arXivDaily arXiv每日学术速递 周一至周五更新
2605.20177 2026-05-20 cs.CL cs.CV 版本更新

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

从感知到思考:解耦感知与推理提升视觉语言模型的训练

Juncheng Wu, Hardy Chen, Haoqin Tu, Xianfeng Tang, Freda Shi, Hui Liu, Hanqing Lu, Cihang Xie, Yuyin Zhou

发表机构 * Amazon(亚马逊) University of Waterloo(滑铁卢大学) Vector Institute, Canada CIFAR AI Chair(向量研究所,加拿大CIFAR人工智能主席)

AI总结 本研究通过解耦感知与推理,发现视觉任务性能受限于感知能力不足而非推理本身,通过分阶段训练提升模型的感知与推理能力,从而在多个视觉数学和感知任务中取得更优表现。

Comments 19 pages, 9 figures; Accepted to ICML 2026; Project Page: https://ucsc-vlaa.github.io/VLM-CapCurriculum/

详情
AI中文摘要

最近视觉语言模型(VLMs)的进步强调长链推理;然而,我们发现其在视觉任务上的性能主要受限于感知能力不足而非推理本身。在本工作中,我们系统研究了VLMs在训练后感知与推理之间的相互作用,通过将能力分解为三个独立的训练阶段:视觉感知、视觉推理和文本推理,并结合专门的训练数据。我们证明视觉感知(a)需要针对优化和专门数据;(b)作为基础框架,应在细化视觉推理之前通过分阶段训练巩固;(c)通过强化学习(RL)比基于标题的监督微调(SFT)更有效学习。我们的实验表明,分阶段训练在多个VLMs上一致提升了视觉感知和推理性能。值得注意的是,采用我们方法训练的模型在推理准确性上提高了1.5%,推理轨迹缩短了20.8%,表明更强的感知减少了对过度推理的需求。此外,我们展示了基于能力的分阶段训练代表了与传统难度基于课程正交的新课程维度,结合两者可进一步获得加性收益。我们的分阶段训练模型在开放权重VLMs中表现优异,在多个视觉数学和感知任务(如WeMath和RealWorldQA)上取得了优于基础模型的先进结果。

英文摘要

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

2605.20174 2026-05-20 cs.CV cs.LG 版本更新

Multi-axis Analysis of Image Manipulation Localization

多轴分析图像操纵定位

Keanu Nichols, Divya Appapogu, Giscard Biamby, Dina Bashkirova, Anna Rohrbach, Bryan A. Plummer

发表机构 * Boston University(波士顿大学) University of California, Berkeley(加州大学伯克利分校) Technical University of Darmstadt(德累斯顿技术大学)

AI总结 本文提出AUDITS基准,用于多轴分析图像操纵检测,通过不同领域转移类型评估现有方法的鲁棒性,以推动更可靠和通用的图像操纵检测方法的发展。

Comments 28 pages, 5 figures, 5 tables

详情
AI中文摘要

先进的图像编辑软件使创建高度逼真的图像操纵变得容易,近年来由于生成式AI的进步,这种能力变得更加普及。虽然操纵的图像通常无害,但它们可能传播虚假信息、制造虚假叙述并影响人们对重要问题的看法。尽管这种威胁日益增长,但针对不同视觉领域检测高级操纵的研究仍然有限。因此,我们引入了Analysis Under Domain-shifts, QualIty, Type, and Size (AUDITS),一个全面的基准,用于研究图像操纵检测中的分析轴。AUDITS包含来自两个不同来源(用户和新闻照片)的超过530,000张图像。我们通过最近的扩散基填充技术整理数据集,以支持跨多个轴的分析,涵盖多样化的操纵类型和尺寸。我们通过不同的领域转移类型进行实验,以评估现有图像操纵检测方法的鲁棒性。我们的目标是通过提供新的见解来推动该领域进一步研究,以帮助开发更可靠和通用的图像操纵检测方法。

英文摘要

Advanced image editing software enables easy creation of highly convincing image manipulations, which has been made even more accessible in recent years due to advances in generative AI. Manipulated images, while often harmless, could spread misinformation, create false narratives, and influence people's opinions on important issues. Despite this growing threat, there is limited research on detecting advanced manipulations across different visual domains. Thus, we introduce Analysis Under Domain-shifts, qualIty, Type, and Size (AUDITS), a comprehensive benchmark designed for studying axes of analysis in image manipulation detection. AUDITS comprises over 530K images from two distinct sources (user and news photos). We curate our dataset to support analysis across multiple axes using recent diffusion-based inpaintings, spanning a diverse range of manipulation types and sizes. We conduct experiments under different types of domain shift to evaluate robustness of existing image manipulation detection methods. Our goal is to drive further research in this area by offering new insights that would help develop more reliable and generalizable image manipulation detection methods.

2605.20165 2026-05-20 cs.CV 版本更新

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

CaMo:基于摄像机运动的视觉-语言模型评估与训练

Hsiang-Wei Huang, Junbin Lu, Kuang-Ming Chen, Jianxu Shangguan, Cheng-Yen Yang, Jenq-Neng Hwang

发表机构 * Department of Electrical and Computer Engineering, University of Washington, Seattle, USA(华盛顿大学电气与计算机工程系)

AI总结 本文提出了一种基于摄像机运动的视觉-语言模型评估与训练方法CaMo,通过要求模型生成显式的空间叙述并进行推理,揭示了现有空间视觉-语言模型在空间认知方面的不足,并展示了CaMo在空间叙述评估和直接空间问题回答准确性上的一致表现。

Comments Code and model available at https://github.com/hsiangwei0903/CaMo

详情
AI中文摘要

视觉-语言模型(VLMs)在空间问答基准测试中表现出色,但尚不清楚这种表现是否反映了真正的空间智能。我们证明现有空间VLMs缺乏基本的摄像机运动理解,这是空间认知的关键组成部分。我们提出了空间叙述评分(SNS),一种评估框架,要求VLMs生成显式的空间叙述,捕捉场景语义和摄像机运动,随后使用冻结的代理LLM进行推理。在SNS下,最先进的空间VLMs在直接问答准确性高时,却在评估中表现出显著的性能下降。为解决这一差距,我们引入了CaMo,一种基于摄像机运动的VLM,其在SNS评估和直接空间问答准确性上均表现出一致的性能。我们的结果强调了显式空间叙述外部化在评估具有可转移3D空间理解的VLMs中的重要性。我们的代码、数据和模型可在https://github.com/hsiangwei0903/CaMo上获得。

英文摘要

Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at https://github.com/hsiangwei0903/CaMo

2605.20159 2026-05-20 cs.CV cond-mat.mtrl-sci cs.LG 版本更新

Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites

用于航空SiC/SiC复合材料X射线断层扫描缺陷检测的可解释计算机视觉

Antonio Peña Corredor, Julien Lesseur, Romain Nunez, Paul Rivalland, Thomas Philippe

发表机构 * Safran Ceramics(萨弗兰陶瓷) Safran Engineering Services(萨弗兰工程服务)

AI总结 本研究提出了一种结合原型层的p-ResNet-50框架,通过引入新的正则化项和语义对齐,提高了X射线断层扫描中缺陷检测的可解释性和准确性,同时保持了高精度和可追溯性。

详情
AI中文摘要

航空SiC/SiC复合材料的非破坏性检测依赖于专家视觉评估,当前流程在接受/拒绝决策方面缺乏可追溯性。深度卷积网络可以自动检测缺陷,但其黑盒性质与工业检测实践所需的透明性相冲突。为此,我们引入了p-ResNet-50,一种扩展了原型层的卷积框架,将高检测精度与基于案例的解释相结合。六个学习到的原型被显式对齐到专家定义的语义类别——健康基质、基质-空气界面、孔洞、线状缺陷和混合形态,使得每个分类都能追溯到具有物理意义的参考。两种新的正则化项,基于锚点和中位数,将原型连接到专家选择的片段,并防止原型崩溃,解决了原型网络已知的限制。通过UMAP进行的潜在空间分析揭示了语义连贯的子域,并映射出不确定性区域,这些区域集中了误分类,使检查员能够明确了解模型在哪里可靠,以及不可靠。该框架在约12,000个片段的XCT数据集上进行了验证,这些片段是从四个缺陷丰富的SiC/SiC实验室样品中提取的。与黑盒ResNet-50基线(ROC-AUC = 0.991)相比,原型扩展实现了相似的性能(准确率0.957 vs. 0.959;ROC-AUC 0.994 vs. 0.993),虽然灵敏度略有降低,但精度和特异性更高。每个决定都由代表性的证据片段支持,并且模型明确标记其不确定性区域。除了缺陷映射外,该框架还建立了一种可重用的方法,用于将领域专家知识嵌入到原型网络中,适用于其他需要可追溯、可审计决策的XCT检测场景。

英文摘要

Non-destructive testing of aerospace SiC/SiC composites via X-ray computed tomography (XCT) relies on expert visual assessment, with current workflows offering limited traceability for accept/reject decisions. Deep convolutional networks can automate defect detection, yet their black-box nature conflicts with the transparency that industrial inspection practice demands. To close this gap, we introduce p-ResNet-50, a convolutional framework extended with a prototype layer that couples high detection accuracy with case-based explanations. Six learned prototypes are explicitly aligned with expert-defined semantic categories-healthy matrix, matrix--air interfaces, pores, line-like defects, and mixed morphologies-so that every classification is traceable to a physically meaningful reference. Two novel regularisation terms, anchor-based and medoid-based, tether prototypes to expert-selected patches and prevent prototype collapse, addressing a known limitation of prototype networks. Latent-space analysis via UMAP delineates semantically coherent sub-domains and maps zones of uncertainty where misclassifications concentrate, giving inspectors an explicit picture of where the model is-and is not-reliable. The framework is validated on an XCT patch dataset of approximately 12,000 patches extracted from four defect-rich SiC/SiC laboratory specimens. Taking a black-box ResNet-50 as a baseline (ROC-AUC = 0.991), the prototype extension achieves comparable performance (accuracy 0.957 vs. 0.959; ROC-AUC 0.994 vs. 0.993) while trading a slight reduction in sensitivity for higher precision and specificity. Each decision is backed by representative evidence patches, and the model explicitly flags its uncertainty regions. Beyond defect mapping, the framework establishes a reusable methodology for embedding domain-expert knowledge into prototype networks, applicable to other XCT inspection scenarios requiring traceable, auditable decisions.

2605.20158 2026-05-20 cs.CV cs.AI cs.CL 版本更新

Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

重新思考用于大视觉语言模型胸部X光推理中的视觉归因

Guangzhi Xiong, Qiao Jin, Sanchit Sinha, Zhiyong Lu, Aidong Zhang

发表机构 * University of Virginia(弗吉尼亚大学) National Institutes of Health(美国国立卫生研究院)

AI总结 本文针对大视觉语言模型在胸部X光推理中视觉归因的可靠性问题,提出了一种因果评估框架,通过反事实编辑保留仅由专家标注区域验证的X光-VQA样本,以确定模型预测的因果责任区域。通过11种归因方法、6种开源LVLMs和两种输出模式,发现现有归因方法往往无法识别LVLMs所使用的证据。为此,本文提出MedFocus,一种基于概念的归因方法,通过不平衡最优传输局部化具有临床意义的解剖区域,并通过针对性干预测量其对模型输出的因果效应,显著优于现有方法,推动医疗LVLMs的更可信归因。

详情
AI中文摘要

大视觉语言模型(LVLMs)在医疗应用中展现出前景,但其无法准确将响应与视觉证据联系起来,引发了关于临床可信度的严重担忧。尽管视觉归因方法被广泛用于解释LVLM预测,但这些解释是否确实反映了模型决策背后的视觉证据仍缺乏验证,因为内部模型推理的真值注释通常不可用。我们通过开发一种因果评估框架来解决胸部X光(CXR)推理中的这一问题,该框架仅保留专家标注区域已验证的CXR-VQA样本,通过反事实编辑保留因果责任区域。在11种归因方法、6种开源LVLMs和两种输出模式(直接回答和逐步推理)上应用此框架,发现现有归因方法往往无法识别LVLMs所使用的证据。为解决这一失败,我们提出MedFocus,一种基于概念的归因方法,通过不平衡最优传输局部化具有临床意义的解剖区域,并通过针对性干预测量其对模型输出的因果效应。MedFocus产生空间、概念级和token级归因,并显著优于现有方法,推动医疗LVLMs的更可信归因。我们的数据和代码可在https://github.com/gzxiong/medfocus/上获得。

英文摘要

Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.

2605.20147 2026-05-20 cs.CV 版本更新

PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

PixVerve:通过大规模高质量数据集将原生超高清图像生成推至100MP

Haojun Chen, Haoyang He, Chengming Xu, Qingdong He, Junwei Zhu, Yabiao Wang, Zhucun Xue, Xianfang Zeng, Zhennan Chen, Xiaobin Hu, Hao Zhao, Yong Liu, Jiangning Zhang, Dacheng Tao

发表机构 * Zhejiang University(浙江大学) Fudan University(复旦大学) Nanjing University(南京大学) National University of Singapore(新加坡国立大学) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出PixVerve-95K数据集,通过精心设计的数据管道构建,包含95K张高分辨率图像和七维标注,用于推动超高清图像生成技术,通过三种训练方案将T2I基础模型扩展到100MP生成,并建立PixVerve-Bench评估协议。

Comments Project page is available at https://haojunchen663.github.io/projects/PixVerve/

详情
AI中文摘要

文本到图像(T2I)模型近年来在1K和2K分辨率方面取得了显著进展。随着对更好视觉体验的极端需求和成像技术的快速发展,超高清(UHR)图像生成的需求显著增长。然而,由于高分辨率内容的稀缺性和复杂性,UHR图像生成面临巨大挑战。在本文中,我们首先介绍了PixVerve-95K,一个高质量、开源的UHR T2I数据集,通过精心设计的数据管道构建,包含95K张图像,涵盖多样场景(每张图像的最小像素数为100M)和七维标注。基于我们的大规模图像-文本数据集,我们采取了开创性的步骤,将各种T2I基础模型扩展到原生100MP生成,采用三种训练方案。最后,利用传统度量标准和基于多模态大语言模型的评估,我们提出的PixVerve-Bench基准建立了涵盖视觉质量和语义对齐的全面评估协议。在我们的基准上的广泛实验结果和训练策略的建设性探索共同提供了对未来突破的宝贵见解。

英文摘要

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.

2605.20110 2026-05-20 cs.CV 版本更新

SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

SetCon: 通过集级概念预测实现开放式的指称分割

Zhixiong Zhang, Yizhuo Li, Shuangrui Ding, Yuhang Zang, Shengyuan Ding, Long Xing, Yibin Wang, Qiaosheng Zhang, Jiaqi Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) The Chinese University of Hong Kong(香港中文大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Fudan University(复旦大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出SetCon,通过集级概念预测实现开放式的指称分割,利用LVLM生成的自然语言概念作为语义条件进行联合掩码-集解码,提高了分割的完整性和互斥性。

详情
AI中文摘要

指称分割将自然语言查询与像素级掩码联系起来,但将其扩展到包含多个实例、跨类群组或开放目标集的复杂场景仍然具有挑战性。先前基于大型视觉语言模型(LVLM)的方法用一个或多个特殊标记依次表示指称目标,将多个目标视为独立输出而非连贯的集合,并且几乎没有激励去捕捉集合级属性,如完整性和互斥性。我们重新公式化开放式的指称分割为显式的集级概念预测,并提出Set-Concept Segmentation(SetCon),该方法使用LVLM生成的自然语言概念,而不是分割特定的标记,作为联合掩码-集解码的语义条件。一个层次化的语义分解首先预测一个共享的集级概念以定义目标范围,然后将其细化为细粒度的概念组,与目标子集对齐。为了支持这一点,一个两阶段的标注流程增强了现有的推理分割数据集,添加了层次化的语义监督(236k样本,784k概念短语)。SetCon在图像基准上取得了最先进的结果(在gRefCOCO上+3.3 gIoU,在MUSE上+12.1 gIoU),其优势随着指称目标数量的增加而扩大。概念接口在检测和跟踪设置下也转移到视频中,产生了在七个指称视频基准上的新最先进的结果,包括在MeViS上+10.9 J&F和在Ref-SeCVOS上+12.4 J&F。

英文摘要

Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 J&F on MeViS and +12.4 J&F on Ref-SeCVOS.

2605.20090 2026-05-20 cs.CV 版本更新

MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

MetaEarth-MM:基于场景中心联合建模的多模态遥感图像生成

Zhiping Yu, Chenyang Liu, Jinqi Cao, Qinzhe Yang, Siwei Yu, Zhengxia Zou, Zhenwei Shi

发表机构 * Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University(航天智能科学与技术系,航天学院,北京航空航天大学) State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(虚拟现实技术与系统国家重点实验室,北京航空航天大学) Shenyuan Honors College, Beihang University(Shen Yuan荣誉学院,北京航空航天大学)

AI总结 本文提出MetaEarth-MM模型,通过统一的多模态遥感图像生成框架,实现多模态图像的联合生成和任意模态之间的转换,展示了其在多模态遥感观测中的强大生成能力和广泛适用性。

详情
AI中文摘要

多模态遥感图像对于地球观测至关重要,但在实践中,完整的配对观测往往稀缺。现有的生成方法通常通过孤立的成对模态翻译来解决这个问题,但随着模态数量和生成任务的增加,其通用性和可扩展性仍然有限。本文开发了一个生成基础模型MetaEarth-MM,用于多模态遥感图像生成,能够在统一模型中实现五种模态之间的配对联合生成和任意到任意的翻译。认识到多模态观测下内在的场景一致性,我们引入了MetaEarth-MM中的场景中心联合建模范式。与以往依赖直接外观级跨模态映射的方法不同,我们的模型围绕底层场景内容组织生成过程。具体而言,MetaEarth-MM采用解耦架构,首先从可用观测中推断出潜在的场景表示,然后基于此中间状态生成目标模态。为了支持训练,我们进一步构建了EarthMM,一个包含280万张多分辨率全球图像和220万对对齐图像的大型数据集。广泛的实验表明,MetaEarth-MM不仅在多样化的生成任务中表现出强大的生成能力和稳健的泛化能力,还支持数据和表示层面的下游任务,突显了其作为跨模态地球观测通用基础模型的潜力。代码和数据集将在https://github.com/YZPioneer/MetaEarth-MM上提供。

英文摘要

Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at https://github.com/YZPioneer/MetaEarth-MM.

2605.20085 2026-05-20 cs.CV 版本更新

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

基于空间提示的视觉轨迹预测用于目视操控

Yifan Li, Xinyu Zhou, Yunhao Ge, Yu Kong

发表机构 * Michigan State University(密歇根州立大学) NVIDIA Research(英伟达研究)

AI总结 本文提出了一种新的视觉轨迹预测方法SP-VTP,通过空间提示定义任务目标,结合任务编码器、观察编码器和轨迹生成器,提升了跨场景的目视操控轨迹预测性能。

详情
AI中文摘要

机器人操控通常通过语言指令或任务标识符指定,但在有相似物体的杂乱环境中,通过空间指示要移动什么和放置在哪里会更有效。针对以视觉为中心的对象和目标指定挑战,我们提出了目前所知的第一个空间提示视觉轨迹预测(SP-VTP)的正式化。这种新的设置利用初始空间提示(如边界框或点)来定义任务目标,要求模型从目视流中预测未来末端执行器轨迹。为了研究此问题,我们收集并标注了EgoSPT数据集,包含带有第一帧物体和目标定位注释以及恢复的3D末端执行器运动的目视空间提示操控轨迹。SP-VTP具有挑战性,因为任务指定是静态的,而场景配置随时间变化。为了解决这个问题,我们提出了SPOT(空间提示对象-目标策略),它结合了任务编码器用于第一帧视觉和坐标空间提示,观察编码器用于当前视觉和历史上下文,以及轨迹生成器用于未来末端执行器运动。在严格的场景级划分实验中,SPOT在非提示或单源提示基线之上提高了跨场景轨迹预测性能。共同,EgoSPT和SPOT建立了一个新的空间提示问题SP-VTP,作为简单且可扩展的任务条件用于目视操控。

英文摘要

Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.

2605.20082 2026-05-20 cs.CV cs.AI 版本更新

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

VL-DPO:基于视觉语言的偏好对齐自动驾驶微调

Zhefan Xu, Ghassen Jerfel, Marina Haliem, Qi Zhao, Jeonhyung Kang, Khaled S. Refaat

发表机构 * Waymo

AI总结 本文提出VL-DPO,一种基于视觉语言模型的框架,通过零样本推理生成偏好对来微调自动驾驶模型,以提升与人类驾驶偏好的对齐程度,实验表明该方法在RFS和ADE指标上均优于基线模型。

Comments Published in International Conference on Robotics and Automation (ICRA), 2026 8 pages, 6 figures, 4 tables

详情
AI中文摘要

自动驾驶数据集的快速增长使强大的运动预测模型得以扩展。尽管大规模预训练提供了强大的性能,但标准模仿目标可能无法完全捕捉人类驾驶偏好中的复杂细微差别。同时,视觉语言模型(VLMs)的最新进展展示了出色的推理和常识理解能力。基于这些能力,本文提出了VL-DPO,一种基于视觉语言的框架,用于将自动驾驶车辆的运动预测模型与人类偏好对齐。我们的方法利用VLM作为零样本推理器,自动从预训练模型的轨迹中生成偏好对,然后通过直接偏好优化(DPO)进行微调。我们在此Waymo Open End-to-End Driving Dataset(WOD-E2E)上微调模型,并通过评分反馈(RFS)和平均位移误差(ADE)评估模型在持保留人类偏好注释上的性能。实验表明,VLM的轨迹选择是高质量的人类偏好的代理。我们的最终模型VL-DPO在RFS指标上比预训练模型提高了11.94%,在ADE指标上减少了10.01%。

英文摘要

The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.

2605.20079 2026-05-20 cs.CV cs.AI cs.LG eess.IV 版本更新

Probability-Conserving Flow Guidance

概率守恒的流引导

Parsa Esmati, Junha Hyung, Amirhossein Dadashzadeh, Jaegul Choo, Majid Mirmehdi

发表机构 * University of Bristol(布里斯托大学) KAIST(韩国科学技术院)

AI总结 本文提出了一种概率守恒的流引导方法AdaMaG,通过分析连续方程,将引导效果分解为发散项和分数平行项,并通过时间依赖的调度和分数平行衰减来控制这两个项,从而在不增加推理成本的情况下提高生成质量并减少幻觉。

详情
AI中文摘要

扩散和基于流的生成模型在视觉合成中占据主导地位,引导将样本对齐到用户输入并提高感知质量。然而,分类器无关引导(CFG)和基于外推的方法是速度/分数的启发式线性组合,忽略了生成流形的几何结构,破坏了概率守恒,导致在强引导下样本偏离学习的流形。我们通过连续方程分析引导,并展示其效果分解为一个发散项和一个在参数化下不变的分数平行项。我们证明发散项在采样接近数据流形时结构上会发散,这促使我们采用时间依赖的调度和分数平行衰减。所得到的即插即用规则,自适应流形引导(AdaMaG),在不增加推理成本的情况下限制了这两个项。最后,我们展示大多数减少饱和或提高生成质量的实证启发式方法直接对应于我们分解中的两个项。在图像生成基准测试中,AdaMaG提高了真实感,减少了幻觉,并在高引导制度下诱导了受控的去饱和。

英文摘要

Diffusion and flow-based generative models dominate visual synthesis, with guidance aligning samples to user input and improving perceptual quality. However, Classifier-Free Guidance (CFG) and extrapolation-based methods are heuristic linear combinations of velocities/scores that ignore the generative manifold geometry, breaking probability conservation and driving samples off the learned manifold under strong guidance. We analyse guidance through the continuity equation and show its effect decomposes into a divergence term and a score-parallel term defined invariantly across parameterisations. We prove the divergence term blows up structurally as sampling approaches the data manifold, motivating a time-dependent schedule alongside score-parallel attenuation. The resulting plug-and-play rule, Adaptive Manifold Guidance (AdaMaG), bounds both terms at no additional inference cost. Finally, we show that most empirical heuristics for reducing saturation or improving generation quality correspond directly to the two terms in our decomposition. Across image generation benchmarks, AdaMaG improves realism, reduces hallucinations, and induces controlled desaturation in high-guidance regimes.

2605.20073 2026-05-20 cs.CV 版本更新

X-Ray cardiac angiographic vessel segmentation based on pixel classification using machine learning and region growing

基于机器学习和区域生长的X射线心血管造影血管分割

E O Rodrigues, L O Rodrigues, J J Lima, D Casanova, F Favarim, E R Dosciatti, V Pegorini, L S N Oliveira, F F C Morais

发表机构 * Department of Academic Informatics (DAINF), Universidade Tecnologica Federal do Parana (UTEPR)(学术信息系(DAINF),技术联邦大学帕托布拉桑分校(UTEPR)) Graduate Program of Applied Sciences to Health Products, Universid ade Federal Fluminense (UFF)(健康产品应用科学研究生项目,联邦理工学院弗洛里亚纳分校(UFF)) Primary Health Care, Pato Branco Prefecture, Parana, Brazil(帕托布拉桑市初级卫生保健,巴兰省,巴西) Innovation Office, Mass General Brigham Hospital, Cambridge, Massachusetts, United States of America(麻省总医院创新办公室,剑桥,马萨诸塞州,美国)

AI总结 本文提出了一种基于像素分类的X射线血管分割方法,利用纹理特征和区域生长技术,通过随机森林分类器实现高精度血管识别,达到95.48%的准确率。

Journal ref Biomedical Physics & Engineering Express 2021

详情
AI中文摘要

本文提出了一种用于X射线造影图像中血管分割的像素分类方法。该方法利用各向异性扩散、Hessian矩阵特征、数学形态学和统计学等纹理特征,从每个像素的邻域中提取这些特征。该方法还使用了ELEMENT方法,即通过区域生长控制像素分类,其中分类结果影响后续像素的分类。随机森林分类器用于预测像素是否属于血管结构。该方法在文献中实现了最高的准确率(95.48%),优于无监督的最新方法。

英文摘要

This work proposes a pixel-classification approach for vessel segmentation in x-ray angiograms. The proposal uses textural features such as anisotropic diffusion, features based on the Hessian matrix, mathematical morphology and statistics. These features are extracted from the neighborhood of each pixel. The approach also uses the ELEMENT methodology, which consists of creating a pixel-classification controlled by region-growing where the result of the classification affects further classifications of pixels. The Random Forests classifier is used to predict whether the pixel belongs to the vessel structure. The approach achieved the best accuracy in the literature (95.48%) outperforming unsupervised state-of-the-art approaches.

2605.20064 2026-05-20 cs.CV 版本更新

Cardiac fat segmentation using computed tomography and an image-to-image conditional generative adversarial neural network

利用计算断层扫描和图像到图像的条件生成对抗神经网络进行心脏脂肪分割

Guilherme Santos da Silva, Dalcimar Casanova, Jefferson Tales Oliva, Erick Oliveira Rodrigues

发表机构 * Academic Department of Informatics, Universidade Tecnoldgica Federal do Parand (UTFPR)(信息学学术部门,联邦技术大学(UTFPR))

AI总结 本研究提出了一种基于深度学习的新方法,利用pix2pix网络对心脏脂肪进行自动分割和量化,实现了高精度的epicardial和mediastinal脂肪分割,并在准确率和运行时间上优于现有方法。

Journal ref Medical Engineering & Physics 2024

详情
AI中文摘要

近年来,研究强调了人类心脏周围脂肪组织增加与心瓣膜纤维颤动和冠心病等心血管疾病之间存在联系。然而,由于对医疗专业人员来说手动分割这些脂肪沉积物工作量大且成本高,这种分割并未在临床实践中广泛应用。因此,对更精确和高效定量分析的需求推动了新型计算方法的出现。本研究提出了一种新的深度学习方法,能够自主分割和量化两种不同类型的心脏脂肪沉积物。所提出的方法利用了pix2pix网络,这是一种主要设计用于图像到图像翻译任务的生成对抗网络。通过应用此网络架构,我们旨在研究其在解决心脏脂肪分割特定挑战方面的有效性,尽管该网络并非最初为该目的设计。本研究中感兴趣的两种脂肪沉积物称为心外膜脂肪和心包脂肪,它们被心包空间分开。实验结果表明,epicardial脂肪分割的平均准确率为99.08%和f1分数98.73,mediastinal脂肪分割的准确率为97.90%和f1分数98.40。这些发现代表了所提出方法的高精度和重叠一致性。与现有研究相比,我们的方法在f1分数和运行时间上表现更优,使图像能够在实时情况下进行分割。

英文摘要

In recent years, research has highlighted the association between increased adipose tissue surrounding the human heart and elevated susceptibility to cardiovascular diseases such as atrial fibrillation and coronary heart disease. However, the manual segmentation of these fat deposits has not been widely implemented in clinical practice due to the substantial workload it entails for medical professionals and the associated costs. Consequently, the demand for more precise and time-efficient quantitative analysis has driven the emergence of novel computational methods for fat segmentation. This study presents a novel deep learning-based methodology that offers autonomous segmentation and quantification of two distinct types of cardiac fat deposits. The proposed approach leverages the pix2pix network, a generative conditional adversarial network primarily designed for image-to-image translation tasks. By applying this network architecture, we aim to investigate its efficacy in tackling the specific challenge of cardiac fat segmentation, despite not being originally tailored for this purpose. The two types of fat deposits of interest in this study are referred to as epicardial and mediastinal fats, which are spatially separated by the pericardium. The experimental results demonstrated an average accuracy of 99.08% and f1-score 98.73 for the segmentation of the epicardial fat and 97.90% of accuracy and f1-score of 98.40 for the mediastinal fat. These findings represent the high precision and overlap agreement achieved by the proposed methodology. In comparison to existing studies, our approach exhibited superior performance in terms of f1-score and run time, enabling the images to be segmented in real time.

2605.20044 2026-05-20 cs.CV 版本更新

OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives

OP2GS: 带双不透明度的物体感知3D高斯散射

Guiyu Liu, Niklas Vaara, Janne Mustaniemi, Juho Kannala, Janne Heikkilä

发表机构 * Center for Machine Vision and Signal Analysis, University of Oulu, Finland(奥卢大学机器视觉与信号分析中心,芬兰) Aalto University, Finland(阿尔托大学,芬兰)

AI总结 OP2GS通过引入双不透明度机制,为每个原始体素添加显式实例身份和专用实例不透明度σ*,以解决3D高斯散射在物体层面身份缺失的问题,从而提升开放词汇场景理解的性能。

Comments Under review

详情
AI中文摘要

3D高斯散射(3DGS)提供了一种显式且高效的场景表示,但其原始体素缺乏固有的物体层面身份,阻碍了下游任务如开放词汇场景理解。现有方法通常通过将高维特征嵌入提炼为高斯或通过启发式细化将2D掩码标签提升为3D来解决这一问题。然而,基于特征的方法会带来沉重的存储和解码开销,而基于提升的方法则容易受到标签污染:用于外观重建的高斯体往往在2D到3D投影时会获得错误的物体标签。我们提出了OP2GS,一种带物体感知的高斯表示,通过为每个原始体素添加显式实例身份和专用实例不透明度σ*用于物体掩码渲染。原始不透明度σ仍负责视觉重建,而σ*则模型该高斯是否应贡献于特定的物体掩码。这种双不透明度公式将视觉存在与实例占用解耦:错误标记的高斯体仍可用于图像渲染,但在物体掩码分支中会变得透明。为了学习这种表示,我们引入了随机物体损失,通过3DGS标准的透射率基可见性优化1D实例占用场。然后通过多视角聚合将语义描述符附加在物体层面,消除了每个高斯体的特征存储需求。与基于特征训练的方法相比,OP2GS在开放词汇性能方面具有竞争力,同时显著减少了计算开销。与无训练管道相比,它利用物理一致的占用学习来解决可见性歧义。

英文摘要

3D Gaussian Splatting (3DGS) provides an explicit and efficient scene representation, but its primitives lack inherent object-level identity, hindering downstream tasks such as open-vocabulary scene understanding. Existing methods typically address this by either distilling high-dimensional feature embeddings into Gaussians or by lifting 2D mask labels into 3D via heuristic refinement. However, feature-based approaches incur heavy storage and decoding overhead, while lifting-based pipelines remain vulnerable to label contamination: Gaussians necessary for appearance reconstruction often receive incorrect object labels during 2D-to-3D projection. We propose OP2GS, an object-aware Gaussian representation that augments each primitive with an explicit instance identity and a dedicated instance opacity $σ^{*}$ for object-mask rendering. The original opacity $σ$ remains responsible for visual reconstruction, while $σ^{*}$ models whether a Gaussian should contribute to a particular object mask. This dual-opacity formulation decouples visual existence from instance occupancy: mislabeled Gaussians can remain available for image rendering while becoming transparent in the object-mask branch. To learn this representation, we introduce a random object loss that optimizes the 1D instance occupancy field using the standard transmittance-based visibility of 3DGS. Semantic descriptors are then attached at the object level through multi-view aggregation, eliminating per-Gaussian feature storage. Compared with feature-training approaches, OP2GS achieves competitive open-vocabulary performance while significantly reducing computational overhead. Compared with training-free pipelines, it leverages physically consistent occupancy learning to resolve visibility ambiguities.

2605.20035 2026-05-20 cs.CV 版本更新

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

面向高效多模态大语言模型的阶段自适应令牌选择

Zijie Xin, Jie Yang, Ruixiang Zhao, Tianyi Wang, Fengyun Rao, Jing Lyu, Xirong Li

发表机构 * Renmin University of China(中国人民大学) WeChat Vision, Tencent Inc.(腾讯微信视觉实验室)

AI总结 本文提出SEATS方法,通过阶段自适应的令牌选择技术,有效提升多模态大语言模型的推理效率,在保留96.3%原始性能的同时,实现9.3倍的FLOPs减少和4.8倍的prefill加速。

Comments Code Link: https://github.com/xxayt/SEATS

详情
AI中文摘要

多模态大语言模型(om-LLMs)通过将视频和音频编码为时间对齐的令牌序列,在窗口级别交错处理以实现统一的音频-视觉理解。然而,处理这些密集的非文本令牌会带来显著的计算开销。尽管训练无关的令牌选择可以减少这种成本,但现有方法要么专注于视觉输入,要么在LLM之前以固定的每模态比例修剪om-LLM令牌,无法捕捉跨模态令牌重要性在层间的变化。为了解决这一限制,我们首先分析om-LLMs的层间令牌依赖性。我们发现视觉和音频依赖性遵循块状模式,并随着深度逐渐减弱,表明许多后期层的非文本令牌在跨模态融合后变得冗余。受此启发,我们提出SEATS,一种训练无关的、阶段自适应的令牌选择方法,用于高效的om-LLM推理。在LLM之前,SEATS通过注意力加权多样性选择去除时空冗余。在LLM内部,它逐步在块间修剪令牌,并利用查询相关性分数动态分配从时间窗口到模态的保留预算。在后期层中,一旦完成跨模态融合,它会移除所有剩余的非文本令牌。在Qwen2.5-Omni和Qwen3-Omni上的实验表明,SEATS有效提高了推理效率。仅保留10%的视觉和音频令牌,实现了9.3倍的FLOPs减少和4.8倍的prefill加速,同时保持96.3%的原始性能。

英文摘要

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.

2605.20033 2026-05-20 cs.CV cs.GT 版本更新

A Nash Equilibrium Framework For Training-Free Multimodal Step Verification

为无训练多模态步骤验证构建纳什均衡框架

Rohit Sinha, Kunal Tilaganji, Tanuja Ganu, Nagarajan Natarajan, Amit Sharma, Vineeth N. Balasubramanian

发表机构 * Microsoft Research India(微软印度研究院) Indian Institute of Technology Hyderabad(印度海得拉巴理工学院)

AI总结 本文提出一种无训练的多模态步骤验证方法,将步骤验证视为专门法官之间的协调问题,并通过纳什均衡游戏形式化法官之间的交互,通过闭式解计算均衡分数,实现对分歧的敏感过滤和稳定性意识的排名,实验表明跨模态一致性(而非平均置信度)提供了鲁棒的验证信号。

Comments ICLR 2026 Workshop VerifAI-2

详情
AI中文摘要

多模态大语言模型经常生成包含细微错误的推理链,导致错误答案。当前的验证方法有显著局限。学习批评者需要大量标注数据且在不同任务上表现不一致。同时,现有无训练方法仅简单平均不同来源的分数,忽略了关键见解:当这些分数不一致时,这种不一致本身包含了关于推理步骤是否真正有效的重要信息。我们提出了一种无训练验证方法,将分步验证视为专门法官之间的协调问题。我们形式化这些法官的交互为纳什均衡游戏,其中一致信号表示有效步骤,不一致揭示不稳定性。我们的方法通过闭式解计算均衡分数,实现了对分歧的敏感过滤和稳定性意识的排名。在六个基准测试中,我们的方法在基准模型上实现了2.4%至5.2%的一致性提升,并在与学习批评者相比时表现出竞争力,证明了跨模态一致性(而非平均置信度)在无任务特定适应的情况下提供了稳健的验证信号。

英文摘要

Multimodal large language models often generate reasoning chains containing subtle errors that lead to incorrect answers. Current verification approaches have notable limitations. Learned critics need extensive labeled data and show inconsistent performance across different tasks. Meanwhile, existing training-free methods simply average scores from different sources, missing a key insight: when these scores disagree, that disagreement itself carries important information about whether a reasoning step is truly valid or not. We propose a training-free verification approach that treats step-wise verification as a coordination problem among specialized judges. We formalize these judges' interaction as a Nash equilibrium game where agreement signals valid steps while disagreement reveals instability. Our method computes equilibrium scores through a closed-form solution, enabling both disagreement-aware filtering and stability-conscious ranking of reasoning steps. Evaluated across six benchmarks, our approach achieves consistent improvements of 2.4% to 5.2% over baseline models and shows competitive performance against learned critics, demonstrating that cross-modal agreement (not just average confidence) provides robust verification signals without task-specific adaptation.

2605.20016 2026-05-20 eess.IV cs.CV 版本更新

FGSVQA: Frequency-Guided Short-form Video Quality Assessment

FGSVQA:基于频率的短视频质量评估

Xinyi Wang, Angeliki Katsenou, Junxiao Shen, David Bull

发表机构 * School of Computer Science, University of Bristol(布里斯托大学计算机科学学院)

AI总结 本文提出了一种端到端的视频质量评估框架,利用基于CLIP的密集视觉编码器和频率域中的压缩先验,生成具有伪影和结构感知的权重图,以实现高效的视频质量预测。

Comments 4 pages, 1 figure

详情
AI中文摘要

短视频给用户生成内容(UGC)的质量评估带来了新挑战,由于其复杂的生成流程、快速的内容变化和混合的失真。为了解决这一挑战,我们提出了一种端到端的视频质量评估(VQA)框架,该框架采用基于CLIP的密集视觉编码器,并结合从频率域导出的压缩先验,生成具有伪影和结构感知的权重图用于特征聚合。通过显式分解伪影、结构和原始视觉特征分支,并通过学习的门控模块在时间上自适应融合,所提出的方法实现了准确且高效的质量预测。实验结果表明,我们的方法在短视频数据集上在平均排名和线性相关性(SRCC: 0.736,PLCC: 0.787)方面表现出色,同时保持了高效的推理运行时间。代码和额外结果可在:https://github.com/xinyiW915/FGSVQA 获取。

英文摘要

Short-form video poses new challenges to the quality assessment of user-generated content (UGC) due to its complex generation pipeline, rapid content variation, and mixed distortions. To address this challenge, we propose an end-to-end video quality assessment (VQA) framework that employs a dense visual encoder based on CLIP, and incorporates compression priors derived from the frequency domain to generate artifact- and structure-aware weight maps for feature aggregation. By explicitly decomposing artifact, structure, and original visual feature branches and adaptively fusing them over time through a learned gating module, the proposed method achieves accurate and efficient quality prediction. Experimental results show that our method achieves strong performance on short-form video datasets in terms of average rank and linear correlation (SRCC: 0.736, PLCC: 0.787), while maintaining efficient inference runtime. The code and additional results are available at: https://github.com/xinyiW915/FGSVQA.

2605.19995 2026-05-20 cs.CV 版本更新

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

CogOmniControl: 通过创意意图认知实现推理驱动的可控视频生成

Hongji Yang, Songlian Li, Yucheng Zhou, Xiaotong Zhao, Alan Zhao, Chengzhong Xu, Jianbing Shen

发表机构 * SKL-IOTSC, CIS, University of Macau(澳门大学SKL-IOTSC、CIS实验室) Online-Video BU, Tencent(腾讯在线视频事业部)

AI总结 本文提出CogOmniControl框架,通过将可控视频生成分解为创意意图认知和生成两个阶段,利用专门训练的CogVLM生成更专业清晰的输出,并通过强化学习对齐不同条件的控制,最终在两个基准测试中超越现有开源模型。

详情
AI中文摘要

最近的扩散模型在视频生成中实现了强大的照片真实性和流畅性,但在抽象、稀疏或复杂条件下表现脆弱,导致在专业生产流程如分镜头草图和泥塑渲染条件中性能不佳。现有视频生成模型要么通过适配器注入条件,要么将通用视觉-语言模型(VLM)嵌入扩散骨干中,导致能力缺口,无法生成符合用户创意意图的视频。我们提出了CogOmniControl,一个推理驱动的框架,将可控视频生成分解为创意意图认知和生成。具体而言,我们训练了一个专门的CogVLM,使用真实的动画制作数据。与通用VLM相比,它生成更专业和清晰的输出,能够从稀疏和抽象的条件下准确认知用户的创意意图,并将这些提示转换为密集的推理输出。此外,CogOmniDiT通过上下文生成统一各种条件的控制,并通过强化学习对齐CogVLM的推理输出。此外,利用CogVLM在引导视频生成中的强大能力,我们释放了其在规划特定评估者和启用生成视频的最佳N选择中的潜力。这种整合将整个框架转变为闭环的

英文摘要

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/

2605.19990 2026-05-20 cs.RO cs.CV cs.LG 版本更新

Minimalist Visual Inertial Odometry

极简视觉惯性里程计

Francesco Pasti, Jeremy Klotz, Nicola Bellotto, Shree K. Nayar

发表机构 * Department of Information Engineering, University of Padua(帕多瓦大学信息工程系) Computer Science Department, Columbia University(哥伦比亚大学计算机科学系)

AI总结 本文提出了一种极简的平面里程计方法,通过四个视觉测量和一个IMU实现差分驱动机器人的鲁棒运动估计,展示了极简传感在高效准确平面里程计中的应用。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

视觉-惯性里程计(VIO)对于移动机器人导航至关重要,但使用高像素相机需要大量资源。本文提出了一种极简方法用于平面里程计,证明仅四个视觉测量和一个IMU即可为差分驱动机器人提供可靠的运动估计。我们的关键见解是四个向下 facing 的光电二极管通过光学Gabor掩码感知世界,产生编码速度的信号。基于此,我们利用物理基础模拟器联合优化掩码参数和时间卷积网络(TCN)。所得到的模型仅通过光电二极管产生的四个测量值解码速度。将这些估计与IMU提供的角速度结合,可以得到连续的平面轨迹。我们通过将原型传感器安装在差分驱动机器人上验证了我们的方法。在多样化的室内和室外地形上,我们的系统能够紧密跟踪参考真实地面,无需任何现实中的微调。我们的工作表明,极简传感能够实现高效且准确的平面里程计。

英文摘要

Visual-Inertial Odometry(VIO), which is critical to mobile robot navigation, uses cameras with a large number of pixels. Capturing and processing camera images requires significant resources. This work presents a minimalist approach to planar odometry, demonstrating that just four visual measurements and an IMU can provide robust motion estimation for differential-drive robots. Our key insight is that four downward-facing photodiodes that sense the world through optical Gabor masks produce signals that encode speed. Based on this, we jointly optimize the mask parameters alongside a Temporal Convolutional Network (TCN) using a physically-grounded simulator. The resulting model decodes speed from just the four measurements produced by the photodiodes. Pairing these estimates with the angular speed from an IMU yields a continuous planar trajectory. We validate our approach with a prototype sensor mounted on a differential drive robot. Across diverse indoor and outdoor terrains, our system closely tracks the reference ground truth without any real-world fine-tuning. Our work shows that minimalist sensing enables efficient and accurate planar odometry.

2605.19986 2026-05-20 cs.RO cs.CV cs.LG 版本更新

Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

超越二元成功:一种用于细粒度操控的诊断元评估框架

He-Yang Xu, Pengyuan Zhang, Zongyuan Ge, Xiaoshuai Hao, Serge Belongie, Xin Geng, Yuxin Peng, Xiu-Shen Wei

发表机构 * Southeast University(东南大学) Monash University(墨尔本大学) Xiaomi EV(小米电动车) University of Copenhagen(哥本哈根大学) Peking University(北京大学)

AI总结 本文提出MetaFine框架,通过分解理解、感知和受控行为三个维度,诊断细粒度操控中的能力瓶颈,并通过因果干预识别视觉编码器在保持局部空间结构方面的关键限制,从而提升操控精度。

Comments Project page: https://metafine.github.io/

详情
AI中文摘要

细粒度操控标志着一个领域,其中全局场景上下文不再足够,成功取决于局部属性定位、高保真空间感知和符合约束的运动执行之间的紧密耦合。然而,当前的具身AI基准测试将这些能力简化为二元成功率,系统性地将报告能力夸大了多达70%,并掩盖了阻碍实际应用的架构瓶颈。我们引入了MetaFine,一种诊断元评估框架,通过分解理解、感知和受控行为三个轴来分离操控能力。基于组合任务图,MetaFine吸收异构外部基准,并在统一协议下重构为不同复杂度的诊断场景。通过这一视角评估最先进的视觉-语言-动作(VLA)模型,揭示了传统度量无法发现的严重维度特定失败。通过针对性的因果干预,我们确定了视觉编码器保持局部空间结构的能力是细粒度精度的关键瓶颈:改进它可以直接解锁之前无法触及的操控能力,而无需修改下游策略。MetaFine进一步支持混合真实-仿真验证,利用有限的配对现实运行来校准可扩展的仿真基于估计,以获得更稳定的物理基准测试。通过将评估从排名转向诊断,MetaFine将基准测试转变为修复真实物理敏捷性底层能力的可行指南。MetaFine框架、基准和相关资源将在项目页面上公开发布:https://metafine.github.io/。

英文摘要

Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution. However, current embodied AI benchmarks collapse these capacities into binary success rates, systematically inflating reported capabilities by up to 70% and masking the architectural bottlenecks that impede real-world deployment. We introduce MetaFine, a diagnostic meta-evaluation framework that disentangles manipulation competency along three axes: understanding, perception, and controlled behavior. Built on a compositional task graph, MetaFine absorbs heterogeneous external benchmarks and reconstructs them into diagnostic scenarios of varying complexity under a unified protocol. Evaluating state-of-the-art vision-language-action (VLA) models through this lens exposes severe dimension-specific failures invisible to conventional metrics. Through targeted causal intervention, we identify the visual encoder's ability to preserve local spatial structure as a key bottleneck for fine-grained precision: improving it directly unlocks previously inaccessible manipulation capabilities without modifying downstream policies. MetaFine further supports hybrid real-sim validation, using limited paired real-world rollouts to calibrate scalable simulation-based estimates for more stable physical benchmarking. By shifting evaluation from ranking to diagnosis, MetaFine turns benchmarking into an actionable compass for repairing the layered capacities underlying genuine physical dexterity. The MetaFine framework, benchmarks, and supporting resources will be publicly released at our project page: https://metafine.github.io/.

2605.19982 2026-05-20 cs.CV 版本更新

InterLight: Leveraging Intrinsic Illumination Priors for Low-Light Image Enhancement

InterLight: 利用内在照明先验进行低光照图像增强

Ziqi Wang, Xu Zhang, Laibin Chang, Shi Chen, Jiaqi Ma, Huan Zhang

发表机构 * National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University(武汉大学计算机学院多媒体软件国家工程研究中心) Department of Computer Science, University of Macau(澳门大学计算机科学系) Department of Computer Vision, Mohamed bin Zayed University of Artificial Intelligence(马尔代夫穆罕默德·本·扎耶德人工智能大学计算机视觉系) School of Information Engineering, Guangdong University of Technology(广东技术大学信息工程学院)

AI总结 本文提出InterLight框架,通过系统挖掘和操作内在照明先验来解决低光照图像增强问题,核心方法是构建照明感知的处理流程,通过物理引导增强和自监督一致性目标实现更清晰的纹理和更一致的增强效果。

Comments Accepted by IJCAI 2026. Code: https://github.com/House-yuyu/InterLight

详情
AI中文摘要

低光照图像增强(LLIE)长期以来一直是低级视觉中的挑战性问题,由于光照不足常导致对比度低、细节丢失和噪声。最近的研究表明,基于深度学习的Retinex理论可以有效解耦光照和反光。然而,现有方法常面临过增强或色彩失真问题,并且通常假设均匀噪声或理想照明。为了解决这些限制,我们提出InterLight,一种新颖的框架,系统挖掘并操作内在照明先验用于LLIE。我们的核心见解是,稳健的增强不仅需要估计光照,还需要构建照明感知的处理流程。我们首先通过物理引导增强注入传感器级光照响应先验,然后通过适应性提示表示退化,这些提示基于场景的潜在光照状态。这种显式表示直接引导一个亮度门控的内在记忆机制,选择性补偿信息损失,优先重建暗区的同时在亮区保持保真度。最后,整个过程通过自监督一致性目标进行正则化,该目标蒸馏了光照不变特征。通过深入挖掘内在光照先验,我们的方法实现了更清晰的纹理和更一致的增强结果。在多个基准上的广泛实验验证了我们的方法的有效性。代码可在:https://github.com/House-yuyu/InterLight 获取。

英文摘要

Low-Light Image Enhancement (LLIE) has long been a challenging problem in low-level vision, as insufficient illumination often leads to low contrast, detail loss, and noise. Recent studies show that deep learning-based Retinex theory can effectively decouple illumination and reflectance. However, existing methods frequently suffer from over-enhancement or color distortion, and often assume uniform noise or ideal lighting. To address these limitations, we propose InterLight, a novel framework that systematically excavates and operationalizes intrinsic illumination priors for LLIE.Our core insight is that robust enhancement requires not just estimating illumination, but constructing an illumination-aware pipeline. We first inject sensor-level illumination-response priors via physics-guided augmentation, then represent the degradation through adaptive prompts conditioned on the scene's latent illumination state. This explicit representation directly guides a luminance-gated intrinsic memory mechanism to selectively compensate for information loss, prioritizing reconstruction in dark regions while preserving fidelity in bright ones. Finally, the entire process is regularized by a self-supervised consistency objective that distills illumination-invariant features. By deeply exploiting intrinsic illumination priors, our method achieves clearer textures and more visually coherent enhancement results. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach. Code is available at: https://github.com/House-yuyu/InterLight.

2605.19976 2026-05-20 cs.CV 版本更新

RECIPE: Procedural Planning via Grounding in Instructional Video

RECIPE: 通过指令视频中的 grounding 实现过程规划

Luigi Seminara, Antonino Furnari, Lorenzo Torresani

发表机构 * Khoury College of Computer Sciences, Northeastern University, Boston(东北大学北斯托顿学院计算机科学学院) Department of Mathematics and Computer Science, University of Catania, Italy(卡塔尼亚大学数学与计算机科学系)

AI总结 该研究提出RECIPE方法,通过利用指令视频中的grounding信息来改进过程规划任务,通过利用预计算的文本嵌入实现大规模视频数据的验证,从而提升规划的准确性和鲁棒性。

详情
AI中文摘要

视觉规划要求模型在给定部分视频上下文和目标的情况下,生成剩余步骤的自然语言描述。该任务的进展受到标注的限制:干净的标记数据集较小,领域狭窄,每个示例只编码一个执行轨迹,尽管许多有效的顺序存在。大规模的指令视频语料库提供了数量级更多的过程内容,但通过使用伪标签进行监督微调会传播分割和对齐错误,并且只能生成单轨迹。我们识别出一个关键的不对称性:从噪声视频中提取干净的步骤标签是困难的,但验证生成的步骤序列是否在ASR转录中时间上接地是便宜的,并且可以通过预计算的文本嵌入扩展到数百万个视频。我们利用这种不对称性,在RECIPE中将grounding质量作为GRPO的奖励,将噪声语料库转化为验证者而不是标签来源。该框架可以统一应用于两种规划器输入配置(Socratic,使用冻结的VLM提取文本历史,以及Video,直接消耗视频令牌)以及标注和弱监督的模式。我们在7个过程基准上进行评估,使用基于参考的LLM-as-judge协议对计划进行评分,跨6个过程标准。RECIPE-RL在所有规模(0.5B、3B、7B)和每个基准上都优于基础检查点,领域内宏准确率提升7到8分,在零样本情况下最高提升16分。它在标注和伪标签计划上均优于监督微调(后者会降低基础模型性能),并在没有人工标注的情况下保持稳健。作为先前提案-评估-搜索规划器的提案阶段使用时,在视觉规划辅助任务中在每个时间范围内均优于最强的零样本基线,在COIN任务中保持了SFT所崩溃的生成多样性。

英文摘要

Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. Progress on this task is bottlenecked by annotation: clean labeled datasets are small, domain-narrow, and encode a single execution trajectory per example, even though many valid orderings exist. Large-scale instructional video corpora offer orders of magnitude more procedural content, but supervised fine-tuning on pseudo-labels from their noisy ASR narrations propagates segmentation and alignment errors and stays single-trajectory. We identify a key asymmetry: extracting clean step labels from noisy video is hard, but verifying whether a generated step sequence is temporally grounded in ASR transcripts is cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry in RECIPE, which uses grounding quality as a reward for GRPO, turning the noisy corpus into a verifier rather than a label source. The framework applies uniformly to two planner input configurations (Socratic, with a textual history extracted by a frozen VLM, and Video, consuming video tokens directly) and to annotated and weakly supervised regimes. We evaluate on 7 procedural benchmarks using a reference-based LLM-as-judge protocol scoring plans across 6 procedural criteria. RECIPE-RL improves over the base checkpoint at all scales (0.5B, 3B, 7B) and every benchmark, with macro-accuracy gains of +7 to +8 points in-domain and up to +16 points zero-shot. It outperforms supervised fine-tuning on both annotated and pseudo-labeled plans (the latter degrades the base) and remains robust without human annotations. Used as the proposal stage of a prior propose-assess-search planner, it improves over the strongest zero-shot baseline at every horizon on Visual Planning for Assistance, and on COIN it preserves the generation diversity that SFT collapses.

2605.19974 2026-05-20 cs.CV 版本更新

SphericalDreamer: Generating Navigable Immersive 3D Worlds with Panorama Fusion

SphericalDreamer: 通过全景融合生成可导航的沉浸式3D世界

Antoine Schnepf, Karim Kassab, Flavian Vasile, Andrew Comport

发表机构 * Université Côte d'Azur, CNRS, I3S, France(法国蔚蓝海岸大学、国家科学研究中心、I3S研究所) Criteo AI Lab, Paris, France(法国Criteo人工智能实验室)

AI总结 本研究提出SphericalDreamer方法,通过生成多个全景图像并将其提升到3D空间中进行融合,从而生成高度细节且可导航的沉浸式3D户外环境,显著提升了尺度和可导航性。

Comments Accepted at ICML 2026. Project page available at https://sphericaldreamer.github.io

详情
AI中文摘要

沉浸式和可导航的3D环境的生成随着虚拟现实和3D内容的普及而变得越来越普遍。然而,最近的方法面临一个根本性的限制:它们无法生成同时(i)能够在长距离空间范围内导航且(ii)覆盖完整全方位视野(水平360度,垂直180度)的3D世界。为了解决这一挑战,我们引入了SphericalDreamer,一种从文本提示中生成完全沉浸和长距离3D户外环境的方法。我们的方法基于生成多个全景图像,这些图像随后被提升到3D空间中并融合在一起,同时保持视觉和几何一致性。SphericalDreamer生成高度细节的、完全沉浸的3D环境,同时在尺度和可导航性方面显著优于先前的方法。

英文摘要

The generation of immersive and navigable 3D environments is increasingly prevalent with the growing adoption of virtual reality and 3D content. However, recent methods face a fundamental limitation: they cannot produce 3D worlds that simultaneously (i) are navigable over long-range spatial extents and (ii) cover the complete omnidirectional field of view ($360^\circ$ horizontally and $180^\circ$ vertically). To address this challenge, we introduce SphericalDreamer, a method for generating fully immersive and long-range 3D outdoor environments from textual prompts. Our approach is built on the generation of multiple panoramic images, which are subsequently lifted into 3D and fused together while maintaining visual and geometric consistency. SphericalDreamer produces highly detailed, fully immersive 3D environments, while substantially improving scale and navigability compared to prior approaches.

2605.19957 2026-05-20 cs.CV cs.AI cs.RO 版本更新

World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

为混合具身体验中的长时域演化构建世界-自我模型

Zuyao Lin, Jianhui Zhang, Peidong Jia, Xiaoguang Zhao, Shanghang Zhang, Xingyu Chen

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Chinese Academy of Sciences(中国科学院大学) Peking University(北京大学)

AI总结 本文提出了一种新的世界-自我建模范式,通过分解未来演化为世界和自我组件,解决混合任务中长时域具身体验中的退化问题,并通过HTEWorld基准测试验证了其有效性。

详情
AI中文摘要

世界模型在具身智能中被广泛研究,但通常在同一流中预测世界和自我不同的演化,其中世界捕捉持续的指令无关场景规律,而自我捕捉机器人中心的指令条件动态。这种世界-自我纠缠导致长时域具身体验中的退化,特别是在混合任务中,其中导航和操作行为交替出现。在本文中,我们引入了世界-自我建模,一种新的概念范式,将未来演化分解为世界和自我组件。我们从三种视角定义世界-自我边界,即运动、语义和意图视角,并分析了三种解纠缠策略,即后、前和完全解纠缠。进一步,我们将该范式实例化为世界-自我模型(WEM),一个统一的具身世界模型,它将一个隐含的独立世界-自我规划器与一个级联并行混合专家(CP-MoE)扩散生成器相结合。为了实现严格的评估,我们进一步构建了HTEWorld,第一个长时域世界建模基准,包含125,000个视频片段(超过4.5百万帧)和精细的动作注释,以及300个多轮评估轨迹(超过2,000条指令)。广泛的实验表明,WEM在HTEWorld上实现了最先进的性能,同时在现有的仅操作基准上保持竞争力。

英文摘要

World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.

2605.19956 2026-05-20 cs.CV 版本更新

Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models

迈向细粒度鲁棒性:面向视觉-语言模型的注意力引导测试时提示调优

Jia-Wei Hai, Yijun Wang, Xiu-Shen Wei

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications(新一代人工智能技术及其交叉应用重点实验室) Southeast University(东南大学) School of Intelligence Science and Engineering(智能科学与工程学院)

AI总结 本文提出了一种注意力引导的测试时提示调优方法(A-TPT),旨在解决视觉-语言模型在对抗攻击下的鲁棒性问题,通过改进的梯度注意力机制和空间变化的增强强度来提升模型在细粒度场景下的表现。

Comments Accepted by ICML 2026, Project Page: this https, URL Code URL: this https URL

详情
AI中文摘要

视觉-语言模型(VLMs),如CLIP,通过各种微调适应方法在下游任务上实现了显著的零样本性能。然而,最近的研究证明,对抗攻击可以显著降低VLMs的推理能力,对实际应用构成重大风险。普遍的测试时适应方法通常依赖多视图增强来实现各种微调策略,但它们难以识别语义信息,并且在细粒度场景中容易破坏判别区域。为了解决这些限制,我们提出了注意力引导的测试时提示调优(A-TPT),一种旨在测试时适应的语义保持方法。我们首先改进了梯度注意力展开机制,以识别在对抗攻击下仍能存活的语义重要区域。进一步地,我们利用这些区域来指导空间变化的增强强度和多视图集成,以进行提示调优和推理。广泛的实验表明,A-TPT在对抗和干净数据上均优于现有的测试时适应方法。代码可在https://github.com/SEU-VIPGroup/A-TPT获取。

英文摘要

Vision-Language Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posing substantial risks to their practical applications. Prevalent test-time adaptation methods typically rely on multi-view augmentation to implement various fine-tuning strategies, which struggle to identify semantic information and are prone to destroying discriminative regions in fine-grained scenarios. To address these limitations, we propose Attention-Guided Test-Time Prompt Tuning (A-TPT), a semantics-preserving method designed for test-time adaptation. We first refine the gradient attention rollout mechanism to identify semantically meaningful regions surviving under adversarial attacks. Furthermore, we leverage them to guide the spatially varying augmentation intensities and multi-view ensemble for prompt tuning and inference. Extensive experiments demonstrate that A-TPT outperforms existing test-time adaptation methods on both adversarial and clean data. Codes are available at https://github.com/SEU-VIPGroup/A-TPT .

2605.19950 2026-05-20 cs.CV 版本更新

AffectVerse: Emotional World Models for Multimodal Affective Computing

AffectVerse: 多模态情感计算中的情感世界模型

Bo Zhao, Fanghua Ye, Yixin Ji, Sicheng Zhao, Xiaojiang Peng, Zitong YU

发表机构 * Great Bay University(大湾大学) Tencent(腾讯) Tsinghua University(清华大学) Shenzhen Technology University(深圳技术大学)

AI总结 本研究提出AffectVerse,一种基于Qwen2.5-Omni的多模态情感计算模型,通过引入情感世界模块实现短期潜在情感预测,利用未来预测作为自监督信号,提高了情感计算的准确性。

详情
AI中文摘要

人类通过整合观察到的多模态线索与对情绪状态可能演变的期望来推断情绪。然而,现有的多模态大语言模型(MLLMs)通常将情绪识别视为对完整音频视觉-文本输入的静态融合,忽略了情感动态。我们提出了AffectVerse,一种基于Qwen2.5-Omni的模型,配备了情感世界模块(EWM),这是一个无动作的表示层面模块,用于短期潜在情感预测。EWM包含三个模块:1)跨模态时间想象通过多步展开预测未来的视频/音频表示;2)MAMA(模态感知多步注意力)信念聚合将想象的标记压缩成模态感知的信念标记;3)信念注入将这些信念标记插入LLM中进行情绪推理。AffectVerse将未来预测作为过去条件的自监督信号:它不替换对观察历史的建模或需要未见过的信号,但迫使当前信念状态编码预测后续情绪变化的转换线索。在九个基准测试中,AffectVerse在其他模型上提高了至少2.57%,而受控消融实验显示了时间想象、跨模态展开和信念聚合的加性增益。这些结果表明,预测信念状态建模是情感计算的一种实用替代方案。

英文摘要

Humans infer emotions by integrating observed multimodal cues with expectations about how affective states may unfold. Existing multimodal large language models (MLLMs), however, often treat emotion recognition as static fusion over complete audiovisual-text inputs, leaving affective dynamics implicit. We propose AffectVerse, a Qwen2.5-Omni-based model equipped with an Emotion World Module (EWM), an action-free representation-level module for short-horizon latent affective prediction. \rev{EWM contains three modules: 1) Cross-Modal Temporal Imagination predicts future video/audio representations from past tokens with multi-step rollout. 2) MAMA(Modality-Aware Multi-step Attention) Belief Aggregation compresses imagined tokens into modality-aware belief tokens. 3) Belief Injection inserts these belief tokens into the LLM for affective reasoning.} AffectVerse uses future prediction as a past-conditioned self-supervised signal: it does not replace modeling observed history or require unseen signals at inference, but forces the current belief state to encode transition cues that are predictive of subsequent affective change. Across nine benchmarks, AffectVerse improves at least 2.57\% over other models, while controlled ablations show additive gains from temporal imagination, cross-modal rollout, and belief aggregation. These results suggest predictive belief-state modeling is a practical alternative for affective computing.

2605.19949 2026-05-20 cs.CV 版本更新

Feed-Forward Gaussian Splatting from Sparse Aerial Views

从稀疏航拍视图进行前馈高斯点扩散

Dongli Wu, Zhuoxiao Li, Tongyan Hua, Yinrui Ren, Xiaobao Wei, Rongjun Qin, Wufan Zhao

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州)) Peking University(北京大学) The Ohio State University(俄亥俄州立大学)

AI总结 本文提出AnyCity框架,通过观察驱动的生成重建方法,解决稀疏航拍视图中大规模城市场景重建中的证据不平衡问题,通过几何潜在表示和条件化空中完成标记预测,实现高质量的3D高斯点场重建。

详情
AI中文摘要

从稀疏航拍视图重建大规模城市场景是一项关键但具有挑战性的任务。由于俯视和浅倾角相机姿态偏置,稀疏航拍捕捉表现出强烈的证据不平衡:屋顶和开放区域被反复观察,而立面、远处建筑和被遮挡的结构则很少有多视图支持。现有的前馈3D高斯点扩散方法直接从稀疏输入回归确定性表示,但这种方法常常导致鬼影、融化立面和拉伸纹理。最近的伪视图和视频基于生成重建方法使用额外的监督或生成先验。然而,它们通常缺乏清晰的观察几何与先验驱动内容之间的分离,这可能导致合理但不一致的结构。我们提出AnyCity,一种用于稀疏航拍城市场景的观察驱动生成重建框架。AnyCity首先预测一个观察支持的几何潜在表示以锚定可靠的结构,然后使用支架条件化的空中完成标记来预测弱约束内容的门控残差更新,在高斯解码之前。在训练过程中,密集到稀疏的蒸馏将结构线索从密集视图重建中转移,同时一个适应于空中视频扩散先验通过门控标记条件提供细粒度的城市外观线索。观察保持目标保持优化后的表示与输入支持的几何一致。在推理过程中,AnyCity从稀疏航拍视图中通过单次前馈传递重建最终的3D高斯点场,实现具有第二级推理的连贯城市新视图合成。在合成、航拍域、无人机纹理和真实世界场景上的实验显示,与前馈基线相比,取得了持续的改进。

英文摘要

Reconstructing large-scale urban scenes from sparse aerial views is a crucial yet challenging task. Due to biased top-down and shallow-oblique camera poses, sparse aerial captures exhibit strong evidence imbalance: roofs and open regions are repeatedly observed, while facades, distant buildings, and occluded structures receive little multi-view support. Existing feed-forward 3D Gaussian Splatting methods directly regress a deterministic representation from sparse inputs, but this often leads to ghosting, melted facades, and stretched textures. Recent pseudo-view and video-based generative reconstruction methods use additional supervision or generative priors. However, they often lack a clear separation between observed geometry and prior-driven content, which can lead to plausible but inconsistent structures. We propose AnyCity, an observation-grounded generative reconstruction framework for sparse aerial urban scenes. AnyCity first predicts an observation-supported geometry latent to anchor reliable structures, and then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. During training, dense-to-sparse distillation transfers structural cues from dense-view reconstruction, while an aerial-adapted video diffusion prior provides fine-grained urban appearance cues through gated token conditioning. Observation-preserving objectives keep the refined representation consistent with input-supported geometry. At inference time, AnyCity reconstructs the final 3D Gaussian scene from sparse aerial views in a single feed-forward pass, achieving coherent urban novel-view synthesis with second-level inference. Experiments on synthetic, aerial-domain, UAV-textured, and real-world scenes show consistent improvements over feed-forward baselines.

2605.19931 2026-05-20 cs.CV cs.AI cs.LG 版本更新

StruMPL: Multi-task Dense Regression under Disjoint Partial Supervision and MNAR Labels

StruMPL:在不相交的部分监督和MNAR标签下的多任务密集回归

Reza M. Asiyabi, Juan Alberto Molina-Valero, The SEOSAW Partnership, Steven Hancock, Casey M. Ryan

发表机构 * School of Geosciences, University of Edinburgh, UK(爱丁堡大学地球科学学院,英国) National Centre for Earth Observation (NCEO), UK(英国地球观测国家中心) Department of Spatial Sciences, Faculty of Environmental Sciences Czech University of Life Sciences Prague, Praha, Czech Republic(环境科学学院空间科学系,捷克布拉格生命科学大学)

AI总结 本文针对在不相交的部分监督和MNAR标签下的多任务密集回归问题,提出StruMPL方法,通过共享编码器和可学习的物理模块,结合Augmented IPW损失函数,提高了对森林地上生物量的估计精度。

Comments 10 pages with 3 figures and 4 tables, References and Appendix 12 pages with 1 figure and 4 tables

详情
AI中文摘要

从地球观测估计森林地上生物量(AGB)结合了两个结构上不兼容的标签源:空间borne激光雷达在数百万个位置提供冠层结构但没有生物量估计,而地面样地在数千个偏倚位置提供生物量但没有结构指标。没有单个训练样本携带所有目标变量的标签,样地标签不是随机缺失(MNAR),且生物量通过已知但生物体特异性的所有学定律与结构变量相关联。我们将其正式化为在异质不相交部分监督下的多任务密集回归问题,具有MNAR标签和任务间物理约束,并提出StruMPL方法来联合解决。一个共享编码器为每个变量回归、填补和倾向性头提供空间MNAR校正,以及一个可学习的物理模块,该模块在每个像素上评估任务间约束对模型自身预测的影响。监督损失使用Augmented IPW(AIPW)伪结果,其中在倾向性和填补基线上的停止梯度;我们证明了分析和实证上,两者对于联合优化恢复IPW加权的平稳点并保持损失有界是必要的。在两个生态上不同的生物体上,StruMPL在AGB RMSE和偏倚方面优于消融变体和最接近的已发表方法,分层分析显示AIPW减少了高AGB偏倚约54%。

英文摘要

Estimating forest aboveground biomass (AGB) from Earth observation combines two structurally incompatible label sources: spaceborne lidar provides canopy structure at millions of locations but no biomass estimate, and ground-based plots provide biomass at thousands of biased locations but no metrics of structure. No single training sample carries labels for all target variables, plot labels are missing not at random (MNAR), and biomass is linked to the structural variables by known but biome-specific allometric laws. We formalise this as multi-task dense regression under heterogeneous disjoint partial supervision with MNAR labels and inter-task physical constraints, and propose StruMPL to address it jointly. A shared encoder feeds per-variable regression, imputation, and propensity heads for spatial MNAR correction, and a learnable physics module that evaluates the inter-task constraint on the model's own predictions at every pixel. The supervised loss uses an Augmented IPW (AIPW) pseudo-outcome with stop-gradients on the propensity and on the imputation baseline; we show analytically and empirically that both are necessary for joint optimisation to recover IPW-weighted stationary points while keeping the loss bounded. On two ecologically distinct biomes, StruMPL outperforms ablation variants and the closest published method on AGB RMSE and bias, with a stratified analysis showing AIPW reduces high-AGB bias by ~54%.

2605.19929 2026-05-20 cs.CV cs.AI 版本更新

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

打破大视觉-语言模型低比特量化中的模态异质性

Yi Zhong, Haotong Qin, Xindong Zhang, Lei Zhang, Guolei Sun

发表机构 * VCIP, College of Computer Science, Nankai University(南开大学计算机科学学院VCIP) D-ITET, ETH Zürich(苏黎世联邦理工学院D-ITET) OPPO Research Institute(OPPO研究院) Department of Computing, Hong Kong Polytechnic University(香港理工大学计算机系)

AI总结 本文提出SplitQ框架,通过通道分割和自适应跨模态校准模块,解决大视觉-语言模型在低比特量化中因模态异质性导致的精度下降问题,显著提升了在多种多模态数据集上的性能。

详情
AI中文摘要

低比特后训练量化(PTQ)是将视觉-语言模型(VLMs)部署到资源受限设备中的关键技术。然而,现有PTQ方法由于在量化过程中文本和视觉模态的异质激活分布而降低了VLMs的准确性。我们发现这种跨模态异质性在通道上分布不均:一小部分通道包含大部分模态特定的异常值,且这些异常值通常位于每个模态的不同通道中。受此启发,我们提出了SplitQ,一种基于通道分割的后训练量化框架。其核心是引入了一个新的模态特定异常通道解耦(MOCD)模块,该模块能够以最小的开销有效隔离显著的模态特定异常通道。为进一步解决剩余的跨模态分布差异,我们设计了一个自适应跨模态校准(ACC)模块,该模块采用双轻量级可学习分支动态缓解模态引起的量化误差。在流行的VLMs上的广泛实验表明,SplitQ在所有评估的量化设置下,包括W4A8、W4A4、W3A3和W3A2,均在6个流行的多模态数据集上显著优于现有方法。值得注意的是,SplitQ在具有挑战性的W3A3设置下保留了93.5%的FP16性能(69.5 vs. 74.3),推动了高级VLMs部署的效率前沿。我们的代码可在https://github.com/EMVision-NK/SplitQ上获得。

英文摘要

Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies, we design an Adaptive Cross-Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality-induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi-modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at https://github.com/EMVision-NK/SplitQ

2605.19890 2026-05-20 cs.CV 版本更新

GoTTA be Diverse: Rethinking Memory Policies for Test-Time Adaptation

GoTTA be Diverse: 重新思考测试时间适应中的记忆策略

Shyma Alhuwaider, Yasmeen Alsaedy, Merey Ramazanova, Silvio Giancola, Bernard Ghanem

发表机构 * Center of Excellence in Generative AI, KAUST, Saudi Arabia(沙特王国人工智能卓越中心)

AI总结 本文研究了测试时间适应中记忆策略的重要性,提出了一种基于类平衡和特征空间多样性的GOTTA方法,展示了在受限内存和非独立同分布流中,多样性管理对适应性能的提升。

详情
AI中文摘要

测试时间适应(TTA)使预训练模型能够在分布偏移的测试流中在线适应。尽管大多数TTA研究关注适应目标,但实际流也严重依赖用于选择驱动适应的测试样本的记忆机制。现有记忆机制通常作为特定TTA算法的组件进行评估,这使得难以确定哪些记忆设计选择何时重要。在本文中,我们提供了一个系统性的基准测试,将记忆与适应算法解耦,并在独立同分布、非独立同分布、持续学习和实际测试流中统一评估记忆策略。我们的研究表明,有效的内存管理不仅仅是保留最近或类平衡的样本。特别是,类内多样性是避免冗余缓冲和在时间相关和标签偏斜流中保持代表性的适应信号的关键因素。受这一发现的启发,我们引入了Guided Observational Test-Time Adaptation(GOTTA),一种结合类平衡分配和特征空间多样性的多样性感知记忆策略。GOTTA记忆可以作为现有缓冲区的直接替换,并可与不同的TTA目标配对。在腐蚀基准和视频流设置中,多样性感知的记忆在受限内存预算和具有挑战性的非独立同分布流中显著提升了适应性能,同时在内存容量增加时仍保持竞争力。这些结果突显了内存管理作为稳健测试时间适应的第一要素,并将多样性确定为实际TTA的核心原则。

英文摘要

Test-time adaptation (TTA) enables a pre-trained model to adapt online to an unlabeled test stream under distribution shift. While most TTA research focuses on the adaptation objective, practical streams also depend critically on the memory used to select which test samples drive adaptation. Existing memory mechanisms are usually evaluated as components of specific TTA algorithms, making it difficult to isolate which memory design choices matter and when they matter. In this work, we provide a systematic benchmark that decouples memory from the adaptation algorithm and evaluates memory policies under unified conditions across i.i.d., non-i.i.d., continual, and practical test streams. Our study shows that effective memory management requires more than retaining recent or class-balanced samples. In particular, intra-class diversity is a key factor for avoiding redundant buffers and maintaining representative adaptation signals under temporally correlated and label-skewed streams. Motivated by this finding, we introduce Guided Observational Test-Time Adaptation (GOTTA), a family of diversity-aware memory policies that combine class-balanced allocation with feature-space diversity. GOTTA memories act as drop-in replacements for existing buffers and can be paired with different TTA objectives. Across corruption benchmarks and video-stream settings, diversity-aware memory improves adaptation most clearly under constrained memory budgets and challenging non-i.i.d. streams, while remaining competitive as memory capacity increases. These results highlight memory management as a first-class component of robust test-time adaptation and identify diversity as a central principle for practical TTA.

2605.19889 2026-05-20 cs.GR cs.CV 版本更新

GLUT: 3D Gaussian Lookup Table for Continuous Color Transformation

GLUT: 3D高斯查找表用于连续颜色变换

Danna Xue, David Serrano-Lozano, Shaolin Su, Javier Vazquez-Corral

发表机构 * Computer Vision Center(计算机视觉中心) Universitat Autònoma de Barcelona(巴塞罗那自治大学)

AI总结 本文提出GLUT,一种连续且显式的颜色表示方法,通过学习的3D高斯基元建模颜色变换,实现灵活的表示能力和紧凑的内存占用,并支持高效的用户友好编辑。

Comments Project page: https://color.cvc.uab.cat/glut/

详情
AI中文摘要

3D查找表(3D LUTs)广泛用于颜色映射,但其基于网格的表示需要对RGB空间进行离散化,导致容量-内存权衡问题,当存储大量LUT时尤为严重。最近的方法采用隐式神经表示来提高可扩展性,但其黑箱性质限制了可解释性和直观的局部编辑。在本文中,我们提出了Gaussian LUT(GLUT),一种连续且显式的颜色表示方法,通过一组可学习的3D高斯基元来建模颜色变换。通过避免固定分辨率的网格,GLUT在保持紧凑内存占用的同时实现了灵活的表示能力。其显式、空间局部化的形式进一步使准确建模和可解释性成为可能。基于这一表示,我们引入了一个紧凑的条件生成器(CGLUT),用于为多个LUT实例预测GLUT参数,将多样的颜色风格编码在一个框架中,以实现平滑且可控的LUT风格混合。此外,GLUT通过允许对特定颜色区域进行局部调整而不需全局重新训练,实现了高效的用户友好编辑。实验结果表明,我们的方法在准确性和效率方面均优于先前的神经LUT表示,同时提供了改进的可解释性和交互控制。

英文摘要

3D Lookup Tables (3D LUTs) are widely used for color mapping, but their grid-based representation requires discretizing the RGB space, leading to a capacity-memory trade-off that becomes prohibitive when storing large numbers of LUTs. Recent approaches adopt implicit neural representations to improve scalability, yet their black-box nature limits interpretability and hinders intuitive, localized editing. In this paper, we propose Gaussian LUT (GLUT), a continuous and explicit color representation that models color transformations using a set of learnable 3D Gaussian primitives. By avoiding fixed-resolution grids, GLUT achieves flexible representational capacity while maintaining a compact memory footprint. Its explicit, spatially localized formulation further enables both accurate modeling and interpretability. Building on this representation, we introduce a compact conditional generator (CGLUT) that predicts GLUT parameters for multiple LUT instances, encoding diverse color styles in a single framework to enable smooth and controllable LUT style blending. Moreover, GLUT supports efficient, user-friendly editing by allowing localized adjustments to specific color regions without global retraining. Experimental results demonstrate that our approach outperforms prior neural LUT representations in both accuracy and efficiency, while offering improved interpretability and interactive control.

2605.19869 2026-05-20 cs.CV cs.AI 版本更新

Passive Construction Site Safety Monitoring via Persona-Scaffolded Adversarial Chain-of-Thought VLM Verification

通过基于人设的对抗性链式思考视觉语言模型验证实现被动施工现场安全监控

Ananth Sriram, Neel Mokaria, Rajveer Singh

发表机构 * Department of Computer Science, University of Maryland, College Park, MD, USA(大学马里兰学院计算机科学系,马里兰州科利尔帕克,MD,美国)

AI总结 本文提出了一种被动的施工现场安全监控方法,通过三阶段架构处理视频数据,结合细调的YOLO11、SAM 3和Qwen3-VL-8B-Instruct模型,利用基于人设的对抗性链式思考协议提高合规性验证和幻觉控制,主要贡献是第三阶段提示设计,提升了12%的精度。

Comments 10 pages, 4 figures. First place, Ironsite.ai Spatial Intelligence Hackathon, University of Maryland, February 2026. Code available at https://github.com/ananthsriram1/ironsite-hackathon-project-safety_assistant

详情
AI中文摘要

建筑行业仍然是美国最危险的行业领域,2023年记录了1,055起致命工人受伤事件,大多数可以预防。现有的监控方法昂贵,需要实时人类操作员,或仅解决狭窄的违规子集。本文提出了一种被动的、工作结束时的建筑安全监控流程,通过三阶段架构处理POV体佩戴和固定墙安装摄像头的视频数据:(1)细调的YOLO11用于主要PPE和危险检测,(2)SAM 3用于分割精修和工人去重,(3)Qwen3-VL-8B-Instruct结合方法提示的、基于人设的三轮对抗性链式思考协议进行合规性验证和幻觉控制。主要贡献是第三阶段提示设计:专业人设背景故事遵循方法-行动者框架,在非正式的三作者评审中,对12个视频的Ironsite开发语料库的12%精度提升,最大的提升在易产生幻觉的违规类别上。结构信息隔离强制生成器、判别器和协调轮之间在不对称规则编码人类观察与自动化检测可靠性的独立性。系统将违规映射到OSHA标准,进行受REBA启发的人体工程学风险评分,从姿态关键点生成每名工人的安全报告并附带时间戳证据。释放了一个评估工具用于未来复现。

英文摘要

Construction remains the deadliest industry sector in the United States, with 1,055 fatal worker injuries recorded in 2023, and the majority preventable. Existing monitoring approaches are expensive, require real-time human operators, or address only a narrow subset of violations. This paper presents a passive, end-of-shift construction safety monitoring pipeline processing video from POV body-worn and fixed wall-mounted cameras through a three-stage architecture: (1) fine-tuned YOLO11 for primary PPE and hazard detection, (2) SAM 3 for segmentation refinement and worker deduplication, and (3) Qwen3-VL-8B-Instruct with a method-prompted, persona-scaffolded three-pass adversarial chain-of-thought protocol for compliance verification and hallucination control. The principal contribution is the Stage 3 prompt design: professional persona backstories following the method-actor framing drive an observed 12% precision improvement over single-pass prompting in an informal three-author review of the 12-video Ironsite development corpus, with the largest gains on hallucination-prone violation categories. Structural message isolation enforces observational independence between a generator, discriminator, and reconciliation pass governed by asymmetric rules encoding priors about human observation versus automated detection reliability. The system maps violations to OSHA standards, performs REBA-inspired ergonomic risk scoring from pose keypoints, and produces per-worker safety reports with timestamped evidence. An evaluation harness is released for future reproduction.

2605.19868 2026-05-20 cs.CV 版本更新

WoundFormer: Multi-Scale Spatial Feature Fusion for Multi-Class Wound Tissue Segmentation

WoundFormer: 多尺度空间特征融合用于多类伤口组织分割

Muhammad Ashad Kabir, Rabin Dulal

发表机构 * School of Computing, Mathematics and Engineering(计算、数学与工程学院) Charles Sturt University(查尔斯·斯特劳特大学) NSW, Australia(新南威尔士州,澳大利亚)

AI总结 本研究提出WoundFormer框架,通过多尺度空间特征融合提升多类伤口组织分割的准确性,解决了现有方法在处理异质组织组成时的不足。

Comments 10 pages

详情
AI中文摘要

慢性伤口如糖尿病足溃疡和压疮需要准确的组织水平评估以指导治疗计划和监测愈合进程。尽管深度学习方法已推动了自动伤口分析的发展,但大多数现有方法专注于二元分割,由于类内变异性高和标注数据有限,难以充分建模异质组织组成。因此,多类伤口组织分割仍然是一个具有临床相关性的重要挑战。我们提出WoundFormer,一种基于Transformer的框架,通过增强层次化空间特征融合来改进多类伤口组织分割。具体来说,我们用一种空间保持的多尺度聚合头替代标准的SegFormer解码器,该头在跨尺度整合过程中保持特征拓扑,并通过卷积融合加强上下文交互。这种设计提高了边界定位和在视觉上相似的组织类别之间的区分能力,同时保持了Transformer的效率。我们在WoundTissueSeg数据集(147张图像,6个组织类别)和第二个基准(DFUTissue数据集)上评估了WoundFormer。所提出的方法在WoundTissueSeg基准上实现了总体Dice分数为81.9%,在强CNN和Transformer基线方法上最高高出4.3个Dice点,且在少数群体组织类别上也表现出一致的改进。这些结果表明,显式建模层次化空间交互增强了Transformer表示,以异质伤口组织分割,并支持更可靠的定量伤口评估。

英文摘要

Chronic wounds such as diabetic foot ulcers and pressure injuries require accurate tissue-level assessment to guide treatment planning and monitor healing progression. While deep learning methods have advanced automated wound analysis, most existing approaches focus on binary segmentation and inadequately model heterogeneous tissue composition due to high intra-class variability and limited annotated data. Multi-class wound tissue segmentation, therefore, remains a challenging and clinically relevant problem. We propose WoundFormer, a transformer-based framework that enhances hierarchical spatial feature fusion for multi-class wound tissue segmentation. Specifically, we replace the standard SegFormer decoder with a spatially-preserving multi-scale aggregation head that maintains feature topology during cross-scale integration and strengthens contextual interactions through convolutional fusion. This design improves boundary localization and discrimination between visually similar tissue categories while preserving transformer efficiency. We evaluate WoundFormer on the WoundTissueSeg dataset (147 images, six tissue classes) and a second benchmark (DFUTissue dataset). The proposed method achieves an overall Dice score of 81.9%, outperforming strong CNN- and transformer-based baselines by up to 4.3 Dice points on the WoundTissueSeg benchmark, with consistent improvements across minority tissue classes. These results indicate that explicit modeling of hierarchical spatial interactions enhances transformer representations for heterogeneous wound tissue segmentation and supports more reliable quantitative wound assessment.

2605.19866 2026-05-20 cs.CV 版本更新

Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding

用于鲁棒性文档理解的结构化布局先验

Peter El Hachem, Ahmed Nassar, A. Said Gurbuz, Christoph Auer, Peter W. J. Staar

发表机构 * ETH Zurich(苏黎世联邦理工学院) IBM Research(IBM研究院)

AI总结 本文提出了一种结构化布局先验,通过在解码器之外运行轻量级RT-DETR检测器,将检测结果转换为解析器的DocTags词汇,并注入到提示中,以解决文档布局解析中的两跳瓶颈问题,从而提升文档理解的鲁棒性。

Comments 18 pages, 7 figures. Main text: 9 pages (4 figures); Appendix: 9 pages (3 figures)

详情
AI中文摘要

视觉-语言模型(VLMs)能够端到端地解析文档,但在处理布局时经常表现不佳,这种布局不同于训练时所见的布局。我们归因于一个两跳瓶颈:在解码器能够提取内容(第二跳)之前,它必须首先分类和定位包含的布局实体(第一跳)。当第一跳失败时,第二跳会退化为遗漏、结构错误或自回归重复。我们通过在解码器之外运行轻量级RT-DETR检测器,将检测结果序列化为解析器的DocTags词汇,并将其注入到提示中,同时保留完整的页面图像。与先分析后解析的方法(裁剪页面)或之前在纯文本中写的提示级别先验不同,我们的先验共享解码器的生成空间,并在检测结果嘈杂时保留全局图像作为后备。在10,000页的结构化Out-of-distribution基准测试中,markdown F1从0.37提升到0.92;在OmniDocBench中文子集上,表格TEDS从0.01提升到0.36;在26,000页的ViDoRe V3基准测试中,所有工业领域中的无限循环解码失败都减少了。这些收益成本15%的墙时延迟和74个中位提示标记,而无需对基础VLM进行架构更改。进一步的注意力级分析揭示了双模态相位转移,即当发出结构时解码器关注注入的布局标记,当发出内容时关注图像块,这与两跳瓶颈被缓解一致。模型权重将被释放以支持可重复性。

英文摘要

Vision-Language Models (VLMs) parse documents end-to-end but frequently break down on layouts unlike those seen in training. We attribute this to a two-hop bottleneck: before the decoder can extract content (Hop 2), it must first classify and localize the enclosing layout entity (Hop 1), and when the first hop fails the second collapses into omissions, malformed structure, or autoregressive repetition. We pre-resolve Hop 1 outside the decoder by running a lightweight RT-DETR detector, serializing its outputs in the parser's native DocTags vocabulary, and injecting them into the prompt alongside the full page image. Unlike analyze-then-parse approaches that crop the page, or prior prompt-level priors written in plain text, our prior shares the decoder's generation space and leaves the global image in view as a fallback when detections are noisy. On a 10k-page structural out-of-distribution benchmark, markdown F1 rises from $0.37$ to $0.92$; on the Chinese subset of OmniDocBench, table TEDS rises from $0.01$ to $0.36$; and on the 26k-page ViDoRe V3 benchmark, infinite-loop decoding failures drop across every industrial domain tested. These gains cost $15\%$ wall-clock latency and a median of $74$ prompt tokens, with no architectural change to the base VLM. An attention-level analysis further reveals a bimodal phase shift in which the decoder attends to injected layout tokens when emitting structure and to image patches when emitting content, consistent with the two-hop bottleneck being alleviated. Model weights will be released to support reproducibility.

2605.19865 2026-05-20 cs.CV 版本更新

Landscape-Awareness for Geometric View Diffusion Model

面向几何视角的扩散模型

Yan-Ting Chen, Hao-Wei Chen, Tsu-Ching Hsiao, Chun-Yi Lee

发表机构 * Elsa Lab, National Taiwan University(国家台湾大学 Elsa 实验室)

AI总结 本文提出了一种面向几何视角的扩散模型,通过重塑优化景观来引导更新至真实视角,并通过视角条件扩散模型进行细化,以提高收敛性、减少对暴力采样依赖并实现更高的样本效率。

Comments CVPR2026

详情
AI中文摘要

在稀疏视角条件下准确估计摄像机视角仍具挑战性,特别是在两视角场景中。最近的方法利用扩散模型如Zero123来合成新视角,基于相对视角进行条件合成,在通过MSE损失优化时显示出有希望的结果。然而,现有方法往往面临非凸损失景观,存在众多局部极小值,使它们对初始化敏感,并依赖于简单的多起始策略。我们分析了这些优化挑战并可视化了失败案例,显示几何歧义,如对称性和自相似性,可能导致梯度更新向错误视角偏移。为了解决这些限制,我们提出了一种基于分数的方法,重塑优化景观以引导更新至真实视角,随后使用视角条件扩散模型进行细化。实验表明,我们的方法提高了收敛性,减少了对暴力采样依赖,并在更高的样本效率下实现了具有竞争力的准确性。

英文摘要

Accurate camera viewpoint estimation under sparse-view conditions remains challenging, particularly in two-view scenarios. Recent approaches leverage diffusion models such as Zero123 to synthesize novel views conditioned on relative viewpoint, showing promising results when repurposed for viewpoint estimation via optimization with MSE loss. However, existing methods often suffer from nonconvex loss landscape with numerous local minima, making them sensitive to initialization and reliant on naive multistart strategies. We analyze these optimization challenges and visualize failure cases, showing that geometric ambiguities, such as symmetry and self-similarity, can mislead gradient-based updates toward incorrect viewpoints. To address these limitations, we propose a score-based method that reshapes the optimization landscape to guide updates toward the ground-truth viewpoint, followed by a refinement stage using a viewpoint-conditioned diffusion model. Experiments show that our method improves convergence, reduces reliance on brute-force sampling, and achieves competitive accuracy with higher sample-efficiency.

2605.19855 2026-05-20 cs.CV cs.AI 版本更新

A Framework for Evaluating Zero-Shot Image Generation in Concept-based Explainability

基于概念的可解释性人工智能的零样本图像生成评估框架

Giacomo Astolfi, Matteo Bianchi, Riccardo Campi, Antonio De Santis, Marco Brambilla

发表机构 * Politecnico di Milano, DEIB(米兰理工大学,DEIB)

AI总结 本文提出了一种基于概念的可解释性人工智能的零样本图像生成评估框架,通过生成合成概念数据集来评估概念基于的XAI方法,探讨了零样本文本到图像生成模型在模型分析中的挑战和开放性问题。

Comments G. Astolfi, M. Bianchi, and R. Campi contributed equally

详情
AI中文摘要

基于概念的可解释性人工智能(XAI)通过将内部表示与类别预测联系起来,利用人类可理解的视觉特征(如纹理或物体部分)来解释深度学习模型,从而弥合低级图像数据与高级语义之间的差距。然而,一个主要挑战是依赖大量标记图像来表示每个概念,这限制了可扩展性。在本工作中,我们研究了使用零样本文本到图像(T2I)生成模型作为合成概念数据集的来源,用于概念基于的XAI方法。具体而言,我们通过预定义提示生成概念,并通过四种互补分析评估其对真实概念的忠实性:(1)通过概念表示相似性比较合成与真实概念图像;(2)通过比较相同概念的子集对进行评估,子集大小逐步增加;(3)通过相关类别图像评估其在下游解释任务中的性能;(4)评估在移除测试类别图像中的概念对生成概念的解释影响。尽管当前T2I生成模型承诺为概念基于的XAI提供捷径,但我们的研究突显了挑战并提出了关于使用零样本管道生成的合成数据在模型分析中的使用问题。生成的数据集可在https://github.com/DataSciencePolimi/ZeroShot-T2I-Concepts获取。

英文摘要

Concept-based Explainable Artificial Intelligence (XAI) interprets deep learning models using human-understandable visual features (e.g., textures or object parts) by linking internal representations to class predictions, thereby bridging the gap between low-level image data and high-level semantics. A major challenge, however, is the reliance on large sets of labeled images to represent each concept, which limits scalability. In this work, we investigate the use of zero-shot Text-to-Image (T2I) generative models as a source of synthetic concept datasets for concept-based XAI methods. Specifically, we generate concepts using predefined prompts and evaluate their faithfulness to real ones through four complementary analyses: (1) comparing synthetic vs. real concept images via concept representation similarity; (2) evaluating their intra-similarity by comparing pairs of subsets of the same concept with progressively increasing size; (3) evaluating their performance for downstream explanation tasks using relevant class images; (4) evaluating how removing a concept from tested class images affects explanations of generated concepts. While current T2I generative models promise a shortcut to concept-based XAI, our study highlights challenges and raises open questions about the use of synthetic data generated by zero-shot pipelines in model analyses. The resulting dataset is available at https://github.com/DataSciencePolimi/ZeroShot-T2I-Concepts.

2605.19837 2026-05-20 cs.CV cs.AI cs.CL cs.RO 版本更新

CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

CADENet:条件自适应异步双流增强网络用于自动驾驶中的恶劣天气感知

Sherif Khairy, Catherine M. Elias

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt(德国开罗大学(GUC)计算机科学与工程系,埃及) C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt(认知驾驶系统实验室(C-DRiVeS Lab),开罗,埃及)

AI总结 本文提出CADENet,一种无需训练的三线系统,通过条件自适应增强和熵引导NMS融合,实现自动驾驶中恶劣天气下的目标检测,同时无需重新训练或额外硬件。

详情
AI中文摘要

恶劣天气(雨、雾、沙尘和雪)会降级自动驾驶车辆基于摄像头的目标检测。现有先增强后检测的方法会阻碍安全关键的感知循环,违反严格的实时要求。该问题的进展也受到一个未被认识到的评估上限的限制:在降质图像上标注的地面真实数据不能为一个能够恢复注释者自身无法看到的目标的检测器提供信用,因此真正的有用的增强可以注册为接近平坦的F1增益。本文提出了CADENet(条件自适应异步双流增强网络),一种无需训练的三线系统:线S(YOLOv11n)以全帧率提供检测,无额外延迟;线Q应用条件自适应增强(CAPE)并通过熵引导NMS(EG-NMS)融合结果,不阻塞线S;线E提供CLIP零样本天气分类,因此新的天气类别只需新的文本提示,无需标注数据和重新训练。在1327张DAWN图像(YOLOv11m,IoU=0.5,置信度=0.25)上评估,CADENet在雪中实现Recall=0.0103(微),F1=0.0230,在雨中实现F1=0.0038。我们正式化了DAWN类数据上的注释完整性偏差,因此报告的F1值是真实增益的下限;Recall是注释-间隙-免疫的头条指标。线S在增强负载下保持约44 FPS。无需模型重新训练或额外传感器硬件。

英文摘要

Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time requirements. Progress on this problem is also constrained by an under-recognized evaluation ceiling: ground truth annotated on degraded images cannot credit a detector that recovers objects the annotators themselves could not see, so a genuinely useful enhancement can register as a near-flat F1 gain. This paper presents CADENet (Condition-Adaptive Asynchronous Dual-stream Enhancement Network), a training-free three-thread system: Thread S (YOLOv11n) delivers detections at full frame rate with zero added latency; Thread Q applies condition-adaptive enhancement (CAPE) and fuses results via entropy-guided NMS (EG-NMS) without blocking Thread S; Thread E provides CLIP zero-shot weather classification, so new weather categories require only a new text prompt, with no labeled data and no retraining. Evaluated on 1327 DAWN images (YOLOv11m, IoU = 0.5, confidence = 0.25), CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain. We formalize the annotation completeness bias on DAWN-class data, so the reported F1 values are lower bounds on the true gain; recall is the annotation-gap-immune headline metric. Thread S sustains approximately 44 FPS regardless of enhancement load. No model retraining or additional sensor hardware is required.

2605.19824 2026-05-20 cs.AI cs.CL cs.CV cs.RO 版本更新

From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

从提示到路面通过时间:代理场景到计划推理中的时间定位

Ahmed Y. Gado, Omar Y. Goba, Alaa Hassanein, Catherine M. Elias, Ahmed Hussein

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt(德国亚历山大大学(GUC)计算机科学与工程系,埃及) C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt(认知驾驶系统实验室,埃及开罗,C-DRiVeS) M.Eng. Robotics Candidate at Deggendorf Institute of Technology, Germany(德国德格多夫技术学院机器人硕士候选人) IAV GmbH, Berlin, Germany(德国柏林IAV GmbH公司)

AI总结 本研究探讨了在代理间通信中引入时间条件是否能保持或增强推理的一致性,而不会降低语义或逻辑一致性,并通过BDD-X数据集的curated子集评估了三种具有递增时间整合的规划器架构。结果表明,时间条件改变了推理风格,但并未在标准NLP正确性指标上产生统计显著改进,但定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。

详情
AI中文摘要

近期尝试通过大型语言模型(LLMs)和大型多模态模型(LMMs)的集合来支持自动驾驶(AVs)中的高级场景解释和规划,仍然将时间视为次要属性。这种缺乏时间定位导致在连续动作推理中出现不一致,影响安全性和可解释性。本文探讨时间条件在代理间通信中是否能保持或增强一致性而不引入语义或逻辑一致性下降。为此,我们引入了三种具有递增时间整合的规划器架构,并在BDD-X数据集的curated子集上评估它们,使用语义、语法和逻辑指标。结果表明,虽然时间条件改变了推理风格,但并未在标准NLP基于的正确性指标上产生统计显著改进。然而,定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。这些发现澄清了基于提示的时间定位的局限性,并建立了时间场景到计划推理的第一个经验基准。

英文摘要

Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.

2605.19821 2026-05-20 cs.CV 版本更新

LaCoVL-FER: Landmark-Guided Contrastive Learning Network with Vision-Language Enhancement for Facial Expression Recognition

LaCoVL-FER: 一种结合视觉-语言增强的地标引导对比学习网络用于面部表情识别

Jiaxin Wang, Muwei Jian, Hui Yu, Junyu Dong, Yifan Xia

发表机构 * School of Airspace and Engineering, Shandong University(山东大学航空航天与工程学院) School of Computer Science and Technology, Shandong University of Finance and Economics(山东财经大学计算机科学与技术学院) School of Psychology and Neuroscience, University of Glasgow(格拉斯哥大学心理学与神经科学学院) Faculty of Information Science and Engineering, Ocean University of China(中国海洋大学信息科学与工程学院)

AI总结 本文提出了一种结合视觉-语言增强的地标引导对比学习网络LaCoVL-FER,通过引入面部地标几何先验和视觉-语言模型语义先验,解决野生环境中面部表情识别的挑战,提升识别的鲁棒性和泛化能力。

详情
AI中文摘要

在真实环境中,面部表情识别(FER)仍然具有挑战性,由于姿态、遮挡和光照的不可控变化。现有的基于注意力的方法主要依赖于视觉外观线索,导致注意力冗余和不稳定,限制了其在复杂场景中的性能。为了解决这些问题,我们提出了一种新颖的地标引导对比学习网络,结合视觉-语言增强,用于面部表情识别(LaCoVL-FER),该网络整合了来自面部地标几何先验和视觉-语言模型的语义先验。具体而言,设计了一个地标引导自适应编码器(LGAE),通过双分支门控交叉注意力(BGCA)机制引入几何先验,实现自适应融合基于地标几何和视觉外观特征,生成与表情相关的特征,从而聚焦于关键面部区域并抑制噪声干扰。同时,提出了一种视觉-语言增强策略(VLES),利用表情相关的特征来优化冻结预训练CLIP图像编码器提取的一般视觉特征,生成表情特定的视觉表示。基于这些表示,采用表情条件提示(ECP)机制进一步调整来自冻结预训练CLIP文本编码器的固定类级提示文本特征,生成更实例感知的文本表示。这些视觉-文本表示作为语义先验对齐,以增强FER的鲁棒性和泛化能力。定量和定性实验表明,我们的LaCoVL-FER在三个具有代表性的现实世界FER数据集(RAF-DB、FERPlus和AffectNet)上优于最先进的方法。代码可在https://github.com/ylin06804/LaCoVL-FER上获得。

英文摘要

Facial Expression Recognition (FER) in the wild is still challenging due to uncontrolled variations in pose, occlusion, and illumination. Most existing attention-based methods primarily rely on visual appearance cues, suffering from attention redundancy and instability, which limits their performance in complex scenarios. To address these issues, we propose a novel landmark-guided contrastive learning network with vision-language enhancement for FER (LaCoVL-FER), which integrates geometric priors from facial landmarks and semantic priors from a vision-language model. Specifically, a Landmark-Guided Adaptive Encoder (LGAE) is designed to introduce geometric priors through a Bi-branch Gated Cross Attention (BGCA) mechanism, which achieves adaptive fusion of landmark-based geometric and visual appearance features to produce expression-relevant features, thereby focusing on key facial regions and suppressing noise interference. In parallel, a Vision-Language Enhancement Strategy (VLES) is presented to leverage the expression-relevant features to refine the generalizable visual features extracted by the frozen pretrained CLIP image encoder, yielding expression-specific visual representations. Based on these representations, an Expression-Conditioned Prompting (ECP) mechanism is utilized to further adapt the textual features of fixed class-level prompts from the frozen pretrained CLIP text encoder, generating more instance-aware textual representations. These visual-textual representations are aligned as semantic priors to enhance the robustness and generalization of FER. Quantitative and qualitative experiments demonstrate that our LaCoVL-FER outperforms state-of-the-art methods on three representative real-world FER datasets, including RAF-DB, FERPlus, and AffectNet. The code is available at https://github.com/ylin06804/LaCoVL-FER.

2605.19804 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Stitched Value Model for Diffusion Alignment

用于扩散对齐的拼接价值模型

Hyojun Go, Hyungjin Chung, Prune Truong, Goutam Bhat, Li Mi, Zhaochong An, Zixiang Zhao, Dominik Narnhofer, Serge Belongie, Federico Tombari, Konrad Schindler

发表机构 * ETH Zurich(苏黎世联邦理工学院) Google(谷歌) University of Copenhagen(哥本哈根大学)

AI总结 本文提出StitchVM,一种将预训练的干净图像奖励模型转移到噪声潜在空间的拼接框架,通过高效转移和微调,提升扩散对齐的效率和效果。

Comments Project page: https://gohyojun15.github.io/StitchVM/

详情
AI中文摘要

为了实际应用,基于扩散或流的生成模型必须与任务特定的奖励对齐,例如提示保真度或审美偏好。这种对齐具有挑战性,因为奖励是为干净的输出图像定义的,但对齐过程需要在噪声中间潜在空间中估计价值函数。现有方法倾向于Tweedie风格或蒙特卡洛近似,权衡估计器偏差与计算成本:Tweedie估计高效但有偏差,而蒙特卡洛估计更准确但需要昂贵的回放。一个自然的替代方法是学习的价值函数,但如何有效训练一个强大的、通用的价值模型专门用于噪声潜在空间仍然是一个开放问题。本文提出了StitchVM,一种模型拼接框架,该框架高效地将预训练用于干净图像的奖励模型转移到噪声潜在空间。StitchVM从一个现有的、截断的像素空间奖励模型开始,并将其冻结的扩散骨干作为其头部。从像素空间模型中,所得到的混合模型保留了精心预训练、稳健的奖励能力;从扩散骨干中,它继承了其处理噪声潜在空间的原生能力。拼接过程异常轻量,例如拼接和微调CLIP ViT-L和SD 3.5 Medium仅需10个GPU小时。通过将强大的像素空间奖励模型提升到潜在空间,StitchVM打开了一种新的扩散对齐风格:而不是对价值函数的粗糙但昂贵的每样本近似,正确的函数对于实际的噪声潜在空间一次构建,然后在许多样本和迭代中进行抵消。我们显示,这种方法在广泛下游引导和后训练方法中带来了改进:DPS变得比原来快3.2倍,同时将峰值GPU内存减半,DiffusionNFT变得比原来快2.3倍。

英文摘要

For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes $3.2\times$ faster while halving peak GPU memory, and DiffusionNFT becomes $2.3\times$ faster.

2605.19799 2026-05-20 cs.CV cs.AI 版本更新

Synergistic Foundation Models for Semi-Supervised Fetal Cardiac Ultrasound Analysis: SAM-Med2D Boundary Refinement and DINOv3 Semantic Enhancement

协同基础模型用于半监督胎儿心脏超声分析:SAM-Med2D边界细化与DINOv3语义增强

Tonghao Zhuang, Shanglong Hu, Yongsheng Luo, Zhiqi Zhang, Yu Li

发表机构 * Zhuhai College of Science and Technology(珠海科技学院)

AI总结 本文提出了一种半监督框架,用于胎儿心脏超声图像的联合分割和分类,结合SAM-Med2D进行边界细化和DINOv3进行语义增强,有效提升了胎儿先天性心脏病筛查的性能。

Comments Accepted to the ISBI 2026 Fetal HearT UltraSound Segmentation and Diagnosis (FETUS) Challenge

详情
AI中文摘要

我们提出了一种半监督框架,用于胎儿心脏超声图像的联合分割和分类。基于EchoCare多任务主干网络,我们的方法整合了SAM-Med2D用于边界细化,并利用DINOv3提升伪标签质量。我们引入了视图特定的硬掩膜,并结合一种两阶段优化策略:一个EMA阶段用于巩固分割能力,随后是一个分类微调阶段,该阶段冻结分割参数并重置分类头以恢复分类性能,而不影响分割效果。在FETUS 2026排行榜上评估,我们的方法在Dice相似系数上达到79.99%,归一化表面距离为61.62%,F1分数为41.20%,验证了我们方法在产前先天性心脏病筛查中的有效性。源代码可在https://github.com/2826056177/zcst_fetus2026公开获取。

英文摘要

We present a semi-supervised framework for joint segmentation and classification of fetal cardiac ultrasound images. Built upon the EchoCare multi-task backbone, our method integrates SAM-Med2D for boundary refinement and leverages DINOv3 to enhance pseudo-label quality. We introduce view-specific hard masking along with a two-stage optimization strategy: an EMA phase to consolidate segmentation capabilities, followed by a Classification Fine-Tuning phase that freezes segmentation parameters and resets the classification head to recover classification performance without compromising segmentation gains. Evaluated on the FETUS 2026 leaderboard, our method achieves a Dice Similarity Coefficient at 79.99%, Normalized Surface Distance at 61.62%, and F1-score at 41.20%, validating the effectiveness of our approach for prenatal congenital heart disease screening. Source code is publicly available at: https://github.com/2826056177/zcst_fetus2026.

2605.19797 2026-05-20 cs.CV 版本更新

Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth

Depth2Pose: 一种用于单目深度估计的基于姿态的基准,无需真实深度

Viktor Kocur, Sithu Aung, Gabrielle Flood, Yaqing Ding, Lukas Bujnak, Torsten Sattler, Zuzana Kukelova

发表机构 * Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava(数学、物理与信息学学院,布拉迪斯拉瓦科门纽斯大学) Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague(电气工程学院视觉识别组,布拉格捷克技术大学) Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague(捷克信息学、机器人学与自动控制研究所,布拉格捷克技术大学)

AI总结 本文提出Depth2Pose基准,用于评估单目深度估计器在下游任务中的性能,通过结合深度预测与特征匹配,利用相对相机姿态估计精度作为深度质量的代理指标,解决了传统基准依赖像素级真实深度的高成本问题。

详情
AI中文摘要

单目深度估计近年来有了显著进步,这得益于越来越强大的模型和大规模训练数据。预测的深度越来越多地被用作下游任务(如结构从运动SfM、视觉定位和SLAM)的输入信号。然而,单目深度估计器(MDEs)仍然主要基于深度准确性进行评估。标准度量方法对误差进行全局汇总,可能无法反映深度对下游几何任务的有用性。因此,我们提出Depth2Pose,一种用于评估MDEs在下游任务中的框架。通过将深度预测与深度感知几何求解器中的特征匹配相结合,我们使用相对相机姿态估计精度作为任务驱动的深度质量代理。传统基准要求以像素级深度形式提供密集的真实数据,这获取成本高昂。相反,我们的方法仅需要相机姿态,这可以高效地估计,例如使用结构从运动(SfM)流水线。因此,我们的框架可以应用于难以获取真实深度的场景,例如由于场景规模大或重叠(如植被环境)。利用这一点,我们引入了D2P数据集,其中包含挑战性场景,这些场景不在常用训练数据分布中。我们展示了在现有基准上表现良好的方法在相同数据集上也表现良好,但在我们的更具挑战性的数据集上未必能推广。最后,我们提供了一个简单且可扩展的评估框架。数据集和代码可在kocurvik.github.io/depth2pose获取。

英文摘要

Monocular depth estimation has improved significantly in recent years, driven by increasingly powerful models and large-scale training data. Predicted depth is increasingly used as an input signal for downstream tasks such as Structure-from-Motion (SfM), visual localization, and SLAM. However, monocular depth estimators (MDEs) are still primarily evaluated in terms of depth accuracy. Standard metrics aggregate errors globally and may not reflect the usefulness of depth for downstream geometric tasks. We therefore propose Depth2Pose, a framework for evaluating MDEs in the context of downstream tasks. By combining depth predictions with feature correspondences in depth-aware geometric solvers, we use relative camera pose estimation accuracy as a task-driven proxy for depth quality. Traditional benchmarks require dense ground truth in the form of per-pixel depth, which is expensive to obtain. In contrast, our formulation requires only camera poses, which can be estimated efficiently, e.g., using Structure-from-Motion pipelines. As a result, our framework can be applied to scenes where ground-truth depth is difficult to obtain, for example due to large scene scale or heavy occlusions (e.g., vegetated environments). Leveraging this, we introduce the D2P dataset, which contains challenging scenes outside the distribution of commonly used training data. We show that methods performing well under standard depth error metrics on existing benchmarks also perform well under our pose-based metric when evaluated on the same datasets, but do not necessarily generalize to our more challenging dataset. Finally, we provide a simple and extensible evaluation framework. The dataset and code are available at kocurvik.github.io/depth2pose.

2605.19792 2026-05-20 cs.CV 版本更新

Mechanisms of Object Localization in Vision-Language Models

视觉-语言模型中物体定位的机制

Timothy Schaumlöffel, Martina G. Vilas, Gemma Roig

发表机构 * Goethe University Frankfurt, Germany(法兰克福歌德大学,德国) The Hessian Center for AI, Germany(黑森人工智能中心,德国)

AI总结 本文研究了视觉-语言模型中物体定位的核心机制,通过分析LLaVA-1.5和InternVL-3.5等模型,揭示了定位依赖于容器化机制,并发现只有少量注意力头参与分类和定位任务,为未来模型设计提供了指导。

Comments Accepted at CVPR 2026

详情
AI中文摘要

视觉引导的语言模型(VLMs)在连接视觉和文本信息方面非常有效,但它们在基本的分类和定位任务上常常遇到困难。尽管分类机制已被广泛研究,但支持物体定位的过程仍不明确。在本工作中,我们使用一系列机械可解释性工具,包括token消融、注意力剔除和因果中介分析,研究了LLaVA-1.5和InternVL-3.5两个代表性家族。我们发现,定位由一种容器化机制驱动,其中对齐对象的token定义了物体的空间范围,而这些边界内token的语义排列与预测框关系不大。只有非常小的注意力头集介导了分类和定位的因果效应,对于LLaVA集中在早期-中期层,而对于InternVL集中在中期-后期层。这两个任务共享一些早期处理,但最终依赖于大量不同的专用头。总体而言,我们提供了VLMs中定位的首个层和头级账户,揭示了狭窄的计算路径,可以指导未来模型设计和基础目标。

英文摘要

Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis. We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while the semantic arrangement of tokens within those boundaries is largely irrelevant to the predicted box. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early-mid layers for LLaVA and mid-late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads. Overall, we provide the first layer- and head-level account of localization in VLMs, revealing narrow computational pathways that can guide future model design and grounding objectives.

2605.19786 2026-05-20 cs.CV 版本更新

Fast 4D Mesh Generation by Spatio-Temporal Attention Chains

通过时空注意力链实现快速的4D网格生成

Dvir Samuel, Yuval Atzmon, Gal Chechik, Yoni Kasten

发表机构 * NVIDIA Research(NVIDIA研究)

AI总结 本文提出一种无需训练的4D网格生成方法,通过时空注意力链加速生成并提升时间对应质量,能够在9秒内生成4D网格,实现13倍速度提升,且能处理长达16倍的视频序列,同时在2D物体跟踪和4D跟踪任务中表现出色,还支持可靠的相机估计。

Comments https://research.nvidia.com/labs/par/fast4dmesh/

详情
AI中文摘要

4D网格生成最近已成为从视频中恢复动态3D结构的强大范式,但现有方法仍然缓慢、计算成本高且难以扩展到更长的序列。我们介绍了一种无需训练的方法,以加速4D网格生成并提高时间对应质量。我们的关键观察是,时间对应关系在4D骨干网络生成视觉准确的网格之前就已经在其中出现。我们利用这一发现,提出了一种通用框架,称为时空注意力链,该框架在空间和时间上传播信息。从锚定网格的顶点开始,链将顶点映射到潜在令牌,然后在潜在空间中跟随时间对应关系,并通过潜在到顶点的注意力恢复帧特定的顶点。这种设计避免了昂贵的显式匹配,同时保留了锚定网格的细节,从而改进了动态网格几何和时间一致性。与最先进的方法相比,我们的方法在9秒内生成4D网格,实现13倍的速度提升,同时产生更高质量的结果。此外,我们的方法可扩展到长达16倍的视频序列,而不降低网格质量。除了生成外,改进的对应关系使方法在两个下游任务上表现出色:2D物体跟踪和4D跟踪。我们进一步表明,我们的框架能够实现可靠的相机估计,这是先前4D网格生成方法所不支持的能力。

英文摘要

4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a $13\times$ speedup while producing higher-quality results. Moreover, our approach scales to videos up to $16\times$ longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods.

2605.18739 2026-05-20 cs.CV cs.DC 版本更新

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

LongLive-2.0: 一个基于NVFP4的长视频生成并行基础设施

Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, Song Han

发表机构 * NVIDIA

AI总结 本文提出LongLive-2.0,一个基于NVFP4的并行基础设施,用于长视频生成的整个训练和推理流程,解决了速度和内存瓶颈问题。通过引入序列并行自回归训练方法,结合NVFP4精度,显著降低了GPU内存消耗并加速了训练过程。同时,该系统能够将扩散模型转换为长视频生成的自回归扩散模型,并在不同GPU架构上实现了高效的推理和训练。

Comments Code, model, and demos are available at https://github.com/NVlabs/LongLive

详情
AI中文摘要

我们提出了LongLive-2.0,一个基于NVFP4的并行基础设施,贯穿长视频生成的整个训练和推理流程,以解决速度和内存瓶颈问题。在训练过程中,我们引入了序列并行自回归(AR)训练,具体实现为平衡SP,通过在每个rank上配对干净历史和噪声目标时间片段,共同设计高效的教师强制布局与SP执行,从而实现自然的教师强制掩码和SP-aware分块VAE编码。结合NVFP4精度,它减少了GPU内存成本并加速了GEMM计算,随着视频长度的增加,其比例增加。此外,我们表明高质量的基础设施和数据集能够实现显著清洁的训练流程。与现有Self-Forcing系列方法不同,LongLive-2.0直接调节扩散模型,使其成为长、多镜头、交互式自回归(AR)扩散模型。它可以进一步转换为实时生成(4到2去噪步骤)通过独立LoRA权重。在Blackwell GPU上进行推理时,我们启用了W4A4 NVFP4推理,将KV缓存量化为NVFP4以节省内存,并通过异步流式VAE解码提高端到端吞吐量。在非Blackwell GPU架构上,我们部署SP推理以匹配Blackwell GPU的速度,同时量化后的KV缓存可以降低SP的跨GPU通信。实验显示训练速度提高了2.15倍,推理速度提高了1.84倍。LongLive-2.0-5B在45.7 FPS的推理速度下实现了在基准测试中的强大性能。据我们所知,LongLive-2.0是首个针对长视频生成的NVFP4训练和推理系统。

英文摘要

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.

2605.16137 2026-05-20 cs.CV cs.RO 版本更新

STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System

STABLE: 通过语义-物理双系统生成仿真准备的桌面布局

Zhen Luo, Yixuan Yang, Xudong Xu, Jinkun Hao, Zhaoyang Lyu, Feng Zheng, Jiangmiao Pang, Yanwei Fu

发表机构 * Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出STABLE,一种通过语义-物理双系统生成仿真准备的桌面布局的方法,通过语义推理模块生成粗略布局,物理校正模块校正布局以确保物理合理性,从而提升场景的物理有效性。

Comments ICML 2026

详情
AI中文摘要

从任务指令生成仿真准备的桌面场景是嵌入式人工智能领域引人入胜且有前景的研究方向。然而,现有任务到场景生成方法仅依赖大型语言模型(LLMs)预测场景布局,不可避免地导致物体碰撞或漂浮,因为LLMs在三维空间推理方面存在固有局限性。在本文中,我们提出了STABLE,一种专为仿真准备的桌面场景生成设计的语义-物理双系统。STABLE由两个互补模块组成:(i)语义推理器,一个在结构化桌面场景数据集上微调的LLM,用于从输入任务指令生成粗略布局;(ii)物理校正器,一个具有物理意识的基于流的去噪模型,输出姿态更新以校正布局,从而确保场景的物理合理性,同时保持与任务指令的语义一致性。STABLE采用渐进生成范式:通过交替使用语义推理器和物理校正器,它逐步从任务关键对象扩展到背景对象。实验表明,STABLE成功生成严格符合任务指令的仿真准备的桌面场景,并显著提高了场景的物理有效性。

英文摘要

Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs' inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics-physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.

2605.15599 2026-05-20 cs.CV cs.AI 版本更新

Pretraining Objective Matters in Extreme Low-Data FGVC: A Backbone-Controlled Study

预训练目标在极低数据细粒度视觉分类中的影响:一个骨干网络控制研究

Alexander Hackett, Srikanth Thudumu, Ginny Fisher, Jason Fisher

发表机构 * Santa Clara University(圣克拉拉大学) IAAIR

AI总结 本文研究了在极低数据细粒度视觉分类中预训练目标对下游表示质量的影响,通过比较四种冻结的ViT-B/16编码器,得出了在数据稀缺时优先选择边界增强预训练目标的结论。

Comments Presented at the 13th Workshop on Fine-Grained Visual Categorization (FGVC13) at CVPR 2026

Journal ref 13th Workshop on Fine-Grained Visual Categorization (FGVC13), CVPR 2026

详情
AI中文摘要

极端低数据细粒度分类在专家领域中普遍存在,其中标注成本高昂,但从业者仍需要有原则的指导来选择预训练编码器。我们使用一个定制的数据集,包含三个类别的标注图像,研究了在匹配的骨干容量下,预训练目标如何影响下游表示质量。我们比较了四种冻结的ViT-B/16编码器,分别通过监督分类、对比学习(SigLIP2)、掩码重建(MAE)和自蒸馏(DINOv3)进行训练,并使用留一验证法通过线性和非线性探测器评估。为了控制低N情况下的统计噪声,我们使用排列检验(N=1000)在宏级一对多AUC上进行测试。监督和对比学习编码器在线性可分性方面表现最强(逻辑AUC:0.768和0.735;SVM AUC:0.739和0.697),而MAE在非线性探测器下表现更优(XGBoost AUC:0.713)。我们发现DINOv3在该领域整体表现较差。这些结果支持在极低数据细粒度视觉分类中的一种实用建议:当数据稀缺限制探测到线性决策规则时,优先选择边界增强预训练目标;当非线性分类器可行时,考虑使用重建式编码器。

英文摘要

Extreme low-data fine-grained classification is common in expert domains where labeling is expensive, yet practitioners still need principled guidance for selecting pretrained encoders. We study emerald inclusion grading with a custom dataset of labeled images across three classes and ask: under matched backbone capacity, how does pretraining objective affect downstream representation quality? We compare four frozen ViT-B/16 encoders trained with supervised classification, contrastive learning (SigLIP2), masked reconstruction (MAE), and self-distillation (DINOv3), and evaluate them with leave-one-out cross-validation using linear and nonlinear probes. To control statistical noise in the low-N regime, we use permutation testing (N=1000) on macro one-vs-rest AUC. Supervised and contrastive encoders provide the strongest linear separability (logistic AUC: 0.768 and 0.735; SVM AUC: 0.739 and 0.697), while MAE improves under nonlinear probes (XGBoost AUC: 0.713). We find that DINOv3 underperforms across probe families in this domain. These results support a practical recommendation for extreme low-data FGVC: prioritize margin-enforcing pretraining objectives when data scarcity restricts probing to linear decision rules, and consider reconstruction-style encoders when nonlinear classifiers are feasible given dataset constraints.

2605.13193 2026-05-20 cs.CV 版本更新

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

FIKA-Bench: 从细粒度识别到细粒度知识获取

Geng Li, Yuxin Peng

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学计算机技术研究所)

AI总结 本文提出FIKA-Bench,一个包含311个公开来源和现实实例的细粒度知识获取基准,通过过滤和审计确保实例质量,评估最新多模态模型和代理发现细粒度识别任务仍具挑战性,需改进代理设计以提升知识获取能力。

Comments Project page with code: https://ligeng0197.github.io/FIKA-Bench.github.io/

详情
AI中文摘要

日常生活中细粒度识别往往不是封闭书目分类问题:当遇到陌生物体时,人类会主动搜索、比较视觉细节并验证证据后再做决定。现有基准主要评估视觉识别能力,忽略了这种主动外部知识获取能力。我们研究细粒度知识获取,即系统必须寻求、验证并使用外部证据来回答开放式细粒度识别问题。我们引入FIKA-Bench,一个泄漏意识且证据支持的实例集合,包含311个公开来源和现实实例。为确保高质量,每个实例均经过前沿封闭书目模型过滤以去除记忆案例,并经过审核以消除图像-答案泄漏,仅保留由验证证据支持的样本。我们对最新多模态模型(LMMs)和代理的评估显示,该任务仍具挑战性:最佳系统仅达到25.1%的准确率,无模型超过30%。关键发现是,仅给模型配备工具不足以弥合这一差距;代理失败主要由错误实体检索和较差的视觉判断驱动。这些结果表明,可靠的知识获取需要更好的代理设计,以专注于细粒度识别。

英文摘要

Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans actively search, compare visual details, and verify evidence before deciding. Existing benchmarks primarily evaluate visually recognition, leaving this active external knowledge acquisition ability underexplored. We study fine-grained knowledge acquisition, where a system must seek, verify, and use external evidence to answer open-ended fine-grained recognition questions. We introduce FIKA-Bench, a leakage-aware and evidence-grounded collection of 311 public-source and real-life instances. To ensure high quality, every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, retaining only samples supported by verified evidence. Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement. These results show that reliable knowledge acquisition needs better agent designs that focus on fine-grained recognition.

2605.12640 2026-05-20 cs.CV 版本更新

MambaPanoptic: A Vision Mamba-based Structured State Space Framework for Panoptic Segmentation

MambaPanoptic:基于视觉Mamba的结构状态空间框架用于全景分割

Qing Cheng, Damiano Bertolini, Wei Zhang, Dong Wang, Niclas Zeller, Daniel Cremers

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Polytechnic University of Milan(米兰理工大学) University of Stuttgart(斯图加特大学) Wuhan University(武汉大学) Karlsruhe University of Applied Sciences(卡尔斯鲁厄应用科学大学)

AI总结 本研究提出MambaPanoptic,一种基于视觉Mamba的结构状态空间框架,旨在解决全景分割中长程上下文建模、多尺度特征表示和高效密集预测的挑战,通过引入MambaFPN和改进的PanopticFCN风格核生成器实现统一的实例和物质预测。

Comments Accepted to ISPRS Congress 2026, camera-ready version

详情
AI中文摘要

全景分割要求同时识别可计数的实例和无形态的物质区域,对长程上下文建模、多尺度特征表示和高效密集预测提出了联合需求。现有的卷积和Transformer方法难以同时满足这三个要求:卷积架构在建模长程依赖方面能力有限,而基于Transformer的方法在高分辨率下会带来二次计算成本。在本文中,我们提出MambaPanoptic,一种完全基于Mamba的全景分割框架,通过两个主要贡献来解决这些限制。首先,我们引入MambaFPN,一种自上而下的特征金字塔,利用Mamba块生成具有线性计算复杂度的全局一致、多尺度特征表示。其次,我们采用PanopticFCN风格的核生成器,产生统一的实例和物质核用于无提案的全景预测,并通过在多个网络阶段应用QuadMamba基于的特征细化模块进行增强。在Cityscapes和COCO全景分割基准测试中,实验表明MambaPanoptic在同等模型大小下一致优于PanopticDeepLab和PanopticFCN,并在Cityscapes上以更少的参数匹配或超越Mask2Former在PQ和AP上的表现。

英文摘要

Panoptic segmentation requires the simultaneous recognition of countable thing instances and amorphous stuff regions, placing joint demands on long-range context modelling, multi-scale feature representation, and efficient dense prediction. Existing convolutional and transformer-based methods struggle to satisfy all three requirements concurrently: convolutional architectures are limited in their capacity to model long-range dependencies, while transformer-based methods incur quadratic computational cost that is prohibitive at high resolutions. In this paper, we propose MambaPanoptic, a fully Mamba-based panoptic segmentation framework that addresses these limitations through two principal contributions. First, we introduce MambaFPN, a top-down feature pyramid that leverages Mamba blocks to generate globally coherent, multi-scale feature representations with linear computational complexity. Second, we adopt a PanopticFCN-style kernel generator that produces unified thing and stuff kernels for proposal-free panoptic prediction, enhanced by a QuadMamba-based feature refinement module applied at multiple network stages. Experiments on the Cityscapes and COCO panoptic segmentation benchmarks demonstrate that MambaPanoptic consistently outperforms PanopticDeepLab and PanopticFCN under comparable model sizes, and matches or surpasses Mask2Former on Cityscapes in PQ and AP while requiring fewer parameters.

2605.12320 2026-05-20 cs.CV 版本更新

Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos

在噪声时间自监督下利用对比学习进行结肠镜视频处理

Luca Parolari, Pietro Gori, Lamberto Ballan, Carlo Biffi, Loic Le Folgoc

发表机构 * Department of Mathematics, University of Padova, Padova, Italy(帕多瓦大学数学系) LTCI, Telecom Paris, Institut Polytechnique de Paris, Palaiseau, France(巴黎电信学院) Cosmo Intelligent Medical Devices, Dublin, Ireland(都柏林智能医疗设备公司)

AI总结 本文提出一种在噪声时间自监督下利用对比学习进行结肠镜视频处理的方法,通过利用结肠镜检查的顺序流程来推导自监督关联,引入噪声感知的对比损失以处理噪声关联,从而在多项下游任务中取得了优于现有自监督和监督基线方法的性能。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

学习鲁棒的息肉轨迹表示对于启用多项AI辅助结肠镜应用至关重要,从息肉特征化到自动化报告和检索。监督对比学习是学习此类表示的有效方法,但通常依赖于正确的正负定义。收集这些标签需要链接在整个视频中描绘相同基础息肉实体的轨迹,这成本高昂且需要专门的临床专业知识。在本工作中,我们利用结肠镜检查的顺序流程推导出自监督关联。由于时间推导的关联不保证正确,我们引入了噪声感知的对比损失以处理噪声关联。我们展示了所学表示在多项下游任务中的有效性,包括息肉检索和重识别、大小估计和组织学分类。我们的方法在多项任务中优于先前的自监督和监督基线方法,并且在所有任务中与最近的基座模型相匹配或超过,使用了一个仅在27个视频上训练的轻量级编码器。代码可在https://github.com/lparolari/ntssl上获得。

英文摘要

Learning robust representations of polyp tracklets is key to enabling multiple AI-assisted colonoscopy applications, from polyp characterization to automated reporting and retrieval. Supervised contrastive learning is an effective approach for learning such representations, but it typically relies on correct positive and negative definitions. Collecting these labels requires linking tracklets that depict the same underlying polyp entity throughout the video, which is costly and demands specialized clinical expertise. In this work, we leverage the sequential workflow of colonoscopy procedures to derive self-supervised associations from temporal structure. Since temporally derived associations are not guaranteed to be correct, we introduce a noise-aware contrastive loss to account for noisy associations. We demonstrate the effectiveness of the learned representations across multiple downstream tasks, including polyp retrieval and re-identification, size estimation, and histology classification. Our method outperforms prior self-supervised and supervised baselines, and matches or exceeds recent foundation models across all tasks, using a lightweight encoder trained on only 27 videos. Code is available at https://github.com/lparolari/ntssl.

2605.10180 2026-05-20 cs.CV cs.CR 版本更新

What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

什么概念在其中?在扩散变换器中检测和抑制危险内容

Chenyu Zhang

AI总结 本文研究了如何在扩散变换器中检测和抑制危险内容,提出了一种无需训练的推理时安全机制AHV-D&S,通过分析注意力头对概念的敏感性来检测和抑制危险生成倾向,有效压制了性内容、受版权保护的内容及有害内容,同时保持视觉质量。

Comments arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission

详情
AI中文摘要

文本到图像(T2I)模型的兴起日益引发了关于生成危险内容(如性、暴力和受版权保护的图像)的担忧,突显了在模型内部需要有效安全措施的必要性。尽管已有方法被提出以消除T2I模型中的危险概念,但它们主要针对早期的U-Net架构,使得最先进的基于扩散变换器(DiTs)的T2I模型缺乏充分保护。这一差距源于根本性的架构转变:扩散变换器(DiTs)通过联合注意力将语义注入和视觉合成结合起来,这使得在生成过程中隔离和消除危险内容变得困难。为了弥合这一差距,我们研究了DiTs中语义概念的表示方式,并发现注意力头表现出对概念的特定敏感性。这一特性使得能够同时检测和抑制危险内容。基于这一发现,我们提出AHV-D&S,一种无需训练的推理时图像生成安全机制。具体而言,AHV-D&S量化每个文本标记在所有注意力头上的敏感性作为注意力头向量(AHV),该向量用作检测危险生成倾向的判别签名。在推理阶段,我们提出了一种基于动量的策略,用于在去噪步骤中动态跟踪标记级别的AHVs,并提出一种基于敏感度的自适应抑制策略,该策略根据头特定的风险分数抑制已识别的危险标记的注意力权重。广泛的实验表明,AHV-D&S有效抑制了性内容、受版权保护的内容以及各种有害内容,同时保持了视觉质量,并进一步表现出对对抗性提示的强鲁棒性和在不同DiT-based T2I模型中的可转移性。

英文摘要

The rise of text-to-image (T2I) models has increasingly raised concerns regarding the generation of risky content, such as sexual, violent, and copyright-protected images, highlighting the need for effective safeguards within the models themselves. Although existing methods have been proposed to eliminate risky concepts from T2I models, they are primarily developed for earlier U-Net architectures, leaving the state-of-the-art Diffusion-Transformer-based T2I models inadequately protected. This gap stems from a fundamental architectural shift: Diffusion Transformers (DiTs) entangle semantic injection and visual synthesis via joint attention, which makes it difficult to isolate and erase risky content within the generation. To bridge this gap, we investigate how semantic concepts are represented in DiTs and discover that attention heads exhibit concept-specific sensitivity. This property enables both the detection and suppression of risky content. Building on this discovery, we propose AHV-D\&S, a training-free inference-time safeguard for image generation in DiTs. Specifically, AHV-D\&S quantifies each textual token's sensitivity across all attention heads as an Attention Head Vector (AHV), which serves as a discriminative signature for detecting risky generation tendencies. In the inference stage, we propose a momentum-based strategy to dynamically track token-wise AHVs across denoising steps, and a sensitivity-guided adaptive suppression strategy that suppresses the attention weights of identified risky tokens based on head-specific risk scores. Extensive experiments demonstrate that AHV-D\&S effectively suppresses sexual, copyrighted-style, and various harmful content while preserving visual quality, and further exhibits strong robustness against adversarial prompts and transferability across different DiT-based T2I models.

2604.15166 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Class Unlearning via Depth-Aware Removal of Forget-Specific Directions

通过深度感知移除遗忘特定方向实现类别反学习

Arman Hatami, Romina Aalishah, Ilya E. Monosov

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出DAMP方法,通过深度感知移除遗忘特定方向,改进类别反学习的选性遗忘,同时更好地保留保留类性能并减少深层残留遗忘结构。

Comments Accepted for oral presentation at the CVPR 2026 Workshop on Machine Unlearning for Vision (MUV). Code: https://github.com/armanhtm/DAMP

详情
AI中文摘要

机器反学习旨在在不重新训练模型的情况下移除目标知识。然而,在类别反学习中,降低遗忘类的准确性并不一定意味着真正的遗忘:遗忘的信息可能仍编码在内部表示中,而显着的遗忘可能源于分类器头部抑制而非表示移除。我们显示现有类别反学习方法往往表现出弱或负的选择性,保留遗忘类结构在深度表示中,或严重依赖最终层偏移。我们随后引入DAMP(通过投影的深度感知调节),一种单次、闭合形式的权重手术方法,可以在不使用梯度优化的情况下从预训练网络中移除遗忘特定方向。在每个阶段,DAMP在下一个可学习操作的输入空间中计算类别原型,提取遗忘方向作为相对于保留类原型的残差,并应用基于投影的更新以减少下游对这些方向的敏感性。为了保持实用性,DAMP使用从探测分离性导出的参数无关深度感知缩放规则,应用较小的编辑在早期层和较大的编辑在深层。该方法自然扩展到多类遗忘通过低秩子空间移除。在MNIST、CIFAR-10、CIFAR-100和Tiny ImageNet以及卷积和变换器架构上,DAMP比一些先前方法更接近再训练的黄金标准,改进了选择性遗忘的同时更好地保留保留类性能并减少深层残留遗忘结构。

英文摘要

Machine unlearning aims to remove targeted knowledge from a trained model without the cost of retraining from scratch. In class unlearning, however, reducing accuracy on forget classes does not necessarily imply true forgetting: forgotten information can remain encoded in internal representations, and apparent forgetting may arise from classifier-head suppression rather than representational removal. We show that existing class-unlearning methods often exhibit weak or negative selectivity, preserve forget-class structure in deep representations, or rely heavily on final-layer bias shifts. We then introduce DAMP (Depth-Aware Modulation by Projection), a one-shot, closed-form weight-surgery method that removes forget-specific directions from a pretrained network without gradient-based optimization. At each stage, DAMP computes class prototypes in the input space of the next learnable operator, extracts forget directions as residuals relative to retain-class prototypes, and applies a projection-based update to reduce downstream sensitivity to those directions. To preserve utility, DAMP uses a parameter-free depth-aware scaling rule derived from probe separability, applying smaller edits in early layers and larger edits in deeper layers. The method naturally extends to multi-class forgetting through low-rank subspace removal. Across MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet, and across convolutional and transformer architectures, DAMP more closely resembles the retraining gold standard than some of the prior methods, improving selective forgetting while better preserving retain-class performance and reducing residual forget-class structure in deep layers.

2603.25722 2026-05-20 cs.CV cs.LG 版本更新

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

无需硬负样本:基于概念的学习在不降低对比模型零样本能力的情况下实现组合性

Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez

发表机构 * Samsung AI Center(三星人工智能中心)

AI总结 本文提出了一种基于概念的学习方法,无需使用硬负样本即可在不损害对比模型零样本和检索能力的情况下实现组合性,通过简单的方法改进了文本和图像编码器的全局池化问题。

Comments Accepted at CVPR 2026. 2nd rev: update github repo URL

详情
AI中文摘要

对比视觉-语言(V&L)模型仍然是各种应用中的流行选择。然而,出现了几个限制,尤其是V&L模型学习组合性表示的能力有限。先前的方法通常通过生成定制训练数据来获得硬负样本。硬负样本已被证明可以提高组合性任务的性能,但通常只适用于单一基准,无法推广,并且可能导致基本V&L能力如零样本或检索性能的显著下降,使其不切实际。在本工作中,我们采取了不同的方法。我们识别出两个限制V&L组合性性能的根本原因:1)长训练标题不需要组合性表示;2)文本和图像编码器中的最终全局池化导致完全失去学习绑定所需的必要信息。为了解决这一问题,我们提出了两种简单的解决方案:1)使用标准NLP软件获得短的概念导向标题部分,并将其对齐到图像;2)引入无参数的跨模态注意力池化,从图像编码器中获得概念导向的视觉嵌入。通过这些更改和简单的辅助对比损失,我们获得了标准组合性基准的SOTA性能,同时保持或提高了强大的零样本和检索能力。这在不增加推理成本的情况下实现。我们在此工作的代码已发布在https://github.com/saic-fi/concept_centric_clip。

英文摘要

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/saic-fi/concept_centric_clip.

2602.20700 2026-05-20 cs.CV 版本更新

NGL: Natural Garment Language for Training-Free Sewing Pattern Estimation

NGL:自然服装语言用于无训练缝纫图案估计

Anna Badalyan, Pratheba Selvaraju, Giorgio Becherini, Omid Taheri, Victoria Fernandez Abrevaya, Michael Black

发表机构 * Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)

AI总结 本研究提出NGL自然服装语言,通过利用视觉语言模型的自然描述能力,实现无训练的缝纫图案估计,解决了传统方法在泛化能力、真实世界关联性和多层服装处理上的不足。

Comments 12 pages, 7 figures

详情
AI中文摘要

从图像估计缝纫图案是创建高质量3D服装的实用方法,但受限于真实世界图像和缝纫图案配对数据的稀缺性而具有挑战性。现有方法通过训练视觉语言模型(VLMs)从参数化服装模型采样的合成服装中学习低级缝纫图案表示来解决这一限制。然而,这些方法往往难以泛化到野外图像,无法捕捉真实世界服装部件之间的关联,并且局限于单层服装。相比之下,我们发现VLMs在描述服装时表现良好,但将这些描述映射到有效的缝纫图案仍然困难。为此,我们提出了NGL(自然服装语言),一种针对VLMs的领域特定语言,能够以与VLMs的自然描述能力对齐的方式表示服装。利用NGL,我们引入了一条完全无训练的流程,通过查询大型VLMs提取结构化的服装规格,并确定性地将其转换为有效的缝纫图案。我们在Dress4D、CloSe以及一个包含253张野外时尚图像的新数据集上评估了我们的方法。我们的方法在标准几何度量上实现了最先进的性能,并在人类和基于GPT的感知评估中优于现有基线。此外,NGL能够恢复多层服装,而竞争方法主要集中在单层服装上,突显了其在处理有遮挡部分的真实世界图像时的强大泛化能力。这些结果表明,高效的服装表示对于使用VLMs进行缝纫图案估计至关重要。我们的代码和数据将供研究使用。

英文摘要

Estimating sewing patterns from images is a practical approach for creating high-quality 3D garments, but it remains challenging due to the scarcity of paired real-world image and sewing-pattern data. Existing methods address this limitation by training vision-language models (VLMs) to learn low-level sewing-pattern representations from synthetic garments sampled from parametric garment models. However, they often struggle to generalize to in-the-wild images, fail to capture real-world correlations between garment parts, and are restricted to single-layer outfits. In contrast, we observe that VLMs are effective at describing garments in natural language, but mapping these descriptions into valid sewing patterns remains difficult. To bridge this gap, we propose NGL (Natural Garment Language), a novel domain-specific language that represents garments in terms aligned with VLMs' natural descriptive abilities. Leveraging NGL, we introduce a fully training-free pipeline that queries large VLMs to extract structured garment specifications and deterministically converts them into valid sewing patterns. We evaluate our method on the Dress4D, CloSe and a newly collected dataset of 253 in-the-wild fashion images. Our approach achieves state-of-the-art performance on standard geometry metrics and is preferred in both human and GPT-based perceptual evaluations compared to existing baselines. Furthermore, NGL recovers multi-layer outfits whereas competing methods focus mostly on single-layer garments, highlighting its strong generalization to real-world images even with occluded parts. These results demonstrate that an efficient garment representation is critical for sewing pattern estimation with VLMs. Our code and data will be released for research use.

2601.16200 2026-05-20 cs.LG cs.CV 版本更新

Feature-Space Smoothing: Certified Robustness of Deep Representations

特征空间平滑:深度表示的认证鲁棒性

Song Xia, Meiwen Ding, Chenqi Kong, Wenhan Yang, Xudong Jiang

发表机构 * Rapid-Rich Object Search Lab, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore(快速-丰富目标搜索实验室,电气电子工程学院,南洋理工大学,新加坡) Pengcheng Laboratory, Shenzhen, China(鹏城实验室,深圳,中国)

AI总结 本文提出了一种特征空间平滑(FS)框架,通过在特征表示层面提供认证鲁棒性,以解决深度学习模型对恶意输入的脆弱性问题,核心方法是通过特征平滑保证清洁和对抗特征之间的余弦相似度下界,并引入高斯平滑增强器(GSB)提升编码器的高斯鲁棒性得分,从而提升模型的鲁棒性并保持下游任务性能。

Comments Under review

详情
AI中文摘要

现代深度学习模型在多种应用中表现出强大的能力,但仍然容易受到通过特征空间扭曲诱导错误预测的恶意输入的攻击。为了解决这一脆弱性,我们提出了特征空间平滑(FS),一种通用的防御框架,该框架能够在特征表示层面提供认证鲁棒性。我们证明,FS将给定的特征编码器转换为一个平滑版本,该版本在l_2有界扰动下保证清洁和对抗特征之间的余弦相似度的认证下界。然后我们建立该特征余弦相似度下界(FCSB)可以扩展到预测层面的认证,其值由编码器内在的高斯鲁棒性得分决定。基于这些见解,我们引入了高斯平滑增强器(GSB),一个即插即用的模块,用于提升编码器的高斯鲁棒性得分。具体来说,GSB模块被插入以增强特征空间的一致性,并在高斯扰动下保持特征的实用性,以供下游任务使用。这种设计使FS能够无缝集成到受保护的模型上,例如多模态大语言模型(MLLMs),而无需额外的模型重新训练或对齐,从而在提升鲁棒性的同时保持下游任务的性能。广泛的实验表明,整合FS一致地提供了非平凡的认证鲁棒性,并在多种模型和应用中显著提高了面向任务的性能,即使在强白盒对抗攻击下也如此。

英文摘要

Modern deep learning models exhibit strong capabilities across diverse applications, yet remain vulnerable to malicious inputs that induce erroneous predictions via feature-space distortion. To address this vulnerability, we propose Feature-space Smoothing (FS), a general defense framework that provides certified robustness at the feature representation level. We show that FS converts a given feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the cosine similarity between clean and adversarial features under l_2-bounded perturbations. We then establish that this Feature Cosine Similarity Bound (FCSB) can be extended to the prediction-wise certification under the cosine similarity measure, and the value of FCSB is determined by the encoder intrinsic Gaussian robustness score. Building on those insights, we introduce the Gaussian Smoothness Booster (GSB), a plug-and-play module to improve the encoder Gaussian robustness score. Specifically, the GSB module is plugged to enhance the feature-space consistency and maintain the feature utility for downstream tasks under Gaussian perturbations. This design enables seamless integration of FS on the protected model, e.g., Multimodal Large Language Models (MLLMs), without additional model retraining or alignment, improving its robustness while preserving the performance for downstream task-oriented decoding. Extensive experiments demonstrate that integrating FS consistently provides non-trivial certified robustness and significantly improves task-oriented performance under strong white-box adversarial attacks across diverse models and applications.

2601.12373 2026-05-20 cs.CV cs.HC cs.RO 版本更新

CD-TWINSAFE: A ROS-enabled Digital Twin for Scene Understanding and Safety Emerging V2I Technology

CD-TWINSAFE:一种基于ROS的数字孪生用于场景理解和安全新兴V2I技术

Amro Khaled, Farah Khaled, Omar Riad, Catherine M. Elias

发表机构 * C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt(认知驾驶研究与车辆系统实验室,埃及开罗) Computer Science and Engineering Department - Faculty of Media Engineering and Technology(计算机科学与工程系-媒体工程与技术学院) German University in Cairo, Egypt(埃及开罗德国大学)

AI总结 本文提出了一种基于V2I的数字孪生系统CD-TWINSAFE,用于自动驾驶车辆的场景理解和安全监控,通过同时运行的两个栈结构实现车辆侧的驾驶模块和数字孪生模块,利用立体相机和Unreal Engine 5构建场景复现,并通过ROS架构实现V2I通信。

详情
AI中文摘要

本文介绍了CD-TWINSAFE,一种基于V2I的自动驾驶车辆数字孪生系统。所提出的架构由两个同时运行的栈组成,一个是车载驾驶栈,包含立体相机用于场景理解,另一个是数字孪生栈,运行Unreal Engine 5的场景复制品并返回安全警报至驾驶舱。车载栈在车辆侧实现,包括两个主要自主模块:定位和感知。通过车载传感器获取车辆的位置和方向。此外,感知模块负责处理立体相机的20fps图像,并通过两个互补的管道理解场景,包括物体检测和特征提取,包括物体速度、偏转角以及安全指标时间到碰撞和时间头道。收集的数据通过ROS架构以自定义ROS2消息的形式发送到基础设施侧,并通过UDP链接在4G调制解调器上进行V2I通信。通过数字孪生监控环境,共享消息更新生成的ego车辆和检测到的对象的信息,基于实时的定位和感知数据。通过不同驾驶场景的测试来验证所提出架构的有效性和实时响应能力。

英文摘要

In this paper, the CD-TWINSAFE is introduced, a V2I-based digital twin for Autonomous Vehicles. The proposed architecture is composed of two stacks running simultaneously, an on-board driving stack that includes a stereo camera for scene understanding, and a digital twin stack that runs an Unreal Engine 5 replica of the scene viewed by the camera as well as returning safety alerts to the cockpit. The on-board stack is implemented on the vehicle side including 2 main autonomous modules; localization and perception. The position and orientation of the ego vehicle are obtained using on-board sensors. Furthermore, the perception module is responsible for processing 20-fps images from stereo camera and understands the scene through two complementary pipelines. The pipeline are working on object detection and feature extraction including object velocity, yaw and the safety metrics time-to-collision and time-headway. The collected data form the driving stack are sent to the infrastructure side through the ROS-enabled architecture in the form of custom ROS2 messages and sent over UDP links that ride a 4G modem for V2I communication. The environment is monitored via the digital twin through the shared messages which update the information of the spawned ego vehicle and detected objects based on the real-time localization and perception data. Several tests with different driving scenarios to confirm the validity and real-time response of the proposed architecture.

2601.12358 2026-05-20 cs.CV cs.AI cs.RO 版本更新

From Prompts to Pavement: LMMs-based Agentic Behavior-Tree Generation Framework for Autonomous Vehicles

从提示到道路:基于大语言模型的代理行为树生成框架用于自动驾驶车辆

Omar Y. Goba, Ahmed Y. Gado, Catherine M. Elias, Ahmed Hussein

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt(德国亚历山大·冯·洪堡大学(开罗分校)计算机科学与工程系,埃及) C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt(认知驾驶系统实验室(车辆系统中的认知驾驶研究),开罗,埃及) IAV GmbH, Berlin, Germany(IAV GmbH,柏林,德国)

AI总结 本文提出了一种基于大语言模型和多模态视觉模型的代理行为树生成框架,用于自动驾驶车辆在复杂环境中自适应导航。该框架通过链式符号提示评估场景关键性,通过上下文学习构建高层子目标,并通过生成器合成可执行的BT子树,实现了在CARLA+Nav2模拟中对突发障碍物(如道路堵塞)的成功绕行。

详情
AI中文摘要

自动驾驶车辆(AVs)需要适应性行为规划器来安全地导航不可预测的现实环境。传统的行为树(BTs)提供结构化决策逻辑,但本质上是静态的,并且需要大量人工调优,限制了其在SAE Level 5自主性中的应用。本文提出了一种代理框架,利用大语言模型(LLMs)和多模态视觉模型(LVMs)来实时生成和适应BTs。一个专门的Descriptor代理使用链式符号提示来评估场景关键性,一个Planner代理通过上下文学习构建高层子目标,一个Generator代理合成可执行的BT子树。该系统集成到CARLA+Nav2模拟中,仅在基线BT失败时触发,展示了成功绕过突发障碍物(例如道路堵塞)的能力,无需人工干预。与静态BT基线相比,该方法是一种概念验证,能够扩展到多样的驾驶场景。

英文摘要

Autonomous vehicles (AVs) require adaptive behavior planners to navigate unpredictable, real-world environments safely. Traditional behavior trees (BTs) offer structured decision logic but are inherently static and demand labor-intensive manual tuning, limiting their applicability at SAE Level 5 autonomy. This paper presents an agentic framework that leverages large language models (LLMs) and multi-modal vision models (LVMs) to generate and adapt BTs on the fly. A specialized Descriptor agent applies chain-of-symbols prompting to assess scene criticality, a Planner agent constructs high-level sub-goals via in-context learning, and a Generator agent synthesizes executable BT sub-trees in XML format. Integrated into a CARLA+Nav2 simulation, our system triggers only upon baseline BT failure, demonstrating successful navigation around unexpected obstacles (e.g., street blockage) with no human intervention. Compared to a static BT baseline, this approach is a proof-of-concept that extends to diverse driving scenarios.

2512.03869 2026-05-20 cs.CV cs.CY 版本更新

An Automated Framework for Large-Scale Graph-Based Cerebrovascular Analysis

一种用于大规模基于图的脑血管分析的自动化框架

Daniele Falcetta, Liane S. Canas, Lorenzo Suppa, Matteo Pentassuglia, Jon Cleary, Marc Modat, Sébastien Ourselin, Maria A. Zuluaga

发表机构 * 1 EURECOM, Sophia Antipolis, France 2 School of Biomedical Engineering \& Imaging Sciences, King's College London, UK 3 Politecnico di Torino, Torino, Italy

AI总结 本文提出了一种自动化脑血管分析框架,通过骨架化生成的图表示建模血管形态,并通过区域划分、中心线提取和图构建计算15种形态学、拓扑学、分形和几何特征,以多尺度方式表征脑血管组织。

Comments Accepted at IEEE ISBI 2026

详情
AI中文摘要

我们提出了CaravelMetrics,一种用于自动化脑血管分析的计算框架,通过骨架化生成的图表示建模血管形态。该框架整合了基于图谱的区域划分、中心线提取和图构建,以计算15种形态学、拓扑学、分形和几何特征。这些特征可以全局从完整的血管网络或区域内动脉territories估计,从而实现脑血管组织的多尺度表征。应用于IXI数据集中的570个3D TOF-MRA扫描(年龄20-86岁),CaravelMetrics产生可重复的血管图,捕捉年龄和性别相关变化以及教育程度相关的血管复杂性增加,与文献中的发现一致。该框架提供了一种可扩展且完全自动的定量脑血管特征提取方法,支持规范建模和群体水平的血管健康和衰老研究。

英文摘要

We present CaravelMetrics, a computational framework for automated cerebrovascular analysis that models vessel morphology through skeletonization-derived graph representations. The framework integrates atlas-based regional parcellation, centerline extraction, and graph construction to compute fifteen morphometric, topological, fractal, and geometric features. The features can be estimated globally from the complete vascular network or regionally within arterial territories, enabling multiscale characterization of cerebrovascular organization. Applied to 570 3D TOF-MRA scans from the IXI dataset (ages 20-86), CaravelMetrics yields reproducible vessel graphs capturing age- and sex-related variations and education-associated increases in vascular complexity, consistent with findings reported in the literature. The framework provides a scalable and fully automated approach for quantitative cerebrovascular feature extraction, supporting normative modeling and population-level studies of vascular health and aging.

2511.22940 2026-05-20 cs.CV 版本更新

One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer

一对一动画:无对齐角色动画和图像姿态转换

Shijun Shi, Jing Xu, Zhihang Li, Chunli Peng, Xiaoda Yang, Lijing Lu, Kai Hu, Jiangning Zhang

发表机构 * Jiangnan University(江南大学) University of Science and Technology of China(中国科学技术大学) Chinese Academy of Sciences(中国科学院) Beijing University of Posts and Telecommunications(北京邮电大学) Zhejiang University(浙江大学)

AI总结 本文提出了一种统一框架,用于高保真角色动画和图像姿态转换,解决了参考姿态错位问题,通过自监督补全任务和混合参考融合注意力机制提升生成质量。

Comments Project Page:https://ssj9596.github.io/one-to-all-animation-project/

详情
AI中文摘要

最近扩散模型的进步极大地提高了基于姿态的角色动画效果。然而,现有方法受限于空间对齐的参考姿态对和匹配的骨骼结构。处理参考姿态错位仍是一个未解决的问题。为此,我们提出了One-to-All Animation,一种统一框架,用于高保真的角色动画和图像姿态转换,适用于任意布局的参考。首先,为了处理空间错位的参考,我们将训练重新公式化为自监督的补全任务,将多样布局的参考转换为统一的遮挡输入格式。其次,为了处理部分可见的参考,我们设计了一个参考提取器用于全面的身份特征提取。进一步,我们整合了混合参考融合注意力机制以处理不同分辨率和动态序列长度。最后,从生成质量的角度,我们引入了身份鲁棒姿态控制,将外观与骨骼结构解耦以缓解姿态过拟合,并引入了一个令牌替换策略以实现连贯的长视频生成。大量实验表明,我们的方法优于现有方法。代码和模型可在https://github.com/ssj9596/One-to-All-Animation上获得。

英文摘要

Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model are available at https://github.com/ssj9596/One-to-All-Animation.

2510.25897 2026-05-20 cs.CV cs.LG 版本更新

MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

MIRO:多奖励条件预训练提升T2I质量和效率

Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Vicky Kalogeiton, David Picard

发表机构 * LIGM, ENPC, IP Paris, CNRS, UGE, France LIX, CNRS, \'Ecole Polytechnique, IP Paris, France

AI总结 MIRO通过在训练过程中对模型施加多个奖励,直接学习用户偏好,从而提升文本到图像生成的质量和效率,同时在GenEval组合基准和用户偏好评分上取得最佳成绩。

Comments Accepted at ICML 2026. Project page: https://nicolas-dufour.github.io/miro

详情
AI中文摘要

后训练文本到图像生成器的默认范式包括事后选择生成的图像,随后使用一个奖励模型进行训练以对齐生成器与奖励,通常为用户偏好。这会丢弃信息性数据,并且仅优化单一奖励,从而损害多样性、语义保真度和效率。相反,我们提出MIRO,一种在训练过程中对模型施加多个奖励的方法,从而让模型直接学习用户偏好。MIRO预训练不仅提高了生成图像的视觉质量,还加快了训练速度,在GenEval组合基准和用户偏好评分(PickAScore、ImageReward、HPSv2)上实现了最先进的性能。

英文摘要

The default paradigm of post-training text-to-image generators includes post-hoc selection of generated images, and subsequent training with one reward model to align the generator to the reward, typically user preference. This discards informative data as well as optimizes only for a single reward, hence harming diversity, semantic fidelity and efficiency. Instead, we propose MIRO, a method that conditions the model on multiple rewards during training, thus letting the model learn user preferences directly. MIRO pre-training both improves the visual quality of the generated images and speeds up the training, achieving state of the art on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).

2509.15435 2026-05-20 cs.CV cs.AI cs.MA 版本更新

ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models

ORCA:一种用于视觉语言模型幻觉和对抗鲁棒性的代理推理框架

Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian

发表机构 * University of West Florida(佛罗里达大学) United States Military Academy(美国军事学院)

AI总结 本文提出ORCA框架,通过推理时的结构化推理和小规模视觉模型,提升预训练视觉语言模型的事实准确性与对抗鲁棒性,并在幻觉基准和对抗扰动测试中取得显著提升。

Comments Accepted at the ACM International Conference on Cloud and Big Data Computing (ICCBDC 2026)

详情
AI中文摘要

大型视觉语言模型(LVLMs)虽然具备强大的多模态能力,但仍然容易受到内在错误和外部攻击的幻觉影响,限制了其在现实中的可靠性。我们提出了ORCA,一种代理推理框架,通过推理时的结构化推理和一系列小规模视觉模型(参数少于3B)来提高预训练LVLMs的事实准确性和对抗鲁棒性。ORCA通过观察-推理-批判-行动循环运行,通过证据问题查询多个视觉工具,验证跨模型不一致,并在不访问模型内部或重新训练的情况下迭代细化预测。ORCA还存储中间推理轨迹,支持可审计的决策。尽管主要设计用于缓解物体级幻觉,但ORCA在不需对抗训练或防御机制的情况下也表现出新兴的对抗鲁棒性。我们在三个设置上评估了ORCA:(1)干净图像上的幻觉基准,(2)无防御的对抗扰动图像,以及(3)应用防御的对抗扰动图像。在POPE幻觉基准上,ORCA在不同子集上将独立LVLMs的性能提升了+3.64%到+40.67%。在POPE上的对抗扰动中,ORCA在LVLMs上实现了平均准确率提升+20.11%。当与防御技术结合使用时,ORCA进一步提高了独立LVLM在对抗扰动AMBER图像上的性能,提升幅度在+1.20%到+48.00%之间。这些结果表明,ORCA为构建更可靠和鲁棒的多模态系统提供了一条有前途的路径。

英文摘要

Large Vision-Language Models (LVLMs) exhibit strong multimodal capabilities but remain vulnerable to hallucinations from intrinsic errors and adversarial attacks from external exploitations, limiting their reliability in real-world applications. We present ORCA, an agentic reasoning framework that improves the factual accuracy and adversarial robustness of pretrained LVLMs through inference-time structured inference reasoning with a suite of small vision models (less than 3B parameters). ORCA operates via an Observe-Reason-Critique-Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies, and refining predictions iteratively without access to model internals or retraining. ORCA also stores intermediate reasoning traces, which supports auditable decision-making. Though designed primarily to mitigate object-level hallucinations, ORCA also exhibits emergent adversarial robustness without requiring adversarial training or defense mechanisms. We evaluate ORCA across three settings: (1) clean images on hallucination benchmarks, (2) adversarially perturbed images without defense, and (3) adversarially perturbed images with defense applied. On the POPE hallucination benchmark, ORCA improves standalone LVLMs performance by +3.64% to +40.67% across different subsets. Under adversarial perturbations on POPE, ORCA achieves an average accuracy gain of +20.11% across LVLMs. When combined with defense techniques on adversarially perturbed AMBER images, ORCA further improves standalone LVLM performance, with gains ranging from +1.20% to +48.00% across metrics. These results demonstrate that ORCA offers a promising path toward building more reliable and robust multimodal systems.

2506.07209 2026-05-20 cs.GR cs.CV 版本更新

HOI-PAGE: Zero-Shot Human-Object Interaction Generation with Part Affordance Guidance

HOI-PAGE:基于部分可及性的零样本人类-物体交互生成

Lei Li, Angela Dai

发表机构 * University of Virginia(弗吉尼亚大学) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出HOI-PAGE,一种通过部分可及性推理生成高保真4D人类-物体交互的零样本方法,利用大语言模型进行部分级机械推理,并通过结构化部分可及性图(PAG)引导三阶段合成,生成复杂多物体或多人物交互序列。

Comments ICML 2026. Project page: https://craigleili.github.io/projects/hoipage/ Video: https://www.youtube.com/watch?v=gwXjOffCFyk

详情
AI中文摘要

我们提出了HOI-PAGE,一种新的方法,优先考虑部分级可及性推理,从文本提示中以零样本方式生成高保真的4D人类-物体交互(HOIs)。与之前专注于全局、整体身体-物体运动合成的方法不同,我们的方法利用大语言模型(LLMs)显式推理交互的底层部分级机械特性。我们通过结构化的部分可及性图(PAG)表示来捕捉这种推理,作为高层次交互框架,引导三阶段合成:首先,将输入3D对象分解为语义部分;然后,从文本提示生成参考HOI视频以提取基于部分的运动约束;最后,优化4D HOI运动序列,使其模仿参考动态并满足部分级接触约束。广泛的实验表明,我们的方法具有灵活性,能够生成复杂的多物体或多人物交互序列,具有显著提高的现实感和文本对齐性,对于零样本4D HOI生成具有明显优势。

英文摘要

We present HOI-PAGE, a new approach that prioritizes part-level affordance reasoning to generate high-fidelity 4D human-object interactions (HOIs) from text prompts in a zero-shot fashion. In contrast to prior works that focus on global, whole body-object motion synthesis, our approach explicitly reasons about the underlying part-level mechanics of interactions using large language models (LLMs). We capture this reasoning in a structured part affordance graph (PAG) representation, serving as a high-level interaction scaffolding to guide a three-stage synthesis: first, decomposing input 3D objects into semantic parts; then, generating reference HOI videos from text prompts to extract part-based motion constraints; and finally, optimizing for 4D HOI motion sequences that mimic the reference dynamics while satisfying part-level contact constraints. Extensive experiments show that our approach is flexible and capable of generating complex multi-object or multi-person interaction sequences, with significantly improved realism and text alignment for zero-shot 4D HOI generation.

2506.03178 2026-05-20 eess.IV cs.AI cs.CV 版本更新

LLaMA-XR: A Novel Framework for Radiology Report Generation using LLaMA and QLoRA Fine Tuning

LLaMA-XR: 一种基于LLaMA和QLoRA微调的新型放射科报告生成框架

Md. Zihad Bin Jahangir, Muhammad Ashad Kabir, Sumaiya Akter, Israt Jahan, Minh Chau

发表机构 * Department of Computer Science and Engineering, Southeast University(计算机科学与工程系,东南大学) School of Computing, Mathematics and Engineering, Charles Sturt University(计算、数学与工程学院,查尔斯·斯特劳特大学) Department of Computer Science and Engineering, University of Liberal Arts Bangladesh(计算机科学与工程系,孟加拉国自由大学) Medical Imaging Group, School of Dentistry and Medical Sciences, Charles Sturt University(医学影像组,牙科学院与医学科学学院,查尔斯·斯特劳特大学)

AI总结 本文提出LLaMA-XR框架,结合LLaMA 3.1与DenseNet-121图像嵌入及QLoRA微调,提升放射科报告生成的准确性和临床相关性,同时保持计算效率。

Comments 25 pages

Journal ref Bioengineering 2026, 13(5), 493

详情
AI中文摘要

自动化放射科报告生成具有减少放射科医生工作负担和提高诊断准确性的潜力。然而,从胸部X光片生成精确且具有临床意义的报告仍然具有挑战性,因为医学语言的复杂性和对上下文理解的需求。现有模型在保持准确性和上下文相关性方面存在困难。在本文中,我们提出了LLaMA-XR,一种新型框架,整合了LLaMA 3.1与基于DenseNet-121的图像嵌入以及量化低秩适应(QLoRA)微调。LLaMA-XR在保持计算效率的同时实现了改进的连贯性和临床准确性。这种效率是由一种优化策略驱动的,该策略增强了参数利用并减少了内存开销,使报告生成速度更快,计算资源需求更低。在IU X光基准数据集上进行的广泛实验表明,LLaMA-XR优于一系列最先进的方法。我们的模型在ROUGE-L得分上达到0.433,在METEOR得分上达到0.336,建立了该领域的性能新基准。这些结果突显了LLaMA-XR作为自动化放射科报告的有效且高效的AI系统潜力,提供了增强的临床效用和可靠性。

英文摘要

Automated radiology report generation holds significant potential to reduce radiologists' workload and enhance diagnostic accuracy. However, generating precise and clinically meaningful reports from chest radiographs remains challenging due to the complexity of medical language and the need for contextual understanding. Existing models often struggle with maintaining both accuracy and contextual relevance. In this paper, we present LLaMA-XR, a novel framework that integrates LLaMA 3.1 with DenseNet-121-based image embeddings and Quantized Low-Rank Adaptation (QLoRA) fine-tuning. LLaMA-XR achieves improved coherence and clinical accuracy while maintaining computational efficiency. This efficiency is driven by an optimization strategy that enhances parameter utilization and reduces memory overhead, enabling faster report generation with lower computational resource demands. Extensive experiments conducted on the IU X-ray benchmark dataset demonstrate that LLaMA-XR outperforms a range of state-of-the-art methods. Our model achieves a ROUGE-L score of 0.433 and a METEOR score of 0.336, establishing new performance benchmarks in the domain. These results underscore LLaMA-XR's potential as an effective and efficient AI system for automated radiology reporting, offering enhanced clinical utility and reliability.

2505.16819 2026-05-20 cs.CV 版本更新

Character-Centered Dialogue Generation from Scene-Level Prompts

从场景级提示生成以角色为中心的对话

Taewon Kang, Ming C. Lin

发表机构 * University of Maryland at College Park, United States(马里兰大学学院市分校,美国)

AI总结 本研究提出了一种模块化流程,将动作级提示转化为视觉和听觉上一致的对话,丰富了基于场景的故事叙述。通过预训练的视觉-语言编码器提取高级视觉语义,并结合结构化提示引导大型语言模型生成对话。引入递归叙述银行以保持跨场景的上下文和情感一致性,最终生成具有表现力的角色条件语音,产生完整的视听叙事。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026). 18 pages, 5 figures

详情
AI中文摘要

最近的场景基于视频生成技术使结构化提示能够生成连贯的视觉叙述,但故事叙述中的关键方面--角色驱动的对话和言语--仍被忽视。我们提出了一种模块化流程,将动作级提示转化为视觉和听觉上一致的对话,从而丰富基于场景的故事叙述,增加自然语音和角色表达。我们的方法每场景使用一对提示,定义场景和角色行为。虽然故事生成模型如Text2Story生成视觉场景,我们专注于生成具有表现力且角色一致的陈述,这些陈述基于提示和代表性的场景图像。预训练的视觉-语言编码器提取高级视觉语义,这些语义与结构化提示结合,引导大型语言模型进行对话合成。为了在跨场景中保持上下文和情感一致性,我们引入递归叙述银行,这是一种说话者感知、时间结构化的记忆,用于积累每个角色的对话历史。受脚本理论启发,这种设计使对话能够反映不断变化的目标、社会情境和叙事角色。最后,我们将每个陈述渲染为具有表现力的角色条件语音,产生完整的视听叙述。我们的训练自由框架能够跨多样化的故事情境泛化,提供了一种可扩展的解决方案,用于连贯且以角色为中心的音频视觉叙述。

英文摘要

Recent advances in scene-based video generation enable coherent visual narratives from structured prompts, yet a key aspect of storytelling -- character-driven dialogue and speech -- remains underexplored. We present a modular pipeline that transforms action-level prompts into visually and auditorily grounded dialogue, enriching scene-based storytelling with natural voice and character expression. Our method takes a pair of prompts per scene, defining the setting and character behavior. While a story generation model such as Text2Story produces the visual scene, we focus on generating expressive, character-consistent utterances grounded in both the prompts and a representative scene image. A pretrained vision-language encoder extracts high-level visual semantics, which are combined with structured prompts to guide a large language model for dialogue synthesis. To maintain contextual and emotional consistency across scenes, we introduce a Recursive Narrative Bank, a speaker-aware, temporally structured memory that accumulates each character's dialogue history. Inspired by Script Theory, this design enables dialogue that reflects evolving goals, social context, and narrative roles. Finally, we render each utterance as expressive, character-conditioned speech, producing fully voiced, multimodal video narratives. Our training-free framework generalizes across diverse story settings, providing a scalable solution for coherent, character-grounded audiovisual storytelling.

2503.16309 2026-05-20 eess.IV cs.CV physics.med-ph 版本更新

Rapid patient-specific neural networks for intraoperative X-ray to volume registration

快速的患者特异性神经网络用于术中X射线到体积的配准

Vivek Gopalakrishnan, David-Dimitris Chlorogiannis, Andrew Abumoussa, Anna M. Larson, Nazim Haouchine, Darren B. Orbach, Sarah Frisken, Neel Dey, Polina Golland

发表机构 * Harvard-MIT Health Sciences and Technology, Massachusetts Institute of Technology(哈佛-麻省理工健康科学与技术, 麻省理工学院) Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology(计算机科学与人工智能实验室, 麻省理工学院) Department of Radiology, Harvard Medical School(哈佛医学院放射科) Saint Luke’s Marion Bloch Neuroscience Institute(圣路易斯马里恩布洛克神经科学研究所) Department of Critical Care Medicine, Shriners Children’s Hospital(谢尔曼儿童医院重症医学科) Department of Interventional Neuroradiology, Boston Children’s Hospital(波士顿儿童医院介入神经放射科) Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital(阿提努拉A·马丁诺斯生物医学成像中心, 麻省总医院)

AI总结 本文提出了一种自监督框架xvr,结合患者特异性神经网络和梯度优化,实现了快速且准确的2D到3D配准,通过物理模拟生成训练数据,无需手动标注,提升了临床和研究社区的广泛应用能力。

详情
AI中文摘要

先进的导航技术在图像引导的介入和手术机器人中需要快速且精确地对齐3D术前体积(如CT、MRI)到2D术中图像(如X射线荧光)。然而,现有的2D/3D配准方法无法在广泛的荧光引导程序中泛化:传统基于强度的优化器需要为每个患者仔细调整超参数,而深度学习方法需要大量的手动标注数据集,并且受限于训练时特定的解剖结构。为了解决这些限制,我们提出了xvr,一种自监督框架,结合了患者特异性神经网络和基于梯度的优化,实现了自动的2D/3D配准。xvr利用基于物理的模拟生成训练数据,从患者的术前扫描中生成,消除了手动标注的需要。我们提出了一种在数千次全身扫描上预训练的基础模型,仅需5分钟的微调即可实现任何解剖区域的患者特异性适应。在迄今为止最大的2D/3D配准评估中,xvr在多种解剖结构、成像模态和医院中实现了高精度,精度比现有方法提高了数量级。xvr通过开源软件https://xvr.csail.mit.edu,使广谱解剖的2D/3D刚性配准对广泛的临床和研究社区可及。

英文摘要

Advanced navigation techniques in image-guided interventions and surgical robotics require the rapid and precise alignment of 3D preoperative volumes (e.g., CT, MRI) to 2D intraoperative images (e.g., X-ray fluoroscopy). However, existing 2D/3D registration methods fail to generalize across the broad spectrum of fluoroscopy-guided procedures: traditional intensity-based optimizers require careful hyperparameter tuning for each subject, while deep learning approaches demand extensive manually labeled datasets and remain constrained to the specific anatomy on which they were trained. To address these limitations, we present xvr, a self-supervised framework that combines patient-specific neural networks with gradient-based optimization for automatic 2D/3D registration. xvr leverages physics-based simulation to generate training data from a patient's own preoperative scan, eliminating the need for manual annotation. We present a foundation model pretrained on thousands of whole-body scans, achieving patient-specific adaptation for any anatomical region in only 5 minutes of finetuning. In the largest evaluation of 2D/3D registration on real fluoroscopy to date, xvr achieves high accuracy in seconds across diverse anatomical structures, imaging modalities, and hospitals, improving upon the accuracy of existing methods by an order of magnitude. xvr makes pan-anatomical 2D/3D rigid registration accessible to broad clinical and research communities through open-source software at https://xvr.csail.mit.edu.

2503.06310 2026-05-20 cs.CV 版本更新

Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling

场景-动作提示融合用于连贯的文本到视频叙事

Taewon Kang, Divya Kothandaraman, Ming C. Lin

发表机构 * University of Maryland at College Park(马里兰大学学院市分校) Dolby Laboratories(杜比实验室)

AI总结 本文提出了一种整合场景和动作提示的叙事框架,通过动态启发的提示混合策略,解决文本到视频生成中时间一致性、语义一致性和场景-动作连续性的问题,通过三个关键组件实现了更连贯的视频叙事。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026). 13 pages, 4 figures

详情
AI中文摘要

从离散文本提示生成连贯的长视频序列仍然具有挑战性,因为难以在片段之间维持时间一致性、语义一致性和场景-动作连续性。我们提出了一种新的叙事框架,通过动态启发的提示混合来整合场景和动作提示。我们的方法结合了三个关键组成部分:(i)双向时间加权潜在融合策略,强制连续视频片段之间的时间一致性;(ii)动态启发的提示权重(DIPW)机制,根据CLIP对齐、叙事进展和时间平滑性,在每个扩散时间步适应性地平衡场景和动作提示;(iii)语义动作表示,编码高层动作语义以根据动作相似性调节转换。潜在空间融合在场景内保持空间一致性,而时间加权融合引入双向时间约束以防止突兀的转换。这些组件共同实现了流畅且连贯的视频叙事,忠实反映了场景上下文和动作动态。大量实验表明,我们的方法显著优于基线,生成时间一致且视觉吸引人的长视频,无需额外训练,从而填补了短片段和扩展文本驱动视频叙事之间的差距。

英文摘要

Generating coherent long-form video sequences from discrete text prompts remains challenging due to difficulties in maintaining temporal coherence, semantic consistency, and scene-action continuity across segments. We propose a novel storytelling framework that integrates scene and action prompts through dynamics-inspired prompt mixing. Our approach combines three key components: (i) a bidirectional time-weighted latent blending strategy that enforces temporal consistency between consecutive video segments, (ii) a dynamics-informed prompt weighting (DIPW) mechanism that adaptively balances scene and action prompts at each diffusion timestep based on CLIP-based alignment, narrative progression, and temporal smoothness, and (iii) a semantic action representation that encodes high-level action semantics to modulate transitions according to action similarity. Latent-space blending preserves spatial coherence within scenes, while time-weighted blending introduces bidirectional temporal constraints to prevent abrupt transitions. Together, these components enable fluid and coherent video narratives that faithfully reflect both scene context and action dynamics. Extensive experiments demonstrate that our method significantly outperforms baselines, producing temporally consistent and visually compelling long-form videos without any additional training, thereby bridging the gap between short clips and extended text-driven video storytelling.

2404.07106 2026-05-20 cs.CV cs.GR 版本更新

3DMambaComplete: Exploring Structured State Space Model for Point Cloud Completion

3DMambaComplete:探索结构状态空间模型用于点云补全

Yixuan Li, Weidong Yang, Ben Fei

发表机构 * Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University(复旦大学计算机学院数据科学实验室) Department of Information Engineering, The Chinese University of Hong Kong(香港中文大学信息工程系)

AI总结 本文提出3DMambaComplete,一种基于Mamba框架的点云补全网络,通过HyperPoint生成、分散和变形模块有效解决点云补全中的局部细节丢失和计算复杂度问题,实验表明其优于现有方法。

Comments 24 pages, 14 figures, 10 tables

详情
AI中文摘要

点云补全旨在从初始不完整且低质量的输入生成完整且高保真的点云。一种常见策略是利用基于Transformer的模型来编码全局特征并促进重建过程。然而,使用池化操作获取全局特征表示往往会导致点云中局部细节的丢失。此外,Transformer中的注意力机制引入了额外的计算复杂性,使得处理长序列变得困难。为了解决这些问题,我们提出了3DMambaComplete,一种基于新型Mamba框架的点云补全网络。它包含三个模块:HyperPoint生成模块利用Mamba的选择机制编码点云特征,并预测一组Hyperpoints;特定偏移量被估计,下采样的点成为HyperPoints;HyperPoint Spread模块将这些HyperPoints分散到不同的空间位置以避免集中。最后,一种变形方法将HyperPoints的2D网格表示转换为精细的3D结构以进行点云重建。在各种已建立的基准上进行的大量实验表明,3DMambaComplete超越了最先进的点云补全方法,这通过定性和定量分析得到证实。

英文摘要

Point cloud completion aims to generate a complete and high-fidelity point cloud from an initially incomplete and low-quality input. A prevalent strategy involves leveraging Transformer-based models to encode global features and facilitate the reconstruction process. However, the adoption of pooling operations to obtain global feature representations often results in the loss of local details within the point cloud. Moreover, the attention mechanism inherent in Transformers introduces additional computational complexity, rendering it challenging to handle long sequences effectively. To address these issues, we propose 3DMambaComplete, a point cloud completion network built on the novel Mamba framework. It comprises three modules: HyperPoint Generation encodes point cloud features using Mamba's selection mechanism and predicts a set of Hyperpoints. A specific offset is estimated, and the down-sampled points become HyperPoints. The HyperPoint Spread module disperses these HyperPoints across different spatial locations to avoid concentration. Finally, a deformation method transforms the 2D mesh representation of HyperPoints into a fine-grained 3D structure for point cloud reconstruction. Extensive experiments conducted on various established benchmarks demonstrate that 3DMambaComplete surpasses state-of-the-art point cloud completion methods, as confirmed by qualitative and quantitative analyses.

2605.19750 2026-05-20 cs.CV 版本更新

CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models

CPC-VAR:视觉自回归模型中的持续个性化与组合生成

Junhao Li, Xinhao Zhong, Yi sun, Yuxia Qiao, Bin Chen, Shu-Tao Xia, Yaowei Wang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Peng Cheng Laboratory(鹏城实验室) South China University of Technology(华南理工大学)

AI总结 本文研究了视觉自回归模型中的持续个性化生成问题,提出了一种统一框架,通过梯度基概念神经元选择和上下文感知组合策略,解决了连续单概念学习和多概念合成中的关键挑战,提升了长序列持续个性化和多概念图像合成的性能。

详情
AI中文摘要

视觉自回归(VAR)模型最近涌现出作为一种高效的文本到图像生成范式。尽管其强大的生成能力,现有的基于VAR的个性化方法仍局限于静态设置,无法适应不断变化的用户需求。特别是,序列概念学习导致严重的灾难性遗忘,而多概念合成常遭受特征纠缠和属性不一致的问题。在本文中,我们首次系统研究了VAR模型中的持续个性化生成。我们识别出两个关键挑战:(i)在连续定制过程中保持已学习的概念,以及(ii)以可控的方式组合多个个性化概念。为了解决这些问题,我们提出了一种统一框架,包含两个核心组件。对于持续单概念学习,我们引入了基于梯度的概念神经元选择(GCNS),该方法识别出与概念相关的神经元,并仅约束跨任务的冲突参数,从而有效缓解遗忘而不增加模型规模。对于多概念合成,我们提出了一种上下文感知的组合策略,通过多分支特征建模和局部跨注意力融合,由空间条件引导,实现了精确且解耦的概念组合。大量实验表明,我们的方法在长序列持续个性化中显著提高了性能,并在多概念图像合成中优于现有基线。这些发现突显了VAR模型在可扩展和可控个性化生成中的潜力。

英文摘要

Visual autoregressive (VAR) models have recently emerged as an efficient paradigm for text-to-image generation. Despite their strong generative capability, existing VAR-based personalization methods remain limited to static settings, failing to accommodate evolving user demands. In particular, sequential concept learning leads to severe catastrophic forgetting, while multi-concept synthesis often suffers from feature entanglement and attribute inconsistency. In this work, we present the first systematic study of continual personalized generation in VAR models. We identify two key challenges: (i) preserving previously learned concepts during sequential customization, and (ii) composing multiple personalized concepts in a controllable manner. To address these issues, we propose a unified framework with two core components. For continual single-concept learning, we introduce Gradient-based Concept Neuron Selection (GCNS), which identifies concept-relevant neurons and constrains only conflicting parameters across tasks, effectively mitigating forgetting without additional model expansion. For multi-concept synthesis, we propose a context-aware composition strategy that performs multi-branch feature modeling and localized cross-attention fusion guided by spatial conditions, enabling precise and disentangled concept composition. Extensive experiments demonstrate that our method significantly improves performance in long-sequence continual personalization while achieving superior results in multi-concept image synthesis compared to existing baselines. These findings highlight the potential of VAR models for scalable and controllable personalized generation.

2605.19744 2026-05-20 cs.CV 版本更新

Real-World On-Vehicle Evaluation of Embedding-Based Anomaly Detection

车载场景中基于嵌入的异常检测实测

Albert Schotschneider, Daniel Bogdoll, Svetlana Pavlitska, Ahmed Abouelazm, Johann Marius Zoellner

发表机构 * FZI Research Center for Information Technology(FZI信息科技研究中心) KIT Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文提出了一种适应性强的实时异常检测方法,利用预训练视觉变换器嵌入来检测潜在异常,通过在潜在语义特征空间中使用最近邻相似性检测偏差,并在真实世界场景中评估了该方法的性能。

Comments Accepted at CVPR 2026 Workshop AUTOPILOT-NA

详情
AI中文摘要

在自动驾驶中检测交通场景中的异常对于确保安全至关重要,但收集具有代表性的异常数据仍然具有挑战性。现有的异常检测方法高度专业化,并且依赖于抽象语义Cityscapes类定义的正常性,这使得难以适应多样的现实世界场景。我们提出了一种适应性强的实时异常检测方法,该方法利用预训练的视觉变换器嵌入作为基础模型,通过潜在语义特征空间中的最近邻相似性来检测偏差。基于逐块处理,该算法生成密集的异常掩码,允许定位检测到的异常。该方法通过单个参考图像稳健地建模正常性。这种形式避免了显式监督和数据集特定的训练,使其适合现实世界部署。我们在标准基准和自动化车辆的真实场景中评估了该方法。尽管其简单性,该方法在Road Anomaly基准上表现良好,并在实践中表现出一致的定性行为,成功地在多样化的场景中突出显示语义上不寻常的对象。这些结果表明,在现实操作条件下,简单的基于参考的方法可以提供有用的异常信号。

英文摘要

Detecting anomalies in traffic scenes is crucial for ensuring safety in autonomous driving, yet collecting representative anomalous data remains challenging. Existing anomaly detection methods are highly specialized and rely on normality as defined by the abstract semantic Cityscapes classes, making it difficult to adapt to diverse real-world scenarios. We propose an adaptable real-time anomaly detection method that leverages foundation models in the form of pretrained vision transformer embeddings to detect deviations via nearest-neighbor similarity in the latent semantic feature space. Based on patch-wise processing, the algorithm produces dense anomaly masks, allowing for the localization of detected anomalies. The method robustly models normality through a single reference image. This formulation avoids explicit supervision and dataset-specific training, making it suitable for real-world deployment. We evaluate the method on standard benchmarks and on an automated vehicle in real-world scenarios. Despite its simplicity, the method achieves good performance on the Road Anomaly benchmark and demonstrates consistent qualitative behavior in practice, successfully highlighting semantically unusual objects in diverse scenes. These results suggest that simple, reference-based methods can provide useful anomaly signals under realistic operating conditions.

2605.19737 2026-05-20 cs.GR cs.CV 版本更新

Decentralized Direct Volume Rendering: A Browser-Native GPU Architecture for MRI Digital Twins in Resource-Constrained Settings

去中心化直接体渲染:一种浏览器原生的GPU架构,用于资源受限环境中的MRI数字孪生

Oserebameh Augustine Beckley

发表机构 * Lagos State University(拉各斯州大学)

AI总结 本研究提出了一种去中心化的浏览器原生GPU架构,用于在资源受限环境中实现高保真的MRI数字孪生,通过在低成本集成边缘GPU上执行确定性的单次通过射线投射和形态学梯度计算,实现了快速的像素生成和稳定的交互性能。

Comments 10 pages, 4 figures. Live interactive browser demo available at: https://webgpu-mri.vercel.app/ . Source code repository: https://github.com/Bahdmanbabzo/webgpu-mri

详情
AI中文摘要

数字孪体(DT)技术在手术计划和个性化医学中具有巨大潜力。然而,生成交互式、患者特异性的解剖孪体目前依赖于计算密集型的服务器端渲染(SSR)或昂贵的本地工作站,这在资源受限环境中(RCS)构成了显著的部署障碍。本文提出了一种去中心化的、客户端侧的WebGPU架构,以民主化高保真解剖数字孪体的访问。通过绕过标准的服务器端渲染管线,该框架在低成本的集成边缘GPU上执行确定性的单次通过射线投射和形态学梯度计算。消除云渲染解决方案固有的网络延迟,系统实现了小于920.0毫秒的首次像素时间(TTFP)并在>=82.0 FPS的稳定交互性。通过统一缓冲区维持连续交互保真度,实现了零延迟的组织参数操控,以支持动态临床决策。通过证明复杂的患者特异性MRI扫描的3D医学模拟可以在浏览器中原生执行,无需深度学习或外部计算依赖,该架构提供了一种可扩展且经济的平台,以促进医疗数字孪体的广泛临床应用。

英文摘要

Digital Twin (DT) technology holds immense potential for surgical planning and personalized medicine. However, generating interactive, patient-specific anatomical twins currently relies on computationally heavy Server-Side Rendering (SSR) or expensive local workstations, creating significant barriers to deployment, especially in resource-constrained settings (RCS). This paper presents a decentralized, client-side WebGPU architecture that democratizes access to high-fidelity anatomical Digital Twins. By bypassing standard server-side rendering pipelines, the framework executes deterministic single-pass raymarching and morphological gradient calculations directly on low-cost integrated edge GPUs. Eliminating the network latency inherent to cloud-rendered solutions, the system achieves a Time to First Pixel (TTFP) of under 920.0ms and maintains stable interactivity at >= 82.0 FPS. Continuous Interaction Fidelity is maintained via uniform buffers, enabling zero-latency manipulation of tissue parameters for dynamic clinical decision-making. By proving that complex 3D medical simulations of patient-specific MRI scan can be executed natively in the browser without deep learning or external computational dependencies, this architecture provides a scalable, affordable foundation for the widespread clinical adoption of healthcare Digital Twins.

2605.19734 2026-05-20 cs.CV 版本更新

GeoMamba: A Geometry-driven MambaVision Framework and Dataset for Fine-grained Optical-SAR Object Retrieval

GeoMamba: 一种基于几何的MambaVision框架及数据集,用于细粒度光学-雷达目标检索

Tiantong Fang, Xiuwei Wang, Jing Xiao, Wujie Zhou, Liang Liao, Mi Wang

发表机构 * School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院) School of Artificial Intelligence and Information Engineering, Zhejiang University of Science & Technology(浙江科技大学人工智能与信息工程学院) Hangzhou Institute of Technology, Xidian University(西安电子科技大学杭州研究院) State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University(武汉大学测绘遥感信息工程国家重点实验室)

AI总结 本文提出GeoMamba框架,通过引入几何特征注入模块和几何一致性约束模块,提升光学-雷达细粒度目标检索的鲁棒性,并构建了新的FGOS-as数据集来评估跨模态检索性能。

详情
AI中文摘要

多源遥感能够互补地观测地面物体,但跨模态细粒度目标检索仍具有挑战性,尤其是在光学和雷达条件不一致的情况下。与传统的依赖配对或空间对齐样本的检索设置不同,实际的光学-雷达检索受到显著的模态差异、斑点噪声和结构不一致的影响,限制了跨模态表示学习的鲁棒性。为此,我们提出GeoMamba,一种针对光学-雷达细粒度检索的几何驱动框架。具体而言,GeoMamba引入了一个几何特征注入(GFI)模块,以增强跨模态特征交互,并结合结构先验,从而提高雷达表示的鲁棒性并促进几何一致的特征学习。此外,几何一致性约束(GCC)模块与深度监督(DS)策略一起,利用经典操作符施加层次化的几何约束,帮助在表示学习过程中保留信息丰富的物体结构。我们进一步构建了一个新的数据集FGOS-as,包含11个航空航天和海洋类别,用于评估在现实遥感场景中的不一致跨模态细粒度目标检索性能。在FGOS-as上的大量实验表明,GeoMamba在所有对所有检索设置中优于现有方法,达到了63.3%的mAP和77.0%的Rank-1准确率。

英文摘要

Multi-source remote sensing enables complementary observation of ground objects, while cross-modal fine-grained object retrieval remains challenging, especially under unaligned optical and SAR conditions. Unlike conventional retrieval settings that rely on paired or spatially aligned samples, practical optical-SAR retrieval is affected by substantial modality discrepancy, speckle noise, and structural inconsistency, which limit robust cross-modal representation learning. To address this problem, we propose GeoMamba, a geometry-driven framework tailored for optical-SAR fine-grained retrieval. Specifically, GeoMamba introduces a Geometric Feature Injection (GFI) module that enhances cross-modal feature interaction and incorporates structural priors, thereby improving the robustness of SAR representations and promoting geometry-consistent feature learning. In addition, a Geometric Consistency Constraint (GCC) module, together with a Deep Supervision (DS) strategy, imposes hierarchical geometric constraints using classical operators, which helps preserve informative object structures during representation learning. We further construct a new dataset, FGOS-as, containing 11 aerospace and maritime categories for evaluating unaligned cross-modal fine-grained object retrieval in realistic remote sensing scenarios. Extensive experiments on FGOS-as demonstrate that GeoMamba outperforms existing methods, achieving 63.3% mAP and 77.0% Rank-1 accuracy in all-to-all retrieval setting.

2605.19728 2026-05-20 cs.CV 版本更新

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

Aero-World: 从惯性控制生成动作条件的空中视频

Abdul Mohaimen Al Radi, Kunyang Li, Yuzhang Shang, Mubarak Shah, Yu Tian

发表机构 * Institute of Artificial Intelligence, University of Central Florida(中央佛罗里达大学人工智能研究所)

AI总结 本文提出Aero-World,一种将预训练图像到视频扩散模型转换为可控空中视频生成器的方法,通过注入加速度和角速度序列,利用冻结的物理探测器提供惯性一致性监督,从而提高生成视频对低级动作信号的符合度和时间稳定性。

详情
AI中文摘要

基础视频模型能够生成视觉逼真的结果,但其在具身AI中的应用受限,因为它们主要在自然语言上训练而不是低级控制信号。这种限制在空中飞行中尤为明显,因为运动发生在无约束的6自由度空间中,微小的自我运动误差会产生大的轨迹漂移。生成遵循精细惯性动作的空中视频可以支持可扩展的空中代理训练和评估,通过提供可控的现实世界或昂贵模拟数据代理。为此,我们提出了Aero-World,一种将预训练图像到视频扩散模型转换为可控空中视频生成器的方法。Aero-World通过动作令牌流将加速度和角速度序列注入到预训练的潜在扩散变换器中。一个冻结的潜在空间物理探测器,独立在真实视频-IMU配对上训练,通过LoRA微调期间提供可微的惯性一致性监督,同时避免计算昂贵的视频解码。我们进一步提出了AeroBench,一个评估生成无人机视频是否符合低级动作信号的基准。AeroBench使用动作对齐分数(AAS)测量与命令惯性动作的一致性,使用物理一致性率(PCR)测量时间运动稳定性。在AeroBench上,Aero-World将平均AAS从57.7提高到63.6,比仅动作微调有更高的质量控制权衡,与AirScape相比,FVD更低(596.5 vs. 1058.6),SSIM更高(0.595 vs. 0.505),Flow-IMU相关性更高(0.44 vs. 0.20)。这些结果表明,冻结的物理探测器监督是一种将预训练视频生成器适应更动作对齐的空中运动的实用机制。

英文摘要

Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbf{Aero-World}, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video--IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning while avoiding computationally expensive video decoding. We further propose \textbf{AeroBench}, a benchmark for evaluating whether generated drone videos adhere to low-level action signals. AeroBench uses Action Alignment Score (AAS) to measure agreement with commanded inertial actions and Physical Consistency Rate (PCR) to measure temporal motion stability. On AeroBench, Aero-World improves mean AAS from 57.7 to 63.6 over action-only finetuning and gives a stronger quality-control trade-off than AirScape, with lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20). These results suggest that frozen Physics Probe supervision is a practical mechanism for adapting pretrained video generators toward more action-aligned aerial motion.

2605.19727 2026-05-20 cs.CV 版本更新

Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence

Tango3D: 向全局和局部2D-3D对应关系对齐迈进

Zebin He, Mingxin Yang, Shuhui Yang, Hanxiao Sun, Xintong Han, Chunchao Guo, Wenhan Luo

发表机构 * HKUST(香港科技大学) Tencent Hunyuan(腾讯混元)

AI总结 本文提出Tango3D,一种统一密集对应和全局检索的3D基础模型,通过几何感知的2D视觉骨干网络和预训练的3D VAE将图像编码为2D片段,点云编码为3D标记,并映射到共享空间以实现局部像素-点对齐和全局语义对齐。

详情
AI中文摘要

现有的3D基础模型通常将点云对齐到冻结的视觉-语言空间(如CLIP),通过将3D形状压缩成全局向量实现强大的跨模态检索。然而,这种仅全局对齐的方法无法建立精细的像素-点对应关系。为了解决这个问题,我们提出了Tango3D,一种基础模型,它统一了密集对应和全局检索。我们使用一个几何感知的2D视觉骨干网络和一个预训练的3D VAE将图像编码为2D片段,并将点云编码为3D标记。这些被映射到一个共享空间中,以实现局部像素-点对齐和全局语义对齐。为了稳定密集和全局目标的联合学习,我们引入了三阶段渐进训练策略。实验表明,我们的模型成功实现了对象级别的像素-点对齐,同时保持了具有竞争力的全局检索能力,这种联合能力是现有3D基础模型所不具备的。通过建立精细的对齐特征空间,Tango3D将丰富的语义注入到纯粹的几何3D标记中,为广泛密集3D下游任务铺平了道路。

英文摘要

Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global retrieval, a joint capability not offered by existing 3D foundation models. By establishing a fine-grained alignment feature space, Tango3D injects rich semantics into purely geometric 3D tokens, paving the way for a wide range of dense 3D downstream tasks.

2605.19726 2026-05-20 cs.CV 版本更新

Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention

通过块近似稀疏注意力实现扩散语言模型的高效长上下文建模

Wenhu Zhang, Yiming Wu, Huanyu Wang, Yaoyang Liu, Huanzhang Dou, Senqiao Yang, Sitong Wu, Hanbin Zhao, Jiaya Jia

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) The University of Hong Kong(香港大学) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出了一种块近似稀疏注意力框架(BA-Att),通过块级预下采样操作识别信息区域,避免依赖脆弱的位置先验,从而在保持高性能的同时提升计算效率,实验表明其在注意力计算上比FlashAttention快6.95倍,并在50%稀疏度下保持接近全注意力性能。

Comments CVPR 2026 Findings paper

详情
AI中文摘要

扩散语言模型(DLMs)能够实现全局一致、双向且可控的文本生成,相较于传统自回归LLMs具有优势,但扩展到超长序列仍成本高昂。许多现有块稀疏注意力方法通过固定采样模式在高分辨率注意力空间中选择块,如尾部区域或反斜线条带。此类先验驱动的采样可能遗漏显著令牌并引入分布变化下的不稳定性。在本文中,我们提出块近似稀疏注意力框架(BA-Att)具有块级预下采样操作,能够在紧凑的下采样空间内识别信息区域,避免依赖脆弱的位置先验。为了分析其理论行为,我们定义了一个 oracle 后下采样注意力图,并正式化预下采样与后下采样方案之间的近似误差。基于这一见解,我们引入了一个轻量级的范数排序模块和一个协方差补偿修正,利用对角线QK方差近似完整协方差,从而降低计算复杂度。广泛的实验表明,我们的操作在注意力计算上比FlashAttention快达6.95倍,并在50%稀疏度下在语言模型、多模态语言模型和视频生成模型中保持接近全注意力性能,展示了强大的效率和泛化能力。

英文摘要

Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks by fixed sampling patterns over the high-resolution attention space, such as tail regions or anti-diagonal stripes. Such prior-driven sampling can miss salient tokens and introduce instability under distribution shifts. In this paper, we propose the Block Approximate Sparse Attention framework (BA-Att) with block-wise pre-downsampled operation, which identifies informative regions within a compact downsampled space, avoiding reliance on brittle positional priors. To analyze its theoretical behavior, we define an oracle post-downsample attention map and formalize the approximation error between pre- and post-downsample schemes. Based on this insight, we introduce a lightweight norm-sorting module and a covariance-compensated correction that approximates full covariance using diagonal QK variances, reducing computational complexity. Extensive experiments show that our operator achieves up to 6.95x acceleration over FlashAttention in attention computation, and maintains near full-attention performance at 50% sparsity across language models, multimodal language models, and video generation models, demonstrating strong efficiency and generalization.

2605.19717 2026-05-20 cs.CV 版本更新

Physics-in-the-Loop: A Hybrid Agentic Architecture for Validated CAD Engineering Design

物理闭环:一种混合代理架构用于验证的CAD工程设计

Elias Berger, Muhammad Usama, Jan Mehlstäubl, Bernhard Saske, Kristin Paetzold-Byhain

发表机构 * Dresden University of Technology(德累斯顿技术大学) MAN Truck & Bus SE(MAN卡车与巴士股份有限公司) German Research Center for Artificial Intelligence(德国人工智能研究中心) RPTU Kaiserslautern-Landau(凯撒斯劳滕-兰道大学)

AI总结 本文提出了一种混合代理-物理架构,通过将经过验证的知识工程工具直接嵌入到自主AI代理的决策循环中,以解决大型语言模型在生成CAD设计时缺乏物理理解的问题。该方法通过显式的物理验证指导闭环、顺序决策过程,提高了生成CAD设计的物理正确性。

Comments Accepted in IJCAI-ECAI 2026 (Special Track on AI4Tech)

详情
AI中文摘要

大型语言模型(LLMs)可以生成计算机辅助设计(CAD),但缺乏可靠工程设计所需的物理理解。而不是试图从数据中隐式学习物理定律,我们提出了一种混合代理-物理架构,将经过验证的知识工程工具直接嵌入到自主AI代理的决策循环中。在该框架中,工程设计被建模为一个闭环、顺序决策过程,由显式的物理验证指导。基于负载案例,专用代理通过知识工程工具作为反馈信号,迭代地计划、生成、评估和修订工程设计。我们引入了一个基准数据集和评估功能有效性的指标。我们的系统生成了更复杂且经过物理验证的设计,结构复杂性提高了4.2%,与类似代理方法相比,编译率提高了3.5%。代码库、提示和数据集将向公众开放,以支持可重复性和未来研究。

英文摘要

Large Language Models (LLMs) can generate Computer-Aided Design (CAD), yet lack physical comprehension required for reliable engineering design. Instead of attempting to implicitly learn physical laws from data, we propose a Hybrid Agentic-Physical Architecture that embeds validated knowledge-based engineering tools directly into the decision making loop of autonomous AI agents. In this framework, engineering design is formulated as a closed-loop, sequential decision making process guided by explicit physical verification. Based on a load case, dedicated agents iteratively plan, generate, evaluate, and revise engineering designs using knowledge-based tools as a feedback signal. We introduce a benchmark dataset and metrics for assessing functional validity in generative CAD. Our system generates more complex and physically verified designs, with a 4.2 increase in structural complexity and improving compile rate by 3.5% compared to similar agentic methods. The codebase, prompts and dataset will be made publicly available to support reproducibility and future research.

2605.19712 2026-05-20 cs.CV 版本更新

Physics-informed simulation framework for realistic sonar image generation and statistical validation

具有物理信息的模拟框架用于真实声纳图像生成和统计验证

Kamal Basha S, Athira Nambiar

发表机构 * Department of Computational Intelligence, SRM Institute of Science

AI总结 本文提出了一种基于物理的模拟框架ACOUSIM,用于生成真实声纳图像并进行统计验证,通过比较合成与真实声纳图像的统计特性,建立了可重复的分布级基准。

详情
AI中文摘要

合成声纳数据集为昂贵的实地采集提供了可扩展的替代方案,但其效用仍受缺乏严格定量验证的限制。我们提出了ACOUSIM(ACOustic SIMulation and Validation Platform),一个具有物理信息的框架,该框架在不依赖生成模型的情况下评估合成与真实声纳图像之间的统计一致性。基于Gazebo的环境通过显式控制海底纹理、光照驱动的阴影、平台高度和噪声生成声纳样图像。真实性通过两个公开声纳数据集SeabedObjects-KLSG-II和Sonar Common Target Detection(SCTD)进行量化,使用KL散度、JS散度和地球移动距离评估全局强度和局部纹理(LBP)分布。结果表明,在所有类别中纹理一致性都很强(KL < 0.07),其中平面类强度一致性优于船舶类,因为阴影几何复杂性。ACOUSIM为sim-to-real声纳评估建立了可重复的分布级基准,并直接支持水下图像分析的可靠数据集验证。

英文摘要

Synthetic sonar datasets offer a scalable alternative to costly real-world acquisition, yet their utility remains limited by the absence of rigorous quantitative validation. We present ACOUSIM (ACOustic SIMulation and Validation Platform), a physics-informed framework that evaluates the statistical alignment between synthetic and real sonar imagery without relying on generative models. A Gazebo-based environment generates sonar-like images by explicitly controlling seabed texture, illumination-driven shadowing, platform altitude, and noise. Realism is quantified against two public sonar datasets, SeabedObjects-KLSG-II and Sonar Common Target Detection (SCTD), using global intensity and local texture (LBP) distributions assessed via Kullback-Leibler divergence, Jensen-Shannon divergence, and Earth Mover's Distance. Results show strong texture alignment (KL < 0.07) across all classes, with plane-class intensity alignment outperforming ship-class due to shadow geometry complexity. ACOUSIM establishes a reproducible, distribution-level baseline for sim-to-real sonar evaluation and directly supports reliable dataset validation for underwater image analysis.

2605.19692 2026-05-20 cs.CV 版本更新

WBCAtt+: Fine-Grained Pixel-Level Morphological Annotations for White Blood Cell Images

WBCAtt+: 细粒度像素级形态学标注用于白血球图像

Satoshi Tsutsui, Winnie Pang, Shuting He, Bihan Wen

发表机构 * Rapid-Rich Object Search (ROSE) Lab, School of Electrical and Electronic Engineering, Nanyang Technological University(快速丰富目标搜索(ROSE)实验室,电气与电子工程学院,南洋理工大学) Shanghai University of Finance and Economics(上海财经大学)

AI总结 本文提出WBCAtt+数据集,通过11个形态学属性和5个像素级细胞组件的密集标注,为白血球图像提供了全面的标注,用于改进属性识别和语义分割的基准模型,并展示了可解释AI模型等应用。

Comments Accepted to Medical Image Analysis. arXiv admin note: substantial text overlap with arXiv:2306.13531

详情
AI中文摘要

白血球(WBC)的显微检查在病理学中起着基础性作用,对于诊断如白血病和贫血等血液疾病至关重要。为了支持进一步的WBC图像研究,已提出多个数据集。然而,这些数据集主要标注细胞类别,缺乏病理学家用于解释细胞解释的详细形态学特征。为解决这一差距,我们引入WBCAtt+,一个包含11个形态学属性和5个像素级细胞组件的新型WBC图像数据集。WBCAtt+拥有113,000个图像级标签和10,000个分割图,是首个为WBC图像提供全面标注的数据集。利用此数据集,我们提供了属性识别和语义分割的基准模型。我们还设计了一个属性识别模型,以整合细胞的组成结构,进一步提高识别性能。最后,我们展示了由我们的数据集启用的各种应用,如可解释AI模型,包括反事实示例生成。

英文摘要

The microscopic examination of white blood cells (WBCs) plays a fundamental role in pathology and is essential for diagnosing blood disorders such as leukemia and anemia. To support further research on WBC images, multiple datasets have been proposed. However, they mainly annotate cell categories, and lack detailed morphological characteristics that pathologists use to explain their interpretations of cells. To address this gap, we introduce WBCAtt+, a novel dataset of WBC images densely annotated with 11 morphological attributes and five pixel-level cell components. With 113k image-level labels and 10k segmentation maps, WBCAtt+ is the first to provide comprehensive annotations for WBC images. Leveraging this dataset, we provide baseline models for attribute recognition and semantic segmentation. We also design an attribute recognition model to incorporate compositional structure of cells, further improving the recognition performance. Lastly, we showcase various applications enabled by our dataset, such as explainable AI models, including counterfactual example generation. \revision{The dataset and code are publicly available\footnote{https://doi.org/10.57967/hf/8143}}.

2605.19688 2026-05-20 cs.CV 版本更新

DocQT: Improving Document Forgery Localization Robustness via Diverse JPEG Quantization Tables

DocQT: 通过多样化的JPEG量化表提高文档伪造定位的鲁棒性

Kylian Ronfleux-Corail, Guillaume Bernard, Mickaël Coustaty, Nicolas Sidère

发表机构 * MAIF, Niort, France(法国尼奥特MAIF机构) L3i Laboratory, La Rochelle University, La Rochelle, France(法国拉罗谢尔大学拉罗谢尔L3i实验室)

AI总结 本文提出DocQT数据集,通过对比不同架构在不同量化表训练下的表现,证明标准质量因子增强无法代表实际压缩多样性,并展示了显式考虑量化表的架构在实际部署中的鲁棒性优势。

详情
AI中文摘要

文档操纵定位模型在公开基准上表现强劲,但在实际文档工作流程中泛化能力不足。我们发现这一差距的关键原因在于训练过程中使用的JPEG量化表分布狭窄(仅限于标准libjpeg质量因子)与实际保险文档管道中遇到的异质压缩配置之间的不匹配。为了隔离这一因素,我们进行了一项受控的因子研究,比较了两种具有不同量化表意识水平的架构(FFDN [2] 和 Mesorch [20]),每种架构在标准质量因子增强(Standard-QT)或从DocQT量化表库(Real-QT)采样的操作校准量化表下进行训练,并在三种再压缩条件下进行评估。在DocTamper [15] 上训练时使用Real-QT带来了显著的定位增益,并显著降低了真实操作文档中的像素级误报率,但仅适用于显式将量化表作为输入的架构。发布的DocQT量化表数据集和压缩再生产材料可在https://github.com/Kyliroco/Improving-Document-Forgery-Localization-Robustness-via-Diverse-JPEG-Quantization-Tables直接获取。这些结果表明,标准质量因子增强无法充分代表实际压缩多样性,并且显式条件化于量化表的架构选择为实际部署提供了有意义的鲁棒性优势。

英文摘要

Document manipulation localization models achieve strong performance on public benchmarks yet fail to generalize to operational document workflows. We identify a critical and overlooked source of this gap: the mismatch between the narrow distribution of JPEG quantization tables used during training -restricted to standard libjpeg quality factors -and the heterogeneous compression profiles encountered in real-world insurance document pipelines. To isolate this factor, we conduct a controlled factorial study comparing two architectures with contrasting levels of quantization table awareness -FFDN [2] and Mesorch [20] -each trained under either standard quality factor augmentation (Standard-QT ) or operationally calibrated quantization tables sampled from DocQT, a quantization-table bank derived from a MAIF operational image corpus (Real-QT ), and evaluated under three recompression conditions. Training under Real-QT yields substantial localization gains on DocTamper [15] and significantly reduces the pixel-level false positive rate on authentic operational documents, but only for architectures that explicitly ingest the quantization table as input. The released DocQT quantization-table dataset and compression-reproduction material are directly available at https://github.com/Kyliroco/Improving-Document-Forgery-Localization-Robustness-via-Diverse-JPEG-Quantization-Tables. These results demonstrate that standard quality factor augmentation does not adequately proxy operational compression diversity, and that architectural choices explicitly conditioning on the quantization table provide a meaningful robustness advantage for real-world deployment.

2605.19656 2026-05-20 cs.CV 版本更新

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

跨视图泼溅:基于地理参考图像的馈送视图合成

Matias Turkulainen, Akshay Krishnan, Filippo Aleotti, Mohamed Sayed, Guillermo Garcia-Hernando, Juho Kannala, Arno Solin, Gabriel Brostow, Daniyar Turmukhambetov

发表机构 * Aalto University(阿alto大学) Georgia Tech(佐治亚理工学院) Niantic Spatial(Niantic空间) University of Oulu(奥卢大学) ELLIS Institute Finland(芬兰ELLIS研究所) UCL(伦敦大学学院)

AI总结 本文提出了一种基于地理参考图像的馈送视图合成方法,通过融合正交校正的卫星图像与GPS标记的地面照片,预测统一3D坐标框架中的高斯泼溅,从而提升场景覆盖和新视角合成效果。

Comments Submitted to CVPR 2026. 8 figures, 3 tables. Project page: https://nianticspatial.github.io/cross-view-splatter/

详情
AI中文摘要

我们提出了Cross-View Splatter,一种预测像素对齐高斯泼溅的馈送方法,用于地面级和卫星拍摄的户外场景。忠实重建需要良好的相机覆盖,但地面影像在大规模户外场景中拍摄耗时且困难。幸运的是,卫星影像可以提供全球几何先验,可通过公共API轻松获取。Cross-View Splatter融合正交校正的卫星视图与GPS标记的地面照片,以统一的3D坐标框架预测高斯泼溅。通过对齐地面和鸟瞰特征表示,我们的模型相比仅使用地面影像提升了场景覆盖和新视角合成。我们在经过筛选的地理参考数据集和配对的卫星地形数据上进行训练,这些数据来自开源测绘服务。我们在新的新视角合成基准上评估了我们的方法,该基准允许与先前最先进的方法进行比较。我们的代码和数据准备将在https://nianticspatial.github.io/cross-view-splatter/上提供。

英文摘要

We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced datasets and paired satellite-terrain data, mined from open mapping services. We evaluate our method on a new benchmark for novel-view synthesis with georeferenced imagery allowing comparison to prior state-of-the-art methods. Our code and data preparation will be available at https://nianticspatial.github.io/cross-view-splatter/.

2605.19639 2026-05-20 cs.CV 版本更新

Benchmarking and Evolving Reason-Reflect-Rectify for Reflective Visual Generation

基于反思生成的基准测试与进化

Junjie Wang, Xinghua Lou, Jason Li, Ye Tian, Keyu Chen, Yulin Li, Bin Kang, Jacky Mai, Yanwei Li, Zhuotao Tian, Liqiang Nie

AI总结 本文提出R^3-Bench基准和R^3-Refiner框架,用于评估和提升反思视觉生成能力,通过改进迭代推理和修正能力,提升文本到图像模型的生成质量。

详情
AI中文摘要

文本到图像(T2I)模型和统一多模态模型(UMMs)在视觉生成领域取得了显著进展。然而,其依赖于单次生成范式限制了处理需要迭代细化的复杂提示的能力。为了实现多轮反思视觉生成(RVG),我们正式将Reason-Reflect-Rectify(R^3)循环作为核心框架,并引入R^3-Bench,一个包含600多个专家标注实例的基准,用于量化迭代推理和修正能力。在R^3-Bench上的评估揭示了一个关键差距:尽管最先进的模型能够识别生成错误,但它们无法生成具有操作性的修正指令。为弥合这一差距,我们提出了R^3-Refiner,一个双阶段框架,利用组相对策略优化(GRPO)和分层奖励机制(HRM)来更好地对齐修正与反思推理。实验表明,R^3-Refiner在R^3-Bench上实现了显著改进(在反思判断分数上提升12.0%,在修正分数上提升9.0%),并且可以无缝集成到各种多语言大型模型(MLLMs)中,以提升不同T2I模型在GenEval++和T2I-CompBench上的生成质量。代码可在https://github.com/xiaomoguhz/R3-Bench获取。

英文摘要

Text-to-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation. However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the Reason-Reflect-Rectify (R^3) loop as a core framework and introduce R^3-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities. Evaluation on R^3-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions. To bridge this gap, we propose R^3-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification with reflective reasoning. Experiments show that R^3-Refiner achieves significant improvements on R^3-Bench (+12.0% in Reflective Verdict Score, +9.0% in Rectification Score), and can be seamlessly integrated with various MLLMs to enhance the generation quality of different T2I models on GenEval++ and T2I-CompBench. Code is available at https://github.com/xiaomoguhz/R3-Bench.

2605.19634 2026-05-20 cs.CV cs.AI 版本更新

P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

P2DNav: 全景到俯视视角的零样本视觉-语言导航

Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li, Yongrui Qin, Zongtao He, Chengju Liu, Qijun Chen

发表机构 * Department of Control Science and Engineering, Tongji University(控制科学与工程系,同济大学)

AI总结 本文提出P2DNav框架,通过全景到俯视视角的分解、滑动窗口对话记忆和反思重新定位机制,解决零样本视觉-语言导航中的方向推理与局部定位问题,实验表明其在R2R-CE基准上性能优异。

详情
AI中文摘要

视觉-语言导航(VLN)要求一个具身代理将自然语言指令转化为可执行的导航动作,以应对未见环境。现有零样本方法通常依赖额外的航点预测模块,这些模块往往将高层方向推理与细粒度局部定位纠缠在一起,导致决策错误且不稳定。在本文中,我们提出P2DNav,一种用于零样本视觉-语言导航的分层框架。P2DNav包含三个核心组件:全景到俯视(P2D)、滑动窗口对话记忆(SDM)和反思重新定位机制(RRM)。P2D明确将导航决策分解为两个阶段:全景方向选择和俯视局部定位。它首先从360°全景中选择与指令相关的方向,然后从该方向的俯视RGB观察中预测像素级目标点。此外,SDM将导航历史组织为多轮对话上下文,并在滑动窗口内维护最近的视觉观察以支持长距离导航。RRM进一步通过评估局部定位的可靠性基于俯视观察,并在必要时返回全景方向选择。在R2R-CE基准上的实验表明,P2DNav在零样本方法中表现强劲。特别是,与最先进的(SOTA)零样本航点基于和航点自由方法相比,P2DNav在SR方面分别获得了146.6%和58.9%的提升,证明了P2D、SDM和RRM在零样本VLN中的有效性。代码将向公众发布。

英文摘要

Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360° panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.

2605.19631 2026-05-20 cs.RO cs.CV 版本更新

HEAT: Heterogeneous End-to-End Autonomous Driving via Trajectory-Guided World Models

HEAT: 基于轨迹引导的世界模型实现异构端到端自动驾驶

Hoonhee Cho, Giwon Lee, Jae-Young Kang, Hyemin Yang, Heejun Park, Kuk-Jin Yoon

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出一种基于轨迹引导的学习方法,通过规划轨迹组织训练,使模型能够捕捉驾驶意图的领域不变表示,并结合预测未来潜在特征的世界模型,提高特征一致性并缓解领域偏见,从而在多个异构数据集上实现强性能。

详情
AI中文摘要

端到端自动驾驶作为一种直接将原始传感器数据映射到驾驶动作的替代方案,已逐渐取代传统模块化管道。尽管近期方法在单域数据集上表现强劲,但当在多个异构领域联合训练时,性能显著下降。然而,实际自动驾驶系统必须在具有异构分布的不同环境中运行,包括不同城市、传感器配置和交通模式,而无需领域特定重新训练。这一差距突显了多领域学习中的关键挑战:异构领域中的领域特定变化引入了冲突的学习信号,使模型倾向于妥协解决方案,这些方案在各个领域中都是次优的。为此,我们提出了一种轨迹驱动的学习范式,围绕规划轨迹组织训练,使模型能够捕捉驾驶意图的领域不变表示。此外,我们还引入了一个世界模型,该模型根据自主动作预测未来的潜在特征,从而提高特征一致性和缓解领域引起的偏见。我们在三个基准上评估了我们的方法,即nuScenes、NAVSIM和Waymo端到端数据集,并在所有领域上展示了显著优于现有方法的改进。我们的结果表明,一个统一的模型可以在异构数据集上进行训练,同时在每个领域中保持强大的性能,这表明了向可扩展的现实世界部署迈出的一步。我们将公开我们的代码。

英文摘要

End-to-end autonomous driving has emerged as a compelling alternative to traditional modular pipelines by directly mapping raw sensor data to driving actions. While recent approaches achieve strong performance on single-domain datasets, their performance degrades significantly when trained jointly across multiple heterogeneous domains. In practice, however, autonomous systems must operate across diverse environments with heterogeneous distributions, including different cities, sensor configurations, and traffic patterns, without domain-specific retraining. This gap highlights a key challenge in multi-domain learning: domain-specific variations across heterogeneous domains introduce conflicting learning signals, driving models toward compromised solutions that are suboptimal across domains. To address this, we propose a trajectory-driven learning paradigm that organizes training around planning trajectories, enabling the model to capture domain-invariant representations of driving intent. Furthermore, we incorporate a world model that predicts future latent features conditioned on ego actions, improving feature consistency and mitigating domain-induced biases. We evaluate our approach on three benchmarks, nuScenes, NAVSIM, and the Waymo end-to-end dataset, and show substantial improvements over existing methods across all domains. Our results demonstrate that a single unified model can be trained on heterogeneous datasets while maintaining strong performance within each domain, highlighting a step toward scalable real-world deployment. We will make our code publicly available.

2605.19623 2026-05-20 cs.CV 版本更新

PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

PrAda:基于文本提示的分割的少样本视觉适应

Gabriele Rosi, Fabio Cermelli, Carlo Masone, Barbara Caputo

发表机构 * Politecnico di Torino(托里诺理工学院) Focoos AI

AI总结 该研究针对文本提示分割在特定领域中的性能下降问题,提出了一种新的少样本视觉适应方法PrAda,通过结合细粒度像素特征和高层Transformer表示学习类特定原型,从而在不改变模型零样本潜力的情况下实现对新领域的强适应。

Comments CVPR 2026 Findings. Code: https://github.com/FocoosAI/PrAda

详情
AI中文摘要

图像分割对于视觉理解至关重要,但需要大量的像素级标注。基础模型已经使预测新类别的新范式成为可能,这些范式通过文本提示引导,而无需目标领域的标注。然而,在专门化的目标领域中,远离原始预训练,其性能会下降。我们研究了现有方法在这样的领域偏移下的误差,发现误分类而不是掩码生成是主要的罪魁祸首。为了解决这个问题,我们引入了新的问题:基于文本提示的分割的少样本视觉适应。这种适应在图像分类中已被广泛研究,但在分割中仍属未探索的领域。我们通过原型适应(PrAda)解决了这一任务,这是一种新颖且参数高效的适应方法,用于适应冻结的文本提示分割模型。我们的方法通过结合细粒度像素特征和高层Transformer表示来学习类特定原型,然后通过学习的重要性因子将这些原型与原始基于文本的预测融合。这在保持模型零样本潜力的同时,使模型能够适应新领域。在五个基准上的语义、实例和全景分割实验表明,PrAda在与现有最先进方法和所提基线相比时,取得了显著的改进。

英文摘要

Segmenting images is critical for visual understanding but demands extensive pixel-level annotations. Foundational models have enabled new paradigms for predicting new classes guided by textual prompts, without annotations from the target domain. Yet, on specialized target domains, far from the original pre-training, their performance degrades. We study the errors of existing methods under such domain-shift, finding that misclassification rather than mask generation is the main culprit. To address this, we introduce the novel problem of Few-Shot Visual Adaptation for text-prompted Segmentation. This kind of adaptation has been largely studied for image classification, but it remains unexplored for segmentation. We tackle this task with Prototype Adaptation (PrAda), a novel, parameter-efficient method that adapts a frozen text-prompted segmentation model. Our approach learns class-specific prototypes by combining fine-grained pixel features and high-level transformer representations, which are then fused with the original text-based predictions through a learned importance factor. This preserves the model's zero-shot potential while enabling strong adaptation to new domains. Experiments across semantic, instance, and panoptic segmentation on five benchmarks demonstrate that PrAda yields significant improvements over state-of-the-art and proposed baselines.

2605.19622 2026-05-20 cs.CV 版本更新

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

UniRefiner: 通过对比注册教会预训练ViTs自我处理杂质

Congpei Qiu, Zhaoyu Hu, Wei Ke, Zhuotao Tian, Yanhao Wu, Tong Zhang

发表机构 * Xi’an Jiaotong University, School of Software Engineering(西安交通大学软件工程学院) University of Chinese Academy of Sciences(中国科学院大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Shenzhen Loop Area Institute(深圳环城院)

AI总结 本文提出UniRefiner,一种通用 refinement 框架,通过对比注册方法教会预训练 ViT 自动处理空间敏感任务中的杂质 token,提升模型在密集预测任务中的表现。

Comments CVPR 2026

详情
AI中文摘要

基于 Vision Transformers (ViTs) 的表示学习已取得显著进展,然而大规模模型在空间敏感任务中的实用性受到虚假 token 的阻碍。先前的缓解措施有限,通常将这些伪影狭义地定义为简单的高范数异常值。我们认为这种范围不足。对于密集预测任务,我们提出任何未能编码位置对齐语义的 token 应被视为伪影。这种更广义的定义揭示了一个更复杂的问题,促使我们系统地分类并表征三种基本类型的伪影 token,这些 token 污染了空间表示。基于这种全面的诊断,我们提出了 UniRefiner,一种通用的 refinement 框架,教会预训练 ViTs 自我处理这些伪影。UniRefiner 使用对比注册来显式隔离并重新分配伪影 token,通过双重目标:(i) 它将图像 token 与过滤后的正常 token 对齐以保持语义,(ii) 它将注册 token 与检测到的伪影 token 对齐以捕捉伪影信号。我们的方法仅需在 ~5k 图像上进行少量微调即可优化多种 ViTs,包括 EVA-CLIP-8B 和 InternViT-6B 等大规模模型。实验显示了一致且显著的改进:特别是优化后的 EVA-CLIP-8B 在 ADE20K 上达到 51.9% mIoU(+9.4%),超过 DINOv2(49.1%)等专用视觉模型,零样本分割精度提升高达 22%。UniRefiner 解锁了现有大规模基础模型的潜在空间能力,为它们的广泛应用铺平了道路。

英文摘要

Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\% mIoU on ADE20K (+9.4\%), surpassing specialized vision models like DINOv2 (49.1\%), while zero-shot segmentation accuracy improves by up to 22\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.

2605.19620 2026-05-20 cs.CV 版本更新

Bézier Degradation Modeling for LiDAR-based Human Motion Capture

基于LiDAR的人体动作捕捉的贝塞尔退化建模

Xiaoqi An, Lin Zhao, Jun Li, Chen Gong, Jian Yang

发表机构 * PCA Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology(计算机科学与工程学院精密仪器实验室,南京理工大学) PCA Lab, School of Intelligence Science and Technology, Nanjing University(智能科学与技术学院精密仪器实验室,南京大学)

AI总结 本文提出BMLiCap框架,通过时间可压缩的贝塞尔曲线建模人体动作,采用轨迹保留策略减少控制点,设计渐进式动作重建模块,利用时间尺度运动变换器和多级动作聚合器有效融合多尺度曲线,以提高复杂场景下的动作重建精度和时间连续性。

Comments Accepted by CVPR 2026

详情
AI中文摘要

基于LiDAR的3D人体动作捕捉在自动驾驶和机器人领域有广泛应用,准确的动作重建至关重要。然而,现有方法在不稳定输入和严重遮挡情况下常常导致预测抖动甚至失败。为了解决这些挑战,我们提出BMLiCap,一种从粗到细的框架,通过时间可压缩的贝塞尔曲线建模运动。通过采用轨迹保留策略减少控制点,我们获得了一种连贯且易于学习的动作表示。为了从LiDAR点云线索中重建人体动作,我们设计了一个渐进式动作重建模块。具体来说,引入了时间尺度运动变换器(TMT)来在多个时间尺度上预测运动曲线,并利用多级动作聚合器(MMA)来适应性融合多尺度曲线,以恢复详细的、时间连贯的姿态,有效弥补由遮挡和噪声引起的观测缺口。在四个主流基准LiDARHuman26M、FreeMotion、NoiseMotion和SLOPER4D上,BMLiCap在复杂场景中实现了最先进的准确性和时间连续性,证明了其在严重遮挡下的补偿能力和减少预测抖动的能力。

英文摘要

LiDAR-based 3D human motion capture has broad applications in fields such as autonomous driving and robotics, where accurate motion reconstruction is crucial. However, existing methods often struggle with unstable inputs and severe occlusions, leading to jittery or even failed pose predictions. To address these challenges, we propose BMLiCap, a coarse-to-fine framework that models motion using temporally compressible Bézier curves. By reducing control points through a trajectory-preserving strategy, we obtain a coherent and learning-friendly motion representation. To reconstruct human actions from LiDAR point-cloud cues, we design a progressive motion-reconstruction module. Specifically, a Time-scale Motion Transformer (TMT) is introduced to predict motion curves at multiple temporal scales, and a Multi-level Motion Aggregator (MMA) is utilized to adaptively fuse the multi-scale curves to recover detailed, temporally coherent poses, effectively bridging observation gaps caused by occlusions and noise. Across four mainstream benchmarks LiDARHuman26M, FreeMotion, NoiseMotion, and SLOPER4D, BMLiCap achieves state-of-the-art accuracy and temporal continuity in complex scenes, demonstrating its ability to compensate for severe occlusions and reduce prediction jitter.

2605.19613 2026-05-20 cs.CV 版本更新

White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation

先白平衡,后调整:通过视觉-语言评估实现跨相机颜色恒常性

Shuwei Li, Lei Tan, Robby T. Tan

发表机构 * National University of Singapore(国立新加坡大学) ASUS Intelligent Cloud Services(ASUS智能云服务)

AI总结 本文提出VLM-CC框架,通过视觉-语言模型评估实现跨相机颜色恒常性的迭代反馈优化,利用感知反馈替代直接RGB回归,提升鲁棒性。

Comments In CVPR 2026

详情
AI中文摘要

颜色恒常性旨在保持物体颜色在不同光照下的一致性。跨相机颜色恒常性仍具挑战性,因为基于学习的模型常过拟合训练相机的颜色响应特性,导致在其他相机拍摄的图像上性能下降。我们提出VLM-CC,一种反馈引导的框架,将颜色恒常性建模为迭代细化过程。而不是直接从原始输入估计光源,VLM-CC通过视觉-语言模型(VLM)基于的评估进行迭代修正。在每次迭代中,图像使用当前估计进行白平衡并转换为伪sRGB。一个轻量级的LoRA微调VLM然后评估校正后的图像,识别主导的残差色偏并提供定性反馈。此反馈被映射到残差照明方向(红、绿或蓝)并用于更新光源估计,直到收敛。我们的关键思想是将颜色恒常性重新建模为迭代感知反馈问题,利用VLM评估而不是直接RGB回归。通过将直接RGB估计替换为VLM引导的感知反馈,VLM-CC在多个数据集上实现了跨相机颜色恒常性的最先进鲁棒性。代码将在https://github.com/NothingIknow/VLM-CC上提供。

英文摘要

Color constancy aims to keep object colors consistent under varying illumination. Cross-camera generalization in color constancy remains challenging because learning-based models often overfit to the color response characteristics of the training camera, resulting in degraded performance on images captured by other cameras. We propose VLM-CC, a feedback-guided framework that formulates color constancy as an iterative refinement process. Instead of directly estimating the illuminant from raw input, VLM-CC performs iterative correction driven by vision-language model (VLM)-based evaluation. At each iteration, the image is white-balanced using the current estimate and converted to pseudo-sRGB. A lightweight LoRA-tuned VLM then assesses the corrected image, identifying the dominant residual color cast and providing qualitative feedback. This feedback is mapped to a residual illumination direction (red, green, or blue) and used to update the illuminant estimate until convergence. Our key idea is to reframe color constancy as an iterative perceptual feedback problem, leveraging VLM evaluation instead of direct RGB regression. By replacing direct RGB estimation with VLM-guided perceptual feedback, VLM-CC achieves state-of-the-art robustness in cross-camera color constancy across multiple datasets. Code will be available at https://github.com/NothingIknow/VLM-CC.

2605.19607 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution

基于谱积分梯度的粗到细特征归因

Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院) INEEJI Corp.(INEEJI公司)

AI总结 本文提出Spectral Integrated Gradients(SIG)方法,通过奇异值分解构建积分路径,以减少噪声并提高特征归因的准确性,优于传统路径基方法。

Comments 21 pages, 13 figures, 9 tables. Accepted to ACM KDD 2026; includes appendix

详情
AI中文摘要

积分梯度(IG)是一种广泛采用的特征归因方法,满足理想的公理性质。然而,积分路径的选择显著影响归因质量,标准直线路径同时引入所有输入特征,通常在途中积累噪声梯度。为解决这一限制,我们提出了Spectral Integrated Gradients,通过基线到输入差异的奇异值分解(SVD)构建积分路径。通过逐步激活奇异成分,从最大到最小,SIG在引入全局结构之前引入细粒度细节,自然遵循粗到细的进程。通过在多种图像分类数据集上的广泛评估,我们证明SIG生成的归因图更干净,噪声更少,并在定量性能上优于现有基于路径的归因方法。我们的代码可在https://github.com/leekwoon/sig/上获得。

英文摘要

Integrated Gradients (IG) is a widely adopted feature attribution method that satisfies desirable axiomatic properties. However, the choice of integration path significantly affects the quality of attributions, and the standard straight-line path introduces all input features simultaneously, often accumulating noisy gradients along the way. To address this limitation, we propose Spectral Integrated Gradients, which constructs integration paths based on singular value decomposition (SVD) of the baseline-to-input difference. By progressively activating singular components from largest to smallest, SIG introduces global structure before fine-grained details, naturally following a coarse-to-fine progression. Through extensive evaluation across diverse image classification datasets, we demonstrate that SIG produces cleaner attribution maps with reduced noise and achieves improved quantitative performance compared to existing path-based attribution methods. Our code is available at https://github.com/leekwoon/sig/.

2605.19605 2026-05-20 cs.CV 版本更新

deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection

deadtrees.earth-aerial: 一个多分辨率航拍图像数据集用于树冠和死亡检测

Ayushi Sharma, Clemens Mosig, Lukas Drees, Salim Soltani, Janusch Vajna-Jehle, Aaron Sheppard, Belqis Ahmadi, Jonathan Schmid, Paul Neumeier, Nathan Jacobs, Jan Dirk Wegner, Teja Kattenborn

发表机构 * Chair of Sensor-based Geoinformatics, University of Freiburg(传感器基于地理信息学系,弗赖堡大学) EcoVision Lab, DM3L, University of Zurich(生态视觉实验室,苏黎世大学) Institute for Earth System Science and Remote Sensing, Leipzig University(地球系统科学与遥感研究所,莱比锡大学) Washington University, St. Louis(斯蒂芬斯敦大学)

AI总结 本文提出两个全新的开放数据集,用于从厘米级航拍图像中进行树冠和死亡的联合分割,解决了全球范围内缺乏统一数据集的问题,并在多个生物群落中实现了显著的性能提升。

Comments Preprint. Under review. All rights reserved

详情
AI中文摘要

全球范围内的森林正日益受到气候变化和火灾、害虫和病原体等破坏的威胁,这催生了对大规模树冠和树死亡监测的迫切需求。无人机和飞机的航拍图像是一种关键的数据源,用于详细且大规模地绘制树冠和死亡情况。然而,相关进展受限于缺乏全球代表性、统一的数据集,用于树冠和死亡的联合分割。我们介绍了两个新的、开放的、适合机器学习的数据集,首次在全球范围内实现了从厘米级航拍图像中进行树冠和死亡的联合分割。通过DTE-aerial-train,我们提供了一个包含385,000个1024x1024像素图像块的训练数据集,分辨率范围从2.5到20厘米。它包括多类专家标注和审核的伪标签,用于树冠和死亡。通过DTE-aerial-bench,我们提供了一个地理上平衡的基准测试集,包含25个全球分布的正射图像,总计525个高质量的专家标注图像块,用于树冠和死亡。训练和基准数据集涵盖了热带、温带、寒带和干旱生物群落,并覆盖了广泛的森林结构和死亡模式。使用基准测试集进行评估,我们建立了强参考基线,这些基线在所有生物群落和尺度上提高了死亡分割的性能,在挑战性区域如寒带森林中,F1分数从0.40提高到0.58,提升了约45%的相对性能。所有数据、模型和代码将在宽松的开源许可证下公开发布。基准数据集的交互式可视化可在deadtrees.earth/releases/dte-aerial-bench查看。

英文摘要

Forests worldwide are increasingly threatened by climate change and disturbances such as fire, pests, and pathogens, creating an urgent need for scalable monitoring of tree cover and tree mortality. Aerial imagery from drones and aircraft is a key data source for detailed and large-scale mapping of tree crowns and mortality. However, related progress is limited by the lack of globally representative, harmonized datasets for joint segmentation of tree cover and mortality. We introduce two novel, open, machine-learning-ready datasets to enable joint segmentation of tree cover and tree mortality from centimeter-scale aerial imagery for the first time at global scales. With DTE-aerial-train, we provide a training dataset comprising 385K image patches of size 1024x1024 pixels, with resolutions ranging from 2.5 to 20 cm. It includes multi-class expert-annotated and -audited pseudo-labels for tree cover and mortality. With DTE-aerial-bench, we provide a geographically balanced benchmark test set of 25 globally distributed orthoimages totaling 525 patches with high-quality expert annotations for both tree cover and mortality. Both the training and benchmark datasets span tropical, temperate, boreal, and dryland biomes and cover a wide range of forest structures and mortality patterns. Using the benchmark test set for evaluation, we establish strong reference baselines that improve mortality segmentation across all biomes and scales with significant gains in challenging regions, such as boreal forests, where the F1 score increases from 0.40 to 0.58 with around 45% relative improvement. All data, models, and code will be publicly released under permissive open-source licenses. An interactive visualization of the benchmark dataset is available at deadtrees.earth/releases/dte-aerial-bench.

2605.19595 2026-05-20 cs.CV cs.AI 版本更新

A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images

一种由LLM代理优化的YOLO26-MoE新型模型用于考虑无人机图像的绝缘子故障检测

João Pedro Matos-Carvalho, Laio Oriel Seman, Stefano Frizzo Stefenon, Mohammad Khalaf Mohammad Khreasat, Gabriel Villarrubia González

发表机构 * Department of Automation and Systems Engineering, Federal University of Santa Catarina, Florianópolis, Brazil(自动化与系统工程系,圣卡塔琳娜联邦大学,巴西弗洛里安波利斯) Applications Lab, Faculty of Science, University of Salamanca, Plaza de los Caídos s/n, 37008 Salamanca, Spain(应用实验室,科学学院,萨拉曼卡大学,西班牙萨拉曼卡)

AI总结 本文提出一种优化的YOLO26-MoE模型,通过在YOLO26检测器的高分辨率分支中集成稀疏的混合专家(MoE)模块,以适应细微和多样的故障模式,同时保持单阶段检测框架的效率,利用LLM代理进行超参数优化,最终在无人机图像上实现了99.00 mAP@0.5和95.15 mAP@0.5:0.95的性能,优于最新版本的YOLO。

详情
AI中文摘要

电力线路绝缘子的检查对于确保电网可靠性和防止因损坏或退化的绝缘组件引起的故障至关重要。近年来,结合深度学习视觉系统的无人机(UAV)已成为自动化此过程的有效解决方案。然而,由于缺陷区域小、故障模式异质性、复杂背景和变化的成像条件,绝缘子故障检测仍具挑战性。为解决这些挑战,本文提出了一种优化的YOLO26-MoE模型,一种新的目标检测架构,其在YOLO26检测器的高分辨率分支中集成了稀疏的混合专家(MoE)模块。所提出的修改使模型能够适应细微和多样的故障模式,同时保持单阶段检测框架的效率。超参数优化、最终训练和评估通过工具增强的大型语言模型(LLM)代理协调。所提出的模型实现了0.9900 mAP@0.5和0.9515 mAP@0.5:0.95的性能,优于最新版本的YOLO。这些结果表明,所提出的模型为基于无人机的绝缘子故障检测提供了一种有效且可靠的解决方案。

英文摘要

The inspection of electrical power line insulators is essential for ensuring grid reliability and preventing failures caused by damaged or degraded insulation components. In recent years, Unmanned Aerial Vehicles (UAVs) combined with deep learning-based vision systems have emerged as an effective solution for automating this process. However, insulator fault detection remains challenging due to small defect regions, heterogeneous fault patterns, complex backgrounds, and varying imaging conditions. To address these challenges, this paper proposes an optimized YOLO26-MoE, a novel object detection architecture that integrates a sparse Mixture-of-Experts (MoE) module into the high-resolution branch of the YOLO26 detector. The proposed modification enables adaptive feature refinement for subtle and diverse fault patterns while preserving the efficiency of a one-stage detection framework. Hyperparameter optimization, final training, and evaluation were coordinated through a tool-augmented Large Language Model (LLM) agent. The proposed model achieved 0.9900 mAP@0.5 and 0.9515 mAP@0.5:0.95, outperforming the latest YOLO versions. These results demonstrate that the proposed model provides an effective and reliable solution for UAV-based insulator fault detection.

2605.19559 2026-05-20 cs.CV cs.AI 版本更新

EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs

EgoCoT-Bench: 用于MLLMs的 grounded 和可验证的 operation-centric 思维链推理基准测试

Yang Dai, Dian Jiao, Tianwei Lin, Wenqiao Zhang

发表机构 * Zhejiang University(浙江大学)

AI总结 本文提出EgoCoT-Bench,一个用于评估MLLMs在第一人称视角下细粒度操作中心推理能力的基准测试,包含3172个可验证的问答对,涵盖感知、预见和高层次推理等任务,旨在解决现有基准测试在细粒度推理和证据验证方面的不足。

详情
AI中文摘要

多模态大语言模型(MLLMs)的快速发展引发了对第一人称视频理解的广泛关注,特别是MLLMs识别细粒度手-物体交互、跟踪物体状态变化以及从第一人称视角推理动态环境中操作过程的能力。然而,现有的第一人称视频基准测试存在局限性,即缺乏对基于现实证据的推理评估,难以支持细粒度的操作中心推理,并且很少检查模型推理是否基于显式的时空证据。为了解决这一差距,我们引入了EgoCoT-Bench,一个细粒度的第一人称基准测试,用于验证和可验证的操作中心推理,具有显式的逐步推理注释。总体而言,EgoCoT-Bench包含3172个可验证的问答对,覆盖351个第一人称视频,分为四个任务组,共12个子任务组,涵盖感知与回顾、预见和高层次推理。该基准测试通过时空场景图(STSG)引导生成框架构建,并通过人工标注者进一步优化,以确保正确性、第一人称相关性和细粒度质量。实验结果表明,第一人称细粒度推理仍存在困难,并进一步揭示了许多多模态模型生成的解释虽然答案正确,但证据与答案不一致。我们希望EgoCoT-Bench能为第一人称视频理解中的 grounded 和可验证推理提供有用的测试平台。项目页面和补充材料可在:https://dstardust.github.io/EgoCoT/ 上找到。

英文摘要

The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egocentric video understanding, specifically the ability for MLLMs to recognize fine-grained hand-object interactions, track object state changes over time, and reason about manipulative processes in dynamic environments from a first-person perspective. However, existing egocentric video benchmarks suffer from \textbf{limited grounded rationale evaluation}, offering limited support for fine-grained operation-centric reasoning and rarely examining whether model rationales are grounded in explicit spatio-temporal evidence. To address this gap, we introduce \textbf{EgoCoT-Bench}, a fine-grained egocentric benchmark for grounded and verifiable operation-centric reasoning with explicit step-by-step rationale annotations. Overall, EgoCoT-Bench comprises 3,172 verifiable QA pairs over 351 egocentric videos separated into four task groups for a total of 12 sub-task groups, encompassing perception and retrospection, anticipation, and high-level reasoning. The benchmark is constructed through a spatio-temporal scene graphs (STSG) guided generation framework and is further refined by human annotators to ensure correctness, egocentric relevance and fine-grained quality. Experimental results show continuing difficulties with egocentric fine-grained reasoning and further reveal that many multimodal models produce explanations that are answer-correct, but have evidence that is inconsistent with the answer. We hope EgoCoT-Bench can serve as a useful testbed for grounded and verifiable reasoning in egocentric video understanding. Project page and supplementary materials are available at: https://dstardust.github.io/EgoCoT/.

2605.19556 2026-05-20 cs.CV 版本更新

EpiDiffVO: Geometry-Aware Epipolar Diffusion for Robust Visual Odometry

EpiDiffVO: 一种基于几何的视差扩散用于鲁棒视觉里程计

Prateeth Rao

发表机构 * International Institute of Information Technology Bangalore(国际信息科技学院班加罗尔)

AI总结 本文提出了一种稀疏视差匹配框架,通过优化几何一致性来减少冗余,并结合视差扩散过程和图神经网络实现高效的视觉里程计。

Comments 8 pages, 5 figures, in revision to be submitted to IEEE RA-L

详情
AI中文摘要

从图像对中估计相对姿态本质上只需要一组几何上一致的对应点的最小子集。然而,大多数基于学习的方法依赖于密集匹配或直接回归,导致冗余并降低几何可解释性。在本工作中,我们提出了一种稀疏视差匹配框架,预测一组紧凑的对应点,以优化不同时间基线下的几何一致性。为了解决残余噪声和对齐问题,我们引入了视差扩散过程,该过程建模对应点的不确定性,并将关键点细化到视差一致性。经过细化的对应点,结合深度线索,被提升为图表示,形成一个Steiner图,该图编码点之间的关系结构。图神经网络学习了一组紧凑的有用对应点,这些对应点被传递给可微的奇异值分解求解器进行端到端的几何估计。从得到的基矩阵中恢复相对姿态,并在TartanAir和KITTI SLAM数据集上进行视觉里程计评估。实验结果表明,结合稀疏匹配、基于扩散的细化和基于图的子集选择可以减少对应点的冗余,同时在具有挑战性的基线下保持稳健的姿态估计。

英文摘要

Estimating relative pose from image pairs fundamentally requires only a minimal subset of geometrically consistent correspondences. However, most learning-based approaches rely on dense matching or direct regression, leading to redundancy and reduced geometric interpretability. In this work, we propose a sparse epipolar matching framework that predicts a compact set of correspondences optimized for geometric consistency across varying temporal baselines. To address residual noise and misalignment, we introduce an epipolar diffusion process that models correspondence uncertainty and refines keypoints toward epipolar consistency. The refined correspondences, along with depth cues, are lifted into a graph representation forming a Steiner graph that encodes relational structure between points. A graph neural network learns a compact subset of informative correspondences, which are passed to a differentiable singular value decomposition solver for end-to-end geometric estimation. Relative pose is recovered from the resulting essential matrix and evaluated in a visual odometry setting on the TartanAir and KITTI SLAM datasets. Experimental results demonstrate that combining sparse matching, diffusion-based refinement, and graph-based subset selection reduces correspondence redundancy while maintaining robust pose estimation across challenging baselines.

2605.19554 2026-05-20 cs.CV 版本更新

Self-Creative Text-to-Object Generation using Semantic-Aware Spatial Weighting

基于语义感知空间加权的自创文本到物体生成

Yue Yu, Haibo Chen, Shuo Chen, Jian Yang, Jun Li

发表机构 * Nanjing University of Science and Technology(南京理工大学)

AI总结 本文提出了一种自创扩散模型SCDiff,通过学习空间加权模块和视觉-语义混合损失模块,提升文本到图像生成的创意性和语义对齐性。

详情
AI中文摘要

在文本到图像(T2I)生成中注入创造力是一个重大挑战,因为合成图像不仅要具有视觉新颖性和惊喜,还应具有艺术价值。然而,当前T2I模型主要优化于字面文本-图像对齐,其噪声预测网络限制生成到高概率区域,导致生成结果缺乏真实创造力。为此,我们提出了一种自创扩散(SCDiff)模型,用于有意义的T2I生成,包含两个核心模块:可学习的空间加权(LSW)模块和视觉-语义混合损失(VSML)。LSW模块设计了一个参数化的Kaiser-Bessel窗,以强化中心图像特征,促进新颖和令人惊讶的生成。VSML模块引入了双重损失函数:相似性损失约束新图像与文本描述对齐,而多样性损失最大化其与原始图像的区别,从而增强语义价值和视觉新颖性。大量实验表明,我们的模型显著提高了创造力、语义对齐性和视觉一致性,提供了一个简单但强大的框架用于生成创意物体。

英文摘要

Instilling creativity in text-to-image (T2I) generation presents a significant challenge, as it requires synthesized images to exhibit not only visual novelty and surprise, but also artistic value. Current T2I models, however, are largely optimized for literal text-image alignment with their data distribution, and their noise prediction networks constrain the generation to high-probability regions, consequently generating outputs that lack authentic creativity. To address this, we propose a Self-Creative Diffusion (SCDiff) model for meaningful T2I generations featuring two core modules: a learnable spatial weighting (LSW) module and a visual-semantic mixing loss (VSML). The LSW module designs a parametric Kaiser-Bessel window to reinforce central image features, fostering novel and surprising generation. The VSML module introduces a dual loss function: a similarity loss constrains that the new images align with its textual description, while a diversity loss maximizes its distinction from the original image, enhancing both semantic value and visual novelty. Extensive experiments demonstrate that our model substantially improves creativity, semantic alignment, and visual coherence, offering a simple yet powerful framework for generating creative objects.

2605.19551 2026-05-20 cs.GR cs.CV 版本更新

AnchorFlow: Editable SVG Reconstruction via Sparse Anchor Point Fields

AnchorFlow: 通过稀疏锚点场实现可编辑的SVG重建

Mengnan Jiang, Christian Franke, Michele Franco Adesso, Antonio Haas, Grace Li Zhang

发表机构 * Mercedes-Benz AG(梅赛德斯-奔驰公司) Technical University of Darmstadt(达姆施塔特技术大学)

AI总结 本文提出AnchorFlow框架,通过稀疏锚点场实现路径级锚点放置,解决图像到SVG重建中精度与可编辑性的平衡问题,实验表明其在保持高质量的同时显著降低可编辑复杂度。

详情
AI中文摘要

图像到SVG重建旨在生成忠实于位图输入且易于编辑的矢量图形。现有方法在如何参数化矢量结构上面临结构性权衡,包括图像由多少路径表示以及每个路径由多少锚点定义。高保真方法通常依赖大量路径或密集参数化曲线,而过于紧凑的SVG生成可能会偏离输入几何。这个问题在局部位图证据不完美时更加明显,其中边界跟随重建可能会引入冗余锚点和碎片化结构。我们主张应在锚点放置层面解决这一权衡,因为贝塞尔曲线上的锚点定义局部路径结构,并强烈影响精度和可编辑性。我们提出AnchorFlow,一个可编辑的SVG重建框架,通过稀疏锚点场建模路径级锚点放置。给定从位图图像中提取的路径状前景组件,AnchorFlow为每个组件预测一个图像条件的稀疏锚点场,并将其解析为有序的贝塞尔路径。渲染引导的反馈随后纠正局部结构错误后再进行重新解析。恢复的路径随后被组装和优化为最终的SVG。在孤立路径和完整图像上的实验表明,AnchorFlow在精度和可编辑性之间实现了有利的权衡,显著降低了可编辑复杂度,同时保持竞争性的位图保真度。

英文摘要

Image-to-SVG reconstruction aims to produce vector graphics that are faithful to raster inputs and easy to edit. Existing methods face a structural trade-off in how vector structure is parameterized, including how many paths represent an image and how many anchor points define each path. High-fidelity methods often rely on many paths or densely parameterized curves, whereas overly compact SVG generation may deviate from the input geometry. This issue becomes more pronounced when local raster evidence is imperfect, where boundary-following reconstruction can introduce redundant anchors and fragmented structures. We argue that this trade-off should be addressed at the level of anchor placement, since anchors on Bezier curves define local path structure and strongly affect both accuracy and editability. We propose AnchorFlow, an editable SVG reconstruction framework that models path-level anchor placement with sparse anchor point fields. Given path-like foreground components extracted from a raster image, AnchorFlow predicts an image-conditioned sparse anchor field for each component and resolves it into an ordered Bezier path. Rendering-guided feedback then corrects local structural errors before re-resolution. The recovered paths are then assembled and optimized into the final SVG. Experiments on isolated paths and full images show that AnchorFlow achieves a favorable fidelity-editability trade-off, substantially reducing editable complexity while preserving competitive raster fidelity.

2605.19539 2026-05-20 cs.CV 版本更新

Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R

信任它还是不信任它:基于信任3R的证据不确定性用于前馈3D重建

Zihao Zhu, Wenyuan Zhao, Nuo Chen, Chao Tian, Zhiwen Fan

发表机构 * Department of Electrical and Computer Engineering, Texas A\&M University, College Station, TX, USA(电子与计算机工程系,德克萨斯农工大学,学院站,德克萨斯州,美国)

AI总结 本文提出Trust3R,一种轻量级的证据不确定性框架,用于前馈3D重建,通过结合门控残差均值细化和正态-逆 Wishart 证据头,生成点云不确定性估计,提升几何重建的准确性和可靠性。

Comments Accepted at ICML 2026. 10 pages main paper, with appendix

详情
AI中文摘要

几何基础模型有希望从未经校准的图像中进行无约束的密集几何预测。然而,在当前的前馈设计中,其预测的置信度分数是启发式的,缺乏概率解释,且通常无法指示预测几何的可信区域和程度。为解决这一差距,我们提出了Trust3R,一种轻量级的证据不确定性框架用于前馈3D重建。Trust3R结合了门控残差均值细化和正态-逆 Wishart 证据头,生成每一点的几何不确定性的闭合形式多元学生t分布。这种设计在提供概率基础的点云不确定性估计的同时,增加了适度的推断开销。我们在多样化的室内和室外基准上进行了评估,并与MASt3R内置的置信度图以及跨越单次通过异方差回归和基于采样的方法(如MC dropout和深度集合)的常见不确定性感知基线进行了比较。实验结果表明,Trust3R在风险覆盖和稀疏化方面表现一致,并且在几何准确性方面总体有所提高。这些收益体现在跨基准的更强的不确定性排名上,ScanNet++上AURC降低了25%,AUSE降低了41%,为不确定性感知加权在下游几何管道中提供了实用的可靠性信号。项目页面和代码可在https://trust3r-z.github.io/上找到。

英文摘要

Geometric foundation models hold promise for unconstrained dense geometry prediction from uncalibrated images. However, in current feed-forward designs, their predicted confidence scores are heuristic, lack probabilistic interpretation, and often fail to indicate where and how much the predicted geometry can be trusted. To address this gap, we present Trust3R, a lightweight evidential uncertainty framework for feed-forward 3D reconstruction. Trust3R combines gated residual mean refinement with a Normal-Inverse-Wishart evidential head, yielding a closed-form multivariate Student-t distribution for per-point geometric uncertainty. This design provides probabilistically grounded pointmap uncertainty estimates while adding moderate inference overhead. We evaluate on diverse indoor and outdoor benchmarks and compare against MASt3R's built-in confidence map as well as common uncertainty-aware baselines spanning single-pass heteroscedastic regression and sampling-based methods such as MC dropout and deep ensembles. Experimental results show that Trust3R consistently improves risk-coverage and sparsification, and generally improves geometric accuracy. These gains are reflected in stronger uncertainty ranking across benchmarks, with 25% lower AURC and 41% lower AUSE on ScanNet++, providing a practical reliability signal for uncertainty-aware weighting in downstream geometry pipelines. The project page and code are available at https://trust3r-z.github.io/.

2605.19538 2026-05-20 cs.CV cs.AI 版本更新

CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision

CaptchaMind: 通过强化学习与显式推理监督训练CAPTCHA求解器

Pengcheng Wang, Haoxiang Liu, Yang Dai, Xiangxiang Zeng, Guanhua Chen, Baotian Hu, Longyue Wang, Weihua Luo

发表机构 * Alibaba Group(阿里巴巴集团) Southern University of Science and Technology(南方科技大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 本文提出CaptchaMind,一种基于强化学习的CAPTCHA求解器,通过显式推理监督训练,实现了82.9%的平均成功率,显著优于现有方法。

Comments 17 pages, 12 figures

详情
AI中文摘要

CAPTCHAs被广泛部署作为人类验证机制,经常阻止智能代理在现实网络环境中完成端到端自动化。解决现代CAPTCHAs需要稳健的多步骤视觉推理和交互能力,但基于训练的方法由于缺乏大规模训练数据和过程级注释而一直缺席。我们介绍了CaptchaBench,第一个支持大规模训练的CAPTCHA基准,包含16,000个程序生成的样本,覆盖八个任务类别,并带有详细的区域和过程级注释。系统评估表明,现有方法在需要精细视觉细节捕获和区域级比较的任务上表现一致失败。因此,我们提出了CaptchaMind,一种基于强化学习的求解器,通过显式推理过程监督训练,实现了82.9%的平均成功率,跨八个任务和71.0%在现实实例上的表现,显著优于所有现有方法,无需闭源API。

英文摘要

CAPTCHAs are widely deployed as human verification mechanisms and frequently block intelligent agents from completing end-to-end automation in real-world web environments. Solving modern CAPTCHAs requires robust multi-step visual reasoning and interaction capabilities, yet training-based approaches have remained absent due to the lack of large-scale training data and process-level annotations. We introduce CaptchaBench, the first CAPTCHA benchmark designed to support large-scale training, comprising 16,000 programmatically generated samples across eight task categories with detailed region and process-level annotations. Systematic evaluation on CaptchaBench reveals that existing methods fail consistently on tasks requiring fine-grained visual detail capture and region-level comparison. We therefore present CaptchaMind, an RL-based solver trained with explicit reasoning process supervision, achieving 82.9% average success rate across eight tasks and 71.0% on real-world instances, substantially outperforming all existing methods without closed-source APIs.

2605.19533 2026-05-20 cs.CV 版本更新

Replacement Learning: Training Neural Networks with Fewer Parameters

替代学习:用更少的参数训练神经网络

Yuming Zhang, Peizhe Wang, Tianyang Han, Hengyu Shi, Junhao Su, Dongzhi Guan, Jiabin Liu, Jiaji Wang

发表机构 * The University of Hong Kong(香港大学) Southeast University(东南大学)

AI总结 本文提出替代学习(RepL)方法,通过替换而非删除神经网络中的部分模块来减少全深度反向传播的冗余,从而在保持性能的同时降低参数量、内存使用和训练时间。

Comments 16pages

详情
AI中文摘要

端到端训练结合全深度反向传播仍然是优化深度神经网络的主要范式,但随着模型变深,其效率会下降。由于每个块必须在单一全局目标下执行和微分,全深度反向传播引入了显著的参数冗余、激活-内存成本和训练延迟,尤其是在相邻层具有高度相关学习模式时。直接跳过或删除层可以降低成本,但通常会削弱表示能力或需要特定架构的重用设计。在本文中,我们提出了替代学习(RepL),一种训练时的范式,通过替换选定的块而不是简单地删除它们来减少全深度冗余。对于每个被移除的块,RepL插入一个轻量级计算层,通过可学习的转换从其相邻前序和后序块的参数合成一个替代操作符,并将该合成操作符应用于前序激活。这样,RepL在保持局部上下文连续性的同时避免了不必要的全层计算。我们为CNNs和ViTs实例化RepL,使用定制化的参数融合块来处理卷积通道、特征分辨率和Transformer子模块。在CIFAR-10、SVHN、STL-10、ImageNet、COCO和CityScapes等数据集上的广泛实验表明,RepL在减少可训练参数、GPU内存使用和训练时间的同时,在分类、检测和分割任务中与标准端到端训练相匹配或超越。此外,在WikiText-2、迁移学习、推理吞吐量、检查点、随机深度和INT8量化等额外结果中进一步展示了其通用性和兼容性。

英文摘要

End-to-end training with full-depth backpropagation remains the dominant paradigm for optimizing deep neural networks, but its efficiency deteriorates as models grow deeper. Since every block must be executed and differentiated under a single global objective, full-depth BP introduces substantial parameter redundancy, activation-memory cost, and training latency, especially when neighboring layers exhibit highly correlated learning patterns. Directly skipping or removing layers can reduce cost, but often weakens representation capacity or requires architecture-specific reuse designs. In this paper, we propose Replacement Learning (RepL), a training-time paradigm that reduces full-depth redundancy by replacing selected blocks rather than simply discarding them. For each removed block, RepL inserts a lightweight computing layer that synthesizes a surrogate operator from the parameters of its adjacent preceding and succeeding blocks through a learnable transformation, and applies the synthesized operator to the preceding activation. In this way, RepL preserves local contextual continuity while avoiding unnecessary full-layer computation. We instantiate RepL for CNNs and ViTs with tailored parameter-fusion blocks that handle convolutional channels, feature resolutions, and transformer submodules. Extensive experiments on CIFAR-10, SVHN, STL-10, ImageNet, COCO, and CityScapes show that RepL reduces trainable parameters, GPU memory usage, and training time while matching or surpassing standard end-to-end training across classification, detection, and segmentation. Additional results on WikiText-2, transfer learning, inference throughput, checkpointing, stochastic depth, and INT8 quantization further demonstrate its generality and compatibility.

2605.19532 2026-05-20 cs.CV cs.LG 版本更新

Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection

通过基于核心标记注意力的种子选择提升文本到图像扩散模型

Yunzhe Zhang, Hongfu Liu, Pengyu Hong

发表机构 * Brandeis University(布兰迪大学)

AI总结 本文研究了文本到图像扩散模型中种子对生成质量的影响,提出基于核心标记注意力的种子选择方法,无需训练即可提升文本与图像的一致性及视觉质量。

Comments Preprint

详情
AI中文摘要

文本到图像扩散模型能够生成高质量的图像,但其输出对随机种子极为敏感:不同的初始种子往往导致图像质量和提示词与图像的一致性产生显著差异。我们重新审视这一

英文摘要

Text-to-image diffusion models can synthesize high-quality images, yet the outcome is notoriously sensitive to the random seed: different initial seeds often yield large variations in image quality and prompt-image alignment. We revisit this "seed effect" and show that attention dynamics over prompt core tokens, the content-bearing words, measured during the first few denoising steps, strongly predict final generation quality. Building on this observation, we introduce Attention-Based Seed Selection (ABSS), a training-free, plug-and-play method that ranks seeds for a given prompt by leveraging cross-attention to core tokens during the denoising process. ABSS requires no finetuning and does not alter the initial noise; it scores and ranks all candidate seeds, keeps only the top-k for full generation, and discards the rest, without relying on a fixed accept/reject threshold. Operating purely at inference time, ABSS can serve as a lightweight pre-selection add-on for existing seed-optimization pipelines, enabling additional gains. Across three benchmarks, extensive experiments show that ABSS enables consistent improvements in text-image alignment and visual quality for Stable Diffusion variants, as corroborated by human preference and alignment metrics.

2605.19528 2026-05-20 cs.CV 版本更新

Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

面向相机鲁棒的3D定位:基于方程的工具使用用于MLLMs

Xueying Jiang, Wenhao Li, Quanhao Qian, Deli Zhao, Shijian Lu, Gongjie Zhang, Ran Xu

发表机构 * Nanyang Technological University(南洋理工大学) DAMO Academy, Alibaba Group(阿里集团大模型研究院) HuPan Lab(虎派实验室) Alibaba Group(阿里集团)

AI总结 本文提出了一种基于方程的工具使用框架,通过将空间工具作为公式变量重新利用,以解决多模态大语言模型(MLLMs)中3D定位的相机固有模糊问题,从而在3D物体检测和3D视觉定位任务中取得了显著提升。

详情
AI中文摘要

多模态大语言模型(MLLMs)中的3D定位,包括3D物体检测和3D视觉定位,本质上受限于相机内参的模糊性:相同图像在不同相机下可以对应不同的3D场景。现有的MLLMs要么忽略相机参数并过度拟合于标准训练内参,要么从外部工具检索深度和3D线索,但将返回值视为参考线索(数值提示,模型可以隐式解释)。我们提出了一种基于方程的工具使用框架,将空间工具重新作为公式变量。该框架主动检索相机内参并采样多点度量深度,将针孔反投影方程$\hat{X} = (u_c - c_x)ar{Z}/f_x$明确写出在Chain-of-Thought(CoT)中,并在回归最终9自由度包围盒之前将工具输出代入公式。在从$0.5 imes$到$1.5 imes$缩放的相机内参下,我们的方法在3D物体检测和3D视觉定位任务中优于仅使用RGB和工具增强的基线方法,特别是在相机偏离训练尺度最显著时有显著提升。代码和数据将被发布。

英文摘要

3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical training intrinsic, or retrieve depth and 3D cues from external tools but treat the returned values as reference cues (numerical hints that the model is free to interpret implicitly), both preventing camera information from being deterministically propagated into the prediction. We propose an equation-anchored tool-use framework that re-purposes spatial tools as formula variables. The proposed framework proactively retrieves camera intrinsics and samples multi-point metric depths, writes the pinhole back-projection equation $\hat{X} = (u_c - c_x)\bar{Z}/f_x$ explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box. On both 3D object detection and 3D visual grounding tasks under rescaled camera intrinsics from $0.5\times$ to $1.5\times$, our method outperforms RGB-only and tool-augmented baselines, with significant gains where the camera deviates most from the training scale. Code and data will be released.

2605.19527 2026-05-20 cs.CV 版本更新

Dual-Prompt CLIP with Hybrid Visual Encoders for Occluded Person Re-Identification

双提示CLIP与混合视觉编码器用于遮挡行人重识别

Zhangjian Ji, Shaotong Qiao, Kai Feng, Wei Wei

发表机构 * organization= School of Computer \& Information Technology, Shanxi University , addressline= Wucheng Rd.92 , city= Taiyuan , postcode= 030006 , state= Shanxi , country= China organization= Key Laboratory of Computational Intelligence

AI总结 本文提出了一种双提示学习重识别模型DPL-ReID,通过双提示学习策略和现实遮挡增强方法,提升遮挡行人重识别的鲁棒性和准确性。

详情
AI中文摘要

遮挡行人重识别旨在在多个摄像头视图中匹配部分可见的行人。然而,遮挡会破坏身体区域线索,从而复杂化跨视图匹配。大多数基于预训练视觉-语言模型的行人重识别方法只关注增强基于提示的特征学习,而忽略遮挡物的语义信息。基于CLIP-ReID的成功,我们提出了一种新的双提示学习重识别(DPL-ReID)模型用于遮挡行人重识别。它结合了双提示学习(Dual-PL)策略,可以利用文本线索捕捉完整的行人语义并保持对遮挡的鲁棒性,以及现实世界遮挡增强(RWOA)方法,该方法真实模拟现实世界中遇到的遮挡场景以丰富遮挡样本。此外,我们还设计了加权门控特征融合(WGFF)方法,它结合LSNet来捕捉全局信息并作为特征门控机制。该机制可以有效引导CLIP视觉编码器生成更全面的特征表示。在多个基准遮挡重识别数据集上的广泛实验表明,所提出的DPL-ReID实现了最先进的性能。遮挡实例库可在https://github.com/stone-qiao/DPL-ReID上获取。

英文摘要

Occluded person re-identification focuses on matching partially visible pedestrians across multiple camera views. However, occlusions disrupt body-region cues, thereby complicating cross-view matching. Most person ReID methods built on pretrained vision-language models only focus on enhancing prompt-based feature learning while ignoring the semantic information of occluders. Based on the success of CLIP-ReID, we propose a novel Dual Prompt Learning ReID (DPL-ReID) model for occluded person ReID. It incorporates a Dual Prompt Learning (Dual-PL) strategy, which can utilize textual cues to capture complete pedestrian semantics and keep robustness against occlusion, and a Real-World Occlusion Augmentation (RWOA) method that realistically simulates occlusion scenarios encountered in real word to enrich occluded samples. In addition, we also design a Weighted Gated Feature Fusion (WGFF) method, which in corporates LSNet to capture global information and act as a feature-gating mechanism. This mechanism can effectively guide the CLIP visual encoder toward generating more comprehensive feature representations. Extensive experiments on several benchmark occluded ReID datasets show that our proposed DPL-ReID achieves the state-of-the art performance. The occlusion instance library are available at https://github.com/stone-qiao/DPL-ReID.

2605.19524 2026-05-20 cs.RO cs.CV 版本更新

SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving

SafeAlign-VLA: 一种增强负样本的安全对齐框架用于风险感知的自动驾驶

Kefei Tian, Yuansheng Lian, Kai Yang, Xiangdong Chen, Shen Li

发表机构 * College of Transportation, Tongji University(同济大学交通运输学院) Department of Civil Engineering, Tsinghua University(清华大学土木工程系) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与移动系统学院) Department of Civil and Environmental Engineering, National University of Singapore(新加坡国立大学土木与环境工程系)

AI总结 本文提出SafeAlign-VLA框架,通过整合负样本数据提升自动驾驶系统对安全边界的理解,通过生成安全标签和反事实轨迹,结合两阶段训练策略和基于锚点的群体相对策略优化,提高了自动驾驶的安全性和鲁棒性。

详情
AI中文摘要

端到端的自动驾驶系统在常见场景中表现优异,但在安全关键的长尾案例中表现不佳。视觉-语言-动作(VLA)模型因其强大的推理能力而具有前景。然而,大多数基于VLA的方法依赖于正专家演示,很少利用负样本,导致对危险行为和安全边界的理解不足。为了解决这一限制,我们提出了SafeAlign-VLA,一种统一的增强负样本的安全对齐框架,将负数据整合到监督学习和强化学习中。首先,我们开发了一种反事实安全配对范式,通过反事实推理从危险场景中生成结构化的安全标签和反事实正轨迹。然后采用两阶段训练策略:负样本增强的监督微调用于故障反馈和轨迹修正,接着是基于锚点的群体相对策略优化,利用正负轨迹作为对比锚点,引导采样并惩罚高风险行为。在NAVSIM和DeepAccident上的实验验证了所提框架。SafeAlign-VLA在NAVSIM v1测试集上达到89.1 PDMS,比无负样本基线提高了1.3%。在DeepAccident上,碰撞率降低到3.36%,同时达到84.2%的语言准确率和85.8%的风险预测准确率。这些结果证明了所提增强负样本的安全对齐框架在安全和鲁棒自动驾驶中的有效性。

英文摘要

End-to-end autonomous driving systems excel in common scenarios but struggle with safety-critical long-tail cases. Vision-Language-Action (VLA) models are promising due to their strong reasoning capabilities. However, most VLA-based approaches rely on positive expert demonstrations, rarely exploiting negative samples, leading to insufficient understanding of risky behaviors and safety boundaries. To address this limitation, we propose SafeAlign-VLA, a unified negative-enhanced safe alignment framework that incorporates negative data into supervised learning and reinforcement learning. First, we develop a counterfactual safety pairing paradigm to generate structured safety labels and counterfactual positive trajectories from risky scenarios via counterfactual reasoning. Then, a two-stage training strategy is adopted: negative-enhanced supervised fine-tuning for failure feedback and trajectory correction, followed by anchor-based group relative policy optimization that uses positive and negative trajectories as contrastive anchors to steer sampling and penalize high-risk behaviors via group-relative advantages. Experiments on NAVSIM and DeepAccident validate the proposed framework. SafeAlign-VLA achieves 89.1 PDMS on the NAVSIM v1 testset, improving over the baseline without negative data by 1.3%. On DeepAccident, it reduces the collision rate to 3.36%, while achieving 84.2% language accuracy and 85.8% risk prediction accuracy. These results demonstrate the effectiveness of the proposed negative-enhanced safe alignment framework for safe and robust autonomous driving.

2605.19523 2026-05-20 cs.CL cs.AI cs.CV 版本更新

Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

探究跨模态技能注入:场景、方法与超参数

Zhiyu Xu, Lean Wang, Yuanxin Liu, Lei Li, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University(多媒体信息处理国家重点实验室,计算机科学学院,北京大学) WeChat AI, Tencent Inc., China(腾讯公司,中国) The University of Hong Kong(香港大学)

AI总结 本文研究了跨模态技能注入在不同场景下的表现,分析了其方法和超参数的影响,发现其在指令遵循和跨语言任务中表现良好,但在数学推理中存在困难,同时指出经典方法如TA和DARE在性能上优于其他融合方法。

详情
AI中文摘要

视觉-语言模型(VLMs)在一般多模态理解方面表现出色;然而,它们在高效获取持续演化的领域特定技能方面存在困难。传统增强VLM能力的方法,如监督微调(SFT),需要大量的数据集整理和大量的计算资源。模型合并作为一种高效的替代方法,能够将领域专家的LLM专业知识转移到VLMs上,而无需额外的数据集要求或显著的计算开销。与传统合并同质LLM的方法不同,跨模态技能注入旨在通过将领域专家LLM整合到VLM中来诱导出新的跨模态能力。然而,现有研究缺乏对跨模态技能注入的适用性和方法的系统分析。在本研究中,我们从三个主要方面探讨了跨模态技能注入:场景、方法和超参数。在场景方面,我们发现跨模态技能注入在指令遵循和跨语言设置中表现良好,但在数学推理中表现不佳。在方法方面,我们发现经典方法如TA和DARE在性能上优于其他融合方法。我们还提供了这些经典方法所依赖的超参数调优的系统和定量分析。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional training data requirements or significant computational overhead. Unlike conventional merging of homogeneous LLMs, which mainly aggregates existing capabilities, cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM. However, existing research lacks a systematic analysis of the applicability and methodology of cross-modal skill injection. In this study, we investigate cross-modal skill injection across three main aspects: scenarios, methods, and hyperparameters. For scenarios, we find that cross-modal skill injection generally performs well in instruction-following and cross-lingual settings, yet struggles with mathematical reasoning. For methods, we find that classic approaches such as TA and DARE consistently achieve superior performance over alternative merging methods. We also provide a systematic and quantitative analysis of the hyperparameter tuning that these classic methods critically depend on.

2605.19522 2026-05-20 cs.CV 版本更新

iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment

iDiff:用于成对图像质量评估的可解释差异感知框架

Xinli Yue, JianHui Sun, Tao Shao, Liangchao Yao, Fan Xia, Yuetang Deng

发表机构 * Tencent(腾讯)

AI总结 本文提出iDiff框架,通过双分支设计结合可解释的差异建模和结构化多模态推理,提升成对图像质量评估的鲁棒性和可解释性,并在NTIRE 2026 RAIM挑战中取得第一名。

Comments Accepted to CVPR 2026 Workshop

详情
AI中文摘要

成对图像质量评估(IQA)在专业摄影中需要一个模型不仅能够识别两个候选图像之间的优选图像,还能提供有说服力且基于图像的推理。在NTIRE 2026 RAIM挑战中,这一要求进一步通过联合评估偏好预测和推理生成被强调。为了解决这一任务,我们提出了iDiff,一个用于成对图像质量评估的可解释差异感知框架。我们的方法采用由答案模型和推理模型组成的双分支设计。答案模型通过显式地将每个样本分解为左右全局和局部视图,随后进行内容感知的专业化处理,针对人物和场景图像,并通过跨主干的集成方法进行聚合,以实现稳健的偏好预测。推理模型专注于推理生成,并逐步增强,通过专家式模板、多源质量特征以及基于答案模型预测的条件监督进行优化。通过这种方式,iDiff联合建模了判别性决策和结构化解释,提高了鲁棒性和可解释性。广泛的实验表明,所提出的框架在准确性和推理质量指标上都有效。我们的方法在NTIRE 2026 RAIM挑战中取得了第一名,展示了将显式差异建模与结构化多模态推理整合用于成对IQA的有效性。

英文摘要

Pairwise image quality assessment (IQA) in professional photography requires a model not only to identify the preferred image between two candidates, but also to provide convincing and image-grounded reasoning. In the NTIRE 2026 RAIM challenge, this requirement is further emphasized by jointly evaluating preference prediction and rationale generation. To address this task, we propose iDiff, an Interpretable Difference-aware framework for pairwise image quality assessment. Our method adopts a dual-branch design consisting of an Answer Model and a Thinking Model. The Answer Model performs robust preference prediction by explicitly decomposing each sample into left/right global and local views, followed by content-aware specialization for person and scene images and ensemble-based aggregation across backbones. The Thinking Model focuses on rationale generation and is progressively enhanced with expert-style templates, multi-source quality features, and answer-aware supervision conditioned on the Answer Model prediction. In this way, iDiff jointly models discriminative decision making and structured explanation, improving both robustness and interpretability. Extensive experiments demonstrate the effectiveness of the proposed framework on both accuracy and reasoning-quality metrics. Our method achieved first place in the NTIRE 2026 RAIM challenge, showing the effectiveness of integrating explicit difference modeling with structured multimodal reasoning for pairwise IQA.

2605.19511 2026-05-20 cs.CV 版本更新

Are Watermarked Images Editable? SafeMark for Watermark-Preserving Text-Guided Image Editing

水印图像可编辑吗?SafeMark用于水印保持的文本引导图像编辑

Xiaodong Wu, Qi Li, Xiangman Li, Zelin Zhang, Lingshuang Liu, Jianbing Ni

发表机构 * Queen’s University(皇后大学) University of Waterloo(滑铁卢大学)

AI总结 本文研究了一个基础但未被充分探索的问题:水印图像能否在不损害水印完整性的情况下保持可编辑?我们提出了SafeMark框架,该框架在图像编辑过程中显式地将水印完整性整合进去。具体来说,SafeMark将阈值化的水印解码损失直接添加到扩散编辑器的训练目标中,微调编辑器,使得语义上有效的编辑也能够在最终输出中保留嵌入的水印。这种设计具有清晰的信息论依据:在编辑图像上保持高比特准确性下限界了编辑通道所保持的水印与编辑输出之间的互信息,这一量根本控制着水印恢复能力。SafeMark与可微扩散编辑器兼容,不需要架构修改。在多个数据集、文本引导编辑方法和编辑后失真设置上的广泛评估表明,SafeMark在多种编辑设置中实现了高水印比特准确性,同时保持高质量的语义编辑,而不会牺牲对常见编辑后失真的鲁棒性。这些结果表明,语义可编辑性和水印完整性本质上是兼容的,使生成编辑管道中的图像溯源变得可信。

详情
AI中文摘要

本文研究了一个基础但未被充分探索的问题:水印图像能否在不损害水印完整性的情况下保持可编辑?我们提出了SafeMark,一个用于水印保持的文本引导图像编辑的框架,该框架在编辑过程中显式地整合水印完整性。具体来说,SafeMark将阈值化的水印解码损失直接添加到扩散编辑器的训练目标中,微调编辑器,使得语义上有效的编辑也能够在最终输出中保留嵌入的水印。这种设计具有清晰的信息论依据:在编辑图像上保持高比特准确性下限界了编辑通道所保持的水印与编辑输出之间的互信息,这一量根本控制着水印恢复能力。SafeMark与可微扩散编辑器兼容,且不需要架构修改。在多个数据集、文本引导编辑方法和编辑后失真设置上的广泛评估表明,SafeMark在多种编辑设置中实现了高水印比特准确性,同时保持高质量的语义编辑,而不会牺牲对常见编辑后失真的鲁棒性。这些结果表明,语义可编辑性和水印完整性本质上是兼容的,使生成编辑管道中的图像溯源变得可信。

英文摘要

This paper investigates a fundamental yet underexplored question: can watermarked images remain editable without compromising watermark integrity? We propose SafeMark, a framework for watermark-preserving text-guided image manipulation that explicitly integrates watermark integrity into the editing process. Specifically, SafeMark adds a thresholded watermark-decoding loss directly to the diffusion editor's training objective, fine-tuning the editor so that semantically valid edits also preserve the embedded watermark at the final output. This design admits a clean information-theoretic justification: maintaining high bit-accuracy on the edited image lower-bounds the mutual information that the editor channel preserves between watermark and edited output, the quantity that fundamentally controls watermark recoverability. SafeMark is compatible with differentiable diffusion-based editors, and requires no architectural modification. Extensive evaluations across multiple datasets, text-guided editing methods, and post-edit distortion settings demonstrate that SafeMark achieves high watermark bit accuracy across diverse editing settings while maintaining high-quality semantic edits, without sacrificing robustness to common post-edit distortions. These results demonstrate that semantic editability and watermark integrity are fundamentally compatible, enabling trustworthy image provenance in generative editing pipelines.

2605.19510 2026-05-20 cs.CV 版本更新

Return of Frustratingly Easy Unsupervised Video Domain Adaptation

令人沮丧的简单无监督视频域适应重现

Pengfei Wei, Yiqun Sun, Zhiqiang Xu, Yiping Ke, Lawrence B. Hsieh

发表机构 * Magellan Technology Research Institute (MTRI)(马格纳技术研究所(MTRI)) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为MetaTrans的简单无监督视频域适应方法,通过巧妙的模型架构设计,分别处理跨域视频的空间和时间分歧,从而在多个跨域动作识别任务中实现了显著的性能提升。

Comments To appear in ICML 2026

详情
AI中文摘要

无监督视频域适应(UVDA)是一个实用但研究较少的问题。在本文中,我们提出了一种名为MetaTrans的令人沮丧的简单UVDA方法。具体来说,MetaTrans采用了一个包含仅两个基本损失项的简洁学习目标。尽管学习目标的简洁性,MetaTrans体现了一种先进的UVDA思想,即通过微妙的模型架构设计,分别处理跨域视频的空间和时间分歧。通过实现一个时间静态减法模块,MetaTrans有效地消除了空间和时间分歧。广泛的实证评估,特别是在各种跨域动作识别任务中,显示了显著的绝对适应性能提升和相对于最先进UVDA基线的显著优越性能提升。

英文摘要

Unsupervised video domain adaptation (UVDA) is a practical but under-explored problem. In this paper, we propose a frustratingly easy UVDA method, called MetaTrans. Specifically, MetaTrans adopts a concise learning objective that contains only two fundamental loss terms. Despite the simplicity of the learning objective, MetaTrans embodies an advanced UVDA idea, that is, handling the spatial and temporal divergence of cross-domain videos separately, through a subtle model architecture design. By implementing a temporal-static subtraction module, MetaTrans effectively removes spatial and temporal divergence. Extensive empirical evaluations, particularly on various cross-domain action recognition tasks, show substantial absolute adaptation performance enhancement and significantly superior relative performance gain compared with state-of-the-art UVDA baselines.

2605.19506 2026-05-20 cs.CV 版本更新

EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

EventPrune: 用于高效第一人称动态空间推理的级联事件辅助标记修剪

Pengtao Ma, Ziliang Zhou, Ciyu Ruan, Haoyang Wang, Kaiyuan Li, Zihang Gong, Wenhua Ding, Chen Gao, Jingao Xu, Xinlei Chen

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Harbin Institute of Technology(哈尔滨工业大学) Tsinghua University(清华大学) The University of Hong Kong(香港大学)

AI总结 本文提出Event Cascade Pruning (ECP),一种无需训练的框架,利用事件相机的高频运动线索作为连续事件引导的运动先验,指导标记选择,从而在第一人称动态空间推理中实现高效的标记修剪,提升推理速度和减少计算量。

详情
AI中文摘要

第一人称动态空间推理需要模型跟踪连续运动和精确的几何结构,但基于Transformer的视频大语言模型(Video-LLMs)的二次注意力成本使得密集视觉标记计算成本高昂。现有标记修剪方法主要依赖离散静态快照,无法保留推理所需的关键运动和几何线索。我们提出了Event Cascade Pruning (ECP),据我们所知,这是首个无需训练的框架,利用事件相机的高频运动线索作为连续事件引导的运动先验来指导标记选择。ECP结合了三个阶段:事件触发的因果采样用于锚定包含运动信息的关键帧,事件引导的运动显著性过滤用于抑制事件不活跃的视觉标记,以及事件-注意力排名融合用于校准空间注意力与运动显著动态。在减少80%的视觉标记的情况下,ECP在准确率上优于全标记基线(37.62% vs. 36.31%),同时实现了1.89倍的推理加速和52%的GFLOPs减少。我们进一步引入了ESR-Real,首个用于第一人称空间推理的真实世界RGB-事件基准,其中ECP在全标记基线上的准确率提高了2.68个百分点。

英文摘要

First-person dynamic spatial reasoning requires models to track continuous motion and precise geometric structure, but the quadratic attention cost of Transformer-based Video-LLMs makes dense visual tokens computationally expensive. Existing token pruning paradigms predominantly rely on discrete static snapshots, failing to preserve the motion and geometric cues essential for reasoning. We propose Event Cascade Pruning (ECP), to our knowledge the first training-free framework that leverages the high-frequency motion cues from event cameras as a continuous event-guided motion prior to guide token selection. ECP combines three stages: Event-Triggered Causal Sampling to anchor motion-informative keyframes, Event-guided Motion Saliency Filtering to suppress event-inactive visual tokens, and Event-Attention Ranking Fusion to calibrate spatial attention with motion-salient dynamics. With 80% visual token reduction, ECP outperforms the full-token baseline (37.62% vs. 36.31%) while achieving 1.89x inference speedup and 52% GFLOPs reduction. We further introduce ESR-Real, the first real-world RGB-event benchmark for first-person spatial reasoning, where ECP improves accuracy by 2.68 percentage points over full-token baselines.

2605.19490 2026-05-20 cs.RO cs.CV 版本更新

Closed-Loop Hybrid Digital Twin Platform for Connected and Automated Vehicle Validation

闭环混合数字孪生平台用于联网和自动化车辆验证

Kanglong Quan, Zhebing Xia, Linfeng Jiang, Hao Yu, Ziheng Qiao, Dapeng Dong, Dongyao Jia

发表机构 * National Natural Science Foundation of China(中国国家自然科学基金委员会) Suzhou Science and Technology Development Planning Programme(苏州科技发展计划)

AI总结 本文提出一种闭环混合数字孪生平台,通过高保真CARLA-SUMO协同模拟与物理测试现场和车辆的紧密耦合,实现联网和自动化车辆的高效验证。

详情
AI中文摘要

联网和自动化车辆(CAVs)的全面且高效的验证在实际部署前至关重要。虽然基于模拟的测试提供了可扩展性,但现有方法往往缺乏与真实车辆和现场数据的无缝集成,限制了其在捕捉动态真实世界交互方面的保真度。为弥合这一差距,本文提出了一种新的实时混合数字孪生平台。其核心创新在于高保真CARLA-SUMO协同模拟与物理测试现场和车辆通过低延迟的车辆到万物(V2X)通信链路的紧密耦合。定制开发的中间件作为关键桥梁,同步真实CAV的运动状态作为模拟中的影子车辆,并将虚拟控制命令转换为底盘执行的控制器局域网络(CAN)消息以实现闭环控制。详细的实现包括使用摄影测量法进行全尺寸资产重建以及云边协同架构以实现可扩展的多用户操作。实验结果表明同步稳定且闭环控制有效,延迟低,证实了该平台在多场景CAV验证中的实用性。

英文摘要

Comprehensive and efficient validation of connected and automated vehicles (CAVs) is critical prior to real-world deployment. While simulation-based testing offers scalability, existing approaches often lack seamless integration with real vehicles and field data, limiting their fidelity in capturing dynamic, real-world interactions. To bridge this gap, this paper proposes a novel real-time hybrid digital twin platform. Its core innovation lies in the tight coupling of a high-fidelity CARLA-SUMO co-simulation with a physical test site and vehicle via a low-latency Vehicle-to-Everything (V2X) communication link. A custom-developed middleware serves as the critical bridge, synchronizing a real CAV's kinematic state as a shadow vehicle in the simulation and translating virtual control commands into chassis-actuating Controller Area Network (CAN) messages for closed-loop control. Detailed implementation includes using photogrammetry for full-scale asset reconstruction and a cloud-edge collaborative architecture for scalable, multi-user operation. Experimental results demonstrate stable synchronization and effective closed-loop control with low latency, confirming the platform's practicality for multi-scenario CAV verification.

2605.19484 2026-05-20 cs.CV cs.AI cs.GR cs.HC 版本更新

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

CutVerse: 一个用于媒体后期制作编辑的组合式GUI代理基准测试

Haobo Hu, Xiangwu Guo, Zhiheng Chen, Difei Gao, Haotian Liu, Libiao Jin, Qi Mao

发表机构 * MIPG, Communication University of China(MIPG,中国传媒大学) National University of Singapore(新加坡国立大学) USEIT AI(USEIT人工智能)

AI总结 本研究提出CutVerse,一个用于评估自主GUI代理在真实媒体后期制作环境中的能力的基准测试,揭示现有代理在复杂、长周期媒体后期制作工作流中的局限性。

详情
AI中文摘要

尽管GUI代理在网页导航和基础操作系统任务中取得了显著进展,但其在专业创意工作流中的能力仍鲜有研究。为弥合这一差距,我们引入CutVerse,一个旨在系统评估自主GUI代理在真实媒体后期制作环境中的基准测试。我们收集了7个专业应用(如Premiere Pro、Photoshop)的专家演示,涵盖186个复杂、长周期任务,这些任务基于真实的编辑工作流,涉及密集的多模态界面和紧密耦合的交互序列。为支持可扩展评估,我们开发了一个轻量级解析器,将原始屏幕记录和低级交互日志转换为结构化、组合式的GUI动作轨迹,具有精确的定位。广泛评估显示,现有代理在现实媒体编辑任务中的任务成功率仅为36.0%,凸显了复杂、长周期媒体后期制作工作流在本基准测试中的挑战。尽管当前模型在空间定位、多模态对齐和协调动作执行方面表现出色,但在长周期可靠性和领域特定规划方面仍存在限制。

英文摘要

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.

2605.19478 2026-05-20 cs.CR cs.CV 版本更新

Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures

揭示功能融合:动态提示架构中一种新的战略后门类别

Zeyao Liu, Zhendong Zhao, Xiaojun Chen, Xin Zhao, Yuexin Xuan, Xiaoshuang Ji

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) State Key Laboratory of Cyberspace Security Defense(网络空间安全防御国家重点实验室) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) PetroChina (Beijing) Digital Intelligent Research Institute Co., Ltd.(中石油北京数字智能研究院有限公司)

AI总结 本文提出VIPER攻击框架,揭示动态提示架构中通过功能融合产生的新风险,该框架在轻量级动态视觉提示生成器上实现,展示了恶意逻辑与良性任务功能的紧密融合,从而在剪枝时破坏良性性能,同时保持高ASR和低延迟。

详情
AI中文摘要

现有的基于背骨重写全调优的ViT后门攻击在计算上昂贵且会降低性能。这迫使攻击者转向以适配器为基础(例如LoRA)和提示为基础(例如VPT)的视觉参数高效微调(PEFT)范式。尽管适配器安全已有一些初步研究,但快速增长的提示基础生态系统中的风险仍严重未被探索。我们填补了这个关键缺口,揭示了VPT向动态和上下文感知架构演进如何促成一种更加危险和新兴的威胁。这种漏洞即使在这些动态模块解锁了优越良性性能的情况下也会出现。我们提出了VIPER,一个基于轻量级动态视觉提示生成器(VPG)的攻击框架,展示了这种漏洞。关键的是,这种动态架构使功能融合成为可能:恶意逻辑和良性任务功能紧密融合到同一个稀疏、高幅度参数核心中。这种融合创造了一个严峻的“人质”困境,因为剪枝攻击必然破坏良性性能。全面评估显示VIPER有效解决了攻击者的三重困境:VIPER不仅在干净数据上实现了最先进的性能,而且在90% VPG模块剪枝(LoRA攻击崩溃)的情况下仍保持近100%的ASR,同时仅增加可察觉的0.06ms(1.16%)推理延迟。VIPER的结果,由功能融合驱动,揭示了动态提示架构中一种新的、范式级别的风险。

英文摘要

Existing ViT backdoor attacks based on backbone-overwriting full-tuning are computationally expensive and inflict performance degradation. This has forced adversaries towards the Visual Parameter-Efficient Fine-Tuning (PEFT) paradigm, dominated by adapter-based (e.g., LoRA) and prompt-based (e.g., VPT) approaches. While adapter security has seen initial study, the risks of the burgeoning prompt-based ecosystem remain critically unexplored. We fill this critical gap, exposing how the evolution of VPT towards dynamic and context-aware architectures can facilitate a far more dangerous and emergent threat. This vulnerability arises even though these dynamic modules unlock superior benign performance. We propose VIPER, an attack framework built on a lightweight, dynamic Visual Prompt Generator (VPG) that demonstrates this vulnerability. Critically, this dynamic architecture enables Functional Fusion: an emergent phenomenon where malicious logic and benign task utility are tightly fused into the same sparse, high-magnitude parameter core. This fusion creates a formidable ``hostage" dilemma, as pruning the attack necessarily destroys the benign performance. Comprehensive evaluations show VIPER effectively addresses the attacker's trilemma: VIPER not only achieves state-of-the-art performance on clean data, but also maintains near-100% ASR even under 90% VPG-module pruning (where LoRA attacks collapse), while adding only an imperceptible 0.06ms (1.16%) of inference latency. VIPER's results, driven by Functional Fusion, expose a new, paradigm-level risk in dynamic prompt architectures.

2605.19446 2026-05-20 cs.CV cs.AI 版本更新

Targeted Downstream-Agnostic Attack

定向下游无关攻击

Zhuxin Lei, Ziyuan Yang, Yi Zhang

发表机构 * College of Computer Science, Sichuan University(四川大学计算机学院)

AI总结 本文提出了一种定向下游无关攻击(TDAA)方法,通过在更严格的威胁模型下,要求攻击同时具有针对性和下游无关性,解决了传统下游无关攻击(DAAs)在目标未知和编码器不直接生成预测时的挑战。通过引入威胁图像作为特征级锚点,构建了任务无关的桥梁,揭示了受害者编码器的脆弱性。

详情
AI中文摘要

近年来,由于其在表示提取方面的强大能力,预训练编码器得到了广泛应用。然而,它们容易受到下游无关攻击(DAAs)的攻击。现有的DAA方法基于一种宽松的威胁模型,只要生成的下游无关对抗样本(DAEs)改变原始预测,攻击就算成功,而无需特定目标。在本文中,我们提出了一种在更严格的威胁模型下进行的定向DAA(TDAA)方法,要求攻击必须同时具有针对性和下游无关性。由于下游任务未知且编码器不直接生成预测,实现针对性攻击尤其具有挑战性。为此,我们引入了一个名为“威胁图像”的新组件,由攻击者预先选择作为目标。具体来说,设计了一个生成器,生成针对每个样本的对抗扰动,迫使受害者编码器为DAEs和威胁图像输出相同的特征。与以往的DAA方法生成所有样本共享的单一扰动不同,我们的方法采用样本特定的范式。这生成了针对每个图像的定制扰动,以确保高攻击成功率和隐蔽性。通过利用威胁图像作为特征级锚点,我们的方法构建了一个任务无关的桥梁,揭示了受害者编码器的脆弱性。在10种自监督方法上对3个基准数据集的广泛实验展示了我们方法的有效性,并揭示了预训练编码器的显著脆弱性。代码将在审查期结束后公开。

英文摘要

Recently, pre-trained encoders have gained widespread use due to their strong capability in representation extraction. However, they are vulnerable to downstream-agnostic attacks (DAAs). Existing DAA methods operate under a permissive threat model, where an attack is successful if the generated downstream-agnostic adversarial examples (DAEs) change the original prediction, without requiring a specific target. In this paper, we propose a Targeted DAA (TDAA) method under a stricter threat model requiring the attack to be both targeted and downstream-agnostic. Since the downstream task is unknown and encoders do not directly produce predictions, achieving a targeted attack is particularly challenging. To address this, we introduce a novel component termed the 'threat image', pre-selected by the attacker as the target. Specifically, a generator is designed to produce example-specific adversarial perturbations that compel the victim encoder to output identical features for both the DAEs and the threat image. Unlike previous DAA methods that generate a single shared perturbation for all samples, which often fails due to image diversity, our method adopts an example-specific paradigm. This generates tailored perturbations for each image to ensure a high attack success rate and invisibility. By leveraging the threat image as a feature-level anchor, our method builds a task-agnostic bridge to reveal the vulnerabilities of the victim encoder. Extensive experiments on 10 self-supervised methods across 3 benchmark datasets demonstrate the effectiveness of our approach and reveal the pronounced vulnerability of pre-trained encoders. The code will be made publicly available after the review period.

2605.19436 2026-05-20 cs.LG cs.CL cs.CV 版本更新

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

CEPO: 使用对比证据策略优化进行RLVR自蒸馏

Ahmed Heakl, Abdelrahman M. Shaker, Youssef Mohamed, Rania Elbadry, Omar Fetouh, Fahad Shahbaz Khan, Salman Khan

发表机构 * MBZUAI Linköping University(林雪平大学) Australian National University(澳大利亚国立大学)

AI总结 本文提出CEPO,通过对比证据策略优化解决RLVR中自蒸馏的问题,通过区分关键推理步骤与填充内容来提升模型性能。

Comments 9 pages

详情
AI中文摘要

当模型在强化学习中产生正确解时,每个token都会收到相同的奖励信号,无论其是关键推理步骤还是语法填充。一种自然的解决方法是将模型条件化为正确的答案作为教师,识别出模型在知道答案时会生成不同的token。先前的工作表明,这种方法要么通过泄露答案到梯度而破坏训练,要么产生弱信号,无法区分关键步骤和填充内容,因为两者在模型基线下看起来同样令人惊讶。我们提出对比证据策略优化(CEPO),在每个token上提出更尖锐的问题:不仅“正确答案是否偏好此token?”而且“正确答案是否偏好它,而错误答案是否厌恶它?”满足两者的是真正的推理步骤;不满足的是填充内容。错误答案的教师是从训练批次中已有的拒绝rollouts构造的,不增加额外的采样成本。我们证明CEPO继承了先前最先进状态下的所有结构安全保证,同时在关键token上严格提高信用,改进在填充位置恰好消失。实验表明,CEPO在五个多模态数学推理基准上分别达到43.43%和60.56%的平均准确率(在2B和4B规模下),而GRPO在相同训练预算下为41.17%和57.43%。分布匹配自蒸馏方法(OPSD、SDPO)在未训练基线下表现低于,实验证实了我们的理论预测的信息泄漏。我们的代码可在https://github.com/ahmedheakl/CEPO上获得。

英文摘要

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.

2605.19435 2026-05-20 cs.CV cs.AI 版本更新

KappaPlace: Learning Hyperspherical Uncertainty for Visual Place Recognition via Prototype-Anchored Supervision

KappaPlace: 通过原型锚定监督学习超球面不确定性用于视觉位置识别

Maya Yanko, Yoli Shavit

发表机构 * Faculty of Engineering Bar-Ilan University(工程学院巴伊兰大学)

AI总结 本文提出KappaPlace,一种学习具有不确定性的视觉位置识别表示的框架,通过原型锚定监督策略利用潜在类别代表作为概率目标,以减轻视觉位置识别中不确定性估计不准确的问题,从而提高导航系统的可靠性。

详情
AI中文摘要

视觉位置识别(VPR)对于自主导航至关重要,但最先进的方法缺乏良好的校准不确定性估计。标准流程无法可靠地指示查询是否模糊或匹配可能不正确,这在安全关键的机器人学中带来风险。我们提出KappaPlace,一种学习不确定性感知VPR表示的原理性框架。我们的核心贡献是一种原型锚定监督策略,利用潜在类别代表作为概率目标。通过将图像描述符建模为von Mises-Fisher(vMF)变量,我们学习了一个轻量级模块来预测浓度参数作为对aleatoric不确定性的直接代理。虽然现有的VPR不确定性方法通常局限于查询中心的视角,我们推导出一种新的匹配层面的公式来量化特定查询-参考对的可靠性。在五个多样化的基准测试中,KappaPlace将预期校准误差(ECE@K)比现有方法减少了高达50%,同时保持或提高了检索召回率。我们提供了联合训练变体和冻结骨干的后训练扩展。我们的结果表明,KappaPlace提供了稳健、稳定且校准良好的信号,能够在VPR流程中实现可靠的决策。我们的代码可在:https://github.com/mayayank95/UncertaintyAwareVPR

英文摘要

Visual Place Recognition (VPR) is critical for autonomous navigation, yet state-of-the-art methods lack well-calibrated uncertainty estimation. Standard pipelines cannot reliably signal when a query is ambiguous or a match is likely incorrect, posing risks in safety-critical robotics. We propose KappaPlace, a principled framework for learning uncertainty-aware VPR representations. Our core contribution is a Prototype-Anchored supervision strategy that leverages latent class representatives as targets for a probabilistic objective. By modeling image descriptors as von Mises-Fisher (vMF) variables, we learn a lightweight module to predict the concentration parameter as a direct proxy for aleatoric uncertainty. While existing VPR uncertainty methods are typically restricted to a query-centric view, we derive a novel match-level formulation to quantify the reliability of specific query-reference pairs. Across five diverse benchmarks, KappaPlace reduces Expected Calibration Error (ECE@K) by up to 50% compared to existing methods while maintaining or improving retrieval recall. We provide both a joint-training variant and a post-training extension for frozen backbones. Our results demonstrate that KappaPlace provides a robust, stable, and well-calibrated signal that enables reliable decision-making within the VPR pipeline. Our code is available at: https://github.com/mayayank95/UncertaintyAwareVPR

2605.19410 2026-05-20 cs.CV 版本更新

Vision Harnessing Agent for Open Ad-hoc Segmentation

用于开放即兴分割的视觉引导代理

Zilin Wang, Stella X. Yu

发表机构 * University of Michigan(密歇根大学)

AI总结 本文提出了一种名为VASA的视觉引导即兴分割代理,该代理通过结合视觉语言模型、分割基础模型和视觉引导工作流,实现了无需训练的即兴分割任务,其在PARS和RefCOCO等基准测试中均表现出色。

Comments 23 pages, 11 figures

详情
AI中文摘要

分割任务在了解概念后变得容易,需要从文本中检索已学习的视觉基础。然而,对于开放即兴概念,这种基础可能不存在,必须通过图像证据中的部分、关系、排除和集合来构建。我们提出了视觉引导的即兴分割代理(VASA),这是首个用于开放即兴分割的视觉引导代理。VASA无需训练,结合了VLM代理、分割基础模型和视觉引导工作流。不同于仅修改文本提示,VASA使用持久的工作掩码来推理、构建和验证解决方案。它计划视觉操作,调用分割工具,检查结果,编辑掩码并恢复错误。我们构建了PARS,一个将PartImageNet中的部分级标签转换为开放即兴概念的新基准,通过长文本定义查询实现。在PARS上,VASA优于开放词汇、推理和代理基线,超越SAM3代理14-25%。在RefCOCO,一个标准的多粒度指引用分割基准上,VASA比SAM3代理提高5-9%,比其他代理基线提高高达20%。这些结果验证了代理视觉构建在开放即兴分割中的有效性。我们的工作指出了AI代理超越将基础模型作为工具的路径:通过任务知识、VLM行为、视觉规程、工作记忆和故障意识工作流来编程它们。

英文摘要

Segmentation has become easy when the concept is known, requiring retrieval of a learned visual grounding from text. It remains hard for open ad-hoc concepts, where the grounding may not exist as one learned mask and must often be constructed from image evidence through parts, relations, exclusions, and collections. We propose a Vision-guided Ad-hoc Segmentation Agent (VASA), the first vision harnessing agent for open ad-hoc segmentation. VASA is training-free and couples a VLM agent, a segmentation foundation model, and a visually grounded workflow. Rather than revising text prompts alone, VASA uses a persistent working mask to reason, construct, and validate a solution. It plans visual operations, invokes segmentation tools, inspects results, edits the mask, and recovers from errors. We construct PARS, a new benchmark that turns part-level labels in PartImageNet into open ad-hoc concepts through long-form definition queries. On PARS, VASA outperforms open-vocabulary, reasoning-based, and agentic baselines, surpassing SAM3 Agent by 14-25%. On RefCOCOm, a standard multi-granularity referring segmentation benchmark, VASA improves over SAM3 Agent by 5-9% and over other agentic baselines by up to 20%. These results validate agentic visual construction for open ad-hoc segmentation. Our work points to a path for AI agents beyond wrapping foundation models as tools: Programming them with task knowledge, VLM behavior, visual routines, working memory, and failure-aware workflows.

2605.19393 2026-05-20 cs.CV cs.LG 版本更新

Neuron Incidence Redistribution for Fairness in Medical Image Classification

神经元发生再分配用于医疗图像分类中的公平性

Abin Shoby, Lyle John Palmer, Nikhil Cherian Kurian

发表机构 * Neuron Incidence Redistribution for Fairness in Medical Image Classification(神经发生再分配用于医学图像分类)

AI总结 本文提出了一种轻量级的正则化方法Neuron Incidence Redistribution (NIR),通过减少预测概率加权平均激活值的方差来提升医疗图像分类中的公平性,实验结果显示在不同年龄和性别组别中,TPR和FPR的不平等现象显著降低。

Comments 4 Pages, 1 Figure

详情
AI中文摘要

深度学习模型在医疗图像分类中容易出现因年龄、性别和种族等人口属性导致的子群体性能差异。我们识别出这些差异背后的潜在表征机制:在迁移学习模型中,正预测下的主导倒数第二层激活通道同时被疾病阳性样本和特权人口群体(男性、年长患者)激活,导致过度诊断;相反,负预测下的主导通道由不利群体(女性、年轻患者)激活,导致系统性误诊。为了解决这一问题,我们提出了Neuron Incidence Redistribution (NIR),一种轻量级正则化方法,该方法惩罚倒数第二层神经元预测概率加权平均激活值的方差,无需在训练时使用人口属性标签。在HAM10000数据集上,NIR使年龄组的TPR不平等从10.81%降至0.93%,性别组的TPR不平等从12.04%降至0.74%,同时AUC略有提高0.51个点。在Harvard OCT-RNFL数据集上,NIR减少了种族(从15.68%降至10.66%)和年龄(从12.69%降至1.80%)的FPR不平等,证明了在全倒数第二层分布潜在疾病证据是一种提升医疗AI人口公平性的原则性且有效的方法。

英文摘要

Deep learning models for medical image classification are susceptible to subgroup performance disparities across demographic attributes such as age, gender, and race. We identify a latent representational mechanism underlying these disparities: in transfer-learned models, the dominant penultimate-layer activation channel under positive predictions is co-activated by both disease-positive samples and privileged demographic groups (male, older patients), producing over-diagnosis; conversely, the dominant channel under negative predictions is co-activated by disadvantaged groups (female, younger patients), producing systematic under-diagnosis. To address this, we propose Neuron Incidence Redistribution (NIR), a lightweight regularization method that penalizes the variance of predicted-probability-weighted mean activations across penultimate-layer neurons, requiring no demographic labels at training time. On HAM10000, TPR disparity drops from 10.81% to 0.93% across age groups and from 12.04% to 0.74% across gender, with a marginal AUC improvement of 0.51 points. On Harvard OCT-RNFL, NIR reduces FPR disparity for race (from 15.68% to 10.66%) and age (from 12.69% to 1.80%), demonstrating that distributing latent disease evidence across the full penultimate layer is a principled and effective strategy for improving demographic fairness in medical AI.

2605.19390 2026-05-20 cs.CV 版本更新

LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue

LMM-Track4D: 通过轨迹引导的对话激发LMM中的4D动态推理

Chaoyue Li, Yongxue Xu, Jie Feng, Jiayu Ding

发表机构 * Huazhong University of Science and Technology(华中科技大学) Sun Yat-sen University(中山大学) Beihang University(北航) Peking University(北京大学)

AI总结 本文提出LMM-Track4D任务,通过轨迹引导的多轮时空对话,结合RTGE、TRK和OSK-RA解码器,提升LMM在4D动态推理中的性能,实验表明显式动态状态建模是有效设计原则。

详情
AI中文摘要

近期大型多模态模型(LMMs)在图像和视频理解方面的能力不断增强,但仍难以持续进行4D连续时空动态推理。为研究这一能力差距,我们提出了轨迹引导的多轮时空对话任务,该任务要求模型在回答时空查询的同时,返回整个短片段或指定较长片段中的结构化3D目标轨迹,并引入Track4D-Bench基准,包含526个片段级对话样本,涵盖23.5k帧和7.5k对象注释,用于训练和评估。基于此任务,我们提出了LMM-Track4D,结合RTGE(射线-时间几何编码)、专门用于长时间跨度动态传播的流式状态令牌TRK,以及在遮挡和视角变化下稳定进行4步3D状态估计的Object-Slot Kinematic, Residual-Anchor(OSK-RA)解码器。在Track4D-Bench上的实验表明,与强基线相比,LMM-Track4D有持续的性能提升,表明显式动态状态建模是激发LMM中4D动态推理的有效设计原则。我们的代码和数据集将在https://github.com/mikubaka88/LMM-Track4D上公开。

英文摘要

Recent large multimodal models (LMMs) have become increasingly capable on image and video understanding, yet still struggle to sustain 4D continuous spatiotemporal dynamic reasoning. To study this capability gap, we formulate trajectory-grounded multi-turn spatiotemporal dialogue, a new task in which a model must answer spatiotemporal queries while returning structured 3D target trajectories over an entire short clip or a specified segment of a longer clip, and introduce Track4D-Bench, a benchmark with 526 clip-level dialogue samples spanning 23.5k frames and 7.5k object annotations, for training and evaluation. Building on this task, we propose LMM-Track4D, which combines RTGE (Ray--Time Geometry Encoding), a dedicated streaming state token TRK for long-horizon dynamic propagation, and an Object-Slot Kinematic, Residual-Anchor (OSK-RA) decoder for stable 4-step 3D state estimation under occlusion and viewpoint variation. Experiments on Track4D-Bench show consistent improvements over strong baselines, suggesting that explicit dynamic state modeling is a useful design principle for eliciting 4D dynamic reasoning in LMMs. Our code and dataset will be publicly available at https://github.com/mikubaka88/LMM-Track4D.

2605.19386 2026-05-20 cs.CV 版本更新

MatPhys: Learning Material-Aware Physics Parameters for Deformable Object Simulation from Videos

MatPhys: 从视频中学习材料感知的物理参数以模拟可变形物体

Yang Yang, Yiyan Wang, Zheming Liu, Naoya Iwamoto

发表机构 * The University of Osaka(大阪大学) The University of Tokyo(东京大学) Huawei Technologies Japan K.K(华为技术日本株式会社)

AI总结 本文提出MatPhys方法,通过单视角视频预测弹簧-质量参数,解决了现有方法在材料假设和跨场景一致性方面的不足,从而提升可变形物体模拟的准确性和泛化能力。

Comments Submitted to Siggrah Asia 2026

详情
AI中文摘要

从视频中重建可变形物体的模拟准备版本对于视觉、图形学和机器人学至关重要。现有的物理驱动方法可以从视频中恢复物理数字双胞胎,但它们有两个根本性的局限性:它们通常假设物体整体具有均匀的材料属性,且其场景特定的逆向优化与单目观测的固有模糊性相结合,导致相同材料在不同场景或交互中参数不一致。我们提出了MatPhys,一种材料感知的前馈框架,通过单视角视频预测弹簧-质量参数,通过两个耦合的设计解决这两个问题。为了放松均匀材料假设,我们使用DINO特征将物体分解为具有语义意义的部分,并查询部分级材料先验,为每个部分分配其自身的物理行为。为了强制跨场景一致性,我们引入了一个学习的材料代码本,其中包含共享的材料嵌入,作为外观和物理之间的桥梁,并进一步使用部分级先验作为参考分布,约束解码器,使得相同材料在不同场景和交互中产生一致的参数。这些设计将一个欠约束的单目问题转化为基于共享、可重用材料概念的前馈推断。实验表明,我们的方法在重建和未来预测方面与每场景优化基线相匹配,同时在未见过的交互和物体上实现了更强的泛化能力,具有更一致的物理参数。

英文摘要

Reconstructing simulation-ready deformable objects is important for vision, graphics, and robotics. Existing physics-driven methods can recover physical digital twins from videos, but they suffer from two fundamental limitations: they typically assume a homogeneous material across the whole object, and their scene-specific inverse optimization, combined with the inherent ambiguity of monocular observation, yields inconsistent parameters for the same material across different scenes or interactions. We propose MatPhys, a material-aware feed-forward framework that predicts spring-mass parameters from a single-view video, addressing these two issues with two coupled designs. To relax the homogeneous material assumption, we use DINO features to decompose the object into semantically meaningful parts and to query a part-level material prior, assigning each part its own physical behavior. To enforce cross-scene consistency, we introduce a learned material codebook of shared material embeddings as the bridge between appearance and physics, and further use the part-level prior as a reference distribution that constrains the decoder so that the same material yields consistent parameters across scenes and interactions. Together, these designs turn an under-constrained monocular problem into feed-forward inference grounded on shared, reusable material concepts. Experiments show that our method matches per-scene optimization baselines in reconstruction and future prediction, while achieving stronger generalization to unseen interactions and objects with more consistent physical parameters.

2605.19378 2026-05-20 cs.CV 版本更新

Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

视觉扩散变换器中稀疏专家混合路由的稀疏性:从路由崩溃到选择性死锁的诊断、边界校准和进化路线图

Haiying Sha

发表机构 * Haiying Sha(海ying Sha)

AI总结 本文系统诊断了Token-Choice稀疏混合专家(MoE)在视频扩散变换器中的训练失败模式,通过分析超过6500万个标记的路由决策时间序列,提出了功能冗余假说,并总结了从视觉统一到世界模型的三步进化路线图。

详情
AI中文摘要

本文系统诊断了Token-Choice稀疏混合专家(MoE)在视频扩散变换器中的训练失败模式。从约50亿参数的预训练密集模型开始,我们遵循三条定律将其转换为MoE架构:路由专家精确克隆原始FFN权重,共享专家初始化为零以验证,然后初始化为极小的非零噪声以实际训练,而只有门控网络从随机初始化开始。实验揭示了五层失败模式的层次结构:(1)线性路由器经历全局软饱和,导致所有专家同质化;(2)MLP路由器引入选择性死锁,其中大约三分之一的层退化为单专家模式,无法通过增加辅助损失防止;(3)交叉注意力路由器表现出初步的自我恢复,但约九层仍顽固死锁;(4)死锁层显示U型分布,集中在浅层视觉处理层和深层语义整合层;(5)bfloat16混合精度导致微小权重更新被硬件截断为零。基于超过6500万个标记的路由决策时间序列,我们提出了功能冗余假说:死锁是共享专家在门控-共享专家-路由专家三元系统中成熟之前的理性等待策略。该假说由系统生物学中的功能冗余理论支持。在工程方面,我们总结了密集到MoE转换的三条定律,并提供了完整的bfloat16精度陷阱解决方案。我们校准了Token-Choice范式的当前能力边界,并概述了从视觉统一到世界模型的三步进化路线图。

英文摘要

This paper systematically diagnoses the training failure modes of Token-Choice sparse Mixture-of-Experts (MoE) on video Diffusion Transformers. Starting from a pretrained dense model of about 5 billion parameters, we convert it into an MoE architecture following three laws: routed experts exactly clone the original FFN weights, shared experts are initialized to zero for verification and then to extremely small non-zero noise for actual training, while only the gating networks start from random initialization. Experiments reveal a hierarchy of five failure modes: (1) linear routers suffer global soft saturation with complete expert homogenization; (2) MLP routers introduce selective deadlock, where roughly one-third of layers degenerate into a single-expert mode that cannot be prevented by increasing the auxiliary loss; (3) cross-attention routers exhibit preliminary self-recovery, yet about nine layers remain stubbornly deadlocked; (4) deadlocked layers display a U-shaped distribution, concentrated in shallow visual processing layers and deep semantic integration layers; (5) bfloat16 mixed precision causes tiny weight updates to be truncated to zero by hardware. Based on routing decision time series over 65 million tokens across 5,000 training steps, we propose the Functional Redundancy Hypothesis: deadlock is a rational waiting strategy before the shared expert matures within the gate-shared expert-routed expert triadic system. This hypothesis is supported by the theory of functional redundancy in systems biology. On the engineering side, we summarize the Three Laws of dense-to-MoE conversion and provide a complete solution for the bfloat16 precision trap. We calibrate the current capability boundary of the Token-Choice paradigm and outline a three-step evolutionary roadmap from visual unification to a world model.

2605.19374 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings

基于概念的噪声负样本抑制用于零样本分类和胸片发现的 grounding

Chenyu Lian, Hong-Yu Zhou, Chun-Ka Wong, Jing Qin

发表机构 * The Center for Smart Health, School of Nursing, the Hong Kong Polytechnic University, Hong Kong, China(香港理工大学智能健康中心,护理学院,中国香港) Research Institute for Smart Ageing, the Hong Kong Polytechnic University, Hong Kong, China(香港理工大学智能老龄化研究 institute,中国香港) School of Biomedical Engineering, Tsinghua Medicine, Tsinghua University, Beijing, China(清华大学生物医学工程学院,清华大学,北京,中国) Queen Mary Hospital, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, China(香港大学李嘉诚医学院Queen Mary医院,中国香港)

AI总结 本文提出了一种基于概念的噪声负样本抑制框架CoNNS,通过构建层次化概念本体,解决不同患者间相似发现导致的噪声负样本问题,提升零样本理解任务的性能。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

利用胸片和放射学报告进行视觉-语言对齐已成为零样本分类和胸片发现 grounding 的先进范式。然而,标准对比学习通常将不同患者的影像和报告简单视为负样本对。这种假设引入了噪声负样本,因为不同患者经常表现出相似的发现。此类噪声负样本导致语义模糊并降低零样本理解任务的性能。为了解决这一挑战,我们提出CoNNS,一种基于概念的噪声负样本抑制框架。为了支持负样本抑制机制,不同于先前方法使用原始报告或模板化文本,我们利用大型语言模型构建层次化概念本体。本体通过显式建模存在性、属性(位置和特征)和文本(证据片段和存在陈述)来结构化41个关键临床概念。利用该本体,我们实现了包含三个步骤的跨患者对再标记策略:(1)细粒度分解,根据发现存在性对配对进行分类;(2)噪声负样本过滤,通过移除假负样本解决语义冲突;(3)困难负样本挖掘,利用轻量级语言模型识别细微属性差异。最后,我们提出了一种概念感知的NCE损失,以对齐视觉特征与文本并抑制识别出的噪声负样本。在多粒度零样本grounding任务和五个零样本分类数据集上的广泛实验验证了CoNNS优于现有最先进模型。代码可在https://github.com/DopamineLcy/conns获取。

英文摘要

Vision-language alignment using chest X-rays and radiology reports has emerged as an advanced paradigm for zero-shot classification and grounding of chest X-ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero-shot understanding tasks. To address this challenge, we propose CoNNS, a concept-guided noisy-negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross-patient pair relabeling strategy comprising three steps: (1) Fine-Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept-Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi-granularity zero-shot grounding tasks and five zero-shot classification datasets validate that CoNNS outperforms existing state-of-the-art models. The code is available at https://github.com/DopamineLcy/conns.

2605.19371 2026-05-20 cs.CV cs.AI 版本更新

Multi-Scale Generative Modeling with Heat Dissipation Flow Matching

多尺度生成建模与热耗散流匹配

Jun Ma, Hanquan Zhang, Yanjun Qin, Haoyuan Guan, Ke Zhang

发表机构 * Department of Systems Science, Faculty of Arts and Sciences, Beijing Normal University(北京师范大学系统科学系,文理学院) School of Computer Science and Technology, Xinjiang University(新疆大学计算机科学与技术学院) International Academic Center of Complex Systems, Beijing Normal University(北京师范大学复杂系统学术中心) School of Systems Science, Beijing Normal University(北京师范大学系统科学学院)

AI总结 本文提出Heat Dissipation Flow Matching (HDFM)方法,通过引入连续模糊(热耗散)过程来注入多尺度先验,解决模糊基模型在SDE框架中的局限性,并在ODE框架如Flow Matching中实现更有效的多尺度细节保留和颜色预算保持。

详情
AI中文摘要

扩散模型在图像生成中被广泛应用,大多数模型依赖于噪声为基础的破坏和去噪。一个不同的分支使用模糊作为主要破坏,通过提供多尺度先验来更好地保持颜色预算和多尺度细节。然而,基于模糊的模型仍局限于SDE框架,并未整合到ODE框架中,如Flow Matching (FM)。同时,在模糊基公式中,经典的逆热耗散(IHD)过程面临病态挑战。此外,在数据流形假设下,从高维噪声(或速度)空间回归模糊图像也具有困难。我们提出Heat Dissipation Flow Matching (HDFM),其引入连续模糊(热耗散)过程到FM中以注入多尺度先验。HDFM将插值热耗散路径对齐以解决病态问题,并采用x预测来缓解高维回归困难。玩具实验和消融研究显示,HDFM在模糊和x预测方面均受益。HDFM在所有数据集上均优于大多数基线方法。

英文摘要

Diffusion models are widely used in image generation, with most relying on noise-based corruption and denoising. A distinct branch instead uses blur as the main corruption, preserving better color budgets and multi-scale detail by providing multi-scale priors. However, blur-based models remain in SDE-based frameworks and are not integrated into ODE-based frameworks, such as Flow Matching (FM). Meanwhile, in the blur-based formulation, the classical inverse heat-dissipation (IHD) process faces an ill-posed challenge. Moreover, under the data-manifold assumption, regressing blurred images from high-dimensional noise (or velocity) space is also difficult. We propose Heat Dissipation Flow Matching (HDFM), which introduces a continuous blurred (heat-dissipation) process into FM to inject multi-scale priors. HDFM aligns an interpolated heat-dissipation path to address ill-posedness and adopts $x$-prediction to mitigate high-dimensional regression difficulty. Toy experiments and ablation studies show that HDFM consistently benefits from both blur and $x$-prediction. The performance of HDFM outperforms most baseline methods on all datasets.

2605.19360 2026-05-20 cs.CV cs.LG cs.NE physics.app-ph physics.optics 版本更新

Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection

可扩展的、节能的光学-神经架构用于多路复用的深度伪造视频检测

Parnian Ghapandar Kashani, Shiqi Chen, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校电气与计算机工程系) Bioengineering Department, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校生物工程系) California NanoSystems Institute (CNSI), University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校加州纳米系统研究所)

AI总结 本文提出了一种结合轻量级数字前端和空间复用光学解码后端的混合深度伪造视频检测框架,通过可编程空间光调制器实现大规模并行模拟推理,从而在降低计算成本的同时提高视频真实性预测的吞吐量和准确性。

Comments 30 Pages, 8 Figures

详情
AI中文摘要

AI生成视觉媒体的快速普及催生了对高效、可信的深度伪造检测系统的需求。然而,现有基于深度学习的检测方法依赖于计算密集且能耗高的推理算法,限制了其可扩展性。本文提出了一种混合的数字-模拟深度伪造视频检测框架,结合轻量级数字前端和空间复用光学解码后端,通过可编程空间光调制器实现大规模并行模拟推理。通过在单次光学传播过程中同时处理15个或更多的视频流,该系统在降低计算成本的同时实现了高吞吐量和准确的视频级真实性预测。我们使用不同数据集验证了该混合深度伪造视频处理器,包括经典面部交换、现实世界深度伪造记录和完全AI生成的视频。使用在可见光谱范围内操作的空间复用实验装置,我们在Celeb-DF视频数据集上实现了97.79%的深度伪造检测准确率、99.86%的灵敏度和95.72%的特异性,分别在15个视频并行处理的单次光学传播中测试。多路复用的光学解码器还展示了对各种视频退化、噪声、压缩、实验偏移和黑盒对抗攻击的鲁棒性。我们的结果表明,将光学计算整合到AI推理中可以同时提高吞吐量、能效和对抗鲁棒性——这三个属性在纯数字系统中难以同时实现。

英文摘要

The rapid proliferation of AI-generated visual media has created an urgent need for efficient, trustworthy deepfake detection systems. However, existing deep learning-based detection methods rely on computationally intensive and energy-demanding inference algorithms, limiting their scalability. Here, we present a hybrid digital-analog deepfake video detection framework that combines a lightweight digital front-end with a spatially multiplexed optical decoding back-end for massively parallel analog inference through a programmable spatial light modulator. By simultaneously processing 15 or more video streams within a single optical propagation pass, the system enables high-throughput and accurate video-level authenticity prediction at reduced computational cost compared with purely digital methods. We validated this hybrid deepfake video processor using different datasets spanning classical face-swapping, real-world deepfake recordings, and fully AI-generated videos. Using a spatially multiplexed experimental set-up operating in the visible spectrum, we achieved average deepfake detection accuracy, sensitivity and specificity of 97.79%, 99.86% and 95.72%, respectively, on the Celeb-DF video dataset with 15 videos tested in parallel in a single optical pass per inference. The multiplexed optical decoder also demonstrates resilience against various types of video degradation, noise, compression, experimental misalignments and black-box adversarial attacks. Our results show that integrating optical computation into AI inference enables simultaneous gains in throughput, energy efficiency, and adversarial robustness - three properties that are difficult to achieve together in purely digital systems.

2605.19359 2026-05-20 cs.CV cs.LG 版本更新

MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

MAM-CLIP:基于乳腺X线图集的视觉-语言预训练用于BI-RADS分类

Halil Ibrahim Gulluk, Olivier Gevaert

发表机构 * Department of Electrical Engineering(电气工程系) Biomedical Informatics Research (BMIR)(生物医学信息学研究(BMIR)) Stanford University(斯坦福大学)

AI总结 本文提出MAM-CLIP模型,通过预训练PubMedBERT和对比学习来提升乳腺X线图像的BI-RADS分类性能,实验表明在标注样本稀缺时,该方法能显著提高F1分数。

详情
AI中文摘要

深度学习方法在预测乳腺X线图像的BI-RADS评分方面已显示出有前景的结果。然而,这些图像的解释可能因人而异,即使在放射科医生之间也可能存在差异。鉴于乳腺X线的固有复杂性,仅依靠图像标签训练分类模型通常效果有限。为了解决这一挑战,我们收集了来自两个乳腺图集的2313张乳腺X线图像及其对应的描述。我们提出的方法采用了一个多模态模型,使用预训练的PubMedBERT作为语言组件。通过在图像-文本对上进行对比学习训练,使视觉编码器能够吸收描述中丰富的信息,从而提高其对乳腺X线发现的理解。然后,我们对两个数据集进行微调以进行BI-RADS预测,其性能优于没有此预训练的模型,尤其是在标注样本稀缺时。在3类平均F1分数上,改进范围从+1%到+14%:在40K训练样本时增加+1%,在1K样本时增加+14%。此外,我们的实验表明,来自乳腺图集的2K图像-文本对比2K标注样本更具信息量,当训练样本超过10K时,平均提升幅度为+1.1%。总体而言,我们的工作提供了一个用于乳腺X线的视觉-语言模型,并突显了乳腺图集文本信息的价值。此外,我们公开发布了TEKNOFEST数据集的预处理乳腺X线图像。训练代码、预训练模型权重、数据提取脚本和发布的数据集均可在:https://github.com/igulluk/MAM-CLIP上公开获取。

英文摘要

Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP

2605.19355 2026-05-20 cs.GR cs.AI cs.CV cs.LG 版本更新

Skinned Motion Retargeting with Spatially Adaptive Interaction Guidance

具有空间自适应交互引导的皮肤运动重定向

Soojin Choi, Seokhyeon Hong, Chaelin Kim, Junghyun Nam, Junhyuk Jeon, Junyong Noh

发表机构 * Visual Media Lab(视觉媒体实验室) KAIST(韩国科学技术院)

AI总结 本文提出了一种几何感知的运动重定向框架,通过在空间自适应锚点上进行接近匹配,保留交互语义,以解决在不同身体形状角色之间重定向运动时保持交互语义(如自接触和近身体接近)的挑战。

Comments SIGGRAPH 2026 / ACM TOG. Project page available at https://suzyn.github.io/space_page/

详情
AI中文摘要

在不同身体形状的角色之间进行运动重定向,同时保持交互语义,如自接触和近身体接近,仍是一个具有挑战性的问题。尽管最近的几何感知方法通过维持预定义对应区域之间的空间关系来解决这一问题,但它们对静态对应关系的依赖在目标角色表现出夸张的身体比例时往往遇到困难。在本文中,我们提出了一种几何感知的运动重定向框架,通过在空间自适应锚点上进行接近匹配来保留交互语义。与以往具有静态锚点定义的方法不同,所提出的方法动态地将锚点重新定位到目标角色上可到达的区域。这通过基于Transformer的锚点细化策略实现,该策略预测锚点位移,并通过可微的软投影将转换后的锚点限制在目标角色的几何结构上。通过结合源角色的姿势依赖空间结构,适应的锚点为交互感知的重定向提供结构上连贯的指导。在这些锚点的条件下,基于图的自编码器预测目标骨骼运动,以保持源的空问配置。为了鼓励锚点适应和运动重定向之间的任务对齐优化,我们采用交替训练方案,其中每个模块依次优化。通过广泛的评估,我们证明了我们的方法在保持交互保真度方面优于最先进的方法,适用于多样化的角色几何结构。

英文摘要

Retargeting motion across characters with varying body shapes while preserving interaction semantics, such as self-contact and near-body proximity, remains a challenging problem. While recent geometry-aware approaches address this by maintaining spatial relationships between predefined corresponding regions, their reliance on static correspondences often struggles when the target character exhibits exaggerated body proportions. In this paper, we present a geometry-aware motion retargeting framework that preserves interaction semantics by performing proximity matching over spatially adaptive anchors. Unlike prior methods with static anchor definitions, the proposed method dynamically repositions anchors to reachable regions on the target character. This is achieved via a Transformer-based anchor refinement strategy that predicts anchor displacements and constrains the translated anchors to remain on the target character geometry through differentiable soft projection. By incorporating pose-dependent spatial structures from the source character, the adapted anchors provide structurally coherent guidance for interaction-aware retargeting. Conditioned on these anchors, a graph-based autoencoder predicts target skeletal motion that preserves the spatial configuration of the source. To encourage task-aligned optimization between anchor adaptation and motion retargeting, we adopt an alternating training scheme in which each module is optimized in turn. Through extensive evaluations, we demonstrate that our method outperforms state-of-the-art approaches in preserving interaction fidelity across diverse character geometries.

2605.19340 2026-05-20 cs.CV 版本更新

Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation

选择性、正则化和校准:利用视觉基础模型进行跨域少样本语义分割

Junyuan Ma, Xunzhi Xiang, Wenbin Li, Qi Fan, Yang Gao

发表机构 * Nanjing University(南京大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出HERA框架,通过选择性、正则化和校准的方法,有效利用视觉基础模型进行跨域少样本语义分割,提升了模型在新领域中的适应能力,并在多个基准上取得了更高的mIoU成绩。

Comments 20 pages, 11 figures, 13 tables. Accepted to CVPR 2026

详情
AI中文摘要

视觉基础模型(VFMs)在各种视觉任务中已取得优异性能。然而,将VFMs应用于跨域少样本分割(CD-FSS)仍然具有挑战性,因为CD-FSS需要在仅少量标记示例的情况下对新类别的对象进行分割,并且在域转移下进行。挑战主要由两个因素驱动:(1)每个新类别的标记示例有限,相对于VFM预训练的规模,这使模型在重新训练时容易过拟合;(2)目标域在预训练期间未被充分代表,导致跨域不一致性和层间敏感性。为了解决这些问题,我们提出了层次示例表示适应(HERA),一种基于VFMs的三阶段选择-正则化-校准分割框架,能够有效利用有限的标签并在不重新训练源数据的情况下适应新领域。我们首先设计了层次层选择(HLS)以自适应地识别最信息丰富的VFM层,使用数据依赖的示例转移风险(ETR)计算每个候选层。然后,先验引导正则化(PGR)对选定的表示进行正则化,产生后续阶段的结构化局部信号。此外,像素级自适应校准(PAC)将选定的表示与细化的交互图结合,校准像素级预测,产生一致的掩码。这些阶段共同形成一个层次选择-正则化-校准的管道,指导冻结的VFM特征在新领域中工作,同时在测试时仅微调不到2.7%的参数。广泛的实验表明,HERA在多个CD-FSS基准上超越了现有最佳方法,mIoU提高了超过4.1个百分点。

英文摘要

Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for cross-domain few-shot segmentation (CD-FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre-training, making the model prone to overfitting during retraining, and (2) target-domain shifts underrepresented during pre-training, inducing cross-domain inconsistency and layer-wise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate VFM-based segmentation framework that learns effectively from limited labels and adapts to novel domains without source-data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data-dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then, Prior-Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well-structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixel-wise predictions, producing consistent masks. Together, these stages form a hierarchical select-regularize-calibrate pipeline that guides frozen VFM features in new domains while fine-tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state of the art by more than 4.1 mIoU across multiple CD-FSS benchmarks.

2605.19322 2026-05-20 cs.CV 版本更新

DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

DynaTok: 时序自适应和位置偏见感知的视频大语言模型token压缩

Minyoung Park, Taehun Kong, Sangjun Ahn

发表机构 * LG Electronics, Seoul, South Korea(LG电子,首尔,韩国)

AI总结 本文提出DynaTok,一种无需训练的时序自适应和位置偏见感知的token压缩框架,通过在时序和空间维度上分配token预算,有效减少冗余的时空覆盖,提升视频大语言模型的效率和鲁棒性。

详情
AI中文摘要

近年来,视频大语言模型(Video-LLMs)的进步显著扩展了多模态推理能力。然而,从长视频序列中提取的大量视觉token带来了高昂的计算成本,限制了其在现实场景中的应用。现有的无训练token压缩方法基于注意力大小作为语义重要性的代理进行token选择,但往往忽视位置偏见并仅依赖短期时间局部性,导致冗余的时空覆盖和低效的token使用。我们提出了DynaTok,一种无需训练、时序自适应且偏见感知的token压缩框架,能够在时序和空间维度上分配token预算。通过轻量级的指数移动平均(EMA)内存,时序预算分配(TBA)模块动态地将较少的token分配给冗余帧,将更多的token分配给新颖的帧,捕捉长期时间变化。空间预算分配(SBA)模块通过基于激活的注意力图选择空间多样性和语义重要的特征,同时利用空间内存减少已选区域的冗余并缓解位置偏见。DynaTok无缝集成到现有的Video-LLMs中,如LLaVA-OneVision和LLaVA-Video,无需重新训练,并在高强度压缩下有效保留语义覆盖。在四个代表性VideoQA基准测试-MVBench、LongVideoBench、MLVU和VideoMME上的实验表明,即使在90%的token减少下,DynaTok仍能保留超过95%的基线准确性,优于最近的无训练方法。这些结果表明,DynaTok为高效和稳健的视频推理提供了系统的基础,为未来Video-LLMs实现实时流媒体视频理解铺平了道路。

英文摘要

Recent advances in Video Large Language Models (Video-LLMs) have greatly expanded multimodal reasoning capabilities. However, the massive number of visual tokens extracted from long video sequences incurs prohibitive computational costs, limiting their deployment in real-world scenarios. Existing training-free token compression methods select tokens based on attention magnitude as a proxy for semantic importance, but often overlook positional bias and rely only on short-term temporal locality, leading to redundant spatio-temporal coverage and inefficient token usage. We present DynaTok, a training-free, temporally adaptive and bias-aware token compression framework that allocates token budgets across both temporal and spatial dimensions. Through a lightweight exponential moving average (EMA) memory, the Temporal Budget Allocation (TBA) module dynamically assigns fewer tokens to redundant frames and more to novel frames, capturing long-term temporal variation. The Spatial Budget Allocation (SBA) module complements this by selecting spatially diverse and semantically important features using activation-based attention maps, while leveraging a spatial memory to reduce redundancy from previously selected regions and mitigate positional bias. DynaTok integrates seamlessly with existing Video-LLMs such as LLaVA-OneVision and LLaVA-Video without retraining, and effectively preserves semantic coverage under aggressive compression. Experiments on four representative VideoQA benchmarks-MVBench, LongVideoBench, MLVU, and VideoMME-show that DynaTok retains over 95% of baseline accuracy even with a 90% token reduction, surpassing recent training-free approaches. These results demonstrate that DynaTok provides a principled foundation for efficient and robust video reasoning, paving the way toward real-time streaming video understanding with future Video-LLMs.

2605.19319 2026-05-20 cs.CV 版本更新

SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

SWEET:基于图像编辑的稀疏世界建模用于具身任务执行

Yiren Song, Yihan Wang, Xiyao Deng, Zhuoran Yan, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室) Central South University(中南大学)

AI总结 本文研究图像编辑模型能否作为稀疏视觉世界模型用于机器人操作,通过预测任务级未来状态而非密集视频生成,提出SWEET框架实现稀疏视觉规划,结合语言指令和空间引导生成关键帧,并通过扩散动作预测器生成可执行动作,实验表明其在不同场景中提升关键帧预测能力。

详情
AI中文摘要

视觉预测已成为具身控制的有前景范式,其中未来观察被生成并转化为动作。然而,密集视频生成计算成本高且对许多操作任务而言往往不必要,其进展可以总结为少量任务相关视觉状态。本文研究图像编辑模型能否作为稀疏视觉世界模型用于机器人操作,通过预测任务级未来状态而非密集视频生成。我们首先在相同的机器人数据设置下比较视频生成模型Wan2.2和图像编辑模型FLUX-Kontext,发现图像编辑能生成更可靠的任务级关键帧,具有更好的视觉保真度和显著更低的推理成本。受此启发,我们提出SWEET,一种单次稀疏视觉规划框架,通过连续图像编辑生成一系列任务相关操作关键帧,基于语言指令和可选箭头式空间引导。一个目标条件化的扩散动作预测器将相邻想象的关键帧转换为可执行的动作块。为了减少真实与编辑视觉子目标之间的不匹配,我们进一步引入混合训练策略,使用过滤后的编辑目标。在DROID和RoboMimic上的实验表明,SWEET在已见和未见场景中均提升了关键帧预测能力,并实现了从序列关键帧规划到可执行机器人动作的完整流程,表明图像编辑是具身视觉预测中一个有前景但尚未被广泛探索的方向。

英文摘要

Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task-relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX-Kontext under the same robotic data setting, and find that image editing produces more reliable task-level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one-shot sparse visual planning framework that progressively generates a sequence of task-relevant manipulation keyframes through successive image editing, conditioned on language instructions and optional arrow-based spatial guidance. A goal-conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks. To reduce the mismatch between real and edited visual subgoals, we further introduce a mixed-training strategy with filtered edited targets. Experiments on DROID and RoboMimic show that SWEET improves keyframe prediction across seen and unseen scenes and enables a full pipeline from sequential keyframe planning to executable robot actions, suggesting that image editing is a promising and underexplored direction for embodied visual prediction.

2605.19307 2026-05-20 cs.CV 版本更新

MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems

MetaRA: 多模态大语言模型基于视觉问答系统的元形态鲁棒性评估

Quanxing Xu, Yuhao Tian, Ling Zhou, Xian Zhong, Xiaohua Huang, Rubing Huang, Chia-Wen Lin

发表机构 * School of Computer Science and Engineering, Macau University of Science and Technology, Macao SAR(澳门科学技术大学计算机科学与工程学院) Hubei Key Laboratory of Transportation Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology(湖北省交通物联网重点实验室,武汉理工大学)

AI总结 本文提出MetaRA,一种基于元形态测试的框架,用于评估多模态大语言模型基于视觉问答系统的鲁棒性,通过生成受控的图像-问题输入变体,揭示模型在语言扰动、视觉线索依赖和多模态推理中的弱点。

详情
AI中文摘要

视觉问答(VQA)作为代表性多模态任务,是评估多模态大语言模型(MLLMs)推理能力的关键基准。然而,现有评估主要依赖静态数据集和基于准确性的指标,无法捕捉鲁棒性、一致性和泛化能力。受元形态测试(MT)启发,我们提出元形态鲁棒性评估(MetaRA),一种测试框架,利用元形态关系(MRs)系统性地探测MLLM基于VQA系统的漏洞。MetaRA根据特定MRs生成受控的图像-问题输入变体,并在多样化的条件下评估模型。将MetaRA应用于多个基于MLLM的VQA模型,揭示了细微的失败模式,包括对语言扰动的敏感性、对表面视觉线索的过度依赖以及更深层次的多模态推理弱点。实验结果表明,MetaRA提供的诊断见解比传统准确性指标更丰富,暴露了在标准基准下仍隐藏的失败模式。总体而言,本文强调了在VQA中系统性鲁棒性评估的必要性,并将元形态评估定位为一种可扩展、模型无关的方法,用于可信的多模态AI。

英文摘要

Visual Question Answering (VQA), as the representative multimodal task, serves as a key benchmark for evaluating the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, existing evaluations largely rely on static datasets and accuracy-based metrics, which fail to capture robustness, consistency, and generalization. Inspired by Metamorphic Testing (MT), we propose Metamorphic Robustness Assessment (MetaRA), a testing framework that employs Metamorphic Relations (MRs) to systematically probe vulnerabilities in MLLM-based VQA systems. MetaRA generates controlled variations of image-question inputs based on specific MRs and evaluates models across diverse conditions. Applying MetaRA to multiple MLLM-based VQA models across different tasks reveals nuanced failure patterns, including sensitivity to linguistic perturbations, over-reliance on superficial visual cues, and deeper weaknesses in multimodal reasoning. Experimental results demonstrate that MetaRA provides richer diagnostic insights than conventional accuracy metrics, exposing failure modes that remain hidden under standard benchmarks. Overall, this work highlights the need for systematic robustness evaluation in VQA and positions metamorphic assessment as a scalable, model-agnostic approach toward trustworthy multimodal AI.

2605.19305 2026-05-20 cs.GR cs.CV cs.LG 版本更新

Matérn Noise for Triangulation-Agnostic Flow Matching on Meshes

Matérn噪声用于三角化无关的网格上流匹配

Tianshu Kuai, Arman Maesumi, Daniel Ritchie, Noam Aigerman

发表机构 * Université de Montréal & Mila(蒙特利尔大学及Mila) Brown University(布朗大学)

AI总结 本文提出了一种三角化无关的流匹配方法,通过Matérn过程生成网格信号,实现高效且高质量的网格生成。

Comments In ACM Transactions on Graphics (SIGGRAPH 2026). Project page: https://matern-fm.github.io/

详情
AI中文摘要

本文针对在三角网格上学习生成信号的任务,提出了三角化无关的流匹配方法。理论部分提出了一种三角化无关的噪声分布,用于流匹配模型的去噪过程。通过数学定义了分布的三角化无关性,证明了Matérn过程的离散化具有所需性质,并提供了一种高效的采样算法。使用该噪声模型,并结合PoissonNet作为去噪器,实现了三角化无关的流匹配。实验显示,该方法在超过一百万三角形的网格上能够生成高质量和多样化的结果,显著超越了现有最佳水平。

英文摘要

This paper tackles the task of learning to generate signals over triangle meshes in a triangulation-agnostic manner, meaning the trained model can be applied to different meshes and triangulations effectively. Practically, the paper adapts the flow matching (FM) paradigm to a mesh-based, triangulation-agnostic setting. Theoretically, it proposes a specific noise distribution which is triangulation agnostic, to be used inside the FM model's denoising process. While noise distributions are usually trivial to devise for, e.g., images, devising a triangulation-agnostic distribution proves to be a much more difficult task. We formulate a mathematical definition of triangulation agnosticism of distributions, via their spectrum. We then show that a discretization of a specific Gaussian random field called a Matérn process holds these desired properties, and provides a simple and efficient sampling algorithm. We use it as our noise model, and adapt FM to the triangulation-agnostic setting by using a state-of-the-art approach for learning signals on meshes in the gradient domain -- PoissonNet -- as the denoiser. We conduct experiments on elaborate tasks such as sampling elastic rest states, and generating poses of humanoids. Our method is shown to be capable of producing highly realistic results for meshes of over one million triangles, significantly exceeding the state-of-the-art in quality and diversity.

2605.19304 2026-05-20 cs.CV cs.GR 版本更新

MMGS: 10$\times$ Compressed 3DGS through Optimal Transport Aggregation based on Multi-view Ranking

MMGS: 通过多视图排序基于最优传输的10倍压缩3DGS

Beizhen Zhao, Sicheng Yu, Ziran Yin, Dongxu Shen, Hao Wang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 本文提出了一种基于最优传输聚合的多视图排序方法,通过全局几何分布匹配问题优化高斯参数,实现3DGS的10倍压缩和10倍加速训练速度,同时保持高质量渲染效果。

Comments 19 pages

详情
AI中文摘要

尽管3D高斯散射(3DGS)已革新了3D重建,但其因大量冗余原始体而存在显著开销。现有压缩方法通常依赖局部采样或固定修剪阈值,难以在减少冗余与高保真渲染之间取得平衡。为此,我们提出了一种新的框架,将高斯优化建模为全局几何分布匹配问题。具体而言,我们的方法集成了三个组成部分:(1)我们引入了多视图3D高斯贡献排序机制,通过几何一致性过滤原始体,而不是使用局部启发式方法;(2)我们提出了基于全局最优传输(OT)的聚合算法,合并冗余原始体的同时保持底层几何;(3)我们设计了基于OT的致密化操作符,保持高斯的分布属性以实现稳定的优化。我们的方法仅使用10%的原始体和10倍于vanilla 3DGS的加速训练速度,实现了最先进的渲染质量。

英文摘要

While 3D Gaussian Splatting (3DGS) has revolutionized 3D reconstruction, it suffers from significant overhead due to massive redundant primitives. Existing compression methods typically rely on local sampling or fixed pruning thresholds, which often struggle to balance redundancy reduction with high-fidelity rendering. To address this, we propose a novel framework that formulates Gaussian optimization as a global geometric distribution matching problem. Specifically, our approach integrates three components: (1) we introduce a multi-view 3D Gaussian contribution ranking mechanism that filters primitives using geometric consistency instead of local heuristics; (2) we propose a global Optimal Transport (OT)-based aggregation algorithm that merges redundant primitives while preserving the underlying geometry; and (3) we design an OT-based densification operator that maintains the Gaussian's distributional properties for stable optimization. Our approach achieves state-of-the-art rendering quality with only \textbf{10$\%$} primitives and \textbf{10$\times$} accelerated training speeds compared to vanilla 3DGS.

2605.19301 2026-05-20 cs.CV 版本更新

iGSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models

iGSP:隐式梯度子空间投影用于高效视觉-语言模型的持续学习

Xuezhi Cui, Dongbo Zhou, Wang Guo, Zeyuan Wang, Ziyu Li, Gaozhi Zhou, Xian Li, Ling Zhao, Wentao Yang, Chao Tao, Haifeng Li

发表机构 * School of Geosciences and Info-Physics, Central South University(地质科学与信息物理学院,中南大学) School of Earth Sciences and Spatial Information Engineering, Hunan University of Science and Technology(地球科学与空间信息工程学院,湖南科技大学)

AI总结 本文提出iGSP框架,通过隐式梯度子空间投影实现视觉-语言模型的高效持续学习,解决了传统方法在参数效率和任务间对齐一致性上的不足,显著提升了训练效率和知识重用率。

详情
AI中文摘要

视觉-语言模型需要高效适应不断出现的下游任务。尽管参数高效微调可以缓解灾难性遗忘,但为每个任务分配孤立模块会导致参数爆炸。相反,最近的相似性驱动共享机制错误地将表面视觉相似性等同于底层对齐一致性。这种根本性不匹配导致在视觉相似但逻辑不同的任务之间产生严重的负迁移,并未能利用在视觉上多样的任务之间的对齐重用。我们提出,对齐共享本质上是共享低秩子空间内重叠优化轨迹的几何问题。基于这一见解,我们提出iGSP,一种通过隐式梯度子空间投影实现高效适应的新框架。利用MoE路由器的早期收敛性来建立子空间基底,iGSP将适应过程分为两个阶段。首先,子空间识别阶段通过基底预扩展引入候选专家,应用一种新的子空间约束正则化来隐式地将新任务梯度投影到历史子空间,并通过将路由概率视为梯度流指示器来精确修剪冗余维度,最终最大化知识重用。其次,正交子空间微调阶段固定这一结构基底并去除正则化,快速拟合任务特定的残差损失。在MTIL基准测试中,iGSP在准确率上达到最先进的水平,同时显著提高了训练效率,与当前最先进的方法相比,平均可训练参数减少了42.7%,相对于其他方法最终总参数减少了86.9%。源代码可在https://github.com/GeoX-Lab/iGSP上获得。

英文摘要

Vision-Language Models require efficient adaptation to continually emerging downstream tasks. While Parameter-Efficient Fine-Tuning mitigates catastrophic forgetting, assigning isolated modules per task leads to parameter explosion. Conversely, recent similarity-driven sharing mechanisms falsely equate superficial visual similarity with underlying alignment consistency. This fundamental mismatch triggers severe negative transfer between visually similar but logically distinct tasks and fails to exploit alignment reuse across visually diverse ones. We argue thatalignment sharing is fundamentally a geometric problem of overlapping optimization trajectories within shared low-rank subspaces. Grounded in this insight, we propose iGSP, a novel framework that achieves efficient adaptation via implicit gradient subspace projection. Leveraging the early convergence of MoE routers to establish the subspace basis, iGSP bifurcates the adaptation process into two phases. First, the Subspace Identification phase introduces candidate experts via basis pre-expansion, applies a novel subspace-constrained regularization to implicitly project new task gradients onto the historical subspace, and precisely prunes redundant dimensions by treating routing probabilities as gradient flow indicators, ultimately to maximize knowledge reuse. Second, the Orthogonal Subspace Fine-Tuning phase fixes this structural basis and removes the regularization to rapidly fit the task-specific residual loss. Extensive experiments on the MTIL benchmark demonstrate that iGSP achieves state-of-the-art accuracy while significantly improving training efficiency, reducing the average trainable parameters by 42.7\% compared to current SOTA methods, and decreasing the final total parameters by 86.9\% relative to counterparts. The source code is available at https://github.com/GeoX-Lab/iGSP.

2605.19289 2026-05-20 cs.CV 版本更新

What Makes Synthetic Data Effective in Image Segmentation

是什么使合成数据在图像分割中有效

Jinjin Zhang, Xiefan Guo, Yizhou Jin, Nan Zhou, Di Huang

发表机构 * State Key Laboratory of Complex and Critical Software Environment(复杂与关键软件环境国家重点实验室) Beihang University(北京航空航天大学) School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 本文研究了合成数据在图像分割中的有效性,通过分析最先进的扩散模型生成的合成图像,发现密集场景构成和精细实例保真度是关键因素,并提出了一种统一框架SENSE,以提升分割性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

受大规模生成模型快速发展的推动,合成数据已成为视觉理解的有前途的解决方案。尽管现代扩散模型在生成逼真图像方面表现出色,但其在复杂视觉分割任务中的潜力仍待探索。在本工作中,我们系统分析了最先进的扩散模型生成的合成图像,以揭示其有效性的决定因素。特别是,具有密集场景构成和精细实例保真度的合成图像表现出显著优势,能够产生更具判别性的空间表示。基于这些见解,我们提出了SENSE,一种利用灵活且可扩展的合成数据显著提升分割性能的统一框架。值得注意的是,SENSE是模型无关的,可与多种架构(如DPT和Mask2Former)兼容,并能有效扩展到参数容量不同的模型。在Cityscapes、COCO和ADE20K上的广泛实验验证了我们方法的有效性和泛化能力。代码可在https://github.com/zhang0jhon/SENSE获取。

英文摘要

Driven by rapid advances in large-scale generative models, synthetic data has emerged as a promising solution for visual understanding. While modern diffusion models achieve remarkable photorealistic image synthesis, their potential in complex visual segmentation tasks remains underexplored. In this work, we conduct a systematic analysis of synthetic images from state-of-the-art diffusion models to uncover the factors governing their utility. In particular, synthetic images characterized by dense scene composition and fine instance fidelity demonstrate distinctive benefits, yielding significantly more discriminative spatial representations. Building on these insights, we propose SENSE, a unified framework that leverages flexible and scalable synthetic data to substantially enhance segmentation performance. Notably, SENSE is model-agnostic, compatible with diverse architectures (e.g., DPT and Mask2Former), and scales effectively across models with varying parameter capacities. Extensive experiments on Cityscapes, COCO, and ADE20K validate the effectiveness and generalization capability of our approach. Code is available at https://github.com/zhang0jhon/SENSE.

2605.19279 2026-05-20 cs.CV 版本更新

FPED: A Functional-Network Prior-Guided Mixture-of-Experts Framework for Interpretable Brain Decoding

FPED: 一种基于功能网络先验的可解释性脑解码混合专家框架

Yudan Ren, Pengcheng Shi, Zihan Ma, Xiaowei He, Xiao Li

发表机构 * School of Electronic Information (School of Artificial Intelligence), Northwest University(电子信息学院(人工智能学院),西北大学)

AI总结 本文提出FPED框架,通过建模不同的功能脑网络作为专家,利用自适应路由机制捕捉其对视觉语义理解的互补贡献,实现可解释的脑解码。

Comments 15 pages,4 figures

详情
AI中文摘要

从功能磁共振成像(fMRI)进行视觉图像重建是脑解码中的基本任务,为理解人类感知机制和开发高级脑机接口(BCIs)提供了关键路径。然而,大多数现有方法将局部视觉皮层的fMRI信号简单地展平为一维向量,直接映射到对比语言-图像预训练(CLIP)等潜在空间。这种范式不仅破坏了大脑固有网络拓扑结构,导致神经科学解释性有限,还忽略了其他分布式功能网络在处理高级视觉语义中的协同作用。为解决这些限制,我们提出了FPED,一种基于功能网络先验的混合专家(MoE)框架,用于可解释的脑解码。FPED明确将不同的功能脑网络建模为专门的专家,并利用自适应路由机制捕捉其对视觉语义理解的互补贡献。与传统同质解码范式不同,我们的框架整合了神经生物学基础的先验知识,以实现结构化且可解释的网络层面表示学习。实验结果表明,FPED仅使用0.68B参数即可实现高度竞争的语义重建性能。所学的路由动态揭示了功能脑网络与模态特定语义处理之间的生物意义对应关系,提供了透明的神经科学解释性。这表明,具有脑网络意识的专家建模是连接神经解码与生物启发式人工智能的有前景方向。

英文摘要

Visual image reconstruction from functional Magnetic Resonance Imaging (fMRI) is a fundamental task in brain decoding, providing a crucial pathway for understanding human perceptual mechanisms and developing advanced brain-computer interfaces (BCIs). However, most current methods simply flatten fMRI signals from localized visual cortices into one-dimensional (1D) vectors, mapping them directly into latent spaces such as that of Contrastive Language-Image Pre-training (CLIP). This paradigm not only disrupts the inherent network topology of the brain-leading to limited neuroscientific interpretability-but also overlooks the synergistic contributions of other distributed functional networks in processing high-level visual semantics. To address these limitations, we propose FPED, a Functional-Network Prior-Guided Mixture of Experts (MoE) framework for interpretable brain decoding. FPED explicitly models different functional brain networks as specialized experts and employs adaptive routing to capture their complementary contributions to visual semantic understanding. Unlike conventional homogeneous decoding paradigms, our framework incorporates neurobiologically grounded priors to enable structured and interpretable network-level representation learning. Experimental results demonstrate that FPED achieves highly competitive semantic reconstruction performance with only 0.68B parameters. The learned routing dynamics reveal biologically meaningful correspondence between functional brain networks and modality-specific semantic processing, providing transparent neuroscientific interpretability. This suggests that brain network-aware expert modeling is a promising direction for bridging neural decoding and biologically inspired artificial intelligence.

2605.19260 2026-05-20 cs.AI cs.CV cs.MA 版本更新

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

AQuaUI: 用于GUI代理的视觉令牌减少方法基于自适应四叉树

Yuankai Li, Tinghui Zhu, Ha Min Son, Zhe Zhao, Xin Liu, Muhao Chen

发表机构 * UC Davis(加州大学戴维斯分校)

AI总结 本文提出AQuaUI,一种无需训练的推理时GUI代理模型的视觉令牌减少方法,利用屏幕截图中的非均匀信息密度,通过自适应四叉树结构保持令牌位置以确保一致性,并通过条件四叉树算法提升多步骤GUI交互的时序一致性,实验表明其在准确性和效率之间取得了改进。

详情
AI中文摘要

大型多模态模型(LMMs)最近已作为GUI代理模型的有希望的骨干出现,其中在每个迭代步骤中将高分辨率GUI截图引入提示中。然而,这些截图表现出高度非均匀的空间信息密度:大区域可能携带很少的信息且视觉上同质,而关键文本和图标可能需要高视觉保真度。现有方法要么需要额外训练,要么依赖于基于注意力的令牌压缩,忽略了GUI截图的结构布局和空间冗余。为填补这一空白,本文提出了AQuaUI,一种用于GUI代理模型的无训练推理时令牌减少方法,利用截图中的非均匀信息密度。AQuaUI在每个截图输入上构建一个自适应四叉树,并在四叉树的每个叶子节点保留一个代表性的合并令牌。AQuaUI在整个管道中保持保留令牌的空间位置,以确保所有位置编码阶段保持一致。为进一步提高多步骤GUI交互中的时间一致性,我们提出了一种条件四叉树算法,利用单个请求内连续截图之间的连续性。具体而言,它利用先前的四叉树作为参考来细化当前四叉树,帮助在静态或轻微移动的GUI状态下保留细粒度区域。我们在最先进的GUI代理模型上实现了AQuaUI,并在标准的地面和导航基准上进行了实验。AQuaUI在准确性和效率之间始终优于先前的基线。值得注意的是,在GUI-Owl-1.5-32B-Instruct上,AQuaUI实现了高达13.22%的速度提升和29.52%的更少视觉令牌,同时保留了99.06%的完整令牌性能,表明可以在不重新训练的情况下利用GUI截图的空间冗余。

英文摘要

Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of the quadtree. AQuaUI preserves the spatial positions of retained tokens throughout the pipeline to ensure that all position-encoding stages remain consistent. To further improve temporal consistency across multi-step GUI interactions, we propose a conditional quadtree algorithm that leverages the continuity between consecutive screenshots within a single request. Specifically, it refines the current quadtree using previous quadtrees as references, helping preserve fine-grained regions across static or mildly shifted GUI states. We implement AQuaUI on state-of-the-art GUI agent models and conduct experiments on standard grounding and navigational benchmarks. AQuaUI consistently shows improved accuracy-efficiency trade-offs over prior baselines. Notably, on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance, suggesting that the spatial redundancy of GUI screenshots can be exploited at inference without retraining.

2605.19256 2026-05-20 cs.CV 版本更新

Distribution Matching Distillation without Fake Score Network

无需假评分网络的分布匹配蒸馏

Youngjoong Kim, Deokyeong Lee, Jaesik Park

发表机构 * Department of Computer Science and Engineering, Seoul National University(首尔国立大学计算机科学与工程系) Department of Computer Science and Engineering, Sogang University(成均馆大学计算机科学与工程系)

AI总结 本文提出无需假评分网络的分布匹配蒸馏(FSF-DMD),通过流图生成器自身诱导的伪速度替代传统假评分网络,实现了分布级校正,并在ImageNet-1K数据集上验证了其有效性。

详情
AI中文摘要

分布匹配蒸馏(DMD)为少步生成提供了有效的分布级校正,但依赖辅助的假评分网络来跟踪生成分布的演变。近期工作将DMD式目标与流图生成器结合,以利用正向发散训练和反向发散校正。假评分估计器仍是一个额外的组件,具有内存和更新开销。在本工作中,我们研究当生成器本身具有流图结构时是否可以避免显式跟踪器。我们提出无需假评分网络的DMD(FSF-DMD),一种适用于流图生成器的DMD形式,其用生成器诱导的伪速度替代传统假评分估计器。关键观察是流图生成器的端点伪速度提供了一个可计算的假速度估计代理,使生成器本身能够提供反向发散信号。基于这一观察,我们推导出一个实用的目标,扩展了流图一致的反向模拟,并引入了自教师变体以从头开始训练。在ImageNet-1K 256×256实验中,FSF-DMD改进了流图基线,达到了流图初始化设置下低于列出的DMD2比较的FID,并在流图匹配初始化和从头开始训练时仍保持有效。

英文摘要

Distribution Matching Distillation (DMD) provides an effective distribution-level correction for few-step generation, while relying on an auxiliary fake-score network to track the evolving generative distribution. Recent work combines DMD-style objectives with flow-map generators to exploit both forward-divergence training and reverse-divergence correction. The fake-score estimator remains an additional component with memory and update overhead. In this work, we study whether this explicit tracker can be avoided when the generator itself has a flow-map structure. We propose Fake-Score-network-Free DMD (FSF-DMD), a DMD formulation for flow-map generators that replaces the auxiliary fake-score estimator with a generator-induced pseudo-velocity surrogate. The key observation is that the endpoint pseudo-velocity of a flow-map generator provides a tractable proxy for fake-velocity estimation, allowing the generator itself to supply the reverse-divergence signal. Building on this observation, we derive a practical objective, extend it with flow-map-consistent backward simulation, and introduce a self-teacher variant for training from scratch. In our ImageNet-1K $256 \times 256$ experiments, FSF-DMD improves flow-map baselines, reaches lower FID than the listed DMD2 comparisons in the flow-map-initialized setting, and remains effective under flow-matching initialization and training from scratch.

2605.19247 2026-05-20 cs.CV 版本更新

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

结构化开放端NAS:利用LLM进行半自动设计知识结构化以实现高效的神经架构搜索

Yuiko Sakuma, Masakazu Yoshimura, Marcel Gröpl, Zitang Sun, Junji Otsuka, Atsushi Irie, Takeshi Ohashi

发表机构 * Sony Group Corporation(索尼集团公司) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出一种半自动方法,利用LLM结构化模型设计知识,以指导神经架构搜索过程,通过定义高层结构模板和引入FairNAD算法,实现了高效的开放端搜索空间探索,提升了在多个数据集上的性能。

Comments 42 pages

详情
AI中文摘要

当前的神经架构搜索(NAS)方法通常受到预定义、限制性搜索空间的限制。尽管最近的基于大语言模型(LLM)的NAS方法能够实现开放式的搜索空间,但它们往往由于偏见或低质量的设计想法而导致探索效率低下。为了解决这些问题,我们提出了一种半自动的方法来结构化模型设计知识以指导搜索过程。我们的方法首先定义了高层结构模板,然后通过分析论文,利用LLM填充此模板,从而创建了一个丰富且多样的搜索空间,该空间体现了这种结构化设计知识。为了高效地探索这个庞大的空间,我们引入了FairNAD,使用多类型突变,通过公平的想法采样、帕累托感知突变、LLM驱动的迭代突变和细粒度反馈循环实现广泛的探索。我们展示了FairNAD在发现高性能架构方面的有效性,这些架构在CIFAR-10、CIFAR-100和ImageNet16-120上分别比当前最先进的方法提高了0.84、2.17和2.35个点。

英文摘要

Current neural architecture search (NAS) methods are often limited by their predefined, restrictive search spaces. While recent large language model (LLM)-assisted NAS methods enable open-ended search spaces, they often suffer from inefficient exploration due to biased or low-quality design ideas. To address these issues, we propose to semi-automatically structure model design knowledge to guide the search process. Our approach first defines a high-level structural template of architectural attributes. An LLM then populates this template by analyzing papers, creating a rich and diverse search space that embodies this structured design knowledge. To efficiently explore this vast space, we introduce FairNAD, using a multi-type mutation that enables broad exploration through mutation with fair idea sampling, Pareto-aware mutation, LLM-driven iterative mutation, and a fine-grained feedback loop. We demonstrate the effectiveness of FairNAD in discovering high-performing architectures that yield 0.84, 2.17, and 2.35 points improvement on CIFAR-10, CIFAR-100, and ImageNet16-120, respectively, compared to current state-of-the-art methods.

2605.19242 2026-05-20 cs.CV cs.AI cs.ET cs.LG cs.MM 版本更新

PhyWorld: Physics-Faithful World Model for Video Generation

PhyWorld: 用于视频生成的物理忠实世界模型

Pu Zhao, Juyi Lin, Timothy Rupprecht, Arash Akbari, Chence Yang, Rahul Chowdhury, Elaheh Motamedi, Arman Akbari, Yumei He, Chen Wang, Geng Yuan, Weiwei Chen, Yanzhi Wang

发表机构 * Northeastern University(东北大学) University of Georgia(佐治亚大学) Tulane University(路易斯安那大学) EmbodyX

AI总结 本文提出PhyWorld,一种通过两阶段训练提升视频生成模型的物理忠实性,以改进世界模拟器的性能,从而更有效地支持物理AI系统。

详情
AI中文摘要

世界模拟器可以在真实世界部署前提供安全且可扩展的环境来训练物理AI系统。大型视频生成模型正成为此类模拟器的有希望的基础,因为它们能够生成多样且逼真的视觉未来。然而,将其用作世界模拟器需要物理忠实的视频延续,即生成的视频应保持由条件输入隐含的物理状态,并以符合基本物理原理的方式演变。我们提出了PhyWorld,一种视频生成世界模型,通过两阶段的后训练来生成时间上一致且物理忠实的场景延续。在第一阶段,我们通过流匹配微调改进视频到视频延续,鼓励稳定视觉属性和帧间一致的运动动态。在第二阶段,我们通过直接偏好优化(DPO)对物理偏好对进行对齐,使模型朝着更符合物理合理性的输出发展。为了评估PhyWorld,我们使用了标准视频质量基准和专门的物理忠实性基准,并对每条物理定律进行评分。实验表明,PhyWorld提高了视频一致性,其在VBench上的平均得分为0.769,比最先进的基线0.756或更低。PhyWorld还提高了物理合理性,其在我们物理忠实性基准上的平均得分为3.09,比最强基线的2.99有所提高。这些结果表明,通过延续和物理偏好信号对大型视频生成模型进行后训练,可以使其成为更有效的物理AI世界模拟器。

英文摘要

World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video generation world model designed to produce temporally coherent and physically faithful scene continuations through two-stage post-training. In the first stage, we improve video-to-video continuation with flow matching fine-tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video-quality benchmarks and a dedicated physical-faithfulness benchmark with per-law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state-of-the-art baselines. PhyWorld also improves physical plausibility, reaching an average score of 3.09 on our physical-faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post-training large video generation models with continuation and physics-preference signals can make them more effective world simulators for Physical AI.

2605.19230 2026-05-20 cs.CV cs.LG 版本更新

Robust Mitigation of Age-Dependent Confounding Effects via Sample-Difficulty Decorrelation

通过样本难度去相关性实现鲁棒的年龄依赖性混杂效应缓解

Nikhil Cherian Kurian, Victor Caquilpan Parra, Abin Shoby, Luke Whitbread, Lyle J. Palmer

发表机构 * Australian Institute for Machine Learning(澳大利亚机器学习研究所) Adelaide University(阿德莱德大学)

AI总结 本文提出了一种鲁棒框架,通过针对虚假的年龄相关趋势而非强制不变性来缓解年龄依赖性混杂效应,通过样本难度建模和去相关年龄与主导年龄难度趋势,减少年龄相关的真阳性与假阳性差异,同时保持临床有意义的非线性年龄信息。

Comments 10 Pages, 3 Figures

详情
AI中文摘要

医学图像分类中的年龄依赖性性能差异通常是因为年龄作为混杂因素,将成像形态与疾病流行率联系起来。在实践中,差异可能表现为在疾病流行率较高的年龄过诊断,而在流行率较低的年龄下诊断不足,并在训练测试年龄分布变化时恶化。传统缓解方法强制严格年龄不变性可能会抑制在年龄中编码的诊断性信息。因此,我们提出了一种鲁棒框架,通过针对虚假的年龄相关趋势而非强制不变性来缓解年龄依赖性混杂效应。在预热阶段后,我们表征样本难度并以标签条件方式建模其年龄依赖性趋势。通过使用鲁棒的Huber加权亲和权重去相关年龄与主导年龄难度趋势,削弱由混杂驱动的捷径,同时保留临床有意义的非线性年龄信息。我们进一步引入了一个年龄覆盖分数,通过mini-batch年龄方差缩放去相关惩罚,以确保在有限年龄多样性下稳定的优化。在两个放射学数据集中,我们的方法在最小化AUC影响的同时减少了年龄相关的真阳性与假阳性差异,并在增加的训练测试年龄分布变化下保持稳健。

英文摘要

Age dependent performance disparities in medical image classification often arise because age acts as a confounder, linking imaging morphology with disease prevalence. In practice, disparities can manifest as overdiagnosis at ages where disease prevalence is higher and underdiagnosis at ages where prevalence is lower, and can worsen under train test shifts in the age distribution. Conventional mitigation approaches that enforce strict age invariance may suppress diagnostically meaningful information encoded in age. We therefore propose a robust framework that mitigates the effects of age-dependent confounding by targeting spurious age linked trends rather than enforcing invariance. Following a warm-up phase, we characterize sample difficulty and model its age-dependent trends in a label-conditioned manner. We decorrelate age from dominant age difficulty trends using robust, Huber weighted affinity weights, attenuating confounding-driven shortcuts while preserving clinically meaningful, nonlinear age information. We further introduce an Age Coverage Score that scales the decorrelation penalty by minibatch age variance to ensure stable optimization under limited age diversity. Across two radiology datasets, our approach reduces age dependent true and false positive disparities with minimal AUC impact and remains robust to increasing train test age distribution shifts.

2605.19223 2026-05-20 cs.CV 版本更新

HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

HAVEN:用于统一视频理解的层次对齐多模态基准

Mengqi Shi, Haopeng Zhang

发表机构 * Department of Information and Computer Sciences(信息与计算机科学系)

AI总结 本文提出HAVEN,一个用于统一视频理解的层次对齐多模态基准,旨在解决现有多模态大语言模型在复杂叙事总结和推理方面评估不足的问题,通过引入全粒度和全多模态的数据集架构,提供了一个严谨的标准测试平台。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在标准视频任务上表现出色,但其在复杂叙事的忠实总结和推理能力仍缺乏充分评估。现有总结基准在监督上分散于孤立的粒度层面,如关键帧、关键镜头或不连贯的文本总结,未能捕捉跨模态对齐的内在层次结构。为了解决这一关键差距,我们引入了HAVEN,一个用于统一视频理解的层次对齐多模态基准。HAVEN开创了一种全粒度(帧、镜头和视频层面)且全多模态(视频和文本)的数据集架构,配备了明确的、连续的模态对齐。基于这一统一的标注范式,我们提出了涵盖总结、时间推理、多模态定位和显著性排序的综合评估套件。对最新MLLMs的广泛基准测试揭示了表面文本流畅性与基于多模态理解之间的持续差距。最终,HAVEN推动了多模态系统的评估超越传统问答格式,提供了一个严谨、标准化的测试平台,以推动未来可解释、层次化的视频理解研究。我们公开发布了数据集、基准套件和评估协议。

英文摘要

While Multimodal Large Language Models (MLLMs) exhibit strong performance on standard video tasks, their ability to faithfully summarize and reason over complex narratives remains poorly evaluated. Existing summarization benchmarks fragment supervision across isolated granularities, such as keyframes, key shots, or disjointed text summaries, failing to capture the inherently hierarchical structure of cross-modal alignment. To address this critical gap, we introduce HAVEN, a hierarchically aligned multimodal benchmark for unified video understanding. HAVEN pioneers a fully granular (frame, shot, and video levels) and fully multimodal (video and text) dataset architecture, complete with explicit, continuous alignment between modalities. Built upon this unified annotation paradigm, we propose a comprehensive evaluation suite spanning summarization, temporal reasoning, multimodal grounding, and saliency ranking. Extensive benchmarking of state-of-the-art MLLMs exposes a persistent gap between surface-level textual fluency and grounded multimodal understanding. Ultimately, HAVEN advances the evaluation of multimodal systems beyond traditional QA formats, offering a rigorous, standardized testbed to drive future research in interpretable, hierarchical video understanding. We publicly release the dataset, benchmark suite, and evaluation protocols.

2605.19218 2026-05-20 cs.CV cs.AI 版本更新

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

旋转对齐的关键通道剪枝用于高效的视觉-语言模型推理

Beomseok Kang, Dongwon Jo, Jiwon Song, Donghwee Son, Jae-Joon Kim

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出旋转对齐的关键通道剪枝方法,通过压缩通道维度在固定KV缓存预算下保留更多视觉token,解决传统token剪枝在细粒度感知任务中的性能下降问题,同时提升解码效率。

详情
AI中文摘要

视觉-语言模型在推理过程中面临严重的KV缓存压力,因为单张图像通常会编码成数千个token。现有方法主要通过token稀疏性进行token剪枝,但永久丢弃视觉内容导致细粒度感知任务显著退化。为此,本文提出一个互补的轴,即特征稀疏性:在固定KV缓存预算下,压缩通道维度可以在相同内存成本下保留更多视觉token。然而,现有关键通道剪枝方法面临结构上的权衡:基于token的通道剪枝具有表现力但不结构化且较慢,而基于head的方法则硬件友好但不够稳健。本文通过RotateK,一种基于旋转的结构化关键通道剪枝框架,解决这一问题。RotateK应用基于PCA的在线旋转,将token依赖的通道重要性对齐到共享的低维子空间,从而在轻量级head掩码下实现精确剪枝;融合的Triton注意力内核直接在稀疏通道的Key上操作以实现高效的解码。在两个代表性的VLM后端上进行的实验表明,RotateK在准确率和解码延迟方面均优于现有关键通道剪枝方法,而联合token-通道剪枝在匹配的KV缓存预算下优于仅token剪枝的基线。

英文摘要

Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.

2605.19214 2026-05-20 cs.LG cs.CV 版本更新

Worst-Group Equalized Odds Regularization for Multi-Attribute Fair Medical Image Classification

多属性公平医疗图像分类中的最差组等化几率正则化

Nikhil Cherian Kurian, Victor Caquilpan Parra, Abin Shoby, Luke Whitbread, Lauren Oakden-Rayner, Robert Vandersluis, Jessica Schrouff, Lyle J. Palmer, Mark Jenkinson

发表机构 * Australian Institute for Machine Learning, Adelaide University(澳大利亚机器学习研究所,阿德莱德大学) GlaxoSmithKline (GSK)(葛兰素史克(GSK))

AI总结 本文提出了一种最差组等化几率正则化方法,用于在多个人口属性上同时评估和缓解医疗图像分类中的系统性差异,通过在推理时优化子组层面的真阳性率和假阳性率偏差,减少等化几率和等化机会的不平等,同时对AUC影响最小。

Comments 11 Pages, 2 Figures

详情
AI中文摘要

医疗人工智能的诊断性能在不同人口群体间系统性地变化,但子组AUC可能掩盖了临床重要的不平等。在固定的推理时间操作点上,某些群体可能表现出过度诊断行为,其特征是真阳性率和假阳性率升高,而另一些群体则表现出不足诊断模式,其真阳性率和假阳性率降低。这些对立的趋势可能在总体AUC中相互抵消,但会产生有意义的临床决策不平等。受在操作点和多个人口属性上评估和缓解此类不平等的需要所驱动,我们提出了一种最差组等化几率边际正则化器。该正则化器明确针对推理时的子组层面真阳性率和假阳性率偏差。在每次更新时,该方法识别出由显式人口属性(如年龄、性别和种族)定义的最极端边际偏差的子组,并应用统一的惩罚,从而在多个人口轴上实现公平优化,而无需显式交集约束。在两个现实中的多标签医学影像数据集中,我们的方法在减少等化几率和等化机会的不平等方面表现一致,对AUC影响极小,从而在保持诊断性能的同时提高公平性。

英文摘要

Diagnostic performance in medical AI varies systematically across demographic groups, yet subgroup AUC can mask clinically important disparities. At a fixed inference-time operating point, some groups may exhibit over-diagnostic behaviour, characterized by elevated true and false positive rates, while others show under-diagnostic patterns with reduced true and false positive rates. These opposing tendencies can cancel in aggregate AUCs while producing meaningful inequities in clinical decision-making. Motivated by the need to assess and mitigate such disparities at the operating point and across multiple demographic attributes simultaneously, we propose a worst-group equalized-odds margin regularizer. The proposed regularizer explicitly targets subgroup-level deviations on both the true positive and false positive sides at inference. At each update, the method identifies subgroups defined by explicit demographic attributes (e.g., age, sex, and race) that exhibit the most extreme margin deviations and applies a unified penalty, enabling fairness optimization across multiple demographic axes without requiring explicit intersectional constraints. Across two medical imaging datasets in realistic multi-label settings, our method consistently reduces disparities in Equalized Odds and Equalized Opportunity with minimal impact on AUC, preserving diagnostic performance while improving fairness.

2605.19213 2026-05-20 cs.CV 版本更新

Smartphone-based Circular Plot Sampling for Forest Inventory

基于智能手机的圆形采样法用于森林调查

Su Sun, Jui-Cheng Chiu, Nabin Khanal, Songlin Fei, Yingjie Victor Chen

发表机构 * School of Applied and Creative Computing, Purdue University(应用与创意计算学院,普渡大学) Department of Forestry and Natural Resources, Purdue University(林业与自然资源学院,普渡大学)

AI总结 本文提出了一种基于智能手机的轻量级pipeline,通过单次 walkthrough 视频实现完整的圆形采样法树测量,无需额外专业硬件,结合预训练的单目深度估计和树实例分割与SLAM框架,实现相机轨迹和深度的联合优化,从而获得树的位置和胸径估计,具有较高的准确性和可扩展性。

详情
AI中文摘要

圆形采样法是森林调查的核心,但准确测量树的胸径(DBH)和在采样区域内的空间位置仍然具有挑战性。传统方法依赖于昂贵的地面激光雷达系统或劳动密集型的手动方法,涉及卡尺和罗盘测量,限制了其在大规模环境中的可扩展性和可及性。本文提出了一种轻量级、基于智能手机的pipeline,能够通过单次walkthrough视频实现完整的采样区域树测量,仅需一个消费者智能手机安装在便携支架上即可。所提出的方法整合了预训练的单目深度估计和树实例分割与同时定位与建图(SLAM)框架,以联合优化视频序列中的相机轨迹和深度。通过融合SLAM推导出的相机姿态与分割深度图,结合校准的参考长度,获得树的位置和DBH估计。该系统在管理森林和自然森林采样区域中进行了评估,分别达到了1.51厘米(MARE 3.98%)和2.30厘米(MARE 5.69%)的平均绝对误差,性能在不同起始方向和位置下保持一致。跨视频一致性分析进一步证明了在不同起始位置开始测量时,树的定位稳定且可重复。所提出的方法在准确性和可扩展性上与传统现场方法相当,同时显著降低了设备成本和操作复杂性,使其适用于专业研究人员和非专业森林管理者在多样化的操作环境中使用。

英文摘要

Circular sample plots are a cornerstone of forest inventory, yet accurate measurement of tree diameter at breast height (DBH) and spatial location within such plots remains challenging. Conventional approaches rely either on costly terrestrial LiDAR systems or labor-intensive manual methods involving calipers and compass bearings, limiting their scalability and accessibility in large scale environments. We present a lightweight, smartphone-based pipeline that enables complete plot sampling based tree measurement from a single walkthrough video, requiring no specialized hardware beyond a consumer smartphone mounted on a portable stand. The proposed method integrates pretrained monocular depth estimation and tree instance segmentation with a simultaneous localization and mapping (SLAM) framework to jointly refine camera trajectories and depth across the video sequence. Tree positions and DBH estimates are recovered by fusing SLAM-derived camera poses with segmented depth maps, with absolute real-world scale anchored via a calibrated reference length. The system was evaluated in both managed forest plots and natural forest plot, achieving a mean absolute error of 1.51 cm (MARE 3.98%) and 2.30 cm (MARE 5.69%) respectively, with consistent performance across varying starting directions and positions. Cross-video consistency analysis further demonstrated stable and reproducible tree localization across measurements initiated from different starting positions. The proposed approach achieves accuracy comparable to established field methods while substantially reducing equipment cost and operational complexity, making it accessible to both professional researchers and non-expert forest managers in diverse operational settings.

2605.19210 2026-05-20 cs.CV 版本更新

D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation

D-Convexity:通过准凹性统一的可微凸形状先验用于数据驱动的图像分割

Shengzhe Chen, Hao Yan

发表机构 * School of Computing and Augmented Intelligence, Arizona State University(计算与增强智能学院,亚利桑那州立大学)

AI总结 本文提出了一种基于网络输出掩码函数u的准凹性,统一且无阈值的可微凸形状先验,用于数据驱动的图像分割,通过将所有超水平集要求为凸性,将全局形状约束转化为局部可微不等式,从而提升形状正则化性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

凸性是许多自然和人造结构的基础几何先验,但在端到端可训练分割网络中有效施加仍然具有挑战性。我们从函数的角度重新审视凸性,并提出基于网络输出掩码函数u的准凹性的一致、无阈值凸性先验。我们不局限于约束单个二值分割,而是要求u的所有超水平集都是凸的,将全局形状约束转化为u及其导数的局部、可微不等式。从这一原则出发,我们推导出零、一、二阶特征,分别产生局部中点凸化算法、基于支撑超平面的梯度条件以及以切平面上的二次形式表达的充分二阶不等式。一阶和二阶形式产生一个紧凑的卷积损失,可以在图像上密集应用而无需阈值处理。我们的准凹性损失通过所提出的凸梯度投影模块(CGPM)无缝集成到现代分割网络中。它们在多个数据集中一致地强制凸性并提高形状正则化性能,优于专门针对视网膜分割的网络,并超越了先前的形状意识方法。值得注意的是,我们的分析将一系列先前的凸形状模型,从离散1-0-1线约束和图割凸性公式到基于曲率或带符号距离拉普拉斯的水平集先验,统一在一个连续且可微的框架中。

英文摘要

Convexity is a fundamental geometric prior that underlies many natural and man-made structures, yet remains challenging to impose effectively in end-to-end trainable segmentation networks. We revisit convexity from a functional perspective and propose a unified, threshold-free convexity prior based on the quasi-concavity of the network's output mask function u. Instead of constraining a single binary segmentation, we require all super-level sets of u to be convex, transforming global shape constraints into local, differentiable inequalities on u and its derivatives. From this principle, we derive zero, first, and second-order characterizations, yielding respectively a local midpoint convexification algorithm, a gradient-based condition linked to supporting hyperplanes, and a sufficient second-order inequality expressed as a quadratic form on the tangent plane. The first and second-order formulations produce a compact convolutional loss that can be densely applied across the image without thresholding. Our quasi-concavity losses integrate seamlessly with modern segmentation networks via the proposed convex gradient projection module (CGPM). They consistently enforce convexity and improve shape regularity across multiple datasets, outperforming networks tailored for retinal segmentation and surpassing previous shape-aware methods. Remarkably, our analysis unifies a wide spectrum of previous convex shape models, from discrete 1-0-1 line constraints and graph-cuts convexity formulations to curvature or signed distance Laplacian based level-set priors, within a single continuous and differentiable framework.

2605.19207 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

用于低资源医疗环境的量化机器学习模型:医学影像

Sumanth Meenan Kanneti, Aryan Shah

发表机构 * Georgia State University(佐治亚州立大学)

AI总结 本文提出了一种多策略压缩框架,用于MRI图像中的脑肿瘤分类,通过量化感知训练、从DenseNet-101教师模型到紧凑DenseNet-32学生模型的知识蒸馏以及轻量MobileNetV2骨干网络上的Float16后训练量化,实现了在低资源医疗环境中高效且准确的脑肿瘤筛查。

详情
AI中文摘要

深度学习模型在医学影像分析中表现出强大的性能,但在低资源临床环境中部署仍然困难,由于计算、内存和电力限制。本文提出了一种多策略压缩框架,用于从MRI中进行脑肿瘤分类,包括量化感知训练、从DenseNet-101教师模型到紧凑DenseNet-32学生模型的知识蒸馏,以及在轻量MobileNetV2骨干网络上的Float16后训练量化。使用包含胶质瘤、脑膜瘤、垂体瘤和健康对照的多类脑肿瘤MRI数据集,我们提供了基于MobileNetV2的完整实验验证,通过三阶段迁移学习训练分类器,并通过TensorFlow Lite应用Float16量化。DenseNet基于的知识蒸馏和量化感知训练策略被描述为框架内的互补压缩方法,其完整的经验评估留待未来工作。在MobileNetV2管道上的实验结果表明,量化模型在验证准确率为82.37%的情况下,与全精度基线82.20%相比,模型大小从35.34 MB减少到5.76 MB,压缩比为6.14倍,无显著精度损失。各分类评估证实,量化在所有四个肿瘤类别中均匀保持诊断性能。这些发现表明,轻量化的量化模型可以在资源受限的医疗环境中提供临床可行的脑肿瘤筛查。

英文摘要

Deep learning models have shown strong performance in medical image analysis, but deploying them in low-resource clinical environments remains difficult due to computational, memory, and power constraints. This paper presents a multi-strategy compression framework for brain tumor classification from MRI, encompassing quantization-aware training, knowledge distillation from a DenseNet-101 teacher to a compact DenseNet-32 student with low-bit post-training quantization, and Float16 post-training quantization on a lightweight MobileNetV2 backbone. Using a multi-class brain tumor MRI dataset containing glioma, meningioma, pituitary tumors, and healthy controls, we provide full experimental validation of the MobileNetV2-based pipeline, training the classifier through a three-stage transfer learning process and applying Float16 quantization via TensorFlow Lite. The DenseNet-based distillation and quantization-aware training strategies are described as complementary compression approaches within the framework, with their complete empirical evaluation reserved for future work. Experimental results on the MobileNetV2 pipeline show that the quantized model achieves 82.37 percent validation accuracy compared to the 82.20 percent full-precision baseline, reducing model size from 35.34 MB to 5.76 MB, a 6.14x compression ratio with no meaningful accuracy loss. Per-class evaluation confirms that quantization preserves diagnostic performance uniformly across all four tumor categories. These findings demonstrate that lightweight quantized models can deliver clinically viable brain tumor screening in resource-constrained healthcare settings.

2605.19155 2026-05-20 cs.CV 版本更新

Efficient coding along the visual hierarchy

视觉层次中的高效编码

Ananya Passi, Brian S. Robinson, Michael F. Bonner

发表机构 * Department of Cognitive Science, Johns Hopkins University(约翰霍普金斯大学认知科学系) Applied Physics Laboratory, Johns Hopkins University(约翰霍普金斯大学应用物理实验室)

AI总结 本文研究了在有限数据下如何通过高效编码原理构建与人类对齐的视觉特征层次,提出了一种无监督学习方法,该方法通过压缩输入到自然图像的主要变化模式来生成从边缘和颜色到纹理和形状的特征,且结合监督微调可提高脑区对齐性和类别学习速度。

Comments 34 pages, 6 figures

详情
AI中文摘要

生物视觉系统在有限经验下学习,不同于依赖数百万训练图像的深度学习模型。什么学习原理使这种可能性成为可能?我们测试了高效编码(即神经表示捕捉自然输入的统计结构)是否能从有限数据中构建与人类对齐的视觉特征层次。我们开发了一种无监督学习过程,其中每个深度网络层仅使用局部统计信息,不使用标签、任务或反向传播,将输入压缩到自然图像的主要变化模式上。这种无监督过程生成的特征从边缘和颜色逐步发展到纹理和形状。该深度高效编码模型的特征易于被人类观察者识别,并能预测人类视觉皮层的图像诱发fMRI响应。此外,结合高效编码与监督微调的混合学习过程在低数据设置下能产生更好的脑区对齐性,并加快类别学习速度。这些发现表明,高效编码可能在视觉层次的整个表示中起作用,并有助于解释生物视觉的数据效率。

英文摘要

Biological visual systems learn from limited experience, unlike deep learning models that rely on millions of training images. What learning principles make this possible? We tested whether efficient coding, the idea that neural representations capture the statistical structure of natural inputs, can build a hierarchy of human-aligned visual features from limited data. We developed an unsupervised learning procedure in which each layer of a deep network compresses its inputs onto the dominant modes of variation in natural images, using only local statistics and no labels, tasks, or backpropagation. This unsupervised procedure yields features that progress from edges and colors to textures and shapes. The features of this deep efficient coding model are readily recognized by human observers and are predictive of image-evoked fMRI responses in human visual cortex. Furthermore, a hybrid learning procedure that combines efficient coding with supervised fine-tuning yields better brain alignment in low-data settings and more rapid category learning. These findings suggest that efficient coding may shape representations across the entire visual hierarchy and help explain the data efficiency of biological vision.

2605.19137 2026-05-20 cs.CV 版本更新

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

迈向数据高效的视频预训练:使用冻结的图像基础模型

Svetlana Orlova, Niccolò Cavagnero, Gijs Dubbelman

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 本文探讨了如何通过冻结预训练的图像基础模型并仅训练时间模块来实现数据高效的视频预训练,从而减少对大规模视频数据和计算资源的需求。

Comments Accepted to CVPR 2026 Workshops CV4Smalls

详情
AI中文摘要

视频基础模型在许多视频理解任务中表现出色,但通常需要在大规模视频数据集上进行大规模预训练,导致显著的数据和计算成本。相比之下,现代图像基础模型已经提供了强大的空间表示。这引发了一个重要问题:能否通过重用这些空间表示并仅进行时间推理的预训练来构建具有竞争力的视频模型?我们初步探索了一种轻量级训练范式,即冻结预训练的图像基础模型并仅训练时间模块来处理流视频。通过将图像基础模型用作空间编码器,这种方法可以显著减少与端到端视频预训练相比所需的视频数据和计算量。在本工作中,我们探讨了这种方法的可行性,以在投入视频预训练计算之前进行探索。在多个视频理解任务上的实证发现表明,无需大规模视频预训练即可获得强大的时间性能,这促使未来的工作集中在通过在冻结的图像基础模型上预训练时间模块来构建递归视频基础模型。代码:https://github.com/tue-mps/towards-video-image-frozen

英文摘要

Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: https://github.com/tue-mps/towards-video-image-frozen .

2605.19133 2026-05-20 cs.CV cs.AI 版本更新

Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening

知道何时不进行预测:用于更安全糖尿病视网膜病变筛查的自监督学习与退避

Muskaan Chopra, Lorenz Sparrenberg, Jan H. Terheyden, Rafet Sifa

发表机构 * Rheinische Friedrich-Wilhelms-Universität Bonn(莱茵-威斯巴登大学波恩分校) University Hospital Bonn - Department of Ophthalmology(波恩大学医院眼科部门) Fraunhofer IAIS(弗劳恩霍夫研究所) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔人工智能与机器学习研究所)

AI总结 本文研究了自监督学习预训练长度对校准置信度和基于置信度的退避策略的影响,发现预训练长度对选择性预测有积极影响,但过长预训练并不总能提高可靠性,强调了退避意识评估的重要性。

Comments Accepted at IJCAI 2026

详情
AI中文摘要

自监督学习(SSL)现在是预训练医学图像模型的标准方法,但性能仍主要通过下游准确性来评估。对于安全关键的筛查任务,如糖尿病视网膜病变分级,这还不够:模型必须知道何时其预测不可靠,并将不确定案例推迟给临床审查。在本工作中,我们探讨了SSL预训练长度如何影响校准置信度和基于置信度的退避。我们评估了多个SSL检查点在固定微调协议下的表现,并评估了校准置信度、覆盖范围、选择性准确性以及选择性宏F1。在不同数据集和数据制度下,SSL预训练优于从头开始训练。与之前主要评估下游准确性或AUROC的SSL研究不同,我们分析了SSL预训练持续时间如何影响在基于校准置信度的退避下的置信度行为。然而,一旦准确性饱和,选择性性能仍可能在不同检查点间显著变化,且更长的预训练并不总能提高可靠性。这些结果强调了退避意识评估的重要性,并建议预训练长度应被视为重要的可靠性相关设计选择,而非仅是计算细节。代码可在GitHub上获取。

英文摘要

Self-supervised learning (SSL) is now a standard way to pretrain medical image models, but performance is still mostly judged by downstream accuracy. For safety-critical screening tasks such as diabetic retinopathy grading, this is not enough: a model must also know when its predictions are unreliable and defer uncertain cases for clinical review. In this work, we examine how the length of SSL pretraining influences calibrated confidence and confidence-based abstention. We evaluate multiple SSL checkpoints under a fixed fine-tuning protocol and assess calibrated confidence, coverage, selective accuracy, and selective macro-F1. Across datasets and data regimes, SSL pretraining improves selective prediction compared to training from scratch. Unlike prior SSL studies that primarily evaluate downstream accuracy or AUROC, we analyze how SSL pretraining duration influences confidence behavior under calibrated confidence-based abstention. However, once accuracy saturates, selective performance can still change markedly across checkpoints, and longer pretraining does not consistently improve reliability. These results underscore the importance of abstention-aware evaluation and suggest that pretraining length should be treated as an important reliability-related design choice rather than only a computational detail. Code is available at GitHub.

2605.19130 2026-05-20 cs.LG cs.AI cs.CL cs.CV 版本更新

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

EgoBabyVLM:基于自然主义第一人称视频数据的跨模态学习基准测试

Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra, Charles-Éric Saint-James, Rashel Moritz, Sheila Krogh-Jespersen, Vanessa Stark, Surya Parimi, Jiayi Shen, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Tom Fizycki, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Juan Pino, Michael C. Frank, Emmanuel Dupoux

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) Stanford University(斯坦福大学) Meta Reality Labs(Meta现实实验室) The University of Tokyo(东京大学)

AI总结 研究探讨了儿童如何从有限的视觉-语言输入中获得语言 grounding 的鲁棒性,提出了 EgoBabyVLM 挑战,推动模型在自然主义数据中实现 grounded language learning。

详情
AI中文摘要

儿童在有限的视觉-语言输入中展现出惊人的鲁棒性,这种能力超过了目前最好的大型多模态模型。最近的研究表明,目前基于 curated web 数据训练的视觉-语言模型 (VLMs) 无法泛化到由可穿戴设备、具身代理和婴儿头摄像机产生的稀疏、弱对齐的第一人称视频流,并且没有固定的评估流程来衡量在此类数据上的进展。我们训练 VLMs 在具有不同视觉和语言输入语义对齐程度的数据集上,包括自然主义婴儿和成人第一人称视频,并通过涵盖多模态语言 grounding 和单模态视觉和语言任务的综合评估套件进行评估。这套评估的核心是 Machine-DevBench,它是一个基于语料库的基准测试,自动从模型的训练词汇中生成,以消除训练/评估不匹配和先前发展基准的低统计效力。我们的结果表明,当前 VLM 模型依赖于 curated 数据的紧密语义对齐,并无法利用主导自然主义第一人称输入的弱对齐信号——正是人类在其中茁壮成长的领域。为了推动进展,我们引入了 EgoBabyVLM 挑战,以驱动开发能够从人类婴儿经历的此类自然主义数据中实现 grounded language learning 的模型。

英文摘要

Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model's training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input -- the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.

2605.19111 2026-05-20 cs.CV cs.AI 版本更新

FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models

FAGER:基于事实的文本到图像模型评估与改进

Youngsun Lim, Cusuh Ham, Pin-Yu Chen, Deepti Ghadiyaram

发表机构 * Boston University(波士顿大学) Adobe(Adobe公司) IBM Research(IBM研究院)

AI总结 本文提出FAGER框架,用于评估和改进文本到图像模型的事实准确性,通过结合LLM生成事实和参考引导的视觉事实提取与验证,构建结构化事实评估标准,并通过VLM进行评估,验证FAGER在事实性测试中优于现有方法,并能无训练改进T2I输出。

Comments It was accepted for an oral presentation at the 2nd Workshop on the Evaluation of Generative Foundation Models (EVGENFM2026) at CVPR 2026. Total 8 pages (1 page for references). 5 figures

详情
AI中文摘要

现有文本到图像(T2I)评估指标主要评估生成图像是否与提示中明确陈述的信息一致,但往往无法捕捉隐含、外部依赖或定义身份的事实要求。因此,它们不适合评估涉及科学知识、历史事实、产品或文化特定概念的提示中的事实正确性。我们提出了FActually Grounded Evaluation and Refinement(FAGER),一种代理框架,用于评估生成图像是否正确反映由提示中或暗示的视觉可验证事实,并提供改进的可操作反馈。FAGER首先通过结合LLM生成事实与参考引导的视觉事实提取和验证构建结构化事实评估标准,然后将该标准转换为基于VLM的问答对进行评估。为了验证FAGER作为事实性度量标准的有效性,我们引入了事实性A/B测试,该测试衡量度量标准是否更倾向于选择事实参考图像而非对应的生成图像。在涵盖科学、历史、产品、文化和知识密集型概念的五个数据集中,FAGER在该测试中始终优于现有方法。我们进一步表明,FAGER可以以无训练的方式用于改进T2I输出,在多个数据集中产生显著的事实性提升。

英文摘要

Existing text-to-image (T2I) evaluation metrics mainly assess whether generated images align with information explicitly stated in the prompt, but often fail to capture factual requirements that are implicit, externally grounded, or identity-defining. As a result, they are not well suited for evaluating factual correctness in prompts involving scientific knowledge, historical facts, products, or culture-specific concepts. We propose FActually Grounded Evaluation and Refinement (FAGER), an agentic framework that evaluates whether generated images correctly reflect visually verifiable facts grounded in or implied by the prompt, while also providing actionable feedback for improvement. FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images. Across five datasets spanning science, history, products, culture, and knowledge-intensive concepts, FAGER consistently outperforms prior metrics on this test. We further show that FAGER can be used to refine T2I outputs in a fully training-free manner, yielding substantial factuality gains across datasets.

2605.19075 2026-05-20 cs.CV cs.AI 版本更新

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

CRAFT: 基于批评的自适应关键帧目标定位用于多模态视频问答

Mahesh Bhosale, Abdul Wasi, Vishvesh Trivedi, Pengyu Yan, Akhil Gorugantu, David Doermann

发表机构 * University at Buffalo(布法罗大学) New York University(纽约大学)

AI总结 该研究提出CRAFT方法,通过动态关键帧选择、每视频ASR与多语言回退以及混合批评循环,迭代验证和修复声明,最终实现多模态视频问答的准确证据聚合。

Comments Accepted at ACL 2026 Multimodal Augmented Generation via MultimodAl Retrieval Workshop

详情
AI中文摘要

基于现实世界新闻事件的多视频问答需要系统在异构视频档案中检索与查询相关的证据,并将每个声明归因于其支持来源。我们介绍了CRAFT(Critic-Refined Adaptive Key-Frame Targeting),一种查询条件的管道,结合动态关键帧选择、每视频ASR与多语言回退以及混合批评循环,以迭代验证和修复声明,然后整合。该管道集成了UNLI时间蕴含、DeBERTa-v3跨声明筛选以及Llama-3.2-3B裁决者,并在最终引用合并阶段发出每个事实一次,附带所有支持来源标识符。在MAGMaR 2026上,CRAFT实现了最佳的总体平均(0.739)、参考召回(0.810)和引用F1(0.635)。我们进一步在WikiVideo的MAGMaR风格转换上进行了评估,包含52个非重叠事件查询,CRAFT也表现出色(0.823 Avg),表明其声明中心的证据聚合能力超越了MAGMaR。消融研究显示,原子声明、ASR和批评循环在超过基本查询条件基线时发挥了主要作用。代码和实现细节可在https://github.com/bhosalems/CRAFT公开获取。

英文摘要

Grounded multi-video question answering over real-world news events requires systems to surface query-relevant evidence across heterogeneous video archives while attributing every claim to its supporting source. We introduce CRAFT (Critic-Refined Adaptive Key-Frame Targeting), a query-conditioned pipeline that combines dynamic keyframe selection, per-video ASR with multilingual fallback, and a hybrid critic loop to iteratively verify and repair claims before consolidation. The pipeline integrates UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator, with a final citation-merging stage that emits each fact once with all supporting source identifiers. On MAGMaR 2026, CRAFT achieves the best overall average (0.739), reference recall (0.810), and citation F1 (0.635). We further evaluate on a MAGMaR-style conversion of WikiVideo with 52 non-overlapping event queries, where CRAFT also performs strongly (0.823 Avg), showing that its claim-centric evidence aggregation generalizes beyond MAGMaR. Ablations show that atomic claims, ASR, and the critic loop drive the main gains over the vanilla query-conditioned baseline. Code and implementation details are publicly available at https://github.com/bhosalems/CRAFT.

2605.19074 2026-05-20 cs.CV cs.AI 版本更新

Learning Long-Term Temporal Dependencies in Photovoltaic Power Output Prediction Through Multi-Horizon Forecasting

通过多时间尺度预测学习光伏功率输出预测中的长期时间依赖性

Sumit Laha, Ankit Sharma, Hassan Foroosh

发表机构 * Department of Computer Science University of Central Florida Orlando, Florida, United States(计算机科学系 佛罗里达中央大学 奥兰多 佛罗里达 美国)

AI总结 本文提出一种多时间尺度预测框架,通过联合优化多个未来值来提高深度神经网络对隐含的步间时间依赖性的捕捉能力,从而提升光伏功率输出预测的准确性和鲁棒性。

详情
AI中文摘要

全球太阳能光伏(PV)容量的迅速扩张——2024年达到创纪录的597 GW——凸显了需要稳健的预测模型来缓解由太阳能辐照度间歇性引起的电网不稳定性。尽管基于深度学习的直接预测使用地面天空图像(GSI)已成为主导方法,但现有文献常受限于单一架构评估和对单时间尺度(点)预测的专注。本文提出从传统单时间尺度估计向多时间尺度预测框架的转变,从而实现架构无关的准确率提升。我们假设并实验验证了联合优化一系列未来值使深度神经网络能够通过避免网络在权重梯度和滤波器多样性方面的过早收敛来更好地捕捉隐含的步间时间依赖性。利用这种架构无关的改进,将顺序天空图像与历史光伏发电数据相结合,我们评估了模型在多个离散未来时间步长上同时预测功率输出的能力。我们的方法通过在多样深度学习架构上的比较分析进行验证。结果表明,这种多时间尺度方法在预测时间范围内显著提高了预测准确性和鲁棒性,同时保持计算效率。通过在单时间尺度模型上实现优越性能且计算开销 negligible,本文提供了一种可扩展且高效的解决方案,以提高现代电网的韧性。

英文摘要

The rapid global expansion of solar photovoltaic (PV) capacity-reaching a record 597 GW in 2024-highlights the urgent need for robust forecasting models to mitigate the grid instability caused by the intermittent nature of solar irradiance. While deep learning-based direct forecasting using ground-based sky images (GSI) has emerged as a dominant approach, existing literature is often constrained by single-architecture evaluations and an exclusive focus on single-horizon (point) prediction. This paper proposes a transition from traditional single-horizon estimation toward a multi-horizon forecasting framework, leading to an architecture-independent improvement in accuracy. We hypothesize and demonstrate experimentally that joint optimization over a sequence of future values allows deep neural networks to better capture latent inter-step temporal dependencies by avoiding precocious convergence of the network in terms of both weight gradients and filter diversity. Leveraging this architecture-independent improvement that integrates sequential sky imagery with historical PV generation data, we evaluate the models' abilities to predict power output across multiple discrete future time steps simultaneously. Our methodology is validated through a comparative analysis across diverse deep learning architectures. The results demonstrate that this multi-horizon approach significantly enhances predictive accuracy and robustness across the entire forecast horizon while maintaining computational parsimony. By achieving superior performance with negligible overhead compared to single-horizon models, this work provides a scalable and efficient solution to improve the resilience of modern power grids.

2605.19060 2026-05-20 cs.CV cs.AI eess.IV 版本更新

LiFT: Lifted Inter-slice Feature Trajectories for 3D Image Generation from 2D Generators

LiFT:用于从2D生成器生成3D图像的提升跨切片特征轨迹

Xinhe Zhang, Yuyang Zhang, Pengfei Jin, Arnau Marin-Llobet, Na Li, Quanzheng Li

发表机构 * School of Engineering and Applied Sciences, Harvard University(哈佛大学工程与应用科学学院) Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School(马萨诸塞总医院和哈佛医学院高级医学计算与分析中心) Kempner Institute, Harvard University(哈佛大学凯普纳研究所)

AI总结 本文提出LiFT框架,通过将3D体积合成分解为单切片图像生成和跨切片轨迹学习,解决高分辨率3D医学图像生成中体积模型计算成本高和2D切片生成器在第三维度上无法保持解剖一致性的问题。

详情
AI中文摘要

高分辨率3D医学图像生成仍然具有挑战性,因为完全体积分布模型计算成本高,而高效的2D切片生成器往往无法在第三维度上保持解剖一致性。我们提出LiFT,一种用于提升跨切片特征轨迹的框架,将3D体积合成分解为单切片图像生成和跨切片轨迹学习。与端到端建模体积分布不同,LiFT将体积视为特征空间中的有序轨迹,捕捉解剖结构在深度方向上的出现、变换和消失。一个三平面漂移损失对齐生成切片的轨迹与真实体积的轨迹,使在无条件生成中能够学习跨切片进展的分布;在配对翻译中,一个双向$z$-上下文混合器通过注册目标进行训练,提供通过平面的连贯性同时保持单切片的保真度。我们在BraTS 2023(无条件和缺失模态MRI)和SynthRAD2023(MRI到CT)上评估LiFT。在这些设置中,LiFT保持单切片质量,接近报告的cWDM缺失MRI重建质量,在约135倍更低的推理成本下(无正式等价性测试),并在MRI到CT中相对于无映射消融提高了通过平面的连贯性,证明了轻量级跨切片轨迹学习是高分辨率3D医学合成的可行途径。

英文摘要

High-resolution 3D medical image generation remains challenging because fully volumetric models are computationally expensive, while efficient 2D slice generators often fail to preserve anatomical consistency across the third dimension. We propose LiFT, a framework for Lifted inter-slice Feature Trajectories that factorizes 3D volume synthesis into per-slice image generation and inter-slice trajectory learning. Rather than modeling the volumetric distribution end-to-end, LiFT treats a volume as an ordered trajectory in feature space, capturing how anatomical structures appear, transform, and disappear across depth. A tri-planar drifting loss aligns the trajectory of generated slices with the trajectories of real volumes, enabling distributional learning over inter-slice progressions in unconditional generation; in paired translation, a bidirectional $z$-context mixer trained against the registered target supplies through-plane coherence while preserving per-slice fidelity. We evaluate LiFT on BraTS 2023 (unconditional and missing-modality MR) and SynthRAD2023 (MR-to-CT). Across these settings, LiFT preserves per-slice quality, approaches the reported cWDM missing-MR reconstruction quality at $\sim$$135\times$ lower inference cost (without formal equivalence testing), and improves through-plane coherence on MR-to-CT relative to a no-mapper ablation, demonstrating that lightweight inter-slice trajectory learning is a viable route to high-resolution 3D medical synthesis.

2605.19033 2026-05-20 cs.RO cs.AI cs.CV cs.LG cs.MA 版本更新

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

RLFTSim: 通过强化学习微调实现逼真且可控的多智能体交通仿真

Ehsan Ahmadi, Hunter Schofield, Behzad Khamidehi, Fazel Arasteh, Jinjun Shan, Lili Mou, Dongfeng Bai, Kasra Rezaee

发表机构 * University of Alberta(阿尔伯塔大学) Huawei Technologies Canada(华为加拿大技术有限公司) York University(约克大学) Canada CIFAR AI Chair, Amii(加拿大 CIFAR 人工智能主席,Amii)

AI总结 本文提出RLFTSim框架,通过强化学习微调提升交通仿真场景的真实感,并通过目标条件化方法实现对交通仿真可控性的提炼,实验表明其在真实感和可控性方面均优于其他启发式搜索方法。

Comments CVPR 2026 Highlight; Project page at https://ehsan-ami.github.io/rlftsim

详情
AI中文摘要

监督式开环训练已被广泛用于训练交通仿真模型;然而,它无法捕捉复杂驾驶场景中固有的动态性和多智能体交互。我们引入RLFTSim,一种基于强化学习的微调框架,通过将模拟器运行与真实世界数据分布对齐来增强场景真实性,并提供一种方法用于在场景生成中提炼目标条件化的可控性。我们基于预训练的仿真模型实例化RLFTSim,设计一种平衡保真度和可控性的奖励函数,并在Waymo Open Motion Dataset上进行了全面实验。我们的结果表明在真实感方面取得了改进,实现了最先进的性能。与其它基于启发式搜索的微调方法相比,RLFTSim由于提出了一种低方差且密集的奖励信号,所需样本显著更少,并且通过设计直接解决了真实感对齐问题。我们还通过目标条件化展示了我们方法在提炼交通仿真可控性方面的有效性。项目页面可在https://ehsan-ami.github.io/rlftsim上访问。

英文摘要

Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan-ami.github.io/rlftsim.

2605.19032 2026-05-20 cs.CV 版本更新

Personalized Face Privacy Protection From a Single Image

基于单张图像的个性化面部隐私保护

Zachary Yahn, Fatih Ilhan, Tiansheng Huang, Selim Tekin, Sihao Hu, Yichang Xu, Margaret Loper, Ling Liu

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出FaceCloak系统,通过单张图像生成个性化面部隐私掩码,有效防止面部识别,经实验验证其在多个数据集上优于其他方法。

详情
AI中文摘要

在线上传的面部照片容易受到恶意行为者的攻击,他们可以刮取面部图像并通过未经授权的面部识别模型侵犯个人隐私。本文提出了FaceCloak,一种新颖的个性化面部隐私保护系统,该系统能够从用户单张图像生成防御性身份特定的通用面部隐私掩码,使面部识别失败。FaceCloak引入了三阶段的个性化面部扰动学习方法:(1)基于用户的单张图像生成少量高多样性的合成面部图像;(2)通过迭代扰动生成在合成图像的小集合上学习面部伪装,通过增加关键面部身份泄露区域的保护,有效将用户的身份嵌入推向遥远的锚身份并远离相似身份;(3)生成以像素级伪装形式的个性化身份保护掩码,该掩码轻量且可以高效应用于任何用户的面部图像,同时保持良好的感知质量。在三个流行面部数据集上对十个识别模型的广泛实验显示,FaceCloak相比29种其他现有代表性方法更有效。代码可在https://github.com/zacharyyahn/FaceCloak获取。

英文摘要

Photos of faces uploaded online are vulnerable to malicious actors who can scrape facial images from online sources and intrude on personal privacy via unauthorized use of facial recognition models. This paper presents FaceCloak, a novel personalized face privacy protection system, which can generate defensive identity-specific universal face privacy masks from a single image of a user, causing facial recognition to fail. FaceCloak introduces a three-stage personalized face perturbation learning methodology: (1) It generates a small set of high-variety synthetic face images of a person based on a single image of the person. (2) It learns face cloaking by adding more protection to key facial-identity leakage regions through iterative perturbation generation over the small set of synthetic images, effectively shifting a user's identity embedding towards a distant anchor identity and away from a similar one. (3) It generates a personalized identity-protective mask in the form of pixel-wise cloaking, which is light-weight and can be efficiently applied to any facial image of a user while maintaining good perceptual quality. Extensive experiments on three popular face datasets across ten recognition models show the effectiveness of FaceCloak compared to 29 other existing representative methods. Code is available at https://github.com/zacharyyahn/FaceCloak

2605.18464 2026-05-20 cs.CV 版本更新

PERL: Parameter Efficient Reasoning in CLIP Latent Space

PERL:在CLIP潜在空间中实现参数高效的推理

Simone Carnemolla, Salvatore Calcagno, Daniela Giordano, Concetto Spampinato, Matteo Pennisi

发表机构 * University of Catania(卡塔尼亚大学)

AI总结 本文提出PERL,一种在CLIP潜在空间中通过迭代潜在推理实现参数高效适应的框架,该方法在多个基准测试中表现出最佳的参数-性能权衡,仅需约6K可训练参数即可实现强的新型类别准确率和竞争性的迁移性能。

Comments Submitted to NeurIPS 2026

详情
AI中文摘要

对比训练的视觉-语言模型,如CLIP,通过在共享嵌入空间中对齐图像和文本,提供了强大的零样本迁移能力。然而,将这些模型适应到下游任务而不影响其开放词汇泛化能力仍然具有挑战性。现有的参数高效适应方法通常通过学习的提示、适配器或多模态转换来提高任务专业化,其中适应能力主要通过额外的可训练参数来表达。受最近语言模型中潜在推理方法的启发,我们探讨了一种互补的视角:适应是否可以来自于对潜在表示的迭代推理,而不是仅仅通过增加参数数量?我们介绍了PERL(在CLIP潜在空间中实现参数高效的推理),一种轻量级的适应框架,它通过在冻结的CLIP模型上添加一个紧凑的共享推理模块,在多次细化步骤中反复应用。在每一步中,PERL根据当前的表示生成一个潜在推理标记,并将其注入到中间编码器层中,逐步细化更高层次的语义表示,同时保持CLIP的预训练多模态结构。在15个基准测试中,涵盖基础到新颖泛化、跨数据集迁移以及非分布ImageNet变体,PERL在快速适应的少样本设置下,实现了与其他方法相比最佳的参数-性能权衡,仅使用约6K可训练参数,比最大的比较方法少817倍,同时结合了强的新类别准确率和具有竞争力的迁移性能。总体而言,我们的结果表明,迭代的潜在推理为判别视觉-语言模型中的参数扩展提供了一种互补的适应机制。

英文摘要

Contrastively trained vision-language models such as CLIP provide strong zero-shot transfer by aligning images and text in a shared embedding space. However, adapting these models to downstream tasks without degrading their open-vocabulary generalization remains challenging. Existing parameter-efficient adaptation methods typically improve task specialization through learned prompts, adapters, or multimodal transformations, where adaptation capacity is primarily expressed through additional trainable parameters. Inspired by recent latent reasoning methods in language models, we investigate a complementary perspective: can adaptation emerge from iterative reasoning on latent representations rather than from increasing parameter count alone? We introduce PERL (Parameter-Efficient Reasoning in CLIP Latent Space), a lightweight adaptation framework that augments a frozen CLIP model with a compact shared reasoning module applied recurrently across refinement steps. At each step, PERL generates a latent reasoning token conditioned on the current representation and injects it into an intermediate encoder layer, progressively refining higher-level semantic representations while preserving CLIP's pretrained multimodal structure. Across 15 benchmarks spanning base-to-novel generalization, cross-dataset transfer, and out-of-distribution ImageNet variants, PERL achieves the best parameter-performance trade-off among the compared methods under a fast-adaptation few-shot setting, combining strong novel-class accuracy and competitive transfer performance with only about 6K trainable parameters, up to 817x fewer than the largest compared approach. Overall, our results suggest that iterative latent reasoning provides a complementary adaptation mechanism to parameter scaling in discriminative vision-language models.

2605.18445 2026-05-20 cs.CV cs.AI cs.CL cs.LG 版本更新

What's Holding Back Latent Visual Reasoning?

是什么在阻碍潜在视觉推理?

André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann

发表机构 * Instituto Superior Técnico, Universidade de Lisboa(里斯本大学理工学院) Instituto de Telecomunicações(电信研究所) TransPerfect(TransPerfect公司) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本研究探讨了现有模型如何利用潜在令牌,发现潜在令牌在最终预测中起作用有限,主要问题在于训练数据中潜在令牌信息有限且推理时生成的潜在令牌偏离真实表示,需要高质量数据和更精确的潜在令牌预测来推动发展。

详情
AI中文摘要

人类通过心理模拟中间视觉步骤来解决复杂视觉问题,而非仅通过语言推理。受此启发,近期有关视觉-语言模型的工作探索了连续潜在令牌作为中间视觉想象步骤的链式推理。在本工作中,我们研究了近期模型如何利用此类潜在令牌。令人惊讶的是,当潜在令牌被无信息的占位符令牌替代时,模型准确性不受影响。这表明潜在令牌在模型最终预测中起最小的因果作用。为了更好地理解这一现象,我们分析了由oracle潜在表示提供的训练信号以及推理时生成的潜在令牌质量。我们的实验揭示了两个阻碍潜在视觉推理的关键问题:首先,在大多数现有数据集中,oracle潜在令牌提供的信息有限,仅超出原始图像,且不显著简化任务,导致模型在训练时忽略它们,并在推理时有效绕过它们。当在诊断数据集上微调时,其中潜在令牌为最终预测提供充分支持,我们显示模型可以因果依赖于它们。其次,在推理时生成的潜在令牌偏离其对应的oracle表示,坍缩到狭窄区域,即使模型依赖它们也无法获得收益。总体而言,我们的发现表明,未来潜在视觉推理的进步取决于两个关键支柱:具有信息性中间步骤的高质量数据集和更精确的潜在令牌预测。

英文摘要

Humans can approach complex visual problems by mentally simulating intermediate visual steps, rather than reasoning through language alone. Inspired by this, several works on Vision-Language Models have recently explored chain-of-thought reasoning with continuous latent tokens as intermediate visual imagination steps. In this work, we investigate how recent models leverage such latent tokens. Surprisingly, we find that model accuracy is unaffected when latent tokens are replaced by uninformative dummy tokens. This indicates that latent tokens play a minimal causal role in the model's final prediction. To better understand this phenomenon, we analyze both the training signal provided by oracle latent representations and the quality of the latent tokens generated at inference time. Our experiments reveal two crucial issues holding back latent visual reasoning: First, in most existing datasets, oracle latent tokens provide limited additional information beyond the original image and do not substantially simplify the task, leading models to ignore them during training and effectively bypassing them at inference time. When fine-tuned on a diagnostic dataset, in which latent tokens provide sufficient support for the final prediction, we show that models can causally rely on them. Second, the latent tokens produced at inference time deviate from their corresponding oracle representations, collapsing to a narrow region and preventing benefits even when the model relies on them. Overall, our findings suggest that future progress in latent visual reasoning depends on two key pillars: high-quality datasets with informative intermediate steps and more precise latent token prediction.

2605.18431 2026-05-20 cs.CV 版本更新

Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models

协同视见:基于多模态大语言模型的多机器人协作自体空间推理

Kunyu Peng, Zhikun Zhou, Kailun Yang, Di Wen, Ruiping Liu, Yufan Chen, Junwei Zheng, Hao Shi, Yi Zhou, M. Saquib Sarfraz, Danda Pani Paudel, Luc Van Gool

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Hunan University(湖南大学) University of Oxford(牛津大学) Zhejiang University(浙江大学) ETH Zurich(苏黎世联邦理工学院) Ant Group(蚂蚁集团)

AI总结 本文研究了多机器人协作动态空间推理问题,提出了首个针对该任务的基准CoopSR以及多机器人自体问答数据集EgoTeam,通过引入SP-CoR框架实现了细粒度的协作空间推理,显著提升了多机器人协作推理性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在自体视频理解方面取得了显著进展,但其从多个具身视角进行协作推理的能力仍鲜有探索。我们通过多机器人协作动态空间推理研究该问题,其中模型必须通过集成同步的自体视频来回答空间、时间、可见性和协调性问题。为此,我们引入了首个针对该任务的基准CoopSR,以及EgoTeam多机器人自体问答数据集。EgoTeam包含114,227个问答对,覆盖19种问题类型、四个难度等级和三种团队规模,在Habitat和iGibson中,以及一个包含约2,326个问题的现实世界测试集。我们进一步提出了SP-CoR(Spectral and Physics-Informed Cooperative Reasoner),一种用于细粒度协作空间推理的MLLM框架。SP-CoR结合了动态感知的多机器人帧采样、光谱和物理引导的视图融合以及物理对齐的提示蒸馏,使模型在训练时能够受益于特权机器人姿态监督,而在测试时仅需自体视频。在22个MLLM基线模型上,SP-CoR在Habitat上比最强的微调基线高出3.87%,在iGibson上高出7.12%。它还展示了更强的泛化能力,适用于未见过的团队规模和现实世界机器人测试。代码可在https://github.com/KPeng9510/seeing-together.git找到。

英文摘要

Multimodal Large Language Models (MLLMs) have made substantial progress in egocentric video understanding, but their ability to reason cooperatively from multiple embodied viewpoints remains largely unexplored. We study this problem through multi-robot cooperative dynamic spatial reasoning, where a model must answer spatial, temporal, visibility, and coordination questions by integrating synchronized egocentric videos from a team of moving robots. To support this setting, we introduce CoopSR, the first benchmark for this task, together with EgoTeam, a multi-robot egocentric QA dataset. EgoTeam contains 114,227 QA pairs spanning 19 question types, four difficulty tiers, and three team sizes in Habitat and iGibson, along with a real-world test set of around 2,326 QAs collected using two quadruped robots. We further propose SP-CoR (Spectral and Physics-Informed Cooperative Reasoner), an MLLM framework for fine-grained cooperative spatial reasoning. SP-CoR combines dynamics-aware multi-robot frame sampling, spectral- and physics-guided view fusion, and physics-aligned prompt distillation, enabling the model to benefit from privileged robot-pose supervision during training while requiring only egocentric videos at test time. Across 22 MLLM baselines, SP-CoR consistently improves cooperative reasoning, outperforming the strongest fine-tuned baseline by +3.87% on Habitat and +7.12% on iGibson. It also shows stronger generalization to unseen team sizes and real-world robot tests. Code can be found at https://github.com/KPeng9510/seeing-together.git.

2605.18413 2026-05-20 cs.CV 版本更新

Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models

基础的裂缝:一个挑战视觉基础模型的民用基础设施数据集

Nicola Farronato, Niccolo Avogaro, Thomas Frick, Mattia Rigotti, Rizwan Ullah Khan, Michele Magno, Konrad Schindler, Cristiano Malossi, Florian Scheidegger

发表机构 * IBM Research(IBM研究院) ETH Zürich(苏黎世联邦理工学院) University of Twente(特文特大学)

AI总结 本文提出Cracks in the Foundation数据集,通过高分辨率图像挑战视觉基础模型在民用基础设施中的密集图像理解能力,揭示了现有模型在真实世界中的局限性。

详情
AI中文摘要

自动化结构健康监测对于防止基础设施灾难性失效至关重要。精确的像素级缺陷分割对于准确评估结构完整性至关重要,但进展受制于极少数数据的匮乏,这需要昂贵的专家标注。由于问题固有的算法障碍,如中心偏差和在检查近似无纹理的建筑材料时需要更多依赖形状,数据需求更加突出。为消除瓶颈,我们引入Cracks in the Foundation (CiF),这是迄今为止最大的、最详细的民用基础设施(实例)分割数据集,包含约150,000张高分辨率图像,经过五年与土木工程专家的合作精心编纂。借助这一前所未有的数据源,我们揭示了当前视觉AI的一个盲点:尽管提示式基础模型(FMs)和视觉语言模型(VLMs)已出现,尽管当今专门的分割模型表现出色,但建成环境中的密集图像理解仍远未解决。我们的评估表明,即使是最新的零样本FMs在部署到真实基础设施时也面临重大挑战,甚至专门模型在领域特定监督下的性能也停滞在约25%的mAP。CiF将民用基础设施检查,一个基础且看似简单的感知任务,确立为一个开放挑战,揭示了目前主要在互联网图像上训练的模型的根本性弱点,字面和比喻上都突显了当前基础模型范式的裂缝。

英文摘要

Automated structural health monitoring is essential to prevent catastrophic infrastructure failures. Precise, pixel-level defect segmentation is needed to accurately assess structural integrity, but progress in defect segmentation for civil infrastructures has been held back by an extreme scarcity of data, which requires costly expert annotation. The need for data is accentuated by algorithmic hurdles intrinsic to the problem, including center-bias and the need to rely more on shape when inspecting nearly textureless building materials. To remove the bottleneck, we introduce Cracks in the Foundation (CiF), the largest and most detailed civil infrastructure (instance) segmentation dataset to date, comprising $\approx$150,000 high-resolution images meticulously curated over five years in collaboration with civil engineering experts. With the help of this unprecedented data source, we expose a blind spot of current visual AI: despite the advent of promptable Foundation Models (FMs) and Vision Language Models (VLMs), and despite the impressive abilities of today's specialised segmentation models, it turns out that dense image understanding in the built environment is nowhere near solved. Our evaluations indicate that even the most recent zero-shot FMs face significant challenges when deployed on real-world infrastructure and even the performance of specialised models with domain-specific supervision plateaus at $\approx$25% mAP. CiF establishes inspection of civil infrastructure, an elementary and seemingly easy perceptual task, as an open challenge that reveals fundamental weaknesses of present-day models trained predominantly on internet images, literally and figuratively highlighting cracks in the current foundation model paradigm.

2605.18396 2026-05-20 cs.CV 版本更新

NEWTON: Agentic Planning for Physically Grounded Video Generation

NEWTON:面向物理基础视频生成的代理规划

Yuxiang Feng, Juncheng Wang, Chao Xu, Yijie Qian, Huihan Wang, Wenlong Hou, Yang Liu, Baigui Sun, Yong Liu, Shujun Wang

发表机构 * Zhejiang University(浙江大学) The Hong Kong Polytechnic University(香港理工大学) IROOTECH TECHNOLOGY(IROOTECH技术公司) Sany Group(三一集团)

AI总结 本文提出NEWTON,通过将视频生成从系统输出降级为代理工具箱中的一个动作,利用学习的规划器协调物理感知工具,提高视频生成的物理合理性,从而在VideoPhy-2数据集上显著提升联合准确性。

Comments project page: https://Newton026.github.io/newton

详情
AI中文摘要

视频生成模型能够产生视觉上吸引人的结果,但系统性地违反物理常识——在VideoPhy-2数据集上,最佳模型仅能达到32.6%的联合准确性。我们识别出一个规范瓶颈:文本提示是对物理世界的损失压缩,省略了完全决定动态的参数,而无论模型规模如何扩大都无法恢复从未指定的内容。从这一诊断中,我们得出物理条件必须满足的三个属性——充分性、动态性和可验证性,并展示现有方法均无法同时满足这三个属性。我们提出了NEWTON,其中视频生成被降级为代理工具箱中的一个动作:学习的规划器协调物理感知工具(关键帧生成、科学计算、提示优化)以构建丰富的条件输入,并通过验证器闭合回路以实现迭代再规划。规划器是唯一可训练的组件,通过Flow-GRPO在实时多轮循环中进行在线优化。在VideoPhy-2数据集上,NEWTON在LTX-Video上将联合准确性从21.4%提升到29.7%,在Veo-3.1上从30.7%提升到37.4%,而无需修改生成器。我们的项目页面:https://Newton026.github.io/newton

英文摘要

Video generation models produce visually compelling results but systematically violate physical commonsense -- on VideoPhy-2, the best model achieves only 32.6% joint accuracy. We identify a specification bottleneck: text prompts are lossy compression of the physical world, omitting the parameters that fully determine dynamics, and no amount of model scaling can recover what was never specified. From this diagnosis we derive three properties that physics conditioning must satisfy -- sufficiency, dynamism, and verifiability -- and show that no existing approach satisfies all three. We present NEWTON, in which video generation is demoted from the system output to one action inside an agent's toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop. On VideoPhy-2, NEWTON improves joint accuracy from 21.4% to 29.7% on LTX-Video and from 30.7% to 37.4% on Veo-3.1, without modifying either generator. Our project page: https://Newton026.github.io/newton

2605.17942 2026-05-20 cs.CV 版本更新

UAVFF3D: A Geometry-Aware Benchmark for Feed-Forward UAV 3D Reconstruction

UAVFF3D: 一种面向无人机3D重建的几何感知基准

Xiang Yang, Yongli Wang, HaiFeng Li, Yunsheng Zhang

发表机构 * School of Geosciences and Info-Physics(地质科学与信息物理学院)

AI总结 本文提出UAVFF3D基准,旨在解决无人机摄影测量中因相机几何变化导致的重建问题,通过引入真实-合成图像和控制测试子集,提升无人机领域适应性和鲁棒性。

Comments 19 pages, 16 figures, 16 tables

详情
AI中文摘要

尽管前馈3D重建技术取得了快速发展,但当前模型在无人机摄影测量中仍不够可靠。我们认为,这种失败不仅源于外观域偏移,还源于无人机特定的相机几何变化,特别是斜视和HFOV高度模糊。现有无人机数据集主要强调场景多样性,但对相机配置的覆盖有限,限制了鲁棒性评估和无人机领域适应。为解决这一差距,我们引入UAVFF3D,一个面向前馈无人机3D重建的几何感知真实-合成基准。UAVFF3D包含超过170,000张真实无人机图像和超过370,000张由高质量纹理3D模型渲染的合成图像,覆盖多样的HFOV、飞行高度、观看方向和采集模式。它还包含一个受控的HFOV-高度测试子集,用于诊断投影几何模糊。我们进一步提出一个评估协议,联合评估相机几何估计和密集场景重建,通过共享的全局对齐,避免单独相机和几何对齐带来的偏差。在代表性前馈重建模型上的实验表明,基于UAVFF3D的领域适应一致地提高了相机和几何估计,将射线误差降低了高达84.2%,姿态ATE降低了高达76.0%,点距离降低了高达41.1%。在斜视场景中,适应减少了斜视-正视旋转差距高达90.7%。在HFOV-高度模糊情况下,它提高了在不同HFOV-高度配置下的鲁棒性,并在不同HFOV设置下产生了更稳定的性能。结合相机先验进一步改进了在无人机特定采集几何下的重建。数据集和评估代码可在https://github.com/yanxian-ll/UAVFF3D获取。

英文摘要

Feed-forward 3D reconstruction has advanced rapidly, but current models remain unreliable in UAV photogrammetric acquisition. We argue that this failure is caused not only by appearance-domain shift, but also by UAV-specific camera-geometry variations, especially oblique views and HFOV-height ambiguity. Existing UAV datasets mainly emphasize scene diversity and provide limited coverage of camera configurations, which restricts robustness evaluation and UAV-domain adaptation. To address this gap, we introduce UAVFF3D, a geometry-aware real-synthetic benchmark for feed-forward UAV 3D reconstruction. UAVFF3D contains more than 170k real UAV images and more than 370k synthetic images rendered from high-quality textured 3D models, covering diverse HFOVs, flight altitudes, viewing directions, and acquisition patterns. It also includes a controlled HFOV-height test subset for diagnosing projection-geometry ambiguity. We further propose an evaluation protocol that jointly assesses camera-geometry estimation and dense scene reconstruction under a shared global alignment, avoiding the bias caused by separate camera and geometry alignments. Experiments on representative feed-forward reconstruction models show that UAVFF3D-based domain adaptation consistently improves camera and geometry estimation, reducing Ray Error by up to 84.2%, Pose ATE by up to 76.0%, and Chamfer Distance by up to 41.1%. In oblique scenes, adaptation reduces the oblique-nadir rotation gap by up to 90.7%. Under HFOV-height ambiguity, it improves robustness across HFOV-height configurations and yields more stable performance across HFOV settings. Incorporating camera priors further improves reconstruction under UAV-specific acquisition geometries. The dataset and evaluation code are available at https://github.com/yanxian-ll/UAVFF3D .

2605.17916 2026-05-20 cs.CV 版本更新

PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis

PanoWorld: 一种生成式空间世界模型,用于一致的整屋全景合成

Jinrang Jia, Zhenjia Li, Yijiang Hu, Yifeng Shi

发表机构 * Ke Holdings Inc.(凯控股有限公司)

AI总结 本文提出PanoWorld,一种生成式空间世界模型,通过自回归生成基于节点的360度全景图,实现一致的整屋全景合成,解决了纯2D生成器在视角变化时几何和材质重新想象的问题,以及单一3D生成在多房间尺度下的高成本和纹理丢失问题。

Comments 17

详情
AI中文摘要

生成一致的整屋VR游览需要逼真的全景图和跨视角的空间一致性。纯2D生成器产生吸引人的单个全景图,但在视角变化时重新想象几何和材质,而单一3D生成在多房间尺度下变得昂贵且丢失细纹理。我们引入PanoWorld,一种生成式空间世界模型,将整屋合成视为自回归生成基于节点的360度全景图,匹配真实VR游览产品使用的离散导航。PanoWorld使用由平面图派生的3D壳体作为全局几何代理,并使用动态3D高斯点云缓存作为可渲染的空间记忆。一个用于度量尺度多房间360度输入的前馈全景LRM将生成的全景图提升到局部360度高斯点云更新,同时房间感知的组注意机制抑制跨房间特征干扰。一种拓扑感知的渐进缓存策略将这些局部更新融合,而无需反复重建完整历史。通过将基于壳体的几何指导与缓存渲染的视觉记忆解耦,PanoWorld在保持高频率2D合成质量的同时,提高了跨节点布局和材质一致性。项目链接是https://jjrcn.github.io/PanoWorld-project-home/

英文摘要

Generating a consistent whole-house VR tour from a floorplan and style reference requires both photorealistic panoramas and cross-view spatial coherence. Pure 2D generators produce appealing single panoramas but re-imagine geometry and materials when the viewpoint changes, whereas monolithic 3D generation becomes expensive and loses fine texture at multi-room scale. We introduce PanoWorld, a generative spatial world model that treats whole-house synthesis as autoregressive generation of node-based 360-degree panoramas, matching the discrete navigation used by real VR tour products. PanoWorld uses a floorplan-derived 3D shell as a global geometric proxy and a dynamic 3D Gaussian Splatting cache as renderable spatial memory. A feed-forward panoramic LRM designed for metric-scale multi-room 360-degree inputs lifts generated panoramas into local 3DGS updates, while Room-aware Group Attention suppresses cross-room feature interference. A topology-aware progressive caching strategy fuses these local updates without repeatedly reconstructing the full history. By decoupling shell-based geometry guidance from cache-rendered visual memory, PanoWorld preserves high-frequency 2D synthesis quality while improving cross-node layout and material consistency. The project link is https://jjrcn.github.io/PanoWorld-project-home/

2605.17470 2026-05-20 cs.CV cs.MM eess.IV 版本更新

EchoSR: Efficient Context Harnessing for Lightweight Image Super-Resolution

EchoSR: 为轻量图像超分辨率实现高效的上下文利用

Hanli Zhao, Binhao Wang, Shihao Zhao, Tao Wang, Kaihao Zhang, Wanglong Lu

发表机构 * College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325000, China(温州大学计算机科学与人工智能学院) vivo BlueImage Lab, vivo Mobile Communication Co., Ltd, Shanghai 200100, China(vivo蓝影实验室,vivo移动通信有限公司,上海200100,中国) College of Engineering and Computer Science, Australian National University, Canberra, Australia(工程与计算机科学学院,澳大利亚国立大学,堪培拉,澳大利亚) The AI/Analytics Team, Nasdaq, St. John’s, Canada(AI/分析团队,纳斯达克,圣约翰,加拿大)

AI总结 本文提出EchoSR框架,通过统一多尺度感受野建模和层次化上下文融合,提升了轻量图像超分辨率的效率和效果,同时在多个基准上优于现有方法,并实现了约两倍的速度提升。

Comments Accepted by Information Fusion; 20 pages, 17 figures

详情
AI中文摘要

图像超分辨率(SR)旨在从低分辨率(LR)输入中重建高质量、高分辨率(HR)图像,并在各种下游应用中发挥关键作用。尽管近年来取得了进展,但平衡重建保真度和计算效率仍然是一个根本性挑战,尤其是在资源受限的场景中。虽然现有轻量方法试图扩展感受野,但许多方法要么导致显著的计算开销,要么简单地扩大内核大小,或缺乏机制进行一致的多尺度整合,限制了它们的整体效果和可扩展性。为了解决这些限制,我们提出了EchoSR,一个高效的上下文利用框架,用于轻量图像超分辨率,它统一了多尺度感受野建模和层次化上下文融合。EchoSR通过一种高效的上下文利用策略将特征学习解耦为分离的局部、多尺度和全局建模阶段,并进一步通过跨尺度重叠融合机制促进无缝的跨尺度整合。广泛的实验表明,EchoSR在多个基准上一致优于现有最先进的轻量超分辨率方法,同时也实现了更快的速度(约2倍)。源代码可在https://github.com/funnyWang-Echoes/EchoSR上获得。

英文摘要

Image super-resolution (SR) aims to reconstruct high-quality, high-resolution (HR) images from low-resolution (LR) inputs and plays a critical role in various downstream applications. Despite recent advancements, balancing reconstruction fidelity and computational efficiency remains a fundamental challenge, particularly in resource-constrained scenarios. While existing lightweight methods attempt to expand receptive fields, many of them either incur substantial computational overhead, naively scale up kernel sizes, or lack mechanisms for coherent multi-scale integration, limiting their overall effectiveness and scalability. To address these limitations, we propose EchoSR, an efficient context-harnessing framework for lightweight image super-resolution, which unifies multi-scale receptive field modeling and hierarchical context fusion. EchoSR decouples feature learning into disentangled local, multi-scale, and global modeling stages through an efficient context-harnessing strategy, and further promotes seamless cross-scale integration via a cross-scale overlapping fusion mechanism. Extensive experiments have shown that EchoSR consistently outperforms state-of-the-art lightweight super-resolution methods across multiple benchmarks, while also achieving a faster speed $(\sim 2\times)$. The source code is available at https://github.com/funnyWang-Echoes/EchoSR.

2605.16736 2026-05-20 cs.CV 版本更新

CAB: Accelerating Flow and Diffusion Sampling via Rectification and Corrected Adams-Bashforth

CAB: 通过校正和修正Adams-Bashforth加速流和扩散采样

Anuska Roy, Pravin Nair

发表机构 * Department of Electrical Engineering(电气工程系)

AI总结 本文提出了一种无需训练的采样器CAB,通过将采样动态转换为统一的校正坐标系,并应用带有基于过去速度评估的简单修正项的多步Adams-Bashforth预测器,从而在不增加额外函数评估次数的情况下加速流和扩散模型。

详情
AI中文摘要

流和扩散模型能够实现高质量、高分辨率的图像合成,但通常在采样时需要大量的函数评估次数(NFEs)。现有的加速方法要么需要通过蒸馏进行额外训练,要么依赖于无需训练的高阶求解器,但两者在低NFE预算下都会降低样本质量。我们提出CAB(Corrected Adams-Bashforth),一种无需训练的采样器,能够加速流和扩散模型。CAB首先将采样动态转换为统一的校正坐标系,然后应用一个带有基于过去速度评估的简单修正项的多步Adams-Bashforth预测器,因此不增加额外的NFEs。所得到的方法简单,具有相同的算法形式,适用于所有模型类别,并且具有至少第三阶局部截断误差和第二阶全局误差。在预训练的流和扩散模型上进行的实验,包括类别条件和大规模文本到图像基准,表明CAB在6-20 NFEs的低步数范围内改进了质量-NFE权衡。它在大多数测试模型中在更高步数时与强大的无需训练采样器保持竞争力。官方实现可在https://github.com/Anuska-Roy/CAB上获得。

英文摘要

Flow and diffusion models achieve high-fidelity, high-resolution image synthesis, but often require many function evaluations (NFEs) at sampling time. Existing acceleration methods either require additional training through distillation or rely on training-free high-order solvers, and both can degrade sample quality at low NFE budgets. We propose CAB (Corrected Adams-Bashforth), a training-free sampler that accelerates both flow and diffusion models. CAB first transforms the sampling dynamics to a common rectified coordinate system, and then applies a multistep Adams-Bashforth predictor augmented with a simple correction term based on past velocity evaluations and therefore incurs no additional NFEs. The resulting method is simple, has the same algorithmic form across model classes, and has at least third-order local truncation error and second-order global error. Experiments on pretrained flow and diffusion models, including class-conditional and large-scale text-to-image benchmarks, show that CAB improves quality-NFE trade-offs in the low-step regime of 6-20 NFEs. It also remains competitive with strong training-free samplers at higher step counts across most tested models. The official implementation is available at https://github.com/Anuska-Roy/CAB.

2605.16353 2026-05-20 cs.CV cs.AI 版本更新

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

StrLoRA: 向流式连续视觉指令微调迈进以适应大规模多模态语言模型

Chang Che, Ziqi Wang, Hui Ma, Cheems Wang, Zenglin Shi

发表机构 * Hefei University of Technology(合肥工业大学) Tsinghua University(清华大学)

AI总结 本文提出StrLoRA,一种流式连续视觉指令微调方法,旨在解决动态任务流中模型持续学习的问题,通过任务感知的专家路由框架提升模型在不断变化的数据流中的表现。

详情
AI中文摘要

持续视觉指令微调(CVIT)使多模态大语言模型能够逐步获得新能力。然而,现有CVIT方法在任务增量设置下运行,每个训练阶段对应一个预定义任务,这不符合现实世界中数据作为连续流中交织和动态变化的任务的条件。为弥合这一差距,我们引入流式CVIT(StrCVIT),一种更通用和现实的设置,其中模型从包含动态混合任务的数据块中学习。在StrCVIT中,模型必须同时获得新能力、强化常见能力并减轻遗忘。现有CVIT方法在此处失败,因为它们无法可靠地区分或适应每个块内的异构任务样本。因此,我们提出了StrLoRA,一种正则化的两阶段专家路由框架。StrLoRA首先使用文本指令进行任务感知的专家选择,激活相关专家的稀疏子集,减少跨任务干扰。然后在该子集内应用基于令牌的专家加权,其中贡献权重通过本地视觉令牌与全局指令表示之间的跨模态注意力计算。为了在非平稳流中保持稳定性,路由稳定性正则化将当前路由分布与历史指数移动平均参考对齐。在新开发的StrCVIT基准上的广泛实验表明,StrLoRA显著优于现有方法,有效提升了模型从持续演变的数据流中获取能力的能力。代码可在https://github.com/chanceche/StrCVIT获取。

英文摘要

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two-stage expert routing framework. StrLoRA first performs task-aware expert selection using the textual instruction to activate a sparse subset of relevant experts, reducing cross-task interference. It then applies token-wise expert weighting within this subset, where contribution weights are computed via cross-modal attention between local visual tokens and the global instruction representation. To maintain stability across the non-stationary stream, a routing-stability regularization aligns current routing distributions with a historical exponential moving average reference. Extensive experiments on a newly developed StrCVIT benchmark show that StrLoRA substantially outperforms existing methods, effectively enhancing model's abilities from continuously evolving data streams. The code is available at https://github.com/chanceche/StrCVIT.

2605.15497 2026-05-20 cs.CV cs.GR 版本更新

AnyAct: Towards Human Reenactment of Character Motion From Video

AnyAct: 向视频中非人类角色动作的重新演绎迈进

Liuhan Chen, Lei Zhong, Jiewei Wang, Qin Shuai, Li Yuan, Leidong Fan, Qing Li, Kanglin Liu

发表机构 * Peking University(北京大学) Nankai University(南开大学) The University of Hong Kong(香港大学) Zhejiang University(浙江大学) Pengcheng Laboratory(鹏城实验室)

AI总结 本文研究如何从单目视频中直接推导出人类动作的初始重新演绎,其目标是将非人类角色的动作重新诠释为可编辑的人类表演,以供后续动画创作使用。核心方法是利用稀疏局部关节运动线索在结构差异大的情况下保持本质动态,提出AnyAct模型以实现基于可转移稀疏局部2D关节运动的条件人类运动生成。

Comments 12 pages

详情
AI中文摘要

我们研究了从非人类角色的单目视频中直接推导出初始人类重新演绎的问题。我们的目标不是重建源角色本身,而是将它的动作重新诠释为一个合理且可编辑的人类表演,以供后续动画创作使用。这一任务具有挑战性,因为现有的基于视频的动作捕捉方法大多局限于以人类为中心的结构空间,而动作重定向方法通常需要结构化的3D源动作和已知的源拓扑。我们的关键见解是稀疏局部关节运动线索可以在较大的结构差异下保持本质动态,为角色视频到人类重新演绎提供稳定的桥梁。基于这一观察,我们提出了AnyAct,将角色视频驱动的人类重新演绎公式化为从可转移的稀疏局部2D关节运动中生成的条件人类运动。为了使这一方法实用,我们引入了三个关键设计:通过增强的3D到2D投影进行的人类运动-only监督、渐进的3D到2D训练以缓解条件模糊性,以及全局-局部运动解耦以实现可靠的局部运动控制。我们进一步构建了一个主要涵盖多样化非人类角色视频的基准。在该基准上的实验表明,AnyAct能够生成高保真的初始人类重新演绎,这些重新演绎保留了参考视频中角色的本质动态,进一步的消融研究验证了其核心设计的有效性。

英文摘要

We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.

2605.15186 2026-05-20 cs.CV cs.AI 版本更新

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

VGGT-Edit:基于残差场预测的前馈原生3D场景编辑

Kaixin Zhu, Yiwen Tang, Yifan Yang, Renrui Zhang, Bohan Zeng, Ziyu Guo, Ruichuan An, Zhou Liu, Qizhi Chen, Delin Qu, Jaehong Yoon, Wentao Zhang

发表机构 * Peking University(北京大学) Tencent(腾讯) The Chinese University of Hong Kong(香港中文大学) Shanghai AI Lab(上海人工智能实验室) NTU Singapore(新加坡国立大学) Zhongguancun Academy(中关村学院) Beijing Key Lab of Data Intel. & Security (PKU)(北京数据智能与安全实验室(北京大学))

AI总结 本文提出VGGT-Edit,一种基于文本条件的前馈原生3D场景编辑框架,通过引入深度同步文本注入和残差变换头,实现高质量的3D场景编辑,同时构建DeltaScene数据集以提升编辑效果和推理速度。

详情
AI中文摘要

高质量的3D场景重建近年来已发展为通用的前馈架构,使单次正向传递即可生成复杂的环境。然而,尽管这些模型在静态场景感知方面表现强劲,但它们在响应动态人类指令方面仍然有限,限制了其在交互应用中的使用。现有的编辑方法通常依赖于2D提升策略,即单独编辑每个视图,然后将其提升回3D空间。这种间接流程往往导致模糊的纹理和不一致的几何结构,因为2D编辑器缺乏保持跨视角结构的空间意识。为了解决这些限制,我们提出了VGGT-Edit,一种用于文本条件的前馈框架,用于原生3D场景编辑。VGGT-Edit引入了深度同步的文本注入,以对齐语义指导与骨干网络的空间姿态,确保稳定的指令接地。此语义信号随后由残差变换头处理,直接预测3D几何位移以变形场景,同时保持背景稳定性。为了确保高保真结果,我们通过多术语目标函数监督该框架,强制几何准确性和跨视图一致性。我们还构建了DeltaScene数据集,一个通过自动化流程生成的大规模数据集,通过3D一致过滤确保地面真实质量。实验表明,VGGT-Edit在2D提升基线中表现显著更好,生成更清晰的物体细节,更强的多视图一致性以及接近即时的推理速度。项目页面是https://chriszkxxx.github.io/VGGT-Edit/.

英文摘要

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed. The project page is https://chriszkxxx.github.io/VGGT-Edit/.

2605.14530 2026-05-20 cs.CV 版本更新

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

缓解大扩散视觉-语言模型中的遮蔽先验漂移和位置注意力崩溃

Sujung Hong, Chanyong Yoon, Seong Jae Hwang

发表机构 * Department of Artificial Intelligence, Yonsei University, Seoul, Republic of Korea(首尔大学人工智能系)

AI总结 本文研究了大扩散视觉-语言模型在长形式生成中的重复生成和视觉 grounding 退化问题,提出了一种无需训练的解决方案来缓解遮蔽先验漂移和位置注意力崩溃。

详情
AI中文摘要

大扩散视觉-语言模型(LDVLMs)最近作为一种有前途的替代自回归模型出现,能够实现高效的并行解码,并利用双向注意力获取全局上下文。尽管有这些进展,其在长形式生成中的行为仍然缺乏深入研究。在本文中,我们发现现有的LDVLMs存在重复生成和退化的视觉 grounding, 并识别出两个根本原因。首先,重复生成源于遮蔽标记先验:由于生成标记被初始化为遮蔽标记,其隐藏表示在生成步骤中逐渐漂向共享的先验方向。其次,位置注意力偏置与迭代解屏蔽过程之间的基本不匹配会抑制对信息性视觉标记的注意力,从而降低视觉 grounding。基于这些见解,我们提出了一种无需训练的方法,引入遮蔽先验抑制和单调RoPE缩放来缓解解码过程中的遮蔽先验漂移和位置注意力崩溃。在通用多模态基准和视觉 grounding 任务上的实验表明,与基线LDVLMs相比有所改进,特别是在长形式描述基准上表现稳健。我们的结果表明,这些失败可以通过一种轻量级、即插即用的策略有效解决,该策略不需要额外训练,并且在多种LDVLM架构上具有泛化能力。

英文摘要

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

2605.10525 2026-05-20 cs.CV 版本更新

GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

GemDepth:用于3D一致视频深度的几何嵌入特征

Yuecheng Liu, Junda Cheng, Longliang Liu, Wenjing Liao, Hanrui Cheng, Yuzhou Wang, Xin Yang

发表机构 * Huazhong University of Science \& Technology Optics Valley Laboratory

AI总结 本文提出GemDepth框架,通过引入几何嵌入模块和交替时空变换器,解决视频深度估计中空间模糊和时间不一致的问题,实现高精度和鲁棒的3D一致性。

详情
AI中文摘要

视频深度估计将单目预测扩展到时间域以确保一致性。然而,现有方法在细节区域常出现空间模糊和时间不一致的问题。我们提出GemDepth框架,其核心思想是显式了解相机运动和全局3D结构是保持3D一致性必要的前提。GemDepth引入了一个几何嵌入模块(GEM),通过预测帧间相机姿态生成隐式几何嵌入。这种运动先验的注入使网络具备内在的3D感知和对齐能力。在这些几何提示的引导下,我们的交替时空变换器(ASTT)捕获潜在点级对应关系,同时提高空间精度以增强细节清晰度,并强制严格的时间一致性。此外,GemDepth采用数据高效训练策略,有效弥合了高效率和鲁棒几何一致性之间的差距。如图2所示,全面评估表明GemDepth在多个数据集上均取得最佳性能,特别是在复杂动态场景中。代码已公开在:https://github.com/Yuecheng919/GemDepth。

英文摘要

Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth.

2605.08830 2026-05-20 cs.CV cs.AI cs.RO 版本更新

VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

VECTOR-Drive: 紧密耦合的视觉-语言与轨迹专家路由用于端到端自动驾驶

Rui Zhao, Jianlin Yu, Zhenhai Gao, Jiaqiao Liu, Fei Gao

发表机构 * College of Automotive Engineering, Jilin University(吉林大学汽车工程学院) The National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University(吉林大学汽车底盘集成与生物力学国家级重点实验室) ReeFocus AI Technology(ReeFocus人工智能技术)

AI总结 本文提出VECTOR-DRIVE框架,通过紧密耦合的视觉-语言与轨迹专家路由,解决端到端自动驾驶中视觉语言理解和轨迹预测之间的耦合问题,实现更高的任务性能。

详情
AI中文摘要

端到端自动驾驶需要模型理解交通场景、推断驾驶意图并生成可执行的运动计划。最近的视觉-语言-动作(VLA)模型继承了大规模视觉-语言预训练的语义先验,但仍然面临耦合权衡:完全共享的骨干网络保留了多模态交互,但可能导致语言推理和轨迹预测的耦合问题;而解耦的推理-动作管道减少了任务冲突,但削弱了语义-运动耦合。我们提出VECTOR-DRIVE,一个基于Qwen2.5-VL-3B的紧密耦合VLA框架。VECTOR-DRIVE通过共享自注意力保持所有token的耦合,并根据token语义路由前馈计算。视觉和语言token由视觉-语言专家处理以保留语义先验,而目标点、主体状态和噪声动作token则路由到轨迹专家进行运动特定计算。在动作token路径上,一个流匹配规划器将噪声动作token细化为未来路径点和速度配置文件。这种设计在单一多模态Transformer中耦合了语义推理和运动规划,同时分离了任务特定的FFN计算。在Bench2Drive上,VECTOR-DRIVE实现了88.91的驾驶得分,并优于代表性的端到端和VLA基线。定性结果和消融进一步验证了共享注意力、语义感知专家路由、渐进式训练和基于流的动作解码的优势。

英文摘要

End-to-end autonomous driving requires models to understand traffic scenes, infer driving intent, and generate executable motion plans. Recent vision-language-action (VLA) models inherit semantic priors from large-scale vision-language pretraining, yet still face a coupling trade-off: fully shared backbones preserve multimodal interaction but may entangle language reasoning and trajectory prediction, whereas decou pled reasoning-action pipelines reduce task conflict but weaken semantic-motion coupling. We propose VECTOR-DRIVE, a tightly coupled VLA framework built on Qwen2.5-VL-3B. VECTOR-DRIVE keeps all tokens coupled through shared self attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert to preserve semantic priors, while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert for motion-specific computation. On the action-token pathway, a flow-matching planner refines noisy action tokens into future waypoints and speed profiles. This design couples semantic reasoning and motion planning within a single multimodal Transformer while separating task-specific FFN computation. On Bench2Drive, VECTOR-DRIVE achieves 88.91 Driving Score and outperforms representative end-to end and VLA-based baselines. Qualitative results and ablations further validate the benefits of shared attention, semantic-aware expert routing, progressive training, and flow-based action de coding.

2605.07379 2026-05-20 cs.CV cs.AI 版本更新

RELO: Reinforcement Learning to Localize for Visual Object Tracking

RELO:用于视觉目标跟踪的强化学习定位

Xin Chen, Chuanyu Sun, Jiao Xu, Houwen Peng, Dong Wang, Huchuan Lu, Kede Ma

发表机构 * City University of Hong Kong(香港城市大学) Hunyuan Team, Tencent(腾讯文心团队) Dalian University of Technology(大连理工大学)

AI总结 本文提出RELO方法,通过将目标定位建模为马尔可夫决策过程,利用强化学习替代传统手工设计的空间先验,以提升跟踪性能和一致性。

Comments ICML 2026 paper

详情
AI中文摘要

传统视觉目标跟踪方法通常使用手工设计的空间先验(如热图)来定位目标,但这些先验只能提供替代监督,并且与跟踪优化和评估指标(如交并比IoU和成功曲线下的面积AUC)不匹配。本文引入RELO,一种用于视觉目标跟踪的强化学习定位方法,将目标定位建模为马尔可夫决策过程。具体而言,RELO用强化学习学习的空间位置策略替代手工设计的空间先验,奖励结合帧级IoU和序列级AUC。此外,我们还引入层对齐的时间令牌传播以提高帧间语义一致性,计算开销极低。在多个基准测试中,RELO取得了优异的性能,无需模板更新,在LaSOText上达到了57.5%的AUC。这证实了基于奖励的定位为视觉目标跟踪提供了一种有效的替代方法。

英文摘要

Conventional visual object trackers localize targets using handcrafted spatial priors, often in the form of heatmaps. Such priors provide only surrogate supervision and are poorly aligned with tracking optimization and evaluation metrics, such as intersection over union (IoU) and area under the success curve (AUC). Here, we introduce RELO, a REinforcement-learning-to-LOcalize method for visual object tracking that formulates target localization as a Markov decision process. Specifically, RELO replaces handcrafted spatial priors with a localization policy learned over spatial positions via reinforcement learning, with rewards combining frame-level IoU and sequence-level AUC. We additionally introduce layer-aligned temporal token propagation to improve semantic consistency across frames, with negligible computational overhead. Across multiple benchmarks, RELO achieves superior results, attaining 57.5% AUC on LaSOText without template updates. This confirms that reward-driven localization provides an effective alternative to prior-driven localization for visual object tracking.

2605.06270 2026-05-20 cs.CV 版本更新

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

Spark3R: 非对称令牌缩减使快速前馈3D重建

Zecheng Tang, Jiaye Fu, Qiankun Gao, Haijie Li, Yanmin Wu, Jiaqi Zhang, Siwei Ma, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University(北京大学电子与计算机工程学院) School of Computer Science, Peking University(北京大学计算机科学学院)

AI总结 本文提出Spark3R框架,通过非对称令牌缩减技术,在不重新训练的情况下加速前馈3D重建模型,实现高达28倍的速度提升同时保持高质量重建。

详情
AI中文摘要

基于视觉Transformer的前馈3D重建模型可以直接从少量输入图像估计场景几何和相机姿态,但将其扩展到具有数百或数千帧的视频输入仍然具有挑战性,因为全局注意力层的二次成本。最近的令牌合并方法通过在全局注意力层内压缩令牌序列来加速这些模型,但它们对查询令牌和键值令牌应用均匀的缩减,忽略了它们在3D重建中功能不同的角色。在本文中,我们识别出前馈3D重建模型的一个关键属性:查询令牌编码视图特定的几何请求并且对压缩敏感,而键值令牌代表共享的场景上下文并且可以容忍剧烈压缩。受这一见解的启发,我们提出了Spark3R,一个无需训练的加速框架,通过为查询令牌和键值令牌分配不同的缩减因子来解耦压缩,对查询令牌应用组内令牌合并,对键值令牌应用轻量级令牌剪枝。此外,Spark3R在不同层之间自适应调整键值缩减因子,进一步改进质量-效率权衡。作为一种即插即用的框架,无需重新训练,Spark3R直接集成到多个预训练的前馈3D重建模型中,包括VGGT、π³、Depth-Anything-3和VGGT-Ω,并在1000帧输入上实现了高达28倍的速度提升,同时保持有竞争力的重建质量。

英文摘要

Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $π^3$, Depth-Anything-3, and VGGT-$Ω$, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.

2605.02223 2026-05-20 cs.SD cs.CV 版本更新

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

迈向细粒度语音修补取证:一个多区域篡改定位的数据集、方法和度量标准

Tung Vu, Yen Nguyen, Hai Nguyen, Cuong Pham, Cong Tran

发表机构 * Posts and Telecommunications Institute of Technology

AI总结 本文提出MIST数据集、ISA方法和SF1@tau度量标准,用于多区域语音修补检测,揭示现有深度伪造检测器在细粒度语音修补检测上的不足。

详情
AI中文摘要

近年来,语音克隆和文本到语音合成技术的进步使部分语音操纵——即攻击者在语音中替换几个词以改变其含义同时保持说话者身份——成为一种日益现实的威胁。现有音频深度伪造检测基准主要集中在句级二元分类或单区域篡改,无法检测和定位未知数量的多区域修补内容。我们通过三个贡献填补这一空白:首先,我们引入MIST(多区域修补语音篡改),一个覆盖6种语言、每句包含1-3个独立修补词级段的大型多语言数据集,通过LLM引导的语义替换和神经语音克隆生成,其中虚假内容仅占每句的2-7%。其次,我们提出了ISA(迭代段分析),一种与backbone无关的框架,通过粗到细的滑动窗口分类,结合容差区域提议和边界细化,无需先验知识即可恢复所有篡改区域。第三,我们定义了SF1@tau,一个基于时间IoU匹配的段级F1度量标准,联合评估区域计数准确性和定位精度。零样本评估显示,细粒度语音修补仍无法被现有深度伪造检测器解决:句级分类器在完全合成语音上对MIST句的伪造概率接近零,而ISA在这一具有挑战性的设置中始终优于非迭代基线,且数据集、代码和评估工具包已公开发布。

英文摘要

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.

2605.00578 2026-05-20 cs.CV 版本更新

Federated Distillation for Whole Slide Image via Gaussian-Mixture Feature Alignment and Curriculum Integration

通过高斯混合特征对齐和课程整合实现全切片图像的联邦蒸馏

Luru Jing, Cong Cong, Yanyuan Chen, Yongzhi Cao

发表机构 * School of Computer Science, Peking University, Beijing, China(北京大学计算机科学系) Center for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Sydney, NSW 2113, Australia(健康信息学中心,澳大利亚健康创新研究所,麦考利大学,悉尼,NSW 2113,澳大利亚) School of Data Science, University of Virginia, Charlottesville, VA, USA(数据科学学院,弗吉尼亚大学,夏洛特维尔,VA,美国)

AI总结 本文提出FedHD框架,通过高斯混合特征对齐和课程整合策略,在联邦学习中实现全切片图像分析,通过本地生成的语义丰富合成特征表示提升模型性能,同时保持诊断多样性。

Comments Accepted by ICML 2026, Camera-Ready version updated

详情
AI中文摘要

联邦学习(FL)提供了一个有前景的框架,用于通过跨机构进行模型训练来实现协作数字病理学。然而,现实部署面临异质性问题,源于不同机构中多样化的多实例学习(MIL)架构和异构特征提取器。我们提出FedHD,一种新的FL框架,通过针对WSI分析进行本地高斯混合特征对齐。不同于交换模型参数,每个客户端独立地蒸馏语义丰富的合成特征表示,这些表示与真实WSI的分布对齐。为保持诊断多样性,FedHD采用一对一蒸馏策略,为每个真实切片生成一个合成对应物,以避免过度压缩。在联邦过程中,采用基于课程的整合策略,一旦性能达到平台期,逐步将跨站点的合成特征整合到本地训练中。此外,一个可选的解释模块从合成嵌入中重建伪块,提高透明度。FedHD是架构无关的、隐私保护的,并支持在不同机构之间进行个性化但协作的训练。在TCGA-IDH、CAMELYON16和CAMELYON17上的实验表明,FedHD在联邦和蒸馏基线中表现一致优于最先进的方法。

英文摘要

Federated learning (FL) offers a promising framework for collaborative digital pathology by enabling model training across institutions. However, real-world deployments face heterogeneity arising from diverse multiple instance learning (MIL) architectures and heterogeneous feature extractors across institutions. We propose FedHD, a novel FL framework that performs local Gaussian-mixture feature alignment tailored for WSI analysis. Instead of exchanging model parameters, each client independently distills semantically rich synthetic feature representations aligned with the distribution of real WSIs. To preserve diagnostic diversity, FedHD adopts a one-to-one distillation strategy, generating a synthetic counterpart for each real slide to avoid over-compression. During federation, a curriculum-based integration strategy progressively incorporates cross-site synthetic features into local training once performance plateaus. Furthermore, an optional interpretation module reconstructs pseudo-patches from synthetic embeddings, enhancing transparency. FedHD is architecture-agnostic, privacy-preserving, and supports personalized yet collaborative training across diverse institutions. Experiments on TCGA-IDH, CAMELYON16, and CAMELYON17 show that FedHD consistently outperforms state-of-the-art federated and distillation baselines.

2604.25646 2026-05-20 cs.CV cs.RO 版本更新

SAMe: A Semantic Anatomy Mapping Engine for Robotic Ultrasound

SAMe:一种用于机器人超声的语义解剖映射引擎

Jing Zhang, Duojie Chen, Wentao Jiang, Zihan Lou, Jianxin Liu, Xinwu Cui, Qinghong Zhao, Bo Du, Christoph F. Dietrich, Dacheng Tao

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) Hubei Center for Applied Mathematics, Wuhan University(湖北应用数学中心,武汉大学) Department of Ultrasound, The Central Hospital of Wuhan(武汉市中心医院超声科) Department of Medical Ultrasound, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology(同济医院,同济医学院,华中科技大学医学影像科) Department of Ultrasound in Medicine, Renmin Hospital of Wuhan University(武汉大学仁医医院医学超声科) University Hospital, Johann-Wolfgang-Goethe University Frankfurt am Main(法兰克福歌德大学医学院大学医院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 该研究提出SAMe,一种语义解剖映射引擎,通过提供显式的解剖先验层,解决机器人超声扫描初始化问题,实现了基于临床症状的解剖目标识别和控制指令生成,提高了自动扫描的准确性和效率。

Comments Supplementary information included. Code will be released at https://github.com/MiliLab/Echo-SAMe

详情
AI中文摘要

机器人超声已经实现了局部图像驱动控制、接触调节和视图优化,但当前系统缺乏必要的解剖学理解,无法确定应扫描什么、从哪里开始以及如何适应个体患者解剖结构。这些差距使得系统仍依赖专家干预来启动扫描。本文提出SAMe,一种语义解剖映射引擎,为机器人超声提供显式的解剖先验层。SAMe将扫描初始化视为目标到解剖到动作的过程:它将不明确的临床症状转化为结构化的目标器官,从单张外部身体图像中为这些目标生成患者特定的解剖表示,并将这种表示转换为面向控制的6自由度探头初始化状态,无需使用术前CT或MRI进行额外的配准。SAMe维护的解剖表示是显式的、轻量的(单器官推断在0.08秒内完成),并且设计上与下游控制兼容。在语义接地、解剖生成和真实机器人评估中,SAMe在完整的初始化流程中表现出色。在真实机器人实验中,基于质心的SAMe初始化在单目标设置下,对于肝脏(86.7% vs 46.7%)和肾脏(80.0% vs 73.3%)初始化均优于基于身体关键点的启发式基线。此外,当多个候选目标可用时,试验级别的器官命中率达到了肝脏97.3%和肾脏83.3%。这些结果建立了一个显式的解剖先验层,解决了扫描初始化问题,并为更广泛的下游自主扫描流程提供了解剖基础,为基于症状驱动和解剖信息的机器人超声提供了基础。

英文摘要

Robotic ultrasound has advanced local image-driven control, contact regulation, and view optimization, yet current systems lack the anatomical understanding needed to determine what to scan, where to begin, and how to adapt to individual patient anatomy. These gaps make systems still reliant on expert intervention to initiate scanning. Here we present SAMe, a semantic anatomy mapping engine that provides robotic ultrasound with an explicit anatomical prior layer. SAMe addresses scan initiation as a target-to-anatomy-to-action process: it grounds under-specified clinical complaints into structured target organs, instantiates a patient-specific anatomical representation for the grounded targets from a single external body image, and translates this representation into control-facing 6-DoF probe initialization states without any additional registration using preoperative CT or MRI. The anatomical representation maintained by SAMe is explicit, lightweight (single-organ inference in 0.08s), and compatible with downstream control by design. Across semantic grounding, anatomical instantiation, and real-robot evaluation, SAMe shows strong performance across the full initialization pipeline. In real-robot experiments, centroid-based SAMe initialization outperformed the body-keypoint-based heuristic baseline under a budget-matched single-target setting for both liver (86.7% versus 46.7%) and kidney (80.0% versus 73.3%) initialization. Furthermore, The trial-level organ-hit rate reached 97.3% for liver and 83.3% for kidney when multiple candidate targets were available. These results establish an explicit anatomical prior layer that addresses scan initialization and is designed to support broader downstream autonomous scanning pipelines, providing the anatomical foundation for complaint-driven, anatomically informed robotic ultrasonography.

2604.18225 2026-05-20 cs.CV cs.AI 版本更新

Is SAM3 ready for pathology segmentation?

SAM3是否准备好进行病理分割?

Qiuyu Kong, Shakiba Sharifi, Yiming Wang, Marco Cristani, Zanxi Ruan

发表机构 * Sapienza University of Rome(罗马萨皮恩扎大学) University of Verona(威尼斯大学) Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会)

AI总结 本文评估了SAM3在病理图像分割中的能力,发现文本提示效果有限,视觉提示类型和预算对性能影响显著,少样本学习有提升但鲁棒性不足,且提示基于方法与任务训练适配方法之间存在显著差距。

Comments accept to icip2026

详情
AI中文摘要

Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: (1) text-only prompts poorly activate nuclear concepts; (2) performance is highly sensitive to visual prompt types and budgets; (3) few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise; and (4) a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.

英文摘要

Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: (1) text-only prompts poorly activate nuclear concepts; (2) performance is highly sensitive to visual prompt types and budgets; (3) few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise; and (4) a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.

2604.16503 2026-05-20 cs.CV cs.AI 版本更新

Motif-Video 2B: Technical Report

Motif-Video 2B:技术报告

Junghwan Lim, Wai Ting Cheung, Minsu Ha, Beomgyu Kim, Taewhan Kim, Haesol Lee, Dongpin Oh, Jeesoo Lee, Taehyun Kim, Minjae Kim, Sungmin Lee, Hyeyeon Cho, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Dongseok Kim, Jangwoong Kim, Youngrok Kim, Hyukjin Kweon, Hongjoo Lee, Jeongdoo Lee, Junhyeok Lee, Eunhwan Park, Yeongjae Park, Bokki Ryu, Dongjoo Weon

发表机构 * Motif Technologies(Motif技术公司)

AI总结 该研究探讨在有限预算下是否能够训练出高质量的文本到视频生成模型,提出通过架构设计而非单纯扩大模型规模来提升性能,结合共享交叉注意力和三部分主干网络,实现了在较少参数和数据下的高质量视频生成。

详情
AI中文摘要

训练强大的视频生成模型通常需要大规模数据集、大量参数和大量计算资源。在本工作中,我们探讨在更小的预算下(少于1000万片段和少于10万H200 GPU小时)是否能够实现高质量的文本到视频生成。我们的核心观点是,模型容量的组织方式,而不仅仅是其规模,是关键因素。在视频生成中,提示对齐、时间一致性以及细节恢复在通过相同路径处理时可能会相互干扰。Motif-Video 2B通过在架构上分离这些角色,而不是仅依赖规模来解决这一问题。该模型结合了两个关键思想:首先,共享交叉注意力在视频令牌序列变长时增强了文本控制;其次,三部分主干网络分离了早期融合、联合表征学习和细节细化。为了使这种设计在有限计算预算下有效,我们将其与基于动态令牌路由和早期阶段特征对齐到冻结预训练视频编码器的高效训练方案相结合。我们的分析显示,后期块比标准单流基线发展出更清晰的跨帧注意力结构。在VBench上,Motif-Video 2B达到了83.76%的性能,超越了Wan2.1 14B模型,使用7倍更少的参数和显著更少的训练数据。这些结果表明,通过精心的架构专门化和以效率为导向的训练方案,可以缩小或超越通常与更大视频模型相关联的质量差距。

英文摘要

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.

2604.16491 2026-05-20 cs.CV cs.AI 版本更新

A Lightweight Transformer for Pain Recognition from Brain Activity

一种轻量级变压器用于从脑活动识别疼痛

Stefanos Gkikas, Christian Arzate Cruz, Yu Fang, Lu Cao, Muhammad Umar Khan, Thomas Kassiotis, Giorgos Giannakakis, Raul Fernandez Rojas, Randy Gomez

发表机构 * Honda Research Institute Japan Wako City, Japan BioSIS (Biosensing \& Intelligent Systems) Lab Centre for Intelligent Computing Systems University of Canberra Canberra, Australia Department of Electronic Engineering Hellenic Mediterranean University Chania, Greece

AI总结 本文提出了一种轻量级变压器,通过统一的标记机制融合多种fNIRS表示,实现互补信号视图的联合建模,无需特定模态适应或增加架构复杂性,从而在保持计算紧凑性的同时实现竞争性的疼痛识别性能。

详情
AI中文摘要

疼痛是一种复杂且广泛的现象,具有显著的临床和社会负担,使其可靠的自动化评估成为关键目标。本文提出了一种轻量级变压器架构,通过统一的标记机制融合多种fNIRS表示,实现了互补信号视图的联合建模,而无需特定模态的适应或增加架构复杂性。所提出的标记混合策略通过将异构输入投影到共享的潜在表示中,保留了空间、时间和时间-频率特性,并使用结构化的分段方案来控制局部聚合和全局交互的粒度。该模型在AI4Pain数据集上使用堆叠的原始波形和功率谱密度表示进行评估。实验结果表明,该方法在保持计算紧凑性的同时实现了竞争性的疼痛识别性能,使其适用于GPU和CPU硬件上的实时推断。

英文摘要

Pain is a multifaceted and widespread phenomenon with substantial clinical and societal burden, making reliable automated assessment a critical objective. This paper presents a lightweight transformer architecture that fuses multiple fNIRS representations through a unified tokenization mechanism, enabling joint modeling of complementary signal views without requiring modality-specific adaptations or increasing architectural complexity. The proposed token-mixing strategy preserves spatial, temporal, and time-frequency characteristics by projecting heterogeneous inputs onto a shared latent representation, using a structured segmentation scheme to control the granularity of local aggregation and global interaction. The model is evaluated on the AI4Pain dataset using stacked raw waveform and power spectral density representations of fNIRS inputs. Experimental results demonstrate competitive pain recognition performance while remaining computationally compact, making the approach suitable for real-time inference on both GPU and CPU hardware.

2604.11089 2026-05-20 cs.CV 版本更新

Structured State-Space Regularization for Generation-Friendly Image Tokenization

结构化状态空间正则化用于生成友好的图像标记化

Jinsung Lee, Jaemin Oh, Namhun Kim, Dongwon Kim, Byung-Jun Yoon, Suha Kwak

发表机构 * POSTECH Brown University(布朗大学) KAIST(韩国科学技术院) Texas A&M University(德克萨斯大学) Brookhaven National Laboratory(布鲁克海文国家实验室)

AI总结 本文提出结构化状态空间正则化方法,通过诱导潜在空间的频谱结构提升图像标记化生成性能,同时保持重建保真度。

Comments Related blog posts in https://jinsingsangsung.github.io/collections/blog/ : Towards 2-Dimensional State-Space Models series

详情
AI中文摘要

图像标记器在现代生成模型中起着核心作用,其中潜在空间的结构关键决定了下游生成性能。有效潜在表示的一个关键但未被充分探索的特性是频谱组织,即能够跨频率组件编码信息。在本文中,我们引入了结构化状态空间正则化,一种系统诱导潜在空间频谱结构的方法。我们通过重新审视状态空间模型(SSMs)作为模仿基函数行为的系统,推导出一个正则化目标。这种视角揭示了SSMs的隐藏状态被诱导以捕捉频率组件,从而产生一种新的正则器,强制潜在空间捕捉图像的频谱结构。实验表明,我们的正则器在提升图像标记器生成性能的同时,仅导致微小的重建保真度损失。

英文摘要

Image tokenizers play a central role in modern generative models, where the structure of the latent space critically determines the downstream generation performance. A key but underexplored property of effective latent representations is spectral organization, the ability to encode information across frequency components. In this work, we introduce structured state-space regularization, a principled approach to inducing spectral structure in latent spaces. We derive a regularization objective by revisiting state-space models (SSMs) as systems mimicking a basis function's behavior. This perspective reveals that hidden states of SSMs are induced to capture the frequency components, resulting in a novel regularizer that enforces the latent space to capture spectral structure of images. Experiments demonstrate that our regularizer improves the generative performance of image tokenizers while incurring only minimal loss in their reconstruction fidelity.

2604.08503 2026-05-20 cs.CV 版本更新

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Phantom:通过联合建模视觉和潜在物理动态实现物理 infused 的视频生成

Ying Shen, Jerry Xiong, Tianjiao Yu, Ismini Lourentzou

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出Phantom模型,通过联合建模视觉内容和潜在物理动态,使视频生成过程具备物理一致性,从而生成既视觉真实又物理合理的视频。

Comments 15 pages, 6 figures, CVPR 2026

详情
AI中文摘要

近期生成视频建模的进展,受到大规模数据集和强大架构的推动,已经取得了显著的视觉真实效果。然而,越来越多的证据表明,仅仅扩大数据和模型规模并不能使这些系统理解支配现实世界动态的底层物理定律。现有方法往往无法捕捉或强制执行这种物理一致性,导致不真实的运动和动态。在本文中,我们探讨是否将潜在物理属性的推断直接整合到视频生成过程中,可以赋予模型生成物理合理视频的能力。为此,我们提出了Phantom,一个物理 infused 的视频生成模型,该模型联合建模视觉内容和潜在物理动态。在观察到的视频帧和推断出的物理状态条件下,Phantom联合预测潜在物理动态并生成未来的视频帧。Phantom利用一种物理感知的视频表示,作为底层物理的抽象但信息丰富的嵌入,从而在不需显式指定复杂物理动态和属性集的情况下,联合预测物理动态和视频内容。通过将物理感知视频表示的推断直接整合到视频生成过程中,Phantom生成的视频序列既具有视觉真实性又具有物理一致性。在标准视频生成和物理感知基准上的定量和定性结果表明,Phantom不仅在遵守物理动态方面优于现有方法,还提供了具有竞争力的感知保真度。

英文摘要

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

2604.02784 2026-05-20 cs.CV cs.CL 版本更新

EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors

EnsemHalDet: 通过内部状态检测器的集成实现鲁棒的视觉语言模型幻觉检测

Ryuhei Miyazato, Shunsuke Kitada, Kei Harada

发表机构 * The University of Electro-Communications(电通大学)

AI总结 本文提出EnsemHalDet,一种通过集成多个内部表示的视觉语言模型幻觉检测框架,以提高多模态幻觉检测的鲁棒性。

详情
AI中文摘要

视觉语言模型(VLMs)在多模态任务中表现出色,但它们仍然容易受到事实错误或与输入图像无关的幻觉影响。最近的研究表明,利用内部表示进行幻觉检测比仅依赖模型输出的方法更高效和准确。然而,现有的基于内部表示的方法通常依赖于单一的表示或检测器,限制了它们捕捉多样化幻觉信号的能力。在本文中,我们提出了EnsemHalDet,一种基于集成的幻觉检测框架,利用VLMs的多种内部表示,包括注意力输出和隐藏状态。EnsemHalDet为每个表示训练独立的检测器,并通过集成学习进行组合。在多个VQA数据集和VLMs上的实验结果表明,EnsemHalDet在AUC方面始终优于先前的方法和单检测器模型。这些结果表明,集成多样化的内部信号显著提高了多模态幻觉检测的鲁棒性。

英文摘要

Vision-Language Models (VLMs) excel at multimodal tasks, but they remain vulnerable to hallucinations that are factually incorrect or ungrounded in the input image. Recent work suggests that hallucination detection using internal representations is more efficient and accurate than approaches that rely solely on model outputs. However, existing internal-representation-based methods typically rely on a single representation or detector, limiting their ability to capture diverse hallucination signals. In this paper, we propose EnsemHalDet, an ensemble-based hallucination detection framework that leverages multiple internal representations of VLMs, including attention outputs and hidden states. EnsemHalDet trains independent detectors for each representation and combines them through ensemble learning. Experimental results across multiple VQA datasets and VLMs show that EnsemHalDet consistently outperforms prior methods and single-detector models in terms of AUC. These results demonstrate that ensembling diverse internal signals significantly improves robustness in multimodal hallucination detection.

2603.29092 2026-05-20 cs.CV 版本更新

TrajectoryMover: Generative Movement of Object Trajectories in Videos

TrajectoryMover: 视频中物体轨迹的生成性运动

Kiran Chhatre, Hyeonho Jeong, Yulia Gryaditskaya, Christopher E. Peters, Chun-Hao Paul Huang, Paul Guerrero

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Adobe Research(Adobe研究)

AI总结 本文提出TrajectoryMover,一种生成视频中物体轨迹运动的方法,通过生成大规模合成配对视频数据和细调的视频生成器,实现了物体轨迹的生成性移动。

Comments 15 pages, 9 figures. Project page: https://chhatrekiran.github.io/trajectorymover

详情
AI中文摘要

生成性视频编辑已经使一些直观的编辑操作成为可能,这些操作以前在短视频片段中难以实现,特别是对于非专业编辑者而言。现有方法专注于在视频中为对象的3D或2D运动轨迹指定路径,或改变对象或场景的外观,同时保持视频的合理性和身份。然而,目前仍缺少一种方法,可以在视频中移动对象的3D运动轨迹,即在保持其相对3D运动的情况下移动对象。主要挑战在于获取这种场景下的配对视频数据。先前的方法通常依赖于巧妙的数据生成方法,从不成对的视频中构造出合理的配对数据,但这种方法在无法从另一视频轻易构造出配对视频时会失效。相反,我们引入了TrajectoryAtlas,一种新的大规模合成配对视频数据生成管道,以及一个通过此数据细调的视频生成器TrajectoryMover。我们证明这种方法成功实现了物体轨迹的生成性移动。项目页面:https://chhatrekiran.github.io/trajectorymover

英文摘要

Generative video editing has enabled several intuitive editing operations for short video clips that would previously have been difficult to achieve, especially for non-expert editors. Existing methods focus on prescribing an object's 3D or 2D motion trajectory in a video, or on altering the appearance of an object or a scene, while preserving both the video's plausibility and identity. Yet a method to move an object's 3D motion trajectory in a video, i.e., moving an object while preserving its relative 3D motion, is currently still missing. The main challenge lies in obtaining paired video data for this scenario. Previous methods typically rely on clever data generation approaches to construct plausible paired data from unpaired videos, but this approach fails if one of the videos in a pair can not easily be constructed from the other. Instead, we introduce TrajectoryAtlas, a new data generation pipeline for large-scale synthetic paired video data and a video generator TrajectoryMover fine-tuned with this data. We show that this successfully enables generative movement of object trajectories. Project page: https://chhatrekiran.github.io/trajectorymover

2603.16284 2026-05-20 cs.CV cs.LG 版本更新

Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

定位后再稀疏化:基于归因的视觉幻觉缓解稀疏策略

Tiantian Dang, Chao Bi, Shufan Shen, Jinzhe Liu, Qingming Huang, Shuhui Wang

发表机构 * State Key Lab. of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院人工智能安全国家重点实验室,计算技术研究所) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学先进交叉科学学院) School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院)

AI总结 本文提出了一种名为Locate-Then-Sparsify for Feature Steering (LTS-FS)的框架,通过定位和稀疏化策略,根据每层与幻觉的相关性调整特征引导强度,从而有效缓解视觉语言模型中的幻觉问题,同时保持良好的性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

尽管大型视觉-语言模型(LVLMs)在技术上取得了显著进展,但其生成幻觉的倾向削弱了可靠性并限制了更广泛的实际应用。在幻觉缓解方法中,特征引导作为一种有前景的方法,能够在不增加推理成本的情况下减少LVLMs中的错误输出。然而,当前的方法在所有层上应用统一的特征引导策略。这种启发式策略忽略了层间的差异,可能会干扰与幻觉无关的层,最终导致在通用任务上的性能下降。在本文中,我们提出了一种名为Locate-Then-Sparsify for Feature Steering (LTS-FS)的即插即用框架,该框架根据每层与幻觉的相关性来控制引导强度。我们首先构建了一个包含token级和句子级幻觉案例的数据集。基于此数据集,我们引入了一种基于因果干预的归因方法,以量化每层的幻觉相关性。利用各层的归因分数,我们提出了一种逐层策略,将这些分数转换为针对单个层的特征引导强度,从而在幻觉相关的层上实现更精确的调整。在多个LVLMs和基准测试中进行的广泛实验表明,LTS-FS有效缓解了幻觉问题,同时保持了强大的性能。代码可在https://github.com/huttersadan/LTS-FS上获得。

英文摘要

Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose Locate-Then-Sparsify for Feature Steering (LTS-FS), a plug-and-play framework which controls the steering intensity according to the hallucination relevance of each layer. We first construct a dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that LTS-FS effectively mitigates hallucination while preserving strong performance. Codes are available at https://github.com/huttersadan/LTS-FS.

2603.13609 2026-05-20 cs.CV 版本更新

A Grid-Based Framework for E-Scooter Demand Representation and Temporal Input Design for Deep Learning: Evidence from Austin, Texas

基于网格的电动滑板车需求表示与深度学习的时序输入设计框架:以德克萨斯州奥斯汀为例

Mohammad Sahnoon, Merkebe Getachew Demissie, Roberto Souza

发表机构 * Schulich School of Engineering, University of Calgary(卡莱尔大学施吕希学院)

AI总结 本文提出了一种基于网格的电动滑板车需求表示方法和深度学习的时序输入设计框架,通过系统性的数据处理流程和统计学方法,提高了空间学习的一致性并保留了需求模式,实验结果表明该方法在下一小时和下一24小时预测中将均方误差降低了37%和35%。

Comments 16 pages, 7 tables, 10 figures

详情
AI中文摘要

尽管在共享微出行需求预测方面深度学习取得了进展,但系统设计和时序输入结构的统计验证仍然缺乏。时序特征通常被启发式选择,尽管历史需求强烈影响模型性能和泛化能力。本文介绍了一种可重复的数据处理流程和一种基于统计学的方法,用于设计图像到图像需求预测的时序输入结构。利用德克萨斯州奥斯汀的大规模电动滑板车数据,我们通过将行程记录转换为每小时的起点和终点需求图像,构建了一个基于网格的时空数据集。该流程包括行程过滤、将人口普查街区映射到空间位置、网格构建、需求汇总以及创建一个全球活动掩码,以限制评估仅限于历史上活跃的区域。这种表示支持一致的空间学习,同时保留需求模式。我们随后引入了一种结合相关性和误差的程序来识别有信息的历史输入。通过使用基线UNET模型的消融研究,结合配对非参数检验和Holm校正,选择最优的时序深度。所得到的时序结构能够捕捉短期持续性以及日和周周期。与相邻小时和固定周期基线相比,所提出的设计在下一小时预测中将均方误差降低了高达37%,在下一24小时预测中降低了35%。这些结果突显了系统性数据集构建和统计学验证的时序输入设计在时空微出行需求预测中的价值。

英文摘要

Despite progress in deep learning for shared micromobility demand prediction, the systematic design and statistical validation of temporal input structures remain underexplored. Temporal features are often selected heuristically, even though historical demand strongly affects model performance and generalizability. This paper introduces a reproducible data-processing pipeline and a statistically grounded method for designing temporal input structures for image-to-image demand prediction. Using large-scale e-scooter data from Austin, Texas, we build a grid-based spatiotemporal dataset by converting trip records into hourly pickup and dropoff demand images. The pipeline includes trip filtering, mapping Census Tracts to spatial locations, grid construction, demand aggregation, and creation of a global activity mask that limits evaluation to historically active areas. This representation supports consistent spatial learning while preserving demand patterns. We then introduce a combined correlation- and error-based procedure to identify informative historical inputs. Optimal temporal depth is selected through an ablation study using a baseline UNET model with paired non-parametric tests and Holm correction. The resulting temporal structures capture short-term persistence as well as daily and weekly cycles. Compared with adjacent-hour and fixed-period baselines, the proposed design reduces mean squared error by up to 37 percent for next-hour prediction and 35 percent for next-24-hour prediction. These results highlight the value of principled dataset construction and statistically validated temporal input design for spatiotemporal micromobility demand prediction.

2603.11024 2026-05-20 cs.CV cs.AI 版本更新

Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style

AI 是否能像艺术史家一样看?解析视觉语言模型如何识别艺术风格

Marvin Limpijankit, Milad Alshomary, Yassin Oulad Daoud, Amith Ananthram, Tim Trombley, Emily L. Spratt, Anna Filonenko, Hannah Pivo, Elias Stengel-Eskin, Mohit Bansal, Noam M. Elcott, Kathleen McKeown

发表机构 * Columbia University, Department of Computer Science(哥伦比亚大学计算机科学系) Columbia University, Department of Art History & Archaeology(哥伦比亚大学艺术史与考古系) University of Texas at Austin(德克萨斯大学奥斯汀分校) UNC Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文研究了视觉语言模型(VLMs)在识别艺术风格方面的机制,通过跨学科合作,分析VLMs如何预测艺术风格,并评估其与艺术史家判断艺术风格的标准的一致性。

Comments 20 pages, 18 figures

详情
AI中文摘要

视觉语言模型(VLMs)在多种计算机视觉任务上已表现出越来越强的能力,例如视觉问答和目标检测。这包括在艺术领域中越来越强的能力,从分析艺术品到生成艺术品。在计算机科学家和艺术史家的跨学科合作中,我们表征了VLMs预测艺术风格的机制,并评估其与艺术史家用于推理艺术风格标准的契合程度。我们采用潜在空间分解方法来识别驱动艺术风格预测的概念,并通过定量评估、因果分析和艺术史家的评估进行评估。我们的发现表明,73%的提取概念被艺术史家认为具有连贯且语义明确的视觉特征,90%用于预测特定艺术品风格的概念被判定为相关。在无关概念成功预测风格的情况下,艺术史家发现了其成功的原因;例如,模型可能以更正式的方式理解概念,如明暗对比。

英文摘要

VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs' ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might "understand" a concept in more formal terms, such as dark/light contrasts.

2603.07561 2026-05-20 cs.CV 版本更新

PureCC: Pure Learning for Text-to-Image Concept Customization

PureCC: 文本到图像概念定制的纯学习

Zhichao Liao, Xiaole Xian, Qingyu Li, Wenyu Qin, Meng Wang, Weicheng Xie, Siyang Song, Pingfa Feng, Long Zeng, Liang Pan

发表机构 * Tsinghua University(清华大学) School of Computer Science & Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院) Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University(深圳大学智能信息处理广东省重点实验室) Kling Team, Kuaishou Technology(快手科技Kling团队) University of Exeter(埃克塞特大学) S-Lab, Nanyang Technological University(南洋理工大学S实验室)

AI总结 本文提出PureCC,一种用于文本到图像概念定制的纯学习方法,通过分离学习目标来平衡概念定制的保真度与模型保留。

Comments Accepted to CVPR 2026

详情
AI中文摘要

现有概念定制方法在高保真和多概念定制方面取得了显著成果。然而,它们往往忽视了在学习新个性化概念时对原始模型行为和能力的影响。为了解决这个问题,我们提出了PureCC。PureCC引入了一个新的分离学习目标用于概念定制,结合了目标概念的隐式指导与原始条件预测。这种分离形式使PureCC在训练过程中能够显著专注于原始模型。此外,基于此目标,PureCC设计了一个双分支训练流水线,包括一个冻结的提取器提供纯净的目标概念表示作为隐式指导,以及一个可训练的流模型产生原始条件预测,共同实现对个性化概念的纯学习。此外,PureCC引入了一个新的自适应指导尺度$λ^\star$,以动态调整目标概念的指导强度,平衡定制保真度和模型保留。广泛的实验表明,PureCC在保留原始行为和能力的同时,实现了高保真的概念定制。代码可在https://github.com/lzc-sg/PureCC上获得。

英文摘要

Existing concept customization methods have achieved remarkable outcomes in high-fidelity and multi-concept customization. However, they often neglect the influence on the original model's behavior and capabilities when learning new personalized concepts. To address this issue, we propose PureCC. PureCC introduces a novel decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representations as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concepts. Furthermore, PureCC introduces a novel adaptive guidance scale $λ^\star$ to dynamically adjust the guidance strength of the target concept, balancing customization fidelity and model preservation. Extensive experiments show that PureCC achieves state-of-the-art performance in preserving the original behavior and capabilities while enabling high-fidelity concept customization. The code is available at https://github.com/lzc-sg/PureCC.

2603.03066 2026-05-20 cs.CV 版本更新

EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos

EduVQA: 向概念感知的教育AI生成视频评估迈进

Baoliang Chen, Xinlong Bu, Hanwei Zhu, Lingyu Zhu, Jieyu Zhan

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Department of Computer Science, South China Normal University, China(华南师范大学计算机学院) School of Computer Science, City University of Hong Kong(香港城市大学计算机科学学院)

AI总结 本研究提出EduVQA框架,通过引入结构化2D混合专家架构,实现了对教育AI生成视频中概念正确性的感知评估,解决了传统方法在教育场景中忽略概念正确性的不足。

详情
AI中文摘要

现有的AI生成视频质量评估(AIGVQA)方法主要关注全局感知真实性和粗略的文本-视频对齐,而忽视了教育场景中的关键要求:概念正确性。在早期数学教育中,即使视觉上合理,数值量、几何关系或空间配置中的细微错误也可能从根本上改变传达的知识。为了解决这个问题,我们引入了EduAVQABench,这是首个概念感知的教育AIGV评估基准,包含1,130个由十种最先进的T2V模型生成的视频,以及超过310,650个精细的人工标注,涵盖感知质量和语义对齐。基于此基准,我们进一步提出了EduVQA,一个概念感知的AIGVQA框架,配备了结构化2D混合专家(S2D-MoE)架构。通过通过共享专家和自适应二维路由联合建模细粒度概念评估和整体质量预测,EduVQA有效地捕捉了传统全局评分方法所忽略的细微概念层面不一致。广泛的实验表明,EduVQA在感知和语义评估任务中均优于现有AIGVQA方法,并在未见过的基准上表现出强大的泛化能力。代码和数据集将在:https://github.com/EduVQA/EduVQA 公开。

英文摘要

Existing AI-generated video quality assessment (AIGVQA) methods mainly focus on global perceptual realism and coarse text-video alignment, while overlooking a critical requirement in educational scenarios: concept correctness. In early mathematics education, subtle errors in numerical quantities, geometric relations, or spatial configurations may fundamentally alter the conveyed knowledge despite visually plausible generation. To address this problem, we introduce EduAVQABench, the first benchmark for concept-aware educational AIGV assessment, containing 1,130 videos generated by ten state-of-the-art T2V models together with over 310,650 fine-grained human annotations spanning perceptual quality and semantic alignment. Built upon this benchmark, we further propose EduVQA, a concept-aware AIGVQA framework equipped with a Structured 2D Mixture-of-Experts (S2D-MoE) architecture. By jointly modeling fine-grained concept assessment and overall quality prediction through shared experts and adaptive two-dimensional routing, EduVQA effectively captures subtle concept-level inconsistencies overlooked by conventional global scoring methods. Extensive experiments demonstrate that EduVQA consistently outperforms existing AIGVQA approaches across both perceptual and semantic evaluation tasks while exhibiting strong generalization capability on unseen benchmarks. Code and dataset will be publicly available at: https://github.com/EduVQA/EduVQA.

2602.23622 2026-05-20 cs.CV cs.AI 版本更新

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

DLEBench: 评估基于指令的图像编辑模型在小规模物体编辑能力

Shibo Hong, Boxian Ai, Jun Kuang, Wei Wang, FengJiao Chen, Zhongyuan Peng, Chenhao Huang, Yixin Cao

发表机构 * College of Computer Science(计算机科学学院) Artificial Intelligence(人工智能) Fudan University(复旦大学)

AI总结 本文提出DLEBench,首个专门评估基于指令的图像编辑模型在小规模物体编辑能力的基准,通过1889个样本覆盖复杂场景,揭示了现有模型在小物体编辑上的性能差距,强调了专用基准的重要性。

详情
AI中文摘要

在基于指令的图像编辑模型(IIEMs)领域已取得显著进展。然而,尽管这些模型在当前基准上表现出对指令的合理遵循和强大的推理能力,但它们在编辑小物体方面的能力仍缺乏深入探索,尽管这对精确局部编辑和生成图像中细节的细化至关重要。本文介绍了DeepLookEditBench(DLEBench),首个专门评估IIEMs在编辑小规模物体能力的基准。具体而言,我们构建了一个包含七个指令类型的挑战性测试平台,共1889个样本。在这些样本中,目标物体仅占图像面积的1%-10%,涵盖了部分遮挡和多物体编辑等复杂场景。为确保对本基准的稳健评估,我们提出了一种评估协议,包含细化的评分标准,以最小化在“指令遵循”和“视觉一致性”两个标准中的主观性和歧义性。该协议还引入了双模式评估框架(工具驱动模式和Oracle引导模式),以解决DLEBench中LMM-as-a-Judge与人类判断之间的不一致问题。在10个IIEMs上的实证结果揭示了小规模物体编辑上的显著性能差距,突显了专用基准在推动该能力发展方面的重要性。

英文摘要

Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

2602.09872 2026-05-20 cs.CV cs.HC 版本更新

BabyMamba-HAR: Lightweight Selective State Space Models for Efficient Human Activity Recognition on Resource Constrained Devices

BabyMamba-HAR:轻量级选择性状态空间模型用于资源受限设备上高效的人体活动识别

Mridankan Mandal

发表机构 * Department of Information Technology(信息科技系) Indian Institute of Information Technology, Allahabad Prayagraj(印度阿利哈巴德信息科技学院)

AI总结 本文提出BabyMamba-HAR,一种轻量级选择性状态空间模型,用于在资源受限设备上高效进行人体活动识别,通过两种轻量级架构实现高精度和低资源消耗。

详情
AI中文摘要

在资源受限的设备上进行人体活动识别(HAR)需要在多样化的传感器设置下保持高精度。选择性状态空间模型(SSMs)提供了高效的线性时间序列处理,成为注意力机制的一种有吸引力的替代方案。然而,其TinyML设计空间仍待探索。本文介绍了BabyMamba-HAR,包含两种轻量级架构:(1)CI-BabyMamba-HAR,利用通道独立的茎部以提高噪声鲁棒性;(2)Crossover-BiDir-BabyMamba-HAR,利用早期融合的茎部以实现通道计数独立的复杂度。两者都集成了权重绑定的双向扫描和门控时间注意力池化。在八个基准测试中,Crossover-BiDir-BabyMamba-HAR平均达到86.52%的F1分数,使用27K参数和2.21M MACs,与TinyHAR(86.16%)相当,但要求在高通道数据集上减少11倍的MACs。在设备上部署到Raspberry Pi Pico 2和ESP32上使用混合精度C++运行时(INT8投影,float32状态)。融合计算策略与生命周期感知内存管理将峰值内存足迹从O(B*dmodel*L*dstate)减少到O(B*dmodel*dstate),适应于支持权重绑定的双向和通道流执行。两种架构均实现了完整的8/8数据集覆盖,与PyTorch的>99.2%的兼容性,而INT8量化TFLite基线显示了退化的覆盖和兼容性(TinyHAR:7/8和4/8覆盖,60.4%和88.6%兼容性,TinierHAR:8/8和6/8在54.2%和90.8%兼容性,DeepConvLSTM:1/8和0/8在Pico 2和ESP32上)。Crossover-BiDir-BabyMamba-HAR在ESP32上平均延迟为154.4 ms,在Pico 2上为481.9 ms。消融实验确认双向扫描和门控注意力分别将F1分数提高高达8.42%和8.94%,建立了TinyML SSM部署的实用原则。

英文摘要

Human activity recognition (HAR) on resource constrained devices requires high accuracy across diverse sensor setups. Selective state space models (SSMs) offer efficient linear time sequence processing, presenting a compelling alternative to attention mechanisms. However, their TinyML design space remains unexplored. This paper introduces BabyMamba-HAR, comprising two lightweight architectures: (1) CI-BabyMamba-HAR, utilizing a channel independent stem for noise robustness, and (2) Crossover-BiDir-BabyMamba-HAR, utilizing an early fusion stem for channel count independent complexity. Both integrate weight tied bidirectional scanning and gated temporal attention pooling. Across eight benchmarks, Crossover-BiDir-BabyMamba-HAR averages an 86.52% F1-score with 27K parameters and 2.21M MACs, matching TinyHAR (86.16%) while requiring 11x fewer MACs on high channel datasets. On-device deployment on the Raspberry Pi Pico 2 and ESP32 utilized a mixed precision C++ runtime (INT8 projections, float32 states). A fused computation strategy with lifetime aware memory management reduces peak memory footprint from O(B*dmodel*L*dstate) to O(B*dmodel*dstate), adapting to support weight-tied bidirectional and channel-streaming execution. Both architectures achieved full 8/8 dataset coverage with >99.2% PyTorch parity, whereas INT8 quantized TFLite baselines showed degraded coverage and parity (TinyHAR: 7/8 and 4/8 coverage at 60.4% and 88.6% parity, TinierHAR: 8/8 and 6/8 at 54.2% and 90.8%, DeepConvLSTM: 1/8 and 0/8 on Pico 2 and ESP32, respectively). Crossover-BiDir-BabyMamba-HAR averages 154.4 ms latency on ESP32 and 481.9 ms on Pico 2. Ablations confirm bidirectional scanning and gated attention improve F1-scores by up to 8.42% and 8.94%, respectively, establishing practical principles for TinyML SSM deployment.

2602.07570 2026-05-20 q-bio.NC cs.AI cs.CV cs.LG 版本更新

How does longer temporal context enhance multimodal narrative video processing in the brain?

更长的时间上下文如何增强大脑对多模态叙事视频的处理?

Prachi Jindal, Anant Khandelwal, Manish Gupta, Bapi S. Raju, Subba Reddy Oota, Tanmoy Chakraborty

发表机构 * Technische Universität Berlin(柏林技术大学) Microsoft Research(微软研究院) IIT Delhi(德里理工学院) Microsoft(微软) IIIT-Hyderabad(海得拉巴理工学院)

AI总结 本研究探讨了视频片段时长和叙事任务提示如何影响自然电影观看过程中大脑模型对多模态大语言模型(MLLMs)的对齐情况,发现增加片段持续时间显著提高了大脑对齐程度,而单模态视频模型则无明显提升。

Comments 22 pages, 15 figures

详情
AI中文摘要

理解人类和人工智能系统如何处理复杂的叙事视频是一个在神经科学和机器学习交汇处的基本挑战。本研究调查了视频片段的时间上下文长度(3-24秒片段)和叙事任务提示如何影响自然电影观看过程中大脑模型的对齐情况。利用受试者观看完整电影的fMRI记录,我们研究了对叙事上下文敏感的大脑区域如何在不同时间尺度上动态表示信息,以及这些神经模式如何与模型派生的特征对齐。我们发现,增加片段持续时间显著提高了多模态大语言模型(MLLMs)的大脑对齐程度,而单模态视频模型则几乎没有提升。进一步地,较短的时间窗口与感知和早期语言区域对齐,而较长的窗口则更倾向于与更高阶整合区域对齐,这在MLLMs中表现为层到皮层的层次结构。最后,使用四个叙事任务提示的实验显示,这些提示会引发任务特定、区域依赖性的大脑对齐模式,并在更高阶区域引起上下文依赖的片段级调谐变化。我们的工作将长篇叙事电影定位为研究长时间尺度时间整合在长上下文MLLMs中的原理性测试平台,以及其与叙事理解过程中皮层响应关系的桥梁。

英文摘要

Understanding how humans and artificial intelligence systems process complex narrative videos is a fundamental challenge at the intersection of neuroscience and machine learning. This study investigates how the temporal context length of video clips (3--24 s clips) and the narrative-task prompting shape brain-model alignment during naturalistic movie watching. Using fMRI recordings from participants viewing full-length movies, we examine how brain regions sensitive to narrative context dynamically represent information over varying timescales and how these neural patterns align with model-derived features. We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Further, shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Finally, experiments with four narrative-task prompts show that they elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions. Our work positions long-form narrative movies as a principled testbed for studying long-timescale temporal integration in long-context MLLMs and its relationship to cortical responses during narrative comprehension.

2602.07008 2026-05-20 cs.CV cs.LG 版本更新

Where Not to Learn: Prior-Aligned Training with Subset-based Attribution Constraints for Reliable Decision-Making

不应学习的地方:基于子集归因约束的先验对齐训练以实现可靠的决策制定

Ruoyu Chen, Shangquan Sun, Xiaoqing Guo, Sanyi Zhang, Kangwei Liu, Shiming Liu, Zhangcheng Wang, Qunli Zhang, Hua Zhang, Xiaochun Cao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) University of Chinese Academy of Sciences(中国科学院大学) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Department of Computer Science, Hong Kong Baptist University(香港 Baptist 大学计算机科学系) Communication University of China(中国传媒大学) Imperial College London(伦敦帝国学院) School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区网络科学与技术学院)

AI总结 本文提出了一种基于归因的先验对齐方法,通过子集选择归因技术约束模型依赖于人类先验区域,从而提升决策的可靠性。

详情
AI中文摘要

可靠的模型不仅要预测正确,还要能用可接受的证据来解释决策。然而,传统监督学习通常只提供类别级标签,使模型通过捷径相关性实现高精度,而非预期的证据。人类先验可以约束此类行为,但对齐模型到这些先验仍然具有挑战性,因为学习的表示往往偏离人类感知。为了解决这一挑战,我们提出了一种基于归因的人类先验对齐方法。我们将人类先验编码为模型应依赖的输入区域(例如边界框),并利用高度忠实的子集选择归因方法,在训练过程中暴露模型的决策证据。当归因区域显著偏离先验区域时,我们惩罚对非先验证据的依赖,促使模型将归因转向预期区域。这是通过一个训练目标实现的,该目标通过人类先验诱导归因约束。我们在基于MLLM的GUI代理模型上验证了我们的方法,涵盖图像分类和点击决策任务。在传统分类和自回归生成设置中,人类先验对齐一致提高了任务准确性,同时增强了模型的决策合理性。

英文摘要

Reliable models should not only predict correctly, but also justify decisions with acceptable evidence. Yet conventional supervised learning typically provides only class-level labels, allowing models to achieve high accuracy through shortcut correlations rather than the intended evidence. Human priors can help constrain such behavior, but aligning models to these priors remains challenging because learned representations often diverge from human perception. To address this challenge, we propose an attribution-based human prior alignment method. We encode human priors as input regions that the model is expected to rely on (e.g., bounding boxes), and leverage a highly faithful subset-selection-based attribution approach to expose the model's decision evidence during training. When the attribution region deviates substantially from the prior regions, we penalize reliance on off-prior evidence, encouraging the model to shift its attribution toward the intended regions. This is achieved through a training objective that imposes attribution constraints induced by the human prior. We validate our method on both image classification and click decision tasks in MLLM-based GUI agent models. Across conventional classification and autoregressive generation settings, human prior alignment consistently improves task accuracy while also enhancing the model's decision reasonability.

2602.04381 2026-05-20 cs.CV cs.AI 版本更新

Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture

通过超轻量架构在商用CPU上实现实时结肠镜息肉分割

Weihao Gao, Zhuo Deng, Zheng Gong, Lan Ma

发表机构 * School of Computer Science and Artificial Intelligence, Guangdong University of Education(广东教育学院计算机科学与人工智能学院) Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院)

AI总结 本文提出UltraSeg家族,一种在CPU上运行的轻量级分割模型,能够在不依赖GPU的情况下实现实时结肠镜息肉分割,其核心方法是采用组多率扩张卷积和注意力门控跨层融合,主要贡献是建立了首个在商用CPU上实现高精度实时息肉分割的基准线。

Comments 18pages, 4 figures

详情
AI中文摘要

实时息肉分割对于早期结直肠癌检测至关重要,但临床部署仍受GPU依赖的阻碍。我们引入UltraSeg家族,一组在CPU上运行的分割模型,参数量低于0.3M。UltraSeg-108K(0.108M)建立了极端压缩的前沿,而UltraSeg-130K(0.130M)通过跨层轻量融合提升了多中心泛化能力。该架构用组多率扩张卷积和注意力门控跨层融合取代参数密集的组件,实现了在单个CPU核心上实时吞吐(在256*256分辨率上超过50 FPS,在352*352分辨率上超过30 FPS)而不牺牲临床级精度。在七个公开数据集上评估,UltraSeg-130K在两个分辨率上均达到Dice分数超过0.8,显著优于所有现有的子0.3M竞争者。值得注意的是,在零样本外部验证中,它接近或超过了UNet-Medium(7.76M参数)的性能,但仅使用其1.7%的参数,建立了首个在CPU上实现实时息肉分割的强基准线。当扩展到4.38M参数时,UltraSeg的准确性可与重型最先进的模型相媲美,同时保持数量级的参数优势,证明了所提出的设计原则在效率光谱的整个范围内实现了内在的表示增益。通过提供首个在商用CPU上可部署的实时解决方案,本工作为资源有限的环境提供了一个立即可用的工具,并为超越内窥镜的实时医疗AI提供了可复现的蓝图。源代码已公开。

英文摘要

Real-time polyp segmentation is essential for early colorectal cancer detection, yet clinical deployment remains blocked by GPU dependency. We introduce the UltraSeg family, a set of CPU-native segmentation models operating below 0.3M parameters. UltraSeg-108K (0.108M) establishes the extreme-compression frontier, while UltraSeg-130K (0.130M) integrates cross-layer lightweight fusion for enhanced multi-center generalization. The architecture replaces parameter-heavy components with grouped multi-rate dilated convolutions and attention-gated cross-layer fusion, achieving real-time throughput on a single CPU core (exceeding 50 FPS at 256*256 and 30 FPS at 352*352) without sacrificing clinical-grade accuracy. Evaluated on seven public datasets, UltraSeg-130K attains Dice scores exceeding 0.8 at both resolutions, substantially outperforming all existing sub-0.3M competitors. Notably, it approaches or exceeds UNet-Medium (7.76M parameters) on zero-shot external validations while using only 1.7% of its parameters, establishing the first strong baseline for CPU-native real-time polyp segmentation. When scaled to 4.38M parameters, UltraSeg achieves accuracy competitive with heavyweight state-of-the-art models while maintaining an order-of-magnitude parameter advantage, demonstrating that the proposed design principles yield intrinsic representational gains across the entire efficiency spectrum. By delivering the first clinically deployable, CPU-native real-time solution, this work provides an immediately usable tool for resource-limited settings and a reproducible blueprint for real-time medical AI beyond endoscopy. Source code is publicly available.

2602.03454 2026-05-20 cs.CV 版本更新

Contextualized Visual Personalization in Vision-Language Models

基于上下文的视觉个性化在视觉-语言模型中

Yeongtak Oh, Sangwon Yu, Junsung Park, Han Cheol Moon, Jisoo Mok, Sungroh Yoon

发表机构 * Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea(电气电子工程系,首尔国立大学,首尔,韩国) Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, Korea(人工智能交叉学科项目,首尔国立大学,首尔,韩国)

AI总结 本文提出了一种基于上下文的视觉个性化方法,通过强化学习和生成增强技术改进视觉-语言模型的个性化图像描述能力,并通过诊断评估验证了模型对视觉上下文的真实利用,展示了CoViP在下游个性化任务中的全面提升。

Comments Accepted at ICML 2026

详情
AI中文摘要

尽管视觉-语言模型(VLMs)在最近取得了进展,但现有方法往往无法根据用户的特定经历生成个性化响应,因为它们缺乏将视觉输入与用户积累的视觉-文本上下文相关联的能力。我们首次将这一挑战正式化为“基于上下文的视觉个性化”,要求VLMs在解释新图像时通过视觉识别和文本检索个性化视觉经验。为了解决这一问题,我们提出了CoViP,一个统一的框架,将个性化图像描述作为基于上下文的视觉个性化的核心任务,并通过基于强化学习的后训练和描述增强生成来提高这一能力。我们进一步引入了诊断评估,明确排除了文本捷径解决方案,并验证VLMs是否真正利用了视觉上下文。广泛的实验表明,现有开源和专有VLMs存在显著限制,而CoViP不仅提高了个性化图像描述能力,还在下游个性化任务中实现了全面提升。这些结果突显了CoViP作为实现稳健且可推广的基于上下文的视觉个性化关键阶段的重要性。

英文摘要

Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.

2602.03139 2026-05-20 cs.CV 版本更新

Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

保留多样性的分布匹配蒸馏用于快速视觉合成

Tianhe Wu, Ruibin Li, Lei Zhang, Kede Ma

AI总结 本文提出了一种保留多样性的分布匹配蒸馏(DP-DMD)方法,通过分离角色的蒸馏策略,在少量步骤中保持样本多样性并维持竞争性的视觉质量,为其他DMD变体提供了一种简单且稳定的替代方案。

详情
AI中文摘要

分布匹配蒸馏(DMD)通过将蒸馏的学生模型与参考多步骤教师模型对齐,实现了少步图像生成。然而,在实践中,优化DMD可能会减少少步合成中的样本多样性,而现有解决方案通常依赖于感知或对抗正则化,导致训练过程中的稳定性和可扩展性挑战。本文描述了保留多样性的DMD(DP-DMD),一种受早期和晚期去噪步骤互补作用启发的角色分离蒸馏方法。具体而言,第一个蒸馏步骤通过教师衍生的目标预测目标(例如v-prediction)进行训练,以保留样本多样性,而其余步骤则通过标准DMD损失进行优化,以提高感知质量。DP-DMD无需感知或对抗正则化、额外模块和教师生成的参考样本,在少量步骤采样下保持样本多样性,同时维持竞争性的视觉质量,为其他DMD变体提供了一种简单且稳定的替代方案。

英文摘要

Distribution matching distillation (DMD) facilitates few-step image generation by aligning a distilled student with a reference multi-step teacher. In practice, however, optimizing DMD can reduce sample diversity in few-step synthesis, and existing remedies typically rely on perceptual or adversarial regularization, leading to stability and scalability challenges during training. Here, we describe diversity-preserved DMD (DP-DMD), a role-separated distillation method inspired by the complementary roles of early and late denoising steps. Specifically, the first distillation step is trained with a teacher-derived target-prediction objective (e.g., v-prediction) to preserve sample diversity, while the remaining steps are optimized with the standard DMD loss to refine perceptual quality. DP-DMD, with no perceptual or adversarial regularization, no additional modules, and no teacher-generated reference samples, preserves sample diversity while maintaining competitive visual quality under few-step sampling, providing a simple and stable alternative to other DMD variants.

2601.20308 2026-05-20 cs.CV cs.GR 版本更新

Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion

通过一步扩散模型平滑现实世界的时空视频超分辨率

Shuoyan Wei, Feng Li, Chen Zhou, Runmin Cong, Yao Zhao, Huihui Bai

发表机构 * Institute of Information Science, Beijing Jiaotong University(北京交通大学信息科学学院) Visual Intelligence + X International Cooperation Joint Laboratory of MOE, Beijing(教育部视觉智能+X国际合作联合实验室) Innovation School of Artificial Intelligence, Hefei University of Technology(合肥工业大学人工智能创新学院) School of Control Science and Engineering, Shandong University(山东大学控制科学与工程学院)

AI总结 本文提出OSDEnhancer框架,通过一步扩散模型实现鲁棒的时空视频超分辨率,解决了现实世界中复杂未知退化的问题,通过线性初始化和分治策略提升时空动态和纹理恢复性能。

Comments 12 pages, 9 figures

详情
AI中文摘要

扩散模型在视频超分辨率(VSR)中表现出色,能够生成精细细节。然而,其在时空视频超分辨率(STVSR)中的潜力仍被忽视,STVSR需要恢复真实的高分辨率视觉内容并提高帧率,同时保持时间动态的一致性。此外,现有STVSR方法主要在简单退化假设下处理时空上采样,无法应对现实世界中复杂的未知退化。为了解决这些挑战,我们提出了OSDEnhancer,这是首个在一步扩散中实现稳健STVSR的框架。OSDEnhancer首先通过线性初始化建立必要的时空结构并适应模型进行一步重建。然后应用分治策略,引入时间一致性(TC)和纹理丰富(TE)LoRAs,分别专注于帧间动态建模和精细纹理恢复,同时在推理过程中协作以提升整体性能。双向VAE解码器使用可变形递归块来利用常规VAE的多尺度结构,通过联合多尺度可变形聚合和帧间特征传播增强潜在到像素的重建。实验结果表明,所提出的方法在现实世界场景中实现了最先进的性能,并具有更强的泛化能力。代码可在https://github.com/W-Shuoyan/OSDEnhancer获取。

英文摘要

Diffusion models have demonstrated exceptional success in video super-resolution (VSR), exhibiting powerful capabilities for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic high-resolution visual content but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simple degradation assumptions, thus failing in real-world scenarios with complex unknown degradations. To address these challenges, we propose OSDEnhancer, the first framework that achieves robust STVSR in one-step diffusion. OSDEnhancer begins with a linear initialization to establish essential spatiotemporal structures and adapt the model for one-step reconstruction. It then applies a divide-and-conquer strategy, introducing the temporal coherence (TC) and texture enrichment (TE) LoRAs that progressively specialize in inter-frame dynamics modeling and fine-grained texture recovery, respectively, while collaborating during inference for enhanced overall performance. A bidirectional VAE decoder employs deformable recurrent blocks to leverage the multi-scale structure of the vanilla VAE, enhancing latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation. Experimental results demonstrate that the proposed method attains state-of-the-art performance with superior generalization in real-world scenarios. The code is available at https://github.com/W-Shuoyan/OSDEnhancer.

2601.18993 2026-05-20 cs.CV cs.AI cs.GR 版本更新

FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction

FreeOrbit4D: 通过前景完整4D重建实现免训练的任意相机重定向

Wei Cao, Hao Zhang, Fengrui Tian, Yulun Wu, Yingying Li, Shenlong Wang, Ning Yu, Yaoyao Liu

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Pennsylvania(宾夕法尼亚大学) Eyeline Labs(Eyeline实验室)

AI总结 本文提出FreeOrbit4D,一种无需训练的框架,通过恢复完整的前景4D代理来解决大角度重定向中的几何模糊问题,从而生成更真实且时间一致的视频。

Comments 12 pages, 10 figures. Accepted to SIGGRAPH Conference Papers 2026

详情
AI中文摘要

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective 免训练 framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

英文摘要

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective training-free framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

2601.14822 2026-05-20 cs.CV cs.AI 版本更新

Multimodal system for skin cancer detection

多模态皮肤癌检测系统

Volodymyr Sydorskyi, Igor Krashenyi, Oleksii Yakubenko

AI总结 本文提出一种多模态皮肤癌检测系统,结合传统照片图像与表格型元数据(如患者人口统计数据和病变特征),通过多模态神经网络和两阶段模型提升检测准确率,并通过三阶段流程进一步优化预测,最终在不平衡数据集上实现显著性能提升。

Comments Accepted to System research and information technologies

Journal ref System Research and Information Technologies, no. 1, pp. 33-57, 2026

详情
AI中文摘要

皮肤癌检测对于早期诊断和有效治疗至关重要。尽管基于dermoscopic图像的深度学习模型已显示出潜力,但它们需要专门的设备,限制了其在更广泛临床环境中的应用。本研究介绍了一种使用传统照片图像的多模态皮肤癌检测系统,使其更具可访问性和适应性。我们的系统整合图像数据与表格型元数据,如患者人口统计数据和病变特征,以提高检测准确性。它采用结合图像和元数据处理的多模态神经网络,并支持有或无元数据的两阶段模型。一个三阶段流程进一步通过提升算法和增强性能来优化预测。为解决高度不平衡数据集的挑战,实施了特定技术以确保稳健的训练。通过消融研究评估了最近的视觉架构、提升算法和损失函数,实现了峰值部分ROC AUC为0.18068(0.2最大)和前15检索灵敏度为0.78371。结果表明,通过结构化、多阶段的图像与元数据整合流程,实现了显著的性能提升。该系统通过提供一个可扩展、设备无关的解决方案,推进了皮肤癌检测,适用于多样化的医疗环境,弥合了专业与一般临床实践之间的差距。

英文摘要

Melanoma detection is vital for early diagnosis and effective treatment. While deep learning models on dermoscopic images have shown promise, they require specialized equipment, limiting their use in broader clinical settings. This study introduces a multi-modal melanoma detection system using conventional photo images, making it more accessible and versatile. Our system integrates image data with tabular metadata, such as patient demographics and lesion characteristics, to improve detection accuracy. It employs a multi-modal neural network combining image and metadata processing and supports a two-step model for cases with or without metadata. A three-stage pipeline further refines predictions by boosting algorithms and enhancing performance. To address the challenges of a highly imbalanced dataset, specific techniques were implemented to ensure robust training. An ablation study evaluated recent vision architectures, boosting algorithms, and loss functions, achieving a peak Partial ROC AUC of 0.18068 (0.2 maximum) and top-15 retrieval sensitivity of 0.78371. Results demonstrate that integrating photo images with metadata in a structured, multi-stage pipeline yields significant performance improvements. This system advances melanoma detection by providing a scalable, equipment-independent solution suitable for diverse healthcare environments, bridging the gap between specialized and general clinical practices.

2512.11234 2026-05-20 cs.CV 版本更新

RoomPilot: Controllable Indoor Scene Synthesis via Multimodal Semantic Parsing

RoomPilot: 通过多模态语义解析实现可控的室内场景合成

Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li

发表机构 * School of Information Science and Engineering, Hunan University(信息科学与工程学院,湖南大学)

AI总结 该研究提出RoomPilot框架,通过多模态语义解析实现可控的室内场景合成,解决了现有方法输入模态有限和生成过程隐式的问题,提高了场景结构和语义的可控性。

Comments 30 pages, 8 figures

详情
AI中文摘要

生成可控的室内场景对于游戏开发、建筑可视化和具身AI应用至关重要。然而,现有方法要么只支持有限的输入模态,要么依赖隐式生成过程,限制了对场景结构和语义的精确控制。为了解决这些限制,我们引入RoomPilot,一个统一的框架,从多模态输入(包括文本描述和CAD平面图)中生成可控的室内场景。RoomPilot将异构输入映射到一个室内领域特定语言(IDSL),作为描述室内场景的结构化和可解释的语义表示。基于IDSL,RoomPilot提出一个分层合成流程,逐步在建筑、房间和物体层面组织场景,促进多房间布局中的结构一致性和功能一致性。此外,RoomPilot构建了一个经过精心挑选的资产数据集,具有丰富的语义注释,以支持高质量的场景合成,提高视觉真实感和外观一致性。广泛的实验表明,该方法在多模态理解、场景生成的细粒度可控性以及物理一致性和视觉保真度方面均有所提升,标志着可控3D室内场景合成的重要一步。代码和模型将公开。

英文摘要

Generating controllable indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI. However, existing approaches either support a limited input modalities or rely on implicit generation processes that hinder precise control over scene structure and semantics. To address these limitations, we introduce RoomPilot, a unified framework for controllable indoor scene synthesis from multi-modal inputs, including textual descriptions and CAD floor plans. RoomPilot maps heterogeneous inputs into an Indoor Domain-Specific Language (IDSL), which serves as a structured and interpretable semantic representation for describing indoor scenes. Built upon IDSL, RoomPilot presents a hierarchical synthesis pipeline that progressively organizes scenes at the building, room, and object levels, promoting structural coherence and functional consistency across multi-room layouts. Moreover, RoomPilot constructs a curated asset dataset with rich semantic annotations to support high-quality scene synthesis, improving visual realism and appearance consistency. Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity, marking a significant step toward controllable 3D indoor scene synthesis. Code and model will be available.

2512.08237 2026-05-20 cs.CV 版本更新

Fast-BEV++: Fast by Algorithm, Deployable by Design

Fast-BEV++: 通过算法加速,通过设计部署

Yuanpeng Chen, Hui Song, Sheng Yang, Wei Tao, Shanhui Mo, Shuang Zhang, Xiao Hua, Tiankun Zhao

发表机构 * iMotion Automotive Technology (Suzhou) Co., Ltd(iMotion汽车技术(苏州)有限公司) School of Data Science, Fudan University(复旦大学数据科学学院)

AI总结 本文提出Fast-BEV++,通过算法加速和设计部署两个原则,解决自动驾驶中低成本鸟眼视图感知在精度与部署效率之间的矛盾,实现了3倍速度提升并在nuScenes基准上取得0.488 NDS的新状态-of-the-art结果,同时在134 FPS以上实现实时推理。

Comments most up-to-date version

详情
AI中文摘要

视觉-only鸟眼视图(BEV)感知的进步受制于感知精度与设备部署效率之间的长期根本权衡。在本文中,我们引入了Fast-BEV++,一种通过两个基本设计原则解决这一矛盾的BEV感知框架:通过算法加速和通过设计部署。通过将核心视图转换模块分解为硬件导向的标准索引-收集-重塑流水线,Fast-BEV++消除了对定制内核的依赖,从而在主流边缘平台上实现了至少3倍于Fast-BEV基线的速度提升。实证表明,Fast-BEV++在nuScenes 3D物体检测基准上建立了新的状态-of-the-art结果0.488 NDS,同时通过我们的加速设计实现了超过134 FPS的实时推理。特别是,我们的集成、可学习深度模块带来了持续的性能提升,在可比方法中保持最高准确性。总体而言,这种本质上分解的架构使在各种生产级汽车平台上的无缝实时部署成为可能,缓解了硬件限制,而不会牺牲感知精度或推理效率。

英文摘要

The advancement of vision-only Bird's-Eye-View (BEV) perception, a core paradigm for cost-effective autonomous driving, is hindered by the long-standing fundamental trade-off between perception accuracy and on-device deployment efficiency. In this work, we introduce Fast-BEV++, a BEV perception framework that resolves this tension through two fundamental design principles: Fast by Algorithm and Deployable by Design. By decomposing the core view transformation module into a hardware-oriented standard Index-Gather-Reshape pipeline, Fast-BEV++ eliminates dependencies on custom kernels while achieving no less than 3 times speedup over the Fast-BEV baseline across mainstream edge platforms. Empirically, Fast-BEV++ establishes a new state-of-the-art result of 0.488 NDS on the nuScenes 3D object detection benchmark, simultaneously delivering real-time inference at more than 134 FPS via our acceleration design. In particular, our integrated, learnable depth module yields consistent performance gains, maintaining the highest accuracy among comparable methods. Overall, this inherently decomposed architecture enables seamless real-time deployment across diverse production-grade automotive platforms, alleviating hardware limitations without compromising perception accuracy or inference efficiency.

2512.04556 2026-05-20 cs.GR cs.CV 版本更新

DISK: Differentiable Sparse Kernel Complex for Efficient Spatially-Variant Convolution

DISK: 可微稀疏核复数用于高效空间变体卷积

Zhizhen Wu, Zhe Cao, Yuchi Huo

发表机构 * State Key Lab of CAD&CG, Zhejiang University, China(浙江大学CAD与CG国家重点实验室)

AI总结 本文提出了一种可微稀疏核复数分解框架,用于高效处理空间变体卷积,通过稀疏核样本表示目标空间变体密集复数核,实现了高效且可微的优化方法,适用于移动成像和实时渲染。

Comments Accepted as a conference paper at ICLR 2026. OpenReview: https://openreview.net/forum?id=bbuxDoRD2D

详情
AI中文摘要

复数核图像卷积是摄影、科学成像和动画效果中的基本操作,但直接密集卷积在资源受限设备上计算上是不可行的。现有的近似方法,如模拟退火或低秩分解,要么效率低下,要么无法捕捉非凸核。我们介绍了一种可微的核分解框架,通过一组稀疏核样本表示目标空间变体、密集复数核。我们的方法具有(i)一种允许对稀疏核进行可微优化的分解;(ii)一种专门的初始化策略用于非凸形状以避免较差的局部极小值;(iii)一种核空间插值方案,将单核过滤扩展到空间变化过滤,无需重新训练和额外的运行时开销。在高斯和非凸核的实验中,我们的方法在保真度上优于模拟退火,并且在成本上显著低于低秩分解。我们的方法为移动成像和实时渲染提供了实用的解决方案,同时保持完全可微,可用于更广泛的学习管道。

英文摘要

Image convolution with complex kernels is a fundamental operation in photography, scientific imaging, and animation effects, yet direct dense convolution is computationally prohibitive on resource-limited devices. Existing approximations, such as simulated annealing or low-rank decompositions, either lack efficiency or fail to capture non-convex kernels. We introduce a differentiable kernel decomposition framework that represents a target spatially-variant, dense, complex kernel using a set of sparse kernel samples. Our approach features (i) a decomposition that enables differentiable optimization of sparse kernels, (ii) a dedicated initialization strategy for non-convex shapes to avoid poor local minima, and (iii) a kernel-space interpolation scheme that extends single-kernel filtering to spatially varying filtering without retraining and additional runtime overhead. Experiments on Gaussian and non-convex kernels show that our method achieves higher fidelity than simulated annealing and significantly lower cost than low-rank decompositions. Our approach provides a practical solution for mobile imaging and real-time rendering, while remaining fully differentiable for integration into broader learning pipelines.

2512.01152 2026-05-20 cs.LG cs.AI cs.CV 版本更新

Open-Set Domain Adaptation Under Background Distribution Shift: Challenges and A Provably Efficient Solution

开放集域适应在背景分布偏移下的挑战:挑战与一种可证明高效的解决方案

Shravan Chaudhari, Yoav Wald, Suchi Saria

发表机构 * Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Faculty of Data and Decision Sciences, Technion(技术学院数据与决策科学学院) Center for Data Science, New York University(纽约大学数据科学中心) Bayesian Health(贝叶斯健康)

AI总结 本文研究了在背景分布偏移情况下开放集域适应的挑战,并提出了一种可证明高效的解决方案CoLOR,通过理论分析和实验证明其在简化过参数化设置中优于基线方法,同时展示了其在图像和文本数据上的广泛适用性。

Comments Project page at https://github.com/Shra1-25/CoLOR

Journal ref Transactions on Machine Learning Research (TMLR) 2026/May ISSN: 2835-8856

详情
AI中文摘要

随着我们将机器学习系统部署到现实世界中,一个核心挑战是保持模型在数据偏移时的性能。这种偏移可以以多种形式存在:新类可能在训练时不存在,这被称为开放集识别,以及已知类别的分布可能发生变化。对于开放集识别的保证大多基于假设已知类别的分布(我们称之为背景分布)是固定的。在本文中,我们开发了CoLOR,一种在挑战性情况下(即背景分布偏移)也能解决开放集识别的方法。我们证明该方法在温和假设下有效,即新类可与非新类分离,并提供理论保证,表明其在简化过参数化设置中优于代表基线方法。我们开发了使CoLOR可扩展和稳健的技术,并在图像和文本数据上进行了全面的实证评估。结果表明,CoLOR在背景偏移下显著优于现有开放集识别方法。此外,我们还提供了新的见解,探讨了诸如新类大小等因素对性能的影响,这在先前工作中尚未得到广泛探索。

英文摘要

As we deploy machine learning systems in the real world, a core challenge is to maintain a model that is performant even as the data shifts. Such shifts can take many forms: new classes may emerge that were absent during training, a problem known as open-set recognition, and the distribution of known categories may change. Guarantees on open-set recognition are mostly derived under the assumption that the distribution of known classes, which we call the background distribution, is fixed. In this paper we develop CoLOR, a method that is guaranteed to solve open-set recognition even in the challenging case where the background distribution shifts. We prove that the method works under benign assumptions that the novel class is separable from the non-novel classes, and provide theoretical guarantees that it outperforms a representative baseline in a simplified overparameterized setting. We develop techniques to make CoLOR scalable and robust, and perform comprehensive empirical evaluations on image and text data. The results show that CoLOR significantly outperforms existing open-set recognition methods under background shift. Moreover, we provide new insights into how factors such as the size of the novel class influences performance, an aspect that has not been extensively explored in prior work.

2512.00281 2026-05-20 cs.CV q-bio.NC 版本更新

Beyond Size and Growth: Rethinking Lung Cancer Screening with AI Based Nodule Detection and Diagnosis

超越尺寸和增长:利用AI进行肺结节检测与诊断的肺癌筛查再思考

Sylvain Bodard, Pierre Baudot, Benjamin Renoust, Charles Voyton, Gwendoline De Bie, Ezequiel Geremia, Van-Khoa Le, Danny Francis, Pierre-Henri Siot, Yousra Haddou, Vincent Bobin, Jean-Christophe Brisset, Carey C. Thomson, Valerie Bourdes, Benoit Huet

发表机构 * Université de Paris Cité, AP-HP, Hôpital Universitaire Necker Enfants Malades, Service d’Imagerie Adulte(巴黎大学Cité,AP-HP,Necker儿童医院成人影像科) Memorial Sloan Kettering Cancer Center, Department of Radiology(纪念斯隆凯特琳癌症中心,放射科) Sorbonne Université, CNRS UMR 7371, INSERM U 1146, Laboratoire d’Imagerie Biomédicale (LIB)(索邦大学,CNRS UMR 7371,INSERM U 1146,生物医学成像实验室) Median Technologies, eyonis(Median Technologies,eyonis) Mount Auburn Hospital/Beth Israel Lahey Health, Cambridge MA, USA(Mount Auburn医院/Beth Israel Lahey健康,马萨诸塞州剑桥市,美国) Harvard Medical School, Boston MA, USA(哈佛医学院,马萨诸塞州波士顿,美国)

AI总结 本文提出了一种基于AI的集成系统,通过低剂量CT扫描在结节层面直接进行结节检测和恶性评估,超越传统基于尺寸和增长的筛查标准,提高了肺癌筛查的准确性和效率。

Comments 25 pages, 8 figures, with supplementary information containing 11 figures

详情
AI中文摘要

早期检测恶性肺结节仍然受到基于尺寸和生长的筛查标准的限制,常常延迟诊断。我们提出了一种集成的AI系统,该系统在统一的CADe/CADx框架内,从低剂量CT扫描中联合执行结节检测和恶性评估。与传统将检测和诊断分开的流程不同,我们的方法直接针对恶性结节,重新定义了临床决策点的评估。为了解决数据集规模和可解释性限制,系统由一个大型集成模型(LEM)组成,结合了浅层深度学习和基于特征的模型。该系统在25,709例扫描中训练和评估,其中69,449个结节被标注,并在独立队列上进行了外部验证。其内部AUC为0.98,外部AUC为0.945,优于所有基于生长的指标、Lung RADS尺寸基于的分流、欧洲体积和VDT基于的筛查标准、放射科医生和领先的AI模型。该模型在低假阳性率下保持高灵敏度,对小和早期阶段的癌症表现出色,并能对不确定和缓慢生长的结节在一年内更早地评估恶性性。这种方法有潜力优化肺癌筛查流程,支持更早、更可行的临床决策。

英文摘要

Early detection of malignant lung nodules remains constrained by size and growth based screening criteria, often delaying diagnosis. We present an integrated AI system that jointly performs nodule detection and malignancy assessment directly at the nodule level from low dose CT scans, within a unified CADe/CADx framework. Unlike conventional pipelines separating detection and diagnosis, our approach targets malignant nodules directly, redefining evaluation at the point where clinical decisions are made. To address limitations in dataset scale and explainability, the system consists of a Large Ensemble Model (LEM) combining ensembles of shallow deep learning and feature based models. It was trained and evaluated on 25,709 scans with 69,449 annotated nodules, with external validation on an independent cohort. It achieved an AUC of 0.98 internally and 0.945 externally, outperforming all growth based metrics, Lung RADS size based triage, European volume and VDT based screening criteria, radiologists, and leading AI models. The model maintains high sensitivity at low false positive rates, excels for small and early stage cancers, and enables malignancy assessment up to one year earlier than radiologists for indeterminate and slow growing nodules. This approach has the potential to streamline lung cancer screening workflows and support earlier, more actionable clinical decision making.

2511.16766 2026-05-20 cs.CV 版本更新

SVG360: Editable Multiview Vector Graphics from a Single SVG

SVG360: 从单个SVG生成可编辑的多视角矢量图形

Mengnan Jiang, Zhaolin Sun, Christian Franke, Michele Franco Adesso, Antonio Haas, Grace Li Zhang

发表机构 * Technical University of Darmstadt(达姆施塔特技术大学) University of Stuttgart(斯图加特大学)

AI总结 本文提出SVG360框架,通过视图一致的矢量化流程将单个SVG转换为几何和视觉一致的多视角SVG资产,解决了多视角下路径碎片化和颜色不稳定的问题,提升了多视角一致性与编辑性。

详情
AI中文摘要

可缩放矢量图形(SVG)是可编辑视觉设计的标准表示形式,但通常作为单视角二维插图进行作者创作。这限制了其在需要对象级资产在不同视角下保持一致时的应用。我们提出了SVG360,一个框架,将单个输入SVG转换为几何和视觉一致的多视角SVG资产。关键挑战在于直接按视角生成或矢量化会产生视角依赖的区域、碎片化的路径和不稳定的颜色,使生成的SVG难以作为整体对象进行编辑。SVG360通过视图一致的矢量化流程解决这一问题。它首先将栅格化输入提升为视图条件的对象表示,并在规定相机下渲染目标视角。然后通过一种源自视频分割的时空记忆机制,将部分身份传播到相邻视角,建立一致的区域分解、路径对应和颜色分配,而无需特定任务的重新训练。最后,每个视角通过结构感知的矢量化重建为可编辑的SVG,其中冗余路径被合并,局部几何被优化,同时保持边界和语义部分。在对象级SVG资产上的实验表明,与直接按视角矢量化相比,SVG360提高了多视角一致性,减少了路径冗余,并更好地保留了细结构。通过将单视角SVG转换为一致的360度矢量资产,SVG360将矢量图形从静态插图扩展到可编辑的多视角内容,适用于设计、动画和结构化视觉编辑。

英文摘要

Scalable Vector Graphics are a standard representation for editable visual design, yet they are usually authored as single view two dimensional illustrations. This limits their use in applications that require object level assets to remain coherent when observed, edited, or animated from different viewpoints. We present SVG360, a framework that converts a single input SVG into geometrically and visually consistent multiview SVG assets. The key challenge is that direct per view generation or vectorization produces view dependent regions, fragmented paths, and unstable colors, making the resulting SVGs difficult to edit as a coherent object. SVG360 addresses this problem through a view consistent vectorization pipeline. It first lifts the rasterized input into a view conditioned object representation and renders target views under prescribed cameras. It then propagates part identity across neighboring views using a spatial memory mechanism adapted from video segmentation, establishing consistent region decomposition, path correspondence, and color assignment without task specific retraining. Finally, each view is reconstructed as an editable SVG through structure aware vectorization, where redundant paths are consolidated and local geometry is optimized while preserving boundaries and semantic parts. Experiments on object level SVG assets show that SVG360 improves multiview consistency, reduces path redundancy, and better preserves fine structures compared with direct per view vectorization. By turning a single view SVG into a coherent 360 degree vector asset, SVG360 expands vector graphics from static illustration toward editable multiview content for design, animation, and structured visual editing.

2511.13864 2026-05-20 cs.CV 版本更新

GRLoc: Geometric Representation Regression for Visual Localization

GRLoc: 用于视觉定位的几何表示回归

Changyang Li, Xuejian Ma, Lixiang Liu, Zhan Li, Qingan Yan, Yi Xu

发表机构 * Goertek Alpha Labs(歌尔声学实验室)

AI总结 本文提出了一种基于几何表示回归(GRR)的方法,通过分离旋转和翻译预测来提升视觉定位的性能,并在7-Scenes和Cambridge Landmarks数据集上实现了最先进的结果。

详情
AI中文摘要

绝对姿态回归(APR)已成为视觉定位中的有力范式。然而,APR模型通常作为黑箱操作,直接从查询图像回归6自由度姿态,这可能导致记忆训练视图而非理解3D场景几何。在本文中,我们提出了一种基于几何的替代方法。受新颖视角合成的启发,该方法通过从中间几何表示生成图像,将APR重新公式化为其逆过程,即从图像直接回归底层3D表示,并将此范式称为几何表示回归(GRR)。我们的模型显式预测两种解耦的几何表示:(1)方向图以估计相机旋转,(2)对应点图以估计相机翻译。最终的相机姿态通过可微确定性求解器从这些几何组件中恢复。这种解耦方法将学习的视觉到几何映射与最终姿态计算分离,为网络引入了强几何先验。我们发现,显式分离旋转和翻译预测可显著提升性能。我们证明在7-Scenes和Cambridge Landmarks数据集上实现了最先进的性能,验证了建模逆渲染过程是更稳健的通用绝对姿态估计路径。

英文摘要

Absolute Pose Regression (APR) has emerged as a compelling paradigm for visual localization. However, APR models typically operate as black boxes, directly regressing a 6-DoF pose from a query image, which can lead to memorizing training views rather than understanding 3D scene geometry. In this work, we propose a geometrically-grounded alternative. Inspired by novel view synthesis, which renders images from intermediate geometric representations, we reformulate APR as its inverse that regresses the underlying 3D representations directly from the image, and we name this paradigm Geometric Representation Regression (GRR). Our model explicitly predicts two disentangled geometric representations in the world coordinate system: (1) a raymap's directions to estimate camera rotation, and (2) a corresponding pointmap to estimate camera translation. The final camera pose is then recovered from these geometric components using a differentiable deterministic solver. This disentangled approach, which separates the learned visual-to-geometry mapping from the final pose calculation, introduces a strong geometric prior into the network. We find that the explicit decoupling of rotation and translation predictions measurably boosts performance. We demonstrate state-of-the-art performance on 7-Scenes and Cambridge Landmarks datasets, validating that modeling the inverse rendering process is a more robust path toward generalizable absolute pose estimation.

2511.11688 2026-05-20 cs.LG cs.CV 版本更新

Hierarchical Schedule Optimization for Fast and Robust Diffusion Model Sampling

分层调度优化用于快速且稳健的扩散模型采样

Aihua Zhu, Rui Su, Qinglin Zhao, Li Feng, Meng Shen, Shibo He

发表机构 * School of Computer Science and Engineering, Macau University of Science and Technology(澳门科学技术大学计算机科学与工程学院) Beijing Institute of Technology(北京理工大学) Zhejiang University(浙江大学)

AI总结 本文提出了一种分层调度优化方法,通过改进的双层优化框架,在极低的函数评估次数下实现高效的扩散模型采样,显著提升了样本质量和计算效率。

Comments Preprint, accepted to AAAI 2026

详情
AI中文摘要

扩散概率模型在生成保真度方面设立了新标准,但受到采样过程缓慢的迭代限制。一种强大的无训练策略是调度优化,旨在在固定的、较小的函数评估次数(NFE)下找到最优的时间步分布以最大化样本质量。为此,成功的调度优化方法必须遵循四个核心原则:有效性、适应性、实用性鲁棒性和计算效率。然而,现有方法难以同时满足这些原则,推动了更先进解决方案的需求。为克服这些限制,我们提出了分层调度优化器(HSO),一种新颖且高效的双层优化框架。HSO通过交替迭代两个协同层级将全局最优调度的搜索转化为更可处理的问题:上层的全局搜索用于寻找最优初始化策略,下层的局部优化用于调度细化。这一过程由两个关键创新引导:中点误差代理(MEP),一种求解器无关且数值稳定的局部优化目标,以及间距惩罚适应度(SPF)函数,通过惩罚病态接近的时间步确保实用性鲁棒性。大量实验表明,HSO在极低NFE范围内为无训练采样设定了新的状态-of-the-art。例如,仅使用5次NFE,HSO在LAION-Aesthetics上实现显著的FID为11.94,使用Stable Diffusion v2.1。关键的是,这种性能不是通过昂贵的重新训练实现的,而是一次性的优化成本不到8秒,提供了一种高效且实用的扩散模型加速范式。

英文摘要

Diffusion probabilistic models have set a new standard for generative fidelity but are hindered by a slow iterative sampling process. A powerful training-free strategy to accelerate this process is Schedule Optimization, which aims to find an optimal distribution of timesteps for a fixed and small Number of Function Evaluations (NFE) to maximize sample quality. To this end, a successful schedule optimization method must adhere to four core principles: effectiveness, adaptivity, practical robustness, and computational efficiency. However, existing paradigms struggle to satisfy these principles simultaneously, motivating the need for a more advanced solution. To overcome these limitations, we propose the Hierarchical-Schedule-Optimizer (HSO), a novel and efficient bi-level optimization framework. HSO reframes the search for a globally optimal schedule into a more tractable problem by iteratively alternating between two synergistic levels: an upper-level global search for an optimal initialization strategy and a lower-level local optimization for schedule refinement. This process is guided by two key innovations: the Midpoint Error Proxy (MEP), a solver-agnostic and numerically stable objective for effective local optimization, and the Spacing-Penalized Fitness (SPF) function, which ensures practical robustness by penalizing pathologically close timesteps. Extensive experiments show that HSO sets a new state-of-the-art for training-free sampling in the extremely low-NFE regime. For instance, with an NFE of just 5, HSO achieves a remarkable FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1. Crucially, this level of performance is attained not through costly retraining, but with a one-time optimization cost of less than 8 seconds, presenting a highly practical and efficient paradigm for diffusion model acceleration.

2511.10292 2026-05-20 cs.CV cs.AI 版本更新

Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models

自适应残差更新引导用于大型视觉语言模型中低开销幻觉抑制

Zhengtao Zou, Ya Gao, Jiarui Guan, Bin Li, Pekka Marttinen

发表机构 * Aalto University, Espoo, Finland(艾尔沃大学,芬兰 Espoo) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China(深圳先进技术研究院,中国科学院,深圳)

AI总结 本文提出RUDDER框架,通过创建持久视觉锚点来对抗视觉稀释,利用模型的prefill残差更新提取鲁棒证据方向,并通过自适应门控机制注入解码过程,有效抑制幻觉并保持高吞吐量。

Comments Accepted by ICML 2026; Code available at: https://github.com/Akko000/RUDDER-Residual-Update-Directed-DEcoding-Regulation-

详情
AI中文摘要

大型视觉-语言模型(LVLMs)通常将视觉输入作为语言解码器之前的前缀进行处理。随着模型自回归地生成文本,这种初始视觉信息不可避免地经历“稀释”,导致模型过度依赖语言先验并产生幻觉。现有干预尝试通过对比logits或迭代优化输出来纠正这一问题,但会带来不可接受的延迟成本。我们提出残差更新引导解码调节(RUDDER)框架,通过创建持久视觉锚点来对抗视觉稀释。我们直接从模型的prefill残差更新中提取鲁棒证据方向(CARD),并将其注入解码过程。这种注入通过自适应门控机制(Beta Gate)进行调节,该机制作为信任机制,确保只有在必要时才应用视觉提示。在LLaVA-1.5(7B/13B)、Idefics2、InstructBLIP和Qwen2.5-VL上的实验表明,RUDDER一致地抑制了幻觉(在贪婪解码中,RUDDER将CHAIR_S减少平均24.4%,将CHAIR_i减少23.6%),并在不同架构上有效扩展,同时保持>96.0%的吞吐量。

英文摘要

Large Vision-Language Models (LVLMs) typically process visual inputs as a prefix to the language decoder. As the model autoregressively generates text, this initial visual information inevitably undergoes "dilution" leading the model to over-rely on language priors and hallucinate objects. Existing interventions attempt to correct this by contrasting logits or iteratively refining outputs, but they incur prohibitive latency costs. We propose Residual-Update Directed DEcoding Regulation (RUDDER), a framework that counters visual dilution by creating a persistent visual anchor. We extract a robust evidence direction (CARD) directly from the model's prefill residual updates, and inject it into the decoding process. This injection is modulated by an adaptive gate, the Beta Gate, which acts as a trust mechanism and ensures the visual reminder is applied only when necessary. Experiments on LLaVA-1.5 (7B/13B), Idefics2, InstructBLIP, and Qwen2.5-VL demonstrate that RUDDER consistently mitigates hallucination (with greedy decoding, RUDDER reduces CHAIR_S by an average of 24.4% and CHAIR_i by 23.6% relative) and scales effectively across architectures, all while maintaining >96.0% throughput.

2511.06943 2026-05-20 cs.CV cs.AI 版本更新

PlantTraitNet: An Uncertainty-Aware Multimodal Framework for Global-Scale Plant Trait Inference from Citizen Science Data

PlantTraitNet: 一种考虑不确定性的多模态框架,用于从公民科学数据中进行全球尺度植物特性推断

Ayushi Sharma, Johanna Trost, Daniel Lusk, Johannes Dollinger, Julian Schrader, Christian Rossi, Javier Lopatin, Etienne Laliberté, Simon Haberstroh, Jana Eichel, Daniel Mederer, Jose Miguel Cerda-Paredes, Shyam S. Phartyal, Lisa-Maricia Schwarz, Anja Linstädter, Maria Conceição Caldeira, Teja Kattenborn

发表机构 * GeoSense-Freiburg(弗赖堡GeoSense)

AI总结 本研究提出PlantTraitNet,一种多模态、多任务且考虑不确定性的深度学习框架,通过弱监督从公民科学照片中预测四个关键植物特性(植物高度、叶面积、特定叶面积和氮含量),并利用空间聚合生成全球特性分布图,验证结果表明其在所有评估特性上均优于现有特性地图。

Comments Accepted at the 40th AAAI Conference on Artificial Intelligence (AAAI-26). Link: https://ojs.aaai.org/index.php/AAAI/article/view/41272

详情
AI中文摘要

全球植物特性地图,如叶片氮含量或植物高度,对于理解生态系统过程,包括地球系统的碳和能量循环至关重要。然而,现有特性地图受限于基于现场测量的高成本和稀疏的地理覆盖。公民科学计划提供了一个未被充分利用的资源来克服这些限制,全球范围内有超过5000万张带有地理标签的植物照片,捕捉了有价值的植物形态和生理信息。在本研究中,我们引入PlantTraitNet,一种多模态、多任务且考虑不确定性的深度学习框架,利用弱监督从公民科学照片中预测四个关键植物特性(植物高度、叶面积、特定叶面积和氮含量)。通过在空间上聚合个体特性预测,我们生成全球特性分布图。我们通过独立的植被调查数据(sPlotOpen)验证这些地图,并将其与领先全球特性产品进行基准测试。我们的结果表明,PlantTraitNet在所有评估特性上均优于现有特性地图,证明了将公民科学影像与计算机视觉和地理空间AI结合,不仅能够实现可扩展的,而且更准确的全球特性映射。这种方法为生态研究和地球系统建模提供了强大的新途径。

英文摘要

Global plant maps of plant traits, such as leaf nitrogen or plant height, are essential for understanding ecosystem processes, including the carbon and energy cycles of the Earth system. However, existing trait maps remain limited by the high cost and sparse geographic coverage of field-based measurements. Citizen science initiatives offer a largely untapped resource to overcome these limitations, with over 50 million geotagged plant photographs worldwide capturing valuable visual information on plant morphology and physiology. In this study, we introduce PlantTraitNet, a multi-modal, multi-task uncertainty-aware deep learning framework that predictsfour key plant traits (plant height, leaf area, specific leaf area, and nitrogen content) from citizen science photos using weak supervision. By aggregating individual trait predictions across space, we generate global maps of trait distributions. We validate these maps against independent vegetation survey data (sPlotOpen) and benchmark them against leading global trait products. Our results show that PlantTraitNet consistently outperforms existing trait maps across all evaluated traits, demonstrating that citizen science imagery, when integrated with computer vision and geospatial AI, enables not only scalable but also more accurate global trait mapping. This approach offers a powerful new pathway for ecological research and Earth system modeling.

2510.21464 2026-05-20 cs.CV 版本更新

CXR-LanIC: Language-Grounded Interpretable Classifier for Chest X-Ray Diagnosis

CXR-LanIC:基于语言的可解释分类器用于胸部X光诊断

Yiming Tang, Wenjia Zhong, Rushi Shah, Dianbo Liu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出CXR-LanIC,一种基于语言的可解释分类器,通过任务对齐的模式发现解决胸部X光诊断的可解释性挑战,通过训练稀疏自编码器提取可解释的视觉模式,实现高准确率的诊断并支持自然语言解释。

详情
AI中文摘要

深度学习模型在胸部X光诊断中已取得显著的准确性,但其广泛应用仍受到预测黑盒性质的限制。临床医生需要透明、可验证的解释来信任自动化诊断并识别潜在的故障模式。我们介绍CXR-LanIC(基于语言的可解释分类器用于胸部X光),一种新的框架,通过任务对齐的模式发现解决这一可解释性挑战。我们的方法在BiomedCLIP诊断分类器上训练基于转码的稀疏自编码器,将医学图像表示分解为可解释的视觉模式。通过在MIMIC-CXR数据集上训练100个转码器,我们发现了约5,000个单义模式,涵盖心脏、肺部、胸膜、结构、设备和伪影类别。每个模式在共享特定放射学特征的图像中表现出一致的激活行为,使预测分解为20-50个可解释模式,具有可验证的激活画廊。CXR-LanIC在五个关键发现上实现了竞争性的诊断准确性,同时通过计划的大型多模态模型注释为自然语言解释奠定基础。我们的关键创新在于从在特定诊断目标上训练的分类器中提取可解释特征,而不是通用嵌入,确保发现的模式直接相关于临床决策,证明医疗AI系统可以既准确又可解释,通过透明、基于临床的解释支持更安全的临床部署。

英文摘要

Deep learning models have achieved remarkable accuracy in chest X-ray diagnosis, yet their widespread clinical adoption remains limited by the black-box nature of their predictions. Clinicians require transparent, verifiable explanations to trust automated diagnoses and identify potential failure modes. We introduce CXR-LanIC (Language-Grounded Interpretable Classifier for Chest X-rays), a novel framework that addresses this interpretability challenge through task-aligned pattern discovery. Our approach trains transcoder-based sparse autoencoders on a BiomedCLIP diagnostic classifier to decompose medical image representations into interpretable visual patterns. By training an ensemble of 100 transcoders on multimodal embeddings from the MIMIC-CXR dataset, we discover approximately 5,000 monosemantic patterns spanning cardiac, pulmonary, pleural, structural, device, and artifact categories. Each pattern exhibits consistent activation behavior across images sharing specific radiological features, enabling transparent attribution where predictions decompose into 20-50 interpretable patterns with verifiable activation galleries. CXR-LanIC achieves competitive diagnostic accuracy on five key findings while providing the foundation for natural language explanations through planned large multimodal model annotation. Our key innovation lies in extracting interpretable features from a classifier trained on specific diagnostic objectives rather than general-purpose embeddings, ensuring discovered patterns are directly relevant to clinical decision-making, demonstrating that medical AI systems can be both accurate and interpretable, supporting safer clinical deployment through transparent, clinically grounded explanations.

2510.16814 2026-05-20 cs.LG cs.AI cs.CV 版本更新

Needles in the Landscape: Semi-Supervised Pseudolabeling for Archaeological Site Discovery under Label Scarcity

景观中的针:在标签稀缺条件下用于考古遗址发现的半监督伪标签方法

Simon Jaxy, Anton Theys, Patrick Willett, W. Chris Carleton, Ralf Vandam, Pieter Libin

发表机构 * Sensors, Royal Military Academy, Brussels, Belgium AMGC (Archaeology, Environmental Changes \& Geo-Chemistry), Vrije Universiteit Brussel Max Planck Institute of Geoanthropology, Jena, Germany Shared first author Shared last author

AI总结 本文提出了一种非对称双伪标签(DPL)方法,通过端到端深度学习直接从多波段遥感影像中学习稀疏正样本,无需人工特征工程或对遗址不存在的假设,在两个著名的考古数据集上进行了评估。DPL在Sagalassos数据集上优于LAMAP基线,在F1和召回率上分别提高了12%和29%,而在Cyprus数据集上,DPL在无确认负样本的纯PU设置中恢复了判别能力。DPL的集成产生可解释的概率表面,支持调查规划,从最小的标记数据中有效发现遗址。

详情
AI中文摘要

考古预测建模通过结合已知位置与环境和地理空间变量来估计未发现遗址的可能位置,提出了一个积极无标签(PU)学习挑战,其中确认的遗址稀少,大多数位置未标记而非真正的负样本。为克服这一问题,我们提出了非对称双伪标签(DPL),一种端到端深度学习方法,直接从多波段遥感影像中学习稀疏正样本,无需人工特征工程或对遗址不存在的假设,并在两个著名的考古数据集上进行了评估。在Sagalassos数据集上,与独立的验证现场调查相比,DPL在F1和召回率上分别优于LAMAP基线12%和29%,而LAMAP在概率排名上保持优势。标准监督基线在负样本不确定时失败惨烈;仅正样本训练崩溃为预测 everywhere,建立经验界限。在Cyprus数据集上,纯PU设置中无确认负样本,SL翻转概率排名,而DPL恢复判别能力。DPL集成产生可解释的概率表面,支持调查规划,从最小的标记数据中有效发现遗址。

英文摘要

Archaeological predictive modelling estimates where undiscovered sites are likely to occur by combining known locations with environmental and geospatial variables, presenting a positive-unlabeled (PU) learning challenge where confirmed sites are rare and most locations are unlabeled rather than truly negative. To overcome this, we propose asymmetric dual pseudolabeling (DPL), an end-to-end deep learning method that learns from sparse positives directly from multi-band geospatial imagery without hand-crafted feature engineering or assumptions about site absence, and evaluate on two prominent archaeological datasets. On the Sagalassos dataset, evaluated against an independent, held-out field survey, DPL outperforms the LAMAP baseline by 12% in F1 and 29% in Recall, while LAMAP maintains advantages in probability ranking. Standard supervised baselines fail catastrophically when negatives are uncertain; positive-only training collapses to predicting everywhere, es- tablishing empirical bounds. On the Cyprus dataset, a pure PU setting without confirmed negatives, SL inverts probability rankings while DPL recovers discrimination. DPL ensembles produce interpretable probability surfaces supporting survey planning, enabling effective site discovery from minimal labeled data.

2510.11344 2026-05-20 cs.CV 版本更新

MMAP: A Multi-Magnification and Prototype-Aware Architecture for Predicting Spatial Gene Expression

MMAP: 一种多倍率和原型感知架构,用于预测空间基因表达

Hai Dang Nguyen, Nguyen Dang Huy Pham, The Minh Duc Nguyen, Dac Thai Nguyen, Hang Thi Nguyen, Duong M. Nguyen

发表机构 * Institute for AI Innovation and Societal Impact(人工智能创新与社会影响研究所) Hanoi University of Science and Technology(河内科学技术大学) Amsterdam High School for the Gifted(阿姆斯特丹天才高中) Anatomic Pathology Division, Laboratory Department, Vinmec Times City International Hospital(Vinmec国际医院解剖病理科实验室部门) Vinmec Healthcare System(Vinmec医疗系统) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出MMAP架构,通过多倍率和原型增强方法,解决空间基因表达预测中的局部特征粒度不足和全局空间上下文覆盖不足的问题,实验表明其在多个评估指标上均优于现有最先进方法。

Comments Received Best Paper Award at the 2025 Pacific Rim International Conference on Artificial Intelligence (PRICAI 2025)

详情
AI中文摘要

空间转录组学(ST)能够测量基因表达的同时保留空间信息,为组织结构和疾病病理提供关键见解。最近的发展探索了使用经苏木精和伊红染色的整张滑扫图像(WSI)通过深度神经网络预测转录组-wide基因表达谱。这项任务通常被框架为回归问题,其中每个输入对应从WSI中提取的局部图像块。然而,从组织学图像预测空间基因表达仍是一个具有挑战性的问题,因为视觉特征与分子信号之间存在显著的模态差距。最近的研究尝试将局部和全局信息纳入预测模型中。然而,现有方法仍然存在两个关键限制:(1)局部特征提取的粒度不足,(2)全局空间上下文的覆盖不足。在本工作中,我们提出了一种新的框架,MMAP(多倍率和原型增强架构),同时解决这两个挑战。为了增强局部特征的粒度,MMAP利用多倍率块表示来捕捉精细的组织学细节。为了提高全局上下文的理解,它学习了一组潜在原型嵌入,这些嵌入作为滑片级信息的紧凑表示。广泛的实验结果表明,MMAP在多个评估指标上均优于所有现有最先进方法,包括平均绝对误差(MAE)、平均平方误差(MSE)和皮尔逊相关系数(PCC)。

英文摘要

Spatial Transcriptomics (ST) enables the measurement of gene expression while preserving spatial information, offering critical insights into tissue architecture and disease pathology. Recent developments have explored the use of hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) to predict transcriptome-wide gene expression profiles through deep neural networks. This task is commonly framed as a regression problem, where each input corresponds to a localized image patch extracted from the WSI. However, predicting spatial gene expression from histological images remains a challenging problem due to the significant modality gap between visual features and molecular signals. Recent studies have attempted to incorporate both local and global information into predictive models. Nevertheless, existing methods still suffer from two key limitations: (1) insufficient granularity in local feature extraction, and (2) inadequate coverage of global spatial context. In this work, we propose a novel framework, MMAP (Multi-MAgnification and Prototype-enhanced architecture), that addresses both challenges simultaneously. To enhance local feature granularity, MMAP leverages multi-magnification patch representations that capture fine-grained histological details. To improve global contextual understanding, it learns a set of latent prototype embeddings that serve as compact representations of slide-level information. Extensive experimental results demonstrate that MMAP consistently outperforms all existing state-of-the-art methods across multiple evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Pearson Correlation Coefficient (PCC).

2510.07538 2026-05-20 cs.CV 版本更新

Low-Compute Watermark Removal via Dual-Domain Natural Projection

基于双域自然投影的低计算量水印移除

Pragati Shuddhodhan Meshram, Varun Chandrasekaran

发表机构 * Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, USA(伊利诺伊大学厄巴纳-香槟分校电子与计算机工程系)

AI总结 本文提出了一种轻量级且无需训练的攻击方法DAWN,通过在互补频率和语义空间中投影水印图像,以低计算成本实现高效的水印移除,同时保持结构和语义的完整性。

详情
AI中文摘要

有效的语义水印移除需要在三个竞争性目标之间取得平衡:高移除成功率、低感知失真和低计算成本。然而,现有的单图像攻击通常只优化前两个目标,实现强大的水印抑制,但依赖于昂贵的多步骤优化,限制了实际部署。在本文中,我们证明这种权衡是根本性的:目前没有任何方法能够同时实现这三个属性。我们引入DAWN,一种轻量级、无需训练的攻击方法,专门针对低计算成本的领域,同时保持竞争性的移除性能。DAWN通过将带水印的图像投影到自然图像先验上,在互补的频率和语义空间中压制偏离自然统计的水印信号,然后应用解耦的感知对齐步骤以最小化伪影来恢复视觉一致性。在多样化的像素、频率和潜在空间水印方案中,DAWN一致地降低了可检测性,同时保持结构和语义的保真度,证明了仅通过适度的感知退化即可实现高效的、低资源水印移除。我们的代码可在https://github.com/Pragati-Meshram/DAWN上获得。

英文摘要

Effective removal of semantic watermarks requires balancing three competing objectives: \emph{high removal success}, \emph{low perceptual distortion}, and \emph{low computational cost}. However, existing single-image attacks typically optimize only for the first two, achieving strong watermark suppression but relying on expensive, multi-step optimization that limits practical deployment. In this work, we show that this trade-off is fundamental: no current approach achieves all three properties simultaneously. We introduce \textsc{DAWN}, a lightweight, training-free attack that explicitly targets the low-cost regime while maintaining competitive removal performance. \textsc{DAWN} works by projecting a watermarked image onto natural-image priors in complementary frequency and semantic spaces, suppressing watermark signals that deviate from natural statistics, and then applying a decoupled perceptual-alignment step to restore visual consistency with minimal artifact. Across diverse pixel-, frequency-, and latent-space watermarking schemes, \textsc{DAWN} consistently reduces detectability while preserving structural and semantic fidelity, demonstrating that efficient, low-resource watermark removal is feasible with only modest perceptual degradation. Our code is available at https://github.com/Pragati-Meshram/DAWN.

2510.00660 2026-05-20 cs.CV 版本更新

Unsupervised Unfolded rPCA (U2-rPCA): Deep Interpretable Clutter Filtering for Ultrasound Microvascular Imaging

无监督展开rPCA(U2-rPCA):用于超声微血管成像的深度可解释杂波过滤

Huaying Li, Chuling Ye, Manfei Liao, Xiaobo Qu, Liansheng Wang, Yinran Chen

发表机构 * Fujian Key Laboratory of Urban Intelligent Sensing and Computing, School of Informatics, Xiamen University(福建城市智能感知与计算重点实验室,信息学院,厦门大学) School of Electronic Science and Engineering, Fujian Provincial Key Laboratory of Plasma and Magnetic Resonance, Xiamen University(电子科学与技术学院,福建省等离子体与磁共振重点实验室,厦门大学) Department of Computer Science and Technology, School of Informatics, and the National Institute for Data Science in Health and Medicine, Xiamen University(计算机科学与技术系,信息学院,以及健康医学数据科学国家研究院,厦门大学)

AI总结 本文提出了一种无监督展开rPCA(U2-rPCA)方法,通过迭代加权最小二乘(IRLS)rPCA基础进行展开,结合稀疏增强单元,以提高对稀疏微流信号的捕捉能力,从而在超声微血管成像中实现更高效的杂波过滤。

详情
AI中文摘要

高灵敏度杂波过滤是超声微血管成像中的基本步骤。奇异值分解(SVD)和鲁棒主成分分析(rPCA)是主要的杂波过滤策略。然而,这两种策略在特征建模和组织与血流分离方面对于高质量微血管成像有限。最近,基于深度学习的杂波过滤在更彻底地分离组织和血流信号方面显示出潜力。然而,现有的监督滤波器面临缺乏可解释性和训练真实数据的问题。虽然可解释性问题可以通过算法深度展开来解决,但训练真实数据仍然无法解决。本文提出了一种无监督展开rPCA(U2-rPCA)方法,该方法保留了数学可解释性,并且对学习标签不敏感。具体而言,U2-rPCA是从具有内在低秩和稀疏正则化的迭代加权最小二乘(IRLS)rPCA基础展开而来。此外,稀疏增强单元被插入到网络中,以增强其捕捉稀疏微流信号的能力。U2-rPCA就像一个自适应滤波器,它通过部分图像序列进行训练,然后用于后续帧。在硅基数据集和公开的活体数据集上的实验验证显示,U2-rPCA在与SVD滤波器、rPCA基础和另一种深度学习滤波器相比时表现出优越性。特别是,所提出的方法将功率多普勒图像的对比噪声比(CNR)从1.91 dB提高到8.48 dB,相比其他方法。此外,通过消融研究验证了U2-rPCA构建模块的有效性。

英文摘要

High-sensitivity clutter filtering is a fundamental step in ultrasound microvascular imaging. Singular value decomposition (SVD) and robust principal component analysis (rPCA) are the main clutter filtering strategies. However, both strategies are limited in feature modeling and separation of tissue and blood flow for high-quality microvascular imaging. Recently, deep learning-based clutter filtering has shown potential in more thoroughly separating tissue and blood flow signals. However, the existing supervised filters face the lack of interpretability and the training ground truth. While the interpretability issue can be addressed by algorithm deep unfolding, the training ground truth remains unsolved. This paper proposes an unsupervised unfolded rPCA (U2-rPCA) method that preserves mathematical interpretability and is insusceptible to learning labels. Specifically, U2-rPCA is unfolded from an iteratively reweighted least squares (IRLS) rPCA baseline with intrinsic low-rank and sparse regularization. In addition, a sparse-enhancement unit is plugged into the network to strengthen its capability to capture the sparse micro-flow signals. U2-rPCA is like an adaptive filter that is trained with part of the image sequence and then used for the following frames. Experimental validations on a in-silico dataset and public in-vivo datasets demonstrated the outperformance of U2-rPCA when compared with the SVD filter, the rPCA baseline, and another deep learning-based filter. Particularly, the proposed method improved the contrast-to-noise ratio (CNR) of the power Doppler image by 1.91 dB to 8.48 dB compared to other methods. Furthermore, the effectiveness of the building modules of U2-rPCA was validated through ablation studies.

2510.00600 2026-05-20 cs.RO cs.AI cs.CV cs.LG 版本更新

Hybrid Training for Vision-Language-Action Models

视觉-语言-动作模型的混合训练

Pietro Mazzaglia, Cansu Sancaktar, Markus Peschl, Daniel Dijkman

发表机构 * Qualcomm AI Research(高通AI研究)

AI总结 本文提出混合训练框架,旨在使视觉-语言-动作模型在推理时能够根据需要生成思考过程或直接预测动作,从而在保持性能提升的同时提高推理效率。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

使用大型语言模型生成中间思考过程(即链式思考,CoT)再提供答案,已成为解决复杂语言任务的有效方法。在机器人领域,类似的具身CoT策略,即在执行动作前生成思考,也已被证明在使用视觉-语言-动作模型(VLAs)时能够提高性能。然而,这些技术会增加模型生成输出的长度以包含思考过程,从而影响推理时间。在现实世界执行中,如机器人操作场景,延迟代理的动作会严重影响方法的实用性,因为任务需要长序列的动作。然而,生成长链式思考是否是实现性能提升的必要条件?在本文中,我们探索了混合训练(HyT)的概念,这是一种框架,使VLAs能够从思考中学习并受益于相关的性能提升,同时在推理时允许省略CoT生成。此外,通过学习有条件地预测多样化的输出,HyT在推理时提供了灵活性,使模型能够直接预测动作、生成思考或遵循指令。我们评估了所提出的方法在一系列模拟基准和真实世界实验中的表现。

英文摘要

Using Large Language Models to produce intermediate thoughts, a.k.a. Chain-of-thought (CoT), before providing an answer has been a successful recipe for solving complex language tasks. In robotics, similar embodied CoT strategies, generating thoughts before actions, have also been shown to lead to improved performance when using Vision-Language-Action models (VLAs). As these techniques increase the length of the model's generated outputs to include the thoughts, the inference time is negatively affected. Delaying an agent's actions in real-world executions, as in robotic manipulation settings, strongly affects the usability of a method, as tasks require long sequences of actions. However, is the generation of long chains-of-thought a strong prerequisite for achieving performance improvements? In this work, we explore the idea of Hybrid Training (HyT), a framework that enables VLAs to learn from thoughts and benefit from the associated performance gains, while enabling the possibility to leave out CoT generation during inference. Furthermore, by learning to conditionally predict a diverse set of outputs, HyT supports flexibility at inference time, enabling the model to either predict actions directly, generate thoughts or follow instructions. We evaluate the proposed method in a series of simulated benchmarks and real-world experiments.

2509.22292 2026-05-20 cs.CV cs.AI 版本更新

Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

通过场景分割策略对文本到视频模型进行劫持

Wonjun Lee, Haon Park, Doehyeon Lee, Bumsub Ham, Suhyun Kim

发表机构 * Yonsei University(延世大学) Korea Institute of Science and Technology(韩国科学技术院) AIM Intelligence(AIM智能) Seoul National University(首尔国立大学) Kyung Hee University(庆熙大学)

AI总结 本文提出了一种新的黑盒劫持方法SceneSplit,通过将有害叙述分割成多个良性场景,利用场景组合作为约束来引导最终输出,从而提高生成有害视频的可能性,验证了当前文本到视频模型的安全机制存在漏洞。

Comments ICLR 2026. Project page at https://velpegor.github.io/SceneSplit/

详情
AI中文摘要

随着文本到视频(T2V)模型的快速发展,对其安全风险的关注也日益增加。尽管最近的研究已经探讨了像LLM、VLM和文本到图像(T2I)模型等模型中的漏洞,但T2V模型仍然鲜有研究,存在显著的安全缺口。为了解决这一缺口,我们引入了SceneSplit,一种新颖的黑盒劫持方法,其通过将有害叙述分割成多个场景,每个场景本身都是无害的。这种方法利用场景组合作为强大的约束,来引导最终的输出空间。虽然每个场景单独对应一个宽泛且安全的空间,其中大多数结果都是无害的,但它们的顺序组合会共同限制这个空间,将其缩小到一个危险区域,从而显著增加生成有害视频的可能性。这种核心机制通过迭代场景操纵进一步增强,可以绕过此受限危险区域内的安全过滤器。此外,一个重用成功攻击模式的策略库进一步提高了攻击的整体效果和鲁棒性。为了验证我们的方法,我们在T2VSafetyBench上的11个安全类别上评估了SceneSplit在T2V模型上的表现。我们的结果表明,它在Luma Ray2上实现了77.2%的平均攻击成功率,在Hailuo上为84.1%,在Veo2上为78.2%,在Kling V1.0上为78.6%,在Sora2上为68.6%,显著优于现有基线。通过这项工作,我们证明了当前T2V安全机制容易受到利用叙述结构的攻击,为理解和改进T2V模型的安全性提供了新的见解。

英文摘要

Along with the rapid advancement of numerous Text-to-Video (T2V) models, growing concerns have emerged regarding their safety risks. While recent studies have explored vulnerabilities in models like LLMs, VLMs, and Text-to-Image (T2I) models through jailbreak attacks, T2V models remain largely unexplored, leaving a significant safety gap. To address this gap, we introduce SceneSplit, a novel black-box jailbreak method that works by fragmenting a harmful narrative into multiple scenes, each individually benign. This approach manipulates the generative output space, the abstract set of all potential video outputs for a given prompt, using the combination of scenes as a powerful constraint to guide the final outcome. While each scene individually corresponds to a wide and safe space where most outcomes are benign, their sequential combination collectively restricts this space, narrowing it to an unsafe region and significantly increasing the likelihood of generating a harmful video. This core mechanism is further enhanced through iterative scene manipulation, which bypasses the safety filter within this constrained unsafe region. Additionally, a strategy library that reuses successful attack patterns further improves the attack's overall effectiveness and robustness. To validate our method, we evaluate SceneSplit across 11 safety categories from T2VSafetyBench on T2V models. Our results show that it achieves a high average Attack Success Rate (ASR) of 77.2% on Luma Ray2, 84.1% on Hailuo, 78.2% on Veo2, 78.6% on Kling V1.0, and 68.6% on Sora2, significantly outperforming the existing baselines. Through this work, we demonstrate that current T2V safety mechanisms are vulnerable to attacks that exploit narrative structure, providing new insights for understanding and improving the safety of T2V models.

2509.22258 2026-05-20 cs.CV cs.AI 版本更新

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

超越分类准确度:Neural-MedBench与更深层次推理基准的需求

Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Huan Gao, Mingkun Xu, Shangyang Li

发表机构 * School of Physics Science and Technology, Beijing University of Posts and Telecommunications(北京邮电大学物理科学与技术学院) Guangdong Institute of Intelligence Science and Technology(广东智能科学技术研究院) Beijing Chaoyang Hospital, Capital Medical University(北京朝阳医院) Sleep Medical Center, Huzhou Third Municipal Hospital, Affiliated Hospital of Wenzhou Medical University(湖州第三人民医院睡眠医学中心,温州医科大学附属医院) University of Macau(澳门大学) Renyixun Health Technology Co., Ltd(仁颐讯健康科技有限公司) Academy for Advanced Interdisciplinary Studies, Peking University(北京大学交叉学科研究院)

AI总结 本文提出Neural-MedBench,一个专门用于测试多模态神经病学推理能力的基准,揭示现有医疗数据集过于强调分类准确度的问题,并通过系统评估发现模型推理失败而非感知误差主导性能下降,强调需要兼顾广度与深度的评估框架。

Comments 23 pages, 12 figures

Journal ref ICLR'2026

详情
AI中文摘要

近期视觉-语言模型(VLMs)在标准医疗基准上取得了显著进展,但其真正的临床推理能力仍不清楚。现有数据集主要强调分类准确度,导致模型在高风险诊断推理上仍存在不足。我们引入Neural-MedBench,一个紧凑且推理密集的基准,专门用于探测多模态临床推理在神经病学中的极限。Neural-MedBench整合多序列MRI扫描、结构化电子健康记录和临床笔记,并涵盖三大核心任务家族:鉴别诊断、病变识别和推理生成。为确保可靠评估,我们开发了结合LLM评分、临床验证和语义相似度指标的混合评分流程。通过系统评估最先进的VLMs,包括GPT-4o、Claude-4和MedGemma,我们发现其性能相比传统数据集显著下降。错误分析显示,推理失败而非感知误差主导模型不足。我们的发现强调了需要双轴评估框架:以广度为导向的大数据集用于统计泛化,以深度为导向的紧凑基准如Neural-MedBench用于推理保真度。我们发布Neural-MedBench于https://neuromedbench.github.io/作为开放且可扩展的诊断测试床,引导未来基准的扩展,并实现严谨而成本有效的临床可信AI评估。

英文摘要

Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

2509.21196 2026-05-20 cs.LG cs.CV 版本更新

Differential-Integral Neural Operator for Long-Term Turbulence Forecasting

微分-积分神经算子用于长期湍流预测

Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Qingsong Wen, Kun Wang, Xiaomeng Huang, Xian Wu

发表机构 * Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学) The Chinese University of Hong Kong(香港中文大学) Nanyang Technological University(南洋理工大学) Tencent(腾讯)

AI总结 本文提出了一种基于物理原理的微分-积分神经算子,通过并行分支学习不同的物理算子,以提高长期湍流预测的稳定性与鲁棒性,从而在2D Kolmogorov流基准测试中实现了更精确的预测。

详情
AI中文摘要

准确预测湍流的长期演变是科学计算中的重大挑战,对气候建模和航空航天工程等应用至关重要。现有的深度学习方法,特别是神经算子,在长期自回归预测中常常失败,导致灾难性误差累积和物理保真度的丧失。这种失败源于它们无法同时捕捉湍流动力学所支配的不同的数学结构:局部、耗散效应和全局、非局部相互作用。在本文中,我们提出了微分-积分神经算子(\method{}),一种基于算子分解的原理方法。\method{}通过并行分支显式建模湍流的演变,学习不同的物理算子:一个局部微分算子,由一个受约束的卷积网络实现,该网络可以证明收敛于导数;以及一个全局积分算子,由Transformer架构捕捉,学习数据驱动的全局核。这种基于物理的分解使\method{}具有卓越的稳定性和鲁棒性。通过在具有挑战性的2D Kolmogorov流基准测试中的广泛实验,我们证明\method{}在长期预测中显著优于最先进的模型。它能够抑制数百个时间步上的误差累积,保持涡旋场和能量谱的高保真度,并建立了物理一致、长程湍流预测的新基准。

英文摘要

Accurately forecasting the long-term evolution of turbulence represents a grand challenge in scientific computing and is crucial for applications ranging from climate modeling to aerospace engineering. Existing deep learning methods, particularly neural operators, often fail in long-term autoregressive predictions, suffering from catastrophic error accumulation and a loss of physical fidelity. This failure stems from their inability to simultaneously capture the distinct mathematical structures that govern turbulent dynamics: local, dissipative effects and global, non-local interactions. In this paper, we propose the {\textbf{\underline{D}}}ifferential-{\textbf{\underline{I}}}ntegral {\textbf{\underline{N}}}eural {\textbf{\underline{O}}}perator (\method{}), a novel framework designed from a first-principles approach of operator decomposition. \method{} explicitly models the turbulent evolution through parallel branches that learn distinct physical operators: a local differential operator, realized by a constrained convolutional network that provably converges to a derivative, and a global integral operator, captured by a Transformer architecture that learns a data-driven global kernel. This physics-based decomposition endows \method{} with exceptional stability and robustness. Through extensive experiments on the challenging 2D Kolmogorov flow benchmark, we demonstrate that \method{} significantly outperforms state-of-the-art models in long-term forecasting. It successfully suppresses error accumulation over hundreds of timesteps, maintains high fidelity in both the vorticity fields and energy spectra, and establishes a new benchmark for physically consistent, long-range turbulence forecast.

2509.14839 2026-05-20 cs.CV 版本更新

MapAnything: Evaluating Monocular Metric Depth Models for 3D Urban Asset Localization

MapAnything: 评估单目度量深度模型用于3D城市资产定位

Miriam Louise Carnot, Jonas Kunze, Erik Quinten Fastermann, Eric Peukert, André Ludwig, Bogdan Franczyk

发表机构 * ScaDS.AI (University of Leipzig)(ScaDS.AI(莱比锡大学)) University of Leipzig(莱比锡大学) Kühne Logistics University(库赫内物流大学) Wrocław University of Economics(沃拉夫经济大学)

AI总结 本文提出MapAnything框架,通过单目图像自动映射城市物体和事件,利用度量深度估计模型计算物体坐标,验证其在复杂城市环境中的精度,展示其在交通标志和道路损坏等实际应用中的有效性。

详情
AI中文摘要

城市管理部门越来越多地依赖全面的数据库和数字孪生,如交通标志和树木以及涂鸦或道路损坏等事件,以有效监控城市状况。数字化提高了对持续更新的空间数据集的需求,但当前的数据采集和维护过程仍涉及大量人工劳动,带来了显著的可扩展性挑战。本文介绍了MapAnything,一种新颖的地理定位框架,能够从单个单目图像自动映射城市物体和事件。通过利用先进的度量深度估计模型,Map Anything准确计算物体的地理坐标,将2D图像数据转换为有价值的3D空间信息。该方法集成了估计的相机到物体距离与几何原理和已知相机规格。我们展示了该框架的详细验证,将其距离估计精度与高精度LiDAR点云在复杂城市环境中的对比。我们的评估提供了在各种距离区间和语义区域(如道路和植被)上的空间性能的细致分析。最后,我们通过具体的使用案例,如映射交通标志和道路路面损坏,展示了该框架的实际有效性,并提供了将其整合到自动化城市库存系统中的建议。

英文摘要

City administrations increasingly rely on comprehensive databases and urban digital twins of city assets, such as traffic signs and trees, as well as incidents like graffiti or road damage, to maintain an effective overview of urban conditions. Digitization has increased the demand for continuously updated spatial datasets, yet current data acquisition and maintenance processes still involve considerable manual effort, posing significant scalability challenges. This paper introduces MapAnything, a novel geo-localization framework that automates the spatial mapping of urban objects and incidents from a single monocular image. By leveraging advanced Metric Depth Estimation models, MapAnything accurately calculates object geocoordinates, converting 2D image data into valuable 3D spatial information. The methodology integrates the estimated camera-to-object distance with geometric principles and known camera specifications. We present a detailed validation of the framework, comparing its distance-estimation accuracy against high-precision LiDAR point clouds in complex urban environments. Our evaluation provides a granular analysis of spatial performance across various distance intervals and semantic areas, such as roads and vegetation. Finally, we demonstrate the framework's practical efficacy through specific use cases, including mapping traffic signs and road pavement damage, and provide recommendations for its integration into automated urban inventory systems.

2507.10492 2026-05-20 cs.CV cs.AI cs.LG 版本更新

BenchReAD: A systematic benchmark for retinal anomaly detection

BenchReAD: 一种系统性的视网膜异常检测基准

Chenyu Lian, Hong-Yu Zhou, Zhanli Hu, Jing Qin

发表机构 * The Center for Smart Health, School of Nursing, the Hong Kong Polytechnic University, Hong Kong, China(香港理工大学护理学院智能健康中心) School of Biomedical Engineering, Tsinghua University, Beijing, China(清华大学生物医学工程学院) Research Center for Medical AI, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China(中国科学院深圳先进技术研究院医学人工智能研究中心)

AI总结 本研究提出BenchReAD基准,旨在解决视网膜异常检测领域缺乏全面且公开的评估标准的问题,通过系统化的数据和算法分类,引入了全监督方法DRA,并改进为NFM-DRA,实现了SOTA性能。

Comments MICCAI 2025

详情
AI中文摘要

视网膜异常检测在筛查眼部和系统性疾病中起着关键作用。尽管其重要性,该领域的进展受到缺乏全面且公开可用的基准的阻碍,这对于公平评估和推进方法至关重要。由于这一限制,与视网膜图像相关的先前异常检测工作受到(1)异常类型有限且过于简单的限制,(2)测试集几乎饱和,以及(3)缺乏泛化评估的影响,导致实验设置说服力不足。此外,现有医学异常检测基准大多专注于单类监督方法(仅使用负样本训练),忽视了临床实践中大量可用的标记异常数据和未标记数据。为了填补这些差距,我们引入了视网膜异常检测的基准,该基准在数据和算法上都是全面且系统的。通过分类和评估先前方法,我们发现利用解耦异常表示的全监督方法(DRA)取得了最佳性能,但在遇到某些未见异常时性能显著下降。受单类监督学习中记忆库机制的启发,我们提出了NFM-DRA,将其与正常特征记忆结合,以缓解性能下降,建立新的SOTA。该基准可在https://github.com/DopamineLcy/BenchReAD上公开获取。

英文摘要

Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, establishing a new SOTA. The benchmark is publicly available at https://github.com/DopamineLcy/BenchReAD.

2507.05843 2026-05-20 cs.CV 版本更新

USIGAN: Unbalanced Self-Information Feature Transport for Weakly Paired Image IHC Virtual Staining

USIGAN: 用于弱配对图像IHC虚拟染色的不平衡自信息特征传输

Yue Peng, Bing Xiong, Fuqiang Chen, De Eybo, RanRan Zhang, Wanming Hu, Jing Cai, Wenjian Qin

发表机构 * ShenZhen Institues of Advanced Technology, university chinese academy of sciences(深圳先进技术研究院,中国科学院)

AI总结 本文提出USIGAN方法,通过提取全局形态学语义来解决弱配对条件下IHC虚拟染色的不一致问题,改进生成结果的病理语义一致性。

详情
AI中文摘要

免疫组化(IHC)虚拟染色任务旨在从H&E图像生成虚拟IHC图像,同时保持与相邻切片的病理语义一致性。该任务通过生成模型实现形态结构与染色模式的跨域映射,为病理分析提供高效且经济的解决方案。然而,在弱配对条件下,相邻切片之间的空间异质性带来了显著挑战,可能导致不准确的一对多映射并生成与相邻切片病理语义不一致的结果。为了解决这个问题,我们提出了一种新的IHC虚拟染色的不平衡自信息特征传输方法,称为USIGAN,该方法在不依赖位置对应的情况下提取全局形态学语义。通过在联合边缘分布中移除弱配对项,我们有效减轻了弱配对对联合分布的影响,从而显著提高了生成结果的内容一致性和病理语义一致性。此外,我们设计了不平衡最优传输一致性(UOT-CTM)机制和病理自对应(PC-SCM)机制,以构建H&E与生成IHC在图像级别以及真实IHC与生成IHC图像集内的相关矩阵。在两个公开数据集上的实验表明,我们的方法在多个临床相关指标上表现优异,如IoD和Pearson-R相关性,证明了更好的临床相关性。

英文摘要

Immunohistochemical (IHC) virtual staining is a task that generates virtual IHC images from H\&E images while maintaining pathological semantic consistency with adjacent slices. This task aims to achieve cross-domain mapping between morphological structures and staining patterns through generative models, providing an efficient and cost-effective solution for pathological analysis. However, under weakly paired conditions, spatial heterogeneity between adjacent slices presents significant challenges. This can lead to inaccurate one-to-many mappings and generate results that are inconsistent with the pathological semantics of adjacent slices. To address this issue, we propose a novel unbalanced self-information feature transport for IHC virtual staining, named USIGAN, which extracts global morphological semantics without relying on positional correspondence.By removing weakly paired terms in the joint marginal distribution, we effectively mitigate the impact of weak pairing on joint distributions, thereby significantly improving the content consistency and pathological semantic consistency of the generated results. Moreover, we design the Unbalanced Optimal Transport Consistency (UOT-CTM) mechanism and the Pathology Self-Correspondence (PC-SCM) mechanism to construct correlation matrices between H\&E and generated IHC in image-level and real IHC and generated IHC image sets in intra-group level.. Experiments conducted on two publicly available datasets demonstrate that our method achieves superior performance across multiple clinically significant metrics, such as IoD and Pearson-R correlation, demonstrating better clinical relevance.

2507.01123 2026-05-20 cs.CV cs.LG eess.IV 版本更新

Landslide Detection and Mapping Using Deep Learning Across Multi-Source Satellite Data and Geographic Regions

利用多源卫星数据和地理区域的深度学习进行滑坡检测与制图

Rahul A. Burange, Harsh K. Shinde, Omkar Mutyalwar

发表机构 * Department of Electronics & Telecommunication, KDK College of Engineering(电子与电信系,KDK工程学院)

AI总结 本文提出了一种综合方法,结合多源卫星影像和深度学习模型,以提高滑坡识别和预测的准确性,通过Sentinel-2多光谱数据和ALOS PALSAR衍生的坡度和数字高程模型(DEM)层来捕捉影响滑坡发生的关键环境特征,并评估多种地理空间分析技术对检测精度的影响,同时评估了多种先进的深度学习分割模型,如U-Net、DeepLabV3+和Res-Net,以确定其在滑坡检测中的有效性。

Comments 17 pages, 22 figures

Journal ref JETIR March 2025, Volume 12, Issue 3

详情
AI中文摘要

滑坡对基础设施、经济和人类生命构成严重威胁,需要在多样化的地理区域中进行准确的检测和预测制图。随着深度学习和遥感技术的进步,自动化滑坡检测已变得更加有效。本文提出了一种综合方法,整合多源卫星影像和深度学习模型,以增强滑坡识别和预测。我们利用Sentinel-2多光谱数据和ALOS PALSAR衍生的坡度和数字高程模型(DEM)层来捕捉影响滑坡发生的关键环境特征。各种地理空间分析技术被用来评估地形特征、植被覆盖和降雨对检测精度的影响。此外,我们评估了多种先进的深度学习分割模型,包括U-Net、DeepLabV�+和Res-Net,以确定其在滑坡检测中的有效性。所提出的框架有助于发展可靠的早期预警系统,改进灾害风险管理,并促进可持续的土地利用规划。我们的发现为深度学习和多源遥感在创建稳健、可扩展和可转移的滑坡预测模型中的潜力提供了有价值的见解。

英文摘要

Landslides pose severe threats to infrastructure, economies, and human lives, necessitating accurate detection and predictive mapping across diverse geographic regions. With advancements in deep learning and remote sensing, automated landslide detection has become increasingly effective. This study presents a comprehensive approach integrating multi-source satellite imagery and deep learning models to enhance landslide identification and prediction. We leverage Sentinel-2 multispectral data and ALOS PALSAR-derived slope and Digital Elevation Model (DEM) layers to capture critical environmental features influencing landslide occurrences. Various geospatial analysis techniques are employed to assess the impact of terra in characteristics, vegetation cover, and rainfall on detection accuracy. Additionally, we evaluate the performance of multiple stateof-the-art deep learning segmentation models, including U-Net, DeepLabV3+, and Res-Net, to determine their effectiveness in landslide detection. The proposed framework contributes to the development of reliable early warning systems, improved disaster risk management, and sustainable land-use planning. Our findings provide valuable insights into the potential of deep learning and multi-source remote sensing in creating robust, scalable, and transferable landslide prediction models.

2506.08618 2026-05-20 cs.LG cond-mat.mes-hall cond-mat.other cs.AI cs.CV 版本更新

HSG-12M: A Large-Scale Benchmark of Spatial Multigraphs from the Energy Spectra of Non-Hermitian Crystals

HSG-12M: 一种大规模空间多图基准,源自非厄密晶体能量谱

Xianquan Yan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee

发表机构 * National University of Singapore(新加坡国立大学) NUS Centre for Bioimaging Sciences(新加坡国立大学生物成像科学中心)

AI总结 本文提出HSG-12M,一个包含1160万静态和510万动态哈密顿量谱图的数据集,用于研究非厄密量子物理中的复杂几何结构,填补了现有图基准在空间多边学习方面的空白。

Comments Accepted to ICLR 2026, OpenReview: [https://openreview.net/forum?id=YxuKCME576]. 49 pages, 13 figures, 14 tables. Code & pipeline: [https://github.com/sarinstein-yan/Poly2Graph] Dataset: [https://github.com/sarinstein-yan/HSG-12M] Dataset released under CC BY 4.0. The Fourteenth International Conference on Learning Representations (ICLR 2026)

Journal ref The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
AI中文摘要

人工智能正通过揭示理解复杂物理系统的新方法改变科学研究,但其影响仍受限于缺乏大规模、高质量的领域专用数据集。非厄密量子物理中蕴藏着丰富的资源,其中晶体的能量谱在复平面上形成复杂的几何结构,称为哈密顿量谱图。尽管这些谱图作为电子行为的指纹具有重要意义,但其系统研究一直受限于手动提取的依赖。为释放这一潜力,我们引入Poly2Graph:一个高性能、开源的管道,自动化将一维晶体哈密顿量映射到谱图。使用该工具,我们提出了HSG-12M:一个包含1160万静态和510万动态哈密顿量谱图的数据集,涵盖1401个特征多项式类别,源自177TB的谱势数据。关键的是,HSG-12M是首个大规模空间多图数据集——图嵌入在度量空间中,其中两个节点之间不同的几何轨迹被保留为单独的边。这同时填补了现有图基准在空间多边学习方面的空白。流行的GNN基准测试揭示了在大规模学习空间多边时的新挑战。除了其实际用途外,我们还表明谱图是多项式、向量和矩阵的通用拓扑指纹,建立了新的代数到图的联系。HSG-12M为凝聚态物理的数据驱动科学发现奠定了基础,为几何感知图学习的新机会以及更广泛领域铺平了道路。

英文摘要

AI is transforming scientific research by revealing new ways to understand complex physical systems, but its impact remains constrained by the lack of large, high-quality domain-specific datasets. A rich, largely untapped resource lies in non-Hermitian quantum physics, where the energy spectra of crystals form intricate geometries on the complex plane -- termed as Hamiltonian spectral graphs. Despite their significance as fingerprints for electronic behavior, their systematic study has been intractable due to the reliance on manual extraction. To unlock this potential, we introduce Poly2Graph: a high-performance, open-source pipeline that automates the mapping of 1-D crystal Hamiltonians to spectral graphs. Using this tool, we present HSG-12M: a dataset containing 11.6 million static and 5.1 million dynamic Hamiltonian spectral graphs across 1401 characteristic-polynomial classes, distilled from 177 TB of spectral potential data. Crucially, HSG-12M is the first large-scale dataset of spatial multigraphs -- graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. This simultaneously addresses a critical gap, as existing graph benchmarks overwhelmingly assume simple, non-spatial edges, discarding vital geometric information. Benchmarks with popular GNNs expose new challenges in learning spatial multi-edges at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for data-driven scientific discovery in condensed matter physics, new opportunities in geometry-aware graph learning and beyond.

2506.05317 2026-05-20 cs.CV 版本更新

ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation

ProJo4D:渐进式联合优化用于稀疏视图逆物理估计

Daniel Rho, Jun Myeong Choi, Biswadip Dey, Roni Sengupta

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Meta Reality Labs(Meta现实实验室)

AI总结 本文提出ProJo4D,一种渐进式联合优化框架,用于解决稀疏视图下逆物理参数估计问题,通过逐步扩展联合优化参数集,提高了4D未来状态预测和物理参数估计的准确性,达到几何精度提升10倍的性能。

Comments TMLR 2026

详情
AI中文摘要

神经渲染在3D重建和新视图合成方面已取得显著进展,将物理整合到这些框架中开辟了新的应用,如机器人和XR中的物理准确数字孪生。然而,从视觉观测中估计物理参数的逆问题仍具挑战性。现有物理感知神经渲染方法通常需要密集多视角视频,使其在可扩展的实际部署中不切实际。在稀疏视图设置下,当前方法采用的顺序优化策略导致严重误差累积:初始3D重建的不准确性会传播到后续阶段,降低物理状态和材料参数估计。另一方面,同时优化所有参数失败,因为问题高度非凸且通常非可微。我们提出ProJo4D,一种渐进式联合优化框架,逐步扩展联合优化的参数集。这种设计使物理感知梯度能够细化几何,同时避免直接对所有参数进行联合优化的不稳定性。在合成和真实世界数据集上的评估表明,ProJo4D在4D未来状态预测和物理参数估计方面显著优于先前工作,实现几何精度提升高达10倍,同时保持计算效率。请访问项目网页:https://daniel03c1.github.io/ProJo4D/

英文摘要

Neural rendering has advanced significantly in 3D reconstruction and novel view synthesis, and integrating physics into these frameworks opens new applications such as physically accurate digital twins for robotics and XR. However, the inverse problem of estimating physical parameters from visual observations remains challenging. Existing physics-aware neural rendering methods typically require dense multi-view videos, making them impractical for scalable, real-world deployment. Under sparse-view settings, the sequential optimization strategies employed by current approaches suffer from severe error accumulation: inaccuracies in initial 3D reconstruction propagate to subsequent stages, degrading physical state and material parameter estimates. On the other hand, simultaneous optimization of all parameters fails due to the highly non-convex and often non-differentiable nature of the problem. We propose ProJo4D, a progressive joint optimization framework that gradually expands the set of jointly optimized parameters. This design enables physics-informed gradients to refine geometry while avoiding the instability of direct joint optimization over all parameters. Evaluations on synthetic and real-world datasets demonstrate that ProJo4D substantially outperforms prior work in 4D future state prediction and physical parameter estimation, achieving up to 10x improvement in geometric accuracy while maintaining computational efficiency. Please visit the project webpage: https://daniel03c1.github.io/ProJo4D/

2506.01418 2026-05-20 cs.RO cs.CV 版本更新

SEMNAV: Enhancing Visual Semantic Navigation in Robotics through Semantic Segmentation

SEMNAV: 通过语义分割增强机器人中的视觉语义导航

Rafael Flor-Rodríguez, Carlos Gutiérrez-Álvarez, Francisco Javier Acevedo-Rodríguez, Sergio Lafuente-Arroyo, Roberto J. López-Sastre

发表机构 * University of Alcalá(阿尔卡萨大学) CAM-UAH Ministry of Science and Innovation of Spain(西班牙科学与创新部)

AI总结 本文提出SEMNAV,一种利用语义分割作为环境主要视觉输入表示的方法,以增强机器人代理的感知和决策能力,通过引入高层面的语义信息,提升模型在未知环境中的泛化能力,并引入SEMNAV数据集进行训练。

Journal ref Applied Intelligence, 2026

详情
AI中文摘要

视觉语义导航(VSN)是机器人学中的基本问题,其中智能体必须在未知环境中导航至目标对象,主要依靠视觉信息。大多数最先进的VSN模型是在模拟环境中训练的,其中使用的是现实世界的渲染场景,最理想的情况。这些方法通常依赖于虚拟场景的原始RGB数据,这限制了它们在真实世界环境中的泛化能力,由于域适应问题。为了解决这个问题,本文提出了SEMNAV,一种新的方法,利用语义分割作为环境的主要视觉输入表示,以增强代理的感知和决策能力。通过显式地引入这种高层语义信息,我们的模型学习到稳健的导航策略,提高了在未见过的环境中泛化的能力,无论是模拟还是真实世界。我们还引入了SEMNAV数据集,这是一个新编纂的数据集,用于训练如SEMNAV这样的语义分割感知导航模型。我们的方法在模拟环境和真实世界机器人平台上进行了广泛的评估。实验结果表明,SEMNAV优于现有的最先进VSN模型,在Habitat 2.0模拟环境使用HM3D数据集时实现了更高的成功率。此外,我们的实际实验突显了语义分割在缓解仿真到现实差距方面的有效性,使我们的模型成为实用VSN基于机器人应用的有希望的解决方案。代码和数据集可在https://github.com/gramuah/semnav访问。

英文摘要

Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real-world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent's perception and decision-making capabilities. By explicitly incorporating this type of high-level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce the SEMNAV dataset, a newly curated dataset designed for training semantic segmentation-aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real-world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state-of-the-art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real-world experiments highlight the effectiveness of semantic segmentation in mitigating the sim-to-real gap, making our model a promising solution for practical VSN-based robotic applications. The code and datasets are accessible at https://github.com/gramuah/semnav

2505.23747 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Spatial-MLLM: 提升基于视觉的空域智能的MLLM能力

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出Spatial-MLLM,一种基于纯2D观测的视觉空域推理框架,通过双编码器架构和空间感知帧采样策略提升空域理解能力,实验表明其在多种视觉空域任务中达到SOTA性能。

Comments 22 pages

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)在2D视觉任务上的性能显著提升。然而,提高其空间智能仍是一个挑战。现有的3D MLLMs总是依赖额外的3D或2.5D数据来整合空间意识,限制了它们在只有2D输入(如图像或视频)场景中的实用性。在本文中,我们提出了Spatial-MLLM,一种新颖的框架,用于从纯2D观测中进行基于视觉的空间推理。与传统视频MLLMs依赖CLIP-based视觉编码器优化语义理解不同,我们的关键见解是释放来自前馈视觉几何基础模型的强大结构先验。具体来说,我们提出了双编码器架构:一个预训练的2D视觉编码器用于提取语义特征,以及一个3D空间编码器,从视觉几何模型的主干初始化以提取3D结构特征。然后,一个连接器将两种特征整合到统一的视觉标记中以增强空间理解。此外,我们提出了一种在推理时间的空间感知帧采样策略,该策略选择视频序列中具有空间信息的帧,确保在有限的token长度下,模型专注于对空间推理至关重要的帧。除了架构改进外,我们从多个来源构建了一个训练数据集,并使用监督微调和GRPO对其进行训练。在各种真实世界数据集上的广泛实验表明,Spatial-MLLM在广泛的基于视觉的空间理解和推理任务中实现了SOTA性能。项目页面:https://diankun-wu.github.io/Spatial-MLLM/.

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct a training dataset from multiple sources and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

2505.17726 2026-05-20 cs.CV cs.AI 版本更新

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Slot-MLLM: 多模态大语言模型中的面向对象视觉标记化

Donghwan Chi, Hyomin Kim, Yoonjin Oh, Yongjin Kim, Donghoon Lee, Daejin Jo, Jongmin Kim, Junyeob Baek, Sungjin Ahn, Sungwoong Kim

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系) Kakao Corp(Kakao公司) School of Computing, KAIST(韩国科学技术院计算机学院)

AI总结 本文提出了一种面向对象的视觉标记化方法Slot-MLLM,通过基于Slot Attention的标记器,有效编码局部视觉细节并保持高层语义,从而提升多模态大语言模型在视觉内容理解和生成中的性能。

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)已成为实现人工通用智能的关键方法。特别是,视觉语言MLLMs已被开发用于从多模态输入中生成文本和视觉输出。这一进展需要高效的图像标记,使LLMs能够有效处理输入和输出。然而,现有的图像标记方法通常只能捕捉全局抽象概念或均匀分割的图像块,限制了MLLMs在理解和生成细节视觉内容方面的能力,尤其是在对象层面。为了解决这一限制,我们提出了一种基于Slot Attention的面向对象视觉标记器,专门针对MLLMs。具体而言,基于Q-Former编码器、扩散解码器和残差向量量化,我们提出的离散化槽标记能够编码局部视觉细节,同时保持高层语义,并与文本数据对齐,无缝集成到LLMs的统一下一个标记预测框架中。所得到的Slot-MLLM在各种涉及局部详细理解和生成的视觉语言任务中,相对于先前视觉标记器的基线表现显著提升。值得注意的是,这项工作是首次展示了使用MLLMs和真实自然图像进行面向对象槽注意力的可行性。

英文摘要

Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output. However, existing image tokenization methods for MLLMs typically capture only global abstract concepts or uniformly segmented image patches, restricting MLLMs' capability to effectively understand or generate detailed visual content, particularly at the object level. To address this limitation, we propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs. In particular, based on the Q-Former encoder, diffusion decoder, and residual vector quantization, our proposed discretized slot tokens can encode local visual details while maintaining high-level semantics, and also align with textual data to be integrated seamlessly within a unified next-token prediction framework of LLMs. The resulting Slot-MLLM demonstrates significant performance improvements over baselines with previous visual tokenizers across various vision-language tasks that entail local detailed comprehension and generation. Notably, this work is the first demonstration of the feasibility of object-centric slot attention performed with MLLMs and in-the-wild natural images.

2505.12217 2026-05-20 cs.CV 版本更新

HyperCap: Hyperspectral Land Cover Captioning Dataset for Vision Language Models

HyperCap:面向视觉语言模型的超光谱土地覆盖描述数据集

Aryan Das, Tanishq Rachamalla, Pravendra Singh, Koushik Biswas, Vinay Kumar Verma, Salvador Garcia, Antonio Plaza, Swalpa Kumar Roy

发表机构 * Department of Computer Science and Engineering, Vellore Institute of Technology(计算机科学与工程系,维洛雷理工学院) Department of Information Technology, Siddhartha Academy of Higher Education(信息技术系,斯里达拉塔高等教育学院) Department of Computer Science and Engineering, Indian Institute of Technology, Roorkee(计算机科学与工程系,印度理工学院罗尔基分校) Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi(计算机科学与工程系,印度信息技术学院德里) Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur(计算机科学与工程系,印度理工学院坎浦尔) Department of Computer Science and Artificial Intelligence, University of Granada(计算机科学与人工智能系,格拉纳达大学) Hyperspectral Computing Laboratory, Department of Computers and Communications, University of Extremadura(超光谱计算实验室,计算机与通信系,埃斯特拉达大学)

AI总结 本文提出HyperCap数据集,通过整合光谱数据与像素级文本标注,提升遥感应用中的模型性能,为未来研究提供基础资源。

Comments Accepted for publication in IEEE Geoscience and Remote Sensing Magazine (GRSM), 2026

详情
AI中文摘要

我们介绍了HyperCap,首个大规模超光谱描述数据集,旨在提升模型在遥感应用中的性能和有效性。与传统超光谱成像(HSI)基准不同,HyperCap将光谱数据与像素级文本标注相结合,实现更深入的语义理解。该数据集通过结合自动和手动方法对四个基准数据集进行标注,确保准确性和一致性。使用最先进的编码器和多样的融合技术进行实证评估,显示出显著的分类性能提升。这些结果突显了视觉-语言学习在HSI中的潜力,并将HyperCap定位为未来研究的基础数据集。代码和数据集可在https://github.com/arya-domain/HyperCap获取。

英文摘要

We introduce HyperCap, the first large-scale hyperspectral captioning dataset designed to enhance model performance and effectiveness in remote sensing applications. Unlike traditional hyperspectral imaging (HSI) benchmarks, HyperCap integrates spectral data with pixel-wise textual annotations, enabling deeper semantic understanding. This dataset enhances model performance in tasks like classification and feature extraction, providing a valuable resource for advanced remote sensing applications. HyperCap is constructed from four benchmark datasets and annotated through a hybrid approach combining automated and manual methods to ensure accuracy and consistency. Empirical evaluations using state-of-the-art encoders and diverse fusion techniques demonstrate significant improvements in classification performance. These results underscore the potential of vision-language learning in HSI and position HyperCap as a foundational dataset for future research in the field. The code and dataset are available at https://github.com/arya-domain/HyperCap.

2504.04065 2026-05-20 cs.CV cs.IR cs.MM 版本更新

Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering

使检索增强的视觉问答实现协作参数知识校准

Jiaqi Deng, Kaize Shi, Zonghan Wu, Huan Huo, Dingxian Wang, Guandong Xu

发表机构 * University of Technology Sydney(悉尼大学) East China Normal University(华东师范大学) The Education University of Hong Kong(香港教育大学)

AI总结 本文提出了一种统一的检索增强视觉问答框架,通过协作参数知识校准来充分利用KB-VQA中的跨任务协同效应,从而提升问答准确性。

Comments 10 pages, 5 figures, Under Review

Journal ref Knowledge-Based Systems, 8 July 2026, Volume 346

详情
AI中文摘要

基于知识的视觉问答(KB-VQA)系统通过从外部知识库检索的知识来解决复杂的视觉-地面化问题。知识检索和答案生成任务都要求对问题上下文和外部知识进行精确的多模态理解。然而,现有方法将这两个阶段视为独立模块,在训练过程中交互有限,这阻碍了双向参数知识共享,最终导致性能不佳。为充分利用KB-VQA中的跨任务协同效应,我们提出了一种统一的检索增强VQA框架,具有协作参数知识校准。所提出的框架可以有效地将通用多模态预训练模型适应于细粒度、知识密集型任务,同时在训练和推理过程中使检索器和生成器能够协作增强和共享其参数知识。为了增强对问题和外部文档的细粒度理解,我们还将晚期交互机制整合到所提出的训练框架中。此外,我们引入了一种反思-回答机制,使模型能够显式评估并细化其知识边界。我们的方法在与最先进的模型竞争中取得了竞争力的表现,实现了回答准确率的显著4.7%的提升,并为基础MLLMs的VQA性能带来了平均7.5%的提升。

英文摘要

Knowledge-based Vision Question Answering (KB-VQA) systems address complex visual-grounded questions with knowledge retrieved from external knowledge bases. The tasks of knowledge retrieval and answer generation tasks both necessitate precise multimodal understanding of question context and external knowledge. However, existing methods treat these two stages as separate modules with limited interaction during training, which hinders bi-directional parametric knowledge sharing, ultimately leading to suboptimal performance. To fully exploit the cross-task synergy in KB-VQA, we propose a unified retrieval-augmented VQA framework with collaborative parametric knowledge calibration. The proposed framework can effectively adapt general multimodal pre-trained models for fine-grained, knowledge-intensive tasks while enabling the retriever and generator to collaboratively enhance and share their parametric knowledge during both training and inference. To enhance fine-grained understanding of questions and external documents, we also integrate late interaction mechanism into the proposed training framework. Additionally, we introduce a reflective-answering mechanism that allows the model to explicitly evaluate and refine its knowledge boundary. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7\% improvement in answering accuracy, and brings an average 7.5\% boost in base MLLMs' VQA performance.

2504.03758 2026-05-20 cs.CY cs.CV cs.GR 版本更新

Improved visual-information-driven model for crowd simulation and its modular application

改进的视觉信息驱动模型用于人群模拟及其模块化应用

Xuanwen Liang, Jiayu Chen, Eric Wai Ming Lee, Wei Xie

发表机构 * Department of Architecture and Civil Engineering(建筑与土木工程系) Department of Construction Management(建设管理系) Sichuan University-The Hong Kong Polytechnic University Institute for Disaster Management and Reconstruction(四川大学-香港理工大学灾难管理与重建研究院)

AI总结 本文提出一种数据驱动的人群模拟模型,通过改进的视觉信息提取和显式出口提示,提高在多个场景中的灵活性,并在四个基本模块和复合场景中进行了测试和评估,结果显示该模型在多个场景中表现良好,优于传统知识驱动模型。

Journal ref Xuanwen Liang, Jiayu Chen, Eric Wai Ming Lee, & Wei Xie (2026). Improved visual-information-driven model for crowd simulation and its modular application. Chaos, Solitons & Fractals, 209, 118481

详情
AI中文摘要

人群运动模拟对行人安全管理及设施设计至关重要。数据驱动模型有潜力提高真实性和预测准确性,但大多数模型仅适用于单一场景,限制了其灵活性。我们提出了一种数据驱动的人群模拟模型,结合了精细化的视觉信息提取和显式出口提示,旨在通过更有效地捕捉核心导航特征,提高在多个场景中的灵活性。该模型在四个基本模块(瓶颈、走廊、拐角和T形交叉口)上进行了测试,并进一步在复合场景中使用模块化方法进行评估。结果表明,该模型在这些场景中表现良好,与现实世界实验中的行人运动一致,并在这些场景中优于传统知识驱动模型。研究结果可为数据驱动的人群模拟模型发展提供启发,并推进数据驱动方法的应用。

英文摘要

Crowd movement simulation is crucial for pedestrian safety management and facility design. Data-driven models offer the potential to improve realism and predictive accuracy, but most are developed for a single scenario, limiting their flexibility. We propose a data-driven crowd simulation model that incorporates refined visual-information extraction and explicit exit cues, aiming to improve flexibility across multiple scenarios by more effectively capturing core navigational features. The model is tested on four fundamental modules (bottleneck, corridor, corner, and T-junction) and further evaluated in a composite scenario using a modular approach. Results show that our model performs well across these scenarios, aligning with pedestrian movement in real-world experiments, and outperforms the classical knowledge-driven model in these scenarios. The research outcomes can provide inspiration for the development of data-driven crowd simulation models and advance the application of data-driven approaches.

2504.00470 2026-05-20 cs.LG cs.CV 版本更新

Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection

少即是多:通过最小可解释子集选择实现高效的黑盒属性分析

Ruoyu Chen, Siyuan Liang, Jingzhi Li, Shiming Liu, Li Liu, Hua Zhang, Xiaochun Cao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) University of Chinese Academy of Sciences(中国科学院大学) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) School of Artificial Intelligence, University of Science and Technology Beijing(北京科技大学人工智能学院) Department of Mechanical Engineering, Imperial College London(伦敦帝国理工学院机械工程系) Center for Machine Vision and Signal Analysis (CMVS), University of Oulu(奥卢大学机器视觉与信号分析中心) School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区计算机科学与技术学院)

AI总结 本文提出了一种高效的黑盒属性分析方法LiMA,通过将重要区域的属性分析转化为子模函数子集选择的优化问题,以更少的区域提供更准确的解释,并在多个基准模型上展示了显著的改进。

详情
AI中文摘要

为了开发一个可信的AI系统,目标是识别对模型决策影响最大的输入区域。现有属性方法的主要任务是高效且准确地识别输入-预测交互关系。特别是当输入数据是离散的,如图像时,分析输入和输出之间的关系由于组合爆炸而成为重大挑战。在本文中,我们提出了一种新颖且高效的黑盒属性机制LiMA(Less input is More faithful for Attribution),它将重要区域的属性分析重新表述为一个子模子集选择的优化问题。首先,为了准确评估交互,我们设计了一个子模函数,该函数量化子集的重要性并有效捕捉其对决策结果的影响。然后,通过一种新的双向贪心搜索算法,高效地对输入子区域按重要性进行排序。LiMA能够识别最和最不重要的样本,同时确保一个最优的属性边界,以最小化误差。在八个基础模型上的广泛实验表明,我们的方法在更少的区域上提供了忠实的解释,并表现出强大的泛化能力,插入和删除任务的平均改进分别为36.3%和39.6%。我们的方法在属性效率方面也优于朴素的贪心搜索,速度提高了1.6倍。此外,当解释模型预测错误的原因时,我们的方法平均最高置信度比最先进的属性算法高86.1%。代码可在https://github.com/RuoyuChen10/LIMA上获得。

英文摘要

To develop a trustworthy AI system, which aim to identify the input regions that most influence the models decisions. The primary task of existing attribution methods lies in efficiently and accurately identifying the relationships among input-prediction interactions. Particularly when the input data is discrete, such as images, analyzing the relationship between inputs and outputs poses a significant challenge due to the combinatorial explosion. In this paper, we propose a novel and efficient black-box attribution mechanism, LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection. First, to accurately assess interactions, we design a submodular function that quantifies subset importance and effectively captures their impact on decision outcomes. Then, efficiently ranking input sub-regions by their importance for attribution, we improve optimization efficiency through a novel bidirectional greedy search algorithm. LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors. Extensive experiments on eight foundation models demonstrate that our method provides faithful interpretations with fewer regions and exhibits strong generalization, shows an average improvement of 36.3% in Insertion and 39.6% in Deletion. Our method also outperforms the naive greedy search in attribution efficiency, being 1.6 times faster. Furthermore, when explaining the reasons behind model prediction errors, the average highest confidence achieved by our method is, on average, 86.1% higher than that of state-of-the-art attribution algorithms. The code is available at https://github.com/RuoyuChen10/LIMA.

2503.12172 2026-05-20 cs.LG cs.CR cs.CV 版本更新

SEAL: Semantic Aware Image Watermarking

SEAL:语义感知图像水印

Kasra Arabi, R. Teal Witter, Chinmay Hegde, Niv Cohen

发表机构 * New York University(纽约大学)

AI总结 本文提出了一种新的水印方法,通过将生成图像的语义信息直接嵌入水印中,实现无损水印验证,无需依赖密钥模式数据库。通过局部敏感哈希从图像语义嵌入中推断密钥模式,并基于原始图像内容条件检测水印,提高对抗伪造攻击的鲁棒性。

详情
AI中文摘要

生成模型已迅速发展以生成逼真的输出。然而,它们的合成输出越来越多地挑战自然与AI生成内容之间的清晰区分,需要稳健的水印技术。水印通常需要保持目标图像的完整性,抵御移除尝试,并防止未经授权的复制到无关图像上。为了解决这一需求,最近的方法将持久水印嵌入由扩散模型生成的图像中使用初始噪声。然而,为此,它们要么会扭曲生成图像的分布,要么依赖于搜索一个长密钥字典进行检测。在本文中,我们提出了一种新的水印方法,将生成图像的语义信息直接嵌入水印中,使水印无损,且无需数据库中的密钥模式即可验证。相反,密钥模式可以从图像的语义嵌入中使用局部敏感哈希推断。此外,将水印检测条件化于原始图像内容可以提高对伪造攻击的鲁棒性。为了证明这一点,我们考虑了两种被忽视的攻击策略:(i)攻击者提取初始噪声并生成具有相同模式的新图像;(ii)攻击者在水印图像中插入无关(可能有害)的对象,可能在保持水印的情况下。我们通过实验证明了我们的方法对这些攻击的增强鲁棒性。总的来说,我们的结果表明,内容感知的水印可以缓解图像生成模型带来的风险。

英文摘要

Generative models have rapidly evolved to generate realistic outputs. However, their synthetic outputs increasingly challenge the clear distinction between natural and AI-generated content, necessitating robust watermarking techniques. Watermarks are typically expected to preserve the integrity of the target image, withstand removal attempts, and prevent unauthorized replication onto unrelated images. To address this need, recent methods embed persistent watermarks into images produced by diffusion models using the initial noise. Yet, to do so, they either distort the distribution of generated images or rely on searching through a long dictionary of used keys for detection. In this paper, we propose a novel watermarking method that embeds semantic information about the generated image directly into the watermark, enabling a distortion-free watermark that can be verified without requiring a database of key patterns. Instead, the key pattern can be inferred from the semantic embedding of the image using locality-sensitive hashing. Furthermore, conditioning the watermark detection on the original image content improves robustness against forgery attacks. To demonstrate that, we consider two largely overlooked attack strategies: (i) an attacker extracting the initial noise and generating a novel image with the same pattern; (ii) an attacker inserting an unrelated (potentially harmful) object into a watermarked image, possibly while preserving the watermark. We empirically validate our method's increased robustness to these attacks. Taken together, our results suggest that content-aware watermarks can mitigate risks arising from image-generative models.

2503.02170 2026-05-20 cs.CV cs.AI 版本更新

Adaptive Camera Sensor for Vision Models

自适应摄像头传感器用于视觉模型

Eunsu Baek, Sunghwan Han, Taesik Gong, Hyung-Sin Kim

发表机构 * Graduate School of Data Science(数据科学研究生院) Seoul National University(首尔国立大学) Department of Computer Science & Engineering(计算机科学与工程系) Seogang University(世宗大学) Ulsan National Institute of Science and Technology(乌山国立科学技术研究院)

AI总结 本文提出Lens,一种基于人类视觉感知的自适应摄像头传感器控制方法,通过从模型视角捕获高质量图像来提升模型性能,同时在真实时间内适应特定模型和场景,并通过新的ImageNet-ES Diverse数据集验证了其有效性。

Comments The International Conference on Learning Representations (ICLR 2025)

详情
AI中文摘要

领域偏移仍然是基于深度学习的计算机视觉中的持续挑战,通常需要大量的模型修改或标记数据集来解决。受人类视觉感知的启发,即通过矫正透镜调整输入质量而不是过度训练大脑,我们提出了Lens,一种新颖的摄像头传感器控制方法,通过从模型视角捕获高质量图像来增强模型性能,而不是依赖传统的以人类为中心的传感器控制。Lens是轻量级的,并且能够实时适应特定模型和场景的传感器参数。其核心是VisiT,一种无需训练的、模型特定的质量指标,它在测试时使用置信度分数评估单个未标记样本,而无需额外的适应成本。为了验证Lens,我们引入了ImageNet-ES Diverse,一个新基准数据集,捕捉了来自变化的传感器和光照条件的自然扰动。在ImageNet-ES和我们新的ImageNet-ES Diverse上的大量实验表明,Lens在各种传感器控制和模型修改的基线方案中显著提高了模型的准确性,同时保持了低延迟的图像捕获。Lens有效补偿了大模型大小差异,并与模型改进技术协同作用。我们的代码和数据集可在github.com/Edw2n/Lens.git上获得。

英文摘要

Domain shift remains a persistent challenge in deep-learning-based computer vision, often requiring extensive model modifications or large labeled datasets to address. Inspired by human visual perception, which adjusts input quality through corrective lenses rather than over-training the brain, we propose Lens, a novel camera sensor control method that enhances model performance by capturing high-quality images from the model's perspective rather than relying on traditional human-centric sensor control. Lens is lightweight and adapts sensor parameters to specific models and scenes in real-time. At its core, Lens utilizes VisiT, a training-free, model-specific quality indicator that evaluates individual unlabeled samples at test time using confidence scores without additional adaptation costs. To validate Lens, we introduce ImageNet-ES Diverse, a new benchmark dataset capturing natural perturbations from varying sensor and lighting conditions. Extensive experiments on both ImageNet-ES and our new ImageNet-ES Diverse show that Lens significantly improves model accuracy across various baseline schemes for sensor control and model modification while maintaining low latency in image captures. Lens effectively compensates for large model size differences and integrates synergistically with model improvement techniques. Our code and dataset are available at github.com/Edw2n/Lens.git.

2502.20981 2026-05-20 cs.CV 版本更新

Distribution Prototype Diffusion Learning for Open-set Supervised Anomaly Detection

分布原型扩散学习用于开放集监督异常检测

Fuyun Wang, Tong Zhang, Yuanzhi Wang, Yide Qiu, Xin Liu, Xu Guo, Zhen Cui

发表机构 * Nanjing University of Science and Technology(南京理工大学) Nanjing SeetaCloud Technology(南京海康威视科技) Beijing Normal University(北京师范大学)

AI总结 本文提出了一种分布原型扩散学习方法,通过构建可学习的高斯原型来创建潜在表示空间,以提高正常样本的判别边界,并通过Schroedinger桥促进正常样本向原型的扩散,同时将异常样本推离,从而提升异常检测性能。

Comments Accepted by CVPR 2025

详情
AI中文摘要

在开放集监督异常检测(OSAD)中,现有方法通常生成伪异常来补偿观察到的异常样本稀缺,而忽视了正常样本的关键先验,导致判别边界效果不佳。为了解决这个问题,我们提出了一种分布原型扩散学习(DPDL)方法,旨在将正常样本封闭在紧凑且判别的分布空间中。具体来说,我们构建了多个可学习的高斯原型,以创建一个容纳丰富且多样正常样本的潜在表示空间,并学习Schroedinger桥以促进正常样本向这些原型的扩散过渡,同时将异常样本推离。此外,为了增强样本间的分离,我们设计了一种在超球面空间中的分散特征学习方法,有助于识别分布外的异常。实验结果表明,所提出的DPDL方法在9个公开数据集上取得了最先进的性能。

英文摘要

In Open-set Supervised Anomaly Detection (OSAD), the existing methods typically generate pseudo anomalies to compensate for the scarcity of observed anomaly samples, while overlooking critical priors of normal samples, leading to less effective discriminative boundaries. To address this issue, we propose a Distribution Prototype Diffusion Learning (DPDL) method aimed at enclosing normal samples within a compact and discriminative distribution space. Specifically, we construct multiple learnable Gaussian prototypes to create a latent representation space for abundant and diverse normal samples and learn a Schrödinger bridge to facilitate a diffusive transition toward these prototypes for normal samples while steering anomaly samples away. Moreover, to enhance inter-sample separation, we design a dispersion feature learning way in hyperspherical space, which benefits the identification of out-of-distribution anomalies. Experimental results demonstrate the effectiveness and superiority of our proposed DPDL, achieving state-of-the-art performance on 9 public datasets.

2501.09203 2026-05-20 cs.CV cs.RO 版本更新

3D Modeling and Automated Measurement of Concrete Cracks via Segment Anything Refinement and Visual Inertial LiDAR Fusion

通过段落任何精修和视觉惯性LiDAR融合进行混凝土裂缝的3D建模与自动测量

Pengru Deng, Jiapeng Yao, Chun Li, Su Wang, Xinrun Li, Varun Ojha, Xuhui He

发表机构 * School of Civil Engineering(土木工程学院) Central South University(中南大学) Hunan Provincial Key Laboratory for Disaster Prevention and Mitigation of Rail Transit Engineering Structures(湖南省铁路工程结构灾害预防与 mitigation 工程结构重点实验室) Nvidia School of Computing(计算学院) Newcastle University(新castle大学)

AI总结 本文提出了一种结合计算机视觉技术和多模态同时定位与建图(SLAM)的创新框架,用于二维裂缝检测、三维重建和三维自动裂缝测量,解决了现有方法在适应性和鲁棒性方面的不足,特别是在处理曲线或复杂几何形状时的挑战。

Comments Title and author list updated

Journal ref Computer-Aided Civil and Infrastructure Engineering, Volume 45, 2026, 100019, ISSN 1093-9687

详情
AI中文摘要

视觉-空间系统在混凝土裂缝检测中变得越来越关键。然而,现有方法往往缺乏对多样化场景的适应性,在基于图像的方法中表现出有限的鲁棒性,并且在处理曲线或复杂几何形状时存在困难。为了解决这些限制,本文提出了一种创新的框架,通过整合计算机视觉技术和多模态同时定位与建图(SLAM),用于二维(2D)裂缝检测、三维(3D)重建和三维自动裂缝测量。首先,基于基础的DeepLabv3+分割模型,并结合特定的改进利用基础模型Segment Anything Model(SAM),我们开发了一种具有强泛化能力的裂缝分割方法,能够在不熟悉的场景中生成精确的2D裂缝掩码。为了提高三维重建的准确性和鲁棒性,利用Light Detection and Ranging(LiDAR)点云与图像数据和分割掩码。通过利用图像和LiDAR-SLAM,我们开发了多帧和多模态融合框架,产生密集、着色的点云,有效捕捉裂缝语义在三维现实尺度上。此外,裂缝几何属性在三维密集点云空间中自动且直接地进行测量,超越了传统二维图像测量方法的限制。这一进步使该方法适用于具有曲线和复杂三维几何结构的结构部件。在各种混凝土结构上的实验结果突显了所提出方法的显著改进和独特优势,展示了其在现实应用中的有效性、准确性和鲁棒性。

英文摘要

Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.

2412.13111 2026-05-20 cs.CV cs.GR 版本更新

Motion-2-To-3: Leveraging 2D Motion Data for 3D Motion Generations

Motion-2-To-3: 利用2D运动数据进行3D运动生成

Ruoxi Guo, Huaijin Pi, Zehong Shen, Qing Shuai, Zechen Hu, Zhumei Wang, Yajiao Dong, Ruizhen Hu, Taku Komura, Sida Peng, Xiaowei Zhou

发表机构 * Zhejiang University(浙江大学) Deep Glint The University of Hong Kong(香港大学) Shenzhen University(深圳大学)

AI总结 本文提出了一种利用2D视频中提取的运动数据来改进基于文本的3D运动生成的方法,通过解耦局部关节运动和全局运动,有效学习局部运动先验,从而提升生成的3D人体运动的真实性和多样性。

Comments Project page: https://zju3dv.github.io/Motion-2-to-3/

Journal ref 2025 IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 2025, pp. 14305-14316

详情
AI中文摘要

文本驱动的人体运动合成已展现出在电影和游戏行业颠覆性设计的潜力。现有方法通常依赖于3D运动捕捉数据,这需要特殊设置,导致数据采集成本高,最终限制了人体运动的多样性和范围。相比之下,2D人体视频提供了一种广泛且易于获取的运动数据源,涵盖了更广泛风格和活动。在本文中,我们探索了从视频中提取的2D人体运动作为替代数据源,以改进基于文本的3D运动生成。我们的方法引入了一个新颖的框架,将局部关节运动与全局运动解耦,从而能够高效地从2D数据中学习局部运动先验。我们首先在大量文本-2D运动配对数据集上训练了一个单视角的2D局部运动生成器。然后,我们用3D数据对生成器进行微调,将其转换为多视角生成器,该生成器能够预测视图一致的局部关节运动和根动力学。在知名数据集和新文本提示上的评估表明,我们的方法能够高效利用2D数据,支持更广泛的真实3D人体运动生成。我们的代码在https://zju3dv.github.io/Motion-2-to-3/上公开提供。

英文摘要

Text-driven human motion synthesis has showcased its potential for revolutionizing motion design in the movie and game industry. Existing methods often rely on 3D motion capture data, which requires special setups, resulting in high costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. In this paper, we explore the use of 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation. Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. We first train a single-view 2D local motion generator on a large dataset of text-2D motion pairs. Then we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics. Evaluations on the well-acknowledged dataset and novel text prompts demonstrate that our method can efficiently utilize 2D data, supporting a wider range of realistic 3D human motion generation. Our code is publicly available at https://zju3dv.github.io/Motion-2-to-3/.

2412.00404 2026-05-20 cs.CV 版本更新

Hard-Label Black-Box Attacks on 3D Point Clouds

针对3D点云的硬标签黑盒攻击

Daizong Liu, Yunbo Tao, Junhao Dong, Keke Tang, Pan Zhou, Wei Hu, Yew-Soon Ong

发表机构 * Institute for Math & AI(数学与人工智能研究院) Wuhan University(武汉大学) Huazhong University of Science and Technology(华中科技大学) Shenzhen Huazhong University of Science and Technology Research Institute(深圳华中科技大学研究机构) College of Computing and Data Science(计算与数据科学学院) Nanyang Technological University(南洋理工大学) Cyberspace Institute of Advanced Technology(先进技术网络空间研究院) Guangzhou University(广州大学) Wangxuan Institute of Computer Technology(王轩计算机技术研究所) Peking University(北京大学)

AI总结 本文提出了一种基于硬标签黑盒攻击的3D点云攻击方法,通过引入新的频谱感知决策边界算法生成高质量对抗样本,以提升攻击性能和对抗质量。

详情
AI中文摘要

随着深度传感器在各种3D安全关键应用中的成熟,3D点云模型已被证明对对抗攻击脆弱。几乎所有的现有3D攻击者只是遵循白盒或黑盒设置,通过反向传播或估计的梯度迭代更新坐标扰动。然而,这些方法很难在现实世界中部署(没有提供模型细节),因为它们严重依赖于受害者模型的参数或输出logits。为此,我们提出了一种更具实际应用的攻击方法,即硬标签黑盒攻击,其中攻击者只能访问3D输入的预测标签。我们引入了一种基于新频谱感知决策边界算法的新型3D攻击方法,以生成高质量的对抗样本。具体而言,我们首先构建了一个类感知的模型决策边界,通过开发一种可学习的频谱融合策略,适应性地在频谱域中融合不同类别的点云,旨在在不扭曲原始几何的情况下制造其中间样本。然后,我们设计了一种迭代坐标-频谱优化方法,带有曲率感知的边界搜索,以沿决策边界移动中间样本,生成具有微小扰动的对抗点云。实验表明,我们的攻击在攻击性能和对抗质量方面优于现有的白盒/黑盒攻击者。

英文摘要

With the maturity of depth sensors in various 3D safety-critical applications, 3D point cloud models have been shown to be vulnerable to adversarial attacks. Almost all existing 3D attackers simply follow the white-box or black-box setting to iteratively update coordinate perturbations based on back-propagated or estimated gradients. However, these methods are hard to deploy in real-world scenarios (no model details are provided) as they severely rely on parameters or output logits of victim models. To this end, we propose point cloud attacks from a more practical setting, i.e., hard-label black-box attack, in which attackers can only access the prediction label of 3D input. We introduce a novel 3D attack method based on a new spectrum-aware decision boundary algorithm to generate high-quality adversarial samples. In particular, we first construct a class-aware model decision boundary, by developing a learnable spectrum-fusion strategy to adaptively fuse point clouds of different classes in the spectral domain, aiming to craft their intermediate samples without distorting the original geometry. Then, we devise an iterative coordinate-spectrum optimization method with curvature-aware boundary search to move the intermediate sample along the decision boundary for generating adversarial point clouds with trivial perturbations. Experiments demonstrate that our attack competitively outperforms existing white/black-box attackers in terms of attack performance and adversary quality.

2409.08248 2026-05-20 cs.CV 版本更新

TextBoost: Boosting Text Encoder for Personalized Text-to-Image Generation

TextBoost: 通过文本编码器提升文本到图像生成的个性化

NaHyeon Park, Kunhee Kim, Hyunjung Shim

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出TextBoost,一种高效的文本到图像扩散模型单次个性化方法,通过仅微调文本编码器提升计算和存储效率,并保持语义完整性,从而实现更快收敛和更低存储需求,同时保持高质量生成。

Comments Project page: https://textboost.github.io. Accepted to TMLR

详情
AI中文摘要

在本文中,我们介绍了TextBoost,一种高效的文本到图像扩散模型单次个性化方法。传统个性化方法通常涉及微调模型的大量部分,导致存储需求大且收敛慢。相反,我们提出仅选择性地微调文本编码器,显著提高了计算和存储效率。为了保持原始语义完整性,我们开发了一种新颖的因果保持适应机制。此外,轻量级适配器被用于在文本嵌入与交叉注意层交互之前局部细化文本嵌入,从而在极小的计算开销下显著增强文本嵌入的表达能力。在多样化的概念上进行的实证评估表明,TextBoost通过减少可训练参数的数量实现了更快的收敛速度和显著的存储需求降低。此外,TextBoost在主体保真度、文本保真度和生成多样性方面与现有方法相比具有可比性。我们展示所提出的方法为高质量文本到图像个性化提供了一种高效、可扩展且实用的解决方案,尤其在资源受限的环境中具有优势。

英文摘要

In this paper, we introduce TextBoost, an efficient one-shot personalization approach for text-to-image diffusion models. Traditional personalization methods typically involve fine-tuning extensive portions of the model, leading to substantial storage requirements and slow convergence. In contrast, we propose selectively fine-tuning only the text encoder, significantly improving computational and storage efficiency. To preserve the original semantic integrity, we develop a novel causality-preserving adaptation mechanism. Additionally, lightweight adapters are employed to locally refine text embeddings immediately before their interaction with cross-attention layers, greatly enhancing the expressiveness of text embeddings with minimal computational overhead. Empirical evaluations across diverse concepts demonstrate that TextBoost achieves faster convergence and substantially reduces storage demands by minimizing the number of trainable parameters. Furthermore, TextBoost maintains comparable subject fidelity, superior text fidelity, and greater generation diversity compared to existing methods. We show that our proposed method offers an efficient, scalable, and practically applicable solution for high-quality text-to-image personalization, particularly beneficial in resource-constrained environments.

2409.03192 2026-05-20 cs.CV 版本更新

PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning

PEPL: 精度增强的伪标签法用于半监督学习中的细粒度图像分类

Bowen Tian, Songning Lai, Lujundong Li, Zhihao Shuai, Runwei Guan, Tian Wu, Yutao Yue

发表机构 * HKUST(GZ)(香港科技大学(广州)) Institute of Deep Perception Technology, JITRI(感知技术研究所,JITRI) University of Liverpool(利物浦大学) Nanchang University(南昌大学) DI^2 Lab(DI²实验室)

AI总结 本文提出PEPL方法,通过生成高质量的伪标签来解决细粒度图像分类中标注数据稀缺的问题,利用CAMs进行语义混合伪标签生成,提升分类精度和鲁棒性。

Comments Accepted by ICASSP 2025

详情
AI中文摘要

细粒度图像分类随着深度学习和计算机视觉技术的发展取得了显著进步。然而,详细的标注数据稀缺仍然是一个主要挑战,尤其是在获取高质量标注数据成本高或耗时的情况下。为了解决这一限制,我们引入了Precision-Enhanced Pseudo-Labeling(PEPL)方法,专门设计用于半监督学习框架下的细粒度图像分类。我们的方法通过生成高质量的伪标签,利用大量未标注数据,通过两个关键阶段:初始伪标签生成和语义混合伪标签生成,逐步细化伪标签。这些阶段利用类激活图(CAMs)准确估计语义内容,并生成捕获细粒度分类所需关键细节的精炼标签。通过聚焦语义层面的信息,我们的方法有效克服了标准数据增强和图像混合技术在保留关键细粒度特征方面的局限性。我们在基准数据集上实现了最先进的性能,证明了与现有半监督策略相比,在准确性和鲁棒性上有了显著提升。

英文摘要

Fine-grained image classification has witnessed significant advancements with the advent of deep learning and computer vision technologies. However, the scarcity of detailed annotations remains a major challenge, especially in scenarios where obtaining high-quality labeled data is costly or time-consuming. To address this limitation, we introduce Precision-Enhanced Pseudo-Labeling(PEPL) approach specifically designed for fine-grained image classification within a semi-supervised learning framework. Our method leverages the abundance of unlabeled data by generating high-quality pseudo-labels that are progressively refined through two key phases: initial pseudo-label generation and semantic-mixed pseudo-label generation. These phases utilize Class Activation Maps (CAMs) to accurately estimate the semantic content and generate refined labels that capture the essential details necessary for fine-grained classification. By focusing on semantic-level information, our approach effectively addresses the limitations of standard data augmentation and image-mixing techniques in preserving critical fine-grained features. We achieve state-of-the-art performance on benchmark datasets, demonstrating significant improvements over existing semi-supervised strategies, with notable boosts in accuracy and robustness.

2002.09053 2026-05-20 cs.CV 版本更新

Adapted Center and Scale Prediction: More Stable and More Accurate

适应中心和尺度预测:更加稳定和准确

Wenhao Wang, Jusheng Zhang

发表机构 * University of Technology Sydney(悉尼科技大学) Sun Yat-sen University(中山大学)

AI总结 本文提出了一种基于中心和尺度预测(CSP)的改进方法,旨在结合无锚点检测器的简洁性和两阶段检测器的准确性,通过增强CSP的鲁棒性、提出压缩宽度的新方法,并在CityPersons基准上取得第二名的性能,同时探索了可切换归一化的能力。

Comments 14 pages, 7 figures

详情
AI中文摘要

行人检测受益于深度学习技术,在近年来迅速发展。大多数检测器遵循通用目标检测框架,即默认框和两阶段过程。最近,无锚点和单阶段检测器被引入到这一领域。然而,它们的准确性并不令人满意。因此,为了同时享受无锚点检测器的简洁性和两阶段检测器的准确性,我们基于检测器提出了一些改进,即中心和尺度预测(CSP)。本文的主要贡献包括:(1)我们改进了CSP的鲁棒性,使其更容易训练。(2)我们提出了一种新的方法来预测宽度,即压缩宽度。(3)我们在CityPersons基准上取得了第二好的性能,即在合理集上9.3%的log-average miss rate(MR),在部分集上8.7%的MR,在裸集上5.6%的MR,这表明无锚点和单阶段检测器仍能保持高精度。(4)我们探索了可切换归一化的一些能力,这些能力在原始论文中未被提及。代码可在https://github.com/WangWenhao0716/Adapted-Center-and-Scale-Prediction上公开获取。

英文摘要

Pedestrian detection benefits from deep learning technology and gains rapid development in recent years. Most of detectors follow general object detection frame, i.e. default boxes and two-stage process. Recently, anchor-free and one-stage detectors have been introduced into this area. However, their accuracies are unsatisfactory. Therefore, in order to enjoy the simplicity of anchor-free detectors and the accuracy of two-stage ones simultaneously, we propose some adaptations based on a detector, Center and Scale Prediction(CSP). The main contributions of our paper are: (1) We improve the robustness of CSP and make it easier to train. (2) We propose a novel method to predict width, namely compressing width. (3) We achieve the second best performance on CityPersons benchmark, i.e. 9.3% log-average miss rate(MR) on reasonable set, 8.7% MR on partial set and 5.6% MR on bare set, which shows an anchor-free and one-stage detector can still have high accuracy. (4) We explore some capabilities of Switchable Normalization which are not mentioned in its original paper. The code is publicly available at https://github.com/WangWenhao0716/Adapted-Center-and-Scale-Prediction.

2605.19020 2026-05-20 cs.CV 版本更新

A Systematic Failure Analysis of Vision Foundation Models for Open Set Iris Presentation Attack Detection

对用于开放集虹膜呈现攻击检测的视觉基础模型系统性失败分析

Rahul Anand, Siddharth Singh, Dileep A D, Mahadeva Prasanna, Raghavendra Ramachandra

发表机构 * Indian Institute of Technology, Dharwad, India(印度德瓦德理工学院) Indian Institute of Information Technology Dharwad, India(印度德瓦德信息学院) SAFE Center, Norwegian University of Science and Technology (NTNU)(挪威科学技术大学(NTNU)的安全中心)

AI总结 本文系统分析了视觉基础模型在开放集虹膜呈现攻击检测中的表现,发现其在面对未见过的攻击设备和跨光谱转移时表现不佳,强调了需要更鲁棒的虹膜检测表示方法。

详情
AI中文摘要

视觉基础模型在多种视觉识别任务中表现出强大的迁移能力,并日益被应用于生物识别领域。然而,其在开放集条件下用于虹膜呈现攻击检测(PAD)的适用性仍不够充分。本文系统分析了通用视觉基础模型在开放集虹膜PAD中的表现,使用周缘视觉图像进行评估。在三个明确分离不同分布偏移的开放集协议下,评估了五个代表性基础模型:未见过的呈现攻击设备(PAIs)、使用不同传感器捕获的未见数据集以及近红外(NIR)到可见光(VIS)光谱的跨光谱转移。在统一的实验框架内,评估了冻结的特征表示和参数高效的LoRA任务适应方法。结果表明,基础模型能够在具有相似传感特征的数据集之间迁移,但无法可靠地推广到未见过的攻击设备,并在跨光谱评估中急剧退化。尽管LoRA在某些跨数据集设置中提高了性能,但在攻击级别和光谱偏移下经常放大失败。额外的验证实验使用分段虹膜输入、完整主干微调、联合跨数据集和跨PAI偏移以及反向VIS到NIR转移进一步证实,这些失败并非仅仅是周缘视觉输入、弱适应或单向光谱评估的产物。这些发现表明,强闭合集或跨数据集性能不应被视为开放集安全性的证据,并突显了需要虹膜检测表示方法在保持对呈现伪影的敏感性的同时,在现实部署变化下保持稳定性的需求。

英文摘要

Vision foundation models have demonstrated strong transferability across diverse visual recognition tasks and are increasingly considered for biometric applications. Their suitability for iris Presentation Attack Detection (PAD), particularly under realistic open-set operating conditions, remains insufficiently examined. This work presents a systematic failure analysis of general-purpose vision foundation models for open-set iris PAD using periocular imagery. Five representative foundation models are evaluated under three open-set protocols that explicitly separate different sources of distribution shift: unseen Presentation Attack Instruments (PAIs), unseen datasets captured with different sensors and cross-spectral transfer from near-infrared (NIR) to visible spectrum (VIS) imagery. Both frozen feature representations and parameter-efficient task adaptation using Low-Rank Adaptation (LoRA) are assessed within a unified experimental framework. The results indicate that foundation models can transfer across datasets with similar sensing characteristics, but fail to generalise reliably to unseen attack instruments and degrade sharply under cross-spectral evaluation. While LoRA improves performance in certain cross-dataset settings, it frequently amplifies failure under attack-level and spectral shifts. Additional validation experiments using segmented iris inputs, full backbone fine-tuning, joint cross-dataset and cross-PAI shifts, and reverse VIS to NIR transfer further confirm that these failures are not simply artefacts of periocular input, weak adaptation, or one-directional spectral evaluation. These findings show that strong closed-set or cross-dataset performance should not be treated as evidence of robust open-set security, and highlight the need for PAD representations that maintain sensitivity to presentation artefacts while remaining stable under realistic deployment variation.

2605.19004 2026-05-20 cs.CV cs.LG cs.RO 版本更新

EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

EgoTraj: 用于多模态预测的现实世界人轨迹数据集

Ahmad Yehia, Abduallah Mohamed, Tianyi Wang, Jiseop Byeon, Kun Qian, Junfeng Jiao, Christian Claudel

发表机构 * Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin(土木、建筑与环境工程系,德克萨斯大学奥斯汀分校) Meta Reality Labs(Meta现实实验室) School of Architecture, The University of Texas at Austin(建筑学院,德克萨斯大学奥斯汀分校)

AI总结 本文提出EgoTraj数据集,用于多模态预测,包含75个真实城市环境中的人导航轨迹,提供了同步的RGB视频和地面真实数据,包括6自由度头部姿态、3D眼 gaze向量和场景注释,展示了该数据集在AR感知、导航和辅助系统中的应用价值。

Comments 21 pages, 14 figures. Project page: https://github.com/yehiahmad/EgoTraj

详情
AI中文摘要

准确地从第一人称视角预测人类轨迹在人形机器人、可穿戴传感系统和辅助导航等应用中起着核心作用。然而,由于现实世界环境中缺乏第一人称轨迹数据集,这一方向的进展受到限制。为了解决这一需求,我们介绍了EgoTraj,一个使用Meta Quest Pro (MQPro)录制的egocentric多模态开放数据集。EgoTraj包含75个由多个MQPro穿戴设备在真实城市环境中收集的人导航轨迹。每个记录都提供了同步的RGB视频以及地面真实数据,包括连续时间同步的6自由度头部姿态、每帧3D眼 gaze向量和场景注释。据我们所知,EgoTraj不同于典型的egocentric轨迹数据集,因为它捕捉了在多样化的城市路线中进行的长视距、自主导航,具有广泛的参与者多样性。为了展示该数据集的潜力,我们对几种最先进的egocentric轨迹预测方法进行了基准测试,并进行了消融研究以分析注视、场景和运动提示的贡献。结果突显了EgoTraj在AR感知、导航和辅助系统中的实用性。EgoTraj数据集、代码和EgoViz仪表板已公开在https://github.com/yehiahmad/EgoTraj。

英文摘要

Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at https://github.com/yehiahmad/EgoTraj.

2605.18984 2026-05-20 cs.CV 版本更新

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

Artifact-Bench: 评估MLLMs在检测和评估AI生成视频中的伪影

Yuqi Tang, Yang Shi, Zhuoran Zhang, Qixun Wang, Xuehai Bai, Yue Ding, Ruizhe Chen, Bohan Zeng, Xinlong Chen, Xuanyu Zhu, Bozhou Li, Yuran Wang, Yifan Dai, Chengzhuo Tong, Xinyu Liu, Yiyan Ji, Yujie Wei, Yuhao Dong, Shilin Yan, Fengxiang Wang, Yi-Fan Zhang, Haotian Wang, Yuanxing Zhang, Pengfei Wan

发表机构 * Kling Team(Kling团队) Shanghai AI Lab(上海人工智能实验室)

AI总结 本文提出Artifact-Bench,一个用于评估多模态大语言模型在检测和分析AI生成视频伪影能力的基准,揭示了现有模型在伪影感知和推理上的显著局限性。

详情
AI中文摘要

近年来,视频生成模型在提高AI生成视频的真实感方面取得了显著进步,但其输出仍存在时间不一致、结构失真和语义不连贯等伪影。尽管多模态大语言模型(MLLMs)在视觉理解方面表现出色,但其感知和推理这些伪影的能力仍不明确。现有基准缺乏对伪影感知和细粒度诊断推理的系统评估,尤其是在超越逼真内容的多样化AI生成视频领域。为解决这一差距,我们引入Artifact-Bench,一个全面的基准,用于评估MLLMs在AI生成视频伪影检测和分析上的能力。我们首先建立了涵盖逼真、动画和CG风格视频的三级层次化伪影分类法。基于此分类法,Artifact-Bench定义了三个互补任务:真实与AI生成视频分类、成对真实感比较和细粒度伪影识别。在19种领先MLLMs上的实验揭示了伪影感知和推理的显著局限性,许多模型在挑战性设置中接近随机甚至低于随机表现。我们进一步观察到MLLM判断与人类感知偏好之间存在显著不一致,突显了其作为AI生成视频真实感一般评估者的有限可靠性。

英文摘要

Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.

2605.18974 2026-05-20 cs.CV cs.AI cs.MM 版本更新

Harnessing Self-Supervised Features for Art Classification

利用自监督特征进行艺术分类

Federico Melis, Davide Bilardello, Emanuele Prato, Evelyn Turri, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学)

AI总结 本文研究了监督和自监督主干作为特征提取器在艺术分类和检索中的有效性,特别是绘画,通过DINO家族和CLIP模型的实验评估,证明自监督主干在艺术分类中能带来一致的性能提升,并为现实应用如虚拟现实中的博物馆导航提供了见解。

Comments IRCDL 2026

详情
AI中文摘要

对艺术品进行分类是一项具有挑战性的任务,因为精细细节和抽象特征的复杂相互作用决定了艺术作品的风格或流派。本文系统地研究了监督和自监督主干作为特征提取器在艺术品分类和检索中的有效性,特别是绘画。我们通过DINO家族和CLIP模型进行了广泛的实验评估,评估了多种分类策略和特征表示。我们的结果表明,使用自监督主干在艺术品分类性能上产生了持续的改进。此外,我们的工作为现实应用中的分类和检索模块提供了见解,例如支持博物馆导航的虚拟现实(VR)应用。

英文摘要

Classifying artworks presents a significant challenge due to the complex interplay of fine-grained details and abstract features that condition the style or genre of an artwork. This paper presents a systematic investigation of the effectiveness of supervised and self-supervised backbones as feature extractors for both artwork classification and retrieval, with a particular focus on paintings. We conduct an extensive experimental evaluation using the DINO family and CLIP models, assessing multiple classification strategies and feature representations. Our results demonstrate that employing a self-supervised backbone leads to consistent improvements in artwork classification performance. Moreover, our work provides insights into the applicability of classification and retrieval modules in real-world applications, such as virtual reality (VR) applications that support museum navigation.

2605.18956 2026-05-20 cs.CV 版本更新

MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

MotionMERGE: 一种用于人体动作编辑、推理、生成和解释的多粒度框架

Bizhu Wu, Jinheng Xie, Wenting Chen, Zhe Kong, Jianfeng Ren, Linlin Shen, Ruibin Bai, Rong Qu

发表机构 * Computer Vision Institute, School of Computer Science and Software Engineering, Shenzhen University(计算机视觉研究院,计算机科学与软件工程学院,深圳大学) Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University(广东省智能信息处理重点实验室,深圳大学) School of Computer Science, University of Nottingham Ningbo China(Nottingham Ningbo 中国计算机科学学院) Department of Electrical and Computer Engineering, National University of Singapore(电子与计算机工程系,新加坡国立大学) Department of Radiation Oncology, Stanford University(放射肿瘤科,斯坦福大学) Sun Yat-sen University(中山大学) School of Computer Science, University of Nottingham(计算机科学学院,Nottingham大学)

AI总结 本文提出MotionMERGE框架,通过细粒度语言引导的动作控制、跨粒度协同预训练和细粒度动作-语言对齐,实现了更精确的动作生成、理解和编辑,并建立了新的细粒度文本驱动动作编辑和动作引导推理基准。

详情
AI中文摘要

Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.

英文摘要

Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.

2605.18923 2026-05-20 eess.IV cs.CV cs.LG q-bio.QM 版本更新

From Division to Decision: Leveraging Temporal Cell-Stage Segmentation for Embryo Transferability Prediction

从分裂到决策:利用时间细胞阶段分割预测胚胎可转移性

Yasmine Hachani, Patrick Bouthemy, Elisa Fromont, Véronique Duranthon, Ludivine Laffont, Alline de Paula Reis

发表机构 * Inria center at Rennes University, Paris-Saclay University, UVSQ, INRAE, BREED(里昂大学Inria研究中心、巴黎萨克雷大学、UVSQ、INRAE、BREED) University of Rennes, IRISA(雷恩大学、IRISA) The National Veterinary School of Alfort(阿尔福兽医学校)

AI总结 该研究提出TransFACT框架,利用时间 lapse 视频中的早期发育阶段信息,通过结合帧级时间特征和阶段级表示,预测胚胎可转移性,优于现有方法。

Journal ref ICIP 2026 - IEEE International Conference on Image Processing, Sep 2026, Tampere, Finland

详情
AI中文摘要

准确选择牛胚胎是一项具有挑战性的任务,因为当前实践依赖于受精后第七天单一专家评估,导致高妊娠丢失率。时间延展显微镜提供了早期发育的详细信息,但由于复杂的运动模式和耗时的分析而难以利用。我们提出TransFACT,一种基于变压器的框架,用于使用发育前四天的2D时间延展视频建模早期发育阶段和胚胎可转移性。TransFACT结合帧级时间特征和阶段级表示,利用发育阶段作为辅助监督,在第四天预测可转移性。我们的实验表明,TransFACT通过利用现有用于动作识别的方法,在预测胚胎可转移性方面优于其竞争对手。

英文摘要

Accurate selection of bovine embryos is a challenging task, as current practice relies on a single expert assessment on the seventh day after insemination, resulting in high rates of pregnancy loss. Time-lapse videomicroscopy provides detailed information on early development, but is difficult to exploit because of complex motion patterns and time-consuming analysis. We propose TransFACT, a transformer-based framework for modeling early developmental stages and embryo transferability using 2D time-lapse videos from the first four days of development. TransFACT combines frame-level temporal features with stage-level representations, using developmental stages as auxiliary supervision to predict transferability on day four. Our experiments demonstrate that TransFACT, by leveraging an existing method designed for action recognition, achieves superior performance than its competitor in predicting embryo transferability.

2605.18903 2026-05-20 cs.LG cs.CV 版本更新

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

推理可移植性:引导MLLMs在RLVR时代的持续学习

Qiuhe Hong, Yuyang Liu, Shuo Yang, Tiantian Peng, Fei Zhu, Yonghong Tian

发表机构 * Shenzhen Graduate School of Peking University(北京大学深圳研究生院) Centre for Artificial Intelligence and Robotics, HKISI, CAS(香港科学院人工智能与机器人研究中心) Peng Cheng Laboratory(鹏城实验室)

AI总结 本文提出了一种名为推理可移植性(RP)的机制,通过在持续学习中引入推理层面的约束,改进了多模态大语言模型在RLVR环境下的适应能力,实验表明RDB-CL在提升最后准确率方面优于基线方法。

详情
AI中文摘要

在持续学习中,视觉-语言模型(VLM-CL)旨在不断适应新多模态任务的同时保留先前知识。新兴的将多模态大语言模型(MLLMs)与具有可验证奖励的强化学习(RLVR)相结合的范式,要求一种新的模式来引导持续适应。随着推理能力的进步,现在可以在推理层面施加约束。我们正式化了可移植性,即一个样本级别的度量,用于衡量先前策略行为在新任务中的可重用性,并实证表明推理层面的信号在分布外样本上仍可靠,而答案层面的信号则不然。我们将此形式化为推理可移植性(RP),并提出基于推理的动态平衡持续学习(RDB-CL),该方法根据RP调节RLVR中的每样本Kullback-Leibler正则化:一个紧密的锚点在高RP样本上保留可重用的推理,而低RP样本上的放松锚点则允许探索新的推理路径。实验表明,RDB-CL在提升最后准确率方面优于基线方法,相比 vanilla RLVR 基线提升了+12.0%。

英文摘要

Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.

2605.18884 2026-05-20 cs.LG cs.CV 版本更新

Navigating the Emotion Tree: Hierarchical Hyperbolic RAG for Multimodal Emotion Recognition

在情绪树中导航:用于多模态情绪识别的分层双曲RAG

Zeheng Wang, Bo Zhao, Yijie Zhu, Zhishu Liu, Hui Ma, Ruixin Zhang, Shouhong Ding, Qianyu Xie, Zitong Yu

发表机构 * Great Bay University(广东东莞大亚湾大学) Tencent Youtu Lab(腾讯优图实验室)

AI总结 本文提出HyperEmo-RAG,一种利用结构化情绪知识库的检索增强生成框架,通过双曲空间嵌入和证据图构建来提升多模态情绪识别的性能。

详情
AI中文摘要

多模态情绪识别旨在整合文本、音频和视频源以理解人类情感状态。尽管多模态大语言模型在多模态推理方面表现优异,但通常将情绪类别视为独立标签,忽略了人类心理的丰富层次分类。此外,缺乏外部上下文知识使它们容易过度解释噪声线索,进一步复杂化细粒度情绪分类。为了解决这些问题,我们提出了HyperEmo-RAG,一种检索增强生成框架,利用结构化情绪知识库。我们的框架引入了两个关键创新。1)层次双曲 grounding。认识到情绪分类的内在层次树结构,我们将层次情绪标签和多模态样本嵌入到连续双曲空间(Poincaré球)中,并设计了层次束搜索 deliberation 过程,逐步从粗粒度到细粒度级别检索样本。2)结构化证据注入。基于检索到的证据,我们构建证据图,并通过Tree-Aware Attention机制和EmotionGraphFormer将结构化知识作为显式认知上下文注入LLM中,保持图结构信息的完整性。在多个数据集上的实验表明,HyperEmo-RAG显著优于现有方法。

英文摘要

Multimodal emotion recognition aims to integrate text, audio, and video sources to understand human affective states. Although multimodal large language models excel at multimodal reasoning, they typically treat emotion categories as independent labels, ignoring the rich hierarchical taxonomy of human psychology. Moreover, lacking external contextual knowledge makes them highly susceptible to over-interpreting noisy cues, further complicating fine-grained emotion classification. To address these issues, we propose \textbf{HyperEmo-RAG}, a retrieval-augmented generation framework that leverages a structured emotional knowledge base. Our framework introduces two key innovations. 1) Hierarchical hyperbolic grounding. Recognizing the inherent hierarchical tree structure of emotion taxonomies, we jointly embed hierarchical emotion labels and multimodal samples into a continuous hyperbolic space (Poincaré ball) and design a hierarchical beam-search deliberation process that progressively retrieves samples from coarse to fine-grained levels. 2) Structured evidence injection. Based on the retrieved evidence, we construct an evidence graph and inject the structured knowledge as explicit cognitive context into the LLM through a Tree-Aware Attention mechanism and an EmotionGraphFormer, preserving the integrity of graph-structured information. Experiments on multiple datasets demonstrate that HyperEmo-RAG significantly outperforms existing methods.

2605.18880 2026-05-20 cs.LG cs.CV q-bio.QM 版本更新

A Multi-Dimensional Clustering Approach for Identifying Inborn Errors of Immunity

一种多维聚类方法用于识别先天性免疫缺陷

Nishad Kulkarni, Alexandra K. Martinson, Nicholas L. Rider, Michael Keller, Syed Muhammad Anwar

发表机构 * Sheikh Zayed Institute for Pediatric Surgical Innovation, Children’s National Hospital, Washington, DC(Sheikh Zayed儿童外科创新研究所,儿童医院,华盛顿特区) Childrens National Hospital, Washington, DC(儿童医院,华盛顿特区) Department of Health Systems & Implementation Science, Division of Allergy & Immunology Virginia Tech Carilion School of Medicine, Roanoke, VA(健康系统与实施科学部门,过敏与免疫学分会弗吉尼亚理工大学Carilion医学院,罗阿诺克,VA) Division of Allergy & Immunology Childrens National Hospital, Washington, DC(过敏与免疫学分会儿童医院,华盛顿特区) School of Medicine and Health Sciences, George Washington University, Washington, DC(医学与健康科学学院,乔治华盛顿大学,华盛顿特区)

AI总结 本文提出一种多维聚类方法,用于从全国数据注册中识别新的罕见疾病模式并提取与先天性免疫缺陷相关的特征,通过改进IEI特征意识和开发罕见疾病人群分析的数据工具包,扩展了复杂医疗记录到可被无监督ML解释的数据结构。

Comments Accepted at EMBC 2026

详情
AI中文摘要

先天性免疫缺陷(IEI)等罕见疾病需要早期诊断以防止终器官损伤并提高生活质量。获取和整理大规模电子健康记录(EHR)数据的障碍限制了常规数据驱动分析保持在IEI和其他罕见疾病趋势的前沿。在IEI中开发机器学习(ML)算法进行模式识别以及已发表的方法研究如何系统地处理和整合复杂医疗数据有限。我们提出的流程,包括数据整理和ML聚类算法,旨在识别新的罕见疾病模式并从全国数据注册中提取IEI相关的特征。我们的EHR数据格式化和处理方法提出了一个流程,将原始免疫学实验室数据转换为向量。这进一步结合了通过聚类进行疾病模式识别的超参数调优。本研究改进了IEI特征意识,开发了罕见疾病人群分析的数据工具包,并扩展了将复杂医疗记录转换为可被无监督ML解释的数据结构。

英文摘要

Rare diseases such as inborn errors of immunity (IEI) require early diagnosis to prevent end organ damage and improve quality of life. Hurdles in accessing and curating large scale electronic health record (EHR) data limit routine data driven analyses to remain on the forefront of IEI and other rare disease trends. Development of machine learning (ML) algorithms in IEI for pattern recognition as well as published methodology examining how to systematically process and integrate complex medical data is limited. Our proposed pipeline, including data curation and ML clustering algorithms, is designed to recognize novel rare disease patterns and extract IEI- associated features from a national data registry. Our methodology for EHR data formatting and processing presents the pipeline that transforms raw immunologic lab data into vectors. This is further combined with hyperparameter tuning for diseases pattern recognition via clustering. This study refines IEI feature awareness, develops data tool kits for rare disease populations analysis, and expands on transforming complex medical records in data structures interpretable by unsupervised ML.

2605.18878 2026-05-20 eess.SP cs.CV cs.LG eess.IV 版本更新

Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis

心力衰竭再入院风险的肺部超声生物标志物预后价值:一项试点数据驱动分析

Jana Armouti, Laura Hutchins, Jacob Duplantis, Thomas Deiss, Thales Nogueira Gomes, Keyur H. Patel, Seema Walvekar, Shane Guillory, Thomas H. Fox, Amita Krishnan, Ricardo Rodriguez, Bennett DeBoisblanc, Deva Ramanan, John Galeotti, Gautam Gare

发表机构 * Carnegie Mellon University(卡内基梅隆大学) LSUHSC Internal Medicine(路易斯安那州立大学医学部) Cosmetic Surgery Facility LLC(美容外科诊所有限公司)

AI总结 本研究通过数据驱动方法利用住院期间获得的B型肺部超声(LUS)数据,预测30天内心力衰竭再入院风险,发现依赖性下肺区域、时间差特征以及多视图特征拼接在预测中表现最佳,展示了超声生物标志物在非侵入性心力衰竭风险分层中的实用性。

详情
AI中文摘要

住院后30天内再入院是心力衰竭(CHF)导致发病率、死亡率和可避免医疗支出的主要驱动因素。当前的临床风险分层工具主要依赖于非成像数据,且预测性能有限。床旁肺部超声(LUS)提供了一个敏感的、非侵入性的窗口,以观察肺部充血,这特征于CHF失代偿,但其用于再入院预测的预后作用仍待探索。我们提出了一个试点可行性研究,这是首个系统使用住院期间获得的B型LUS进行机器学习预测30天内CHF再入院的系统研究。从预训练的Temporal Shift Module(TSM)ResNet-18编码器中提取定量时空嵌入,并分别评估可解释的生物标志物特征。通过结构化消融研究肺部视图、时间表示、多视图融合和跨肺增强,我们识别出驱动再入院风险的关键成像因素。我们的发现表明(1)依赖性下肺区域(左3、右3)携带最强的预后信号,与它们对静水性充血的更大易感性一致;(2)连续检查之间的时间差特征显著优于单时间点表示,突显了捕捉疾病轨迹的重要性;(3)多视图特征拼接产生了最佳整体性能,我们的最佳MLP模型实现了F1得分为0.80(95% CI: 0.62-0.96)。生物标志物分析进一步表明,胸膜线异常,包括断裂和凹陷,的信息量与传统A线和B线标志物相当。这些结果支持POCUS衍生的生物标志物作为实用、可解释的非侵入性CHF风险分层工具。

英文摘要

Hospital readmission within 30 days of discharge is a leading driver of morbidity, mortality, and avoidable healthcare expenditure in congestive heart failure (CHF). Current clinical risk stratification tools rely primarily on non-imaging data and exhibit limited predictive performance. Point-of-care lung ultrasound (LUS) offers a sensitive, noninvasive window into the pulmonary congestion that characterizes CHF decompensation, yet its prognostic utility for readmission prediction remains largely unexplored. We present a pilot feasibility study, the first systematic machine learning study using B-mode LUS acquired during hospitalization to predict 30-day CHF readmission. Quantitative spatiotemporal embeddings are extracted from a pretrained Temporal Shift Module (TSM) ResNet-18 encoder, and interpretable biomarker features are separately evaluated. Through structured ablations over lung view, temporal representation, multi-view fusion, and cross-lung augmentation, we identify the key imaging factors driving readmission risk. Our findings reveal that (1) dependent lower-lung regions (Left-3, Right-3) carry the strongest prognostic signal, consistent with their greater susceptibility to hydrostatic congestion; (2) temporal difference features between sequential examinations substantially outperform single-timepoint representations, highlighting the importance of capturing disease trajectory; and (3) multi-view feature concatenation yields the best overall performance, with our top MLP model achieving an F1 score of 0.80 (95% CI: 0.62-0.96). Biomarker analysis further reveals that pleural-line abnormalities, including breaks and indentations, are as informative as the canonical A-line and B-line markers. These results support POCUS-derived biomarkers as practical, interpretable tools for noninvasive CHF risk stratification.

2605.18868 2026-05-20 cs.CR cs.AI cs.CV cs.LG 版本更新

DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models

DarkLLM: 利用大语言模型学习语言驱动的对抗攻击

Ye Sun, Xin Wang, Jiaming Zhang, Yifeng Gao, Yixu Wang, Yifan Ding, Qixian Zhang, Henghui Ding, Xingjun Ma, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Nanyang Technological University(南洋理工大学) Tongji University(同济大学)

AI总结 本文提出DarkLLM,一种基于大语言模型的对抗攻击框架,通过将自然语言攻击指令转换为潜在攻击向量,生成有效的对抗扰动,统一了多种攻击类型并实现了灵活可控的对抗生成。

Comments 23 pages, 13 figures

详情
AI中文摘要

尽管视觉和多模态基础模型在感知到复杂推理任务中至关重要,但它们仍然极易受到对抗攻击的影响。然而,传统对抗攻击通常局限于单一、预定义的目标,紧密耦合每个攻击到特定模型或任务,限制了其在现实场景中的可扩展性和灵活性。在本文中,我们提出了DarkLLM,一种新的攻击框架,该框架训练了一个大语言模型(LLM)将自然语言攻击指令转换为潜在攻击向量,然后解码为视觉对抗扰动。通过利用自然语言指令微调,DarkLLM不仅在一个框架内统一了目标攻击、非目标攻击、分割攻击和多模型攻击,还实现了灵活且可控的对抗生成,使每个指令都能生成一种扰动,以在异构模型上诱导期望的行为。通过在4个任务、13个数据集和15个模型上的广泛实验,我们证明DarkLLM仅需1B参数即可遵循攻击者的指令,生成对CLIP、SAM和前沿LLM高度有效的攻击,揭示了现代基础模型系统性的脆弱性。

英文摘要

While vision and multimodal foundation models underpin critical tasks from perception to complex reasoning, they remain highly vulnerable to adversarial attacks. However, traditional adversarial attacks are typically limited to single, predefined objectives, tightly coupling each attack to a specific model or task, which restricts their scalability and flexibility in real-world scenarios. In this work, we present DarkLLM, a novel attack framework that trains an LLM to translate natural-language attack instructions into latent attack vectors, which are then decoded into visual adversarial perturbations. By leveraging natural-language instruction tuning, DarkLLM not only unifies targeted, untargeted, segmentation, and multi-model attacks within a single framework, but also achieves flexible and controllable adversarial generation, enabling each instruction to produce a perturbation that induces desired behaviors across heterogeneous models. Through extensive experiments across 4 tasks, 13 datasets, and 15 models, we demonstrate that DarkLLM with only 1B parameters can follow attacker instructions and generate highly effective attacks against CLIP, SAM, and frontier LLMs, revealing a systemic vulnerability in modern foundation models.

2605.18855 2026-05-20 cs.LG cs.CV 版本更新

Delta Attention Residuals

Delta Attention Residuals

Cheng Luo, Zefan Cai, Junjie Hu

发表机构 * Independent Researcher(独立研究者) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出Delta Attention Residuals,通过在残差连接中引入对每个子层引入的变化(delta)进行注意力机制,解决了传统注意力残差中因累积隐藏状态冗余导致的路由崩溃问题,从而提升模型跨层选择信息的能力。

详情
AI中文摘要

Attention Residuals将标准加性残差连接替换为在前一层输出上学习的softmax注意力,实现了选择性的跨层路由。然而,标准Attention Residuals仍然在累积的隐藏状态上进行注意力计算,这些状态高度冗余。我们发现这种冗余导致在更深的层中出现路由崩溃:注意力权重变得低对比度且接近均匀(最大权重≈0.2),限制了模型在前一层中选择信息性状态的能力。这提出了一个关键但尚未深入研究的设计问题:在Attention Residuals中应路由何种层间表示?为回答这个问题,我们提出了Delta Attention Residuals,其在delta(每个子层引入的变化(v_i = h_{i+1} - h_i))上进行注意力计算,而非累积状态。Delta表示在结构上具有多样性,产生更高对比度的注意力分布(最大权重≈0.6),从而在层间实现更选择性和有效的路由。这一原则适用于单个子层和块粒度。在所有测试的规模(220M-7.6B)中,Delta Attention Residuals始终优于标准残差和Attention Residuals,验证困惑度提升1.7-8.2%。Delta Attention Residuals还允许通过标准微调将预训练检查点转换为Delta Attention Residuals。代码可在https://github.com/wdlctc/delta-attention-residuals-code获得。

英文摘要

Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight ${\approx}$0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer ($\mathbf{v}_i = \mathbf{h}_{i+1} - \mathbf{h}_i$) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight ${\approx}$0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.

2605.18853 2026-05-20 cs.LG cs.CV cs.DC 版本更新

INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference

INAR-VL:面向边缘-云视觉-语言推断的输入感知路由

Ahmed Šabanović, Paul Joe Maliakel, Ivona Brandić

发表机构 * TU Wien(维也纳技术大学)

AI总结 本文提出INAR-VL,一种轻量级的边缘-云路由系统,用于多模态推断的两级部署。该系统通过轻量级的图像和文本复杂度信号指导路由和模型选择,在本地执行简单查询,将复杂查询卸载到云端,从而在延迟、能耗和准确性之间取得平衡。

Comments 8 pages, 3 figures

详情
AI中文摘要

边缘部署的视觉-语言模型(VLMs)面临延迟与准确性的权衡:云端执行提供高质量预测但会带来通信延迟和能耗,而仅边缘执行则速度更快但准确性较低,因为模型容量有限。这种权衡进一步受到图像质量和推理复杂度异质性的影响,使静态部署效果不佳。我们提出了INAR-VL,一种轻量级的边缘-云路由系统,用于两级部署中的多模态推断。INAR-VL在边缘和云端维护互补的VLMs,并利用轻量级的图像和文本复杂度信号指导路由和模型选择,执行简单查询本地化,当有利时将复杂查询卸载到云端。在视觉问答任务上的评估表明,INAR-VL将36%的请求执行在边缘,延迟降低24%,能耗降低26%,并保持97%的云端准确性。

英文摘要

Edge deployment of Vision-Language Models (VLMs) faces a tradeoff between latency and accuracy: cloud execution provides high-quality predictions but incurs communication delay and energy cost, while edge-only execution is faster but less accurate due to limited model capacity. This trade-off is further complicated by heterogeneity in image quality and reasoning complexity, making static placement suboptimal. We present INAR-VL, a lightweight edge-cloud routing system for multimodal inference in a two-tier deployment. INAR-VL maintains complementary VLMs across edge and cloud and uses lightweight image and text complexity signals to guide routing and model selection, executing simple queries locally while offloading complex ones when beneficial. Evaluation on visual question answering shows that INAR-VL executes 36% of requests on the edge, reduces latency by 24%, lowers energy by 26%, and preserves 97% of cloud-level accuracy.

2605.18836 2026-05-20 cs.LG cs.CV 版本更新

Spectral Gradient Surgery for Domain-Generalizable Dataset Distillation

谱梯度手术用于领域通用化数据集蒸馏

Minyoung Oh, Najeong Chae, Jae-Young Sim

发表机构 * Graduate School of Artificial Intelligence(人工智能研究生院) Ulsan National Institute of Science and Technology (UNIST)(乌山国立科学与技术研究院(UNIST))

AI总结 本文提出了一种新的数据集蒸馏方法,即领域通用化数据集蒸馏(DGDD),通过谱梯度手术(SGS)来提升蒸馏数据集对超出分布(OOD)的泛化能力,同时保持与现有数据集蒸馏方法的兼容性。

Comments 17pages

详情
AI中文摘要

数据集蒸馏(DD)合成一个紧凑的合成数据集,以保留完整数据集的训练效用。然而,其标准公式假设测试数据遵循与训练数据相同的分布,这一假设在实践中很少成立。一种直接的扩展——将事后域泛化(DG)技术应用于蒸馏数据——并不合适,因为现有DG方法依赖于真实数据集的自然多样性,而压缩的合成集本质上缺乏这种多样性,同时还会带来显著的增强开销,这与数据集蒸馏的效率目标相冲突。为了解决这一限制,我们引入了领域通用化数据集蒸馏(DGDD),一种新的问题设定,明确针对蒸馏数据集的超出分布泛化。我们通过广泛采用的DD基线分布匹配(DM)来研究这一问题。我们将DM的超出分布脆弱性归因于压缩合成集中类判别信息和领域特定信息的纠缠,并提出谱梯度手术(SGS)来解纠缠。SGS的关键见解是跨域在谱域中的梯度一致性和跨域梯度组件的共享揭示了哪些梯度组件在源域之间共享——因此是类判别性的——以及哪些是领域特定的。基于这一观察,SGS在标准DM更新中添加了两个互补的梯度:一个强化跨域共享组件,另一个促进蒸馏数据集内的多样性。在多样规模基准上的广泛实验表明,SGS在提升超出分布泛化的同时,仍保持与现有DM方法的即插即用兼容性。

英文摘要

Dataset Distillation (DD) synthesizes a compact synthetic dataset that preserves the training utility of a full dataset. However, its standard formulation assumes that test data follow the same distribution as training data, an assumption that rarely holds in practice. A straightforward extension-applying post-hoc Domain Generalization (DG) techniques to distilled data-is ill-suited because existing DG methods rely on the natural diversity of real datasets, which compact synthetic sets inherently lack, while also incurring substantial augmentation overhead that conflicts with the efficiency objective of dataset distillation. To address this limitation, we introduce Domain Generalizable Dataset Distillation (DGDD), a new problem setting that explicitly targets out-of-distribution (OOD) generalization of distilled datasets. We study this problem through a widely adopted DD baseline of Distribution Matching (DM). We attribute the OOD vulnerability of DM to the entanglement of class-discriminative and domain-specific information within the compressed synthetic set, and propose Spectral Gradient Surgery (SGS) to disentangle the two. The key insight of SGS is that cross-domain agreement among domain-wise gradients in the spectral domain reveals which gradient components are shared across source domains-and are therefore class-discriminative-and which are domain-specific. Based on this observation, SGS augments the standard DM update with two complementary gradients: one that reinforces cross-domain shared components and another that explicitly promotes diversity within the distilled dataset. Extensive experiments on diverse-scale benchmarks demonstrate that SGS substantially improves OOD generalization while remaining plug-and-play compatible with existing DM methods.

2605.18791 2026-05-20 eess.IV cs.CV cs.LG q-bio.OT 版本更新

SpecX: A Large-Scale Benchmark for Multi-Modal Spectroscopy and Cross-Paradigm Evaluation

SpecX:多模态光谱的大规模基准及跨范式评估

Chengrui Xiang, Tengfei Ma, Yujie Chen, Tong Wang, Haowen Chen, Xiangxiang Zeng

发表机构 * College of Computer Science and Technology, Hunan University(湖南大学计算机科学与技术学院)

AI总结 本文提出SpecX,一个用于多模态光谱的大规模基准,通过不同层级的数据集支持分子解析、光谱模拟和理解任务,揭示了专用光谱模型和多模态语言模型在光谱智能中的不同优势。

Comments 9 pages,1 figures

详情
AI中文摘要

现有的光谱基准在规模、模态对齐和评估范围上存在局限,通常专注于专用模型或多模态语言模型(MLLMs)。我们引入SpecX,一个大规模的多模态光谱基准,具有跨范式评估。SpecX包含170万种分子,涵盖NMR(1H,13C,HSQC)、IR、MS、UV、拉曼和FL等多种光谱模态,并分为三个层级:大规模数据集用于预训练,对齐的多光谱子集用于基准测试,以及高质量实验子集用于评估。SpecX支持分子解析、光谱模拟和光谱理解等多种任务,并在专用光谱模型和MLLMs之间实现统一评估。实验表明,专用模型在信号层面建模上表现优异,而MLLMs在高层推理上表现出色,但缺乏精确的光谱定位。SpecX建立了一个统一的光谱智能基准,并强调了需要光谱原生的基础模型。

英文摘要

Existing spectral benchmarks are limited in scale, modality alignment, and evaluation scope, and typically focus on either specialized models or multimodal language models (MLLMs). We introduce SpecX, a large-scale benchmark for multi-modal spectroscopy with cross-paradigm evaluation. SpecX contains 1.7M molecules with diverse spectral modalities, including NMR (1H, 13C, HSQC), IR, MS,UV,Raman and FL, and is organized into three tiers: a large-scale dataset for pretraining, an aligned multi-spectral subset for benchmarking, and a high-quality experimental subset for evaluation. SpecX supports a range of tasks such as molecular elucidation, spectrum simulation, and spectral understanding, and enables unified evaluation across both specialized spectral models and MLLMs. Experiments show that specialized models excel at signal-level modeling, while MLLMs exhibit strengths in high-level reasoning but lack precise spectral grounding. SpecX establishes a unified benchmark for spectral intelligence and highlights the need for spectrum-native foundation models.

2605.18777 2026-05-20 cs.SI cs.CV 版本更新

XFlowMap: Cross-Scale Generalization and Mapping of Massive Origin-Destination Data

XFlowMap:大规模出行生成数据的跨尺度泛化与制图

Diansheng Guo, Hai Jin

发表机构 * PolyU

AI总结 本文提出XFlowMap框架,用于大规模出行生成数据的跨尺度泛化与制图,通过整合跨尺度流量模式检测、自动化流量制图泛化和新的制图表示法,实现复杂出行流量结构的分析与可视化。

详情
AI中文摘要

将大规模出行生成(OD)数据集进行制图仍具挑战性,因为流量图变得杂乱,有意义的模式出现在多个空间尺度上,而现有流量制图方法通常依赖于预定义的聚合单元或手动泛化。本文提出了XFlowMap,一种用于大规模OD数据的跨尺度泛化和制图的框架。具体而言,该框架整合了跨尺度流量模式(集群)检测、自动化流量图泛化和新的制图表示法,用于分析和可视化复杂的出行流量结构。该方法在适当的起源和目的地尺度上检测显著的流量模式,提取高层结构,并生成一种新的流量图表示法,以支持对复杂出行流量模式的全面解释。开发了一种基于扫描统计的程序来评估和泛化跨尺度流量集群。检测到的集群随后使用一种新的流量符号进行可视化,该符号将位置、方向、强度和OD尺度整合到单一表示中。该框架支持基于区域和基于点的OD数据,对稀疏和噪声数据具有鲁棒性,并能够对分层流量数据进行比较制图。使用合成数据和美国迁移数据的实验表明,该方法有效地提取了有意义的跨尺度流量模式,并为大规模移动数据集生成清晰且信息丰富的流量图,支持静态展示和交互式探索。

英文摘要

Mapping large origin-destination (OD) datasets remains challenging because flow maps become cluttered, meaningful patterns occur at multiple spatial scales, and existing flow-mapping approaches frequently rely on predefined aggregation units or manual generalization. This paper presents XFlowMap, a framework for the cross-scale generalization and mapping of massive OD data. Specifically, the framework integrates cross-scale flow pattern (cluster) detection, automated flow map generalization, and a new cartographic representation for analyzing and visualizing complex origin-destination flow structures. The approach detects salient flow patterns at their appropriate origin and destination scales, extracts high-level structures, and generates a new flow map representation that supports holistic interpretation of complex origin-destination flow patterns. A scan-statistic-based procedure is developed to evaluate and generalize cross-scale flow clusters. The detected clusters are then visualized using a novel flow symbol that integrates location, direction, strength, and OD scales in a single representation. The framework supports both area-based and point-based OD data, is robust to sparse and noisy datasets, and enables comparative mapping of stratified flow data. Experiments with synthetic data and U.S. migration data demonstrate that the method effectively extracts meaningful cross-scale flow patterns and produces clear, information-rich flow maps for large mobility datasets, supporting both static presentation and interactive exploration.