arXivDaily arXiv每日学术速递 周一至周五更新
2605.10937 2026-05-12 cs.CV 版本更新

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

Haoyuan Sun, Jing Wang, Yuxin Song, Yu Lu, Bo Fang, Yifu Luo, Jun Yin, Pengyu Zeng, Miao Zhang, Tiantian Zhang, Xueqian Wang, Shijian Lu

发表机构 * Nanyang Technological University(南洋理工大学) Baidu Inc.(百度公司) Zhejiang University(浙江大学) City University of Hong Kong(香港城市大学) Tsinghua University(清华大学) Jimei University(集美大学)

AI总结 本文研究了如何通过强化学习后训练进一步提升文本到图像生成模型的性能,并针对现有方法中奖励黑客问题提出了解决方案。作者指出标准化操作可能导致策略校准偏差,进而影响训练效果,为此提出了一种基于信息几何的超线性优势塑造方法(SLAS),通过引入优势依赖的权重对策略空间进行非线性重构,从而增强有效更新、抑制虚假梯度。实验表明,SLAS在多个模型和基准测试中均优于现有方法,提升了训练效率、泛化能力和生成质量。

详情
英文摘要

Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.

2605.10936 2026-05-12 cs.CV 版本更新

Personal Visual Context Learning in Large Multimodal Models

Zihui Xue, Ami Baid, Sangho Kim, Mi Luo, Kristen Grauman

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 随着智能眼镜等可穿戴设备将大 multimodal 模型(LMMs)融入用户的连续第一人称视觉流,这些模型要成为真正的个人助手,关键在于视觉个性化能力。本文提出个人视觉上下文学习(Personal VCL),旨在利用用户特定的视觉信息解决个性化查询,并构建了 Personal-VCL-Bench 作为评估基准。研究发现当前 LMMs 在利用视觉上下文方面存在显著差距,为此提出了一种名为 Agentic Context Bank 的推理时基线方法,通过结构化的记忆银行和查询自适应的证据选择,有效提升了模型在多任务中的表现。

Comments Project website: https://vision.cs.utexas.edu/projects/PersonalVCL/

详情
英文摘要

As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user's visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.

2605.10934 2026-05-12 cs.LG cs.AI cs.CV cs.RO stat.ML 版本更新

Variational Inference for Lévy Process-Driven SDEs via Neural Tilting

Yaman Kindap, Manfred Opper, Benjamin Dupuis, Umut Simsekli, Tolga Birdal

发表机构 * Imperial College London, UK(伦敦帝国学院) Technical University of Berlin, Germany(柏林技术大学) INRIA, CNRS, Département d’Informatique de l’Ecole Normale Supérieure / PSL, France(法国国家信息与自动化研究所(INRIA)、国家科学研究中心(CNRS)、巴黎社会科学高等师范学院信息学系/巴黎社会科学高等师范学院)

AI总结 该论文研究了如何利用变分推断方法对由Lévy过程驱动的随机微分方程(SDEs)进行建模,以准确捕捉金融、气候等领域的极端事件和重尾现象。传统方法要么计算开销大,要么依赖高斯假设而无法处理跳跃特性。为此,作者提出了一种基于神经网络的指数倾斜框架,通过神经网络对Lévy测度进行指数加权,构建灵活的变分族,在保留跳跃结构的同时保证计算可行性。实验表明,该方法在合成和真实数据上均能有效捕捉跳跃动态,并在高斯变分方法失效的情况下提供可靠的后验推断。

Comments The associated project page which contains the official implementation can be found in https://circle-group.github.io/research/NeuralTilting/

详情
英文摘要

Modelling extreme events and heavy-tailed phenomena is central to building reliable predictive systems in domains such as finance, climate science, and safety-critical AI. While Lévy processes provide a natural mathematical framework for capturing jumps and heavy tails, Bayesian inference for Lévy-driven stochastic differential equations (SDEs) remains intractable with existing methods: Monte Carlo approaches are rigorous but lack scalability, whereas neural variational inference methods are efficient but rely on Gaussian assumptions that fail to capture discontinuities. We address this tension by introducing a neural exponential tilting framework for variational inference in Lévy-driven SDEs. Our approach constructs a flexible variational family by exponentially reweighting the Lévy measure using neural networks. This parametrization preserves the jump structure of the underlying process while remaining computationally tractable. To enable efficient inference, we develop a quadratic neural parametrization that yields closed-form normalization of the tilted measure, a conditional Gaussian representation for stable processes that facilitates simulation, and symmetry-aware Monte Carlo estimators for scalable optimization. Empirically, we demonstrate that the method accurately captures jump dynamics and yields reliable posterior inference in regimes where Gaussian-based variational approaches fail, on both synthetic and real-world datasets.

2605.10922 2026-05-12 cs.CV 版本更新

Pixal3D: Pixel-Aligned 3D Generation from Images

Dong-Yang Li, Wang Zhao, Yuxin Chen, Wenbo Hu, Meng-Hao Guo, Fang-Lue Zhang, Ying Shan, Shi-Min Hu

发表机构 * BNRist, Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系BNRist) Tencent ARC Lab(腾讯ARC实验室) Victoria University of Wellington(惠灵顿维多利亚大学)

AI总结 Pixal3D 是一种基于图像的高保真3D生成方法,旨在解决现有3D生成模型在像素级细节还原方面的不足。该方法通过引入像素级反投影条件机制,直接在输入视角下生成与像素对齐的3D几何结构,建立了明确的像素到3D特征的对应关系,从而显著提升了生成结果的保真度。此外,Pixal3D 还支持多视角生成和场景级合成,为从单张或多张图像生成高精度3D物体和场景提供了新的解决方案。

Comments SIGGRAPH 2026. Project page: https://ldyang694.github.io/projects/pixal3d/

详情
英文摘要

Recent advances in 3D generative models have rapidly improved image-to-3D synthesis quality, enabling higher-resolution geometry and more realistic appearance. Yet fidelity, which measures pixel-level faithfulness of the generated 3D asset to the input image, still remains a central bottleneck. We argue this stems from an implicit 2D-3D correspondence issue: most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguous. To tackle this issue, we draw inspiration from 3D reconstruction and propose Pixal3D, a pixel-aligned 3D generation paradigm for high-fidelity 3D asset creation from images. Instead of generating in a canonical pose, Pixal3D directly generates 3D in a pixel-aligned way, consistent with the input view. To enable this, we introduce a pixel back-projection conditioning scheme that explicitly lifts multi-scale image features into a 3D feature volume, establishing direct pixel-to-3D correspondence without ambiguity. We show that Pixal3D is not only scalable and capable of producing high-quality 3D assets, but also substantially improves fidelity, approaching the fidelity level of reconstruction. Furthermore, Pixal3D naturally extends to multi-view generation by aggregating back-projected feature volumes across views. Finally, we show pixel-aligned generation benefits scene synthesis, and present a modular pipeline that produces high-fidelity, object-separated 3D scenes from images. Pixal3D for the first time demonstrates 3D-native pixel-aligned generation at scale, and provides a new inspiring way towards high-fidelity 3D generation of object or scene from single or multi-view images. Project page: https://ldyang694.github.io/projects/pixal3d/

2605.10903 2026-05-12 cs.CV cs.RO 版本更新

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

Wenxuan Song, Han Zhao, Fuhao Li, Ziyang Zhou, Xi Wang, Jing Lyu, Pengxiang Ding, Yan Wang, Donglin Wang, Haoang Li

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学) Tsinghua University(清华大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 本文提出了一种新的方法,解决预训练视觉-语言-动作(VLA)模型在标准监督微调过程中性能提升有限且适应成本高的问题。该方法通过在参数空间中解耦辅助目标微调的两个目标——增强通用能力和拟合任务特定动作分布,并利用两种不同的训练策略在小规模任务集上训练出两个微调模型,从而提取出由辅助目标提供的能力向量。将这些能力向量与预训练参数结合形成增强能力的元模型,并引入轻量正交正则化损失,使模型在保持高性能的同时显著降低计算开销。实验表明,该方法在多种模型和新环境中均具有良好的有效性和泛化能力。

详情
英文摘要

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters' difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.

2605.10894 2026-05-12 cs.CV 版本更新

Counterfactual Stress Testing for Image Classification Models

Moritz Stammel, Fabio De Sousa Ribeiro, Raghav Mehta, Mélanie Roschewitz, Ben Glocker

发表机构 * Department of Computing, Imperial College London, UK(伦敦帝国理工学院计算机系)

AI总结 本文研究了医学影像分类模型在新临床环境中因分布偏移而失效的问题,提出了一种基于因果生成模型的反事实压力测试框架,通过干预扫描仪类型、患者性别等属性生成具有临床真实性的“假设”图像,从而在保持解剖结构不变的前提下,进行有针对性的分布偏移评估。实验表明,该方法相比传统扰动方法能更准确地反映模型在真实分布外场景下的性能变化,为医学AI系统的鲁棒性评估提供了更可靠的基础。

详情
英文摘要

Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in demographics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes. Although stress testing has emerged as a tool to assess this, current methods typically rely on simple, uninformed perturbations (e.g., brightness or contrast changes), which fail to capture clinically realistic variation and can overestimate robustness. In this work, we introduce a counterfactual stress testing framework based on causal generative models that create realistic "what if" images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity, enabling controlled and semantically meaningful evaluation under targeted distribution shifts. Across two imaging modalities (chest X-ray and mammography), three model architectures, and multiple shift scenarios, we show that counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing the direction and relative magnitude of performance changes as well as model ranking. These results suggest that causal generative models can serve as practical simulators for robustness assessment, offering a more reliable basis for evaluating medical AI systems prior to deployment.

2605.10887 2026-05-12 cs.CV 版本更新

Count Anything at Any Granularity

Chang Liu, Haoning Wu, Weidi Xie

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, China(人工智能学院,上海交通大学,中国) CMIC, Shanghai Jiao Tong University, China(计算机医学研究所,上海交通大学,中国)

AI总结 本文研究了开放世界物体计数中的细粒度计数问题,指出当前方法因未明确计数粒度而导致计数可靠性不足。为此,作者提出了多粒度计数框架,通过视觉示例和细粒度文本描述明确指定计数目标,并构建了首个自动化的数据增强管道,生成了目前最大的细粒度计数数据集KubriCount。基于该数据集,作者进一步训练了HieraCount模型,显著提升了细粒度计数的准确性和实际场景的泛化能力。

Comments Project page: https://verg-avesta.github.io/KubriCount/

详情
英文摘要

Open-world object counting remains brittle: despite rapid advances in vision-language models (VLMs), reliably counting the objects a user intends is far from solved. We argue that a central reason is that counting granularity is left implicit; users may refer to a specific identity, an attribute, an instance type, a category, or an abstract concept, yet most methods treat "what to count" as a single, category-level matching problem. In this work, we redefine open-world counting as multi-grained counting, where visual exemplars specify target appearance and fine-grained text, with optional negative prompts, specifies the intended semantic granularity across five explicit levels. Making granularity explicit, however, exposes a critical data bottleneck: existing counting datasets lack the multi-category scenes, controlled distractors, and instance-level annotations needed to verify fine-grained prompt semantics. To address this, we propose the first fully automatic data-scaling pipeline that integrates controllable 3D synthesis with consistent image editing and VLM-based filtering, and use it to construct KubriCount, the largest and most comprehensively annotated counting dataset to date, supporting both training and multi-grained evaluation. Systematic benchmarking reveals that both multimodal large language models and specialist counting models exhibit severe prompt-following failures under fine-grained distinctions. Motivated by these findings, we train HieraCount, a multi-grained counting model that jointly leverages text and visual exemplars as complementary target specifications. HieraCount substantially improves multi-grained counting accuracy and generalizes robustly to challenging real-world scenarios. The project page is available here: https://verg-avesta.github.io/KubriCount/.

2605.10885 2026-05-12 cs.CV 版本更新

Geometry-aware Prototype Learning for Cross-domain Few-shot Medical Image Segmentation

Feifan Song, Yuntian Bo, Haofeng Zhang

发表机构 * School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院)

AI总结 跨域小样本医学图像分割(CD-FSMIS)旨在仅凭少量标注样本,使模型同时适应新的解剖类别和未见过的成像领域。现有基于原型的方法往往将解剖结构与领域特定的外观变化混杂在一起,导致在领域变化下难以实现稳定匹配。本文提出GeoProto框架,通过引入几何感知的原型增强机制,利用人体解剖结构的几何先验信息,提升原型匹配的鲁棒性与泛化能力,并在多个跨模态、跨序列和跨场景的数据集上取得了最先进的性能。

详情
英文摘要

Cross-domain few-shot medical image segmentation (CD-FSMIS) requires a model to generalise simultaneously to novel anatomical categories and unseen imaging domains from only a handful of annotated examples. Existing prototypical approaches inevitably entangle anatomical structure with domain-specific appearance variations, and thus lack a stable reference for reliable matching under domain shift. We observe that the geometric structure of human anatomy constitutes a reliable, domain-transferable prior that has been overlooked. Building on this insight, we propose GeoProto, a geometry-aware CD-FSMIS framework that enriches prototypical matching with explicit structural priors. The core component, Geometry-Aware Prototype Enrichment (GAPE), augments each local appearance prototype with a learned geometric offset encoding its ordinal position within the organ's interior topology. This offset is derived from an auxiliary Ordinal Shape Branch (OSB) trained under an ordinally consistent objective that enforces monotonic variation of geometric embeddings across interior strata, requiring no annotation beyond standard segmentation masks. Extensive experiments across seven datasets spanning three evaluation settings (cross-modality, cross-sequence, and cross-context) demonstrate that GeoProto achieves state-of-the-art performance.

2605.10859 2026-05-12 cs.CV cs.LG 版本更新

Masked Generative Transformer Is What You Need for Image Editing

Wei Chow, Linfeng Li, Xian Sun, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, Songhua Liu

发表机构 * ByteDance(字节跳动) National University of Singapore(新加坡国立大学) Duke University(杜克大学) Shanghai Jiao Tong University(上海交通大学) HKUST(GZ)(香港科技大学(广州))

AI总结 该论文提出了一种基于掩码生成变压器(MGT)的图像编辑框架EditMGT,旨在解决扩散模型在编辑过程中修改扩散到非目标区域的问题。通过局部化token预测机制和多层注意力整合,EditMGT能够精确控制编辑区域,同时避免非目标区域的意外变化。研究还构建了一个包含200万张高分辨率图像的编辑数据集CrispEdit-2M,并在多个基准测试中取得了最先进的图像相似度表现,且编辑速度比现有方法快6倍。

Comments CVPR 2026 HiGen Workshop; Project Page at https://weichow23.github.io/EditMGT/ GitHub at https://github.com/weichow23/EditMGT

详情
英文摘要

Diffusion models dominate image editing, yet their global denoising mechanism entangles edited regions with surrounding context, causing modifications to propagate into areas that should remain intact. We propose a fundamentally different approach by leveraging Masked Generative Transformers (MGTs), whose localized token-prediction paradigm naturally confines changes to intended regions. We present EditMGT, an MGT-based editing framework that is the first of its kind. Our approach employs multi-layer attention consolidation to aggregate cross-attention maps into precise edit localization signals, and region-hold sampling to explicitly prevent token flipping in non-target areas. To support training, we construct CrispEdit-2M, a 2M-sample high-resolution (>1024) editing dataset spanning seven categories. With only 960M parameters, EditMGT achieves state-of-the-art image similarity on multiple benchmarks while delivering 6x faster editing, demonstrating that MGTs offer a compelling alternative to diffusion-based editing.

2605.10858 2026-05-12 cs.CV cs.RO 版本更新

Is Your Driving World Model an All-Around Player?

Lingdong Kong, Ao Liang, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Xian Sun, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu

AI总结 当前的驾驶世界模型虽然能生成逼真的行车记录仪视频,但尚无单一模型在所有方面都表现优异。本文提出WorldLens基准,从像素质量、4D几何结构、闭环驾驶行为及人类感知等多个维度全面评估世界模型的真实性,并揭示现有模型在纹理、几何或行为一致性上各有所长,却难以兼顾。研究还构建了包含26,808条人类标注数据的WorldLens-26K数据集,以及一个能自动评估生成世界的视觉语言模型WorldLens-Agent,为模型评估提供了更贴近人类感知的统一框架。

Comments CVPR 2026 VideoWorldModel Workshop; Project Page at https://worldbench.github.io/worldlens GitHub at https://github.com/worldbench/WorldLens

详情
英文摘要

Today's driving world models can generate remarkably realistic dash-cam videos, yet no single model excels universally. Some generate photorealistic textures but violate basic physics; others maintain geometric consistency but fail when subjected to closed-loop planning. This disconnect exposes a critical gap: the field evaluates how real generated worlds appear, but rarely whether they behave realistically. We introduce WorldLens, a unified benchmark that measures world-model fidelity across the full spectrum, from pixel quality and 4D geometry to closed-loop driving and human perceptual alignment, through five complementary aspects and 24 standardized dimensions. Our evaluation of six representative models reveals that no existing approach dominates across all axes: texture-rich models violate geometry, geometry-aware models lack behavioral fidelity, and even the strongest performers achieve only 2-3 out of 10 on human realism ratings. To bridge algorithmic metrics with human perception, we further contribute WorldLens-26K, a 26,808-entry human-annotated preference dataset pairing numerical scores with textual rationales, and WorldLens-Agent, a vision-language evaluator distilled from these judgments that enables scalable, explainable auto-assessment. Together, the benchmark, dataset, and agent form a unified ecosystem for assessing generated worlds not merely by visual appeal, but by physical and behavioral fidelity.

2605.10850 2026-05-12 cs.CV 版本更新

Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA

Ruinan Jin, Beidi Zhao, Myeongkyun Kang, Qiong Zhang, Xiaoxiao Li

发表机构 * The University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Redmin University of China(红矿大学)

AI总结 本文研究了医学视觉问答(VQA)中自验证机制的可靠性边界,指出当前常用的通过重新调用相同视觉语言模型(VLM)进行自验证的做法存在根本性不可靠的问题。作者提出了一种诊断框架,通过分解验证器的行为为判别能力和一致性偏差,揭示了验证器与生成器之间的能力耦合会导致“验证幻觉”现象,即在错误答案被错误接受的情况下,验证器错误率和一致性偏差同时升高的状态。实验表明,验证机制无法提供独立的安全保障,且在多轮交互中错误答案可能被错误验证所固化,凸显出自验证在实际临床应用中可能存在的严重风险。

Comments 31 pages, 12 figures

详情
英文摘要

Self-verification, re-invoking the same vision language model (VLM) in a fresh context to check its own generated answer, is increasingly used as a default safety layer for medical visual question answering (VQA). We argue that this practice is fundamentally unreliable. We introduce [METHOD NAME], a diagnostic framework for mapping the reliability boundary of medical VLM self-verification by decomposing verifier behavior into discrimination capability and agreement bias. Because the verifier and answer generator are capacity-coupled, the verifier can overly agree with the generator, creating a verification mirage: a regime with both high verifier error and high agreement bias, driven by false acceptance of incorrect answers. Evaluating six open-weight VLMs across five medical VQA datasets and seven medical tasks, we find that this boundary is strongly task-conditioned. Knowledge-intensive clinical tasks fall deepest into the mirage, simpler tasks are more resistant, and perceptual tasks lie in between. Verification also fails to provide an independent safety signal: logistic mixed-effects analysis shows that verifier error and agreement bias become more likely when the generator is wrong, while saliency analyses show that verifiers under-attend to image evidence relative to generators, a phenomenon we call the lazy verifier. Cross-verification reduces but does not eliminate the mirage. Moreover, when verification is reused in multi-turn actor-verifier loops, most initially wrong answers become locked in by false verification. Since our experiments use clean benchmarks, the observed reliability boundary likely underestimates failures in real clinical deployment.

2605.10845 2026-05-12 cs.CV cs.CL 版本更新

BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation

Qi Yang, Xiangyao Ma, Xiao Wang, Hao Wang, Rui Wang

发表机构 * School of Computer Engineering and Science, Shanghai University, Shanghai, China(1 上海大学计算机工程与科学学院,上海,中国) Funstory.ai Limited, Hong Kong SAR, China(2 Funstory.ai有限公司,香港特别行政区,中国) Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China(3 上海交通大学计算机科学与工程系,上海,中国)

AI总结 随着跨语言交流的日益频繁,富含视觉内容的PDF等文档中的语言障碍仍然是一个实际瓶颈。现有文档翻译方法在语言处理与版式保留之间面临矛盾,BabelDOC通过引入中间表示框架,将视觉布局信息与语义内容解耦,实现了术语提取、跨页上下文处理等文档级翻译操作,并通过自适应排版引擎将翻译内容重新锚定到原始布局中。实验表明,BabelDOC在版式保真度、视觉美观性和术语一致性方面优于现有方法,同时保持了较高的翻译精度。

Comments ACL 2026 System Demonstration paper. 2 figures

详情
英文摘要

As global cross-lingual communication intensifies, language barriers in visually rich documents such as PDFs remain a practical bottleneck. Existing document translation pipelines face a tension between linguistic processing and layout preservation: text-oriented Computer-Assisted Translation (CAT) systems often discard structural metadata, while document parsers focus on extraction and do not support faithful re-rendering after translation. We introduce BabelDOC, an Intermediate Representation (IR)-based framework for layout-preserving PDF translation. BabelDOC decouples visual layout metadata from semantic content, enabling document-level translation operations such as terminology extraction, cross-page context handling, glossary-constrained generation, and formula placeholdering. The translated content is then re-anchored to the original layout through an adaptive typesetting engine. Experiments on a curated 200-page benchmark, together with human evaluation and multimodal LLM-as-a-judge evaluation, show that BabelDOC improves layout fidelity, visual aesthetics, and terminology consistency over representative baselines, while maintaining competitive translation precision. The open-source toolkit and its interactive downstream applications are publicly available and have attracted over 8.4K GitHub stars and 17 contributors at the time of writing. A demonstration video is also available.

2605.10835 2026-05-12 cs.CV cs.LG 版本更新

Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training

Daniel Dratschuk, Paul Swoboda

发表机构 * Heinrich Heine University Düsseldorf(海因里希-海涅大学杜伊斯堡)

AI总结 光学乐谱识别(OMR)任务面临缺乏大规模真实扫描数据集的瓶颈,现有方法多依赖少量样本迁移或过于简化的合成训练。本文提出Transcoda系统,通过改进的合成数据生成、**kern编码的规范化以及基于语法规则的解码方法,有效解决了乐谱文本编码的非唯一性问题。该方法在单块GPU上仅用6小时即可训练出一个5900万参数的紧凑模型,在合成乐谱数据集和历史波兰乐谱数据集上均取得优于现有方法的显著性能提升。

Comments 13 pages, 7 figures

详情
英文摘要

Optical Music Recognition (OMR), the task of transcribing sheet music into a structured textual representation, is currently bottlenecked by a lack of large-scale, annotated datasets of real scans. This forces models to rely on either few-shot transfer or synthetic training pipelines that remain overly simplistic. A secondary challenge is encoding non-uniqueness: in the popular Humdrum **kern format for transcribing music, multiple different text encodings can render into the same visual sheet music. This one-to-many mapping creates a harder learning task and introduces high uncertainty during decoding. We propose Transcoda, an OMR system built on (i) an advanced synthetic data generation pipeline, (ii) a normalization of the **kern encoding to enforce a unique normal form and (iii) grammar-based decoding to ensure the syntactic correctness of the output. This approach allows us to train a compact 59M-parameter model in just 6 hours on a single GPU that outperforms billion-parameter baselines. Transcoda achieves the best score among state of the art baselines on a newly curated benchmark of synthetically rendered scores at 18.46% OMR-NED (compared to 43.91% for the next-best system, Legato) and reduces the error rate on historical Polish scans to 63.97% OMR-NED (down from 80.16% for SMT++).

2605.10833 2026-05-12 cs.CV cs.AI 版本更新

MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

Xiran Zhao, Jing Jin, Yan Bai, Zhongan Wang, Yifeng Sun, Yihang Lou, Xuanyu Zhu, Tao Feng, Yingna Wu

发表机构 * ShanghaiTech University(上海科技大学) Tsinghua University(清华大学) Meituan Inc.(美团公司) Peking University(北京大学)

AI总结 本文提出MMVIAD,首个面向工业异常检测的多视角连续视频数据集,涵盖多种物体类别、环境和异常类型,并支持多项任务评估。为提升模型在细粒度缺陷识别和时序定位上的表现,研究设计了两阶段的后训练流程,显著提升了模型性能,优于现有主流模型。该工作为工业视频理解与异常检测提供了新的基准和方法。

详情
英文摘要

Industrial anomaly detection is critical for manufacturing quality control, yet existing datasets mainly focus on static images or sparse views, which do not fully reflect continuous inspection processes in real industrial scenarios. We introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), to the best of our knowledge the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. MMVIAD contains object-centric 2-second inspection clips with approximately 120 degrees of camera motion, covering 48 object categories, 14 environments, and 6 structural anomaly types. It supports anomaly detection, defect classification, object classification, and anomaly visible-time localization. Systematic evaluations on MMVIAD show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. To improve transferable anomaly understanding, we further develop a two-stage post-training pipeline where PS-SFT (Perception-Structured Supervised Fine-Tuning) initializes perception-structured reasoning and VISTA-GRPO (Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization) refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model's average score across the four tasks from 45.0 to 57.5, surpassing GPT-5.4. Source code is available at https://github.com/Georgekeepmoving/MMVIAD.

2605.10806 2026-05-12 cs.CV cs.AI cs.LG 版本更新

PhyGround: Benchmarking Physical Reasoning in Generative World Models

Juyi Lin, Arash Akbari, Yumei He, Lin Zhao, Haichao Zhang, Arman Akbari, Xingchen Xu, Zoe Y. Lu, Enfu Nan, Hokin Deng, Edmund Yeh, Sarah Ostadabbas, Yun Fu, Jennifer Dy, Pu Zhao, Yanzhi Wang

发表机构 * Northeastern University(东北大学) Tulane University(路易斯安那州立大学) University of Washington(华盛顿大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 PhyGround 是一个用于评估生成式世界模型物理推理能力的新基准,旨在解决现有视频生成模型在物理规律遵循性方面的评估难题。该基准包含250个精心设计的提示,每个提示附带预期的物理结果,并涵盖13类物理定律的分类体系。通过大规模、质量控制的人类标注实验和一个专门的物理推理视觉语言模型 PhyJudge-9B,PhyGround 能够对生成视频的物理合理性进行细粒度、可复现的评估,显著提升了评估的准确性与可靠性。

Comments Preprint. 56 pages, 39 figures, 40 tables. Project page: https://phyground.github.io/

详情
英文摘要

Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid-body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub-questions to enable per-law diagnostics. We evaluate eight modern video generation models through a large-scale, quality-controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine-grained labels; after quality control, the retained annotations exhibited high split-half model-ranking correlations (Spearman's rho > 0.90). To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialized VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs. 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page https://phyground.github.io/.

2605.10789 2026-05-12 cs.CV 版本更新

Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction

Quanyun Wu, Kyle Gao, Wentao Sun, Zhengsen Xu, Hudson Sun, Linlin Xu, Yuhao Chen, David A. Clausi, Jonathan Li

发表机构 * University of Waterloo(滑铁卢大学) University of Calgary(卡尔加里大学)

AI总结 本文提出了一种基于虚拟遥感数据和度量级前馈3D重建的快速森林燃料载荷估计方法,旨在解决传统方法成本高、耗时长的问题。该方法利用Google Earth Studio生成低空轨道图像和相机位姿,结合改进的Pi-Long模型进行密集3D重建,并通过度量恢复模块解决单目重建的尺度模糊问题,最终生成鸟瞰图高度和密度图,进而实现树种分类、叶面积指数计算和燃料载荷估计。实验表明,该方法在保证几何一致性的同时,提供了高效、低成本的森林生物量估算方案。

Comments Accepted for publication at IEEE IGARSS 2026

详情
英文摘要

Accurate quantification of forest coverage and combustible biomass (fuel load) is critical for wildfire risk assessment and ecosystem management. However, traditional methods relying on airborne LiDAR or field surveys are cost-prohibitive and time-intensive, while satellite imagery often lacks the vertical resolution required for canopy volume analysis. This paper proposes a novel, automated pipeline for rapid forest inventory using virtual remote sensing data derived from Google Earth Studio (GES). Our approach first generates low-altitude orbital imagery and camera poses for a target region. For dense 3D reconstruction, we employ Pi-Long, developed within the VGGT-Long framework. This model serves as a scalable extension of the Pi-3 feed-forward Transformer architecture. To address the inherent scale ambiguity in monocular reconstruction, we introduce a metric recovery module that aligns the reconstructed trajectory with GES ground truth poses via Sim(3) Umeyama optimization. The metric-scale point cloud is then orthogonally projected into Bird's-Eye-View (BEV) height and density maps. Finally, we employ a watershed-based segmentation algorithm combined with height variance analysis to classify tree species (conifer vs. broadleaf), calculate Leaf Area Index (LAI), and estimate total fuel load. Experimental results demonstrate that this pipeline offers a scalable, cost-effective alternative to physical scanning, enabling near-real-time estimation of forest biomass with high geometric consistency.

2605.10772 2026-05-12 cs.CV cs.AI eess.IV 版本更新

Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition

David F. Ramirez, Tim L. Overman, Kristen Jaskie, Marv Kleine, Andreas Spanias

发表机构 * SenSIP Center, School of ECEE, Arizona State University(SenSIP中心,电子与计算机工程学院,亚利桑那州立大学) Prime Solutions Group

AI总结 本文研究了将大语言-视觉模型(LLVM)应用于合成孔径雷达(SAR)图像的目标识别任务,特别是在军事车辆自动目标识别(ATR)中的应用。通过构建基于MSTAR公开数据集的训练与评估基准,并引入描述性文本和问答对,作者探索了LLVM在遥感图像描述和视觉问答(VQA)中的性能。实验表明,使用参数高效的微调方法,模型在识别细粒度目标特征方面达到了98%的准确率,为机器辅助的军事和情报遥感目标识别提供了新的技术路径。

Comments Accepted to SPIE Defense + Commercial Sensing, Automatic Target Recognition XXXV

Journal ref Proc. SPIE 13463, Automatic Target Recognition XXXV, 134630D (29 May 2025);

详情
英文摘要

Large language-vision models (LLVM), such as OpenAI's ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery. The merging of these data domains represents a significant paradigm shift with far-reaching implications for automatic target recognition (ATR). Recent transformer-based LLVM research has shown substantial improvements for geospatial perception tasks. Our study examines the application of LLVM to remote sensing image captioning and visual question-answering (VQA), with a specific focus on synthetic aperture radar (SAR) imagery. We examine newly published LLVM methods, including CLIP and LLaVA neural network transformer architectures. We have developed a work-in-progress SAR training and evaluation benchmark derived from the MSTAR Public Dataset. This has been extended to include descriptive text captions and question-answer pairs for VQA tasks. This challenge dataset is designed to push the boundaries of an LLVM in identifying nuanced ATR details in SAR imagery. Utilizing parameter-efficient fine-tuning, we train an LLVM method to identify fine-grained target qualities at 98% accuracy. We detail our data setup and experiments, addressing potential pitfalls that could lead to misleading conclusions. Accurately identifying and differentiating military vehicle types in SAR data poses a critical challenge, especially under complex environmental conditions. Mastering this target recognition skill may require a human analyst months of training and years of practice. This research represents a unique effort to apply LLVM to SAR applications, advancing machine-assisted remote sensing ATR for military and intelligence contexts.

2605.10769 2026-05-12 cs.CV cs.AI 版本更新

MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation

Ziyi Wang, Xianping Ma, Ziyao Wang, Hongyang Zhang, Man On Pun

发表机构 * The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)) Southwest Jiaotong University(西南交通大学)

AI总结 本文提出了一种名为MPerS的动态多模态大语言模型混合专家感知引导的遥感场景分割方法,旨在提升遥感图像语义分割的效果。该方法通过设计多种提示词引导大语言模型生成高质量的遥感场景描述,并结合DINOv3提取土地覆盖的密集视觉特征,利用动态混合专家模块自适应融合最有效的文本语义信息,最终实现更精确的遥感场景分割。实验表明,该方法在三个公开的遥感语义分割数据集上取得了优越的性能。

Comments Accepted to CVPR 2026 Findings. 11 pages, 6 figures

详情
英文摘要

The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high-quality RS captions and the investigation of their effectiveness in multimodal semantic fusion.In this context, we propose the Dynamic MLLM Mixture-of-Experts Perception-Guided Remote Sensing Scene Segmentation, referred to as MPerS.We design multiple prompts for MLLMs to generate high-quality RS captions, enabling MLLMs to perceive RS scenes from diverse expert perspectives. DINOv3 is employed to extract dense visual representations of land-covers.We design a Dynamic MixExperts module that adaptively integrates the most effective textual semantics. Linguistic Query Guided Attention is constructed to utilize textual semantic information to guide visual features for precise segmentation. The MLLMs include LLaVA, ChatGPT, and Qwen. Our method achieves superior performance on three public semantic segmentation RS datasets.

2605.10765 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

Tao Hu, Da-Wei Zhou

发表机构 * School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) State Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室)

AI总结 多模态大语言模型(MLLMs)通过指令微调取得了优异性能,但在实际应用中往往需要在连续任务中逐步扩展能力,同时避免灾难性遗忘。现有方法主要依赖模块组合范式,但难以应对同一任务内图像场景、问题意图和推理需求的差异。为此,本文提出DRAPE,一种动态跨模态提示生成框架,通过从文本指令中生成提示查询并结合视觉特征进行交叉注意力,为每个查询-图像对生成个性化的软提示,从而实现更细粒度的实例级适应。实验表明,DRAPE在多模态持续指令微调基准上取得了最先进的性能。

详情
英文摘要

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks. In such scenarios, Multimodal Continual Instruction Tuning (MCIT) aims to acquire new capabilities while limiting catastrophic forgetting. Existing methods mainly follow a module-composition paradigm: they maintain task-level prompts or LoRA experts and dynamically route or aggregate a subset of them at inference. However, samples within the same task can still differ substantially in visual scenes, question intents, and reasoning demands. This motivates instance-level adaptation to individual query-image pairs rather than only selecting or combining task-level modules. To this end, we propose DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework that synthesizes continuous instance-specific soft prompts for MCIT. Instead of selecting prompts from a fixed pool, DRAPE derives prompt queries from the textual instruction and cross-attends to visual patch features, producing query-image conditioned prompts that are prepended to the frozen LLM. To mitigate forgetting during sequential updates, DRAPE applies null-space gradient projection to the shared projector and uses CLIP-based prototype routing for task-label-free generator selection at inference. Extensive experiments on MCIT benchmarks show that DRAPE achieves state-of-the-art performance among representative prompt-based and LoRA-based continual-learning baselines.

2605.10762 2026-05-12 cs.CV cs.AI 版本更新

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

Mohamed Eltahir, Lama Ayash, Ali Habibullah, Tanveer Hussain, Naeemullah Khan

发表机构 * King Abdullah University of Science and Technology (KAUST)(卡尔斯塔德大学科学与技术学院) Department of Computer Science, Edge Hill University(埃奇希尔大学计算机科学系)

AI总结 在长视频理解任务中,视觉-语言模型(VLM)因需处理数千帧视频而面临二次注意力计算成本的瓶颈。为解决这一问题,本文提出GridProbe,一种高效的训练-free 后验探测推理框架,通过冻结VLM自身的推理能力,在答案空间中对证据进行评分,并自适应选择与问题相关的帧,从而显著降低计算成本而几乎不损失精度。GridProbe通过在K×K网格上布置帧,并运行轻量级的行和列探测器,生成可解释的重要性图,进而实现形状自适应的帧选择,有效提升了长视频理解的效率与性能。

详情
英文摘要

Long-video understanding in VLMs is bottlenecked by a single monolithic forward pass over thousands of frames at quadratic attention cost. A common mitigation is to first select a small subset of informative frames before the forward pass; common for training-free selectors via auxiliary encoder-space similarities. Such signals are capped by contrastive pretraining, which usually fails on reasoning-heavy queries (negation, cross-frame counting, holistic summarization). We propose GridProbe, an efficient training-free posterior-probing inference paradigm that scores evidence in answer space using a frozen VLM's own reasoning and then selects question-relevant frames adaptively, resulting in sub-quadratic attention cost with little to no accuracy loss. We arrange frames on a $K{\times}K$ grid and run lightweight row R and column C probes, where each probe reads its peak posterior as a query-conditioned confidence. The outer product of R and C yields an interpretable importance map whose skewness and kurtosis drive Shape-Adaptive Selection, a closed-form rule that reliably replaces the fixed frame budget $M$ with a per-question $M_{\mathrm{eff}}$. We show empirically that $M_{\mathrm{eff}}$ tracks intrinsic question difficulty without ever seeing the answer, a sign of test-time adaptive compute. On Video-MME-v2, GridProbe matches the monolithic baseline within $1.6$ pp Avg Acc at $3.36\times$ TFLOPs reduction, while on LongVideoBench it Pareto-dominates the baseline ($+0.9$ pp at $0.35\times$ compute). Because the selector and QA models can be decoupled, pairing a small 2B selector with a stronger 4B or 8B QA is strictly Pareto-dominant over the 2B monolithic baseline (up to $+4.0$ pp at $0.52\times$ compute, on average), with no retraining. Finally, the interpretability of the importance maps opens future avenues for behavioral diagnostics, grounding, and frame-selection distillation.

2605.10761 2026-05-12 cs.CV 版本更新

RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

Wenxuan Li, Pedro R. A. S. Bassi, Xinze Zhou, Jakob Wasserthal, Alan L. Yuille, Zongwei Zhou

发表机构 * Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Clinic of Radiology and Nuclear Medicine, University Hospital Basel(巴塞尔大学医院放射科与核医学科) Department of Oncology, Johns Hopkins School of Medicine(约翰霍普金斯医学院肿瘤科)

AI总结 RadThinking 是一个用于放射学纵向临床推理的视觉问答(VQA)数据集,旨在使癌症筛查中的诊断推理过程显式化并可训练。该数据集包含不同难度级别的问答对,从基础感知问题到需要多步骤推理的复合型问题,并提供了每道复合问题对应的推理链条,符合临床报告标准。RadThinking 覆盖了大量患者的CT扫描数据,为AI系统进行系统性的推理训练与评估提供了重要资源。

详情
英文摘要

Cancer screening is a reasoning task. A radiologist observes findings, compares them to prior scans, integrates clinical context, and reaches a diagnostic conclusion confirmed by pathology. We present RadThinking, a Visual Question Answering (VQA) dataset that makes this reasoning explicit and trainable. RadThinking releases VQA pairs at three difficulty tiers. Foundation VQAs are atomic perception questions. Single-step reasoning VQAs apply one clinical rule. Compositional VQAs require multi-step chain-of-thought to reach a guideline category such as LI-RADS-5. For every compositional VQA, we release the chain of foundation VQAs that solves it. The chain follows the rules of the governing clinical reporting standard. The dataset spans 20,362 CT scans from 9,131 patients across 43 cancer groups, plus 2,077 verified healthy controls with >1-year follow-up. To our knowledge, RadThinking is the first cancer-screening VQA corpus that stratifies questions by reasoning depth and grounds compositions in clinical reporting standards. The foundation tier supplies atomic perception supervision. The compositional tier supplies chain-of-thought data and verifiable rewards for reinforcement-learning recipes such as DeepSeek-R1 and OpenAI o1. RadThinking enables systematic training and evaluation of whether AI systems can reason about cancer, not merely detect it.

2605.10756 2026-05-12 cs.CV 版本更新

TINS: Test-time ID-prototype-separated Negative Semantics Learning for OOD Detection

Yifeng Yang, Jubo Feng, Jing Xu, Xinbing Wang, Qinying Gu, Nanyang Ye

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) University of Electronic Science and Technology of China(电子科学与技术大学)

AI总结 该研究提出了一种名为TINS的测试时ID-原型分离负语义学习方法,用于提升视觉-语言模型在开放域检测(OOD Detection)中的性能。为了解决现有方法依赖静态负标签、难以适应多样化和动态变化的OOD概念的问题,TINS通过图像到文本的模态反转学习样本特定的负语义嵌入,并引入ID-原型分离正则化以避免与ID语义混淆。实验表明,TINS在多个基准数据集上均优于现有方法,尤其在Four-OOD基准中将平均FPR95从14.04%降低至6.72%。

详情
英文摘要

Vision-language models enable OOD detection by comparing image alignment with ID labels and negative semantics. Existing negative-label-based methods mainly rely on static negative labels constructed before inference, limiting their ability to cover diverse and evolving OOD concepts. Although test-time expansion provides a natural solution, naively learning negative semantics from potential OOD samples may introduce hard ID contamination. To address this issue, we propose a \textbf{T}est-time \textbf{I}D-prototype-separated \textbf{N}egative \textbf{S}emantics learning method, termed \textbf{TINS}. TINS learns sample-specific negative text embeddings via image-to-text modality inversion and introduces ID-prototype-separated regularization to keep them separated from ID semantics. To further stabilize negative semantics expansion, TINS employs group-wise aggregation scoring and a buffer update strategy. Extensive experiments across Four-OOD, OpenOOD, Temporal-shift, and Various ID settings show consistent improvements over strong baselines. Notably, on the Four-OOD benchmark with ImageNet-1K as ID, TINS reduces the average FPR95 from 14.04\% to 6.72\%. Our code is available at https://github.com/zxk1212/tins.

2605.10744 2026-05-12 cs.CV cs.RO 版本更新

C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving

Kefei Tian, Yuansheng Lian, Kai Yang, Xiangdong Chen, Shen Li

发表机构 * College of Transportation, Tongji University(同济大学交通运输学院) Department of Civil Engineering, Tsinghua University(清华大学土木工程系) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与移动系统学院) Department of Civil and Environmental Engineering, National University of Singapore(新加坡国立大学土木与环境工程系)

AI总结 本文提出了一种基于视觉语言模型的反事实推理框架C-CoT,用于提升自动驾驶在复杂城市交叉路口等安全关键场景中的决策能力。该方法将驾驶决策分解为五个阶段,通过引入结构化的元动作评估树,在反事实推理阶段显式评估不同行动组合的潜在后果,从而建立行动与安全结果之间的因果联系,增强模型在罕见和分布外场景中的鲁棒性。实验表明,该方法在风险预测和碰撞率等指标上均优于现有方法,显著提升了自动驾驶系统的安全性和可解释性。

详情
英文摘要

Safety-critical planning in complex environments, particularly at urban intersections, remains a fundamental challenge for autonomous driving. Existing methods, whether rule-based or data-driven, frequently struggle to capture complex scene semantics, infer potential risks, and make reliable decisions in rare, high-risk situations. While vision-language models (VLMs) offer promising approaches for safe decision-making in these environments, most current approaches lack reflective and causal reasoning, thereby limiting their overall robustness. To address this, we propose a counterfactual chain-of-thought (C-CoT) framework that leverages VLMs to decompose driving decisions into five sequential stages: scene description, critical object identification, risk prediction, counterfactual risk reasoning, and final action planning. Within the counterfactual reasoning stage, we introduce a structured meta-action evaluation tree to explicitly assess the potential consequences of alternative action combinations. This self-reflective reasoning establishes causal links between action choices and safety outcomes, improving robustness in long-tail and out-of-distribution scenarios. To validate our approach, we construct the DeepAccident-CCoT dataset based on the DeepAccident benchmark and fine-tune a Qwen2.5-VL (7B) model using low-rank adaptation. Our model achieves a risk prediction recall of 81.9%, reduces the collision rate to 3.52%, and lowers L2 error to 1.98 m. Ablation studies further confirm the critical role of counterfactual reasoning and the meta-action evaluation tree in enhancing safety and interpretability.

2605.10739 2026-05-12 eess.IV cs.AI cs.CV 版本更新

Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

David F. Ramirez, Tim Overman, Kristen Jaskie, Andreas Spanias

发表机构 * SenSIP Center, School of ECEE, Arizona State University(SenSIP中心,电子与计算机工程学院,亚利桑那州立大学) Prime Solutions Group Inc(Prime Solutions Group公司) Intelligence Advanced Research Projects Activity(智能高级研究计划局)

AI总结 本文提出了一种基于Sentinel-2卫星影像的多模态视觉问答数据集SMART-HC-VQA,用于分析人类活动的时空演变。该数据集通过将施工标注、类型标签、时间阶段标签等信息转化为自然语言问答对,构建了一个时序扩展的自动目标识别与视觉问答挑战任务。研究还引入了一种多图像大语言模型训练框架,能够处理多时相遥感影像并进行语义推理,为理解语言引导下的遥感活动提供了可复现的基础。

Comments Accepted to 2026 SPIE Defense + Security, Automatic Target Recognition XXXVI

详情
英文摘要

We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.

2605.10732 2026-05-12 cs.CV cs.AI 版本更新

iPay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning

Kaicong Huang, Weiheng Oh, Thomas Guggisberg, Ruimin Ke

发表机构 * Rensselaer Polytechnic Institute(伦斯勒理工学院) Capital District Transportation Authority(卡特里奇交通局)

AI总结 本文提出了一种名为iPay的集成支付动作识别框架,用于车载公共交通监控系统。该方法结合RGB图像和骨架数据,通过多模态混合专家架构,分别捕捉局部细节和整体运动特征,并引入双注意力融合机制和空间差异判别器,以提升模型对支付动作的识别能力。实验表明,iPay在真实监控数据上取得了83.45%的识别准确率,具有较高的计算效率,适用于边缘部署。

详情
英文摘要

Automated transit payment analysis is vital for scalable fare auditing and passenger analytics, yet practice still relies on limited manual inspection. Prior vision- and skeleton-based methods remain brittle under noisy onboard surveillance and often depend on poorly generalizable handcrafted features. Building on the success of graph convolutional networks in human action recognition, we observe that skeleton features excel at modeling global spatiotemporal dependencies but tend to underemphasize the subtle local relative motions that distinguish payment actions. In contrast, RGB features preserve fine-grained spatial details yet often lack reliable temporal continuity in surveillance footage. To bridge both system-level deployment needs and model-level design challenges, we present iPay, an integrated payment action recognition framework for onboard transit surveillance system. iPay adopts a multimodal mixture-of-experts architecture with four tightly coupled streams: (1) an RGB expert stream emphasizing local evidence via region-focused computation; (2) a skeleton expert stream modeling articulated motion with a graph convolutional backbone; (3) a dual-attention fusion stream enabling skeleton-to-RGB temporal transfer and RGB-to-skeleton spatial enhancement; and (4) a prior-driven Spatial Difference Discriminator (SDD) that explicitly models hand-to-anchor relative motion to improve task-specific discriminability. We also collaborate with local transit agencies to collect over 55 hours of real onboard surveillance footage, yielding 500+ payment clips. Experiments show that iPay outperforms prior methods and achieves 83.45\% recognition accuracy with competitive computational efficiency, making it suitable for edge deployment. Code is available at https://github.com/ccoopq/iPay.

2605.10730 2026-05-12 cs.CV 版本更新

Qwen-Image-2.0 Technical Report

Bing Zhao, Chenfei Wu, Deqing Li, Hao Meng, Jiahao Li, Jie Zhang, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kuan Cao, Kun Yan, Liang Peng, Lihan Jiang, Niantong Li, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Xihua Wang, Yan Shu, Yanran Zhang, Yi Wang, Yilei Chen, Ying Ba, Yixian Xu, Yujia Wu, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhendong Wang, Zihao Liu, Zikai Zhou, An Yang, Chen Cheng, Chenxu Lv, Dayiheng Liu, Fan Zhou, Hantian Xiong, Hongzhu Shi, Hu Wei, Huihong Zhao, Ivy Liu, Jianwei Zhang, Jiawei Zhang, Kai Chen, Kang He, Levon Xue, Lin Qu, Linhan Tang, Luwen Feng, Minggang Wu, Minmin Sun, Na Ni, Rui Men, Shuai Bai, Sishou Zheng, Tao Lan, Tianqi Zhang, Tingkun Wen, Wei Wang, Weixu Qiao, Weiyi Lu, Wenmeng Zhou, Xiaodong Deng, Xiaoxiao Xu, Xinlei Fang, Xionghui Chen, Yanan Wang, Yang Fan, Yichang Zhang, Yixuan Xu, Yu Wu, Zhiyuan Ma, Zhizhi Cai

发表机构 * Qwen Team(通义实验室)

AI总结 本文介绍了Qwen-Image-2.0,一种能够统一高保真图像生成与精确图像编辑的全能型图像生成基础模型。该模型通过结合Qwen3-VL作为条件编码器与多模态扩散变换器,解决了超长文本渲染、多语言排版、高分辨率写实生成等挑战,并在大规模数据训练和定制化多阶段训练流程的支持下,实现了强大的多模态理解能力与灵活的生成与编辑功能。实验表明,Qwen-Image-2.0在生成与编辑任务上显著优于之前的版本,向着更通用、可靠和实用的图像生成模型迈出了重要一步。

详情
英文摘要

We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robust instruction following, and efficient deployment, especially in text-rich and compositionally complex scenarios. Qwen-Image-2.0 addresses these challenges by coupling Qwen3-VL as the condition encoder with a Multimodal Diffusion Transformer for joint condition-target modeling, supported by large-scale data curation and a customized multi-stage training pipeline. This enables strong multimodal understanding while preserving flexible generation and editing capabilities. The model supports instructions of up to 1K tokens for generating text-rich content such as slides, posters, infographics, and comics, while significantly improving multilingual text fidelity and typography. It also enhances photorealistic generation with richer details, more realistic textures, and coherent lighting, and follows complex prompts more reliably across diverse styles. Extensive human evaluations show that Qwen-Image-2.0 substantially outperforms previous Qwen-Image models in both generation and editing, marking a step toward more general, reliable, and practical image generation foundation models.

2605.10723 2026-05-12 cs.CV cs.AI cs.LG cs.MA 版本更新

AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State

Huimin Wang, Leilei Ouyang, Chang Xia, Yongqi Kang, Yu Fu, Yuqi Ouyang

发表机构 * College of Computer Science, Sichuan University(四川大学计算机学院)

AI总结 AllocMV 是一种用于音乐视频生成的分层框架,旨在解决长时域视频生成中计算成本高和跨镜头一致性难以保持的问题。该方法将视频合成建模为多重选择背包问题,通过结构化持久状态对象进行资源优化分配,并引入基于动态规划的求解器实现高效资源调度。实验表明,AllocMV 在严格预算和节奏约束下,实现了生成质量与资源消耗之间的最优平衡。

详情
英文摘要

Generating long-horizon music videos (MVs) is frequently constrained by prohibitive computational costs and difficulty maintaining cross-shot consistency. We propose AllocMV, a hierarchical framework formulating music video synthesis as a Multiple-Choice Knapsack Problem (MCKP). AllocMV represents the video's persistent state as a compact, structured object comprising character entities, scene priors, and sharing graphs, produced by a global planner prior to realization. By estimating segment saliency from multimodal cues, a group-level MCKP solver based on dynamic programming optimally allocates resources across High-Gen, Mid-Gen, and Reuse branches. For repetitive musical motifs, we implement a divergence-based forking strategy that reuses visual prefixes to reduce costs while ensuring motif-level continuity. Evaluated via the Cost-Quality Ratio (CQR), AllocMV achieves an optimal trade-off between perceived quality and resource expenditure under strict budgetary and rhythmic constraints.

2605.10717 2026-05-12 cs.LG cs.CV 版本更新

Heteroscedastic Diffusion for Multi-Agent Trajectory Modeling

Guillem Capellera, Antonio Rubio, Luis Ferraz, Antonio Agudo

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC(机器人与信息学院,CSIC-UPC)

AI总结 本文提出了一种异方差扩散模型U2Diffine,用于多智能体轨迹建模,同时提供每个状态的不确定性估计,以解决传统方法在轨迹补全和不确定性量化方面的不足。通过在去噪损失中引入预测噪声的负对数似然,并利用一阶泰勒展开将潜在空间的不确定性传播到真实状态空间,实现了轨迹补全与不确定性估计的统一。此外,还提出了一种更高效的基线模型U2Diff,并结合排序神经网络进行后处理,显著提升了推理速度和预测可靠性,在多个体育数据集上取得了优于现有方法的性能。

Comments Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Extended version of arXiv:2503.18589 (CVPR 2025)

详情
英文摘要

Multi-agent trajectory modeling traditionally focuses on forecasting, often neglecting more general tasks like trajectory completion, which is essential for real-world applications such as correcting tracking data. Existing methods also generally predict agents' states without offering any state-wise measure of heteroscedastic uncertainty. Moreover, popular multi-modal sampling methods lack error probability estimates for each generated scene under the same prior observations, which makes it difficult to rank the predictions at inference time. We introduce U2Diffine, a unified diffusion model built to perform trajectory completion while simultaneously offering state-wise heteroscedastic uncertainty estimates. This is achieved by augmenting the standard denoising loss with the negative log-likelihood of the predicted noise, and then propagating the latent space uncertainty to the real state space using a first-order Taylor approximation. We also propose U2Diff, a faster baseline that avoids gradient computation during sampling. This approach significantly increases inference speed, making it as efficient as a standard generative-only diffusion model. For post-processing, we integrate a Rank Neural Network (RankNN) that enables error probability estimation for each generated mode, demonstrating strong correlation with ground truth errors. Our method outperforms state-of-the-art solutions in both trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball-U, Football-U, Soccer-U), underscoring the effectiveness of our uncertainty and error probability estimation.

2605.10715 2026-05-12 cs.CV 版本更新

UAV-Assisted Scan-to-Simulation for Landslides Using Physics-Informed Gaussian Splatting

Zhenyu Liang, Jack C. P. Cheng

AI总结 本文提出了一种基于无人机的扫描到模拟框架,用于提升滑坡监测与仿真的真实感与准确性。该方法结合物理感知的高斯点喷射技术(3DGS)与材料点法(MPM),实现了从无人机采集的实景图像到具备物理特性的滑坡模拟的全过程。研究通过在香港真实滑坡现场的验证,展示了该方法在视觉重建与物理模拟方面的双重优势,为灾害预防和公众教育提供了更有效的工具。

详情
英文摘要

Landslide monitoring and simulation play an important role in urban safety assessment and disaster prevention. Existing landslide simulation pipelines typically rely on digital elevation model and mesh-based representations, which are suitable for geometric analysis, but often lack visual realism. This limitation reduces their effectiveness in interactive applications, hazard communication, and public education. In this paper, we propose a UAV-based scan-to-simulation framework that bridges photorealistic scene capture and physics-based landslide simulation through 3DGS. Specifically, our pipeline includes four stages: (1) UAV-based acquisition of slope imagery, (2) reconstruction of a low-anisotropy 3DGS scene representation, (3) volumetric conversion of the target simulation region by filling the interior of the surface-based model, and (4) integration with the Material Point Method (MPM) for landslide simulation. We validate the proposed framework on a real landslide site in Hong Kong that experienced a severe landslide event. The results show that our method supports both realistic visual reconstruction and effective simulation.

2605.10705 2026-05-12 cs.CV 版本更新

TransmissiveGS: Residual-Guided Disentangled Gaussian Splatting for Transmissive Scene Reconstruction and Rendering

Zhenyu Liang, Xiao Zhang, Tianchao Li, Jack C. P. Cheng, Chi-Keung Tang

发表机构 * HKUST(香港理工大学)

AI总结 该论文提出了一种名为TransmissiveGS的新框架,用于解决透射场景重建与渲染中的挑战性问题。该方法通过引入双高斯表示和延迟着色函数,实现了反射与透射成分的解耦重建,并利用多视角不一致性及残差信息分离表面几何与光照属性,同时提出反射光场以提升近场反射估计精度。实验表明,该方法在合成与真实场景中均优于现有高斯点绘技术,显著提升了透射场景的重建与渲染质量。

详情
英文摘要

Transmissive scenes are ubiquitous in daily life, yet reconstructing and rendering them remains highly challenging due to the inherent entanglement between near-field reflections from the surrounding environment on the transmissive surface, and the transmitted content of the scene behind it. This coupling gives rise to dual surface geometries and dual radiance components within each observation, posing ambiguities for standard methods. We present TransmissiveGS, a novel framework for disentangled reconstruction and rendering of transmissive scenes. Specifically, we model the scene with a dual-Gaussian representation and introduce a deferred shading function to jointly render the two Gaussian components. To separate reflection and transmission, we exploit the inherent multi-view inconsistency of reflections and leverage the residuals from reconstructing multi-view consistent content as cues for disentangled geometry and appearance modeling. We further propose a reflection light field that enables high-fidelity estimation of near-field reflections. During training, we introduce a high-frequency regularization to preserve fine details. We also contribute a new synthetic dataset for evaluating transmissive surface reconstruction. Experiments on both synthetic and real-world scenes demonstrate that TransmissiveGS consistently outperforms prior Gaussian Splatting-based methods in both reconstruction and rendering quality for transmissive scenes.

2605.10676 2026-05-12 cs.CV cs.LG 版本更新

Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

Qingxin Xiao, Peilin Zhao, Yangyang Zhao, Lingwei Dang, Qingyao Wu

发表机构 * South China University of Technology(华南理工大学) Institute for Super Robotics (Huangpu)(机器人研究所(黄埔)) Shanghai Jiao Tong University(上海交通大学) Changsha University of Science and Technology(长沙理工大学)

AI总结 在多模态语言模型解码过程中,注意力往往异常聚焦于与任务无关的图像区域,现有方法通常将这些区域视为噪声并强制调整注意力,但本文认为这些区域实际上承载了重要的视觉与叙事逻辑,强制调整反而加剧了视觉与语言之间的不平衡。为此,研究提出了一种名为Adversarial Counter-Commonsense Equilibrium(ACE)的训练无关框架,通过引入反常识的图像干扰块,动态调整解码过程中的注意力分布,从而在不引入额外训练的前提下,有效抑制虚假信息,恢复视觉与语言的平衡,实验表明该方法能显著提升模型的可信度且几乎不增加推理开销。

详情
英文摘要

During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual and narrative logic, and such coercive corrections exacerbate visual-language imbalance. Adopting a "decoding-as-game" perspective, we reveal that hallucinations stem from an equilibrium imbalance between linguistic priors and visual information. We propose Adversarial Counter-Commonsense Equilibrium (ACE), a training-free framework that perturbs visual context via counter-commonsense patches. Leveraging the fact that authentic visual features remain stable under perturbation while hallucinations fluctuate, ACE implements a dynamic game decoding strategy. This approach precisely suppresses perturbation-sensitive priors while compensating for stable visual signals to restore balance. Extensive experiments demonstrate that ACE, as a plug-and-play strategy, enhances model trustworthiness with negligible inference overhead.

2605.10675 2026-05-12 cs.CV 版本更新

Neuromorphic Monocular Depth Estimation with Uncertainty Modeling

Viktor Bergkvist, Felix Rydell, Per-Erik Forssén, David Gustafsson, Johan Rideg

发表机构 * Swedish Defence Research Agency(瑞典国防研究机构) Linköping University(林奈大学)

AI总结 本文研究了基于事件相机的单目深度估计问题,提出了一种结合不确定性建模的神经形态深度估计方法。通过使用高斯、对数正态和证据学习框架,模型能够预测每个像素的深度分布并估计其不确定性。实验比较了六种事件表示方式,并在合成数据上训练、在真实序列上微调U-Net模型,结果表明不确定性建模能有效提升深度估计的可靠性,并在多种指标下表现优异。

详情
英文摘要

Event cameras offer distinct advantages over conventional frame-based sensors, including microsecond-level temporal resolution, high dynamic range, and low bandwidth. In this paper, we predict per-pixel depth distributions from monocular event streams using deep neural networks. We estimate uncertainty using Gaussian, log-normal, and evidential learning frameworks. We compare six event representations: spatio-temporal voxel grids with 1, 5, 10, and 20 temporal bins, the Compact Spatio-Temporal Representation (CSTR), and Time-Ordered Recent Event (TORE) volumes. Our U-Net-based models are trained on synthetic data and then fine-tuned on real sequences. We evaluate performance using absolute relative error, root mean squared error, and the area under the sparsification error. Quantitative results show that the representations perform similarly, while 10 bin log-normal and 5 bin evidential learning perform best across metrics. Our experiments demonstrate that uncertainty estimation can be successfully integrated into event-based monocular depth estimation, and be used to indicate pixels with reliable depth.

2605.10661 2026-05-12 cs.CV cs.AI 版本更新

bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

Michal Byra, Pawel Olszowiec, Grzegorz Stefanski, Grzegorz Gruszczynski, Alberto Presta

发表机构 * Samsung AI Center(三星人工智能中心) Institute of Fundamental Technological Research, Polish Academy of Sciences(波兰科学院基础技术研究所)

AI总结 本文研究了视觉Transformer(ViT)中是否可以通过单块循环结构替代传统的多层独立参数化结构。提出了一种名为bViT的模型,该模型仅使用一个Transformer块进行重复计算来处理图像,从而在保持深度结构的同时大幅减少参数量。实验表明,在相同训练条件和计算预算下,bViT在ImageNet-1K上达到了与标准ViT相当的性能,且参数数量减少了约一个数量级,展示了循环结构在视觉任务中的有效性与潜力。

Comments 31 pages, 16 figures

详情
英文摘要

Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than narrow variants. We interpret this behavior as implicit depth multiplexing, where a shared block expresses multiple step-dependent computations through the evolving hidden state. Beyond ImageNet classification, bViT transfers competitively to downstream tasks and enables parameter-efficient fine-tuning. Mechanistic analyses of activations, attention and step-specific pruning show that the shared block changes its effective behavior across recurrent steps rather than simply repeating the same computation. Our results suggest that a large fraction of ViT depth can be implemented through recurrent reuse, provided that the representation space is sufficiently wide.

2605.10645 2026-05-12 cs.CV 版本更新

GenMed: A Pairwise Generative Reformulation of Medical Diagnostic Tasks

Hantao Zhang, Weidong Guo, Yuhe Liu, Jiancheng Yang, Sathvik Bhagavan, Danli Shi, Mingda Xu, Pascal Fua

发表机构 * CVLab, École Polytechnique Fédérale de Lausanne (EPFL)(瑞士联邦理工学院(EPFL)计算机视觉实验室) Fudan University(复旦大学) Beihang University(北航大学) ELLIS Institute Finland(芬兰ELLIS研究所) Aalto University(艾尔沃斯大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出了一种基于生成模型的新型医学诊断框架GenMed,通过联合建模输入与输出的联合分布 $P(X,Y)$,将诊断任务重新定义为推理时的输出优化问题。该方法利用扩散模型,在不改变模型结构或重新训练的前提下,实现了对多样化输入条件的灵活梯度引导,有效支持跨模态、少样本和零样本等复杂场景下的医学图像分割任务。实验表明,GenMed 在多种医学影像任务中表现出色,并配套发布了大规模文本-形状数据集以支持相关研究。

详情
英文摘要

Data-driven medical AI is traditionally formulated as a discriminative mapping from input $X$ to output $Y$ via a learned function $f$, which does not generalize well across heterogeneous data and modalities encountered in real-world clinical settings. In this work, we propose a fundamentally different, generative paradigm. We model the joint distribution $P(X,Y)$ using diffusion models and reframe inference as a test-time output optimization problem. By guiding the generative process to match observed inputs, our framework enables flexible, gradient-based conditioning at inference time without architectural changes or retraining, effectively supporting arbitrary and previously unseen combinations of observations. Extensive experiments demonstrate strong performance across standard and cross-modality medical image segmentation, few-shot segmentation with only 2 or 4 training samples, degraded-input segmentation, shape completion from sparse and partial observations, and zero-shot application to demonstrate generality. To support these evaluations, we curated and released a large-scale text-shape dataset derived from MedShapeNet. Our results highlight the versatility of generative joint modeling as a foundation for reusable, task-agnostic medical AI systems.

2605.10641 2026-05-12 cs.CV cs.AI 版本更新

LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

Nikolaos Gkalelis, Vasileios Mezaris

发表机构 * CERTH-ITI

AI总结 本文提出了一种名为LLaVA-CKD的自底向上级联知识蒸馏框架,旨在解决视觉语言模型(VLMs)在实际部署中面临的大规模计算和内存需求问题。该方法通过引入中间容量的教师模型逐步引导学生模型学习,缓解了传统知识蒸馏中师生模型容量差距过大导致的知识迁移效果下降问题。实验表明,该框架在多个标准视觉问答基准测试中取得了当前最优的性能。

Comments Under review

详情
英文摘要

Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network. However, the capacity gap between the two networks is both a blessing and a curse: the smaller the Student network, the better its efficiency, and the larger the Teacher, the more knowledge it carries; yet, beyond a point, the larger capacity gap between the two leads to worse knowledge transfer. To counter this effect, we propose a bottom-up cascaded knowledge distillation (CKD) framework. Instead of treating knowledge transfer as an activity involving one high-capacity Teacher (or an ensemble of such), inspired by human formal education systems, we introduce one (potentially, more) additional Teacher(s) of intermediate capacity that gradually bring the Student network to the next level, where the next (higher-capacity) Teacher can take over. We provide a theoretical analysis in order to study the effect of cascaded distillation in the generalization performance of the Student. We apply the proposed framework on models build upon the LLaVA methodology and evaluate the derived models on seven standard, publicly available VQA benchmarks, demonstrating their SotA performance.

2605.10629 2026-05-12 cs.CV 版本更新

Product-of-Gaussian-Mixture Diffusion Models for Joint Nonlinear MRI Reconstruction

Laurenz Nagler, Martin Zach, Thomas Pock

发表机构 * Graz University of Technology(格拉茨技术大学) École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) Biomedical Imaging Group and Center for Biomedical Imaging(生物医学成像组和生物医学成像中心)

AI总结 本文提出了一种基于高斯混合乘积扩散模型的联合非线性磁共振成像重建方法,旨在解决现有方法中网络结构复杂、时间条件机制不透明以及需要离线估计线圈灵敏度等问题。该方法通过将参数高效的高斯混合扩散模型作为图像先验,并结合经典的线圈灵敏度平滑先验,实现了图像与线圈灵敏度的联合重建。该方法在保持重建质量的同时,提升了对对比度和解剖分布变化以及不同k空间轨迹的鲁棒性。

详情
英文摘要

Recently, diffusion models have attracted considerable attention for magnetic resonance image reconstruction due to their high sample quality. However, most existing methods rely on large networks with opaque time-conditioning mechanisms, and require offline coil sensitivity estimation. This results in limited interpretability of the reconstruction process and reduced flexibility in the acquisition setup. To address these limitations, we jointly reconstruct the image and the coil sensitivities by combining the parameter-efficient product-of-Gaussian-mixture diffusion model as an image prior with a classical smoothness prior on the coil sensitivities. The proposed method is fast and robust to both contrast and anatomical distribution shifts as well as changing k-space trajectories. Finally, we propose a more expressive parameterization of the image prior which improves results in denoising and magnetic resonance image reconstruction.

2605.10628 2026-05-12 cs.CV 版本更新

Hypergraph-Enhanced Training-Free and Language-Free Few-Shot Anomaly Detection

Guohuan Xie, Xin He, Dingying Fan, Siqi Li, Yun Liu

发表机构 * Nankai University(南开大学) Tianjin University of Technology(天津工业大学) Tsinghua University(清华大学)

AI总结 本文提出了一种名为HyperFSAD的少样本异常检测框架,该方法无需训练和语言提示,且具备跨领域鲁棒性,有效解决了现有方法对特定任务训练、语言监督和领域适应性的依赖问题。该方法基于DINOv3和超图推理机制,通过稀疏超匹配和双分支图像评分策略,实现了对正常样本的紧凑表征与异常区域的精准识别。实验表明,在六个涵盖工业和医疗场景的数据集上,HyperFSAD在无训练、无语言提示的严格设置下取得了当前最优的检测性能。

详情
英文摘要

Few-shot anomaly detection (FSAD) has made significant strides, yet existing methods still face critical challenges: (i) dependence on task- or dataset-specific training/fine-tuning, (ii) reliance on language supervision or carefully hand-crafted prompts, and (iii) limited robustness across domains. In this paper, we introduce HyperFSAD, a novel FSAD framework that is training-free, language-free, and robust across domains, offering a powerful solution to these challenges. Built upon DINOv3 and a hypergraph-based inference mechanism, our approach performs inference without any task-specific optimization or text prompts, while remaining competitive. Specifically, we replace sensitive nearest-neighbor / top-$n$ matching with \textbf{Sparse Hyper Matching}: \textit{sparsemax} first selects the most relevant support patches, which are then aggregated into a \textit{hyperedge} as compact normal evidence to suppress background noise and distractors. We further introduce \textbf{Dual-Branch Image Scoring}, which fuses \emph{spatial anomaly evidence} from the patch-grid anomaly map with \emph{global semantic deviation} captured by support-aware CLS matching, yielding a robust image-level anomaly score in a strictly visual manner. Notably, all components of HyperFSAD are purely visual, eliminating the need for labor-intensive hand-crafted text prompts. Under the stringent training-free and language-free setting, HyperFSAD achieves state-of-the-art performance across six datasets spanning four industrial datasets (MVTecAD, VisA, MPDD, BTAD) and two medical datasets (RESC, BraTS).

2605.10622 2026-05-12 cs.MM cs.CV 版本更新

Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

Yangneng Chen, Junlin Li, Weijun Yao, Xilai Ma, Guodong Du, Wenya Wang, Jing Li

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Huawei Technologies Co., Ltd.(华为技术有限公司) The Hong Kong Polytechnic University(香港理工大学) Nanyang Technological University(南洋理工大学)

AI总结 大型视觉-语言模型(LVLMs)在多模态任务中表现出色,但其可靠性常因幻觉问题而受到挑战,即生成与视觉输入矛盾的文本。本文提出“词汇劫持”现象,发现某些视觉标记(称为惰性标记)会异常地吸引注意力,并在词汇空间中固定解码为无关词语(劫持锚点),导致语义崩溃。基于此,研究提出了一种无需训练的干预方法HAVAE,通过增强关键注意力头对视觉内容的关注,有效缓解了幻觉问题,同时保持模型整体性能。

Comments Accepted by ACL 2026 Main

详情
英文摘要

Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet their reliability is persistently undermined by hallucinations-generating text that contradicts visual input. Recent studies often attribute these errors to inadequate visual attention. In this work, we analyze the attention mechanisms via the logit lens, uncovering a distinct anomaly we term Vocabulary Hijacking. We discover that specific visual tokens, defined as Inert Tokens, disproportionately attract attention. Crucially, when their intermediate hidden states are projected into the vocabulary space, they consistently decode to a fixed set of unrelated words (termed Hijacking Anchors) across layers, revealing a rigid semantic collapse. Leveraging this semantic rigidity, we propose Hijacking Anchor-Based Identification (HABI), a robust strategy to accurately localize these Inert Tokens. To quantify the impact of this phenomenon, we introduce the Non-Hijacked Visual Attention Ratio (NHAR), a novel metric designed to identify attention heads that remain resilient to hijacking and are critical for factual accuracy. Building on these insights, we propose Hijacking-Aware Visual Attention Enhancement (HAVAE), a training-free intervention that selectively strengthens the focus of these identified heads on salient visual content. Extensive experiments across multiple benchmarks demonstrate that HAVAE significantly mitigates hallucinations with no additional computational overhead, while preserving the model's general capabilities. Our code is publicly available at https://github.com/lab-klc/HAVAE.

2605.10616 2026-05-12 cs.LG cs.CL cs.CV 版本更新

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

Alan Arazi, Eilam Shapira, Shoham Grunblat, Mor Ventura, Elad Hoffer, Gioia Blayer, David Holzmüller, Lennart Purucker, Gaël Varoquaux, Frank Hutter, Roi Reichart

发表机构 * Technion – Israel Institute of Technology(技术ion – 以色列理工学院) Prior Labs(Prior实验室) NVIDIA SODA Team, INRIA Saclay, Palaiseau(SODA团队,INRIA萨克莱,帕莱索) University of Freiburg(弗赖堡大学) Probabl ELLIS Institute Tübingen(图宾根ELLIS研究所)

AI总结 本文提出 MulTaBench,一个包含40个数据集的多模态表格学习基准,涵盖图像-表格和文本-表格任务,旨在评估模型在处理结构化数据与非结构化模态(如文本和图像)结合时的表现。研究发现,针对任务进行嵌入调优能显著提升性能,而现有基准往往忽视任务相关性,导致结果波动较大。MulTaBench 通过强调模态间互补信息的重要性,推动了目标感知表示学习的发展,并为构建多模态表格基础模型提供了新的研究方向。

详情
英文摘要

Tabular Foundation Models have recently established the state of the art in supervised tabular learning, by leveraging pretraining to learn generalizable representations of numerical and categorical structured data. However, they lack native support for unstructured modalities such as text and image, and rely on frozen, pretrained embeddings to process them. On established Multimodal Tabular Learning benchmarks, we show that tuning the embeddings to the task improves performance. Existing benchmarks, however, often focus on the mere co-occurrence of modalities; this leads to high variance across datasets and masks the benefits of task-specific tuning. To address this gap, we introduce MulTaBench, a benchmark of 40 datasets, split equally between image-tabular and text-tabular tasks. We focus on predictive tasks where the modalities provide complementary predictive signal, and where generic embeddings lose critical information, necessitating Target-Aware Representations that are aligned with the task. Our experimental results demonstrate that the gains from target-aware representation tuning generalize across both text and image modalities, several tabular learners, encoder scales, and embedding dimensions. MulTaBench constitutes the largest image-tabular benchmarking effort to date, spanning high-impact domains such as healthcare and e-commerce. It is designed to enable the research of novel architectures which incorporate joint modeling and target-aware representations, paving the way for the development of novel Multimodal Tabular Foundation Models.

2605.10588 2026-05-12 cs.CV 版本更新

Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

Yanbing Zhang, Bo Wang, Jianhui Liu, Nan Jiang, Jiaxiu Jiang, Haoze Sun, Yijun Yang, Shenghe Zheng, Lin Song, Haoyang Huang, Nan Duan, Wenbo Li

发表机构 * Joy Future Academy(未来Joy学院)

AI总结 当前大型多模态模型(LMMs)在需要视角依赖理解的空间推理任务中表现不佳,主要受限于单一静态视角的观察。为此,研究提出了一种名为“Thinking with Novel Views(TwNV)”的新范式,通过在推理过程中引入生成新视角的合成图像,提升模型对空间关系的理解能力。实验表明,TwNV在多个空间子任务和不同架构的LMM上均显著提升了性能,验证了新视角生成在增强模型空间智能方面的有效性。

Comments Submitted to NeurIPS 2026

详情
英文摘要

Current Large Multimodal Models (LMMs) struggle with spatial reasoning tasks requiring viewpoint-dependent understanding, largely because they are confined to a single, static observation. We propose Thinking with Novel Views (TwNV), a paradigm that integrates generative novel-view synthesis into the reasoning loop: a Reasoner LMM identifies spatial ambiguity, instructs a Painter to synthesize an alternative viewpoint, and re-examines the scene with the additional evidence. Through systematic experiments we address three research questions. (1) Instruction format: numerical camera-pose specifications yield more reliable view control than free-form language. (2) Generation fidelity: synthesized view quality is tightly coupled with downstream spatial accuracy. (3) Inference-time visual scaling: iterative multi-turn view refinement further improves performance, echoing recent scaling trends in language reasoning. Across four spatial subtask categories and four LMM architectures (both closed- and open-source), TwNV consistently improves accuracy by +1.3 to +3.9 pp, with the largest gains on viewpoint-sensitive subtasks. These results establish novel-view generation as a practical lever for advancing spatial intelligence of LMMs.

2605.10586 2026-05-12 cs.CV 版本更新

CausalGS: Learning Physical Causality of 3D Dynamic Scenes with Gaussian Representations

Nengbo Lu, Minghua Pan

发表机构 * Guilin University of Electronic Technology(桂林电子科技大学)

AI总结 本文提出了一种名为CausalGS的框架,旨在仅从多视角视频中学习复杂三维动态场景的物理因果关系,无需依赖显式先验知识。其核心是一个逆物理推理模块,通过联合推断场景的初始速度场和内在材料属性,将动态过程分解为两个因素进行建模,并利用可微分物理模拟器进行物理正则化的学习。实验表明,CausalGS在长期未来帧外推和新视角插值任务中均优于现有方法,展示了其从视觉观测中自主学习物理属性交互和因果关系的能力。

Comments ICMR2026 Accepted

详情
英文摘要

Learning a physical model from video data that can comprehend physical laws and predict the future trajectories of objects is a formidable challenge in artificial intelligence. Prior approaches either leverage various Partial Differential Equations (PDEs) as soft constraints in the form of PINN losses, or integrate physics simulators into neural networks; however, they often rely on strong priors or high-quality geometry reconstruction. In this paper, we propose CausalGS, a framework that learns the causal dynamics of complex dynamic 3D scenes solely from multi-view videos, while dispensing with the reliance on explicit priors. At its core is an inverse physics inference module that decouples the complex dynamics problem from the video into the joint inference of two factors: the initial velocity field representing the scene's kinematics, and the intrinsic material properties governing its dynamics. This inferred physical information is then utilized within a differentiable physics simulator to guide the learning process in a physics-regularized manner. Extensive experiments demonstrate that CausalGS surpasses the state-of-the-art on the highly challenging task of long-term future frame extrapolation, while also exhibiting advanced performance in novel view interpolation. Crucially, our work shows that, without any human annotation, the model is able to learn the complex interactions between multiple physical properties and understand the causal relationships driving the scene's dynamic evolution, solely from visual observations.

2605.10576 2026-05-12 cs.CV cs.AI 版本更新

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

Chen Zhong, Xiao An, Jiaxing Sun, Zihan Gui, Guangyi Yang, Wei He

发表机构 * Wuhan University(武汉大学) Shanghai Artificial Intelligent Laboratory(上海人工智能实验室)

AI总结 本文提出 SenseBench,首个专门用于评估大语言视觉模型在遥感低级视觉感知与描述能力的基准测试平台。该研究针对当前图像质量评估方法无法准确描述遥感退化现象的问题,构建了包含6大类22个细粒度退化类型的10,000余个精心标注样本,并设计了感知与描述两种评估协议,揭示了现有模型在遥感领域存在的领域偏差、多退化混淆等关键问题,为推动遥感低级视觉感知模型的发展提供了有力支持。

详情
英文摘要

Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS experts. While Vision-Language Models (VLMs) present a compelling alternative by delivering language-grounded IQA, their visual priors are heavily biased toward ground-level natural images. Consequently, whether VLMs can overcome this domain gap to perceive and articulate RS artifacts remains insufficiently studied. To bridge this gap, we propose \textbf{SenseBench}, the first dedicated diagnostic benchmark for RS low-level visual perception and description. Driven by a physics-based hierarchical taxonomy that unifies both non-reference and reference-based paradigms, SenseBench features over 10K meticulously curated instances across 6 major and 22 fine-grained RS degradation categories. Specifically, two complementary protocols are designed for evaluation: objective low-level visual \textit{perception} and subjective diagnostic \textit{description}. Comprehensive evaluation of 29 state-of-the-art VLMs reveals not only skewed domain priors and multi-distortion collapse, but also \textit{fluency illusion} and a \textit{perception-description inversion} effect. We hope SenseBench provides a robust evaluation testbed and high-quality diagnostic data to advance the development of VLMs in RS low-level perception. Code and datasets are available \href{https://github.com/Zhong-Chenchen/SenseBench}{\textcolor{blue}{here}}.

2605.10571 2026-05-12 eess.IV cs.CV 版本更新

Set-Based Groupwise Registration for Variable-Length, Variable-Contrast Cardiac MRI

Yi Zhang, Yidong Zhao, Tijmen Toxopeus, Maša Božić-Iven, Sebastian Weingärtner, Qian Tao

发表机构 * Department of Imaging Physics, Delft University of Technology, The Netherlands(荷兰代尔夫特理工大学影像物理系)

AI总结 该研究针对可变长度、对比度不同的心脏MRI序列,提出了一种基于集合的群组配准方法\emph{\AnyTwoReg},以解决传统深度学习方法在跨协议配准中的泛化性不足问题。该方法将MRI序列视为无序集合,解耦了网络设计与序列长度和输入顺序的依赖关系,并通过共享编码器和相关性引导的特征聚合构建了排列不变的参考基准,实现了从图像到形变场的排列等变映射。实验表明,该方法在未见过的定量MRI数据集上表现出良好的零样本泛化能力,并有效提升了后续定量映射的质量。

Comments MICCAI 2026. Submitted Version

详情
英文摘要

Quantitative cardiac magnetic resonance imaging (MRI) enables non-invasive myocardial tissue characterization but relies on robust motion correction within these variable-length, variable-contrast image sequences. Groupwise registration, which simultaneously aligns all images, has shown greater robustness than pairwise registration for motion correction. However, current deep-learning-based groupwise registration methods cannot generalize across MRI sequences: the architecture typically encodes input data as a fixed-length channel stack, which rigidly couples network design to protocol-specific sequence length, input ordering, and contrast dynamics. At inference time, any change in imaging protocols will render the network unusable. In this work, we introduce \emph{\AnyTwoReg}, a new set-based groupwise registration framework that takes a quantitative MRI sequence as an unordered set. This set formulation fundamentally decouples network design from sequence length and input ordering. By utilizing a shared encoder and correlation-guided feature aggregation, \emph{\AnyTwoReg} constructs a permutation-invariant canonical reference for registration, and learns a permutation-equivariant mapping from images to deformation fields. Additionally, we extract contrast-insensitive image features from an existing foundation model to handle extreme contrast variations. Trained exclusively on a single public $T_1$ mapping dataset (STONE, sequence length $L=11$), \AnyTwoReg generalizes to two unseen quantitative MRI datasets (MOLLI, ASL) with variable lengths ($L \in [11, 60]$) and different contrast dynamics. It achieves strong cross-protocol generalization in a zero-shot manner, and consistently improves downstream quantitative mapping quality. Notably, while designed for quantitative MRI sequences, our framework is directly applicable to Cine MRI sequences for inter-cardiac-phase registration.

2605.10567 2026-05-12 cs.CV 版本更新

VeloGauss: Learning Physically Consistent Gaussian Velocity Fields from Videos

Nengbo Lu, Bin Zhao

发表机构 * Guangxi Key Laboratory of Robot Intelligent Perception and Control(广西机器人智能感知与控制重点实验室) School of Artificial Intelligence, Guilin University of Electronic Technology(人工智能学院,桂林电子科技大学)

AI总结 本文提出了一种名为 VeloGauss 的方法,旨在仅从动态多视角视频中联合建模三维场景的几何、外观和物理信息,而无需依赖任何物理先验。该方法通过引入物理编码和粒子动力学系统,学习每个高斯粒子的运动场,并结合全局物理约束以确保场景的物理一致性。实验表明,VeloGauss 在新视角插值和未来帧外推任务中均取得了优于现有方法的性能。

Comments ICME2026 Accepted

详情
英文摘要

In this paper, we aim to jointly model the geometry, appearance, and physical information of 3D scenes solely from dynamic multi-view videos, without relying on any physical priors. Existing works typically employ physical losses merely as soft constraints or integrate physical simulations into neural networks; however, these approaches often fail to effectively learn complex motion physics. Although modeling velocity fields holds the potential to capture authentic physical information, due to the lack of appropriate physical constraints, current methods are unable to correctly learn the interaction mechanisms between rigid and non-rigid particles. To address this, we propose VeloGauss, designed to learn the physical properties of complex dynamic 3D scenes without physical priors. Our method learns the velocity field for each Gaussian particle by introducing a Physics Code and a Particle Dynamics System, and ultimately incorporates Global Physical Constraints to ensure the physical consistency of the scene. Extensive experiments on four public datasets demonstrate that our method outperforms achieves state-of-the-art performance in both Novel View Interpolation and Future Frame Extrapolation tasks.

2605.10564 2026-05-12 cs.CV cs.RO 版本更新

DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

Lingjun Zhang, Changjie Wu, Linzhe Shi, Jiangyang Li, Jiaxin Liu, Lei Yang, Hang Zhang, Mu Xu, Hong Wang

发表机构 * Tsinghua University(清华大学) Amap, Alibaba Group(阿里巴巴集团Amap) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为DeepSight的端到端自动驾驶世界模型,通过在鸟瞰图(BEV)空间中并行预测连续未来帧的潜在语义特征,实现了对长期未来世界状态的建模。该方法还引入了一种高效且自适应的文本推理机制,结合额外的社会知识和推理能力,以提升复杂长尾场景下的驾驶性能。实验表明,该方法在闭合回路 Bench2drive 基准测试中达到了最先进的效果。

Comments ICML 2026

详情
英文摘要

End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.

2605.10523 2026-05-12 cs.CV 版本更新

Improving Human Image Animation via Semantic Representation Alignment

Chang Liu, Mengting Chen, Yixuan Huang, Haoning Wu, Chen Ju, Shuai Xiao, Jinsong Lan, Yanfeng Wang

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, China(上海交通大学人工智能学院,中国) Alibaba Group, China(阿里巴巴集团,中国)

AI总结 本文研究如何通过语义表示对齐来提升人体图像动画生成的质量,解决在生成长视频或复杂动作时出现的肢体扭曲和面部失真问题。提出了一种名为 SemanticREPA 的新方法,通过结构对齐模块和身份对齐模块,分别对齐视频潜在表示中的结构信息与深度特征、生成视频的身份特征与人脸识别特征,从而提升生成结果的结构稳定性和身份一致性。该方法在复杂动作生成和角色一致性方面表现出色,为人体动画生成提供了更高质量和更灵活的解决方案。

Comments Accepted by CVPR 2026 workshop

详情
英文摘要

The field of image-to-video generation has made remarkable progress. However, challenges such as human limb twisting and facial distortion persist, especially when generating long videos or modeling intensive motions. Existing human image animation works address these issues by incorporating human-specific semantic representations, e.g., dense poses or ID embeddings, as additional conditions. However, conditioning on these representations could decrease the generation flexibility. Moreover, their reliance on RGB pixel supervision also lacks emphasis on learning necessary 3D geometric relationships and temporal coherence. In contrast, we introduce a novel approach named SemanticREPA that leverages these semantic representations as supervision signals through representation alignment. Specifically, we begin by training a structure alignment module that aligns the structure representations obtained from video latents with video depth estimation features. We then fix the pretrained module, and utilize it to provide additional supervision on the structure representations of the diffusion models, achieving structure rectification to generate coherent and stable human structures. Simultaneously, we develop an ID alignment module to align the ID representations of the generated videos to face recognition features. We further propose to use the predicted structure representations to refine identity restoration in relevant regions. With structure and ID alignment, our method demonstrates superior quality on extended character motions and enhanced character consistency.

2605.10521 2026-05-12 cs.CV cs.AI 版本更新

DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation

Yiqi Tian, Sangjoon Park, Bo Zeng, Pengfei Jin, Yujin Oh, Quanzheng Li

发表机构 * Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School(先进医学计算与分析中心,麻省总医院和哈佛医学院) Department of Industrial Engineering, University of Pittsburgh(工业工程系,匹兹堡大学) Department of Radiation Oncology, College of Medicine, Yonsei University(放射肿瘤学系,延世大学医学院) Institute for Innovation in Digital Healthcare, Yonsei University(数字医疗创新研究所,延世大学) Department of Biomedical Systems Informatics, College of Medicine, Yonsei University(生物医学系统信息学系,延世大学医学院)

AI总结 医学图像分割模型在不同子群体中的表现可能存在差异,现有公平性方法大多关注提升子群体平均性能,忽略了子群体内部可能存在的隐藏失效问题。为此,本文提出DuetFair机制,通过联合考虑子群体间适应与子群体内鲁棒性,引入FairDRO方法,结合分布感知的专家混合模型与子群体条件分布鲁棒优化,有效提升了模型在不同子群体中的公平性与分割性能。实验表明,FairDRO在多个医学图像分割基准上取得了优越的公平性与性能提升。

Comments 16 pages, 2 figures

详情
英文摘要

Medical image segmentation models can perform unevenly across subgroups. Most existing fairness methods focus on improving average subgroup performance, implicitly treating each subgroup as internally homogeneous. However, this can hide difficult cases within a subgroup, where high-loss samples are obscured by the subgroup mean. We call this problem \textbf{intra-group hidden failure}. To solve this, we propose \textbf{DuetFair} mechanism, a dual-axis fairness framework that jointly considers inter-subgroup adaptation and intra-subgroup robustness. Based on DuetFair, we introduce \textbf{FairDRO}, which combines distribution-aware mixture-of-experts (dMoE) with subgroup-conditioned distributionally robust optimization (DRO) loss aggregation. This design allows the model to adapt across subgroups while also reducing hidden failures within each subgroup. We evaluate FairDRO on three medical image segmentation benchmarks with varying degrees of within-group heterogeneity. FairDRO achieves the best equity-scaled performance on Harvard-FairSeg and improves worst-case subgroup performance on HAM10000 under both age- and race-based grouping schemes. On the 3D radiotherapy target cohort, FairDRO further improves worst-group Dice by 3.5 points ($\uparrow 6.0\%$) under the tumor-stage grouping and by 4.1 points ($\uparrow 7.4\%$) under the institution grouping over the strongest baseline.

2605.10498 2026-05-12 cs.CV cs.AI stat.ML 版本更新

Simultaneous Long-tailed Recognition and Multi-modal Fusion for Highly Imbalanced Multi-modal Data

Heegeon Yoon, Heeyoung Kim

发表机构 * Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST)(工业与系统工程系,韩国科学技术院(KAIST))

AI总结 该研究针对高度不平衡的多模态数据,提出了一个同时处理长尾识别与多模态融合的新框架。该方法通过引入多专家架构,结合模态特异性网络估计各模态的信息量,并利用置信度引导的权重动态调整融合过程,从而更有效地整合多源数据。实验表明,该方法在多个基准和真实数据集上优于现有方法,展示了其在长尾分类任务中的鲁棒性和泛化能力。

详情
英文摘要

Long-tailed distributions in class-imbalanced data present a fundamental challenge for deep learning models, which tend to be biased toward majority classes. While recent methods for long-tailed recognition have mitigated this issue, they are largely restricted to single-modal inputs and cannot fully exploit complementary information from diverse data sources. In this work, we introduce a new framework for long-tailed recognition that explicitly handles multi-modal inputs. Our approach extends multi-expert architectures to the multi-modal setting by fusing heterogeneous data into a unified representation while leveraging modality-specific networks to estimate the informativeness of each modality. These confidence-guided weights dynamically modulate the fusion process, ensuring that more informative modalities contribute more strongly to the final decision. To further enhance performance, we design specialized training and test procedures that accommodate diverse modality combinations, including images and tabular data. Extensive experiments on benchmark and real-world datasets demonstrate that the proposed approach not only effectively integrates multi-modal information but also outperforms existing methods in handling long-tailed, class-imbalanced scenarios, highlighting its robustness and generalization capability.

2605.10484 2026-05-12 cs.CV cs.RO 版本更新

OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

Gang Chen, Sebastián Barbas Laina, Stefan Leutenegger, Javier Alonso-Mora

发表机构 * Autonomous Multi-Robots Lab, Department of Cognitive Robotics, School of Mechanical Engineering, Delft University of Technology, 2628 CD, Delft, Netherlands(代尔夫特理工大学机械工程学院认知机器人学系自主多机器人实验室) Mobile Robotics Lab, School of Computation, Information and Technology, Technical University of Munich(慕尼黑技术大学计算、信息与技术学院移动机器人实验室)

AI总结 本文提出了一种名为 OpenSGA 的高效三维场景图对齐框架,旨在解决机器人在开放环境中重新访问场景时的物体级定位与地图融合问题。该方法通过融合视觉-语言、文本和几何特征,并结合空间上下文信息,实现了即使在坐标偏差较大的情况下也能准确对齐场景图。此外,作者还构建了一个大规模数据集 ScanNet-SG,包含超过 70 万样本和丰富的物体类别,显著提升了场景图对齐任务的训练与评估能力。实验表明,该方法在帧到扫描(F2S)和子扫描到子扫描(S2S)任务中均取得了最佳性能。

Comments 13 figures

详情
英文摘要

Scene graph alignment establishes object correspondences between two 3D scene graphs constructed from partially overlapping observations. This enables efficient scene understanding and object-level relocalization when a robot revisits a place, as well as global map fusion across multiple agents. Such capabilities are essential for robots that require long-term memory for long-horizon tasks involving interactions with the environment. Existing approaches mainly focus on subscan-to-subscan (S2S) alignment and depend heavily on geometric point-cloud features, leaving frame-to-scan (F2S) alignment and open-set vision-language features underexplored. In addition, existing datasets for scene graph alignment remain small-scale with limited object diversity, constraining systematic training and evaluation. We present a unified and efficient scene graph alignment framework that predicts object correspondences by fusing vision-language, textual, and geometric features with spatial context. The framework comprises modules such as a distance-gated spatial attention encoder, a minimum-cost-flow-based allocator, and a global scene embedding generator to achieve accurate alignment even under large coordinate discrepancies. We further introduce ScanNet-SG, a large-scale dataset generated via an automated annotation pipeline with over 700k samples, covering 509 object categories from ScanNet labels and over 3k categories from GPT-4o-based tagging. Experiments show that our method achieves the best overall performance on both F2S and S2S tasks, substantially outperforming existing scene graph alignment methods. Our code and dataset are released at: https://autonomousrobots.nl/paper_websites/opensga.

2605.10470 2026-05-12 cs.CV 版本更新

Adaptive Context Matters: Towards Provable Multi-Modality Guidance for Super-Resolution

Jinyi Luo, Minghao Liu, Yifan Li, Zejia Fan, Jiaying Liu

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究院)

AI总结 超分辨率(SR)是一个严重病态的问题,存在固有的歧义性。本文首次对多模态超分辨率进行了理论建模,揭示了现有方法在模态利用上的不足,并提出了一种基于动态模态融合的多模态专家混合超分辨率框架(M$^3$ESR),通过空间动态模态权重模块和时间自适应模态温度调度机制,实现了更精确的风险控制和模态贡献优化。实验表明,该方法在泛化能力和语义一致性方面均有显著提升。

详情
英文摘要

Super-resolution (SR) is a severely ill-posed problem with inherent ambiguity, as widely recognized in both empirical and theoretical studies. Although recent semantic-guided and multi-modal SR methods exploit large models or external priors to enhance semantic alignment, the fusion of heterogeneous modalities remains insufficiently understood in practice and theory. In this work, we provide the first theoretical modeling of multi-modal SR, revealing that prior methods are bottlenecked by sub-optimal modality utilization. Our analysis shows that the generalization risk bound can be improved by strengthening the alignment between modality weights and their effective contributions, while reducing representation complexity. This theoretical insight inspires us to propose the novel Multi-Modal Mixture-of-Experts Super-Resolution framework (M$^3$ESR) that employs generalization-oriented dynamic modality fusion for accurate risk control and modality contribution optimization. In detail, we propose a novel spatially dynamic modality weighting module and a temporally adaptive modality temperature scheduling mechanism, enabling flexible and adaptive spatial-temporal modality weighting for effective risk control. Extensive experiments demonstrate that our M$^3$ESR significantly boosts generalization and semantic consistency performances, which confirms our superiority.

2605.10464 2026-05-12 cs.CV 版本更新

Automated Detection of Abnormalities in Zebrafish Development

Sarath Sivaprasad, Hui-Po Wang, Anna-Lisa Jäckel, Jonas Baumann, Carole Baumann, Jennifer Herrmann, Mario Fritz

发表机构 * CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全中心) Helmholtz Institute for Pharmaceutical Research Saarland(萨尔兰州制药研究所海德堡中心)

AI总结 本文提出了一种用于斑马鱼胚胎发育异常自动检测的方法,针对目前依赖人工评估效率低的问题,构建了一个包含高分辨率显微图像序列的大型数据集,涵盖正常发育和药物暴露两种条件,并提供了细粒度时间标注。研究还引入了基于Transformer的模型,能够融合时空特征以早期预测发育异常,在受精卵存活率分类和毒性评估任务中分别达到98%和92%的准确率,为自动化斑马鱼毒性分析提供了有效工具。

详情
英文摘要

Zebrafish embryos are a valuable model for drug discovery due to their optical transparency and genetic similarity to humans. However, current evaluations rely on manual inspection, which is costly and labor-intensive. While machine learning offers automation potential, progress is limited by the lack of comprehensive datasets. To address this, we introduce a large-scale dataset of high-resolution microscopic image sequences capturing zebrafish embryonic development under both control conditions and exposure to compounds (3,4-dichloroaniline). This dataset, with expert annotations at fine-grained temporal levels, supports two benchmarking tasks: (1) fertility classification, assessing zebrafish egg viability (130,368 images), and (2) toxicity assessment, detecting malformations induced by toxic exposure over time (55,296 images). Alongside the dataset, we present the first transformer-based baseline model that integrates spatiotemporal features to predict developmental abnormalities at early stages. Experimental results present the model's effectiveness, achieving 98% accuracy in fertility classification and 92% in toxicity assessment. These findings underscore the potential of automated approaches to enhance zebrafish-based toxicity analysis.

2605.10449 2026-05-12 cs.CV 版本更新

Automated high-frequency quantification of fish communities and biomass using computer vision

Kota Ishikawa, Takuma Masui, Keita Koeda, Rickdane Gomez, Lucas Yutaka Kimura, Michio Kondoh

发表机构 * Graduate School of Life Sciences, Tohoku University(东北大学生命科学研究生院) Advanced Institute for Marine Ecosystem Change (WPI-AIMEC), Tohoku University(东北大学海洋生态系统变化先进研究所) Graduate School of Science and Engineering, University of the Ryukyus(冲绳大学理学研究院) Faculty of Science, University of the Ryukyus(冲绳大学理学部)

AI总结 该研究提出了一种基于计算机视觉的自动化方法,用于高频量化水下鱼类群落结构和生物量。方法结合了深度学习鱼类识别、多目标跟踪和三维重建技术,能够从立体摄像系统采集的视频中准确估计鱼类的种类、数量及生物量。研究在珊瑚礁鱼类群落中进行了20天的连续监测,展示了该方法在捕捉物种丰富度、数量和生物量动态变化方面的优势,并验证了其在非侵入性、持续性监测中的有效性。

Comments 21 pages, 3 figures, supplementary information under Ancillary files

详情
英文摘要

Quantifying fish community structure is essential for understanding biodiversity and ecosystem responses in a changing environment, yet existing survey methods provide limited high-frequency, quantitative observations. Conventional approaches, including catch-based methods, underwater visual censuses, and environmental DNA metabarcoding, either require intensive labor or lack reliable estimates of abundance and biomass. Here, we develop an automated framework for quantifying fish communities from underwater video using computer vision. Using videos acquired with a custom-made stereo camera system, the framework integrates deep learning-based fish identification, multi-object tracking, and 3D reconstruction to estimate species-level abundance and biomass. We applied the approach to a reef fish community over a 20-day period with hourly daytime observations, revealing dynamic fluctuations in species richness, abundance, and biomass associated with changes in species composition. By comparing fish communities estimated from visual census and environmental DNA surveys, we demonstrate that our method provides complementary strengths for continuous, non-invasive, and quantitative monitoring of consistently observed species. This approach provides a scalable foundation for long-term monitoring and advances the capacity to resolve fine-scale temporal dynamics in fish communities.

2605.10445 2026-05-12 cs.CV 版本更新

Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

Zijun Shen, Sihan Yang, Ruichuan An, Ziyu Guo, Hao Liang, Ming Lu, Renrui Zhang, Wentao Zhang

发表机构 * Peking University(北京大学) Nanjing University(南京大学) CUHK(香港中文大学) Zhongguancun Academy(中关村学院)

AI总结 本文提出了一种名为Sync-R1的端到端强化学习框架,旨在通过协同优化实现个性化理解和生成之间的桥梁。该方法引入了Sync-GRPO和动态组缩放(DGS)技术,以增强多任务间的协同效应并提升训练效率,同时构建了更贴近现实场景的UnifyBench++数据集。实验表明,Sync-R1在跨任务推理和个性化生成方面表现出色,且无需复杂的冷启动流程。

详情
英文摘要

Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy between comprehension and creation. In this work, we propose Sync-R1, an end-to-end reinforcement learning framework that jointly optimizes personalized understanding and generation within a single, explicit reasoning loop. Through this unified feedback process, Sync-R1 enables personalized comprehension to guide content creation, while the resulting generation quality reciprocally refines understanding within an integrated reward landscape. To efficiently orchestrate this dual-task synergy, we introduce Sync-GRPO, a reinforcement learning method utilizing an ensemble reward system. Furthermore, we propose Dynamic Group Scaling (DGS), which adaptively filters low-potential trajectories to reduce gradient variance and accelerate convergence. To better reflect real-world complexity, we introduce UnifyBench++, featuring denser textual descriptions and richer user contexts. Experimental results demonstrate that Sync-R1 achieves state-of-the-art performance, showcasing superior cross-task reasoning and robust personalization without requiring complex cold-start procedures. The code and the UnifyBench++ dataset will be released at: https://github.com/arctanxarc/UniCTokens.

2605.10439 2026-05-12 cs.CV 版本更新

Filtering Memorization from Parameter-Space in Diffusion Models

Yu Zhe, Yang Jiayan, Wei Junhao, Yu-Lin Tsai, Wang Chen

发表机构 * RIKEN AIP(理化学研究所Advanced Institute for Peripheral Research) Science of Tokyo(东京科学大学) University of California, Berkeley(加州大学伯克利分校) Zhejiang University(浙江大学)

AI总结 本文研究了扩散模型中低秩适配(LoRA)模块可能记住训练图像的问题,导致生成内容泄露受版权保护或敏感信息。为此,作者提出了一种无需训练和数据的后处理方法——Base-Anchored Filtering(BAF),通过分解LoRA更新为频谱通道,并衡量其与预训练主干网络主子空间的对齐程度,从而过滤掉可能包含记忆内容的通道。实验表明,BAF在多个数据集和扩散模型主干上有效减少了记忆效应,同时保持或提升了生成质量。

详情
英文摘要

Low-Rank Adaptation (LoRA) has become a widely used mechanism for customizing diffusion models, enabling users to inject new visual concepts or styles through lightweight parameter updates. However, LoRAs can memorize training images, causing generated outputs to reproduce copyrighted or sensitive content. This risk is particularly concerning in LoRA-sharing ecosystems, where users distribute trained LoRAs without releasing the underlying training data. Existing approaches for mitigating memorization rely on access to the training pipeline, training data, or control over the inference process, making them difficult to apply when only the released LoRA weights are available. We propose \textbf{Base-Anchored Filtering (BAF)}, a training-free and data-free framework for post-hoc memorization mitigation in diffusion LoRAs. BAF decomposes LoRA updates into spectral channels and measures their alignment with the principal subspace of the pretrained backbone. Channels strongly aligned with this subspace are retained as generalizable adaptations, while weakly aligned channels are suppressed as potential carriers of memorized content. Experiments on multiple datasets and diffusion backbones demonstrate that BAF consistently reduces memorization while preserving or even improving generation quality. Our code is available in the supplementary material.

2605.10438 2026-05-12 cs.LG cs.CV 版本更新

Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure

Xiang Chen, Alexander Binder

发表机构 * DSC ScaDS.AI, Leipzig University(DSC ScaDS.AI,莱比锡大学) Institute for Cancer Genetics and Informatics (ICGI), Oslo, Norway(癌症遗传学与信息学研究所(ICGI),奥斯陆,挪威) ICT Cluster, Singapore Institute of Technology, Singapore(信息科技集群,新加坡理工学院,新加坡)

AI总结 当前3D编码器大多将表示视为空间压缩,虽然能重建表面几何,但无法明确组件归属和连接有效性。本文提出一种以接口为中心的生成状态表示方法,将编码过程构建为可操作的状态而非被动压缩代码,使得局部几何、组件归属和连接有效性在解码过程中可被查询、约束和修复。通过引入组件条件的局部规范标记(C2LT-3D),该方法在开放世界多组件场景中提升了结构鲁棒性,并展示了其潜在状态在装配级结构推理中的有效性。

详情
英文摘要

Current 3D tokenizers largely treat representation as spatial compression: compact codes reconstruct surface geometry, but leave component ownership and attachment validity implicit. In open-world assets with intersecting components, noisy topology, and weak canonical structure, this creates a representation mismatch: local shape, component identity, and assembly relations become entangled in a latent stream and are not natively addressable during decoding. We formulate an alternative view, interface-centric generative states, in which tokenization constructs an operational state rather than a passive compressed code. The state exposes local geometry, component ownership, and attachment validity as variables that can be queried, constrained, and repaired during decoding. We instantiate this formulation with Component-Conditioned Canonical Local Tokens (C2LT-3D), factorizing representation into canonical local geometry, partition-conditioned context, and relational seam variables. Each factor targets a distinct failure mode of compression-centric tokens: pose leakage, cross-component interference, or invalid local attachment. This exposed state supports attachment validation, latent structural repair, targeted intervention, and constrained serialization without a separate post-hoc structure recovery module. Trained on single-object CAD models and evaluated zero-shot on open-world multi-component assets, C2LT-3D improves structural robustness and shows that its latent variables remain actionable under adversarial attachment settings. These results suggest that open-world 3D generative representations should be evaluated not only by reconstruction fidelity, but by whether their discrete states remain operational for assembly-level structural reasoning.

2605.10434 2026-05-12 cs.CV 版本更新

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Keming Wu, Yijing Cui, Wenhan Xue, Qijie Wang, Xuan Luo, Zhiyuan Feng, Zuhao Yang, Sudong Wang, Sicong Jiang, Haowei Zhu, Zihan Wang, Ping Nie, Wenhu Chen, Bin Wang

发表机构 * Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学) University of Waterloo(滑铁卢大学) Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 本文提出WorldReasonBench,用于评估视频生成模型作为未来世界状态预测器的能力,重点检验其在物理、社会、逻辑和信息一致性方面的推理能力。该基准包含436个结构化测试案例,并采用人类对齐的两阶段评估方法,分别验证推理过程和视频质量。研究揭示了当前视频生成模型在视觉合理性与世界推理能力之间存在显著差距,并提供了WorldRewardBench用于奖励模型评估,推动更真实的世界感知视频生成研究。

Comments Project Page: https://unix-ai-lab.github.io/WorldReasonBench/

详情
英文摘要

Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators." Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures, while Multi-dimensional Quality Assessment scores reasoning quality, temporal consistency, and visual aesthetics for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation. Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation at https://github.com/UniX-AI-Lab/WorldReasonBench/.

2605.10409 2026-05-12 cs.CV 版本更新

Progressive Photorealistic Simplification

Adi Rosenthal, Dana Berman, Yedid Hoshen, Ariel Shamir

发表机构 * Reichman University and Google(里奇曼大学和谷歌) Google Israel(谷歌以色列) Hebrew University and Google(希伯来大学和谷歌) Google(谷歌)

AI总结 本文提出了一种渐进式光栅化简化方法,旨在在保持图像真实感的前提下减少视觉复杂度。该方法通过结合语义理解和生成编辑,利用视觉语言模型识别并优先移除图像中的元素,并通过学习验证器确保简化过程中的真实感和一致性。研究还进一步将该过程蒸馏为一个图像到视频生成模型,能够直接从单张图像生成连贯的简化序列,适用于内容感知去杂、语义分层分解等任务。

详情
英文摘要

Existing image simplification techniques often rely on Non-Photorealistic Rendering (NPR), transforming photographs into stylized sketches, cartoons, or paintings. While effective at reducing visual complexity, such approaches typically sacrifice photographic realism. In this work, we explore a complementary direction: simplifying images while preserving their photorealistic appearance. We introduce progressive semantic image simplification, a framework that iteratively reduces scene complexity by removing and inpainting elements in a controlled manner. At each step, the resulting image remains a plausible natural photograph. Our method combines semantic understanding with generative editing, leveraging Vision-Language Models (VLMs) to identify and prioritize elements for removal, and a learned verifier to ensure photorealism and coherence throughout the process. This is implemented via an iterative Select-Remove-Verify pipeline that produces high-quality simplification trajectories. To improve efficiency, we further distill this process into an image-to-video generation model that directly predicts coherent simplification sequences from a single input image. Beyond generating cleaner and more focused compositions, our approach enables applications such as content-aware decluttering, semantic layer decomposition, and interactive editing. More broadly, our work suggests that simplification through structured content removal can serve as a practical mechanism for guiding visual interpretation within the photorealistic domain, complementing traditional abstraction methods.

2605.10404 2026-05-12 cs.CV 版本更新

Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable

Tianyuan Zou, Liang Yue, Yang Liu, Ya-Qin Zhang, Sijie Cheng

发表机构 * Institute for AI Industry Research, Tsinghua University, Beijing, China(清华大学人工智能产业研究院) RayNeo.AI, Shenzhen, China(深圳RayNeo.AI) Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系)

AI总结 随着智能眼镜、体戴摄像头等持续运行的硬件设备日益普及,生活日志视频流已成为持续运行人工智能系统的核心组成部分。这类视频流虽能显著提升系统实用性,但也带来了严重的隐私泄露风险,如暴露行为模式、情绪状态和社会互动等敏感信息。现有隐私保护方法要么针对特定攻击,要么导致显著的实用性损失,未能全面考虑数据处理全流程,因此生活日志视频流中的隐私与实用性权衡已成为下一代人工智能系统亟待解决的基础性挑战。

Comments 19 pages, 7 figures

详情
英文摘要

With the growing prevalence of always-on hardware such as smart glasses, body cameras, and home security systems, life-logging visual sensing is becoming inevitable, forming the backbone of persistent, always-on AI systems. Meanwhile, recent advances in proactive agents and world models signal a fundamental shift from episodic, prompt-driven tools to next-generation AI systems that continuously perceive and react to the physical world. Although life-logging video streams can substantially improve utility of these promising systems, they also introduce significant privacy risks by revealing sensitive information, such as behavioral patterns, emotional states, and social interactions, beyond what isolated images expose. If unresolved, these risks may undermine public trust and hinder the sustainable development of always-on AI technologies. Existing privacy protections are either attack-specific or incur substantial utility loss, and fail to consider the entire data exploitation pipeline. We therefore posit that the privacy-utility trade-off in life-logging video streams is a foundational challenge for next-generation AI systems that demands further investigation. We call for novel pipeline-aware privacy-preserving designs that jointly optimize utility and privacy for long-horizon life-logging visual data. In parallel, formal privacy leakage metrics and standardized benchmarks remain important open directions for future research.

2605.10397 2026-05-12 cs.CV cs.AI 版本更新

AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

Xi Jiang, Yinjie Zhao, Zesheng Yang, Feng Zheng

发表机构 * Department of Computer Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China(南方科技大学计算机科学与工程系,深圳,中国) School of EEE, Nanyang Technological University (NTU), Singapore(南洋理工大学电子工程学院,新加坡) CFAR, Agency for Science, Technology and Research (A*STAR), Singapore(科技研究局(A*STAR)的CFAR,新加坡)

AI总结 视觉异常检测在工业检测、医疗影像等领域具有重要意义,但不同领域间的数据模态和标注标准差异导致单一领域训练的模型难以跨域应用。为此,本文提出 AnomalyClaw,一种无需训练的视觉异常检测代理,通过多轮反驳机制提升判断可靠性,结合13种工具进行视觉验证与参考解析。实验表明,AnomalyClaw 在多个跨域数据集上显著优于单步推理方法,并通过自进化机制进一步提升了检测性能。

Comments We release the agent, the benchmark, and the analysis artifacts at https://github.com/jam-cc/AnomalyClaw

详情
英文摘要

Visual anomaly detection (VAD) is crucial in many real-world fields, such as industrial inspection, medical imaging, infrastructure monitoring, and remote sensing. However, the specific anomaly definitions, data modalities, and annotation standards across different domains make it difficult to transfer single-domain trained VAD models. Vision-language models (VLMs), pre-trained on large-scale cross-domain data, can perform visual perception under task instructions, offering a promising solution for cross-domain VAD. However, single-inference VLM judgments are unreliable, since they rely more on prior knowledge than on normal-sample references or fine-grained feature evidence. We therefore present AnomalyClaw, a training-free VAD agent that turns anomaly judgment into a multi-round refutation process. In each round, the agent proposes candidate anomalies and refutes each against normal-sample references, drawing on a 13-tool library for visual verification, reference parsing, and frozen expert probing. On the CrossDomainVAD-12 benchmark (12 datasets), AnomalyClaw achieves consistent macro-AUROC improvements over single-step direct inference with +6.23 pp on GPT-5.5, +7.93 pp on Seed2.0-lite, and +3.52 pp on Qwen3.5-VL-27B. We further introduce an optional verbalized self-evolution extension. It builds an online rulebook from internal-branch disagreement without oracle labels. On Qwen3.5-VL-27B, it delivers a +2.09 pp mean gain, comparable to a K = 10 oracle-label supervised baseline (+1.99 pp). These results show that agentic refutation improve anomaly understanding and reasoning of VLMs, rather than merely aggregating tool outputs.

2605.10394 2026-05-12 cs.CV 版本更新

Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection

Andreas Goulas, Damianos Galanopoulos, Evlampios Apostolidis, Vasileios Mezaris

发表机构 * IDT-ITI

AI总结 本文提出了一项新的任务——煽动性图像检测,旨在判断图像是否包含令人震惊、挑衅或情感强烈的特征,以吸引注意力并引发强烈情绪反应。为此,研究者构建了一个名为Sens-VisualNews的基准数据集,包含9,576张新闻图片,并根据其视觉内容中是否存在各种煽动性概念和事件进行标注。基于该数据集,研究进一步探讨了多种先进多模态大语言模型在零样本和微调设置下的提示敏感性、性能及鲁棒性。

Comments Authors' Accepted Version; Accepted at IEEE ICIP 2026

详情
英文摘要

The detection of sensational content in media items can be a critical filtering mechanism for identifying check-worthy content and flagging potential disinformation, since such content triggers physiological arousal that often bypasses critical evaluation and accelerates viral sharing. In this paper we introduce the task of sensational image detection, which aims to determine whether an image contains shocking, provocative, or emotionally charged features to grab attention and trigger strong emotional responses. To support research on this task, we create a new benchmark dataset (called Sens-VisualNews) that contains 9,576 images from news items, annotated based on the (in-)existence of various sensational concepts and events in their visual content. Finally, using Sens-VisualNews, we study the prompt sensitivity, performance and robustness of a wide range of open SotA Multimodal LLMs, across both zero-shot and fine-tuned settings.

2605.10391 2026-05-12 cs.CL cs.AI cs.CV 版本更新

Phoenix-VL 1.5 Medium Technical Report

Team Phoenix, :, Arka Ray, Askar Ali Mohamed Jawad, Biondi Lee, Elijah Seah, Eva Lim, Fiona Teo, Grace Toh, Guang Xiang Teo, Jun En Tan, Jia Hui Bong, Jiale Wang, Jonathan Ng, Justin Tan, Kai Zhe Yew, Matthew Ong, Shun Yi Yeo, Wen Jett Lam, Wen Xiu Tan, Ze Yu Zhang, Gee Wah Ng, Chee Wee Ang, Mistral AI, :, Adrien Sadé, Guillaume Kunsch, Jia Sin Loh, Nicolas Schuhl, Rupert Menneer, Umar Jamil, Vincent Maladière, Yimu Pan

发表机构 * Mistral AI

AI总结 本文介绍了Phoenix-VL 1.5 Medium,一个1230亿参数的本地化多模态、多语言基础模型,专门适配新加坡语境和区域性语言。该模型通过本地化的大规模多模态语料进行持续预训练,并结合新加坡文化、法律等领域的数据进行微调,显著提升了在新加坡相关任务上的表现,同时在通用多模态、多语言和STEM任务上也保持了高水平性能。研究还提出了包含本地化知识评估和机构对齐行为的安全框架,为区域化AI模型开发提供了新思路。

Comments Release page: https://medium.com/htx-ai/introducing-phoenix-vl-1-5-medium-multimodal-intelligence-uniquely-singaporean-ef8214c8cfa1

详情
英文摘要

We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adaptation can be achieved with minimal degradation to broad-spectrum intelligence and alignment. Continued pretraining was performed on Mistral Medium 3.1 using a localized 1-trillion tokens multimodal corpus, followed by a 250-billion tokens long-context extension phase. Subsequent post-training incorporated a novel human-annotated Singapore multimodal dataset and curated textual corpus on Singapore culture, knowledge, and legislation, totaling 22-billion tokens. An additional 5 billion tokens of model alignment was performed through Online Direct Preference Optimization. Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. We also introduce a novel evaluation suite encompassing localized knowledge benchmarks and an institutionally aligned model behavior and safety framework. We report the data curation principles, training methodology, and highlight benchmark and inference performance.

2605.10388 2026-05-12 cs.CV cs.RO 版本更新

Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction

Yumao Liu, Tao Liu, Xiangyu Li, Jiaxiang Li, Ke Ma

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文研究了端到端自动驾驶轨迹预测中时间采样频率对模型性能的影响,挑战了高频率采样必然提升性能的传统假设。通过构建不同频率的训练集,并在固定实验协议下训练和评估相同模型,分析了采样频率与预测性能之间的关系。研究发现,模型和数据集不同会导致频率响应差异,小型模型在中等或较低频率下往往表现最佳,而大模型如AutoVLA在最高频率下效果更优,表明时间采样频率应作为可调参数进行优化,而非固定使用最高频率。

详情
英文摘要

End to end (E2E) autonomous driving trajectory prediction is often trained with camera frames sampled at the highest available temporal frequency, assuming that denser sampling improves performance. We question this assumption by treating temporal sampling frequency as an explicit training set design variable. Starting from high frequency E2E driving datasets, we construct frequency sweep training sets by temporally subsampling camera frames along each trajectory. For each model dataset pair, we train and evaluate the same model under a fixed protocol, so the frequency response reflects how prediction performance changes with sampling frequency. We analyze this response from a capacity aware perspective. Sparse sampling may miss driving relevant cues, while dense sampling may add redundant visual content and off manifold noise. For finite capacity models, this can create a driving irrelevant capacity burden. We evaluate three smaller E2E models and a larger VLA style AutoVLA model on Waymo, nuScenes, and PAVE. Results show model and dataset dependent frequency responses. Smaller E2E models often show non monotonic or near plateau trends and achieve their best 3 second ADE at lower or intermediate frequencies. In contrast, AutoVLA achieves its best 3 second ADE and FDE at the highest evaluated frequency on all three datasets. Iteration matched controls suggest that the advantage of lower or intermediate frequencies for smaller models is not explained only by unequal training update counts. These findings show that temporal sampling frequency should be reported and tuned, rather than fixed to the highest available value.

2605.10374 2026-05-12 cs.CV 版本更新

Halo Separation-guided Underwater Multi-scale Image Restoration

Jiaxin Yang, Honglin Liu, Yongli Wang, Shuyi Cao, Chengcheng Jiang, Jiale Wang

发表机构 * College of Information Science and Technology(信息科学与技术学院) Dalian Maritime University(大连海事大学) College of Marine Electrical Engineering(海洋电气工程学院)

AI总结 本文针对水下自主水下机器人拍摄图像中因人工光源引起的光晕问题,提出了一种基于迭代结构的单光晕图像校正方法。该方法通过两个子网络分别实现光晕层分离和多尺度图像恢复,提升了水下图像的清晰度和质量。实验使用合成数据集和真实光晕图像进行训练与测试,并引入径向梯度约束以进一步优化光晕消除效果,为水下图像增强提供了更鲁棒的解决方案。

详情
英文摘要

Underwater images captured by Autonomous Underwater Vehicles (AUVs) are inevitably affected by artificial light sources, which often produce halos in the foreground of the camera and seriously interfere with the quality of the image. The existing underwater image enhancement methods fail to fully consider this key problem, and the robustness of processing images under artificial light scenes is poor. In practical applications, since underwater image enhancement itself is a very challenging task, the influence of artificial light sources will lead to serious degradation of image performance and affect subsequent vision tasks. In order to effectively deal with this problem, this paper designs a single halo image correction method based on an iterative structure. The network is mainly divided into two sub-networks, one is the halo layer separation sub-network which aims to separate the halo by gradient minimization, and the other is the multi-scale recovery sub-network which aims to recover the image information masked by halo. The UIEB and EUVP synthetic datasets are used for training to ensure that the network can fully learn the characteristics and laws of underwater halo images. Then a large number of halo images taken in an underwater environment with real artificial light are collected for testing. In addition, the brightness distribution characteristics of underwater halo images are analyzed and the radial gradient is introduced to constraint eliminate halo to improve the effect of underwater image restoration.

2605.10362 2026-05-12 cs.CV 版本更新

CellDX AI Autopilot: Agent-Guided Training and Deployment of Pathology Classifiers

Alexey Pchelnikov, Aleksei Pchelnikov

发表机构 * HistAI

AI总结 CellDX AI Autopilot 是一个通过人工智能代理实现病理图像分类模型训练与部署的平台,旨在降低计算病理学中对专业技能和计算资源的依赖。该平台提供结构化的代理技能,引导用户完成数据集构建、超参数优化、多策略模型比较及带人工参与的部署流程,并基于包含32,000多例病例和66,000张H&E染色全切片图像的预构建数据集进行训练。其核心贡献在于引入了专为病理任务设计的代理技能架构和多实例学习框架,显著提升了模型训练效率与易用性。

详情
英文摘要

Training AI models for computational pathology currently requires access to expensive whole-slide-image datasets, GPU infrastructure, deep expertise in machine learning, and substantial engineering effort. We present CellDX AI Autopilot, a platform that lets users -- from pathologists with no ML background to ML practitioners running many parallel experiments -- train, evaluate, and deploy whole-slide image classifiers through natural language interaction with an AI agent. The platform provides a structured set of agent skills that guide the user through dataset curation, automated hyperparameter tuning, multi-strategy model comparison, and human-in-the-loop deployment, all on a pre-built dataset of over 32,000 cases and 66,000 H&E-stained whole-slide images with pre-extracted features. We describe the agent skill architecture, the underlying Multiple Instance Learning (MIL) training framework supporting four classification strategies, and an iterative pairwise hyperparameter search (grid or seeded random) that reduces tuning cost by over 30x compared to exhaustive search. CellDX AI Autopilot is, to our knowledge, the first system to expose pathology-specialized agent skills and a pathology-specialized training platform to general-purpose AI agents (e.g. any LLM-based agent runtime), delivering end-to-end automated model training without requiring the agent itself to be domain-specific. The platform addresses both the ML-expertise bottleneck that limits adoption in diagnostic pathology and the engineering bottleneck that limits how many experiments a researcher can run cost-effectively.

2605.10349 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Portable Active Learning for Object Detection

Rashi Sharma, Justin Timothy C. Bersamin, Karthikk Subramanian

发表机构 * Panasonic R&D Center Singapore(松下研发中心新加坡) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为PAL的便携式主动学习框架,用于提升目标检测任务的标注效率。该方法无需修改检测模型内部结构或训练流程,仅基于模型的推理输出进行数据选择,结合类别级实例不确定性与图像级多样性,有效提升了所选样本的信息量与多样性。实验表明,PAL在多个数据集上均优于现有主动学习方法,显著提高了标签效率和检测精度,为实际应用中的高效目标检测部署提供了实用解决方案。

Comments CVPR 2026(highlight)

详情
英文摘要

Annotating bounding boxes is costly and limits the scalability of object detection. This challenge is compounded by the need to preserve high accuracy while minimizing manual effort in real-world applications. Prior active learning methods often depend on model features or modify detector internals and training schedules, increasing integration overhead. Moreover, they rarely jointly exploit the benefits of image-level signals, class-imbalance cues, and instance-level uncertainty for comprehensive selection. We present Portable Active Learning (PAL), a detector-agnostic, easily portable framework that operates solely on inference outputs. PAL combines class-wise instance uncertainty with image-level diversity to guide data selection. At each round, PAL trains lightweight class-specific logistic classifiers to distinguish true from false positives, producing entropy-based uncertainty scores for proposals. Candidate images are then refined using global image entropy, class diversity, and image similarity, yielding batches that are both informative and diverse. PAL requires no changes to model internals or training pipelines, ensuring broad compatibility across detectors. Extensive experiments on COCO, PASCAL VOC, and BDD100K demonstrate that PAL consistently improves label efficiency and detection accuracy compared to existing active learning baselines, making it a practical solution for scalable and cost-effective deployment of object detection in real-world settings.

2605.10345 2026-05-12 cs.CV 版本更新

BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization

Wei Wang, Dou Quan, Ning Huyan, Shuang Wang, Yi Li, Pei He, Licheng Jiao

发表机构 * Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University(中国教育部智能感知与图像理解重点实验室,西安电子科技大学) School of Telecommunications, Xidian University(西安电子科技大学电信学院) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 本文提出了一种基于视觉基础模型(VFM)的参数高效适配框架BGG,用于解决跨视角图像(如无人机与卫星图像)之间的几何差异问题,以提升跨视角地理定位(CVGL)的性能。BGG通过多粒度特征增强适配器(MFEA)和频率感知结构聚合(FASA)模块,有效提升了特征的尺度适应性和视角鲁棒性,并增强了局部结构特征,从而在低训练成本下实现了更精确的地理定位。实验表明,BGG在多个数据集上取得了优于现有方法的先进性能。

详情
英文摘要

Geometric differences between cross-view images, such as drone and satellite views, significantly increase the challenge of Cross-View Geo-Localization (CVGL), which aims to acquire the geolocation of images by image retrieval. To further enhance the CVGL performance, this paper proposes a parameter-efficient adaptation framework for bridging the geometric gap across images based on the vision foundation model (VFM) (e.g., DINOv3), termed BGG. BGG not only effectively leverages the general visual representations of VFM and captures the robust and consistent features from cross-view images, but also utilizes the generalization capabilities of the VFM, significantly improving the CVGL performance. It mainly contains a Multi-granularity Feature Enhancement Adapter (MFEA) and a Frequency-Aware Structural Aggregation (FASA) module. Specifically, MFEA enhances the scale adaptability and viewpoint robustness of features by multi-level dilated convolutions, effectively bridging the cross-view geometric gap with small training costs. Additionally, considering the [CLS] token lacks spatial details for precise image retrieval and localization, the FASA module modulates patch tokens in the frequency domain and performs adaptive aggregation for local structural feature enhancement. Finally, BGG fuses the enhanced local features with the [CLS] token for more accurate CVGL. Extensive experiments on University-1652 and SUES-200 datasets demonstrate that BGG has significant advantages over other methods and achieves state-of-the-art localization performance with low training costs.

2605.10343 2026-05-12 cs.CV cs.AI 版本更新

EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant

Zichen Wen, Boxue Yang, Junlong Ke, Jiajie Huang, Chenfei Liao, Junxi Wang, Xuyang Liu, Linfeng Zhang

发表机构 * EPIC Lab, Shanghai Jiao Tong University(上海交通大学EPIC实验室) Tsinghua University(清华大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Fudan University(复旦大学)

AI总结 本文提出EvoStreaming,一种用于将离线视频语言模型(VideoLLM)适配为流式视频助理的自进化框架。研究发现,现有VideoLLM虽具备良好的视觉理解能力,但缺乏在流式场景下决定何时响应的交互策略。EvoStreaming通过模型自身生成数据、标注相关性并制定响应策略,无需外部监督即可合成流式交互轨迹,仅用极少样本便显著提升了模型在流式评估中的表现,同时基本保持其离线性能,为高效适配流式视频助理提供了新路径。

Comments 33 pages, 9 figures

详情
英文摘要

Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only $1{,}000$ self-generated samples ($139\times$ less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to $10.8$ points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.

2605.10334 2026-05-12 cs.CV 版本更新

The Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection

Andrii Yermakov, Jan Cech, Mario Fritz, Jiri Matas

发表机构 * Czech Technical University in Prague(捷克技术大学) CISPA Helmholtz Center for Information Security(CISPA海德堡中心)

AI总结 近年来,深度伪造检测方法在跨数据集泛化能力上有所提升,但其背后的机制仍不明确。本文提出“Alpha混合假说”,认为当前先进的基于帧的检测器实际上是在搜索Alpha混合痕迹,而非学习语义异常或生成模型的指纹。研究通过实验验证了该假说,并提出了一种基于真实人脸图像和自混合图像增强数据集的检测方法BlenD,在多个合成伪造数据集上取得了最佳的跨数据集泛化性能,且无需在训练中使用明确生成的深度伪造样本。

详情
英文摘要

Recent deepfake detection methods demonstrate improved cross-dataset generalization, yet the underlying mechanisms remain underexplored. We introduce the Alpha Blending Hypothesis, positing that state-of-the-art frame-based detectors primarily function as alpha blending searchers; rather than learning semantic anomalies or specific generative neural fingerprints, they localize low-level compositing artifacts introduced during the integration of manipulated faces into target frames. We experimentally validate the hypothesis, demonstrating that deepfake detectors exhibit high sensitivity to the so-called self-blended images (SBI) and non-generative manipulations. We propose the method BlenD that leverages a large-scale, diverse dataset of real-only facial images augmented with SBI. This approach achieves the best average cross-dataset generalization on 15 compositional deepfake datasets released between 2019 and 2025 without utilizing explicitly generated deepfakes during training. Furthermore, we show that predictions from explicit blending searchers and models resilient to blending shortcuts are highly complementary, yielding a state-of-the-art AUROC of 94.0% in an ensemble configuration. The code with experiments and the trained model will be publicly released.

2605.10319 2026-05-12 cs.CV 版本更新

LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency

Ryugo Morita, Stanislav Frolov, Brian Bernhard Moser, Ko Watanabe, Riku Takahashi, Issey Sukeda, Andreas Dengel

发表机构 * RPTU Kaiserslautern-Landau \& DFKI GmbH, Kaiserslautern, Germany Faculty of Science Engineering, Hosei University, Tokyo, Japan EQUES, Tokyo, Japan

AI总结 本文提出了一种名为 LimeCross 的训练-free 上下文条件化分层图像编辑框架,能够在保持未选层不变的前提下,根据文本指令对用户选定的 RGBA 分层进行编辑。该方法通过双流注意力机制利用其他层的上下文信息,保持跨层一致性,并有效防止编辑层污染。研究还引入了 LayerEditBench 数据集与评估协议,实验表明 LimeCross 在分层纯净度和合成真实感方面优于现有方法,为可控生成创作提供了新的分层编辑范式。

详情
英文摘要

Layered image assets are widely used in real-world creative workflows, enabling non-destructive iteration and flexible re-composition. Recent advances in layered image generation and decomposition synthesize or recover layered representations, yet controllable editing of layered images remains challenging. Manual editing requires careful coordination across layers to maintain consistent illumination and contact, while AI-based pipelines collapse layers into a flattened image for editing, then decompose them again, introducing background-to-foreground leakage and unstable transparency. To address these limitations, we propose LimeCross, a training-free context-conditioned layered image editing framework that edits user-selected RGBA layers according to text while keeping the remaining layers unchanged. It leverages contextual cues from other layers using a bi-stream attention mechanism to preserve cross-layer consistency, while explicitly maintaining layer integrity to prevent the contamination of edited layers. To evaluate our approach, we introduce LayerEditBench, a benchmark of 1500 layered scenes with paired source/target prompts, along with evaluation protocols that assess both edit fidelity and alpha channel stability. Extensive experiments demonstrate that LimeCross improves layer purity and composite realism over strong editing baselines, establishing context-conditioned layered editing as a principled framework for controllable generative creation.

2605.10307 2026-05-12 cs.CV cs.GR cs.RO 版本更新

PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction

Yinan Deng, Jianyu Dou, Jiahui Wang, Jingyu Zhao, Yi Yang, Yufeng Yue

AI总结 动态场景重建是计算机视觉与机器人领域中的一个基础而具有挑战性的问题。为了解决复杂运动场景下高保真渲染与精确跟踪的难题,本文提出了一种新的动态高斯泼溅框架 PaMoSplat,该方法结合了部件感知与运动先验。通过多视角分割掩码的三维重建与光流引导的部件运动估计,PaMoSplat 能够实现更高质量的渲染与更精确的跟踪,并在多个实际场景中表现出优于现有方法的性能与收敛速度。

Comments Accepted by TCSVT. Project Url: https://pamosplat.github.io

详情
英文摘要

Dynamic scene reconstruction represents a fundamental yet demanding challenge in computer vision and robotics. While recent progress in 3DGS-based methods has advanced dynamic scene modeling, obtaining high-fidelity rendering and accurate tracking in scenarios with substantial, intricate motions remains significantly challenging. To address these challenges, we propose PaMoSplat, a novel dynamic Gaussian splatting framework incorporating part awareness and motion priors. Our approach is grounded in two key observations: 1) Parts serve as primitives for scene deformation, and 2) Motion cues from optical flow can effectively guide part motion. Specifically, PaMoSplat initializes by lifting multi-view segmentation masks into 3D space via graph clustering, establishing coherent Gaussian parts. For subsequent timestamps, we leverage a differential evolutionary algorithm to estimate the rigid motion of these parts using multi-view optical flow cues, providing a robust warm-start for further optimization. Additionally, PaMoSplat introduces an adaptive iteration count mechanism, internal learnable rigidity, and flow-supervised rendering loss to accelerate and optimize the training process. Comprehensive evaluations across diverse scenes, including real-world environments, demonstrate that PaMoSplat delivers superior rendering quality, improved tracking precision, and faster convergence compared to existing methods. Furthermore, it enables multiple part-level downstream applications, such as 4D scene editing.

2605.10275 2026-05-12 cs.CV 版本更新

PolarVSR: A Unified Framework and Benchmark for Continuous Space-Time Polarization Video Reconstruction

Chenggong Li, Yidong Luo, Junchao Zhang, Boxin Shi, Degui Yang

发表机构 * School of Automation, Central South University(中南大学自动化学院) Hunan Provincial Key Laboratory of Optic-Electronic Intelligent Measurement and Control(湖南省光学电子智能测量控制重点实验室) Zhejiang University(浙江大学) School of Engineering, Westlake University(西湖大学工程学院) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室) National Engineering Research Center of Visual Technology, School of Computer Science, Peking University(视觉技术国家工程研究中心,北京大学计算机学院)

AI总结 本文提出了一种统一的时空极化视频重建框架PolarVSR,旨在解决主流分焦平面极化成像中从混色阵列中恢复极化参数这一具有挑战性的逆问题。该方法通过联合建模空间与时间上的极化方向,并结合极化感知的隐式神经表示,实现了连续且高保真的超分辨率重建。同时,引入了基于光流引导的极化变化损失以优化极化动态,还建立了首个大规模彩色DoFP极化视频基准数据集,实验结果验证了方法的有效性。

详情
英文摘要

Polarimetric imaging captures surface polarization characteristics, such as the Degree of Linear Polarization (DoLP) and the Angle of Polarization (AoP). In mainstream Division of-Focal-Plane (DoFP) color polarization imaging, recovering polarization parameters from captured mosaic arrays remains a challenging inverse problem. Existing DoFP cameras also face hardware bottlenecks and often cannot support high-frame-rate acquisition, limiting polarimetric imaging in dynamic video tasks. These limitations motivate joint spatial and temporal enhancement. To this end, we propose the first space-time polarization video reconstruction architecture. The method jointly models polarization directions in space and time and uses a polarization-aware implicit neural representation for continuous, high-fidelity upsampling. By analyzing temporal variations in polarization parameters, we further introduce a flow-guided polarization variation loss to supervise polarization dynamics. We also establish the first large-scale color DoFP polarization video benchmark to support this research direction. Extensive experiments on this benchmark demonstrate the effectiveness of the method.

2605.10269 2026-05-12 cs.CV cs.RO 版本更新

Increasing the Efficiency of DETR for Maritime High-Resolution Images

Tinsae Yehuala, Hao Cheng, Ville Lehtola

发表机构 * Dept. of Earth Observation Science, ITC Faculty, University of Twente(地球观测科学系,ITC学院,特文特大学)

AI总结 本文针对海上无人水面船舶(USV)安全导航中高分辨率图像的目标检测需求,研究如何提升DETR模型的检测效率。作者采用基于状态空间模型(SSM)的Vision Mamba(ViM)作为主干网络,结合序列化图像分块处理与特征金字塔网络设计,有效提升了对远距离、小目标及大尺度变化的检测能力。通过引入令牌剪枝等优化策略,该方法在保持检测精度的同时显著降低了计算和内存开销,为海上实时目标检测提供了更高效可靠的解决方案。

Comments Accepted to IEEE ITSC 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. DOI to be added upon publication

详情
英文摘要

Maritime object detection is critical for the safe navigation of unmanned surface vessels (USVs), requiring accurate recognition of obstacles from small buoys to large vessels. Real-time detection is challenging due to long distances, small object sizes, large-scale variations, edge computing limitations, and the high memory demands of high-resolution imagery. Existing solutions, such as downsampling or image splitting, often reduce accuracy or require additional processing, while memory-efficient models typically handle only limited resolutions. To overcome these limitations, we leverage Vision Mamba (ViM) backbones, which build on State Space Models (SSMs) to capture long-range dependencies while scaling linearly with sequence length. Images are tokenized into sequences for efficient high-resolution processing. For further computational efficiency, we design a tailored Feature Pyramid Network with successive downsampling and SSM layers, as well as token pruning to reduce unnecessary computation on background regions. Compared to state-of-the-art methods like RT-DETR with ResNet50 backbone, our approach achieves a better balance between performance and computational efficiency in maritime object detection.

2605.10251 2026-05-12 cs.CV 版本更新

Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation

Ishan Narayan

发表机构 * IMCS Lab, CSIR-CSIO(IMCS实验室,CSIR-CSIO)

AI总结 本文提出了一种名为GraphDepth的单目深度估计架构,通过在卷积编码器-解码器框架中引入图神经网络(GNN),有效建模了局部卷积难以捕捉的长距离空间关系。该方法在ResNet-101 U-Net主干网络的多尺度位置嵌入高效的GraphSAGE层,并结合通道注意力门控跳跃连接和异方差不确定性估计模块,提升了深度估计的精度与鲁棒性。实验表明,与基于Transformer的混合模型相比,GraphDepth在保持相近全局感受野的同时,计算效率更高,且在多个基准数据集上取得了优异的性能表现。

详情
英文摘要

We present GraphDepth, a monocular depth estimation architecture that synergistically integrates Graph Neural Networks (GNNs) within a convolutional encoder-decoder framework. Our approach embeds efficient GraphSAGE layers at multiple scales of a ResNet-101 U-Net backbone, enabling explicit modeling of long-range spatial relationships that lie beyond the receptive field of local convolutions. Key technical contributions include: (1) batch-parallelized graph construction with configurable k-NN and grid-based adjacency for scalable training; (2) multi-scale GraphSAGE integration at bottleneck and decoder stages (1/32, 1/16, 1/8 resolution) to propagate global context throughout the feature hierarchy; (3) channel-attention gated skip connections that adaptively weight encoder features before fusion; and (4) heteroscedastic uncertainty estimation via a dedicated aleatoric uncertainty head, enabling confidence-aware loss weighting during optimization. Unlike transformer-based hybrids, which suffer from quadratic complexity in sequence length, GraphDepth scales linearly with spatial resolution while achieving comparable global receptive fields through iterative message passing. Experiments on NYU Depth V2, WHU Aerial, ETH3D, and Mid-Air benchmarks demonstrate competitive accuracy within 4.6\% of state-of-the-art transformers on indoor scenes with substantially lower computational cost (25 FPS vs 9 FPS, 3.8 GB vs 8.8 GB VRAM). GraphDepth achieves the best reported result on WHU Aerial (RMSE 8.24 m) and exhibits superior zero-shot cross-domain transfer to the Mid-Air synthetic aerial dataset, validating the generalization power of explicit relational reasoning for depth estimation.

2605.10229 2026-05-12 cs.CV cs.CY 版本更新

VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection

Xiaobin Hu, Enpu Zuo, Lanping Hu, Kaiwen Yang, Dianshu Liao, Tianyi Zhang, Bo Yin, Yinsi Zhou, Shidong Pan, Xiaoyu Sun

发表机构 * National University of Singapore(新加坡国立大学) Australian National University(澳大利亚国立大学) New York University(纽约大学) The University of New South Wales(新南威尔士大学)

AI总结 随着视觉数据共享的普及,隐私保护成为一项重要需求,但现有隐私检测算法因缺乏全面数据集而面临挑战。为此,本文提出一个大规模、细粒度的视觉隐私数据集 VPD-100K,涵盖人类存在、屏幕上的个人身份信息、物理标识符和位置指示等四个领域,包含10万张图像和19万标注对象实例,具有长尾分布、小目标和高视觉复杂度等特点。同时,研究设计了一种基于频率增强的轻量模块,有效提升了对敏感信息细微特征的捕捉能力,实验表明该数据集和方法在多种基准测试中均表现出色。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
英文摘要

Privacy protection has become a critical requirement in the era of ubiquitous visual data sharing, imposing higher demands on efficient and robust privacy detection algorithms. However, current robust detection models are severely hindered by the lack of comprehensive datasets. Existing privacy-oriented datasets often suffer from limited scale, coarse-grained annotations, and narrow domain coverage, failing to capture the intricate details of sensitive information in realworld environments. To bridge this gap, we present a large-scale, fine-grained Visual Privacy Dataset (VPD-100K), designed to facilitate generalized privacy detection. We establish a holistic taxonomy comprising four primary domains: Human Presence, On-Screen Personally Identifiable Information (PII), Physical Identifiers, and Location Indicators, containing 100,000 images annotated with 33 fine-grained classes and over 190,000 object instances. Statistical analysis reveals that our dataset features long-tailed distributions, small object scales, and high visual complexity. These characteristics make the dataset particularly valuable for demanding, unconstrained applications such as live streaming, where actors frequently face unintentional, realtime information leakage. Furthermore, we design an effective frequency-enhanced lightweight module consisting of frequency-domain attention fusion and adaptive spectral gating mechanism that breaks the limitations of spatial pixel intensity to better capture the subtle details of sensitive information. Extensive experiments conducted on both diverse image and streaming videos benchmarks consistently demonstrate the effectiveness of our VPD-100K dataset and the wellcurated frequency mechanism. The code and dataset are available at https://vpd-100k.github.io/.

2605.10210 2026-05-12 cs.RO cs.CV 版本更新

Nano-U: Efficient Terrain Segmentation for Tiny Robot Navigation

Federico Pizzolato, Francesco Pasti, Nicola Bellotto

发表机构 * Dept of Information Engineering, University of Padua(信息工程系,帕多瓦大学)

AI总结 本文研究了如何在微型机器人上实现高效的地形分割,以支持其在户外非结构化环境中的自主导航。为了解决现有模型在资源受限的微控制器上部署困难的问题,作者提出了一种名为 Nano-U 的轻量二值分割网络,并结合量化感知蒸馏方法进行训练,显著提升了模型性能。该模型在多个数据集上表现优异,并通过改进的编译器工具链成功部署在低成本微控制器上,实现了低功耗、低延迟的实时地形感知。

Comments Code repository: https://github.com/federico-pizz/Nano-U

详情
英文摘要

Terrain segmentation is a fundamental capability for autonomous mobile robots operating in unstructured outdoor environments. However, state-of-the-art models are incompatible with the memory and compute constraints typical of microcontrollers, limiting scalable deployment in small robotics platforms. To address this gap, we develop a complete framework for robust binary terrain segmentation on a low-cost microcontroller. At the core of our approach we design Nano-U, a highly compact binary segmentation network with a few thousand parameters. To compensate for the network's minimal capacity, we train Nano-U via Quantization-Aware Distillation (QAD), combining knowledge distillation and quantization-aware training. This allows the final quantized model to achieve excellent results on the Botanic Garden dataset and to perform very well on TinyAgri, a custom agricultural field dataset with more challenging scenes. We deploy the quantized Nano-U on a commodity microcontroller by extending MicroFlow, a compiler-based inference engine for TinyML implemented in Rust. By eliminating interpreter overhead and dynamic memory allocation, the quantized model executes on an ESP32-S3 with a minimal memory footprint and low latency. This compiler-based execution demonstrates a viable and energy-efficient solution for perception on low-cost robotic platforms.

2605.10204 2026-05-12 cs.CV 版本更新

3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects

Zhicheng Liang, Haoyi Yu, Boyan Li, Dayou Zhang, Zijian Cao, Tianyi Gong, Junhua Liu, Shuguang Cui, Fangxin Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Capital Normal University(首都师范大学) University of Southern California(南加州大学)

AI总结 本文介绍了3DReflecNet,一个专为重建具有反射、透明和低纹理表面物体的3D视觉方法而设计的大规模数据集。该数据集包含超过12万个基于物理渲染的合成样本和1000多个使用消费级设备采集的真实物体,总数据量超过22TB,涵盖了多种材质、复杂光照条件和几何形态。研究还设计了五个核心任务的基准测试,揭示了现有方法在处理这类复杂材料时的性能局限,推动了更鲁棒的3D视觉模型的发展。

Comments This paper has been accepted by CVPR 2026 Oral

详情
英文摘要

Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces still remains notoriously challenging. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the availability on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, and therefore provide limited insight into performance under real-world material complexities. We introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 120,000 synthetic instances generated via physically-based rendering of more than 12,000 shapes, and over 1,000 real-world objects captured using consumer devices. Together, these data consist of more than 7 million multi-view frames. The dataset spans diverse materials, complex lighting conditions, and a wide range of geometric forms, including shapes generated from both real and LLM-synthesized 2D images using diffusion-based pipelines. To support robust evaluation, we design benchmarks for five core tasks: image matching, structure-from-motion, novel view synthesis, reflection removal, and relighting. Extensive experiments demonstrate that state-of-the-art methods struggle to maintain accuracy across these settings, highlighting the need for more resilient 3D vision models.

2605.10190 2026-05-12 cs.CV 版本更新

DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer

Soichiro Okazaki, Tatsuya Sasaki, Hiroki Ohashi

发表机构 * Hitachi, Ltd. Research and Development Group(日立株式会社研究开发集团)

AI总结 DetRefiner 是一种用于开放词汇目标检测的模型无关检测优化框架,旨在提升对已见和未见类别的检测性能。该方法通过轻量级的 Transformer 编码器融合全局图像特征和局部图像块特征,生成属性可靠性信息以校准基础检测模型的置信度。DetRefiner 不依赖于基础模型的内部特征或重新训练,仅在推理阶段对检测结果进行辅助校准,显著提升了多个开放词汇检测模型在多个数据集上的性能,尤其在未见类别上取得了最高达 +10.1 AP 的提升。

Comments CVPR 2026 Findings

详情
英文摘要

Open-vocabulary object detection (OVOD) aims to detect both seen and unseen categories, yet existing methods often struggle to generalize to novel objects due to limited integration of global and local contextual cues. We propose DetRefiner, a simple yet effective plug-and-play framework that learns to fuse global and local features to refine open-vocabulary detection. DetRefiner processes global image features and patch-level image features from foundational models (e.g., DINOv3) through a lightweight Transformer encoder. The encoder produces a class vector capturing image-level attributes and patch vectors representing local region attributes, from which attribute reliability is inferred to recalibrate the base model's confidence. Notably, DetRefiner is trained independently of the base OVOD model, requiring neither access to its internal features nor retraining. At inference, it operates solely on the base detector's predictions, producing auxiliary calibration scores that are merged with the base detector's scores to yield the final refined confidence. Despite this simplicity, DetRefiner consistently enhances multiple OVOD models across COCO, LVIS, ODinW13, and Pascal VOC, achieving gains of up to +10.1 AP on novel categories. These results highlight that learning to fuse global and local representations offers a powerful and general mechanism for advancing open-world object detection. Our codes and models are available at https://github.com/hitachi-rd-cv/detrefiner.

2605.10184 2026-05-12 cs.CV cs.AI 版本更新

Developing a foundation model for high-resolution remote sensing data of the Netherlands

Paul Vermeeren, Heysem Kaya

发表机构 * Utrecht University, Department of Information and Computing Sciences(乌得勒支大学信息与计算科学系)

AI总结 本文提出了一种基于荷兰高分辨率(1.2米)卫星影像的基座模型,结合卷积神经网络与视觉Transformer,以同时捕捉景观的细纹理、边缘、小物体以及大范围地形结构、高程模式和土地覆盖分布等特征。通过引入时间序列数据,模型能够学习跨时间的上下文信息,提升对地形特征、土地覆盖变化和季节动态等时序依赖关系的建模能力,从而减少特征歧义、增强表征学习并提高小样本下的泛化性能。实验表明,该模型在荷兰植被监测等任务中表现优异,并在多个全球基准数据集上取得了与先进模型相当的性能,展现了在有限数据和参数规模下学习通用表征的能力。

Comments 9 pages, 4 figures, under review in a journal

详情
英文摘要

We develop a foundation model using 1.2m high resolution satellite images of the Netherlands. By combining a Convolutional Neural Network and a Vision Transformer, the model captures both low- and high-frequency landscape features, such as fine textures, edges, and small objects as well as large terrain structures, elevation patterns, and land-cover distributions. Leveraging temporal data as input, the model learns from broader contextual information across time, allowing the model to exploit the temporal dependencies, such as topographic features, land-cover changes, and seasonal dynamics. These additional constraints reduce feature ambiguity, improve representation learning, and enable better generalization with fewer labeled samples. The foundation model is evaluated on multiple downstream tasks, ranging from use cases within the Netherlands to global benchmarking datasets. On the vegetation monitoring dataset of the Netherlands, the model shows clear performance improvements by incorporating temporal information instead of relying on a single time point. Despite using a smaller model and less pretraining data limited to the Netherlands, it achieves competitive results on global benchmarks when compared to state-of-the-art models. These results demonstrate that the model can learn rich, generalizable representations from limited data, achieving competitive performance on global benchmarks while using a fraction of the parameters of larger state-of-the-art remote sensing models. To maximize reproducibility and reuse, we made the scripts and the model accessible on GitHub.

2605.10177 2026-05-12 cs.CV cs.AI cs.RO 版本更新

MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning

Guangli Chen, Dianzhao Li, Wenjian Zhong, Bangquan Xie, Ostap Okhrin

发表机构 * Dongguan Key Laboratory of Intelligent Equipment and Smart Industry, School of Advanced Engineering, Great Bay University(东莞智能装备与智能制造重点实验室,先进工程学院,大湾大学) Chair of Applied Statistics, Technische Universität Dresden(应用统计学教授职位,德累斯顿技术大学) Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI)(可扩展数据解析与人工智能中心(ScaDS.AI)) College of Automation, Guangdong University of Technology(自动化学院,广东技术大学)

AI总结 本文提出了一种名为MTA-RL的框架,通过基于多模态Transformer的3D可操作性表示和强化学习,提升城市自动驾驶的鲁棒性。该方法将RGB图像和LiDAR点云融合,生成结构化的几何感知可操作性表示,作为强化学习策略的输入,从而提高决策效率和稳定性。实验表明,MTA-RL在不同密度的交通场景中均优于现有方法,并在未见过的城市环境中表现出优异的零样本泛化能力。

详情
英文摘要

Robust urban autonomous driving requires reliable 3D scene understanding and stable decision-making under dense interactions. However, existing end-to-end models lack interpretability, while modular pipelines suffer from error propagation across brittle interfaces. This paper proposes MTA-RL, the first framework that bridges perception and control through Multi-modal Transformer-based 3D Affordances and Reinforcement Learning (RL). Unlike previous fusion models that directly regress actions, RGB images and LiDAR point clouds are fused using a transformer architecture to predict explicit, geometry-aware affordance representations. These structured representations serve as a compact observation space, enabling the RL policy to operate purely on predicted driving semantics, which significantly improves sample efficiency and stability. Extensive evaluations in CARLA Town01-03 across varying densities (20-60 background vehicles) show that MTA-RL consistently outperforms state-of-the-art baselines. Trained solely on Town03, our method demonstrates superior zero-shot generalization in unseen towns, achieving up to a 9.0% increase in Route Completion, an 11.0% increase in Total Distance, and an 83.7% improvement in Distance Per Violation. Furthermore, ablation studies confirm that our multi-modal fusion and reward shaping are critical, significantly outperforming image-only and unshaped variants, demonstrating the effectiveness of MTA-RL for robust urban autonomous driving.

2605.10174 2026-05-12 cs.CV 版本更新

BathyFacto: Refraction-Aware Two-Media Neural Radiance Fields for Bathymetry

Markus Brezovsky, Anatol Günthner, Frederik Schulte, Lukas Winiwarter, Boris Jutzi, Gottfried Mandlburger

发表机构 * Department of Geodesy and Geoinformation, TU Wien(维也纳技术大学测绘与地理信息系) Institute of Photogrammetry and Remote Sensing (IPF), Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院测绘与遥感研究所) Unit of Geometry and Surveying, University of Innsbruck(因斯布鲁克大学几何与测绘单位)

AI总结 BathyFacto 是一种针对水下测绘的折射感知双介质神经辐射场方法,旨在解决传统光束法重建在水下场景中因光折射导致的深度偏差问题。该方法通过引入介质条件颜色头和基于哈希网格的密度场,结合斯涅尔定律模拟光线在空气-水界面的折射路径,从而实现更精确的水下点云重建。实验表明,BathyFacto 在模拟场景中显著提升了重建精度和完整性,优于传统方法和未考虑折射的神经辐射场基线。

Comments 16 pages, 8 figures, 3 tables. Submitted to ISPRS Open Journal of Photogrammetry and Remote Sensing, Special Issue "3D Underwater Mapping from Above and Below"

详情
英文摘要

Through-water photogrammetry based on UAV imagery enables shallow-water bathymetry, but refraction at the air-water interface violates the straight-ray assumption of Structure-from-Motion and causes systematic depth bias. We present BathyFacto, a refraction-aware two-media extension of Nerfacto integrated into Nerfstudio that targets metrically precise underwater point clouds. BathyFacto uses a shared hash-grid-based density field with a medium-conditioned color head that receives a one-bit medium flag (air or water) and traces each camera ray as two segments: a straight segment in air up to a planar water surface and a refracted segment in water computed via Snell's law with known refractive indices. To allocate samples efficiently across the air-water boundary, we employ a single proposal-network sampler that operates on a virtual straight ray spanning both media, combined with a kinked density wrapper that transparently corrects water-segment positions along the refracted direction before density evaluation. A data adaptation pipeline converts photogrammetric reconstructions to a Nerfstudio-compatible format, estimates the water plane from boundary markers, and provides per-pixel medium masks to gate refraction. We also extend the point cloud export with refraction-corrected backprojection and reversible coordinate transforms to world and global frames. On a simulated two-media scene with known ground truth, BathyFacto with refraction achieves a Cloud-to-Mesh mean distance of 0.06 m and 87 % completeness, compared to 0.52 m / 29 % for the Nerfacto baseline and 0.36 m / 21% for conventional MVS without refraction correction.

2605.10172 2026-05-12 cs.CV cs.CL 版本更新

V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

Zhiwei Ning, Xuanang Gao, Jiaxi Cao, Gengming Zhang, Shengnan Ma, Wenwen Tong, Hanming Deng, Jie Yang, Wei Liu

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(上海交通大学自动化与智能感知学院) Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University(上海交通大学图像处理与模式识别研究所) SenseTime Research(商汤研究院) Institute of Medical Robotics, Shanghai Jiao Tong University(上海交通大学医学机器人研究所)

AI总结 本文提出了一种名为V-ABS的行动观察者驱动的束搜索框架,用于解决动态视觉推理中的多步骤复杂任务。该方法通过引入思考者-行动者-观察者迭代机制,结合基于熵的自适应加权算法,有效缓解了想象-行动-观察者偏差(IAO偏差),提升了推理的稳定性和最优性。实验表明,V-ABS在多个基准测试中均取得领先性能,显著优于现有模型。

详情
英文摘要

Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset comprising over 80k samples to guide the model to assign higher prior confidence to correct action paths. Extensive experiments across eight diverse benchmarks show that V-ABS achieves state-of-the-art performance, delivering an average improvement of 19.7% on the Qwen3-VL-8B baseline and consistent gains across both open-source and proprietary models.

2605.10162 2026-05-12 cs.CV 版本更新

Active-SAOOD: Active Sparsely Annotated Oriented Object Detection in Remote Sensing Images

Yu Lin, Jianghang Lin, Kai Ye, Shengchuan Zhang, Liujuan Cao

发表机构 * Key Laboratory of Multimedia Trusted Perception(多媒体可信感知关键实验室) Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China(高效计算,中国教育部,厦门大学,361005,中华人民共和国)

AI总结 本文提出了一种基于主动学习的稀疏标注遥感图像定向目标检测方法Active-SAOOD,旨在降低遥感图像中定向目标检测的标注成本。该方法通过模型状态观测模块,在实例层面综合考虑方向、分类与定位的不确定性以及类间和类内多样性,主动选择对当前模型最有价值的稀疏样本,从而在完全随机初始化的稀疏标注下实现稳定检测。实验表明,Active-SAOOD在多种数据集上显著提升了现有稀疏标注方法的性能与稳定性,尤其在仅1%标注比例下性能提升达9%,进一步增强了其在遥感领域的实用价值。

详情
英文摘要

Reducing the annotation cost of oriented object detection in remote sensing remains a major challenge. Recently, sparse annotation has gained attention for effectively reducing annotation redundancy in densely remote sensing scenes. However, (1) the sparse data reliance on class-dependent sampling, and (2) the lack of in-depth investigation into the characteristics of sparse samples hinders its further development. This paper proposes an active learning-based sparsely annotated oriented object detection (SAOOD) method, termed Active-SAOOD. Based on a model state observation module, Active-SAOOD actively selects the most valuable sparse samples at the instance level that are best suited to the current model state, by jointly considering orientation, classification, and localization uncertainty, as well as inter- and intra-class diversity. This design enables SAOOD to operate stably under completely randomly initialized sparse annotations and extends its applicability to broader real-world. Experiments on multiple datasets demonstrate that Active-SAOOD significantly improves both performance and stability of existing SAOOD methods under various random sparse annotation. In particular, with only 1\% annotated ratios, it achieves a 9\% performance gain over the baseline, further enhancing the practical value of SAOOD in remote sensing. The code will be public.

2605.10149 2026-05-12 cs.CV 版本更新

Improving Temporal Action Segmentation via Constraint-Aware Decoding

Yeo Keat Ee, Debaditya Roy, Chen Li, Hao Zhang, Basura Fernando

发表机构 * Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore(高性能计算研究所,科学、技术与研究局,新加坡) Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore(前沿人工智能研究中心,科学、技术与研究局,新加坡) Indian Institute of Technology Kharagpur, India(印度克哈格浦理工学院) College of Computing and Data Science, Nanyang Technological University, Singapore(计算与数据科学学院,南洋理工大学,新加坡)

AI总结 本文研究如何通过引入结构先验约束来提升时序动作分割的性能。作者提出了一种轻量级的约束感知解码框架,通过整合动作转移置信度、动作边界集和类别持续时间等统计结构先验,在不增加模型复杂度的情况下实现推理阶段的预测优化。该方法有效提升了全监督和半监督动作分割模型的性能,尤其在标注数据有限或新领域场景中表现突出。

Comments accepted to ICPR 2026

详情
英文摘要

Temporal action segmentation (TAS) divides untrimmed videos into labeled action segments. While fully supervised methods have advanced the field, challenges such as action variability, ambiguous boundaries, and high annotation costs remain, especially in new or low-resource domains. Grammar-based approaches improve segmentation with structural priors but rely on complex parsing limiting scalability. In this work, we propose a lightweight, constraint-based refinement framework that enhances TAS predictions by integrating statistical structural priors such as transition confidence, action boundary sets, and per-class duration, that can be directly extracted from annotated data. These constraints are integrated into a modified Viterbi decoding algorithm, allowing inference-time refinement without retraining or added model complexity. Our approach improves both fully and semi-supervised TAS models by correcting structural prediction errors while maintaining high efficiency. Code is available at https://github.com/LUNAProject22/CAD

2605.10148 2026-05-12 cs.CV 版本更新

MicroViTv2: Beyond the FLOPS for Edge Energy-Friendly Vision Transformers

Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh

发表机构 * Department of Electro-Optics, National Formosa University(国立.formosa大学电光学系) Department of Electrical Engineering, National Taipei University(台北国立大学电气工程系) College of Artificial Intelligence and Green Energy, National Yang Ming Chiao Tung University(阳明交通大学人工智能与再生能源学院)

AI总结 本文提出了一种轻量级的视觉Transformer模型MicroViTv2,旨在提升边缘设备上的能效表现。通过引入重参数化设计,包括重参数化块嵌入(RepEmbed)和重参数化深度可分离卷积混合器(RepDW),并结合单深度可分离转置注意力(SDTA)模块,模型在保持快速推理速度的同时,实现了更高的准确率。实验表明,MicroViTv2在Jetson AGX Orin等硬件平台上展现出优越的能效比,验证了超越FLOPs指标进行效率评估的重要性。

详情
英文摘要

The Vision Transformer (ViT) achieves remarkable accuracy across visual tasks but remains computationally expensive for edge deployment. This paper presents MicroViTv2, a lightweight Vision Transformer optimized for real-device efficiency. Built upon the original MicroViT, the proposed model is designed based on reparameterized design, specifically Reparameterized Patch Embedding (RepEmbed) and Reparameterized Depth-Wise convolution mixer (RepDW) for faster inference, and introduces the Single Depth-Wise Transposed Attention (SDTA) to capture long-range dependencies with minimal redundancy. Despite slightly higher FLOPs, MicroViTv2 improves accuracy up to 0.5% compared to its predecessor and surpassing MobileViTv2, EdgeNeXt, and EfficientViT while maintaining fast inference and high energy efficiency on Jetson AGX Orin. Experiments on ImageNet-1K and COCO demonstrate that hardware-aware design and structural re-parameterization are key to achieving high accuracy and low energy consumption, validating the need to evaluate efficiency beyond FLOPs. Code is available at https://github.com/novendrastywn/MicroViT.

2605.10142 2026-05-12 cs.CV cs.AI 版本更新

Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality

Mateusz Cedro, Marcin Chlebus

发表机构 * University of Warsaw(华沙大学)

AI总结 本文研究了视觉模型的规模扩大是否能提升基于定位的解释质量。通过在多个图像数据集上评估不同深度和复杂度的ResNet、DenseNet和Vision Transformer模型,结合五种事后解释方法,发现模型规模的增加并未在大多数情况下提升解释质量,较小的模型往往表现相当甚至更优。研究还指出,预训练虽能提升预测性能,但对定位精度的提升并不一致,表明在模型选择中应明确评估解释性以确保安全应用。

Comments 28 pages, 8 figures, 8 tables

详情
英文摘要

Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of depth and complexity within the ResNet, DenseNet, and Vision Transformer families, trained from scratch or pretrained, across three image datasets with ground-truth segmentation masks. For each model, we generate explanations using five post-hoc explainable AI methods and quantify mask alignment using two localisation metrics: Relevance Rank Accuracy (Arras et al., 2022) and the proposed Dual-Polarity Precision, which measures positive attributions inside the class mask and negative attributions outside it. Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. While pretraining typically improves predictive performance and increases the dependence of explanations on learned weights, it does not consistently increase localisation scores. We also observe scenarios in which models achieve strong predictive performance while localisation precision is near zero, suggesting that performance metrics alone may not indicate whether predictions are based on the annotated regions. These results indicate that larger models do not reliably provide higher-quality explanations, and that explainability should therefore be assessed explicitly during model selection for safety-sensitive deployments.

2605.10130 2026-05-12 cs.CV 版本更新

Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection

Yasiru Ranasinghe, Elim Schenck, Florence Yellin, Shuowen Hu, Christopher Funk, Vishal M. Patel

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Kitware DEVCOM Army Research Laboratory(国防部陆军研究实验室)

AI总结 现有开放词汇检测方法主要针对RGB图像,难以推广到热成像领域,因热图像纹理低、发射率变化大,给基于RGB的语义理解带来挑战。本文提出Thermal-Det,首个由大语言模型(LLM)监督的开放词汇热成像目标检测方法,通过构建包含百万级热成像对齐样本的合成数据集,并结合跨模态蒸馏与文本校准模块,实现了无需人工标注的热成像检测知识迁移。实验表明,该方法在公开数据集上相比现有开放词汇检测器平均精度提升2-4%,为语言驱动的热感知系统奠定了基础。

Comments Accepted at CVPR 26

详情
英文摘要

Existing open-vocabulary detectors focus on RGB images and fail to generalize to thermal imagery, where low texture and emissivity variations challenge RGB-based semantics. We present Thermal-Det, the first large language model (LLM) supervised open-vocabulary detector tailored for thermal images. To enable large-scale training, we develop a synthetic dataset by converting GroundingCap-1M into the thermal domain and filtering captions to remove RGB-specific terms, yielding over one million thermally aligned samples with bounding boxes, grounding texts, and detailed captions. Thermal-Det jointly optimizes detection, captioning, and cross-modal distillation objectives. A frozen RGB teacher provides geometric and semantic pseudo-supervision for paired but unlabeled RGB-thermal data, transferring open-vocabulary knowledge without manual annotation. The model further employs a Thermal-Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation methods, the detector is fully fine-tuned to internalize thermal contrast patterns while preserving language alignment. Experiments on public benchmarks show consistent 2-4% AP gains over existing open-vocabulary detectors, establishing a strong foundation for scalable, language-driven thermal perception.

2605.10120 2026-05-12 cs.CV cs.AI 版本更新

MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan

发表机构 * Shanghai Key Laboratory of Intelligent Information Processing(上海智能信息处理关键实验室) School of Computer Science, Fudan University(复旦大学计算机科学学院)

AI总结 本文提出了一种名为MicroWorld的框架,旨在解决多模态大语言模型在显微镜等专业微观领域表现不足的问题。该方法通过构建多模态属性图(MAPG)来增强模型的推理能力,无需特定领域的微调即可在推理阶段提升模型表现。实验表明,MicroWorld显著提升了Qwen3-VL-8B-Instruct在MicroVQA等基准上的性能,取得了当前最优结果,并展示了其在跨领域泛化能力上的优势。

Comments 29 pages, 14 figures

详情
英文摘要

Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-specific training data and the difficulty of encoding fine-grained expert knowledge into model parameters. To bridge the gap, we introduce MicroWorld, a framework that constructs a multimodal attributed property graph (MAPG) from large-scale scientific image--caption corpora and leverages it to augment MLLM reasoning at inference time without any domain-specific fine-tuning. MicroWorld extracts biomedical entities and relations via scispaCy or LLM-based triplet mining, aligns images and entities in a shared embedding space using Qwen3-VL-Embedding, and assembles a knowledge graph comprising approximately 111K nodes and 346K typed edges spanning eight relation categories. At inference time, a graph-augmented retrieval pipeline matches query entities to the MAPG and injects structured knowledge context into the MLLM prompt. On the MicroVQA benchmark, MicroWorld improves the reasoning performance of Qwen3-VL-8B-Instruct by 37.5%, outperforming GPT-5 by 13.0% to achieve a new state-of-the-art. Furthermore, it yields a 6.0% performance gain on the MicroBench benchmark. Extensive experiments demonstrate the enhanced generalization capability introduced by MicroWorld. A qualitative case study further reveals both the mechanisms through which structured knowledge improves reasoning and the failure modes that point to promising future directions. Code and data are available at https://github.com/ieellee/MicroWorld.

2605.10117 2026-05-12 cs.CV cs.AI 版本更新

Think as Needed: Geometry-Driven Adaptive Perception for Autonomous Driving

Donghyun Kim, Jaehyoung Park

发表机构 * Stony Brook University(史蒂文尼森布鲁克大学)

AI总结 本文研究了自动驾驶场景中如何根据环境复杂度动态调整感知计算资源的问题。提出了一种名为Enhanced HOPE的自适应感知架构,通过无监督方法估计LiDAR帧的几何复杂度,并据此选择浅层或深层处理路径,从而在保证精度的同时提升计算效率。该方法还引入了线性时间的子空间注意力网络和持续的时序记忆模块,有效提升了对遮挡目标的跟踪能力,并在多个基准测试中表现出优越的性能。

详情
英文摘要

Autonomous driving scenes range from empty highways to dense intersections with dozens of interacting road users, yet current 3D detection models apply a fixed computation budget to every frame, wasting resources on simple scenes while lacking capacity for complex ones. Existing approaches compound this problem: Transformer-based interaction models scale quadratically with the number of detected objects, and frame-by-frame processing causes the system to immediately forget objects the moment they become occluded. We propose Enhanced HOPE, an adaptive perception architecture that measures the geometric complexity of each incoming LiDAR frame using an unsupervised statistical estimator and routes it through a shallow or deep processing path accordingly, requiring no manual scene labels. To keep interaction modeling efficient, we replace quadratic pairwise attention with a linear-time subspace-based network that groups nearby objects into clusters and processes them jointly. The computational savings from these two mechanisms free up resources for a persistent temporal memory module that retains previously detected objects and traffic rules across frames, enabling the system to recall occluded objects seconds after they disappear from view. On the nuScenes and CARLA benchmarks, Enhanced HOPE reduces latency by 38% on simple scenes with no accuracy loss, improves mean Average Precision by 2.7 points on rare long-tail scenarios, and tracks objects through occlusions lasting over 5 seconds, where all tested baselines fail.

2605.07846 2026-05-12 cs.CV 版本更新

BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

Peilin Xiong, Honghui Yuan, Junwen Chen, Keiji Yanai

发表机构 * Department of Informatics, The University of Electro-Communications(信息学系,电通大学)

AI总结 本文研究了粗粒度掩码局部图像编辑中因掩码形状偏差导致的编辑区域边界失真问题,提出了一种名为BRIDGE的方法。该方法通过将掩码分离于DiT主干网络之外,并引入可学习的离散几何门控机制,实现背景稳定与编辑区域灵活生成的双重约束。实验表明,BRIDGE在多个基准测试中显著提升了编辑质量,同时保持了模型的轻量化特性。

Comments 11 pages, 6 figures

详情
英文摘要

Coarse-mask local image editing asks a model to modify a user-indicated region while preserving the surrounding scene. In practice, however, rough masks often become unintended shape priors: instead of serving as flexible edit support, the mask can pull the generated object toward its accidental boundary. We study this failure as mask-shape bias and frame the task through a Two-Zone Constraint, where the background should remain stable while the editable region should follow the instruction without being forced to inherit the mask contour. BRIDGE addresses this setting by keeping masks outside the DiT backbone for support construction and blending, avoiding DiT-internal mask injection and copied control branches. It uses BridgePath generation, where a Main Path preserves background context and a Subject Path generates editable content from independent noise. Motivated by a diagnostic Qwen-Image experiment showing that positional embeddings and attention connectivity regulate which image context visual tokens reuse, BRIDGE introduces a learnable Discrete Geometric Gate for token-level positional-embedding routing. This gate lets subject tokens borrow background-anchored coordinates near fusion regions or keep subject-centric coordinates for geometric freedom. We evaluate BRIDGE on BRIDGE-Bench, MagicBrush, and ICE-Bench. On BRIDGE-Bench, BRIDGE improves Local SigLIP2-T from 0.262 with FLUX.1-Fill and 0.390 with ACE++ to 0.503, with parallel gains in local DINO and DreamSim. Zero-shot results on MagicBrush and ICE-Bench further indicate competitive alignment and source preservation beyond the curated benchmark, while the added routing module remains compact at 13.31M parameters compared with ControlNet-style copied branches.

2605.07786 2026-05-12 cs.CV cs.AI 版本更新

APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment

Caterina Gallegati, Monica Bianchini, Franco Scarselli, Vittorio Murino, Barbara Toniella Corradini

发表机构 * University of Siena(锡耶纳大学) AI for Good (AIGO), Istituto Italiano di Tecnologia(AI for Good(AIGO),意大利理工学院) University of Verona(威尼斯大学)

AI总结 随着生成模型在视觉质量上取得突破,传统的基于特征分布的图像评估指标(如FID)仍被视为黄金标准,但其受到过时特征和参数化假设的限制。为解决这些问题,本文提出APEX,一种基于切片沃谢尔距离的无假设嵌入评估框架,无需依赖特定参数形式,且能兼容多种嵌入模型,如CLIP和DINOv2。实验表明,APEX在高维空间中具有良好可扩展性,对视觉退化具有更强鲁棒性,并在跨数据集评估中表现出高度稳定性。

详情
英文摘要

As generative models achieve unprecedented visual quality, the gold standard for image evaluation remains traditional feature-distribution metrics (e.g., FID). However, these metrics are provably hindered by the closed-vocabulary bottleneck of outdated features and the assumptive bias of rigid parametric formulations. Recent alternatives exploit modern backbones to solve the feature bottleneck, yet continue to suffer from parametric limitations. To close this gap, we introduce APEX (Assumption-free Projection-based Embedding eXamination), a novel evaluation framework leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure. APEX inherits effective scalability to high-dimensional spaces, as we prove with theoretical and empirical evidences. Moreover, APEX is embedding-agnostic and uses two open-vocabulary foundation models, CLIP and DINOv2, as feature extractors. Benchmarking APEX against established baselines reveals superior robustness to visual degradations. Additionally, we show that APEX metrics exhibit intra- and cross-dataset stability, ensuring highly stable evaluations on out-of-domain datasets.

2605.07575 2026-05-12 cs.CV cs.AI 版本更新

Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

Ke Ma, Jiaqi Tang, Bin Guo, Xueting Han, Ruonan Xu, Qingfeng He, Ziheng Wang, Xu Wang, Qifeng Chen, Zhiwen Yu, Yunhao Liu

发表机构 * Northwestern Polytechnical University(北华大学) Tsinghua University(清华大学) The Hong Kong University of Science and Technology(香港科技大学) Harbin Engineering University(哈尔滨工程大学)

AI总结 本文提出了一种名为Response-G1的新型框架,旨在解决流媒体视频理解中主动响应时机判断的问题。该方法通过显式的场景图建模,将视频内容与查询响应条件进行结构化对齐,从而提升响应决策的准确性和可解释性。框架包含三个无需微调的阶段,包括在线生成场景图、基于记忆的语义检索以及增强触发提示,实验表明其在主动和被动任务中均优于现有方法。

Comments Accepted to ACL 2026

详情
英文摘要

Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query's expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" decisions. By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.

2605.07574 2026-05-12 cs.CV 版本更新

PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

Yuliang Li, Chu Zhou, Heng Guo, Boxin Shi, Imari Sato, Zhanyu Ma

发表机构 * Beijing University of Posts and Telecommunications, China(北京邮电大学) National Institute of Informatics, Japan(日本国立信息机构) Peking University, China(北京大学) The University of Tokyo, Japan(东京大学)

AI总结 主流的视觉-语言模型(VLMs)由于依赖标准RGB输入,在处理反射、透明物体等光学模糊场景时存在显著困难。为解决这一问题,本文提出PolarVLM,首个将偏振物理参数融入VLM的多模态框架,通过双流架构和渐进式训练策略,有效避免物理误判并保持通用视觉能力。同时,研究构建了首个面向偏振感知的视觉问答基准PolarVQA,实验表明PolarVLM在多个任务上显著优于RGB基线,尤其在反射识别和玻璃计数任务中提升明显。

Comments 23 pages, 12 figures, including appendices

详情
英文摘要

Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.

2605.07429 2026-05-12 cs.CV 版本更新

Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework

Linxiao Shi, Siming Zheng, Zerong Wang, Hao Zhang, Jinwei Chen, Bo Li, Shifeng Chen, Peng-Tao Jiang

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) vivo BlueImage Lab, vivo Mobile Communication Co., Ltd.(vivo BlueImage实验室,vivo移动通信有限公司) Shenzhen University of Advanced Technology(深圳大学)

AI总结 现有移动设备由于光学设计限制,难以生成自然的光学景深效果。为解决这一问题,本文提出 MagicBokeh,一种基于扩散框架的统一方法,能够高效生成高质量的逼真景深效果。该方法通过替代训练策略和聚焦感知的掩码注意力机制,联合优化景深渲染与超分辨率,显著提升了控制精度和视觉真实感,并引入退化感知深度模块以提升低质量输入的深度估计准确性。实验表明,MagicBokeh 能在真实低分辨率图像上高效生成高度逼真的景深效果,为未来景深渲染研究提供了新方向。

Comments Accepted by CVPR 2026

详情
英文摘要

Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learning-based methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. Our code and models are available at https://github.com/vivoCameraResearch/MagicBokeh.

2605.05775 2026-05-12 cs.CV cs.AI 版本更新

The autoPET3 Challenge: Automated Lesion Segmentation in Whole-Body PET/CT $\unicode{x2013}$ Multitracer Multicenter Generalization

Jakob Dexl, Katharina Jeblick, Andreas Mittermeier, Balthasar Schachtner, Anna Theresa Stüber, Johanna Topalis, Maximilian Rokuss, Fabian Isensee, Klaus H. Maier-Hein, Hamza Kalisch, Jens Kleesiek, Constantin M. Seibold, Hussain Alasmawi, Lap Yan Lennon Chan, Yixuan Yuan, Alexander Jaus, Rainer Stiefelhagen, Pauline Ornela Megne Choudja, Konstantin Nikolaou, Christian La Fougère, Sergios Gatidis, Matthias P. Fabritius, Maurice Heimer, Gizem Abaci, Lalith Kumar Shiyam Sundar, Rudolf A. Werner, Jens Ricke, Clemens C. Cyran, Thomas Küstner, Michael Ingrisch

发表机构 * Department of Radiology, LMU University Hospital, LMU Munich(莱比锡大学医院放射科,莱比锡大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) University Hospital Tübingen, Department of Radiology(图宾根大学医院放射科) Department of Radiology, Stanford University(斯坦福大学放射科) German Cancer Research Center (DKFZ)(德国癌症研究中心(DKFZ)) Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital(海德堡大学医院放射肿瘤学部模式分析与学习组) Faculty of Mathematics and Computer Science, Heidelberg University(海德堡大学数学与计算机科学学院) Institute for AI in Medicine (IKIM), University Hospital Essen (AöR)(医学人工智能研究所(IKIM),埃森大学医院(AöR)) Department of Nuclear Medicine, University Hospital Essen (AöR)(核医学部,埃森大学医院(AöR)) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Department of Electronic Engineering, The Chinese University of Hong Kong(香港中文大学电子工程系) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) HIDSS4Health - Helmholtz Information and Data Science School for Health(HIDSS4Health - 海德堡信息与数据科学健康学校) Department of Nuclear Medicine, LMU University Hospital, LMU Munich(莱比锡大学医院核医学部,莱比锡大学) Comprehensive Pneumology Center (CPC-M), Member of the German Center for Lung Research (DZL)(综合肺科中心(CPC-M),德国肺癌研究中心(DZL)成员) relAI – Konrad Zuse School of Excellence in Reliable AI(relAI - 卡诺德·祖斯可靠性人工智能卓越学校) Cluster of Excellence iFIT (EXC 2180) "Image Guided and Functionally Instructed Tumor Therapies", University of Tübingen(卓越中心iFIT(EXC 2180)"图像引导和功能指导肿瘤治疗",图宾根大学)

AI总结 本文介绍了第三届 autoPET 挑战赛(MICCAI 2024)的设计与结果,旨在评估在全身 PET/CT 图像中自动分割病灶的算法在多示踪剂、多中心场景下的泛化能力。研究使用了来自两个医院的大量标注数据,并在包含未见示踪剂-中心组合的测试集上评估算法性能,结果显示最佳算法在多个指标上优于基线模型。研究还指出,当前算法在域内多示踪剂分割任务上表现良好,但在跨中心、跨示踪剂的泛化任务中仍面临挑战,性能差异主要受数据异质性和病例难度影响。

Comments Preprint submitted to Medical Image Analysis

详情
英文摘要

We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the University Hospital Tübingen and 597 [18F]/[68Ga]-PSMA PET/CT studies from the LMU University Hospital Munich, constituting the largest publicly available annotated PSMA PET/CT dataset to date. The held-out test set of 200 studies covered four tracer-center combinations, two of which represented unseen compositional pairings. A complementary data-centric award category isolated the contribution of data handling strategies by restricting participants to a fixed baseline model. Seventeen teams submitted 27 algorithms, predominantly nnU-Net-based 3D networks with PET/CT channel concatenation. The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing the false-negative volume by 5 mL relative to the provided baseline. Ranking was stable across bootstrap resampling and alternative ranking schemes for the top tier. Beyond the benchmark, we provide an in-depth analysis of segmentation performance at the patient and lesion level. Three main conclusions can be drawn: (1) in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement; (2) compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation; (3) heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.

2605.04617 2026-05-12 cs.CV cs.HC cs.LG 版本更新

Temporal Structure Matters for Efficient Test-Time Adaptation in Wearable Human Activity Recognition

Zishu Zhou, Zaipeng Xie, Xuanyao Jie

发表机构 * College of Computer Science and Software Engineering, Hohai University(河海大学计算机科学与软件工程学院)

AI总结 可穿戴人体活动识别模型在面对真实世界中用户分布变化时往往性能下降,现有测试时自适应方法多沿用视觉任务的假设,未能充分利用活动识别流中的时间结构特性。本文重新审视时间结构作为条件推理信号的作用,提出了一种基于时间连续性和特征偏差的自适应机制,用于指导何时保持或释放时间惯性以及预测优化的路由位置。基于此,作者设计了SIGHT框架,无需反向传播即可实现轻量高效的实时自适应,实验表明其在实际数据集上优于现有方法,同时降低了计算和内存开销。

详情
英文摘要

Wearable human activity recognition (WHAR) models often suffer from performance degradation under real-world cross-user distribution shifts. Test-time adaptation (TTA) mitigates this degradation by adapting models online using unlabeled test streams, yet existing methods largely inherit assumptions from vision tasks and underexploit the inherent inter-window temporal structure in WHAR streams. In this paper, we revisit such temporal structure as a feature-conditioned inference signal rather than merely an output-space smoothing prior. We derive the insight that temporal continuity and observation-induced feature deviations provide complementary cues for determining when to preserve or release temporal inertia and where to route prediction refinement during likely transitions. Building upon this insight, we propose SIGHT, a lightweight and backpropagation-free TTA framework for WHAR, enabling real-time edge deployment. SIGHT estimates predictive surprise by comparing the current feature with a prototype-based expected state, and then uses the resulting feature deviation to guide geometry-aware transition routing based on prototype alignment and stream-level marginal habit tracking. Evaluations on real-world datasets confirm that SIGHT outperforms existing TTA baselines while reducing computational and memory costs.

2605.04541 2026-05-12 cs.CV 版本更新

Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection

Muyao Peng, Shun Zou, Pei An, You Yang, Qiong Liu

发表机构 * School of Electronic Information and Communications, Huazhong University of Science and Technology(华中科技大学电子信息与通信学院)

AI总结 本文提出了一种名为Angle-I2P的图像到点云配准方法,旨在解决低内点比情况下传统PnP方法难以准确配准的问题。该方法通过引入角度一致性约束和层次注意力机制,有效提升配准的鲁棒性与精度。实验表明,Angle-I2P在多个公开数据集上取得了当前最优的配准效果。

Comments Accepted by ICRA 2026

详情
英文摘要

Image-to-point-cloud registration (I2P) is a fundamental task in robotic applications such as manipulation,grasping, and localization. Existing deep learning-based I2P methods seek to align image and point cloud features in a learned representation space to establish correspondences, and have achieved promising results. However, when the inlier ratio of the initial matching pairs is low, conventional Perspective-n-Points (PnP) methods may struggle to achieve accurate results. To address this limitation, we propose Angle-I2P, an outlier rejection network that leverages angle-consistent geometric constraints and hierarchical attention. First, we design a scale-invariant, crossmodality geometric constraint based on angular consistency. This explicit geometric constraint guides the model in distinguishing inliers from outliers. Furthermore, we propose a global-tolocal hierarchical attention mechanism that effectively filters out geometrically inconsistent matches under rigid transformation, thereby improving the Inlier Ratio (IR) and Registration Recall (RR). Experimental results demonstrate that our method achieves state-of-the-art performance on the 7Scenes, RGBD Scenes V2, and a self-collected dataset, with consistent improvements across all benchmarks.

2605.03650 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

Zhiyuan Li, Rongzhen Zhao, Wenyan Yang, Wenshuai Zhao, Pekka Marttinen, Joni Pajarinen

发表机构 * Department of Electrical Engineering and Automation(电气工程与自动化系) Aalto University(阿尔托大学) Department of Computer Science(计算机科学系)

AI总结 本文重新思考了视频对象中心学习中的时间一致性问题,指出当前依赖动态模块预测未来对象表示的方法实际上是复杂的离散对应问题的近似。作者提出了一种新的框架“Grounded Correspondence”,通过冻结的骨干网络提取显著区域初始化对象槽,并利用匈牙利匹配实现帧间身份对应,无需可学习的时间建模参数,即可在多个数据集上取得具有竞争力的性能。

详情
英文摘要

The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS. Project page: https://magenta-sherbet-85b101.netlify.app/

2605.03639 2026-05-12 cs.CV 版本更新

Diffusion Masked Pretraining for Dynamic Point Cloud

Zhuoyue Zhang, Jihua Zhu, Chaowei Fang, Jian Liu, Ajmal Saeed Mian

发表机构 * Xi’an Jiaotong University(西安交通大学) School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) University of Western Australia(西澳大学)

AI总结 本文提出了一种名为DiMP的统一自监督预训练框架,用于动态点云处理。该方法通过引入扩散模型,解决了现有掩码重建目标中的时空位置泄露和运动不确定性丢失问题。DiMP在位置推理和运动学习中均采用扩散建模,通过预测可见时空上下文中的干净点云中心,提升了位置表示的准确性,并将帧间位移监督转化为条件扩散模型的噪声预测任务,从而更完整地建模运动的条件分布。实验表明,DiMP在多个下游任务中均显著提升了性能。

详情
英文摘要

Dynamic point cloud pretraining is still dominated by masked reconstruction objectives. However, these objectives inherit two key limitations. Existing methods inject ground-truth tube centers as decoder positional embeddings, causing spatio-temporal positional leakage. Moreover, they supervise inter-frame motion with deterministic proxy targets that systematically discard distributional structure by collapsing multimodal trajectory uncertainty into conditional means. To address these limitations, we propose Diffusion Masked Pretraining (DiMP), a unified self-supervised framework for dynamic point clouds. DiMP introduces diffusion modeling into both positional inference and motion learning. It first applies forward diffusion noise only to masked tube centers, then predicts clean centers from visible spatio-temporal context. This removes positional leakage while preserving visible coordinates as clean temporal anchors. DiMP also reformulates point-wise inter-frame displacement supervision as a DDPM noise-prediction objective conditioned on decoded representations. This design drives the encoder to target the full conditional distribution of plausible motions under a variational surrogate, rather than collapsing to a single deterministic estimate. Extensive experiments demonstrate that DiMP consistently improves downstream accuracy over the backbone alone, with absolute gains of 11.21% on offline action segmentation and 13.65% under causally constrained online inference.Codes are available at https://github.com/InitalZ/DiMP.git.

2604.17565 2026-05-12 cs.CV 版本更新

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

Hong Jiang, Wensong Song, Zongxing Yang, Ruijie Quan, Yi Yang

发表机构 * ReLER, CCAI, Zhejiang University(ReLER、CCAI、浙江大学) DBMI, HMS, Harvard University(DBMI、HMS、哈佛大学)

AI总结 UniGeo 是一种新型的相机可控图像编辑框架,旨在在不同相机视角下生成几何一致的场景视图。该方法通过在表示层、架构层和损失函数层统一注入几何引导,解决了现有方法在连续相机运动下出现的几何漂移和结构退化问题。实验表明,UniGeo 在多个公开数据集上显著优于现有方法,具有更高的视觉质量和几何一致性。

详情
英文摘要

Camera-controllable image editing aims to synthesize novel views of a given scene under varying camera poses while strictly preserving cross-view geometric consistency. However, existing methods typically rely on fragmented geometric guidance, such as only injecting point clouds at the representation level despite models containing multiple levels, and are mainly based on image diffusion models that operate on discrete view mappings. These two limitations jointly lead to geometric drift and structural degradation under continuous camera motion. We observe that while leveraging video models provides continuous viewpoint priors for camera-controllable image editing, they still struggle to form stable geometric understanding if geometric guidance remains fragmented. To systematically address this, we inject unified geometric guidance across three levels that jointly determine the generative output: representation, architecture, and loss function. To this end, we propose UniGeo, a novel camera-controllable editing framework. Specifically, at the representation level, UniGeo incorporates a frame-decoupled geometric reference injection mechanism to provide robust cross-view geometry context. At the architecture level, it introduces geometric anchor attention to align multi-view features. At the loss function level, it proposes a trajectory-endpoint geometric supervision strategy to explicitly reinforce the structural fidelity of target views. Comprehensive experiments across multiple public benchmarks, encompassing both extensive and limited camera motion settings, demonstrate that UniGeo significantly outperforms existing methods in both visual quality and geometric consistency.

2604.06720 2026-05-12 cs.CV 版本更新

Exploring 6D Object Pose Estimation with Deformation

Zhiqiang Liu, Rui Song, Duanmu Chuangqi, Jiaojiao Li, David Ferstl, Yinlin Hu

发表机构 * State Key Laboratory of ISN, Xidian University(西安电子科技大学信息与通信系统国家重点实验室) MagicLeap

AI总结 本文提出DeSOPE,一个用于6自由度(6DoF)变形物体位姿估计的大规模数据集。传统6D位姿估计方法通常假设物体为刚性或可变形的关节结构,但在实际应用中,物体因磨损、碰撞或形变而偏离标准形状,导致方法失效。为此,DeSOPE包含26类常见物体在标准形态和三种变形状态下的高精度3D扫描数据,并配有133K帧的RGB-D图像和665K个位姿标注,为研究变形物体的位姿估计提供了重要资源。

Comments Accepted at CVPR 2026

详情
英文摘要

We present DeSOPE, a large-scale dataset for 6DoF deformed objects. Most 6D object pose methods assume rigid or articulated objects, an assumption that fails in practice as objects deviate from their canonical shapes due to wear, impact, or deformation. To model this, we introduce the DeSOPE dataset, which features high-fidelity 3D scans of 26 common object categories, each captured in one canonical state and three deformed configurations, with accurate 3D registration to the canonical mesh. Additionally, it features an RGB-D dataset with 133K frames across diverse scenarios and 665K pose annotations produced via a semi-automatic pipeline. We begin by annotating 2D masks for each instance, then compute initial poses using an object pose method, refine them through an object-level SLAM system, and finally perform manual verification to produce the final annotations. We evaluate several object pose methods and find that performance drops sharply with increasing deformation, suggesting that robust handling of such deformations is critical for practical applications.

2604.04306 2026-05-12 cs.CV cs.AI 版本更新

HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data

Stella Girtsou, Konstantinos Alexis, Giorgos Giannopoulos, Charalambos Kontoes

发表机构 * National Observatory of Athens(国家天文台) National Technical University of Athens(雅典国家技术大学) National and Kapodistrian University of Athens(雅典国家与卡波迪斯特里亚大学) Athena Research Center(雅典研究所以及研究中心)

AI总结 随着气候相关灾害频发,实时监测和预警需求日益迫切。本文提出 HighFM,一种面向高时间分辨率多光谱遥感数据的基座模型,通过利用超过 2TB 的 SEVIRI 卫星影像,改进了掩码自编码框架以学习稳健的时空表征,并在云检测和火灾识别任务中取得了优于传统方法和近期地理空间基座模型的性能,展示了地静止卫星数据在实时遥感应用中的巨大潜力。

详情
英文摘要

The increasing frequency and severity of climate related disasters have intensified the need for real time monitoring, early warning, and informed decision-making. Earth Observation (EO), powered by satellite data and Machine Learning (ML), offers powerful tools to meet these challenges. Foundation Models (FMs) have revolutionized EO ML by enabling general-purpose pretraining on large scale remote sensing datasets. However most existing models rely on high-resolution satellite imagery with low revisit rates limiting their suitability for fast-evolving phenomena and time critical emergency response. In this work, we present HighFM, a first cut approach towards a FM for high temporal resolution, multispectral EO data. Leveraging over 2 TB of SEVIRI imagery from the Meteosat Second Generation (MSG) platform, we adapt the SatMAE masked autoencoding framework to learn robust spatiotemporal representations. To support real time monitoring, we enhance the original architecture with fine grained temporal encodings to capture short term variability. The pretrained models are then finetuned on cloud masking and active fire detection tasks. We benchmark our SEVIRI pretrained Vision Transformers against traditional baselines and recent geospatial FMs, demonstrating consistent gains across both balanced accuracy and IoU metrics. Our results highlight the potential of temporally dense geostationary data for real-time EO, offering a scalable path toward foundation models for disaster detection and tracking.

2603.21901 2026-05-12 cs.CV 版本更新

CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

Qingdong He, Chaoyi Wang, Peng Tang, Yifan Yang, Xiaobin Hu

发表机构 * University of Electronic Science and Technology of China(电子科技大学) University of Chinese Academy of Sciences(中国科学院大学) Technical University of Munich(慕尼黑技术大学) Shanghai Jiao Tong University(上海交通大学) National University of Singapore(新加坡国立大学)

AI总结 CLEAR 是一种无需掩码的端到端视频字幕去除框架,旨在在保持时间一致性的同时区分字幕与背景内容。该方法采用两阶段设计,第一阶段通过自监督正交约束学习解耦的字幕表示,第二阶段利用LoRA参数微调和生成反馈机制进行动态上下文调整,从而实现无需真实掩码的自适应推理。CLEAR 在参数效率和跨语言泛化能力方面表现优异,仅需基础扩散模型0.77%的参数即可在多个中文字幕数据集上超越依赖掩码的基线方法,并在六种语言中展现出强大的零样本泛化能力。

Comments Accepted by ICML 2026 (Spotlight)

详情
英文摘要

Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.

2603.16964 2026-05-12 cs.CV cs.LG 版本更新

Behavior-Centric Extraction of Scenarios from Highway Traffic Data and their Domain-Knowledge-Guided Clustering using CVQ-VAE

Niklas Roßberg, Sinan Hasirlioglu, Mohamed Essayed Bouzouraa, Wolfgang Utschick, Michael Botsch

发表机构 * Technische Hochschule Ingolstadt(因斯布鲁克技术大学) AUDI AG(奥迪公司) Technische Universität München(慕尼黑技术大学)

AI总结 该研究旨在从高速公路交通数据中标准化提取场景,并基于领域知识进行聚类,以支持自动驾驶系统的行为评估。研究提出了一种基于“场景即规范”概念的场景提取方法,并结合CVQ-VAE模型实现领域知识引导的聚类过程,提升了场景分类的可解释性和一致性。实验表明,该方法能够可靠地从真实数据中提取场景,并有效融合领域知识,为自动驾驶系统的验证提供了更高效和标准化的场景分类框架。

Comments Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States

详情
英文摘要

Approval of ADS depends on evaluating its behavior within representative real-world traffic scenarios. A common way to obtain such scenarios is to extract them from real-world data recordings. These can then be grouped and serve as basis on which the ADS is subsequently tested. This poses two central challenges: how scenarios are extracted and how they are grouped. Existing extraction methods rely on heterogeneous definitions, hindering scenario comparability. For the grouping of scenarios, rule-based or ML-based methods can be utilized. However, while modern ML-based approaches can handle the complexity of traffic scenarios, unlike rule-based approaches, they lack interpretability and may not align with domain-knowledge. This work contributes to a standardized scenario extraction based on the Scenario-as-Specification concept, as well as a domain-knowledge-guided scenario clustering process. Experiments on the highD dataset demonstrate that scenarios can be extracted reliably and that domain-knowledge can be effectively integrated into the clustering process. As a result, the proposed methodology supports a more standardized process for deriving scenario categories from highway data recordings and thus enables a more efficient validation process of automated vehicles.

2603.11969 2026-05-12 cs.CV 版本更新

AstroSplat: Physics-Based Gaussian Splatting for Rendering and Reconstruction of Small Celestial Bodies

Jennifer Nolan, Travis Driver, John Christian

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种基于物理的高斯点绘(Gaussian Splatting)框架AstroSplat,用于小天体(如小行星)表面的渲染与重建。该方法引入行星反射模型,显式建模表面材质属性与光照交互,克服了传统基于球谐函数的外观参数化方法在物理特性表达上的不足。实验表明,AstroSplat在NASA“黎明”任务的真实图像上表现出更优的渲染效果和表面重建精度。

Comments 10 pages, 6 figures, conference

详情
英文摘要

Image-based surface reconstruction and characterization are crucial for missions to small celestial bodies (e.g., asteroids), as it informs mission planning, navigation, and scientific analysis. Recent advances in Gaussian splatting enable high-fidelity neural scene representations but typically rely on a spherical harmonic intensity parameterization that is strictly appearance-based and does not explicitly model material properties or light-surface interactions. We introduce AstroSplat, a physics-based Gaussian splatting framework that integrates planetary reflectance models to improve the autonomous reconstruction and photometric characterization of small-body surfaces from in-situ imagery. The proposed framework is validated on real imagery taken by NASA's Dawn mission, where we demonstrate superior rendering performance and surface reconstruction accuracy compared to the typical spherical harmonic parameterization.

2603.11566 2026-05-12 cs.CV 版本更新

R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Zhongyu Xia, Yousen Tang, Yongtao Wang, Zhifeng Wang, Weijun Qin

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究所) EBTech Co. Ltd(EBTech公司)

AI总结 本文提出了一种名为R4Det的4D雷达-相机融合方法,用于提升自动驾驶中的3D目标检测性能。针对现有方法在深度估计、时序融合和小目标检测方面的不足,R4Det引入全景深度融合模块增强深度估计精度,设计无需依赖车辆姿态的可变形门控时序融合模块,并构建实例引导的动态细化模块以提升小目标检测能力。实验表明,R4Det在TJ4DRadSet和VoD数据集上取得了最先进的3D检测效果。

Comments Accepted to CVPR 2026

详情
英文摘要

4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets. The source code and models will be released at https://github.com/VDIGPKU/R4Det.

2603.10165 2026-05-12 cs.CL cs.AI cs.CV cs.LG 版本更新

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang

AI总结 OpenClaw-RL 是一种创新的强化学习框架,通过利用用户反馈、工具输出和界面状态变化等“下一步状态”信号,实现对智能体的在线优化。该框架在基础设施上采用服务器-客户端架构,分离信号提取与策略优化过程,提升训练效率;在方法上提出混合强化学习目标,结合稀疏但精细的指令信号和广泛可用的评估信号,提升学习稳定性。研究展示了 OpenClaw-RL 在个性化代理和通用代理任务中的广泛应用,特别是在长期任务中表现出色。

Comments Code: https://github.com/Gen-Verse/OpenClaw-RL

详情
英文摘要

Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework that employs next-state signals to optimize personal agents online through infrastructure and methodology innovations. On the infrastructure side, we extend existing RL systems to a server-client architecture where the RL server hosts the policy behind an inference API and user terminals stream interaction data back over HTTP. From each observed next state, the system extracts two complementary training signals, evaluative and directive, via a separate asynchronous server so that neither signal extraction nor optimization blocks inference. On the methodology side, we introduce a hybrid RL objective that unifies both signal types in a single update: directive signals provide richer, token-level supervision but are sparser, while evaluative signals are more broadly available. To stabilize distillation under teacher-student mismatch, we propose overlap-guided hint selection, which picks the hint whose induced teacher distribution maximally overlaps with the student's top-$k$ tokens, together with a log-probability-difference clip that bounds per-token advantages. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments, where we additionally demonstrate the utility of next-state signals in long-horizon settings.

2603.09465 2026-05-12 cs.CV cs.AI 版本更新

EvoDriveVLA: Evolving Driving VLA Models via Collaborative Perception-Planning Distillation

Jiajun Cao, Xiaoan Zhang, Xiaobao Wei, Liyuqiu Huang, Zijian Wang, Hanzhen Zhang, Zhengyu Jia, Wei Mao, Hao Wang, Xianming Liu, Shuchang Zhou, Yang Wang, Shanghang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(多媒体信息处理国家重点实验室,计算机学院,北京大学) XPeng Motors(小鹏汽车)

AI总结 本文提出了一种名为EvoDriveVLA的协作感知-规划蒸馏框架,旨在解决视觉语言动作模型在自动驾驶中解冻视觉编码器后感知性能下降以及长期规划不稳定的问题。该方法结合了自锚定感知约束和未来感知轨迹优化,通过自锚定教师模型引导学生模型关注关键区域,并利用未来感知的引导教师进行轨迹优化与不确定性建模,从而提升模型的感知与规划能力。实验表明,EvoDriveVLA在nuScenes和NAVSIM数据集上均取得了优越的性能。

Comments 19 pages, 5 figures, 5 tables

详情
英文摘要

Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and future-informed trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, future-informed trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to synthesize reasoning trajectories that model future evolutions, enabling the student model to internalize the future-aware insights of the teacher. EvoDriveVLA achieves SOTA performance in nuScenes open-loop evaluation and significantly enhances performance in NAVSIM closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.

2603.03239 2026-05-12 cs.CV 版本更新

COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data

Miguel Espinosa, Eva Gmelich Meijling, Valerio Marsocci, Elliot J. Crowley, Mikolaj Czerkawski

发表机构 * School of Engineering University of Edinburgh(工程学院爱丁堡大学) European Space Agency (ESA)(欧洲航天局) Asterisk Labs(Asterisk实验室)

AI总结 该研究提出了一种名为COP-GEN的多模态潜扩散变换器,用于生成Copernicus地球观测数据,能够建模不同传感器(如光学、雷达、高程和土地覆盖)在原生空间分辨率下的联合分布。通过将跨模态映射参数化为条件分布,COP-GEN实现了灵活的任意到任意条件生成,包括无需任务特异性再训练的零样本模态转换。实验表明,该模型在保持高峰值保真度的同时,能够生成多样且物理一致的观测结果,并在构建的基准数据集上展现出显著优于现有方法的生成能力。

详情
英文摘要

Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover. Relationships between modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations, and should be parametrised as conditional distributions. Deterministic models, by contrast, collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous EO modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation without task-specific retraining. Experiments show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and adapts its output uncertainty as conditioning information increases. We release a stochastic benchmark built from multi-temporal Sentinel-2 observations that enables distribution-level comparison of generative EO models. On this benchmark, COP-GEN covers 90% of the real observation manifold and 63% of its per-band reflectance range, while the strongest competing method collapses to 2.8% and 18%, respectively. These results highlight the importance of stochastic generative modeling for EO and motivate evaluation protocols beyond single-reference, pointwise metrics. Website: https://miquel-espinosa.github.io/cop-gen

2602.05391 2026-05-12 cs.CV 版本更新

Efficient Dataset Distillation for Pre-Trained Self-Supervised Models via Statistical Flow Matching

Qianxin Xia, Jiawei Du, Xin Zhang, Yuhan Zhang, Jielei Wang, Guoming Lu

发表机构 * University of Electronic Science(电子科技大学)

AI总结 该论文研究了如何高效地对预训练自监督模型进行数据集蒸馏,以生成一个体积小但性能接近原始数据集的合成数据集。为了解决传统方法在计算和内存上的高开销问题,作者提出了一种基于统计流匹配的新方法,通过对齐原始数据中目标类与非目标类中心的统计流来优化合成图像,大幅降低了计算资源需求。实验表明,该方法在保持甚至提升性能的同时,相比现有方法减少了10倍的GPU内存占用和4倍的运行时间,并提出了一种分类器继承策略以进一步提升效率和性能。

详情
英文摘要

Dataset distillation seeks to synthesize a highly compact dataset that achieves performance comparable to the original dataset on downstream tasks. For the classification task that use pre-trained self-supervised models as backbones, previous linear gradient matching optimizes synthetic images by encouraging them to mimic the gradient updates induced by real images on the linear classifier. However, this batch-level formulation requires loading thousands of real images and applying multiple rounds of differentiable augmentations to synthetic images at each distillation step, leading to substantial computational and memory overhead. In this paper, we introduce statistical flow matching , a stable and efficient supervised learning framework that optimizes synthetic images by aligning constant statistical flows from target class centers to non-target class centers in the original data. Our approach loads raw statistics only once and performs a single augmentation pass on the synthetic data, achieving performance comparable to or better than the state-of-the-art methods with 10x lower GPU memory usage and 4x shorter runtime. Furthermore, we propose a classifier inheritance strategy that reuses the classifier trained on the original dataset for inference, requiring only an extremely lightweight linear projector and marginal storage while achieving substantial performance gains.

2602.04712 2026-05-12 cs.CV cs.AI eess.IV 版本更新

SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation

David F. Ramirez, Tim Overman, Kristen Jaskie, Joe Marvin, Andreas Spanias

发表机构 * SenSIP Center, School of ECEE, Arizona State University(SenSIP中心,电子与计算机工程学院,亚利桑那州立大学) Prime Solutions Group Inc(Prime Solutions Group公司)

AI总结 本文提出了一种用于合成孔径雷达(SAR)图像自动目标识别(ATR)的视觉上下文图像检索增强生成(ImageRAG)辅助AI方法,名为SAR-RAG。该方法结合多模态大语言模型(MLLM)与语义嵌入向量数据库,通过检索已知目标类型的图像示例,提升对SAR图像中军事车辆的识别准确率。实验表明,SAR-RAG在检索、分类和尺寸回归等指标上均优于传统MLLM方法,显著提升了ATR任务的性能。

Comments Accepted to 2026 SPIE Defense + Security, Automatic Target Recognition XXXVI

详情
英文摘要

We present a visual-context image-retrieval-augmented generation (ImageRAG)- assisted AI agent for automatic target recognition (ATR) of synthetic aperture radar (SAR) imagery. SAR is a remote sensing method used in defense and security applications to detect and monitor the positions of military vehicles, which may appear indistinguishable in images. Researchers have extensively studied SAR ATR to improve the differentiation and identification of vehicle types, characteristics, and measurements. Test examples can be compared with known vehicle target types to improve recognition tasks. New methods enhance the capabilities of neural networks, transformer attention, and multimodal large language models. An agentic AI method may be developed to utilize a defined set of tools, such as searching through a library of similar examples. Our proposed method, SAR Retrieval-Augmented Generation (SAR-RAG), combines a multimodal large language model (MLLM) with a vector database of semantic embeddings to support contextual search for image exemplars with known qualities. By recovering past image examples of known true target types, our SAR-RAG system can compare similar vehicle categories, thereby improving ATR prediction accuracy. We evaluate this through search and retrieval metrics, categorical classification accuracy, and numeric regression of vehicle dimensions. These metrics all show improvements when SAR-RAG is added to an MLLM baseline method as an attached ATR memory bank.

2601.22143 2026-05-12 cs.GR cs.CV 版本更新

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

Anthony Chen, Naomi Ken Korem, Gal Zeevi, Tavi Halperin, Matan Ben Yosef, Urska Jelercic, Ofir Bibi, Or Patashnik, Daniel Cohen-Or

发表机构 * Tel Aviv University(特拉维夫大学)

AI总结 本文提出了一种基于音频-视觉扩散模型的视频配音方法JUST-DUB-IT,通过轻量级的LoRA适配器实现从输入视频生成对应语言的配音和同步面部动作。该方法利用生成模型自身生成多语言配对视频作为训练数据,通过在单个视频片段中切换语言并进行面部和音频修复,实现了高质量的配音效果,保持了说话人身份和唇形同步,同时在复杂运动和真实场景中表现出更强的鲁棒性。

Comments Project webpage available at https://justdubit.github.io

详情
英文摘要

Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.

2601.08321 2026-05-12 cs.CV 版本更新

UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, Junshi Huang

发表机构 * Sun Yat-sen University(中山大学)

AI总结 随着图像生成技术的快速发展,基于自然语言指令的视觉文本编辑任务日益受到关注。该任务的核心挑战在于如何准确理解指令和参考图像,并生成与图像风格一致的视觉文本。为此,本文提出 UM-Text,一个统一的多模态模型,通过引入视觉语言模型(VLM)和 UM-Encoder,实现了对文本内容与布局的精细设计,并通过区域一致性损失和三阶段训练策略提升了生成效果,同时贡献了一个大规模视觉文本图像数据集 UM-DATA-200K。

Comments Accepted by AAAI 2026

详情
英文摘要

With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.

2512.15977 2026-05-12 cs.CV 版本更新

Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Earl Ranario, Mason J. Earles

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 该研究评估了多种开源和闭源的视觉-语言模型(VLMs)在农业图像分类任务中的表现,涉及27个数据集、162个类别和248,000张图像。结果表明,零样本VLMs在多数任务中显著落后于监督学习的基准模型YOLO11,且在开放性提示下性能更低,需借助语义判断等方法提升效果。尽管部分开源模型如Qwen-VL-72B表现接近闭源模型,但整体来看,当前VLMs尚未具备作为独立农业诊断系统的能力,更适合在受限接口和领域知识支持下作为辅助工具使用。

详情
英文摘要

Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.

2512.06949 2026-05-12 cs.CV 版本更新

Can We Go Beyond Visual Features? Neural Tissue Relation Modeling for Relational Graph Analysis in Non-Melanoma Skin Histology

Shravan Venkatraman, Muthu Subash Kavitha, Joe Dhanith P R, V Manikandarajan, Jia Wu

发表机构 * Mohamed bin Zayed University of AI(Mohamed bin Zayed人工智能大学) School of Information and Data Sciences(信息与数据科学学院) Vellore Institute of Technology(维洛雷理工学院) Loughborough University(洛桑大学) MD Anderson Cancer Center, The University of Texas(MD安德森癌症中心,德克萨斯大学)

AI总结 在皮肤癌诊断中,组织病理学图像分割对于识别组织结构至关重要,但建模空间上下文和组织间关系仍是一个挑战,尤其是在组织重叠或形态相似的区域。为此,本文提出了一种新的分割框架——神经组织关系建模(NTRM),通过在卷积神经网络中引入图神经网络,建模不同组织类型之间的空间和功能关系,从而提升分割的结构一致性。实验表明,NTRM在非黑色素瘤皮肤癌分割数据集上显著优于现有方法,Dice相似性系数提升了4.9%至31.25%,展示了关系建模在提升分割准确性和可解释性方面的潜力。

Comments CVPR 2026 Workshops

详情
英文摘要

Histopathology image segmentation is essential for delineating tissue structures in skin cancer diagnostics, but modeling spatial context and inter-tissue relationships remains a challenge, especially in regions with overlapping or morphologically similar tissues. Current convolutional neural network (CNN)-based approaches operate primarily on visual texture, often treating tissues as independent regions and failing to encode biological context. To this end, we introduce Neural Tissue Relation Modeling (NTRM), a novel segmentation framework that augments CNNs with a tissue-level graph neural network to model spatial and functional relationships across tissue types. NTRM constructs a graph over predicted regions, propagates contextual information via message passing, and refines segmentation through spatial projection. Unlike prior methods, NTRM explicitly encodes inter-tissue dependencies, enabling structurally coherent predictions in boundary-dense zones. On the benchmark Histopathology Non-Melanoma Skin Cancer Segmentation Dataset, NTRM outperforms state-of-the-art methods, achieving a robust Dice similarity coefficient that is 4.9\% to 31.25\% higher than the best-performing models among the evaluated approaches. Our experiments indicate that relational modeling offers a principled path toward more context-aware and interpretable histological segmentation, compared to local receptive-field architectures that lack tissue-level structural awareness. Our code is available at https://github.com/shravan-18/NTRM.

2511.23332 2026-05-12 cs.CV 版本更新

UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang

发表机构 * Beijing Institute of Technology(北京理工大学) Wuhan University(武汉大学) Zhongguancun Academy(中关村学院) Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出 UniGeoSeg,一种面向遥感地景的统一开放世界分割框架,旨在解决现有方法在任务定义分散和指令数据有限方面的不足。研究构建了 GeoSeg-1M 数据集,包含大量图像-掩码-指令三元组,并设计了 GeoSeg-Bench 用于评估模型在复杂地景场景中的理解与推理能力。UniGeoSeg 通过任务感知的文本增强、潜在知识记忆和渐进式训练策略,实现了多任务学习,在多个基准测试中表现出色,具有强大的零样本泛化能力。

Comments Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg ; Accepted by CVPR 2026

详情
英文摘要

Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg.

2511.07756 2026-05-12 cs.CV 版本更新

Determinism of Randomness: Prompt-Residual Seed Shaping for Diffusion Generation

Song Yan, Wei Zhai, Chenfeng Wang, Xinliang Bi, Jian Yang, Yancheng Cai, Yusen Zhang, Yunwei Lan, Tao Zhang, GuanYe Xiong, Min Li, Zheng-Jun Zha

发表机构 * USTC(中国科学技术大学) Li Auto Inc.(利亚自动化公司) Xi’an High-tech Research Institute(西安高新技术研究院) Wechat Vision(微信视觉) Cambridge University(剑桥大学) HUST(华中科技大学)

AI总结 扩散模型从各向同性高斯潜在空间开始生成,但仅改变随机种子会导致生成结果在语义忠实度、构图和视觉质量上出现显著差异。本文通过分析从初始噪声到生成内容的语义映射,揭示了种子敏感性的几何原因:潜在空间中大多数方向对语义变化不敏感,而语义敏感的变化集中在较小的子空间内。基于这一发现,作者提出了一种无需训练的提示残差种子塑造方法,通过注入与语义变化相关的切向分量,将种子拉回到原始高斯分布的壳层,从而在保持先验兼容性的同时提升生成结果的对齐度和质量。

详情
英文摘要

Diffusion models start generation from an isotropic Gaussian latent, yet changing only the random seed can lead to large differences in prompt faithfulness, composition, and visual quality. We study this seed sensitivity through the semantic map from initial noise to generated meaning. Although the sampling flow is locally invertible, the subsequent semantic projection is many-to-one, inducing a degenerate pullback semi-metric on the latent space: most local directions are nearly semantic-invariant, while semantic-sensitive variation is concentrated in a much smaller horizontal subspace. This provides an explanatory geometric view of the seed lottery. Motivated by this view, we introduce a training-free prompt-residual seed-shaping procedure. Rather than claiming to recover the exact horizontal space, the method uses a single high-noise cold-start prompt residual as a model-coupled proxy, injects only its tangential component, and retracts the seed to the original Gaussian radius shell. This keeps the initialization prior-compatible while adding only one conditional/unconditional probe before standard sampling. Across multiple generation benchmarks, the method improves alignment and quality metrics over standard sampling, supporting both the practical value of the proxy and the explanatory relevance of semantic anisotropy.

2510.25372 2026-05-12 cs.CV cs.LG 版本更新

Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers

M Yashwanth, Sharannya Ghosh, Aditay Tripathi, Anirban Chakraborty

发表机构 * Department of Computational and Data Sciences, Indian Institute of Science(计算与数据科学系,印度科学研究院) Accenture, Japan(日本Accenture公司) Google, India(印度Google公司) Indian Institute of Science(印度科学研究院)

AI总结 本文研究了如何在联邦学习环境下高效且通用地对视觉Transformer进行提示调优。为了解决全局提示调优泛化性差和个性化调优过拟合的问题,作者提出了PEP-FedPT框架,引入了一种基于类上下文混合提示(CCMP)的新方法,通过全局类原型和客户端类先验动态组合类特定提示,实现样本级提示个性化,而无需存储客户端参数。实验表明,该方法在多个数据集上优于现有方法,为联邦视觉Transformer调优提供了有效解决方案。

Comments Accepted to TMLR 2026

详情
英文摘要

Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.

2510.10606 2026-05-12 cs.CV 版本更新

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, Jiaya Jia

发表机构 * The Chinese University of Hong Kong(香港中文大学) Renmin University of China(中国人民大学) The Hong Kong University of Science(香港科学大学)

AI总结 ViSurf 是一种统一的单阶段微调方法,旨在解决大型视觉-语言模型在知识注入与性能提升之间的矛盾。该方法结合了监督微调(SFT)和基于可验证奖励的强化学习(RLVR)的优势,通过将真实标签直接注入RLVR过程,实现外部监督与内部强化的同步优化。ViSurf 还引入了三种新的奖励控制策略以保障训练稳定性,实验表明其在多个基准测试中均优于单独使用SFT、RLVR或传统两阶段方法。

详情
英文摘要

Post-training Large Vision-and-Language Models (LVLMs) typically involves Supervised Fine-Tuning (SFT) for knowledge injection or Reinforcement Learning with Verifiable Rewards (RLVR) for performance enhancement. However, SFT often leads to sub-optimal performance, while RLVR remains constrained by the model's internal knowledge base. While a sequential SFT $\rightarrow$ RLVR pipeline can be used, it introduces significant computational overhead and suffers from catastrophic forgetting. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified, single-stage paradigm that integrates the strengths of both SFT and RLVR. By analyzing their training objectives, we establish a unified framework that injects ground-truth labels directly into RLVR rollouts, facilitating simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to ensure training stability and optimization. Extensive experiments demonstrate that ViSurf consistently outperforms standalone SFT, RLVR, and the traditional two-stage pipeline across diverse benchmarks. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.

2510.04142 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments

Xiaoyu Yang, En Yu, Wei Duan, Jie Lu

发表机构 * Australian Artificial Intelligence Institute (AAII)(澳大利亚人工智能研究所) Faulty of Engineering and Information Technology(工程与信息技术学院) University of Technology Sydney(悉尼技术大学) Australia(澳大利亚)

AI总结 本文研究了在非平稳多流环境中,如何从多个多模态大语言模型中实现鲁棒的推理对齐问题。针对源模型推理分布随时间演变带来的系统性偏差,作者提出了一种新的约束满足框架——自主偏好优化(APO),将模型间差异视为动态负约束,并通过两阶段策略实现对齐:先通过监督引导使目标模型具备源模型的联合能力,再通过约束感知优化生成一致的共识流形。实验表明,该方法在胸部X光解读任务中表现出优越的鲁棒性,并发布了包含七个多模态大模型推理轨迹的CXR-MAX基准数据集。

Comments ICML 2026

详情
英文摘要

This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models often evolve unpredictably, transmitting systematic biases and drift to the target model. To address this, we formulate multi-source reasoning alignment as a constraint satisfaction problem under concept drift theory. We propose Autonomous Preference Optimization (APO), a novel framework that treats inter-model divergences not as noise, but as dynamic negative constraints. APO operates via a two-stage protocol: first, supervised bootstrapping projects the target model into the capability union of source models; second, constraint-aware optimization synthesizes a consistent consensus manifold by explicitly suppressing drifting trajectories via a multi-negative Plackett-Luce objective. Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark comprising 170,982 reasoning trajectories from seven large-scale MLLMs to facilitate research on reasoning alignment under drift. Code and data are available at: https://github.com/XiaoyuYoung/APO.

2510.03895 2026-05-12 cs.RO cs.CV 版本更新

NoTVLA: Semantics-Preserving Robot Adaptation via Narrative Action Interfaces

Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Ye Lin, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, Chunhua Shen

发表机构 * Zhejiang University(浙江大学)

AI总结 该研究提出了一种名为NoTVLA的语义保持型机器人自适应框架,旨在解决视觉-语言-动作(VLA)模型在实际部署中面临的灾难性遗忘问题。其核心方法是通过关注稀疏轨迹而非密集动作序列,结合时间压缩和空间推理剪枝策略,优化轨迹规划并降低计算需求。NoTVLA在多任务评估中表现出优于现有模型的性能,同时显著减少计算资源消耗,并无需依赖腕部摄像头,实现了跨平台部署与零样本泛化能力。

详情
英文摘要

Vision-Language-Action (VLA) models represent a pivotal advance in embodied intelligence, yet they confront critical barriers to real-world deployment, most notably catastrophic forgetting. This issue stems from their overreliance on continuous action sequences or action chunks, which inadvertently create isolated data silos that disrupt knowledge retention across tasks. To tackle these challenges, we propose the Narrowing of Trajectory VLA (NoTVLA) framework: a novel approach that narrows its focus to sparse trajectories, thereby avoiding the catastrophic forgetting associated with dense trajectory fine-tuning. A key innovation of NoTVLA lies in its trajectory planning strategy: instead of centering on the target object's trajectory, it leverages temporal compression and spatial reasoning pruning specifically for the robot end effector's trajectory. Furthermore, training is conducted using these sparse trajectories rather than dense action trajectories, an optimization that delivers remarkable practical advantages with better performance in zero-shot. In multi-task evaluation scenarios, NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints: it uses over an order of magnitude less computing power than pi0 and requires no wrist-mounted camera. This design ensures that NoTVLA's operational accuracy closely approximates that of single-task expert models. Crucially, it also preserves the model's inherent language capabilities, enabling zero-shot generalization in specific scenarios, supporting unified model deployment across multiple robot platforms, and fostering a degree of generalization even when perceiving tasks from novel perspectives.

2508.20325 2026-05-12 cs.CL cs.AI cs.CV 版本更新

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Zelei Cheng, Haohan Wang

发表机构 * Hong Kong University of Science and Technology(香港理工大学) Lapis Labs(Lapis实验室) Capital One

AI总结 随着大型语言模型(LLMs)在各领域应用日益广泛,其生成有害内容的潜在风险引发了社会和监管方面的关注。为验证LLMs是否符合政府发布的伦理指南,本文提出GUARD方法,通过自动生成违反指南的问题并结合“越狱”检测技术,评估模型对指南的遵循程度。该方法不仅能够识别直接违反指南的响应,还能发现可能绕过安全机制的潜在违规场景,并已在多个主流LLMs上进行了实证验证,展示了其在提升模型可靠性方面的有效性。

Comments 56 pages

详情
英文摘要

As Large Language Models (LLMs) become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We empirically validated the effectiveness of GUARD on eight LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models (MiniGPT-v2 and Gemini-1.5), demonstrating its usage in promoting reliable LLM-based applications.

2508.06248 2026-05-12 cs.CV 版本更新

Deepfake Detection that Generalizes Across Benchmarks

Andrii Yermakov, Jan Cech, Jiri Matas, Mario Fritz

发表机构 * Czech Technical University in Prague(捷克技术大学布拉格分校) CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全中心)

AI总结 本文研究了如何使深度伪造检测方法在面对未知的伪造技术时仍具有良好的泛化能力。提出了一种名为GenD的方法,仅通过微调预训练视觉编码器中的层归一化参数(占总参数的0.03%),结合L2归一化和度量学习,实现了高效的泛化性能。实验表明,该方法在14个不同年份的基准数据集上取得了最先进的结果,证明了在保持模型简洁性的同时,也能实现强大的跨数据集检测能力。

详情
英文摘要

The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of one of the foundational pre-trained vision encoders. The proposed method, GenD, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and metric learning on it. We conducted an extensive evaluation on 14 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained foundational image encoder model. The code is at: https://github.com/yermandy/GenD

2505.20381 2026-05-12 cs.CV 版本更新

ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking

Sijia Chen, Yanqiu Yu, En Yu, Wenbing Tao

发表机构 * National Key Laboratory of Science and Technology on Multi-spectral Information Processing, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology(国家多谱信息处理科学与技术重点实验室,人工智能与自动化学院,华中科技大学)

AI总结 ReaMOT 是一个基于推理的多目标跟踪任务,旨在通过逻辑推理追踪由语言指令指定的目标,克服了现有方法对显式视觉-文本匹配的依赖。为此,研究者提出了 ReaMOT 挑战基准,包含大量语言指令和视频序列,并设计了 ReaTrack 框架,结合大语言模型与运动先验,实现了更鲁棒的跟踪性能。实验表明,ReaTrack 在高层次推理任务中表现出显著提升。

Comments Code: https://github.com/chen-si-jia/ReaMOT

详情
英文摘要

Referring Multi-Object Tracking (RMOT) aims to track targets specified by language instructions. However, existing RMOT paradigms heavily rely on explicit visual-textual matching and consequently fail to generalize to complex instructions that require logical reasoning. To overcome this, we propose Reasoning-based Multi-Object Tracking (ReaMOT), a novel task that elevates tracking to a cognitive level, requiring models to infer and track specific targets satisfying implicit constraints via logical reasoning. To advance this field, we construct the ReaMOT Challenge, a comprehensive benchmark featuring a tailored metric suite and a large scale dataset. This dataset comprises 1,156 language instructions, 423,359 image language pairs, and 869 distinct video sequences systematically categorized into six distinct evaluation scenarios, with over 75\% of the instructions dedicated to High Level Reasoning. Furthermore, recognizing that traditional trackers lack cognitive capacity while direct application of Large Vision-Language Model (LVLM) yields severe temporal inconsistencies, we propose ReaTrack. Driven by the insight to decouple high-level cognitive localization from low-level physical motion continuity, this training-free framework dynamically aligns the semantic detections of a Thinking-variant LVLM with the robust motion priors of SAM2. Extensive experiments on the ReaMOT Challenge benchmark demonstrate that ReaTrack establishes a new leading performance standard. Notably, it achieves a more than threefold improvement in RHOTA on the High Level Reasoning subset. Our dataset and code will be available at https://github.com/chen-si-jia/ReaMOT.

2505.20001 2026-05-12 cs.CV 版本更新

NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification

Shihao Li, Huaibo Huang, Junxian Duan, Aihua Zheng, Jin Tang, Jixin Ma

发表机构 * State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology(光电信息采集与防护技术国家重点实验室) Anhui Provincial Key Laboratory of Multimodal Cognitive Computation(安徽省多模态认知计算重点实验室) School of Artificial Intelligence(人工智能学院) Anhui University(安徽大学) State Key Laboratory of Multimodal Artificial Intelligence Systems(多模态人工智能系统国家重点实验室) New Laboratory of Pattern Recognition(模式识别新实验室) CASIA(中国科学院自动化所) University of Chinese Academy of Sciences(中国科学院大学) School of Computing and Mathematical Sciences(计算与数学科学学院) University of Greenwich(格林威治大学)

AI总结 本文研究多模态物体重识别问题,旨在从异构模态中获取完整的身份特征。为解决现有方法依赖隐式特征融合、难以建模细粒度识别模式的问题,提出了一种基于文本调制的多粒度专家混合框架NEXT。该方法通过属性置信度生成高质量描述文本,并将识别任务分解为语义和结构两个分支,分别捕捉细粒度外观特征和粗粒度结构特征,最终通过多粒度特征聚合策略实现更准确的身份表示,实验表明该方法在多个数据集上显著优于现有先进方法。

详情
英文摘要

Multi-modal object Re-IDentification (ReID) aims to obtain complete identity features across heterogeneous modalities. However, most existing methods rely on implicit feature fusion modules, making it difficult to model fine-grained recognition patterns under various challenges in real world. Benefiting from the powerful Multi-modal Large Language Models (MLLMs), the object appearances are effectively translated into descriptive captions. In this paper, we propose a reliable caption generation pipeline based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs and improves the quality of generated text. Additionally, to model diverse identity patterns, we propose a novel ReID framework, named NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural branches to separately capture fine-grained appearance features and coarsegrained structure features. For semantic recognition, we first propose a Text-Modulated Semantic Experts (TMSE), which randomly samples high-quality captions to modulate experts capturing semantic features and mining inter-modality complementary cues. Second, to recognize structure features, we propose a Context-Shared Structure Experts (CSSE), which focuses on the holistic object structure and maintains identity structural consistency via a soft routing mechanism. Finally, we propose a Multi-Grained Features Aggregation (MGFA), which adopts a unified fusion strategy to effectively integrate multi-grained expert features into the final identity representations. Extensive experiments on two public person datasets and three vehicle datasets demonstrate the effectiveness of our method, showing that it significantly outperforms existing state-of-the-art methods.

2505.19519 2026-05-12 cs.CV 版本更新

Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift

Gihoon Kim, Hyungjin Park, Taesup Kim

发表机构 * Graduate School of Data Science(数据科学研究生院) Seoul National University(首尔国立大学)

AI总结 本文研究了如何在不导致分布偏移的前提下,对文本到图像的扩散模型进行个性化定制。作者指出,现有方法在个性化过程中容易过度拟合参考图像,忽视用户提示,其根本原因是未能同时保证图像真实性与文本对齐。为此,提出了一种基于李普希茨正则化的优化目标,约束模型参数更新,保持预训练模型输出分布的稳定性,从而在保留原始生成能力的同时实现对新概念的准确适配。实验表明,该方法在多个扩散模型架构上均表现出优越的定量和定性性能。

Comments Accepted at ICLR 2026

详情
英文摘要

Personalizing text-to-image diffusion models involves integrating novel visual concepts from a small set of reference images while retaining the model's original generative capabilities. However, this process often leads to overfitting, where the model ignores the user's prompt and merely replicates the reference images. We attribute this issue to a fundamental misalignment between the true goals of personalization, which are subject fidelity and text alignment, and the training objectives of existing methods that fail to enforce both objectives simultaneously. Specifically, prior approaches often overlook the need to explicitly preserve the pretrained model's output distribution, resulting in distributional drift that undermines diversity and coherence. To resolve these challenges, we introduce a Lipschitz-based regularization objective that constrains parameter updates during personalization, ensuring bounded deviation from the original distribution. This promotes consistency with the pretrained model's behavior while enabling accurate adaptation to new concepts. Furthermore, our method offers a computationally efficient alternative to commonly used, resource-intensive sampling techniques. Through extensive experiments across diverse diffusion model architectures, we demonstrate that our approach achieves superior performance in both quantitative metrics and qualitative evaluations, consistently excelling in visual fidelity and prompt adherence. We further support these findings with comprehensive analyses, including ablation studies and visualizations.

2504.02373 2026-05-12 eess.IV cs.CV 版本更新

HPGN: Hybrid Priors-Guided Network for Compressed Low-Light Image Enhancement

Hantang Li, Qiang Zhu, Xiandong Meng, Lei Xiong, Shuyuan Zhu, Xiaopeng Fan

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) University of Electronic Science and Technology of China(电子科技大学)

AI总结 在实际应用中,低光照图像通常为了高效存储和传输而被压缩,但现有方法大多忽视了压缩伪影的去除或难以建立统一的增强框架。为此,本文提出了一种结合压缩先验和光照先验的混合引导网络(HPGN),通过引入JPEG质量因子和DCT量化矩阵指导模块设计,实现了对不同压缩质量低光照图像的联合增强。实验结果表明,该方法在提升图像质量方面具有显著优势。

Comments 5 pages, 3 figures

详情
英文摘要

In practical applications, low-light images are often compressed for efficient storage and transmission. Most existing methods disregard compression artifacts removal or hardly establish a unified framework for joint task enhancement of low-light images with varying compression qualities. To address this problem, we propose an efficient hybrid priors-guided network (HPGN) that enhances compressed low-light images by integrating both compression and illumination priors. Our approach fully utilizes the JPEG quality factor (QF) and DCT quantization matrix (QM) to guide the design of efficient plug-and-play modules for joint tasks. Additionally, we employ a random QF generation strategy to guide model training, enabling a single model to enhance low-light images with different compression levels. Experimental results demonstrate the superiority of our proposed method.

2411.18111 2026-05-12 cs.CV 版本更新

When Large Vision-Language Models Meet Person Re-Identification

Qizao Wang, Bin Li, Xiangyang Xue

发表机构 * School of Computer Science, Fudan University, Shanghai, China(复旦大学计算机科学学院,上海,中国)

AI总结 本文研究了如何将大型视觉-语言模型(LVLMs)应用于行人重识别(ReID)任务。传统ReID依赖于提取区分性强的身份特征,而LVLMs则擅长跨模态理解和生成。为此,作者提出LVLM-ReID框架,通过指令引导LVLM生成包含行人关键外观语义的语义标记,并利用语义引导交互模块增强语义与视觉特征的关联,最终将强化后的语义标记作为行人身份表示。该方法无需额外图像-文本标注即可在多个基准上取得有竞争力的性能,展示了LVLM生成语义在提升ReID效果中的潜力。

Comments Accepted by ICASSP 2026

详情
英文摘要

Large Vision-Language Models (LVLMs) that incorporate visual models and large language models have achieved impressive results across cross-modal understanding and reasoning tasks. In recent years, person re-identification (ReID) has also started to explore cross-modal semantics to improve the accuracy of identity recognition. However, effectively utilizing LVLMs for ReID remains an open challenge. While LVLMs operate under a generative paradigm by predicting the next output word, ReID requires the extraction of discriminative identity features to match pedestrians across cameras. In this paper, we propose LVLM-ReID, a novel framework that harnesses the strengths of LVLMs to promote ReID. Specifically, we employ instructions to guide the LVLM in generating one semantic token that encapsulates key appearance semantics from the person image. This token is further refined through our Semantic-Guided Interaction (SGI) module, establishing a reciprocal interaction between the semantic token and visual tokens. Ultimately, the reinforced semantic token serves as the representation of pedestrian identity. Our framework integrates the semantic understanding and generation capabilities of LVLM into end-to-end ReID training, allowing LVLM to capture rich semantic cues during both training and inference. LVLM-ReID achieves competitive results on multiple benchmarks without additional image-text annotations, demonstrating the potential of LVLM-generated semantics to advance person ReID.

2411.04077 2026-05-12 cs.CV 版本更新

H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

Nhi Pham, Michael Schott

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克信息研究所) Saarland University(萨尔兰州大学) Zuse School(祖斯学校)

AI总结 本文提出了一种基于分层抽样评估的H-POPE基准,用于系统评估大视觉语言模型在物体存在性和属性层面的幻觉问题。该方法通过从粗到细的层次结构进行评估,揭示了模型在细粒度属性上更容易产生幻觉的现象。研究进一步探讨了模型在生成文本时是否依赖于视觉输入,为理解视觉语言模型的生成机制提供了新的视角。

Comments Poster at https://sites.google.com/berkeley.edu/bb-stat/home

详情
英文摘要

By leveraging both texts and images, large vision language models (LVLMs) have shown significant progress in various multi-modal tasks. Nevertheless, these models often suffer from hallucinations, e.g., they exhibit inconsistencies between the visual input and the textual output. To address this, we propose H-POPE, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes. Our evaluation shows that models are prone to hallucinations on object existence, and even more so on fine-grained attributes. We further investigate whether these models rely on visual input to formulate the output texts.

2410.10247 2026-05-12 cs.CV cs.AI 版本更新

LPT: Less-overfitting Prompt Tuning for Vision-Language Model

Chenhao Ding, Xinyuan Gao, Songlin Dong, Jizhou Han, Qiang Wang, Zhengdong Zhou, Yuhang He, Yihong Gong

发表机构 * IEEE(国际电气电子工程师协会)

AI总结 该研究针对视觉语言模型在迁移过程中易出现的过拟合问题,提出了一种名为LPT的轻量级提示调优框架。其核心方法包括利用CLIP过滤细粒度前景信息以引导基础视觉概念的提示生成,并引入特征级结构保持约束和输出级层次逻辑约束,以增强模型的泛化能力。实验表明,LPT在多个基准任务中显著提升了模型的泛化性能,有效缓解了过拟合问题。

详情
英文摘要

Vision-language models (VLMs) have demonstrated exceptional generalization capabilities for downstream tasks. Due to its efficiency, prompt learning has gradually become a more effective and efficient method for transferring VLMs to downstream tasks, surpassing traditional finetuning methods. However, during the transfer process, these models are prone to severe overfitting, leading to a significant decline in generalization ability. To address this issue, we propose a framework named LPT, specifically designed for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that may lead to overfitting, thereby guiding the prompts with basic visual concepts. Additionally, to further mitigate overfitting, we have developed a Structural Preservation (SP) constraint at the feature level, which aligns the model's overall feature space structure with the frozen CLIP, endowing the feature space with overall plasticity and enabling effective reshaping of the feature space during optimization. Moreover, we employ Hierarchical Logit (HL) constraint at the output layer to constrain the overall class information in the output, complementing the role of SP at the output end. Extensive experiments across various benchmarks (from base-to-novel, cross-dataset transfer, and domain generalization) demonstrate that our approach significantly improves generalization capability and effectively alleviates overfitting compared to state-of-the-art methods.

2006.02666 2026-05-12 eess.IV cs.CV 版本更新

Deep Sequential Feature Learning in Clinical Image Classification of Infectious Keratitis

Yesheng Xu, Ming Kong, Wenjia Xie, Runping Duan, Zhengqing Fang, Yuxiao Lin, Qiang Zhu, Siliang Tang, Fei Wu, Yu-Feng Yao

发表机构 * Department of Ophthalmology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine(浙江大学医学院眼科学系,邵氏医院) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院)

AI总结 本文针对感染性角膜炎的临床图像分类问题,提出了一种基于序列级深度学习的模型,旨在准确区分感染性角膜病变的细微差异。该方法通过设计有效的机制保留临床图像的空间结构并提取关键特征,显著提升了分类性能。实验表明,该模型在120张测试图像上的诊断准确率达到80.00%,远超421位眼科医生49.27%的平均水平,展示了其在辅助诊断中的巨大潜力。

Comments Accepted by Engineering

详情
英文摘要

Infectious keratitis is the most common entities of corneal diseases, in which pathogen grows in the cornea leading to inflammation and destruction of the corneal tissues. Infectious keratitis is a medical emergency, for which a rapid and accurate diagnosis is needed for speedy initiation of prompt and precise treatment to halt the disease progress and to limit the extent of corneal damage; otherwise it may develop sight-threatening and even eye-globe-threatening condition. In this paper, we propose a sequential-level deep learning model to effectively discriminate the distinction and subtlety of infectious corneal disease via the classification of clinical images. In this approach, we devise an appropriate mechanism to preserve the spatial structures of clinical images and disentangle the informative features for clinical image classification of infectious keratitis. In competition with 421 ophthalmologists, the performance of the proposed sequential-level deep model achieved 80.00% diagnostic accuracy, far better than the 49.27% diagnostic accuracy achieved by ophthalmologists over 120 test images.

2605.10111 2026-05-12 cs.LG cs.AI cs.CV 版本更新

CFSPMNet: Cross-subject Fourier-guided Spatial-Patch Mamba Network for EEG Motor Imagery Decoding in Stroke Patients

Xiangkai Wang, Yun Zhao, Dongyi He, Qingling Xia, Gen Li, Xinlai Xing, Yuchi Pan, Bin Jiang

发表机构 * School of Artificial Intelligence, Chongqing University of Technology(重庆理工大学人工智能学院) Chongqing Key Laboratory of Embodied Intelligence Perception and Autonomous Learning for Humanoid Robots(重庆市人形机器人感知与自主学习重点实验室) Key Laboratory of Advanced Equipment Intelligence of the Chongqing Education Commission(重庆市教育委员会先进设备智能重点实验室) School of Smart Health, Chongqing Polytechnic University of Electronic Technology(重庆理工大学电子工程学院智能健康学院) Department of Language Science and Technology, The Hong Kong Polytechnic University(香港理工大学语言科学与技术系) School of Pharmacy and Bioengineering, Chongqing University of Technology(重庆理工大学药学院与生物工程学院)

AI总结 该研究针对中风患者脑机接口(BCI)解码中的跨被试应用难题,提出了一种名为CFSPMNet的新型神经网络框架。该方法结合傅里叶域状态重组与共享-私有原型匹配机制,通过建模潜在的神经状态组织,有效提升了跨被试MI-EEG解码的准确性和鲁棒性。实验表明,CFSPMNet在两个中风MI-EEG数据集上均优于现有主流方法,展现出显著的性能提升。

详情
英文摘要

Motor imagery electroencephalography (MI-EEG) decoding offers a non-invasive route for post-stroke rehabilitation, but cross-patient use remains difficult because pathological neural reorganization changes task-related EEG dynamics, aperiodic activity, local excitability, cross-regional coordination, and trial-level brain-state context. This makes source-learned MI representations unreliable for unseen patients. To address this problem, we propose CFSPMNet, a cross-patient adaptation framework that models post-stroke MI-EEG as latent neural-state organization. CFSPMNet combines a Fourier-Reorganized State Mamba Network (FRSM) with Shared-Private Prototype Matching (SPPM). FRSM represents each trial as a latent physiological token sequence, reorganizes token states in the Fourier domain, and uses Fourier-derived trial context to guide Mamba state-space propagation. SPPM improves target pseudo-label updating by combining semantic confidence with shared-private physiological consistency, filtering confident but physiologically inconsistent target predictions. Leave-one-subject-out experiments on two stroke MI-EEG datasets show that CFSPMNet outperforms representative CNN-, Transformer-, Mamba-, and adaptation-based baselines, achieving average accuracies of 68.23% on XW-Stroke and 73.33% on 2019-Stroke, with gains of 5.63 and 8.25 percentage points over the strongest competitors. Ablation, sensitivity, feature-alignment, pseudo-label selection, and neurophysiological visualization analyses further support the roles of Fourier-domain token-state reorganization and calibrated pseudo-label updating. These results suggest that latent neural-state modeling can improve rehabilitation-oriented cross-patient BCI decoding. Code is available at https://github.com/wxk1224/CFSPMNet.

2605.10106 2026-05-12 cs.CV cs.AI 版本更新

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

Tingshu Mou, Jiabo He, Renying Wang, Ce Liu, Hao Yang, Tiehua Zhang, Jingjing Chen, Xingjun Ma

发表机构 * Fudan University(复旦大学) Bosch Center for Artificial Intelligence (BCAI)(博世人工智能中心(BCAI)) Tongji University(同济大学)

AI总结 本文提出了一种名为ViSRA的基于视频的三维空间推理代理,旨在提升多模态大语言模型(MLLMs)的空间推理能力。ViSRA无需额外训练,通过利用专家模型提供的显式空间信息,以模块化和可扩展的方式引导模型进行空间推理,实现了灵活的即插即用框架。该方法在多个现有基准和未见过的三维空间任务中均表现出色,相比基线方法分别提升了15.6%和28.9%的绝对性能,具有可迁移的三维理解能力和较低的计算成本。

详情
英文摘要

Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.

2605.10087 2026-05-12 cs.CV 版本更新

Initiation of Interaction Detection Framework using a Nonverbal Cue for Human-Robot Interaction

Guhnoo Yun, Juhan Yoo, Kijung Kim, Dong Hwan Kim

发表机构 * Korea Institute of Science and Technology(韩国科学技术院) Department of Computer Science, Semyung University(Semyoung大学计算机科学系)

AI总结 本文提出了一种基于音频和视觉传感器融合的非语言线索的人机交互(HRI)启动检测框架,用于家庭环境中的机器人交互。该框架通过声音源定位与人体跟踪信息结合,实现用户注视机器人时的交互启动检测,即使用户未直接说话,也能在注视时间超过预设阈值时识别交互意图。研究设计了状态转移模型,并在移动机器人上进行了实验验证,所有模块均集成于ROS系统中,实现了框架的完整实现与应用。

详情
英文摘要

This paper describes an initiation of interaction(IoI) detection framework without keywords for human-robot interaction(HRI) based on audio and vision sensor fusion in a domestic environment. In the proposed framework, the robot has its own audio and vision sensors, and can employ external vision sensor for stable human detection and tracking. When the user starts to speak while looking at the robot, the robot can localize his or her position by its sound source localization together with human tracking information. Then the robot can detect the IoI if it perceives the face of the speaker faces the robot. In case that the user does not speak directly, the robot can also detect the IoI if he or she looks at the robot for more than predefined periods of time. A state transition model for the proposed IoI detection framework is designed and verified by experiments with a mobile robot. In order to implement and associate our model in a robot architecture, all the components are implemented and integrated in the Robot Operating System(ROS) environment.

2605.10079 2026-05-12 cs.CV 版本更新

SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

Liangyang Ouyang, Ruicong Liu, Caixin Kang, Yifei Huang, Yoichi Sato

发表机构 * The University of Tokyo(东京大学) Shanda AI Research Tokyo(Shanda AI东京研究所)

AI总结 该论文提出了一种名为SocialDirector的训练-free交互控制器,用于提升多人物视频生成中社会互动的控制能力。该方法通过调节交叉注意力图,实现了对人物动作执行者、动作时机及目标对象的精确控制,有效解决了现有模型中人物与动作不匹配、社交动态混乱等问题。研究还构建了自动化评估流程,实验表明SocialDirector显著提升了生成视频的交互真实性,接近真实视频的表现水平。

详情
英文摘要

Video generation has advanced rapidly, producing photorealistic videos from text or image prompts. Meanwhile, film production and social robotics increasingly demand multi-person videos with rich social interactions, including conversations, gestures, and coordinated actions. However, existing models offer no explicit control over interactions, such as who performs which action, when it occurs, and toward whom it is directed. This often results in wrong person performing unintended actions (actor-action mismatch), disordered social dynamics, and wrong action targets. To address these challenges, we present SocialDirector, a training-free interaction controller that enhances the generation model by modulating cross-attention maps. SocialDirector contains two modules: Social Actor Masking and Directional Reweighting. Social Actor Masking constrains each person's visual tokens to attend only to their own textual descriptions via a spatiotemporal mask, avoiding actor-action mismatch and disordered social dynamics. Directional Reweighting amplifies attention to directional words (e.g., "leftward", "right"), leading each action towards its intended target. To evaluate generated social interactions, we annotate existing datasets with interaction descriptions and build a fully automated evaluation pipeline powered by open-source VLMs. Experiments on different video generation models show that SocialDirector significantly improves interaction fidelity and approaches the upper bound set by real videos.

2605.10071 2026-05-12 cs.CV 版本更新

MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization

Yaning Zhang, Tianyi Wang, Zan Gao, Yibo Zhao, Chunjie Ma, Meng Wang

发表机构 * Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences)(计算机科学与技术学院,齐鲁工业大学(山东省科学院)) School of Computing, National University of Singapore(国立新加坡大学计算机学院) Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences)(山东省人工智能研究院,齐鲁工业大学(山东省科学院)) Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology(教育部计算机视觉与系统重点实验室,天津工业大学)

AI总结 随着高真实感人脸生成技术的快速发展,通用性的人脸伪造检测与定位方法变得尤为重要。本文提出了一种多领域细粒度视觉-语言重建模型(MFVLR),通过语言引导的细粒度人脸伪造表示学习,全面捕捉多领域中的视觉伪造痕迹,从而实现对扩散模型生成人脸伪造内容的通用检测与定位。该模型引入细粒度语言变换器、多领域视觉编码器和视觉解码器,并设计了创新的视觉注入模块,显著提升了模型在跨生成器、跨伪造类型和跨数据集场景下的性能。

详情
英文摘要

The swift advancement in photo-realistic face generation technology has sparked considerable concerns across society and academia, emphasizing the requirement of generalizable face forgery detection and localization methods. Prior works tend to capture face forgery patterns across multiple domains using image modality, other modalities like fine-grained texts are not comprehensively investigated, which restricts the generalization capability of models. Besides, they usually analyze facial images created by GAN, but struggle to identify and localize those synthesized by diffusion. To solve the problems, in this paper, we devise a novel multi-domain fine-grained vision-language reconstruction (MFVLR) model, which explores comprehensive and diverse visual forgery traces via language-guided face forgery representation learning, to achieve generalizable diffusion-synthesized face forgery detection and localization (DFFDL). Specifically, we devise a fine-grained language transformer that studies general fine-grained language embeddings using language reconstruction. We propose a multi-domain vision encoder to capture general and complementary visual forgery patterns across the image and residual domains. A vision decoder is designed to reconstruct image appearance and achieve forgery localization. Besides, we propose an innovative plug-and-play vision injection module to enhance the interaction between the vision and language embeddings. Extensive experiments and visualizations demonstrate that our network outperforms the state of the art on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.

2605.10054 2026-05-12 cs.CV 版本更新

Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging

Zubair Faruqui, Rahul Dubey

发表机构 * Department of Computer Science, Missouri State University(密苏里州立大学计算机科学系)

AI总结 该研究针对医学影像诊断中深度神经网络过度依赖非临床相关特征的问题,提出了一种在训练过程中直接引入解释性监督的方法,以引导模型关注具有临床意义的区域。研究系统分析了不同解释损失设计和监督强度对模型预测性能和解释可信度的影响,并引入了两个新的量化指标用于评估解释质量。实验表明,该方法在保持模型准确性的同时,能够显著提升解释的临床相关性,适用于多种标注的生物医学影像任务。

Comments Under review at IEEE Journal of Biomedical and Health Informatics (JBHI)

详情
英文摘要

Deep neural networks for medical image diagnosis often achieve high predictive accuracy while relying on spurious or clinically irrelevant visual cues, limiting their trustworthiness in practice. Post-hoc explanation methods are widely used to visualize model decisions in the form of saliency maps; however, these explanations do not influence how models learn during training, allowing non-causal or confounding features to persist. This motivates the incorporation of explanation supervision directly into the training objective to guide model attention toward clinically meaningful regions and promote clinically grounded decision-making. This paper presents a systematic approach to integrate explanation loss into model training and analyzes how different explanation loss designs and supervision strengths influence both predictive performance and spatial faithfulness of explanations. To quantitatively assess interpretability, two complementary explanation performance metrics-annotation coverage and saliency precision-are introduced, enabling rigorous evaluation beyond qualitative visualization. Our experimental results reveal a clear trade-off between explanation quality and explanation loss coefficients. Furthermore, quantitative statistical analysis yields consistently improved explanation alignment while maintaining comparable accuracy. Experiments were conducted on annotated chest X-ray datasets; however, the proposed framework is applicable to a broad range of annotated biomedical imaging modalities. Overall, these findings demonstrate that explanation supervision is not a monolithic design choice and provide practical guidance for incorporating explanation loss into training objectives under noisy clinical annotations.

2605.10050 2026-05-12 cs.CV 版本更新

EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs

Jiameng Li, Minye Wu, Jiezhang Cao, Aleksei Tiulpin, Matthew B. Blaschko

发表机构 * KU Leuven(鲁文大学) Shanghai Jiaotong University(上海交通大学) Weill Cornell Medicine(韦尔医学院)

AI总结 视频大语言模型(VideoLLMs)在处理长视频时面临挑战,因为密集采样会导致大量视觉token,而稀疏采样则可能遗漏关键时间信息,引发模型幻觉。本文提出了一种轻量且无需训练的token剪枝方法EchoPrune,通过将冗余token解释为时间回声,利用跨模态相关性和时间重建误差对token进行评分,从而在固定token预算下提升时间分辨率。实验表明,EchoPrune使VideoLLMs在相同token预算下处理的帧数提升至原来的20倍,并在多个基准上提升了性能和推理速度。

Comments 9 pages

详情
英文摘要

Long-form video understanding remains challenging for Video Large Language Models (VideoLLMs), as the dense frame sampling introduces massive visual tokens while sparse sampling risks missing critical temporal evidence and leading to LLM hallucination. Existing training-free token reduction methods either treat videos equally as static images or rely on segment-level merging heuristics, which weaken fine-grained spatiotemporal modeling and introduce additional overhead. In this paper, we propose EchoPrune, a lightweight and training-free token pruning method that improves temporal resolution under a fixed LLM-side visual token budget. Our core idea is to interpret redundant video tokens as temporal echoes: if a token is well reconstructed from the previous frame, it is merely a temporally redundant echo; otherwise, it may capture new events, motion, or query-relevant visual evidence. Based on this insight, EchoPrune scores visual tokens by (i) query-guided crossmodal relevance and (ii) temporal reconstruction error, measured by correspondence matching and echo matching across consecutive frames. The selected tokens preserve task-relevant cues and temporal novelty while suppressing predictable redundancy, allowing VideoLLMs to observe more frames without increasing the decoding budget. Extensive experiments on LLaVA-OV, Qwen2.5VL, and Qwen3VL across six video understanding benchmarks show that EchoPrune enables VideoLLMs to process up to 20x frames under the same token budget, yielding improved performance (+8.6%) and inference speedup (5.6x for prefilling) on Qwen2.5VL-7B.

2605.10046 2026-05-12 cs.CV cs.LG cs.MA 版本更新

PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows

Yufeng Zhu, Chunlei Shi, Yongchao Feng, Dan Niu

发表机构 * Department of Automation, Southeast University(东南大学自动化系) State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(北京航空航天大学虚拟现实技术与系统国家重点实验室)

AI总结 本文提出了一种名为PixelFlowCast的降水临近预报方法,旨在在不使用潜在空间压缩的情况下实现高效且高精度的短期雷达回波预测。该方法采用两阶段框架,第一阶段通过确定性模型生成粗粒度预测以捕捉整体演变趋势,第二阶段利用KANCondNet提取深度时空特征进行精确条件引导,并结合基于像素均值流的预测器,以少量步骤生成高质量预测结果。实验表明,PixelFlowCast在预测精度和推理效率方面均优于现有主流方法,尤其在长序列预测任务中表现突出,具有良好的实际应用前景。

Comments 26 pages, 7 figures

详情
英文摘要

Precipitation nowcasting aims to forecast short-term radar echo sequences for extreme weather warning, where both prediction fidelity and inference efficiency are critical for real-world deployment. However, diffusion-based models, despite their strong generative capability, suffer from slow inference due to multi-step sampling trajectories, limiting their practical usability. Conditional Flow Matching (CFM) improves efficiency via straightened trajectories, but relies on latent space compression, which inevitably discards high-frequency physical details and degrades fine-grained prediction quality. To address these limitations, we propose PixelFlowCast, a two-stage probabilistic forecasting framework that achieves both high-efficiency and high-fidelity prediction without latent compression. Specifically, in the first stage, a deterministic model first produces coarse forecasts to capture global evolution trends. In the subsequent stage, the proposed KANCondNet extracts deep spatiotemporal evolution features to provide accurate conditional guidance. Based on this, a latent-free, few-step Pixel Mean Flows (PMF) predictor employs an $x$-prediction mechanism to generate high-quality predictions, effectively preserving fine-grained structures while maintaining fast inference. Experiments on the publicly available SEVIR dataset demonstrate that PixelFlowCast outperforms existing mainstream methods in both prediction accuracy and inference efficiency, particularly for long sequence forecasting, highlighting its strong potential for real-world operational deployment.

2605.10045 2026-05-12 cs.CV 版本更新

ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

Feihong Yan, Shaoyu Liu, Haixuan Wang, Shuai Lu, Linfeng Zhang, Huiqi Li, Xiangyang Ji

发表机构 * Beijing Institute of Technology(北京理工大学) Xidian University(西安电子科技大学) Northeastern University at Qinhuangdao(秦皇岛东北大学) Shanghai Jiao Tong University(上海交通大学) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 视觉自回归(VAR)模型作为扩散模型的有力替代方案,在图像生成中表现出色,但其固定训练分辨率限制了其在更高分辨率下的直接生成能力。本文提出ExtraVAR方法,通过引入阶段感知的RoPE重映射策略,解决了VAR模型在分辨率外推过程中出现的全局重复、局部重复和细节退化等问题,并进一步提出基于熵驱动的自适应注意力校准方法,以适应高分辨率下注意力分布的变化,实验表明该方法在结构一致性和细节保真度方面均优于现有方法。

Comments 10 pages, 7 figures

详情
英文摘要

Visual Autoregressive (VAR) models have emerged as a strong alternative to diffusion for image synthesis, yet their fixed training resolution prevents direct generation at higher resolutions. Naively transferring training-free extrapolation methods from LLMs or diffusion models to VAR yields three characteristic failure modes: global repetition, local repetition, and detail degradation. We trace them to a unified band-stage mismatch: VAR generates images in a coarse-to-fine, scale-wise process where each stage is driven by a distinct dominant RoPE frequency band, and each failure mode emerges when the dominant band of a particular stage is disrupted. Building on this insight, we propose Stage-Aware RoPE Remapping, a training-free strategy that assigns each frequency band a stage-specific remapping rule, jointly suppressing all three failure modes. We further observe that attention becomes systematically dispersed as the image resolution increases. Existing methods typically depend on predefined attention scaling factors, which are neither adaptive to the target resolution nor capable of faithfully capturing the actual extent of attention dispersion. We therefore propose Entropy-Driven Adaptive Attention Calibration, which quantifies dispersion via a resolution-invariant normalized entropy and yields a closed-form per-head scaling factor that realigns the extrapolated-resolution attention entropy with its training-resolution counterpart. Extensive experiments show that our method consistently outperforms prior resolution-extrapolation methods in both structural coherence and fine-detail fidelity. Our code is available at https://github.com/feihongyan1/ExtraVAR.

2605.10029 2026-05-12 cs.CV 版本更新

Slum Detection and Density Mapping with AlphaEarth Foundations: A Representation Learning Evaluation Across 12 Global Cities

Shuyang Hou, Ziqi Liu, Haoyue Jiao, Zhangyan Xu, Xiaopu Zhang, Lutong Xie, Yaxian Qing, Jianyuan Liang, Xuefeng Guan, Huayi Wua

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing(信息工程测绘与遥感国家重点实验室)

AI总结 该研究利用AlphaEarth Foundations(AEF)这一全球一致的高分辨率地表嵌入数据,评估其在12个全球城市中用于贫民窟检测和密度估计的性能。通过多种训练策略和辅助特征配置,研究发现同一城市跨年训练效果最佳,并揭示了AEF在区分贫民窟边界和建模像素内密度梯度方面的局限性。研究还指出POI特征对密度估计有显著提升,并展示了AEF在长期贫民窟监测中的结构保持能力。

详情
英文摘要

Pixel-level slum mapping has long been constrained by limited cross-city generalisation, the absence of continuous density estimation, and weak global comparability. AlphaEarth Foundations (AEF), a globally consistent 64-dimensional annual surface embedding at 10 m, offers a new analysis-ready basis for lightweight slum monitoring, but its applicability to slum detection - an indirectly coupled task shaped by both built form and socio-economic processes - remains untested. We evaluate AEF on slum classification and sub-pixel density estimation across 12 cities and 69 city-year pairs (2017-2024), using GRAM pseudo-masks as supervisory labels. The evaluation spans four training strategies, two protocols (random split and 3x3 spatial block cross-validation), six auxiliary feature configurations, and five baseline models, complemented by representation-level analyses (PCA, SHAP) and full-AOI mapping. Five findings emerge. (1) Same-city cross-year training is optimal under both protocols (median spatial F1 = 0.616, R^2 = 0.466); temporal expansion outperforms cross-city transfer, indicating city-scale representational drift. (2) Regression R^2 is driven primarily by zero/non-zero boundary discrimination: positive-pixel R^2 is consistently negative across all cities, revealing limited capacity to model intra-pixel density gradients at 10 m. (3) PC36 is consistently top-ranked across tasks; classification saturates at k = 32 while regression remains unsaturated at k = 64. (4) POI features yield the largest density gain (Delta R^2 = +0.064). (5) For six cities meeting dual-task usability thresholds, full-AOI inference across 2017-2024 preserves slum cluster structure (mean SSIM = 0.926). The study delineates the capabilities and complementarity needs of foundation-model embeddings for slum monitoring.

2605.10026 2026-05-12 cs.CV 版本更新

MUSDA: Multi-source Multi-modality Unsupervised Domain Adaptive 3D Object Detection for Autonomous Driving

Xiaohu Lu, Hamed Khatounabadi, Hayder Radha

发表机构 * Electrical and Computer Engineering(电气与计算机工程) Michigan State University(密歇根州立大学)

AI总结 随着自动驾驶技术的发展,多模态标注数据集日益丰富,为无需人工标注即可适应新环境的3D目标检测提供了可能。然而传统领域自适应方法通常仅针对单一来源或单一模态,难以应对多源多模态场景。本文提出了一种面向自动驾驶的多源多模态无监督领域自适应3D目标检测框架,通过引入分层空间条件领域分类器和原型图加权融合策略,有效对齐了不同来源和模态的特征,实验表明该方法在多个主流数据集上均优于现有先进方法。

详情
英文摘要

With the advancement of autonomous driving, numerous annotated multi-modality datasets have become available. This presents an opportunity to develop domain-adaptive 3D object detectors for new environments without relying on labor-intensive manual annotations. However, traditional domain adaptation methods typically focus on a single source domain or a single modality, limiting their effectiveness in multi-source, multi-modality scenarios. In this paper, we propose a novel framework for multi-source, multi-modality unsupervised domain adaptation in 3D object detection for autonomous driving. Given multiple labeled source domains and one unlabeled target domain, our framework first introduces hierarchical spatially-conditioned (HSC) domain classifiers, which jointly align features from both camera and LiDAR modalities at two distinct levels for each source-target domain pair. To effectively leverage information from multiple source domains, we construct a prototype graph between each pair of domains. Based on this, we develop a prototype graph weighted (PGW) multi-source fusion strategy to aggregate predictions from multiple source detection heads. Experimental results on three widely used 3D object detection datasets - Waymo, nuScenes, and Lyft - demonstrate that our proposed framework effectively integrates information across both modalities and source domains, consistently outperforming state-of-the-art methods.

2605.10009 2026-05-12 cs.CV 版本更新

Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation

Yujia Cai, Boxuan Li, Chenghao Xu, Jiexi Yan

发表机构 * School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China(西安电子科技大学计算机科学与技术学院) School of Electronic Engineering, Xidian University, Xi’an, Shaanxi, China(西安电子科技大学电子工程学院)

AI总结 本文提出了一种名为Hystar的轻量级框架,用于解决基于查询的图像检索(QBIR)中因查询风格多样而导致的分布偏移问题。该方法通过超网络动态生成注意力层的奇异值扰动,实现对每个查询风格的自适应调整,同时利用静态奇异值偏移保证跨风格的稳定性。此外,Hystar引入了基于最优传输的对比损失StyleNCE,以增强跨风格语义区分能力,实验表明该方法在多风格检索和跨风格分类任务中均优于现有方法,具有参数高效且风格稳定的优势。

Comments Accepted by ICLR2026

详情
英文摘要

Query-based image retrieval (QBIR) requires retrieving relevant images given diverse and often stylistically heterogeneous queries, such as sketches, artworks, or low-resolution previews. While large-scale vision--language representation models (VLRMs) like CLIP offer strong zero-shot retrieval performance, they struggle with distribution shifts caused by unseen query styles. In this paper, we propose the Hypernetwork-driven Style-adaptive Retrieval (Hystar), a lightweight framework that dynamically adapts model weights to each query's style. Hystar employs a hypernetwork to generate singular-value perturbations ($ΔS$) for attention layers, enabling flexible per-input adaptation, while static singular-value offsets on MLP layers ensure cross-style stability. To better handle semantic confusions across styles, we design StyleNCE as part of Hystar, an optimal-transport-weighted contrastive loss that emphasizes hard cross-style negatives. Extensive experiments on multi-style retrieval and cross-style classification benchmarks demonstrate that Hystar consistently outperforms strong baselines, achieving state-of-the-art performance while being parameter-efficient and stable across styles.

2605.10008 2026-05-12 physics.optics cs.CV cs.ET 版本更新

Measurement-Adapted Eigentask Representations for Photon-Limited Optical Readout

Tianyang Chen, Mandar M. Sohoni, Saeed A. Khan, Jérémie Laydevant, Shi-Yuan Ma, Tianyu Wang, Peter L. McMahon, Hakan E. Türeci

发表机构 * Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544, USA(普林斯顿大学电气工程与计算机工程系) School of Applied and Engineering Physics, Cornell University, Ithaca, NY 14853, USA(康奈尔大学应用与工程物理学院) USRA Research Institute for Advanced Computer Science, Mountain View, CA 94035, USA(美国研究机构高级计算机科学研究所) Kavli Institute at Cornell for Nanoscale Science, Cornell University, Ithaca, NY 14853, USA(康奈尔大学纳米科学学院)

AI总结 在低光条件下,光学读取面临光子噪声、探测器噪声和量化误差等限制,影响后续分类与决策的准确性。本文提出一种基于特征可分辨性的本征任务(eigentask)表示方法,用于对光学传感器输出进行噪声自适应的特征表示。实验表明,该方法在光子预算有限、样本稀缺和任务复杂度高的场景下显著优于主成分分析等传统方法,有效提升了分类性能与学习效率。

Comments 15+14 pages, 4+9 figures, 55 references

详情
英文摘要

Optical readout in low-light imaging is fundamentally limited by measurement noise, including photon shot noise, detector noise, and quantization error. In this regime, downstream inference depends not only on the optical front end, but also on how noisy high-dimensional sensor measurements are represented before classification or decision-making. Here we show that eigentasks provide a measurement-adapted representation for optical sensor outputs by ordering readout features according to their resolvability under noise. Using experimental data from a lens-based optical imaging system and a reanalysis of published data from a single-photon-detection neural network, we find that eigentask representations frequently outperform standard baselines including principal component analysis and filtering-based compression. The advantage is most pronounced in photon-limited, few-shot, and higher-difficulty classification regimes. In few-shot MPEG-7 classification, for example, the advantage over other methods reaches about 10 percentage points as the number of classes increases. In these settings, eigentasks yield more informative low-dimensional features and improve sample-efficient downstream learning. These results identify measurement-adapted representation as a promising strategy for optical inference when photon budget, acquisition time, and task complexity are constrained.

2605.10002 2026-05-12 cs.CV 版本更新

Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models

Minh Khoi Nguyen, Dai Lam Le, Amir Reza Jafari, Tuan Dung Nguyen, Mai Hong Son, Mai Huy Thong, Quang Huy Nguyen, Thanh Trung Nguyen, Reza Farahbakhsh, Noel Crespi, Phi Le Nguyen

发表机构 * AI4LIFE, Hanoi University of Science and Technology, Vietnam(AI4LIFE,越南科学与技术大学) SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, France(SAMOVAR,法国电信南巴黎学院,巴黎理工学院) Military Central Hospital, Vietnam(越南108军中心医院)

AI总结 该研究提出Med-StepBench,首个用于评估医学视觉语言模型在3D PET/CT影像中逐步推理能力的大型基准,旨在检测模型在生成临床合理但错误的诊断时的幻觉问题。该框架将临床推理分解为四个诊断阶段,并通过超过12,000张影像和100万对图像-陈述对,揭示了现有模型在多步骤推理中的系统性缺陷。研究还表明,当前模型对看似合理但具有误导性的中间解释高度敏感,进一步放大了幻觉风险,为构建更安全可靠的医学视觉语言模型提供了重要依据。

Comments Accepted at IJCAI-ECAI 2026

详情
英文摘要

Large vision-language models (VLMs) demonstrate strong performance in medical image understanding, but frequently generate clinically plausible yet incorrect statements, raising significant safety concerns. Existing medical hallucination benchmarks primarily focus on 2D imaging with one-shot diagnostic questions, offering limited insight into whether predictions are grounded in correct localization and abnormality identification, allowing critical reasoning errors to remain hidden behind seemingly correct diagnoses. We introduce Med-StepBench, the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT, comprising over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data, which decomposes clinical reasoning into four expert-designed diagnostic stages. Using clinician-verified annotations, we perform the first step-level evaluation of general-purpose and medical VLMs, revealing systematic failure modes obscured by aggregate accuracy metrics. Furthermore, we show that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations, which significantly amplify hallucinations despite contradictory visual evidence. Together, our findings highlight fundamental limitations in grounding multi-step clinical reasoning and establish Med-StepBench as a rigorous benchmark for developing safer and more reliable medical VLMs.

2605.09996 2026-05-12 cs.CV 版本更新

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

Yeongtak Oh, Dongwook Lee, Sangkwon Park, Heeseung Kim, Sungroh Yoon

发表机构 * Department of Electrical and Computer Engineering, Seoul National University(首尔国立大学电气与计算机工程系) Interdisciplinary Program in Artificial Intelligence, Seoul National University(首尔国立大学人工智能跨学科项目) Department of Artificial Intelligence, University of Seoul(首尔大学人工智能系)

AI总结 本文提出Omni-Persona,首个全面的多模态个性化基准,用于系统评估和改进文本、图像和音频的联合个性化能力。该基准通过“人格模态图”形式化任务,涵盖四个任务组和18个细粒度任务,并引入校准准确率(Cal)指标,综合衡量正确对齐与适当回避的能力。实验揭示了开源模型在音频与视觉对齐上的差距、参数规模与召回率并非可靠诊断指标,以及监督微调与基于奖励的强化学习在个性化中的不同局限与挑战。

Comments Project Page: https://github.com/oyt9306/Omni-Persona

详情
英文摘要

While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across ${\sim}750$ items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ($\mathrm{Cal}$)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher $\mathrm{Cal}$, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.

2605.09984 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Geometric 4D Stitching for Grounded 4D Generation

Sunwoo Park, Taesung Kwon, Jong Chul Ye

发表机构 * KAIST AI(韩国科学技术院人工智能实验室)

AI总结 本文提出了一种名为“几何4D缝合”的高效框架,用于解决现有4D场景生成方法中几何不一致和重建成本高的问题。该方法通过显式识别缺失的几何区域,并用几何基础的4D缝合进行补充,从而在保证几何一致性的同时,显著提升了4D场景生成的效率。此外,该方法还支持4D网格的迭代扩展和场景编辑,具有良好的实用性和扩展性。

详情
英文摘要

Recent 4D generation methods complete scene-level missing information using generative models and reconstruct the scene into radiance-based representations. However, these pipelines often present geometric inconsistencies in the generated content, and the radiance-based reconstruction requires expensive optimization. Furthermore, radiance-based representations often absorb these geometric inconsistencies into their view-dependent nature, failing to enforce the grounded geometric consistency. To address these issues, we propose Geometric 4D Stitching, an efficient framework that explicitly identifies missing geometric regions and complements them with geometrically grounded 4D stitches. As a result, our method constructs 4D scene representations in under 10 minutes on a single NVIDIA RTX 5090 GPU per one-step scene expansion, while improving geometric consistency. Moreover, we demonstrate that our explicit 4D stitching supports interative expansion of 4D mesh as well as 4D scene editing.

2605.09982 2026-05-12 cs.CV 版本更新

ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning

Yuna Lee, Kyoungho Min, Yulhwa Kim

发表机构 * Department of Electrical and Computer Engineering, Sungkyunkwan University, Republic of Korea(电气与计算机工程系,成均馆大学,大韩民国) Department of Semiconductor Systems Engineering, Sungkyunkwan University, Republic of Korea(半导体系统工程系,成均馆大学,大韩民国)

AI总结 本文提出了一种名为ERASE的两阶段视觉token剪枝框架,旨在解决视觉语言模型处理高分辨率图像时产生的大量视觉token带来的计算负担问题。该方法通过自适应剪枝策略,根据输入图像的复杂度识别并保留关键视觉token,在保持模型性能的同时显著减少token数量。实验表明,ERASE在Qwen2.5-VL-7B模型上以85%的剪枝率仍能保留89.46%的原始精度,优于现有最佳方法。

Comments 20 pages, 8 figures

详情
英文摘要

Recent advancements in Vision-Language Models (VLMs) enable large language models (LLMs) to process high-resolution images, significantly improving real-world multimodal understanding. However, this capability introduces a large number of vision tokens, resulting in substantial computational overhead. To mitigate this issue, various vision token pruning methods have been proposed. Nevertheless, existing approaches predominantly rely on learned semantic features within the model to capture visual redundancy. Moreover, they lack adaptive mechanisms to adjust pruning strategies according to the complexity of the input image. In this paper, we propose ERASE, a two-stage vision token pruning framework that identifies and retains salient tokens through pruning strategies adaptive to image complexity. Experiment results demonstrate that ERASE significantly reduces vision tokens while preserving accuracy. For Qwen2.5-VL-7B, at a token pruning ratio of 85\%, ERASE retains 89.46% of the original model accuracy, whereas the best prior method retains only 78.1%. Our code is available at https://github.com/Tuna-Luna/ERASE.

2605.09977 2026-05-12 cs.CV 版本更新

INFANiTE: Implicit Neural representation for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI

Xiaotian Hu, Mingxuan Liu, Hongjia Yang, Juncheng Zhu, Yijin Li, Yifei Chen, Haoxiang Li, Tongxi Song, Zihan Li, Yingqi Hao, Ziyu Li, Yujin Zhang, Gang Ning, Yi Liao, Haibo Qu, Qiyuan Tian

发表机构 * Beihang University(北航大学) Tsinghua University(清华大学) Sichuan University(四川大学) University of Oxford(牛津大学)

AI总结 该研究提出了一种名为INFANiTE的隐式神经表示框架,用于从临床厚切片MRI扫描中高效学习高分辨率胎儿脑时空图谱,解决了传统方法中耗时的切片到体积重建和迭代配准步骤的问题。该方法显著加速了图谱构建过程,实验表明其在稀疏数据条件下仍能保持较高的精度和生物学合理性,为大规模胎儿脑发育分析提供了可行的解决方案。

详情
英文摘要

Spatio-temporal fetal brain atlases are important for characterizing normative neurodevelopment and identifying congenital anomalies. However, existing atlas construction pipelines necessitate days for slice-to-volume reconstruction (SVR) to generate high-resolution 3D brain volumes and several additional days for iterative volume registration, thereby rendering atlas construction from large-scale cohorts prohibitively impractical. We address these limitations with INFANiTE, an Implicit Neural Representation (INR) framework for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI scans, bypassing both the costly SVR and the iterative non-rigid registration steps entirely, thereby substantially accelerating atlas construction. Extensive experiments demonstrate that INFANiTE outperforms existing baselines in subject consistency, reference fidelity, intrinsic quality and biological plausibility, even under challenging sparse-data settings. Additionally, INFANiTE reduces the end-to-end processing time (i.e., from raw scans to the final atlas) from days to hours compared to the traditional 3D volume-based pipeline (e.g., SyGN), facilitating large-scale population-level fetal brain analysis. Our code is publicly available at: https://anonymous.4open.science/r/INFANiTE-5D74

2605.09976 2026-05-12 cs.CV 版本更新

OZ-TAL: Online Zero-Shot Temporal Action Localization

Chaolei Han, Hongsong Wang, Xin Gong, Jie Gui

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学信息科学与工程学院) Engineering Research Center of Blockchain Application, Supervision and Management (Southeast University), Ministry of Education(教育部区块链应用、监督与管理工程研究中心(东南大学)) Purple Mountain Laboratories, Nanjing(紫金山实验室(南京)) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院)

AI总结 本文提出了一种新的在线零样本时序动作定位任务(OZ-TAL),旨在在视频流处理过程中检测尚未见过的动作类别及其发生时间。为了解决现有方法在跨域视频中泛化能力不足的问题,作者设计了一个无需训练的框架,利用现成的视觉-语言模型并引入额外机制以增强视觉表示并减少其偏差。实验表明,该方法在THUMOS14和ActivityNet-1.3数据集上显著优于现有先进方法,确立了新的基准和对比基线。

详情
英文摘要

Online Temporal Action Localization (On-TAL) aims to detect the occurrence time and category of actions in untrimmed streaming videos immediately upon their completion. Recent advancements in this field focus on developing more sophisticated frameworks, shifting from Online Action Detection (OAD)-based aggregation paradigm to instance-level understanding. However, existing approaches are typically trained on specific domains and often exhibit limited generalization capabilities when applied to arbitrary videos, particularly in the presence of previously unseen actions. In this paper, we introduce a new task called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to detect previously unseen actions in an online fashion. Furthermore, we propose a training-free framework that leverages off-the-shelf Vision-Language Models (VLMs) while introducing additional mechanisms to enhance visual representations and mitigate their inherent biases. We establish new benchmarks and representative baselines for OZ-TAL on THUMOS14 and ActivityNet-1.3, and extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings.

2605.09972 2026-05-12 cs.RO cs.CV 版本更新

HiDrive: A Closed-Loop Benchmark for High-Level Autonomous Driving

Zhongyu Xia, Guanyu Zhu, Guo Tang, Wenhao Chen, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(王炫计算机技术研究所,北京大学)

AI总结 HiDrive 是一个全新的闭环自动驾驶基准,旨在解决现有基准在场景多样性、对象种类和驾驶能力评估方面的不足。该基准特别强调长尾场景,引入了多种罕见物体和复杂交通情境,并扩展了对规则遵守、道德推理和应急决策等高级驾驶能力的评估。HiDrive 采用更先进的物理引擎,提供真实光照和高保真视觉渲染,为自动驾驶系统在真实复杂环境中的表现提供了更具挑战性的测试平台。

详情
英文摘要

End-to-end autonomous driving has witnessed rapid progress, yet existing benchmarks are increasingly saturated, with state-of-the-art models achieving near-perfect scores on widely used open-loop and closed-loop benchmarks. This saturation does not mean that the problem has been solved; instead, it reveals that current benchmarks remain limited in scenario diversity, object variety, and the breadth of driving capabilities they evaluate. In particular, they lack sufficient long-tail scenarios involving rare but safety-critical objects and fail to assess advanced decision-making such as legal compliance, ethical reasoning, and emergency response. To address these gaps, we propose HiDrive, a new closed-loop benchmark for end-to-end autonomous driving that emphasizes long-tail scenarios and a richer evaluation of driving capabilities. HiDrive introduces a diverse set of rare objects and uncommon traffic situations, and expands evaluation from basic driving skills to more advanced capabilities, including rule compliance, moral reasoning, and context-dependent emergency maneuvers. Correspondingly, we extend previous collision-avoidance-centered metrics into a comprehensive evaluation system that encompasses collision and braking, traffic-rule compliance, and moral-reasoning indicators. Built on a more advanced physics engine, HiDrive provides physically realistic lighting and high-fidelity visual rendering, offering a more challenging and realistic testbed for assessing whether autonomous driving systems can handle the complexity of real-world deployment. The HiDrive software, source code, digital assets, and documentation are available at https://github.com/VDIGPKU/HiDrive.

2605.09963 2026-05-12 cs.CV 版本更新

Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning

Yang Shen, Yusen Cai, Weronika Hryniewska-Guzik, Qing Lin, Mengmi Zhang

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Warsaw University of Technology, Poland(华沙理工大学,波兰)

AI总结 现有自监督学习方法主要学习对象不变的表征,但往往忽视了物体部分之间的空间结构和关系。为解决这一问题,本文提出了一种空间感知的预训练任务——空间预测(SP),通过预测同一图像中两个解耦局部视图之间的相对位置和尺度,学习细粒度的空间依赖关系。实验表明,该方法在图像识别、细粒度分类、语义分割和深度估计等多个任务中均取得显著提升,并增强了模型在分布外场景下的鲁棒性。

详情
英文摘要

Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out-of-distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.

2605.09956 2026-05-12 cs.CV cs.AI 版本更新

SDTalk: Structured Facial Priors and Dual-Branch Motion Fields for Generalizable Gaussian Talking Head Synthesis

Peng Jia, Zhen Xiao, Jia Li, Xueliang Liu, Zhenzhen Hu, Lingyun Yu

发表机构 * Hefei University of Technology(合肥工业大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种名为SDTalk的单次拍摄3D高斯溅射(3DGS)框架,用于实现无需个性化训练即可泛化到未知身份的高质量实时说话头生成。该方法通过引入结构化面部先验和双分支运动场,分别提升头部重建的完整性与面部动态的细节表现,从而在视觉质量和推理效率方面优于现有方法。

Comments 5 pages, 4 figures, 4 tables

详情
英文摘要

High-quality, real-time talking head synthesis remains a fundamental challenge in computer vision. Existing reconstruction- and rendering-based methods typically rely on identity-specific models, limiting cross-identity generalization. To address this issue, we propose SDTalk, a one-shot 3D Gaussian Splatting (3DGS)-based framework that generalizes to unseen identities without personalized training or fine-tuning. Our framework comprises two modules with a two-stage training strategy. In the first stage, we incorporate structured facial priors into the reconstruction module and separately predict 3DGS parameters for visible and occluded regions, enabling complete head reconstruction from a single image. In the second stage, we introduce a dual-branch motion field to model coarse and fine facial dynamics, improving detail fidelity and lip synchronization. Experiments demonstrate that SDTalk surpasses existing methods in both visual quality and inference efficiency.

2605.09954 2026-05-12 cs.RO cs.CV 版本更新

JODA: Composable Joint Dynamics for Articulated Objects

Tianhong Gao, Cheng Yu, Yinghao Xu, Mengyu Chu

发表机构 * Peking University(北京大学) Ant Group, Robbyant(蚂蚁集团,Robbyant)

AI总结 本文提出JODA,一种用于生成关节级动力学的可组合框架,能够捕捉如摩擦保持、卡扣、软闭合等精细的机械行为。JODA通过结构化的三通道场描述关节自由度下的保守力、干摩擦和阻尼,结合形状约束的分段三次插值方法,实现了表达力强且可微分模拟的动力学建模。该方法支持从多模态输入中推断和优化关节动力学,为复杂机械系统的建模、编辑和优化提供了统一的接口。

详情
英文摘要

Articulated objects used in simulation and embodied AI are typically specified by geometry and kinematic structure, but lack the fine-grained dynamical effects that govern realistic mechanical behavior, such as frictional holding, detents, soft closing, and snap latching. Existing approaches either ignore the detailed structure of dynamics entirely, or use simple models with limited expressiveness. We introduce JODA, a framework for generating joint-level dynamics as a structured three-channel field over the joint degree of freedom, capturing conservative forces, dry friction, and damping. Instantiated using shape-constrained piecewise cubic interpolation (PCHIP), this formulation defines a compact and expressive function space that is both interpretable and compatible with differentiable simulation. Building on this representation, we develop methods for inferring and refining joint dynamics from multimodal inputs. Given visual observations and joint context, a vision-language model proposes structured dynamical primitives, which are composed into a unified dynamics field. The resulting representation supports both direct manipulation and gradient-based refinement. We demonstrate that JODA enables plausible and controllable modeling of diverse joint behaviors, providing a unified interface for inference, editing, and optimization. Code and example assets with their generated profiles will be released upon publication.

2605.09948 2026-05-12 cs.AI cs.CV cs.RO 版本更新

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

Boyang Shen, Kaixiang Yang, Hao Wang, Qiuyu Yu, Qiang Xie, Qiang Li, Zhiwei Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学) Wuhan United Imaging Surgical Co.,Ltd. (UIS)(武汉联影 surgical 公司)

AI总结 当前视觉-语言-动作(VLA)模型通常将视觉-语言主干网络的最深层表示视为动作预测的最优输入,但机器人操作任务需要频繁的闭环空间调整,过度抽象可能浪费计算资源并削弱精确控制所需的底层几何线索。为此,本文提出LoopVLA,一种递归VLA架构,联合学习表示优化、动作预测与表示充分性估计,通过共享的Transformer块迭代优化多模态特征,并在每一步生成候选动作和充分性评分,从而动态决定是否需要进一步优化。实验表明,LoopVLA在保持任务成功率的同时显著提升了模型效率,参数量减少45%,推理吞吐量提升达1.7倍。

详情
英文摘要

Current Vision-Language-Action (VLA) models typically treat the deepest representation of a vision-language backbone as universally optimal for action prediction. However, robotic manipulation is composed of many frequent closed-loop spatial adjustments, for which excessive abstraction may waste computation and weaken low-level geometric cues essential for precise control. Existing early-exit strategies attempt to reduce computation by stopping at predefined layers or applying heuristic rules such as action consistency, but they do not directly answer when a representation is actually sufficient for action. In this paper, we present LoopVLA, a recurrent VLA architecture that jointly learns representation refinement, action prediction, and sufficiency estimation. LoopVLA iteratively applies a shared Transformer block to refine multimodal tokens, and at each iteration produces both a candidate action and a sufficiency score that estimates whether further refinement is necessary. By sharing parameters across iterations, LoopVLA decouples refinement from absolute layer indices and grounds sufficiency estimation in the evolving representation itself. Since sufficiency has no direct supervision, we introduce a self-supervised distribution alignment objective, where intermediate confidence scores are trained to match the relative action quality across refinement steps, thereby linking sufficiency learning to policy optimization signals. Experiments on LIBERO, LIBERO-Plus, and VLA-Arena show that LoopVLA pushes the efficiency-performance frontier of VLA policies, reducing parameters by 45% and improving inference throughput by up to 1.7 times while matching or outperforming strong baselines in task success.

2605.09936 2026-05-12 cs.CV cs.IR cs.LG 版本更新

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Yiwei Ou, Chung Ching Cheung, Jun Yang Ang, Xiaobin Ren, Ronggui Sun, Guansong Gao, Kaiqi Zhao, Manfredo Manfredini

发表机构 * University of Auckland(奥克兰大学) University of Pennsylvania(宾夕法尼亚大学) Stanford University(斯坦福大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 本文提出Urban-ImageNet,一个大规模多模态数据集与评估框架,用于从社交媒体图像中感知城市空间。该数据集包含来自微博的200万张公共图像及其配对文本,涵盖中国24个城市61个城区,支持从1K到2M不同规模的训练与评估。基于城市理论构建的层次化分类体系,Urban-ImageNet支持城市场景语义分类、跨模态图像-文本检索和实例分割三项任务,旨在评估AI模型对城市空间社会性、功能性和空间特征的理解能力。

详情
英文摘要

We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.

2605.09925 2026-05-12 cs.CV 版本更新

Frequency Adapter with SAM for Generalized Medical Image Segmentation

Phuoc-Nguyen Bui, Van-Nguyen Pham, Duc-Tai Le, Junghyun Bum, Hyunseung Choo

发表机构 * Sungkyunkwan University, Korea(成均馆大学,韩国)

AI总结 医学图像分割在辅助诊断和治疗规划中具有重要意义,但深度学习模型在面对不同数据集时常因成像协议、扫描设备和患者群体的差异而难以泛化。本文提出了一种基于频率域适配的通用医学图像分割方法FSAM,结合低秩适配(LoRA)和频率适配模块,有效提取跨域不变的高频特征,提升模型在单一源域下的泛化能力。实验表明,该方法在视网膜和前列腺数据集上优于传统域泛化及基于SAM的域泛化方法。

Comments Under review, 10 pages, 1 figure, 2 tables

详情
英文摘要

Medical image segmentation is a critical task in computer-aided diagnosis and treatment planning. However, deep learning models often struggle to generalize across datasets due to domain shifts arising from variations in imaging protocols, scanner types, and patient populations. Traditional domain generalization (DG) methods utilize causal feature learning, adversarial consistency, and style augmentation to improve segmentation robustness. While effective, these approaches rely on explicit feature alignment, adversarial objectives, or handcrafted augmentations, which may not fully exploit the capabilities of foundation models. Recently, the Segment Anything Model (SAM) has demonstrated strong generalization capabilities in segmentation tasks. SAM-based DG methods attempt to improve medical image segmentation. However, these approaches primarily operate in the spatial domain and overlook frequency-based discrepancies that significantly affect model robustness. In this work, we propose Frequency-based Domain Generalization with SAM (FSAM), a novel framework that integrates Low-Rank Adaptation (LoRA) for efficient fine-tuning and a frequency adapter to incorporate frequency-domain representations for single-source domain generalization. FSAM enhances SAM's segmentation robustness by extracting domain-invariant high-frequency features, mitigating frequency-related domain shifts. Experimental results on fundus and prostate datasets demonstrate that FSAM outperforms existing traditional DG and SAM-based DG approaches in domain generalization. Codes and pre-trained models will be made available on GitHub.

2605.09902 2026-05-12 cs.CV 版本更新

Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment

Haobo Wang, Xiaorong Ma, Weiqi Luo, Xiaojun Jia, Jiwu Huang

发表机构 * Sun Yat-sen University(中山大学) Nanyang Technological University(南洋理工大学) Shenzhen MSU-BIT University(深圳MSU-BIT大学)

AI总结 该研究针对多模态大语言模型(MLLM)的安全性问题,提出了一种新型的定向迁移攻击方法PRAF-Attack,旨在通过对抗样本误导模型对图像内容的判断。该方法引入了渐进式分辨率处理和自适应特征对齐策略,利用中间层特征增强攻击的迁移性和鲁棒性,并通过梯度一致性选择可迁移的层次特征,显著提升了攻击效果。实验表明,PRAF-Attack在多种黑盒MLLM上均表现出优于现有方法的迁移能力。

详情
英文摘要

Adversarial perturbations can mislead Multimodal Large Language Models (MLLMs) recognize a benign image as a specific target object, posing serious risks in safety-critical scenarios such as autonomous driving and medical diagnosis. This makes transfer-based targeted attacks crucial for understanding and improving black-box MLLM robustness. Existing transfer-based targeted attack methods typically rely on the final global features of the surrogate encoder and anchor optimization to original-resolution target crops, leading to their limited transferability and robustness. To address these challenges, we propose Progressive Resolution Processing and Adaptive Feature Alignment (PRAF-Attack), a targeted transfer-based attack framework that integrates multi-scale global semantic guidance with robust intermediate-layer local alignment. Unlike prior methods that align only the surrogate encoder's final layer, we design an adaptive feature alignment strategy that leverages intermediate representations to enhance transferability. Specifically, we introduce an adaptive intermediate layer selection mechanism to identify transferable hierarchical features across surrogate ensembles via gradient consistency, along with an adaptive patch-level optimization strategy that preserves highly correlated local regions through efficient patch filtering. To overcome the reliance on fixed original-resolution target crops, we propose a progressive resolution processing strategy that gradually refines optimization from coarse to fine, enabling the attack to better exploit target information at multiple scales and achieve stronger transferability. We evaluate PRAF-Attack on a diverse suite of black-box MLLMs, including six open-source models and six closed-source commercial APIs. Compared with seven state-of-the-art targeted attack baselines, the proposed PRAF-Attack consistently achieves superior transferability.

2605.09900 2026-05-12 cs.AI cs.CL cs.CV 版本更新

The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

Hao Liu, Jicheng Liu

发表机构 * Department of Psychology(心理学系) New York University(纽约大学) Department of Computer Science(计算机科学系) University of Southern California(南加州大学)

AI总结 该论文提出了一种名为KnotBench的新型基准,用于评估视觉-语言模型在处理绳结图示任务中的能力。研究通过大量绳结图像和对应的规范签名,设计了包括等价判断、操作预测、识别和跨模态对齐在内的14项任务,揭示了当前模型在感知与操作之间的能力差距。实验表明,即使是最先进的模型如Claude Opus 4.7和GPT-5,在无思考模式下表现接近随机水平,而思考模式虽有提升,但整体仍难以准确模拟绳结操作。

Comments 41 pages, 18 figures

详情
英文摘要

A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina's canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items. Thinking-mode reasoning lifts overall accuracy by 1.65 points for Claude and 9.25 points for GPT-5, narrowing the gap only modestly. Read together, the four families suggest current vision-language models hold features of a diagram but lack apparatus to simulate moves on those features.

2605.09899 2026-05-12 cs.CV cs.AI 版本更新

Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection

Kanglin Ning, Wenrui Li, Houde Quan, Qifan Li, Xingtao Wang, Xiaopeng Fan

发表机构 * Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) Suzhou Research Institute of HIT(哈尔滨工业大学苏州研究院) PengChengLab(鹏城实验室)

AI总结 本文提出了一种基于双曲几何约束的跨模态知识蒸馏方法HGC-Det,用于提升多模态3D目标检测的性能。该方法通过图像分支和点云分支分别提取语义特征,并引入语义引导的体素优化、双曲几何约束的跨模态特征迁移以及特征聚合的几何优化三个核心组件,有效缓解了模态异质性、空间错位和表示危机等问题。实验表明,该方法在室内和室外数据集上均取得了检测精度与计算成本之间的良好平衡。

Comments Current version has been subbmitted to IEEE Transactions on Multimedia. Now, this manuscript's status is Under Review

详情
英文摘要

Cross-modal knowledge distillation has emerged as an effective strategy for integrating point cloud and image features in 3D perception tasks. However, the modality heterogeneity, spatial misalignment, and the representation crisis of multiple modalities often limit the efficient of these cross-modal distillation methods. To address these limitations in existing approaches, we propose a hyperbolic constrained cross-modal distillation method for multimodal 3D object detection (HGC-Det). The proposed HGC-Det framework includes an image branch and a point cloud branch to extract semantic features from two different modalities. The point cloud branch comprises three core components: a 2D semantic-guided voxel optimization component (SGVO), a hyperbolic geometry constrained cross-modal feature transfer component (HFT), and a feature aggregation-based geometry optimization component (FAGO). Specifically, the SGVO component adaptively refines the spatial representation of the 3D branch by leveraging semantic cues from the image branch, thereby mitigating the issue of inadequate representation fusion. The HFT component exploits the intrinsic geometric properties of hyperbolic space to alleviate semantic loss during the fusion of high-dimensional image features and low-dimensional point cloud features. Finally, the FAGO compensates for potential spatial feature degradation introduced by the 2D semantic-guided voxel optimization component. Extensive experiments on indoor datasets (SUN RGB-D, ARKitScenes) and outdoor datasets (KITTI, nuScenes) demonstrate that our method achieves a better trade-off between detection accuracy and computational cost.

2605.09874 2026-05-12 cs.CV cs.AI cs.CL 版本更新

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

Ziyang Wang, Yue Zhang, Shoubin Yu, Ce Zhang, Zengqi Zhao, Jaehong Yoon, Hyunji Lee, Gedas Bertasius, Mohit Bansal

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) NTU Singapore(新加坡国立大学)

AI总结 EgoMemReason 是一个面向长期第一人称视频理解的记忆驱动推理基准,旨在评估模型在连续多天视觉信息中积累、回忆和推理的能力。该基准引入了三种互补的记忆类型,包括实体记忆、事件记忆和行为记忆,用于评估模型对物体状态变化、活动顺序以及长期行为模式的识别能力。实验表明,当前最先进的模型在该基准上的整体准确率仅为39.6%,揭示了长期记忆推理仍面临重大挑战。

Comments The first two authors contributed equally. Project website: https://egomemreason.github.io/

详情
英文摘要

Next-generation visual assistants, such as smart glasses, embodied agents, and always-on life-logging systems, must reason over an entire day or more of continuous visual experience. In ultra-long video settings, relevant information is sparsely distributed across hours or days, making memory a fundamental challenge: models must accumulate information over time, recall prior states, track temporal order, and abstract recurring patterns. However, existing week-long video benchmarks are primarily designed for perception and recognition, such as moment localization or global summarization, rather than reasoning that requires integrating evidence across multiple days. To address this gap, we introduce EgoMemReason, a comprehensive benchmark that systematically evaluates week-long egocentric video understanding through memory-driven reasoning. EgoMemReason evaluates three complementary memory types: entity memory, tracking how object states evolve and change across days; event memory, recalling and ordering activities separated by hours or days; and behavior memory, abstracting recurring patterns from sparse, repeated observations over the whole week period. EgoMemReason comprises 500 questions across three memory types and six core challenges, with an average of 5.1 video segments of evidence per question and 25.9 hours of memory backtracking. We evaluate EgoMemReason on 17 methods across MLLMs and agentic frameworks, revealing that even the best model achieves only 39.6% overall accuracy. Further analysis shows that the three memory types fail for distinct reasons and that performance degrades as evidence spans longer temporal horizons, revealing that long-horizon memory remains far from solved. We believe EgoMemReason establishes a strong foundation for evaluating and advancing long-context, memory-aware multimodal systems.

2605.09864 2026-05-12 cs.CV cs.LG 版本更新

DA-SegFormer: Damage-Aware Semantic Segmentation for Fine-Grained Disaster Assessment

Kevin Zhu, William Tang, Raphael Hay Tene, Zesheng Liu, Nhut Le, Maryam Rahnemoonfar

发表机构 * Bina Labs, Lehigh University(Bina实验室,莱斯大学)

AI总结 本文提出了一种名为DA-SegFormer的细粒度灾害评估语义分割方法,旨在解决无人机影像中因纹理退化和类别不平衡导致的细微损伤识别难题。该方法基于SegFormer架构,引入了类别感知采样策略和在线难例挖掘结合Dice损失函数,以增强对罕见损伤特征的学习,并采用分辨率保持的推理协议以保留原始纹理细节。实验表明,DA-SegFormer在RescueNet数据集上取得了74.61%的mIoU,显著优于基线模型,并在关键损伤类别上实现了显著提升。

Comments Accepted for 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

详情
英文摘要

Rapid and accurate damage assessment following natural disasters is critical for effective emergency response. However, identifying fine-grained damage levels (e.g., distinguishing minor from major roof damage) in UAV imagery remains challenging due to the degradation of texture cues during resizing and extreme class imbalance. We propose DA-SegFormer, a damage-aware adaptation of the SegFormer architecture optimized for high-resolution disaster imagery. Our method introduces a Class-Aware Sampling strategy to guarantee exposure to rare damage features, and it integrates Online Hard Example Mining (OHEM) with Dice Loss to dynamically focus on underrepresented classes. In addition, we employ a resolution-preserving inference protocol that maintains native texture details. Evaluated on the RescueNet dataset, DA-SegFormer achieves 74.61\% mIoU, outperforming the baseline by 2.55\%. Notably, our improvements yield double-digit gains in critical damage classes: Minor Damage (+11.7%) and Major Damage (+21.3%).

2605.09859 2026-05-12 cs.CV 版本更新

Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval

Shijie Wang, Yadan Luo, Zijian Wang, Xin Yu, Zi Huang

发表机构 * The University of Queensland, Australia(昆士兰大学,澳大利亚) The University of Adelaide, Australia(阿德莱德大学,澳大利亚)

AI总结 本文研究了细粒度图像检索中如何提升对未见类别的检索性能问题,提出了一种基于生成外观先验对齐的新型方法GAPan。该方法通过可逆密度模型重构学习目标,从类别预测转向外观建模,利用归一化流将特征映射到潜在密度空间,并通过类别条件高斯先验进行优化,从而保留更丰富的外观细节。通过反向采样生成外观感知的锚点,引导检索嵌入与类别特定的外观分布对齐,显著提升了模型在未见类别上的泛化能力。

详情
英文摘要

Fine-grained image retrieval (FGIR) typically relies on supervision from seen categories to learn discriminative embeddings for retrieving unseen categories. However, such supervision often biases retrieval models toward the semantics of seen categories rather than the underlying appearance characteristics that generalize across categories, thereby limiting retrieval performance on unseen categories. To tackle this, we propose GAPan, a Generative Appearance Prior alignment network that reformulates the learning objective from category prediction toward appearance modeling. Technically, GAPan treats retrieval features with an invertible density model based on normalizing flows. In the forward direction, the flow maps all instance features into a latent density space, where each seen category is modeled by a class-conditional Gaussian prior and optimized via exact likelihood estimation. This formulation preserves richer appearance details by leveraging the invertible property of the flows. In the reverse direction, samples from the high-density regions of these learned priors are mapped back to the feature space to produce appearance-aware anchors that reflect intra-category variation. These anchors supervise a prior-driven alignment objective that aligns retrieval embeddings with category-specific appearance distributions, thereby improving generalization to unseen categories. Evaluations demonstrate that our GAPan achieves state-of-the-art performance on both widely-used fine- and coarse-grained benchmarks.

2605.09858 2026-05-12 cs.CV 版本更新

Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking

Riku Inoue, Shogo Sato, Kazuhiko Murasaki, Tomoyasu Shimada, Toshihiko Nishimura, Ryuichi Tanida

发表机构 * NTT, Inc.(NTT公司)

AI总结 本文研究了动态环境下端到端多目标跟踪(MOT)中如何通过主动学习(AL)提升标注效率的问题。针对现有基于帧的AL方法与现代基于Transformer的端到端跟踪器在时间粒度上不匹配的问题,提出了一种基于片段(clip)的主动学习方法CUTAL,该方法通过多帧预测的不确定性度量评估每个片段的不确定性,并引入时间多样性约束以选择信息量大且冗余度低的片段。实验表明,CUTAL在相同标注预算下优于现有方法,并且在仅使用50%标注数据时即可达到接近全监督的跟踪性能。

Comments Accepted to 2026 IEEE International Conference on Image Processing (ICIP). Copyright 2026 IEEE. Published in 2026 IEEE International Conference on Image Processing (ICIP), scheduled for 13-17 September 2026 in Tampere, Finland

详情
英文摘要

Multi-Object Tracking (MOT) in dynamic environments relies on robust temporal reasoning to maintain consistent object identities over time. Transformer-based end-to-end MOT models achieve strong performance by explicitly modeling temporal dependencies, yet training them requires extensive bounding-box and identity annotations. Given the high labeling cost and strong redundancy in videos, Active Learning (AL) is an effective approach to improve annotation efficiency. However, existing AL methods for MOT primarily operate at the frame level, which is structurally misaligned with modern end-to-end trackers whose inference and training rely on multi-frame clips. To bridge this gap, we formulate clip-level active learning and propose Clip-level Uncertainty and Temporal-aware Active Learning (CUTAL). In contrast to frame-based approaches, CUTAL scores each clip using uncertainty metrics derived from multi-frame predictions to capture inter-frame correspondence ambiguities, while enforcing temporal diversity to select an informative and non-redundant subset. Experiments show that CUTAL achieves stronger overall performance than baselines at the same label budgets across MeMOTR and SambaMOTR. Notably, CUTAL achieves performance comparable to full supervision for MeMOTR on both datasets using only 50% of the labeled training data.

2605.09856 2026-05-12 cs.CV cs.AI 版本更新

MoPO: Incorporating Motion Prior for Occluded Human Mesh Recovery

Tao Tang, Hong Liu, Xinshun Wang, Wanruo Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, China(一般人工智能国家重点实验室,北京大学,深圳研究生院,中国)

AI总结 尽管近期在人体网格恢复方面取得了显著进展,但在面对遮挡时仍表现出鲁棒性不足,常导致姿态估计不准确和运动抖动。本文提出MoPO方法,通过引入运动先验来提升遮挡人体网格恢复的效果。MoPO包含运动去遮挡模块和运动感知融合与优化模块,前者利用历史姿态预测遮挡关节位置,后者结合图像特征与预测姿态进行人体形状和姿态估计,并通过逆运动学进一步优化最终姿态,显著提升了遮挡场景下人体网格恢复的精度和时序一致性。

Comments 35 pages

详情
英文摘要

Although recent studies have made remarkable progress in human mesh recovery, they still exhibit limited robustness to occlusions and often produce inaccurate poses and severe motion jitter due to the insufficient spatial features for occluded body parts. Inspired by the rapid advancements in human motion prediction, we discover that compared to occluded image features, pose sequence inherently contains reliable motion prior for estimating occluded body parts. In this paper, we incorporate Motion Prior for Occluded human mesh recovery, called MoPO. Our MoPO mainly consists of two components: 1) The motion de-occlusion module, where we propose a spatial-temporal occlusion detector to detect joint visibility, and then we propose a lightweight motion predictor to complete the occluded body parts by predicting the most plausible joint positions based on history poses. 2) The motion-aware fusion and refinement module, which fuses the completed joint sequence with image features to estimate human shape and initial human pose. Moreover, the completed joint sequence is further used to refine the final human pose through inverse kinematics, which provides the occlusion-free motion prior for regressing human poses. Extensive experiments demonstrate that MoPO achieves state-of-the-art performance on both occlusion-specific and standard benchmarks, significantly enhancing the accuracy and temporal consistency of occluded human mesh recovery. Our code and demo can be found in the supplementary material.

2605.09850 2026-05-12 cs.CV cs.AI 版本更新

Probing Routing-Conditional Calibration in Attention-Residual Transformers

Wenhao Liang, Lin Yue, Wei Emma Zhang, Miao Xu, Mingyu Guo, Olaf Maennel, Weitong Chen

发表机构 * Adelaide University(阿德莱德大学) Australian Institute for Machine Learning (AIML), Adelaide University(澳大利亚机器学习研究所(AIML),阿德莱德大学) The University of Queensland(昆士兰大学)

AI总结 本文研究了在注意力残差变换器(Attention-Residual Transformers)中,路由信息对模型校准的影响。通过设计匹配置信度的诊断实验,作者发现路由摘要无法提供稳定的路由条件下的校准证据,且基于路由深度的校准方法在多个评估指标上表现并不优于仅基于置信度的模型。实验表明,所谓的路由感知校准提升可能是由其他因素引起的,需在控制匹配置信度、带宽、模型容量和排列等因素后,才能确认是否为内部状态校准的真正提升。

Comments Under reviewing

详情
英文摘要

Post-hoc calibration is usually evaluated as a function of logits or softmax confidence alone, even as routing-augmented architectures increasingly accompany predictions with sample-specific internal routing traces and pair them with claims of calibration-relevant uncertainty. We ask a basic question: do these traces provide stable routing-specific evidence for post-hoc calibration beyond confidence? We study this in Attention-Residual transformers (Kimi Team, 2026) through a matched-confidence diagnostic suite that stratifies examples by routing-derived state, compares subgroup gaps against within-bin routing-permutation nulls, and evaluates matched post-hoc probes differing only in their auxiliary feature. Across our completed AR runs, scalar routing summaries do not provide stable evidence of routing-conditional miscalibration: weighted gaps remain small or seed-sensitive, and only $1$ of $30$ within-bin permutation tests rejects the conditional-null at $α=0.05$ (only on one seed; not stable across seeds in that cell). AR-CondCal, a minimal $2$-D Nadaraya--Watson probe on confidence and routing-depth variance, lies within the seed-variance band of matched confidence-only and predictive-entropy controls and does not reliably improve worst-routing-tertile ECE; bandwidth-sensitivity checks (Scott multiples, CV-NLL, global-ECE oracle) do not change this. A full-vector MLP over $(c, H_1, \ldots, H_L)$ can appear to improve over a linear confidence baseline, but the apparent gain disappears once a capacity-matched confidence-only MLP is included as a control, and shuffled routing profiles achieve comparable performance. Apparent routing-aware calibration gains in this AR setting should not be read as internal-state calibration until matched-confidence, bandwidth, capacity, and permutation controls rule out common confounds.

2605.09830 2026-05-12 cs.IR cs.CV 版本更新

Loom: Hybrid Retrieval-Scoring Outfit Recommendation with Semantic Material Compatibility and Occasion-Aware Embedding Priors

Anushree Berlia

AI总结 Loom 是一个结合神经嵌入检索与结构化领域评分的服装搭配推荐系统,旨在从时尚图册中生成完整且协调的穿搭组合。该系统通过 FashionCLIP 嵌入进行约束检索,结合多目标评分函数,综合考虑嵌入相似性、色彩协调性、正式程度一致性、场合适配性等多个因素进行打分。研究引入了语义材质权重和场合先验嵌入两种技术,分别提升材质兼容性判断和场合适配性,实验表明该系统在搭配质量与违规率方面显著优于随机基线,且能在普通硬件上快速生成多样化的穿搭方案。

Comments Code: https://github.com/anushreeberlia/loom

详情
英文摘要

We present Loom, an outfit recommendation system that combines neural embedding retrieval with structured domain scoring to generate complete, coherent outfits from fashion catalogs. Given an anchor clothing item, Loom retrieves complementary pieces via slot-constrained approximate nearest neighbor search over FashionCLIP embeddings, then scores candidate outfits using a multi-objective function that integrates six signals: embedding similarity, color harmony, formality consistency, occasion coherence, style direction, and within-outfit diversity. We introduce two techniques that address limitations of purely learned or purely rule-based approaches: (1) semantic material weight, which uses CLIP embedding geometry to infer garment heaviness for layer compatibility without hand-coded material taxonomies; and (2) vibe/anti-vibe occasion priors, which embed prose descriptions of occasion contexts as anchor vectors in CLIP space and score items by differential affinity. Ablation experiments on a catalog of 620 items show that each component contributes measurably to outfit quality: the full system achieves a mean outfit score of 0.179 with a 9.3% hard violation rate, compared to 0.054 score and 16.0% violations for a category-constrained random baseline, a 3.3x improvement in score and 42% reduction in violations. Direction reranking is the single indispensable component: removing it drops score to 0.052, essentially equal to random. The system generates three stylistically distinct outfits in under 5 seconds on commodity hardware.

2605.09827 2026-05-12 cs.CV cs.AI 版本更新

Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction

Anushree Berlia

AI总结 本文提出 Fashion Florence,一种基于 Florence-2 的视觉语言模型,通过 LoRA 微调技术实现对服装图像结构化属性的提取。该模型能够从单张服装照片中生成包含类别、颜色、材质、风格标签和场合标签的 JSON 格式输出,适用于推荐系统等下游任务。实验表明,Fashion Florence 在多个指标上优于 GPT-4o-mini 和 Gemini 2.5 Flash,且在单个 GPU 上运行时参数量仅为 0.77B,推理成本接近于零。

Comments Model: https://huggingface.co/anushreeberlia/fashion-florence

详情
英文摘要

We present Fashion Florence, a Florence-2 vision-language model fine-tuned with LoRA to extract structured fashion attributes from clothing images. Given a single photograph, the model generates a JSON object containing category, color, material, style tags, and occasion tags, structured output suitable for direct programmatic consumption by downstream recommendation and retrieval systems. Fine-tuning data is derived from the iMaterialist Fashion dataset (228 labels), where we collapse fine-grained annotations into a compact 6-category, 16-color, 19-style schema via rule-based label engineering. We apply LoRA (r=16, alpha=32) to all decoder linear layers, training for 3 epochs on 3,688 examples. On a held-out test set of 461 images, Fashion Florence achieves 94.6% category accuracy and 63.0% material accuracy, compared to 89.3% / 43.3% for GPT-4o-mini and 87.4% for Gemini 2.5 Flash. Fashion Florence produces valid JSON in 99.8% of outputs while running at 0.77B parameters on a single GPU at zero marginal inference cost. Style tag F1 reaches 0.753 vs. 0.612 (Gemini) and 0.398 (GPT-4o-mini). The model is deployed as a Hugging Face Space and integrated into Loom, an open-source outfit recommendation system.

2605.09802 2026-05-12 cs.CV cs.AI cs.LG 版本更新

CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection

Zhipeng Liu, Chunbo Luo

发表机构 * Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系)

AI总结 本文研究了跨视角(如地面与空中)场景下视觉-语言模型(VLM)的目标检测性能下降问题,提出了CrossVL框架,结合复杂度感知的特征路由机制和成对课程学习策略,以增强模型对不同视角图像的适应能力。该方法通过估计场景复杂度并动态路由视觉特征,以及利用同步地面-空中图像对的语义一致性进行渐进式训练,有效提升了检测精度和稳定性。实验表明,CrossVL在MAVREC数据集上显著提升了检测性能并缩小了不同视角间的性能差距。

Comments Accepted to CVPR 2026. Code available at https://github.com/1nyourlife/Crossvl_cvpr2026

详情
英文摘要

Vision-language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose CrossVL, a framework combining Complexity-Aware Pathway Aggregation (CPA) and Paired Curriculum Learning (PCL) for enhanced cross-view detection for VLM. CPA estimates scene complexity from multimodal statistics and routes visual features through multiple pathways to obtain view-specific representations. PCL leverages semantic consistency of synchronized ground-aerial pairs to provide stable early supervision and then gradually shifts toward randomized sampling. On MAVREC, CrossVL improves Florence-2's aerial mAP from 58.66% to 61.03% and reduces the ground-aerial performance gap from 8.63pp to 6.65pp, while also achieving a 3.3x reduction in variance across random seeds. CPA provides stable complexity-aware feature aggregation, and PCL enhances optimization dynamics. Together, they demonstrate that coordinated architectural and training adaptations are crucial for robust cross-view VLM detection.

2605.09774 2026-05-12 cs.CV 版本更新

DRIVE-C: A Controlled Corruption Dataset for Autonomous Driving

Shiva Aher

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 DRIVE-C 是一个用于评估自动驾驶系统视觉感知鲁棒性的受控退化数据集,由真实场景下的多种环境驾驶视频构建而成。该数据集通过物理启发的合成退化方法生成了包含10段干净视频和600段退化视频的多样化样本,并提供了详细的元数据和传感器健康指数标注。DRIVE-C 为自动驾驶感知系统的鲁棒性评估、退化感知建模、不确定性估计以及传感器健康监测提供了可控且可复现的测试平台。

详情
英文摘要

DRIVE-C is a controlled corruption dataset designed to evaluate visual perception robustness in autonomous driving systems. It is built from real-world forward-facing driving videos collected across daytime, nighttime, urban, rural, freeway, and parking environments. Clean clips are anonymized via localized face and license plate blurring, then transformed with physics-inspired synthetic degradations. The dataset contains 10 clean clips and 600 corrupted clips spanning 12 camera degradation types across five severity levels, with per-clip metadata and Global Sensor Health Index (GSHI) annotations. DRIVE-C supports robustness benchmarking, degradation-aware modeling, uncertainty estimation, out-of-distribution (OOD) detection, and sensor health monitoring for Advanced Driver Assistance Systems (ADAS). By providing pixel-aligned clean and degraded video clips with fully reproducible corruption parameters, DRIVE-C offers a structured testbed for studying perception reliability under controlled camera degradation.

2605.09750 2026-05-12 cs.CV 版本更新

Fetal Brain Imaging: A Composite Neural Network Approach for Keyframe Detection in Ultrasound Videos

Aleksander Zamojski, Kacper Jarczak, Radoslaw Roszczyk

发表机构 * Warsaw University of Technology(华沙技术大学)

AI总结 本文提出了一种用于胎儿脑部超声视频中关键帧检测的新方法,旨在提高胎儿脑部影像分析的效率和准确性。该方法采用一种融合卷积神经网络(CNN)和循环神经网络(RNN)的复合神经网络架构,其中CNN用于提取视频帧的局部空间特征,RNN则用于捕捉视频序列中帧与帧之间的时序依赖关系。该模型有助于更早地检测和诊断特定胎儿脑部疾病,从而支持更及时的治疗规划。

详情
英文摘要

This article presents a novel approach to keyframe detection in ultrasound videos, with a particular focus on fetal brain imaging. The proposed model is a composite neural network architecture that combines a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN). The CNN extracts spatial features from individual video frames, while the RNN captures temporal dependencies between consecutive frames within each video sequence. The proposed model may improve the efficiency and accuracy of fetal brain ultrasound analysis, thereby supporting earlier detection, diagnosis, and treatment planning for selected fetal brain conditions.

2605.09719 2026-05-12 cs.CV cs.AI 版本更新

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Alaa Asfour, Christopher Indris, Leihan Chen, Tejas Vyas, Guanghui Wang

发表机构 * Department of Computer Science, Toronto Metropolitan University(多伦多 Metropolitan 大学计算机科学系)

AI总结 该研究提出了一种知识蒸馏框架,将大型3D视觉语言模型中的空间推理能力转移到更轻量的模型中,从而显著降低计算成本。通过引入可学习的隐式推理标记(Hidden CoT)和多任务蒸馏策略,该方法在保持教师模型72%以上性能的同时,将模型大小减少了3倍,推理延迟降低了8.7倍。该工作首次在蒸馏的3D视觉语言模型中应用隐式推理机制,实现了高效的3D场景问答任务。

详情
英文摘要

Large-scale 3D vision-language models (VLMs) like LLaVA-3D offer strong spatial reasoning but are difficult to deploy due to high computational costs. We propose a knowledge distillation framework that transfers spatial reasoning from a 7B teacher to a 2.29B student model. Our approach achieves 8.7x lower inference latency and a 3x reduction in model size while retaining 54-72% of the teacher's performance. The framework utilizes VGGT as the vision encoder and a multi-task distillation pipeline with uncertainty-aware loss weighting. To improve reasoning without chain-of-thought (CoT) data, we introduce "Hidden CoT": learnable latent tokens that serve as an internal scratchpad before answer generation. This is the first use of latent scratchpad reasoning in distilled 3D VLMs. The student model jointly performs spatial description, depth estimation, and object detection. Experiments on ScanNet and 3D-FRONT show strong spatial understanding, reaching 68-72% accuracy on proximity and contact tasks. Our framework enables efficient 3D scene QA on resource-constrained platforms.

2605.09703 2026-05-12 cs.CV 版本更新

MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

Xiaoyu Yuan, Niklas Heikkala, Tiina Törmänen, Hanna Järvenoja, Guoying Zhao, Haoyu Chen

发表机构 * University of Oulu(奥卢大学)

AI总结 本文提出MOTOR-Bench,一个用于零样本人类心理状态理解的现实场景数据集与多智能体框架。该数据集包含1,440个协作学习场景的多模态视频片段,每个样本由教育专家基于自我调节学习理论标注,旨在支持对复杂人际互动的结构化分析。为解决现有方法在从可观测行为推理深层心理状态方面的不足,研究提出了MOTOR-MAS多智能体框架,通过结构化协调机制提升对行为、认知和情绪三类标签的预测性能,实验表明其在多项指标上显著优于现有方法。

Comments Accepted by CVPR 2026 workshop AI4RWC

详情
英文摘要

Understanding human mental states from natural behavior is crucial for intelligent systems in the real world. However, most current research focuses on predicting isolated mental state labels, lacking structured annotations of complex interpersonal interactions. To support structured analysis, we introduce MOTOR-Bench, a carefully-designed benchmark with a real-world dataset MOTOR-dataset, containing 1,440 multimodal video clips in collaborative learning scenarios, reflecting key real-world data challenges including natural class imbalance, visual noise, and domain-specific language. Each sample is labeled by educational experts based on self-regulated learning theory. We further evaluate several state-of-the-art multimodal large language models and multi-agent systems in a zero-shot setting on our MOTOR-Bench. However, their performance on this task remains limited, suggesting that existing methods still struggle with structured reasoning from observable behavior to deeper mental states. To address this challenge, we propose a reasoning multi-agent framework, named MOTOR-MAS. It coordinates multiple agents through a structured agent coordination mechanism to infer explicit behaviors, internal cognitions, and psychological emotions. Experimental results show that our MOTOR-MAS outperforms the best single-model benchmark by 15.93 points in Macro-F1 scores for the three labels of behavior, cognition, and emotion, and outperforms the general multi-agent benchmark by 10.2 points in internal cognition prediction.

2605.09701 2026-05-12 cs.CV 版本更新

DriveFuture: Future-Aware Latent World Models for Autonomous Driving

Yufeng Hong, Xiaotian Zhou, Yingyan Li, Xiangpo Zhou, Lin Liu, Yadan Luo, Shaoqing Xu, Lei Yang, Ziying Song

发表机构 * Beijing Institute of Technology(北京理工大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beihang University(北航) Beijing Jiaotong University(北京交通大学) The University of Queensland(昆士兰大学) University of Macau(澳门大学) Nanyang Technological University(南洋理工大学) School of Artificial Intelligence ( School of Software), Yanshan University(燕山大学人工智能学院(软件学院))

AI总结 DriveFuture 是一种面向自动驾驶的未来感知潜在世界模型,其核心在于将未来世界状态作为当前潜在状态建模的条件,从而显式学习面向路径规划的前瞻性能力。该方法在训练过程中通过预测和优化未来潜在状态,为基于扩散模型的轨迹规划器提供显式条件,在多个公开基准测试中取得了领先的性能表现。实验结果表明,将未来状态作为当前决策的条件,比单纯预测未来状态更能提升自动驾驶系统的智能化水平。

Comments 24pages, 7 figures

详情
英文摘要

Existing latent world models for autonomous driving have opened a promising path toward future-aware driving intelligence. However, they typically treat future latent states as prediction targets or auxiliary signals, rather than directly conditioning trajectory planning. This can entangle current and future features in latent space. In this work, we propose DriveFuture, a future-aware latent world modeling framework for autonomous driving that explicitly learns planning-oriented foresight by conditioning the current latent state modeling process on future world states. Specifically, during training, the model first predicts future latent world states from the current latent state and ego action, and then refines the prediction against the ground-truth future latent state via cross-attention. The resulting future-aware latent serves as an explicit condition for a diffusion-based trajectory planner. During inference, DriveFuture conditions on the predicted future latent state instead of the ground-truth future state. DriveFuture achieves SOTA performance on the public NAVSIM benchmarks, reaching \textbf{55.5} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}, \textbf{89.9} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navtest}}}, and \textbf{90.7} PDMS on NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}, respectively. These results suggest that the key to latent world modeling lies not merely in simulating future states, but more importantly in conditioning current decision-making on future states. Notably, as of April 2026, DriveFuture ranks \textbf{1st} on the \href{https://huggingface.co/spaces/AGC2025/e2e-driving-navhard}{NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}} leaderboard and achieves SOTA performance on \href{https://huggingface.co/spaces/AGC2024-P/e2e-driving-navtest}{NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}}.

2605.09699 2026-05-12 eess.IV cs.CV cs.GR cs.LG 版本更新

A Real-Calibrated Synthetic-First Data Engine

Yukang Shen

发表机构 * Kennesaw State University(肯纳邦大学)

AI总结 现代计算机视觉系统在数据稀缺领域常面临性能限制,而合成数据生成虽具潜力,但直接应用常因数据质量与反馈机制不足导致效果不稳定。本文提出一种“真实校准、以合成数据为主”的数据引擎,通过可控扩散模型与多阶段筛选过滤的统一流程,系统性提升合成数据增强的实用性与可靠性。实验表明,在人体姿态估计等任务中,合成数据与真实数据结合可有效提升性能,凸显了数据驱动策略在低数据场景下的重要价值。

Comments 7 pages, 6 figures

详情
英文摘要

Modern computer vision systems increasingly encounter performance limitations in data-scarce domains, where collecting large-scale, high-quality labeled data is costly or impractical. While controllable diffusion models enable scalable synthetic image generation, directly applying synthetic augmentation often leads to unstable performance gains due to dataset-level quality issues and insufficient feedback mechanisms. In this work, we present a Real-Calibrated Synthetic-First Data Engine, a modular data engineering framework that combines controllable diffusion generation and multi-stage curation/filtering within a unified pipeline, with optional support for uncertainty-driven selection and human verification. Instead of introducing new generative algorithms, our approach focuses on systematic dataset construction for improving the practical reliability of synthetic augmentation in low-data regimes. The framework is implemented as a modular CLI-based pipeline, where generation, filtering, selection, and validation components can be independently configured and replaced. This design emphasizes reproducibility, flexibility, and practical deployment in real-world data workflows. Through empirical evaluation centered on human pose estimation, we show that synthetic data improves a real-data baseline when used as near-zero-human-annotation-cost augmentation alongside real anchors, while synthetic-only training remains substantially below real-only performance. Supplementary segmentation diagnostics show the same domain-gap pattern. These results highlight the practical value of data-centric orchestration for low-data augmentation.

2605.09693 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Do multimodal models imagine electric sheep?

Santhosh Kumar Ramakrishnan, Carl Vondrick, Raja Giryes, Philipp Krähenbühl, Vladlen Koltun

发表机构 * Apple(苹果公司)

AI总结 该研究探讨了多模态模型在解决空间谜题时是否会产生心理意象,并发现大型多模态模型在解决如拼图、积木等任务时确实会形成类似“想象”的过程,甚至在解决与羊相关的谜题时会“想象”出羊的形象。研究通过微调Qwen3.5视觉语言模型,使其能够完成多种视觉推理任务,并发现模型在执行操作过程中会自发形成对中间状态的视觉表征。基于这一发现,研究提出了两种方法来增强和利用模型的内部视觉表征,显著提升了任务解决的准确率。

详情
英文摘要

Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks -- including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour -- that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model's activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning-heavy tasks such as jigsaw and 3D mental rotation.

2605.09688 2026-05-12 cs.CV 版本更新

ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes

Rui Song, Tianhui Cai, Markus Gross, Xingcheng Zhou, Zewei Zhou, Zhiyu Huang, Olaf Wysocki, Jiaqi Ma

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) University of Cambridge(剑桥大学) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出了一种名为 ConFixGS 的方法,用于修复基于前馈的3D高斯泼溅(3DGS)在驾驶场景中的重建问题。该方法利用置信度感知的扩散先验,通过生成局部伪目标并结合支持视图的重投影校验,提升重建的细节可靠性并抑制不一致信息。实验表明,ConFixGS 在多个数据集上显著提升了新视角合成效果,PSNR 提升最高达3.68 dB,FID 减少近一半,展示了其在驾驶场景中鲁棒重建的有效性。

Comments 28 pages, 12 figures

详情
英文摘要

Feedforward 3D Gaussian Splatting (3DGS) often struggles in trajectory-based sparse-view driving scenes. Existing Gaussian repair methods mainly target optimization-based 3DGS, while diffusion-based repair is typically restricted to iterative refinement near observed viewpoints, leaving feedforward 3DGS repair underexplored. We propose ConFixGS, a plug-and-play method that learns to fix feedforward 3DGS with confidence-aware diffusion priors. Starting from a pretrained feedforward model, ConFixGS generates diffusion-enhanced local pseudo-targets and validates them through reprojection-based cross-checking against support views. The resulting dense confidence maps guide refinement, enhancing reliable details while suppressing hallucinated or inconsistent evidence. On Waymo, nuScenes, and KITTI, ConFixGS improves challenging novel view synthesis, with PSNR gains of up to 3.68 dB and FID reduced by nearly half. Our results highlight confidence-aware fusion of generative priors and support-view consistency as a key principle for robust feedforward 3D driving scene reconstruction.

2605.09687 2026-05-12 cs.CV 版本更新

Spatial-Frequency Gated Swin Transformer for Remote Sensing Single-Image Super-Resolution

Md Aminur Hossain, Parekh Valkesh, Ayush V. Patel, Yogesh Jethani, Sanjay K. Singh, Biplab Banerjee

发表机构 * Space Applications Centre, ISRO, Ahmedabad, India(印度航天研究组织阿赫迈德亚布德研究中心) Centre of Studies in Resources Engineering, Indian Institute of Technology Bombay, India(印度理工学院孟买资源工程研究学院) New L J Institute of Engineering and Technology, Ahmedabad, India(阿赫迈德亚布德新LJ工程与技术学院) Pandit Deendayal Energy University, Gandhinagar, India(潘迪特·德恩达尔能源大学) GLS University, Ahmedabad, India(阿赫迈德亚布德GLS大学)

AI总结 本文研究了遥感单图像超分辨率问题,旨在从低分辨率观测中重建高分辨率图像并保留精细的空间结构。为了解决现有Swin Transformer模型在细节重建上的不足,作者提出了一种空间-频率门控Swin Transformer(SFG-SwinSR),通过在前馈网络中引入空间-频率门控模块,分离低频结构内容与高频残差细节,从而提升重建质量。实验表明,该方法在多个遥感数据集上取得了更好的PSNR和SSIM指标,有效增强了高分辨率图像的细节表现。

Comments 15 pages

详情
英文摘要

Remote Sensing (RS) single-image super-resolution aims to reconstruct high-resolution imagery from low-resolution observations while preserving fine spatial structures. Recent Swin Transformer-based models, including Swin2SR, provide strong spatial context modeling throughshifted-window self-attention, but their feed-forward networks remain generic channel-mixing modules and do not separate low-frequency structural content from high-frequency residual detail. To address this limitation, we propose SFG-SwinSR, a Spatial-Frequency Gated Swin Transformer for single-image super-resolution in remote sensing. SFG-SwinSR modifies the original Swin2SR attention block by replacing each transformer block's standard feed-forward network with a lightweight Spatial-Frequency Gated Feed-Forward Network (SFG-FFN). The module estimates low-frequency content via a depthwise-blur branch, extracts high-frequency residuals by subtraction, refines them with a lightweight spatial branch, and adaptively injects detail through a bottleneck gate. Experiments on SpaceNet and SEN2VENμS show that SFG-SwinSR improves reconstruction quality under the evaluated settings. On SpaceNet, it achieves 45.19 dB PSNR and 0.9852 SSIM, indicating effective enhancement of high-frequency details. This demonstrates that spatial-frequency transformation within the transformer feed-forward network improves detail reconstruction in RS super-resolution.

2605.09681 2026-05-12 cs.CV 版本更新

Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

Yicheng Ji, Zhizhou Zhong, Jun Zhang, Qin Yang, XiTai Jin, Ying Qin, Wenhan Luo, Shuiyang Mao, Wei Liu, Huan Li

发表机构 * ZJU(浙江大学) Video Rebirth(视频重生) HKUST(香港科技大学) BJTU(北京理工大学)

AI总结 本文针对自回归视频扩散模型中因冗余键值(KV)缓存导致的注意力复杂度高和内存开销大的问题,提出了一种混合KV缓存压缩方法Forcing-KV。通过分析主流模型中注意力头的功能特性,将头分为关注帧内细节和块间过渡的静态头,以及控制帧间运动和一致性的动态头,并分别采用结构化剪枝和基于片段相似度的动态剪枝策略。该方法在保持生成质量的同时,显著提升了生成速度并减少了内存占用,实现在单块NVIDIA H200 GPU上每秒生成29帧以上。

Comments 10 pages

详情
英文摘要

Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.

2605.09679 2026-05-12 cs.CV cs.AI 版本更新

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Boyan Wang, Liang He, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Bologna(博洛尼亚大学) Istanbul Medipol University(伊斯坦布尔梅迪波尔大学) Center for Biomolecular Nanotechnologies, Istituto Italiano di Tecnologia(生物分子纳米技术中心,意大利技术研究院) The First Affiliated Hospital, Sun Yat-Sen University(中山大学第一附属医院) Tongji University(同济大学)

AI总结 DeepTumorVQA 是一个面向医学影像的层次化3D CT基准,旨在对医疗视觉语言模型(VLMs)和工具增强代理进行分阶段评估。该基准将肿瘤诊断中的推理过程分解为识别、测量、视觉推理和医学推理四个阶段,使模型在不同层次上的表现能够被独立评估。研究还引入了工具交互环境,允许模型调用分割、测量和医学知识等外部工具,从而更贴近实际医疗场景。实验表明,工具增强显著提升了模型在复杂医学推理任务中的表现。

详情
英文摘要

Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.

2605.09677 2026-05-12 cs.CV 版本更新

VFM-SDM: A vision foundation model-based framework for training-free, marker-free, and calibration-free structural displacement measurement

Qingyu Xian, Hao Cheng, Berend Jan van der Zwaag, Rolands Kromanis, Ozlem Durmaz Incel

发表机构 * Pervasive Systems Research Group, Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente(普罗普及系统研究组,电气工程、数学与计算机科学学院,埃因霍温理工大学) Department of Earth Observation Science, Faculty of Geo-Information Science and Earth Observation (ITC), University of Twente(地球观测科学系,地理信息科学与地球观测(ITC)学院,埃因霍温理工大学) Department of Civil Engineering and Management, Faculty of Engineering Technology, University of Twente(土木工程与管理系,工程科技学院,埃因霍温理工大学)

AI总结 本文提出了一种基于视觉基础模型(VFM)的结构位移测量框架VFM-SDM,能够在无需任务特定训练、无需现场标记和标定的情况下,实现多方向结构位移的非接触式测量。该方法结合VFM推断的相机参数估计与点跟踪技术,通过三角化重建位移,并引入结构几何约束以提升估计的物理合理性和一致性。实验结果表明,该框架在真实场景中具有较高的测量精度和稳定性,为自动化、可扩展的结构健康监测提供了新思路。

详情
英文摘要

Reliable displacement measurement is fundamental for structural health monitoring and digital engineering workflows, as it provides direct structural response information. Vision-based measurement has emerged as a promising approach for low-cost, non-contact displacement monitoring. However, its deployment often remains constrained by task-specific model training or on-site preparation, such as marker installation or manual camera calibration. This study presents a Vision Foundation Model-based framework for Structural Displacement Measurement (VFM-SDM) that integrates VFM-inferred camera parameter estimation and point tracking to reconstruct multi-directional structural displacements via triangulation without task-specific training or on-site preparation, enabling efficient non-contact deployment in real-world applications. Structural geometry constraints are incorporated to suppress physically implausible deviations and improve estimation consistency. A multi-modal field dataset collected from an in-service pedestrian bridge is introduced alongside a unified benchmarking protocol to support reproducible evaluation. Representative results show low amplitude errors (NRMSE$_{\text{range}}$: 0.11/0.12), strong temporal agreement (correlation coefficient: 0.86/0.88), and small peak-to-peak amplitude errors (RPPAE: 0.01/0.02) for vertical and lateral displacements, indicating robust performance under real-world conditions. The proposed framework advances automated, scalable displacement monitoring and lays the groundwork for VFM-enabled structural response measurements in digital twin and data-centric construction workflows.

2605.09670 2026-05-12 cs.RO cs.CV 版本更新

Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models

Aws Khalil, Jaerock Kwon

发表机构 * Department of Electrical and Computer Engineering, University of Michigan - Dearborn(密歇根大学迪尔伯恩分校电气与计算机工程系)

AI总结 本文研究了基于视觉的遥操作系统中预测显示技术的生成能力,旨在通过生成未来视觉状态来缓解通信延迟带来的影响。作者提出了一种无需任务微调的零样本基准,评估了多种现成的生成视频模型在短时预测显示中的表现。实验表明,现有模型在预测精度、推理延迟和误差稳定性等方面难以同时满足预测显示的需求,揭示了通用生成视频模型与遥操作预测显示应用之间的性能差距。

详情
英文摘要

Teleoperation systems are fundamentally limited by communication latency, which degrades situational awareness and control performance. Predictive display aims to mitigate this limitation by presenting an estimate of the current visual state rather than delayed observations. While recent advances in generative video models enable high-quality video synthesis, their suitability for latency-sensitive predictive display remains unclear. This paper presents a zero-shot benchmark of off-the-shelf generative video models for short-horizon predictive display, without task-specific fine-tuning. We formulate the problem as rollout-based future frame prediction and develop a unified benchmarking pipeline using simulated driving data from the CARLA simulator. Five publicly released video models spanning transformer-based and diffusion-based families are evaluated across two resolutions and two conditioning regimes (multi-frame and single-frame). Performance is assessed using prediction accuracy (mean absolute difference), per-rollout latency, peak GPU memory usage, and temporal error evolution across the prediction horizon. On this zero-shot benchmark, no tested model simultaneously achieves low rollout error, non-divergent per-step error behavior, and real-time inference at the source frame rate. Increasing model scale or resolution yields limited and, in some cases, inverted improvements. These findings highlight a gap between general-purpose generative video synthesis and the requirements of predictive display in teleoperation, suggesting that practical deployment will require either explicit short-horizon temporal supervision, in-domain adaptation, or aggressive inference optimization rather than direct application of off-the-shelf models. Code, configurations, and qualitative results are released on the project page: https://bimilab.github.io/paper-GenPD

2605.09667 2026-05-12 cs.CV cs.AI 版本更新

S2P-Net: A Spectral-Spatial Polar Network for Rotation-Invariant Object Recognition in Low-Data Regimes

Albert Heruth

发表机构 * Unaffiliated Researcher(无隶属研究人员)

AI总结 本文提出了一种名为S2P-Net的紧凑型深度学习网络架构,用于在数据量较少的情况下实现旋转不变的目标识别,且无需数据增强即可保证数学上的旋转不变性。该网络结合了频域与空域信息,并通过极坐标变换增强其对旋转的鲁棒性。与传统卷积神经网络相比,S2P-Net在小样本场景下表现出更优的识别性能,为低数据条件下的旋转不变目标识别提供了新思路。

Comments 9 pages, 4 figures, 3 tables. Preprint. Code available from the author upon request

详情
英文摘要

We present S2P-Net (Spectral-Spatial Polar Network), a compact deep learning architecture that achieves mathematically guaranteed rotation invariance without data augmentation. In this Paper, we also made a comparison to other neural network architectures (CNN`s). Have a look at the results and feel free to contact me for any questions. This is my first paper:) Made by Hackbert

2605.09666 2026-05-12 cs.CV cs.AI 版本更新

Rethinking Evaluation of Multiple Sclerosis (MS) Lesion Segmentation Models

Abdul Basit, Ashir Rashid, Muhammad Abdullah Hanif, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University (NYU) Abu Dhabi(eBRAIN实验室,工程学院,纽约大学(纽约大学阿布扎克分校))

AI总结 本文探讨了多发性硬化症(MS)病灶分割模型评估方法的不足,指出当前大多使用Dice分数进行评估,未能充分考虑病灶级别的检测与分割性能,以及对复杂或人类标注者难以判断情况的模型表现。作者详细分析了神经科医生在脑部MRI扫描中关注的特征,并提出了更符合实际需求的评估指标,同时在两个开源数据集上对现有先进模型进行了分析,以评估其在实际医疗场景中的适用性。

Comments 8 pages, 5 figures, Accepted to IJCNN 2026

详情
英文摘要

Multiple Sclerosis (MS) is a chronic autoimmune disease that can significantly reduce the quality of life of a patient. Existing treatment options can only help slow down the progression of the disease. Therefore, early detection and precise monitoring of disease progression are important. Deep learning offers state-of-the-art models for detecting and segmenting MS lesions in brain MRI scans. However, most of these models are evaluated using the Dice score, without accounting for lesion-wise detection and segmentation performance or other metrics that quantify model performance in cases that are complex or confusing for human annotators, or in cases that are essential for disease detection and progression monitoring. In this paper, we highlight the need to rethink the evaluation of MS lesion segmentation models. In this context, we first present problem fingerprinting in detail to highlight what neurologists look for in brain MRI scans for MS detection and progression monitoring, and which metrics are required to properly quantify model performance in these contexts. Additionally, we present an analysis of state-of-the-art models on two open-source datasets using these metrics to highlight their usability for real-world deployment in hospitals.

2605.09662 2026-05-12 cs.CV 版本更新

BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction

Alessio Mazzucchelli, Maria Naranjo-Almeida, Jorge Bustos-Sanchez, Mariella Dimiccoli, Francesc Moreno-Noguer, Jordi Sanchez-Riera, Adrian Penate-Sanchez

发表机构 * Arquimea Research Center(阿奎米亚研究中心) Institut de Robòtica i Informàtica Industrial (CSIC-UPC)(机器人与信息技术研究所(CSIC-UPC)) Universidad de las Palmas de Gran Canaria (IUSIANI)(Gran Canaria大学(IUSIANI))

AI总结 本文提出了一种名为BEA-GS的新型高斯泼溅方法,旨在在无需辐射监督的情况下实现更精确的物体提取。该方法通过引入两种新的损失函数,分别优化可见和不可见高斯点的几何结构,以更准确地对齐语义边界。实验表明,该方法在多个数据集上取得了当前最佳的边界分割效果,显著提升了物体级编辑和资产提取的精度。

Comments CVPR 2026 Highlight

详情
英文摘要

Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene do not optimize the underlying 3D geometry, making object-level editing or asset extraction challenging. Recent methods, such as COBGS, Trace3D, ObjectGS, acknowledge this limitation and propose approaches that modify the scene's geometry to represent the underlying semantics. We advance this concept further by proposing a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1) a loss that modifies the geometry of visible Gaussians to respect semantic boundaries, and 2) a loss that adjusts the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization, allowing for seamless integration within the optimization of the Gaussian parameters. The second loss also propagates gradients to Gaussian parameters but does so without passing through the rasterization, enabling modification of the scene's geometry even when little transmittance reaches a Gaussian (partial or non-visible). Exhaustive comparisons with 12 state of the art methods across 4 datasets, using six metrics, demonstrate that our approach produces overall the best boundary segmentation to date.

2605.09644 2026-05-12 cs.CV 版本更新

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

Zichen Zou, Xiaosong Jia, Zuxuan Wu, Yu-Gang Jiang

AI总结 该论文提出了一种名为RetrieveVGGT的训练-free框架,用于解决基于Transformer的三维重建在处理长序列时因注意力机制复杂度过高而导致的内存溢出和质量下降问题。通过将上下文构建转化为检索问题,RetrieveVGGT在每一步仅检索少量相关帧,从而保持可控的内存开销,并利用VGGT中查询与键之间的相似性作为相关性指标,无需额外训练。此外,该方法引入了分段采样和基于相机位姿的空间记忆机制,进一步提升了信息多样性与定位准确性,实验表明其在性能上优于多个现有方法。

详情
英文摘要

Visual Geometry Grounded Transformer (VGGT) advances 3D reconstruction via scalable Transformer architecture, but the quadratic complexity of global attention prevents long context application. StreamVGGT enables streaming with causal attention, yet its KV cache grows linearly with frames, causing memory overflow and quality degradation. We present RetrieveVGGT, a training-free framework, which formulates context construction for VGGT as a retrieval problem. By retrieving a fixed number of relevant frames at each step, VGGT maintains a controllable memory budget, which is close to its training context length. Interestingly, we find that the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring. To enhance information diversity similar to a recommender system, we propose Segment Sampling so that the retrieval spans distinct relevant segments rather than a single high-similarity region. We design a pose-aware spatial memory mechanism that organizes history frames according to their already estimated camera poses, enabling location-aware retrieval. Extensive experiments demonstrate that RetrieveVGGT achieves state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT while maintaining constant memory usage regardless of sequence length. Code is available at https://github.com/zzctmd/RetrieveVGGT.

2605.09640 2026-05-12 cs.CV cs.LG 版本更新

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

Meng Lou, Hanzhong Guo, Linwei Chen, Yizhou Yu

发表机构 * The University of Hong Kong(香港大学) The Hong Kong University of Science and Technology(香港科学与技术大学) Hong Kong Generative AI Research and Development Center(香港生成式人工智能研究与开发中心)

AI总结 本文研究了如何在视觉持续学习中克服灾难性遗忘问题,提出了一种基于强化微调的新方法RaPO。作者发现现有方法如GRPO在面对类别增量和领域增量学习时仍存在显著遗忘,其根本原因在于轨迹层面的策略漂移。为此,RaPO通过引入保留奖励和跨任务优势归一化,有效缓解了策略漂移带来的遗忘问题,实验表明其在多个持续学习场景中均取得优越性能,为视觉持续学习中的强化微调提供了系统性探索。

详情
英文摘要

Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.

2605.09628 2026-05-12 cs.CV 版本更新

DegBins: Degradation-Driven Binning for Depth Super-Resolution

Zhiqiang Yan, Zhengxue Wang, Jian Yang, Gim Hee Lee

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系) Nanjing University of Science and Technology(南京理工大学)

AI总结 深度超分辨率(DSR)旨在从低分辨率深度图中恢复高分辨率深度图。传统方法通常在低维特征空间中学习高分辨率与低分辨率之间的残差,但难以准确建模空间变化的退化关系。本文提出了一种新的DSR框架DegBins,通过退化驱动的分箱策略,将回归问题转化为分类-回归混合问题,利用离散深度分箱的加权组合更灵活地表示残差深度,并在高维特征空间中建模退化关系,实现分箱范围和概率分布的自适应调整。实验表明,DegBins在多个基准数据集上优于现有方法,具有更高的精度和鲁棒性。

Comments 9 pages

详情
英文摘要

Depth super-resolution (DSR) aims to recover a high-resolution (HR) depth map from its low-resolution (LR) counterpart. With color image guidance, this task is typically formulated as learning the residual between HR and LR in a low-dimensional feature space. However, this additive formulation is insufficient to accurately capture the complex relationship between HR and LR, especially under spatially varying degradations. In this paper, we introduce DegBins, a novel DSR framework that leverages degradation-driven binning to adaptively enhance residual modeling. Specifically, DegBins reformulates the regression-based DSR as a hybrid classification-regression problem, where the residual depth is represented as a linear combination of discrete depth bins weighted by their learned probability distribution, yielding more flexible and expressive representations. Furthermore, DegBins models the degradation relationship between HR and LR in a high-dimensional feature space, enabling adaptive bin range adjustment and probability optimization conditioned on local degradation characteristics. To progressively improve reconstruction quality, DegBins adopts a multi-stage refinement scheme, where each stage performs finer-grained bin partitioning and probability updating based on the former estimation. This coarse-to-fine design facilitates more accurate depth recovery, particularly in regions with severe degradations or complex structural variations. Extensive experiments across five benchmarks demonstrate that DegBins consistently outperforms existing state-of-the-art methods in terms of accuracy, robustness, and generalization.

2605.09622 2026-05-12 cs.CV cs.AI 版本更新

Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

Yuhan Wang, Zihan Li, Han Liu, Simon Arberet, Martin Kraus, Yuyin Zhou, Florin-Cristian Ghesu, Dorin Comaniciu, Ali Kamen, Riqiang Gao

发表机构 * UC Santa Cruz(加州大学圣克ruz分校) Siemens Healthineers(西门子医疗) University of Washington(华盛顿大学)

AI总结 在放射治疗计划中,体素级剂量预测是一个关键但具有挑战性的任务,现有模型往往难以在不同临床场景中泛化。本文提出 DiffKT3D,一种统一的 Any2Any 3D 扩散框架,通过迁移预训练视频扩散模型的知识,实现高效且具有临床意义的剂量预测。该方法引入了基于模态嵌入的灵活条件生成机制,并结合临床导向的强化学习后训练策略,显著提升了剂量预测精度与图像质量,优于当前最优模型。

Comments Accepted by CVPR 2026 main conference. Compare to CVPR version, minor updates here are included (e.g., combine main text and appendix; clarify the timing scenario in appendix)

详情
英文摘要

Voxel-wise dose prediction is a critical yet challenging task in practical radiotherapy (RT) planning, as bespoke models trained from scratch often struggle to generalize across diverse clinical settings. Meanwhile, generative models trained on billion-scale datasets from vision domains have achieved impressive performance. Herein, we propose DiffKT3D, a unified Any2Any 3D diffusion framework that leverages prior knowledge from pretrained video diffusion models for efficient and clinically meaningful dose prediction. To enable flexible conditioning across multiple clinical modalities (CT, anatomical structures, body, beam settings, etc.), we introduce an Any2Any conditional paradigm utilizing modality-specific embeddings without cross-attention overhead. Further, we design a novel reinforcement learning (RL) post-training mechanism guided by a clinically-informed Scorecard explicitly tailored to institutional treatment preferences. Compared with winner of GDP-HMM challenge, DiffKT3D sets a new state-of-the-art in dose prediction by reducing voxel-level MAE from 2.07 to 1.93. In addition, DiffKT3D achieves superior image quality and preference match. These results demonstrate that transferring diffusion priors via modality-aware conditioning and clinically aligned RL post-training can provide a robust and generalizable solution for RT planning across various clinical scenarios.

2605.09614 2026-05-12 cs.CV 版本更新

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

Xuan Gong, Hanbo Huang, Hao Zheng, Yiran Zhang, Wenbin Dai, Weishu Zhao, Shiyu Liang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Lanzhou University(兰州大学)

AI总结 本文研究了长链多模态推理中视觉信息衰减的问题,提出了一种基于信息论的分析方法,推导出干预点对下游视觉收益的下界,并据此设计了反射锚点策略优化(RAPO)方法。RAPO通过选择高熵的反射锚点并优化有限窗口的KL散度代理,有效增强了视觉信息在生成过程中的传播与保留。实验表明,RAPO在多个视觉-语言模型基准上显著优于现有方法,并且机制分析显示其能增强生成轨迹中视觉依赖的对比信号。

Comments Under Review

详情
英文摘要

Long chain-of-thought (CoT) reasoning improves large vision--language models, but visual information often fades during generation, limiting long-horizon multimodal reasoning. Existing methods either re-inject vision at inference or train policies for stronger grounding, but where to intervene relies on perception heuristics rather than principled gain analysis, and how local visual influence propagates remains implicit. We study this problem from an information-theoretic standpoint and derive a lower bound on the downstream visual gain of a one-step intervention, which suggests two factors: local branching room (token entropy) and downstream visual propagation potential (suffix divergence from a vision-marginalized reference). Guided by this analysis, we propose reflection-anchor policy optimization (RAPO), a GRPO-based policy optimization method that selects high-entropy reflection anchors and optimizes a chain-masked finite-window KL surrogate for downstream visual dependence. Experiments on reasoning-intensive and general-domain benchmarks show that RAPO delivers substantial gains over strong baselines across multiple LVLM backbones. Mechanism analyses further indicate that reflection anchors are enriched for visually sensitive decision points and that RAPO increases contrastive visual-dependence signals along generated trajectories.

2605.09613 2026-05-12 cs.RO cs.CV 版本更新

SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation

Narsimha Menga, Parikshit Sakurikar, Amirreza Rouhi, Satya Sai Reddy, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, Sashi Reddi

发表机构 * DreamVu

AI总结 该研究提出了SABER,一个用于现实零售场景中机器人视觉-语言-动作(VLA)适配的高保真动作数据集。SABER通过多小时的真实店内捕捉,记录了人类在零售环境中的精细手部动作、全身运动及场景动态,无需人工编排或远程操作。该数据集包含多种动作表示形式,并在实际机器人系统上验证了其有效性,显著提升了复杂零售任务的完成率,展示了高质量数据对提升机器人性能的关键作用。

详情
英文摘要

Robotic deployment in real-world environments depends on rich, domain-specific action data as much as on strong model architecture. General-purpose robot foundation models show modest performance in complex unseen tasks such as manipulation in a retail domain when applied out of the box. The root cause is a data gap: retail environments are structurally absent from general robot pretraining distributions, and the path to filling that gap through teleoperation is prohibitively expensive, logistically constrained, and difficult to scale. We introduce SABER, a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple real grocery environments. Egocentric footage from head-mounted cameras records fine-grained hand activity at the point of interaction, while exocentric 360-degree scene footage from DreamVu's ALIA camera simultaneously observes all actors and activities across the entire space. This combination yields a uniquely complete picture of human retail behavior: dexterous hand activity, whole-body motion, and scene dynamics, all captured without staging, scripting, or teleoperation overhead. The SABER corpus contains 44.8K training samples across three action representation streams: 25K latent action sequences via LAPA-style encoding, 18.6K dexterous hand-pose trajectories retargeted to robot joint space, and 1.2K whole-body synchronized motion sequences retargeted to a humanoid embodiment. When applied to GR00T N1.6 via a shared-backbone multi-task post-training recipe, SABER yields a mean success rate of 29.3% across ten retail manipulation tasks -- more than 2.19x over fine-tuning baselines (13.4%). SABER demonstrates that the path to capable retail robots runs through better data, which can be collected today, at scale, without a robot in the loop. The dataset and code are available at https://dreamvu.ai/saber

2605.09606 2026-05-12 cs.CR cs.CV 版本更新

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

Yule Liu, Yilong Yang, Jiale Teng, Hanze Jia, Zeren Luo, Jingyi Zheng, Zifan Peng, Ke Li, Yifan Liao, Zhen Sun, Jiaheng Wei, Yang Liu, Zhuo Ma, Xinlei He

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州)) Xidian University(西安电子科技大学) Zhejiang University(浙江大学) Wuhan University(武汉大学)

AI总结 本文研究了图像到3D模型在生成有害几何结构方面的风险及其缓解方法,揭示了当前模型在面对恶意输入时可能重建出具有物理危害、风险组件或欺骗性复制品的3D结构。通过系统评估多种开源和商用模型,发现现有模型在生成有害几何方面表现较强,而现有防护机制效果有限。研究进一步提出了一种多层次防御策略,有效降低有害输出比例,但仍面临较高的误报率,突显了当前系统在几何安全防护方面的不足。

详情
英文摘要

Recent advances in image-to-3D models have significantly improved the fidelity and accessibility of 3D content creation. Such a powerful reconstruction capability that enables creative design can also be misused by the adversary to generate harmful geometries, which can be further fabricated via 3D printers and pose real-world risks. However, such risks are largely underexplored: it remains unclear how well current image-to-3D models can produce these harmful geometries, and whether existing safeguards can reliably prevent such generation. To fill this gap, we conduct a systematic measurement study of harmful geometry generation and mitigation. We first describe this risk through three kinds of unsafe categories: direct-use physical hazards, risky templates or components, and deceptive replicas. Each category is instantiated with representative objects. We evaluate both open-source and commercial image-to-3D models under original, degraded, viewpoint-shifted, and semantically camouflaged inputs. We consider different evaluation metrics, including geometric validity, multi-view VLM-based semantic scoring, targeted human validation, and controlled physical fabrication. The results reveal a concerning reality that current image-to-3D models can effectively reconstruct the harmful geometries, while fewer than 0.3% of such geometries trigger commercial moderation flags. As a first step toward mitigation, we evaluate three representative safeguard families, including input moderation, model-level benign alignment, and output-level filtering. We find that existing safeguards have distinct weaknesses. We further develop a stacked defense that can reduce harmful retention to <1%, but still at 11% overall false-positive cost. Taken together, our findings demonstrate that the risk in current system and encourage better geometry-aware safeguards for moderation.

2605.09604 2026-05-12 cs.CV 版本更新

DAP: Doppler-aware Point Network for Heterogeneous mmWave Action Recognition

Jiaying Lin, Shiman Wu, Jinfu Liu, Can Wang, Mengyuan Liu

发表机构 * Peking University(北京大学) Huazhong University of Science and Technology(华中科技大学) DJI Technology Company Ltd.(大疆技术创新有限公司) Christian-Albrechts-Universität zu Kiel(基尔大学)

AI总结 该研究针对毫米波雷达在异构场景下的人体动作识别(HAR)问题,提出了首个大规模异构多源毫米波点云数据集UniMM-HAR,并设计了DAP-Net网络以应对不同设备和频段带来的分布差异。DAP-Net通过融合多模态信息与Doppler感知机制,增强了模型对异构雷达源的鲁棒性,实验表明其在跨源识别任务中取得了优越的性能。

详情
英文摘要

Millimeter-wave (mmWave) radar provides privacy-preserving sensing and is valuable for human action recognition (HAR). Existing mmWave point cloud datasets are limited in scale and mostly collected under homogeneous single-source settings, preventing current methods from handling real-world distribution shifts caused by heterogeneous radar sources, such as different devices and frequency bands. To address this, we introduce UniMM-HAR, the largest and first mmWave point cloud HAR dataset for heterogeneous multi-source scenarios, standardizing three distinct radar configurations to realistically evaluate cross-source generalization. We further propose the Doppler-aware Point Cloud Network (DAP-Net) to tackle heterogeneity challenges. DAP-Net enhances intra-modal representations and performs cross-modal alignment to learn source-invariant action semantics. Leveraging action-consistent spatio-temporal Doppler patterns as anchors, the Dual-space Doppler Reparameterization (D2R) module performs sample-adaptive geometric densification and Doppler-guided feature recalibration, while the Text Alignment Module (TAM) provides stable semantic anchors via a pretrained textual space. Experiments show that DAP-Net significantly outperforms existing methods under heterogeneous radar settings, achieving state-of-the-art accuracy and strong cross-source robustness.

2605.09591 2026-05-12 cs.CV 版本更新

From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Shuang Liang, Zeqing Wang, Yuxian Li, Xihui Liu, Han Wang

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong(香港大学电子与计算机工程系) School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) CASIC, The University of Hong Kong(香港大学中国科学院自动化所)

AI总结 本文研究了可提示分割模型是否真正理解其分割的概念,而不仅仅是依赖视觉显著但语义误导的线索。为此,作者提出了一个新的基准测试 CAFE,通过属性层面的反事实修改来评估模型对概念的忠实度。实验表明,尽管模型能生成准确的分割掩码,但在面对误导性提示时仍表现出概念理解的不足,揭示了定位质量与语义理解之间的系统性差距。

Comments 30 pages, 8 figures

详情
英文摘要

Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: \textbf{C}ounterfactual \textbf{A}ttribute \textbf{F}actuality \textbf{E}valuation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our \textbf{CAFE} is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (\textbf{SM}), Context Conflict (\textbf{CC}), and Ontological Conflict (\textbf{OC}). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.

2605.09581 2026-05-12 cs.CV 版本更新

FPGA-Based Hardware Architecture for Contrast Maximization in Event-Based Vision

Michal Filipkowski, Marcin Kowalczyk, Tomasz Kryjak

发表机构 * AGH University of Krakow, Poland(波兰格但尼克技术大学) Embedded Vision Systems Group, Computer Vision Laboratory(嵌入式视觉系统组,计算机视觉实验室)

AI总结 本文提出了一种基于FPGA的硬件架构,用于实现基于事件视觉系统的对比度最大化(CM)算法。该架构利用FPGA的并行处理能力,高效实现了从异步事件流中重构图像的对比度计算与迭代优化,从而估计运动参数。研究展示了该硬件模块的设计细节与优化方法,并通过实验验证其在速度和能效方面的显著优势,相比CPU和GPU实现快200倍以上,为高速、低功耗嵌入式系统中的实时运动估计提供了坚实基础。

Comments Accepted for ARC 2026

详情
英文摘要

This paper presents a hardware architecture that implements the Contrast Maximization (CM) algorithm in Field-Programmable Gate Array (FPGA) resources for event-based vision systems. CM estimates motion parameters by maximizing the contrast of an Image of Warped Events (IWE) reconstructed from asynchronous event streams. Event-based vision sensors generate sparse data with high temporal resolution and low spatial redundancy, which makes them well suited for hardware processing. The deterministic, massively parallel structure of the FPGA is leveraged to design a deeply pipelined architecture capable of high-throughput, energy-efficient processing suitable for real-time embedded applications. This paper details the hardware modules responsible for event warping, contrast computation, and iterative optimization, discusses key implementation decisions, and presents the hardware-aware optimization method used in the design. Experimental results demonstrate a substantial speed and efficiency improvement over CPU- and GPU-based implementations, with motion parameter estimation executing over 200 times faster. To the best of our knowledge, this is the first hardware architecture enabling acceleration of CM algorithm computations. Its performance is evaluated in terms of processing speed, energy efficiency, and hardware resource utilization. The proposed design is validated using an event-based object tracking application. The results confirm that the architecture provides a solid foundation for real-time motion estimation in high-speed, low-power embedded systems.

2605.09575 2026-05-12 eess.IV cs.CV 版本更新

Annotation-free deep learning for detection and segmentation of fetal germinal matrix-intraventricular hemorrhage in brain MRI

Mingxuan Liu, Yingqi Hao, Yi Liao, Juncheng Zhu, Haoxiang Li, Hongjia Yang, Yifei Chen, Yijin Li, Kasidit Anmahapong, Zihan Li, Jialan Zheng, Min Kang, Yan Song, Hua Lai, Xiaoling Zhou, Nan Sun, Rong Hu, Gang Ning, Haibo Qu, Qiyuan Tian

发表机构 * Department of Radiology, West China Second University Hospital, Sichuan University(四川大学华西第二医院放射科) School of Biomedical Engineering, Tsinghua Medicine, Tsinghua University(清华大学医学院生物医学工程系) Department of Radiology, Sichuan Provincial Woman’s and Children’s Hospital, The Affiliated Women’s and Children’s Hospital of Chengdu Medical College(四川省妇幼保健院放射科,成都医学院附属妇幼医院) Chengdu Women’s and Children’s Central Hospital, School of Medicine, University of Electronic Science and Technology of China(成都妇女儿童中央医院,电子科技大学医学院) Department of Radiology, The Third Affiliated Hospital of Zhengzhou University(郑州大学第三附属医院放射科) Qujing Maternal and Child Health Hospital, Qujing, China(曲靖 maternal and child health hospital, Qujing, China)

AI总结 该研究提出了一种无需标注数据的深度学习框架FreeHemoSeg,用于自动检测和分割胎儿脑MRI中的生发层-脑室出血(GMH-IVH)。该方法通过结合医学先验知识生成伪病变图像进行训练,有效解决了标注数据获取困难的问题。实验结果表明,FreeHemoSeg在内部和外部验证中均表现出优越的检测和分割性能,并显著提升了放射科医生的诊断效率和准确性。

详情
英文摘要

Background: Prenatal germinal matrix-intraventricular hemorrhage (GMH-IVH) is a leading cause of infant mortality and neurodevelopmental impairment. Manual diagnosis and lesion segmentation are labor-intensive and error-prone. Deep learning models offer potential for automation but typically require large annotated datasets, which are challenging to obtain. Purpose: To develop and validate an annotation-free deep learning framework for automated detection and segmentation of GMH-IVH on brain MRI. Materials and Methods: This retrospective study analyzed 2D T2-weighted MRI data from pregnant women collected from October 2015 to October 2023 at one hospital (internal validation) and two hospitals (external validation). Eligible participants included healthy fetuses and those with GMH-IVH. FreeHemoSeg was developed and trained using pseudo GMH-IVH images synthesized from normal fetal data guided by medical priors. Primary outcomes included diagnostic accuracy (area under the ROC curve [AUROC], sensitivity, specificity) and segmentation accuracy (Dice similarity coefficient [DSC]). A reader study evaluated clinical utility. Results: A total of 1674 stacks from 558 pregnant women were analyzed. FreeHemoSeg achieved the highest performance in both internal (sensitivity: 0.914, 95% CI 0.869-0.945; specificity: 0.966, 95% CI 0.946-0.978; DSC: 0.559, 95% CI 0.546-0.571) and external validation (sensitivity: 0.824, 95% CI 0.739-0.885; specificity: 0.943, 95% CI 0.913-0.964; DSC: 0.512, 95% CI 0.497-0.526), outperforming supervised and unsupervised methods. FreeHemoSeg assistance improved radiologists' sensitivity (from 0.882 to 0.941-1.000) and diagnostic confidence while reducing interpretation time by 16.0-52.7%. Conclusion: FreeHemoSeg accurately detects and localizes fetal brain hemorrhages without annotated training data, enabling earlier diagnosis and supporting timely clinical management.

2605.09572 2026-05-12 cs.CV cs.AI cs.MM 版本更新

KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation

Guanyi Du, Lintao Wang, Kun Hu, Ziyang Wang

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) School of Science, Edith Cowan University(埃迪斯科文大学科学学院) School of Computer Science and Digital Technologies, Aston University(阿斯顿大学计算机科学与数字技术学院)

AI总结 该研究探讨了如何利用Kolmogorov-Arnold网络(KAN)从符号注释生成手语姿态动画,提出了一种多尺度序列生成模型KANMultiSign,能够将HamNoSys符号系统转化为二维人体姿态序列。研究引入了从粗到细的生成策略,并结合多尺度监督机制,先生成整体身体结构,再细化手部动作细节;同时将KAN模块集成到Transformer架构中,以更高效地建模符号到连续姿态的非线性映射。实验表明,该方法在多个手语语料库中取得了比现有方法更优的性能,同时大幅减少了参数量,验证了多尺度监督在提升符号条件姿态生成效果中的关键作用。

Comments Accepted at Neurocomputing

详情
英文摘要

Sign language production from symbolic notation offers a scalable route to accessible sign animation. We present KANMultiSign, a multi-scale sequence generator that translates HamNoSys notation into two-dimensional human pose sequences. Our framework makes two complementary contributions. First, we introduce a coarse-to-fine generation strategy with multi-scale supervision: the model is first guided by an intermediate body--hand--face scaffold to encourage global structural coherence, and then refines fine-grained hand articulation to improve finger-level detail. Second, we investigate integrating Kolmogorov--Arnold Network modules into a Transformer backbone, using learnable univariate function primitives to model the highly non-linear mapping from discrete phonological symbols to continuous body kinematics with a compact parameterization. Experiments on multiple public corpora spanning Polish, German, Greek, and French sign languages show consistent reductions in dynamic time warping based joint error compared with a strong notation-to-pose baseline, while using substantially fewer parameters. Controlled ablations further indicate that KAN-based variants substantially reduce parameter count while maintaining competitive performance when coupled with multi-scale supervision, rather than serving as the main driver of accuracy gains. These findings position multi-scale supervision as the key mechanism for improving notation-conditioned pose generation, with KAN offering a compact alternative for efficient modeling. Our code will be publicly available.

2605.09566 2026-05-12 cs.CV 版本更新

Dual-Path Hyperprior Informed Deep Unfolding Network for Image Compressive Sensing

Tianyi Lu, Wenxue Cui, Shaohui Liu

发表机构 * Harbin Institute of Technology(哈尔滨理工大学)

AI总结 本文提出了一种双路径超先验引导的深度展开网络(DPH-DUN),用于解决图像压缩感知中的重建问题。该方法通过将测量数据分为两个子集,并引入超先验信息指导重建过程,有效提升了不同纹理区域的重建质量。核心创新包括设计轻量神经模块生成多域超先验知识,并在重建过程中动态生成自适应步长和注意力机制,以提高重建精度和鲁棒性。实验表明,该方法在多个基准数据集上优于现有压缩感知方法。

详情
英文摘要

Recent Deep Unfolding Networks (DUNs) have significantly advanced Compressive Sensing (CS) by integrating iterative optimization with deep networks. However, existing DUNs still suffer from two challenges: 1) Reliance on a single measurement stream, which limits effective information interaction across distinct measurement subsets. 2) Uniform processing of all image regions, which overlooks varying reconstruction difficulties induced by diverse textures. To address these limitations, a novel Dual-Path Hyperprior Informed Deep Unfolding Network (DPH-DUN) is proposed, which partitions measurements into double subsets to enable hyperprior-guided reconstruction via a dual-path architecture. In the Deep Hyperprior Learning branch, a series of lightweight neural modules are designed to efficiently generate hyperprior knowledge of different domains, enabling collaborative guidance for the CS reconstruction. In the Hyperprior Informed Reconstruction branch, a deep unfolding framework with hyperprior guidance is constructed to iteratively refine reconstruction. Specifically, i) in the gradient descent step, a Hyperprior Informed Step Size Generation network is designed to dynamically generate spatially varying step maps, enabling adaptive fine-grained gradient updates. ii) In the proximal mapping step, two well-designed hyperprior informed attention mechanisms are introduced to dynamically focus on challenging regions via gradient-based hard and soft attentions, facilitating CS reconstruction accuracy. Extensive experiments demonstrate that the proposed DPH-DUN outperforms existing CS methods.

2605.09554 2026-05-12 cs.CL cs.CV 版本更新

Towards Compact Sign Language Translation: Frame Rate and Model Size Trade-offs

Kuanwei Chen, Mengfeng Tsai

发表机构 * Computer Science and Information Engineering, National Central University, Zhongli, Taiwan(资讯工程系,国立中央大学,中坜,台湾)

AI总结 本文研究了手语翻译(SLT)中帧率与模型大小之间的权衡问题,旨在实现更紧凑高效的翻译系统。作者提出了一种仅含77M参数的轻量级管道,结合MMPose骨骼姿态提取与单一线性投影至T5-small模型,通过调整输入帧率,在保证翻译质量的前提下显著降低计算复杂度。实验表明,该方法在12fps时相比24fps仅小幅降低BLEU-4得分,同时模型大小仅为之前T5-base系统的1/3,展示了轻量架构在无需层次化编码器或大规模模型的情况下仍具竞争力。

Comments 2 pages, 1 figure, 2 tables

详情
英文摘要

Sign Language Translation (SLT) converts sign language videos into spoken-language text, bridging communication between Deaf and hearing communities. Current gloss-free approaches rely on large encoder-decoder models, limiting deployment. We propose a compact 77M-parameter pipeline that couples MMPose skeletal pose extraction with a single linear projection into T5-small. By varying the input frame rate, we expose a practical efficiency trade-off: at 12 fps the model halves its sequence length, achieving a 75% reduction in encoder quadratic self-attention computational complexity while incurring only a modest BLEU-4 drop (9.53 vs. 10.06 at 24 fps on How2Sign). Our system is roughly 3x smaller than prior T5-base systems, demonstrating that a lightweight architecture can remain competitive without hierarchical encoders or large-scale models.

2605.09538 2026-05-12 cs.CV cs.AI cs.RO 版本更新

PhysHanDI: Physics-Based Reconstruction of Hand-Deformable Object Interactions

Jihyun Lee, Changmin Lee, Donghwan Kim, Tae-Kyun Kim

发表机构 * School of Computing, KAIST, Daejeon, South Korea(韩国釜山科学技术院计算学系)

AI总结 PhysHanDI 是一种基于物理的框架,旨在同时重建手部与非刚性物体(如布料、毛绒玩具)的三维交互。该方法通过模拟由密集重建的手部运动引起的力来驱动物体变形,确保重建的物体动态既符合物理规律又与手部运动一致。此外,物体变形的模拟还能通过逆物理方法提升手部重建的精度,实验表明 PhysHanDI 在重建和未来预测任务中均优于现有最佳方法。

Comments Accepted to ICML 2026

详情
英文摘要

While existing methods for reconstructing hand-object interactions have made impressive progress, they either focus on rigid or part-wise rigid objects-limiting their ability to model real-world objects (e.g., cloth, stuffed animals) that exhibit highly non-rigid deformations-or model deformable objects without full 3D hand reconstruction. To bridge this gap, we present PhysHanDI (Physics-based Reconstruction of Hand and Deformable Object Interactions), a framework that enables full 3D reconstruction of both interacting hands and non-rigid objects. Our key idea is to physically simulate object deformations driven by forces induced from densely reconstructed 3D hand motions, ensuring that the reconstructed object dynamics are both physically plausible and coherent with the interacting hand movements. Furthermore, we demonstrate that such simulation of object deformations can, in turn, refine and improve hand reconstruction via inverse physics. In experiments, PhysHanDI outperforms the state-of-the-art baseline across reconstruction and future prediction.

2605.09513 2026-05-12 cs.CV cs.RO 版本更新

QueST: Persistent Queries as Semantic Monitors for Drift Suppression in Long-Horizon Tracking

Mayank Anand, Mohammad Saqlain, Kyan Mahajan, Priya Shukla, Gora Chand Nandi, Andrew Melnik

发表机构 * Center for Intelligent Robotics(智能机器人中心) Indian Institute of Information Technology Allahabad(阿拔斯理工大学) University of Bremen(不莱梅大学)

AI总结 本文提出QueST,一种用于长期轨迹跟踪的语义监控框架,旨在解决传统逐帧匹配方法在复杂场景下累积误差导致的语义漂移问题。QueST将与交互相关的实体视为持久的语义查询,而非瞬时的点轨迹,并在每个时间步全局关注时空视频特征,提供稳定的语义锚点。通过引入轻量的三维物理约束,QueST在遮挡等情况下有效抑制漂移,实验表明其在长期关节运动序列上的跟踪精度显著优于现有方法。

详情
英文摘要

Tracking points in videos is typically formulated as frame-to-frame correspondence, where each point is matched locally to the next frame. While this works over short horizons, errors accumulate under articulation, occlusion, and viewpoint change, leading to silent semantic drift that existing trackers cannot detect or correct. In this work, we revisit long-horizon tracking from a monitoring perspective and introduce QueST, a monitoring-by-design framework that treats interaction-relevant entities as persistent semantic queries rather than transient point tracks. Instead of local propagation, each query attends globally over spatio-temporal video features at every time-step, providing a stable semantic anchor across time. We further constrain query trajectories with lightweight 3D physical grounding, using geometric plausibility to suppress unbounded drift under occlusion. We evaluate QueST on long-horizon articulated sequences from PartNet-Mobility in SAPIEN and compare against RAFT-3D, CoTracker, and TAP-Net. QueST substantially reduces terminal drift achieving a 67.7% Absolute Point Error (APE) improvement over TAP-Net while better preserving identity over extended horizons. Our results show that embedding semantic monitoring directly into perception enables more reliable long-horizon tracking under distribution shift.

2605.09507 2026-05-12 cs.CV 版本更新

Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization

Omer Tariq, Syed Muhammad Raza, Jeongbae Son

发表机构 * Perception AI Neubility Inc.(感知AI Neubility公司)

AI总结 该论文提出了一种用于视频摘要的不确定性感知与解码器对齐的学习框架VASTSum,旨在解决视频摘要任务中因主观标注和离散解码过程带来的挑战。该方法通过变分形式预测帧级的概率重要性分数,显式建模多标注者监督下的不确定性,并引入解码器对齐正则化以提升摘要选择的稳定性。实验表明,该方法在多个数据集上表现出更强的鲁棒性和高效性,优于传统确定性和扩散模型方法。

Comments Accepted for presentation at the 2026 International Joint Conference on Neural Networks (IJCNN 2026)

详情
英文摘要

Video summarization aims to produce a compact representation of a long video by selecting a subset of temporally important segments that best reflect human preferences. This task is inherently difficult due to strong annotation subjectivity and the reliance on discrete decoding procedures, such as temporal segmentation and knapsack-based selection, during evaluation. Most existing approaches either learn deterministic importance scores that overlook these characteristics or adopt complex generative models that increase training and inference cost. In this paper, we propose VASTSum, an uncertainty-aware and decoder-aligned learning framework for video summarization that addresses both challenges within a single-pass model. The proposed method predicts probabilistic frame-level importance scores using a variational formulation, enabling explicit modeling of uncertainty arising from multi-annotator supervision. To account for subjectivity, particularly under binary annotations, we employ a supervision strategy that encourages alignment with plausible human annotation modes rather than enforcing a single consensus target. Furthermore, we introduce a decoder-aligned regularization that promotes stability of knapsack-based summary selection, reducing sensitivity to small perturbations in predicted scores. We evaluate the proposed framework on the SumMe and TVSum benchmarks using standard rank-based metrics. Experimental results show consistent and competitive Kendall and Spearman correlations across multiple data splits, demonstrating improved robustness under annotation disagreement while maintaining efficient single-forward inference. These results indicate that explicitly modeling uncertainty and aligning learning objectives with the decoding stage provide a principled alternative to both deterministic and diffusion-based video summarization methods.

2605.09479 2026-05-12 eess.IV cs.CV cs.MM 版本更新

ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality

Feng Ding, Haisheng Fu, Jie Liang, Qihan Xu, Siyu Zhu, Jingning Han

发表机构 * Simon Fraser University(西蒙弗雷泽大学) University of British Columbia(不列颠哥伦比亚大学) Eastern Institute of Technology(东部技术学院) Xi’an Jiaotong University(西安交通大学) Google Inc(谷歌公司)

AI总结 本文从机器视角出发,研究全参考图像质量评估问题,旨在评估图像在多下游模型中信息保留的程度。提出了一种基于CLIP视觉编码器的可微质量度量方法ML-CLIPSim,通过聚合中间特征相似性和全局图像嵌入来近似机器感知的图像质量。实验表明,该方法在机器偏好、人类质量预测以及图像压缩任务中均表现出优越性能,优于传统保真度和感知度量。

详情
英文摘要

We study full-reference image quality assessment from a machine-centric perspective, where images are evaluated by how well they preserve information for downstream models. We formulate machine-oriented quality as a latent machine utility and approximate it through pairwise predictive-consistency comparisons. To this end, we construct PCMP, a dataset of PSNR-matched distortion pairs labeled by consistency votes from multiple pretrained models. We further propose ML-CLIPSim, a differentiable quality metric built on a frozen CLIP visual encoder, which aggregates intermediate patch-token similarities and global image embeddings. Experiments on machine-preference benchmarks, human-IQA datasets, and learned image compression show that ML-CLIPSim better aligns with machine-oriented preferences than conventional fidelity and perceptual metrics, while remaining competitive for human quality prediction. Used as a compression distortion term, it improves rate--task trade-offs across multiple downstream tasks.

2605.09477 2026-05-12 cs.CV cs.AI 版本更新

Outlier-Robust Diffusion Solvers for Inverse Problems

Yang Zheng, Jiahua Liu, Tongyao Pang, Wen Li, Zhaoqiang Liu

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) Yau Mathematical Sciences Center, Tsinghua University(清华大学尤太数学科学中心)

AI总结 本文研究了在存在异常值的情况下,如何利用扩散模型解决逆问题。为提高鲁棒性,作者首先通过显式噪声估计优化测量数据,并基于Huber损失函数构建迭代加权最小二乘目标函数,进而提出一种基于梯度下降的优化方法,并结合共轭梯度法以避免学习率调优问题。实验表明,该方法在多种图像数据集上表现出对异常值的强鲁棒性,优于现有的扩散模型方法。

Comments Accepted by CVPR 2026

详情
英文摘要

Methods based on diffusion models (DMs) for solving inverse problems (IPs) have recently achieved remarkable performance. However, DM-based methods typically struggle against outliers, which are common in real-world measurements. In this work, to tackle IPs with outliers, we first refine the measurement via explicit noise estimation to mitigate the effect of noise. Subsequently, we formulate an iteratively reweighted least squares objective based on the Huber loss to address the outliers. We propose a method utilizing gradient descent to approximately solve the corresponding optimization problem for the robust objective. To avoid delicate tuning of the learning rate required by the gradient descent method, we further employ the conjugate gradient method with an efficient strategy for updating. Extensive experiments on multiple image datasets for linear and nonlinear tasks under various conditions demonstrate that our proposed methods exhibit robustness to outliers and outperform recent DM-based methods in most cases.

2605.09460 2026-05-12 cs.CV cs.AI 版本更新

When Few Steps Are Enough: Training-Free Acceleration of Identity-Preserved Generation

Dongqi Zheng

发表机构 * FLUX Diffusion Transformer(FLUX扩散变换器) InfuseNet

AI总结 本文研究了在保持身份特征的前提下,如何通过简化生成步骤来加速图像生成过程。作者提出了一种无需重新训练的方法,通过替换预训练的扩散模型主干网络,并禁用分类器引导,显著提升了生成效率,同时保持了较高的身份相似度。实验表明,在早期生成步骤中已能获得较高质量的身份特征,后续步骤主要优化细节,从而为身份保留生成提供了高效且实用的优化策略。

详情
英文摘要

Identity-preserved image generation is typically built on many-step diffusion backbones, making personalized generation expensive at deployment time. We show that this cost is often unnecessary for identity-conditioned FLUX generation. A frozen InfuseNet identity adapter trained with dev transfers directly to the distilled schnell backbone without retraining. This two-line replacement -- changing the backbone path and disabling classifier-free guidance -- reduces latency by 5.9x while improving ArcFace identity similarity by +0.028 and lpips by -0.016 over the standard 28-step dev baseline. To explain why this works, we analyze the denoising trajectory and find that identity fidelity enters an early effective regime, often within 4-8 steps, while later steps primarily refine visual detail, sharpness, and contrast. Adapter ablations confirm that identity formation depends on the identity adapter, while attention-stream norm probes suggest that the relative conditioning contribution decreases as sampling proceeds. Preliminary style-adapter and object-adapter sweeps on SDXL and SD1.5 show similar diminishing returns after intermediate steps. These results position distilled backbone replacement as a simple, training-free strategy for improving the efficiency-fidelity tradeoff of identity-preserved generation.

2605.09455 2026-05-12 cs.CV 版本更新

Adaptive 3D Convolution for Remote Sensing Image Fusion

Siran Peng, Xiangyu Zhu, Shang-Qi Deng, Liang-Jian Deng, Zhen Lei

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,自动化研究所,中国科学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) School of Mathematical Sciences/Multi-Hazard Early Warning Key Laboratory of Sichuan Province, University of Electronic Science and Technology of China(数学科学学院/四川省多灾种早期预警重点实验室,电子科技大学) Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences(人工智能与机器人中心,香港科学院,中国科学院)

AI总结 本文研究了遥感图像融合问题,旨在从高分辨率但光谱信息有限的图像和低分辨率但光谱数据丰富的图像中生成高分辨率多/高光谱图像。为了解决现有方法在光谱信息保持和计算效率上的不足,作者提出了一种新型的自适应三维卷积(Ada3D)方法,该方法为每个输入体素生成独特的三维卷积核,结合空间和光谱信息,有效提升了融合效果,并通过分组卷积降低了计算复杂度。实验表明,该方法在五个数据集上均取得了当前最优的性能。

Comments Accepted by IEEE Transactions on Image Processing (TIP), Early Access, 2026

详情
英文摘要

Remote sensing image fusion aims to create a high-resolution multi/hyper-spectral image from a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Recently, deep learning (DL) techniques have shown significant effectiveness in this area. Most DL-based methods approach image fusion as a 2D problem by encoding spectral information into feature map channels. However, our research suggests that this strategy introduces notable spectral distortions. In contrast, some methods consider spectral data as an additional dimension, utilizing standard 3D convolutions to preserve spectral information. Nevertheless, in a standard 3D convolutional layer, the same set of kernels is applied across all input regions, which we have found to be sub-optimal for image fusion. Furthermore, standard 3D convolutions necessitate substantial computational resources. To address these challenges, we propose a novel convolutional paradigm called Adaptive 3D Convolution (Ada3D) for remote sensing image fusion. Ada3D applies a unique set of 3D kernels to each input voxel, enabling the capture of fine-grained details. These adaptive kernels are generated through a two-step process: (i) spatial and spectral kernels are derived from their respective image sources; (ii) these two types of kernels are then combined to form content-aware 3D kernels that effectively integrate spatial and spectral information. Additionally, adaptive biases are introduced to enhance the convolutional outcome at the voxel level. Furthermore, we incorporate the group convolution technique to reduce computational complexity. As a result, Ada3D offers full adaptivity in an efficient manner. Evaluation results across five datasets demonstrate that our method achieves SOTA performance, underscoring the superiority of Ada3D. The code is available at https://github.com/PSRben/Ada3D.

2605.09449 2026-05-12 cs.CV 版本更新

SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

Bo Gu, Zhikang Zhang, Zizhuang Wei, Zhenyuan Chen, Lingyun Li, Zhuoyi Song

发表机构 * Fudan University(复旦大学) Huawei(华为) Shenzhen Loop Area Institute(深圳环城院)

AI总结 当前多模态大语言模型(MLLMs)在视觉理解和语言推理方面取得了显著进展,但在三维环境中缺乏持续的、以世界为中心的空间表征。为此,研究提出了一种名为 SpaceMind++ 的视频 MLLM 架构,通过从 RGB 视频中构建体素化的认知地图,实现对物体永久性和空间拓扑关系的保持。该模型引入了坐标引导的深度迭代融合机制,将地图层面的空间知识反馈至原始二维视觉特征中,从而在不破坏原有视觉接口的前提下增强模型的空间推理能力。实验表明,SpaceMind++ 在多个基准测试中取得了优异的性能,尤其在未见过的三维环境中表现出更强的泛化能力。

Comments 14 pages, 3 figures

详情
英文摘要

Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new mechanism that relays map-level spatial knowledge back into the original 2D visual features. This fusion is explicitly guided by coordinate embeddings and 3D Rotary Positional Encoding, which ground semantic interactions in metric 3D space, resembling the entorhinal binding of sensory features to metric space. Extensive experiments show that SpaceMind++ achieves new state-of-the-art performance on VSI-Bench. Furthermore, it demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench, underscoring its robustness in unseen 3D environments.

2605.09443 2026-05-12 cs.CV cs.CL 版本更新

Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

Yihong Tang, Kehai Chen, Xuefeng Bai, Min Zhang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Shenzhen Loop Area Institute (SLAI)(深圳环城院)

AI总结 随着多模态大语言模型的发展,角色扮演代理(RPAs)逐渐进入视觉化环境,但现有模型提取的通用视觉特征容易掩盖角色特性,导致模态-角色干扰(MRI)。为此,研究提出了一种无需训练的字符感知视觉干预框架CAVI,通过角色引导的标记剪枝、正交特征调制和模态自适应角色引导等方法,有效缓解MRI问题,显著提升了角色一致性的多模态交互能力。

详情
英文摘要

The advancement of Multimodal Large Language Models (MLLMs) has expanded Role-Playing Agents (RPAs) into visually grounded environments. However, human vision is inherently subjective and identity-driven, whereas existing MLLMs extract objective, character-agnostic features for general tasks. In RPAs, this generic visual noise overpowers fragile character traits, causing Modality-Role Interference (MRI), where agents struggle to integrate visual grounding and character consistency. To address this, we introduce the training-free Character-Aware Visual Intervention (CAVI) framework, enabling agents to perceive the world through the lens of character. CAVI systematically targets MRI: macroscopically, Character-Guided Token Pruning (CTP) restricts the visual receptive field to role-relevant entities; microscopically, Orthogonal Feature Modulation (OFM) projects tokens onto a character-context subspace to extract aligned facts; and during decoding, Modality-Adaptive Role Steering (MARS) dynamically optimizes steering intensity based on visual reliance. Extensive experiments show CAVI effectively alleviates MRI, significantly enhancing character-consistent multimodal interactions.

2605.09442 2026-05-12 cs.CV cs.AI 版本更新

SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

Shanwen Tan, Hao Li, Jingtao Zhang, Xiaosong Jia, Xue Yang, Shaofeng Zhang, Yanyong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) Fudan University(复旦大学) Georgia Institute of Technology(佐治亚理工学院) Shanghai Jiao Tong University(上海交通大学)

AI总结 SWIFT 是一种用于多提示长视频生成的高效框架,旨在解决连续语义切换中的语义连贯性与计算效率之间的矛盾。该方法引入了轻量级的语义注入缓存和自适应动态窗口机制,能够在不重建缓存内容的前提下实现高效的语义切换,并通过分头语义注入和段级语义锚点保持视频的时序一致性。实验表明,SWIFT 在单块 H100 GPU 上实现了 22.6 FPS 的生成速度,显著提升了长视频生成的效率。

Comments Code is available at https://github.com/ShanwenTan/SWIFT

详情
英文摘要

Streaming long-video generation faces a central challenge in continuous semantic switching, requiring adaptive memory to preserve coherent visual evolution. Current approaches rely on cache rebuilding at prompt boundaries or fixed memory budgets, but they introduce redundant computation and limit flexible semantic adaptation. This limitation arises from a mismatch between cached video history and prompt updates, as memory preserves visual continuity while prompt switches demand rapid semantic adaptation. Motivated by this observation, we present SWIFT, Semantic Windowing and Injection for Flexible Transitions, a training-free framework for multi-prompt long-video generation that enables efficient semantic switching while preserving temporal coherence in causal video diffusion models. SWIFT introduces a lightweight Semantic Injection Cache that augments cached video memory rather than reconstructing it from scratch at every prompt boundary. To avoid uniformly perturbing all attention channels, we further perform head-wise semantic injection, so that each attention head receives a prompt update proportional to its alignment with the current video state. In addition, we introduce an Adaptive Dynamic Window that allocates temporal memory according to prompt phase, using larger local context near switching boundaries and smaller windows during stable segments to reduce average inference cost. To preserve long-range semantic consistency under compressed local attention, we further maintain segment-level semantic anchors that summarize prompt-conditioned video history and reintroduce it as compact memory tokens. Compared with current state-of-the-art methods, SWIFT preserves generation quality while achieving 22.6 FPS on a single H100 GPU, establishing a substantially more efficient solution for multi-prompt long-video generation. Our code is available at https://github.com/ShanwenTan/SWIFT.

2605.09433 2026-05-12 cs.CV 版本更新

Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

Yunhong Lu, Qichao Wang, Hengyuan Cao, Xiaoyin Xu, Min Zhang

发表机构 * Zhejiang University(浙江大学) Shanghai Institute for Advanced Study-Zhejiang University(上海先进研究院-浙江大学) Shanghai Institute for Mathematics and Interdisciplinary Sciences(上海数学与交叉科学研究院)

AI总结 现有文本到图像模型的偏好数据集通常仅存储最终的优胜或劣汰图像,这不足以支持基于直化流(RF)模型的生成过程,因其生成过程依赖特定的先验噪声样本并遵循近似直线的去噪轨迹。为此,本文提出了一种针对直化流模型的离线偏好优化框架——先验噪声感知偏好优化(PNAPO),通过保留生成优胜/劣汰图像所用的配对先验噪声,扩展标准三元组为六元组,并利用RF的直线特性进行噪声-图像插值,从而更准确地估计轨迹并提升优化目标的紧致性。实验表明,PNAPO在主流RF文本到图像模型上显著提升了偏好指标,同时减少了训练计算量。

Comments Accepted by ICML 2026

详情
英文摘要

Existing preference datasets for text-to-image models typically store only the final winner/loser images. This representation is insufficient for rectified flow (RF) models, whose generation is naturally indexed by a specific prior noise sample and follows a nearly straight denoising trajectory. In contrast, prior DPO-style alignment for diffusion models commonly estimates trajectories using an independent forward noising process, which can be mismatched to the true reverse dynamics and introduces unnecessary variance. We propose Prior Noise-Aware Preference Optimization (PNAPO), an off-policy alignment framework specialized for rectified flow. PNAPO augments preference data by retaining the paired prior noises used to generate each winner/loser image, turning the standard (prompt, winner, loser) triplet into a sextuple. Leveraging the straight-line property of RF, we estimate intermediate states via noise-image interpolation, which constrains the trajectory estimation space and yields a tighter surrogate objective for preference optimization. In addition, we introduce a dynamic regularization strategy that adapts the DPO regularization based on (i) the reward gap between winner and loser and (ii) training progress, improving stability and sample efficiency. Experiments on state-of-the-art RF T2I backbones show that PNAPO consistently improves preference metrics while substantially reducing training compute.

2605.09429 2026-05-12 cs.CV cs.AI 版本更新

Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

Jie Ma, Yihang Liu, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun

发表机构 * Xiamen University(厦门大学)

AI总结 该研究探讨了在视觉-语言模型中,低注意力视觉token是否真的冗余,并指出现有剪枝方法基于浅层注意力分数进行剪枝可能影响模型对复杂场景的推理能力,导致“视觉失语”问题。为此,作者提出了一种无需训练的剪枝框架COAST,通过对比自适应语义token剪枝,利用跨模态注意力识别关键token并平衡语义证据与空间上下文的关系。实验表明,COAST在多个基准上大幅减少了视觉token数量并提升了推理速度,同时保持了较高的模型性能,展示了其在不同模型和压缩设置下的广泛适用性。

详情
英文摘要

Are low-attention visual tokens truly redundant in vision-language reasoning? Existing pruning methods often assume so, ranking visual tokens by shallow text-to-image attention and discarding low-scoring patches to accelerate LVLM inference. We show that this scalar criterion is unreliable for compositional reasoning: tokens ignored in early layers can later become essential for resolving secondary objects, spatial relations, and contextual cues. Premature pruning can therefore induce Visual Aphasia, a failure mode in which the model loses visual grounding and falls back on language priors. We introduce COAST (COntrastive Adaptive Semantic Token Pruning), a training-free pruning framework that casts compression as adaptive semantic routing. COAST uses native cross-modal attention to identify query-specific anchors and estimate contextual dispersion via attention entropy, then adapts the retention trade-off between semantic evidence and spatial context. It further uses a contrastive routing score to preserve both anchor-aligned evidence and complementary spatial context. Across seven benchmarks, COAST reduces visual tokens by 77.8% and achieves a 2.15x latency speedup while retaining 98.64% of the original average performance. Beyond a single backbone or compression setting, COAST consistently outperforms strong pruning baselines across token budgets and generalizes across multiple LVLM families, showing that adaptive semantic routing is a robust alternative to one-shot scalar pruning

2605.09425 2026-05-12 cs.CV cs.AI 版本更新

AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation

Shogo Noguchi

发表机构 * Gunma University(群马大学)

AI总结 本文研究了多条件扩散模型中条件冲突对图像生成结构保真度的影响,提出了一种基于注意力机制的冲突抑制方法,有效提升了生成图像的高层结构一致性。通过结合语义分割、深度图和边缘信息作为多条件输入,模型能够在保持场景细节的同时生成高质量的图像,用于自动驾驶任务的数据增强。该工作不仅解决了多条件生成中的冲突问题,还构建了针对驾驶任务的生成框架与评估体系,为缓解高阶自动驾驶中数据稀缺问题提供了重要支持。

Comments 44 pages, 20 figures. Code and project page available at: https://github.com/ShogoNoguchi/AtteConDA

详情
英文摘要

Recent conditional image generation methods can improve controllability by generating images that are faithful to conditions such as sketches, human poses, segmentation maps, and depth. By applying these techniques to image augmentation while preserving annotations, generated images can be used as additional training data and can improve recognition performance. However, for high-level driving tasks such as traffic-rule extraction and driving-behavior understanding, simply using annotations as conditions is insufficient. Instead, images must be augmented while preserving the detailed high-level structure of the original scene. One possible solution is to use multiple conditions so that generated images retain diverse structural cues after generation. However, when multiple conditions are used, conflicts among conditions can prevent reliable structure preservation. In this work, we input semantic segmentation, depth, and edges extracted from the original image into a multi-condition image generation model, thereby providing rich structural information as conditions. We further propose a modeling approach for handling conflicts among multiple conditions and show that it enables image generation with stronger structural preservation. We also build a generation framework and evaluation protocol for driving tasks, establishing a basis for comparison with prior and future models. As a result, this work contributes to image generation research by addressing condition conflicts in multi-condition generation and provides an important step toward mitigating data scarcity in high-level autonomous-driving tasks.

2605.09422 2026-05-12 cs.CL cs.CV 版本更新

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

Jiafeng Liang, Zhihao Zhu, Zihan Zhang, Baoqi Ren, Shixin Jiang, Runxuan Liu, Tao Ren, Ming Liu, See-Kiong Ng, Bing Qin

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Pengcheng Laboratory(鹏城实验室) National University of Singapore(新加坡国立大学) Peking University(北京大学) Harvard University(哈佛大学)

AI总结 尽管大型多模态模型(LMMs)在视频理解方面表现出色,但它们在因果发现过程中容易依赖文本先验信息,这一缺陷尚未被充分理解。本文提出了一种基于扰动的评估方法ProCauEval,通过系统控制视觉和文本模态的输入,揭示模型在因果推理中的失效模式。研究发现,主流LMMs虽然能够准确感知视频内容,但在因果推理中未能充分加以利用,并且更强的后训练反而加剧了对文本先验的依赖。为此,作者提出了一种反蒸馏策略优化框架ADPO,通过强化学习推动模型更依赖视觉证据而非文本捷径,实验表明该方法有效提升了模型的视觉参与度并保持了基础理解能力。

Comments 17 pages, 5 figures

详情
英文摘要

Although Large Multimodal Models (LMMs) have achieved strong performance on general video understanding, their susceptibility to textual prior shortcuts during causal discovery has been recognized as a critical deficit. The underlying mechanisms of this phenomenon remain incompletely understood, as existing benchmarks only measure response accuracy without revealing the sources and extent of the deficit. We introduce ProCauEval, a perturbation-based evaluation protocol that shifts from outcome assessment to mechanism diagnosis, probing causal discovery through five controlled configurations that systematically manipulate visual and textual modalities to decompose their respective contributions to model behavior and dissect the failure modes. Evaluating 17 mainstream LMMs, we find that models faithfully perceive video content yet systematically underexploit it during causal reasoning. We further observe that stronger post-training amplifies rather than mitigates textual prior reliance, and that higher baseline performance correlates with greater fragility under perturbation. To address these, we propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning framework built on negative teacher alignment, which augments GRPO by explicitly pushing the policy away from a prior-only counterfactual teacher induced by visual corruption. Specifically, ADPO maximizes the divergence between the policy distributions conditioned on the original and visually corrupted inputs, thereby forcing the model to ground its reasoning in visual evidence rather than textual shortcuts. Extensive experiments show that ADPO improves visual engagement without sacrificing fundamental comprehension, thus offering a preliminary step toward reliable causal discovery.

2605.09418 2026-05-12 cs.CV cs.RO 版本更新

MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition

Zhengyi Xu, Yuhang Ming, Zhihao Zhan, Hanyu Zhu, Javier Civera, Wanzeng Kong

发表机构 * Hangzhou Dianzi University(杭州电子科技大学) TopXGun Robotics(TopXGun机器人) University of Zaragoza(萨拉戈萨大学)

AI总结 跨视角场景识别在计算机视觉与机器人领域面临诸多挑战,尤其在地面观测与空中参考之间存在显著的视角、模态和结构差异。为此,本文提出MAG-VLAQ框架,通过融合预训练基础模型提取的多模态特征,在共享嵌入空间中实现地面与空中图像的对齐与融合。其核心创新在于引入ODE条件化的VLAQ机制,动态调整查询中心以适应多模态信息,从而在保持全局检索原型的同时提升场景特异性匹配能力。实验表明,该方法在KITTI360-AG数据集上显著优于现有方法,Recall@1指标达到61.1。

Comments 16 pages, 4 figures, 3 tables

详情
英文摘要

Multi-modal cross-view place recognition remains a fundamental challenge in computer vision and robotics due to the severe viewpoint, modality, and spatial-structure discrepancies between ground observations and aerial references. To address this challenge, we present MAG-VLAQ, a foundation-model-enhanced query aggregation framework for multi-modal aerial-ground cross-view place recognition. Specifically, our approach leverages pre-trained foundation models to extract dense visual tokens from both ground and aerial images, as well as expressive geometric tokens from ground LiDAR observations. These heterogeneous tokens are then projected into a shared embedding space for cross-modal alignment and fusion. As our main contribution, we propose ODE-conditioned VLAQ, which tightly couples neural ordinary differential equations (ODE)-based RGB-LiDAR fusion with vectors of locally aggregated queries (VLAQ). In this design, the VLAQ query centers are dynamically adapted according to the fused multi-modal state. This mechanism allows the final global descriptor to preserve globally learned retrieval prototypes while remaining responsive to scene-specific visual and geometric evidence, significantly improving aerial-ground matching. Extensive experiments on KITTI360-AG and nuScenes-AG validate the effectiveness of our proposed MAG-VLAQ. Notably, on KITTI360-AG, our MAG-VLAQ nearly doubles the state-of-the-art performance, achieving 61.1 Recall@1 in the satellite setting, compared with 34.5 from the closest competing approach.

2605.09417 2026-05-12 cs.CV 版本更新

SAMOFT: Robust Multi-Object Tracking via Region and Flow

Yanchao Wang, Dawei Zhang, Chengzhuan Yang, Wei Liu, Minglu Li, Hua Wang, Zhonglong Zheng, Ming-Hsuan Yang

发表机构 * School of Computer Science and Technology, Zhejiang Normal University(浙江师范大学计算机科学与技术学院) Institute for Sustainable Industries and Liveable Cities, College of Engineering and Science, Victoria University(维多利亚大学可持续产业与宜居城市研究所、工程与科学学院) School of Electrical Engineering and Computer Science, University of California at Merced(加州大学默塞德分校电子工程与计算机科学学院)

AI总结 本文提出了一种名为SAMOFT的鲁棒多目标跟踪方法,旨在解决复杂运动场景下目标形变、非线性运动和遮挡带来的跟踪难题。该方法引入像素级运动匹配模块(PMM),结合Segment Anything Model(SAM)和密集光流,提升基于卡尔曼滤波的运动预测精度;同时设计了中心距匹配(CDM)模块和分布校正(DBC)模块,分别增强对低置信度检测的鲁棒性以及在线轨迹状态的动态修正能力。实验表明,SAMOFT在多个基准数据集上显著优于现有方法,验证了其有效性。

详情
英文摘要

Multi-object tracking (MOT) is a fundamental task in computer vision that requires continuously tracking multiple targets while maintaining consistent identities across frames. However, most existing approaches primarily rely on instance-level object features for trajectory association, which often leads to degraded performance under challenging conditions such as object deformation, nonlinear motion, and occlusion. In this work, we propose SAMOFT, a robust tracker that leverages pixel-level cues to improve robustness under complex motion scenarios. Specifically, we introduce a Pixel Motion Matching (PMM) module that integrates the Segment Anything Model (SAM) with dense optical flow to refine Kalman filter-based motion prediction using instantaneous foreground pixel motion. To further enhance robustness under unreliable detections, we design a Centroid Distance Matching (CDM) module that performs flexible mask-based centroid matching for low-confidence or partially occluded observations. Moreover, a Distribution-Based Correction (DBC) module models long-tailed motion patterns in a training-free manner using historical optical flow statistics and dynamically corrects trajectory states online. We also incorporate a Cluster-Aware ReID (CA-ReID) strategy to improve the stability and discriminative power of trajectory appearance features. Extensive experiments on the DanceTrack and MOTChallenge benchmarks demonstrate that SAMOFT consistently improves baseline trackers and achieves competitive performance compared with recent state-of-the-art methods, validating the effectiveness of leveraging pixel-level cues for robust multi-object tracking.

2605.09407 2026-05-12 cs.CV 版本更新

AnyDepth-DETR/-YOLO: Any-depth object detection with a single network

Woochul Kang, Hyungseop Lee, Jiho Lee

发表机构 * Incheon Nat’l Univ.(Incheon国立大学)

AI总结 本文提出了一种名为AnyDepth-DETR/-YOLO的任意深度目标检测框架,使单个网络能够在推理时通过控制深度实现精度与效率的连续权衡,无需重新训练。该方法通过将网络的主干和颈部模块分解为必须执行的主路径和可跳过的细化路径,保持了不同深度配置下的多尺度特征层次。通过在最深和最浅网络之间进行自蒸馏,并结合预测层和特征层对齐损失,确保各阶段输出的兼容性。实验表明,该方法在RT-DETR和YOLOv12上实现了与现有最佳模型相当或更优的性能,且在高效配置下可提升1.82倍速度,仅损失2.0 AP。

Comments 16 pages, 5 figures, 9 tables

详情
英文摘要

Modern object detectors are static, fixed-depth networks optimized for a single operating point, requiring separate models for different deployment scenarios. We present an any-depth detection framework that enables a single network to span a continuous range of accuracy--efficiency trade-offs by controlling depth at inference time without retraining. Each backbone and neck stage is divided into an essential path, which always executes, and a skippable refinement path; this decomposition preserves the full multi-scale feature hierarchy at every depth configuration, unlike conventional early exiting that discards entire stages. To train such a network, jointly optimizing many sub-networks of varying depth introduces conflicting gradient signals. We address this via self-distillation between only the two extremes, with prediction-level and feature-level alignment losses that enforce stage-wise modularity, ensuring the outputs of each stage remain compatible regardless of the paths taken. Instantiated on RT-DETR and YOLOv12, our full-depth configurations match or surpass their respective SOTA baselines with negligible parameter overhead, while the most efficient configurations achieve up to $1.82\times$ speedup at a cost of only 2.0 AP, all from a single set of weights.

2605.09404 2026-05-12 cs.LG cs.CL cs.CV 版本更新

Let the Target Select for Itself: Data Selection via Target-Aligned Paths

Huitao Yang, Hengzhi He, Guang Cheng

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 该研究针对目标导向的数据选择问题,提出了一种新的参考路径方法,以减少传统方法在异构数据池中可能产生的偏差。通过在目标验证集上进行短期预热,生成一个验证诱导的参考路径,并利用该路径上的终点损失下降作为候选样本的评分依据,从而实现无需梯度或海森矩阵近似的选择策略。该方法在多个实验中表现出与动态归因方法相当的性能,同时显著降低了预热和存储成本,并可复用到不同的数据池中。

详情
英文摘要

Targeted data selection aims to identify training samples from a large candidate pool that improve performance on a specific downstream task. Many recent methods estimate candidate utility by aggregating local attribution scores along a trajectory induced by the candidate pool. When the pool is heterogeneous, however, this reference trajectory may be misaligned with the dynamics of a target-aligned selected subset, creating what we call reference path bias. We propose an alternative reference path: a validation-induced flow obtained from a short, capacity-limited warmup on the available target validation proxy. Along this path, candidates are scored by a normalized endpoint loss drop, yielding a simple zero-order selection rule that requires no candidate gradients or Hessian approximations. Across controlled logistic, vision, and instruction-tuning experiments, this score is competitive with strong dynamic attribution baselines while substantially reducing warmup and storage cost. Moreover, since the reference trajectory is decoupled from any specific candidate pool, the same compact warmup can be reused across additional pools without recomputing the trajectory.

2605.09392 2026-05-12 cs.CV 版本更新

HyNeuralMap: Hyperbolic Mapping of Visual Semantics to Neural Hierarchies

Zihan Ma, Tian Xia, Kexin Wang, Xiao Li, Xiaowei He, Yudan Ren

发表机构 * School of Electronic Information (School of Artificial Intelligence), the Xi’an Key Laboratory of Radiomics and Intelligent Perception, Northwest University(电子信息学院(人工智能学院)、西安放射组学与智能感知重点实验室、西北大学)

AI总结 本文提出了一种名为HyNeuralMap的框架,用于将视觉语义映射到跨被试的神经层次结构中,以解决视觉刺激与神经响应之间复杂映射关系的理解问题。该方法利用双曲洛伦兹模型,通过双曲空间的负曲率作为归纳偏置,更有效地捕捉视觉语义的层次结构和跨被试神经相似性。实验表明,HyNeuralMap在多标签语义预测和跨模态检索任务中优于现有的欧氏空间方法,验证了双曲几何在跨模态语义对齐和层次建模中的优势。

Comments 14 pages, 4 figures

详情
英文摘要

Understanding the intricate mappings between visual stimuli and neural responses is a fundamental challenge in cognitive neuroscience. While current approaches predominantly align images and functional magnetic resonance imaging (fMRI) responses in Euclidean space, this geometry often struggles to preserve fine-grained semantic relationships and latent hierarchical structures across visual and neural modalities. To overcome this, we propose HyNeuralMap, a framework that employ hyperbolic Lorentz model to map visual semantics into a shared, cross-subject neural hierarchy. By leveraging the negative curvature of hyperbolic space as an inductive bias, the proposed framework better captures hierarchical semantic organization and cross-subject neural similarities. Specifically, visual and neural embeddings are jointly optimized through hyperbolic geometric alignment, where geodesic distances preserve semantic proximity and hierarchical relationships more effectively than Euclidean embeddings. Experiments demonstrate that HyNeuralMap consistently outperforms state-of-the-art Euclidean baselines in both multi-label semantic prediction and cross-modal retrieval tasks. This confirms hyperbolic geometry's superiority for cross-modal semantic alignment and hierarchical modeling, providing a new avenue for vision-neural representation learning.

2605.09384 2026-05-12 cs.CV cs.AI q-bio.QM 版本更新

LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

Runze Ma, Shunbo Jia, Haonan Lyu, Guo Liu, Caizhi Liao

发表机构 * School of Information Technology(信息科技学院) Monash University Malaysia(墨尔本大学马来西亚分校) Faculty of Innovation Engineering(创新工程学院) Macau University of Science and Technology(澳门科学技术大学) Department of Bioelectronics(生物电子系) Faculty of Biomedical Engineering(生物医学工程学院) Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 本文提出了一种名为LiteMedCoT-VL的参数高效的适配方法,旨在提升医疗视觉问答(VQA)模型在资源受限设备上的推理能力。该方法通过基于LoRA的微调,将大型教师模型的链式推理能力迁移至小型学生模型,且无需依赖图像字幕,更贴近实际临床场景。实验表明,LiteMedCoT-VL在PMC-VQA基准测试中取得了64.9%的准确率,显著优于现有基线模型,验证了小参数模型通过推理蒸馏可达到甚至超越更大模型的效果。

Comments 17 pages, 5 figures

详情
英文摘要

The reasoning gap between large and compact vision-language models (VLMs) limits the deployment of medical AI on portable clinical devices. Compact VLMs of 2--4B parameters can run on resource-constrained hardware but lack the multi-step reasoning capacity needed for interpretable clinical decision support. Existing knowledge distillation methods transfer answers without the reasoning process behind them. Medical visual question answering (VQA) serves as a testbed for this problem, as it requires models to integrate visual evidence with clinical knowledge through structured reasoning chains. We introduce LiteMedCoT-VL, a pipeline that transfers chain-of-thought reasoning from a 235B teacher model to 2B student models through LoRA-based fine-tuning on explanation-enriched training data. All inference is conducted without image captions by default, simulating the clinical scenario in which a physician interprets a medical image directly without an accompanying radiology report. On the PMC-VQA benchmark, LiteMedCoT-VL achieves 64.9% accuracy, exceeding the zero-shot Qwen3-VL-4B baseline of 53.9% by 11.0 percentage points and outperforming all published baselines. This result indicates that a 2B model with reasoning distillation can match or exceed models with twice the parameters. Visual grounding analysis shows that the model relies on image content rather than exploiting textual priors. Our code is publicly available at https://anonymous.4open.science/r/LiteMedCoT-VL.

2605.09378 2026-05-12 cs.CV cs.AI cs.CL 版本更新

EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation

Xinyi Wu, Jayant Teotia, Shuai Zhao, Erik Cambria

发表机构 * Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 EduStory 是一个统一的框架,旨在生成符合教学逻辑的多镜头STEM教学视频。该方法通过整合教学状态建模、脚本引导的结构化控制以及面向学习的评估指标,有效提升了视频在知识一致性和教学叙事连贯性方面的表现。研究还引入了 EduVideoBench 评估基准,支持对生成视频的多粒度分析与评估,实验表明该框架在保持教学意图和知识准确性方面具有显著优势。

详情
英文摘要

Long-horizon video generation has advanced in visual quality, yet existing methods still struggle to maintain knowledge consistency and coherent pedagogical narratives across multi-shot instructional videos, especially in STEM domains. To address these challenges, we propose EduStory, a unified framework for reliable instructional video generation. EduStory integrates pedagogical state modeling to track persistent knowledge states, script-guided structured control to organize multi-shot narratives, and learning-oriented evaluation metrics to assess knowledge fidelity and constraint satisfaction. To support rigorous evaluation, we further introduce EduVideoBench, a diagnostic benchmark with multi-granularity annotations, including pedagogical storyboards, shot-level semantics, and knowledge state transitions, together with baseline tasks for controllable instructional video generation. Extensive experiments demonstrate that domain-aware state modeling and structured control substantially reduce narrative breakdown and improve alignment with instructional intent. These results highlight the significance of domain-specific structural constraints and tailored benchmarks for advancing reliable, controllable, and also trustworthy long-horizon video generation.

2605.09362 2026-05-12 cs.GR cs.CV 版本更新

FrameTwin: Curve-Anchored Gaussian Alignment from Sparse Views for Adaptive Wireframe 3D Printing

Wenting Wang, Zhuo Huang, Kun Qian, Neelotpal Dutta, Yuhu Guo, Yingjun Tian, Yeung Yam, Charlie C. L. Wang

发表机构 * Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong(香港中文大学机械与自动化工程系) Centre of Perceptual and Interactive Intelligence, Hong Kong(香港感知与交互智能中心) University of Manchester(曼彻斯特大学) Department of Mechanical and Aerospace Engineering, The University of Manchester, United Kingdom(曼彻斯特大学机械与航空航天工程系)

AI总结 本文提出了一种名为FrameTwin的框架,用于从稀疏视角图像中进行自适应丝状结构3D打印的曲线锚定高斯对齐。该方法通过将高斯核锚定在参数化曲线上,捕捉薄丝结构的变形,从而获得紧凑且具有几何感知能力的编码,明确表达支撑结构的拓扑关系。与通用的高斯点扩散方法不同,该方法约束高斯核沿参数曲线分布,显著减少了稀疏视角下对薄结构的歧义,实现了全局一致的变形场对齐,并可用于动态调整后续打印路径。

详情
英文摘要

We present FrameTwin, a curve-anchored Gaussian alignment framework that uses sparse-view images to close the control loop for adaptive wireframe 3D printing. Our key idea is to capture the deformation of thin wireframe structures from sparse-view images using Gaussian kernels anchored to parametric curves, yielding a compact and geometry-aware encoding that explicitly captures strut topology. Driven by a differentiable rendering pipeline, FrameTwin estimates a neural deformation field that aligns the partially printed target model with the deformed structure observed during fabrication, where the optimized curve-Gaussian representation serves as a digital twin of the evolving wireframe. Unlike general Gaussian-splatting approaches, our formulation constrains kernel placement along parametric curves, substantially reducing the ambiguity inherent in sparse-view observations of thin structures. The resultant deformation-field alignment enforces global consistency across all struts. By using the estimated deformation field to blend the distorted printed geometry with the remaining unprinted geometry, FrameTwin enables adaptive updates to future printing trajectories. We demonstrate that FrameTwin can robustly capture and compensate for deformation in wireframe models fabricated using a robotized 3D printing system.

2605.09339 2026-05-12 cs.CV cs.AI 版本更新

Perceptual Asymmetry Between Hue Categories: Evidence from Human Color Categorization

Elnara Kadyrgali, Nuray Toganas, Muragul Muratbekova, Pakizar Shamoi

发表机构 * School of Information Technology and Engineering(信息科技与工程学院) Kazakh-British Technical University(哈萨克-英国技术大学)

AI总结 人类颜色类别在感知空间中并非均匀分布,但大多数计算颜色模型仍假设颜色表示是固定且均匀的。本文通过分析大规模人类颜色分类数据,扩展了COLIBRI模糊颜色模型,引入了基于模糊隶属函数的定量指标,揭示了色相类别间的感知不对称性。研究发现,黄色类别在色相空间中占据紧凑且明确的区域,而绿色类别则覆盖更广的区间并具有更长的过渡结构,表明人类颜色类别不仅具有模糊性,其几何组织也高度不均匀,为语言颜色分类和感知驱动的颜色建模提供了新的视角。

Comments The paper has been submitted for consideration to ICICS 2026 (International Conference on Informatics and Computer Science)

详情
英文摘要

Human color categories are not uniformly distributed in perceptual space, yet most computational color models still assume fixed and evenly structured representations. In this paper, we present a focused analytical extension of the COLIBRI fuzzy color model by investigating perceptual asymmetry between hue categories. Using previously collected large-scale human color categorization data, we introduce quantitative measures of category extent and boundary uncertainty, namely Wideness and Boundary Width, derived from fuzzy membership functions at the α = 0.5 level. The analysis reveals a strong imbalance between the two categories: yellow occupies a compact and sharply constrained region of the hue space, whereas green spans a substantially broader interval and exhibits a more extended transition structure. The results show that perceptual color categories are not only fuzzy, but also highly non-uniform in their geometric organization. This asymmetry suggests that some categories behave as narrow, highly specific perceptual labels, while others function as broad, tolerant regions of human color naming. These findings provide a new perspective on linguistic color categorization and extend the interpretability of the COLIBRI framework for perceptually grounded color modeling.

2605.09328 2026-05-12 cs.CV 版本更新

Noise-Started One-Step Real-World Super-Resolution via LR-Conditioned SplitMeanFlow and GAN Refinement

Wei Zhu, Kai Zhang, Yu Zheng, Lei Luo, Yong Guo, Jian Yang

发表机构 * Nanjing University of Science and Technology(南京理工大学) Nanjing University(南京大学) Huawei(华为)

AI总结 该研究提出了一种基于扩散模型的单步真实世界图像超分辨率方法SMFSR,旨在解决传统扩散模型在效率与质量之间的矛盾。该方法在保持噪声起始生成过程的基础上,通过LR条件下的SplitMeanFlow实现从噪声到高分辨率图像的直接映射,并引入GAN优化阶段提升细节真实感和图像自然度。实验表明,SMFSR在保持高效单步推理的同时,达到了当前单步扩散模型在真实世界超分辨率任务中的最优感知质量。

详情
英文摘要

Pre-trained text-to-image (T2I) diffusion models have shown strong potential for real-world image super-resolution (Real-ISR), owing to their noise-started generation process that enables realistic texture synthesis and captures the one-to-many nature of super-resolution. However, diffusion-based Real-ISR methods still face a fundamental efficiency-quality trade-off. Multi-step methods generate high-quality results by iteratively denoising random Gaussian noise under LR conditioning, but suffer from slow sampling. Recent one-step methods greatly improve efficiency, yet they typically replace noise-started generation with direct LR-to-HR restoration, which weakens stochasticity and limits realistic detail synthesis. To address this issue, we propose SMFSR, a noise-started one-step Real-ISR framework via LR-conditioned SplitMeanFlow and GAN refinement. SMFSR preserves the random-noise starting point of diffusion models and learns a direct noise-to-HR mapping conditioned on the LR image. To this end, Interval Splitting Consistency distills the multi-step generative trajectory into a single average-velocity prediction, enabling efficient one-step generation. To compensate for the reduced opportunity for progressive refinement, we further introduce a GAN refinement stage, where a DINOv3-based discriminator enhances realistic texture synthesis and variational score distillation aligns the generated outputs with the natural image distribution under a frozen diffusion teacher. Extensive experiments demonstrate that SMFSR achieves state-of-the-art perceptual quality among one-step diffusion-based Real-ISR methods while retaining fast single-step inference.

2605.09319 2026-05-12 cs.CV cs.LG 版本更新

PGID: Progressive Guided Inversion and Denoising for Robust Watermark Detection

Minh Quoc Duong, Chun Tong Lei, Chun Pong Lau

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 随着AI生成图像的普及,数字水印技术成为保护知识产权和防止恶意利用的重要手段。然而,现有的语义水印方法依赖扩散模型逆过程进行水印检测,容易受到印痕移除和伪造攻击的影响。本文提出了一种名为PGID的渐进引导逆过程与去噪框架,无需训练即可有效防御这些攻击,通过逐步逆过程和去噪循环将扰动的潜在变量投影回其原始区域,从而恢复被移除的水印并识别伪造实例。

详情
英文摘要

With the proliferation of AI-generated images, digital watermarking has become an essential safeguard for protecting intellectual property and mitigating malicious exploitation. Recent works on semantic watermarking have enabled efficient copyright protection for diffusion models. However, the dependence of semantic watermarking on diffusion inversion for watermark detection creates a critical vulnerability. Imprint removal and forgery attacks exploit this weakness to produce deceptive results. Our analysis reveals that these attacks succeed by displacing watermarked latents into the unwatermarked region, while guiding unwatermarked latents into the watermarked region. Based on that, we propose Progressive Guided Inversion and Denoising (PGID), the first plug-and-play, training-free noise extraction framework designed to defend against both attack strategies. PGID effectively defends by projecting perturbed latents back to the region where they originally belong. The projection is achieved by eliminating intermediate latent deflections and mitigating adversarial perturbations through progressive inversion-denoising cycles. Comprehensive evaluations across multiple schemes demonstrate that PGID successfully restores detection reliability by recovering removed watermarks and identifying forged instances.

2605.09317 2026-05-12 cs.CL cs.CV cs.LG 版本更新

Mem-W: Latent Memory-Native GUI Agents

Guibin Zhang, Yaohui Ling, Fanci Meng, Kun Wang, Shuicheng Yan

发表机构 * LV-NUS Lab(LV-NUS实验室)

AI总结 本文提出了一种名为 Mem-W 的新型 GUI 智能体,其核心在于将记忆作为智能体连续上下文的一部分,而非传统的外部辅助结构。通过一个共享的轨迹到潜空间压缩器,Mem-W 将历史轨迹和当前会话片段编码为紧凑的记忆标记,并将其与当前 GUI 观测融合为连续的嵌入序列,从而实现对任务进展的统一感知与决策。实验表明,Mem-W 在多个网页和移动端导航任务中显著提升了多种基础模型和增强记忆方法的性能,最高提升达 30.0%,展示了潜空间原生记忆在长时程 GUI 操作中的有效性与扩展性。

详情
英文摘要

GUI agents are beginning to operate the web, mobile, and desktop as interactive worlds, where successful control depends on carrying forward visual, procedural, and task-level evidence beyond the fleeting present screen. Yet most agents still treat memory as an external, human-readable artifact: histories are summarized, categorized, retrieved, and reinserted as text or structured records before being encoded again by the policy. This creates a mismatch between the representational form in which experience is stored and the latent embedding sequence over which modern GUI policies actually act. We introduce Mem-W, a series of latent-memory-native GUI agents that treat memory as part of the agent's continuous context rather than as an auxiliary symbolic scaffold. Mem-W weaves both historical trajectories (as experiential memory) and in-session segments (as working memory) into compact memory tokens through a shared trajectory-to-latent compressor. These tokens are woven with the current GUI observation and local context into one continuous embedding sequence, allowing the agent to read successes, failures, and unfinished progress through the same machine-native interface. Mem-W is trained with self-distillation and outcome-aware supervision to preserve decision-relevant state while filtering memory toward evidence that truly supports task success. Across four web and mobile navigation benchmarks, Mem-W consistently improves diverse backbones and memory-enhanced baselines, with gains of up to $+30.0$, suggesting that latent-context-native memory can serve as a scalable foundation for long-horizon GUI agency.

2605.09312 2026-05-12 cs.CV 版本更新

Low-Cost Neural Radiance Fields

Alice Huang, Prathamesh Sonawane, Yashdeep Thorat, Yug Rao

发表机构 * University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了如何在计算资源和数据量受限的情况下加速神经辐射场(NeRF)的训练与推理。作者对比了三种加速版NeRF模型,并针对低算力、低数据场景进行了扩展实验,包括引入深度监督损失、简化特征解码网络以及设计不同架构的HashNeRF。实验结果表明,在同等训练时间下,各改进方法未明显优于现有基线,但揭示了哪些改进更适合受限环境,并为未来研究提供了方向。

Comments 7 pages

详情
英文摘要

Neural Radiance Fields (NeRF) achieve high-quality novel-view synthesis, but their long training times and reliance on dense input views limit accessibility. We present a comparative study of three accelerated NeRF variants - DS-NeRF, TensoRF, and HashNeRF and explore extensions targeted at the low-compute, low-data regime. First, we add a depth-supervision loss derived from COLMAP keypoints to TensoRF (TensoRF-DS) and evaluate it on the LLFF dataset under reduced view counts. Second, we ablate the feature-decoding MLP of TensoRF and study the effect of input downsampling on PSNR and runtime on the synthetic Lego scene. Third, we propose four architectural variants of the HashNeRF color and density networks, including residual and convolutional designs, and report PSNR/training-time tradeoffs under matched iteration budgets. Under iso-time evaluation, none of our extensions conclusively outperform the published baselines, but the experiments characterize which extensions transfer to constrained settings and surface design questions for future work.

2605.09302 2026-05-12 cs.LG cs.CV 版本更新

Discrete Langevin-Inspired Posterior Sampling

Chaitanya Amballa, Sattwik Basu, Jorge Vančo Sampedro, Romit Roy Choudhury

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了在离散状态空间中使用离散扩散模型作为生成先验的逆问题后验采样方法。现有方法多依赖于连续松弛、吉布斯更新或特定退化过程的机制,限制了其可扩展性和通用性。为此,作者提出了一种基于离散朗之万动力学的后验采样器ΔLPS,能够在不离开离散状态空间的前提下,利用梯度信息高效地进行采样,支持所有维度的并行更新,并适用于不同训练方式的离散扩散模型。实验表明,该方法在图像恢复和空间映射等任务中优于现有离散扩散后验采样器,并能与连续扩散方法竞争。

详情
英文摘要

We study posterior sampling for inverse problems in discrete state spaces using discrete diffusion models as generative priors. While continuous diffusion models have become widely used for inverse problems, their discrete counterparts remain comparatively underexplored. Existing discrete posterior samplers often rely on continuous relaxations of discrete variables, Gibbs-style updates, or mechanisms specialized to particular corruption processes, which can limit scalability or generality. We propose $Δ$LPS, a Discrete Langevin-Inspired Posterior Sampler that uses gradient information to identify promising discrete moves without leaving the discrete state space. The resulting approach enables efficient parallel updates across all token dimensions and is agnostic to the training paradigm of the discrete diffusion prior, including masked and uniform-state diffusion. We evaluate our method on image restoration tasks across MNIST, CIFAR, and FFHQ, as well as spatial mapping, covering linear, nonlinear, and blind inverse problems. Across these settings, we improve over recent discrete diffusion posterior samplers and are competitive with strong continuous diffusion-based inverse solvers. Our results suggest that fully discrete, gradient-informed posterior samplers offer a scalable and general path toward solving inverse problems over discrete representations.

2605.09296 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts

Boxuan Zhang, Jianing Zhu, Qifan Wang, Jiang Liu, Ruixiang Tang

发表机构 * Rutgers University(罗格斯大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Meta AI Advanced Micro Devices(先进微器件公司)

AI总结 近年来生成模型能够生成高度逼真的图像,使得区分真实图像与AI生成图像变得愈发困难。现有基于预训练特征提取器的检测方法往往过于依赖全局语义信息,忽略了关键的微小缺陷。本文提出了一种基于局部分布差异的检测框架MDMF,通过放大图像中微小的统计不规则性,揭示AI生成图像的宏观分布差异,显著提升了检测性能。实验表明,MDMF在多个基准测试中均优于现有方法,验证了其有效性。

Comments 41 pages, 10 figures

详情
英文摘要

Recent generative models can produce images that appear highly realistic, raising challenges in distinguishing real and AI-generated images. Yet existing detectors based on pre-trained feature extractors tend to over-rely on global semantics, limiting sensitivity to the critical micro-defects. In this work, we propose Micro-Defects expose Macro-Fakes (MDMF), a local distribution-aware detection framework that amplifies micro-scale statistical irregularities into macro-level distributional discrepancies. To avoid localized forensic cues being diluted by plain aggregation, we introduce a learnable Patch Forensic Signature that projects semantic patch embeddings into a compact forensic latent space. We then use Maximum Mean Discrepancy (MMD) to quantify distributional discrepancies between generated and real images. Our theory-grounded analysis shows that patch-wise modeling yields provably larger discrepancies when localized forensic signals are present in generated images, enabling more reliable separation from real images. Extensive experiments demonstrate that MDMF consistently outperforms baseline detectors across multiple benchmarks, validating its general effectiveness. Project page: https://zbox1005.github.io/MDMF-project/

2605.09288 2026-05-12 cs.LG cs.AI cs.CE cs.CV cs.NA math.NA 版本更新

MC$^2$: Monte Carlo Correction for Fast Elliptic PDE Solving

Ethan Hsu, Hong Meng Yam, Ivan Ge

发表机构 * Stanford University(斯坦福大学)

AI总结 该论文提出了一种名为 MC² 的混合求解方法,结合蒙特卡洛方法(Walk-on-Spheres)与神经网络,用于高效求解椭圆型偏微分方程(PDE)。该方法通过将低计算量的蒙特卡洛解作为结构化估计器,训练神经网络进行单次前向传播修正,从而获得高精度解,显著提升了求解速度。此外,论文还发布了 PDEZoo,一个包含两百万个椭圆型 PDE 的标准化基准数据集,为有限计算资源下的 PDE 求解研究提供了重要支持。

详情
英文摘要

Partial differential equation (PDE) solvers underpin scientific computing, but real-world deployment is bounded by compute. Classical Monte Carlo solvers such as Walk-on-Spheres (WoS) are unbiased and geometry-agnostic but are slow. Learned solvers are fast but biased and brittle under distribution shift. We present \textbf{MC$^2$}, a hybrid WoS-Neural Network (WoS-NN) PDE solver that treats a low-budget Monte Carlo solution as a structured estimator of the true field and learns a single-pass neural correction to recover a high-fidelity solution. MC$^2$ matches the accuracy of solutions using over $1000\times$ more Monte Carlo compute, outperforming all evaluated classical, denoising, and neural-operator baselines. To enable reproducible study of finite-compute PDE solving, we additionally release \textbf{PDEZoo}, the largest standardized elliptic PDE benchmark to date: 2M PDEs spanning five elliptic families and unlimited geometric compositions, with analytic ground truth and multi-budget Monte Carlo trajectories. Together \textbf{MC$^2$} and \textbf{PDEZoo} (1) empirically establish that finite-sample Monte Carlo error is structured, learnable, and correctable in a single forward pass, (2) show that we can solve PDEs $\sim$\textbf{1000x} faster than with just WoS, and (3) provide the evaluation infrastructure the field has so far lacked.

2605.09279 2026-05-12 cs.GR cs.CV cs.MM cs.NI eess.IV 版本更新

CAGS: Color-Adaptive Volumetric Video Streaming with Dynamic 3D Gaussian Splatting

Daheng Yin, Yili Jin, Jianxin Shi, Isaac Ding, Miao Zhang, Fangxin Wang, Zhaowu Huang, Cong Zhang, Jiangchuan Liu, Fang Dong

发表机构 * Simon Fraser University(西蒙弗雷泽大学) Jiangxing Intelligence Inc.(江行智能有限公司) McGill University(麦吉尔大学) Nankai University(南开大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Fuzhou University(福州大学) Southeast University(东南大学)

AI总结 本文提出了一种名为CAGS的色彩自适应体素视频流系统,旨在解决动态3D高斯点云在实时传输中的带宽消耗和画质退化问题。该方法通过向量量化建立多细节层次(LoD),并利用低分辨率参考图像进行色彩校正,有效减少了颜色失真。实验表明,CAGS在不同带宽条件下相比现有方法在PSNR指标上提升了5至20 dB,并具有更高的传输效率和跨高斯表示的通用性。

Comments SIGGRAPH 2026 Conference Paper. Code is available at https://github.com/yindaheng98/ColorAdaptiveGaussianSplatting

Journal ref ACM SIGGRAPH 2026

详情
英文摘要

Volumetric video (VV) streaming enables real-time, immersive access to remote 3D environments, powering telepresence, ecological monitoring, and robotic teleoperation. These applications turn VV streaming into a real-time interface to remote physical environments, imposing new system-level demands for photorealistic scene representation, low-latency interaction, and robust performance under heterogeneous networks. 3D Gaussian Splatting (3DGS) has been widely used for real-time photorealistic rendering, offering superior visual quality and rendering performance, but it faces challenges due to bandwidth consumption. Furthermore, as the foundation of adaptive VV streaming, existing Levels of Detail (LoD) methods based on density are not well-suited to Gaussian representations, leading to visible gaps and severe quality degradation. Recent studies have also explored attribute compression techniques to reduce bandwidth consumption. Our preliminary studies reveal that aggressive attribute compression primarily causes color distortion, which can be effectively corrected in the rendered image using a reference image. Motivated by these findings, we propose a novel Color-Adaptive scheme for adaptive VV streaming that uses vector quantization (VQ) to establish LoDs and correct color distortions with low-resolution reference images. We further present CAGS, an adaptive VV streaming system compatible with diverse Gaussian representations, which integrates the Color-Adaptive scheme by rendering reference images on the streaming server and performing color restoration on the client. Extensive experiments on our prototype system demonstrate that CAGS outperforms the existing adaptive streaming systems in PSNR by 5$\sim$20 dB under fluctuating bandwidth, operates significantly faster than existing scalable Gaussian compression methods, and generalizes across different Gaussian representations.

2605.09276 2026-05-12 cs.LG cs.CV 版本更新

Uncertainty-Aware Token Importance Estimation in Spiking Transformers

Wenxuan Liu, Zecheng Hao, Tong Bu, Yuran Wang, Zhaofei Yu

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) School of Computer Science, Peking University. Institute for Artificial Intelligence, Peking University(北京大学计算机科学学院。人工智能研究所) Peking University(北京大学)

AI总结 本文研究了在脉冲变压器中如何更准确地估计令牌的重要性,以减少冗余计算并提高推理效率。现有方法主要依赖于响应特征,如激活幅度或发放统计,但未能反映令牌在时间演化中的不确定性变化。作者提出了一种无需训练、可插拔的Uncert框架,通过建模令牌的类别证据并分析其时间不确定性模式,为令牌重要性评估提供了新的依据。实验表明,该方法在静态和神经形态基准上均取得了良好的精度与效率平衡,尤其在令牌剪枝任务中表现突出。

详情
英文摘要

Spiking transformers have shown strong potential for neuromorphic vision, yet their token processing across multiple spiking steps still introduces substantial redundancy and inference cost. Existing token reduction methods mainly rely on response based cues, such as activation magnitude, firing statistics, or feature similarity. Although effective, these criteria do not explicitly characterize token importance from the perspective of temporally evolving class evidence. In spiking transformers, token representations are progressively formed across multiple spiking steps rather than determined at a single instant, suggesting that token importance should be evaluated not only by instantaneous responses but also by temporal uncertainty patterns. Our key observation is that tokens exhibit heterogeneous uncertainty trajectories over time, and that their temporally aggregated uncertainty statistics provide an effective cue for distinguishing informative tokens from redundant ones. Motivated by this, we propose Uncert, a training free and plug and play token importance estimation framework for spiking transformers. Specifically, Uncert models token wise class evidence with a Dirichlet distribution and summarizes each token temporal uncertainty using its mean and fluctuation across spiking steps, yielding an uncertainty aware importance score for token reduction during inference. Experiments on both static and neuromorphic benchmarks show that Uncert achieves favorable accuracy and efficiency tradeoffs, with the most consistent gains observed under token pruning. Further analysis reveals a clear empirical connection between temporal uncertainty patterns and token contribution, offering new insights into token dynamics in spiking transformers.

2605.09272 2026-05-12 cs.AI cs.CL cs.CV 版本更新

Towards Conversational Medical AI with Eyes, Ears and a Voice

Meet Shah, Jason Gusdorf, Anil Palepu, Chunjong Park, Jack W. O'Sullivan, Vishnu Ravi, Tim Strother, Pavel Dubov, Aliya Rysbek, Toshiyuki Fukuzawa, Yana Lunts, Jan Freyberg, Michael B. Chang, Aniruddh Raghu, David Stutz, Devora Berlowitz, Eliseo Papa, Taylan Cemgil, JD Velasquez, Jack Chen, Arthur Chen, Doug Fritz, Charlie Taylor, Katya Tregubova, Jing Rong Lim, Richard Green, Sara Mahdavi, Mahvish Nagda, Jihyeon Lee, Craig Schiff, Liviu Panait, Sukhdeep Singh, Valentin Liévin, David G. T. Barrett, Hannah Gladman, Anna Cupani, Francesca Pietra, Uchechi Okereke, Katherine Tong, Clemens Meyer, Erwan Rolland, Mili Sanwalka, Michael D. Howell, Shixiang Shane Gu, Bibo Xu, Euan A. Ashley, S. M. Ali Eslami, Gregory Wayne, Pushmeet Kohli, Vivek Natarajan, Adam Rodman, Alan Karthikesalingam, Ryutaro Tanno

发表机构 * Google DeepMind(谷歌深Mind) Google Research(谷歌研究) Beth Israel Deaconess Medical Center, Harvard Medical School(贝塞斯达医院, 哈佛医学院) Stanford University(斯坦福大学)

AI总结 该研究提出了一种名为AI co-clinician的新型会话式医疗AI系统,能够实时处理来自医患对话的视听数据,辅助临床决策。该系统基于Gemini的低延迟音视频处理能力,采用双代理架构,兼顾深度临床推理与自然对话所需的低延迟响应。实验表明,AI co-clinician在多个关键评估维度上接近初级保健医生,且在通用评估标准上显著优于GPT-Realtime,但仍在体格检查和疾病特异性推理方面存在不足,突显了视听信息在医疗咨询中的重要性。

Comments Video examples are available on Youtube: https://youtu.be/y5Vaa_SN1t0, https://youtu.be/dC4icb75vLQ, and https://youtu.be/E7iEvWo-E6c

详情
英文摘要

The practice of medicine relies not only upon skillful dialogue but also on the nuanced exchange and interpretation of rich auditory and visual cues between doctors and patients. Building on the low-latency voice and video processing capabilities of Gemini, we introduce AI co-clinician, a first-of-its-kind conversational AI system utilizing continuous streams of audio-visual data from live patient conversations to inform real-time clinical decisions. Its dual-agent architecture balances deep clinical reasoning with the low latency required for natural dialogue. To assess this system, we implemented a video-based interface emulating telemedicine consultations. We crafted 20 standardized outpatient scenarios requiring proactive real-time auditory and visual reasoning and designed "TelePACES" evaluation criteria alongside case-specific rubrics. In a randomized, interface-blinded, crossover simulation study (n = 120 encounters) with 10 internal medicine residents as patient actors, we compared AI co-clinician with primary care physicians (PCPs), GPT-Realtime, and a baseline agent. AI co-clinician approached PCPs in key TelePACES dimensions, including management plans and differential diagnosis, while significantly outperforming GPT-Realtime across all general criteria. While our agent demonstrated parity with PCPs in case-specific triage measures, physicians maintained superior overall performance in case-specific assessments. Although AI co-clinician marks a significant advance in real-time telemedical AI, gaps remain in physical examination and disease-specific reasoning. Our work shows that text-only approaches fail to capture the true challenges of medical consultation and suggests that high-stakes real-time diagnostic AI is most safely advanced in collaborative, triadic models where AI can be a supportive co-clinician for doctors and patients.

2605.09269 2026-05-12 cs.CL cs.CV 版本更新

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Rui Liu, Dian Yu, Zhenwen Liang, Yucheng Shi, Tong Zheng, Runpeng Dai, Haitao Mi, Pratap Tokekar, Leoweiliang

发表机构 * Tencent Hunyuan(腾讯文元) University of Maryland, College Park(马里兰大学 College Park 分校) University of North Carolina, Chapel Hill(北卡罗来纳大学 Chapel Hill 分校)

AI总结 DeltaRubric 是一种用于多模态大语言模型奖励建模的生成式方法,旨在解决现有评估方式在视觉细节判断上的偏差问题。该方法通过将评估过程分解为“规划”和“验证”两个步骤,动态生成针对具体实例的检查清单,并基于图像和问题进行验证,从而提高评估的准确性和可靠性。实验表明,DeltaRubric 在多个基准测试中显著提升了模型的奖励建模效果,验证了其在多模态任务中的有效性。

详情
英文摘要

Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce $\textbf{DeltaRubric}$, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a $\textit{Disagreement Planner}$, the model generates a neutral, instance-specific verification checklist. Transitioning into a $\textit{Checklist Verifier}$, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by $\textbf{+22.6}$ (4B) and $\textbf{+18.8}$ (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.

2605.09262 2026-05-12 cs.CV cs.CL 版本更新

Reinforcing Multimodal Reasoning Against Visual Degradation

Rui Liu, Dian Yu, Haolin Liu, Yucheng Shi, Tong Zheng, Runpeng Dai, Haitao Mi, Pratap Tokekar, Leoweiliang

发表机构 * Tencent Hunyuan(腾讯文言) University of Maryland, College Park(马里兰大学 College Park 分校) University of Virginia(弗吉尼亚大学) University of North Carolina, Chapel Hill(北卡罗来纳大学 Chapel Hill 分校)

AI总结 该研究针对多模态大语言模型在面对现实视觉退化(如模糊、压缩伪影等)时推理能力下降的问题,提出了一种基于强化学习的微调框架ROMA。该方法通过双前向传播策略、分布一致性约束和正确性条件正则化等技术,在不损害干净输入性能的前提下提升模型对视觉退化的鲁棒性。实验表明,ROMA在多个多模态推理基准上显著优于现有方法,提升了可见和未见退化场景下的推理准确性。

详情
英文摘要

Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance, correctness-conditioned regularization restricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions over GRPO while matching clean accuracy.

2605.09258 2026-05-12 cs.CV cs.AI 版本更新

Monocular Biomechanical Tracking of Fingers with Inverse Kinematics to Foundation Models

R. James Cotton, Pouyan Firouzabadi, Wendy Murray

发表机构 * Shirley Ryan AbilityLab Department of PM\&R Northwestern University Shirley Ryan AbilityLab Department of Biomedical Engineering Northwestern University

AI总结 该研究旨在解决单目视频中精确追踪手指生物力学运动的问题,提出了一种结合SAM 3D Body基础模型与逆运动学优化的方法,从单视角视频中提取解剖学约束的手指关节角度。通过将模型迁移至JAX并集成至MuJoCo-MJX,实现了高效的GPU加速优化,并建立了Momentum Human Rig输出与生物力学模型标记之间的新映射关系。实验表明,该方法在多种手部动作和物体操作任务中,能够达到约10度的关节角度误差和6毫米的手部位置误差,具有良好的视角一致性和鲁棒性,为基于视频的定量手部运动分析提供了新途径。

Comments Accepted to EMBC 2026

详情
英文摘要

Accurate hand and finger tracking from video has significant clinical applications for monitoring activities of daily living and measuring range of motion, yet monocular video approaches for obtaining hand biomechanics remain under-developed. We present a method that combines the SAM 3D Body foundation model with inverse kinematics optimization in a full-body biomechanical model to extract anatomically-constrained finger joint angles from single-view video. We port SAM 3D Body from PyTorch to JAX for integration with MuJoCo-MJX, enabling GPU-accelerated optimization, and develop a novel mapping between the Momentum Human Rig (MHR) outputs and biomechanical model markers. Validation against 8-camera multiview reconstruction on 4,590 frames from 7 participants performing a variety of hand poses and object manipulation tasks shows finger joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm, after Procrustes alignment. Results were consistent across camera viewpoints and robust to different methods for producing reference values from multiview video. This work extends monocular biomechanical analysis to detailed finger tracking, expanding access to quantitative characterization of hand movement from readily available video.

2605.09242 2026-05-12 eess.IV cs.CV 版本更新

Cross-Modal Semantic-Enhanced Diffusion Framework for Diabetic Retinopathy Grading

Yiqun Wang

发表机构 * School of Software Engineering Beijing Jiaotong University(软件工程学院 北京交通大学)

AI总结 本文提出了一种结合视觉-语言预训练和扩散概率建模的跨模态语义增强扩散框架CGSD,用于糖尿病视网膜病变的自动分级。该方法通过低秩适配技术对领域特定的视觉-语言模型进行微调,有效缩小了预训练模型与目标数据集之间的分布差异,并利用图像特征与病变等级文本描述的点积构建跨模态语义条件向量,作为扩散去噪网络的条件输入,提升了模型对细粒度病变特征和临床语义信息的感知能力。实验表明,该方法在APTOS 2019数据集上取得了优于现有方法的准确率和F1分数。

Comments 6 pages, 3 figures, 2 tables

详情
英文摘要

Automated grading of diabetic retinopathy (DR) faces several critical challenges: subtle inter-grade visual distinctions in fine-grained lesion patterns, distributional discrepancies induced by heterogeneous imaging devices and acquisition conditions, and the inherent inability of purely visual approaches to exploit clinical semantic knowledge. In this paper, we propose CLIP-Guided Semantic Diffusion (CGSD), a DR grading framework that synergistically integrates vision-language pretraining with diffusion probabilistic modeling. We adopt a domain-specific vision-language model tailored for DR grading as the semantic guidance module and adapt it to the target domain via Low-Rank Adaptation (LoRA), effectively bridging the distributional gap between the pretrained model and the target dataset with only a minimal number of trainable parameters. Building on this foundation, we construct a cross-modal semantic conditioning vector by computing the dot product between image features and the text description features of each DR grade, yielding a joint representation that simultaneously encodes visual content and clinical-grade semantics. This vector serves as the conditioning signal for the diffusion denoising network, replacing the structurally complex dual-branch visual prior employed in existing diffusion-based classification methods. Experiments on the APTOS 2019 dataset demonstrate that the proposed approach achieves an accuracy of 87.5% and a macro-averaged F1 score of 0.731, outperforming a variety of representative methods. Ablation studies further validate the independent contribution of each constituent module.

2605.07910 2026-05-12 cs.CV 版本更新

One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction

Yulong Chen, Xiaoyun Dong, Haoyu Zhang, Zongxian Yang, Lewei Xie, Xinke Li, Yifan Zhang, Kai Wang, Jianping Wang

发表机构 * City University of Hong Kong (Dongguan)(香港城市大学(东莞)) City University of Hong Kong(香港城市大学) SLAI

AI总结 本文研究了从车路协同自动驾驶(VICAD)数据中重建动态场景的问题,指出现有高斯场景图方法因假设观测同步而无法处理车辆与基础设施摄像头之间的时序不同步问题,导致动态目标出现严重鬼影现象。为此,作者提出了一种解耦时空高斯场景图(DUST),通过为每个代理维护独立的位姿轨迹并共享统一的外观表示,有效消除了跨源干扰,并在V2X-Seq数据集上取得了显著的性能提升。

详情
英文摘要

Reconstructing dynamic scenes from Vehicle-to-Infrastructure Cooperative Autonomous Driving (VICAD) data is fundamentally complicated by temporal asynchrony: vehicle and infrastructure cameras operate on independent clocks, capturing the same dynamic agent such as cars and pedestrians at different physical times. Existing Gaussian Scene Graph methods implicitly assume synchronized observations and assign a single pose per agent per frame, which is an assumption that breaks in cooperative settings, where the resulting gradient conflicts cause severe ghosting on dynamic agents. We identify this as a representation-level failure, not an optimization artifact: we prove that any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset. To resolve this, we propose Dust (DecoUpled Spatio-Temporal) Gaussian Scene Graph for 4D Cooperative Driving Reconstruction. DUST Gaussian Scene Graph shares a canonical Gaussian set per agent for appearance consistency, while maintaining decouple pose trajectories aligned to each source's true capture timestamps. We prove that this decoupling enables the pose-gradient kernel block-diagonal, eliminating cross-source interference entirely. To make Dust practical, we further introduce a static anchor-based pose correction pipeline that corrects spatio misalignment between vehicle and infrastructure annotations, and a pose-regularized joint optimization scheme that prevents trajectory jitter and drift during early training. On 26 sequences from V2X-Seq, DUST achieves state-of-the-art performance, improving dynamic-area PSNR by 3.2 dB over the strongest baseline and reducing Fréchet Video Distance by 37.7%, with keeping robustness under larger temporal asynchrony.

2605.07649 2026-05-12 cs.CV cs.AI cs.RO 版本更新

Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

Berkehan Ünal, Hauke Dierend, Dren Fazlija, Christopher Plachetka

发表机构 * Volkswagen Aktiengesellschaft(大众汽车股份有限公司) L3S Research Center(莱比锡大学汉诺威研究中心) Faculty of Information Technology(信息科技学院) MOIA GmbH(MOIA公司) Motor AI GmbH(Motor AI公司)

AI总结 本文研究了如何利用视觉-语言模型(VLM)实现对操作设计域(ODD)的零样本感知,以支持自动驾驶系统等安全关键应用。通过在自定义数据集和Mapillary Vistas上的实验,作者评估了四种VLM在零样本分类与检测任务中的表现,并分析了不同优化策略的效果。研究提出了一种基于定义锚定的思维链提示方法,结合角色分解,显著提升了感知性能,为构建透明、高效的ODD感知系统提供了可行方案。

Comments 8 pages, 4 figures

详情
英文摘要

Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision-language models (VLMs) integrate visual recognition and language reasoning, functioning without task-specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero-shot "ODD sensors" that adapt to evolving definitions, we contribute (i) an empirical study of zero-shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero-shot optimization strategies with a cost-performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition-anchored chain-of-thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD-based perception in safety-critical applications.

2605.07399 2026-05-12 cs.CV 版本更新

GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization

Yu Pan, Andi Zhang, Yi Wang, Sibei Yang, Wenjie Wang

发表机构 * ShanghaiTech University(上海科技大学) University of Warwick(沃里克大学) SUN YAT-SEN UNIVERSITY(中山大学)

AI总结 该论文研究了扩散视觉语言模型(dVLMs)在面对越狱攻击时的安全性问题,揭示了其在应对传统固定前缀优化(FPO)攻击时表现出的假象性鲁棒性。作者提出了一种基于全局概率优化(GPO)的新型越狱方法,通过操纵扩散模型的去噪轨迹,绕过模型的防护机制,并进一步开发了首个针对dVLMs的视觉模态越狱框架GPO-V。实验表明,GPO-V能够生成隐蔽且具有跨模型迁移能力的扰动,暴露了非序列生成架构中的关键安全漏洞,突显了对dVLMs进行安全对齐的紧迫性。

详情
英文摘要

Diffusion Vision-Language Models (dVLMs), built upon the non-causal foundations of Diffusion Large Language Models (dLLMs), have demonstrated remarkable efficacy in multimodal tasks by departing from the traditional autoregressive generation paradigm. While dVLMs appear inherently robust against conventional jailbreak tactics, which we categorize as Fixed Prefix Optimization (FPO) (e.g., anchoring responses with "Sure, here is"), this perceived resilience is deceptive. Our investigation into the safety landscape of dVLMs reveals a unique refusal pattern: Immediate Refusal and Progressive Refusal. We find that while FPO-based attacks often fail by triggering the latter, the progressive refinement process itself uncovers a novel, latent attack surface. To exploit this vulnerability, we propose Global Probability Optimization (GPO), a general jailbreak paradigm designed specifically for the denoising trajectory of masked diffusion models. Unlike prefix-based methods, GPO manipulates the global generative dynamics to bypass guardrails in diffusion language models. Building on this, we introduce GPO-V, the first visual-modality jailbreak framework tailored for dVLMs. Empirical results demonstrate that GPO-V produces stealthy perturbations with exceptional cross-model transferability, revealing a critical security gap in non-sequential generative architectures. Our findings underscore the critical urgency of addressing safety alignment in dVLMs. These results necessitate an immediate and fundamental re-evaluation of current defense paradigms to mitigate the unique risks of diffusion-based generation. Our code is available at: https://anonymous.4open.science/r/GPO-V-0250.

2605.07203 2026-05-12 cs.CV 版本更新

From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting

Chamuditha Jayanga Galappaththige, Jason Lai, Timothy Patten, Donald Dansereau, Niko Suenderhauf, Dimity Miller

发表机构 * QUT Centre for Robotics(昆士兰大学机器人中心) ARIAM ACFR, University of Sydney(悉尼大学先进计算机研究学院)

AI总结 本文研究了基于高斯泼溅(Gaussian Splatting)的场景变化检测问题,提出了一种直接在原始高斯参数空间进行比较的方法,而非传统的渲染后对比方式。通过分析高斯的原始属性(位置、各向异性协方差和颜色),作者证明这些属性本身已包含足够的变化信息,并引入几何和光度漂移的各向异性模型以及每个高斯的可观测性项来解决表示的欠约束问题。该方法在多视角一致性、变化类型区分等方面具有优势,并在实际数据集上取得了优于现有方法约17%的性能提升。

Comments Project Page: https://chumsy0725.github.io/GS-DIFF/

详情
英文摘要

Scene change detection methods built on Gaussian splatting universally follow a render-then-compare paradigm: the pre-change scene is rendered into 2D and compared against post-change images via pixel or feature residuals. This change detection problem with Gaussian Splatting has been treated as a question about pixels; we treat it as a question about primitives. We provide direct evidence that native primitive attributes alone -- position, anisotropic covariance, and color -- carry sufficient signal for scene change detection. What makes primitive-space comparison hard is the under-constrained nature of Gaussian splatting representation: independent optimizations yield primitive solutions whose count, positions, shapes, and colors differ even where nothing has changed. We address this challenge with anisotropic models of geometric and photometric drift, complemented by a per-primitive observability term that reflects the extent to which each Gaussian is constrained by the camera geometry. Operating directly on primitives gives our method, GD-DIFF, two properties that distinguish it from render-then-compare methods. First, change maps are multi-view consistent by construction, where prior work had to learn this through an additional optimization objective. Second, geometric and appearance changes are scored separately, identifying not just where but what kind of change occurred, distinguishing structural changes (e.g., an added object) from surface-level ones (e.g., a color change) without supervision or external model dependencies. On real-world benchmarks, GS-DIFF surpasses the prior state-of-the-art approach by $\sim$17% in mean Intersection over Union.

2605.06969 2026-05-12 cs.CV 版本更新

Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

Yuchen Guo, Junli Gong, Yao Lu, Xintong Xu, Yiuming Cheung, Weifeng Su

发表机构 * Northwestern University(西北大学) Northeastern University(东北大学) University of Washington(华盛顿大学) Hong Kong Baptist University(香港 Baptist大学) Beijing Normal - Hong Kong Baptist University(北京师范大学-香港 Baptist大学)

AI总结 该研究旨在提升红外-可见光图像融合(IVIF)质量评估的准确性,针对现有方法过度依赖手工特征和全参考指标的问题,提出了一种基于多模态大语言模型(MLLM)的新型评估方法FuScore。该方法通过MLLM生成连续的质量评分,而非离散等级预测,从而实现对相似质量图像的细粒度区分,并结合多维度一致性构建软标签,进一步引入三元目标函数以提升评估的全面性和鲁棒性。实验表明,FuScore在与人类视觉偏好相关性方面达到了当前最优水平。

详情
英文摘要

Infrared-Visible image fusion (IVIF) aims to integrate thermal information and detailed spatial structures into a single fused image to enhance perception. However, existing evaluation approaches tend to over-optimize both hand-crafted no-reference statistics and full-reference metrics that treat the source images as pseudo ground truths. Recent IVIF reward-modelling efforts learn from human ratings but use scalar regression on aggregated scores, neither leveraging the reasoning of Multimodal Large Language Models (MLLMs) nor encoding per-image perceptual ambiguity in their supervision, but naively introducing MLLMs with discrete one-hot supervision likewise collapses fused images of similar quality into different rating levels. To address this, we introduce FuScore, which utilizes an MLLM to mimic human visual perception by producing continuous quality score, rather than discrete level predictions, enabling fine-grained discrimination among fused images of similar quality. We exploit the agreement among four IVIF-specific sub-dimensions to construct a per-image soft label whose sharpness reflects how consensual the overall judgment is. We further introduce a tripartite objective combining per-image distributional supervision, within-source-pair Thurstone fidelity for method-level ordering, and cross-source-pair Thurstone fidelity for scene-level ordering across scenes. Extensive experiments demonstrate that FuScore achieves state-of-the-art correlation with human visual preferences.

2605.06681 2026-05-12 cs.LG cs.CV 版本更新

A Hierarchical Ensemble Pipeline for Anomaly Detection in ESA Satellite Telemetry

Lorenzo Riccardo Allegrini, Geremia Pompei

发表机构 * ContinualIST, Pisa, Italy(持续主义机构,意大利比萨) University of Pisa, Department of Computer Science, Pisa, Italy(比萨大学计算机科学系,意大利比萨)

AI总结 本文提出了一种分层集成管道,用于处理欧洲空间局(ESA)卫星遥测数据中的异常检测问题。该方法结合了形状片段提取、统计特征分析、单通道建模、通道内堆叠以及跨通道聚合等多种技术,通过时间序列交叉验证和双层掩码策略进行训练与验证,有效防止信息泄露。实验结果表明,该方法在ESA-ADB基准测试中表现出优异的泛化能力,能够有效检测现实卫星遥测数据中的细微异常。

Comments 15 pages, 3 figures, 1 table. Submitted to the ML4ITS workshop at the ECML PKDD 2025 conference. Awarded 2nd place in the final round of the Spacecraft Anomaly Challenge on ESA dataset. (Ranked 1st on the Kaggle public leaderboard and 3rd on the private leaderboard)

Journal ref Communications in Computer and Information Science 2842 (2026) Chapter 7

详情
英文摘要

A hierarchical ensemble pipeline is introduced to address anomaly detection in multivariate telemetry data provided by European Space Agency (ESA). The method integrates shapelet-based and statistical feature extraction, per-channel modeling, intra-channel stacking, and a final cross-channel aggregation. The pipeline is trained and validated using time-series cross-validation and two-level masking strategies to prevent information leakage. Results on the European Space Agency Anomaly Detection Benchmark (ESA-ADB) challenge demonstrate strong generalization, highlighting the effectiveness of hierarchical modeling in detecting subtle anomalies in realistic satellite telemetry.

2605.05831 2026-05-12 cs.CV 版本更新

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

Megha Mariam K. M, Vineeth N. Balasubramanian, C. V. Jawahar

发表机构 * IIIT Hyderabad(IIIT海得拉尔学院) Microsoft Research India & IIT Hyderabad(微软研究院印度分部及IIIT海得拉尔学院)

AI总结 随着科学传播逐渐呈现多模态趋势,研究论文、幻灯片、视频等不同形式的材料共同传达研究成果,但目前缺乏结构化的关联方式。本文提出首个整合研究论文、演讲视频、讲解视频和幻灯片的多模态会议数据集(MCD),并评估多种嵌入式和视觉-语言模型在跨格式细粒度对应任务中的表现。研究发现,视觉-语言模型在整体上表现稳健,但在细粒度对齐上仍有不足,而嵌入式模型在文本与视觉对应上效果较好,但对公式和符号内容的处理存在明显聚类差异,为多模态科学理解的未来研究指明了方向。

Comments Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings Track, 2026

详情
英文摘要

The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at https://github.com/meghamariamkm2002/MCD

2605.05072 2026-05-12 cs.CV 版本更新

Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

Yuan Wu, Zhiqiang Yan, Jiawei Lian, Zhengxue Wang, Jian Yang

发表机构 * Nanjing University of Science and Technology(南京理工大学) National University of Singapore(新加坡国立大学)

AI总结 本文研究了如何从相机和激光雷达传感器数据中准确预测三维场景的占用情况,重点解决传统方法在投影空间采样固定、难以适应真实场景高度变化和稀疏性的问题。为此,作者提出了一种名为HiPR的框架,通过高度引导的投影重参数化方法,动态调整激光雷达点云的采样范围,使投影点更合理地分布于具有几何意义的区域。实验表明,HiPR在保持实时推理能力的同时,显著优于现有先进方法。

详情
英文摘要

3D occupancy prediction aims to infer dense, voxel-wise scene semantics from sensor observations, where the 2D-to-3D view transformation serves as a crucial step in bridging image features and volumetric representations. Most previous methods rely on a fixed projection space, where 3D reference points are uniformly sampled along pillars. However, such sampling struggles to capture the sparsity and height variations of real-world scenes, leading to ambiguous correspondences and unreliable feature aggregation. To address these challenges, we propose HiPR, a camera-LiDAR occupancy framework with Height-Guided Projection Reparameterization. HiPR first encodes LiDAR into a BEV height map to capture the maximum height of the point cloud. HiPR then adjusts the sampling range of each pillar using the height prior, enabling adaptive reparameterization of the projection space. As a result, the projected points are redistributed into geometrically meaningful regions rather than fixed ranges. Meanwhile, we mask out the invalid parts of the height map to avoid misleading the feature aggregation. In addition, to alleviate the training instability caused by noisy LiDAR-derived heights, we introduce a training-time Progressive Height Conditioning strategy, which gradually transitions the conditioning signal from ground-truth heights to LiDAR heights. Extensive experiments demonstrate that HiPR consistently outperforms existing state-of-the-art methods while maintaining real-time inference. The code and pretrained models can be found at https://github.com/yanzq95/HiPR.

2605.05045 2026-05-12 cs.CV cs.CL 版本更新

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

Philip Wootaek Shin, Ajay Narayanan Sridhar, Sivani Devarapalli, Rui Zhang, Jack Sampson, Vijaykrishnan Narayanan

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 该研究分析了视觉-语言模型在面对旋转和噪声等视觉干扰时产生的关系幻觉现象,揭示了即使轻微的图像扰动也会显著影响模型对物体间关系的推理能力。研究评估了多种基于提示的增强与预处理策略,发现这些方法虽能部分缓解问题,但无法彻底消除关系幻觉。结果表明,当前模型在感知鲁棒性与关系理解之间仍存在差距,亟需开发更具几何感知能力的视觉-语言模型。

详情
英文摘要

Vision-language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.

2605.03652 2026-05-12 cs.CV cs.AI 版本更新

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

Tencent HY Team

发表机构 * Tencent HY Team(腾讯HY团队)

AI总结 本文提出了一种名为 AniMatrix 的动画视频生成模型,专门针对动画艺术风格进行设计,而非依赖物理现实作为先验。该模型通过双通道条件机制和三步过渡策略,重新定义“正确性”标准,克服传统模型对物理规律的依赖,并有效区分艺术表达与生成失败。实验表明,AniMatrix 在专业动画师参与的评估中表现优异,尤其在提示理解与艺术动作生成方面显著优于现有模型。

Comments 37 pages, 1 main figure (qualitative comparison), 1 TikZ architecture diagram; technical report. Model weights and inference code to be released

详情
英文摘要

Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We are preparing accompanying resources for public release to support reproducibility and follow-up research.

2605.03438 2026-05-12 cs.CV 版本更新

Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models

Zihao Guo, Jihua Zhu, Jian Liu, Ajmal Saeed Mian

发表机构 * Xi’an Jiaotong University(西安交通大学) School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) University of Western Australia(西澳大学)

AI总结 本文提出了一种名为Mantis的高效参数微调框架,专门针对基于Mamba架构的3D点云基础模型。该方法通过引入状态感知适配器(SAA),在冻结预训练主干网络的前提下实现状态级的细粒度适配,同时采用双序列化一致性蒸馏(DSCD)减少序列化带来的不稳定性。实验表明,Mantis仅需约5%的可训练参数即可在多个基准上取得具有竞争力的性能。

详情
英文摘要

Pre-trained 3D point cloud foundation models (PFMs) have demonstrated strong transferability across diverse downstream tasks. However, full fine-tuning these models is computationally expensive and storage-intensive. Parameter-efficient fine-tuning (PEFT) offers a promising alternative, but existing PEFT approaches are primarily designed for Transformer-based backbones and rely on token-level prompting or feature transformation. Mamba-based backbones introduce a granularity mismatch between token-level adaptation and state-level sequence dynamics. Consequently, straightforward transfer of existing PEFT approaches to frozen Mamba backbones leads to substantial accuracy degradation and unstable optimization. To address this issue, we propose Mantis, the first Mamba-native PEFT framework for 3D PFMs. Specifically, a State-Aware Adapter (SAA) is introduced to inject lightweight task-conditioned control signals into selective state-space updates, enabling state-level adaptation while keeping the pre-trained backbone frozen. Moreover, different valid point cloud serializations are regularized by Dual-Serialization Consistency Distillation (DSCD), thereby reducing serialization-induced instability. Extensive experiments across multiple benchmarks demonstrate that our Mantis achieves competitive performance with only about 5% trainable parameters. Our code is available at https://github.com/gzhhhhhhh/Mantis.

2605.01402 2026-05-12 cs.CL cs.CV cs.LG 版本更新

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Yao Du, Shanshan Song, Xiaomeng Li

发表机构 * The Hong Kong University of Science(香港科学与技术大学)

AI总结 多模态大语言模型(MLLMs)在处理长尾分布的数值回归任务时表现不佳,现有基于标记的监督微调方法容易偏向高密度区域,导致回归均值化和尾部性能下降。本文提出了一种基于组相对策略优化的分布感知强化学习框架,通过引入基于一致相关系数的奖励机制,在批量层面提供跨样本的比较监督,从而在相关性、尺度和均值等方面对齐预测与真实分布。该方法无需修改模型结构,实验表明其在多种长尾回归基准上均优于传统微调方法,尤其在中样本和少样本场景下效果显著。

Comments Accepted by ICML 2026

详情
英文摘要

Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.

2605.00642 2026-05-12 cs.AI cs.CV 版本更新

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Yan Zhang, Daiqing Wu, Huawen Shen, Can Ma, Yu Zhou

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) VCIP & TMCC & DISSec, College of Computer Science, Nankai University(南开大学计算机学院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 本文提出了一种面向GUI定位任务的首个基于策略自蒸馏(OPSD)框架GUI-SD,旨在解决现有强化学习方法在训练效率和样本稀疏性方面的不足。该方法通过构建视觉增强的特权上下文和引入熵引导的蒸馏策略,实现了单次交互中的密集监督学习,有效提升了定位精度与训练效率。实验表明,GUI-SD在六个代表性基准上均优于现有方法。

Comments under review

详情
英文摘要

Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely on expensive multiple rollouts and suffer from sparse signals on hard samples. These limitations make on-policy self-distillation (OPSD), which provides dense token-level supervision from a single rollout, a promising alternative. However, its applicability to GUI grounding remains unexplored. In this paper, we present GUI-SD, the first OPSD framework tailored for GUI grounding. First, it constructs a visually enriched privileged context for the teacher using a target bounding box and a Gaussian soft mask, providing informative guidance without leaking exact coordinates. Second, it employs entropy-guided distillation, which adaptively weights tokens based on digit significance and teacher confidence, concentrating optimization on the most impactful and reliable positions. Extensive experiments on six representative GUI grounding benchmarks show that GUI-SD consistently outperforms GRPO-based methods and naive OPSD in both accuracy and training efficiency. Code and training data are available at https://zhangyan-ucas.github.io/GUI-SD/.

2605.00548 2026-05-12 cs.CV cs.GR 版本更新

Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation

Nadav Z. Cohen, Ofir Abramovich, Ariel Shamir

发表机构 * Reichman University(雷曼大学)

AI总结 本文研究了扩散模型中输入噪声的特性,发现白噪声中低频分量主要决定图像的全局结构和颜色组成,而高频分量控制细节。基于此,作者提出了一种无需训练的低频噪声操控方法,通过简单操作低频噪声来引导图像生成过程,从而在保持输出多样性的同时,实现对图像整体结构和颜色的有效控制。

Comments SIGGRAPH 2026 Conference Paper. Project Page at: https://nadavc220.github.io/colorful-noise/

详情
英文摘要

Text-to-image diffusion models generate images by gradually converting white Gaussian noise into a natural image. White Gaussian noise is well suited for producing diverse outputs from a single text prompt due to its absence of structure. However, this very property limits control over, and predictability of, specific visual attributes, as the noise is not human-interpretable. In this work, we investigate the characteristics of the input noise in diffusion models. We show that, although all frequencies in white Gaussian noise have comparable statistical energy, low-frequency components primarily determine the images global structure and color composition, while high-frequency components control finer details. Building on this observation, we demonstrate that simple manipulations of the low-frequency noise using low-frequency image priors can effectively condition the generation process to reconstruct these low-frequency visual cues. This allows us to define a simple, training-free method with minimal overhead that steers overall image structure and color, while letting high-frequency components freely emerge as fine details, enabling variability across generated outputs.

2605.00408 2026-05-12 cs.CV 版本更新

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

Zhenhua Ning, Xin Li, Jun Yu, Guangming Lu, Yaowei Wang, Wenjie Pei

发表机构 * Pengcheng Laboratory, Shenzhen(鹏城实验室,深圳) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学,深圳)

AI总结 本文提出了一种可学习的密度控制方法LeGS,用于改进三维高斯溅射(3DGS)技术,以克服其对启发式密度控制规则的依赖。该方法将密度控制建模为通过强化学习优化的参数化策略网络,并设计了一种基于敏感性分析的有效奖励函数,以精确量化单个高斯分布对重建质量的贡献。实验表明,LeGS在多个数据集上显著优于现有方法,在重建质量和计算效率之间取得了更好的平衡。

Comments 9 pages, 5 figures

详情
英文摘要

While 3D Gaussian Splatting (3DGS) has demonstrated impressive real-time rendering performance, its efficacy remains constrained by a reliance on heuristic density control. Despite numerous refinements to these handcrafted rules, such methods inherently lack the flexibility to adapt to diverse scenes with complex geometries. In this paper, we propose a paradigm shift for density control from rigid heuristics to fully learnable policies. Specifically, we introduce \textbf{LeGS}, a framework that reformulates density control as a parameterized policy network optimized via Reinforcement Learning (RL). Central to our approach is the tailored effective reward function grounded in sensitivity analysis, which precisely quantifies the marginal contribution of individual Gaussians to reconstruction quality. To maintain computational tractability, we derive a closed-form solution that reduces the complexity of reward calculation from $O(N^2)$ to $O(N)$. Extensive experiments on the Mip-NeRF 360, Tanks \& Temples, and Deep Blending datasets demonstrate that \textbf{LeGS} significantly outperforms state-of-the-art methods, striking a superior balance between reconstruction quality and efficiency. The code will be released at https://github.com/AaronNZH/LeGS

2604.24954 2026-05-12 cs.LG cs.AI cs.CV 版本更新

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Arushi Goel, Mike Ranzinger, Greg Heinrich, Guo Chen, Lukas Voegtle, Philipp Fischer, Timo Roman, Karan Sapra, Collin McCarthy, Shaokun Zhang, Fuxiao Liu, Hanrong Ye, Yi Dong, Mingjie Liu, Yifan Peng, Piotr Zelasko, Zhehuai Chen, Nithin Rao Koluguri, Nune Tadevosyan, Lilit Grigoryan, Ehsan Hosseini Asl, Pritam Biswas, Leili Tavabi, Yuanhang Su, Zhiding Yu, Peter Jin, Alexandre Milesi, Netanel Haber, Yao Xu, Sarah Amiraslani, Nabin Mulepati, Eric Tramel, Jaehun Jung, Ximing Lu, Brandon Cui, Jin Xu, Zhiqi Li, Shihao Wang, Yuanguo Kuang, Shaokun Zhang, Huck Yang, Boyi Li, Hongxu Yin, Song Han, Bilal Kartal, Pavlo Molchanov, Adi Renduchintala, Charles Wang, David Mosallanezhad, Soumye Singhal, Luis Vega, Katherine Cheung, Sreyan Ghosh, Yian Zhang, Alexander Bukharin, Venkat Srinivasan, Johnny Greco, Andre Manoel, Maarten Van Segbroeck, Suseella Panguliri, Rohit Watve, Divyanshu Kakwani, Shubham Pachori, Jeffrey Glick, Radha Sri-Tharan, Aileen Zaman, Khanh Nguyen, Shi Chen, Jiaheng Fang, Qing Miao, Wenfei Zhou, Yu Wang, Zaid Pervaiz Bhat, Varun Praveen, Arihant Jain, Ramanathan Arunachalam, Tomasz Kornuta, Ashton Sharabiani, Amy Shen, Wei Huang, Yi-Fu Wu, Ali Roshan Ghias, Huiying Li, Brian Yu, Nima Tajbakhsh, Chen Cui, Wenwen Gao, Li Ding, Terry Kong, Manoj Kilaru, Anahita Bhiwandiwalla, Marek Wawrzos, Daniel Korzekwa, Pablo Ribalta, Grzegorz Chlebus, Besmira Nushi, Ewa Dobrowolska, Maciej Jakub Mikulski, Kunal Dhawan, Steve Huang, Jagadeesh Balam, Yongqiang Wang, Nikolay Karpov, Valentin Mendelev, George Zelenfroynd, Meline Mkrtchyan, Qing Miao, Omri Almog, Bhavesh Pawar, Rameshwar Shivbhakta, Sudeep Sabnis, Ashrton Sharabiani, Negar Habibi, Geethapriya Venkataramani, Pamela Peng, Prerit Rodney, Serge Panev, Richard Mazzarese, Nicky Liu, Michael Fukuyama, Andrii Skliar, Roger Waleffe, Duncan Riach, Yunheng Zou, Jian Hu, Hao Zhang, Binfeng Xu, Yuhao Yang, Zuhair Ahmed, Alexandre Milesi, Carlo del Mundo, Chad Voegele, Zhiyu Cheng, Nave Assaf, Andrii Skliar, Daniel Afrimi, Natan Bagrov, Ran Zilberstein, Ofri Masad, Eugene Khvedchenia, Natan Bagrov, Borys Tymchenko, Tomer Asida, Daniel Afrimi, Parth Mannan, Victor Cui, Michael Evans, Katherine Luna, Jie Lou, Pinky Xu, Guyue Huang, Negar Habibi, Michael Boone, Pradeep Thalasta, Adeola Adesoba, Dina Yared, Christopher Parisien, Leon Derczynski, Shaona Ghosh, Wes Feely, Micah Schaffer, Radha Sri-Tharan, Jeffrey Glick, Barnaby Simkin, George Zelenfroynd, Tomasz Grzegorzek, Rishabh Garg, Aastha Jhunjhunwala, Sergei Kolchenko, Farzan Memarian, Haran Kumar, Shiv Kumar, Isabel Hulseman, Anjali Shah, Kari Briski, Padmavathy Subramanian, Joey Conway, Udi Karpas, Jane Polak Scowcroft, Annie Surla, Shilpa Ammireddy, Ellie Evans, Jesse Oliver, Tom Balough, Chia-Chih Chen, Sandip Bhaskar, Alejandra Rico, Bardiya Sadeghi, Seph Mard, Katherine Cheung, Meredith Price, Laya Sleiman, Saori Kaji, Wesley Helmholz, Wendy Quan, Michael Lightstone, Jonathan Cohen, Jian Zhang, Oleksii Kuchaiev, Boris Ginsburg, Jan Kautz, Eileen Long, Mohammad Shoeybi, Mostofa Patwary, Oluwatobi Olabiyi, Andrew Tao, Bryan Catanzaro, Udi Karpas

发表机构 * NVIDIA

AI总结 本文介绍了 Nemotron 3 Nano Omni,这是 Nemotron 多模态系列的最新模型,首次原生支持音频输入,同时兼容文本、图像和视频。该模型在架构、训练数据和训练方法上均有改进,在多种模态任务中均表现出更高的准确性,尤其在现实文档理解、长音频视频理解和智能计算机使用方面表现突出。基于高效的 Nemotron 3 Nano 30B-A3B 架构,该模型引入了创新的多模态 token 减少技术,显著降低了推理延迟并提升了吞吐量,同时提供了多种精度格式的模型权重和部分训练数据及代码以促进进一步研究。

详情
英文摘要

We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.

2604.19923 2026-05-12 cs.CV 版本更新

UniCon3R: Unified Contact-aware 4D Human-Scene Reconstruction from Monocular Video

Tanuj Sur, Shashank Tripathi, Nikos Athanasiou, Ha Linh Nguyen, Kai Xu, Michael J. Black, Angela Yao

发表机构 * National University of Singapore(国立新加坡大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)

AI总结 本文提出 UniCon3R,一种用于从单目视频中进行在线人类-场景四维重建的统一前馈框架。该方法通过显式建模人类与场景之间的接触关系,利用接触信息作为修正线索来提升人体网格重建质量,从而在保证快速推理速度的同时,实现场景几何与对齐的人体四维重建。实验表明,UniCon3R 在物理合理性与人体运动估计方面优于现有方法,验证了接触信息作为强大先验在联合重建中的有效性。

Comments Project page: https://surtantheta.github.io/UniCon3R

详情
英文摘要

We introduce UniCon3R, a unified feed-forward framework for online human-scene 4D reconstruction from monocular video. Current feed-forward human-scene reconstruction methods suffer from artifacts, where bodies float above the ground or penetrate parts of the scene. A key reason is the lack of effective interaction modelling between the human and the environment. Our goal is to exploit contact between the human and the scene during inference to actively improve the human mesh reconstruction. To that end, we explicitly model interaction by inferring 4D contact from the human pose and scene geometry and use the contact as a corrective cue for generating the pose. This enables UniCon3R to jointly recover scene geometry and spatially aligned 4D humans within the scene. Experiments on standard human-centric video benchmarks show that UniCon3R outperforms state-of-the-art baselines on physical plausibility and global human motion estimation while preserving fast, feed-forward inference speeds. The results validate our central claim: contact serves as a powerful internal prior, thus establishing a new paradigm for physically grounded joint human-scene reconstruction. Project page is available at https://surtantheta.github.io/UniCon3R .

2604.19748 2026-05-12 cs.CV 版本更新

Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

Mengting Chen, Zhengrui Chen, Yongchao Du, Zuan Gao, Taihang Hu, Jinsong Lan, Chao Lin, Yefeng Shen, Xingjian Wang, Zhao Wang, Zhengtao Wu, Xiaoli Xu, Zhengze Xu, Hao Yan, Mingzhou Zhang, Jun Zheng, Qinye Zhou, Xiaoyong Zhu, Bo Zheng

发表机构 * Alibaba Group(阿里巴巴集团) Taobao App(淘宝App)

AI总结 Tstars-Tryon 1.0 是一个高效、真实且鲁棒的虚拟试穿系统,能够应对复杂现实场景中的多种挑战,如极端姿态、光照变化和运动模糊等。该系统支持多种服装类别和多参考图像的灵活组合,生成具有精细细节和真实材质的高质量图像,同时避免了常见的AI生成伪影。通过端到端的模型架构和优化的推理速度,系统实现了接近实时的生成效果,并已在淘宝App上大规模部署,服务于数百万用户。

Comments 24 pages, model evaluation report

详情
英文摘要

Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.

2604.14125 2026-05-12 cs.CV cs.AI cs.RO 版本更新

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo

发表机构 * The University of Hong Kong(香港大学) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出了一种名为HiVLA的视觉-语义引导的分层操作系统,旨在解决端到端视觉-语言-动作模型在精细控制数据微调时削弱其基础视觉语言模型推理能力的问题。该方法通过将高层语义规划与底层运动控制解耦,利用视觉语言模型进行任务分解和视觉定位,生成结构化操作计划,并通过配备级联交叉注意力机制的扩散变换器执行精确动作,从而在保持高层推理能力的同时提升操作精度。实验表明,HiVLA在长时序技能组合和复杂场景下的精细操作任务中显著优于现有端到端方法。

Comments Project Page: https://tianshuoy.github.io/HiVLA-page/

详情
英文摘要

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

2604.11808 2026-05-12 cs.CV 版本更新

Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

Xingjian Ran, Shujie Zhang, Weipeng Zhong, Li Luo, Bo Dai

发表机构 * The University of Hong Kong(香港大学) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) Shenzhen Loop Area Institute(深圳loop区研究所)

AI总结 生成高保真度的室内3D场景由于数据稀缺和复杂空间关系建模的困难,仍是一个重大挑战。本文提出Pair2Scene,一种基于局部物体关系学习的程序化场景生成框架,通过结合局部规则、场景层次结构和物理算法,有效捕捉支撑关系和功能关系两种关键物体间交互模式。该方法利用自建的3D-Pairs数据集进行训练,在推理阶段通过递归应用模型并结合碰撞感知的拒绝采样,生成符合物理和语义合理性的复杂场景,显著优于现有方法。

Comments ICML 2026

详情
英文摘要

Generating high-fidelity 3D indoor scenes remains a significant challenge due to data scarcity and the complexity of modeling intricate spatial relations. Current methods often struggle to scale beyond training distribution to dense scenes or rely on LLMs/VLMs that lack the ability for precise spatial reasoning. Building on top of the observation that object placement relies mainly on local dependencies instead of information-redundant global distributions, in this paper, we propose Pair2Scene, a novel procedural generation framework that integrates learned local rules with scene hierarchies and physics-based algorithms. These rules mainly capture two types of inter-object relations, namely support relations that follow physical hierarchies, and functional relations that reflect semantic links. We model these rules through a network, which estimates spatial position distributions of dependent objects conditioned on position and geometry of the anchor ones. Accordingly, we curate a dataset 3D-Pairs from existing scene data to train the model. During inference, our framework can generate scenes by recursively applying our model within a hierarchical structure, leveraging collision-aware rejection sampling to align local rules into coherent global layouts. Extensive experiments demonstrate that our framework outperforms existing methods in generating complex environments that go beyond training data while maintaining physical and semantic plausibility.

2604.07780 2026-05-12 eess.IV cs.CV 版本更新

MonoUNet: A Robust Tiny Neural Network for Automated Knee Cartilage Segmentation on Point-of-Care Ultrasound Devices

Alvin Kimbowa, Arjun Parmar, Ibrahim Mujtaba, Will Wei, Maziar Badii, Matthew Harkey, David Liu, Ilker Hacihaliloglu

发表机构 * School of Biomedical Engineering, The University of British Columbia(生物医学工程学院,不列颠哥伦比亚大学) Department of Kinesiology, Michigan State University(运动科学系,密歇根州立大学) Department of Rheumatology, The University of British Columbia(风湿病学系,不列颠哥伦比亚大学) Department of Radiology, The University of British Columbia(放射学系,不列颠哥伦比亚大学) Department of Medicine, The University of British Columbia(医学系,不列颠哥伦比亚大学)

AI总结 本研究提出了一种名为 MonoUNet 的轻量级深度学习模型,旨在用于便携式超声设备上自动分割膝关节软骨。该模型通过引入可训练的单基因块提取多尺度局部相位特征,并结合门控机制提升对超声图像变化的鲁棒性,显著减少了参数量和计算成本。实验表明,MonoUNet 在多个设备和站点的数据集上取得了优异的分割性能,Dice 分数高达 92.62% 至 94.82%,且与手动测量结果具有高度一致性与可靠性。

Comments 17 pages, 4 figures. Published in Ultrasound in Medicine & Biology (2026)

Journal ref Ultrasound in Medicine & Biology, 2026, ISSN 0301-5629

详情
英文摘要

Objective: To develop a robust and compact deep learning model for automated knee cartilage segmentation on point-of-care ultrasound (POCUS) devices. Methods: We propose MonoUNet, a novel, highly compact segmentation model consisting of (i) an aggressively reduced U-Net backbone, (ii) a trainable monogenic block that extracts multi-scale local phase features from the input, and (iii) a gating mechanism that injects these features into the encoder stages to reduce sensitivity to variations in ultrasound image appearance. MonoUNet segmentation performance was evaluated on a multi-site, multi-device knee cartilage ultrasound dataset using Dice score and mean average surface distance (MASD). Agreement between MonoUNet and manual cartilage outcomes (thickness and echo intensity) was assessed using Bland-Altman analysis with 95% limits of agreement, and reliability was assessed using intraclass correlation coefficient (ICC$_{2,k}$). Results: Overall, MonoUNet outperformed existing lightweight segmentation models, with average Dice scores ranging from 92.62% to 94.82% and MASD values between 0.133 mm and 0.254 mm. MonoUNet reduces the number of parameters by 10x--700x and computational cost by 14x--2000x relative to existing lightweight models. MonoUNet cartilage outcomes showed excellent reliability and agreement with the manual outcomes: intraclass correlation coefficients (ICC$_{2,k})$=0.96 and bias=2.00% (0.047 mm) for average thickness, and ICC$_{2,k}$=0.99 and bias=0.80% (0.328 a.u.) for echo intensity. Conclusion: Incorporating trainable local phase features improves the robustness of highly compact neural networks for knee cartilage segmentation across varying acquisition settings and could support scalable ultrasound-based assessment and monitoring of knee osteoarthritis using POCUS devices. The code is publicly available at https://github.com/alvinkimbowa/monounet.

2604.03928 2026-05-12 cs.LG cs.AI cs.CV stat.ML 版本更新

Supervised Dimensionality Reduction Revisited: Why LDA on Frozen CNN Features Deserves a Second Look

Indar Kumar, Girish Karhana, Sai Krishna Jasti, Ankit Hemant Lade

发表机构 * independent researchers(独立研究人员)

AI总结 本文重新审视了在冻结的预训练卷积神经网络特征上应用监督降维方法的有效性,特别是线性判别分析(LDA)。研究对比了多种降维策略在多个视觉任务上的表现,发现LDA在粗粒度分类任务中能显著提升分类准确率并大幅降低特征维度,但在细粒度任务中效果较差。实验表明,LDA在类间结构较明显时表现优异,而对需要细微区分的任务则可能适得其反,为冻结特征分类流程中的降维应用提供了实用指导。

Comments 11 pages, 5 figures, 5 tables. Code available at https://github.com/IndarKarhana/lda-image-classification

详情
英文摘要

Frozen pretrained image representations are widely used for transfer learning: a backbone is kept fixed, feature vectors are extracted, and a lightweight classifier is trained on top. This pipeline usually feeds the full feature vector to the classifier, even when the target task has far fewer classes than the pretraining task. We revisit a classical alternative: supervised dimensionality reduction with Linear Discriminant Analysis (LDA) before linear probing. We evaluate ten dimensionality-reduction strategies on frozen features from six backbones -- ResNet-18, ResNet-50, MobileNetV3-Small, EfficientNet-B0, ViT-B/16, and DINOv2-ViT-S/14 -- across CIFAR-100, Tiny ImageNet, and CUB-200-2011. Under a fixed logistic-regression protocol, LDA improves accuracy over full features in 11 of 12 coarse-grained configurations, with gains up to 4.5 percentage points while reducing feature dimensionality by 48-87%. The same projection consistently hurts on fine-grained CUB-200, where full features win across all six backbones. This establishes a practical boundary condition: LDA is useful when class-level structure is coarse enough to be captured by mean-separating directions, but it can discard subtle cues needed for fine-grained recognition. We also compare LDA with PCA, PCA+LDA, regularized LDA, Local Fisher Discriminant Analysis, Neighbourhood Components Analysis, and three lightweight LDA extensions. The results show that plain LDA offers the best accuracy-cost tradeoff for most coarse-grained settings, while more complex supervised reduction methods rarely justify their additional cost. Overall, the study provides concrete guidance for when post-hoc supervised projection should, and should not, be inserted into frozen-feature image classification pipelines.

2604.01824 2026-05-12 cs.CV 版本更新

STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

Emad Bahrami, Olga Zatsarynna, Parth Pathak, Sunando Sengupta, Juergen Gall, Mohsen Fayyaz

发表机构 * University of Bonn(波恩大学) Microsoft(微软) Meta Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔人工智能与机器学习研究所)

AI总结 STRIVE 是一种用于视频问答的结构化时空强化学习框架,旨在解决现有方法在奖励方差低、策略更新不稳定的问题。该方法通过构建输入视频的多个时空变体,并在文本生成和视觉变体之间进行联合归一化,从而丰富奖励信号并提升策略更新的稳定性。此外,STRIVE 引入了基于重要性的采样机制,确保探索过程语义相关且保持时间覆盖,实验表明其在多个视频推理基准上优于现有强化学习方法。

详情
英文摘要

We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.

2603.25074 2026-05-12 cs.CV 版本更新

Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers

Nanxiang Jiang, Zhaoxin Fan, Baisen Wang, Daiheng Gao, Junhang Cheng, Jifeng Guo, Yalan Qin, Yeying Jin, Hongwei Zheng, Faguo Wu, Wenjun Wu

发表机构 * Beijing Advanced Innovation Center for Future Blockchain(未来区块链与隐私计算北京创新中心) Privacy Computing, School of Artificial Intelligence, Beihang University(隐私计算,北京航空航天大学人工智能学院) University of Science(科学大学) Shanghai University(上海大学) Tencent(腾讯) Beijing Academy of Blockchain(北京区块链研究院)

AI总结 Z-Erase 是一种针对单流扩散变压器(如 Z-Image)设计的概念擦除方法,旨在从文本到图像模型中安全地去除不需要的概念。该方法提出了流解耦概念擦除框架和拉格朗日引导的自适应擦除调制算法,有效解决了单流模型中直接应用传统擦除方法导致的生成崩溃问题,并在多项任务中取得了最先进的性能。

详情
英文摘要

Concept erasure serves as a vital safety mechanism for removing unwanted concepts from text-to-image (T2I) models. While extensively studied in U-Net and dual-stream architectures (e.g., Flux), this task remains under-explored in the recent emerging paradigm of single-stream diffusion transformers (e.g., Z-Image). In this new paradigm, text and image tokens are processed as a single unified sequence via shared parameters. Consequently, directly applying prior erasure methods typically leads to generation collapse. To bridge this gap, we introduce Z-Erase, the first concept erasure method tailored for single-stream T2I models. To guarantee stable image generation, Z-Erase first proposes a Stream Disentangled Concept Erasure Framework that decouples updates and enables existing methods on single-stream models. Subsequently, within this framework, we introduce Lagrangian-Guided Adaptive Erasure Modulation, a constrained algorithm that further balances the sensitive erasure-preservation trade-off. Moreover, we provide a rigorous convergence analysis proving that Z-Erase can converge to a Pareto stationary point. Experiments demonstrate that Z-Erase successfully overcomes the generation collapse issue, achieving state-of-the-art performance across a wide range of tasks.

2603.16869 2026-05-12 cs.CV 版本更新

SegviGen: Repurposing 3D Generative Model for Part Segmentation

Lin Li, Haoran Feng, Zehuan Huang, Haohua Chen, Wenbo Nie, Shaohua Hou, Keqing Fan, Pan Hu, Sheng Wang, Buyu Li, Lu Sheng

发表机构 * Renmin University of China(中国人民大学) Tsinghua University(清华大学) Beihang University(北航) Beijing Jiaotong University(北京交通大学) Bambu Lab(Bambu实验室)

AI总结 本文提出了一种名为SegviGen的框架,通过重用预训练的3D生成模型,实现高效的3D部件分割。该方法利用生成模型中编码的结构先验知识,通过独特的部件着色策略引导分割过程,避免了传统方法中多视角不一致和边界模糊的问题。实验表明,SegviGen在交互式分割和全分割任务中分别优于现有最佳方法40%和15%,且仅需极少量的标注数据,展示了预训练3D生成模型在部件分割任务中的强大迁移能力。

Comments Project page: https://fenghora.github.io/SegviGen-Page/

详情
英文摘要

We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at https://fenghora.github.io/SegviGen-Page/.

2603.12800 2026-05-12 eess.IV cs.CV 版本更新

GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification

Jiao Wang, Chi Liu, Yiying Zhang, Hongchen Luo, Zhifen Guo, Ying Hu, Ke Xu, Jing Zhou, Hongyan Xu, Ruiting Zhou, Man Tang

发表机构 * Department of Ophthalmology, Shenyang Fourth People’s Hospital(眼科医院,沈阳第四人民医院)

AI总结 本文提出了GLEAM,一个包含三种成像模态的公开青光眼数据集,涵盖眼底扫描激光图像、视神经周围OCT图像和视野图模式偏差图,并标注了四个疾病阶段,有助于综合利用多模态信息进行精准诊断。为有效整合跨模态信息,研究提出了一种分层注意力掩码建模(HAMM)方法,通过分层注意力编码器和轻量解码器,聚焦于跨模态表征学习,提升青光眼分类的准确性。该研究为多模态医学影像分析提供了新思路和有效工具。

详情
英文摘要

We propose glaucoma lesion evaluation and analysis with multimodal imaging (GLEAM), the first publicly available tri-modal glaucoma dataset comprising scanning laser ophthalmoscopy fundus images, circumpapillary OCT images, and visual field pattern deviation maps, annotated with four disease stages, enabling effective exploitation of multimodal complementary information and facilitating accurate diagnosis and treatment across disease stages. To effectively integrate cross-modal information, we propose hierarchical attentive masked modeling (HAMM) for multimodal glaucoma classification. Our framework employs hierarchical attentive encoders and light decoders to focus cross-modal representation learning on the encoder.

2603.07686 2026-05-12 cs.RO cs.CV 版本更新

UniUncer: Unified Dynamic Static Uncertainty for End to End Driving

Yu Gao, Jijun Wang, Zongzheng Zhang, Anqing Jiang, Yiru Wang, Yuwen Heng, Shuo Wang, Hao Sun, Zhangfeng Hu, Hao Zhao

发表机构 * Bosch Corporate Research(博世企业研究)

AI总结 该论文提出了一种名为UniUncer的统一动态静态不确定性框架,用于端到端自动驾驶系统,旨在提升系统对环境不确定性的感知与应对能力。该方法通过将确定性模型转换为概率回归模型,同时引入不确定性融合模块和不确定性感知门控机制,实现了对静态地图元素和动态交通参与者不确定性的联合建模与利用。实验表明,UniUncer在多个基准数据集上有效提升了轨迹预测和驾驶决策的性能,且计算开销极小。

Comments Accepted ICRA 2026

详情
英文摘要

End-to-end (E2E) driving has become a cornerstone of both industry deployment and academic research, offering a single learnable pipeline that maps multi-sensor inputs to actions while avoiding hand-engineered modules. However, the reliability of such pipelines strongly depends on how well they handle uncertainty: sensors are noisy, semantics can be ambiguous, and interaction with other road users is inherently stochastic. Uncertainty also appears in multiple forms: classification vs. localization, and, crucially, in both static map elements and dynamic agents. Existing E2E approaches model only static-map uncertainty, leaving planning vulnerable to overconfident and unreliable inputs. We present UniUncer, the first lightweight, unified uncertainty framework that jointly estimates and uses uncertainty for both static and dynamic scene elements inside an E2E planner. Concretely: (1) we convert deterministic heads to probabilistic Laplace regressors that output per-vertex location and scale for vectorized static and dynamic entities; (2) we introduce an uncertainty-fusion module that encodes these parameters and injects them into object/map queries to form uncertainty-aware queries; and (3) we design an uncertainty-aware gate that adaptively modulates reliance on historical inputs (ego status or temporal perception queries) based on current uncertainty levels. The design adds minimal overhead and drops throughput by only $\sim$0.5 FPS while remaining plug-and-play for common E2E backbones. On nuScenes (open-loop), UniUncer reduces average L2 trajectory error by 7\%. On NavsimV2 (pseudo closed-loop), it improves overall EPDMS by 10.8\% and notable stage two gains in challenging, interaction-heavy scenes. Ablations confirm that dynamic-agent uncertainty and the uncertainty-aware gate are both necessary.

2603.00918 2026-05-12 cs.CV cs.AI 版本更新

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Seungwook Kim, Minsu Cho

发表机构 * POSTECH(POSTECH大学) RLWRLD(RLWRLD实验室) GenGenAI(GenGenAI研究院)

AI总结 本文提出了一种名为SOLACE的后训练框架,用于提升文本到图像生成的质量。该方法通过模型自身对生成图像进行重噪声处理,并衡量其恢复噪声的准确性,从而生成内在的自信信号作为强化学习的奖励,无需外部奖励模型或人工标注。实验表明,SOLACE在组合生成、文本渲染和图文对齐等方面均取得了一致性提升,并能与外部奖励结合实现互补改进。

Comments 22 pages, accepted to CVPR 2026. Project page https://wookiekim.github.io/SOLACE/

详情
英文摘要

Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to improve human preference alignment, factuality, and aesthetics. We introduce SOLACE (Self-Originating LAtent Confidence Estimation), a post-training framework that replaces external reward supervision with an internal self-confidence signal: we re-noise the model's own outputs and measure how accurately it recovers the injected noise, treating low reconstruction error as high self-confidence. SOLACE converts this intrinsic signal into scalar rewards for reinforcement learning, requiring no external reward models, annotators, or preference data. By reinforcing high-confidence generations, SOLACE delivers consistent gains in compositional generation, text rendering, and text-image alignment. Integrating SOLACE with external rewards yields complementary improvements while alleviating reward hacking.

2603.00166 2026-05-12 cs.CV cs.AI 版本更新

Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?

Hongyu Li, Kuan Liu, Yuan Chen, Juntao Hu, Huimin Lu, Guanjie Chen, Xue Liu, Guangming Lu, Hong Huang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) City University of Hong Kong(城市大学) Chinese University of Hong Kong(香港中文大学) McGill University(麦吉尔大学)

AI总结 本文探讨了生成式AI在执行简单任务时表现出的“简洁性悖论”,即模型在生成复杂场景时表现优异,却难以完成如生成纯色图像等简单任务。研究提出“AI服从性”概念,构建了一个分层评估框架,并设计了首个系统性基准Violin,用于评估模型从概率近似到像素级确定性的转换能力。实验表明,闭源模型在确定性任务上的表现优于开源模型,且其性能与自然图像生成能力存在相关性,为理解模型指令对齐问题提供了基础框架和工具。

详情
英文摘要

Recent advances in generative AI have shown human-level performance in complex content creation. However, we identify a "Paradox of Simplicity": models that can render complex scenes often fail at trivial, low-entropy tasks, such as generating a uniform pure color image. We argue this is a systemic failure related to uncontrollable emergent abilities. As models scale, strong priors for aesthetics and complexity override deterministic simplicity, creating an "aesthetic bias" that hinders the model's transition from data simulation to true intellectual abstraction. To better investigate this problem, we formalize the concept of AI Obedience, a hierarchical framework that grades a model's ability to transition from probabilistic approximation to pixel-level determinism (Levels 1 to 5). We introduce Violin, the first systematic benchmark designed to evaluate Level 4 Obedience through three deterministic tasks: color purity, image masking, and geometric shape generation. Using Violin, we evaluate several state-of-the-art models and reveal that closed-source models generally outperform open-source ones in deterministic precision. Interestingly, performance on our benchmark correlates with the benchmark in natural image generation. Our work provides a foundational framework and tools for achieving better alignment between human instructions and model outputs.

2602.21581 2026-05-12 cs.CV 版本更新

MultiAnimate: Pose-Guided Image Animation Made Extensible

Yingcheng Hu, Haowen Gong, Chuanguang Yang, Zhulin An, Yongjun Xu, Songhua Liu

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(人工智能安全国家重点实验室,计算技术研究所,中国科学院) ShanghaiTech University(上海科技大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种可扩展的多角色图像动画框架 MultiAnimate,旨在解决基于姿势引导的多角色视频生成中身份混淆和不合理遮挡的问题。该方法基于现代扩散变换器(DiT),引入了身份分配器和身份适配器两个关键组件,用于捕捉个体位置信息和角色间空间关系,从而提升模型的灵活性和泛化能力。实验表明,该方法在多角色图像动画任务中取得了优于现有扩散模型的最先进性能。

Comments CVPR2026 Accepted. Project page at https://hyc001.github.io/MultiAnimate/

详情
英文摘要

Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components-Identifier Assigner and Identifier Adapter - which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.

2602.09534 2026-05-12 cs.CV 版本更新

AUHead: Realistic Emotional Talking Head Generation via Action Units Control

Jiayi Lyu, Leigang Qu, Wenjing Zhang, Hanyu Jiang, Kai Liu, Zhenglin Zhou, Xiaobo Xia, Jian Xue, Tat-Seng Chua

发表机构 * University of the Chinese Academy of Sciences(中国科学院大学) National University of Singapore(新加坡国立大学) Zhejiang University(浙江大学) State Key Laboratory of Communication Content Cognition, People’s Daily Online(人民日報網通信內容認知重點實驗室)

AI总结 本文提出了一种名为 AUHead 的新方法,用于生成具有真实情感表达的说话人视频。该方法通过解耦音频与细粒度情感单元(Action Units, AUs)的控制,实现了对情绪表达的精确调控。研究采用两阶段框架,第一阶段利用大语言模型生成 AUs 序列,第二阶段基于 AUs 驱动的扩散模型生成高质量的视频,有效提升了情感真实性和视觉一致性。

Comments https://openreview.net/forum?id=dmzlAUkulz&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DICLR.cc%2F2026%2FConference%2FAuthors%23your-submissions) Accepted at the 14th International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an "emotion-then-AU" chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at https://github.com/laura990501/AUHead_ICLR

2602.09016 2026-05-12 cs.CV 版本更新

Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction

Hao Phung, Hadar Averbuch-Elor

发表机构 * Cornell University(康奈尔大学)

AI总结 本文提出了一种名为 Raster2Seq 的方法,用于从栅格化的平面图图像中重建结构化的矢量图形表示。该方法将平面图重建视为序列到序列的任务,将房间、窗户和门等元素表示为包含几何和语义信息的带标签多边形序列。通过引入基于可学习锚点的自回归解码器,模型能够根据图像特征和已生成的顶点预测下一个顶点,从而更有效地生成复杂且具有多样多边形结构的平面图。实验表明,该方法在多个标准数据集上取得了最先进的性能,并在更具挑战性的数据集上也表现出良好的泛化能力。

Comments Accepted to SIGGRAPH 2026. Project page: https://cornell-vailab.github.io/Raster2Seq/

详情
英文摘要

Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements--such as rooms, windows, and doors--are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.

2602.05243 2026-05-12 cs.LG cs.CV 版本更新

CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers

Boxiang Zhang, Baijian Yang

发表机构 * Purdue University(普渡大学)

AI总结 本文提出CORP,一种无需梯度或微调的闭式单次结构化剪枝方法,用于在Transformer模型中去除多层感知机和注意力子结构。该方法将结构化剪枝建模为表示恢复问题,通过闭式岭回归推导出补偿模型权重的解析解,从而在保持高精度的前提下实现模型的高效压缩。实验表明,CORP在ImageNet数据集上对DeiT模型进行大量剪枝后仍能保持较高的分类准确率。

详情
英文摘要

Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning reduces inference cost, but most methods rely on retraining or multi-stage optimization, which limits post-training deployment. We propose CORP, a closed-form one-shot structured pruning method that removes MLP dimensions and attention substructures using only unlabeled calibration data without gradients or fine-tuning. CORP formulates structured pruning as a representation recovery problem. It models removed components as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes a layer-local affine/logit reconstruction objective under the calibration distribution. Experiments on ImageNet with DeiT reveal strong redundancy in both MLP and attention representations. With CORP, models retain high accuracy under aggressive sparsity. On DeiT-Huge, CORP achieves 83.27% Top-1 accuracy after pruning 50\% of both MLP and attention structures.

2602.03916 2026-05-12 cs.CV cs.CE cs.CL cs.LG 版本更新

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md. Iqramul Hoque, Shahriyar Zaman Ridoy, Mohammed Eunus Ali, Majd Hawasly, Mohammad Raza, Md Rizwan Parvez

发表机构 * Computational Intelligence and Operations Laboratory(计算智能与运筹实验室) Shahjalal University of Science and Technology(沙赫jalal科技大学) BRAC University(BRAC大学) North South University(北南大学) Monash University(墨尔本大学) Qatar Computing Research Institute(卡塔尔计算研究院)

AI总结 SpatiaLab 是一个用于评估视觉语言模型(VLMs)在真实场景中空间推理能力的综合性基准。该研究指出,现有模型在处理复杂的空间关系、深度感知、导航和三维几何等问题时仍存在显著不足。SpatiaLab 包含 1400 个视觉问答对,涵盖六个主要类别及 30 种任务类型,实验表明当前最先进的 VLMs 在空间推理任务上的表现远低于人类。

Comments Accepted to ICLR 2026 (https://openreview.net/forum?id=fWWUPOb0CT). 92 Pages. 42 Figures and 29 Tables

Journal ref ICLR 2026

详情
英文摘要

Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.

2601.15065 2026-05-12 cs.CV 版本更新

Enhancing Few-Shot Out-of-Distribution Detection via the Refinement of Foreground and Background

Tianyu Li, Zongqian Wu, Songyue Cai, Ping Hu, Xiaofeng Zhu

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) School of Computer Science and Technology, Hainan University(海南大学计算机科学与技术学院)

AI总结 该论文针对少样本分布外检测(Few-Shot OOD Detection)中前景-背景分解方法的不足,提出了一种新的即插即用框架。该方法通过自适应背景抑制和可混淆前景修正两个核心模块,分别优化背景区域的分类熵权重和修正与其它类别相似的前景区域,从而提升检测性能。实验表明,该框架有效提升了现有方法在少样本场景下的分布外检测能力。

Comments arXiv preprint arXiv:2601.15065 (2026)

详情
英文摘要

CLIP-based foreground-background (FG-BG) decomposition methods have demonstrated remarkable effectiveness in improving few-shot out-of-distribution (OOD) detection performance. However, existing approaches still suffer from several limitations. For background regions obtained from decomposition, existing methods adopt a uniform suppression strategy for all patches, overlooking the varying contributions of different patches to the prediction. For foreground regions, existing methods fail to adequately consider that some local patches may exhibit appearance or semantic similarity to other classes, which may mislead the training process. To address these issues, we propose a new plug-and-play framework. This framework consists of three core components: (1) a Foreground-Background Decomposition module, which follows previous FG-BG methods to separate an image into foreground and background regions; (2) an Adaptive Background Suppression module, which adaptively weights patch classification entropy; and (3) a Confusable Foreground Rectification module, which identifies and rectifies confusable foreground patches. Extensive experimental results demonstrate that the proposed plug-and-play framework significantly improves the performance of existing FG-BG decomposition methods. Code is available at: https://github.com/lounwb/FoBoR.

2512.19219 2026-05-12 cs.CV cs.AI 版本更新

Selective LoRA for Visual Tokens and Attention Heads

Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee

发表机构 * University of Michigan(密歇根大学) LG AI Research(LG人工智能研究)

AI总结 本文提出了一种面向视觉任务的参数高效微调方法Image-LoRA,针对视觉语言模型(VLM)输入的异构性,将LoRA的更新限制在视觉token和部分注意力头的值路径上,从而减少可训练参数和计算量。该方法在视觉定位任务中表现优异,尤其在视觉token占比高的情况下,与标准LoRA相比具有更优的性能与效率平衡,并在多个任务上验证了其通用性和文本处理的稳定性。

详情
英文摘要

Low-rank adaptation (LoRA) is widely used for parameter-efficient fine-tuning, but its standard all-token, all-head design ignores the heterogeneous structure of vision language model (VLM) inputs. We introduce \emph{Image-LoRA}, a vision-oriented PEFT recipe that views LoRA as a token-level residual update and applies this update only to visual tokens. Image-LoRA further restricts adaptation to the value path of a compact subset of attention heads, selected using a one-pass influence estimate from a rank-1 visual-token-only probe. This token-, head-, and value-selective design reduces trainable parameters and adapter-only training FLOPs while leaving the pure-text forward pass of the frozen backbone unchanged when no visual tokens are present. Across visual localization benchmarks with controlled text:image token ratios, Image-LoRA matches or closely approaches standard LoRA, while showing especially favorable trade-offs in image-token-heavy regimes. We further validate its generality on TextVQA and VideoQA, verify pure-text preservation on GSM8K, and show on ViLP that a stronger information bottleneck can yield gains over standard LoRA.

2511.12878 2026-05-12 cs.CV cs.RO 版本更新

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Yu Zheng, Erhang Zhang, Xieyuanli Chen, Hesheng Wang

发表机构 * IRMV Lab, Shanghai Jiao Tong University(上海交通大学IMV实验室) Meta Reality Labs(Meta现实实验室) Department of Electronic Engineering, Shanghai Jiao Tong University(上海交通大学电子工程系) China University of Mining and Technology(中国矿业大学) College of Intelligence Science and Technology, National University of Defense Technology(国防科技大学智能科学与技术学院)

AI总结 本文提出了一种名为Uni-Hand的通用手部运动预测框架,旨在解决第一人称视角下手部运动预测中存在的预测目标不足、模态差异、手部与头部运动耦合以及下游任务验证有限等问题。该方法通过融合视觉与语言信息、引入全局上下文和任务感知的文本嵌入,实现了2D和3D空间中手部关键点的多目标预测,并首次引入手部与物体交互状态的预测以提升下游任务表现。实验结果表明,Uni-Hand在多个公开数据集和新构建的基准测试中均取得了最先进的预测性能,并在机器人策略迁移和动作识别等任务中展现出优异的应用潜力。

Comments Accepted by T-PAMI 2026. Code and data: https://github.com/IRMVLab/UniHand

详情
英文摘要

Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.

2511.00560 2026-05-12 cs.CV 版本更新

4D Neural Voxel Splatting: Dynamic Scene Rendering with Voxelized Guassian Splatting

Chun-Tin Wu, Jun-Cheng Chen

发表机构 * National Taiwan University(国立台湾大学) Academia Sinica(中央研究院)

AI总结 尽管3D高斯泼溅(3D-GS)在新视角合成中实现了高效的渲染,但将其扩展到动态场景时仍因每帧复制高斯分布而导致较大的内存开销。为此,本文提出了一种4D神经体素泼溅(4D-NVS)方法,结合体素表示与神经高斯泼溅,以高效建模动态场景。该方法通过学习变形场的紧凑神经体素集来建模时间动态,显著降低了内存消耗并加快了训练速度,同时保持了高质量的图像渲染。实验表明,该方法在内存占用和训练速度上优于现有方法,实现了实时渲染与更优的视觉效果。

Comments 10 pages, 7 figures

详情
英文摘要

Although 3D Gaussian Splatting (3D-GS) achieves efficient rendering for novel view synthesis, extending it to dynamic scenes still results in substantial memory overhead from replicating Gaussians across frames. To address this challenge, we propose 4D Neural Voxel Splatting (4D-NVS), which combines voxel-based representations with neural Gaussian splatting for efficient dynamic scene modeling. Instead of generating separate Gaussian sets per timestamp, our method employs a compact set of neural voxels with learned deformation fields to model temporal dynamics. The design greatly reduces memory consumption and accelerates training while preserving high image quality. We further introduce a novel view refinement stage that selectively improves challenging viewpoints through targeted optimization, maintaining global efficiency while enhancing rendering quality for difficult viewing angles. Experiments demonstrate that our method outperforms state-of-the-art approaches with significant memory reduction and faster training, enabling real-time rendering with superior visual fidelity.

2509.13484 2026-05-12 cs.CV cs.CY 版本更新

MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

Liu Liu, Alexandra Kudaeva, Marco Cipriano, Fatimeh Al Ghannam, Freya Tan, Gerard de Melo, Andres Sevtsuk

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Hasso Plattner Institute(于尔克·平特纳研究所)

AI总结 本文提出MINGLE,一种用于检测城市场景中语义复杂社交群体区域的视觉-语言模型方法。该方法通过结合人体检测、深度估计、视觉-语言模型推理及空间聚合算法,实现了对图像中社交互动区域的识别与定位。研究还构建了一个包含10万张城市街景图像的新数据集,标注了个体及社交群体的边界框和标签,为相关研究提供了重要资源。

Comments 13 pages, 4 figures Updated with the camera-ready version after acceptance

详情
英文摘要

Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.

2505.23617 2026-05-12 cs.CV cs.AI cs.GR cs.LG 版本更新

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

Chenhao Zheng, Jieyu Zhang, Mohammadreza Salehi, Ziqi Gao, Vishnu Iyengar, Norimasa Kobori, Quan Kong, Ranjay Krishna

发表机构 * University of Washington(华盛顿大学) Allen Institute for Artificial Intelligence(人工智能艾伦研究所) Woven by Toyota, Inc(丰田公司)

AI总结 本文提出了一种基于全景子物体轨迹的视频分词方法,旨在解决传统时空块分词在长视频处理中导致的冗余令牌和计算效率低的问题。该方法通过将视频内容组织为物体轨迹生成语义令牌,有效减少了令牌数量并保持时间一致性。所提出的TrajViT模型在多个视频理解任务中显著优于现有方法,展现出更高的性能和更低的计算成本。

Comments ICCV 2025

详情
英文摘要

Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.

2505.18184 2026-05-12 eess.SP cs.CV 版本更新

AI- Enhanced Stethoscope in Remote Diagnostics for Cardiopulmonary Diseases

Hania Ghouse, Juveria Tanveen, Abdul Muqtadir Ahmed, Uma N. Dulhare

发表机构 * Department of Computer Science and Artificial Intelligence, Muffakham Jah College of Engineering and Technology(计算机科学与人工智能系,穆法卡姆·贾赫工程与技术学院)

AI总结 本文针对全球范围内日益严重的 cardiovascular 和 pulmonary 疾病诊断难题,提出了一种结合人工智能的低成本听诊器系统,用于远程诊断心肺疾病。该方法通过提取和处理听诊声音中的 MFCC 特征,结合 CNN 和 GRU 的混合模型实现对六种肺部和五种心血管疾病的自动分类,能够在资源匮乏地区部署于低成本嵌入式设备,提供实时诊断支持,为标准化医疗提供了创新解决方案。

详情
英文摘要

The increase in cardiac and pulmonary diseases presents an alarming and pervasive health challenge on a global scale responsible for unexpected and premature mortalities. In spite of how serious these conditions are, existing methods of detection and treatment encounter challenges, particularly in achieving timely diagnosis for effective medical intervention. Manual screening processes commonly used for primary detection of cardiac and respiratory problems face inherent limitations, increased by a scarcity of skilled medical practitioners in remote or under-resourced areas. To address this, our study introduces an innovative yet efficient model which integrates AI for diagnosing lung and heart conditions concurrently using the auscultation sounds. Unlike the already high-priced digital stethoscope, our proposed model has been particularly designed to deploy on low-cost embedded devices and thus ensure applicability in under-developed regions that actually face an issue of accessing medical care. Our proposed model incorporates MFCC feature extraction and engineering techniques to ensure that the signal is well analyzed for accurate diagnostics through the hybrid model combining Gated Recurrent Unit with CNN in processing audio signals recorded from the low-cost stethoscope. Beyond its diagnostic capabilities, the model generates digital audio records that facilitate in classifying six pulmonary and five cardiovascular diseases. Hence, the integration of a cost effective stethoscope with an efficient AI empowered model deployed on a web app providing real-time analysis, represents a transformative step towards standardized healthcare

2505.07349 2026-05-12 eess.IV cs.CV 版本更新

Multi-Plane Vision Transformer for Hemorrhage Classification Using Axial and Sagittal MRI Data

Badhan Kumar Das, Gengyan Zhao, Boris Mailhe, Thomas J. Re, Dorin Comaniciu, Eli Gibson, Andreas Maier

发表机构 * Digital Technology and Innovation, Siemens Healthineers(西门子医疗数字技术与创新部) Pattern Recognition Lab, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg(弗赖堡-艾尔朗根-纽伦堡大学计算机科学系模式识别实验室)

AI总结 本文提出了一种用于脑出血分类的多平面视觉Transformer(MP-ViT),旨在解决使用不同方位MRI数据(如轴向和矢状位)进行出血检测时的信息丢失问题。该方法采用两个独立的Transformer编码器分别处理不同方位的影像,并通过跨注意力机制融合多方位信息,同时引入模态指示向量以补充缺失的对比信息。实验表明,MP-ViT在包含10,084个训练样本的临床数据集上表现出色,其AUC值相比传统ViT和CNN模型分别提升了5.5%和1.8%,展示了其在多方位MRI出血检测中的优越性。

Comments 10 pages

详情
英文摘要

Identifying brain hemorrhages from magnetic resonance imaging (MRI) is a critical task for healthcare professionals. The diverse nature of MRI acquisitions with varying contrasts and orientation introduce complexity in identifying hemorrhage using neural networks. For acquisitions with varying orientations, traditional methods often involve resampling images to a fixed plane, which can lead to information loss. To address this, we propose a 3D multi-plane vision transformer (MP-ViT) for hemorrhage classification with varying orientation data. It employs two separate transformer encoders for axial and sagittal contrasts, using cross-attention to integrate information across orientations. MP-ViT also includes a modality indication vector to provide missing contrast information to the model. The effectiveness of the proposed model is demonstrated with extensive experiments on real world clinical dataset consists of 10,084 training, 1,289 validation and 1,496 test subjects. MP-ViT achieved substantial improvement in area under the curve (AUC), outperforming the vision transformer (ViT) by 5.5% and CNN-based architectures by 1.8%. These results highlight the potential of MP-ViT in improving performance for hemorrhage detection when different orientation contrasts are needed.

2505.05209 2026-05-12 cs.CV 版本更新

EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution

Haizhen Xie, Kunpeng Du, Qiangyu Yan, Sen Lu, Jianhong Han, Hanting Chen, Hailin Hu, Jie Hu

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出了一种基于扩散变换器(DiT)的盲超分辨率方法EAM,旨在提升图像超分辨率性能。该方法引入了新的$Ψ$-DiT模块,通过三流架构有效利用预训练DiT的先验知识,并结合渐进式掩码图像建模策略和主题感知提示生成策略,显著提升了模型的泛化能力和训练效率。实验表明,EAM在多个数据集上取得了优于现有方法的定量指标和视觉质量。

Comments Revision of Section 4.1

详情
英文摘要

Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, $Ψ$-DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.

2503.09336 2026-05-12 cs.CV 版本更新

Stealthy Patch-Wise Backdoor Attack in 3D Point Cloud via Curvature Awareness

Yu Feng, Dingxin Zhang, Runkai Zhao, Yong Xia, Heng Huang, Weidong Cai

发表机构 * The University of Sydney(悉尼大学) Northwestern Polytechnical University(西北工业大学) University of Maryland College Park(马里兰大学学院公园分校)

AI总结 本文提出了一种针对3D点云模型的隐蔽块状后门攻击方法SPBA,通过利用局部曲率变化对点云进行块状划分,并选择不易察觉的块作为后门触发区域,从而在不显著改变点云结构的前提下实现高效隐蔽的后门植入。该方法相比传统的样本级触发方式,大幅降低了计算开销并提升了攻击隐蔽性,在多个基准数据集上取得了优越的实验结果。

Comments 12 pages, 6 figures, 11 tables

详情
英文摘要

Backdoor attacks pose a severe threat to deep neural networks (DNNs) by implanting hidden backdoors that can be activated with predefined triggers to manipulate model behaviors maliciously. Recent studies have extended backdoor attacks to 3D point clouds, but most existing triggers are sample-wise and often cause visible geometric artifacts or high optimization cost. To address these limitations, we propose the Stealthy Patch-Wise Backdoor Attack (SPBA), a patch-wise backdoor attack framework for 3D point clouds. Specifically, SPBA decomposes a point cloud into local patches, where each patch is formed by a Farthest Point Sampling (FPS) center and its K-nearest neighbors (KNN). Candidate patches are ranked using a patch imperceptibility score derived from local curvature variation, and a unified spectral trigger is injected into the selected patches by perturbing only the coordinates of existing points while preserving the original point cardinality. Extensive experiments on ModelNet40 and ShapeNetPart further demonstrate that SPBA achieves state-of-the-art stealthiness among prior methods and reduces spectral-trigger computation by 98.43% relative to a sample-wise spectral baseline, while maintaining competitive attack performance. These results support localized spectral design as an effective and efficient approach to stealthy backdoor attacks in 3D point cloud models. Code is available at https://github.com/HazardFY/SPBA.

2407.11906 2026-05-12 cs.CV cs.RO 版本更新

SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge

Hao Ding, Yuqian Zhang, Tuxun Lu, Ruixing Liang, Hongchao Shu, Lalithkumar Seenivasan, Yonghao Long, Qi Dou, Cong Gao, Yicheng Leng, Seok Bong Yoo, Eung-Joo Lee, Negin Ghamsarian, Klaus Schoeffmann, Raphael Sznitman, Zijian Wu, Yuxin Chen, Septimiu E. Salcudean, Samra Irshad, Shadi Albarqouni, Seong Tae Kim, Yueyi Sun, An Wang, Long Bai, Hongliang Ren, Ihsan Ullah, Ho-Gun Ha, Attaullah Khan, Hyunki Lee, Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa, Sita Tailor, Ricardo Sanchez-Matilla, Imanol Luengo, Tianhao Fu, Jun Ma, Bo Wang, Marcos Fernández-Rodríguez, Estevao Lima, João L. Vilaça, Mathias Unberath

发表机构 * Johns Hopkins University(约翰霍普金斯大学) The Chinese University of Hong Kong(香港中文大学) Intuitive Surgical Inc.(Intuitive Surgical公司) University of Arizona(亚利桑那大学) Chonnam National University(全州大学) University of British Columbia(不列颠哥伦比亚大学) Kyung Hee University(庆熙大学) University H(大学H)

AI总结 SegSTRONG-C 是一项旨在提升手术器械分割模型在非对抗性干扰下鲁棒性的挑战赛,基于通过反事实机器人重演生成的数据集,提供干净与受干扰的配对样本以评估模型性能。该挑战赛要求参赛者在未受干扰的数据上训练模型,并在包含出血、烟雾和低亮度等干扰的测试集上进行评估,揭示了模型失效的关键因素并提出了提升鲁棒性的有效方法。挑战赛结果显示,优秀方法在多个干扰类型下均取得了较高的分割精度,突显了先验知识、定制训练策略和网络结构选择对提升模型鲁棒性的重要性。

详情
英文摘要

Surgical data science has seen rapid advancement with the excellent performance of end-to-end deep neural networks (DNNs). Despite their successes, DNNs have been proven susceptible to minor "corruptions," introducing a major concern for the translation of cutting-edge technology, especially in high-stakes scenarios. We introduce the SegSTRONG-C challenge dedicated to better understanding model deterioration under unforeseen but plausible non-adversarial "corruption" and the capabilities of contemporary methods that seek to improve it. Built on a dataset generated through counterfactual robotic replay, SegSTRONG-C provides paired clean and "corrupted" samples, enabling reproducible evaluation of model robustness. Participants are challenged to train tool segmentation algorithms on "uncorrupted" data and evaluate them on "corrupted" test domains for the binary robot tool segmentation task. Through comprehensive baseline experiments and participating submissions from widespread community engagement, SegSTRONG-C reveals key themes for model failure and identifies promising directions for improving robustness. The performance of challenge winners, achieving an average 0.9394 DSC and 0.9301 NSD across the unreleased test sets with "corruption" types: bleeding, smoke, and low brightness. This highlights how prior knowledge, customized training strategies, and architectural choice can be leveraged to improve robustness. In conclusion, the SegSTRONG-C challenge has identified practical approaches for enhancing model robustness. However, most approaches rely on conventional techniques that have known limitations. Looking ahead, we advocate for expanding intellectual diversity and creativity in non-adversarial robustness beyond data augmentation, calling for new paradigms that enhance universal robustness to unforeseen "corruptions" to facilitate richer applications in surgical data science.

2605.09245 2026-05-12 cs.CV 版本更新

CalibFree: Self-Supervised View Feature Separation for Calibration-Free Multi-Camera Multi-Object Tracking

Ruiqi Xian, Deep Patel, Iain Melvin, Sanjoy Kundu, Martin Renqiang Min, Dinesh Manocha

发表机构 * University of Maryland, College Park, MD, USA(马里兰大学) NEC Laboratories America, Princeton, NJ, USA(NEC美国实验室) University of North Carolina, Greenboro, NC, USA(北卡罗来纳大学格林伯格分校)

AI总结 多相机多目标跟踪(MCMOT)在不同视角下保持目标身份一致性方面面临挑战,尤其需要精确的标定和大量标注。本文提出了一种无需标定和人工标注的自监督表征学习框架CalibFree,通过单视角蒸馏和跨视角重建促进视图无关与视图特定特征的分离,从而适应复杂动态场景。实验表明,该方法在多个数据集上均取得优于现有方法的跟踪性能,验证了其在无标定情况下的有效性与适应性。

详情
英文摘要

Multi-camera multi-object tracking (MCMOT) faces significant challenges in maintaining consistent object identities across varying camera perspectives, particularly when precise calibration and extensive annotations are required. In this paper, we present CalibFree, a self-supervised representation learning framework that does not need any calibration or manual labeling for the MCMOT task. By promoting feature separation between view-agnostic and view-specific representations through single-view distillation and cross-view reconstruction, our method adapts to complex, dynamic scenarios with minimal overhead. Experiments on the MMP-MvMHAT dataset show a 3% improvement in overall accuracy and a 7.5% increase in the average F1 score over state-of-the-art approaches, confirming the effectiveness of our calibration-free design. Moreover, on the more diverse MvMHAT dataset, our approach demonstrates superior over-time tracking and strong cross-view performance, highlighting its adaptability to a wide range of camera configurations. Code will be publicly available upon acceptance.

2605.09218 2026-05-12 cs.CV cs.AI cs.LG cs.RO 版本更新

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models

Sagar Bharadwaj, Ziyong Ma, Anurag Ghosh, Srinivasan Seshan, Anthony Rowe

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 Flame3D 是一种无需训练的三维场景理解框架,通过可编辑的视觉-文本三维记忆与现成的大型语言模型结合,实现对复杂空间关系和未出现对象的零样本推理。该方法在推理时能够合成自定义的空间程序,支持对场景布局、空置空间和新对象的开放推理,并可通过外部数据更新记忆而无需重新训练。实验表明,Flame3D 在三维问答和组合空间推理任务中表现出色,突显了动态生成空间操作对复杂三维推理的重要性。

详情
英文摘要

3D scene understanding spans reasoning about free space, object grounding, hypothetical object insertions, complex geometric relationships, and integrating all of these with external tools and data sources. Existing 3D understanding methods typically rely on large-scale 3D-language training or focus on object grounding and simple spatial relationships. We argue that the broad generalization that motivates 3D-language training can be achieved at inference time, without 3D-specific training. We propose Flame3D, a training-free framework that represents scenes as editable visual-textual 3D memories and exposes them to an off-the-shelf MLLM through composable spatial tools. Flame3D also lets the agent synthesize custom spatial programs at inference time, enabling open-ended reasoning over layouts, empty space, and objects not yet present in the scene. External data and corrections can be added to the memory without retraining. In addition to showing competitive performance to finetuned 3D-LMM methods on ScanQA, we study multi-hop 3D reasoning capabilities of Flame3D by evaluating it on a curated compositional spatial-reasoning benchmark, Compose3D. We find that fixed tools fall short and that the agent's ability to synthesize spatial operations at inference time is essential. These results invite the question: should future progress in 3D scene understanding focus on richer scene memories and expressive compositional abstractions?

2605.09196 2026-05-12 cs.CV cs.AI cs.GR cs.LG cs.RO 版本更新

RigidFormer: Learning Rigid Dynamics using Transformers

Zhiyang Dou, Minghao Guo, Haixu Wu, Doug Roble, Tuur Stuyck, Wojciech Matusik

发表机构 * MIT(麻省理工学院) Meta

AI总结 本文提出了一种基于Transformer的模型RigidFormer,用于学习多物体刚体动力学,特别适用于点云等无网格表示。该模型通过对象级的锚点进行动态建模,结合锚点-顶点池化和基于锚点的RoPE注意力机制,实现了高效且高保真的刚体运动模拟。RigidFormer在多个基准测试中表现优于传统网格基方法,计算效率更高,并能处理大量物体和不同点云分辨率的输入。

Comments Project Page: https://people.csail.mit.edu/frankzydou/projects/RigidFormer/index.html

详情
英文摘要

Learning-based simulation of multi-object rigid-body dynamics remains difficult because contact is discontinuous and errors compound over long horizons. Most existing methods remain tied to mesh connectivity and vertex-level message passing, which limits their applicability to mesh-free inputs such as point clouds and leads to high computational cost. Efficiently modeling high-fidelity rigid-body dynamics from mesh-free representations, therefore, remains challenging. We introduce RigidFormer, an object-centric Transformer-based model that learns mesh-free rigid-body dynamics with controllable integration step sizes. RigidFormer reasons at the object level and advances each object through compact anchors; Anchor-Vertex Pooling enriches these anchors with local vertex features, retaining contact-relevant geometry without dense vertex-level interaction. We propose Anchor-based RoPE to inject anchor geometry into attention while respecting the unordered nature of objects and anchors: object-token processing is permutation-equivariant, and the mean-pooled anchor descriptor is invariant to anchor reindexing while preserving shape extent. RigidFormer further enforces rigidity by projecting updates onto the rigid-body manifold using differentiable Kabsch alignment. On standard benchmarks, RigidFormer outperforms or matches mesh-based baselines using point inputs, runs faster, generalizes to unseen point resolutions and across datasets, and scales to 200+ objects; we also show a preliminary extension to command-conditioned articulated bodies by treating body parts as interacting object-level components.

2605.09190 2026-05-12 cs.CV 版本更新

AQMP: Image compression through Adaptive Quadtree Refinement and Matching Pursuit with Hyperparameter Optimization

Franco Cerino, Emmanuel Tassone, Manuel Tiglio

发表机构 * CONICET(阿根廷国家科研 council) Facultad de Matemática, Astronomía, Física y Computación, Universidad Nacional de Córdoba(数学、天文学、物理和计算学院,国家大学科达布拉达)

AI总结 本文提出了一种新型图像编码方法 AQMP,结合自适应四叉树划分与匹配追踪技术,通过动态调整块大小以适应图像局部结构,从而在保证图像质量的前提下实现更高的压缩率。该方法引入超参数优化机制,利用树结构帕尔森估计器进行多目标优化,获得压缩效率与视觉质量之间的最佳平衡。实验表明,AQMP 在与 JPEG 相当的结构相似度(SSIM)下,压缩率可提升至其 4 倍,且在不同压缩条件下均表现出良好的性能。

Comments 34 pages, 18 figures

详情
英文摘要

We present AQMP, a novel image codec combining Adaptive Quadtree Refinement with Matching Pursuit. Unlike conventional Matching Pursuit methods that operate on fixed-size sub-images, AQMP dynamically adapts block sizes to local image structure, allocating finer partitions where the image is complex and coarser ones where it is smooth. This adaptivity yields superior compression ratios compared to fixed-size block Matching Pursuit at equivalent image quality, while offering significant parallelization opportunities at both the tree-leaf level and during compression of individual nodes. The algorithm is governed by user-specified accuracy and sparsity parameters alongside a small set of additional hyperparameters. To navigate the trade-off between compression efficiency and visual quality, we perform multi-objective hyperparameter optimization using the Tree-Structured Parzen Estimator, producing comprehensive Pareto fronts. Experimental results show that AQMP achieves up to $4\times$ higher compression rates than JPEG at comparable SSIM values, while maintaining competitive quality across a broad range of compression regimes. Performance evaluation is provided using a representative set of test images. To ensure reproducibility and promote adoption, we have made our implementation publicly available on GitHub under the MIT license.

2605.09181 2026-05-12 cs.CV cs.ET eess.IV 版本更新

Establishing Robust Retinal Eye Tracking: A Weakly Supervised Algorithmic Framework

Bo Wen, Dillon Lohr, Yatong An, Pushkar Anand, Alexander Fix, Ruobing Qian, Catherine A. Fromm, Yimin Ding, Truong Nguyen, Mohamed El-Haddad, Francesco La Rocca

发表机构 * Meta Reality Labs(Meta现实实验室) University of California, San Diego(加州大学圣地亚哥分校) Independent Researcher(独立研究员)

AI总结 本文提出了一种基于弱监督学习的新型框架,用于实现鲁棒的视网膜眼动追踪。该方法克服了传统模板匹配方法在应对视网膜特征变化和实际成像条件时的不足,初步实验表明其在6名受试者中达到95百分位的注视误差小于0.45度,具有较高的准确性。这一成果为眼科成像和视觉科学中的眼动追踪提供了新的技术路径。

Comments 2026 IEEE International Conference on Image Processing (Accepted for Publication)

详情
英文摘要

Retinal image-based eye tracking is widely used in ophthalmic imaging and vision science, and is a promising path to deliver higher gaze accuracy than the pupil- and cornea-based approaches commonly used in modern AR/VR devices. Nevertheless, existing retinal tracking algorithms still primarily rely on classical template-matching registration, which can be insufficiently robust to retinal feature variability and real-world imaging conditions. In this work, we propose a novel weakly-supervised, learning-based framework for robust retinal eye tracking. Initial studies demonstrate high accuracy, achieving the 95th-percentile gaze error < 0.45 deg across a cohort of 6 participants.

2605.09151 2026-05-12 cs.CV 版本更新

MultiMedVision: Multi-Modal Medical Vision Framework

Frank Li, Bardia Khosravi, Mohammadreza Chavoshi, Young Seok Jeon, Theo Dapamede, Hari Trivedi, Janice Newsome, Judy Gichoya

发表机构 * Emory University(埃默里大学) Yale University(耶鲁大学)

AI总结 本文提出了一种名为 MultiMedVision 的多模态医学视觉框架,旨在统一处理二维(如X光)和三维(如CT)医学影像数据。该框架基于稀疏视觉变换器,通过三维旋转位置嵌入和可变长度序列打包技术,在共享的潜在空间中直接处理混合模态数据,无需模态特定适配器或将三维体积视为二维切片序列。实验表明,MultiMedVision 在多个医学影像基准测试中表现出色,验证了其在跨维度统一表征学习上的有效性。

Comments 9 pages, 2 figures

详情
英文摘要

Multi-modal medical imaging enables comprehensive diagnostics, yet current foundation models process 2D (e.g. X-ray) and 3D (e.g. CT) data with separate, dimensionality-specific architectures. We present MultiMedVision, a unified framework for joint 2D/3D representation learning built on a Sparse Vision Transformer. Our model uses 3D Rotary Positional Embeddings and variable-length sequence packing to process mixed-modality batches natively within a shared latent space, without modality-specific adapters or treating 3D volumes as 2D slice sequences. Trained with a self-supervised objective on chest X-rays (MIMIC-CXR) and CT scans (CT-RATE), and using a single shared encoder with 5x less data, MultiMedVision achieves competitive performance on both 2D benchmarks (Macro AUROC 0.82 on MIMIC, 0.84 on CheXpert) and 3D tasks (0.85 on CT-RATE). Analysis of the learned representations reveals coexisting modality-specific and shared feature subspaces, demonstrating that unified cross-dimensional representation learning is feasible without sacrificing modality-specific performance.

2605.09146 2026-05-12 cs.CV 版本更新

Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

Jingdong Zhang, Yizhou Wang, Zhengzhong Tu, Xin Li, Wenping Wang, Xiaohang Zhan

发表机构 * Texas A&M University(德克萨斯A&M大学) Adobe(Adobe公司)

AI总结 本文研究了人形视觉搜索(HVS)问题,即智能体在360度沉浸式环境中主动探索目标。为了解决现有方法依赖繁琐的多轮推理链(CoT)所带来的高认知负担和数据标注成本,作者提出了一种新的框架“Imagining in 360°”,将探索过程解耦为Imaginator和Actor两个模块。Imaginator通过一次推理预测环境的语义布局,为Actor提供多样化的空间信息分布,从而在不确定环境下实现高效搜索。该方法大幅降低了数据工程成本,并在复杂真实环境中显著提升了搜索效率和成功率。

详情
英文摘要

Humanoid Visual Search (HVS) requires agents to actively explore immersive 360$^\circ$ environments. While prior methods treat this as a monolithic task relying on cumulative, multi-turn Chain-of-Thought (CoT) reasoning, they impose heavy cognitive burdens and require expensive trajectory-level annotations. In this paper, we propose Imagining in 360$^\circ$, a novel framework that decouples the exploration process into a specialized Imaginator and an Actor. The Imaginator functions as a probabilistic predictor of spatial priors; instead of maintaining a cumulative reasoning chain, it infers the semantic layout of both observed and unobserved regions in a single step. By sampling multiple hypotheses within this semantic space, we provide the Actor with a distribution of effective spatial information, offering robust guidance that hedges against uncertainty during active search. This decoupled architecture significantly lowers data engineering costs by eliminating the need for full-trajectory CoT annotations, enabling the generation of over 1.96 million curated training samples. Extensive experiments demonstrate that explicitly modeling semantic spatial priors drastically improves search efficiency and success rates in complex, in-the-wild environments.

2605.09132 2026-05-12 cs.CV 版本更新

KEPIL: Knowledge-Enhanced Prompt-Image Learning for Prompt-Robust Disease Detection

Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Robert Berke, Mauricio Reyes

发表机构 * University of Bern(伯尔尼大学) Shanghai Jiao Tong University(上海交通大学) Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern(放射肿瘤科、Inselspital伯尔尼大学医院及伯尔尼大学)

AI总结 该研究提出了一种名为KEPIL的知识增强型提示-图像学习框架,旨在提升医学影像诊断中基于提示的零样本推理能力。为了解决当前视觉-语言模型对提示变化敏感且缺乏可靠外部知识的问题,KEPIL结合了结构化医学知识,通过动态提示增强、语义感知对比损失和实体中心报告标准化等方法,增强了模型的鲁棒性和泛化能力。实验表明,KEPIL在多个基准测试中取得了领先的零样本性能,显著提升了在提示变化情况下的诊断准确性。

详情
英文摘要

Vision--language models (VLMs) show promise for clinical decision support in radiology because they enable joint reasoning over radiological images and clinical text, thereby leveraging complementary clinical information. However, radiological findings are long-tailed in practice, leaving some conditions underrepresented and making zero-shot inference essential. Yet current CLIP-style medical VLMs are sensitive to prompt variations and often lack trustworthy external knowledge at inference time, which hinders reliable clinical deployment. We present \textit{KEPIL}, a prompt-robust framework that integrates curated medical knowledge to stabilize zero-shot generalization. KEPIL comprises: (i) \emph{dynamic prompt enrichment} using ontologies with LLM assistance, (ii) a \emph{semantic-aware contrastive loss} aligning embeddings of equivalent prompt variants via a dual-embedding objective, and (iii) \emph{entity-centric report standardization} to yield ontology-aligned representations. Across seven benchmarks, KEPIL achieves state-of-the-art zero-shot inference performance; under prompt-variation tests, it improves AUC by \(6.37\%\) on \textit{CheXpert} and by \(4.11\%\) on average. These results suggest that structured knowledge and robust prompt design are key to clinically reliable radiology-facing VLMs. Code will be released at https://github.com/Roypic/KEPIL.

2605.09090 2026-05-12 cs.CV cs.AI 版本更新

Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations

Gabriele Lombardo, Luigi Maiorana, Liliana Lo Presti, Marco La Cascia

发表机构 * Department of Engineering(工程系)

AI总结 该研究探讨了在受控反事实扰动下视觉 grounding 模型中的各向异性问题,旨在分析模型在面对语义不匹配的描述时的行为。研究引入了一种基于相似度控制的反事实描述生成方法,系统地扰动图像中的物体或上下文成分,以分析 grounding 模型在不同对齐程度下的表现。实验表明,嵌入空间的各向异性并非导致反事实错误的主因,模型的鲁棒性需进一步考察嵌入空间更细致的几何特性。

Comments To be published in the proceedings of the 5th Explainable AI for Computer Vision (XAI4CV) Workshop at CVPR 2026

详情
英文摘要

Visual Grounding benchmarks assume that the object described by a referring expression is always present in the image, and grounding models are therefore rarely evaluated under semantically mismatched captions. In such cases, models frequently exhibit approximation behavior, producing a plausible bounding box that satisfies only part of the expression (\eg, preserving the original object while ignoring modified contextual cues). Because mismatched captions represent realistic edge cases, this behavior compromises reliability and raises concerns from an explainability perspective. Identifying its underlying causes is thus essential for improving model faithfulness and interpretability. Adopting a mechanistic interpretability viewpoint, this work examines whether embedding anisotropy contributes to counterfactual failures. A similarity-controlled counterfactual caption generation protocol is introduced to systematically perturb object or contextual components within predefined embedding similarity intervals, enabling a fine-grained analysis of grounding behavior as a function of alignment. Experiments on two Transformer-based models with markedly different embedding geometries (BERT-based TransVG and CLIP-based SwimVG) reveal no meaningful correlation between cosine similarity and approximation. These findings suggest that anisotropy alone does not account for counterfactual errors, and that robustness requires investigating finer-grained geometric properties of the embedding space.

2605.09089 2026-05-12 cs.CV cs.AI 版本更新

Field-Localized Forgery Detection for Digital Identity Documents

Abhishek Kumar, Riya Tapwal, Carsten Maple, Mark Hooper

发表机构 * The Alan Turing Institute(艾伦·图灵研究所) IIT Mandi(曼迪理工学院)

AI总结 本文提出了一种轻量级的场域定位伪造检测框架FLiD,专门用于数字身份文件的远程身份验证,以应对面部照片和文本信息等关键字段的局部篡改问题。该方法通过目标检测定位关键区域,并利用冻结的MobileNetV3-Small网络提取紧凑的特征嵌入,最终通过轻量分类网络实现高精度的伪造检测。实验表明,FLiD在多个评估指标上显著优于现有通用伪造检测方法,且参数量和计算量大幅减少。

详情
英文摘要

Digital identity verification systems used in remote onboarding rely on document images to authenticate users, making them vulnerable to localized manipulations of key identity fields such as facial photographs and textual information. Existing forgery detection methods, developed primarily for natural-image forensics, show limited transferability to structured identity documents. We propose FLiD, a lightweight field-localized framework that targets critical identity regions rather than processing full-document images. A fine-tuned object detector first localizes face and text fields; a frozen MobileNetV3-Small backbone then extracts compact field-level embeddings, which are classified by lightweight neural network with only 191K trainable parameters. FLiD achieves AUC scores of 0.880 (face), 0.954 (text), and 0.923 (both-field attacks), with corresponding EERs of 18.05%, 11.61%, and 15.16%, representing absolute reductions of 29-35 percentage points over a full-document baseline trained from scratch. FLiD also consistently outperforms general-purpose manipulation detectors (TruFor, MMFusion, UniVAD) across all attack scenarios while requiring 13x fewer parameters and 21x fewer FLOPs

2605.09071 2026-05-12 cs.CV 版本更新

Probability-Flow Distillation: Exact Wasserstein Gradient Flow for High-Fidelity 3D Generation

Rohith Ramanan, A. N. Rajagopalan

发表机构 * Department of Physics(物理系) Indian Institute of Technology Madras(印度理工学院马德拉斯分校) Department of Electrical Engineering(电气工程系)

AI总结 该论文提出了一种名为概率流蒸馏(PFD)的新方法,用于解决文本到3D生成中现有方法如分数蒸馏采样(SDS)及其变体所面临的模式崩溃和细节丢失问题。PFD通过将蒸馏过程建模为精确的Wasserstein梯度流,实现了更准确的分布匹配,从而能够生成具有精细、高保真细节的3D模型,显著提升了生成质量。

详情
英文摘要

Score Distillation Sampling (SDS) and its variants have been widely used for text-to-3D generation by distilling 2D image diffusion priors. However, the standard SDS objective is prone to severe mode collapse, frequently yielding over-smoothed and over-saturated results. Although recent advancements, such as Score Distillation via Inversion (SDI), mitigate these artifacts and produce visually sharper models, they ultimately fail to faithfully capture the full target distribution. In this work, we show that the bottleneck limiting the sampling capacity of SDI stems from its reliance on the posterior mean estimator, which is mathematically equivalent to a single-step Euler approximation of the deterministic reverse DDIM trajectory. To address this, we propose a naturally motivated extension termed Probability-Flow Distillation (PFD). We establish that PFD corresponds exactly to a Wasserstein gradient flow, thereby inducing principled distribution-matching dynamics. Finally, we show that PFD can synthesize 3D assets with fine-grained, high-fidelity details and achieve improved quality compared to existing methods.

2605.09067 2026-05-12 cs.CV 版本更新

Reducing Annotation Burden for Femoral Cartilage Segmentation in Knee MRI via Cross-Sequence Transfer Learning

Francesco Chiumento, Gianluigi Crimi, Elisa Moretta, Rocco Milieri, Alberto Bazzocchi, Giulio Vara, Giacomo Dal Fabbro, Stefano Zaffagnini, Fulvia Taddei, Serena Bonaretti

发表机构 * School of Electronic Engineering, Dublin City University(都柏林城市大学电子工程学院) Bioengineering and Computing Laboratory, IRCCS Istituto Ortopedico Rizzoli(里扎利骨科研究所生物工程与计算实验室) Dipartimento di Scienze Mediche e Chirurgiche (DIMEC), Alma Mater Studiorum – Università di Bologna(博洛尼亚大学医学与外科科学系(DIMEC)) Diagnostic and Interventional Radiology, IRCCS Istituto Ortopedico Rizzoli(里扎利骨科研究所诊断与介入放射学) Department of Biomedical and Neuromotor Sciences, University of Bologna(博洛尼亚大学生物医学与神经运动科学系) nd Orthopedics and Trauma Unit, IRCCS Istituto Ortopedico Rizzoli(里扎利骨科研究所第二骨科与创伤单元) Independent Researcher(独立研究员)

AI总结 该研究旨在通过跨序列迁移学习减少膝关节MRI中股骨软骨分割的人工标注负担,测试双回波稳态(DESS)与矢状位质子密度加权3D快速自旋回波(Cube)序列之间的双向迁移效果。研究采用改进的2D U-Net模型,在OAI数据集的507张DESS图像上进行预训练,并在不同序列间进行迁移学习,结果表明从Cube到DESS的迁移性能接近原序列训练效果,而从DESS到Cube的迁移则需更多标注数据,且病变对不同序列的分割影响存在差异。这一成果为减少医学图像分割标注工作提供了有效方法。

详情
英文摘要

Purpose: To develop and evaluate cross-sequence transfer learning for automatic femoral cartilage segmentation, testing bidirectional transfer between dual-echo steady-state (DESS) and sagittal proton density-weighted 3D fast spin-echo (Cube) sequences. Materials and Methods: We optimized a modified 2D U-Net on 507 DESS images from the Osteoarthritis Initiative (OAI). We then established same-sequence baselines using subject-level cross-validation on a subset of 44 OAI DESS images and 44 Cube images acquired at the Istituto Ortopedico Rizzoli, Bologna, Italy. Each subset included 22 non-lesioned and 22 lesioned subjects. Finally, we performed transfer learning across sequences by fine-tuning the pretrained models on the target sequence with increasing training set sizes to study convergence, while keeping validation and test sets fixed. Segmentations were evaluated using Dice similarity coefficient (DSC) and average surface distance (ASD). Lesion effects were assessed with two-sided Mann-Whitney U tests with Bonferroni correction. Results: Same-sequence training yielded higher accuracy on DESS than Cube (DSC, $0.900$ vs $0.830$; $P < .001$). Cube-to-DESS transfer matched DESS performance (DSC, $0.903 \pm 0.032$ vs $0.900 \pm 0.027$), reaching a performance plateau at 9 training subjects. DESS-to-Cube yielded a lower combined DSC ($0.802 \pm 0.049$ vs $0.830 \pm 0.042$), reaching a plateau at 24 training subjects. Lesions did not affect DESS ($P \ge .39$) but reduced Cube accuracy (DSC, $0.805$ vs $0.856$; $P < .001$). Conclusion: Transfer learning across sequences can substantially reduce target-sequence annotation requirements for femoral cartilage segmentation, but performance is direction- and sequence-dependent, and the effects of lesions on segmentation may vary across MRI sequences.

2605.09065 2026-05-12 cs.CV cs.LG 版本更新

Dependency-Aware Discrete Diffusion for Scene Graph Generation

Rajalaxmi Rajagopalan, Romit Roy Choudhury

发表机构 * University of Illinois, Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 该研究提出了一种依赖感知的离散扩散模型,用于生成场景图,以解决从自然语言生成结构化场景图的挑战。该方法通过在正向和反向过程中解耦结构与语义,捕捉对象、边和关系之间的条件依赖,从而生成更符合文本描述的场景图。实验表明,该方法在标准基准上优于现有连续和离散图生成方法,并在后续图像生成任务中表现出更优的组合对齐效果,尤其在多物体场景中表现突出。

详情
英文摘要

Scene graphs (SGs) represent objects and their relationships as structured graphs, enabling applications in image generation, robotics, and 3D understanding. Recent work suggests that conditioning image generation on scene graphs improves compositional fidelity compared to text-only prompting. However, since users typically provide text rather than structured graphs, a key challenge is to generate scene graphs from natural language. Prior work on discrete diffusion has demonstrated success in generating generic graphs such as molecules and circuits, but fails to account for the hierarchical structure and strong dependencies between objects, edges, and relations in scene graphs. We address this limitation by introducing a dependency-aware, hierarchically constrained discrete diffusion model for scene graph generation. Our approach decouples structure and semantics across the forward and reverse processes, enabling the model to capture conditional dependencies. At inference time, we perform training-free conditioning to sample text-aligned scene graphs. We evaluate our method on standard SG benchmarks and demonstrate improvements over both continuous and discrete graph generation baselines across graph and layout metrics. When fed to downstream image generation, our approach yields improved compositional alignment compared to text-to-image models, particularly in multi-object scenarios.

2605.09053 2026-05-12 cs.CV 版本更新

LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation

Jiankun Peng, Jianyuan Guo, Yiguang Yang, Yue Liu, Jiashuang Yan, Ying Xu

发表机构 * The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing(中国科学院航空航天信息研究所,北京) The School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing(中国科学院大学电子电气与通信工程学院,北京) The Department of Computer Science, City University of Hong Kong, Hong Kong, SAR, China(香港城市大学计算机科学系,香港特别行政区,中国)

AI总结 在连续环境视觉语言导航(VLN-CE)中,现有的在线拓扑规划方法仍面临局部深度信息冗余和随着拓扑图扩展导致当前候选节点关注减弱的问题。为此,本文提出LCGNav,一种模块化的局部几何增强框架,通过将候选深度视图转换为三维点云并结合可达范围的物理截断,实现更紧凑的局部几何建模。此外,LCGNav引入了一种保持维度的局部融合策略,仅对当前相关的“幽灵”节点进行几何增强,而无需改变原有规划器接口。实验表明,LCGNav作为一种有效的跨架构增强模块,能够以较低的训练成本提升多个代表性在线拓扑方法的关键指标,并在R2R-CE和RxR-CE数据集的val-unseen划分上取得了最佳性能。

详情
英文摘要

Online topological planning has become an effective paradigm for Vision-Language Navigation in Continuous Environments (VLN-CE), but existing methods still suffer from two limitations: redundant local depth information and weakened focus on current frontier candidates as the topological graph grows. To address this, we propose LCGNav, a modular local geometric enhancement framework for topological VLN. LCGNav explicitly converts candidate depth views into 3D point clouds and applies physical truncation based on the agent's reachable range, enabling more compact local geometric modeling. It further introduces a dimension-preserving local fusion strategy with transient state degradation, so that geometric enhancement is applied only to the currently relevant ghost nodes without changing the original planner interface. Experiments on R2R-CE and RxR-CE show that LCGNav serves as an effective cross-architecture enhancement module, consistently improving multiple key metrics of representative online topological baselines with low additional training cost. When integrated with ETP-R1, LCGNav achieves the best performance among the compared online topological methods on the val-unseen splits of the R2R-CE and RxR-CE benchmarks. The code is available at https://github.com/shannanshouyin/LCGNav.

2605.09050 2026-05-12 cs.RO cs.CV 版本更新

Automated Robotic Moisture Monitoring in Agricultural Fields

Senthil Palanisamy, Akila I. S

发表机构 * Coimbatore Institute of Technology(科伊巴特尔理工学院)

AI总结 本文旨在开发一种自动化机器人系统,用于大规模农田的土壤湿度监测。该系统结合田间湿度传感器和机器人,利用Dijkstra算法规划路径,并通过图像处理技术计算土壤湿度,从而实现高效、经济的监测。研究搭建了一个小型实验田并测试了原型系统,验证了该方法的可行性。

Comments 2018 International Seminar on Intelligent Technology and Its Applications (ISITIA)

Journal ref 2018 International Seminar on Intelligent Technology and Its Applications (ISITIA)

详情
英文摘要

Monitoring moisture level of land in a large-scale plantation is tedious. The main objective of this project is to use a robotic kit in collaboration with the on-field moisture sensor circuits, thereby creating an efficient and economical moisture monitoring system. A large agriculture field is divided into smaller grids. Each grid is placed with a moisture sensor. Whenever a sensor reports the soil to be dry, the robot goes to the concerned field for inspection. The path to the concerned field is found by applying Dijkstra's shortest path algorithm on the aerial image of the field. Then the total moisture content of the field is calculated by the robot using suitable image processing algorithms and reported accordingly. For developing and testing this work, a small study field was set up above which a camera was mounted at an appropriate height to capture its aerial view. Thus a prototype for an automated system of monitoring agricultural fields' moisture has been developed through this work.

2605.09039 2026-05-12 cs.CV 版本更新

SeasonScapes: Learning Large-scale Re-lightable 3D Landscapes with Seasonal Variation from Sparse Webcams

Timo Kleger, Qi Ma, Deheng Zhang, Luc Van Gool, Danda Pani Paudel

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出了一种名为 SeasonScapes 的框架和一个大规模季节变化三维景观数据集,该数据集由来自32个不同位置、13个时间点的85000多张网络摄像头图像组成,覆盖超过50公里×60公里的瑞士山区。通过将时间点特定的图像投影到三维网格上,构建出反映自然外观随时间变化的季节性三维景观。为了解决遮挡和缺失数据问题,研究采用条件扩散模型在网格上进行图像引导的补全,最终生成的网格可使用标准物理渲染器进行重新光照。

详情
英文摘要

We introduce SeasonScapes framework and a the SeasonScapes dataset: Swiss Sparse-view Mountain Scenes with Seasonal Changes that covers over 50 km x 60 km, composed of more than 85,000 webcam images captured from 32 different locations across 13 timestamps throughout a full year. By projecting these timestamp-specific images onto a 3D mesh, we construct seasonal 3D landscapes that reflect natural appearance changes over time. To address occlusions and missing data, we leverage conditional diffusion models for image-guided inpainting directly on the mesh. The resulting completed meshes can be further relighted using standard physically-based renderer.

2605.09030 2026-05-12 cs.CV cs.LG 版本更新

When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation

Jörg Frochte

发表机构 * Bochum University of Applied Sciences(博湖应用科学大学)

AI总结 该论文探讨了在艺术家风格评估中,使用对比风格描述符(CSD)余弦相似度作为绝对风格保真度指标的局限性,并提出了一种名为“判别差距”的诊断方法,用于检测该指标在特定艺术家语料库中是否能够准确区分相同与不同风格。研究发现,原始CSD余弦在多个艺术家语料中存在负点估计差距,表明其无法作为绝对评分使用;通过引入CSLS读取方式和位置嵌入插值方法,可显著提升评估准确性。研究建议在使用CSD余弦作为风格评分前,应先进行该诊断测试,并推荐使用改进后的CSD+方法以提高可靠性。

Comments 24 pages, 7 figures, 19 tables

详情
英文摘要

Raw cosine in the 768-dimensional output space of the Contrastive Style Descriptor (CSD) is now widely read as an absolute, calibrated style-fidelity score for text-to-image and style-imitation evaluation. We introduce the discrimination gap, a corpus-internal, prototype-free and threshold-free diagnostic that tests whether contrastive style cosines admit an absolute same-versus-different interpretation on a candidate artist corpus. On a 1799-artwork, 91-artist public-domain corpus, raw CSD cosine yields negative point-estimate gaps for $23/91$ artists at the pairwise level ($2/91$ robust under bootstrap) and for $15/91$ in the aggregated-pool scoring regime style-fidelity evaluations typically use. CSLS readout on the frozen backbone reduces the aggregated negative-gap count to $4/91$; combined with positional-embedding interpolation to $336$ pixels it raises unsupervised pair-verification AUC from $0.883$ to $0.905$ across $25$ artist-disjoint splits. We refer to this diagnostic-driven readout protocol on the frozen backbone (CSLS as default, pos-interp $336$ as the stronger optional setting) as CSD+, not a new encoder.A cross-backbone check on CLIP-ViT-L/14, SigLIP-large and DINOv2-Large reproduces the same shared-tradition failure pattern, providing evidence that the residual reflects a shared limitation of the four backbones we tested rather than a CSD-specific artefact. Practical implication: before reporting CSD cosine as an absolute style-fidelity score, run the diagnostic on the candidate corpus; CSLS is the minimal correction when it fails.

2605.09025 2026-05-12 cs.CV cs.LG 版本更新

MedFL-Stress: A Systematic Robustness Evaluation of Federated Brain Tumor Segmentation under Cross-Hospital MRI Appearance Shift

Kiran Naseer, Naveed Anwer Butt

发表机构 * University of Gujrat, Pakistan(瓜尔杰特大学,巴基斯坦)

AI总结 本文提出MedFL-Stress,一个用于评估联邦学习脑肿瘤分割模型在跨医院MRI影像外观变化下的鲁棒性的系统化测试框架。研究通过引入不同级别的MRI外观偏移,揭示了现有联邦学习方法在不同医院间性能差异的问题,并对比了FedAvg、FedProx和FedBN三种方法的表现。实验表明,FedBN在提升最差医院分割性能和减少医院间性能差距方面表现更优,突显了鲁棒性评估在联邦医疗影像应用中的重要性。

详情
英文摘要

Federated learning enables hospitals to collaboratively train segmentation models without sharing patient data. However, current evaluation protocols report only average performance across clients, masking failures at individual sites. In clinical deployment, a model that fails consistently at one hospital is a real safety risk that a good mean score can hide entirely. We introduce MedFL-Stress, a controlled stress-testing framework that exposes exactly this failure mode. Using 2D axial slices from BraTS 2020 distributed across four simulated hospital clients, we apply graded MRI appearance shifts (gamma contrast, scale-shift, and noise-plus-blur) reflecting scanner and acquisition variability in real multi-site deployments. Three federated baselines are evaluated: FedAvg, FedProx, and FedBN. Worst-hospital Dice and inter-hospital disparity are treated as primary metrics, not supplementary observations. FedAvg achieves the highest global mean Dice (0.8159) but conceals a 0.0850 gap between its best and worst-performing hospital. FedBN closes that gap by 41% (0.0850 to 0.0503) while sacrificing less than half a Dice point in mean accuracy (0.8159 to 0.8109), and the weakest hospital gains 3.5 Dice points outright (0.7309 to 0.7656). These findings demonstrate that robustness-oriented evaluation protocols are essential for reliable federated medical imaging deployment.

2605.09024 2026-05-12 cs.CV cs.GR cs.MM eess.IV 版本更新

Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination

Adrian Azzarelli, Nantheera Anantrasirichai, James Pollock, David R. Bull

发表机构 * University of Bristol, UK(英国布里斯托大学) Lux Aeterna, Bristol, UK(卢克斯艾特纳,布里斯托,英国)

AI总结 该研究提出了一种基于高分辨率图像照明的虚拟制作(VP)专用三维重建与重光照框架,解决了传统方法中背景与光照耦合、环境贴图分辨率低等问题。方法采用高斯点扩散技术,利用已知背景图像条件化重光照过程,无需依赖环境贴图,将合成简化为背景图像编辑任务。通过引入真实VP场景数据集,分解场景为固定外观与可变光照部分,实现了高效、可控的高质量三维重建与重光照,支持多种输出变量,且计算效率高。

详情
英文摘要

Virtual production (VP) use LED walls to provide both background imagery and image-based lighting. While this enables on-set compositing, it couples lighting to background and scene appearance, limiting flexibility for downstream editing. In addition, inverse rendering conventionally relies on physically-based rendering to estimates 3D geometry and lighting, using environment maps. However, these maps are typically low-resolution and assume far-field lighting. In VP, with near-field and high-resolution image-based lighting, this can lead to inaccuracies and introduce complexities when editing. Addressing this, we propose a VP-specific framework for 3D reconstruction and relighting using Gaussian Splatting. This uses the known background imagery to condition the relighting process. This avoids relying on environment maps and reduces compositing to a background-image editing task. To realize our framework, we introduce a process (and associated dataset) that captures real VP scenes under varying background content and illumination conditions. This data is used to decompose a 3D scene into fixed appearance and variable lighting components. The variable lighting process simulates light transport by parameterizing each primitive with a UV coordinate, intensity value and resolution modifier. Using mipmaps, these directly sample the background texture in image space - implicitly capturing reflections and refractions without physically-based rendering. Combined with the fixed appearance component, this allows us to render relit scenes using a Gaussian Splatting rasterizer. Compared to baselines, our approach achieves higher-quality 3D reconstruction and controllable relighting. The method is efficient (<3 GB RAM, <5 GB VRAM, <2 hours training, ~35 FPS) and supports rendering useful arbitrary output variables including depth, lighting intensity, lighting color, and unlit renders.

2605.09002 2026-05-12 cs.CV cs.AI 版本更新

CT-IDP: Segmentation-Derived Quantitative Phenotypes for Interpretable Abdominal CT Disease Classification

Lavsen Dahal, Joseph Y. Lo

发表机构 * Center for Virtual Imaging Trials, Carl E. Ravin Advanced Imaging Laboratories(虚拟成像试验中心,卡尔·E·拉文先进成像实验室) Department of Radiology, Duke University School of Medicine(杜克大学医学学院放射科) Electrical and Computer Engineering, Pratt School of Engineering, Duke University(杜克大学普拉特工程学院电气与计算机工程系) Medical Physics Graduate Program, Duke University(杜克大学医学物理研究生项目)

AI总结 本文提出了一种基于腹部CT影像分割的可解释性疾病分类框架CT-IDP,通过生成多器官分割结果并提取超过900个定量表型特征,用于疾病分类任务。研究在MERLIN数据集上训练并验证了该方法,并在两个独立数据集上进行了外部评估,结果显示CT-IDP在多个指标上均优于基于DINOv3的视觉Transformer基线模型,表明其在疾病分类中的有效性与可解释性优势。

详情
英文摘要

In this retrospective multi-institutional study, a quantitative phenotyping framework, CT-IDP (CT Image-Derived Phenotypes) was developed on the MERLIN abdominal CT benchmark (training, validation, and test sets- 15,175, 5,018, and 5,082 studies, respectively) and externally evaluated on two independent dataset: Duke-Abdomen (2,000) and AMOS (1,107). Multi-organ segmentations were generated with TotalSegmentator and used to derive over 900 organ and compartment-level descriptors spanning morphometry, attenuation, and contextual/burden findings. Sparse disease-specific logistic regression with elastic-net regularization was trained on MERLIN and externally validated under a frozen specification. Performance was compared against a DINOv3-based vision-transformer baseline using AUC and average precision (AP), supported by phenotype-stratified audits and coefficient-level inspection. Macro-AUC for CT-IDP versus the baseline was 0.897 versus 0.880 on MERLIN, 0.877 versus 0.857 on the Duke-Abdomen dataset, and 0.780 versus 0.756 on AMOS.

2605.08985 2026-05-12 cs.CV 版本更新

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Kechen Fang, Yihua Qin, Chongyi Wang, Wenshuo Ma, Tianyu Yu, Yuan Yao

发表机构 * Tsinghua University(清华大学) ModelBest

AI总结 该研究针对多模态大语言模型(MLLMs)中高分辨率图像输入带来的视觉编码计算瓶颈问题,提出了一种高效且可控的视觉编码方案LLaVA-UHD v4。通过对比实验发现,基于切片的编码策略在保持局部细节的同时优于传统的全局编码方法;同时引入了在ViT浅层进行早期压缩的新方法,显著降低了计算量而不影响下游任务性能。实验表明,该方法在多个基准测试中将视觉编码的浮点运算量减少了55.8%,并在性能上达到或超越了基线模型。

详情
英文摘要

Visual encoding constitutes a major computational bottleneck in Multimodal Large Language Models (MLLMs), especially for high-resolution image inputs. The prevailing practice typically adopts global encoding followed by post-ViT compression. Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show that slice-based encoding outperforms global encoding across benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowers visual-encoding FLOPs while preserving downstream performance. By integrating intra-ViT compression into the slice-based encoding framework, we present LLaVA-UHD v4, an efficient and compute-controllable visual encoding scheme tailored for high-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.

2605.08974 2026-05-12 cs.CV cs.AI 版本更新

Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

Tri Cao, Khoi Le, Thong Nguyen, Cong-Duy Nguyen, Quynh Vo, Anh Tuan Luu, Chunyan Miao, See-Kiong Ng, Shuicheng Yan, Bryan Hooi

发表机构 * National University of Singapore(新加坡国立大学) VinUniversity(文大学) Nanyang Technological University(南洋理工大学)

AI总结 尽管多模态大语言模型在视频理解方面取得了进展,但在动态场景中仍容易产生幻觉。本文认为这是由于缺乏对时空信息的持续监控能力,即无法有效追踪物体的身份、状态及关系随时间的变化。为此,研究者提出了STEMO-Bench基准,用于评估模型在物体中心事实上的中间推理能力,并引入了STEMO-Track框架,通过结构化轨迹构建和时序聚合显著提升了模型在时空推理上的准确性和一致性。

Comments Code: https://github.com/nguyentthong/video_hallucination

详情
英文摘要

While multimodal large language models (MLLMs) have advanced video understanding, they remain highly prone to hallucinations in dynamic scenes. We argue this stems from a failure in spatio-temporal monitoring, the ability to persistently track object identities, states, and relations over time. Existing benchmarks obscure this deficit by relying on single final-answer evaluations for queries that can often be resolved via local visual cues or statistical priors. To rigorously diagnose this, we introduce STEMO-Bench (Spatio-TEmporal MOnitoring), a benchmark of human-verified object-centric facts that evaluates intermediate reasoning by decomposing queries into sub-questions, distinguishing genuine temporal understanding from coincidental correctness. To address failure modes exposed by STEMO, we propose STEMO-Track, a novel object-centric framework that explicitly constructs and reasons over structured object trajectories via chunk-wise state extraction and temporal aggregation. Extensive experiments demonstrate that our object-centric framework significantly reduces hallucinated answers and improves spatio-temporal reasoning consistency over state-of-the-art MLLMs.

2605.08971 2026-05-12 cs.CV cs.AI 版本更新

Extrusion Segmentation Strategy to improve CAD Reconstruction from Point Cloud

Said Harb, Mehdi Maboudi, Markus Gerke

发表机构 * Institute of Geodesy and Photogrammetry(测绘院)

AI总结 本文研究如何从点云数据中重建CAD模型,提出了一种基于挤出分割的策略,将复杂形状分解为基本的挤出部件,从而提升深度学习模型的重建性能。该方法通过增加数据多样性,提高了模型的泛化能力和鲁棒性,为从无序点云生成结构化CAD模型提供了简单而有效的方式。

Comments Conference: ISPRS Toronto 2026

详情
英文摘要

Computer-Aided Design is ubiquitous in todays world, as almost every manufactured object begins as a digital model across industries. At the same time, advances in 3D sensing have made point clouds a dominant form of raw 3D data. Recovering the CAD model of a physical object from its point cloud scan has two major applications: reverse engineering, where physical or hand-crafted prototypes need to be reconstructed automatically as editable digital models, and quality control, where recovering the CAD description of a manufactured object helps quantify and understand deviations introduced during the production process. Thus, converting unordered point clouds into structured CAD models is increasingly important for modern applications. Deep learning has enabled major progress in computer vision for both 2D and 3D data, and new datasets facilitate data-driven CAD reconstruction. Building on this foundation, we develop an end-to-end model that reconstructs CAD models from point clouds and introduce a segmentation approach that decomposes them into individual extrusions. These partial shapes increase data diversity, improving the generalization and robustness of deep learning models. Our strategy thereby provides a simple, yet effective way to increase reconstruction performance of deep learning models.

2605.08965 2026-05-12 cs.CV 版本更新

Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning

Naeun Lee, Hyunjong Kim, Sunghwan Choi, Injin Kong, Yohan Jo

发表机构 * Seoul National University(首尔国立大学)

AI总结 尽管多模态大语言模型(MLLMs)在多模态任务中表现出色,但预测图像是否具有说服力及其原因仍具有挑战性。本文发现,让MLLMs在预测前进行推理并不能一致提升性能,甚至可能降低效果,表明生成的推理理由不可靠。为此,研究提出通过多样化的教师生成推理进行监督微调,提升了视觉说服力预测性能,并引入了一个三维的可信度评估框架,从推理与决策的一致性、推理与图像的相关性以及推理对决策的敏感性三个方面进行评估,揭示了预测性能与推理可信度之间的差异,并为未来训练更可信的视觉说服力模型提供了新方向。

详情
英文摘要

Despite strong performance of Multimodal Large Language Models (MLLMs) on multimodal tasks, predicting whether and why an image is persuasive remains challenging. We first show that prompting MLLMs to reason before prediction does not consistently help, and can even reduce persuasiveness prediction performance, suggesting that naively generated rationales are unreliable signals for this task. Yet, no established methodology exists for training MLLMs to reason about visual persuasion or evaluating whether their rationales faithfully support their decisions. To address this gap, we show empirically and theoretically that diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction. We further introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity. Applying this framework shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences. These findings motivate faithfulness-aware training objectives and scalable rationale supervision for visual persuasiveness evaluation. Our code and dataset will be made publicly available.

2605.08952 2026-05-12 cs.CV 版本更新

FugSeg: Fast Uncertainty-aware Ground Segmentation for 3D Point Cloud

Yu Li, Volker Schwieger

发表机构 * Institute of Engineering Geodesy, University of Stuttgart(斯图加特大学工程大地测量研究所) Daimler Truck AG(戴姆勒卡车公司)

AI总结 在基于激光雷达的环境感知系统中,地面分割是支持地图构建和导航等应用的关键预处理步骤。为了解决反射噪声和孤立地面点等挑战,本文提出了一种快速且具有不确定性感知能力的地面分割方法FugSeg。该方法采用极坐标网格图表示点云,并引入自适应坡度和噪声地面点处理机制,有效提升了复杂地形下的分割可靠性;实验表明,FugSeg在多个公开数据集上均优于现有非学习方法,且在单线程CPU上即可实现高运行效率,适用于资源受限的系统。

Comments Accepted for publication in IEEE Transactions on Intelligent Transportation Systems

Journal ref IEEE Transactions on Intelligent Transportation Systems (Early Access), 2026

详情
英文摘要

In LiDAR-based environment perception systems, ground segmentation is a key preprocessing step supporting various applications such as mapping and navigation. Although extensively studied, problems such as reflection noise and isolated ground remain challenging. To address these issues, we propose FugSeg, a fast uncertainty-aware ground segmentation method. A polar grid map is adopted as the point cloud representation to ensure generalizability across LiDAR types. Building on that, we develop a within- and cross-segment ground labeling strategy that identifies not only directly visible ground cells but also those that are isolated or occluded. During this process, an adaptive slope is introduced, which incorporates measurement uncertainties to enhance its reliability under complex terrain. Finally, to achieve point-level ground segmentation, a fine-grained ground elevation estimation method is introduced. Throughout the complete workflow, reflection noise is explicitly handled via the proposed noisy ground cells. We conduct comprehensive evaluations on four public datasets covering both structured and unstructured environments. Results show that FugSeg outperforms state-of-the-art non-learning methods, achieving the highest F1, accuracy, and mIoU across all datasets, while maintaining the fastest runtime (135 Hz and 487 Hz for 64- and 32-layer LiDARs) using a single CPU thread, making it suitable for resource-limited systems. The code will be available at https://github.com/Leo-YuLi/FugSeg.

2605.08945 2026-05-12 cs.CV 版本更新

PIDNet: Progressive Implicit Decouple Network for Multimodal Action Quality Assessment

Qiqi Li, Pengfei Wang, Nenggan Zheng

发表机构 * Qiushi Academy for Advanced Studies (QAAS), Zhejiang University(浙江大学启斯特先进研究院) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) School of Software Technology, Zhejiang University(浙江大学软件学院) State Key Lab of Brain-Machine Intelligence(脑机智能国家重点实验室) Collaborative Innovation Center for Artificial Intelligence by MOE and Zhejiang Provincial Government (ZJU)(教育部-浙江省人工智能协同创新中心) Zhejiang Lab(浙江实验室)

AI总结 本文提出了一种名为PIDNet的渐进式隐式解耦网络,用于多模态动作质量评估。该方法通过渐进融合不同模态的特定信息、跨模态互补线索和全局质量语义,有效提升了评估准确性。核心模块iMambaWave结合双向Mamba分支和小波变换分支,分别捕捉长时序依赖和局部细节变化,配合门控聚合机制实现时域与频域信息的自适应融合。实验表明,PIDNet在多个数据集上取得了优于现有单模态和多模态方法的评估性能,并具有良好的通用性和模块化能力。

Comments 14 pages, 6 figures, 11 tables

详情
英文摘要

Action quality assessment (AQA) aims to automatically quantify the execution quality of human actions in videos and is valuable for applications such as competitive sports judging. In multimodal AQA, quality evidence from different modalities is heterogeneous, and quality cues evolve progressively over time. Existing methods often rely on coarse fusion or unified temporal modeling, which may blur modality-specific cues, preserve cross-modal redundancy, and weaken stage-specific quality evidence. To address these issues, we propose a progressive implicit decoupling and fusion network (PIDNet) that progressively integrates modality-specific information, cross-modal complementary cues, and global quality semantics for accurate assessment. Specifically, we design an iMambaWave module that maps RGB, optical flow, and audio features into a shared latent space and disentangles them with a Bi-Mamba branch and a wavelet-transform branch to capture long-range temporal dependencies and local perturbation details, respectively. A gated aggregation mechanism adaptively fuses temporal and frequency-domain information. We further build a three-stage progressive fusion network using Group3M blocks, where modality complementary attention retrieves cross-modal evidence while suppressing redundancy, and multi-scale convolutions enrich feature representations. Experiments on the Rhythmic Gymnastics and Fis-V datasets show that PIDNet achieves highly competitive score correlation with favorable error control compared with existing unimodal and multimodal methods. Ablation studies verify the effectiveness of each component. Moreover, iMambaWave consistently improves visual representation and temporal modeling across multiple backbones, showing good generalization and plug-and-play capability.

2605.08911 2026-05-12 cs.CV 版本更新

Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning

Han Li, Yulu Gao, Si Liu, Yuhang Wang, Bo Liu, Beipeng Mu

发表机构 * School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院) Zhongguancun Academy, Beijing, China(中关村学院,北京,中国) Hangzhou International Innovation Institute, Beihang University(杭州国际创新院,北京航空航天大学) Meituan, Beijing, China(美团,北京,中国)

AI总结 自动驾驶车辆不仅需要感知驾驶场景中的物理元素,如车道线和交通信号灯,还需要理解车道中心线及其拓扑关系等逻辑信息。本文提出了一种统一建模车道与车道拓扑关系的新方法UniTopo,通过将车道间的拓扑关系表示为连接关系,实现了在同一个感知流程中同时获取车道位置和拓扑信息,建立了从原始图像特征直接感知车道拓扑的新范式。实验表明,该方法在OpenLane-V2基准测试中显著优于现有先进方法。

Comments Accepted by IEEE TCSVT

详情
英文摘要

Autonomous vehicles need to perceive not only physical elements in the driving scene, such as lane lines and traffic lights, but also logical elements like lane centerlines and their topology. Existing lane topology reasoning methods typically follow a reasoning-by-detection paradigm, where lane topological relationships are primarily derived from lane detection results. In this paper, we propose an innovative method called Unified Modeling of Lane and Lane Topology (UniTopo), which represents the topological relationships between lanes as connected lanes, encompassing predecessor lanes, successor lanes, and their interconnections. This unified representation of lanes and lane topology allows us to simultaneously obtain both the positions and topological information of lanes within a shared perception pipeline, establishing a new paradigm for directly perceiving lane topology from original image features. We validate our method on the driving scene reasoning benchmark OpenLane-V2, which consists of two subsets, built based on Argoverse2 and nuScenes, respectively. Our method achieves TOP_ll of 30.1% and 31.8% on the two subsets, significantly surpassing the existing state-of-the-art method T^2SG by 6.0% and 8.6%.

2605.08902 2026-05-12 cs.CV cs.AI 版本更新

DAPE: Dynamic Non-uniform Alignment and Progressive Detail Enhancement Techniques for Improving the Performance of Efficient Visual Language Models

Mengyuan Tian, Qiyan Zhao, Yanan Wang, Da-Han Wang

发表机构 * The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen 361005, China(厦门大学王文安经济研究所) Fujian Key Laboratory of Pattern Recognition and Image Understanding, School of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China(福建pattern识别与图像理解重点实验室,厦门理工大学计算机与信息工程学院)

AI总结 本文提出了一种名为DAPE的新框架,旨在提升高效视觉语言模型的性能。该方法通过动态非均匀对齐和渐进细节增强技术,解决了文本与图像之间信息密度分布不均的问题,实现了更精确的跨模态交互。实验表明,该方法在多个基准测试中显著提升了下游任务的准确性,同时降低了计算开销。

Comments Accepted in ICIC 2026 Oral

详情
英文摘要

In recent years, pre-trained visual-linguistic models have demonstrated tremendous potential, becoming a crucial foundational framework for numerous downstream tasks. However, the information density between text and images is not uniformly distributed. Existing methods often overlook the inherent and dynamic differences in information density and semantic scope between text tags and image blocks. These common uniform alignment strategies result in coarse-grained cross-modal interactions and loss of fine semantic details. Moreover, pursuing finer alignment typically requires substantial computational overhead, limiting practical model deployment. To address this challenge, this paper proposes a novel framework for dynamic cross-modal alignment with continuous detail introduction. First, we design a dynamically adaptive cross-modal matching mechanism that uses a learnable matching function to dynamically assign varying numbers and sizes of image tags to text tags of the same size but different information density, enabling more precise attention interaction. Second, we develop a continuous detail introduction module to progressively incorporate high-resolution visual feature enhancement into the alignment process. Extensive experiments across multiple benchmarks demonstrate significant improvements in the accuracy of various downstream tasks while reducing computational overhead.

2605.08874 2026-05-12 cs.CV 版本更新

Semantic Alignment in Hyperbolic Space for Open-Vocabulary Semantic Segmentation

Hoang M. Truong, Hai Nguyen-Truong, Dang Huynh

发表机构 * Fulbright University Vietnam(富布赖特大学越南分校)

AI总结 该研究针对开放词汇语义分割任务中图像级视觉-语言模型与像素级预测之间的语义对齐问题,提出了一种基于双曲空间的细调框架HyRo。HyRo通过在庞加莱球模型中解耦层次结构与语义对齐,利用双曲半径调整实现层次对齐,并通过正交变换进行角度对齐以优化同层次嵌入的语义关系。实验表明,HyRo在多个基准数据集上取得了当前最优的性能。

Comments Accepted to the PVUW Workshop at CVPR 2026. Project page: https://tmhoanggg.github.io/HyRo/

详情
英文摘要

Open-vocabulary semantic segmentation requires adapting image-level vision-language models such as CLIP to dense pixel-level prediction, which is challenging due to the mismatch between hierarchical structure and semantic alignment in the embedding space. While recent works leverage hyperbolic geometry to model hierarchical relationships, they align embeddings across hierarchical levels but overlook semantic misalignment among embeddings within the same level. In this work, we propose HyRo, a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincaré ball model. HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius. Experiments on standard open-vocabulary semantic segmentation benchmarks demonstrate that HyRo achieves state-of-the-art performance over prior methods.

2605.08854 2026-05-12 cs.CV 版本更新

Restoration-Aligned Generative Flow Models for Blind Motion Deblurring

Insoo Kim, Jinwoo Shin

发表机构 * NAVER Cloud(NAVER云) KAIST AI(韩国科学技术院人工智能研究所) Samsung Electronics(三星电子)

AI总结 本文提出了一种名为DeblurFlow的生成流模型框架,用于解决盲运动去模糊问题。该方法通过将生成流的轨迹终点从噪声替换为模糊观测,使模型的训练目标与去模糊任务对齐,从而避免了传统生成流模型在恢复任务中出现的保真度下降问题。研究还引入了r-space这一专门用于残差解码的潜在空间,大幅降低了计算成本,并在多个数据集上展示了DeblurFlow在恢复保真度和感知真实感方面的优越性能。

详情
英文摘要

Generative flow models offer powerful priors learned from large-scale natural images, but directly adapting them to restoration tasks such as motion deblurring causes severe fidelity degradation, as their training objective is inherently misaligned with restoration. We present DeblurFlow, a framework that resolves this misalignment by reformulating the flow trajectory itself: we replace the noise endpoint with the blur observation, which makes the underlying vector field coincide with the residual error between blur and clean images. Under this formulation, the standard flow matching loss naturally takes the form of a residual loss, allowing pretrained flow models to be optimized under restoration-aligned objectives via LoRA adaptation. This formulation further enables a dual-expert sampling strategy: a fidelity expert provides a high-fidelity initialization, e.g., PSNR 33.69 dB, and DeblurFlow enhances perceptual quality with only a marginal fidelity reduction to 33.05 dB, whereas directly applying a generative model on top of a fidelity expert decreases PSNR to 27.60 dB. To make this practical, we further introduce r-space, a latent space tailored for residual decoding rather than image reconstruction, which reduces encoder-decoder cost by up to 9$\times$over standard VAE latents. Extensive experiments on GoPro, HIDE, RealBlur, and RWBI demonstrate that DeblurFlow achieves strong restoration fidelity and perceptual realism, while remaining computationally practical.

2605.08841 2026-05-12 cs.CV 版本更新

Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models

Junli Zha, Jiahui Wang, Xinkai Lu, Jinbo Wang

发表机构 * SF Technology Co., Ltd.(SF技术有限公司)

AI总结 该研究针对视觉语言模型(VLMs)在经典视错觉理解任务中过度依赖记忆而非真实视觉感知的问题,提出了一种无需微调的训练自由框架。方法通过错觉感知的图像预处理、反错觉提示工程以及多投票集成三种互补策略,有效提升了模型对视觉错觉的识别能力。实验表明,该方法在官方测试集上达到了90.48%的准确率,在人工验证子集上更是达到了98.41%,并取得了挑战赛第二名的优异成绩。

Comments Accepted at CVPR 2026 Workshop on 5th DataCV Challenge

详情
英文摘要

Vision-Language Models (VLMs) exhibit systematic bias toward visual illusions, recalling memorized facts rather than perceiving actual visual differences. This paper presents a training-free framework for the 5th DataCV Challenge Task 1 at CVPR 2026, addressing this perception-versus-memory conflict through three complementary strategies:(1) illusion-aware image preprocessing that weakens illusion-inducing context via type-specific transformations (edge extraction, color isolation, morphological processing, and reference-line overlay), (2) anti-illusion prompt engineering guiding VLMs toward qualitative visual comparison, and (3) multi-vote ensemble that further improves robustness. Our method achieves 90.48% accuracy on the official 630-image test set using Claude (claude-opus-4-6) with 5-vote majority ensemble, and 98.41% on a human-verified subset. The approach requires no finetuning, relying solely on visual manipulation and prompt design. Our solution secured 2nd place in the challenge, only 0.47% behind the 1st-place solution. Code is available at https://github.com/jasminezz/sf-illusion-aware-vlm.git.

2605.08839 2026-05-12 cs.CV 版本更新

Cross-Sample Relational Fusion: Unifying Domain Generalization and Class-Incremental Learning

Zhen-Hao Xie, Yan Wang, Hao Sun, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou

发表机构 * School of Artificial Intelligence and the State Key Laboratory of Novel Software Technology, Nanjing University(人工智能学院和新型软件技术国家重点实验室,南京大学)

AI总结 本文提出了一种统一处理领域偏移和灾难性遗忘的框架CORF,用于解决增量学习中的挑战。该方法通过空间贡献图选择性地优化训练样本,并结合预测置信度自适应调整样本权重,以增强模型的泛化能力。同时,CORF引入级联知识蒸馏机制,捕捉跨样本的关系依赖,实现多粒度的知识迁移,有效缓解了遗忘问题,并可无缝集成到现有增量学习算法中,取得良好的实验效果。

Comments Accepted by IEEE Transactions on Multimedia (TMM 2026). Code is available at https://github.com/LAMDA-CL/TMM26-CORF

详情
英文摘要

Class-Incremental Learning (CIL) requires a learning system to learn new classes while retaining previously learned knowledge. However, in real-world scenarios such as autonomous driving, a system trained on urban roads in sunny weather may later need to operate in rural or highway environments with different traffic patterns and weather conditions. This requires the model not only to overcome catastrophic forgetting, but also to effectively handle domain shifts. In this paper, we propose CrOss-sample Relational Fusion (CORF), a unified framework to address domain shift and catastrophic forgetting simultaneously. To enhance generalizability, we perform selective refinement of training samples by leveraging spatial contribution maps to highlight semantically informative regions. Furthermore, we incorporate predictive confidence to adaptively weigh samples, thereby facilitating the learning of domain-agnostic representations. To alleviate forgetting, we propose a cascaded distillation framework that captures cross-sample relational dependencies across multiple feature hierarchies, enabling multi-grained knowledge transfer from previous tasks. CORF can be seamlessly integrated into existing CIL algorithms to enhance their generalizability, achieving competitive performance across various benchmark datasets. Code is available at https://github.com/LAMDA-CL/TMM26-CORF .

2605.08824 2026-05-12 cs.GR cs.CV 版本更新

HairGPT: Strand-as-Language Autoregressive Modeling for Realistic 3D Hairstyle Synthesis

Haimin Luo, Min Ouyang, Lan Xu, Jingyi Yu

发表机构 * ShanghaiTech University(上海科技大学) ShanghaiTech University and Deemos Technology Co., Ltd.(上海科技大学和Deemos技术有限公司) Deemos Technology Co., Ltd.(Deemos技术有限公司)

AI总结 HairGPT 是一种基于发丝作为语言单元的自回归生成模型,旨在解决真实感3D发型合成中的结构与纹理耦合问题。该方法将发型分解为语义区域和结构层次的双解耦序列建模问题,通过几何分词器和语义注释引导发丝级别的生成,实现了复杂发型的合成与编辑。HairGPT 将发型生成从传统的纹理合成转变为结构化且语义可控的创作过程,支持在真实和风格化场景中生成高保真发型。

Comments Accepted to SIGGRAPH 2026 (Journal Track)

详情
英文摘要

Hair is a rich medium of visual and cultural expression, yet its digital modeling remains challenging due to the duality of fluidity and structure. Many existing generative approaches rely primarily on continuous diffusion fields, which entangle global topology with local texture and obscure the semantic and structural organization of hairstyles. To address this, we propose HairGPT, a strand-centric framework that treats strands as generative primitives and formulates realistic 3D hairstyle synthesis as a dual-decoupled autoregressive sequence modeling problem. Our method applies spatial decoupling across semantic scalp regions and structural decoupling along a hierarchical strand representation, progressing from global layout to fine-grained style. We further introduce a geometric tokenizer and region-aware semantic annotations to guide strand-level generation, enabling compositional editing, synthesis of rare and complex hairstyles, and adaptation to stylized domains. By aligning generative modeling with the workflow of digital grooming, HairGPT turns hair generation from opaque texture synthesis into a structured and semantically controllable authoring process, supporting robust semantic conditioning and high-fidelity results across realistic and stylized domains. Project Page: https://haiminluo.github.io/hairgpt/

2605.08820 2026-05-12 cs.CV cs.AI cs.CR 版本更新

FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence

Xinyu Yan, Boyang Chen, Jiaming Zhang, Tiantong Wu, Hong Xi Tae, Yichen He, Tiantong Wang, Yachun Mi, Yurong Hao, Yilei Zhao, Lei Xiao, Longtao Huang, Pengjun Xie, Wei Liu, Wei Yang Bryan Lim

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) Alibaba-NTU Global e-Sustainability CorpLab (ANGEL)(阿里巴巴-NTU全球可持续性科技实验室) Alibaba Group(阿里巴巴集团)

AI总结 随着人工智能生成图像日益逼真,AI生成的退款欺诈证据检测成为新的挑战。为此,研究者提出了FraudBench,一个基于多模态数据的基准,专门用于检测AI生成的虚假退款证据。该基准集从电商、外卖和旅行服务等真实场景中构建,包含图像、评论及产品元数据,并通过模型辅助过滤和人工标注区分真实损坏与未损坏证据,同时利用先进图像生成模型合成虚假损坏图像。实验表明,现有模型在检测AI生成的虚假损坏证据方面仍存在显著不足,揭示了通用图像检测与真实场景下欺诈证据验证之间的明显差距。

详情
英文摘要

Artificial Intelligence (AI)-generated images have become increasingly realistic and readily adaptable to concrete real-world claims, creating new challenges for verifying visual evidence. A concrete emerging risk is AI-generated refund fraud, in which manipulated or synthetic images are used to support claims about damaged products, poor delivery conditions, or service-related defects. Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored. To bridge this gap, we introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence. FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios. We curate real evidence images together with their associated review and product metadata, identify genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation, and synthesize fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models. Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings. Experiments show that current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets. Specialized detectors generally perform better but remain inconsistent across generators and can produce false positives on real-damaged samples, revealing a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification.

2605.08819 2026-05-12 cs.CV cs.LG 版本更新

From pre-training to downstream performance: Does domain-specific pre-training make sense?

Felix Krones

发表机构 * Oxford Internet Institute, University of Oxford, Oxford, UK(牛津大学互联网研究所,牛津大学,牛津,英国)

AI总结 该研究探讨了在医学影像领域中,领域特定的预训练是否能有效提升下游任务性能。通过系统比较卷积神经网络和Transformer模型,并分析多种预训练方法(包括监督和自监督学习)及数据模态的影响,研究发现只有当预训练数据与目标模态高度匹配时,才能显著提升模型性能。研究强调了预训练策略对提升医学影像深度学习模型可靠性的重要性,并为开发更准确、可靠的诊断工具提供了参考。

详情
英文摘要

Deep learning techniques have revolutionised medical imaging, improving diagnostic accuracy and enabling both more accurate and earlier disease detection. However, the relationship between pre-training strategies and downstream performance in medical imaging models requires further exploration. Here, we systematically compare convolutional neural networks and transformers, examining various pre-training approaches, including supervised and self-supervised learning, as well as different initialisations and data modalities. Models are evaluated on natural images, chest X-rays, chest CT and retina OCT images, considering the effects of matching pre-training data with target modalities. Our findings indicate that only pre-training on data closely matching the target modality significantly improves downstream performance. While self-supervised learning can outperform supervised methods, its effectiveness varies with context. The study underscores the importance of pre-training strategies to enhance the reliability and effectiveness of deep learning models in medical imaging. By addressing these key factors, our research aims to contribute to the development of more accurate and dependable diagnostic tools, ultimately improving patient outcomes in clinical settings.

2605.08814 2026-05-12 cs.CV 版本更新

Zero-Shot Chinese Character Recognition via Global-Local Dual-Branch Alignment and Hierarchical Inference

Wei Cao, Hao Xu, Xiaolei Diao

发表机构 * Jilin University, China(吉林大学,中国) University College London, UK(伦敦大学学院,英国)

AI总结 本文研究了开放场景下未见过的汉字识别这一具有挑战性的问题,提出了一种基于全局-局部双分支对齐和层次推理的零样本汉字识别方法。该方法通过统一的跨模态对齐框架联合学习汉字图像和汉字结构描述的全局与局部表示,结合结构过滤掩码抑制局部相似性中的噪声操作符,并采用从粗到细的层次推理策略,有效提升了识别性能与推理效率。实验表明,该方法在多种零样本划分下表现优异,尤其在低资源条件下具有显著优势。

Comments 9 pages

详情
英文摘要

Chinese character categories are extremely large, and unseen characters frequently arise in open-world scenarios, making zero-shot Chinese character recognition an important yet challenging problem. Existing IDS-based retrieval methods usually encode a character image and its ideographic description sequence into a single global vector for matching. Although efficient, such holistic alignment often under-models local component differences. Moreover, directly introducing patch-token level fine-grained interaction suffers from both the noise of structural operators in IDS and the high cost of full-candidate retrieval.To address these issues, we propose a Global-Local Hierarchical Perception Network (GL-HPN), which jointly learns global and local representations of character images and IDS sequences within a unified cross-modal alignment framework. The global branch supports efficient coarse recall, while the local branch improves component-level discrimination through patch-token interaction. We further introduce a structure filtering mask to suppress structurally meaningful but visually non-entity IDS operators in local similarity aggregation. On top of this, we design a coarse-to-fine hierarchical inference strategy that performs global retrieval over the full candidate set and local reranking only on Top-$K$ candidates, followed by parameter-free multiplicative fusion of normalized posterior scores. Experimental results show that GL-HPN achieves competitive performance across multiple zero-shot splits, performs especially well under low-resource settings, and substantially reduces the inference cost of large-scale candidate retrieval.

2605.08808 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding

Ziyao He, Yingjie Liu, ZhangYangRui, Mingsong Chen, Xuan Tang, Xian Wei

发表机构 * East China Normal University(东华大学)

AI总结 本文提出了一种名为“曲率感知描述生成”的新框架,用于解决三维场景理解中稀疏点云数据的精确描述问题。该方法引入非欧几里得的测地注意力机制,通过在斜空间中进行自注意力计算和在洛伦兹空间中建立双向测地交叉注意力,实现了局部几何细节与全局语义层次的协同建模。理论分析表明,该方法有效缓解了欧几里得空间与双曲空间之间的冲突,实验结果在ScanRefer和Nr3D数据集上展示了其在定位精度和描述丰富性方面的优越性能。

Comments CVPR2026 Highlight!

详情
英文摘要

Accurate 3D scene description is fundamental to robotic navigation and augmented reality, yet current dense captioning methods face significant limitations in processing sparse point cloud data. % Existing approaches that apply Euclidean embedding spaces struggle to simultaneously preserve fine-grained local geometric details and model exponentially growing global semantic hierarchies, leading to either inaccurate localization or disjointed, shallow scene descriptions. % In this work, we propose a novel \textbf{\textsc{Curvature-Aware Captioning}} framework, integrating novel non-Euclidean geodesic attention mechanisms, to resolve the localization-contextualization conflict. % Specifically, self-attention within Oblique space enforces dimensional homogeneity while establishing long-range dependencies. Bidirectional geodesic cross-attention within Lorentz space models hierarchical semantic relationships across scene instances, enabling simultaneous precision in object localization and coherence in scene descriptions. % Theoretical analysis confirms that the curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict, ensuring feature stability via isotropic optimization while preserving inherent hierarchical relationships. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, with significant gains in both localization accuracy and descriptive richness.

2605.08805 2026-05-12 cs.CV 版本更新

LightAVSeg: Lightweight Audio-Visual Segmentation

Qing Zhong, Guodong Ding, Lingqiao Liu, Zaiwen Feng, Lin Yuanbo Wu, Angela Yao

发表机构 * College of Informatics, Huazhong Agricultural University, Wuhan, China(华中农业大学信息学院) School of Computing, National University of Singapore, Singapore(新加坡国立大学计算机学院) School of Computer Science, Adelaide University, Australia(阿德莱德大学计算机科学学院) School of Engineering, University of Warwick, Coventry, UK(沃里克大学工程学院) Zhejiang Yuexiu University, Shaoxing, China(浙江越秀大学)

AI总结 LightAVSeg 是一种轻量化的音视频分割框架,旨在解决现有模型计算复杂度高、难以高效部署的问题。该方法通过解耦设计替代传统的密集跨模态注意力机制,使交互成本随空间分辨率线性增长,并引入辅助对齐损失以提升语义一致性。实验表明,LightAVSeg 在参数量仅为 AVSegFormer 1/7 的情况下,在 MS3 数据集上取得了 50.4 mIoU 的优异性能,实现了高效的移动端推理。

Comments 15 pages, 8 figures, 6 tables, Accepted to ICML 2026

详情
英文摘要

Audio-Visual Segmentation (AVS) targets pixel level localization of sounding emitting objects in videos. However, existing models rely on dense cross-modal attention with quadratic computational cost, limiting their suitability for resource efficient deployment. Most efficiency oriented methods focus on backbone reduction and overlook the interaction module as the primary bottleneck. This paper proposes LightAVSeg, a lightweight framework that replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution. Furthermore, we introduce an auxiliary alignment loss to enforce semantic consistency during training with zero inference overhead. Extensive experiments demonstrate that LightAVSeg achieves a new state-of-the-art among lightweight methods: with 20.5M parameters ~1/7 of AVSegFormer), it reaches 50.4 mIoU on the MS3 benchmark and enables efficient inference on a mobile processor.

2605.08800 2026-05-12 cs.CV cs.AI 版本更新

PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models

Jiahui Guang, Zexun Zhan, Zhenlin Xu, Cuiyun Gao, Haiyan Wang, Jing Li, Zhaoquan Gu, Yanchun Zhang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Pengcheng Laboratory(鹏城实验室) The Hong Kong Polytechnic University(香港理工大学) Sichuan University(四川大学) Zhejiang Normal University(浙江师范大学)

AI总结 该论文提出PPU-Bench,一个用于视觉语言模型中个性化部分遗忘的现实基准,旨在解决现有基准依赖合成数据或全量删除的问题。该基准包含24,000个样本,涵盖三种渐进式场景,评估模型在去除目标知识的同时保持非目标事实、模型效用和跨模态一致性的能力。研究还提出边界感知优化方法(BAO),有效强化了模型在个体事实边界上的控制能力。

详情
英文摘要

Multimodal Large Language Models (MLLMs) may memorize sensitive cross-modal information during pretraining. However, existing MLLM unlearning benchmarks rely on synthetic knowledge injection or complete subject-level deletion, which fail to capture realistic, personalized deletion requests that require fine-grained factual control. In this paper, we introduce PPU-Bench, a real-world and fine-tuning-free benchmark for personalized partial unlearning in MLLMs. PPU-Bench contains 24K multimodal and unimodal samples derived from pre-existing knowledge of 500 public figures under three progressively challenging settings: Complete, Selective, and Personalized unlearning. The benchmark evaluates whether methods can remove target knowledge while preserving non-target facts, model utility, and cross-modal consistency. Extensive experiments show that Complete Unlearning often suppresses visual identity rather than factual knowledge, while Selective and Personalized Unlearning expose significant forget--retain trade-offs and challenges in intra-subject factual boundaries. Robustness analysis under cross-image and prompt-based attacks reveals distinct vulnerabilities across different unlearning settings. Motivated by these findings, we propose Boundary-Aware Optimization (BAO), which explicitly models intra-subject forget-retain boundaries. Experimental results on two representative methods demonstrate that BAO can effectively enforce intra-subject factual boundaries.

2605.08787 2026-05-12 cs.CV 版本更新

Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models

Mashrafi Monon, Umaima Rahman, Asif Hanif, Numan Saeed, Mohammad Yaqub

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(莫扎德人工智能大学) New York University Abu Dhabi(纽约大学阿布扎克分校)

AI总结 该论文提出了一种名为CT-SpatialVQA的新型基准,用于评估3D医学视觉-语言模型在语义-空间理解方面的能力。该基准基于1601份放射科报告和CT影像构建了9077个临床相关的问答对,要求模型具备解剖定位、左右识别、结构对比和三维结构关系推理等能力。实验表明,现有模型在这些任务上的表现较差,平均准确率仅为34%,突显了在临床可信应用中亟需加强三维医学证据整合的重要性。

详情
英文摘要

Recent advances in 3D medical vision-language models have enabled joint reasoning over volumetric images and text, showing strong performance in medical visual question-answering (VQA) and report generation. Despite this progress, it remains unclear whether these models learn spatially grounded anatomy from 3D volumes or rely primarily on learned priors and language correlations. This uncertainty stems from the lack of systematic evaluation of semantic-spatial reasoning in volumetric medical VLMs for clinically reliable decision support. To address this gap, we introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. The benchmark comprises 9077 clinically grounded question-answer (QA) pairs derived directly from 1601 radiology reports and CT volumes, which are validated via a robust LLM-assisted pipeline with a 95% human consensus agreement rate. Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning. We also introduce a standardized evaluation protocol and benchmark eight 3D medical VLMs, finding severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random, highlighting the need for deeper integration of volumetric evidence for trustworthy clinical use.

2605.08784 2026-05-12 cs.CV 版本更新

simpleposter: a simple baseline for product poster generation

Benlei Cui, Fangao Zeng, Weitao Jiang, Yuwen Zhai, Haiwen Hong, Longtao Huang, Hui Xue, Wenxiang Shang, Pipei Huang

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 本文提出了一种名为SimplePoster的简单而有效的产品海报生成框架,旨在解决在保留产品外观和精确控制密集多行文本布局方面的挑战。与以往依赖复杂模块(如ControlNet和OCR编码器)的方法不同,SimplePoster通过全参数微调和字符级位置编码,在无需外部控制器的情况下实现了高保真主体保留和精准文本渲染。实验表明,SimplePoster在主体保留率和文本渲染准确性方面均优于现有方法。

Comments CVPR 2026

详情
英文摘要

Product poster generation poses distinct challenges beyond general poster design, requiring both faithful preservation of product appearance and precise control over dense, multi-line text layouts. Prior methods typically adopt inpainting frameworks augmented with auxiliary modules such as ControlNet and OCR encoders. However, these approaches introduce architectural complexity and computational overhead while still suffering from text errors and subject extension artifacts. We present SimplePoster, a simple yet effective inpainting-based framework that achieves faithful subject preservation and accurate, position-controllable text rendering without external controllers. Our approach builds on two observations: (1) full-parameter fine-tuning of the base model effectively suppresses subject extension, outperforming ControlNet-based alternatives; and (2) a zero-cost character-level position encoding enables geometry-aware text generation without dedicated layout modules. Experiments show that SimplePoster achieves a $98.7\%$ subject preservation rate, compared to $55.2\%$ for SeedEdit 3.0 and $85.3\%$ for PosterMaker, while also improving text rendering accuracy. Code, models, benchmark and a part of training data will be available at https://github.com/Alibaba-YuFeng/SIMPLEPOSTER

2605.08781 2026-05-12 cs.CV 版本更新

Contour-Native Bridge Defect Detection and Compact Digital Archiving with Frequency-Supervised Fourier Contours

Jin Liu, Wang Wang, Hongxu Pu, Zhen Cao, Yasong Wang, Hu Wang, Kunming Luo

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University(测绘遥感信息工程国家重点实验室,武汉大学) Sustainability X-Lab, The University of Hong Kong(可持续性X实验室,香港大学) Department of Cyber Security, Southeast University(安全与保密系,东南大学) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology(香港理工大学电子与计算机工程系)

AI总结 本文研究了如何将桥梁缺陷检测结果以更紧凑、可恢复的轮廓向量形式进行表示,以替代传统的粗略几何边界框或存储成本高的栅格掩膜。提出了一种基于频率监督的傅里叶级数检测方法(FS-FSD),该方法直接回归傅里叶轮廓描述子,并在统一的多边形空间协议下对边界框、掩膜和轮廓进行评估。实验表明,该方法在大量无人机采集的桥梁图像上取得了更高的多边形空间检测精度和更优的真阳性几何匹配质量,为工程审查和后续信息流程提供了更高效、更精确的缺陷边界表示方式。

Comments 46 pages,13 figures

详情
英文摘要

AI-assisted bridge defect inspection often produces bounding boxes with crude geometry or raster masks that are costly to store, transmit, and reuse. This study investigates how detected defects can be represented as compact, recoverable contour-level vector records in image space. We propose Frequency-Supervised Fourier Series Detection (FS-FSD), which directly regresses Fourier contour descriptors and evaluates boxes, masks, and contours under a unified polygon-space protocol. On 3,767 UAV-collected bridge images with 42,346 defect instances, FS-FSD achieves higher polygon-space accuracy and better matched-TP geometric quality than representative detection, segmentation, and contour baselines. These results show that, compared with bounding boxes and raster masks, Fourier contour records preserve defect-boundary geometry in a more compact, recoverable, and shareable form for engineering review and downstream information workflows. Future work will study the modeling of multi-region, fragmented, and adjacent bridge-defect boundaries and extend the framework toward long-term bridge-defect tracking and lifecycle-oriented management.

2605.08764 2026-05-12 cs.LG cs.CV eess.IV 版本更新

Anchoring the Eigengap: Cross-Modal Spectral Stabilization for Sample-Efficient Representation Learning

Nikhil J. Dhinagar, Vidhi Chhatbar, Chirag Jagad, Pavithra Senthilkumar, Sophia I. Thomopoulos, Mahir H. Khan, Sook-Lei Liew, the ENIGMA-Stroke Recovery Working Group, Paul M. Thompson

发表机构 * Imaging Genetics Center, Mark & Mary Stevens Neuroimaging & Informatics Institute, Keck School of Medicine, University of Southern California(影像基因中心,马克与玛丽史蒂文斯神经影像与信息学研究所,凯克医学院,南加州大学) Neuroscience Graduate Program, Mark & Mary Stevens Neuroimaging & Informatics Institute, Chan Division of Occupational Science & Occupational Therapy, Biomedical Engineering, University of Southern California(神经科学研究生项目,马克与玛丽史蒂文斯神经影像与信息学研究所,查恩职业科学与职业治疗 division,生物医学工程,南加州大学)

AI总结 本文研究了在数据稀缺情况下深度视觉模型性能下降的根本原因,指出这是由于有限样本导致的嵌入协方差矩阵噪声干扰,从而压缩了特征值间隔(eigengap),限制了可恢复的信号模式数量。作者提出了一个有限样本表示学习的谱理论,量化了可恢复的维度 $K(N)$,并通过扰动理论和集中不等式分析了可靠特征模式的判据。研究进一步表明,多模态学习(如视觉-语言模型)能够通过低秩约束抑制噪声方向、保持特征值间隔,从而提升数据效率和分类性能,尤其在医学影像等小样本场景中表现出显著优势。

详情
英文摘要

Deep vision models degrade sharply in low-data regimes, particularly in medical imaging where labeled samples are scarce. We show this arises not merely from overfitting but from a geometric failure: finite-sample noise corrupts the embedding covariance, collapsing the eigengap and limiting the number of recoverable signal-bearing modes. We develop a spectral theory of finite-sample representation learning that quantifies the recoverable dimension K(N), the number of eigenmodes that can be stably estimated from N samples. Using perturbation theory and concentration bounds, we show that only modes with eigenvalues above the noise floor $\|\hatΣ - Σ\|_{\mathrm{op}} \sim \sqrt{D/N}$ are reliable, yielding a truncated Mahalanobis energy that governs classification performance. Under a power-law spectral model, this energy can be approximated by a truncated Riemann zeta function, linking eigenvalue decay to data efficiency and AUC. Within this framework, multimodal learning acts as spectral stabilization: vision-language models impose low-rank constraints that suppress noise-dominated directions and preserve the eigengap, increasing K(N) under data scarcity. Across MNIST and multi-disease neuroimaging, we show that multimodal training maintains more stable modes and improves class separation, even when unimodal models achieve comparable few-shot accuracy. These results identify spectral collapse as a fundamental bottleneck in low-data learning. We use truncated Mahalanobis energy and K(N) to diagnose encoder quality, and introduce zeta-based spectral filtering as a principled approach to improve data efficiency.

2605.08753 2026-05-12 cs.CV stat.ML 版本更新

Simultaneous Monitoring of Shape and Surface Color via 4D Point Clouds: A Registration-free Approach

Mariafrancesca Patalano, Giovanna Capizzi, Kamran Paynabar

发表机构 * Department of Statistical Sciences, University of Padua(帕多瓦大学统计科学系) School of Industrial and Systems Engineering, Georgia Institute of Technology(佐治亚理工学院工业与系统工程学院)

AI总结 本文提出了一种无需配准的4D点云框架SMAC,用于同时监测物体的形状和表面颜色变化。该方法利用拉普拉斯-贝尔特拉米算子的谱特性,捕捉形状与颜色之间的关系,并通过联合监测策略有效检测形状变形和颜色异常。此外,该方法还引入了空间感知的后信号诊断过程,以定位异常来源,具有计算高效、无需配准和网格重建的优势,实验表明其在细微缺陷检测方面表现优异。

Comments 38 pages, 11 figures

详情
英文摘要

Advanced manufacturing technologies allow for the production of intricate parts featuring high shape complexity and spatially-varying material composition. Data fusion of point clouds with chromatic attributes provides 4D point clouds, a compact and informative representation that encodes both shape and material information. In this paper, we present a registration-free framework for Simultaneous Monitoring of shApe and Color (SMAC) via 4D point clouds. The proposed framework leverages Laplace-Beltrami operator spectral properties to capture and monitor geometric features and the relationship between shape and surface color. A combined monitoring scheme is proposed to effectively detect shape deformations and color anomalies, along with a spatially-aware post-signal diagnostic procedure to determine the source of change and localize color anomalies. Importantly, neither component relies on registration or mesh reconstruction, eliminating error-prone and computationally expensive preprocessing steps. A Monte Carlo simulation study and a case study on functionally graded materials demonstrate that SMAC achieves effective detection performance, particularly for subtle defects, while providing diagnostic capabilities to identify the source and location of anomalies.

2605.08739 2026-05-12 cs.CV 版本更新

ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting

Luchao Wang, Kaimin Liao, Qian Ren, Hua Wang, Zhi Chen, Yaohua Tang

AI总结 本文提出了一种名为 ReorgGS 的方法,用于解决 3D 高斯溅射(3DGS)模型在收敛后参数化退化的问题。该方法通过将现有高斯点集视为经验概率场,重新采样中心点并估计各向异性协方差,从而重建更优的分布结构,提升后续优化的梯度可访问性。与简单重置不透明度的方法不同,ReorgGS 重构了高斯点的分布和可见性结构,在保持场景表达能力的同时,有效减少了冗余重叠,提高了模型的优化效果和渲染效率。

详情
英文摘要

A converged 3D Gaussian Splatting (3DGS) model may approximate the target scene while remaining poorly parameterized for further optimization. We identify this failure mode as \emph{parameterization degeneration}: high-opacity floaters attenuate gradients to true surfaces through alpha compositing, and redundant overlapping clusters create strongly coupled parameter blocks with nearly collinear Jacobian responses. These effects explain why continued optimization can plateau even when the model still contains removable artifacts. We propose ReorgGS, an equivalent distribution reorganization method for converged 3DGS models. ReorgGS treats the existing Gaussian set as an empirical probability field, resamples centers from it, estimates local anisotropic covariances with kNN, initializes low opacity, and continues optimization with the original 3DGS renderer and loss. Unlike opacity reset, which only rescales opacity on the old overlap graph, ReorgGS rebuilds centers, covariances, and visibility structure, thereby changing the graph itself. Our analysis shows that distributional equivalence is not optimization equivalence. The reorganized model preserves scene support while improving gradient accessibility under alpha compositing and reducing opacity-weighted overlap, thereby weakening local parameter coupling during subsequent optimization. Under the same additional optimization budget, ReorgGS improves fitting quality at a fixed Gaussian count, suppresses persistent floaters, and reduces rendering overhead from redundant overlap.

2605.08735 2026-05-12 cs.CV 版本更新

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

Joowon Kim, Seungho Shin, Joonhyung Park, Eunho Yang

发表机构 * KAIST(韩国科学技术院) Kyung Hee University(庆熙大学) AITRICS

AI总结 该论文提出了一种名为CollabVR的协作视频推理框架,旨在解决视频生成模型(VGM)在多步骤任务中出现的长期偏差和中间片段模拟错误问题。该方法通过将视觉-语言模型(VLM)与VGM在步骤层面进行紧密协作,使VLM在每一步生成动作后对VGM生成的视频片段进行检查与修正,从而提升推理的准确性和鲁棒性。实验表明,CollabVR在多个基准测试中显著优于现有方法,尤其在复杂任务上表现突出,并且与针对推理优化的VGM结合使用时还能进一步提升性能。

详情
英文摘要

Recent "Thinking with Video" approaches use Video Generation Models (VGMs) for visual reasoning by producing temporally coherent Chain-of-Frames as reasoning artifacts. Even strong VGMs, however, exhibit two recurring failure modes on goal-directed tasks: long-horizon drift on multi-step tasks and mid-clip simulation errors that compound. Both stem from the absence of explicit reasoning built upon the VGM's short-horizon visual prior, a role naturally filled by Vision-Language Models (VLMs), but where to place the VLM is non-trivial: upfront plans commit before any frame is generated and post-hoc critiques over whole videos intervene too late. We propose VLM-VGM Collaborative Video Reasoning (CollabVR), a closed-loop framework that couples the VLM with the VGM at step-level granularity: the VLM plans the immediate next action, inspects the clip the VGM generates, and folds the verifier's diagnosis directly into the next action prompt to repair detected failures. On Gen-ViRe and VBVR-Bench, CollabVR improves both open-source and closed-source VGMs over single-inference, Pass@$k$, and prior test-time scaling baselines at matched compute, with the largest gains on the hardest tasks. It also yields further improvements on top of a reasoning-fine-tuned VGM, indicating that step-level VLM supervision is orthogonal to and stackable with reasoning-oriented fine-tuning. We provide video samples and additional qualitative results at our project page: https://joow0n-kim.github.io/collabvr-project-page.

2605.08729 2026-05-12 cs.CV cs.GR cs.MM cs.SD 版本更新

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

Shihao Cheng, Jiaxu Zhang, Quanyue Song, Shansong Liu, Zhizhi Guo, Xiaolei Zhang, Chi Zhang, Xuelong Li, Zhigang Tu

发表机构 * DeepMind OpenAI

AI总结 Unison 是一个统一的框架,旨在解决人类中心视频生成中动作、语音和声音之间异步特性带来的对齐难题。该方法通过语义引导的谐波策略,分离生成语音和音效组件,并利用双向音频交叉注意力和语义条件门控机制,提升声音清晰度并减少语音主导现象。此外,Unison 提出双向跨模态强制策略,通过解耦的去噪时间表实现动作与音频的同步,显著提升了生成视频在音频感知质量和跨模态同步方面的表现。

详情
英文摘要

Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.

2605.08727 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Control Your View: High-Resolution Global Semantic Manipulation in Learned Image Compression

Jiaming Liang, Chi-Man Pun, Weisi Lin, Greta Seng Peng Mok

发表机构 * University of Macau(澳门大学) Nanyang Technological University(南洋理工大学)

AI总结 本文研究了在学习图像压缩系统中实现高分辨率全局语义操控(GSM)的问题,指出现有方法在高分辨率场景下效果有限。作者通过理论与实验分析,揭示了高分辨率GSM攻击需要经过懒惰-震荡-细化三个阶段,并提出了一种周期几何衰减的步长调度策略,从而实现$\ell_{\infty}$-有界条件下的高分辨率GSM。基于此,他们改进了PGD方法,提出PGD$^{2}$-GSM,在Kodak数据集上首次实现了稳定高效的高分辨率GSM,揭示了学习图像压缩系统的新安全威胁。

详情
英文摘要

Learned image compression (LIC) integrates deep neural networks (DNNs) to map high-dimensional images into compact latent representations, reducing redundancy and achieving superior rate-distortion (RD) performance in benign settings. Unfortunately, due to inherent vulnerabilities in DNNs, LIC systems are susceptible to adversarial perturbations that lead to downstream deterioration, compression rate degradation, untargeted distortion, and both local semantic manipulation (LSM) and low-resolution ($3\times28\times28$) global semantic manipulation (GSM). However, high-resolution GSM remains unexplored due to its intractability. Notably, the existing project gradient descent (PGD) method achieves near-perfect white-box attacks for classification, segmentation, and other tasks, yet fails to generalize to high-resolution GSM. Our theoretical and empirical analyses reveal that well-performing GSM drives adversarial examples from the Identity Region to the Amplification Region through the Lazying-Oscillating-Refining stages. General $\ell_{\infty}$-bounded attacks fail on high-resolution GSM because their step-size schedules cannot accommodate both the Oscillating and Refining stages. Based on this, we propose the Periodic Geometric Decay schedule that enables $\ell_{\infty}$-bounded high-resolution GSM. To verify our approach, we integrate it with PGD, yielding a minimal variant, PGD$^{2}$-GSM. Extensive experiments on the Kodak $(3\times768\times512)$ demonstrate that our PGD$^{2}$-GSM is the first to stably achieve high-resolution GSM, thereby exposing a novel threat to LIC systems. Code is available at https://github.com/chinaliangjiaming/PGD2-GSM.

2605.08724 2026-05-12 cs.CV 版本更新

SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment

Weiren Zhao, Yi Dong, Cheng Chen

发表机构 * The University of Hong Kong, Hong Kong, China(香港大学)

AI总结 本文提出SynerMedGen,一个通过任务对齐将医学多模态理解与生成统一的框架,旨在解决现有模型中理解与生成目标分离的问题。该方法引入了三个与生成对齐的理解任务和两阶段训练策略,使理解阶段学到的生成有益表征能够有效支持医学图像合成。实验表明,SynerMedGen在多个医学图像生成任务中表现出色,且具有良好的泛化能力,同时作者还发布了包含100万对合成样本和200万生成衍生理解实例的SynerMed数据集,以支持相关研究。

Comments Accepted by ICML 2026

详情
英文摘要

Unifying multimodal understanding and generation is a compelling frontier that is beginning to emerge in the medical field. However, the limited existing unified medical models typically treat understanding and generation as disjoint objectives, lacking a meaningful functional synergy. In this work, we identify and address a critical question in unified medical modeling: what form of understanding truly benefits generation. We present SynerMedGen, a unified framework built on the proposed principle of generation-aligned understanding, which synergizes understanding objectives with generation tasks via task alignment. SynerMedGen introduces three generation-aligned understanding tasks and a two-stage training strategy that transfers generation-beneficial representations learned during understanding training to medical image synthesis. Remarkably, even with understanding training alone, our SynerMedGen achieves strong zero-shot performance across 22 medical image synthesis tasks and demonstrates robust generalization to unseen datasets. When combined with generation training, SynerMedGen consistently outperforms state-of-the-art specialized medical image synthesis models as well as recent unified medical models. We also release a large-scale dataset named SynerMed consisting of 1M paired synthesis samples and 2M generation-derived understanding instances to support further research on understanding-generation synergy. Our project can be accessed at https://github.com/Mhilab/SynerMedGen.

2605.08723 2026-05-12 cs.CV cs.MM 版本更新

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

Huilai Li, Xiaomeng Di, Ying Xing, Yonghao Dang, Yiming Wang, Jianqin Yin

发表机构 * School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications(智能工程与自动化学院,北京邮电大学) State Grid Corporation of China(国家电网公司) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(人工智能学院,北京邮电大学)

AI总结 本文研究弱监督音视频视频解析(AVVP)问题,旨在仅使用粗粒度标签识别和定位视频中的音频、视觉及音视频事件。现有方法多关注多模态融合,却忽视了对单模态语义的引导与保持,导致伪标签噪声大、解析性能不佳。为此,本文提出一种增强单模态表征的新框架,通过相似性标签迁移方法提升伪标签生成器对单模态事件的理解,并采用软约束方式同步优化单模态与多模态特征建模,从而提升事件定位性能。实验表明,该方法在伪标签生成和AVVP任务中均优于现有先进方法。

详情
英文摘要

Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.

2605.08712 2026-05-12 cs.CV 版本更新

From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation

Bohan Li, Shuojue Yang, Baorui Peng, Xianda Guo, Erli Zhang, Youqi Tao, Junfeng Duan, Daguang Xu, Qi Dou, Xin Jin, Wenjun Zeng, Hao Zhao, Yueming Jin

发表机构 * SJTU(上海交通大学) NUS(国立新加坡大学) THU(清华大学) EIT(欧洲研究所) WHU(武汉大学) Harvard(哈佛大学) NVIDIA(NVIDIA公司) CUHK(香港大学)

AI总结 本文研究了基于动作条件的手术视频生成问题,其核心挑战在于如何通过低维控制向量精确控制复杂的图像空间演变。为此,作者提出了一种从关节运动学向视觉控制提升的框架,将机械臂的运动学信息转化为五种与图像对齐的控制模态,并设计了一种分层路由的视觉控制体系,动态选择最相关的控制模态和运动尺度,从而提升生成效率与控制精度。此外,作者构建了一个包含精细标注的手术视频数据集,并通过实验验证了方法在动作忠实度、视觉保真度和跨域泛化能力方面的优越性。

详情
英文摘要

Action-conditioned surgical video generation is a critical yet highly challenging problem for robotic surgery. The core difficulty is that low-dimensional control vectors must precisely govern complex image-space evolution. In this work, we propose a kinematic-to-visual lifting paradigm that converts articulated kinematics into a unified set of five image-aligned control modalities. Building on this representation, we introduce a hierarchically routed visual control framework that selectively activates the most relevant control modalities and motion scales. Instead of uniformly applying all control signals, our model performs hierarchical routing to dynamically allocate conditioning capacity. We further design kinematic-prior-guided routing loss functions to ensure physically meaningful, temporally stable, and efficient expert utilization. To improve efficiency, we propose a budgeted training and inference scheme that leverages routing-induced sparsity. By selectively discarding low-significance control pathways during training and execution, our approach enables adaptive computation that is complementary to standard distillation. We additionally construct a new benchmark with curated articulated annotations, obtained through human-in-the-loop semantic labeling and differentiable pose tracking, providing realistic supervision for action-conditioned surgical video generation. Extensive experiments demonstrate that our method consistently improves action faithfulness, visual fidelity, and cross-domain generalization over diverse baselines. Moreover, our efficient variant achieves substantial reductions in latency while maintaining strong control accuracy.

2605.08709 2026-05-12 cs.CV 版本更新

UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning

Hongrui Li, Yichen Shi, Hongyang Wang, Yuhao Gao, Hui Ma, Jun Feng, Zitong Yu

发表机构 * Shijiazhuang Tiedao University(石家庄铁道大学) Shanghai Jiao Tong University(上海交通大学) Ningbo Institute of Digital Twin(宁波数字孪生研究所) Great Bay University(大贝大学)

AI总结 本文提出了一种基于知识引导的多模态推理框架UniShield,用于统一的人脸攻击检测,旨在同时识别物理欺骗和数字伪造攻击。该方法构建了人脸攻击知识图谱(FAKG),并通过攻击图指令调优(AGIT)生成大量训练样本,同时引入图一致性推理优化(GCRO)以提升推理的一致性。实验表明,UniShield在多种检测协议下均表现出优异的性能,显著提升了检测准确率和推理可靠性。

详情
英文摘要

Unified face attack detection (UAD) requires recognizing physical spoofing and digital forgery within a shared decision space, yet existing discriminative or prompt-based methods largely rely on appearance correlations and provide limited evidence-grounded reasoning. We propose UniShield, a knowledge-grounded multimodal reasoning framework for unified face attack defense. UniShield constructs a Face Attack Knowledge Graph (FAKG) that links attack categories to diagnostic visual cues and attack-conditioned relations, and uses it to synthesize 52,025 FAKG-QA examples for Attack-Graph Instruction Tuning (AGIT). To improve rationale consistency, we further introduce Graph-Consistent Reasoning Optimization (GCRO), a GRPO-based objective with a KG-consistency reward that encourages generated rationales to match graph-supported cues while penalizing incompatible claims. Experiments on our multimodal UAD benchmark show that UniShield achieves strong performance across binary, coarse-grained, and fine-grained protocols, with consistently high ACC and low HTER. These results suggest that structured attack knowledge can improve both detection accuracy and reasoning reliability over discriminative baselines and general-purpose MLLMs. Our code will be released at https://anonymous.4open.science/r/Unishield-A6A3/.

2605.08703 2026-05-12 cs.AI cs.CL cs.CV cs.LG 版本更新

RewardHarness: Self-Evolving Agentic Post-Training

Yuxuan Zhang, Penghui Du, Bo Li, Cong Wei, Junwen Miao, Huaisong Zhang, Songcheng Cai, Yubo Wang, Dongfu Jiang, Yuyu Zhang, Ping Nie, Wenhu Chen, Changqian Yu, Kelsey R. Allen

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Kolors Team, Kuaishou Technology(快手团队) Carnegie Mellon University(卡内基梅隆大学) University of Waterloo(滑铁卢大学) Etude AI Tsinghua University(清华大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 该研究提出了一种名为 RewardHarness 的自进化智能奖励框架,旨在解决图像编辑任务中评估指令引导编辑效果时所需奖励模型依赖大量人工标注的问题。该方法通过少量示例迭代进化工具和技能库,无需额外训练即可对齐人类偏好,显著提升了数据效率。实验表明,仅使用 0.05% 的标注数据,RewardHarness 在图像编辑评估基准上取得了优于 GPT-5 的性能,展现了其在奖励建模中的高效性与有效性。

Comments Project page: https://rewardharness.com

详情
英文摘要

Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference annotation and additional model training. This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. We present RewardHarness, a self-evolving agentic reward framework that reframes reward modeling as context evolution rather than weight optimization. Instead of learning from large-scale annotations, RewardHarness aligns with human preferences by iteratively evolving a library of tools and skills from as few as 100 preference demonstrations. Given a source image, candidate edited images, and an editing instruction, an Orchestrator selects the most relevant subset of tools and skills from the maintained library, and a frozen Sub-Agent uses them to construct a reasoning chain that produces a preference judgment. By comparing predicted judgments with ground-truth preferences and analyzing successes and failures in the reasoning process, the Orchestrator automatically refines its library of tools and skills without additional human annotation. Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward signal for GRPO fine-tuning, RL-tuned models achieve 3.52 on ImgEdit-Bench. Project page: https://rewardharness.com.

2605.08702 2026-05-12 cs.CV cs.AI 版本更新

Gate-and-Merge: Zero-shot Compositional Personalization of Vision Language Models

Guodong Ding, Angela Yao

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文研究了视觉语言模型的组合式个性化问题,即在测试时同时识别或描述多个用户定义的概念。提出了一种零样本框架 Gate-and-Merge,无需共现训练即可实现组合式个性化。该方法通过独立学习每个概念的轻量 LoRA 适配器并结合概念标记,在推理时直接在权重空间合并相关更新,并利用门控机制抑制无关激活,从而提升模型在单一概念和组合场景下的性能。

详情
英文摘要

This paper tackles compositional personalization of vision-language models (VLMs). In this problem, multiple user-defined concepts must be recognized or described jointly at test time. We introduce Gate-and-Merge, a zero-shot framework that enables compositional personalization without the need for co-occurrence training. During personalization, each concept is learned independently as a lightweight LoRA adapter, paired with a concept token. The base model remains unchanged and concepts are kept disentangled. At inference, we enable composition by merging concept-specific LoRA updates directly in weight space. To suppress irrelevant activations and prevent interference, a gating mechanism is employed to estimate textual and visual cues and select only the modules that contribute to the prediction. We further stabilize composition by combining only the most meaningful and mutually consistent updates, helping preserve each concept's identity. Our quantitative and qualitative analyses show consistent gains in performance across multiple personalization tasks in both single-concept and compositional settings.

2605.08695 2026-05-12 cs.CV 版本更新

EditSleuth: A Dataset of Grounded Reasoning Chains for Image-Edit Forensics

Van-Loc Nguyen, AprilPyone MaungMaung, Minh-Triet Tran, Isao Echizen

发表机构 * University of Science, Vietnam National University Ho Chi Minh City(越南胡志明市国家大学科学大学) National Institute of Informatics(国立信息研究所)

AI总结 EditSleuth 是一个用于图像编辑取证的新型数据集,包含257,725个图像编辑三元组,每个样本包含编辑后的图像、原始图像、编辑掩码、编辑类型标签、难度评分以及六步推理链。该数据集通过确定性方法构建,推理链中的每一步都基于可计算的视觉证据,旨在支持基于视觉依据的编辑定位与语义识别。实验表明,该数据集能够有效指导模型学习编辑推理能力,并生成具有解释性的取证说明。

详情
英文摘要

Forensic analysis of AI-edited images requires more than binary real-versus-fake prediction: a useful system should localize the edit, identify its semantic type, and ground its decisions in visual evidence. Existing image-forensics datasets typically emphasize detection or localization, while reasoning-supervised vision-language datasets rarely target image manipulation and often rely on LLM-generated rationales whose faithfulness is difficult to verify. We introduce EditSleuth, a dataset of 257,725 image-edit triplets constructed from existing image-editing corpora for grounded image-edit forensic reasoning. Each example includes an edited image, its source image, a binary edit mask, a 12-class edit taxonomy label, a difficulty score, and a six-step reasoning chain. EditSleuth chains are generated deterministically from triplet-grounded upstream artifacts, with each statement tied to a specific computable source of evidence. Our analysis reveals that a naive four-component difficulty formulation suffers from a rank-2 correlation collapse among magnitude features; a simplified three-component formulation substantially increases score dispersion on both Pico-Banana and MagicBrush. Difficulty also varies meaningfully within most edit categories, indicating that the score is not a proxy for edit type. As an initial learning study, we fine-tune Qwen2-VL-2B with LoRA and find that chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers, while additionally yielding grounded explanatory prose that label-only supervision cannot produce. We release the dataset, the deterministic construction pipeline, and pilot training scripts.

2605.08664 2026-05-12 cs.CV 版本更新

IPAD-CLIP: Teaching CLIP to Detect Image Local Perceptual Artifacts

Juan Wang, Xinyu Sun, Ke Zhang, Jin Wang, Bing Li, Weiming Hu, Liang Wang

发表机构 * Minzu University of China(民族大学) OPPO Co., Ltd.(OPPO公司)

AI总结 当前图像质量评估方法主要关注全局失真(如噪声、模糊),而忽视了局部感知伪影(如鬼影、镜头眩光、摩尔效应)的检测。为解决这一问题,本文提出图像感知伪影检测(IPAD)任务,并构建了一个包含3,520张标注图像的基准数据集。基于CLIP模型,研究者设计了IPAD-CLIP框架,通过学习与伪影相关的语义嵌入,增强模型对局部细微伪影的识别能力,实验表明该方法在资源效率和检测性能上均优于现有先进方法。

Comments 14 pages, 6 figures

详情
英文摘要

Current image quality assessment methods are heavily biased towards global distortions (e.g., noise, blur), neglecting local perceptual artifacts such as ghosting, lens flare, and moire effects. Although significant progress has been made in artifact removal, the fundamental problem of automatic artifact detection remains largely unexplored. In this paper, we formalize the Image Perceptual Artifact Detection (IPAD) task to address this gap. We contribute a benchmark dataset comprising 3,520 artifact images, including 520 real-captured and 3,000 synthetic samples, each paired with pixel-level masks across three representative artifact categories. The core challenge of IPAD lies in the localized, subtle, and semantically weak nature of these artifacts, which makes them prone to missed detection. To overcome this, we introduce IPAD-CLIP, a novel framework built upon CLIP that enhances artifact discrimination in both textual and visual spaces while preserving generalization capabilities. Our key insight is that local artifacts often exhibit strong correlations with specific semantic contexts. Accordingly, we learn artifact-aware text embeddings to explicitly model the object-artifact relationships, resulting in enhanced representations that clear differentiate between clean and artifact prompts. These text embeddings are then used as anchors to shift the visual encoder's attention from high-level semantics to subtle, low-level artifacts. Extensive experiments demonstrate that IPAD-CLIP offers a resource-efficient adaptation of CLIP for detection, significantly outperforming advanced image anomaly detection and manipulation detection methods on our benchmark. To the best of our knowledge, this is the first study addressing multi-class local perceptual artifact detection in terms of both dataset and model.

2605.08663 2026-05-12 cs.CV 版本更新

CAST: Channel-Aware Spatial Transfer Learning with Pseudo-Image Radar for Sign Language Recognition

Md. Shakhoyat Rahman Shujon, Sheikh Md. Galib Mahim, Md. Milon Islam, Md Rezwanul Haque, Md Rabiul Islam, Hamdi Altaheri, Fakhri Karray

发表机构 * Department of Computer Science and Engineering, Khulna University of Engineering & Technology(电子与技术大学计算机科学与工程系) Department of Electrical and Computer Engineering, University of Waterloo(滑铁库大学电子与计算机工程系) Department of Electrical and Computer Engineering, Texas A&M University(德克萨斯A&M大学电子与计算机工程系) College of Applied Computer Science, King Saud University(沙特王后大学应用计算机科学学院) Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence(马尔代夫人工智能大学机器学习系)

AI总结 本文提出了一种名为CAST的双流架构,用于解决仅基于60GHz雷达回波幅度的孤立手语识别问题。该方法结合了三个基于物理特性的模块与预训练视觉网络,通过通道感知的空间迁移学习,有效提升了雷达信号的表征能力。核心方法包括对数压缩信号的逆变换、跨天线空间注意力机制以及异构网络的跨注意力融合,实验表明该方法在五折交叉验证中达到了80.5%的Top-1准确率,优于现有最佳单模型基线。

Comments Accepted for the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), MSLR Workshop @ CVPR 2026 in Denver (Colorado, USA)

详情
英文摘要

We propose CAST, a dual-stream architecture that utilizes channel-aware spatial transfer learning for isolated sign language recognition addressing the challenges of magnitude-only 60~GHz radar Range-Time Maps (RTM). The proposed framework combines three physics-aware architectures with pretrained vision backbones, which operate under radar-only constraints across clinical and alphabetical gestures. First, an explicit decibel-to-linear inversion is combined with a windowed fast Fourier transform that extracts Cadence Velocity Diagrams (CVD) while avoiding the harmonic artifacts that arise from the spectral analysis of log-compressed signals. Second, a cross-antenna spatial attention module applies attention to raw antenna channels before the convolution, preserving inter-receiver amplitude covariance. Third, an asymmetric cross-attention mechanism fuses representations from parallel ConvNeXt-Tiny (CVD) and EfficientNetV2-S (RTM) backbones. Extensive experiments reveal that the architecture achieves a Top-1 accuracy of 80.5% under 5-fold cross-validation, establishing a 3.3% improvement over the best single-model baseline (77.2%). The findings suggest that physics-aware signal representations form a promising direction for radar-only sign language recognition under constrained sensor modalities. The source code is available at: https://github.com/Shakhoyat/CAST-at-SignEval2026.

2605.08651 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection

Lei Wang, Wenxiang Diao, Andrew Busch, Jun Zhou, Yongsheng Gao

发表机构 * School of Engineering and Built Environment, Griffith University(格里菲斯大学工程与环境学院) School of Computer Science and Engineering, University of New South Wales(新南威尔士大学计算机科学与工程学院) School of Information and Communication Technology, Griffith University(格里菲斯大学信息与通信技术学院)

AI总结 本文研究了隐私感知的视频异常检测问题,提出了一种通过正交子空间投影来保护隐私的新型方法。核心方法包括正交投影层(OPL)和引导式正交投影层(G-OPL),能够去除与任务无关的特征变化,同时抑制人脸属性信息,保留动作和姿态等非身份识别特征。该方法在保证检测性能的同时有效保护隐私,并引入了隐私感知的评估框架,实验表明其在提升检测准确性的同时有效过滤敏感信息。

Comments Accepted as a Spotlight paper at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
英文摘要

Video anomaly detection (VAD) systems often prioritize accuracy while overlooking privacy concerns, limiting their suitability for real-world deployment. We propose the Orthogonal Projection Layer (OPL), a lightweight module that removes task-irrelevant variations to produce representations focused on anomaly-relevant cues. To address privacy risks in human-centered scenarios, we introduce Guided OPL (G-OPL), which suppresses facial attributes using weak supervision from face-presence signals while preserving non-identifying features such as pose and motion. A cosine alignment objective enforces consistent capture and removal of facial information without identity labels or adversarial training. We further present a privacy-aware evaluation framework that jointly assesses detection performance and privacy preservation, and enables analysis of how sensitive information is filtered. Experiments show that embedding privacy constraints into model design reduces sensitive information while maintaining or improving detection accuracy, supporting projection-based architectures as a principled approach for privacy-aware VAD.

2605.08640 2026-05-12 cs.CV 版本更新

FlowADMM: Plug-and-play ADMM with Flow-based Renoise-Denoise Priors

Hendrik Sommerhoff, Michael Moeller

发表机构 * Computer Vision Group, University of Siegen(Siegen大学计算机视觉组)

AI总结 本文提出了一种基于流模型的插件式ADMM算法FlowADMM,用于求解逆问题。该方法通过形式化流模型中的确定性重噪声-去噪操作,将这一操作整合到经典的ADMM框架中,从而提升了算法的收敛性与稳定性。实验表明,FlowADMM在去噪、去模糊、超分辨率和修复等任务中表现出色,且所需的图像一致性评估次数更少。

详情
英文摘要

Plug-and-play (PnP) methods for solving inverse problems have recently achieved strong performance by leveraging denoising priors based on powerful generative diffusion and flow models. However, existing diffusion- and flow-based PnP methods typically rely on stochastic renoise-denoise operations, which complicate the analysis of their convergence behavior. In this work, we identify and formalize the deterministic renoise-denoise operator underlying flow-based plug-and-play methods. This perspective reveals that these methods implicitly define a deterministic operator given by the expectation of a denoiser over the latent noise distribution. Building on this insight, we propose FlowADMM, a PnP algorithm that integrates the renoise-denoise operator into the classical alternating direction method of multiplier (ADMM) framework. We establish convergence guarantees for FlowADMM under weak Lipschitz conditions on the underlying flow network, and extend the analysis to non-stationary time schedules. Empirically, FlowADMM achieves state-of-the-art performance among flow-based PnP methods on a range of inverse problems, including denoising, deblurring, super-resolution, and inpainting, while requiring fewer data consistency evaluations than prior approaches.

2605.08635 2026-05-12 cs.CV 版本更新

Kinematics-Driven Gaussian Shape Deformation for Blurry Monocular Dynamic Scenes

Yeon-Ji Song, Kiyoung Kwon, Junoh Lee, Jin-Hwa Kim, Byoung-Tak Zhang

发表机构 * AI Institute, Seoul National University(首尔国立大学人工智能研究所) Interdisciplinary Program in Neuroscience, Seoul National University(首尔国立大学神经科学跨学科项目) Gwangju Institute of Science(全州科学技术学院) NAVER AI Lab(NAVER AI实验室)

AI总结 本文研究了如何从模糊的单目视频中重建动态3D场景,针对运动模糊导致的几何信息混杂问题,提出了一种基于运动学的高斯形状变形框架Kinematics-GS。该方法通过将模糊视为与运动对齐的形变,并引入运动学先验对高斯形状进行参数化,从而在无需辅助运动监督的情况下有效避免形状退化。此外,该方法通过时间形变方差分解场景为动态和静态部分,并采用由粗到细的形变策略,提升了重建的稳定性和细节表现,实验表明其在真实场景中显著优于现有方法。

Comments 20 pages, 9 figures, 13 tables

详情
英文摘要

Reconstructing dynamic 3D scenes from blurry monocular videos is challenging as motion-induced blur entangles object motion and geometry, hindering geometric consistency. We present Kinematics-GS, a kinematics-aware framework that models blur as motion-aligned deformation and introduces a kinematic prior to reparameterize Gaussian shapes along motion trajectories, thereby mitigating degenerate shape collapse without auxiliary motion supervision. To stabilize optimization, we decompose scenes into dynamic and static components using temporal deformation variance and employ a coarse-to-fine deformation strategy to capture both global motion and fine-grained details. We also introduce a challenging real-world dataset of deformable and elastic objects exhibiting non-rigid motion with spatially non-uniform motion blur that obscures geometric cues. Extensive experiments on real-world benchmarks with realistic motion blur demonstrate that Kinematics-GS outperforms prior methods by a clear margin in monocular dynamic scene reconstruction, highlighting its effectiveness in handling complex and non-rigid motion scenarios.

2605.08633 2026-05-12 cs.DC cs.CV 版本更新

Transforming the Use of Earth Observation Data: Exascale Training of a Generative Compression Model with Historical Priors for up to 10,000x Data Reduction

Jinxiao Zhang, Runmin Dong, Xiyong Wu, Xihan Huang, Shenggan Cheng, Yunkai Yang, Zheng Zhou, Yunpu Xu, Zhaoyang Luo, Miao Yang, Fan Wei, Mengxuan Chen, Yang You, Juepeng Zheng, Weijia Li, Yutong Lu, Haohuan Fu

发表机构 * Institute of Data and Information, Tsinghua Shenzhen International Graduate School(数据与信息研究所,清华大学深圳国际研究生院) Department of Earth System Science, Tsinghua University(地球系统科学系,清华大学) Sun Yat-Sen University(中山大学) National University of Singapore(新加坡国立大学) National Supercomputing Center in Shenzhen(深圳国家超算中心)

AI总结 该研究提出了一种基于历史先验的生成式压缩框架,旨在将地球观测数据的压缩从传统的存储和传输工具转变为一种新型的数据使用方式,实现高达10,000倍的数据压缩比。通过在LineShine Armv9超算上进行超大规模训练,研究团队优化了模型设计、内核、内存层次、运行时和并行性,实现了每秒1.54至2.16 EFLOP的高效训练性能。该方法利用地球观测数据重复测量同一星球的特性,为极端压缩提供了可行方案,展示了历史先验生成压缩在数据获取、传输、存储和科学应用中的巨大潜力。

详情
英文摘要

Earth observation is becoming one of the largest data-producing activities in science, yet current pipelines still treat compression as a storage and transmission tool rather than a new way to use data. We present a generative compression framework that learns from historical Earth observation archives and enables on-demand 100x to 10,000x data reduction across downstream tasks. Unlike general visual data, Earth observation repeatedly measures the same evolving planet, making historical-prior learning feasible for extreme compression. To realize this paradigm, we train large generative compression models at exascale on the LineShine Armv9 CPU supercomputer, with co-optimization across model design, kernels, memory hierarchy, runtime, and parallelism. Our implementation sustains 1.54 EFLOP/s and peaks at 2.16 EFLOP/s in end-to-end training. This work shows that historical-prior generative compression can turn Earth observation data into an active, task-adaptive foundation for acquisition, delivery, storage, and scientific use.

2605.08627 2026-05-12 cs.CV 版本更新

DRNet: All-in-One Image Restoration via Prior-Guided Dynamic Reparameterization

Ao Li, Xiaoning Liu, Sheng Li, Yapeng Du, Zhen Long, Lei Luo, Le Zhang, Ce Zhu

发表机构 * School of Information and Communication Engineering, University of Electronic Science and Technology of China(信息与通信工程学院,电子科学与技术大学) School of Communications and Information Engineering, Chongqing University of Posts and Telecommunication(通信与信息工程学院,重庆邮电大学)

AI总结 本文提出了一种名为DRNet的全新图像修复框架,旨在通过单一模型处理多种退化问题。该方法引入了动态重参数化机制,结合任务特定调制器和连续小波变换编码器,有效解决了计算开销大、任务异构优化困难以及编码器设计低效等问题。实验表明,DRNet在五个修复任务中均达到最先进的性能,兼具参数效率和灵活应用能力,可作为盲修复基础模型或用户引导型专家模型使用。

Comments Accepted by IEEE TMM

详情
英文摘要

All-in-one image restoration aims to handle diverse degradations within a single model. However, existing methods often suffer from three key limitations: 1) per-input computational overhead from dynamic degradation estimation; 2) optimization challenges due to task heterogeneity; and 3) inefficient, frequency-agnostic encoder designs. To overcome these, we introduce the Dynamic Reparameterization Network (DRNet), a novel framework operating on an initialization-stage reconfiguration paradigm that fundamentally eliminates per-input overhead. At its core, a Dynamic Reparameterization MLP (DRMLP) guided by a Task-Specific Modulator (TSM), which effectively mitigates task heterogeneity by orchestrating both specific restoration goals and a versatile general-purpose mode within a unified architecture. Furthermore, we incorporate a Continuous Wavelet Transform Encoder (CWTE) that explicitly leverages frequency characteristics via wavelet decomposition for a lightweight yet powerful design. Extensive experiments demonstrate that DRNet achieves state-of-the-art performance across five restoration tasks with superior parameter efficiency. Crucially, it showcases unique flexibility, excelling as both a highly competitive foundation model for blind restoration and a top-performing user-guided specialist.

2605.08618 2026-05-12 cs.CV cs.LG 版本更新

Beyond Toy Benchmarks: A Systematic Evaluation of OOD Detection Methods For Plant Pathology Classification

Devesh Shah

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究系统评估了六种越域检测方法在植物病理分类任务中的性能,关注真实场景下的分布偏移问题。通过在Plant Pathology 2021数据集上的实验发现,基于能量的微调方法在保持类别内准确率的同时显著提升了越域检测效果,其优势来源于嵌入空间重构和评分函数校准。研究还揭示了在中等规模数据集上应用约束优化方法时可能出现的训练不稳定性问题,为实际应用提供了重要参考。

详情
英文摘要

Out-of-distribution (OOD) detection is essential for reliable deployment of deep learning systems, yet the majority of existing methods are evaluated on small, visually homogeneous benchmarks. In this work, we study six OOD detection methods spanning post-hoc scoring, auxiliary objectives, energy-based models, and constrained optimization on the Plant Pathology 2021 dataset, a fine-grained task with natural distribution shifts. Energy-based fine-tuning performs best across OOD settings, improving detection over the softmax baseline while preserving in-distribution accuracy. Analysis shows these gains stem from both a restructuring of the embedding space alongside calibration of the scoring function. We further document practical training instabilities that arise when scaling constrained optimization methods to moderate-sized datasets, findings that are largely absent from existing literature. Our results demonstrate that principled OOD detection is achievable on real-world domain-specific data and that benchmark evaluations alone may not capture the challenges that emerge in practice.

2605.08606 2026-05-12 cs.CV 版本更新

Egocentric Whole-Body Human Mesh Recovery with Prior-Guided Learning

第一人称全身体素 meshes 恢复与先验引导学习

Soyeon Na, Seung Young Noh, Ju Yong Chang

AI总结 本文提出基于先验引导的学习框架,通过构造更准确的优化伪标签和多先验方法,提升第一人称全身体素恢复的精度,实验表明其优于现有方法。

Comments Accepted to ICIP 2026. This is the author-formatted version of the paper

详情
AI中文摘要

第一人称人类 mesh 恢复(HMR)从单目头戴式摄像头日益重要,但因缺乏基于参数化人体模型如 SMPL 和 SMPL-X 的可靠真实第一人称图像标注而具有挑战性。现有第一人称 HMR 方法通常依赖伪 GT 并专注于身体姿态估计,限制了其恢复细粒度全身体细节如手和脸的能力。我们研究第一人称全身体素恢复并提出一个先验引导学习框架,从单个第一人称图像重建全身体素。我们构造了更准确的基于优化的伪 GT,与 3D 关节监督对齐,并利用多种先验通过适配一个外人称 HMR 基础模型和基于扩散的姿势先验。进一步采用确定性去畸变模块以处理第一人称图像中的鱼眼畸变。在多个第一人称基准测试中,实验显示相比最先进方法,全身体素重建得到改进,并表明我们的基于优化的伪 GT 显著比现有回归基于的伪 GT 更准确。为促进可重复性,代码和数据集注释已公开在 https://github.com/naso06/EgoSMPLX。

英文摘要

Egocentric human mesh recovery (HMR) from monocular head-mounted cameras is increasingly important for AR/VR applications, but remains challenging due to the lack of reliable ground-truth (GT) annotations based on parametric human body models such as SMPL and SMPL-X for real egocentric images. Existing egocentric HMR methods typically rely on pseudo-GT and focus on body pose estimation, which limits their ability to recover fine-grained whole-body details such as hands and face. We study egocentric whole-body human mesh recovery and propose a prior-guided learning framework that reconstructs whole-body meshes from a single egocentric image. We construct more accurate optimization-based pseudo-GT aligned with 3D joint supervision, and leverage multiple priors by adapting an exocentric HMR foundation model together with a diffusion-based pose prior. A deterministic undistortion module is further adopted to handle fisheye distortions in egocentric images. Experiments across multiple egocentric benchmarks demonstrate improved whole-body reconstruction compared to state-of-the-art methods, and show that our optimization-based pseudo-GT is substantially more accurate than existing regression-based pseudo-GT. To facilitate reproducibility, the code and dataset annotations are publicly available at https://github.com/naso06/EgoSMPLX.

2605.08592 2026-05-12 cs.CV 版本更新

Cross-Modal RGB-D Fusion Transformer for 6D Pose Estimation of Non-Cooperative Spacecraft with Stereo-Derived Depth

用于非合作航天器6D位姿估计的跨模态RGB-D融合Transformer

Yongliang Zhen, Bo LÜ, Hang Yang, Xiaotian WU

发表机构 * School of Physics, Northeast Normal University(东北师范大学物理学院) Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Science(长春光学精密机械与物理研究所)

AI总结 本文提出基于被动立体视觉的6D位姿估计方法,通过TSCA-Stereo网络处理空间图像中的弱纹理和强光问题,并结合跨模态Transformer融合RGB和立体深度信息,实现高精度位姿估计。

详情
AI中文摘要

在轨服务和主动清除非合作航天器需要可靠的位姿估计以提供准确的位置和姿态数据用于自主视觉导航。基于学习的单目方法在航天器位姿估计中广泛应用,但存在固有的深度模糊问题且在轨道上常见的恶劣光照条件下容易失效。主动深度传感器理论上可以解决几何模糊问题,但其功率和质量要求使其不适合大多数航天平台。本文通过被动立体视觉框架解决这些问题,开发了TSCA-Stereo立体匹配网络以应对弱纹理表面、镜面高光和严重光照变化。引入了跨模态融合Transformer,以适应性方式结合RGB外观信息与立体深度特征,支持可靠的位姿恢复。还构建了一个合成双目多模态数据集用于实验,涵盖立体视差图和6-DOF位姿注释,覆盖多种光照场景、姿态配置和噪声水平。实验结果表明,TSCA-Stereo在该空间专用数据集上优于基线方法。完整的位姿估计管道在不同成像条件下实现了平均位移误差0.0419米和平均姿态误差0.8632度,证实了被动立体方法在严苛空间视觉条件下的有效性和鲁棒性。

英文摘要

On-orbit servicing and active debris removal involving non-cooperative spacecraft require reliable pose estimation to supply accurate position and orientation data for autonomous visual navigation. Learning-based monocular methods have seen widespread adoption in spacecraft pose estimation, yet they suffer from an intrinsic depth ambiguity problem and tend to fail under the harsh illumination conditions routinely encountered in orbit. Active depth sensors could in principle address the geometric ambiguity, but their power and mass requirements make them poorly suited to most spacecraft platforms. This work addresses these issues through a passive stereo vision framework for six-degree-of-freedom (6-DOF) pose estimation of non-cooperative spacecraft. A binocular stereo matching network called TSCA-Stereo is developed to cope with weak-texture surfaces, specular highlights, and severe lighting variations typical of space imagery. A cross-modal fusion Transformer is introduced to combine RGB appearance information with stereo depth features in an adaptive manner, supporting reliable pose recovery. A synthetic binocular multimodal dataset is also built for the experiments, covering stereo disparity maps and 6-DOF pose annotations across a range of illumination scenarios, attitude configurations, and noise levels. Experimental results show that TSCA-Stereo outperforms the baseline across every evaluated metric on this space-specific dataset. The full pose estimation pipeline achieves a mean translation error of 0.0419 m and a mean orientation error of 0.8632° under varied imaging conditions, confirming that the passive stereo approach is both effective and resilient when operating under the demanding visual conditions of the space environment.

2605.08589 2026-05-12 cs.CV 版本更新

S2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain

S2FT:稀疏谱域中的参数高效微调

Baoquan Zhang, Zhehao Yu, Lisai Zhang, Kenghong Lin, Tianran Chen, Yuxi Sun, Yunming Ye, Yao He

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) ShenZhen SiFar Co., Ltd.(深圳思法科技有限公司) Bilibili. Inc(哔哩哔哩公司) Shenzhen University(深圳大学)

AI总结 本文提出S2FT方法,通过在稀疏谱域中微调少量频谱系数,解决传统频谱微调中频谱不稀疏的问题,实验表明其仅使用0.08%的训练参数即可取得优异性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

参数高效微调(PEFT)是一种通过仅微调少量参数来适应大型预训练模型的关键技术。最近基于傅里叶变换的方法进一步减少了微调参数的规模,其基本假设是权重变化δW是一个具有稀疏频谱的空间域矩阵。然而,本文发现权重变化的频谱并非稀疏,而是呈现类似功率均匀的分布。这表明仅微调少量频谱系数不足以准确建模具有均匀频谱的权重变化。为此,本文提出寻找一种可逆变换,将具有稀疏频谱的潜在空间域矩阵转换为权重变化,然后在该稀疏频谱域中进行PEFT,称为S2FT。为寻找此类变换,我们首先预估计一个粗略的权重变化作为先验。然后,受稀疏频谱常对应局部平滑空间结构的启发,我们将此变换视为对预估计的权重变化进行行和列重排操作,以平滑空间结构并保持神经元的结构信息。最后,我们提出以简单最近邻搜索方式解决重排搜索问题,从而获得可逆变换。广泛的结果表明,我们的S2FT仅使用0.08%的训练参数即可取得优异性能。

英文摘要

Parameter Efficient Fine-Tuning (PEFT) is a key technique for adapting a large pretrained model to downstream tasks by fine-tuning only a small number of parameters. Recent methods based on Fourier transforms have further reduced the fine-tuned parameters scale by only fine-tuning a few spectral coefficients. Its basic assumption is that the weight change δW is a spatial-domain matrix with a sparse spectrum. However, in this paper, we observe that the spectrum of weight change is not sparse, but instead distributed like power-uniform. This fact implies that fine-tuning only a few spectral coefficients is insufficient to accurately model the weight change with uniform spectrum. To address this issue, we propose to seek an invertible transformation that can transform a latent spatial-domain matrix with sparse spectrum to the weight change, and then perform PEFT on such sparse spectrum domain with few spectral coefficients, called S2FT. To seek such transformation, we first pre-estimate a coarse weight change as a prior. Then, inspired by that sparse spectrum often correspond to locally smooth spatial structures, we regard this transformation as a row and column rearrangement operation on the pre-estimated weight change that smooth spatial structures while keep the structure information of neurons. Finally, we propose to solve the rearrangement search problem in a simple nearest neighbor search manner, thereby obtaining the invertible transformation. Extensive results show our S2FT achieves superior performance by only using 0.08% training parameters.

2605.08585 2026-05-12 cs.CV cs.AI 版本更新

PromptDx: Differentiable Prompt Tuning for Multimodal In-Context Alzheimer's Diagnosis

PromptDx: 为多模态上下文阿尔茨海默病诊断的可微提示微调

Lujia Zhong, Yihao Xia, Shuo Huang, Jianwei Zhang, Yonggang Shi

发表机构 * Stevens Neuroimaging and Informatics Institute, Keck School of Medicine, University of Southern California(斯蒂文斯神经影像与信息学研究所,凯克医学院,南加州大学) Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California(明希德电气与计算机工程系,维特比工程学院,南加州大学) Alfred E. Mann Department of Biomedical Engineering, Viterbi School of Engineering, University of Southern California(阿尔弗雷德·E·曼生物医学工程系,维特比工程学院,南加州大学)

AI总结 PromptDx通过可微提示微调机制,结合预训练的TabPFN实现多模态上下文诊断,提升数据效率和临床适用性。

详情
AI中文摘要

医疗图像深度学习模型通常作为参数化记忆运作,通过回忆训练期间学习的固定知识来诊断患者。这与临床实践形成鲜明对比,医生通过类比推理诊断新病例,参考过去示例中的相似记录。尽管上下文学习(ICL)框架如表格优先拟合网络(TabPFN)提供了一种参考诊断范式,但它们设计为表格特定的归纳先验,并依赖非可微预处理流程,导致在应用于异质多模态数据时出现流形不匹配和梯度断裂。为了解决这些限制,我们提出了PromptDx,一种新的参考诊断框架,利用预训练的TabPFN作为ICL引擎,同时实现与多模态表示的无缝集成。我们的核心贡献是可微提示微调(DPT)机制,该机制将掩码多模态建模模块与预训练的ICL引擎对齐。通过训练一个轻量级适配器作为引擎非可微预处理程序的可微替代品,我们实现了在ICL范式内对多模态提示的端到端优化。我们在阿尔茨海默病神经影像计划(ADNI)数据集上验证了我们的方法,使用3D MRI和表格生物标志物。实验表明,我们的方法优于传统参数基线。值得注意的是,我们的方法仅使用1%的上下文样本就实现了优于标准ICL的30%的性能,显示出卓越的流形压缩能力。我们进一步验证了DPT框架在六个具有不同规模的表格数据集上的通用性。总体而言,我们的方法提供了一种更数据高效且符合临床需求的阿尔茨海默病诊断范式。

英文摘要

Deep learning models in medical imaging typically operate as parametric memory, diagnosing patients by recalling fixed knowledge learned during training. This contrasts sharply with clinical practice, where physicians employ analogical reasoning to diagnose new cases by referencing similar records from past exemplars. While In-Context Learning (ICL) frameworks such as Tabular Prior-Fitted Networks (TabPFN) offer a promising diagnosis-by-reference paradigm, they are designed with tabular-specific inductive priors and rely on non-differentiable preprocessing pipelines, leading to manifold mismatch and gradient fracture when applied to heterogeneous multimodal data. To address these limitations, we propose PromptDx, a novel diagnosis-by-reference framework that leverages a pre-trained TabPFN as an ICL engine while enabling seamless integration with multimodal representations. Our core contribution is a Differentiable Prompt Tuning (DPT) mechanism that aligns a Masked Multimodal Modeling module with the pre-trained ICL engine. By training a lightweight adapter as a differentiable surrogate for the engine's non-differentiable preprocessors, we enable an end-to-end optimization of multimodal prompts within the ICL paradigm. We validate our method on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset using 3D MRI and tabular biomarkers. Experiments demonstrate that our approach outperforms traditional parametric baselines. Notably, our method achieves superior performance using only 1% context samples compared to 30% in standard ICL, demonstrating exceptional manifold condensation ability. We further validate the generalizability of our DPT framework across six tabular datasets with diverse scales. Overall, our method offers a more data-efficient and clinically aligned paradigm for Alzheimer's Disease diagnosis.

2605.08577 2026-05-12 cs.CV cs.LG 版本更新

Improving Generative Adversarial Networks with Self-Distillation

通过自蒸馏改进生成对抗网络

Antoni Nowinowski, Krzysztof Krawiec

发表机构 * Poznan University of Technology(波兹南技术大学)

AI总结 本文提出自蒸馏GAN(SD-GAN),利用指数移动平均生成器作为教师模型,通过感知损失指导训练生成器,提升图像质量并稳定优化轨迹。

详情
AI中文摘要

在现代GAN中,维护生成器权重的指数移动平均(EMA)是标准实践,因为平均模型在训练过程中表现优于活跃训练的生成器。然而,EMA生成器仅用于最终部署,不影响训练过程。为利用这一机会,我们引入自蒸馏GAN(SD-GAN),利用EMA生成器作为教师模型,通过感知损失指导活跃生成器(学生)。我们证明了SD-GAN在Dirac-GAN设置下的局部渐近稳定性,并展示了其能抑制传统GAN中的寄生循环行为。在多个架构和数据集上的实验证明,SD-GAN在多个指标(如FID和随机FID)上提升了最终图像质量,稳定了优化轨迹,并提供了与传统对抗损失不完全相关的额外学习指导。此外,它在微调预训练GAN模型方面也证明了有效性。

英文摘要

In modern GANs, maintaining an Exponential Moving Average (EMA) of the generator's weights is a standard practice, as such an averaged model consistently outperforms the actively trained generator. However, the EMA generator is used for final deployment only and does not influence the training process. To address this missed opportunity, we introduce Self-Distilled GAN (SD-GAN) that employs the EMA generator as a teacher to guide the active generator (student) via perceptual loss. We prove the local asymptotic stability of SD-GAN in the Dirac-GAN setting and show that it dampens the parasitic cycling behavior that plagues the conventional GANs. Empirical evaluations across established architectures and datasets demonstrate that SD-GAN improves the final image quality on several metrics (FID and random-FID in particular), stabilizes the optimization trajectory and provides additional learning guidance that is not trivially correlated with the conventional adversarial loss. It also proves effective for fine-tuning pretrained GAN models.

2605.08574 2026-05-12 cs.CV cs.LG 版本更新

Post-hoc Selective Classification for Reliable Synthetic Image Detection

事后选择性分类用于可靠合成图像检测

Kaixiang Zheng, Jacob H. Seidman

发表机构 * University of Waterloo(滑铁卢大学) Reality Defender

AI总结 本文提出ReSIDe框架,通过改进选择性分类策略提升合成图像检测在协变量偏移下的可靠性,实验显示其在常见偏移下显著提升性能。

详情
AI中文摘要

随着合成图像日益逼真,可靠检测技术至关重要以防止滥用。尽管深度神经网络基于的合成图像检测器(SIDs)在分布内表现良好,但在存在常见协变量偏移时可靠性不足,导致检测精度下降。为避免潜在错误风险,我们采用选择性分类(SC)策略,允许SIDs在低置信度预测时 abstain。为实用性,我们聚焦于事后方法,即在给定SID上进行置信度估计而无需重新训练。然而,我们发现传统基于logit的置信度评分函数(CSFs)在协变量偏移下表现出病理行为,导致SC性能接近或甚至劣于随机猜测。为此,我们提出一个简单而有效的SC框架用于可靠合成图像检测(ReSIDe)。首先,我们将logits的概念推广到SID的中间层,从质心匹配角度扩展logit-based CSFs的使用范围到SID的任意层。然后,我们引入一种偏好优化算法,通过最小化风险-覆盖曲线(AURC)上界来聚合不同层提取的置信度分数,得到最终的置信度估计。广泛实验结果表明,ReSIDe显著提升了各种logit-based CSFs在常见协变量偏移下的SC性能,实现了高达69.55%的AURC减少。

英文摘要

As synthetic images become increasingly realistic, reliable synthetic image detection techniques are of pressing need to prevent their misuse. Despite satisfactory in-distribution performance, deep neural network-based synthetic image detectors (SIDs) lack reliability in deployment and often fail in the presence of common covariate shifts, resulting in poor detection accuracy. To avoid the risk caused by potential errors, we adopt a selective classification (SC) strategy by allowing SIDs to abstain from making low confidence predictions. For practicality, we focus on post-hoc methods which perform confidence estimation on a given SID without retraining. However, we show that conventional logit-based confidence score functions (CSFs) exhibit pathological behavior under covariate shifts, leading to SC performance close to or even worse than random guessing. To address this, we propose a simple yet effective SC framework for Reliable Synthetic Image Detection (ReSIDe). First, we generalize the notion of logits to an SID's intermediate layers from a centroid matching perspective, extending the use of logit-based CSFs to any layer of an SID. Then, we introduce a preference optimization algorithm that aggregates confidence scores extracted from different layers to a final confidence estimate by minimizing an upper bound of the area under the risk-coverage curve (AURC). Extensive experimental results show that ReSIDe significantly boosts the SC performance of various logit-based CSFs under common covariate shifts, achieving up to 69.55% AURC reduction.

2605.08572 2026-05-12 cs.CV 版本更新

Enhancing Consistency Models for Multi-Agent Trajectory Prediction

增强多智能体轨迹预测的一致性模型

Alen Mrdovic, Qingze, Liu, Danrui Li, Mathew Schwartz, Kaidong Hu, Sejong Yoon, Mubbasir Kapadia, Vladimir Pavlovic

发表机构 * Rutgers University - New Brunswick(罗格斯大学-新不伦瑞克分校) New Jersey Institute of Technology(新泽西理工学院) The College of New Jersey(新泽西学院)

AI总结 本文提出ECTraj,通过改进训练和条件生成方法,提升多智能体轨迹预测的一致性模型性能,实现更快推理和更准确预测,建立新的基准。

详情
AI中文摘要

多智能体轨迹预测的扩散模型受限于迭代去噪,导致推理延迟,阻碍其在自动驾驶等时间敏感场景中的应用。快速采样变种使用DDIM和知情初始噪声分布部分缓解此问题,但要么无法实现真正的单步生成,要么受所选噪声分布限制。一致性模型(CMs)通过将噪声直接映射到数据实现高质量单步生成,但难以从头训练。我们提出ECTraj,一种增强的CM流程,具有改进的训练和条件生成能力。我们的框架扩展了学生-教师一致性训练方案:学生生成标准输出,而教师显式融合其预测与部分真实数据以提供更强的监督。我们还利用CMs的直接去噪能力进行训练中的Top-K多步生成。结合条件生成与增强的一致性目标,实现了更快的推理和改进的预测准确性,在大规模Argoverse 2数据集上建立了具有竞争力的新基准。

英文摘要

Diffusion models for multi-agent trajectory prediction are limited by iterative denoising, which causes inference latency that hinders their use in time-critical settings like autonomous driving. Fast-sampling variants using DDIM and informed initial noise distributions partially alleviate this issue, but they either fail to achieve true single-step generation or are constrained by the chosen noise distribution. Consistency Models (CMs) offer high-quality one-step generation by mapping noise directly to data, but are difficult to train from scratch . We propose ECTraj, an enhanced CM pipeline with improved training and conditional generation for trajectory prediction. Our framework extends the student-teacher consistency training scheme: the student produces standard outputs, while the teacher explicitly fuses its predictions with parts of the ground truth to give stronger supervision. We also exploit CMs' direct denoising for top-K multi-shot generation during training. Combining conditional generation with this enhanced consistency objective yields faster inference and improved prediction accuracy, establishing competitive new benchmarks on the large-scale Argoverse 2 dataset.

2605.08566 2026-05-12 cs.CV cs.LG q-bio.QM 版本更新

MicroDiffuse3D: A Foundation Model for 3D Microscopy Imaging Restoration

MicroDiffuse3D:一种用于3D显微成像修复的预训练基础模型

Yongkang Li, Brian Wong, King Wai Chiu, Hanwen Xu, Tangqi Fang, Erin Dunnington, Dan Fu, Sheng Wang

发表机构 * Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA(保罗·G·艾伦计算机科学与工程学院,华盛顿大学,西雅图,华盛顿州,美国) Department of Chemistry, University of Washington, Seattle, WA, USA(化学系,华盛顿大学,西雅图,华盛顿州,美国)

AI总结 本文提出MicroDiffuse3D,一种预训练的3D显微成像修复模型,通过高通量数据提升3D化学成像的分辨率和信噪比,实现高质量体体积结构重建。

详情
AI中文摘要

化学成像能够无标记地可视化细胞、组织和活体系统,同时提供传统荧光显微镜难以获得的直接生化信息。尽管其在术中诊断和药物反应分析等领域的应用前景广阔,但其更广泛的应用仍受限于缓慢的数据采集,尤其是三维成像。本文提出了MicroDiffuse3D,一种预训练的基础模型,用于3D显微成像修复,从降质低分辨率测量中恢复高质量的体体积结构,这些测量是在显著更高的通量下获得的。我们评估了MicroDiffuse3D在三种具有挑战性的修复设置中的表现,包括16倍体积分稀疏下的3D超分辨率、分辨率和噪声的联合退化,以及低信噪比(SNR)下的3D去噪,在这些设置中,模型在强基线模型上表现出显著优势。在稀疏3D超分辨率设置下,MicroDiffuse3D在深度方向上产生了更清晰的连续性,其伪影更少,分割质量提高了10.58%,线形剖面一致性提高了15.59%。总的来说,我们的结果确立了预训练3D修复作为克服体积分率和SNR限制的广泛适用策略,使高分辨率分析在以前难以实现的规模和速度上成为可能。

英文摘要

Chemical imaging enables label-free visualization of cells, tissues and living systems while providing direct biochemical information that is difficult to obtain with conventional fluorescence microscopy. Despite its promise in applications ranging from intraoperative diagnosis to drug-response analysis, its broader use remains limited by slow data acquisition, particularly for three-dimensional imaging. Here we present MicroDiffuse3D, a pretrained foundation model for 3D microscopy image restoration that recovers high-quality volumetric structure from degraded low-resolution measurements acquired at substantially higher throughput. We evaluated MicroDiffuse3D across three challenging restoration settings, including 3D super-resolution under 16-fold volumetric sparsity, joint degradation in resolution and noise, and 3D denoising in the low signal-to-noise ratio (SNR) regime, where the model delivered clear gains over strong baselines. Under the sparse 3D super-resolution setting, MicroDiffuse3D produced clearer continuity across depth with fewer artifacts and improved segmentation quality by 10.58% and line-profile concordance by 15.59%. Together, our results establish pretrained 3D restoration as a broadly applicable strategy for overcoming the throughput and SNR limitations in volumetric chemical imaging, enabling high-resolution analysis at scales and speeds that were previously difficult to achieve.

2605.08564 2026-05-12 cs.AI cs.CV cs.LG 版本更新

Biological Plausibility and Representational Alignment of Feedback Alignment in Convolutional Networks

反馈对齐在卷积网络中的生物合理性与表征对齐

Jake Lance, Larry Kieu

发表机构 * University of Toronto(多伦多大学)

AI总结 本文比较了反馈对齐与反向传播在卷积网络中的生物合理性、可解释性和计算复杂性,发现修改后的反馈对齐能产生与反向传播相似的内部表征。

详情
AI中文摘要

反馈对齐(FA)算法为神经网络训练提供了生物合理的替代方案,但显著无法扩展到卷积架构。本文评估了包括修改后的FA和标准BP在内的五种学习算法,在相同的卷积架构和CIFAR-10数据集上进行比较。我们提供三重比较分析,聚焦于生物合理性、可解释性和计算复杂性。结果表明,修改后的FA算法收敛于与反向传播产生的内部表征结构相似的表示。特别是,修改后的FA算法的功能成功可能源于其能够模仿反向传播的表征几何,尽管使用了根本不同的权重更新机制,仍能收敛到相似的表示。

英文摘要

The feedback alignment (FA) algorithm offers a biologically plausible alternative to backpropagation (BP) for training neural networks yet notably fails to scale to convolutional architectures. Modifications have been proposed to address this limitation, but at questionable cost to biological plausibility. In this paper, we evaluate five learning algorithms including modified FA and standard BP, applied to the same convolutional architecture with the CIFAR-10 dataset. We provide a tripartite comparative analysis focusing on biological plausibility, interpretability, and computational complexity. Our results indicate that modified FA algorithms converge on internal representations that are structurally similar to those produced by backpropagation. In particular, it appears the functional success of modified FA algorithms may be rooted in their ability to mimic the representational geometry of backpropagation, converging on similar representations despite relying on fundamentally different weight update mechanisms.

2605.08560 2026-05-12 cs.CV cs.AI 版本更新

ZAYA1-VL-8B Technical Report

ZAYA1-VL-8B 技术报告

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge

发表机构 * Zyphra

AI总结 ZAYA1-VL-8B 是基于内部语言模型 ZAYA1-8B 构建的紧凑多专家视觉语言模型,在多个图像理解、推理和计数基准上超越了 Qwen2.5-VL-3B、PLM-3B 和 MolmoE-1B 等模型。

Comments 20 pages, 7 figures, 3 appendices (with 31 figures)

详情
AI中文摘要

我们介绍了 ZAYA1-VL-8B,一个基于我们内部语言模型 ZAYA1-8B 构建的紧凑型混合专家视觉语言模型。尽管其体积紧凑,ZAYA1-VL 在性能上与领先的基模型如 Molmo2-4B 和 InternVL3.5-4B 相当,同时在多个图像理解、推理和计数基准上超越了包括 Qwen2.5-VL-3B、PLM-3B 和 MolmoE-1B 等模型。该架构包含两个关键创新:(1) 集成到 LLM 中的视觉专用 LoRA 适配器,无需增加专家数量即可提高模态特定能力;(2) 在 LLM 内部对图像标记的双向注意力,以增强视觉理解。我们详细描述了完整的训练流程,包括每个阶段的数据组成、序列打包和注意力遮蔽方案。该模型总共有 9.2B 个参数,其中 1.4B 个活跃参数包括视觉编码器,并在 https://huggingface.co/Zyphra/ZAYA1-VL 上公开发布。

英文摘要

We present ZAYA1-VL-8B, a compact mixture-of-experts vision-language model built upon our in-house language model, ZAYA1-8B. Despite its compact size, ZAYA1-VL achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing models including Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B across a range of image understanding, reasoning, and counting benchmarks. The architecture incorporates two key innovations: (1) vision-specific LoRA adapters integrated into the LLM to increase modality-specific capacity without increasing the number of experts, and (2) bidirectional attention over image tokens within the LLM to enhance visual understanding. We detail the full training pipeline including data composition at each stage, sequence packing, and the attention masking scheme. The model comprises 9.2B total parameters, with 1.4B active parameters including the vision encoder, and is publicly available at https://huggingface.co/Zyphra/ZAYA1-VL.

2605.08557 2026-05-12 cs.CV cs.AI cs.LG 版本更新

MC-RFM: Geometry-Aware Few-Shot Adaptation via Mixed-Curvature Riemannian Flow Matching

MC-RFM:通过混合曲率黎曼流匹配实现的几何感知少样本适应

Salim Khazem, Ibrahim Mohamed Serouis, Zakaria Ezzahed

发表机构 * Talan Research Center(塔兰研究中心)

AI总结 本文提出MC-RFM框架,通过混合曲率黎曼流匹配实现少样本适应,结合双曲因子和欧几里得因子构建产品流形,提升视觉识别性能。

Comments Submitted to NeurIPS (Under Review)

详情
AI中文摘要

预训练视觉模型的参数高效适应通常通过线性探针、提示、低秩更新或轻量残差模块实现。尽管有效,这些方法通常将适应视为冻结表示的离散欧几里得扰动,未显式建模任务诱导特征位移的几何结构。我们提出MC-RFM,一种混合曲率黎曼流匹配框架,用于冻结视觉主干的少样本适应。关键思想是将适应特征表示在结合双曲因子(捕捉层次敏感语义结构)和欧几里得因子(保留局部判别视觉变化)的产品流形上。适应被公式化为从冻结特征到支持集原型的连续传输,通过流匹配目标和耦合的混合原型线性分类器训练。该方法轻量、主干无关,并完全在缓存冻结特征上操作。在七个视觉识别基准、五个冻结主干和1/4/16-shot场景中,MC-RFM在多数评估设置中表现最佳,尤其在Transformer主干和细粒度数据集上收益最大。消融研究显示混合曲率头部、任务条件、自适应分支门控、原型收缩和判别监督均对性能有贡献。这些结果表明,少样本适应不仅受益于决定哪些参数更新,还受益于建模表示如何通过与下游任务结构匹配的几何结构移动。

英文摘要

Parameter-efficient adaptation of pretrained vision models is commonly performed through linear probes, prompts, low-rank updates, or lightweight residual modules. While effective, these methods usually treat adaptation as a discrete Euclidean perturbation of frozen representations, without explicitly modeling the geometry of the task-induced feature displacement. We propose \textsc{MC-RFM}, a mixed-curvature Riemannian flow-matching framework for few-shot adaptation of frozen visual backbones. The key idea is to represent adapted features on a product manifold combining a hyperbolic factor, which captures hierarchy-sensitive semantic structure, and a Euclidean factor, which preserves locally discriminative visual variation. Adaptation is formulated as a task-conditioned continuous transport from frozen features to support-set prototypes, trained with a flow-matching objective and coupled to a hybrid prototype-linear classifier. The method is lightweight, backbone-agnostic, and operates entirely on cached frozen features. Across seven visual recognition benchmarks, five frozen backbones, and 1/4/16-shot regimes, \textsc{MC-RFM} is the best-performing method in a majority of evaluated settings, with the strongest gains on Transformer backbones and fine-grained datasets. Ablations show that the mixed-curvature head, task conditioning, adaptive branch gating, prototype shrinkage, and discriminative supervision each contribute to performance. These results suggest that few-shot adaptation benefits not only from deciding which parameters to update, but also from modeling how representations should move through a geometry matched to the structure of the downstream task.

2605.08530 2026-05-12 cs.CV 版本更新

A Two-Stage Motion-Aware Framework for mmWave-based Human Mesh Recovery

基于毫米波的双阶段人体网格恢复框架

Hoang Hai Pham, Shuntian Zheng, Jiaqi Li, Yu Guan

发表机构 * Department of Computer Science, University of Warwick, Coventry, UK(沃里克大学计算机科学系,英国考文特里)

AI总结 本文提出双阶段框架,通过雷达反射提取模块和运动感知网格恢复网络,提升毫米波雷达下人体3D网格恢复的精度与效率。

详情
AI中文摘要

毫米波雷达因其在恶劣环境下的鲁棒性和隐私保护特性,成为人体感知的有前景的传感模态。然而,从雷达观测中恢复准确的3D人体网格仍然困难,由于严重的信号杂波和雷达测量的固有不完整性。先前的工作通常采用端到端框架,直接从原始雷达数据回归人体参数,而未解耦信号解释与几何推理或利用时间运动线索,限制了学习性能。为此,我们提出了一种针对雷达的人体重建双阶段框架。首先,我们引入了人体反射提取模块,通过粗到细定位和体素级分割,生成带有置信度加权的雷达体积编码体素级人体可能性。其次,我们设计了运动感知的网格恢复网络,通过双分支架构联合建模每帧几何和跨帧动态,重建人体。大量实验表明,所提方法在保持计算效率的同时优于现有方法。

英文摘要

Millimeter-wave (mmWave) radar has emerged as a promising sensing modality for human perception due to its robustness under challenging environmental conditions and strong privacy-preserving properties. However, recovering accurate 3D human body meshes from radar observations remains difficult due to severe signal clutter and the inherently partial nature of radar measurements. Previous works typically adopt end-to-end frameworks that directly regress human body parameters from raw radar data, without decoupling signal interpretation from geometric reasoning or exploiting temporal motion cues, limiting learning performance. To address this, we propose a two-stage framework for radar-based human body reconstruction. First, we introduce a human reflection extraction module that performs coarse-to-fine localization and voxel-wise segmentation to produce a confidence-weighted radar volume encoding voxel-level human likelihood. Second, we design a motion-aware mesh recovery network that reconstructs the human body by jointly modeling per-frame geometry and inter-frame dynamics using a dual-branch architecture. Extensive experiments demonstrate that the proposed method outperforms existing approaches while maintaining computational efficiency.

2605.08521 2026-05-12 cs.CV cs.LG 版本更新

Geometric Flood Depth Estimation: Fusing Transformer-Based Segmentation with Digital Elevation Models

几何洪水深度估计:融合基于变换器的分割与数字高程模型

Nhut Le, Ehsan Karimi, Maryam Rahnemoonfar

发表机构 * Lehigh University(莱维大学)

AI总结 本文提出几何'水面高程'方法,通过融合基于变换器的分割模型与数字高程模型,从单目航空影像估计洪水深度,提升洪水范围和体积评估能力。

Comments Accepted by the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

详情
AI中文摘要

灾后态势认知严重依赖于理解洪水范围和体积。尽管2D语义分割能提供准确的洪水遮蔽,但缺乏评估通行性和结构风险所需的垂直维度。本文提出一种几何'水面高程'方法,利用Mask2Former生成精确的2D洪水遮蔽,将其与数字高程模型融合,以确定水陆边界,计算全局水面高程(Z_water),并基于局部流体静力学平衡原理计算每个像素的深度。我们使用FloodNet和CRASAR-U-DROIDS数据集评估此流程,展示如何利用高性能分割提取3D体积数据,而无需流体动力学模拟的延迟。

英文摘要

Post-disaster situational awareness relies heavily on understanding both the extent and the volume of floodwaters. While 2D semantic segmentation provides accurate flood masking, it lacks the vertical dimension required to assess navigability and structural risk. This paper presents a geometric "Water Surface Elevation" approach for estimating flood depth from monocular aerial imagery. Our pipeline utilizes Mask2Former, a state-of-the-art transformer-based segmentation model, to generate precise 2D flood masks. These masks are fused with Digital Elevation Models (DEMs) to identify the water-land boundary, calculate a global water surface elevation ($Z_{water}$), and compute per-pixel depth based on the principle of local hydrostatic equilibrium. We evaluate this workflow using the FloodNet and CRASAR-U-DROIDS datasets, demonstrating how high-performance segmentation can be leveraged to extract 3D volumetric data from 2D imagery without the latency of hydrodynamic simulations.

2605.08517 2026-05-12 cs.LG cs.CV physics.med-ph 版本更新

A Deep Risk Estimator for Known Operator Learning

已知算子学习的深度风险估计器

Andreas Maier, Md Hasan, Paulina Conrad, Paula Andrea Perez-Toro

发表机构 * Pattern Recognition Lab(模式识别实验室) Friedrich-Alexander-Universität Erlangen-Nürnberg(埃朗根-纽伦堡弗里德里希-亚历山大大学)

AI总结 本文提出一种深度风险估计器,用于估计包含学习和已知算子的深度网络的统计风险,通过分解总风险并分析各层贡献,验证了替换学习层为已知算子可降低界,并应用于CT重建和物理引导神经网络。

Comments In Review

详情
AI中文摘要

我们描述了一种估计包含学习和已知算子的深度网络统计风险的方法。基于之前为已知算子学习建立的最大训练误差界,我们推导出一个深度风险估计器,将分层网络的预期误差与训练样本大小联系起来。该估计器将总风险分解为各学习层的和;每个已知算子对此和贡献零,而每个学习层添加一个受Barron经典工作启发的近似项和随训练样本数减少的估计项。我们证明当学习层被已知算子替换时,界会缩小,且对应的样本需求与被替换层的可训练参数数成比例。作为应用,我们以计算机断层扫描为例,比较了具有算子意识的滤波反投影网络与完全连接的替代网络,后者将整个重建流程压缩为单一学习密集矩阵。预测的参数比与分析分解为循环滤波和稀疏反投影所暴露的结构稀疏性一致。我们在小图像规模上验证了预测的缩放性,且在中等图像规模上使用GPU验证,均遵循相同的缩放规律。除了CT重建外,该估计器还适用于将已知物理操作硬编码到其架构中的物理引导神经网络,我们预计该结果对从事算子意识深度学习的广泛社区有参考价值。在每次扫掠中校准每层常数,可使界跟踪经验测试均方误差,误差因子在2以内,因此该估计器可逆用于预测达到目标误差所需的训练样本数。

英文摘要

We describe an approach for estimating the statistical risk of deep networks that contain a mix of learned and known operators. Building on the maximal training error bounds previously established for known operator learning, we derive a deep risk estimator that connects the expected error of a layered network to the size of the training sample. The estimator decomposes the total risk into a sum over learned layers; every known operator contributes zero to this sum, while every learned layer adds an approximation term inspired by Barron's classic work and an estimation term that decreases with the number of training samples. We are able to show that the bound shrinks whenever a learned layer is replaced by a known operator and that the corresponding sample requirement scales with the number of trainable parameters of the layer that is replaced. As an application, we use computed tomography as an example and compare an operator-aware filtered backprojection network with a fully connected substitute that collapses the entire reconstruction pipeline into a single learned dense matrix. The predicted parameter ratio coincides with the structural sparsity that the analytic decomposition into a circulant filter and a sparse backprojection exposes. We confirm the predicted scaling on CPU at small image scale and on GPU at medium image scale, all on the same scaling law. Beyond CT reconstruction, the estimator applies to physics-informed neural networks that hardcode a known physical operation in its architecture, and we expect the result to be of interest for a broad community working on operator-aware deep learning. Calibrating the per-layer constants on each sweep yields a bound that tracks the empirical test MSE within a factor of two at every training-set size, so the estimator can be inverted to predict how many training samples are required to reach a target error.

2605.08493 2026-05-12 cs.CV 版本更新

CapCLIP: A Vision-Language Representation Alignment Approach for Wireless Capsule Endoscopy Analysis

CapCLIP:一种用于无线胶囊内镜分析的视觉-语言表示对齐方法

Haroon Wahab, Irfan Mehmood, Hassan Ugail

发表机构 * School of Computer Science, AI and Electronics Faculty of Engineering and Digital Technologies(计算机科学与电子工程学院,工程与数字技术学院) School of Management Faculty of Mgmt, Law & Social Sciences(管理学院,管理、法律与社会科学学院) Centre for Visual Computing and Intelligent Systems(视觉计算与智能系统中心)

AI总结 CapCLIP通过将内镜图像与临床标准术语生成的文本描述对齐,提升无线胶囊内镜分析的泛化能力和语义解释性,优于现有方法。

详情
AI中文摘要

无线胶囊内镜(WCE)可非侵入性评估小肠,但其临床应用受限于每检查生成的大量帧和在高度变化的成像条件下识别细微异常的困难。现有基于学习的方法多为视觉单一,局限于狭窄病病理集,且在数据集和中心间转移有限。为此,本研究提出CapCLIP,一种针对WCE的视觉-语言表示学习框架。CapCLIP通过将内镜图像与基于标准化命名法和病理意识的文本描述对齐,从而学习出语义丰富且可转移的嵌入。该框架在严格零样本条件下,使用未见过的WCE数据集评估,与相关开源视觉和视觉-语言基础模型进行比较。评估涵盖三个下游任务:K最近邻分类、CLIP风格图像-文本分类和文本到图像检索。在这些设置中,CapCLIP在零样本图像-文本分类和跨模态检索上表现尤为突出。结果表明,语言引导的表示学习可提升WCE分析的泛化能力和语义可解释性。这些发现将CapCLIP定位为定制化胶囊内镜基础模型的一步,并支持语言引导的WCE分析。

英文摘要

Wireless capsule endoscopy (WCE) enables non-invasive visual assessment of the small bowel, but its clinical utility is constrained by the large volume of frames generated per examination and the difficulty of recognising subtle abnormalities under highly variable imaging conditions. Existing learning-based approaches for WCE are predominantly vision-only, often confined to narrow pathology sets, and show limited transfer across datasets and centres. To address these limitations, this study introduces CapCLIP, a domain-specific vision-language representation learning framework for WCE. CapCLIP aligns capsule endoscopy frames with clinically grounded textual descriptions derived from standardised nomenclature and pathology-aware caption templates, thereby learning embeddings that are both semantically informed and transferable. The proposed framework is evaluated against relevant open-source vision and vision-language foundation models under strict zero-shot conditions using unseen WCE datasets. Evaluation covers three downstream tasks: K-nearest neighbour classification, CLIP-style image-text classification, and text-to-image retrieval. Across these settings, CapCLIP consistently outperforms the compared baselines, with particularly strong gains in zero-shot image-text classification and cross-modal retrieval on out-of-distribution datasets. The results indicate that language-guided representation learning can improve both generalisation and semantic interpretability in WCE analysis. These findings position CapCLIP as a step toward foundation models tailored to capsule endoscopy and support the use of language-grounded WCE analysis.

2605.08452 2026-05-12 cs.CV 版本更新

NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics

NICE FACT: 诊断和校准VLMs在运动物理定量推理中的能力

Jian Lan, Zhicheng Liu, Xinpeng Wang, Yuhao Zhou, Haokun Chen, Jiancheng Lv, Barbara Plank, Thomas Seidl

发表机构 * University of Munich (LMU)(慕尼黑大学(LMU)) Munich Center of Machine Learning(慕尼黑机器学习中心) Sichuan University(四川大学)

AI总结 本文研究VLMs在运动物理定量推理中的表现,提出NICE和FACT双诊断框架,分析视觉真实性、物理定律理解和时间定位,揭示模型在识别视觉前提和应用物理定律方面的不足。

详情
AI中文摘要

推导精确空间和物理洞察力是视觉语言模型(VLMs)的核心能力,但其在相关空间智能任务如物理推理中的表现较差仍是一个根本障碍。本文旨在深入理解VLMs如何感知物理世界并利用物理定律,同时评估模型置信度的可靠性。我们提出NICE和FACT双诊断范式,明确分解运动物理的定量推理:FACT诊断视觉真实性、物理定律理解和时间定位。NICE研究我们的新型邻域感知校准方法和新指标以评估和校准置信度可靠性。在6种最新最先进的VLMs上评估后,发现模型无法识别视觉前提或利用必要的物理定律以得出答案。本文强调并建立了标准化诊断范式,以指导开发忠实且物理基础的VLMs。

英文摘要

The ability to derive precise spatial and physical insights is a cornerstone of vision-language models (VLMs), yet their poor performances in related spatial intelligence tasks such as physical reasoning remain a fundamental barrier. The community critically lacks a scientific analysis revealing whether VLMs faithfully reach answers or plausibly make guesses. This work aims to provide a fundamental understanding of how VLMs perceive the physical world, and utilize physical laws, while assessing the reliability of model confidence. We propose NICE and FACT, a dual-diagnostic paradigm that explicitly decomposes quantitative reasoning for kinematic physics: FACT diagnoses visual fidelity, physical law comprehension, and temporal grounding. NICE studies our novel neighborhood-informed calibration method and novel metrics to evaluate and calibrate confidence reliability. Evaluated across 6 latest state-of-the-art VLMs, we uncover that models fail to identify visual preconditions or utilize necessary physical laws to reach answers. This work highlights and establishes a standardized diagnostic paradigm to guide the development of faithful, physically-grounded VLMs.

2605.06356 2026-05-12 cs.CV 版本更新

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

SwiftI2V: 通过条件分段生成实现高效的高分辨率图像到视频生成

YaoYang Liu, Yuechen Zhang, Wenbo Li, Yufei Zhao, Rui Liu, Long Chen

发表机构 * HKUST(香港科技大学) CUHK(香港大学) Joy Future Academy HKU(香港大学) HUAWEI Research(华为研究)

AI总结 本文提出SwiftI2V框架,通过分段生成和双向上下文交互,在2K分辨率下实现高效图像到视频生成,相比端到端模型减少GPU时间202倍。

Comments 27 pages, 17 figures

详情
AI中文摘要

高分辨率图像到视频(I2V)生成旨在合成逼真的时间动态同时保持输入图像的细粒度外观细节。在2K分辨率下,这变得极其具有挑战性,现有解决方案存在多种缺陷:1)端到端模型通常在内存和延迟上成本过高;2)级联低分辨率生成与通用视频超分辨率相结合,容易产生细节幻觉并偏离输入特定的局部结构,因为超分辨率阶段未明确针对输入图像进行条件处理。为此,我们提出了SwiftI2V,一个专为高分辨率I2V设计的高效框架。遵循广泛使用的两阶段设计,它通过首先生成低分辨率运动参考以减少令牌成本并减轻建模负担,然后执行强图像条件的2K合成,以恢复输入忠实的细节,同时受控开销。具体来说,为了使生成更具可扩展性,SwiftI2V引入了条件分段生成(CSG)来分段合成视频,每个步骤有受限制的令牌预算,并在每个分段内采用双向上下文交互以提高跨分段的一致性和输入保真度。在VBench-I2V上,SwiftI2V在2K分辨率下实现了与端到端基线相当的性能,同时将总GPU时间减少了202倍。特别是,它使在单个数据中心GPU(如H800)或消费级GPU(如RTX 4090)上实现实用的2K I2V生成成为可能。

英文摘要

High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).

2605.03456 2026-05-12 cs.CV 版本更新

VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

VL-SAM-v3:基于记忆的开放世界目标检测视觉先验

Chih-Chung Liu, Zhiwei Lin, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University, China(北京大学计算机科学技术研究院)

AI总结 VL-SAM-v3通过引入检索驱动的外部视觉记忆,提升开放世界目标检测的性能,尤其在稀有类别和复杂场景中表现优异。

详情
AI中文摘要

开放世界目标检测旨在定位和识别超出固定封闭集标签空间的对象。它通常分为两类:开放词汇检测,假设测试时有预定义类别列表;以及开放端检测,需在推理过程中生成候选类别。现有方法主要依赖粗粒度文本语义和参数化知识,常无法提供足够的视觉证据以处理细粒度外观变化、稀有类别和杂乱场景。本文提出VL-SAM-v3,一个统一框架,通过检索驱动的外部视觉记忆增强开放世界检测。具体而言,一旦候选类别可用,VL-SAM-v3从非参数化记忆库中检索相关视觉原型,并将其转换为两种互补的视觉先验:即稀疏先验用于实例级空间锚定,密集先验用于类别感知的局部上下文。这些先验通过记忆引导的提示细化与原始检测提示结合,实现共享的检索与细化机制,支持开放词汇和开放端推理。在LVIS上的广泛零样本实验表明,VL-SAM-v3在开放词汇和开放端推理中均能显著提升检测性能,尤其在稀有类别上表现突出。此外,与更强的开放词汇检测器(即SAM3)的实验验证了所提检索与细化机制的通用性。

英文摘要

Open-world object detection aims to localize and recognize objects beyond a fixed closed-set label space. It is commonly divided into two categories, i.e., open-vocabulary detection, which assumes a predefined category list at test time, and open-ended detection, which requires generating candidate categories during the inference. Existing methods rely primarily on coarse textual semantics and parametric knowledge, which often provide insufficient visual evidence for fine-grained appearance variation, rare categories, and cluttered scenes. In this paper, we propose VL-SAM-v3, a unified framework that augments open-world detection with retrieval-grounded external visual memory. Specifically, once candidate categories are available, VL-SAM-v3 retrieves relevant visual prototypes from a non-parametric memory bank and transforms them into two complementary visual priors, i.e., sparse priors for instance-level spatial anchoring and dense priors for class-aware local context. These priors are integrated with the original detection prompts via Memory-Guided Prompt Refinement, enabling a shared retrieval-and-refinement mechanism that supports open-vocabulary and open-ended inference. Extensive zero-shot experiments on LVIS show that VL-SAM-v3 consistently improves detection performance under both open-vocabulary and open-ended inference, with particularly strong gains on rare categories. Moreover, experiments with a stronger open-vocabulary detector (i.e., SAM3) validate the generality of the proposed retrieval-and-refinement mechanism.

2605.03276 2026-05-12 cs.CV 版本更新

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

VEBench: 多模态大模型在真实世界视频编辑中的基准测试

Andong Deng, Dawei Du, Zhenfang Chen, Wen Zhong, Fan Chen, Guang Chen, Chia-Wen Kuo, Longyin Wen, Chen Chen, Sijie Zhu

发表机构 * ByteDance Intelligent Creation(字节跳动智能创作) CRCV, University of Central Florida(中央佛罗里达大学CRCV)

AI总结 VEBench通过高质量视频和人工验证问题对多模态大模型在视频编辑中的知识理解和操作推理能力进行评估,揭示当前模型与人类水平间的显著差距。

Comments CVPR Findings 2026

详情
AI中文摘要

真实世界视频编辑不仅需要电影技巧的专家知识,还需要多模态推理来选择、对齐和组合片段形成连贯的叙事。尽管最近的大型多模态模型(LMMs)在一般视频理解上取得了显著进展,但其在多视频推理和操作编辑流程中的能力仍 largely 未被探索。我们引入 VEBENCH,第一个全面的基准,旨在评估真实视频编辑场景中的编辑知识理解和操作推理能力。VEBENCH 包含 3,900 个高质量编辑视频(超过 257 小时)和 3,080 个人验证的 QA 对,通过三轮人机协作标注流程构建,确保精确的时间标记和语义一致性。它包含两个互补的 QA 任务:1)视频编辑技术识别,评估模型识别 7 种编辑技术的能力;2)视频编辑操作模拟,通过要求从多个候选片段中选择并定位时间片段来建模现实编辑流程。广泛的实验显示,当前模型性能与人类水平之间存在显著差距。这些结果突显了将视频理解与创造性操作推理结合的紧迫需求。我们设想 VEBENCH 作为推进智能视频编辑系统和未来复杂推理研究的基础。

英文摘要

Real-world video editing demands not only expert knowledge of cinematic techniques but also multimodal reasoning to select, align, and combine footage into coherent narratives. While recent Large Multimodal Models (LMMs) have shown remarkable progress in general video understanding, their abilities in multi-video reasoning and operational editing workflows remain largely unexplored. We introduce VEBENCH, the first comprehensive benchmark designed to evaluate both editing knowledge understanding and operational reasoning in realistic video editing scenarios. VEBENCH contains 3.9K high-quality edited videos (over 257 hours) and 3,080 human-verified QA pairs, built through a three-round human-AI collaborative annotation pipeline that ensures precise temporal labeling and semantic consistency. It features two complementary QA tasks: 1) Video Editing Technique Recognition, assessing models' ability to identify 7 editing techniques using multimodal cues; and 2) Video Editing Operation Simulation, modeling real-world editing workflows by requiring the selection and temporal localization of relevant clips from multiple candidates. Extensive experiments across proprietary (e.g., Gemini-2.5-Pro) and open-source LMMs reveal a large gap between current model performance and human-level editing cognition. These results highlight the urgent need for bridging video understanding with creative operational reasoning. We envision VEBENCH as a foundation for advancing intelligent video editing systems and driving future research on complex reasoning.

2605.02743 2026-05-12 cs.AI cs.CV cs.HC 版本更新

Triple Spectral Fusion for Sensor-based Human Activity Recognition

三频谱融合用于基于传感器的人体活动识别

Ye Zhang, Longguang Wang, Qing Gao, Chaocan Xiang, Mohammed Bennamoun, Yulan Guo

发表机构 * School of Electronics and Communication Engineering, the Shenzhen Campus of Sun Yat-sen University(南方科技大学深圳校区电子与通信工程学院) Aviation University of Air Force(空军航空大学) College of Computer Science, Chongqing University(重庆大学计算机学院) Department of Computer Science and Software Engineering, the University of Western Australia(西澳大学计算机科学与软件工程系)

AI总结 本文提出三频谱融合框架,通过自适应滤波和图傅里叶变换提升传感器数据融合与长期上下文关联性,实验表明其在多个基准数据集上表现优异。

详情
AI中文摘要

本文提出三频谱融合框架,通过自适应滤波和图傅里叶变换提升传感器数据融合与长期上下文关联性,实验表明其在多个基准数据集上表现优异。

英文摘要

The field of sensor-based human activity recognition (HAR) mainly uses posture, motion and context data of Inertial Measurement Units (IMUs) to identify daily activities. Despite the advancements in learning-based methods, it is challenging to perform information fusion from the temporal perspective due to the complexities in fusing heterogeneous sensor data and establishing long-term context correlations. This paper proposes a novel triple spectral fusion framework tailored for HAR. First, we develop an adaptive complementary filtering technique for noise suppression and organize each IMU's sensors into posture and motion modality nodes. Given that IMU nodes form a dynamic heterogeneous graph, we then apply adaptive filtering within the graph Fourier domain to merge both homogeneous and heterogeneous node information. Furthermore, an adaptive wavelet frequency selection approach is implemented to suppress context redundancy and shorten the length of features. This approach enhances both timestamp-based graph aggregation and the correlation of long-term contexts. Our framework uses adaptive filtering in the Fourier, graph Fourier, and wavelet domains, enabling effective multi-sensor fusion and context correlation. Extensive experiments on ten benchmark datasets demonstrate the superior performance of our framework. Project page: https://github.com/crocodilegogogo/TSF-TPAMI2026.

2605.02169 2026-05-12 cs.CV cs.DC cs.LG 版本更新

Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation

异构模型融合用于隐私保护多摄像头监控的合成领域适应

Peggy Joy Lu, Wei-Yu Chen, Yao-Tsung Huang, Vincent Shin-Mu Tseng

发表机构 * Department of Computer Science and Information Engineering, National Chung Cheng University(资讯工程系,国立 Chung Cheng 大学) Department of Computer Science, National Yang Ming Chiao Tung University(计算机科学系,国立阳明交通大学) National Center for High-Performance Computing(国家高速计算中心)

AI总结 本文提出HeroCrystal框架,通过合成领域适应解决多摄像头目标检测中的隐私、类别不平衡和异构架构问题,提升定位精度并实现模型融合,实验表明其在多类别隐私保护设置下优于现有方法,mAP提升2.1%达33.4%。

Comments 42 pages, 13 figures. Published in Information Fusion (Elsevier). DOI: 10.1016/j.inffus.2026.104413

Journal ref Information Fusion, 2026

详情
AI中文摘要

我们提出HeroCrystal,一种新型隐私保护多摄像头领域自适应目标检测框架,解决数据隐私、类别不平衡和异构架构等挑战。框架包含三个关键阶段:生成阶段引入单次目标感知扩散生成模块,通过提示控制合成特定目标实例;联邦阶段采用概率Faster R-CNN提升定位精度,动态模型对比策略抑制领域偏见;服务器端在不访问原始数据的情况下融合异构架构模型。最后提出不一致类别整合算法解决标签不一致和架构异质性问题。在多个跨领域检测基准上的实验表明,该方法在多类别隐私保护设置下优于现有多源领域适应和联邦学习基线,mAP比现有隐私保护方法提升2.1%,达到33.4%的新SOTA,证明HeroCrystal在实现实用多摄像头AI监控系统中的有效性。源代码可在https://github.com/ccuvislab/HeroCrystal公开获取。

英文摘要

We propose HeroCrystal, a novel privacy-preserving framework for multi-camera domain-adaptive object detection, addressing challenges such as data privacy, class imbalance, and heterogeneous architectures. Our framework consists of three key stages. In the Generated Stage, we introduce a one-shot, target-aware diffusion-based generation module that learns visual style from a single target-domain image while leveraging prompt-based control to synthesize specific object instances. Unlike conventional style transfer-based methods that require large target datasets and ignore semantic-level discrepancies, our approach enables privacy-preserving augmentation to reduce ethical concerns, and introduces controllable rare object generation to mitigate long-tailed category degradation. In the Federated Stage, we employ probabilistic Faster R-CNN on the client side to improve localization accuracy, and a dynamic model contrastive strategy to suppress domain-specific bias. The server side performs model fusion across heterogeneous architectures without accessing raw data. Finally, in the Distilled Stage, we propose an inconsistent categories integration algorithm to resolve label inconsistency and architecture heterogeneity across clients. Extensive experiments on multiple cross-domain detection benchmarks demonstrate that our method outperforms existing multi-source domain adaptation and federated learning baselines under multi-class, privacy-preserving settings. Our method improves mAP by +2.1% over prior privacy-preserving approaches and achieves a new state-of-the-art mAP of 33.4%, highlighting the effectiveness of HeroCrystal in enabling practical multi-camera AI surveillance systems. The source code is publicly available at https://github.com/ccuvislab/HeroCrystal.

2605.01345 2026-05-12 cs.CV cs.AI cs.LG 版本更新

The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design

视觉语言模型中的感知带宽瓶颈:通过顺序实验设计实现主动视觉推理

Anjie Liu, Ziqin Gong, Yan Song, Yuxiang Chen, Xiaolong Liu, Hengtong Lu, Kaike Zhang, Chen Wei, Jun Wang

发表机构 * The Hong Kong University of Science(香港科学与技术大学) University College London, London, United Kingdom(伦敦大学学院, 英国伦敦) ShanghaiTech University, Shanghai, China(上海科技大学, 中国上海) AI Lab, The Yangtze River Delta, China(人工智能实验室, 长江三角洲, 中国)

AI总结 本文提出通过顺序贝叶斯最优实验设计实现主动视觉推理,解决视觉语言模型中感知带宽瓶颈问题,提升高分辨率视觉推理能力。

Comments 27 pages, 5 figures, accepted at ICML 2026

详情
AI中文摘要

现代视觉语言模型(VLMs)的视觉感知受到感知带宽瓶颈的限制:广视野保留了全局上下文但牺牲了复杂推理所需的细粒度细节。我们主张高分辨率视觉推理不仅是语义推理,也是在有限感知带宽下获取任务相关的证据。受主动视觉和信息觅食启发,我们将这一过程形式化为顺序贝叶斯最优实验设计(S-BOED),其中智能体在回答前决定获取哪些视觉证据。由于在连续十亿像素空间中精确的贝叶斯推断不可行,我们推导出一个可计算的覆盖-分辨率目标作为任务相关信息增益的代理。我们通过FOVEA,一种无需训练的程序,通过证据导向的探测来优化VLM的裁剪提案。在高分辨率基准测试中,实验结果在直接和ReAct风格基线中表现出一致的提升,特别是在以搜索为主的遥感设置中表现尤为突出。

英文摘要

Visual perception in modern Vision-Language Models (VLMs) is constrained by a perceptual bandwidth bottleneck: a broad field of view preserves global context but sacrifices the fine-grained details required for complex reasoning. We argue that high-resolution visual reasoning is therefore not only semantic reasoning but also task-relevant evidence acquisition under limited perceptual bandwidth. Inspired by active vision and information foraging, we formalise this process as sequential Bayesian optimal experimental design (S-BOED), where an agent decides which visual evidence to acquire before answering. Since exact Bayesian inference is intractable in continuous gigapixel spaces, we derive a tractable coverage--resolution objective as a proxy for task-relevant information gain. We instantiate this framework with FOVEA, a training-free procedure that refines VLM crop proposals through evidence-oriented probing. Experiments on high-resolution benchmarks show consistent gains over direct and ReAct-style baselines, with particularly strong improvements in search-dominated remote-sensing settings.

2605.00884 2026-05-12 cs.CV 版本更新

LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception

LiteVLA-H: 双速率视觉-语言-动作推断用于机载空中引导与语义感知

Justin williams, Kishor Datta Gupta, Roy George, Mrinmoy Sarkar

发表机构 * Department of Cyber Physical Systems, Clark Atlanta University(克劳克阿特拉大学网络物理系统系)

AI总结 LiteVLA-H通过双速率操作在边缘设备上实现高效视觉-语言-动作推断,兼顾快速动作输出与语义理解,提升空中引导与场景感知性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在 manipulation 中表现出强大的语义接地和任务泛化能力,但空中部署仍面临挑战,因为无人机需要在严格的机载计算和通信限制下实现低延迟闭环引导。我们提出了LiteVLA-H,一个紧凑的256M参数VLA系统,专为在NVIDIA Jetson AGX Orin上双速率操作而设计:一种快速的外环引导模式用于短动作令牌输出,一种较慢的语义模式用于场景理解、危险描述和操作员面向的叙述。核心经验观察是,在这种紧凑的边缘环境中,端到端延迟主要由多模态预填充而非解码少量额外令牌的边际成本主导。这促使一种调度器,在同一嵌入式平台上,以50.65毫秒(19.74Hz)的速率发出反应性动作令牌,同时仍支持句子级别的语义输出,速度为149.90-164.57毫秒(6.08-6.67Hz)。为了在不降低描述能力的情况下专门化模型,我们使用一种知识保持的微调配方,混合了反应性飞行数据、空中语义数据和通用标题/VQA监督。除了报告当前的延迟测量外,我们还将该系统与最近的最先进架构(AnywhereVLA、FutureVLA和ReMem-VLA)进行对比,显示在我们的部署条件下,所测动作分支在边缘推理速率上更高,同时保留了周期性语义意识。

英文摘要

Vision-language-action (VLA) models have shown strong semantic grounding and task generalization in manipulation, but aerial deployment remains difficult because drones require low-latency closed-loop guidance under strict onboard compute and communication constraints. We present LiteVLA-H, a compact 256M-parameter VLA system designed for dual-rate operation on an NVIDIA Jetson AGX Orin: a fast outer-loop guidance mode for short action-token outputs and a slower semantic mode for scene understanding, hazard description, and operator-facing narration. The central empirical observation is that, in this compact edge regime, end-to-end latency is dominated by multimodal pre-fill rather than by the marginal cost of decoding a few extra tokens. This motivates a scheduler that issues reactive action tokens at 50.65,ms (19.74,Hz) while still supporting sentence-level semantic outputs at 149.90--164.57\ms (6.08--6.67,Hz) on the same embedded platform. To specialize the model without collapsing its descriptive competence, we use a knowledge-preserving fine-tuning recipe that mixes reactive flight data, aerial semantic data, and generic caption/VQA supervision. Beyond reporting current latency measurements, we position the system against recent state-of-the-art architectures, including AnywhereVLA, FutureVLA, and ReMem-VLA, showing that the measured action branch reaches a higher edge inference rate under our deployment conditions while retaining periodic semantic awareness.

2604.23789 2026-05-12 cs.CV 版本更新

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

MuSS:一个多镜头数据集和电影叙事基准用于多镜头主体到视频生成

Haojie Zhang, Di Wu, Bingyan Liu, Linjie Zhong, Yuancheng Wei, Xingsong Ye, Nanqing Liu, Yaling Liang

发表机构 * South China University of Technology(华南理工大学) Fudan University(复旦大学) Yunnan Normal University(云南师范大学)

AI总结 本文提出MuSS数据集和电影叙事基准,解决多镜头视频生成中的叙事逻辑、时空对齐和复制粘贴问题,通过改进的标注流程和反复制粘贴指标提升生成质量。

Comments 17 pages, 9 figues

详情
AI中文摘要

尽管视频基础模型在单镜头生成上表现优异,但现实中的电影叙事本质上依赖复杂的多镜头序列。进一步进展受限于缺乏解决三个核心挑战的数据集:真实叙事逻辑、时空文本-视频对齐冲突以及主体到视频(S2V)生成中的“复制粘贴”难题。为此,我们引入MuSS,一个大规模双轨数据集,专门用于多镜头视频和S2V生成。该数据集源自超过3000部电影,支持复杂的蒙太奇过渡和以主体为中心的叙事。为构建该数据集,我们首创了一种渐进式标注流程,通过确保局部镜头级准确性后再强制全球叙事一致性来消除上下文冲突。关键在于我们实现了跨镜头匹配机制,从根本上消除S2V中的复制粘贴捷径。此外,我们提出了电影叙事基准,包含视觉逻辑驱动范式和新颖的反复制粘贴方差(ACP-Var)度量,以严格评估连续叙事和3D结构一致性。大量实验表明,尽管当前基线在连续叙事逻辑或退化为2D贴纸生成器方面挣扎,但我们的MuSS增强模型在叙事效果和跨镜头身份保持方面达到了最先进的水平。

英文摘要

While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the "copy-paste" dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to fundamentally eradicate the S2V copy-paste shortcut. Alongside the dataset, we propose the Cinematic Narrative Benchmark, featuring a visual-logic-driven paradigm and a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously assess continuous storytelling and 3D structural consistency. Extensive experiments demonstrate that while current baselines struggle with continuous narrative logic or degenerate into trivial 2D sticker generators, our MuSS-augmented model achieves state-of-the-art narrative effectiveness and cross-shot identity preservation.

2604.22942 2026-05-12 cs.CV cs.AI cs.LG 版本更新

VS-DDPM: Efficient Low-Cost Diffusion Model for Medical Modality Translation

VS-DDPM:高效的低成本扩散模型用于医学模态转换

Nikoo Moradi, Gijs Luijten, Behrus Hinrichs-Puladi, Jens Kleesiek, Victor Alves, Jan Egger, André Ferreira

发表机构 * Center Algoritmi / LASI, University of Minho(中心Algoritmi / LASI,明霍大学) Institute for Artificial Intelligence in Medicine (IKIM)(医学人工智能研究所) Essen University Hospital (AöR)(埃森大学医院(AöR)) University of Duisburg-Essen(杜伊斯堡-埃森大学) Cancer Research Center Cologne Essen (CCCE)(科隆-埃森癌症研究中心(CCCE)) West German Cancer Center(西德癌症中心) University Hospital Essen (AöR)(埃森大学医院(AöR)) German Cancer Consortium (DKTK)(德国癌症联盟(DKTK)) Partner site University Hospital Essen (AöR)(埃森大学医院(AöR)合作站点) Institute of Computer Graphics and Vision(计算机图形与视觉研究所) Graz University of Technology(格拉茨技术大学) Institute of Medical Informatics(医学信息学研究所) University Hospital RWTH Aachen(亚琛RWTH大学医院) Department of Oral and Maxillofacial Surgery(口腔和颌面外科部门) TU Dortmund University(多特蒙德技术大学) Center for Virtual and Extended Reality in Medicine (ZvRM)(医学虚拟和扩展现实中心(ZvRM)) Faculty of Computer Science(计算机科学学院)

AI总结 本文提出VS-DDPM模型,通过改进的去噪扩散概率模型提升医学图像生成效率与质量,在MRI缺失合成、肿瘤移除等任务中表现优异,但部分任务未达最佳性能。

详情
AI中文摘要

扩散模型能生成高质量合成数据但推理速度慢。本文提出3D可变步长去噪扩散概率模型(VS-DDPM),旨在在保持生成质量的同时加速推理。我们在BraTS2025和SynthRAD2025挑战中测试了四种任务(缺失MRI、肿瘤移除、MRI到sCT、CBCT到sCT)。该模型在缺失MRI合成任务中达到SOTA性能,Dice分数分别为0.80、0.83和0.88,SSIM为0.95。MRI肿瘤移除任务中,RMSE为0.053,PSNR为26.77,SSIM为0.918。尽管框架在MRI到sCT和CBCT到sCT任务中表现竞争,但未达最佳基准,可能由于数据预处理和后处理管道的敏感性或特定损失函数配置。这些结果表明,VS-DDPM为高保真3D医学图像合成提供了稳健且可调节的解决方案。代码可在https://github.com/andre-fs-ferreira/SynthRAD_by_Faking_it获取。

英文摘要

Diffusion models produce high-quality synthetic data but suffer from slow inference. We propose 3D Variable-Step Denoising Diffusion Probabilistic Model (VS-DDPM) a framework engineered to maintain generative quality while accelerating inference by several factors. We tested our approach on four tasks (missing MRI, tumor removal, MRI-to-sCT, and CBCT-to-sCT) within the BraTS2025 and SynthRAD2025 challenges. Designed for high efficiency under hardware and time constrains imposed by both challenges. VS-DDPM achieved state-of-the-art (SOTA) performance in missing MRI synthesis, yielding Dice scores of 0.80, 0.83, and 0.88 for the enhancing tumor, tumor core, and whole tumor regions, respectively, alongside a structural similarity index (SSIM) of 0.95. For MRI tumor removal, the model attained a root mean squared error (RMSE) of 0.053, a peak signal-to-noise ratio (PSNR) of 26.77, and an SSIM of 0.918. While the framework demonstrated competitive performance in MRI-to-sCT and CBCT-to-sCT tasks, it did not reach SOTA benchmarks, potentially due to sensitivities in data pre and post-processing pipelines or specific loss function configurations. These results demonstrate that VS-DDPM provides a robust and tunable solution for high-fidelity 3D medical image synthesis. The code is available in https://github.com/andre-fs-ferreira/SynthRAD_by_Faking_it.

2604.13710 2026-05-12 cs.CV 版本更新

SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

SLQ:通过共享潜在查询桥接模态以实现冻结MLLMs的检索

Haoran Lou, Ziyan Liu, Chunxiao Fan, Yuexin Wu, Yue Ming, Hao Wu, Kai Zuo, Yibo Chen, Xu Tang

发表机构 * School of Electronic Engineering, Beijing University of Posts(北京邮电大学电子工程学院) Xiaohongshu Inc., China(小红书公司)

AI总结 SLQ通过冻结MLLMs的主干,引入共享潜在查询提升多模态检索性能,实验显示其在多个基准上表现优异。

Comments Accepted to ICML-2026

详情
AI中文摘要

多模态大语言模型(MLLMs)具备内在推理和世界知识能力,但将其应用于密集检索仍具挑战性。现有方法依赖侵入性参数更新,如全微调和LoRA,可能破坏预训练语义空间并损害推理所需的结构化知识。为此,我们提出SLQ,一种参数高效调优框架,通过冻结主干来适应MLLMs进行检索。SLQ引入少量共享潜在查询,附加于文本和图像令牌,利用模型原生因果注意力将多模态上下文聚合到统一嵌入空间。此外,为更评估检索超越表层模式匹配,我们构建了KARR-Bench基准,专为知识感知推理检索设计。大量实验表明,SLQ在COCO和Flickr30K上优于全微调和LoRA,在MMEB上表现竞争,在KARR-Bench上取得显著提升,验证了通过非侵入性适应保留预训练表示在MLLMs基于检索中的有效性。代码可在https://github.com/CnFaker/SLQ获取。

英文摘要

Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench, validating that preserving the pre-trained representations via non-invasive adaptation is an effective strategy for MLLM-based retrieval. The code is available under: https://github.com/CnFaker/SLQ.

2604.12592 2026-05-12 cs.CV 版本更新

ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction

ELoG-GS:双分支高斯点云生成与亮度引导增强用于极端低光3D重建

Yuhao Liu, Dingju Wang, Ziyang Zheng

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出ELoG-GS方法,结合学习点云初始化和亮度引导色彩增强,提升极端低光环境下3D重建的几何一致性和真实感。实验表明其在NTIRE 2026挑战赛中优于基线方法,达到更高的PSNR和SSIM。

Comments Our method achieved a ranking of 9 out of 148 participants in Track 1 of the NTIRE 3DRR Challenge, as reported on the official competition website: https://www.codabench.org/competitions/13854/

详情
AI中文摘要

本文介绍了针对NTIRE 2026 3D修复与重建挑战(Track 1)的解决方案,旨在从退化的多视角输入中重建高质量的3D表示。挑战涉及在极端低光环境中恢复几何一致且逼真的3D场景。为此,我们提出了极端低光优化的高斯点云生成(ELoG-GS),一种稳健的低光3D重建管道,集成了基于学习的点云初始化和亮度引导的颜色增强,以实现稳定且逼真的高斯点云生成。我们的方法结合了几何感知的初始化和光度适应策略,以在挑战条件下提高重建保真度。在NTIRE Track 1基准上的广泛实验表明,我们的方法在重建质量上显著优于基线方法,实现了更优的视觉保真度和几何一致性。所提出的方法为现实世界退化场景中的稳健3D重建提供了实用解决方案。在最终测试阶段,我们的方法在官方平台排行榜上实现了PSNR为18.6626和SSIM为0.6855。代码可在https://github.com/lyh120/FSGS_EAPGS上获得。

英文摘要

This paper presents our approach to the NTIRE 2026 3D Restoration and Reconstruction Challenge (Track 1), which focuses on reconstructing high-quality 3D representations from degraded multi-view inputs. The challenge involves recovering geometrically consistent and photorealistic 3D scenes in extreme low-light environments. To address this task, we propose Extreme Low-light Optimized Gaussian Splatting (ELoG-GS), a robust low-light 3D reconstruction pipeline that integrates learning-based point cloud initialization and luminance-guided color enhancement for stable and photorealistic Gaussian Splatting. Our method incorporates both geometry-aware initialization and photometric adaptation strategies to improve reconstruction fidelity under challenging conditions. Extensive experiments on the NTIRE Track 1 benchmark demonstrate that our approach significantly improves reconstruction quality over the baselines, achieving superior visual fidelity and geometric consistency. The proposed method provides a practical solution for robust 3D reconstruction in real-world degraded scenarios. In the final testing phase, our method achieved a PSNR of 18.6626 and an SSIM of 0.6855 on the official platform leaderboard. Code is available at https://github.com/lyh120/FSGS_EAPGS.

2604.07522 2026-05-12 cs.CV 版本更新

Training-free Spatially Grounded Geometric Shape Encoding (Technical Report)

无需训练的空间几何形状编码(技术报告)

Yuhang He

发表机构 * Microsoft Research(微软研究院)

AI总结 本文提出XShapeEnc,一种无需训练的通用空间几何形状编码策略,通过分解几何形状和姿态,利用正交Zernike基和频率传播操作,实现高效、可逆且频率丰富的紧凑表示,适用于多种二维空间任务。

Comments Training-Free 2D Geometric Shape Encoding

详情
AI中文摘要

位置编码已成为将深度神经网络接地于离散点位置的标准方法,并在可表示为一维序列的任务中取得显著成功。然而,将其扩展到二维空间几何形状需要精心设计的编码策略,不仅考虑形状几何和姿态,还需与神经网络学习兼容。本文通过引入无需训练的通用编码策略XShapeEnc,将任意空间接地的二维几何形状编码为紧凑表示,具有可逆性、适应性和频率丰富性等五种有利特性。具体而言,将二维空间接地几何形状分解为单位圆内的标准化几何和姿态向量,其中姿态进一步转换为谐波姿态场,同样位于单位圆内。构建一组正交Zernike基来编码形状几何和姿态,单独或联合编码,随后通过频率传播操作引入高频内容。通过广泛分析和实验,展示了XShapeEnc在多种形状感知任务和自建的XShapeCorpus中的理论有效性、效率、判别性和适用性。本文认为XShapeEnc是超越一维序列数据向二维空间智能前沿研究的基础工具。

英文摘要

Positional encoding has become the de facto standard for grounding deep neural networks on discrete point-wise positions, and it has achieved remarkable success in tasks where the input can be represented as a one-dimensional sequence. However, extending this concept to 2D spatial geometric shapes demands carefully designed encoding strategies that account not only for shape geometry and pose, but also for compatibility with neural network learning. In this work, we address these challenges by introducing a training-free, general-purpose encoding strategy, dubbed XShapeEnc, that encodes an arbitrary spatially grounded 2D geometric shape into a compact representation exhibiting five favorable properties, including invertibility, adaptivity, and frequency richness. Specifically, a 2D spatially grounded geometric shape is decomposed into its normalized geometry within the unit disk and its pose vector, where the pose is further transformed into a harmonic pose field that also lies within the unit disk. A set of orthogonal Zernike bases is constructed to encode shape geometry and pose either independently or jointly, followed by a frequency-propagation operation to introduce high-frequency content into the encoding. We demonstrate the theoretical validity, efficiency, discriminability, and applicability of XShapeEnc via extensive analysis and experiments across a wide range of shape-aware tasks and our self-curated XShapeCorpus. We envision XShapeEnc as a foundational tool for research that goes beyond one-dimensional sequential data toward frontier 2D spatial intelligence.

2604.06518 2026-05-12 eess.IV cs.AI cs.CV 版本更新

ADP-FL-MedSeg: Adaptive Differential Privacy for Federated Medical Segmentation Across Diverse Modalities

ADP-FL-MedSeg:适应性差分隐私用于跨多样模态的联邦医疗分割

Puja Saha, Eranga Ukwatta

发表机构 * College of Engineering, University of Guelph(圭尔夫大学工程学院)

AI总结 本文提出ADP-FL框架,通过动态调整隐私机制,在联邦学习中平衡隐私与效用,提升医疗图像分割的准确性与稳定性。

Comments 10 pages, 8 figures. Accepted in SPIE Medical Imaging 2026. Recipient of CAD Best Paper Award: 1st Place, and Robert F. Wagner All-Conference Best Paper Award: Finalist

Journal ref Proceedings Volume 13926, SPIE Medical Imaging 2026: Computer-Aided Diagnosis

详情
AI中文摘要

大量医学数据因隐私法规和机构限制难以集中使用。现有模型在不同临床中心泛化能力差,因成像协议异质性和数据分布变化。联邦学习提供无需共享原始数据的协作训练方法。然而,将差分隐私融入联邦学习常导致精度下降、收敛不稳定和泛化能力减弱。本文提出一种适应性差分隐私联邦学习(ADP-FL)框架,用于医疗图像分割,动态调整隐私机制以更好地平衡隐私-效用权衡。该方法稳定训练,显著提高Dice分数和分割边界质量,同时保持严格隐私保障。在多种成像模态和分割任务中评估ADP-FL,包括皮肤病变分割、肾脏肿瘤分割和脑肿瘤分割。与传统联邦学习和标准差分隐私联邦学习相比,ADP-FL在准确性、边界界定、收敛速度和训练稳定性方面均表现更优,性能接近非隐私联邦学习在同一隐私预算下的表现。这些结果证明了ADP-FL在真实联邦设置中实现高性能、隐私保护医疗图像分割的实用性。

英文摘要

Large volumes of medical data remain underutilized because centralizing distributed data is often infeasible due to strict privacy regulations and institutional constraints. In addition, models trained in centralized settings frequently fail to generalize across clinical sites because of heterogeneity in imaging protocols and continuously evolving data distributions arising from differences in scanners, acquisition parameters, and patient populations. Federated learning offers a promising solution by enabling collaborative model training without sharing raw data. However, incorporating differential privacy into federated learning, while essential for privacy guarantees, often leads to degraded accuracy, unstable convergence, and reduced generalization. In this work, we propose an adaptive differentially private federated learning (ADP-FL) framework for medical image segmentation that dynamically adjusts privacy mechanisms to better balance the privacy-utility trade-off. The proposed approach stabilizes training, significantly improves Dice scores and segmentation boundary quality, and maintains rigorous privacy guarantees. We evaluated ADP-FL across diverse imaging modalities and segmentation tasks, including skin lesion segmentation in dermoscopic images, kidney tumor segmentation in 3D CT scans, and brain tumor segmentation in multi-parametric MRI. Compared with conventional federated learning and standard differentially private federated learning, ADP-FL consistently achieves higher accuracy, improved boundary delineation, faster convergence, and greater training stability, with performance approaching that of non-private federated learning under the same privacy budgets. These results demonstrate the practical viability of ADP-FL for high-performance, privacy-preserving medical image segmentation in real-world federated settings.

2604.03687 2026-05-12 cs.CV 版本更新

SciLT: Long-tailed Image Classification under Scientific Image Domains

SciLT:在科学图像领域进行长尾图像分类

Jiahao Chen, Bing Su

发表机构 * Gaoling School of Artificial Intelligence(人工智能学院)

AI总结 本文研究了科学图像领域的长尾识别问题,发现基础模型微调效果有限,提出SciLT框架通过多级表征和双监督学习实现头尾类平衡,实验表明其在科学长尾识别中表现优异。

详情
AI中文摘要

长尾识别受益于基础模型和微调范式,但现有研究和基准主要局限于自然图像领域,其中预训练和微调数据分布相似。相比之下,科学图像具有不同的视觉特征和监督信号,这引发了对在这些设置中微调基础模型有效性的疑问。本文在纯视觉和微调范式下研究了科学长尾识别。在三个科学基准上的实验表明,微调基础模型收益有限,揭示了penultimate-layer特征在尾类中的重要作用。受这些发现启发,我们提出了SciLT框架,通过自适应特征融合和双监督学习利用多级表征。通过联合利用penultimate-和final-layer特征,SciLT在头尾类上实现了平衡性能。大量实验表明,SciLT在科学长尾识别中一致优于现有方法,建立了强大的基准,并为适应具有显著领域偏移的科学数据提供了有价值的指导。

英文摘要

Long-tailed recognition has benefited from foundation models and fine-tuning paradigms, yet existing studies and benchmarks are mainly confined to natural image domains, where pre-training and fine-tuning data share similar distributions. In contrast, scientific images exhibit distinct visual characteristics and supervision signals, raising questions about the effectiveness of fine-tuning foundation models in such settings. In this work, we investigate scientific long-tailed recognition under a purely visual and fine-tuning paradigm. Experiments on three scientific benchmarks show that fine-tuning foundation models yields limited gains, and reveal that penultimate-layer features play an important role, particularly for tail classes. Motivated by these findings, we propose SciLT, a framework that exploits multi-level representations through adaptive feature fusion and dual-supervision learning. By jointly leveraging penultimate- and final-layer features, SciLT achieves balanced performance across head and tail classes. Extensive experiments demonstrate that SciLT consistently outperforms existing methods, establishing a strong and practical baseline for scientific long-tailed recognition and providing valuable guidance for adapting foundation models to scientific data with substantial domain shifts.

2604.02564 2026-05-12 eess.IV cs.CV 版本更新

Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It

为何不变性不足以实现生物医学领域泛化以及如何修复它

Sebo Diaz, Polina Golland, Elfar Adalsteinsson, Neel Dey

发表机构 * MIT(麻省理工学院) MGH(麻省总医院) HMS(哈佛医学院)

AI总结 MaskGen通过结合源域图像强度和领域稳定的基模型表示,提出了一种简单有效的领域泛化策略,在生物医学图像分割中实现了更强的泛化能力。

Comments Project GitHub https://github.com/sebodiaz/MaskGen

详情
AI中文摘要

我们提出了MaskGen,一种理论上有根据且刻意简单的领域泛化方法,用于3D生物医学图像分割。现代分割模型在模态变化、疾病严重程度、临床地点等变化时会急剧退化,限制了其可靠应用。现有泛化方法使用极端增强、手工工程领域统计混合或架构重设计,这些方法带来了显著的实现开销,但效果在生物医学设置中不一致。MaskGen则提出了一种原理性的学习策略,具有边际开销,利用源域图像强度和领域稳定的基模型表示来训练鲁棒的分割模型。结果,MaskGen在完全监督和少样本分割中实现了广泛临床变化下的显著提升。与先前方法不同,MaskGen不依赖特定架构和损失函数,兼容标准增强流水线,易于实现,并能处理任意解剖区域。其实现可在https://github.com/sebodiaz/MaskGen免费获取。

英文摘要

We present MaskGen, a theoretically grounded and deliberately simple approach for domain generalization in 3D biomedical image segmentation. Modern segmentation models degrade sharply under shifts in modality, disease severity, clinical sites, and more, limiting their reliable adoption. Existing generalization methods address this using extreme augmentations, hand-engineered domain statistics mixing, or architectural redesigns that add significant implementation overhead while yielding inconsistent performance across biomedical settings. MaskGen instead presents a principled learning strategy with marginal overhead that utilizes both source-domain image intensities and domain-stable foundation model representations to train robust segmentation models. As a result, MaskGen achieves strong gains in both fully supervised and few-shot segmentation across broad clinical shifts in biomedical studies. Unlike prior approaches, MaskGen is architecture- and loss-agnostic, compatible with standard augmentation pipelines, easy to implement, and tackles arbitrary anatomical regions. Its implementation is freely available at https://github.com/sebodiaz/MaskGen.

2603.19222 2026-05-12 cs.CV cs.LG 版本更新

Spectrally-Guided Diffusion Noise Schedules

基于光谱引导的扩散噪声调度

Carlos Esteves, Ameesh Makadia

发表机构 * Google Research(谷歌研究)

AI总结 本文提出基于图像光谱特性设计实例特定噪声调度的方法,通过理论界限优化噪声级别,提升单阶段像素扩散模型生成质量,尤其在低步数情况下表现更佳。

Comments Accepted to ICML'26

详情
AI中文摘要

去噪扩散模型广泛用于高质量图像和视频生成,其性能依赖于噪声调度,定义训练和采样过程中的噪声分布及噪声级别序列。噪声调度通常手动设计并需跨不同分辨率手动调整。在本文中,我们提出一种系统的方法,基于图像的光谱特性为像素扩散设计实例特定的噪声调度。通过推导最小和最大噪声级别的有效性界限,我们设计出"紧致"的噪声调度,消除冗余步骤。在推理过程中,我们提出条件采样此类噪声调度。实验表明,我们的噪声调度提高了单阶段像素扩散模型的生成质量,特别是在低步数情况下表现更佳。

英文摘要

Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image's spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight'' noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.

2603.16253 2026-05-12 cs.CV cs.AI 版本更新

Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

将评分接地:为可靠视觉语言处理奖励模型的显式视觉前提验证

Junxin Wang, Dai Guan, Weijie Qiu, Zhihang Li, Yongbo Gai, Zhengyi Yang, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

发表机构 * Qwen Large Model Application Team, Alibaba(阿里巴巴大模型应用团队) Alibaba Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所阿里巴巴分所) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 本文提出EVPV方法,通过显式验证视觉前提提升视觉语言处理奖励模型的可靠性,实验表明其在多模态推理基准上提升了重排序准确率。

Comments 27 pages, 4 figures, 10 tables. Evaluated on VisualProcessBench and six multimodal reasoning benchmarks (LogicVista, MMMU, MathVerse-VO, MathVision, MathVista, WeMath). Includes ablations and causal analysis via controlled constraint corruption. Code: https://github.com/Qwen-Applications/EVPV-PRM

详情
AI中文摘要

视觉语言处理奖励模型(VL-PRMs)被广泛用于评分中间推理步骤并重排序候选答案。然而,它们通常作为黑盒裁判:低步骤评分可能反映真实的推理错误或仅仅是验证者对图像的误解。这种感知与推理的纠缠导致系统性假阳性(奖励虚构的视觉前提)和假阴性(惩罚正确的 grounded 陈述),削弱了重排序和错误定位。我们引入显式视觉前提验证(EVPV),一种轻量级验证接口,将步骤评分与所依赖的视觉前提的可靠性联系起来。策略被提示生成分步的视觉检查表,使所需视觉事实显式化,同时约束提取器独立从输入图像中推导出结构化的视觉约束。EVPV将检查表声明与这些约束匹配以计算标量视觉可靠性信号,并通过可靠性门控校准PRM步骤奖励:当可靠性低时,依赖视觉的步骤奖励被衰减,当可靠性高时则保持不变。这在不进行每步工具调用的情况下解耦了感知不确定性与逻辑评估。在VisualProcessBench和六个多模态推理基准上的实验表明,EVPV提升了步骤级验证并一致地提高了Best-of-N重排序准确性。此外,向提取的约束中注入受控的损坏会产生单调的性能下降,提供了因果证据,表明收益来自约束的准确性以及显式前提验证,而非偶然的提示效应。代码可在:https://github.com/Qwen-Applications/EVPV-PRM

英文摘要

Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier's misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per-step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: https://github.com/Qwen-Applications/EVPV-PRM

2603.14694 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Robust Building Damage Detection in Cross-Disaster Settings Using Domain Adaptation

利用领域自适应进行跨灾害场景的建筑损坏检测

Asmae Mouradi, Shruti Kshirsagar

发表机构 * School of Computing(计算机学院)

AI总结 本文提出一种两阶段集成方法,利用监督领域自适应(SDA)在四个严重程度类别中进行建筑损坏分类,通过SDA适应xView2方法到Ida-BD数据集,验证了SDA在跨区域检测中的关键作用,最终达到宏F1值0.5552。

Comments accepted for publication IEEE ICHMS

详情
AI中文摘要

从遥感影像快速评估结构损坏对及时灾害响应至关重要。在人类-机器系统(HMS)中,自动化损坏检测为决策者提供可操作的情报意识。然而,训练在多灾害基准上的模型在未见地理区域往往表现不佳,因为领域转移——训练和部署数据之间的分布不匹配会削弱人类对自动化评估的信任。我们探索了一种两阶段集成方法,利用监督领域自适应(SDA)在四个严重程度类别中进行建筑损坏分类。该流程将xView2第一方法适应到Ida-BD数据集,使用SDA并系统地研究了各个增强组件对分类性能的影响。在未见的Ida-BD测试分割上的全面消融实验表明,SDA是必不可少的:移除它会导致损坏检测完全失败。我们的流程在使用不清晰增强的RGB输入时,通过SDA实现了最稳健的性能,达到宏F1值0.5552。这些结果强调了领域自适应在构建可信的自动化损坏评估模块中的关键作用,用于HMS集成的灾害响应。

英文摘要

Rapid structural damage assessment from remote sensing imagery is essential for timely disaster response. Within human-machine systems (HMS) for disaster management, automated damage detection provides decision-makers with actionable situational awareness. However, models trained on multi-disaster benchmarks often underperform in unseen geographic regions due to domain shift - a distributional mismatch between training and deployment data that undermines human trust in automated assessments. We explore a two-stage ensemble approach using supervised domain adaptation (SDA) for building damage classification across four severity classes. The pipeline adapts the xView2 first-place method to the Ida-BD dataset using SDA and systematically investigates the effect of individual augmentation components on classification performance. Comprehensive ablation experiments on the unseen Ida-BD test split demonstrate that SDA is indispensable: removing it causes damage detection to fail entirely. Our pipeline achieves the most robust performance using SDA with unsharp-enhanced RGB input, attaining a Macro-F1 of 0.5552. These results underscore the critical role of domain adaptation in building trustworthy automated damage assessment modules for HMS-integrated disaster response.

2603.13224 2026-05-12 cs.CV cs.AI 版本更新

Visual-ERM: Reward Modeling for Visual Equivalence

视觉-ERM:用于视觉等价性的奖励建模

Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Fudan University(复旦大学) CUHK(香港中文大学)

AI总结 本文提出视觉-ERM模型,通过细粒度的视觉反馈提升视觉到代码任务的奖励学习效果,实验显示其在图表到代码、表格和SVG解析任务中均取得显著提升。

Comments Project: https://github.com/InternLM/Visual-ERM

详情
AI中文摘要

视觉-ERM模型通过细粒度的视觉反馈提升视觉到代码任务的奖励学习效果,实验显示其在图表到代码、表格和SVG解析任务中均取得显著提升。

英文摘要

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

2603.04415 2026-05-12 cs.CL cs.CV 版本更新

Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training

双调优:面向多模态大语言模型训练的推理效能驱动的数据筛选

Ruobing Zheng, Tianqi Li, Jianing Li, Qingpei Guo, Yi Yuan, Jingdong Chen

发表机构 * Ant Group(蚂蚁集团)

AI总结 本文提出双调优框架,通过评估训练数据对推理训练的收益及推理训练的效能,指导多模态大语言模型的数据筛选与训练策略匹配。

Comments Project Page: https://digital-avatar.github.io/ai/ThinkingBoundary/

详情
AI中文摘要

推理训练能提升大语言模型(LLMs)在数学和编程等复杂任务上的表现,但其在多样化多模态任务中的效果尚不确定。领先的团队释放平行的“Instruct”和“Thinking”模型趋势既资源密集又用户不友好。先前研究发现,推理训练的收益受多种因素影响,如基础模型能力、任务特征和链式思维(CoT)数据质量。然而,确定推理训练后何时有益以及哪些数据应支持其的系统性标准仍然缺乏。本文提出双调优,一种面向多模态LLM训练的推理效能驱动的数据筛选框架。给定一个目标任务和一个基础模型,双调优联合评估训练数据是否有益,以及当前CoT内容的推理训练是否在非推理替代方案上产生积极收益。我们将在空间、数学和多学科任务上应用双调优,并进一步分析强化学习和思考模式如何影响推理效能。双调优结果通过识别有益于推理训练的数据、更适合直接回答训练的数据以及在两种训练模式下有害的数据来指导数据筛选。我们的工作提供了选择合适训练数据和匹配后训练策略的定量标准。

英文摘要

Reasoning post-training improves Large Language Models (LLMs) on complex tasks such as mathematics and coding, but its benefits across diverse multimodal tasks remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading teams is both resource-intensive and user-unfriendly. Prior work finds that the gains from reasoning training are influenced by multiple factors, such as base model capabilities, task characteristics, and Chain-of-Thought (CoT) data quality. However, principled criteria for determining when reasoning post-training is beneficial and which data should support it are still lacking. In this paper, we propose Dual Tuning, a reasoning efficacy-driven data curation framework for multimodal LLMs training. Given a target task and a base model, Dual Tuning jointly evaluates whether the training data is beneficial and whether reasoning training with current CoT content yields positive gains over non-reasoning alternatives. We apply Dual Tuning across spatial, mathematical, and multi-disciplinary tasks, and further analyze how reinforcement learning and thinking patterns affect reasoning efficacy. The Dual Tuning results guide data curation by identifying data that benefit reasoning training, data better suited to direct-answer training, and data that are detrimental under both training modes. Our work provides quantitative criteria for selecting appropriate training data and matching post-training strategies.

2603.01743 2026-05-12 cs.CV 版本更新

Action-Guided Attention for Video Action Anticipation

基于动作的注意力机制用于视频动作预见

Tsung-Ming Tai, Sofia Casarin, Andrea Pilzer, Werner Nutt, Oswald Lanz

发表机构 * Free University of Bozen-Bolzano(博泽自由大学) NVIDIA(NVIDIA公司)

AI总结 本文提出Action-Guided Attention机制,通过预测动作序列引导注意力,提升视频动作预见的泛化能力与可解释性。

Comments Accepted by ICLR 2026

详情
AI中文摘要

在视频中预见未来动作具有挑战性,因为观察到的帧仅提供过去活动的证据,需要推断潜在意图以预测后续动作。现有基于transformer的方法依赖于像素表示上的点积注意力,往往缺乏必要的高层语义来有效建模视频序列。为此,我们提出Action-Guided Attention (AGA),一种利用预测动作序列作为查询和键的注意力机制,通过专门的门控函数将相关信息与当前帧嵌入结合。AGA的设计使训练后可以分析训练集发现的知识。在广泛采用的EPIC-Kitchens-100基准测试中,AGA在验证集到未见测试集上表现良好。训练后分析可以进一步检查模型捕获的动作依赖性和内部化的反事实证据,提供透明且可解释的预见预测见解。

英文摘要

Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.

2602.09524 2026-05-12 cs.CV 版本更新

HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection

HLGFA:高-低分辨率引导的特征对齐用于无监督异常检测

Han Zhou, Yuxuan Gao, Yinchao Du, Xuezhe Zheng

发表机构 * Innolight Technology Research Institute(英 light 技术研究院)

AI总结 本文提出HLGFA框架,通过高-低分辨率特征一致性学习正常样本的正常性,不依赖像素级重建,利用双分辨率输入提取多级特征,并通过条件调制和门控残差修正引导低分辨率特征细化,有效识别异常区域。

Comments 14 pages, 6 figures, references added

详情
AI中文摘要

无监督工业异常检测(UAD)对现代制造检测至关重要,其中缺陷样本稀缺且需要可靠检测。本文提出HLGFA,一种高-低分辨率引导的特征对齐框架,通过建模正常样本高分辨率和低分辨率表示之间的跨分辨率特征一致性来学习正常性,而非依赖像素级重建。双分辨率输入通过共享冻结的主干网络提取多级特征,高分辨率表示被分解为结构和细节先验,通过条件调制和门控残差修正引导低分辨率特征的细化。在推理过程中,异常自然地被识别为跨分辨率对齐失效的区域。此外,引入了一种噪声感知的数据增强策略,以抑制工业环境中常见的干扰响应。在标准基准上的广泛实验表明,HLGFA的有效性,其在MVTec AD数据集上实现了97.9%的像素级AUROC和97.5%的图像级AUROC,优于代表性的重建和特征方法。

英文摘要

Unsupervised industrial anomaly detection (UAD) is essential for modern manufacturing inspection, where defect samples are scarce and reliable detection is required. In this paper, we propose HLGFA, a high-low resolution guided feature alignment framework that learns normality by modeling cross-resolution feature consistency between high-resolution and low-resolution representations of normal samples, instead of relying on pixel-level reconstruction. Dual-resolution inputs are processed by a shared frozen backbone to extract multi-level features, and high-resolution representations are decomposed into structure and detail priors to guide the refinement of low-resolution features through conditional modulation and gated residual correction. During inference, anomalies are naturally identified as regions where cross-resolution alignment breaks down. In addition, a noise-aware data augmentation strategy is introduced to suppress nuisance-induced responses commonly observed in industrial environments. Extensive experiments on standard benchmarks demonstrate the effectiveness of HLGFA, achieving 97.9% pixel-level AUROC and 97.5% image-level AUROC on the MVTec AD dataset, outperforming representative reconstruction-based and feature-based methods.

2602.07052 2026-05-12 cs.CV eess.IV 版本更新

Markerless Head Tracking for Accurate and Accessible Neuronavigation

无标记头部追踪用于精确且可及的神经导航

Ziye Xie, Oded Schlesinger, Raj Kundu, Jessica Y. Choi, Pablo Iturralde, Dennis A. Turner, Stefan M. Goetz, Guillermo Sapiro, Angel V. Peterchev, J. Matias Di Martino

发表机构 * Department of Electrical and Computer Engineering, Duke University(电气与计算机工程系,杜克大学) Department of Psychiatry & Behavioral Sciences, Duke University(精神病学与行为科学系,杜克大学) Duke Institute for Brain Sciences(杜克脑科学研究所)

AI总结 本文提出无标记方法,利用低成本摄像头和算法建模实现高精度神经导航,实验显示其在精度和成本效益上的优势。

详情
AI中文摘要

神经导航广泛应用于生物医学研究和干预,用于指导头周围仪器的精确放置以支持如经颅磁刺激等程序。然而,传统系统依赖于佩戴在受试者身上的标记,需要手动注册,且在程序中可能移位并引起不适。我们介绍了并评估了无标记方法,用低成本的可见和红外摄像头结合立体和深度感应,以及面部几何的算法建模来替代昂贵的硬件和物理标记。对50名人类受试者的验证显示,最佳无标记算法的跟踪偏差中位数仅为2.32毫米和2.01度,相较于传统标记系统,表明足够精确用于经颅磁刺激,并显著优于先前的无标记结果。该研究还表明,整合各种摄像头传感器的数据可以进一步提高整体精度。所提出的无标记神经导航方法可以降低设置成本和复杂性,提高患者舒适度,并在临床和研究环境中扩大神经导航的可及性。

英文摘要

Neuronavigation is widely used in biomedical research and interventions to guide the precise placement of instruments around the head to support procedures such as transcranial magnetic stimulation. Traditional systems, however, rely on subject-mounted markers that require manual registration, may shift during procedures, and can cause discomfort. We introduce and evaluate markerless approaches that replace expensive hardware and physical markers with low-cost visible and infrared light cameras incorporating stereo and depth sensing, combined with algorithmic modeling of the facial geometry. Validation with 50 human subjects yielded a median tracking discrepancy of only 2.32 mm and 2.01$^\circ$ for the best markerless algorithm compared to a conventional marker-based system, which indicates sufficient accuracy for transcranial magnetic stimulation and a substantial improvement over prior markerless results. The study also suggests that integration of the data from the various camera sensors can improve the overall accuracy further. The proposed markerless neuronavigation methods can reduce setup cost and complexity, improve patient comfort, and expand access to neuronavigation in clinical and research settings.

2602.04549 2026-05-12 cs.CV 版本更新

Nix and Fix: Targeting 1000x Compression of 3D Gaussian Splatting with Diffusion Models

Nix 和 Fix:通过扩散模型实现 3D 高斯散射的 1000 倍压缩

Cem Eteke, Enzo Tartaglione

发表机构 * 1 Chair of Media Technology, Munich Institute of Robotics Machine Intelligence School of Computation, Information Technology Technical University of Munich, 80333 Munich, Germany 2LTCI, T\'el\'ecom Paris, Institut Polytechnique de Paris, France

AI总结 本文提出 NiFi 方法,通过扩散模型实现 3DGS 在极低速率下的高质量压缩,达到 1000 倍压缩率。

详情
AI中文摘要

3D 高斯散射(3DGS)革新了视图渲染。与隐式表示不同,3DGS 使用稀疏高斯,这使实时性能得以实现,但增加了空间需求,阻碍了受速率限制的应用。3DGS 压缩成为缓解此问题的领域。尽管取得了显著进展,但在低速率下,压缩会引入伪影,显著降低视觉质量。我们引入 NiFi,一种通过基于扩散的、伪影感知的一步蒸馏进行极端 3DGS 压缩的方法。我们证明,我们的方法在极低速率(低至 0.1 MB)下实现了最先进的感知质量,并在与 3DGS 相当的感知性能下实现了 1000 倍的压缩率。代码可在:https://github.com/ceteke/nifi 获得。

英文摘要

3D Gaussian Splatting (3DGS) revolutionized novel view rendering. Instead of inferring from dense spatial points, as implicit representations do, 3DGS uses sparse Gaussians. This enables real-time performance but increases space requirements, hindering rate-constrained applications. 3DGS compression emerged as a field aimed at alleviating this issue. While impressive progress has been made, at low rates, compression introduces artifacts that degrade visual quality significantly. We introduce NiFi, a method for extreme 3DGS compression through restoration via artifact-aware, diffusion-based one-step distillation. We show that our method achieves state-of-the-art perceptual quality at extremely low rates, down to 0.1 MB, and towards 1000x rate improvement over 3DGS at comparable perceptual performance. Code is available at: https://github.com/ceteke/nifi

2602.04054 2026-05-12 cs.LG cs.CV 版本更新

SEIS: Subspace-based Equivariance and Invariance Scores for Neural Representations

SEIS:基于子空间的神经表示等变性和不变性评分

Huahua Lin, Katayoun Farrahi, Xiaohao Cai

发表机构 * University of Southampton(索姆塞特大学)

AI总结 本文提出SEIS方法,用于分析神经表示在几何变换下的特性,揭示卷积编码器和解码器在不同层的等变性和不变性变化规律,以及训练策略对这些特性的影响。

详情
AI中文摘要

理解神经表示如何响应几何变换对于评估学习特征是否保持有意义的空间结构至关重要。现有方法主要通过比较变换输入下的模型输出来评估鲁棒性,仅能有限地揭示几何信息在内部表示中的组织方式,并无法区分信息损失与重新编码。在本文中,我们引入SEIS(基于子空间的等变性和不变性评分),一种用于分析几何变换下各层特征表示的子空间度量方法,无需标签或显式了解变换即可区分等变性与不变性。通过在多样化架构上的受控实验,我们揭示了若干一致模式。首先,卷积编码器表现出深度方向上从强等变性到增加不变性的转变,两种属性在前几个训练周期内趋于稳定。然而,在分割解码器中,等变性则倾向于在后期层中恢复。其次,这种权衡并非内在的,而是由训练决策塑造的:数据增强主动同时增强等变性和不变性,而多任务学习在两者上产生协同增益,超过单独任务的收益。将分析扩展到卷积网络之外,我们发现变换器模型表现出不同的几何行为,而MLP-Mixers则显示中间特性。

英文摘要

Understanding how neural representations respond to geometric transformations is essential for evaluating whether learned features preserve meaningful spatial structure. Existing approaches primarily assess robustness primarily by comparing model outputs under transformed inputs, offering limited insight into how geometric information is organized within internal representations and failing to distinguish between information loss and re-encoding. In this work, we introduce SEIS (Subspace-based Equivariance and Invariance Scores), a subspace metric for analyzing layer-wise feature representations under geometric transformations, disentangling equivariance from invariance without requiring labels or explicit knowledge of the transformation. Through controlled experiments across diverse architectures, we uncover several consistent patterns. First, convolutional encoders exhibit a depth-wise transition from strong equivariance to increasing invariance, with both properties stabilizing within the first few training epochs. In segmentation decoders, however, equivariance tends to recover in later layers. Second, this trade-off is not intrinsic but is shaped by training decisions: data augmentation actively strengthens both equivariance and invariance simultaneously, and multi-task learning induces synergistic gains in both properties beyond what either task achieves alone. Extending our analysis beyond convolutional networks, we find that transformer-based models exhibit distinct geometric behaviors, while MLP-Mixers display intermediate characteristics.

2602.01219 2026-05-12 cs.LG cs.CV 版本更新

Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights

混合top-k注意力:通过可扩展的快速权重实现高效的注意力

Qishuai Wen, Zhiyuan Huang, Xianghan Meng, Wei He, Chun-Guang Li

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 本文提出MiTA,通过小规模地标查询收集top-k的键值对作为查询感知的变形路由专家,同时压缩N宽度MLP为共享专家,提升注意力机制的灵活性和可扩展性。

Comments Code is available at https://github.com/QishuaiWen/MiTA

详情
AI中文摘要

Transformer中的 vanilla 自注意力机制可以视为一个两层快速权重MLP,其权重由输入动态诱导,隐藏维度等于序列长度N。随着上下文扩展,这样的N宽度MLP的表达能力增加,但对极长序列不可扩展。最近,这种快速权重视角激励了混合专家(MoE)注意力机制,将序列划分为刚性块,将其视为快速权重专家,并稀疏地将token路由到它们。在本文中,我们将此视角提升为一个统一的高效注意力机制框架,将它们解释为通过路由或压缩使快速权重可扩展,并将其组织成五维分类法。然后,我们提出混合top-k注意力(MiTA),它使用少量地标查询收集top-k的注意力键值对作为查询感知和变形路由专家,同时将N宽度MLP压缩为更窄的共享专家。因此,我们的MiTA提升了先前MoE注意力的灵活性,从刚性到变形快速权重专家,以及先前top-k注意力的可扩展性,从查询特定集到可重用的top-k集。我们在视觉任务上进行了广泛实验,展示了MiTA的优越效果和效率,并揭示了有趣性质,如涌现的token修剪效应和从标准注意力的易于泛化。代码可在https://github.com/QishuaiWen/MiTA获取。

英文摘要

The vanilla self-attention mechanism in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically induced by inputs and whose hidden dimension is equal to the sequence length $N$. As the context extends, the expressive capacity of such an $N$-width MLP increases, but it becomes unscalable for extremely long sequences. Recently, this fast-weight perspective has motivated the Mixture-of-Experts (MoE) attention mechanism, which partitions the sequence into rigid blocks, treats them as fast-weight experts, and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for efficient attention mechanisms, interpreting them as making fast weights scalable through either routing or compression, and organizing them into a five-dimensional taxonomy. Then, we propose Mixture-of-Top-$k$ Attention (MiTA), which employs a small set of landmark queries to gather top-$k$ attended key-value pairs as query-aware and deformable routed experts, while compressing the $N$-width MLP into a narrower shared expert. Consequently, our MiTA improves the flexibility of prior MoE attention from rigid to deformable fast-weight experts, as well as the scalability of prior top-$k$ attention from query-specific set to reusable top-$k$ set. We conduct extensive experiments on vision tasks showing the superior effectiveness and efficiency of our MiTA, and also uncovering intriguing properties such as an emergent token-pruning effect and easy generalization from standard attention. Code is available at https://github.com/QishuaiWen/MiTA.

2602.01194 2026-05-12 cs.CV 版本更新

EMFormer: Efficient Multi-Scale Transformer for Accumulative Context Weather Forecasting

EMFormer:高效的多尺度Transformer用于累积上下文天气预测

Hao Chen, Tao Han, Jie Zhang, Song Guo, Fenghua Ling, Lei Bai

发表机构 * Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China(香港科技大学计算机科学与工程系) Shanghai AI Laboratory, Shanghai, China(上海人工智能实验室)

AI总结 本文提出EMFormer,通过单次卷积提取多尺度特征,结合累积上下文微调和复合损失函数,提升天气预测精度并减少计算开销。

Comments This paper has been accepted by ICML2026

详情
AI中文摘要

长期天气预测对社会经济规划和灾害准备至关重要。尽管近期方法通过微调延长预测范围,但受限于灾难性遗忘、误差累积和高训练开销。为解决这些问题,我们提出一个包含预训练、微调和预测的新型流程,以增强长上下文建模并降低计算开销。首先,我们引入高效的多尺度Transformer(EMFormer)通过单次卷积在训练和推理中提取多尺度特征。基于新架构,我们进一步采用累积上下文微调以提高时间一致性而不降低短期准确性。此外,我们提出一个复合损失函数,通过正弦波加权动态平衡不同项,从而在预训练和微调过程中自适应引导优化轨迹。实验表明,我们的方法在天气预测和极端事件预测中表现优异,显著提高了长期预测精度。此外,EMFormer在视觉基准(ImageNet-1K和ADE20K)上表现出色,同时比传统多尺度模块快5.69倍。代码:https://github.com/chenhao-zju/emformer

英文摘要

Long-term weather forecasting is critical for socioeconomic planning and disaster preparedness. While recent approaches employ finetuning to extend prediction horizons, they remain constrained by the issues of catastrophic forgetting, error accumulation, and high training overhead. To address these limitations, we present a novel pipeline across pretraining, finetuning and forecasting to enhance long-context modeling while reducing computational overhead. First, we introduce an Efficient Multi-scale Transformer (EMFormer) to extract multi-scale features through a single convolution in both training and inference. Based on the new architecture, we further employ an accumulative context finetuning to improve temporal consistency without degrading short-term accuracy. Additionally, we propose a composite loss that dynamically balances different terms via a sinusoidal weighting, thereby adaptively guiding the optimization trajectory throughout pretraining and finetuning. Experiments show that our approach achieves strong performance in weather forecasting and extreme event prediction, substantially improving long-term forecast accuracy. Moreover, EMFormer demonstrates strong generalization on vision benchmarks (ImageNet-1K and ADE20K) while delivering a 5.69x speedup over conventional multi-scale modules. Code: https://github.com/chenhao-zju/emformer

2601.22904 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Hyperspherical Autoencoder for High-Fidelity Image Reconstruction and Generation

超球面自编码器用于高保真图像重建与生成

Hun Chang, Byunghee Cha, Jong Chul Ye

发表机构 * KAIST AI(韩国科学技术院人工智能实验室)

AI总结 本文提出超球面自编码器,通过方向特征对齐和层次卷积补丁嵌入模块,提升图像重建和生成的保真度,实验结果显示其在gFID、rFID和PSNR指标上均优于现有方法。

Comments 22 pages, and 20 figures

详情
AI中文摘要

近期研究探讨了使用预训练视觉基础模型(如DINO)进行生成自编码器,显示出强大的生成性能。然而,现有方法常因高频细节丢失而重建保真度有限。本文提出超球面自编码器(HAE),通过方向特征对齐和层次卷积补丁嵌入模块,在语义表示与像素级重建之间建立桥梁。进一步观察到SSL表示本质上位于超球面,采用黎曼流匹配训练扩散变换器(DiT)直接在球面潜在流形上进行训练。实验表明,我们的球面感知DiT在gFID、rFID和PSNR指标上均表现出色,验证了球面感知方法的优势。

英文摘要

Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the \textbf{\em Hyperspherical Autoencoder (HAE)}, a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that while semantic information in contrastive representations is primarily directional, enforcing strict magnitude matching hinders the preservation of fine-grained details. To address this, we introduce a {\em Directional Feature Alignment} objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention, alongside a {\em Hierarchical Convolutional Patch Embedding} module to enhance local structure preservation. Furthermore, observing that SSL-based representations intrinsically lie on a hypersphere, we employ {\em Riemannian Flow Matching} to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Notably, our manifold-aware DiT exhibits highly efficient convergence, achieving an exceptional gFID of \textbf{1.96} alongside a reconstruction rFID of \textbf{0.78} and a PSNR of \textbf{25.2} dB, validating the advantages of our manifold-aware approach.

2601.22158 2026-05-12 cs.CV 版本更新

One-step Latent-free Image Generation with Pixel Mean Flows

无需潜在变量的一步图像生成:像素均值流

Yiyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang, Xianbang Wang, Tianhong Li, Zhengyang Geng, Kaiming He

发表机构 * MIT(麻省理工学院)

AI总结 本文提出像素均值流(pMF),通过分离网络输出空间与损失空间,实现无需潜在变量的一步图像生成,在ImageNet上取得优异结果。

Comments Tech report. Code at https://github.com/Lyy-iiis/pMF

详情
AI中文摘要

现代扩散/流基图像生成模型通常具有两个核心特征:(i) 使用多步采样,(ii) 在潜在空间中运行。最近的进展在每个方面都取得了显著进步,为实现无需潜在变量的一步扩散/流模型铺平了道路。在本文中,我们进一步朝着这一目标迈进,提出

英文摘要

Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose "pixel MeanFlow" (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.

2601.16836 2026-05-12 cs.CV cs.CL 版本更新

ColorConceptBench: A Benchmark for Probabilistic Color-Concept Understanding in Text-to-Image Models

ColorConceptBench:一种用于文本到图像模型中概率颜色-概念理解的基准

Chenxi Ruan, Yihan Hou, Yu Xiao, Guosheng Hu, Wei Zeng

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学) China Academy of Art(中国美术学院)

AI总结 本文提出ColorConceptBench,通过概率颜色分布系统评估颜色-概念关联,研究文本到图像模型对隐含概念如情感和视觉状态的理解能力,发现模型在抽象语义上的敏感性不足。

Comments 9 pages, 6 figures

详情
AI中文摘要

文本到图像(T2I)模型在生成高质量图像方面已取得显著进展。然而,其将颜色与概念关联的能力仍主要受限于显式颜色名称或代码,而处理隐含概念(如情感和视觉状态)的能力尚待探索。为填补这一空白,我们引入ColorConceptBench,一个由专家标注的基准,通过概率颜色分布系统评估颜色-概念关联。ColorConceptBench超越显式颜色规范,考察模型如何解释1,281个隐含颜色概念,这些概念基于6,584个人类注释。对九种领先的T2I模型的评估表明,性能在语义类别上差异显著,且模型在抽象语义上的敏感性显著不足。这些限制即使在应用分类器自由引导扩展时仍存在,表明实现人类级颜色理解需要模型在学习和表示隐含语义意义的方式上发生转变。

英文摘要

Text-to-image (T2I) models have advanced considerably in generating high-quality images from textual descriptions. However, their ability to associate colors with concepts remains largely constrained to explicit color names or codes, while their capacity to handle \emph{implicit concepts}, such as emotions and visual states, remains underexplored. To address this gap, we introduce ColorConceptBench, an expert-annotated benchmark that systematically evaluates color-concept associations through probabilistic color distributions. ColorConceptBench moves beyond explicit color specifications by examining how models interpret 1,281 implicit color concepts, grounded in 6,584 human annotations. Our evaluation of nine leading T2I models reveals that performance varies substantially across semantic categories, and models exhibit a significant lack of sensitivity to abstract semantics. These limitations persist even when applying classifier-free guidance scaling at inference time, suggesting that achieving human-like color understanding demands a shift in how models learn and represent implicit semantic meaning.

2512.24552 2026-05-12 cs.CV math.OC 版本更新

OCP-GN: A Scalable Second-order Optimizer for Stochastic Optimization

OCP-GN:一种适用于随机优化的可扩展二阶优化器

Jindi Zhong, Congyaohui Yin, Zhaorong Zhang, Huanshui Zhang

发表机构 * JOURNAL OF LATEX CLASS FILES(LaTeX类文件期刊)

AI总结 本文提出基于最优控制原理的OCP-GN算法,用于神经网络训练中的大规模优化问题,具有O(d)的计算复杂度和强鲁棒性,实验验证其显著优势。

详情
AI中文摘要

本文提出一种基于最优控制原理(OCP)的新型二阶优化算法,适用于神经网络训练中的大规模优化问题。该算法具有O(d)的计算复杂度和强鲁棒性。在多个基准测试中的广泛实验验证了所提方法的显著优越性。

英文摘要

This paper proposes a novel second-order optimization algorithm based on the Optimal Control Principle (OCP), applicable to large-scale optimization problems in neural network training. The algorithm has a computational complexity of O(d) and strong robustness. Extensive experiments on multiple benchmarks demonstrate the significant superiority of the proposed method.

2512.19115 2026-05-12 cs.CV 版本更新

Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

生成巨人,检索弱者:为何多模态大语言模型在多模态检索中表现不佳?

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Yang Li, Wentao Zhang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Peking University(北京大学) Tencent Inc(腾讯公司) Zhongguancun Academy(中关村学院)

AI总结 研究揭示多模态大语言模型在多模态检索中表现不佳的原因,通过稀疏自编码器分析发现其表示空间主要由文本语义构成,视觉语义不足,导致检索性能下降,提出ReAlign方法提升检索效果。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在生成任务中表现出色,但在零样本多模态检索任务中却表现出反直觉的不足。本文研究了阻碍MLLMs成为有效检索器的机制。通过稀疏自编码器(SAEs),我们将MLLM输出表示分解为可解释的语义概念以探测其内在行为。分析发现,MLLM的表示空间 overwhelmingly 由文本语义主导;而多模态检索所需的关键视觉语义仅占小部分。我们发现这种不平衡是由于MLLM过度关注图像-文本模态桥接,促进了生成但使嵌入空间同质化,最终降低了多模态检索所需的判别能力。进一步发现,对MLLM相似性计算贡献最大的特定特征组件实际上是干扰项,大大降低了检索性能。基于这些见解,我们提出了ReAlign,一种测试时适应方法,通过白化变换调整MLLM表示空间的几何结构。实验结果表明,这种简单的干预在不同MLLM上一致提升了零样本多模态检索性能,无需微调。代码可在https://github.com/Heinz217/mllm-retrieval-analysis获取。

英文摘要

Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from being effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; and the visual semantics essential for multimodal retrieval only constitute a small portion. We find that this imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal retrieval. We further discover that the specific feature components that contribute most to the similarity computations of MLLMs are actually distractors that greatly reduce retrieval performance. Building on these insights, we propose ReAlign, a test-time adaptation approach that applies a whitening transformation to adjust the geometry of MLLM representation spaces. Empirical results show that this simple intervention consistently improves zero-shot multimodal retrieval performance across diverse MLLMs without fine-tuning efforts. The code is available at https://github.com/Heinz217/mllm-retrieval-analysis.

2512.08984 2026-05-12 cs.CV cs.AI 版本更新

RAG-HAR: Retrieval Augmented Generation-based Human Activity Recognition

RAG-HAR:基于检索增强生成的人类活动识别

Nirhoshan Sivaroopan, Hansi Karunarathna, Chamara Madarasingha, Anura Jayasumana, Kanchana Thilakarathna

发表机构 * University of Sydney Australia(悉尼大学澳大利亚分校) University of Sri Jayewardenepura Sri Lanka(Sri Jayewardenepura大学 Sri Lanka) Curtin University Australia(Curtin大学澳大利亚分校) Colorado State University USA(科罗拉多州立大学美国分校)

AI总结 RAG-HAR提出一种无需训练的检索增强框架,利用大语言模型实现人类活动识别,通过轻量统计描述符和语义相似样本检索提升识别准确性和实用性。

Comments Accepted to IEEE PerCom 2026 (Pervasive computing and communications)

详情
AI中文摘要

人类活动识别(HAR)在医疗、康复、健身追踪和智能环境中有广泛应用,但现有深度学习方法需要特定数据集训练、大量标注数据和大量计算资源。本文提出RAG-HAR,一种无需训练的检索增强框架,利用大语言模型(LLMs)进行HAR。RAG-HAR计算轻量统计描述符,从向量数据库检索语义相似样本,并利用此上下文证据进行LLM基于的活动识别。我们进一步通过提示优化和引入LLM基于的活动描述符,生成上下文丰富的向量数据库,以提供准确且高度相关的上下文信息。此外,RAG-HAR在六个多样化的HAR基准测试中实现了最先进的性能。最重要的是,RAG-HAR在无需模型训练或微调的情况下实现了这些改进,强调了其鲁棒性和实用性。RAG-HAR超越已知行为,能够识别和有意义地标记多种未见过的人类活动。

英文摘要

Human Activity Recognition (HAR) underpins applications in healthcare, rehabilitation, fitness tracking, and smart environments, yet existing deep learning approaches demand dataset-specific training, large labeled corpora, and significant computational resources.We introduce RAG-HAR, a training-free retrieval-augmented framework that leverages large language models (LLMs) for HAR. RAG-HAR computes lightweight statistical descriptors, retrieves semantically similar samples from a vector database, and uses this contextual evidence to make LLM-based activity identification. We further enhance RAG-HAR by first applying prompt optimization and introducing an LLM-based activity descriptor that generates context-enriched vector databases for delivering accurate and highly relevant contextual information. Along with these mechanisms, RAG-HAR achieves state-of-the-art performance across six diverse HAR benchmarks. Most importantly, RAG-HAR attains these improvements without requiring model training or fine-tuning, emphasizing its robustness and practical applicability. RAG-HAR moves beyond known behaviors, enabling the recognition and meaningful labelling of multiple unseen human activities.

2512.06673 2026-05-12 cs.CV 版本更新

Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding

基于检测器的视频大语言模型用于高效的时空定位

Shida Gao, Feng Xue, Xiangfeng Wang, Anlong Ming, Zhaowen Lin, Haiyang Zhang, Teng Long, Nicu Sebe, Yihua Shao, Haozhe Wang, Wei Wang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) University of Trento(特伦特大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Hong Kong University of Science and Technology(香港科技大学) ZTE Corporation(中兴通讯)

AI总结 本文提出DEViL模型,通过将密集空间定位任务转移给训练良好的检测器,提升视频时空定位的效率与性能,实现43.1%的m_vIoU和14.33 FPS的高效表现。

详情
AI中文摘要

多模态大语言模型(MLLMs)正从通用视频理解扩展到更细粒度的理解,如时空视频定位(STVG)和推理。在这些任务中,MLLM必须在时间和空间上定位用户查询的目标,并将结果作为推理的证据。现有MLLM方法主要遵循两种范式:(1)直接定位,通过额外对齐模块或专用解码器输出STVG结果;(2)基于候选的选择,首先构建管级候选,然后通过MLLM选择相关项。然而,两者都面临严重的效率瓶颈:前者随着查询时间跨度的增加导致线性增长的解码成本,后者依赖于昂贵的候选构建。为突破这一瓶颈,我们提出DEViL,一种由检测器赋能的视频-LLM,其核心思想是将密集空间定位从MLLM转移给完全并行化的、已训练良好的检测器。具体而言,DEViL将查询转化为与检测器兼容的参考语义标记,该标记取代检测器的文本嵌入,以实现单次通过的空间定位。然后,我们设计时间一致性正则化来跨帧匹配对象并强制其在时间上的连贯性。通过这种方式,DEViL避免了长坐标解码和重候选管道。大量实验表明,DEViL在HC-STVG上实现了43.1%的m_vIoU,具有优越的效率(14.33 FPS),同时保持了MLLM基础架构的一般推理能力。

英文摘要

Multimodal large language models (MLLMs) are rapidly expanding from general video understanding to finer-grained understanding such as spatio-temporal video grounding (STVG) and reasoning. In these tasks, an MLLM must localize the user-queried target in time and space and take the results as evidence for reasoning. Existing MLLM methods mainly follow two paradigms: (1) Direct Localization, which outputs STVG results with extra alignment modules or specialized decoders; and (2) Candidate-based Selection, which first constructs tube-level candidates and then selects the relevant one by an MLLM. However, both suffer from a serious efficiency bottleneck: the former incurs linearly growing decoding cost as the queried temporal span increases, while the latter relies on costly candidate construction. To break this bottleneck, we propose DEViL, a detector-empowered Video-LLM with a simple key idea: offloading dense spatial grounding from the MLLM to a fully parallelizable, well-trained detector. Specifically, DEViL distills the query into a detector-compatible reference-semantic token, which replaces the detector's text embedding to enable spatial grounding in a single pass. Then, we design temporal consistency regularization to match objects across frames and enforce their coherence over time. In this way, DEViL avoids long coordinate decoding and heavy candidate pipelines. Extensive experiments show that DEViL achieves strong performance (43.1% m_vIoU on HC-STVG) with superior efficiency (14.33 FPS), while preserving the general reasoning capacity of the MLLM backbone.

2512.02012 2026-05-12 cs.CV cs.LG 版本更新

Improved Mean Flows: On the Challenges of Fastforward Generative Models

改进的均值流:关于快速前向生成模型挑战的研究

Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J. Zico Kolter, Kaiming He

发表机构 * CMU(卡内基梅隆大学) MIT(麻省理工学院) Adobe(Adobe公司) THU(清华大学)

AI总结 本文改进了MeanFlow框架,通过重新参数化训练目标和指导机制,提升了训练稳定性与灵活性,实现了在ImageNet上的1.72 FID成绩。

Comments Technical report. Code at https://github.com/Lyy-iiis/imeanflow

详情
AI中文摘要

MeanFlow(MF)近期已被确立为一步生成建模的框架。然而,其``fastforward''特性在训练目标和引导机制中引入了关键挑战。首先,原始MF的训练目标不仅依赖于底层真实场,还依赖于网络本身。为解决这一问题,我们将目标重新表述为对瞬时速度$v$的损失,通过一个预测平均速度$u$的网络进行重新参数化。这种重新表述产生了一个更标准的回归问题,并提高了训练稳定性。其次,原始MF在训练期间固定了分类免费引导尺度,这牺牲了灵活性。我们通过将引导明确地作为条件变量来解决这一问题,从而在测试时保留灵活性。多样条件通过上下文条件处理,减少了模型大小并提升了性能。总体而言,我们的改进MeanFlow(iMF)方法,完全从头训练,实现了在ImageNet 256×256上单次函数评估(1-NFE)的1.72 FID成绩。iMF在性能上显著优于此类先前方法,并在不使用蒸馏的情况下接近多步方法的水平。我们希望我们的工作能进一步推动fastforward生成建模作为独立范式的发展。

英文摘要

MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF's training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity $v$, re-parameterized by a network that predicts the average velocity $u$. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our $\textbf{improved MeanFlow}$ ($\textbf{iMF}$) method, trained entirely from scratch, achieves $\textbf{1.72}$ FID with a single function evaluation (1-NFE) on ImageNet 256$\times$256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.

2511.13415 2026-05-12 cs.IR cs.CL cs.CV 版本更新

Attention Grounded Enhancement for Visual Document Retrieval

基于注意力的视觉文档检索增强

Wanqing Cui, Wei Huang, Yazhi Guo, Yibo Hu, Meiguang Jin, Junfeng Ma, Keping Bi

发表机构 * Alibaba Group(阿里巴巴集团) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出AGREE框架,通过多模态大语言模型的注意力机制引导检索器识别相关文档区域,提升细粒度相关性建模,实验表明在ViDoRe V2基准上显著优于传统方法。

Comments Published as a conference paper at SIGIR 2026

详情
AI中文摘要

视觉文档检索需要理解异构和多模态内容以满足隐式信息需求。近期进展利用截图基于的文档编码结合细粒度后期交互来编码整体信息并捕捉细微对齐,显著提升检索性能。然而,检索器仍使用粗粒度全局相关性标签进行训练,未揭示哪些区域支持匹配。因此,检索器倾向于依赖表面特征,难以捕捉隐含语义连接,阻碍其处理非提取查询的能力。为提高细粒度相关性建模,我们提出注意力引导的检索增强(AGREE)框架。AGREE利用多模态大语言模型(MLLMs)的跨模态注意力作为代理监督,引导检索器识别相关文档区域。具体而言,AGREE从MLLM中提取注意力图,突出基于查询的文档区域。这些注意力分数作为局部、区域级的相关性信号。在训练过程中,AGREE将局部信号与全局文档级相关性标签结合,共同优化检索器。这种双层监督使模型不仅能学习文档是否匹配,还能学习哪些内容驱动相关性。在挑战性的视觉文档检索基准ViDoRe V2上,实验表明AGREE在平均nDCG@1和nDCG@5上分别比仅使用全局监督的基线高出12.82%和5.03%。定量和定性分析进一步表明,AGREE促进了查询术语与文档区域之间的深度对齐,超越了表面匹配,向更准确和可解释的检索迈进。我们的代码可在:https://github.com/VickiCui/AGREE获取。

英文摘要

Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy implicit information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction to encode holistic information and capture nuanced alignments, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries.To improve fine-grained relevance modeling, we propose a Attention-Grounded REtriever Enhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models (MLLMs) as proxy supervision to guide the retriever in identifying relevant document regions. Specifically, AGREE extracts attention maps from the MLLM that highlight which document regions are attended to based on the query. These attention scores serve as local, region-level relevance signals. During training, AGREE combines local signals with the global document-level relevance label to jointly optimize the retriever. This dual-level supervision enables the model to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging visual document retrieval benchmark, ViDoRe V2, show that AGREE significantly outperforms the global-supervision-only baseline by 12.82\% and 5.03\% in terms of average nDCG@1 and nDCG@5. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://github.com/VickiCui/AGREE.

2510.13896 2026-05-12 q-bio.QM cs.AI cs.CV cs.MA 版本更新

GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents

GenCellAgent:基于大语言模型代理的通用、无需训练的细胞图像分割

Xi Yu, Yang Yang, Qun Liu, Yonghua Du, Sean McSweeney, Yuewei Lin

发表机构 * Artificial Intelligence Department, Brookhaven National Laboratory, Upton, 11973, NY, US(1 人工智能部门,布鲁赫萨尔国家实验室,乌普顿,11973,纽约,美国) National Synchrotron Light Source II, Brookhaven National Laboratory, Upton, 11973, NY, US(2 国家同步辐射光源II,布鲁赫萨尔国家实验室,乌普顿,11973,纽约,美国) Biology Department, Brookhaven National Laboratory, Upton, 11973, NY, US(3 生物学部门,布鲁赫萨尔国家实验室,乌普顿,11973,纽约,美国)

AI总结 GenCellAgent通过规划-执行-评估循环实现无需训练的细胞图像分割,自动分配最佳工具并支持文本引导分割,优于现有方法且减少标注负担。

Comments 43 pages

详情
AI中文摘要

细胞图像分割对于定量生物学至关重要,但受限于异质模态、形态变异和标注有限。我们提出GenCellAgent,一种无需训练的多代理框架,通过规划-执行-评估循环(选择工具→运行→质量检查)结合长期记忆,协调专门分割器和通用视觉-语言模型。系统可自动路由图像至最佳工具,根据成像条件即时适应,支持文本引导分割未覆盖的细胞器,并将专家编辑记录在记忆中,实现自我进化和个性化流程。在七个细胞分割基准上,该系统在所有数据集上匹配或超过最佳单个工具,总体准确率优于所有基线。在分布外细胞器数据上,GenCellAgent显著优于未在目标领域训练的专门模型,恢复专用工具无法检测的结构。它还通过迭代文本引导细化分割新对象如高尔基体,轻量人工校正进一步提升性能。这些能力为无需重新训练的稳健、适应性强的细胞图像分割提供了实用路径,同时减少标注负担并匹配用户偏好。

英文摘要

Cellular image segmentation is essential for quantitative biology yet remains difficult due to heterogeneous modalities, morphological variability, and limited annotations. We present GenCellAgent, a training-free multi-agent framework that orchestrates specialist segmenters and generalist vision-language models via a planner-executor-evaluator loop (choose tool $\rightarrow$ run $\rightarrow$ quality-check) with long-term memory. The system (i) automatically routes images to the best tool, (ii) adapts on the fly using a few reference images when imaging conditions differ from what a tool expects, (iii) supports text-guided segmentation of organelles not covered by existing models, and (iv) commits expert edits to memory, enabling self-evolution and personalized workflows. Across seven cell-segmentation benchmarks spanning diverse microscopy modalities (4,718 images), this routing consistently matches or exceeds the best individual tool on every dataset and outperforms all baselines in overall accuracy. On out-of-distribution organelle data, GenCellAgent substantially outperforms specialist models that were not trained on the target domain, recovering structures that dedicated tools fail to detect. It also segments novel objects such as the Golgi apparatus via iterative text-guided refinement, with light human correction further boosting performance. Together, these capabilities provide a practical path to robust, adaptable cellular image segmentation without retraining, while reducing annotation burden and matching user preferences.

2510.06637 2026-05-12 cs.LG cs.AI cs.CV 版本更新

Control-Augmented Autoregressive Diffusion for Data Assimilation

增强控制的自回归扩散用于数据同化

Prakhar Srivastava, Farrin Marouf Sofian, Francesco Immorlano, Kushagra Pandey, Stephan Mandt

发表机构 * University of California, Irvine(加州大学洛杉矶分校)

AI总结 本文提出一种增强控制的自回归扩散模型,通过引入预训练模型与离线训练控制器,提升数据同化中混沌时空偏微分方程的稳定性和准确性。

详情
AI中文摘要

尽管在测试时扩展和扩散微调方面有所进展,但对自回归扩散模型(ARDMs)的指导仍显不足。我们介绍了一个可 amortized 的框架,该框架通过将预训练的 ARDMS 与离线训练的控制器相结合,通过预览未来的滚动预测,控制器学习逐步修正,以在终端成本目标下预测观测,从而生成可重用的策略。受 ARDM 轨迹的随机最优控制观点启发,我们的方法在每个去噪子步骤中注入小控制量,同时保持接近预训练的动力学。我们研究了这种方法在混沌时空偏微分方程(PDEs)中的数据同化(DA)应用,其中现有方法往往计算成本高且在稀疏观测下易受预测漂移影响。在推理时,DA 变成一个具有实时修正的前馈滚动预测,相较于强大的扩散基线方法,实现了数量级的加速。在两个典型 PDEs 和一个涵盖六个观测领域的紧凑 ECMWF 再分析 v5(ERA5)试点研究中,我们的方法在稳定性和准确性上均优于现有最先进方法,且在更大规模的 GenCast 研究中也观察到相似的改进。

英文摘要

Despite advances in test-time scaling and diffusion finetuning, guidance for Auto-Regressive Diffusion Models (ARDMs) remains underexplored. We introduce an amortized framework that augments a pretrained ARDM with an offline-trained controller. By previewing future rollouts, the controller learns stepwise corrections that anticipate observations under a terminal-cost objective, yielding a reusable policy for guided generation. Motivated by a stochastic optimal control view of ARDM trajectories, our method injects small controls within each denoising sub-step while staying close to the pretrained dynamics. We study this approach for dataassimilation (DA) in chaotic spatiotemporal partial differential equations (PDEs), where existing methods are often computationally expensive and susceptible to forecast drift under sparse observations. At inference, DA becomes a feed-forward rollout with on-the-fly corrections, achieving an order-of-magnitude speedup over strong diffusion-based baselines. Across two canonical PDEs and a compact ECMWF Reanalysis v5 (ERA5) pilot spanning six observation regimes, our method consistently improves stability and accuracy over state-of-the-art alternatives, with similar improvements observed in a larger-scale GenCast study.

2510.05635 2026-05-12 cs.LG cs.CV 版本更新

NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering

NEO: 通过潜在空间重新定位实现无优化的测试时间适应

Alexander Murphy, Michal Danilowski, Soumyajit Chatterjee, Abhirup Ghosh

发表机构 * University of Birmingham(伯明翰大学) Brave Software Research(Brave软件研究) University of Cambridge(剑桥大学)

AI总结 NEO通过潜在空间重新定位实现无超参数的测试时间适应,仅需少量计算即可显著提升分类准确率,适用于多个数据集和设备。

Comments ICLR 2026

详情
AI中文摘要

测试时间适应(TTA)方法通常计算成本高、需要大量数据或对超参数敏感。基于潜在空间几何理论,通过将目标数据嵌入重新定位到原点,显著提升源与分布偏移样本的对齐程度。NEO是一种无需超参数的完全TTA方法,相比传统推理无显著计算开销。NEO在ImageNet-C上通过仅适应一个批次的64个样本,将ViT-Base的分类准确率从55.6%提升至59.2%。当使用512个样本进行适应时,NEO在ImageNet-C、ImageNet-R和ImageNet-S上超越了所有7种比较的TTA方法,在CIFAR-10-C上超越6/7种方法。NEO在模型校准指标上表现良好,并能从1个类别适应以提升ImageNet-C上999个其他类别的准确性。在Raspberry Pi和Jetson Orin Nano设备上,NEO相比基线减少了63%的推理时间及9%的内存使用。基于三种ViT架构和四个数据集的实验结果表明,NEO在TTA中具有高效且有效的应用潜力。

英文摘要

Test-Time Adaptation (TTA) methods are often computationally expensive, require a large amount of data for effective adaptation, or are brittle to hyperparameters. Based on a theoretical foundation of the geometry of the latent space, we are able to significantly improve the alignment between source and distribution-shifted samples by re-centering target data embeddings at the origin. This insight motivates NEO -- a hyperparameter-free fully TTA method, that adds no significant compute compared to vanilla inference. NEO is able to improve the classification accuracy of ViT-Base on ImageNet-C from 55.6% to 59.2% after adapting on just one batch of 64 samples. When adapting on 512 samples NEO beats all 7 TTA methods we compare against on ImageNet-C, ImageNet-R and ImageNet-S and beats 6/7 on CIFAR-10-C, while using the least amount of compute. NEO performs well on model calibration metrics and additionally is able to adapt from 1 class to improve accuracy on 999 other classes in ImageNet-C. On Raspberry Pi and Jetson Orin Nano devices, NEO reduces inference time by 63% and memory usage by 9% compared to baselines. Our results based on 3 ViT architectures and 4 datasets show that NEO can be used efficiently and effectively for TTA.

2509.08670 2026-05-12 cs.CV 版本更新

FractalPINN-Flow: A Fractal-Inspired Network for Unsupervised Optical Flow Estimation with Total Variation Regularization

FractalPINN-Flow:一种受分形启发的网络用于无监督光流估计与总变分正则化

Sara Behnamian, Rasoul Khaksarinezhad, Andreas Langer

发表机构 * Globe Institute, University of Copenhagen(全球研究所,哥本哈根大学) Centre for Mathematical Sciences, Lund University(数学科学中心,吕勒奥大学)

AI总结 本文提出FractalPINN-Flow,一种无监督深度学习框架,通过连续灰度帧直接学习光流,无需真实数据。基于分形几何和自相似性设计的分形变形网络,结合总变分正则化,实现高精度、平滑且边缘保留的光流估计。

Journal ref In Proceedings of the 2nd Sorbonne-Heidelberg Workshop on AI in Medicine: Machine Learning for Multi-modal Data, Heidelberg University Library, 2025

详情
AI中文摘要

我们提出了FractalPINN-Flow,一种无监督深度学习框架,用于密集光流估计,直接从连续灰度帧中学习,无需真实数据。该架构的核心是分形变形网络(FDN)——一种受分形几何和自相似性启发的递归编码器-解码器。与传统CNN不同,FDN使用重复的编码器-解码器嵌套和跳跃连接,以捕捉细粒度细节和长程运动模式。训练目标基于经典的变分公式,使用总变分(TV)正则化。具体来说,我们最小化一个能量函数,结合L1和L2数据保真项以强制亮度恒定性,以及一个TV项以促进空间平滑性和一致的光流场。在合成和基准数据集上的实验表明,FractalPINN-Flow能够生成准确、平滑且边缘保留的光流场。该模型在高分辨率数据和标注有限的场景中表现尤为出色。

英文摘要

We present FractalPINN-Flow, an unsupervised deep learning framework for dense optical flow estimation that learns directly from consecutive grayscale frames without requiring ground truth. The architecture centers on the Fractal Deformation Network (FDN) - a recursive encoder-decoder inspired by fractal geometry and self-similarity. Unlike traditional CNNs with sequential downsampling, FDN uses repeated encoder-decoder nesting with skip connections to capture both fine-grained details and long-range motion patterns. The training objective is based on a classical variational formulation using total variation (TV) regularization. Specifically, we minimize an energy functional that combines $L^1$ and $L^2$ data fidelity terms to enforce brightness constancy, along with a TV term that promotes spatial smoothness and coherent flow fields. Experiments on synthetic and benchmark datasets show that FractalPINN-Flow produces accurate, smooth, and edge-preserving optical flow fields. The model is especially effective for high-resolution data and scenarios with limited annotations.

2507.04465 2026-05-12 cs.CV 版本更新

Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions

基于深度学习的视觉手部手势识别:方法、数据集、挑战及未来研究方向的全面综述

Konstantinos Foteinos, Manousos Linardakis, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

发表机构 * Department of Informatics and Telematics, Harokopio University of Athens(信息与电信学系,哈罗科比欧大学)

AI总结 本文综述了视觉手部手势识别领域的方法、数据集、挑战及未来方向,系统分析了当前最先进的方法和评估指标,为研究者提供改进指南。

Comments Submitted to Neurocomputing. Rewritten abstract, due to limited space

详情
AI中文摘要

深度学习模型的快速发展和可用数据集的持续增长,使视觉手部手势识别(VHGR)这一重要领域受到研究社区的广泛关注,并在手语理解和人机交互等方面得到广泛应用。尽管该领域已有大量研究,但缺乏系统性和完整的综述,导致研究者需在数百篇论文中寻找当前最先进的方法(SOTA)。本文旨在填补这一空白,通过系统的方法和结构化的呈现,全面概述该计算机视觉领域。本文重点探讨四个核心问题:VHGR的主要方面、当前最先进的方法、方法和任务之间的比较洞察,以及塑造未来研究的挑战。通过系统的方法定位相关文献,本文以分类方式识别并组织了关键的VHGR方法。SOTA方法被分为三个主要任务:静态、孤立动态和连续手势识别。对于每个任务,列出了架构趋势和学习策略。为了支持未来方法的实验评估,本文回顾了常用的数据集并展示了标准性能指标。本文最后识别了VHGR中的主要挑战,包括通用计算机视觉问题和领域特定的障碍,并概述了未来研究的有希望方向。

英文摘要

The rapid evolution of deep learning (DL) models and the ever-increasing size of available datasets have raised the interest of the research community in the always-important field of visual hand gesture recognition (VHGR), and delivered a wide range of applications, such as sign language understanding and human-computer interaction. Despite the large volume of research works in the field, a structured and complete survey on VHGR is still missing, leaving researchers to navigate through hundreds of papers in order to find the current state-of-the-art (SOTA). The current survey aims to fill this gap by presenting a comprehensive overview of this computer vision field. With a systematic research methodology and a structured presentation of the various methods, datasets, and evaluation metrics, this review aims to constitute a useful guideline for researchers, helping them to propose improvements. Specifically, this survey focuses on four fundamental questions: what are the main VHGR aspects, what are the current SOTA methods, what comparative insights can be drawn across methods and tasks, and which challenges shape future research. Starting with the methodology used to locate the related literature, the survey identifies and organizes the key VHGR approaches in a taxonomy-based format. The SOTA methods are grouped across three primary VHGR tasks: static, isolated dynamic and continuous gesture recognition. For each task, the architectural trends and learning strategies are listed. To support the experimental evaluation of future methods in the field, the study reviews commonly used datasets and presents the standard performance metrics. Our survey concludes by identifying the major challenges in VHGR, including both general computer vision issues and domain-specific obstacles, and outlines promising directions for future research.

2507.04277 2026-05-12 cs.CV 版本更新

Towards Lightest Low-Light Image Enhancement Architecture for Mobile Devices

面向移动设备的最轻量级低光照图像增强架构

Guangrui Bai, Hailong Yan, Wenhai Liu, Yahui Deng, Erbao Dong

发表机构 * Key Laboratory of Precision and Intelligent Chemistry, Department of Precision Machinery and Precision Instrumentation, University of Science and Technology of China(精密与智能化学重点实验室,精密机械与精密仪器系,中国科学技术大学) School of Information and Communication Engineering, University of Electronic Science and Technology of China(信息与通信工程学院,电子科学与技术大学)

AI总结 本文提出LiteIE框架,通过轻量级网络和无监督训练,实现低光照图像增强的高效解决方案,在资源受限设备上达到高PSNR和实时处理能力。

Comments Accepted by ESWA

详情
AI中文摘要

实时低光照图像增强在移动和嵌入式设备上需要在视觉质量和计算效率之间取得平衡。现有深度学习方法常依赖大型网络和标注数据集,限制了其在资源受限平台上的部署。本文提出LiteIE,一种超轻量级无监督增强框架,消除了对大规模监督的依赖,并在多种条件下表现良好。我们设计了一个背骨无关的特征提取器,仅使用两个卷积层来生成紧凑的图像特征增强张量。此外,我们开发了一个无参数的迭代修复模块,通过重用提取的特征逐步恢复早期增强步骤中丢失的细节,不引入任何额外的可学习参数。我们进一步提出一个无监督训练目标,整合了曝光控制、边缘感知平滑性和多尺度颜色一致性损失。在LOL数据集上的实验表明,LiteIE在PSNR上达到19.04 dB,比SOTA高1.4 dB,同时仅使用其0.07%的参数。在Snapdragon 8 Gen 3移动处理器上,LiteIE在4K图像上以30 FPS运行,仅需58个参数,实现了在边缘设备上的实时部署。这些结果证明LiteIE是资源受限平台上的高效且实用的低光照增强解决方案。

英文摘要

Real-time low-light image enhancement on mobile and embedded devices requires models that balance visual quality and computational efficiency. Existing deep learning methods often rely on large networks and labeled datasets, limiting their deployment on resource-constrained platforms. In this paper, we propose LiteIE, an ultra-lightweight unsupervised enhancement framework that eliminates dependence on large-scale supervision and generalizes well across diverse conditions. We design a backbone-agnostic feature extractor with only two convolutional layers to produce compact image features enhancement tensors. In addition, we develop a parameter-free Iterative Restoration Module, which reuses the extracted features to progressively recover fine details lost in earlier enhancement steps, without introducing any additional learnable parameters. We further propose an unsupervised training objective that integrates exposure control, edge-aware smoothness, and multi-scale color consistency losses. Experiments on the LOL dataset, LiteIE achieves 19.04 dB PSNR, surpassing SOTA by 1.4 dB while using only 0.07\% of its parameters. On a Snapdragon 8 Gen 3 mobile processor, LiteIE runs at 30 FPS for 4K images with just 58 parameters, enabling real-time deployment on edge devices. These results establish LiteIE as an efficient and practical solution for low-light enhancement on resource-limited platforms.

2506.12542 2026-05-12 cs.LG cs.AI cs.CV stat.ML 版本更新

PLD: A Choice-Theoretic List-Wise Knowledge Distillation

PLD: 一种基于选择理论的列表级知识蒸馏

Ejafa Bassam, Dawei Zhu, Kaigui Bian

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院)

AI总结 本文提出PLD,一种基于Plackett-Luce模型的列表级知识蒸馏方法,通过将教师logits视为'价值'评分,直接优化教师最优排名,实现凸且平移不变的替代目标,涵盖加权交叉熵。

Journal ref Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 136090--136112 (2026)

详情
AI中文摘要

知识蒸馏是一种模型压缩技术,通过训练紧凑的

英文摘要

Knowledge distillation is a model compression technique in which a compact "student" network is trained to replicate the predictive behavior of a larger "teacher" network. In logit-based knowledge distillation, it has become the de facto approach to augment cross-entropy with a distillation term. Typically, this term is either a KL divergence that matches marginal probabilities or a correlation-based loss that captures intra- and inter-class relationships. In every case, it acts as an additional term to cross-entropy. This term has its own weight, which must be carefully tuned. In this paper, we adopt a choice-theoretic perspective and recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as "worth" scores. We introduce "Plackett-Luce Distillation (PLD)", a weighted list-wise ranking loss. In PLD, the teacher model transfers knowledge of its full ranking of classes, weighting each ranked choice by its own confidence. PLD directly optimizes a single "teacher-optimal" ranking. The true label is placed first, followed by the remaining classes in descending teacher confidence. This process yields a convex and translation-invariant surrogate that subsumes weighted cross-entropy. Empirically, across CIFAR-100, ImageNet-1K, and MS-COCO, PLD achieves consistent gains across diverse architectures and distillation objectives, including divergence-based, correlation-based, and feature-based methods, in both homogeneous and heterogeneous teacher-student pairs.

2506.07436 2026-05-12 cs.CV cs.AI cs.ET 版本更新

Prompt to Protection: A Comparative Study of Multimodal LLMs in Construction Hazard Recognition

提示到保护:多模态大语言模型在建筑危险识别中的比较研究

Nishi Chaudhary, S M Jamil Uddin, Sathvik Sharath Chandra, Anto Ovid, Alex Albert

发表机构 * Department of Construction Management, Colorado State University(科罗拉多州立大学建设管理系) Department of Civil, Construction, and Environmental Engineering, North Carolina State University(北卡罗来纳州立大学土木、建设与环境工程系)

AI总结 本文比较了五种先进多模态大语言模型在建筑危险识别中的表现,发现提示策略显著影响性能,CoT提示效果最佳,GPT-4.5和GPT-o3表现突出,强调了提示设计在提升建筑安全应用准确性中的关键作用。

详情
AI中文摘要

最近多模态大语言模型(LLMs)的出现为改进施工现场的视觉危险识别提供了新机遇。不同于传统计算机视觉模型依赖领域特定训练和大量数据集,现代LLMs能通过简单的自然语言提示解释和描述复杂视觉场景。然而,尽管对其应用的兴趣日益增长,但在建筑领域安全关键视觉任务中不同LLMs的表现仍有待研究。为此,本文对五种最先进的LLMs:Claude-3 Opus、GPT-4.5、GPT-4o、GPT-o3和Gemini 2.0 Pro进行了比较评估,以评估其从真实世界建筑图像中识别潜在危险的能力。每个模型在三种提示策略下进行测试:零样本、少样本和思维链(CoT)。零样本提示涉及最少指令,少样本结合基本安全上下文和危险源记忆法,而CoT提供逐步推理示例以支撑模型思维。使用精确率、召回率和F1分数指标进行定量分析。结果表明,提示策略显著影响性能,CoT提示在所有模型中均产生更高的准确性。此外,LLM在不同条件下的表现各异,GPT-4.5和GPT-o3在大多数设置中表现最佳。研究还展示了提示设计在提升多模态LLMs在建筑安全应用中的准确性和一致性中的关键作用。本研究为提示工程与LLMs的整合提供了可行见解,有助于开发更可靠的AI辅助安全系统。

英文摘要

The recent emergence of multimodal large language models (LLMs) has introduced new opportunities for improving visual hazard recognition on construction sites. Unlike traditional computer vision models that rely on domain-specific training and extensive datasets, modern LLMs can interpret and describe complex visual scenes using simple natural language prompts. However, despite growing interest in their applications, there has been limited investigation into how different LLMs perform in safety-critical visual tasks within the construction domain. To address this gap, this study conducts a comparative evaluation of five state-of-the-art LLMs: Claude-3 Opus, GPT-4.5, GPT-4o, GPT-o3, and Gemini 2.0 Pro, to assess their ability to identify potential hazards from real-world construction images. Each model was tested under three prompting strategies: zero-shot, few-shot, and chain-of-thought (CoT). Zero-shot prompting involved minimal instruction, few-shot incorporated basic safety context and a hazard source mnemonic, and CoT provided step-by-step reasoning examples to scaffold model thinking. Quantitative analysis was performed using precision, recall, and F1-score metrics across all conditions. Results reveal that prompting strategy significantly influenced performance, with CoT prompting consistently producing higher accuracy across models. Additionally, LLM performance varied under different conditions, with GPT-4.5 and GPT-o3 outperforming others in most settings. The findings also demonstrate the critical role of prompt design in enhancing the accuracy and consistency of multimodal LLMs for construction safety applications. This study offers actionable insights into the integration of prompt engineering and LLMs for practical hazard recognition, contributing to the development of more reliable AI-assisted safety systems.

2505.16025 2026-05-12 cs.CV cs.MM eess.IV 版本更新

Context and Pixel Aware Large Language Model for Video Quality Assessment

基于上下文和像素的大型语言模型用于视频质量评估

Wen Wen, Yaohong Wu, Yue Sheng, Neil Birkbeck, Balu Adsumilli, Yilin Wang

发表机构 * City University of Hong Kong(香港城市大学) Google Inc.(谷歌公司)

AI总结 本文提出CP-LLM,通过双视觉编码器和语言解码器同时生成视频质量评分和可解释描述,提升对像素失真敏感度,实验显示其在视频质量评估基准上表现优异。

Comments Accepted to ICIP 2026

详情
AI中文摘要

视频质量评估(VQA)是一个具有广泛应用的挑战性研究课题。传统手工制作和判别学习的VQA模型主要关注像素级失真,缺乏上下文理解,而近期的多模态大型语言模型(MLLMs)在敏感小失真或处理质量评分和描述作为单独任务方面存在困难。为了解决这些不足,我们引入CP-LLM:一个基于上下文和像素的大型语言模型。CP-LLM是一种新颖的多模态LLM架构,具有双视觉编码器,可独立分析感知质量的高层(视频上下文)和低层(像素失真)粒度,以及一个语言解码器,随后推理这些方面的相互作用。这种设计使CP-LLM能够同时生成稳健的质量评分和可解释的质量描述,具有增强的对像素失真的敏感度(例如,压缩伪影)。实验结果表明,CP-LLM在VQA基准上实现了跨数据集的最先进性能,并且在像素失真方面具有更强的鲁棒性。

英文摘要

Video quality assessment (VQA) is a challenging research topic with broad applications. Traditional hand-crafted and discriminative learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent multimodal large language models (MLLMs) struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context- and Pixel-aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder that subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g., compression artifacts). Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on VQA benchmarks and superior robustness to pixel distortions.

2505.15879 2026-05-12 cs.CV cs.AI cs.CL 版本更新

GRIT: Teaching MLLMs to Think with Images

GRIT: 教授大语言模型通过图像思考

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, Xin Eric Wang

发表机构 * UC Santa Cruz(加州大学圣克鲁兹分校) UC Santa Barbara(加州大学圣芭芭拉分校) eBay

AI总结 GRIT提出一种基于图像和文本的 grounded reasoning 方法,通过强化学习实现高效训练,使大语言模型生成视觉基础的推理链。

Journal ref NeurIPS 2025

详情
AI中文摘要

近期研究表明,使用强化学习(RL)构建能够生成推理链的模型是有效的。然而,尽管在视觉-语言任务中推进推理能力的进展持续,现有开源视觉推理模型通常仅用纯自然语言生成推理内容,缺乏显式整合视觉信息。为此,我们提出了Grounded Reasoning with Images and Texts(GRIT),一种训练大语言模型(MLLMs)通过图像思考的新方法。GRIT引入了一种 grounded reasoning 框架,其中模型生成交替包含自然语言和显式边界框坐标的推理链。这些坐标指向输入图像中模型在推理过程中参考的区域。此外,GRIT配备了基于GRPO算法的强化学习方法GRPO-GR。GRPO-GR采用关注最终答案准确性和接地推理输出格式的鲁棒奖励,从而消除了需要带有推理链标注或显式边界框标签的数据需求。结果表明,GRIT实现了卓越的数据效率,仅需现有数据集中的20个图像-问题-答案三元组。全面评估显示,GRIT有效训练大语言模型生成连贯且视觉基础的推理链,展示了推理和接地能力的成功统一。

英文摘要

Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.

2503.09158 2026-05-12 cs.CV 版本更新

FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO

FaVChat: 基于数据高效GRPO的层次提示-查询引导的面部视频理解

Fufangchen Zhao, Songbai Tan, Xuerui Qiu, Linrui Xun, Wenhao Jiang, Jinkai Zheng, Hehe Fan, Jian Gao, Danfeng Yan, Ming Li

发表机构 * State Key Laboratory of Networking and Switching Technology, BUPT(北京邮电大学网络与交换技术国家重点实验室) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室(深圳)) Institute of automation, Chinese Academy of Sciences(中国科学院自动化研究所) Zhongguancun Academy(中关村学院) Central South University(中南大学) Zhejiang University(浙江大学) College of communication Engineering, Hangzhou Dianzi University(杭州电子科技大学通信工程学院)

AI总结 FaVChat通过层次化提示引导框架提取面部动态细节信息,结合数据高效GRPO策略提升学习效率,实现在面部视频理解任务中的优越性能。

详情
AI中文摘要

现有视频大语言模型(VLLMs)主要依赖于无偏视觉编码器,提取的面部表示缺乏对查询信息的意识,导致任务关键线索的丢失。为解决这一挑战,我们提出了FaVChat,首个针对细微视觉和动态面部线索推理设计的VLLM。FaVChat引入了层次化、提示引导的视觉特征提取框架,在三个互补层次上强调问题相关的信息。这些多级特征动态融合并注入LLM,以实现更精确的面部细节推理。为进一步提高数据稀缺下的学习效率,我们提出数据高效GRPO,一种强化学习策略,通过迭代识别高价值样本和单实例效用估计,最大化每个实例的贡献,显著提升有限监督下的性能。我们构建了一个大规模基准数据集FaVChat 170K,包含约6万高质量面部视频和17万对聚焦于细粒度面部细节的问题答案对。广泛实验,包括在四个面部理解任务上的零样本评估,证明FaVChat在性能上始终优于现有VLLMs。

英文摘要

Existing video large language models (VLLMs) primarily leverage prompt agnostic visual encoders, which extract untargeted facial representations without awareness of the queried information, leading to the loss of task critical cues. To address this challenge, we propose FaVChat, the first VLLM designed for reasoning over subtle visual and dynamic facial cues. FaVChat introduces a hierarchical, prompt guided visual feature extraction framework that emphasizes question relevant information at three complementary levels. These multi level features are dynamically fused and injected into the LLM, enabling more accurate facial details reasoning To further improve learning efficiency under data scarcity, we propose Data Efficient GRPO, a reinforcement learning strategy that iteratively identifies high utility samples and maximizes the contribution of each instance via per instance utility estimation, substantially enhancing performance gains under limited supervision. We construct a large scale benchmark dataset FaVChat 170K, comprising approximately 60K high quality facial videos and 170K question answer pairs focusing on fine grained facial details. Extensive experiments, including zero shot evaluations on four facial understanding tasks, demonstrate that FaVChat consistently outperforms existing VLLMs.

2501.03544 2026-05-12 cs.CV cs.AI cs.CR 版本更新

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

PromptGuard: 基于软提示的文本到图像模型不安全内容审查

Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, Bo Li

发表机构 * Department of Computer Science, University of Maryland(计算机科学系,马里兰大学) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机科学与数据科学学院) Kahlert School of Computing, The University of Utah(犹他大学Kahlert计算学院) Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校Siebel计算与数据科学学院)

AI总结 PromptGuard通过引入软提示机制,有效抑制文本到图像模型生成不安全内容,同时保持生成质量,且比现有方法快3.8倍。

Comments Accepted for publication in IEEE Transactions on Information Forensics and Security (TIFS)

详情
AI中文摘要

最近的文本到图像(T2I)模型在生成高质量图像方面表现出色,但容易被滥用,特别是生成不适宜工作(NSFW)内容,如色情、暴力、政治和扰动图像,引发严重伦理问题。本文提出PromptGuard,一种新的内容审查技术,借鉴大语言模型(LLMs)中的系统提示机制以实现安全对齐。与LLMs不同,T2I模型缺乏直接执行行为规范的接口。我们的核心思想是优化一个安全软提示,作为T2I模型文本嵌入空间中的隐式系统提示。这个通用软提示(P*)直接调节不安全输入,使安全且逼真的图像生成不影响推理效率或需要代理模型。我们进一步通过分而治之的策略优化类别特定的软提示并将其组合成统一的安全指导。在五个数据集上的广泛实验表明,PromptGuard有效抑制NSFW内容生成,同时保持高质量的良性输出。PromptGuard比现有内容审查方法快3.8倍,优于八种最先进的防御措施。使用多头安全分类器和基于VLM的护栏进一步验证了其鲁棒性,平均不安全比率为5.84%和6.18%。我们的代码和数据集可在https://t2i-promptguard.github.io/上获得。

英文摘要

Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without affecting inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy that optimizes category-specific soft prompts and combines them into unified safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard is 3.8 times faster than prior content moderation methods while outperforming eight state-of-the-art defenses. Evaluations using both a multi-head safety classifier and a VLM-based guardrail further confirm its robustness, with average unsafe ratios of 5.84% and 6.18%, respectively. Our code and dataset are available at https://t2i-promptguard.github.io/.

2412.13547 2026-05-12 cs.CV 版本更新

Turbo-GS: Accelerating 3D Gaussian Fitting for High-Quality Radiance Fields

Turbo-GS: 加速3D高斯拟合以实现高质量辐射场

Ankit Dhiman, Tao Lu, R Srinath, Emre Arslan, Angela Xing, Yuanbo Xiangli, R Venkatesh Babu, Srinath Sridhar

发表机构 * Indian Institute of Science(印度科学研究院) Brown University(布朗大学) Cornell University(康奈尔大学) Samsung R&D Institute India - Bangalore(三星印度研发中心-班加罗尔)

AI总结 本文提出Turbo-GS方法,通过减少计算开销和提升学习效率加速3D高斯拟合,实现4K分辨率快速拟合并保持高质量视图渲染。

Comments Accepted to CVPR 2026. Project page: https://ivl.cs.brown.edu/research/turbo-gs

详情
AI中文摘要

新颖视角合成在计算机视觉中扮演关键角色,应用于3D重建、混合现实和机器人领域。最近的3D高斯散射(3DGS)方法成为最先进的解决方案,提供实时高质量新颖视图合成。然而,训练3DGS模型仍较慢,特别是对于高分辨率图像,往往需要数小时才能拟合200个视图。在本文中,我们旨在通过减少计算开销和提升学习效率来加速拟合过程。具体来说,我们引入了一种扩张渲染技术,仅渲染像素子集而非完整图像,显著降低计算成本。为提升学习效率,我们开发了一种收敛意识预算控制机制,平衡新高斯的添加与现有高斯的优化。此外,为提高密集化效率并防止梯度消失,我们结合位置和外观误差以提高密集化的有效性。通过这些改进,我们实现了快速4K分辨率拟合,同时保持或甚至提升新颖视图渲染质量。大量实验表明,我们的方法在优化速度上显著优于现有方法,同时保持高渲染保真度。

英文摘要

Novel-view synthesis plays a crucial role in computer vision with applications in 3D reconstruction, mixed reality, and robotics. Recent approaches, such as 3D Gaussian Splatting (3DGS), have emerged as state-of-the-art solutions, offering high-quality novel view synthesis in real time. However, training 3DGS models remains slow, particularly for high-resolution images, often requiring hours to fit a scene with 200 views. In this work, we aim to accelerate the fitting process by reducing computational overhead and improving learning efficiency. Specifically, we introduce a dilated rendering technique that renders only a subset of pixels instead of the full image, significantly reducing computational costs. To enhance learning efficiency, we develop a convergence-aware budget control mechanism that balances the addition of new Gaussians with the optimization of existing ones. Additionally, to improve densification efficiency and prevent gradient vanishing, we incorporate both positional and appearance errors to improve the effectiveness of densification. With these improvements, we achieve fast 4K-resolution fitting while maintaining, or even improving, novel view rendering quality. Extensive experiments demonstrate that our method achieves significantly faster optimization than existing approaches while preserving high rendering fidelity.

2412.10433 2026-05-12 cs.CV cs.LG eess.SP 版本更新

Implicit Neural Compression of Point Clouds

隐式神经网络点云压缩

Hongning Ruan, Yulin Shao, Qianqian Yang, Liang Zhao, Zhaoyang Zhang, Dusit Niyato

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院) Department of Electrical and Electronic Engineering, The University of Hong Kong(香港大学电子与电气工程系) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院)

AI总结 本文提出NeRC³框架,利用隐式神经表示实现点云的高效压缩,通过坐标基神经网络编码几何与属性,并扩展至动态点云压缩,实验验证其在静态和动态点云压缩中的优越性能。

Journal ref IEEE Transactions on Image Processing, vol. 35, pp. 260-275, 2026

详情
AI中文摘要

点云因其能准确表示3D物体和场景而在众多应用中崭露头角。然而,高效压缩无结构、高精度点云数据仍是一个重大挑战。本文提出NeRC³,一种新颖的点云压缩框架,利用隐式神经表示(INRs)来编码密集点云的几何和属性。我们的方法采用两个坐标基神经网络:一个将空间坐标映射到体素占用,另一个将占用的体素映射到其属性,从而隐式表示体素化的点云几何和属性。编码器量化并压缩网络参数及重建所需的辅助信息,而解码器通过将体素坐标输入神经网络来重建原始点云。此外,我们通过减少时间冗余的技术将方法扩展到动态点云压缩,包括一种称为4D-NeRC³的4维时空表示。实验结果验证了我们的方法有效性:对于静态点云,NeRC³优于基于八叉树的G-PCC标准和现有INR方法。对于动态点云,4D-NeRC³在几何压缩性能上优于最新G-PCC和V-PCC标准,同时与最先进学习方法相当。它在几何和属性联合压缩中也表现出竞争力。

英文摘要

Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC$^3$, a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC$^3$. Experimental results validate the effectiveness of our approach: For static point clouds, NeRC$^3$ outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC$^3$ achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.

2411.08443 2026-05-12 cs.LG cs.CV 版本更新

Machine Unlearning on Pre-trained Models by Residual Feature Alignment Using LoRA

通过LoRA实现预训练模型的残差特征对齐机器去学习

Laiqiao Qin, Tianqing Zhu, Linlin Wang, Wanlei Zhou

发表机构 * City University of Macau(澳门城市大学)

AI总结 本文提出一种高效的预训练模型去学习方法,通过LoRA分解中间特征并调整残差特征,实现去学习与保留目标的对齐,实验验证了方法的有效性。

Comments v2: corrected a sign typo in Algorithm 1 line 13

Journal ref IEEE Transactions on Dependable and Secure Computing, 2026

详情
AI中文摘要

机器去学习是一种新兴技术,旨在从已训练模型中移除一部分训练数据,而不显著影响模型在剩余数据上的性能。该技术在保护用户隐私和消除有害或过时数据方面变得越来越重要。关键挑战在于有效且高效地去学习特定信息而不影响模型在保留数据上的实用性。对于预训练模型,微调是实现去学习目标的重要方法。以往的工作通常微调整个模型的参数,这带来了显著的计算成本。此外,微调过程可能导致中间层特征的偏移,影响模型的整体实用性。在本文中,我们提出了一种新颖且高效的预训练模型去学习方法。我们称之为残差特征对齐去学习。具体而言,我们利用LoRA(低秩适应)将模型的中间特征分解为预训练特征和残差特征。通过调整残差特征,我们对齐去学习模型与预训练模型在中间特征层面,以实现去学习和保留目标。该方法旨在学习保留集上的零残差和去学习集上的偏移残差。在多个数据集上的广泛实验验证了我们方法的有效性。

英文摘要

Machine unlearning is an emerging technology that removes a subset of the training data from a trained model without significantly affecting the model performance on the remaining data. This topic is becoming increasingly important in protecting user privacy and eliminating harmful or outdated data. The key challenge lies in effectively and efficiently unlearning specific information without compromising the model's utility on the retained data. For pre-trained models, fine-tuning is an important way to achieve the unlearning target. Previous work typically fine-tuned the entire model's parameters, which incurred significant computational costs. In addition, the fine-tuning process may cause shifts in the intermediate layer features, affecting the model's overall utility. In this work, we propose a novel and efficient machine unlearning method for pre-trained models. We term the method Residual Feature Alignment Unlearning. Specifically, we leverage LoRA (Low-Rank Adaptation) to decompose the model's intermediate features into pre-trained features and residual features. By adjusting the residual features, we align the unlearned model with the pre-trained model at the intermediate feature level to achieve both unlearning and remaining targets. The method aims to learn zero residuals on the retained set and shifted residuals on the unlearning set. Extensive experiments on numerous datasets validate the effectiveness of our approach.

2407.12173 2026-05-12 cs.CV cs.AI 版本更新

Beta Sampling is All You Need: Efficient Image Generation Strategy for Diffusion Models using Stepwise Spectral Analysis

贝塔采样是全部所需:利用分步频谱分析的扩散模型高效图像生成策略

Haeil Lee, Hansang Lee, Seoyeon Gye, Junmo Kim

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院)

AI总结 本文提出基于频谱分析的贝塔采样方法,优化扩散模型去噪过程,通过重点投入关键步骤提升生成效率与质量,实验显示其在FID和IS评分上优于传统均匀采样。

Comments 8 pages, 9 figures, WACV 2025

Journal ref Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 4215-4224, 2025

详情
AI中文摘要

生成扩散模型已成为高质量图像合成的强大工具,但其迭代性质需要大量计算资源。本文提出基于扩散过程图像频谱分析的高效时间步采样方法,旨在优化去噪过程。我们引入了一种类似于贝塔分布的采样技术,优先考虑过程早期和晚期的关键步骤。我们的假设是某些步骤在图像内容上有显著变化,而其他步骤则贡献较小。通过傅里叶变换测量每一步的频率响应变化,发现早期有显著的低频变化,后期有高频调整。ADM和Stable Diffusion的实验表明,我们的贝塔采样方法在FID和IS评分上优于均匀采样,并在效率上与AutoDiffusion等先进方法具有竞争力。本文提供了一个实用框架,通过聚焦最有效的步骤来提高扩散模型的效率,具有进一步优化和广泛应用的潜力。

英文摘要

Generative diffusion models have emerged as a powerful tool for high-quality image synthesis, yet their iterative nature demands significant computational resources. This paper proposes an efficient time step sampling method based on an image spectral analysis of the diffusion process, aimed at optimizing the denoising process. Instead of the traditional uniform distribution-based time step sampling, we introduce a Beta distribution-like sampling technique that prioritizes critical steps in the early and late stages of the process. Our hypothesis is that certain steps exhibit significant changes in image content, while others contribute minimally. We validated our approach using Fourier transforms to measure frequency response changes at each step, revealing substantial low-frequency changes early on and high-frequency adjustments later. Experiments with ADM and Stable Diffusion demonstrated that our Beta Sampling method consistently outperforms uniform sampling, achieving better FID and IS scores, and offers competitive efficiency relative to state-of-the-art methods like AutoDiffusion. This work provides a practical framework for enhancing diffusion model efficiency by focusing computational resources on the most impactful steps, with potential for further optimization and broader application.

2605.08440 2026-05-12 cs.LG cs.CV 版本更新

TARO: Temporal Adversarial Rectification Optimization Using Diffusion Models as Purifiers

TARO:利用扩散模型作为净化器的时序对抗修正优化

Daniel Wesego, Pedram Rooshenas

发表机构 * Department of Computer Science(计算机科学系) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 TARO通过构建时序引导的分数先验,结合多视角去噪,实现对抗样本的鲁棒修正与语义保留,提升零样本下的鲁棒性。

详情
AI中文摘要

对抗净化利用扩散模型将对抗样本投影回数据流形,但平衡语义保留与对抗攻击鲁棒性仍具挑战。近期研究表明标准扩散净化在适应性评估下失效,而测试时基于分数的优化更具鲁棒性。现有优化防御通常依赖单一扩散噪声模式或统一时间步,忽视粗细去噪尺度的差异作用。我们提出时序对抗修正优化(TARO),一种推理时净化方法,从扩散轨迹的多去噪视角构建时序引导的分数先验。TARO形成粗到细的残差目标:高噪声专家提供全局平滑结构并降低对抗敏感性,低噪声专家恢复图像特定、类相关细节。指导强度控制此时间修正,使TARO在鲁棒全局修正与语义保留间取得平衡。实证表明,TARO在零样本设置下提升鲁棒准确性,同时兼容互补的对抗似然目标以进一步提升鲁棒性。

英文摘要

Adversarial purification with diffusion models seeks to project adversarial examples back toward the data manifold, but balancing semantic preservation and robustness against adaptive attacks remains challenging. Recent work shows that standard diffusion purification can fail under adaptive evaluation, while test-time score-based optimization is more resilient. Existing optimization defenses, however, typically rely on a single diffusion noise regime or treat timesteps uniformly, overlooking the distinct roles of coarse and fine denoising scales. We propose Temporal Adversarial Rectification Optimization (TARO), an inference-time purification method that builds a temporally guided score prior from multiple denoising views along the diffusion trajectory. TARO forms a coarse-to-fine residual target: high-noise experts provide globally smoothed structure with reduced adversarial sensitivity, while low-noise experts restore image-specific, class-relevant details. A guidance strength controls this temporal correction, allowing TARO to balance robust global rectification with semantic preservation. Empirically, TARO improves robust accuracy across datasets and adaptive threat models in a zero-shot setting, while remaining compatible with complementary adversarial-likelihood objectives for further robustness gains.

2605.08421 2026-05-12 cs.CV 版本更新

Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval

超越图像块:通过文本监督学习全局布局用于晚期交互视觉文档检索

Pascal Tilli, Mohsen Mesgar

发表机构 * University of Stuttgart(斯图加特大学) Bosch Center for Artificial Intelligence(博世人工智能中心)

AI总结 本文提出通过文本监督学习文档全局布局,改进视觉文档检索的晚期交互架构,提升检索性能。

详情
AI中文摘要

视觉文档检索(VDR)模型大多依赖晚期交互架构,其中文档通过一组局部块嵌入表示,然后与查询词匹配。尽管高效,这种架构优先考虑局部相似性而非文档的全局布局结构来估计文档与查询的相关性。在实践中,这导致错误,因为相关性来源于具有异构布局(包含图表和文本)的文档布局结构。我们提出一种多模态编码器,通过文本描述编码文档布局信息来增强局部块表示,从而学习文档布局。在四个ViDoRe-v2数据集上,我们的模型在nDCG@5和MAP@5指标上分别优于最强大的架构相当的ColPali/ColQwen基线,且在每个数据集上均具有统计显著性提升。

英文摘要

Visual Document Retrieval (VDR) models mostly rely on late interaction architectures, in which documents are represented by a set of local patch embeddings and then matched against query tokens. While efficient, this architecture prioritizes local similarity over global layout structure of documents to estimate relevancy between documents and query. In practice, this leads to errors as relevance originates from layout structure of documents with heterogeneous layouts combining figures, tables, and text. We make document layout learnable without changing inference. We propose a multimodal encoder that augments local patch representations with a global layout embedding, trained via textual descriptions encoding document layout information. Across four ViDoRe-v2 datasets, our model improves over the strongest architecturally comparable ColPali/ColQwen baseline by +2.4 nDCG@5 and +2.3 MAP@5, with statistically significant per-dataset gains over ColQwen.

2605.08412 2026-05-12 cs.CV 版本更新

SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

SYNCR: 一个具有合成接地的跨视频推理基准

Sara Ghazanfari, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami

发表机构 * New York University(纽约大学)

AI总结 SYNCR通过合成数据评估多视频推理能力,发现现有模型在物理空间推理上表现不足,参数扩展和训练后优化能提升时间对齐能力,但无法可靠解决细粒度物理跟踪和全局空间合成问题。

详情
AI中文摘要

多模态大语言模型(MLLMs)在单视频理解上取得快速进展,但其跨多视频流推理能力仍不明确。现有多视频基准主要依赖人工标注的真实视频,限制了空间、时间和物理真实性的精度,难以诊断模型失败。我们引入SYNCR,一个受控的合成基准,用于跨视频推理,具有程序验证的接地。使用Habitat、Kubric和CLEVRER模拟器引擎构建,SYNCR包含8,163个多视频问题-答案对,基于9,650个唯一视频。它评估MLLMs在八个任务上的表现,涵盖四个诊断支柱:时间对齐、空间跟踪、比较推理和整体合成。我们对领先的大规模和封闭权重MLLMs进行零样本评估,发现当前模型与人类之间存在显著差距:最佳模型仅达到52.5%的平均准确率,而人类基线为89.5%。模型在时间排序上表现较好,但在精确物理和空间推理上挣扎,最佳模型在运动学比较上仅达到26.0%的准确率。我们进一步发现,参数扩展和推理专用的训练后优化能提升时间对齐能力,但无法可靠解决细粒度物理跟踪或全局空间合成问题。最后,探索性sim-to-real相关性分析表明,SYNCR任务跟踪现实多视频基准中模型层面的趋势,同时暴露了现有评估未涵盖的推理能力。代码可在https://github.com/SaraGhazanfari/SYNCR获取。

英文摘要

Multimodal Large Language Models (MLLMs) have made rapid progress in single-video understanding, yet their ability to reason across multiple independent video streams remains poorly understood. Existing multi-video benchmarks rely largely on human-annotated real-world footage, limiting the precision of spatial, temporal, and physical ground truth and making it difficult to diagnose model failures. We introduce SYNCR, a controlled synthetic benchmark for cross-video reasoning with programmatically verified grounding. Built using Habitat, Kubric, and CLEVRER simulator engines, SYNCR contains 8,163 multi-video question-answer pairs grounded in 9,650 unique videos. It evaluates MLLMs across eight tasks spanning four diagnostic pillars: Temporal Alignment, Spatial Tracking, Comparative Reasoning, and Holistic Synthesis. Our zero-shot evaluation of leading open- and closed-weight MLLMs reveals a substantial gap between current models and humans: the best model achieves only 52.5% average accuracy, compared to an 89.5% human baseline. Models perform relatively well on temporal ordering but struggle with precise physical and spatial reasoning, with the best model reaching only 26.0% accuracy on Kinematic Comparison. We further find that parameter scaling and reasoning-specialized post-training improve temporal alignment capabilities, but do not reliably address fine-grained physical tracking or global spatial synthesis. Finally, an exploratory sim-to-real correlation analysis suggests that several SYNCR tasks track model-level trends on real-world multi-video benchmarks, while also exposing reasoning capabilities underrepresented by existing evaluations. Code available at https://github.com/SaraGhazanfari/SYNCR.

2605.08404 2026-05-12 cs.CL cs.AI cs.CV cs.ET 版本更新

Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models

利用大视觉-语言模型从遥感图像中推导建成环境

Dongdong Wang, Deepak Balakrishnan, Ravi Srinivasan, Shenhao Wang

发表机构 * Department of Urban and Regional Planning, University of Florida(佛罗里达大学城市与区域规划系)

AI总结 本文研究了大语言模型在智慧城市中的应用,通过多尺度遥感图像输入多模态语言模型,评估其对建成环境推理的影响,并比较了InternVL和Qwen等先进模型在生成建成环境建议中的准确性和可靠性。

Comments Published in the International Conference on Industrialized Construction 2026

详情
AI中文摘要

本文研究了大语言模型(LLMs)在智慧城市中的应用。核心思想是利用遥感图像来表征建成环境,包括设计建议、施工可行性评估、土地使用模式和风险识别。我们研究了多尺度的遥感图像作为多模态语言模型的输入,并评估其对建成环境相关推理的影响。此外,我们比较了最先进的LLMs,包括InternVL和Qwen,在生成建成环境建议时的准确性和可靠性。结果表明,将遥感图像与大语言模型结合,有助于协助智慧城市和决策制定。

英文摘要

This work investigates the use of large language models (LLMs) for tasks in smart cities. The core idea is to leverage remote sensing imagery to characterize the built environment, including design suggestions, constructability assessment, landuse patterns, and risk identification. We examine remote sensing imagery at multiple spatial scales as inputs for multimodal language modeling and evaluate their effects on built-environment-related reasoning. In addition, we compare state-of-the-art LLMs, including InternVL and Qwen, in terms of accuracy and reliability when generating built environment recommendations. The results demonstrate the potential of integrating remote sensing imagery with large language models to assist smart cities and decision-making.

2605.08396 2026-05-12 cs.CV 版本更新

Delivering Science as a Service: Sci-Orchestra's Cloud-Native Approach to HPC

作为服务交付科学:Sci-Orchestra的云原生方法用于HPC

Harinarayan Krishnan, Shubhabrata Mukerjee, Jeffrey Donatelli, Daniela Ushizima

发表机构 * Lawrence Berkeley National Laboratory(劳伦斯伯克利国家实验室) Applied Math & Computational Research Division(应用数学与计算研究部) Bakar Comp. Health Sciences Institute(巴卡计算与健康科学研究所) UC San Francisco(旧金山大学) Berkeley Institute for Data Science(伯克利数据科学研究所) UC Berkeley(伯克利大学)

AI总结 Sci-Orchestra通过云原生方法自动化HPC实验流程,提供安全的执行环境,促进跨机构合作,加速实验室原型到工业应用的转化。

详情
AI中文摘要

现代计算环境的复杂性常使研究人员面临基础设施管理、认证协议和容器部署的负担。我们提出了Sci-Orchestra,一种分层编排框架,旨在完全自动化实验流程,使科学家能专注于科学发现而非后台操作。通过API驱动的接口抽象执行,系统负责安全认证、资源管理和在多样化的高性能计算环境中使用Kubernetes架构进行可扩展部署。Sci-Orchestra的关键创新是其自主市场,促进跨机构合作。通过直观的用户界面,研究人员可通过简单选择快速部署和共享专用服务,无需复杂安装和技术设置。这种模块化基础设施专门设计用于促进产业合作,因为它提供安全的执行环境,允许外部合作者测试和验证专有工具而无需源代码交换。这种“黑盒”互操作性保护知识产权,同时使无缝集成到更广泛的科学流程中,最终加速从实验室原型到工业规模应用的过渡。

英文摘要

The increasing complexity of modern computational environments often burdens researchers with infrastructure management, authentication protocols, and container deployments. We present Sci-Orchestra, a layered orchestration framework designed to fully automate experimental workflows, allowing scientists to prioritize scientific discovery over backend operations. By abstracting execution through an API-driven interface, the system assumes responsibility for secure authentication, resource management, and scalable deployment across diverse high-performance computing environments using Kubernetes architectures. A key innovation of Sci-Orchestra is its autonomous marketplace, which serves as a catalyst for cross-institutional collaboration. Through an intuitive user interface, researchers can rapidly deploy and share specialized services via simple selections, eliminating the need for complex installations and technical setups. This modular infrastructure is specifically designed to facilitate industry partnerships as it provides a secure execution environment and allows external collaborators to test and validate proprietary tools without the need for source-code exchange. This ``black-box'' interoperability protects intellectual property while enabling seamless integration into broader scientific pipelines, ultimately accelerating the transition from laboratory prototypes to industrial-scale applications.

2605.08376 2026-05-12 cs.CV 版本更新

UIESNN: A Scale-Aware Spiking Network for Underwater Image Enhancement

UIESNN:一种具有尺度感知的脉冲网络用于水下图像增强

Shuang Chen, Ruochen Li, Zihan Zhu, Ronald Thenius, Farshad Arvin, Amir Atapour-Abarghouei

发表机构 * Institute of Biology, University of Graz, Graz, Austria(格拉茨大学生物学院,格拉茨,奥地利) University of Cambridge(剑桥大学)

AI总结 本文提出UIESNN,一种具有尺度感知的脉冲网络,用于水下图像增强,通过多尺度池化LIF块和脉冲残差架构,提升颜色保真度和空间一致性。

详情
AI中文摘要

水下图像增强(UIE)是脉冲神经网络(SNNs)应用中一个实际重要但研究较少的领域,其中主导的退化包括大尺度和低频退化,如波长依赖性颜色偏差和散射引起的失真。现有SNN恢复设计依赖于局部有界脉冲感知,这会限制全局校正并导致饱和或不一致的表示。为了解决这些挑战,我们提出了一种名为UIESNN的具有尺度感知的SNN框架用于UIE。其核心是一个多尺度池化LIF块(MPLB),它将多尺度池化响应注入膜动力学,从而扩大有效感受野,同时保留细粒度细节并诱导异质尺度依赖激活。基于MPLB,我们设计了一种脉冲残差架构,该架构在完全脉冲驱动的管道中集成了频率分解和基于注意力的细化。在EUVP和LSUI基准上的大量实验表明,UIESNN在SNN方法中实现了最先进的性能,实现了改进的颜色保真度和空间一致性,同时具有竞争力的能量成本。

英文摘要

Underwater image enhancement (UIE) is a practically important yet underexplored application of spiking neural networks (SNNs), where the dominant degradations are large-scale and low-frequency, such as wavelength-dependent colour casts and scattering-induced veiling. Existing SNN restoration designs rely on locally bounded spiking perception, which can limit global correction and lead to saturated or inconsistent representations. To address these challenges, we propose a scale-aware SNN framework for UIE named UIESNN. At its core is a Multi-scale Pooling LIF Block (MPLB) that injects hierarchical multi-scale pooling responses into membrane dynamics, thereby enlarging the effective receptive field while preserving fine-grained details and inducing heterogeneous scale-dependent activations. Building on MPLB, we design a spiking residual architecture that integrates frequency decomposition and attention-based refinement in a fully spike-driven pipeline. Extensive experiments on the EUVP and LSUI benchmarks demonstrate that UIESNN achieves state-of-the-art performance among SNN-based methods, delivering improved colour fidelity and spatial coherence with competitive energy cost.

2605.08373 2026-05-12 cs.CV cs.AI 版本更新

NeuroGAN-3D: Enhancing Intrinsic Functional Brain Networks via High-Fidelity 3D Generative Super-Resolution

NeuroGAN-3D: 通过高保真3D生成超分辨率增强内在功能脑网络

M. Moein Esfahani, Sepehr Salem Ghahfarokhi, Mohammed Alser, Jingyu Liu, Vince Calhoun

发表机构 * Georgia State University(佐治亚州立大学) Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS)(神经影像与数据科学转化研究三机构中心(TReNDS)) Georgia Institute of Technology(佐治亚理工学院) Emory University(埃默里大学)

AI总结 本文提出NeuroGAN-3D,利用生成对抗网络提升rs-fMRI空间图的分辨率,以更精确地局部化功能单元和检测神经生物学变化。

Comments Accepted in ICCABS 2026: The 14th International Conference on Computational Advances in Bio and Medical Sciences

详情
AI中文摘要

近期神经影像学的进步加深了我们对大脑复杂功能和结构组织的理解。其中,功能性磁共振成像(fMRI)特别是静息态fMRI(rs-fMRI)已成为识别内在脑连接生物标志物和界定大规模神经网络的工具。这些网络通常表示为体积空间图,捕捉功能上连贯的脑区并反映个体差异在脑活动和结构中的表现。这些图的空间分辨率起着重要作用,因为它决定了局部化功能单元的精度、执行可靠的脑分割以及检测与发育、衰老或疾病相关的细微空间特异性神经生物学变化的能力。因此,提高神经影像学衍生图的有效分辨率对更深入理解大脑结构及其与行为和病理的关系具有重要意义。为此,我们提出NeuroGAN-3D,一种针对体积神经影像计算需求设计的新型3D生成超分辨率模型。我们的模型利用生成对抗网络架构来增强rs-fMRI空间图的分辨率,显著优于传统基线方法。

英文摘要

Recent advances in neuroimaging have deepened our understanding of the brain's complex functional and structural organization. Among these, functional Magnetic Resonance Imaging (fMRI) - particularly resting-state fMRI (rs-fMRI) - has emerged as a tool for identifying biomarkers of intrinsic brain connectivity and delineating large-scale neural networks. These networks are typically represented as volumetric spatial maps that capture functionally coherent brain regions and reflect individual differences in brain activity and structure. The spatial resolution of these maps plays an important role, as it determines the ability to localize functional units with precision, perform reliable brain parcellation, and detect subtle, spatially specific neurobiological alterations associated with development, aging, or disease. Therefore, improving the effective resolution of neuroimaging-derived maps holds significant promise for enabling more detailed insights into brain architecture and its relationship to behavior and pathology. To address this need, we propose NeuroGAN-3D, a novel 3D generative super-resolution model tailored to the computational demands of volumetric neuroimaging. Our model leverages a generative adversarial network architecture to enhance the spatial resolution of rs-fMRI spatial maps, significantly outperforming a conventional baseline.

2605.08371 2026-05-12 cs.CV 版本更新

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

PaceVGGT: 为视觉几何变换器预置交替注意力令牌修剪

Haotang Li, Zhenyu Qi, Shaohan Henry Wang, Kebin Peng, Zi Wang, Qing Guo, Sen He, Huanrui Yang

发表机构 * University of Arizona(亚利桑那大学) East Carolina University(东卡罗来纳大学) Augusta University(奥古斯塔大学) Nankai University(南开大学)

AI总结 PaceVGGT通过在冻结的VGGT首个交替注意力块前修剪DINO令牌,减少推理延迟,同时保持重建质量,在ScanNet-50和7-Scenes数据集上实现显著加速。

详情
AI中文摘要

视觉几何变换器(VGGT)是一种强大的前馈模型,适用于多种3D任务,但其交替注意力(AA)堆叠在总令牌数量上呈二次增长,使长片段变得昂贵。现有令牌减少加速器在AA内部操作,导致进入AA的补丁网格未压缩。我们引入PaceVGGT,一种预AA令牌修剪框架,该框架在冻结的VGGT首个AA块前修剪DINO补丁令牌。PaceVGGT训练了一个轻量级的令牌评分器,从DINO特征估计每个令牌的重要性。评分器首先通过无修剪主干的AA内部注意力目标进行蒸馏,然后在下游相机、深度和点图损失下进行细化。每帧保留预算固定了主干可见序列长度,而重要性自适应的合并/修剪分配在固定总合并预算下保留高显著性帧的残差内容。一个特征引导的重建模块重建预测头所需的密集空间网格。在ScanNet-50和7-Scenes上,PaceVGGT保持在重建质量-延迟前沿,同时减少推理延迟。在ScanNet-50上,它在N=300时将延迟降低$5.1\times$,在N=1000时比LiteVGGT降低$1.47\times$。这些结果表明预AA修剪是冻结VGGT式几何变换器的可行加速途径。

英文摘要

Visual Geometry Transformer (VGGT) is a strong feed-forward model for multiple 3D tasks, but its Alternating-Attention (AA) stack scales quadratically in the total token count, making long clips expensive. Existing token-reduction accelerators operate inside AA, leaving the patch grid that enters AA uncompressed. We introduce PaceVGGT, a pre-AA token pruning framework that prunes DINO patch tokens before the first AA block of a frozen VGGT. PaceVGGT trains a lightweight Token Scorer that estimates per-token importance from DINO features. The scorer is first distilled against an AA-internal attention target from the unpruned backbone, then refined under downstream camera, depth, and point-map losses. A per-frame keep budget fixes the backbone-visible sequence length, while an importance-adaptive merge/prune assignment preserves residual content from high-saliency frames under a fixed total merge budget. A Feature-guided Restoration module reconstructs the dense spatial grid required by the prediction heads. On ScanNet-50 and 7-Scenes, PaceVGGT remains on the reconstruction quality--latency frontier while reducing inference latency. On ScanNet-50, it reduces latency by \(5.1\times\) over unmodified VGGT at \(N=300\) and \(1.47\times\) over LiteVGGT at \(N=1000\). These results identify pre-AA pruning as a viable acceleration route for frozen VGGT-style geometry transformers.

2605.08329 2026-05-12 cs.CV eess.IV 版本更新

An Efficient Token Compression Framework for Visual Object Tracking

一种高效的视觉目标跟踪token压缩框架

Weijing Wu, Qihua Liang, Bineng Zhong, Haiying Xia, Zhiyi Mo, Shuxiang Song

发表机构 * Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University(教育区块链与智能技术重点实验室,教育部,广西师范大学) Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University(广西多源信息挖掘与安全重点实验室,广西师范大学) University Engineering Research Center of Educational Intelligent Technology, Guangxi Normal University(教育智能技术大学工程研究中心,广西师范大学) Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou University(广西机器视觉与智能控制重点实验室,梧州大学)

AI总结 本文提出ETCTrack框架,通过压缩历史模板帧的token以提升跟踪性能和效率,实验表明在七个基准上优于现有方法,减少60%的token数量并降低21.4%的MACs,仅损失0.4%的精度。

Comments Accepted by CVPR2026

详情
AI中文摘要

通过消除内部特征层面的冗余来优化视觉表示对于同时优化视觉跟踪模型的性能和计算成本至关重要。为提高性能,许多基于Transformer的跟踪器利用更多历史模板帧来捕捉更丰富的时空线索。然而,这种策略导致大量输入视觉token,从而产生两个关键问题:它导致二次计算负担,并可能降低跟踪器的整体性能。为此,我们提出了一种压缩-交互跟踪框架ETCTrack,该框架学习从历史模板帧中高效压缩模板token到稳健的目标表示,超越了手工规则。我们的方法首先使用自适应token压缩器动态构建紧凑且高度判别性的模板token,通过过滤冗余视觉token。这些精炼的模板token随后由分层交互编码器处理,以实现与搜索特征的深入适应性交互。精炼的搜索特征确保后续的精确目标定位。在七个基准上的实验表明,我们的方法优于当前最先进的跟踪器。ETCTrack-B224将模板token数量减少60%,导致MACs减少21.4%,仅损失0.4%的精度。源代码可在https://github.com/PJD-WJ/ETCTrack上获得。

英文摘要

Refining visual representations by eliminating their internal feature-level redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker's overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens. These refined template tokens are then processed by our Hierarchical Interaction Encoder to achieve a deep, adaptive interaction with the search features. Refined search features ensure subsequent precise target localization. Experiments on seven benchmarks demonstrate that our method outperforms current state-of-the-art trackers. ETCTrack-B224 reduces the number of template tokens by 60%, leading to a 21.4% reduction in MACs with only a 0.4% drop in accuracy. The source code are available at https://github.com/PJD-WJ/ETCTrack.

2605.08311 2026-05-12 cs.LG cs.CV 版本更新

Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning

重燃开端:避免存储依赖于持续学习中的模型合并

Xi Wang, Cheng Deng

发表机构 * School of Electronic Engineering, Xidian University, Xi'an, Shaanxi, China(西安电子科技大学电子工程学院)

AI总结 本文提出TRM框架,通过优化方法解决持续学习中模型合并的存储限制问题,提升模型稳定性和优化动态。

详情
AI中文摘要

模型合并为整合专业领域知识到统一多任务模型提供了有吸引力的范式,这与持续学习(CL)中的顺序知识获取自然一致。然而,保留多样化先前知识的需求与CL固有的存储限制冲突。本文系统分析了现有模型合并方法在CL约束下的表现。发现当前方法优先考虑全局对齐,导致连续数据流中任务特定误差的积累和放大;后续任务初期消失的梯度常导致优化停滞。这些使合并模型在下一训练阶段开始时处于亚优状态。为解决这些挑战,我们提出轨迹正则化合并(TRM),将合并过程重新表述为增强轨迹子空间内的优化过程。我们的框架整合了三个协同目标,包括任务对齐、预测一致性以及梯度响应性,以同时保持合并模型的历史稳定性并重新激活优化动态。广泛实验结果表明,我们的方法在多个基准上实现了最先进的性能。

英文摘要

Model merging provides a compelling paradigm for integrating specialized expertise into a unified multi-task model, a goal that aligns naturally with the sequential knowledge acquisition in continual learning (CL). However, the requirement for preserving diverse forms of previous knowledge conflicts with the storage limitations inherent to CL. In this paper, we systematically analyze existing model merging methods under the constraints of CL. We find that current methods prioritize global alignment, which often leads to the accumulation and amplification of task-specific errors within the continuous data stream; and the vanishing gradients at the onset of subsequent tasks frequently cause optimization to stagnate. These leave the merged model in a suboptimal state at the beginning of the next training phase. To address these challenges, we propose Trajectory Regularized Merging (TRM), a framework that reformulates the merging phase as an optimization process within an augmented trajectory subspace. Our framework integrates three synergistic objectives including task alignment, prediction consistency, and gradient responsiveness to concurrently preserve merged model's historical stability and re-activate optimization dynamics. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple benchmarks.

2605.08296 2026-05-12 cs.CV eess.SP 版本更新

BenchHAR: Benchmarking Self-Supervised Learning for Generalizable Sensor-based Activity Recognition

BenchHAR: 用于通用传感器活动识别的自监督学习基准测试

Yize Cai, Rui Feng, Anlan Yu, Baoshen Guo, Zhiqing Hong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Peking University(北京大学) Singapore-MIT Alliance for Research and Technology(新加坡-麻省理工联合研究技术联盟)

AI总结 BenchHAR针对传感器活动识别中数据异质性和标注数据稀缺问题,系统比较了自监督学习方法的泛化性能,发现混合范式和CNN编码器在性能上表现最佳,且增加预训练数据量能提升泛化能力。

Comments 25 pages

详情
AI中文摘要

人体活动识别(HAR)从可穿戴传感器支持广泛医疗和行为科学应用。然而,数据异质性和标注数据稀缺限制了其现实中的泛化能力。最近在视觉和语言领域自监督学习(SSL)的进展显示了从无标签数据中学习通用表示的强大能力。然而,很少有研究系统比较SSL方法的泛化性能或探索如何适应通用HAR。为解决这些差距,我们提出了BenchHAR,一个统一框架,用于评估SSL方法在未见目标分布上的泛化能力。BenchHAR整理了一个大规模数据集(约258,000个样本)并评估了八个代表性的SSL方法在12种编码器-分类器架构上的表现。我们的结果表明,现有SSL方法难以达到满意的泛化性能。我们发现:(1)对于HAR模型,混合范式(结合重建和对比预训练)实现了整体最佳性能。CNN编码器表现出最强的学习通用表示能力,而更具表现力的分类器架构进一步提升了泛化能力。(2)对于数据规模,增加从下游活动类别预训练数据量的一致性提高了泛化能力,而增加更多标注数据收益有限。有趣的是,纳入非下游活动类别的无标签数据并未提升泛化能力。(3)来自定制设备的传感器数据比研究级设备的数据泛化更好,且肢体转移数据更有效地转移到躯干位置。BenchHAR为通用传感器基于HAR系统提供了统一的基准和可操作的见解。我们的代码可在https://github.com/saiketa/HAR-Bench获取。

英文摘要

Human Activity Recognition (HAR) from wearable sensors supports broad healthcare and behavior science applications. However, data heterogeneity and the scarcity of labeled data limit its real-world generalization. Recent advances in self-supervised learning (SSL) in vision and language domains have shown strong capability for learning generalizable representations from unlabeled data. Yet, few studies have systematically compared the generalization performance of SSL methods or explored how to adapt them for generalizable HAR. To address these gaps, we present BenchHAR, a unified framework for evaluating the generalization capability of SSL methods for sensor-based HAR on unseen target distributions. BenchHAR curates a large-scale dataset (~258K samples) and evaluates eight representative SSL methods across 12 encoder-classifier architectures. Our results reveal that existing SSL methods struggle to achieve satisfactory generalization performance. We find that: (1) For HAR models, the hybrid paradigm (combining reconstruction and contrastive pretraining) achieves the best overall performance. The CNN encoder exhibits the strongest ability to learn generalizable representations, while more expressive classifier architectures further improve generalization. (2) For data scale, increasing the amount of pretraining data from downstream activity classes consistently improves generalization, while adding more labeled data yields limited gains. Interestingly, incorporating unlabeled data from non-downstream activity classes does not improve generalization. (3) Sensor data collected from custom-grade devices generalizes better than that from research-grade devices, and data from limb transfers more effectively to trunk positions. BenchHAR provides a unified benchmark and actionable insights for generalizable sensor-based HAR systems. Our code is available at https://github.com/saiketa/HAR-Bench.

2605.08282 2026-05-12 eess.IV cs.AI cs.CV 版本更新

A Paired Point-of-Care Ultrasound Dataset for Image Quality Enhancement and Benchmarking via a cGAN Baseline

一种配对的点即护理超声数据集用于通过cGAN基线的图像质量增强和基准测试

Lennard M. van Karnenbeek, Hilde G. A. van der Pol, Mark Wijkhuizen, Eva Poelman, Caroline A. Drukker, Theo Ruers, Freija Geldof, Behdad Dashtbozorg

发表机构 * Department of Nanobiophysics, Faculty of Science and Technology, University of Twente(特文特大学科学与技术学院生物医学系) Image-Guided Surgery, Department of Surgery, Netherlands Cancer Institute(荷兰癌症研究所手术引导外科部) Netherlands Cancer Institute(荷兰癌症研究所) Center for Early Cancer Detection, Netherlands Cancer Institute(荷兰癌症研究所早期癌症检测中心)

AI总结 本文提出了一种新的配对数据集,用于通过cGAN基线提升点即护理超声图像质量,并展示了在低资源环境中的诊断价值。

详情
AI中文摘要

目的:我们旨在利用深度学习和一种新的配对数据集来提升点即护理超声(POCUS)设备的图像质量。方法:我们使用定制的自动化架系统收集了第一个准确配对的数据集,结合低端POCUS和高端超声图像。基于pix2pix架构,使用带有L1和结构相似性指数(SSIM)损失的U-Net生成器的条件生成对抗网络(cGAN)被利用。在模拟数据集上预训练进一步提升了性能。评估是在1064对体外组织和仿真实超声图像集上进行的。结果:我们的方法将SSIM从0.29提升到0.54,PSNR从19.16 dB提升到22.41 dB。无参考指标也表明了显著的提升,自然图像质量评估器(NIQE)和基于感知的图像质量评估器(PIQE)得分分别从7.95降至4.44和31.12降至19.99。结论:本文提出了第一个公开可用的低端POCUS到高端超声图像的准确配对数据集。此外,我们的结果展示了所提框架在克服手持POCUS硬件限制方面的潜力,从而在低资源和点即护理环境中提升其诊断价值。POCUS-IQ数据集可在https://github.com/NKI-MedTech-AI/POCUS-IQ上公开获取。

英文摘要

Purpose: We aim to enhance the image quality of point-of-care ultrasound (POCUS) devices using deep learning and a novel paired dataset of POCUS and high-end ultrasound images. Approach: We collected the first accurately paired dataset using a custom-built automated gantry system of low-end POCUS and high-end ultrasound images. A conditional generative adversarial network (cGAN) was utilized based on the pix2pix architecture, with a U-Net generator that incorporates both L1 and structural similarity index (SSIM) losses to improve perceptual quality. Pretraining on a simulation dataset further boosts performance. Evaluation was performed on 1064 paired ex vivo tissue and phantom ultrasound image sets. Results: Our approach improves the SSIM from 0.29 to 0.54 and PSNR from 19.16 dB to 22.41 dB. No-reference metrics also indicate substantial enhancement, with the Natural Image Quality Evaluator (NIQE) and Perception-based Image Quality Evaluator (PIQE) scores dropping from 7.95 to 4.44 and 31.12 to 19.99, respectively. Conclusions: This work presents the first publicly available accurately paired dataset of low-end POCUS to high end ultrasound images. Additionally, our results demonstrate the potential of the proposed framework to overcome hardware limitations of handheld POCUS, enhancing its diagnostic value in low-resource and point-of-care settings. The POCUS-IQ Dataset is publicly available at https://github.com/NKI-MedTech-AI/POCUS-IQ.

2605.08281 2026-05-12 cs.CV 版本更新

Is Class Signal Clustered or Routed in Task-Induced Implicit Neural Representation Weight Spaces?

隐式神经表示中的类别信号是聚类还是路由?

Xinyi Guo, Mingyi He, Haobin Ding, Weiming Chen, Xinrui Chen, Jiawen Li, Di Zhang, Minxi Ouyang, Yizhi Wang, Xitong Ling

发表机构 * South China Normal University(南方科技大学) Beijing University of Chemical Technology(北京化工大学) Tsinghua University(清华大学) Xi’an Jiaotong University(西安交通大学)

AI总结 研究探讨了隐式神经表示中类别信号的几何结构,发现其并非简单聚类,而是通过读者路由实现可分类性。

详情
AI中文摘要

隐式神经表示(INR)将图像编码为神经网络权重,使图像分类成为权重空间可分类性的问题。一个自然的几何假设是,分类反馈应使图像特定的权重在共享锚坐标下按类别聚类。我们在此SIREN基础的Meta Weight Transformer(MWT)框架中测试这一假设,发现该预测失败。暴露的权重空间几何和监督聚类压力无法可靠追踪训练读者的准确性;聚类甚至可能使局部邻域更类一致,同时使训练读者更差。关键在于,读者构建而非继承类对齐的几何结构:令牌流诊断显示,类对齐的邻域只有在晚期读者交互后才强烈预测训练读者的准确性,而非输入坐标。我们进一步识别了增强权重令牌中的原生SIREN偏置列作为训练读者的低维、样本依赖的因果读出路由;针对性控制排除了通用标量列和边际分布伪影。该诊断促使干预措施,如强化读者路由、添加显式偏置路由或使用更密集的内环拟合;在本文使用的车道特定训练惯例下,路由导向的变体通常优于共享锚基线,但交互非加性。任务诱导的INR权重可分类并非因为形成原始几何聚类,而是因为其类别信号通过读者路由。

英文摘要

Implicit neural representations (INRs) encode images as neural-network weights, making image classification a problem of weight-space classifiability. A natural geometric hypothesis is that classifier feedback should make image-specific weights cluster by class in the shared-anchor coordinate. We test this hypothesis in the SIREN-based Meta Weight Transformer (MWT) regime, where end-to-end training meta-learns a shared initialization and inner-loop update schedule for fitting image-specific SIRENs. We find that this prediction fails. Exposed weight-space geometry and supervised clustering pressure do not reliably track trained-reader accuracy; clustering can even make local neighborhoods more class-consistent while making the trained reader worse. Crucially, the reader constructs rather than inherits class-aligned geometry: token-flow diagnostics show that class-aligned neighborhoods become strongly predictive of trained-reader accuracy only after late reader interactions, not in the input coordinate. We further identify the native SIREN bias column in the augmented weight token as a low-dimensional, sample-dependent causal readout route for the trained reader; targeted controls rule out generic scalar-column and marginal-distribution artifacts. The diagnosis motivates interventions that strengthen reader routing, add an explicit bias route, or use denser inner-loop fitting; under the lane-specific training conventions used here, route-directed variants often outperform the shared-anchor baseline but interact non-additively. Task-induced INR weights are classifiable not because they form raw geometric clusters, but because their class signal is routed through the reader.

2605.08276 2026-05-12 cs.CV 版本更新

Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction

超越ViT标记:面向细胞级密集预测的掩码扩散预训练卷积病理基础模型

Weiming Chen, Xitong Ling, Zhenyang Cai, Xidong Wang, Jiawen Li, Tian Guan, Benyou Wang, Yonghong He

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Research Institute of Tsinghua, Pearl River Delta(清华大学 Pearl River Delta 研究院) The Chinese University of Hong Kong, ShenZhen(香港中文大学(深圳))

AI总结 本文提出ConvNeXt Masked-Diffusion模型,通过卷积生成预训练框架提升病理图像细胞级密集预测性能,实验表明其在有限标注条件下表现更优,优于现有ViT模型和端到端分割方法。

详情
AI中文摘要

细胞级密集预测是计算病理学的核心,但受限于细粒度组织结构、强域偏移和昂贵的密集标注。现有基于ViT的病理基础模型依赖于补丁标记化,可能破坏空间连续性并削弱局部形态细节。为此,我们提出掩码扩散卷积基础模型,即ConvNeXt Masked-Diffusion(CMD),一种自监督卷积生成预训练框架。CMD采用全卷积ConvNeXt-UNet主干网络,在像素空间进行掩码扩散预训练,并通过自适应归一化融合冻结的病理基础模型特征。实验结果表明,CMD在多个病理密集预测任务中优于现有ViT模型,并在微调少量任务特定参数时超越了最先进的端到端分割方法。在标注有限的情况下,CMD表现出更强的鲁棒性和泛化能力。我们的发现表明,纯卷积架构也可作为竞争性的病理基础模型,实现当前ViT主导范式中的领先性能,并提供一个可扩展、高性能的解决方案,更好地保留组织结构先验知识以实现细粒度病理理解。

英文摘要

Cell-level dense prediction is central to computational pathology, but remains challenging due to fine-grained histological structures, strong domain shifts, and costly dense annotations. Existing ViT-based pathology foundation models rely on patch tokenization, which can disrupt spatial continuity and weaken local morphological details needed for cell-level prediction. To address this, we propose Masked-Diffusion Convolutional Foundation Models, termed ConvNeXt Masked-Diffusion (CMD), a self-supervised convolutional generative pretraining framework for dense pathology representation learning. CMD uses a fully convolutional ConvNeXt-UNet backbone, performs masked-diffusion pretraining in pixel space, and incorporates frozen pathology foundation model features through adaptive normalization. Experimental results demonstrate that CMD consistently outperforms existing ViT-based pathology foundation models and even surpasses state-of-the-art end-to-end segmentation methods while fine-tuning only a small number of task-specific parameters across multiple pathology dense prediction tasks. The advantage is particularly pronounced under limited annotation settings, where CMD exhibits stronger robustness and generalization ability. Our findings suggest that purely convolutional architectures can also serve as competitive pathology foundation models for cell-level dense prediction, achieving leading performance within the current ViT-dominated paradigm and providing a scalable, high-performance solution that better preserves histological structural priors for fine-grained pathology understanding.

2605.08275 2026-05-12 eess.IV cs.CV 版本更新

Model-based Dynamic 3D MRI Reconstructions using Neural Fields and Tensor Product Expansions

基于神经场和张量积展开的模型动态3D MRI重建

Ray Sheombarsing, Max van Riel, David Heesterbeek, Nico van den Berg, Alessandro Sbrizzi

发表机构 * Computational Imaging Group for MRI Therapy & Diagnostics, Department of Radiotherapy, University Medical Center Utrecht(MRI治疗与诊断计算成像组,放射治疗系,乌得勒支大学医学中心)

AI总结 本文提出一种无需离散化的高效MRI重建框架,利用张量积结构在高维时空场景中实现动态2D和3D图像的高精度重建,尤其在强欠采样条件下保持结构和运动信息。

详情
AI中文摘要

传统MRI重建方法将图像和线圈灵敏度视为离散对象,导致高内存需求和有限的结构感知能力,限制了有效正则化。这些限制阻碍了在高度欠采样场景(如动态3D心脏磁共振)中的准确重建。我们引入了一种无需离散化的、内存高效的、基于模型的框架,用于从高度欠采样数据中重建动态2D和3D MRI。我们利用单变量神经场的张量积表示磁化和线圈灵敏度作为连续对象——可微函数。这种张量积结构使在高维时空设置中实现可扩展的优化成为可能。我们的方法在动态2D和3D MR设置中优于最先进的基于模型的重建方法,即使在极端欠采样(例如加速因子16)下也能保持结构和运动信息。

英文摘要

Conventional MRI reconstruction methods treat images and coil sensitivities as discrete objects, leading to high memory demands and limited structural awareness that hamper effective regularization. These limitations hinder accurate reconstruction in highly undersampled scenarios, such as dynamic 3D cardiac magnetic resonance (CMR). We introduce a discretization-free, memory-efficient, model-based framework for dynamic 2D and 3D MRI reconstruction from highly undersampled data. We represent magnetization and coil sensitivities as continuous objects -- differentiable functions -- using tensor products of univariate neural fields. This tensor product structure enables scalable optimization in high-dimensional spatiotemporal settings. Our method outperforms state-of-the-art model-based reconstructions in dynamic 2D and 3D MR settings, preserving structure and motion even under aggressive undersampling (e.g., acceleration factor 16).

2605.08271 2026-05-12 cs.CV cs.AI 版本更新

Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning

跨模态、跨时间:结构化记忆用于超长代理视频推理

Jiazheng Li, Chi-Hao Wu, Yunze Liu, Kaize Ding, Jundong Li, Chuxu Zhang

发表机构 * University of Connecticut(康涅狄格大学) Northwestern University(西北大学) University of Virginia(弗吉尼亚大学)

AI总结 本文提出MAGIC-Video框架,通过多模态记忆图和交织叙事链,实现超长视频的跨模态检索与长时序推理,优于现有基线模型。

详情
AI中文摘要

理解超长视频(如第一视角录像、直播或监控录像)仍具挑战性。现有多模态大语言模型即使有百万token上下文窗口,帧预算仅覆盖几十分钟密集采样视频,大部分证据在推理前被丢弃。记忆增强和代理方法有助于扩展规模,但其检索仍碎片化且缺乏跨天或周的长篇叙事摘要。我们提出MAGIC-Video,一个无需训练的框架,围绕多模态记忆图和交织叙事链:图通过六种类型边统一事件、语义和视觉内容,支持跨模态检索;链提炼长时程实体生物和重复活动事件。在推理时,代理循环交织图检索与叙事事实注入,覆盖超长视频的模态和时间维度。在EgoLifeQA、Ego-R1和MM-Lifelong数据集上,MAGIC-Video在强通用、长视频和代理基线中表现优异,分别优于先前最佳代理系统10.1、7.4和5.9分。代码可在https://github.com/lijiazheng0917/MAGIC-video获取。

英文摘要

Understanding ultra-long videos such as egocentric recordings, live streams, or surveillance footage spanning days to weeks, remains a challenge. For current multimodal LLMs: even with million-token context windows, frame budgets cover only tens of minutes of densely sampled video, and most evidence is discarded before inference begins. Memory-augmented and agentic approaches help with scale, but their retrieval remains fragmented across modalities and lacks long-range narrative summaries that span days or weeks. We propose \textbf{MAGIC-Video}, a training-free framework built around a multimodal memory graph with interleaved narrative chain: the graph unifies episodic, semantic, and visual content through six typed edges and supports cross-modal retrieval, while the chain distils long-horizon entity biographies and recurring activity events. At inference time, an agentic loop interleaves graph retrieval with narrative fact injection, covering both the modality and time dimensions of ultra-long video in a single retrieval pipeline. On EgoLifeQA, Ego-R1 and MM-Lifelong, MAGIC-Video consistently outperforms strong general-purpose, long-video, and agentic baselines, with gains of 10.1, 7.4, and 5.9 points over the prior best agentic system on each benchmark. Code is available at https://github.com/lijiazheng0917/MAGIC-video.

2605.08266 2026-05-12 eess.IV cs.CV 版本更新

Coarse-to-Fine: Progressive Image Compression for Semantically Hierarchical Classification

由粗到细:用于语义分层分类的渐进图像压缩

Jungwoo Kim, Jun-Hyuk Kim, Jong-Seok Lee

发表机构 * School of Integrated Technology, Yonsei University(延世大学整合技术学院) BK21 Graduate Program in Intelligent Semiconductor Technology, Yonsei University(延世大学智能半导体技术研究生项目) Department of Artificial Intelligence, Chung-Ang University(成均馆大学人工智能系)

AI总结 本文提出一种语义分层感知的渐进编码器,实现从单个比特流中实现语义可扩展性。通过将ImageNet-1K类别分类到CLIP嵌入的语义层次中,并基于通道自回归框架分解潜在表示,提升低比特率下的粗粒度识别和高比特率下的细粒度准确性。

Comments Accepted at ICIP 2026

详情
AI中文摘要

最近的学得图像压缩(LIC)进展使实际部署成为可能,推动了图像压缩和渐进编码方案的研究。然而,其整合仍处于探索阶段:先前关于渐进机器编码器的研究主要针对样本层面的难度适应(即易到难),而未考虑语义层面的可扩展性。在本文中,我们引入了一种语义层次感知的渐进编码器,从单个比特流中实现语义可扩展性(即由粗到细)。我们首先系统地将ImageNet-1K类别分类到基于CLIP嵌入的语义层次中。基于通道自回归框架,我们将潜在表示分解为层次有序的通道块,每个块专门优化以适应相应的语义层次。大量实验表明,我们的方法在低比特率下显著提升了粗粒度识别性能,同时在高比特率下保持细粒度准确性。通过将渐进传输重新框架化为语义可扩展性的视角,我们的工作为任务自适应图像编码提供了高效且可解释的解决方案,在层次评估下优于现有渐进编码器。

英文摘要

Recent advances in learned image compression (LIC) have enabled practical deployments, spurring active research into image compression for machines and progressive coding schemes. However, their integration remains under-explored: prior works on progressive machine codec predominantly target sample-level difficulty adaptation (i.e., easy-to-hard), without considering semantic-level scalability. In this work, we introduce a semantic hierarchy-aware progressive codec that enables semantic scalability (i.e., coarse-to-fine) from a single bitstream. We first systematically categorize ImageNet-1K classes into CLIP embedding-based semantic hierarchies. Based on a channel-wise autoregressive framework, we decompose latent representations into hierarchically ordered channel blocks, each explicitly optimized for a corresponding semantic hierarchy. Extensive experiments demonstrate that our approach substantially improves coarse-level recognition at low bitrates while maintaining fine-grained accuracy at higher bitrates. By reframing progressive transmission through the lens of semantic scalability, our work provides an efficient and interpretable solution for task-adaptive image coding, outperforming existing progressive codecs under hierarchical evaluation.

2605.08252 2026-05-12 cs.CV 版本更新

Multimodal Emotion Recognition via Causal-Diffusion Bridge (Affect-Diff)

通过因果-扩散桥进行多模态情绪识别(Affect-Diff)

Ankit Sanjyal

发表机构 * Department of Computer Science -- Data Science(计算机科学系——数据科学部) Fordham University(福特汉姆大学)

AI总结 针对CMU-MOSEI数据集中情绪类别不平衡问题,Affect-Diff通过因果图、正则化潜在压缩和扩散先验机制提升识别性能,验证了其在平衡准确率和少数类敏感性上的优势。

Comments 10 Pages, 12 Figures, 6 Tables

详情
AI中文摘要

多模态情绪识别在CMU-MOSEI数据集面临极端不平衡问题,因为快乐占样本的65.9%,而三个Ekman类别合计仅占7%。标准融合模型通过忽略少数情绪类别来最大化准确率。我们提出Affect-Diff,一种因果-扩散桥,通过三个联合训练机制解决此问题:一个通过NOTEARS学习的因果图,在融合前重新加权模态贡献;一个beta-VAE瓶颈用于正则化的潜在压缩;以及一个带有stop-gradient的1D DDPM先验,用于对抗多数类崩溃。在3,292个对齐的CMU-MOSEI样本上,Affect-Diff达到验证平衡准确率0.384,比最强基线(TETFN: 0.324)提高了18%。所有评估基线在恐惧、厌恶和惊讶类别上产生零F1。消融研究证实扩散先验和因果图的独立、非冗余贡献。值得注意的是,只有确定性编码器变体能检测所有六个情绪类别,揭示KL正则化强度作为少数类敏感性的直接杠杆。

英文摘要

Multimodal emotion recognition on CMU-MOSEI faces an extreme imbalance as Happy accounts for 65.9% of samples while three Ekman categories collectively represent under 7%, causing standard fusion models to maximize accuracy by ignoring minority emotions entirely. We present Affect-Diff, a Causal-Diffusion Bridge that addresses this through three jointly trained mechanisms: a NOTEARS-learned causal graph that re-weights modality contributions before fusion, a beta-VAE bottleneck for regularized latent compression, and a stop-gradiented 1D DDPM prior that structures the latent space against majority-class collapse. On 3,292 aligned CMU-MOSEI samples, Affect-Diff achieves validation balanced accuracy 0.384, an 18% relative improvement over the strongest baseline (TETFN: 0.324), while all evaluated baselines produce zero F1 on Fear, Disgust, and Surprise. Ablation studies confirm independent, non-redundant contributions from the diffusion prior (-24% without it) and causal graph (-13%). Notably, only the deterministic-encoder variant detects all six emotion classes, revealing KL regularization strength as a direct lever for minority-class sensitivity.

2605.08250 2026-05-12 cs.CV cs.AI 版本更新

Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space

为何DiT编辑器会漂移?VAE潜在空间中的插件式低频对齐

Xiaoce Wang, Sifan Zhou, Kaifei Wang, Leli Xu, Xuerui Qiu, Tao He, Ming Li

发表机构 * Tsinghua University(清华大学) Carnegie Mellon University(卡内基梅隆大学) Peking University(北京大学) CASIA University of Electronic Science and Technology of China(电子科技大学) Guangming Laboratory(光明实验室)

AI总结 本文从潜在空间频率角度研究DiT编辑器漂移问题,提出无需训练的VAE-LFA方法,通过低频对齐抑制语义漂移并保持高频细节,适用于白盒和黑盒DiT编辑器。

Comments 9 pages main paper, 12 figures, 25 pages in total

详情
AI中文摘要

最近扩散变换器(DiTs)的进展使单轮图像编辑能力显著提升。然而,多轮编辑常导致渐进式语义漂移和质量退化。本文从潜在空间频率角度研究该问题,将编辑过程分解为VAE和DiT两个功能组件。通过系统分析VAE潜在空间,发现DiT引入主导的低频漂移,随着编辑轮次增加而累积,而VAE贡献相对稳定的重建偏差。基于此,提出VAE-LFA(低频对齐),一种无需训练、即插即用的方法,在VAE潜在空间中进行对齐。VAE-LFA通过低通滤波分解编辑轮次间的潜在差异,并将低频统计对齐到先前轮次的指数移动平均,有效抑制累积语义漂移并保留高频细节。该方法无需重新训练、真实先验或扩散参数访问,适用于白盒和黑盒DiT编辑器。对于白盒模型,VAE-LFA无缝集成到编辑流程中,通过消除冗余的VAE轮次;对于黑盒模型,通过现成的VAE进行轮次间潜在对齐。大量实验表明,VAE-LFA在多样化的多轮编辑场景中提升了语义一致性和视觉保真度,包括受控和真实场景中的图像。

英文摘要

Recent advances in diffusion transformers (DiTs) have enabled promising single-turn image editing capabilities. However, multi-turn editing often leads to progressive semantic drift and quality degradation.In this work, we study this problem from a latent-space frequency perspective by decomposing the editing process into two functional components: VAE and DiT. Through systematic analysis in the VAE latent space, we uncover that the DiT introduces dominant low-frequency drift that accumulates as semantic misalignment across editing rounds, while the VAE contributes comparatively stable reconstruction bias.Based on this insight, we propose VAE-LFA (Low Frequency Alignment), a training-free, plug-and-play method that performs alignment in VAE latent space. VAE-LFA decomposes latent discrepancies across editing rounds via low-pass filtering, and aligns low-frequency statistics to an exponential moving average of previous rounds, effectively suppressing accumulated semantic drift while preserving high-frequency details.Our method requires no retraining, ground-truth priors, or access to diffusion parameters, making it applicable to both white-box and black-box DiT editors. For white-box models, VAE-LFA is seamlessly integrated into the editing pipeline by eliminating redundant VAE round trips; for black-box models, it operates via an off-the-shelf VAE to perform inter-round latent alignment.Extensive experiments demonstrate that VAE-LFA improves semantic consistency and visual fidelity across diverse multi-turn editing scenarios, including both controlled and in-the-wild images.

2605.08249 2026-05-12 cs.CV eess.IV eess.SP 版本更新

Dimensional Coactivation for Representational Consistency in Frozen Vision Foundation Models

维度共激活用于冻结视觉基础模型中的表征一致性

Izaldein Al-Zyoud Abdulmotaleb El Saddik

发表机构 * MCRLab, School of Electrical Engineering and Computer Science University of Ottawa(MCRLab,电气与计算机科学学院,渥太华大学)

AI总结 本文提出维度共激活(DCA)方法,用于评估冻结视觉基础模型中样本内部的表征一致性。通过比较语义子区域的特征维度共激活,DCA在深度伪造检测任务中表现出色,验证了其在冻结模型中的有效性。

详情
AI中文摘要

冻结视觉基础模型不仅提取特征,还通过学习的坐标系统组织图像。本文探讨该坐标系统在单个输入内部是否保持一致性,提出表征一致性概念。引入维度共激活(DCA),一种用于测量这种一致性的工具。DCA通过比较语义区域是否在同一特征维度上共激活来评估一致性。与经典相似性度量不同,DCA有意避免中心化、L2归一化和完全Gram耦合。深度伪造检测提供了自然的验证任务。合成面孔可能在眼睛、鼻子和嘴巴区域生成合理外观,但破坏真实面孔中这些区域的表征结构。使用冻结的DINOv3特征,DCA揭示了这种破坏:眼-嘴-鼻指纹在CelebDF-v2和DFD上分别达到0.9106和0.9289的AUC。设计还通过消融实验验证:重新引入中心化使CelebDF-v2 AUC降至0.459,L2归一化降至0.862,跨维度耦合降至0.478。最后,用FaRL替代DINOv3使CelebDF-v2 AUC降至0.582。DCA因此依赖于稳定的每维度坐标系统,而非单纯的区域提取。这些结果将DCA定位为测量冻结基础模型中样本内部表征一致性的工具,深度伪造检测是首个验证任务。

英文摘要

Frozen vision foundation models do not merely extract features; they organize images through a learned coordinate system. We ask whether that coordinate system remains internally coherent within a single input. This leads to Representational Consistency: the study of whether a frozen foundation model represents one sample coherently across its semantic subregions. We introduce Dimensional Coactivation (DCA), a per-dimension instrument for measuring this coherence. DCA compares semantic regions by asking whether the same feature dimensions coactivate across them. Unlike classical similarity measures, it deliberately avoids centering, L2 normalization, and full Gram coupling. These operations are useful when comparing different models or distributions, but they are mismatched to the intra-sample setting, where the coordinate system is fixed and raw magnitude carries signal. Deepfake detection provides a natural validation task. Synthetic faces may reproduce plausible eyes, noses, and mouths while breaking the representational structure that links those regions in real faces. Using frozen DINOv3 features, DCA exposes this break: an eyes-mouth-nose fingerprint achieves 0.9106 AUC on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer. The design is also sharply validated by ablation: reintroducing centering collapses CelebDF-v2 AUC to 0.459, L2 normalization reduces it to 0.862, and cross-dimension coupling reduces it to 0.478. Finally, replacing DINOv3 with FaRL collapses CelebDF-v2 AUC to 0.582. DCA therefore depends on a stable per-dimension coordinate system, not on region extraction alone. These results position DCA as an instrument for measuring intra-sample representational coherence in frozen foundation models, with deepfake detection as the first validation task.

2605.08246 2026-05-12 cs.CV cs.CR cs.LG 版本更新

Smart Railway Obstruction Detection System using IoT and Computer Vision

基于物联网和计算机视觉的智能铁路障碍检测系统

Pravin Kumar, Mritunjay Shall Peelam, Ramakant Kumar, Sanjay Kumar, Vinay Chamola

发表机构 * School of Computer Science, University of Petroleum and Energy Studies (UPES)(石油与能源研究大学计算机科学学院) Department of Computer Science, Galgotias College of Engineering & Technology(工程与技术学院计算机科学系) Department of Computer Science, NIT Jamshedpur(jamshedpur国立理工学院计算机科学系) Department of Electrical and Electronics Engineering, BITS-Pilani, Pilani Campus(比什努尔理工学院电气与电子工程系) APPCAIR, BITS-Pilani, Pilani campus(比什努尔理工学院帕兰ी校区APPCAIR)

AI总结 本文提出NETRA系统,利用边缘计算平台实现低成本、低功耗的铁路入侵检测,通过概率传感器融合和边缘AI分类,提高检测准确率并降低部署成本。

详情
AI中文摘要

铁路轨道入侵对印度铁路安全构成重大挑战,包括野生动物入侵和故意障碍。2025年12月阿萨姆发生的碰撞事件凸显了实时检测的紧迫性。现有解决方案如基于光纤的Gajraj系统成本高昂(1000美元/公里)且误报率高,限制了在101条大象走廊中的部署仅限于20条。本文提出NETRA,一种成本效益高、独立于互联网的入侵检测系统,部署在Raspberry Pi Zero W和Raspberry Pi 4边缘平台上。NETRA采用概率传感器融合,整合PIR运动传感器和HC-SR04超声波距离传感器,通过可调阈值(tau_c=0.65)实现事件驱动的相机激活,减少不必要的视觉处理52%。确认入侵后,使用MobileNet-SSD(Pi Zero)或YOLOv5 ONNX(Pi 4)的边缘AI分类识别威胁,包括人类、大型动物和轨道障碍。确认的威胁通过LoRa(868 MHz)传输,2.4秒内通知机车司机。113个运动事件的实验评估显示,概率融合方法的检测准确率为95%,无误报,优于二元方法的85%。Raspberry Pi 4与YOLOv5实现83.5%的象类F1分数,比Pi Zero的启发法方法(14.8%)提高5.6倍。LoRa通信在1-2公里范围内实现100%的数据包交付。NETRA将部署成本降低75%(247美元/公里 vs Gajraj的1000美元/公里),同时提供对野生动物和障碍威胁的统一检测。

英文摘要

Railway track intrusions pose a critical safety challenge for Indian Railways, encompassing wildlife incursions and deliberate malicious obstructions. The December 2025 collision in Assam, in which seven elephants were killed by the Rajdhani Express, underscores the urgency of effective real-time detection. Existing solutions such as the optical fiber-based Gajraj system suffer from prohibitive costs (\$1000/km) and high false alarm rates, limiting deployment to only 20 of India's 101 elephant corridors. This paper proposes NETRA, a cost-effective, internet-independent intrusion detection system deployed on Raspberry Pi Zero W and Raspberry Pi 4 edge platforms. NETRA employs probabilistic sensor fusion integrating a PIR motion sensor and an HC-SR04 ultrasonic distance sensor with a tunable threshold (tau_c = 0.65), enabling event-driven camera activation that reduces unnecessary visual processing by 52%. Upon confirmed intrusion, edge-AI classification using MobileNet-SSD (Pi Zero) or YOLOv5 ONNX (Pi 4) identifies threats including humans, large animals, and track obstructions. Confirmed threats are transmitted via LoRa (868 MHz) to alert the locomotive driver within 2.4 seconds end-to-end. Experimental evaluation across 113 motion events demonstrated 95% detection accuracy with zero false alarms through probabilistic fusion, compared to 85% for binary methods. Raspberry Pi 4 with YOLOv5 achieved 83.5% elephant F1-score, a 5.6x improvement over Pi Zero's heuristic approach (14.8%). LoRa communication achieved 100% packet delivery across 1-2 km in field trials. NETRA reduces deployment cost by 75% (\$247/km vs \$1000/km for Gajraj) while providing unified detection of both wildlife and obstruction threats.

2605.08241 2026-05-12 cs.CV cs.AI 版本更新

TinySSL: Distilled Self-Supervised Pretraining for Sub-Megabyte MCU Models

TinySSL:用于子兆字节MCU模型的蒸馏自监督预训练

Bibin Wilson

发表机构 * Bibin Wilson

AI总结 本文提出CA-DSSL框架,通过蒸馏和自监督学习在子兆字节MCU模型上实现高效表示学习,达到优于SimCLR-Tiny的性能,且参数更少。

详情
AI中文摘要

自监督学习(SSL)已改变大模型的表示学习,但在微控制器(MCU)类模型中仍无探索。本文识别了三个障碍:投影头主导、表示瓶颈和增强敏感性,并提出容量感知蒸馏自监督学习(CA-DSSL),一种教师引导框架,无需标签或文本监督即可克服这些障碍。CA-DSSL结合了不对称蒸馏、多尺度特征蒸馏和渐进增强课程。在MobileNetV2-0.35主干上预训练CIFAR-100,CA-DSSL达到62.7 0.5%线性探针准确率(3种子均值),优于SimCLR-Tiny 18个百分点,与SEED(61.7%)匹配,但参数更少。标准SSL方法(BYOL-Tiny、DINO-Tiny)在该规模完全崩溃。在Pascal VOC检测中,CA-DSSL达到随机初始化的2.3倍mAP,并在SEED上高出3个百分点,尽管SimCLR-Tiny在检测mAP上匹配CA-DSSL。部署主干占用378 KB(INT8)且无推理开销。初步ImageNet-100实验表明,CA-DSSL的优势特定于小数据集;扩展到ImageNet-1K作为未来工作。

英文摘要

Self-supervised learning (SSL) has transformed representation learning for large models, yet remains unexplored for microcontroller (MCU)-class models with fewer than 500K parameters. We identify three obstacles at this scale -- projection head dominance, representation bottleneck, and augmentation sensitivity -- and propose Capacity-Aware Distilled Self-Supervised Learning (CA-DSSL), a teacher-guided framework that overcomes them without labels or text supervision. CA-DSSL combines asymmetric distillation from a frozen DINO ViT-S/16 teacher, multi-scale feature distillation for spatial representations, and a progressive augmentation curriculum. On a MobileNetV2-0.35 backbone (396K parameters) pretrained on CIFAR-100, CA-DSSL reaches 62.7 0.5% linear-probe accuracy (3-seed mean) -- surpassing SimCLR-Tiny by 18 pp, matching SEED (61.7%) with 10 fewer projection parameters (426K vs. 3.15M), and reaching 94.0% of a supervised upper bound. Standard SSL methods (BYOL-Tiny, DINO-Tiny) collapse entirely at this scale. On Pascal VOC detection, CA-DSSL achieves 2.3 the mAP of random initialization and +3 pp over SEED, though SimCLR-Tiny matches CA-DSSL on detection mAP. The deployed backbone occupies 378 KB (INT8) with no inference overhead from pretraining. Preliminary ImageNet-100 experiments reveal that CA-DSSL's advantage is specific to small-data regimes; scaling to ImageNet-1K is discussed as future work.

2605.08238 2026-05-12 cs.CV cs.AI cs.ET cs.LG 版本更新

Resource-Aware Evolutionary Neural Architecture Search for Cardiac MRI Segmentation

面向资源的进化神经架构搜索用于心脏MRI分割

Farhana Yasmin, Mahade Hasan, Haipeng Liu, Amjad Ali, Ghulam Muhammad, Yu Xue

发表机构 * School of Computer Science, Nanjing University of Information Science and Technology(南京信息工程大学计算机科学学院) School of Software, Nanjing University of Information Science and Technology(南京信息工程大学软件学院) Eastern University(东东大学) Research Centre for Intelligent Healthcare, Coventry University(科文大学智能医疗研究中心) National Medical Research Association(国家医学研究协会) Faculty of Engineering and Technology, Muscat University(穆斯cat大学工程与技术学院) College of Computer and Information Sciences, King Saud University(沙特国王大学计算机与信息学院)

AI总结 本文提出CardiacNAS,一种结合UNet-like supernet和心脏感知搜索空间的进化神经架构搜索框架,通过优化Dice相似度系数和Hausdorff距离与模型大小和FLOPs的平衡,实现心脏MRI分割的高精度与高效能。

Journal ref F. Yasmin et.al., "Resource-Aware Evolutionary Neural Architecture Search for Cardiac MRI Segmentation," 28th International Conference on Computer and Information Technology (ICCIT), 2025, pp. 2819-2824

详情
AI中文摘要

心脏磁共振(CMR)分割是评估心室结构和功能定量评估的基础,但可靠分割仍因低组织对比度、模糊边界和扫描变异性而困难。本文提出CardiacNAS,一种进化神经架构搜索(NAS)框架,结合UNet-like supernet与覆盖深度、宽度、核大小、滤波器大小、注意力、融合、激活、丢弃和残差缩放的心脏感知搜索空间。搜索过程明确考虑资源,联合优化Dice相似度系数(DSC)和95百分位Hausdorff距离(HD95)与模型大小和浮点运算(FLOPs)在固定计算预算下。候选架构从supernet实例化,通过代理预算训练,并通过交叉、突变和精英选择进化。在ACDC数据集上评估并与六种最先进的方法进行比较,使用定性比较、学习曲线分析和设计因素相关性研究。所得到的模型在3.58M参数和14.56GFLOPs下达到93.22%的平均DSC和4.73mm HD95,展示了良好的精度效率权衡。分析表明,搜索的注意力和融合选择,以及残差缩放,有助于改进边界保真度和稳定性。CardiacNAS提供了一种原则性、资源感知的方法,用于可部署的CMR分割,具有透明报告的架构复杂性和计算预算。

英文摘要

Cardiac magnetic resonance (CMR) segmentation underpins quantitative assessment of ventricular structure and function, yet reliable delineation remains difficult due to low tissue contrast, fuzzy boundaries, and inter scan variability. We present CardiacNAS, an evolutionary neural architecture search (NAS) framework that couples a UNet like supernet with a cardiac aware search space spanning depth width, kernel size, filter size, attention, fusion, activation, dropout, and residual scaling. The search is explicitly resource aware, jointly optimizing dice similarity coefficient (DSC) and 95th percentile Hausdorff distance (HD95) versus model size and floating point operations (FLOPs) under fixed compute budgets. Candidate architectures are instantiated from the supernet, trained with proxy budgets, and evolved through crossover, mutation, and elitist selection. We evaluate on the ACDC dataset and compare against six state of the art methods, using qualitative comparisons, learning curve analyses, and design factor correlation studies. The resulting model attains 93.22% average DSC and 4.73 mm HD95 with 3.58M parameters and 14.56 GFLOPs, demonstrating a favorable accuracy efficiency trade off. Analyses indicate that searched attention and fusion choices, together with residual scaling, contribute to improved boundary fidelity and stability. CardiacNAS offers a principled, resource aware approach to deployable CMR segmentation with transparent reporting of architectural complexity and compute budgets.

2605.08226 2026-05-12 cs.CV 版本更新

SPECTRA-Net: Scalable Pipeline for Explainable Cross-domain Tensor Representations for AI-generated Images Detection

SPECTRA-Net:可扩展的可解释跨域张量表示可解释性AI生成图像检测流水线

Sarra Arab, Anfal Achouri, Seif Eddine Bouziane

发表机构 * The National School of Artificial Intelligence(国家人工智能学院)

AI总结 本文提出SPECTRA-Net,通过多视角图像表示结合全局语义特征、频谱分析、局部补丁异常检测和统计描述符,实现跨域AI生成图像检测的高准确性和泛化能力。

Comments 13 pages, 2 figures, submitted to a journal

详情
AI中文摘要

AI生成图像的快速普及对数字信息完整性构成重大挑战。尽管人类观察者和现有检测模型难以跟上生成模型复杂性的提升,对稳健、实时检测系统的需求变得至关重要。本文介绍了SPECTRA-Net,一种可扩展的可解释跨域张量表示流水线,用于AI生成图像检测。我们的方法利用图像的多视角表示,结合来自视觉基础模型(VFM)的全局语义特征、频谱分析、基于局部补丁的异常检测和统计描述符。通过融合这些互补的数据流,SPECTRA-Net在域内和跨域设置中均实现了最先进的性能,展示了在广泛挑战数据集(包括WildFake、Chameleon和RRDataset)中的高准确性和泛化能力。所提出的流水线不仅为AI生成图像检测提供了稳健的解决方案,还通过缺陷定位提供可解释性,为现实应用中的更可信和可靠的内容验证铺平了道路。

英文摘要

The rapid proliferation of AI-generated images (AIGI) presents a significant challenge to digital information integrity. While human observers and existing detection models struggle to keep pace with the increasing sophistication of generative models, the need for robust, real-time detection systems has become critical. This paper introduces SPECTRA-Net, a scalable pipeline for explainable, cross-domain tensor representations for AIGI detection. Our approach leverages a multi-view representation of images, combining global semantic features from a Vision Foundation Model (VFM), spectral analysis, local patch-based anomaly detection, and statistical descriptors. By fusing these complementary data streams, SPECTRA-Net achieves state-of-the-art performance in both in-domain and cross-domain settings, demonstrating high accuracy and generalization capabilities across a wide range of challenging datasets, including WildFake, Chameleon, and RRDataset. The proposed pipeline not only provides a robust solution for AIGI detection but also offers explainability through artifact localization, paving the way for more trustworthy and reliable content verification in real-world applications.

2605.08222 2026-05-12 cs.CV cs.AI cs.IR 版本更新

From Historical Tabular Image to Knowledge Graphs: A Provenance-Aware Modular Pipeline

从历史表格图像到知识图谱:一个具有溯源意识的模块化流水线

Sarah Binta Alam Shoilee, Victor de Boer, Jacco van Ossenbruggen, Susan Legêne

发表机构 * Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)

AI总结 本文提出一个模块化、具有溯源意识的流水线,将手写表格图像转换为知识图谱,支持人机协作。通过分解为表格重建、信息提取和知识图谱构建三个阶段,确保提取实体和文字可追溯至其视觉和文本来源。

Comments Shorter version of this paper has been accepted in the 5th International Conference on Hybrid Human-Artificial Intelligence (HHAI 2026)

详情
AI中文摘要

手写档案表格包含丰富的历史信息,但将其转换为结构化表示,如知识图谱,需要整合表格结构识别、手写识别和语义解释——一个复杂的多模态过程。端到端的AI实现可能会掩盖这些步骤,导致不透明的算法操作,阻碍人类监督、关键评估和信任。为此,我们提出一个模块化、具有溯源意识的流水线,将手写表格图像转换为知识图谱,支持人机协作。该流水线将工作流程分解为三个阶段——表格重建、信息提取和知识图谱构建——同时暴露中间表示以供检查、评估和修正。我们方法的一个关键贡献是系统地在每个阶段整合数据溯源,确保提取的所有实体和文字都能追溯到其视觉和文本来源。所提出的流水线通过在真实世界档案材料中进行多项实验来展示,涉及军事生涯。在三种不同的表格重建变体上的结果突显了模块化的重要性。通过将模块化与数据溯源相结合,我们的工作推进了透明且可协作控制的图像到知识图谱的流水线,用于复杂的史数据。

英文摘要

Handwritten archival tables contain rich historical information, yet transforming them into structured representations, such as Knowledge Graphs, requires integrating table structure recognition, handwriting recognition, and semantic interpretation - a complex multimodal process. End-to-end AI implementations can obscure these steps, resulting in opaque algorithmic operations that hinder human oversight, critical assessment, and trust. To address this, we present a modular, provenance-aware pipeline to convert handwritten tabular images into KGs supporting human-AI collaboration. The pipeline decomposes the workflow into three stages - table reconstruction, information extraction, and KG construction - while exposing intermediate representations for inspection, evaluation, and correction. A key contribution of our approach is the systematic integration of data provenance at every stage, ensuring that all extracted entities and literals remain traceable to their visual and textual origins. The proposed pipeline is demonstrated through a number of experiments on real-world archival material concerning military careers. The results across three different table reconstruction variants highlight the importance of modularisation. By coupling modularity with data provenance, our work advances transparent and collaboratively controllable image-to-KG pipelines for complex historical data.

2605.08220 2026-05-12 cs.AI cs.CE cs.CL cs.CV cs.SE 版本更新

Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction

空间提示优于语义提示:一种基于网格的方法以提高LLM在图表数据提取上的准确性

Andrei Lazarev, Dmitrii Sedov, Alexander Galkin

发表机构 * Russian Federation(俄罗斯联邦)

AI总结 本文通过对比空间提示与语义提示,发现基于网格的提示方法能显著降低图表数据提取误差,优于传统语义引导策略。

Comments his is the version of the article accepted for publication in SUMMA 2025 after peer review. The final, published version is available at IEEE Xplore: https://doi.org/10.1109/SUMMA68668.2025.11302248

Journal ref 2025 7th International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency (SUMMA), Lipetsk, Russian Federation, 2025, pp. 799-804

详情
AI中文摘要

自动从科学图表中提取数据是大规模文献分析的关键任务。尽管多模态大语言模型(LLMs)展现出潜力,但其在非标准化图表上的准确性仍面临挑战。本文探讨了两种策略:高阶语义提示和低阶空间提示。我们发现,基于网格的提示方法在合成数据集上显著降低了数据提取误差(SMAPE从25.5%降至19.5%,p < 0.05),优于传统语义方法。结论表明,对于当前多模态模型而言,提供明确的空间上下文比高阶语义指导更有效且可靠。

英文摘要

The automated extraction of data from scientific charts is a critical task for large-scale literature analysis. While multimodal Large Language Models (LLMs) show promise, their accuracy on non-standardized charts remains a challenge. This raises a key research question: what is the most effective strategy to improve model performance (high-level semantic priming) or low-level spatial priming? This paper presents a comparative investigation into these two distinct strategies. We describe our exploratory experiments with semantic methods, such as a two-stage metadata-first framework and Chain-of-Thought, which failed to produce a statistically significant improvement. In contrast, we present a simple but highly effective spatial priming method: overlaying a coordinate grid onto the chart image before analysis. Our quantitative experiment on a synthetic dataset demonstrates that this grid-based approach provides a statistically significant reduction in data extraction error (SMAPE reduced from 25.5% to 19.5%, p < 0.05) compared to a baseline. We conclude that for the current generation of multimodal models, providing explicit spatial context is a more effective and reliable strategy than high-level semantic guidance for this class of tasks.

2605.08218 2026-05-12 cs.LG cs.CV 版本更新

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

深度梦境由此构成:在扩散模型中可视化单义特征

Adam Szokalski, Mateusz Modrzejewski

AI总结 本文提出通过优化的潜在可视化(LVO)技术,扩展了用于卷积神经网络的特征可视化方法至潜在扩散模型。通过稀疏自编码器(SAEs)将多义层表示分解为单义特征,展示了在Stable Diffusion 1.5模型上可视化可识别概念的效果,相比基线方法更清晰且具有相关性。

详情
AI中文摘要

本文提出潜在可视化通过优化(LVO),一种机制可解释性技术,将最初为卷积神经网络开发的特征可视化通过优化方法扩展到潜在扩散模型。LVO利用稀疏自编码器(SAEs)将多义层表示分解为单义特征。关键贡献包括潜在空间优化、时间步活动分析、匹配计划的噪声注入、通过特征引导的先验初始化以及合适的正则化策略。我们在Stable Diffusion 1.5模型上进行Style50数据集微调,展示了SAE特征能产生清晰的可识别概念可视化,如对角线构图、人物、玫瑰、电缆和瀑布泡沫,这些与数据集示例相关联,而无解纠缠的基线方法则产生不连贯的结果。我们进一步表明,从像素空间特征可视化转移来的正则化技术也适用于潜在域,尽管需要为原始层和SAE变体进行不同的配置。与数据集示例和引导相比,LVO通过直接揭示激活特征的内容而非其下游效果提供互补见解。

英文摘要

This paper proposes latent visualization by optimization (LVO), a mechanistic interpretability technique that extends feature visualization by optimization - originally developed for convolutional neural networks - to latent diffusion models. LVO employs sparse autoencoders (SAEs) to disentangle polysemantic layer representations into monosemantic features. Key contributions include latent-space optimization, time-step activity analysis, schedule-matched noise injection, prior initialization through feature steering, and suitable regularization strategies. We demonstrate the method on Stable Diffusion 1.5 fine-tuned on the Style50 dataset, showing that SAE features produce clear visualizations of recognizable concepts - including diagonal compositions, human figures, roses, cables, and waterfall foam - that correlate with dataset examples, while the baseline without disentanglement produces less coherent results. We further show that regularization techniques from pixel-space feature visualization transfer to the latent domain, though they require different configurations for the raw-layer and SAE variants. Compared to dataset examples and steering, LVO provides complementary insights by directly revealing what activates a feature rather than its downstream effects.

2605.08213 2026-05-12 cs.CV 版本更新

Low-Cost Stereo Vision for Robust 3D Positioning of Thin Radiata Pine Branches in Autonomous Drone Pruning

低成本立体视觉用于自主无人机修剪中薄辐射松枝的稳健三维定位

Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

发表机构 * Centre of Data Science and Artificial Intelligence & School of Engineering and Computer Science(数据科学与人工智能中心及工程与计算机科学学院)

AI总结 本文研究利用低成本立体相机实现自主无人机修剪薄枝的三维定位,通过分支分割与深度估计方法,解决森林场景中纹理稀疏、结构细长和噪声干扰等问题。

详情
AI中文摘要

辐射松的Manual pruning在新西兰林业中具有重要经济价值,但存在危险、劳动强度大且劳动力短缺的问题。现有自主修剪平台通常依赖昂贵的LiDAR传感器,仅适用于粗枝,限制了广泛应用。本文研究是否单个低成本立体相机安装在无人机上能提供足够的准确分支检测和三维定位,以支持自主修剪10mm薄枝。所提流程包括两个阶段:分支分割和深度估计。对于分割,比较了Mask R-CNN变体和YOLOv8、YOLOv9家族在自定义数据集上的表现;YOLOv8和YOLOv9被选为当时最先进的实时分割器,框架设计保持兼容于新版本YOLO。对于深度估计,评估了传统方法(SGBM加WLS滤波)和深度学习方法(PSMNet、ACVNet、GWCNet、MobileStereoNet、RAFT-Stereo和NeRF-Supervised Deep Stereo),包括跨数据集微调实验,揭示了城市驾驶基准与自然林业场景之间的领域差距。本文的主要创新点在于将立体分割与基于质心的三角化算法和中位数绝对偏差异常值拒绝相结合,将分割掩码和视差图转换为单一稳健的枝-相机距离,解决森林场景中稀疏纹理、细长结构和噪声视差值的挑战。在1-2米距离的定性评估中,学习的立体方法产生更连贯的深度估计...

英文摘要

Manual pruning of radiata pine, a species of major economic importance to New Zealand forestry, is hazardous, labour-intensive, and increasingly constrained by workforce shortages. Existing autonomous pruning platforms typically rely on expensive sensors such as LiDAR and are limited to thick branches, which restricts their wider adoption. This paper investigates whether a single low-cost stereo camera mounted on a drone can provide sufficiently accurate branch detection and three-dimensional positioning to support autonomous pruning of branches as thin as 10 mm, thereby removing the need for auxiliary depth sensors. The proposed pipeline comprises two stages: branch segmentation and depth estimation. For segmentation, Mask R-CNN variants and the YOLOv8 and YOLOv9 families are compared on a custom dataset of 71 stereo image pairs captured with a ZED Mini camera; YOLOv8 and YOLOv9 are selected as representative state-of-the-art real-time segmentors at the time of data collection, and the framework is designed to remain compatible with newer YOLO releases. For depth estimation, a traditional method (SGBM with WLS filtering) and deep-learning-based methods (PSMNet, ACVNet, GWCNet, MobileStereoNet, RAFT-Stereo, and NeRF-Supervised Deep Stereo) are evaluated, including cross-dataset fine-tuning experiments that expose the domain gap between urban driving benchmarks and natural forestry scenes. The main novelty of this work lies in coupling stereo segmentation with a centroid-based triangulation algorithm and Median-Absolute-Deviation outlier rejection that converts a segmentation mask and disparity map into a single robust branch-to-camera distance, addressing the challenges of sparse texture, thin structures, and noisy disparity values typical of forest scenes. Qualitative evaluations at distances of 1-2 m show that the learning-based stereo methods produce more coherent depth es...

2605.08210 2026-05-12 cs.CV 版本更新

Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation

谐化特征条件与频率提示个性化用于多评分者医学分割

Sanaz Karimijafarbigloo, Armin Khosravi, Alireza Kheyrkhah, Reza Azad, Mauricio Reyes, Dorit Merhof

发表机构 * Faculty of Informatics and Data Science, University of Regensburg(信息学院与数据科学学院,莱比锡大学) Sharif University of Technology(谢赫·伊斯兰大学) Iran University of Science and Technology(伊朗科学技术大学) ARTORG Center for Biomedical Research, University of Bern(伯尔尼大学生物医学研究中心) Department of Radiation Oncology, University Hospital Bern, University of Bern(伯尔尼大学放射肿瘤科,伯尔尼大学) Fraunhofer Institute for Digital Medicine MEVIS, Bremen(弗劳恩霍夫数字医学研究所MEVIS,不莱梅)

AI总结 本文提出谐化概率框架,通过自适应特征条件和频域个性化分离成像伪影与标注者差异,提升多评分者医学分割的准确性和可解释性。

Comments Accepted in main CVPR 2026

详情
AI中文摘要

多评分者医学图像分割捕捉了临床解释的固有模糊性,其中诊断边界在专家和成像设备间变化。现有方法常将这种多样性归为共识标签或视为噪声,导致模型过于自信且校准不佳。我们提出谐化概率框架,通过自适应特征条件和频域个性化分离成像伪影与真实标注者差异。轻量级Harmonizer网络隐式建模设备特定伪影,并进行动态特征调制以标准化潜在表示,确保不确定性反映解剖而非噪声。为表示评分者特定风格,我们引入新型高频提示模块,在频域内编码标注者依赖的边界精度和纹理敏感性。这些提示自适应地调节谐化特征以产生个性化但解剖一致的分割。此外,基于广义能量距离的正则化对齐生成分布与经验标注差异,促进专家分歧处的多样性与共识处的一致性。在LIDC-IDRI和NPC-170上的实验显示,模型在聚合和个性化分割上均达到SOTA,具有显著的GED减少和改进的Dice分数,尤其在噪声情况下。除准确性外,模型表现出临床有意义的不确定性。置信度在一致区域升高,在模糊区域下降,支持其作为可靠且可解释的多专家临床工作流程工具。

英文摘要

Multi-rater medical image segmentation captures the inherent ambiguity of clinical interpretation, where diagnostic boundaries vary across experts and imaging devices. Existing approaches often reduce this diversity to consensus labels or treat rater differences as noise, resulting in overconfident and poorly calibrated models. We propose a harmonized probabilistic framework that disentangles acquisition artifacts from genuine annotator variability through adaptive feature conditioning and frequency-domain personalization. A lightweight Harmonizer Network implicitly models scanner-specific artifacts and performs dynamic feature modulation to standardize latent representations, ensuring that uncertainty reflects anatomy rather than noise. To represent rater-specific styles, we introduce a novel High-Frequency Prompt Modules that operate in the spectral domain to encode annotator-dependent boundary precision and textural sensitivity. These prompts adaptively modulate harmonized features to produce personalized yet anatomically consistent segmentations. Furthermore, a Generalized Energy Distance based regularization aligns the generative distribution with empirical annotation variability, promoting diversity where experts disagree and consensus where they converge. Experiments on LIDC-IDRI and NPC-170 show SOTA aggregated and individualized segmentation, with notable GED reductions and improved Dice scores, especially on noisy cases. Beyond accuracy, the model exhibits clinically meaningful uncertainty. Confidence rises in agreement regions and declines in ambiguous areas, supporting its use as a reliable and interpretable tool for multi-expert clinical workflows.

2605.08207 2026-05-12 cs.CV 版本更新

A Breast Vision Pathology Foundation Model for Real-world Clinical Utility

乳腺病理科基础模型用于真实世界临床应用

Yingxue Xu, Zhengyu Zhang, Xiuming Zhang, Mengwei Xu, Fengtao Zhou, Yihui Wang, Jiabo Ma, Yi Xin, Danyi Li, Chengyu Lu, Zhijian Cen, Ying Tan, Qingbing Yao, Qi Wang, Zizhao Gao, Yong Zhang, Jingjing Chen, Feifei Liu, Qian Xu, Yi Dai, Hongxuan Tan, Cheng Jin, Huajun Zhou, Zhengrui Guo, Ling Liang, Hongyi Wang, Yingcong Chen, Xi Wang, Zhenhui Li, Ronald Cheong Kin Chan, Ning Mao, Muyan Cai, Zhe Wang, Li Liang, Hao Chen

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China(香港科技大学计算机科学与工程系) Department of Pathology, Nanfang Hospital, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China(南方医科大学基础医学学院病理学系,广州医院) Department of Pathology, The First Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China(浙江大学医学院第一附属医院病理学系,杭州) State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers, Department of Pathology, School of Basic Medicine and Xijing Hospital, Fourth Military Medical University, Xi’an, China(胃肠癌整体整合管理国家重点实验室,第四军医大学基础医学学院病理学系,西京医院)

AI总结 本文提出BRAVE模型,通过101638张乳腺全切片图像评估其在乳腺癌诊断中的临床实用性,展示了在不同阶段的病理评估中提升诊断效率和准确性的能力。

Comments 60 pages

详情
AI中文摘要

病理科基础模型在回顾性表现上表现出色,但其是否能支持临床相关应用仍存疑。乳腺癌中,病理评估是诊断的金标准,指导治疗规划、手术决策和风险分层。本文提出BRAVE,一个乳腺适应性病理科基础模型,利用来自亚洲、欧洲和北美的32个来源的101,638张乳腺全切片图像进行开发和评估。BRAVE在82个队列中进行了34项任务评估,包括术前活检、术中冰冻切片和术后切除。通过回顾性基准测试、临床挑战场景、以工作流程为导向的临床影响模拟、前瞻性观察验证(阈值锁定在回顾性队列中)和病理医生-AI交互研究,BRAVE在临床工作流程中发挥了实用作用,包括安全排除低风险病例、AI辅助的二次审查挽救最初遗漏的阳性病例以及优先处理进一步评估的病例。在三个中心的前瞻性验证中,BRAVE排除了76.9%的阴性活检病例(NPV 0.953)和70.1%的阴性冰冻切片病例(NPV 0.973),并将78.8%的术后亚型病例归类为高置信度明确病例(NPV 1.000)。在读者研究中,AI辅助将平衡准确率从88.5%提高到95.1%(OR 3.14,P<0.001),效率、信心和评分者间的一致性均有所提升。BRAVE衍生的评分也独立预测了无病生存率(调整后HR 4.79,P<0.001)和总生存率(调整后HR 8.14,P<0.001)

英文摘要

Pathology foundation models have shown strong retrospective performance, but whether such systems can support clinically relevant use remains unclear. This challenge is particularly important in breast cancer, where pathological assessment serves as the gold standard for diagnosis and guides treatment planning, surgical decision-making and risk stratification across pre-, intra- and post-operative stages. Here we present \textbf{BRAVE}, a breast-adaptive pathology foundation model developed and evaluated using a total resource of 101,638 breast whole-slide images from 32 sources across Asia, Europe and North America. We assessed BRAVE across 34 tasks in 82 cohorts spanning pre-operative biopsy, intra-operative frozen section and post-operative resection, using an evidence chain comprising retrospective benchmarking, clinically challenging scenarios, workflow-oriented clinical impact simulations, prospective observational validation with the thresholds locked in the retrospective cohorts and crossover pathologist-AI interaction studies. Across these settings, BRAVE supported practical roles in the clinical workflow, including safe exclusion of low-risk cases from routine review, AI-assisted second-review rescue of initially missed positives and prioritization of cases for further assessment. In prospective validation across three centres, BRAVE excluded 76.9% of negative biopsy cases (NPV 0.953) and 70.1% of negative frozen-section cases (NPV 0.973), and triaged 78.8% of post-operative subtyping cases as high-confidence clear-cut cases (NPV 1.000). In reader studies, AI assistance improved balanced accuracy from 88.5% to 95.1% (OR 3.14, P<0.001), with better efficiency, confidence and inter-rater agreement. BRAVE-derived scores also independently predicted disease-free survival (adjusted HR 4.79, P<0.001) and overall survival (adjusted HR 8.14, P<0.001).

2605.08201 2026-05-12 cs.LG cs.AI cs.CV 版本更新

Weakly Supervised Concept Learning for Object-centric Visual Reasoning

弱监督概念学习用于以对象为中心的视觉推理

Sparsh Tiwari, Bettina Finzel, Gesina Schwalbe

发表机构 * University of Lübeck(吕贝克大学) University of Bamberg(巴马克大学) University of Ulm(乌尔姆大学)

AI总结 本文提出一种高效的弱监督方案,结合槽式架构和VAE实现对象中心推理任务中的符号 grounding,减少监督至1%并提升域泛化能力。

详情
AI中文摘要

神经符号系统旨在结合深度神经网络对原始传感器输入的处理与符号人工智能的少样本性能。两阶段方法显式解耦基于DNN的感知与后续基于规则的推理。这避免了端到端可微方法的优化和可解释性问题,但需要昂贵的感知输出标签。本文介绍了一种高效的弱监督方案,用于感知阶段以将输出符号 grounding 用于逻辑归纳的对象中心推理任务。它结合了用于对象中心性的槽式架构和变分自编码器(VAE)进行自监督,与概念指导在潜在维度上竞争,以实现人类可解释的 grounding。所得到的预测被转换为符号背景知识,供推理框架使用,如归纳逻辑编程(ILP)、决策树和贝叶斯网络。我们在合成和现实世界数据集上的广泛实证评估表明,我们的方法能够发现复杂的抽象规则用于对象中心推理,同时将监督减少到标签的1%,并且即使在显著的领域转移下也表现出鲁棒性。值得注意的是,在1%的监督下,它甚至在领域泛化上超过了最先进的基础模型基线。

英文摘要

Neurosymbolic systems promise to combine deep neural network's (DNN) processing of raw sensor inputs with few-shot performance of symbolic artificial intelligence. Two-stage approaches explicitly decouple DNN based perception from subsequent rule based reasoning. This avoids optimization and interpretability issues of end to end differentiable approaches, but requires costly labels for the perception output. This paper introduces an efficient weak supervision scheme for the perception stage to ground its output symbols for logical induction in object-centric reasoning tasks. It combines a slot-based architecture for object-centricity with a Variational Autoencoder (VAE) for self-supervision, competing with concept guidance on latent dimensions for human interpretable grounding. The resulting predictions are translated into symbolic background knowledge for reasoning frameworks, such as Inductive Logic Programming (ILP), Decision Trees, and Bayesian Networks. Our extensive empirical evaluation on synthetic and real world datasets shows that our approach can discover complex, abstract rules for object centric reasoning whilst reducing supervision to as little as 1% of labels, and being robust even under substantial domain shift. Notably, at 1% supervision it even outperforms state of the art foundation model baselines in domain generalization

2605.08200 2026-05-12 cs.AI cs.CV cs.LG 版本更新

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

在视觉-语言模型中可靠性在哪里存在:注意力、隐藏状态和因果回路的机制研究

Logan Mann, Ajit Saravanan, Ishan Dave, Shikhar Shiromani, Saadullah Ismail, Yi Xia, Emily Huang

发表机构 * UC Santa Barbara(加州大学圣巴巴拉分校) UC Berkeley(加州大学伯克利分校) NVIDIA(英伟达) Algoverse AI Research(Algoverse人工智能研究) Brown University(布朗大学)

AI总结 本文通过机制性研究发现,视觉-语言模型的可靠性主要体现在隐藏状态几何、分层边际形成和稀疏晚层回路,而非注意力图的锐度。

Comments 15 pages, 4 figures, 10 tables. Accepted at the ICLR 2026 Workshop on Multimodal Reasoning. Code and probe-training pipelines: https://github.com/itsloganmann/VLM-Reliability-Probe

详情
AI中文摘要

一种普遍的直觉认为,视觉-语言模型(VLMs)在注意力图看起来锐利时最为可信:集中在查询区域的注意力应意味着自信且校准的回答。我们直接测试了这一注意力-信心假设。我们通过统一的机制性流程——VLM可靠性探针(VRP)——对三个开源权重VLM家族(LLaVA-1.5、PaliGemma、Qwen2-VL;3-7B参数)进行了仪器化,将注意力结构、生成动态和隐藏状态几何与单一正确性标签进行比较。三个结果出现。(i)注意力结构是正确性几乎零预测因子(R_pb(C_k,y)=0.001,95% CI[-0.034,0.036];R_pb(H_s,y)=-0.012,[-0.047,0.024]在合并n=3,090分割中),尽管注意力在特征提取中仍然是因果必要的(前30%补丁遮蔽使准确性下降8.2-11.3个百分点,p<0.001)。(ii)可靠性在计算后期变得清晰:一个隐藏状态线性探针在POPE上达到AUROC>0.95,对于两个家族,而自我一致性在K=10时是测量到的最强行为预测因子,10倍推理成本(R_pb=0.43)。(iii)因果神经元层面的删除暴露了具有直接监控设计影响的明显架构分裂:晚期融合LLaVA将可靠性集中在脆弱的晚期瓶颈(在前5探针神经元删除后,物体识别准确性下降8.3个百分点),而早期融合PaliGemma和Qwen2-VL则广泛分布,并能吸收约50%峰值层隐藏维度的破坏,降幅不超过1个百分点。结论是狭窄但重要的:在3-7B VLMs中,可靠性更可靠地从隐藏状态几何、分层边际形成和稀疏晚层回路中读取,而非注意力图的锐度。

英文摘要

A pervasive intuition holds that vision-language models (VLMs) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region should imply a confident, calibrated answer. We test this Attention-Confidence Assumption directly. We instrument three open-weight VLM families (LLaVA-1.5, PaliGemma, Qwen2-VL; 3-7B parameters) with a unified mechanistic pipeline -- the VLM Reliability Probe (VRP) -- that compares attention structure, generation dynamics, and hidden-state geometry against a single correctness label. Three results emerge. (i) Attention structure is a near-zero predictor of correctness (R_pb(C_k,y)=0.001, 95% CI [-0.034,0.036]; R_pb(H_s,y)=-0.012, [-0.047,0.024] on a pooled n=3,090 split), even though attention remains causally necessary for feature extraction (top-30% patch masking drops accuracy by 8.2-11.3 pp, p<0.001). (ii) Reliability becomes legible later in the computation: a single hidden-state linear probe reaches AUROC>0.95 on POPE for two of three families, and self-consistency at K=10 is the strongest behavioral predictor we measure at 10x inference cost (R_pb=0.43). (iii) Causal neuron-level ablations expose a sharp architectural split with direct monitor-design implications: late-fusion LLaVA concentrates reliability in a fragile late bottleneck (-8.3 pp object-identification accuracy after top-5 probe-neuron ablation), whereas early-fusion PaliGemma and Qwen2-VL distribute it widely and absorb destruction of ~50% of their peak-layer hidden dimension with <=1 pp degradation. The takeaway is narrow but consequential: in 3-7B VLMs, reliability is read more reliably off hidden-state geometry, layer-wise margin formation, and sparse late-layer circuits than off attention-map sharpness.

2605.08196 2026-05-12 cs.CV 版本更新

Survey on Disaster Management Datasets for Remote Sensing Based Emergency Applications

基于遥感应急应用的灾害管理数据集综述

Alain P. Ndigande, Josiah Wiggins, Sedat Ozer

发表机构 * Ozer Lab, Dept. of Computer Science, Ozyegin University(奥泽实验室,计算机科学系,奥兹根大学) Dept. of Electrical and Computer Engineering, California State Polytechnic University, Pomona(电气与计算机工程系,加州州立理工大学庞纳分校)

AI总结 本文综述了用于机器学习和深度学习灾害管理流程的公开图像数据集,重点在于支持遥感任务的高质量数据集,以促进灾害响应解决方案的快速开发与部署。

Comments This work has been accepted for publication at IEEE Transactions on Geoscience and Remote Sensing

详情
AI中文摘要

近期自然灾害凸显了高效数据驱动灾害管理的紧迫需求。机器学习(ML)和深度学习(DL)技术在灾害管理的关键阶段,包括缓解、准备、检测、响应和恢复中显示出巨大潜力。然而,成功的ML或DL应用的关键在于可访问性和质量的注释数据集。随着无人机(UAV)和卫星高分辨率影像的日益可用,计算机视觉和遥感算法已成为灾害场景中快速检测、态势评估和决策的重要工具。本文全面概述了与ML/DL灾害管理流程相关的公开图像数据集,强调支持遥感任务的数据集,涵盖灾害事件的所有阶段,包括灾前、灾中和灾后。本文旨在为寻求高质量数据集以快速开发和部署遥感驱动灾害响应解决方案的研究人员和实践者提供集中参考。

英文摘要

Recent natural disasters have highlighted the urgent need for efficient data-driven approaches to disaster management. Machine learning (ML) and deep learning (DL) techniques have shown considerable promise in enhancing the key phases of disaster management including mitigation, preparedness, detection, response, and recovery. A critical enabler of successful ML or DL based applications in remote sensing, however, is the accessibility and quality of annotated datasets. With the growing availability of high-resolution imagery from unmanned aerial vehicles (UAVs) and satellites, computer vision and remote sensing algorithms have become essential tools for rapid detection, situational assessment, and decision-making in disaster scenarios. This survey provides a comprehensive overview of publicly available image-based datasets relevant to ML/DL-based disaster management pipelines. Emphasis is placed on datasets that support computer vision and remote sensing tasks across all phases of disaster events including pre-disaster, during, and post-disaster. The goal of this work is to serve as a centralized reference for researchers and practitioners seeking high-quality datasets for rapid development and deployment of remote sensing-driven disaster response solutions.

2605.08191 2026-05-12 cs.CV cs.AI 版本更新

A Robust Out-of-Distribution Detection Framework via Synergistic Smoothing

通过协同平滑实现稳健的分布外检测框架

Maria Stoica, Abdelrahman Hekal, Alessio Lomuscio

发表机构 * Imperial College London(伦敦帝国学院) Zeroth Research(Zeroth研究)

AI总结 本文提出ROSS框架,通过基线分数的不稳定性区分分布内和分布外样本,实现对对抗攻击的强鲁棒性,实验表明其在多个数据集上表现优异。

Comments Accepted to CVPR Findings 2026

详情
AI中文摘要

可靠的分布外(OOD)检测是安全部署机器学习系统的关键要求。尽管近期有进展,最先进的OOD检测器对对抗攻击高度敏感,这影响了其在自动化系统中的可信度。为了解决这一漏洞,我们应用中位数平滑处理基线OOD检测分数,平衡干净和对抗准确率。我们的关键见解是,用于中位数平滑生成的噪声样本可以被重新利用来量化基线分数的局部不稳定性。我们观察到,OOD样本在扰动下表现出更高的不稳定性。基于此,我们提出ROSS,一种新颖且稳健的后处理OOD检测器,利用基线分数的不稳定性进一步区分ID和OOD样本。ROSS实现了对称鲁棒性,对分数最小化和最大化攻击均表现强劲,不同于先前工作。这种对称防御导致了最先进的鲁棒性,比先前方法高出多达40 AUROC点。我们在CIFAR-10、CIFAR-100和ImageNet上进行了广泛的实验。代码可在:https://github.com/Abdu-Hekal/ROSS获取。

英文摘要

Reliable out-of-distribution (OOD) detection is a critical requirement for the safe deployment of machine learning systems. Despite recent progress, state-of-the-art OOD detectors are highly susceptible to adversarial attacks, which undermines their trustworthiness in automated systems. To address this vulnerability, we apply median smoothing to baseline OOD detection scores, balancing clean and adversarial accuracies. Our key insight is that the noisy samples generated for median smoothing can be repurposed to quantify the local instability of the base score. We observe that OOD samples exhibit higher instability under perturbation. Based on this, we propose ROSS, a novel and robust post-hoc OOD detector that leverages the instability of baseline scores to further distinguish between in-distribution (ID) and OOD samples. ROSS achieves symmetric robustness, performing strongly against both score-minimising and score-maximising attacks, unlike prior work. This symmetric defence leads to state-of-the-art robustness, outperforming prior methods by up to 40 AUROC points. We demonstrate ROSS's effectiveness on extensive experiments across CIFAR-10, CIFAR-100, and ImageNet. Code is available at: https://github.com/Abdu-Hekal/ROSS.

2605.08188 2026-05-12 cs.CV cs.AI 版本更新

Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers

受神经科学启发的多模态Transformer中视觉趣味性分析

Mathis Immertreu, Fitim Abdullahu, Thomas Kinfe, Helmut Grabner, Patrick Krauss, Achim Schilling

发表机构 * Cognitive Computational Neuroscience Group, Patter Recognition Lab, University Erlangen-Nürnberg(认知计算神经科学组、模式识别实验室、埃尔兰根-纽伦堡大学) IDS Institut für Data Science, ZHAW School of Engineering, Winterthur, Switzerland(IDS数据科学研究所、ZHAW工程学院、温特图尔,瑞士) Mannheim Center for Neuromodulation and Neuroprosthetics, University Hospital Mannheim, Heidelberg University(曼海姆神经调制与神经假体中心、曼海姆大学医院、海德堡大学) BGU Ludwigshafen, Germany(BGU路易斯港,德国) Physics and Cognition Group, MCNN, University Hospital Mannheim, Heidelberg University(物理与认知组、MCNN、曼海姆大学医院、海德堡大学) NeuroAI and BCI Group, MCNN, University Hospital Mannheim, Heidelberg University(神经AI与BCI组、MCNN、曼海姆大学医院、海德堡大学)

AI总结 研究通过多模态视觉语言模型Qwen3-VL-8B探讨视觉趣味性编码机制,发现CI信息可从最终层嵌入中线性解码,揭示了无监督下视觉趣味性的结构化编码。

详情
AI中文摘要

人类注意力是意识感知、记忆和决策的门户,然而其在现代Transformer模型中的作用尚不明确。本文通过预定义的Common Interestingness (CI)评分,分析多模态视觉语言模型Qwen3-VL-8B中的内部表示,发现CI信息可从最终层嵌入中线性解码,表明其与人类视觉趣味性测量一致。维度约简和Generalized Discrimination Value (GDV)分析显示,CI相关的隐藏表示在中间视觉Transformer层中出现,并在语言模型层中逐渐变得可区分。通过几何、探针和稀疏自编码方法得到的概念向量在高层收敛,代表相似性分析证实了这一点。这表明视觉趣味性在无监督条件下具有鲁棒且结构化的编码。未来工作将寻找人类大脑动态与Transformer架构之间的共同计算原理,以揭示生物和人工系统中注意力和兴趣的组织机制。

英文摘要

Human attention is the gateway to conscious perception, memory and decision-making. However, its role in modern transformer models remains largely unexplored. As these systems increasingly influence what people see, prefer and buy, the question arises as to whether they encode principles of human interest or merely exploit large-scale correlations. Addressing this issue is crucial for understanding cognition and ensuring the responsible use of AI in communication and marketing. In order to address this issue, the concept of visual interest was examined within the multimodal vision-language-model Qwen3-VL-8B, using a pre-defined Common Interestingness (CI) score derived from large-scale human engagement data on the photo-sharing platform Flickr. Here, we analyzed internal representations across vision and language components using methods from the neurosciences. Our analyses revealed that CI information is linearly decodable from final-layer embeddings, indicating that it is aligned with human-derived measures of visual interestingness. Dimensionality reduction and Generalized Discrimination Value (GDV) analyses demonstrate that CI-related hidden representations emerge in intermediate vision transformer layers and becomes progressively more distinguishable across language model layers. Concept vectors derived using geometric, probe, and Sparse Auto-Encoder based methods converge in higher layers, as confirmed by representational similarity analysis. This indicates a robust and structured encoding of visual interestingness without explicit supervision. Future work will seek to identify shared computational principles linking human brain dynamics and transformer architectures, with the ultimate goal of uncovering the organizing mechanisms that give rise to attention and interest in both biological and artificial systems.

2605.08183 2026-05-12 cs.CV cs.LG 版本更新

Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery

稀疏性受损:简单的线性适配器可提升通用类别发现

Bo Ye, Kai Gan, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China(计算机网络与信息集成重点实验室(东南大学),教育部,中国)

AI总结 本文提出LAGCD方法,通过在每个ViT块中嵌入残差线性适配器,解决传统方法在灵活性和过拟合问题上的不足,提升通用类别发现性能。

Comments Submitted to IEEE TPAMI

详情
AI中文摘要

通用类别发现(GCD)旨在从未标记数据中识别新类别,同时保持已见类别的分类能力。传统GCD方法通常利用预训练模型的可转移表示,通过部分微调(仅更新最终ViT块)和视觉提示微调(向输入附加可学习向量)进行适应。然而,传统部分微调缺乏灵活性,因为它无法适应整个模型;同时,视觉提示微调容易过拟合,因其对初始化敏感且容量受限。为了解决这些限制,我们提出LAGCD,一种简单而有效的GCD方法,将残差线性适配器嵌入每个ViT块。从特征稀疏性的角度,我们系统地表明传统适配器中的非线性会损害性能,而我们的线性适配器通过使模型容量更灵活而提升性能。我们进一步引入辅助分布对齐损失,以减轻已见类别和新类别之间偏置预测的负面影响。在通用和细粒度数据集上的广泛实验证实,LAGCD在许多复杂基线上一致提升性能。源代码可在https://github.com/yebo0216best/LAGCD获取。

英文摘要

Generalized Category Discovery (GCD) seeks to identify novel categories from unlabeled data while retaining the classification ability of seen categories. Prior GCD methods commonly leverage transferable representations from pre-trained models, adapting to downstream datasets via partial fine-tuning (updating only the final ViT block) and visual prompt tuning (appending learnable vectors to inputs). However, conventional partial fine-tuning offers limited flexibility, as it fails to adapt the entire model; meanwhile, visual prompt tuning is prone to overfitting, due to its sensitivity to initialization and inherently constrained capacity. To address these limitations, we propose LAGCD, a simple yet effective GCD approach that embeds a residual linear adapter into each ViT block. From the perspective of feature sparsity, we systematically show that non-linearity in conventional adapters impairs performance, whereas our linear adapter enhances it by enabling more flexible model capacity. We further introduce an auxiliary distribution alignment loss to mitigate the negative impact of biased predictions between seen and novel categories. Extensive experiments on both generic and fine-grained datasets confirm that LAGCD consistently improves performance over many sophisticated baselines. The source code is available at https://github.com/yebo0216best/LAGCD

2605.08181 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Text-Guided Multi-Scale Frequency Representation Adaptation

基于文本的多尺度频率表示适应

Weicai Yan, Xinhua Ma, Wang Lin, Tao Jin

发表机构 * Zhejiang University(浙江大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出FreqAdapter,通过在频域进行多尺度信号微调,结合文本信息减少信息冗余,提升多模态模型的性能与效率。

Comments ACL 2026 Main

详情
AI中文摘要

参数高效微调方法通过引入少量训练参数,使预训练模型能够快速适应新数据分布。尽管这些方法表现出色,但存在显著局限:首先,现有方法大多在信号空间域操作,导致信息冗余;其次,现有方法使用固定提示或适应层,未能充分考虑信号的多尺度特性。为解决这些问题,我们提出了多尺度频率适配器(FreqAdapter),整合文本信息并在频域进行多尺度信号微调。此外,我们引入多尺度适应策略以优化不同频率范围的接收场,进一步提升模型的表征能力。在多模态模型(包括CLIP和LLaVA)上的广泛实验表明,FreqAdapter显著提高了性能和效率。FreqAdapter以最小成本和在单个epoch内快速收敛的能力提升了性能。代码可在https://github.com/Kelvin-ywc/FreqAdapter获取。

英文摘要

Parameter-efficient fine-tuning methods introduce a small number of training parameters, enabling pre-trained models to adapt rapidly to new data distributions. While these methods have shown promising results, they exhibit notable limitations. First, most existing methods operate in the signal space domain, which results in substantial information redundancy. Second, most existing methods utilize fixed prompts or adaptation layers, failing to fully account for the multi-scale characteristics of signals. To address these challenges, we propose the Multi-Scale Frequency Adapter (FreqAdapter), which integrates textual information and performs multi-scale fine-tuning of signals in the frequency domain. Additionally, we introduce a multi-scale adaptation strategy to optimize receptive fields across different frequency ranges, further enhancing the model's representational capacity. Extensive experiments on multimodal models, including CLIP and LLaVA, demonstrate that FreqAdapter significantly improves both performance and efficiency. FreqAdapter improves performance with minimal cost and fast convergence within one epoch. Code is available at https://github.com/Kelvin-ywc/FreqAdapter.

2605.08175 2026-05-12 cs.CV cs.AI 版本更新

KARMA-MV: A Benchmark for Causal Question Answering on Music Videos

KARMA-MV:音乐视频上的因果问答基准

Archishman Ghosh, Abhinaba Roy, Dorien Herremans

发表机构 * AMAAI Lab, Singapore University of Technology and Design(新加坡科技设计大学AMAAI实验室)

AI总结 KARMA-MV是一个基于2682个YouTube音乐视频构建的大规模多选问答数据集,旨在测试模型整合时序音频视觉线索和视觉到音乐影响的能力,通过因果知识图谱方法提升音乐视频因果推理性能。

详情
AI中文摘要

尽管在视频问答和跨模态理解方面取得了显著进展,但关于视觉动态如何驱动音乐结构的因果推理在音乐视频中仍被低估。我们介绍了KARMA-MV,一个从2682个YouTube音乐视频衍生出的大型多选问答数据集,旨在测试模型整合时序音频视觉线索并推理视觉到音乐影响的能力。与需要人工标注的传统数据集不同,KARMA-MV利用LLM推理进行可扩展的生成和验证,产生37737个多选问题。我们提出了一种因果知识图谱(CKG)方法,通过结构化检索跨模态依赖性增强视觉语言模型(VLMs)。在最先进的VLMs和LLMs上的实验表明,CKG基础带来了持续的提升,尤其是对较小的模型,确立了显式因果结构在音乐视频推理中的价值。KARMA-MV为超越相关性的因果音频视觉理解提供了新的基准。

英文摘要

While significant progress has been made in Video Question Answering and cross-modal understanding, causal reasoning about how visual dynamics drive musical structure in music videos remains under-explored. We introduce KARMA-MV, a large-scale multiple-choice QA dataset derived from 2,682 YouTube music videos, designed to test models' ability to integrate temporal audio-visual cues and reason about visual-to-musical influence across reasoning, prediction, and counterfactual questions. Unlike traditional datasets requiring manual annotation, KARMA-MV leverages LLM reasoning for scalable generation and validation, yielding 37,737 MCQs. We propose a causal knowledge graph (CKG) approach that augments vision-language models (VLMs) with structured retrieval of cross-modal dependencies. Experiments on state-of-the-art VLMs and LLMs show consistent gains from CKG grounding -- especially for smaller models -- establishing the value of explicit causal structure for music-video reasoning. KARMA-MV provides a new benchmark for advancing causal audio-visual understanding beyond correlation.

2605.08174 2026-05-12 cs.LG cs.AI cs.CV 版本更新

CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning

CERSA:累积能量保留子空间适应用于内存高效的微调

Jingze Ge, Xue Geng, Yun Liu, Wanqi Dong, Wang Zhe Mark, Min Wu, Ngai-Man Cheung, Bharadwaj Veeravalli, Xulei Yang

发表机构 * National University of Singapore(新加坡国立大学) Nankai University(南开大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 CERSA通过奇异值分解保留主成分以降低内存消耗,优于现有PEFT方法,适用于多种领域模型。

Comments 10 pages, 7 figures, supplementary material included

详情
AI中文摘要

为了解决微调大型预训练模型时的内存限制,现有的参数高效微调(PEFT)方法,如LoRA,依赖于低秩更新。然而,此类更新未能充分捕捉到全参数微调中权重修改的秩特性,导致性能差距。此外,LoRA和其他现有PEFT方法仍然需要大量内存来存储完整的冻结权重,限制了其在资源受限环境中的效率。为解决这些限制,我们引入了累积能量保留子空间适应(CERSA),一种新的微调范式,利用奇异值分解(SVD)仅保留负责90%至95%谱能量的主成分。通过在此主子空间中微调低秩表示,CERSA显著降低了内存消耗。我们对不同规模和领域的模型进行了广泛评估,包括图像识别、文本到图像生成和自然语言理解。实验证果表明,CERSA在性能上始终优于最先进的PEFT方法,同时实现了显著更低的内存需求。代码将公开发布。

英文摘要

To mitigate the memory constraints associated with fine-tuning large pre-trained models, existing parameter-efficient fine-tuning (PEFT) methods, such as LoRA, rely on low-rank updates. However, such updates fail to fully capture the rank characteristics of the weight modifications observed in full-parameter fine-tuning, resulting in a performance gap. Furthermore, LoRA and other existing PEFT methods still require substantial memory to store the full set of frozen weights, limiting their efficiency in resource-constrained settings. To addres these limitations, we introduce Cumulative Energy-Retaining Subspace Adaptation (CERSA), a novel fine-tuning paradigm that leverages singular value decomposition (SVD) to retain only the principal components responsible for 90% to 95% of the spectral energy. By fine-tuning low-rank representations derived from this principal subspace, CERSA significantly reduces memory consumption. We conduct extensive evaluations of CERSA across models of varying scales and domains, including image recognition, text-to-image generation, and natural language understanding. Empirical results demonstrate that CERSA consistently outperforms state-of-the-art PEFT methods while achieving substantially lower memory requirements. The code will be publicly released.

2605.08173 2026-05-12 cs.CV cs.LG 版本更新

CASISR: Circular Arbitrary-Scale Image Super-Resolution

CASISR: 循环任意尺度图像超分辨率

Honggui Li, Zhengyang Zhang, Dingtai Li, Sinan Chen, Nahid Md Lokman Hossain, Xinfeng Xu, Yinlu Qin, Ruobing Wang, Hantao Lu, Yuting Feng, Maria Trocan, Dimitri Galayko, Amara Amara, Mohamad Sawan

发表机构 * School of Information and Artificial Intelligence, Yangzhou University(扬州大学信息与人工智能学院) Shanghai Qigong Research Institute, Shanghai University of Traditional Chinese Medicine(上海青工研究院,上海中医药大学) LISITE Research Laboratory, Institut Supérieur d'Électronique de Paris(LISITE研究实验室,巴黎电子高级学院) Laboratoire d'Informatique de Paris 6, Sorbonne University(巴黎第六大学信息学实验室) International Campus (Hangzhou), Beijing University of Aeronautics and Astronautics(北京航空航天大学杭州国际校区) School of Engineering, Westlake University(西湖大学工程学院) Polystim Neurotech Laboratory, Polytechnique Montreal(蒙特利尔Polytechnique神经科技实验室)

AI总结 本文提出CASISR闭环架构,通过建立非线性循环方程提升图像重建能力,实验表明其在分数SR尺度和文本/条纹图像中表现优异。

详情
AI中文摘要

深度学习基于任意尺度图像超分辨率(ASISR)方法的泛化性能(GP)受限于有限的训练数据集和无限的测试数据集。通过充分利用测试样本可增强预训练ASISR模型的GP。ASISR模型通常采用从低分辨率(LR)图像到超分辨率(SR)图像的开环架构。从SR样本到LR样本的退化模型在经典ASISR中为双三次下采样,在盲ASISR中假设为带有加性随机噪声的下采样,在现实世界ASISR中为可学习模型。结合ASISR和退化模型,可基于自动控制理论采用闭环架构强化ASISR方法的GP。因此,本文提出闭环架构CASISR,以提升图像重建能力。建立数学非线性循环方程描述CASISR,通过条件概率理论证明其合理性,通过泰勒级数近似证明其稳定性。定义一阶和二阶绝对差图像以比较ASISR和CASISR方法的图像重建性能。综合仿真实验表明,所提CASISR方法在图像重建质量上优于八种最先进的ASISR方法。特别是,所提CASISR特别适合分数SR尺度因子,并在边缘大幅变化的文本和条纹图像中效果显著。

英文摘要

The generalization performance (GP) of deep learning-based arbitrary-scale image super-resolution (ASISR) methods is subject to limited training datasets and unlimited testing datasets. It is vitally significant to enhance the GP of the pretrained ASISR models by making full use of the testing samples. The ASISR models usually employ an open-loop architecture from low-resolution (LR) images to super-resolution (SR) images. The degradation model from SR samples to LR samples is known bicubic down-sampling for the classical ASISR, is supposed down-sampling with additive random noise for the blind ASISR, and is learnable for the real-world ASISR. Combining the ASISR and degradation models, it is potentially possible to adopt a closed-loop architecture based on the automatic control theory for strengthening the GP of the ASISR methods. Therefore, this paper proposes a closed-loop architecture, circular ASISR (CASISR), to lift the capability of image reconstruction. A mathematical nonlinear loop equation is established to describe the CASISR, the reasonability of the CASISR is proven by conditional probability theory, and the stability of the CASISR is proven by Taylor series approximation. The first-order and second-order absolute difference images are defined to compare the image reconstruction performance of the ASISR and the CASISR methods. Comprehensive simulation experiments show that the proposed CASISR approach outperforms the eight state-of-the-art ASISR approaches in the quality of image reconstruction. Especially, the proposed CASISR is extraordinarily suitable for fractional SR scale factors and is extremely effective for text and stripe images with drastically changed edges.

2605.08172 2026-05-12 cs.CV cs.LG 版本更新

Augmented Equivariant Mesh Networks for Anatomical Segmentation

增强等变网格网络用于解剖分割

Daniel Saragih

发表机构 * Department of Pathology and Molecular Medicine(病理学与分子医学系)

AI总结 本文提出EAMS,基于等变网格神经网络,结合内在网格描述符和解剖感知先验,实现鲁棒的解剖网格分割,在不同监督类型下表现优异。

Comments 21 pages, 7 figures, 14 tables

详情
AI中文摘要

解剖网格分割需要能够直接操作不规则表面几何并鲁棒于任意患者姿态和网格分辨率变化的模型。现有任务特定的网格和点云方法不具备等变性,在测试时易退化,例如在40度倾斜扫描分割中IoU分数下降25-26个百分点。本文提出EAMS,基于等变网格神经网络(EMNN),在四个临床不同的任务上进行评估,涵盖边、顶点和面级监督。我们结合内在网格描述符和解剖感知先验,包括基于PCA的牙弓和肝脏表面框架,并增强消息传递以提供轻量级全局上下文。在颅内动脉瘤和口内分割中,EAMS变体在未扰动输入上与专用基线竞争,同时在几何扰动下保持稳定,在肝脏表面暴露了在标准姿态准确性与旋转鲁棒性之间的有利权衡。这些结果表明,轻量级(<2M参数)的等变框架可以在不同监督类型下实现鲁棒的解剖网格分割,而无需特定任务架构。

英文摘要

Anatomical mesh segmentation requires models that operate directly on irregular surface geometry while remaining robust to arbitrary patient pose and mesh resolution variation. Existing task-specific mesh and point-cloud methods are not equivariant, and can degrade sharply under test-time perturbation, for example dropping by 25-26 IoU points on intraoral scan segmentation at $40^\circ$ tilt. We present EAMS, an Equivariant Anatomical Mesh Segmentor built on Equivariant Mesh Neural Networks (EMNN), and evaluate it across four clinically distinct tasks spanning edge-, vertex-, and face-level supervision. We combine intrinsic mesh descriptors with anatomy-aware priors, including PCA-derived frames for dental arches and liver surfaces, and augment message passing to provide lightweight global context. Across intracranial aneurysm and intraoral segmentation, EAMS variants are competitive with specialized baselines on unperturbed inputs while remaining stable under geometric perturbations, and on liver surfaces they expose a favorable trade-off between canonical-pose accuracy and rotation robustness. These results show that a lightweight ($<2$M parameters) equivariant framework can deliver robust anatomical mesh segmentation across diverse supervision types without task-specific architectures.

2605.08169 2026-05-12 cs.CV cs.AI 版本更新

Optimized Culprit Identification Using Mobilenet and Attention Mechanisms

基于MobileNet和注意力机制的优化罪犯识别

Savitha N J, Lata B T

发表机构 * Department of Computer Science and Engineering, University Visvesvaraya College of Engineering (UVCE), Bangalore University, CMR Institute of Technology, Bengaluru, India(计算机科学与工程系,Visvesvaraya工程学院(UVCE),班加罗尔大学,CMR技术学院,班加罗尔,印度) Department of Computer Science and Engineering, University Visvesvaraya College of Engineering (UVCE), Bangalore University, Bengaluru, India(计算机科学与工程系,Visvesvaraya工程学院(UVCE),班加罗尔大学,班加罗尔,印度)

AI总结 本文提出一种轻量MobileNet架构结合通道和空间注意力机制的深度学习框架,通过聚焦关键区域提升识别性能,在LFW、CASIA-WebFace等数据集上实现97.8%的高准确率。

Journal ref ISSN No: 2096-3246, Link: https://advancedengineeringscience.com/article/2026/2216.html

详情
AI中文摘要

在监控系统中自动识别罪犯是一项需要高精度和计算效率的关键任务。本文提出了一种优化的深度学习框架,结合轻量级MobileNet架构与通道和空间注意力机制。该模型通过选择性聚焦最判别性的区域并抑制无关背景信息,从而提升识别性能。框架包括高效的预处理、基于注意力的特征细化和优化Adam优化器的稳健分类策略。实验在LFW、CASIA-WebFace和VGGFace2子集上进行,考虑光照、姿态和遮挡变化。结果表明,所提模型在分类准确率上达到97.8%,优于基线CNN、ResNet和标准MobileNet。混淆矩阵分析显示强类间区分能力,ROC-AUC评估确认了跨类的稳健性能。此外,该方法保持低计算复杂度和减少的推理时间,适用于实时监控和边缘应用。

英文摘要

Automated culprit identification in surveillance systems is a critical task that requires high accuracy along with computational efficiency for real-time deployment. In this paper, an optimized deep learning framework is proposed using a lightweight MobileNet architecture integrated with channel and spatial attention mechanisms. The proposed model enhances feature representation by selectively focusing on the most discriminative regions while suppressing irrelevant background information, thereby improving identification performance. The framework incorporates efficient preprocessing, attention based feature refinement, and a robust classification strategy optimized using the Adam Optimizer. Experiments were conducted on benchmark face recognition datasets, including Labelled Faces in the Wild (LFW), CASIA-WebFace, and a subset of VGGFace2, under realistic conditions with variations in illumination, pose, and occlusion. The results demonstrate that the proposed model achieves a high classification accuracy of 97.8%, outperforming conventional models such as baseline CNN, ResNet, and standard MobileNet. The confusion matrix analysis indicates strong class-wise discrimination with minimal misclassification, while ROC-AUC evaluation confirms robust performance across all classes. Additionally, the proposed approach maintains low computational complexity and reduced inference time, making it suitable for real-time surveillance and edge-based applications.

2605.08167 2026-05-12 cs.CV cs.AI 版本更新

Digital Image Forgery Detection Using Transfer Learning

利用迁移学习的数字图像伪造检测

Fatma Betul Buyuk, Gozde Karatas Baydogmus, Ali Buldu, Ayaulym Tulendiyeva, Zhuldyz Baizhumanova

发表机构 * Marmara University, Department of Computer Engineering(马尔马拉大学计算机工程系) Loyola University Chicago, Department of Computer Science(芝加哥洛伊拉大学计算机科学系) Biruni University, Department of Computer Engineering(比鲁尼大学计算机工程系)

AI总结 本文提出一种基于迁移学习的数字图像伪造检测框架,结合压缩感知特征增强与深度卷积神经网络,通过混合输入表示和自适应阈值优化策略提升检测鲁棒性。

详情
AI中文摘要

随着高级图像编辑工具的普及, manipulated digital content 的数量显著增加,给数字取证和信息安全带来严峻挑战。本文提出一种基于迁移学习的数字图像伪造检测框架,整合压缩感知特征增强与深度卷积神经网络(CNN)架构。所提出的方法引入混合输入表示,结合RGB图像与基于压缩差异的特征(FDIFF),显式突出难以检测的细微篡改痕迹。此外,基于Youden指数的模型特定自适应阈值优化策略用于提高分类可靠性,通过在真阳性率和假阳性率之间取得更好平衡。在CASIA v2.0数据集上使用多个预训练CNN架构(包括DenseNet121、VGG16、ResNet50、EfficientNetB0、MobileNet和InceptionV3)进行实验,验证了所提框架的有效性和鲁棒性。模型使用准确率、精确率、召回率、F1分数、Matthews相关系数(MCC)和ROC曲线下的面积(AUC)等综合性能指标进行评估。结果表明,DenseNet121在准确率和AUC上表现最佳,而ResNet50提供了最平衡和可靠的预测,具有最高的MCC。研究强调,仅依赖准确率不足以满足取证应用需求,因为最小化假阴性至关重要。总体而言,所提框架提高了篡改痕迹的可见性并增强了分类鲁棒性,使其适用于实际的数字图像伪造检测场景。

英文摘要

The increasing availability of advanced image editing tools has led to a significant rise in manipulated digital content, posing serious challenges for digital forensics and information security. This study presents a transfer learning-based framework for digital image forgery detection that integrates compression-aware feature enhancement with deep convolutional neural network (CNN) architectures. The proposed approach introduces a hybrid input representation that combines RGB images with compression difference-based features (FDIFF), explicitly highlighting subtle manipulation artifacts that are often difficult to detect. In addition, a model-specific adaptive threshold optimization strategy based on the Youden Index is employed to improve classification reliability by achieving a better balance between true positive and false positive rates. Experiments conducted on the CASIA v2.0 dataset using multiple pretrained CNN architectures, including DenseNet121, VGG16, ResNet50, EfficientNetB0, MobileNet, and InceptionV3, demonstrate the effectiveness and robustness of the proposed framework. The models are evaluated using comprehensive performance metrics such as accuracy, precision, recall, F1-score, Matthews correlation coefficient (MCC), and area under the ROC curve (AUC). The results show that DenseNet121 achieves the highest accuracy and AUC, while ResNet50 provides the most balanced and reliable predictions with the highest MCC. The findings emphasize that relying solely on accuracy is insufficient for forensic applications, where minimizing false negatives is critical. Overall, the proposed framework improves the visibility of manipulation artifacts and enhances classification robustness, making it suitable for real-world digital image forgery detection scenarios.

2605.08161 2026-05-12 cs.CV 版本更新

Advanced Tumor Segmentation in PET/CT Imaging: A Training Strategy Study with nnU-Net for AutoPET III

PET/CT全身成像中肿瘤分割的进阶研究:基于nnU-Net的AutoPET III自动训练策略研究

Hussain Alasmawi

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出一种全身PET/CT肿瘤分割方法,通过nnU-Net框架探讨训练策略对模型性能的影响,提升分割鲁棒性与准确性,在AutoPET III挑战中取得第三名成绩。

详情
AI中文摘要

全身PET/CT影像中的肿瘤分割对于精确疾病评估和治疗计划至关重要。然而,由于病变大小、对比度和解剖分布的差异,这一任务仍然具有挑战性。依赖手动分割过程耗时且易受观察者间和观察者内变异影响。本文提出了一种针对AutoPET III挑战的全身肿瘤分割方法,目标是构建能够跨示踪剂和多中心数据泛化模型。我们采用基于ResNet的编码器的nnU-Net框架作为基线,并系统研究了训练策略的影响,包括强度归一化、批次Dice优化和使用CraveMix的数据增强。实验表明,这些策略显著影响模型性能,特别是在减少假阳性并提高对病变变异的鲁棒性方面。最佳配置在初步测试阶段达到Dice分数高达0.80,且我们的方法在AutoPET III挑战中排名第三。代码已公开。

英文摘要

Tumor segmentation in whole-body PET/CT imaging is crucial for precise disease evaluation and treatment planning. However, it remains challenging due to variability in lesion size, contrast, and anatomical distribution. Relying on manual segmentation makes the process time-consuming and prone to intra- and inter-observer variability. This work presents a whole-body tumor segmentation method developed for the AutoPET III challenge, where the goal is to build models that generalize across tracers and multi-center data. We employ the nnU-Net framework with a ResNet-based encoder as our baseline and systematically investigate the impact of training strategies, including intensity normalization, batch dice optimization, and data augmentation using CraveMix. Our experiments show that these strategies significantly influence model performance, particularly in reducing false positives and improving robustness to lesion variability. The best-performing configuration achieves a Dice score of up to 0.80 on the preliminary test phase, and our method ranked third in the AutoPET III challenge. The code is publicly available here.

2605.08160 2026-05-12 cs.CV cs.AI 版本更新

WATCH: Wide-Area Archaeological Site Tracking for Change Detection

WATCH:大范围考古遗址跟踪用于变化检测

Girmaw Abebe Tadesse, Titien Bartette, Andrew Hassanali, Allen Kim, Jonathan Chemla, Andrew Zolli, Yves Ubelmann, Caleb Robinson, Inbal Becker-Reshef, Juan Lavista Ferres

发表机构 * Microsoft AI for Good Research Lab(微软AI for Good研究实验室) Iconem, Paris, France(Iconem公司,巴黎,法国) Planet Labs PBC(Planet Labs公司)

AI总结 WATCH通过三种方法实现大规模考古遗址变化检测,利用卫星影像和基础模型嵌入,提高遗址保护的效率和准确性。

详情
AI中文摘要

大规模监测考古遗址对保护文化遗产至关重要,但确定扰动发生时间困难,因为视觉线索微妙且地面真实数据稀少。我们介绍了WATCH,一个用于PlanetScope卫星拼接图(2017-2024,4.7米/像素)的月级变化事件定位框架,支持三种互补的评分方法:(i)时间嵌入距离(TED),一种无需训练的方法,评分月与月之间的偏差;(ii)自监督变化检测(SSCD),一个包含重建、预测和潜在新颖性信号的集合;(iii)弱监督(WS)时间定位模型,使用稀疏事件-月份标签进行训练。我们在阿富汗1943个考古遗址上基准测试WATCH,使用六个基础模型(CLIP、GeoRSCLIP、SatMAE、Prithvi-EO-2.0、DINOv3和Satlas-Pretrain)的嵌入以及手工制作的光谱和纹理基线,并在叙利亚、土耳其、巴基斯坦和埃及的遗址上评估跨区域泛化能力。无监督方法(TED、SSCD)始终优于弱监督方法。TED与SatMAE实现最高精确月召回率(m=0时55%),而TED与GeoRSCLIP、CLIP或Satlas-Pretrain在三个月容忍度内(m=3)达到92.5%。手工制作的特征在弱监督下仍具竞争力。我们的方向边际分析揭示了系统的时间偏见:SSCD与GeoRSCLIP或Prithvi-EO-2.0配对表现出最强的预警轮廓,能在记录事件前检测异常,而TED更倾向于在变化发生后进行确认检测。这些结果表明,卫星影像结合基础模型嵌入能够实现可扩展、决策相关的遗产监测。代码:https://github.com/microsoft/WATCH

英文摘要

Monitoring archaeological sites at scale is vital for protecting cultural heritage, yet pinpointing when disturbances occur remains difficult because visual cues are subtle and ground-truth data are sparse. We introduce WATCH, a framework for month-level change-event localization over PlanetScope satellite mosaics (2017-2024, 4.7 m/px) that supports three complementary scoring approaches: (i) Temporal Embedding Distance (TED), a training-free method that scores month-to-month deviations from a local temporal reference; (ii) Self-Supervised Change Detection (SSCD), an ensemble of reconstruction, forecasting, and latent-novelty signals; and (iii) a Weakly Supervised (WS) temporal localization model trained with sparse event-month labels. We benchmark WATCH on 1,943 archaeological sites in Afghanistan using embeddings from six foundation models (CLIP, GeoRSCLIP, SatMAE, Prithvi-EO-2.0, DINOv3, and Satlas-Pretrain) alongside a handcrafted spectral and texture baseline, and assess cross-regional generalization on sites in Syria, Turkey, Pakistan, and Egypt. The unsupervised approaches (TED, SSCD) consistently outperform the weakly supervised alternative. TED with SatMAE achieves the highest exact-month recall (55% at m=0), while TED with GeoRSCLIP, CLIP, or Satlas-Pretrain reaches 92.5% within a three-month tolerance (m=3). Handcrafted features remain competitive for exact-month detection under weak supervision. Our directional margin analysis reveals systematic temporal biases: SSCD paired with GeoRSCLIP or Prithvi-EO-2.0 exhibits the strongest early-warning profile, detecting anomalies before the recorded event, while TED favors confirmation-oriented detection after a change has materialized. These results show that satellite imagery combined with foundation-model embeddings enables scalable, decision-relevant heritage monitoring. Code: https://github.com/microsoft/WATCH

2605.08158 2026-05-12 cs.CV cs.AI 版本更新

HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

HY-Himmel技术报告:分层交错多流运动编码用于长视频理解

Haopeng Jin, Hongzhu Yi, Wenlong Zhao, Jinwen Luo, Shani Ye, Zhenyu Guan, Shiquan Dong, Tiankun Yang, Tao Yu

发表机构 * Tencent(腾讯) University of Chinese Academy of Sciences(中国科学院大学) Beijing Forestry University(北京林业大学)

AI总结 HY-Himmel通过分层视频语言框架解决长视频理解中多模态语言模型的三个瓶颈问题,采用轻量级压缩域三流适配器编码密集帧,提升运动感知能力,实验证明其在Video-MME上性能优于32帧基准。

Comments 59 pages, 42 figures. Technical report

详情
AI中文摘要

长视频理解多模态语言模型面临三个叠加瓶颈:获取密集RGB帧的高解码成本、帧数增长导致的二次令牌增长以及稀疏关键帧采样下的弱运动感知。本文提出HY-Himmel,一种分层视频语言框架,分别分配语义和运动能力。少量稀疏锚定I帧被路由到昂贵的ViT主机,以确定物体身份和场景布局,而更密集的帧间间隔由轻量级压缩域三流适配器编码,通过运动矢量图、残差图和I帧上下文提炼出对齐的运动令牌。这些令牌通过可微置换机制注入LLM,在专用的Stage-1对比对齐后,使运动表示与冻结的视觉主干兼容。在Video-MME上,HY-Himmel在使用3.6倍更少上下文令牌的情况下,超越32帧基准,性能提升2.3个百分点(61.2至63.5%)。对流组成、运动编码器家族、融合模式、对齐目标、锚定数量、LoRA秩和视频持续时间的广泛消融验证了完整三流的必要性和充分性。

英文摘要

Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost to obtain dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. We present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to the expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter that distils motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens. These tokens are injected into the LLM via a differentiable placeholder mechanism after a dedicated Stage-1 contrastive alignment that places the motion representation in a geometry compatible with the frozen visual backbone. On Video-MME, HY-Himmel surpasses the dense 32-frame baseline by +2.3 pp (61.2 to 63.5%) while using 3.6x fewer context tokens. Extensive ablations over stream composition, motion encoder family, fusion mode, alignment objective, anchor count, LoRA rank, and video duration confirm that the full tri-stream is necessary and sufficient for the observed gains.

2605.08156 2026-05-12 cs.CV cs.AI 版本更新

LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment

LAGO:基于语言引导的自适应对象区域聚焦用于零样本视觉-文本对齐

Junyi Hu, Qiji Zhou, Lei Zhang, Yue Zhang

发表机构 * Beijing Jiaotong University(北京交通大学) Westlake University(西湖大学) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 LAGO提出了一种高效鲁棒的零样本局部视觉-文本对齐框架,通过自适应语言引导的精修和双通道聚合策略,提升局部化性能,实现更优的零样本分类。

Comments 37 pages, 26 figures, including appendix. Preprint

详情
AI中文摘要

零样本识别旨在通过从候选类别中选择最兼容的标签描述来对图像进行分类,而无需任何任务特定的监督。在细粒度设置中,相关证据往往位于局部区域、属性或纹理,而非整个图像,使全图像对齐效果不佳。最近的局部视觉-文本对齐方法通过比较类别描述与多个图像区域进行改进,但通常依赖大量随机或冗余的裁剪,增加推理成本并引入大量高度冗余或弱相关候选。此外,过早引入语义引导会创建一个误差放大反馈过程,使不准确的中间预测偏移后续的局部化并强化后续错误;我们将其称为预测循环。我们提出LAGO(LAnguage-Guided adaptive Object-region focus),一种高效的零样本局部视觉-文本对齐框架。LAGO首先执行类别无关的对象中心候选发现以获得稳定的视觉初始化,然后应用自适应语言引导的精修,通过中间置信度控制语义引导的强度。它进一步通过有效的对象-上下文双通道聚合策略结合对象级、上下文级和全图像证据。大量实验表明,LAGO在标准零样本基准和具有挑战性的分布偏移设置中均取得最佳性能,同时在推理时需要显著更少的候选区域。

英文摘要

Zero-shot recognition aims to classify an image by selecting the most compatible label description from a set of candidate classes without any task-specific supervision. In fine-grained settings, however, the relevant evidence often lies in localized parts, attributes, or textures rather than in the full image, making whole-image alignment suboptimal. Recent localized visual-text alignment methods address this by comparing class descriptions with multiple image regions, but they typically rely on large sets of random or redundant crops, increasing inference cost and introducing many highly redundant or weakly relevant candidates. Moreover, introducing semantic guidance too early can create an error-amplifying feedback process in which inaccurate intermediate predictions bias later localization and reinforce subsequent mistakes; we refer to this failure mode as the prediction loop. We propose LAGO (LAnguage-Guided adaptive Object-region focus), a framework for efficient and robust zero-shot localized visual-text alignment. LAGO first performs class-agnostic object-centric candidate discovery to obtain a stable visual initialization, and then applies adaptive language-guided refinement with the strength of semantic guidance controlled by intermediate confidence. It further combines object-level, contextual, and full-image evidence through an effective object-context dual-channel aggregation strategy. Extensive experiments show that LAGO consistently achieves state-of-the-art performance on standard zero-shot benchmarks and challenging distribution-shift settings, while requiring substantially fewer candidate regions at inference time.

2605.08144 2026-05-12 cs.LG cs.AI cs.CV 版本更新

NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training

NoiseRater: 用于扩散模型训练的元学习噪声估值

Fang Wu, Haokai Zhao, Da Xing, Hanqun Cao, Tinson Xu, Yanchao Li, Xiangru Tang, Zehong Wang, Aaron Tu, Kuan Pang, Hanchen Wang, Hongbin Lin, Zeqi Zhou, Yinxi Li, Peng Xia, Li Erran Li, Molei Tao, Jure Leskovec, Aditya Joshi, Yejin Choi

发表机构 * Stanford University(斯坦福大学) UNSW(新南威尔士大学) UCL(伦敦大学学院) The University of Chicago(芝加哥大学) CUHK(香港中文大学) Nanjing University(南京大学) Brown University(布朗大学) Yale University(耶鲁大学) University of Notre Dame(Notre Dame 大学) University of Waterloo(滑铁卢大学) UCB(加州大学伯克利分校) Georgia Technology(佐治亚理工学院) Amazon(亚马逊)

AI总结 本文提出NoiseRater,通过元学习实现实例级噪声估值,提升扩散模型训练效率和生成质量。

详情
AI中文摘要

扩散模型在生成任务中取得显著成功,但其训练范式通常将注入噪声视为均匀信息。本文挑战这一假设,引入NoiseRater,一种用于扩散模型训练的元学习框架,提出参数化的噪声评估器,根据数据和时间步对个体噪声进行重要性评分,实现训练目标的自适应重新加权。评估器通过双层优化训练,以提升下游验证性能。为实现高效部署,进一步设计解耦的两阶段流程,从元训练期间的软加权过渡到标准训练期间的硬噪声选择。在FFHQ和ImageNet上的实验表明,并非所有噪声样本贡献相同,优先选择信息量大的噪声可提升训练效率和生成质量。结果表明,噪声估值成为提升扩散模型训练的互补且此前未被充分探索的维度。代码可在https://anonymous.4open.science/r/NoiseRater-DEB116获取。

英文摘要

Diffusion models have achieved remarkable success across a wide range of generative tasks, yet their training paradigm largely treats injected noise as uniformly informative. In this work, we challenge this assumption and introduce NoiseRater, a meta-learning framework for instance-level noise valuation in diffusion model training. We propose a parametric noise rater that assigns importance scores to individual noise realizations conditioned on data and timestep, enabling adaptive reweighting of the training objective. The rater is trained via bilevel optimization to improve downstream validation performance after inner-loop diffusion updates. To enable efficient deployment, we further design a decoupled two-stage pipeline that transitions from soft weighting during meta-training to hard noise selection during standard training. Extensive experiments on FFHQ and ImageNet demonstrate that not all noise samples contribute equally, and that prioritizing informative noise improves both training efficiency and generation quality. Our results establish noise valuation as a complementary and previously underexplored axis for improving diffusion model training. Our code is available at: https://anonymous.4open.science/r/NoiseRater-DEB116.

2605.08142 2026-05-12 cs.LG cs.CL cs.CV 版本更新

Reasoning emerges from constrained inference manifolds in large language models

推理源自大型语言模型中受限推断流形

Yanbiao Ma, Fei Luo, Linfeng Zhang, Chuangxin Zhao, Mingxuan Wang, Yinan Wu, Zhe Qian, Yang Lu, Long Chen, Zhao Cao, Xiaoshuai Hao, Ji-Rong Wen, Jungong Han

发表机构 * Renmin University of China(中国人民大学) Tsinghua University(清华大学) Xiaomi EV(小米电动车) Xiamen University(厦门大学)

AI总结 研究通过分析推理过程中的内部表示演变,发现推理动态在高维空间中自组织为低维流形,但需满足表示能力、流形压缩和非退化信息体积等条件才能实现稳定推理。

详情
AI中文摘要

大型语言模型中的推理主要通过标注基准评估,将任务表现与内部推断质量混淆。本文通过研究推理作为内在动态过程,分析推理过程中内部表示的演变。发现推理时间动态一致自组织为嵌入在高维表示空间中的低维流形。尽管这种几何压缩普遍存在,但不足以实现稳定或可靠的推理。有效的推理动态出现在由三个条件定义的受限结构域内:足够的表示表达性、自发的流形压缩以及压缩子空间内非退化的信息体积。域外模型表现出特征性的病态推理动态。基于这些见解,我们引入了一个统一的、无标注的诊断方法,仅从内部动态计算。这些发现表明,LLM中的推理本质上由几何和信息约束决定,为基于基准的评估提供了补充框架。

英文摘要

Reasoning in large language models is predominantly evaluated through labeled benchmarks, conflating task performance with the quality of internal inference. Here we study reasoning as an intrinsic dynamical process by examining the evolution of internal representations during inference. We find that inference-time dynamics consistently self-organize into low-dimensional manifolds embedded within high-dimensional representation spaces. we find that such geometric compression, although pervasive, is not sufficient for stable or reliable reasoning. Instead, effective reasoning dynamics emerge within a constrained structural regime characterized by three conditions: adequate representational expressivity, spontaneous manifold compression, and preservation of non-degenerate information volume within the compressed subspace. Models outside this regime exhibit characteristic pathological inference dynamics. Based on these insights, we introduce a unified, label-free diagnostic computed solely from internal dynamics. These findings suggest that reasoning in LLMs is fundamentally governed by geometric and informational constraints, offering a complementary framework to benchmark-centric assessment.

2605.08136 2026-05-12 cs.CV cs.AI cs.RO 版本更新

Benchmarking ResNet Backbones in RT-DETR: Impact of Depth and Regularization under environmental conditions

在RT-DETR中评估ResNet主干:环境条件下深度和正则化的影响

Pamela Barboza, Víctor Castelli, Belén Pereira, Ricardo Grando, Bruna de Vargas, Augusto Calfani

发表机构 * Robotics and Artificial Intelligence Laboratory, Technological University of Uruguay(机器人与人工智能实验室,乌拉圭技术大学) Artificial Intelligence Laboratory, Technological University of Uruguay(人工智能实验室,乌拉圭技术大学)

AI总结 本文评估RT-DETR在环境和超参数变化下的圆物体检测性能,比较了四种ResNet主干在不同丢弃率下的置信度和准确率,发现ResNet50在光照变化中表现最佳,ResNet34在背景变化中表现最佳。

Comments Accepted at the International Conference on Data Science, Technology and Applications (DATA) 2026

详情
AI中文摘要

视觉感知在竞争性机器人中起着核心作用,其中环境变化可能直接影响实时检测性能。现有的基于变换器的检测器文献缺乏关于主干规模和环境设置对模型性能影响的信息。本文提出了对RT-DETR的比较评估,以检测环境和超参数变化相关的圆物体。通过比较四种ResNet主干(ResNet18、ResNet34、ResNet50和ResNet101)的丢弃率,分析了它们对置信度和准确率的影响。所有模型均在相同配置下训练,并在光照和背景对比变化下进行评估。环境条件主要影响预测置信度,而推理延迟基本不受影响,分类准确率保持一致较高,大多数情况下接近或超过1.00。观察到两种不同的行为。在光照变化下,ResNet50实现了最佳的权衡,结合接近完美的准确率、置信值高达约0.869和延迟约0.058-0.059 ms。在背景变化下,ResNet34提供了最平衡的性能,达到接近完美的准确率和更高的置信值,高达约0.887。这些结果表明,最佳架构取决于环境变化类型,中间深度模型在性能和效率之间提供了最佳平衡。

英文摘要

Visual perception plays a central role in competitive robotics, where environmental variations can directly affect real-time detection performance. The related literature on transformer-based detectors lack information regarding the impact of backbone scale and environmental settings on model performance. This work presents a comparative evaluation of RT-DETR for detecting round objects under environmental and hyperparameter variations relevant to competitive robotics. Four ResNet backbones (ResNet18, ResNet34, ResNet50, and ResNet101) were compared using dropout rates, analyzing their effect on confidence and accuracy. All models were trained under the same configuration and evaluated under changes in lighting and background contrast. Environmental conditions primarily impact prediction confidence, while inference latency remains largely unaffected and classification accuracy stays consistently high, approaching or above 1.00 in most cases. Two distinct behaviors were observed. Under illumination variation, ResNet50 achieves the best trade-off, combining near-perfect accuracy, confidence values up to approximately 0.869 and latency around 0.058-0.059 ms. Under background variation, ResNet34 provides the most balanced performance, reaching near-perfect accuracy and higher confidence values up to approximately 0.887. These results indicate that the optimal architecture depends on the type of environmental variation, with intermediate-depth models offering the best balance between performance and efficiency.

2605.08117 2026-05-12 eess.SP cs.CV cs.LG 版本更新

Modular Retrieval-Augmented Generalization for Human Action Recognition

模块化检索增强泛化用于人类动作识别

Peng Liao, Shangsong Liang, Lin Chen, Peijia Zheng

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) Sun Yat-sen University(中山大学) Macao Polytechnic University(澳门理工学院)

AI总结 本文提出MoRA模块,通过检索增强策略提升IMU动作识别性能,解决训练样本不足和静态知识利用问题,实验表明其在多个真实数据集上有效提升识别效果。

Comments ICME 2026

详情
AI中文摘要

基于惯性测量单元(IMU)的人类活动识别(HAR)旨在从时间运动信号中解释和分类用户行为。最近,深度学习框架通过学习和提取判别时空表示提高了该任务的性能。然而,IMU-based HAR仍面临几个关键挑战,特别是训练样本有限和静态知识利用不足,这严重阻碍了其大规模部署。本文引入MoRA,第一个专门设计用于运动序列的检索增强模块。它可以灵活地集成到任何现有的HAR模型中,在保持推理效率的同时提高识别性能。为了解决检索结果中的信息冗余和刚性融合策略问题,我们在MoRA中提出了一个不确定性自适应融合单元。该单元利用IMU信号的先前物理知识,动态调整原始输出与检索信息之间的融合策略,从而实现更稳健的识别。在十个真实数据集上的广泛实验表明,MoRA显著提高了现有IMU-based HAR模型的性能,始终提供稳定和有效的提升。MoRA的源代码可在:https://github.com/liavonpenn/mora获取。

英文摘要

Inertial Measurement Unit (IMU)-based Human Activity Recognition (HAR) aims to interpret and classify user behaviors from temporal motion signals. Recently, deep learning frameworks have advanced this task by learning and extracting discriminative spatiotemporal representations, significantly improving recognition performance. However, IMU-based HAR still faces several critical challenges, particularly limited training samples and static knowledge utilization, both of which severely hinder its large-scale deployment. In this paper, we introduce MoRA, the first Retrieval-Augmented Module specifically designed for motion series. It can be flexibly integrated into any existing HAR model, enhancing recognition performance while maintaining inference efficiency. To address issues such as information redundancy in retrieval results and rigid fusion strategies, we propose an uncertainty-adaptive fusion unit within MoRA. This unit leverages previous physical knowledge from IMU signals to dynamically adjust the fusion strategy between original outputs and retrieved information, enabling more robust recognition. Extensive experiments on ten real-world datasets demonstrate that MoRA significantly improves the performance of existing IMU-based HAR models, consistently delivering stable and effective gains. The source code of MoRA is available at: https://github.com/liavonpenn/mora.

2605.08115 2026-05-12 cs.GR cs.CV cs.LG 版本更新

Alice v1: Distillation-Enhanced Video Generation Surpassing Closed-Source Models

Alice v1:通过一致性蒸馏增强的视频生成超越闭源模型

Wang Xiaoyu, Phong Nguyen, Chen Zhao

发表机构 * Mirage Team Open Source Research(Mirage团队开源研究)

AI总结 Alice v1通过一致性蒸馏与分数正则化实现视频生成质量突破,其在自动化基准测试中超越闭源系统,且在人类偏好研究中表现优异。

详情
AI中文摘要

我们提出了Alice v1,一个140亿参数的开源视频生成模型,通过一致性蒸馏与分数正则化(rCM)实现最先进的质量。与传统蒸馏方法不同,我们证明rCM蒸馏可以超越教师模型质量。我们归因于三个机制:(1)分数正则化项作为模式寻找目标,将概率质量集中在高质量输出而非覆盖完整教师分布;(2)我们的定向合成数据管道与难例挖掘提供针对失败模式(物理、手部、面孔)的训练信号,这些模式教师处理不一致;(3)一致性强制作为隐式正则化,消除对特定噪声样本的“幸运路径”依赖。Alice v1在4次去噪步骤(约8秒在H100上)生成5秒720p视频,比50步教师模型快7倍,同时将VBench分数从84.0(Wan2.2)提升到91.2。这在自动化基准测试中超越了教师模型和闭源系统,包括Veo3(~90)和Sora2(~88),在人类偏好研究中表现竞争。我们释放所有模型权重、训练代码、合成数据管道和评估脚本,以推动视频生成领域的开放研究。

英文摘要

Wepresent Alice v1, a 14-billion parameter open-source video generation model that achieves state-of-the-art quality through consistency distillation with score regularization (rCM). Contrary to conventional distillation-which trades quality for speed-we demonstrate that rCM-based distillation can exceed teacher model quality. We attribute this to three mechanisms: (1) the score regularization term acts as a mode-seeking objective that concentrates probability mass on high-quality outputs rather than covering the full teacher distribution, (2) our targeted synthetic data pipeline with hard example mining provides training signal specifically for failure modes (physics, hands, faces) that the teacher handles inconsistently, and (3) consistency enforcement acts as implicit regularization, eliminating "lucky path" dependence on specific noise samples. Alice v1 generates 5-second 720p videos at 24fps in 4 denoising steps (~8 seconds on H100), a 7x speedup over the 50-step teacher while improving VBench score from 84.0 (Wan2.2) to 91.2. This surpasses both the teacher and closed-source systems including Veo3 (~90) and Sora2 (~88) on automated benchmarks, with competitive results in human preference studies. We release all model weights, training code, synthetic data pipelines, and evaluation scripts to advance open research in video generation.

2605.08113 2026-05-12 cs.LG cs.CV 版本更新

Do Foundation Model Embeddings Improve Cross-Country Crop Yield Generalisation? A Leave-One-Country-Out Evaluation in Sub-Saharan Africa

基础模型嵌入是否能提升跨国家作物产量泛化?一种留一国家法的撒哈拉以南非洲评估

Yaw Osei Adjei

发表机构 * Department of Computer Science, Kwame Nkrumah University of Science and Technology(计算机科学系,库马西技术科学大学)

AI总结 本文评估了地理空间基础模型嵌入在留一国家法交叉验证中的表现,发现跨国家预测效果不佳,主要受限于产量分布差异。

Comments 9 pages, 10 figures, appendix, code and processed results released publicly

详情
AI中文摘要

准确预测小农户玉米产量跨国家边界对于撒哈拉以南非洲粮食安全规划至关重要,但大多数发布的基准报告国内表现过度夸大真实泛化能力。本文评估了地理空间基础模型嵌入(Prithvi-EO-1.0-100M和ViT-Base)是否在留一国家法交叉验证中优于传统Sentinel-2光谱特征,基于五国6404个玉米田观测数据。结果表明存在明显的泛化差距:国内随机交叉验证获得中等R²值,但所有特征集在跨国家测试中表现糟糕,R²普遍为负。冻结Prithvi-EO嵌入在该设置下对跨国家预测无明显优势。本文认为主要限制是国家间产量分布变化而非表征质量,并释放可复现的负面基准供未来工作使用。

英文摘要

Accurate predictions of smallholder maize yields across national boundaries are critical for food security planning in sub-Saharan Africa, yet most published benchmarks report within-country performance that overstates true generalisability. This paper evaluates whether geospatial foundation model embeddings, specifically Prithvi-EO-1.0-100M and ViT-Base, outperform traditional Sentinel-2 spectral features under a Leave-One-Country-Out cross-validation scheme on 6,404 maize field observations from five African countries. The results show a clear generalisability gap: within-country random cross-validation yields moderate R^2 values, but all feature sets perform poorly under cross-country testing, with universally negative R^2. Frozen Prithvi-EO embeddings provide no meaningful advantage over engineered spectral features for cross-country prediction in this setting. The paper argues that the main limitation is a shift in yield distribution between countries rather than representation quality and releases a reproducible negative benchmark for future work.

2604.18486 2026-05-12 cs.CV cs.CL cs.RO 版本更新

Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

小米OneVL:基于视觉-语言解释的一步潜在推理与规划

Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, Fang Li, Chenxu Dang, Junli Wang, Tao Xu, Jing Wu, Jianhua Wu, Xiaoshuai Hao, Wen Zhang, Tianyi Jiang, Lingfeng Zhang, Lei Zhou, Yingbo Tang, Jie Wang, Yinfeng Gao, Xizhou Bu, Haochen Tian, Yihang Qiu, Feiyang Jia, Lin Liu, Yigu Ge, Hanbing Li, Yuannan Shen, Jianwei Cui, Hongwei Xie, Bing Wang, Haiyang Sun, Jingwei Zhao, Jiahui Huang, Pei Liu, Zeyu Zhu, Yuncheng Jiang, Zibin Guo, Chuhong Gong, Hanchao Leng, Kun Ma, Naiyan Wang, Guang Chen, Kuiyuan Yang, Hangjun Ye, Long Chen

发表机构 * Xiaomi Embodied Intelligence Team(小米具身智能团队)

AI总结 OneVL通过双重辅助解码器监督的紧凑潜在标记,实现了视觉-语言解释的一步潜在推理与规划,首次在延迟条件下超越了显式推理方法。

Comments Technical Report; 49 pages, 22 figures, 10 tables; Project Page at https://xiaomi-embodied-intelligence.github.io/OneVL GitHub at https://github.com/xiaomi-research/onevl

详情
AI中文摘要

链式推理(CoT)推理已成为基于视觉-语言增强(VLA)的自动驾驶轨迹预测中的强大驱动力,但其自回归性质带来了延迟成本,这在实时部署中是不可接受的。潜在CoT方法试图通过将推理压缩到连续隐藏状态中来弥合这一差距,但始终无法达到其显式对应物的水平。我们建议,这是因为纯粹的语言潜在表示压缩了世界的一个符号抽象,而不是实际上支配驾驶的因果动态。因此,我们提出了OneVL(基于视觉-语言解释的一步潜在推理与规划),这是一个统一的VLA和世界模型框架,通过双辅助解码器监督的紧凑潜在标记,将推理路由到潜在标记中。除了一个重建文本CoT的语言解码器外,我们还引入了一个视觉世界模型解码器,预测未来帧标记,迫使潜在空间内化道路几何、代理运动和环境变化的因果动态。一个三阶段训练流程逐步将这些潜在与轨迹、语言和视觉目标对齐,确保稳定的联合优化。在推理中,辅助解码器被丢弃,所有潜在标记在单次并行传递中被预填充,与仅回答预测的速度相匹配。在四个基准测试中,OneVL成为首个超越显式CoT的潜在CoT方法,以仅回答延迟的精度提供更优的准确性。这些结果表明,通过世界模型监督,潜在CoT生成的表示比逐个标记的冗长推理更具泛化能力。代码已向社区开源。项目页面:https://xiaomi-embodied-intelligence.github.io/OneVL

英文摘要

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. In inference, the auxiliary decoders are discarded, and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering superior accuracy at answer-only latency. These results show that with world model supervision, latent CoT produces more generalizable representations than verbose token-by-token reasoning. Code has been open-sourced to the community. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

2603.22421 2026-05-12 cs.CV 版本更新

OsteoFlow: Lyapunov-Guided Flow Distillation for Predicting Bone Remodeling after Mandibular Reconstruction

OsteoFlow: 基于Lyapunov引导的流式蒸馏用于预测下颌重建后的骨重塑

Hamidreza Aftabi, Faye Yu, Brooke Switzer, Zachary Fishman, Eitan Prisman, Antony Hodgson, Cari Whyne, Sidney Fels, Michael Hardisty

发表机构 * Department of Electrical and Computer Engineering, University of British Columbia, Canada(电气与计算机工程系,不列颠哥伦比亚大学,加拿大) Department of Mechanical Engineering, University of British Columbia, Canada(机械工程系,不列颠哥伦比亚大学,加拿大) Department of Surgery, University of British Columbia, Canada(外科系,不列颠哥伦比亚大学,加拿大) Sunnybrook Research Institute, University of Toronto, Canada(阳光医院研究学院,多伦多大学,加拿大)

AI总结 OsteoFlow通过Lyapunov引导的轨迹蒸馏方法,从术后第5天的CT扫描预测第1年的骨重塑,显著提升了预测精度,减少了手术切除区的平均绝对误差。

详情
AI中文摘要

预测下颌重建后的长期骨重塑在临床上有重要价值,但标准生成模型难以维持长时间的轨迹一致性与解剖保真度。我们引入OsteoFlow,一种基于流的框架,可从术后第5天的CT扫描预测第1年的CT扫描。我们的核心贡献是Lyapunov引导的轨迹蒸馏:不同于一步蒸馏,我们的方法从注册导出的静止速度场教师中蒸馏出连续轨迹。结合手术切除意识的图像损失,这在不牺牲生成能力的情况下强制几何对应。在344个配对感兴趣区域上评估,OsteoFlow显著优于最先进的基线,将手术切除区的平均绝对误差减少了约20%。这展示了轨迹蒸馏在长期预测中的潜力。代码可在GitHub上获得:OsteoFlow。

英文摘要

Predicting long-term bone remodeling after mandibular reconstruction would be of great clinical benefit, yet standard generative models struggle to maintain trajectory-level consistency and anatomical fidelity over long horizons. We introduce OsteoFlow, a flow-based framework predicting Year-1 post-operative CT scans from Day-5 scans. Our core contribution is Lyapunov-guided trajectory distillation: Unlike one-step distillation, our method distills a continuous trajectory over transport time from a registration-derived stationary velocity field teacher. Combined with a resection-aware image loss, this enforces geometric correspondence without sacrificing generative capacity. Evaluated on 344 paired regions of interest, OsteoFlow significantly outperforms state of-the-art baselines, reducing mean absolute error in the surgical resection zone by ~20%. This highlights the promise of trajectory distillation for long-term prediction. Code is available on GitHub: OsteoFlow.

2402.02286 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Attention-Mamba: A Mamba-Enhanced Multi-Scale Parallel Inference Network for Medical Image Segmentation

Attention-Mamba: 一种增强Mamba的多尺度并行推理网络用于医学图像分割

Yanhua Zhang, Ke Zhang, Jingyu Wang, Gabriella Balestra, Samanta Rosati, Yulin Wu, Wuwei Wang, Valentina Giannini

发表机构 * School of Astronautics, Northwestern Polytechnical University(航天学院,西北工业大学) Department of Electronics and Telecommunications, Politecnico di Torino(电信系,托斯卡纳理工学院) Beijing Aerospace Automatic Control Research Institute(北京航天自动控制研究所) Xi'an University of Posts and Telecommunications(西安邮电大学) Department of Oncology, University of Turin(肿瘤科,都灵大学) Candiolo Cancer Institute, FPO-IRCCS(坎迪奥利癌症研究所,FPO-IRCCS)

AI总结 本文提出一种基于Mamba的多尺度并行网络,通过多尺度特征提取和递归对齐模块提升医学图像分割性能,实现高效准确的分割结果。

Comments 14 pages, 9 figures and 8 Tables

详情
AI中文摘要

U-shaped架构长期以来主导医学图像分割领域,而Transformer被广泛用于建模长距离依赖。前者通常通过聚合多级特征隐式处理尺度变化,而后者效率受限于二次计算和内存复杂度。本文提出一种有效的传统U-shaped架构替代方案,通过在不同层次构建并行分支以获得多尺度特征和相应预测。此外,我们通过整合Mamba,一种捕捉长距离依赖的态空间模型,来增强网络。首先,双路径架构通过横向连接在每个分支中聚合高层语义信息和低层空间细节。然后,我们引入递归对齐模块(RAM),通过逐步对齐恢复低分辨率特征的空间细节,优化后续全局特征学习和多尺度融合。我们进一步在对齐特征上构建并行Mamba分支,建立层次化全局表示。最后,我们提出基于Mamba的注意力机制,用于自适应多尺度预测融合;该机制利用Mamba增强通道和空间维度上的信息交换。在三种成像模态(MRI、CT和皮肤镜)上的实验验证了所提网络的优越泛化能力。与最先进的2D CNN、Transformer和基于Mamba的网络相比,我们的模型在Synapse、ACDC、ISIC-2018和PH2数据集上实现了最高的分割性能,同时保持高效率,参数量为第二小(14.05 M),计算复杂度适中(8.94 GFLOPs).

英文摘要

U-shaped architectures have long dominated the field of medical image segmentation, while Transformers are widely employed for modeling long-range dependencies. The former typically handles scale variations implicitly by aggregating multi-level features, whereas the efficiency of the latter is constrained by its quadratic computational and memory complexity. In this work, we propose an effective alternative to traditional U-shaped architectures by constructing parallel branches at different levels to obtain multi-scale features and corresponding predictions. Furthermore, we enhance our network by integrating Mamba, a state space model that captures long-range dependencies with linear complexity. First, a dual-path architecture with lateral connections aggregates high-level semantic information and low-level spatial details at each branch. Then, we introduce a Recursive Alignment Module (RAM) that restores spatial details in low-resolution features through stepwise alignment, optimizing them for subsequent global feature learning and multi-scale fusion. We further build parallel Mamba branches upon aligned features to establish hierarchical global representations. Finally, we propose a Mamba-based attention mechanism for adaptive multi-scale prediction fusion; this mechanism utilizes Mamba to enhance information exchange across scales along both the channel and spatial dimensions. Experiments across three imaging modalities (MRI, CT, and dermoscopy) underscore the superior generalization of the proposed network. Compared to state-of-the-art 2D CNN, Transformer, and Mamba-based networks, our model achieves the highest segmentation performance on the Synapse, ACDC, ISIC-2018, and PH2 datasets while maintaining high efficiency, featuring the second-smallest parameters (14.05 M) and moderate computational complexity (8.94 GFLOPs).