2605.20158 2026-05-20 cs.CV cs.AI cs.CL 版本更新

Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

重新思考用于大视觉语言模型胸部X光推理中的视觉归因

Guangzhi Xiong, Qiao Jin, Sanchit Sinha, Zhiyong Lu, Aidong Zhang

发表机构 * University of Virginia（弗吉尼亚大学）； National Institutes of Health（美国国立卫生研究院）

AI总结本文针对大视觉语言模型在胸部X光推理中视觉归因的可靠性问题，提出了一种因果评估框架，通过反事实编辑保留仅由专家标注区域验证的X光-VQA样本，以确定模型预测的因果责任区域。通过11种归因方法、6种开源LVLMs和两种输出模式，发现现有归因方法往往无法识别LVLMs所使用的证据。为此，本文提出MedFocus，一种基于概念的归因方法，通过不平衡最优传输局部化具有临床意义的解剖区域，并通过针对性干预测量其对模型输出的因果效应，显著优于现有方法，推动医疗LVLMs的更可信归因。

详情

AI中文摘要

大视觉语言模型（LVLMs）在医疗应用中展现出前景，但其无法准确将响应与视觉证据联系起来，引发了关于临床可信度的严重担忧。尽管视觉归因方法被广泛用于解释LVLM预测，但这些解释是否确实反映了模型决策背后的视觉证据仍缺乏验证，因为内部模型推理的真值注释通常不可用。我们通过开发一种因果评估框架来解决胸部X光（CXR）推理中的这一问题，该框架仅保留专家标注区域已验证的CXR-VQA样本，通过反事实编辑保留因果责任区域。在11种归因方法、6种开源LVLMs和两种输出模式（直接回答和逐步推理）上应用此框架，发现现有归因方法往往无法识别LVLMs所使用的证据。为解决这一失败，我们提出MedFocus，一种基于概念的归因方法，通过不平衡最优传输局部化具有临床意义的解剖区域，并通过针对性干预测量其对模型输出的因果效应。MedFocus产生空间、概念级和token级归因，并显著优于现有方法，推动医疗LVLMs的更可信归因。我们的数据和代码可在https://github.com/gzxiong/medfocus/上获得。

英文摘要

Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.

URL PDF HTML ☆

赞 0 踩 0

2605.20147 2026-05-20 cs.CV 版本更新

PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

PixVerve：通过大规模高质量数据集将原生超高清图像生成推至100MP

Haojun Chen, Haoyang He, Chengming Xu, Qingdong He, Junwei Zhu, Yabiao Wang, Zhucun Xue, Xianfang Zeng, Zhennan Chen, Xiaobin Hu, Hao Zhao, Yong Liu, Jiangning Zhang, Dacheng Tao

发表机构 * Zhejiang University（浙江大学）； Fudan University（复旦大学）； Nanjing University（南京大学）； National University of Singapore（新加坡国立大学）； Tsinghua University（清华大学）； Nanyang Technological University（南洋理工大学）

AI总结本文提出PixVerve-95K数据集，通过精心设计的数据管道构建，包含95K张高分辨率图像和七维标注，用于推动超高清图像生成技术，通过三种训练方案将T2I基础模型扩展到100MP生成，并建立PixVerve-Bench评估协议。

Comments Project page is available at https://haojunchen663.github.io/projects/PixVerve/

详情

AI中文摘要

文本到图像（T2I）模型近年来在1K和2K分辨率方面取得了显著进展。随着对更好视觉体验的极端需求和成像技术的快速发展，超高清（UHR）图像生成的需求显著增长。然而，由于高分辨率内容的稀缺性和复杂性，UHR图像生成面临巨大挑战。在本文中，我们首先介绍了PixVerve-95K，一个高质量、开源的UHR T2I数据集，通过精心设计的数据管道构建，包含95K张图像，涵盖多样场景（每张图像的最小像素数为100M）和七维标注。基于我们的大规模图像-文本数据集，我们采取了开创性的步骤，将各种T2I基础模型扩展到原生100MP生成，采用三种训练方案。最后，利用传统度量标准和基于多模态大语言模型的评估，我们提出的PixVerve-Bench基准建立了涵盖视觉质量和语义对齐的全面评估协议。在我们的基准上的广泛实验结果和训练策略的建设性探索共同提供了对未来突破的宝贵见解。

英文摘要

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.

URL PDF HTML ☆

赞 0 踩 0

2605.20110 2026-05-20 cs.CV 版本更新

SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

SetCon: 通过集级概念预测实现开放式的指称分割

Zhixiong Zhang, Yizhuo Li, Shuangrui Ding, Yuhang Zang, Shengyuan Ding, Long Xing, Yibin Wang, Qiaosheng Zhang, Jiaqi Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Innovation Institute（上海创新研究院）； The Chinese University of Hong Kong（香港中文大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Fudan University（复旦大学）； University of Science and Technology of China（中国科学技术大学）

AI总结本文提出SetCon，通过集级概念预测实现开放式的指称分割，利用LVLM生成的自然语言概念作为语义条件进行联合掩码-集解码，提高了分割的完整性和互斥性。

详情

AI中文摘要

指称分割将自然语言查询与像素级掩码联系起来，但将其扩展到包含多个实例、跨类群组或开放目标集的复杂场景仍然具有挑战性。先前基于大型视觉语言模型（LVLM）的方法用一个或多个特殊标记依次表示指称目标，将多个目标视为独立输出而非连贯的集合，并且几乎没有激励去捕捉集合级属性，如完整性和互斥性。我们重新公式化开放式的指称分割为显式的集级概念预测，并提出Set-Concept Segmentation（SetCon），该方法使用LVLM生成的自然语言概念，而不是分割特定的标记，作为联合掩码-集解码的语义条件。一个层次化的语义分解首先预测一个共享的集级概念以定义目标范围，然后将其细化为细粒度的概念组，与目标子集对齐。为了支持这一点，一个两阶段的标注流程增强了现有的推理分割数据集，添加了层次化的语义监督（236k样本，784k概念短语）。SetCon在图像基准上取得了最先进的结果（在gRefCOCO上+3.3 gIoU，在MUSE上+12.1 gIoU），其优势随着指称目标数量的增加而扩大。概念接口在检测和跟踪设置下也转移到视频中，产生了在七个指称视频基准上的新最先进的结果，包括在MeViS上+10.9 J&F和在Ref-SeCVOS上+12.4 J&F。

英文摘要

Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 J&F on MeViS and +12.4 J&F on Ref-SeCVOS.

URL PDF HTML ☆

赞 0 踩 0

2605.20090 2026-05-20 cs.CV 版本更新

MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

MetaEarth-MM：基于场景中心联合建模的多模态遥感图像生成

Zhiping Yu, Chenyang Liu, Jinqi Cao, Qinzhe Yang, Siwei Yu, Zhengxia Zou, Zhenwei Shi

发表机构 * Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University（航天智能科学与技术系，航天学院，北京航空航天大学）； State Key Laboratory of Virtual Reality Technology and Systems, Beihang University（虚拟现实技术与系统国家重点实验室，北京航空航天大学）； Shenyuan Honors College, Beihang University（Shen Yuan荣誉学院，北京航空航天大学）

AI总结本文提出MetaEarth-MM模型，通过统一的多模态遥感图像生成框架，实现多模态图像的联合生成和任意模态之间的转换，展示了其在多模态遥感观测中的强大生成能力和广泛适用性。

详情

AI中文摘要

多模态遥感图像对于地球观测至关重要，但在实践中，完整的配对观测往往稀缺。现有的生成方法通常通过孤立的成对模态翻译来解决这个问题，但随着模态数量和生成任务的增加，其通用性和可扩展性仍然有限。本文开发了一个生成基础模型MetaEarth-MM，用于多模态遥感图像生成，能够在统一模型中实现五种模态之间的配对联合生成和任意到任意的翻译。认识到多模态观测下内在的场景一致性，我们引入了MetaEarth-MM中的场景中心联合建模范式。与以往依赖直接外观级跨模态映射的方法不同，我们的模型围绕底层场景内容组织生成过程。具体而言，MetaEarth-MM采用解耦架构，首先从可用观测中推断出潜在的场景表示，然后基于此中间状态生成目标模态。为了支持训练，我们进一步构建了EarthMM，一个包含280万张多分辨率全球图像和220万对对齐图像的大型数据集。广泛的实验表明，MetaEarth-MM不仅在多样化的生成任务中表现出强大的生成能力和稳健的泛化能力，还支持数据和表示层面的下游任务，突显了其作为跨模态地球观测通用基础模型的潜力。代码和数据集将在https://github.com/YZPioneer/MetaEarth-MM上提供。

英文摘要

Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at https://github.com/YZPioneer/MetaEarth-MM.

URL PDF HTML ☆

赞 0 踩 0

2605.20085 2026-05-20 cs.CV 版本更新

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

基于空间提示的视觉轨迹预测用于目视操控

Yifan Li, Xinyu Zhou, Yunhao Ge, Yu Kong

发表机构 * Michigan State University（密歇根州立大学）； NVIDIA Research（英伟达研究）

AI总结本文提出了一种新的视觉轨迹预测方法SP-VTP，通过空间提示定义任务目标，结合任务编码器、观察编码器和轨迹生成器，提升了跨场景的目视操控轨迹预测性能。

详情

AI中文摘要

机器人操控通常通过语言指令或任务标识符指定，但在有相似物体的杂乱环境中，通过空间指示要移动什么和放置在哪里会更有效。针对以视觉为中心的对象和目标指定挑战，我们提出了目前所知的第一个空间提示视觉轨迹预测（SP-VTP）的正式化。这种新的设置利用初始空间提示（如边界框或点）来定义任务目标，要求模型从目视流中预测未来末端执行器轨迹。为了研究此问题，我们收集并标注了EgoSPT数据集，包含带有第一帧物体和目标定位注释以及恢复的3D末端执行器运动的目视空间提示操控轨迹。SP-VTP具有挑战性，因为任务指定是静态的，而场景配置随时间变化。为了解决这个问题，我们提出了SPOT（空间提示对象-目标策略），它结合了任务编码器用于第一帧视觉和坐标空间提示，观察编码器用于当前视觉和历史上下文，以及轨迹生成器用于未来末端执行器运动。在严格的场景级划分实验中，SPOT在非提示或单源提示基线之上提高了跨场景轨迹预测性能。共同，EgoSPT和SPOT建立了一个新的空间提示问题SP-VTP，作为简单且可扩展的任务条件用于目视操控。

英文摘要

Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.

URL PDF HTML ☆

赞 0 踩 0

2605.20082 2026-05-20 cs.CV cs.AI 版本更新

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

VL-DPO：基于视觉语言的偏好对齐自动驾驶微调

Zhefan Xu, Ghassen Jerfel, Marina Haliem, Qi Zhao, Jeonhyung Kang, Khaled S. Refaat

发表机构 * Waymo

AI总结本文提出VL-DPO，一种基于视觉语言模型的框架，通过零样本推理生成偏好对来微调自动驾驶模型，以提升与人类驾驶偏好的对齐程度，实验表明该方法在RFS和ADE指标上均优于基线模型。

Comments Published in International Conference on Robotics and Automation (ICRA), 2026 8 pages, 6 figures, 4 tables

详情

AI中文摘要

自动驾驶数据集的快速增长使强大的运动预测模型得以扩展。尽管大规模预训练提供了强大的性能，但标准模仿目标可能无法完全捕捉人类驾驶偏好中的复杂细微差别。同时，视觉语言模型（VLMs）的最新进展展示了出色的推理和常识理解能力。基于这些能力，本文提出了VL-DPO，一种基于视觉语言的框架，用于将自动驾驶车辆的运动预测模型与人类偏好对齐。我们的方法利用VLM作为零样本推理器，自动从预训练模型的轨迹中生成偏好对，然后通过直接偏好优化（DPO）进行微调。我们在此Waymo Open End-to-End Driving Dataset（WOD-E2E）上微调模型，并通过评分反馈（RFS）和平均位移误差（ADE）评估模型在持保留人类偏好注释上的性能。实验表明，VLM的轨迹选择是高质量的人类偏好的代理。我们的最终模型VL-DPO在RFS指标上比预训练模型提高了11.94%，在ADE指标上减少了10.01%。

英文摘要

The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.

URL PDF HTML ☆

赞 0 踩 0

2605.20079 2026-05-20 cs.CV cs.AI cs.LG eess.IV 版本更新

Probability-Conserving Flow Guidance

概率守恒的流引导

Parsa Esmati, Junha Hyung, Amirhossein Dadashzadeh, Jaegul Choo, Majid Mirmehdi

发表机构 * University of Bristol（布里斯托大学）； KAIST（韩国科学技术院）

AI总结本文提出了一种概率守恒的流引导方法AdaMaG，通过分析连续方程，将引导效果分解为发散项和分数平行项，并通过时间依赖的调度和分数平行衰减来控制这两个项，从而在不增加推理成本的情况下提高生成质量并减少幻觉。

详情

AI中文摘要

扩散和基于流的生成模型在视觉合成中占据主导地位，引导将样本对齐到用户输入并提高感知质量。然而，分类器无关引导（CFG）和基于外推的方法是速度/分数的启发式线性组合，忽略了生成流形的几何结构，破坏了概率守恒，导致在强引导下样本偏离学习的流形。我们通过连续方程分析引导，并展示其效果分解为一个发散项和一个在参数化下不变的分数平行项。我们证明发散项在采样接近数据流形时结构上会发散，这促使我们采用时间依赖的调度和分数平行衰减。所得到的即插即用规则，自适应流形引导（AdaMaG），在不增加推理成本的情况下限制了这两个项。最后，我们展示大多数减少饱和或提高生成质量的实证启发式方法直接对应于我们分解中的两个项。在图像生成基准测试中，AdaMaG提高了真实感，减少了幻觉，并在高引导制度下诱导了受控的去饱和。

英文摘要

Diffusion and flow-based generative models dominate visual synthesis, with guidance aligning samples to user input and improving perceptual quality. However, Classifier-Free Guidance (CFG) and extrapolation-based methods are heuristic linear combinations of velocities/scores that ignore the generative manifold geometry, breaking probability conservation and driving samples off the learned manifold under strong guidance. We analyse guidance through the continuity equation and show its effect decomposes into a divergence term and a score-parallel term defined invariantly across parameterisations. We prove the divergence term blows up structurally as sampling approaches the data manifold, motivating a time-dependent schedule alongside score-parallel attenuation. The resulting plug-and-play rule, Adaptive Manifold Guidance (AdaMaG), bounds both terms at no additional inference cost. Finally, we show that most empirical heuristics for reducing saturation or improving generation quality correspond directly to the two terms in our decomposition. Across image generation benchmarks, AdaMaG improves realism, reduces hallucinations, and induces controlled desaturation in high-guidance regimes.

URL PDF HTML ☆

赞 0 踩 0

2605.20073 2026-05-20 cs.CV 版本更新

X-Ray cardiac angiographic vessel segmentation based on pixel classification using machine learning and region growing

基于机器学习和区域生长的X射线心血管造影血管分割

E O Rodrigues, L O Rodrigues, J J Lima, D Casanova, F Favarim, E R Dosciatti, V Pegorini, L S N Oliveira, F F C Morais

发表机构 * Department of Academic Informatics (DAINF), Universidade Tecnologica Federal do Parana (UTEPR)（学术信息系（DAINF），技术联邦大学帕托布拉桑分校（UTEPR））； Graduate Program of Applied Sciences to Health Products, Universid ade Federal Fluminense (UFF)（健康产品应用科学研究生项目，联邦理工学院弗洛里亚纳分校（UFF））； Primary Health Care, Pato Branco Prefecture, Parana, Brazil（帕托布拉桑市初级卫生保健，巴兰省，巴西）； Innovation Office, Mass General Brigham Hospital, Cambridge, Massachusetts, United States of America（麻省总医院创新办公室，剑桥，马萨诸塞州，美国）

AI总结本文提出了一种基于像素分类的X射线血管分割方法，利用纹理特征和区域生长技术，通过随机森林分类器实现高精度血管识别，达到95.48%的准确率。

Journal ref Biomedical Physics & Engineering Express 2021

2605.20064 2026-05-20 cs.CV 版本更新

Cardiac fat segmentation using computed tomography and an image-to-image conditional generative adversarial neural network

利用计算断层扫描和图像到图像的条件生成对抗神经网络进行心脏脂肪分割

Guilherme Santos da Silva, Dalcimar Casanova, Jefferson Tales Oliva, Erick Oliveira Rodrigues

发表机构 * Academic Department of Informatics, Universidade Tecnoldgica Federal do Parand (UTFPR)（信息学学术部门，联邦技术大学（UTFPR））

AI总结本研究提出了一种基于深度学习的新方法，利用pix2pix网络对心脏脂肪进行自动分割和量化，实现了高精度的epicardial和mediastinal脂肪分割，并在准确率和运行时间上优于现有方法。

Journal ref Medical Engineering & Physics 2024

详情

DOI: 10.1016/j.medengphy.2024.104104

AI中文摘要

近年来，研究强调了人类心脏周围脂肪组织增加与心瓣膜纤维颤动和冠心病等心血管疾病之间存在联系。然而，由于对医疗专业人员来说手动分割这些脂肪沉积物工作量大且成本高，这种分割并未在临床实践中广泛应用。因此，对更精确和高效定量分析的需求推动了新型计算方法的出现。本研究提出了一种新的深度学习方法，能够自主分割和量化两种不同类型的心脏脂肪沉积物。所提出的方法利用了pix2pix网络，这是一种主要设计用于图像到图像翻译任务的生成对抗网络。通过应用此网络架构，我们旨在研究其在解决心脏脂肪分割特定挑战方面的有效性，尽管该网络并非最初为该目的设计。本研究中感兴趣的两种脂肪沉积物称为心外膜脂肪和心包脂肪，它们被心包空间分开。实验结果表明，epicardial脂肪分割的平均准确率为99.08%和f1分数98.73，mediastinal脂肪分割的准确率为97.90%和f1分数98.40。这些发现代表了所提出方法的高精度和重叠一致性。与现有研究相比，我们的方法在f1分数和运行时间上表现更优，使图像能够在实时情况下进行分割。

英文摘要

In recent years, research has highlighted the association between increased adipose tissue surrounding the human heart and elevated susceptibility to cardiovascular diseases such as atrial fibrillation and coronary heart disease. However, the manual segmentation of these fat deposits has not been widely implemented in clinical practice due to the substantial workload it entails for medical professionals and the associated costs. Consequently, the demand for more precise and time-efficient quantitative analysis has driven the emergence of novel computational methods for fat segmentation. This study presents a novel deep learning-based methodology that offers autonomous segmentation and quantification of two distinct types of cardiac fat deposits. The proposed approach leverages the pix2pix network, a generative conditional adversarial network primarily designed for image-to-image translation tasks. By applying this network architecture, we aim to investigate its efficacy in tackling the specific challenge of cardiac fat segmentation, despite not being originally tailored for this purpose. The two types of fat deposits of interest in this study are referred to as epicardial and mediastinal fats, which are spatially separated by the pericardium. The experimental results demonstrated an average accuracy of 99.08% and f1-score 98.73 for the segmentation of the epicardial fat and 97.90% of accuracy and f1-score of 98.40 for the mediastinal fat. These findings represent the high precision and overlap agreement achieved by the proposed methodology. In comparison to existing studies, our approach exhibited superior performance in terms of f1-score and run time, enabling the images to be segmented in real time.

URL PDF HTML ☆

赞 0 踩 0

2605.20044 2026-05-20 cs.CV 版本更新

OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives

OP2GS: 带双不透明度的物体感知3D高斯散射

Guiyu Liu, Niklas Vaara, Janne Mustaniemi, Juho Kannala, Janne Heikkilä

发表机构 * Center for Machine Vision and Signal Analysis, University of Oulu, Finland（奥卢大学机器视觉与信号分析中心，芬兰）； Aalto University, Finland（阿尔托大学，芬兰）

AI总结 OP2GS通过引入双不透明度机制，为每个原始体素添加显式实例身份和专用实例不透明度σ*，以解决3D高斯散射在物体层面身份缺失的问题，从而提升开放词汇场景理解的性能。

Comments Under review

详情

AI中文摘要

3D高斯散射（3DGS）提供了一种显式且高效的场景表示，但其原始体素缺乏固有的物体层面身份，阻碍了下游任务如开放词汇场景理解。现有方法通常通过将高维特征嵌入提炼为高斯或通过启发式细化将2D掩码标签提升为3D来解决这一问题。然而，基于特征的方法会带来沉重的存储和解码开销，而基于提升的方法则容易受到标签污染：用于外观重建的高斯体往往在2D到3D投影时会获得错误的物体标签。我们提出了OP2GS，一种带物体感知的高斯表示，通过为每个原始体素添加显式实例身份和专用实例不透明度σ*用于物体掩码渲染。原始不透明度σ仍负责视觉重建，而σ*则模型该高斯是否应贡献于特定的物体掩码。这种双不透明度公式将视觉存在与实例占用解耦：错误标记的高斯体仍可用于图像渲染，但在物体掩码分支中会变得透明。为了学习这种表示，我们引入了随机物体损失，通过3DGS标准的透射率基可见性优化1D实例占用场。然后通过多视角聚合将语义描述符附加在物体层面，消除了每个高斯体的特征存储需求。与基于特征训练的方法相比，OP2GS在开放词汇性能方面具有竞争力，同时显著减少了计算开销。与无训练管道相比，它利用物理一致的占用学习来解决可见性歧义。

英文摘要

3D Gaussian Splatting (3DGS) provides an explicit and efficient scene representation, but its primitives lack inherent object-level identity, hindering downstream tasks such as open-vocabulary scene understanding. Existing methods typically address this by either distilling high-dimensional feature embeddings into Gaussians or by lifting 2D mask labels into 3D via heuristic refinement. However, feature-based approaches incur heavy storage and decoding overhead, while lifting-based pipelines remain vulnerable to label contamination: Gaussians necessary for appearance reconstruction often receive incorrect object labels during 2D-to-3D projection. We propose OP2GS, an object-aware Gaussian representation that augments each primitive with an explicit instance identity and a dedicated instance opacity $σ^{*}$ for object-mask rendering. The original opacity $σ$ remains responsible for visual reconstruction, while $σ^{*}$ models whether a Gaussian should contribute to a particular object mask. This dual-opacity formulation decouples visual existence from instance occupancy: mislabeled Gaussians can remain available for image rendering while becoming transparent in the object-mask branch. To learn this representation, we introduce a random object loss that optimizes the 1D instance occupancy field using the standard transmittance-based visibility of 3DGS. Semantic descriptors are then attached at the object level through multi-view aggregation, eliminating per-Gaussian feature storage. Compared with feature-training approaches, OP2GS achieves competitive open-vocabulary performance while significantly reducing computational overhead. Compared with training-free pipelines, it leverages physically consistent occupancy learning to resolve visibility ambiguities.

URL PDF HTML ☆

赞 0 踩 0

2605.20035 2026-05-20 cs.CV 版本更新

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

面向高效多模态大语言模型的阶段自适应令牌选择

Zijie Xin, Jie Yang, Ruixiang Zhao, Tianyi Wang, Fengyun Rao, Jing Lyu, Xirong Li

发表机构 * Renmin University of China（中国人民大学）； WeChat Vision, Tencent Inc.（腾讯微信视觉实验室）

AI总结本文提出SEATS方法，通过阶段自适应的令牌选择技术，有效提升多模态大语言模型的推理效率，在保留96.3%原始性能的同时，实现9.3倍的FLOPs减少和4.8倍的prefill加速。

Comments Code Link: https://github.com/xxayt/SEATS

详情

AI中文摘要

多模态大语言模型（om-LLMs）通过将视频和音频编码为时间对齐的令牌序列，在窗口级别交错处理以实现统一的音频-视觉理解。然而，处理这些密集的非文本令牌会带来显著的计算开销。尽管训练无关的令牌选择可以减少这种成本，但现有方法要么专注于视觉输入，要么在LLM之前以固定的每模态比例修剪om-LLM令牌，无法捕捉跨模态令牌重要性在层间的变化。为了解决这一限制，我们首先分析om-LLMs的层间令牌依赖性。我们发现视觉和音频依赖性遵循块状模式，并随着深度逐渐减弱，表明许多后期层的非文本令牌在跨模态融合后变得冗余。受此启发，我们提出SEATS，一种训练无关的、阶段自适应的令牌选择方法，用于高效的om-LLM推理。在LLM之前，SEATS通过注意力加权多样性选择去除时空冗余。在LLM内部，它逐步在块间修剪令牌，并利用查询相关性分数动态分配从时间窗口到模态的保留预算。在后期层中，一旦完成跨模态融合，它会移除所有剩余的非文本令牌。在Qwen2.5-Omni和Qwen3-Omni上的实验表明，SEATS有效提高了推理效率。仅保留10%的视觉和音频令牌，实现了9.3倍的FLOPs减少和4.8倍的prefill加速，同时保持96.3%的原始性能。

英文摘要

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.

URL PDF HTML ☆

赞 0 踩 0

2605.20033 2026-05-20 cs.CV cs.GT 版本更新

A Nash Equilibrium Framework For Training-Free Multimodal Step Verification

为无训练多模态步骤验证构建纳什均衡框架

Rohit Sinha, Kunal Tilaganji, Tanuja Ganu, Nagarajan Natarajan, Amit Sharma, Vineeth N. Balasubramanian

发表机构 * Microsoft Research India（微软印度研究院）； Indian Institute of Technology Hyderabad（印度海得拉巴理工学院）

AI总结本文提出一种无训练的多模态步骤验证方法，将步骤验证视为专门法官之间的协调问题，并通过纳什均衡游戏形式化法官之间的交互，通过闭式解计算均衡分数，实现对分歧的敏感过滤和稳定性意识的排名，实验表明跨模态一致性（而非平均置信度）提供了鲁棒的验证信号。

Comments ICLR 2026 Workshop VerifAI-2

详情

AI中文摘要

多模态大语言模型经常生成包含细微错误的推理链，导致错误答案。当前的验证方法有显著局限。学习批评者需要大量标注数据且在不同任务上表现不一致。同时，现有无训练方法仅简单平均不同来源的分数，忽略了关键见解：当这些分数不一致时，这种不一致本身包含了关于推理步骤是否真正有效的重要信息。我们提出了一种无训练验证方法，将分步验证视为专门法官之间的协调问题。我们形式化这些法官的交互为纳什均衡游戏，其中一致信号表示有效步骤，不一致揭示不稳定性。我们的方法通过闭式解计算均衡分数，实现了对分歧的敏感过滤和稳定性意识的排名。在六个基准测试中，我们的方法在基准模型上实现了2.4%至5.2%的一致性提升，并在与学习批评者相比时表现出竞争力，证明了跨模态一致性（而非平均置信度）在无任务特定适应的情况下提供了稳健的验证信号。

英文摘要

Multimodal large language models often generate reasoning chains containing subtle errors that lead to incorrect answers. Current verification approaches have notable limitations. Learned critics need extensive labeled data and show inconsistent performance across different tasks. Meanwhile, existing training-free methods simply average scores from different sources, missing a key insight: when these scores disagree, that disagreement itself carries important information about whether a reasoning step is truly valid or not. We propose a training-free verification approach that treats step-wise verification as a coordination problem among specialized judges. We formalize these judges' interaction as a Nash equilibrium game where agreement signals valid steps while disagreement reveals instability. Our method computes equilibrium scores through a closed-form solution, enabling both disagreement-aware filtering and stability-conscious ranking of reasoning steps. Evaluated across six benchmarks, our approach achieves consistent improvements of 2.4% to 5.2% over baseline models and shows competitive performance against learned critics, demonstrating that cross-modal agreement (not just average confidence) provides robust verification signals without task-specific adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.20016 2026-05-20 eess.IV cs.CV 版本更新

FGSVQA: Frequency-Guided Short-form Video Quality Assessment

FGSVQA：基于频率的短视频质量评估

Xinyi Wang, Angeliki Katsenou, Junxiao Shen, David Bull

发表机构 * School of Computer Science, University of Bristol（布里斯托大学计算机科学学院）

AI总结本文提出了一种端到端的视频质量评估框架，利用基于CLIP的密集视觉编码器和频率域中的压缩先验，生成具有伪影和结构感知的权重图，以实现高效的视频质量预测。

Comments 4 pages, 1 figure

详情

AI中文摘要

短视频给用户生成内容（UGC）的质量评估带来了新挑战，由于其复杂的生成流程、快速的内容变化和混合的失真。为了解决这一挑战，我们提出了一种端到端的视频质量评估（VQA）框架，该框架采用基于CLIP的密集视觉编码器，并结合从频率域导出的压缩先验，生成具有伪影和结构感知的权重图用于特征聚合。通过显式分解伪影、结构和原始视觉特征分支，并通过学习的门控模块在时间上自适应融合，所提出的方法实现了准确且高效的质量预测。实验结果表明，我们的方法在短视频数据集上在平均排名和线性相关性（SRCC: 0.736，PLCC: 0.787）方面表现出色，同时保持了高效的推理运行时间。代码和额外结果可在：https://github.com/xinyiW915/FGSVQA 获取。

英文摘要

Short-form video poses new challenges to the quality assessment of user-generated content (UGC) due to its complex generation pipeline, rapid content variation, and mixed distortions. To address this challenge, we propose an end-to-end video quality assessment (VQA) framework that employs a dense visual encoder based on CLIP, and incorporates compression priors derived from the frequency domain to generate artifact- and structure-aware weight maps for feature aggregation. By explicitly decomposing artifact, structure, and original visual feature branches and adaptively fusing them over time through a learned gating module, the proposed method achieves accurate and efficient quality prediction. Experimental results show that our method achieves strong performance on short-form video datasets in terms of average rank and linear correlation (SRCC: 0.736, PLCC: 0.787), while maintaining efficient inference runtime. The code and additional results are available at: https://github.com/xinyiW915/FGSVQA.

URL PDF HTML ☆

赞 0 踩 0

2605.19995 2026-05-20 cs.CV 版本更新

RECIPE: 通过指令视频中的 grounding 实现过程规划

Luigi Seminara, Antonino Furnari, Lorenzo Torresani

发表机构 * Khoury College of Computer Sciences, Northeastern University, Boston（东北大学北斯托顿学院计算机科学学院）； Department of Mathematics and Computer Science, University of Catania, Italy（卡塔尼亚大学数学与计算机科学系）

AI总结该研究提出RECIPE方法，通过利用指令视频中的grounding信息来改进过程规划任务，通过利用预计算的文本嵌入实现大规模视频数据的验证，从而提升规划的准确性和鲁棒性。

详情

AI中文摘要

视觉规划要求模型在给定部分视频上下文和目标的情况下，生成剩余步骤的自然语言描述。该任务的进展受到标注的限制：干净的标记数据集较小，领域狭窄，每个示例只编码一个执行轨迹，尽管许多有效的顺序存在。大规模的指令视频语料库提供了数量级更多的过程内容，但通过使用伪标签进行监督微调会传播分割和对齐错误，并且只能生成单轨迹。我们识别出一个关键的不对称性：从噪声视频中提取干净的步骤标签是困难的，但验证生成的步骤序列是否在ASR转录中时间上接地是便宜的，并且可以通过预计算的文本嵌入扩展到数百万个视频。我们利用这种不对称性，在RECIPE中将grounding质量作为GRPO的奖励，将噪声语料库转化为验证者而不是标签来源。该框架可以统一应用于两种规划器输入配置（Socratic，使用冻结的VLM提取文本历史，以及Video，直接消耗视频令牌）以及标注和弱监督的模式。我们在7个过程基准上进行评估，使用基于参考的LLM-as-judge协议对计划进行评分，跨6个过程标准。RECIPE-RL在所有规模（0.5B、3B、7B）和每个基准上都优于基础检查点，领域内宏准确率提升7到8分，在零样本情况下最高提升16分。它在标注和伪标签计划上均优于监督微调（后者会降低基础模型性能），并在没有人工标注的情况下保持稳健。作为先前提案-评估-搜索规划器的提案阶段使用时，在视觉规划辅助任务中在每个时间范围内均优于最强的零样本基线，在COIN任务中保持了SFT所崩溃的生成多样性。

英文摘要

Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. Progress on this task is bottlenecked by annotation: clean labeled datasets are small, domain-narrow, and encode a single execution trajectory per example, even though many valid orderings exist. Large-scale instructional video corpora offer orders of magnitude more procedural content, but supervised fine-tuning on pseudo-labels from their noisy ASR narrations propagates segmentation and alignment errors and stays single-trajectory. We identify a key asymmetry: extracting clean step labels from noisy video is hard, but verifying whether a generated step sequence is temporally grounded in ASR transcripts is cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry in RECIPE, which uses grounding quality as a reward for GRPO, turning the noisy corpus into a verifier rather than a label source. The framework applies uniformly to two planner input configurations (Socratic, with a textual history extracted by a frozen VLM, and Video, consuming video tokens directly) and to annotated and weakly supervised regimes. We evaluate on 7 procedural benchmarks using a reference-based LLM-as-judge protocol scoring plans across 6 procedural criteria. RECIPE-RL improves over the base checkpoint at all scales (0.5B, 3B, 7B) and every benchmark, with macro-accuracy gains of +7 to +8 points in-domain and up to +16 points zero-shot. It outperforms supervised fine-tuning on both annotated and pseudo-labeled plans (the latter degrades the base) and remains robust without human annotations. Used as the proposal stage of a prior propose-assess-search planner, it improves over the strongest zero-shot baseline at every horizon on Visual Planning for Assistance, and on COIN it preserves the generation diversity that SFT collapses.

URL PDF HTML ☆

赞 0 踩 0

2605.19974 2026-05-20 cs.CV 版本更新

SphericalDreamer: Generating Navigable Immersive 3D Worlds with Panorama Fusion

SphericalDreamer: 通过全景融合生成可导航的沉浸式3D世界

Antoine Schnepf, Karim Kassab, Flavian Vasile, Andrew Comport

发表机构 * Université Côte d'Azur, CNRS, I3S, France（法国蔚蓝海岸大学、国家科学研究中心、I3S研究所）； Criteo AI Lab, Paris, France（法国Criteo人工智能实验室）

AI总结本研究提出SphericalDreamer方法，通过生成多个全景图像并将其提升到3D空间中进行融合，从而生成高度细节且可导航的沉浸式3D户外环境，显著提升了尺度和可导航性。

Comments Accepted at ICML 2026. Project page available at https://sphericaldreamer.github.io

2605.19957 2026-05-20 cs.CV cs.AI cs.RO 版本更新

World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

为混合具身体验中的长时域演化构建世界-自我模型

Zuyao Lin, Jianhui Zhang, Peidong Jia, Xiaoguang Zhao, Shanghang Zhang, Xingyu Chen

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； University of Chinese Academy of Sciences（中国科学院大学）； Peking University（北京大学）

AI总结本文提出了一种新的世界-自我建模范式，通过分解未来演化为世界和自我组件，解决混合任务中长时域具身体验中的退化问题，并通过HTEWorld基准测试验证了其有效性。

详情

AI中文摘要

世界模型在具身智能中被广泛研究，但通常在同一流中预测世界和自我不同的演化，其中世界捕捉持续的指令无关场景规律，而自我捕捉机器人中心的指令条件动态。这种世界-自我纠缠导致长时域具身体验中的退化，特别是在混合任务中，其中导航和操作行为交替出现。在本文中，我们引入了世界-自我建模，一种新的概念范式，将未来演化分解为世界和自我组件。我们从三种视角定义世界-自我边界，即运动、语义和意图视角，并分析了三种解纠缠策略，即后、前和完全解纠缠。进一步，我们将该范式实例化为世界-自我模型（WEM），一个统一的具身世界模型，它将一个隐含的独立世界-自我规划器与一个级联并行混合专家（CP-MoE）扩散生成器相结合。为了实现严格的评估，我们进一步构建了HTEWorld，第一个长时域世界建模基准，包含125,000个视频片段（超过4.5百万帧）和精细的动作注释，以及300个多轮评估轨迹（超过2,000条指令）。广泛的实验表明，WEM在HTEWorld上实现了最先进的性能，同时在现有的仅操作基准上保持竞争力。

英文摘要

World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.19956 2026-05-20 cs.CV 版本更新

Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models

迈向细粒度鲁棒性：面向视觉-语言模型的注意力引导测试时提示调优

Jia-Wei Hai, Yijun Wang, Xiu-Shen Wei

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）； Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications（新一代人工智能技术及其交叉应用重点实验室）； Southeast University（东南大学）； School of Intelligence Science and Engineering（智能科学与工程学院）

AI总结本文提出了一种注意力引导的测试时提示调优方法（A-TPT），旨在解决视觉-语言模型在对抗攻击下的鲁棒性问题，通过改进的梯度注意力机制和空间变化的增强强度来提升模型在细粒度场景下的表现。

Comments Accepted by ICML 2026, Project Page: this https, URL Code URL: this https URL

详情

AI中文摘要

视觉-语言模型（VLMs），如CLIP，通过各种微调适应方法在下游任务上实现了显著的零样本性能。然而，最近的研究证明，对抗攻击可以显著降低VLMs的推理能力，对实际应用构成重大风险。普遍的测试时适应方法通常依赖多视图增强来实现各种微调策略，但它们难以识别语义信息，并且在细粒度场景中容易破坏判别区域。为了解决这些限制，我们提出了注意力引导的测试时提示调优（A-TPT），一种旨在测试时适应的语义保持方法。我们首先改进了梯度注意力展开机制，以识别在对抗攻击下仍能存活的语义重要区域。进一步地，我们利用这些区域来指导空间变化的增强强度和多视图集成，以进行提示调优和推理。广泛的实验表明，A-TPT在对抗和干净数据上均优于现有的测试时适应方法。代码可在https://github.com/SEU-VIPGroup/A-TPT获取。

英文摘要

Vision-Language Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posing substantial risks to their practical applications. Prevalent test-time adaptation methods typically rely on multi-view augmentation to implement various fine-tuning strategies, which struggle to identify semantic information and are prone to destroying discriminative regions in fine-grained scenarios. To address these limitations, we propose Attention-Guided Test-Time Prompt Tuning (A-TPT), a semantics-preserving method designed for test-time adaptation. We first refine the gradient attention rollout mechanism to identify semantically meaningful regions surviving under adversarial attacks. Furthermore, we leverage them to guide the spatially varying augmentation intensities and multi-view ensemble for prompt tuning and inference. Extensive experiments demonstrate that A-TPT outperforms existing test-time adaptation methods on both adversarial and clean data. Codes are available at https://github.com/SEU-VIPGroup/A-TPT .

URL PDF HTML ☆

赞 0 踩 0

2605.19950 2026-05-20 cs.CV 版本更新

AffectVerse: Emotional World Models for Multimodal Affective Computing

AffectVerse: 多模态情感计算中的情感世界模型

Bo Zhao, Fanghua Ye, Yixin Ji, Sicheng Zhao, Xiaojiang Peng, Zitong YU

发表机构 * Great Bay University（大湾大学）； Tencent（腾讯）； Tsinghua University（清华大学）； Shenzhen Technology University（深圳技术大学）

AI总结本研究提出AffectVerse，一种基于Qwen2.5-Omni的多模态情感计算模型，通过引入情感世界模块实现短期潜在情感预测，利用未来预测作为自监督信号，提高了情感计算的准确性。

详情

AI中文摘要

人类通过整合观察到的多模态线索与对情绪状态可能演变的期望来推断情绪。然而，现有的多模态大语言模型（MLLMs）通常将情绪识别视为对完整音频视觉-文本输入的静态融合，忽略了情感动态。我们提出了AffectVerse，一种基于Qwen2.5-Omni的模型，配备了情感世界模块（EWM），这是一个无动作的表示层面模块，用于短期潜在情感预测。EWM包含三个模块：1）跨模态时间想象通过多步展开预测未来的视频/音频表示；2）MAMA（模态感知多步注意力）信念聚合将想象的标记压缩成模态感知的信念标记；3）信念注入将这些信念标记插入LLM中进行情绪推理。AffectVerse将未来预测作为过去条件的自监督信号：它不替换对观察历史的建模或需要未见过的信号，但迫使当前信念状态编码预测后续情绪变化的转换线索。在九个基准测试中，AffectVerse在其他模型上提高了至少2.57%，而受控消融实验显示了时间想象、跨模态展开和信念聚合的加性增益。这些结果表明，预测信念状态建模是情感计算的一种实用替代方案。

英文摘要

Humans infer emotions by integrating observed multimodal cues with expectations about how affective states may unfold. Existing multimodal large language models (MLLMs), however, often treat emotion recognition as static fusion over complete audiovisual-text inputs, leaving affective dynamics implicit. We propose AffectVerse, a Qwen2.5-Omni-based model equipped with an Emotion World Module (EWM), an action-free representation-level module for short-horizon latent affective prediction. \rev{EWM contains three modules: 1) Cross-Modal Temporal Imagination predicts future video/audio representations from past tokens with multi-step rollout. 2) MAMA(Modality-Aware Multi-step Attention) Belief Aggregation compresses imagined tokens into modality-aware belief tokens. 3) Belief Injection inserts these belief tokens into the LLM for affective reasoning.} AffectVerse uses future prediction as a past-conditioned self-supervised signal: it does not replace modeling observed history or require unseen signals at inference, but forces the current belief state to encode transition cues that are predictive of subsequent affective change. Across nine benchmarks, AffectVerse improves at least 2.57\% over other models, while controlled ablations show additive gains from temporal imagination, cross-modal rollout, and belief aggregation. These results suggest predictive belief-state modeling is a practical alternative for affective computing.

URL PDF HTML ☆

赞 0 踩 0

2605.19949 2026-05-20 cs.CV 版本更新

Feed-Forward Gaussian Splatting from Sparse Aerial Views

从稀疏航拍视图进行前馈高斯点扩散

Dongli Wu, Zhuoxiao Li, Tongyan Hua, Yinrui Ren, Xiaobao Wei, Rongjun Qin, Wufan Zhao

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））； Peking University（北京大学）； The Ohio State University（俄亥俄州立大学）

AI总结本文提出AnyCity框架，通过观察驱动的生成重建方法，解决稀疏航拍视图中大规模城市场景重建中的证据不平衡问题，通过几何潜在表示和条件化空中完成标记预测，实现高质量的3D高斯点场重建。

详情

AI中文摘要

从稀疏航拍视图重建大规模城市场景是一项关键但具有挑战性的任务。由于俯视和浅倾角相机姿态偏置，稀疏航拍捕捉表现出强烈的证据不平衡：屋顶和开放区域被反复观察，而立面、远处建筑和被遮挡的结构则很少有多视图支持。现有的前馈3D高斯点扩散方法直接从稀疏输入回归确定性表示，但这种方法常常导致鬼影、融化立面和拉伸纹理。最近的伪视图和视频基于生成重建方法使用额外的监督或生成先验。然而，它们通常缺乏清晰的观察几何与先验驱动内容之间的分离，这可能导致合理但不一致的结构。我们提出AnyCity，一种用于稀疏航拍城市场景的观察驱动生成重建框架。AnyCity首先预测一个观察支持的几何潜在表示以锚定可靠的结构，然后使用支架条件化的空中完成标记来预测弱约束内容的门控残差更新，在高斯解码之前。在训练过程中，密集到稀疏的蒸馏将结构线索从密集视图重建中转移，同时一个适应于空中视频扩散先验通过门控标记条件提供细粒度的城市外观线索。观察保持目标保持优化后的表示与输入支持的几何一致。在推理过程中，AnyCity从稀疏航拍视图中通过单次前馈传递重建最终的3D高斯点场，实现具有第二级推理的连贯城市新视图合成。在合成、航拍域、无人机纹理和真实世界场景上的实验显示，与前馈基线相比，取得了持续的改进。

英文摘要

Reconstructing large-scale urban scenes from sparse aerial views is a crucial yet challenging task. Due to biased top-down and shallow-oblique camera poses, sparse aerial captures exhibit strong evidence imbalance: roofs and open regions are repeatedly observed, while facades, distant buildings, and occluded structures receive little multi-view support. Existing feed-forward 3D Gaussian Splatting methods directly regress a deterministic representation from sparse inputs, but this often leads to ghosting, melted facades, and stretched textures. Recent pseudo-view and video-based generative reconstruction methods use additional supervision or generative priors. However, they often lack a clear separation between observed geometry and prior-driven content, which can lead to plausible but inconsistent structures. We propose AnyCity, an observation-grounded generative reconstruction framework for sparse aerial urban scenes. AnyCity first predicts an observation-supported geometry latent to anchor reliable structures, and then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. During training, dense-to-sparse distillation transfers structural cues from dense-view reconstruction, while an aerial-adapted video diffusion prior provides fine-grained urban appearance cues through gated token conditioning. Observation-preserving objectives keep the refined representation consistent with input-supported geometry. At inference time, AnyCity reconstructs the final 3D Gaussian scene from sparse aerial views in a single feed-forward pass, achieving coherent urban novel-view synthesis with second-level inference. Experiments on synthetic, aerial-domain, UAV-textured, and real-world scenes show consistent improvements over feed-forward baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.19931 2026-05-20 cs.CV cs.AI cs.LG 版本更新

StruMPL: Multi-task Dense Regression under Disjoint Partial Supervision and MNAR Labels

StruMPL：在不相交的部分监督和MNAR标签下的多任务密集回归

Reza M. Asiyabi, Juan Alberto Molina-Valero, The SEOSAW Partnership, Steven Hancock, Casey M. Ryan

发表机构 * School of Geosciences, University of Edinburgh, UK（爱丁堡大学地球科学学院，英国）； National Centre for Earth Observation (NCEO), UK（英国地球观测国家中心）； Department of Spatial Sciences, Faculty of Environmental Sciences Czech University of Life Sciences Prague, Praha, Czech Republic（环境科学学院空间科学系，捷克布拉格生命科学大学）

AI总结本文针对在不相交的部分监督和MNAR标签下的多任务密集回归问题，提出StruMPL方法，通过共享编码器和可学习的物理模块，结合Augmented IPW损失函数，提高了对森林地上生物量的估计精度。

Comments 10 pages with 3 figures and 4 tables, References and Appendix 12 pages with 1 figure and 4 tables

详情

AI中文摘要

从地球观测估计森林地上生物量（AGB）结合了两个结构上不兼容的标签源：空间borne激光雷达在数百万个位置提供冠层结构但没有生物量估计，而地面样地在数千个偏倚位置提供生物量但没有结构指标。没有单个训练样本携带所有目标变量的标签，样地标签不是随机缺失（MNAR），且生物量通过已知但生物体特异性的所有学定律与结构变量相关联。我们将其正式化为在异质不相交部分监督下的多任务密集回归问题，具有MNAR标签和任务间物理约束，并提出StruMPL方法来联合解决。一个共享编码器为每个变量回归、填补和倾向性头提供空间MNAR校正，以及一个可学习的物理模块，该模块在每个像素上评估任务间约束对模型自身预测的影响。监督损失使用Augmented IPW（AIPW）伪结果，其中在倾向性和填补基线上的停止梯度；我们证明了分析和实证上，两者对于联合优化恢复IPW加权的平稳点并保持损失有界是必要的。在两个生态上不同的生物体上，StruMPL在AGB RMSE和偏倚方面优于消融变体和最接近的已发表方法，分层分析显示AIPW减少了高AGB偏倚约54%。

英文摘要

Estimating forest aboveground biomass (AGB) from Earth observation combines two structurally incompatible label sources: spaceborne lidar provides canopy structure at millions of locations but no biomass estimate, and ground-based plots provide biomass at thousands of biased locations but no metrics of structure. No single training sample carries labels for all target variables, plot labels are missing not at random (MNAR), and biomass is linked to the structural variables by known but biome-specific allometric laws. We formalise this as multi-task dense regression under heterogeneous disjoint partial supervision with MNAR labels and inter-task physical constraints, and propose StruMPL to address it jointly. A shared encoder feeds per-variable regression, imputation, and propensity heads for spatial MNAR correction, and a learnable physics module that evaluates the inter-task constraint on the model's own predictions at every pixel. The supervised loss uses an Augmented IPW (AIPW) pseudo-outcome with stop-gradients on the propensity and on the imputation baseline; we show analytically and empirically that both are necessary for joint optimisation to recover IPW-weighted stationary points while keeping the loss bounded. On two ecologically distinct biomes, StruMPL outperforms ablation variants and the closest published method on AGB RMSE and bias, with a stratified analysis showing AIPW reduces high-AGB bias by ~54%.

URL PDF HTML ☆

赞 0 踩 0

2605.19929 2026-05-20 cs.CV cs.AI 版本更新

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

打破大视觉-语言模型低比特量化中的模态异质性

Yi Zhong, Haotong Qin, Xindong Zhang, Lei Zhang, Guolei Sun

发表机构 * VCIP, College of Computer Science, Nankai University（南开大学计算机科学学院VCIP）； D-ITET, ETH Zürich（苏黎世联邦理工学院D-ITET）； OPPO Research Institute（OPPO研究院）； Department of Computing, Hong Kong Polytechnic University（香港理工大学计算机系）

AI总结本文提出SplitQ框架，通过通道分割和自适应跨模态校准模块，解决大视觉-语言模型在低比特量化中因模态异质性导致的精度下降问题，显著提升了在多种多模态数据集上的性能。

详情

AI中文摘要

低比特后训练量化（PTQ）是将视觉-语言模型（VLMs）部署到资源受限设备中的关键技术。然而，现有PTQ方法由于在量化过程中文本和视觉模态的异质激活分布而降低了VLMs的准确性。我们发现这种跨模态异质性在通道上分布不均：一小部分通道包含大部分模态特定的异常值，且这些异常值通常位于每个模态的不同通道中。受此启发，我们提出了SplitQ，一种基于通道分割的后训练量化框架。其核心是引入了一个新的模态特定异常通道解耦（MOCD）模块，该模块能够以最小的开销有效隔离显著的模态特定异常通道。为进一步解决剩余的跨模态分布差异，我们设计了一个自适应跨模态校准（ACC）模块，该模块采用双轻量级可学习分支动态缓解模态引起的量化误差。在流行的VLMs上的广泛实验表明，SplitQ在所有评估的量化设置下，包括W4A8、W4A4、W3A3和W3A2，均在6个流行的多模态数据集上显著优于现有方法。值得注意的是，SplitQ在具有挑战性的W3A3设置下保留了93.5%的FP16性能（69.5 vs. 74.3），推动了高级VLMs部署的效率前沿。我们的代码可在https://github.com/EMVision-NK/SplitQ上获得。

英文摘要

Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies, we design an Adaptive Cross-Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality-induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi-modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at https://github.com/EMVision-NK/SplitQ

URL PDF HTML ☆

赞 0 踩 0

2605.19890 2026-05-20 cs.CV 版本更新

CADENet：条件自适应异步双流增强网络用于自动驾驶中的恶劣天气感知

Sherif Khairy, Catherine M. Elias

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt（德国开罗大学（GUC）计算机科学与工程系，埃及）； C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（认知驾驶系统实验室（C-DRiVeS Lab），开罗，埃及）

AI总结本文提出CADENet，一种无需训练的三线系统，通过条件自适应增强和熵引导NMS融合，实现自动驾驶中恶劣天气下的目标检测，同时无需重新训练或额外硬件。

详情

AI中文摘要

恶劣天气（雨、雾、沙尘和雪）会降级自动驾驶车辆基于摄像头的目标检测。现有先增强后检测的方法会阻碍安全关键的感知循环，违反严格的实时要求。该问题的进展也受到一个未被认识到的评估上限的限制：在降质图像上标注的地面真实数据不能为一个能够恢复注释者自身无法看到的目标的检测器提供信用，因此真正的有用的增强可以注册为接近平坦的F1增益。本文提出了CADENet（条件自适应异步双流增强网络），一种无需训练的三线系统：线S（YOLOv11n）以全帧率提供检测，无额外延迟；线Q应用条件自适应增强（CAPE）并通过熵引导NMS（EG-NMS）融合结果，不阻塞线S；线E提供CLIP零样本天气分类，因此新的天气类别只需新的文本提示，无需标注数据和重新训练。在1327张DAWN图像（YOLOv11m，IoU=0.5，置信度=0.25）上评估，CADENet在雪中实现Recall=0.0103（微），F1=0.0230，在雨中实现F1=0.0038。我们正式化了DAWN类数据上的注释完整性偏差，因此报告的F1值是真实增益的下限；Recall是注释-间隙-免疫的头条指标。线S在增强负载下保持约44 FPS。无需模型重新训练或额外传感器硬件。

英文摘要

Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time requirements. Progress on this problem is also constrained by an under-recognized evaluation ceiling: ground truth annotated on degraded images cannot credit a detector that recovers objects the annotators themselves could not see, so a genuinely useful enhancement can register as a near-flat F1 gain. This paper presents CADENet (Condition-Adaptive Asynchronous Dual-stream Enhancement Network), a training-free three-thread system: Thread S (YOLOv11n) delivers detections at full frame rate with zero added latency; Thread Q applies condition-adaptive enhancement (CAPE) and fuses results via entropy-guided NMS (EG-NMS) without blocking Thread S; Thread E provides CLIP zero-shot weather classification, so new weather categories require only a new text prompt, with no labeled data and no retraining. Evaluated on 1327 DAWN images (YOLOv11m, IoU = 0.5, confidence = 0.25), CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain. We formalize the annotation completeness bias on DAWN-class data, so the reported F1 values are lower bounds on the true gain; recall is the annotation-gap-immune headline metric. Thread S sustains approximately 44 FPS regardless of enhancement load. No model retraining or additional sensor hardware is required.

URL PDF HTML ☆

赞 0 踩 0

2605.19824 2026-05-20 cs.AI cs.CL cs.CV cs.RO 版本更新

From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

从提示到路面通过时间：代理场景到计划推理中的时间定位

Ahmed Y. Gado, Omar Y. Goba, Alaa Hassanein, Catherine M. Elias, Ahmed Hussein

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt（德国亚历山大大学（GUC）计算机科学与工程系，埃及）； C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（认知驾驶系统实验室，埃及开罗，C-DRiVeS）； M.Eng. Robotics Candidate at Deggendorf Institute of Technology, Germany（德国德格多夫技术学院机器人硕士候选人）； IAV GmbH, Berlin, Germany（德国柏林IAV GmbH公司）

AI总结本研究探讨了在代理间通信中引入时间条件是否能保持或增强推理的一致性，而不会降低语义或逻辑一致性，并通过BDD-X数据集的curated子集评估了三种具有递增时间整合的规划器架构。结果表明，时间条件改变了推理风格，但并未在标准NLP正确性指标上产生统计显著改进，但定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。

详情

AI中文摘要

近期尝试通过大型语言模型（LLMs）和大型多模态模型（LMMs）的集合来支持自动驾驶（AVs）中的高级场景解释和规划，仍然将时间视为次要属性。这种缺乏时间定位导致在连续动作推理中出现不一致，影响安全性和可解释性。本文探讨时间条件在代理间通信中是否能保持或增强一致性而不引入语义或逻辑一致性下降。为此，我们引入了三种具有递增时间整合的规划器架构，并在BDD-X数据集的curated子集上评估它们，使用语义、语法和逻辑指标。结果表明，虽然时间条件改变了推理风格，但并未在标准NLP基于的正确性指标上产生统计显著改进。然而，定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。这些发现澄清了基于提示的时间定位的局限性，并建立了时间场景到计划推理的第一个经验基准。

英文摘要

Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.19821 2026-05-20 cs.CV 版本更新

LaCoVL-FER: Landmark-Guided Contrastive Learning Network with Vision-Language Enhancement for Facial Expression Recognition

LaCoVL-FER: 一种结合视觉-语言增强的地标引导对比学习网络用于面部表情识别

Jiaxin Wang, Muwei Jian, Hui Yu, Junyu Dong, Yifan Xia

发表机构 * School of Airspace and Engineering, Shandong University（山东大学航空航天与工程学院）； School of Computer Science and Technology, Shandong University of Finance and Economics（山东财经大学计算机科学与技术学院）； School of Psychology and Neuroscience, University of Glasgow（格拉斯哥大学心理学与神经科学学院）； Faculty of Information Science and Engineering, Ocean University of China（中国海洋大学信息科学与工程学院）

AI总结本文提出了一种结合视觉-语言增强的地标引导对比学习网络LaCoVL-FER，通过引入面部地标几何先验和视觉-语言模型语义先验，解决野生环境中面部表情识别的挑战，提升识别的鲁棒性和泛化能力。

详情

AI中文摘要

在真实环境中，面部表情识别（FER）仍然具有挑战性，由于姿态、遮挡和光照的不可控变化。现有的基于注意力的方法主要依赖于视觉外观线索，导致注意力冗余和不稳定，限制了其在复杂场景中的性能。为了解决这些问题，我们提出了一种新颖的地标引导对比学习网络，结合视觉-语言增强，用于面部表情识别（LaCoVL-FER），该网络整合了来自面部地标几何先验和视觉-语言模型的语义先验。具体而言，设计了一个地标引导自适应编码器（LGAE），通过双分支门控交叉注意力（BGCA）机制引入几何先验，实现自适应融合基于地标几何和视觉外观特征，生成与表情相关的特征，从而聚焦于关键面部区域并抑制噪声干扰。同时，提出了一种视觉-语言增强策略（VLES），利用表情相关的特征来优化冻结预训练CLIP图像编码器提取的一般视觉特征，生成表情特定的视觉表示。基于这些表示，采用表情条件提示（ECP）机制进一步调整来自冻结预训练CLIP文本编码器的固定类级提示文本特征，生成更实例感知的文本表示。这些视觉-文本表示作为语义先验对齐，以增强FER的鲁棒性和泛化能力。定量和定性实验表明，我们的LaCoVL-FER在三个具有代表性的现实世界FER数据集（RAF-DB、FERPlus和AffectNet）上优于最先进的方法。代码可在https://github.com/ylin06804/LaCoVL-FER上获得。

英文摘要

Facial Expression Recognition (FER) in the wild is still challenging due to uncontrolled variations in pose, occlusion, and illumination. Most existing attention-based methods primarily rely on visual appearance cues, suffering from attention redundancy and instability, which limits their performance in complex scenarios. To address these issues, we propose a novel landmark-guided contrastive learning network with vision-language enhancement for FER (LaCoVL-FER), which integrates geometric priors from facial landmarks and semantic priors from a vision-language model. Specifically, a Landmark-Guided Adaptive Encoder (LGAE) is designed to introduce geometric priors through a Bi-branch Gated Cross Attention (BGCA) mechanism, which achieves adaptive fusion of landmark-based geometric and visual appearance features to produce expression-relevant features, thereby focusing on key facial regions and suppressing noise interference. In parallel, a Vision-Language Enhancement Strategy (VLES) is presented to leverage the expression-relevant features to refine the generalizable visual features extracted by the frozen pretrained CLIP image encoder, yielding expression-specific visual representations. Based on these representations, an Expression-Conditioned Prompting (ECP) mechanism is utilized to further adapt the textual features of fixed class-level prompts from the frozen pretrained CLIP text encoder, generating more instance-aware textual representations. These visual-textual representations are aligned as semantic priors to enhance the robustness and generalization of FER. Quantitative and qualitative experiments demonstrate that our LaCoVL-FER outperforms state-of-the-art methods on three representative real-world FER datasets, including RAF-DB, FERPlus, and AffectNet. The code is available at https://github.com/ylin06804/LaCoVL-FER.

URL PDF HTML ☆

赞 0 踩 0

2605.19804 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Stitched Value Model for Diffusion Alignment

用于扩散对齐的拼接价值模型

Hyojun Go, Hyungjin Chung, Prune Truong, Goutam Bhat, Li Mi, Zhaochong An, Zixiang Zhao, Dominik Narnhofer, Serge Belongie, Federico Tombari, Konrad Schindler

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Google（谷歌）； University of Copenhagen（哥本哈根大学）

AI总结本文提出StitchVM，一种将预训练的干净图像奖励模型转移到噪声潜在空间的拼接框架，通过高效转移和微调，提升扩散对齐的效率和效果。

Comments Project page: https://gohyojun15.github.io/StitchVM/

详情

AI中文摘要

为了实际应用，基于扩散或流的生成模型必须与任务特定的奖励对齐，例如提示保真度或审美偏好。这种对齐具有挑战性，因为奖励是为干净的输出图像定义的，但对齐过程需要在噪声中间潜在空间中估计价值函数。现有方法倾向于Tweedie风格或蒙特卡洛近似，权衡估计器偏差与计算成本：Tweedie估计高效但有偏差，而蒙特卡洛估计更准确但需要昂贵的回放。一个自然的替代方法是学习的价值函数，但如何有效训练一个强大的、通用的价值模型专门用于噪声潜在空间仍然是一个开放问题。本文提出了StitchVM，一种模型拼接框架，该框架高效地将预训练用于干净图像的奖励模型转移到噪声潜在空间。StitchVM从一个现有的、截断的像素空间奖励模型开始，并将其冻结的扩散骨干作为其头部。从像素空间模型中，所得到的混合模型保留了精心预训练、稳健的奖励能力；从扩散骨干中，它继承了其处理噪声潜在空间的原生能力。拼接过程异常轻量，例如拼接和微调CLIP ViT-L和SD 3.5 Medium仅需10个GPU小时。通过将强大的像素空间奖励模型提升到潜在空间，StitchVM打开了一种新的扩散对齐风格：而不是对价值函数的粗糙但昂贵的每样本近似，正确的函数对于实际的噪声潜在空间一次构建，然后在许多样本和迭代中进行抵消。我们显示，这种方法在广泛下游引导和后训练方法中带来了改进：DPS变得比原来快3.2倍，同时将峰值GPU内存减半，DiffusionNFT变得比原来快2.3倍。

英文摘要

For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes $3.2\times$ faster while halving peak GPU memory, and DiffusionNFT becomes $2.3\times$ faster.

URL PDF HTML ☆

赞 0 踩 0

2605.19799 2026-05-20 cs.CV cs.AI 版本更新

Synergistic Foundation Models for Semi-Supervised Fetal Cardiac Ultrasound Analysis: SAM-Med2D Boundary Refinement and DINOv3 Semantic Enhancement

协同基础模型用于半监督胎儿心脏超声分析：SAM-Med2D边界细化与DINOv3语义增强

Tonghao Zhuang, Shanglong Hu, Yongsheng Luo, Zhiqi Zhang, Yu Li

发表机构 * Zhuhai College of Science and Technology（珠海科技学院）

AI总结本文提出了一种半监督框架，用于胎儿心脏超声图像的联合分割和分类，结合SAM-Med2D进行边界细化和DINOv3进行语义增强，有效提升了胎儿先天性心脏病筛查的性能。

Comments Accepted to the ISBI 2026 Fetal HearT UltraSound Segmentation and Diagnosis (FETUS) Challenge

2605.19797 2026-05-20 cs.CV 版本更新

预训练目标在极低数据细粒度视觉分类中的影响：一个骨干网络控制研究

Alexander Hackett, Srikanth Thudumu, Ginny Fisher, Jason Fisher

发表机构 * Santa Clara University（圣克拉拉大学）； IAAIR

AI总结本文研究了在极低数据细粒度视觉分类中预训练目标对下游表示质量的影响，通过比较四种冻结的ViT-B/16编码器，得出了在数据稀缺时优先选择边界增强预训练目标的结论。

Comments Presented at the 13th Workshop on Fine-Grained Visual Categorization (FGVC13) at CVPR 2026

Journal ref 13th Workshop on Fine-Grained Visual Categorization (FGVC13), CVPR 2026

详情

AI中文摘要

极端低数据细粒度分类在专家领域中普遍存在，其中标注成本高昂，但从业者仍需要有原则的指导来选择预训练编码器。我们使用一个定制的数据集，包含三个类别的标注图像，研究了在匹配的骨干容量下，预训练目标如何影响下游表示质量。我们比较了四种冻结的ViT-B/16编码器，分别通过监督分类、对比学习（SigLIP2）、掩码重建（MAE）和自蒸馏（DINOv3）进行训练，并使用留一验证法通过线性和非线性探测器评估。为了控制低N情况下的统计噪声，我们使用排列检验（N=1000）在宏级一对多AUC上进行测试。监督和对比学习编码器在线性可分性方面表现最强（逻辑AUC：0.768和0.735；SVM AUC：0.739和0.697），而MAE在非线性探测器下表现更优（XGBoost AUC：0.713）。我们发现DINOv3在该领域整体表现较差。这些结果支持在极低数据细粒度视觉分类中的一种实用建议：当数据稀缺限制探测到线性决策规则时，优先选择边界增强预训练目标；当非线性分类器可行时，考虑使用重建式编码器。

英文摘要

Extreme low-data fine-grained classification is common in expert domains where labeling is expensive, yet practitioners still need principled guidance for selecting pretrained encoders. We study emerald inclusion grading with a custom dataset of labeled images across three classes and ask: under matched backbone capacity, how does pretraining objective affect downstream representation quality? We compare four frozen ViT-B/16 encoders trained with supervised classification, contrastive learning (SigLIP2), masked reconstruction (MAE), and self-distillation (DINOv3), and evaluate them with leave-one-out cross-validation using linear and nonlinear probes. To control statistical noise in the low-N regime, we use permutation testing (N=1000) on macro one-vs-rest AUC. Supervised and contrastive encoders provide the strongest linear separability (logistic AUC: 0.768 and 0.735; SVM AUC: 0.739 and 0.697), while MAE improves under nonlinear probes (XGBoost AUC: 0.713). We find that DINOv3 underperforms across probe families in this domain. These results support a practical recommendation for extreme low-data FGVC: prioritize margin-enforcing pretraining objectives when data scarcity restricts probing to linear decision rules, and consider reconstruction-style encoders when nonlinear classifiers are feasible given dataset constraints.

URL PDF HTML ☆

赞 0 踩 0

2605.13193 2026-05-20 cs.CV 版本更新

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

FIKA-Bench: 从细粒度识别到细粒度知识获取

Geng Li, Yuxin Peng

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学计算机技术研究所）

AI总结本文提出FIKA-Bench，一个包含311个公开来源和现实实例的细粒度知识获取基准，通过过滤和审计确保实例质量，评估最新多模态模型和代理发现细粒度识别任务仍具挑战性，需改进代理设计以提升知识获取能力。

Comments Project page with code: https://ligeng0197.github.io/FIKA-Bench.github.io/

详情

AI中文摘要

日常生活中细粒度识别往往不是封闭书目分类问题：当遇到陌生物体时，人类会主动搜索、比较视觉细节并验证证据后再做决定。现有基准主要评估视觉识别能力，忽略了这种主动外部知识获取能力。我们研究细粒度知识获取，即系统必须寻求、验证并使用外部证据来回答开放式细粒度识别问题。我们引入FIKA-Bench，一个泄漏意识且证据支持的实例集合，包含311个公开来源和现实实例。为确保高质量，每个实例均经过前沿封闭书目模型过滤以去除记忆案例，并经过审核以消除图像-答案泄漏，仅保留由验证证据支持的样本。我们对最新多模态模型（LMMs）和代理的评估显示，该任务仍具挑战性：最佳系统仅达到25.1%的准确率，无模型超过30%。关键发现是，仅给模型配备工具不足以弥合这一差距；代理失败主要由错误实体检索和较差的视觉判断驱动。这些结果表明，可靠的知识获取需要更好的代理设计，以专注于细粒度识别。

英文摘要

Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans actively search, compare visual details, and verify evidence before deciding. Existing benchmarks primarily evaluate visually recognition, leaving this active external knowledge acquisition ability underexplored. We study fine-grained knowledge acquisition, where a system must seek, verify, and use external evidence to answer open-ended fine-grained recognition questions. We introduce FIKA-Bench, a leakage-aware and evidence-grounded collection of 311 public-source and real-life instances. To ensure high quality, every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, retaining only samples supported by verified evidence. Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement. These results show that reliable knowledge acquisition needs better agent designs that focus on fine-grained recognition.

URL PDF HTML ☆

赞 0 踩 0

2605.12640 2026-05-20 cs.CV 版本更新

MambaPanoptic: A Vision Mamba-based Structured State Space Framework for Panoptic Segmentation

MambaPanoptic：基于视觉Mamba的结构状态空间框架用于全景分割

Qing Cheng, Damiano Bertolini, Wei Zhang, Dong Wang, Niclas Zeller, Daniel Cremers

发表机构 * Technical University of Munich（慕尼黑技术大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； Polytechnic University of Milan（米兰理工大学）； University of Stuttgart（斯图加特大学）； Wuhan University（武汉大学）； Karlsruhe University of Applied Sciences（卡尔斯鲁厄应用科学大学）

AI总结本研究提出MambaPanoptic，一种基于视觉Mamba的结构状态空间框架，旨在解决全景分割中长程上下文建模、多尺度特征表示和高效密集预测的挑战，通过引入MambaFPN和改进的PanopticFCN风格核生成器实现统一的实例和物质预测。

Comments Accepted to ISPRS Congress 2026, camera-ready version

详情

AI中文摘要

全景分割要求同时识别可计数的实例和无形态的物质区域，对长程上下文建模、多尺度特征表示和高效密集预测提出了联合需求。现有的卷积和Transformer方法难以同时满足这三个要求：卷积架构在建模长程依赖方面能力有限，而基于Transformer的方法在高分辨率下会带来二次计算成本。在本文中，我们提出MambaPanoptic，一种完全基于Mamba的全景分割框架，通过两个主要贡献来解决这些限制。首先，我们引入MambaFPN，一种自上而下的特征金字塔，利用Mamba块生成具有线性计算复杂度的全局一致、多尺度特征表示。其次，我们采用PanopticFCN风格的核生成器，产生统一的实例和物质核用于无提案的全景预测，并通过在多个网络阶段应用QuadMamba基于的特征细化模块进行增强。在Cityscapes和COCO全景分割基准测试中，实验表明MambaPanoptic在同等模型大小下一致优于PanopticDeepLab和PanopticFCN，并在Cityscapes上以更少的参数匹配或超越Mask2Former在PQ和AP上的表现。

英文摘要

Panoptic segmentation requires the simultaneous recognition of countable thing instances and amorphous stuff regions, placing joint demands on long-range context modelling, multi-scale feature representation, and efficient dense prediction. Existing convolutional and transformer-based methods struggle to satisfy all three requirements concurrently: convolutional architectures are limited in their capacity to model long-range dependencies, while transformer-based methods incur quadratic computational cost that is prohibitive at high resolutions. In this paper, we propose MambaPanoptic, a fully Mamba-based panoptic segmentation framework that addresses these limitations through two principal contributions. First, we introduce MambaFPN, a top-down feature pyramid that leverages Mamba blocks to generate globally coherent, multi-scale feature representations with linear computational complexity. Second, we adopt a PanopticFCN-style kernel generator that produces unified thing and stuff kernels for proposal-free panoptic prediction, enhanced by a QuadMamba-based feature refinement module applied at multiple network stages. Experiments on the Cityscapes and COCO panoptic segmentation benchmarks demonstrate that MambaPanoptic consistently outperforms PanopticDeepLab and PanopticFCN under comparable model sizes, and matches or surpasses Mask2Former on Cityscapes in PQ and AP while requiring fewer parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.12320 2026-05-20 cs.CV 版本更新

Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos

在噪声时间自监督下利用对比学习进行结肠镜视频处理

Luca Parolari, Pietro Gori, Lamberto Ballan, Carlo Biffi, Loic Le Folgoc

发表机构 * Department of Mathematics, University of Padova, Padova, Italy（帕多瓦大学数学系）； LTCI, Telecom Paris, Institut Polytechnique de Paris, Palaiseau, France（巴黎电信学院）； Cosmo Intelligent Medical Devices, Dublin, Ireland（都柏林智能医疗设备公司）

AI总结本文提出一种在噪声时间自监督下利用对比学习进行结肠镜视频处理的方法，通过利用结肠镜检查的顺序流程来推导自监督关联，引入噪声感知的对比损失以处理噪声关联，从而在多项下游任务中取得了优于现有自监督和监督基线方法的性能。

Comments Accepted to MICCAI 2026

详情

AI中文摘要

学习鲁棒的息肉轨迹表示对于启用多项AI辅助结肠镜应用至关重要，从息肉特征化到自动化报告和检索。监督对比学习是学习此类表示的有效方法，但通常依赖于正确的正负定义。收集这些标签需要链接在整个视频中描绘相同基础息肉实体的轨迹，这成本高昂且需要专门的临床专业知识。在本工作中，我们利用结肠镜检查的顺序流程推导出自监督关联。由于时间推导的关联不保证正确，我们引入了噪声感知的对比损失以处理噪声关联。我们展示了所学表示在多项下游任务中的有效性，包括息肉检索和重识别、大小估计和组织学分类。我们的方法在多项任务中优于先前的自监督和监督基线方法，并且在所有任务中与最近的基座模型相匹配或超过，使用了一个仅在27个视频上训练的轻量级编码器。代码可在https://github.com/lparolari/ntssl上获得。

英文摘要

Learning robust representations of polyp tracklets is key to enabling multiple AI-assisted colonoscopy applications, from polyp characterization to automated reporting and retrieval. Supervised contrastive learning is an effective approach for learning such representations, but it typically relies on correct positive and negative definitions. Collecting these labels requires linking tracklets that depict the same underlying polyp entity throughout the video, which is costly and demands specialized clinical expertise. In this work, we leverage the sequential workflow of colonoscopy procedures to derive self-supervised associations from temporal structure. Since temporally derived associations are not guaranteed to be correct, we introduce a noise-aware contrastive loss to account for noisy associations. We demonstrate the effectiveness of the learned representations across multiple downstream tasks, including polyp retrieval and re-identification, size estimation, and histology classification. Our method outperforms prior self-supervised and supervised baselines, and matches or exceeds recent foundation models across all tasks, using a lightweight encoder trained on only 27 videos. Code is available at https://github.com/lparolari/ntssl.

URL PDF HTML ☆

赞 0 踩 0

2605.10180 2026-05-20 cs.CV cs.CR 版本更新

CD-TWINSAFE：一种基于ROS的数字孪生用于场景理解和安全新兴V2I技术

Amro Khaled, Farah Khaled, Omar Riad, Catherine M. Elias

发表机构 * C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（认知驾驶研究与车辆系统实验室，埃及开罗）； Computer Science and Engineering Department - Faculty of Media Engineering and Technology（计算机科学与工程系-媒体工程与技术学院）； German University in Cairo, Egypt（埃及开罗德国大学）

AI总结本文提出了一种基于V2I的数字孪生系统CD-TWINSAFE，用于自动驾驶车辆的场景理解和安全监控，通过同时运行的两个栈结构实现车辆侧的驾驶模块和数字孪生模块，利用立体相机和Unreal Engine 5构建场景复现，并通过ROS架构实现V2I通信。

详情

DOI: 10.1109/MELECON64486.2026.11418830

AI中文摘要

本文介绍了CD-TWINSAFE，一种基于V2I的自动驾驶车辆数字孪生系统。所提出的架构由两个同时运行的栈组成，一个是车载驾驶栈，包含立体相机用于场景理解，另一个是数字孪生栈，运行Unreal Engine 5的场景复制品并返回安全警报至驾驶舱。车载栈在车辆侧实现，包括两个主要自主模块：定位和感知。通过车载传感器获取车辆的位置和方向。此外，感知模块负责处理立体相机的20fps图像，并通过两个互补的管道理解场景，包括物体检测和特征提取，包括物体速度、偏转角以及安全指标时间到碰撞和时间头道。收集的数据通过ROS架构以自定义ROS2消息的形式发送到基础设施侧，并通过UDP链接在4G调制解调器上进行V2I通信。通过数字孪生监控环境，共享消息更新生成的ego车辆和检测到的对象的信息，基于实时的定位和感知数据。通过不同驾驶场景的测试来验证所提出架构的有效性和实时响应能力。

英文摘要

In this paper, the CD-TWINSAFE is introduced, a V2I-based digital twin for Autonomous Vehicles. The proposed architecture is composed of two stacks running simultaneously, an on-board driving stack that includes a stereo camera for scene understanding, and a digital twin stack that runs an Unreal Engine 5 replica of the scene viewed by the camera as well as returning safety alerts to the cockpit. The on-board stack is implemented on the vehicle side including 2 main autonomous modules; localization and perception. The position and orientation of the ego vehicle are obtained using on-board sensors. Furthermore, the perception module is responsible for processing 20-fps images from stereo camera and understands the scene through two complementary pipelines. The pipeline are working on object detection and feature extraction including object velocity, yaw and the safety metrics time-to-collision and time-headway. The collected data form the driving stack are sent to the infrastructure side through the ROS-enabled architecture in the form of custom ROS2 messages and sent over UDP links that ride a 4G modem for V2I communication. The environment is monitored via the digital twin through the shared messages which update the information of the spawned ego vehicle and detected objects based on the real-time localization and perception data. Several tests with different driving scenarios to confirm the validity and real-time response of the proposed architecture.

URL PDF HTML ☆

赞 0 踩 0

2601.12358 2026-05-20 cs.CV cs.AI cs.RO 版本更新

From Prompts to Pavement: LMMs-based Agentic Behavior-Tree Generation Framework for Autonomous Vehicles

从提示到道路：基于大语言模型的代理行为树生成框架用于自动驾驶车辆

Omar Y. Goba, Ahmed Y. Gado, Catherine M. Elias, Ahmed Hussein

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt（德国亚历山大·冯·洪堡大学（开罗分校）计算机科学与工程系，埃及）； C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（认知驾驶系统实验室（车辆系统中的认知驾驶研究），开罗，埃及）； IAV GmbH, Berlin, Germany（IAV GmbH，柏林，德国）

AI总结本文提出了一种基于大语言模型和多模态视觉模型的代理行为树生成框架，用于自动驾驶车辆在复杂环境中自适应导航。该框架通过链式符号提示评估场景关键性，通过上下文学习构建高层子目标，并通过生成器合成可执行的BT子树，实现了在CARLA+Nav2模拟中对突发障碍物（如道路堵塞）的成功绕行。

详情

DOI: 10.1109/ITSC60802.2025.11423726

AI中文摘要

自动驾驶车辆（AVs）需要适应性行为规划器来安全地导航不可预测的现实环境。传统的行为树（BTs）提供结构化决策逻辑，但本质上是静态的，并且需要大量人工调优，限制了其在SAE Level 5自主性中的应用。本文提出了一种代理框架，利用大语言模型（LLMs）和多模态视觉模型（LVMs）来实时生成和适应BTs。一个专门的Descriptor代理使用链式符号提示来评估场景关键性，一个Planner代理通过上下文学习构建高层子目标，一个Generator代理合成可执行的BT子树。该系统集成到CARLA+Nav2模拟中，仅在基线BT失败时触发，展示了成功绕过突发障碍物（例如道路堵塞）的能力，无需人工干预。与静态BT基线相比，该方法是一种概念验证，能够扩展到多样的驾驶场景。

英文摘要

Autonomous vehicles (AVs) require adaptive behavior planners to navigate unpredictable, real-world environments safely. Traditional behavior trees (BTs) offer structured decision logic but are inherently static and demand labor-intensive manual tuning, limiting their applicability at SAE Level 5 autonomy. This paper presents an agentic framework that leverages large language models (LLMs) and multi-modal vision models (LVMs) to generate and adapt BTs on the fly. A specialized Descriptor agent applies chain-of-symbols prompting to assess scene criticality, a Planner agent constructs high-level sub-goals via in-context learning, and a Generator agent synthesizes executable BT sub-trees in XML format. Integrated into a CARLA+Nav2 simulation, our system triggers only upon baseline BT failure, demonstrating successful navigation around unexpected obstacles (e.g., street blockage) with no human intervention. Compared to a static BT baseline, this approach is a proof-of-concept that extends to diverse driving scenarios.

URL PDF HTML ☆

赞 0 踩 0

2512.03869 2026-05-20 cs.CV cs.CY 版本更新

An Automated Framework for Large-Scale Graph-Based Cerebrovascular Analysis

一种用于大规模基于图的脑血管分析的自动化框架

Daniele Falcetta, Liane S. Canas, Lorenzo Suppa, Matteo Pentassuglia, Jon Cleary, Marc Modat, Sébastien Ourselin, Maria A. Zuluaga

发表机构 * 1 EURECOM, Sophia Antipolis, France 2 School of Biomedical Engineering \& Imaging Sciences, King's College London, UK 3 Politecnico di Torino, Torino, Italy

AI总结本文提出了一种自动化脑血管分析框架，通过骨架化生成的图表示建模血管形态，并通过区域划分、中心线提取和图构建计算15种形态学、拓扑学、分形和几何特征，以多尺度方式表征脑血管组织。

Comments Accepted at IEEE ISBI 2026

详情

AI中文摘要

我们提出了CaravelMetrics，一种用于自动化脑血管分析的计算框架，通过骨架化生成的图表示建模血管形态。该框架整合了基于图谱的区域划分、中心线提取和图构建，以计算15种形态学、拓扑学、分形和几何特征。这些特征可以全局从完整的血管网络或区域内动脉territories估计，从而实现脑血管组织的多尺度表征。应用于IXI数据集中的570个3D TOF-MRA扫描（年龄20-86岁），CaravelMetrics产生可重复的血管图，捕捉年龄和性别相关变化以及教育程度相关的血管复杂性增加，与文献中的发现一致。该框架提供了一种可扩展且完全自动的定量脑血管特征提取方法，支持规范建模和群体水平的血管健康和衰老研究。

英文摘要

We present CaravelMetrics, a computational framework for automated cerebrovascular analysis that models vessel morphology through skeletonization-derived graph representations. The framework integrates atlas-based regional parcellation, centerline extraction, and graph construction to compute fifteen morphometric, topological, fractal, and geometric features. The features can be estimated globally from the complete vascular network or regionally within arterial territories, enabling multiscale characterization of cerebrovascular organization. Applied to 570 3D TOF-MRA scans from the IXI dataset (ages 20-86), CaravelMetrics yields reproducible vessel graphs capturing age- and sex-related variations and education-associated increases in vascular complexity, consistent with findings reported in the literature. The framework provides a scalable and fully automated approach for quantitative cerebrovascular feature extraction, supporting normative modeling and population-level studies of vascular health and aging.

URL PDF HTML ☆

赞 0 踩 0

2511.22940 2026-05-20 cs.CV 版本更新

One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer

一对一动画：无对齐角色动画和图像姿态转换

Shijun Shi, Jing Xu, Zhihang Li, Chunli Peng, Xiaoda Yang, Lijing Lu, Kai Hu, Jiangning Zhang

发表机构 * Jiangnan University（江南大学）； University of Science and Technology of China（中国科学技术大学）； Chinese Academy of Sciences（中国科学院）； Beijing University of Posts and Telecommunications（北京邮电大学）； Zhejiang University（浙江大学）

AI总结本文提出了一种统一框架，用于高保真角色动画和图像姿态转换，解决了参考姿态错位问题，通过自监督补全任务和混合参考融合注意力机制提升生成质量。

Comments Project Page:https://ssj9596.github.io/one-to-all-animation-project/

详情

AI中文摘要

LLaMA-XR: 一种基于LLaMA和QLoRA微调的新型放射科报告生成框架

Md. Zihad Bin Jahangir, Muhammad Ashad Kabir, Sumaiya Akter, Israt Jahan, Minh Chau

发表机构 * Department of Computer Science and Engineering, Southeast University（计算机科学与工程系，东南大学）； School of Computing, Mathematics and Engineering, Charles Sturt University（计算、数学与工程学院，查尔斯·斯特劳特大学）； Department of Computer Science and Engineering, University of Liberal Arts Bangladesh（计算机科学与工程系，孟加拉国自由大学）； Medical Imaging Group, School of Dentistry and Medical Sciences, Charles Sturt University（医学影像组，牙科学院与医学科学学院，查尔斯·斯特劳特大学）

AI总结本文提出LLaMA-XR框架，结合LLaMA 3.1与DenseNet-121图像嵌入及QLoRA微调，提升放射科报告生成的准确性和临床相关性，同时保持计算效率。

Comments 25 pages

Journal ref Bioengineering 2026, 13(5), 493

详情

DOI: 10.3390/bioengineering13050493

AI中文摘要

自动化放射科报告生成具有减少放射科医生工作负担和提高诊断准确性的潜力。然而，从胸部X光片生成精确且具有临床意义的报告仍然具有挑战性，因为医学语言的复杂性和对上下文理解的需求。现有模型在保持准确性和上下文相关性方面存在困难。在本文中，我们提出了LLaMA-XR，一种新型框架，整合了LLaMA 3.1与基于DenseNet-121的图像嵌入以及量化低秩适应（QLoRA）微调。LLaMA-XR在保持计算效率的同时实现了改进的连贯性和临床准确性。这种效率是由一种优化策略驱动的，该策略增强了参数利用并减少了内存开销，使报告生成速度更快，计算资源需求更低。在IU X光基准数据集上进行的广泛实验表明，LLaMA-XR优于一系列最先进的方法。我们的模型在ROUGE-L得分上达到0.433，在METEOR得分上达到0.336，建立了该领域的性能新基准。这些结果突显了LLaMA-XR作为自动化放射科报告的有效且高效的AI系统潜力，提供了增强的临床效用和可靠性。

英文摘要

Automated radiology report generation holds significant potential to reduce radiologists' workload and enhance diagnostic accuracy. However, generating precise and clinically meaningful reports from chest radiographs remains challenging due to the complexity of medical language and the need for contextual understanding. Existing models often struggle with maintaining both accuracy and contextual relevance. In this paper, we present LLaMA-XR, a novel framework that integrates LLaMA 3.1 with DenseNet-121-based image embeddings and Quantized Low-Rank Adaptation (QLoRA) fine-tuning. LLaMA-XR achieves improved coherence and clinical accuracy while maintaining computational efficiency. This efficiency is driven by an optimization strategy that enhances parameter utilization and reduces memory overhead, enabling faster report generation with lower computational resource demands. Extensive experiments conducted on the IU X-ray benchmark dataset demonstrate that LLaMA-XR outperforms a range of state-of-the-art methods. Our model achieves a ROUGE-L score of 0.433 and a METEOR score of 0.336, establishing new performance benchmarks in the domain. These results underscore LLaMA-XR's potential as an effective and efficient AI system for automated radiology reporting, offering enhanced clinical utility and reliability.

URL PDF HTML ☆

赞 0 踩 0

2505.16819 2026-05-20 cs.CV 版本更新

Character-Centered Dialogue Generation from Scene-Level Prompts

从场景级提示生成以角色为中心的对话

Taewon Kang, Ming C. Lin

发表机构 * University of Maryland at College Park, United States（马里兰大学学院市分校，美国）

AI总结本研究提出了一种模块化流程，将动作级提示转化为视觉和听觉上一致的对话，丰富了基于场景的故事叙述。通过预训练的视觉-语言编码器提取高级视觉语义，并结合结构化提示引导大型语言模型生成对话。引入递归叙述银行以保持跨场景的上下文和情感一致性，最终生成具有表现力的角色条件语音，产生完整的视听叙事。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026). 18 pages, 5 figures

详情

AI中文摘要

最近的场景基于视频生成技术使结构化提示能够生成连贯的视觉叙述，但故事叙述中的关键方面--角色驱动的对话和言语--仍被忽视。我们提出了一种模块化流程，将动作级提示转化为视觉和听觉上一致的对话，从而丰富基于场景的故事叙述，增加自然语音和角色表达。我们的方法每场景使用一对提示，定义场景和角色行为。虽然故事生成模型如Text2Story生成视觉场景，我们专注于生成具有表现力且角色一致的陈述，这些陈述基于提示和代表性的场景图像。预训练的视觉-语言编码器提取高级视觉语义，这些语义与结构化提示结合，引导大型语言模型进行对话合成。为了在跨场景中保持上下文和情感一致性，我们引入递归叙述银行，这是一种说话者感知、时间结构化的记忆，用于积累每个角色的对话历史。受脚本理论启发，这种设计使对话能够反映不断变化的目标、社会情境和叙事角色。最后，我们将每个陈述渲染为具有表现力的角色条件语音，产生完整的视听叙述。我们的训练自由框架能够跨多样化的故事情境泛化，提供了一种可扩展的解决方案，用于连贯且以角色为中心的音频视觉叙述。

英文摘要

Recent advances in scene-based video generation enable coherent visual narratives from structured prompts, yet a key aspect of storytelling -- character-driven dialogue and speech -- remains underexplored. We present a modular pipeline that transforms action-level prompts into visually and auditorily grounded dialogue, enriching scene-based storytelling with natural voice and character expression. Our method takes a pair of prompts per scene, defining the setting and character behavior. While a story generation model such as Text2Story produces the visual scene, we focus on generating expressive, character-consistent utterances grounded in both the prompts and a representative scene image. A pretrained vision-language encoder extracts high-level visual semantics, which are combined with structured prompts to guide a large language model for dialogue synthesis. To maintain contextual and emotional consistency across scenes, we introduce a Recursive Narrative Bank, a speaker-aware, temporally structured memory that accumulates each character's dialogue history. Inspired by Script Theory, this design enables dialogue that reflects evolving goals, social context, and narrative roles. Finally, we render each utterance as expressive, character-conditioned speech, producing fully voiced, multimodal video narratives. Our training-free framework generalizes across diverse story settings, providing a scalable solution for coherent, character-grounded audiovisual storytelling.

URL PDF HTML ☆

赞 0 踩 0

2503.16309 2026-05-20 eess.IV cs.CV physics.med-ph 版本更新

CPC-VAR：视觉自回归模型中的持续个性化与组合生成

Junhao Li, Xinhao Zhong, Yi sun, Yuxia Qiao, Bin Chen, Shu-Tao Xia, Yaowei Wang

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Peng Cheng Laboratory（鹏城实验室）； South China University of Technology（华南理工大学）

AI总结本文研究了视觉自回归模型中的持续个性化生成问题，提出了一种统一框架，通过梯度基概念神经元选择和上下文感知组合策略，解决了连续单概念学习和多概念合成中的关键挑战，提升了长序列持续个性化和多概念图像合成的性能。

详情

AI中文摘要

视觉自回归（VAR）模型最近涌现出作为一种高效的文本到图像生成范式。尽管其强大的生成能力，现有的基于VAR的个性化方法仍局限于静态设置，无法适应不断变化的用户需求。特别是，序列概念学习导致严重的灾难性遗忘，而多概念合成常遭受特征纠缠和属性不一致的问题。在本文中，我们首次系统研究了VAR模型中的持续个性化生成。我们识别出两个关键挑战：（i）在连续定制过程中保持已学习的概念，以及（ii）以可控的方式组合多个个性化概念。为了解决这些问题，我们提出了一种统一框架，包含两个核心组件。对于持续单概念学习，我们引入了基于梯度的概念神经元选择（GCNS），该方法识别出与概念相关的神经元，并仅约束跨任务的冲突参数，从而有效缓解遗忘而不增加模型规模。对于多概念合成，我们提出了一种上下文感知的组合策略，通过多分支特征建模和局部跨注意力融合，由空间条件引导，实现了精确且解耦的概念组合。大量实验表明，我们的方法在长序列持续个性化中显著提高了性能，并在多概念图像合成中优于现有基线。这些发现突显了VAR模型在可扩展和可控个性化生成中的潜力。

英文摘要

Visual autoregressive (VAR) models have recently emerged as an efficient paradigm for text-to-image generation. Despite their strong generative capability, existing VAR-based personalization methods remain limited to static settings, failing to accommodate evolving user demands. In particular, sequential concept learning leads to severe catastrophic forgetting, while multi-concept synthesis often suffers from feature entanglement and attribute inconsistency. In this work, we present the first systematic study of continual personalized generation in VAR models. We identify two key challenges: (i) preserving previously learned concepts during sequential customization, and (ii) composing multiple personalized concepts in a controllable manner. To address these issues, we propose a unified framework with two core components. For continual single-concept learning, we introduce Gradient-based Concept Neuron Selection (GCNS), which identifies concept-relevant neurons and constrains only conflicting parameters across tasks, effectively mitigating forgetting without additional model expansion. For multi-concept synthesis, we propose a context-aware composition strategy that performs multi-branch feature modeling and localized cross-attention fusion guided by spatial conditions, enabling precise and disentangled concept composition. Extensive experiments demonstrate that our method significantly improves performance in long-sequence continual personalization while achieving superior results in multi-concept image synthesis compared to existing baselines. These findings highlight the potential of VAR models for scalable and controllable personalized generation.

URL PDF HTML ☆

赞 0 踩 0

2605.19744 2026-05-20 cs.CV 版本更新

Real-World On-Vehicle Evaluation of Embedding-Based Anomaly Detection

车载场景中基于嵌入的异常检测实测

Albert Schotschneider, Daniel Bogdoll, Svetlana Pavlitska, Ahmed Abouelazm, Johann Marius Zoellner

发表机构 * FZI Research Center for Information Technology（FZI信息科技研究中心）； KIT Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结本文提出了一种适应性强的实时异常检测方法，利用预训练视觉变换器嵌入来检测潜在异常，通过在潜在语义特征空间中使用最近邻相似性检测偏差，并在真实世界场景中评估了该方法的性能。

Comments Accepted at CVPR 2026 Workshop AUTOPILOT-NA

详情

AI中文摘要

在自动驾驶中检测交通场景中的异常对于确保安全至关重要，但收集具有代表性的异常数据仍然具有挑战性。现有的异常检测方法高度专业化，并且依赖于抽象语义Cityscapes类定义的正常性，这使得难以适应多样的现实世界场景。我们提出了一种适应性强的实时异常检测方法，该方法利用预训练的视觉变换器嵌入作为基础模型，通过潜在语义特征空间中的最近邻相似性来检测偏差。基于逐块处理，该算法生成密集的异常掩码，允许定位检测到的异常。该方法通过单个参考图像稳健地建模正常性。这种形式避免了显式监督和数据集特定的训练，使其适合现实世界部署。我们在标准基准和自动化车辆的真实场景中评估了该方法。尽管其简单性，该方法在Road Anomaly基准上表现良好，并在实践中表现出一致的定性行为，成功地在多样化的场景中突出显示语义上不寻常的对象。这些结果表明，在现实操作条件下，简单的基于参考的方法可以提供有用的异常信号。

英文摘要

Detecting anomalies in traffic scenes is crucial for ensuring safety in autonomous driving, yet collecting representative anomalous data remains challenging. Existing anomaly detection methods are highly specialized and rely on normality as defined by the abstract semantic Cityscapes classes, making it difficult to adapt to diverse real-world scenarios. We propose an adaptable real-time anomaly detection method that leverages foundation models in the form of pretrained vision transformer embeddings to detect deviations via nearest-neighbor similarity in the latent semantic feature space. Based on patch-wise processing, the algorithm produces dense anomaly masks, allowing for the localization of detected anomalies. The method robustly models normality through a single reference image. This formulation avoids explicit supervision and dataset-specific training, making it suitable for real-world deployment. We evaluate the method on standard benchmarks and on an automated vehicle in real-world scenarios. Despite its simplicity, the method achieves good performance on the Road Anomaly benchmark and demonstrates consistent qualitative behavior in practice, successfully highlighting semantically unusual objects in diverse scenes. These results suggest that simple, reference-based methods can provide useful anomaly signals under realistic operating conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.19737 2026-05-20 cs.GR cs.CV 版本更新

Decentralized Direct Volume Rendering: A Browser-Native GPU Architecture for MRI Digital Twins in Resource-Constrained Settings

去中心化直接体渲染：一种浏览器原生的GPU架构，用于资源受限环境中的MRI数字孪生

Oserebameh Augustine Beckley

发表机构 * Lagos State University（拉各斯州大学）

AI总结本研究提出了一种去中心化的浏览器原生GPU架构，用于在资源受限环境中实现高保真的MRI数字孪生，通过在低成本集成边缘GPU上执行确定性的单次通过射线投射和形态学梯度计算，实现了快速的像素生成和稳定的交互性能。

Comments 10 pages, 4 figures. Live interactive browser demo available at: https://webgpu-mri.vercel.app/ . Source code repository: https://github.com/Bahdmanbabzo/webgpu-mri

详情

AI中文摘要

数字孪体（DT）技术在手术计划和个性化医学中具有巨大潜力。然而，生成交互式、患者特异性的解剖孪体目前依赖于计算密集型的服务器端渲染（SSR）或昂贵的本地工作站，这在资源受限环境中（RCS）构成了显著的部署障碍。本文提出了一种去中心化的、客户端侧的WebGPU架构，以民主化高保真解剖数字孪体的访问。通过绕过标准的服务器端渲染管线，该框架在低成本的集成边缘GPU上执行确定性的单次通过射线投射和形态学梯度计算。消除云渲染解决方案固有的网络延迟，系统实现了小于920.0毫秒的首次像素时间（TTFP）并在>=82.0 FPS的稳定交互性。通过统一缓冲区维持连续交互保真度，实现了零延迟的组织参数操控，以支持动态临床决策。通过证明复杂的患者特异性MRI扫描的3D医学模拟可以在浏览器中原生执行，无需深度学习或外部计算依赖，该架构提供了一种可扩展且经济的平台，以促进医疗数字孪体的广泛临床应用。

英文摘要

Digital Twin (DT) technology holds immense potential for surgical planning and personalized medicine. However, generating interactive, patient-specific anatomical twins currently relies on computationally heavy Server-Side Rendering (SSR) or expensive local workstations, creating significant barriers to deployment, especially in resource-constrained settings (RCS). This paper presents a decentralized, client-side WebGPU architecture that democratizes access to high-fidelity anatomical Digital Twins. By bypassing standard server-side rendering pipelines, the framework executes deterministic single-pass raymarching and morphological gradient calculations directly on low-cost integrated edge GPUs. Eliminating the network latency inherent to cloud-rendered solutions, the system achieves a Time to First Pixel (TTFP) of under 920.0ms and maintains stable interactivity at >= 82.0 FPS. Continuous Interaction Fidelity is maintained via uniform buffers, enabling zero-latency manipulation of tissue parameters for dynamic clinical decision-making. By proving that complex 3D medical simulations of patient-specific MRI scan can be executed natively in the browser without deep learning or external computational dependencies, this architecture provides a scalable, affordable foundation for the widespread clinical adoption of healthcare Digital Twins.

URL PDF HTML ☆

赞 0 踩 0

2605.19734 2026-05-20 cs.CV 版本更新

GeoMamba: A Geometry-driven MambaVision Framework and Dataset for Fine-grained Optical-SAR Object Retrieval

GeoMamba: 一种基于几何的MambaVision框架及数据集，用于细粒度光学-雷达目标检索

Tiantong Fang, Xiuwei Wang, Jing Xiao, Wujie Zhou, Liang Liao, Mi Wang

发表机构 * School of Artificial Intelligence, Wuhan University（武汉大学人工智能学院）； School of Artificial Intelligence and Information Engineering, Zhejiang University of Science & Technology（浙江科技大学人工智能与信息工程学院）； Hangzhou Institute of Technology, Xidian University（西安电子科技大学杭州研究院）； State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University（武汉大学测绘遥感信息工程国家重点实验室）

AI总结本文提出GeoMamba框架，通过引入几何特征注入模块和几何一致性约束模块，提升光学-雷达细粒度目标检索的鲁棒性，并构建了新的FGOS-as数据集来评估跨模态检索性能。

详情

AI中文摘要

多源遥感能够互补地观测地面物体，但跨模态细粒度目标检索仍具有挑战性，尤其是在光学和雷达条件不一致的情况下。与传统的依赖配对或空间对齐样本的检索设置不同，实际的光学-雷达检索受到显著的模态差异、斑点噪声和结构不一致的影响，限制了跨模态表示学习的鲁棒性。为此，我们提出GeoMamba，一种针对光学-雷达细粒度检索的几何驱动框架。具体而言，GeoMamba引入了一个几何特征注入（GFI）模块，以增强跨模态特征交互，并结合结构先验，从而提高雷达表示的鲁棒性并促进几何一致的特征学习。此外，几何一致性约束（GCC）模块与深度监督（DS）策略一起，利用经典操作符施加层次化的几何约束，帮助在表示学习过程中保留信息丰富的物体结构。我们进一步构建了一个新的数据集FGOS-as，包含11个航空航天和海洋类别，用于评估在现实遥感场景中的不一致跨模态细粒度目标检索性能。在FGOS-as上的大量实验表明，GeoMamba在所有对所有检索设置中优于现有方法，达到了63.3%的mAP和77.0%的Rank-1准确率。

英文摘要

Multi-source remote sensing enables complementary observation of ground objects, while cross-modal fine-grained object retrieval remains challenging, especially under unaligned optical and SAR conditions. Unlike conventional retrieval settings that rely on paired or spatially aligned samples, practical optical-SAR retrieval is affected by substantial modality discrepancy, speckle noise, and structural inconsistency, which limit robust cross-modal representation learning. To address this problem, we propose GeoMamba, a geometry-driven framework tailored for optical-SAR fine-grained retrieval. Specifically, GeoMamba introduces a Geometric Feature Injection (GFI) module that enhances cross-modal feature interaction and incorporates structural priors, thereby improving the robustness of SAR representations and promoting geometry-consistent feature learning. In addition, a Geometric Consistency Constraint (GCC) module, together with a Deep Supervision (DS) strategy, imposes hierarchical geometric constraints using classical operators, which helps preserve informative object structures during representation learning. We further construct a new dataset, FGOS-as, containing 11 aerospace and maritime categories for evaluating unaligned cross-modal fine-grained object retrieval in realistic remote sensing scenarios. Extensive experiments on FGOS-as demonstrate that GeoMamba outperforms existing methods, achieving 63.3% mAP and 77.0% Rank-1 accuracy in all-to-all retrieval setting.

URL PDF HTML ☆

赞 0 踩 0

2605.19728 2026-05-20 cs.CV 版本更新

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

Aero-World: 从惯性控制生成动作条件的空中视频

Abdul Mohaimen Al Radi, Kunyang Li, Yuzhang Shang, Mubarak Shah, Yu Tian

发表机构 * Institute of Artificial Intelligence, University of Central Florida（中央佛罗里达大学人工智能研究所）

AI总结本文提出Aero-World，一种将预训练图像到视频扩散模型转换为可控空中视频生成器的方法，通过注入加速度和角速度序列，利用冻结的物理探测器提供惯性一致性监督，从而提高生成视频对低级动作信号的符合度和时间稳定性。

详情

AI中文摘要

基础视频模型能够生成视觉逼真的结果，但其在具身AI中的应用受限，因为它们主要在自然语言上训练而不是低级控制信号。这种限制在空中飞行中尤为明显，因为运动发生在无约束的6自由度空间中，微小的自我运动误差会产生大的轨迹漂移。生成遵循精细惯性动作的空中视频可以支持可扩展的空中代理训练和评估，通过提供可控的现实世界或昂贵模拟数据代理。为此，我们提出了Aero-World，一种将预训练图像到视频扩散模型转换为可控空中视频生成器的方法。Aero-World通过动作令牌流将加速度和角速度序列注入到预训练的潜在扩散变换器中。一个冻结的潜在空间物理探测器，独立在真实视频-IMU配对上训练，通过LoRA微调期间提供可微的惯性一致性监督，同时避免计算昂贵的视频解码。我们进一步提出了AeroBench，一个评估生成无人机视频是否符合低级动作信号的基准。AeroBench使用动作对齐分数（AAS）测量与命令惯性动作的一致性，使用物理一致性率（PCR）测量时间运动稳定性。在AeroBench上，Aero-World将平均AAS从57.7提高到63.6，比仅动作微调有更高的质量控制权衡，与AirScape相比，FVD更低（596.5 vs. 1058.6），SSIM更高（0.595 vs. 0.505），Flow-IMU相关性更高（0.44 vs. 0.20）。这些结果表明，冻结的物理探测器监督是一种将预训练视频生成器适应更动作对齐的空中运动的实用机制。

英文摘要

Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbf{Aero-World}, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video--IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning while avoiding computationally expensive video decoding. We further propose \textbf{AeroBench}, a benchmark for evaluating whether generated drone videos adhere to low-level action signals. AeroBench uses Action Alignment Score (AAS) to measure agreement with commanded inertial actions and Physical Consistency Rate (PCR) to measure temporal motion stability. On AeroBench, Aero-World improves mean AAS from 57.7 to 63.6 over action-only finetuning and gives a stronger quality-control trade-off than AirScape, with lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20). These results suggest that frozen Physics Probe supervision is a practical mechanism for adapting pretrained video generators toward more action-aligned aerial motion.

URL PDF HTML ☆

赞 0 踩 0

2605.19727 2026-05-20 cs.CV 版本更新

具有物理信息的模拟框架用于真实声纳图像生成和统计验证

Kamal Basha S, Athira Nambiar

发表机构 * Department of Computational Intelligence, SRM Institute of Science

AI总结本文提出了一种基于物理的模拟框架ACOUSIM，用于生成真实声纳图像并进行统计验证，通过比较合成与真实声纳图像的统计特性，建立了可重复的分布级基准。

详情

AI中文摘要

合成声纳数据集为昂贵的实地采集提供了可扩展的替代方案，但其效用仍受缺乏严格定量验证的限制。我们提出了ACOUSIM（ACOustic SIMulation and Validation Platform），一个具有物理信息的框架，该框架在不依赖生成模型的情况下评估合成与真实声纳图像之间的统计一致性。基于Gazebo的环境通过显式控制海底纹理、光照驱动的阴影、平台高度和噪声生成声纳样图像。真实性通过两个公开声纳数据集SeabedObjects-KLSG-II和Sonar Common Target Detection（SCTD）进行量化，使用KL散度、JS散度和地球移动距离评估全局强度和局部纹理（LBP）分布。结果表明，在所有类别中纹理一致性都很强（KL < 0.07），其中平面类强度一致性优于船舶类，因为阴影几何复杂性。ACOUSIM为sim-to-real声纳评估建立了可重复的分布级基准，并直接支持水下图像分析的可靠数据集验证。

英文摘要

Synthetic sonar datasets offer a scalable alternative to costly real-world acquisition, yet their utility remains limited by the absence of rigorous quantitative validation. We present ACOUSIM (ACOustic SIMulation and Validation Platform), a physics-informed framework that evaluates the statistical alignment between synthetic and real sonar imagery without relying on generative models. A Gazebo-based environment generates sonar-like images by explicitly controlling seabed texture, illumination-driven shadowing, platform altitude, and noise. Realism is quantified against two public sonar datasets, SeabedObjects-KLSG-II and Sonar Common Target Detection (SCTD), using global intensity and local texture (LBP) distributions assessed via Kullback-Leibler divergence, Jensen-Shannon divergence, and Earth Mover's Distance. Results show strong texture alignment (KL < 0.07) across all classes, with plane-class intensity alignment outperforming ship-class due to shadow geometry complexity. ACOUSIM establishes a reproducible, distribution-level baseline for sim-to-real sonar evaluation and directly supports reliable dataset validation for underwater image analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.19692 2026-05-20 cs.CV 版本更新

WBCAtt+: Fine-Grained Pixel-Level Morphological Annotations for White Blood Cell Images

WBCAtt+: 细粒度像素级形态学标注用于白血球图像

Satoshi Tsutsui, Winnie Pang, Shuting He, Bihan Wen

发表机构 * Rapid-Rich Object Search (ROSE) Lab, School of Electrical and Electronic Engineering, Nanyang Technological University（快速丰富目标搜索（ROSE）实验室，电气与电子工程学院，南洋理工大学）； Shanghai University of Finance and Economics（上海财经大学）

AI总结本文提出WBCAtt+数据集，通过11个形态学属性和5个像素级细胞组件的密集标注，为白血球图像提供了全面的标注，用于改进属性识别和语义分割的基准模型，并展示了可解释AI模型等应用。

Comments Accepted to Medical Image Analysis. arXiv admin note: substantial text overlap with arXiv:2306.13531

详情

AI中文摘要

白血球（WBC）的显微检查在病理学中起着基础性作用，对于诊断如白血病和贫血等血液疾病至关重要。为了支持进一步的WBC图像研究，已提出多个数据集。然而，这些数据集主要标注细胞类别，缺乏病理学家用于解释细胞解释的详细形态学特征。为解决这一差距，我们引入WBCAtt+，一个包含11个形态学属性和5个像素级细胞组件的新型WBC图像数据集。WBCAtt+拥有113,000个图像级标签和10,000个分割图，是首个为WBC图像提供全面标注的数据集。利用此数据集，我们提供了属性识别和语义分割的基准模型。我们还设计了一个属性识别模型，以整合细胞的组成结构，进一步提高识别性能。最后，我们展示了由我们的数据集启用的各种应用，如可解释AI模型，包括反事实示例生成。

英文摘要

The microscopic examination of white blood cells (WBCs) plays a fundamental role in pathology and is essential for diagnosing blood disorders such as leukemia and anemia. To support further research on WBC images, multiple datasets have been proposed. However, they mainly annotate cell categories, and lack detailed morphological characteristics that pathologists use to explain their interpretations of cells. To address this gap, we introduce WBCAtt+, a novel dataset of WBC images densely annotated with 11 morphological attributes and five pixel-level cell components. With 113k image-level labels and 10k segmentation maps, WBCAtt+ is the first to provide comprehensive annotations for WBC images. Leveraging this dataset, we provide baseline models for attribute recognition and semantic segmentation. We also design an attribute recognition model to incorporate compositional structure of cells, further improving the recognition performance. Lastly, we showcase various applications enabled by our dataset, such as explainable AI models, including counterfactual example generation. \revision{The dataset and code are publicly available\footnote{https://doi.org/10.57967/hf/8143}}.

URL PDF HTML ☆

赞 0 踩 0

2605.19688 2026-05-20 cs.CV 版本更新

DocQT: Improving Document Forgery Localization Robustness via Diverse JPEG Quantization Tables

DocQT: 通过多样化的JPEG量化表提高文档伪造定位的鲁棒性

Kylian Ronfleux-Corail, Guillaume Bernard, Mickaël Coustaty, Nicolas Sidère

发表机构 * MAIF, Niort, France（法国尼奥特MAIF机构）； L3i Laboratory, La Rochelle University, La Rochelle, France（法国拉罗谢尔大学拉罗谢尔L3i实验室）

AI总结本文提出DocQT数据集，通过对比不同架构在不同量化表训练下的表现，证明标准质量因子增强无法代表实际压缩多样性，并展示了显式考虑量化表的架构在实际部署中的鲁棒性优势。

详情

AI中文摘要

文档操纵定位模型在公开基准上表现强劲，但在实际文档工作流程中泛化能力不足。我们发现这一差距的关键原因在于训练过程中使用的JPEG量化表分布狭窄（仅限于标准libjpeg质量因子）与实际保险文档管道中遇到的异质压缩配置之间的不匹配。为了隔离这一因素，我们进行了一项受控的因子研究，比较了两种具有不同量化表意识水平的架构（FFDN [2] 和 Mesorch [20]），每种架构在标准质量因子增强（Standard-QT）或从DocQT量化表库（Real-QT）采样的操作校准量化表下进行训练，并在三种再压缩条件下进行评估。在DocTamper [15] 上训练时使用Real-QT带来了显著的定位增益，并显著降低了真实操作文档中的像素级误报率，但仅适用于显式将量化表作为输入的架构。发布的DocQT量化表数据集和压缩再生产材料可在https://github.com/Kyliroco/Improving-Document-Forgery-Localization-Robustness-via-Diverse-JPEG-Quantization-Tables直接获取。这些结果表明，标准质量因子增强无法充分代表实际压缩多样性，并且显式条件化于量化表的架构选择为实际部署提供了有意义的鲁棒性优势。

英文摘要

Document manipulation localization models achieve strong performance on public benchmarks yet fail to generalize to operational document workflows. We identify a critical and overlooked source of this gap: the mismatch between the narrow distribution of JPEG quantization tables used during training -restricted to standard libjpeg quality factors -and the heterogeneous compression profiles encountered in real-world insurance document pipelines. To isolate this factor, we conduct a controlled factorial study comparing two architectures with contrasting levels of quantization table awareness -FFDN [2] and Mesorch [20] -each trained under either standard quality factor augmentation (Standard-QT ) or operationally calibrated quantization tables sampled from DocQT, a quantization-table bank derived from a MAIF operational image corpus (Real-QT ), and evaluated under three recompression conditions. Training under Real-QT yields substantial localization gains on DocTamper [15] and significantly reduces the pixel-level false positive rate on authentic operational documents, but only for architectures that explicitly ingest the quantization table as input. The released DocQT quantization-table dataset and compression-reproduction material are directly available at https://github.com/Kyliroco/Improving-Document-Forgery-Localization-Robustness-via-Diverse-JPEG-Quantization-Tables. These results demonstrate that standard quality factor augmentation does not adequately proxy operational compression diversity, and that architectural choices explicitly conditioning on the quantization table provide a meaningful robustness advantage for real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.19656 2026-05-20 cs.CV 版本更新

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

跨视图泼溅：基于地理参考图像的馈送视图合成

Matias Turkulainen, Akshay Krishnan, Filippo Aleotti, Mohamed Sayed, Guillermo Garcia-Hernando, Juho Kannala, Arno Solin, Gabriel Brostow, Daniyar Turmukhambetov

发表机构 * Aalto University（阿alto大学）； Georgia Tech（佐治亚理工学院）； Niantic Spatial（Niantic空间）； University of Oulu（奥卢大学）； ELLIS Institute Finland（芬兰ELLIS研究所）； UCL（伦敦大学学院）

AI总结本文提出了一种基于地理参考图像的馈送视图合成方法，通过融合正交校正的卫星图像与GPS标记的地面照片，预测统一3D坐标框架中的高斯泼溅，从而提升场景覆盖和新视角合成效果。

Comments Submitted to CVPR 2026. 8 figures, 3 tables. Project page: https://nianticspatial.github.io/cross-view-splatter/

详情

AI中文摘要

我们提出了Cross-View Splatter，一种预测像素对齐高斯泼溅的馈送方法，用于地面级和卫星拍摄的户外场景。忠实重建需要良好的相机覆盖，但地面影像在大规模户外场景中拍摄耗时且困难。幸运的是，卫星影像可以提供全球几何先验，可通过公共API轻松获取。Cross-View Splatter融合正交校正的卫星视图与GPS标记的地面照片，以统一的3D坐标框架预测高斯泼溅。通过对齐地面和鸟瞰特征表示，我们的模型相比仅使用地面影像提升了场景覆盖和新视角合成。我们在经过筛选的地理参考数据集和配对的卫星地形数据上进行训练，这些数据来自开源测绘服务。我们在新的新视角合成基准上评估了我们的方法，该基准允许与先前最先进的方法进行比较。我们的代码和数据准备将在https://nianticspatial.github.io/cross-view-splatter/上提供。

英文摘要

We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced datasets and paired satellite-terrain data, mined from open mapping services. We evaluate our method on a new benchmark for novel-view synthesis with georeferenced imagery allowing comparison to prior state-of-the-art methods. Our code and data preparation will be available at https://nianticspatial.github.io/cross-view-splatter/.

URL PDF HTML ☆

赞 0 踩 0

2605.19639 2026-05-20 cs.CV 版本更新

Benchmarking and Evolving Reason-Reflect-Rectify for Reflective Visual Generation

基于反思生成的基准测试与进化

Junjie Wang, Xinghua Lou, Jason Li, Ye Tian, Keyu Chen, Yulin Li, Bin Kang, Jacky Mai, Yanwei Li, Zhuotao Tian, Liqiang Nie

AI总结本文提出R^3-Bench基准和R^3-Refiner框架，用于评估和提升反思视觉生成能力，通过改进迭代推理和修正能力，提升文本到图像模型的生成质量。

详情

AI中文摘要

文本到图像（T2I）模型和统一多模态模型（UMMs）在视觉生成领域取得了显著进展。然而，其依赖于单次生成范式限制了处理需要迭代细化的复杂提示的能力。为了实现多轮反思视觉生成（RVG），我们正式将Reason-Reflect-Rectify（R^3）循环作为核心框架，并引入R^3-Bench，一个包含600多个专家标注实例的基准，用于量化迭代推理和修正能力。在R^3-Bench上的评估揭示了一个关键差距：尽管最先进的模型能够识别生成错误，但它们无法生成具有操作性的修正指令。为弥合这一差距，我们提出了R^3-Refiner，一个双阶段框架，利用组相对策略优化（GRPO）和分层奖励机制（HRM）来更好地对齐修正与反思推理。实验表明，R^3-Refiner在R^3-Bench上实现了显著改进（在反思判断分数上提升12.0%，在修正分数上提升9.0%），并且可以无缝集成到各种多语言大型模型（MLLMs）中，以提升不同T2I模型在GenEval++和T2I-CompBench上的生成质量。代码可在https://github.com/xiaomoguhz/R3-Bench获取。

英文摘要

Text-to-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation. However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the Reason-Reflect-Rectify (R^3) loop as a core framework and introduce R^3-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities. Evaluation on R^3-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions. To bridge this gap, we propose R^3-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification with reflective reasoning. Experiments show that R^3-Refiner achieves significant improvements on R^3-Bench (+12.0% in Reflective Verdict Score, +9.0% in Rectification Score), and can be seamlessly integrated with various MLLMs to enhance the generation quality of different T2I models on GenEval++ and T2I-CompBench. Code is available at https://github.com/xiaomoguhz/R3-Bench.

URL PDF HTML ☆

赞 0 踩 0

2605.19634 2026-05-20 cs.CV cs.AI 版本更新

P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

P2DNav: 全景到俯视视角的零样本视觉-语言导航

Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li, Yongrui Qin, Zongtao He, Chengju Liu, Qijun Chen

发表机构 * Department of Control Science and Engineering, Tongji University（控制科学与工程系，同济大学）

AI总结本文提出P2DNav框架，通过全景到俯视视角的分解、滑动窗口对话记忆和反思重新定位机制，解决零样本视觉-语言导航中的方向推理与局部定位问题，实验表明其在R2R-CE基准上性能优异。

详情

AI中文摘要

视觉-语言导航（VLN）要求一个具身代理将自然语言指令转化为可执行的导航动作，以应对未见环境。现有零样本方法通常依赖额外的航点预测模块，这些模块往往将高层方向推理与细粒度局部定位纠缠在一起，导致决策错误且不稳定。在本文中，我们提出P2DNav，一种用于零样本视觉-语言导航的分层框架。P2DNav包含三个核心组件：全景到俯视（P2D）、滑动窗口对话记忆（SDM）和反思重新定位机制（RRM）。P2D明确将导航决策分解为两个阶段：全景方向选择和俯视局部定位。它首先从360°全景中选择与指令相关的方向，然后从该方向的俯视RGB观察中预测像素级目标点。此外，SDM将导航历史组织为多轮对话上下文，并在滑动窗口内维护最近的视觉观察以支持长距离导航。RRM进一步通过评估局部定位的可靠性基于俯视观察，并在必要时返回全景方向选择。在R2R-CE基准上的实验表明，P2DNav在零样本方法中表现强劲。特别是，与最先进的（SOTA）零样本航点基于和航点自由方法相比，P2DNav在SR方面分别获得了146.6%和58.9%的提升，证明了P2D、SDM和RRM在零样本VLN中的有效性。代码将向公众发布。

英文摘要

Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360° panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.

URL PDF HTML ☆

赞 0 踩 0

2605.19631 2026-05-20 cs.RO cs.CV 版本更新

deadtrees.earth-aerial: 一个多分辨率航拍图像数据集用于树冠和死亡检测

Ayushi Sharma, Clemens Mosig, Lukas Drees, Salim Soltani, Janusch Vajna-Jehle, Aaron Sheppard, Belqis Ahmadi, Jonathan Schmid, Paul Neumeier, Nathan Jacobs, Jan Dirk Wegner, Teja Kattenborn

发表机构 * Chair of Sensor-based Geoinformatics, University of Freiburg（传感器基于地理信息学系，弗赖堡大学）； EcoVision Lab, DM3L, University of Zurich（生态视觉实验室，苏黎世大学）； Institute for Earth System Science and Remote Sensing, Leipzig University（地球系统科学与遥感研究所，莱比锡大学）； Washington University, St. Louis（斯蒂芬斯敦大学）

AI总结本文提出两个全新的开放数据集，用于从厘米级航拍图像中进行树冠和死亡的联合分割，解决了全球范围内缺乏统一数据集的问题，并在多个生物群落中实现了显著的性能提升。

详情

AI中文摘要

全球范围内的森林正日益受到气候变化和火灾、害虫和病原体等破坏的威胁，这催生了对大规模树冠和树死亡监测的迫切需求。无人机和飞机的航拍图像是一种关键的数据源，用于详细且大规模地绘制树冠和死亡情况。然而，相关进展受限于缺乏全球代表性、统一的数据集，用于树冠和死亡的联合分割。我们介绍了两个新的、开放的、适合机器学习的数据集，首次在全球范围内实现了从厘米级航拍图像中进行树冠和死亡的联合分割。通过DTE-aerial-train，我们提供了一个包含385,000个1024x1024像素图像块的训练数据集，分辨率范围从2.5到20厘米。它包括多类专家标注和审核的伪标签，用于树冠和死亡。通过DTE-aerial-bench，我们提供了一个地理上平衡的基准测试集，包含25个全球分布的正射图像，总计525个高质量的专家标注图像块，用于树冠和死亡。训练和基准数据集涵盖了热带、温带、寒带和干旱生物群落，并覆盖了广泛的森林结构和死亡模式。使用基准测试集进行评估，我们建立了强参考基线，这些基线在所有生物群落和尺度上提高了死亡分割的性能，在挑战性区域如寒带森林中，F1分数从0.40提高到0.58，提升了约45%的相对性能。所有数据、模型和代码将在宽松的开源许可证下公开发布。基准数据集的交互式可视化可在deadtrees.earth/releases/dte-aerial-bench查看。

英文摘要

Forests worldwide are increasingly threatened by climate change and disturbances such as fire, pests, and pathogens, creating an urgent need for scalable monitoring of tree cover and tree mortality. Aerial imagery from drones and aircraft is a key data source for detailed and large-scale mapping of tree crowns and mortality. However, related progress is limited by the lack of globally representative, harmonized datasets for joint segmentation of tree cover and mortality. We introduce two novel, open, machine-learning-ready datasets to enable joint segmentation of tree cover and tree mortality from centimeter-scale aerial imagery for the first time at global scales. With DTE-aerial-train, we provide a training dataset comprising 385K image patches of size 1024x1024 pixels, with resolutions ranging from 2.5 to 20 cm. It includes multi-class expert-annotated and -audited pseudo-labels for tree cover and mortality. With DTE-aerial-bench, we provide a geographically balanced benchmark test set of 25 globally distributed orthoimages totaling 525 patches with high-quality expert annotations for both tree cover and mortality. Both the training and benchmark datasets span tropical, temperate, boreal, and dryland biomes and cover a wide range of forest structures and mortality patterns. Using the benchmark test set for evaluation, we establish strong reference baselines that improve mortality segmentation across all biomes and scales with significant gains in challenging regions, such as boreal forests, where the F1 score increases from 0.40 to 0.58 with around 45% relative improvement. All data, models, and code will be publicly released under permissive open-source licenses. An interactive visualization of the benchmark dataset is available at deadtrees.earth/releases/dte-aerial-bench.

URL PDF HTML ☆

赞 0 踩 0

2605.19595 2026-05-20 cs.CV cs.AI 版本更新

A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images

一种由LLM代理优化的YOLO26-MoE新型模型用于考虑无人机图像的绝缘子故障检测

João Pedro Matos-Carvalho, Laio Oriel Seman, Stefano Frizzo Stefenon, Mohammad Khalaf Mohammad Khreasat, Gabriel Villarrubia González

发表机构 * Department of Automation and Systems Engineering, Federal University of Santa Catarina, Florianópolis, Brazil（自动化与系统工程系，圣卡塔琳娜联邦大学，巴西弗洛里安波利斯）； Applications Lab, Faculty of Science, University of Salamanca, Plaza de los Caídos s/n, 37008 Salamanca, Spain（应用实验室，科学学院，萨拉曼卡大学，西班牙萨拉曼卡）

AI总结本文提出一种优化的YOLO26-MoE模型，通过在YOLO26检测器的高分辨率分支中集成稀疏的混合专家（MoE）模块，以适应细微和多样的故障模式，同时保持单阶段检测框架的效率，利用LLM代理进行超参数优化，最终在无人机图像上实现了99.00 mAP@0.5和95.15 mAP@0.5:0.95的性能，优于最新版本的YOLO。

详情

AI中文摘要

电力线路绝缘子的检查对于确保电网可靠性和防止因损坏或退化的绝缘组件引起的故障至关重要。近年来，结合深度学习视觉系统的无人机（UAV）已成为自动化此过程的有效解决方案。然而，由于缺陷区域小、故障模式异质性、复杂背景和变化的成像条件，绝缘子故障检测仍具挑战性。为解决这些挑战，本文提出了一种优化的YOLO26-MoE模型，一种新的目标检测架构，其在YOLO26检测器的高分辨率分支中集成了稀疏的混合专家（MoE）模块。所提出的修改使模型能够适应细微和多样的故障模式，同时保持单阶段检测框架的效率。超参数优化、最终训练和评估通过工具增强的大型语言模型（LLM）代理协调。所提出的模型实现了0.9900 mAP@0.5和0.9515 mAP@0.5:0.95的性能，优于最新版本的YOLO。这些结果表明，所提出的模型为基于无人机的绝缘子故障检测提供了一种有效且可靠的解决方案。

英文摘要

The inspection of electrical power line insulators is essential for ensuring grid reliability and preventing failures caused by damaged or degraded insulation components. In recent years, Unmanned Aerial Vehicles (UAVs) combined with deep learning-based vision systems have emerged as an effective solution for automating this process. However, insulator fault detection remains challenging due to small defect regions, heterogeneous fault patterns, complex backgrounds, and varying imaging conditions. To address these challenges, this paper proposes an optimized YOLO26-MoE, a novel object detection architecture that integrates a sparse Mixture-of-Experts (MoE) module into the high-resolution branch of the YOLO26 detector. The proposed modification enables adaptive feature refinement for subtle and diverse fault patterns while preserving the efficiency of a one-stage detection framework. Hyperparameter optimization, final training, and evaluation were coordinated through a tool-augmented Large Language Model (LLM) agent. The proposed model achieved 0.9900 mAP@0.5 and 0.9515 mAP@0.5:0.95, outperforming the latest YOLO versions. These results demonstrate that the proposed model provides an effective and reliable solution for UAV-based insulator fault detection.

URL PDF HTML ☆

赞 0 踩 0

2605.19559 2026-05-20 cs.CV cs.AI 版本更新

EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs

EgoCoT-Bench: 用于MLLMs的 grounded 和可验证的 operation-centric 思维链推理基准测试

Yang Dai, Dian Jiao, Tianwei Lin, Wenqiao Zhang

发表机构 * Zhejiang University（浙江大学）

AI总结本文提出EgoCoT-Bench，一个用于评估MLLMs在第一人称视角下细粒度操作中心推理能力的基准测试，包含3172个可验证的问答对，涵盖感知、预见和高层次推理等任务，旨在解决现有基准测试在细粒度推理和证据验证方面的不足。

详情

AI中文摘要

多模态大语言模型（MLLMs）的快速发展引发了对第一人称视频理解的广泛关注，特别是MLLMs识别细粒度手-物体交互、跟踪物体状态变化以及从第一人称视角推理动态环境中操作过程的能力。然而，现有的第一人称视频基准测试存在局限性，即缺乏对基于现实证据的推理评估，难以支持细粒度的操作中心推理，并且很少检查模型推理是否基于显式的时空证据。为了解决这一差距，我们引入了EgoCoT-Bench，一个细粒度的第一人称基准测试，用于验证和可验证的操作中心推理，具有显式的逐步推理注释。总体而言，EgoCoT-Bench包含3172个可验证的问答对，覆盖351个第一人称视频，分为四个任务组，共12个子任务组，涵盖感知与回顾、预见和高层次推理。该基准测试通过时空场景图（STSG）引导生成框架构建，并通过人工标注者进一步优化，以确保正确性、第一人称相关性和细粒度质量。实验结果表明，第一人称细粒度推理仍存在困难，并进一步揭示了许多多模态模型生成的解释虽然答案正确，但证据与答案不一致。我们希望EgoCoT-Bench能为第一人称视频理解中的 grounded 和可验证推理提供有用的测试平台。项目页面和补充材料可在：https://dstardust.github.io/EgoCoT/ 上找到。

英文摘要

The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egocentric video understanding, specifically the ability for MLLMs to recognize fine-grained hand-object interactions, track object state changes over time, and reason about manipulative processes in dynamic environments from a first-person perspective. However, existing egocentric video benchmarks suffer from \textbf{limited grounded rationale evaluation}, offering limited support for fine-grained operation-centric reasoning and rarely examining whether model rationales are grounded in explicit spatio-temporal evidence. To address this gap, we introduce \textbf{EgoCoT-Bench}, a fine-grained egocentric benchmark for grounded and verifiable operation-centric reasoning with explicit step-by-step rationale annotations. Overall, EgoCoT-Bench comprises 3,172 verifiable QA pairs over 351 egocentric videos separated into four task groups for a total of 12 sub-task groups, encompassing perception and retrospection, anticipation, and high-level reasoning. The benchmark is constructed through a spatio-temporal scene graphs (STSG) guided generation framework and is further refined by human annotators to ensure correctness, egocentric relevance and fine-grained quality. Experimental results show continuing difficulties with egocentric fine-grained reasoning and further reveal that many multimodal models produce explanations that are answer-correct, but have evidence that is inconsistent with the answer. We hope EgoCoT-Bench can serve as a useful testbed for grounded and verifiable reasoning in egocentric video understanding. Project page and supplementary materials are available at: https://dstardust.github.io/EgoCoT/.

URL PDF HTML ☆

赞 0 踩 0

2605.19556 2026-05-20 cs.CV 版本更新

EpiDiffVO: Geometry-Aware Epipolar Diffusion for Robust Visual Odometry

EpiDiffVO: 一种基于几何的视差扩散用于鲁棒视觉里程计

Prateeth Rao

发表机构 * International Institute of Information Technology Bangalore（国际信息科技学院班加罗尔）

AI总结本文提出了一种稀疏视差匹配框架，通过优化几何一致性来减少冗余，并结合视差扩散过程和图神经网络实现高效的视觉里程计。

Comments 8 pages, 5 figures, in revision to be submitted to IEEE RA-L

详情

AI中文摘要

从图像对中估计相对姿态本质上只需要一组几何上一致的对应点的最小子集。然而，大多数基于学习的方法依赖于密集匹配或直接回归，导致冗余并降低几何可解释性。在本工作中，我们提出了一种稀疏视差匹配框架，预测一组紧凑的对应点，以优化不同时间基线下的几何一致性。为了解决残余噪声和对齐问题，我们引入了视差扩散过程，该过程建模对应点的不确定性，并将关键点细化到视差一致性。经过细化的对应点，结合深度线索，被提升为图表示，形成一个Steiner图，该图编码点之间的关系结构。图神经网络学习了一组紧凑的有用对应点，这些对应点被传递给可微的奇异值分解求解器进行端到端的几何估计。从得到的基矩阵中恢复相对姿态，并在TartanAir和KITTI SLAM数据集上进行视觉里程计评估。实验结果表明，结合稀疏匹配、基于扩散的细化和基于图的子集选择可以减少对应点的冗余，同时在具有挑战性的基线下保持稳健的姿态估计。

英文摘要

Estimating relative pose from image pairs fundamentally requires only a minimal subset of geometrically consistent correspondences. However, most learning-based approaches rely on dense matching or direct regression, leading to redundancy and reduced geometric interpretability. In this work, we propose a sparse epipolar matching framework that predicts a compact set of correspondences optimized for geometric consistency across varying temporal baselines. To address residual noise and misalignment, we introduce an epipolar diffusion process that models correspondence uncertainty and refines keypoints toward epipolar consistency. The refined correspondences, along with depth cues, are lifted into a graph representation forming a Steiner graph that encodes relational structure between points. A graph neural network learns a compact subset of informative correspondences, which are passed to a differentiable singular value decomposition solver for end-to-end geometric estimation. Relative pose is recovered from the resulting essential matrix and evaluated in a visual odometry setting on the TartanAir and KITTI SLAM datasets. Experimental results demonstrate that combining sparse matching, diffusion-based refinement, and graph-based subset selection reduces correspondence redundancy while maintaining robust pose estimation across challenging baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.19554 2026-05-20 cs.CV 版本更新

Self-Creative Text-to-Object Generation using Semantic-Aware Spatial Weighting

基于语义感知空间加权的自创文本到物体生成

Yue Yu, Haibo Chen, Shuo Chen, Jian Yang, Jun Li

发表机构 * Nanjing University of Science and Technology（南京理工大学）

AI总结本文提出了一种自创扩散模型SCDiff，通过学习空间加权模块和视觉-语义混合损失模块，提升文本到图像生成的创意性和语义对齐性。

详情

AI中文摘要

在文本到图像（T2I）生成中注入创造力是一个重大挑战，因为合成图像不仅要具有视觉新颖性和惊喜，还应具有艺术价值。然而，当前T2I模型主要优化于字面文本-图像对齐，其噪声预测网络限制生成到高概率区域，导致生成结果缺乏真实创造力。为此，我们提出了一种自创扩散（SCDiff）模型，用于有意义的T2I生成，包含两个核心模块：可学习的空间加权（LSW）模块和视觉-语义混合损失（VSML）。LSW模块设计了一个参数化的Kaiser-Bessel窗，以强化中心图像特征，促进新颖和令人惊讶的生成。VSML模块引入了双重损失函数：相似性损失约束新图像与文本描述对齐，而多样性损失最大化其与原始图像的区别，从而增强语义价值和视觉新颖性。大量实验表明，我们的模型显著提高了创造力、语义对齐性和视觉一致性，提供了一个简单但强大的框架用于生成创意物体。

英文摘要

Instilling creativity in text-to-image (T2I) generation presents a significant challenge, as it requires synthesized images to exhibit not only visual novelty and surprise, but also artistic value. Current T2I models, however, are largely optimized for literal text-image alignment with their data distribution, and their noise prediction networks constrain the generation to high-probability regions, consequently generating outputs that lack authentic creativity. To address this, we propose a Self-Creative Diffusion (SCDiff) model for meaningful T2I generations featuring two core modules: a learnable spatial weighting (LSW) module and a visual-semantic mixing loss (VSML). The LSW module designs a parametric Kaiser-Bessel window to reinforce central image features, fostering novel and surprising generation. The VSML module introduces a dual loss function: a similarity loss constrains that the new images align with its textual description, while a diversity loss maximizes its distinction from the original image, enhancing both semantic value and visual novelty. Extensive experiments demonstrate that our model substantially improves creativity, semantic alignment, and visual coherence, offering a simple yet powerful framework for generating creative objects.

URL PDF HTML ☆

赞 0 踩 0

2605.19551 2026-05-20 cs.GR cs.CV 版本更新

iDiff：用于成对图像质量评估的可解释差异感知框架

Xinli Yue, JianHui Sun, Tao Shao, Liangchao Yao, Fan Xia, Yuetang Deng

发表机构 * Tencent（腾讯）

AI总结本文提出iDiff框架，通过双分支设计结合可解释的差异建模和结构化多模态推理，提升成对图像质量评估的鲁棒性和可解释性，并在NTIRE 2026 RAIM挑战中取得第一名。

Comments Accepted to CVPR 2026 Workshop

详情

AI中文摘要

成对图像质量评估（IQA）在专业摄影中需要一个模型不仅能够识别两个候选图像之间的优选图像，还能提供有说服力且基于图像的推理。在NTIRE 2026 RAIM挑战中，这一要求进一步通过联合评估偏好预测和推理生成被强调。为了解决这一任务，我们提出了iDiff，一个用于成对图像质量评估的可解释差异感知框架。我们的方法采用由答案模型和推理模型组成的双分支设计。答案模型通过显式地将每个样本分解为左右全局和局部视图，随后进行内容感知的专业化处理，针对人物和场景图像，并通过跨主干的集成方法进行聚合，以实现稳健的偏好预测。推理模型专注于推理生成，并逐步增强，通过专家式模板、多源质量特征以及基于答案模型预测的条件监督进行优化。通过这种方式，iDiff联合建模了判别性决策和结构化解释，提高了鲁棒性和可解释性。广泛的实验表明，所提出的框架在准确性和推理质量指标上都有效。我们的方法在NTIRE 2026 RAIM挑战中取得了第一名，展示了将显式差异建模与结构化多模态推理整合用于成对IQA的有效性。

英文摘要

Pairwise image quality assessment (IQA) in professional photography requires a model not only to identify the preferred image between two candidates, but also to provide convincing and image-grounded reasoning. In the NTIRE 2026 RAIM challenge, this requirement is further emphasized by jointly evaluating preference prediction and rationale generation. To address this task, we propose iDiff, an Interpretable Difference-aware framework for pairwise image quality assessment. Our method adopts a dual-branch design consisting of an Answer Model and a Thinking Model. The Answer Model performs robust preference prediction by explicitly decomposing each sample into left/right global and local views, followed by content-aware specialization for person and scene images and ensemble-based aggregation across backbones. The Thinking Model focuses on rationale generation and is progressively enhanced with expert-style templates, multi-source quality features, and answer-aware supervision conditioned on the Answer Model prediction. In this way, iDiff jointly models discriminative decision making and structured explanation, improving both robustness and interpretability. Extensive experiments demonstrate the effectiveness of the proposed framework on both accuracy and reasoning-quality metrics. Our method achieved first place in the NTIRE 2026 RAIM challenge, showing the effectiveness of integrating explicit difference modeling with structured multimodal reasoning for pairwise IQA.

URL PDF HTML ☆

赞 0 踩 0

2605.19511 2026-05-20 cs.CV 版本更新

Are Watermarked Images Editable? SafeMark for Watermark-Preserving Text-Guided Image Editing

水印图像可编辑吗？SafeMark用于水印保持的文本引导图像编辑

Xiaodong Wu, Qi Li, Xiangman Li, Zelin Zhang, Lingshuang Liu, Jianbing Ni

发表机构 * Queen’s University（皇后大学）； University of Waterloo（滑铁卢大学）

AI总结本文研究了一个基础但未被充分探索的问题：水印图像能否在不损害水印完整性的情况下保持可编辑？我们提出了SafeMark框架，该框架在图像编辑过程中显式地将水印完整性整合进去。具体来说，SafeMark将阈值化的水印解码损失直接添加到扩散编辑器的训练目标中，微调编辑器，使得语义上有效的编辑也能够在最终输出中保留嵌入的水印。这种设计具有清晰的信息论依据：在编辑图像上保持高比特准确性下限界了编辑通道所保持的水印与编辑输出之间的互信息，这一量根本控制着水印恢复能力。SafeMark与可微扩散编辑器兼容，不需要架构修改。在多个数据集、文本引导编辑方法和编辑后失真设置上的广泛评估表明，SafeMark在多种编辑设置中实现了高水印比特准确性，同时保持高质量的语义编辑，而不会牺牲对常见编辑后失真的鲁棒性。这些结果表明，语义可编辑性和水印完整性本质上是兼容的，使生成编辑管道中的图像溯源变得可信。

详情

AI中文摘要

本文研究了一个基础但未被充分探索的问题：水印图像能否在不损害水印完整性的情况下保持可编辑？我们提出了SafeMark，一个用于水印保持的文本引导图像编辑的框架，该框架在编辑过程中显式地整合水印完整性。具体来说，SafeMark将阈值化的水印解码损失直接添加到扩散编辑器的训练目标中，微调编辑器，使得语义上有效的编辑也能够在最终输出中保留嵌入的水印。这种设计具有清晰的信息论依据：在编辑图像上保持高比特准确性下限界了编辑通道所保持的水印与编辑输出之间的互信息，这一量根本控制着水印恢复能力。SafeMark与可微扩散编辑器兼容，且不需要架构修改。在多个数据集、文本引导编辑方法和编辑后失真设置上的广泛评估表明，SafeMark在多种编辑设置中实现了高水印比特准确性，同时保持高质量的语义编辑，而不会牺牲对常见编辑后失真的鲁棒性。这些结果表明，语义可编辑性和水印完整性本质上是兼容的，使生成编辑管道中的图像溯源变得可信。

CEPO: 使用对比证据策略优化进行RLVR自蒸馏

Ahmed Heakl, Abdelrahman M. Shaker, Youssef Mohamed, Rania Elbadry, Omar Fetouh, Fahad Shahbaz Khan, Salman Khan

发表机构 * MBZUAI ； Linköping University（林雪平大学）； Australian National University（澳大利亚国立大学）

AI总结本文提出CEPO，通过对比证据策略优化解决RLVR中自蒸馏的问题，通过区分关键推理步骤与填充内容来提升模型性能。

Comments 9 pages

详情

AI中文摘要

当模型在强化学习中产生正确解时，每个token都会收到相同的奖励信号，无论其是关键推理步骤还是语法填充。一种自然的解决方法是将模型条件化为正确的答案作为教师，识别出模型在知道答案时会生成不同的token。先前的工作表明，这种方法要么通过泄露答案到梯度而破坏训练，要么产生弱信号，无法区分关键步骤和填充内容，因为两者在模型基线下看起来同样令人惊讶。我们提出对比证据策略优化（CEPO），在每个token上提出更尖锐的问题：不仅“正确答案是否偏好此token？”而且“正确答案是否偏好它，而错误答案是否厌恶它？”满足两者的是真正的推理步骤；不满足的是填充内容。错误答案的教师是从训练批次中已有的拒绝rollouts构造的，不增加额外的采样成本。我们证明CEPO继承了先前最先进状态下的所有结构安全保证，同时在关键token上严格提高信用，改进在填充位置恰好消失。实验表明，CEPO在五个多模态数学推理基准上分别达到43.43%和60.56%的平均准确率（在2B和4B规模下），而GRPO在相同训练预算下为41.17%和57.43%。分布匹配自蒸馏方法（OPSD、SDPO）在未训练基线下表现低于，实验证实了我们的理论预测的信息泄漏。我们的代码可在https://github.com/ahmedheakl/CEPO上获得。

英文摘要

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.

URL PDF HTML ☆

赞 0 踩 0

2605.19435 2026-05-20 cs.CV cs.AI 版本更新

LMM-Track4D: 通过轨迹引导的对话激发LMM中的4D动态推理

Chaoyue Li, Yongxue Xu, Jie Feng, Jiayu Ding

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Sun Yat-sen University（中山大学）； Beihang University（北航）； Peking University（北京大学）

AI总结本文提出LMM-Track4D任务，通过轨迹引导的多轮时空对话，结合RTGE、TRK和OSK-RA解码器，提升LMM在4D动态推理中的性能，实验表明显式动态状态建模是有效设计原则。

详情

AI中文摘要

近期大型多模态模型（LMMs）在图像和视频理解方面的能力不断增强，但仍难以持续进行4D连续时空动态推理。为研究这一能力差距，我们提出了轨迹引导的多轮时空对话任务，该任务要求模型在回答时空查询的同时，返回整个短片段或指定较长片段中的结构化3D目标轨迹，并引入Track4D-Bench基准，包含526个片段级对话样本，涵盖23.5k帧和7.5k对象注释，用于训练和评估。基于此任务，我们提出了LMM-Track4D，结合RTGE（射线-时间几何编码）、专门用于长时间跨度动态传播的流式状态令牌TRK，以及在遮挡和视角变化下稳定进行4步3D状态估计的Object-Slot Kinematic, Residual-Anchor（OSK-RA）解码器。在Track4D-Bench上的实验表明，与强基线相比，LMM-Track4D有持续的性能提升，表明显式动态状态建模是激发LMM中4D动态推理的有效设计原则。我们的代码和数据集将在https://github.com/mikubaka88/LMM-Track4D上公开。

英文摘要

Recent large multimodal models (LMMs) have become increasingly capable on image and video understanding, yet still struggle to sustain 4D continuous spatiotemporal dynamic reasoning. To study this capability gap, we formulate trajectory-grounded multi-turn spatiotemporal dialogue, a new task in which a model must answer spatiotemporal queries while returning structured 3D target trajectories over an entire short clip or a specified segment of a longer clip, and introduce Track4D-Bench, a benchmark with 526 clip-level dialogue samples spanning 23.5k frames and 7.5k object annotations, for training and evaluation. Building on this task, we propose LMM-Track4D, which combines RTGE (Ray--Time Geometry Encoding), a dedicated streaming state token TRK for long-horizon dynamic propagation, and an Object-Slot Kinematic, Residual-Anchor (OSK-RA) decoder for stable 4-step 3D state estimation under occlusion and viewpoint variation. Experiments on Track4D-Bench show consistent improvements over strong baselines, suggesting that explicit dynamic state modeling is a useful design principle for eliciting 4D dynamic reasoning in LMMs. Our code and dataset will be publicly available at https://github.com/mikubaka88/LMM-Track4D.

URL PDF HTML ☆

赞 0 踩 0

2605.19386 2026-05-20 cs.CV 版本更新

MatPhys: Learning Material-Aware Physics Parameters for Deformable Object Simulation from Videos

MatPhys: 从视频中学习材料感知的物理参数以模拟可变形物体

Yang Yang, Yiyan Wang, Zheming Liu, Naoya Iwamoto

发表机构 * The University of Osaka（大阪大学）； The University of Tokyo（东京大学）； Huawei Technologies Japan K.K（华为技术日本株式会社）

AI总结本文提出MatPhys方法，通过单视角视频预测弹簧-质量参数，解决了现有方法在材料假设和跨场景一致性方面的不足，从而提升可变形物体模拟的准确性和泛化能力。

Comments Submitted to Siggrah Asia 2026

详情

AI中文摘要

从视频中重建可变形物体的模拟准备版本对于视觉、图形学和机器人学至关重要。现有的物理驱动方法可以从视频中恢复物理数字双胞胎，但它们有两个根本性的局限性：它们通常假设物体整体具有均匀的材料属性，且其场景特定的逆向优化与单目观测的固有模糊性相结合，导致相同材料在不同场景或交互中参数不一致。我们提出了MatPhys，一种材料感知的前馈框架，通过单视角视频预测弹簧-质量参数，通过两个耦合的设计解决这两个问题。为了放松均匀材料假设，我们使用DINO特征将物体分解为具有语义意义的部分，并查询部分级材料先验，为每个部分分配其自身的物理行为。为了强制跨场景一致性，我们引入了一个学习的材料代码本，其中包含共享的材料嵌入，作为外观和物理之间的桥梁，并进一步使用部分级先验作为参考分布，约束解码器，使得相同材料在不同场景和交互中产生一致的参数。这些设计将一个欠约束的单目问题转化为基于共享、可重用材料概念的前馈推断。实验表明，我们的方法在重建和未来预测方面与每场景优化基线相匹配，同时在未见过的交互和物体上实现了更强的泛化能力，具有更一致的物理参数。

英文摘要

Reconstructing simulation-ready deformable objects is important for vision, graphics, and robotics. Existing physics-driven methods can recover physical digital twins from videos, but they suffer from two fundamental limitations: they typically assume a homogeneous material across the whole object, and their scene-specific inverse optimization, combined with the inherent ambiguity of monocular observation, yields inconsistent parameters for the same material across different scenes or interactions. We propose MatPhys, a material-aware feed-forward framework that predicts spring-mass parameters from a single-view video, addressing these two issues with two coupled designs. To relax the homogeneous material assumption, we use DINO features to decompose the object into semantically meaningful parts and to query a part-level material prior, assigning each part its own physical behavior. To enforce cross-scene consistency, we introduce a learned material codebook of shared material embeddings as the bridge between appearance and physics, and further use the part-level prior as a reference distribution that constrains the decoder so that the same material yields consistent parameters across scenes and interactions. Together, these designs turn an under-constrained monocular problem into feed-forward inference grounded on shared, reusable material concepts. Experiments show that our method matches per-scene optimization baselines in reconstruction and future prediction, while achieving stronger generalization to unseen interactions and objects with more consistent physical parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.19378 2026-05-20 cs.CV 版本更新

Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

视觉扩散变换器中稀疏专家混合路由的稀疏性：从路由崩溃到选择性死锁的诊断、边界校准和进化路线图

Haiying Sha

发表机构 * Haiying Sha（海ying Sha）

AI总结本文系统诊断了Token-Choice稀疏混合专家（MoE）在视频扩散变换器中的训练失败模式，通过分析超过6500万个标记的路由决策时间序列，提出了功能冗余假说，并总结了从视觉统一到世界模型的三步进化路线图。

详情

AI中文摘要

本文系统诊断了Token-Choice稀疏混合专家（MoE）在视频扩散变换器中的训练失败模式。从约50亿参数的预训练密集模型开始，我们遵循三条定律将其转换为MoE架构：路由专家精确克隆原始FFN权重，共享专家初始化为零以验证，然后初始化为极小的非零噪声以实际训练，而只有门控网络从随机初始化开始。实验揭示了五层失败模式的层次结构：（1）线性路由器经历全局软饱和，导致所有专家同质化；（2）MLP路由器引入选择性死锁，其中大约三分之一的层退化为单专家模式，无法通过增加辅助损失防止；（3）交叉注意力路由器表现出初步的自我恢复，但约九层仍顽固死锁；（4）死锁层显示U型分布，集中在浅层视觉处理层和深层语义整合层；（5）bfloat16混合精度导致微小权重更新被硬件截断为零。基于超过6500万个标记的路由决策时间序列，我们提出了功能冗余假说：死锁是共享专家在门控-共享专家-路由专家三元系统中成熟之前的理性等待策略。该假说由系统生物学中的功能冗余理论支持。在工程方面，我们总结了密集到MoE转换的三条定律，并提供了完整的bfloat16精度陷阱解决方案。我们校准了Token-Choice范式的当前能力边界，并概述了从视觉统一到世界模型的三步进化路线图。

英文摘要

This paper systematically diagnoses the training failure modes of Token-Choice sparse Mixture-of-Experts (MoE) on video Diffusion Transformers. Starting from a pretrained dense model of about 5 billion parameters, we convert it into an MoE architecture following three laws: routed experts exactly clone the original FFN weights, shared experts are initialized to zero for verification and then to extremely small non-zero noise for actual training, while only the gating networks start from random initialization. Experiments reveal a hierarchy of five failure modes: (1) linear routers suffer global soft saturation with complete expert homogenization; (2) MLP routers introduce selective deadlock, where roughly one-third of layers degenerate into a single-expert mode that cannot be prevented by increasing the auxiliary loss; (3) cross-attention routers exhibit preliminary self-recovery, yet about nine layers remain stubbornly deadlocked; (4) deadlocked layers display a U-shaped distribution, concentrated in shallow visual processing layers and deep semantic integration layers; (5) bfloat16 mixed precision causes tiny weight updates to be truncated to zero by hardware. Based on routing decision time series over 65 million tokens across 5,000 training steps, we propose the Functional Redundancy Hypothesis: deadlock is a rational waiting strategy before the shared expert matures within the gate-shared expert-routed expert triadic system. This hypothesis is supported by the theory of functional redundancy in systems biology. On the engineering side, we summarize the Three Laws of dense-to-MoE conversion and provide a complete solution for the bfloat16 precision trap. We calibrate the current capability boundary of the Token-Choice paradigm and outline a three-step evolutionary roadmap from visual unification to a world model.

URL PDF HTML ☆

赞 0 踩 0

2605.19374 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings

基于概念的噪声负样本抑制用于零样本分类和胸片发现的 grounding

Chenyu Lian, Hong-Yu Zhou, Chun-Ka Wong, Jing Qin

发表机构 * The Center for Smart Health, School of Nursing, the Hong Kong Polytechnic University, Hong Kong, China（香港理工大学智能健康中心，护理学院，中国香港）； Research Institute for Smart Ageing, the Hong Kong Polytechnic University, Hong Kong, China（香港理工大学智能老龄化研究 institute，中国香港）； School of Biomedical Engineering, Tsinghua Medicine, Tsinghua University, Beijing, China（清华大学生物医学工程学院，清华大学，北京，中国）； Queen Mary Hospital, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, China（香港大学李嘉诚医学院Queen Mary医院，中国香港）

AI总结本文提出了一种基于概念的噪声负样本抑制框架CoNNS，通过构建层次化概念本体，解决不同患者间相似发现导致的噪声负样本问题，提升零样本理解任务的性能。

Comments Early accepted by MICCAI 2026

详情

AI中文摘要

利用胸片和放射学报告进行视觉-语言对齐已成为零样本分类和胸片发现 grounding 的先进范式。然而，标准对比学习通常将不同患者的影像和报告简单视为负样本对。这种假设引入了噪声负样本，因为不同患者经常表现出相似的发现。此类噪声负样本导致语义模糊并降低零样本理解任务的性能。为了解决这一挑战，我们提出CoNNS，一种基于概念的噪声负样本抑制框架。为了支持负样本抑制机制，不同于先前方法使用原始报告或模板化文本，我们利用大型语言模型构建层次化概念本体。本体通过显式建模存在性、属性（位置和特征）和文本（证据片段和存在陈述）来结构化41个关键临床概念。利用该本体，我们实现了包含三个步骤的跨患者对再标记策略：（1）细粒度分解，根据发现存在性对配对进行分类；（2）噪声负样本过滤，通过移除假负样本解决语义冲突；（3）困难负样本挖掘，利用轻量级语言模型识别细微属性差异。最后，我们提出了一种概念感知的NCE损失，以对齐视觉特征与文本并抑制识别出的噪声负样本。在多粒度零样本grounding任务和五个零样本分类数据集上的广泛实验验证了CoNNS优于现有最先进模型。代码可在https://github.com/DopamineLcy/conns获取。

英文摘要

Vision-language alignment using chest X-rays and radiology reports has emerged as an advanced paradigm for zero-shot classification and grounding of chest X-ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero-shot understanding tasks. To address this challenge, we propose CoNNS, a concept-guided noisy-negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross-patient pair relabeling strategy comprising three steps: (1) Fine-Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept-Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi-granularity zero-shot grounding tasks and five zero-shot classification datasets validate that CoNNS outperforms existing state-of-the-art models. The code is available at https://github.com/DopamineLcy/conns.

URL PDF HTML ☆

赞 0 踩 0

2605.19371 2026-05-20 cs.CV cs.AI 版本更新

Multi-Scale Generative Modeling with Heat Dissipation Flow Matching

多尺度生成建模与热耗散流匹配

Jun Ma, Hanquan Zhang, Yanjun Qin, Haoyuan Guan, Ke Zhang

发表机构 * Department of Systems Science, Faculty of Arts and Sciences, Beijing Normal University（北京师范大学系统科学系，文理学院）； School of Computer Science and Technology, Xinjiang University（新疆大学计算机科学与技术学院）； International Academic Center of Complex Systems, Beijing Normal University（北京师范大学复杂系统学术中心）； School of Systems Science, Beijing Normal University（北京师范大学系统科学学院）

AI总结本文提出Heat Dissipation Flow Matching (HDFM)方法，通过引入连续模糊（热耗散）过程来注入多尺度先验，解决模糊基模型在SDE框架中的局限性，并在ODE框架如Flow Matching中实现更有效的多尺度细节保留和颜色预算保持。

详情

AI中文摘要

扩散模型在图像生成中被广泛应用，大多数模型依赖于噪声为基础的破坏和去噪。一个不同的分支使用模糊作为主要破坏，通过提供多尺度先验来更好地保持颜色预算和多尺度细节。然而，基于模糊的模型仍局限于SDE框架，并未整合到ODE框架中，如Flow Matching (FM)。同时，在模糊基公式中，经典的逆热耗散（IHD）过程面临病态挑战。此外，在数据流形假设下，从高维噪声（或速度）空间回归模糊图像也具有困难。我们提出Heat Dissipation Flow Matching (HDFM)，其引入连续模糊（热耗散）过程到FM中以注入多尺度先验。HDFM将插值热耗散路径对齐以解决病态问题，并采用x预测来缓解高维回归困难。玩具实验和消融研究显示，HDFM在模糊和x预测方面均受益。HDFM在所有数据集上均优于大多数基线方法。

英文摘要

Diffusion models are widely used in image generation, with most relying on noise-based corruption and denoising. A distinct branch instead uses blur as the main corruption, preserving better color budgets and multi-scale detail by providing multi-scale priors. However, blur-based models remain in SDE-based frameworks and are not integrated into ODE-based frameworks, such as Flow Matching (FM). Meanwhile, in the blur-based formulation, the classical inverse heat-dissipation (IHD) process faces an ill-posed challenge. Moreover, under the data-manifold assumption, regressing blurred images from high-dimensional noise (or velocity) space is also difficult. We propose Heat Dissipation Flow Matching (HDFM), which introduces a continuous blurred (heat-dissipation) process into FM to inject multi-scale priors. HDFM aligns an interpolated heat-dissipation path to address ill-posedness and adopts $x$-prediction to mitigate high-dimensional regression difficulty. Toy experiments and ablation studies show that HDFM consistently benefits from both blur and $x$-prediction. The performance of HDFM outperforms most baseline methods on all datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.19360 2026-05-20 cs.CV cs.LG cs.NE physics.app-ph physics.optics 版本更新

Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection

可扩展的、节能的光学-神经架构用于多路复用的深度伪造视频检测

Parnian Ghapandar Kashani, Shiqi Chen, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校电气与计算机工程系）； Bioengineering Department, University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校生物工程系）； California NanoSystems Institute (CNSI), University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校加州纳米系统研究所）

AI总结本文提出了一种结合轻量级数字前端和空间复用光学解码后端的混合深度伪造视频检测框架，通过可编程空间光调制器实现大规模并行模拟推理，从而在降低计算成本的同时提高视频真实性预测的吞吐量和准确性。

Comments 30 Pages, 8 Figures

详情

AI中文摘要

AI生成视觉媒体的快速普及催生了对高效、可信的深度伪造检测系统的需求。然而，现有基于深度学习的检测方法依赖于计算密集且能耗高的推理算法，限制了其可扩展性。本文提出了一种混合的数字-模拟深度伪造视频检测框架，结合轻量级数字前端和空间复用光学解码后端，通过可编程空间光调制器实现大规模并行模拟推理。通过在单次光学传播过程中同时处理15个或更多的视频流，该系统在降低计算成本的同时实现了高吞吐量和准确的视频级真实性预测。我们使用不同数据集验证了该混合深度伪造视频处理器，包括经典面部交换、现实世界深度伪造记录和完全AI生成的视频。使用在可见光谱范围内操作的空间复用实验装置，我们在Celeb-DF视频数据集上实现了97.79%的深度伪造检测准确率、99.86%的灵敏度和95.72%的特异性，分别在15个视频并行处理的单次光学传播中测试。多路复用的光学解码器还展示了对各种视频退化、噪声、压缩、实验偏移和黑盒对抗攻击的鲁棒性。我们的结果表明，将光学计算整合到AI推理中可以同时提高吞吐量、能效和对抗鲁棒性——这三个属性在纯数字系统中难以同时实现。

英文摘要

The rapid proliferation of AI-generated visual media has created an urgent need for efficient, trustworthy deepfake detection systems. However, existing deep learning-based detection methods rely on computationally intensive and energy-demanding inference algorithms, limiting their scalability. Here, we present a hybrid digital-analog deepfake video detection framework that combines a lightweight digital front-end with a spatially multiplexed optical decoding back-end for massively parallel analog inference through a programmable spatial light modulator. By simultaneously processing 15 or more video streams within a single optical propagation pass, the system enables high-throughput and accurate video-level authenticity prediction at reduced computational cost compared with purely digital methods. We validated this hybrid deepfake video processor using different datasets spanning classical face-swapping, real-world deepfake recordings, and fully AI-generated videos. Using a spatially multiplexed experimental set-up operating in the visible spectrum, we achieved average deepfake detection accuracy, sensitivity and specificity of 97.79%, 99.86% and 95.72%, respectively, on the Celeb-DF video dataset with 15 videos tested in parallel in a single optical pass per inference. The multiplexed optical decoder also demonstrates resilience against various types of video degradation, noise, compression, experimental misalignments and black-box adversarial attacks. Our results show that integrating optical computation into AI inference enables simultaneous gains in throughput, energy efficiency, and adversarial robustness - three properties that are difficult to achieve together in purely digital systems.

URL PDF HTML ☆

赞 0 踩 0

2605.19359 2026-05-20 cs.CV cs.LG 版本更新

DynaTok: 时序自适应和位置偏见感知的视频大语言模型token压缩

Minyoung Park, Taehun Kong, Sangjun Ahn

发表机构 * LG Electronics, Seoul, South Korea（LG电子，首尔，韩国）

AI总结本文提出DynaTok，一种无需训练的时序自适应和位置偏见感知的token压缩框架，通过在时序和空间维度上分配token预算，有效减少冗余的时空覆盖，提升视频大语言模型的效率和鲁棒性。

详情

AI中文摘要

近年来，视频大语言模型（Video-LLMs）的进步显著扩展了多模态推理能力。然而，从长视频序列中提取的大量视觉token带来了高昂的计算成本，限制了其在现实场景中的应用。现有的无训练token压缩方法基于注意力大小作为语义重要性的代理进行token选择，但往往忽视位置偏见并仅依赖短期时间局部性，导致冗余的时空覆盖和低效的token使用。我们提出了DynaTok，一种无需训练、时序自适应且偏见感知的token压缩框架，能够在时序和空间维度上分配token预算。通过轻量级的指数移动平均（EMA）内存，时序预算分配（TBA）模块动态地将较少的token分配给冗余帧，将更多的token分配给新颖的帧，捕捉长期时间变化。空间预算分配（SBA）模块通过基于激活的注意力图选择空间多样性和语义重要的特征，同时利用空间内存减少已选区域的冗余并缓解位置偏见。DynaTok无缝集成到现有的Video-LLMs中，如LLaVA-OneVision和LLaVA-Video，无需重新训练，并在高强度压缩下有效保留语义覆盖。在四个代表性VideoQA基准测试-MVBench、LongVideoBench、MLVU和VideoMME上的实验表明，即使在90%的token减少下，DynaTok仍能保留超过95%的基线准确性，优于最近的无训练方法。这些结果表明，DynaTok为高效和稳健的视频推理提供了系统的基础，为未来Video-LLMs实现实时流媒体视频理解铺平了道路。

英文摘要

Recent advances in Video Large Language Models (Video-LLMs) have greatly expanded multimodal reasoning capabilities. However, the massive number of visual tokens extracted from long video sequences incurs prohibitive computational costs, limiting their deployment in real-world scenarios. Existing training-free token compression methods select tokens based on attention magnitude as a proxy for semantic importance, but often overlook positional bias and rely only on short-term temporal locality, leading to redundant spatio-temporal coverage and inefficient token usage. We present DynaTok, a training-free, temporally adaptive and bias-aware token compression framework that allocates token budgets across both temporal and spatial dimensions. Through a lightweight exponential moving average (EMA) memory, the Temporal Budget Allocation (TBA) module dynamically assigns fewer tokens to redundant frames and more to novel frames, capturing long-term temporal variation. The Spatial Budget Allocation (SBA) module complements this by selecting spatially diverse and semantically important features using activation-based attention maps, while leveraging a spatial memory to reduce redundancy from previously selected regions and mitigate positional bias. DynaTok integrates seamlessly with existing Video-LLMs such as LLaVA-OneVision and LLaVA-Video without retraining, and effectively preserves semantic coverage under aggressive compression. Experiments on four representative VideoQA benchmarks-MVBench, LongVideoBench, MLVU, and VideoMME-show that DynaTok retains over 95% of baseline accuracy even with a 90% token reduction, surpassing recent training-free approaches. These results demonstrate that DynaTok provides a principled foundation for efficient and robust video reasoning, paving the way toward real-time streaming video understanding with future Video-LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.19319 2026-05-20 cs.CV 版本更新

SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

SWEET：基于图像编辑的稀疏世界建模用于具身任务执行

Yiren Song, Yihan Wang, Xiyao Deng, Zhuoran Yan, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore（新加坡国立大学Show实验室）； Central South University（中南大学）

AI总结本文研究图像编辑模型能否作为稀疏视觉世界模型用于机器人操作，通过预测任务级未来状态而非密集视频生成，提出SWEET框架实现稀疏视觉规划，结合语言指令和空间引导生成关键帧，并通过扩散动作预测器生成可执行动作，实验表明其在不同场景中提升关键帧预测能力。

详情

AI中文摘要

视觉预测已成为具身控制的有前景范式，其中未来观察被生成并转化为动作。然而，密集视频生成计算成本高且对许多操作任务而言往往不必要，其进展可以总结为少量任务相关视觉状态。本文研究图像编辑模型能否作为稀疏视觉世界模型用于机器人操作，通过预测任务级未来状态而非密集视频生成。我们首先在相同的机器人数据设置下比较视频生成模型Wan2.2和图像编辑模型FLUX-Kontext，发现图像编辑能生成更可靠的任务级关键帧，具有更好的视觉保真度和显著更低的推理成本。受此启发，我们提出SWEET，一种单次稀疏视觉规划框架，通过连续图像编辑生成一系列任务相关操作关键帧，基于语言指令和可选箭头式空间引导。一个目标条件化的扩散动作预测器将相邻想象的关键帧转换为可执行的动作块。为了减少真实与编辑视觉子目标之间的不匹配，我们进一步引入混合训练策略，使用过滤后的编辑目标。在DROID和RoboMimic上的实验表明，SWEET在已见和未见场景中均提升了关键帧预测能力，并实现了从序列关键帧规划到可执行机器人动作的完整流程，表明图像编辑是具身视觉预测中一个有前景但尚未被广泛探索的方向。

英文摘要

Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task-relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX-Kontext under the same robotic data setting, and find that image editing produces more reliable task-level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one-shot sparse visual planning framework that progressively generates a sequence of task-relevant manipulation keyframes through successive image editing, conditioned on language instructions and optional arrow-based spatial guidance. A goal-conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks. To reduce the mismatch between real and edited visual subgoals, we further introduce a mixed-training strategy with filtered edited targets. Experiments on DROID and RoboMimic show that SWEET improves keyframe prediction across seen and unseen scenes and enables a full pipeline from sequential keyframe planning to executable robot actions, suggesting that image editing is a promising and underexplored direction for embodied visual prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.19307 2026-05-20 cs.CV 版本更新

iGSP：隐式梯度子空间投影用于高效视觉-语言模型的持续学习

Xuezhi Cui, Dongbo Zhou, Wang Guo, Zeyuan Wang, Ziyu Li, Gaozhi Zhou, Xian Li, Ling Zhao, Wentao Yang, Chao Tao, Haifeng Li

发表机构 * School of Geosciences and Info-Physics, Central South University（地质科学与信息物理学院，中南大学）； School of Earth Sciences and Spatial Information Engineering, Hunan University of Science and Technology（地球科学与空间信息工程学院，湖南科技大学）

AI总结本文提出iGSP框架，通过隐式梯度子空间投影实现视觉-语言模型的高效持续学习，解决了传统方法在参数效率和任务间对齐一致性上的不足，显著提升了训练效率和知识重用率。

详情

AI中文摘要

视觉-语言模型需要高效适应不断出现的下游任务。尽管参数高效微调可以缓解灾难性遗忘，但为每个任务分配孤立模块会导致参数爆炸。相反，最近的相似性驱动共享机制错误地将表面视觉相似性等同于底层对齐一致性。这种根本性不匹配导致在视觉相似但逻辑不同的任务之间产生严重的负迁移，并未能利用在视觉上多样的任务之间的对齐重用。我们提出，对齐共享本质上是共享低秩子空间内重叠优化轨迹的几何问题。基于这一见解，我们提出iGSP，一种通过隐式梯度子空间投影实现高效适应的新框架。利用MoE路由器的早期收敛性来建立子空间基底，iGSP将适应过程分为两个阶段。首先，子空间识别阶段通过基底预扩展引入候选专家，应用一种新的子空间约束正则化来隐式地将新任务梯度投影到历史子空间，并通过将路由概率视为梯度流指示器来精确修剪冗余维度，最终最大化知识重用。其次，正交子空间微调阶段固定这一结构基底并去除正则化，快速拟合任务特定的残差损失。在MTIL基准测试中，iGSP在准确率上达到最先进的水平，同时显著提高了训练效率，与当前最先进的方法相比，平均可训练参数减少了42.7%，相对于其他方法最终总参数减少了86.9%。源代码可在https://github.com/GeoX-Lab/iGSP上获得。

英文摘要

Vision-Language Models require efficient adaptation to continually emerging downstream tasks. While Parameter-Efficient Fine-Tuning mitigates catastrophic forgetting, assigning isolated modules per task leads to parameter explosion. Conversely, recent similarity-driven sharing mechanisms falsely equate superficial visual similarity with underlying alignment consistency. This fundamental mismatch triggers severe negative transfer between visually similar but logically distinct tasks and fails to exploit alignment reuse across visually diverse ones. We argue thatalignment sharing is fundamentally a geometric problem of overlapping optimization trajectories within shared low-rank subspaces. Grounded in this insight, we propose iGSP, a novel framework that achieves efficient adaptation via implicit gradient subspace projection. Leveraging the early convergence of MoE routers to establish the subspace basis, iGSP bifurcates the adaptation process into two phases. First, the Subspace Identification phase introduces candidate experts via basis pre-expansion, applies a novel subspace-constrained regularization to implicitly project new task gradients onto the historical subspace, and precisely prunes redundant dimensions by treating routing probabilities as gradient flow indicators, ultimately to maximize knowledge reuse. Second, the Orthogonal Subspace Fine-Tuning phase fixes this structural basis and removes the regularization to rapidly fit the task-specific residual loss. Extensive experiments on the MTIL benchmark demonstrate that iGSP achieves state-of-the-art accuracy while significantly improving training efficiency, reducing the average trainable parameters by 42.7\% compared to current SOTA methods, and decreasing the final total parameters by 86.9\% relative to counterparts. The source code is available at https://github.com/GeoX-Lab/iGSP.

URL PDF HTML ☆

赞 0 踩 0

2605.19289 2026-05-20 cs.CV 版本更新

What Makes Synthetic Data Effective in Image Segmentation

是什么使合成数据在图像分割中有效

Jinjin Zhang, Xiefan Guo, Yizhou Jin, Nan Zhou, Di Huang

发表机构 * State Key Laboratory of Complex and Critical Software Environment（复杂与关键软件环境国家重点实验室）； Beihang University（北京航空航天大学）； School of Computer Science and Engineering（计算机科学与工程学院）

AI总结本文研究了合成数据在图像分割中的有效性，通过分析最先进的扩散模型生成的合成图像，发现密集场景构成和精细实例保真度是关键因素，并提出了一种统一框架SENSE，以提升分割性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

受大规模生成模型快速发展的推动，合成数据已成为视觉理解的有前途的解决方案。尽管现代扩散模型在生成逼真图像方面表现出色，但其在复杂视觉分割任务中的潜力仍待探索。在本工作中，我们系统分析了最先进的扩散模型生成的合成图像，以揭示其有效性的决定因素。特别是，具有密集场景构成和精细实例保真度的合成图像表现出显著优势，能够产生更具判别性的空间表示。基于这些见解，我们提出了SENSE，一种利用灵活且可扩展的合成数据显著提升分割性能的统一框架。值得注意的是，SENSE是模型无关的，可与多种架构（如DPT和Mask2Former）兼容，并能有效扩展到参数容量不同的模型。在Cityscapes、COCO和ADE20K上的广泛实验验证了我们方法的有效性和泛化能力。代码可在https://github.com/zhang0jhon/SENSE获取。

英文摘要

Driven by rapid advances in large-scale generative models, synthetic data has emerged as a promising solution for visual understanding. While modern diffusion models achieve remarkable photorealistic image synthesis, their potential in complex visual segmentation tasks remains underexplored. In this work, we conduct a systematic analysis of synthetic images from state-of-the-art diffusion models to uncover the factors governing their utility. In particular, synthetic images characterized by dense scene composition and fine instance fidelity demonstrate distinctive benefits, yielding significantly more discriminative spatial representations. Building on these insights, we propose SENSE, a unified framework that leverages flexible and scalable synthetic data to substantially enhance segmentation performance. Notably, SENSE is model-agnostic, compatible with diverse architectures (e.g., DPT and Mask2Former), and scales effectively across models with varying parameter capacities. Extensive experiments on Cityscapes, COCO, and ADE20K validate the effectiveness and generalization capability of our approach. Code is available at https://github.com/zhang0jhon/SENSE.

URL PDF HTML ☆

赞 0 踩 0

2605.19279 2026-05-20 cs.CV 版本更新

FPED: A Functional-Network Prior-Guided Mixture-of-Experts Framework for Interpretable Brain Decoding

FPED: 一种基于功能网络先验的可解释性脑解码混合专家框架

Yudan Ren, Pengcheng Shi, Zihan Ma, Xiaowei He, Xiao Li

发表机构 * School of Electronic Information (School of Artificial Intelligence), Northwest University（电子信息学院（人工智能学院），西北大学）

AI总结本文提出FPED框架，通过建模不同的功能脑网络作为专家，利用自适应路由机制捕捉其对视觉语义理解的互补贡献，实现可解释的脑解码。

Comments 15 pages,4 figures

详情

AI中文摘要

从功能磁共振成像（fMRI）进行视觉图像重建是脑解码中的基本任务，为理解人类感知机制和开发高级脑机接口（BCIs）提供了关键路径。然而，大多数现有方法将局部视觉皮层的fMRI信号简单地展平为一维向量，直接映射到对比语言-图像预训练（CLIP）等潜在空间。这种范式不仅破坏了大脑固有网络拓扑结构，导致神经科学解释性有限，还忽略了其他分布式功能网络在处理高级视觉语义中的协同作用。为解决这些限制，我们提出了FPED，一种基于功能网络先验的混合专家（MoE）框架，用于可解释的脑解码。FPED明确将不同的功能脑网络建模为专门的专家，并利用自适应路由机制捕捉其对视觉语义理解的互补贡献。与传统同质解码范式不同，我们的框架整合了神经生物学基础的先验知识，以实现结构化且可解释的网络层面表示学习。实验结果表明，FPED仅使用0.68B参数即可实现高度竞争的语义重建性能。所学的路由动态揭示了功能脑网络与模态特定语义处理之间的生物意义对应关系，提供了透明的神经科学解释性。这表明，具有脑网络意识的专家建模是连接神经解码与生物启发式人工智能的有前景方向。

英文摘要

Visual image reconstruction from functional Magnetic Resonance Imaging (fMRI) is a fundamental task in brain decoding, providing a crucial pathway for understanding human perceptual mechanisms and developing advanced brain-computer interfaces (BCIs). However, most current methods simply flatten fMRI signals from localized visual cortices into one-dimensional (1D) vectors, mapping them directly into latent spaces such as that of Contrastive Language-Image Pre-training (CLIP). This paradigm not only disrupts the inherent network topology of the brain-leading to limited neuroscientific interpretability-but also overlooks the synergistic contributions of other distributed functional networks in processing high-level visual semantics. To address these limitations, we propose FPED, a Functional-Network Prior-Guided Mixture of Experts (MoE) framework for interpretable brain decoding. FPED explicitly models different functional brain networks as specialized experts and employs adaptive routing to capture their complementary contributions to visual semantic understanding. Unlike conventional homogeneous decoding paradigms, our framework incorporates neurobiologically grounded priors to enable structured and interpretable network-level representation learning. Experimental results demonstrate that FPED achieves highly competitive semantic reconstruction performance with only 0.68B parameters. The learned routing dynamics reveal biologically meaningful correspondence between functional brain networks and modality-specific semantic processing, providing transparent neuroscientific interpretability. This suggests that brain network-aware expert modeling is a promising direction for bridging neural decoding and biologically inspired artificial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.19260 2026-05-20 cs.AI cs.CV cs.MA 版本更新

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

AQuaUI: 用于GUI代理的视觉令牌减少方法基于自适应四叉树

Yuankai Li, Tinghui Zhu, Ha Min Son, Zhe Zhao, Xin Liu, Muhao Chen

发表机构 * UC Davis（加州大学戴维斯分校）

AI总结本文提出AQuaUI，一种无需训练的推理时GUI代理模型的视觉令牌减少方法，利用屏幕截图中的非均匀信息密度，通过自适应四叉树结构保持令牌位置以确保一致性，并通过条件四叉树算法提升多步骤GUI交互的时序一致性，实验表明其在准确性和效率之间取得了改进。

详情

AI中文摘要

大型多模态模型（LMMs）最近已作为GUI代理模型的有希望的骨干出现，其中在每个迭代步骤中将高分辨率GUI截图引入提示中。然而，这些截图表现出高度非均匀的空间信息密度：大区域可能携带很少的信息且视觉上同质，而关键文本和图标可能需要高视觉保真度。现有方法要么需要额外训练，要么依赖于基于注意力的令牌压缩，忽略了GUI截图的结构布局和空间冗余。为填补这一空白，本文提出了AQuaUI，一种用于GUI代理模型的无训练推理时令牌减少方法，利用截图中的非均匀信息密度。AQuaUI在每个截图输入上构建一个自适应四叉树，并在四叉树的每个叶子节点保留一个代表性的合并令牌。AQuaUI在整个管道中保持保留令牌的空间位置，以确保所有位置编码阶段保持一致。为进一步提高多步骤GUI交互中的时间一致性，我们提出了一种条件四叉树算法，利用单个请求内连续截图之间的连续性。具体而言，它利用先前的四叉树作为参考来细化当前四叉树，帮助在静态或轻微移动的GUI状态下保留细粒度区域。我们在最先进的GUI代理模型上实现了AQuaUI，并在标准的地面和导航基准上进行了实验。AQuaUI在准确性和效率之间始终优于先前的基线。值得注意的是，在GUI-Owl-1.5-32B-Instruct上，AQuaUI实现了高达13.22%的速度提升和29.52%的更少视觉令牌，同时保留了99.06%的完整令牌性能，表明可以在不重新训练的情况下利用GUI截图的空间冗余。

英文摘要

Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of the quadtree. AQuaUI preserves the spatial positions of retained tokens throughout the pipeline to ensure that all position-encoding stages remain consistent. To further improve temporal consistency across multi-step GUI interactions, we propose a conditional quadtree algorithm that leverages the continuity between consecutive screenshots within a single request. Specifically, it refines the current quadtree using previous quadtrees as references, helping preserve fine-grained regions across static or mildly shifted GUI states. We implement AQuaUI on state-of-the-art GUI agent models and conduct experiments on standard grounding and navigational benchmarks. AQuaUI consistently shows improved accuracy-efficiency trade-offs over prior baselines. Notably, on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance, suggesting that the spatial redundancy of GUI screenshots can be exploited at inference without retraining.

URL PDF HTML ☆

赞 0 踩 0

2605.19256 2026-05-20 cs.CV 版本更新

Distribution Matching Distillation without Fake Score Network

无需假评分网络的分布匹配蒸馏

Youngjoong Kim, Deokyeong Lee, Jaesik Park

发表机构 * Department of Computer Science and Engineering, Seoul National University（首尔国立大学计算机科学与工程系）； Department of Computer Science and Engineering, Sogang University（成均馆大学计算机科学与工程系）

AI总结本文提出无需假评分网络的分布匹配蒸馏（FSF-DMD），通过流图生成器自身诱导的伪速度替代传统假评分网络，实现了分布级校正，并在ImageNet-1K数据集上验证了其有效性。

详情

AI中文摘要

分布匹配蒸馏（DMD）为少步生成提供了有效的分布级校正，但依赖辅助的假评分网络来跟踪生成分布的演变。近期工作将DMD式目标与流图生成器结合，以利用正向发散训练和反向发散校正。假评分估计器仍是一个额外的组件，具有内存和更新开销。在本工作中，我们研究当生成器本身具有流图结构时是否可以避免显式跟踪器。我们提出无需假评分网络的DMD（FSF-DMD），一种适用于流图生成器的DMD形式，其用生成器诱导的伪速度替代传统假评分估计器。关键观察是流图生成器的端点伪速度提供了一个可计算的假速度估计代理，使生成器本身能够提供反向发散信号。基于这一观察，我们推导出一个实用的目标，扩展了流图一致的反向模拟，并引入了自教师变体以从头开始训练。在ImageNet-1K 256×256实验中，FSF-DMD改进了流图基线，达到了流图初始化设置下低于列出的DMD2比较的FID，并在流图匹配初始化和从头开始训练时仍保持有效。

英文摘要

Distribution Matching Distillation (DMD) provides an effective distribution-level correction for few-step generation, while relying on an auxiliary fake-score network to track the evolving generative distribution. Recent work combines DMD-style objectives with flow-map generators to exploit both forward-divergence training and reverse-divergence correction. The fake-score estimator remains an additional component with memory and update overhead. In this work, we study whether this explicit tracker can be avoided when the generator itself has a flow-map structure. We propose Fake-Score-network-Free DMD (FSF-DMD), a DMD formulation for flow-map generators that replaces the auxiliary fake-score estimator with a generator-induced pseudo-velocity surrogate. The key observation is that the endpoint pseudo-velocity of a flow-map generator provides a tractable proxy for fake-velocity estimation, allowing the generator itself to supply the reverse-divergence signal. Building on this observation, we derive a practical objective, extend it with flow-map-consistent backward simulation, and introduce a self-teacher variant for training from scratch. In our ImageNet-1K $256 \times 256$ experiments, FSF-DMD improves flow-map baselines, reaches lower FID than the listed DMD2 comparisons in the flow-map-initialized setting, and remains effective under flow-matching initialization and training from scratch.

URL PDF HTML ☆

赞 0 踩 0

2605.19247 2026-05-20 cs.CV 版本更新

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

结构化开放端NAS：利用LLM进行半自动设计知识结构化以实现高效的神经架构搜索

Yuiko Sakuma, Masakazu Yoshimura, Marcel Gröpl, Zitang Sun, Junji Otsuka, Atsushi Irie, Takeshi Ohashi

发表机构 * Sony Group Corporation（索尼集团公司）； ETH Zurich（苏黎世联邦理工学院）

AI总结本文提出一种半自动方法，利用LLM结构化模型设计知识，以指导神经架构搜索过程，通过定义高层结构模板和引入FairNAD算法，实现了高效的开放端搜索空间探索，提升了在多个数据集上的性能。

Comments 42 pages

详情

AI中文摘要

当前的神经架构搜索（NAS）方法通常受到预定义、限制性搜索空间的限制。尽管最近的基于大语言模型（LLM）的NAS方法能够实现开放式的搜索空间，但它们往往由于偏见或低质量的设计想法而导致探索效率低下。为了解决这些问题，我们提出了一种半自动的方法来结构化模型设计知识以指导搜索过程。我们的方法首先定义了高层结构模板，然后通过分析论文，利用LLM填充此模板，从而创建了一个丰富且多样的搜索空间，该空间体现了这种结构化设计知识。为了高效地探索这个庞大的空间，我们引入了FairNAD，使用多类型突变，通过公平的想法采样、帕累托感知突变、LLM驱动的迭代突变和细粒度反馈循环实现广泛的探索。我们展示了FairNAD在发现高性能架构方面的有效性，这些架构在CIFAR-10、CIFAR-100和ImageNet16-120上分别比当前最先进的方法提高了0.84、2.17和2.35个点。

英文摘要

Current neural architecture search (NAS) methods are often limited by their predefined, restrictive search spaces. While recent large language model (LLM)-assisted NAS methods enable open-ended search spaces, they often suffer from inefficient exploration due to biased or low-quality design ideas. To address these issues, we propose to semi-automatically structure model design knowledge to guide the search process. Our approach first defines a high-level structural template of architectural attributes. An LLM then populates this template by analyzing papers, creating a rich and diverse search space that embodies this structured design knowledge. To efficiently explore this vast space, we introduce FairNAD, using a multi-type mutation that enables broad exploration through mutation with fair idea sampling, Pareto-aware mutation, LLM-driven iterative mutation, and a fine-grained feedback loop. We demonstrate the effectiveness of FairNAD in discovering high-performing architectures that yield 0.84, 2.17, and 2.35 points improvement on CIFAR-10, CIFAR-100, and ImageNet16-120, respectively, compared to current state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.19242 2026-05-20 cs.CV cs.AI cs.ET cs.LG cs.MM 版本更新

PhyWorld: Physics-Faithful World Model for Video Generation

PhyWorld: 用于视频生成的物理忠实世界模型

Pu Zhao, Juyi Lin, Timothy Rupprecht, Arash Akbari, Chence Yang, Rahul Chowdhury, Elaheh Motamedi, Arman Akbari, Yumei He, Chen Wang, Geng Yuan, Weiwei Chen, Yanzhi Wang

发表机构 * Northeastern University（东北大学）； University of Georgia（佐治亚大学）； Tulane University（路易斯安那大学）； EmbodyX

AI总结本文提出PhyWorld，一种通过两阶段训练提升视频生成模型的物理忠实性，以改进世界模拟器的性能，从而更有效地支持物理AI系统。

详情

AI中文摘要

世界模拟器可以在真实世界部署前提供安全且可扩展的环境来训练物理AI系统。大型视频生成模型正成为此类模拟器的有希望的基础，因为它们能够生成多样且逼真的视觉未来。然而，将其用作世界模拟器需要物理忠实的视频延续，即生成的视频应保持由条件输入隐含的物理状态，并以符合基本物理原理的方式演变。我们提出了PhyWorld，一种视频生成世界模型，通过两阶段的后训练来生成时间上一致且物理忠实的场景延续。在第一阶段，我们通过流匹配微调改进视频到视频延续，鼓励稳定视觉属性和帧间一致的运动动态。在第二阶段，我们通过直接偏好优化（DPO）对物理偏好对进行对齐，使模型朝着更符合物理合理性的输出发展。为了评估PhyWorld，我们使用了标准视频质量基准和专门的物理忠实性基准，并对每条物理定律进行评分。实验表明，PhyWorld提高了视频一致性，其在VBench上的平均得分为0.769，比最先进的基线0.756或更低。PhyWorld还提高了物理合理性，其在我们物理忠实性基准上的平均得分为3.09，比最强基线的2.99有所提高。这些结果表明，通过延续和物理偏好信号对大型视频生成模型进行后训练，可以使其成为更有效的物理AI世界模拟器。

英文摘要

World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video generation world model designed to produce temporally coherent and physically faithful scene continuations through two-stage post-training. In the first stage, we improve video-to-video continuation with flow matching fine-tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video-quality benchmarks and a dedicated physical-faithfulness benchmark with per-law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state-of-the-art baselines. PhyWorld also improves physical plausibility, reaching an average score of 3.09 on our physical-faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post-training large video generation models with continuation and physics-preference signals can make them more effective world simulators for Physical AI.

URL PDF HTML ☆

赞 0 踩 0

2605.19230 2026-05-20 cs.CV cs.LG 版本更新

Robust Mitigation of Age-Dependent Confounding Effects via Sample-Difficulty Decorrelation

通过样本难度去相关性实现鲁棒的年龄依赖性混杂效应缓解

Nikhil Cherian Kurian, Victor Caquilpan Parra, Abin Shoby, Luke Whitbread, Lyle J. Palmer

发表机构 * Australian Institute for Machine Learning（澳大利亚机器学习研究所）； Adelaide University（阿德莱德大学）

AI总结本文提出了一种鲁棒框架，通过针对虚假的年龄相关趋势而非强制不变性来缓解年龄依赖性混杂效应，通过样本难度建模和去相关年龄与主导年龄难度趋势，减少年龄相关的真阳性与假阳性差异，同时保持临床有意义的非线性年龄信息。

Comments 10 Pages, 3 Figures

详情

AI中文摘要

医学图像分类中的年龄依赖性性能差异通常是因为年龄作为混杂因素，将成像形态与疾病流行率联系起来。在实践中，差异可能表现为在疾病流行率较高的年龄过诊断，而在流行率较低的年龄下诊断不足，并在训练测试年龄分布变化时恶化。传统缓解方法强制严格年龄不变性可能会抑制在年龄中编码的诊断性信息。因此，我们提出了一种鲁棒框架，通过针对虚假的年龄相关趋势而非强制不变性来缓解年龄依赖性混杂效应。在预热阶段后，我们表征样本难度并以标签条件方式建模其年龄依赖性趋势。通过使用鲁棒的Huber加权亲和权重去相关年龄与主导年龄难度趋势，削弱由混杂驱动的捷径，同时保留临床有意义的非线性年龄信息。我们进一步引入了一个年龄覆盖分数，通过mini-batch年龄方差缩放去相关惩罚，以确保在有限年龄多样性下稳定的优化。在两个放射学数据集中，我们的方法在最小化AUC影响的同时减少了年龄相关的真阳性与假阳性差异，并在增加的训练测试年龄分布变化下保持稳健。

英文摘要

Age dependent performance disparities in medical image classification often arise because age acts as a confounder, linking imaging morphology with disease prevalence. In practice, disparities can manifest as overdiagnosis at ages where disease prevalence is higher and underdiagnosis at ages where prevalence is lower, and can worsen under train test shifts in the age distribution. Conventional mitigation approaches that enforce strict age invariance may suppress diagnostically meaningful information encoded in age. We therefore propose a robust framework that mitigates the effects of age-dependent confounding by targeting spurious age linked trends rather than enforcing invariance. Following a warm-up phase, we characterize sample difficulty and model its age-dependent trends in a label-conditioned manner. We decorrelate age from dominant age difficulty trends using robust, Huber weighted affinity weights, attenuating confounding-driven shortcuts while preserving clinically meaningful, nonlinear age information. We further introduce an Age Coverage Score that scales the decorrelation penalty by minibatch age variance to ensure stable optimization under limited age diversity. Across two radiology datasets, our approach reduces age dependent true and false positive disparities with minimal AUC impact and remains robust to increasing train test age distribution shifts.

URL PDF HTML ☆

赞 0 踩 0

2605.19223 2026-05-20 cs.CV 版本更新

HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

HAVEN：用于统一视频理解的层次对齐多模态基准

Mengqi Shi, Haopeng Zhang

发表机构 * Department of Information and Computer Sciences（信息与计算机科学系）

AI总结本文提出HAVEN，一个用于统一视频理解的层次对齐多模态基准，旨在解决现有多模态大语言模型在复杂叙事总结和推理方面评估不足的问题，通过引入全粒度和全多模态的数据集架构，提供了一个严谨的标准测试平台。

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）在标准视频任务上表现出色，但其在复杂叙事的忠实总结和推理能力仍缺乏充分评估。现有总结基准在监督上分散于孤立的粒度层面，如关键帧、关键镜头或不连贯的文本总结，未能捕捉跨模态对齐的内在层次结构。为了解决这一关键差距，我们引入了HAVEN，一个用于统一视频理解的层次对齐多模态基准。HAVEN开创了一种全粒度（帧、镜头和视频层面）且全多模态（视频和文本）的数据集架构，配备了明确的、连续的模态对齐。基于这一统一的标注范式，我们提出了涵盖总结、时间推理、多模态定位和显著性排序的综合评估套件。对最新MLLMs的广泛基准测试揭示了表面文本流畅性与基于多模态理解之间的持续差距。最终，HAVEN推动了多模态系统的评估超越传统问答格式，提供了一个严谨、标准化的测试平台，以推动未来可解释、层次化的视频理解研究。我们公开发布了数据集、基准套件和评估协议。

英文摘要

While Multimodal Large Language Models (MLLMs) exhibit strong performance on standard video tasks, their ability to faithfully summarize and reason over complex narratives remains poorly evaluated. Existing summarization benchmarks fragment supervision across isolated granularities, such as keyframes, key shots, or disjointed text summaries, failing to capture the inherently hierarchical structure of cross-modal alignment. To address this critical gap, we introduce HAVEN, a hierarchically aligned multimodal benchmark for unified video understanding. HAVEN pioneers a fully granular (frame, shot, and video levels) and fully multimodal (video and text) dataset architecture, complete with explicit, continuous alignment between modalities. Built upon this unified annotation paradigm, we propose a comprehensive evaluation suite spanning summarization, temporal reasoning, multimodal grounding, and saliency ranking. Extensive benchmarking of state-of-the-art MLLMs exposes a persistent gap between surface-level textual fluency and grounded multimodal understanding. Ultimately, HAVEN advances the evaluation of multimodal systems beyond traditional QA formats, offering a rigorous, standardized testbed to drive future research in interpretable, hierarchical video understanding. We publicly release the dataset, benchmark suite, and evaluation protocols.

URL PDF HTML ☆

赞 0 踩 0

2605.19218 2026-05-20 cs.CV cs.AI 版本更新

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

旋转对齐的关键通道剪枝用于高效的视觉-语言模型推理

Beomseok Kang, Dongwon Jo, Jiwon Song, Donghwee Son, Jae-Joon Kim

发表机构 * Seoul National University（首尔国立大学）

AI总结本文提出旋转对齐的关键通道剪枝方法，通过压缩通道维度在固定KV缓存预算下保留更多视觉token，解决传统token剪枝在细粒度感知任务中的性能下降问题，同时提升解码效率。

详情

AI中文摘要

视觉-语言模型在推理过程中面临严重的KV缓存压力，因为单张图像通常会编码成数千个token。现有方法主要通过token稀疏性进行token剪枝，但永久丢弃视觉内容导致细粒度感知任务显著退化。为此，本文提出一个互补的轴，即特征稀疏性：在固定KV缓存预算下，压缩通道维度可以在相同内存成本下保留更多视觉token。然而，现有关键通道剪枝方法面临结构上的权衡：基于token的通道剪枝具有表现力但不结构化且较慢，而基于head的方法则硬件友好但不够稳健。本文通过RotateK，一种基于旋转的结构化关键通道剪枝框架，解决这一问题。RotateK应用基于PCA的在线旋转，将token依赖的通道重要性对齐到共享的低维子空间，从而在轻量级head掩码下实现精确剪枝；融合的Triton注意力内核直接在稀疏通道的Key上操作以实现高效的解码。在两个代表性的VLM后端上进行的实验表明，RotateK在准确率和解码延迟方面均优于现有关键通道剪枝方法，而联合token-通道剪枝在匹配的KV缓存预算下优于仅token剪枝的基线。

英文摘要

Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.

URL PDF HTML ☆

赞 0 踩 0

2605.19214 2026-05-20 cs.LG cs.CV 版本更新

Worst-Group Equalized Odds Regularization for Multi-Attribute Fair Medical Image Classification

多属性公平医疗图像分类中的最差组等化几率正则化

Nikhil Cherian Kurian, Victor Caquilpan Parra, Abin Shoby, Luke Whitbread, Lauren Oakden-Rayner, Robert Vandersluis, Jessica Schrouff, Lyle J. Palmer, Mark Jenkinson

发表机构 * Australian Institute for Machine Learning, Adelaide University（澳大利亚机器学习研究所，阿德莱德大学）； GlaxoSmithKline (GSK)（葛兰素史克（GSK））

AI总结本文提出了一种最差组等化几率正则化方法，用于在多个人口属性上同时评估和缓解医疗图像分类中的系统性差异，通过在推理时优化子组层面的真阳性率和假阳性率偏差，减少等化几率和等化机会的不平等，同时对AUC影响最小。

Comments 11 Pages, 2 Figures

详情

AI中文摘要

医疗人工智能的诊断性能在不同人口群体间系统性地变化，但子组AUC可能掩盖了临床重要的不平等。在固定的推理时间操作点上，某些群体可能表现出过度诊断行为，其特征是真阳性率和假阳性率升高，而另一些群体则表现出不足诊断模式，其真阳性率和假阳性率降低。这些对立的趋势可能在总体AUC中相互抵消，但会产生有意义的临床决策不平等。受在操作点和多个人口属性上评估和缓解此类不平等的需要所驱动，我们提出了一种最差组等化几率边际正则化器。该正则化器明确针对推理时的子组层面真阳性率和假阳性率偏差。在每次更新时，该方法识别出由显式人口属性（如年龄、性别和种族）定义的最极端边际偏差的子组，并应用统一的惩罚，从而在多个人口轴上实现公平优化，而无需显式交集约束。在两个现实中的多标签医学影像数据集中，我们的方法在减少等化几率和等化机会的不平等方面表现一致，对AUC影响极小，从而在保持诊断性能的同时提高公平性。

英文摘要

Diagnostic performance in medical AI varies systematically across demographic groups, yet subgroup AUC can mask clinically important disparities. At a fixed inference-time operating point, some groups may exhibit over-diagnostic behaviour, characterized by elevated true and false positive rates, while others show under-diagnostic patterns with reduced true and false positive rates. These opposing tendencies can cancel in aggregate AUCs while producing meaningful inequities in clinical decision-making. Motivated by the need to assess and mitigate such disparities at the operating point and across multiple demographic attributes simultaneously, we propose a worst-group equalized-odds margin regularizer. The proposed regularizer explicitly targets subgroup-level deviations on both the true positive and false positive sides at inference. At each update, the method identifies subgroups defined by explicit demographic attributes (e.g., age, sex, and race) that exhibit the most extreme margin deviations and applies a unified penalty, enabling fairness optimization across multiple demographic axes without requiring explicit intersectional constraints. Across two medical imaging datasets in realistic multi-label settings, our method consistently reduces disparities in Equalized Odds and Equalized Opportunity with minimal impact on AUC, preserving diagnostic performance while improving fairness.

URL PDF HTML ☆

赞 0 踩 0

2605.19213 2026-05-20 cs.CV 版本更新

Smartphone-based Circular Plot Sampling for Forest Inventory

基于智能手机的圆形采样法用于森林调查

Su Sun, Jui-Cheng Chiu, Nabin Khanal, Songlin Fei, Yingjie Victor Chen

发表机构 * School of Applied and Creative Computing, Purdue University（应用与创意计算学院，普渡大学）； Department of Forestry and Natural Resources, Purdue University（林业与自然资源学院，普渡大学）

AI总结本文提出了一种基于智能手机的轻量级pipeline，通过单次 walkthrough 视频实现完整的圆形采样法树测量，无需额外专业硬件，结合预训练的单目深度估计和树实例分割与SLAM框架，实现相机轨迹和深度的联合优化，从而获得树的位置和胸径估计，具有较高的准确性和可扩展性。

详情

AI中文摘要

圆形采样法是森林调查的核心，但准确测量树的胸径（DBH）和在采样区域内的空间位置仍然具有挑战性。传统方法依赖于昂贵的地面激光雷达系统或劳动密集型的手动方法，涉及卡尺和罗盘测量，限制了其在大规模环境中的可扩展性和可及性。本文提出了一种轻量级、基于智能手机的pipeline，能够通过单次walkthrough视频实现完整的采样区域树测量，仅需一个消费者智能手机安装在便携支架上即可。所提出的方法整合了预训练的单目深度估计和树实例分割与同时定位与建图（SLAM）框架，以联合优化视频序列中的相机轨迹和深度。通过融合SLAM推导出的相机姿态与分割深度图，结合校准的参考长度，获得树的位置和DBH估计。该系统在管理森林和自然森林采样区域中进行了评估，分别达到了1.51厘米（MARE 3.98%）和2.30厘米（MARE 5.69%）的平均绝对误差，性能在不同起始方向和位置下保持一致。跨视频一致性分析进一步证明了在不同起始位置开始测量时，树的定位稳定且可重复。所提出的方法在准确性和可扩展性上与传统现场方法相当，同时显著降低了设备成本和操作复杂性，使其适用于专业研究人员和非专业森林管理者在多样化的操作环境中使用。

英文摘要

Circular sample plots are a cornerstone of forest inventory, yet accurate measurement of tree diameter at breast height (DBH) and spatial location within such plots remains challenging. Conventional approaches rely either on costly terrestrial LiDAR systems or labor-intensive manual methods involving calipers and compass bearings, limiting their scalability and accessibility in large scale environments. We present a lightweight, smartphone-based pipeline that enables complete plot sampling based tree measurement from a single walkthrough video, requiring no specialized hardware beyond a consumer smartphone mounted on a portable stand. The proposed method integrates pretrained monocular depth estimation and tree instance segmentation with a simultaneous localization and mapping (SLAM) framework to jointly refine camera trajectories and depth across the video sequence. Tree positions and DBH estimates are recovered by fusing SLAM-derived camera poses with segmented depth maps, with absolute real-world scale anchored via a calibrated reference length. The system was evaluated in both managed forest plots and natural forest plot, achieving a mean absolute error of 1.51 cm (MARE 3.98%) and 2.30 cm (MARE 5.69%) respectively, with consistent performance across varying starting directions and positions. Cross-video consistency analysis further demonstrated stable and reproducible tree localization across measurements initiated from different starting positions. The proposed approach achieves accuracy comparable to established field methods while substantially reducing equipment cost and operational complexity, making it accessible to both professional researchers and non-expert forest managers in diverse operational settings.

URL PDF HTML ☆

赞 0 踩 0

2605.19210 2026-05-20 cs.CV 版本更新

D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation

D-Convexity：通过准凹性统一的可微凸形状先验用于数据驱动的图像分割

Shengzhe Chen, Hao Yan

发表机构 * School of Computing and Augmented Intelligence, Arizona State University（计算与增强智能学院，亚利桑那州立大学）

AI总结本文提出了一种基于网络输出掩码函数u的准凹性，统一且无阈值的可微凸形状先验，用于数据驱动的图像分割，通过将所有超水平集要求为凸性，将全局形状约束转化为局部可微不等式，从而提升形状正则化性能。

Comments Accepted by CVPR 2026

详情

AI中文摘要

凸性是许多自然和人造结构的基础几何先验，但在端到端可训练分割网络中有效施加仍然具有挑战性。我们从函数的角度重新审视凸性，并提出基于网络输出掩码函数u的准凹性的一致、无阈值凸性先验。我们不局限于约束单个二值分割，而是要求u的所有超水平集都是凸的，将全局形状约束转化为u及其导数的局部、可微不等式。从这一原则出发，我们推导出零、一、二阶特征，分别产生局部中点凸化算法、基于支撑超平面的梯度条件以及以切平面上的二次形式表达的充分二阶不等式。一阶和二阶形式产生一个紧凑的卷积损失，可以在图像上密集应用而无需阈值处理。我们的准凹性损失通过所提出的凸梯度投影模块（CGPM）无缝集成到现代分割网络中。它们在多个数据集中一致地强制凸性并提高形状正则化性能，优于专门针对视网膜分割的网络，并超越了先前的形状意识方法。值得注意的是，我们的分析将一系列先前的凸形状模型，从离散1-0-1线约束和图割凸性公式到基于曲率或带符号距离拉普拉斯的水平集先验，统一在一个连续且可微的框架中。

英文摘要

Convexity is a fundamental geometric prior that underlies many natural and man-made structures, yet remains challenging to impose effectively in end-to-end trainable segmentation networks. We revisit convexity from a functional perspective and propose a unified, threshold-free convexity prior based on the quasi-concavity of the network's output mask function u. Instead of constraining a single binary segmentation, we require all super-level sets of u to be convex, transforming global shape constraints into local, differentiable inequalities on u and its derivatives. From this principle, we derive zero, first, and second-order characterizations, yielding respectively a local midpoint convexification algorithm, a gradient-based condition linked to supporting hyperplanes, and a sufficient second-order inequality expressed as a quadratic form on the tangent plane. The first and second-order formulations produce a compact convolutional loss that can be densely applied across the image without thresholding. Our quasi-concavity losses integrate seamlessly with modern segmentation networks via the proposed convex gradient projection module (CGPM). They consistently enforce convexity and improve shape regularity across multiple datasets, outperforming networks tailored for retinal segmentation and surpassing previous shape-aware methods. Remarkably, our analysis unifies a wide spectrum of previous convex shape models, from discrete 1-0-1 line constraints and graph-cuts convexity formulations to curvature or signed distance Laplacian based level-set priors, within a single continuous and differentiable framework.

URL PDF HTML ☆

赞 0 踩 0

2605.19207 2026-05-20 cs.CV cs.AI cs.LG 版本更新

EgoBabyVLM：基于自然主义第一人称视频数据的跨模态学习基准测试

Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra, Charles-Éric Saint-James, Rashel Moritz, Sheila Krogh-Jespersen, Vanessa Stark, Surya Parimi, Jiayi Shen, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Tom Fizycki, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Juan Pino, Michael C. Frank, Emmanuel Dupoux

发表机构 * Meta Superintelligence Labs（Meta超智能实验室）； Stanford University（斯坦福大学）； Meta Reality Labs（Meta现实实验室）； The University of Tokyo（东京大学）

AI总结研究探讨了儿童如何从有限的视觉-语言输入中获得语言 grounding 的鲁棒性，提出了 EgoBabyVLM 挑战，推动模型在自然主义数据中实现 grounded language learning。

详情

AI中文摘要

儿童在有限的视觉-语言输入中展现出惊人的鲁棒性，这种能力超过了目前最好的大型多模态模型。最近的研究表明，目前基于 curated web 数据训练的视觉-语言模型 (VLMs) 无法泛化到由可穿戴设备、具身代理和婴儿头摄像机产生的稀疏、弱对齐的第一人称视频流，并且没有固定的评估流程来衡量在此类数据上的进展。我们训练 VLMs 在具有不同视觉和语言输入语义对齐程度的数据集上，包括自然主义婴儿和成人第一人称视频，并通过涵盖多模态语言 grounding 和单模态视觉和语言任务的综合评估套件进行评估。这套评估的核心是 Machine-DevBench，它是一个基于语料库的基准测试，自动从模型的训练词汇中生成，以消除训练/评估不匹配和先前发展基准的低统计效力。我们的结果表明，当前 VLM 模型依赖于 curated 数据的紧密语义对齐，并无法利用主导自然主义第一人称输入的弱对齐信号——正是人类在其中茁壮成长的领域。为了推动进展，我们引入了 EgoBabyVLM 挑战，以驱动开发能够从人类婴儿经历的此类自然主义数据中实现 grounded language learning 的模型。

英文摘要

Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model's training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input -- the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.

URL PDF HTML ☆

赞 0 踩 0

2605.19111 2026-05-20 cs.CV cs.AI 版本更新

FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models

FAGER：基于事实的文本到图像模型评估与改进

Youngsun Lim, Cusuh Ham, Pin-Yu Chen, Deepti Ghadiyaram

发表机构 * Boston University（波士顿大学）； Adobe（Adobe公司）； IBM Research（IBM研究院）

AI总结本文提出FAGER框架，用于评估和改进文本到图像模型的事实准确性，通过结合LLM生成事实和参考引导的视觉事实提取与验证，构建结构化事实评估标准，并通过VLM进行评估，验证FAGER在事实性测试中优于现有方法，并能无训练改进T2I输出。

Comments It was accepted for an oral presentation at the 2nd Workshop on the Evaluation of Generative Foundation Models (EVGENFM2026) at CVPR 2026. Total 8 pages (1 page for references). 5 figures

详情

AI中文摘要

现有文本到图像（T2I）评估指标主要评估生成图像是否与提示中明确陈述的信息一致，但往往无法捕捉隐含、外部依赖或定义身份的事实要求。因此，它们不适合评估涉及科学知识、历史事实、产品或文化特定概念的提示中的事实正确性。我们提出了FActually Grounded Evaluation and Refinement（FAGER），一种代理框架，用于评估生成图像是否正确反映由提示中或暗示的视觉可验证事实，并提供改进的可操作反馈。FAGER首先通过结合LLM生成事实与参考引导的视觉事实提取和验证构建结构化事实评估标准，然后将该标准转换为基于VLM的问答对进行评估。为了验证FAGER作为事实性度量标准的有效性，我们引入了事实性A/B测试，该测试衡量度量标准是否更倾向于选择事实参考图像而非对应的生成图像。在涵盖科学、历史、产品、文化和知识密集型概念的五个数据集中，FAGER在该测试中始终优于现有方法。我们进一步表明，FAGER可以以无训练的方式用于改进T2I输出，在多个数据集中产生显著的事实性提升。

英文摘要

Existing text-to-image (T2I) evaluation metrics mainly assess whether generated images align with information explicitly stated in the prompt, but often fail to capture factual requirements that are implicit, externally grounded, or identity-defining. As a result, they are not well suited for evaluating factual correctness in prompts involving scientific knowledge, historical facts, products, or culture-specific concepts. We propose FActually Grounded Evaluation and Refinement (FAGER), an agentic framework that evaluates whether generated images correctly reflect visually verifiable facts grounded in or implied by the prompt, while also providing actionable feedback for improvement. FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images. Across five datasets spanning science, history, products, culture, and knowledge-intensive concepts, FAGER consistently outperforms prior metrics on this test. We further show that FAGER can be used to refine T2I outputs in a fully training-free manner, yielding substantial factuality gains across datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.19075 2026-05-20 cs.CV cs.AI 版本更新

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

CRAFT: 基于批评的自适应关键帧目标定位用于多模态视频问答

Mahesh Bhosale, Abdul Wasi, Vishvesh Trivedi, Pengyu Yan, Akhil Gorugantu, David Doermann

发表机构 * University at Buffalo（布法罗大学）； New York University（纽约大学）

AI总结该研究提出CRAFT方法，通过动态关键帧选择、每视频ASR与多语言回退以及混合批评循环，迭代验证和修复声明，最终实现多模态视频问答的准确证据聚合。

Comments Accepted at ACL 2026 Multimodal Augmented Generation via MultimodAl Retrieval Workshop

详情

AI中文摘要

基于现实世界新闻事件的多视频问答需要系统在异构视频档案中检索与查询相关的证据，并将每个声明归因于其支持来源。我们介绍了CRAFT（Critic-Refined Adaptive Key-Frame Targeting），一种查询条件的管道，结合动态关键帧选择、每视频ASR与多语言回退以及混合批评循环，以迭代验证和修复声明，然后整合。该管道集成了UNLI时间蕴含、DeBERTa-v3跨声明筛选以及Llama-3.2-3B裁决者，并在最终引用合并阶段发出每个事实一次，附带所有支持来源标识符。在MAGMaR 2026上，CRAFT实现了最佳的总体平均（0.739）、参考召回（0.810）和引用F1（0.635）。我们进一步在WikiVideo的MAGMaR风格转换上进行了评估，包含52个非重叠事件查询，CRAFT也表现出色（0.823 Avg），表明其声明中心的证据聚合能力超越了MAGMaR。消融研究显示，原子声明、ASR和批评循环在超过基本查询条件基线时发挥了主要作用。代码和实现细节可在https://github.com/bhosalems/CRAFT公开获取。

英文摘要

Grounded multi-video question answering over real-world news events requires systems to surface query-relevant evidence across heterogeneous video archives while attributing every claim to its supporting source. We introduce CRAFT (Critic-Refined Adaptive Key-Frame Targeting), a query-conditioned pipeline that combines dynamic keyframe selection, per-video ASR with multilingual fallback, and a hybrid critic loop to iteratively verify and repair claims before consolidation. The pipeline integrates UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator, with a final citation-merging stage that emits each fact once with all supporting source identifiers. On MAGMaR 2026, CRAFT achieves the best overall average (0.739), reference recall (0.810), and citation F1 (0.635). We further evaluate on a MAGMaR-style conversion of WikiVideo with 52 non-overlapping event queries, where CRAFT also performs strongly (0.823 Avg), showing that its claim-centric evidence aggregation generalizes beyond MAGMaR. Ablations show that atomic claims, ASR, and the critic loop drive the main gains over the vanilla query-conditioned baseline. Code and implementation details are publicly available at https://github.com/bhosalems/CRAFT.

URL PDF HTML ☆

赞 0 踩 0

2605.19074 2026-05-20 cs.CV cs.AI 版本更新

Learning Long-Term Temporal Dependencies in Photovoltaic Power Output Prediction Through Multi-Horizon Forecasting

通过多时间尺度预测学习光伏功率输出预测中的长期时间依赖性

Sumit Laha, Ankit Sharma, Hassan Foroosh

发表机构 * Department of Computer Science University of Central Florida Orlando, Florida, United States（计算机科学系佛罗里达中央大学奥兰多佛罗里达美国）

AI总结本文提出一种多时间尺度预测框架，通过联合优化多个未来值来提高深度神经网络对隐含的步间时间依赖性的捕捉能力，从而提升光伏功率输出预测的准确性和鲁棒性。

详情

AI中文摘要

全球太阳能光伏（PV）容量的迅速扩张——2024年达到创纪录的597 GW——凸显了需要稳健的预测模型来缓解由太阳能辐照度间歇性引起的电网不稳定性。尽管基于深度学习的直接预测使用地面天空图像（GSI）已成为主导方法，但现有文献常受限于单一架构评估和对单时间尺度（点）预测的专注。本文提出从传统单时间尺度估计向多时间尺度预测框架的转变，从而实现架构无关的准确率提升。我们假设并实验验证了联合优化一系列未来值使深度神经网络能够通过避免网络在权重梯度和滤波器多样性方面的过早收敛来更好地捕捉隐含的步间时间依赖性。利用这种架构无关的改进，将顺序天空图像与历史光伏发电数据相结合，我们评估了模型在多个离散未来时间步长上同时预测功率输出的能力。我们的方法通过在多样深度学习架构上的比较分析进行验证。结果表明，这种多时间尺度方法在预测时间范围内显著提高了预测准确性和鲁棒性，同时保持计算效率。通过在单时间尺度模型上实现优越性能且计算开销 negligible，本文提供了一种可扩展且高效的解决方案，以提高现代电网的韧性。

英文摘要

The rapid global expansion of solar photovoltaic (PV) capacity-reaching a record 597 GW in 2024-highlights the urgent need for robust forecasting models to mitigate the grid instability caused by the intermittent nature of solar irradiance. While deep learning-based direct forecasting using ground-based sky images (GSI) has emerged as a dominant approach, existing literature is often constrained by single-architecture evaluations and an exclusive focus on single-horizon (point) prediction. This paper proposes a transition from traditional single-horizon estimation toward a multi-horizon forecasting framework, leading to an architecture-independent improvement in accuracy. We hypothesize and demonstrate experimentally that joint optimization over a sequence of future values allows deep neural networks to better capture latent inter-step temporal dependencies by avoiding precocious convergence of the network in terms of both weight gradients and filter diversity. Leveraging this architecture-independent improvement that integrates sequential sky imagery with historical PV generation data, we evaluate the models' abilities to predict power output across multiple discrete future time steps simultaneously. Our methodology is validated through a comparative analysis across diverse deep learning architectures. The results demonstrate that this multi-horizon approach significantly enhances predictive accuracy and robustness across the entire forecast horizon while maintaining computational parsimony. By achieving superior performance with negligible overhead compared to single-horizon models, this work provides a scalable and efficient solution to improve the resilience of modern power grids.

URL PDF HTML ☆

赞 0 踩 0

2605.19060 2026-05-20 cs.CV cs.AI eess.IV 版本更新

LiFT: Lifted Inter-slice Feature Trajectories for 3D Image Generation from 2D Generators

LiFT：用于从2D生成器生成3D图像的提升跨切片特征轨迹

Xinhe Zhang, Yuyang Zhang, Pengfei Jin, Arnau Marin-Llobet, Na Li, Quanzheng Li

发表机构 * School of Engineering and Applied Sciences, Harvard University（哈佛大学工程与应用科学学院）； Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School（马萨诸塞总医院和哈佛医学院高级医学计算与分析中心）； Kempner Institute, Harvard University（哈佛大学凯普纳研究所）

AI总结本文提出LiFT框架，通过将3D体积合成分解为单切片图像生成和跨切片轨迹学习，解决高分辨率3D医学图像生成中体积模型计算成本高和2D切片生成器在第三维度上无法保持解剖一致性的问题。

详情

AI中文摘要

高分辨率3D医学图像生成仍然具有挑战性，因为完全体积分布模型计算成本高，而高效的2D切片生成器往往无法在第三维度上保持解剖一致性。我们提出LiFT，一种用于提升跨切片特征轨迹的框架，将3D体积合成分解为单切片图像生成和跨切片轨迹学习。与端到端建模体积分布不同，LiFT将体积视为特征空间中的有序轨迹，捕捉解剖结构在深度方向上的出现、变换和消失。一个三平面漂移损失对齐生成切片的轨迹与真实体积的轨迹，使在无条件生成中能够学习跨切片进展的分布；在配对翻译中，一个双向$z$-上下文混合器通过注册目标进行训练，提供通过平面的连贯性同时保持单切片的保真度。我们在BraTS 2023（无条件和缺失模态MRI）和SynthRAD2023（MRI到CT）上评估LiFT。在这些设置中，LiFT保持单切片质量，接近报告的cWDM缺失MRI重建质量，在约135倍更低的推理成本下（无正式等价性测试），并在MRI到CT中相对于无映射消融提高了通过平面的连贯性，证明了轻量级跨切片轨迹学习是高分辨率3D医学合成的可行途径。

英文摘要

High-resolution 3D medical image generation remains challenging because fully volumetric models are computationally expensive, while efficient 2D slice generators often fail to preserve anatomical consistency across the third dimension. We propose LiFT, a framework for Lifted inter-slice Feature Trajectories that factorizes 3D volume synthesis into per-slice image generation and inter-slice trajectory learning. Rather than modeling the volumetric distribution end-to-end, LiFT treats a volume as an ordered trajectory in feature space, capturing how anatomical structures appear, transform, and disappear across depth. A tri-planar drifting loss aligns the trajectory of generated slices with the trajectories of real volumes, enabling distributional learning over inter-slice progressions in unconditional generation; in paired translation, a bidirectional $z$-context mixer trained against the registered target supplies through-plane coherence while preserving per-slice fidelity. We evaluate LiFT on BraTS 2023 (unconditional and missing-modality MR) and SynthRAD2023 (MR-to-CT). Across these settings, LiFT preserves per-slice quality, approaches the reported cWDM missing-MR reconstruction quality at $\sim$$135\times$ lower inference cost (without formal equivalence testing), and improves through-plane coherence on MR-to-CT relative to a no-mapper ablation, demonstrating that lightweight inter-slice trajectory learning is a viable route to high-resolution 3D medical synthesis.

URL PDF HTML ☆

赞 0 踩 0

2605.19033 2026-05-20 cs.RO cs.AI cs.CV cs.LG cs.MA 版本更新

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

RLFTSim: 通过强化学习微调实现逼真且可控的多智能体交通仿真

Ehsan Ahmadi, Hunter Schofield, Behzad Khamidehi, Fazel Arasteh, Jinjun Shan, Lili Mou, Dongfeng Bai, Kasra Rezaee

发表机构 * University of Alberta（阿尔伯塔大学）； Huawei Technologies Canada（华为加拿大技术有限公司）； York University（约克大学）； Canada CIFAR AI Chair, Amii（加拿大 CIFAR 人工智能主席，Amii）

AI总结本文提出RLFTSim框架，通过强化学习微调提升交通仿真场景的真实感，并通过目标条件化方法实现对交通仿真可控性的提炼，实验表明其在真实感和可控性方面均优于其他启发式搜索方法。

Comments CVPR 2026 Highlight; Project page at https://ehsan-ami.github.io/rlftsim

详情

AI中文摘要

监督式开环训练已被广泛用于训练交通仿真模型；然而，它无法捕捉复杂驾驶场景中固有的动态性和多智能体交互。我们引入RLFTSim，一种基于强化学习的微调框架，通过将模拟器运行与真实世界数据分布对齐来增强场景真实性，并提供一种方法用于在场景生成中提炼目标条件化的可控性。我们基于预训练的仿真模型实例化RLFTSim，设计一种平衡保真度和可控性的奖励函数，并在Waymo Open Motion Dataset上进行了全面实验。我们的结果表明在真实感方面取得了改进，实现了最先进的性能。与其它基于启发式搜索的微调方法相比，RLFTSim由于提出了一种低方差且密集的奖励信号，所需样本显著更少，并且通过设计直接解决了真实感对齐问题。我们还通过目标条件化展示了我们方法在提炼交通仿真可控性方面的有效性。项目页面可在https://ehsan-ami.github.io/rlftsim上访问。

英文摘要

Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan-ami.github.io/rlftsim.

URL PDF HTML ☆

赞 0 踩 0

2605.19032 2026-05-20 cs.CV 版本更新

Personalized Face Privacy Protection From a Single Image

基于单张图像的个性化面部隐私保护

Zachary Yahn, Fatih Ilhan, Tiansheng Huang, Selim Tekin, Sihao Hu, Yichang Xu, Margaret Loper, Ling Liu

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结本文提出FaceCloak系统，通过单张图像生成个性化面部隐私掩码，有效防止面部识别，经实验验证其在多个数据集上优于其他方法。

详情

AI中文摘要

在线上传的面部照片容易受到恶意行为者的攻击，他们可以刮取面部图像并通过未经授权的面部识别模型侵犯个人隐私。本文提出了FaceCloak，一种新颖的个性化面部隐私保护系统，该系统能够从用户单张图像生成防御性身份特定的通用面部隐私掩码，使面部识别失败。FaceCloak引入了三阶段的个性化面部扰动学习方法：(1)基于用户的单张图像生成少量高多样性的合成面部图像；(2)通过迭代扰动生成在合成图像的小集合上学习面部伪装，通过增加关键面部身份泄露区域的保护，有效将用户的身份嵌入推向遥远的锚身份并远离相似身份；(3)生成以像素级伪装形式的个性化身份保护掩码，该掩码轻量且可以高效应用于任何用户的面部图像，同时保持良好的感知质量。在三个流行面部数据集上对十个识别模型的广泛实验显示，FaceCloak相比29种其他现有代表性方法更有效。代码可在https://github.com/zacharyyahn/FaceCloak获取。

英文摘要

Photos of faces uploaded online are vulnerable to malicious actors who can scrape facial images from online sources and intrude on personal privacy via unauthorized use of facial recognition models. This paper presents FaceCloak, a novel personalized face privacy protection system, which can generate defensive identity-specific universal face privacy masks from a single image of a user, causing facial recognition to fail. FaceCloak introduces a three-stage personalized face perturbation learning methodology: (1) It generates a small set of high-variety synthetic face images of a person based on a single image of the person. (2) It learns face cloaking by adding more protection to key facial-identity leakage regions through iterative perturbation generation over the small set of synthetic images, effectively shifting a user's identity embedding towards a distant anchor identity and away from a similar one. (3) It generates a personalized identity-protective mask in the form of pixel-wise cloaking, which is light-weight and can be efficiently applied to any facial image of a user while maintaining good perceptual quality. Extensive experiments on three popular face datasets across ten recognition models show the effectiveness of FaceCloak compared to 29 other existing representative methods. Code is available at https://github.com/zacharyyahn/FaceCloak

URL PDF HTML ☆

赞 0 踩 0

2605.18464 2026-05-20 cs.CV 版本更新

PERL: Parameter Efficient Reasoning in CLIP Latent Space

PERL：在CLIP潜在空间中实现参数高效的推理

Simone Carnemolla, Salvatore Calcagno, Daniela Giordano, Concetto Spampinato, Matteo Pennisi

发表机构 * University of Catania（卡塔尼亚大学）

AI总结本文提出PERL，一种在CLIP潜在空间中通过迭代潜在推理实现参数高效适应的框架，该方法在多个基准测试中表现出最佳的参数-性能权衡，仅需约6K可训练参数即可实现强的新型类别准确率和竞争性的迁移性能。

Comments Submitted to NeurIPS 2026

详情

AI中文摘要

对比训练的视觉-语言模型，如CLIP，通过在共享嵌入空间中对齐图像和文本，提供了强大的零样本迁移能力。然而，将这些模型适应到下游任务而不影响其开放词汇泛化能力仍然具有挑战性。现有的参数高效适应方法通常通过学习的提示、适配器或多模态转换来提高任务专业化，其中适应能力主要通过额外的可训练参数来表达。受最近语言模型中潜在推理方法的启发，我们探讨了一种互补的视角：适应是否可以来自于对潜在表示的迭代推理，而不是仅仅通过增加参数数量？我们介绍了PERL（在CLIP潜在空间中实现参数高效的推理），一种轻量级的适应框架，它通过在冻结的CLIP模型上添加一个紧凑的共享推理模块，在多次细化步骤中反复应用。在每一步中，PERL根据当前的表示生成一个潜在推理标记，并将其注入到中间编码器层中，逐步细化更高层次的语义表示，同时保持CLIP的预训练多模态结构。在15个基准测试中，涵盖基础到新颖泛化、跨数据集迁移以及非分布ImageNet变体，PERL在快速适应的少样本设置下，实现了与其他方法相比最佳的参数-性能权衡，仅使用约6K可训练参数，比最大的比较方法少817倍，同时结合了强的新类别准确率和具有竞争力的迁移性能。总体而言，我们的结果表明，迭代的潜在推理为判别视觉-语言模型中的参数扩展提供了一种互补的适应机制。

英文摘要

Contrastively trained vision-language models such as CLIP provide strong zero-shot transfer by aligning images and text in a shared embedding space. However, adapting these models to downstream tasks without degrading their open-vocabulary generalization remains challenging. Existing parameter-efficient adaptation methods typically improve task specialization through learned prompts, adapters, or multimodal transformations, where adaptation capacity is primarily expressed through additional trainable parameters. Inspired by recent latent reasoning methods in language models, we investigate a complementary perspective: can adaptation emerge from iterative reasoning on latent representations rather than from increasing parameter count alone? We introduce PERL (Parameter-Efficient Reasoning in CLIP Latent Space), a lightweight adaptation framework that augments a frozen CLIP model with a compact shared reasoning module applied recurrently across refinement steps. At each step, PERL generates a latent reasoning token conditioned on the current representation and injects it into an intermediate encoder layer, progressively refining higher-level semantic representations while preserving CLIP's pretrained multimodal structure. Across 15 benchmarks spanning base-to-novel generalization, cross-dataset transfer, and out-of-distribution ImageNet variants, PERL achieves the best parameter-performance trade-off among the compared methods under a fast-adaptation few-shot setting, combining strong novel-class accuracy and competitive transfer performance with only about 6K trainable parameters, up to 817x fewer than the largest compared approach. Overall, our results suggest that iterative latent reasoning provides a complementary adaptation mechanism to parameter scaling in discriminative vision-language models.

URL PDF HTML ☆

赞 0 踩 0

2605.18445 2026-05-20 cs.CV cs.AI cs.CL cs.LG 版本更新

What's Holding Back Latent Visual Reasoning?

是什么在阻碍潜在视觉推理？

André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann

发表机构 * Instituto Superior Técnico, Universidade de Lisboa（里斯本大学理工学院）； Instituto de Telecomunicações（电信研究所）； TransPerfect（TransPerfect公司）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本研究探讨了现有模型如何利用潜在令牌，发现潜在令牌在最终预测中起作用有限，主要问题在于训练数据中潜在令牌信息有限且推理时生成的潜在令牌偏离真实表示，需要高质量数据和更精确的潜在令牌预测来推动发展。

详情

AI中文摘要

人类通过心理模拟中间视觉步骤来解决复杂视觉问题，而非仅通过语言推理。受此启发，近期有关视觉-语言模型的工作探索了连续潜在令牌作为中间视觉想象步骤的链式推理。在本工作中，我们研究了近期模型如何利用此类潜在令牌。令人惊讶的是，当潜在令牌被无信息的占位符令牌替代时，模型准确性不受影响。这表明潜在令牌在模型最终预测中起最小的因果作用。为了更好地理解这一现象，我们分析了由oracle潜在表示提供的训练信号以及推理时生成的潜在令牌质量。我们的实验揭示了两个阻碍潜在视觉推理的关键问题：首先，在大多数现有数据集中，oracle潜在令牌提供的信息有限，仅超出原始图像，且不显著简化任务，导致模型在训练时忽略它们，并在推理时有效绕过它们。当在诊断数据集上微调时，其中潜在令牌为最终预测提供充分支持，我们显示模型可以因果依赖于它们。其次，在推理时生成的潜在令牌偏离其对应的oracle表示，坍缩到狭窄区域，即使模型依赖它们也无法获得收益。总体而言，我们的发现表明，未来潜在视觉推理的进步取决于两个关键支柱：具有信息性中间步骤的高质量数据集和更精确的潜在令牌预测。

英文摘要

Humans can approach complex visual problems by mentally simulating intermediate visual steps, rather than reasoning through language alone. Inspired by this, several works on Vision-Language Models have recently explored chain-of-thought reasoning with continuous latent tokens as intermediate visual imagination steps. In this work, we investigate how recent models leverage such latent tokens. Surprisingly, we find that model accuracy is unaffected when latent tokens are replaced by uninformative dummy tokens. This indicates that latent tokens play a minimal causal role in the model's final prediction. To better understand this phenomenon, we analyze both the training signal provided by oracle latent representations and the quality of the latent tokens generated at inference time. Our experiments reveal two crucial issues holding back latent visual reasoning: First, in most existing datasets, oracle latent tokens provide limited additional information beyond the original image and do not substantially simplify the task, leading models to ignore them during training and effectively bypassing them at inference time. When fine-tuned on a diagnostic dataset, in which latent tokens provide sufficient support for the final prediction, we show that models can causally rely on them. Second, the latent tokens produced at inference time deviate from their corresponding oracle representations, collapsing to a narrow region and preventing benefits even when the model relies on them. Overall, our findings suggest that future progress in latent visual reasoning depends on two key pillars: high-quality datasets with informative intermediate steps and more precise latent token prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.18431 2026-05-20 cs.CV 版本更新

Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models

协同视见：基于多模态大语言模型的多机器人协作自体空间推理

Kunyu Peng, Zhikun Zhou, Kailun Yang, Di Wen, Ruiping Liu, Yufan Chen, Junwei Zheng, Hao Shi, Yi Zhou, M. Saquib Sarfraz, Danda Pani Paudel, Luc Van Gool

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； Hunan University（湖南大学）； University of Oxford（牛津大学）； Zhejiang University（浙江大学）； ETH Zurich（苏黎世联邦理工学院）； Ant Group（蚂蚁集团）

AI总结本文研究了多机器人协作动态空间推理问题，提出了首个针对该任务的基准CoopSR以及多机器人自体问答数据集EgoTeam，通过引入SP-CoR框架实现了细粒度的协作空间推理，显著提升了多机器人协作推理性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在自体视频理解方面取得了显著进展，但其从多个具身视角进行协作推理的能力仍鲜有探索。我们通过多机器人协作动态空间推理研究该问题，其中模型必须通过集成同步的自体视频来回答空间、时间、可见性和协调性问题。为此，我们引入了首个针对该任务的基准CoopSR，以及EgoTeam多机器人自体问答数据集。EgoTeam包含114,227个问答对，覆盖19种问题类型、四个难度等级和三种团队规模，在Habitat和iGibson中，以及一个包含约2,326个问题的现实世界测试集。我们进一步提出了SP-CoR（Spectral and Physics-Informed Cooperative Reasoner），一种用于细粒度协作空间推理的MLLM框架。SP-CoR结合了动态感知的多机器人帧采样、光谱和物理引导的视图融合以及物理对齐的提示蒸馏，使模型在训练时能够受益于特权机器人姿态监督，而在测试时仅需自体视频。在22个MLLM基线模型上，SP-CoR在Habitat上比最强的微调基线高出3.87%，在iGibson上高出7.12%。它还展示了更强的泛化能力，适用于未见过的团队规模和现实世界机器人测试。代码可在https://github.com/KPeng9510/seeing-together.git找到。

英文摘要

Multimodal Large Language Models (MLLMs) have made substantial progress in egocentric video understanding, but their ability to reason cooperatively from multiple embodied viewpoints remains largely unexplored. We study this problem through multi-robot cooperative dynamic spatial reasoning, where a model must answer spatial, temporal, visibility, and coordination questions by integrating synchronized egocentric videos from a team of moving robots. To support this setting, we introduce CoopSR, the first benchmark for this task, together with EgoTeam, a multi-robot egocentric QA dataset. EgoTeam contains 114,227 QA pairs spanning 19 question types, four difficulty tiers, and three team sizes in Habitat and iGibson, along with a real-world test set of around 2,326 QAs collected using two quadruped robots. We further propose SP-CoR (Spectral and Physics-Informed Cooperative Reasoner), an MLLM framework for fine-grained cooperative spatial reasoning. SP-CoR combines dynamics-aware multi-robot frame sampling, spectral- and physics-guided view fusion, and physics-aligned prompt distillation, enabling the model to benefit from privileged robot-pose supervision during training while requiring only egocentric videos at test time. Across 22 MLLM baselines, SP-CoR consistently improves cooperative reasoning, outperforming the strongest fine-tuned baseline by +3.87% on Habitat and +7.12% on iGibson. It also shows stronger generalization to unseen team sizes and real-world robot tests. Code can be found at https://github.com/KPeng9510/seeing-together.git.

URL PDF HTML ☆

赞 0 踩 0

2605.18413 2026-05-20 cs.CV 版本更新

EchoSR: 为轻量图像超分辨率实现高效的上下文利用

Hanli Zhao, Binhao Wang, Shihao Zhao, Tao Wang, Kaihao Zhang, Wanglong Lu

发表机构 * College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325000, China（温州大学计算机科学与人工智能学院）； vivo BlueImage Lab, vivo Mobile Communication Co., Ltd, Shanghai 200100, China（vivo蓝影实验室，vivo移动通信有限公司，上海200100，中国）； College of Engineering and Computer Science, Australian National University, Canberra, Australia（工程与计算机科学学院，澳大利亚国立大学，堪培拉，澳大利亚）； The AI/Analytics Team, Nasdaq, St. John’s, Canada（AI/分析团队，纳斯达克，圣约翰，加拿大）

AI总结本文提出EchoSR框架，通过统一多尺度感受野建模和层次化上下文融合，提升了轻量图像超分辨率的效率和效果，同时在多个基准上优于现有方法，并实现了约两倍的速度提升。

Comments Accepted by Information Fusion; 20 pages, 17 figures

详情

DOI: 10.1016/j.inffus.2026.104471

AI中文摘要

图像超分辨率（SR）旨在从低分辨率（LR）输入中重建高质量、高分辨率（HR）图像，并在各种下游应用中发挥关键作用。尽管近年来取得了进展，但平衡重建保真度和计算效率仍然是一个根本性挑战，尤其是在资源受限的场景中。虽然现有轻量方法试图扩展感受野，但许多方法要么导致显著的计算开销，要么简单地扩大内核大小，或缺乏机制进行一致的多尺度整合，限制了它们的整体效果和可扩展性。为了解决这些限制，我们提出了EchoSR，一个高效的上下文利用框架，用于轻量图像超分辨率，它统一了多尺度感受野建模和层次化上下文融合。EchoSR通过一种高效的上下文利用策略将特征学习解耦为分离的局部、多尺度和全局建模阶段，并进一步通过跨尺度重叠融合机制促进无缝的跨尺度整合。广泛的实验表明，EchoSR在多个基准上一致优于现有最先进的轻量超分辨率方法，同时也实现了更快的速度（约2倍）。源代码可在https://github.com/funnyWang-Echoes/EchoSR上获得。

英文摘要

Image super-resolution (SR) aims to reconstruct high-quality, high-resolution (HR) images from low-resolution (LR) inputs and plays a critical role in various downstream applications. Despite recent advancements, balancing reconstruction fidelity and computational efficiency remains a fundamental challenge, particularly in resource-constrained scenarios. While existing lightweight methods attempt to expand receptive fields, many of them either incur substantial computational overhead, naively scale up kernel sizes, or lack mechanisms for coherent multi-scale integration, limiting their overall effectiveness and scalability. To address these limitations, we propose EchoSR, an efficient context-harnessing framework for lightweight image super-resolution, which unifies multi-scale receptive field modeling and hierarchical context fusion. EchoSR decouples feature learning into disentangled local, multi-scale, and global modeling stages through an efficient context-harnessing strategy, and further promotes seamless cross-scale integration via a cross-scale overlapping fusion mechanism. Extensive experiments have shown that EchoSR consistently outperforms state-of-the-art lightweight super-resolution methods across multiple benchmarks, while also achieving a faster speed $(\sim 2\times)$. The source code is available at https://github.com/funnyWang-Echoes/EchoSR.

URL PDF HTML ☆

赞 0 踩 0

2605.16736 2026-05-20 cs.CV 版本更新

CAB: Accelerating Flow and Diffusion Sampling via Rectification and Corrected Adams-Bashforth

CAB: 通过校正和修正Adams-Bashforth加速流和扩散采样

Anuska Roy, Pravin Nair

发表机构 * Department of Electrical Engineering（电气工程系）

AI总结本文提出了一种无需训练的采样器CAB，通过将采样动态转换为统一的校正坐标系，并应用带有基于过去速度评估的简单修正项的多步Adams-Bashforth预测器，从而在不增加额外函数评估次数的情况下加速流和扩散模型。

详情

AI中文摘要

流和扩散模型能够实现高质量、高分辨率的图像合成，但通常在采样时需要大量的函数评估次数（NFEs）。现有的加速方法要么需要通过蒸馏进行额外训练，要么依赖于无需训练的高阶求解器，但两者在低NFE预算下都会降低样本质量。我们提出CAB（Corrected Adams-Bashforth），一种无需训练的采样器，能够加速流和扩散模型。CAB首先将采样动态转换为统一的校正坐标系，然后应用一个带有基于过去速度评估的简单修正项的多步Adams-Bashforth预测器，因此不增加额外的NFEs。所得到的方法简单，具有相同的算法形式，适用于所有模型类别，并且具有至少第三阶局部截断误差和第二阶全局误差。在预训练的流和扩散模型上进行的实验，包括类别条件和大规模文本到图像基准，表明CAB在6-20 NFEs的低步数范围内改进了质量-NFE权衡。它在大多数测试模型中在更高步数时与强大的无需训练采样器保持竞争力。官方实现可在https://github.com/Anuska-Roy/CAB上获得。

英文摘要

Flow and diffusion models achieve high-fidelity, high-resolution image synthesis, but often require many function evaluations (NFEs) at sampling time. Existing acceleration methods either require additional training through distillation or rely on training-free high-order solvers, and both can degrade sample quality at low NFE budgets. We propose CAB (Corrected Adams-Bashforth), a training-free sampler that accelerates both flow and diffusion models. CAB first transforms the sampling dynamics to a common rectified coordinate system, and then applies a multistep Adams-Bashforth predictor augmented with a simple correction term based on past velocity evaluations and therefore incurs no additional NFEs. The resulting method is simple, has the same algorithmic form across model classes, and has at least third-order local truncation error and second-order global error. Experiments on pretrained flow and diffusion models, including class-conditional and large-scale text-to-image benchmarks, show that CAB improves quality-NFE trade-offs in the low-step regime of 6-20 NFEs. It also remains competitive with strong training-free samplers at higher step counts across most tested models. The official implementation is available at https://github.com/Anuska-Roy/CAB.

URL PDF HTML ☆

赞 0 踩 0

2605.16353 2026-05-20 cs.CV cs.AI 版本更新

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

StrLoRA: 向流式连续视觉指令微调迈进以适应大规模多模态语言模型

Chang Che, Ziqi Wang, Hui Ma, Cheems Wang, Zenglin Shi

发表机构 * Hefei University of Technology（合肥工业大学）； Tsinghua University（清华大学）

AI总结本文提出StrLoRA，一种流式连续视觉指令微调方法，旨在解决动态任务流中模型持续学习的问题，通过任务感知的专家路由框架提升模型在不断变化的数据流中的表现。

详情

AI中文摘要

持续视觉指令微调（CVIT）使多模态大语言模型能够逐步获得新能力。然而，现有CVIT方法在任务增量设置下运行，每个训练阶段对应一个预定义任务，这不符合现实世界中数据作为连续流中交织和动态变化的任务的条件。为弥合这一差距，我们引入流式CVIT（StrCVIT），一种更通用和现实的设置，其中模型从包含动态混合任务的数据块中学习。在StrCVIT中，模型必须同时获得新能力、强化常见能力并减轻遗忘。现有CVIT方法在此处失败，因为它们无法可靠地区分或适应每个块内的异构任务样本。因此，我们提出了StrLoRA，一种正则化的两阶段专家路由框架。StrLoRA首先使用文本指令进行任务感知的专家选择，激活相关专家的稀疏子集，减少跨任务干扰。然后在该子集内应用基于令牌的专家加权，其中贡献权重通过本地视觉令牌与全局指令表示之间的跨模态注意力计算。为了在非平稳流中保持稳定性，路由稳定性正则化将当前路由分布与历史指数移动平均参考对齐。在新开发的StrCVIT基准上的广泛实验表明，StrLoRA显著优于现有方法，有效提升了模型从持续演变的数据流中获取能力的能力。代码可在https://github.com/chanceche/StrCVIT获取。

英文摘要

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two-stage expert routing framework. StrLoRA first performs task-aware expert selection using the textual instruction to activate a sparse subset of relevant experts, reducing cross-task interference. It then applies token-wise expert weighting within this subset, where contribution weights are computed via cross-modal attention between local visual tokens and the global instruction representation. To maintain stability across the non-stationary stream, a routing-stability regularization aligns current routing distributions with a historical exponential moving average reference. Extensive experiments on a newly developed StrCVIT benchmark show that StrLoRA substantially outperforms existing methods, effectively enhancing model's abilities from continuously evolving data streams. The code is available at https://github.com/chanceche/StrCVIT.

URL PDF HTML ☆

赞 0 踩 0

2605.15497 2026-05-20 cs.CV cs.GR 版本更新

AnyAct: Towards Human Reenactment of Character Motion From Video

AnyAct: 向视频中非人类角色动作的重新演绎迈进

Liuhan Chen, Lei Zhong, Jiewei Wang, Qin Shuai, Li Yuan, Leidong Fan, Qing Li, Kanglin Liu

发表机构 * Peking University（北京大学）； Nankai University（南开大学）； The University of Hong Kong（香港大学）； Zhejiang University（浙江大学）； Pengcheng Laboratory（鹏城实验室）

AI总结本文研究如何从单目视频中直接推导出人类动作的初始重新演绎，其目标是将非人类角色的动作重新诠释为可编辑的人类表演，以供后续动画创作使用。核心方法是利用稀疏局部关节运动线索在结构差异大的情况下保持本质动态，提出AnyAct模型以实现基于可转移稀疏局部2D关节运动的条件人类运动生成。

Comments 12 pages

详情

AI中文摘要

我们研究了从非人类角色的单目视频中直接推导出初始人类重新演绎的问题。我们的目标不是重建源角色本身，而是将它的动作重新诠释为一个合理且可编辑的人类表演，以供后续动画创作使用。这一任务具有挑战性，因为现有的基于视频的动作捕捉方法大多局限于以人类为中心的结构空间，而动作重定向方法通常需要结构化的3D源动作和已知的源拓扑。我们的关键见解是稀疏局部关节运动线索可以在较大的结构差异下保持本质动态，为角色视频到人类重新演绎提供稳定的桥梁。基于这一观察，我们提出了AnyAct，将角色视频驱动的人类重新演绎公式化为从可转移的稀疏局部2D关节运动中生成的条件人类运动。为了使这一方法实用，我们引入了三个关键设计：通过增强的3D到2D投影进行的人类运动-only监督、渐进的3D到2D训练以缓解条件模糊性，以及全局-局部运动解耦以实现可靠的局部运动控制。我们进一步构建了一个主要涵盖多样化非人类角色视频的基准。在该基准上的实验表明，AnyAct能够生成高保真的初始人类重新演绎，这些重新演绎保留了参考视频中角色的本质动态，进一步的消融研究验证了其核心设计的有效性。

英文摘要

We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.

URL PDF HTML ☆

赞 0 踩 0

2605.15186 2026-05-20 cs.CV cs.AI 版本更新

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

VGGT-Edit：基于残差场预测的前馈原生3D场景编辑

Kaixin Zhu, Yiwen Tang, Yifan Yang, Renrui Zhang, Bohan Zeng, Ziyu Guo, Ruichuan An, Zhou Liu, Qizhi Chen, Delin Qu, Jaehong Yoon, Wentao Zhang

发表机构 * Peking University（北京大学）； Tencent（腾讯）； The Chinese University of Hong Kong（香港中文大学）； Shanghai AI Lab（上海人工智能实验室）； NTU Singapore（新加坡国立大学）； Zhongguancun Academy（中关村学院）； Beijing Key Lab of Data Intel. & Security (PKU)（北京数据智能与安全实验室（北京大学））

AI总结本文提出VGGT-Edit，一种基于文本条件的前馈原生3D场景编辑框架，通过引入深度同步文本注入和残差变换头，实现高质量的3D场景编辑，同时构建DeltaScene数据集以提升编辑效果和推理速度。

详情

AI中文摘要

高质量的3D场景重建近年来已发展为通用的前馈架构，使单次正向传递即可生成复杂的环境。然而，尽管这些模型在静态场景感知方面表现强劲，但它们在响应动态人类指令方面仍然有限，限制了其在交互应用中的使用。现有的编辑方法通常依赖于2D提升策略，即单独编辑每个视图，然后将其提升回3D空间。这种间接流程往往导致模糊的纹理和不一致的几何结构，因为2D编辑器缺乏保持跨视角结构的空间意识。为了解决这些限制，我们提出了VGGT-Edit，一种用于文本条件的前馈框架，用于原生3D场景编辑。VGGT-Edit引入了深度同步的文本注入，以对齐语义指导与骨干网络的空间姿态，确保稳定的指令接地。此语义信号随后由残差变换头处理，直接预测3D几何位移以变形场景，同时保持背景稳定性。为了确保高保真结果，我们通过多术语目标函数监督该框架，强制几何准确性和跨视图一致性。我们还构建了DeltaScene数据集，一个通过自动化流程生成的大规模数据集，通过3D一致过滤确保地面真实质量。实验表明，VGGT-Edit在2D提升基线中表现显著更好，生成更清晰的物体细节，更强的多视图一致性以及接近即时的推理速度。项目页面是https://chriszkxxx.github.io/VGGT-Edit/.

英文摘要

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed. The project page is https://chriszkxxx.github.io/VGGT-Edit/.

URL PDF HTML ☆

赞 0 踩 0

2605.14530 2026-05-20 cs.CV 版本更新

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

缓解大扩散视觉-语言模型中的遮蔽先验漂移和位置注意力崩溃

Sujung Hong, Chanyong Yoon, Seong Jae Hwang

发表机构 * Department of Artificial Intelligence, Yonsei University, Seoul, Republic of Korea（首尔大学人工智能系）

AI总结本文研究了大扩散视觉-语言模型在长形式生成中的重复生成和视觉 grounding 退化问题，提出了一种无需训练的解决方案来缓解遮蔽先验漂移和位置注意力崩溃。

详情

AI中文摘要

大扩散视觉-语言模型（LDVLMs）最近作为一种有前途的替代自回归模型出现，能够实现高效的并行解码，并利用双向注意力获取全局上下文。尽管有这些进展，其在长形式生成中的行为仍然缺乏深入研究。在本文中，我们发现现有的LDVLMs存在重复生成和退化的视觉 grounding，并识别出两个根本原因。首先，重复生成源于遮蔽标记先验：由于生成标记被初始化为遮蔽标记，其隐藏表示在生成步骤中逐渐漂向共享的先验方向。其次，位置注意力偏置与迭代解屏蔽过程之间的基本不匹配会抑制对信息性视觉标记的注意力，从而降低视觉 grounding。基于这些见解，我们提出了一种无需训练的方法，引入遮蔽先验抑制和单调RoPE缩放来缓解解码过程中的遮蔽先验漂移和位置注意力崩溃。在通用多模态基准和视觉 grounding 任务上的实验表明，与基线LDVLMs相比有所改进，特别是在长形式描述基准上表现稳健。我们的结果表明，这些失败可以通过一种轻量级、即插即用的策略有效解决，该策略不需要额外训练，并且在多种LDVLM架构上具有泛化能力。

英文摘要

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.10525 2026-05-20 cs.CV 版本更新

GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

GemDepth：用于3D一致视频深度的几何嵌入特征

Yuecheng Liu, Junda Cheng, Longliang Liu, Wenjing Liao, Hanrui Cheng, Yuzhou Wang, Xin Yang

发表机构 * Huazhong University of Science \& Technology ； Optics Valley Laboratory

AI总结本文提出GemDepth框架，通过引入几何嵌入模块和交替时空变换器，解决视频深度估计中空间模糊和时间不一致的问题，实现高精度和鲁棒的3D一致性。

详情

AI中文摘要

视频深度估计将单目预测扩展到时间域以确保一致性。然而，现有方法在细节区域常出现空间模糊和时间不一致的问题。我们提出GemDepth框架，其核心思想是显式了解相机运动和全局3D结构是保持3D一致性必要的前提。GemDepth引入了一个几何嵌入模块（GEM），通过预测帧间相机姿态生成隐式几何嵌入。这种运动先验的注入使网络具备内在的3D感知和对齐能力。在这些几何提示的引导下，我们的交替时空变换器（ASTT）捕获潜在点级对应关系，同时提高空间精度以增强细节清晰度，并强制严格的时间一致性。此外，GemDepth采用数据高效训练策略，有效弥合了高效率和鲁棒几何一致性之间的差距。如图2所示，全面评估表明GemDepth在多个数据集上均取得最佳性能，特别是在复杂动态场景中。代码已公开在：https://github.com/Yuecheng919/GemDepth。

英文摘要

Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth.

URL PDF HTML ☆

赞 0 踩 0

2605.08830 2026-05-20 cs.CV cs.AI cs.RO 版本更新

VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

VECTOR-Drive: 紧密耦合的视觉-语言与轨迹专家路由用于端到端自动驾驶

Rui Zhao, Jianlin Yu, Zhenhai Gao, Jiaqiao Liu, Fei Gao

发表机构 * College of Automotive Engineering, Jilin University（吉林大学汽车工程学院）； The National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University（吉林大学汽车底盘集成与生物力学国家级重点实验室）； ReeFocus AI Technology（ReeFocus人工智能技术）

AI总结本文提出VECTOR-DRIVE框架，通过紧密耦合的视觉-语言与轨迹专家路由，解决端到端自动驾驶中视觉语言理解和轨迹预测之间的耦合问题，实现更高的任务性能。

详情

AI中文摘要

端到端自动驾驶需要模型理解交通场景、推断驾驶意图并生成可执行的运动计划。最近的视觉-语言-动作（VLA）模型继承了大规模视觉-语言预训练的语义先验，但仍然面临耦合权衡：完全共享的骨干网络保留了多模态交互，但可能导致语言推理和轨迹预测的耦合问题；而解耦的推理-动作管道减少了任务冲突，但削弱了语义-运动耦合。我们提出VECTOR-DRIVE，一个基于Qwen2.5-VL-3B的紧密耦合VLA框架。VECTOR-DRIVE通过共享自注意力保持所有token的耦合，并根据token语义路由前馈计算。视觉和语言token由视觉-语言专家处理以保留语义先验，而目标点、主体状态和噪声动作token则路由到轨迹专家进行运动特定计算。在动作token路径上，一个流匹配规划器将噪声动作token细化为未来路径点和速度配置文件。这种设计在单一多模态Transformer中耦合了语义推理和运动规划，同时分离了任务特定的FFN计算。在Bench2Drive上，VECTOR-DRIVE实现了88.91的驾驶得分，并优于代表性的端到端和VLA基线。定性结果和消融进一步验证了共享注意力、语义感知专家路由、渐进式训练和基于流的动作解码的优势。

英文摘要

End-to-end autonomous driving requires models to understand traffic scenes, infer driving intent, and generate executable motion plans. Recent vision-language-action (VLA) models inherit semantic priors from large-scale vision-language pretraining, yet still face a coupling trade-off: fully shared backbones preserve multimodal interaction but may entangle language reasoning and trajectory prediction, whereas decou pled reasoning-action pipelines reduce task conflict but weaken semantic-motion coupling. We propose VECTOR-DRIVE, a tightly coupled VLA framework built on Qwen2.5-VL-3B. VECTOR-DRIVE keeps all tokens coupled through shared self attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert to preserve semantic priors, while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert for motion-specific computation. On the action-token pathway, a flow-matching planner refines noisy action tokens into future waypoints and speed profiles. This design couples semantic reasoning and motion planning within a single multimodal Transformer while separating task-specific FFN computation. On Bench2Drive, VECTOR-DRIVE achieves 88.91 Driving Score and outperforms representative end-to end and VLA-based baselines. Qualitative results and ablations further validate the benefits of shared attention, semantic-aware expert routing, progressive training, and flow-based action de coding.

URL PDF HTML ☆

赞 0 踩 0

2605.07379 2026-05-20 cs.CV cs.AI 版本更新

RELO: Reinforcement Learning to Localize for Visual Object Tracking

RELO：用于视觉目标跟踪的强化学习定位

Xin Chen, Chuanyu Sun, Jiao Xu, Houwen Peng, Dong Wang, Huchuan Lu, Kede Ma

发表机构 * City University of Hong Kong（香港城市大学）； Hunyuan Team, Tencent（腾讯文心团队）； Dalian University of Technology（大连理工大学）

AI总结本文提出RELO方法，通过将目标定位建模为马尔可夫决策过程，利用强化学习替代传统手工设计的空间先验，以提升跟踪性能和一致性。

Comments ICML 2026 paper

详情

AI中文摘要

传统视觉目标跟踪方法通常使用手工设计的空间先验（如热图）来定位目标，但这些先验只能提供替代监督，并且与跟踪优化和评估指标（如交并比IoU和成功曲线下的面积AUC）不匹配。本文引入RELO，一种用于视觉目标跟踪的强化学习定位方法，将目标定位建模为马尔可夫决策过程。具体而言，RELO用强化学习学习的空间位置策略替代手工设计的空间先验，奖励结合帧级IoU和序列级AUC。此外，我们还引入层对齐的时间令牌传播以提高帧间语义一致性，计算开销极低。在多个基准测试中，RELO取得了优异的性能，无需模板更新，在LaSOText上达到了57.5%的AUC。这证实了基于奖励的定位为视觉目标跟踪提供了一种有效的替代方法。

英文摘要

Conventional visual object trackers localize targets using handcrafted spatial priors, often in the form of heatmaps. Such priors provide only surrogate supervision and are poorly aligned with tracking optimization and evaluation metrics, such as intersection over union (IoU) and area under the success curve (AUC). Here, we introduce RELO, a REinforcement-learning-to-LOcalize method for visual object tracking that formulates target localization as a Markov decision process. Specifically, RELO replaces handcrafted spatial priors with a localization policy learned over spatial positions via reinforcement learning, with rewards combining frame-level IoU and sequence-level AUC. We additionally introduce layer-aligned temporal token propagation to improve semantic consistency across frames, with negligible computational overhead. Across multiple benchmarks, RELO achieves superior results, attaining 57.5% AUC on LaSOText without template updates. This confirms that reward-driven localization provides an effective alternative to prior-driven localization for visual object tracking.

URL PDF HTML ☆

赞 0 踩 0

2605.06270 2026-05-20 cs.CV 版本更新

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

Spark3R: 非对称令牌缩减使快速前馈3D重建

Zecheng Tang, Jiaye Fu, Qiankun Gao, Haijie Li, Yanmin Wu, Jiaqi Zhang, Siwei Ma, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University（北京大学电子与计算机工程学院）； School of Computer Science, Peking University（北京大学计算机科学学院）

AI总结本文提出Spark3R框架，通过非对称令牌缩减技术，在不重新训练的情况下加速前馈3D重建模型，实现高达28倍的速度提升同时保持高质量重建。

详情

AI中文摘要

基于视觉Transformer的前馈3D重建模型可以直接从少量输入图像估计场景几何和相机姿态，但将其扩展到具有数百或数千帧的视频输入仍然具有挑战性，因为全局注意力层的二次成本。最近的令牌合并方法通过在全局注意力层内压缩令牌序列来加速这些模型，但它们对查询令牌和键值令牌应用均匀的缩减，忽略了它们在3D重建中功能不同的角色。在本文中，我们识别出前馈3D重建模型的一个关键属性：查询令牌编码视图特定的几何请求并且对压缩敏感，而键值令牌代表共享的场景上下文并且可以容忍剧烈压缩。受这一见解的启发，我们提出了Spark3R，一个无需训练的加速框架，通过为查询令牌和键值令牌分配不同的缩减因子来解耦压缩，对查询令牌应用组内令牌合并，对键值令牌应用轻量级令牌剪枝。此外，Spark3R在不同层之间自适应调整键值缩减因子，进一步改进质量-效率权衡。作为一种即插即用的框架，无需重新训练，Spark3R直接集成到多个预训练的前馈3D重建模型中，包括VGGT、π³、Depth-Anything-3和VGGT-Ω，并在1000帧输入上实现了高达28倍的速度提升，同时保持有竞争力的重建质量。

英文摘要

Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $π^3$, Depth-Anything-3, and VGGT-$Ω$, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.

URL PDF HTML ☆

赞 0 踩 0

2605.02223 2026-05-20 cs.SD cs.CV 版本更新

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

迈向细粒度语音修补取证：一个多区域篡改定位的数据集、方法和度量标准

Tung Vu, Yen Nguyen, Hai Nguyen, Cuong Pham, Cong Tran

发表机构 * Posts and Telecommunications Institute of Technology

AI总结本文提出MIST数据集、ISA方法和SF1@tau度量标准，用于多区域语音修补检测，揭示现有深度伪造检测器在细粒度语音修补检测上的不足。

详情

DOI: 10.1109/ACCESS.2026.3694045

AI中文摘要

近年来，语音克隆和文本到语音合成技术的进步使部分语音操纵——即攻击者在语音中替换几个词以改变其含义同时保持说话者身份——成为一种日益现实的威胁。现有音频深度伪造检测基准主要集中在句级二元分类或单区域篡改，无法检测和定位未知数量的多区域修补内容。我们通过三个贡献填补这一空白：首先，我们引入MIST（多区域修补语音篡改），一个覆盖6种语言、每句包含1-3个独立修补词级段的大型多语言数据集，通过LLM引导的语义替换和神经语音克隆生成，其中虚假内容仅占每句的2-7%。其次，我们提出了ISA（迭代段分析），一种与backbone无关的框架，通过粗到细的滑动窗口分类，结合容差区域提议和边界细化，无需先验知识即可恢复所有篡改区域。第三，我们定义了SF1@tau，一个基于时间IoU匹配的段级F1度量标准，联合评估区域计数准确性和定位精度。零样本评估显示，细粒度语音修补仍无法被现有深度伪造检测器解决：句级分类器在完全合成语音上对MIST句的伪造概率接近零，而ISA在这一具有挑战性的设置中始终优于非迭代基线，且数据集、代码和评估工具包已公开发布。

英文摘要

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.

URL PDF HTML ☆

赞 0 踩 0

2605.00578 2026-05-20 cs.CV 版本更新

Federated Distillation for Whole Slide Image via Gaussian-Mixture Feature Alignment and Curriculum Integration

通过高斯混合特征对齐和课程整合实现全切片图像的联邦蒸馏

Luru Jing, Cong Cong, Yanyuan Chen, Yongzhi Cao

发表机构 * School of Computer Science, Peking University, Beijing, China（北京大学计算机科学系）； Center for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Sydney, NSW 2113, Australia（健康信息学中心，澳大利亚健康创新研究所，麦考利大学，悉尼，NSW 2113，澳大利亚）； School of Data Science, University of Virginia, Charlottesville, VA, USA（数据科学学院，弗吉尼亚大学，夏洛特维尔，VA，美国）

AI总结本文提出FedHD框架，通过高斯混合特征对齐和课程整合策略，在联邦学习中实现全切片图像分析，通过本地生成的语义丰富合成特征表示提升模型性能，同时保持诊断多样性。

Comments Accepted by ICML 2026, Camera-Ready version updated

详情

AI中文摘要

联邦学习（FL）提供了一个有前景的框架，用于通过跨机构进行模型训练来实现协作数字病理学。然而，现实部署面临异质性问题，源于不同机构中多样化的多实例学习（MIL）架构和异构特征提取器。我们提出FedHD，一种新的FL框架，通过针对WSI分析进行本地高斯混合特征对齐。不同于交换模型参数，每个客户端独立地蒸馏语义丰富的合成特征表示，这些表示与真实WSI的分布对齐。为保持诊断多样性，FedHD采用一对一蒸馏策略，为每个真实切片生成一个合成对应物，以避免过度压缩。在联邦过程中，采用基于课程的整合策略，一旦性能达到平台期，逐步将跨站点的合成特征整合到本地训练中。此外，一个可选的解释模块从合成嵌入中重建伪块，提高透明度。FedHD是架构无关的、隐私保护的，并支持在不同机构之间进行个性化但协作的训练。在TCGA-IDH、CAMELYON16和CAMELYON17上的实验表明，FedHD在联邦和蒸馏基线中表现一致优于最先进的方法。

英文摘要

Federated learning (FL) offers a promising framework for collaborative digital pathology by enabling model training across institutions. However, real-world deployments face heterogeneity arising from diverse multiple instance learning (MIL) architectures and heterogeneous feature extractors across institutions. We propose FedHD, a novel FL framework that performs local Gaussian-mixture feature alignment tailored for WSI analysis. Instead of exchanging model parameters, each client independently distills semantically rich synthetic feature representations aligned with the distribution of real WSIs. To preserve diagnostic diversity, FedHD adopts a one-to-one distillation strategy, generating a synthetic counterpart for each real slide to avoid over-compression. During federation, a curriculum-based integration strategy progressively incorporates cross-site synthetic features into local training once performance plateaus. Furthermore, an optional interpretation module reconstructs pseudo-patches from synthetic embeddings, enhancing transparency. FedHD is architecture-agnostic, privacy-preserving, and supports personalized yet collaborative training across diverse institutions. Experiments on TCGA-IDH, CAMELYON16, and CAMELYON17 show that FedHD consistently outperforms state-of-the-art federated and distillation baselines.

URL PDF HTML ☆

赞 0 踩 0

2604.25646 2026-05-20 cs.CV cs.RO 版本更新

SAMe: A Semantic Anatomy Mapping Engine for Robotic Ultrasound

SAMe：一种用于机器人超声的语义解剖映射引擎

Jing Zhang, Duojie Chen, Wentao Jiang, Zihan Lou, Jianxin Liu, Xinwu Cui, Qinghong Zhao, Bo Du, Christoph F. Dietrich, Dacheng Tao

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； Hubei Center for Applied Mathematics, Wuhan University（湖北应用数学中心，武汉大学）； Department of Ultrasound, The Central Hospital of Wuhan（武汉市中心医院超声科）； Department of Medical Ultrasound, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology（同济医院，同济医学院，华中科技大学医学影像科）； Department of Ultrasound in Medicine, Renmin Hospital of Wuhan University（武汉大学仁医医院医学超声科）； University Hospital, Johann-Wolfgang-Goethe University Frankfurt am Main（法兰克福歌德大学医学院大学医院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）

AI总结该研究提出SAMe，一种语义解剖映射引擎，通过提供显式的解剖先验层，解决机器人超声扫描初始化问题，实现了基于临床症状的解剖目标识别和控制指令生成，提高了自动扫描的准确性和效率。

Comments Supplementary information included. Code will be released at https://github.com/MiliLab/Echo-SAMe

详情

AI中文摘要

机器人超声已经实现了局部图像驱动控制、接触调节和视图优化，但当前系统缺乏必要的解剖学理解，无法确定应扫描什么、从哪里开始以及如何适应个体患者解剖结构。这些差距使得系统仍依赖专家干预来启动扫描。本文提出SAMe，一种语义解剖映射引擎，为机器人超声提供显式的解剖先验层。SAMe将扫描初始化视为目标到解剖到动作的过程：它将不明确的临床症状转化为结构化的目标器官，从单张外部身体图像中为这些目标生成患者特定的解剖表示，并将这种表示转换为面向控制的6自由度探头初始化状态，无需使用术前CT或MRI进行额外的配准。SAMe维护的解剖表示是显式的、轻量的（单器官推断在0.08秒内完成），并且设计上与下游控制兼容。在语义接地、解剖生成和真实机器人评估中，SAMe在完整的初始化流程中表现出色。在真实机器人实验中，基于质心的SAMe初始化在单目标设置下，对于肝脏（86.7% vs 46.7%）和肾脏（80.0% vs 73.3%）初始化均优于基于身体关键点的启发式基线。此外，当多个候选目标可用时，试验级别的器官命中率达到了肝脏97.3%和肾脏83.3%。这些结果建立了一个显式的解剖先验层，解决了扫描初始化问题，并为更广泛的下游自主扫描流程提供了解剖基础，为基于症状驱动和解剖信息的机器人超声提供了基础。

英文摘要

Robotic ultrasound has advanced local image-driven control, contact regulation, and view optimization, yet current systems lack the anatomical understanding needed to determine what to scan, where to begin, and how to adapt to individual patient anatomy. These gaps make systems still reliant on expert intervention to initiate scanning. Here we present SAMe, a semantic anatomy mapping engine that provides robotic ultrasound with an explicit anatomical prior layer. SAMe addresses scan initiation as a target-to-anatomy-to-action process: it grounds under-specified clinical complaints into structured target organs, instantiates a patient-specific anatomical representation for the grounded targets from a single external body image, and translates this representation into control-facing 6-DoF probe initialization states without any additional registration using preoperative CT or MRI. The anatomical representation maintained by SAMe is explicit, lightweight (single-organ inference in 0.08s), and compatible with downstream control by design. Across semantic grounding, anatomical instantiation, and real-robot evaluation, SAMe shows strong performance across the full initialization pipeline. In real-robot experiments, centroid-based SAMe initialization outperformed the body-keypoint-based heuristic baseline under a budget-matched single-target setting for both liver (86.7% versus 46.7%) and kidney (80.0% versus 73.3%) initialization. Furthermore, The trial-level organ-hit rate reached 97.3% for liver and 83.3% for kidney when multiple candidate targets were available. These results establish an explicit anatomical prior layer that addresses scan initialization and is designed to support broader downstream autonomous scanning pipelines, providing the anatomical foundation for complaint-driven, anatomically informed robotic ultrasonography.

URL PDF HTML ☆

赞 0 踩 0

2604.18225 2026-05-20 cs.CV cs.AI 版本更新

Is SAM3 ready for pathology segmentation?

SAM3是否准备好进行病理分割？

Qiuyu Kong, Shakiba Sharifi, Yiming Wang, Marco Cristani, Zanxi Ruan

发表机构 * Sapienza University of Rome（罗马萨皮恩扎大学）； University of Verona（威尼斯大学）； Fondazione Bruno Kessler（布鲁诺·凯斯勒基金会）

AI总结本文评估了SAM3在病理图像分割中的能力，发现文本提示效果有限，视觉提示类型和预算对性能影响显著，少样本学习有提升但鲁棒性不足，且提示基于方法与任务训练适配方法之间存在显著差距。

Comments accept to icip2026

详情

AI中文摘要

Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: (1) text-only prompts poorly activate nuclear concepts; (2) performance is highly sensitive to visual prompt types and budgets; (3) few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise; and (4) a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.

英文摘要

Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: (1) text-only prompts poorly activate nuclear concepts; (2) performance is highly sensitive to visual prompt types and budgets; (3) few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise; and (4) a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.

URL PDF HTML ☆

赞 0 踩 0

2604.16503 2026-05-20 cs.CV cs.AI 版本更新

Motif-Video 2B: Technical Report

Motif-Video 2B：技术报告

Junghwan Lim, Wai Ting Cheung, Minsu Ha, Beomgyu Kim, Taewhan Kim, Haesol Lee, Dongpin Oh, Jeesoo Lee, Taehyun Kim, Minjae Kim, Sungmin Lee, Hyeyeon Cho, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Dongseok Kim, Jangwoong Kim, Youngrok Kim, Hyukjin Kweon, Hongjoo Lee, Jeongdoo Lee, Junhyeok Lee, Eunhwan Park, Yeongjae Park, Bokki Ryu, Dongjoo Weon

发表机构 * Motif Technologies（Motif技术公司）

AI总结该研究探讨在有限预算下是否能够训练出高质量的文本到视频生成模型，提出通过架构设计而非单纯扩大模型规模来提升性能，结合共享交叉注意力和三部分主干网络，实现了在较少参数和数据下的高质量视频生成。

详情

AI中文摘要

训练强大的视频生成模型通常需要大规模数据集、大量参数和大量计算资源。在本工作中，我们探讨在更小的预算下（少于1000万片段和少于10万H200 GPU小时）是否能够实现高质量的文本到视频生成。我们的核心观点是，模型容量的组织方式，而不仅仅是其规模，是关键因素。在视频生成中，提示对齐、时间一致性以及细节恢复在通过相同路径处理时可能会相互干扰。Motif-Video 2B通过在架构上分离这些角色，而不是仅依赖规模来解决这一问题。该模型结合了两个关键思想：首先，共享交叉注意力在视频令牌序列变长时增强了文本控制；其次，三部分主干网络分离了早期融合、联合表征学习和细节细化。为了使这种设计在有限计算预算下有效，我们将其与基于动态令牌路由和早期阶段特征对齐到冻结预训练视频编码器的高效训练方案相结合。我们的分析显示，后期块比标准单流基线发展出更清晰的跨帧注意力结构。在VBench上，Motif-Video 2B达到了83.76%的性能，超越了Wan2.1 14B模型，使用7倍更少的参数和显著更少的训练数据。这些结果表明，通过精心的架构专门化和以效率为导向的训练方案，可以缩小或超越通常与更大视频模型相关联的质量差距。

英文摘要

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.

URL PDF HTML ☆

赞 0 踩 0

2604.16491 2026-05-20 cs.CV cs.AI 版本更新

A Lightweight Transformer for Pain Recognition from Brain Activity

一种轻量级变压器用于从脑活动识别疼痛

Stefanos Gkikas, Christian Arzate Cruz, Yu Fang, Lu Cao, Muhammad Umar Khan, Thomas Kassiotis, Giorgos Giannakakis, Raul Fernandez Rojas, Randy Gomez

发表机构 * Honda Research Institute Japan Wako City, Japan ； BioSIS (Biosensing \& Intelligent Systems) Lab Centre for Intelligent Computing ； Systems University of Canberra Canberra, Australia ； Department of Electronic Engineering Hellenic Mediterranean University Chania, Greece

AI总结本文提出了一种轻量级变压器，通过统一的标记机制融合多种fNIRS表示，实现互补信号视图的联合建模，无需特定模态适应或增加架构复杂性，从而在保持计算紧凑性的同时实现竞争性的疼痛识别性能。

详情

AI中文摘要

疼痛是一种复杂且广泛的现象，具有显著的临床和社会负担，使其可靠的自动化评估成为关键目标。本文提出了一种轻量级变压器架构，通过统一的标记机制融合多种fNIRS表示，实现了互补信号视图的联合建模，而无需特定模态的适应或增加架构复杂性。所提出的标记混合策略通过将异构输入投影到共享的潜在表示中，保留了空间、时间和时间-频率特性，并使用结构化的分段方案来控制局部聚合和全局交互的粒度。该模型在AI4Pain数据集上使用堆叠的原始波形和功率谱密度表示进行评估。实验结果表明，该方法在保持计算紧凑性的同时实现了竞争性的疼痛识别性能，使其适用于GPU和CPU硬件上的实时推断。

英文摘要

Pain is a multifaceted and widespread phenomenon with substantial clinical and societal burden, making reliable automated assessment a critical objective. This paper presents a lightweight transformer architecture that fuses multiple fNIRS representations through a unified tokenization mechanism, enabling joint modeling of complementary signal views without requiring modality-specific adaptations or increasing architectural complexity. The proposed token-mixing strategy preserves spatial, temporal, and time-frequency characteristics by projecting heterogeneous inputs onto a shared latent representation, using a structured segmentation scheme to control the granularity of local aggregation and global interaction. The model is evaluated on the AI4Pain dataset using stacked raw waveform and power spectral density representations of fNIRS inputs. Experimental results demonstrate competitive pain recognition performance while remaining computationally compact, making the approach suitable for real-time inference on both GPU and CPU hardware.

URL PDF HTML ☆

赞 0 踩 0

2604.11089 2026-05-20 cs.CV 版本更新

Structured State-Space Regularization for Generation-Friendly Image Tokenization

结构化状态空间正则化用于生成友好的图像标记化

Jinsung Lee, Jaemin Oh, Namhun Kim, Dongwon Kim, Byung-Jun Yoon, Suha Kwak

发表机构 * POSTECH ； Brown University（布朗大学）； KAIST（韩国科学技术院）； Texas A&M University（德克萨斯大学）； Brookhaven National Laboratory（布鲁克海文国家实验室）

AI总结本文提出结构化状态空间正则化方法，通过诱导潜在空间的频谱结构提升图像标记化生成性能，同时保持重建保真度。

Comments Related blog posts in https://jinsingsangsung.github.io/collections/blog/ : Towards 2-Dimensional State-Space Models series

2604.08503 2026-05-20 cs.CV 版本更新

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Phantom：通过联合建模视觉和潜在物理动态实现物理 infused 的视频生成

Ying Shen, Jerry Xiong, Tianjiao Yu, Ismini Lourentzou

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出Phantom模型，通过联合建模视觉内容和潜在物理动态，使视频生成过程具备物理一致性，从而生成既视觉真实又物理合理的视频。

Comments 15 pages, 6 figures, CVPR 2026

详情

AI中文摘要

近期生成视频建模的进展，受到大规模数据集和强大架构的推动，已经取得了显著的视觉真实效果。然而，越来越多的证据表明，仅仅扩大数据和模型规模并不能使这些系统理解支配现实世界动态的底层物理定律。现有方法往往无法捕捉或强制执行这种物理一致性，导致不真实的运动和动态。在本文中，我们探讨是否将潜在物理属性的推断直接整合到视频生成过程中，可以赋予模型生成物理合理视频的能力。为此，我们提出了Phantom，一个物理 infused 的视频生成模型，该模型联合建模视觉内容和潜在物理动态。在观察到的视频帧和推断出的物理状态条件下，Phantom联合预测潜在物理动态并生成未来的视频帧。Phantom利用一种物理感知的视频表示，作为底层物理的抽象但信息丰富的嵌入，从而在不需显式指定复杂物理动态和属性集的情况下，联合预测物理动态和视频内容。通过将物理感知视频表示的推断直接整合到视频生成过程中，Phantom生成的视频序列既具有视觉真实性又具有物理一致性。在标准视频生成和物理感知基准上的定量和定性结果表明，Phantom不仅在遵守物理动态方面优于现有方法，还提供了具有竞争力的感知保真度。

英文摘要

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

URL PDF HTML ☆

赞 0 踩 0

2604.02784 2026-05-20 cs.CV cs.CL 版本更新

EduVQA: 向概念感知的教育AI生成视频评估迈进

Baoliang Chen, Xinlong Bu, Hanwei Zhu, Lingyu Zhu, Jieyu Zhan

发表机构 * College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； Department of Computer Science, South China Normal University, China（华南师范大学计算机学院）； School of Computer Science, City University of Hong Kong（香港城市大学计算机科学学院）

AI总结本研究提出EduVQA框架，通过引入结构化2D混合专家架构，实现了对教育AI生成视频中概念正确性的感知评估，解决了传统方法在教育场景中忽略概念正确性的不足。

详情

AI中文摘要

现有的AI生成视频质量评估（AIGVQA）方法主要关注全局感知真实性和粗略的文本-视频对齐，而忽视了教育场景中的关键要求：概念正确性。在早期数学教育中，即使视觉上合理，数值量、几何关系或空间配置中的细微错误也可能从根本上改变传达的知识。为了解决这个问题，我们引入了EduAVQABench，这是首个概念感知的教育AIGV评估基准，包含1,130个由十种最先进的T2V模型生成的视频，以及超过310,650个精细的人工标注，涵盖感知质量和语义对齐。基于此基准，我们进一步提出了EduVQA，一个概念感知的AIGVQA框架，配备了结构化2D混合专家（S2D-MoE）架构。通过通过共享专家和自适应二维路由联合建模细粒度概念评估和整体质量预测，EduVQA有效地捕捉了传统全局评分方法所忽略的细微概念层面不一致。广泛的实验表明，EduVQA在感知和语义评估任务中均优于现有AIGVQA方法，并在未见过的基准上表现出强大的泛化能力。代码和数据集将在：https://github.com/EduVQA/EduVQA 公开。

英文摘要

Existing AI-generated video quality assessment (AIGVQA) methods mainly focus on global perceptual realism and coarse text-video alignment, while overlooking a critical requirement in educational scenarios: concept correctness. In early mathematics education, subtle errors in numerical quantities, geometric relations, or spatial configurations may fundamentally alter the conveyed knowledge despite visually plausible generation. To address this problem, we introduce EduAVQABench, the first benchmark for concept-aware educational AIGV assessment, containing 1,130 videos generated by ten state-of-the-art T2V models together with over 310,650 fine-grained human annotations spanning perceptual quality and semantic alignment. Built upon this benchmark, we further propose EduVQA, a concept-aware AIGVQA framework equipped with a Structured 2D Mixture-of-Experts (S2D-MoE) architecture. By jointly modeling fine-grained concept assessment and overall quality prediction through shared experts and adaptive two-dimensional routing, EduVQA effectively captures subtle concept-level inconsistencies overlooked by conventional global scoring methods. Extensive experiments demonstrate that EduVQA consistently outperforms existing AIGVQA approaches across both perceptual and semantic evaluation tasks while exhibiting strong generalization capability on unseen benchmarks. Code and dataset will be publicly available at: https://github.com/EduVQA/EduVQA.

URL PDF HTML ☆

赞 0 踩 0

2602.23622 2026-05-20 cs.CV cs.AI 版本更新

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

DLEBench: 评估基于指令的图像编辑模型在小规模物体编辑能力

Shibo Hong, Boxian Ai, Jun Kuang, Wei Wang, FengJiao Chen, Zhongyuan Peng, Chenhao Huang, Yixin Cao

发表机构 * College of Computer Science（计算机科学学院）； Artificial Intelligence（人工智能）； Fudan University（复旦大学）

AI总结本文提出DLEBench，首个专门评估基于指令的图像编辑模型在小规模物体编辑能力的基准，通过1889个样本覆盖复杂场景，揭示了现有模型在小物体编辑上的性能差距，强调了专用基准的重要性。

详情

AI中文摘要

在基于指令的图像编辑模型（IIEMs）领域已取得显著进展。然而，尽管这些模型在当前基准上表现出对指令的合理遵循和强大的推理能力，但它们在编辑小物体方面的能力仍缺乏深入探索，尽管这对精确局部编辑和生成图像中细节的细化至关重要。本文介绍了DeepLookEditBench（DLEBench），首个专门评估IIEMs在编辑小规模物体能力的基准。具体而言，我们构建了一个包含七个指令类型的挑战性测试平台，共1889个样本。在这些样本中，目标物体仅占图像面积的1%-10%，涵盖了部分遮挡和多物体编辑等复杂场景。为确保对本基准的稳健评估，我们提出了一种评估协议，包含细化的评分标准，以最小化在“指令遵循”和“视觉一致性”两个标准中的主观性和歧义性。该协议还引入了双模式评估框架（工具驱动模式和Oracle引导模式），以解决DLEBench中LMM-as-a-Judge与人类判断之间的不一致问题。在10个IIEMs上的实证结果揭示了小规模物体编辑上的显著性能差距，突显了专用基准在推动该能力发展方面的重要性。

英文摘要

Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

URL PDF HTML ☆

赞 0 踩 0

2602.09872 2026-05-20 cs.CV cs.HC 版本更新

BabyMamba-HAR: Lightweight Selective State Space Models for Efficient Human Activity Recognition on Resource Constrained Devices

BabyMamba-HAR：轻量级选择性状态空间模型用于资源受限设备上高效的人体活动识别

Mridankan Mandal

发表机构 * Department of Information Technology（信息科技系）； Indian Institute of Information Technology, Allahabad Prayagraj（印度阿利哈巴德信息科技学院）

AI总结本文提出BabyMamba-HAR，一种轻量级选择性状态空间模型，用于在资源受限设备上高效进行人体活动识别，通过两种轻量级架构实现高精度和低资源消耗。

详情

AI中文摘要

在资源受限的设备上进行人体活动识别（HAR）需要在多样化的传感器设置下保持高精度。选择性状态空间模型（SSMs）提供了高效的线性时间序列处理，成为注意力机制的一种有吸引力的替代方案。然而，其TinyML设计空间仍待探索。本文介绍了BabyMamba-HAR，包含两种轻量级架构：（1）CI-BabyMamba-HAR，利用通道独立的茎部以提高噪声鲁棒性；（2）Crossover-BiDir-BabyMamba-HAR，利用早期融合的茎部以实现通道计数独立的复杂度。两者都集成了权重绑定的双向扫描和门控时间注意力池化。在八个基准测试中，Crossover-BiDir-BabyMamba-HAR平均达到86.52%的F1分数，使用27K参数和2.21M MACs，与TinyHAR（86.16%）相当，但要求在高通道数据集上减少11倍的MACs。在设备上部署到Raspberry Pi Pico 2和ESP32上使用混合精度C++运行时（INT8投影，float32状态）。融合计算策略与生命周期感知内存管理将峰值内存足迹从O(B*dmodel*L*dstate)减少到O(B*dmodel*dstate)，适应于支持权重绑定的双向和通道流执行。两种架构均实现了完整的8/8数据集覆盖，与PyTorch的>99.2%的兼容性，而INT8量化TFLite基线显示了退化的覆盖和兼容性（TinyHAR：7/8和4/8覆盖，60.4%和88.6%兼容性，TinierHAR：8/8和6/8在54.2%和90.8%兼容性，DeepConvLSTM：1/8和0/8在Pico 2和ESP32上）。Crossover-BiDir-BabyMamba-HAR在ESP32上平均延迟为154.4 ms，在Pico 2上为481.9 ms。消融实验确认双向扫描和门控注意力分别将F1分数提高高达8.42%和8.94%，建立了TinyML SSM部署的实用原则。

英文摘要

Human activity recognition (HAR) on resource constrained devices requires high accuracy across diverse sensor setups. Selective state space models (SSMs) offer efficient linear time sequence processing, presenting a compelling alternative to attention mechanisms. However, their TinyML design space remains unexplored. This paper introduces BabyMamba-HAR, comprising two lightweight architectures: (1) CI-BabyMamba-HAR, utilizing a channel independent stem for noise robustness, and (2) Crossover-BiDir-BabyMamba-HAR, utilizing an early fusion stem for channel count independent complexity. Both integrate weight tied bidirectional scanning and gated temporal attention pooling. Across eight benchmarks, Crossover-BiDir-BabyMamba-HAR averages an 86.52% F1-score with 27K parameters and 2.21M MACs, matching TinyHAR (86.16%) while requiring 11x fewer MACs on high channel datasets. On-device deployment on the Raspberry Pi Pico 2 and ESP32 utilized a mixed precision C++ runtime (INT8 projections, float32 states). A fused computation strategy with lifetime aware memory management reduces peak memory footprint from O(B*dmodel*L*dstate) to O(B*dmodel*dstate), adapting to support weight-tied bidirectional and channel-streaming execution. Both architectures achieved full 8/8 dataset coverage with >99.2% PyTorch parity, whereas INT8 quantized TFLite baselines showed degraded coverage and parity (TinyHAR: 7/8 and 4/8 coverage at 60.4% and 88.6% parity, TinierHAR: 8/8 and 6/8 at 54.2% and 90.8%, DeepConvLSTM: 1/8 and 0/8 on Pico 2 and ESP32, respectively). Crossover-BiDir-BabyMamba-HAR averages 154.4 ms latency on ESP32 and 481.9 ms on Pico 2. Ablations confirm bidirectional scanning and gated attention improve F1-scores by up to 8.42% and 8.94%, respectively, establishing practical principles for TinyML SSM deployment.

URL PDF HTML ☆

赞 0 踩 0

2602.07570 2026-05-20 q-bio.NC cs.AI cs.CV cs.LG 版本更新

How does longer temporal context enhance multimodal narrative video processing in the brain?

更长的时间上下文如何增强大脑对多模态叙事视频的处理？

Prachi Jindal, Anant Khandelwal, Manish Gupta, Bapi S. Raju, Subba Reddy Oota, Tanmoy Chakraborty

发表机构 * Technische Universität Berlin（柏林技术大学）； Microsoft Research（微软研究院）； IIT Delhi（德里理工学院）； Microsoft（微软）； IIIT-Hyderabad（海得拉巴理工学院）

AI总结本研究探讨了视频片段时长和叙事任务提示如何影响自然电影观看过程中大脑模型对多模态大语言模型（MLLMs）的对齐情况，发现增加片段持续时间显著提高了大脑对齐程度，而单模态视频模型则无明显提升。

Comments 22 pages, 15 figures

详情

AI中文摘要

理解人类和人工智能系统如何处理复杂的叙事视频是一个在神经科学和机器学习交汇处的基本挑战。本研究调查了视频片段的时间上下文长度（3-24秒片段）和叙事任务提示如何影响自然电影观看过程中大脑模型的对齐情况。利用受试者观看完整电影的fMRI记录，我们研究了对叙事上下文敏感的大脑区域如何在不同时间尺度上动态表示信息，以及这些神经模式如何与模型派生的特征对齐。我们发现，增加片段持续时间显著提高了多模态大语言模型（MLLMs）的大脑对齐程度，而单模态视频模型则几乎没有提升。进一步地，较短的时间窗口与感知和早期语言区域对齐，而较长的窗口则更倾向于与更高阶整合区域对齐，这在MLLMs中表现为层到皮层的层次结构。最后，使用四个叙事任务提示的实验显示，这些提示会引发任务特定、区域依赖性的大脑对齐模式，并在更高阶区域引起上下文依赖的片段级调谐变化。我们的工作将长篇叙事电影定位为研究长时间尺度时间整合在长上下文MLLMs中的原理性测试平台，以及其与叙事理解过程中皮层响应关系的桥梁。

英文摘要

Understanding how humans and artificial intelligence systems process complex narrative videos is a fundamental challenge at the intersection of neuroscience and machine learning. This study investigates how the temporal context length of video clips (3--24 s clips) and the narrative-task prompting shape brain-model alignment during naturalistic movie watching. Using fMRI recordings from participants viewing full-length movies, we examine how brain regions sensitive to narrative context dynamically represent information over varying timescales and how these neural patterns align with model-derived features. We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Further, shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Finally, experiments with four narrative-task prompts show that they elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions. Our work positions long-form narrative movies as a principled testbed for studying long-timescale temporal integration in long-context MLLMs and its relationship to cortical responses during narrative comprehension.

URL PDF HTML ☆

赞 0 踩 0

2602.07008 2026-05-20 cs.CV cs.LG 版本更新

保留多样性的分布匹配蒸馏用于快速视觉合成

Tianhe Wu, Ruibin Li, Lei Zhang, Kede Ma

AI总结本文提出了一种保留多样性的分布匹配蒸馏（DP-DMD）方法，通过分离角色的蒸馏策略，在少量步骤中保持样本多样性并维持竞争性的视觉质量，为其他DMD变体提供了一种简单且稳定的替代方案。

详情

AI中文摘要

分布匹配蒸馏（DMD）通过将蒸馏的学生模型与参考多步骤教师模型对齐，实现了少步图像生成。然而，在实践中，优化DMD可能会减少少步合成中的样本多样性，而现有解决方案通常依赖于感知或对抗正则化，导致训练过程中的稳定性和可扩展性挑战。本文描述了保留多样性的DMD（DP-DMD），一种受早期和晚期去噪步骤互补作用启发的角色分离蒸馏方法。具体而言，第一个蒸馏步骤通过教师衍生的目标预测目标（例如v-prediction）进行训练，以保留样本多样性，而其余步骤则通过标准DMD损失进行优化，以提高感知质量。DP-DMD无需感知或对抗正则化、额外模块和教师生成的参考样本，在少量步骤采样下保持样本多样性，同时维持竞争性的视觉质量，为其他DMD变体提供了一种简单且稳定的替代方案。

英文摘要

Distribution matching distillation (DMD) facilitates few-step image generation by aligning a distilled student with a reference multi-step teacher. In practice, however, optimizing DMD can reduce sample diversity in few-step synthesis, and existing remedies typically rely on perceptual or adversarial regularization, leading to stability and scalability challenges during training. Here, we describe diversity-preserved DMD (DP-DMD), a role-separated distillation method inspired by the complementary roles of early and late denoising steps. Specifically, the first distillation step is trained with a teacher-derived target-prediction objective (e.g., v-prediction) to preserve sample diversity, while the remaining steps are optimized with the standard DMD loss to refine perceptual quality. DP-DMD, with no perceptual or adversarial regularization, no additional modules, and no teacher-generated reference samples, preserves sample diversity while maintaining competitive visual quality under few-step sampling, providing a simple and stable alternative to other DMD variants.

URL PDF HTML ☆

赞 0 踩 0

2601.20308 2026-05-20 cs.CV cs.GR 版本更新

Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion

通过一步扩散模型平滑现实世界的时空视频超分辨率

Shuoyan Wei, Feng Li, Chen Zhou, Runmin Cong, Yao Zhao, Huihui Bai

发表机构 * Institute of Information Science, Beijing Jiaotong University（北京交通大学信息科学学院）； Visual Intelligence + X International Cooperation Joint Laboratory of MOE, Beijing（教育部视觉智能+X国际合作联合实验室）； Innovation School of Artificial Intelligence, Hefei University of Technology（合肥工业大学人工智能创新学院）； School of Control Science and Engineering, Shandong University（山东大学控制科学与工程学院）

AI总结本文提出OSDEnhancer框架，通过一步扩散模型实现鲁棒的时空视频超分辨率，解决了现实世界中复杂未知退化的问题，通过线性初始化和分治策略提升时空动态和纹理恢复性能。

Comments 12 pages, 9 figures

详情

AI中文摘要

扩散模型在视频超分辨率（VSR）中表现出色，能够生成精细细节。然而，其在时空视频超分辨率（STVSR）中的潜力仍被忽视，STVSR需要恢复真实的高分辨率视觉内容并提高帧率，同时保持时间动态的一致性。此外，现有STVSR方法主要在简单退化假设下处理时空上采样，无法应对现实世界中复杂的未知退化。为了解决这些挑战，我们提出了OSDEnhancer，这是首个在一步扩散中实现稳健STVSR的框架。OSDEnhancer首先通过线性初始化建立必要的时空结构并适应模型进行一步重建。然后应用分治策略，引入时间一致性（TC）和纹理丰富（TE）LoRAs，分别专注于帧间动态建模和精细纹理恢复，同时在推理过程中协作以提升整体性能。双向VAE解码器使用可变形递归块来利用常规VAE的多尺度结构，通过联合多尺度可变形聚合和帧间特征传播增强潜在到像素的重建。实验结果表明，所提出的方法在现实世界场景中实现了最先进的性能，并具有更强的泛化能力。代码可在https://github.com/W-Shuoyan/OSDEnhancer获取。

英文摘要

Diffusion models have demonstrated exceptional success in video super-resolution (VSR), exhibiting powerful capabilities for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic high-resolution visual content but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simple degradation assumptions, thus failing in real-world scenarios with complex unknown degradations. To address these challenges, we propose OSDEnhancer, the first framework that achieves robust STVSR in one-step diffusion. OSDEnhancer begins with a linear initialization to establish essential spatiotemporal structures and adapt the model for one-step reconstruction. It then applies a divide-and-conquer strategy, introducing the temporal coherence (TC) and texture enrichment (TE) LoRAs that progressively specialize in inter-frame dynamics modeling and fine-grained texture recovery, respectively, while collaborating during inference for enhanced overall performance. A bidirectional VAE decoder employs deformable recurrent blocks to leverage the multi-scale structure of the vanilla VAE, enhancing latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation. Experimental results demonstrate that the proposed method attains state-of-the-art performance with superior generalization in real-world scenarios. The code is available at https://github.com/W-Shuoyan/OSDEnhancer.

URL PDF HTML ☆

赞 0 踩 0

2601.18993 2026-05-20 cs.CV cs.AI cs.GR 版本更新

FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction

FreeOrbit4D: 通过前景完整4D重建实现免训练的任意相机重定向

Wei Cao, Hao Zhang, Fengrui Tian, Yulun Wu, Yingying Li, Shenlong Wang, Ning Yu, Yaoyao Liu

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Pennsylvania（宾夕法尼亚大学）； Eyeline Labs（Eyeline实验室）

AI总结本文提出FreeOrbit4D，一种无需训练的框架，通过恢复完整的前景4D代理来解决大角度重定向中的几何模糊问题，从而生成更真实且时间一致的视频。

Comments 12 pages, 10 figures. Accepted to SIGGRAPH Conference Papers 2026

详情

DOI: 10.1145/3799902.3811122

AI中文摘要

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective 免训练 framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

英文摘要

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective training-free framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

URL PDF HTML ☆

赞 0 踩 0

2601.14822 2026-05-20 cs.CV cs.AI 版本更新

Multimodal system for skin cancer detection

多模态皮肤癌检测系统

Volodymyr Sydorskyi, Igor Krashenyi, Oleksii Yakubenko

AI总结本文提出一种多模态皮肤癌检测系统，结合传统照片图像与表格型元数据（如患者人口统计数据和病变特征），通过多模态神经网络和两阶段模型提升检测准确率，并通过三阶段流程进一步优化预测，最终在不平衡数据集上实现显著性能提升。

Comments Accepted to System research and information technologies

Journal ref System Research and Information Technologies, no. 1, pp. 33-57, 2026

详情

DOI: 10.20535/SRIT.2308-8893.2026.1.03

AI中文摘要

皮肤癌检测对于早期诊断和有效治疗至关重要。尽管基于dermoscopic图像的深度学习模型已显示出潜力，但它们需要专门的设备，限制了其在更广泛临床环境中的应用。本研究介绍了一种使用传统照片图像的多模态皮肤癌检测系统，使其更具可访问性和适应性。我们的系统整合图像数据与表格型元数据，如患者人口统计数据和病变特征，以提高检测准确性。它采用结合图像和元数据处理的多模态神经网络，并支持有或无元数据的两阶段模型。一个三阶段流程进一步通过提升算法和增强性能来优化预测。为解决高度不平衡数据集的挑战，实施了特定技术以确保稳健的训练。通过消融研究评估了最近的视觉架构、提升算法和损失函数，实现了峰值部分ROC AUC为0.18068（0.2最大）和前15检索灵敏度为0.78371。结果表明，通过结构化、多阶段的图像与元数据整合流程，实现了显著的性能提升。该系统通过提供一个可扩展、设备无关的解决方案，推进了皮肤癌检测，适用于多样化的医疗环境，弥合了专业与一般临床实践之间的差距。

英文摘要

Melanoma detection is vital for early diagnosis and effective treatment. While deep learning models on dermoscopic images have shown promise, they require specialized equipment, limiting their use in broader clinical settings. This study introduces a multi-modal melanoma detection system using conventional photo images, making it more accessible and versatile. Our system integrates image data with tabular metadata, such as patient demographics and lesion characteristics, to improve detection accuracy. It employs a multi-modal neural network combining image and metadata processing and supports a two-step model for cases with or without metadata. A three-stage pipeline further refines predictions by boosting algorithms and enhancing performance. To address the challenges of a highly imbalanced dataset, specific techniques were implemented to ensure robust training. An ablation study evaluated recent vision architectures, boosting algorithms, and loss functions, achieving a peak Partial ROC AUC of 0.18068 (0.2 maximum) and top-15 retrieval sensitivity of 0.78371. Results demonstrate that integrating photo images with metadata in a structured, multi-stage pipeline yields significant performance improvements. This system advances melanoma detection by providing a scalable, equipment-independent solution suitable for diverse healthcare environments, bridging the gap between specialized and general clinical practices.

URL PDF HTML ☆

赞 0 踩 0

2512.11234 2026-05-20 cs.CV 版本更新

RoomPilot: Controllable Indoor Scene Synthesis via Multimodal Semantic Parsing

RoomPilot: 通过多模态语义解析实现可控的室内场景合成

Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li

发表机构 * School of Information Science and Engineering, Hunan University（信息科学与工程学院，湖南大学）

AI总结该研究提出RoomPilot框架，通过多模态语义解析实现可控的室内场景合成，解决了现有方法输入模态有限和生成过程隐式的问题，提高了场景结构和语义的可控性。

Comments 30 pages, 8 figures

详情

AI中文摘要

生成可控的室内场景对于游戏开发、建筑可视化和具身AI应用至关重要。然而，现有方法要么只支持有限的输入模态，要么依赖隐式生成过程，限制了对场景结构和语义的精确控制。为了解决这些限制，我们引入RoomPilot，一个统一的框架，从多模态输入（包括文本描述和CAD平面图）中生成可控的室内场景。RoomPilot将异构输入映射到一个室内领域特定语言（IDSL），作为描述室内场景的结构化和可解释的语义表示。基于IDSL，RoomPilot提出一个分层合成流程，逐步在建筑、房间和物体层面组织场景，促进多房间布局中的结构一致性和功能一致性。此外，RoomPilot构建了一个经过精心挑选的资产数据集，具有丰富的语义注释，以支持高质量的场景合成，提高视觉真实感和外观一致性。广泛的实验表明，该方法在多模态理解、场景生成的细粒度可控性以及物理一致性和视觉保真度方面均有所提升，标志着可控3D室内场景合成的重要一步。代码和模型将公开。

英文摘要

Generating controllable indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI. However, existing approaches either support a limited input modalities or rely on implicit generation processes that hinder precise control over scene structure and semantics. To address these limitations, we introduce RoomPilot, a unified framework for controllable indoor scene synthesis from multi-modal inputs, including textual descriptions and CAD floor plans. RoomPilot maps heterogeneous inputs into an Indoor Domain-Specific Language (IDSL), which serves as a structured and interpretable semantic representation for describing indoor scenes. Built upon IDSL, RoomPilot presents a hierarchical synthesis pipeline that progressively organizes scenes at the building, room, and object levels, promoting structural coherence and functional consistency across multi-room layouts. Moreover, RoomPilot constructs a curated asset dataset with rich semantic annotations to support high-quality scene synthesis, improving visual realism and appearance consistency. Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity, marking a significant step toward controllable 3D indoor scene synthesis. Code and model will be available.

URL PDF HTML ☆

赞 0 踩 0

2512.08237 2026-05-20 cs.CV 版本更新

Fast-BEV++: Fast by Algorithm, Deployable by Design

Fast-BEV++: 通过算法加速，通过设计部署

Yuanpeng Chen, Hui Song, Sheng Yang, Wei Tao, Shanhui Mo, Shuang Zhang, Xiao Hua, Tiankun Zhao

发表机构 * iMotion Automotive Technology (Suzhou) Co., Ltd（iMotion汽车技术（苏州）有限公司）； School of Data Science, Fudan University（复旦大学数据科学学院）

AI总结本文提出Fast-BEV++，通过算法加速和设计部署两个原则，解决自动驾驶中低成本鸟眼视图感知在精度与部署效率之间的矛盾，实现了3倍速度提升并在nuScenes基准上取得0.488 NDS的新状态-of-the-art结果，同时在134 FPS以上实现实时推理。

Comments most up-to-date version

详情

AI中文摘要

视觉-only鸟眼视图（BEV）感知的进步受制于感知精度与设备部署效率之间的长期根本权衡。在本文中，我们引入了Fast-BEV++，一种通过两个基本设计原则解决这一矛盾的BEV感知框架：通过算法加速和通过设计部署。通过将核心视图转换模块分解为硬件导向的标准索引-收集-重塑流水线，Fast-BEV++消除了对定制内核的依赖，从而在主流边缘平台上实现了至少3倍于Fast-BEV基线的速度提升。实证表明，Fast-BEV++在nuScenes 3D物体检测基准上建立了新的状态-of-the-art结果0.488 NDS，同时通过我们的加速设计实现了超过134 FPS的实时推理。特别是，我们的集成、可学习深度模块带来了持续的性能提升，在可比方法中保持最高准确性。总体而言，这种本质上分解的架构使在各种生产级汽车平台上的无缝实时部署成为可能，缓解了硬件限制，而不会牺牲感知精度或推理效率。

英文摘要

The advancement of vision-only Bird's-Eye-View (BEV) perception, a core paradigm for cost-effective autonomous driving, is hindered by the long-standing fundamental trade-off between perception accuracy and on-device deployment efficiency. In this work, we introduce Fast-BEV++, a BEV perception framework that resolves this tension through two fundamental design principles: Fast by Algorithm and Deployable by Design. By decomposing the core view transformation module into a hardware-oriented standard Index-Gather-Reshape pipeline, Fast-BEV++ eliminates dependencies on custom kernels while achieving no less than 3 times speedup over the Fast-BEV baseline across mainstream edge platforms. Empirically, Fast-BEV++ establishes a new state-of-the-art result of 0.488 NDS on the nuScenes 3D object detection benchmark, simultaneously delivering real-time inference at more than 134 FPS via our acceleration design. In particular, our integrated, learnable depth module yields consistent performance gains, maintaining the highest accuracy among comparable methods. Overall, this inherently decomposed architecture enables seamless real-time deployment across diverse production-grade automotive platforms, alleviating hardware limitations without compromising perception accuracy or inference efficiency.

URL PDF HTML ☆

赞 0 踩 0

2512.04556 2026-05-20 cs.GR cs.CV 版本更新

DISK: Differentiable Sparse Kernel Complex for Efficient Spatially-Variant Convolution

DISK: 可微稀疏核复数用于高效空间变体卷积

Zhizhen Wu, Zhe Cao, Yuchi Huo

发表机构 * State Key Lab of CAD&CG, Zhejiang University, China（浙江大学CAD与CG国家重点实验室）

AI总结本文提出了一种可微稀疏核复数分解框架，用于高效处理空间变体卷积，通过稀疏核样本表示目标空间变体密集复数核，实现了高效且可微的优化方法，适用于移动成像和实时渲染。

Comments Accepted as a conference paper at ICLR 2026. OpenReview: https://openreview.net/forum?id=bbuxDoRD2D

详情

AI中文摘要

复数核图像卷积是摄影、科学成像和动画效果中的基本操作，但直接密集卷积在资源受限设备上计算上是不可行的。现有的近似方法，如模拟退火或低秩分解，要么效率低下，要么无法捕捉非凸核。我们介绍了一种可微的核分解框架，通过一组稀疏核样本表示目标空间变体、密集复数核。我们的方法具有（i）一种允许对稀疏核进行可微优化的分解；（ii）一种专门的初始化策略用于非凸形状以避免较差的局部极小值；（iii）一种核空间插值方案，将单核过滤扩展到空间变化过滤，无需重新训练和额外的运行时开销。在高斯和非凸核的实验中，我们的方法在保真度上优于模拟退火，并且在成本上显著低于低秩分解。我们的方法为移动成像和实时渲染提供了实用的解决方案，同时保持完全可微，可用于更广泛的学习管道。

英文摘要

Image convolution with complex kernels is a fundamental operation in photography, scientific imaging, and animation effects, yet direct dense convolution is computationally prohibitive on resource-limited devices. Existing approximations, such as simulated annealing or low-rank decompositions, either lack efficiency or fail to capture non-convex kernels. We introduce a differentiable kernel decomposition framework that represents a target spatially-variant, dense, complex kernel using a set of sparse kernel samples. Our approach features (i) a decomposition that enables differentiable optimization of sparse kernels, (ii) a dedicated initialization strategy for non-convex shapes to avoid poor local minima, and (iii) a kernel-space interpolation scheme that extends single-kernel filtering to spatially varying filtering without retraining and additional runtime overhead. Experiments on Gaussian and non-convex kernels show that our method achieves higher fidelity than simulated annealing and significantly lower cost than low-rank decompositions. Our approach provides a practical solution for mobile imaging and real-time rendering, while remaining fully differentiable for integration into broader learning pipelines.

URL PDF HTML ☆

赞 0 踩 0

2512.01152 2026-05-20 cs.LG cs.AI cs.CV 版本更新

Open-Set Domain Adaptation Under Background Distribution Shift: Challenges and A Provably Efficient Solution

开放集域适应在背景分布偏移下的挑战：挑战与一种可证明高效的解决方案

Shravan Chaudhari, Yoav Wald, Suchi Saria

发表机构 * Department of Computer Science, Johns Hopkins University（约翰霍普金斯大学计算机科学系）； Faculty of Data and Decision Sciences, Technion（技术学院数据与决策科学学院）； Center for Data Science, New York University（纽约大学数据科学中心）； Bayesian Health（贝叶斯健康）

AI总结本文研究了在背景分布偏移情况下开放集域适应的挑战，并提出了一种可证明高效的解决方案CoLOR，通过理论分析和实验证明其在简化过参数化设置中优于基线方法，同时展示了其在图像和文本数据上的广泛适用性。

Comments Project page at https://github.com/Shra1-25/CoLOR

Journal ref Transactions on Machine Learning Research (TMLR) 2026/May ISSN: 2835-8856

详情

AI中文摘要

随着我们将机器学习系统部署到现实世界中，一个核心挑战是保持模型在数据偏移时的性能。这种偏移可以以多种形式存在：新类可能在训练时不存在，这被称为开放集识别，以及已知类别的分布可能发生变化。对于开放集识别的保证大多基于假设已知类别的分布（我们称之为背景分布）是固定的。在本文中，我们开发了CoLOR，一种在挑战性情况下（即背景分布偏移）也能解决开放集识别的方法。我们证明该方法在温和假设下有效，即新类可与非新类分离，并提供理论保证，表明其在简化过参数化设置中优于代表基线方法。我们开发了使CoLOR可扩展和稳健的技术，并在图像和文本数据上进行了全面的实证评估。结果表明，CoLOR在背景偏移下显著优于现有开放集识别方法。此外，我们还提供了新的见解，探讨了诸如新类大小等因素对性能的影响，这在先前工作中尚未得到广泛探索。

英文摘要

As we deploy machine learning systems in the real world, a core challenge is to maintain a model that is performant even as the data shifts. Such shifts can take many forms: new classes may emerge that were absent during training, a problem known as open-set recognition, and the distribution of known categories may change. Guarantees on open-set recognition are mostly derived under the assumption that the distribution of known classes, which we call the background distribution, is fixed. In this paper we develop CoLOR, a method that is guaranteed to solve open-set recognition even in the challenging case where the background distribution shifts. We prove that the method works under benign assumptions that the novel class is separable from the non-novel classes, and provide theoretical guarantees that it outperforms a representative baseline in a simplified overparameterized setting. We develop techniques to make CoLOR scalable and robust, and perform comprehensive empirical evaluations on image and text data. The results show that CoLOR significantly outperforms existing open-set recognition methods under background shift. Moreover, we provide new insights into how factors such as the size of the novel class influences performance, an aspect that has not been extensively explored in prior work.

URL PDF HTML ☆

赞 0 踩 0

2512.00281 2026-05-20 cs.CV q-bio.NC 版本更新

Beyond Size and Growth: Rethinking Lung Cancer Screening with AI Based Nodule Detection and Diagnosis

超越尺寸和增长：利用AI进行肺结节检测与诊断的肺癌筛查再思考

Sylvain Bodard, Pierre Baudot, Benjamin Renoust, Charles Voyton, Gwendoline De Bie, Ezequiel Geremia, Van-Khoa Le, Danny Francis, Pierre-Henri Siot, Yousra Haddou, Vincent Bobin, Jean-Christophe Brisset, Carey C. Thomson, Valerie Bourdes, Benoit Huet

发表机构 * Université de Paris Cité, AP-HP, Hôpital Universitaire Necker Enfants Malades, Service d’Imagerie Adulte（巴黎大学Cité，AP-HP，Necker儿童医院成人影像科）； Memorial Sloan Kettering Cancer Center, Department of Radiology（纪念斯隆凯特琳癌症中心，放射科）； Sorbonne Université, CNRS UMR 7371, INSERM U 1146, Laboratoire d’Imagerie Biomédicale (LIB)（索邦大学，CNRS UMR 7371，INSERM U 1146，生物医学成像实验室）； Median Technologies, eyonis（Median Technologies，eyonis）； Mount Auburn Hospital/Beth Israel Lahey Health, Cambridge MA, USA（Mount Auburn医院/Beth Israel Lahey健康，马萨诸塞州剑桥市，美国）； Harvard Medical School, Boston MA, USA（哈佛医学院，马萨诸塞州波士顿，美国）

AI总结本文提出了一种基于AI的集成系统，通过低剂量CT扫描在结节层面直接进行结节检测和恶性评估，超越传统基于尺寸和增长的筛查标准，提高了肺癌筛查的准确性和效率。

Comments 25 pages, 8 figures, with supplementary information containing 11 figures

详情

AI中文摘要

早期检测恶性肺结节仍然受到基于尺寸和生长的筛查标准的限制，常常延迟诊断。我们提出了一种集成的AI系统，该系统在统一的CADe/CADx框架内，从低剂量CT扫描中联合执行结节检测和恶性评估。与传统将检测和诊断分开的流程不同，我们的方法直接针对恶性结节，重新定义了临床决策点的评估。为了解决数据集规模和可解释性限制，系统由一个大型集成模型（LEM）组成，结合了浅层深度学习和基于特征的模型。该系统在25,709例扫描中训练和评估，其中69,449个结节被标注，并在独立队列上进行了外部验证。其内部AUC为0.98，外部AUC为0.945，优于所有基于生长的指标、Lung RADS尺寸基于的分流、欧洲体积和VDT基于的筛查标准、放射科医生和领先的AI模型。该模型在低假阳性率下保持高灵敏度，对小和早期阶段的癌症表现出色，并能对不确定和缓慢生长的结节在一年内更早地评估恶性性。这种方法有潜力优化肺癌筛查流程，支持更早、更可行的临床决策。

英文摘要

Early detection of malignant lung nodules remains constrained by size and growth based screening criteria, often delaying diagnosis. We present an integrated AI system that jointly performs nodule detection and malignancy assessment directly at the nodule level from low dose CT scans, within a unified CADe/CADx framework. Unlike conventional pipelines separating detection and diagnosis, our approach targets malignant nodules directly, redefining evaluation at the point where clinical decisions are made. To address limitations in dataset scale and explainability, the system consists of a Large Ensemble Model (LEM) combining ensembles of shallow deep learning and feature based models. It was trained and evaluated on 25,709 scans with 69,449 annotated nodules, with external validation on an independent cohort. It achieved an AUC of 0.98 internally and 0.945 externally, outperforming all growth based metrics, Lung RADS size based triage, European volume and VDT based screening criteria, radiologists, and leading AI models. The model maintains high sensitivity at low false positive rates, excels for small and early stage cancers, and enables malignancy assessment up to one year earlier than radiologists for indeterminate and slow growing nodules. This approach has the potential to streamline lung cancer screening workflows and support earlier, more actionable clinical decision making.

URL PDF HTML ☆

赞 0 踩 0

2511.16766 2026-05-20 cs.CV 版本更新

SVG360: Editable Multiview Vector Graphics from a Single SVG

SVG360: 从单个SVG生成可编辑的多视角矢量图形

Mengnan Jiang, Zhaolin Sun, Christian Franke, Michele Franco Adesso, Antonio Haas, Grace Li Zhang

发表机构 * Technical University of Darmstadt（达姆施塔特技术大学）； University of Stuttgart（斯图加特大学）

AI总结本文提出SVG360框架，通过视图一致的矢量化流程将单个SVG转换为几何和视觉一致的多视角SVG资产，解决了多视角下路径碎片化和颜色不稳定的问题，提升了多视角一致性与编辑性。

详情

AI中文摘要

可缩放矢量图形（SVG）是可编辑视觉设计的标准表示形式，但通常作为单视角二维插图进行作者创作。这限制了其在需要对象级资产在不同视角下保持一致时的应用。我们提出了SVG360，一个框架，将单个输入SVG转换为几何和视觉一致的多视角SVG资产。关键挑战在于直接按视角生成或矢量化会产生视角依赖的区域、碎片化的路径和不稳定的颜色，使生成的SVG难以作为整体对象进行编辑。SVG360通过视图一致的矢量化流程解决这一问题。它首先将栅格化输入提升为视图条件的对象表示，并在规定相机下渲染目标视角。然后通过一种源自视频分割的时空记忆机制，将部分身份传播到相邻视角，建立一致的区域分解、路径对应和颜色分配，而无需特定任务的重新训练。最后，每个视角通过结构感知的矢量化重建为可编辑的SVG，其中冗余路径被合并，局部几何被优化，同时保持边界和语义部分。在对象级SVG资产上的实验表明，与直接按视角矢量化相比，SVG360提高了多视角一致性，减少了路径冗余，并更好地保留了细结构。通过将单视角SVG转换为一致的360度矢量资产，SVG360将矢量图形从静态插图扩展到可编辑的多视角内容，适用于设计、动画和结构化视觉编辑。

英文摘要

Scalable Vector Graphics are a standard representation for editable visual design, yet they are usually authored as single view two dimensional illustrations. This limits their use in applications that require object level assets to remain coherent when observed, edited, or animated from different viewpoints. We present SVG360, a framework that converts a single input SVG into geometrically and visually consistent multiview SVG assets. The key challenge is that direct per view generation or vectorization produces view dependent regions, fragmented paths, and unstable colors, making the resulting SVGs difficult to edit as a coherent object. SVG360 addresses this problem through a view consistent vectorization pipeline. It first lifts the rasterized input into a view conditioned object representation and renders target views under prescribed cameras. It then propagates part identity across neighboring views using a spatial memory mechanism adapted from video segmentation, establishing consistent region decomposition, path correspondence, and color assignment without task specific retraining. Finally, each view is reconstructed as an editable SVG through structure aware vectorization, where redundant paths are consolidated and local geometry is optimized while preserving boundaries and semantic parts. Experiments on object level SVG assets show that SVG360 improves multiview consistency, reduces path redundancy, and better preserves fine structures compared with direct per view vectorization. By turning a single view SVG into a coherent 360 degree vector asset, SVG360 expands vector graphics from static illustration toward editable multiview content for design, animation, and structured visual editing.

URL PDF HTML ☆

赞 0 踩 0

2511.13864 2026-05-20 cs.CV 版本更新

PlantTraitNet: 一种考虑不确定性的多模态框架，用于从公民科学数据中进行全球尺度植物特性推断

Ayushi Sharma, Johanna Trost, Daniel Lusk, Johannes Dollinger, Julian Schrader, Christian Rossi, Javier Lopatin, Etienne Laliberté, Simon Haberstroh, Jana Eichel, Daniel Mederer, Jose Miguel Cerda-Paredes, Shyam S. Phartyal, Lisa-Maricia Schwarz, Anja Linstädter, Maria Conceição Caldeira, Teja Kattenborn

发表机构 * GeoSense-Freiburg（弗赖堡GeoSense）

AI总结本研究提出PlantTraitNet，一种多模态、多任务且考虑不确定性的深度学习框架，通过弱监督从公民科学照片中预测四个关键植物特性（植物高度、叶面积、特定叶面积和氮含量），并利用空间聚合生成全球特性分布图，验证结果表明其在所有评估特性上均优于现有特性地图。

Comments Accepted at the 40th AAAI Conference on Artificial Intelligence (AAAI-26). Link: https://ojs.aaai.org/index.php/AAAI/article/view/41272

详情

DOI: 10.1609/aaai.v40i46.41272

AI中文摘要

全球植物特性地图，如叶片氮含量或植物高度，对于理解生态系统过程，包括地球系统的碳和能量循环至关重要。然而，现有特性地图受限于基于现场测量的高成本和稀疏的地理覆盖。公民科学计划提供了一个未被充分利用的资源来克服这些限制，全球范围内有超过5000万张带有地理标签的植物照片，捕捉了有价值的植物形态和生理信息。在本研究中，我们引入PlantTraitNet，一种多模态、多任务且考虑不确定性的深度学习框架，利用弱监督从公民科学照片中预测四个关键植物特性（植物高度、叶面积、特定叶面积和氮含量）。通过在空间上聚合个体特性预测，我们生成全球特性分布图。我们通过独立的植被调查数据（sPlotOpen）验证这些地图，并将其与领先全球特性产品进行基准测试。我们的结果表明，PlantTraitNet在所有评估特性上均优于现有特性地图，证明了将公民科学影像与计算机视觉和地理空间AI结合，不仅能够实现可扩展的，而且更准确的全球特性映射。这种方法为生态研究和地球系统建模提供了强大的新途径。

英文摘要

Global plant maps of plant traits, such as leaf nitrogen or plant height, are essential for understanding ecosystem processes, including the carbon and energy cycles of the Earth system. However, existing trait maps remain limited by the high cost and sparse geographic coverage of field-based measurements. Citizen science initiatives offer a largely untapped resource to overcome these limitations, with over 50 million geotagged plant photographs worldwide capturing valuable visual information on plant morphology and physiology. In this study, we introduce PlantTraitNet, a multi-modal, multi-task uncertainty-aware deep learning framework that predictsfour key plant traits (plant height, leaf area, specific leaf area, and nitrogen content) from citizen science photos using weak supervision. By aggregating individual trait predictions across space, we generate global maps of trait distributions. We validate these maps against independent vegetation survey data (sPlotOpen) and benchmark them against leading global trait products. Our results show that PlantTraitNet consistently outperforms existing trait maps across all evaluated traits, demonstrating that citizen science imagery, when integrated with computer vision and geospatial AI, enables not only scalable but also more accurate global trait mapping. This approach offers a powerful new pathway for ecological research and Earth system modeling.

URL PDF HTML ☆

赞 0 踩 0

2510.21464 2026-05-20 cs.CV 版本更新

CXR-LanIC: Language-Grounded Interpretable Classifier for Chest X-Ray Diagnosis

CXR-LanIC：基于语言的可解释分类器用于胸部X光诊断

Yiming Tang, Wenjia Zhong, Rushi Shah, Dianbo Liu

发表机构 * National University of Singapore（新加坡国立大学）

AI总结本文提出CXR-LanIC，一种基于语言的可解释分类器，通过任务对齐的模式发现解决胸部X光诊断的可解释性挑战，通过训练稀疏自编码器提取可解释的视觉模式，实现高准确率的诊断并支持自然语言解释。

详情

AI中文摘要

深度学习模型在胸部X光诊断中已取得显著的准确性，但其广泛应用仍受到预测黑盒性质的限制。临床医生需要透明、可验证的解释来信任自动化诊断并识别潜在的故障模式。我们介绍CXR-LanIC（基于语言的可解释分类器用于胸部X光），一种新的框架，通过任务对齐的模式发现解决这一可解释性挑战。我们的方法在BiomedCLIP诊断分类器上训练基于转码的稀疏自编码器，将医学图像表示分解为可解释的视觉模式。通过在MIMIC-CXR数据集上训练100个转码器，我们发现了约5,000个单义模式，涵盖心脏、肺部、胸膜、结构、设备和伪影类别。每个模式在共享特定放射学特征的图像中表现出一致的激活行为，使预测分解为20-50个可解释模式，具有可验证的激活画廊。CXR-LanIC在五个关键发现上实现了竞争性的诊断准确性，同时通过计划的大型多模态模型注释为自然语言解释奠定基础。我们的关键创新在于从在特定诊断目标上训练的分类器中提取可解释特征，而不是通用嵌入，确保发现的模式直接相关于临床决策，证明医疗AI系统可以既准确又可解释，通过透明、基于临床的解释支持更安全的临床部署。

英文摘要

Deep learning models have achieved remarkable accuracy in chest X-ray diagnosis, yet their widespread clinical adoption remains limited by the black-box nature of their predictions. Clinicians require transparent, verifiable explanations to trust automated diagnoses and identify potential failure modes. We introduce CXR-LanIC (Language-Grounded Interpretable Classifier for Chest X-rays), a novel framework that addresses this interpretability challenge through task-aligned pattern discovery. Our approach trains transcoder-based sparse autoencoders on a BiomedCLIP diagnostic classifier to decompose medical image representations into interpretable visual patterns. By training an ensemble of 100 transcoders on multimodal embeddings from the MIMIC-CXR dataset, we discover approximately 5,000 monosemantic patterns spanning cardiac, pulmonary, pleural, structural, device, and artifact categories. Each pattern exhibits consistent activation behavior across images sharing specific radiological features, enabling transparent attribution where predictions decompose into 20-50 interpretable patterns with verifiable activation galleries. CXR-LanIC achieves competitive diagnostic accuracy on five key findings while providing the foundation for natural language explanations through planned large multimodal model annotation. Our key innovation lies in extracting interpretable features from a classifier trained on specific diagnostic objectives rather than general-purpose embeddings, ensuring discovered patterns are directly relevant to clinical decision-making, demonstrating that medical AI systems can be both accurate and interpretable, supporting safer clinical deployment through transparent, clinically grounded explanations.

URL PDF HTML ☆

赞 0 踩 0

2510.16814 2026-05-20 cs.LG cs.AI cs.CV 版本更新

Needles in the Landscape: Semi-Supervised Pseudolabeling for Archaeological Site Discovery under Label Scarcity

景观中的针：在标签稀缺条件下用于考古遗址发现的半监督伪标签方法

Simon Jaxy, Anton Theys, Patrick Willett, W. Chris Carleton, Ralf Vandam, Pieter Libin

发表机构 * Sensors, Royal Military Academy, Brussels, Belgium AMGC (Archaeology, Environmental Changes \& Geo-Chemistry), Vrije Universiteit Brussel Max Planck Institute of Geoanthropology, Jena, Germany Shared first author Shared last author

AI总结本文提出了一种非对称双伪标签（DPL）方法，通过端到端深度学习直接从多波段遥感影像中学习稀疏正样本，无需人工特征工程或对遗址不存在的假设，在两个著名的考古数据集上进行了评估。DPL在Sagalassos数据集上优于LAMAP基线，在F1和召回率上分别提高了12%和29%，而在Cyprus数据集上，DPL在无确认负样本的纯PU设置中恢复了判别能力。DPL的集成产生可解释的概率表面，支持调查规划，从最小的标记数据中有效发现遗址。

详情

AI中文摘要

考古预测建模通过结合已知位置与环境和地理空间变量来估计未发现遗址的可能位置，提出了一个积极无标签（PU）学习挑战，其中确认的遗址稀少，大多数位置未标记而非真正的负样本。为克服这一问题，我们提出了非对称双伪标签（DPL），一种端到端深度学习方法，直接从多波段遥感影像中学习稀疏正样本，无需人工特征工程或对遗址不存在的假设，并在两个著名的考古数据集上进行了评估。在Sagalassos数据集上，与独立的验证现场调查相比，DPL在F1和召回率上分别优于LAMAP基线12%和29%，而LAMAP在概率排名上保持优势。标准监督基线在负样本不确定时失败惨烈；仅正样本训练崩溃为预测 everywhere，建立经验界限。在Cyprus数据集上，纯PU设置中无确认负样本，SL翻转概率排名，而DPL恢复判别能力。DPL集成产生可解释的概率表面，支持调查规划，从最小的标记数据中有效发现遗址。

英文摘要

Archaeological predictive modelling estimates where undiscovered sites are likely to occur by combining known locations with environmental and geospatial variables, presenting a positive-unlabeled (PU) learning challenge where confirmed sites are rare and most locations are unlabeled rather than truly negative. To overcome this, we propose asymmetric dual pseudolabeling (DPL), an end-to-end deep learning method that learns from sparse positives directly from multi-band geospatial imagery without hand-crafted feature engineering or assumptions about site absence, and evaluate on two prominent archaeological datasets. On the Sagalassos dataset, evaluated against an independent, held-out field survey, DPL outperforms the LAMAP baseline by 12% in F1 and 29% in Recall, while LAMAP maintains advantages in probability ranking. Standard supervised baselines fail catastrophically when negatives are uncertain; positive-only training collapses to predicting everywhere, es- tablishing empirical bounds. On the Cyprus dataset, a pure PU setting without confirmed negatives, SL inverts probability rankings while DPL recovers discrimination. DPL ensembles produce interpretable probability surfaces supporting survey planning, enabling effective site discovery from minimal labeled data.

URL PDF HTML ☆

赞 0 踩 0

2510.11344 2026-05-20 cs.CV 版本更新

MMAP: A Multi-Magnification and Prototype-Aware Architecture for Predicting Spatial Gene Expression

MMAP: 一种多倍率和原型感知架构，用于预测空间基因表达

Hai Dang Nguyen, Nguyen Dang Huy Pham, The Minh Duc Nguyen, Dac Thai Nguyen, Hang Thi Nguyen, Duong M. Nguyen

发表机构 * Institute for AI Innovation and Societal Impact（人工智能创新与社会影响研究所）； Hanoi University of Science and Technology（河内科学技术大学）； Amsterdam High School for the Gifted（阿姆斯特丹天才高中）； Anatomic Pathology Division, Laboratory Department, Vinmec Times City International Hospital（Vinmec国际医院解剖病理科实验室部门）； Vinmec Healthcare System（Vinmec医疗系统）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出MMAP架构，通过多倍率和原型增强方法，解决空间基因表达预测中的局部特征粒度不足和全局空间上下文覆盖不足的问题，实验表明其在多个评估指标上均优于现有最先进方法。

Comments Received Best Paper Award at the 2025 Pacific Rim International Conference on Artificial Intelligence (PRICAI 2025)

详情

DOI: 10.1007/978-981-95-7084-3_24

AI中文摘要

空间转录组学（ST）能够测量基因表达的同时保留空间信息，为组织结构和疾病病理提供关键见解。最近的发展探索了使用经苏木精和伊红染色的整张滑扫图像（WSI）通过深度神经网络预测转录组-wide基因表达谱。这项任务通常被框架为回归问题，其中每个输入对应从WSI中提取的局部图像块。然而，从组织学图像预测空间基因表达仍是一个具有挑战性的问题，因为视觉特征与分子信号之间存在显著的模态差距。最近的研究尝试将局部和全局信息纳入预测模型中。然而，现有方法仍然存在两个关键限制：（1）局部特征提取的粒度不足，（2）全局空间上下文的覆盖不足。在本工作中，我们提出了一种新的框架，MMAP（多倍率和原型增强架构），同时解决这两个挑战。为了增强局部特征的粒度，MMAP利用多倍率块表示来捕捉精细的组织学细节。为了提高全局上下文的理解，它学习了一组潜在原型嵌入，这些嵌入作为滑片级信息的紧凑表示。广泛的实验结果表明，MMAP在多个评估指标上均优于所有现有最先进方法，包括平均绝对误差（MAE）、平均平方误差（MSE）和皮尔逊相关系数（PCC）。

英文摘要

Spatial Transcriptomics (ST) enables the measurement of gene expression while preserving spatial information, offering critical insights into tissue architecture and disease pathology. Recent developments have explored the use of hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) to predict transcriptome-wide gene expression profiles through deep neural networks. This task is commonly framed as a regression problem, where each input corresponds to a localized image patch extracted from the WSI. However, predicting spatial gene expression from histological images remains a challenging problem due to the significant modality gap between visual features and molecular signals. Recent studies have attempted to incorporate both local and global information into predictive models. Nevertheless, existing methods still suffer from two key limitations: (1) insufficient granularity in local feature extraction, and (2) inadequate coverage of global spatial context. In this work, we propose a novel framework, MMAP (Multi-MAgnification and Prototype-enhanced architecture), that addresses both challenges simultaneously. To enhance local feature granularity, MMAP leverages multi-magnification patch representations that capture fine-grained histological details. To improve global contextual understanding, it learns a set of latent prototype embeddings that serve as compact representations of slide-level information. Extensive experimental results demonstrate that MMAP consistently outperforms all existing state-of-the-art methods across multiple evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Pearson Correlation Coefficient (PCC).

URL PDF HTML ☆

赞 0 踩 0

2510.07538 2026-05-20 cs.CV 版本更新

Low-Compute Watermark Removal via Dual-Domain Natural Projection

基于双域自然投影的低计算量水印移除

Pragati Shuddhodhan Meshram, Varun Chandrasekaran

发表机构 * Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, USA（伊利诺伊大学厄巴纳-香槟分校电子与计算机工程系）

AI总结本文提出了一种轻量级且无需训练的攻击方法DAWN，通过在互补频率和语义空间中投影水印图像，以低计算成本实现高效的水印移除，同时保持结构和语义的完整性。

详情

AI中文摘要

有效的语义水印移除需要在三个竞争性目标之间取得平衡：高移除成功率、低感知失真和低计算成本。然而，现有的单图像攻击通常只优化前两个目标，实现强大的水印抑制，但依赖于昂贵的多步骤优化，限制了实际部署。在本文中，我们证明这种权衡是根本性的：目前没有任何方法能够同时实现这三个属性。我们引入DAWN，一种轻量级、无需训练的攻击方法，专门针对低计算成本的领域，同时保持竞争性的移除性能。DAWN通过将带水印的图像投影到自然图像先验上，在互补的频率和语义空间中压制偏离自然统计的水印信号，然后应用解耦的感知对齐步骤以最小化伪影来恢复视觉一致性。在多样化的像素、频率和潜在空间水印方案中，DAWN一致地降低了可检测性，同时保持结构和语义的保真度，证明了仅通过适度的感知退化即可实现高效的、低资源水印移除。我们的代码可在https://github.com/Pragati-Meshram/DAWN上获得。

英文摘要

Effective removal of semantic watermarks requires balancing three competing objectives: \emph{high removal success}, \emph{low perceptual distortion}, and \emph{low computational cost}. However, existing single-image attacks typically optimize only for the first two, achieving strong watermark suppression but relying on expensive, multi-step optimization that limits practical deployment. In this work, we show that this trade-off is fundamental: no current approach achieves all three properties simultaneously. We introduce \textsc{DAWN}, a lightweight, training-free attack that explicitly targets the low-cost regime while maintaining competitive removal performance. \textsc{DAWN} works by projecting a watermarked image onto natural-image priors in complementary frequency and semantic spaces, suppressing watermark signals that deviate from natural statistics, and then applying a decoupled perceptual-alignment step to restore visual consistency with minimal artifact. Across diverse pixel-, frequency-, and latent-space watermarking schemes, \textsc{DAWN} consistently reduces detectability while preserving structural and semantic fidelity, demonstrating that efficient, low-resource watermark removal is feasible with only modest perceptual degradation. Our code is available at https://github.com/Pragati-Meshram/DAWN.

URL PDF HTML ☆

赞 0 踩 0

2510.00660 2026-05-20 cs.CV 版本更新

超越分类准确度：Neural-MedBench与更深层次推理基准的需求

Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Huan Gao, Mingkun Xu, Shangyang Li

发表机构 * School of Physics Science and Technology, Beijing University of Posts and Telecommunications（北京邮电大学物理科学与技术学院）； Guangdong Institute of Intelligence Science and Technology（广东智能科学技术研究院）； Beijing Chaoyang Hospital, Capital Medical University（北京朝阳医院）； Sleep Medical Center, Huzhou Third Municipal Hospital, Affiliated Hospital of Wenzhou Medical University（湖州第三人民医院睡眠医学中心，温州医科大学附属医院）； University of Macau（澳门大学）； Renyixun Health Technology Co., Ltd（仁颐讯健康科技有限公司）； Academy for Advanced Interdisciplinary Studies, Peking University（北京大学交叉学科研究院）

AI总结本文提出Neural-MedBench，一个专门用于测试多模态神经病学推理能力的基准，揭示现有医疗数据集过于强调分类准确度的问题，并通过系统评估发现模型推理失败而非感知误差主导性能下降，强调需要兼顾广度与深度的评估框架。

Comments 23 pages, 12 figures

Journal ref ICLR'2026

详情

AI中文摘要

近期视觉-语言模型（VLMs）在标准医疗基准上取得了显著进展，但其真正的临床推理能力仍不清楚。现有数据集主要强调分类准确度，导致模型在高风险诊断推理上仍存在不足。我们引入Neural-MedBench，一个紧凑且推理密集的基准，专门用于探测多模态临床推理在神经病学中的极限。Neural-MedBench整合多序列MRI扫描、结构化电子健康记录和临床笔记，并涵盖三大核心任务家族：鉴别诊断、病变识别和推理生成。为确保可靠评估，我们开发了结合LLM评分、临床验证和语义相似度指标的混合评分流程。通过系统评估最先进的VLMs，包括GPT-4o、Claude-4和MedGemma，我们发现其性能相比传统数据集显著下降。错误分析显示，推理失败而非感知误差主导模型不足。我们的发现强调了需要双轴评估框架：以广度为导向的大数据集用于统计泛化，以深度为导向的紧凑基准如Neural-MedBench用于推理保真度。我们发布Neural-MedBench于https://neuromedbench.github.io/作为开放且可扩展的诊断测试床，引导未来基准的扩展，并实现严谨而成本有效的临床可信AI评估。

英文摘要

Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

URL PDF HTML ☆

赞 0 踩 0

2509.21196 2026-05-20 cs.LG cs.CV 版本更新

Differential-Integral Neural Operator for Long-Term Turbulence Forecasting

微分-积分神经算子用于长期湍流预测

Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Qingsong Wen, Kun Wang, Xiaomeng Huang, Xian Wu

发表机构 * Tsinghua University（清华大学）； University of Science and Technology of China（中国科学技术大学）； The Chinese University of Hong Kong（香港中文大学）； Nanyang Technological University（南洋理工大学）； Tencent（腾讯）

AI总结本文提出了一种基于物理原理的微分-积分神经算子，通过并行分支学习不同的物理算子，以提高长期湍流预测的稳定性与鲁棒性，从而在2D Kolmogorov流基准测试中实现了更精确的预测。

详情

AI中文摘要

准确预测湍流的长期演变是科学计算中的重大挑战，对气候建模和航空航天工程等应用至关重要。现有的深度学习方法，特别是神经算子，在长期自回归预测中常常失败，导致灾难性误差累积和物理保真度的丧失。这种失败源于它们无法同时捕捉湍流动力学所支配的不同的数学结构：局部、耗散效应和全局、非局部相互作用。在本文中，我们提出了微分-积分神经算子（\method{}），一种基于算子分解的原理方法。\method{}通过并行分支显式建模湍流的演变，学习不同的物理算子：一个局部微分算子，由一个受约束的卷积网络实现，该网络可以证明收敛于导数；以及一个全局积分算子，由Transformer架构捕捉，学习数据驱动的全局核。这种基于物理的分解使\method{}具有卓越的稳定性和鲁棒性。通过在具有挑战性的2D Kolmogorov流基准测试中的广泛实验，我们证明\method{}在长期预测中显著优于最先进的模型。它能够抑制数百个时间步上的误差累积，保持涡旋场和能量谱的高保真度，并建立了物理一致、长程湍流预测的新基准。

英文摘要

Accurately forecasting the long-term evolution of turbulence represents a grand challenge in scientific computing and is crucial for applications ranging from climate modeling to aerospace engineering. Existing deep learning methods, particularly neural operators, often fail in long-term autoregressive predictions, suffering from catastrophic error accumulation and a loss of physical fidelity. This failure stems from their inability to simultaneously capture the distinct mathematical structures that govern turbulent dynamics: local, dissipative effects and global, non-local interactions. In this paper, we propose the {\textbf{\underline{D}}}ifferential-{\textbf{\underline{I}}}ntegral {\textbf{\underline{N}}}eural {\textbf{\underline{O}}}perator (\method{}), a novel framework designed from a first-principles approach of operator decomposition. \method{} explicitly models the turbulent evolution through parallel branches that learn distinct physical operators: a local differential operator, realized by a constrained convolutional network that provably converges to a derivative, and a global integral operator, captured by a Transformer architecture that learns a data-driven global kernel. This physics-based decomposition endows \method{} with exceptional stability and robustness. Through extensive experiments on the challenging 2D Kolmogorov flow benchmark, we demonstrate that \method{} significantly outperforms state-of-the-art models in long-term forecasting. It successfully suppresses error accumulation over hundreds of timesteps, maintains high fidelity in both the vorticity fields and energy spectra, and establishes a new benchmark for physically consistent, long-range turbulence forecast.

URL PDF HTML ☆

赞 0 踩 0

2509.14839 2026-05-20 cs.CV 版本更新

MapAnything: Evaluating Monocular Metric Depth Models for 3D Urban Asset Localization

MapAnything: 评估单目度量深度模型用于3D城市资产定位

Miriam Louise Carnot, Jonas Kunze, Erik Quinten Fastermann, Eric Peukert, André Ludwig, Bogdan Franczyk

发表机构 * ScaDS.AI (University of Leipzig)（ScaDS.AI（莱比锡大学））； University of Leipzig（莱比锡大学）； Kühne Logistics University（库赫内物流大学）； Wrocław University of Economics（沃拉夫经济大学）

AI总结本文提出MapAnything框架，通过单目图像自动映射城市物体和事件，利用度量深度估计模型计算物体坐标，验证其在复杂城市环境中的精度，展示其在交通标志和道路损坏等实际应用中的有效性。

详情

AI中文摘要

城市管理部门越来越多地依赖全面的数据库和数字孪生，如交通标志和树木以及涂鸦或道路损坏等事件，以有效监控城市状况。数字化提高了对持续更新的空间数据集的需求，但当前的数据采集和维护过程仍涉及大量人工劳动，带来了显著的可扩展性挑战。本文介绍了MapAnything，一种新颖的地理定位框架，能够从单个单目图像自动映射城市物体和事件。通过利用先进的度量深度估计模型，Map Anything准确计算物体的地理坐标，将2D图像数据转换为有价值的3D空间信息。该方法集成了估计的相机到物体距离与几何原理和已知相机规格。我们展示了该框架的详细验证，将其距离估计精度与高精度LiDAR点云在复杂城市环境中的对比。我们的评估提供了在各种距离区间和语义区域（如道路和植被）上的空间性能的细致分析。最后，我们通过具体的使用案例，如映射交通标志和道路路面损坏，展示了该框架的实际有效性，并提供了将其整合到自动化城市库存系统中的建议。

英文摘要

City administrations increasingly rely on comprehensive databases and urban digital twins of city assets, such as traffic signs and trees, as well as incidents like graffiti or road damage, to maintain an effective overview of urban conditions. Digitization has increased the demand for continuously updated spatial datasets, yet current data acquisition and maintenance processes still involve considerable manual effort, posing significant scalability challenges. This paper introduces MapAnything, a novel geo-localization framework that automates the spatial mapping of urban objects and incidents from a single monocular image. By leveraging advanced Metric Depth Estimation models, MapAnything accurately calculates object geocoordinates, converting 2D image data into valuable 3D spatial information. The methodology integrates the estimated camera-to-object distance with geometric principles and known camera specifications. We present a detailed validation of the framework, comparing its distance-estimation accuracy against high-precision LiDAR point clouds in complex urban environments. Our evaluation provides a granular analysis of spatial performance across various distance intervals and semantic areas, such as roads and vegetation. Finally, we demonstrate the framework's practical efficacy through specific use cases, including mapping traffic signs and road pavement damage, and provide recommendations for its integration into automated urban inventory systems.

URL PDF HTML ☆

赞 0 踩 0

2507.10492 2026-05-20 cs.CV cs.AI cs.LG 版本更新

BenchReAD: A systematic benchmark for retinal anomaly detection

BenchReAD: 一种系统性的视网膜异常检测基准

Chenyu Lian, Hong-Yu Zhou, Zhanli Hu, Jing Qin

发表机构 * The Center for Smart Health, School of Nursing, the Hong Kong Polytechnic University, Hong Kong, China（香港理工大学护理学院智能健康中心）； School of Biomedical Engineering, Tsinghua University, Beijing, China（清华大学生物医学工程学院）； Research Center for Medical AI, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China（中国科学院深圳先进技术研究院医学人工智能研究中心）

AI总结本研究提出BenchReAD基准，旨在解决视网膜异常检测领域缺乏全面且公开的评估标准的问题，通过系统化的数据和算法分类，引入了全监督方法DRA，并改进为NFM-DRA，实现了SOTA性能。

Comments MICCAI 2025

详情

DOI: 10.1007/978-3-032-04937-7_4

AI中文摘要

视网膜异常检测在筛查眼部和系统性疾病中起着关键作用。尽管其重要性，该领域的进展受到缺乏全面且公开可用的基准的阻碍，这对于公平评估和推进方法至关重要。由于这一限制，与视网膜图像相关的先前异常检测工作受到（1）异常类型有限且过于简单的限制，（2）测试集几乎饱和，以及（3）缺乏泛化评估的影响，导致实验设置说服力不足。此外，现有医学异常检测基准大多专注于单类监督方法（仅使用负样本训练），忽视了临床实践中大量可用的标记异常数据和未标记数据。为了填补这些差距，我们引入了视网膜异常检测的基准，该基准在数据和算法上都是全面且系统的。通过分类和评估先前方法，我们发现利用解耦异常表示的全监督方法（DRA）取得了最佳性能，但在遇到某些未见异常时性能显著下降。受单类监督学习中记忆库机制的启发，我们提出了NFM-DRA，将其与正常特征记忆结合，以缓解性能下降，建立新的SOTA。该基准可在https://github.com/DopamineLcy/BenchReAD上公开获取。

英文摘要

Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, establishing a new SOTA. The benchmark is publicly available at https://github.com/DopamineLcy/BenchReAD.

URL PDF HTML ☆

赞 0 踩 0

2507.05843 2026-05-20 cs.CV 版本更新

USIGAN: Unbalanced Self-Information Feature Transport for Weakly Paired Image IHC Virtual Staining

USIGAN: 用于弱配对图像IHC虚拟染色的不平衡自信息特征传输

Yue Peng, Bing Xiong, Fuqiang Chen, De Eybo, RanRan Zhang, Wanming Hu, Jing Cai, Wenjian Qin

发表机构 * ShenZhen Institues of Advanced Technology, university chinese academy of sciences（深圳先进技术研究院，中国科学院）

AI总结本文提出USIGAN方法，通过提取全局形态学语义来解决弱配对条件下IHC虚拟染色的不一致问题，改进生成结果的病理语义一致性。

详情

DOI: 10.1109/TIP.2026.3679993

AI中文摘要

免疫组化（IHC）虚拟染色任务旨在从H&E图像生成虚拟IHC图像，同时保持与相邻切片的病理语义一致性。该任务通过生成模型实现形态结构与染色模式的跨域映射，为病理分析提供高效且经济的解决方案。然而，在弱配对条件下，相邻切片之间的空间异质性带来了显著挑战，可能导致不准确的一对多映射并生成与相邻切片病理语义不一致的结果。为了解决这个问题，我们提出了一种新的IHC虚拟染色的不平衡自信息特征传输方法，称为USIGAN，该方法在不依赖位置对应的情况下提取全局形态学语义。通过在联合边缘分布中移除弱配对项，我们有效减轻了弱配对对联合分布的影响，从而显著提高了生成结果的内容一致性和病理语义一致性。此外，我们设计了不平衡最优传输一致性（UOT-CTM）机制和病理自对应（PC-SCM）机制，以构建H&E与生成IHC在图像级别以及真实IHC与生成IHC图像集内的相关矩阵。在两个公开数据集上的实验表明，我们的方法在多个临床相关指标上表现优异，如IoD和Pearson-R相关性，证明了更好的临床相关性。

英文摘要

Immunohistochemical (IHC) virtual staining is a task that generates virtual IHC images from H\&E images while maintaining pathological semantic consistency with adjacent slices. This task aims to achieve cross-domain mapping between morphological structures and staining patterns through generative models, providing an efficient and cost-effective solution for pathological analysis. However, under weakly paired conditions, spatial heterogeneity between adjacent slices presents significant challenges. This can lead to inaccurate one-to-many mappings and generate results that are inconsistent with the pathological semantics of adjacent slices. To address this issue, we propose a novel unbalanced self-information feature transport for IHC virtual staining, named USIGAN, which extracts global morphological semantics without relying on positional correspondence.By removing weakly paired terms in the joint marginal distribution, we effectively mitigate the impact of weak pairing on joint distributions, thereby significantly improving the content consistency and pathological semantic consistency of the generated results. Moreover, we design the Unbalanced Optimal Transport Consistency (UOT-CTM) mechanism and the Pathology Self-Correspondence (PC-SCM) mechanism to construct correlation matrices between H\&E and generated IHC in image-level and real IHC and generated IHC image sets in intra-group level.. Experiments conducted on two publicly available datasets demonstrate that our method achieves superior performance across multiple clinically significant metrics, such as IoD and Pearson-R correlation, demonstrating better clinical relevance.

URL PDF HTML ☆

赞 0 踩 0

2507.01123 2026-05-20 cs.CV cs.LG eess.IV 版本更新

Landslide Detection and Mapping Using Deep Learning Across Multi-Source Satellite Data and Geographic Regions

利用多源卫星数据和地理区域的深度学习进行滑坡检测与制图

Rahul A. Burange, Harsh K. Shinde, Omkar Mutyalwar

发表机构 * Department of Electronics & Telecommunication, KDK College of Engineering（电子与电信系，KDK工程学院）

AI总结本文提出了一种综合方法，结合多源卫星影像和深度学习模型，以提高滑坡识别和预测的准确性，通过Sentinel-2多光谱数据和ALOS PALSAR衍生的坡度和数字高程模型（DEM）层来捕捉影响滑坡发生的关键环境特征，并评估多种地理空间分析技术对检测精度的影响，同时评估了多种先进的深度学习分割模型，如U-Net、DeepLabV3+和Res-Net，以确定其在滑坡检测中的有效性。

Comments 17 pages, 22 figures

Journal ref JETIR March 2025, Volume 12, Issue 3

详情

DOI: 10.2139/ssrn.5225437

AI中文摘要

滑坡对基础设施、经济和人类生命构成严重威胁，需要在多样化的地理区域中进行准确的检测和预测制图。随着深度学习和遥感技术的进步，自动化滑坡检测已变得更加有效。本文提出了一种综合方法，整合多源卫星影像和深度学习模型，以增强滑坡识别和预测。我们利用Sentinel-2多光谱数据和ALOS PALSAR衍生的坡度和数字高程模型（DEM）层来捕捉影响滑坡发生的关键环境特征。各种地理空间分析技术被用来评估地形特征、植被覆盖和降雨对检测精度的影响。此外，我们评估了多种先进的深度学习分割模型，包括U-Net、DeepLabV�+和Res-Net，以确定其在滑坡检测中的有效性。所提出的框架有助于发展可靠的早期预警系统，改进灾害风险管理，并促进可持续的土地利用规划。我们的发现为深度学习和多源遥感在创建稳健、可扩展和可转移的滑坡预测模型中的潜力提供了有价值的见解。

英文摘要

Landslides pose severe threats to infrastructure, economies, and human lives, necessitating accurate detection and predictive mapping across diverse geographic regions. With advancements in deep learning and remote sensing, automated landslide detection has become increasingly effective. This study presents a comprehensive approach integrating multi-source satellite imagery and deep learning models to enhance landslide identification and prediction. We leverage Sentinel-2 multispectral data and ALOS PALSAR-derived slope and Digital Elevation Model (DEM) layers to capture critical environmental features influencing landslide occurrences. Various geospatial analysis techniques are employed to assess the impact of terra in characteristics, vegetation cover, and rainfall on detection accuracy. Additionally, we evaluate the performance of multiple stateof-the-art deep learning segmentation models, including U-Net, DeepLabV3+, and Res-Net, to determine their effectiveness in landslide detection. The proposed framework contributes to the development of reliable early warning systems, improved disaster risk management, and sustainable land-use planning. Our findings provide valuable insights into the potential of deep learning and multi-source remote sensing in creating robust, scalable, and transferable landslide prediction models.

URL PDF HTML ☆

赞 0 踩 0

2506.08618 2026-05-20 cs.LG cond-mat.mes-hall cond-mat.other cs.AI cs.CV 版本更新

HSG-12M: A Large-Scale Benchmark of Spatial Multigraphs from the Energy Spectra of Non-Hermitian Crystals

HSG-12M: 一种大规模空间多图基准，源自非厄密晶体能量谱

Xianquan Yan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee

发表机构 * National University of Singapore（新加坡国立大学）； NUS Centre for Bioimaging Sciences（新加坡国立大学生物成像科学中心）

AI总结本文提出HSG-12M，一个包含1160万静态和510万动态哈密顿量谱图的数据集，用于研究非厄密量子物理中的复杂几何结构，填补了现有图基准在空间多边学习方面的空白。

Comments Accepted to ICLR 2026, OpenReview: [https://openreview.net/forum?id=YxuKCME576]. 49 pages, 13 figures, 14 tables. Code & pipeline: [https://github.com/sarinstein-yan/Poly2Graph] Dataset: [https://github.com/sarinstein-yan/HSG-12M] Dataset released under CC BY 4.0. The Fourteenth International Conference on Learning Representations (ICLR 2026)

Journal ref The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情

AI中文摘要

人工智能正通过揭示理解复杂物理系统的新方法改变科学研究，但其影响仍受限于缺乏大规模、高质量的领域专用数据集。非厄密量子物理中蕴藏着丰富的资源，其中晶体的能量谱在复平面上形成复杂的几何结构，称为哈密顿量谱图。尽管这些谱图作为电子行为的指纹具有重要意义，但其系统研究一直受限于手动提取的依赖。为释放这一潜力，我们引入Poly2Graph：一个高性能、开源的管道，自动化将一维晶体哈密顿量映射到谱图。使用该工具，我们提出了HSG-12M：一个包含1160万静态和510万动态哈密顿量谱图的数据集，涵盖1401个特征多项式类别，源自177TB的谱势数据。关键的是，HSG-12M是首个大规模空间多图数据集——图嵌入在度量空间中，其中两个节点之间不同的几何轨迹被保留为单独的边。这同时填补了现有图基准在空间多边学习方面的空白。流行的GNN基准测试揭示了在大规模学习空间多边时的新挑战。除了其实际用途外，我们还表明谱图是多项式、向量和矩阵的通用拓扑指纹，建立了新的代数到图的联系。HSG-12M为凝聚态物理的数据驱动科学发现奠定了基础，为几何感知图学习的新机会以及更广泛领域铺平了道路。

英文摘要

AI is transforming scientific research by revealing new ways to understand complex physical systems, but its impact remains constrained by the lack of large, high-quality domain-specific datasets. A rich, largely untapped resource lies in non-Hermitian quantum physics, where the energy spectra of crystals form intricate geometries on the complex plane -- termed as Hamiltonian spectral graphs. Despite their significance as fingerprints for electronic behavior, their systematic study has been intractable due to the reliance on manual extraction. To unlock this potential, we introduce Poly2Graph: a high-performance, open-source pipeline that automates the mapping of 1-D crystal Hamiltonians to spectral graphs. Using this tool, we present HSG-12M: a dataset containing 11.6 million static and 5.1 million dynamic Hamiltonian spectral graphs across 1401 characteristic-polynomial classes, distilled from 177 TB of spectral potential data. Crucially, HSG-12M is the first large-scale dataset of spatial multigraphs -- graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. This simultaneously addresses a critical gap, as existing graph benchmarks overwhelmingly assume simple, non-spatial edges, discarding vital geometric information. Benchmarks with popular GNNs expose new challenges in learning spatial multi-edges at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for data-driven scientific discovery in condensed matter physics, new opportunities in geometry-aware graph learning and beyond.

URL PDF HTML ☆

赞 0 踩 0

2506.05317 2026-05-20 cs.CV 版本更新

ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation

ProJo4D：渐进式联合优化用于稀疏视图逆物理估计

Daniel Rho, Jun Myeong Choi, Biswadip Dey, Roni Sengupta

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Meta Reality Labs（Meta现实实验室）

AI总结本文提出ProJo4D，一种渐进式联合优化框架，用于解决稀疏视图下逆物理参数估计问题，通过逐步扩展联合优化参数集，提高了4D未来状态预测和物理参数估计的准确性，达到几何精度提升10倍的性能。

Comments TMLR 2026

详情

AI中文摘要

神经渲染在3D重建和新视图合成方面已取得显著进展，将物理整合到这些框架中开辟了新的应用，如机器人和XR中的物理准确数字孪生。然而，从视觉观测中估计物理参数的逆问题仍具挑战性。现有物理感知神经渲染方法通常需要密集多视角视频，使其在可扩展的实际部署中不切实际。在稀疏视图设置下，当前方法采用的顺序优化策略导致严重误差累积：初始3D重建的不准确性会传播到后续阶段，降低物理状态和材料参数估计。另一方面，同时优化所有参数失败，因为问题高度非凸且通常非可微。我们提出ProJo4D，一种渐进式联合优化框架，逐步扩展联合优化的参数集。这种设计使物理感知梯度能够细化几何，同时避免直接对所有参数进行联合优化的不稳定性。在合成和真实世界数据集上的评估表明，ProJo4D在4D未来状态预测和物理参数估计方面显著优于先前工作，实现几何精度提升高达10倍，同时保持计算效率。请访问项目网页：https://daniel03c1.github.io/ProJo4D/

英文摘要

Neural rendering has advanced significantly in 3D reconstruction and novel view synthesis, and integrating physics into these frameworks opens new applications such as physically accurate digital twins for robotics and XR. However, the inverse problem of estimating physical parameters from visual observations remains challenging. Existing physics-aware neural rendering methods typically require dense multi-view videos, making them impractical for scalable, real-world deployment. Under sparse-view settings, the sequential optimization strategies employed by current approaches suffer from severe error accumulation: inaccuracies in initial 3D reconstruction propagate to subsequent stages, degrading physical state and material parameter estimates. On the other hand, simultaneous optimization of all parameters fails due to the highly non-convex and often non-differentiable nature of the problem. We propose ProJo4D, a progressive joint optimization framework that gradually expands the set of jointly optimized parameters. This design enables physics-informed gradients to refine geometry while avoiding the instability of direct joint optimization over all parameters. Evaluations on synthetic and real-world datasets demonstrate that ProJo4D substantially outperforms prior work in 4D future state prediction and physical parameter estimation, achieving up to 10x improvement in geometric accuracy while maintaining computational efficiency. Please visit the project webpage: https://daniel03c1.github.io/ProJo4D/

URL PDF HTML ☆

赞 0 踩 0

2506.01418 2026-05-20 cs.RO cs.CV 版本更新

SEMNAV: Enhancing Visual Semantic Navigation in Robotics through Semantic Segmentation

SEMNAV: 通过语义分割增强机器人中的视觉语义导航

Rafael Flor-Rodríguez, Carlos Gutiérrez-Álvarez, Francisco Javier Acevedo-Rodríguez, Sergio Lafuente-Arroyo, Roberto J. López-Sastre

发表机构 * University of Alcalá（阿尔卡萨大学）； CAM-UAH ； Ministry of Science and Innovation of Spain（西班牙科学与创新部）

AI总结本文提出SEMNAV，一种利用语义分割作为环境主要视觉输入表示的方法，以增强机器人代理的感知和决策能力，通过引入高层面的语义信息，提升模型在未知环境中的泛化能力，并引入SEMNAV数据集进行训练。

Journal ref Applied Intelligence, 2026

详情

DOI: 10.1007/s10489-026-07275-1

AI中文摘要

视觉语义导航（VSN）是机器人学中的基本问题，其中智能体必须在未知环境中导航至目标对象，主要依靠视觉信息。大多数最先进的VSN模型是在模拟环境中训练的，其中使用的是现实世界的渲染场景，最理想的情况。这些方法通常依赖于虚拟场景的原始RGB数据，这限制了它们在真实世界环境中的泛化能力，由于域适应问题。为了解决这个问题，本文提出了SEMNAV，一种新的方法，利用语义分割作为环境的主要视觉输入表示，以增强代理的感知和决策能力。通过显式地引入这种高层语义信息，我们的模型学习到稳健的导航策略，提高了在未见过的环境中泛化的能力，无论是模拟还是真实世界。我们还引入了SEMNAV数据集，这是一个新编纂的数据集，用于训练如SEMNAV这样的语义分割感知导航模型。我们的方法在模拟环境和真实世界机器人平台上进行了广泛的评估。实验结果表明，SEMNAV优于现有的最先进VSN模型，在Habitat 2.0模拟环境使用HM3D数据集时实现了更高的成功率。此外，我们的实际实验突显了语义分割在缓解仿真到现实差距方面的有效性，使我们的模型成为实用VSN基于机器人应用的有希望的解决方案。代码和数据集可在https://github.com/gramuah/semnav访问。

英文摘要

Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real-world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent's perception and decision-making capabilities. By explicitly incorporating this type of high-level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce the SEMNAV dataset, a newly curated dataset designed for training semantic segmentation-aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real-world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state-of-the-art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real-world experiments highlight the effectiveness of semantic segmentation in mitigating the sim-to-real gap, making our model a promising solution for practical VSN-based robotic applications. The code and datasets are accessible at https://github.com/gramuah/semnav

URL PDF HTML ☆

赞 0 踩 0

2505.23747 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Spatial-MLLM: 提升基于视觉的空域智能的MLLM能力

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan

发表机构 * Tsinghua University（清华大学）

AI总结本文提出Spatial-MLLM，一种基于纯2D观测的视觉空域推理框架，通过双编码器架构和空间感知帧采样策略提升空域理解能力，实验表明其在多种视觉空域任务中达到SOTA性能。

Comments 22 pages

详情

AI中文摘要

近年来，多模态大语言模型（MLLMs）在2D视觉任务上的性能显著提升。然而，提高其空间智能仍是一个挑战。现有的3D MLLMs总是依赖额外的3D或2.5D数据来整合空间意识，限制了它们在只有2D输入（如图像或视频）场景中的实用性。在本文中，我们提出了Spatial-MLLM，一种新颖的框架，用于从纯2D观测中进行基于视觉的空间推理。与传统视频MLLMs依赖CLIP-based视觉编码器优化语义理解不同，我们的关键见解是释放来自前馈视觉几何基础模型的强大结构先验。具体来说，我们提出了双编码器架构：一个预训练的2D视觉编码器用于提取语义特征，以及一个3D空间编码器，从视觉几何模型的主干初始化以提取3D结构特征。然后，一个连接器将两种特征整合到统一的视觉标记中以增强空间理解。此外，我们提出了一种在推理时间的空间感知帧采样策略，该策略选择视频序列中具有空间信息的帧，确保在有限的token长度下，模型专注于对空间推理至关重要的帧。除了架构改进外，我们从多个来源构建了一个训练数据集，并使用监督微调和GRPO对其进行训练。在各种真实世界数据集上的广泛实验表明，Spatial-MLLM在广泛的基于视觉的空间理解和推理任务中实现了SOTA性能。项目页面：https://diankun-wu.github.io/Spatial-MLLM/.

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct a training dataset from multiple sources and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

URL PDF HTML ☆

赞 0 踩 0

2505.17726 2026-05-20 cs.CV cs.AI 版本更新

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Slot-MLLM: 多模态大语言模型中的面向对象视觉标记化

Donghwan Chi, Hyomin Kim, Yoonjin Oh, Yongjin Kim, Donghoon Lee, Daejin Jo, Jongmin Kim, Junyeob Baek, Sungjin Ahn, Sungwoong Kim

发表机构 * Department of Artificial Intelligence, Korea University（韩国大学人工智能系）； Kakao Corp（Kakao公司）； School of Computing, KAIST（韩国科学技术院计算机学院）

AI总结本文提出了一种面向对象的视觉标记化方法Slot-MLLM，通过基于Slot Attention的标记器，有效编码局部视觉细节并保持高层语义，从而提升多模态大语言模型在视觉内容理解和生成中的性能。

详情

AI中文摘要

近年来，多模态大语言模型（MLLMs）已成为实现人工通用智能的关键方法。特别是，视觉语言MLLMs已被开发用于从多模态输入中生成文本和视觉输出。这一进展需要高效的图像标记，使LLMs能够有效处理输入和输出。然而，现有的图像标记方法通常只能捕捉全局抽象概念或均匀分割的图像块，限制了MLLMs在理解和生成细节视觉内容方面的能力，尤其是在对象层面。为了解决这一限制，我们提出了一种基于Slot Attention的面向对象视觉标记器，专门针对MLLMs。具体而言，基于Q-Former编码器、扩散解码器和残差向量量化，我们提出的离散化槽标记能够编码局部视觉细节，同时保持高层语义，并与文本数据对齐，无缝集成到LLMs的统一下一个标记预测框架中。所得到的Slot-MLLM在各种涉及局部详细理解和生成的视觉语言任务中，相对于先前视觉标记器的基线表现显著提升。值得注意的是，这项工作是首次展示了使用MLLMs和真实自然图像进行面向对象槽注意力的可行性。

英文摘要

Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output. However, existing image tokenization methods for MLLMs typically capture only global abstract concepts or uniformly segmented image patches, restricting MLLMs' capability to effectively understand or generate detailed visual content, particularly at the object level. To address this limitation, we propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs. In particular, based on the Q-Former encoder, diffusion decoder, and residual vector quantization, our proposed discretized slot tokens can encode local visual details while maintaining high-level semantics, and also align with textual data to be integrated seamlessly within a unified next-token prediction framework of LLMs. The resulting Slot-MLLM demonstrates significant performance improvements over baselines with previous visual tokenizers across various vision-language tasks that entail local detailed comprehension and generation. Notably, this work is the first demonstration of the feasibility of object-centric slot attention performed with MLLMs and in-the-wild natural images.

URL PDF HTML ☆

赞 0 踩 0

2505.12217 2026-05-20 cs.CV 版本更新

HyperCap: Hyperspectral Land Cover Captioning Dataset for Vision Language Models

HyperCap：面向视觉语言模型的超光谱土地覆盖描述数据集

Aryan Das, Tanishq Rachamalla, Pravendra Singh, Koushik Biswas, Vinay Kumar Verma, Salvador Garcia, Antonio Plaza, Swalpa Kumar Roy

发表机构 * Department of Computer Science and Engineering, Vellore Institute of Technology（计算机科学与工程系，维洛雷理工学院）； Department of Information Technology, Siddhartha Academy of Higher Education（信息技术系，斯里达拉塔高等教育学院）； Department of Computer Science and Engineering, Indian Institute of Technology, Roorkee（计算机科学与工程系，印度理工学院罗尔基分校）； Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi（计算机科学与工程系，印度信息技术学院德里）； Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur（计算机科学与工程系，印度理工学院坎浦尔）； Department of Computer Science and Artificial Intelligence, University of Granada（计算机科学与人工智能系，格拉纳达大学）； Hyperspectral Computing Laboratory, Department of Computers and Communications, University of Extremadura（超光谱计算实验室，计算机与通信系，埃斯特拉达大学）

AI总结本文提出HyperCap数据集，通过整合光谱数据与像素级文本标注，提升遥感应用中的模型性能，为未来研究提供基础资源。

Comments Accepted for publication in IEEE Geoscience and Remote Sensing Magazine (GRSM), 2026

详情

DOI: 10.1109/MGRS.2026.3693613

AI中文摘要

我们介绍了HyperCap，首个大规模超光谱描述数据集，旨在提升模型在遥感应用中的性能和有效性。与传统超光谱成像（HSI）基准不同，HyperCap将光谱数据与像素级文本标注相结合，实现更深入的语义理解。该数据集通过结合自动和手动方法对四个基准数据集进行标注，确保准确性和一致性。使用最先进的编码器和多样的融合技术进行实证评估，显示出显著的分类性能提升。这些结果突显了视觉-语言学习在HSI中的潜力，并将HyperCap定位为未来研究的基础数据集。代码和数据集可在https://github.com/arya-domain/HyperCap获取。

英文摘要

We introduce HyperCap, the first large-scale hyperspectral captioning dataset designed to enhance model performance and effectiveness in remote sensing applications. Unlike traditional hyperspectral imaging (HSI) benchmarks, HyperCap integrates spectral data with pixel-wise textual annotations, enabling deeper semantic understanding. This dataset enhances model performance in tasks like classification and feature extraction, providing a valuable resource for advanced remote sensing applications. HyperCap is constructed from four benchmark datasets and annotated through a hybrid approach combining automated and manual methods to ensure accuracy and consistency. Empirical evaluations using state-of-the-art encoders and diverse fusion techniques demonstrate significant improvements in classification performance. These results underscore the potential of vision-language learning in HSI and position HyperCap as a foundational dataset for future research in the field. The code and dataset are available at https://github.com/arya-domain/HyperCap.

URL PDF HTML ☆

赞 0 踩 0

2504.04065 2026-05-20 cs.CV cs.IR cs.MM 版本更新

Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering

使检索增强的视觉问答实现协作参数知识校准

Jiaqi Deng, Kaize Shi, Zonghan Wu, Huan Huo, Dingxian Wang, Guandong Xu

发表机构 * University of Technology Sydney（悉尼大学）； East China Normal University（华东师范大学）； The Education University of Hong Kong（香港教育大学）

AI总结本文提出了一种统一的检索增强视觉问答框架，通过协作参数知识校准来充分利用KB-VQA中的跨任务协同效应，从而提升问答准确性。

Comments 10 pages, 5 figures, Under Review

Journal ref Knowledge-Based Systems, 8 July 2026, Volume 346

详情

DOI: 10.1016/j.knosys.2026.116157

AI中文摘要

基于知识的视觉问答（KB-VQA）系统通过从外部知识库检索的知识来解决复杂的视觉-地面化问题。知识检索和答案生成任务都要求对问题上下文和外部知识进行精确的多模态理解。然而，现有方法将这两个阶段视为独立模块，在训练过程中交互有限，这阻碍了双向参数知识共享，最终导致性能不佳。为充分利用KB-VQA中的跨任务协同效应，我们提出了一种统一的检索增强VQA框架，具有协作参数知识校准。所提出的框架可以有效地将通用多模态预训练模型适应于细粒度、知识密集型任务，同时在训练和推理过程中使检索器和生成器能够协作增强和共享其参数知识。为了增强对问题和外部文档的细粒度理解，我们还将晚期交互机制整合到所提出的训练框架中。此外，我们引入了一种反思-回答机制，使模型能够显式评估并细化其知识边界。我们的方法在与最先进的模型竞争中取得了竞争力的表现，实现了回答准确率的显著4.7%的提升，并为基础MLLMs的VQA性能带来了平均7.5%的提升。

英文摘要

Knowledge-based Vision Question Answering (KB-VQA) systems address complex visual-grounded questions with knowledge retrieved from external knowledge bases. The tasks of knowledge retrieval and answer generation tasks both necessitate precise multimodal understanding of question context and external knowledge. However, existing methods treat these two stages as separate modules with limited interaction during training, which hinders bi-directional parametric knowledge sharing, ultimately leading to suboptimal performance. To fully exploit the cross-task synergy in KB-VQA, we propose a unified retrieval-augmented VQA framework with collaborative parametric knowledge calibration. The proposed framework can effectively adapt general multimodal pre-trained models for fine-grained, knowledge-intensive tasks while enabling the retriever and generator to collaboratively enhance and share their parametric knowledge during both training and inference. To enhance fine-grained understanding of questions and external documents, we also integrate late interaction mechanism into the proposed training framework. Additionally, we introduce a reflective-answering mechanism that allows the model to explicitly evaluate and refine its knowledge boundary. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7\% improvement in answering accuracy, and brings an average 7.5\% boost in base MLLMs' VQA performance.

URL PDF HTML ☆

赞 0 踩 0

2504.03758 2026-05-20 cs.CY cs.CV cs.GR 版本更新

Improved visual-information-driven model for crowd simulation and its modular application

改进的视觉信息驱动模型用于人群模拟及其模块化应用

Xuanwen Liang, Jiayu Chen, Eric Wai Ming Lee, Wei Xie

发表机构 * Department of Architecture and Civil Engineering（建筑与土木工程系）； Department of Construction Management（建设管理系）； Sichuan University-The Hong Kong Polytechnic University Institute for Disaster Management and Reconstruction（四川大学-香港理工大学灾难管理与重建研究院）

AI总结本文提出一种数据驱动的人群模拟模型，通过改进的视觉信息提取和显式出口提示，提高在多个场景中的灵活性，并在四个基本模块和复合场景中进行了测试和评估，结果显示该模型在多个场景中表现良好，优于传统知识驱动模型。

Journal ref Xuanwen Liang, Jiayu Chen, Eric Wai Ming Lee, & Wei Xie (2026). Improved visual-information-driven model for crowd simulation and its modular application. Chaos, Solitons & Fractals, 209, 118481

详情

DOI: 10.1016/j.chaos.2026.118481

AI中文摘要

人群运动模拟对行人安全管理及设施设计至关重要。数据驱动模型有潜力提高真实性和预测准确性，但大多数模型仅适用于单一场景，限制了其灵活性。我们提出了一种数据驱动的人群模拟模型，结合了精细化的视觉信息提取和显式出口提示，旨在通过更有效地捕捉核心导航特征，提高在多个场景中的灵活性。该模型在四个基本模块（瓶颈、走廊、拐角和T形交叉口）上进行了测试，并进一步在复合场景中使用模块化方法进行评估。结果表明，该模型在这些场景中表现良好，与现实世界实验中的行人运动一致，并在这些场景中优于传统知识驱动模型。研究结果可为数据驱动的人群模拟模型发展提供启发，并推进数据驱动方法的应用。

英文摘要

Crowd movement simulation is crucial for pedestrian safety management and facility design. Data-driven models offer the potential to improve realism and predictive accuracy, but most are developed for a single scenario, limiting their flexibility. We propose a data-driven crowd simulation model that incorporates refined visual-information extraction and explicit exit cues, aiming to improve flexibility across multiple scenarios by more effectively capturing core navigational features. The model is tested on four fundamental modules (bottleneck, corridor, corner, and T-junction) and further evaluated in a composite scenario using a modular approach. Results show that our model performs well across these scenarios, aligning with pedestrian movement in real-world experiments, and outperforms the classical knowledge-driven model in these scenarios. The research outcomes can provide inspiration for the development of data-driven crowd simulation models and advance the application of data-driven approaches.

URL PDF HTML ☆

赞 0 踩 0

2504.00470 2026-05-20 cs.LG cs.CV 版本更新

Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection

少即是多：通过最小可解释子集选择实现高效的黑盒属性分析

Ruoyu Chen, Siyuan Liang, Jingzhi Li, Shiming Liu, Li Liu, Hua Zhang, Xiaochun Cao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； University of Chinese Academy of Sciences（中国科学院大学）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算机与数据科学学院）； School of Artificial Intelligence, University of Science and Technology Beijing（北京科技大学人工智能学院）； Department of Mechanical Engineering, Imperial College London（伦敦帝国理工学院机械工程系）； Center for Machine Vision and Signal Analysis (CMVS), University of Oulu（奥卢大学机器视觉与信号分析中心）； School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University（中山大学深圳校区计算机科学与技术学院）

AI总结本文提出了一种高效的黑盒属性分析方法LiMA，通过将重要区域的属性分析转化为子模函数子集选择的优化问题，以更少的区域提供更准确的解释，并在多个基准模型上展示了显著的改进。

详情

AI中文摘要

为了开发一个可信的AI系统，目标是识别对模型决策影响最大的输入区域。现有属性方法的主要任务是高效且准确地识别输入-预测交互关系。特别是当输入数据是离散的，如图像时，分析输入和输出之间的关系由于组合爆炸而成为重大挑战。在本文中，我们提出了一种新颖且高效的黑盒属性机制LiMA（Less input is More faithful for Attribution），它将重要区域的属性分析重新表述为一个子模子集选择的优化问题。首先，为了准确评估交互，我们设计了一个子模函数，该函数量化子集的重要性并有效捕捉其对决策结果的影响。然后，通过一种新的双向贪心搜索算法，高效地对输入子区域按重要性进行排序。LiMA能够识别最和最不重要的样本，同时确保一个最优的属性边界，以最小化误差。在八个基础模型上的广泛实验表明，我们的方法在更少的区域上提供了忠实的解释，并表现出强大的泛化能力，插入和删除任务的平均改进分别为36.3%和39.6%。我们的方法在属性效率方面也优于朴素的贪心搜索，速度提高了1.6倍。此外，当解释模型预测错误的原因时，我们的方法平均最高置信度比最先进的属性算法高86.1%。代码可在https://github.com/RuoyuChen10/LIMA上获得。

英文摘要

To develop a trustworthy AI system, which aim to identify the input regions that most influence the models decisions. The primary task of existing attribution methods lies in efficiently and accurately identifying the relationships among input-prediction interactions. Particularly when the input data is discrete, such as images, analyzing the relationship between inputs and outputs poses a significant challenge due to the combinatorial explosion. In this paper, we propose a novel and efficient black-box attribution mechanism, LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection. First, to accurately assess interactions, we design a submodular function that quantifies subset importance and effectively captures their impact on decision outcomes. Then, efficiently ranking input sub-regions by their importance for attribution, we improve optimization efficiency through a novel bidirectional greedy search algorithm. LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors. Extensive experiments on eight foundation models demonstrate that our method provides faithful interpretations with fewer regions and exhibits strong generalization, shows an average improvement of 36.3% in Insertion and 39.6% in Deletion. Our method also outperforms the naive greedy search in attribution efficiency, being 1.6 times faster. Furthermore, when explaining the reasons behind model prediction errors, the average highest confidence achieved by our method is, on average, 86.1% higher than that of state-of-the-art attribution algorithms. The code is available at https://github.com/RuoyuChen10/LIMA.

URL PDF HTML ☆

赞 0 踩 0

2503.12172 2026-05-20 cs.LG cs.CR cs.CV 版本更新

通过段落任何精修和视觉惯性LiDAR融合进行混凝土裂缝的3D建模与自动测量

Pengru Deng, Jiapeng Yao, Chun Li, Su Wang, Xinrun Li, Varun Ojha, Xuhui He

发表机构 * School of Civil Engineering（土木工程学院）； Central South University（中南大学）； Hunan Provincial Key Laboratory for Disaster Prevention and Mitigation of Rail Transit Engineering Structures（湖南省铁路工程结构灾害预防与 mitigation 工程结构重点实验室）； Nvidia ； School of Computing（计算学院）； Newcastle University（新castle大学）

AI总结本文提出了一种结合计算机视觉技术和多模态同时定位与建图（SLAM）的创新框架，用于二维裂缝检测、三维重建和三维自动裂缝测量，解决了现有方法在适应性和鲁棒性方面的不足，特别是在处理曲线或复杂几何形状时的挑战。

Comments Title and author list updated

Journal ref Computer-Aided Civil and Infrastructure Engineering, Volume 45, 2026, 100019, ISSN 1093-9687

详情

DOI: 10.1016/j.cacaie.2026.100019

AI中文摘要

视觉-空间系统在混凝土裂缝检测中变得越来越关键。然而，现有方法往往缺乏对多样化场景的适应性，在基于图像的方法中表现出有限的鲁棒性，并且在处理曲线或复杂几何形状时存在困难。为了解决这些限制，本文提出了一种创新的框架，通过整合计算机视觉技术和多模态同时定位与建图（SLAM），用于二维（2D）裂缝检测、三维（3D）重建和三维自动裂缝测量。首先，基于基础的DeepLabv3+分割模型，并结合特定的改进利用基础模型Segment Anything Model（SAM），我们开发了一种具有强泛化能力的裂缝分割方法，能够在不熟悉的场景中生成精确的2D裂缝掩码。为了提高三维重建的准确性和鲁棒性，利用Light Detection and Ranging（LiDAR）点云与图像数据和分割掩码。通过利用图像和LiDAR-SLAM，我们开发了多帧和多模态融合框架，产生密集、着色的点云，有效捕捉裂缝语义在三维现实尺度上。此外，裂缝几何属性在三维密集点云空间中自动且直接地进行测量，超越了传统二维图像测量方法的限制。这一进步使该方法适用于具有曲线和复杂三维几何结构的结构部件。在各种混凝土结构上的实验结果突显了所提出方法的显著改进和独特优势，展示了其在现实应用中的有效性、准确性和鲁棒性。

英文摘要

Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2412.13111 2026-05-20 cs.CV cs.GR 版本更新

Motion-2-To-3: Leveraging 2D Motion Data for 3D Motion Generations

Motion-2-To-3: 利用2D运动数据进行3D运动生成

Ruoxi Guo, Huaijin Pi, Zehong Shen, Qing Shuai, Zechen Hu, Zhumei Wang, Yajiao Dong, Ruizhen Hu, Taku Komura, Sida Peng, Xiaowei Zhou

发表机构 * Zhejiang University（浙江大学）； Deep Glint ； The University of Hong Kong（香港大学）； Shenzhen University（深圳大学）

AI总结本文提出了一种利用2D视频中提取的运动数据来改进基于文本的3D运动生成的方法，通过解耦局部关节运动和全局运动，有效学习局部运动先验，从而提升生成的3D人体运动的真实性和多样性。

Comments Project page: https://zju3dv.github.io/Motion-2-to-3/

Journal ref 2025 IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 2025, pp. 14305-14316

详情

DOI: 10.1109/ICCV51701.2025.01327

AI中文摘要

文本驱动的人体运动合成已展现出在电影和游戏行业颠覆性设计的潜力。现有方法通常依赖于3D运动捕捉数据，这需要特殊设置，导致数据采集成本高，最终限制了人体运动的多样性和范围。相比之下，2D人体视频提供了一种广泛且易于获取的运动数据源，涵盖了更广泛风格和活动。在本文中，我们探索了从视频中提取的2D人体运动作为替代数据源，以改进基于文本的3D运动生成。我们的方法引入了一个新颖的框架，将局部关节运动与全局运动解耦，从而能够高效地从2D数据中学习局部运动先验。我们首先在大量文本-2D运动配对数据集上训练了一个单视角的2D局部运动生成器。然后，我们用3D数据对生成器进行微调，将其转换为多视角生成器，该生成器能够预测视图一致的局部关节运动和根动力学。在知名数据集和新文本提示上的评估表明，我们的方法能够高效利用2D数据，支持更广泛的真实3D人体运动生成。我们的代码在https://zju3dv.github.io/Motion-2-to-3/上公开提供。

英文摘要

Text-driven human motion synthesis has showcased its potential for revolutionizing motion design in the movie and game industry. Existing methods often rely on 3D motion capture data, which requires special setups, resulting in high costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. In this paper, we explore the use of 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation. Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. We first train a single-view 2D local motion generator on a large dataset of text-2D motion pairs. Then we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics. Evaluations on the well-acknowledged dataset and novel text prompts demonstrate that our method can efficiently utilize 2D data, supporting a wider range of realistic 3D human motion generation. Our code is publicly available at https://zju3dv.github.io/Motion-2-to-3/.

URL PDF HTML ☆

赞 0 踩 0

2412.00404 2026-05-20 cs.CV 版本更新

Hard-Label Black-Box Attacks on 3D Point Clouds

针对3D点云的硬标签黑盒攻击

Daizong Liu, Yunbo Tao, Junhao Dong, Keke Tang, Pan Zhou, Wei Hu, Yew-Soon Ong

发表机构 * Institute for Math & AI（数学与人工智能研究院）； Wuhan University（武汉大学）； Huazhong University of Science and Technology（华中科技大学）； Shenzhen Huazhong University of Science and Technology Research Institute（深圳华中科技大学研究机构）； College of Computing and Data Science（计算与数据科学学院）； Nanyang Technological University（南洋理工大学）； Cyberspace Institute of Advanced Technology（先进技术网络空间研究院）； Guangzhou University（广州大学）； Wangxuan Institute of Computer Technology（王轩计算机技术研究所）； Peking University（北京大学）

AI总结本文提出了一种基于硬标签黑盒攻击的3D点云攻击方法，通过引入新的频谱感知决策边界算法生成高质量对抗样本，以提升攻击性能和对抗质量。

详情

AI中文摘要

随着深度传感器在各种3D安全关键应用中的成熟，3D点云模型已被证明对对抗攻击脆弱。几乎所有的现有3D攻击者只是遵循白盒或黑盒设置，通过反向传播或估计的梯度迭代更新坐标扰动。然而，这些方法很难在现实世界中部署（没有提供模型细节），因为它们严重依赖于受害者模型的参数或输出logits。为此，我们提出了一种更具实际应用的攻击方法，即硬标签黑盒攻击，其中攻击者只能访问3D输入的预测标签。我们引入了一种基于新频谱感知决策边界算法的新型3D攻击方法，以生成高质量的对抗样本。具体而言，我们首先构建了一个类感知的模型决策边界，通过开发一种可学习的频谱融合策略，适应性地在频谱域中融合不同类别的点云，旨在在不扭曲原始几何的情况下制造其中间样本。然后，我们设计了一种迭代坐标-频谱优化方法，带有曲率感知的边界搜索，以沿决策边界移动中间样本，生成具有微小扰动的对抗点云。实验表明，我们的攻击在攻击性能和对抗质量方面优于现有的白盒/黑盒攻击者。

英文摘要

With the maturity of depth sensors in various 3D safety-critical applications, 3D point cloud models have been shown to be vulnerable to adversarial attacks. Almost all existing 3D attackers simply follow the white-box or black-box setting to iteratively update coordinate perturbations based on back-propagated or estimated gradients. However, these methods are hard to deploy in real-world scenarios (no model details are provided) as they severely rely on parameters or output logits of victim models. To this end, we propose point cloud attacks from a more practical setting, i.e., hard-label black-box attack, in which attackers can only access the prediction label of 3D input. We introduce a novel 3D attack method based on a new spectrum-aware decision boundary algorithm to generate high-quality adversarial samples. In particular, we first construct a class-aware model decision boundary, by developing a learnable spectrum-fusion strategy to adaptively fuse point clouds of different classes in the spectral domain, aiming to craft their intermediate samples without distorting the original geometry. Then, we devise an iterative coordinate-spectrum optimization method with curvature-aware boundary search to move the intermediate sample along the decision boundary for generating adversarial point clouds with trivial perturbations. Experiments demonstrate that our attack competitively outperforms existing white/black-box attackers in terms of attack performance and adversary quality.

URL PDF HTML ☆

赞 0 踩 0

2409.08248 2026-05-20 cs.CV 版本更新

TextBoost: Boosting Text Encoder for Personalized Text-to-Image Generation

TextBoost: 通过文本编码器提升文本到图像生成的个性化

NaHyeon Park, Kunhee Kim, Hyunjung Shim

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出TextBoost，一种高效的文本到图像扩散模型单次个性化方法，通过仅微调文本编码器提升计算和存储效率，并保持语义完整性，从而实现更快收敛和更低存储需求，同时保持高质量生成。

Comments Project page: https://textboost.github.io. Accepted to TMLR

详情

AI中文摘要

在本文中，我们介绍了TextBoost，一种高效的文本到图像扩散模型单次个性化方法。传统个性化方法通常涉及微调模型的大量部分，导致存储需求大且收敛慢。相反，我们提出仅选择性地微调文本编码器，显著提高了计算和存储效率。为了保持原始语义完整性，我们开发了一种新颖的因果保持适应机制。此外，轻量级适配器被用于在文本嵌入与交叉注意层交互之前局部细化文本嵌入，从而在极小的计算开销下显著增强文本嵌入的表达能力。在多样化的概念上进行的实证评估表明，TextBoost通过减少可训练参数的数量实现了更快的收敛速度和显著的存储需求降低。此外，TextBoost在主体保真度、文本保真度和生成多样性方面与现有方法相比具有可比性。我们展示所提出的方法为高质量文本到图像个性化提供了一种高效、可扩展且实用的解决方案，尤其在资源受限的环境中具有优势。

英文摘要

In this paper, we introduce TextBoost, an efficient one-shot personalization approach for text-to-image diffusion models. Traditional personalization methods typically involve fine-tuning extensive portions of the model, leading to substantial storage requirements and slow convergence. In contrast, we propose selectively fine-tuning only the text encoder, significantly improving computational and storage efficiency. To preserve the original semantic integrity, we develop a novel causality-preserving adaptation mechanism. Additionally, lightweight adapters are employed to locally refine text embeddings immediately before their interaction with cross-attention layers, greatly enhancing the expressiveness of text embeddings with minimal computational overhead. Empirical evaluations across diverse concepts demonstrate that TextBoost achieves faster convergence and substantially reduces storage demands by minimizing the number of trainable parameters. Furthermore, TextBoost maintains comparable subject fidelity, superior text fidelity, and greater generation diversity compared to existing methods. We show that our proposed method offers an efficient, scalable, and practically applicable solution for high-quality text-to-image personalization, particularly beneficial in resource-constrained environments.

URL PDF HTML ☆

赞 0 踩 0

2409.03192 2026-05-20 cs.CV 版本更新

PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning

PEPL: 精度增强的伪标签法用于半监督学习中的细粒度图像分类

Bowen Tian, Songning Lai, Lujundong Li, Zhihao Shuai, Runwei Guan, Tian Wu, Yutao Yue

发表机构 * HKUST(GZ)（香港科技大学（广州））； Institute of Deep Perception Technology, JITRI（感知技术研究所，JITRI）； University of Liverpool（利物浦大学）； Nanchang University（南昌大学）； DI^2 Lab（DI²实验室）

AI总结本文提出PEPL方法，通过生成高质量的伪标签来解决细粒度图像分类中标注数据稀缺的问题，利用CAMs进行语义混合伪标签生成，提升分类精度和鲁棒性。

Comments Accepted by ICASSP 2025

详情

DOI: 10.1109/ICASSP49660.2025.10889037

AI中文摘要

细粒度图像分类随着深度学习和计算机视觉技术的发展取得了显著进步。然而，详细的标注数据稀缺仍然是一个主要挑战，尤其是在获取高质量标注数据成本高或耗时的情况下。为了解决这一限制，我们引入了Precision-Enhanced Pseudo-Labeling（PEPL）方法，专门设计用于半监督学习框架下的细粒度图像分类。我们的方法通过生成高质量的伪标签，利用大量未标注数据，通过两个关键阶段：初始伪标签生成和语义混合伪标签生成，逐步细化伪标签。这些阶段利用类激活图（CAMs）准确估计语义内容，并生成捕获细粒度分类所需关键细节的精炼标签。通过聚焦语义层面的信息，我们的方法有效克服了标准数据增强和图像混合技术在保留关键细粒度特征方面的局限性。我们在基准数据集上实现了最先进的性能，证明了与现有半监督策略相比，在准确性和鲁棒性上有了显著提升。

英文摘要

Fine-grained image classification has witnessed significant advancements with the advent of deep learning and computer vision technologies. However, the scarcity of detailed annotations remains a major challenge, especially in scenarios where obtaining high-quality labeled data is costly or time-consuming. To address this limitation, we introduce Precision-Enhanced Pseudo-Labeling(PEPL) approach specifically designed for fine-grained image classification within a semi-supervised learning framework. Our method leverages the abundance of unlabeled data by generating high-quality pseudo-labels that are progressively refined through two key phases: initial pseudo-label generation and semantic-mixed pseudo-label generation. These phases utilize Class Activation Maps (CAMs) to accurately estimate the semantic content and generate refined labels that capture the essential details necessary for fine-grained classification. By focusing on semantic-level information, our approach effectively addresses the limitations of standard data augmentation and image-mixing techniques in preserving critical fine-grained features. We achieve state-of-the-art performance on benchmark datasets, demonstrating significant improvements over existing semi-supervised strategies, with notable boosts in accuracy and robustness.

URL PDF HTML ☆

赞 0 踩 0

2002.09053 2026-05-20 cs.CV 版本更新

Adapted Center and Scale Prediction: More Stable and More Accurate

适应中心和尺度预测：更加稳定和准确

Wenhao Wang, Jusheng Zhang

发表机构 * University of Technology Sydney（悉尼科技大学）； Sun Yat-sen University（中山大学）

AI总结本文提出了一种基于中心和尺度预测（CSP）的改进方法，旨在结合无锚点检测器的简洁性和两阶段检测器的准确性，通过增强CSP的鲁棒性、提出压缩宽度的新方法，并在CityPersons基准上取得第二名的性能，同时探索了可切换归一化的能力。

Comments 14 pages, 7 figures

详情

AI中文摘要

行人检测受益于深度学习技术，在近年来迅速发展。大多数检测器遵循通用目标检测框架，即默认框和两阶段过程。最近，无锚点和单阶段检测器被引入到这一领域。然而，它们的准确性并不令人满意。因此，为了同时享受无锚点检测器的简洁性和两阶段检测器的准确性，我们基于检测器提出了一些改进，即中心和尺度预测（CSP）。本文的主要贡献包括：（1）我们改进了CSP的鲁棒性，使其更容易训练。（2）我们提出了一种新的方法来预测宽度，即压缩宽度。（3）我们在CityPersons基准上取得了第二好的性能，即在合理集上9.3%的log-average miss rate（MR），在部分集上8.7%的MR，在裸集上5.6%的MR，这表明无锚点和单阶段检测器仍能保持高精度。（4）我们探索了可切换归一化的一些能力，这些能力在原始论文中未被提及。代码可在https://github.com/WangWenhao0716/Adapted-Center-and-Scale-Prediction上公开获取。

英文摘要

Pedestrian detection benefits from deep learning technology and gains rapid development in recent years. Most of detectors follow general object detection frame, i.e. default boxes and two-stage process. Recently, anchor-free and one-stage detectors have been introduced into this area. However, their accuracies are unsatisfactory. Therefore, in order to enjoy the simplicity of anchor-free detectors and the accuracy of two-stage ones simultaneously, we propose some adaptations based on a detector, Center and Scale Prediction(CSP). The main contributions of our paper are: (1) We improve the robustness of CSP and make it easier to train. (2) We propose a novel method to predict width, namely compressing width. (3) We achieve the second best performance on CityPersons benchmark, i.e. 9.3% log-average miss rate(MR) on reasonable set, 8.7% MR on partial set and 5.6% MR on bare set, which shows an anchor-free and one-stage detector can still have high accuracy. (4) We explore some capabilities of Switchable Normalization which are not mentioned in its original paper. The code is publicly available at https://github.com/WangWenhao0716/Adapted-Center-and-Scale-Prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.19020 2026-05-20 cs.CV 版本更新

A Systematic Failure Analysis of Vision Foundation Models for Open Set Iris Presentation Attack Detection

对用于开放集虹膜呈现攻击检测的视觉基础模型系统性失败分析

Rahul Anand, Siddharth Singh, Dileep A D, Mahadeva Prasanna, Raghavendra Ramachandra

发表机构 * Indian Institute of Technology, Dharwad, India（印度德瓦德理工学院）； Indian Institute of Information Technology Dharwad, India（印度德瓦德信息学院）； SAFE Center, Norwegian University of Science and Technology (NTNU)（挪威科学技术大学（NTNU）的安全中心）

AI总结本文系统分析了视觉基础模型在开放集虹膜呈现攻击检测中的表现，发现其在面对未见过的攻击设备和跨光谱转移时表现不佳，强调了需要更鲁棒的虹膜检测表示方法。

详情

AI中文摘要

视觉基础模型在多种视觉识别任务中表现出强大的迁移能力，并日益被应用于生物识别领域。然而，其在开放集条件下用于虹膜呈现攻击检测（PAD）的适用性仍不够充分。本文系统分析了通用视觉基础模型在开放集虹膜PAD中的表现，使用周缘视觉图像进行评估。在三个明确分离不同分布偏移的开放集协议下，评估了五个代表性基础模型：未见过的呈现攻击设备（PAIs）、使用不同传感器捕获的未见数据集以及近红外（NIR）到可见光（VIS）光谱的跨光谱转移。在统一的实验框架内，评估了冻结的特征表示和参数高效的LoRA任务适应方法。结果表明，基础模型能够在具有相似传感特征的数据集之间迁移，但无法可靠地推广到未见过的攻击设备，并在跨光谱评估中急剧退化。尽管LoRA在某些跨数据集设置中提高了性能，但在攻击级别和光谱偏移下经常放大失败。额外的验证实验使用分段虹膜输入、完整主干微调、联合跨数据集和跨PAI偏移以及反向VIS到NIR转移进一步证实，这些失败并非仅仅是周缘视觉输入、弱适应或单向光谱评估的产物。这些发现表明，强闭合集或跨数据集性能不应被视为开放集安全性的证据，并突显了需要虹膜检测表示方法在保持对呈现伪影的敏感性的同时，在现实部署变化下保持稳定性的需求。

英文摘要

Vision foundation models have demonstrated strong transferability across diverse visual recognition tasks and are increasingly considered for biometric applications. Their suitability for iris Presentation Attack Detection (PAD), particularly under realistic open-set operating conditions, remains insufficiently examined. This work presents a systematic failure analysis of general-purpose vision foundation models for open-set iris PAD using periocular imagery. Five representative foundation models are evaluated under three open-set protocols that explicitly separate different sources of distribution shift: unseen Presentation Attack Instruments (PAIs), unseen datasets captured with different sensors and cross-spectral transfer from near-infrared (NIR) to visible spectrum (VIS) imagery. Both frozen feature representations and parameter-efficient task adaptation using Low-Rank Adaptation (LoRA) are assessed within a unified experimental framework. The results indicate that foundation models can transfer across datasets with similar sensing characteristics, but fail to generalise reliably to unseen attack instruments and degrade sharply under cross-spectral evaluation. While LoRA improves performance in certain cross-dataset settings, it frequently amplifies failure under attack-level and spectral shifts. Additional validation experiments using segmented iris inputs, full backbone fine-tuning, joint cross-dataset and cross-PAI shifts, and reverse VIS to NIR transfer further confirm that these failures are not simply artefacts of periocular input, weak adaptation, or one-directional spectral evaluation. These findings show that strong closed-set or cross-dataset performance should not be treated as evidence of robust open-set security, and highlight the need for PAD representations that maintain sensitivity to presentation artefacts while remaining stable under realistic deployment variation.

URL PDF HTML ☆

赞 0 踩 0

2605.19004 2026-05-20 cs.CV cs.LG cs.RO 版本更新

EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

EgoTraj: 用于多模态预测的现实世界人轨迹数据集

Ahmad Yehia, Abduallah Mohamed, Tianyi Wang, Jiseop Byeon, Kun Qian, Junfeng Jiao, Christian Claudel

发表机构 * Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin（土木、建筑与环境工程系，德克萨斯大学奥斯汀分校）； Meta Reality Labs（Meta现实实验室）； School of Architecture, The University of Texas at Austin（建筑学院，德克萨斯大学奥斯汀分校）

AI总结本文提出EgoTraj数据集，用于多模态预测，包含75个真实城市环境中的人导航轨迹，提供了同步的RGB视频和地面真实数据，包括6自由度头部姿态、3D眼 gaze向量和场景注释，展示了该数据集在AR感知、导航和辅助系统中的应用价值。

Comments 21 pages, 14 figures. Project page: https://github.com/yehiahmad/EgoTraj

详情

AI中文摘要

准确地从第一人称视角预测人类轨迹在人形机器人、可穿戴传感系统和辅助导航等应用中起着核心作用。然而，由于现实世界环境中缺乏第一人称轨迹数据集，这一方向的进展受到限制。为了解决这一需求，我们介绍了EgoTraj，一个使用Meta Quest Pro (MQPro)录制的egocentric多模态开放数据集。EgoTraj包含75个由多个MQPro穿戴设备在真实城市环境中收集的人导航轨迹。每个记录都提供了同步的RGB视频以及地面真实数据，包括连续时间同步的6自由度头部姿态、每帧3D眼 gaze向量和场景注释。据我们所知，EgoTraj不同于典型的egocentric轨迹数据集，因为它捕捉了在多样化的城市路线中进行的长视距、自主导航，具有广泛的参与者多样性。为了展示该数据集的潜力，我们对几种最先进的egocentric轨迹预测方法进行了基准测试，并进行了消融研究以分析注视、场景和运动提示的贡献。结果突显了EgoTraj在AR感知、导航和辅助系统中的实用性。EgoTraj数据集、代码和EgoViz仪表板已公开在https://github.com/yehiahmad/EgoTraj。

英文摘要

Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at https://github.com/yehiahmad/EgoTraj.

URL PDF HTML ☆

赞 0 踩 0

2605.18984 2026-05-20 cs.CV 版本更新

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

Artifact-Bench: 评估MLLMs在检测和评估AI生成视频中的伪影

Yuqi Tang, Yang Shi, Zhuoran Zhang, Qixun Wang, Xuehai Bai, Yue Ding, Ruizhe Chen, Bohan Zeng, Xinlong Chen, Xuanyu Zhu, Bozhou Li, Yuran Wang, Yifan Dai, Chengzhuo Tong, Xinyu Liu, Yiyan Ji, Yujie Wei, Yuhao Dong, Shilin Yan, Fengxiang Wang, Yi-Fan Zhang, Haotian Wang, Yuanxing Zhang, Pengfei Wan

发表机构 * Kling Team（Kling团队）； Shanghai AI Lab（上海人工智能实验室）

AI总结本文提出Artifact-Bench，一个用于评估多模态大语言模型在检测和分析AI生成视频伪影能力的基准，揭示了现有模型在伪影感知和推理上的显著局限性。

详情

AI中文摘要

近年来，视频生成模型在提高AI生成视频的真实感方面取得了显著进步，但其输出仍存在时间不一致、结构失真和语义不连贯等伪影。尽管多模态大语言模型（MLLMs）在视觉理解方面表现出色，但其感知和推理这些伪影的能力仍不明确。现有基准缺乏对伪影感知和细粒度诊断推理的系统评估，尤其是在超越逼真内容的多样化AI生成视频领域。为解决这一差距，我们引入Artifact-Bench，一个全面的基准，用于评估MLLMs在AI生成视频伪影检测和分析上的能力。我们首先建立了涵盖逼真、动画和CG风格视频的三级层次化伪影分类法。基于此分类法，Artifact-Bench定义了三个互补任务：真实与AI生成视频分类、成对真实感比较和细粒度伪影识别。在19种领先MLLMs上的实验揭示了伪影感知和推理的显著局限性，许多模型在挑战性设置中接近随机甚至低于随机表现。我们进一步观察到MLLM判断与人类感知偏好之间存在显著不一致，突显了其作为AI生成视频真实感一般评估者的有限可靠性。

英文摘要

Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.

URL PDF HTML ☆

赞 0 踩 0

2605.18974 2026-05-20 cs.CV cs.AI cs.MM 版本更新

Harnessing Self-Supervised Features for Art Classification

利用自监督特征进行艺术分类

Federico Melis, Davide Bilardello, Emanuele Prato, Evelyn Turri, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia（摩德纳和雷吉奥艾米利亚大学）

AI总结本文研究了监督和自监督主干作为特征提取器在艺术分类和检索中的有效性，特别是绘画，通过DINO家族和CLIP模型的实验评估，证明自监督主干在艺术分类中能带来一致的性能提升，并为现实应用如虚拟现实中的博物馆导航提供了见解。

Comments IRCDL 2026

2605.18956 2026-05-20 cs.CV 版本更新

MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

MotionMERGE: 一种用于人体动作编辑、推理、生成和解释的多粒度框架

Bizhu Wu, Jinheng Xie, Wenting Chen, Zhe Kong, Jianfeng Ren, Linlin Shen, Ruibin Bai, Rong Qu

发表机构 * Computer Vision Institute, School of Computer Science and Software Engineering, Shenzhen University（计算机视觉研究院，计算机科学与软件工程学院，深圳大学）； Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University（广东省智能信息处理重点实验室，深圳大学）； School of Computer Science, University of Nottingham Ningbo China（Nottingham Ningbo 中国计算机科学学院）； Department of Electrical and Computer Engineering, National University of Singapore（电子与计算机工程系，新加坡国立大学）； Department of Radiation Oncology, Stanford University（放射肿瘤科，斯坦福大学）； Sun Yat-sen University（中山大学）； School of Computer Science, University of Nottingham（计算机科学学院，Nottingham大学）

AI总结本文提出MotionMERGE框架，通过细粒度语言引导的动作控制、跨粒度协同预训练和细粒度动作-语言对齐，实现了更精确的动作生成、理解和编辑，并建立了新的细粒度文本驱动动作编辑和动作引导推理基准。

详情

AI中文摘要

Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.

一种多维聚类方法用于识别先天性免疫缺陷

Nishad Kulkarni, Alexandra K. Martinson, Nicholas L. Rider, Michael Keller, Syed Muhammad Anwar

发表机构 * Sheikh Zayed Institute for Pediatric Surgical Innovation, Children’s National Hospital, Washington, DC（Sheikh Zayed儿童外科创新研究所，儿童医院，华盛顿特区）； Childrens National Hospital, Washington, DC（儿童医院，华盛顿特区）； Department of Health Systems & Implementation Science, Division of Allergy & Immunology Virginia Tech Carilion School of Medicine, Roanoke, VA（健康系统与实施科学部门，过敏与免疫学分会弗吉尼亚理工大学Carilion医学院，罗阿诺克，VA）； Division of Allergy & Immunology Childrens National Hospital, Washington, DC（过敏与免疫学分会儿童医院，华盛顿特区）； School of Medicine and Health Sciences, George Washington University, Washington, DC（医学与健康科学学院，乔治华盛顿大学，华盛顿特区）

AI总结本文提出一种多维聚类方法，用于从全国数据注册中识别新的罕见疾病模式并提取与先天性免疫缺陷相关的特征，通过改进IEI特征意识和开发罕见疾病人群分析的数据工具包，扩展了复杂医疗记录到可被无监督ML解释的数据结构。

Comments Accepted at EMBC 2026

详情

AI中文摘要

先天性免疫缺陷（IEI）等罕见疾病需要早期诊断以防止终器官损伤并提高生活质量。获取和整理大规模电子健康记录（EHR）数据的障碍限制了常规数据驱动分析保持在IEI和其他罕见疾病趋势的前沿。在IEI中开发机器学习（ML）算法进行模式识别以及已发表的方法研究如何系统地处理和整合复杂医疗数据有限。我们提出的流程，包括数据整理和ML聚类算法，旨在识别新的罕见疾病模式并从全国数据注册中提取IEI相关的特征。我们的EHR数据格式化和处理方法提出了一个流程，将原始免疫学实验室数据转换为向量。这进一步结合了通过聚类进行疾病模式识别的超参数调优。本研究改进了IEI特征意识，开发了罕见疾病人群分析的数据工具包，并扩展了将复杂医疗记录转换为可被无监督ML解释的数据结构。

英文摘要

Rare diseases such as inborn errors of immunity (IEI) require early diagnosis to prevent end organ damage and improve quality of life. Hurdles in accessing and curating large scale electronic health record (EHR) data limit routine data driven analyses to remain on the forefront of IEI and other rare disease trends. Development of machine learning (ML) algorithms in IEI for pattern recognition as well as published methodology examining how to systematically process and integrate complex medical data is limited. Our proposed pipeline, including data curation and ML clustering algorithms, is designed to recognize novel rare disease patterns and extract IEI- associated features from a national data registry. Our methodology for EHR data formatting and processing presents the pipeline that transforms raw immunologic lab data into vectors. This is further combined with hyperparameter tuning for diseases pattern recognition via clustering. This study refines IEI feature awareness, develops data tool kits for rare disease populations analysis, and expands on transforming complex medical records in data structures interpretable by unsupervised ML.

URL PDF HTML ☆

赞 0 踩 0

2605.18878 2026-05-20 eess.SP cs.CV cs.LG eess.IV 版本更新

Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis

心力衰竭再入院风险的肺部超声生物标志物预后价值：一项试点数据驱动分析

Jana Armouti, Laura Hutchins, Jacob Duplantis, Thomas Deiss, Thales Nogueira Gomes, Keyur H. Patel, Seema Walvekar, Shane Guillory, Thomas H. Fox, Amita Krishnan, Ricardo Rodriguez, Bennett DeBoisblanc, Deva Ramanan, John Galeotti, Gautam Gare

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； LSUHSC Internal Medicine（路易斯安那州立大学医学部）； Cosmetic Surgery Facility LLC（美容外科诊所有限公司）

AI总结本研究通过数据驱动方法利用住院期间获得的B型肺部超声（LUS）数据，预测30天内心力衰竭再入院风险，发现依赖性下肺区域、时间差特征以及多视图特征拼接在预测中表现最佳，展示了超声生物标志物在非侵入性心力衰竭风险分层中的实用性。

详情

AI中文摘要

住院后30天内再入院是心力衰竭（CHF）导致发病率、死亡率和可避免医疗支出的主要驱动因素。当前的临床风险分层工具主要依赖于非成像数据，且预测性能有限。床旁肺部超声（LUS）提供了一个敏感的、非侵入性的窗口，以观察肺部充血，这特征于CHF失代偿，但其用于再入院预测的预后作用仍待探索。我们提出了一个试点可行性研究，这是首个系统使用住院期间获得的B型LUS进行机器学习预测30天内CHF再入院的系统研究。从预训练的Temporal Shift Module（TSM）ResNet-18编码器中提取定量时空嵌入，并分别评估可解释的生物标志物特征。通过结构化消融研究肺部视图、时间表示、多视图融合和跨肺增强，我们识别出驱动再入院风险的关键成像因素。我们的发现表明（1）依赖性下肺区域（左3、右3）携带最强的预后信号，与它们对静水性充血的更大易感性一致；（2）连续检查之间的时间差特征显著优于单时间点表示，突显了捕捉疾病轨迹的重要性；（3）多视图特征拼接产生了最佳整体性能，我们的最佳MLP模型实现了F1得分为0.80（95% CI: 0.62-0.96）。生物标志物分析进一步表明，胸膜线异常，包括断裂和凹陷，的信息量与传统A线和B线标志物相当。这些结果支持POCUS衍生的生物标志物作为实用、可解释的非侵入性CHF风险分层工具。

英文摘要

Hospital readmission within 30 days of discharge is a leading driver of morbidity, mortality, and avoidable healthcare expenditure in congestive heart failure (CHF). Current clinical risk stratification tools rely primarily on non-imaging data and exhibit limited predictive performance. Point-of-care lung ultrasound (LUS) offers a sensitive, noninvasive window into the pulmonary congestion that characterizes CHF decompensation, yet its prognostic utility for readmission prediction remains largely unexplored. We present a pilot feasibility study, the first systematic machine learning study using B-mode LUS acquired during hospitalization to predict 30-day CHF readmission. Quantitative spatiotemporal embeddings are extracted from a pretrained Temporal Shift Module (TSM) ResNet-18 encoder, and interpretable biomarker features are separately evaluated. Through structured ablations over lung view, temporal representation, multi-view fusion, and cross-lung augmentation, we identify the key imaging factors driving readmission risk. Our findings reveal that (1) dependent lower-lung regions (Left-3, Right-3) carry the strongest prognostic signal, consistent with their greater susceptibility to hydrostatic congestion; (2) temporal difference features between sequential examinations substantially outperform single-timepoint representations, highlighting the importance of capturing disease trajectory; and (3) multi-view feature concatenation yields the best overall performance, with our top MLP model achieving an F1 score of 0.80 (95% CI: 0.62-0.96). Biomarker analysis further reveals that pleural-line abnormalities, including breaks and indentations, are as informative as the canonical A-line and B-line markers. These results support POCUS-derived biomarkers as practical, interpretable tools for noninvasive CHF risk stratification.

URL PDF HTML ☆

赞 0 踩 0

2605.18868 2026-05-20 cs.CR cs.AI cs.CV cs.LG 版本更新

DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models

DarkLLM: 利用大语言模型学习语言驱动的对抗攻击

Ye Sun, Xin Wang, Jiaming Zhang, Yifeng Gao, Yixu Wang, Yifan Ding, Qixian Zhang, Henghui Ding, Xingjun Ma, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Nanyang Technological University（南洋理工大学）； Tongji University（同济大学）

AI总结本文提出DarkLLM，一种基于大语言模型的对抗攻击框架，通过将自然语言攻击指令转换为潜在攻击向量，生成有效的对抗扰动，统一了多种攻击类型并实现了灵活可控的对抗生成。

Comments 23 pages, 13 figures

详情

AI中文摘要

尽管视觉和多模态基础模型在感知到复杂推理任务中至关重要，但它们仍然极易受到对抗攻击的影响。然而，传统对抗攻击通常局限于单一、预定义的目标，紧密耦合每个攻击到特定模型或任务，限制了其在现实场景中的可扩展性和灵活性。在本文中，我们提出了DarkLLM，一种新的攻击框架，该框架训练了一个大语言模型（LLM）将自然语言攻击指令转换为潜在攻击向量，然后解码为视觉对抗扰动。通过利用自然语言指令微调，DarkLLM不仅在一个框架内统一了目标攻击、非目标攻击、分割攻击和多模型攻击，还实现了灵活且可控的对抗生成，使每个指令都能生成一种扰动，以在异构模型上诱导期望的行为。通过在4个任务、13个数据集和15个模型上的广泛实验，我们证明DarkLLM仅需1B参数即可遵循攻击者的指令，生成对CLIP、SAM和前沿LLM高度有效的攻击，揭示了现代基础模型系统性的脆弱性。

英文摘要

While vision and multimodal foundation models underpin critical tasks from perception to complex reasoning, they remain highly vulnerable to adversarial attacks. However, traditional adversarial attacks are typically limited to single, predefined objectives, tightly coupling each attack to a specific model or task, which restricts their scalability and flexibility in real-world scenarios. In this work, we present DarkLLM, a novel attack framework that trains an LLM to translate natural-language attack instructions into latent attack vectors, which are then decoded into visual adversarial perturbations. By leveraging natural-language instruction tuning, DarkLLM not only unifies targeted, untargeted, segmentation, and multi-model attacks within a single framework, but also achieves flexible and controllable adversarial generation, enabling each instruction to produce a perturbation that induces desired behaviors across heterogeneous models. Through extensive experiments across 4 tasks, 13 datasets, and 15 models, we demonstrate that DarkLLM with only 1B parameters can follow attacker instructions and generate highly effective attacks against CLIP, SAM, and frontier LLMs, revealing a systemic vulnerability in modern foundation models.

URL PDF HTML ☆

赞 0 踩 0

2605.18855 2026-05-20 cs.LG cs.CV 版本更新

Delta Attention Residuals

Cheng Luo, Zefan Cai, Junjie Hu

发表机构 * Independent Researcher（独立研究者）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

AI总结本文提出Delta Attention Residuals，通过在残差连接中引入对每个子层引入的变化（delta）进行注意力机制，解决了传统注意力残差中因累积隐藏状态冗余导致的路由崩溃问题，从而提升模型跨层选择信息的能力。

详情

AI中文摘要

XFlowMap：大规模出行生成数据的跨尺度泛化与制图

Diansheng Guo, Hai Jin

发表机构 * PolyU

AI总结本文提出XFlowMap框架，用于大规模出行生成数据的跨尺度泛化与制图，通过整合跨尺度流量模式检测、自动化流量制图泛化和新的制图表示法，实现复杂出行流量结构的分析与可视化。

详情

AI中文摘要

将大规模出行生成（OD）数据集进行制图仍具挑战性，因为流量图变得杂乱，有意义的模式出现在多个空间尺度上，而现有流量制图方法通常依赖于预定义的聚合单元或手动泛化。本文提出了XFlowMap，一种用于大规模OD数据的跨尺度泛化和制图的框架。具体而言，该框架整合了跨尺度流量模式（集群）检测、自动化流量图泛化和新的制图表示法，用于分析和可视化复杂的出行流量结构。该方法在适当的起源和目的地尺度上检测显著的流量模式，提取高层结构，并生成一种新的流量图表示法，以支持对复杂出行流量模式的全面解释。开发了一种基于扫描统计的程序来评估和泛化跨尺度流量集群。检测到的集群随后使用一种新的流量符号进行可视化，该符号将位置、方向、强度和OD尺度整合到单一表示中。该框架支持基于区域和基于点的OD数据，对稀疏和噪声数据具有鲁棒性，并能够对分层流量数据进行比较制图。使用合成数据和美国迁移数据的实验表明，该方法有效地提取了有意义的跨尺度流量模式，并为大规模移动数据集生成清晰且信息丰富的流量图，支持静态展示和交互式探索。

英文摘要

Mapping large origin-destination (OD) datasets remains challenging because flow maps become cluttered, meaningful patterns occur at multiple spatial scales, and existing flow-mapping approaches frequently rely on predefined aggregation units or manual generalization. This paper presents XFlowMap, a framework for the cross-scale generalization and mapping of massive OD data. Specifically, the framework integrates cross-scale flow pattern (cluster) detection, automated flow map generalization, and a new cartographic representation for analyzing and visualizing complex origin-destination flow structures. The approach detects salient flow patterns at their appropriate origin and destination scales, extracts high-level structures, and generates a new flow map representation that supports holistic interpretation of complex origin-destination flow patterns. A scan-statistic-based procedure is developed to evaluate and generalize cross-scale flow clusters. The detected clusters are then visualized using a novel flow symbol that integrates location, direction, strength, and OD scales in a single representation. The framework supports both area-based and point-based OD data, is robust to sparse and noisy datasets, and enables comparative mapping of stratified flow data. Experiments with synthetic data and U.S. migration data demonstrate that the method effectively extracts meaningful cross-scale flow patterns and produces clear, information-rich flow maps for large mobility datasets, supporting both static presentation and interactive exploration.

URL PDF HTML ☆

赞 0 踩 0