2605.18719 2026-05-19 cs.CV 版本更新

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

SafeDiffusion-R1: 在线奖励引导用于安全扩散后训练

Komal Kumar, Ankan Deria, Abhishek Basu, Fahad Shamshad, Hisham Cholakkal, Karthik Nandakumar

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； Michigan State University（密歇根州立大学）

AI总结本文提出了一种在线强化学习框架，通过在负样本和正样本文本提示上进行后训练，利用组相对策略优化（GRPO）解决数据稀缺和模型退化问题，引入了引导奖励机制以提高扩散模型的安全性，实验表明其在减少不适当内容和提升生成质量方面表现优异。

Comments Page 28, Image 20, Table 6

详情

AI中文摘要

扩散模型已被广泛研究用于去除预训练过程中学习到的不安全内容。现有方法需要昂贵的监督数据，要么是不安全文本与安全图像的配对数据，要么是负/正图像对，使其难以扩展。此外，离线强化学习和监督微调方法生成离线合成数据会受到灾难性遗忘的影响，降低生成质量。我们提出了一种新的在线强化学习框架，通过在负样本和正样本文本提示上进行后训练，利用组相对策略优化（GRPO）解决数据稀缺和模型退化问题。为了消除对专门安全/不安全奖励模型的微调需求，我们引入了一种引导奖励机制，利用CLIP嵌入的一个固有特性：在嵌入空间中将文本表示引导向积极安全方向，远离消极方向。我们的在线策略方法使模型能够从多样化的提示中学习，包括显式不安全内容，而不会出现灾难性遗忘。大量实验表明，我们的方法将不适当内容减少到18.07%（与SD v1.4的48.9%相比），将色情检测减少到15（与基线646相比），同时在GenEval上将组合生成质量从42.08%提高到47.83%。值得注意的是，这些安全收益可以推广到七个危害类别中的跨领域不安全提示，实现了最先进的性能，而无需监督配对数据或奖励微调。Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.

英文摘要

Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a \textit{steering reward mechanism} that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08\% to 47.83\% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.

URL PDF HTML ☆

赞 0 踩 0

2605.18714 2026-05-19 cs.CV cs.AI 版本更新

Semantic Generative Tuning for Unified Multimodal Models

语义生成微调用于统一多模态模型

Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Tencent ARCLab（腾讯ARCLab）

AI总结本文提出语义生成微调（SGT）方法，通过将高阶语义任务作为生成代理，统一多模态模型的感知与生成能力，提升多模态理解和生成质量。

Comments 14 pages, 13 figures

详情

AI中文摘要

统一多模态模型（UMMs）致力于在单一架构中整合视觉理解和视觉生成。然而，现有训练范式分别通过稀疏文本信号优化理解，通过密集像素目标优化生成，导致表示空间不一致，隔离了视觉理解和生成，阻碍了它们的相互促进。本文首次系统地研究了生成式后训练，我们将层次化的视觉任务作为生成代理，以弥合UMMs中的隔离。通过实证研究发现，高阶语义任务，特别是图像分割，作为最优代理。不同于低阶任务，分割提供结构语义，显著增强视觉感知和生成布局的保真度。基于这些见解，我们引入语义生成微调（SGT），一种利用分割作为生成代理来对齐和协同多模态能力的新范式。机理分析进一步表明，SGT从根本上提高了特征线性可分离性，并优化了视觉-文本注意力分配模式。广泛的评估显示，SGT在主流基准上一致提升了多模态理解和生成保真度。我们的代码可在https://song2yu.github.io/SGT/上获得。

英文摘要

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.

URL PDF HTML ☆

赞 0 踩 0

2605.18700 2026-05-19 cs.CV 版本更新

A Large-Scale Study on the Accuracy vs Cost Trade-offs of Training and Evaluation Settings in Fine-Grained Image Recognition

细粒度图像识别中训练和评估设置的准确性与成本权衡的大规模研究

Edwin Arkel Rios, Augusto Christian Surya, Oswin Gosal, Fernando Mikael, Mary Madeline Nicole, Kisoon Jang, Bo-Cheng Lai, Min-Chun Hu

发表机构 * National Yang Ming Chiao Tung University, Taiwan ； National Tsing Hua University, Taiwan

AI总结本文通过2000多项实验，探讨了不同训练和评估设置下，模型精度与成本之间的权衡，提出改进的Counterfactual Attention Learning方法，并提供高效的评估变体以降低推理成本。

Comments Accepted to The 13th Workshop on Fine-Grained Visual Categorization (FGVC13) @ CVPR 2026. Main: 6 pages, 4 figures

详情

AI中文摘要

先前关于细粒度图像识别（FGIR）的研究已确立了backbone选择的重要性，但忽视了不同训练和评估设置下的精度与成本权衡。在本工作中，我们进行了大规模研究，涵盖超过2000项实验，6种训练和评估设置，9种预训练backbone和17个数据集。初步观察数据增强在细粒度训练中的有效性促使我们扩展Counterfactual Attention Learning（CAL），一种基于数据感知裁剪和遮罩增强的状态-of-the-art方法，引入跨图像判别区域混合增强。我们还提出了一种高效的评估-only变体，在保持竞争力精度的同时，通过放弃通常由CAL和类似FGIR方法使用的判别作物的前向传递来降低推理成本。我们的结果表明，训练期间的数据感知增强仅能使模型在不使用作物的情况下达到卓越的精度，显著减少推理成本。为了支持未来研究，我们共享了代码和检查点：https://github.com/arkel23/FGIR-Backbones

英文摘要

Prior work on fine-grained image recognition (FGIR) has established the importance of the backbone selection, but has neglected the accuracy-vs-cost trade-offs under different training and evaluation settings. In this work we conduct a large-scale study with over 2000 experiments across 6 training and evaluation settings, 9 pretrained backbones, and 17 datasets. Preliminary observations on the effectiveness of data augmentation for fine-grained training motivate us to extend Counterfactual Attention Learning (CAL), a state-of-the-art method based on data-aware cropping and masking augmentations, with cross-image discriminative region mixing augmentation. We also propose an efficient evaluation-only variant that maintains competitive accuracy while reducing inference costs by forfeiting the forward pass on discriminative crops that is normally used by CAL and similar FGIR methods. Our results show that data-aware augmentations during training only can enable a model to achieve excellent accuracy even without crops, significantly reducing inference costs. To support future research we share our code and checkpoints at: \url{https://github.com/arkel23/FGIR-Backbones}

URL PDF HTML ☆

赞 0 踩 0

2605.18680 2026-05-19 cs.CV 版本更新

CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation

CMAG：基于概念的市场人像生成检索

Rajeev Goel, Jason Ding, Phani Harish Wajjala, Pavan Turaga, Tejaswi Gowda, Krishna C. Garikipati

发表机构 * Arizona State University（亚利桑那州立大学）； Roblox Corporation（Roblox公司）

AI总结本文提出CMAG框架，通过概念引导的检索和验证组合方法，解决市场人像生成中因文本模糊、元数据噪声和部件不一致导致的检索问题，提升生成人像的拓扑一致性和组合正确性。

Comments Accepted to CVPR 2026 Workshop (GRAIL-V)

详情

AI中文摘要

元宇宙平台依赖于由离散、按分类标签标记的3D资产（如上衣、下装、鞋子、配饰）组成的创作者驱动市场，其中人像需在严格分类和拓扑约束下组装。尽管用户日益期望自由形式的文本控制，但纯文本检索存在脆弱性：自然语言对平台分类体系而言是模糊的，元数据常噪声或非正式，且独立检索的部件可能在风格上不一致或几何上不兼容。我们提出CMAG，一种用于市场人像生成的概念引导检索和验证组合框架。给定提示，CMAG首先合成一个中间3D概念框架，通过提供全局空间和风格上下文来超越文本意图。同时，一个视图感知的部分发现模块通过提示分解和文本引导的分割提取局部视觉证据。一个基于提示的分类路由器强制分类覆盖并解决语义到分类的不匹配，之后一个混合分类检索器结合基于部件的融合和概念残差回退使用特征抑制。最后，一个代理视觉-语言模型在不同类别中过滤和重新排序候选者，并驱动一个迭代验证循环，以从目录资产中组装符合提示、拓扑一致的人像。我们在多样化的组合提示上评估了CMAG，并与强大的基线相比展示了改进的检索鲁棒性和组合正确性，突显了在提示模糊性下3D概念框架的重要性。

英文摘要

Metaverse platforms rely on creator-driven marketplaces where avatars are assembled from discrete, taxonomy-labeled 3D assets (e.g., tops, bottoms, shoes, accessories) under strict category and topology constraints. While users increasingly expect free-form text control, text-only retrieval is brittle: natural language is ambiguous with respect to platform taxonomies, metadata is often noisy or informal, and independently retrieved components can be stylistically inconsistent or geometrically incompatible. We propose \textbf{CMAG}, a concept-scaffolded retrieval and verified composition framework for marketplace avatar generation. Given a prompt, CMAG first synthesizes an intermediate 3D concept scaffold that disambiguates intent beyond text by providing global spatial and stylistic context. In parallel, a view-aware part discovery module extracts localized visual evidence via prompt decomposition and text-grounded segmentation. A prompt-conditioned taxonomy router enforces category coverage and resolves semantic-to-taxonomic mismatch, after which a hybrid category-wise retriever combines part-based fusion with a concept-residual fallback using feature suppression. Finally, an agentic vision--language model filters and re-ranks candidates across categories and drives an iterative verification loop to assemble prompt-faithful, topologically consistent avatars from catalog assets. We evaluate CMAG on diverse compositional prompts and demonstrate improved retrieval robustness and compositional correctness compared to strong baselines, highlighting the importance of 3D concept scaffolding under prompt ambiguity.

URL PDF HTML ☆

赞 0 踩 0

2605.18667 2026-05-19 cs.CV cs.LG 版本更新

Better Together: Evaluating the Complementarity of Earth Embedding Models

Better Together: 评估地球嵌入模型的互补性

Thijs L van der Plas, Jacob JW Bakermans, Vishal Nedungadi, Gabrielė Tijūnaitytė, Marc Rußwurm, Ioannis N Athanasiadis

发表机构 * Wageningen University（瓦赫宁根大学）； University College London（伦敦大学学院）； University of Bonn（波恩大学）

AI总结本文研究了地球嵌入模型的互补性，提出通过融合嵌入来提升性能，并评估了四种模型在不同任务中的表现，发现互补性在任务和位置上都具有依赖性。

详情

AI中文摘要

地球嵌入模型将地球观测数据转换为与地球表面位置唯一关联的嵌入。这些模型通常单独评估，比较不同地球嵌入在下游任务中的性能。然而，空间对齐的嵌入可以自然融合，提供更丰富的每位置信息，而孤立评估无法捕捉到这一点。因此，我们提出通过互补性评估地球嵌入：融合嵌入相对于最佳单模型基线的性能提升。为此，我们引入了一个适用于任何嵌入和任务的嵌入互补性指数，并在六个下游任务中评估了四种地球嵌入模型（AlphaEarth、Tessera、GeoCLIP、SatCLIP），分别单独、成对和联合评估。融合嵌入在六个任务中的四个任务中优于最佳单模型，证实了单嵌入评估通常低估了地球嵌入的能力。互补性在任务和位置上都具有依赖性。进一步，对于一个土地覆盖回归任务，我们发现互补性部分由土地覆盖类别的空间尺度决定。互补性重新定义了地球嵌入：未来的最大收益可能不来自任何单一地球嵌入模型，而是来自更好的组合。

英文摘要

Earth embedding models transform Earth observation data into embeddings uniquely tied to locations on the Earth's surface. These models are typically evaluated in isolation, comparing the downstream task performance across different Earth embeddings. However, spatially aligned embeddings can naturally be fused, providing richer information per location, a capability that isolated evaluations fail to capture. We therefore propose assessing Earth embeddings by their complementarity: the performance gain of fused embeddings over the best single-model baseline. To operationalise this, we introduce an embedding complementarity index applicable to any embedding and task, and evaluate four Earth embedding models (AlphaEarth, Tessera, GeoCLIP, SatCLIP) in isolation, in all pairs, and jointly across six downstream tasks. Fused embeddings outperform the best single model in four out of six tasks, confirming that single-embedding evaluations often underestimate Earth embedding capabilities. Complementarity proves both task- and location-dependent. Further, for a land cover regression task, we find that complementarity is partially determined by the spatial scale of land cover classes. Complementarity reframes Earth embeddings: the greatest future gains may come not from any single Earth embedding model, but from combinations that are better together.

URL PDF HTML ☆

赞 0 踩 0

2605.18652 2026-05-19 cs.CV 版本更新

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

MementoGUI: 学习代理多模态记忆控制以实现长周期GUI代理

Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, Jiebo Luo

发表机构 * University of Rochester（罗切斯特大学）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结本文提出MementoGUI，一种学习代理多模态记忆控制框架，用于提升长周期GUI代理的任务状态维持能力，通过模块化记忆控制和可扩展的数据管道提高记忆检索和决策效率。

Comments Preprint, 15 pages, 4 figures, 5 tables

详情

AI中文摘要

近年来，GUI代理在视觉定位和动作预测方面取得了显著进展，但它们在需要跨多个界面转换维持任务状态的长周期任务中仍显得脆弱。现有代理通常依赖于原始历史回放或纯文本记忆，这要么使模型超载冗余截图，要么丢弃未来决策所需的局部视觉证据。为了解决这些限制，我们引入了MementoGUI，一种插件式代理记忆框架，为基于MLLM的GUI代理配备了MementoCore，一个用于在线记忆选择、压缩和检索的学习控制器。与将交互历史视为固定上下文不同，MementoGUI将长周期GUI控制视为一个在线记忆控制问题：工作记忆会选择性地保存与任务相关的界面事件，带有文本摘要和ROI级别的视觉证据，而情景记忆则通过学习的相关性选择检索可重用的过去轨迹。MementoCore将记忆控制模块化为专门的运算符，用于步骤处理、记忆压缩、情景写入和情景选择，使插件式记忆增强而无需微调GUI代理的主干。我们进一步开发了一条可扩展的数据整理管道，将计算机使用轨迹转换为记忆控制器训练数据，引入MementoGUI-Bench用于评估GUI代理的长周期决策能力，并设计基于MLLM的指标用于语义动作匹配、任务进度和记忆一致性。在GUI-Odyssey、MM-Mind2Web和MementoGUI-Bench上的实验表明，MementoGUI在无历史、历史回放和纯文本记忆基线之上一致提升了GUI代理的表现，较大的MementoCore主干进一步加强了记忆增强的GUI控制。

英文摘要

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce \textbf{MementoGUI}, a plug-in agentic memory framework that equips MLLM-based GUI agents with \textbf{MementoCore}, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce \textbf{MementoGUI-Bench} for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.

URL PDF HTML ☆

赞 0 踩 0

2605.18645 2026-05-19 cs.CV 版本更新

Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video

素性在首要位置：从单个随意视频中基于原始的机械对象理解

Arslan Artykov, Tom Ravaud, Nicolás Violante-Grezzi, Vincent Lepetit

发表机构 * LIGM, CNRS, Univ Gustave Eiffel, ENPC, Institut Polytechnique de Paris（LIGM研究所、国家科学研究中心、巴黎高等电力学院、巴黎理工学院）

AI总结本文提出了一种不依赖类别信息的优化框架，将机械对象理解视为原始拟合问题，通过几何原始体避免不稳定点跟踪的弊端，并利用新的机制将原始体组织成受旋转和滑动关节约束的连贯部分，从而从单个随意拍摄的视频中恢复复杂的运动学。

详情

AI中文摘要

从单目视频中检索机械对象的3D运动学是计算机视觉中的基本挑战。现有方法依赖于复杂的视频设置或长期点跟踪、宽基线匹配等线索，但在严重遮挡、快速相机自运动或弱局部特征下经常表现脆弱。基于学习的方法在泛化到训练类别之外时也面临困难。我们提出了一种类别无关的优化框架，将机械对象理解视为原始体拟合问题。几何原始体作为代理表示，避免了不稳定点跟踪的陷阱；一种新的机制将它们组织成受旋转和滑动关节约束的连贯部分。我们的公式同时优化部分分割和关节参数，从单个随意拍摄的视频中恢复复杂的运动学。一种可见性意识的程序处理现实数据中固有的部分观察和遮挡。我们还提出了AiP-synth和AiP-real基准，具有显著的相机运动和严重的遮挡，并在现有方法上取得了更好的表现。项目页面：https://aartykov.github.io/Articulation-in-Prime/

英文摘要

Retrieving the 3D kinematics of articulated objects from monocular video is a fundamental challenge in computer vision. Existing methods rely on complex video setups or cues such as long-term point tracking or wide-baseline matching, but are frequently brittle under severe occlusions, rapid camera ego-motion, or weak local features. Learning-based methods, meanwhile, struggle to generalize beyond their training categories. We propose a category-agnostic optimization framework that treats articulated object understanding as a primitive-fitting problem. Geometric primitives serve as a proxy representation that avoids the pitfalls of unstable point tracks; a novel mechanism organizes them into coherent parts constrained by revolute and prismatic joints. Our formulation jointly optimizes part segmentation and joint parameters, recovering complex kinematics from a single casually captured video. A visibility-aware procedure handles partial observations and occlusions inherent to real-world data. We also propose the AiP-synth and AiP-real benchmarks, featuring significant camera motion and heavy occlusions, and outperform existing methods. Project page: https://aartykov.github.io/Articulation-in-Prime/

URL PDF HTML ☆

赞 0 踩 0

2605.18641 2026-05-19 cs.CV 版本更新

Leveraging Latent Visual Reasoning in Silence

利用沉默中的潜在视觉推理

Dongyao Zhu, Zhen Wang, Xi Xiao, Han Jiang, Saeed Vahidian, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su, Raju Vatsavai, Jianyang Gu

发表机构 * North Carolina State University（北卡罗来纳州立大学）； UC, San Diego（加州大学圣地亚哥分校）； University of Alabama at Birmingham（阿拉巴马大学伯明翰分校）； Johns Hopkins University（约翰霍普金斯大学）； Duke University（杜克大学）； Boston University（波士顿大学）； The Ohio State University（俄亥俄州立大学）

AI总结本文探讨了在推理过程中是否需要持续的潜在令牌，发现即使移除这些令牌或用随机噪声替代，性能影响较小，提出了一种基于注意力的奖励机制以促进潜在令牌与后续文本令牌的交互，从而提升视觉感知和视觉推理任务的性能。

详情

AI中文摘要

潜在视觉推理通过在文本生成前插入连续潜在令牌，更直接地参与多模态推理。然而，这些潜在令牌在推理中的必要性仍存疑。我们发现，在空间推理基准上，用随机噪声替代或完全移除潜在令牌对性能影响很小。强化学习进一步在训练后减少了潜在生成行为。这些观察引发了一个核心问题：潜在视觉推理是否仍然有意义？我们认为其价值应由潜在令牌如何引导学习来衡量，而非是否在推理时保留。我们的分析表明，潜在推理在不同问题类型中效果不均，但任务级路由应用潜在生成是脆弱的。受这些发现启发，我们提出了一种基于注意力的奖励，鼓励生成的潜在令牌在强化学习中与后续文本令牌交互。该奖励在潜在模式激活时促进潜在利用，同时保持使用纯文本推理的灵活性。实验表明，我们的方法在感知和视觉推理基准上提升了性能，即使在训练后潜在令牌很少生成。我们的结果表明，在推理时没有显式表达的情况下，潜在视觉推理可以塑造更好的视觉基础和更准确的文本推理。我们的代码和训练模型可在GitHub和Hugging Face上公开获取。

英文摘要

Latent visual reasoning involves visual evidence more directly in multimodal reasoning by inserting continuous latent tokens before textual generation. However, the necessity of these latent tokens at inference remains ambiguous. We show that replacing latent tokens with random noise or removing them completely causes little performance degradation across spatial reasoning benchmarks. Reinforcement learning further diminishes the latent generation behavior after post-training. These observations raise a central question: Is latent visual reasoning still meaningful? We argue that its value should be measured by how effectively latent tokens guide learning, rather than whether they persist as an inference-time format. Our analysis shows that latent reasoning is unevenly favorable across question types, yet hard task-level routing for applying latent generation is brittle. Motivated by these findings, we propose an attention-based reward that encourages generated latent tokens to interact with later text tokens during RL. This reward promotes latent utilization when the latent mode is activated while preserving the flexibility to use pure-text reasoning. Experiments show that our method improves performance across perception and visual reasoning benchmarks, even when latent tokens are rarely generated after post-training. Our results highlight that, without explicit expression at inference, latent visual reasoning can shape better visual grounding and more accurate textual reasoning in silence. Our code and trained models are publicly available at \href{https://github.com/ddydyd32/silent-lvr/tree/master}{GitHub} and \href{https://huggingface.co/collections/cornuHGF/silent-lvr}{Hugging Face}.

URL PDF HTML ☆

赞 0 踩 0

2605.18636 2026-05-19 cs.CV 版本更新

SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

SPIKE：一种适应性双控制器框架，用于成本效益高的长周期游戏智能体

Wencan Jiang, Jiangning Zhang, Jianbiao Mei, Jinzhuo Liu, Yu Yang, Xiaobin Hu, Zhucun Xue, Yong Liu, Dacheng Tao

发表机构 * Zhejiang University（浙江大学）； National University of Singapore（新加坡国立大学）； Nanyang Technological University（南洋理工大学）

AI总结本文提出SPIKE框架，通过双控制器设计实现成本高效长周期游戏智能体，通过事件触发机制和分层记忆结构提升目标导向性和任务完成率，实验表明在StarDojo数据集上显著提升成功率并降低资源消耗。

Comments https://wencanjiang.github.io/projects/SPIKE/

详情

AI中文摘要

长周期多模态智能体在开放世界游戏中必须在紧绷的令牌和延迟预算下保持目标导向，通过许多低层交互。现有方法往往在昂贵的每步推理和易漂移、重复失败和恢复不佳的反应执行之间权衡。我们的核心思想是重用战略推理在局部稳定的段落中，并在事件边界重新调用。我们提出了SPIKE，一种适应性双控制器框架，用于成本高效的长周期游戏控制。其战略控制器执行低频全局规划、故障分析和恢复，而其反应控制器在严格的令牌预算下处理快速的本地执行。事件触发器监控视觉变化、任务进度、重复动作和失败信号，以决定何时控制应保持反应或升级到战略推理。分层记忆将短期经验重用在状态-动作记忆银行（SA-MB）中，与结构化证据在状态动作知识图（SA-KG）中分离，使每个控制器能够检索所需的上下文。这种设计在多个反应步骤中重用战略提案，支持计划过时时的本地覆盖，并保留昂贵的推理用于需要额外思考的时刻。在StarDojo的Lite-100分割上，SPIKE比最强的Lite-100基线提高了5.0个百分点（38.5%相对），比最强的预算基线提高了9.3点（75.6%相对）。它还减少了54.9%的令牌消耗和40.8%的延迟。消融实验表明，事件触发、反应覆盖和异构记忆各自对成功和恢复有贡献，支持选择性推理而非每一步推理。

英文摘要

Long-horizon multimodal agents in open-world games must stay goal-directed across many low-level interactions under tight token and latency budgets. Existing approaches often trade off costly per-step reasoning against reactive execution that can drift, repeat failures, and recover poorly. Our key idea is to reuse strategic reasoning across locally stable segments and reinvoke it at event boundaries. We present SPIKE, an adaptive dual controller framework for cost-efficient long-horizon game control. Its Strategic Controller performs low-frequency global planning, failure analysis, and recovery, while its Reactive Controller handles fast local execution under a strict token budget. An Event Trigger monitors visual change, task progress, repeated actions, and failure signals to decide when control should stay reactive or escalate to strategic reasoning. Hierarchical Memory separates short-term experience reuse in the State-Action Memory Bank (SA-MB) from structured evidence in the State Action Knowledge Graph (SA-KG), allowing each controller to retrieve the context it needs. This design reuses strategic proposals over multiple reactive steps, supports local override when plans become stale, and reserves expensive reasoning for moments where extra deliberation is useful. On the Lite-100 split of StarDojo, SPIKE improves Lite-100 success rate (SR) by 5.0 percentage points (38.5% relative) over the strongest Lite-100 baseline and Budgeted SR by 9.3 points (75.6% relative) over the strongest budgeted baseline. It also reduces token consumption by 54.9% and latency by 40.8%. Ablations show that event triggering, reactive override, and heterogeneous memory each contribute to success and recovery, supporting selective reasoning rather than reasoning at every step.

URL PDF HTML ☆

赞 0 踩 0

2605.18621 2026-05-19 cs.CV cs.AI 版本更新

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

CrossView Suite: 利用数据集、模型和基准 harnessing MLLMs 的跨视图空间智能

Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Siliang Tang, Jun Xiao, Yueting Zhuang

发表机构 * Zhejiang University（浙江大学）

AI总结该研究提出CrossView Suite，通过开发CrossViewSet、CrossViewBench和CrossViewer三个组件，解决跨视图推理中的数据稀缺、评估不足和对齐机制缺失问题，提升多视图空间理解能力。

详情

AI中文摘要

空间智能要求多模态大语言模型（MLLMs）超越单一视图感知，对物体、可见性、几何和交互在多个视角下保持一致推理。然而，跨视图推理的进步受限于三个主要缺口：大规模高质量标注训练数据的稀缺性、缺乏系统性评估的基准以及缺乏显式对齐机制以建立物体层面的一致性。为了解决这些缺口，我们全面开发了CrossView Suite的三个协调组件：CrossViewSet、CrossViewBench和CrossViewer。首先，我们引入一个多代理数据引擎，精心编纂了一个大规模、高质量的跨视图指令数据集，称为CrossViewSet，涵盖17种细粒度任务类型，包含1.6M个样本。其次，我们精心创建了一个场景不重叠的CrossViewBench，以全面评估MLLM的跨视图空间理解能力，评估其在各种方面的表现。最后，我们提出了CrossViewer，一个渐进的三阶段框架，用于MLLMs的跨视图空间推理，遵循感知->对齐->推理的范式。我们的方法配备了一个自适应的空间区域标记器，以捕捉细粒度的物体表示，然后显式对齐多视图对象，并因此融合对齐的特征，以提升MLLMs的跨视图推理能力。广泛的实验和分析表明，大规模训练数据、系统性评估和显式的跨视图对齐都是推动MLLMs从单视角感知向现实世界空间智能发展的关键因素。项目页面可在https://github.com/Thinkirin/Crossview-Suite上找到。

英文摘要

Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously create a scene-disjoint CrossViewBench to comprehensively assess the cross-view spatial understanding capability of an MLLM, evaluating it across various aspects. Finally, we propose CrossViewer, a progressive three-stage framework for cross-view spatial reasoning in MLLMs, following a Perception -> Alignment -> Reasoning paradigm. Our method equips an adaptive spatial region tokenizer to capture fine-grained object representations, and then aligns the multi-view objects explicitly, and thus fuses aligned features for boosting the cross-view inference capacity for MLLMs. Extensive experiments and analyses show that large-scale training data, systematic evaluation, and explicit cross-view alignment are all critical for advancing MLLMs from single-view perception toward real-world spatial intelligence. The project page is available at https://github.com/Thinkirin/Crossview-Suite.

URL PDF HTML ☆

赞 0 踩 0

2605.18617 2026-05-19 cs.RO cs.AI cs.CV 版本更新

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

ManiSoft: 向视觉-语言操控的柔软连续机器人迈进

Ziyu Wei, Luting Wang, Chen Gao, Li Wen, Si Liu

发表机构 * Beihang University（北京航空航天大学）； National University of Singapore（新加坡国立大学）； Hangzhou Innovation Institute, Beihang University（北京航空航天大学杭州创新研究院）

AI总结本文提出ManiSoft基准，用于研究柔软连续机器人的视觉-语言操控，通过定制模拟器结合真实柔软体动力学和丰富的接触交互，定义了四个任务以展示变形控制的不同方面，并通过自动化流程生成6300个多样场景和专家轨迹，评估了三种代表性策略模型的性能。

Comments Accepted in ICML 2026

详情

AI中文摘要

大多数现有的视觉-语言操控研究针对刚性机械臂，其固定形态限制了在杂乱或狭窄空间中的适应性。柔软机械臂由于其可变形性提供了一个有吸引力的替代方案，但面临不可靠的本体感觉和分布式的低层驱动挑战。为了研究这些挑战，我们介绍了ManiSoft，一个用于柔软机械臂的视觉-语言操控基准。ManiSoft特征一个定制的模拟器，通过弹性力约束将真实柔软体动力学与丰富的接触交互相结合。在此基础上，ManiSoft定义了四个任务，每个任务突出显示变形控制的不同方面，从基本末端执行器协调到障碍物回避。为了支持策略训练和评估，ManiSoft包括一个自动化流程，生成6,300个多样场景及其对应的专家轨迹。为了大规模生成高质量轨迹，我们首先使用高层规划器将每个任务分解为一系列路径点，然后使用低层强化学习策略生成扭矩命令以跟踪路径点。基准测试三种代表性策略模型显示在清洁场景中相对有希望的结果，但在随机化情况下性能显著下降。可视化分析表明，失败主要源于本体感觉状态的视觉估计不准确和变形性在适应性障碍回避中的利用有限。我们预计ManiSoft将作为有价值的测试平台，在视觉-语言操控的背景下弥合刚性和柔软机械臂之间的差距。代码和数据集已发布在https://buaa-colalab.github.io/ManiSoft。

英文摘要

Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft{} includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at https://buaa-colalab.github.io/ManiSoft.

URL PDF HTML ☆

赞 0 踩 0

2605.18610 2026-05-19 cs.CV cs.AI cs.LG 版本更新

CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic

CATA: 通过冲突厌恶任务算术实现持续机器去学习

Shen Lin, Junhao Dong, Rongjie Chen, Xiaoyu Zhang, Li Xu, Xiaofeng Chen

发表机构 * Fujian Normal University（福建师范大学）； Nanyang Technological University（南洋理工大学）； Xidian University（西安电子科技大学）

AI总结本文首次研究了视觉语言模型的持续去学习问题，提出CATA方法，通过冲突厌恶任务算术有效解决去学习中的有效性、模型保真度和持续性挑战。

详情

AI中文摘要

视觉语言模型（VLMs）在对齐视觉和文本表示方面表现出色，能够支持多种多模态应用。然而，其大规模训练数据不可避免地引发了隐私、版权和不良内容的担忧，这使得机器去学习变得必要。尽管现有研究主要关注单次去学习，但实际VLM部署往往涉及随时间推移的连续删除请求，从而产生持续机器去学习。在本文中，我们首次研究了VLMs的持续去学习，并识别出该设置中的三个关键挑战：去除目标知识的有效性、保留模型效用的保真度以及在连续更新下防止知识重新出现的持续性。为了解决这些挑战，我们提出了CATA，一种冲突厌恶任务算术方法，将每个遗忘请求表示为一个去学习任务向量。通过维护历史任务向量并执行符号感知的冲突厌恶聚合，CATA抑制可能削弱先前遗忘效果的冲突更新组件。在单次和持续设置下的大量实验表明，CATA在遗忘有效性、模型保真度和遗忘持续性方面均优于基线方法。

英文摘要

Vision-language models (VLMs) have shown remarkable ability in aligning visual and textual representations, enabling a wide range of multimodal applications. However, their large-scale training data inevitably raises concerns about privacy, copyright, and undesirable content, creating a strong need for machine unlearning. While existing studies mainly focus on single-shot unlearning, practical VLM deployment often involves sequential removal requests over time, giving rise to continual machine unlearning. In this work, we make the first attempt to study continual unlearning for VLMs and identify three key challenges in this setting: effectiveness in removing target knowledge, fidelity in preserving retained model utility, and persistence in preventing knowledge re-emergence under sequential updates. To address these challenges, we propose CATA, a conflict-averse task arithmetic method that represents each forget request as an unlearning task vector. By maintaining historical task vectors and performing sign-aware conflict-averse aggregation, CATA suppresses conflicting update components that may weaken previous forgetting effects. Extensive experiments under both single-shot and continual settings show that CATA outperforms baselines in terms of forgetting effectiveness, model fidelity, and forgetting persistence.

URL PDF HTML ☆

赞 0 踩 0

2605.18608 2026-05-19 cs.CV 版本更新

Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging

跨越迁移：通过动态风格桥接实现向前促进的持续测试时间适应

Zhilin Zhu, Yabin Wang, Zhiheng Ma, Yaguang Song, Yaowei Wang, Xiaopeng Hong

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Pengcheng Laboratory（鹏城实验室）； Shenzhen University of Advanced Technology（深圳先进技术大学）； Guangdong Provincial Key Laboratory of Computility Microelectronics（广东省计算微电子重点实验室）

AI总结本文提出了一种新的向前促进的持续测试时间适应方法，通过动态风格桥接机制，在部署前构建紧凑的知识库，并在测试时动态注入输入数据风格，以提供可靠的监督信号，从而在持续迁移中实现稳定的适应。

Comments Accepted by CVPR 2026

详情

AI中文摘要

持续测试时间适应（CTTA）旨在使感知系统能够处理部署后遇到的动态分布偏移。现有方法主要采用后向对齐范式，这种范式将输入数据与源域衍生的监督代理进行刚性对齐，因此在面对不可靠的监督和不断变化的分布偏移时表现不佳。为克服这些限制，我们引入了一种新的向前促进范式，通过一种称为动态风格桥接的方法。在部署前，我们构建了一个生成类示例的紧凑知识库。在测试时间，为了减轻固有的生成偏移并使这些代理适应输入数据，我们提出了一个多级桥接机制。该机制在输入、统计和表示层动态地将代理与输入数据风格注入，同时保留代理的原始语义。这些高保真的代理随后被用来提供可靠且按需的监督信号，从而在持续偏移下实现稳定的适应。在标准CTTA基准上的广泛实验表明，我们的方法在最近的最先进方法上实现了一致且显著的改进。代码可在https://github.com/z1358/DAS上获得。

英文摘要

Continual Test-Time Adaptation (CTTA) aims to empower perception systems to handle dynamic distribution shifts encountered after deployment. Existing methods predominantly follow a backward-alignment paradigm, which rigidly aligns incoming data with supervisory surrogates derived from the source domain. Consequently, they struggle with unreliable supervision and evolving distribution shifts. To overcome these limitations, we introduce a novel forward-facilitation paradigm through a method termed Dynamic Style Bridging. Prior to deployment, we construct a compact knowledge base of generated class exemplars. During test time, to mitigate inherent generative bias and adapt these proxies to incoming data, we propose a multi-level bridging mechanism. This mechanism dynamically injects the proxies with incoming data styles at the input, statistical, and representation levels, while preserving the original semantics of the proxies. These high-fidelity proxies are then used to provide reliable, on-demand supervisory signals, enabling stable adaptation under continual shifts. Extensive experiments across standard CTTA benchmarks demonstrate that our method achieves consistent and substantial improvements over recent state-of-the-art approaches. Code is available at \href{https://github.com/z1358/DAS}.

URL PDF HTML ☆

赞 0 踩 0

2605.18603 2026-05-19 cs.CV 版本更新

Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

Starve to Perceive: 通过受限视觉带宽驯服VLMs中的懒惰感知

Yuhuan Wu, Cong Wei, Fangzhen Lin, Wenhu Chen, Haozhe Wang

发表机构 * Hong Kong University of Science and Technology（香港科学与技术大学）； Technology University of Waterloo（滑铁卢大学）

AI总结本文提出了一种新的训练方法，通过限制视觉带宽迫使视觉语言模型主动感知，从而提升其在高分辨率视觉环境中的表现。

详情

AI中文摘要

视觉语言模型（VLMs）作为处于环境中的智能体，在高分辨率视觉环境中需要主动感知——即通过缩放、裁剪和平移等操作动态决定观察方向的能力。然而，当前的训练范式产生的是模仿这些操作表面形式而没有功能性依赖的模型，我们称之为懒惰感知。我们将其归因于一个根本的学习不对称性：当粗略的全局视图结合语言先验足以达到中等准确度时，模型没有动力学习更复杂的多步骤视觉搜索。如果模型可以不主动观察就成功，它将永远学不会主动观察。这促使我们提出Starve to Perceive，一种限制视觉带宽的训练范式——限制每个观察到紧贴令牌预算，使得单个视角不足以完成任务，使主动感知成为唯一可行的策略。尽管不需要辅助损失、奖励塑造或架构变化——作为标准后训练流程的最小、即插即用修改——在感知饥饿下训练的模型在多种基准上实现了显著的提升，平均相对改进达5%。

英文摘要

Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception -- the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models that mimic the surface form of such operations without functionally depending on their outputs, a phenomenon we term lazy perception. We trace this to a fundamental learning asymmetry: when coarse global views combined with language priors suffice for moderate accuracy, the model has no incentive to learn harder multi-step visual search. If a model can succeed without actively looking, it will never learn to look. This motivates Starve to Perceive, a training paradigm that constrains visual bandwidth -- restricting each observation to a tight token budget so that no single view suffices for task completion, making active perception the only viable strategy. Despite requiring no auxiliary losses, reward shaping, or architectural changes -- serving as a minimal, plug-in modification to standard post-training pipelines -- models trained under perceptual starvation achieve substantial gains of 5% average relative improvement across diverse benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.18601 2026-05-19 cs.CV 版本更新

LESSViT: 在光谱配置偏移下鲁棒的高光谱表示学习

Haozhe Si, Yuxuan Wan, Yuqing Wang, Minh Do, Han Zhao

发表机构 * Department of Electrical and Computer Engineering（电气与计算机工程系）； Siebel School of Computing and Data Science（计算与数据科学学院）

AI总结本文提出LESSViT，一种灵活的跨光谱泛化架构，通过低秩高效空间-光谱ViT，解决不同传感器下的高光谱图像建模问题，提升鲁棒性和效率。

详情

AI中文摘要

对不同传感器的高光谱图像（HSI）建模面临波长覆盖、波段采样和通道维度的变化带来的基本挑战。因此，基于固定光谱配置训练的模型往往无法泛化到其他传感器。现有的Vision Transformer（ViT）方法要么依赖于隐式光谱建模和固定通道假设，要么采用显式的空间-光谱注意力机制，但计算成本过高，导致效率与表达能力之间存在根本性的权衡。在本文中，我们引入了低秩高效空间-光谱ViT（LESSViT），一种用于跨光谱泛化的灵活架构。LESSViT基于LESS注意力，一种结构化的低秩因子分解，通过可分离的空间和光谱组件建模联合空间-光谱交互，将全空间-光谱注意力的复杂度从O(N²C²)降低到O(rNC)，其中N是空间标记的数量，C是光谱通道的数量，r是低秩近似等级。我们进一步结合通道无关的补丁嵌入和波长感知的位置编码，以支持灵活的光谱输入。为了实现高效且稳健的预训练，我们引入了高光谱掩码自编码器（HyperMAE），具有解耦的空间-光谱掩码和分层通道采样。我们在跨光谱泛化设置下评估LESSViT，该设置模拟了跨传感器变化。在SpectralEarth基准测试中，实验表明LESSViT在光谱偏移下提高了鲁棒性，同时在分布内保持竞争力，显式且高效的空间-光谱建模对于可扩展和可泛化的高光谱表示学习至关重要。

英文摘要

Modeling hyperspectral imagery (HSI) across different sensors presents a fundamental challenge due to variations in wavelength coverage, band sampling, and channel dimensionality. As a result, models trained under a fixed spectral configuration often fail to generalize to other sensors. Existing Vision Transformer (ViT) approaches either rely on implicit spectral modeling with fixed channel assumptions or adopt explicit spatial-spectral attention with prohibitive computational cost, leading to a fundamental trade-off between efficiency and expressiveness. In this work, we introduce Low-rank Efficient Spatial-Spectral ViT (LESSViT), a sensor-flexible architecture for cross-spectral generalization. LESSViT is built on LESS Attention, a structured low-rank factorization that models joint spatial-spectral interactions through separable spatial and spectral components, reducing the complexity of full spatial-spectral attention from $O(N^2 C^2)$ to $O(rNC)$, where $N$ is the number of spatial tokens, $C$ is the number of spectral channels, and $r$ is the rank of the low-rank approximation. We further incorporate channel-agnostic patch embedding and wavelength-aware positional encoding to support flexible spectral inputs. To enable efficient and robust pretraining, we introduce a hyperspectral masked autoencoder (HyperMAE) with decoupled spatial-spectral masking and hierarchical channel sampling. We evaluate LESSViT under a cross-spectral generalization setting that simulates cross-sensor variability. Experiments on the SpectralEarth benchmark demonstrate that LESSViT improves robustness under spectral shifts while remaining competitive in-distribution, and explicit and efficient spatial-spectral modeling is essential for scalable and generalizable hyperspectral representation learning.

URL PDF HTML ☆

赞 0 踩 0

2605.18522 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Beyond Morphology: Quantifying the Diagnostic Power of Color Features in Cancer Classification

超越形态学：量化颜色特征在癌症分类中的诊断能力

Farnaz Kheiri, Shahryar Rahnamayan, Masoud Makrehchi

发表机构 * Dept. of Electrical, Computer and Software Engineering（电气、计算机与软件工程系）； Ontario Tech University（安大略技术大学）； Dept. of Engineering（工程系）； Brock University（布鲁克大学）

AI总结本文研究了颜色特征在癌症分类中的诊断能力，通过排除形态学信息，评估了全局颜色特征的判别力，发现颜色特征在二分类任务中可达到高达89%的准确率，表明颜色分布包含非随机的诊断信号。

详情

AI中文摘要

在组织病理学中，人类专家主要依靠颜色增强对比度来解读组织形态，而机器视觉模型则将颜色视为原始统计信息。这一区别提出了一个根本性问题：像素强度本身，独立于结构和形态学线索，能支持多少癌症分类？为了解决这个问题，我们系统评估了全局颜色特征的独立判别力，同时刻意排除所有形态学信息。具体而言，我们提取了统计颜色矩，并对RGB和HSV颜色直方图进行离散化处理，然后在十个不同的实验设置中使用经典机器学习分类器评估其性能。我们的结果表明，在二元诊断任务（例如良性与恶性）中，仅颜色特征即可实现强劲的性能，分类准确率可达到89%。这种性能很可能归因于与恶性相关的全局色度变化。重要的是，这些简单的颜色基表示在很大程度上优于随机基线，表明原始颜色分布编码了非随机且具有诊断意义的信号用于癌症检测。因此，本研究表明，简单的、计算高效的色彩特征可以作为一种有效的预筛选工具。通过识别具有强色度指示恶性特征的样本，这些轻量模型可以作为第一道筛选系统，减少对复杂深度学习架构的计算负担。

英文摘要

In histopathology, human experts primarily rely on color as a means of enhancing contrast to interpret tissue morphology, whereas machine vision models process color as raw statistical information. This distinction raises a fundamental question: to what extent can pixel intensity alone, independent of structural and morphological cues, support cancer classification? To address this question, we systematically evaluated the standalone discriminative power of global color features while deliberately excluding all morphological information. Specifically, we extracted statistical color moments and discretized RGB and HSV color histograms, and assessed their performance across ten diverse experimental settings using classical machine learning classifiers. Our results demonstrate that color features alone can achieve strong performance in binary diagnostic tasks (e.g., benign versus malignant), with classification accuracies reaching up to 89%. This performance is likely attributable to global chromatic shifts associated with malignancy. Importantly, these simple color-based representations consistently outperformed random baselines by a substantial margin, indicating that raw color distributions encode a non-random and diagnostically relevant signal for cancer detection. Consequently, this study suggests that simple, computationally efficient color features can serve as an effective pre-screening tool. By identifying samples with strong chromatic indicators of malignancy, these lightweight models could function as a first-pass triage system, reducing the computational burden on complex deep learning architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.18491 2026-05-19 cs.CV 版本更新

Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks

对SSL预训练在相同和不同模态分割任务中转移性的基准测试

Jue Jiang, Harini Veeraraghavan

发表机构 * Department of Medical Physics, Memorial Sloan Kettering Cancer Center（医学物理系，纪念斯隆凯特勒癌症中心）

AI总结本文通过九种SSL方法在相同和不同模态的分割任务中进行基准测试，评估了预训练模型的迁移能力和效率，发现自蒸馏masked image transformer在分割精度、收敛速度和少量样本到大量样本的性能差距方面表现最佳。

Comments Paper submitted to Medical Physics for review

详情

AI中文摘要

方法：九种覆盖四种预训练任务家族的SSL方法使用相同的10,412个3D CT扫描（1.89~M个2D轴向切片）从头开始预训练，这些扫描涵盖不同的疾病部位。每个方法的预训练Swin Transformer编码器被整合到SwinUNETR风格的分割网络中（Swin编码器与3D CNN解码器和跳跃连接），并在九个公开的分割任务上进行微调，包括大腹腔器官、头颈结构和CT和MRI中的肿瘤。性能通过Dice相似系数（DSC）评估。微调收敛速度、跨模态（CT到MRI）的迁移性以及少量样本和大量样本微调之间的特征重用模式进一步通过中心化核对齐分析。结果：自蒸馏masked image transformer（SMIT），结合masked image modeling（MIM）和局部和全局自蒸馏，在九个任务中实现了最高的分割精度、最快的微调收敛速度和最小的少量样本到大量样本性能差距，表明最强的数据效率。SMIT还显示了在少量样本和大量样本微调之间最一致的特征重用模式。基于MIM的SimMIM和自蒸馏方法（DINO、iBOT）优于依赖图像级全局表示的对比学习和旋转预测。SSL方法之间的差异在少量样本设置中最大，随着标记微调数据集大小的增加而缩小，表明在有限标注预算下SSL预训练的选择最为关键。

英文摘要

Methods: Nine SSL methods spanning four pretext-task families were pretrained from scratch using the same 10{,}412 3D CT scans (1.89~M 2D axial slices) covering varied disease sites. The pretrained Swin Transformer encoder from each method was integrated into a SwinUNETR-style segmentation network (Swin encoder with a 3D CNN decoder and skip connections) and fine-tuned on nine public segmentation tasks of varying complexity, including large abdominal organs, head-and-neck structures, and tumors from CT and MRI. Performance was assessed using Dice similarity coefficient (DSC). Fine-tuning convergence speed, transferability across modalities (CT-to-MRI), and feature-reuse patterns between few- and many-shot fine tuning were further analyzed using centered kernel alignment. Results: Self-distilled masked image transformer (SMIT), which combines masked image modeling (MIM) with local and global self-distillation, achieved the highest overall segmentation accuracy across the nine tasks, the fastest fine-tuning convergence, and the smallest few-shot-to-many-shot performance gap, indicating the strongest data efficiency. SMIT also showed the most consistent feature-reuse patterns between few- and many-shot fine tuning. MIM-based SimMIM and self-distillation methods (DINO, iBOT) outperformed contrastive learning and rotation prediction, which rely on image-level global representations. Differences between SSL methods were largest in the few-shot setting and narrowed as the size of the labeled fine-tuning dataset increased, indicating that the choice of SSL pretraining matters most under limited annotation budgets.

URL PDF HTML ☆

赞 0 踩 0

2605.18467 2026-05-19 cs.CV 版本更新

InstructAV2AV: Instruction-Guided Audio-Video Joint Editing

InstructAV2AV：基于指令的音频视频联合编辑

Haojie Zheng, Yixin Yang, Siqi Yang, Shuchen Weng, Boxin Shi

发表机构 * Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Peking University（北京大学）

AI总结本文提出InstructAV2AV，首个端到端的指令引导音频视频联合编辑框架，通过构建大规模音频视频编辑数据集InsAVE-80K和改进的生成模型，实现了更高质量的音频视频联合编辑。

详情

AI中文摘要

最近的扩散基方法在视频内容操控方面取得了显著进展。然而，它们通常忽视伴随的音频，导致音频与编辑结果脱节。在本文中，我们提出了InstructAV2AV，首个端到端的指令引导音频视频联合编辑框架。我们首先开发了一个可扩展的数据合成管道，并构建了InsAVE-80K，首个大规模音频视频编辑数据集，包含高质量的源到目标配对。借助这一数据基础，我们适配了一个音频视频生成骨干网络，以利用其强大的先验知识。我们将音频视频输入与噪声潜在代码结合，以锚定源上下文，提出源指令门控注意力以提高指令遵循和内容保持，并引入两阶段训练策略以有效转移这些预训练的先验知识。广泛的实验表明，InstructAV2AV在两个评估集上，跨11个指标覆盖三个方面，均优于现有最先进方法，凸显了其在可控内容创作中的潜力。项目页面：https://hjzheng.net/projects/InstructAV2AV/.

英文摘要

Recent diffusion-based methods have achieved impressive progress in video content manipulation. However, they typically ignore the accompanying audio, leaving the audio disjointed from the edited results. In this paper, we propose InstructAV2AV, the first end-to-end framework for instruction-guided audio-video joint editing. We first develop a scalable data synthesis pipeline and construct InsAVE-80K, the first large-scale audio-video editing dataset with high-quality source-to-target pairs. With this data foundation, we adapt an audio-video generation backbone to leverage its robust priors. We concatenate the audio-video input with noisy latent codes to anchor the source context, propose the source-instruction gated attention to improve instruction following and content preservation, and introduce a two-stage training strategy to effectively transfer these pre-trained priors. Extensive experiments demonstrate that InstructAV2AV outperforms state-of-the-art methods across 11 metrics spanning three aspects on two evaluation sets, highlighting its potential for controllable content creation. Project page: https://hjzheng.net/projects/InstructAV2AV/.

URL PDF HTML ☆

赞 0 踩 0

2605.18466 2026-05-19 cs.CV 版本更新

Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI

基于语音引导的多模态学习用于实时MRI中的声道分割

Daiqi Liu, Lukas Mulzer, Md Hasan, Nyvenn de Castro, Fangxu Xing, Xingjian Kang, Chengze Ye, Siyuan Mei, Yipeng Sun, Tomás Arias-Vergara, Jana Hutter, Jonghye Woo, Andreas Maier, Paula Andrea Pérez-Toro

发表机构 * Harvard Medical School / Massachusetts General Hospital（哈佛医学院/麻省总医院）； Institute for Information Processing, Leibniz University Hannover（汉诺威莱布尼茨信息处理研究所）； GITA Lab, Facultad de Ingeniería. Universidad de Antioquia UdeA（安提奥基亚大学工程学院GITA实验室）

AI总结本文提出了一种三阶段框架，利用语音和语音学监督进行训练，仅需实时MRI图像进行推理，通过将语音学表示转换为空间边界框先验进行发音器官定位，通过双级跨模态对比预训练对视觉和音频编码器对齐，并通过跨注意力解码器融合学习的表示，有效将多模态知识转移到单模态推理管道中，实验表明该方法在75-Speaker~Annot-16和USC-TIMIT数据集上优于现有单模态和多模态方法。

Comments under review

详情

AI中文摘要

在实时MRI（rtMRI）中对发音器官进行分割是一个具有低对比度、快速运动和有限空间分辨率的动态图像分割难题。然而，尽管rtMRI采集可能提供同步的声学信号，现有方法却丢弃了这一信息，而能结合音频的少数多模态方法在音频不可用时无法部署。我们提出了一种三阶段框架，在训练过程中利用音频和语音学监督，而在推理时仅需rtMRI图像：语音学表示被转换为空间边界框先验以用于发音器官定位，视觉和音频编码器通过双级跨模态对比预训练对齐，学习的表示通过跨注意力解码器融合，有效将多模态知识转移到单模态推理管道中。在75-Speaker~Annot-16和USC-TIMIT数据集上的评估表明，我们的方法优于现有单模态和多模态方法，证明了多模态监督对精确且可临床部署的声道分割提供了可转移的益处。

英文摘要

Segmenting vocal tract articulators in real-time MRI (rtMRI) is a challenging dynamic image segmentation problem characterized by low contrast, rapid motion, and limited spatial resolution. However, while rtMRI acquisitions may provide synchronized acoustic signals, existing methods discard this information, and the few multimodal approaches that incorporate audio cannot be deployed when audio is unavailable. We propose a three-stage framework that leverages acoustic and phonological supervision during training while requiring only the rtMRI image at inference: phonological representations are converted into spatial bounding-box priors for articulator localization, visual and acoustic encoders are aligned via dual-level cross-modal contrastive pretraining, and the learned representations are fused through a cross-attention decoder, effectively transferring multimodal knowledge into a single-modality inference pipeline. Evaluated on 75-Speaker~Annot-16 and USC-TIMIT datasets, our method outperforms existing unimodal and multimodal methods, demonstrating that multimodal supervision provides transferable benefits for precise and clinically deployable vocal tract segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.18451 2026-05-19 cs.CV cs.GR 版本更新

Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

Code-as-Room: 通过代理代码合成从俯视图图像生成3D房间

Yixuan Yang, Zhen Luo, Wanshui Gan, Jinkun Hao, Junru Lu, Jinghao Yan, Zhaoyang Lyu, Xudong Xu

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Innovation Institute（上海创新研究院）； Southern University of Science and Technology（南方科技大学）； University of Warwick（沃里克大学）

AI总结本文提出Code-as-Room框架，通过结构化执行 harness 生成3D房间，利用Blender代码表示房间，并引入专门的代码基3D房间合成基准进行评估。

详情

AI中文摘要

设计逼真且功能性的3D室内房间对于广泛的应用至关重要，包括室内设计、虚拟现实、游戏和具身AI。尽管最近基于大语言模型（MLLM）的方法在从文本描述或参考图像生成3D房间方面展现出巨大潜力，但基于文本的方法难以精确捕捉空间信息，而现有的图像条件化代理在从俯视图生成整体房间时面临不稳定性和无限循环的问题。为了解决这些限制，我们提出了Code-as-Room，一种基于MLLM的代理框架，配备了结构化执行harness，用Blender代码表示3D房间。给定一个俯视房间图像，该框架解析参考图像以提取场景元素及其空间关系，并在有原则的多阶段管道中合成用于几何、材料和照明的可执行Blender代码。在整个过程中维护一个跨阶段的记忆模块，以缓解现有基于代理框架固有的上下文遗忘问题。我们进一步引入了一个专门的代码基3D房间合成基准，涵盖各种评估协议。基于我们的基准，对现有基于代理的方法进行了全面比较，以验证我们提出的执行harness的有效性。

在n维平面基础几何代数中推广交叉比

Enzo Harquin, Stephane Breuils, Pascal Monasse, Venceslas Biri, Vincent Nozick

AI总结本文研究了n维平面基础几何代数中投影交叉比的完整理论，建立了各类几何对象的显式交叉比公式，并证明其恢复了相应的经典不变量，同时识别了标准的成对测量算子。

详情

AI中文摘要

我们发展了n维平面基础几何代数（PGA，R(n,0,1))中投影交叉比的完整理论，涵盖了所有等级的几何对象：有限点和理想点、超平面以及中间扁面。对于每种对象类型和配置，我们建立了显式的交叉比公式，证明其恢复了适当的经典不变量，并确定了标准的成对测量算子。系统性的对偶分析进一步揭示了所有八种配置在Hodge对偶下组织成四对对偶配置，并且所有测量算子根据几何配置而非对象等级，要么是交换子要么是交换子对偶。在每种情况下，公式恢复了适当的经典不变量：平行配置的有符号距离比和割线配置的正弦交叉比。这些结果确立了交叉比作为PGA中的无等级项目不变量，并为从给定不变量直接定义n维同调映射提供了构造性基础。

英文摘要

We develop a complete theory of projective cross-ratios in n-dimensional Plane-Based Geometric Algebra (PGA), R(n,0,1), covering geometric objects of every grade: finite and ideal points, hyperplanes, and intermediate flats. For each object type and configuration, we establish an explicit cross-ratio formula, prove that it recovers the appropriate classical invariant, and identify the canonical pairwise measurement operator. A systematic duality analysis further revealed that all eight configurations organize into four dual pairs under the Hodge dual, and that all measurement operators reduce to either the commutator or the commutator dual, depending solely on the geometric configuration rather than on object grade. In each case the formula recovers the appropriate classical invariant: signed distance ratios for parallel configurations and sine cross-ratios for secant ones. These results establish the cross-ratio as a grade-agnostic projective invariant within PGA, and provide a constructive foundation for defining n-dimensional homographies directly from prescribed invariants.

URL PDF HTML ☆

赞 0 踩 0

2605.18390 2026-05-19 cs.CV 版本更新

Vision Foundation Models as Generalist Tokenizers for Image Generation

视见过滤模型作为图像生成的通用标记器

Anlin Zheng, Qi Han, Xin Wen, Chuofan Ma, Lanxi Gong, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

发表机构 * University of Hong Kong（香港大学）； StepFun

AI总结本文提出了一种基于冻结视见过滤模型（VFM）的通用图像标记器VFMTok，通过区域自适应量化框架和语义重建目标，提升了图像生成的质量和效率，同时在离散和连续潜在空间中实现了高保真度的类别条件合成。

Comments 4 figures and 14 tables

详情

AI中文摘要

在本文中，我们探索了构建一个通用图像标记器的全新方向，该标记器直接建立在冻结的视见过滤模型（VFM）之上。为了构建此标记器，我们利用冻结的VFM作为编码器，并引入两个关键创新：（1）区域自适应量化框架，用于消除标准2D网格特征中的空间冗余；（2）语义重建目标，使解码输出与VFM的表示对齐，以保持语义保真度。基于这些设计，我们提出了VFMTok，一种能够无缝在离散和连续潜在空间中运行的通用视觉标记器。VFMTok在合成质量上取得了显著提升，同时大幅提高了标记效率。对于离散自回归（AR）生成，它通过3倍加速模型收敛，并在ImageNet条件合成上实现了最先进的gFID值1.36。同样，对于连续空间生成，将VFMTok与去噪模型结合，可获得极佳的gFID值1.25。此外，由于潜在空间本身捕捉了丰富的空间语义，VFMTok能够在两种生成范式中无需分类器自由指导（w/o CFG）下实现高保真度的类别条件合成，显著加快了推理速度。除了这些显著的实证结果外，我们还系统地研究了我们方法的底层机制。我们发现，在VFM预训练过程中使用的特定自监督学习目标决定了其作为标记器的有效性。具体来说，一个联合优化全局对比学习和潜在掩码图像建模的VFM提供了最佳的图像标记表示。这些见解为未来图像标记器的设计奠定了坚实的基础，并提供了有价值的指导。

英文摘要

In this work, we explore the largely unexplored direction of building a generalist image tokenizer directly on top of a frozen vision foundation model (VFM). To build this tokenizer, we utilize a frozen VFM as the encoder and introduce two key innovations: (1) a region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features, and (2) a semantic reconstruction objective that aligns the decoded outputs with the VFM's representations to preserve semantic fidelity. Grounded in these designs, we propose VFMTok, a generalist visual tokenizer capable of operating seamlessly in both discrete and continuous latent spaces. VFMTok achieves substantial improvements in synthesis quality while drastically enhancing token efficiency. For discrete autoregressive (AR) generation, it accelerates model convergence by \textbf{3 times} and achieves a state-of-the-art gFID of \textbf{1.36} on ImageNet class-conditional synthesis. Similarly, for continuous-space generation, integrating VFMTok with a denoising model yields an exceptional gFID of \textbf{1.25}. Furthermore, because the latent space inherently captures rich spatial semantics, VFMTok enables high-fidelity class-conditional synthesis without classifier-free guidance (\textbf{w/o CFG}) across both generative paradigms, significantly accelerating inference speed. Beyond these remarkable empirical results, we systematically investigate the underlying mechanisms of our approach. We discover that the specific self-supervised learning objectives utilized during VFM pre-training dictate its effectiveness as a tokenizer. Specifically, a VFM jointly optimized with global contrastive learning and latent masked image modeling provides the optimal representations for image tokenization. These insights establish a strong foundation and offer valuable guidance for the design of future image tokenizers.

URL PDF HTML ☆

赞 0 踩 0

2605.18365 2026-05-19 cs.CV 版本更新

GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

GeoFlow: 在视频生成中强制隐式几何一致性

Jan Ackermann, Shengqu Cai, Boyang Deng, Zhengfei Kuang, Songyou Peng, Gordon Wetzstein

发表机构 * Stanford University（斯坦福大学）； Google DeepMind（谷歌深Mind）

AI总结本文提出GeoFlow，一种通过强化学习微调来增强视频生成中几何一致性的方法，通过引入几何一致性奖励，有效减少时间上的几何伪影，同时保持感知质量。

Comments Project Page: https://geometryflow.github.io/

详情

AI中文摘要

生成几何上一致的视频仍然是一个开放性挑战：基于网络级数据训练的文本到视频扩散模型仅隐式处理几何，导致在相机运动下出现物体形变、纹理漂移和非刚性背景。现有解决方案要么作为副产品改进一致性，要么仅适用于静态场景或完全重新对齐模型的潜在空间。我们引入了一个几何一致性奖励，直接衡量生成视频中的运动是否与一致的场景兼容。我们的关键见解是，在物理一致的视频中，背景运动应能由刚性相机诱导的流解释，而独立移动的物体应沿运动轨迹保持外观身份。我们使用光流、深度-姿态预测和基于特征的对应关系来分离刚性和动态区域并评估它们各自的一致性。将此奖励与强化学习微调结合，将几何一致性从一种涌现属性转化为视频生成器的显式优化目标。该方法对模型不敏感，适用于包含相机和物体运动的多样化动态场景。实验显示，在强基线模型上显著减少了时间上的几何伪影，同时保持感知质量。代码和模型权重已发布。

英文摘要

Generating geometrically consistent videos remains an open challenge: text-to-video diffusion models trained on web-scale data treat geometry only implicitly, leading to object deformation, texture drift, and non-rigid backgrounds under camera motion. Existing solutions either improve consistency as a byproduct, apply only to static scenes or realign the latent space of the model completely. We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth--pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective for video generators. The approach is model agnostic and applies to diverse dynamic scenes containing both camera and object motion. Experiments show substantial reductions in temporal geometric artifacts over strong baselines while preserving perceptual quality. Code and model weights are published.

URL PDF HTML ☆

赞 0 踩 0

2605.18349 2026-05-19 cs.CV cs.AI 版本更新

Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport

通过参数自由注意力机制优化CSRNet以实现公共交通中的人群计数

Aida Rostamza, Enrico Del Re, Joshua Cherian Varughese, Cristina Olaverri-Monreal

发表机构 * Johannes Kepler University Linz（约翰· Kepler 大学林茨）； Department Intelligent Transport Systems（智能交通系统部门）

AI总结本文研究了参数自由注意力机制在密集场景中的人群计数和密度图估计中的有效性，提出了一种结合PFCA和SA的新型注意力机制PFCASA，并在ShanghaiTech数据集上验证了其在公共交通视频流中的性能。

详情

AI中文摘要

占用估计和人群计数是设计智能高效公共交通车辆的关键任务。鉴于公共交通载客量可能从稀疏到拥挤变化，传统的占用估计模型必须适应这一目的。注意力机制在增强深度神经网络在拥挤场景中的人群计数能力方面表现出显著优势，尤其是在存在遮挡、复杂背景和透视畸变的情况下。然而，传统方法通常作为卷积层中的参数化子网络实现，不可避免地增加了模型大小和计算成本，限制了在资源受限的边缘设备上的部署。本文研究了最先进的参数自由注意力机制在高度拥挤场景中的人群计数和密度图估计中的有效性。我们评估了通道级（PFCA）、空间级（SA）和三维级（SimAM）模块，并将其性能与参数化注意力模块进行比较，后者限制引入不超过1%的额外参数。此外，我们提出了一种新的注意力机制组合，结合PFCA和SA（PFCASA）以分析公共交通系统内的视频流。使用CSRNet作为骨干网络，在ShanghaiTech数据集上的实验表明，参数自由注意力机制在不引入额外模型参数的情况下实现了可比或更优的准确性。详细的性能分析进一步揭示，PFCASA在少于40人的场景中优于其他注意力模块，而PFCA在人群密度增加时表现出更大的有效性，凸显了其在智能公共交通模式中的应用潜力。

英文摘要

Occupancy estimation and crowd counting are critical tasks in designing smart and efficient public transport vehicles. Given that public transport loading can vary from sparse to crowded, classical models for occupancy estimation must be adapted to suit this purpose. Attention mechanisms have shown remarkable capability in enhancing the representational power of deep neural networks for crowd counting in congested scenes with occlusion, complex backgrounds, and perspective distortion. However, conventional approaches, often implemented as parameterized sub-networks within convolutional layers, inevitably increase model size and computational cost, limiting deployment on resource-constrained edge devices. This paper investigates the effectiveness of state-of-the-art parameter-free attention mechanisms for crowd counting and density map estimation in highly congested scenes. We evaluate channel-wise (PFCA), spatial-wise (SA), and 3-D (SimAM) modules and compare their performance with parameterized attention modules constrained to introduce no more than 1% additional parameters. Furthermore, we present a novel combination of attention mechanisms that combines the strengths of PFCA and SA (PFCASA) customized for analyzing video streams onboard public transport systems. Using CSRNet as the backbone, experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy without introducing additional model parameters. A detailed performance analysis further reveals that PFCASA outperforms other attention modules in scenes with fewer than 40 individuals, while PFCA shows greater effectiveness as crowd density increases, underscoring their potential applicability for integration into smart public transport modalities.

URL PDF HTML ☆

赞 0 踩 0

2605.18346 2026-05-19 cs.CV cs.AI 版本更新

Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

聚焦强制：面向内容的每帧KV选择用于高效的自回归视频扩散

Peiliang Cai, Evelyn Zhang, Jiacheng Liu, Hao Lin, Ruiqi Zhang, Weile Mo, Yue Ma, Shikang Zheng, Jiehang Huang, Dongrui Liu, Linfeng Zhang

发表机构 * SJTU（上海交通大学）； SDU（山东大学）； HUST（华中科技大学）； UTokyo（东京大学）； HKUST（香港科技大学）； SCUT（上海大学）； Shanghai AI Lab（上海人工智能实验室）

AI总结本文提出了一种无需训练的KV选择方法，通过结合注意力分数和历史帧的多样性分数，保留最相关和有区别的历史帧，从而在不牺牲质量的情况下提高自回归视频扩散的效率。

详情

AI中文摘要

近期在自回归视频扩散领域的进展使得序列和流式视频生成成为可能。然而，长视界生成需要越来越大的KV缓存，这使得在不牺牲质量的情况下实现高效的压缩具有挑战性。现有方法大多基于注意力分数选择历史帧，但它们的上下文决策仍然粗略。当同一块中生成多个帧时，这些方法通常对整个块应用共享的历史选择，仅通过注意力对历史帧评分，并将头预算均匀或通过注意力模式启发式分配，而不是显式估计头重要性。我们发现同一生成块中的帧可能依赖于不同的历史帧，同一历史帧在与当前帧的相对时间距离变化时可能获得不同的注意力分数，且屏蔽不同头会引发不均等的生成退化。受这些发现的启发，我们提出了Focused Forcing，一种无需训练的KV选择方法，该方法在生成帧和头维度上聚焦缓存历史。对于每个生成帧，Focused Forcing通过结合注意力分数和历史帧的多样性分数保留最相关和有区别的历史帧，同时将较大的预算分配给估计重要性更高的头。在多个自回归生成范式中，Focused Forcing在不训练的情况下实现了高达1.48倍的端到端加速，同时提高了视觉质量和文本对齐。

英文摘要

Recent advances in autoregressive video diffusion have enabled sequential and streaming video generation. However, long-horizon generation requires increasingly large KV caches, making efficient compression without sacrificing quality challenging. Existing methods mostly select historical frames based on attention scores, but their context decisions remain coarse. When multiple frames are generated in the same chunk, these methods often apply a shared history selection to the whole chunk, score historical frames solely by attention, and assign head-wise budgets either uniformly or by attention-pattern heuristics rather than explicit head-importance estimation. We show that frames within the same generated chunk can depend on distinct historical frames, that the same historical frame can receive different attention scores as its relative temporal distance to the current frames changes, and that masking different heads induces unequal generation degradation. Motivated by these findings, we propose \textbf{Focused Forcing}, a training-free KV selection method that focuses cached history along both generated-frame and head dimensions. For each generated frame, Focused Forcing preserves the most relevant and distinctive historical frames by combining attention scores with diversity scores of historical frames, while assigning larger budgets to heads with higher estimated importance. Across multiple autoregressive generation paradigms, Focused Forcing achieves up to $\textbf{1.48}\times$ end-to-end acceleration without training, while \textbf{improving visual quality and text alignment}. \textit{Our code will be released on GitHub.}

URL PDF HTML ☆

赞 0 踩 0

2605.18334 2026-05-19 cs.CV cs.GR 版本更新

3D Skew Gaussian Splatting with Any Camera Trajectory Visualization Engine

具有任意相机轨迹可视化引擎的3D斜高斯散射

Beizhen Zhao, Yifan Zhou, Gaochao Song, Ziran Yin, Hao Wang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Zhejiang University（浙江大学）； The University of Hong Kong（香港大学）

AI总结本文提出3D斜高斯散射（3DSGS），通过引入斜高斯分布来提升3D高斯散射的结构保真度和紧凑性，以解决对称高斯分布在捕捉形状和颜色不连续性方面的不足，从而提高可视化效果和空间数据探索的准确性。

Comments 16 pages

详情

AI中文摘要

尽管3D高斯散射（3DGS）已经革新了实时逼真视角合成，但其对称高斯分布的根本依赖引入了视觉伪影，阻碍了准确的空间数据探索。具体而言，对称内核难以捕捉形状和颜色不连续性，导致模糊和基本元素冗余，这在视觉分析中会误导人类感知。为了解决这些可视化障碍，我们引入了3D斜高斯散射（3DSGS），一种新的框架，显著增强了显式场景表示的结构保真度和紧凑性。我们的关键见解在于将标准基本元素扩展为一般斜高斯对应物。这种通用基本元素继承了标准高斯的高效光栅化特性，同时获得了内在的非对称建模能力。我们将其与增强的不透明度表示相结合，以更好地处理复杂的透明度，同时结合一种深度感知的密集化策略，智能管理基本元素的分配。此外，为了使这些进步能够应用于实际的视觉分析，我们重新推导了CUDA光栅化管线，使其普遍支持对称和斜高斯，将其整合到一个解耦的自由相机交互可视化引擎中。广泛的实验表明，3DSGS在复杂细节区域实现了更优的渲染质量和结构紧凑性，同时保持了实时帧率，以支持流畅的交互探索。补充推导和视觉结果可在https://3d-skew-gs.github.io/上获得。

英文摘要

While 3D Gaussian Splatting (3DGS) has revolutionized real-time photorealistic view synthesis, its fundamental reliance on symmetric Gaussian distributions introduces visual artifacts that hinder accurate spatial data exploration. Specifically, symmetric kernels struggle to capture shape and color discontinuities , which cause blurriness and primitive redundancy that mislead human perception during visual analysis. To address these visualization barriers, we introduce 3D Skew Gaussian Splatting (3DSGS), a novel framework that significantly enhances the structural fidelity and compactness of explicit scene representations. Our key insight lies in extending the standard primitive to a general Skew Gaussian counterpart. This generalized primitive inherits the highly efficient rasterization properties of standard Gaussians while gaining intrinsic asymmetric modeling capabilities. We couple this with an enhanced opacity representation to better handle complex transparency, alongside a depth-aware densification strategy that intelligently manages primitive allocation. Furthermore, to make these advancements actionable for real-world visual analytics, we re-derive the CUDA rasterization pipeline to universally support both symmetric and skew Gaussians, integrating it into a decoupled, free-camera interactive visualization engine. Extensive experiments demonstrate that 3DSGS achieves superior rendering quality and structural compactness, particularly in regions with intricate details, while maintaining the real-time frame rates necessary for fluid interactive exploration. Supplementary derivations and visual results are available at \textbf{\textit{https://3d-skew-gs.github.io/}}.

URL PDF HTML ☆

赞 0 踩 0

2605.18328 2026-05-19 cs.CV 版本更新

CineMatte: Background Matting for Virtual Production and Beyond

CineMatte：虚拟制作及其他场景的背景分割

Yuanjian He, Chen Zhang, Fasheng Chen, Jiangbo Cao

发表机构 * Online Video Business Unit, Tencent PCG Shenzhen, China（腾讯PCG深圳在线视频事业部）

AI总结本文提出CineMatte，一种用于虚拟制作及其他场景的鲁棒背景分割框架。该方法采用交叉注意力条件设计，通过共享权重的冻结DINOv3 Vision Transformer编码输入帧和捕获的背景，并利用交叉注意力模块预测前景，从而保留预训练语义并提高对背景位移的鲁棒性。此外，还引入了CineMatte-4K数据集，包含4K HDR图像视频，为虚拟制作分割提供了首个非合成的数据集。

详情

AI中文摘要

LED虚拟制作（VP）利用大LED体积实时渲染背景，使镜头内视觉效果成为可能，但使剪辑后更改变得费力。我们通过CineMatte，一种用于VP及其他场景的鲁棒背景分割框架来解决这一问题。CineMatte采用交叉注意力条件设计。不同于将背景与输入拼接，CineMatte采用一个冻结的DINOv3 Vision Transformer，具有共享权重，分别对输入帧和捕获的背景进行编码。交叉注意力模块比较两个流以预测前景，保留预训练语义并提高对背景位移的鲁棒性。先前基于ViT的分割模型使用并行卷积“细节分支”来恢复细节，这在实际样本中可能由于与主干的语义对齐问题导致边界伪影。我们改用预训练的图像引导特征上采样器，这在很大程度上缓解了该问题。我们还引入了CineMatte-4K，一个在专业LED VP舞台上拍摄的4K HDR图像视频数据集。据我们所知，图像子集是首个VP分割数据集，非合成，通过绿色屏幕插入获得；视频子集包含相机运动和跟踪轨迹，以便后续可以正确渲染任意背景。在CineMatte-4K和公共基准（VideoMatte240K，YouTubeMatte）上，CineMatte不仅在VP中表现出色，而且对真实世界 footage 也具有强大的泛化能力。

英文摘要

LED Virtual Production (VP) uses large LED volumes to render backgrounds in real time, enabling in-camera visual effects but making post-shot changes labor-intensive. We address this with CineMatte, a robust background matting framework for VP and beyond. CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional "detail branch" to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace it with a pretrained, image-guided feature upsampler, which largely mitigates the problem. We also introduce CineMatte-4K, a 4K HDR image-video dataset captured on a professional LED VP stage. To the best of our knowledge, the image subset is the first dataset for VP matting and is non-synthetic, obtained via green-screen insertion; the video subset includes camera motion with tracked trajectories so that arbitrary backgrounds can be rendered later with correct parallax. Across CineMatte-4K and public benchmarks (VideoMatte240K, YouTubeMatte), CineMatte not only excels in VP but also generalizes robustly to real-world footage.

URL PDF HTML ☆

赞 0 踩 0

2605.18303 2026-05-19 cs.LG cs.AI cs.CV cs.RO 版本更新

SIREM: 语音引导的MRI重建与学习采样

Md Hasan, Nyvenn Castro, Daiqi Liu, Lukas Mulzer, Jana Hutter, Jonghye Woo, Moritz Zaiss, Andreas Maier, Paula A. Perez-Toro

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg（埃森哲-埃尔朗根-纽伦堡大学模式识别实验室）； Institute of Radiology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg（埃尔朗根大学医院放射学研究所）； Institut für Informationsverarbeitung, Leibniz Universität Hannover（汉诺威莱比锡大学信息处理研究所）； Department of Radiology, Harvard Medical School and Massachusetts General Hospital（哈佛医学院放射科和麻省总医院）

AI总结本文提出了一种语音引导的MRI重建框架SIREM，通过同步语音作为跨模态先验，利用语音与声音学之间的相关性预测图像内容，从而在更高的吞吐量下实现更合理的解剖结构重建。

详情

AI中文摘要

实时磁共振成像（rtMRI）在语音生产中的应用能够非侵入性地可视化动态声带运动，对语音科学和临床评估具有价值。然而，rtMRI本质上受到空间分辨率、时间分辨率和获取速度之间的权衡限制，常常导致k空间测量不足和重建质量下降。我们提出SIREM，一种利用同步语音作为跨模态先验的MRI重建框架。核心思想是语音期间的声带配置与产生的声音学相关，使图像部分内容可从音频预测。SIREM将每帧建模为音频驱动组件和MRI驱动组件的融合，通过空间加权图。音频分支从语音预测发音器相关结构，而MRI分支从测量的k空间数据重建互补内容。我们进一步引入了可学习的软加权轮廓，使螺旋臂的使用与语音引导融合的交互研究可微分。这产生了一个统一的多模态公式，结合了音频驱动预测、MRI重建和采样适应。我们在USC语音rtMRI基准上评估了SIREM，与标准基线（包括栅格、基于小波的压缩感知和总变分）进行比较。SIREM引入了一种语音引导的重建范式，在比迭代方法高得多的吞吐量下运行，同时保持解剖上合理的声带结构。这些结果为多模态语音引导的rtMRI重建建立了初步基准，并突显了同步语音作为快速重建辅助先验的潜力。源代码可在https://github.com/mdhasanai/SIREM获取。

英文摘要

Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at https://github.com/mdhasanai/SIREM

URL PDF HTML ☆

赞 0 踩 0

2605.18209 2026-05-19 cs.CV cs.AI 版本更新

SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning

SPATIOROUTE: 动态提示路由用于零样本空间推理

Pawat Chunhachatrachai, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

发表机构 * National Taiwan University（台湾国立大学）； Delta Robotics Innovation Center（Delta机器人创新中心）

AI总结本文提出SpatioRoute，一种动态提示生成方法，通过语义定制的提示模板路由问题，无需额外训练或3D传感器输入，在零样本设置下提升空间推理性能，同时发现Chain-of-Thought提示在空间视频理解中效果不佳。

Comments 10 pages, 2 figures, 2nd Workshop on 3D-LLM/VLA, CVPR 2026

详情

AI中文摘要

在眼动视频上的空间问题回答是一项具有挑战性的任务，需要视觉-语言模型（VLMs）对3D物体位置、场景可行性和方向关系进行推理，特别是在无任务特定微调的零样本设置中。我们引入SpatioRoute，一种动态提示生成方法，将每个输入问题路由到语义定制的提示模板，无需任何额外训练、微调或3D传感器输入。SpatioRoute在两个互补模式中运行：SpatioRoute-R，一种基于规则的路由器，将问题类型（如What、Is、How、Can、Which）确定性地映射到专门的提示模板；以及SpatioRoute-L，一种基于LLM的方法，仅从问题和情境上下文生成任务特定的提示，无需在路由时使用视频输入。我们评估了SpatioRoute在SQA3D基准测试上跨不同模型家族的VLMs。SpatioRoute在固定提示基线上实现了高达5%的总体准确率提升，建立了在不需3D点云输入的情况下零样本视频-only空间VQA的新状态。此外，我们发现Chain-of-Thought（CoT）提示，通过Think it Twice架构实现，在此设置中对Qwen系列模型性能有持续下降，证实了问题感知路由比统一推理指令在空间视频理解中更有效。

英文摘要

Spatial question answering over egocentric video is a challenging task that requires Vision-Language Models (VLMs) to reason about 3D object positions, scene affordances, and directional relationships, particularly in the zero-shot setting where no task-specific fine-tuning is available. We introduce SpatioRoute, a dynamic prompt generation approach that routes each incoming question to a semantically tailored prompt template -- without any additional training, fine-tuning, or 3D sensor input. SpatioRoute operates in two complementary modes: SpatioRoute-R, a rule-based router that deterministically maps question typologies (e.g., What, Is, How, Can, Which) to specialized prompt templates; and SpatioRoute-L, an LLM-driven approach that generates task-specific prompts from the question and situational context alone, with no video input at routing time. We evaluate SpatioRoute on the SQA3D benchmark across VLMs spanning model families. SpatioRoute achieves consistent overall accuracy gains up to 5% over fixed prompt baselines, establishing a new state-of-the-art for zero-shot video-only spatial VQA without requiring 3D point-cloud inputs. As an additional finding, we observe that Chain-of-Thought (CoT) prompting, implemented via the Think it Twice architecture, consistently degrades performance in this setting on Qwen series models, confirming that question-aware routing is more effective than uniform reasoning instructions for spatial video understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.18197 2026-05-19 cs.RO cs.AI cs.CV 版本更新

双速率扩散：通过交错重-轻网络加速扩散模型

Grigory Bartosh, David Ruhe, Emiel Hoogeboom, Jonathan Heek, Thomas Mensink, Tim Salimans

发表机构 * Google DeepMind Amsterdam（谷歌深Mind阿姆斯特丹）； Amsterdam University of Amsterdam（阿姆斯特丹大学）

AI总结本文提出双速率扩散方法，通过交错执行高容量上下文编码器和轻量解噪模型，加速扩散模型推理，同时保持样本质量，在ImageNet基准上实现性能与计算成本的平衡。

详情

AI中文摘要

扩散模型在生成性能上达到最先进的水平，但在推理过程中由于重复评估重的神经网络而面临高昂的计算成本。在本文中，我们提出了双速率扩散，一种通过交错执行高容量的上下文编码器和轻量高效的去噪模型来加速采样的方法。上下文编码器被稀疏评估以提取高维特征，这些特征在每一步都被轻量去噪模型有效重用，以高效地细化样本。这种方法显著加速了推理过程，而不会牺牲样本质量。在ImageNet基准上，双速率扩散在性能上与标准基线相匹配，同时将计算成本降低了2-4倍。此外，我们证明了我们的方法与蒸馏技术，如动量匹配蒸馏，兼容，从而在少步生成中进一步提高效率。

英文摘要

Diffusion models achieve state-of-the-art generative performance but suffer from high computational costs during inference due to the repeated evaluation of a heavy neural network. In this work, we propose Dual-Rate Diffusion, a method to accelerate sampling by interleaving the execution of a heavy high-capacity context encoder and a light efficient denoising model. The context encoder is evaluated sparsely to extract high-dimensional features, which are effectively reused by the light denoising model at every step to refine the sample efficiently. This approach significantly accelerates inference without compromising sample quality. On ImageNet benchmarks, Dual-Rate Diffusion matches the performance of standard baselines while reducing computational cost by a factor of $2$-$4$. Furthermore, we demonstrate that our method is compatible with distillation techniques, such as Moment Matching Distillation, enabling further efficiency gains in few-step generation.

URL PDF HTML ☆

赞 0 踩 0

2605.18184 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation

固定外部摄像头作为主动3D场景图生成的共同先验地图

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

发表机构 * Mobile Robotics Group (MRG)（移动机器人组）； Visual and Multimodal Applied Learning Lab (VANDAL)（视觉与多模态应用学习实验室）

AI总结本文提出利用固定外部RGB摄像头作为共同先验地图，以实现主动、渐进式的3D场景图生成，通过融合机器人 onboard 摄像头和固定外部摄像头的数据，提高场景理解的效率和准确性。

详情

AI中文摘要

常用的先验信息，如BIM模型、平面图和遥感图像，可以为自主机器人系统提供有价值的几何和语义上下文。在本文中，我们将固定外部RGB摄像头的观测视为共同先验地图（CPMs）：环境的广角视图，在任何机器人运动开始之前初始化一个语义和几何场景先验。我们提出一个仅使用RGB的框架，用于主动、渐进式的3D场景图（3DSG）生成，该框架在单一硬件无关的管道中无缝融合来自机器人 onboard 摄像头和固定外部摄像头的观测。通过仅依赖RGB观测并通过前馈3D重建模型进行处理，系统将所有摄像头——机器人 onboard 或外部——视为相同，无需硬件修改。基于图的主动语义探索框架然后直接利用部分场景图，引导机器人向高语义不确定性区域前进，逐步完成和细化先验。实验表明，使用单个外部摄像头初始化场景图可使初始物体召回率提高高达+79%，并且先验的更丰富上下文显著提高了后续主动探索的效率。

英文摘要

Commonly available prior information, such as BIM models, floor plans, and remote sensing images, can provide valuable geometric and semantic context for autonomous robotic systems. In this paper, we treat observations from fixed external RGB cameras as Common Prior Maps (CPMs): wide-field views of the environment that initialize a semantic and geometric scene prior before any robot motion begins. We present an RGB-only framework for active, incremental 3D scene graph (3DSG) generation that seamlessly fuses observations from both onboard robot cameras and fixed external cameras within a single hardware-agnostic pipeline. By relying solely on RGB observations processed by a feed-forward 3D reconstruction model, the system treats all cameras - onboard or external - identically, requiring no hardware modifications. A graph-based active semantic exploration framework then directly leverages the partial scene graph to guide the robot toward regions of high semantic uncertainty, progressively completing and refining the prior. Experiments demonstrate that bootstrapping the scene graph with even a single external camera increases initial object recall by up to +79%, and that the richer context of the prior significantly improves the efficiency of subsequent active exploration.

URL PDF HTML ☆

赞 0 踩 0

2605.18177 2026-05-19 cs.CV 版本更新

Token-Space Mask Prediction for Efficient Vision Transformer Segmentation

基于令牌空间的掩码预测用于高效的视觉变换器分割

Calvin Galagain, Martyna Poreba, François Goulette

发表机构 * Université Paris-Saclay, CEA List（巴黎-萨克雷大学，CEA列表）； U2IS, ENSTA Paris, Institut Polytechnique de Paris（U2IS，巴黎ENSTA，巴黎理工学院）

AI总结本文提出TokenMask，一种直接从查询令牌亲和力计算掩码logits并进行logit空间插值的方法，从而在保持准确性的同时减少计算和内存需求，提高分割效率。

Comments CVPR, EVW 2026

2605.18176 2026-05-19 cs.CV cs.AI 版本更新

MARS: Technical Report for the CASTLE Challenge at EgoVis 2026

MARS：EgoVis 2026 CASTLE挑战的技术报告

Haoyu Zhang, Qiaohui Chu, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； Pengcheng Laboratory（鹏城实验室）； Shandong Jianzhu University（山东建筑大学）

AI总结本文提出MARS系统，用于EgoVis 2026的CASTLE挑战，通过多模态代理推理解决需要多源信息的复杂问题，核心方法是多模态证据选择，主要贡献是实现了在多源数据上的有效推理。

Comments The Runner-up Solution for CASTLE Challenge @ EgoVis 2026

详情

AI中文摘要

本报告介绍了MARS，即多模态代理推理与源选择系统，是参与EgoVis 2026 CASTLE挑战的系统。参赛者必须在CASTLE 2024数据集上回答185个封闭式问题。与以往单视频眼动基准不同，CASTLE要求对四天活动、15个同步视角、官方 transcripts 及多种辅助模态（包括个人照片、辅助视频、注视、热成像和心率测量）进行推理。MARS将任务视为多模态源的代理证据选择问题，而非纯粹文本流程。MARS首先遵循官方CASTLE目录组织，从视频和 transcripts 两个主要来源以及注视、心率、照片和热成像四个辅助来源构建证据记忆。长视频仅转换为caption和基于DeepSeek的摘要，因为CASTLE视频太长无法直接输入模型上下文；此步骤压缩时间证据，同时保留照片和其他辅助媒体作为源特定证据。在推理时，一个GPT-5.4决策代理反复选择是否继续推理、请求特定缺失模态、生成答案或回退到随机选项，当证据不足时。所得到的系统在最终CASTLE挑战排行榜上获得第二名。我们的代码可在https://github.com/Hyu-Zhang/MARS获取。

英文摘要

This report presents MARS, short for Multimodal Agentic Reasoning with Source selection, our system for the CASTLE Challenge at EgoVis 2026. Participants must answer 185 closed-form questions over the CASTLE 2024 dataset. In contrast to prior single-video egocentric benchmarks, CASTLE requires reasoning over four days of activity, 15 synchronized perspectives, official transcripts, and multiple auxiliary modalities, including personal photos, auxiliary videos, gaze, thermal imagery, and heartrate measurements. MARS therefore treats the task as an agentic evidence-selection problem over multimodal sources rather than a purely text-only pipeline. MARS first follows the official CASTLE directory organization to build evidence memories from two primary sources, videos and transcripts, and four auxiliary sources, gaze, heartrate, photos, and thermal imagery. Long videos are converted into captions and DeepSeek-based summaries only because CASTLE videos are too long to fit directly into the model context for every question; this step compresses temporal evidence while keeping photos and other auxiliary media available as source-specific evidence. At inference time, a GPT-5.4 decision agent repeatedly chooses whether to continue reasoning, request a specific missing modality, produce an answer, or fall back to a random option when the evidence remains insufficient. The resulting system achieved second place on the final CASTLE Challenge leaderboard. Our codes are available at https://github.com/Hyu-Zhang/MARS.

URL PDF HTML ☆

赞 0 踩 0

2605.18173 2026-05-19 cs.CV 版本更新

Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting

你需要文本校正吗？用于无校正场景文本识别的软注意力掩码嵌入

Antonio Colombo, Giovanni Bianchi

发表机构 * School of Information, Polytechnic University of Turin（理工学院信息学院）

AI总结本文提出了一种新的软注意力掩码嵌入模块（SAME），通过Transformer编码器的全局感受野编码高级特征并计算软注意力权重，然后与预测的掩码进行分层嵌入，生成精细的文本边界感知掩码，从而有效抑制背景噪声。基于该模块，本文提出了一个鲁棒的端到端文本识别框架SAME-Net，无需字符级标注或辅助文本校正模块。

详情

AI中文摘要

端到端场景文本识别，即在一个框架内统一文本检测和识别，已因深度学习的进步而取得显著进展。然而，大多数现有方法仍然受到多尺度变化、任意文本形状和复杂背景干扰导致的不完整掩码提案的影响，从而降低识别准确性。在本文中，我们提出了一种新的软注意力掩码嵌入模块（SAME），该模块利用Transformer编码器的全局感受野来编码高级特征并计算软注意力权重，然后与预测的掩码进行分层嵌入，生成精细的文本边界感知掩码，从而有效抑制背景噪声。基于该模块，我们提出了SAME-Net，一个鲁棒的端到端文本识别框架，无需字符级标注或辅助文本校正模块。由于软注意力机制是完全可微分的，识别损失梯度可以反向传播通过SAME模块到检测分支，从而实现检测和识别目标的联合优化。在具有挑战性的基准测试中进行了广泛的实验，证明了我们方法的有效性：SAME-Net在任意形状的Total-Text数据集上实现了84.02%的端到端H-mean，比之前的最先进方法GLASS在全词典准确率上高出1.02%，且无需额外训练数据；在多方向ICDAR 2015数据集上获得了具有竞争力的83.4%强词典结果。

英文摘要

End-to-end scene text spotting, which unifies text detection and recognition within a single framework, has witnessed remarkable progress driven by deep learning advances. However, most existing approaches still suffer from incomplete mask proposals caused by multi-scale variation, arbitrary text shapes, and complex background interference, thereby degrading recognition accuracy. In this paper, we propose a novel Soft Attention Mask Embedding module (SAME) that leverages the global receptive field of Transformer encoders to encode high-level features and compute soft attention weights, which are then hierarchically embedded with predicted masks to generate refined text-boundary-aware masks that effectively suppress background noise. Building upon this module, we present SAME-Net, a robust end-to-end text spotting framework that requires neither character-level annotations nor auxiliary text rectification modules. Since the soft attention mechanism is fully differentiable, recognition loss gradients can be back-propagated through the SAME module to the detection branch, enabling joint optimization of detection and recognition objectives. Extensive experiments on challenging benchmarks demonstrate the effectiveness of our approach: SAME-Net achieves 84.02\% end-to-end H-mean on the arbitrarily-shaped Total-Text dataset, surpassing the previous state-of-the-art GLASS by 1.02\% in full-lexicon accuracy without additional training data, and obtains competitive 83.4\% strong-lexicon results on the multi-oriented ICDAR 2015 dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.18162 2026-05-19 cs.CV cs.AI 版本更新

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

通过几何逻辑一致性实现视觉语言模型中的自演化空间推理

Junming Liu, Yuqi Li, Yifei Sun, Maonan Wang, Piotr Koniusz, Yirong Chen, Ding Wang

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； The City University of New York（纽约城市大学）； The Chinese University of Hong Kong（香港中文大学）； Data61 CSIRO（Data61澳大利亚国家科学研究院）； University of New South Wales（新南威尔士大学）； Australian National University（澳大利亚国立大学）

AI总结本文提出SAGE框架，通过几何和语言二元操作在视觉语言模型中实现自演化空间推理，提升模型在空间推理任务中的鲁棒性和泛化能力。

Comments 23 pages, 7 figures, 3 tables

详情

AI中文摘要

视觉语言模型（VLMs）在视觉和语言任务上取得了显著进展，但其空间推理能力仍然脆弱：能够正确回答原始输入的模型在面对具有可预测答案映射的配对变换时仍可能失败，揭示了实例级正确性与鲁棒空间推理之间的差距。为此，我们提出空间对齐通过几何演化（SAGE），一种自演化框架，通过几何和语言二元操作在VLMs中强制逻辑一致性。SAGE将二元一致性作为辅助奖励纳入GRPO训练，鼓励模型在原始和变换输入之间产生逻辑一致的答案。一个动态操作池持续探测不一致，促进具有挑战性的操作并淘汰已掌握的操作，使训练聚焦于最有信息量的信号。SAGE具有模型无关性，比先前的GRPO方法更数据高效，并可作为轻量级的后训练阶段应用于任何现有的VLM。在视频和空间推理基准上的实验表明，SAGE在强基线模型上表现一致提升，并增强了对未见数据的泛化能力。

英文摘要

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.

URL PDF HTML ☆

赞 0 踩 0

2605.18156 2026-05-19 cs.CV 版本更新

ProtoFlow: 通过低曲率原型流缓解类别增量遥感分割中的遗忘

Jiekai Wu, Rong Fu, Chuangqi Li, Zijian Zhang, Guangxin Wu, Hao Zhang, Shiyin Lin, Jianyuan Ni, Yang Li, Dongxu Zhang, Amir H. Gandomi, Simon Fong, Pengbin Feng

发表机构 * Faculty of Health Data Science, Juntendo University（静冈大学健康数据科学学院）； The Institute of Collaborative Innovation, University of Macau（澳门大学协同创新研究所）； Department of Information and Computing Sciences, Faculty of Science, Utrecht University（乌得勒支大学科学学院信息与计算科学系）； Department of Computer and Information Science, University of Pennsylvania（宾夕法尼亚大学计算机与信息科学系）； School of Computer Science, University of Chinese Academy of Sciences（中国科学院大学计算机科学学院）； Department of Computer & Information Science & Engineering, University of Florida（佛罗里达大学计算机与信息科学与工程系）； Department of Computer Science, Juniata College（朱尼塔学院计算机科学系）； National Engineering Research Center for Beijing Biochip Technology（北京生物芯片工程技术研究中心）； CapitalBio Corporation（资本生物公司）； Faculty of Engineering & Information Technology, University of Technology Sydney（悉尼科技大学工程与信息技术学院）； University Research and Innovation Center (EKIK), Obuda University（布达佩斯大学研究与创新中心（EKIK））； Faculty of Science and Technology, University of Macau（澳门大学科学与技术学院）； Department of Mathematics, University of Southern California（南加州大学数学系）

AI总结本文提出ProtoFlow，一种时间感知的原型动态框架，通过将类别原型建模为轨迹并学习其演变，以缓解遥感分割中的遗忘问题，实验表明其在多个基准上取得了显著提升。

详情

AI中文摘要

遥感分割在实际部署中本质上是连续的：新的语义类别不断出现，且获取条件随季节、城市和传感器而变化。尽管取得了进展，许多增量方法仍将训练步骤视为孤立的更新，导致表示漂移和遗忘控制不足。我们提出了ProtoFlow，一种时间感知的原型动态框架，将类别原型建模为轨迹，并通过显式的时间向量场学习其演变。通过联合强制低曲率运动和类间分离，ProtoFlow在增量学习过程中稳定了原型几何。在标准的类别和领域增量遥感基准上的实验表明，ProtoFlow在mIoUall上比强大的基线模型提高了1.5-2.0个百分点，并减少了遗忘。这些结果表明，显式建模时间原型演变是一种实用且可解释的策略，用于鲁棒的连续遥感分割。开源代码：https://github.com/dudududke/protoflow.

英文摘要

Remote sensing segmentation in real deployment is inherently continual: new semantic categories emerge, and acquisition conditions shift across seasons, cities, and sensors. Despite recent progress, many incremental approaches still treat training steps as isolated updates, which leaves representation drift and forgetting insufficiently controlled. We present ProtoFlow, a time-aware prototype dynamics framework that models class prototypes as trajectories and learns their evolution with an explicit temporal vector field. By jointly enforcing low-curvature motion and inter-class separation, ProtoFlow stabilizes prototype geometry throughout incremental learning. Experiments on standard class- and domain-incremental remote sensing benchmarks show consistent gains over strong baselines, including up to 1.5-2.0 points improvement in mIoUall, together with reduced forgetting. These results suggest that explicitly modeling temporal prototype evolution is a practical and interpretable strategy for robust continual remote sensing segmentation. Open-source code:https://github.com/dudududke/protoflow.

URL PDF HTML ☆

赞 0 踩 0

2604.02060 2026-05-19 cs.CV cs.RO 版本更新

CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

CompassAD: 基于意图的多功能竞争物体3D affordance 地标

Jingliang Li, Jindou Jia, Tuo An, Chuhao Zhou, Xiangyu Chen, Shilin Shan, Boyu Ma, Bofan Lyu, Gen Li, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University, Singapore（MARS实验室，南洋理工大学，新加坡）

AI总结该研究提出了一种新的3D affordance设定，即意图驱动的可混淆地标，旨在预测多物体点云中正确物体的每点affordance掩码，基于隐含的自然语言意图。通过构建CompassAD基准，该研究展示了在具有隐含意图的多物体组合中的先进结果，并在机器人机械臂上验证了其在真实世界抓取中的有效性。

详情

AI中文摘要

当被告知要“切蛋糕”时，机器人必须在附近的剪刀之上选择刀，尽管两个物体都提供相同的切割功能。在真实世界场景中，多个物体可能具有相同的affordance，但只有一个是给定任务上下文下的合适对象。我们称这种情况为混淆对。然而，现有的3D affordance方法大多回避了这一挑战，通过评估孤立的单个物体，通常伴有查询中提供的显式类别名称。我们正式提出了意图驱动的可混淆affordance地标，这是一种新的3D affordance设定，要求在多物体点云中预测正确物体的每点affordance掩码，基于隐含的自然语言意图。为了研究这个问题，我们构建了CompassAD，第一个专注于隐含意图的多物体组合基准。它包含30个混淆物体对，覆盖16种affordance类型，6,422个组合，以及88K+个查询-回答对。此外，我们提出了CompassNet，一个包含两个专门模块的框架，专为该任务定制。实例受限的交叉注入（ICI）在物体边界内约束语言-几何对齐，以防止跨物体语义泄漏。双级对比细化（BCR）在几何组和点级别上强制执行区分，使目标和可混淆表面之间的区别更加清晰。广泛的实验表明，在已见和未见查询上均取得了最先进的结果，并在机器人机械臂上的部署证实了其在真实世界抓取中的有效性。

英文摘要

When told to "cut the cake," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Intent-Driven Confusable Affordance Grounding, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusing multi-object compositions. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 compositions, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object compositions.

URL PDF HTML ☆

赞 0 踩 0

2604.00634 2026-05-19 cs.RO cs.CV 版本更新

LiPS: Lightweight Panoptic Segmentation for Resource-Constrained Robotics

LiPS: 为资源受限机器人设计的轻量级全景分割

Calvin Galagain, Martyna Poreba, François Goulette, Cyrill Stachniss

发表机构 * Université Paris-Saclay, CEA LIST（巴黎-萨克雷大学，CEA LIST）； U2IS, ENSTA, Institut Polytechnique de Paris（U2IS、ENSTA、巴黎理工学院）； University of Bonn, Center for Robotics（波恩大学，机器人中心）

AI总结本文提出LiPS，一种轻量级全景分割方法，通过简化特征提取和融合路径，在保持查询基于解码的同时，显著降低计算需求，实现与更重模型相当的精度和更高的吞吐量。

Comments Accepted to IEEE International Conference on Image Processing (ICIP) 2026, Paper #2070

2603.23194 2026-05-19 cs.GR cs.CV cs.LG 版本更新

PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning

PhysSkin: 通过自监督神经皮肤化实现实时且可泛化的基于物理的动画

Yuanhang Lei, Tao Cheng, Xingxuan Li, Boming Zhao, Siyuan Huang, Ruizhen Hu, Peter Yichen Chen, Hujun Bao, Zhaopeng Cui

发表机构 * State Key Laboratory of CAD&CG（CAD与计算机图形学国家重点实验室）； BIGAI ； Shenzhen University（深圳大学）； University of British Columbia（不列颠哥伦比亚大学）

AI总结本文提出PhysSkin框架，通过自监督学习策略实现对多样3D形状和离散化形式的实时基于物理的动画，其核心方法是神经皮肤化场自动编码器和物理感知的学习策略。

Comments Accepted by CVPR 2026 Highlight. Project Page: https://zju3dv.github.io/PhysSkin/

详情

AI中文摘要

实现能够在多样3D形状和离散化形式之间泛化的真实时间基于物理的动画仍然是一个基本挑战。我们引入PhysSkin，一个基于物理的框架，解决这一挑战。受线性混合皮肤化的启发，我们学习连续皮肤化场作为基函数，将运动子空间坐标提升到全空间变形，子空间由手柄变换定义。为了生成无网格、离散化无关且物理一致的皮肤化场，PhysSkin采用新的神经皮肤化场自动编码器，由基于Transformer的编码器和交叉注意力解码器组成。此外，我们还开发了一种新的物理感知自监督学习策略，结合实时皮肤化场归一化和冲突感知梯度校正，从而有效平衡能量最小化、空间平滑性和正交约束。PhysSkin在可泛化的神经皮肤化上表现出色，并实现了实时基于物理的动画。

英文摘要

Achieving real-time physics-based animation that generalizes across diverse 3D shapes and discretizations remains a fundamental challenge. We introduce PhysSkin, a physics-informed framework that addresses this challenge. In the spirit of Linear Blend Skinning, we learn continuous skinning fields as basis functions lifting motion subspace coordinates to full-space deformation, with subspace defined by handle transformations. To generate mesh-free, discretization-agnostic, and physically consistent skinning fields that generalize well across diverse 3D shapes, PhysSkin employs a new neural skinning fields autoencoder which consists of a transformer-based encoder and a cross-attention decoder. Furthermore, we also develop a novel physics-informed self-supervised learning strategy that incorporates on-the-fly skinning-field normalization and conflict-aware gradient correction, enabling effective balancing of energy minimization, spatial smoothness, and orthogonality constraints. PhysSkin shows outstanding performance on generalizable neural skinning and enables real-time physics-based animation.

URL PDF HTML ☆

赞 0 踩 0

2603.14936 2026-05-19 cs.CV 版本更新

Bridging the Intention-Expression Gap: Aligning Multi-Dimensional Preferences via Hierarchical Relevance Feedback in Text-to-Image Diffusion

弥合意图-表达鸿沟：通过层次相关反馈对齐多维偏好

Wenxi Wang, Hongbin Liu, Mingqian Li, Junyan Yuan, Junqi Zhang

发表机构 * Tongji University（同济大学）

AI总结本文提出一种层次相关反馈驱动框架，通过在文本到图像扩散模型中对齐多维特征，解决用户意图与表达之间的鸿沟问题，提升模型对多维偏好的识别能力。

详情

AI中文摘要

多模态大语言模型是否准备好用于监控？对零样本异常检测在现实中的检验

Shanle Yao, Armin Danesh Pazho, Narges Rashvand, Hamed Tabkhi

发表机构 * Electrical and Computer Engineering Department（电气与计算机工程系）

AI总结本文研究了多模态大语言模型在现实中的零样本异常检测性能，发现其存在保守偏差，通过特定指令可以提升F1分数，但召回率仍是关键瓶颈。

详情

AI中文摘要

多模态大语言模型（MLLMs）在视频理解方面展示了出色的通用能力，但其在现实中的视频异常检测（VAD）可靠性仍待探索。与传统依赖重建或姿态线索的流程不同，MLLMs实现了将异常检测视为语言引导推理任务的范式转变。本文通过将VAD重新表述为二分类任务，在弱时间监督下系统评估了最先进的MLLMs在ShanghaiTech和CHAD基准上的性能。我们研究了提示特异性及时间窗口长度（1s-3s）对性能的影响，重点分析精度-召回率的权衡。研究发现，在零样本设置中存在显著的保守偏差；尽管模型表现出高置信度，但倾向于选择'正常'类，导致高精度但召回率崩溃，限制了实际应用。我们证明，针对类别的特定指令可显著改变这一决策边界，使ShanghaiTech的峰值F1分数从0.09提升至0.64，但召回率仍是关键瓶颈。这些结果突显了MLLMs在嘈杂环境中的显著性能差距，并为未来在召回导向提示和模型校准方面的研究提供了基础，这对需要复杂视频理解和推理的开放世界监控任务提出了要求。

英文摘要

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

URL PDF HTML ☆

赞 0 踩 0

2602.22941 2026-05-19 cs.CV 版本更新

LURE: 用于扩散模型多概念重新唤醒的潜在空间解阻

Mengyu Sun, Ziyuan Yang, Andrew Beng Jin Teoh, Junxu Liu, Haibo Hu, Yi Zhang

发表机构 * Sichuan University（四川大学）； The Hong Kong Polytechnic University（香港理工大学）； Nanyang Technological University（南洋理工大学）； Yonsei University（延世大学）

AI总结本文提出LURE方法，通过重建潜在空间和引导采样轨迹，实现多概念的高保真重新唤醒，解决了现有方法在多概念场景下的梯度冲突和特征纠缠问题。

详情

AI中文摘要

概念擦除旨在抑制扩散模型中的敏感内容，但最近的研究表明，被擦除的概念仍可能被重新唤醒，揭示了擦除方法的脆弱性。现有重新唤醒方法主要依赖于提示级优化来操控采样轨迹，忽略了其他生成因素，限制了对底层动态的全面理解。在本文中，我们将生成过程建模为一个隐式函数，以实现对多个因素的全面理论分析，包括文本条件、模型参数和潜在状态。我们理论证明，扰动每个因素可以重新唤醒被擦除的概念。基于这一见解，我们提出了一种新的概念重新唤醒方法：用于概念重新唤醒的潜在空间解阻（LURE），通过重建潜在空间并引导采样轨迹来重新唤醒被擦除的概念。具体而言，我们的语义重新绑定机制通过将去噪预测与目标分布对齐来重建潜在空间，以重新建立断裂的文本-视觉关联。然而，在多概念场景中，朴素的重建会导致梯度冲突和特征纠缠。为了解决这个问题，我们引入了梯度场正交化，强制特征正交以防止相互干扰。此外，我们的潜在语义识别引导采样（LSIS）通过后验密度验证确保重新唤醒过程的稳定性。广泛的实验表明，LURE能够在多种擦除任务和方法中同时实现多个被擦除概念的高保真重新唤醒。

英文摘要

Concept erasure aims to suppress sensitive content in diffusion models, but recent studies show that erased concepts can still be reawakened, revealing vulnerabilities in erasure methods. Existing reawakening methods mainly rely on prompt-level optimization to manipulate sampling trajectories, neglecting other generative factors, which limits a comprehensive understanding of the underlying dynamics. In this paper, we model the generation process as an implicit function to enable a comprehensive theoretical analysis of multiple factors, including text conditions, model parameters, and latent states. We theoretically show that perturbing each factor can reawaken erased concepts. Building on this insight, we propose a novel concept reawakening method: Latent space Unblocking for concept REawakening (LURE), which reawakens erased concepts by reconstructing the latent space and guiding the sampling trajectory. Specifically, our semantic re-binding mechanism reconstructs the latent space by aligning denoising predictions with target distributions to reestablish severed text-visual associations. However, in multi-concept scenarios, naive reconstruction can cause gradient conflicts and feature entanglement. To address this, we introduce Gradient Field Orthogonalization, which enforces feature orthogonality to prevent mutual interference. Additionally, our Latent Semantic Identification-Guided Sampling (LSIS) ensures stability of the reawakening process via posterior density verification. Extensive experiments demonstrate that LURE enables simultaneous, high-fidelity reawakening of multiple erased concepts across diverse erasure tasks and methods.

URL PDF HTML ☆

赞 0 踩 0

2601.13839 2026-05-19 cs.CV 版本更新

DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes

DisasterVQA: 一个用于灾难场景的视觉问答基准数据集

Aisha Al-Mohannadi, Ayisha Firoz, Yin Yang, Muhammad Imran, Ferda Ofli

发表机构 * Qatar Computing Research Institute（卡塔尔计算研究所）； Hamad Bin Khalifa University（哈马德·本·卡伊夫大学）； College of Science & Engineering（科学与工程学院）； Qatar University（卡塔尔大学）

AI总结本文提出DisasterVQA数据集，用于灾难场景中的感知与推理任务，通过1395张真实图像和4405对专家 curated 的问答对，评估了七种最先进的视觉-语言模型在灾难响应中的性能，发现模型在细粒度定量推理、物体计数和上下文敏感解释方面存在不足。

Comments Accepted at ICWSM 2026

详情

AI中文摘要

社交媒体图像在自然灾害和人为灾害中提供低延迟的情报信息源，能够实现快速损害评估和响应。尽管视觉问答（VQA）在通用领域表现出色，但其在灾难响应中所需的复杂和安全关键推理的适用性仍不明确。我们引入了DisasterVQA基准数据集，专门用于危机情境中的感知和推理。DisasterVQA包含1395张真实世界图像和4405对专家精心编写的问答对，涵盖洪水、野火和地震等多种事件。基于人道主义框架，包括FEMA ESF和OCHA MIRA，该数据集包含二元、多选和开放式问题，覆盖情境意识和操作决策任务。我们评估了七种最先进的视觉-语言模型，并发现性能在问题类型、灾难类别、地区和人道主义任务上存在差异。尽管模型在二元问题上实现高准确率，但在细粒度定量推理、物体计数和上下文敏感解释方面表现不佳，尤其是在代表性不足的灾难场景中。DisasterVQA提供了一个具有挑战性和实用性的基准，以指导开发更稳健和具有操作意义的视觉-语言模型用于灾害响应。该数据集可通过https://doi.org/10.5281/zenodo.18267769公开获取。

英文摘要

Social media imagery provides a low-latency source of situational information during natural and human-induced disasters, enabling rapid damage assessment and response. While Visual Question Answering (VQA) has shown strong performance in general-purpose domains, its suitability for the complex and safety-critical reasoning required in disaster response remains unclear. We introduce DisasterVQA, a benchmark dataset designed for perception and reasoning in crisis contexts. DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning diverse events such as floods, wildfires, and earthquakes. Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA, the dataset includes binary, multiple-choice, and open-ended questions covering situational awareness and operational decision-making tasks. We benchmark seven state-of-the-art vision-language models and find performance variability across question types, disaster categories, regions, and humanitarian tasks. Although models achieve high accuracy on binary questions, they struggle with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, particularly for underrepresented disaster scenarios. DisasterVQA provides a challenging and practical benchmark to guide the development of more robust and operationally meaningful vision-language models for disaster response. The dataset is publicly available at https://doi.org/10.5281/zenodo.18267769.

URL PDF HTML ☆

赞 0 踩 0

2512.18953 2026-05-19 cs.CV 版本更新

Symmetry Matters: Auditing and Symmetrizing 3D Generative Models

对称性至关重要：审计和对称化3D生成模型

Nicolas Caytuiro, Ivan Sipiran

发表机构 * University of Chile（智利大学）

AI总结本文研究了无条件点云生成中对称性的保持问题，通过审计多个3D生成模型的对称性并计算基于Chamfer距离的归一化对称性分数，发现现有模型在对称性意识评估协议下存在持续的对称性差距。通过分析训练数据和引入对称性意识干预，作者提出了在半对象数据集上训练生成模型并在采样时进行反射重建的方法，从而提高几何一致性和视觉合理性。

Comments 12 pages, 8 figures, 4 tables

详情

AI中文摘要

对称性是许多物体类别中强有力的先验知识，但标准的3D生成模型基准很少报告这一先验是否被保留。我们研究了无条件点云生成中的对称性保持问题。我们首先通过几种3D生成模型审计生成形状的对称性，并基于Chamfer距离（CD）计算归一化对称性分数。我们表明，尽管当前3D生成模型在标准评估下取得竞争性结果，但当应用对称性意识评估协议时，它们显示出持续的对称性差距。为了测试这个差距是否仅仅继承自训练数据，我们评估了这些模型在由ShapeNet衍生的镜像物体数据集上的表现，并分析了训练过程中的对称性动态。通过机制可解释性技术，在采样和潜在空间层面进一步表明，反射对称性在学习的生成过程中并不可靠地编码。最后，为了解决这个差距，我们提出了一种数据导向的对称性意识干预：在半对象数据集上训练生成模型，并在采样时通过反射重建完整物体。在多个模型架构上，这种干预显著提高了几何一致性和视觉合理性，同时在标准度量下仍具竞争力。这些发现表明，需要伴随标准基准进行对称性意识评估，未来的3D生成模型应显式地将这一先验纳入训练或采样过程中。

英文摘要

Symmetry is a strong prior present in many object categories, yet standard benchmarks for 3D generative models rarely report whether this prior is preserved. We study symmetry preservation in unconditional point cloud generation. We first audit the symmetry of generated shapes by several 3D generative models and compute a normalized symmetry score based on the Chamfer Distance (CD). We show that although current 3D generative models achieve competitive results under standard evaluation, they reveal a persistent symmetry gap when a symmetry-aware evaluation protocol is applied. To test whether this gap is merely inherited from the training data, we evaluate these models over a mirrored-objects dataset derived from ShapeNet and analyze symmetry dynamics during training. Mechanistic interpretability techniques were employed at the sampling and latent levels to further show that reflection symmetry is not reliably encoded in the learned generative process. Finally, to address this gap, we propose a data-centric symmetry-aware intervention: training generative models on a half-objects dataset and reconstructing full objects by reflection during sampling. Across multiple backbones, this intervention substantially improves geometric consistency and visual plausibility while remaining competitive under standard metrics. These findings suggest that symmetry-aware evaluation is needed alongside standard benchmarks, and incoming 3D generative models should incorporate this prior explicitly, either during training or sampling.

URL PDF HTML ☆

赞 0 踩 0

2512.11446 2026-05-19 cs.CV 版本更新

YawDD+: Frame-level Annotations for Accurate Yawn Prediction

YawDD+: 用于准确打哈欠预测的帧级标注

Ahmed Mujtaba, Gleb Radchenko, Marc Masana, Radu Prodan

发表机构 * Embedded Systems Division, Silicon Austria Labs（Silicon Austria Labs嵌入式系统部门）； Institute of Visual Computing, Graz University of Technology（格拉茨技术大学视觉计算研究所）； Department of Computer Science, University of Innsbruck（因斯布鲁克大学计算机科学系）

AI总结本文提出了一种半自动化标注流程，通过人工在循环验证来标注YawDD视频以获得更准确的帧级标注，从而在边缘设备上提升模型训练效果，实现更高效的疲劳驾驶检测。

Comments This paper is accepted in the 33rd IEEE International Conference on Image Processing (ICIP) 2026

详情

AI中文摘要

驾驶员疲劳仍然是道路事故的主要原因，导致24%的碰撞事故。尽管打哈欠是疲劳的早期行为指标，但现有方法面临挑战，因为视频标注数据集中存在系统性噪声，源于粗略的时间标注。训练稳健的机器学习（ML）模型需要丰富的监督标签，以帮助从训练数据中学习显著特征。此外，在边缘设备上高效训练和推断模型对于疲劳驾驶检测任务至关重要，以在不依赖云基础设施的情况下实现车辆上的准确实时决策。为了解决这个问题，我们开发了一种半自动标注流程，通过人工在循环验证来标注YawDD视频以获得更准确的帧级标注，从而在边缘平台如NVIDIA Jetson NANO上更准确地训练模型。在YawDD+上训练已建立的MNasNet分类器和YOLOv11检测器架构，比视频级监督提高了多达6%的帧准确率和5%的mAP，分别在Jetson NANO和AGX上实现了99.34%的分类准确率和95.69%的检测mAP。此外，MNasNet在AGX上仅用8.69分钟/epoch完成一个周期，同时提供高达115帧/秒（FPS）的推断时间，证明了增强的数据质量本身支持边缘设备上的驾驶员疲劳监测系统，而无需服务器端计算。YawDD+数据集和训练好的模型已在线上提供。

英文摘要

Driver fatigue remains a leading cause of road accidents, responsible for 24% of crashes. While yawning serves as an early behavioral indicator of fatigue, existing approaches face significant challenges due to the presence of systematic noise in video-annotated datasets arising from coarse temporal annotations. Training robust machine learning (ML) models requires rich supervisory labels that help learn salient features from the training data. Moreover, efficient on-device training and inference of models on edge devices is crucial in driver fatigue detection tasks to enable accurate real-time decisions on vehicles without reliance on cloud infrastructure. To address this issue, we develop a semi-automated labeling pipeline with human-in-the-loop verification to annotate YawDD videos to YawDD+ frame-level annotations, enabling more accurate model training on edge platforms such as NVIDIA Jetson NANO. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6% and mAP by 5% over video-level supervision, achieving 99.34% classification accuracy and 95.69% detection mAP on Jetson NANO and AGX. Moreover, MNasNet completed the epoch time in just 8.69 min/epoch while delivering up to 115 frames-per-second (FPS) inference time on AGX, confirming that enhanced data quality alone supports on-device driver fatigue monitoring systems without server-side computation. The YawDD+ dataset and trained models are available online.

URL PDF HTML ☆

赞 0 踩 0

2512.05136 2026-05-19 cs.CV cs.AI 版本更新

DocReward: 一种用于文档结构化和风格化的文档奖励模型

Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia, Tengchao Lv, Yupan Huang, Wenshan Wu, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Tao Ge, Xun Wang, Huitian Jiao, Sun Mao, FNU Kartik, Si-Qing Chen, Wai Lam, Furu Wei

发表机构 * CUHK（香港大学）； UCAS（中国科学技术大学）； XJTU（西安交通大学）； UMich（密歇根大学）； Microsoft（微软）

AI总结本文提出DocReward，一种用于评估文档结构和风格的奖励模型，通过构建包含117,000对文档的DocPair数据集，采用Bradley-Terry损失训练，有效提升了文档生成的结构和风格专业性。

详情

AI中文摘要

近期的代理工作流程自动化了专业文档生成，但主要关注文本质量，忽视了结构和风格的专业性，这对于可读性同样至关重要。这一差距主要源于缺乏有效的奖励模型，无法引导代理生成结构和风格专业的文档。我们引入DocReward，一种评估文档结构和风格的文档奖励模型。为此，我们提出了一种文本质量无关的框架，确保评估不受内容质量的影响，并构建了包含117,000对文档的DocPair数据集，涵盖32个领域和267种类型。每对文档内容相同，但结构和风格专业性不同。DocReward使用Bradley-Terry损失进行训练。在人工标注的基准测试中，DocReward在相同设置下比GPT-5高出14.6个百分点。强化学习实验进一步表明，DocReward能有效引导代理生成具有更一致结构和风格专业性的文档，突显了其实际应用价值。

英文摘要

Recent agentic workflows automate professional document generation but focus narrowly on textual quality, overlooking structural and stylistic professionalism, which is equally critical for readability. This gap stems mainly from a lack of effective reward models capable of guiding agents toward producing documents with high structural and stylistic professionalism. We introduce DocReward, a document reward model that evaluates documents based on their structure and style. To achieve this, we propose a textual-quality-agnostic framework that ensures assessments are not confounded by content quality, and construct DocPair, a dataset of 117K paired documents covering 32 domains and 267 types. Each pair shares identical content but differs in structural and stylistic professionalism. DocReward is trained using the Bradley-Terry loss. On a manually annotated benchmark, DocReward outperforms GPT-5 by 14.6 percentage points in the same setting. Reinforcement learning experiments further show that DocReward effectively guides agents toward generating documents with consistently higher structural and stylistic professionalism, highlighting its practical utility.

URL PDF HTML ☆

赞 0 踩 0

2510.06809 2026-05-19 cs.CV 版本更新

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

VA-Adapter：将超声基础模型适应于超声心动图探头引导

Teng Wang, Haojun Jiang, Yuxuan Wang, Zhenguo Sun, Yujiao Deng, Shiji Song, Gao Huang

发表机构 * Department of Automation, BNRist, Tsinghua University, Beijing, China（自动化系、BNRist、清华大学、北京、中国）； School of Computer Science and Technology, Xidian University（计算机科学与技术学院、西安电子科技大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Chinese PLA General Hospital（中国人民解放军总医院）

AI总结本文提出VA-Adapter，通过将超声基础模型与理解个体三维结构的能力相结合，提高超声心动图探头引导的精度和效率，实验表明其在参数量较少的情况下表现优于现有模型。

Comments MICCAI2026 Early Accept Paper

详情

AI中文摘要

超声心动图是检测心脏疾病的关键工具，但其操作难度高导致专业人员短缺。探头引导系统通过辅助获取高质量图像，提供了降低操作门槛的有前景的解决方案。然而，由于显著的个体差异，稳健的探头引导仍具挑战性。这种差异表现为二维图像中低级特征的差异，这使得图像特征理解复杂化，以及个体三维结构的差异，这给精确导航带来挑战。为了解决这些挑战，我们首先提出利用超声基础模型从大量数据集中学习的稳健图像表示。然而，将这些模型应用于探头导航是困难的，因为它们缺乏对个体三维结构的理解。为此，我们精心设计了视觉-动作适配器（VA-Adapter）以在线注入理解个体三维结构的能力。具体来说，通过将VA-Adapter嵌入基础模型的图像编码器中，模型可以从历史视觉-动作序列中推断心脏解剖结构，模拟超声技师的认知过程。在包含超过131万样本的数据集上进行的广泛实验表明，VA-Adapter在参数量少约33倍的情况下优于现有探头引导模型。代码可在https://github.com/LeapLabTHU/VA-Adapter上获得。

英文摘要

Echocardiography is a critical tool for detecting heart diseases, yet its steep operational difficulty causes a shortage of skilled personnel. Probe guidance systems, which assist in acquiring high-quality images, offer a promising solution to lower this operational barrier. However, robust probe guidance remains challenging due to significant individual variability. This variability manifests as differences in low-level features within two-dimensional (2D) images, which complicates image feature understanding, and differences in individual three-dimensional (3D) structures, which poses challenges for precise navigation. To address these challenges, we first propose leveraging the robust image representations learned by ultrasound foundation models from vast datasets. Yet, applying these models to probe navigation is non-trivial due to their lack of understanding of individual 3D structures. To this end, we meticulously design a Vision-Action Adapter (VA-Adapter) to online inject the capability of understanding individual 3D structures. Specifically, by embedding the VA-Adapter into the foundation model's image encoder, the model can infer cardiac anatomy from historical vision-action sequences, mimicking the cognitive process of a sonographer. Extensive experiments on a dataset with over 1.31M samples demonstrate that the VA-Adapter outperforms strong probe guidance models while requiring approximately 33 times fewer trained parameters. Code is available at https://github.com/LeapLabTHU/VA-Adapter.

URL PDF HTML ☆

赞 0 踩 0

2510.04382 2026-05-19 eess.IV cs.CV cs.NA math.NA 版本更新

Adaptive double-phase Rudin--Osher--Fatemi denoising model

自适应双相Rudin-Osher-Fatemi去噪模型

Wojciech Górny, Michał Łasica, Alexandros Matsoukas

发表机构 * Faculty of Mathematics, Universität Wien（维也纳大学数学系）； Faculty of Mathematics, Informatics and Mechanics, University of Warsaw（华沙大学数学、信息学与力学系）； Institute of Mathematics of the Polish Academy of Sciences（波兰科学院数学研究所）； Department of Mathematics, School of Applied Mathematical and Physical Sciences, National Technical University of Athens（雅典技术大学应用数学与物理科学学院数学系）

AI总结本文提出了一种基于双相积分函数的自适应ROF去噪模型，旨在减少阶梯效应并保留图像边缘，通过在合成和自然图像上测试性能，展示了在SSIM、PSNR和LPIPS等相似性度量上优于传统模型的表现。

Comments 23 pages, 16 figures, supplementary material available at: https://github.com/wojciechgorny/double-phase-ROF-model/

详情

AI中文摘要

尽管自 seminal Rudin--Osher--Fatemi (ROF) 关于总变分 (TV) 去噪的论文发表超过30年后，该模型在科学应用如天文成像中仍然具有相关性。然而，它已知会产生诸如阶梯效应之类的问题。许多该模型的变体已被提出，旨在对抗这些问题。最近，在数学分析社区对双相问题的大量研究背景下，提出了一种双相类型积分函数，包含TV和一个加权二次增长项，作为图像恢复的正则化器。在此，我们提出了一种基于该正则化的ROF去噪模型的自适应变体。它旨在相对于经典ROF模型减少阶梯效应，同时以类似的方式保留图像的边缘。我们实现了该模型，并在不同噪声水平下的合成和自然图像上测试其性能。与具有类似可解释性的已建立模型相比，我们在SSIM、PSNR以及LPIPS等相似性度量上观察到改进或相似的表现，同时阶梯效应明显减少。

英文摘要

Even though more than 30 years have passed since the seminal Rudin--Osher--Fatemi (ROF) paper on total variation (TV) denoising, it remains relevant, in particular in scientific applications such as astronomical imaging. However, it is known to suffer from artifacts such as the staircasing effect. Many variants of the model have been proposed with the aim of countering this. Recently, against the backdrop of immense research output on double-phase problems in the mathematical analysis community, a double-phase type integral functional, comprising of TV and a weighted term of quadratic growth, was suggested as a regularizer for image restoration. Here, we propose an adaptive variant of the ROF denoising model based on that regularizer. It is designed to reduce staircasing with respect to the classical ROF model, while preserving the edges of the image in a similar fashion. We implement the model and test its performance on synthetic and natural images over a range of noise levels. Compared to {established} models {with similar interpretability to ROF}, we observe an improved or similar performance in terms of similarity metrics SSIM, PSNR, {and LPIPS}, while the staircasing effect is visibly reduced.

URL PDF HTML ☆

赞 0 踩 0

2509.22244 2026-05-19 cs.CV 版本更新

FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

FlashEdit: 解耦速度、结构和语义以实现精确图像编辑

Junyi Wu, Zhiteng Li, Haotong Qin, Yulun Zhang, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； ETH Zürich（苏黎世联邦理工学院）

AI总结本文提出FlashEdit，一种高效的局部图像编辑框架，通过解耦速度、结构和语义来实现精确编辑，实验表明其在保真度和效率之间取得了良好的平衡。

Comments Our code will be made publicly available at https://github.com/JunyiWuCode/FlashEdit

详情

AI中文摘要

基于文本的图像编辑使用扩散模型已取得了显著的高质量成果，但往往面临可接受的延迟问题。我们介绍了FlashEdit，一种针对标准反向编辑设置的实时局部图像编辑框架。其效率和精度源于三个关键创新：（1）一个循环一致的一步反向（COSI）管道，通过循环一致性鼓励流形对齐的一步反向；（2）一种背景屏蔽（BG-Shield）技术，通过结构自注意干预提高非编辑区域的保真度；（3）一种稀疏的空间交叉注意（SSCA）机制，通过抑制语义泄漏促进精确编辑。在PIE-Bench上的实验表明，FlashEdit在保真度和效率之间取得了良好的权衡，编辑可在0.2秒内完成，比基于DDIM的多步编辑快超过150倍。我们的代码将在https://github.com/JunyiWuCode/FlashEdit上公开发布。

英文摘要

Text-guided image editing with diffusion models has achieved remarkable quality but often suffers from prohibitive latency. We introduce \textbf{FlashEdit}, a real-time localized image editing framework for the standard inversion-based editing setting. Its efficiency and precision stem from three key innovations: (1) a \textbf{Cycle-Consistent One-Step Inversion (COSI)} pipeline that encourages manifold-aligned one-step inversion through cycle consistency; (2) a \textbf{Background Shield (BG-Shield)} technique that improves preservation of non-edited regions via structural self-attention intervention; and (3) a \textbf{Sparsified Spatial Cross-Attention (SSCA)} mechanism that promotes precise edits by suppressing semantic leakage. Experiments on PIE-Bench demonstrate a strong preservation-efficiency trade-off, with edits completed in under 0.2 seconds and an over 150$\times$ speedup over DDIM-based multi-step editing. Our code will be made publicly available at \url{https://github.com/JunyiWuCode/FlashEdit}.

URL PDF HTML ☆

赞 0 踩 0

2508.17431 2026-05-19 cs.CV cs.AI cs.LG 版本更新

FedKLPR: KL-Guided Pruning-Aware Federated Learning for Person Re-Identification

FedKLPR: 基于KL引导的剪枝感知联邦学习用于人重识别

Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien

发表机构 * Media IC and System Lab, the Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University（媒体IC与系统实验室，电子工程研究所及电气工程系，国立台湾大学）

AI总结本文提出FedKLPR框架，通过KL散度引导训练、无结构剪枝和跨轮次恢复技术，解决联邦学习在人重识别中的统计异质性和通信开销问题，实验表明其在通信开销和准确性方面均优于现有方法。

Comments 10 pages, 3 figures, 5 tables, submitted to IEEE Transactions on Multimedia

详情

AI中文摘要

人重识别（re-ID）是智能监控和公共安全中的基本任务。联邦学习（FL）提供了一种隐私保护的协同模型训练范式，无需集中数据收集。然而，由于非独立同分布（non-IID）客户端数据导致的统计异质性和频繁传输大规模模型带来的通信开销，将FL应用于现实世界中的re-ID系统仍然具有挑战性。为了解决这些挑战，我们提出了FedKLPR，一种轻量且通信高效的联邦学习框架用于人重识别。FedKLPR包含三个关键组件。首先，KL散度引导训练，包括KL散度正则化损失（KLL）和KL散度聚合权重（KLAW），用于缓解统计异质性和在非IID设置下提高收敛稳定性。其次，引入无结构剪枝以减少通信开销，并提出剪枝率聚合权重（PRAW）以衡量剪枝后客户端参数的相对重要性。与KLAW结合，PRAW形成KL散度-剪枝权重聚合（KLPWA），使在异构数据分布下能够有效聚合剪枝后的本地模型。第三，跨轮次恢复（CRR）适应性地控制剪枝跨通信轮次以防止过度压缩并保持模型准确性。在八个基准数据集上的实验表明，FedKLPR在保持竞争性准确性的同时实现了显著的通信节省。与现有最先进方法相比，FedKLPR在ResNet-50上将通信成本减少了40%--42%，并实现了更优异的总体性能。

英文摘要

Person re-identification (re-ID) is a fundamental task in intelligent surveillance and public safety. Federated learning (FL) provides a privacy-preserving paradigm for collaborative model training without centralized data collection. However, deploying FL in real-world re-ID systems remains challenging due to statistical heterogeneity caused by non-IID client data and the substantial communication overhead incurred by frequent transmission of large-scale models. To address these challenges, we propose FedKLPR, a lightweight and communication-efficient federated learning framework for person re-ID. FedKLPR consists of three key components. First, KL-Divergence-Guided training, including the KL-Divergence Regularization Loss (KLL) and KL-Divergence-aggregation Weight (KLAW), is introduced to mitigate statistical heterogeneity and improve convergence stability under non-IID settings. Second, unstructured pruning is incorporated to reduce communication overhead, and the Pruning-ratio-aggregation Weight (PRAW) is proposed to measure the relative importance of client parameters after pruning. Together with KLAW, PRAW forms KL-Divergence-Prune Weighted Aggregation (KLPWA), enabling effective aggregation of pruned local models under heterogeneous data distributions. Third, Cross-Round Recovery (CRR) adaptively controls pruning across communication rounds to prevent excessive compression and preserve model accuracy. Experiments on eight benchmark datasets demonstrate that FedKLPR achieves substantial communication savings while maintaining competitive accuracy. Compared with state-of-the-art methods, FedKLPR reduces communication cost by 40\%--42\% on ResNet-50 while achieving better overall performance.

URL PDF HTML ☆

赞 0 踩 0

2508.16663 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Fourier Compressor: 频域视觉令牌压缩用于视觉-语言模型

Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin

发表机构 * LUMIA Lab（LUMIA实验室）； School of Artificial Intelligence（人工智能学院）； Shanghai Jiao Tong University（上海交通大学）； Shanghai Innovation Institute（上海创新研究院）； Noah’s Ark Lab（诺亚实验室）； Huawei Technologies Ltd.（华为技术有限公司）； School of Computer Science（计算机科学学院）

AI总结本文提出了一种基于频域的视觉令牌压缩策略，通过傅里叶变换减少计算开销并提升效率，同时保持语义准确性，实验表明其在图像和视频任务中均表现出色。

详情

AI中文摘要

视觉-语言模型（VLMs）由于高分辨率图像和视频输入引入的大量视觉令牌，导致计算开销和推理延迟显著增加。现有的无参数令牌压缩方法通常依赖于令牌选择或合并，但可能丢弃大量视觉信息或扭曲原始表示分布，导致在高压缩比下性能下降。为此，我们探索了一种更有效且高效的视觉令牌压缩策略，重点在频域方向。受图像压缩中频域变换（如JPEG）的成功启发，我们系统分析了视觉表示中的频域冗余，并揭示了不同频带中语义信息的非均匀分布。基于此，我们引入了傅里叶压缩器，一种有效、无参数且高度通用的模块，通过FFT（复杂度为O(n² log n））在频域内去除视觉表示的冗余。实现过程中无额外参数，计算开销极小且保持语义保真度。在图像基准测试中，我们的方法在保留超过96%原始准确率的同时，将推理FLOPs减少高达83.8%，生成速度提升31.2%。它在图像和视频理解任务中均表现出色，且在LLaVA和Qwen-VL架构中均能稳定泛化，证明其在高效VLMs中的实用价值。

英文摘要

Vision-Language Models (VLMs) incur substantial computational overhead and inference latency due to the large number of vision tokens introduced by high-resolution image and video inputs. Existing parameter-free token compression methods typically rely on token selection or merging, yet they risk discarding substantial visual information or distorting the original representation distribution, resulting in pronounced performance degradation at high compression ratios. In response, we aim to explore a more effective and efficient visual token compression strategy, with a promising direction in the frequency domain. Motivated by the success of frequency-domain transforms in image compression (e.g., JPEG), we systematically analyze the frequency redundancy in visual representations and uncover a non-uniform distribution of semantic information across frequency bands. Building upon this, we introduce Fourier Compressor, an effective, parameter-free, and highly generalizable module that removes redundancy from visual representations within the frequency domain. Implemented via FFT with $\mathcal{O}(n^2 \log n)$ complexity and no additional parameters, Fourier Compressor introduces negligible computational overhead while preserving semantic fidelity. Extensive experiments on image-based benchmarks demonstrate that our method achieves a favorable performance-efficiency trade-off, retaining over 96% of the original accuracy while reducing inference FLOPs by up to 83.8% and boosting generation speed by 31.2%. It consistently outperforms existing parameter-free methods and even surpasses some parameterized approaches. Importantly, Fourier Compressor generalizes consistently across both LLaVA and Qwen-VL architectures, and further extends to video understanding tasks, highlighting its practical applicability for efficient VLMs.

URL PDF HTML ☆

赞 0 踩 0

2507.22136 2026-05-19 cs.CV 版本更新

Color as the Impetus: Transforming Few-Shot Learner

颜色作为动力：转换少样本学习者

Chaofei Qi, Zhitai Liu, Jianbin Qiu

AI总结本文提出了一种基于颜色感知机制的少样本学习框架，通过强调不同通道的颜色信息来提升特征提取和分类性能，同时引入知识蒸馏方法增强元学习能力。

Comments This work is currently being redone. It requires significant revisions and polishing. Additionally, the title will also be revised. Therefore, this version is no longer needed.

详情

AI中文摘要

人类具备天生的元学习能力，部分归因于其出色的色彩感知能力。在本文中，我们开创性地从模拟人类色彩感知机制的角度出发，提出了少样本学习的新视角。我们提出了ColorSense Learner，一种生物启发的元学习框架，利用跨通道特征提取和交互学习。通过在不同通道中战略强调不同的颜色信息，我们的方法有效过滤了无关特征，同时捕捉到判别性特征。颜色信息代表了最直观的视觉特征，但传统元学习方法大多忽略了这一方面，而专注于类别间的抽象特征区分。我们的框架通过协同的色彩通道交互弥合了这一差距，使能够更好地提取类内共同性并扩大类间差异。此外，我们引入了基于知识蒸馏的元蒸馏器ColorSense Distiller，该方法利用先验教师知识来增强学生网络的元学习能力。我们对十一个多少样本基准进行了全面的粗粒度/细粒度和跨域实验进行验证。大量实验表明，我们的方法具有极强的泛化能力、鲁棒性和可迁移性，并且能够轻松地从颜色感知的角度处理少样本分类。

英文摘要

Humans possess innate meta-learning capabilities, partly attributable to their exceptional color perception. In this paper, we pioneer an innovative viewpoint on few-shot learning by simulating human color perception mechanisms. We propose the ColorSense Learner, a bio-inspired meta-learning framework that capitalizes on inter-channel feature extraction and interactive learning. By strategically emphasizing distinct color information across different channels, our approach effectively filters irrelevant features while capturing discriminative characteristics. Color information represents the most intuitive visual feature, yet conventional meta-learning methods have predominantly neglected this aspect, focusing instead on abstract feature differentiation across categories. Our framework bridges the gap via synergistic color-channel interactions, enabling better intra-class commonality extraction and larger inter-class differences. Furthermore, we introduce a meta-distiller based on knowledge distillation, ColorSense Distiller, which incorporates prior teacher knowledge to augment the student network's meta-learning capacity. We've conducted comprehensive coarse/fine-grained and cross-domain experiments on eleven few-shot benchmarks for validation. Numerous experiments reveal that our methods have extremely strong generalization ability, robustness, and transferability, and effortless handle few-shot classification from the perspective of color perception.

URL PDF HTML ☆

赞 0 踩 0

2507.22057 2026-05-19 cs.CV 版本更新

MetaLab: Few-Shot Game Changer for Image Recognition

MetaLab: 图像识别中的少样本突破

Chaofei Qi, Zhitai Liu, Jianbin Qiu

AI总结本文提出了一种高效的少样本图像识别方法MetaLab，通过CIELab引导的相干元学习框架，实现了高准确率、鲁棒性和有效泛化能力，接近人类识别水平。

Comments This work is currently being redone. It requires significant revisions and polishing. Additionally, the title will also be revised. Therefore, this version is no longer needed.

详情

AI中文摘要

困难的少样本图像识别具有显著的应用前景，但与传统大规模图像识别相比仍存在显著的技术差距。本文提出了一种高效的原生方法，称为CIELab引导的相干元学习（MetaLab）。结构上，我们的MetaLab由两个协作的神经网络组成：LabNet，能够对CIELab颜色空间进行域转换并提取丰富的分组特征，以及相干LabGNN，能够促进亮度图和颜色图之间的相互学习。为了充分验证，我们在四个粗粒度基准、四个细粒度基准和四个跨域少样本基准上进行了广泛的比较研究。具体而言，我们的方法在每个类别仅使用一个样本时能够实现高准确率、鲁棒性能和有效的泛化能力。总体而言，所有实验都表明，我们的MetaLab可以达到99%的准确率，接近人类识别水平，仅需少量的视觉偏差。

英文摘要

Difficult few-shot image recognition has significant application prospects, yet remaining the substantial technical gaps with the conventional large-scale image recognition. In this paper, we have proposed an efficient original method for few-shot image recognition, called CIELab-Guided Coherent Meta-Learning (MetaLab). Structurally, our MetaLab comprises two collaborative neural networks: LabNet, which can perform domain transformation for the CIELab color space and extract rich grouped features, and coherent LabGNN, which can facilitate mutual learning between lightness graph and color graph. For sufficient certification, we have implemented extensive comparative studies on four coarse-grained benchmarks, four fine-grained benchmarks, and four cross-domain few-shot benchmarks. Specifically, our method can achieve high accuracy, robust performance, and effective generalization capability with one-shot sample per class. Overall, all experiments have demonstrated that our MetaLab can approach 99\% $\uparrow\downarrow$ accuracy, reaching the human recognition ceiling with little visual deviation.

URL PDF HTML ☆

赞 0 踩 0

2507.22041 2026-05-19 cs.CV 版本更新

3D致密化用于内窥镜多地图单目视觉SLAM

X. Anadón, Javier Rodríguez-Puigvert, J. M. M. Montiel

发表机构 * Universidad de Zaragoza（萨拉戈萨大学）

AI总结本文提出了一种方法，通过去除异常值和增强地图密度，改进了内窥镜多地图单目视觉SLAM中的3D环境表示，实现了在临床应用中更精确的3D地图重建。

详情

AI中文摘要

多地图稀疏单目视觉同时定位与建图应用于单目内窥镜序列已被证明在内窥镜中频繁的损失（如运动模糊、时间遮挡、工具交互或水喷射）后能够稳健地恢复跟踪。稀疏多地图对于稳健的相机定位是足够的，但它们在环境表示方面非常差，它们是嘈杂的，有高比例的不准确重建的3D点，包括显著的异常值，更重要的是在临床应用中具有不可接受的低密度。我们提出了一种方法来去除异常值并增强状态-of-the-art稀疏内窥镜多地图CudaSIFT-SLAM的地图。通过使用鲁棒的LMedS将NN LightDepth用于到尺度的深度密集预测对齐稀疏CudaSIFT子地图。我们的系统缓解了单目深度估计中的固有尺度模糊问题，同时过滤异常值，导致可靠的致密3D地图。我们在C3VD幻影结肠数据集中提供了准确致密地图的实验证据，4.15毫米RMS精度在可接受的计算时间内。我们还报告了在Endomapper数据集上的真实结肠镜的定性结果。

英文摘要

Multi-map Sparse Monocular visual Simultaneous Localization and Mapping applied to monocular endoscopic sequences has proven efficient to robustly recover tracking after the frequent losses in endoscopy due to motion blur, temporal occlusion, tools interaction or water jets. The sparse multi-maps are adequate for robust camera localization, however they are very poor for environment representation, they are noisy, with a high percentage of inaccurately reconstructed 3D points, including significant outliers, and more importantly with an unacceptable low density for clinical applications. We propose a method to remove outliers and densify the maps of the state of the art for sparse endoscopy multi-map CudaSIFT-SLAM. The NN LightDepth for up-to-scale depth dense predictions are aligned with the sparse CudaSIFT submaps by means of the robust to spurious LMedS. Our system mitigates the inherent scale ambiguity in monocular depth estimation while filtering outliers, leading to reliable densified 3D maps. We provide experimental evidence of accurate densified maps 4.15 mm RMS accuracy at affordable computing time in the C3VD phantom colon dataset. We report qualitative results on the real colonoscopy from the Endomapper dataset.

URL PDF HTML ☆

赞 0 踩 0

2502.07360 2026-05-19 q-bio.QM cs.CV 版本更新

Supervised contrastive learning for cell stage classification of animal embryos

基于监督对比学习的动物胚胎细胞阶段分类

Yasmine Hachani, Patrick Bouthemy, Elisa Fromont, Sylvie Ruffini, Ludivine Laffont, Alline de Paula Reis

发表机构 * Inria center at Rennes University, France（法国里昂大学Inria研究中心）； University of Rennes, IRISA, France（法国南特大学IRISA研究所）； Paris-Saclay University, UVSQ, INRAE, BREED, France（法国巴黎-萨克雷大学、UVSQ、INRAE、BREED研究所）； The National Veterinary School of Alfort (EnvA), France（法国阿尔福兽医学校（EnvA））

AI总结本文提出了一种基于监督对比学习和焦点损失的深度学习方法，用于自动分类动物胚胎的细胞阶段，解决了低质量图像、类别模糊和数据分布不均等挑战，并在牛胚胎和小鼠胚胎数据集上实现了优于现有方法的性能。

Journal ref Scientific Reports, 2026

详情

AI中文摘要

视频显微镜结合机器学习为研究体外生成（IVP）胚胎的早期发育提供了有前景的方法。然而，手动标注发育事件，特别是细胞分裂，对于生物学家来说是耗时的，且无法扩展到实际应用。我们旨在利用深度学习方法自动分类来自2D时间延时显微镜视频的胚胎细胞阶段。我们专注于牛胚胎发育的分析，因为我们的主要应用是牛养殖，并创建了牛胚胎细胞阶段（ECS）数据集。挑战有三个：（1）低质量图像和牛暗细胞使细胞阶段识别困难，（2）发育阶段边界处的类别模糊，以及（3）数据分布不平衡。为了解决这些挑战，我们引入了CLEmbryo，一种结合监督对比学习和焦点损失的新型方法，并使用轻量级3D神经网络CSN-50作为编码器。我们还展示了我们的方法具有良好的泛化能力。CLEmbryo在我们的牛ECS数据集和公开可用的NYU小鼠胚胎数据集上均优于现有最先进的方法。

英文摘要

Videomicroscopy, when combined with machine learning, offers a promising approach for studying the early development of in vitro produced (IVP) embryos. However, manually annotating developmental events, and more specifically cell divisions, is time-consuming for a biologist and cannot scale up for practical applications. We aim to automatically classify the cell stages of embryos from 2D time-lapse microscopy videos with a deep learning approach. We focus on the analysis of bovine embryonic development using video microscopy, as we are primarily interested in the application of cattle breeding, and we have created a Bovine Embryos Cell Stages (ECS) dataset. The challenges are three-fold: (1) low-quality images and bovine dark cells that make the identification of cell stages difficult, (2) class ambiguity at the boundaries of developmental stages, and (3) imbalanced data distribution. To address these challenges, we introduce CLEmbryo, a novel method that leverages supervised contrastive learning combined with focal loss for training, and the lightweight 3D neural network CSN-50 as an encoder. We also show that our method generalizes well. CLEmbryo outperforms state-of-the-art methods on both our Bovine ECS dataset and the publicly available NYU Mouse Embryos dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.18132 2026-05-19 cs.CV cs.AI 版本更新

Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models

谁生成了这个3D资产？学习生成3D模型的来源归属

Sihan Ma, Siyuan Liang, Dacheng Tao

发表机构 * College of Computing & Data Science, Nanyang Technological University, Singapore（南洋理工大学计算机与数据科学学院）

AI总结该研究提出了一种方法，用于确定给定3D资产是由哪种生成模型创建的，通过构建首个被动来源归属基准，发现生成3D模型留下稳定的指纹特征，从而建立了可信的3D内容来源的新标准。

详情

AI中文摘要

生成3D模型被应用于游戏、机器人和沉浸式创作，因此来源归属至关重要：给定一个3D资产，我们能否确定并识别出是哪种生成模型创建的？该问题面临两个核心挑战：分散的归属信号，其中3D指纹分布在多视角、几何和频率域提示中；以及现实部署约束，其中稀少的标签、退化的提示和混合真实/合成资产会破坏归属的可靠性。为了系统研究该问题，我们构建了迄今为止首个被动来源归属基准，涵盖22种代表性的3D生成器，在标准、少样本和现实部署协议下。基于此基准，我们发现生成3D模型留下两种稳定的指纹：跨视角不一致性和体现在几何统计和频率域提示中的结构伪影。为了捕捉这些分散的信号，我们提出了一种层次多视角多模态Transformer，融合每个视角的外观、几何和频率域特征，并在跨视角建模全局关系。大量实验表明性能优异，在全监督下达到97.22%的准确率，在仅有1%训练数据时达到77.17%的准确率，对应每个生成器少于五个样本。这些结果表明现代3D生成器留下稳定且可归属的指纹，建立了可信3D内容来源的新基准和方法论基础。

英文摘要

Generative 3D models are deployed in gaming, robotics, and immersive creation, making source attribution critical: given a 3D asset, can we identify whether and which generative model created it? This problem faces two core challenges: dispersed attribution signals, where 3D fingerprints are distributed across multi-view, geometric, and frequency-domain cues; and realistic deployment constraints, where scarce labels, degraded prompts, and mixed real/synthetic assets undermine attribution reliability. To systematically study this problem, we construct, to the best of our knowledge, the first passive source attribution benchmark for modern generated assets, covering 22 representative 3D generators under standard, few-shot, and realistic deployment protocols. Based on this benchmark, we find that generative 3D models leave two types of stable fingerprints: cross-view inconsistency and structural artifacts reflected in geometric statistics and frequency-domain cues. To capture these dispersed signals, we propose a hierarchical multi-view multi-modal Transformer that fuses appearance, geometric, and frequency-domain features within each view and models global relationships across views. Extensive experiments demonstrate strong performance, achieving 97.22% accuracy under full supervision and 77.17% accuracy with only 1% training data, corresponding to fewer than five samples per generator. These results show that modern 3D generators leave stable and attributable fingerprints, establishing a new benchmark and methodological foundation for trustworthy 3D content provenance.

URL PDF HTML ☆

赞 0 踩 0

2605.18130 2026-05-19 cs.CV 版本更新

Rad-VLSM: A Cross-Modal Framework with Semantics-Assisted Prompting for Medical Segmentation and Diagnosis

Rad-VLSM：一种结合语义辅助提示的跨模态框架用于医学分割与诊断

Fengyi Zhang, Xujie Zeng, Mohan Liu, Zengyi Wang, Yalong Jiang

发表机构 * Student Member, IEEE（IEEE学生会员）； Member, IEEE（IEEE会员）

AI总结本文提出Rad-VLSM框架，通过语义引导的提示机制，提升医学图像分割与诊断的准确性，解决现有模型易受背景组织和无关视觉相关性干扰的问题。

详情

AI中文摘要

医学图像分割在支持诊断而非仅仅生成病变掩码时更具临床价值。然而，诊断相关的病变线索往往微妙且局部化，而现有模型可能受背景组织、声学伪影和无关视觉相关性干扰。为了解决这个问题，我们提出了Rad-VLSM，一种两阶段跨模态框架，用于语义辅助的病变聚焦、鲁棒分割和视觉基础诊断。第一阶段中，基于BLIP-2的视觉-语言对齐模块在语义引导下识别病变相关候选区域，并将其转换为框提示。第二阶段中，这些提示被输入基于SAM的多任务网络，其中多候选区域聚合策略提高提示稳定性并引导病变分割。预测的掩码随后用作诊断的空间先验，视觉-放射组学融合头将病变感知的视觉特征与选定的放射组学描述符整合。通过使用语义信息进行定位而非直接预测，Rad-VLSM减少了文本到诊断的依赖，并将诊断基于病变层面的证据。在私有临床乳腺超声数据集和公共基准测试中，Rad-VLSM在分割和诊断性能方面表现强劲，具有良好的泛化能力。

英文摘要

Medical image segmentation is more clinically valuable when it supports diagnosis rather than merely producing lesion masks. However, diagnostically relevant lesion cues are often subtle and localized, while existing models may be distracted by background tissues, acoustic artifacts, and irrelevant visual correlations. To address this problem, we propose Rad-VLSM, a two-stage cross-modal framework for semantics-assisted lesion focusing, robust segmentation, and visually grounded diagnosis. In the first stage, a BLIP-2-based vision-language alignment module identifies lesion-related candidate regions under semantic guidance and converts them into box prompts. In the second stage, these prompts are fed into a SAM-based multitask network, where a multi-candidate region aggregation strategy improves prompt stability and guides lesion segmentation. The predicted masks are then used as spatial priors for diagnosis, and a visual-radiomics fusion head integrates lesion-aware visual features with selected radiomics descriptors. By using semantic information for localization rather than direct prediction, Rad-VLSM reduces text-to-diagnosis dependence and grounds diagnosis in lesion-level evidence. Experiments on a private clinical breast ultrasound dataset and public benchmarks show that Rad-VLSM achieves strong segmentation and diagnostic performance with favorable generalization.

URL PDF HTML ☆

赞 0 踩 0

2605.18115 2026-05-19 cs.CV 版本更新

嵌入式卷积网络集合：一种轻量级的阿拉伯手写字符识别方法

Mohsine El Khayati, Rachid Elouahbi, Abdelillah Semma

发表机构 * Systems theory and informatics laboratory（系统理论与信息系统实验室）； Moulay Ismail University of Meknes（穆拉伊姆·艾斯米尔大学梅克内斯分校）； Laboratory of Computer Science and Applications（计算机科学与应用实验室）； Computer Science Dept.（计算机科学系）

AI总结本文提出了一种轻量级嵌入式卷积网络与集成学习相结合的方法，用于实现阿拉伯手写字符识别，通过实验验证了轻量模型在准确率上的优势以及集成学习对性能的提升。

Comments Accepted in the IEEE 15th Image, Video, and Multidimensional Signal Processing Workshop 2026

2605.18058 2026-05-19 cs.CV 版本更新

Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models

阿拉伯手写识别的威胁：调查嵌入式卷积网络模型上的黑盒对抗攻击

Mohsine EL Khayati, Abdelillah Semma, Abdelaziz Courr, Rachid Elouahbi

发表机构 * Systems theory and informatics laboratory（系统理论与信息学实验室）； Moulay Ismail University of Meknes（穆莱·艾息姆大学）； Department of Computer Science（计算机科学系）； EST of Sidi Bennour（西迪·本努尔工程与技术学院）； Chouaib Doukkali University（侯赛因·杜克利大学）； Faculty of Education Sciences（教育科学学院）； University Mohammed V（穆莱·维大学）； Laboratory of Computer Science and Applications（计算机科学与应用实验室）

AI总结本研究探讨了阿拉伯手写识别系统对黑盒对抗攻击的脆弱性，通过实验揭示了高精度模型在面对对抗攻击时的易受攻击性，强调了加强模型安全性和可靠性的必要性。

Comments Accepted in the IEEE 15th Image, Video, and Multidimensional Signal Processing Workshop 2026

详情

AI中文摘要

阿拉伯手写识别（AHR）通过深度学习模型取得了显著进展。AHR研究主要关注性能，而安全性却很少受到重视。本研究通过展示高性能模型对对抗黑盒攻击的易受攻击性，提供了一条新的研究方向。研究聚焦于黑盒攻击，反映了现实场景中攻击者对模型架构没有先验知识的情况。在两个包含阿拉伯手写字符的基准AHR数据集上进行了大量实验。结果表明攻击的有效性，其中Pixle攻击在大多数模型上达到了99-100%的攻击成功率。其他较为温和的攻击在大多数实验中达到了50-96%的成功率。尽管攻击成功率较高，但攻击保持了字符的结构完整性，使其在人眼几乎不可察觉。研究结果表明，所研究的模型对对抗操纵具有更高的易受性。这突显了加强这些模型安全性和可靠性以确保其在AHR实际应用中的必要性。

英文摘要

Arabic handwriting recognition (AHR) has made significant progress with deep learning models. AHR research has largely focused on performance, with security receiving little attention. This study provides what appears to be a new line of inquiry by demonstrating the vulnerability of high-performing models to adversarial black-box attacks. The focus on black-box attacks reflects real-world scenarios where the attacker has no prior knowledge of the model architecture. Extensive experiments were conducted on two benchmark AHR datasets containing Arabic handwritten Characters. Results demonstrated the effectiveness of the attacks, with the Pixle attack achieving an attack success rate of 99-100\% on most models. Other, less aggressive attacks achieved success rates of 50-96\% across most experiments. Despite the higher attack success rate, the attacks maintain the structural integrity of the characters, rendering them almost imperceptible to the human eye. The findings indicate the higher vulnerability of the studied models to adversarial manipulation. This underscores the need to strengthen efforts to secure these models and ensure their reliability in AHR real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2605.18054 2026-05-19 eess.IV cs.CV cs.MM 版本更新

CATRF: Codec-Adaptive TriPlane Radiance Fields for Volumetric Content Delivery

CATRF：用于体积分发的编码自适应三平面辐射场

Tung-I Chen, Lingdong Wang, Subhransu Maji, Ramesh K. Sitaraman

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）

AI总结本文提出CATRF，一种用于体积分发的编码自适应三平面辐射场方法，通过在训练过程中将二维特征平面量化和打包到编码器友好的画布中，并利用标准编码器进行回程处理，从而在训练循环中插入非可微的编码器管道，使辐射场特征能够直接适应客户端侧的编码器失真，而无需引入任何学习的编码器参数。

详情

AI中文摘要

体积分发承诺了下一代内容分发应用，但其带宽需求仍然是一个关键瓶颈。隐式和混合体积分表示减少了模型大小，但仍然需要精心编码才能达到2D视频般的比特率。我们提出了CATRF，一种标准编码器在循环中的压缩框架，用于平面分解的辐射场。在训练过程中，我们对二维特征平面进行量化和打包到编码器友好的画布中，运行标准编码器回程（JPEG/VP9/HEVC/AV1），然后解包并解量化解码后的特征，再进行体积渲染。我们使用直通估计器（STE）将非可微的标准编码器管道插入到训练循环中，使辐射场特征能够直接适应真实的客户端侧编码器失真，而无需引入任何学习的编码器参数。在静态和动态基准测试中，CATRF在编码器无关和学习编码器回路基线中始终实现了更好的率失真权衡，并且在压缩效率和解码速度上也优于最近的压缩3DGS方法。这些结果凸显了一条通往低比特率、抗压缩的体积分表示的实用路径，用于自由视角视频流媒体。

英文摘要

Volumetric media promises next-generation content delivery applications, but its bandwidth demand remains a key bottleneck. Implicit and hybrid volumetric representations reduce model sizes, yet still require careful coding to reach 2D video-like bitrates. We present CATRF, a standard-codec-in-the-loop compression framework for plane-factorized radiance fields. During training, we quantize and pack 2D feature planes into codec-friendly canvases, run a standard codec roundtrip (JPEG/VP9/HEVC/AV1), then unpack and dequantize the decoded features before volume rendering. We use a straight-through estimator (STE) to insert the non-differentiable, standard codec pipeline into the training loop, allowing radiance-field features to adapt directly to the real, client-side codec distortions without introducing any learned codec parameters. On both static and dynamic benchmarks, CATRF consistently achieves a better rate-distortion trade-off over codec-agnostic and learned-codec-in-the-loop baselines, and also outperforms recent compressed 3DGS methods in both compression efficiency and decoding speed. These results highlight a practical path toward low-bitrate, compression-resilient volumetric representations for free-viewpoint video streaming.

URL PDF HTML ☆

赞 0 踩 0

2605.18052 2026-05-19 cs.CV 版本更新

Efficient 3D Content Reconstruction and Generation

高效3D内容重建与生成

Jiahao Li

发表机构 * TOYOTA TECHNOLOGICAL INSTITUTE AT CHICAGO（丰田技术研究所芝加哥分校）

AI总结本文提出了一种高效的3D内容生成和重建方法，通过结合多视图扩散和稀疏视图3D重建，实现了高质量的3D资产生成，并开发了FastMap算法以提高3D重建的速度和精度。

详情

AI中文摘要

自动3D内容创建旨在用能够从文本或图像直接合成或恢复3D资产的系统取代劳动密集型的建模和扫描流程。其应用范围涵盖视频游戏、虚拟现实、机器人技术和模拟，使资产原型设计、多样化的交互世界生成和高效的3D数据收集成为可能。当前解决方案主要遵循两种互补的范式：（i）文本或图像到3D生成，学习3D几何和外观的先验知识，以从自然语言或单视图图像创建新资产；（ii）3D重建，从RGB图像估计相机姿态和几何结构。本论文在两个方向上都取得了进展。在生成方面，我介绍了Instant3D，它结合了多视图扩散和前馈稀疏视图3D重建，可在5-20秒内生成高质量的资产。在重建方面，我开发了FastMap，一种结构从运动流水线，通过使用一阶优化与广泛融合的GPU内核，实现了比现有最先进方法快10倍的速度提升，同时保持了可比的姿态精度和下游新视图合成质量。

英文摘要

Automatic 3D content creation seeks to replace labor-intensive modeling and scanning pipelines with systems that can synthesize or recover 3D assets directly from text or images. Its applications span video games, virtual reality, robotics, and simulation, enabling rapid asset prototyping, diverse interactive world generation, and efficient 3D data collection for training foundation models. Contemporary solutions largely follow two complementary paradigms: (i) text- or image-to-3D generation, which learns priors over 3D geometry and appearance to create novel assets from natural language or a single view image; and (ii) 3D reconstruction, which estimates camera poses and geometry from RGB images. This thesis advances both directions. On the generation side, I introduce Instant3D, which combines multi-view diffusion with feed-forward sparse-view 3D reconstruction to produce high-quality assets in 5-20 seconds. On the reconstruction side, I develop FastMap, a structure-from-motion pipeline that achieves up to 10x speedup over prior state-of-the-art by using first-order optimization with fused GPU kernels extensively, while maintaining comparable pose accuracy and downstream novel view synthesis quality.

URL PDF HTML ☆

赞 0 踩 0

2605.18041 2026-05-19 cs.CV 版本更新

OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models

OmniSelect: 动态模态感知的令牌压缩用于高效多模态大语言模型

Morunliu Yang, Ruotao Xu, Le Li, Yue Wang, Jianxin Zhang, Juntao Li, Yihang Lou, Siwei Feng, Peifeng Li

发表机构 * Soochow University（苏州大学）； Peking University（北京大学）

AI总结本文提出OmniSelect，一种无需训练的模态自适应令牌剪枝框架，通过动态选择压缩策略来提高多模态大语言模型的效率，通过轻量级AudioCLIP模型估计跨模态相关性，并根据相关性得分在不同时间组中进行细粒度令牌剪枝，从而在不增加训练成本的情况下实现高效的多模态令牌压缩。

详情

AI中文摘要

Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $ extbf{OmniSelect}$, a 免训练, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.

英文摘要

Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $\textbf{OmniSelect}$, a training-free, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.

URL PDF HTML ☆

赞 0 踩 0

2605.18039 2026-05-19 cs.CV 版本更新

通过结构补全和运动校正实现功能化

Mingrui Zhao, Sai Raj Kishore Perla, Kai Wang, Sauradip Nag, Duc Anh Nguyen, Jiayi Peng, Ruiqi Wang, Angel X. Chang, Manolis Savva, Ali Mahdavi-Amiri, Hao Zhang

发表机构 * Simon Fraser University（西蒙弗雷泽大学）； ShanghaiTech University（上海科技大学）

AI总结本文提出了一种新的任务，即对象功能化，旨在将视觉上合理但不功能的3D模型转换为功能性和物理上可操作的模型。通过将功能化问题建模为新的功能图上的图补全问题，开发了神经图功能化器（GraFu）来补全不完整的图，从而生成3D几何结构，并校正错误的人工标注和预测运动。

详情

AI中文摘要

获取和创建3D资产长期以来主要基于视角或外观驱动。因此，现有的数字3D模型往往缺乏必要的结构组件，以实现其预期功能，例如关节、支撑结构、内部结构或交互元素。同时，即使人工标注的运动也经常存在误差，导致物理上不合理的行为。我们引入了对象功能化，这是一种新的任务，旨在将视觉上合理但不功能的3D模型转换为功能性和物理上可操作的模型。我们将功能化建模为一个新的功能图上的图补全问题，其中标记的节点代表对象部分，标记的边编码功能和接触关系，而可移动的节点携带运动属性，使得结构功能缺陷表现为缺失的节点或错误的边。我们开发了神经图功能化器（GraFu）来补全表示非功能3D对象的不完整图。补全后的图随后驱动一个几何实现阶段，将预测的连接器和结构元素实例化为3D，具有令人印象深刻的效果，即校正错误的人工标注和预测运动。为了支持训练和评估，专注于家具作为丰富且具有挑战性的目标类别，我们引入了FurFun-233，一个包含233对非功能化和功能化家具模型的数据集。在PartNet-Mobility（

英文摘要

Acquisition and creation of 3D assets have been largely view- or appearance-driven. As a result, existing digital 3D models often lack the requisite structural components to function as intended, such as joints, supports, interiors, or interaction elements. At the same time, even human-annotated motions are frequently error-prone, leading to physically implausible behavior. We introduce object functionalization, a novel task aimed at transforming visually plausible but non-functional 3D models into functional and physically operable ones. We formulate functionalization as a graph completion problem over a new functional graph representation, where labeled nodes represent object parts, labeled edges encode functional and contact relations, and movable nodes carry motion attributes, so that structural functional deficiencies manifest as missing nodes or incorrect edges. We develop a neural Graph Functionalizer (GraFu) to complete an incomplete graph representing a non-functional 3D object. The completed graph then drives a geometry realization stage that instantiates predicted connectors and structural elements in 3D, with the compelling side effect of rectifying erroneous human-annotated and predicted motions. To support training and evaluation, focusing on furniture as a rich and challenging target category, we introduce FurFun-233, a dataset of 233 paired non-functional and functionalized furniture models. On PartNet-Mobility ("zero-shot") and HSSD test sets, our method matches state-of-the-art methods in motion prediction accuracy while substantially improving functionality in terms of collision and connectivity.

URL PDF HTML ☆

赞 0 踩 0

2605.18006 2026-05-19 eess.IV cs.CV cs.MM 版本更新

Inter-LPCM: Learning-based Inter-Frame Predictive Coding for LiDAR Point Cloud Compression

Inter-LPCM: 基于学习的帧间预测编码用于激光雷达点云压缩

Chang Sun, Hui Yuan, Shiqi Jiang, Chongzhen Tian, Guanghui Zhang, Raouf Hamzaoui

发表机构 * School of Control Science and Engineering, Shandong University（控制科学与工程学院，山东大学）； Key Laboratory of Machine Intelligence and System Control, Ministry of Education（教育部机器智能与系统控制重点实验室）； School of Computer Science and Technology, Shandong University（计算机科学与技术学院，山东大学）； School of Engineering and Sustainable Development, De Montfort University（工程与可持续发展学院，德蒙福特大学）

AI总结本文提出Inter-LPCM，一种基于学习的帧间预测编码方法，用于改进激光雷达点云压缩中的几何冗余去除，通过引入delta编码策略、帧间半径预测模型和轻量级注意力预测模型，结合RD优化的量化方法和针对每个球坐标分量的熵编码模型，提高压缩效率和质量。

Comments 14 pages, 12 figures

详情

AI中文摘要

由于激光雷达传感器以固定角分辨率获取点云，因此可以系统地参数化并高效压缩到球坐标系中。传统基于球坐标系的点云压缩方法在率失真（RD）性能方面表现出色，几何点云压缩（G-PCC）标准中的预测几何编码（PredGeom）是其中的典型例子。尽管PredGeom包含帧间预测模式，但其依赖于简单的线性模型，限制了其捕捉复杂运动模式和结构依赖的能力。同时，现有基于学习的球域压缩方法并未利用帧间相关性来减少几何冗余。为了解决这些限制，我们提出了一种基于学习的帧间预测编码方法，称为Inter-LPCM。对于方位预测，我们采用基于预定义角分辨率的delta编码策略。为了提高半径压缩，我们引入了帧间半径预测（Inter-RP）模型，该模型通过当前帧和已注册参考帧中的邻近点来估计当前点的半径。此外，我们设计了一个轻量级注意力预测（LAEP）模型，通过捕捉不同坐标间的长距离几何相关性来预测仰角。对于量化，我们提出了一种RD优化的方法来选择球坐标系中的量化步长。对于熵编码，我们为每个球坐标分量设计了不同的模型。这些模型适应于每个坐标的统计先验，从而实现更准确的概率估计。我们的源代码可在https://github.com/SDUChangSun/Inter-LPCM上公开获取。

英文摘要

Because LiDAR sensors acquire point clouds with a fixed angular resolution, the resulting data can be systematically parameterized and efficiently compressed in the spherical coordinate system. Traditional spherical coordinate-based point cloud compression methods have demonstrated strong rate-distortion (RD) performance, with the predictive geometry coding (PredGeom) method in the geometry-based point cloud compression (G-PCC) standard being a prominent example. Although PredGeom includes an inter-frame prediction mode, it relies on a simple linear model, which limits its ability to capture complex motion patterns and structural dependencies. Meanwhile, existing learning-based compression methods in the spherical domain do not exploit inter-frame correlations to reduce geometry redundancy. To address these limitations, we propose a learning-based inter-frame predictive coding method, termed Inter-LPCM. For azimuth prediction, we employ a delta coding strategy based on the predefined angular resolution. To improve radius compression, we introduce an inter-frame radius predictive (Inter-RP) model that estimates the current point's radius using neighboring points from both the current frame and the registered reference frame. In addition, we design a lightweight attention-based prediction (LAEP) model to predict elevation angles by capturing long-range geometric correlations across different coordinates. For quantization, we propose an RD-optimized method to select quantization steps in the spherical coordinate system. For entropy coding, we design distinct models for each spherical coordinate component. These models are adapted to the statistical priors of each coordinate, enabling more accurate probability estimation. Our source code is publicly available at https://github.com/SDUChangSun/Inter-LPCM

URL PDF HTML ☆

赞 0 踩 0

2605.17997 2026-05-19 cs.LG cs.AI cs.CV 版本更新

MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization

MARR: 模块自适应残差重建用于低比特后训练量化

Le Su, Xing Luo, Zhi Jin

发表机构 * Peng Cheng Laboratory（鹏城实验室）

AI总结本文提出MARR，一种模块自适应残差重建方法，通过为每个模块分配特定的缩放系数，平衡残差相关的HA偏差和累积误差校正，从而在低比特量化中提升性能。

详情

AI中文摘要

近年来，基于残差重建的模型量化方法在低比特后训练量化（PTQ）中取得了有希望的性能，通过引入跨层残差来减少来自先前层的误差积累。然而，这些残差也可能引入额外的偏差，源于重建基于PTQ的Hessian近似（HA）假设，导致量化性能不理想。在本文中，我们分析发现，通过将残差项乘以一个缩放系数，可以提供一种直接的方法来缓解与残差强度相关的HA偏差，同时保持累积误差校正。更重要的是，我们观察到这种权衡是模块依赖性的，使单一全局残差强度不足以在不同模块之间平衡有效的校正和残差相关的偏差。基于这些观察，我们提出了模块自适应残差重建（MARR），为每个模块分配模块特定的缩放系数，以自适应地平衡累积误差校正和残差相关的HA偏差。为了避免昂贵的每模块系数搜索并获得稳定的系数估计，我们设计了一种基于比例-积分-微分（PID）的自适应更新策略，利用重建误差作为反馈，逐步细化此系数。在多个典型的大语言模型（LLMs）和视觉变换器（ViTs）上的实验表明，MARR在低比特量化（小于等于4位）中表现出色，实现了LLMs高达20.2%的性能提升，以及ViTs相对于残差重建最先进的方法高达4.6%的相对提升。代码将在接受后公开发布。

英文摘要

Recently, residual reconstruction-based model quantization methods have achieved promising performance in low-bit post-training quantization (PTQ) by introducing cross-layer residuals to reduce error accumulated from previous layers.However, these residuals may also introduce additional bias arising from the Hessian-approximation (HA) assumption underlying reconstruction-based PTQ, leading to suboptimal quantization performance.In this work, we analyze that multiplying the residual term by a scaling coefficient provides a direct way to mitigate the HA bias associated with residual strength, while preserving accumulated-error correction. More importantly, we observe that this trade-off is module-dependent, making a single global residual strength insufficient to balance effective correction and residual-related bias across modules.Based on these observations, we propose Module-Adaptive Residual Reconstruction (MARR), which assigns a module-specific scaling coefficient to adaptively balance accumulated-error correction and residual-related HA bias for each module.To avoid expensive per-module coefficient search and obtain a stable coefficient estimate, we design a Proportional-Integral-Derivative (PID)-based adaptive update strategy that uses reconstruction error as feedback to progressively refine this coefficient. Experiments on several typical large language models (LLMs) and vision transformers (ViTs) demonstrate the effectiveness of MARR under low-bit quantization (less than or equal to 4-bit), achieving up to 20.2% performance gains on LLMs and up to 4.6% relative gains on ViTs over the residual reconstruction state-of-the-art methods.Code will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.17990 2026-05-19 cs.CV cs.HC 版本更新

Low Latency Gaze Tracking via Latent Optical Sensing

通过潜在光学感知实现低延迟的注视跟踪

Yidan Zheng, Matheus Souza, Kaizhang Kang, Qiang Fu, Hadi Amata, Wolfgang Heidrich

发表机构 * King Abdullah University of Science and Technology（卡布斯大学）

AI总结本文提出了一种实时注视跟踪系统，通过全被动光学编码器直接获取任务相关的潜在特征，利用微透镜阵列和共设计的二进制铬掩膜进行空间复用光学编码，产生足够估计注视方向的紧凑测量集，从而减少计算开销并提高延迟性能。

详情

AI中文摘要

我们提出了一种实时注视跟踪系统，该系统通过全被动光学编码器直接获取任务相关的潜在特征。与处理全分辨率图像不同，我们的方法利用微透镜阵列和共设计的二进制铬掩膜进行空间复用光学编码，产生一组紧凑的测量，足以用于注视估计。通过在光学域内整合传感和特征提取，所提出的系统消除了对高带宽图像读取的需要，并显著减少了计算开销。编码的测量通过4x4光电晶体管阵列捕获，并通过轻量级神经网络映射到注视方向。我们的概念验证原型实现了端到端的感知到推理延迟为3.4 ms，优于已发表的研究系统。我们在模拟和真实世界数据上展示了本方法的有效性，实现了与传统基于摄像头的管道相比具有竞争力的注视估计精度，同时显著提高了延迟和能效。本文工作展示了任务驱动的光学感知在超低延迟、计算高效的人机交互系统中的潜力。

英文摘要

We present a real-time gaze tracking system that directly acquires task-relevant latent features using a fully passive optical encoder. Instead of forming and processing full-resolution images, our approach leverages a microlens array with a co-designed binary chromium mask to perform spatially multiplexed optical encoding, producing a compact set of measurements sufficient for gaze estimation. By integrating sensing and feature extraction in the optical domain, the proposed system eliminates the need for high-bandwidth image readout and substantially reduces computational overhead. The encoded measurements are captured by a 4 x 4 phototransistor array and mapped to gaze direction using a lightweight neural network. Our proof-of-concept prototype enables an end-to-end sensing-to-inference latency of 3.4 ms, outperforming published research systems. We demonstrate the effectiveness of our approach on both simulated and real-world data, achieving competitive gaze estimation accuracy while significantly improving latency and energy efficiency compared to conventional camera-based pipelines. This work highlights the potential of task-driven optical sensing for ultra-low-latency, computationally efficient human-computer interaction systems.

URL PDF HTML ☆

赞 0 踩 0

2605.17984 2026-05-19 eess.IV cs.CV cs.RO 版本更新

See Silhouettes in Motion with Neuromorphic Vision

用神经形态视觉感知运动中的轮廓

Pei Zhang, Shijie Lin, Zhou Ge, Jinpeng Chen, Wei Pu

发表机构 * School of Electrical Engineering, Guangxi University（广西大学电气工程学院）； Department of Computer Science, The University of Hong Kong（香港大学计算机科学系）； School of Mechatronic Engineering and Automation, Shanghai University（上海大学机电工程与自动化学院）； SHU General Intelligent Robotics Research Institute（SHU通用智能机器人研究院）； School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications（北京邮电大学计算机科学学院（国家试点软件学院））； School of Information and Communication Engineering, University of Electronic Science and Technology of China（电子科技大学信息与通信工程学院）

AI总结本文提出了一种双模方法，利用帧和事件的协同作用，在仅CPU的设备上实现实时高帧率二值化，有效减少运动模糊并提升在恶劣光照下的性能，为资源受限边缘平台的轻量感知和交互铺平道路。

Comments 12 pages, 12 figures, and 3 tables. This work is under review. Project page: https://github.com/pz-even/event_binarization

详情

AI中文摘要

准双模物体，如文本、道路标志和条形码，在日常视觉交流中发挥基本而关键的作用。通过将其简化为清晰的轮廓，二值化使用最简语言传达必要的视觉线索，以实现最大下游效率。然而，基于帧的成像在移动平台如无人机、自动驾驶汽车和水下车辆上往往面临困难。在这些动态场景中，快速运动和恶劣光照会使成像失效，导致严重的运动模糊和关键细节的消失。为克服这些限制，神经形态视觉通过事件相机，具有微秒级时间分辨率和高动态范围，成为自然的解决方案。在此事件驱动的感知范式基础上，我们提出了一种简单而有效的双模方法，利用帧和事件之间的协同作用，在仅CPU的设备上实现实时、高帧率的二值化。广泛的评估表明，该方法在减少运动模糊方面与领先技术具有竞争力，并在挑战性光照条件下提供显著改进。此外，我们的异步工作流程绕过了事件稀缺问题，避免了传统时间分组重建的限制，即使在极高的千赫兹帧率下也能保持清晰的目标形状。其二值化结果进一步作为可靠的表示，促进了各种下游任务。本文为在资源受限边缘平台上的具身智能轻量感知和交互铺平了道路。

英文摘要

Quasi-bimodal objects, such as text, road signs, and barcodes, play a basic yet vital role in daily visual communication. By boiling these down to clear silhouettes, binarization uses a minimal language to convey essential vision cues for maximum downstream efficiency. The catch is that frame-based imaging often struggles on mobile platforms like drones, self-driving cars, and underwater vehicles. In these dynamic scenes, rapid motion and harsh lighting can make it blind, causing severe motion blur and erasing crucial details. To overcome the limits, neuromorphic vision via event cameras, featuring microsecond-level temporal resolution and high dynamic range, steps in as a natural solution. Building upon this event-driven sensing paradigm, we introduce a simple yet effective dual-modal approach that harnesses the synergy between frames and events to achieve real-time, high-frame-rate binarization on CPU-only devices. Extensive evaluations present that it earns competitive performance against leading techniques in reducing motion blur, while delivering impressive improvements under challenging illumination. Besides, our asynchronous workflow bypasses event scarcity that breaks traditional time-binning reconstruction, maintaining clear target shapes even at extreme kilohertz frame rates. Its binary results further serve as reliable representations that facilitate a range of downstream tasks. This work paves the way towards lightweight perception and interaction in embodied intelligence on resource-constrained edge platforms.

URL PDF HTML ☆

赞 0 踩 0

2605.17980 2026-05-19 cs.CV 版本更新

Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution

学习平衡：用于基于参考的遥感图像超分辨率的解耦孪生扩散变换器

Bin Luo, Runmin Dong, Zhaoyang Luo, Jinxiao Zhang, Jiyao Zhao, Fan Wei, Haohuan Fu

发表机构 * Tsinghua Shenzhen International Graduate School, Shenzhen, China（清华大学深圳国际研究生院）； Sun Yat-sen University, Zhuhai, China（中山大学）； National Supercomputing Center in Shenzhen, Shenzhen, China（深圳国家超算中心）； Tsinghua University, Beijing, China（清华大学）

AI总结本文提出DS-DiT解耦孪生扩散变换器，通过在注意力层面解耦低分辨率和参考信息交互，解决参考基于超分辨率中参考信息依赖过重和利用不足的问题，提升生成质量。

详情

AI中文摘要

基于扩散的方法在大尺度遥感图像超分辨率中展现出显著潜力，特别是在基于参考的超分辨率（RefSR）中，高分辨率参考图像提供关键的细粒度纹理先验。然而，现有方法往往在过度依赖参考信息导致纹理伪影和利用不足导致细节恢复不足之间存在权衡。为了解决这些问题，我们提出了DS-DiT，一种解耦孪生扩散变换器方法，该方法在注意力层面解耦低分辨率和参考信息交互。通过使低分辨率结构先验和参考纹理信息能够独立与噪声潜在空间交互，框架有效缓解了不同来源之间的竞争。此外，为了补偿全局注意力有限的局部建模能力，我们引入了Patch-Level Weights（PLW）模块，该模块可自适应地调节条件源的融合。此外，这种孪生架构在推理过程中促进了自引导策略，通过利用强参考和弱参考条件之间的预测差异来增强重建。这种方法在不额外训练的情况下提升了生成质量。在多个数据集和缩放因子上的实验结果表明，DS-DiT在定量指标和视觉保真度上均优于现有方法。

英文摘要

Diffusion-based methods demonstrate significant potential for remote sensing image super-resolution at large scaling factors, particularly in reference-based super-resolution (RefSR) where high-resolution reference images provide critical fine-grained texture priors. However, existing methods often suffer from a trade-off between over-reliance on reference information, which leads to texture artifacts, and underutilization, which results in insufficient detail recovery. To address these issues, we propose DS-DiT, a Decoupled Siamese Diffusion Transformer method that decouples low-resolution and reference interactions at the attention level. By enabling low-resolution structural priors and reference texture information to interact independently with the noisy latent, the framework effectively mitigates inter-source competition. Furthermore, to compensate for the limited local modeling ability of global attention, we introduce a Patch-Level Weights (PLW) module that adaptively modulates the fusion of conditional sources. In addition, this siamese architecture facilitates an autoguidance strategy during inference, which enhances reconstruction by exploiting the prediction discrepancy between strong and weak reference conditions. This approach boosts generation quality without additional training. Experimental results across multiple datasets and scaling factors demonstrate that DS-DiT outperforms existing methods in both quantitative metrics and visual fidelity.

URL PDF HTML ☆

赞 0 踩 0

2605.17969 2026-05-19 cs.CV 版本更新

Generation Navigator: A State-Aware Agentic Framework for Image Generation

生成导航器：一种基于状态的图像生成代理框架

Jinming Liu, Ruoyu Feng, Yuqi Wang, Wenjun Zeng, Xin Jin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Eastern Institute of Technology（东部技术研究所）； Independent（独立）

AI总结本文提出了一种基于状态的图像生成代理框架Generation Navigator，通过将图像生成问题重新表述为状态条件下的动作生成问题，解决了传统方法中在强化学习训练中因信用分配问题导致的不足，通过PRE-GRPO算法提升了生成质量与推理准确性。

详情

AI中文摘要

尽管文本到图像生成技术取得了快速进展，但忠实实现用户意图仍然具有挑战性，通常需要手动多轮尝试和错误。为了自动化此过程，现有系统依赖于简单的提示重写或由手工规则驱动的闭环代理，而不是学习适应不断变化的生成过程。在本文中，我们将图像生成重新表述为一个状态条件下的动作生成问题，并提出Generation Navigator，一个多轮T2I代理，能够学习动态引导生成轨迹并输出下一步动作。然而，通过强化学习训练此代理会引入关键的信用分配挑战：仅根据单一状态奖励轨迹会将所有动作视为同等信用，忽略了各轮次质量动态变化，并无法区分那些提升轨迹的动作与那些降质或浪费轮次而无进展的动作。我们通过PRE-GRPO（峰值保留-效率组相对策略优化）算法解决这一问题，这是一种轨迹级强化学习目标，明确奖励发现高质量图像（峰值）、避免后续轮次质量下降（保留）以及最小化不必要的轮次（效率）。实验表明，在多个基准测试中取得了显著提升，达到了0.90的WISE分数和79.06%的T2I-ReasonBench推理准确率。

英文摘要

Despite rapid advances in text-to-image generation, faithfully realizing user intent remains challenging, often requiring manual multi-turn trial and error. To automate this process, existing systems rely on either simple prompt rewriting or closed-loop agents driven by hand-crafted rules, rather than learning to adapt actions to the evolving generation process. In this paper, we reformulate image generation as a state-conditioned action-making problem and propose Generation Navigator, a multi-turn T2I agent that learns to dynamically steer the generation trajectory and output the next action. However, training this agent via reinforcement learning introduces a critical credit assignment challenge: naively rewarding a trajectory based solely on a single state assigns equal credit to all actions in the rollout, ignores the quality dynamics across turns, and fails to distinguish actions that improve the trajectory from those that degrade it or waste turns without progress. We resolve this with PRE-GRPO (Peak-Retention-Efficiency Group Relative Policy Optimization), a trajectory-level reinforcement learning objective that explicitly rewards discovering a high-quality image (Peak), avoiding subsequent quality degradation across turns (Retention), and minimizing unnecessary turns (Efficiency). Experiments show substantial improvements across benchmarks, reaching a WISE score of 0.90 and 79.06% reasoning accuracy on T2I-ReasonBench.

URL PDF HTML ☆

赞 0 踩 0

2605.17954 2026-05-19 cs.CV cs.AI cs.LG 版本更新

A More Word-like Image Tokenization for MLLMs

一种更像单词的图像标记化方法用于大规模语言模型

Hyun Lee, Hyemin Jeong, Yejin Kim, Hyungwook Choi, Hyunsoo Cho, Soo Kyung Kim, Joonseok Lee

发表机构 * Seoul National University（首尔国立大学）； Ewha Womans University（成均馆大学）

AI总结本文提出了一种解耦视觉标记化方法（DiVT），通过将图像块嵌入聚类为语义单元，使每个标记对应于独特的视觉概念，从而提升多模态模型的性能和效率。

Journal ref Proceedings of the IEEE/CVF International Conference on Pattern Recognition and Computer Vision (CVPR), 2026

详情

SurgLQA: 可扩展的长时程外科视频问答

Diandian Guo, Xikai Yang, Ruiyang Li, Jialun Pei, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出SurgLQA框架，通过融合时间一致性巩固和时间接地多策略扩展方法，解决长时程外科视频问答中的长程动态建模问题，提升手术流程中的推理能力。

Comments MICCAI 2026 Early Accept

详情

AI中文摘要

外科视频问答（VideoQA）提供了一个有前景的动态术中解释范式，能够为临床环境中的实时决策支持和上下文感知检索提供支持。然而，现有方法主要局限于图像或短片段，限制了其对长程手术流程中因果依赖关系的建模能力。为解决这一挑战，我们提出了SurgLQA，一个统一的长时程VideoQA框架，用于可扩展的外科推理。该框架集成了忠实时间一致性巩固（FTC），利用内在时间线索构建紧凑的长程表示，同时保持细粒度的时间保真度。进一步，我们开发了时间接地多策略扩展（TMS），一种适应性测试时间推理范式，能够在时间接地上下文中战略性地调整策略层面的推理能力。为了促进系统评估，我们重构了一个长时程结肠镜VideoQA基准，Colon-LQA，并在Colon-LQA和REAL-Colon-VQA上进行了广泛的实验。实验结果表明，我们的方法在长程推理中通过时间接地推理实现了持续的性能提升。代码链接：https://github.com/RascalGdd/SurgLQA。

英文摘要

Surgical Video Question Answering (VideoQA) provides a promising paradigm for dynamic intraoperative interpretation, enabling real-time decision support and context-aware retrieval in clinical environments. Nevertheless, existing approaches are predominantly restricted to images or short clips, limiting their ability to model long-range procedural dynamics and causal dependencies across extended surgical workflows. To address this challenge, we propose SurgLQA, a unified long-horizon VideoQA framework for scalable surgical reasoning. This framework incorporates Faithful Temporal Consolidation (FTC), which leverages intrinsic temporal cues to construct compact long-range representations while preserving fine-grained temporal fidelity. Further, we develop Temporally-Grounded Multi-Policy Scaling (TMS), an adaptive test-time inference paradigm that strategically adjusts policy-level reasoning capacity within temporally grounded contexts. To facilitate systematic evaluation, we restructured a long-duration colonoscopy VideoQA benchmark, Colon-LQA, and conducted extensive experiments on Colon-LQA and REAL-Colon-VQA. Experimental results demonstrate that our approach achieves consistent performance gains in long-range reasoning with temporally grounded inference. Code link: https://github.com/RascalGdd/SurgLQA.

URL PDF HTML ☆

赞 0 踩 0

2605.17912 2026-05-19 cs.RO cs.CV 版本更新

WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

WorldArena 2.0: 扩展模态、功能和平台的具身世界模型基准测试

Yu Shang, Yinzhou Tang, Yiding Ma, Zhuohang Li, Lei Jin, Weikang Su, Xin Jin, Zhaolu Wang, Ziyou Wang, Xin Zhang, Haisheng Su, Weizhen He, Wei Wu, Haoyi Duan, Gordon Wetzstein, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong Tian, Tat-Seng Chua, Wenwu Zhu, Chen Gao, Yong Li

发表机构 * Tsinghua University（清华大学）； Shanghai Jiao Tong University（上海交通大学）； Zhejiang University（浙江大学）； Stanford University（斯坦福大学）； The University of Hong Kong（香港大学）； Princeton University（普林斯顿大学）； Chinese Academy of Sciences（中国科学院）； University of Science and Technology of China（中国科学技术大学）； Peking University（北京大学）； National University of Singapore（新加坡国立大学）

AI总结本文提出WorldArena 2.0，扩展了具身世界模型的评估，涵盖模态、功能和平台三个维度，提供全面的测试平台以评估具身世界模型的进展。

详情

AI中文摘要

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

英文摘要

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

URL PDF HTML ☆

赞 0 踩 0

2605.17907 2026-05-19 cs.CV cs.AI 版本更新

One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception

一个模型翻译它们所有：面向异构协作感知的通用任意到任意翻译

Yang Li, Weize Li, Quan Yuan, Congzhang Shao, Guiyang Luo, Yunqi Ba, Xuanhan Zhu, Xinyuan Ding, Xiaoyuan Fu, Jinglin Li

发表机构 * State Key Laboratory of Networking and Switching Technology（网络与交换技术国家重点实验室）

AI总结本文提出UniTrans，一种通用任意到任意特征模态翻译模型，通过预训练一组翻译专家参数并学习其组合系数来实现零样本翻译，从而在OPV2V-H和DAIR-V2X数据集上实现了优于现有方法的性能。

Comments 19 pages, accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

通过共享中间特征，协作感知扩展了每个代理的感知能力，但现实世界中的特征模态异质性仍然是有效融合的关键障碍。大多数现有方法，包括直接适应和协议基于的转换，通常依赖于为新出现的特征模态训练适配器，往往需要额外的重新训练或微调。这种重复训练成本高，并且由于模型和数据隐私限制，在跨制造商之间不可行，限制了现实世界的可扩展性。为了解决这个问题，我们提出了UniTrans，一种通用的任意到任意特征模态翻译模型，该模型可以即时实例化任意模态的翻译器。UniTrans预训练了一组翻译专家参数，并学习其组合系数作为源到目标模态映射的函数。映射是在模态内在的潜在空间中进行测量，其中内在编码器从单帧中间特征中提取模态特定但场景不变的代码，使UniTrans能够以零样本的方式实例化翻译器。在OPV2V-H和DAIR-V2X上的实验表明，UniTrans在模拟和现实世界中均优于现有方法，通过通用模型实现了高效的任意到任意翻译。代码可在https://github.com/CheeryLeeyy/UniTrans上获得。

英文摘要

By sharing intermediate features, collaborative perception extends each agent's sensing beyond standalone limits, but real-world feature modality heterogeneity remains a key barrier to effective fusion. Most existing methods, including direct adaption and protocol-based transformation, typically rely on training adapters for newly emerging feature modalities and often require additional retraining or fine-tuning. Such repeated training is costly and is often infeasible across manufacturers due to model and data privacy constraints, limiting real-world scalability. To address this issue, we propose UniTrans, a universal any-to-any feature modality translation model that instantiates translators on the fly for arbitrary modalities. UniTrans pretrains a bank of translator expert parameters and learns their combination coefficients as a function of source-to-target modality mapping. The mapping is measured in a modality-intrinsic latent space, where an intrinsic encoder extracts modality-specific yet scene-invariant codes from single-frame intermediate features, enabling UniTrans to instantiate translators in a zero-shot manner. Experiments on OPV2V-H and DAIR-V2X demonstrate that UniTrans consistently outperforms state-of-the-art methods in both simulated and real-world settings, enabling efficient any-to-any translation through a universal model. The code is available at https://github.com/CheeryLeeyy/UniTrans.

URL PDF HTML ☆

赞 0 踩 0

2605.17904 2026-05-19 cs.CV 版本更新

通过运动诱导采样用消费级LiDAR成像隐藏物体

Siddharth Somasundaram, Aaron Young, Akshat Dave, Adithya Pediredla, Ramesh Raskar

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Dartmouth College（达特茅斯学院）

AI总结本文提出了一种多帧融合策略，利用运动诱导孔径采样模型，在消费级LiDAR上实现了非线视成像，实现了隐藏物体的3D重建、多物体跟踪和相机定位，并展示了消费级硬件无需额外设置即可实现非线视成像的潜力。

详情

DOI: 10.1038/s41586-026-10502-x

AI中文摘要

LiDARs are being increasingly deployed for consumer imaging in handheld, wearable, and robotic applications. These sensors can capture the time-of-flight of light at picosecond resolution, which in principle, enables them to capture information about objects hidden from their field of view. While such non-line-of-sight (NLOS) imaging capabilities have been shown on research-grade LiDARs, they are challenging to achieve on consumer devices due to poor signal quality resulting from low laser power, low spatial resolution, and object and camera motion. Inspired by burst photography and synthetic aperture radar, we propose a multi-frame fusion strategy to overcome these challenges and demonstrate NLOS imaging on consumer LiDAR. We first introduce the motion-induced aperture sampling model to unify the effects of object shape, object motion, and camera motion under a single measurement model. Using this model, we demonstrate several NLOS capabilities on a smartphone-grade LiDAR: (1) 3D reconstruction, (2) single and multi-object tracking, and (3) camera localization using hidden objects. Previously, NLOS imaging capabilities were largely restricted to bulky and expensive research-grade hardware that requires extensive setup and calibration. Our results represent a shift towards plug-and-play NLOS imaging, where anyone can image hidden objects with off-the-shelf hardware ($<100) and no additional setup. We believe that democratization of such capabilities will advance consumer applications of NLOS imaging.

英文摘要

LiDARs are being increasingly deployed for consumer imaging in handheld, wearable, and robotic applications. These sensors can capture the time-of-flight of light at picosecond resolution, which in principle, enables them to capture information about objects hidden from their field of view. While such non-line-of-sight (NLOS) imaging capabilities have been shown on research-grade LiDARs, they are challenging to achieve on consumer devices due to poor signal quality resulting from low laser power, low spatial resolution, and object and camera motion. Inspired by burst photography and synthetic aperture radar, we propose a multi-frame fusion strategy to overcome these challenges and demonstrate NLOS imaging on consumer LiDAR. We first introduce the motion-induced aperture sampling model to unify the effects of object shape, object motion, and camera motion under a single measurement model. Using this model, we demonstrate several NLOS capabilities on a smartphone-grade LiDAR: (1) 3D reconstruction, (2) single and multi-object tracking, and (3) camera localization using hidden objects. Previously, NLOS imaging capabilities were largely restricted to bulky and expensive research-grade hardware that requires extensive setup and calibration. Our results represent a shift towards plug-and-play NLOS imaging, where anyone can image hidden objects with off-the-shelf hardware ($<100) and no additional setup. We believe that democratization of such capabilities will advance consumer applications of NLOS imaging.

URL PDF HTML ☆

赞 0 踩 0

2605.17850 2026-05-19 stat.ML cs.CV cs.LG cs.NA math.NA math.PR 版本更新

Simple Approximation and Derivative Free Inference-Time Scaling for Diffusion Models via Sequential Monte Carlo on Path Measures

通过路径测度的序列蒙特卡洛实现扩散模型的简单近似与无导数推理时间缩放

Chenyang Wang, Weizhong Wang, Yinuo Ren, Jose Blanchet, Yiping Lu

发表机构 * School of Mathematical Sciences, Peking University, Beijing, China ； School of Mathematical Sciences, Fudan University, Shanghai, China ； Department of Industrial Engineering \& Management Sciences, Northwestern University, Evanston, IL, United States ； Institute for Computational \& Mathematical Engineering, Stanford University, Stanford, CA, United States ； Management Science \& Engineering, Stanford University, Stanford, CA, United States

AI总结本文提出URGE算法，一种无需梯度的推理时间缩放方法，通过路径重要性重加权提升扩散模型样本质量，同时在合成测试和扩散模型基准中表现出色，且实现简单且无梯度依赖。

Comments accepted by ICML 2026

详情

AI中文摘要

扩散生成模型越来越多地依赖于推理时间引导，通过添加漂移项或重新加权专家混合物来提高任务特定目标的样本质量。然而，大多数现有技术需要重复评估分数或梯度，引入偏差、高计算开销或两者兼有。我们引入URGE（Unbiased Resampling via Girsanov Estimation），一种无导数的推理时间缩放算法，通过Girsanov测度变换进行路径重要性重加权。与先前工作不同，URGE为每个模拟轨迹附加简单的乘法权重，并定期重新采样。无需计算基于梯度的粒子权重。我们建立了路径级和粒子级SMC之间的等价性：Girsanov路径权重允许一个向后条件期望，恢复先前的粒子级权重，保证两种方案产生相同的无偏终端分布。经验上，URGE在合成测试和扩散模型基准中优于现有推理时间引导基线，实现了更好的生成质量，同时显著更简单且完全无梯度依赖。

英文摘要

iffusion-based generative models increasingly rely on inference-time guidance, adding a drift term or reweighting mixture of experts, to improve sample quality on task-specific objectives. However, most existing techniques require repeated score or gradient evaluations, introducing bias, high computational overhead, or both. We introduce \texttt{URGE}, Unbiased Resampling via Girsanov Estimation, a derivative-free inference-time scaling algorithm that performs path-wise importance reweighting via a Girsanov change of measure. Instead of computing gradient-based particle weights in previous work, \texttt{URGE} attaches a simple multiplicative weight to each simulated trajectory and periodically resamples. No score, no Hessian, and no PDE evaluation is required. We establish an equivalence between path-wise and particle-wise SMC: the Girsanov path weight admits a backward conditional expectation that recovers the previous particle-level weights, guaranteeing that both schemes produce the same unbiased terminal law. Empirically, \texttt{URGE} outperforms existing inference-time guidance baselines on synthetic tests and diffusion-model benchmarks, achieving better generation quality, while being significantly simpler to implement and fully gradient-free.

URL PDF HTML ☆

赞 0 踩 0

2605.17834 2026-05-19 cs.CV 版本更新

Stabilizing, Scaling & Enhancing MeanFlow for Large-scale Diffusion Distillation

稳定、扩展与增强MeanFlow用于大规模扩散蒸馏

Xiao He, Yang Li, Peizhen Zhang, Songtao Liu, Zhao Zhong, Nannan Wang

发表机构 * State Key Laboratory of Integrated Services Networks（信息服务网络国家重点实验室）； Xidian University（西安电子科技大学）； Tencent Hunyuan（腾讯文英）

AI总结本文提出了一种稳定MeanFlow的方法，通过引入暖启动技术并结合轨迹分布对齐，提高了大规模工业模型蒸馏的性能和泛化能力。

Comments 10 pages

详情

AI中文摘要

扩散模型表现出卓越的生成能力，但其高延迟限制了实际部署。许多研究尝试减少采样步骤以加速推理。其中，MeanFlow因其简洁的公式和显著的性能而受到关注。然而，其优化目标的不稳定性以及'均值偏置'限制了其在蒸馏大规模工业模型中的应用。为了稳定MeanFlow用于蒸馏大规模模型，我们首先引入了暖启动技术，其中MeanFlow的原始微分解法被替换为离散解。这种设计避免了由于MeanFlow目标包含来自未充分训练模型的stop-gradient项而导致的训练崩溃。一旦模型获得初步能力以拟合平均速度场，我们将其优化目标切换回微分解法，以实现进一步的细化。同时，为了缓解在极少数步推理中复杂目标分布下的'均值偏置'，我们将其纳入轨迹分布对齐作为辅助目标，鼓励学生模型的轨迹分布更接近教师模型的轨迹分布。我们提出的蒸馏框架在应用于文本到图像（T2I）模型FLUX.1-dev（高达12B参数）时，相比现有蒸馏方法表现更优。此外，当扩展到80B参数的最新状态（SOTA）T2I模型HunyuanImage 3.0时，我们的方法继续表现出稳健的泛化能力和强性能。

英文摘要

Diffusion models exhibit remarkable generative capability, but their high latency limits practical deployment. Many studies have attempted to reduce sampling steps to accelerate inference. Among them, MeanFlow has attracted considerable attention due to its concise formulation and remarkable performance. Nevertheless, the instability of its optimization objective and the ''mean-seeking bias'' have limited its applicability to distill large-scale industrial models. To stabilize MeanFlow for distilling large-scale models, we first introduce a warm-up technique, in which the original differential solution of MeanFlow is replaced by a discrete solution. This design avoids training collapse caused by the MeanFlow target containing a stop-gradient term from an undertrained model. Once the model acquires a preliminary ability to fit the average velocity field, we switch the optimization objective back to the differential solution, enabling further refinement. Meanwhile, to alleviate the ''mean-seeking bias'' of MeanFlow under extremely few-step inference with complex target distributions, we incorporate trajectory distribution alignment as an auxiliary objective, encouraging the student model's trajectory distribution to align more closely with that of the teacher model. Our proposed distillation framework achieves superior performance compared to existing distillation approaches when applied to the text-to-image (T2I) model FLUX.1-dev (up to 12B parameters). Furthermore, when extended to the 80B-parameter state-of-the-art (SOTA) T2I model HunyuanImage 3.0, our method continues to demonstrate robust generalization and strong performance.

URL PDF HTML ☆

赞 0 踩 0

2605.17826 2026-05-19 cs.CV cs.AI 版本更新

CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models

CounterCount: 一种用于视觉语言模型计数偏差诊断的框架

Reem Alzahrani, Hassan Alshanqiti, Bushra Bin Hemid, Zaid Alyafeai, Abdelrahman Eldesokey, Bernard Ghanem

发表机构 * KAUST（卡尔斯鲁德大学）； University of Edinburgh（爱丁堡大学）； King Abdullah University of Science and Technology（国王阿卜杜勒-阿齐兹大学）

AI总结本文提出CounterCount框架，通过对比事实性与反事实性图像来诊断视觉语言模型在计数任务中的偏差问题，揭示模型对物体级先验知识的依赖，并提出统一的注意力调节策略提升反事实计数准确性。

详情

AI中文摘要

视觉语言模型（VLMs）在多模态推理方面表现出色，但尚不清楚其答案是基于视觉证据还是由学习的语言和世界先验知识驱动。计数提供了一个精确的测试环境：当视觉证据与常识物体知识冲突时，模型必须依赖图像而非典型计数。我们引入CounterCount，一种用于VLMs的反事实计数诊断框架，包含配对的事实性和反事实性图像、编辑过的计数相关属性、验证答案和局部化证据注释。评估最近的VLMs，我们发现其在事实性图像上表现强劲，但在反事实属性变化下持续退化，表明即使存在矛盾的视觉证据，模型仍依赖物体级先验知识。利用局部化注释，我们发现这些失败不仅由于缺失或模糊的视觉证据，而是由于模型对计数相关视觉token的注意力权重不足。我们引入一种统一的推理时间注意力调节策略，重新加权所选的视觉token，使多个VLMs的反事实计数准确率提高高达8%。总体而言，CounterCount揭示了先验驱动的计数失败，并为设计未来的VLMs提供了诊断见解。

英文摘要

Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to missing or ambiguous visual evidence, but to models underweighting attention to count-relevant visual tokens. We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8% across multiple VLMs. Overall, CounterCount exposes prior-driven counting failures and provides diagnostic insights for designing future VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.17823 2026-05-19 cs.CV cs.AI 版本更新

Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

为什么我们看那里：一种最大化场景理解的视网膜视觉语言模型表现出的人类样注视模式

Shravan Murlidaran, Ziqi Wen, Sana Shehabi, Miguel P. Eckstein

发表机构 * Psychological & Brain Sciences, University of California, Santa Barbara（加州大学圣芭芭拉分校心理学与脑科学系）； Electrical and Computer Engineering, University of California, Santa Barbara（加州大学圣芭芭拉分校电气与计算机工程系）； Computer Science, University of California, Santa Barbara（加州大学圣芭芭拉分校计算机科学系）

AI总结研究探讨了人类自由观看时注视模式的形成机制，发现最大化场景理解的视网膜视觉语言模型能够产生类似人类的注视模式，表明这种模式可能是优化场景理解的副产品。

2605.17822 2026-05-19 cs.CV 版本更新

Unleashing the Representational Power of Fourier Shapes for Attacking Infrared Object Detection

释放傅里叶形状的表示能力以攻击红外目标检测

Yixing Yong, Jian Wang, Ming Lei, Lijun He, Fan Li

发表机构 * School of Information and Communications Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China（信息与通信工程学院，电子与信息工程学院，西安交通大学，西安，中国）； School of Physics, Xi'an Jiaotong University, Xi'an, China（物理学院，西安交通大学，西安，中国）

AI总结本文提出了一种基于傅里叶形状的红外目标检测攻击方法，通过引入可学习的傅里叶形状，克服了传统形状方法在表示能力和优化能力之间的根本权衡问题，实现了高效的梯度优化生成具有欺骗性的形状，使人类目标逃避检测。

详情

AI中文摘要

红外目标检测在自动驾驶和监控中至关重要，但仍然容易受到物理对抗攻击的威胁。与RGB域不同，攻击必须操控热信号，使得热阻材料的几何形状成为主要的对抗信息载体。当前基于形状的方法在表示能力和优化能力之间存在根本性的权衡，限制了攻击效果。在本文中，我们通过将可学习的傅里叶形状引入红外域，克服了这一困境。我们利用端到端可微框架，将一组紧凑的傅里叶系数，定义形状边界，通过 winding number theorem 解析地映射到像素空间的掩码。这使得能够通过梯度优化高效生成具有欺骗性的形状，使人类目标逃避检测。广泛的数字和物理实验提供了全面的评估，并验证了我们的优越性能。我们得到的物理贴片实现了惊人的鲁棒性，成功逃避了不同距离、角度、姿态和个体的检测器，且在距离大于25米（置信度=0.5）时攻击成功率超过88%。代码可在 https://github.com/Yongyx99/Fourier-shape-attack 上获得。

英文摘要

Infrared object detection is crucial for perception in autonomous driving and surveillance but remains vulnerable to physical adversarial attacks. Unlike in the RGB domain, where attacks rely on color texture, infrared attacks must manipulate thermal signatures, making the geometry shape of heat-blocking materials the primary adversarial information carrier. Current shape-based methods suffer from a fundamental trade-off between representational capability and optimization power, limiting their attack effectiveness.In this work, we overcome this dilemma by introducing learnable Fourier shapes to the infrared domain. We utilize an end-to-end differentiable framework where a compact set of Fourier coefficients, defining the shape boundary, is analytically mapped to a pixel-space mask via the winding number theorem. This enables efficient gradient-based optimization to generate potent shapes that cause human targets to evade detection. Extensive digital and physical experiments provide a comprehensive evaluation and validate our superior performance. Our resulting physical patch achieves striking robustness, successfully evading detectors across diverse distances, angles, poses, and individuals, and achieves over 88% attack success rate at distances greater than 25m (conf.=0.5). Code is available at https://github.com/Yongyx99/Fourier-shape-attack.

URL PDF HTML ☆

赞 0 踩 0

2605.17818 2026-05-19 cs.CV 版本更新

Evidence-Guided Unknown Rejection for High-Confidence Near-Known Unknowns

基于证据的未知拒绝用于高置信度近似未知物

Xi Chen, Yingjun Xiao, Gang Fang

发表机构 * Xi Chen 1（陈曦 1）； Yingjun Xiao 2（肖英俊 2）； Gang Fang 3（方刚 3）

AI总结本文提出EGUR-A方法，通过改变决策方式从判断样本得分是否足够高到判断预测已知类别是否有足够证据接受样本，从而减少高置信度的误判接受。

Comments 8 pages, 2 figures,8 tables

详情

AI中文摘要

开放集识别系统面临一个被忽视的失败模式：高置信度的近似未知物，这些样本位于已知标签集之外，但足够接近已知类别，使得闭合集分类器以高置信度接受它们。我们证明这种失败在标量阈值方法中普遍存在，包括最近的后处理检测器，并且更强的编码器可能放大而非消除风险。我们提出EGUR-A，将决策从『这个样本的得分是否足够高？』转变为『这个预测的已知类别是否有足够的证据来接受这个样本？』EGUR-A结合类别条件的局部接受证据与全局残差证据，并从已知样本统计中选择其相对权重，而无需未知验证数据。在CUB、FGVC-Aircraft和ImageNet-hard上，EGUR-A显著减少了在匹配已知拒绝操作点处的高置信度误判接受。结果不是更强的阈值，而是不同的问题：已知类别是否有权接受样本。

英文摘要

Open-set recognition systems face a neglected failure mode: high-confidence near-known unknowns, which lie outside the known label set but are close enough to known classes that a closed-set classifier accepts them with high confidence. We show that this failure is widespread across scalar-threshold methods, including recent post-hoc detectors, and that stronger encoders can amplify rather than remove the risk. We propose EGUR-A, which changes the decision from ``is this sample's score high enough?'' to ``does this predicted known class have sufficient evidence to accept this sample?'' EGUR-A combines class-conditional local acceptance evidence with global residual evidence, and selects their relative weight from known-sample statistics without unknown validation data. Across CUB, FGVC-Aircraft, and ImageNet-hard, EGUR-A substantially reduces high-confidence false known acceptance at matched known-rejection operating points. The result is not a stronger threshold; it is a different question: whether a known class is entitled to accept a sample.

URL PDF HTML ☆

赞 0 踩 0

2605.17807 2026-05-19 cs.CV cs.AI 版本更新

Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation

课程组策略优化：适应性采样以释放文本到图像生成的潜力

Baoteng Li, Xianghao Zang, Xinran Wang, Xiangyu Na, Zhixiang He, Hao Sun, Chi Zhang, Zhongjiang He, Tianwei Cao, Kongming Liang, Zhanyu Ma

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Institute of Artificial Intelligence (TeleAI), China Telecom（中国电信人工智能研究院）； Beijing Key Laboratory of Multimodal Data Intelligent Perception and Governance（北京多模态数据智能感知与治理重点实验室）

AI总结本文提出了一种适应性课程训练框架CGPO，通过动态调整采样策略来提高文本到图像生成的训练效率，同时解决多类别数据集中的数据不平衡问题。

详情

AI中文摘要

文本到图像（T2I）生成在近年来取得了显著进展。同时，基于组相对策略优化（GRPO）的强化学习方法引起了广泛关注，并已成功应用于T2I任务。然而，训练过程中常用的均匀采样策略往往忽略了样本难度与模型当前学习能力之间的匹配，导致训练效率低下。我们主张，提高训练效率需要持续优先选择与模型 evolving 能力匹配且仍能主动学习的提示。为此，我们提出了课程组策略优化（CGPO），一种适应性课程训练框架。在训练过程中，每个提示生成一组由奖励模型评分的图像。我们使用组奖励的方差作为在线代理来衡量提示的一致性。较高的方差表明模型部分捕捉了提示要求，但尚未达到稳定的掌握。此类提示更可能提供有用的训练信号，因此相应增加其采样概率。此外，为了解决多类别数据集中的数据不平衡问题，我们设计了一种基于比例公平优化的类别校准方法，以平衡各类别之间的训练难度。在GenEval、T2I-CompBench++和DPG Bench上的实验表明，我们的框架有效提高了生成性能。

英文摘要

Text-to-Image (T2I) generation has achieved remarkable progress in recent years. Meanwhile, reinforcement learning methods, particularly those based on Group Relative Policy Optimization (GRPO), have attracted widespread attention and been successfully applied to T2I tasks. However, the uniform sampling strategy commonly used during training often ignores the match between sample difficulty and the model's current learning capability, leading to low training efficiency. We argue that improving training efficiency requires continuously prioritizing prompts that match the model's evolving capability and remain actively learnable. To this end, we propose Curriculum Group Policy Optimization (CGPO), an adaptive curriculum training framework. During training, each prompt produces a group of images scored by a reward model. We use the variance of group rewards as an online proxy for prompt inconsistency. A higher variance suggests that the model has partially captured the prompt requirements but has not yet achieved stable mastery. Such prompts are more likely to provide useful learning signals, so we increase their sampling probabilities accordingly. Additionally, to address data imbalance in multi-category datasets, we design a category calibration method based on proportional fairness optimization, which balances training difficulty across categories. Experiments on GenEval, T2I-CompBench++, and DPG Bench demonstrate that our framework effectively improves generation performance.

URL PDF HTML ☆

赞 0 踩 0

2605.17799 2026-05-19 cs.CV cs.LG 版本更新

Is Complex Training Necessary for Long-Tailed OOD Detection? A Re-think from Feature Geometry

长尾分布外检测是否需要复杂的训练？从特征几何角度的重新思考

Ningkang Peng, Xuanming Chen, Yanhui Gu

发表机构 * Nanjing Normal University（南京师范大学）

AI总结本文重新审视长尾分布外检测问题，提出通过特征几何方法简化检测过程，改进Mahalanobis距离计算，提升检测性能。

详情

AI中文摘要

长尾分布外检测通常通过专门的训练方法解决，包括引入分布外数据、回避头、对比目标、能量损失或梯度冲突控制。我们表明这些训练机制可能掩盖了一个更简单的问题：冻结的长尾表示可能已经包含有用的分布外证据，但原始Mahalanobis距离受到频率耦合特征半径和不充分支持的尾部协方差的影响。我们提出了超球面池化Mahalanobis（HPM）方法，一种后处理检测器，将特征归一化到单位球面，并用池化、岭正则化的度量替换类特定协方差，同时保持类均值作为语义锚点。在CIFAR-LT实验和ImageNet-100-LT近分布外边界分析中，HPM提高了原始Mahalanobis评分；对于先验校准经验风险最小化（PC-ERM），在CIFAR-10-LT上将AUROC从46.49提升到85.67，在CIFAR-100-LT上从50.40提升到78.35。这个简单的PC-ERM+HPM流程在CIFAR-100-LT上实现了最佳对数效率分数（LES；3.08），在显著降低训练时间成本的情况下，保留了约95%的最佳CIFAR-100-LT AUROC观测值。这些结果表明，在长尾分布外检测中应分别评估表示质量、检测器几何和训练复杂性。

英文摘要

Long-tailed out-of-distribution (LT-OOD) detection is often addressed with specialized training, including auxiliary out-of-distribution (OOD) data, abstention heads, contrastive objectives, energy losses, or gradient-conflict control. We show that these training mechanisms can obscure a simpler issue: frozen long-tailed representations may already contain useful OOD evidence, but raw Mahalanobis distance is distorted by frequency-coupled feature radius and poorly supported tail covariance. We propose Hyperspherical Pooled Mahalanobis (HPM), a post-hoc detector that normalizes features onto the unit sphere and replaces class-specific covariance with a pooled, ridge-regularized metric while keeping class means as semantic anchors. In CIFAR-LT experiments and an ImageNet-100-LT near-OOD boundary analysis, HPM improves raw Mahalanobis scoring; for Prior-Calibrated ERM (PC-ERM), it raises AUROC from 46.49 to 85.67 on CIFAR-10-LT and from 50.40 to 78.35 on CIFAR-100-LT. This simple PC-ERM+HPM pipeline also achieves the best Log Efficiency Score (LES; 3.08) on CIFAR-100-LT, retaining roughly 95% of the best CIFAR-100-LT AUROC observed among the compared post-hoc scores at substantially lower training-time cost. These results argue for evaluating representation quality, detector geometry, and training complexity as separate factors in LT-OOD detection.

URL PDF HTML ☆

赞 0 踩 0

2605.17795 2026-05-19 cs.LG cs.CV 版本更新

When Accuracy Is Not Enough: Uncertainty Collapse between Noisy Label Learning and Out-of-Distribution Detection

当准确性不够时：噪声标签学习与分布外检测之间的不确定性崩溃

Ningkang Peng, Jingyang Mao, Runhan Zhou, Peirong Ma, Yanhui Gu

发表机构 * Nanjing Normal University（南京师范大学）

AI总结本文研究了噪声标签学习与分布外检测之间的不确定性崩溃问题，提出了一种通用的ACC-OOD基准，揭示了高准确率并不保证分布外可靠性，提出虚拟边距正则化方法来缓解这一问题。

详情

LatentUMM: 双重潜在对齐用于统一多模态模型

Yinyi Luo, Wenwen Wang, Hayes Bai, Marios Savvides, Jindong Wang

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； William & Mary（威廉与玛丽学院）

AI总结本文提出LatentUMM，通过构建增强的共享潜在空间，显式对齐映射到和从潜在空间的转换，提高跨模态一致性。实验表明，该方法在多种架构上一致提升了多模态一致性。

详情

AI中文摘要

统一多模态模型（UMMs）通过学习共享的潜在空间，在理解和生成方面取得优异表现，但往往在这些能力之间存在功能不一致。我们发现，这一问题并非源于共享表示的不足，而是源于映射到和从潜在空间的转换之间缺乏显式对齐。因此，生成和重新编码可能遵循不一致的轨迹，在模态转换时导致语义漂移。在本文中，我们提出了LatentUMM，一个构建增强共享潜在空间的框架，以显式对齐这些转换并提高跨模态一致性。LatentUMM包含两个阶段。第一阶段，双潜在对齐在模态和容量层面强制一致性：跨模态对齐使用更强的嵌入模型来施加结构化的跨模态语义，而双容量对齐在生成和重新编码下强制双向一致性。第二阶段，潜在动态稳定化通过随机潜在滚动和偏好优化提高鲁棒性，倾向于保留语义一致性的轨迹。实验表明，LatentUMM在多种架构上一致提高了多模态一致性。代码可在：https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM。

英文摘要

Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.

URL PDF HTML ☆

赞 0 踩 0

2605.17759 2026-05-19 cs.CV 版本更新

FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion

FrequencyBooster: 高保真像素扩散的全频建模

Lichen Ma, Zipeng Guo, Yu He, Xiaolong Fu, Luohang Liu, Jingling Fu, Junshi Huang, Yan Li

AI总结本文提出FrequencyBooster，一种能够提升像素扩散模型全频建模能力的框架，通过高容量解码器提取高频细节和低频语义，从而在保持全局结构的同时实现更精确的像素生成。

详情

AI中文摘要

为克服基于VAE的潜在扩散模型固有的保真度瓶颈和优化偏差，像素空间扩散模型作为一种具有吸引力的端到端范式而出现。然而，现有的像素扩散模型往往难以在计算效率与高频率细节保留之间取得平衡。它们通常依赖于基于块的压缩或受限的局部解码，导致一种'频谱妥协'，即高频和精细像素信息被抑制。为了解决这些挑战，我们提出了FrequencyBooster，一种新的框架，旨在为像素扩散模型赋予全频建模能力，而无需显著的开销。该方法的核心是一个高容量解码器，专门用于提取详尽的高频细节和低频语义，后者来源于Diffusion Transformer (DiT) 主干网络。与以往牺牲全局上下文以换取局部细化的工作不同，FrequencyBooster利用高维特征表示，在保持全局结构完整性的同时实现了更优的像素级精度。在ImageNet上的大量实验表明，我们的方法效果显著：在仅320个epoch内，我们的模型在256×256分辨率下达到最先进的FID为1.60。此外，在512×512分辨率下，FrequencyBooster达到FID为1.69，显著优于现有的像素空间和潜在空间生成模型。

英文摘要

To circumvent the inherent fidelity bottlenecks and optimization misalignment of VAE-based latent diffusion, pixel-space diffusion models have emerged as a compelling end-to-end paradigm. However, existing pixel diffusion models often struggle to balance computational efficiency with the preservation of high-frequency details. They frequently resort to patch-based compression or restricted local decoding, leading to a "spectral compromise" where high-frequency and fine-grained pixel information are suppressed. To address these challenges, we propose \textbf{FrequencyBooster}, a novel framework designed to empower pixel diffusion with full-frequency modeling capabilities without prohibitive overhead. The core of our method is a high-capacity decoder that specializes in extracting exhaustive high-frequency details and low-frequency semantics, the latter of which is derived from a Diffusion Transformer (DiT) backbone. Unlike prior works that sacrifice global context for local refinement, FrequencyBooster leverages high-dimensional feature representations to maintain global structural integrity while achieving superior pixel-level precision. Extensive experiments on ImageNet demonstrate the effectiveness of our approach: our model achieves a state-of-the-art FID of \textbf{1.60} at $256 \times 256$ resolution within only 320 epochs. Furthermore, at $512 \times 512$ resolution, FrequencyBooster attains an FID of \textbf{1.69}, significantly outperforming existing pixel-space and latent-space generative models.

URL PDF HTML ☆

赞 0 踩 0

2605.17748 2026-05-19 cs.CV 版本更新

Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction

通过全局-局部自适应交互释放视觉Transformer在图像质量评估中的潜力

Yu Li, Puchao Zhou, Yachun Mi, Yanfeng Wu, Xiaoming Wang, Shaohui Liu

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Meituan（美团）

AI总结本文提出了一种全局-局部自适应交互框架，通过双流特征提取机制和交互式全局-局部融合，提升图像质量评估的预测精度和鲁棒性，同时减少可训练参数数量。

Journal ref Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. [10567]-[10571], 2026

详情

AI中文摘要

在盲图像质量评估（BIQA）领域，准确预测自然环境中真实失真图像的感知质量仍然极具挑战性，因为存在多样的复杂失真。尽管现有方法已取得显著准确性，但其可扩展性常受限于主观注释的高成本和可用数据集的有限规模。近年来，大规模预训练视觉模型的进步引入了强大的语义和表征能力，但其在IQA任务中的应用受到显著的计算需求和次优微调效率的阻碍。为克服这些限制，我们引入了全局-局部交互适配器（GLIA），一种新的框架，通过双流特征提取机制与交互式全局-局部融合有效利用预训练的视觉Transformer。通过同时保留全局语义信息和细粒度局部细节，我们的方法在显著减少可训练参数的同时，实现了优越的预测精度和鲁棒性。在多个基准上的广泛实验验证了我们方法的有效性和优越性。

英文摘要

In the field of Blind Image Quality Assessment (BIQA), accurately predicting the perceptual quality of authentically distorted images remains highly challenging due to the diverse and complex distortions present in natural environments. Although existing methods have achieved notable accuracy, their scalability is often constrained by the high cost of subjective annotation and the limited size of available datasets. Recent advances in large-scale pre-trained vision models have introduced powerful semantic and representational capabilities, yet their application to IQA tasks is hindered by substantial computational demands and suboptimal fine-tuning efficiency. To overcome these limitations, we introduce the Global-Local Interaction Adapter (GLIA), a novel framework that effectively harnesses pre-trained Vision Transformers through a dual-stream feature extraction mechanism coupled with interactive global-local fusion. By jointly retaining global semantic information and fine-grained local details, our approach delivers superior prediction accuracy and robustness while requiring significantly fewer trainable parameters. Extensive experiments on multiple benchmarks validate the effectiveness and superiority of our approach.

URL PDF HTML ☆

赞 0 踩 0

2605.17743 2026-05-19 cs.CV 版本更新

Patch-MoE Mamba: 一种用于医学图像分割的基于补丁顺序的专家混合状态空间架构

Diego Adame, Fabian Vazquez, Jose A. Nunez, Huimin Li, Jinghao Yang, Erik Enriquez, DongChul Kim, Haoteng Tang, Bin Fu, Pengfei Gu

发表机构 * University of Texas Rio Grande Valley（德克萨斯理工大学里奥格兰德谷分校）

AI总结本文提出了一种基于补丁顺序的专家混合状态空间架构Patch-MoE Mamba，以解决现有Mamba分割模型在像素级方向扫描破坏局部二维空间结构以及简单求和融合方向无法适应多样物体大小、形状和边界的问题。

详情

AI中文摘要

基于CNN和Transformer的架构在医学图像分割中已取得优异性能，但CNN在建模长距离依赖性方面存在限制，而Transformer则常面临二次计算和内存复杂度的问题。状态空间模型，尤其是基于Mamba的网络，提供了一种高效的替代方案，具有线性序列复杂度。然而，现有的Mamba分割模型仍面临两个限制：像素级方向扫描会破坏局部二维空间结构，而简单的求和融合方向无法适应多样化的物体大小、形状和边界。为了解决这些问题，我们提出了Patch-MoE Mamba，一种用于医学图像分割的基于补丁顺序的专家混合状态空间架构。它引入了一种分层的补丁顺序扫描机制，能够在保留局部空间邻域的同时捕捉多尺度上下文，并引入了基于MoE的方向融合模块，通过四个方向专家、一个可学习的连接专家和残差方向聚合，自适应地结合多个Mamba扫描器输出。在五个公开的息肉分割基准和ISIC 2017/2018皮肤病变分割数据集上的实验表明了Patch-MoE Mamba的有效性和通用性。

英文摘要

CNN- and Transformer-based architectures have achieved strong performance in medical image segmentation, but CNNs are limited in modeling long-range dependencies, while Transformers often suffer from quadratic computational and memory complexity. State space models, especially Mamba-based networks, offer an efficient alternative with linear sequence complexity. However, existing Mamba segmentation models still face two limitations: pixel-wise directional scanning can disrupt local 2D spatial structure, and simple summation-based fusion of scan directions cannot adapt well to diverse object sizes, shapes, and boundaries. To address these issues, we propose \textit{Patch-MoE Mamba}, a patch-ordered mixture-of-experts state space architecture for medical image segmentation. It introduces a hierarchical patch-ordered scanning mechanism that preserves local spatial neighborhoods while capturing multi-scale context, and an MoE-based directional fusion module that adaptively combines multiple Mamba scanner outputs using four directional experts, a learnable concatenation expert, and residual directional aggregation. Experiments on five public polyp segmentation benchmarks and the ISIC 2017/2018 skin lesion segmentation datasets demonstrate the effectiveness and generality of Patch-MoE Mamba.

URL PDF HTML ☆

赞 0 踩 0

2605.17686 2026-05-19 cs.CV 版本更新

Brain-inspired spike-timing plasticity for reliable label-efficient event-camera vision

脑启发式脉冲时间依赖性可塑性用于可靠的标签高效事件相机视觉

Mohamad Yazan Sadoun, Sarah Sharif, Yaser Mike Banad

发表机构 * School of Electrical and Computer Engineering, University of Oklahoma（俄克拉荷马大学电气与计算机工程学院）

AI总结本文提出了一种基于脑启发式脉冲时间依赖性可塑性（STDP）的事件相机视觉方法，通过三个局部STDP模块实现无需GPU支持的单线程处理，提升了标签效率和检测性能。

详情

AI中文摘要

部署事件相机目标检测器受到每帧标注需求和GPU计算需求的限制。本文引入了三个局部脉冲时间依赖性可塑性（STDP）模块，包括序列、候选和管可靠性模块，这些模块在单个CPU线程上运行而无需GPU支持。在FRED无人机基准测试中，所提出的框架覆盖了三个标签高效监督层级。严格零标签检测器实现了53.8%的mAP@30，约26个训练衍生位实现76.9%的mAP@30，而STDP候选可靠性门实现了78.60±0.42%的mAP@30。在获取顺序漂移下，群体门在20次正例试验中优于流式k-means，而无漂移对照组则否定了其效果。STDP将单模型方差减少了6.6倍，一个训练好的门与44种子集合界线相当。门在Intel Lava上实现了89%的前两名一致性。在EVUAV基准测试中，管级STDP层将误报率从454降至331e-4（Pd≥88%）。密集梯度训练检测器无法提供这种梯度训练、密集矩阵乘法和无局部可塑性操作的组合。

英文摘要

Deploying event-camera object detectors is constrained by per-frame labeling requirements and GPU compute demands. This work introduces three local spike-timing-dependent plasticity (STDP) modules, including sequence, candidate, and tube-reliability modules, that operate on a single CPU thread without GPU support. On the FRED drone benchmark, the proposed framework spans three label-efficient supervision tiers. A strict zero-label detector achieves 53.8% mAP@30, approximately 26 train-derived bits achieve 76.9% mAP@30, and an STDP candidate-reliability gate achieves 78.60 +/- 0.42% mAP@30. Under acquisition-order drift, the cohort gate outperforms streaming k-means by 2.03 +/- 0.58 percentage points across 20 of 20 positive trials, while a no-drift control falsifies the effect. STDP reduces single-model variance by 6.6 times, and one trained gate matches a 44-seed ensemble bound. The gate transfers to Intel Lava with 89% top-2 agreement. On the EVUAV benchmark, a tube-level STDP layer reduces false alarms from 454 to 331e-4 at Pd >= 88%. Dense gradient-trained detectors cannot provide this combination of gradient training, dense matrix multiplication, and local plasticity-free operation by construction.

URL PDF HTML ☆

赞 0 踩 0

2605.17685 2026-05-19 cs.CV cs.AI cs.CR cs.SY eess.SP eess.SY 版本更新

Attention-Guided Fusion of 1D and 2D CNNs for Robust ECG-Based Biometric Recognition

基于注意力引导的1D和2D CNN融合用于鲁棒的基于ECG的生物识别

Arioua, Islameddine, Benzaoui, Amir, Zeroual, Abdelhafid, Houam, Lotfi

发表机构 * PIMIS Laboratory, Electronics and Telecommunications Department（PIMIS实验室，电子与电信系）； Université du 8 Mai 1945（8月1945大学）； Electrical Engineering Department, University of 20 August 1955（电子工程系，20 August 1955大学）； Department of Electrical Engineering, Faculty of Science and Applied Sciences（电子工程系，科学与应用科学学院）； Larbi Ben M'hidi University（拉比·本·迈迪大学）； Department of Electronics and Communications, University of Larbi Tebessi（电子与通信系，拉比·塔贝西大学）

AI总结本文提出了一种结合1D和2D CNN的混合框架，通过注意力引导融合机制提升ECG生物识别的鲁棒性和性能，实验表明该方法在多个数据集上均取得了较高的识别准确率。

Journal ref Digital Signal Processing 2026

详情

DOI: 10.1016/j.dsp.2026.106252

AI中文摘要

基于心电图（ECG）的生物识别已作为一种安全的身份验证和活体检测的有希望的解决方案。然而，大多数现有方法依赖于单模深度学习架构，单独处理一维（1D）时间信号或二维（2D）时频表示，限制了鲁棒性和泛化能力。为了解决这个问题，本文提出了一种将1D和2D卷积神经网络（CNNs）整合到统一端到端架构中的混合框架。1D分支从原始ECG信号中提取时序和形态学特征，而2D分支从时频表示中捕获判别性的频谱信息。注意力引导的融合机制根据输入特性动态加权两种模态，克服了传统静态融合策略的局限性。该框架在三个基准数据集（ECG-ID、MIT-BIH和PTB）上进行了评估，包括健康受试者和患有心脏病理学的患者，分别实现了99.56%、100.00%和99.89%的识别准确率。为了评估长期生物稳定性，还进行了多会话Heartprint数据集的实验，该数据集跨越十年。所提出的方法在相同会话中实现了98.54%（S1）、99.09%（S2）、94.93%（S3R）和96.08%（S3L）的准确率，跨会话评估达到了56.33%（S1-S2）和53.27%（S2-S3R），证明了其在时间上的稳定生物特征捕获能力。最优配置结合了InceptionTime用于1D处理，ResNet-34用于2D分析，以及基于注意力的融合。消融研究证实，所提出的注意力机制在传统融合方法中始终表现更优。总体而言，所提出的框架为ECG生物识别提供了一种稳健、可扩展且高性能的解决方案。

英文摘要

Electrocardiogram (ECG)-based biometric recognition has emerged as a promising solution for secure authentication and liveness detection. However, most existing methods rely on unimodal deep learning architectures that independently process either one-dimensional (1D) temporal signals or two-dimensional (2D) time-frequency representations, limiting robustness and generalization. To address this issue, this paper proposes a hybrid framework integrating 1D and 2D convolutional neural networks (CNNs) within a unified end-to-end architecture. The 1D branch extracts temporal and morphological features from raw ECG signals, while the 2D branch captures discriminative spectral information from time-frequency representations. An attention-guided fusion mechanism dynamically weights both modalities according to input characteristics, overcoming the limitations of conventional static fusion strategies. The framework was evaluated on three benchmark datasets (ECG-ID, MIT-BIH, and PTB), including healthy subjects and patients with cardiac pathologies, achieving identification accuracies of 99.56%, 100.00%, and 99.89%, respectively. To assess long-term biometric permanence, experiments were also conducted on the multi-session Heartprint dataset spanning ten years. The proposed approach achieved same-session accuracies of 98.54% (S1), 99.09% (S2), 94.93% (S3R), and 96.08% (S3L), while cross-session evaluations reached 56.33% (S1-S2) and 53.27% (S2-S3R), demonstrating the ability to capture stable biometric signatures over time. The optimal configuration combines InceptionTime for 1D processing, ResNet-34 for 2D analysis, and attention-based fusion. Ablation studies confirm that the proposed attention mechanism consistently outperforms conventional fusion approaches. Overall, the proposed framework provides a robust, scalable, and high-performance solution for ECG biometric recognition.

URL PDF HTML ☆

赞 0 踩 0

2605.17682 2026-05-19 cs.CV 版本更新

GEM: Gaussian Evolution Model for Occupancy Forecasting and Motion Planning

GEM：用于占用预测和运动规划的高斯演化模型

Cheng Chen, Hao Huang, Saurabh Bagchi

发表机构 * Purdue University（普渡大学）； New York University Abu Dhabi（纽约大学阿布扎克分校）

AI总结该研究提出GEM模型，通过高斯演化模型实现高效的占用预测和运动规划，解决了传统方法在时间灵活性、场景演化和连续时间动态匹配上的不足。

详情

AI中文摘要

未来3D语义占用预测和运动规划是自动驾驶的核心，需要模型能够推断周围场景的演变和车辆的行动。现有占用世界模型通常将场景离散化为潜在嵌入、体素特征或量化标记，并通过固定步长自回归生成预测未来状态。这限制了时间灵活性，掩盖了场景演变，长时间预测会积累误差，并且难以匹配真实驾驶场景的连续时间动态。我们提出了GEM，一种用于非自回归占用世界建模的高斯演化模型，其中驾驶场景被表示为学习的动态显式连续4D高斯原语。与逐步推演未来占用状态不同，GEM可以直接查询高斯世界表示中的任意时间戳，并将相应的条件3D高斯分布投射到语义占用体积中。这使得能够高效地进行全时间范围预测，同时保留紧凑且可解释的场景表示。通过解耦空间几何、时间支持和原语运动，GEM使预测的世界更容易检查，因为每个原语的演变可以连续随时间跟踪。相同表示也支持运动规划，通过从学习的高斯世界预测未来的车辆轨迹。大量实验表明，GEM在未来的语义占用预测和强大的运动规划性能方面均达到最先进的水平，同时提供灵活的时间查询。

英文摘要

Future 3D semantic occupancy forecasting and motion planning are central to autonomous driving, as they require models to reason about how surrounding scenes evolve and how the ego vehicle should act. Existing occupancy world models commonly discretize scenes into latent embeddings, volumetric features, or quantized tokens, and forecast future states through fixed-step autoregressive generation. This limits temporal flexibility, obscures scene evolution, accumulates errors over long horizons, and poorly matches the continuous-time dynamics of real driving scenes. We propose GEM, a Gaussian Evolution Model for non-autoregressive occupancy world modeling, where driving scenes are represented as explicit continuous 4D Gaussian primitives with learned dynamics. Instead of rolling out future occupancy states step by step, GEM directly queries the Gaussian world representation at arbitrary timestamps and splats the corresponding conditional 3D Gaussians into semantic occupancy volumes. This enables efficient forecasting over the full horizon while retaining a compact and interpretable scene representation. By decoupling spatial geometry, temporal support, and primitive motion, GEM makes the predicted world easier to inspect, as each primitive's evolution can be followed continuously over time. The same representation also supports motion planning by predicting future ego trajectories from the learned Gaussian world. Extensive experiments show that GEM achieves state-of-the-art future semantic occupancy forecasting and strong motion planning performance, while providing flexible temporal querying.

URL PDF HTML ☆

赞 0 踩 0

2605.17673 2026-05-19 cs.CV 版本更新

A simple approach for biometrics: Finger-knuckle prints recognition based on a Sobel filter and similarity measures

一种简单的生物识别方法：基于Sobel滤波器和相似性度量的指纹-指节印识别

E. O. Rodrigues, T. M. Porcino, Aura Conci, Aristofanes C. Silva

发表机构 * Department of Computer Science, Universidade Federal Fluminense（弗拉门蒂努斯联邦大学计算机科学系）； Department of Electrical Engineering, Universidade Federal do Maranhão（马拉尼昂联邦大学电气工程系）

AI总结本文提出了一种简单的指纹-指节印识别方法，利用Sobel滤波器和相似性度量进行边缘检测和噪声减少，实现了高效的二值图像处理和存储，实验表明在大规模数据集上达到了17.02%的正确识别率。

Journal ref 2016 International Conference on Systems, Signals and Image Processing (IWSSIP)

2605.17668 2026-05-19 cs.CV 版本更新

TouchMap-OR: 医院内多视角手-表面接触的3D映射

Sophokles Ktistakis, Rui Wang, Bastian Grande, Hugo Sax

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Institute for Anesthesiology and Perioperative Medicine, University Hospital Zurich（苏黎世大学麻醉学与围术期医学研究所）； Department of Public and Global Health, University of Zurich（苏黎世大学公共卫生与全球健康系）

AI总结本文提出TouchMap-OR系统，通过多视角RGB-D视觉系统实现手术室中身份分辨的手-表面接触重建，利用临床环境的语义结构推断接触时间和位置，通过多视角手部重建与追踪医生获得一致的手部轨迹，并建立手术室的语义3D模型以将手部轨迹映射到特定表面。

详情

AI中文摘要

临床医生、患者和医疗设备之间的手-表面互动在医疗程序中起着核心作用，在病原体传播中起关键作用。然而，这些互动仍然大多未被观察到，因为目前的感染预防实践依赖于手动观察，无法重建详细的接触历史。在本工作中，我们提出了在手术室中身份分辨的手-表面互动重建问题，并引入了TouchMap-OR，一种多视角RGB-D视觉系统，该系统能够建模医生、可变形手部几何结构以及临床环境的语义结构，以推断接触发生的时间和位置。该系统在多摄像机之间重建全局一致的多个人3D骨骼轨迹，同时从RGB观测与深度数据对齐的数据中估计可变形MANO手部网格。多视角手部重建被融合并关联到追踪的医生，以获得一致的左右手轨迹。通过多视角分割和深度融合构建手术室的语义3D模型，使重建的手部轨迹能够映射到特定表面，包括医疗设备、可移动物体和患者身体部位。利用时间手-表面接近性推断接触事件，描述了哪位医生接触了哪个表面以及何时。我们在三个真实的麻醉诱导记录上评估了TouchMap-OR，手动标注了接触事件。TouchMap-OR在二元接触F1值上达到0.75，优于基于跟踪的基线方法，同时保持了可比的多个人跟踪精度，并实现了0.96的身份分配精度。

英文摘要

Hand-surface interactions between clinicians, patients, and medical equipment play a central role in pathogen transmission during medical procedures. However, these interactions remain largely unobserved, as current infection-prevention practices rely on manual observation and cannot reconstruct detailed contact histories. In this work we formulate the problem of identity-resolved hand-surface interaction reconstruction in operating rooms and introduce TouchMap-OR, a multi-view RGB-D vision system that models clinicians, articulated hand geometry, and the semantic structure of the clinical environment to infer when and where contacts occur. The system reconstructs globally consistent multi-person 3D skeleton tracks across cameras while estimating articulated MANO hand meshes from RGB observations aligned to depth data. Multi-view hand reconstructions are fused and associated with tracked clinicians to obtain consistent left and right hand trajectories. A semantic 3D model of the operating room is built from multi-view segmentation and depth fusion, enabling reconstructed hand trajectories to be mapped to specific surfaces, including medical equipment, movable objects, and patient body sites. Temporal hand-surface proximity is used to infer contact episodes describing which clinician touched which surface and when. We evaluate TouchMap-OR on recordings from three real anesthesia inductions with manually annotated contact events. TouchMap-OR achieves 0.75 binary contact F1, outperforming tracking-based baselines while maintaining comparable multi-person tracking accuracy and achieving 0.96 identity attribution accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.17633 2026-05-19 cs.CV cs.AI 版本更新

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

SparseSAM: Segment Anything模型中激活的结构稀疏化

Hoai-Chau Tran, Chi H. Nguyen, Duy M. H. Nguyen, Mathias Niepert, Fan Lai, Khoa D. Doan

发表机构 * University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； College of Engineering & Computer Science, VinUniversity（Vin大学工程与计算机科学学院）； VinUni-Illinois Smart Health Center, VinUniversity（Vin大学-伊利诺伊智能健康中心）； DFKI ； Max Planck Research School for Intelligent Systems (IMPRS-IS)（马克斯·普朗克智能系统研究学校）； University of Stuttgart（斯图加特大学）

AI总结本文提出SparseSAM，一种无需训练的结构稀疏化框架，通过联合加速注意力和MLP层并保持token身份，从而在保持高质量的同时提高推理速度和减少内存使用。

详情

AI中文摘要

Segment Anything Model (SAM) 实现了强大的开放词汇分割，但其基于ViT的图像编码器在推理延迟和内存方面占主导地位。现有的激活压缩方法，如标记合并，通过减少标记长度来处理，但引入了非平凡的运行时开销，并在高压缩下导致灾难性质量下降。其他应用稀疏注意力的方法仅关注注意力本身，使MLP完全密集，并限制了可达到的速度提升。我们提出了SparseSAM，一种（i）无需训练的结构稀疏化框架，该框架在加速注意力和MLP层的同时保持token身份。SparseSAM引入了（ii）Stripe-Sort Attention，它使用确定性的Z序排列将密集注意力转换为静态的硬件友好的稀疏模式，消除了动态掩码的开销。SparseSAM进一步引入了（iii）残差一致性MLP，只将信息性token路由通过MLP，同时通过残差路径传播剩余token。在四个分割基准测试中，SparseSAM在0.4密度下仅损失0.004 mIoU，在0.3密度下损失0.021 mIoU，相较于标记合并方法的改进，准确率损失减少了2.10倍，同时实现了2倍更快的推理速度和2.8倍的内存减少。

英文摘要

The Segment Anything Model (SAM) achieves strong open-vocabulary segmentation, but its ViT-based image encoders dominate inference latency and memory. Existing activation compression methods, such as token merging, reduce the token length to process, yet introduce non-trivial runtime overhead and encounter catastrophic quality drop under high compression. Other methods applying Sparse Attention focus on attention alone, leaving the MLP fully dense and capping achievable speedup. We propose SparseSAM, a (i) training-free structured sparsification framework that jointly accelerates attention and MLP layers while preserving token identity. SparseSAM introduces (ii) Stripe-Sort Attention, which uses a deterministic Z-order permutation to transform dense attention into static hardware-friendly sparse patterns, eliminating dynamic masking overhead. SparseSAM further introduces a (iii) Residual-Consistency MLP that routes only informative tokens through the MLP while propagating remaining tokens through the residual pathway. Across four segmentation benchmarks, SparseSAM loses only 0.004 mIoU at a 0.4 density and 0.021 mIoU at 0.3, a 2.10x reduction in accuracy loss versus token merging advances, while achieving 2x faster inference and 2.8x memory reduction.

URL PDF HTML ☆

赞 0 踩 0

2605.17624 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Multi-task learning on partially labeled datasets via invariant/equivariant semi-supervised learning

通过不变/等变半监督学习进行部分标注数据集上的多任务学习

Miquel Martí i Rabadán, Alessandro Pieropan, Hossein Azizpour, Atsuto Maki

发表机构 * KTH Royal Institute of Technology（皇家理工学院）； Univrses AB

AI总结本文研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力，通过FixMatch方法和其等变扩展Dense FixMatch进行评估，在城市景观和BDD100K数据集上针对常见的目标检测和语义分割任务进行测试，发现不变和等变半监督学习在大多数情况下优于监督基线，特别是在标注样本较少时效果更佳。

Comments https://github.com/miquelmarti/DenseFixMatch

详情

AI中文摘要

我们研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力。具体而言，我们使用流行的FixMatch方法进行不变半监督学习，并采用其等变扩展Dense FixMatch。我们在Cityscapes和BDD100K数据集上评估了它们在计算机视觉中普遍的目标检测和语义分割任务中的性能。我们考虑了每个任务标注子集的不同大小以及它们之间的不同重叠情况。我们的结果表明，对于不变和等变半监督学习，大多数情况下都优于监督基线，特别是在任务中可用标注样本较少时，改进最为显著，且后者方法通常表现更好。我们的研究表明，不变/等变学习是有限标注数据下多任务学习的一个有前途的方向。

英文摘要

We investigate the potential of invariant and equivariant semi-supervised learning for addressing the challenges of training multi-task models on partially labeled datasets with differently structured output tasks. Specifically, we use the popular FixMatch method for invariant semi-supervised learning and its equivariant extension Dense FixMatch. We evaluate their performance on the Cityscapes and BDD100K datasets in the context of the prevalent object detection and semantic segmentation tasks in computer vision. We consider varying sizes of the subsets annotated for each task and different overlaps among them. Our results for both invariant and equivariant semi-supervised learning outperform supervised baselines in most situations, with the most significant improvements observed when fewer labeled samples are available for a task and generally better results for the latter approach. Our study suggests that invariant/equivariant learning is a promising general direction for multi-task learning from limited labeled data.

URL PDF HTML ☆

赞 0 踩 0

2605.17620 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing

SynVA：一种用于血管生成和动脉瘤编辑的模块化工具包

Marten J. Finck, Niklas C. Koser, Sarker M. Mahfuz, Tameem Jahangir, Jon E. Wilhelm, Daniel Behme, Naomi Larsen, Wojtek Palubicki, Sylvia Saalfeld, Sören Pirk

发表机构 * Visual Computing and Artificial Intelligence, Kiel University, Germany（视觉计算与人工智能研究所，基尔大学，德国）； Institute for Medical Informatics and Statistics, Kiel University, Germany（医学信息学与统计研究所，基尔大学，德国）； Clinic for Neuroradiology, Medical Faculty, Magdeburg University, Germany（神经放射科，马格德堡大学医学学院，德国）； Department of Radiology and Neuroradiology, University Hospital Schleswig-Holstein, Germany（放射学与神经放射学部门，石勒苏益格-荷尔斯泰因大学医院，德国）； Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Poland（数学与计算机科学学院，亚当·密茨凯维奇大学，波兰）

AI总结本文提出SynVA，一种模块化工具包，用于生成血管网格和在解剖学上一致的动脉瘤合成，通过结合新的流匹配方法和基于学习的方法，生成真实血管几何和解剖学合理的动脉瘤，同时提供大规模标注数据集以提升医疗影像分析能力。

详情

AI中文摘要

颅内动脉瘤（IAs）以不可预测的生长和破裂风险为特征，是导致中风的主要原因，可能引发致命性出血，具有高死亡率和长期残疾。随着人口老龄化，脑血管疾病的发病率和整体负担预计会增加，凸显了需要可扩展的方法来分析复杂的医疗数据并提高对这些疾病的群体层面理解的必要性。尽管数字孪生和深度学习为提高诊断、预后和治疗提供了有希望的途径，但其效果受到大规模高质量医疗数据和相应标签稀缺的限制。我们提出了SynVA，一种用于血管网格生成和解剖学一致动脉瘤合成的模块化工具包。SynVA结合了基于流匹配的新型方法生成健康血管网格与基于学习的方法生成解剖条件下的动脉瘤网格——动脉瘤是从已有的血管几何结构计算而来的，而不是孤立生成。此外，我们引入了基于生理学原理和统计先验的SynVA过程模型，用于血管和动脉瘤合成，从而能够生成大规模数据集（例如用于训练基于网格的生成模型）。为此，我们发布了包含50,000个完全标注网格样本的数据集，用于各种下游视觉任务，如语义分割。广泛的定量和定性评估证明了SynVA能够生成逼真的血管几何和解剖学合理的动脉瘤。具体而言，我们的实验表明，某些方法生成的动脉瘤形状更符合专家人类感知，而其他方法在定量相似性度量上与真实动脉瘤的重建表现更优。

英文摘要

Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.

URL PDF HTML ☆

赞 0 踩 0

2605.17610 2026-05-19 cs.CV cs.CL 版本更新

TAME: 通过混合专家架构实现视觉语言模型的测试时对抗提示调优

Xin Wang, Yixu Wang, Jiaming Zhang, Ruofan Wang, Jiaqi Yu, Kai Chen, Jingjing Chen, Xingjun Ma, Yu-Gang Jiang

发表机构 * Fellow, IEEE（IEEE会士）

AI总结本文提出TAME，一种基于混合专家架构的测试时防御方法，旨在提升视觉语言模型在对抗扰动下的鲁棒性，同时保持对清洁样本的泛化能力。

详情

AI中文摘要

大规模预训练的视觉语言模型（VLMs），如CLIP，在零样本泛化方面表现强大，但对不可察觉的对抗扰动高度敏感，这在开放世界部署中引发了严重安全问题。为了在不需下游任务特定重新训练的情况下增强鲁棒性，我们提出了TAME，一种新颖的测试时防御方法。基于我们之前的测试时对抗提示调优（TAPT），TAME通过将TAPT的单一自适应提示替换为输入条件化的混合专家（MoE）框架进行架构重构，从而实现更表达力和适应性的防御。具体而言，TAME维护一个可学习的专家提示库，并利用输入依赖的路由机制，在推理时为每个未标记的测试样本聚合定制化的提示混合。这种测试时防御机制由三个无监督目标驱动：（1）多视图预测熵最小化，（2）逐层对齐视觉标记统计到预计算的干净和对抗参考分布，以及（3）MoE正则化以实现平衡的专家利用和提示多样性。我们在11个基准数据集上评估了TAME，包括ImageNet和10个额外的零样本数据集。结果表明，TAME在AutoAttack下将原始CLIP的零样本对抗鲁棒性提高了至少49.1%，同时在清洁样本上保持了良好的泛化能力。TAME还普遍优于现有对抗提示调优方法，平均鲁棒性提升至少30.2%。

英文摘要

Large-scale pre-trained Vision-Language models (VLMs), such as CLIP, exhibit strong zero-shot generalization, yet remain highly vulnerable to imperceptible adversarial perturbations, raising serious safety concerns for open-world deployment. To enhance robustness without requiring downstream task-specific retraining, we propose TAME, a novel test-time defense. Building upon our prior Test-Time Adversarial Prompt Tuning (TAPT), TAME introduces an architectural reformulation by replacing TAPT's single adaptive prompt with an input-conditioned Mixture-of-Experts (MoE) framework, enabling more expressive and adaptive defense. Specifically, TAME maintains a bank of learnable expert prompts and employs an input-dependent routing mechanism to aggregate a customized prompt mixture for each unlabeled test sample at inference time. This test-time defense mechanism is driven by three unsupervised objectives: (1) multi-view prediction entropy minimization, (2) layer-wise alignment of visual token statistics to precomputed clean and adversarial reference distributions, and (3) MoE regularization for balanced expert utilization and prompt diversity. We evaluated TAME on 11 benchmark datasets, including ImageNet and 10 additional zero-shot datasets. The results show that TAME improves the zero-shot adversarial robustness of the original CLIP by at least 49.1% under AutoAttack while largely preserving generalization on clean samples. TAME also consistently outperforms existing adversarial prompt tuning methods across multiple prompt designs, yielding an average robustness gain of at least 30.2%.

URL PDF HTML ☆

赞 0 踩 0

2605.17573 2026-05-19 cs.CV cs.CR 版本更新

Deepfake Detection in Social Media: A Temporal Artifact Analysis Using 3D Convolutional Neural Networks

社交媒体中的深度伪造检测：利用3D卷积神经网络进行时序特征分析

Mohammadreza Rashidi, Raja Hashim Ali, Sami Ur Rahman

发表机构 * Department of Computer Science AI（计算机科学系人工智能部门）； Media Analysis Lab Berlin, Germany（媒体分析实验室柏林德国）

AI总结本文提出了一种基于R3D-18的3D卷积神经网络检测器，通过结合二元交叉熵损失与时间一致性正则化损失，提升深度伪造检测在高分辨率和跨数据集场景下的准确性，证明了时间特征比空间特征在社交媒体重编码中更具鲁棒性。

Comments 13 pages, 6 figures

详情

AI中文摘要

合成面部视频在社交媒体上传播的速度比平台审核速度更快，导致虚假信息和身份攻击的成本上升。帧级深度伪造检测器在生成器质量增加时性能急剧下降；高质量的128x128 GAN输出在空间仅准确性上减少五个百分点，而时间不一致性的特征基本保持不变。我们通过基于R3D-18的3D卷积神经网络检测器解决这一差距，该检测器使用复合损失函数，结合二元交叉熵与时间一致性正则化。模型处理来自DeepfakeTIMIT数据集的16帧片段，并初始化自Kinetics-400动作识别权重。我们在128x128分辨率的内数据集评估中报告了92.8%的准确率；在不微调的情况下跨数据集转移到FaceForensics++达到76.4%，微调后有所提升。消融研究显示，迁移学习贡献了7.2个百分点，面部跟踪增加了3.5个百分点，而时间一致性正则化在高质量伪造中提供了额外的增益。结果表明，时间特征比空间特征在社交媒体重编码中更具泛化能力，提供了一个能够存活的检测信号。

英文摘要

Synthetic facial videos have proliferated across social media faster than platform moderation can respond, raising the cost of disinformation and identity-based attacks. Frame-level deepfake detectors degrade sharply as generator quality increases; high-quality 128x128 GAN output cuts spatial-only accuracy by five percentage points while leaving temporal inconsistencies largely intact. We address this gap with a 3D Convolutional Neural Network detector based on R3D-18, trained with a composite loss that combines binary cross-entropy with a temporal-consistency regularizer. The model processes 16-frame clips from the DeepfakeTIMIT dataset and is initialized from Kinetics-400 action-recognition weights. We report 92.8% accuracy on intra-dataset evaluation at 128x128 resolution; cross-dataset transfer to FaceForensics++ without fine-tuning reaches 76.4%, rising after minimal fine-tuning. Ablation studies show that transfer learning contributes 7.2 percentage points and face tracking adds 3.5 points, while temporal consistency regularization provides additional gains on high-quality fakes. The results establish that temporal artifacts generalize more broadly than spatial ones, providing a detection signal that survives social-media re-encoding.

URL PDF HTML ☆

赞 0 踩 0

2605.17571 2026-05-19 cs.CV cs.LG 版本更新

GCE-MIL: 多实例学习中全滑片成像的可信且可恢复的证据

Xiangyu Li, Ran Su

发表机构 * College of Intelligence and Computing（智能与计算学院）

AI总结该研究提出GCE-MIL方法，通过优化S/N/R标准直接提升多实例学习中全滑片成像的预测性能和证据质量，改进了宏F1分数和C-index，并减少了连续-离散差距。

Comments 10 pages, 17 figures, 24 table

详情

AI中文摘要

多实例学习（MIL）是全滑片图像（WSI）分类和生存预测的标准方法，其中基于注意力的模型将图像块特征聚合为滑片级预测。这些模型将注意力权重视为预测的证据，但注意力被优化用于分类，而非识别支持诊断的实际图像块。这种混淆导致三个失败：选择的图像块不足（单独保留它们会降低宏F1分数0.078）、多余（移除它们几乎不影响预测）以及不可恢复（连续的注意力分数与推理中使用的离散图像块子集不一致）。核心前提是证据质量应通过显式标准直接优化——充分性、必要性和可恢复性（S/N/R）——而不是作为分类的副产品继承。GCE-MIL是一种背骨无关的封装器，通过三种注入模式和三种证据组件实现：一个将选择与领域特定概念对齐的 grounding 机制，一个作为可微分代理的 noisy-OR 覆盖，以及一个通过边缘引导修复将连续选择器转换为离散子集的阈值加修复恢复。在9个背骨和9个数据集（81种配置）上，GCE-MIL将平均宏F1分数提高了0.024，C-index提高了0.014，减少了连续-离散差距4-7，增加了补集退化2-4。通过在离散恢复后可选的图像块预过滤，推理速度可提高高达5倍，同时保留0.989的完整袋效用。

英文摘要

Multiple instance learning (MIL) is the standard approach for whole-slide image (WSI) classification and survival prediction, where attention-based models ag gregate patch features into slide-level predictions. These models treat attention weights as evidence for their predictions, but attention is optimized for classi fication, not for identifying which patches actually support the diagnosis. This conflation leads to three failures: selected patches are insufficient (keeping them alone drops Macro-F1 by 0.078), unnecessary (removing them barely changes the prediction), and unrecoverable (continuous attention scores disagree with discrete patch subsets used at inference). The central premise is that evidence quality should be optimized directly through explicit criteria- Sufficiency, Necessity, and Recov erability (S/N/R)- rather than inherited as a byproduct of classification. GCE-MIL is a backbone-agnostic wrapper implemented through three injection modes and three evidence components: a grounding mechanism that aligns selection with domain-specific concepts, noisy-OR coverage that acts as a differentiable proxy for interventional evidence search, and threshold-plus-repair recovery that converts continuous selectors into discrete subsets through marginal-guided repair. Across 9 backbones and 9 datasets (81 configurations), GCE-MIL improves average Macro-F1 by 0.024 and C-index by 0.014, reduces the continuous-discrete gap by 4-7, and increases complement degradation by 2-4. With optional tile prefiltering after discrete recovery, inference runs up to 5 faster while retaining 0.989 full-bag utility.

URL PDF HTML ☆

赞 0 踩 0

2605.17451 2026-05-19 cs.CV 版本更新

DeTrack: A Benchmark and Altitude-Aware Dual World Model for Drone-embodied Tracking

DeTrack：一种无人机具身跟踪的基准及海拔感知双世界模型

Guyue Hu, Haoming Liu, Siyuan Song, Chenglong Li, Feng Chen, Jin Tang

发表机构 * Hefei Si Valley Technology Development Co., Ltd（合肥蜀山科技发展有限公司）； Institute of Embodied Intelligence, Anhui University（embodied intelligence研究院，安徽大学）

AI总结本文提出DeTrack任务，要求无人机在交互式3D环境中利用在线自体观察和主动飞行控制进行目标跟踪，并提出AaDWorlds框架以解决海拔相关的可见性与飞行安全矛盾。

详情

AI中文摘要

空中目标跟踪在公共安全、应急救援、野生动物监测等领域有广泛应用。然而，现有空中跟踪基准主要基于固定摄像头位置或预设飞行路径的被动2D视频序列，其中无人机被视为被动相机而非具身代理，无法主动感知、交互和控制其在动态3D场景中的运动。本文定义了新的无人机具身跟踪任务DeTrack，要求无人机利用在线自体观察和主动飞行控制在闭环中跟踪目标。我们构建了一个包含11,368条目标轨迹的大型基准，涵盖多样化的场景、渲染条件、语义区域和移动干扰物，并提供了针对目标可见性、跟踪准确性和轨迹成功的评估指标。我们进一步提出了AaDWorlds，一种用于无人机具身跟踪的海拔感知双世界模型框架。AaDWorlds包含一个海拔感知感知模块和双世界模型，分别在高海拔和低海拔环境下预测未来状态。通过结合伪海拔感知观察和预测的未来状态，AaDWorlds缓解了目标可见性与飞行安全之间的固有矛盾。在DeTrack基准上的实验表明，AaDWorlds在所有评估指标上均提升了闭环跟踪性能。

英文摘要

Aerial object tracking has broad applications in public safety, emergency rescue, wildlife monitoring, and related fields. However, existing aerial tracking benchmarks are mainly based on passive 2D video sequences captured from fixed camera locations or predefined flight paths, where drones are treated as passive cameras rather than embodied agents that actively perceive, interact, and control their motion in dynamic 3D scenes. In this paper, we define a new drone-embodied tracking task, termed DeTrack, which requires a drone to track a target in interactive 3D environments using online egocentric observations and active flight control in a closed loop. We build a large-scale benchmark containing 11,368 target trajectories across diverse scenes, rendering conditions, semantic regions, and moving distractors, together with evaluation metrics for target visibility, tracking accuracy, and trajectory success. We further propose AaDWorlds, an altitude-aware dual world model framework for drone-embodied tracking. AaDWorlds consists of an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes. By combining pseudo altitude-aware observations and imagined future states, AaDWorlds alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety. Experiments on the DeTrack benchmark demonstrate that AaDWorlds improves closed-loop tracking performance across all evaluation metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.17449 2026-05-19 cs.CV cs.AI 版本更新

Spatial Blindness in Whole-Slide Multiple Instance Learning

全切片多实例学习中的空间盲区

Xiangyu Li, Ran Su

发表机构 * College of Intelligence and Computing（智能与计算学院）

AI总结本文研究了全切片多实例学习中由于空间信息处理不足导致的分类误差问题，提出ResTopoMIL模型通过引入不变原型直方图和坐标洗牌约束来提升模型对空间关系的敏感性，从而在多个公开数据集上提升了分类和生存预测性能。

Comments 28 pages, 8 figures, 16 tables

详情

AI中文摘要

全切片MIL模型通常被称为上下文感知模型，当将图网络、Transformer或状态空间模块置于补丁嵌入之上时。我们证明这种标签可能具有误导性。在病理任务中，组织结构是诊断信号的一部分，几个强大的MIL基线在补丁坐标随机排列后，滑片级别AUC几乎未变。它们的预测准确，但大多具有组合性。我们将其失败模式称为空间盲区。我们的解释是基于优化的：在滑片级监督下，密集的外观统计信息被早期学习，留下弱梯度用于稀疏的空间关系。ResTopoMIL通过首先拟合一个排列不变的原型直方图，然后冻结它，同时一个轻量级图分支在坐标洗牌约束下学习残差来解决这个问题。该架构设计简单；干预在于如何训练空间分支。在9个公开WSI基准上，ResTopoMIL在1.15M参数下提升了分类和生存预测性能，恢复了对坐标扰动的敏感性，并在CAMELLYON-16上提供了更强的局部化证据。

英文摘要

Whole-slide MIL models are often called context-aware once graphs, Transform ers, or state-space modules are placed above patch embeddings. We show that this label can be deceptive. On pathology tasks where tissue architecture is part of the diagnostic signal, several strong MIL baselines retain nearly unchanged slide level AUC after patch coordinates are permuted. Their predictions are accurate, but largely compositional. We refer to this failure mode as spatial blindness. Our explanation is optimization-based: dense appearance statistics are learned early under slide-level supervision, leaving weak gradients for sparse spatial relations. ResTopoMIL addresses the issue by first fitting a permutation-invariant prototype histogram and then freezing it while a lightweight graph branch learns the residual under a coordinate-shuffling constraint. The architecture is simple by design; the intervention is in how the spatial branch is trained. Across 9 public WSI bench marks, ResTopoMIL improves classification and survival prediction with 1.15M parameters, restores sensitivity to coordinate perturbation, and gives stronger lo calization evidence on CAMELYON-16.

URL PDF HTML ☆

赞 0 踩 0

2605.17447 2026-05-19 cs.CV cs.CL 版本更新

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

FastOCR: 通过KV缓存剪枝实现高效的动态视觉聚焦文档解析

Zihan Tang, Leqi Shen, Hui Chen, Ao Wang, Ben Wan, Yan Feng, Ke Zhang, Sicheng Zhao, Tongxuan Liu, Guiguang Ding

发表机构 * Tsinghua University（清华大学）

AI总结本文提出FastOCR，一种无需训练的框架，通过动态视觉聚焦技术解决文档解析中的高效KV缓存剪枝问题，显著提升处理速度和准确性。

详情

AI中文摘要

视觉-语言模型（VLMs）在光学字符识别（OCR）中展现出强大潜力，但编码密集文档所需的大量视觉令牌导致推理成本过高。现有剪枝方法依赖物理驱逐，例如在prefill阶段永久丢弃视觉令牌。尽管在自然图像上有效，但此策略在OCR中失效，因为几乎每个视觉令牌可能对应一个字符或结构元素，任何不可逆的损失都会导致准确性急剧下降。我们观察到，尽管文档图像看似密集且难以剪枝，模型对它们的注意力实际上在时间上是稀疏的：在每个解码步骤中，它集中在一小块区域，随着步骤逐渐移动，就像人类读者依次聚焦于词语而不是一次性感知整页内容一样。受此动态视觉聚焦现象的启发，我们将不可行的全局剪枝问题转化为可处理的局部动态问题，并提出FastOCR，一种无需训练的框架，包含两个互补模块。具体而言，Focal-Guided Pruning识别少量焦点层，并在每一步从中选择最相关的视觉令牌；Cross-Step Fixation Reuse利用固定点的逐渐移动，从上一步温暖启动。通过动态调整哪些令牌被关注而不是驱逐任何缓存中的令牌，FastOCR避免了永久信息丢失。广泛实验表明，FastOCR作为一种即插即用的加速模块，在五个不同大小和架构的VLMs上表现出一致的泛化能力。在Qwen2.5-VL上，FastOCR在每个解码步骤只关注5%的视觉令牌，保留了未剪枝模型98%的准确性，同时将注意力延迟减少了3.0倍。

英文摘要

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.

URL PDF HTML ☆

赞 0 踩 0

2605.17436 2026-05-19 cs.CV cs.CL 版本更新

Medical Context Distorts Decisions in Clinical Vision Language Models

医学语境扭曲了临床视觉语言模型的决策

David Restrepo, Ira Ktena, Maria Vakalopoulou, Stergios Christodoulidis, Enzo Ferrante

发表机构 * MICS（医学信息学中心）； CentraleSupélec - Université Paris-Saclay（中央超导学院 - 巴黎萨克雷大学）； Cancer Data Science Unit（癌症数据科学单元）； IHU PRISM ； National Institute in Precision Oncology（精准肿瘤学国家研究所）； University Paris-Saclay（巴黎萨克雷大学）； CentraleSupelec（中央超导学院）； Gustave Roussy（儒勒-维维安-圣拉扎尔医院）； INSERM（国家医学研究院）； CONICET（阿根廷国家科研与技术创新委员会）； Universidad de Buenos Aires（布宜诺斯艾利斯大学）

AI总结本文研究了医学语境对临床视觉语言模型决策的影响，发现模型在整合医学记录的视觉和文本信息时存在模态依赖、无关历史依赖和提示敏感性等问题，强调了在临床应用前需要建立明确的保障措施。

详情

AI中文摘要

视觉-语言模型（VLMs）越来越多地被提出用于临床决策支持，但其在需要整合医学记录中视觉和文本信息的现实场景中的可靠性仍缺乏充分了解。本文识别了三种失败模式：（1）对文本的过度依赖而非图像，（2）对无关临床历史的虚假依赖，以及（3）在语义等价输入上的提示敏感性。我们评估了多种通用领域和医学调优的开源和闭源VLMs，在胸片任务中使用MIMIC-CXR进行测试。通过系统地操纵图像-文本对齐、临床历史和提示公式，我们发现VLM的决策受到文本模态主导，即使有视觉证据可用。此外，我们发现VLMs受到无关报告的强烈影响，而微小的提示变化可以逆转正确的图像基预测。我们的发现强调了在考虑将这些模型用于临床实践之前，需要建立明确的保障措施和压力测试。

英文摘要

Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct image-based predictions. Our findings underscore the need for explicit safeguards and stress-testing before considering the use of these models in clinical practice.

URL PDF HTML ☆

赞 0 踩 0

2605.17433 2026-05-19 cs.CV 版本更新

VISTA: Variance-Gated Inter-Sequence Test-Time Adaptation for Multi-Sequence MRI Segmentation

VISTA: 用于多序列MRI分割的方差门控跨序列测试时间适应

Zhipeng Deng, Jiale Zhou, Wenhan Jiang, Haolin Wang, Xun Lin, Yafei Ou, Yefeng Zheng

发表机构 * Westlake University（西湖大学）； Hokkaido University（北海道大学）； The Chinese University of Hong Kong（香港中文大学）； RIKEN（理化学研究所）

AI总结本文提出VISTA框架，解决多序列MRI分割中模态交互偏移问题，通过设计跨序列干预生成器和跨视图分歧感知伪标签方法，提升模型在临床环境下的适应能力，实验表明在不同群体上性能优于现有方法。

Comments MICCAI2026 early accept

详情

AI中文摘要

在新的临床环境中部署多序列磁共振成像（MRI）分割模型具有挑战性，因为存在扫描仪和采集协议的差异。尽管现有的TTA方法能够处理基本的单模态偏移，但它们在根本性的双偏移问题下常常失效，因为其适应信号无法捕捉模态交互偏移，这会破坏跨序列一致性。为了解决这个问题，我们提出了方差门控跨序列测试时间适应（VISTA），一种无源框架，用于解决模态交互偏移问题。首先，我们设计了一个跨序列干预生成器（ISIG），通过交换低频谱和熵局部化的补丁跨序列生成一组一致性探针，保持解剖语义的同时挑战跨序列依赖性。其次，我们引入了跨视图分歧感知伪标签（CDPL），通过跨视图分歧方差建立体素级可靠性度量，动态门控自我训练并强制干预一致性，促使网络依赖于稳健的解剖语义。大量实验将模型从标准成人MRI（BraTS-GLI-Pre）适应到非洲低场（BraTS-SSA）和儿童（BraTS-PED）群体，在临床偏移下优于竞争方法，实现了绝对Dice改进+1.89%（SSA）和+2.82%（PED）超过源模型。代码可在https://github.com/dzp2095/VISTA获取。

英文摘要

Deploying multi-sequence magnetic resonance imaging (MRI) segmentation models to new clinical environments is challenging due to variations in scanners and acquisition protocols. Although existing TTA methods handle basic per-modality shifts, they often fail under a fundamental dual-shift problem, as their adaptation signals fail to capture modality-interaction shifts that disrupt inter-sequence consistency. To address this, we propose Variance-gated Inter-Sequence Test-time Adaptation (VISTA), a source-free framework that tackles modality-interaction shifts. First, we design an Inter-Sequence Intervention Generator (ISIG) that generates a set of consistency probes by swapping low-frequency spectra and entropy-localized patches across sequences, preserving anatomical semantics while challenging inter-sequence dependencies. Second, we introduce Cross-View Disagreement-Aware Pseudo Labeling (CDPL), which establishes a voxel-wise reliability metric using cross-view disagreement variance to dynamically gate self-training and enforce interventional consistency, encouraging the network to rely on robust anatomical semantics. Extensive experiments adapting from standard adult MRI (BraTS-GLI-Pre) to African low-field (BraTS-SSA) and pediatric (BraTS-PED) cohorts show improved performance over competing methods under clinical shifts, achieving absolute Dice improvements of +1.89% (SSA) and +2.82% (PED) over the source model. The code is available at https://github.com/dzp2095/VISTA.

URL PDF HTML ☆

赞 0 踩 0

2605.17429 2026-05-19 cs.LG cs.CV 版本更新

Radial-Angular Geometry for Reliable Update Diagnosis in Noisy-Label Learning

径向-角向几何用于噪声标签学习中的可靠更新诊断

Ningkang Peng, Jingyang Mao, Xiaoqian Peng, Weiguang Qu, Yanhui Gu

发表机构 * Nanjing Normal University（南京师范大学）； Nanjing University of Chinese Medicine（南京中医药大学）

AI总结本文提出了一种基于径向-角向几何的方法，用于在噪声标签学习中可靠地诊断更新，通过比较观测标签梯度与EMA教师诱导的参考梯度，区分对齐的困难清洁更新与由损坏标签引起的冲突更新。

详情

AI中文摘要

噪声标签方法通常从正向空间信号如损失、置信度或熵来估计样本可靠性。这些信号表明样本是否难以预测，但它们不直接测试其观察到的标签是否导致可靠的参数更新。这个差距很重要，因为困难的干净样本和错误标记的样本可能具有相似的损失，但会诱导不同的更新。我们重新诠释可靠性估计为观测标签更新的诊断。样本级经验Fisher迹提供了一个反向空间的更新能量度量：对于分类器层，它分解为一个预测残差项和一个特征敏感性项，因此捕获了超越标量损失的信息。然而，迹仍是一个径向幅度信号，无法决定大更新是否有益或有害。因此，我们提出了相对几何冲突（RGC），它将观测标签梯度与由EMA教师诱导的参考梯度进行比较。冲突项有助于区分大但对齐的困难清洁更新与由损坏标签引起的冲突更新。在合成和现实世界的噪声标签基准上，RGC在我们的评估协议下提高了困难清洁样本的保留和准确性。

英文摘要

Noisy-label methods often estimate sample reliability from forward-space signals such as loss, confidence, or entropy. These signals indicate whether a sample is difficult to predict, but they do not directly test whether its observed label induces a reliable parameter update. This gap matters because hard clean samples and mislabeled samples can have similar loss while inducing different updates. We recast reliability estimation as diagnosis of the observed-label update. The sample-wise empirical Fisher trace gives a backward-space measure of update energy: for the classifier layer, it factorizes into a prediction-residual term and a feature-sensitivity term, so it captures information beyond scalar loss. Trace, however, is still a radial magnitude signal and cannot decide whether a large update is useful or harmful. We therefore propose Relative Geometric Conflict (RGC), which compares the observed-label gradient with a reference gradient induced by an EMA teacher. The conflict term helps distinguish large but aligned hard-clean updates from large conflicting updates caused by corrupted labels. Across synthetic and real-world noisy-label benchmarks, RGC improves hard-clean preservation and accuracy under our evaluation protocol.

URL PDF HTML ☆

赞 0 踩 0

2605.17423 2026-05-19 cs.CV 版本更新

Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration

Soap2Soap：通过多智能体协作实现长 cinematic 视频重制

Yiren Song, Huilin Zhong, Kevin Qinghong Lin, Haofan Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore（新加坡国立大学Show实验室）； University of Oxford（牛津大学）； Lovart AI（Lovart人工智能）

AI总结本研究提出 Soap2Soap 框架，通过多智能体协作实现长 cinematic 视频重制，解决视频到视频生成中长期一致性与叙事保真度的问题。

详情

AI中文摘要

我们研究系列级 cinematic 重制，这是一个长视界视频到视频生成问题，通过风格化或演员替换局部化完整 episodes 或 films，同时严格保持叙事结构、动作编排和角色身份在数百个镜头中。现有视频生成和编辑管道在此领域常常失效，因为大相机运动和视角变化下会出现身份漂移、背景突变和语义侵蚀的叠加问题。我们提出 Soap2Soap，一个通过双桥一致性机制强制长期语言-视觉一致性的多智能体框架：一个场景感知的 JSON 剧本作为持久的语义骨架，以及在场景和镜头级别动态分配的视觉参考锚点。为在视频合成前抑制漂移，我们引入批次关键帧一致性，通过基于网格的公式共同生成多个关键帧在共享的潜在上下文中。一个闭环验证智能体进一步审计身份、稳定性和对齐度以触发选择性再生。在 SoapBench 上的实验显示，与商业视频生成 API 相比，在长期一致性和叙事保真度方面有显著提升。

英文摘要

We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.

URL PDF HTML ☆

赞 0 踩 0

2605.15735 2026-05-19 cs.CV cs.AI 版本更新

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

UAM：VL A训练中遗忘的双流视角

Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, Jianyu Chen

发表机构 * Tsinghua University（清华大学）

AI总结本文提出UAM模型，通过双流架构解决VL A训练中因单一编码器导致的多模态能力下降问题，展示了通过架构分离而非冻结权重或辅助数据可实现语义保留，并在多种任务中取得高成功率。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通常通过在动作数据上微调预训练的视觉-语言模型（VLM）来构建。然而，我们证明这种标准方法系统性地削弱了VLM的多模态能力，这种副作用我们称之为‘具身税’。但VL A是否必须遗忘？受生物视觉双流组织的启发，我们将这种退化归因于结构性瓶颈：当前VL A要求单一编码器同时支持语言基础语义和控制相关的视觉特征，而生物视觉将识别与视觉运动控制分为不同的路径。基于此观点，我们提出了统一动作模型（UAM），添加了一个平行的背侧专家，作为大脑背侧通路的类比。为了使背侧专家成为有效的第二路径并减少对VLM的控制学习负担，我们从预训练的生成模型中初始化它，并用中层推理目标进行训练，该目标预测视觉动态。这种设计使我们能够仅用动作数据端到端地训练整个VLA：无需参数冻结、无需梯度停止、无需辅助VL共训练，UAM保留了超过95%的底层VLM的多模态能力，同时在多种任务中取得了最高平均成功率，包括未见物体、新物体-目标组合和指令变化等探测分布外泛化的任务。这些结果表明，VL A中的语义保留可以从架构分离本身产生，而非通过冻结权重或辅助数据重放，并且这种保留的语义能力可以自然地从VLM转移到动作中的语义泛化。

英文摘要

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

URL PDF HTML ☆

赞 0 踩 0

2605.15586 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Embracing Biased Transition Matrices for Complementary-Label Learning with Many Classes

拥抱偏置转移矩阵以实现多类互补标签学习

Tan-Ha Mai, Chao-Kai Chiang, Han-Hwa Shih, Gang Niu, Masashi Sugiyama, Hsuan-Tien Lin

发表机构 * National Taiwan University（国立台湾大学）； The University of Tokyo（东京大学）； RIKEN Center for Advanced Intelligence Project（日本理化学研究院先进智能项目中心）

AI总结本文提出了一种新的框架BICL，通过设计偏置的标签生成过程来克服传统互补标签学习在多类设置中的限制，从而在CIFAR-100和TinyImageNet-200上实现了传统方法的七倍以上准确率提升。

Comments 33 pages, 16 figures, 18 tables

详情

AI中文摘要

互补标签学习（CLL）是一种弱监督范式，其中实例被标记为不属于其类别的标签。尽管已有十年的研究，CLL方法主要在10类分类任务中具有竞争力，而扩展到大规模标签空间仍然是一个持久的瓶颈。这种限制源于传统方法对均匀标签生成的假设，这在多类设置中严重稀释了学习信号。在本文中，我们证明通过故意设计偏置（非均匀）的生成过程，将互补标签限制在类别的子集，可以克服这一长期存在的障碍。这一发现促使我们提出Bias-Induced Constrained Labeling（BICL），一个涵盖数据收集到训练的原理性框架，利用这种偏置。BICL在CIFAR-100和TinyImageNet-200上实现了有效学习，比传统方法的准确率提高了超过七倍。我们的发现为在现实应用中使CLL适用于多类问题开辟了新的道路。

英文摘要

Complementary-label learning (CLL) is a weakly supervised paradigm where instances are labeled with classes they do not belong to. Despite a decade of research, CLL methods remain competitive mainly on 10-class classification, with scaling to large label spaces continuing to be an enduring bottleneck. This limitation stems from the common assumption of uniform label generation in traditional methods, which fatally dilutes the learning signal in many-class settings. In this paper, we demonstrate that this long-standing barrier can be overcome by deliberately designing a biased (non-uniform) generation process that restricts complementary labels to a subset of classes. This finding motivates us to propose Bias-Induced Constrained Labeling (BICL), a principled framework spanning data collection to training that leverages this bias. BICL enables effective learning on CIFAR-100 and TinyImageNet-200, achieving more than sevenfold accuracy improvements over traditional methods. Our findings establish a new trajectory for making CLL feasible for many classes in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2605.15487 2026-05-19 cs.LG cs.CV eess.IV 版本更新

Learning Normalized Energy Models for Linear Inverse Problems

学习归一化能量模型以解决线性逆问题

Nicolas Zilberstein, Santiago Segarra, Eero Simoncelli, Florentin Guth

发表机构 * Rice University（里士满大学）； Flatiron Institute（Flatiron研究所）； New York University（纽约大学）

AI总结本文提出了一种新的能量模型，用于解决线性逆问题，通过引入基于协方差的正则化项来提高不同测量条件下的一致性，从而计算出归一化的后验密度，无需额外训练或微调，同时实现了能量引导的自适应采样、无偏的Metropolis-Hastings修正步骤以及通过贝叶斯规则估计退化算子。

Comments ICML 2026

Journal ref Int'l Conf Machine Learning (ICML), Jul 2026. https://openreview.net/forum?id=PlFJwgaaDK

详情

AI中文摘要

生成扩散模型可以为成像中的逆问题提供强大的先验概率模型，但现有实现存在两个关键限制：(i) 先验密度以隐式方式表示，(ii) 它们依赖于似然近似，这会引入采样偏见。我们通过引入一种新的能量模型来解决这些挑战，该模型针对去噪进行了训练，并引入了基于协方差的正则化项，以确保在不同测量条件下的一致性。训练后的模型能够为各种线性逆问题计算归一化的后验密度，而无需额外的重新训练或微调。除了保留扩散模型的采样能力外，这还使以前不可用的能力得以实现：能量引导的自适应采样，可以实时调整采样计划，无偏的Metropolis-Hastings修正步骤，以及通过贝叶斯规则估计退化算子。我们验证了该方法在多个数据集（ImageNet、CelebA、AFHQ）和任务（修复、去模糊）上的性能，证明了其与现有基线相比具有竞争力或更优的表现。

英文摘要

Generative diffusion models can provide powerful prior probability models for inverse problems in imaging, but existing implementations suffer from two key limitations: $(i)$ the prior density is represented implicitly, and $(ii)$ they rely on likelihood approximations that introduce sampling biases. We address these challenges by introducing a new energy-based model trained for denoising with a covariance-based regularization term that enforces consistency across different measurement conditions. The trained model can compute normalized posterior densities for diverse linear inverse problems, without additional retraining or fine tuning. In addition to preserving the sampling capabilities of diffusion models, this enables previously unavailable capabilities: energy-guided adaptive sampling that adjusts schedules on-the-fly, unbiased Metropolis-Hastings correction steps, and blind estimation of the degradation operator via Bayes rule. We validate the method on multiple datasets (ImageNet, CelebA, AFHQ) and tasks (inpainting, deblurring), demonstrating competitive or superior performance to established baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.11817 2026-05-19 cs.RO cs.CV 版本更新

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

Yixu Feng, Zinan Zhao, Yanxiang Ma, Chenghao Xia, Chengbin Du, Yunke Wang, Chang Xu

发表机构 * The University of Sydney（悉尼大学）； City University of Hong Kong（香港城市大学）

AI总结本文提出了一种基于可微网格采样的视觉-语言-动作模型压缩方法，通过连续的token重采样保留关键空间信息，实现高达90%的计算量减少而不影响性能。

Journal ref Proceedings of the Forty-third International Conference on Machine Learning, 2026

详情

AI中文摘要

视觉-语言-动作（VLA）模型在机器人操作中表现出色，但其高计算成本限制了实时部署。现有token剪枝方法面临根本性的权衡：使用剪枝进行剧烈压缩会不可避免地丢弃关键几何细节，如接触点，导致性能严重下降。我们主张通过重新思考压缩作为几何感知的连续token重采样来打破这种权衡。为此，我们提出了可微网格采样器（GridS），一个即插即用的模块，用于在VLA中进行任务感知的连续重采样。通过自适应预测最小的显著坐标集并利用可微插值提取特征，GridS在保留关键空间信息的同时实现了大幅压缩（少于10%的原始视觉token）。在LIBERO基准和真实机器人平台上的实验表明，GridS实现了76%的FLOPs减少，而无需降级成功率。代码可在https://github.com/Fediory/Grid-Sampler上获得。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.

URL PDF HTML ☆

赞 0 踩 0

2605.11567 2026-05-19 cs.CV 版本更新

Dynamic Execution Commitment of Vision-Language-Action Models

视觉-语言-动作模型的动态执行承诺

Feng Chen, Xianghui Wang, Yuxuan Chen, Boying Li, Yefei He, Zeyu Zhang, Yicheng Wu

发表机构 * University of Adelaide（阿德莱德大学）； Sichuan University（四川大学）； Shanghai Jiao Tong University（上海交通大学）； Monash University（墨尔本大学）； Zhejiang University（浙江大学）； Imperial College London（伦敦帝国理工学院）

AI总结本文提出A3机制，通过将动态执行承诺重新定义为自推测前缀验证问题，解决了视觉-语言-动作模型在动态或分布外情况下执行鲁棒性和推理吞吐量之间的平衡问题。

Comments code is available at https://inceptionwang.github.io/A3/

详情

AI中文摘要

视觉-语言-动作（VLA）模型主要采用动作分块方法，即在单次前向传递中预测并承诺一系列连续的低层动作，以摊销大规模主干网络的推理成本并减少每步延迟。然而，将这些多步骤预测提交到现实世界执行需要在成功率和推理效率之间进行平衡，这一决策通常由针对特定任务调整的固定执行时间范围控制。此类启发式方法忽略了预测可靠性与状态依赖性的关系，导致在动态或分布外情况下表现脆弱。在本文中，我们引入了A3，一种自适应动作接受机制，将动态执行承诺重新定义为自推测前缀验证问题。A3首先通过群体采样计算轨迹级的动作共识分数，然后选择一个代表性的草稿并优先验证下游部分。具体而言，它强制执行：（1）共识有序的条件不变性，通过判断在高共识动作条件下重新解码后低共识动作是否保持一致来验证低共识动作；以及（2）前缀封闭的序列一致性，通过只接受从开始处最长连续验证动作序列来保证物理运行完整性。因此，执行时间范围自然成为满足内部模型逻辑和序列执行约束的最长可验证前缀。在多种VLA模型和基准测试中，实验表明A3消除了手动调整时间范围的需要，同时在执行鲁棒性和推理吞吐量之间实现了更优的平衡。

英文摘要

Vision-Language-Action (VLA) models predominantly adopt action chunking, i.e., predicting and committing to a short horizon of consecutive low-level actions in a single forward pass, to amortize the inference cost of large-scale backbones and reduce per-step latency. However, committing these multi-step predictions to real-world execution requires balancing success rate against inference efficiency, a decision typically governed by fixed execution horizons tuned per task. Such heuristics ignore the state-dependent nature of predictive reliability, leading to brittle performance in dynamic or out-of-distribution settings. In this paper, we introduce A3, an Adaptive Action Acceptance mechanism that reframes dynamic execution commitment as a self-speculative prefix verification problem. A3 first computes a trajectory-wise consensus score of actions via group sampling, then selects a representative draft and prioritizes downstream verification. Specifically, it enforces: (1) consensus-ordered conditional invariance, which validates low-consensus actions by judging whether they remain consistent when re-decoded conditioned on high-consensus actions; and (2) prefix-closed sequential consistency, which guarantees physical rollout integrity by accepting only the longest continuous sequence of verified actions starting from the beginning. Consequently, the execution horizon emerges as the longest verifiable prefix satisfying both internal model logic and sequential execution constraints. Experiments across diverse VLA models and benchmarks demonstrate that A3 eliminates the need for manual horizon tuning while achieving a superior trade-off between execution robustness and inference throughput.

URL PDF HTML ☆

赞 0 踩 0

2605.10239 2026-05-19 cs.CV 版本更新

SlimDiffSR: 向轻量高效遥感图像超分辨率迈进：通过扩散模型蒸馏

Ce Wang, Zhenyu Hu, Wanjie Sun

发表机构 * School of Remote Sensing and Information Engineering, Wuhan University（武汉大学遥感与信息工程学院）

AI总结本文提出SlimDiffSR，一种轻量高效的基于扩散模型的遥感图像超分辨率框架，通过引入不确定性引导的时间步分配策略和结构化剪枝策略，提升模型效率和重建质量。

详情

AI中文摘要

扩散模型最近在图像超分辨率（SR）中取得了显著性能，但其高计算成本限制了在遥感应用中的实际部署。为了解决这个问题，我们提出了SlimDiffSR，一种轻量高效的基于扩散模型的框架，用于实际的遥感图像超分辨率。与现有单步扩散方法不同，我们首先引入了不确定性引导的时间步分配策略，以构建一个更强的单步教师模型，其中重建难度与扩散时间步长显式相关，从而实现自适应生成强度。在此基础上，我们进一步提出了一种针对遥感图像的结构化剪枝策略，系统地移除冗余的语义模块，并用轻量级设计替换标准操作，包括频域分离卷积、方向分离卷积以及查询驱动的全局聚合模块。这些组件显式利用了遥感数据的独特特性，如稀疏的高频细节、强方向模式和长距离空间依赖性。为了增强知识转移，我们将在蒸馏过程中引入最大均值差异（MMD），以对齐教师和学生模型之间的特征分布。在多个遥感基准上的广泛实验表明，SlimDiffSR在效率和重建质量之间实现了良好的平衡。特别是，它在多步扩散模型相比下实现了高达200倍的推理加速和20倍的模型参数减少，同时在感知质量方面具有竞争力，并在效率上明显优于现有的轻量级扩散基线。代码可在：https://github.com/wwangcece/SlimDiffSR获取。

英文摘要

Diffusion models have recently achieved remarkable performance in image super-resolution (SR), but their high computational cost limits practical deployment in remote sensing applications. To address this issue, we propose SlimDiffSR, a lightweight and efficient diffusion-based framework for real-world remote sensing image super-resolution. Unlike existing single-step diffusion methods that rely on fixed timesteps, we first introduce an uncertainty-guided timestep assignment strategy to construct a stronger single-step teacher model, where reconstruction difficulty is explicitly linked to diffusion timesteps, enabling adaptive generative strength. Building upon this teacher, we further present a structured pruning strategy tailored to remote sensing imagery, which systematically removes redundant semantic modules and replaces standard operations with lightweight designs, including frequency-separable convolution, direction-separable convolution, and a query-driven global aggregation module. These components explicitly exploit the unique characteristics of remote sensing data, such as sparse high-frequency details, strong directional patterns, and long-range spatial dependencies. To enhance knowledge transfer, we incorporate Maximum Mean Discrepancy (MMD) into the distillation process to align feature distributions between the teacher and student models. Extensive experiments on multiple remote sensing benchmarks demonstrate that SlimDiffSR achieves a favorable balance between efficiency and reconstruction quality. In particular, it attains up to $200\times$ inference acceleration and a $20\times$ reduction in model parameters compared with multi-step diffusion models, while achieving competitive perceptual quality and clearly outperforming existing lightweight diffusion baselines in efficiency. The code is available at: https://github.com/wwangcece/SlimDiffSR.

URL PDF HTML ☆

赞 0 踩 0

2604.24763 2026-05-19 cs.CV 版本更新

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Tuna-2：像素嵌入在多模态理解和生成中优于视觉编码器

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, Yuren Cong

发表机构 * Meta AI ； The University of Hong Kong（香港大学）； University of Waterloo（滑铁卢大学）

AI总结本文提出Tuna-2，一种基于像素嵌入的统一多模态模型，通过直接使用像素嵌入进行多模态理解和生成，展示了统一像素空间建模在高质量图像生成中可以与潜在空间方法竞争，并证明了预训练视觉编码器在多模态建模中并非必要。

Comments Project page: https://tuna-ai.org/tuna-2

详情

AI中文摘要

统一多模态模型通常依赖于预训练的视觉编码器，并使用独立的视觉表示进行理解和生成，导致两种任务之间存在不一致，阻碍了从原始像素进行端到端优化。我们引入Tuna-2，一种原生统一多模态模型，直接基于像素嵌入进行视觉理解和生成。Tuna-2通过使用简单的补丁嵌入层来编码视觉输入，大幅简化了模型架构，完全摒弃了诸如VAE或表示编码器等模块化视觉编码器设计。实验表明，Tuna-2在多模态基准测试中实现了最先进的性能，证明了统一像素空间建模能够与潜在空间方法在高质量图像生成中竞争。此外，虽然基于编码器的变体在早期预训练中收敛更快，但Tuna-2的无编码器设计在大规模情况下实现了更强的多模态理解，特别是在需要细粒度视觉感知的任务中。这些结果表明，预训练视觉编码器在多模态建模中并非必要，端到端的像素空间学习为生成和感知的更强视觉表示提供了一条可扩展的路径。

英文摘要

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.

URL PDF HTML ☆

赞 0 踩 0

2604.20155 2026-05-19 cs.CV 版本更新

GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

GSCompleter: 一种无需蒸馏的插件，用于在几秒钟内进行基于度量的3D高斯溅射完成

Ao Gao, Jingyu Gong, Xin Tan, Zhizhong Zhang, Lizhuang Ma, Yuan Xie

发表机构 * School of Computer Science and Technology, East China Normal University（东华大学计算机科学与技术学院）； Shanghai Innovation Institute（上海创新研究院）； Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University（重庆精密光学重点实验室，东华大学重庆研究院）； Shanghai Key Laboratory of Computer Software Evaluating and Testing（上海计算机软件评测测试重点实验室）； Department of Computer Science and Engineering, Shanghai Jiao Tong University（上海交通大学计算机科学与工程学院）

AI总结本文提出了一种无需蒸馏的GSCompleter插件，通过稳定的'生成-注册'流程实现基于度量的3D高斯溅射完成，提高了完成质量和效率，并在三个基准上取得了新的最先进的结果。

详情

AI中文摘要

3D高斯溅射（3DGS）凭借其显式表示和效率，已彻底改变了高质量神经渲染。然而，从稀疏视角重建场景会因覆盖范围有限而遭受严重的几何空洞和漂浮物。当前的场景完成方法通常依赖于迭代的'修复-蒸馏'范式，这计算成本高，容易出现不稳定优化，并且容易过拟合。为了解决这些限制，我们提出了GSCompleter，一种无需蒸馏的插件，将场景完成转移到稳定的'生成-注册'流程。具体而言，GSCompleter合成出视觉上合理的2D参考图像，并通过稳健的立体锚点视角选择机制将其显式提升为具有一致度量尺度的3D高斯原语。这些新生成的原语随后通过新颖的射线约束注册策略无缝集成到全局场景中。通过用稳定的几何注册替代不稳定蒸馏，GSCompleter在三个基准上表现出优越的3DGS完成性能，比各种基线在质量和效率上都得到了提升，并取得了新的最先进的（SOTA）结果。

英文摘要

3D Gaussian Splatting (3DGS) has revolutionized high-fidelity neural rendering with its explicit representation and efficiency. However, reconstructing scenes from sparse viewpoints suffers from severe geometric voids and floaters due to limited coverage. Current scene completion methods typically rely on an iterative "Repair-then-Distill" paradigm, which is computationally intensive, prone to unstable optimization, and susceptible to overfitting. To address these limitations, we propose GSCompleter, a distillation-free plugin that shifts scene completion to a stable "Generate-then-Register" workflow. Specifically, GSCompleter synthesizes visually plausible 2D reference images and explicitly lifts them into 3D Gaussian primitives with a consistent metric scale via a robust Stereo-Anchor View Selection mechanism. These newly generated primitives are then seamlessly integrated into the global scene using a novel Ray-Constrained Registration strategy. By replacing unstable distillation with rapid geometric registration, GSCompleter exhibits superior 3DGS completion performance across three benchmarks, enhancing both quality and efficiency over various baselines and achieving new state-of-the-art (SOTA) results.

URL PDF HTML ☆

赞 0 踩 0

2604.16429 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph 版本更新

(Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models

(稀疏) 注意细节：在基于机器学习的天气预测模型中保持频谱保真度

Maksim Zhdanov, Ana Lucic, Max Welling, Jan-Willem van de Meent

发表机构 * AMLab（AM实验室）； University of Amsterdam（阿姆斯特丹大学）

AI总结本文提出Mosaic模型，通过学习功能扰动生成集合成员，并利用网格对齐的块稀疏注意力机制，在原分辨率网格上操作，以线性成本捕捉长距离依赖关系，从而在1.5°分辨率下达到或超越更精细分辨率模型的性能，实现了状态-of-the-art结果。

Comments Accepted to ICML 2026

详情

AI中文摘要

我们介绍Mosaic，一种概率天气预测模型，旨在解决基于机器学习的天气预测中频谱退化问题的三种失败模式：频谱阻尼（统计学）、高频混叠（架构学）和残余高频泄漏（参数学）。Mosaic通过学习的功能扰动生成集合成员，并通过网格对齐的块稀疏注意力机制在原分辨率网格上操作，该机制是一种硬件对齐的机制，通过在空间相邻查询之间共享键和值，以线性成本捕捉长距离依赖关系。在1.5°分辨率和214M参数下，Mosaic在关键变量上达到或超越了在6倍更精细分辨率上训练的模型的性能，并在1.5°模型中实现了最先进的结果，生成了经过良好校准的集合，其个体成员在所有解析频率上表现出近乎完美的频谱对齐。一个24成员、10天的预测在单个H100 GPU上不到12秒。代码可在https://github.com/maxxxzdn/mosaic上获得。

英文摘要

We introduce Mosaic, a probabilistic weather forecasting model that addresses three failure modes of spectral degradation in ML-based weather prediction: spectral damping (statistical), high-frequency aliasing (architectural), and residual high-frequency leakage (parametric). Mosaic generates ensemble members through learned functional perturbations and operates on native-resolution grids via mesh-aligned block-sparse attention, a hardware-aligned mechanism that captures long-range dependencies at linear cost by sharing keys and values across spatially adjacent queries. At 1.5° resolution with 214M parameters, Mosaic matches or outperforms models trained on 6$\times$ finer resolution on key variables and achieves state-of-the-art results among 1.5° models, producing well-calibrated ensembles whose individual members exhibit near-perfect spectral alignment across all resolved frequencies. A 24-member, 10-day forecast takes under 12s on a single H100~GPU. Code is available at https://github.com/maxxxzdn/mosaic.

URL PDF HTML ☆

赞 0 踩 0

2603.27341 2026-05-19 cs.AI cs.CV cs.LG 版本更新

A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling

外科AI的比较研究：数据、计算和扩展的潜力与局限

Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara, Shauna Otto, Zhuang-Fang Yi, John Zhu, Daniel A. Donoho, X. Y. Han, Neeraj Mainkar, Margaux Masson-Forsythe

发表机构 * Center for Applied AI, Chicago Booth（应用人工智能中心，芝加哥商学院）； Surgical Data Science Collective（外科数据科学集体）； Children’s National Hospital（儿童医学中心）； Operations Management & Tolan Center for Healthcare, Chicago Booth（运营管理与托兰医疗中心，芝加哥商学院）

AI总结本文通过2026年最先进的AI方法，研究了外科手术工具检测中的性能和限制，发现即使使用多十亿参数模型和大量训练数据，当前的视觉语言模型在神经外科手术工具检测任务中仍表现不足，且模型规模和训练时间的增加对性能提升效果有限，表明当前AI在手术应用中仍面临显著挑战。

详情

AI中文摘要

最近的人工智能（AI）模型在多个生物医学任务基准上已匹配或超越了人类专家，但特别是在外科手术基准方面，这些基准往往缺失于主要的医学基准套件中。由于手术需要整合多种任务，一般能力的AI模型可能成为协作工具，如果性能可以得到提升。一方面，通过扩展架构大小和训练数据的常规方法具有吸引力，尤其是由于每年有数百万小时的手术视频数据生成。另一方面，为AI训练准备手术数据需要显著更高的专业水平，并且在该数据上训练需要昂贵的计算资源。这些权衡描绘了现代AI是否以及在多大程度上能够帮助外科实践的不确定图景。在本文中，我们通过使用2026年最先进的AI方法进行外科手术工具检测的案例研究来探讨这个问题。我们证明，即使使用多十亿参数模型和大量训练，当前的视觉语言模型在看似简单的神经外科手术工具检测任务中仍表现不足。此外，我们展示了扩展实验，表明增加模型规模和训练时间仅导致相关性能指标的边际改善。因此，我们的实验表明，当前模型在手术使用案例中仍可能面临重大障碍。此外，一些障碍无法通过额外的计算能力简单地“解决”并持续存在于不同的模型架构中，提出了数据和标签可用性是否是唯一限制因素的问题。我们讨论了这些约束的主要贡献者，并提出了潜在的解决方案。

英文摘要

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but surgical benchmarks in particular are often missing from prominent medical benchmark suites. Since surgery requires integrating disparate tasks, generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

URL PDF HTML ☆

赞 0 踩 0

2603.23672 2026-05-19 cs.RO cs.CV 版本更新

Bio-Inspired Event-Based Visual Servoing for Ground Robots

生物启发的基于事件的视觉伺服控制用于地面机器人

Maral Mordad, Kian Behzad, Debojyoti Biswas, Noah J. Cowan, Milad Siami

发表机构 * Department of Electrical & Computer Engineering, Northeastern University（东北大学电气与计算机工程系）； Laboratory for Computational Sensing and Robotics, Johns Hopkins University（约翰霍普金斯大学计算感知与机器人实验室）； Department of Mechanical Engineering, Johns Hopkins University（约翰霍普金斯大学机械工程系）

AI总结本文提出了一种基于生物启发的1D事件视觉伺服框架，用于在结构化环境中运行的地面机器人，通过动态视觉传感器和多模式刺激直接合成非线性状态反馈项，实现了高效低延迟的控制。

详情

AI中文摘要

生物感觉系统本质上是自适应的，能够过滤掉恒定刺激并优先处理相对变化，可能提高计算和代谢效率。受广泛动物主动感知行为的启发，本文介绍了一种原理性的1D基于事件的视觉伺服框架，用于在结构化环境中运行的地面机器人。利用动态视觉传感器（DVS），我们证明通过将固定的空间核应用于由结构化对数强度变化模式生成的异步事件流，所得到的网络事件流能够分析性地隔离特定的运动状态组合。我们建立了该事件率估计器的一般理论界，并证明线性和二次空间剖面分别隔离了机器人的速度和位置-速度乘积。利用这些特性，我们采用多模式刺激直接合成非线性状态反馈项，而无需传统状态估计。为克服事件感知中在平衡点固有的线性可观测性损失，我们提出了一种生物启发的主动感知极限环控制器。在1/10比例自主地面车辆上的实验验证证实了所提出直接感知方法的有效性、极低延迟和计算效率。

英文摘要

Biological sensory systems are inherently adaptive, filtering out constant stimuli and prioritizing relative changes, likely enhancing computational and metabolic efficiency. Inspired by active sensing behaviors across a wide range of animals, this paper introduces a principled 1D event-based visual servoing framework for ground robots operating in structured environments. Utilizing a Dynamic Vision Sensor (DVS), we demonstrate that by applying a fixed spatial kernel to the asynchronous event stream generated from structured logarithmic intensity-change patterns, the resulting net event flux analytically isolates specific combinations of kinematic states. We establish a generalized theoretical bound for this event rate estimator and show that linear and quadratic spatial profiles isolate the robot's velocity and position-velocity product, respectively. Leveraging these properties, we employ a multi-pattern stimulus to directly synthesize a nonlinear state feedback term entirely without traditional state estimation. To overcome the inescapable loss of linear observability at equilibrium inherent in event sensing, we propose a bio-inspired active sensing limit-cycle controller. Experimental validation on a 1/10-scale autonomous ground vehicle confirms the efficacy, extreme low-latency, and computational efficiency of the proposed direct-sensing approach.

URL PDF HTML ☆

赞 0 踩 0

2603.21787 2026-05-19 cs.CV 版本更新

Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTevent

在MTevent上评估用于工业多类识别的循环事件基目标检测基准

Lokeshwaran Manohar, Moritz Roidl

发表机构 * Chair of Material Handling and Warehousing, TU Dortmund University, Dortmund, Germany（物料搬运与仓储学系，杜伊斯堡-艾森大学，多特蒙德，德国）

AI总结本文研究了在MTevent数据集上使用循环ReYOLOv8s进行工业多类识别的性能，并通过非循环YOLOv8s作为基线分析时间记忆的影响，发现事件域预训练对性能提升更有效。

Comments Accepted at the Neuromorphic Field Robotics and Automation Workshop, ICRA 2026

详情

AI中文摘要

事件相机因提供高时间分辨率、高动态范围和减少运动模糊而在工业机器人中具有吸引力。然而，大多数基于事件的目标检测研究集中在户外驾驶场景或有限类别设置上。在本工作中，我们在MTevent上评估了循环ReYOLOv8s用于工业多类识别，并使用非循环YOLOv8s变体作为基线来分析时间记忆的影响。在MTevent验证分割上，最佳的从头开始的循环模型（C21）达到了0.285 mAP50，比非循环YOLOv8s基线（0.260）提高了9.6%。事件域预训练效果更显著：GEN1初始化的微调在剪辑长度21时达到了最佳整体结果0.329 mAP50，并且与从头开始训练不同，GEN1预训练模型在剪辑长度上持续改进。PEDRo初始化下降到0.251，表明源域预训练不匹配可能不如从头开始训练有效。持续失败模式主要由类别不平衡和人-物体交互主导。总体而言，我们将这项工作定位为对工业环境中循环事件基检测的聚焦基准测试和分析研究。

英文摘要

Event cameras are attractive for industrial robotics because they provide high temporal resolution, high dynamic range, and reduced motion blur. However, most event-based object detection studies focus on outdoor driving scenarios or limited class settings. In this work, we benchmark recurrent ReYOLOv8s on MTevent for industrial multi-class recognition and use a non-recurrent YOLOv8s variant as a baseline to analyze the effect of temporal memory. On the MTevent validation split, the best scratch recurrent model (C21) reaches 0.285 mAP50, corresponding to a 9.6\% relative improvement over the non-recurrent YOLOv8s baseline (0.260). Event-domain pretraining has a stronger effect: GEN1-initialized fine-tuning yields the best overall result of 0.329 mAP50 at clip length 21, and unlike scratch training, GEN1-pretrained models improve consistently with clip length. PEDRo initialization drops to 0.251, indicating that mismatched source-domain pretraining can be less effective than training from scratch. Persistent failure modes are dominated by class imbalance and human-object interaction. Overall, we position this work as a focused benchmarking and analysis study of recurrent event-based detection in industrial environments.

URL PDF HTML ☆

赞 0 踩 0

2603.13652 2026-05-19 cs.CV 版本更新

Causal Attribution via Activation Patching

通过激活修补进行因果归因

Amirmohammad Izadi, Mohammadali Banayeeanzade, Alireza Mirrokni, Hosein Hasani, Mobin Bagherian, Faridoun Mehri, Mahdieh Soleymani Baghshah

发表机构 * Sharif University of Technology（谢尔万大学）

AI总结本文提出了一种新的因果归因方法CAAP，通过直接干预内部激活来估计图像补丁对Vision Transformer预测的贡献，从而产生更准确和局部化的归因结果。

详情

AI中文摘要

针对Vision Transformers（ViTs）的归因方法旨在识别影响模型预测的图像区域，但产生忠实且良好的局部化归因仍具有挑战性。现有归因方法面临多个限制，基于梯度、相关性传播和注意力的方法依赖于局部近似，而扰动或优化方法则干预输入、令牌或替代物，而非内部补丁表示。关键挑战在于类别相关证据是通过跨层的补丁令牌相互作用形成的；仅操作输入变化、注意力权重或反向相关性信号的方法可能只能提供补丁重要性的间接代理，而非直接测试上下文化补丁表示的预测效果。我们提出通过激活修补进行因果归因（CAAP），通过直接干预内部激活来估计单个图像补丁对ViT预测的贡献，而非使用学习的掩码或合成扰动模式。对于每个补丁，CAAP将对应的源图像激活插入中性目标上下文中的中间层范围，并使用由此产生的目标类别分数作为归因信号。所得到的归因图反映了补丁相关内部表示对模型预测的因果贡献。因果干预作为一种原则性的测量方法，通过在初始表示形成后捕捉语义证据，同时避免晚期层的全局混合，这可能减少空间特异性。在多个ViT骨干网络和标准度量指标上，CAAP在各种设置中均优于现有方法，并产生更忠实且局部化的归因结果。

英文摘要

Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing attribution methods face several limitations, with gradient-based, relevance-propagation, and attention-based methods relying on local approximations, while perturbation or optimization-based methods intervene on inputs, tokens, or surrogates rather than internal patch representations. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers; methods that operate only on input changes, attention weights, or backward relevance signals may therefore provide indirect proxies for patch importance rather than directly testing the predictive effect of contextualized patch representations. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal contribution of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing semantic evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP consistently outperforms existing methods in various settings and produces more faithful and localized attributions.

URL PDF HTML ☆

赞 0 踩 0

2603.10935 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Spherical VAE with Cluster-Aware Feasible Regions: Guaranteed Prevention of Posterior Collapse

具有聚类感知可行区域的球形VAE：保证防止后验崩溃

Zegu Zhang, Jian Zhang

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出了一种理论保证非崩溃解的新型框架，通过利用球壳几何和聚类感知约束，防止VAE中的后验崩溃问题，并在合成和现实数据集上实现了100%的崩溃预防。

Comments 8 pages, 6 figures

详情

AI中文摘要

变分自编码器（VAEs）经常受到后验崩溃的影响，其中潜在变量在近似后验退化为先验时变得无信息。尽管最近的研究将崩溃描述为由数据协方差属性决定的相变，但现有方法主要旨在避免而非消除崩溃。我们引入了一种新的框架，通过利用球壳几何和聚类感知约束，从理论上保证非崩溃解。我们的方法将数据转换为球壳，通过K-means计算最优聚类分配，并定义一个在聚类内方差W和崩溃损失δ-collapse之间的可行区域。我们证明当重构损失被限制在这个区域内时，崩溃解在数学上被排除在可行参数空间之外。关键的是，我们引入了规范约束机制，确保解码器输出保持与球壳几何兼容，而不限制表示能力。与以往方法不同，我们的方法提供了严格的理论保证，计算开销小，且不施加对解码器输出的限制。在合成和现实数据集上的实验表明，在传统VAE完全失败的条件下，实现了100%的崩溃预防，重构质量匹配或超过最先进的方法。我们的方法不需要显式的稳定性条件（例如σ² < λ_max），并且适用于任意神经网络架构。代码可在https://github.com/tsegoochang/spherical-vae-with-Cluster获取。

英文摘要

Variational autoencoders (VAEs) frequently suffer from posterior collapse, where the latent variables become uninformative as the approximate posterior degenerates to the prior. While recent work has characterized collapse as a phase transition determined by data covariance properties, existing approaches primarily aim to avoid rather than eliminate collapse. We introduce a novel framework that theoretically guarantees non-collapsed solutions by leveraging spherical shell geometry and cluster-aware constraints. Our method transforms data to a spherical shell, computes optimal cluster assignments via K-means, and defines a feasible region between the within-cluster variance $W$ and collapse loss $δ_{\text{collapse}}$. We prove that when the reconstruction loss is constrained to this region, the collapsed solution is mathematically excluded from the feasible parameter space. \textbf{Critically, we introduce norm constraint mechanisms that ensure decoder outputs remain compatible with the spherical shell geometry without restricting representational capacity.} Unlike prior approaches, our method provides a strict theoretical guarantee with minimal computational overhead without imposing constraints on decoder outputs. Experiments on synthetic and real-world datasets demonstrate 100\% collapse prevention under conditions where conventional VAEs completely fail, with reconstruction quality matching or exceeding state-of-the-art methods. Our approach requires no explicit stability conditions (e.g., $σ^2 < λ_{\max}$) and works with arbitrary neural architectures. The code is available at https://github.com/tsegoochang/spherical-vae-with-Cluster.

URL PDF HTML ☆

赞 0 踩 0

2603.00607 2026-05-19 cs.CV cs.AI 版本更新

Meltdown: 点云条件化3D扩散变换器中的电路与分叉

Maximilian Plattner, Fabian Paischer, Johannes Brandstetter, Arturs Berzins

发表机构 * Institute for Machine Learning, JKU Linz（机器学习研究所，林茨大学）

AI总结该研究探讨了点云条件化3D扩散变换器在输入变化下的失败模式，揭示了Meltdown现象，通过机制性案例研究展示了其成因，并提出了PowerRemap方法以抑制该现象。

详情

AI中文摘要

稀疏点云是3D表面重建中常见的输入模式，包括在安全关键领域如手术导航和自动驾驶感知中。最近的点云条件化3D扩散变换器在这一领域通过利用学习先验知识实现了最先进的结果。我们展示了这些模型在现实输入变化下可能灾难性地失败，并展示了其原因。我们识别出一种称为Meltdown的失败模式：对稀疏输入点云的微小表面扰动可以将重建输出分解成数百个不连通的部分。对抗搜索在两个开放权重的最先进架构（WaLa、Make-a-Shape）上恢复Meltdown，在真实世界数据集（GSO、SimJEB）和DDPM和DDIM采样下恢复率在89.9-100%。我们追踪Meltdown在正向传递中：它由点在表面上分布的均匀性决定，通过点云编码器忠实传递，并由扩散骨干中的单个早期去噪交叉注意力写入步骤所提交。扩散轨迹集合在接近此提交步骤时表现出对称性破裂，与反向过程的分叉一致。通过一系列匹配幅度的控制，我们证明模型提交的变量是方向性的，集中在写入扰动漂移的低维子空间中。受此发现启发，我们引入PowerRemap，一种测试时间控制，通过重塑局部写入的奇异谱来抑制此漂移，在WaLa上恢复率为98.3%，在Make-a-Shape上为84.6%。这些结果将电路级交叉注意力机制与轨迹级失败解释联系起来，展示了机理分析如何解释和指导条件扩散变换器的行为。

英文摘要

Sparse point clouds are a common input modality for 3D surface reconstruction, including in safety-critical settings such as surgical navigation and autonomous perception. Recent point-cloud-conditioned 3D diffusion transformers achieve state-of-the-art results in this regime by leveraging learned priors. We show that these models can fail catastrophically under realistic input variation, and present a mechanistic case study of why. We identify a failure mode we call Meltdown: tiny on-surface perturbations to a sparse input point cloud can fracture the reconstructed output into hundreds of disconnected pieces. Adversarial search recovers Meltdown in 89.9-100% of shapes across the two open-weight state-of-the-art architectures we study (WaLa, Make-a-Shape) on real-world datasets (GSO, SimJEB) and under both DDPM and DDIM sampling. We trace Meltdown along the forward pass: it is governed by how uniformly the points are distributed on the surface, faithfully transduced through the point-cloud encoder, and committed by a single early-denoising cross-attention write in the diffusion backbone. Diffusion-trajectory ensembles exhibit symmetry-breaking near this commit step, consistent with a bifurcation of the reverse process. Through a suite of matched-magnitude controls, we show that the variable on which the model commits is directional, concentrated in a low-rank subspace of the write's perturbation drift. Motivated by this finding, we introduce PowerRemap, a test-time control that reshapes the singular spectrum of the localized write to suppress this drift, with rescue rates of 98.3% on WaLa and 84.6% on Make-a-Shape. Together, these results link a circuit-level cross-attention mechanism to a trajectory-level account of the failure, demonstrating how mechanistic analysis can explain and guide behavior in conditional diffusion transformers.

URL PDF HTML ☆

赞 0 踩 0

2601.14568 2026-05-19 cs.CV cs.AI 版本更新

Breaking the accuracy-resource dilemma: a lightweight adaptive video inference enhancement

打破精度-资源困境：一种轻量级自适应视频推理增强

Wei Ma, Shaowu Chen, Junjie Ye, Peichang Zhang, Lei Huang

发表机构 * State Key Laboratory of Radio Frequency Heterogeneous Integration (Shenzhen University)（无线电频率异构集成国家重点实验室（深圳大学））； Institute of Applied Artificial Intelligence of the Guangdong–HongKong–Macao Greater Bay（粤港澳大湾区应用人工智能研究院）； Henan Academy of Science Applied Physics Institute Co.,Ltd.（河南省应用物理科学研究院有限公司）

AI总结本文提出了一种轻量级自适应视频推理增强框架，通过动态切换不同规模的模型来平衡资源利用与推理性能。

Comments 5 pages, 5 figures

2601.06943 2026-05-19 cs.CV cs.AI 版本更新

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

观看、推理与搜索：一个面向开放网络的视频深度研究基准，用于代理视频推理

Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Jisheng Dang, Rui Xu, Sen Hu, Jianheng Hou, Chengwei Qin, Xiaobin Hu, Kunyi Wang, Zhi Yang, Hao Peng, Hong Peng, Ronghao Chen, Huacan Wang

发表机构 * LZU（兰州大学）； HKUST(GZ)（香港科技大学（广州））； UBC（不列颠哥伦比亚大学）； FDU（福建大学）； PKU（北京大学）； USC（美国南加州大学）； NUS（新加坡国立大学）； UCAS（中国科学院大学）； HKUST（香港科技大学）； QuantaAlpha（量子Alpha）

AI总结本文提出VideoDR基准，用于研究开放网络环境下视频代理推理，通过跨帧视觉锚点提取、交互式网络检索和多跳推理验证，揭示了长检索链中维持初始视频锚点、目标漂移和长时程一致性等关键挑战。

详情

AI中文摘要

在现实世界视频问答场景中，视频往往只提供局部视觉线索，而可验证答案分布在开放网络中；模型因此需要联合执行跨帧线索提取、迭代检索和基于多跳推理的验证。为弥合这一差距，我们构建了首个视频深度研究基准VideoDR。VideoDR专注于视频条件的开放领域视频问答，要求进行跨帧视觉锚点提取、交互式网络检索和基于联合视频-网络证据的多跳推理；通过严格的真人标注和质量控制，我们获得了涵盖六个语义领域的高质量视频深度研究样本。我们评估了多种闭源和开源多模态大语言模型在Workflow和Agentic范式下的表现，结果表明Agentic并不始终优于Workflow：其收益取决于模型在长检索链中维持初始视频锚点的能力。进一步分析表明，目标漂移和长时程一致性是核心瓶颈。总之，VideoDR为研究开放网络环境下视频代理提供了系统性的基准，并揭示了下一代视频深度研究代理的关键挑战。

英文摘要

In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.

URL PDF HTML ☆

赞 0 踩 0

2601.06163 2026-05-19 cs.CV cs.LG 版本更新

Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking

Forget-It-All: 通过概念感知神经元掩码实现多概念机器去学习

Kaiyuan Deng, Bo Hui, Gen Li, Jie Ji, Minghai Qin, Geng Yuan, Xiaolong Ma

发表机构 * The University of Arizona（亚利桑那大学）； The University of Tulsa（塔尔萨大学）； Clemson University（克莱姆森大学）； Western Digital Corporation（西部数据公司）； University of Georgia（佐治亚大学）

AI总结该研究提出Forget-It-All框架，通过利用模型稀疏性，解决多概念去学习问题，有效提升去学习效果并保持生成质量。

Comments Accepted to ICML 2026

Journal ref Forty-Third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

文本到图像（T2I）扩散模型的广泛应用引发了对其可能生成版权、不当或敏感图像的担忧。作为实际解决方案，机器去学习旨在在不重新训练的情况下删除不需要的概念。尽管现有方法在单概念去学习中有效，但去除多个概念时往往面临显著挑战，包括去学习效果、生成质量和对超参数和数据集的敏感性。我们通过利用模型稀疏性，从独特角度看待多概念去学习，并提出Forget It All（FIA）框架。FIA首先引入对比概念显著性以量化每个权重连接对目标概念的贡献。然后通过结合时间信息和空间信息，识别出概念敏感神经元，确保只选择那些一致响应目标概念的神经元。最后，FIA从识别的神经元中构建掩码，并将其融合成统一的多概念掩码，其中对一般内容生成有广泛支持的无概念神经元被保留，而概念特定神经元被修剪以去除目标。FIA是无训练的，需要最少超参数调整即可用于新任务，实现即插即用。在三个不同的去学习任务上进行了广泛的实验，证明FIA在多概念去学习中实现了更可靠的性能，提高了遗忘效果同时保持生成的保真度和质量。代码可在https://github.com/kaiyuan02415/Forget-It-All获取。

英文摘要

The widespread adoption of text-to-image (T2I) diffusion models has raised concerns about their potential to generate copyrighted, inappropriate, or sensitive imagery. As a practical solution, machine unlearning aims to erase unwanted concepts without retraining from scratch. While most existing methods are effective for single-concept unlearning, they often struggle when removing multiple concepts, causing significant challenges in unlearning effectiveness, generation quality, and sensitivity to hyperparameters and datasets. We take a unique perspective on multi-concept unlearning by leveraging model sparsity and propose the Forget It All (FIA) framework. FIA first introduces Contrastive Concept Saliency to quantify each weight connection's contribution to a target concept. It then identifies Concept Sensitive Neurons by combining temporal and spatial information, ensuring that only neurons consistently responsive to the target concept are selected. Finally, FIA constructs masks from the identified neurons and fuses them into a unified multi-concept mask, where Concept Agnostic Neurons that broadly support general content generation are preserved while concept-specific neurons are pruned to remove the targets. FIA is training-free and requires minimal hyperparameter tuning for new tasks, enabling plug-and-play use. Extensive experiments across three distinct unlearning tasks demonstrate that FIA achieves more reliable multi-concept unlearning, improving forgetting effectiveness while maintaining generation fidelity and quality. Code is available at https://github.com/kaiyuan02415/Forget-It-All

URL PDF HTML ☆

赞 0 踩 0

2601.06162 2026-05-19 cs.LG cs.CV 版本更新

Forget Many, Forget Right: Scalable and Precise Concept Unlearning in Diffusion Models

忘却众多，忘却正确：扩散模型中可扩展且精确的概念反学习

Kaiyuan Deng, Gen Li, Yang Xiao, Bo Hui, Xiaolong Ma

发表机构 * The University of Arizona（亚利桑那大学）； Clemson University（克莱姆森大学）； The University of Tulsa（塔尔萨大学）

AI总结本文提出了一种名为ScaPre的统一框架，用于在大规模扩散模型中实现精确的概念反学习，通过解决冲突更新、不精确机制和依赖额外数据的问题，提高了反学习的效率和精度。

Comments Accepted at ICLR 2026

Journal ref International Conference on Learning Representations (ICLR) 2026

详情

AI中文摘要

文本到图像的扩散模型已取得显著进展，但其使用引发了版权和滥用问题，促使研究机器反学习。然而，将多概念反学习扩展到大规模场景仍然困难，因为存在三个挑战：（i）冲突的权重更新会阻碍反学习或降低生成质量；（ii）不精确的机制会导致对相似内容的损害；（iii）依赖额外数据或模块，造成可扩展性瓶颈。为了解决这些问题，我们提出了可扩展-精确概念反学习（ScaPre），一种专门针对大规模反学习的统一框架。ScaPre引入了冲突感知的稳定设计，整合了谱迹正则化和几何对齐，以稳定优化、抑制冲突并保持全局结构。此外，Informax解耦器识别与概念相关的参数并自适应地重新加权更新，严格将反学习限制在目标子空间内。ScaPre产生了一个高效的闭式解，无需额外数据或子模型。在对象、风格和显性内容上的全面实验表明，ScaPre能够有效移除目标概念并保持生成质量。它比最佳基线在可接受的质量限制内能忘却多达$ imes \mathbf{5}$更多的概念，实现了大规模反学习的最先进精度和效率。代码可在https://github.com/kaiyuan02415/scapre获取。

英文摘要

Text-to-image diffusion models have achieved remarkable progress, yet their use raises copyright and misuse concerns, prompting research into machine unlearning. However, extending multi-concept unlearning to large-scale scenarios remains difficult due to three challenges: (i) conflicting weight updates that hinder unlearning or degrade generation; (ii) imprecise mechanisms that cause collateral damage to similar content; and (iii) reliance on additional data or modules, creating scalability bottlenecks. To address these, we propose Scalable-Precise Concept Unlearning (ScaPre), a unified framework tailored for large-scale unlearning. ScaPre introduces a conflict-aware stable design, integrating spectral trace regularization and geometry alignment to stabilize optimization, suppress conflicts, and preserve global structure. Furthermore, an Informax Decoupler identifies concept-relevant parameters and adaptively reweights updates, strictly confining unlearning to the target subspace. ScaPre yields an efficient closed-form solution without requiring auxiliary data or sub-models. Comprehensive experiments on objects, styles, and explicit content demonstrate that ScaPre effectively removes target concepts while maintaining generation quality. It forgets up to $\times \mathbf{5}$ more concepts than the best baseline within acceptable quality limits, achieving state-of-the-art precision and efficiency for large-scale unlearning. Code is available at https://github.com/kaiyuan02415/scapre

URL PDF HTML ☆

赞 0 踩 0

2601.01593 2026-05-19 cs.CV cs.MM 版本更新

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

超越补丁：面向多模态少样本字体生成的全局感知自回归模型

Haonan Cai, Yuxuan Luo, Zhouhui Lian

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王学轩计算机技术研究所）； School of Electronics Engineering and Computer Science, Peking University（北京大学电子工程与计算机科学学院）

AI总结本文提出GAR-Font，一种多模态少样本字体生成的自回归框架，通过全局感知分词器、多模态风格编码器和后处理流程，提升了字体生成的全局风格一致性和质量。

Comments 28 pages, Accepted as CVPR 2026 Conference Paper

详情

AI中文摘要

手动字体设计是一个将风格化视觉概念转化为一致的字形集的复杂过程。在自动少样本字体生成（FFG）中，模型常常难以在有限参考下保持结构完整性和风格忠实性。尽管自回归（AR）模型展示了出色的生成能力，但其在FFG中的应用受限于传统的补丁级标记化，这忽略了对字体合成至关重要的全局依赖关系。此外，现有FFG方法仍局限于图像到图像的范式，仅依赖视觉参考，忽略了语言在传达字体设计风格意图中的作用。为了解决这些限制，我们提出了GAR-Font，一种新的AR框架用于多模态少样本字体生成。GAR-Font引入了一个全局感知的分词器，能够有效捕捉局部结构和全局风格模式，一个多模态风格编码器通过轻量级的语言-风格适配器提供灵活的风格控制，无需进行高强度的多模态预训练，并且一个后处理流程进一步增强了结构完整性和风格一致性。大量实验表明，GAR-Font在现有FFG方法上表现更优，尤其在保持全局风格忠实性和在文本风格指导下获得更高质量的结果方面表现出色。

英文摘要

Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.

URL PDF HTML ☆

赞 0 踩 0

2512.12598 2026-05-19 cs.CV 版本更新

多阶匹配网络用于无对齐深度超分辨率

Zhengxue Wang, Zhiqiang Yan, Yuan Wu, Guangwei Gao, Xiang Li, Jian Yang

发表机构 * PCA Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology（计算机科学与工程学院PCA实验室，南京理工大学）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）； School of Computer Science, Nankai University（南开大学计算机学院）； PCA Lab, School of Intelligence Science and Technology, Nanjing University（智能科学与技术学院PCA实验室，南京大学）

AI总结本文提出了一种无对齐框架Multi-Order Matching Network (MOMNet)，通过多阶匹配机制和多阶聚合策略，从不对齐的RGB数据中提取并选择最相关的信息，以提高深度超分辨率的性能和泛化能力。

详情

AI中文摘要

最近的引导深度超分辨率方法基于深度和RGB之间严格空间对齐的假设，实现高质量的深度重建。然而，在现实场景中，严格对齐的RGB-D数据受到固有硬件限制（例如物理分离的RGB-D传感器）和不可避免的校准漂移（由机械振动或温度变化引起）的阻碍。因此，现有方法在应用于不对齐的现实场景时往往会出现不可避免的性能下降。在本文中，我们提出了Multi-Order Matching Network (MOMNet)，一种新颖的无对齐框架，能够自适应地从不对齐的RGB中检索并选择最相关的信息。具体而言，我们的方法首先采用多阶匹配机制，联合执行零阶、一阶和二阶匹配，以在多阶特征空间中全面识别与深度一致的RGB信息。为了有效整合检索到的RGB和深度信息，我们进一步引入了由多个结构检测器组成的多阶聚合模块。该策略利用多阶先验作为提示，促进从RGB到深度的特征选择性转移。广泛的实验表明，MOMNet在未对齐和对齐的数据集上均实现了优越的性能和泛化能力。

英文摘要

Recent guided depth super-resolution methods are premised on the assumption of strict spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves superior performance and generalization across both unaligned and aligned datasets.

URL PDF HTML ☆

赞 0 踩 0

2511.16309 2026-05-19 cs.CV cs.LG 版本更新

Sparse Autoencoders are Topic Models

稀疏自编码器是主题模型

Leander Girrbach, Zeynep Akata

发表机构 * Technical University of Munich (TUM), Munich Center for Machine Learning (MCML), Helmholtz Munich（慕尼黑技术大学（TUM）、慕尼黑机器学习中心（MCML）、海德堡-慕尼黑研究所）

AI总结本文提出将稀疏自编码器（SAEs）视为主题模型的新视角，通过构建连续主题模型（CTM）来解释嵌入空间，并推导出SAE的目标作为最大后验估计器，从而揭示SAE特征是主题性组件而非可调节方向。

Comments ICML 2026

详情

AI中文摘要

稀疏自编码器（SAEs）被用于分析嵌入，但其作用和实用价值存在争议。我们提出了一种新的视角，通过展示它们可以自然地被理解为主题模型。我们受到潜在狄利克雷分配（LDA）的启发，提出了一种连续主题模型（CTM）用于嵌入空间，并在此模型下推导出SAE目标作为最大后验估计器。这种观点表明SAE特征是主题性组件而非可调节方向。为了验证我们的理论发现，我们引入了SAE-TM主题建模框架，该框架：（1）训练SAE以学习可重用的主题原子；（2）将它们解释为下游数据中的词分布；（3）将它们合并到任意数量的主题中而无需重新训练。SAE-TM在文本和图像数据集上比强大的基线产生更连贯的主题，同时保持多样性。最后，我们分析了图像数据集中的主题结构，并追踪了日本木版画中主题随时间的变化。我们的工作将SAEs定位为跨模态大规模主题分析的有效工具。代码可在https://github.com/ExplainableML/SAE-TM获取。

英文摘要

Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We propose a continuous topic model (CTM) inspired by Latent Dirichlet Allocation (LDA) for embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. To confirm our theoretical findings, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code is available at https://github.com/ExplainableML/SAE-TM .

URL PDF HTML ☆

赞 0 踩 0

2511.11934 2026-05-19 cs.LG cs.CV 版本更新

A Systematic Analysis of Out-of-Distribution Detection Under Representation and Training Paradigm Shifts

基于表示和训练范式转变的分布外检测系统分析

Claudio César Claros Olivares, Austin J. Brockmeier

发表机构 * Department of Electrical & Computer Engineering（电气与计算机工程系）； University of Delaware（德雷塞尔大学）

AI总结本文通过表示中心的视角系统评估了分布外检测的CSFs，分析了不同架构、训练范式和数据集的影响，并提出基于PCA的投影过滤方法和基于神经坍塌的预测方法来提升检测性能。

详情

AI中文摘要

我们通过表示中心的视角系统评估了分布外检测（OOD）的CSFs。我们的研究涵盖了CNN和ViT架构、多种训练范式、四个图像分类源数据集（CIFAR-10、CIFAR-100、SuperCIFAR-100和TinyImageNet），以及通过CLIP衍生的语义距离将OOD数据集分为近、中、远三个区域。为了比较这些设置下的CSFs，我们采用了一种多重比较受控的排名流程，该流程在无阈值排名指标（AURC和AUGRC）下识别出统计上不可区分的顶级聚类。主要经验发现是，竞争性检测器家族更依赖于学习的表示而不是单纯的分数设计。对于CNN和ViT，简单的概率分数在误分类检测中占主导地位。在CNN中，基于边界的分数在近OOD区域最强，而几何感知分数如NNGuide、fDBD和CTM在移位严重性增加时变得更具竞争力。在微调的ViT中，顶级聚类主要由重建和残差分数主导。为了解释这些排名变化，我们使用神经坍塌（NC）指标分析最后一层表示。得到的图景在不同架构中是一致的：原型和边界感知分数在表示更坍塌且与分类器权重更好对齐时更强，而弱坍塌区域则更青睐梯度和流形基于的分数。基于这些见解，我们提出两个贡献：一种基于PCA的投影过滤过程，可以提高检测器性能，以及一种利用训练分类器计算的NC测量来预测其竞争性的分布外检测器短名单的方法，而无需任何额外的分布外数据。

英文摘要

We present a systematic benchmark of out-of-distribution (OOD) detection CSFs through a representation-centric lens. Our study spans CNN and ViT backbones, multiple training paradigms, four image-classification source datasets (CIFAR-10, CIFAR-100, SuperCIFAR-100, and TinyImageNet), and OOD datasets grouped into near, mid, and far regimes using CLIP-derived semantic distances. To compare CSFs across these settings, we employ a multiple-comparison-controlled rank pipeline that identifies top cliques of statistically indistinguishable winners under threshold-free ranking metrics (AURC and AUGRC). The main empirical finding is that the competitive detector family depends more on the learned representation than on score design alone. For both CNNs and ViTs, simple probabilistic scores dominate misclassification detection. On CNNs, margin-based scores are strongest in near-OOD regimes, while geometry-aware scores such as NNGuide, fDBD, and CTM become more competitive as shift severity increases. On fine-tuned ViTs, the top cliques are led mainly by reconstruction- and residual-based scores. To interpret these ranking shifts, we analyze the last-layer representation using Neural Collapse (NC) metrics. The resulting picture is consistent across architectures: prototype- and boundary-aware scores become stronger when the representation is more collapsed and better aligned with classifier weights, whereas weaker-collapse regimes favor gradient- and manifold-based scores. Building on these insights, we propose two contributions: a simple PCA-based projection-filtering procedure that improves detector performance, and an approach that uses NC measurements computed from a trained classifier to predict its competitive out-of-distribution detector shortlist, without requiring any additional OOD data.

URL PDF HTML ☆

赞 0 踩 0

2511.08704 2026-05-19 cs.CV cs.LG 版本更新

Rethinking Generative Image Pretraining: How Far Are We From Scaling Up Next-Pixel Prediction?

重新思考生成图像预训练：我们离扩大下一步像素预测还有多远？

Xinchen Yan, Chen Liang, Lijun Yu, Adams Wei Yu, Yifeng Lu, Quoc V. Le

发表机构 * Google Deepmind（谷歌深Mind）

AI总结本文研究了自回归下一步像素预测的扩展特性，探讨了统一视觉模型中简单且端到端但尚未充分探索的框架。通过在32x32分辨率的图像上训练Transformer模型，评估了三个目标指标：下一步像素预测目标、ImageNet分类准确率和基于生成的完成度（通过Fr'echet距离测量）。研究发现，最优扩展策略高度依赖任务，且随着图像分辨率的增加，模型大小必须比数据量增长得更快。通过预测发现，计算能力是主要瓶颈，而非训练数据量。随着计算能力每年增长四到五倍，预计在五年内可实现像素级图像建模。

Comments Accepted by ICML2026

详情

AI中文摘要

本文研究了自回归下一步像素预测的扩展特性，一种简单、端到端但尚未充分探索的统一视觉模型框架。从32x32分辨率的图像开始，我们训练了一系列Transformer模型，使用IsoFlops配置在计算预算高达7e19 FLOPs的情况下进行训练，并评估了三个不同的目标指标：下一步像素预测目标、ImageNet分类准确率和基于生成的完成度（通过Fr'echet距离测量）。首先，最优扩展策略高度依赖于任务。在固定的32x32分辨率下，图像分类和图像生成的最优扩展特性不同，其中生成最优设置要求数据量增长是分类最优设置的三到五倍。其次，随着图像分辨率的增加，最优扩展策略表明模型大小必须比数据量增长得更快。令人惊讶的是，通过投影我们的发现，我们发现主要瓶颈是计算能力，而不是训练数据量。随着计算能力每年增长四到五倍，我们预测在五年内可以实现像素级图像建模。

英文摘要

This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation-based completion measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed resolution of 32x32 alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.

URL PDF HTML ☆

赞 0 踩 0

2510.26635 2026-05-19 eess.IV cs.CV 版本更新

SAMRI: Segment Any MRI

SAMRI：分割任何MRI

Zhao Wang, Wei Dai, Thuy Thanh Dao, Steffen Bollmann, Hongfu Sun, Craig Engstrom, Shekhar S. Chandra

发表机构 * School of Electrical Engineering and Computer Science, The University of Queensland, Australia（昆士兰大学电气工程与计算机科学学院，澳大利亚）

AI总结 SAMRI是针对MRI优化的Segment Anything Model，通过框和点提示实现更高效的全身体部MRI分割，特别是在小而临床重要的结构上。

详情

AI中文摘要

摘要：SAMRI是针对MRI优化的Segment Anything Model，实现了优越的全身体部MRI分割，特别是在小而临床重要的结构上，通过框和点提示实现快速标注。目的：现有SAM的适应版本将MRI视为通用模态，忽略了变量组织对比、强度不均匀和临床重要的小结构。我们提出了一种MRI专用的基础模型，具有强大的全身体部分割和零样本泛化能力，可直接用于任何MRI标注任务。方法：SAMRI仅微调SAM的掩码解码器（ViT-B/16），保持编码器冻结以保留预训练表示并消除冗余传递，从而减少训练时间94%，可训练参数96%，FLOPs约99%。训练使用了来自30个数据集的110万张2D切片-掩码对，涵盖47个目标、T1/T2/FLAIR/DWI对比度和全身体部解剖结构，使用焦点Dice损失和边界框（可选点）提示。按掩码面积分层（小：<0.5%；中：0.5-3.5%；大：>3.5%），并通过Wilcoxon符号秩检验评估显著性。结果：SAMRI在框+点提示下在47个目标上实现了平均DSC 0.87±0.11，优于MedSAM（0.74±0.24）17.6%（p < 0.05），对小结构（+42.4%）和中等结构（+26.9%）的提升最大。在六个零样本数据集上，SAMRI实现了平均DSC 0.85，优于基线。推理仅需约4.5 GB VRAM通过标准硬件上的交互界面。结论：在大规模MRI特定语料库上微调解码器，实现了优越的全身体部分割，具有强大的零样本泛化能力，特别是在小而临床重要的结构上。公开代码、预训练模型和交互界面使SAMRI可用于MRI分割研究和临床工作流程。

英文摘要

Summary: SAMRI is an MRI-specialized adaptation of the Segment Anything Model achieving superior whole-body MRI segmentation, particularly for small and clinically critical structures, through box and point prompts for rapid annotation. Purpose: Existing SAM adaptations treat MRI as a generic modality, overlooking variable tissue contrast, intensity inhomogeneity, and clinically important small structures. We propose an MRI-specialized foundation model with strong whole-body segmentation and zero-shot generalization for direct use on any MRI annotation task. Methods: SAMRI fine-tunes only the mask decoder of SAM (ViT-B/16), keeping encoders frozen to preserve pretrained representations and eliminate redundant passes-reducing training time by 94%, trainable parameters by 96%, and FLOPs by ~99% versus full-model retraining. Training used 1.1 million 2D slice-mask pairs from 30 datasets spanning 47 targets, T1/T2/FLAIR/DWI contrasts, and whole-body anatomy, with focal-Dice loss and bounding-box (with optional point) prompts. Sizes were stratified by mask area (small: <0.5%; medium: 0.5-3.5%; large: >3.5%), and significance assessed by the Wilcoxon signed-rank test. Results: SAMRI with box+point prompts achieved mean DSC 0.87 +/- 0.11 across 47 targets, outperforming MedSAM (0.74 +/- 0.24) by 17.6% (p < 0.05), with largest gains for small (+42.4%) and medium (+26.9%) structures. On six zero-shot datasets, SAMRI achieved mean DSC 0.85, outperforming baselines. Inference requires only ~4.5 GB VRAM through an interactive interface on standard hardware. Conclusion: Decoder-only fine-tuning on a large, MRI-specific corpus delivers superior whole-body segmentation with strong zero-shot generalization, particularly for small and clinically salient structures. Public code, pretrained models, and an interactive interface make SAMRI deployable for MRI segmentation research and clinical workflows.

URL PDF HTML ☆

赞 0 踩 0

2510.18822 2026-05-19 cs.CV 版本更新

SAM 2++: Tracking Anything at Any Granularity

SAM 2++: 任意粒度下的任意目标跟踪

Jiaming Zhang, Cheng Liang, Yichun Yang, Chenkai Zeng, Yutao Cui, Xinwen Zhang, Xin Zhou, Kai Ma, Gangshan Wu, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； Tencent（腾讯）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结本文提出SAM 2++框架，通过统一的提示编码、输出解码和记忆表示设计，实现了对不同粒度的目标状态（如掩码、框和点）的统一跟踪，同时引入Tracking-Any-Granularity数据集以提升统一跟踪模型的训练和评估效果。

Comments 14 pages

详情

AI中文摘要

由于不同任务中目标状态的粒度差异，现有跟踪器多针对单一任务进行设计，这种特异性限制了其泛化能力，无法有效利用多任务训练数据，导致模型设计和参数冗余。尽管最近的统一视觉模型在不同任务间共享部分架构，但通常保留任务特定的接口，并忽视不同粒度背后共同的跟踪原理，留下真正统一视频跟踪的空白。为统一视频跟踪任务，我们提出了SAM 2++，一个能够处理不同粒度目标状态的统一框架，包括掩码、框和点，通过集成设计的提示编码、输出解码和记忆表示。首先，为处理不同目标粒度，我们设计了任务特定的提示，将多样化的任务输入映射到通用的提示嵌入，同时引入统一解码器，以共同的输出形式生成任务结果，而无需重新设计整体流程。其次，为满足记忆匹配，跟踪的核心操作，我们引入了任务自适应的记忆机制，统一不同粒度的记忆同时保持其不同的状态语义，防止全参数共享导致粒度间的干扰。最后，我们引入Tracking-Any-Granularity，第一个大规模且多样化的视频跟踪数据集，具有丰富的三粒度注释。它通过定制的数据引擎，结合分阶段的手动标注和模型辅助完成，提供全面的资源用于训练、基准测试和分析统一跟踪模型。全面的实验表明，SAM 2++在不同粒度的多样化跟踪任务中设定了新的状态-of-the-art，建立了统一且稳健的跟踪框架。

英文摘要

Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task, which specificity limits their generalization, preventing them from effectively utilizing multi-task training data and leading to redundancy in both model design and parameters. Although recent unified vision models share partial architectures across tasks, they usually retain task-specific interfaces and overlook the common tracking principle behind different granularities, leaving a gap for truly unified video tracking. To unify video tracking tasks, we present SAM 2++, a unified framework that can handle target states at different granularities, including masks, boxes, and points, through an integrated design of prompt encoding, output decoding, and memory representation. First, to handle different target granularities, we design task-specific prompts that map diverse task inputs into general prompt embeddings, together with a Unified Decoder that produces task results in a common output form without redesigning the overall pipeline. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities while preserving their distinct state semantics, preventing full parameter sharing from causing interference across granularities. Finally, we introduce Tracking-Any-Granularity, the first large and diverse video tracking dataset with rich annotations at three granularities. It is constructed through a customized data engine with phased manual annotation and model-assisted completion, providing a comprehensive resource for training, benchmarking, and analyzing unified tracking models. Comprehensive experiments confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.

URL PDF HTML ☆

赞 0 踩 0

2510.17363 2026-05-19 cs.CV cs.LG cs.RO 版本更新

M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

M2H：基于高效窗口交叉任务注意力的多任务学习用于单目空间感知

U. V. B. L Udugama, George Vosselman, Francesco Nex

发表机构 * Department of Earth Observation Science（地球观测科学系）

AI总结本文提出M2H框架，通过高效的窗口交叉任务注意力模块，实现单目图像上的语义分割、深度估计、边缘检测和表面法线估计，同时在计算效率上优于现有方法。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures

详情

DOI: 10.1109/IROS60139.2025.11246974

AI中文摘要

在边缘设备上部署实时空间感知需要高效的多任务模型，这些模型能够在利用互补任务信息的同时最小化计算开销。本文介绍了Multi-Mono-Hydra（M2H），一种新的多任务学习框架，用于从单张单目图像中进行语义分割、深度、边缘和表面法线估计。与传统方法依赖独立单任务模型或共享编码器-解码器架构不同，M2H引入了基于窗口的跨任务注意力模块，实现了结构化的特征交换同时保留任务特定的细节，提高了任务间预测的一致性。M2H基于轻量级的ViT-based DINOv2主干网络，优化了实时部署，并作为支持动态环境中3D场景图构建的单目空间感知系统的基础。全面评估显示，M2H在NYUDv2上优于最先进的多任务模型，在Hypersim上超越了单任务深度和语义基线，在Cityscapes数据集上实现了更优的性能，同时在笔记本硬件上保持计算效率。除了基准测试外，M2H还在真实世界数据上得到了验证，证明了其在空间感知任务中的实用性。

英文摘要

Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

URL PDF HTML ☆

赞 0 踩 0

2509.25969 2026-05-19 cs.CV 版本更新

A Multi-purpose Tracking Framework for Salmon Welfare Monitoring in Challenging Environments

一种用于挑战性环境中鲑鱼福利监测的多用途跟踪框架

Espen Uri Høgstedt, Christian Schellewald, Annette Stahl, Rudolf Mester

发表机构 * Norwegian University of Science and Technology（挪威科学技术大学）； SINTEF Ocean（SINTEF海洋）

AI总结本文提出了一种多用途跟踪框架，用于在具有挑战性的环境中实现鲑鱼福利的自动化监测，通过使用姿态估计网络提取鲑鱼的边界框及其对应的身体部位信息，以解决水下鲑鱼场景中的特定挑战，并构建了两个新的数据集来评估鲑鱼跟踪的挑战。

Comments Accepted to the Joint Workshop on Marine Vision 2025 (CVAUI & AAMVEM), held in conjunction with ICCV 2025

详情

DOI: 10.1109/ICCVW69036.2025.00225

AI中文摘要

基于计算机视觉（CV）的连续、自动化和精确的鲑鱼福利监测是减少工业网箱养鱼中鲑鱼死亡率和改善鲑鱼福利的关键步骤。现有的CV方法用于确定福利指标主要集中在单一指标上，并依赖于其他应用领域的对象检测器和跟踪器来帮助其福利指标计算算法。这在实际应用中带来了高资源需求，因为每个指标必须单独计算。此外，这些方法在水下鲑鱼场景中容易受到物体遮挡、相似物体外观和相似物体运动等困难的影响。为了解决这些挑战，我们提出了一种灵活的跟踪框架，该框架使用姿态估计网络提取鲑鱼及其对应身体部位的边界框，并利用身体部位的信息，通过专门的模块，来解决水下鲑鱼场景中的特定挑战。随后，高细节的身体部位跟踪被用于计算福利指标。我们构建了两个新的数据集，评估两个鲑鱼跟踪挑战：拥挤场景中的鲑鱼ID转移和转弯期间的鲑鱼ID切换。我们的方法在两个鲑鱼跟踪挑战中均优于当前最先进的行人跟踪器BoostTrack。此外，我们创建了一个用于计算鲑鱼尾鳍拍打波长的数据集，证明了我们的身体部位跟踪方法适合基于尾鳍分析的自动化福利监测。数据集和代码可在https://github.com/espenbh/BoostCompTrack上获得。

英文摘要

Computer Vision (CV)-based continuous, automated and precise salmon welfare monitoring is a key step toward reduced salmon mortality and improved salmon welfare in industrial aquaculture net pens. Available CV methods for determining welfare indicators focus on single indicators and rely on object detectors and trackers from other application areas to aid their welfare indicator calculation algorithm. This comes with a high resource demand for real-world applications, since each indicator must be calculated separately. In addition, the methods are vulnerable to difficulties in underwater salmon scenes, such as object occlusion, similar object appearance, and similar object motion. To address these challenges, we propose a flexible tracking framework that uses a pose estimation network to extract bounding boxes around salmon and their corresponding body parts, and exploits information about the body parts, through specialized modules, to tackle challenges specific to underwater salmon scenes. Subsequently, the high-detail body part tracks are employed to calculate welfare indicators. We construct two novel datasets assessing two salmon tracking challenges: salmon ID transfers in crowded scenes and salmon ID switches during turning. Our method outperforms the current state-of-the-art pedestrian tracker, BoostTrack, for both salmon tracking challenges. Additionally, we create a dataset for calculating salmon tail beat wavelength, demonstrating that our body part tracking method is well-suited for automated welfare monitoring based on tail beat analysis. Datasets and code are available at https://github.com/espenbh/BoostCompTrack.

URL PDF HTML ☆

赞 0 踩 0

2509.19102 2026-05-19 cs.RO cs.AI cs.CV 版本更新

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

FUNCanon: 通过功能对象规范化学习姿态感知的动作原语以实现通用的机器人操作

Hongli Xu, Lei Zhang, Xiaoyue Hu, Boyang Zhong, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

发表机构 * TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg（汉堡大学信息学院TAMS（多模态系统技术））； Technical University of Munich（慕尼黑技术大学）； Agile Robots SE（敏捷机器人有限公司）

AI总结本文提出FUNCanon框架，通过功能对象规范化学习姿态感知的动作原语，以实现通用的机器人操作，该方法将长周期操作任务分解为由主体、动词和对象定义的动作片段，从而提升策略的可组合性和可重用性。

Comments project website: https://sites.google.com/view/funcanon, 11 pages

详情

AI中文摘要

通用机器人技能从端到端演示中通常会导致任务特定的策略，这些策略难以超越训练分布进行泛化。因此，我们引入FUNCanon框架，将长周期操作任务转换为一系列动作片段，每个片段由主体、动词和对象定义。这些片段将策略学习聚焦于动作本身，而不是孤立的任务，从而实现组合性和重用性。为了使策略具有姿态感知和类别通用性，我们对功能对象进行规范化，通过功能对齐和自动操作轨迹转移，利用大型视觉语言模型的 affordance 信息将对象映射到共享的功能框架中。一个以对象为中心和动作为中心的扩散策略FuncDiffuser在对齐的数据上进行训练，自然尊重对象的 affordances 和姿态，简化了学习并提高了泛化能力。在模拟和现实基准上的实验表明，该方法在类别层面实现了泛化，跨任务行为重用和鲁棒的sim2real部署，显示功能规范化为复杂操作领域可扩展模仿学习提供了强大的归纳偏置。演示细节和补充材料可在我们的项目网站上获得：https://sites.google.com/view/funcanon。

英文摘要

General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

URL PDF HTML ☆

赞 0 踩 0

2509.16391 2026-05-19 cs.LG cs.AI cs.CV 版本更新

CoUn: Empowering Machine Unlearning via Contrastive Learning

CoUn: 通过对比学习赋能机器无学习

Yasser H. Khalil, Mehdi Setayesh, Hongliang Li

发表机构 * Huawei Noah’s Ark Lab（华为诺亚实验室）

AI总结本文提出CoUn框架，通过对比学习和监督学习调整保留数据的表示，以提高机器无学习的有效性，实验表明其在多个数据集和模型架构上均优于现有方法。

详情

AI中文摘要

机器无学习（MU）旨在从已训练模型中移除特定'遗忘'数据的影响，同时保持对剩余'保留'数据的知识。现有的基于标签操纵或模型权重扰动的MU方法往往效果有限。为此，我们引入了CoUn，一种受观察启发的新MU框架：当模型仅使用保留数据重新训练时，它会根据保留数据的语义相似性对遗忘数据进行分类。CoUn通过对比学习（CL）和监督学习调整学习的数据表示，仅应用于保留数据。具体而言，CoUn（1）利用数据样本之间的语义相似性，通过CL间接调整遗忘表示，（2）通过监督学习保持保留表示在其各自聚类内。在各种数据集和模型架构上的广泛实验表明，CoUn在无学习有效性上 consistently 超过最先进的MU基线。此外，将我们的CL模块集成到现有基线中可以增强其无学习有效性。

英文摘要

Machine unlearning (MU) aims to remove the influence of specific "forget" data from a trained model while preserving its knowledge of the remaining "retain" data. Existing MU methods based on label manipulation or model weight perturbations often achieve limited unlearning effectiveness. To address this, we introduce CoUn, a novel MU framework inspired by the observation that a model retrained from scratch using only retain data classifies forget data based on their semantic similarity to the retain data. CoUn emulates this behavior by adjusting learned data representations through contrastive learning (CL) and supervised learning, applied exclusively to retain data. Specifically, CoUn (1) leverages semantic similarity between data samples to indirectly adjust forget representations using CL, and (2) maintains retain representations within their respective clusters through supervised learning. Extensive experiments across various datasets and model architectures show that CoUn consistently outperforms state-of-the-art MU baselines in unlearning effectiveness. Additionally, integrating our CL module into existing baselines empowers their unlearning effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2509.02351 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels

序数自适应校正：一种数据导向的带有噪声标签的序数图像分类方法

Alireza Sedighi Moghaddam, Mohammad Reza Mohammadi

发表机构 * School of Computer Engineering, Iran University of Science and Technology（伊朗科学技术大学计算机工程学院）

AI总结本文提出了一种数据导向的序数图像分类方法ORDAC，通过利用标签分布学习来建模序数标签的内在模糊性和不确定性，动态调整每个样本的标签分布均值和标准差，从而有效校正噪声标签并提高模型性能。

Comments 10 pages, 5 figures, 5 tables

详情

AI中文摘要

标记数据是训练计算机视觉任务中监督深度学习模型的基本组成部分。然而，尤其是在序数图像分类中，类边界往往具有模糊性，因此标注过程容易产生错误和噪声。此类标签噪声会显著降低机器学习模型的性能和可靠性。本文针对序数图像分类任务中检测和校正标签噪声的问题，提出了一种新的数据导向方法，称为ORDinal Adaptive Correction（ORDAC）。该方法利用标签分布学习（LDL）的能力来建模序数标签的内在模糊性和不确定性。在训练过程中，ORDAC动态调整每个样本的标签分布的均值和标准差。与其丢弃可能含有噪声的样本不同，该方法旨在校正这些样本并充分利用整个训练数据集。所提出方法在年龄估计（Adience）和疾病严重程度检测（糖尿病视网膜病变）基准数据集上，针对各种不对称高斯噪声场景进行了评估。结果表明，ORDAC及其扩展版本（ORDAC_C和ORDAC_R）在模型性能上取得了显著提升。例如，在Adience数据集上40%的噪声情况下，ORDAC_R将均方误差从0.86降低到0.62，并将召回指标从0.37提高到0.49。该方法还展示了其在原始数据集中固有噪声的校正效果。这项研究表明，使用标签分布进行自适应标签校正是增强在存在噪声数据时序数分类模型鲁棒性和准确性的一种有效策略。

英文摘要

Labeled data is a fundamental component in training supervised deep learning models for computer vision tasks. However, the labeling process, especially for ordinal image classification where class boundaries are often ambiguous, is prone to error and noise. Such label noise can significantly degrade the performance and reliability of machine learning models. This paper addresses the problem of detecting and correcting label noise in ordinal image classification tasks. To this end, a novel data-centric method called ORDinal Adaptive Correction (ORDAC) is proposed for adaptive correction of noisy labels. The proposed approach leverages the capabilities of Label Distribution Learning (LDL) to model the inherent ambiguity and uncertainty present in ordinal labels. During training, ORDAC dynamically adjusts the mean and standard deviation of the label distribution for each sample. Rather than discarding potentially noisy samples, this approach aims to correct them and make optimal use of the entire training dataset. The effectiveness of the proposed method is evaluated on benchmark datasets for age estimation (Adience) and disease severity detection (Diabetic Retinopathy) under various asymmetric Gaussian noise scenarios. Results show that ORDAC and its extended versions (ORDAC_C and ORDAC_R) lead to significant improvements in model performance. For instance, on the Adience dataset with 40% noise, ORDAC_R reduced the mean absolute error from 0.86 to 0.62 and increased the recall metric from 0.37 to 0.49. The method also demonstrated its effectiveness in correcting intrinsic noise present in the original datasets. This research indicates that adaptive label correction using label distributions is an effective strategy to enhance the robustness and accuracy of ordinal classification models in the presence of noisy data.

URL PDF HTML ☆

赞 0 踩 0

2508.13977 2026-05-19 cs.CV 版本更新

轻量级物理感知零样本超声平面波去噪

Hojat Asgariandehkordi, Mostafa Sharifzadeh, Morteza Rezanejad, Hassan Rivaz

发表机构 * Department of Electrical and Computer Engineering, Concordia University（电气与计算机工程系，康科迪亚大学）

AI总结本文提出了一种轻量级物理感知零样本去噪框架，用于低角度CPWC超声成像，无需外部训练数据集或干净参考图像，通过将可用的成射角分为两个不相交子集，分别重建具有不同角度依赖性伪影和噪声特征的复合图像，利用自监督残差学习框架训练轻量级卷积神经网络，从而在不需领域特定微调或配对数据集的情况下，适应不同解剖区域和采集设置。

详情

AI中文摘要

超声相干平面波成像（CPWC）通过结合多个定向传输的回声来增强图像对比度。尽管增加定向角度的数量通常能提高图像质量，但会显著降低帧率，并可能在快速移动目标中引入模糊伪影。此外，复合图像仍易受噪声影响，尤其是在使用有限数量传输获取时。在本工作中，我们提出了一种轻量级物理感知零样本去噪框架，用于低角度CPWC超声成像，以在不需外部训练数据集或干净参考图像的情况下提高图像质量。所提出的方法将可用的定向角度分为两个不相交子集，每个子集用于重建具有不同角度依赖性伪影和噪声特征的复合图像。这些重建的图像随后作为伪对，在自监督残差学习框架中用于训练一个轻量级卷积神经网络，直接在测试样本上进行训练。由于底层组织结构在子集之间保持一致，而非相干伪影随定向角度选择变化，所提出的物理感知配对策略使网络能够区分解剖信息与不一致的噪声和伪影。与监督方法不同，所提出的方法不需要领域特定的微调或配对数据集，使其能够适应不同的解剖区域和采集设置。此外，所提出的框架采用仅包含两个卷积层的高效架构，使训练快速且计算成本低廉。

英文摘要

Ultrasound Coherent Plane-Wave Compounding (CPWC) enhances image contrast by combining echoes from multiple steered transmissions. While increasing the number of steering angles generally improves image quality, it significantly reduces frame rate and may introduce blurring artifacts in fast-moving targets. In addition, compounded images remain susceptible to noise, particularly when acquired using a limited number of transmissions. In this work, we propose a lightweight physics-aware zero-shot denoising framework for low-angle CPWC ultrasound imaging that improves image quality without requiring external training datasets or clean reference images. The proposed approach partitions the available steering angles into two disjoint subsets, each used to reconstruct compounded images with different angle-dependent artifacts and noise characteristics. These reconstructed images are then used as pseudo-pairs within a self-supervised residual learning framework to train a lightweight convolutional neural network directly on the test sample. Because the underlying tissue structures remain consistent across the subsets while the incoherent artifacts vary with steering angle selection, the proposed physics-aware pairing strategy enables the network to distinguish anatomical information from inconsistent noise and artifacts. Unlike supervised approaches, the proposed method does not require domain-specific fine-tuning or paired datasets, making it adaptable across different anatomical regions and acquisition settings. Furthermore, the proposed framework employs an efficient architecture composed of only two convolutional layers, enabling fast and computationally inexpensive training.

URL PDF HTML ☆

赞 0 踩 0

2506.20522 2026-05-19 cs.CV 版本更新

AI-assisted radiographic analysis in detecting alveolar bone-loss severity and patterns

辅助人工智能的放射学分析用于检测牙槽骨丧失的严重程度和模式

Chathura Wimalasiri, Piumal Rathnayake, Shamod Wijerathne, Sumudu Rasnayaka, Dhanushka Leuke Bandara, Roshan Ragel, Vajira Thambawita, Isuru Nawinne

发表机构 * Faculty of Engineering, University of Peradeniya（工程学院，珀德尼亚大学）； Faculty of Dental Sciences, University of Peradeniya（牙科学院，珀德尼亚大学）； Simula Metropolitan Center for Digital Engineering（模拟 Metropolitan 数字工程中心）

AI总结本研究提出了一种新型的基于人工智能的深度学习框架，利用牙内窥镜根尖放射图像自动检测和量化牙槽骨丧失及其模式，通过结合YOLOv8进行牙齿检测和Keypoint R-CNN模型识别解剖标志物，实现了对牙槽骨丧失严重程度的精确计算，并通过几何分析确定水平与角状骨丧失模式，实验结果在1000张专家标注的放射图像上达到了高准确率。

Comments This manuscript is 17 pages with 5 tables and 12 figures. The manuscript is under review at Nature Scientific Reports

详情

DOI: 10.1038/s41598-026-38061-1

AI中文摘要

牙周炎是一种慢性炎症性疾病，导致牙槽骨丧失，显著影响口腔健康和生活质量。准确评估骨丧失的严重程度和模式对于诊断和治疗计划至关重要。在本研究中，我们提出了一种新型的基于人工智能的深度学习框架，利用牙内窥镜根尖放射图像自动检测和量化牙槽骨丧失及其模式。我们的方法结合YOLOv8进行牙齿检测，与Keypoint R-CNN模型识别解剖标志物，从而实现对骨丧失严重程度的精确计算。此外，YOLOv8x-seg模型用于分割骨水平和牙齿掩码，通过几何分析确定骨丧失模式（水平 vs. 角状）。在1000张大规模、专家标注的放射图像上进行评估，我们的方法在检测骨丧失严重程度（类内相关系数高达0.80）和骨丧失模式分类（准确率87%）方面取得了高准确率。这种自动化系统提供了一种快速、客观且可重复的牙周评估工具，减少了对主观手动评估的依赖。通过将人工智能整合到牙科放射学分析中，我们的框架有潜力提高牙周炎的早期诊断和个性化治疗计划，最终改善患者护理和临床结果。

英文摘要

Periodontitis, a chronic inflammatory disease causing alveolar bone loss, significantly affects oral health and quality of life. Accurate assessment of bone loss severity and pattern is critical for diagnosis and treatment planning. In this study, we propose a novel AI-based deep learning framework to automatically detect and quantify alveolar bone loss and its patterns using intraoral periapical (IOPA) radiographs. Our method combines YOLOv8 for tooth detection with Keypoint R-CNN models to identify anatomical landmarks, enabling precise calculation of bone loss severity. Additionally, YOLOv8x-seg models segment bone levels and tooth masks to determine bone loss patterns (horizontal vs. angular) via geometric analysis. Evaluated on a large, expertly annotated dataset of 1000 radiographs, our approach achieved high accuracy in detecting bone loss severity (intra-class correlation coefficient up to 0.80) and bone loss pattern classification (accuracy 87%). This automated system offers a rapid, objective, and reproducible tool for periodontal assessment, reducing reliance on subjective manual evaluation. By integrating AI into dental radiographic analysis, our framework has the potential to improve early diagnosis and personalized treatment planning for periodontitis, ultimately enhancing patient care and clinical outcomes.

URL PDF HTML ☆

赞 0 踩 0

2505.19155 2026-05-19 cs.CV cs.CL 版本更新

Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

稀疏到密集：一种无损加速视频理解的LLM免费午餐

Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang, Wei Gao, Qian Liu

发表机构 * Singapore Management University（新加坡国立管理学院）； Sea AI Lab（Sea AI实验室）； National University of Singapore（国立新加坡大学）

AI总结本文提出了一种名为Sparse-to-Dense（StD）的解码策略，通过结合稀疏top-K注意力和密集全注意力模块，实现视频大语言模型（Video-LLMs）的无损加速，从而在处理长视频序列时显著提高处理速度。

Comments Accepted by ACL 2025

详情

AI中文摘要

由于当前视频大语言模型（Video-LLMs）的自回归性质，输入序列长度的增长会导致推理延迟增加，这给处理通常非常长的视频序列带来了挑战。我们发现，在解码过程中，Video-LLMs中大多数标记的注意力分数趋于稀疏和集中，只有某些标记需要全面的全注意力。基于这一见解，我们引入了Sparse-to-Dense（StD），一种新颖的解码策略，集成了两个不同的模块：一个利用稀疏top-K注意力，另一个采用密集全注意力。这些模块协同工作，以在不损失的情况下加速Video-LLMs。快速（稀疏）模型推测解码多个标记，而缓慢（密集）模型并行验证它们。StD是一种无调优、即插即用的解决方案，可在视频处理中实现高达1.94倍的壁时加速。它在保持模型性能的同时，使从标准Video-LLM无缝过渡到稀疏Video-LLM变得可能，只需最小的代码修改。

英文摘要

Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

URL PDF HTML ☆

赞 0 踩 0

2505.07813 2026-05-19 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

DexWild：面向真实场景的机器人策略的灵巧交互

Tony Tao, Mohan Kumar Srirama, Jason Jingzhou Liu, Kenneth Shaw, Deepak Pathak

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出DexWild框架，通过结合人类和机器人示范数据，提升机器人在多样化环境中的泛化能力，实验表明其在未见环境中的成功率显著高于传统方法。

Comments In RSS 2025. Website at https://dexwild.github.io

详情

AI中文摘要

大规模、多样化的机器人数据集已成为使灵巧操作策略泛化到新环境的有希望途径，但获取此类数据集存在诸多挑战。虽然远程操作能提供高保真的数据集，但其高成本限制了可扩展性。相反，如果人们可以像在日常生活中一样使用自己的手来收集数据呢？在DexWild中，一个多样化的数据收集团队使用他们的手在多种环境和物体上收集数小时的交互数据。为了记录这些数据，我们创建了DexWild-System，一种低成本、移动且易于使用的设备。DexWild学习框架在人类和机器人示范数据上共同训练，相较于单独训练每个数据集，其性能得到提升。这种组合产生了能够泛化到新环境、任务和形态的稳健机器人策略，只需少量额外的机器人特定数据。实验结果表明，DexWild显著提高了性能，在未见环境中实现了68.5%的成功率，几乎是仅使用机器人数据训练的策略的四倍，并提供了5.8倍更好的跨形态泛化能力。视频结果、代码库和说明可在https://dexwild.github.io上找到。

英文摘要

Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io

URL PDF HTML ☆

赞 0 踩 0

2505.06907 2026-05-19 cs.AI cs.CV cs.NE 版本更新

A Survey on Foundation Models for Personalized Federated Intelligence

面向个性化联邦智能的基础模型综述

Yu Qiao, Huy Q. Le, Avi Deb Raha, Phuong-Nam Tran, Apurba Adhikary, Mengchun Zhang, Loc X. Nguyen, Eui-Nam Huh, Dusit Niyato, Choong Seon Hong

发表机构 * School of Computing, Kyung Hee University（韩国庆熙大学计算机学院）； Noakhali Science and Technology University（诺阿克利科学与技术大学）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）

AI总结本文综述了基础模型在个性化联邦智能中的应用，探讨了联邦学习与基础模型的结合，提出了一种新的个性化联邦智能范式，旨在为实现人工智能个性化提供基础支持。

Comments Accepted ACM Computing Survey

详情

AI中文摘要

大语言模型（如ChatGPT、Gemini和Grok）的兴起重塑了人工智能领域。作为基础模型（FMs）的典型实例，它们在生成类人内容方面表现出色，推动人工智能向通用人工智能（AGI）迈进。然而，它们的规模庞大、隐私敏感和计算需求高，给个性化定制带来了挑战。为此，我们提出了人工智能个性化（API）的愿景，专注于将FMs适应到个体用户，同时确保隐私。作为API的核心赋能者，我们提出个性化联邦智能（PFI），这是一种新的范式，不仅整合了联邦学习（FL）的隐私优势和FMs的泛化能力，还将个性化置于核心。为此，我们首先回顾了最近的FL和FMs进展，为PFI奠定基础。然后，我们探讨了PFI流水线的核心阶段：边缘的高效个性化、可信的适应和通过检索增强生成的自适应细化。最后，我们强调了实现PFI的未来方向。总体而言，本文的综述旨在为API的发展奠定基础，作为AGI的补充方向，PFI是关键的赋能范式。

英文摘要

The rise of large language models (LLMs), such as ChatGPT, Gemini, and Grok, has reshaped the AI landscape. As prominent instances of foundational models (FMs), they exhibit remarkable capabilities in generating human-like content, pushing the boundaries towards artificial general intelligence (AGI). However, their large-scale nature, privacy sensitivity, and substantial computational demands pose significant challenges for personalized customization for end users. To bridge this gap, we present the vision of artificial personalized intelligence (API), which focuses on adapting FMs to individual users while ensuring privacy. As a central enabler of API, we propose personalized federated intelligence (PFI), a new paradigm that not only integrates the privacy benefits of federated learning (FL) with the generalization capabilities of FMs but also places personalization at its core. To this end, we first survey recent advances in FL and FMs that lay the foundation for PFI. We then explore core stages of the PFI pipeline: efficient personalization at the edge, trustworthy adaptation, and adaptive refinement via retrieval-augmented generation. Finally, we highlight future directions for enabling PFI. Overall, this survey aims to lay a foundation for the development of API as a complementary direction to AGI, with PFI as a key enabling paradigm.

URL PDF HTML ☆

赞 0 踩 0

2501.13795 2026-05-19 cs.CV 版本更新

Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models

无需训练的零样本时序动作检测与视觉-语言模型

Chaolei Han, Hongsong Wang, Jidong Kuang, Lei Zhang, Jie Gui

发表机构 * Southeast University School of Cyber Science and Engineering（东南大学网络安全科学与工程学院）； Southeast University School of Computer Science and Engineering（东南大学计算机科学与工程学院）； Nanjing Normal University School of Electrical Engineering and Automation（南京师范大学电气工程与自动化学院）

AI总结本文提出一种无需训练的零样本时序动作检测方法FreeZAD，利用现有的视觉-语言模型直接对未标记视频中的未知活动进行分类和定位，无需额外微调或适应，并通过LogOIC和频率基于的动作校准以及测试时适应策略提升性能。

Journal ref IEEE Transactions on Multimedia, 2026

详情

AI中文摘要

现有的零样本时序动作检测（ZSTAD）方法主要采用全监督或无监督策略来识别未见活动。然而，这些基于训练的方法容易出现领域偏移且计算成本高，阻碍了其在现实场景中的应用。在本文中，不同于以往的工作，我们提出了一种无需训练的零样本时序动作检测（FreeZAD）方法，利用现有的视觉-语言（ViL）模型，直接对未修剪视频中的未知活动进行分类和定位，而无需任何额外的微调或适应。我们通过设计Logarithmic decay weighted Outer-Inner-Contrastive Score（LogOIC）和基于频率的动作校准，消除了显式时间建模和伪标签质量的依赖。此外，我们引入了使用原型中心采样（PCS）的测试时适应（TTA）策略来扩展FreeZAD，使ViL模型能够更有效地适应ZSTAD。在THUMOS14和ActivityNet-1.3数据集上的大量实验表明，我们的无需训练的方法在性能上优于最先进的无监督方法，且仅需1/13的运行时间。当配备TTA时，增强的方法进一步缩小了与全监督方法之间的差距。

英文摘要

Existing zero-shot temporal action detection (ZSTAD) methods predominantly use fully supervised or unsupervised strategies to recognize unseen activities. However, these training-based methods are prone to domain shifts and require high computational costs, which hinder their practical applicability in real-world scenarios. In this paper, unlike previous works, we propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method, leveraging existing vision-language (ViL) models to directly classify and localize unseen activities within untrimmed videos without any additional fine-tuning or adaptation. We mitigate the need for explicit temporal modeling and reliance on pseudo-label quality by designing the LOGarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration. Furthermore, we introduce a test-time adaptation (TTA) strategy using Prototype-Centric Sampling (PCS) to expand FreeZAD, enabling ViL models to adapt more effectively for ZSTAD. Extensive experiments on the THUMOS14 and ActivityNet-1.3 datasets demonstrate that our training-free method outperforms state-of-the-art unsupervised methods while requiring only 1/13 of the runtime. When equipped with TTA, the enhanced method further narrows the gap with fully supervised methods.

URL PDF HTML ☆

赞 0 踩 0

2412.18158 2026-05-19 cs.CV eess.IV 版本更新

Semantics Disentanglement and Composition for Universal Image Coding with Efficiently LLM Reasoning and Generative Diffusion

语义解耦与组合用于具有高效LLM推理和生成扩散的通用图像编码

Jinming Liu, Yuntao Wei, Junyan Lin, Shengyang Zhao, Heming Sun, Zhibo Chen, Wenjun Zeng, Xin Jin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative（宁波空间智能与数字衍生关键实验室）； Ningbo Institute of Digital Twin（宁波数字孪生研究院）； Eastern Institute of Technology（东部技术研究院）； University of Science and Technology of China（中国科学技术大学）； Yokohama National University（Yokohama国立大学）

AI总结本文提出UniCodec，一种基于语义解耦和组合生成的通用图像编码框架，通过高效LLM推理和生成扩散模型实现人类和机器需求的统一压缩，无需重新训练。

详情

AI中文摘要

已学习的图像压缩方法在性能上表现出色，但通常高度专门化于人类感知或特定机器视觉任务。这种专门化限制了其通用性和重新训练成本。为此，我们引入UniCodec，一种基于编码器的语义解耦和解码器的组合生成的通用编码器。该框架旨在同时满足人类和机器需求，消除任务特定重新训练的需要。在编码器中，UniCodec利用由大型语言模型（LLM）预先生成的任务特定标签代码本。对于任何给定任务，接地模型使用相应的代码本进行任务感知的解耦，压缩最相关的图像区域。这种机制不仅节省了大量位数，而且是系统快速零重新训练适应的关键：切换到新任务只需选择新代码本。解码器则进行组合生成：它将紧凑的解耦组件与生成扩散模型的强大先验结合，从而重建高质量、完整的图像，优化以满足人类感知的丰富细节和机器视觉任务的精确特征。广泛的实验表明，UniCodec在性能上始终优于现有方法，有效弥合了以人类为中心和以机器为中心压缩之间的差距。

英文摘要

Learned image compression methods have shown impressive performance but are often highly specialized for either human perception or specific machine vision tasks. This specialization limits their versatility and requires costly retraining for new applications. To address this, we introduce UniCodec, a universal codec built on a novel paradigm of semantic disentanglement at the encoder and compositional generation at the decoder. This framework is designed to simultaneously serve both human and machine needs, eliminating the need for task-specific retraining. At the encoder, UniCodec leverages pre-generated, task-specific label codebooks created by a Large Language Model (LLM). For any given task, a grounding model uses the corresponding codebook to perform task-aware disentanglement, compressing only the most relevant image regions. This mechanism not only saves significant bits but is also the key to our system's rapid, zero-retraining adaptation: switching to a new task is as simple as selecting a new codebook. The decoder then performs compositional generation: it combines the compact, disentangled components with powerful priors from a generative diffusion model. This process reconstructs a high-quality, complete image optimized with rich detail for human perception and precise features for machine vision tasks. Extensive experiments demonstrate that UniCodec consistently outperforms existing methods, effectively bridging the gap between human-centric and machine-centric compression.

URL PDF HTML ☆

赞 0 踩 0

2409.15980 2026-05-19 cs.CV cs.AI 版本更新

Leveraging Unsupervised Learning for Cost-Effective Visual Anomaly Detection

利用无监督学习实现高效视觉异常检测

Yunbo Long, Zhengyang Ling, Sam Brook, Duncan McFarlane, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge（剑桥大学工程系）

AI总结本研究提出一种低成本视觉异常检测系统，通过预训练模型和低成本硬件，利用少量数据实现高准确率的异常检测，适用于中小型企业。

详情

AI中文摘要

传统的基于机器学习的视觉检测系统需要大量数据收集和重复模型训练来提高准确性。这些系统通常需要昂贵的相机、计算设备和显著的机器学习专业知识，这对中小型企业构成重大负担。本研究探索利用预训练模型和低成本硬件的无监督学习方法，开发一种高效的视觉异常检测系统。该系统利用Anomalib的无监督学习模型，并通过openVINO部署在经济型Raspberry Pi硬件上。结果表明，该系统仅用10张正常产品图像即可在Raspberry Pi上完成异常检测的训练和推理，耗时仅90秒，达到F1宏评分超过0.95的性能。尽管系统对环境变化如光照、产品摆放或背景略有敏感，但其仍为中小型企业提供了一种快速且经济的工厂自动化检测方法。代码可在https://github.com/Yunbo-max/Cost-Effective-Visual-Anomaly-Detection-using-Unsupervised-Learning获取。

英文摘要

Traditional machine learning-based visual inspection systems require extensive data collection and repetitive model training to improve accuracy. These systems typically require expensive camera, computing equipment and significant machine learning expertise, which can substantially burden small and medium-sized enterprises. This study explores leveraging unsupervised learning methods with pre-trained models and low-cost hardware to create a cost-effective visual anomaly detection system. The research aims to develop a low-cost visual anomaly detection solution that uses minimal data for model training while maintaining generalizability and scalability. The system utilises unsupervised learning models from Anomalib and is deployed on affordable Raspberry Pi hardware through openVINO. The results show that this cost-effective system can complete anomaly defection training and inference on a Raspberry Pi in just 90 seconds using only 10 normal product images, achieving an F1 macro score exceeding 0.95. While the system is slightly sensitive to environmental changes like lighting, product positioning, or background, it remains a swift and economical method for factory automation inspection for small and medium-sized manufacturers. The code is available at https://github.com/Yunbo-max/Cost-Effective-Visual-Anomaly-Detection-using-Unsupervised-Learning.

URL PDF HTML ☆

赞 0 踩 0

2409.12190 2026-05-19 cs.RO cs.CV 版本更新

Bundle Adjustment in the Eager Mode

急切模式下的捆绑调整

Zitong Zhan, Huan Xu, Zihang Fang, Xinpeng Wei, Yaoyu Hu, Chen Wang

发表机构 * Spatial AI & Robotics (SAIR) Lab, University at Buffalo（空间人工智能与机器人实验室，布法罗大学）； Georgia Institute of Technology（佐治亚理工学院）； Purdue University（普渡大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出了一种与PyTorch无缝集成的高效急切模式捆绑调整库，通过稀疏感知的自动微分设计和GPU加速的稀疏运算，提升了在机器人应用中捆绑调整的运行效率和性能。

详情

AI中文摘要

捆绑调整（BA）是各种机器人应用中的关键技术，例如同步定位与建图（SLAM）、增强现实（AR）和摄影测量学。BA通过优化诸如相机姿态和3D地标等参数，使它们与观测结果对齐。随着深度学习在感知系统中的重要性日益增加，将BA与深度学习框架整合已成为提高可靠性和性能的迫切需求。然而，广泛使用的基于C++的BA库，如GTSAM、g²o和Ceres Solver，缺乏与现代深度学习库如PyTorch的原生整合。这种限制影响了它们的灵活性、调试简便性和整体实现效率。为了解决这一差距，我们引入了一种与PyTorch无缝集成的高效急切模式BA库。我们的方法包括稀疏感知的自动微分设计和针对二次优化设计的GPU加速稀疏运算。我们的GPU急切模式BA在所有基准测试中均实现了显著的运行时间效率，与GTSAM、g²o和Ceres相比，平均加速分别为18.5×、22×和23×。

英文摘要

Bundle adjustment (BA) is a critical technique in various robotic applications such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C++-based BA libraries, such as GTSAM, g$^2$o, and Ceres Solver, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA library seamlessly integrated with PyTorch with high efficiency. Our approach includes a sparsity-aware auto-differentiation design and GPU-accelerated sparse operations designed for 2nd-order optimization. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5$\times$, 22$\times$, and 23$\times$ across all benchmarks compared to GTSAM, g$^2$o, and Ceres, respectively.

URL PDF HTML ☆

赞 0 踩 0

2308.06197 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features

利用基本特征的深度知识蒸馏进行复杂面部表情识别

Angus Maiden, Bahareh Nakisa

发表机构 * School of Information Technology, Deakin University（德克萨斯大学信息学院）

AI总结本文提出了一种基于持续学习的方法，通过知识蒸馏和新颖的预测排序记忆重放，实现了复杂面部表情识别的最新状态，能够在少量样本下准确识别新复合表情类别。

Comments 13 pages, 9 figures, 6 tables, 3 algorithms. Code available at https://github.com/AngusMaiden/complex-FER

详情

DOI: 10.1109/DICTA68720.2025.11302420

AI中文摘要

复杂情绪识别是一种认知任务，迄今为止尚未达到与其他处于或高于人类认知水平的任务相同的优秀性能。通过面部表情识别情绪尤其困难，因为人类面部表达的情绪复杂性。为了使机器在复杂面部表情识别方面达到人类的水平，可能需要实时综合知识和理解新概念，就像人类所做的那样。人类能够仅通过少量示例学习新概念，通过从记忆中蒸馏重要信息。受人类认知和学习的启发，我们提出了一种新的持续学习方法，用于复杂面部表情识别，通过在基本表情类别上构建和保留知识，能够使用少量训练样本准确识别新的复合表情类别。在本工作中，我们还使用GradCAM可视化来展示基本和复合面部表情之间的关系。我们的方法通过知识蒸馏和一种新颖的预测排序记忆重放来利用这种关系，实现了复杂面部表情识别持续学习的最新状态，新类别的总体准确率为74.28%。我们还证明了使用持续学习进行复杂面部表情识别的性能远优于非持续学习方法，比最先进的非持续学习方法提高了13.95%。我们的工作也是首次将少样本学习应用于复杂面部表情识别，仅使用每个类别一个训练样本，就实现了100%的准确率，达到了最先进的水平。

英文摘要

Complex emotion recognition is a cognitive task that has so far eluded the same excellent performance of other tasks that are at or above the level of human cognition. Emotion recognition through facial expressions is particularly difficult due to the complexity of emotions expressed by the human face. For a machine to approach the same level of performance in complex facial expression recognition as a human, it may need to synthesise knowledge and understand new concepts in real-time, as humans do. Humans are able to learn new concepts using only few examples by distilling important information from memories. Inspired by human cognition and learning, we propose a novel continual learning method for complex facial expression recognition that can accurately recognise new compound expression classes using few training samples, by building on and retaining its knowledge of basic expression classes. In this work, we also use GradCAM visualisations to demonstrate the relationship between basic and compound facial expressions. Our method leverages this relationship through knowledge distillation and a novel Predictive Sorting Memory Replay, to achieve the current state-of-the-art in continual learning for complex facial expression recognition, with 74.28% Overall Accuracy on new classes. We also demonstrate that using continual learning for complex facial expression recognition achieves far better performance than non-continual learning methods, improving on state-of-the-art non-continual learning methods by 13.95%. Our work is also the first to apply few-shot learning to complex facial expression recognition, achieving the state-of-the-art with 100% accuracy using only a single training sample per class.

URL PDF HTML ☆

赞 0 踩 0

2303.11675 2026-05-19 cs.CV 版本更新

ReBaR: Reference-Based Reasoning for Robust Pose Estimation from Monocular Images

ReBaR：基于参考的鲁棒单目图像姿态估计

Yongkang Cheng, Mingjiang Liang, Jifeng Ning, Gaoge Han, Wei Liu, Shaoli Huang

发表机构 * College of Information Engineering Northwest A\&F University（西北农林科技大学信息工程学院）； University of Technology Sydney（悉尼技术大学）； Tencent AI Lab（腾讯人工智能实验室）； City University of Hong Kong（香港城市大学）

AI总结本文提出ReBaR方法，通过学习参考特征来解决遮挡和深度模糊问题，实现从单目图像中鲁棒的人体姿态和形状估计。

Comments Accepted by Pattern Recognition

详情

DOI: 10.1016/j.patcog.2025.112096

AI中文摘要

ReBaR（Reference-Based Reasoning for Robust Human Pose and Shape Estimation），旨在从单视图像中估计人体形状和姿态。ReBaR通过学习部分回归推理的参考特征，有效解决了遮挡和深度模糊的挑战。我们的方法首先通过注意力引导机制提取身体和部分区域的特征。随后，这些特征用于编码额外的部分-身体依赖关系，以实现个体部分的回归，其中部分特征作为查询，身体特征作为参考。这种基于参考的推理使网络能够利用可见部分和身体参考信息推断被遮挡部分与身体的空间关系。ReBaR在三个基准数据集上优于现有方法，并在最近的新方法中仍保持竞争力。结果表明在处理深度模糊和遮挡方面有显著改进。这些结果强烈支持了我们基于参考的框架在从单目图像中估计人体形状和姿态的有效性。

英文摘要

R}easoning for Robust Human Pose and Shape Estimation), designed to estimate human body shape and pose from single-view images. ReBaR effectively addresses the challenges of occlusions and depth ambiguity by learning reference features for part regression reasoning. Our approach starts by extracting features from both body and part regions using an attention-guided mechanism. Subsequently, these features are used to encode additional part-body dependencies for individual part regression, with part features serving as queries and the body feature as a reference. This reference-based reasoning allows our network to infer the spatial relationships of occluded parts with the body, utilizing visible parts and body reference information. ReBaR outperforms contemporary methods on three benchmark datasets and still maintains competitive advantages among recent new approaches. Demonstrating significant improvement in handling depth ambiguity and occlusion. These results strongly support the effectiveness of our reference-based framework for estimating human body shape and pose from single-view images.

URL PDF HTML ☆

赞 0 踩 0

2605.17368 2026-05-19 cs.CV 版本更新

RadGenome-Anatomy: A Large-Scale Anatomy-Labeled Chest Radiograph Dataset via Physically Grounded Volumetric Projection

RadGenome-Anatomy: 通过物理基础的体积分量生成大规模解剖标注胸部X光图像数据集

Shuchang Ye, Mingyuan Meng, Hao Wang, Usman Naseem, Jinman Kim

发表机构 * The University of Sydney（悉尼大学）； Zhongguancun Academy（中关村学院）； Shanghai Jiao Tong University（上海交通大学）； Macquarie University（麦考瑞大学）

AI总结本文提出RadGenome-Anatomy数据集，通过物理基础的体积分量生成技术，生成包含超过1000万段分割掩码的大型解剖标注胸部X光图像数据集，用于改进医学图像分割和诊断任务。

详情

AI中文摘要

胸部X光图像的解剖结构标注对于医学图像分割和广泛的下游诊断任务至关重要。然而，直接在2D胸部X光图像上标注解剖结构是劳动密集型且本质上模糊的，因为3D解剖结构被投影到一个单一的2D平面上，其中边界可能会重叠、被遮挡或只部分可见。因此，现有的解剖标注胸部X光图像数据集在规模、解剖覆盖和标签可靠性方面仍然有限。为了解决这些限制，我们引入了RadGenome-Anatomy，这是最大的解剖标注胸部X光图像数据集，包含超过1000万段分割掩码，涵盖210种解剖结构，共计25,692例研究。它通过将大规模3D解剖掩码从CT体积投影到2D放射学空间中，通过标准放射学几何构造而成。这将标注从直接追踪不确定的2D边界转移到定义体积空间中的解剖结构，其中在X光中重叠或部分不可见的结构仍能保持空间分离。因此，每个2D掩码代表了在体积空间中定义的结构的物理基础投影足迹。RadGenome-Anatomy的规模和广泛的解剖覆盖，包括重叠、部分可见或难以直接勾勒的结构，使研究几何测量作为胸部X光解释的明确证据成为可能。我们通过训练XAnatomy来预测结构特定的掩码并推导临床相关测量，实现了对心脏扩大、脊柱侧弯和脊柱后凸的诊断准确率分别为96.4%、95.6%和89.2%。

英文摘要

Anatomical structure labels for chest radiographs are essential for medical image segmentation and a broad range of downstream diagnostic tasks. However, annotating anatomy directly on 2D chest radiographs is labor-intensive and intrinsically ambiguous, as 3D anatomical structures are projected onto a single 2D plane where boundaries may overlap, be occluded, or appear only partially visible. Consequently, existing anatomy-labeled chest radiograph datasets remain limited in scale, anatomy coverage, and label reliability. To address these limitations, we introduce RadGenome-Anatomy, the largest anatomy-labeled chest radiograph dataset, containing over 10 million segmentation masks across 210 anatomical structures in 25,692 studies. It is constructed by projecting large-scale 3D anatomical masks from CT volumes into 2D radiographic space through canonical radiographic geometry. This shifts annotation from directly tracing uncertain 2D boundaries to defining anatomy in volumetric space, where structures that overlap or become partially invisible in radiographs remain spatially separable. As a result, each 2D mask represents the physically grounded projected footprint of a volumetrically defined structure. The scale and broad anatomical coverage of RadGenome-Anatomy, including structures that are overlapping, partially visible, or difficult to delineate directly, enable research on geometric measurements as explicit evidence for chest radiograph interpretation. We demonstrate this by training XAnatomy to predict structure-specific masks and derive clinically relevant measurements, achieving diagnostic accuracies of 96.4%, 95.6%, and 89.2% for cardiomegaly, kyphosis, and scoliosis, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.17367 2026-05-19 cs.CV 版本更新

Bridging Data Trials and Task Barriers: A Unified Framework for Sketch Biometric Identification

弥合数据试验与任务障碍：面向草图生物识别的统一框架

Decheng Liu, Bin Hu, Xinbo Gao, Dawei Zhou, Chunlei Peng, Nannan Wang, Ruimin Hu

发表机构 * IEEE

AI总结本文提出了一种统一框架，用于解决草图生物识别中的跨模态和跨任务挑战，通过高效的合成草图生成和任务序列持续学习，提升模型的鲁棒性和泛化能力。

Comments The source code and models are publicly available at https://github.com/sHanbIgsUn/UFSB

详情

AI中文摘要

与现有的跨模态识别任务（例如异构人脸识别、草图重识别等）不同，我们引入了一种新的且实用的设置，称为草图生物识别，旨在在不同数据领域间持续训练一个统一的模型，即使涉及多样化的识别任务。草图生物识别面临挑战，包括真实的草图数据稀缺、高标注成本、隐私风险以及跨任务模型的泛化能力不足。现有方法通常依赖于有限的真实数据或单任务优化，难以有效解决跨模态和跨任务的联合挑战。本文提出了一种统一框架，整合了高效的合成草图生成和任务序列持续学习。首先，我们设计了一个高效的流程来生成大规模的高质量合成人物和人脸草图数据，这显著降低了成本并避免了隐私风险。同时，我们通过融合真实数据增强了模型的鲁棒性。其次，我们构建了一个通用的统一框架用于草图生物识别，该框架采用任务序列训练策略：模型首先在人物数据集上完成草图人物重识别学习；随后，通过可信样本重放技术保持获得的人物识别能力，并无缝地在人脸数据集上进行增量训练。这使一个模型能够同时处理多个草图生物识别任务的跨任务能力。为了支持上述草图生物识别的研究，我们构建了一个新的大规模基准，SketchUnified-BioID，并配备了几种实用的评估协议。

英文摘要

Different from existing cross-modality identification tasks (e.g., heterogeneous face recognition, sketch re-identification, etc.), we introduce a novel yet practical setting for these related identification tasks, named \textbf{sketch biometric identification}, which aims to continually train a unified model across different data domains, even diverse identification tasks. Sketch biometric identification faces challenges, including scarce real sketch data, high annotation costs, privacy risks, and insufficient generalization ability of cross-task models. Existing methods usually rely on limited real data or single-task optimization, making it difficult to effectively address the joint challenges of cross-modality and cross-task. This paper proposes a unified framework that integrates efficient synthetic sketch generation and task-sequential continual learning. First, we design an efficient pipeline to generate a large-scale and high-quality synthetic person and face sketch data, which significantly reduces costs and avoids privacy risks. Meanwhile, we enhance the model's robustness by fusing real data. Second, we construct a universal unified framework for sketch biometric identification, which adopts a task-sequential training strategy: the model first completes sketch person re-identification learning on the person dataset; subsequently, it maintains the acquired person recognition capability through a trusted sample replay technique and seamlessly performs incremental training on the face dataset. This enables a single model to simultaneously handle the cross-task capabilities of multiple sketch biometric identification tasks. To support the study of the mentioned sketch biometric identification, we built a new large-scale benchmark, SketchUnified-BioID, with several practical evaluation protocols.

URL PDF HTML ☆

赞 0 踩 0

2605.17365 2026-05-19 cs.CV 版本更新

Memory-Augmented Query Intent Understanding for Efficient Chat-based Image Retrieval

基于记忆的查询意图理解用于高效的基于聊天的图像检索

Xianke Chen, Daizong Liu, Yushuo Lou, Xin Tan, Xun Yang, Shuhui Wang, Xun Wang, Jianfeng Dong

发表机构 * School of Computer Science and Technology, and the School of Statistics and Mathematics, Zhejiang Gongshang University（计算机科学与技术学院，和统计与数学学院，浙江工商大学）； School of Computer Science and Technology, Zhejiang Gongshang University, and the Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology（计算机科学与技术学院，浙江工商大学，和大数据与未来电子商务技术浙江省重点实验室）； Wangxuan Institute of Computer Technology, Peking University（王璇计算机技术研究所，北京大学）； School of Information and Electronic Engineering, Zhejiang Gongshang University（信息与电子工程学院，浙江工商大学）； School of Computer Science and Technology, East China Normal University（计算机科学与技术学院，华东师范大学）； Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences , Institute of Computing Technology, CAS（中国科学院智能信息处理重点实验室，计算技术研究所，中国科学院）； School of Information Science and Technology, University of Science and Technology of China（信息科学与技术学院，中国科学技术大学）

AI总结本文提出了一种高效的基于聊天的图像检索任务中的记忆增强查询意图理解框架MAQIU，通过动态聚合和演化查询意图的语义表示，防止意图遗忘并增强长期语义完整性，从而在保持高计算效率的同时实现显著的性能提升。

详情

AI中文摘要

与传统的文本到图像检索任务不同，基于聊天的图像检索允许人机交互系统通过多轮对话逐步澄清和细化用户意图，从而实现更精细的检索结果。该任务的关键挑战在于在对话轮次中动态理解和更新用户的查询意图。尽管现有工作在这一新任务上取得了显著性能，但它们要么通过直接拼接所有先前查询到一个长文本序列，要么依赖大语言模型来从历史中重建当前查询，这些策略计算冗余且容易导致意图表示不一致。为了解决这些问题，本文提出了一种新的、高效的基于记忆的用户意图更新框架，称为记忆增强查询意图理解（MAQIU）。它引入了一个轻量级的记忆模块，动态聚合和演化查询意图的语义表示，同时进一步采用记忆回查机制以防止意图遗忘并增强长期语义完整性。此外，MAQIU还整合了历史图像检索结果作为视觉指导，使模型能够加强跨轮次的相关性并细化当前视觉理解。广泛的实验表明，MAQIU在保持高计算效率的同时实现了显著的性能提升，与先前基线ChatIR相比，将对话编码FLOPs减少了86.4%。源代码可在https://github.com/HuiGuanLab/MAQIU上获得。

英文摘要

Different from traditional text-to-image retrieval tasks, chat-based image retrieval allows the human-interactive system to iteratively clarify and refine user intent through multi-round dialogue, thereby achieving more fine-grained retrieval results. The key challenge in this task lies in dynamically understanding and updating the user's query intent across dialogue rounds. Although existing works have achieved great performance on this new task, they simply handle history query information either by directly concatenating all previous queries into a long textual sequence or by relying on large language models to reconstruct the current query from history. Such strategies are computationally redundant and easily lead to inconsistent intent representations as the dialogue progresses. To alleviate these issues, this paper proposes a novel and efficient memory-based user intent updating framework for the chat-based image retrieval task, called Memory-Augmented Query Intent Understanding (MAQIU). It introduces a lightweight memorization module that dynamically aggregates and evolves the semantic representation of query intent across dialogues, while a memory recall mechanism is further employed to prevent intent forgetting and enhance long-term semantic integrity. In addition, MAQIU also integrates historical image retrieval results as visual guidance, allowing the model to strengthen cross-round correlations and refine current visual understanding. Extensive experiments demonstrate that MAQIU achieves substantial performance gains while maintaining high computational efficiency, reducing dialogue encoding FLOPs by 86.4\% compared with the prior baseline ChatIR. Source code is available at https://github.com/HuiGuanLab/MAQIU.

URL PDF HTML ☆

赞 0 踩 0

2605.17360 2026-05-19 cs.CV 版本更新

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Omni-DuplexEval: 评估实时双工全模交互

Chaoqun He, Mingyang Xiang, Yingjing Xu, Bokai Xu, Junbo Cui, Jie Zhou, Yuan Yao, Lijie Wen

发表机构 * Tsinghua University（清华大学）； Tongji University（同济大学）； ModelBest Inc.（ModelBest公司）

AI总结本文提出Omni-DuplexEval基准，用于系统评估实时双工交互能力，通过两个互补场景评估模型生成连续响应和主动提醒的能力，并揭示现有模型在平衡响应及时性和内容连贯性方面的局限性。

Comments 22 pages, 6 figures

详情

AI中文摘要

实时双工交互对于在真实世界场景中运行的多模态AI系统至关重要，其中模型必须持续处理流式输入并适时响应。然而，大多数现有的多模态大语言模型（MLLMs）是在离线设置中评估的，其中整个视频输入在生成任何响应之前都被处理。尽管最近的工作开始探索实时双工MLLMs，但仍然没有全面的基准或自动评估方法用于这种设置。为了解决这一差距，我们提出了Omni-DuplexEval，一个用于系统评估实时双工交互的基准。该基准包含两个互补场景：（1）实时描述，评估生成连续、时间对齐的响应以跟踪演化的多模态输入的能力，以及（2）主动提醒，评估识别显著事件并适时响应的能力。Omni-DuplexEval包含660个视频，具有细粒度、人工标注的标签和精确的时间元数据，涵盖9个基于真实世界场景的任务，其中所有问题均以开放性查询形式提出。我们进一步引入了一个基于LLM-as-a-Judge的自动评估框架，通过时间戳感知和顺序推理联合评估响应内容对齐和响应时间，实现了与人类判断的高度一致。在最先进的双工MLLMs上的实验揭示了显著的局限性。表现最好的模型整体得分仅为39.6%，在主动提醒任务上仅得20.0%。我们的分析识别出两个关键挑战：模型在平衡及时响应与连贯、整体内容生成方面存在困难，且它们往往无法确定何时响应和生成什么内容。我们希望我们的工作能促进MLLMs的进一步发展。

英文摘要

Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.17356 2026-05-19 cs.CV 版本更新

UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings

UniPPTBench: 一种统一的演示生成基准，适用于多样化的输入设置

Bo Zhao, Maosheng Pang, Chen Zhang, Huan Yang, Yixin Cao, Wei Ji

AI总结本文提出UniPPTBench，一个统一的演示生成基准，针对四种代表性的输入设置：模糊提示、长文档、多模态文档和多源生成，同时引入UniPPTEval评估协议，结合跨设置比较的共享指标和针对每个设置核心需求的定制指标，以提供更准确的评估框架。

详情

AI中文摘要

现有工作通常专注于孤立的输入设置下的演示生成，而现实中的使用案例涵盖了多样化的场景，包括模糊的用户提示、长文档、多模态材料和多个异质来源。此外，当前的评估往往不够场景特定。它们主要依赖于通用的演示质量标准，如视觉吸引力、布局质量以及整体连贯性，但未能评估不同输入设置所需的核心能力，包括基于事实的压缩、视觉-文本对齐以及跨来源合成。因此，该领域缺乏一个统一的基准和一个场景感知的评估框架，以准确诊断不同现实场景下的演示生成系统。我们提出了UniPPTBench，一个适用于四种代表性输入设置的统一基准：模糊提示、长文档、多模态文档和多源生成。我们进一步引入UniPPTEval，一种场景感知的评估协议，结合用于跨设置比较的共享指标和针对每个设置核心需求定制的场景特定指标。我们还提供了透明的参考基线以支持可重复的比较。在UniPPTBench上的实验揭示了不同设置之间的显著性能差异以及内容基础、多模态整合和跨来源合成中的反复失败模式。特别是，通用演示质量指标上的强大表现并不一定意味着在基于事实的场景中任务执行的强表现。共同，UniPPTBench和UniPPTEval为评估不同现实场景下的演示生成提供了忠实且诊断性的基础。代码和数据将公开可用。

英文摘要

Existing works typically focus on presentation generation under isolated input settings, whereas real-world use cases span diverse scenarios, including vague user prompts, long documents, multimodal materials, and multiple heterogeneous sources. Moreover, current evaluations are often insufficiently scenario-specific. They mainly rely on generic presentation-quality criteria, such as visual appeal, layout quality, and overall coherence, but fail to assess the core capabilities required by different input settings, including grounded compression, visual-text alignment, and cross-source synthesis. Consequently, the field lacks a unified benchmark and a scenario-aware evaluation framework for faithfully diagnosing presentation-generation systems across diverse real-world settings. We present UniPPTBench, a unified benchmark for presentation generation across four representative input settings: vague-prompt, long-document, multimodal-document, and multi-source generation. We further introduce UniPPTEval, a scenario-aware evaluation protocol that combines shared metrics for cross-setting comparison with scenario-specific metrics tailored to the core requirements of each setting. We also provide transparent reference baselines to support reproducible comparison. Experiments on UniPPTBench reveal substantial performance variation across settings and recurring failure modes in content grounding, multimodal integration, and cross-source synthesis. In particular, strong performance on generic presentation-quality metrics does not necessarily imply strong task fulfillment in grounded scenarios. Together, UniPPTBench and UniPPTEval provide a faithful and diagnostic foundation for evaluating presentation generation across diverse real-world scenarios. Code and data will be publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.17354 2026-05-19 cs.CV 版本更新

GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction

GeoHand: 解锁先验几何知识以实现单目3D手形 reconstruction

Weiquan Lin, Yaoqing Hu, Liangchen Dai, Xu Tang, Xingyu Chen

发表机构 * School of Artificial Intelligence, Xidian University（西安电子科技大学人工智能学院）； Zhongguancun Academy（中关村学院）； School of Automation, Beijing Institute of Technology（北京理工大学自动化学院）

AI总结本文提出GeoHand框架，通过解锁冻结的基础单目几何估计器MoGe2中的高质量几何先验，结合地图级GeoAdapter和门控跨模态token融合策略，实现高精度手形重建，尤其在严重遮挡和手-物体交互场景中表现优异。

详情

AI中文摘要

单目3D手形重建本质上是一个几何问题，然而仅依靠RGB外观特征往往难以解决由自遮挡和手-物体相互作用引起的严重歧义。虽然引入深度可以显式提供空间线索，但原始传感器捕获的深度图存在大量噪声和不完整性，限制了其在细粒度手形重建中的应用。为弥合这一差距，我们提出GeoHand，一种新颖的框架，能够从冻结的基础单目几何估计器MoGe2中解锁高质量几何先验。认识到这些先验偏向于通用场景，我们引入地图级GeoAdapter来重新校准空间特征，特别是适应于细节丰富的手形重建。此外，为了系统地整合这些适应后的先验而不过度干扰固有的RGB外观线索，我们采用门控跨模态token融合策略。最后，为了确保精确的局部运动，我们设计了关键点查询迭代细化器（KQIR），利用投影的关节位置查询几何感知的图像特征以进行空间修正。通过在统一管道中结合全局几何消歧和局部细化，GeoHand在FreiHAND、DexYCB和HO3Dv3上实现了最先进的性能，特别是在严重遮挡和手-物体交互场景中。

英文摘要

Monocular 3D hand reconstruction is intrinsically a geometric problem, yet RGB appearance features alone often struggle to resolve severe ambiguities caused by self-occlusions and hand-object interactions. While introducing depth can explicitly provide spatial cues, raw sensor-captured depth maps are extensively noisy and incomplete, limiting their usefulness for fine-grained hand reconstruction. To bridge this gap, we propose GeoHand, a novel framework that unlocks high-quality geometric priors from a frozen foundational monocular geometry estimator (MoGe2). Recognizing that these priors are oriented toward general scenes, we introduce a map-level GeoAdapter to recalibrate the spatial features, specifically adapting them for detailed hand reconstruction. Furthermore, to systematically integrate these adapted priors without overwhelming intrinsic RGB appearance cues, we employ a gated cross-modal token fusion strategy. Finally, to secure precise local articulation, we design a Keypoint-Queried Iterative Refiner (KQIR) that uses projected joint locations to query geometry-aware image features for spatial correction. By combining global geometric disambiguation with local refinement in a unified pipeline, GeoHand achieves state-of-the-art performance on FreiHAND, DexYCB, and HO3Dv3, especially under severe occlusions and hand-object interactions.

URL PDF HTML ☆

赞 0 踩 0

2605.17347 2026-05-19 cs.CY cs.CV cs.LG 版本更新

Position: Age Estimation Models Do Not Process Biometric Data

位置：年龄估计模型不处理生物特征数据

Nikita Marshalkin

发表机构 * Sumsub GmbH, Berlin, Germany（Sumsub公司，柏林，德国）

AI总结本文研究了年龄估计模型是否处理生物特征数据，通过实验表明这些模型无法达到身份识别阈值，因此不涉及身份识别，呼吁研究者和监管机构提高透明度。

Comments 11 pages, 3 figures, 3 tables. Accepted as a position paper at the 43rd International Conference on Machine Learning (ICML 2026)

2605.17345 2026-05-19 cs.CV 版本更新

VoxShield: Protecting 3D Medical Datasets from Unauthorized Training via Frequency-Aware Inter-Slice Disruption

VoxShield: 通过频率感知的跨切片扰动保护3D医学数据集免受未经授权的训练

Xinyao Liu, Zhipeng Deng, Wenhan Jiang, Haolin Wang, Xun Lin, Yafei Ou, Yefeng Zheng

发表机构 * Westlake University（西拉雅大学）； Dalian University of Technology（大连理工大学）； Hokkaido University（北海道大学）； The Chinese University of Hong Kong（香港中文大学）； RIKEN（理化学研究所）

AI总结本文提出VoxShield，一种通过频率感知的跨切片扰动机制，针对3D医学图像分割数据集中的体积诱导偏差，有效降低3D分割网络性能，同时保持视觉质量。

Comments Submitted version to MICCAI 2026 (Provisional Accept)

详情

AI中文摘要

公开3D医学图像分割（MIS）数据集的发布加速了临床研究，但同时也提高了未经授权的AI模型训练的风险。尽管不可学习的例子（UE）通过注入不可察觉的扰动来防止有效模型学习，但现有方法主要针对2D场景。它们忽略了3D医学体积中固有的体积空间相关性和跨切片解剖一致性，这些是3D分割网络的关键学习先验。为弥合这一差距，我们提出了VoxShield，一种UE框架，专门针对3D网络的体积归纳偏差。我们的核心见解是通过系统性地破坏3D架构依赖的跨切片连续性，可以根本破坏其空间聚合过程。具体来说，我们引入了一种跨切片频率一致性扰动机制，最大化相邻切片之间的频谱差异，沿z轴注入结构不一致性。此外，还加入了语义预测扰动模块。通过最大化干净和扰动logits之间的ℓ₁差异，它迫使注入的噪声穿透整个网络并破坏最终的语义映射。在BraTS19和FLARE21上的实验表明，VoxShield成功降低了3D分割性能，将DSC从80.0%降至接近0.0%，从88.6%降至6.8%。所有保护都通过最小扰动（ε=4/255）实现，以保持高质量的视觉保真度。代码可在https://github.com/KK266299/VoxShield上获得。

英文摘要

The release of public 3D medical image segmentation (MIS) datasets accelerates clinical research but simultaneously heightens risks of unauthorized AI model training. While Unlearnable Examples (UE) offer protection by injecting imperceptible perturbations to prevent effective model learning, existing methods primarily target 2D scenarios. They neglect the volumetric spatial correlations and inter-slice anatomical consistency inherent in 3D medical volumes, which serve as critical learning priors for 3D segmentation networks. To bridge this gap, we propose VoxShield, a UE framework that explicitly targets the volumetric inductive biases of 3D networks. Our core insight is that by systematically dismantling the cross-slice continuity that 3D architectures rely on, we can fundamentally impair their spatial aggregation process. Specifically, we introduce an Inter-Slice Frequency Consistency Disruption mechanism that maximizes the spectral divergence between adjacent slices, injecting structural incoherence along the $z$-axis. Complementing this structural attack, a Semantic Prediction Disruption module is incorporated. By maximizing the $\ell_1$ divergence between clean and perturbed logits, it forces the injected noise to penetrate the entire network and corrupt the final semantic mapping. Experiments on BraTS19 and FLARE21 demonstrate that VoxShield successfully degrades 3D segmentation performance, reducing the DSC from 80.0% to near 0.0% and from 88.6% to 6.8%, respectively. All protections are achieved with minimal perturbation ($ε=4/255$) to preserve high visual fidelity. The code is available at https://github.com/KK266299/VoxShield.

URL PDF HTML ☆

赞 0 踩 0

2605.17343 2026-05-19 cs.CV 版本更新

GraphMAR: Geometry-Aware Graph Learning Framework for Spatially Adaptive CT Metal Artifact Reduction

GraphMAR: 一种基于几何的图学习框架用于空间自适应的CT金属伪影减少

Zilong Li, Chenglong Ma, Yiming Lei, Yuanlin Li, Jing Han, Jiannan Liu, Huidong Xie, Junping Zhang, Yi Zhang, Hongming Shan

发表机构 * Shanghai Key Lab of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University（上海智能信息处理关键实验室，计算机科学与人工智能学院，复旦大学）； Institute of Science and Technology for Brain-inspired Intelligence, Fudan University（脑启发式智能科学技术研究院，复旦大学）； College of Computer Science and Technology, Qingdao University（计算机科学与技术学院，青岛大学）； Department of Oral Maxillofacial Head and Neck Oncology, Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine（口腔颌面头颈肿瘤科，上海第九人民医院，上海交通大学医学院）； School of Cyber Science and Engineering, Sichuan University（网络科学与工程学院，四川大学）

AI总结本文提出GraphMAR，一种基于几何的图学习框架，用于在图像域中实现空间自适应的CT金属伪影减少，通过引入图基的几何建模来显式识别伪影并提高恢复质量和可解释性。

详情

AI中文摘要

计算断层扫描（CT）金属伪影减少（MAR）旨在减少由金属植入物和其他高密度物体引起的严重条纹伪影。有效的MAR通常需要准确的伪影定位和去除。sinogram域方法可以利用显式的几何线索，如金属痕迹，来识别金属损坏的测量，但需要原始投影数据，这在临床和实际场景中往往不可用。图像域方法更加灵活且广泛适用，但通常缺乏可比的几何指导，限制了它们定位伪影的能力，导致结果不理想。为了解决这一限制，我们提出了GraphMAR，一种用于显式伪影识别和图像域中空间自适应MAR的几何意识学习框架。关键思想是引入基于图的几何建模作为sinogram金属痕迹的图像域类比。具体来说，我们首先从金属掩模中构建几何图，并推导出一个几何密度图，根据植入物之间的几何关系粗略定位伪影易发区域。然后我们设计了GraphMoE，一个基于图的混合专家模块，该模块在特征空间中构建极坐标伪影图，并适应性地将不同专家路由到不同的空间区域进行MAR。通过将学习到的路由图与几何密度图对齐，GraphMAR在提供显式和可解释的伪影定位的同时，实现了区域自适应的伪影减少。在模拟和真实世界数据集上的实验表明，GraphMAR在现有方法上实现了更优的MAR性能。据我们所知，这是首次引入基于图的建模用于CT MAR，并在图像域中实现显式的伪影识别，提高了恢复质量和可解释性。

英文摘要

Computed tomography (CT) metal artifact reduction (MAR) aims to reduce the severe streaking artifacts induced by metallic implants and other high-density objects. Effective MAR generally requires both accurate artifact localization and artifact removal. Sinogram-domain methods can exploit explicit geometric cues, such as metal traces, to identify metal-corrupted measurements, while requiring raw projection data, which is often unavailable in clinical and practical scenarios. Image-domain methods are more flexible and widely applicable, yet they usually lack comparable geometric guidance, limiting their ability to localize artifacts and leading to suboptimal results. To address this limitation, we propose GraphMAR, a geometry-aware learning framework for explicit artifact identification and spatially adaptive MAR in the image domain. The key idea is to introduce graph-based geometric modeling as an image-domain analogue of sinogram metal traces. Specifically, we first construct a geometric graph from the metal mask and derive a geometric density graph that coarsely localizes artifact-prone regions according to inter-implant geometry. We then design GraphMoE, a graph-routed mixture-of-experts module that builds a polar-coordinate artifact graph in feature space and adaptively routes different experts to different spatial regions for MAR. By aligning the learned routing maps with the geometric density graph, GraphMAR provides explicit and interpretable artifact localization while enabling region-adaptive artifact reduction. Experiments on both simulated and real-world datasets demonstrate that GraphMAR achieves superior MAR performance compared with existing methods. To the best of our knowledge, this is the first work to introduce graph-based modeling for CT MAR and to enable explicit artifact identification in the image domain, improving both restoration quality and interpretability.

URL PDF HTML ☆

赞 0 踩 0

2605.17341 2026-05-19 cs.CV cs.AI 版本更新

Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

通过跨模态语义对齐实现面向视觉-语言模型的单样本黑盒成员推断攻击

Jiaqing Li, Yajuan Lu, Xiaochuan Shi, Gang Wu, ZhongYuan Wang, Chao Liang

发表机构 * Wuhan University（武汉大学）； Tarim University（塔里木大学）

AI总结本文提出了一种基于跨模态语义对齐的新型成员推断攻击框架，针对视觉-语言模型在单样本和黑盒场景下的数据安全风险进行评估，通过量化联合嵌入空间中的对齐程度，显著提升了攻击性能。

详情

AI中文摘要

视觉-语言模型（VLMs）虽取得了显著成功，但其依赖大规模数据集和意外记忆训练数据，带来了重大数据安全风险。成员推断攻击（MIAs）旨在通过确定数据样本是否包含在模型训练集中来评估这些风险。然而，现有针对VLMs的MIAs方法面临关键瓶颈：灰盒方法依赖于内部logits，通常在实际应用程序接口（APIs）中受限，而黑盒方法依赖于大规模统计分布，在单样本场景中表现不佳。为此，我们从跨模态语义对齐的角度研究MIAs，并观察到成员图像由于训练记忆表现出显著更强的图像-描述对齐，而生成的非成员描述可能偏离原始视觉内容。基于这一洞察，我们提出了一种针对严格黑盒和单样本场景的新MIAs框架，该框架在联合嵌入空间中量化此类对齐，从而绕过这些不现实的假设。我们在三个开源和两个闭源VLMs上进行了广泛实验。在VL-MIA/Flicker数据集上，我们的方法在LLaVA-1.5上实现了0.821的AUC，显著优于现有基线。此外，它在各种图像扰动下仍保持稳健，突显了其实用性。

英文摘要

Vision-Language Models (VLMs) have achieved remarkable success, yet their reliance on massive datasets and unintended memorization of training data raise significant data security risk. Membership Inference Attacks (MIAs) aim to assess these risks by determining whether a data sample was included in a model's training set. However, existing MIA methods against VLMs face critical bottlenecks: gray-box method relies on internal logits that are typically restricted in real-world Application Programming Interfaces (APIs), while black-box method depends on large-scale statistical distributions, which struggle in single-sample scenarios. To this end, we investigate MIAs from the perspective of cross-modal semantic alignment, and observe that member images exhibit significantly stronger image-caption alignment due to training memorization, whereas generated captions for non-members may deviate from the original visual content. Leveraging this insight, we propose a novel MIA framework designed for strict black-box and single-sample setting that quantifies such alignment within a joint embedding space, thereby bypassing these unrealistic assumptions. We conducted extensive experiments on three open-source and two closed-source VLMs. On the VL-MIA/Flicker dataset, our method achieves an AUC of 0.821 against LLaVA-1.5, significantly outperforming existing baselines. Furthermore, it remains robust under diverse image perturbations, highlighting its practicality.

URL PDF HTML ☆

赞 0 踩 0

2605.17336 2026-05-19 cs.RO cs.CV eess.SP 版本更新

Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

基于触觉的多模态融合在具身智能中的应用：视觉、语言和接触驱动范式的综述

Zhixiang Cao, Di Tian, Runwei Guan, Yanzhou Mu, Xiaolou Sun, Shaofeng Liang, Daizong Liu, Tao Huang, Yutao Yue, Henghui Ding, Bin Fang, Alex Zhou, Qing-Long Han, Hui Xiong

发表机构 * School of Electronic Science and Engineering, Xi’an Jiaotong University, China（西安交通大学电子科学与技术学院）； Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), China（香港科技大学（广州）人工智能研究所）； State Key Laboratory for Novel Software Technology, Nanjing University, China（南京大学新型软件技术国家重点实验室）； Purple Mountain Laboratory, China（紫金山实验室）； Institute for Math & AI, Wuhan University, China（武汉大学数学与人工智能学院）； Centre for AI and Data Science Innovation and the School of Science and Engineering, James Cook University, Australia（詹姆斯库克大学人工智能与数据科学创新中心及科学与工程学院）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China（北京邮电大学人工智能学院）； Institute of Big Data, Fudan University, China（复旦大学大数据研究院）； Linkerbot (Beijing) Technology Co., Ltd, China（北京链动科技有限公司）； School of Engineering, Swinburne University of Technology, Melbourne（斯威本技术大学工程学院）

AI总结本文综述了多模态触觉融合在具身智能中的研究，探讨了如何通过整合视觉、语言和触觉信息来提升物理交互与语义推理的结合，提出了一种分层的分类体系，并总结了当前的研究挑战和未来方向。

Comments 20 pages, 8 figures

详情

AI中文摘要

触觉感知是具身智能中的基本模态，能够提供关于接触几何、材料属性和交互动态的独特且直接反馈，这无法被远程传感器所替代。然而，单一的触觉感知在空间覆盖稀疏和缺乏全局语义上下文方面存在固有局限。随着深度学习和大语言模型的迅速发展，将触觉与视觉和语言相结合已成为连接物理交互与语义推理的关键。尽管进展迅速，现有研究仍分散在不同的数据集、传感模态和任务中，缺乏统一的理论框架。为解决这一差距，本文提供了截至2026年第一季度的多模态触觉融合研究的全面综述。我们提出了一种分层的分类体系，将该领域分为两个主要维度：多模态数据集和多模态方法。在数据方面，我们对从触觉-视觉数据集、触觉-语言数据集、触觉-视觉-语言数据集以及触觉-视觉-其他数据集等资源进行了分类。在方法方面，我们把先前的工作分为三个核心支柱：（1）多模态感知与识别，专注于物体理解和抓取预测；（2）跨模态生成，专注于触觉、视觉和文本之间的双向翻译；（3）多模态交互，强调反馈控制和语言引导的操作。此外，我们总结了代表性的触觉传感硬件，回顾了常用的评估指标和基准设置，并讨论了当前的挑战和有前途的未来方向。

英文摘要

Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.

URL PDF HTML ☆

赞 0 踩 0

2605.17327 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

为单目视觉-惯性系统使用前馈3D模型实现高效的特征-free初始化

Yuantai Zhang, Jiaqi Yang, Huajian Zeng, Changhao Chen, Haoang Li, Liang Li, Dezhen Song, Xingxing Zuo

发表机构 * MBZUAI（马克斯·普朗克人工智能研究所）； HKUST (GZ)（香港科技大学（广州））； Zhejiang University（浙江大学）

AI总结本文提出了一种无需视觉特征跟踪的初始化框架，利用前馈3D模型预测的点云，从而提高了单目视觉-惯性导航系统的初始化可靠性与效率，实验表明其初始化成功率超过90%且数据需求显著减少。

详情

AI中文摘要

快速且可靠的初始化对于单目视觉-惯性导航系统（VINS）至关重要，因为它为后续的状态估计建立了初始条件。尽管已有显著进展，但大多数现有方法仍依赖于视觉特征对应关系，并需要3-4秒的传感器数据才能成功初始化，这限制了它们的应用性和效率。随着前馈3D模型的出现，这些模型可以直接从图像预测点云，我们重新从简洁的角度审视视觉-惯性初始化问题。在本文中，我们提出了一种特征-free初始化框架，利用前馈3D模型预测的点云，从而避免了视觉特征跟踪和估计的需要。这种设计显著降低了系统复杂性并提高了初始化的可靠性。在公开数据集上的实验表明，所提出的特征-free初始化方法实现了最高成功率，超过90%，并且显著减少了成功初始化所需的数据持续时间，通常降至1.2秒以下。我们进一步在自采集的数据集上验证了我们的方法，覆盖了各种室内和室外场景，展示了鲁棒性能，特别是在现有方法常失败的视觉退化环境中。代码和数据集可在https://github.com/Yuantai-Z/FF-VIO-Init获取。

英文摘要

Fast and reliable initialization is critical for monocular visual-inertial navigation systems (VINS), as it establishes the starting conditions for subsequent state estimation. Despite steady progress, most existing methods heavily rely on visual feature correspondences and require 3-4 seconds of sensory data for successful initialization, which limits their applicability and efficiency. With the advent of feed-forward 3D models that can directly predict point clouds from images, we revisit the visual-inertial initialization problem from a concise perspective. In this work, we propose a feature-free initialization framework that leverages up-to-scale point clouds predicted by a feed-forward 3D model, thereby obviating the need for visual feature tracking and estimation. This design substantially reduces system complexity and improves the reliability of initialization. Experiments on public datasets demonstrate that the proposed feature-free initialization method achieves the highest success rate, exceeding 90%, and significantly reduces the data duration required for successful initialization, typically to under 1.2 s. We further validate our method on a self-collected dataset covering various indoor and outdoor scenarios, demonstrating robust performance, particularly in visually degraded environments where existing methods often fail. The code and dataset are available at https://github.com/Yuantai-Z/FF-VIO-Init.

URL PDF HTML ☆

赞 0 踩 0

2605.17312 2026-05-19 cs.CV 版本更新

VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers

VISTA: 基于扩散变换器的三元组监督视频风格迁移

Yiren Song, Wangzi Yao, Haofan Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore（新加坡国立大学Show实验室）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Lovart AI（Lovart人工智能）

AI总结本文提出VISTA方法，通过引入大规模三元组数据和基于扩散变换器的框架，解决视频风格迁移中风格、内容和运动的联合建模与解耦问题，实现了高质量的风格迁移效果。

详情

AI中文摘要

视频风格迁移旨在在保持内容、结构和运动的同时将视频渲染成目标艺术风格。尽管图像风格化技术已迅速发展，但视频风格化仍具有挑战性，因为存在时间不一致的问题。现有的大多数方法对帧或关键帧进行风格化，并通过启发式的时序传播来强制一致性，这在遮挡、遮挡解除和长期运动下容易产生漂移和闪烁伪影。我们提出VISTA-1000，一个包含1000种风格和运动对齐三元组的数据集（风格参考、干净视频、风格化视频），并提出一种基于扩散变换器的上下文视频风格迁移框架，具有轻量级的风格适配器以实现稳健的风格提取。大量实验表明，该方法在风格保真度、时间一致性和内容保持方面均达到最佳性能。

英文摘要

Video style transfer aims to render videos in a target artistic style while preserving content, structure, and motion. While image stylization has advanced rapidly, video stylization remains challenging due to temporal inconsistency. Most existing methods stylize frames or keyframes and enforce consistency via heuristic temporal propagation, which is brittle under occlusions, disocclusions, and long-term motion, leading to drift and flickering artifacts. We argue that a fundamental bottleneck lies in the lack of large-scale triplet data and a principled training paradigm that jointly models and disentangles style, content, and motion.To address this, we introduce VISTA-1000, a synthetic dataset with 1,000 styles and motion-aligned triplets of style reference, clean video, and stylized video, and propose a diffusion-transformer-based in-context video style transfer framework with a lightweight style adapter for robust style extraction. Extensive experiments demonstrate SOTA performance in style fidelity, temporal consistency, and content preservation.

URL PDF HTML ☆

赞 0 踩 0

2605.17311 2026-05-19 cs.CV 版本更新

SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection

SpecSem-Net: 通过融合频谱和语义特征实现鲁棒的AI生成视频检测

Zixi Wei, Huixuaun Zhang, Xiaojun Wan

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学计算机技术研究院）

AI总结本文提出SpecSem-Net框架，通过引入语义引导的频谱去噪机制，有效检测高保真AI生成视频，实验表明其在基准和公开数据集上达到87.25%和95.59%的准确率。

详情

AI中文摘要

近期商业视频生成模型如Sora和Veo的显著视觉保真度，使得鲁棒的AI生成视频检测变得至关重要，以防止合成内容与真实视频难以区分并被用于虚假信息。然而，现有检测器往往因过度依赖日益逼真的语义特征而失败，忽视了细微的频谱伪影。本文提出SpecSem-Net，这是首个专门针对高保真AI生成视频检测引入语义引导频谱去噪机制的框架。具体而言，我们设计了一个频谱模块，通过基于傅里叶变换的过滤提取高频特征。此外，为减少频谱噪声引起的误判，我们采用门控融合机制，自适应融合语义上下文，有效缓解频谱噪声。此外，为了评估检测器在最新顶级生成模型上的性能，我们构建了一个包含5个顶级商业生成器的综合基准。广泛实验表明，SpecSem-Net在基准和公开数据集上均优于现有方法，分别达到87.25%和95.59%的准确率。

英文摘要

The remarkable visual fidelity of recent commercial video generative models, such as Sora and Veo, renders robust AI-generated video detection increasingly essential to prevent synthetic content from being indistinguishable from real videos and exploited for disinformation. However, existing detectors often fail due to an over-reliance on increasingly realistic semantic features, neglecting subtle spectral artifacts. In this paper, we propose SpecSem-Net, the first framework to introduce a semantic-guided spectral denoising mechanism specifically for high-fidelity AI-generated video detection. Specifically, we design a spectral module to extract high-frequency features via Fourier-Transform based filtering. Furthermore, to reduce misjudgments arising from spectral noise, we employ a Gated Merging Mechanism to adaptively fuse semantic context, effectively mitigating spectral noise. Additionally, to evaluate detector performance on the latest top-tier generative models, we construct a comprehensive benchmark comprising 5 SOTA commercial generators. Extensive experiments demonstrate that SpecSem-Net outperforms existing methods, achieving accuracies of 87.25% and 95.59% on our benchmark and public datasets, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.17310 2026-05-19 cs.CV cs.AI 版本更新

Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models

注意力劫持：跨查询的视觉-语言模型响应操控

Zhiqiang Wang, Dongrui Liu, Yan Li, Zonghao Ying, Wei Xue, Wenhan Luo, Yike Guo

发表机构 * Hong Kong University of Science and Technology（香港理工大学）； Shanghai Jiao Tong University（上海交通大学）； Beihang University（北京航空航天大学）

AI总结本文研究了视觉-语言模型中跨查询响应操控问题，提出了一种新的对抗攻击方法Attention Hijacking，通过引导内部注意力分布保持图像主导模式，提高攻击在不同查询下的有效性。

详情

AI中文摘要

现有针对视觉-语言模型（VLMs）的对抗攻击可以将模型输出导向攻击者指定的目标响应，但当相同扰动输入与不同文本查询配对时，其效果往往会下降。本文研究了跨查询响应操控，即期望一个对抗示例在多样化的用户查询中保持有效。我们首先分析了现有攻击的局限性，发现成功转移与在响应生成过程中保持图像主导的注意力模式密切相关。受此观察启发，我们提出了Attention Hijacking，一种新的对抗攻击方法，该方法明确引导内部注意力分布向持久的图像主导模式倾斜。通过放大视觉标记对目标响应标记的影响，同时抑制文本标记的竞争影响，我们的方法减少了 manipulated 输出对特定查询用语的依赖。在广泛使用的VLMs上的大量实验表明，Attention Hijacking显著提高了跨查询转移性，适用于多样化的目标响应和未见查询。该方法也有效扩展到多种攻击场景，为VLMs中注意力稳定性在可转移响应操控中的作用提供了新的见解。

英文摘要

Existing adversarial attacks on vision-language models (VLMs) can steer model outputs toward attacker-specified target responses, but their effectiveness often degrades when the same perturbed input is paired with different textual queries. This paper studies cross-query response manipulation, where a single adversarial example is expected to remain effective across diverse user queries. We first analyze the limitations of existing attacks and find that successful transfer is closely associated with preserving an image-dominant attention pattern during response generation. Motivated by the observation, we propose \textbf{Attention Hijacking}, a novel adversarial attack that explicitly steers internal attention distributions toward a persistent image-dominant pattern. By amplifying the influence of visual tokens on target response tokens while suppressing the competing influence of textual tokens, our method reduces the dependence of the manipulated output on the specific wording of the query. Extensive experiments on widely used VLMs show that Attention Hijacking substantially improves cross-query transferability across diverse target responses and unseen queries. The method also extends effectively to multiple attack scenarios, offering new insights into the role of attention stability in transferable response manipulation for VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.17309 2026-05-19 cs.CV cs.AI 版本更新

StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting

StyleText: 一个大规模数据集和基准，用于具有风格保留的场景文本修复

Aleksandr Simonyan, Nipun Jindal

发表机构 * Adobe Inc.（Adobe公司）

AI总结本文提出StyleText，一个用于具有风格保留的场景文本修复的大规模数据集和基准，通过控制评估文本可读性和视觉一致性，利用共享场景上下文。

Comments Accepted at the SynData4CV Workshop, CVPR 2026. 8 pages + 1 page of references, 5 figures, 4 tables

详情

AI中文摘要

我们提出了StyleText，一个用于局部场景文本修复的大型数据集和基准，具有风格保留。StyleText包含28,518个图像-掩码-提示三元组，分为9,932个场景家族，使能够受控评估文本可读性和视觉一致性。我们通过自动化流程构建数据集，该流程结合LLM提示模板、基于Flux的源生成与键值(KV)缓存注入、基于OCR的语义过滤、多边形掩码提取以及掩码条件的FluxFill增强。我们定义了一个可重复的评估协议，使用归一化的OCR度量（词准确率和字符错误率）和CLIP图像-图像相似性，结合显式预处理。在StyleText上训练的FluxFill+LoRA基线在初始化基础上显著提高了OCR准确性，同时保持场景风格一致性，为未来的比较建立了有力的参考点。

英文摘要

We present StyleText, a large-scale dataset and benchmark for localized scene-text inpainting with style preservation. StyleText contains 28,518 image-mask-prompt triplets grouped into 9,932 scene families, enabling controlled evaluation of text legibility and visual consistency under shared scene context. We construct the dataset with an automated pipeline that combines LLM prompt templating, Flux-based source generation with key-value (KV) cache injection, OCR-based semantic filtering, polygon mask extraction, and mask-conditioned FluxFill augmentation. We define a reproducible evaluation protocol using normalized OCR metrics (word accuracy and character error rate) and CLIP image-image similarity with explicit preprocessing. A FluxFill+LoRA baseline trained on StyleText improves OCR accuracy substantially over initialization while maintaining scene style consistency, establishing a strong reference point for future comparisons.

URL PDF HTML ☆

赞 0 踩 0

2605.17303 2026-05-19 cs.CV 版本更新

LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos

LongDPM: 长视频中基于重叠意识的4D重建

Chenyi Xu, Yihao Wu, Liqi Yan, Chao Yang, Jianhui Zhang, Fangli Guan, Pan Li

发表机构 * Hangzhou Dianzi University（杭州电子科技大学）

AI总结本文提出LongDPM，一种基于重叠意识的长视频单目动态重建框架，通过分块处理、重登记和动态身份关联，实现长距离的3D重建和跟踪，提升了PointOdyssey、Kubric-F和Kubric-G等数据集上的密集跟踪精度和相机姿态估计性能。

详情

AI中文摘要

从长单目视频中恢复动态3D场景对于保持共享坐标系中密集几何、相机运动和时间对应的一致性至关重要。现有方法面临两个关键挑战：（1）前馈重建模型提供准确的局部预测，但仅限于短片段；（2）长距离跟踪器保持对应关系但不产生密集序列级重建。本文提出LongDPM，一种新的重叠意识框架，用于可扩展的长距离单目动态重建。首先，LongDPM通过重叠分块处理长视频，使推理内存受限于分块长度。其次，它通过带有静态意识的重叠抽象进行置信度加权注册，连接分块局部坐标系统。第三，它在分块边界处关联动态身份，并融合匹配轨迹以恢复连贯的长距离3D运动。实验结果表明，LongDPM在长距离重建和跟踪性能上优于现有方法，在PointOdyssey、Kubric-F和Kubric-G数据集上减少了密集跟踪EPE，同时在相机姿态估计方面获得了最佳TUM-dynamics ATE。

英文摘要

Recovering a dynamic 3D scene from a long monocular video is crucial for dense geometry, camera motion, and temporal correspondence to remain consistent in a shared coordinate system. Existing methods face two key challenges: (1) feed-forward reconstruction models provide accurate local predictions but are limited to short clips, and (2) long-range trackers preserve correspondences without producing dense sequence-level reconstruction. This paper presents LongDPM, a novel overlap-aware framework for scalable long-range monocular dynamic reconstruction. First, LongDPM processes long videos in overlapping chunks, keeping inference memory bounded by the chunk length. Second, it connects chunk-local coordinate systems through confidence-weighted registration with static-aware overlap abstraction. Third, it associates dynamic identities across chunk boundaries and fuses matched trajectories to recover coherent long-range 3D motion. Experimental results demonstrate that LongDPM achieves superior long-range reconstruction and tracking performance, reducing dense tracking EPE over V-DPM on PointOdyssey, Kubric-F, and Kubric-G, while obtaining the best TUM-dynamics ATE for camera pose estimation.

URL PDF HTML ☆

赞 0 踩 0

2605.17294 2026-05-19 cs.CV 版本更新

EgoIntrospect: 一个用于用户中心内部状态推理的注视数据集和基准

Zeyu Wang, Chang Liu, Eduardus Tjitrahardja, Yuntao Wang, Borislav Pavlov, Fangfei Gou, Jose Manuel Davila, Dai Shi, Ran Xu, Yue Pan, Jiayi Tan, Shuting Chang, Qi Wang, Jinzhao Li, Jiacheng Hua, Yifei Huang, Jingwei Sun, Yu Zhang, Liuxin Zhang, Guocai Yao, Jia Jia, Yin Li, Qianying Wang, Yuanchun Shi, Miao Liu

发表机构 * Tsinghua University（清华大学）； Tongji University（同济大学）； Renmin University of China（中国人民大学）； The University of Tokyo（东京大学）； Lenovo Group（联想集团）； Peking University（北京大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； Shanghai Qi Zhi Institute（上海启智研究院）

AI总结本文提出EgoIntrospect数据集，用于研究用户中心内部状态推理，通过自注释揭示用户与AI助手的交互意图，评估多模态大语言模型在从注视观察中推理用户内部状态的能力。

详情

AI中文摘要

OPTNet：用于灾后3D语义分割的点变换网络

Nhut Le, Ehsan Karimi, Maryam Rahnemoonfar

发表机构 * Computer Science and Engineering, Lehigh University, Bethlehem PA 18015, US（计算机科学与工程系，莱维大学，贝特莱姆 PA 18015，美国）； Civil and Environmental Engineering, Lehigh University, Bethlehem PA 18015, US（土木与环境工程系，莱维大学，贝特莱姆 PA 18015，美国）

AI总结本文提出OPTNet，一种通过可学习的点排序模块动态预测最优排列以提高注意力机制局部性的网络，用于灾后3D点云语义分割。

Comments Accepted for International Conference on Pattern Recognition (ICPR) 2026

详情

AI中文摘要

灾后损害评估需要快速且准确地对3D点云进行语义分割，以识别受损的基础设施，如损坏的建筑和道路。早期的点变换（如PTv1、PTv2）依赖于计算成本高的邻居搜索（k-NN）和最远点采样（FPS）。为了提高效率，最近的架构如Point Transformer V3（PTv3）采用了静态序列化方法，如Hilbert曲线或Z-order，来组织无序点以进行基于窗口的注意力。然而，这些固定顺序并不利于捕捉灾难场景的复杂几何结构。在本文中，我们提出了OPTNet（Ordering Point Transformer Network），它引入了一个可学习的点排序模块。OPTNet利用自监督的排序损失动态预测最优排列，以最大化注意力机制的局部性。我们在3DAeroRelief数据集上评估了我们的方法，显著优于最先进的基线。

英文摘要

Post-disaster damage assessment requires rapid and accurate semantic segmentation of 3D point clouds to identify critical infrastructure such as damaged buildings and roads. Early Point Transformers (e.g., PTv1, PTv2) relied on computationally expensive neighbor searching (k-NN) and Farthest Point Sampling (FPS). To improve efficiency, recent architectures like Point Transformer V3 (PTv3) adopted static serialization methods, such as Hilbert curves or Z-order, to organize unstructured points for window-based attention. However, these fixed orderings are not optimal for capturing the complex geometry of disaster scenes. In this paper, we propose OPTNet (Ordering Point Transformer Network), which introduces a learnable Point Sorter module. OPTNet utilizes a self-supervised ordering loss to dynamically predict an optimal permutation that maximizes the locality of the attention mechanism. We evaluate our method on the 3DAeroRelief dataset, significantly outperforming state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.17179 2026-05-19 cs.CV 版本更新

iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning

iMiGUE-3K：一种基于自监督学习的微手势分析大规模基准

Chengyan Wang, Haoyu Chen, Hui Wei, Yueyi Yang, Yunquan Chen, Guoying Zhao

发表机构 * CMVS, University of Oulu（奥卢大学CMVS实验室）； KTH Royal Institute of Technology（皇家理工学院）； ELLIS Institute Finland（芬兰ELLIS研究所）

AI总结本文提出iMiGUE-3K大规模数据集和MG-FMs基础模型，用于微手势情感理解，通过自监督学习提升情绪识别性能。

详情

AI中文摘要

情感理解是情感计算和人工智能中的基本挑战。尽管现有方法主要关注面部表情和语音，但往往忽视了通过身体语言传达的丰富情绪线索。最近，微手势（MGs）作为一种替代线索受到越来越多关注，但目前缺乏支持MG基础模型预训练的大规模数据集。为了推动MG研究，我们提出一个新的微手势情感理解基准，包含关键贡献：新的数据集（iMiGUE-3K）和一系列针对不同任务的基础模型。通过基于模型的众包数据收集策略，我们构建了iMiGUE-3K，这是迄今为止最大的MG数据集。该数据集包含332名专业网球运动员过去七年的公开采访视频，总时长超过3.4K小时视频片段和3700万帧。数据集包含32种微手势类别，具有丰富的描述性标注，是首个大规模、真实场景的视频数据集，用于细粒度手势基情感分析。基于iMiGUE-3K，我们提出MG-FMs，一种用于可迁移手势呈现学习的判别基础模型。基于该基础模型，我们建立了五个全面的评估任务：微手势识别（无监督、半监督、监督）、微手势检索和微手势情感识别。我们对代表性方法的系统评估表明，基于微手势的分析显著提升了情感理解。我们希望这项工作能为微手势分析提供全面工具，并为未来心理诊断、情感计算和高级人机交互研究奠定坚实基础。

英文摘要

Emotion understanding is a fundamental challenge in affective computing and artificial intelligence. While existing approaches predominantly focus on facial expressions and speech, they often overlook the rich emotional cues conveyed through body language. Recently, micro-gestures (MGs), unintentional, subconscious movements driven by inner feelings, have attracted increasing attention as an alternative to other cues. However, there are no existing large-scale datasets supporting the pre-training of the MG foundation model. To advance MG research, we present a new benchmark for micro-gesture-based emotion understanding, featuring key contributions with a novel dataset (iMiGUE-3K) and a series of foundation models for different tasks. Using a model-based crowd-sourcing data collection strategy, we construct iMiGUE-3K, the largest MG dataset to date. It comprises video recordings from 332 distinct professional tennis players' public press interviews over the past seven years, totaling more than 3.4K long video clips and 37 million frames. The dataset includes 32 micro-gesture classes with rich descriptive annotations, making it the first large-scale, in-the-wild, video dataset for fine-grained gesture-based emotion analysis. Built on iMiGUE-3K, we propose MG-FMs, a discriminative foundation model for transferable gesture presentation learning. Based on the foundation model, we establish five comprehensive evaluation tasks: MG recognition (unsupervised, semi-supervised, supervised), MG retrieval, and MG emotion recognition. Our systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding. We hope this work can provide comprehensive tools for MG analysis and set a solid foundation for future research in psychological diagnostics, affective computing, and advanced human-computer interaction.

URL PDF HTML ☆

赞 0 踩 0

2605.17165 2026-05-19 cs.CV cs.LG 版本更新

CAM-VFD: 跨注意力多模态视频伪造检测

Hoda Osama Elkhodary, Sherin Mostafa Youssef, Marwa Elshenawy, Dalia Sobhy

发表机构 * Computer Engineering Department, College of Engineering and Technology, Arab Academy for Science, Technology and Maritime Transport（计算机工程系，工程与技术学院，阿拉伯科学、技术与海运交通学院）

AI总结针对深度伪造技术和视频编辑工具快速发展带来的挑战，本文提出CAM-VFD框架，通过跨模态矛盾建模实现多模态视频伪造检测，实验表明其在两个生成视频基准测试中表现出色，具有良好的鲁棒性。

详情

AI中文摘要

深度伪造技术和视频编辑工具的快速发展对多媒体取证、司法证据完整性以及信息真实性构成了重大挑战。当前的检测器依赖单一模态信号，将外观、几何和运动独立处理。然而，先进的生成器在保持单模态一致性的同时会产生跨模态矛盾，这些矛盾在取证上具有鉴别性但无法被单一模态检测器发现。本文提出CAM-VFD，即跨注意力多模态视频伪造检测框架，将跨模态矛盾建模为方向性取证信号。该框架采用跨注意力融合机制，其中基于CLIP的外观表示作为查询，与VideoMAE运动特征和MiDaS深度特征进行对比，从而识别视觉、时间及几何证据之间的矛盾。通过跨模态注意力差异分析验证了该设计，观察到真实与伪造分布在统计上可分离（p<0.001，Cohen's d=0.68）。在两个生成视频基准测试中的实验结果表明，CAM-VFD在GenVidBench上达到95.31%的Top-1准确率，在GenVideo上达到93.43%的准确率、90.63%的F1分数和96.56%的AUROC。此外，CAM-VFD在压缩、噪声、模糊和对抗扰动下表现出稳定的性能，表明跨模态推理可能在媒体取证中提高鲁棒性。代码已公开在https://github.com/Hoda-Osama/CAM-VFD/tree/main。

英文摘要

The rapid advancement of Deepfake technologies and video manipulation tools poses a critical challenge to multimedia forensics, judicial evidence integrity, and information authenticity. Current detectors rely on single-modality signals, treating appearance, geometry, and motion independently. However, advanced generators maintain within-modality consistency while producing cross-modal contradictions, which are forensically discriminative but invisible to any single-modal detector. We propose CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradiction as a directional forensic signal. The framework uses a cross-attention fusion mechanism in which CLIP-based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and geometric evidence. We examine this design through cross-modal attention discrepancy analysis, observing statistically separable real and fake distributions ($p<0.001$, Cohen's $d=0.68$). Experimental results on two generative video benchmarks indicate consistent performance, with 95.31\% Top-1 accuracy on GenVidBench and 93.43\% accuracy, 90.63\% F1-score, and 96.56\% AUROC on GenVideo. Moreover, CAM-VFD demonstrates stable performance under compression, noise, blur, and adversarial perturbations, suggesting that cross-modal reasoning may improve robustness in media forensics. The code is publicly available at \url{https://github.com/Hoda-Osama/CAM-VFD/tree/main}.

URL PDF HTML ☆

赞 0 踩 0

2605.17125 2026-05-19 cs.CV cs.LG 版本更新

Principal Component Analysis for Lunar Crater Detection

基于主成分分析的月球陨石坑检测

Travis Driver, John A. Christian

发表机构 * School of Aerospace Engineering, Georgia Institute of Technology（航空航天工程学院，佐治亚理工学院）

AI总结本文提出了一种基于主成分分析的自动陨石坑模板生成方法，用于改进基于图像的陨石坑识别技术，通过在模拟月球图像上展示优于手工挑选模板的检测和定位性能。

2605.17120 2026-05-19 cs.CV 版本更新

Markerless Motion Capture for Biomechanical Whole-Body Kinematic Estimation in Infants

无标记运动捕捉用于婴儿生物力学全身运动学估计

Divya Joshi, J. D. Peiffer, Colleen Peyton, R. James Cotton

发表机构 * Center for Bionic Medicine, Shirley Ryan AbilityLab（生物医学中心，Shirley Ryan AbilityLab）； Dept. of Physical Therapy and Human Movement Science, Northwestern University（物理治疗与人类运动科学系，西北大学）； Dept. of Biomedical Engineering, Northwestern University（生物医学工程系，西北大学）； Dept. of Pediatrics, Northwestern University（儿科系，西北大学）

AI总结本研究评估了三种先进的姿态估计框架在婴儿运动学重建中的性能，展示了无标记运动捕捉在婴儿生物力学分析中的潜力和局限性。

Comments Accepted to EMBC 2026

详情

AI中文摘要

早期识别婴儿运动障碍依赖于专家对自发运动的视觉评估，这推动了自动化、客观方法的发展。本文系统评估了三种最先进的姿态估计框架（MeTRAbs-ACAE、SAM 3D Body和Sapiens）在8名婴儿13次录制的100个视频上的性能。通过重投影误差、几何一致性以及Procrustes对齐的3D位置误差量化关键点检测精度，并展示了将逆向运动学框架拟合到婴儿数据的可行性证明。虽然Sapiens在重投影误差和几何一致性方面表现最佳（分别为22.8像素和0.82），但SAM 3D Body提供了最全面的3D信息用于运动学重建，其Procrustes对齐的位置误差为19至28毫米。通过案例比较示例，证明了基于SAM 3D Body估计的生物力学模型能够区分与运动发育相关的婴儿典型运动模式，如临床专家所识别的。这些发现突显了3D姿态估计在婴儿生物力学中的潜力和当前限制，并为可扩展的视频基早期运动发育评估奠定了初步基础。

英文摘要

arly identification of motor impairment in infancy relies on expert visual assessment of spontaneous movement, motivating the development of automated, objective alternatives. One promising approach is using computer vision, which benefits from high quality pose estimation from video. In this study, we systematically evaluated three state-of-the-art pose estimation frameworks (MeTRAbs-ACAE, SAM 3D Body, and Sapiens) on 100 videos over 13 sessions of 8 infants recorded with a multi-view markerless motion capture system. We quantified keypoint detection accuracy using reprojection error, geometric consistency, and Procrustes-aligned 3D position error, and demonstrated proof-of-concept for fitting an inverse kinematic framework to infant data. While Sapiens achieved the lowest reprojection error and highest geometric consistency of the methods evaluated (22.8 pixels and 0.82, respectively), SAM 3D Body provided the most comprehensive 3D information for kinematic reconstruction with Procrustes-aligned position errors of 19 to 28 mm. We demonstrate in a case comparison example that biomechanical models fit to SAM 3D estimates distinguish representative movement patterns in infants related to motor development, as identified by a clinical expert. Together, these findings highlight both the promise and current limitations of 3D pose estimation for infant biomechanics and establish preliminary groundwork for scalable, video-based assessment of early motor development.

URL PDF HTML ☆

赞 0 踩 0

2605.17102 2026-05-19 cs.GR cs.CV 版本更新

VoxScene: Anchor-Conditioned Voxel Diffusion for Indoor Scene Arrangement

VoxScene: 基于锚点的体素扩散用于室内场景布置

Haotian Mao, Yuhan Huang, Jiatao Lin, Yang Zhao, Hui Wang, Yiheng Zhang, Yuwang Wang, Chenliang Zhou, Yan Zhang, Fangcheng Zhong, Xubo Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Hong Kong University of Science and Technology（香港科技大学）； Tsinghua University（清华大学）； University of Cambridge（剑桥大学）； Peking University（北京大学）

AI总结本文提出VoxScene，一种基于锚点的体素扩散框架，用于3D场景合成。通过引入显式的、以物体为中心的体素表示，解决了现有方法在处理密集环境时的物理碰撞和结构纠缠问题，实现了高保真体素网格的生成，提升了场景布置的物理合理性和形状多样性。

详情

AI中文摘要

我们提出了VoxScene，一种新颖的基于锚点的体素扩散框架，专门用于3D场景合成。当前数据驱动的布局生成技术通常依赖于边界代理或隐式表示，这忽略了体素结构。这种几何盲性不可避免地导致严重的物理碰撞和结构纠缠，特别是在密集环境中。为克服这些限制，我们转向显式的、以物体为中心的体素表示。我们的流程依次合成离散的体素占用，条件于先前的锚点和局部上下文。通过利用离散体素的互斥性质，我们的方法消除了空间歧义，即使在高度复杂的环境中也能保证无碰撞的布置。此外，生成的高保真体素网格作为判别性的几何查询，用于后续资产检索。广泛的实验表明，我们的方法具有普遍性，实现了最先进的物理合理性，并在形状多样性方面超越了现有布局规划器。

英文摘要

We present VoxScene, a novel anchor-conditioned voxel diffusion framework tailored for 3D scene synthesis. Current data-driven layout generation techniques typically rely on bounding proxies or implicit representations, which overlook volumetric structures. This geometric blindness inevitably leads to severe physical collisions and structural entanglement, particularly in densely populated environments. To overcome these limitations, we shift the paradigm to an explicit, object-centric voxel representation. Our pipeline sequentially synthesizes discrete volumetric occupancies conditioned on prior anchors and local context. By exploiting the mutually exclusive nature of discrete voxels, our approach eliminates spatial ambiguities and guarantees collision-free arrangements, even in highly complex environments. Furthermore, the synthesized high-fidelity voxel grids serve as discriminative geometric queries for downstream asset retrieval. Extensive experiments demonstrate the universality of our method, achieving state-of-the-art physical plausibility and unlocking shape diversity compared to existing layout planners.

URL PDF HTML ☆

赞 0 踩 0

2605.17095 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC

警察执法视频中的视觉时间线：用于训练和分析的开放BWC操作上下文和活动编目

Angela Srbinovska, Christopher Homan, Adrian Martin, Ernest Fokoué

发表机构 * Rochester Institute of Technology（罗切斯特理工大学）； Rochester Police Department（罗切斯特警察局）； Office of Business Intelligence（业务智能办公室）； School of Mathematics and Statistics（数学与统计学学院）

AI总结本文提出了一种处理体感摄像头视频的方法，生成时间对齐的固定长度10秒窗口序列，用于训练和分析，通过隐私保护协议进行处理和标记，以提高事件审查和培训流程的效率。

Comments 13 pages, 10 figures, 9 tables

详情

AI中文摘要

执法机构正在积累大量体感摄像头（BWC）视频。然而，这些视频仍然在操作上是模糊的。也就是说，分析人员和培训人员仍然需要花费大量时间观看完整视频以确定关键事件的开始点，并识别活动转向更剧烈的物理活动的点。我们提出了一种方法，将BWC视频处理为时间对齐的固定长度10秒窗口序列，通过隐私保护协议进行处理和标记。每个窗口被标记为两个维度的信息：（i）窗口的操作上下文和（ii）窗口内的运动强度水平，对于因黑暗、模糊或遮挡导致证据不足的窗口，使用低证据标签。我们训练模型根据这两个轴分类窗口，使用从每个窗口中采样的帧，通过CLIP模型编码并汇总成窗口级别的表示。我们提取每个窗口的密集光流统计信息以捕捉运动强度。在测试窗口中，最佳上下文模型达到78.75%的准确率，最佳准确率活动模型达到88.33%。我们还包含了完整性审计，以展示结果以及视觉时间线表示如何支持更快的事件审查，并使警官培训流程更加实用。

英文摘要

Law enforcement agencies are accumulating vast amounts of body-worn camera (BWC) footage. However, this remains operationally opaque. That is, analysts and trainers still have to invest considerable time watching full-length videos to pinpoint the start of key encounters and identify the points where activity shifts to something more physically intense. We present an approach to process BWC video into a time-aligned sequence of fixed-length 10-second windows, processed and labeled using a privacy-conscious protocol. Each window is labeled with two dimensions of information: (i) the operational context of the window and (ii) the level of motion intensity within the window, with low-evidence labels for windows for which insufficient evidence exists due to darkness, blur or occlusion. We train models to classify windows based on these two axes using frames sampled from each window encoded using CLIP model and aggregated into a window-level representation. We extract dense optical flow statistics for each window to capture motion intensity. On test windows the best context model achieves 78.75% accuracy, and the best-accuracy activity model achieves 88.33%. We also included integrity audits to show the results and how the visual timeline representations support faster incident review and make the officer training workflow more practical.

URL PDF HTML ☆

赞 0 踩 0

2605.17093 2026-05-19 cs.CV cs.CL 版本更新

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

HEED：基于密度加权残差对齐的混合视觉-语言模型蒸馏

Yihao Liang, Niraj K. Jha

发表机构 * Princeton University（普林斯顿大学）

AI总结本文提出HEED方法，通过密度加权残差对齐改进混合视觉-语言模型蒸馏，提升在OCR和文档任务中的性能，同时在不同教师模型和混合架构上实现高效推理。

详情

AI中文摘要

将视觉-语言模型蒸馏为更高效的混合架构，如3:1 Mamba-2/注意力混合，已成为提高推理效率的标准做法。聚合基准表明这可行，但隐藏了选择性失败。当将Qwen3-VL-8B-Instruct蒸馏为3:1 Mamba-2/注意力混合时，在视觉推理基准如MMStar、MMBench和MMMU-Pro上，学生模型在教师模型附近保持2分差距，但在光学字符识别和文档任务上下降13分。学生模型仍能理解场景，但失去回答所需的细粒度文本。我们发现大部分失败归因于特定位置。在高分辨率图像中，大多数拼图是天空、墙壁或平滑纹理，而一小部分携带文本、边缘、物体边界或其他局部细节。在令牌级诊断中，前10%最高密度拼图的残差漂移比后10%最低密度拼图大3.6倍，且教师遮蔽答案贡献大3.5倍。均匀加权将许多损失项分配给低信息量的背景拼图，而稀疏答案承载拼图未得到特殊保护。所需干预极小：我们用拼图自不相似性作为无监督代理来替代均匀残差对齐，以确定位置重要性。我们称之为HEED。与常规端到端蒸馏相比，HEED在OCRBench v2上提升8.7分，在10个基准平均上提升5.13分。增益在不同教师模型和混合架构上实现。在标准后训练后，学生在10个基准平均上达到教师级性能，具有4.12倍的吞吐量和128k上下文时68%的内存节省，无需额外参数和推理时间成本。

英文摘要

Distilling vision-language models into faster hybrid architectures, such as 3:1 Mamba-2/attention mixes, is now standard practice for making inference efficient. Aggregate benchmarks suggest that this works but they hide selective failures. When we distill Qwen3-VL-8B-Instruct into a 3:1 Mamba-2/attention hybrid, student model stays within 2 points of the teacher across visual reasoning benchmarks like MMStar, MMBench, and MMMU-Pro, while dropping 13 points on optical-character-recognition and document tasks. The student can still understand the scene but loses the fine-grained text needed to answer. We localize much of the failure to a specific kind of position. In a high-resolution image, most patches are sky, wall, or smooth texture, while a small fraction carries text, edges, object boundaries, or other local details. In a token-level diagnostic, the top 10% highest-density patches have 3.6$\times$ larger residual drift than the bottom 10% lowest-density patches and 3.5$\times$ larger teacher-masking answer contribution. Uniform weighting devotes many loss terms to low-information background patches, whereas sparse answer-bearing patches receive no special protection. The required intervention is minimal: we replace uniform residual alignment with density-weighted residual alignment, using patch self-dissimilarity as a training-free proxy for position importance. We call this HEED. Compared with normal end-to-end distillation, HEED increases performance by 8.7 points on OCRBench v2 and 5.13 points on a 10-benchmark average. The gain is realized on different teacher models and hybrid architectures. After standard post-training, the student reaches teacher-level performance on the 10-benchmark average with a 4.12$\times$ throughput and a 68% memory saving at 128k context, with no additional parameters and no inference-time cost.

URL PDF HTML ☆

赞 0 踩 0

2605.17087 2026-05-19 cs.CV 版本更新

The Learnability Gap in Medical Latent Diffusion

医学潜在扩散中的可学习差距

Mischa Dombrowski, Felix Nützel, Bernhard Kainz

发表机构 * Department of Computing, Imperial College London（伦敦帝国理工学院计算机系）

AI总结本文研究了医学图像中潜在扩散模型在处理类别不平衡问题时的可学习差距，指出尽管预训练的自动编码器能有效编码判别特征，但其潜在表示的结构性使分类器难以学习，通过开发噪声条件潜在分类器和图像空间蒸馏技术，提高了效率并改善了潜在空间质量。

详情

AI中文摘要

生成数据增强使用潜在扩散模型是解决医学影像类别不平衡问题的有前景策略，但当前方法侧重于感知保真度和领域特定自动编码器微调，而忽视了更根本的瓶颈。我们识别并正式化了可学习差距：大规模预训练自动编码器能够忠实编码医学分类的判别特征，如重建空间中的近无损性能所示，但其潜在表示以难以被分类器学习的方式结构化。在五个自动编码器家族和四个覆盖胸片、皮肤镜、计算机断层扫描和超声的医学基准上，我们证明这种差距无论架构、初始化策略或超参数调整如何，都持续存在，且医学领域微调无法关闭它。为了探测并部分缩小这一差距，我们开发了具有FiLM层的噪声条件潜在分类器和图像空间蒸馏，这些方法在效率和内存方面比图像空间模型分别提高了64倍和120倍，同时作为潜在空间质量的诊断工具。我们的分析提供了一个新的框架来评估自动编码器的潜在空间，并识别其结构而非保真度或领域特定性是关闭真实和合成医学训练数据性能差距的主要障碍。

英文摘要

Generative data augmentation with latent diffusion models is a promising strategy for addressing class imbalance in medical imaging, yet current approaches focus on perceptual fidelity and domain-specific autoencoder fine-tuning while neglecting a more fundamental bottleneck. We identify and formalize the learnability gap: large-scale pretrained autoencoders faithfully encode discriminative features for medical classification, as evidenced by near-lossless performance in reconstruction space, yet their latent representations are structured in ways that are difficult for classifiers to learn from. Across five autoencoder families and four medical benchmarks spanning chest radiography, dermatoscopy, computed tomography, and echocardiography, we show that this gap persists regardless of architecture, initialization strategy, or hyperparameter tuning, and that medical-domain fine-tuning of the autoencoder does not close it. To probe and partially narrow the gap, we develop noise-conditioned latent classifiers with FiLM layers and image-space distillation that offer 64x throughput and 120x memory gains over image-space models while serving as diagnostic tools for latent space quality. Our analysis provides a new framework for evaluating autoencoder latent spaces and identifies their structure, rather than their fidelity or domain specificity, as the primary obstacle to closing the performance gap between real and synthetic medical training data.

URL PDF HTML ☆

赞 0 踩 0

2605.17070 2026-05-19 cs.CV 版本更新

EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models

EPIC-Bench: 一种以感知为中心的细粒度具身视觉 grounding 的基准

Haozhe Shan, Xiancong Ren, Han Dong, Haoyuan Shi, Yingji Zhang, Jiayu Hu, Yi Zhang, Yong Dai, Bin Shen, Lizhen Qu, Zenglin Xu, Xiaozhu Ju

发表机构 * X-Humanoid ； Fudan University（复旦大学）； University of Science and Technology of China（中国科学技术大学）； University of Manchester（曼彻斯特大学）； Monash University（墨尔本大学）； Celonis AI ； University of New South Wales（新南威尔士大学）

AI总结本文提出 EPIC-Bench，一种以感知为中心的细粒度具身视觉 grounding 基准，旨在系统评估 VLMs 在现实世界具身环境中的视觉感知能力。该基准包含 6.6k 个精心标注的元组（图像，文本，掩码），涵盖 23 个细粒度任务，涉及具身交互管道的三个核心阶段：目标定位、导航和操作。评估结果显示，尽管先进推理模型表现出潜力，但当前 VLMs 在复杂视觉-文本对齐方面普遍存在困难，特别是在多目标计数、部分-整体关系理解和 affordance 区域检测方面存在瓶颈。

详情

AI中文摘要

尽管大型视觉-语言模型（VLMs）越来越多地被用作具身代理的感知骨干，但现有基准往往依赖于问答或多选格式。这些协议允许模型利用语言先验，而不是展示真正的视觉 grounding。为了解决这个问题，我们提出了 EPIC-Bench，即具身感知基准，这是一种细粒度 grounding 基准，旨在系统地评估 VLMs 在现实世界具身环境中的视觉感知能力。EPIC-Bench 包含 6.6k 个精心标注的元组（图像，文本，掩码），涵盖 23 个细粒度任务，横跨具身交互管道的三个核心阶段：目标定位、导航和操作。对超过 89 个领先 VLMs 的广泛评估显示，尽管先进推理模型显示出潜力，但当前 VLMs 在复杂视觉-文本对齐方面普遍存在困难。具体而言，模型在多目标计数、部分-整体关系理解以及 affordance 区域检测方面存在关键瓶颈。EPIC-Bench 为推进下一代视觉驱动的具身模型提供了稳健的基础和可操作的见解。

英文摘要

While large vision-language models (VLMs) are increasingly adopted as the perceptual backbone for embodied agents, existing benchmarks often rely on question-answering or multiple-choice formats. These protocols allow models to exploit linguistic priors rather than demonstrating genuine visual grounding. To address this, we present EPIC-Bench, Embodied PerceptIon BenChmark, a fine-grained grounding benchmark designed to systematically evaluate the visual perceptual capabilities of VLMs in real-world embodied environments. Comprising 6.6k meticulously annotated tuples (Image, Text, Mask), EPIC-Bench spans 23 fine-grained tasks across three core stages of the embodied interaction pipeline: Target Localization, Navigation, and Manipulation. Extensive evaluations of over 89 leading VLMs reveal that while advanced reasoning models show promise, current VLMs universally struggle with complex visual-text alignment for physical interactions. Specifically, models exhibit critical bottlenecks in multi-target counting, part-whole relationship understanding, and affordance region detection. EPIC-Bench provides a robust foundation and actionable insights for advancing the next generation of vision-driven embodied models.

URL PDF HTML ☆

赞 0 踩 0

2605.17042 2026-05-19 cs.CV 版本更新

Thermal-Only Crowd Counting with Deployment-Time Privacy Protection

仅热成像的人群计数与部署时隐私保护

Yifei Qian, Zhongliang Guo, Chun Tong Lei, Bowen Deng, Chun Pong Lau, Xiaopeng Hong, Michael P. Pound

发表机构 * School of Computer Science, University of Nottingham（诺丁汉大学计算机科学学院）； School of Computer Science, University of St Andrews（斯特灵大学计算机科学学院）； Department of Data Science, City University of Hong Kong（香港城市大学数据科学系）； Harbin Institute of Technology（哈尔滨工业大学）

AI总结本文提出了一种仅使用热成像数据的人群计数框架，通过消除RGB数据依赖，减少公共监控中隐私暴露风险，并利用深度到RGB扩散模型来缓解热成像的模糊性，提升计数准确性。

详情

AI中文摘要

尽管RGB-热人群计数已显示出潜力，但该范式面临关键限制：RGB数据在公共监控中引发隐私问题，而多模态对齐问题会降低融合性能。我们提出首个专门设计用于隐私意识人群计数的热成像-only框架，在推理时消除RGB依赖，并显著减少公共监控部署中连续RGB捕获带来的隐私暴露。为缓解热成像模糊性，我们利用深度到RGB扩散模型作为跨模态桥梁，提取具有辨别力的特征以增强热表示。关键地，我们证明单步LCM去噪产生最忠实于深度条件信号结构内容的特征，而多步方法则逐步将特征与条件输入解耦并累积误差，从而降低计数准确性。在RGBT-CC和DroneRGBT数据集上的实验表明，我们的方法在性能上与最先进的RGB-热融合方法具有竞争力，且仅需在推理时使用热输入，消除了连续RGB捕获的需求，这在现实世界监控部署中是主要的隐私问题。代码将公开提供。

英文摘要

While RGB-Thermal crowd counting has shown promise, the paradigm faces critical limitations: RGB data raises privacy concerns in public surveillance, and multi-modal misalignment degrades fusion performance. We propose the first thermal-only framework specifically designed for privacy-conscious crowd counting, eliminating RGB dependency at inference time and substantially reducing the privacy exposure associated with continuous RGB capture in public surveillance deployments. To mitigate thermal ambiguity, we leverage depth-to-RGB diffusion models as a cross-modal bridge, extracting discriminative features that enhance thermal representations. Critically, we demonstrate that single-step LCM denoising yields features most faithful to the structural content of the depth conditioning signal, while multi-step approaches progressively decouple features from the conditioning input and accumulate errors that degrade counting accuracy. Experiments on RGBT-CC and DroneRGBT datasets show our method achieves competitive performance against state-of-the-art RGB-T fusion methods, while requiring only thermal input during inference, eliminating the need for continuous RGB capture that constitutes the primary privacy concern in real-world surveillance deployment. The code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.17019 2026-05-19 cs.CV 版本更新

基于深度学习和隐式皮肤的临床CT扫描中手部形状统计建模

Gokce Guven, Hasan Fehmi Ates, Deniz Karasahin, Kaan Erdogan

发表机构 * Dept. of Computer Science and Engineering（计算机科学与工程系）； Özyeğin University（奥兹耶尼大学）； Dept. of Artificial Intelligence and Data Engineering（人工智能与数据工程系）； Osteoid Inc.（Osteoid公司）

AI总结本文提出了一种AI辅助的重建流程，利用深度学习和隐式皮肤技术对临床CT扫描中的手部解剖结构进行分割和分析，通过统计形状建模提高生物力学、人机工程学和医疗诊断的应用价值。

详情

AI中文摘要

准确的分割和手部解剖的统计形状建模对医学诊断、人机工程学和生物力学有重要影响。本研究提出了一种AI辅助的重建流程，用于从1,271例肘至手（e2h-CT）计算机断层扫描中分割和分析手部解剖结构。首先使用基于Pix2Pix的条件生成对抗网络去除CT体积中的石膏和背景伪影。清洁后的扫描随后在3D Slicer中处理，提取皮肤和骨掩膜，并将其转换为封闭曲面网格模型。分割的骨网格用于构建骨骼表示，使隐式皮肤能够将所有手模型对齐到标准化的解剖配置。随后，使用Geodesic Based Coherent Point Drift++（GBCPD++）算法对手部皮肤表面进行非刚性配准，以在不同受试者之间建立点对应关系。然后对配准后的模型应用主成分分析（PCA）以量化解剖形状的变异性。Pix2Pix预处理阶段在保留测试集上实现了Dice系数为0.9856和IoU为0.9720。统计建模在90例扫描中进行，其中手指完全可见且解剖上分离。所得的统计形状分布与美国陆军人体测量调查（ANSUR II）有很强的一致性，支持重建模型的解剖有效性。所提出的方法在生物力学建模、人机工程学优化、假肢设计和精准医疗诊断方面具有显著潜力。

英文摘要

Accurate segmentation and statistical shape modeling of hand anatomy have significant implications for medical diagnostics, ergonomics, and biomechanics. This study proposes an AI-assisted reconstruction pipeline for segmenting and analyzing hand anatomy from 1,271 elbow-to-hand (e2h-CT) computed tomography scans. A Pix2Pix-based conditional generative adversarial network is first employed to remove plaster cast and background artifacts from CT volumes. The cleaned scans are then processed in 3D Slicer to extract skin and bone masks, which are converted into closed-surface mesh models. Segmented bone meshes are used to construct skeletal representations, enabling implicit skinning to align all hand models into a standardized anatomical configuration. Subsequently, non-rigid registration is performed on the hand skin surfaces using the Geodesic Based Coherent Point Drift++ (GBCPD++) algorithm to establish point-wise correspondence across subjects. Principal Component Analysis (PCA) is then applied to the registered models to quantify anatomical shape variability. The Pix2Pix preprocessing stage achieved a Dice coefficient of 0.9856 and an IoU of 0.9720 on the held-out test set. Statistical modeling was performed on a subset of 90 scans in which the fingers were fully visible and anatomically separated. The resulting statistical shape distributions demonstrate strong agreement with the U.S. Army Anthropometric Survey (ANSUR II), supporting the anatomical validity of the reconstructed models. The proposed methodology demonstrates significant potential for advancing biomechanical modeling, ergonomic optimization, prosthetic design, and precision medical diagnostics.

URL PDF HTML ☆

赞 0 踩 0

2605.16973 2026-05-19 cs.CV cs.LG 版本更新

SHED: Style-Homogenized Embedding Alignment for Domain Generalization

SHED: 风格均质化嵌入对齐用于领域泛化

Kai Gan, Tong Wei

发表机构 * School of Computer Science and Engineering, Southeast University, Nanjing 210096, China（1 东南大学计算机科学与工程学院，南京 210096，中国）； Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China（2 教育部计算机网络与信息集成重点实验室（东南大学），中国）

AI总结本文提出SHED方法，通过均质化嵌入对齐来解决领域泛化中的信息不对称问题，实验表明其在多个基准测试中取得了最先进的性能。

详情

AI中文摘要

领域泛化旨在通过嵌入分布偏移增强模型对未见领域的鲁棒性。尽管像CLIP这样的大规模视觉-语言模型表现出色，但其直接的图像-文本嵌入对齐却受到固有信息不对称的限制：图像编码了类别语义和领域特定的风格，而文本提示主要传达基本的类别线索。这种不对称性阻碍了在现实场景中对新领域的泛化。为此，我们提出了SHED，一种基于CLIP的新方法，通过对齐风格均质化的嵌入而不是CLIP编码器的原始表示。在训练过程中，SHED从图像嵌入（按源领域计算）和文本嵌入（在多样化的提示模板下平均并去除全局质心）中移除领域特定的风格质心。在推理过程中，考虑到目标领域信息的缺乏，SHED将多样化的文本领域质心投影到视觉空间，并通过成员加权聚合预测。在五个基准测试上的广泛实验表明，SHED在多个基准测试中取得了最先进的性能，显著优于先前方法（例如，在DomainNet上比标准微调高出+4.0%）

英文摘要

Domain generalization aims to enhance model robustness against unseen domains with embedding distribution shifts. While large-scale vision-language models like CLIP exhibit strong generalization, their direct image-text embedding alignment suffers from inherent information asymmetry: images encode both class semantics and domain-specific styles, whereas text prompts primarily convey basic class cues. This asymmetry hinders generalization to novel domains in realistic scenarios. To address this, we propose Style-Homogenized Embedding alignment for Domain-generalization (SHED), a novel CLIP-based method that aligns style-homogenized embeddings instead of raw representations from encoders in CLIP. During training, SHED removes domain-specific style centroids from both image embeddings computed per source domains and text embeddings which are averaged across diverse prompt templates and stripped of a global centroid. For inference, considering the lack of target domain information, SHED projects diverse textual domain centroids into the visual space and aggregates predictions via membership weighting. Extensive experiments on five benchmarks show SHED achieves state-of-the-art performance, outperforming prior methods significantly (e.g., +4.0\% on DomainNet vs. standard fine-tuning).

URL PDF HTML ☆

赞 0 踩 0

2605.16967 2026-05-19 cs.CV 版本更新

Expandable, Compressible, Mineable: Open-World Thermal Image Restoration

可扩展、可压缩、可挖掘：面向开放世界热成像修复的ECMRNet

Pu Li, Huafeng Li, Yafei Zhang, Wen Wang, Neng Dong, Jie Wen

发表机构 * Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Yunnan, China（昆明理工大学信息工程与自动化学院，云南，中国）； School of Mathematics and Statistics, Yunnan University, Yunnan, China（云南大学数学与统计学院，云南，中国）； Shenzhen Key Laboratory of Visual Object Detection and Recognition, Harbin Institute of Technology, Shenzhen, China（深圳视觉对象检测与识别重点实验室，哈尔滨工业大学，深圳，中国）

AI总结本文提出ECMRNet，从持续学习视角解决开放世界热成像修复问题，通过可扩展、可压缩、可挖掘的闭环过程实现持续适应新型退化，同时通过结构熵剪枝和子退化知识挖掘模块提升修复性能。

Comments Accepted by ICML2026

详情

AI中文摘要

在开放世界场景中，热红外（TIR）图像退化持续出现并演变，而现有大多数单一切换修复方法基于封闭集假设，难以持续适应新退化。为此，我们提出ECMRNet，即面向开放世界热成像修复的可扩展、可压缩、可挖掘修复网络。从概念上，ECMRNet将持续退化学习统一为一个“扩展-压缩-挖掘”闭环过程，通过可控进化实现对新退化的持续适应。从结构上，ECMRNet将中间表示分解为组隔离的子空间，并通过冻结历史组和等形扩展新组，实现严格参数隔离和快速适应新退化。为抑制任务积累后的模型增长，我们提出结构熵剪枝，通过二维结构熵最小化识别并移除冗余通道组，实现信息贡献驱动的自适应压缩。此外，我们设计了子退化知识挖掘模块，动态检索并重新组合历史表示中的可转移组件，以提高复合退化下的修复性能。实验结果表明，ECMRNet在多种单退化和复合退化场景中均实现了优越的整体性能，同时使用更少的参数和更低的计算成本。源代码可在https://github.com/Kust-lp/ECMRNet获取。

英文摘要

In open-world settings, thermal infrared (TIR) image degradations continuously emerge and evolve, while most existing all-in-one restoration methods are built on a closed-set assumption and struggle to continually adapt to novel degradations. To address this, we propose ECMRNet, an Expandable, Compressible, and Mineable Restoration Network for open-world TIR restoration from a continual learning perspective. Conceptually, ECMRNet unifies continual degradation learning as an "expand-compress-mine" closed-loop process, enabling sustained adaptation to new degradations with controllable evolution. Structurally, ECMRNet decomposes intermediate representations into group-isolated subspaces, and achieves strict parameter isolation and fast adaptation to new degradations by freezing historical groups and isomorphically expanding new ones. To curb model growth as tasks accumulate, we present Structural Entropy Pruning, which identifies and removes redundant channel groups via two-dimensional structural entropy minimization, achieving information contribution-driven adaptive compression. Moreover, we design a Sub-degradation Knowledge Mining Module that dynamically retrieves and recombines transferable components from historical representations to improve restoration under compound degradations. Experimental results demonstrate that ECMRNet achieves superior overall performance across diverse single and compound degradations while using fewer parameters and lower computational cost. The source code is available at https://github.com/Kust-lp/ECMRNet.

URL PDF HTML ☆

赞 0 踩 0

2605.16961 2026-05-19 cs.CV cs.AI 版本更新

Latent Action Control for Reasoning-Guided Unified Image Generation

潜在动作控制用于推理引导的统一图像生成

Fuxiang Zhai, Sixiang Chen, Yingjin Li, Shuaibo Li, Jianyu Lai, Tengjun Huang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））

AI总结本文提出Latent Action Control (LAC)，通过将推理表示为隐藏的连续动作，使推理过程可操作，从而在统一生成器中实现推理引导的图像生成。LAC通过角色结构化的潜在轨迹进行规划、内部视觉草图、诊断和细化，并将这些动作注入到条件流生成的隐藏流中，从而提升生成质量。

详情

AI中文摘要

统一的多模态模型可以在共享的骨干网络中编码视觉理解和图像生成，但理解并不自动转化为控制：模型可能推断出对象、关系或知识提示，但无法在生成的图像中实例化。我们提出潜在动作控制（LAC），通过将推理表示为隐藏的连续动作，使推理过程可操作。给定提示，LAC会规划角色结构化的潜在轨迹，进行内部视觉草图、诊断和细化，并将这些动作注入到条件流生成的隐藏流中，而无需生成推理标记或中间图像。由于这些动作轨迹是未观察到的，LAC通过先验引导的变分潜在动作对齐从仅训练的语义先验、草图图像特征和监督停止信号中学习这些动作，随后通过Latent-Flow GRPO对齐潜在到图像的生成轨迹与终端视觉反馈。这为从推断的关系、绑定和知识提示到生成过程的控制路径提供了支持。在BAGEL-7B-MoT上实现后，LAC在GenEval、WISE和T2I-CompBench中一致提升了组合性和知识引导的生成，尤其是在空间关系、属性绑定和世界知识敏感提示上表现最佳。消融实验和潜在干预显示，学习的动作轨迹被生成器消耗，表明统一生成在理解不仅被编码，而是在生成过程中被操作时受益。

英文摘要

Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation, without producing reasoning tokens or intermediate images. Since such action trajectories are unobserved, LAC learns them through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This provides a control path from inferred relations, bindings, and knowledge cues to the generation process. Instantiated on BAGEL-7B-MoT, LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts. Ablations and latent interventions show that the learned action trajectory is consumed by the generator, suggesting that unified generation benefits when understanding is not only encoded, but made actionable during generation.

URL PDF HTML ☆

赞 0 踩 0

2605.16951 2026-05-19 cs.CV 版本更新

Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing

Edit-GRPO: 一种用于图像编辑的保持局部性的策略优化框架

Shaodong Xu, Zexian Li, Zhendong Wang, Litong Gong, Tiezheng Ge, Wengang Zhou, Bo Zheng, Houqiang Li

发表机构 * Alibaba Group（阿里巴巴集团）

AI总结本文提出Edit-GRPO框架，通过分离编辑与保留目标，解决图像编辑中保持局部性与全局一致性的问题，提升编辑效果并减少上下文扭曲等常见伪影。

详情

AI中文摘要

图像编辑中的一个根本性挑战在于保持空间局部性：编辑应改进目标内容而不应无意地改变周围区域。然而，大多数基于优化的编辑方法将图像视为整体实体，导致全局策略更新，从而破坏局部性并引入不期望的上下文变化。我们观察到，这一问题源于局部编辑意图与全局应用的优化信号之间的不匹配。受此启发，我们提出Edit-GRPO，一种在优化图像编辑时保持局部性的策略优化框架，该框架明确地将编辑和保留目标分离。通过为编辑和非编辑区域分配区域特定的优化信号，Edit-GRPO使策略更新与编辑任务的空间结构对齐，从而实现局部改进同时保持全局视觉一致性。这种设计有效抑制了诸如上下文扭曲和边界不一致等常见伪影。在各种图像编辑场景中的广泛实验表明，与现有基于优化的方法相比，Edit-GRPO在显著提高局部性保持的同时，保持了强大的编辑性能，验证了所提框架的通用性和有效性。

英文摘要

A fundamental challenge in image editing lies in preserving spatial locality: edits should improve targeted content without inadvertently altering surrounding regions. However, most optimization-based editing approaches treat images as holistic entities, causing global policy updates that undermine locality and introduce undesired context changes. We observe that this issue stems from a mismatch between localized editing intent and globally applied optimization signals. Motivated by this insight, we propose Edit-GRPO, preserving Locality while optimizing image editing, a locality-preserving policy optimization framework that explicitly decouples editing and preservation objectives. By assigning region-specific optimization signals to edit and non-edit areas, Edit-GRPO aligns policy updates with the spatial structure of editing tasks, enabling localized improvements while maintaining global visual coherence. This design effectively suppresses common artifacts such as context distortion and boundary inconsistency. Extensive experiments across diverse image editing scenarios demonstrate that Edit-GRPO significantly improves locality preservation while maintaining strong editing performance compared to existing optimization-based methods, validating the generality and effectiveness of the proposed framework.

URL PDF HTML ☆

赞 0 踩 0

2605.16949 2026-05-19 cs.CV 版本更新

Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

超越点对点匹配：为加速扩散变换器的结构表示对齐

Shaodong Xu, Zhendong Wang, Litong Gong, Zexian Li, Wengang Zhou, Tiezheng Ge, Houqiang Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出sREPA框架，通过显式结构约束来对齐特征图的相对几何关系，以提高生成质量并加速收敛。

详情

AI中文摘要

最近的扩散变换器（DiTs）进展表明，将噪声潜在状态与经过训练的语义特征对齐（如代表性对齐（REPA）所开创的）可以显著加速训练并提高生成保真度。随后的分析（例如iREPA）表明，这些收益主要来自于转移预训练视觉表示中包含的空间结构。然而，大多数现有对齐方法使用点对点匹配目标或依赖于隐式架构调整，这些方法未能显式建模视觉基础模型中固有的空间关系几何。我们主张，元素级监督不足以捕捉视觉表示中的丰富空间拓扑，有效的对齐应以显式结构约束的形式进行。为此，我们提出了sREPA，一种结构代表性对齐框架，以强制特征图的相对几何一致性，而不是仅仅匹配单个特征点。通过鼓励模型内部化预训练特征中的整体空间布局和结构相关性，sREPA在比最先进的对齐策略更快、更稳定的收敛以及改进的样本质量方面取得了成果。我们的代码和模型将被发布。

英文摘要

Recent advances in Diffusion Transformers (DiTs) demonstrate that aligning noisy latent states with well-trained semantic features-as pioneered by Representation Alignment (REPA)-can substantially accelerate training and improve generation fidelity. Subsequent analysis(e.g., iREPA) suggests that these gains arise primarily from transferring spatial structure contained in pre-trained vision representations. However, mostly existing alignment methods employ point-wise matching objectives or rely on implicit architectural tweaks, which fail to explicitly model the spatial relational geometry inherent in vision foundation models. We argue that such element-wise supervision is insufficient to capture the rich spatial topology of visual representations, and that effective alignment for generation should instead be formulated as an explicit structural constraint. To this end, we propose sREPA, a structural REPresentation Alignment framework to enforce consistency in the relational geometry of feature maps, rather than merely matching individual feature points. By encouraging the model to internalize holistic spatial layouts and structural correlations from pre-trained features, sREPA achieves faster and more stable convergence, along with improved sample quality, compared to state-of-the-art alignment strategies. Our code and models will be released.

URL PDF HTML ☆

赞 0 踩 0

2605.16937 2026-05-19 cs.CV 版本更新

DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis

DEVIS-GRPO：释放GRPO用于动态极视合成

Yi Zuo, Huimin Wu, Lingling Li, Fang Liu, Licheng Jiao, Qing Li

发表机构 * Xidian University（西安电子科技大学）； State Key Laboratory of General Artificial Intelligence, BIGAI（人工智能国家重点实验室，BIGAI）

AI总结本文提出DEVIS-GRPO，一种基于GRPO的框架，用于轨迹控制的视频生成，是首个在线策略梯度方法用于极视视频生成。核心方法是新颖的采样策略ADEVIS，通过逐步积累小视增量实现大视运动，提高了训练效率和采样多样性。

详情

AI中文摘要

轨迹控制的视频生成已成为可控视频生成的关键。尽管当前方法在小视相机运动下表现良好，但在大视运动下显著退化。现有的极视合成解决方案通常需要专门的视频对，需要大量标注工作。为了解决这些限制，我们提出了动态极视合成-GRPO（DEVIS-GRPO），一种基于GRPO的框架，用于轨迹控制的视频生成，是首个在线策略梯度方法用于极视视频生成。我们的方法的核心是一种新颖的采样策略：累积动态极视合成（ADEVIS），通过逐步积累小视增量实现大视运动。该方法带来了两个关键优势：1）增强的训练效率，因为它消除了需要预热策略模型的需要，通过收集昂贵的配对大视视频；2）增加的采样多样性，通过灵活变化轨迹配置实现。最后，我们设计了多级一致性-质量奖励函数来选择高质量的样本用于模型优化。在Kubric-4D、iPhone和DL3DV数据集上的实验表明了我们的方法的优越性。在Kubric-4D上，我们在非遮挡区域相比第二好的方法在PSNR上提高了21.57%，在SSIM上提高了7.31%。在iPhone上，LPIPS减少了18.56%。

英文摘要

Trajectory-controlled video generation has become essential for controllable video generation. While current methods perform well under small-view camera motions, they degrade significantly with large-view motions. Existing solutions for extreme-view synthesis typically require dedicated video pairs, demanding substantial annotation effort. To address these limitations, we propose Dynamic Extreme VIew Synthesis-GRPO (DEVIS-GRPO), a GRPO-based framework for trajectory-controlled video generation, the first online policy gradient method for extreme view video generation. Central to our approach is a novel sampling strategy: Accumulative Dynamic Extreme VIew Synthesis (ADEVIS), which achieves large-view camera motions by progressively accumulating small-view increments. This method delivers two key advantages: 1) enhanced training efficiency, as it eliminates the need to warm-start the policy model by collecting expensive paired large-view videos, and 2) increased sampling diversity, achieved by flexibly varying trajectory configurations. Finally, we designed a multi-level consistency-quality reward function to select high-quality samples for model optimization. Experiments on the Kubric-4D, iPhone, and DL3DV datasets demonstrate our method's superiority. On Kubric-4D, we achieve relative improvements of 21.57% in PSNR and 7.31% in SSIM over the second-best method in non-occlusion areas. On iPhone, LPIPS is reduced by 18.56%.

URL PDF HTML ☆

赞 0 踩 0

2605.16925 2026-05-19 cs.CV 版本更新

P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction

P2GS: 基于物理先验的高斯点云法用于光度一致的城市重建

Kota Shimomura, Hidehisa Arai, Tsubasa Takahashi, Takayoshi Yamashita, Hironobu Fujiyoshi

发表机构 * Chubu University（名古屋大学）； Turing Inc.（图灵公司）

AI总结本文提出P2GS，一种基于物理先验的高斯点云法，用于解决自动驾驶中由于异质相机管道和动态户外照明导致的光度不一致问题，通过联合分解视图不变的线性HDR光场、每视图曝光尺度和色调映射函数，提升光度一致性与光照一致性。

Comments Accepted CVPR2026 main

详情

AI中文摘要

3D高斯点云法（3DGS）最近作为一种强大的显式表示方法出现，使其能够实现快速、高保真的渲染，成为自动驾驶闭环模拟器和感知模型的有前途的基础。然而，传统3DGS隐式假设不同视图之间具有一致的曝光和色调映射。真实驾驶数据由于异质相机管道和动态户外照明而违反这一假设，将曝光差异和传感器噪声烘焙到光场中，导致在静态背景中产生伪影和不一致的照明，这对现实模拟至关重要。这些问题是自动驾驶中尤为突出的，因为稀疏的视点、变化的曝光和户外照明相互作用，而以往的工作主要针对动态物体重建，忽略了跨视图的光度一致性。为了解决这一限制，我们引入了P2GS，一种物理一致的高斯点云框架，仅从LDR图像中联合分解视图不变的线性HDR光场、每视图曝光尺度和色调映射函数。P2GS采用基于物理图像形成过程的统一优化策略，强制相对曝光一致性和HDR域光流正则化。这产生了一个对跨相机照明差异具有鲁棒性的光场，同时保持标准3DGS的实时效率。在真实和模拟驾驶环境中进行的实验表明，P2GS在LDR重建中匹配或超越了先前的方法，同时在多样化的场景中提供了显著改进的光度一致性、可靠的曝光归一化和物理一致的照明。

英文摘要

3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, high-fidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency. To address this limitation, we introduce P2GS, a physically consistent Gaussian Splatting framework that jointly decomposes a view-invariant linear HDR radiance field, per-view exposure scales, and tone-mapping functions from only LDR images without HDR supervision. P2GS employs a unified optimization strategy grounded in the physical image-formation process, enforcing relative-exposure consistency and HDR-domain radiance regularization. This yields a radiance field robust to inter-camera illumination differences while preserving the real-time efficiency of standard 3DGS. Experiments across real and simulated driving environments show that P2GS matches or surpasses prior methods in LDR reconstruction while providing substantially improved photometric consistency, reliable exposure normalization, and physically coherent illumination across diverse scenes.

URL PDF HTML ☆

赞 0 踩 0

2605.16922 2026-05-19 cs.CV 版本更新

Motion Cues from Image-based Point Tracking for LiDAR Scene Flow Estimation

基于图像点跟踪的运动线索用于LiDAR场景流估计

Youngdong Jang, Gyeongrok Oh, Jong Wook Kim, Hyunju Ryu, Hyung-gun Chi, SeungHyeon Kim, Seungryong Kim, Jonghyun Choi, Sangpil Kim

发表机构 * Korea University（韩国大学）； Purdue University（普渡大学）； KAIST（韩国科学技术院）； Hyundai Motor Company（现代汽车公司）

AI总结本文提出TrackCue框架，通过图像点跟踪获取密集轨迹以改进LiDAR场景流估计中的动态物体表示，通过视觉一致的运动补偿策略和视觉运动线索提升来实现更准确的静态-动态分类和更可靠的场景流学习。

详情

AI中文摘要

LiDAR场景流估计对于自动驾驶至关重要，因为它为每个点提供3D运动。自监督方法利用静态-动态分类来缓解静态和动态点之间的不平衡，从而获得针对性的监督。然而，现有方法依赖于稀疏几何观测进行此分类，使其容易受到数据稀疏性和遮挡的影响。由此产生的噪声标签会提供错误的运动指导并降低场景流学习的效果。为了解决这个问题，我们引入了TrackCue，一种基于跟踪的框架，用于改进LiDAR场景流估计中的动态物体表示。具体而言，TrackCue重新利用点跟踪来获取锚定在LiDAR点上的密集图像空间轨迹，提供超越稀疏几何观测的运动线索。此外，我们提出了一种视觉一致的运动补偿策略，该策略在图像平面中将跟踪轨迹与自我诱导的刚性轨迹进行比较，有效地将真正的物体运动与自我诱导的表观运动分离。为了将这些分离的运动线索转移到LiDAR领域，我们执行了视觉运动线索提升，将自我补偿的图像轨迹与LiDAR点相关联以进行静态-动态标签细化。结果，TrackCue产生更准确的静态-动态分类，并为场景流学习提供更可靠的监督。实验结果表明，TrackCue显著提高了动态标签的精度和F1分数，从而在自监督场景流估计中带来了性能提升。

英文摘要

LiDAR scene flow estimation is essential for autonomous driving, as it provides 3D motion for each point. Self-supervised approaches use static-dynamic classification to mitigate the imbalance between static and dynamic points, deriving targeted supervision. However, existing methods rely on sparse geometric observations for this classification, making them vulnerable to data sparsity and occlusions. The resulting noisy labels provide incorrect motion guidance and degrade scene flow learning. To address this, we introduce TrackCue, a tracking-guided framework for improving dynamic object representation in LiDAR scene flow estimation. In particular, TrackCue repurposes point tracking to obtain dense image-space trajectories anchored to LiDAR points, providing motion cues beyond sparse geometric observations. Furthermore, we present a visually consistent motion compensation strategy that compares the tracked trajectories with ego-induced rigid trajectories in the image plane, effectively isolating true object motion from ego-induced apparent motion. To transfer these isolated motion cues back to the LiDAR domain, we perform visual motion cue lifting, which associates ego-compensated image trajectories with LiDAR points for static-dynamic label refinement. As a result, TrackCue produces more accurate static-dynamic classification and provides more reliable supervision for scene flow learning. Experimental results show that TrackCue significantly improves the precision and F1 score of dynamic labels, leading to performance gains in self-supervised scene flow estimation.

URL PDF HTML ☆

赞 0 踩 0

2605.16918 2026-05-19 cs.CV 版本更新

HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

HighSync: 通过潜在扩散模型实现高质量唇部同步

Saeed Firouzi Daghigh, Majid Iranpour Mobarekeh, Mostafa Alavi, Mehdi Bagheri

发表机构 * Department of Computer Engineering and Information Technology, Payam Noor University（Payam Noor大学计算机工程与信息科技系）

AI总结本文提出HighSync，一种端到端的扩散框架，用于生成与任意输入音频对齐的逼真说话人脸视频。该方法同时解决了图像质量和同步准确性之间的矛盾，是首个原生在512*512分辨率上运行的唇部同步模型，适用于电影和广播行业等专业生产环境。

Comments 12 pages, 7 figures, 5 tables

详情

AI中文摘要

我们提出了HighSync，一种端到端的扩散基框架，用于生成与任意输入音频对齐的逼真说话人脸视频。现有方法在图像质量和同步准确性之间难以取得平衡，产生视觉降质或时间不一致的唇部运动。HighSync同时解决这两个挑战，并且据我们所知，是首个在512*512分辨率上原生运行的唇部同步模型，使其成为电影和广播行业等专业生产环境中的可行解决方案。我们方法的核心是识别并系统消除一种数据泄漏现象，这种现象在先前工作中无声地破坏了时间建模，阻碍模型发展对音频信号的真实依赖。在感知质量和同步准确性指标上的全面评估证实，HighSync在两者上均实现了最先进的性能。源代码、预训练模型和补充视频结果可在https://github.com/saeed5959/high_sync上公开获取。

英文摘要

We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip movements. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512*512 resolution, positioning it as a viable solution for professional production environments such as the film and broadcast industries. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal. Comprehensive evaluations across both perceptual quality and synchronization accuracy metrics confirm that HighSync achieves state-of-the-art performance on both fronts. Source code, pre-trained models, and supplementary video results are publicly available at: https://github.com/saeed5959/high_sync

URL PDF HTML ☆

赞 0 踩 0

2605.16911 2026-05-19 cs.CV 版本更新

VGGT-Occ: Geometry-Grounded and Density-Aware Gated Fusion for 3D Occupancy Prediction

VGGT-Occ：基于几何和密度的门控融合用于3D占用预测

Xun Chen, Tianchen Deng, Rui Wang, Fangjinhua Wang, Junyi Ma, Hongming Shen, Hesheng Wang, Danwei Wang

发表机构 * Nanyang Technological University（南洋理工大学）； Shanghai Jiao Tong University（上海交通大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结本文提出VGGT-Occ，通过在整个管道中嵌入几何标记，引入投影感知可变形注意力（PA-DA）以注入几何信息，结合视图质量语义门控实现跨视图一致性，采用顺序粗到细解码器与门控融合优化效率和性能，实验证明其在3D语义占用预测中的有效性。

详情

AI中文摘要

3D语义占用预测需要准确的2D到3D特征提升，但当前方法限制相机几何到初始投影。后续操作如偏移学习、注意力加权和跨相机聚合仍缺乏几何感知，忽略了关键的物理约束。我们提出了VGGT-Occ，一个在完整管道中嵌入几何标记的框架。我们引入了投影感知可变形注意力（PA-DA）以在所有注意力阶段注入几何信息。PA-DA将3D偏移投影回图像平面，并利用投影雅可比作为加性偏置以抑制不可靠的观测。特征随后通过视图质量语义门控进行跨视图一致性整合。为了优化效率和性能，我们采用顺序粗到细解码器与门控融合，其中低分辨率特征被细化为更高分辨率，通过信息密度分配计算，同时显著减少解码器成本。广泛的评估证明了我们方法的有效性和准确性。在SurroundOcc-nuScenes上，VGGT-Occ在T=1时达到33.00%的IoU和21.08%的mIoU，在T=2推理时达到33.64%的IoU和21.43%的mIoU，优于现有方法，仅使用约4100万可训练参数。代码将公开发布。

英文摘要

3D semantic occupancy prediction requires accurate 2D-to-3D feature lifting, yet current methods restrict camera geometry to initial projections. Subsequent operations like offset learning, attention weighting, and cross-camera aggregation remain geometry-agnostic, ignoring essential physical constraints. We propose VGGT-Occ, a framework that embeds geometric tokens throughout the entire pipeline. We introduce Projection-Aware Deformable Attention (PA-DA) to inject geometry into all attention stages. PA-DA projects 3D offsets back to image planes and leverages the projection Jacobian as an additive bias to suppress unreliable observations. Features are then integrated through a view-quality semantic gate for cross-view consistency. To optimize both efficiency and performance, we employ a sequential coarse-to-fine decoder with gated fusion, where low-resolution features are refined into higher resolutions, allocating computation by information density while substantially reducing decoder cost. Extensive evaluations demonstrate the effectiveness and accuracy of our approach. On SurroundOcc-nuScenes, VGGT-Occ achieves 33.00\% IoU and 21.08\% mIoU ($T{=}1$), and 33.64\% IoU and 21.43\% mIoU with $T{=}2$ inference, outperforming existing methods, with only ${\sim}41$M trainable parameters in the occupancy head. Code will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2605.16908 2026-05-19 cs.ET cs.CR cs.CV 版本更新

BIDO: A Biometric Identity Online Authentication Framework

BIDO：一种生物特征身份在线认证框架

Aditya Mithra, Sibi Chakkaravarthy S, Srinivas Kankanala

发表机构 * CyberMACS Kadir Has University（凯德里哈大学CyberMACS）； DigitalFortress Private Limited & Indominus Labs Private Limited（DigitalFortress私人有限公司及Indominus Labs私人有限公司）； Centre of Excellence, Artificial Intelligence & Robotics (AIR), School of Computer Science and Engineering（卓越中心，人工智能与机器人（AIR），计算机科学与工程学院）； VIT-AP University, India（印度VIT-AP大学）； Centre of Excellence, Artificial Intelligence & Robotics (AIR), School of Electroncs and Communication Engineering（卓越中心，人工智能与机器人（AIR），电子与通信工程学院）

AI总结本文提出BIDO框架，通过动态生成非居民Web认证凭证，实现无需存储长期生物特征模板的设备无关身份验证，达到NIST SP 800-63B中的AAL2标准，同时在多个面部基准测试中表现出高准确率和低误报率。

详情

AI中文摘要

安全系统需要在不需用户携带物理令牌、智能卡或专用硬件验证器的情况下实现持续且密码学上稳健的身份验证。本文提出了BIDO（Biometric Identity Online），一种设备无关的认证标准，能够在不存储长期生物特征模板、面部图像或其他形式的个人可识别信息（PII）的情况下，达到NIST SP 800-63B中的认证保证级别2（AAL2）。BIDO通过在每次认证事件中从活体生物特征测量中确定性地推导出椭圆曲线数字签名算法（ECDSA）的密钥材料，该过程使用用户定义的记忆化秘密进行盐化，从而消除了持久私钥存储的需求，同时允许从任何商用传感器终端进行验证。生成的凭证是非发现（非居民）Web认证（WebAuthn）凭证，完全兼容所有FIDO2启用的网站和服务，无需服务器端修改。一个多阶段流程，包括捕获200个有效生物特征样本、使用Dlib 68点面部地标预测器进行特征提取、仿射面部对齐、正面性门控、从双眼中点计算欧几里得距离、地板除法量化（除数q=8）、跨会话漂移稳定化以及多数投票SHA-256哈希绑定，产生验证种子（Vseed），从中临时生成WebAuthn凭证并在签名后立即零化。在三个主要面部基准测试（VGGFace2、LFW和MegaFace）上进行评估，达到99.51%的验证准确率（LFW）和92.14%的MegaFace挑战1在10^6干扰项中的排名1识别准确率，同时具有0.03%的密码学误接受率（FAR）和0.90%的误拒率（FRR）。

英文摘要

Security systems demand continuous, cryptograph- ically robust identity verification without requiring subjects to carry physical tokens, smart cards, or dedicated hardware authenticators. This paper presents BIDO (Biometric Identity Online), a device-free authentication standard that achieves Au- thenticator Assurance Level 2 (AAL2) per NIST SP 800-63B with- out storing long-lived biometric templates, facial images, or any other form of Personally Identifiable Information (PII). BIDO derives Elliptic Curve Digital Signature Algorithm (ECDSA) key material deterministically from a live biometric measurement salted with a user-defined memorized secret at every authen- tication event, eliminating persistent private-key storage while enabling verification from any commodity sensor terminal. The generated credentials are non-discoverable (non-resident) Web Authentication (WebAuthn) credentials, fully compatible with all FIDO2-enabled websites and services without modification on the server side. A multi-stage pipeline, comprising capture of 200 valid biometric samples, feature extraction using the Dlib 68- point facial landmark predictor, affine face alignment, frontality gating, Euclidean distance computation from the inter-eye mid- point, floor-division quantization with divisor q = 8, inter-session drift stabilization, and majority-voting SHA-256 hash binding, produces a Verification Seed (Vseed) from which the WebAuthn credential is transiently derived and immediately zeroized after signing. Evaluated against three prominent face benchmarks (VGGFace2, LFW, and MegaFace), achieving 99.51% verification accuracy on LFW and 92.14% Rank-1 identification accuracy on MegaFace Challenge 1 at 10^6 distractors, with a cryptographic False Accept Rate (FAR) of 0.03%, a False Reject Rate (FRR) of 0.90%.

URL PDF HTML ☆

赞 0 踩 0

2605.16905 2026-05-19 cs.LG cs.CV 版本更新

AIM: Adversarial Information Masking for Faithfulness Evaluation of Saliency Maps

AIM：对抗性信息遮蔽用于显著图忠实性评估

Chia-Ying Hsieh, Hsin-Yuan Fang, Chun-Shu Wei

发表机构 * National Yang Ming Chiao Tung University（阳明交通大学）

AI总结本文提出AIM方法，通过对抗性信息遮蔽框架评估显著图的忠实性及遮蔽操作的可靠性，通过对比不同遮蔽方式下的退化效果，减少遮蔽诱导的偏差，并揭示不同模态下符号和非符号归因之间的差异。

详情

AI中文摘要

后验显著性方法广泛用于解释深度神经网络，但其忠实性难以可靠评估。现有评估方法根据显著性诱导的特征排序进行特征遮蔽并测量性能退化，但这种退化可能受遮蔽操作干扰：零遮蔽可能产生分布外伪影，而基于插值的遮蔽可能保留残余预测信息。我们提出对抗性信息遮蔽（AIM），一种基于显著性的对抗性特征替换框架，用于评估显著图的忠实性和遮蔽操作的可靠性。AIM将选定特征替换为输入的对抗性对应值，并在互补的遮蔽顺序下比较退化效果。我们通过随机归因偏差和解释方法忠实性排名的稳定性来评估可靠性。在图像、音频和EEG任务中的实验表明，AIM相比零和插值遮蔽减少了遮蔽诱导的偏差，同时揭示了符号和非符号归因之间的模态依赖性差异。

英文摘要

Post-hoc saliency methods are widely used to interpret deep neural networks, but their faithfulness is difficult to evaluate reliably. Existing evaluations mask features according to saliency-induced feature ordering and measure performance degradation, but this degradation can be confounded by the masking operator: zero masking may create out-of-distribution artifacts, while interpolation-based masking may preserve residual predictive information. We propose Adversarial Information Masking (AIM), a saliency-guided adversarial feature replacement framework for evaluating both saliency-map faithfulness and masking-operator reliability. AIM replaces selected features with values from an adversarial counterpart of the input and compares degradation under complementary masking orders. We assess reliability using random-attribution bias and stability of explanation-method faithfulness rankings. Experiments on image, audio, and EEG tasks suggest that AIM reduces masking-induced bias compared with zero and interpolation-based masking, while revealing modality-dependent differences between signed and unsigned attributions.

URL PDF HTML ☆

赞 0 踩 0

2605.16903 2026-05-19 cs.CV 版本更新

WOW-Seg: A Word-free Open World Segmentation Model

WOW-Seg: 无词开放世界分割模型

Danyang Li, Tianhao Wu, Bin Li, Zhenyuan Chen, Yang Zhang, Yuxuan Li, Ming-Ming Cheng, Xiang Li

发表机构 * NKIARI, Shenzhen Futian（深圳福田NKIARI）； VCIP, CS, Nankai University（南开大学VCIP实验室）； AAIS, Nankai University（南开大学AAIS实验室）； Sichuan Agricultural University（四川农业大学）； Peking University Shenzhen Graduate School（北京大学深圳研究生院）

AI总结本文提出WOW-Seg模型，旨在解决开放世界图像分割中的目标精确分割与语义理解问题，通过引入Mask2Token模块和Cascade Attention Mask，提升模型性能，并构建了Region Recognition Dataset (RR-7K)数据集，在LVIS数据集上取得优异成果。

Comments Accepted by ICLR 2026. Code and benchmark dataset are available at https://github.com/AAwCAA/WOW-Seg-Meta

详情

AI中文摘要

开放世界图像分割旨在通过解决现实世界中无限开放的对象类别集，实现图像中目标的精确分割和语义理解。然而，传统封闭集分割方法难以适应复杂的开放世界场景，而基础分割模型如SAM在分割能力与语义理解之间存在明显差距。为弥合这一差距，我们提出了WOW-Seg，一种无词开放世界分割模型，用于对开放集类别中的对象进行分割和识别。具体而言，WOW-Seg引入了新颖的视觉提示模块Mask2Token，将图像掩码转换为视觉令牌并确保其与VLLM特征空间对齐。此外，我们引入了Cascade Attention Mask以解耦不同实例之间的信息。此方法减少了实例间干扰，显著提升了模型性能。我们进一步构建了一个开放世界区域识别测试基准：Region Recognition Dataset (RR-7K)。该数据集包含7,662个类别，代表目前最丰富的区域识别数据集。WOW-Seg在LVIS数据集上取得强劲成果，达到语义相似度89.7和语义IoU 82.4。这一表现超越了先前的SOTA，同时仅使用八分之一的参数量。这些结果凸显了WOW-Seg强大的开放世界泛化能力。代码及相关资源可在https://github.com/AAwcAA/WOW-Seg-Meta获取。

英文摘要

Open world image segmentation aims to achieve precise segmentation and semantic understanding of targets within images by addressing the infinitely open set of object categories encountered in the real world. However, traditional closed-set segmentation approaches struggle to adapt to complex open world scenarios, while foundation segmentation models such as SAM exhibit notable discrepancies between their strong segmentation capabilities and relatively weaker semantic understanding. To bridge these discrepancies, we propose WOW-Seg, a Word-free Open World Segmentation model for segmenting and recognizing objects from open-set categories. Specifically, WOW-Seg introduces a novel visual prompt module, Mask2Token, which transforms image masks into visual tokens and ensures their alignment with the VLLM feature space. Moreover, we introduce the Cascade Attention Mask to decouple information across different instances. This approach mitigates inter-instance interference, leading to a significant improvement in model performance. We further construct an open world region recognition test benchmark: the Region Recognition Dataset (RR-7K). With 7,662 classes, it represents the most extensive category-rich region recognition dataset to date. WOW-Seg attains strong results on the LVIS dataset, achieving a semantic similarity of 89.7 and a semantic IoU of 82.4. This performance surpasses the previous SOTA while using only one-eighth the parameter count. These results underscore the strong open world generalization capabilities of WOW-Seg. The code and related resources are available at https://github.com/AAwcAA/WOW-Seg-Meta.

URL PDF HTML ☆

赞 0 踩 0

2605.16901 2026-05-19 cs.CV 版本更新

CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model

CAR-SAM：跨注意力重建用于Segment Anything模型的后训练量化

Houji Wen, Jiangyong Yu, Jun Li, Dawei Yang

发表机构 * Nanjing University of Science and Technology（南京理工大学）； Houmo AI

AI总结本文提出CAR-SAM，一种针对Segment Anything模型的统一量化框架，通过引入MatMul-Aware Compensation机制和Joint Cross-Attention Reconstruction策略，解决后训练量化中注意力耗散和重建振荡问题，实现4位精度下的高效量化。

详情

AI中文摘要

Segment Anything Models (SAMs) 被广泛应用于计算机视觉中的通用图像分割，但在资源受限设备上部署具有挑战性，因为它们具有高计算和内存需求。后训练量化（PTQ）是一种广泛使用的模型压缩和加速技术。然而，现有的PTQ方法未能考虑SAM解码器中的跨注意力架构。这种退化主要源于SAMs特有的挑战：（1）注意力耗散，其中解码器中的注意力信息，对于表示分割掩码至关重要，在低比特量化下会坍缩成扩散且非语义的形式；（2）重建振荡，其中双向耦合的两个变压器引入了跨分支误差干扰并破坏了收敛。为了解决这些问题，我们提出了CAR-SAM，一种专门针对SAMs的统一量化框架。首先，为了缓解注意力耗散，我们引入了MatMul-Aware Compensation（MAC）机制，将激活引起的量化误差从MatMul转移到前导线性权重。其次，为了缓解解码器优化中的振荡，我们开发了一种联合跨注意力重建（JCAR）策略，联合重建耦合的注意力分支，抑制振荡行为并促进稳定收敛。广泛的实验表明，CAR-SAM能够稳健地将SAM模型量化到4位精度，在SAM-B和SAM-L上分别比现有方法在mAP上提高了14.6%和6.6%。

英文摘要

Segment Anything Models (SAMs) are extensively used in computer vision for universal image segmentation, but deploying them on resource-constrained devices is challenging due to their high computational and memory demands. Post-Training Quantization (PTQ) is a widely used technique for model compression and acceleration. However, existing PTQ methods fail to consider the cross-attention architecture in the SAM decoder. This degradation primarily stems from the unique challenges posed by SAMs: (1) Attention dissipation, where the attention information in the decoder, which is crucial for representing segmentation masks, collapses into a diffuse and non-semantic form under low-bit quantization; and (2) Reconstruction oscillation, where bidirectional coupling within the two-way transformer introduces cross-branch error interference and destabilizes convergence. To tackle these issues, we propose CAR-SAM, a unified quantization framework tailored for SAMs. Firstly, to mitigate attention dissipation, we introduce MatMul-Aware Compensation (MAC) mechanism that transfers activation-induced quantization errors from MatMul to preceding linear weights. Secondly, to mitigate oscillation in decoder optimization, we develop a Joint Cross-Attention Reconstruction (JCAR) strategy that jointly reconstructs coupled attention branches, suppressing oscillatory behavior and promoting stable convergence. Extensive experiments show that CAR-SAM robustly quantizes SAM models down to 4-bit precision, surpassing existing methods by 14.6% and 6.6% mAP on SAM-B and SAM-L respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.16899 2026-05-19 cs.CV 版本更新

Xin Niu, Enyi Li, Jinchao Liu, Yan Wang, Margarita Osadchy, Yongchun Fang

发表机构 * Tianjin Key Laboratory of Intelligent Robotics, College of Artificial Intelligence, Nankai University, China（天津智能机器人重点实验室，人工智能学院，南开大学，中国）； Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, Nankai University, China（可信行为智能工程研究中心，教育部，南开大学，中国）； Department of Computer Science, Haifa University, Israel（计算机科学系，海法大学，以色列）； VisionMetric Ltd, Canterbury, Kent, UK（VisionMetric Ltd，坎特伯雷，肯特，英国）

AI总结本文提出了一种紧凑的编码器-解码器神经模块（cmUNet），通过跨模态转换和模态内重建，学习模态无关的表示，同时保留身份相关的信息。此外，作者提出了MarrNet，通过将cmUNet连接到标准特征提取网络，实现跨模态匹配，并在多个挑战性任务上验证了其优越性能。

Comments Published in IEEE Transactions on Image Processing. See full abstract in the PDF file

Journal ref n IEEE Transactions on Image Processing, vol. 33, pp. 655-670, 2024

详情

DOI: 10.1109/TIP.2023.3348656.

AI中文摘要

Cross-modality recognition has many important applications in science, law enforcement and entertainment. Popular methods to bridge the modality gap include reducing the distributional differences of representations of different modalities, learning indistinguishable representations or explicit modality transfer. The first two approaches suffer from the loss of discriminant information while removing the modality-specific variations. The third one heavily relies on the successful modality transfer, could face catastrophic performance drop when explicit modality transfers are not possible or difficult. To tackle this problem, we proposed a compact encoder-decoder neural module (cmUNet) to learn modality-agnostic representations while retaining identity-related information. This is achieved through cross-modality transformation and in-modality reconstruction, enhanced by an adversarial/perceptual loss which encourages indistinguishability of representations in the original sample space. For cross-modality matching, we propose MarrNet where cmUNet is connected to a standard feature extraction network which takes as inputs the modality-agnostic representations and outputs similarity scores for matching. We validated our method on five challenging tasks, namely Raman-infrared spectrum matching, cross-modality person re-identification and heterogeneous (photo-sketch, visible-near infrared and visible-thermal) face recognition, where MarrNet showed superior performance compared to state-of-the-art methods. Furthermore, it is observed that a cross-modality matching method could be biased to extract discriminant information from partial or even wrong regions, due to incompetence of dealing with modality gaps, which subsequently leads to poor generalization. We show that robustness to occlusions can be an indicator of whether a method can well bridge the modality gap.

英文摘要

Cross-modality recognition has many important applications in science, law enforcement and entertainment. Popular methods to bridge the modality gap include reducing the distributional differences of representations of different modalities, learning indistinguishable representations or explicit modality transfer. The first two approaches suffer from the loss of discriminant information while removing the modality-specific variations. The third one heavily relies on the successful modality transfer, could face catastrophic performance drop when explicit modality transfers are not possible or difficult. To tackle this problem, we proposed a compact encoder-decoder neural module (cmUNet) to learn modality-agnostic representations while retaining identity-related information. This is achieved through cross-modality transformation and in-modality reconstruction, enhanced by an adversarial/perceptual loss which encourages indistinguishability of representations in the original sample space. For cross-modality matching, we propose MarrNet where cmUNet is connected to a standard feature extraction network which takes as inputs the modality-agnostic representations and outputs similarity scores for matching. We validated our method on five challenging tasks, namely Raman-infrared spectrum matching, cross-modality person re-identification and heterogeneous (photo-sketch, visible-near infrared and visible-thermal) face recognition, where MarrNet showed superior performance compared to state-of-the-art methods. Furthermore, it is observed that a cross-modality matching method could be biased to extract discriminant information from partial or even wrong regions, due to incompetence of dealing with modality gaps, which subsequently leads to poor generalization. We show that robustness to occlusions can be an indicator of whether a method can well bridge the modality gap.

URL PDF HTML ☆

赞 0 踩 0

2605.16879 2026-05-19 cs.CV 版本更新

Towards Generalized Image Manipulation Localization via Score-based Model

通过基于分数的模型实现通用图像操纵定位

Yunfei Wang, Bo Du, Zhe Yang, Xin Liu, Zhiyu Lin, Tianxin Xu, Ji-Zhe Zhou

发表机构 * Sichuan University（四川大学）

AI总结本文提出DiffIML框架，通过引入基于分数的生成模型来解决图像操纵定位中的泛化问题，利用结构先验迭代恢复相干掩码，提升模型鲁棒性，并在多个基准测试中证明其优越的泛化能力。

Comments Accepted to ICMR 2026. 9 pages, 4 figures

详情

DOI: 10.1145/3805622.3810759

AI中文摘要

随着合成媒体的快速发展，图像操纵定位（IML）已成为多媒体取证中的关键组成部分，用于确保数字内容的完整性。然而，泛化仍然是核心挑战，因为现有的判别方法通常学习固定的决策边界，容易过拟合特定训练伪影，且无法适应未见过的操纵类型。为了解决这一问题，我们提出了DiffIML，一种新颖的框架，引入基于分数的生成模型到IML中。不同于直接估计硬边界，DiffIML近似分数函数，即对数似然的梯度，以捕捉掩码分布的内在几何拓扑。这一范式利用结构先验迭代地从噪声中恢复连贯的掩码，从而避免判别模型的脆弱性。在此框架下，扩散模型成为学习分数函数的有效数值求解器。为确保实用性，我们分别解决了标准扩散模型的效率和稳定性瓶颈：（1）利用轻量级的特定掩码VAE实现快速的潜在空间处理，并采用解耦架构和轻量级去噪UNet；（2）边缘监督和误差先验以减轻采样过程中的误差累积。在两个不同的协议上对八个非生成式和三个生成式基准进行的广泛实验表明，DiffIML在多个基准测试中均优于最先进的方法，实现了在多样化未见过的数据集上的显著泛化改进。代码将公开提供。

英文摘要

With the rapid evolution of synthetic media, Image Manipulation Localization (IML) has emerged as a critical component in multimedia forensics for ensuring the integrity of digital content. However, generalization remains a core challenge, as existing discriminative methods typically learn a fixed decision boundary that tends to overfit to specific training artifacts and fails to adapt to unseen manipulation types. To address this, we propose DiffIML, a novel framework that introduces score-based generative modeling to IML. Diverging from the direct estimation of hard boundaries, DiffIML approximates the score function, the gradient of the log-likelihood, to capture the intrinsic geometric topology of mask distributions. This paradigm leverages structural priors to iteratively recover coherent masks from noise, thereby circumventing the brittleness associated with discriminative models. Under this formulation, diffusion models serve as an effective numerical solver for the learned score function.To ensure practicality, we respectively resolve the efficiency and stability bottlenecks of standard diffusion by: (1) utilizing a Lightweight Mask-Specific VAE for fast latent-space process and a decoupled architecture with a lightweight denoising UNet, (2) edge supervision and error prior to mitigate error accumulation during sampling. Extensive experiments of two distinct protocols on eight non-generative and three generative benchmarks demonstrate that DiffIML consistently outperforms state-of-the-art methods, yielding remarkable generalization improvements on diverse unseen datasets. The code will be publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.16877 2026-05-19 cs.CV 版本更新

Zero-Shot Faithful Textual Explanations via Directional-Derivative Influence on Predictions

通过预测影响的定向导数生成零样本文本解释

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

发表机构 * Chiba University（千叶大学）； National Institute of Informatics（国家信息研究所）

AI总结本文提出FaithTrace方法，通过测量文本解释对分类器特征空间中类logit的定向导数，提升图像分类器的透明度和解释的忠实性。

Comments 11+8 pages, 8 figures, 6 tables

详情

AI中文摘要

零样本文本解释旨在通过探测内部表示使图像分类器更透明，而无需依赖任务特定监督或LVLMs。然而，现有方法常遗漏真正驱动预测的特征，导致解释对模型决策证据的忠实性有限。为此，我们提出FaithTrace。受忠实解释应描述强影响预测的概念的启发，FaithTrace直接测量解释诱导的表示如何改变类logit。我们引入影响评分，计算为分类器特征空间中文本诱导方向上类logit的定向导数，并用其作为忠实性的代理。此外，我们将此影响评分扩展为定量评估指标，帮助填补文本解释忠实性评估的空白。实验表明，FaithTrace产生的解释比基线更忠实，有助于更准确地理解模型。代码将公开发布。

英文摘要

Zero-shot textual explanations aim to make image classifiers more transparent by probing their internal representations, without relying on task-specific supervision or LVLMs. However, existing methods often miss the features that truly drive the prediction, resulting in limited \textit{faithfulness} to the evidence underlying the model's decision. To address this, we propose FaithTrace. Motivated by the idea that faithful explanations should describe concepts that strongly influence the prediction, FaithTrace directly measures how much the representation induced by the explanation changes the class logit. We introduce an influence score, computed as the directional derivative of the class logit along the text-induced direction in the classifier's feature space, and use it as a proxy for faithfulness. Moreover, we extend this influence score into quantitative evaluation metrics, helping fill the gap in faithfulness evaluation for textual explanations. Experiments show that FaithTrace yields more faithful explanations than baselines, facilitating a more accurate understanding of the model. The code will be publicly released.

URL PDF HTML ☆

赞 0 踩 0

2605.16873 2026-05-19 cs.CV 版本更新

HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction

HAD：面向3D重建的幻觉感知扩散先验

Xi Liu, Weiwei Sun, Zhou Ren, Chris Broaddus, Siyu Huang, Laurent Guigues

发表机构 * Amazon AWS（亚马逊AWS）； Clemson University（克莱姆森大学）

AI总结本文提出HAD，一种面向3D重建的幻觉感知扩散先验，通过利用预训练在大规模3D数据上的馈送式新视角合成（NVS）网络的多视角推理能力，估计增强图像的像素级幻觉分数图，从而在逐步3D重建过程中选择性地屏蔽不可靠像素，减少幻觉伪影，提升3D重建质量。

Comments Accepted by CVPR 2026

详情

AI中文摘要

扩散先验最近在通过在新视角上增强训练视角来提高稀疏视角3D重建质量方面表现出强大的能力，但不可避免地会引入幻觉内容——与输入视角不一致的伪影——进入最终的3D模型。为了解决这一挑战，我们提出了Hallucination-Aware Diffusion prior（HAD），它通过利用预训练在大规模3D数据上的馈送式新视角合成（NVS）网络的多视角推理能力，估计增强图像的像素级幻觉分数图。这些幻觉分数使在逐步3D重建过程中能够选择性地屏蔽不可靠像素，防止将不存在的伪影引入3D模型。为了进一步提高性能，我们在每个新视角上创建多个增强图像版本，通过将扩散先验条件化于不同的输入视角，然后将这些图像融合成最终图像，该图像利用了所有输入视角的更广泛上下文。我们证明了我们的方法在扩散辅助的3D重建中显著减少了幻觉伪影，从而在多个新视角合成基准上实现了最先进的性能。我们的项目在https://xiliu8006.github.io/HAD-Project-website/上公开可用。

英文摘要

Diffusion priors have recently demonstrated strong capability in enhancing the quality of sparse-view 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content -- artifacts inconsistent with the input views -- into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning the diffusion prior on different input views, which are then fused into a final image that leverages the broader context across all input views. We show that our method substantially reduces hallucination artifacts in diffusion-assisted 3D reconstruction, thereby achieving state-of-the-art performance across multiple benchmarks on novel view synthesis. Our project are publicly available at \href{https://xiliu8006.github.io/HAD-Project-website/}{project website}.

URL PDF HTML ☆

赞 0 踩 0

2605.16864 2026-05-19 cs.CV cs.AI 版本更新

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

基于度量的视觉基础模型特征融合用于分割任务

Yachan Guo, JoseLuis Gomez Zurita, Danna Xue, Yi Xiao, AntonioManuel Lopez Pena

发表机构 * Universitat Autònoma de Barcelona（巴塞罗那自治大学）； Computer Vision Center（计算机视觉中心）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳研究院）

AI总结本文提出了一种基于度量的特征融合方法，通过评估不同视觉基础模型的特征空间，选择并聚合互补特征以提升密集预测任务的性能。

Comments Accepted to the CVPR 2026 Findings Track

详情

AI中文摘要

尽管大规模视觉基础模型（VFMs）在语义理解方面表现优异，但在实例感知的密集预测任务中仍显不足。它们在表示上存在不同的偏倚：例如，可提示的分割模型（如SAM2）专注于细粒度区域边界，而自监督模型（如DINOv3）强调物体层面的结构。这一观察表明，结合不同VFMs的互补特征可以增强下游密集预测任务。然而，简单的多VFMs融合 seldom 导致可靠的增益，且如何利用其互补特征的可解释原则仍待探索。在本文中，我们提出了一种基于度量的方法，通过显式的评估分数选择并聚合不同VFMs的互补特征。具体而言，我们设计了一套无标签的度量标准，在特征空间的两个方面，结构一致性与边缘保真度，来评估VFM编码器的特征。在这些分数的指导下，我们识别出互补性强的边缘强和结构强的编码器对，并通过主辅融合方案进行整合。这种特征融合不需要复杂的架构更改，并且仅在单个阶段进行训练。我们的模型在多个密集预测任务中相比基线模型表现出一致的性能提升，具有更好的物体层面语义和更准确的边界定位。代码可在{https://github.com/gyc-code/metric-guided-fusion}获取。

英文摘要

Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at {https://github.com/gyc-code/metric-guided-fusion}.

URL PDF HTML ☆

赞 0 踩 0

2605.16861 2026-05-19 cs.CV cs.AI 版本更新

Prefix-Adaptive Block Diffusion for Efficient Document Recognition

前缀自适应块扩散用于高效的文档识别

Mingxu Chai, Ziyu Shen, Chenyu Liu, Kaidi Zhang, Jiazheng Zhang, Dingwei Zhu, Zhiheng Xi, Ruoyu Chen, Jun Long, Jihua Kang, Tao Gui, Qi Zhang

发表机构 * Computation and Artificial Intelligence Innovative College, Fudan University, Shanghai, China（复旦大学计算与人工智能创新学院，上海，中国）； Shanghai Innovation Institute, Shanghai, China（上海创新研究院，上海，中国）； ByteDance, Shanghai, China（字节跳动，上海，中国）

AI总结本文提出前缀自适应块扩散模型（PA-BDM），通过改进块内去噪和缓存机制，提升文档识别的效率和准确性。

Comments 17pages,6 figures

详情

AI中文摘要

块扩散模型（BDMs）支持并行生成、灵活长度输出和KV缓存，使其在高效文档解析中具有潜力。然而，现有BDMs将去噪和缓存承诺绑定到固定的块边界：块内去噪时并行性缩小，而生成的token无法缓存直到整个块完成。此外，块内双向去噪与块间自回归冲突，导致信息流不一致，可能挑战结构敏感的识别。我们提出前缀自适应块扩散模型（PA-BDM），用从前缀到后缀的因果去噪替代块内双向去噪，并将块大小视为最大候选范围而非固定承诺单位。PA-BDM使用置信度门控结构损失（CSL）在扩展训练到更长延续之前构建低熵前缀。在推理过程中，逐步前缀承诺（PPC）则动态地将最长可靠的前缀投入KV缓存，并从更新的前缀重置下一个候选范围，每一步都恢复大的并行解码空间。实验表明，3B PA-BDM在多个基准上实现了更高的识别得分，并在2.5B MinerU-Diffusion上将推理吞吐量提高了71.6%。

英文摘要

Block Diffusion Models (BDMs) support parallel generation, flexible-length output, and KV caching, making them promising for efficient document parsing. However, existing BDMs bind denoising and cache commitment to fixed block boundaries: parallelism shrinks during intra-block denoising, while generated tokens cannot be cached until the whole block is completed. Moreover, intra-block bidirectional denoising conflicts with inter-block autoregression, creating inconsistent information flow that can challenge structure-sensitive recognition. We propose the Prefix-Adaptive Block Diffusion Model (PA-BDM), which replaces intra-block bidirectional denoising with causal denoising from prefix to suffix and treats the block size as a maximum candidate range rather than a fixed commitment unit. PA-BDM uses Confidence-gated Structural Loss (CSL) to build low-entropy prefixes before extending training to longer continuations. During inference, Progressive Prefix Commitment (PPC) then dynamically commits the longest reliable prefix into the KV cache and resets the next candidate range from the updated prefix, restoring a large parallel decoding space at each step. Experiments show that the 3B PA-BDM achieves higher recognition scores on several benchmarks and improves inference throughput by 71.6\% over the 2.5B MinerU-Diffusion.

URL PDF HTML ☆

赞 0 踩 0

2605.16859 2026-05-19 cs.CV cs.AI 版本更新

VGGT-CD: Training-Free Robust Registration for 3D Change Detection

VGGT-CD：无训练的鲁棒三维变化检测注册

Wei Zhang, Songhua Li, Yihang Wu, Qiang Li, Qi Wang

发表机构 * Northwestern Polytechnical University（西北工业大学）

AI总结本文提出VGGT-CD方法，通过解耦跨时间注册与动态变化干扰，实现无训练的鲁棒三维变化检测注册，有效减少轨迹误差并提升注册速度。

Comments 13 pages, 5 figures. Code is available at: https://github.com/WZ-CS/VGGT-CD

详情

AI中文摘要

从多视角图像进行三维变化检测对于城市监控、灾难评估和自动驾驶至关重要。然而，现有方法大多在2D领域操作，其中视角变化被误认为物理变化且深度不可用。虽然视觉几何基础模型如VGGT能够快速从未摆正的图像生成密集点云，但独立每轮重建面临根本性障碍：不可预测的跨轮标度模糊、注册-变化悖论以及普遍存在的边缘飞行噪声。为了解决这些挑战，我们提出了VGGT-CD，一种无训练的流水线，将跨时间注册与动态变化干扰解耦。在粗阶段，稀疏关键帧联合推断建立统一的度量空间并产生初始Sim(3)先验。在细阶段，密集重建通过隔离静态背景对应关系进行净化。闭合形式的质心对齐优化平移同时锁定标度和旋转，使用残差自检数学保证非退化。在World Across Time数据集的11场景基准上评估，VGGT-CD在户外将绝对轨迹误差减少了44%，在室内减少了59%。它以6倍于传统方法的速度完成注册，生成高纯度的3D变化地图，无需任务特定训练。

英文摘要

3D change detection from multi-view images is essential for urban monitoring, disaster assessment, and autonomous driving. However, existing methods predominantly operate in the 2D domain, where viewpoint variations are mistaken for physical changes and depth is unavailable. While visual geometry foundation models like VGGT rapidly produce dense point clouds from unposed images, independent per-epoch reconstruction encounters fundamental obstacles: unpredictable inter-epoch scale ambiguity, registration-change paradox where scene changes corrupt alignment, and pervasive edge-flying noise. To address these challenges, we present VGGT-CD, a training-free pipeline decoupling cross-temporal registration from dynamic-change interference. In the Coarse Stage, sparse keyframe joint inference establishes a unified metric space and yields an initial Sim(3) prior. In the Fine Stage, dense reconstructions are purified by isolating static-background correspondences. A closed-form centroid alignment refines the translation while locking scale and rotation, using a residual self-check to mathematically guarantee non-degradation. Evaluated on an 11-scene benchmark from the World Across Time dataset, VGGT-CD reduces Absolute Trajectory Error by 44% outdoors and 59% indoors. It completes registration over 6 times faster, producing high-purity 3D change maps without task-specific training.

URL PDF HTML ☆

赞 0 踩 0

2605.16848 2026-05-19 cs.CV cs.AI cs.CL cs.LG 版本更新

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

基于模式的思考：通过模式诱导突破视觉规划中的感知瓶颈

Yichang Jian, Boyuan Xiao, Zhenyuan Huang, Yifei Peng, Yao-Xiang Ding

发表机构 * State Key Lab of CAD& CG（CAD与CG国家重点实验室）

AI总结本文提出通过模式诱导的方法，利用模式推理和模式诱导策略，使视觉语言模型在视觉规划任务中实现更高效和准确的感知与推理，解决传统模型在复杂输入下的感知瓶颈问题。

详情

AI中文摘要

从原始视觉输入进行规划仍然对当前的视觉-语言模型（VLMs）构成重大挑战，当输入复杂度超出其一步感知能力时。受最近在图像思考（TWI）中的进展启发，一种合理的解决方案是通过迭代获取和整合局部视觉证据，将感知过程分解为更简单的步骤。然而，尽管当前VLMs在一般TWI能力上训练良好，但其在规划领域中的感知瓶颈仍然存在。为解决这一挑战，我们将TWI视为一种工具，逐步构建并反映一个准确的内部世界模型。我们发现，由此产生的无训练规划策略使VLMs能够解决远超其初始能力的任务，但代价是过多的TWI操作会显著增加计算开销。为进一步提高效率，我们提出模式推理，一种新的TWI策略，使VLMs能够主动识别新任务中的已知视觉模式并直接推断局部世界模型结构。为了获得这些模式，我们提出模式诱导，一种在线归纳学习策略，将视觉模式视为复合且可重用的专家，这些专家是自主从经验中发现和优化的。在FrozenLake、Crafter和CubeBench领域中的实验评估表明，我们的方法在准确性和效率之间实现了良好的平衡。

英文摘要

Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.16834 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

基于有限数据的细粒度多模态对齐的相对表示学习

Shiwon Kim, Yu Rang Park

发表机构 * Yonsei University（延世大学）

AI总结本文提出了一种基于相对表示的学习方法，用于在有限数据条件下实现细粒度多模态对齐，通过学习token级别的跨模态结构来提升零样本分类、跨模态检索和零样本分割任务的性能。

详情

AI中文摘要

多模态预训练展示了强大的泛化性能，但在缺乏配对数据的领域中，这种范式往往难以实施。一种有前景的替代方法是事后多模态对齐，它通过有限数量的配对示例分别对预训练的单模态编码器进行对齐。然而，现有方法主要关注全局表示的对齐，忽略了片段-token关系。这可能阻碍了需要细粒度跨模态匹配的任务的迁移，超越粗粒度样本层面的语义。为了解决这个问题，我们提出了一种事后对齐方法，通过相对表示学习token级别的跨模态结构。具体来说，我们通过图像和文本与每种模态空间中一组可学习锚点的token级相似性来表示它们，这些锚点被训练以诱导一致的跨模态相似性模式，以匹配对。尽管仅学习锚点而没有重大的投影层，我们的方法在零样本分类、跨模态检索和零样本分割任务中均显著优于现有方法。这突显了在有限配对数据下，建模细粒度跨模态结构对于有效事后多模态对齐的重要性。

英文摘要

Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately pre-trained unimodal encoders using a limited number of paired examples. However, existing methods focus primarily on aligning global representations, missing patch-token relations. This may hinder transfer to tasks that require fine-grained cross-modal matching beyond coarse sample-level semantics. To address this issue, we propose a post-hoc alignment method that learns token-level cross-modal structure using relative representations. Specifically, we represent images and texts through their token-level similarities to a set of learnable anchors in each modality space, which are trained to induce consistent cross-modal similarity patterns for matched pairs. Despite learning only the anchors without heavy projection layers, our approach consistently outperforms existing methods in zero-shot classification, cross-modal retrieval, and zero-shot segmentation by a substantial margin. This highlights the importance of modeling fine-grained cross-modal structure for effective post-hoc multimodal alignment with limited paired data.

URL PDF HTML ☆

赞 0 踩 0

2605.16832 2026-05-19 cs.CV 版本更新

Coarse Semantic Injection for LLM-Conditioned Structured Indoor Prediction

粗粒度语义注入用于LLM条件的结构室内预测

Shuliang Zhu, Tomiwa Adey, Jinjia Zhou

AI总结本文提出了一种接口保持的语义增强方法，用于LLM条件的结构解码，通过将语义证据与点云表示关联，将其编码为RGBB点接口，以提升结构室内预测的精度，特别是在复杂场景中的门框定位和家具检测。

详情

AI中文摘要

大型语言模型（LLMs）最近被用作结构解码器，用于从3D点云输入中进行室内理解。然而，点云编码器在体素化和稀疏池化后，往往低估了如门和窗等细长结构元素，并可能在拥挤场景中遗漏单个家具实例。我们提出了一种接口保持的语义增强方法，用于LLM条件的结构解码。关键思想是将语义证据与点云表示关联，将其缩减为粗粒度四组代码（家具、墙壁、开口和其他），并将其编码为RGBB点接口：红色表示家具，绿色表示墙壁，蓝色表示开口，黑色表示其他，其中RGBB表示在三个RGB通道中用三种颜色表示四种语义状态，而不是额外的第四通道。该语义颜色代码在原始原始点属性后附加，因此几何和语义共享相同的稀疏标记化路径，同时下游语言模型解码器和输出序列化保持不变。我们进一步引入了一个轻量级的路由语义位移模块，其辅助头仅用于训练时的比率/预算正则化和分析，以在稀疏池化后加强语义线索。整体流程可以使用RGB衍生的语义证据。在这些受控的语义源设置下，报告的指标在Structured3D、SpatialLM数据集和ARKitScenes上均有所提升，尤其是在拥挤场景中的开口定位和单个家具检测。消融实验澄清了语义源、颜色编码、标记融合和位移注入的作用，同时显示颜色/熵效应仍然非平凡。

英文摘要

Large language models (LLMs) have recently been used as structured decoders for indoor understanding from 3D point-token inputs. However, point cloud encoders often under-represent thin structural elements such as doors and windows after voxelization and sparse pooling, and may miss individual furniture instances in cluttered scenes. We propose an interface-preserving semantic augmentation for LLM-conditioned structured decoding. The key idea is to associate semantic evidence with the point-cloud representation, reduce it to a coarse four-group code (furniture, walls, openings, and others), and encode it as an RGBB point interface: red for furniture, green for walls, blue for openings, and black for others, where RGBB denotes four semantic color states represented in three RGB channels rather than an additional fourth channel. This semantic color code is appended to the original raw point attributes before tokenization, so geometry and semantics share the same sparse tokenization path while the downstream language model decoder and output serialization remain unchanged. We further introduce a lightweight routed semantic shift module, with an auxiliary head used only for training-time ratio/budget regularization and analysis, to strengthen semantic cues after sparse pooling. The overall pipeline can use RGB-derived semantic evidence. Under these controlled semantic-source settings, the reported metrics improve across Structured3D, the SpatialLM dataset, and ARKitScenes, especially for opening localization and per-instance furniture detection in cluttered scenes. Ablations clarify the roles of semantic source, color coding, token fusion, and shift injection, while also showing that color/entropy effects remain nontrivial.

URL PDF HTML ☆

赞 0 踩 0

2605.16818 2026-05-19 cs.CV cs.AI 版本更新

DecoRec: 通过物体级扩散进行单视图图像的分解3D场景重建

Yuhan Ping, Yuan Liu, Xiaoxiao Long, Peng Wang, Junhui Hou, Jianyi Zheng, Jia Pan, Xin Li, Cheng Lin

发表机构 * Department of Computer Science, the University of Hong Kong（香港大学计算机科学系）； Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）； Faculty of Humanities and Arts, Macau University of Science and Technology（澳门科学理工学院人文艺术学院）； Department of Computer Science and Engineering, Texas A&M University（德克萨斯A&M大学计算机科学与工程系）； Department of Computer Science and Engineering, Macau University of Science and Technology（澳门科学理工学院计算机科学与工程系）

AI总结本文提出DecoRec系统，通过物体级扩散方法实现单视图图像的分解3D场景重建，解决了现有方法在场景重建中出现的精度问题，并通过可微渲染和扩散引导细化技术提升重建效果。

详情

AI中文摘要

在本文中，我们介绍了DecoRec，一种新的系统，旨在将单视图2D图像提升为分解的3D场景网格。当前单视图场景重建方法通常依赖于物体检索或粗粒度3D体素或表面的回归，导致无法准确捕捉输入图像的外观和几何结构。缺乏高质量的大规模场景级数据集进一步加剧了从单视图图像直接生成3D场景的难度。为了实现高质量的3D场景生成，DecoRec利用最近的基于扩散的单视图物体重建方法，分别重建单个物体。随后提出一个细化流程，通过可微渲染技术和扩散引导细化技术有效地将这些重建的物体合并，提升外观和几何结构。我们的结果表明，DecoRec在几何和新合成方面实现了高质量的单视图场景重建，为下游应用如房间内部设计提供了显著的便利。

英文摘要

In this paper, we introduce \textit{DecoRec}, a novel system designed to elevate single-view 2D images to a decomposed 3D scene mesh. Current methods for single-view scene reconstruction typically rely on object retrieval or the regression of coarse 3D voxels or surfaces, leading to inaccuracies in capturing the appearance and geometry of the input image. The lack of high-quality large-scale scene-level datasets further complicates direct 3D scene generation from single-view images. To achieve high-quality 3D scene generation from a single-view image, DecoRec takes advantage of recent diffusion-based single-view object reconstruction methods to reconstruct individual objects separately. Subsequently, a refinement pipeline is proposed to effectively merge these reconstructed objects, enhancing appearance and geometry through a differentiable rendering technique and diffusion-guided refinement. Our results demonstrate that DecoRec facilitates high-quality single-view scene reconstruction in both geometry and novel synthesis, offering significant benefits for downstream applications like room interior design.

URL PDF HTML ☆

赞 0 踩 0

2605.16806 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Cross-modal Affinity-aligned Multimodal Learning Analytics for Predicting Student Collaboration Satisfaction in Game-Based Learning

跨模态亲和对齐的多模态学习分析用于预测基于游戏的学习中学生协作满意度

Wen-Hsin Tsai, Chia-Ming Lee, Yuk-Ying Tung

发表机构 * Institute of Education, National Cheng Kung University（国立成功大学教育研究所）； Institute of Intelligent System, National Yang Ming Chiao Tung University（阳明交通大学智能系统研究所）； Department of Computer Science, University at Albany, State University of New York（纽约州立大学水牛城分校计算机科学系）

AI总结本文提出了一种跨模态亲和对齐的多模态学习分析框架，通过建模模态间关系和对比学习来增强学生协作满意度预测的鲁棒性和可解释性。

Comments Accetped by CVPR 2026 CVxEdu Workshop

详情

AI中文摘要

协作式基于游戏的学习环境为小组知识构建提供了丰富的机遇，但自动预测学生协作满意度仍具挑战性。关键障碍是模态退化：在教育部署中，个体模态如眼动在学生群体间表现出不一致的信息量，导致基于隐式注意力的融合产生脆弱的多模态表示。我们提出了亲和对齐多模态学习分析（AAMLA）框架，其核心贡献是跨模态亲和引导的模态对齐（CAMA）模块，该模块通过亲和矩阵显式建模模态间关系，并通过对比学习强制跨模态一致性，从而实现对无信息模态的自适应抑制而不丢弃它们。AAMLA进一步应用模态特定的投影层，将异构特征，包括面部动作单元、头部姿态、眼动和交互痕迹日志，映射到统一的语义空间，然后再进行对齐。在EcoJourneys协作学习环境中的50名中学生实验表明，在标准和模态退化条件下，AAMLA在单模态基线和先前跨注意力方法上均表现出一致的改进，SHAP和t-SNE分析证实CAMA能够产生稳健且可解释的跨模态表示，用于学生协作建模。

英文摘要

Collaborative game-based learning environments offer rich opportunities for small-group knowledge construction, yet automatically predicting student collaboration satisfaction remains challenging. A critical barrier is modality degradation: in educational deployments, individual modalities such as eye gaze exhibit inconsistent informativeness across student cohorts, causing implicit attention-based fusion to produce brittle multimodal representations. We propose the Affinity-Aligned Multimodal Learning Analytics (AAMLA) framework, whose core contribution is the Cross-modal Affinity-guided Modality Alignment (CAMA) module, which explicitly models inter-modal relationships via affinity matrices and enforces cross-modal consistency through contrastive learning, enabling adaptive suppression of uninformative modalities without discarding them. AAMLA further applies modality-specific projection layers to map heterogeneous features, including facial action units, head pose, eye gaze, and interaction trace logs, into a unified semantic space prior to alignment. Experiments on 50 middle school students in the EcoJourneys collaborative learning environment demonstrate consistent improvements over unimodal baselines and prior cross-attention approaches under standard and modality degradation conditions, with SHAP and t-SNE analyses confirming that CAMA produces robust, interpretable cross-modal representations for student collaboration modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.16805 2026-05-19 cs.CV 版本更新

NeuroLiDAR: Adaptive Frame Rate Depth Sensing via Neuromorphic Event-LiDAR Fusion

NeuroLiDAR: 通过神经形态事件-LiDAR融合实现自适应帧率深度感知

Darshana Rathnayake, Dulanga Weerakoon, Meera Radhakrishnan, Archan Misra

发表机构 * Singapore Management University（新加坡国立管理学院）； Singapore-MIT Alliance for Research and Technology Centre（新加坡-麻省理工联合研究和技术中心）； University of Technology Sydney（悉尼技术大学）

AI总结本文提出NeuroLiDAR，通过融合稀疏LiDAR数据和密集的神经形态事件相机数据，实现了高达约66Hz的自适应帧率深度感知，减少了29%的深度重建误差。

Comments ICRA2026 accepted

详情

AI中文摘要

LiDARs被广泛用于3D深度重建，但其性能常受到固有硬件限制的制约，这些限制在范围、空间分辨率和帧率之间产生权衡。许多LiDAR系统通常以低帧率（例如5-10Hz）运行，优先考虑远距离传感而不是对快速场景变化的响应。我们提出了NeuroLiDAR，一种能够实现高达约66Hz有效帧率的自适应深度感知框架，通过融合时间稀疏的LiDAR数据与时间密集的神经形态事件相机数据。NeuroLiDAR集成了两个组件：基于事件的关键帧检测和基于事件的深度外推，以动态调整感知速率以响应场景动态。为了评估我们的方法，我们引入了ELiDAR数据集，涵盖了户外和室内场景，并展示了NeuroLiDAR在RMSE中将深度重建误差减少了约29%，同时实现了27.8-47.3Hz的自适应帧率。我们的代码和数据集可在https://github.com/darshanakgr/neurolidar上获得。

英文摘要

LiDARs are widely used for 3D depth reconstruction, but their performance is often limited by inherent hardware constraints that impose trade-offs between range, spatial resolution, and frame rate. Many LiDAR systems typically operate at low frame rates (e.g., 5-10 Hz), prioritizing long-range sensing over responsiveness to rapid scene changes. We present NeuroLiDAR, an adaptive depth sensing framework that achieves effective frame rates of up to $\approx$66 Hz by fusing temporally sparse LiDAR data with temporally dense inputs from neuromorphic event cameras. NeuroLiDAR integrates two components: event-based keyframe detection and event-guided depth extrapolation, to dynamically adjust the sensing rate in response to scene dynamics. To evaluate our approach, we introduce ELiDAR, a dataset spanning outdoor and indoor scenarios, and show that NeuroLiDAR reduces depth reconstruction error by $\approx$29\% in RMSE while achieving adaptive frame rates between 27.8-47.3 Hz. Our code and dataset are available at https://github.com/darshanakgr/neurolidar.

URL PDF HTML ☆

赞 0 踩 0

2605.16797 2026-05-19 cs.CV cs.RO 版本更新

为超维计算编码鲁棒的拓扑特征

Arpan Kusari

发表机构 * University of Michigan Transportation Research Institute（密歇根大学交通研究院）； University of Michigan（密歇根大学）

AI总结本文提出了一种基于拓扑特征的超维计算方法，通过提取离散拓扑原始特征并结合RTS不变的形状签名，提高了超维计算在旋转、噪声和遮挡等扰动下的鲁棒性，实验表明其在多个数据集上优于传统方法。

详情

AI中文摘要

超维（HD）计算由于其简单性、快速的原型基推断和与在线更新的兼容性，为边缘学习提供了一个有吸引力的替代方案。然而，标准的基于像素的HD编码器容易受到分布偏移的影响，如旋转、噪声或遮挡，会显著降低准确性。我们从二值化形状中提取离散拓扑原始特征——尤其是孔洞，并将它们与旋转/平移/缩放（RTS）不变的形状签名配对。我们的方法为（i）外轮廓使用空间金字塔变体的Zernike矩构建RTS稳定的描述符，（ii）每个孔洞使用其径向签名的内在傅里叶描述符以及RTS-标准相对几何。每个原始特征通过随机投影和角色绑定映射到双极超向量，并通过排列不变的捆绑聚合变量卡数的孔洞集以形成单个图像超向量。为了避免过度加权任何线索，我们通过在验证集上融合余弦相似度学习Zernike和孔洞通道的非负可靠性权重。在MNIST和EMNIST数据集上进行的实验表明，拓扑引导的HD计算相比传统HD基线显著提高了鲁棒性，保持了多个扰动家族的高精度，并受益于轻量级在线训练。与在干净数据上训练的紧凑CNN相比，我们的方法在清洁精度上具有竞争力，同时对几种像素级扰动具有明显更强的鲁棒性，证明了显式拓扑结构是实现鲁棒HD表示的可行途径。代码在https://github.com/arpan-kusari/Topological-HDC提供。

英文摘要

Hyperdimensional (HD) computing offers an attractive alternative to deep networks for edge learning due to its simplicity, fast prototype-based inference, and compatibility with online updates. However, standard pixel-based HD encoders are brittle: small distribution shifts such as rotation, noise, or occlusion can drastically reduce accuracy. We extract discrete topological primitives-most notably holes-from binarized shapes and pair them with rotation/translation/scale (RTS)-invariant shape signatures. Our method constructs RTS-stable descriptors for (i) the outer shape using a spatial-pyramid variant of Zernike moments and (ii) each hole using an intrinsic Fourier descriptor of its radial signature together with RTS-canonical relative geometry. Each primitive is mapped to a bipolar hypervector via randomized projection and role binding, and variable-cardinality hole sets are aggregated by permutation-invariant bundling to form a single image hypervector. To avoid over-weighting any cue, we learn nonnegative reliability weights for the Zernike and hole channels on a validation set via late fusion of cosine similarities. Experiments on MNIST and EMNIST under controlled corruptions (rotation, Gaussian noise, salt-and-pepper, cutout, zoom) show that Topology-guided HD computing substantially improves robustness compared with a naive HD baseline, maintaining high accuracy across multiple corruption families and benefiting from lightweight online training. Compared with a compact CNN trained on clean data, our method achieves competitive clean accuracy while offering markedly stronger robustness to several pixel-level corruptions, demonstrating that explicit topological structure is a practical route to robust HD representations. The code is provided at https://github.com/arpan-kusari/Topological-HDC.

URL PDF HTML ☆

赞 0 踩 0

2605.16779 2026-05-19 cs.CV cs.AI 版本更新

A Holistic Method for Superquadric Fitting Using Unsupervised Clustering Analysis

一种基于无监督聚类分析的超二次曲面拟合整体方法

Mingyang Zhao, Sipu Ruan, Xiaohong Jia

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences（数学科学国家重点实验室，数学与系统科学学院，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）； Robotics Institute, School of Mechanical Engineering and Automation, Beihang University（北京航空航天大学机械工程与自动化学院机器人研究所）

AI总结本文提出了一种新的方法，用于在存在噪声和异常值的情况下对点云进行超二次曲面拟合，通过无监督聚类分析重新定义问题，实现了刚性和变形超二次曲面的一体化拟合，同时提供了闭式解析解和收敛性证明。

Comments 20 pages, Code: https://github.com/zikai1/SuperquadricFitting

Journal ref IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2026

详情

AI中文摘要

本文提出了一种新的方法，用于在存在噪声和异常值的情况下对点云进行超二次曲面拟合，该方法在多个领域具有广泛的应用。与以往仅专注于拟合刚性或变形超二次曲面或存在鲁棒性和数值稳定性问题的方法不同，我们的方法从无监督聚类的新视角重新定义问题，使刚性和变形超二次曲面的拟合能够在统一的框架中完成。我们的方法核心是一种受无监督聚类分析启发的稳定优化函数，其中我们将点云数据和潜在参数曲面的样本分别作为聚类成员和质心。然后，具有动态更新质心位置的聚类过程成为优化超二次曲面参数的直接代理，建立了几何拟合与聚类动态之间的原则性联系。我们进一步推导了聚类质心与聚类成员之间的成对计算与正交距离之间的关系，从而有效消除了耗时的曲面采样过程。此外，我们的公式为模糊成员度向量和协方差矩阵提供了闭式解析解，确保了高效迭代优化，并能够更有效地处理几何变形。此外，我们还提供了收敛性分析的理论证明，并证明了聚类启发的拟合方法通过内在增加目标函数的凸性来逃避局部极小值。实现已公开在https://github.com/zikai1/SuperquadricFitting。

英文摘要

This work presents a novel method for fitting superquadrics to point clouds under the contamination of noise and outliers, which has many applications for shape modeling across diverse fields. Unlike prior approaches that either exclusively focus on fitting rigid or deformable superquadrics, or suffer from robustness and numerical instability issues, our method redefines the problem from a new unsupervised clustering perspective, enabling the holistic fitting of both rigid and deformable superquadrics within a unified framework. Central to our approach is a stable optimization function inspired by unsupervised clustering analysis, where we formulate the point cloud data and samples from the potential parametric surface as clustering members and centroids, respectively. Then, the clustering process with dynamic updates to centroid locations serves as a direct proxy for optimizing superquadric parameters, establishing a principled link between geometric fitting and clustering dynamics. We further derive the relationship between pairwise computations of clustering centroids and clustering members to orthogonal distances, effectively eliminating the need for the time-consuming surface sampling process. Moreover, our formulation provides closed-form analytical solutions for both the fuzzy membership degree vector and the covariance matrix, ensuring efficient iteration optimization and enabling more effective handling of geometric deformations. In addition, we provide a theoretical certificate of convergence analysis and demonstrate that the clustering-inspired fitting method can escape local minima by inherently increasing the convexity of the objective function. The implementation is publicly available at https://github.com/zikai1/SuperquadricFitting.

URL PDF HTML ☆

赞 0 踩 0

2605.16775 2026-05-19 cs.CV cs.AI cs.LG 版本更新

VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment

VolTA-3D: 基于3D体积分块对齐的脑MRI自监督学习

Amy Makawana, Abhijeet Parida, Marius George Linguraru, Julia Ive, Syed Muhammad Anwar

发表机构 * Institute of Health Informatics（健康信息学研究所）； Sheikh Zayed Institute for Pediatric Surgical Innovation（谢赫扎耶德儿童外科创新研究所）； School of Medicine and Health Sciences（医学与健康科学学院）

AI总结本文提出VolTA-3D，一种用于脑MRI自监督学习的3D视觉Transformer框架，通过联合对齐全局类风格标记和局部块标记，增强体积分块表示的可迁移性，从而在多个下游任务中表现出更好的泛化能力和鲁棒性。

Comments Accepted at EMBC 2026

详情

AI中文摘要

自监督学习（SSL）通过利用大规模未标记数据推动了医学图像分析的发展。然而，在脑磁共振成像（MRI）中，大多数3D模型仍局限于分割或分类任务，限制了其在不同数据集、成像协议和下游任务中的泛化能力。这种缺乏可迁移性限制了3D MRI模型的临床应用，尽管存在大量未标记的体数据。我们提出了Volta-3D，一种自监督的3D视觉Transformer框架，旨在学习可迁移的体表示。Volta-3D在学生-教师范式中联合对齐全局类风格标记和局部块标记，并强制细粒度结构重建。这种联合全局-局部对齐解决了脑MRI中有限的语义多样性和细微解剖特征，这对现有SSL方法构成了挑战。我们在多个分布外下游任务上评估了Volta-3D，包括海马体分割和性别及阿尔茨海默病与健康对照的分类。在所有任务中，Volta-3D学习的表示均优于随机初始化的基线，证明了其在域偏移下的改进可迁移性和鲁棒性。因此，在预训练过程中联合强制全局语义一致性和局部结构学习，使模型能够从未标记的脑MRI数据中学习更广泛的概念。总体而言，VolTA-3D支持有效的多任务下游性能，具有任务特定的适应性，是迈向通用化和临床可行的3D模型的一步。

英文摘要

Self-supervised learning (SSL) has advanced medical image analysis be enabling learning form large unlabelled data. However, in brain magnetic resonance imaging (MRI), most 3D models remain specialized for either segmentation of classification, limiting their ability to generalize across datasets, imaging protocols,, and downstream tasks. This lack of transferability constrains the clinical utility of 3D MRI models, despite the availability of unlabeled volumetric data. We present Volta-3D, a self-supervised 3D Vision Transformer framework designed to learn transferable volumetric representations. Volta-3D jointly aligns global class-style tokens and local patch tokens within a student-teacher paradigm and enforces fine-grained structural reconstruction. This combined global-local alignment addresses the limited semantic diversity and subtle anatomical characteristics of brain MRI, which challenges existing SSL approaches. We evaluate Volta-3D on multiple out-of-distribution downstream tasks, including hippocampal segmentation and classification of sex and Alzheimer's disease versus healthy controls. Across all tasks, representations learned by Volta-3D outperform randomly initialized baselines, demonstrating improved transferability and robustness under domain shift. Hence jointly enforcing global semantic consistency and local structural learning during pretraining enables broader concept learning from unlabeled brain MRI data. Overall VolTA-3D supports effective multi-task downstream performance with task-specific pertaining, a step towards generalizable and clinically viable 3D models.

URL PDF HTML ☆

赞 0 踩 0

2605.16774 2026-05-19 cs.CV cs.AI 版本更新

CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris

CANSURF：一种ASV视角的可回收物数据集和基准，用于表面级垃圾的检测与跟踪

Zaid Aljundi, Zahra F. Rahmatullah, Mostafa Elemam, Abdullah Moosa

发表机构 * School of Mathematical and Computer Sciences（数学与计算机科学学院）； Heriot-Watt University Dubai（惠顿大学迪拜分校）； School of Engineering and Physical Sciences（工程与物理科学学院）

AI总结本文提出了一种新的ASV视觉系统和表面可回收物数据集，用于在水面条件下检测和跟踪小型反射性垃圾，如铝罐。数据集包含约7.3k张原始图像，经过十种增强方法扩展至约57k张训练/验证图像，涵盖了多样的光照和水状态。通过基准测试，训练YOLOv11在CANSURF数据集上提升了12倍的性能，展示了数据集的价值。实验表明，YOLOv11+ByteTrack在稳定跟踪和多目标准确性方面表现最佳，而YOLOv11+SAHI在远距离罐子的召回率上有所提升，但精度有所下降。考虑到任务需求，YOLOv11 + SAHI在检测最大数量的罐子方面表现更好。

Comments Published in the 2025 8th International Conference on Signal Processing and Information Security (ICSPIS). Published and available to view on IEEE Xplore

Journal ref Proc. 2025 8th Int. Conf. Signal Processing and Information Security (ICSPIS), 2025, pp. 1-6

详情

DOI: 10.1109/ICSPIS67605.2025.11318414

AI中文摘要

表面级海洋垃圾仍然是自主清洁任务中的实际瓶颈，其中小型、反射性的目标（如铝罐）必须在强光、波浪和部分淹没条件下从远处检测。本文提出了一种ASV视觉系统和一个新的表面可回收物数据集。该数据集包含约7.3k张从视频中提取的原始图像，并通过十种增强类型扩展至约57k张训练/验证图像，涵盖了多样化的光照和水状态。一组针对表面操作定制的检测器和检测-跟踪管道进行了基准测试。在CANSURF上训练YOLOv11的性能比通用数据集提高了12倍，突显了数据集的价值。实验表明，YOLOv11+ByteTrack在稳定跟踪（较少的身份切换）和多目标准确性方面表现最佳，而YOLOv11+SAHI在远距离罐子的召回率上有所提升，但精度在全上下文输入中有所下降。鉴于任务配置，单罐拾取与接近和抓取，YOLOv11 + SAHI在检测最大数量的罐子方面表现更好。没有先前的公开数据集针对从水面视角在水面上检测铝罐；此数据集填补了这一空白，并支持可重复的评估。

英文摘要

Surface-level marine debris remains a practical bottleneck for autonomous clean-up, where small, reflective targets (e.g., aluminum cans) must be detected at distance under glare, ripples, and partial submersion. This paper presents, an ASV vision system and a new surface-can dataset. The dataset comprises ~7.3k raw images extracted from videos and annotated with bounding boxes, expanded via ten augmentation types to ~57k training/validation images spanning diverse lighting and water states. A family of detector and detector-tracker pipelines tailored to surface operations were benchmarked. Training YOLOv11 on CANSURF boosts performance 12x over generic datasets, highlighting the dataset's value. Experiments show that YOLOv11+ByteTrack yields the most stable tracks (fewer identity switches) and stronger multi-object accuracy under, while YOLOv11+SAHI increases recall on far-field cans at the cost of lower precision in full-context inputs. Given the mission profile, single-can pickup with approach and grab, YOLOv11 + SAHI proves better for detecting the maximum number of cans. No prior open dataset targets aluminum cans on water from a surface-level viewpoint; this dataset fills this gap and supports reproducible evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.16769 2026-05-19 cs.CV 版本更新

Genflow Ad Studio：一种用于品牌一致、自我纠正视频生成的复合AI架构

Debanshu Das, Lavi Nigam, Sunil Kumar Jang Bahadur, Gopala Dhar

发表机构 * Google（谷歌）

AI总结本文提出Genflow Ad Studio，一种复合AI架构，通过品牌DNA提取模块和对抗性多代理质量控制循环，提高了品牌一致的视频生成效率，将合规率从42%提升到89%。

Comments 6 pages, 2 figures, 2 tables. Accepted to the ACM Conference on AI and Agentic Systems (CAIS '26). Includes demo video and code repository links

Journal ref ACM Conference on AI and Agentic Systems (CAIS '26), May 26-29, 2026, San Jose, CA, USA

详情

DOI: 10.1145/3786335.3813213

AI中文摘要

近期生成视频模型的进步展示了高水平的视觉保真度，但其在企业环境中的整合受到时间不一致性和严重的品牌不一致性的限制。当前的单体架构难以强制执行严格的品牌约束，经常产生未经批准的视觉资产。我们介绍了Genflow，一种复合AI系统，旨在生成媒体生产中强制执行品牌一致性。我们的架构集成了基于检索的'品牌DNA'提取模块，以参数化生成方式根据已确立的企业身份指南进行生成。此外，我们实现了对抗性多代理质量控制（QC）循环。与单次生成流程不同，此流程采用评估代理，反复批评生成的帧，与提取的参数进行比较，促使生成模型细化输出，直到达成确定性的一致性。通过转向多阶段、自我纠正的流程，Genflow将品牌合规视频生成的产量从42%提高到89%，建立了稳健的框架，用于可扩展的、企业级的生成系统。

英文摘要

Recent advancements in generative video models demonstrate high visual fidelity, yet their integration into enterprise environments is restricted by temporal inconsistencies and severe brand misalignment. Current monolithic architectures struggle to enforce rigid brand constraints, frequently hallucinating unapproved visual assets. We introduce Genflow, a Compound AI System designed to enforce brand consistency in generative media production. Our architecture integrates a retrieval-based 'Brand DNA' extraction module to parameterize generation according to established corporate identity guidelines. Furthermore, we implement an Adversarial Multi-Agent Quality Control (QC) loop. Instead of a single-pass generation, this pipeline employs evaluator agents to iteratively critique generated frames against the extracted parameters, prompting generator models to refine outputs until a deterministic consensus is reached. By transitioning to a multi-stage, self-correcting pipeline, Genflow improved the yield of brand-compliant video generations from 42% to 89%, establishing a robust framework for scalable, enterprise-grade generative systems.

URL PDF HTML ☆

赞 0 踩 0

2605.16745 2026-05-19 cs.CV 版本更新

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

EVA01: 通过混合变换器实现统一的原生3D理解和生成

Zongyuan Yang, Mingjing Yi, Wanli Ma, Chenzhuo Fan, Bocheng Li, Baolin Liu, Yuke Lou, Yingde Song, Yongping Xiong, Zhengdong Guo, Shimu Wang

发表机构 * SeeleAI Team（SeeleAI团队）

AI总结本文提出EVA01框架，通过混合变换器架构扩展多模态大语言模型的模态边界，实现原生的3D网格理解和生成以及上下文感知编辑，提升文本到3D生成的保真度和多轮几何编辑能力。

Comments 28 pages, 10 figures, 6 tables. Technical report

详情

AI中文摘要

本文解决了将3D网格作为多模态大语言模型（MLLM）的原生模态整合的挑战。基于扩散的大型重建模型将语义理解与几何推理解耦，作为无状态重建器，条件于密集的2D像素先验。最近的MLLM基于方法将3D模态视为外部输出而非多模态序列的原生组件，使渐进式适应而没有系统分析几何流形如何与MLLM特征空间对齐。我们引入EVA01，一个统一的框架，扩展MLLM的模态边界，原生纳入3D网格理解和生成以及上下文感知编辑。基于混合变换器（MoT）架构，EVA01将模型分为预训练的Understanding Expert（E_und）和结构上镜像的Generation Expert（E_gen），通过共享的全局自注意力和硬模态路由耦合。该设计使MLLM主干的语义潜在空间与几何流形对齐，从而在不使用中间2D表示的情况下直接转移多模态先验。结果表明，EVA01在文本到3D生成保真度方面达到最先进的水平，并解锁了具有身份保持的稳健长上下文多轮几何编辑能力，这一能力对无状态重建流程来说是根本无法实现的。我们的发现进一步为将2D基础模型与3D任务整合提供了架构洞察，指导3D原生多模态系统的设计。项目页面：https://www.seeles.ai/research/pages/EVA01

英文摘要

This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert ($E_{\mathrm{und}}$) and a structurally mirrored Generation Expert ($E_{\mathrm{gen}}$), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01

URL PDF HTML ☆

赞 0 踩 0

2605.16742 2026-05-19 cs.CV stat.ME 版本更新

Diffeomorphic Cortical Alignment via Direct Warping of Streamline Endpoints

通过直接变形纤维束端点实现的皮层对齐

Yang Xiang, Martin Cole, Zhengwu Zhang

发表机构 * Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill（统计与运筹学系，北卡罗来纳大学教堂山分校）； Department of Psychiatry, University of Rochester（精神病学系，罗切斯特大学）

AI总结本文提出了一种基于连接性的皮层对齐方法，通过直接操作白质纤维束端点来对齐皮层表面，以提高纤维束层面的对应性，并在主要纤维束上实现更高的连接性重叠系数和更强的鲁棒性。

详情

AI中文摘要

皮层表面注册通常由局部几何描述符（例如沟回深度和曲率）驱动。尽管这种方法实现了几何对应，但忽略了白质解剖结构所施加的远距离连接约束。扩散磁共振成像束追踪提供了这些关键约束；然而，先前的连接性指导流程通常对预计算的连接性矩阵进行对齐，使优化高度敏感于连接性估计及其分辨率。在本文中，我们提出了一种新的基于连接性的皮层对齐方法，通过直接在白质纤维束端点上操作来对齐皮层表面。我们将束端点建模为产品流形Ω×Ω上的点云，其中Ω代表膨胀的皮层半球的球形域。我们的对齐方法通过迭代（i）通过最小化连接性不匹配计算Ω的小变形扭曲，并（ii）根据此扭曲更新端点。该方法依赖于一个几何框架，确保输出扭曲是微分同胚，并具有最终目标，即优化已知纤维束的匹配。在人类连接组计划（HCP）数据上的实验表明，该方法在纤维束层面实现了改进的对应性，实现了主要纤维束上的更高连接性重叠系数，并在Ω的网格分辨率下比最先进的方法如ENCORE和MSMAll表现出更强的鲁棒性。

英文摘要

Cortical surface registration is often driven by local geometric descriptors (e.g., sulcal depth and curvature). While this approach achieves geometric correspondence, it neglects the long-range wiring constraints imposed by white-matter anatomy. Diffusion MRI tractography offers these crucial constraints; however, prior connectivity-informed pipelines typically align precomputed connectivity matrices, making the optimization highly sensitive to connectivity estimation and its resolution. In this paper, we introduce a novel connectivity-based surface registration method that aligns cortical surfaces by operating directly on white-matter fiber-tract endpoints. We model tract endpoints as a point cloud on the product manifold $Ω\times Ω$, where $Ω$ represents the spherical domain of the inflated cortical hemispheres. Our alignment method iteratively (i) computes a small diffeomorphic warp for $Ω$ by minimizing connectivity mismatch, and (ii) updates the endpoints based on this warp. The method relies on a geometric framework that ensures output warps are diffeomorphisms and has a final goal that optimizes the matching of well-known fiber bundles. Experiments on Human Connectome Project (HCP) data demonstrate improved tract-level correspondence, achieving higher connectivity-level overlap coefficients on major fiber bundles and stronger robustness across grid resolutions for $Ω$ compared to state-of-the-art methods such as ENCORE and MSMAll.

URL PDF HTML ☆

赞 0 踩 0

2605.16737 2026-05-19 cs.RO cs.CV 版本更新

DriveSafer: End-to-End Autonomous Driving with Safety Guidance

DriveSafer: 结合安全指导的端到端自动驾驶

Shounak Sural, Raj Rajkumar

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出DriveSafer框架，通过减少致命性规划失败来提高端到端自动驾驶的安全性，而非单纯提升平均规划质量。

详情

AI中文摘要

端到端（E2E）自动驾驶模型近年来在性能上有了显著提升，尤其是在越来越具有挑战性的基准测试中。然而，现代生成式E2E规划器仍然在安全关键场景中存在大量致命性故障。我们发现许多此类故障源于物理约束和安全要求的违反，导致不安全行为。受此发现启发，本文专注于改进生成式端到端驾驶中的安全结果，通过有针对性地减少致命性规划失败，而不是提升平均规划质量。为此，我们提出了DriveSafer，一种面向失败的的安全框架，用于端到端规划器。DriveSafer通过利用训练时的安全约束和推理时的安全指导，明确引导生成式规划器朝向安全行为。与最先进的DiffusionDrive模型相比，在NAVSIM基准测试中，DriveSafer将致命性故障数量（PDMS=0）减少了48%，在可行驶区域合规性故障上减少了超过65%。

英文摘要

End-to-End (E2E) autonomous driving models have shown growing capability in recent years, with performance improving on increasingly challenging benchmarks. However, modern generative E2E planners still suffer from a substantial number of catastrophic failures in safety-critical scenarios. We find that many such failures arise from violations of physical constraints and safety requirements, leading to unsafe behavior. Motivated by this finding, in this paper, we focus on improving safety outcomes in generative end-to-end driving with a targeted reduction of catastrophic planning failures, instead of enhancing average planning quality. Towards this end, we propose DriveSafer, a failure-aware safety framework for end-to-end planners. DriveSafer explicitly steers generative planners towards safe behaviors leveraging both training-time safety constraints and inference-time safety guidance. Compared to the state-of-the-art DiffusionDrive model, on the NAVSIM benchmark, DriveSafer reduces the number of catastrophic failures (PDMS=0) by 48%, with over 65% reduction in drivable-area compliance failures.

URL PDF HTML ☆

赞 0 踩 0

2605.16732 2026-05-19 cs.CV cs.LG 版本更新

DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

DiRotQ：面向4位扩散变换器的旋转感知量化

Sayeh Sharify, Mahsa Salmani, Hesham Mostafa

发表机构 * d-Matrix

AI总结本文提出DiRotQ，一种W4A4量化框架，通过旋转感知激活量化缓解扩散变换器在4位精度下的性能下降问题，同时引入VLM-as-a-Judge评估协议和Triton定制内核提升压缩下的效率与质量。

详情

AI中文摘要

扩散变换器（DiTs）在图像生成质量上达到最先进的水平，但在推理过程中带来显著的内存和计算成本。尽管激进的后训练量化（PTQ）到4位精度能带来显著的效率提升，但通常会导致严重的质量下降。现有方法，包括基于平滑的方法、混合精度方案、旋转技术以及低秩残差方法，部分缓解了这一问题，但仍与FP16/BF16性能存在明显差距。在本工作中，我们引入DiRotQ，一种W4A4 PTQ框架，通过旋转感知的激活量化来缓解这种降级。DiRotQ通过主成分分析（PCA）识别出捕捉主导激活方差的低秩子空间，在该子空间中保留系数以较高精度，同时将剩余组件量化为4位。在推理时，通过校准得出的正交变换将激活旋转到PCA基底中，而逆旋转被融合到层权重中，离线。结合基于GPTQ的权重量化，DiRotQ在PixArt-Σ数据集上实现了FID（更低越好）为15.9和PSNR（越高越好）为19.1 dB，优于先前最先进的SVDQuant（FID 18.9，PSNR 17.6）在同一INT W4A4设置下的表现。除了标准指标外，我们引入了VLM-as-a-Judge评估协议，这是该设置下的首次此类评估，提供了更全面的感知质量和提示对齐评估。在系统层面，我们实现了基于Triton的定制内核，以实现高效的端到端推理，将12B FLUX.1-dev模型的内存使用减少了2.1倍，并在24 GB RTX 4090 GPU上实现了2.3倍的加速。

英文摘要

Diffusion Transformers (DiTs) achieve state-of-the-art image generation quality but incur substantial memory and computational costs at inference. While aggressive Post-Training Quantization (PTQ) to 4-bit precision offers significant efficiency gains, it typically results in severe quality degradation. Existing approaches, including smoothing-based methods, mixed-precision schemes, rotation techniques, and low-rank residual methods, partially mitigate this issue but still leave a noticeable gap to FP16/BF16 performance. In this work, we introduce DiRotQ, a W4A4 PTQ framework that mitigates this degradation through rotation-aware activation quantization. DiRotQ identifies a low-rank subspace capturing dominant activation variance via Principal Component Analysis (PCA), preserving coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit. Activations are rotated into the PCA basis at inference time using calibration-derived orthogonal transformations, while the inverse rotation is fused into the layer weights offline. Combined with GPTQ-based weight quantization, DiRotQ achieves an FID (lower is better) of 15.9 and PSNR (higher is better) of 19.1 dB on PixArt-Σ over the MJHQ-30K dataset, outperforming the prior state-of-the-art SVDQuant (FID 18.9, PSNR 17.6) under the same INT W4A4 setting. Beyond standard metrics, we introduce a VLM-as-a-Judge evaluation protocol for diffusion model quantization, the first such evaluation in this setting, providing a more holistic assessment of perceptual quality and prompt alignment under aggressive compression. On the systems side, we implement a Triton-based custom kernel to enable efficient end-to-end inference, reducing memory usage of the 12B FLUX.1-dev model by 2.1x and delivering 2.3x speedup over the BF16 baseline, on a 24 GB RTX 4090 GPU.

URL PDF HTML ☆

赞 0 踩 0

2605.16720 2026-05-19 cs.CV cs.LG 版本更新

Compositional Adversarial Training for Robust Visual Watermarking

组合对抗训练用于鲁棒的视觉水印

Anirudh Satheesh, Michael-Andrei Panaitescu-Liess, Andrew Xu, Georgios Milis, Heng Huang, Zikui Cai, Furong Huang

发表机构 * University of Maryland（马里兰大学）

AI总结本文提出了一种组合对抗训练（CAT）框架，通过在结构化空间中构建组合转换的min-max问题，提升视觉水印的鲁棒性，实验表明其在多种攻击设置下优于随机增强基线。

详情

AI中文摘要

鲁棒水印通常使用随机后处理增强进行训练，但随机采样无法覆盖真实攻击管道的组合空间，难以遇到真正破坏检测的稀有组合。这导致训练不稳定且样本效率低。我们将其水印鲁棒性建模为结构化组合转换空间上的min-max问题。我们提出组合对抗训练（CAT），一种插件框架，学习一个顺序可微的对抗者，观察当前水印图像并在每一步选择攻击家族以最大程度干扰信息恢复。CAT结合了直通Gumbel-Softmax攻击选择与熵正则化，使反向传播可端到端微分并聚合攻击家族的梯度信息，从而实现更快、更平滑的收敛，而不陷入单一攻击模式。我们评估CAT在生成后水印VideoSeal 0.0、VideoSeal 1.0和PixelSeal以及在生成WMAR下的单步和双步攻击套件，以及在分布内和多分布图像和视频基准测试中。CAT在单步攻击设置中将水印容量提高最高63.5%，在组合设置中提高13.0%；在自回归设置中，CAT在困难几何变换上将TPR@FPR=1%平均提高12%。这些结果表明，鲁棒视觉水印受益于对抗适应组合对抗者而非独立随机破坏。

英文摘要

Robust watermarking is typically trained with random post-processing augmentation, but random sampling under-covers the combinatorial space of realistic attack pipelines and rarely encounters the rare compositions that actually break detection. This leads to unstable training and poor sample efficiency. We instead formulate watermark robustness as a min-max problem over a structured space of compositional transformations. We propose Compositional Adversarial Training (CAT), a plug-in framework that learns a sequential differentiable adversary that observes the current watermarked image and selects an attack family at each step to maximally disrupt message recovery. CAT combines a straight-through Gumbel-Softmax attack selection with entropy regularization, allowing the backward pass to be end-to-end differentiable and aggregate gradient information across attack families, yielding faster, smoother convergence without collapsing to a single attack mode. We evaluate CAT on post-generation watermarks VideoSeal 0.0, VideoSeal 1.0, and PixelSeal and in-generation WMAR under both single-step and two-step attack suites, on in-distribution and multiple out-of-distribution image and video benchmarks. CAT consistently outperforms random-augmentation baselines trained with the same augmentation budget, with the largest gains on hard composed attacks and OOD evaluations; improving overall watermark capacity by up to $63.5\%$ in the single-step attack setting and $13.0\%$ in the compositional setting. In the autoregressive setting, CAT improves the TPR@FPR$=1\%$ by $12\%$ on average on difficult geometric transformations. These results show that robust visual watermarking benefits from training against adaptive compositional adversaries rather than independent random corruptions.

URL PDF HTML ☆

赞 0 踩 0

2605.16696 2026-05-19 cs.CV 版本更新

AtlasVid: 通过解耦的全局-局部建模实现高效超高清长视频生成

Ziyang Mai, Yuyao Zhang, Yu-Wing Tai

发表机构 * Dartmouth College（达特茅斯学院）

AI总结本文提出AtlasVid框架，通过解耦建模提升超高清长视频生成效率，实现60.9倍加速和更低训练成本，优于原生4K生成器。

详情

AI中文摘要

近期基于扩散的视频生成器在视觉保真度和提示可控性方面取得了显著进展，但将其扩展到超高清（UHR）长视频仍极具挑战性。难点尤其体现在长单次生成中，需保持连续场景的全局时间一致性，同时不依赖剪辑过渡或自回归镜头拼接的精细空间细节。本文从解耦建模角度重新审视这一挑战。我们主张现有视频扩散模型已编码了强局部视觉先验，而主要瓶颈在于如何高效扩展全局时空建模以适应更高的分辨率和持续时间。基于此见解，我们提出AtlasVid，一种解耦的全局-局部框架，用于高效UHR长视频生成。AtlasVid首先通过时间缩放RoPE生成低分辨率和低FPS的全局语义代理，从而扩展时间范围而不增加训练token数量。在该代理的引导下，高分辨率细节分支进行联合去噪，采用分层局部性保持注意力。重新排列的时空窗口保持几何局部性，不对称的全局-局部注意力注入对齐的语义指导并保留模型的预训练能力。此设计使模型具备分辨率无关的训练能力：模型仅在720P上训练，使用轻量LoRA适配，即可直接泛化到4K及更长（>10秒）的视频生成。实验表明，AtlasVid显著提升了超高清长视频生成的效率，实现了高质量UHR长视频生成，速度提升60.9倍，训练成本显著降低，甚至优于原生4K视频生成器。

英文摘要

Recent diffusion-based video generators have achieved remarkable visual fidelity and prompt controllability, yet scaling them to ultra-high-resolution (UHR) long videos remains prohibitively expensive. The difficulty is especially pronounced for long single-shot generation where a continuous scene must preserve global temporal coherence, and fine-grained spatial details without relying on clip transitions or autoregressive shot stitching. In this work, we revisit this challenge from the perspective of decoupled modeling. We argue that existing video diffusion models already encode strong local visual priors, while the main bottleneck lies in efficiently extending global spatiotemporal modeling as resolution and duration increase. Based on this insight, we propose AtlaVid, a decoupled global-local framework for efficient UHR long video generation. AtlaVid first generates a low-resolution and low-FPS global semantic proxy via temporally scaled RoPE, thereby extending the temporal horizon without increasing the training token count. Guided by this proxy, a high-resolution detail branch performs joint denoising with hierarchical locality-preserving attention. Reordered spatiotemporal windows preserve geometric locality and asymmetric global-local attention injects aligned semantic guidance and preserves the model's pretrained ability. This design enables resolution-agnostic training: the model is trained only at 720P with lightweight LoRA adaptation, yet generalizes directly to 4K and beyond for longer (>10s) video synthesis. Experiments show that AtlaVid substantially improves the efficiency of ultra-high-resolution long video generation, achieving high-quality UHR long video generation with 60.9x speed up and significantly less training cost and even better performance than native 4K video generators.

URL PDF HTML ☆

赞 0 踩 0

2605.14963 2026-05-19 cs.CV 版本更新

H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors

H-OmniStereo：基于方向对齐法线先验的零样本全方位立体匹配

Chenxing Jiang, Zhe Tong, Pusen Gao, Peize Liu, Yang Xu, Chuan Fang, Ping Tan, Shaojie Shen

发表机构 * Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology（电子与计算机工程系，香港科学与技术大学）

AI总结本文提出H-OmniStereo框架，通过构建高质量合成数据集和引入方向对齐法线估计器，解决全方位立体匹配中数据稀缺和视角先验退化问题，实现更高精度和跨视角一致性。

Comments 8 pages, 9 figures

详情

AI中文摘要

在顶底等距矩形图像上的立体匹配为全方位感知提供了有效框架，因为垂直对齐的视差线能够利用大量数据集和单目先验驱动的先进透视立体架构。然而，此类适应的性能严重受限于全方位立体数据集的稀缺性和球面畸变下单目先验的退化。为解决这些挑战，我们提出H-OmniStereo，零样本全方位立体匹配框架。首先，我们构建包含280万对顶底等距矩形立体对的高质量合成数据集以扩大训练规模。其次，我们引入等距矩形单目法线估计器，专门在方向对齐坐标系中运行。除了提供抗畸变和跨视角一致的几何先验以建立可靠的立体匹配对应关系外，该设计还提升了训练效率并适应了训练测试视角范围不匹配。大量实验表明，我们的方法在域外数据集上比现有方法更准确，并成功泛化到实际消费者相机设置中使用单个模型。模型和数据集将在https://github.com/JIANG-CX/H-OmniStereo发布。

英文摘要

Stereo matching on top-bottom equirectangular images provides an effective framework for full-surround perception, as vertically aligned epipolar lines enable the use of advanced perspective stereo architectures that are largely driven by large-scale datasets and monocular priors. However, the performance of such adaptations is severely limited by the scarcity of omnidirectional stereo datasets and the degradation of perspective monocular priors under spherical distortions. To address these challenges, we propose H-OmniStereo, a zero-shot omnidirectional stereo matching framework. First, we construct high-quality synthetic dataset comprising over 2.8 million top-bottom equirectangular stereo pairs to scale up training. Second, we introduce an equirectangular monocular normal estimator, specifically operating in a heading-aligned coordinate system. Beyond providing distortion-robust and cross-view-consistent geometric priors for establishing reliable correspondences in stereo matching, this design boosts training efficiency and accommodates train-test FoV mismatches. Extensive experiments show that our approach achieves higher accuracy than existing methods on out-of-domain datasets and successfully generalizes to real-world consumer camera setups using a single model. The model and dataset will be released at https://github.com/JIANG-CX/H-OmniStereo.

URL PDF HTML ☆

赞 0 踩 0

2605.14854 2026-05-19 cs.CV cs.AI 版本更新

FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery

因子化HMR：视频人体网格恢复的混合框架

Patrick Kwon, Chen Chen

发表机构 * Institute of Artificial Intelligence（人工智能研究所）； University of Central Florida（佛罗里达中央大学）

AI总结本文提出FactorizedHMR框架，通过确定性回归模块和概率流匹配模块分别处理人体不同部位的恢复问题，结合复合目标表示和几何感知监督提升模糊部位的恢复效果，实现在遮挡和漂移敏感度指标上的优势。

详情

AI中文摘要

人体网格恢复（HMR）本质上具有歧义性：在遮挡或弱深度线索下，同一图像证据可能由多个3D身体解释。这种歧义性并非均匀分布于全身，躯干姿态和根结构通常相对受约束，而远端关节如手臂和腿部则更不确定。基于此观察，我们提出FactorizedHMR，一种两阶段框架，分别处理这两种情形。一个确定性回归模块首先恢复稳定的躯干-根锚点，一个概率流匹配模块则完成剩余的非躯干关节。为使完成可靠，我们结合复合目标表示与几何感知监督和特征感知分类器自由引导，保留躯干-根锚点的同时提升易产生歧义的关节的单参考恢复。我们还引入了一个合成数据管道，提供在多种视角下的配对图像-相机-运动监督。在相机空间和世界空间基准测试中，FactorizedHMR与强基线竞争，尤其在遮挡密集恢复和漂移敏感世界空间指标上表现最突出。

英文摘要

Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.13322 2026-05-19 cs.CV cs.LG 版本更新

KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

KamonBench：一种基于语法规则的数据集，用于评估视觉-语言模型中的组合因子恢复

Richard Sproat, Stefano Peluchetti

AI总结 KamonBench通过20000个合成复合徽章及辅助组件示例，提供评估视觉-语言模型中稀疏组合识别和因子恢复的可控测试环境，支持程序代码因子度量和可控因子对重组。

Comments Preprint

2605.11871 2026-05-19 cs.CV 版本更新

$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

$h$-control: 无需训练的相机控制 via 块条件吉布斯细化

Yuzhu Wang, Xi Ye, Duo Su, Yangyang Xu, Jun Zhu

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； South China University of Technology（华南理工大学）

AI总结本文提出$h$-control，通过改进采样器结构，解决免训练视频生成中相机控制的逆向问题，提升轨迹一致性与视觉质量的平衡，实现在多个数据集上的最佳表现。

详情

AI中文摘要

无需训练的相机控制对于预训练的流匹配视频生成器是一个部分观察逆向问题：深度扭曲的引导视频为潜变量子集提供噪声证据，采样器必须与预训练先验相协调。现有方法难以平衡轨迹一致性和视觉质量，且启发式引导强度调整缺乏鲁棒性。我们提出$h$-control，通过在采样器中引入结构变化：每个外层硬替换引导步骤均增强内循环块条件伪吉布斯细化，对同一噪声水平下的未观测补集进行处理，保证收敛到部分观察条件数据定律。为加速高维视频潜变量的收敛，我们利用其条件局部性，将未观测补集划分为3D块，每个块由自定义混合指示器跟踪，能自适应冻结收敛块。在RealEstate10K和DAVIS数据集上，$h$-control在所有七种免训练和训练-based竞争者中取得最佳FVD，优于所有免训练基线。

英文摘要

Training-free camera control for pretrained flow-matching video generators is a partial-observation inverse problem: a depth-warped guidance video supplies noisy evidence on a subset of latent sites, which the sampler must reconcile with the pretrained prior. Existing methods struggle to balance the trade-off between trajectory adherence and visual quality and the heuristic guidance-strength tuning lacks robustness. We propose \textbf{$h$-control}, which resolves this dilemma through a structural change to the sampler: each outer hard-replacement guidance step is augmented with an inner-loop \emph{block-conditional pseudo-Gibbs refinement} on the unobserved complement at the same noise level, with provable convergence to the partial-observation conditional data law. To accelerate convergence on high-dimensional video latents, we exploit their conditional locality, partitioning the unobserved complement into 3D patches, each tracked by a custom mixing indicator that adaptively freezes converged patches. On RealEstate10K and DAVIS, \textbf{$h$-control} attains the best FVD against all seven training-free and training-based competitors, outperforming every training-free baseline on every reported metric.

URL PDF HTML ☆

赞 0 踩 0

2605.11208 2026-05-19 cs.CV 版本更新

Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

Hi-GaTA：用于外科视频报告生成的分层门控时间聚合适配器

Kedi Sun, Chaohui Dang, Yue Feng, James Glasbey, Theodoros N. Arvanitis, Le Zhang

发表机构 * School of Engineering, College of Engineering and Physical Sciences, University of Birmingham, Birmingham, UK（英国伯明翰大学工程学院）； School of Computer Science, University of Birmingham, Birmingham, UK（英国伯明翰大学计算机科学学院）； Department of Applied Health Sciences, University of Birmingham, Birmingham, UK（英国伯明翰大学应用健康科学系）

AI总结本文提出Hi-GaTA框架，通过时间聚合压缩长视频序列生成LLM兼容的视觉前缀令牌，结合预训练的外科专用视频编码器和LoRA微调，实现高质量外科报告生成。

Comments 11 pages, 2 figures

详情

AI中文摘要

自动化、临床级的外科手术评估报告可减少文档负担并提供客观反馈，但面临视频时空表示与语言推理对齐困难及高质量隐私数据稀缺的挑战。为此，我们建立包含214个高质量模拟外科视频及外科医生撰写的评估报告的基准。基于此资源，我们提出包含Hi-GaTA的感知-对齐-推理框架，其中Hi-GaTA是一种新型轻量级时间适配器，通过短到长范围时间聚合高效压缩长视频序列为紧凑的LLM兼容视觉前缀令牌。为实现稳健的视觉感知，我们预训练了Sur40k，一种针对外科专用的ViViT风格视频编码器，在40,000分钟的公开外科视频上进行预训练以捕捉细粒度的时空手术先验。Hi-GaTA采用带有文本条件双交叉注意力的时间金字塔，并通过跨层门控融合和递增深度策略提高多尺度一致性。最后，我们使用LoRA微调LLM主干以在有限监督下实现连贯且风格一致的外科报告生成。实验表明，我们的方法在整体性能上最佳，且在强大的多模态大语言模型（MLLM）基线中表现出一致的优势。消融研究进一步验证了每个提出组件的有效性。

英文摘要

Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.

URL PDF HTML ☆

赞 0 踩 0

2605.10759 2026-05-19 cs.LG cs.CV 版本更新

MeshReGen: 一种统一的3D几何再生框架

Geon Yeong Park, Roman Shapovalov, Rakesh Ranjan, Jong Chul Ye, Andrea Vedaldi, Thu Nguyen-Phuoc

发表机构 * KAIST Meta Reality Labs（韩国科学技术院元宇宙实验室）

AI总结 MeshReGen通过基于VecSet的条件机制，实现从2D图像和初始3D形状再生3D对象，支持增强、重建和编辑等任务，无需额外标注即可在多个任务中实现可控的3D生成。

Comments Project page: https://geonyeong-park.github.io/meshregen/ 32 pages, 18 figures, 6 tables. Includes Appendix

2604.10027 2026-05-19 cs.CV 版本更新

SinkTrack: Attention Sink based Context Anchoring for Large Language Models

SinkTrack: 基于注意力sink的上下文锚定用于大语言模型

Xu Liu, Guikun Chen, Wenguan Wang

发表机构 * The State Key Lab of Brain-Machine Intelligence（脑机智能国家重点实验室）

AI总结 SinkTrack通过将<BOS>作为信息锚点，注入关键上下文特征，缓解大语言模型的幻觉和上下文遗忘问题，实验显示在文本和多模态任务中均取得显著提升。

Comments ICLR 2026. Code: https://github.com/67L1/SinkTrack

详情

AI中文摘要

大语言模型（LLMs）面临幻觉和上下文遗忘问题，先前研究认为注意力漂移是主要原因，即LLMs的注意力转向新生成的token而远离初始输入上下文。为应对此问题，我们利用LLMs的一个相关内在特性：注意力sink——倾向于持续将高注意力分配给序列的第一个token（即<BOS>）。具体而言，我们提出了一种先进的上下文锚定方法SinkTrack，将<BOS>作为信息锚点，并将其表示中注入关键上下文特征（如来自输入图像或指令的特征）。因此，LLM在整个生成过程中始终保持对初始输入上下文的锚定。SinkTrack是无需训练的即插即用方法，且引入了极小的推理开销。实验表明，SinkTrack在文本（例如在SQuAD2.0上使用Llama3.1-8B-Instruct模型时提升21.6%）和多模态（例如在M3CoT上使用Qwen2.5-VL-7B-Instruct模型时提升22.8%）任务中均有效缓解了幻觉和上下文遗忘问题。其在不同架构和规模上的稳定提升凸显了其鲁棒性和泛化能力。我们还从信息传递的角度分析了其底层工作原理。源代码可在https://github.com/67L1/SinkTrack获取。

英文摘要

Large language models (LLMs) suffer from hallucination and context forgetting. Prior studies suggest that attention drift is a primary cause of these problems, where LLMs' focus shifts towards newly generated tokens and away from the initial input context. To counteract this, we make use of a related, intrinsic characteristic of LLMs: attention sink -- the tendency to consistently allocate high attention to the very first token (i.e., <BOS>) of a sequence. Concretely, we propose an advanced context anchoring method, SinkTrack, which treats <BOS> as an information anchor and injects key contextual features (such as those derived from the input image or instruction) into its representation. As such, LLM remains anchored to the initial input context throughout the entire generation process. SinkTrack is training-free, plug-and-play, and introduces negligible inference overhead. Experiments demonstrate that SinkTrack mitigates hallucination and context forgetting across both textual (e.g., +21.6% on SQuAD2.0 with Llama3.1-8B-Instruct) and multi-modal (e.g., +22.8% on M3CoT with Qwen2.5-VL-7B-Instruct) tasks. Its consistent gains across different architectures and scales underscore the robustness and generalizability. We also analyze its underlying working mechanism from the perspective of information delivery. Our source code is available at https://github.com/67L1/SinkTrack.

URL PDF HTML ☆

赞 0 踩 0

2604.08936 2026-05-19 cs.CV 版本更新

M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model

M-IDoL：面向医学基础模型的模态特定与多样化表示学习的信息分解

Yihang Liu, Longzhen Yang, Jiaxiong Yang, Ying Wen, Lianghua He, Heng Tao Shen

发表机构 * School of Computer Science and Technique（计算机科学与技术学院）； Tongji University（同济大学）； School of Communications and Electronic Engineering（通讯与电子工程学院）； East China Normal University（华东师范大学）

AI总结本文提出M-IDoL，通过信息分解提升医学基础模型的模态特异性和多样性，通过最大化跨模态熵和最小化内模态不确定性，在21个下游任务中实现优于现有模型的泛化能力。

详情

AI中文摘要

医学基础模型（MFMs）旨在从多模态医学图像中学习通用表示，以有效泛化到多样化的临床任务。然而，现有大多数MFMs面临信息模糊问题，将多模态表示融合到单一嵌入空间中，导致模态特异性和多样性下降。本文提出M-IDoL，一种自监督的MFMs，通过两个目标引入信息分解：i）通过将多模态表示分散到可分离的专家混合（MoE）子空间中，最大化跨模态熵以实现跨模态的表示特异性；ii）通过在每个MoE子空间内进行细粒度语义辨别，最小化内模态不确定性以丰富每个模态的表示多样性。通过在115万张医学图像上预训练，M-IDoL在21个下游临床任务中实现了优于20个基础模型的泛化能力，并学习了模态特定和多样化的表示，展示了跨模态特征簇的更清晰分离和每个模态内更细粒度的特征辨别。

英文摘要

Medical foundation models (MFMs) aim to learn universal representations from multimodal medical images that can generalize effectively to diverse downstream clinical tasks. However, most existing MFMs suffer from information ambiguity that blends multimodal representations in a single embedding space, leading to the degradation of modality specificity and diversity. In this paper, we propose M-IDoL, a self-supervised MFM that introduces Information Decomposition for multimodal representation Learning via two objectives: i) maximizing inter-modality entropy by dispersing multimodal representations into separable Mixture-of-Experts (MoE) subspaces to achieve representation specificity across modalities; and ii) minimizing intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality. By pre-training on 1.15 million medical images, M-IDoL i) delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (e.g., X-ray, fundus, OCT, dermoscopy and pathology), and ii) learns modality-specific and diverse representations, showing clearer separation of feature clusters across modalities and finer-grained feature discrimination within each modality.

URL PDF HTML ☆

赞 0 踩 0

2603.29167 2026-05-19 cs.CV 版本更新

LightZeroNav: 基于轻量级VLMs的连续环境中零样本视觉语言导航

Kun Luo, Xiangyu Dong, Xiaoguang Ma, Haoran Zhao, Yaoming Zhou

发表机构 * Foshan Graduate School of Innovation, Northeastern University（创新研究生院，东北大学）； Faculty of Robot Science and Engineering, Northeastern University（机器人科学与工程学院，东北大学）； School of Aeronautic Science and Engineering, Beihang University（航空科学与工程学院，北航）； QingniaoAI, China（清北AI，中国）

AI总结本文提出LightZeroNav，通过轻量级VLMs解决连续环境中零样本视觉语言导航的三大瓶颈，无需特定训练或图搜索，在RGB观测和轻量级Qwen3-VL-8B模型下实现与GPT-4o相当的性能。

2603.09286 2026-05-19 cs.CV 版本更新

CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation

CogBlender：迈向文本到图像生成中的连续认知干预

Shengqi Dang, Yi He, Jiaying Lei, Ziqing Qian, Nan Cao

发表机构 * Tongji University（同济大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结 CogBlender通过两阶段方法实现对图像生成中认知属性的连续多维干预，有效控制如情感、记忆性等心理属性。

详情

AI中文摘要

除了传达语义信息，图像还具有引发特定心理反应的认知属性，如记忆编码或情感反应。尽管现代文本到图像（T2I）模型能生成语义连贯的内容，但难以控制认知属性（如情感、记忆性）并匹配用户心理意图。为此，我们引入CogBlender算法，通过新颖的两阶段方法实现对认知属性的连续多维干预。首先，构建离散的认知感知重写提示变体，代表不同的极端认知状态。其次，通过在流匹配模型的流场域内插值得到连续控制信号。通过动态混合这些提示预测的流场以实现目标认知评分，CogBlender能够平滑地引导生成轨迹，实现最终图像中期望的认知属性。在四个认知属性（即情感、唤醒度、支配性和记忆性）上的广泛实验表明，CogBlender实现了有效的认知干预。

英文摘要

Beyond conveying semantic information, images also possess cognitive properties that elicit specific psychological responses from viewers, such as memory encoding or emotional reactions. Although modern text-to-image (T2I) models generate semantically coherent content effectively, they struggle to control cognitive properties (e.g., valence, memorability) and often fail to align with the user's psychological intent. To bridge the gap, we introduce CogBlender, an algorithm that enables continuous and multi-dimensional intervention on cognitive properties through a novel two-stage approach. First, we construct discrete cognition-aware rewritten prompts-variants of the input prompt that represent distinct extreme cognitive states. Second, we translate these discrete prompts into continuous control signals by interpolating within the velocity-field domain of flow-matching models. By dynamically blending the velocity fields predicted from these prompts according to the target cognitive scores, CogBlender smoothly steers the generative trajectory to realize the desired cognitive properties in the final image. Extensive experiments across four cognitive properties (i.e., valence, arousal, dominance, and memorability) demonstrate that CogBlender achieves effective cognitive intervention.

URL PDF HTML ☆

赞 0 踩 0

2603.04870 2026-05-19 cs.CV 版本更新

Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning

基于扩散的sRGB真实噪声生成：通过提示驱动的噪声表示学习

Jaekyun Ko, Dongjin Kim, Soomin Lee, Guanghui Wang, Tae Hyun Kim

发表机构 * Department of Computer Science, Hanyang University（翰阳大学计算机科学系）； Mobile Experience (MX) Division, Samsung Electronics（三星电子移动体验部门）； Department of Computer Science, Toronto Metropolitan University（多伦多 Metropolitan 大学计算机科学系）

AI总结本文提出Prompt-Driven Noise Generation框架，通过学习提示特征生成真实噪声图像，无需依赖相机元数据，提升噪声合成的通用性和应用性。

Comments CVPR 2026

详情

AI中文摘要

在sRGB图像空间中去噪具有挑战性，由于噪声变化大。尽管端到端方法表现良好，但实际场景中其效果受限于真实噪声-清洁图像对的稀缺性，这些对昂贵且难以收集。为解决这一限制，已开发出几种生成方法，从有限数据中合成逼真的噪声图像。这些方法通常依赖相机元数据进行训练和测试以合成现实噪声。然而，缺乏元数据或设备间不一致限制了其实用性。因此，我们提出了一种新的框架，称为提示驱动噪声生成（PNG）。该模型能够获取高维提示特征，捕捉现实输入噪声的特征，并创建与输入噪声分布一致的多种逼真噪声图像。通过消除对显式相机元数据的依赖，我们的方法显著提高了噪声合成的通用性和应用性。全面的实验表明，我们的模型能够有效生成逼真的噪声图像，并在各种基准数据集上成功应用于去除现实噪声。

英文摘要

Denoising in the sRGB image space is challenging due to large noise variability. Although end-to-end methods perform well, their effectiveness in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited data. These approaches often rely on camera metadata during both training and testing to synthesize real-world noise. However, the lack of metadata or inconsistencies between devices restricts their usability. Therefore, we propose a novel framework called Prompt-Driven Noise Generation (PNG). This model is capable of acquiring high-dimensional prompt features that capture the characteristics of real-world input noise and creating a variety of realistic noisy images consistent with the distribution of the input noise. By eliminating the dependency on explicit camera metadata, our approach significantly enhances the generalizability and applicability of noise synthesis. Comprehensive experiments reveal that our model effectively produces realistic noisy images and show the successful application of these generated images in removing real-world noise across various benchmark datasets.

URL PDF HTML ☆

赞 0 踩 0

2603.02667 2026-05-19 cs.CV cs.LG 版本更新

Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

统一对比学习与生成目标以实现视觉理解和文本到图像生成

Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati, Hong-You Chen, Satya Narayan Shukla, Jianpeng Cheng, Yonghuan Yang, Jun Xiao, Xiangjun Fan, Aashu Singh, Dina Katabi, Shlok Kumar Mishra

发表机构 * MIT Computer Science \& Artificial Intelligence Laboratory ； Meta AI

AI总结本文提出DREAM框架，通过Masking Warmup解决对比学习与文本到图像生成的矛盾，提升模型在多个任务上的性能。

详情

AI中文摘要

将文本-图像对比学习与文本到图像生成统一到一个端到端模型具有挑战性，因为两者需要不同的掩码策略：对比学习需要近完全可见的token，而掩码生成模型需要大量干扰。我们引入DREAM框架，通过Masking Warmup调度，在训练过程中逐步调整掩码分布的中心，使低和高掩码比率同时存在。这种共暴露使一个联合训练的编码器能够服务于两种目标。所得到的稳定优化解锁了语义对齐解码：在推理阶段，经过所有掩码比率训练的文本编码器可以评估部分生成的图像并选择最佳轨迹，仅需解码图像的12.5%，从而提高FID和吞吐量。DREAM在ImageNet线性探测（+1.1%）、5次转移（+4.1%）、ADE20K分割（+1.9%）和NYU深度估计（+6.25%）上优于CLIP，在CC12M FID上优于FLUID（+6.2%）的同时保持CLIP Score。这些收益表明，当正确统一文本-图像对比和生成目标时，它们是协同作用而非竞争。

英文摘要

Unifying text-image contrastive learning and text-to-image (T2I) generation in a single end-to-end model is challenging because the two objectives demand opposing masking regimes: contrastive alignment needs near-complete visible tokens, while masked generative modeling needs heavy corruption. We introduce DREAM, a unified framework that resolves this conflict through Masking Warmup, a schedule that shifts the center of the masking distribution over training, so low and high masking ratios coexist at every step. This co-exposure lets a single jointly-trained encoder serve both objectives. The resulting stable optimization unlocks Semantically Aligned Decoding at inference: the text encoder, trained against visual embeddings at all masking ratios, can score partially generated images and select the best trajectory with as little as 12.5% of the image decoded, improving both FID and throughput. DREAM outperforms its single-objective baselines, CLIP and FLUID: on ImageNet linear-probing (+1.1%), 5-shot transfer (+4.1%), ADE20K segmentation (+1.9%), and NYU depth estimation (+6.25%) over CLIP, and on CC12M FID (+6.2%) over FLUID while maintaining CLIP Score. Together, these gains show that text-image contrastive and generative objectives, when properly unified, are synergistic rather than competing.

URL PDF HTML ☆

赞 0 踩 0

2603.01993 2026-05-19 cs.CV 版本更新

Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

培养可推广的多模态操纵检测的推理能力

Yuchen Zhang, Yaxiong Wang, Kecheng Han, Yujiao Wu, Lianwei Wu, Li Zhu, Zhedong Zheng

发表机构 * School of Software Engineering, Xi’an Jiaotong University（西安交通大学软件工程学院）； School of Computer Science and Information Engineering, Hefei University of Technology（合肥工业大学计算机科学与信息工程学院）； CSIRO（澳大利亚联邦科学与工业研究组织）； Northwestern Polytechnical University（西北工业大学）； University of Macau（澳门大学）

AI总结本文提出REFORM框架，通过推理驱动的方法改进多模态操纵检测，提升泛化能力，在多个数据集上取得新高准确率。

Comments Accepted to ACL 2026

详情

AI中文摘要

近期生成式AI的发展显著提升了多模态媒体操纵的逼真度，给操纵检测带来了重大挑战。现有操纵检测和定位方法主要集中在结果导向的操纵类型分类，这不仅缺乏可解释性，还容易过拟合表面特征。本文认为，可推广的检测需要纳入显式的推理过程，而非仅分类有限的操纵类型。为此，我们提出REFORM，一个推理驱动的框架，将学习从结果拟合转向过程建模。REFORM采用三阶段课程，首先诱导推理依据，然后对齐推理与最终判断，最后通过强化学习优化逻辑一致性。为支持这一范式，我们引入ROM，一个具有丰富推理标注的大规模数据集。大量实验表明，REFORM在多个数据集上均取得新高准确率，包括在ROM上81.52%的准确率，在DGM4上76.65%的准确率，在MMFakeBench上74.9的F1分数。

英文摘要

Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.

URL PDF HTML ☆

赞 0 踩 0

2603.00952 2026-05-19 cs.CV 版本更新

Decoupling Motion and Geometry in 4D Gaussian Splatting

分离运动与几何的4D高斯点散射

Yi Zhang, Yulei Kang, Jiangxin Sun, Beihao Xia, Jisheng Dang, Jian-Fang Hu

发表机构 * Sun Yat-sen University（中山大学）； University of Trento（特伦特大学）； Huazhong University of Science and Technology（华中科技大学）； Lanzhou University（兰州大学）

AI总结本文提出VeGaS框架，通过引入伽利略剪切矩阵和几何变形网络，分离高斯运动与几何属性，提升复杂非线性运动建模能力，实验表明其在公开数据集上达到最先进的性能。

详情

AI中文摘要

动态场景的高保真重建是一个重要但具有挑战性的问题。尽管最近的4D高斯点散射（4DGS）展示了建模时间动态的能力，但其将高斯运动和几何属性耦合在单一协方差公式中，限制了对复杂运动的表达能力，常导致视觉伪影。为此，我们提出VeGaS，一种基于速度的新型4D高斯点散射框架，通过引入伽利略剪切矩阵，显式纳入时间变化的速度，灵活建模复杂非线性运动，同时严格隔离高斯运动对几何相关条件高斯协方差的影响。此外，引入几何变形网络，利用时空上下文和速度线索细化高斯形状和方向，增强时间几何建模。在公开数据集上的大量实验表明，VeGaS实现了最先进的性能。

英文摘要

High-fidelity reconstruction of dynamic scenes is an important yet challenging problem. While recent 4D Gaussian Splatting (4DGS) has demonstrated the ability to model temporal dynamics, it couples Gaussian motion and geometric attributes within a single covariance formulation, which limits its expressiveness for complex motions and often leads to visual artifacts. To address this, we propose VeGaS, a novel velocity-based 4D Gaussian Splatting framework that decouples Gaussian motion and geometry. Specifically, we introduce a Galilean shearing matrix that explicitly incorporates time-varying velocity to flexibly model complex non-linear motions, while strictly isolating the effects of Gaussian motion from the geometry-related conditional Gaussian covariance. Furthermore, a Geometric Deformation Network is introduced to refine Gaussian shapes and orientations using spatio-temporal context and velocity cues, enhancing temporal geometric modeling. Extensive experiments on public datasets demonstrate that VeGaS achieves state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2602.23058 2026-05-19 cs.CV cs.RO 版本更新

GeoWorld: Geometric World Models

GeoWorld：几何世界模型

Zeyu Zhang, Danning Li, Ian Reid, Richard Hartley

发表机构 * ANU（澳大利亚国立大学）； MBZUAI（穆斯林人工智能研究所）

AI总结 GeoWorld通过超几何JEPA和几何强化学习解决传统能量预测模型在几何结构和长周期预测中的不足，实验显示在3-4步规划中性能提升3%-2%。

Comments Accepted to CVPR 2026

详情

AI中文摘要

基于能量的预测世界模型通过推理潜在能量景观进行多步视觉规划，但现有方法面临两个挑战：（i）其潜在表示通常在欧几里得空间中学习，忽略了状态间的几何和层次结构；（ii）难以进行长周期预测，导致扩展 rollout 中快速退化。为了解决这些挑战，我们引入GeoWorld，通过超几何JEPA将潜在表示从欧几里得空间映射到双曲流形，以保留几何结构和层次关系。我们进一步引入几何强化学习进行能量优化，实现双曲潜在空间中的稳定多步规划。在CrossTask和COIN上的广泛实验显示，与最先进的V-JEPA 2相比，在3步规划中性能提升约3%，在4步规划中提升约2%。项目网站：https://steve-zeyu-zhang.github.io/GeoWorld。

英文摘要

Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.

URL PDF HTML ☆

赞 0 踩 0

2602.19710 2026-05-19 cs.CV cs.LG cs.RO 版本更新

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

面向通用视觉-语言-动作策略的通用姿态预训练

Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, Yanwei Fu

发表机构 * Tencent Robotics X（腾讯机器人X）； Futian Laboratory（福田实验室）； The Hong Kong University of Science and Technology（香港科学与技术大学）； Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结本文提出Pose-VLA，通过分离预训练和后训练阶段，解决视觉-语言-动作模型中的特征坍塌和训练效率问题，实现通用3D空间先验提取与机器人特定动作空间的高效对齐。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. Project website: https://hetolin.github.io/PoseVLA

Journal ref Robotics: Science and Systems, 2026

详情

AI中文摘要

现有视觉-语言-动作（VLA）模型常因将高层感知与稀疏的、特定身体动作监督结合而出现特征坍塌和低训练效率。由于这些模型通常依赖优化用于视觉问答（VQA）的VLM主干，它们擅长语义识别但常忽视细微的3D状态变化，这些变化决定了不同的动作模式。为解决这些不一致，我们提出了Pose-VLA，一种解耦范式，将VLA训练分为预训练阶段以提取统一摄像机空间中的通用3D空间先验，以及后训练阶段以在机器人特定的动作空间中高效对齐。通过引入离散姿态标记作为通用表示，Pose-VLA无缝整合了来自不同3D数据集的空间接地与机器人演示中的几何级轨迹。我们的框架遵循一个两阶段预训练流程，通过姿态建立基本空间接地，然后通过轨迹监督实现运动对齐。广泛的评估显示，Pose-VLA在RoboTwin 2.0上实现了79.5%的平均成功率，并在LIBERO上表现出竞争力。现实世界实验进一步展示了在使用仅100个演示每任务的情况下，对多样化物体的鲁棒泛化能力，验证了我们预训练范式的效率。

英文摘要

Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.

URL PDF HTML ☆

赞 0 踩 0

2602.18584 2026-05-19 cs.LG cs.AI cs.CV 版本更新

GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

GIST: 通过耦合优化几何进行指令微调的目标数据选择

Guanghui Min, Tianhao Huang, Ke Wan, Chen Chen

发表机构 * Department of Computer Science, University of Virginia, Charlottesville, USA（弗吉尼亚大学计算机科学系）

AI总结本文提出GIST方法，通过子空间对齐替代轴对齐缩放，解决参数高效微调中参数耦合问题，实现更高效的目标数据选择。

Comments ICML 2026; 27 pages, 8 figures, 11 tables

详情

AI中文摘要

目标数据选择已成为高效指令微调中的关键范式，旨在为特定任务识别一小部分有影响力的训练示例。在实践中，影响力通常通过示例对参数更新的影响来衡量。为了使选择可扩展，许多方法利用优化器统计（如Adam状态）作为轴对齐的替代品，隐式地将参数视为坐标独立。我们证明在参数高效微调（PEFT）方法如LoRA中，这一假设在破裂。在这种情况下，诱导的优化几何表现出强跨参数耦合和非平凡的非对角交互，而任务相关的更新方向被限制在低维子空间中。受此不匹配的启发，我们提出GIST（梯度等距子空间转换），一种简单但原则性的替代方法，用稳健的子空间对齐替代轴对齐缩放。GIST通过奇异值分解（SVD）从验证梯度中恢复任务特定的子空间，将训练梯度投影到该耦合子空间，并通过与目标方向的对齐程度评分示例。大量实验表明，在相同的选择预算下，GIST仅使用0.29%的存储和25%的计算时间，与当前最先进的基线匹配或优于。

英文摘要

Targeted data selection has emerged as a crucial paradigm for efficient instruction tuning, aiming to identify a small yet influential subset of training examples for a specific target task. In practice, influence is often measured through the effect of an example on parameter updates. To make selection scalable, many approaches leverage optimizer statistics (e.g., Adam states) as an axis-aligned surrogate for update geometry (i.e., diagonal precondition), implicitly treating parameters as coordinate-wise independent. We show that this assumption breaks down in parameter-efficient fine-tuning (PEFT) methods such as LoRA. In this setting, the induced optimization geometry exhibits strong cross-parameter coupling with non-trivial off-diagonal interactions, while the task-relevant update directions are confined to a low-dimensional subspace. Motivated by this mismatch, we propose GIST (Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment. GIST recovers a task-specific subspace from validation gradients via singular value decomposition (SVD), projects training gradients into this coupled subspace, and scores examples by their alignment with target directions. Extensive experiments have demonstrated that GIST matches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget.

URL PDF HTML ☆

赞 0 踩 0

2602.12280 2026-05-19 cs.CV 版本更新

Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

惊喜之笔：向量素描中的渐进语义错觉

Huai-Hsun Cheng, Siang-Ling Zhang, Yu-Lun Liu

发表机构 * National Yang Ming Chiao Tung University

AI总结本文提出渐进语义错觉任务，通过逐步添加笔触实现单幅素描的语义转变，引入双分支Score Distillation Sampling机制解决双重约束问题，提升识别性和错觉强度。

Comments SIGGRAPH 2026. Project page: https://stroke-of-surprise.github.io/

详情

AI中文摘要

传统视觉错觉依赖于空间操纵，如多视角一致性。本文引入渐进语义错觉，一种新的向量素描任务，单幅素描通过逐步添加笔触经历剧烈语义变化。我们提出Stroke of Surprise生成框架，优化向量笔触以满足不同绘制阶段的语义解释。核心挑战在于双重约束：初始前缀笔触必须形成连贯对象（如鸭子），同时作为添加delta笔触后第二概念（如羊）的结构基础。为此，我们提出一种序列感知的联合优化框架，由双分支Score Distillation Sampling机制驱动。不同于冻结初始状态的顺序方法，我们的方法动态调整前缀笔触，发现适用于两个目标的

英文摘要

Visual illusions traditionally rely on spatial manipulations such as multi-view consistency. In this work, we introduce Progressive Semantic Illusions, a novel vector sketching task where a single sketch undergoes a dramatic semantic transformation through the sequential addition of strokes. We present Stroke of Surprise, a generative framework that optimizes vector strokes to satisfy distinct semantic interpretations at different drawing stages. The core challenge lies in the "dual-constraint": initial prefix strokes must form a coherent object (e.g., a duck) while simultaneously serving as the structural foundation for a second concept (e.g., a sheep) upon adding delta strokes. To address this, we propose a sequence-aware joint optimization framework driven by a dual-branch Score Distillation Sampling (SDS) mechanism. Unlike sequential approaches that freeze the initial state, our method dynamically adjusts prefix strokes to discover a "common structural subspace" valid for both targets. Furthermore, we introduce a novel Overlay Loss that enforces spatial complementarity, ensuring structural integration rather than occlusion. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines in recognizability and illusion strength, successfully expanding visual anagrams from the spatial to the temporal dimension. Project page: https://stroke-of-surprise.github.io/

URL PDF HTML ☆

赞 0 踩 0

2602.11553 2026-05-19 cs.CV cs.AI 版本更新

Perception-based Image Denoising via Generative Compression

基于生成压缩的图像去噪

Nam Nguyen, Thinh Nguyen, Bella Bose

发表机构 * School of Electrical and Computer Engineering, Oregon State University, Corvallis, OR 97331, USA（电气与计算机工程学院，俄勒冈州立大学，科瓦利斯，OR 97331，USA）

AI总结本文提出基于生成压缩的去噪框架，通过熵编码潜在表示和感知度量提升去噪效果，实验显示在保持 distortion 性能的同时实现感知改进。

详情

AI中文摘要

图像去噪旨在在去除噪声的同时保持结构细节和感知现实，但受扰动驱动的方法常产生过度平滑的重建，特别是在强噪声和分布偏移下。本文提出一种基于生成压缩的去噪框架，通过从熵编码的潜在表示中重建，强制低复杂度结构，同时通过感知度量如学习感知图像块相似性（LPIPS）损失和Wasserstein距离的生成解码器恢复真实纹理。介绍了两种互补的实例：(i) 基于条件Wasserstein GAN（WGAN）的压缩去噪器，明确控制速率-失真-感知（RDP）权衡；(ii) 基于条件扩散的重建策略，通过压缩潜在进行迭代去噪。进一步建立了在加性高斯噪声下的压缩最大似然去噪器的非渐近保证，包括重建误差和解码误差概率的界限。在合成和真实噪声基准上的实验显示了一致的感知改进，同时保持竞争性的失真性能。

英文摘要

Image denoising aims to remove noise while preserving structural details and perceptual realism, yet distortion-driven methods often produce over-smoothed reconstructions, especially under strong noise and distribution shift. This paper proposes a generative compression framework for perception-based denoising, where restoration is achieved by reconstructing from entropy-coded latent representations that enforce low-complexity structure, while generative decoders recover realistic textures via perceptual measures such as learned perceptual image patch similarity (LPIPS) loss and Wasserstein distance. Two complementary instantiations are introduced: (i) a conditional Wasserstein GAN (WGAN)-based compression denoiser that explicitly controls the rate-distortion-perception (RDP) trade-off, and (ii) a conditional diffusion-based reconstruction strategy that performs iterative denoising guided by compressed latents. We further establish non-asymptotic guarantees for the compression-based maximum-likelihood denoiser under additive Gaussian noise, including bounds on reconstruction error and decoding error probability. Experiments on synthetic and real-noise benchmarks demonstrate consistent perceptual improvements while maintaining competitive distortion performance.

URL PDF HTML ☆

赞 0 踩 0

2602.08167 2026-05-19 cs.RO cs.AI cs.CV cs.LG 版本更新

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

基于互联网规模知识的自监督行动预测具身推理

Milan Ganai, Katie Luo, Jonas Frey, Clark Barrett, Marco Pavone

发表机构 * Stanford（斯坦福大学）； UC Berkeley（加州大学伯克利分校）； NVIDIA（英伟达）

AI总结本文提出R&B-EnCoRe方法，通过自监督细化使模型从互联网知识中自推导具身推理策略，提升动作执行和导航性能，减少碰撞率。

Comments Robotics: Science and Systems (RSS) 2026

详情

AI中文摘要

具身链式思维（CoT）推理显著提升了视觉-语言-动作（VLA）模型，但当前方法依赖刚性模板指定推理原语（如场景中的物体、高层计划、结构 affordances）。这些模板可能迫使策略处理无关信息，干扰关键动作预测信号。我们引入R&B-EnCoRe，使模型通过自监督细化从互联网规模知识中自推导具身推理。通过将推理视为重要加权变分推断中的潜在变量，模型可生成并提炼无外部奖励、验证者或人工标注的具身特定策略训练数据集。我们在各种VLA架构中验证R&B-EnCoRe，应用于 manipulation（Franka Panda在仿真中，WidowX在硬件中）、legged导航（双足、轮式、自行车、四足）和自动驾驶具身，参数规模为1B、4B、7B和30B。我们的方法在 manipulation 成功率提升28%，导航评分提高101%，碰撞率减少21%。R&B-EnCoRe使模型提炼出预测成功控制的推理，避免手动标注工程，同时将互联网规模知识接地于物理执行。

英文摘要

Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.

URL PDF HTML ☆

赞 0 踩 0

2602.06523 2026-05-19 cs.CV cs.HC 版本更新

MicroBi-ConvLSTM: An Ultra-Lightweight Efficient Model for Human Activity Recognition on Resource Constrained Devices

MicroBi-ConvLSTM：一种用于资源受限设备上人类活动识别的超轻量高效模型

Mridankan Mandal

发表机构 * Department of Information Technology（信息技术系）； Indian Institute of Information Technology, Allahabad Prayagraj, India（印度阿利哈巴德普雷亚格贾信息科技学院，印度）

AI总结本文提出MicroBi-ConvLSTM模型，通过双阶段卷积特征提取和单层双向LSTM实现超轻量级架构，在保持线性复杂度的同时，参数减少2.9倍于TinierHAR和11.9倍于DeepConvLSTM，并在多个基准测试中表现出竞争力。

详情

AI中文摘要

在资源受限的可穿戴设备上进行人类活动识别（HAR）需要在准确性和严格的内存和计算预算之间取得平衡。现有的轻量级架构如TinierHAR（34K参数）和TinyHAR（55K参数）虽然在准确率上表现优异，但考虑到操作系统开销后，超出了微控制器有限SRAM的内存预算。本文提出MicroBi-ConvLSTM，一种超轻量级卷积递归架构，通过双阶段卷积特征提取和4倍时间池化，以及单层双向LSTM，平均达到11.4K参数。这比TinierHAR减少了2.9倍的参数，比DeepConvLSTM减少了11.9倍，同时保持线性O(N)复杂度。在八个多样化的HAR基准测试中，MicroBi-ConvLSTM在超轻量级范围内保持了竞争力：在UCI-HAR上达到93.41%的宏F1，在SKODA装配手势上达到94.46%，在Daphnet步态冻结检测上达到88.98%。系统性消融揭示了任务依赖的组件贡献，其中双向性对事件检测有益，但在周期性运动中提供边际增益。在Raspberry Pi Pico 2和ESP32上的设备部署验证了硬件可行性，无论是INT8量化还是FP32全精度路径。在INT8量化下，MicroBi-ConvLSTM是唯一在两个平台上实现全部8/8数据集覆盖的架构，Pico 2的平均延迟为72.8毫秒，ESP32上的PyTorch一致性为97.9%。在FP32部署下，它在所有成功的配置中实现了100.0%的一致性（8/8 Pico 2，7/8 ESP32），证实了所有INT8保真度下降都是量化伪影，而不是架构限制。

英文摘要

Human Activity Recognition (HAR) on resource constrained wearables requires models that balance accuracy against strict memory and computational budgets. State of the art lightweight architectures such as TinierHAR (34K parameters) and TinyHAR (55K parameters) achieve strong accuracy, but exceed memory budgets of microcontrollers with limited SRAM once operating system overhead is considered. We present MicroBi-ConvLSTM, an ultra-lightweight convolutional recurrent architecture achieving 11.4K parameters on average through two stage convolutional feature extraction with 4x temporal pooling, and a single bidirectional LSTM layer. This represents 2.9x parameter reduction versus TinierHAR and 11.9x versus DeepConvLSTM while preserving linear O(N) complexity. Evaluation across eight diverse HAR benchmarks shows that MicroBi-ConvLSTM maintains competitive performance within the ultra-lightweight regime: 93.41% macro F1 on UCI-HAR, 94.46% on SKODA assembly gestures, and 88.98% on Daphnet gait freeze detection. Systematic ablation reveals task dependent component contributions where bidirectionality benefits episodic event detection, but provides marginal gains on periodic locomotion. On-device deployment on the Raspberry Pi Pico 2 and ESP32 validates hardware viability under both INT8 quantized and FP32 full-precision paths. Under INT8 quantization, MicroBi-ConvLSTM is the only architecture achieving full 8/8 dataset coverage on both platforms, with 72.8 ms average latency on Pico 2 and 97.9% PyTorch parity on ESP32. Under FP32 deployment, it achieves 100.0% parity on all successful configurations (8/8 Pico 2, 7/8 ESP32), confirming that all INT8 fidelity degradation is a quantization artifact rather than an architectural limitation.

URL PDF HTML ☆

赞 0 踩 0

2602.06037 2026-05-19 cs.CV 版本更新

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

基于几何的思考：用于空间推理的主动几何整合

Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, JiaWang Bian, Hang Xu, Xiaodan Liang

发表机构 * Shenzhen campus of Sun Yet-sen University（中山大学深圳校区）； Yinwang Intelligent Technology Co. Ltd.（云网智能科技有限公司）； Shanghai Jiao Tong University（上海交通大学）； Shanghai innovation institute（上海创新研究院）； Nanyang Technological University（南洋理工大学）

AI总结本文提出GeoThinker框架，通过主动感知机制改进空间推理，通过空间接地融合和重要性门控实现几何与语义的精准整合，提升空间智能性能。

详情

AI中文摘要

近期多模态大语言模型在空间推理中的进展越来越多地利用3D编码器中的几何先验。然而，现有整合策略大多被动：几何作为全局流暴露并以无差别方式融合，常导致语义-几何不一致和冗余信号。我们提出GeoThinker，框架将范式从被动融合转向主动感知。不同于特征混合，GeoThinker使模型能够根据内部推理需求选择性检索几何证据。GeoThinker通过在精心选择的VLM层应用空间接地融合实现此目标，其中语义视觉先验通过帧严格交叉注意力查询并整合任务相关的几何信息，进一步通过重要性门控校准，使每帧注意力偏向任务相关的结构。全面评估结果表明，GeoThinker在空间智能上达到新的状态-of-the-art，达到VSI-Bench峰值72.6分。此外，GeoThinker在复杂下游场景中表现出鲁棒的泛化能力和显著提升的空间感知，包括具身指称和自动驾驶。我们的结果表明，主动整合空间结构的能力对于下一代空间智能至关重要。代码可在https://github.com/Li-Hao-yuan/GeoThinker找到。

英文摘要

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.

URL PDF HTML ☆

赞 0 踩 0

2602.04802 2026-05-19 cs.CV 版本更新

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

VISTA-Bench: 视觉-语言模型是否真的能像纯文本一样理解可视化文本？

Qing'an Liu, Juntong Feng, Yuhao Wang, Xinzhe Han, Yujie Cheng, Yue Zhu, Haiwen Diao, Yunzhi Zhuge, Huchuan Lu

发表机构 * Dalian University of Technology（大连理工大学）； Nanyang Technological University（南洋理工大学）

AI总结 VISTA-Bench通过对比纯文本和可视化文本问题，揭示了视觉-语言模型在处理可视化文本时的模态差距，发现模型在语义相同的情况下表现显著下降。

Comments 32 pages, 16 figures

详情

AI中文摘要

视觉-语言模型（VLMs）在跨模态理解方面取得了显著进展，但现有基准主要关注纯文本查询。本文引入VISTA-Bench，一个涵盖多模态感知、推理和单模态理解的系统性基准，通过受控渲染条件对比纯文本和可视化文本问题，评估模型对可视化文本的理解能力。对超过30个代表性VLMs的评估揭示了显著的模态差距：在纯文本表现优异的模型在同等语义内容以可视化文本呈现时显著退化。这一差距随着感知难度的增加而加剧，凸显了模型对渲染变化的敏感性。VISTA-Bench提供了一个原则性的评估框架，用于诊断这一限制并指导向更统一的语言表征（tokenized文本和像素）的发展。

英文摘要

Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 30 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels.

URL PDF HTML ☆

赞 0 踩 0

2602.00470 2026-05-19 cs.CV 版本更新

FG-TreeSeg: Flow-Guided Tree Crown Segmentation without Instance Annotations

FG-TreeSeg：基于流引导的树冠分割无需实例标注

Pengyu Chen, Fangzheng Lyu, Sicheng Wang, Cuizhen Wang

发表机构 * Department of Geography, University of South Carolina（南卡罗来纳大学地理系）； Department of Geography, Virginia Polytechnic Institute and State University（弗吉尼亚理工大学地理系）

AI总结本文提出FG-TreeSeg，通过将树冠建模为拓扑流场中的星形凸对象，利用Cellpose-SAM实现无需标注的树冠实例分割，实验表明其在不同传感器和冠层密度下均具有良好的泛化能力。

Comments 5 pages, 8 figures

Journal ref IEEE Geoscience and Remote Sensing Letters, 2026

详情

DOI: 10.1109/LGRS.2026.3693969

AI中文摘要

个体树冠分割是遥感中用于森林生物量估算和生态监测的重要任务。然而，在密集重叠冠层中准确界定仍是一个瓶颈。尽管监督深度学习方法面临高标注成本和泛化能力有限的问题，新兴的基础模型（如Segment Anything Model）往往缺乏领域知识，导致在密集簇中出现欠分割。为弥合这一差距，我们提出了FG-TreeSeg，一种无需训练的树冠实例分割框架，将基于生物医学成像的流引导界定方法转移到遥感领域。通过将树冠建模为拓扑流场中的星形凸对象，利用Cellpose-SAM，FG-TreeSeg框架通过向量收敛迫使接触的树冠实例分离。在NEON和BAMFOREST数据集上的实验以及视觉检查表明，我们的框架在不同传感器类型和冠层密度下均具有良好的泛化能力，可为树冠实例分割和标签生成提供无需训练的解决方案。

英文摘要

Individual tree crown segmentation is an important task in remote sensing for forest biomass estimation and ecological monitoring. However, accurate delineation in dense, overlapping canopies remains a bottleneck. While supervised deep learning methods suffer from high annotation costs and limited generalization, emerging foundation models (e.g., Segment Anything Model) often lack domain knowledge, leading to under-segmentation in dense clusters. To bridge this gap, we propose FG-TreeSeg, a training-free framework for tree crown instance segmentation that transfers flow-based delineation from biomedical imaging to remote sensing. By modeling tree crowns as star-convex objects within a topological flow field using Cellpose-SAM, the FG-TreeSeg framework forces the separation of touching tree crown instances based on vector convergence. Experiments on the NEON and BAMFOREST datasets and visual inspection demonstrate that our framework generalizes robustly across diverse sensor types and canopy densities, which can offer a training-free solution for tree crown instance segmentation and labels generation.

URL PDF HTML ☆

赞 0 踩 0

2601.21531 2026-05-19 cs.CR cs.AI cs.CV 版本更新

On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression

大型视觉-语言模型在视觉标记压缩下的对抗鲁棒性研究

Xinwei Zhang, Hangcheng Liu, Li Bai, Hao Wang, Qingqing Ye, Tianwei Zhang, Haibo Hu

发表机构 * The Hong Kong Polytechnic University, Hong Kong（香港理工大学）； Nanyang Technological University, Singapore（南洋理工大学）； Chongqing University, Chongqing, China（重庆大学）； Research Centre for Privacy and Security Technologies in Future Smart Systems, PolyU（未来智能系统中的隐私与安全技术研究中心）

AI总结本文研究了视觉标记压缩对大型视觉-语言模型对抗鲁棒性的影响，提出CAGE攻击方法，通过优化与压缩推理对齐，揭示压缩机制下的鲁棒性漏洞。

Comments Accepted by ICML 2026

详情

AI中文摘要

视觉标记压缩广泛用于加速大型视觉-语言模型（LVLMs），通过剪枝或合并视觉标记来提升效率，但其对抗鲁棒性仍未经探索。我们发现现有基于编码器的攻击无法充分揭示压缩LVLMs的鲁棒性漏洞，原因在于优化与推理之间的不匹配：扰动在完整标记表示上优化，而推理则通过标记压缩瓶颈进行。为解决这一差距，我们提出了压缩对齐攻击（CAGE），无需假设访问部署压缩机制或其标记预算，通过预期特征破坏和排名扭曲对齐，集中扰动在可能预算下存活的标记上，并主动对齐标记扭曲与排名分数以促进高扭曲证据的保留。在多样化的代表性插件式压缩机制和数据集上，结果表明CAGE在鲁棒性上始终优于基线。本文强调忽视压缩的鲁棒性评估可能过于乐观，呼吁对高效LVLMs进行压缩感知的安全评估和防御。

英文摘要

Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks cannot fully disclose the robustness vulnerabilities of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.

URL PDF HTML ☆

赞 0 踩 0

2601.21458 2026-05-19 cs.CV 版本更新

Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization

从重建误差中挖掘伪造痕迹：一种用于多模态深度伪造时间定位的弱监督框架

Midou Guo, Qilin Yin, Wei Lu, Rui Yang

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）； Sun Yat-sen University（中山大学）； Alibaba Group（阿里巴巴集团）

AI总结本文提出RT-DeepLoc框架，通过重建误差识别深度伪造，利用MAE学习真实数据的时空模式，结合不对称视频对比损失提升定位精度，实验表明在大规模数据集上达到弱监督时间伪造定位的最新水平。

详情

AI中文摘要

现代深度伪造已发展为局部和间歇性篡改，需要精细的时间定位以缓解严重的数字安全风险。帧级标注成本过高，使得弱监督方法成为必要，仅依赖视频级标签。为此，我们提出基于重建的时空深度伪造定位（RT-DeepLoc）框架，通过重建误差识别伪造。该框架使用仅在真实数据上训练的掩码自动编码器（MAE）学习其内在时空模式，使模型能产生显著的重建差异，从而在不需密集人工标注的情况下提供准确定位所需的细粒度线索。为稳健利用这些指标，我们引入了新的不对称视频对比损失（AICL）。通过聚焦于由这些重建线索引导的真实特征紧凑性，AICL建立了一个稳定的决策边界，增强局部辨别力，同时通过先进生成模型保持对未见伪造的泛化能力。在大规模数据集（包括LAV-DF）上的广泛实验表明，RT-DeepLoc在弱监督时间伪造定位任务上实现了最先进的性能。

英文摘要

Modern deepfakes have evolved into localized and intermittent manipulations that require fine-grained temporal localization to mitigate severe digital security risks. The prohibitive cost of frame-level annotation makes weakly supervised methods a practical necessity, which rely only on video-level labels. To this end, we propose Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc), a weakly supervised temporal forgery localization framework that identifies forgeries via reconstruction errors. Our framework uses a Masked Autoencoder (MAE) trained exclusively on authentic data to learn its intrinsic spatiotemporal patterns; this allows the model to produce significant reconstruction discrepancies for forged segments, effectively providing the missing fine-grained cues for accurate localization without demanding dense human annotations. To robustly leverage these indicators, we introduce a novel Asymmetric Intra-video Contrastive Loss (AICL). By focusing on the compactness of authentic features guided by these reconstruction cues, AICL establishes a stable decision boundary that enhances local discrimination while preserving generalization to unseen forgeries by advanced generative models. Extensive experiments on large-scale datasets, including LAV-DF, demonstrate that RT-DeepLoc achieves state-of-the-art performance in weakly-supervised temporal forgery localization.

URL PDF HTML ☆

赞 0 踩 0

2601.20306 2026-05-19 cs.CV 版本更新

TPGDiff: Hierarchical Triple-Prior Guided Diffusion for Image Restoration

TPGDiff: 基于三级先验引导的图像修复扩散网络

Yanjie Tu, Qingsen Yan, Axi Niu, Jiacong Tang

发表机构 * School of Computer Science, Northwestern Polytechnical University, Xi'an, China（西北工业大学计算机学院，西安，中国）； Shenzhen Research Institute of Northwestern Polytechnical University, Shenzhen, China（西北工业大学深圳研究院，深圳，中国）

AI总结 TPGDiff通过整合降质先验、结构先验和语义先验，实现图像修复的分层引导，提升严重降质区域的重建能力。

详情

AI中文摘要

所有-in-one图像修复旨在通过单一统一模型解决多种退化类型。现有方法通常依赖退化先验指导修复，但难以重建严重退化区域的内容。尽管近期工作利用语义信息促进内容生成，但将其整合到扩散模型浅层往往破坏空间结构（例如，模糊伪影）。为此，我们提出了一种三先验引导扩散（TPGDiff）网络用于统一图像修复。TPGDiff在整个扩散轨迹中整合退化先验，同时在浅层引入结构先验，在深层引入语义先验，实现图像重建的分层互补先验引导。具体而言，我们利用多源结构线索作为结构先验，捕捉细粒度细节并指导浅层表示。为了补充此设计，我们进一步开发了蒸馏驱动的语义提取器，以生成稳健的语义先验，确保在严重退化情况下深层可靠高阶指导。此外，采用退化提取器学习退化感知先验，使扩散过程在所有时间步长上实现阶段自适应控制。在单退化和多退化基准上的广泛实验表明，TPGDiff在多样化的修复场景中实现了优越的性能和泛化能力。我们的项目页面是：https://leoyjtu.github.io/tpgdiff-project.

英文摘要

All-in-one image restoration aims to address diverse degradation types using a single unified model. Existing methods typically rely on degradation priors to guide restoration, yet often struggle to reconstruct content in severely degraded regions. Although recent works leverage semantic information to facilitate content generation, integrating it into the shallow layers of diffusion models often disrupts spatial structures (\emph{e.g.}, blurring artifacts). To address this issue, we propose a Triple-Prior Guided Diffusion (TPGDiff) network for unified image restoration. TPGDiff incorporates degradation priors throughout the diffusion trajectory, while introducing structural priors into shallow layers and semantic priors into deep layers, enabling hierarchical and complementary prior guidance for image reconstruction. Specifically, we leverage multi-source structural cues as structural priors to capture fine-grained details and guide shallow layers representations. To complement this design, we further develop a distillation-driven semantic extractor that yields robust semantic priors, ensuring reliable high-level guidance at deep layers even under severe degradations. Furthermore, a degradation extractor is employed to learn degradation-aware priors, enabling stage-adaptive control of the diffusion process across all timesteps. Extensive experiments on both single- and multi-degradation benchmarks demonstrate that TPGDiff achieves superior performance and generalization across diverse restoration scenarios. Our project page is: https://leoyjtu.github.io/tpgdiff-project.

URL PDF HTML ☆

赞 0 踩 0

2601.16527 2026-05-19 cs.LG cs.AI cs.CL cs.CV 版本更新

Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs

超越表面遗忘：多模态大语言模型中Hallucinations的锐度感知鲁棒擦除

Xianya Fang, Feiyang Ren, Xiang Chen, Yu Tian, Zhen Bi, Haiyang Yu, Sheng-Jun Huang

发表机构 * College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics（南京航空航天大学计算机科学与技术学院）； Institute for AI, Tsinghua University（清华大学人工智能研究院）； Huzhou University（湖州大学）； Institute of Dataspace, Hefei Comprehensive National Science Center（合肥综合性国家科学中心数据空间研究院）； University of Science and Technology of China（中国科学技术大学）

AI总结本文提出SARE方法，通过目标导向的min-max优化和Targeted-SAM机制，解决多模态大语言模型中 hallucinations 的鲁棒擦除问题，提升模型稳定性与擦除效果。

详情

AI中文摘要

多模态大语言模型虽然强大，但容易产生hallucinations，即不存在的实体，影响可靠性。尽管最近的遗忘方法试图缓解这一问题，我们发现了一个关键缺陷：结构脆弱性。我们实证显示，标准擦除仅能表面抑制，使模型陷入尖锐极小值，轻度重新学习后hallucinations会灾难性复苏。为确保几何稳定性，我们提出SARE，将遗忘视为目标min-max优化问题，并使用Targeted-SAM机制显式平坦hallucinated概念周围的损失景观。通过在模拟最坏情况参数扰动下抑制hallucinations，我们的框架确保了鲁棒去除的稳定性。大量实验表明，SARE在擦除效果上显著优于基线，同时保持一般生成质量。关键的是，它在重新学习和参数更新中维持持久的hallucination抑制，验证了几何稳定性的有效性。

英文摘要

Multimodal LLMs are powerful but prone to object hallucinations, which describe non-existent entities and harm reliability. While recent unlearning methods attempt to mitigate this, we identify a critical flaw: structural fragility. We empirically demonstrate that standard erasure achieves only superficial suppression, trapping the model in sharp minima where hallucinations catastrophically resurge after lightweight relearning. To ensure geometric stability, we propose SARE, which casts unlearning as a targeted min-max optimization problem and uses a Targeted-SAM mechanism to explicitly flatten the loss landscape around hallucinated concepts. By suppressing hallucinations under simulated worst-case parameter perturbations, our framework ensures robust removal stable against weight shifts. Extensive experiments demonstrate that SARE significantly outperforms baselines in erasure efficacy while preserving general generation quality. Crucially, it maintains persistent hallucination suppression against relearning and parameter updates, validating the effectiveness of geometric stabilization.

URL PDF HTML ☆

赞 0 踩 0

2601.02353 2026-05-19 cs.CV cs.LG 版本更新

Meta-Learning Guided Pruning for Few-Shot Plant Pathology on Edge Devices

元学习引导的剪枝用于边缘设备上的少样本植物病理学

Mohammed Mudassir Uddin, Shahnawaz Alam, Mohammed Kaif Pasha, Dr Tasneem Bano Rehman, Dr Fahmina Taranum, Afroze Begum

发表机构 * Department of CSE, Muffakham Jah College of Engineering and Technology (MJCET)（计算机科学与工程系，穆法卡姆·贾赫工程与技术学院（MJCET））

AI总结本文提出DACIS方法，结合神经网络剪枝与少样本学习，实现边缘设备上高效植物疾病识别，实验表明模型大小减小78%且保持92.3%的精度。

详情

AI中文摘要

远程地区农民需要快速可靠的植物疾病识别方法，但通常缺乏实验室或高性能计算资源。深度学习模型可通过叶片图像检测疾病，但模型通常过大且计算成本高，难以在低成本边缘设备如Raspberry Pi上运行。此外，收集数千张标记的疾病图像进行训练既昂贵又耗时。本文通过结合神经网络剪枝和少样本学习解决这两个挑战。本文提出Disease-Aware Channel Importance Scoring (DACIS)，一种识别神经网络中区分不同植物疾病关键部分的方法，集成到三阶段Prune-then-Meta-Learn-then-Prune (PMP)流程中。在PlantVillage和PlantDoc数据集上的实验表明，所提出的方法将模型大小减少78%，同时保持92.3%的原始精度，压缩后的模型在Raspberry Pi 4上以每秒7帧的速度运行，使小农户农民的实时田间诊断成为可能。

英文摘要

Farmers in remote areas need quick and reliable methods for identifying plant diseases, yet they often lack access to laboratories or high-performance computing resources. Deep learning models can detect diseases from leaf images with high accuracy, but these models are typically too large and computationally expensive to run on low-cost edge devices such as Raspberry Pi. Furthermore, collecting thousands of labeled disease images for training is both expensive and time-consuming. This paper addresses both challenges by combining neural network pruning, removing unnecessary parts of the model, with few-shot learning, which enables the model to learn from limited examples. This paper proposes Disease-Aware Channel Importance Scoring (DACIS), a method that identifies which parts of the neural network are most important for distinguishing between different plant diseases, integrated into a three-stage Prune-then-Meta-Learn-then-Prune (PMP) pipeline. Experiments on PlantVillage and PlantDoc datasets demonstrate that the proposed approach reduces model size by 78% while maintaining 92.3% of the original accuracy, with the compressed model running at 7 frames per second on a Raspberry Pi 4, making real-time field diagnosis practical for smallholder farmers.

URL PDF HTML ☆

赞 0 踩 0

2512.23180 2026-05-19 cs.CV 版本更新

GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation

GaussianDWM: 基于3D高斯场景表示的统一场景理解和多模态生成驱动世界模型

Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, Hesheng Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Tsinghua University（清华大学）； MEGVII Technology（商汤科技）； Mach Drive

AI总结本文提出基于3D高斯表示的统一驱动世界模型框架，实现3D场景理解和多模态生成，并通过语言引导采样策略和双条件生成模型提升生成效果，实验验证其在nuScenes和NuInteract数据集上的优越性能。

Comments Accepted by CVPR 2026

详情

AI中文摘要

本文提出了一种基于3D高斯表示的统一驱动世界模型框架，旨在解决现有驱动世界模型（DWMs）在3D场景理解和多模态生成方面的不足。现有DWMs只能根据输入数据生成内容，无法解释或推理驾驶环境。此外，当前方法使用点云或BEV特征表示3D空间信息，无法准确对齐文本信息与底层3D场景。为了解决这些限制，我们提出了一种新的统一DWM框架，基于3D高斯场景表示，实现了3D场景理解和多模态生成，并能够为理解和生成任务提供上下文丰富性。我们的方法通过将丰富的语言特征嵌入到每个高斯体素中，直接对齐文本信息与3D场景，从而实现早期模态对齐。此外，我们设计了一种新颖的任务感知语言引导采样策略，以去除冗余的3D高斯体素，并将准确且紧凑的3D标记注入LLM中。此外，我们设计了一种双条件多模态生成模型，其中通过我们的视觉-语言模型捕获的信息作为高阶语言条件，与低阶图像条件结合，共同引导多模态生成过程。我们在nuScenes和NuInteract数据集上进行了全面研究，验证了我们框架的有效性。我们的方法实现了最先进的性能。我们将公开代码在GitHub上：https://github.com/dtc111111/GaussianDWM。

英文摘要

Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub https://github.com/dtc111111/GaussianDWM.

URL PDF HTML ☆

赞 0 踩 0

2512.16085 2026-05-19 cond-mat.mtrl-sci cs.CV 版本更新

Machine Learning Enabled Graph Analysis of Particulate Composites: Application to Solid-state Battery Cathodes

机器学习赋能的颗粒复合材料分析：应用于固态电池正极

Zebin Li, Shimao Deng, Yijin Liu, Jia-Mian Hu

发表机构 * Department of Materials Science and Engineering, University of Wisconsin-Madison（威斯康星大学麦迪逊分校材料科学与工程系）； Walker Department of Mechanical Engineering, University of Texas at Austin（德克萨斯大学奥斯汀分校沃克机械工程系）

AI总结本文提出一种基于机器学习的框架，将多模态X射线图像转化为拓扑感知图，用于分析颗粒复合材料的微观结构与性能关系，以固态电池正极为例验证了三相交点和离子/电子传导通道的重要性。

详情

DOI: 10.1021/acsenergylett.5c04258

AI中文摘要

颗粒复合材料支撑了许多固态化学和电化学系统，其中多相边界和颗粒间连接等微观特征强烈影响系统性能。X射线显微镜技术的进步使得能够捕捉这些复杂微观结构的大规模、多模态图像，但如何利用这些数据发现新的物理见解并指导微观结构优化仍是一个重大挑战。本文开发了一种机器学习（ML）赋能的框架，能够自动将实验多模态X射线图像转换为可扩展、拓扑感知的图结构，用于提取物理见解并建立局部微观结构-性能关系，既在颗粒层面又在网络层面。以固态锂离子电池的多相颗粒正极为例，我们的ML赋能图分析证实了三相交点和同时离子/电子传导通道在实现理想局部电化学活性中的关键作用。本文的工作确立了基于图的微观结构表示作为连接多模态实验成像与功能理解的强大范式，并促进了在广泛颗粒复合材料中进行微观结构感知的数据驱动材料设计。

英文摘要

Particulate composites underpin many solid-state chemical and electrochemical systems, where microstructural features such as multiphase boundaries and inter-particle connections strongly influence system performance. Advances in X-ray microscopy enable capturing large-scale, multimodal images of these complex microstructures with an unprecedentedly high throughput. However, harnessing these datasets to discover new physical insights and guide microstructure optimization remains a major challenge. Here, we develop a machine learning (ML) enabled framework that enables automated transformation of experimental multimodal X-ray images of multiphase particulate composites into scalable, topology-aware graphs for extracting physical insights and establishing local microstructure-property relationships at both the particle and network level. Using the multiphase particulate cathode of solid-state lithium batteries as an example, our ML-enabled graph analysis corroborates the critical role of triple phase junctions and concurrent ion/electron conduction channels in realizing desirable local electrochemical activity. Our work establishes graph-based microstructure representation as a powerful paradigm for bridging multimodal experimental imaging and functional understanding, and facilitating microstructure-aware data-driven materials design in a broad range of particulate composites.

URL PDF HTML ☆

赞 0 踩 0

2512.07245 2026-05-19 cs.CV 版本更新

Zero-Shot Textual Explanations via Translating Decision-Critical Features

通过翻译决策关键特征实现零样本文本解释

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

发表机构 * Chiba University（千叶大学）； National Institute of Informatics（国家信息研究所）

AI总结本文提出TEXTER方法，通过隔离决策关键特征生成更准确的文本解释，提升模型可解释性。

Comments Accepted to CVPR 2026 Findings

详情

AI中文摘要

文本解释通过自然语言描述图像分类器的预测理由，使决策过程透明。大型视觉-语言模型虽能生成描述，但并非为特定分类器推理设计。现有零样本解释方法将全局图像特征与语言对齐，生成描述可见内容而非驱动预测的因素。本文提出TEXTER，通过隔离决策关键特征前进行对齐，识别预测相关的神经元并强调其中编码的特征，将其映射到CLIP特征空间以检索反映模型推理的文本解释。稀疏自编码器进一步提升可解释性，尤其对Transformer架构有效。大量实验表明，TEXTER比现有方法提供更忠实和可解释的解释。代码可在https://github.com/tttt-0814/TEXTER获取。

英文摘要

Textual explanations make image classifier decisions transparent by describing the prediction rationale in natural language. Large vision-language models can generate captions but are designed for general visual understanding, not classifier-specific reasoning. Existing zero-shot explanation methods align global image features with language, producing descriptions of what is visible rather than what drives the prediction. We propose TEXTER, which overcomes this limitation by isolating decision-critical features before alignment. TEXTER identifies the neurons contributing to the prediction and emphasizes the features encoded in those neurons -- i.e., the decision-critical features. It then maps these emphasized features into the CLIP feature space to retrieve textual explanations that reflect the model's reasoning. A sparse autoencoder further improves interpretability, particularly for Transformer architectures. Extensive experiments show that TEXTER provides more faithful and interpretable explanations than existing methods. The code is available at \url{https://github.com/tttt-0814/TEXTER}.

URL PDF HTML ☆

赞 0 踩 0

2512.04331 2026-05-19 cs.CV 版本更新

Open Set Face Forgery Detection via Dual-Level Evidence Collection

开放集人脸伪造检测 via 双层证据收集

Zhongyi Cai, Bryce Gernon, Wentao Bao, Yifan Li, Matthew Wright, Yu Kong

发表机构 * Michigan State University（密歇根州立大学）； Rochester Institute of Technology（罗切斯特理工学院）

AI总结本文提出双层证据检测方法，用于识别新型伪造类别，通过不确定性估计提升实际应用，实验显示在识别新伪造类别时性能优于现有方法。

Comments Accepted at IEEE FG 2026

详情

AI中文摘要

人脸伪造的增加已严重削弱在线内容的真实性。随着生成算法的快速发展，新的伪造类别不断出现，严重挑战现有检测方法。尽管检测技术有所提高，但现有方法仍局限于二元真实vs伪造分类或已知伪造类别的识别。此外，它们无法识别完全新的伪造方法。本文研究了开放集人脸伪造检测（OSFFD）问题，要求检测模型识别新伪造类别。为了增强其实际应用，我们重新表述了OSFFD问题并通过不确定性估计解决。具体而言，我们提出了双层证据人脸伪造检测（DLED）方法，通过提取和整合空间和频率层面的类别特定证据来估计预测不确定性。在多样化的设置中进行的全面实验表明，所提出的DLED方法实现了最先进的性能。值得注意的是，它在识别新伪造类别时平均比现有基线模型高出20%。同时，DLED方法在标准二元真实vs伪造人脸伪造检测任务中也表现出竞争力。

英文摘要

The surge in face forgeries has increasingly undermined confidence in the authenticity of online content. As generation algorithms rapidly evolve, new fake categories will constantly emerge, severely challenging existing face forgery detection methods. Although face forgery detection has recently improved, current techniques remain largely confined to binary Real-vs-Fake classification or the recognition of known fake categories. Moreover, they fail to identify the emergence of entirely new forgery methods. In this work, we study the Open Set Face Forgery Detection (OSFFD) problem, which requires the detection model to identify novel fake categories. To enhance its real-world applicability, we reformulate the OSFFD problem and address it through uncertainty estimation. Specifically, we propose the Dual-Level Evidential face forgery Detection (DLED) approach, which estimates prediction uncertainty by extracting and integrating category-specific evidence on the spatial and frequency levels. Comprehensive experiments across diverse settings demonstrate that our proposed DLED approach achieves state-of-the-art performance. Notably, it surpasses various existing baseline models by a $20\%$ margin on average when identifying forgeries from novel fake categories. Concurrently, our DLED method yields competitive performance on the standard binary Real-versus-Fake face forgery detection task.

URL PDF HTML ☆

赞 0 踩 0

2512.04329 2026-05-19 cs.CV cs.SE 版本更新

A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

一种基于检索增强生成的方法用于从神经网络中提取算法逻辑

Waleed Khalid, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg, Germany（计算机视觉实验室，CAIDAS，乌尔姆大学，德国）

AI总结本文提出NN-RAG方法，通过检索增强生成技术从神经网络代码库中提取并验证模块，实现了跨仓库的架构迁移与重复检测，提升了神经网络架构的可复现性和多样性。

详情

AI中文摘要

重用现有神经网络组件对研究效率至关重要，但发现、提取和验证这些模块仍面临困难。我们引入NN-RAG，一种检索增强生成系统，将大规模异构PyTorch代码库转换为可搜索和执行的验证神经模块库。与传统代码搜索或克隆检测工具不同，NN-RAG执行范围感知的依赖解析、保留导入的重建以及验证门控提升，确保每个检索块都是范围封闭、可编译和可运行的。应用于19个主要仓库，流程提取了1,289个候选块，验证了941个（73.0%），并证明超过80%的块在结构上是唯一的。通过多层次去重（精确、词汇、结构），我们发现NN-RAG为LEMUR数据集贡献了绝大多数独特的架构，提供了约72%的所有新网络结构。除了数量，NN-RAG独特地使跨仓库的架构迁移成为可能，自动在一个项目中识别可重用的模块并在另一个上下文中重新生成，依赖完整。据我们所知，没有其他开源系统能以这种规模提供这种能力。框架的中性规范进一步允许可选地与语言模型集成，用于合成或数据集注册，而无需重新分发第三方代码。总体而言，NN-RAG将碎片化的视觉代码转化为可复现、可追溯的子基质，为算法发现提供了一个首个开源解决方案，既量化又扩展了跨仓库的可执行神经架构的多样性。

英文摘要

Reusing existing neural-network components is central to research efficiency, yet discovering, extracting, and validating such modules across thousands of open-source repositories remains difficult. We introduce NN-RAG, a retrieval-augmented generation system that converts large, heterogeneous PyTorch codebases into a searchable and executable library of validated neural modules. Unlike conventional code search or clone-detection tools, NN-RAG performs scope-aware dependency resolution, import-preserving reconstruction, and validator-gated promotion -- ensuring that every retrieved block is scope-closed, compilable, and runnable. Applied to 19 major repositories, the pipeline extracted 1,289 candidate blocks, validated 941 (73.0%), and demonstrated that over 80% are structurally unique. Through multi-level de-duplication (exact, lexical, structural), we find that NN-RAG contributes the overwhelming majority of unique architectures to the LEMUR dataset, supplying approximately 72% of all novel network structures. Beyond quantity, NN-RAG uniquely enables cross-repository migration of architectural patterns, automatically identifying reusable modules in one project and regenerating them, dependency-complete, in another context. To our knowledge, no other open-source system provides this capability at scale. The framework's neutral specifications further allow optional integration with language models for synthesis or dataset registration without redistributing third-party code. Overall, NN-RAG transforms fragmented vision code into a reproducible, provenance-tracked substrate for algorithmic discovery, offering a first open-source solution that both quantifies and expands the diversity of executable neural architectures across repositories.

URL PDF HTML ☆

赞 0 踩 0

2511.19953 2026-05-19 cs.CV 版本更新

Supervise Less, See More: Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting

少监督，多观察：基于原型引导的提示方法实现无训练核实例分割

Wen Zhang, Qin Ren, Wenjing Liu, Haibin Ling, Chenyu You

发表机构 * Stony Brook University（石溪大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结本文提出SPROUT框架，通过组织学先验知识构建滑片特定参考原型，利用部分最优传输方案指导特征对齐，使SAM模型无需训练即可实现精准核分割。

Comments ICML 2026; 44 pages, 25 figures, 26 tables; Code at https://github.com/Y-Research-SBU/SPROUT

详情

AI中文摘要

准确的核实例分割是计算病理学中的关键任务，支持数据驱动的临床洞察并促进下游转化应用。尽管大型视觉基础模型在零样本生物医学分割中显示出潜力，但大多数现有方法仍依赖密集监督和计算昂贵的微调。因此，无训练方法成为有吸引力的研究方向，但尚未被广泛探索。本文介绍SPROUT，一种完全无训练和注释的提示框架，用于核实例分割。SPROUT利用组织学指导的先验知识构建滑片特定的参考原型，以缓解领域差距。这些原型通过部分最优传输方案逐步引导特征对齐。所得前景和背景特征被转换为正负点提示，使Segment Anything Model (SAM)能够在不进行任何参数更新的情况下生成精确的核分割。在多个病理学基准上的广泛实验表明，SPROUT在无监督或再训练的情况下实现了竞争性性能，建立了可扩展的无训练核实例分割的新范式。

英文摘要

Accurate nuclear instance segmentation is a pivotal task in computational pathology, supporting data-driven clinical insights and facilitating downstream translational applications. While large vision foundation models have shown promise for zero-shot biomedical segmentation, most existing approaches still depend on dense supervision and computationally expensive fine-tuning. Consequently, training-free methods present a compelling research direction, yet remain largely unexplored. In this work, we introduce SPROUT, a fully training- and annotation-free prompting framework for nuclear instance segmentation. SPROUT leverages histology-informed priors to construct slide-specific reference prototypes that mitigate domain gaps. These prototypes progressively guide feature alignment through a partial optimal transport scheme. The resulting foreground and background features are transformed into positive and negative point prompts, enabling the Segment Anything Model (SAM) to produce precise nuclear delineations without any parameter updates. Extensive experiments across multiple histopathology benchmarks demonstrate that SPROUT achieves competitive performance without supervision or retraining, establishing a novel paradigm for scalable, training-free nuclear instance segmentation in pathology.

URL PDF HTML ☆

赞 0 踩 0

2511.18801 2026-05-19 cs.CV 版本更新

PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion

PartDiffuser：通过离散扩散实现部件级3D网格生成

Yichen Yang, Hong Li, Haodong Zhu, Linin Yang, Guojun Lei, Sheng Xu, Baochang Zhang

发表机构 * Beihang University（北航）； Communication University of China（中国通信大学）； Zhejiang University（浙江大学）

AI总结 PartDiffuser提出一种半自回归扩散框架，通过部件级方法生成高保真3D网格，有效平衡全局结构与局部细节。

详情

AI中文摘要

现有自回归方法在生成艺术家设计的网格时难以平衡全局结构一致性与高保真局部细节，易受误差累积影响。为解决此问题，我们提出PartDiffuser，一种新的半自回归扩散框架用于点云到网格生成。该方法首先对网格进行语义分割，然后以部件级方式进行操作：通过部件间的自回归确保全局拓扑，同时在每个语义部件内使用并行离散扩散过程精确重建高频几何特征。PartDiffuser基于DiT架构，引入了部件感知的交叉注意力机制，利用点云作为层次化的几何条件，动态控制生成过程，从而有效解耦全局和局部生成任务。实验表明，该方法在生成具有丰富细节的3D网格方面显著优于现有最先进模型，展现出适合实际应用的卓越细节表现。

英文摘要

Existing autoregressive (AR) methods for generating artist-designed meshes struggle to balance global structural consistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs semantic segmentation on the mesh and then operates in a "part-wise" manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high-frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local generation tasks. Experiments demonstrate that this method significantly outperforms state-of-the-art (SOTA) models in generating 3D meshes with rich detail, exhibiting exceptional detail representation suitable for real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2511.14223 2026-05-19 cs.CV 版本更新

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

StreamingTalker: 基于音频的3D面部动画与自回归扩散模型

Yifan Yang, Zhi Cen, Sida Peng, Xiangwei Chen, Yifu Deng, Xinyu Zhu, Fan Jia, Xiaowei Zhou, Hujun Bao

发表机构 * State Key Laboratory of CAD&CG, Zhejiang University（浙江大学计算机辅助设计与图形学国家重点实验室）； Ant Group（蚂蚁集团）； College of Computer Science, Zhejiang University（浙江大学计算机学院）

AI总结本文提出基于自回归扩散模型的StreamingTalker，解决音频驱动3D面部动画中长音频处理延迟和超训练范围的问题，通过动态条件生成高质量实时面部动作。

详情

AI中文摘要

本文聚焦于语音驱动的3D面部动画任务，旨在生成逼真且同步的面部动作，通过音频输入驱动。近期方法采用音频条件扩散模型生成表达自然的动画，但处理整个音频序列存在两个主要挑战：处理超出训练范围的音频序列效果差，且长音频输入会产生显著延迟。为解决这些问题，我们提出一种新的自回归扩散模型，以流式方式处理输入音频。该设计确保了不同音频长度的灵活性，并实现了低延迟，与音频持续时间无关。具体而言，我们选择少量过去帧作为历史动作上下文，并将其与音频输入结合，生成动态条件。该条件引导扩散过程迭代生成面部动作帧，实现高质量的实时合成。此外，我们实现了实时交互演示，突显了该方法的有效性和效率。代码将在https://zju3dv.github.io/StreamingTalker/上发布。

英文摘要

This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs. Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations. However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.

URL PDF HTML ☆

赞 0 踩 0

2511.07329 2026-05-19 cs.LG cs.CV 版本更新

Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis

基于分形的计算架构制备用于高级大语言模型分析

Yash Mittal, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg（计算机视觉实验室，CAIDAS，乌尔姆大学）

AI总结本文提出FractalNet框架，通过递归模板模式自动生成并评估卷积神经网络架构，实现高效稳定的网络结构探索，实验显示分形架构在五轮训练后达到80.18%的准确率。

详情

AI中文摘要

本文提出FractalNet，一种基于分形设计原理的框架，通过递归模板模式自动生成并评估卷积神经网络（CNN）架构。该框架通过递归分形模板系统地变化关键参数如分形深度、列宽和层配置，而非依赖计算成本高的神经架构搜索（NAS）方法。框架包含生成器、分形模板模块和运行器模块，生成1200多个CNN架构在CIFAR-10数据集上进行测试。使用PyTorch进行训练，采用随机梯度下降和自动混合精度及梯度检查点技术降低计算开销。实验结果显示分形架构具有稳定的训练动态和竞争性性能，五轮训练后验证准确率为60-70%，峰值准确率为80.18%。这些发现表明递归分形结构在平衡网络深度和宽度方面有效，并支持大规模自动化架构探索。

英文摘要

This paper proposes FractalNet, a framework based on fractal design principles that automatically generates and evaluates convolutional neural network (CNN) architectures using recursive template patterns. Rather than relying on computationally expensive Neural Architecture Search (NAS) methods, the framework explores a structured architecture space defined by recursive fractal templates that systematically vary key parameters such as fractal depth, column width, and layer configurations. The framework consists of three core components: a generator that produces candidate architectures via controlled permutations of convolutional, normalization, activation, and dropout layers; a fractal template module that enforces recursive multi-path structural patterns; and a runner module that manages model training, evaluation, and logging. Using this system, over 1,200 distinct CNN architectures were automatically generated and evaluated on the CIFAR-10 image classification benchmark. Training was performed in PyTorch using stochastic gradient descent with Automatic Mixed Precision (AMP) and gradient checkpointing to reduce computational overhead. Experimental results demonstrate that fractal-based architectures exhibit stable training dynamics and achieve competitive performance, with an average validation accuracy of 60-70% and a peak accuracy of 80.18% after only five training epochs. These findings suggest that recursive fractal structures provide an effective means of balancing network depth and width while supporting large-scale automated architecture exploration. The proposed framework offers a resource-efficient and interpretable approach to systematic neural architecture experimentation.

URL PDF HTML ☆

赞 0 踩 0

2510.16416 2026-05-19 cs.CV cs.AI 版本更新

视觉语言模型的持续学习：超越遗忘的综述与分类

Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami, Xialei Liu, Joost van de Weijer, Yonghong Tian

AI总结本文综述了视觉语言模型的持续学习挑战，提出四种核心范式以解决跨模态特征漂移和灾难性遗忘问题，强调零样本学习和智能体生态系统的发展。

详情

AI中文摘要

视觉语言模型（VLMs）和近期多模态大语言模型（MLLMs）通过前所未有的跨模态对齐和零样本泛化革新了人工智能。然而，使它们能够从非平稳数据中持续学习仍是一个重大挑战，因为它们的跨模态对齐和泛化能力特别容易受到灾难性遗忘的影响。不同于传统单模态持续学习（CL），VLMs面临独特的挑战，如跨模态特征漂移、由于共享架构导致的参数干扰以及零样本能力侵蚀。此外，生成式MLLMs表现出一种独特的“对齐税”，其中灾难性遗忘不仅表现为事实性遗忘，还表现为深度链式思维（CoT）推理的系统性崩溃。本文首次全面、诊断性地回顾了预测VLMs和生成式MLLMs的持续学习。我们系统地分解了上述失败模式，并提出了一个以挑战为导向的分类，包括四个核心范式：（1）多模态重播策略解决显式和隐式记忆漂移；（2）跨模态正则化强制拓扑和几何对齐；（3）参数高效适应利用动态路由和子空间投影；以及新兴的（4）模型融合与解耦范式。我们批判性地分析了评估协议的演变，强调了向双轨基准（领域 vs. 能力 CL）和微诊断 CoT 评估的转变。最后，我们绘制了未来研究的路线图，强调组合式零样本学习、具身AI与传感器融合以及自主智能体生态系统。所有资源均可在：https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models 上找到。

英文摘要

Vision-language models (VLMs) and the recent surge of Multimodal Large Language Models (MLLMs) have revolutionized artificial intelligence with unprecedented cross-modal alignment and zero-shot generalization. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. Furthermore, generative MLLMs exhibit a unique ``alignment tax,'' where catastrophic forgetting manifests not merely as factual amnesia, but as a systemic collapse of deep Chain-of-Thought (CoT) reasoning. This survey presents the first comprehensive, diagnostic review bridging continual learning for both predictive VLMs and generative MLLMs. We systematically deconstruct the aforementioned failure modes and propose a challenge-driven taxonomy comprising four core paradigms: (1) Multi-Modal Replay Strategies addressing explicit and implicit memory drift; (2) Cross-Modal Regularization enforcing topological and geometric alignment; (3) Parameter-Efficient Adaptation} utilizing dynamic routing and subspace projections; and the emerging (4) Model Fusion and Decoupling paradigms. We critically analyze the evolution of evaluation protocols, highlighting the essential shift toward dual-track benchmarks (Domain vs. Ability CL) and micro-diagnostic CoT evaluations. Finally, we chart a roadmap for future research, emphasizing compositional zero-shot learning, embodied AI with sensor fusion, and autonomous agentic ecosystems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.

URL PDF HTML ☆

赞 0 踩 0

2507.12969 2026-05-19 cs.LG cs.CV 版本更新

WaveletInception Networks for on-board Vibration-Based Infrastructure Health Monitoring

小波 inception 网络用于车载振动基基础设施健康监测

Reza Riahi Samani, Alfredo Nunez, Bart De Schutter

发表机构 * Delft Center for Systems and Control (DCSC), Delft University of Technology（代尔夫特理工大学系统与控制中心）； Section of Railway Engineering, Department of Engineering Structures, Delft University of Technology（工程结构系铁路工程部）

AI总结本文提出WaveletInception-BiGRU网络，通过可学习小波包变换提取频谱特征，结合Inception-残差网络进行多尺度特征学习，并利用BiGRU模块整合时间依赖性，实现无需预处理的振动信号分析，提升车载基础设施健康监测的准确性和自动化水平。

Comments Under reviewer for the Journal of Engineering Application of Artificial Intelligence

详情

DOI: 10.1016/j.engappai.2026.113976

AI中文摘要

本文提出了一种深度学习框架，用于分析车载振动响应信号以进行基础设施健康监测。所提出的WaveletInception-BiGRU网络采用可学习的小波包变换（LWPT）进行早期频谱特征提取，随后通过一维Inception-残差网络（1D Inception-ResNet）模块进行多尺度、高级特征学习。双向门控循环单元（BiGRU）模块则整合时间依赖性，并纳入操作条件，如测量速度。该方法使能够有效分析在不同速度下记录的振动信号，无需显式信号预处理。序列估计头进一步利用双向时间信息，产生准确的基础设施健康局部评估。最终，该框架生成高分辨率的空间映射健康配置文件。针对轨道刚度回归和过渡区分类的案例研究显示，所提出的框架显著优于现有方法，证明了其在准确、局部化和自动化车载基础设施健康监测中的潜力。

英文摘要

This paper presents a deep learning framework for analyzing on board vibration response signals in infrastructure health monitoring. The proposed WaveletInception-BiGRU network uses a Learnable Wavelet Packet Transform (LWPT) for early spectral feature extraction, followed by one-dimensional Inception-Residual Network (1D Inception-ResNet) modules for multi-scale, high-level feature learning. Bidirectional Gated Recurrent Unit (BiGRU) modules then integrate temporal dependencies and incorporate operational conditions, such as the measurement speed. This approach enables effective analysis of vibration signals recorded at varying speeds, eliminating the need for explicit signal preprocessing. The sequential estimation head further leverages bidirectional temporal information to produce an accurate, localized assessment of infrastructure health. Ultimately, the framework generates high-resolution health profiles spatially mapped to the physical layout of the infrastructure. Case studies involving track stiffness regression and transition zone classification using real-world measurements demonstrate that the proposed framework significantly outperforms state-of-the-art methods, underscoring its potential for accurate, localized, and automated on-board infrastructure health monitoring.

URL PDF HTML ☆

赞 0 踩 0

2506.11925 2026-05-19 cs.AR cs.AI cs.CV cs.LG 版本更新

Real-World Deployment of a Lane Change Prediction Architecture Based on Knowledge Graph Embeddings and Bayesian Inference

基于知识图谱嵌入和贝叶斯推断的车道变换预测架构的现实世界部署

M. Manzour, Catherine M. Elias, Omar M. Shehata, R. Izquierdo, M. A. Sotelo

发表机构 * Department of Computer Engineering, University of Alcalá（阿尔卡拉大学计算机工程系）； Department of Computer Science, German University in Cairo（开罗德国大学计算机科学系）； Department of Mechatronics, German University in Cairo（开罗德国大学机电系）

AI总结本文提出基于知识图谱嵌入和贝叶斯推断的车道变换预测系统，通过现实硬件验证，实现了算法与道路部署的结合，提前3-4秒预测目标车辆车道变换，确保安全。

Journal ref 2025 IEEE International Conference on Vehicular Electronics and Safety (ICVES)

详情

DOI: 10.1109/ICVES65691.2025.11376512

AI中文摘要

近年来，车道变换预测研究取得显著进展，但大多数研究局限于仿真或数据集结果，未能实现算法与道路部署的结合。本文通过现实硬件展示了基于知识图谱嵌入（KGEs）和贝叶斯推断的车道变换预测系统。该系统包含感知模块和预测模块：感知模块感知环境，提取数值特征并转换为语言类别，与预测模块通信；预测模块执行KGE和贝叶斯推断模型，预测目标车辆的行驶动作并转换为纵向制动动作。现实硬件实验验证表明，该预测系统能提前3-4秒预测目标车辆的车道变换，为自动驾驶车辆提供充足反应时间，确保车道变换安全。

英文摘要

Research on lane change prediction has gained a lot of momentum in the last couple of years. However, most research is confined to simulation or results obtained from datasets, leaving a gap between algorithmic advances and on-road deployment. This work closes that gap by demonstrating, on real hardware, a lane-change prediction system based on Knowledge Graph Embeddings (KGEs) and Bayesian inference. Moreover, the ego-vehicle employs a longitudinal braking action to ensure the safety of both itself and the surrounding vehicles. Our architecture consists of two modules: (i) a perception module that senses the environment, derives input numerical features, and converts them into linguistic categories; and communicates them to the prediction module; (ii) a pretrained prediction module that executes a KGE and Bayesian inference model to anticipate the target vehicle's maneuver and transforms the prediction into longitudinal braking action. Real-world hardware experimental validation demonstrates that our prediction system anticipates the target vehicle's lane change three to four seconds in advance, providing the ego vehicle sufficient time to react and allowing the target vehicle to make the lane change safely.

URL PDF HTML ☆

赞 0 踩 0

2506.05442 2026-05-19 cs.CV cs.AI 版本更新

Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving

结构化标注加速面向端到端自动驾驶的视觉-语言模型

Hao Jiang, Chuan Hu, Yukang Shi, Yuan He, Ke Wang, Xi Zhang, Zhipeng Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； KargoBot

AI总结本文提出结构化标注的NuScenes-S数据集和紧凑型FastDrive模型，提升自动驾驶中决策任务的效率与准确性，实验显示在结构化数据集上性能优异，推理速度提升超10倍。

详情

AI中文摘要

视觉-语言模型（VLMs）因其类人推理能力成为端到端自动驾驶的有前景方法。然而，现有VLMs与现实应用之间仍存在显著差距。主要限制是现有松散格式的语言描述数据集不适用于机器，可能引入冗余。此外，VLMs的高计算成本和大规模阻碍了推理速度和现实部署。为弥合这一差距，本文引入了结构化且简洁的基准数据集NuScenes-S，该数据集源自NuScenes数据集并包含适用于机器的结构化表示。此外，我们提出了FastDrive，一个参数仅为0.9B的紧凑型VLM基线。与现有参数超过7B且未结构化的VLMs（如LLaVA-1.5）相比，FastDrive能够理解和生成结构化且简洁的描述，以高效率生成机器友好的驾驶决策。大量实验表明，FastDrive在结构化数据集上实现了竞争性的性能，决策任务的精度提高了约20%，同时在推理速度上超越大规模参数基线超过10倍。此外，消融研究进一步聚焦于场景注释（如天气、时间）对决策任务的影响，证明了其在自动驾驶决策任务中的重要性。

英文摘要

Vision-Language Models (VLMs) offer a promising approach to end-to-end autonomous driving due to their human-like reasoning capabilities. However, troublesome gaps remains between current VLMs and real-world autonomous driving applications. One major limitation is that existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy. Additionally, high computational cost and massive scale of VLMs hinder the inference speed and real-world deployment. To bridge the gap, this paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations. Moreover, we present FastDrive, a compact VLM baseline with 0.9B parameters. In contrast to existing VLMs with over 7B parameters and unstructured language processing(e.g., LLaVA-1.5), FastDrive understands structured and concise descriptions and generates machine-friendly driving decisions with high efficiency. Extensive experiments show that FastDrive achieves competitive performance on structured dataset, with approximately 20% accuracy improvement on decision-making tasks, while surpassing massive parameter baseline in inference speed with over 10x speedup. Additionally, ablation studies further focus on the impact of scene annotations (e.g., weather, time of day) on decision-making tasks, demonstrating their importance on decision-making tasks in autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2505.20914 2026-05-19 cs.CV 版本更新

Geometry-Editable and Appearance-Preserving Object Compositon

可编辑且保留外观的对象合成

Jianman Lin, Haojie Li, Chunmei Qing, Zhijing Yang, Liang Lin, Tianshui Chen

发表机构 * South China University of Technology（华南理工大学）； Guangdong University of Technology（广东工业大学）； Sun Yat-sen University（中山大学）

AI总结本文提出DGAD模型，通过语义嵌入和交叉注意力机制实现几何可编辑与外观保留，提升对象合成的精度与真实性。

详情

AI中文摘要

通用对象合成（GOC）旨在无缝整合目标对象到背景场景中，同时保持其细粒度外观细节。最近的方法通过语义嵌入和扩散模型实现几何可编辑生成，但高紧凑嵌入仅编码高层语义线索，不可避免地丢弃细粒度外观细节。我们引入一种解耦的几何可编辑且外观保留扩散（DGAD）模型，首先利用语义嵌入隐式捕捉所需几何变换，然后通过交叉注意力检索机制对齐细粒度外观特征与几何编辑表示，从而在对象合成中实现精确几何编辑和忠实外观保留。具体而言，DGAD基于CLIP/DINO衍生和参考网络提取语义嵌入和外观保留表示，然后在解耦方式下无缝整合到编码和解码管道中。首先将语义嵌入整合到具有强大空间推理能力的预训练扩散模型中，隐式捕捉对象几何，从而实现灵活的对象操作和确保有效的可编辑性。然后设计一种密集交叉注意力机制，利用隐式学习的对象几何检索并空间对齐外观特征，确保忠实的外观一致性。在公共基准上的广泛实验验证了所提DGAD框架的有效性。

英文摘要

General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties, while simultaneously preserving its fine-grained appearance details. Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation. However, these highly compact embeddings encode only high-level semantic cues and inevitably discard fine-grained appearance details. We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion (DGAD) model that first leverages semantic embeddings to implicitly capture the desired geometric transformations and then employs a cross-attention retrieval mechanism to align fine-grained appearance features with the geometry-edited representation, facilitating both precise geometry editing and faithful appearance preservation in object composition. Specifically, DGAD builds on CLIP/DINO-derived and reference networks to extract semantic embeddings and appearance-preserving representations, which are then seamlessly integrated into the encoding and decoding pipelines in a disentangled manner. We first integrate the semantic embeddings into pre-trained diffusion models that exhibit strong spatial reasoning capabilities to implicitly capture object geometry, thereby facilitating flexible object manipulation and ensuring effective editability. Then, we design a dense cross-attention mechanism that leverages the implicitly learned object geometry to retrieve and spatially align appearance features with their corresponding regions, ensuring faithful appearance consistency. Extensive experiments on public benchmarks demonstrate the effectiveness of the proposed DGAD framework.

URL PDF HTML ☆

赞 0 踩 0

2505.17674 2026-05-19 cs.CV 版本更新

SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding

SVL：基于脉冲的视觉-语言预训练用于高效的3D开放世界理解

Xuerui Qiu, Peixi Wu, Yaozhi Wen, Shaowei Gu, Yuqi Pan, Xinhao Luo, Bo XU, Guoqi Li

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； School of Future Technology, University of Chinese Academy of Sciences（中国科学院大学未来技术学院）； Zhongguancun Academy（中关村学院）； University of Science and Technology of China（中国科学技术大学）； Peking University（北京大学）

AI总结 SVL提出基于脉冲的视觉-语言预训练框架，通过多尺度三元组对齐和可重参数化的视觉-语言整合，提升SNN在3D开放世界理解中的性能，实现高效零样本3D分类和多模态问答。

Comments ICML 2026 Spotlight

详情

AI中文摘要

脉冲神经网络（SNNs）提供了一种高效的提取3D时空特征的方法。然而，现有的SNNs在性能上仍显著落后于人工神经网络（ANNs），主要是由于预训练策略不足。这些限制表现为泛化能力有限、任务特定性和缺乏多模态理解，特别是在多模态问答和零样本3D分类等挑战性任务中。为克服这些挑战，我们提出了基于脉冲的视觉-语言（SVL）预训练框架，使SNNs能够实现开放世界3D理解，同时保持脉冲驱动的效率。SVL引入了两个关键组件：（i）多尺度三元组对齐（MTA）用于在3D、图像和文本模态之间进行无标签三元组对比学习；（ii）可重参数化的视觉-语言整合（Rep-VLI）以实现轻量级推理，而无需依赖大型文本编码器。广泛的实验表明，SVL在零样本3D分类中实现了85.4%的top-1准确率，超过了先进的ANN模型，并在下游任务中持续优于先前的SNNs，包括3D分类（+6.1%）、DVS动作识别（+2.1%）、3D检测（+1.1%）和3D分割（+2.1%），具有显著的效率。此外，SVL使SNNs能够执行开放世界3D问答任务，有时甚至优于ANNs。据我们所知，SVL代表了首个可扩展、通用和硬件友好的3D开放世界理解范式，有效弥合了SNNs和ANNs在复杂开放世界理解任务中的差距。代码可用https://github.com/bollossom/SVL。

英文摘要

Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing SNNs still exhibit a significant performance gap compared to Artificial Neural Networks (ANNs) due to inadequate pre-training strategies. These limitations manifest as restricted generalization ability, task specificity, and a lack of multimodal understanding, particularly in challenging tasks such as multimodal question answering and zero-shot 3D classification. To overcome these challenges, we propose a Spike-based Vision-Language (SVL) pretraining framework that empowers SNNs with open-world 3D understanding while maintaining spike-driven efficiency. SVL introduces two key components: (i) Multi-scale Triple Alignment (MTA) for label-free triplet-based contrastive learning across 3D, image, and text modalities, and (ii) Re-parameterizable Vision-Language Integration (Rep-VLI) to enable lightweight inference without relying on large text encoders. Extensive experiments show that SVL achieves a top-1 accuracy of 85.4% in zero-shot 3D classification, surpassing advanced ANN models, and consistently outperforms prior SNNs on downstream tasks, including 3D classification (+6.1%), DVS action recognition (+2.1%), 3D detection (+1.1%), and 3D segmentation (+2.1%) with remarkable efficiency. Moreover, SVL enables SNNs to perform open-world 3D question answering, sometimes outperforming ANNs. To the best of our knowledge, SVL represents the first scalable, generalizable, and hardware-friendly paradigm for 3D open-world understanding, effectively bridging the gap between SNNs and ANNs in complex open-world understanding tasks. Code is available https://github.com/bollossom/SVL.

URL PDF HTML ☆

赞 0 踩 0

2412.11149 2026-05-19 cs.CV 版本更新

A Comprehensive Survey of Action Quality Assessment: Method and Benchmark

动作质量评估的全面综述：方法与基准

Kanglei Zhou, Ruizhi Cai, Liyuan Wang, Hubert P. H. Shum, Xiaohui Liang

AI总结本文综述了动作质量评估的最新进展，提出模态驱动的分层分类体系，建立统一基准，分析方法演变和研究趋势，探讨当前挑战与未来方向。

Comments Published in Pattern Recognition. Project page and benchmark resources are available online

Journal ref Pattern Recognition, 2026, Article 113933

详情

DOI: 10.1016/j.patcog.2026.113933

AI中文摘要

动作质量评估（AQA）旨在自动评估人类动作的执行质量，并已广泛应用于体育分析、技能评估和医疗领域。然而，AQA研究往往是在异质数据集和评估设置下开发的，使得方法间的系统比较困难。为了解决这些挑战，我们呈现了AQA近期进展的全面综述。特别地，我们提出了一种模态驱动的分层分类体系，将现有方法分为基于视频、基于骨架和多模态方法，并分析代表性模型的方法学演变。我们进一步通过整合多样化数据集和标准化评估协议，建立了代表性视频基AQA方法的统一基准，使在准确性和计算效率方面实现一致的比较。最后，我们分析了新兴研究趋势，识别了当前AQA研究中的关键挑战，并概述了从短期的方法学进步到长期由新兴AI范式带来的机遇的未来方向。该项目的网页可在https://ZhouKanglei.github.io/AQA-Survey上找到。

英文摘要

Action Quality Assessment (AQA) aims to automatically evaluate how well human actions are performed and has been widely applied in sports analysis, skill assessment, and healthcare. However, AQA studies are often developed under heterogeneous datasets and evaluation settings, making systematic comparison across methods difficult. To address these challenges, we present a comprehensive survey of recent advances in AQA. In particular, we propose a modality-driven hierarchical taxonomy that organizes existing methods into video-based, skeleton-based, and multi-modal approaches, and analyze the methodological evolution of representative models. We further establish a unified benchmark for representative video-based AQA methods by integrating diverse datasets and standardized evaluation protocols, enabling consistent comparison in terms of both accuracy and computational efficiency. Finally, we analyze emerging research trends, identify key challenges in current AQA research, and outline future directions ranging from near-term methodological advances to longer-term opportunities enabled by emerging AI paradigms. The project web page can be found at https://ZhouKanglei.github.io/AQA-Survey.

URL PDF HTML ☆

赞 0 踩 0

2412.00666 2026-05-19 cs.CV 版本更新

Explaining Object Detectors via Collective Contribution of Pixels

通过像素的集体贡献解释目标检测器

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

发表机构 * Chiba University（千叶大学）； National Institute of Informatics（信息处理研究所）

AI总结本文提出基于Shapley值和交互的游戏理论方法，以捕捉像素的个体和集体贡献，提升目标检测器的解释性与准确性。

Comments Accepted to CVPR 2026 (Highlight); code is available at: https://github.com/tttt-0814/VX-CODE

详情

AI中文摘要

视觉解释对于增强目标检测器的可靠性至关重要。目标检测器通过评估多个视觉特征的集体信息来识别和定位实例。在生成解释时，忽视这些集体影响可能导致遗漏组成线索或捕捉虚假相关性。然而，现有方法通常仅关注单个像素贡献，忽视了多个像素的集体贡献。为了解决这一限制，我们提出了一种基于Shapley值和交互的游戏理论方法，以显式捕捉个体和集体像素贡献。我们的方法为边界框定位和类别确定提供解释，突出对检测至关重要的区域。广泛实验表明，所提出的方法在识别重要区域的准确性上优于最先进的方法。代码可在https://github.com/tttt-0814/VX-CODE获取。

英文摘要

Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code is available at https://github.com/tttt-0814/VX-CODE

URL PDF HTML ☆

赞 0 踩 0

2411.17917 2026-05-19 cs.CV cs.RO 版本更新

DECODE: Domain-aware Continual Domain Expansion for Motion Prediction

DECODE：面向领域的持续领域扩展用于运动预测

Boqi Li, Haojie Zhu, Henry X. Liu

发表机构 * Department of Civil and Environmental Engineering, University of Michigan（密歇根大学土木与环境工程系）

AI总结 DECODE提出一种持续学习框架，通过预训练模型逐步扩展领域专用模型，结合超网络和流机制实现高效模型选择与不确定性估计，有效降低遗忘率并提升预测精度。

Comments This work has been published in IEEE TPAMI Early Access

详情

DOI: 10.1109/TPAMI.2026.3683469

AI中文摘要

运动预测对于自动驾驶车辆在复杂环境中有效导航和准确预测其他交通参与者行为至关重要。随着自动驾驶不断发展，整合新多样驾驶场景的需求促使频繁重新训练模型。为此，我们引入DECODE，一种新的持续学习框架，从预训练的通用模型开始，逐步发展专用领域模型。不同于现有持续学习方法试图开发一个能跨多样场景泛化的统一模型，DECODE独特地平衡了专用性与泛化性，动态调整以满足实时需求。所提框架利用超网络生成模型参数，显著降低存储需求，并结合归一化流机制基于似然估计进行实时模型选择。此外，DECODE利用深度贝叶斯不确定性估计技术合并最相关专用和通用模型的输出。这种整合确保在熟悉条件下最优性能，同时在不熟悉场景中保持鲁棒性。广泛评估证实了框架的有效性，实现显著低的遗忘率0.044和平均minADE 0.584米，显著超越传统学习策略，并在广泛驾驶条件下表现出适应性。

英文摘要

Motion prediction is critical for autonomous vehicles to effectively navigate complex environments and accurately anticipate the behaviors of other traffic participants. As autonomous driving continues to evolve, the need to assimilate new and varied driving scenarios necessitates frequent model updates through retraining. To address these demands, we introduce DECODE, a novel continual learning framework that begins with a pre-trained generalized model and incrementally develops specialized models for distinct domains. Unlike existing continual learning approaches that attempt to develop a unified model capable of generalizing across diverse scenarios, DECODE uniquely balances specialization with generalization, dynamically adjusting to real-time demands. The proposed framework leverages a hypernetwork to generate model parameters, significantly reducing storage requirements, and incorporates a normalizing flow mechanism for real-time model selection based on likelihood estimation. Furthermore, DECODE merges outputs from the most relevant specialized and generalized models using deep Bayesian uncertainty estimation techniques. This integration ensures optimal performance in familiar conditions while maintaining robustness in unfamiliar scenarios. Extensive evaluations confirm the effectiveness of the framework, achieving a notably low forgetting rate of 0.044 and an average minADE of 0.584 m, significantly surpassing traditional learning strategies and demonstrating adaptability across a wide range of driving conditions.

URL PDF HTML ☆

赞 0 踩 0

2407.15199 2026-05-19 cs.CV cs.CY 版本更新

Multiple Object Detection and Tracking in Panoramic Videos for Cycling Safety Analysis

全景视频中多目标检测与跟踪用于骑行安全分析

Jingwei Guo, Yitai Cheng, Meihui Wang, Ilya Ilyankou, Natchapon Jongwiriyanurak, Xiaowei Gao, Nicola Christie, James Haworth

发表机构 * Department of Civil, Environmental, and Geomatic Engineering（土木、环境与测绘工程系）； University College London（伦敦大学学院）； SpaceTimeLab（时空实验室）； Department of Earth Science and Engineering（地球科学与工程系）； Imperial College London（帝国理工学院）； Centre for Transport Studies（交通研究中心）

AI总结本文提出三步框架提升全景视频中多目标检测与跟踪性能，通过子图像分割投影提升检测精度，改进跟踪模型以处理边界连续性和类别信息，并在实际应用中验证车辆超车检测的有效性。

Journal ref IET Intelligent Transport Systems, Volume 20, no. 1 (2026): e70228

详情

DOI: 10.1049/itr2.70228

AI中文摘要

骑行者面临不成比例的受伤风险，但传统碰撞记录过于稀疏，无法在细粒度空间和时间尺度上识别风险因素。最近，自然主义研究利用视频数据捕捉复杂的行为和基础设施风险因素。全景视频是一种有前景的格式，可以记录骑行者周围的360度视图。然而，其使用受到失真、大量小对象和边界连续性的限制，现有计算机视觉模型无法处理。本研究提出一个新颖的三步框架：(1)通过分割和投影原始360度图像为子图像来提升全景影像中的目标检测精度；(2)修改多目标跟踪模型以整合边界连续性和目标类别信息；(3)通过实际应用中的车辆超车检测任务进行验证。该方法使用由骑行者记录的伦敦道路全景视频进行评估。实验结果表明，方法在不同图像分辨率下均优于基线，实现了更高的平均精度。此外，改进的跟踪方法在识别切换次数上减少了10.0%，识别精度提高了2.7%。超车检测任务的F分数达到0.82，展示了所提方法在实际骑行安全场景中的实用性。

英文摘要

Cyclists face a disproportionate risk of injury, yet conventional crash records are too sparse to identify risk factors at fine spatial and temporal scales. Recently, naturalistic studies have used video data to capture the complex behavioural and infrastructural risk factors. A promising format is panoramic video, which can record 360$^\circ$ views around a rider. However, its use is limited by distortions, large numbers of small objects, and boundary continuity, which cannot be handled using existing computer vision models. This research proposes a novel three-step framework: (1) enhancing object detection accuracy on panoramic imagery by segmenting and projecting the original 360$^\circ$ images into sub-images; (2) modifying multi-object tracking models to incorporate boundary continuity and object category information; and (3) validating through a real-world application of vehicle overtaking detection. The methodology is evaluated using panoramic videos recorded by cyclists on London's roadways under diverse conditions. Experimental results demonstrate improvements over baselines, achieving higher average precision across varying image resolutions. Moreover, the enhanced tracking approach yields a 10.0% decrease in identification switches and a 2.7% improvement in identification precision. The overtaking detection task achieves a high F-score of 0.82, illustrating the practical effectiveness of the proposed method in real-world cycling safety scenarios.

URL PDF HTML ☆

赞 0 踩 0

2406.09333 2026-05-19 cs.CV 版本更新

Learning Spatial-Preserving Hierarchical Representations for Digital Pathology

学习空间保持的层次表示用于数字病理学

Weiyi Wu, Xingjian Diao, Chunhui Zhang, Chongyang Gao, Xinwen Xu, Siting Li, Jiang Gui

发表机构 * Dartmouth College（达特茅斯学院）； Massachusetts General Hospital（麻省总医院）； Northwestern University（西北大学）

AI总结本文提出SPAN框架，通过保留空间关系和计算分配，提升数字病理学图像的层次表示能力，通过两种变体在多个数据集上验证了其有效性。

Journal ref CVPR 2026 (Findings Track)

详情

AI中文摘要

全滑片图像（WSI）由于其十亿像素分辨率和信息区域稀疏分布，带来了根本性的计算挑战。现有方法常独立处理图像块或以扭曲空间上下文的方式重塑它们，从而掩盖了WSI固有的层次金字塔表示。我们引入稀疏金字塔注意力网络（SPAN），一种层次框架，能够在保留空间关系的同时将计算分配给信息区域。SPAN直接从单尺度输入构建多尺度表示，使WSI数据的精确层次建模成为可能。我们通过两种变体：SPAN-MIL用于滑片分类，SPAN-UNet用于分割，展示了SPAN的通用性。在多个公开数据集上的全面评估表明，SPAN有效捕捉了层次结构和上下文关系。我们的结果提供了明确证据，表明架构归纳偏置和层次表示增强了滑片级和块级性能。通过解决WSI分析中的关键计算挑战，SPAN为计算病理学提供了有效的框架，并展示了大规模医学图像分析的重要设计原则。

英文摘要

Whole slide images (WSIs) pose fundamental computational challenges due to their gigapixel resolution and the sparse distribution of informative regions. Existing approaches often treat image patches independently or reshape them in ways that distort spatial context, thereby obscuring the hierarchical pyramid representations intrinsic to WSIs. We introduce Sparse Pyramid Attention Networks (SPAN), a hierarchical framework that preserves spatial relationships while allocating computation to informative regions. SPAN constructs multi-scale representations directly from single-scale inputs, enabling precise hierarchical modeling of WSI data. We demonstrate SPAN's versatility through two variants: SPAN-MIL for slide classification and SPAN-UNet for segmentation. Comprehensive evaluations across multiple public datasets show that SPAN effectively captures hierarchical structure and contextual relationships. Our results provide clear evidence that architectural inductive biases and hierarchical representations enhance both slide-level and patch-level performance. By addressing key computational challenges in WSI analysis, SPAN provides an effective framework for computational pathology and demonstrates important design principles for large-scale medical image analysis.

URL PDF HTML ☆

赞 0 踩 0

2312.03798 2026-05-19 cs.CV 版本更新

Single Image Reflection Removal with Patch Reflectance Prior

单图像反射去除与补丁反射率先验

Dongshen Han, Heechan Yoon, Hyukmin Kwon, Hyun-Cheol Kim, Hyon-Gon Choo, Seungkyu Lee, Chaoning Zhang

发表机构 * Kyunghee University（庆尚大学）； Electronics and Telecommunications Research Institute（电子电信研究院）

AI总结本文提出基于补丁反射先验的单图像反射去除方法，通过反射先验提取网络学习非均匀反射先验，并利用变压器U-Net架构实现高效反射去除，实验证明在真实世界基准上达到SIRR领域最先进的准确率。

2310.20389 2026-05-19 eess.IV cs.CV 版本更新

High-Resolution Reference Image Assisted Volumetric Super-Resolution of Cardiac Diffusion Weighted Imaging

高分辨率参考图像辅助的心脏扩散加权成像体积分辨率提升

Yinzhe Wu, Jiahao Huang, Fanwen Wang, Pedro Ferreira, Andrew Scott, Sonia Nielles-Vallespin, Guang Yang

发表机构 * Department of Bioengineering, Imperial College London（帝国理工学院生物工程系）； Cardiovascular Magnetic Resonance Unit, Royal Brompton Hospital（皇家布里托尼医院心血管磁共振单位）； National Heart and Lung Institute, Imperial College London（帝国理工学院国家心脏和肺研究所）

AI总结本文提出一种基于深度学习的体积分辨率提升框架，利用高分辨率b0 DWI作为输入，提升心脏扩散加权成像的图像质量，并证明了该框架在未见b值下的泛化能力。

Comments Accepted by SPIE Medical Imaging 2024

详情

DOI: 10.1117/12.3006008

AI中文摘要

扩散张量心脏磁共振（DT-CMR）是唯一用于非侵入性检查人体心脏微结构的活体方法。当前DT-CMR研究旨在提高对心脏微结构与健康心脏宏观功能关系以及微结构功能障碍与疾病关系的理解。为了获得最终DT-CMR指标，需要获取至少6个方向的扩散加权成像（DWI）。然而，由于DWI信噪比较低，标准体素尺寸在微结构尺度上相当大。在本研究中，我们探索了基于深度学习的方法在提高图像质量方面的潜力（在所有维度上提升4倍）。本研究提出了一种新的框架，通过将高分辨率b0 DWI作为额外模型输入，实现体积分辨率提升。我们证明了额外输入能够提供更高的超分辨图像质量。此外，模型还能超分辨未见过的b值的DWI，证明了该框架在心脏DWI超分辨率中的泛化能力。最后，我们建议在训练和推理中将高分辨率参考图像作为低分辨率图像的额外输入，以指导所有参数成像中的超分辨框架，尤其是在可用参考图像的情况下。

英文摘要

Diffusion Tensor Cardiac Magnetic Resonance (DT-CMR) is the only in vivo method to non-invasively examine the microstructure of the human heart. Current research in DT-CMR aims to improve the understanding of how the cardiac microstructure relates to the macroscopic function of the healthy heart as well as how microstructural dysfunction contributes to disease. To get the final DT-CMR metrics, we need to acquire diffusion weighted images of at least 6 directions. However, due to DWI's low signal-to-noise ratio, the standard voxel size is quite big on the scale for microstructures. In this study, we explored the potential of deep-learning-based methods in improving the image quality volumetrically (x4 in all dimensions). This study proposed a novel framework to enable volumetric super-resolution, with an additional model input of high-resolution b0 DWI. We demonstrated that the additional input could offer higher super-resolved image quality. Going beyond, the model is also able to super-resolve DWIs of unseen b-values, proving the model framework's generalizability for cardiac DWI superresolution. In conclusion, we would then recommend giving the model a high-resolution reference image as an additional input to the low-resolution image for training and inference to guide all super-resolution frameworks for parametric imaging where a reference image is available.

URL PDF HTML ☆

赞 0 踩 0

2305.07152 2026-05-19 cs.CV 版本更新

Intuitive Surgical SurgToolLoc and SurgVU Challenges Results: 2022-2025

直观外科SurgToolLoc和SurgVU挑战结果：2022-2025

Aneeq Zia, Max Berniker, Rogerio Garcia Nespolo, Xiaorui Zhang, Conor Perreault, Kiran Bhattacharyya, Xi Liu, Ziheng Wang, Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa, Bo Liu, David Austin, Yiheng Wang, Michal Futrega, Jean-Francois Puget, Zhenqiang Li, Yoichi Sato, Ryo Fujii, Ryo Hachiuma, Mana Masuda, Hideo Saito, An Wang, Mengya Xu, Mobarakol Islam, Long Bai, Winnie Pang, Hongliang Ren, Chinedu Nwoye, Luca Sestini, Nicolas Padoy, Maximilian Nielsen, Samuel Schüttler, Thilo Sentker, Hümeyra Husseini, Ivo Baltruschat, Rüdiger Schmitz, René Werner, Aleksandr Matsun, Mugariya Farooq, Numan Saaed, Jose Renato Restom Viera, Mohammad Yaqub, Neil Getty, Fangfang Xia, Zixuan Zhao, Xiaotian Duan, Xing Yao, Ange Lou, Hao Yang, Jintong Han, Jack Noble, Jie Ying Wu, Tamer Abdulbaki Alshirbaji, Nour Aldeen Jalal, Herag Arabian, Ning Ding, Knut Moeller, Weiliang Chen, Quan He, Muhammad Bilal, Taofeek Akinosho, Adnan Qayyum, Massimo Caputo, Hunaid Vohra, Michael Loizou, Anuoluwapo Ajayi, Ilhem Berrou, Faatihah Niyi-Odumosu, Charlie Budd, Oluwatosin Alabi, Tom Vercauteren, Ruoxi Zhao, Ayberk Acar, John Han, Jumanh Atoum, Yinhong Qin, Surong Hua, Lu Ping, Wenming Wu, Rongfeng Wei, Jinlin Wu, You Pang, Zhen Chen, Tim Jaspers, Amine Yamlahi, Piotr Kalinowski, Dominik Michael, Tim Rädsch, Marco Hübner, Danail Stoyanov, Stefanie Speidel, Lena Maier-Hein, Jie Tian, Ruxin Zhang, Khang Hoang Nguyen, Anh Quoc Nguyen, Tam Minh Nguyen, Khoi Dinh Tran, Minh Nguyen Dang Nhat, Trinh Thi Doan Pham, Linh Van Nguyen, Chunyang Jiang, Dewei Yang, Haitao Li, Yannick Prudent, Thibaut Boissin, Mahmood Alam, Shazad Ashraf, Andrew D. Beggs, Lukman Akanbi, Manuel D. Delgado, Narain Gupta, Amir M. Hajiyavand, Iqbal Qasim, Hafiz A. Alaka, Junaid Qadir, Shu Yang, Yihui Wang, Hao Chen, Shin Paul, Yosuke Yamagishi, Zhang Dong, Hongyun Li, Hongyu Gu, Xiaoliu Ding, Xiaoyao Liu, Xingyu Zhao, Mariana Ribeiro, Tiago Jesus, André Ferreira, Guilherme Barbosa, João Carvalho, Leonardo Barroso, Nuno Gomes, Rafael Peixoto, Rodrigo Ralha, Victor Alves, Stephanie, Nattapat Ittikosil, Achita Chitrapan, Quan Huu Cap, Jiayuan Huang, Shreyas C Dhake, Sergi Kavtaradze, Mobarak I Hoque, Ka Young Kim, Su Yong Yun, Young Tae Kim, Hyeon Bae Kim, Seong Tae Kim, Zuxing Deng, Ling Li, Jieyu Zheng, Xiaojian Li, Anthony Jarc

发表机构 * Intuitive Surgical, Inc.（Intuitive Surgical公司）； Muroran Institute of Technology（Muroran理工学院）； Niigata University of Health and Welfare Fujita Health University（Niigata大学健康与福利大学 Fujita健康大学）； NVIDIA, Inc.（NVIDIA公司）； University of Tokyo（东京大学）； Keio University（Keio大学）； Shun Hing Institute of Advanced Engineering（Shun Hing先进工程研究所）； NUS NUSRI SZ（新加坡大学 NUSRI SZ）； University of Strasbourg IHU Strasbourg（斯特拉斯堡大学 IHU斯特拉斯堡）； University Medical Center Hambrug-Eppendorf（汉堡-埃彭多夫大学医学中心）

AI总结本文总结了2022-2025年间在机器人辅助手术中解决手术工具定位和手术视觉理解的挑战成果，探讨了相关机器学习问题的解决方法与贡献。

2605.16628 2026-05-19 cs.CV 版本更新

SCARED-C: Corrected Camera Poses for Endoscopic Depth Estimation

SCARED-C：用于内窥镜深度估计的修正相机姿态

John J. Han, Adam Schmidt, Max Allan, Jie Ying Wu, Omid Mohareri

发表机构 * Vanderbilt University（范德比大学）； Intuitive Surgical, Inc.（Intuitive Surgical公司）

AI总结 SCARED-C通过修正相机姿态，将可靠RGB-D配对数从35增加到17135，采用COLMAP和尺度恢复步骤提升内窥镜深度估计的精度与可靠性。

2605.16603 2026-05-19 cs.CV 版本更新

Controlla: Learning Controllability via Graph-Constrained Latent Geometry

Controlla：通过图约束潜在几何学习可控性

Jamuna S. Murthy, Amin Karimi Monsefi, Rajiv Ramnath

发表机构 * Ramaiah Institute of Technology（拉马亚院技术学院）； The Ohio State University（俄亥俄州立大学）

AI总结 Controlla通过图约束潜在几何学习可控性，结合图约束最优传输对多模态输入的语义属性和身份因子进行对齐，提升可控性、身份保持和跨模态对齐。

详情

AI中文摘要

可控多模态生成通常被表述为推理时的条件化问题，使用提示、引导或辅助模块。尽管有效，这些方法并未显式结构化语义属性的演变，可能导致身份漂移和跨模态不一致。我们提出Controlla，一种模块化因子化控制框架，将可控性视为结构化潜在几何的属性。Controlla从多模态输入中学习身份和属性因子，并利用图约束最优传输将它们对齐到图先验，鼓励属性遵循图一致轨迹的同时保持参考身份。为了评估这一设置，我们构建了AffectHuman-43K，一个考虑泄漏的多模态基准，用于参考导向的情感控制，并引入了对轨迹一致性和潜在解耦的几何感知度量。实验显示在可控性、身份保持和跨模态对齐方面有持续改进，此外还进行了图敏感性、可扩展性和鲁棒性的分析。

英文摘要

Controllable multimodal generation is commonly formulated as an inference-time conditioning problem using prompts, guidance, or auxiliary modules. While effective, such approaches do not explicitly structure how semantic attributes evolve, which can lead to identity drift and inconsistent cross-modal behavior. We propose Controlla, a modular factorized-control framework that treats controllability as a property of structured latent geometry. Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport, encouraging attributes to follow graph-consistent trajectories while preserving reference identity. To evaluate this setting, we construct AffectHuman-43K, a leakage-aware multimodal benchmark for reference-grounded affective control, and introduce geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity, extensibility, and robustness.

URL PDF HTML ☆

赞 0 踩 0

2605.16582 2026-05-19 cs.CV 版本更新

ArtMesh: Part-Aware Articulated Mesh Fields with Motion-Consistent Dynamics

ArtMesh：带有运动一致动态的部件感知可变形网格场

Sylvia Yuan, Dan Wang, Ravi Ramamoorthi, Xinrui Cui

发表机构 * University of California San Diego（加州大学圣地亚哥分校）； University of North Texas（北卡罗来纳州立大学）

AI总结 ArtMesh通过构建基于网格的可微渲染基础，实现了从多视角图像中重建可变形物体的连接三角网格，并在100个新基准数据集上超越了现有3DGS方法。

详情

AI中文摘要

我们提出了ArtMesh，一种基于网格的方法，用于从起始和结束状态的多视角图像中显式重建可变形物体，作为具有每部分刚性运动的连接三角网格。现有基于3D高斯点散射的可变形重建管道继承了其点散射基础的无结构点几何，无法提供表面拓扑来推断部件边界或沿物体连接性的运动一致性。ArtMesh相反，建立在基于网格的可微渲染基础之上，使部件感知动态能够直接作用于结构拓扑。为了使拓扑与可变形兼容，我们引入了部件感知受限德劳内重新三角化，产生连接的子网格，其三角形不跨越语义部件边界。动态网格场然后通过双向顶点运动一致性优化可变形性，通过传输网格顶点和像素级运动一致性优化渲染的RGB-D观察。我们引入了Articulate-100，一个包含100个可变形物体的16个PartNet-Mobility类别的新基准。在该基准上，ArtMesh在关节参数估计和部件级几何重建上优于现有3DGS方法，其在具有许多可动部件的物体上收益最大。

英文摘要

We present ArtMesh, a mesh-native method for reconstructing articulated objects explicitly as connected triangle meshes with per-part rigid motion from multi-view images in start and end states. Existing 3D Gaussian Splatting pipelines for articulated reconstruction inherit the unstructured point-based geometry of their splatting base, which provides no surface topology for reasoning about part boundaries or enforcing motion consistency along the object's connectivity. ArtMesh instead builds on a mesh-based differentiable rendering backbone, enabling part-aware dynamics to act directly on the structured topology. To make the topology compatible with articulation, we introduce part-aware restricted Delaunay remeshing, producing connected submeshes whose triangles do not cross semantic part boundaries. The dynamic mesh field then optimizes articulation using bidirectional Vertex-wise Motion Consistency on transported mesh vertices and Pixel-wise Motion Consistency on rendered RGB-D observations. We introduce Articulate-100, a new benchmark of 100 articulated objects spanning 16 PartNet-Mobility categories. On this benchmark, ArtMesh outperforms prior 3DGS-based pipelines in joint parameter estimation and part-level geometric reconstruction, with the largest gains on objects with many movable parts.

URL PDF HTML ☆

赞 0 踩 0

2605.16572 2026-05-19 cs.CV 版本更新

TriALS: Triphasic-Aided Liver Lesion Segmentation Benchmark in Non-Contrast CT

TriALS: 三相辅助非增强CT肝脏病变分割基准

Marawan Elbatel, Mohamed Ghonim, Jiaji Mao, Zhuosheng Lin, Katharina Eckstein, Andrés Martínez Mora, Jonathan Deissler, Maximilian Rokuss, Constantin Ulrich, Zdravko Marinov, Wenhui Deng, Baoxun Li, Huijun Hu, Jun Shen, Mohanad Ghonim, Khadiga Omar Nassar, Mariam Elbakry, Menna Dyab, Amr Muhammad Abdo Salem, Nouran Elghitany, Noha Elghitany, Yi Qin, Xuanqi Huang, Haonan Wang, Shao-Woo Yen, Ahmed Elghamry Saba, Salma Ahmad, Xinyan Fang, Jiahao Zhang, Xiaodi Wang, Xinghua Ma, Gongning Luo, Jessica C. Delmoral, João Manuel R. S. Tavares, Ankan Deria, Adinath Dukre, Yutong Xie, Imran Razzak, Dongwook Kim, Matthew Choi, Hanxiao Zhang, Minghui Zhang, Xin You, Abdul Qayyum, Steven A. Niederer, Moona Mazher, Rachika E. Hamadache, Ricardo Montoya-del-Angel, Robert Martí, Xavier Lladó, Toufiq Musah, Livingstone Eli Ayivor, Enrique Almar-Munoz, Agnes Mayr, Kaouther Mouheb, Esther E. Bron, Stefan Klein, Ahmed Abouelhoda, Amira Adel, Susan Adil Ali, Rainer Stiefelhagen, Klaus H. Maier-Hein, Fabian Isensee, Aya Yassin, Xiaomeng Li

发表机构 * Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology（香港理工大学电子与计算机工程系）； AI Center of Excellence, Ain Shams University（爱思明大学人工智能中心）； Department of Radiology, Ain Shams University（爱思明大学放射科）； Department of Radiology, Guangdong Provincial Key Laboratory of Malignant Tumor Epigenetics and Gene Regulation, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University（广东省恶性肿瘤表观遗传与基因调控重点实验室，中山大学孙逸仙纪念医院放射科）； Nanfang Hospital, Southern Medical University（南方医科大学南华医院）； Division of Medical Image Computing, German Cancer Research Center (DKFZ), Heidelberg, Germany（德国癌症研究中心（DKFZ）医学影像计算部，海德堡，德国）； Medical Faculty Heidelberg, Heidelberg University（海德堡大学医学院）； Faculty of Mathematics and Computer Science, Heidelberg University（海德堡大学数学与计算机科学学院）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结本文提出TriALS挑战，通过多中心150例数据评估自动肝脏病变分割算法，在非增强CT条件下取得人类水平性能，但表现受训练数据规模和预训练策略影响显著。

Comments TriALS challenge paper across MICCAI 2024 and 2025; data and code at https://github.com/xmed-lab/TriALS

详情

AI中文摘要

非增强CT（NCCT）上肝脏病变自动分割在临床中很重要但极具挑战性，特别是在缺乏对比剂的非洲和亚洲低资源地区。进展受限于缺乏标注的NCCT基准。本文描述了TriALS挑战，通过埃及和中国机构的150例多中心数据集（4相CT采集600体积）评估算法性能。在70例数据上评估，最佳方法在门静脉期Dice系数达0.754，但非增强CT下降至0.57。外部验证显示，领先方法在非增强CT上比现成模型提升最高28%。算法性能主要由训练数据规模和预训练策略决定。跨年比较揭示了非增强CT的持续感知障碍，仅扩大预训练无法克服。数据、标注和代码可在https://github.com/xmed-lab/TriALS获取。

英文摘要

Automated segmentation of liver lesions on non-contrast computed tomography (NCCT) is clinically important but fundamentally challenging, particularly in low-resource settings across Africa and Asia where contrast agents are frequently unavailable. Progress has been limited by the absence of annotated NCCT benchmarks. Here we describe the TriALS challenge for automated liver lesion segmentation under contrast-limited conditions, supported by a multi-centre dataset of 150 cases with four-phase CT acquisitions (600 volumes) from Egyptian and Chinese institutions. Algorithms were evaluated on 70 cases from three institutions, including an independent external cohort. The top-performing method achieved a mean venous-phase Dice of 0.754, consistent with human-level performance, yet dropped to 0.57 on NCCT. On external validation, the leading method outperformed off-the-shelf models by up to 28% in Dice on NCCT. Algorithm performance was most strongly predicted by training data scale and pre-training strategy. A cross-year comparison exposed a persistent perceptual barrier on NCCT that scaling pre-training alone cannot overcome. Data, annotations, and code are available at https://github.com/xmed-lab/TriALS.

URL PDF HTML ☆

赞 0 踩 0

2605.16550 2026-05-19 cs.CV cs.LG 版本更新

Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition

基于注意力的变换器聚合网络用于视频眼周识别

Luiz G F Carreira, Breno A Mariano, Victor H C de Melo, David Menotti, William Robson Schwartz

发表机构 * Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil（巴西米纳斯吉拉斯联邦大学计算机科学系）； Department of Informatics, Federal University of Paraná, Curitiba, Brazil（巴西巴西南部联邦大学信息技术系）

AI总结本文提出一种基于变换器的聚合网络，用于视频眼周识别，通过特征嵌入和聚合模块提升识别鲁棒性，在COX Face数据集上优于传统方法，达到99.8%的TPR@1e-1和96.6%的Rank-5。

详情

AI中文摘要

视频眼周识别是基于个体眼睛周围区域识别身份的任务。眼周区域是人脸最具有区分性的区域之一，使其适合识别任务。其作为生物特征模态的应用在监控环境中逐渐兴起，尤其是在传统生物特征如面部或虹膜识别因非受限采集条件而不可行时。本文提出了一种针对监控环境的视频眼周识别的注意力感知方法。该框架包含两个主要模块：特征嵌入和聚合。特征嵌入模块是一个深度卷积神经网络，将眼周数据映射到特征向量。聚合模块是一个仅含编码器的变换器，能够自适应地将帧级特征聚合为单一视频表示和静态参考图像的特征向量。在公开可用的COX Face数据集上的实验表明，所提方法的鲁棒性，一致优于传统聚合方案。在最佳情况下，该方法实现了99.8%的TPR@1e-1和96.6%的Rank-5。

英文摘要

Video periocular recognition is the task of recognizing an individual's identity based on the region around an individual's eyes. The periocular area is one of the most discriminative regions of the human face, making it suitable for recognition tasks. Its use as a biometric modality has emerged as an alternative, especially in surveillance scenarios where conventional biometric traits such as face or iris recognition become unfeasible due to unconstrained acquisition conditions. This paper proposes an attention-aware approach for video-based periocular recognition in surveillance environments. The framework consists of two main modules: feature embedding and aggregation. The feature embedding module is a deep convolutional neural network that maps periocular data to feature vectors. The aggregation module is an encoder-only transformer that adaptively learns to aggregate frame-level features into a single video representation and a feature vector for the still reference image. Experiments on the publicly available COX Face dataset show the robustness of the proposed method, consistently outperforming naive aggregation schemes. In the best scenario, the approach achieves $99.8\%$ of TPR@$1e^{-1}$ and $96.6\%$ of Rank-5.

URL PDF HTML ☆

赞 0 踩 0

2605.16519 2026-05-19 cs.CV eess.SP 版本更新

DepthPolyp: Pseudo-Depth Guided Lightweight Segmentation for Real-Time Colonoscopy

DepthPolyp：基于伪深度引导的轻量级分割用于实时结肠镜检查

Zhuoyu Wu, Wenhui Ou, Lexi Zhang, Pei-Sze Tan, Dongjun Wu, Junhe Zhao, Wenqi Fang, Raphaël C. -W. Phan

发表机构 * CyPhi AI Lab, Monash University, Malaysia Campus, Malaysia Department of Electronic \& Computer Engineering, Hong Kong University of Science \& Technology, Hong Kong, P.R. China Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, P.R. China Harbin Institute of Technology, Harbin, P.R. China

AI总结本文提出DepthPolyp，一种基于伪深度引导的多任务学习轻量级分割框架，通过高效特征调制实现强跨数据集泛化能力，在实时结肠镜检查中表现出色。

Comments This paper has been accepted to the International Conference on Pattern Recognition (ICPR 2026)

详情

AI中文摘要

准确的结肠镜检查息肉分割对于早期结直肠癌检测至关重要，但现实临床环境中的运动模糊、镜面反射和照明不稳定性等挑战使现有方法在实际手术场景中性能显著下降。本文提出DepthPolyp，一种基于伪深度引导的多任务学习和高效特征调制的轻量级分割框架。该架构结合了层次化Ghost因子化进行紧凑特征生成，交错洗牌融合实现低成本跨尺度交互，以及动态组门控实现自适应组内特征加权。大量实验表明，DepthPolyp在训练于退化数据并在清洁和噪声目标领域评估时，表现出强跨数据集泛化能力，优于轻量级基线并能与大幅更大的模型竞争。在PolypGen真实手术视频评估中，DepthPolyp在参数量仅为3.57M、GMACs为0.86的情况下，能够在移动设备上以超过180 FPS的速度运行，使其在资源受限的临床环境中具有良好的实时部署能力。代码和预训练权重可在https://github.com/ReaganWu/DepthPolyp/获取。

英文摘要

Accurate polyp segmentation in colonoscopy is essential for early colorectal cancer detection, yet real-world clinical environments pose persistent challenges such as motion blur, specular reflections, and illumination instability. Most existing methods are optimized on clean benchmark images and suffer noticeable performance degradation when deployed in authentic surgical scenarios. We propose DepthPolyp, a lightweight and robust segmentation framework based on pseudo-depth-guided multi-task learning and efficient feature modulation. The architecture combines hierarchical Ghost factorization for compact feature generation, Interleaved Shuffle Fusion for low-cost cross-scale interaction, and Dynamic Group Gating for adaptive group-wise feature weighting. Extensive experiments demonstrate that DepthPolyp achieves strong cross-dataset generalization when trained on degraded data and evaluated on both clean and noisy target domains, consistently outperforming lightweight baselines and remaining competitive with substantially larger models. In real surgical video evaluation on PolypGen, DepthPolyp achieves better segmentation performance than models up to $20\times$ larger while preserving real-time inference speed. With only 3.57M parameters and 0.86 GMACs, the proposed method runs at over 180 FPS on mobile devices, making it well suited for real-time deployment in resource-constrained clinical environments. Code and pretrained weights are available at: https://github.com/ReaganWu/DepthPolyp/

URL PDF HTML ☆

赞 0 踩 0

2605.16515 2026-05-19 cs.CV cs.LG 版本更新

SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability

SeamCam：通过多线索视觉可探测性量化无缝伪装

Amin Karimi Monsefi, Abolfazl Meyarian, Mridul Khurana, Shuheng Wang, Pouyan Navard, Cheng Zhang, Anuj Karpatne, Wei-Lun Chao, Rajiv Ramnath

发表机构 * The Ohio State University（俄亥俄州立大学）； Path Robotics, USA（Path Robotics公司）； Virginia Tech（弗吉尼亚理工大学）； Boston University（波士顿大学）

AI总结 SeamCam通过将伪装评估转化为视觉定位问题，提出了一种量化动物伪装效果的指标，通过人类实验验证其有效性，并展示了其在扩散模型训练中的应用。

详情

AI中文摘要

动物被描述为有效伪装时，能够无缝融入周围环境，但目前缺乏标准化的量化措施。本文通过将伪装评估转化为视觉定位问题：伪装良好的动物在已知类别时仍难以检测。引入SeamCam指标，量化动物的可探测性。给定图像和目标物种，SeamCam生成类别条件的检测提案，提取分割掩码，并识别其子集，其联合覆盖最大IoU与真实掩码。SeamCam分数是最大可恢复定位信号的补数，分数越高伪装越强（即可探测性越低）。在94名参与者和2390次比较的人类二择一强制选择研究中，SeamCam与人类伪装难度判断达成78.82%的一致性，优于现有最先进方法约25%。随后展示了SeamCam作为直接偏好优化（DPO）的偏好信号，用于微调基于扩散的修复模型以生成伪装。这提供了一种经济的训练方法，其目标专门适用于伪装生成，不同于典型的扩散模型。为支持严格基准测试，进一步引入CamFG-1.5k数据集，包含1521张高分辨率图像，在伪装生成前动物完全可见，使评估更公平，通过控制现有数据集中存在的遮挡伪影。

英文摘要

Animals are described as effectively camouflaged when they blend seamlessly with their surrounding, yet no standardized quantitative measure of this seamlessness exists. We address this gap by framing camouflage evaluation as a visual localization problem: a well-camouflaged animal is one that remains difficult to detect even when its category is known. We introduce SeamCam (Seamless Camouflage), a metric that quantifies how detectable an animal is from the available visual evidence. Given an image and a target species, SeamCam generates category-conditioned detection proposals, extracts segmentation masks, and identifies the subset whose collective union yields the highest IoU with the ground-truth mask. The SeamCam score is one minus this maximum recoverable localization signal, where a higher score indicates stronger camouflage (i.e., lower detectability). In a human two-alternative forced-choice study with 94 participants and 2,390 comparisons, SeamCam achieves 78.82% agreement with human camouflage difficulty judgments, outperforming state-of-the-art by about 25%. We then demonstrate SeamCam's utility as a preference signal for Direct Preference Optimization (DPO) to fine-tune a diffusion-based inpainting model for camouflage generation. This offers an affordable training approach with an objective explicitly suited for camouflage generation, unlike typical diffusion models. To support rigorous benchmarking, we further introduce CamFG-1.5k, a curated dataset of 1,521 high-resolution images in which animals are fully visible prior to camouflage generation, enabling unbiased evaluation by controlling for occlusion artifacts present in existing datasets. https://7amin.github.io/SeamCam/

URL PDF HTML ☆

赞 0 踩 0

2605.16481 2026-05-19 cs.CV cs.AI 版本更新

Visual Agentic Memory: Enabling Online Long Video Understanding via Online Indexing, Hierarchical Memory, and Agentic Retrieval

视觉代理记忆：通过在线索引、分层记忆和代理检索实现在线长视频理解

Aiden Yiliu Li, Nels Numan, Anthony Steed

发表机构 * University College London（伦敦大学学院）

AI总结本文提出视觉代理记忆框架，通过在线索引、分层记忆和代理检索实现长视频理解，实验显示其在OVO-Bench和MM-Lifelong数据集上均取得优异成绩。

详情

AI中文摘要

多跳关系对比学习：超越成对关系的空间对比预训练

Sheikh Tanvir Ahmed, Md. Tanvir Raihan

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； United International University（国际联合大学）

AI总结本文提出多跳关系对比学习框架，通过捕捉场景图中k跳路径的隐含空间依赖，提升空间感知能力，在GQA子集上实现了更优的检索和下游任务表现。

详情

AI中文摘要

理解物体间空间关系对场景理解至关重要，但大多数对比预训练方法仅建模成对关系，忽略了更丰富的组合和多跳交互。本文提出多跳关系对比学习（MRCL）框架，扩展空间对比学习到图结构的场景表示。通过追踪场景图中k跳路径，MRCL捕捉隐含的空间依赖，定义多级对比目标，鼓励嵌入在保持语义稳定性的同时响应空间布局。在GQA子集上，MRCL生成空间感知表示，提升内容基于图检索（NDCG@5=0.748）并持续改善下游任务，包括空间关系识别和图基问题回答。这些结果表明，多跳关系监督比仅成对方法提供更丰富的结构指导，导致更鲁棒、组合和几何感知的视觉表示。

英文摘要

Understanding how objects relate to each other in space is fundamental to scene understanding, yet most contrastive pre-training approaches only model pairwise relationships, leaving richer compositional and multi-hop interactions largely unexplored. We introduce Multi-Hop Relational Contrastive Learning (MRCL), a framework that extends spatial contrastive learning to graph-structured scene representations. By tracing k-hop paths through scene graphs built from detected objects, MRCL captures implicit spatial dependencies that go well beyond what direct object pairs can express. We define a multi-level contrastive objective spanning nodes, edges, and multi-hop paths, encouraging embeddings that remain stable across object semantics while staying responsive to spatial layout. On a GQA subset, MRCL produces spatially-aware representations that improve content-based graph retrieval (NDCG@5 = 0.748) and consistently benefit downstream tasks, including spatial relationship recognition and graph-based question answering. Together, these results suggest that multi-hop relational supervision offers substantially richer structural guidance than pairwise-only methods, leading to visual representations that are more robust, compositional, and geometry-aware.

URL PDF HTML ☆

赞 0 踩 0

2605.16444 2026-05-19 cs.CV cs.AI 版本更新

Diffusion Attention Expert Model for Predicting and Semi-automatic Localizing STAS in Lung Cancer Histopathological Images

扩散注意力专家模型用于预测和半自动定位肺癌组织病理图像中的STAS

Liangrui Pan, Jiadi Luo, Yuxuan Xiao, Chenchen Nie, Xiaoshuai Wu, Songqing Fan, Ling Chu, Manqiu Li, Rongfang He, Zhenyu Zhao, Ruixing Wang, Shulin Liu, Yiyi Liang, Xiang Wang, Qingchun Liang, Shaoliang Peng

发表机构 * College of Computer Science and Electronic Engineering, Hunan University（湖南大学计算机科学与电子工程学院）； Department of Pathology, The Second Xiangya Hospital, Central South University（中南大学湘雅医院病理科）； Hunan Clinical Medical Research Center for Cancer Pathogenic Genes Testing and Diagnosis（湖南临床医学肿瘤基因检测与诊断研究中心）； Department of Thoracic Surgery, The Second Xiangya Hospital, Central South University（中南大学湘雅医院胸外科）； Department of pathology, Hunan Cancer Hospital, The Affiliated Cancer Hospital of Xiangya School of Medicine, Central South University（湖南肿瘤医院病理科）； Department of Pathology, The Third Xiangya Hospital, Central South University（中南大学湘雅第三医院病理科）； Department of Pathology, First People's Hospital of Pingjiang County（平江县第一人民医院病理科）； Department of Pathology, the First Affiliated Hospital, Hengyang Medical School, University of South China（南华大学衡阳医学院第一附属医院病理科）； Department of Radiology, The Second Xiangya Hospital of Central South University（中南大学湘雅医院放射科）； Department of Radiology, Xiangya Hospital, Central South University（中南大学湘雅医院放射科）； Oncology Department and State Key Laboratory of Systems Medicine for Cancer of Shanghai Cancer Institute, Renji Hospital, School of Medicine, Shanghai Jiaotong University（上海癌症研究院肿瘤科及上海交通大学医学院系统医学重点实验室）

AI总结本文提出DAEM模型，通过多尺度特征学习和双分支架构提升STAS检测精度，实现对冷冻切片和石蜡切片的高AUC值检测，并利用肿瘤微环境特征实现STAS半自动定位。

Comments Accepted by Nature Communications

详情

AI中文摘要

准确的术中和术后STAS诊断对指导肺癌手术决策和术后管理至关重要。然而，组织病理学评估耗费人力且易出现漏诊或误诊。我们提出扩散注意力专家模型（DAEM）用于检测冷冻切片（FSs）和石蜡切片（PSs）中的STAS。其扩散注意力专家模块利用全注意力聚合学习多尺度特征，而双分支架构强化多尺度特征表示。在内部数据集中，DAEM在FSs和PSs上分别达到0.8946和0.9112的AUC值。在八个机构的外部多中心数据集上验证显示，模型具有强泛化性和可解释性。利用PSs中的肿瘤微环境（TME）特征，进一步实现了STAS位置及其与原发肿瘤距离的半自动测量。多个定量TME指标被识别为STAS的潜在生物标志物，包括微泡型STAS。总体而言，DAEM通过在FSs和PSs上实现准确且可解释的检测，为STAS评估提供临床可操作的框架，通过基于定量TME的分析支持术后风险分层。

英文摘要

Accurate intraoperative and postoperative diagnosis of spread through air spaces (STAS) is essential for guiding surgical decisions and postoperative management in lung cancer. However, histopathological assessment is labor-intensive and is prone to missed or incorrect diagnoses. We propose a Diffusion Attention Expert Model (DAEM) to detect STAS in frozen sections (FSs) and paraffin sections (PSs). Its diffusion attention expert module leverages full attention aggregation to learn multi-scale features from histopathological images, while a dual-branch architecture strengthens multi-scale feature representation. On an internal dataset, DAEM achieves AUCs of 0.8946 for FSs and 0.9112 for PSs. Validation on external multi-center datasets from eight institutions demonstrates strong generalizability and interpretability. Using tumor microenvironment (TME) features in PSs, we further enable semi-automatic measurement of STAS location and its distance from the primary tumor. Several quantitative TME metrics are identified as potential biomarkers for STAS, including micropapillary-type STAS. Overall, DAEM offers a clinically actionable framework for STAS assessment by enabling accurate and interpretable detection on FSs and PSs, supporting postoperative risk stratification through quantitative TME-based analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.16440 2026-05-19 cs.CV cs.AI 版本更新

Semantic Smoothing via Novel View Synthesis for Robust SAR Image Classification

通过新颖视角合成实现语义平滑以实现稳健的SAR图像分类

Daniel Brignac, Fengwei Tian, Banafsheh Latibari, Abhijit Mahalanobis, Ravi Tandon

发表机构 * The University of Arizona（亚利桑那大学）

AI总结本文提出语义平滑方法，通过新颖视角合成模型生成结构化随机变换，提升SAR图像分类在对抗攻击下的鲁棒性，并提高干净分类准确率。

详情

AI中文摘要

深度神经网络对对抗扰动敏感，限制了其在安全关键应用中的部署，如合成孔径雷达（SAR）自动目标识别（ATR）。随机化平滑通过在噪声输入上平均预测来提高鲁棒性，但各向同性噪声常无法保持SAR图像的语义结构。我们提出语义平滑，一种防御方法，用由新颖视角合成模型生成的结构化随机变换取代基于噪声的扰动。对于SAR，我们根据获取几何学合成多个可能的雷达视角。在生成的随机视角上进行预测并聚合，以形成鲁棒分类器。实验表明，语义平滑在标准攻击（如FGSM和PGD）以及SAR特定攻击（如OTSA和SMGAA）中提高了鲁棒性，同时提高了干净分类准确率。这些结果表明，通过保留语义的几何变换进行随机化平滑，是结构感知领域对抗防御的一种有前景的替代方案。

英文摘要

Deep neural networks are vulnerable to adversarial perturbations, limiting deployment in safety-critical applications such as synthetic aperture radar (SAR) automatic target recognition (ATR). Randomized smoothing improves robustness by averaging predictions over noisy inputs, but isotropic noise often fails to preserve the semantic structure of SAR imagery. We propose semantic smoothing, a defense that replaces noised-based perturbations with structured randomized transformations generated by a novel view synthesis model. For SAR, we condition on acquisition geometry to synthesize multiple plausible radar views. Predictions across generated randomized views are aggregated to form a robust classifier. Experiments show that semantic smoothing improves robustness against standard attacks, such as FGSM and PGD, and SAR-specific attacks, such as OTSA and SMGAA, while also increasing clean classification accuracy. These results demonstrate that randomized smoothing via semantically preserving geometric transformations is a promising alternative to isotropic noise for adversarial defense in structured sensing domains.

URL PDF HTML ☆

赞 0 踩 0

2605.16439 2026-05-19 cs.CV cs.AI 版本更新

KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

KVCapsule: 用于视觉-语言模型的高效序列KV缓存压缩方法：不对称冗余

Yingbing Huang, Tharun Adithya Srikrishnan, Steven K. Reinhardt, Deming Chen

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； AMD

AI总结本文提出KVCapsule，一种针对视觉语言模型的KV缓存压缩框架，通过轻量压缩和重建组件实现内存节省，提升吞吐量并减少内存占用，同时保持精度。

详情

AI中文摘要

视觉-语言模型（VLMs）作为大型语言模型（LLMs）的重要扩展，通过文本和图像输入实现多模态推理。尽管VLMs增强了语言模型的能力，但它们也继承并放大了关键计算瓶颈：自回归解码过程中大规模键值（KV）缓存带来的内存开销。这一挑战在VLMs中尤为严重，因为图像生成更长的token序列和更密集的特征表示，相比文本。此外，视觉token的空间和信息丰富性引入了结构化的注意力模式，使得许多针对LLM的KV缓存压缩技术在直接应用于VLMs时效果不佳。在本文中，我们对视觉token的行为进行了详细的实证分析，突显其与纯文本模型的关键差异。基于这些见解，我们提出KVCapsule，一种新的视觉token的KV缓存压缩框架。KVCapsule保持预训练VLM骨干网络冻结，不需要修改注意力计算模块，并且可以通过轻量级压缩和重建组件集成到现有VLMs中。我们评估了KVCapsule在多个VLMs和基准任务上的性能，证明在60%的压缩率下，TPS提升达2倍，KV缓存内存减少达2.4倍，同时精度或响应质量几乎没有下降。我们的发现为在受限内存预算下扩展VLM推理提供了实用路径，并启发进一步研究结构感知的缓存压缩方法以多模态模型。

英文摘要

Vision-Language Models (VLMs) have emerged as a critical and fast-growing extension of Large Language Models (LLMs) that enable multimodal reasoning through both text and image inputs. Although VLMs enrich the capabilities of language models, they also inherit and amplify key computational bottlenecks: the memory overhead caused by the large key-value (KV) cache during autoregressive decoding. This challenge is particularly severe in VLMs, where images produce longer token sequences and denser feature representations compared to text. Moreover, the spatial and information-rich nature of vision tokens introduces structured attention patterns that make many LLM-oriented KV cache compression techniques ineffective when applied directly to VLMs. In this work, we conduct a detailed empirical analysis of the behavior of vision tokens, highlighting the critical differences from purely text-based models. Based on these insights, we propose KVCapsule, a novel KV cache compression framework for vision tokens. KVCapsule keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and can be integrated into existing VLMs through lightweight compression and reconstruction components. We evaluate KVCapsule on multiple VLMs and benchmark tasks, demonstrating up to 2x improvement in TPS and 2.4x reduction in KV cache memory at a 60% compression ratio, with negligible degradation in accuracy or response quality. Our findings offer practical pathways to scale VLM inference under constrained memory budgets and inspire further research into structure-aware cache compression for multimodal models.

URL PDF HTML ☆

赞 0 踩 0

2605.16431 2026-05-19 cs.CV 版本更新

CT-DegradBench: A Physics-Informed Benchmark for CT Degradation Detection and Severity Estimation

CT-DegradBench：一种用于CT退化检测和严重程度估计的物理引导基准

Yousra Nabila Taifour, Marouane Tliba, Zuheng Ming, Marie Luong, Nour Aburaed, Aladine Chetouani, Gorkem Durak, Alessandro Bruno, Faouzi Alaya Cheikh, Habib Zaidi, Ulas Bagci, Azeddine Beghdadi

发表机构 * Université Sorbonne Paris Nord（索邦巴黎北大学）； University of Dubai（迪拜大学）； Northwestern University（西北大学）； IULM University（IULM大学）； Norwegian University of Science and Technology（挪威科学与技术大学）； University of Geneva（日内瓦大学）

AI总结本文提出CT-DegradBench，一个用于评估CT退化检测和严重程度估计的基准，结合语义先验和频域线索，提出SeSpeCT框架，在多模态嵌入空间中构建免训练的语义质量轴，实现退化类型和严重程度的联合预测。

Comments Accepted in CVPR 2026 VISION Workshop (DEXTER track)

详情

AI中文摘要

Computed tomography (CT) images are frequently degraded by acquisition artifacts, including noise, blur, streaking, aliasing, and metal artifacts. Yet CT enhancement is still largely evaluated using image quality metrics with limited perceptual and clinical validity, while existing datasets remain focused on isolated restoration tasks, hindering unified benchmarking across diverse degradation types. We present CT-DegradBench, a dataset and benchmark for CT degradation detection and severity estimation under controlled single- and mixed-artifact settings. CT-DegradBench enables systematic evaluation across multiple degradation families and severity levels within a common experimental framework. We further propose SeSpeCT (Semantic-Spectral CT degradation estimation), a framework that combines semantic priors from medical vision-language models with complementary frequency-domain cues for artifact analysis. SeSpeCT constructs a 免训练 semantic quality axis in the multimodal embedding space using radiology-informed text prompts, without task-specific fine-tuning, and combines it with spectral features that capture degradation-specific frequency patterns. The resulting representation enables joint prediction of artifact type and severity. Experimental results show that SeSpeCT consistently outperforms the evaluated baselines under both single- and mixed-degradation settings. The framework is available at https://github.com/yousranb/CT-DEGRADBENCH.

英文摘要

Computed tomography (CT) images are frequently degraded by acquisition artifacts, including noise, blur, streaking, aliasing, and metal artifacts. Yet CT enhancement is still largely evaluated using image quality metrics with limited perceptual and clinical validity, while existing datasets remain focused on isolated restoration tasks, hindering unified benchmarking across diverse degradation types. We present CT-DegradBench, a dataset and benchmark for CT degradation detection and severity estimation under controlled single- and mixed-artifact settings. CT-DegradBench enables systematic evaluation across multiple degradation families and severity levels within a common experimental framework. We further propose SeSpeCT (Semantic-Spectral CT degradation estimation), a framework that combines semantic priors from medical vision-language models with complementary frequency-domain cues for artifact analysis. SeSpeCT constructs a training-free semantic quality axis in the multimodal embedding space using radiology-informed text prompts, without task-specific fine-tuning, and combines it with spectral features that capture degradation-specific frequency patterns. The resulting representation enables joint prediction of artifact type and severity. Experimental results show that SeSpeCT consistently outperforms the evaluated baselines under both single- and mixed-degradation settings. The framework is available at https://github.com/yousranb/CT-DEGRADBENCH.

URL PDF HTML ☆

赞 0 踩 0

2605.16427 2026-05-19 cs.CV cs.AI 版本更新

EAGT: Echocardiography Augmentation for Generalisability and Transferability

超声波增强：通用性和可迁移性

Soroush Elyasi, Sara Adibzadeh, Nasim Dadashi Serej, Julie Wall, Massoud Zolgharni

发表机构 * THRIVE Centre, University of West London（西伦敦大学THRIVE中心）； University of West London（西伦敦大学）； School of Computing and Engineering, University of West London（西伦敦大学计算机与工程学院）

AI总结本文研究了29种数据增强技术及其组合对左心室分割的通用性和可迁移性影响，发现几何变换优于强度增强，且最佳组合提升模型鲁棒性。

详情

AI中文摘要

深度学习模型在超声分割中常难以跨机构、设备和患者群体泛化，因收集大量一致标注数据不现实。数据增强广泛用于提升模型鲁棒性，但其在超声中的跨数据集泛化作用尚不明确。本文评估了29种数据增强技术及其配对组合，使用U-Net在Unity、CAMUS和EchoNet Dynamic数据集上进行2D左心室分割。每种增强方法在不同超参数设置下，通过Dice和IoU在域内和跨域场景下重复运行评估，统计显著性通过独立t检验量化。结果表明，解剖合理几何变换，特别是仿射、位移-缩放-旋转、透视和随机水平翻转，显著提升跨数据集性能，而激进的强度或伪影增强常降低泛化能力。配对增强组合优于单个增强，尤其以随机水平翻转与仿射组合在大多数迁移场景中表现一致。这些发现为设计增强策略提供了实证指导，以增强超声分割模型的鲁棒性和可迁移性。

英文摘要

Deep learning models for echocardiography segmentation often struggle to generalise across institutions, scanners, and patient populations, where collecting large, consistently annotated datasets is infeasible. Data augmentation is widely used to improve the robustness of deep learning models; however, its role in enhancing cross-dataset generalisability in echocardiography remains insufficiently understood. This study presents a large-scale multi-dataset evaluation of 29 data augmentation techniques and their pairwise combinations for 2D left ventricular segmentation using a U-Net trained on Unity, CAMUS, and EchoNet Dynamic datasets. Each augmentation was explored under several hyperparameter settings and assessed through repeated runs using Dice and IoU in both in-domain and cross-dataset scenarios, with statistical significance quantified via independent t-tests. Results show that anatomically plausible geometric transformations, particularly affine, shift-scale-rotate, perspective, and random horizontal flip, substantially improve cross-dataset performance, whereas aggressive intensity- or artefact-based augmentations often degrade generalisability. Pairwise augmentation combinations outperform individual augmentations and show that moderate flip-centric combinations, especially random horizontal flip with affine, yield consistent gains across most transfer scenarios. These findings provide empirically grounded guidance for designing augmentation policies that enhance the robustness and transferability of echocardiography segmentation models.

URL PDF HTML ☆

赞 0 踩 0

2605.16423 2026-05-19 cs.CV 版本更新

通过认知引导的自适应模糊和信息受限对齐实现神经视觉解码

Fan Yin, Chuhang Zheng, Peiliang Gong, Donghai Guan, Qi Zhu

发表机构 * Department of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics（南京航空航天大学人工智能学院）； Department of Electrical and Information Engineering, Tianjin University（天津大学电气与信息工程学院）

AI总结本文提出CAIA框架，通过认知引导的自适应模糊和信息受限对齐，提升神经信号与视觉语义的映射精度，改进零样本脑-图像检索的Top-1和Top-5准确率。

详情

AI中文摘要

基于EEG的视觉解码旨在建立神经信号与视觉语义之间的映射。然而，它受到严重的信息粒度不匹配和EEG信号信噪比低的双重挑战。现有方法通常处理静态视觉特征，忽略了人类视觉的动态选择性和神经振荡的频率特异性。为此，我们提出了CAIA框架，通过认知引导的自适应模糊和信息受限对齐来弥合这一差距。在视觉侧，它模拟选择性注意以自适应地减少冗余。同时，在EEG侧，它利用神经振荡先验和信息瓶颈机制来增强信噪比。具体而言，我们设计了一种基于认知动态的自适应模糊机制，通过跨模态注意动态整合中心偏向和显著性引导的视觉线索。此外，我们引入了分布感知的边界校准损失，以稳健地纠正由异常样本引起的对齐偏差。此外，提出了一种认知引导的信息筛选方法，以选择任务相关的EEG振荡。大量实验表明，CAIA在零样本脑-图像检索中提高了受试者依赖和受试者无关的平均Top-1和Top-5准确率，显著优于现有方法。我们的工作验证了优化视觉信息密度以匹配神经粒度能提供更可解释和稳健的神经解码路径。

英文摘要

EEG-based visual decoding aims to establish a mapping between neural signals and visual semantics. However, it remains constrained by the dual challenges of severe information granularity mismatch and the low signal-to-noise ratio (SNR) of EEG signals. Existing approaches typically treat static visual features, ignoring the dynamic selectivity of human vision and the frequency specificity of neural oscillations. To bridge this gap, we propose CAIA, a Cognitive-guided Adaptive blurring with Information-Constrained Alignment framework for Neural-Visual decoding. On the visual side, it simulates selective attention to adaptively reduce redundancy. Meanwhile, on the EEG side, it leverages neural oscillation priors and the information bottleneck mechanism to enhance SNR. Specifically, we devise a cognitive-dynamics-based adaptive blurring mechanism that dynamically integrates center-biased and saliency-guided visual cues via cross-modal attention. Furthermore, we introduce a distribution-aware boundary calibration loss to robustly rectify alignment bias caused by outlier samples. Moreover, a cognitively-guided information-screening method is proposed to select task-relevant EEG oscillations. Extensive experiments demonstrate that CAIA improves both subject-dependent and subject-independent average Top-1 and Top-5 accuracy in zero-shot brain-to-image retrieval, significantly outperforming prior methods. Our work validates that optimizing visual information density to match neural granularity offers a more interpretable and robust pathway for neural decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.16416 2026-05-19 cs.CV cs.AI 版本更新

CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning

CAVE：一种用于碎片化视觉证据推理的结构化信用分配方法

Tengda Guo, Jie Leng, Hanlei Li, Yaoyuan Liang, Qingyue Zhang, Dian Yang, Mingyu Zhang, Yuhua Fu, Shao-Lun Huang

发表机构 * Tsinghua University（清华大学）； Peking University（北京大学）； Zhejiang University of Technology（浙江工业大学）

AI总结 CAVE通过结构化过程-奖励机制提升碎片化视觉推理能力，引入三个互补信号优化推理步骤，提升模型可靠性与鲁棒性。

Comments 24 pages, 6 figures. Preprint

详情

AI中文摘要

视觉-语言模型（VLMs）在通用多模态推理中表现优异，但在整合非局部视觉信息支持语义不明确的视觉推理方面面临挑战。本文提出CAVE，一种基于GRPO的结构化过程-奖励方法，通过信念更新、证据获取和自适应聚焦控制三个信号评估中间步骤贡献，引导模型优化推理动作并学习更可靠的视觉推理策略。同时构建TRACER-Bench，涵盖四个非局部且语义易混淆的推理维度，提供关键中间证据监督推理路径。实验表明，CAVE在需要整合碎片化视觉证据的任务中显著提升性能，涵盖公开基准和新引入的TRACER-Bench，同时在通用多模态评估中保持竞争力。进一步分析显示，CAVE有效提升视觉推理能力，在长距离和深层跨区域依赖下表现更稳健。

三维胰腺成像中的视觉搜索模式：一项眼动研究

Anna Anikina, Leila Khaertdinova, Trine Balschmidt, Michael B Andersen, Christoph F Müller, Erik GS Brandt, Henrik S Thomsen, Claudia Mello-Thoms, Bulat Ibragimov

发表机构 * Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）； Department of Radiology, Herlev Hospital（赫尔勒夫医院放射科）； Department of Radiology, University of Iowa（爱荷华大学放射科）

AI总结本研究通过眼动追踪分析三维胰腺CT影像中放射科医生的视觉搜索行为，揭示其在空间和时间上的注视模式，为理解诊断策略提供新的视角。

Comments Accepted at SPIE - Medical Imaging Conference 2026

Journal ref Proc. SPIE 13928, Medical Imaging 2026: Image Perception, Observer Performance, and Technology Assessment, 1392814 (2026)

详情

DOI: 10.1117/12.3086082

AI中文摘要

眼动追踪已成为研究视觉感知和搜索策略的强大工具，尤其在医学领域。尽管在2D环境中应用较为简便，但在3D医学影像中的应用仍面临挑战。此研究聚焦于放射学领域，其中体积成像如CT扫描被医生常规解读。放射科医生通常通过数百张2D切片进行解读，通常以轴向投影查看。对导航通过CT体积期间的眼动数据进行分类有助于理解放射科医生如何应对诊断任务。作为分类方法的一个示例，我们让两名放射科医生搜索胰腺腹部CT图像，并收集眼动数据，将眼动轨迹与切片导航对齐，以可视化胰腺通过体积的表示，并分析临床医生在空间和时间上的注视行为。

英文摘要

Eye tracking has emerged as a powerful tool for examining visual perception and search strategies in various domains, including medicine. While it is relatively straightforward to apply in 2D settings, its use in 3D medical imaging remains challenging and not yet well explored. This gap is particularly relevant for radiology, where volumetric images such as computed tomography (CT) scans are routinely read by medical experts. Radiologists typically interpret these images by navigating through hundreds of 2D slices, most often viewed in the axial projection. A taxonomy of eye movement data during navigation through a CT volume could be valuable to understand how radiologists approach diagnostic tasks. As an example of the derived taxonomy, we asked two radiologists to search abdominal CTs of the pancreas. We collect eye tracking data and align eye gaze movements with slice navigation to visualize the representation of the pancreas through volume and analyze clinicians' gaze behavior in both space and time.

URL PDF HTML ☆

赞 0 踩 0

2605.16406 2026-05-19 cs.CV 版本更新

当视觉为声音说话

Xiaofei Wen, Wenjie Jacky Mo, Xingyu Fu, Rui Cai, Tinghui Zhu, Wendi Li, Yanan Xie, Muhao Chen, Peng Qi

发表机构 * University of California, Davis（加州大学戴维斯分校）

AI总结本文发现视频中MLLMs的音频理解依赖视觉线索而非实际音频流，提出Thud框架通过三种音频编辑干预研究此问题，并提出两阶段对齐方法提升模型性能。

Comments 24 pages, 10 figures

详情

AI中文摘要

尽管视频能力的MLLMs取得显著进展，我们发现其视频中的音频理解往往由视觉驱动：模型依赖视觉线索推断或虚构音频信息，而非验证音频流。此问题出现在最先进的开源全能模型和领先的闭源模型中。我们将此失败模式称为音频-视觉的Clever Hans效应，即模型看似（错误地）音频相关，但实际利用视觉-音频相关性而不验证音频和视觉流是否真正对齐。为系统研究此行为，我们引入Thud，一个基于三种反事实音频编辑的干预驱动探测框架：Shift测试时间同步，Mute测试声音存在，Swap测试音频-视觉一致性。除诊断外，我们进一步研究两阶段对齐方法：干预衍生的偏好对教授音频验证，而事件级通用视频偏好规范模型防止过度专业化。我们的最佳10000样本方法在三个干预维度的平均性能提高28个百分点，同时略微提升通用视频和音频-视觉问答基准性能。

英文摘要

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.16402 2026-05-19 cs.CV 版本更新

WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments

WinDeskGround：复杂多窗口桌面环境中的鲁棒GUI定位基准

Haoren Zhao, Tianyi Chen, Zhen Wang

发表机构 * School of Cyberspace, Hangzhou Dianzi University, Hangzhou, China（杭州电子科技大学信息学院）； Microsoft（微软公司）

AI总结本文提出WinDeskGround基准，通过参数生成复杂桌面场景，评估GUI定位鲁棒性。实验显示顶级模型在简单环境表现佳，但部分遮挡下准确率下降。

详情

AI中文摘要

多模态大语言模型（MLLMs）已革新GUI自动化，但其效果主要建立在理想化单层界面之上。本文指出，现有先进代理在真实桌面环境中面临多窗口堆叠、遮挡和视觉杂乱等鲁棒性挑战。为此，我们引入WinDeskGround，一种新型基准和合成框架，通过控制窗口遮挡、布局密度和语义相似性参数生成复杂桌面场景。我们构建了包含1356对高保真指令-目标对的多样化元数据集，并对五种领先MLLMs进行了全面评估。结果表明，顶级代理在简化设置中表现优异，但在部分遮挡下准确性下降。WinDeskGround为评估和提升现实环境中GUI代理的鲁棒性提供了有价值的基准。代码可在https://github.com/ZZZhr-1/WinDeskGround获取。

英文摘要

Multimodal Large Language Models (MLLMs) have revolutionized GUI automation, yet their efficacy is largely established on idealized, single-layer interfaces. This paper identifies a critical reliability gap: state-of-the-art agents face distinct robustness challenges in real-world desktop environments characterized by multi-window stacking, occlusion, and visual clutter. To address this, we introduce WinDeskGround, a novel benchmark and synthesis framework tailored for evaluating GUI grounding robustness. Unlike static datasets, our framework parametrically generates complex desktop scenarios by controlling window occlusion, layout density, and semantic similarity, thereby simulating the distribution shifts of authentic workflows. We construct a diverse meta-dataset of 1,356 high-fidelity instruction-target pairs and conduct comprehensive evaluations of five leading MLLMs. Our results demonstrate that while top-tier agents excel in simplified settings, their accuracy declines under partial occlusion. WinDeskGround provides a valuable benchmark to facilitate the assessment and advancement of GUI agent robustness in realistic environments. The code is available at https://github.com/ZZZhr-1/WinDeskGround.

URL PDF HTML ☆

赞 0 踩 0

2605.16401 2026-05-19 cs.CV cs.LG 版本更新

CADS: Conformal Adaptive Decision System for Cost-Efficient Image Classification

CADS：用于成本高效图像分类的符合适应决策系统

Turkoglu Mikael, Bary Tim, Thielens Vincent, Dausort Manon, Macq Benoît

发表机构 * ICTEAM, UCLouvain, Belgium（ICTEAM，比利时鲁汶大学）； SAFiR Lab, Univ. of Sherbrooke, Canada（SAFiR实验室，加拿大Sherbrooke大学）； Univ. of Mons, Belgium（蒙斯大学，比利时）

AI总结 CADS通过动态路由样本优化资源分配，提升图像分类的效率与准确性，降低计算成本达12倍。

Comments 6 pages, 2 figures, 1 table, Accepted at ICIP 2026

2605.16399 2026-05-19 cs.CV cs.LG 版本更新

Stable and Near-Reversible Diffusion ODE Solvers for Image Editing

稳定且近可逆的图像编辑扩散ODE求解器

Barbora Barancikova, Daniil Shmelev, Cristopher Salvi

发表机构 * Department of Computing, Imperial College London, London, United Kingdom（帝国理工学院伦敦分校计算机系，伦敦，英国）； Department of Mathematics, Imperial College London, London, United Kingdom（帝国理工学院伦敦分校数学系，伦敦，英国）

AI总结本文提出近可逆Runge-Kutta方法以提升图像编辑的稳定性与精度，平衡可逆性与数值稳定性，保留背景保真优势。

详情

AI中文摘要

扩散模型的反向在图像编辑中起核心作用。代数可逆的ODE求解器为文本引导的图像编辑提供了有吸引力的方法，通过消除DDIM基编辑流程中的反向误差。然而，实证结果表明仅可逆性不足。由于编辑需要更大的语义或视觉变化，可逆扩散求解器常表现出不稳定性，并导致输出质量急剧下降。本文显示，精确可逆性与数值稳定性之间的权衡在图像编辑中表现为背景保真与提示对齐之间的权衡。随后研究了近可逆Runge-Kutta方法作为更稳定的替代方案。当与向量场平滑策略结合时，所得方法提高了编辑保真度，在大范围编辑下仍保持稳定，并在很大程度上保留了可逆求解器的背景保真优势。

英文摘要

The inversion of diffusion models plays a central role in image editing. Algebraically reversible ODE solvers provide an appealing approach to diffusion inversion for text-guided image editing, by eliminating the inversion error inherent in DDIM-based editing pipelines. However, empirical results indicate that reversibility alone is insufficient. As edits require larger semantic or visual changes, reversible diffusion solvers often exhibit instabilities and suffer sharp drops in output quality. In this paper, we show that the trade-off between exact reversibility and numerical stability manifests empirically as a trade-off between background preservation and prompt alignment in image editing. We then investigate the use of near-reversible Runge-Kutta methods as a more stable alternative to exactly reversible diffusion schemes. When combined with a vector-field smoothing strategy, the resulting approach improves edit fidelity, remains stable under large edits, and largely retains the background-preservation benefits of reversible solvers.

URL PDF HTML ☆

赞 0 踩 0

2605.16397 2026-05-19 cs.CV cs.AI 版本更新

Trajectory-Aware Adaptive Inference in Object Detection Models

轨迹感知的自适应推理在目标检测模型中

Grigorios Papanikolaou, Ioannis Kontopoulos, Giannis Spiliopoulos, Dimitris Zissis, Konstantinos Tserpes

发表机构 * Department of Electrical and Computer Engineering, National Technical University of Athens, Greece（电子与计算机工程系，国家技术大学亚历山大学院，希腊）； Department of Product and Systems Design Engineering, University of the Aegean, Syros, Greece（产品与系统设计工程系，爱琴海大学，西罗斯，希腊）

AI总结本文提出利用GPS轨迹数据优化目标检测模型的推理过程，通过引入早退机制减少计算成本，提升实时感知效率。

Comments Accepted to the MuseKDE workshop of the IEEE MDM 2026 conference

详情

AI中文摘要

随着自主水下导航中传感器的集成，大规模多模态数据集的出现对高效实时感知提出了挑战。在这样的系统中，目标检测和附近船只轨迹感知紧密耦合，尤其是在动态环境中。然而，目标检测模型在推理过程中的效率常被忽视。为此，我们基于现有目标检测框架，将GPS轨迹数据纳入推理过程，实现输入自适应计算。具体来说，在基于YOLOv8的检测器中引入早退机制，结合运动线索（如船舶间距离）。分离距离短且高速接近的船舶帧使用完整模型处理，而其他帧仅激活网络的一部分架构。通过利用物体间距离和距离减少速率评估帧或帧集的难度（或场景复杂度）。实验结果表明，该策略在保持满意检测性能的同时，显著减少了推理时间和计算成本，从而在准确性和效率之间实现了灵活的权衡，相比完整模型推理。

英文摘要

The increasing integration of sensors in autonomous maritime navigation has led to large-scale multimodal datasets, raising challenges in achieving efficient real-time perception. In such systems, object detection and trajectory perception of nearby vessels are tightly coupled, particularly in dynamic environments such as maritime navigation. However, the efficiency of object detection models during inference remains an often-overlooked aspect. To this end, we build upon an existing object detection framework by incorporating GPS trajectory data into the inference process to enable input-adaptive computation. Specifically, we introduce an early-exit mechanism in a YOLOv8-based detector that incorporates motion cues - such as inter-vessel distances. Frames of vessels that are separated by short distances, converging with high speed, are processed using the full model, while only a subset of the network's architecture is activated otherwise. The difficulty degree (or scene complexity) of a frame or set of frames per second is evaluated by leveraging inter-object distance and the rate at which the distance between them decreases. Experimental results demonstrate that this strategy maintains satisfactory detection performance while significantly reducing inference time and computational cost, thus enabling a flexible trade-off between accuracy and efficiency compared to full-model inference.

URL PDF HTML ☆

赞 0 踩 0

2605.16396 2026-05-19 cs.CV cs.LG 版本更新

Beyond MMSE: Enhancing PnP Restoration with ProxiMAP

超越MMSE：通过ProxiMAP增强PnP修复

Kenta Vert, Giacomo Meanti, Scott Pesme, Michael Arbel, Julien Mairal

发表机构 * Univ. Grenoble Alpes（格勒诺布尔阿尔卑斯大学）； Inria（法国国家科学研究中心）； CNRS（法国国家科学研究中心）； Grenoble INP（格勒诺布尔理工大学）； LJK（实验室）； MaLGa Centre（MaLGa中心）； DIBRIS（DIBRIS研究所）； Università di Genova（热那亚大学）； MMS（MMS机构）； Italian Institute of Technology（意大利理工学院）

AI总结本文提出ProxiMAP，通过调整噪声调度使去噪器保持分布内，实现更稳定的图像重建，适用于去模糊、补全、超分辨率和相位恢复等任务。

详情

AI中文摘要

Plug-and-Play (PnP)方法通过将不可行的最大后验（MAP）去噪器替换为MMSE去噪器成为解决成像逆问题的标准工具。尽管这种不匹配常被视为不可避免，近期研究试图通过针对扩散模型分数来缩小这一差距。本文指出在实践中，学习到的分数与真实分数不匹配，导致MAP目标迭代收敛到卡通化图像而非真实图像，而提前停止迭代能获得更好结果。本文将这一观察转化为设计原则，引入ProxiMAP，一种迭代的MAP近似方法，其噪声调度保持迭代残差噪声与去噪器训练噪声匹配。这使去噪器保持分布内，其分数可靠，并产生隐式提前停止，避免上述失败模式。ProxiMAP是标准PnP算法中MMSE去噪器的模块化替换，能一致提升重建质量。基于相同原理，本文提出一种混合变体，仅在PnP晚期迭代中应用ProxiMAP，其中去噪器最可靠，匹配或超过全替换变体，且成本仅为分数之一。

英文摘要

Plug-and-Play (PnP) methods have become standard tools for solving imaging inverse problems by replacing the intractable maximum a posteriori (MAP) denoiser with the MMSE one. While this mismatch has been widely treated as unavoidable, recent works have sought to close this gap by targeting the MAP with diffusion-model scores. We show this is problematic in practice: learned scores do not match the true ones, so MAP-targeting iterations converge to cartoon-like images rather than realistic ones, and better results are obtained by stopping short of convergence. We turn this observation into a design principle and introduce ProxiMAP, an iterative MAP approximation whose noise schedule keeps the iterate's residual noise matched to the denoiser's training noise. This keeps the denoiser in-distribution where its score is reliable, and yields implicit early stopping that avoids the failure mode above. ProxiMAP is a modular drop-in replacement for MMSE denoisers in standard PnP algorithms and consistently sharpens reconstructions across deblurring, inpainting, super-resolution, and phase retrieval. Building on the same principle, we propose a hybrid variant that applies ProxiMAP only in the late iterations of PnP, where the denoiser is most reliable -- matching or exceeding the full-replacement variant at a fraction of the cost.

URL PDF HTML ☆

赞 0 踩 0

2605.16393 2026-05-19 cs.CV cs.AI 版本更新

Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation

基于 Vision Transformer 的 UNet 用于领域自适应语义分割

Joel Valdivia Ortega, Tingying Peng, Marion Jasnin

发表机构 * Helmholtz Pioneer Campus, Helmholtz Munich, Neuherberg, Germany（海德堡先锋校园，海德堡穆恩奇，纽赫尔伯格，德国）； School of Computation, Information and Technology, TUM, Garching, Germany（计算、信息与技术学院，技术大学慕尼黑，冈辛，德国）； Department of Chemistry, TUM, Garching, Germany（化学系，技术大学慕尼黑，冈辛，德国）

AI总结本文提出 ViTC-UNet，通过可学习令牌和双向注意力解码器将预训练 ViT 表示条件化于 UNet，以提升生物医学语义分割的精度与适应性。

详情

AI中文摘要

语义分割在生物医学研究中至关重要，但 Vision Transformers（ViTs）在该领域仍存在性能差距，尤其在稀疏、精细结构和低信噪比目标上。我们部分归因于可提示 ViT 模型中常用的轻量级像素解码器，可能缺乏高精度生物医学掩码所需的局部归纳偏置。我们通过引入 ViTC-UNet，通过可学习令牌和双向注意力解码器将预训练 ViT 表示条件化于 UNet，结合 ViT 的全局视觉先验与 UNet 的局部归纳偏置和高分辨率解码能力，同时避免端到端 ViT 微调，即使在跨领域设置中。ViTC-UNet 在 MRI 和 CT 模态的语义分割任务中均优于基线结果，证明了结构条件化的 UNet 解码可有效适应大规模视觉先验到高复杂度的生物医学分割。

英文摘要

Semantic segmentation is essential for analysing anatomical features in biomedical research, yet a performance gap remains for Vision Transformers (ViTs) in the field, particularly for sparse, fine-structured, and low signal-to-noise targets. We attribute this challenge in part to the lightweight pixel decoders commonly used in promptable ViT models, who may lack the local inductive bias needed for high-precision biomedical masks. We bridge this gap by introducing ViTC-UNet, which conditions a UNet on frozen pre-trained ViT representations through learnable tokens and a two-way attention decoder. This combines ViT global visual priors with the local inductive bias and high-resolution decoding capacity of UNets, while avoiding end-to-end ViT fine-tuning even in cross-domain settings. ViTC-UNet outperforms baseline results in semantic segmentation tasks across MRI and CT modalities, demonstrating that structure-conditioned UNet decoding can efficiently adapt large-scale visual priors to high-complexity biomedical segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.16392 2026-05-19 q-bio.QM cs.CV cs.LG 版本更新

Bridging the Modality Bottleneck in Pathology MIL through Virtual Molecular Staining

弥合病理MIL中的模态瓶颈：通过虚拟分子染色

Yucheng Xing, Pei Liu, Jingying Ma, Ruping Hong, Jiangdong Qiu, Tianyu Liu, Kai He, Ling Huang, Mengling Feng

发表机构 * National University of Singapore（新加坡国立大学）； Hunan University（湖南大学）； Peking Union Medical College Hospital (PUMCH)（北京协和医学院附属阜外医院）； Imperial College London（伦敦帝国理工学院）

AI总结本文提出MIST方法，通过虚拟分子染色提升病理MIL中投影层性能，改进240/256配置，平均提升3.5%，在生存预测、组织分型和生物标志物预测中分别提升5.2%、3.3%和2.6%。

详情

AI中文摘要

多重实例学习（MIL）是计算病理学中全切片图像分析的主流框架，通常结合冻结的补丁编码器、投影层和滑片级聚合器。尽管编码器和聚合器已广泛研究，投影层仍是一个主要的形态学瓶颈。这限制了诸如生物标志物状态和生存等终点，这些终点由未被H&E形态完全捕捉的分子状态决定。我们引入了分子指导的染色转换（MIST），一种可替换MIL投影层的插件，仅在训练期间使用配对的空间转录组学数据来构建虚拟分子染色。MIST将基因表达谱聚类为跨模态原型，将其锚定在冻结的基础模型特征空间中，并利用它们沿分子指导的轴重新组织H&E补丁特征。它不需要转录组学在推理阶段，并且可以在标准MIL聚合器之前插入。我们评估了MIST在23个下游任务和8个MIL聚合器上的表现。MIST在256种配置中改进了240种，平均提升3.5%，在各种终点类型中观察到一致的提升：生存预测提升5.2%，组织分型提升3.3%，生物标志物预测提升2.6%。消融实验确认基因衍生的原型是提升的主要来源，而空间、生物和病理分析显示跨模态原型亲和力能够从H&E中捕捉到空间上一致的分子程序。

英文摘要

Multiple instance learning (MIL) is the dominant framework for whole-slide image analysis in computational pathology, typically combining a frozen patch encoder, a projection layer, and a slide-level aggregator. While encoders and aggregators have been extensively studied, the projection layer remains a largely morphology-only bottleneck. This limits endpoints such as biomarker status and survival, which are governed by a molecular state that is not fully captured by H&E morphology. We introduce Molecularly Informed Staining Transform (MIST), a plug-in replacement for the MIL projection layer that uses paired spatial transcriptomics only during training to construct virtual molecular stains. MIST clusters gene expression profiles into cross-modal prototypes, anchors them in the frozen foundation model feature space, and uses them to reorganize H&E patch features along molecularly guided axes. It requires no transcriptomics at inference and can be inserted before standard MIL aggregators. We evaluate MIST across 23 downstream tasks and 8 MIL aggregators. MIST improves 240 of 256 configurations over the standard projection layer, with an average gain of +3.5%, observed consistently across endpoint types: +5.2% on survival prediction, +3.3% on tissue subtyping, and +2.6% on biomarker prediction. Ablations confirm that gene-derived prototypes are the primary source of the gains, while spatial, biological, and pathological analyses show that cross-modal prototype affinities capture spatially coherent molecular programs from H&E alone.

URL PDF HTML ☆

赞 0 踩 0

2605.16390 2026-05-19 cs.CV cs.LG stat.ML 版本更新

对多模态大语言模型评分者的审计：临床顺序评分中的中间倾向偏差

Jiaqing Zhang, Sandeep Elluri, Bhanu Cherukuvada, Yonah Joffe, Jessica Sena, Miguel Contreras, Scott Siegel, Subhash Nerella, Catherine Price, Parisa Rashidi

发表机构 * Department of Electrical & Computer Engineering（电气与计算机工程系）； Department of Computer and Information Science and Engineering（计算机与信息科学与工程系）； Department of Clinical and Health Psychology（临床与健康心理学系）； Department of Biomedical Engineering（生物医学工程系）

AI总结本文研究多模态大语言模型在临床顺序评分中的中间倾向偏差，通过基准测试发现三种前沿LLM在Clock Drawing Test评分中存在系统性压缩倾向，影响临床决策。

详情

AI中文摘要

多模态大语言模型（LLM）正被探索用于临床自动评估，但其在顺序临床量表上的评分行为尚不明确。我们通过Shulman评分标准，用三个前沿LLM家族与监督深度学习模型在两个公开数据集上进行基准测试。尽管完全微调的视觉转换器在校准（MAE 0.52，内1准确性91%）方面表现最佳，零样本LLM在基于容忍度的一致性（GPT-5 MAE 0.67，内1准确性92%）上仍具竞争力，尽管绝对误差更高。然而，每项评分分析显示，三种LLM家族均表现出显著的中间倾向效应（系统性端点压缩）：预测系统性压缩向量标刻度中间，低端（分数0到1）过预测，高端（分数5到4）下预测。这种效应不成比例影响临床关键极端，准确评分最影响认知障碍筛查决策。定向删除显示，既无少量示例覆盖完整评分范围，也无从提示中删除临床术语能消除该效应。我们的发现将LLM作为判断者的偏见文献从NLP评估扩展到临床评估，并强调在高风险筛查流程中部署LLM评分者前需要校准意识的评估和事后校准。

英文摘要

Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.16384 2026-05-19 cs.CV cs.AI 版本更新

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

全局标记与补丁标记之间的相互增强：从理论到实践

Xiusheng Huang, Xin Jiang, Jun Zhao, Kang Liu, Yequan Wang

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China（认知与决策智能复杂系统重点实验室，自动化研究所，中国科学院，北京，中国）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）

AI总结本文提出TaTok框架，通过引入全局标记和动态令牌过滤算法，解决现有方法中信息不足和冗余问题，提升图像令牌化效果和推理速度。

Comments 21 pages, 8 figures

详情

AI中文摘要

准确有效的离散图像令牌化对长图像序列处理至关重要。然而，当前方法以固定比率压缩所有内容，忽视了图像中信息密度的变化，导致冗余或信息丢失。受信息熵启发，我们提出TaTok，一种理论指导的自适应图像令牌化框架。我们严格识别现有方法的两个关键问题：仅使用补丁令牌重建图像时的信息不足，以及补丁令牌之间的信息冗余。为此，我们引入全局令牌来建模补丁令牌之间的互信息，并基于累积条件熵的动态令牌过滤（DTF）算法来消除冗余。实验证实TaTok的最先进性能，实现了1.3倍gFID提升和8.7倍推理加速。通过根据信息丰富度分配令牌，TaTok实现了更压缩但更准确的图像令牌化，为未来研究提供了有价值的见解。

英文摘要

Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.

URL PDF HTML ☆

赞 0 踩 0

2605.16383 2026-05-19 cs.CV cs.AI stat.ML 版本更新

A neurosymbolic Approach with Epistemic Deep Learning for Hierarchical Image Classification

一种结合知识符号学习与认知深度学习的分层图像分类方法

Ezel Kilicdere, Shireen Kudukkil Manchingal, Fabio Cuzzolin

发表机构 * Institute for AI, Data Analysis and Systems (AIDAS) School of Engineering, Computing and Mathematics, Oxford Brookes University, UK（人工智能、数据分析和系统研究所（AIDAS）工程、计算与数学学院，英国奥克斯福德布鲁克斯大学）

AI总结本文提出一种统一的神经符号和认知建模框架，通过融合Swin Transformer、焦点集推理和可微模糊逻辑，提升分层图像分类的准确性和逻辑一致性。

Comments 36 pages

详情

AI中文摘要

深度神经网络在图像分类任务中实现高精度，但往往产生过于自信的预测，无法表达认知不确定性，并违反数据中存在的逻辑或结构约束。这些局限性在分层分类中尤为明显，因为细粒度和粗粒度的预测必须保持一致。本文首次提出一种统一的神经符号和认知建模框架，通过融合Swin Transformer、焦点集推理和可微模糊逻辑，将标签视为孤立类别，而是在学习的嵌入空间中诱导数据驱动的焦点集，帮助捕捉多个可能细粒度类别的认知不确定性。这些焦点集构成了一个基于信念理论的层，利用模糊隶属函数和t-范数合取来鼓励细粒度和粗粒度预测之间的一致性。可学习的损失进一步平衡校准、质量正则化和逻辑一致性，使模型能够自适应地权衡符号结构与数据驱动的证据。在分层图像分类实验中，本文框架在与Transformer基线相当的准确性的同时，提供更校准和可解释的预测，减少过度自信并强制在分层输出中保持高逻辑一致性。实验结果表明，结合焦点集推理与模糊逻辑为深度学习模型提供了实际步骤，使其既准确又具有认知意识。

英文摘要

Deep neural networks achieve high accuracy on image classification tasks. Yet, they often produce overconfident predictions as which fail to express epistemic uncertainty, and frequently violate logical or structural constraints present in the data. These limitations are particularly pronounced in hierarchical classification, where predictions across fine and coarse levels must remain coherent. We propose, for the first time, a unified neurosymbolic and epistemic modelling framework that augments Swin Transformers with focal set reasoning and differentiable fuzzy logic. Rather than treating labels as isolated categories, our method induces data-driven focal sets within the learnt embedding space, which helps capture epistemic uncertainty over multiple plausible fine-grained classes. These focal sets form the basis of a belief-theoretic layer that uses fuzzy membership functions and t-norm conjunctions to encourage consistency between fine- and coarse-grained predictions. A learnable loss further balances calibration, mass regularisation, and logical consistency, allowing the model to adaptively trade off symbolic structure with data-driven evidence. In experiments on hierarchical image classification, our framework maintains accuracy on par with transformer baselines while providing more calibrated and interpretable predictions, reducing overconfidence and enforcing high logical consistency across hierarchical outputs. Our experimental results show that combining focal set reasoning with fuzzy logic provides a practical step toward deep learning models that are both accurate and epistemically aware.

URL PDF HTML ☆

赞 0 踩 0

2605.16381 2026-05-19 cs.CV cs.AI 版本更新

StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video

StreamPro: 从反应式感知到主动决策的流视频处理

Ao Li, Zihan Xiao, Zihao Yue, Boshen Xu, Linli Yao, Jiaze Li, Pei Fu, Jianzhong Ju, Jian Luan, Qin Jin

发表机构 * AIM3 Lab, Renmin University of China（中国人民大学AIM3实验室）； MiLM Plus, Xiaomi Inc.（小米公司MiLM Plus）

AI总结 StreamPro通过引入CB-Stream损失和GRPO算法，提升流视频处理的主动决策能力，在StreamPro-Bench上取得显著成效，性能优于先前最佳。

详情

AI中文摘要

主动流视频理解需要模型持续处理视频流并决定何时响应，而非仅仅确定响应内容。这自然引入了部分观察下的决策问题，模型需在早期预测与充分证据之间平衡。然而，现有基准大多遵循“看见再回答”范式，响应仅在明确证据出现后触发，将主动推理缩减为延迟感知。因此，它们无法评估模型在不完整观察下的及时性和可靠性决策能力。此外，训练主动模型本身具有挑战性，因为流轨迹中沉默与响应信号之间存在极端不平衡，且需要联合优化响应准确性和时机。为解决这些问题，我们引入StreamPro-Bench，从感知理解、时间推理和主动代理三个互补视角评估流模型。其中，主动代理衡量模型在部分观察下的早期但可靠决策能力。我们进一步提出StreamPro，一种两阶段训练框架用于主动学习。首先，我们引入CB-Stream损失以缓解监督不平衡问题。然后，我们应用基于多粒度奖励设计的分组相对策略优化（GRPO）。实验表明，StreamPro显著提升了主动性能。在StreamPro-Bench上，其达到41.5，远超先前最佳（10.4），同时在实时流基准测试中也表现优异，达到78.9分。

英文摘要

Proactive streaming video understanding requires models to continuously process video streams and decide when to respond, rather than merely what to respond. This naturally introduces a decision-making problem under partial observations, where models must balance early prediction against sufficient evidence. However, existing benchmarks largely follow a "see-then-answer" paradigm, where responses are triggered only after explicit evidence appears, effectively reducing proactive reasoning to delayed perception. As a result, they fail to evaluate a model's ability to make timely and reliable decisions under incomplete observations. Moreover, training proactive models is inherently challenging due to the extreme imbalance between silence and response signals in streaming trajectories, as well as the need to jointly optimize response correctness and timing. To address these challenges, we introduce StreamPro-Bench, a new benchmark that evaluates streaming models from three complementary perspectives: Perception Understanding, Temporal Reasoning, and Proactive Agency, where the last measures a model's ability to make early yet reliable decisions under partial observations. We further propose StreamPro, a two-stage training framework for proactive learning. First, we introduce CB-Stream Loss to mitigate the severe supervision imbalance during supervised fine-tuning (SFT). Then, we apply Group Relative Policy Optimization (GRPO) with a multi-grained reward design that involves both turn-level and trajectory-level rewards. Experiments show that StreamPro significantly improves proactive performance. On StreamPro-Bench, it achieves 41.5, substantially outperforming the previous best (10.4), while also maintaining strong performance on real-time streaming benchmarks, achieving 78.9 on StreamingBench-RTVU.

URL PDF HTML ☆

赞 0 踩 0

2605.16376 2026-05-19 eess.IV cs.CV cs.DC cs.LG cs.MM 版本更新

Kelvin v1.0: A Neural Pre-Encoder for H.264: A standards-compliant learned preprocessor with -27.62% BD-VMAF on UVG

Kelvin v1.0：一种用于H.264的神经预编码器：一种符合标准的学得预处理程序，在UVG上实现-27.62%的BD-VMAF

Marco Graziano

发表机构 * Graziano Labs Corp.（格raziano实验室公司）

AI总结 Kelvin v1.0通过内容自适应像素调整优化H.264编码，实现比基准libx264更高的BD-VMAF，其在UVG和MCL-JCV数据集上均表现优异，同时解决了H.264非可微的工程挑战。

详情

AI中文摘要

Kelvin是一种轻量级学得预编码器，位于未修改的libx264编码器之前。它应用内容自适应的像素调整，每个通道限制在±1/255以内，使编码器将比特分配到最需要感知的区域，同时输出兼容所有现有解码器、播放器和CDN的标准H.264位流。在七序列1080p UVG基准上，Kelvin v1.0实现平均BD-VMAF为-27.62%（7/7胜），BD-VMAF-NEG为-5.18%（6/7胜）。在30序列MCL-JCV公开数据集上，相同检查点在28/30片段上胜过基准libx264，去除两个可诊断失败后，平均BD-VMAF为-27.70%，与UVG一致。核心工程挑战是H.264的非可微性：我们描述了一种混合编码器代理，结合校准的可微率估计器（与真实libx264的每像素比特数斯皮尔曼_rho=0.986）和在真实编码器输出上训练的U-Net失真代理。我们发布完整的每序列率失真数据，MCL-JCV上的命名失败模式分类（率下限违规、分布偏移、指标饱和），以及五个基准的合理性面板（hqdn3d、unsharp、-tune psnr、-tune ssim、x265 medium），并诚实定位：x265 medium在相同数据集上每项指标均胜过Kelvin。因此，Kelvin是为在H.264上保持是约束而非选择的工作负载设计的。

英文摘要

Kelvin is a lightweight learned pre-encoder that sits in front of an unmodified libx264 encoder. It applies content-adaptive pixel adjustments, bounded at +/-1/255 per channel, so that the encoder allocates bits where they matter most perceptually, while emitting a standard H.264 bitstream compatible with every existing decoder, player, and CDN. On the seven-sequence 1080p UVG benchmark, Kelvin v1.0 achieves a mean BD-VMAF of -27.62% (7 of 7 wins) and BD-VMAF-NEG of -5.18% (6 of 7 wins) relative to baseline libx264 at preset medium. On the 30-sequence MCL-JCV public set (28 unseen by training), the same checkpoint wins on 28 of 30 clips by BD-VMAF; with the two diagnosable failures removed the mean is -27.70% BD-VMAF and -5.37% BD-VMAF-NEG, consistent with UVG to within one percentage point. A central engineering challenge is the non-differentiability of H.264: we describe a hybrid codec proxy that combines a calibrated differentiable rate estimator (Spearman rho = 0.986 vs. real libx264 bits-per-pixel) with a U-Net distortion proxy trained on real encoder outputs. We publish full per-sequence rate-distortion data, a named failure-mode taxonomy on MCL-JCV (rate-floor violation, distribution shift, metric saturation), a five-baseline sanity panel (hqdn3d, unsharp, -tune psnr, -tune ssim, x265 medium), and honest positioning: x265 medium beats Kelvin on every metric on the same corpus. Kelvin is therefore designed for workloads where remaining on H.264 is a constraint rather than a choice.

URL PDF HTML ☆

赞 0 踩 0

2605.16373 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Cross-Source Supervision for Bone Infection Segmentation in Dual-Modality PET-CT

跨源监督在双模态PET-CT骨感染分割中的应用

Zonglin Yang, Xiaolei Diao, Jishizhan Chen, Xiaozhuang Man, Wei Kong, Gen Wen, Pengfei Cheng, Daqian Shi

发表机构 * Shanghai Maritime University（上海海洋大学）； University College London（伦敦大学学院）； Shanghai Sixth People’s Hospital（上海第六人民医院）； Shanghai Sixth People’s Hospital Affiliated to SJTU School of Medicine（上海第六人民医院附属复旦大学医学院）； Queen Mary University of London（伦敦女王玛丽大学）

AI总结本文提出一种双模态端到端分割框架，通过早融合多模态表示整合PET代谢信号和CT骨窗解剖信息，解决标注不一致下的骨感染分割问题，采用患者级3D体积评估和交叉验证提高性能。

详情

AI中文摘要

早期和准确诊断骨感染及病变定位对临床治疗至关重要。PET-CT结合了CT的解剖信息和PET的代谢信息，是诊断骨感染的重要成像模态。然而，由于病变边界不清晰和不同专家或自动化系统生成的标注不一致，准确的病变分割仍具挑战性。本文研究了在标注不一致下的多模态分割。我们开发了一个双模态端到端分割框架，通过早融合多模态表示整合PET代谢信号和CT骨窗解剖信息。为了缓解小数据集中小切片相关性导致的性能膨胀，本研究弃用传统二维评估方法，采用严格的患者级3D体积评估和交叉验证。此外，我们提出了一种解耦的双源学习框架，其中并行模型在由高灵敏度和高特异性临床意图驱动的独立专家标注上进行训练。实验结果客观报告了患者级性能变化（均值±标准差和均值-标准差），证明了多模态PET-CT融合的有效性。交叉评估矩阵定量揭示了模型如何成功内化不同的专家诊断哲学，提供了一种稳健且保持多样性的临床AI部署范式，用于骨感染分割。

英文摘要

Early and accurate diagnosis and lesion localization of bone infections are crucial for clinical treatment. PET-CT integrates anatomical information from CT with metabolic information from PET, making it an important imaging modality for diagnosing bone infections. However, accurate lesion segmentation remains challenging due to indistinct lesion boundaries and inconsistencies in annotations generated by different experts or automated systems. In this work, we investigate multimodal segmentation of bone infections under annotation discrepancy. We develop a bimodal end-to-end segmentation framework that integrates PET metabolic signals and CT bone-window anatomy through an early-fusion multimodal representation.To mitigate performance inflation caused by inter-slice correlation in small datasets, this study discards traditional two-dimensional evaluation methods and implements a rigorous patient-level 3D volumetric evaluation and cross-validation. Furthermore, instead of forcing a singular consensus, we propose a decoupled dual-source learning framework where parallel models are trained on independent expert annotations driven by high-sensitivity and high-specificity clinical intents. Experimental results objectively report performance variations at the patient level (Mean + SD and Mean - SD), demonstrating the effectiveness of multimodal PET-CT fusion. The cross-evaluation matrix quantitatively reveals how models successfully internalize distinct expert diagnostic philosophies, providing a robust, diversity-preserving paradigm for clinical AI deployment in bone infection segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.16372 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SwordBench: Evaluating Orthogonality of Steering Image Representations

SwordBench：评估转向图像表示的正交性

Vladimir Zaigrajew, Dawid Pludowski, Hubert Baniecki, Przemyslaw Biecek

发表机构 * Centre for Credible AI（可信人工智能中心）； Warsaw University of Technology（华沙技术大学）； University of Warsaw（华沙大学）

AI总结本文提出SwordBench，用于评估视觉模型在多个backbone和概念移除任务中转向表示的正交性，引入了交叉概念鲁棒性和 collateral damage 等新评估指标，发现线性SVM在分离性和正交性上优于稀疏自编码器，但无法实现零 collateral damage。

详情

AI中文摘要

在推理时间对模型表示进行干预以校正预测对于AI可解释性和安全性至关重要，但现有评估协议局限于模糊的语言建模任务。为填补这一空白，我们引入SwordBench，一个用于评估视觉模型在多个backbone和概念移除任务中转向表示的基准。除了统一的基准测试套件外，我们还提出了新的评估概念，揭示了概念激活向量正交性对实用转向的二次影响。具体而言，交叉概念鲁棒性衡量在针对替代概念正交化输入上概念检测性能的稳定性，而collateral damage量化在缺乏偏见的输入上转向是否意外影响下游任务的模型性能。我们发现尽管线性支持向量机在分离性和正交性上表现优异，但无法实现零collateral damage，通常落后于稀疏自编码器。在更简单的环境中，标准基线和优化方法均无法实现完美的转向。源代码将很快在GitHub上发布。

英文摘要

Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2605.16371 2026-05-19 cs.CV cs.AI 版本更新

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

GeoSym127K：可扩展的符号验证合成用于多模态几何推理

Jinhao Jing, Zheng Ma, Jinwei Liang, Qiannian Zhao, Shawn Chen, Jing Yang, Por Lip Yee, Prayag Tiwari, Jingjing Bai, Benyou Wang, Lewei Lu, Zhan Su

发表机构 * School of Information Technology, Halmstad University（哈姆斯塔德大学信息科技学院）

AI总结本文提出GeoSym引擎，通过类型条件语法和分析SymGT求解器生成精确符号地面真实值，构建了包含51K高清图像、127K问题和55K答案验证CoT QA对的GeoSym127K数据集，并展示了其在几何推理任务中的性能提升。

详情

AI中文摘要

大型多模态模型（LMMs）在几何推理中常因视觉幻觉和缺乏数学精确的Chain-of-Thought（CoT）数据而遇到困难。为此，我们提出了GeoSym引擎，一种自动且可扩展的神经符号框架。通过利用类型条件语法和分析SymGT求解器，它能够推导出精确的符号地面真实值，并无缝整合到稳健的渲染管线中，生成高精度的几何图示。使用该引擎，我们构建了GeoSym127K，一个难度分层的数据集，包含51K高清图像、127K带有符号地面真实值的问题和55K答案验证的CoT QA对。我们还引入了GeoSym-Bench，一个由专家整理的511个复杂样本集，用于严格评估。通过广泛的监督微调（SFT），我们证明GeoSym在依赖图示和多步骤几何任务上实现了集中改进。我们的Qwen3-VL-8B模型在MathVerse Vision-Only子集上实现了绝对+22.21%的提升，并在WeMath上达到61.52%（+6.19%的改进），缓解了长距离逻辑碎片化问题，并优于先进的闭源模型如Doubao-1.8。进一步地，通过Reinforcement Learning with Verifiable Rewards（RLVR） via GRPO发现，从结构SFT检查点初始化显著提升了零样本RL的性能上限。由确定性精确匹配信号驱动，这展示了我们可验证推理合成的稳健扩展潜力。数据集和代码可在https://huggingface.co/datasets/Tomie0506/GeoSym127K和https://github.com/Tomie56/GeoSym127K获得。

英文摘要

Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at https://huggingface.co/datasets/Tomie0506/GeoSym127K and https://github.com/Tomie56/GeoSym127K.

URL PDF HTML ☆

赞 0 踩 0

2605.16366 2026-05-19 cs.CV cs.AI 版本更新

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

Fre-Res: 频率-残差视频令牌压缩用于高效的视频多模态大语言模型

Yigui Feng, Qinglin Wang, Yang Liu, Jie Liu

发表机构 * College of Computer Science, National University of Defense Technology（计算机科学学院，国防科技大学）； Shien-Ming Wu School of Intelligent Engineering, South China University of Technology（智能工程谢民明伍学院，华南理工大学）

AI总结 Fre-Res通过分离空间和时间信息，实现视频令牌压缩，在保持细节精度的同时提升效率，适用于短时事件和长视频推理。

Comments 24 pages, 5 figures

详情

AI中文摘要

视频多模态大语言模型面临空间保真度与时间覆盖度之间的矛盾：保留细粒度视觉细节需要大量空间令牌，而捕捉短暂事件需要密集的时间采样。我们提出Fre-Res，一种预算自适应的双轨视频令牌压缩框架，分别处理这两种证据形式。Fre-Res保留稀疏的高保真空间锚点，并通过紧凑的残差频域令牌表示密集的时间演变。具体而言，它对视觉潜在空间中的帧间残差轨迹应用时间1D-DCT，在其中观察到强低频集中。为对齐频域动态与原生视觉嵌入，Fre-Res引入了空间引导吸收器，将时间残差信息注入与空间锚点对应的令牌中。在细粒度短视频和长视频推理基准上，Fre-Res实现了有利的准确率-效率权衡，匹配或接近全令牌性能，同时显著减少视觉令牌长度。广泛消融实验进一步表明，时间频域残差保留因果转换线索，而空间锚点对细粒度物体和布局推理至关重要。

英文摘要

Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-grained visual details requires many spatial tokens, while capturing short-lived events requires dense temporal sampling. We propose \textbf{Fre-Res}, a budget-adaptive dual-track video-token compression framework that separates these two forms of evidence. Fre-Res preserves sparse high-fidelity spatial anchors and represents dense temporal evolution through compact residual-frequency tokens. Specifically, it applies temporal 1D-DCT to inter-frame residual trajectories in vision-latent space, where we observe strong low-frequency concentration. To align frequency-domain dynamics with native visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information into spatially corresponding anchor tokens. Across fine-grained short-video and long-video reasoning benchmarks, Fre-Res achieves a favorable accuracy--efficiency trade-off, matching or approaching full-token performance while substantially reducing visual-token length. Extensive ablations further show that temporal-frequency residuals preserve causal transition cues, while spatial anchors remain essential for fine-grained object and layout reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.16359 2026-05-19 cs.CV cs.AI 版本更新

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

多模态语言模型需要多少视觉标记？通过F^3A进行视觉标记剪枝的扩展

YiJie Huang, Yiqun Zhang, Zhuoyue Jia, Xiaocui Yang, Junzhao Huang, Zihan Wang, Shi Feng, Daling Wang, Yifei Zhang, Yongkang Liu

发表机构 * School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China（东北大学计算机科学与工程学院，沈阳 110819，中国）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； School of Computer and Communication Engineering, Northeastern University, Qinhuangdao 066004, China（东北大学计算机与通信工程学院，秦皇岛 066004，中国）

AI总结本文提出F^3A方法，通过任务条件证据搜索优化视觉标记分配，在不训练模型的情况下实现高效的视觉标记剪枝，保留原始多模态提示和解码流程。

详情

AI中文摘要

视觉语言模型通过将越来越长的视觉标记序列输入语言骨干网络来提升感知能力，但由此产生的推理成本提出了一个基本的扩展问题：随着多模态模型的增长，实际上需要多少视觉标记，以及在固定视觉标记预算下如何分配？现有训练免费剪枝方法通常通过一shot代理如解码器注意力、视觉相似性或条件多样性来回答这个问题。我们主张将视觉标记剪枝视为任务条件证据搜索，特别是在极端压缩和跨模型规模的情况下。我们提出F^3A，一种训练免费的视觉标记剪枝路由器，在语言模型消耗图像标记之前运行。F^3A构建轻量级的问题条件线索，通过冻结的稀疏感知头将它们与视觉网格标记匹配，并通过粗略证据定位、局部细化、覆盖保持竞争和恢复未覆盖区域来分配固定视觉标记预算。它不需要模型训练，不需要额外的LLM前向传递，并保留原始多模态提示和解码流程。

英文摘要

Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.16357 2026-05-19 eess.SP cs.AI cs.CV 版本更新

A₃B₂：一种自适应非对称适配器，用于缓解视觉-语言图像分类中的分支偏差

Yiyun Zhou, Zhonghua Jiang, Wenkang Han, Kunxi Li, Mingjing Xu, Chang Yao, Jingyuan Chen

发表机构 * Zhejiang University（浙江大学）； Swansea University（斯旺西大学）

AI总结本文提出A₃B₂适配器，通过引入不确定性感知适配器阻尼机制，缓解少样本学习中的分支偏差问题，实验表明其在多个数据集上优于现有基线方法。

Comments Accepted by IJCAI 2026

详情

AI中文摘要

高效的迁移学习方法为大规模视觉-语言模型（例如CLIP）提供了强大的少样本迁移能力，但现有适配方法遵循固定微调范式，隐含假设图像和文本分支的重要性是均匀的，这一假设在图像分类中未被系统研究。通过深入分析，我们揭示了视觉-语言图像分类中的分支偏差问题：在分布外设置下，适配图像编码器并不总能提高性能。受此启发，我们提出了A₃B₂，一种自适应非对称适配器，用于缓解少样本学习中的分支偏差。A₃B₂引入了不确定性感知适配器阻尼（UAAD），在预测不确定性较高时自动抑制图像分支适配，实现软且数据驱动的控制，无需手动干预。在架构上，A₃B₂采用了一种轻量级非对称设计，受混合专家启发，结合负载平衡正则化。在三个少样本图像分类任务上，对11个数据集的广泛实验表明，A₃B₂在多个数据集上一致优于11个竞争的提示和适配基线方法。

英文摘要

Efficient transfer learning methods for large-scale vision-language models ($e.g.$, CLIP) enable strong few-shot transfer, yet existing adaptation methods follow a fixed fine-tuning paradigm that implicitly assumes a uniform importance of the image and text branches, which has not been systematically studied in image classification. Through extensive analysis, we reveal a Branch Bias issue in vision-language image classification: adapting the image encoder does not always improve performance under out-of-distribution settings. Motivated by this observation, we propose A$_3$B$_2$, an Adaptive Asymmetric Adapter that alleviates Branch Bias in few-shot learning. A$_3$B$_2$ introduces Uncertainty-Aware Adapter Dampening (UAAD), which automatically suppresses image-branch adaptation when prediction uncertainty is high, enabling soft and data-driven control without manual intervention. Architecturally, A$_3$B$_2$ adopts a lightweight asymmetric design inspired by mixture-of-experts with Load Balancing Regularization. Extensive experiments on three few-shot image classification tasks across 11 datasets demonstrate that A$_3$B$_2$ consistently outperforms 11 competitive prompt- and adapter-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.11555 2026-05-19 cs.CV 版本更新

ScribbleDose: Scribble-Guided Dose Prediction in Radiotherapy

ScribbleDose：放射治疗中的涂鸦引导剂量预测

Zhenxi Zhang, Yitao Zhuang, Yao Pu, Peixin Yu, Zirong Li, Yan Xia, Hui Li, Bin Li, Fuchen Zheng, Ge Ren

发表机构 * Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong SAR（香港理工大学健康科技与信息技术系）； The Hong Kong Polytechnic University Shenzhen Research Institute, The Hong Kong Polytechnic University, China（香港理工大学深圳研究院）； Department of Orthodontics and Orofacial Orthopedics, Friedrich-Alexander-University Erlangen-Nuremberg, Germany（弗赖堡-埃尔兰根-纽伦堡大学口腔医学与面部骨科系）； Institute of Scientific Instrumentation, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China（深圳先进技术研究所科学仪器研究所）； Department of Computer and Information Science, University of Macau, Macau SAR（澳门大学计算机与信息科学系）

AI总结本文提出一种基于稀疏涂鸦的剂量预测框架，通过生成密集解剖学掩码和结构引导剂量生成模块，提高剂量预测精度并降低标注成本。

Comments Preprint of the submitted version before peer review. The final Version of Record will be available in the MICCAI 2026 proceedings published by Springer

详情

从像素到地点：一个系统性基准，用于评估大语言模型中的图像地理定位能力

Lingyao Li, Runlong Yu, Qikai Hu, Bowei Li, Min Deng, Yang Zhou, Xiaowei Jia

发表机构 * University of South Florida Tampa USA ； University of Alabama Tuscaloosa USA ； University of Michigan Ann Arbor USA ； Texas Tech University Lubbock USA ； Texas A \& M University College Station USA ； University of Pittsburgh Pittsburgh USA ； University of South Florida ； University of Alabama ； University of Michigan ； Texas Tech University ； Texas A \& M University ； University of Pittsburgh

AI总结本文提出IMAGEO-Bench基准，系统评估大语言模型在图像地理定位中的准确性、距离误差、地理偏见和推理过程，揭示闭源模型在高资源区域表现优于欠代表区域。

详情

AI中文摘要

图像地理定位，即识别图像中描绘的地理位置，对危机响应、数字取证和基于位置的智能应用至关重要。尽管近期大语言模型（LLMs）的进步为视觉推理提供了新机会，但其在图像地理定位能力方面仍缺乏系统评估。本文介绍IMAGEO-Bench基准，系统评估准确性、距离误差、地理偏见和推理过程。该基准包含三个多样化的数据集，涵盖全球街道场景、美国兴趣点（POIs）和一个未见图像的私有集合。通过在10种最先进的LLMs（包括开源和闭源模型）上的实验，我们揭示了明显的性能差异，闭源模型通常表现更优。重要的是，我们发现地理偏见，因为LLMs在高资源区域（如北美、西欧和加州）表现更好，而在欠代表区域表现下降。回归诊断显示，成功的地理定位主要依赖于识别城市环境、户外环境、街道级影像和可识别地标。总体而言，IMAGEO-Bench为LLMs的空间推理能力提供了严格视角，并为构建地理定位感知的AI系统提供了启示。

英文摘要

Image geolocalization, the task of identifying the geographic location depicted in an image, is important for applications in crisis response, digital forensics, and location-based intelligence. While recent advances in large language models (LLMs) offer new opportunities for visual reasoning, their ability to perform image geolocalization remains underexplored. In this study, we introduce a benchmark called IMAGEO-Bench that systematically evaluates accuracy, distance error, geospatial bias, and reasoning process. Our benchmark includes three diverse datasets covering global street scenes, points of interest (POIs) in the United States, and a private collection of unseen images. Through experiments on 10 state-of-the-art LLMs, including both open- and closed-source models, we reveal clear performance disparities, with closed-source models generally showing stronger reasoning. Importantly, we uncover geospatial biases as LLMs tend to perform better in high-resource regions (e.g., North America, Western Europe, and California) while exhibiting degraded performance in underrepresented areas. Regression diagnostics demonstrate that successful geolocalization is primarily dependent on recognizing urban settings, outdoor environments, street-level imagery, and identifiable landmarks. Overall, IMAGEO-Bench provides a rigorous lens into the spatial reasoning capabilities of LLMs and offers implications for building geolocation-aware AI systems.

URL PDF HTML ☆

赞 0 踩 0

2505.17352 2026-05-19 cs.CV 版本更新

Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey

通过强化学习和奖励建模对扩散模型的对齐与安全性：综述

Preeti Lamba, Kiran Ravish, Ankita Kushwaha, Pawan Kumar

发表机构 * International Institute of Information Technology（国际信息科技学院）

AI总结本文综述了通过强化学习、奖励建模等方法对文本到图像扩散模型进行对齐和安全性的最新进展，探讨了反馈来源、奖励信号形式、优化机制等五个维度，并提出了多目标对齐、反馈高效偏好学习等开放性挑战。

Comments 3 figures, 1 table

详情

AI中文摘要

扩散模型已成为图像和多模态生成的核心范式，但其部署引发了关于对齐、安全性、偏好满足和滥用鲁棒性的持续疑问。本文综述了通过强化学习、奖励建模、偏好优化和安全特定微调对齐文本到图像扩散模型的最新进展。我们沿五个轴组织文献：反馈来源、奖励或偏好信号形式、优化机制、分布偏移和奖励过度优化的处理、以及安全作为显式约束而非一般偏好的程度。本文涵盖了人类反馈强化学习、KL正则化策略优化、直接偏好优化、二元效用优化、可微奖励微调、替代奖励学习、区域感知微调以及安全导向的DPO变体。为使综述易于理解，我们包含了扩散采样、奖励建模和偏好优化的教程解释，并简要连接了图像扩散对齐与新兴文本和掩码语言扩散模型。我们还比较了代表性方法在反馈要求、计算成本、可扩展性、对奖励黑客的易感性和安全性关键部署的适用性。最后，我们将文献综合为一组开放性挑战：多目标对齐、反馈高效偏好学习、对抗鲁棒安全对齐、在变化规范下的持续对齐和可解释奖励建模。本文的目标是为新兴的扩散模型对齐领域提供一个连贯的技术图谱，并识别在对齐生成模型可靠部署前必须解决的方法学缺口。

英文摘要

Diffusion models have become a central paradigm for image and multimodal generation, yet their deployment raises persistent questions about alignment, safety, preference satisfaction, and robustness to misuse. This survey reviews recent progress on aligning text-to-image diffusion models through reinforcement learning, reward modeling, preference optimization, and safety-specific fine-tuning. We organize the literature along five axes: the source of feedback, the form of the reward or preference signal, the optimization mechanism, the treatment of distribution shift and reward overoptimization, and the extent to which safety is addressed as an explicit constraint rather than a generic preference. The review covers reinforcement learning from human feedback, KL-regularized policy optimization, direct preference optimization, binary utility optimization, differentiable reward fine-tuning, surrogate reward learning, region-aware fine-tuning, and safety-oriented DPO variants. To make the survey accessible, we include tutorial explanations of diffusion sampling, reward modeling, and preference optimization, and briefly connect image diffusion alignment to emerging text and masked language diffusion models. We also compare representative methods in terms of feedback requirements, computational cost, scalability, susceptibility to reward hacking, and suitability for safety-critical deployment. Finally, we synthesize the literature into a set of open challenges: multi-objective alignment, feedback-efficient preference learning, adversarially robust safety alignment, continual alignment under changing norms, and interpretable reward modeling. The goal of this survey is to provide a coherent technical map of the emerging area of diffusion model alignment and to identify the methodological gaps that must be addressed before aligned generative models can be reliably deployed.

URL PDF HTML ☆

赞 0 踩 0