arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 多模态与视觉语言模型 10 篇

2606.13929 2026-06-15 cs.CV cs.LG 新提交

Self-Evolving Visual Questioner

自演化视觉提问器

Yijun Liang, Hengguang Zhou, Ming Li, Lichen Li, Cho-Jui Hsieh, Tianyi Zhou

发表机构 * University of Maryland, College Park(马里兰大学帕克分校) University of California, Los Angeles(加州大学洛杉矶分校) Peking University(北京大学) Arena MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出自演化框架,让视觉语言模型作为提问器和过滤器,无需外部监督即可生成更难、更信息丰富、更视觉中心的问题,并保持探索多样性以避免训练崩溃,显著提升自主提问质量和难度边界。

Comments 21 pages, including references and appendix. Project Page is available at https://joliang17.github.io/SelfEvolvingVQG/

详情
AI中文摘要

视觉语言模型(VLM)通常被训练为被动的回答者,而它们主动提出多样化、非平凡、视觉中心且基于问题的问题的能力仍未被充分探索。现有的视觉提问器的性能受到高质量训练数据的可用性或整理成本的瓶颈限制。我们证明,VLM可以在没有任何外部监督的情况下作为视觉提问器持续自我改进。我们提出一个自演化框架,该框架使用VLM本身作为提议者和过滤器,以产生更难、更信息丰富、更视觉中心的问题,同时保持其探索多样性以避免训练崩溃。这些问题随后用于以提问者和回答者模式训练VLM。为了评估提问器,我们引入了一个代理协议,从感知、推理和多样性维度评估问题。在各种骨干VLM上的实验表明,我们的方法显著提高了自主问题生成的质量,并大幅扩展了难度边界。在相同预算下,我们的自监督比在静态源数据上训练更有效。此外,自演化提问器仍然是一个有竞争力甚至更好的回答者。

英文摘要

Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.

2606.14703 2026-06-15 cs.CV cs.CL cs.LG 新提交

Gaze Heads: How VLMs Look at What They Describe

注视头:视觉语言模型如何观察它们所描述的内容

Rohit Gandikota, David Bau

发表机构 * Northeastern University(东北大学)

AI总结 发现视觉语言模型的语言骨干中存在一组“注视头”,其注意力跟踪当前描述的图像区域,通过干预这些头可精确控制模型描述内容,准确率达83.1%。

详情
AI中文摘要

视觉语言模型在内部如何解决描述图像的任务远非显而易见。我们发现模型为此发展出一种特定机制:其语言模型骨干中的一小部分注意力头(我们称之为注视头),其注意力跟踪模型当前正在描述的图像区域。我们通过简单的相关性得分从几次前向传播中发现了它们,使用连环漫画作为受控测试平台,其中叙事顺序在空间上展开。这些注视头不仅跟踪正在描述的图像标记:将它们的注意力重定向到所选区域会强制视觉语言模型描述该区域。对前100个注视头(少于所有头的9%)进行单次注意力掩码干预,以83.1%的准确率将模型的答案引导到任何选定的漫画面板,而对随机头进行相同干预则无法重定向答案,并且对所有头进行干预会破坏生成。相同的杠杆还扩展到连续控制:在生成过程中切换注视目标会使模型在几个标记内结束当前面板描述并转向新面板。在漫画之外,相同的干预将答案重定向到自然COCO图像中的选定区域。该机制进一步在2B到32B参数的模型大小以及其他视觉语言模型架构中重复出现,尽管一些冻结编码器系列没有显示可比较的头集。更广泛地说,这表明通过机制分析识别的目标编辑可以作为实用的推理时杠杆来引导多模态模型行为,而无需任何重新训练。我们的代码、交互式演示和数据集可在以下网址获取:此 https URL

英文摘要

How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, using comic strips as a controlled testbed where narrative order is laid out spatially. These gaze heads do not just track the image tokens being described: redirecting their attention to a chosen region forces the VLM to describe that region instead. A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model's answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation. The same lever also extends to continuous control: switching the gaze target mid-generation makes the model wrap up its current panel description and move to the new one within a few tokens. Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images. The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set. More broadly, this shows that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal model behavior, without any retraining. Our code, interactive demo, and datasets are available at https://gaze.baulab.info/

2606.13707 2026-06-15 cs.AI cs.CL cs.CV 交叉投稿

Orchestra-o1: Omnimodal Agent Orchestration

Orchestra-o1: 全模态智能体编排

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng

发表机构 * CUHK(香港中文大学) LIGHTSPEED PKU(北京大学) THU(清华大学) Tongji University(同济大学)

AI总结 提出Orchestra-o1全模态智能体编排框架,通过统一编排机制实现模态感知任务分解、在线子智能体专业化和并行子任务执行,在OmniGAIA基准上准确率超第二名10.3%,并引入DA-GRPO强化学习方法训练Orchestra-o1-8B达到开源全模态智能体最优性能。

详情
AI中文摘要

近期智能体集群的成功将基于大语言模型(LLM)的智能体从单智能体工作流范式转向多智能体系统,凸显了智能体编排在任务分解与协作中的重要性。然而,现有编排框架局限于狭窄的模态集合,难以泛化到异构模态共存并交互的更复杂场景。这种局限性在全模态场景中尤为突出,此类任务需要对文本、图像、音频和视频等多样化输入进行统一理解与协调。在本工作中,我们提出Orchestra-o1,一种全模态智能体编排框架,旨在支持跨多种模态的高效智能体协作。Orchestra-o1引入统一编排机制,实现模态感知任务分解、在线子智能体专业化和并行子任务执行。这种可扩展设计使智能体系统能够有效处理涉及异构信息源的复杂现实任务,在OmniGAIA基准上超越第二名方法10.3%的准确率。此外,我们提出决策对齐群体相对策略优化(DA-GRPO),一种高效的智能体强化学习方法,用于训练Orchestra-o1-8B,该方法在所有现有开源全模态智能体中取得了最先进性能。

英文摘要

The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.

2606.14106 2026-06-15 cs.MA cs.CV 交叉投稿

Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents

朴素视觉记忆不足:GUI代理的失败模式研究

Seoyoung Choi, Minseok Ko, Hyunseok Lee, Kunwoong Kim, Woomin Song, Chanseok Jeon, Jinwoo Shin

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出动作锚定视觉记忆(AGMem),通过存储与成功动作相关的局部GUI区域图像而非全屏截图,减少GUI代理中的动作级错误,在OSWorld上将任务成功率提升33.3%。

Comments 9 pages, 5 figures, ICML 2026 WORKSHOP

详情
AI中文摘要

图形用户界面(GUI)代理越来越多地被用于自动化跨应用程序、网站和操作系统的复杂计算机任务。为了提高其可靠性,最近的工作引入了经验记忆,代理检索先前的轨迹以指导相似状态下的决策。更近期的方法进一步将这一思想扩展到视觉记忆,通过存储和检索过去交互中的截图,为代理提供比纯文本记忆更丰富的上下文信息。然而,视觉记忆在GUI代理中的效果仍未被充分理解:不清楚视觉记忆缓解了哪些失败,或加剧了哪些失败。为了系统分析视觉记忆的效果,我们引入了一个包含四种GUI代理失败(即认知失败、视觉状态误解、隐藏操作盲点和接地错误)的分类法,这些失败对应于感知-推理-动作流水线的不同阶段。我们发现,前置全图像记忆对失败分布产生了分歧性影响:它减少了状态级失败,但加剧了动作级失败,并增加了隐藏操作盲点和接地错误。受此发现启发,我们提出了动作锚定视觉记忆(AGMem),一种用于GUI代理的动作锚定记忆框架。AGMem的核心思想是存储捕捉与成功动作或恢复密切相关的局部GUI区域的图像裁剪,而不是存储全屏截图。在OSWorld上的实验表明,AGMem比全图像记忆将任务成功率提高了33.3%。这些结果表明,AGMem是GUI代理中视觉记忆的一种有效表示。

英文摘要

Graphical User Interface (GUI) agents are increasingly used to automate complex computer tasks across applications, websites, and operating systems. To improve their reliability, recent work has introduced experiential memory, where agents retrieve prior trajectories to guide decision-making in similar states. More recent approaches further extend this idea to visual memory by storing and retrieving screenshots from past interactions, providing agents with richer contextual information than text-only memories. However, the effect of visual memory in GUI agents remains insufficiently understood: it is unclear which failures visual memory mitigates, or which failures it exacerbates. To systematically analyze the effect of visual memory, we introduce a taxonomy of four GUI agent failures (i.e., cognitive failure, visual state misunderstanding, hidden operation blindness, and grounding error) that map to distinct stages of the perception-reasoning-action pipeline. We find that prepending full-image memory has a divergent effect on the failure distribution: it reduces state-level failures but worsens action-level ones, and increases hidden operation blindness and grounding error. Motivated by this finding, we propose Action-Grounded Visual Memory (AGMem), an action-grounded memory framework for GUI agents. The core idea of AGMem is to store image crops that capture the local GUI region closely related to a successful action or a recovery, rather than storing full screenshots. Experiments on OSWorld show that AGMem improves task success rates by 33.3 % over full-image memory. These results demonstrate that AGMem is an effective representation for visual memory in GUI agents.

2606.14172 2026-06-15 cs.LG cs.CV 交叉投稿

Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

上下文感知的模态-拓扑协同对齐用于多模态属性图

Sirui Zhang, Xu Wang, Zhengyu Wu, Xunkai Li, Hongchao Qin

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出CoMAG框架,通过任务自适应可靠上下文学习和模态保持的跳令牌对齐,统一处理图任务和模态任务,在保持稀疏边线性复杂度的同时提升结构预测、跨模态匹配和图条件生成性能。

详情
AI中文摘要

多模态属性图(MAGs)通过将图拓扑与文本、图像等异质属性耦合来建模真实世界实体。它们支持需要结构和类别判别表示以进行图中心任务,以及需要细粒度跨模态对应以进行模态中心任务。然而,现有的MAG方法通常依赖固定的图上下文或统一融合的表示,导致任务无关的传播和过度压缩的融合,阻碍了多样化的任务需求和模态特定证据的保留。为了解决这个问题,我们提出了CoMAG,一个统一的MAG骨干网络,学习任务自适应的可靠上下文并在其中进行模态保持的对齐。CoMAG首先通过从多模态语义一致性估计边可靠性、用语义邻居补充原始拓扑以及通过任务感知门选择上下文组件来进行可靠上下文学习。然后,它通过维护模态特定的多跳轨迹、跨模态匹配模态-跳令牌以及解耦共享和私有表示来进行模态保持的跳令牌对齐。因此,CoMAG在一次前向传播中产生图和模态表示,同时保留模态特定的线索。我们进一步分析了稳定传播、缓解过度平滑和控制模态崩溃。在九个OpenMAG数据集上的实验将CoMAG与仅特征、仅图、多模态和统一的MAG基线在图级预测、模态匹配和图条件生成方面进行了比较。结果表明,CoMAG达到了最佳报告性能,证明任务自适应的可靠上下文和模态保持的对齐改善了结构预测、跨模态匹配和图条件生成,同时保持了稀疏边线性复杂度。

英文摘要

Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative representations, and modality-centric tasks requiring fine-grained cross-modal correspondence. However, existing MAG methods often rely on fixed graph contexts or uniformly fused representations, causing task-agnostic propagation and over-compressed fusion that hinder diverse task requirements and modality-specific evidence preservation. To address this, we propose CoMAG, a unified MAG backbone that learns task-adaptive reliable contexts and modality-preserving alignment within them. CoMAG first conducts Reliable Context Learning by estimating edge reliability from multimodal semantic consistency, complementing raw topology with semantic neighbors, and selecting context components through a task-aware gate. It then performs Modality-preserving Hop-token Alignment by maintaining modality-specific multi-hop trajectories, matching modality-hop tokens across modalities, and decoupling shared and private representations. Thus, CoMAG produces graph and modality representations from one forward pass while retaining modality-specific cues. We further analyze stable propagation, over-smoothing mitigation, and modality-collapse control. Experiments on nine OpenMAG datasets compare CoMAG with feature-only, graph-only, multimodal, and unified MAG baselines across graph-level prediction, modality matching, and graph-conditioned generation. Results show that CoMAG achieves the best reported performance, demonstrating that task-adaptive reliable contexts and modality-preserving alignment improve structural prediction, cross-modal matching, and graph-conditioned generation while retaining sparse edge-linear complexity.

2508.03736 2026-06-15 cs.CV cs.AI 版本更新

Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities

通过视觉Transformer融合泛在射频数据与空间图像以增强智慧城市地图构建

Rafayel Mkrtchyan, Armen Manukyan, Hrant Khachatrian, Theofanis P. Raptis

发表机构 * Yerevan State University(亚美尼亚国立大学) Consiglio Nazionale delle Ricerche(意大利国家研究委员会)

AI总结 提出基于DINOv2的深度学习框架,融合开源地图与射频数据,利用视觉Transformer联合处理多模态信息,在合成与真实数据集上实现65.3%和64.9%的宏观IoU,显著优于单一数据源方法。

Comments Work supported by funding under the bilateral agreement between CNR (Italy) and HESC MESCS RA (Armenia) as part of the DeepRF project for the 2025-2026 biennium, and by the HESC MESCS RA grant No. 22rl-052 (DISTAL)

详情
Journal ref
Pervasive and Mobile Computing, Article 102261, 2026
AI中文摘要

本文提出一种基于深度学习的方法,集成DINOv2架构,通过结合来自开源平台的(可能错误的)地图与从多个无线用户设备和基站收集的泛在射频(RF)数据,改进建筑地图构建。与先前方法不同,我们的方法利用基于视觉Transformer的架构,在统一框架内联合处理RF和地图模态,有效捕捉空间依赖性和结构先验,以提高地图构建精度。为评估目的,我们使用华为联合制作的合成数据集。为应对真实世界数据不完善的挑战,我们向其RF数据引入受控噪声以模拟真实条件。此外,我们开发并训练了一个仅利用聚合路径损耗信息来解决地图构建问题的模型。我们根据三个性能指标衡量结果:Jaccard指数(交并比,IoU)、Hausdorff距离和Chamfer距离。我们的设计实现了65.3%的宏观IoU,显著超过(i)错误地图基线(40.1%)、(ii)文献中仅使用RF的方法(37.3%)以及(iii)我们设计的非AI融合基线(42.2%)。对比评估突显了仅依赖RF数据或空间数据的局限性,以及AI在融合数据以提升智慧城市地图构建精度方面的有效性。我们还在奥斯陆地区的真实世界数据上进一步验证了我们的方法,通过真实部署环境补充了合成评估,其中我们的最佳融合模型达到了64.9%的宏观IoU。我们还概述了一种通过使用重叠窗口对区域进行分块来在更大区域上部署模型的策略。

英文摘要

In this paper, we present a deep learning-based approach that integrates the DINOv2 architecture to improve building mapping by combining (possibly erroneous) maps from open-source platforms with pervasive radio frequency (RF) data collected from multiple wireless user equipments and base stations. Unlike prior methods, our approach leverages a vision transformer-based architecture to jointly process both RF and map modalities within a unified framework, effectively capturing spatial dependencies and structural priors for enhanced mapping accuracy. For the evaluation purposes, we employ a synthetic dataset co-produced by Huawei. To address the challenges associated with real-world data imperfections, we introduce controlled noise to its RF data so as to simulate real-world conditions. Additionally, we develop and train a model that leverages only aggregated path loss information to tackle the mapping problem. We measure the results according to three performance metrics: the Jaccard index (intersection over union, IoU), the Hausdorff distance, and the Chamfer distance. Our design achieves a macro IoU of 65.3%, significantly surpassing (i) the erroneous maps baseline, which yields 40.1%, (ii) an RF-only method from the literature, which yields 37.3%, and (iii) a non-AI fusion baseline that we designed which yields 42.2%. The comparative evaluation highlights the limitations of relying solely on RF data or on spatial data, as well as the effectiveness that AI can have on fusing data towards enhancing smart city mapping accuracy. We further validate our method on real-world data from the Oslo region, complementing the synthetic evaluation with a real deployment setting, where our best fusion model reaches 64.9% macro IoU. We additionally outline a strategy for deploying the model over larger areas by tiling the region with overlapping windows.

2511.05017 2026-06-15 cs.CV cs.CL 版本更新

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

通过细化文本嵌入缓解大型视觉语言模型中的幻觉

Aakriti Agrawal, Gouthaman KV, Rohith Aralikatti, Gauri Jagatap, Jiaxin Yuan, Sarvesh Baskar, Vijay Kamarshi, Andrea Fanelli, Furong Huang

发表机构 * University of Maryland(马里兰大学) Dolby Laboratories(杜比实验室) Capital One

AI总结 针对大型视觉语言模型因过度依赖文本先验而忽视视觉线索导致的幻觉问题,提出一种简单有效的视觉特征融入方法,通过学习视觉信息化的文本嵌入来平衡注意力分布,显著降低幻觉并提升多模态推理能力。

Comments Accepted at The 64th Annual Meeting of the Association for Computational Linguistics

详情
AI中文摘要

大型视觉语言模型(LVLMs)中的幻觉仍然是一个持续的挑战,通常源于多模态推理过程中视觉信息整合不足。一个关键原因是模型过度依赖文本先验而未能充分利用视觉线索,导致输出语言流畅但视觉上不准确。例如,给定一张空厨房台面的图像,LVLM可能会根据语言关联而非视觉证据幻觉出“一碗水果”或“一杯咖啡”。大多数LVLM通过将视觉特征附加到预训练LLM的输入流中,并在大规模视觉语言数据集上训练来整合视觉特征。我们的系统分析表明,由于LLM对语言主导表示的固有偏见,这种策略往往导致对文本信息的过度依赖。这种不平衡使注意力偏向文本而非视觉内容,削弱了模型将输出基于视觉输入的能力。为了解决这个问题,我们提出了一种简单而有效的视觉特征融入方法,鼓励模型学习与基础LLM不同的视觉信息化的文本嵌入,并促进更平衡的注意力分布。在多个幻觉基准上的实验结果表明,我们的方法显著减少了幻觉,并促进了更平衡的多模态推理。值得注意的是,我们的方法取得了显著提升,包括在MMVP-MLLM上+9.33%,在POPE-AOKVQA上+2.99%,在Merlin上高达+3.4%,以及在HallusionBench的硬数据分割上+3%。

英文摘要

Hallucinations in Large Vision-Language Models (LVLMs) remain a persistent challenge, often stemming from inadequate integration of visual information during multimodal reasoning. A key cause is the model's over-reliance on textual priors and underutilization of visual cues, leading to outputs that are linguistically fluent but visually inaccurate. For example, given an image of an empty kitchen countertop, an LVLM might hallucinate a "bowl of fruit" or "cup of coffee", relying on language associations rather than visual evidence. Most LVLMs incorporate visual features by appending them to the input stream of a pre-trained LLM and training on large-scale vision-language datasets. Our systematic analysis reveals that this strategy often leads to over-dependence on textual information due to the inherent bias of LLMs towards language-dominant representations. This imbalance skews attention towards the text over visual content, weakening the model's ability to ground outputs in visual inputs. To address this, we propose a simple yet effective visual feature incorporation method that encourages the model to learn visually-informed textual embeddings distinct from those of the base LLM and promotes a more balanced attention distribution. Experimental results across multiple hallucination benchmarks demonstrate that our method significantly reduces hallucinations and fosters more balanced multimodal reasoning. Notably, our approach achieves substantial gains, including +9.33% on MMVP-MLLM, +2.99% on POPE-AOKVQA, up to +3.4% on Merlin, and +3% on the hard-data split of HallusionBench.

2603.05230 2026-06-15 cs.CV cs.RO 版本更新

Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems

数字孪生驱动的自动化分拣系统中的纺织品分类与异物识别

Serkan Ergun, Tobias Mitterer, Hubert Zangl

发表机构 * Institute of Smart Systems Technologies(智能系统技术研究所) University of Klagenfurt(克雷格弗特大学) AAU SAL USE Laboratory(AAU SAL USE实验室) Silicon Austria Labs(硅 Austria 实验室)

AI总结 提出一种数字孪生驱动的机器人分拣系统,结合抓取预测、多模态感知和视觉语言模型,实现纺织品分类与异物检测,Qwen模型准确率达87.9%。

Comments 10 pages,single column, 5 figures, preprint for Photomet Edumet 2026 (Klagenfurt, Austria)

详情
AI中文摘要

对可持续纺织品回收日益增长的需求要求强大的自动化解决方案,能够处理可变形服装并在杂乱环境中检测异物。本文提出了一种数字孪生驱动的机器人分拣系统,集成了抓取预测、多模态感知和语义推理,用于现实世界中的纺织品分类。一个配备RGBD传感、电容式触觉反馈和碰撞感知运动规划的双臂机器人单元,自主地将服装从未分类的篮子中分离,将其转移到检查区域,并使用最先进的视觉语言模型(VLM)进行分类。我们在一个包含223个检查场景的数据集上对来自五个模型家族的九个VLM进行了基准测试,这些场景包括衬衫、袜子、裤子、内衣、异物(包括上述类别之外的服装)和空场景。评估评估了每类准确率、幻觉行为以及在实际硬件约束下的计算性能。结果表明,Qwen模型家族实现了最高的总体准确率(高达87.9%),具有强大的异物检测性能,而较轻的模型如Gemma3为边缘部署提供了有竞争力的速度-准确率权衡。数字孪生结合MoveIt实现了碰撞感知路径规划,并将分割后的检查服装3D点云集成到虚拟环境中,以提高操作可靠性。所提出的系统证明了将语义VLM推理与常规抓取检测和数字孪生技术相结合,在现实工业环境中实现可扩展的自主纺织品分拣的可行性。

英文摘要

The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.

2605.31604 2026-06-15 cs.CV 版本更新

Representation Forcing for Bottleneck-Free Unified Multimodal Models

表示强制:无瓶颈统一多模态模型

Yuqing Wang, Zhijie Lin, Ceyuan Yang, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Zihan Ding, Fuyun Wang, Shuai Wang, Youliang Zhang, Haoqi Fan, Xihui Liu

发表机构 * University of Hong Kong(香港大学) ByteDance Seed(字节跳动种子) The Chinese University of Hong Kong(香港中文大学) Nanjing University(南京大学) Tsinghua University(清华大学)

AI总结 提出表示强制(RF)技术,通过让解码器自回归预测视觉表示作为中间令牌,再在相同骨干网络中引导像素扩散,从而消除统一多模态模型对预训练VAE的依赖,实现无瓶颈的端到端模型。

Comments Project page: https://yuqingwang1029.github.io/RepresentationForcing

详情
AI中文摘要

统一多模态模型(UMMs)旨在单个模型中处理感知和生成。然而,现有的UMMs仍然依赖一个冻结的、单独预训练的VAE进行图像生成,造成了结构瓶颈。简单地移除它会导致质量差距,因为模型必须从原始像素中同时学习高级结构和低级细节。在本文中,我们提出了表示强制(RF),一种通过使表示预测成为模型原生能力来缩小这一差距的技术。具体来说,RF强制解码器在像素之前自回归地预测视觉表示作为中间令牌;这些令牌随后保留在上下文中,在相同骨干网络内引导像素扩散。通过将表示从感知输出转变为生成目标,RF消除了任何外部生成潜在空间的需求。我们发现RF对理解和生成都有益。在图像生成上,我们的像素空间模型与RF匹配了基于VAE的最先进统一模型。在图像理解上,像素空间RF通常优于其基于VAE的变体。这些结果共同为迈向端到端、无瓶颈的UMMs提供了有效的一步。

英文摘要

Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels. In this paper, we propose Representation Forcing (RF), a technique that closes this gap by making representation prediction a native capability of the model. Concretely, RF forces the decoder to autoregressively predict visual representations as intermediate tokens before pixels; these tokens then stay in context to guide pixel diffusion within the same backbone. By turning representations from perception outputs into generation targets, RF eliminates the need for any external generative latent space. We find that RF benefits both understanding and generation. On image generation, our pixel-space model with RF matches state-of-the-art VAE-based unified models. On image understanding, pixel-space RF generally outperforms its VAE-based variant. Together, these results offer an effective step toward end-to-end, bottleneck-free UMMs.

2504.20734 2026-06-15 cs.CL cs.AI cs.CV cs.IR cs.LG 版本更新

UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

UniversalRAG: 在多样模态和粒度的语料库上实现检索增强生成

Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出UniversalRAG,一种能够处理多种模态和粒度的检索增强生成框架,通过动态路由机制和多粒度组织,提升跨模态知识检索的有效性,实验表明其在多个模态基准上的优越性。

Comments ACL 2026. Project page : https://universalrag.github.io

详情
AI中文摘要

检索增强生成(RAG)通过将外部相关知识与查询绑定,显著提升了事实准确性。然而,现有方法多局限于文本语料,尽管最近有尝试扩展到图像、视频等模态,但通常仅针对单一模态语料。相比之下,现实中的查询所需知识类型多样,单一知识源无法满足。为此,我们引入UniversalRAG,一种any-to-any RAG框架,旨在从异构源中检索和整合多样模态和粒度的知识。具体而言,受强制所有模态进入单一聚合语料的统一表示空间导致模态间隙的观察启发,我们提出模态感知路由,动态识别最合适的模态特定语料并执行针对性检索,并通过理论分析证明其有效性。此外,除模态外,我们对每个模态组织为多个粒度层级,实现针对查询复杂性和范围的精细检索。我们验证UniversalRAG在10个多种模态基准上的性能,显示其优于各种模态特定和统一基线。

英文摘要

Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, an any-to-any RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single aggregated corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose modality-aware routing, which dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it, and further justify its effectiveness with a theoretical analysis. Moreover, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 10 benchmarks of multiple modalities, showing its superiority over various modality-specific and unified baselines.

2. 具身智能、机器人与自动驾驶 11 篇

2606.14010 2026-06-15 cs.CV cs.LG cs.RO 新提交

RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation

RT-VLA:通过知识蒸馏实现实时视觉-语言-动作模型

Xiangyu Huang, Zhenlin Hua, Han Zhou, Shounak Sural, Ragunathan Rajkumar

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出RT-VLA,通过多级监督蒸馏将SimLingo模型的能力压缩至轻量学生模型,在保持竞争性能的同时将推理时间降低44.8倍(纯视觉模式)和7.9倍(视觉+语言模式),实现实时可解释的VLA自动驾驶。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过联合建模视觉感知、语言推理、可解释性和动作预测,在端到端自动驾驶中展现出强大潜力。然而,其庞大的视觉-语言骨干网络和推理模块引入了显著的推理延迟,从而阻碍了它们在道路网络严苛现实中的部署。我们提出RT-VLA,一种轻量级、蒸馏的VLA模型,通过多级监督蒸馏将最先进的SimLingo模型的驾驶和推理能力迁移到紧凑的学生模型中。RT-VLA保留了基于语言的推理,并通过离线语言分析安全关键驾驶时刻来支持事后解释,而不增加实时控制的延迟。与SimLingo教师模型相比,RT-VLA在保持竞争性的闭环驾驶和语言推理性能的同时,在纯视觉模式下将推理时间减少了44.8倍,在视觉+语言模式下减少了7.9倍。这些结果表明,监督蒸馏是构建实时、可解释的VLA风格自动驾驶模型的实用方法。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for end-to-end autonomous driving by jointly modeling visual perception, language reasoning, explainability and action prediction. However, their large vision-language backbones and reasoning modules introduce substantial inference latency and thereby prevent their deployment in the unforgiving reality of the road networks. We propose RT-VLA, a lightweight, distilled VLA model that transfers the driving and reasoning capabilities of the state-of-the-art SimLingo model into a compact student through multi-level supervised distillation. RT-VLA preserves language-based reasoning and supports post-hoc explanation through offline language analysis of safety-critical driving moments without adding latency to real-time control. Compared to the SimLingo teacher, RT-VLA maintains competitive closed-loop driving and language reasoning performance while reducing inference time by 44.8X in vision-only mode and 7.9X in vision+language mode. These results suggest that supervised distillation is a practical approach for building real-time, explainable VLA-style autonomous driving models.

2606.14048 2026-06-15 cs.CV cs.RO 新提交

WAM4D: Fast 4D World Action Model via Spatial Register Tokens

WAM4D:通过空间注册令牌实现快速4D世界动作模型

Ying Li, Xiaobao Wei, Jiajun Cao, Hao Wang, Xiaowei Chi, Chengyu Bai, Qianpu Sun, Jiajun Li, Xiaojie Zhang, Jian Tang, Sirui Han, Shanghang Zhang

发表机构 * Peking University(北京大学) The Hong Kong University of Science and Technology(香港科技大学) Beijing Innovation Center of Humanoid Robotics(北京人形机器人创新中心)

AI总结 提出WAM4D,利用轻量级空间注册令牌将预训练几何先验迁移至因果视频-动作变换器,实现高效4D世界动作建模,在RoboTwin 2.0和真实操作任务中提升空间一致性并保持快速推理。

Comments 15 pages, 7figures, 9tables

详情
AI中文摘要

世界动作模型(WAMs)最近在联合建模未来观测和可执行机器人动作方面显示出前景。然而,大多数现有的WAMs仍在2D视频或潜在空间中运行,其中视觉上合理的展开缺乏精确操作所需的3D空间约束和遮挡接触几何。虽然几何基础模型为从视觉观测恢复密集3D结构和运动提供了强大的先验,但迫使WAMs预测密集4D表示会引入昂贵的几何解码并减慢因果动作生成。为了解决这一权衡,我们提出了WAM4D,一种快速的4D世界动作模型,它使用轻量级空间注册令牌作为训练时的未来深度读出,将预训练的几何先验迁移到因果视频-动作变换器中,然后移除注册分支以实现轻量级动作推理。为了防止非因果捷径,我们进一步为混合变换器(MoT)WAM骨干设计了因果混合注意力,定义了视频、动作和几何令牌之间的模态特定可见性。在RoboTwin 2.0和具有挑战性的真实世界操作任务上的全面实验表明,WAM4D提高了空间一致性,并在保持高效推理的同时实现了具有竞争力的动作预测。

英文摘要

World action models (WAMs) have recently shown promise in jointly modeling future observations and executable robot actions. However, most existing WAMs still operate in 2D video or latent spaces, where visually plausible rollouts miss the 3D spatial constraints and occluded contact geometry required for precise manipulation. While geometric foundation models offer strong priors for recovering dense 3D structure and motion from visual observations, forcing WAMs to predict the dense 4D representation introduces costly geometric decoding and slows down causal action generation. To address the trade-off, we present WAM4D, a fast 4D world action model that uses lightweight spatial register tokens as training-time future-depth readouts to transfer pretrained geometric priors into a causal video-action transformer, then removes the register branch for lightweight action inference. To prevent non-causal shortcuts, we further design causal mixture attention for the Mixture-of-Transformers (MoT) WAM backbone, defining modality-specific visibility among video, action, and geometry tokens. Comprehensive experiments on RoboTwin 2.0 and challenging real-world manipulation tasks show that WAM4D improves spatial consistency and achieves competitive action prediction while maintaining efficient inference.

2606.13700 2026-06-15 eess.SP cs.CV 交叉投稿

C-MambaPose: A Physics-Informed Complex Mamba Framework for Cross-Environment WiFi Human Pose Estimation

C-MambaPose:一种物理信息驱动的复杂Mamba框架用于跨环境WiFi人体姿态估计

Phuc Nguyen H

发表机构 * VinUniversity(文大学)

AI总结 提出C-MambaPose,一种结合物理信息的复值Mamba-GraFormer混合框架,通过相位保留表示和动态选择性感受野的时空复杂Mamba编码器,实现跨环境WiFi三维人体姿态估计,在MM-Fi数据集上以3.78M参数达到SOTA。

详情
AI中文摘要

利用无线WiFi信号进行人体姿态估计(HPE)因其无设备、保护隐私、抗遮挡和弱光等优点而成为一项有前景的技术。然而,现有方法往往忽略WiFi信号的物理复相位信息,并且由于严重的域偏移而无法在多样环境中泛化。在本文中,我们提出C-MambaPose,一种物理信息驱动的复值Mamba-GraFormer混合框架,用于鲁棒的跨环境WiFi三维人体姿态估计。我们的框架首先净化原始WiFi信道状态信息(CSI)相位误差,并构建保持相位的复值表示。然后,我们采用具有动态选择性感受野的时空复杂Mamba编码器来捕获细粒度的相位动态。一个交叉注意力联合查询映射器将非结构化序列标记映射到人体关节,这些关节由图卷积网络(GCN)解码以预测解剖学一致的3D坐标。在MM-Fi数据集上的广泛评估表明,C-MambaPose在所有设置下均达到与最先进基线竞争或更优的性能,特别是在具有挑战性的跨环境分割上设立了新的最先进水平,仅需3.78M参数——相比GraphPose-Fi减少83.1%,相比MetaFi++减少85.7%,同时保持与DT-Pose相当的大小(仅小18%),但无需任何预训练即可实现显著优越的性能。我们的代码在此https URL公开。

英文摘要

Human pose estimation (HPE) utilizing wireless WiFi signals has emerged as a promising technology owing to its device-free nature, privacy preservation, and robustness against occlusion and poor lighting. However, existing methods often overlook the physical complex phase information of WiFi signals and fail to generalize across diverse environments due to severe domain shifts. In this paper, we present C-MambaPose, a physics-informed complex-valued Mamba-GraFormer hybrid framework for robust cross-environment WiFi-based 3D HPE. Our framework first sanitizes raw WiFi Channel State Information (CSI) phase errors and constructs a phase-preserving complex-valued representation. We then employ a Spatiotemporal Complex Mamba encoder with a dynamic selective receptive field to capture fine-grained phase dynamics. A cross-attention joint-query mapper maps the unstructured sequence tokens to human joints, which are decoded by a Graph Convolutional Network (GCN) to predict anatomically coherent 3D coordinates. Extensive evaluations on the MM-Fi dataset show that C-MambaPose achieves competitive or superior performance to state-of-the-art baselines across all settings, setting a new state-of-the-art specifically on the challenging cross-environment split, requiring only 3.78 M parameters-an 83.1\% reduction compared to GraphPose-Fi~\cite{chen2026graph} and an 85.7\% reduction compared to MetaFi++~\cite{zhou2023metafi++}, while maintaining a comparable size to DT-Pose~\cite{chen2025towards} (which is only 18\% smaller) but achieving significantly superior performance without requiring any pretraining. Our code is publicly available at https://github.com/phucngvinuni/cmampose.git.

2606.13840 2026-06-15 cs.RO cs.CV 交叉投稿

Multi-Agent Embodied Autonomous Driving: From V2X Information Exchange to Shared World Models

多智能体具身自动驾驶:从V2X信息交换到共享世界模型

Senkang Hu, Zhengru Fang, Yihang Tao, Zihan Fang, Sam Tak Wu Kwong, Yuguang Fang

发表机构 * Lingnan University, Hong Kong(岭南大学(香港))

AI总结 本文综述了从单车智能向多智能体具身系统转变的自动驾驶技术,通过共享世界模型实现感知共享、意图推断和协同规划,并指出了在仿真评估、实时安全保证等方面的研究空白。

详情
AI中文摘要

自动驾驶正从孤立的车辆智能转向多智能体具身系统,这些系统共享感知、推断意图并在不确定性下协调行动。本综述通过共享世界模型(SWMs)的视角审视这一转变:SWMs是跨车辆、基础设施和其他交通参与者维护的预测性跨智能体表征。我们回顾了超过380篇文献,涵盖车联万物(V2X)通信、协同感知、智能体间认知、协同规划、端到端协同驾驶以及用于闭环验证的仿真和数据引擎。核心问题是交换的观测如何成为对齐的状态、意图感知的交互和协调的下游行动。在所调查的文献中,评估仍然集中在仿真、精心设计的基准测试和离线协议上。基于基础模型的协调也缺乏在开放交通中经过验证的实时安全保证。这些空白为多智能体具身自动驾驶(MAEAD)提出了关键研究重点:可验证的共享状态维护、鲁棒的意图和计划对齐,以及在通信、延迟和部署约束下的安全协调行动。

英文摘要

Autonomous driving is shifting from isolated vehicle intelligence toward multi-agent embodied systems that share perception, infer intent, and coordinate action under uncertainty. This survey examines this transition through the lens of Shared World Models (SWMs): predictive cross-agent representations maintained across vehicles, infrastructure, and other traffic participants. We review more than 380 publications spanning vehicle-to-everything (V2X) communication, collaborative perception, inter-agent cognition, cooperative planning, end-to-end cooperative driving, and simulation and data engines for closed-loop validation. The organizing question is how exchanged observations become aligned state, intent-aware interaction, and coordinated downstream action. Across the surveyed literature, evaluation remains concentrated in simulation, curated benchmarks, and offline protocols. Foundation-model-based coordination also lacks verified real-time safety guarantees in open traffic. These gaps motivate key research priorities for multi-agent embodied autonomous driving (MAEAD): verifiable shared-state maintenance, robust intent and plan alignment, and safe coordinated action under communication, latency, and deployment constraints.

2606.13886 2026-06-15 cs.RO cs.CV cs.LG 交叉投稿

PhysVLA: Towards Physically-Grounded VLA for Embodied Robotic Manipulation

PhysVLA:面向物理基础的VLA用于具身机器人操作

Namai Chandra, Shriram Damodaran, Lin Wang

发表机构 * IIT Madras(印度理工学院马德拉斯分校) Nanyang Technological University(南洋理工大学)

AI总结 提出PhysVLA,一种即插即用的推理时框架,通过相位有限状态机和选择性欧拉-拉格朗日门,在不重新训练的情况下为任何冻结的VLA骨干注入物理约束,提升成功率、稳定性和轨迹效率。

Comments 9 pages, 5 figures, supplementary material included

详情
AI中文摘要

视觉-语言-动作(VLA)模型擅长将视觉输入和自然语言指令直接映射到机器人控制策略。然而,由于它们主要针对行为演示数据进行训练,并未明确强制执行刚体动力学或接触约束等基本物理原理。这暴露了一个关键的物理差距:在单步或分块VLA上应用的标准时间平滑以轨迹质量为代价,增加了短期记忆无法解决的失败。为弥补这一差距,我们提出PhysVLA(Physics-VLA),一种即插即用、推理时的框架,旨在包装任何冻结的VLA骨干,无需重新训练、微调或权重访问,每个控制步骤的开销小于1毫秒。PhysVLA拦截预测的控制动作,仅捕获模拟器或系统状态,并应用双层校正:(i)一个相位感知的有限状态机,用于结构化离散任务段(接近、抓取、运输和放置),以及(ii)一个选择性欧拉-拉格朗日门,仅在动力学预言器检测到运动学不一致时激活。在LIBERO-Spatial上使用7自由度Franka Panda对OpenVLA、OpenVLA-OFT、Force-VLA和Generalist-VLA进行评估,该框架实现了高达17%的绝对成功率提升和高达19%的稳定性提升,且无每任务回归,在所有四个骨干上轨迹效率提升高达15%,并在Robosuite Lift跨模拟器扫描中显示出高达10倍的轨迹急动度鲁棒性提升。我们还在真实的Agilex Piper机械臂上通过拾取和放置任务进一步验证了该框架,确认PhysVLA无需重新训练即可迁移到物理硬件,成功率提升高达50%,将物理意识确立为一种可组合、骨干无关的运行时模块。

英文摘要

Vision-Language-Action (VLA) models excel at mapping visual inputs and natural language instructions directly to robotic control policies. However, because they are trained primarily to fit behavioural demonstration data, they do not explicitly enforce fundamental physical principles such as rigid-body dynamics or contact constraints. This exposes a critical physics gap: standard temporal smoothing applied on top of single-step or chunked VLAs trades trajectory quality for added failures that short-term memory cannot resolve. To bridge this gap, we introduce PhysVLA (Physics-VLA), a plug-and-play, inference-time framework designed to wrap any frozen VLA backbone without retraining, fine-tuning, or weight access, with less than 1 ms of overhead per control step. PhysVLA intercepts the predicted control action, captures only the simulator or system state, and applies a dual-layered correction: (i) a phase-aware finite-state machine that structures discrete task segments (approach, grasp, transport, and place), and (ii) a selective Euler-Lagrange gate that activates only when a dynamics oracle detects kinodynamic inconsistency. Evaluated across OpenVLA, OpenVLA-OFT, Force-VLA, and Generalist-VLA on LIBERO-Spatial with a 7-DoF Franka Panda, the framework delivers absolute success rate increases of up to 17% and stability increases of up to 19% with no per-task regressions, improves trajectory efficiency by up to 15% across all four backbones, and shows up to a 10x improvement in trajectory jerk robustness on a Robosuite Lift cross-simulator sweep. We further validate the framework on a real Agilex Piper arm with a pick-and-place task, confirming that PhysVLA transfers to physical hardware without retraining, with success-rate improvements of up to 50%, establishing physical awareness as a composable, backbone-agnostic runtime module.

2604.16522 2026-06-15 cs.CV 版本更新

Efficient Online 3D Multi-Camera Multi-Object Tracking and Pose Estimation

高效在线3D多相机多目标跟踪与姿态估计

Linh Van Ma, Tran Thien Dat Nguyen, Juhua Hu, Wei Cheng, Moongu Jeon

发表机构 * School of Electrical Engineering and Computer Science(电子工程与计算机科学学院) School of Electrical Engineering, Computing and Mathematical Sciences(电子工程、计算与数学科学学院) School of Engineering and Technology(工程与技术学院)

AI总结 提出一种基于多单目相机的快速在线3D多目标跟踪与姿态估计方法,仅需2D检测,无需3D训练数据,通过贝叶斯最优滤波实现高效准确跟踪,并支持相机断连场景。

详情
AI中文摘要

本文提出了一种快速在线方法,用于联合执行使用多个单目相机的3D多目标跟踪和姿态估计。我们的算法仅需要2D边界框和姿态检测,消除了对昂贵的3D训练数据或计算成本高昂的深度学习模型的需求。我们的解决方案是贝叶斯最优多目标跟踪滤波器的高效实现,在保持准确性的同时提高了计算效率。我们证明了我们的算法在仅使用公开可用的预训练2D检测模型的情况下,比最先进的方法显著更快,且不牺牲准确性。我们还展示了我们的算法在多个相机在运行过程中间歇性断开或重新连接的场景中的鲁棒性能。

英文摘要

This paper proposes a fast and online method for jointly performing 3D multi-object tracking and pose estimation using multiple monocular cameras. Our algorithm requires only 2D bounding box and pose detections, eliminating the need for costly 3D training data or computationally expensive deep learning models. Our solution is an efficient implementation of a Bayes-optimal multi-object tracking filter, enhancing computational efficiency while maintaining accuracy. We demonstrate that our algorithm is significantly faster than state-of-the-art methods without compromising accuracy, using only publicly available pre-trained 2D detection models. We also illustrate the robust performance of our algorithm in scenarios where multiple cameras are intermittently disconnected or reconnected during operation.

2606.05774 2026-06-15 cs.CV 版本更新

LiAuto-GeoX: Efficient Grounded Driving Transformer

LiAuto-GeoX: 高效接地驾驶Transformer

Jiawei Lian, Haoyi Sun, Yang Wu, Lifu Mu, Siyuan Wang, Le Hui, Ning Mao, Tao Wei, Pan Zhou, Kun Zhan, Jian Yang

发表机构 * Nanjing University of Science and Technology(南京理工大学) Li Auto Inc.(Li Auto公司) Northwestern Polytechnical University(西北工业大学) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学院)

AI总结 提出LiAuto-GeoX,通过稀疏激光雷达先验和几何保持蒸馏框架,实现高效、实时的自车中心密集3D重建,并显著提升下游自动驾驶任务性能。

详情
AI中文摘要

密集3D重建在空间理解方面展现出巨大潜力,但其作为自动驾驶实时车载表示的可行性仍是一个开放挑战。现有大规模视觉几何模型通常需要大量计算资源,且缺乏动态驾驶环境所需的远距离几何保真度、环视一致性和实时效率。为弥补这一差距,我们提出 extbf{LiAuto-GeoX},一种为可部署的自车中心3D场景理解设计的高效接地驾驶Transformer。我们的方法首先从大规模环视数据中学习高容量驾驶几何模型,利用稀疏激光雷达先验在远处、模糊或结构稀疏区域提供稳健的几何接地。然后,通过一种新颖的几何保持蒸馏框架,将这一能力实例化为高度紧凑的1.55亿参数车载模型。该框架采用掩码引导的深度感知蒸馏,通过强调几何信息丰富的区域来保留细粒度度量结构,以及相对姿态关系蒸馏,通过姿态诱导的几何关系强制跨视图空间一致性。大量评估表明, extbf{LiAuto-GeoX}在KITTI上以220 FPS运行,同时保持高保真密集重建,实现实时部署。学习到的几何结构无缝迁移到下游自主任务,在轨迹预测中达到90.6 PDMS,在占用预测中达到24.63 mIoU,在未来帧预测中达到47.67 IoU。这些结果表明,高效的密集3D重建可以超越其作为感知目标的传统角色,作为下一代自动驾驶的可扩展基础几何表示。

英文摘要

Dense 3D reconstruction has demonstrated immense potential for spatial understanding, yet its viability as a real-time, onboard representation for autonomous driving remains an open challenge. Existing large-scale visual geometry models typically require substantial computational resources and lack the long-range geometric fidelity, surround-view consistency, and real-time efficiency demanded by dynamic driving environments. To bridge this gap, we present \textbf{LiAuto-GeoX}, an efficient grounded driving transformer designed for deployable, ego-centric 3D scene understanding. Our approach begins by learning a high-capacity driving geometry model from large-scale surround-view data, utilizing sparse LiDAR priors to provide robust geometric grounding in distant, ambiguous, or structure-sparse regions. We then instantiate this capability into a highly compact 155M-parameter onboard model through a novel geometry-preserving distillation framework. This framework employs mask-guided depth-aware distillation to retain fine-grained metric structures by emphasizing geometrically informative regions, and relative-pose relational distillation to enforce cross-view spatial consistency through pose-induced geometric relations. Extensive evaluations reveal that \textbf{LiAuto-GeoX} runs at 220 FPS on KITTI while maintaining high-fidelity dense reconstruction, enabling real-time deployment. The learned geometry transfers seamlessly to downstream autonomy tasks, achieving 90.6 PDMS in trajectory prediction, 24.63 mIoU in occupancy prediction, and 47.67 IoU in future-frame prediction. These all demonstrate that efficient dense 3D reconstruction can transcend its traditional role as a perception target to serve as a scalable, foundational geometric representation for next-generation autonomous driving.

2503.14331 2026-06-15 cs.RO cs.CV cs.SY eess.SY 版本更新

ADAPT: An Autonomous Forklift for Construction Site Operation

ADAPT:一种用于建筑工地作业的自主叉车

Johannes Huemer, Markus Murschitz, Matthias Schörghuber, Lukas Reisinger, Thomas Kadiofsky, Christoph Weidinger, Mario Niedermeyer, Benedikt Widy, Marcel Zeilinger, Csaba Beleznai, Tobias Glück, Andreas Kugi, Patrik Zips

发表机构 * Center for Vision, Automation and Control(视觉、自动化与控制中心) AIT Austrian Institute of Technology GmbH(奥地利技术研究所) Automation and Control Institute(自动化与控制研究所) Technische Universität Wien(维也纳技术大学)

AI总结 提出ADAPT自主叉车,结合AI感知与经典方法,在非结构化建筑工地实现近人类水平的物流操作,提升安全与效率。

详情
AI中文摘要

高效的物料物流在控制建筑行业的成本和进度中起着关键作用。然而,人工物料搬运仍然容易出现效率低下、延误和安全风险。自主叉车提供了一种有前景的解决方案,以简化现场物流,减少对人类操作员的依赖并缓解劳动力短缺。本文介绍了ADAPT(自主动态全地形托盘运输车)的开发与评估,这是一种专为建筑环境设计的全自主越野叉车。与结构化的仓库环境不同,建筑工地面临重大挑战,包括动态障碍物、非结构化地形和多变的天气条件。为应对这些挑战,我们的系统将AI驱动的感知技术与传统的决策、规划和控制方法相结合,实现了在复杂环境中的可靠操作。我们通过广泛的真实世界测试验证了该系统,并在各种天气条件下将其连续性能与经验丰富的人类操作员进行了比较。我们的研究结果表明,自主户外叉车可以达到接近人类水平的性能,为更安全、更高效的建筑物流提供了一条可行路径。

英文摘要

Efficient material logistics play a critical role in controlling costs and schedules in the construction industry. However, manual material handling remains prone to inefficiencies, delays, and safety risks. Autonomous forklifts offer a promising solution to streamline on-site logistics, reducing reliance on human operators and mitigating labor shortages. This paper presents the development and evaluation of ADAPT (Autonomous Dynamic All-terrain Pallet Transporter), a fully autonomous off-road forklift designed for construction environments. Unlike structured warehouse settings, construction sites pose significant challenges, including dynamic obstacles, unstructured terrain, and varying weather conditions. To address these challenges, our system integrates AI-driven perception techniques with traditional approaches for decision making, planning, and control, enabling reliable operation in complex environments. We validate the system through extensive real-world testing, comparing its continuous performance against an experienced human operator across various weather conditions. Our findings demonstrate that autonomous outdoor forklifts can operate near human-level performance, offering a viable path toward safer and more efficient construction logistics.

2512.21201 2026-06-15 cs.RO cs.AI cs.CV 版本更新

Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

薛定谔的导航者:为零样本目标导航设想未来轨迹集合

Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun, Guangnan Ye, Yanwei Fu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学) Shanghai University of International Business and Economics(上海对外经贸大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出一种信念感知框架,在推理时通过轨迹条件化的3D世界模型设想多个未来场景,结合自适应遮挡物感知采样和未来感知价值图,提升零样本目标导航在遮挡严重环境中的隐蔽目标发现和风险感知路径选择。

详情
AI中文摘要

零样本目标导航(ZSON)要求机器人在未见环境中找到目标物体,无需任务特定的微调或预建地图,这是通用服务机器人的关键能力。然而,在模拟中表现良好的方法在杂乱的真实世界场景中往往会退化,这些场景存在严重遮挡和潜在危险,大面积的未观察区域使得单场景推理脆弱且不安全。我们提出薛定谔的导航者,一个信念感知框架,在推理时对多个轨迹条件化的设想3D未来进行推理。给定候选路径,轨迹条件化的3D世界模型预测假设的观察结果,并保持多个合理场景实现的叠加,而不是承诺于单一地图。自适应遮挡物感知采样器将想象引导至不确定性关键区域,而未来感知价值图(FAVM)聚合设想的未来,以实现鲁棒、主动的动作选择。在模拟和物理Go2四足机器人上的实验表明,薛定谔的导航者优于强ZSON基线,在遮挡严重的导航场景中提高了隐蔽目标发现和风险感知路径点选择。这些结果突显了设想3D未来作为在不确定真实世界环境中进行零样本导航的可扩展和通用策略。

英文摘要

Zero-shot object navigation (ZSON) requires robots to find target objects in unseen environments without task-specific fine-tuning or pre-built maps, a key capability for general-purpose service robots. Yet methods that perform well in simulation often degrade in cluttered real-world scenes with severe occlusion and latent hazards, where large unseen regions make single-scene inference brittle and unsafe. We propose Schrödinger's Navigator, a belief-aware framework that reasons at inference time over multiple trajectory-conditioned imagined 3D futures. Given candidate paths, a trajectory-conditioned 3D world model predicts hypothetical observations and maintains a superposition of plausible scene realizations rather than committing to one map. An adaptive occluder-aware sampler directs imagination to uncertainty-critical regions, while a Future-Aware Value Map (FAVM) aggregates imagined futures for robust, proactive action selection. Experiments in simulation and on a physical Go2 quadruped show that Schrödinger's Navigator outperforms strong ZSON baselines, improving hidden-target discovery and risk-aware waypoint selection in occlusion-heavy navigation scenarios. These results highlight imagined 3D futures as a scalable and generalizable strategy for zero-shot navigation in uncertain real-world environments.

2606.12728 2026-06-15 cs.RO cs.CV cs.LG 版本更新

EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows

EquiDexFlow: 基于接触的SE(3)-等变灵巧抓取生成流

Clinton Enwerem, John S. Baras, Calin Belta

发表机构 * Institute for Systems Research, University of Maryland, College Park(马里兰大学帕克分校系统研究所)

AI总结 提出EquiDexFlow,一种SE(3)-等变流匹配模型,联合预测腕部姿态、关节角度、指尖接触、表面法线和接触力,通过将接触投影到物体表面并将力约束在库仑摩擦锥内,确保物理稳定抓取,在16自由度Allegro手上实现零摩擦违规和最佳综合分数。

Comments 22 pages, 11 figures, 11 tables. Project page with videos, code, and checkpoints: https://equidexflow.github.io

详情
AI中文摘要

大多数学习型灵巧抓取生成器将接触力降级为下游验证步骤,因此运动学上可行的姿态仍可能违反稳定物理抓取的条件。我们通过EquiDexFlow解决这一问题,这是一种SE(3)-等变流匹配模型,从物体点云联合预测腕部姿态、关节角度、指尖接触、表面法线和接触力。我们的架构通过构造将接触投影到物体表面并将力约束在库仑摩擦锥内,因此无需损失惩罚即可满足放置和摩擦合规性。我们证明了端到端SE(3)等变性,并在200次旋转上经验验证,腕部残差低于$0.04^\circ$且关节偏差严格为零。该模型在81个物体的8,100个力闭合抓取上训练,适用于16自由度Allegro手,在所有消融变体中实现了零摩擦违规、最佳综合分数和最低扳手残差。我们通过每指逆运动学将解码的指尖接触重新定位到16自由度LEAP手,我们的硬件可行优化将每个关节至少置于其执行器包络的5%以内,同时保持扳手平衡。在物理机器人上,重新定位的EquiDexFlow解码抓取在所有六个测试物体上完成了开环拾取和保持试验,每个非对称物体在标准姿态和$120^\circ$共旋转下均成功。视频、代码和检查点可在https://this URL获取。

英文摘要

Most learned dexterous grasp generators relegate contact forces to a downstream verification step, so a kinematically-plausible pose can still violate the conditions for a stable physical grasp. We address this with EquiDexFlow, an SE(3)-equivariant flow-matching model that jointly predicts wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from an object point cloud. Our architecture projects contacts onto the object surface and forces into the Coulomb friction cone by construction, so placement and friction compliance hold without loss penalties. We prove end-to-end SE(3) equivariance and verify it empirically over 200 rotations, with wrist residuals below $0.04^\circ$ and exactly zero joint deviation. Trained on 8,100 force-closure grasps across 81 objects for the 16-DoF Allegro Hand, our model achieves zero friction violations, the best composite score, and the lowest wrench residual among all ablation variants. We retarget decoded fingertip contacts to a 16-DoF LEAP Hand via per-finger inverse kinematics, and our hardware-feasible refinement places every joint at least 5% inside its actuator envelope while preserving wrench balance. On the physical robot, retargeted EquiDexFlow-decoded grasps complete open-loop pick-and-hold trials on all six test objects, with every asymmetric object succeeding at both the canonical pose and a $120^\circ$ co-rotation. Videos, code, and checkpoints are available at https://equidexflow.github.io.

2606.12910 2026-06-15 cs.RO cs.AI cs.CV cs.SY eess.SY 版本更新

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

边界框作为目标:通过神经符号规划实现语言条件抓取

Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GRASP框架,利用预训练VLM将自然语言查询转化为神经符号目标状态,通过边界框检测实现零样本桌面操作,无需任务特定训练。

Comments Project website: https://allisonandreyev.github.io/grasp.github.io/

详情
AI中文摘要

为了将机器人有效集成到家庭或工业环境中,机器必须实时适应自然语言提示。尽管视觉-语言模型(VLM)已在机器人任务与运动规划(TAMP)中实现零样本泛化,但当前最先进的方法通常计算量“沉重”或需要在数千个演示上进行大量训练。我们提出GRASP(基础推理与符号规划)框架,作为向开放词汇桌面操作迈进的一步。我们的方法利用预训练VLM将自然语言查询转化为神经符号目标状态,通过边界框检测管道在物理世界中接地。与依赖固定颜色列表或硬编码坐标的方法不同,GRASP使机器人能够解释诸如“顶层架子”之类的抽象空间概念,并在无需额外微调的情况下执行任务。我们在三个难度级别的90次真实机器人试验中实现了73.3%的总体成功率,无需任务特定训练。

英文摘要

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

3. 图像识别、检索与分类 5 篇

2606.14555 2026-06-15 cs.CV cs.AI 新提交

Rethinking Global Average Pooling: Your Classifier Is Secretly a Multi-Instance Learner

重新思考全局平均池化:你的分类器实际上是一个多实例学习器

Aray Karjauv

发表机构 * Aray Karjauv(阿瑞·卡贾乌)

AI总结 本文揭示标准图像分类器中的全局平均池化结构天然具有多实例学习解释,使得单标签训练的分类器能学习多目标场景,并提出后验诊断方法提取空间类别证据。

详情
AI中文摘要

现代图像分类器广泛采用全局平均池化(GAP)后接线性分类头。这种线性结构确保图像级logits等于将分类头逐点应用于GAP之前的特征网格所获得的logits的平均值。因此,标准分类器可能固有地保留空间类别证据,即使在图像级预测错误时这些证据仍可恢复。这种结构自然暗示了多实例学习(MIL)解释,其中图像被视为空间实例的包。在此框架下,我们证明使用每张图像单个标签训练的标准分类器仍然可以在多目标场景中学习预期的分类任务。我们进一步利用这一特性将图像级logits分解为预测网格,提供一种事后诊断方法来提取GAP原本掩盖的空间类别证据。我们的系统评估表明,现成模型始终能在前景区域内恢复真实类别。MIL解释进一步表明,常见的分类器失败反映了均值聚合的已知局限性。

英文摘要

Modern image classifiers widely adopt global average pooling (GAP) followed by a linear classification head. This linearity ensures that the image-level logits equal the average of logits obtained by applying the classification head pointwise to the feature grid prior to GAP. Consequently, standard classifiers may inherently retain spatial class evidence that remains recoverable even when the image-level prediction is incorrect. This structure naturally suggests a multiple-instance learning (MIL) interpretation, where an image is viewed as a bag of spatial instances. Within this formulation, we demonstrate that standard classifiers trained with a single label per image can still learn the intended classification task in multi-object scenes. We further exploit this property to decompose image-level logits into a prediction grid, providing a post-hoc diagnostic to extract spatial class evidence that GAP otherwise obscures. Our systematic evaluation reveals that off-the-shelf models consistently recover the ground-truth class within foreground regions. The MIL interpretation further suggests that common classifier failures reflect known limitations of mean aggregation.

2606.14686 2026-06-15 cs.CV cs.AI 新提交

CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification

CottonLeafVision:一种可解释且鲁棒的棉花叶部病害分类深度学习框架

Rafi Ahamed, Md. Abir Rahman, Tasnia Tarannum Roza, Munaia Jannat Easha, Md. Asif Khan, Sudeepta Mandal

发表机构 * Dept. of CSE(计算机科学与工程系) East West University(东-西大学) Dhaka, Bangladesh(达卡,孟加拉国)

AI总结 提出CottonLeafVision框架,使用DenseNet201在棉花叶部病害数据集上达到98%分类准确率,并集成Grad-CAM、遮挡敏感性和对抗训练增强可解释性与鲁棒性。

Comments This paper contains 11 figures and 4 tables. It was Presented at 18th IEEE International Conference on Computational Intelligence and Communication Networks (CICN) 2026

详情
AI中文摘要

全球范围内,棉花是一种高度经济价值的作物,因为纺织工业严重依赖它。因此,精确识别和检测棉花叶部病害对经济稳定至关重要。“CottonLeafVision”的开发目标是准确分类和检测棉花叶部病害。为此,我们在公开的棉花叶部病害图像数据集上评估了多个预训练的深度卷积神经网络,包括DenseNet201、InceptionV3和VGG19。该图像数据集包含七个类别,六个病害类别和一个健康类别,是在反映现实挑战的各种田间条件下收集的。在这些预训练模型中,使用DenseNet201,我们实现了98%的最高分类准确率。为了增强模型的可靠性和可解释性,我们实施了不同的技术和方法,如梯度加权类激活映射(Grad-CAM)、遮挡敏感性分析和对抗训练,以提高模型的抗噪声能力。最后,我们开发了一个原型,以便在现实农业中利用模型的能力。本文展示了深度学习模型在现实棉花病害管理情况下分类病害的能力。

英文摘要

Globally, cotton is a highly economically beneficial crop, as the textile industry heavily depends on it. So, the precise identification and detection of cotton leaf disease is crucial for economic stability. The development goal of "CottonLeafVision" is to accurately classify and detect cotton leaf disease. With this goal, we have evaluated multiple pretrained Deep Convolutional Neural Networks, including DenseNet201, InceptionV3, and VGG19 on a publicly available cotton leaf disease image dataset. This image dataset includes seven classes, six disease classes, and one healthy class, collected under various field conditions reflecting real-world challenges. Among these pretrained models, with DenseNet201, we have achieved the highest classification accuracy of 98%. To enhance the model reliability and interpretability, we have implemented different techniques and methods such as Gradient-weighted Class Activation Mapping (Grad-CAM), occlusion sensitivity analysis and adversarial training to increase the noise resistance of the model. Finally, we have developed a prototype in order to utilize the model's capabilities on real life agriculture. This paper shows the deep learning model's capabilities to classify the disease in real-life cotton disease management situations.

2603.09377 2026-06-15 cs.CV 版本更新

SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization

SinGeo: 解锁单一模型在鲁棒跨视角地理定位中的潜力

Yang Chen, Xieyuanli Chen, Junxiang Li, Jie Tang, Tao Wu

发表机构 * College of Intelligence Science and Technology, National University of Defense Technology(智能科学与技术学院,国防科技大学)

AI总结 提出SinGeo框架,通过双判别学习架构和课程学习策略,使单一模型在未知视场角和方向下实现鲁棒跨视角地理定位,在四个基准数据集上达到最先进性能。

Comments v2

详情
AI中文摘要

尽管近期取得了进展,鲁棒的跨视角地理定位(CVGL)仍然具有挑战性。现有方法仍依赖于视场角(FoV)特定的训练范式,其中模型在固定FoV下优化,但在未见过的FoV和未知方向上测试时性能崩溃。这种局限性需要部署多个模型来覆盖各种变化。尽管有研究通过简单随机化FoV探索了动态FoV训练,但它们未能实现跨不同条件的鲁棒性——隐含地假设所有FoV难度相同。为解决这一差距,我们提出了SinGeo,一个简单而强大的框架,使单一模型能够实现鲁棒的跨视角地理定位,无需额外模块或显式变换。SinGeo采用双判别学习架构,增强地面和卫星分支内的视角内判别性,并且是第一个引入课程学习策略以实现鲁棒CVGL的方法。在四个基准数据集上的广泛评估表明,SinGeo在多种条件下取得了最先进(SOTA)结果,并且显著优于专门为极端FoV训练的方法。除了卓越的性能,SinGeo还展示了跨架构的可迁移性。此外,我们提出了一种一致性评估方法,以定量评估模型在不同视角下的稳定性,为理解和推进未来CVGL研究的鲁棒性提供了可解释的视角。代码将在接收后公开。

英文摘要

Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions -- implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, and is the first to introduce a curriculum learning strategy to achieve robust CVGL. Extensive evaluations on four benchmark datasets reveal that SinGeo sets state-of-the-art (SOTA) results under diverse conditions, and notably outperforms methods specifically trained for extreme FoVs. Beyond superior performance, SinGeo also exhibits cross-architecture transferability. Furthermore, we propose a consistency evaluation method to quantitatively assess model stability under varying views, providing an explainable perspective for understanding and advancing robustness in future CVGL research. Codes will be available upon acceptance.

2605.08270 2026-06-15 cs.CV cs.AI 版本更新

SAFformer:Improving Spiking Transformer via Active Predictive Filtering

SAFformer:通过主动预测滤波改进脉冲Transformer

Zequan Xie, Weiming Zeng, Yunhua Chen, Sichang Ling, Tongyang Chen, Jinsheng Xiao

发表机构 * School of Computer Science and Technology, Guangdong University of Technology(广东技术大学计算机科学与技术学院) Faculty of Science, Hong Kong Baptist University(香港 Baptist 大学科学学院) School of Electronic Information, Wuhan University(武汉大学电子信息学院)

AI总结 提出基于主动预测滤波的脉冲Transformer架构SAFformer,通过抑制可预测信号并聚焦显著视觉特征,在CIFAR和ImageNet-1K上实现新最优性能,平衡精度与能耗。

Comments IJCAI 2026(International Joint Conference on Artificial Intelligence)

详情
AI中文摘要

脉冲神经网络(SNNs)在生物合理性和能效方面具有显著优势,使其成为构建低功耗Transformer的有前景的候选方案。然而,现有的脉冲Transformer主要遵循被动反应范式,难以聚焦于任务相关信息,并且在处理冗余视觉数据时会产生大量计算开销。为了克服这一基础但尚未充分探索的局限性,我们提出了SAFformer,一种基于主动预测滤波范式的新型脉冲Transformer架构。受大脑预测编码机制的启发,SAFformer主动抑制可预测信号并聚焦于显著视觉特征。大量实验表明,SAFformer在CIFAR-10/100和CIFAR10-DVS上建立了新的最先进性能。值得注意的是,在ImageNet-1K上,它仅用26.58M参数和5.88 mJ的能耗就达到了80.44%的Top-1准确率,展现了精度与效率之间的卓越平衡。

英文摘要

Spiking Neural Networks (SNNs) offer notable advantages in biological plausibility and energy efficiency, making them promising candidates for building low-power Transformers. However, existing Spiking Transformers largely adhere to a passive reactive paradigm, which struggles to focus on task-relevant information and incurs substantial computational overhead when processing redundant visual data. To overcome this fundamental yet underexplored limitation, we propose SAFformer, a novel Spiking Transformer architecture based on an active predictive filtering paradigm. Inspired by the brain's predictive coding mechanism, SAFformer actively suppresses predictable signals and focuses on salient visual features. Extensive experiments show that SAFformer establishes new state-of-the-art performance on CIFAR-10/100 and CIFAR10-DVS. Remarkably, on ImageNet-1K, it achieves 80.44% Top-1 accuracy with only 26.58M parameters and an energy consumption of 5.88 mJ, demonstrating an exceptional balance between accuracy and efficiency.

2605.09420 2026-06-15 cs.CV cs.AI cs.MM 版本更新

Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery

关系检索:利用已知-新颖相互作用进行通用类别发现

Yulin Xu, Chunqi Guo, Yuanzhen Shuai, Jianyuan Ni

发表机构 * University of California, Irvine(加州大学尔湾分校) Sichuan Agricultural University(四川农业大学) University College London(伦敦大学学院) Juniata College(朱尼ata学院)

AI总结 本文通过关系检索视角解决通用类别发现问题,提出关系模式一致性方法,通过双向知识转移增强已知类别和新类别发现,实验表明在通用和细粒度基准上均取得最佳性能。

Comments Accepted by ICMR 2026 (Oral)

详情
AI中文摘要

在本研究中,我们通过关系检索视角解决通用类别发现(GCD)问题,通过双向知识转移显式连接标记和未标记数据。尽管现有方法将这些来源分开处理,错过了有价值的作用机会,我们提出关系模式一致性(RPC),使两者相互增强。RPC使用一对一分类器进行软ID/OOD分解,然后引入两种机制:(i)为已知类别保留,我们转移语义行为对齐;(ii)为类别发现,我们利用样本来自同一类别与已知类别原型保持不变的关系的洞察,将不可靠的伪标签转化为明确的关系模式匹配。这种双向设计使标记数据指导未标记学习,同时通过它们的集体关系签名发现新类别。广泛的实验表明,RPC在通用和细粒度基准上均取得最佳性能。

英文摘要

In this study, we tackle Generalized Category Discovery (GCD) via a Relational Retrieval perspective, explicitly coupling labeled and unlabeled data through bidirectional knowledge transfer. While existing methods treat these sources separately, missing valuable interaction opportunities, we propose Relational Pattern Consistency (RPC) that enables mutual enhancement. RPC employs One-vs-All classifiers for soft ID/OOD decomposition, then introduces two mechanisms: (i) for known-class preservation, we transfer semantic behavioral alignment; (ii) for category discovery, we leverage the insight that samples from the same category maintain invariant relationships with known-class prototypes, transforming unreliable pseudo-labeling into well-defined relational pattern matching. This bidirectional design allows labeled data to guide unlabeled learning while discovering novel categories through their collective relational signatures. Extensive experiments demonstrate RPC achieves state-of-the-art performance on both generic and fine-grained benchmarks.

4. 目标检测、分割与定位 7 篇

2606.13723 2026-06-15 cs.CV cs.AI 新提交

Morphology-Aware Sample Assignment: Overcoming IoU Insensitivity for Surface Defect Detection

形态感知样本分配:克服IoU不敏感性用于表面缺陷检测

Pengfei Liu, Yuhan Guo

发表机构 * School of Management, Harbin Institute of Technology(管理学院,哈尔滨工业大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 针对IoU在缺陷检测中不敏感的问题,提出基于面积、形状和长宽比的形态相似性度量来优化正样本分配,理论分析表明该方法能重塑匹配函数响应分布,在NEUDET和GC10-DET数据集上基于YOLOv9框架取得一致性能提升,且零额外推理开销。

详情
AI中文摘要

交并比(IoU)作为评估候选框与真实标注空间对齐的关键指标,直接决定了正样本集的质量和视觉检测模型的训练效果。通过理论建模和分析,我们揭示了IoU响应曲线上的一个非敏感区域,在该区域内,尽管样本的几何重叠程度不同,但IoU得分几乎相同。为克服这一局限,我们引入一组形态相似性度量,涵盖面积、形状和长宽比,以优化正样本分配过程,从而确保更具区分性和可靠性的匹配。通过基于均值的多维相似性聚合,推导出一个补充匹配分数,补偿IoU在表示结构对应性方面的固有缺陷。理论上,融入形态相似性重塑了匹配函数的响应分布,产生有效的方向梯度和多边形等响应轮廓,将高响应区域紧密限制在每个真实实例周围,显著提高了正样本选择的精度。基于YOLOv9框架的实验在NEUDET和GC10-DET数据集上均取得一致性能提升。值得注意的是,所提方法完全即插即用,且零额外推理开销,从而确保了工业视觉检测的部署效率。

英文摘要

Intersection-over-Union (IoU), as a pivotal metric for evaluating the spatial alignment between candidate proposals and ground-truth annotations, directly determines the quality of positive sample sets and the training efficacy of visual detection models. Through theoretical modeling and analysis, we uncover a non-sensitive region on the IoU response curve, within which samples yield nearly identical IoU scores despite distinct geometric overlaps. To overcome this limitation, we introduce a set of morphological similarity metrics covering area, shape, and aspect ratio, to refine the positive sample assignment process, thereby ensuring more discriminative and reliable matching. A supplementary matching score is derived via mean-based aggregation of these multidimensional similarities, compensating for the intrinsic limitation of IoU in representing structural correspondence. Theoretically, incorporating morphological similarity reshapes the response distribution of the matching function, yielding both effective directional gradients and polygon-like iso-response contours, which tightly confine high-response regions around each ground-truth instance and substantially enhance the precision of positive sample selection. Experiments based on the YOLOv9 framework demonstrate consistent performance gains on both NEUDET and GC10- DET datasets. Notably, the proposed approach is fully plug-and-play and incurs zero additional inference overhead, thereby ensuring deployment efficiency for industrial visual inspection.

2606.14005 2026-06-15 cs.CV 新提交

Context-Guided Semantic Alignment for Feature Fusion Networks

上下文引导的特征融合网络语义对齐

Hyungseop Lee, Jiho Lee, Woochul Kang

发表机构 * Department of Embedded Systems Engineering, Incheon National University(仁川国立大学嵌入式系统工程系)

AI总结 提出轻量级语义对齐模块FINE,通过跨层级注意力机制利用高层上下文指导低层特征融合,并引入对齐感知令牌采样降低计算复杂度,提升目标检测精度。

Comments 26 pages, 12 figures, 8 tables

详情
AI中文摘要

特征融合网络是现代目标检测器的基础组件,通过聚合多尺度特征来检测不同大小的物体。然而,直接融合来自不同金字塔层次的特征往往因其异构表示而导致语义不一致。本文提出特征交互网络(FINE),一种轻量级语义对齐模块,在融合前通过跨层级注意力机制利用高层上下文指导来细化低层特征。为弥合结构差距并确保计算效率,我们引入对齐感知令牌采样,对齐跨尺度的对应空间区域,将注意力复杂度降低一个数量级。生成的注意力权重产生一个空间-通道调制图,通过残差逐元素调制进行上采样并应用于低层特征。该机制确保网络选择性地增强语义相关像素,同时保留密集预测任务所需的亚像素定位精度。FINE普遍适用于各种检测器,并在不牺牲效率的情况下持续提升检测精度。

英文摘要

Feature fusion networks are fundamental components in modern object detectors, aggregating multi-scale features to detect objects of varying sizes. However, directly fusing features from different pyramid levels often introduces semantic inconsistency due to their heterogeneous representations. In this paper, we propose Feature Interaction NEtwork (FINE), a lightweight semantic alignment module that refines low-level features via high-level contextual guidance using cross-level attention prior to fusion. To bridge the structural gap and ensure computational efficiency, we introduce an Alignment-Aware Token Sampling that aligns corresponding spatial regions across scales, reducing the attention complexity by an order of magnitude. The resulting attention weights generate a spatial-channel modulation map that is upsampled and applied to the low-level features via residual element-wise modulation. This mechanism ensures that the network selectively enhances semantically relevant pixels while preserving the sub-pixel localization accuracy necessary for dense prediction tasks. FINE is generally applicable to various detectors and consistently improves detection accuracy without compromising efficiency.

2606.14129 2026-06-15 cs.CV 新提交

BoRAD: Bootstrap your Own Representations for Multi-class Anomaly Detection

BoRAD: 自举表示实现多类异常检测

Duy Hoang Khuong, Tri Nguyen Minh, Ngu Huynh Cong Viet

发表机构 * Department of Artificial Intelligence, FPT University(FPT大学人工智能系) Department of IT, FPT University(FPT大学信息技术系) Department of Computing Fundamental, FPT University(FPT大学计算基础系)

AI总结 提出BoRAD框架,通过原型正则化解决多类异常检测中重建模型的捷径和误重建问题,无需标签即可实现单模型多类检测。

详情
AI中文摘要

基于重建的异常检测在工业检测中具有吸引力,但将其从类别特定训练扩展到一劳永逸的设置具有挑战性。单个模型必须重建多样的正常外观,同时不复制异常细节,这暴露了两个耦合的失败模式:相同捷径,即异常通过重建路径;以及误重建,即正常类别相互混淆。我们提出\textbf{BoRAD},一个无标签训练框架,将其视为表示容量分配问题。BoRAD使用共享的可学习原型库施加两个互补正则化器:空间原型对齐约束局部原型内变异以抑制异常复制,而原型相对全局对齐保留原型间结构并提高对异常角度偏差的敏感性。原型库和预测头仅在训练期间使用;推理保持标准的师生特征差异过程,无需类别标签、负样本对、内存检索或原型查找。BoRAD实现了具有竞争力的一劳永逸异常检测性能,包括MVTec AD上86.2% mAD、VisA上80.7% mAD和Real-IAD上73.1% mAD。诊断分析进一步显示异常泄漏减少、正常类别可分性提高以及异常-正常分数分离更强。

英文摘要

Reconstruction-based anomaly detection is attractive for industrial inspection, but scaling it from category-specific training to a one-for-all setting is challenging. A single model must reconstruct diverse normal appearances without copying abnormal details, which exposes two coupled failure modes: identical shortcut, where anomalies pass through the reconstruction path, and mis-reconstruction, where normal categories are confused with one another. We propose \textbf{BoRAD}, a label-free training framework that treats this as a representation-capacity allocation problem. BoRAD uses a shared learnable prototype bank to impose two complementary regularizers: spatial prototype alignment contracts local within-prototype variation to suppress anomaly copying, while prototype-relative global alignment preserves between-prototype structure and improves sensitivity to abnormal angular deviations. The prototype bank and prediction heads are used only during training; inference remains a standard teacher-student feature discrepancy pass, with no class labels, negative pairs, memory retrieval, or prototype lookup. BoRAD achieves competitive one-for-all anomaly detection performance, including 86.2\% mAD on MVTec AD, 80.7\% mAD on VisA and 73.1\% mAD on Real-IAD. Diagnostic analyses further show reduced anomaly leakage, improved normal-category separability, and stronger anomaly-normal score separation.

2606.14307 2026-06-15 cs.CV 新提交

Pano3D: Unified 3D Reconstruction and Panoptic Segmentation

Pano3D:统一的3D重建与全景分割

Victor Barberteguy, Ahmet Iscen, Mathilde Caron, Alireza Fathi, Gül Varol, Cordelia Schmid

发表机构 * Google DeepMind(谷歌深度思维) LIGM, École des Ponts, IP Paris, Univ Gustave Eiffel, CNRS(LIGM, 巴黎高科桥梁学院, 巴黎理工学院, 古斯塔夫·埃菲尔大学, 法国国家科学研究中心)

AI总结 提出统一框架,在3D前馈重建网络中集成全景分割,通过联合训练几何与语义损失实现互惠提升,在多个数据集上达到最优性能。

Comments Project page: https://victorbbt.github.io/Pano3D/

详情
AI中文摘要

最近,3D前馈重建神经网络的进展在无需任何相机参数的图像密集重建中取得了显著成功。然而,为这些模型配备鲁棒的语义理解仍然是一个开放问题。本文提出了一种在统一框架中执行3D重建和3D全景分割的方法。我们基于现有的3D重建模型,并为其增加了一个基于集合的掩码解码器。该方法通过几何损失和语义损失进行联合训练,实验表明两者相互促进。更准确地说,特征从几何信息初始化,然后微调以同时捕获几何和语义。我们通过将框架成功应用于在线和全对全注意力重建骨干网络,证明了方法的通用性。我们的方法在ScanNet、ScanNet200和ScanNet++数据集上的3D全景分割中达到了最先进的性能。消融研究表明,这种统一模型的联合训练使3D前馈重建神经网络具备了全景分割能力,并带来了互惠的改进。

英文摘要

Recent advances in 3D feedforward reconstruction neural networks have achieved remarkable success in dense reconstruction from images without any camera parameters. Yet, equipping these models with robust semantic understanding remains an open problem. Here we introduce an approach that performs 3D reconstruction and 3D panoptic segmentation in a unified framework. We build on existing 3D reconstruction models and augment them with a set-based mask decoder. The approach is jointly trained with a geometric and semantic loss, which are shown to be mutually beneficial. More precisely, the features are initialized from the geometric information and then finetuned to capture jointly geometry and semantics. We demonstrate the generality of our approach by successfully applying our framework both to online and all-to-all attention reconstruction backbones. Our method achieves state-of-the-art performance in 3D panoptic segmentation across ScanNet, ScanNet200, and ScanNet++ datasets. Ablation studies show that such joint training of a unified model equips 3D feedforward reconstruction neural networks with panoptic segmentation and yields mutually beneficial improvements.

2606.14475 2026-06-15 cs.CV 新提交

Value-order Decomposition for Generalist Anomaly Detection

值序分解用于通用异常检测

Miaoyun Zhao, Jing Chen, Miaoni Zhao, Qiang Zhang

发表机构 * Dalian University of Technology(大连理工大学) Xi’an Chang’an Vanke City Primary School(西安长安万科城小学) Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education(社会计算与认知智能教育部重点实验室(大连理工大学))

AI总结 提出值序分解(VOD)方法,通过解耦和抑制类别、缺陷类型和域特定信息,实现跨域异常检测的强泛化。

详情
AI中文摘要

工业异常检测受限于数据量少,使得跨域泛化尤其具有挑战性。通用异常检测(GAD)旨在在源域上训练一个统一模型,能够有效检测未见目标域中的异常。在初始语义特征空间中,异常与物体类别或缺陷类型之间的强纠缠阻碍了跨域的有效泛化。最近的工作通过将特征投影到残差空间来解决这个问题;然而,这些方法主要增加了正常特征的跨域重叠,而异常特征仍然与物体类别、缺陷类型和数据域相关,导致对齐和泛化效果差。为了解决这一限制,我们提出了值序分解(VOD),一种简单而有效的技术,它弥合了物体类别、缺陷类型(包括真实和合成缺陷)和数据域之间的\textbf{三种泛化差距}。VOD解耦并抑制了物体类别、缺陷类型和域特定信息,促进了正常和异常样本内部的对齐,同时保持了它们的可分离性,从而实现了跨三个差距的鲁棒泛化。利用同一物体内真实和合成缺陷之间的强对齐,我们仅使用正常和合成异常参考进行异常检测,并有效泛化到未见过的真实缺陷类型。在多样化的工业和医学基准上的实验表明,我们的方法使用简单的剪切粘贴异常模拟策略,实现了跨三个差距的强泛化。

英文摘要

Industrial anomaly detection suffers from limited data, making cross-domain generalization particularly challenging. Generalist Anomaly Detection (GAD) aims to train a unified model on a source domain that can effectively detect anomalies in unseen target domains. In the initial semantic feature space, strong entanglement between anomalies and object categories or defect types hinders effective generalization across domains. Recent works address this issue by projecting features into a residual space; however, such methods primarily increase cross-domain overlap for normal features, while anomalous features remain specific to object categories, defect types and data domains, leading to poor alignment and generalization. To address this limitation, we propose Value-order Decomposition (VOD), a simple yet effective technique that bridges \textbf{three types of generalization gaps} across object categories, defect types (including real and synthetic defects), and data domains. VOD disentangles and suppresses object-category-, defect-type-, and domain-specific information, promoting alignment within normal and abnormal samples while preserving their separability, thereby enabling robust generalization across the three gaps. Leveraging the strong alignment between real and synthetic defects within the same object, we perform anomaly detection using only normal and synthetic-abnormal reference, and effectively generalize to unseen real defect types. Experiments on diverse industrial and medical benchmarks demonstrate that our method, using a simple cut-and-paste anomaly simulation strategy, achieves strong generalization across the three gaps.

2606.14701 2026-06-15 cs.CV 新提交

RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

RATS!补丁通过寄存器对话:寄存器注意力Transformer中的涌现部件

Timing Yang, Predrag Neskovic, Jansen Seheult, Wenchao Han, Anand Bhattad, Alan Yuille, Feng Wang

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Office of Naval Research, Arlington, VA(海军研究办公室,阿灵顿,弗吉尼亚州) Department of Laboratory Medicine and Pathology, Mayo Clinic, MN, USA(梅奥诊所检验医学与病理学系,明尼苏达州,美国)

AI总结 提出RATS模型,通过将分类令牌分解为可学习的寄存器令牌,在L→N→N→L瓶颈中路由补丁信息,无需辅助损失或部件标注,每个寄存器自发专化为类似物体部件的原语义区域,在五个分割基准上平均mIoU提升12。

详情
AI中文摘要

当人类看到一只鸟时,他们识别出的远不止是“鸟”——他们看到头部、翅膀和爪子,这是一个可重复使用部件的结构化组合,这些部件可以在他们见过的每一只鸟中被识别出来。我们询问一个自监督视觉模型能否自行发现相同的组合结构。为此,我们提出了RATS(寄存器注意力Transformer),它将分类令牌分解为N个可学习的寄存器令牌,通过三步压缩-通信-广播注意力机制,在L→N→N→L瓶颈中路由补丁信息。这N个寄存器被分配到H个注意力头上,因此分配给不同头的寄存器之间不相互作用。在没有辅助损失或部件标注的情况下,每个寄存器自发地专化为一个原语义区域,其涌现结构类似于物体部件。RATS在五个分割基准上平均超过所有基线+12 mIoU,在ADE20K(+1.11 mIoU)和COCO(+0.2 AP^m)上持续提升。其寄存器字典进一步展示了跨相关类别的部件级一致性和语义接近性。我们的结果表明,RATS可能为结构化和可解释的视觉表示学习提供有用的架构先验。

英文摘要

When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on its own. To this end, we propose RATS (Register Attention Transformers), which decomposes the classification token into N learnable register tokens that route patch information through an L->N->N->L bottleneck via a three-step compress-communicate-broadcast attention. The N registers are partitioned across the H attention heads, so that registers assigned to different heads do not interact with each other. Without auxiliary losses or part annotations, each register spontaneously specializes into a proto-semantic region whose emerging structure resembles object parts. RATS surpasses all baselines by +12 mIoU on average across five segmentation benchmarks, with consistent gains on ADE20K (+1.11 mIoU) and COCO (+0.2 AP^m). Its register dictionary further exhibits part-level consistency and semantic proximity across related categories. Our results suggest that RATS may provide a useful architectural prior for structured and interpretable visual representation learning.

2605.25651 2026-06-15 cs.CV 版本更新

Hierarchical Consistency Learning for Test-time Adaptation in Camouflage Perception

用于伪装感知测试时适应的层次一致性学习

Mingfeng Zha, Tianyu Li, Guoqing Wang, Yunqiang Pei, Chaofan Qiao, Jiening Zhang, Yang Yang, Heng Tao Shen

发表机构 * Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China(未来媒体中心和电子科技大学计算机科学与工程学院) School of Computer Science and Technology, Tongji University(计算机科学与技术学院,同济大学) Peng Cheng Laboratory(鹏城实验室)

AI总结 提出层次一致性学习(HCL)框架,通过测试时适应动态调整表示,结合层次表示重构、任务亲和引导和原型一致性校准,解决伪装目标检测中的域刚性和注释依赖问题。

Comments Accepted by IEEE TIP

详情
AI中文摘要

伪装目标检测(COD)旨在通过物理属性定位与背景感知差异最小的目标。现有方法受限于静态的“训练-冻结”范式,存在域刚性和注释依赖,限制了其对场景变化和未见伪装模式的适应性。为克服这些问题,我们提出层次一致性学习(HCL)框架,该框架集成了测试时适应以实现动态表示重校准。具体而言,我们设计了层次表示重构(HRR),通过协同空间重构与双流频域分解来缓解特征纠缠,增强对表观均匀化的鲁棒性。像素和频谱推理提供了结构和上下文先验。我们进一步引入任务亲和引导(TAG),通过通道级亲和力在分支间传播知识,对齐局部判别线索并缓解语义漂移。为确保语义不变性,我们制定了原型一致性校准(PCC),将区域特征聚合为紧凑原型并建立原型-特征相似度。这施加了隐式和层次化的约束,弥合了任务和表示之间的差距。在四个伪装和四个水下目标基准上,在三种退化设置下的广泛实验表明,我们的方法始终优于最先进的方法,突显了其在分布偏移下的鲁棒性和泛化能力。

英文摘要

Camouflaged object detection (COD) aims to localize targets that exhibit minimal perceptual differences from backgrounds through physical attributes. Existing methods, constrained by the static train-then-freeze paradigm, suffer from domain rigidity and annotation dependency, limiting their adaptability to scene variations and unseen camouflage patterns. To overcome these, we propose the hierarchical consistency learning (HCL) framework, which integrates test-time adaptation for dynamic representation recalibration. Specifically, we design the hierarchical representation reconstruction (HRR) to alleviate feature entanglement by synergizing spatial reconstruction with dual-stream frequency-domain decomposition, enhancing robustness against appearance homogenization. The pixel and spectrum inference provide structural and contextual priors. We further introduce task affinity guidance (TAG) to propagate knowledge across branches via channel-wise affinity, aligning local discriminative cues and mitigating semantic drift. To ensure semantic invariance, we formulate the prototype consistency calibration (PCC), which aggregates region features into compact prototypes and establishes prototype-feature similarity. This imposes implicit and hierarchical constraints that bridge task and representation gaps. Extensive experiments across four camouflaged and four underwater object benchmarks, under three degradation settings, demonstrate that our method consistently outperforms state-of-the-art approaches, highlighting its robustness and generalization under distribution shifts.

5. 视频理解与时序视觉 6 篇

2606.13714 2026-06-15 cs.CV 新提交

TSA: Temporal Slot Activation for Persistent Object-Centric Video Representation

TSA: 时间槽激活用于持久目标中心视频表示

Duc Nguyen, Sieu Tran, Hao Vo, Khoa Vo, Duy Minh Ho Nguyen, Nghi D. Q. Bui, Anh Nguyen, Long Mai, Ngan Le

发表机构 * University of Arkansas, USA(阿肯色大学) Max Planck Research School for Intelligent Systems(马克斯·普朗克智能系统研究所) Google Research, Google(谷歌研究院) University of Liverpool, UK(利物浦大学) Adobe Research(Adobe研究院)

AI总结 提出时间槽激活(TSA)机制,通过学习每槽每帧激活分数实现持久槽的生命周期建模,解决无条件传播导致的状态漂移和重建干扰问题,在多个基准上提升目标分解和时间身份保持。

详情
AI中文摘要

无监督视频目标中心学习旨在将动态场景分解为时间上持久的实体表示。现有的循环视频槽注意力方法在帧间传播一组固定的槽,但通常假设无条件槽传播:每个槽在每一帧都被更新和解码,无论其对应目标是否可见。我们表明,这种设计违反了持久槽的基本生命周期要求:当目标缺失或完全遮挡时,其槽应保留先前状态,并避免解释无关的可见内容。相反,无条件传播导致两种失败路径:更新引起的状态漂移(当前帧证据覆盖缺失目标的表示)和解码器引起的重建干扰(非活跃槽通过解码器注意力保持与重建的耦合)。我们提出时间槽激活(TSA),一种无需可见性监督即可学习每槽每帧激活分数 $\alpha_{k,t} \in (0, 1)$ 的机制。TSA 使用该激活作为共享潜在控制变量进行槽生命周期建模。当槽不活跃时,TSA 通过激活门控更新将其状态锚定到前一槽,并通过在 softmax 归一化前对注意力 logits 施加激活依赖的加性偏置来抑制其解码器参与。这共同减少了状态漂移和重建驱动的干扰。为了在部分遮挡和逐渐重现下改进决策,TSA 进一步将激活预测条件于时间上下文编码器生成的每槽时间记忆。我们在 MOVi-C/E、YT-VIS 和 OVIS 基准上使用标准指标和基于跟踪的指标(FG-ARI、mBO、IDF1、HOTA)评估 TSA。TSA 持续改进了目标分解和时间身份保持,在长且严重遮挡的视频上取得了大幅提升。

英文摘要

Unsupervised video object-centric learning aims to decompose dynamic scenes into temporally persistent entity representations. Existing recurrent video slot-attention methods propagate a fixed set of slots across frames, but typically assume unconditional slot propagation: every slot is updated and decoded at every frame, regardless of whether its corresponding object is visible. We show that this design violates a basic lifecycle requirement for persistent slots: when an object is absent or fully occluded, its slot should preserve its previous state and avoid explaining unrelated visible content. Instead, unconditional propagation creates two failure pathways: update-induced state drift, where current-frame evidence overwrites the absent object's representation, and decoder-induced reconstruction interference, where the inactive slot remains coupled to reconstruction through decoder attention. We propose Temporal Slot Activation (TSA), a mechanism that learns a per-slot, per-frame activation score $α_{k,t} \in (0, 1)$ without visibility supervision. TSA uses this activation as a shared latent control variable for slot lifecycle modeling. When a slot is inactive, TSA anchors its state to the previous slot via activation-gated updating and suppresses its decoder participation through an activation-dependent additive bias on attention logits before softmax normalization. This jointly reduces state drift and reconstruction-driven interference. To improve decisions under partial occlusion and gradual reappearance, TSA further conditions activation prediction on a per-slot temporal memory produced by a Temporal Context Encoder. We evaluate TSA on MOVi-C/E, YT-VIS, and OVIS benchmarks using both standard and tracking-based metrics (FG-ARI, mBO, IDF1, HOTA). TSA consistently improves object decomposition and temporal identity preservation, with large gains on long, heavily occluded videos.

2606.13861 2026-06-15 cs.CV 新提交

Temporal Backtracking Search for Test-time Generative Video Reasoning

时间回溯搜索:测试时生成式视频推理

Sejoon Jun, Zheng Ding, Huangyuan Su, Weirui Ye, Yilun Du

发表机构 * Northeastern University(东北大学) Independent Researcher(独立研究者) Harvard & Kempner(哈佛大学与肯普纳研究所) MIT(麻省理工学院)

AI总结 提出时间回溯搜索(TBS),通过将搜索空间转移到时间轴,在扩散过程中定位失败点并回溯重启,显著提升视频模型在测试时的推理能力。

详情
AI中文摘要

虽然测试时扩展已彻底改变了大型语言模型的推理能力,但生成式视频推理仍受限于单次生成范式。我们证明,在去噪步骤上进行搜索无法挽救逻辑有缺陷的生成结果,因为空间轨迹在扩散过程的早期就已确定。根级最佳N(BoN)采样同样低效:推理错误在时间轴上早期聚集,而重新采样盲目丢弃已验证的先前进展。为了解锁视频模型的有效测试时扩展,我们引入了时间回溯搜索(TBS),它将搜索空间转移到时间轴。TBS通过三个核心机制将视频生成转化为迭代的生成-验证-重启循环:(1)可变K条件化,从任意干净的起始前缀恢复生成;(2)时间过程验证,定位失败并提取有效的重启锚点;(3)基于前缀的搜索,将计算重新分配给扩展正确轨迹,而不是根重采样。在算法、导航和机器人领域,TBS帕累托优于同等预算的BoN。在严格的分布外设置中,单次生成崩溃(BoN为0.7%),TBS达到22.7%,每个解决的片段都来自重启的分支。最终,TBS揭示了视频模型的局部推理能力远超单次生成所显示的水平,提供了一个可扩展的测试时框架来释放这种能力。

英文摘要

While test-time scaling has revolutionized reasoning in large language models, generative video reasoning remains bottlenecked by a single-shot paradigm. We demonstrate that searching over denoising steps cannot rescue logically flawed rollouts because spatial trajectories commit early in the diffusion process. Root-level Best-of-N (BoN) sampling is similarly inefficient: reasoning errors cluster early in the temporal axis, and resampling blindly discards verified upstream progress. To unlock effective test-time scaling for video models, we introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis. TBS transforms video generation into an iterative generate-verify-restart loop via three core mechanisms: (1) variable-K conditioning to resume generation from arbitrary clean prefixes; (2) temporal process verification to localize failures and extract valid restart anchors; and (3) prefix-based search to reallocate compute toward extending correct trajectories rather than root resampling. Across algorithmic, navigation, and robotics domains, TBS Pareto-dominates matched-budget BoN. In a strict out-of-distribution setting where one-shot generation collapses (0.7% for BoN), TBS achieves 22.7%, with every solved episode stemming from a restarted branch. Ultimately, TBS reveals that the local reasoning competence of video models far exceeds what single-shot rollouts indicate, providing a scalable test-time framework to unlock it.

2606.14094 2026-06-15 cs.CV cs.AI 新提交

FEMOT: Multi-Object Tracking using Frame and Event Cameras

FEMOT: 使用帧和事件摄像机的多目标跟踪

Shiao Wang, Xiao Wang, Chao Wang, Yitao Li, Menghao Liu, Bo Jiang, Yaowei Wang, Yonghong Tian, Jin Tang

发表机构 * School of Computer Science and Technology, Anhui University(安徽大学计算机科学与技术学院) Peng Cheng Laboratory(鹏城实验室) National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理全国重点实验室) School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University(北京大学深圳研究生院电子与计算机工程学院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 提出FEMOT大规模RGB-事件多目标跟踪数据集和FEMOTR多模态跟踪框架,通过频域融合解耦特征,有效利用互补信息实现鲁棒跟踪。

详情
AI中文摘要

传统的RGB摄像机因其捕获丰富外观和语义信息的能力而被广泛用于多目标跟踪。然而,在复杂的现实挑战下,如运动模糊、低照度和过度曝光,其性能通常会下降。受生物启发的事件摄像机提供高时间分辨率和高动态范围,在极端场景下提供互补线索。尽管如此,由于缺乏大规模且标注良好的数据集,RGB-事件多目标跟踪仍未被充分探索。为解决这一问题,我们提出了FEMOT,一个大规模RGB-事件多目标跟踪数据集,涵盖多样化的现实场景和14个具有挑战性的属性。凭借RGB和事件数据以及高质量标注,FEMOT为系统评估RGB-事件多目标跟踪方法提供了可靠平台。基于FEMOT,我们重新训练并评估了超过十个强跟踪器,从而为未来研究建立了全面的基准。此外,我们提出了FEMOTR,一种多模态跟踪框架,该框架解耦RGB和事件特征并在频域中融合它们,从而有效利用其互补特性实现鲁棒的目标定位和身份关联。在FEMOT和DSEC-MOT数据集上的大量实验证明了所提方法的有效性。源代码和基准数据集已在此https URL上发布。

英文摘要

Conventional RGB cameras have been widely used in multi-object tracking due to their ability to capture rich appearance and semantic information. However, their performance is often degraded under complex real-world challenges, such as motion blur, low illumination, and overexposure. Bio-inspired event cameras offer high temporal resolution and high dynamic range, providing complementary cues under extreme scenarios. Nevertheless, RGB-event multi-object tracking remains underexplored due to the lack of large-scale and well-annotated datasets. To address this issue, we propose FEMOT, a large-scale RGB-event multi-object tracking dataset that covers diverse real-world scenarios and 14 challenging attributes. With both RGB and event data as well as high-quality annotations, FEMOT provides a reliable platform for systematically evaluating RGB-event multi-object tracking methods. Based on FEMOT, we retrain and evaluate over ten strong trackers, thereby establishing a comprehensive benchmark for future research. Furthermore, we propose FEMOTR, a multimodal tracking framework that decouples RGB and event features and fuses them in the frequency domain, thereby effectively exploiting their complementary characteristics for robust object localization and identity association. Extensive experiments on FEMOT and DSEC-MOT datasets demonstrate the effectiveness of the proposed method. The source code and benchmark dataset have been released on https://github.com/Event-AHU/FEMOT.

2606.14380 2026-06-15 cs.CV 新提交

FLaRA: Predicting Future Latent Representations for Accident Anticipation

FLaRA: 预测未来潜在表示用于事故预警

Lorenzo Caselli, Tomaso Trinci, Tommaso Bianconcini, Simone Magistri, Leonardo Taccari, Francesco Sambo, Andrew D. Bagdanov

发表机构 * Department of Information Engineering, University of Florence(佛罗伦萨大学信息工程系) Verizon Connect

AI总结 提出FLaRA架构,通过预测未来潜在表示实现事故预警,在Nexar等数据集上达到最优性能。

Comments Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

详情
AI中文摘要

从行车记录仪视频中预测交通事故是智能交通系统中的一个关键挑战。现有方法通常直接将视觉上下文映射到碰撞概率,而没有显式建模驾驶场景的未来演化。在本文中,我们提出了FLaRA(预测未来潜在表示用于事故预警),一种新颖的预测架构,通过预测未来潜在表示来转变这一范式。基于视频联合嵌入预测架构(V-JEPA2),我们的模型将预测器网络条件于观察到的上下文帧,以预测场景即将到来的潜在特征。然后,分类器对这些预测的未来表示进行操作,而不仅仅是过去的观察。为了确保这些预测基于现实的未来动态,我们引入了一个联合训练目标,同时优化辅助的特征级重建损失和交叉熵分类损失。在Nexar数据集上的广泛评估,以及在DAD、DADA-2000和DoTA基准上的跨域验证,表明我们的方法在保持现实早期预警能力的同时实现了最先进的性能。

英文摘要

Anticipating traffic accidents from dashcam videos is a critical challenge in intelligent transportation systems. Existing methods typically map visual context directly to a collision probability without explicitly modeling the future evolution of the driving scene. In this paper we propose FLaRA (Predicting Future Latent Representations for Accident Anticipation), a novel predictive architecture that shifts this paradigm by forecasting future latent representations for accident anticipation. Building upon the Video Joint-Embedding Predictive Architecture (V-JEPA2), our model conditions a predictor network on observed context frames to predict the forthcoming latent features of the scene. A classifier then operates on these predicted future representations rather than only on past observations. To ensure these forecasts remain grounded in realistic future dynamics, we introduce a joint training objective that simultaneously optimizes an auxiliary feature-level reconstruction loss and a cross-entropy classification loss. Extensive evaluations on the Nexar dataset, alongside cross-domain validations on the DAD, DADA-2000, and DoTA benchmarks, demonstrate that our approach achieves state-of-the-art performance while maintaining realistic early warning capabilities.

2606.14631 2026-06-15 cs.CV 新提交

SED:Lightweight Saliency prediction for Event-based data via Distillation

SED: 基于蒸馏的轻量级事件数据显著性预测

Romaric Mazna, Jean Martinet, Michele Magno

发表机构 * i3S/CNRS, Université Côte d’Azur(法国蔚蓝海岸大学i3S/CNRS实验室) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出轻量级网络SED,通过知识蒸馏和深度时空块(DSTconv)实现事件数据显著性预测,模型大小减少562倍,参数减少554倍,性能匹配或超越教师模型。

详情
AI中文摘要

基于事件的显著性预测最近受到关注,因为将事件相机与显著性估计结合可以作为上游阶段,自然提高边缘端下游事件感知的效率。然而,当前的方法要么是神经形态的,在基于事件的显著性基准上表现不佳,要么由于依赖Transformer或3D卷积而对资源受限的边缘应用来说过于沉重。受高效卷积模块的启发,SED旨在利用事件数据中的时间信息,我们提出了一种轻量级网络,通过知识蒸馏训练,构建于深度时空块(DSTconv)之上——这是3D深度可分离卷积的分解。相对于其教师模型,我们的模型将模型大小从180 MB减少到0.32 MB(562倍),参数数量从45M减少到81k(554倍),同时在N-DHF1K和N-UCF Sports数据集上匹配或超越其性能。此外,它在训练分布之外具有很强的泛化能力,从合成事件数据迁移到真实事件数据,而从头训练的模型则失败。

英文摘要

Event-based saliency prediction has gained attention recently, as combining event cameras with saliency estimation can act as an upstream stage that naturally improves the efficiency of downstream eventbased perception at the edge. However, current approaches are either neuromorphic, underperforming on event-based saliency benchmarks, or too heavy for resource-constrained edge applications due to their reliance on transformers or 3D convolutions. Drawing inspiration from efficient convolutional modules, SED and aiming to exploit the temporal information in event data, we propose a lightweight network, trained through knowledge distillation, built on a Depthwise Spatio-Temporal Block (DSTconv) -- a factorization of the 3D depthwise separable convolution. Relative to its teacher, our model reduces the model size from 180 MB to 0.32 MB (562x) and the parameter count from 45M to 81k (554x), while matching or outperforming it on the N-DHF1K and N-UCF Sports datasets. Moreover, it generalizes strongly beyond its training distribution, transferring from synthetic to real event data where a model trained from scratch fails.

2604.15173 2026-06-15 cs.CV 版本更新

Boundary-Centric Clip-Budgeted Active Learning for Temporal Action Segmentation

面向时间动作分割的边界中心剪辑预算主动学习

Halil Ismail Helvaci, Sen-ching Samson Cheung

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出B-ACT框架,通过边界中心监督策略,在有限标注预算下优先标注动作边界帧,显著提升时间动作分割的标签效率,在多个数据集上超越现有方法。

详情
AI中文摘要

未修剪视频中的时间动作分割(TAS)需要密集的时间监督。然而,大部分标注成本花费在识别动作转换上,这些地方分割错误集中,且微小的时间偏移会不成比例地降低片段级指标。我们引入B-ACT,一种剪辑预算主动学习框架,明确地将监督分配到这些易出错的边界区域。B-ACT在分层两阶段循环中运行:(i) 使用预测不确定性对未标记视频进行排序和查询,(ii) 在每个选定视频中,从当前模型预测中检测候选转换,并通过新颖的边界分数选择前K个边界。边界分数融合邻域不确定性、类别模糊性和时间预测动态,以揭示每帧的潜在重要性。重要的是,我们的标注协议仅请求边界帧的标签,同时仍然在边界中心剪辑上训练,以通过模型的感受野利用时间上下文。在GTEA、50Salads和Breakfast上的大量实验表明,边界中心监督在稀疏预算下实现了强大的标签效率,并持续优于代表性的TAS主动学习基线和先前的最先进方法。在性能对边界放置高度敏感的数据集上(通过编辑和基于重叠的F1指标衡量),增益最大。

英文摘要

Temporal action segmentation (TAS) in untrimmed videos requires dense temporal supervision. However, most of the annotation cost is spent identifying action transitions where segmentation errors concentrate and small temporal shifts can disproportionately degrade segment-level metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these error-prone boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score. The boundary score fuses neighborhood uncertainty, class ambiguity, and temporal prediction dynamics to reveal the underlying importance of each frame. Importantly, our annotation protocol requests labels only at the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets. Gains are largest on datasets where performance is highly sensitive to boundary placement, as measured by edit and overlap-based F1 metrics.

6. 生成式视觉与世界模型 18 篇

2606.13809 2026-06-15 cs.CV 新提交

Compressing Image Style Training into a Single Model Forward

将图像风格训练压缩为单次模型前向传播

Zhongjie Duan, Yingda Chen

发表机构 * ModelScope Team, Alibaba Group(阿里巴巴集团 ModelScope 团队)

AI总结 提出i2L框架,通过单次前向传播预测LoRA权重,实现高效风格迁移,避免逐风格优化,在风格保真度、提示对齐和感知质量上超越现有基线。

Comments 11 pages, 9 figures

详情
AI中文摘要

基于扩散的风格迁移必须在推理效率与风格化保真度之间取得平衡。基于适配器的方法效率高,但将风格作为外部条件注入,可能削弱参考图像的特定外观或将参考语义复制到生成图像中。基于优化的个性化方法(如LoRA)能更有效地内化风格,但每个新风格都需要独立的训练过程。我们提出i2L(图像到LoRA),一种将风格LoRA训练摊销为单次前向传播的框架。给定一张或多张参考图像,i2L预测文本到图像模型的LoRA权重,无需逐风格优化即可立即实例化风格。该架构结合了图像编码器、可学习的LoRA查询以及生成适配矩阵的压缩解码头。在语义多样的风格对上训练,鼓励预测器保留外观线索同时抑制参考内容复制。在Z-Image、FLUX.2和Hidream-O1上的实验表明,i2L在风格保真度、提示对齐和感知质量上优于现有基线。由于i2L生成显式的LoRA权重,它还支持非对称无分类器引导、多参考风格融合以及与可控生成模块的组合。

英文摘要

Diffusion-based style transfer must balance inference efficiency with stylization fidelity. Adapter-based methods are efficient, but they inject style as an external condition and can either weaken reference-specific appearance or copy reference semantics into the generated image. Optimization-based personalization methods such as LoRA internalize style more effectively, but require a separate training process for every new style. We introduce i2L (image-to-LoRA), a framework that amortizes style LoRA training into a single forward pass. Given one or more reference images, i2L predicts LoRA weights for a text-to-image model, enabling immediate style instantiation without per-style optimization. The architecture combines an image encoder, learnable LoRA queries, and compressed decoding heads that generate adapted matrices. Training on semantically diverse style pairs encourages the predictor to preserve appearance cues while suppressing reference-content copying. Experiments on Z-Image, FLUX.2, and Hidream-O1 show that i2L improves style fidelity, prompt alignment, and perceptual quality over existing baselines. Because i2L produces explicit LoRA weights, it also supports asymmetric classifier-free guidance, multi-reference style fusion, and composition with controllable-generation modules.

2606.13872 2026-06-15 cs.CV 新提交

Avatar V: Scaling Video-Reference Avatar Video Generation

Avatar V:缩放视频参考的化身视频生成

Benjamin Liang, Ce Chen, Desmond Lin, Ivan Somov, Jiajun Zhao, Jiewei Yuan, Jingfeng Zhang, Junhao Huang, Nik Nolte, Pedram Haqiqi, Penghan Wang, Rong Yan, Rui Zhang, Sam Prokopchuk, Sivan Wang, Viktor Goriachko, Yi Ren, Yuanming Li, Yutao Chen, Zhenhui Ye, Zhibin Hong, Zilong Nie, Zujin Guo

发表机构 * HeyGen Research(HeyGen 研究院)

AI总结 提出Avatar V框架,通过视频参考条件化身份建模,利用稀疏参考注意力机制实现长视频线性复杂度条件化,结合运动表示流和身份感知超分辨率精炼器,生成无限时长1080p视频,在身份保持、唇形同步和质量上超越现有方法。

Comments 31 pages, 15 figures. All contributors are listed in alphabetical order by first name

详情
AI中文摘要

生成不仅视觉上相似而且行为上可识别的化身视频,忠实再现其说话节奏、手势倾向和表情动态,仍然是一个开放挑战。现有方法主要依赖于静态单张图像,这提供的身份信息不足且无法捕捉动态运动特征,而标准的像素级目标函数对决定化身保真度的感知关键面部区域关注不够。我们提出了Avatar V,一个生产级框架,通过视频参考条件化身份建模来解决这些限制。该模型不将身份压缩为固定大小的嵌入,而是直接条件化于参考视频的完整令牌序列,通过参考上下文上的注意力学习再现静态身份属性(面部几何、皮肤纹理)和动态行为模式(说话节奏、微表情)。我们引入了稀疏参考注意力,一种非对称机制,实现对任意长参考的线性复杂度条件化;一个运动表示流,实现闭环说话风格迁移;以及一个身份感知超分辨率精炼器,继承完整的参考条件化。这些由数据引擎支持,该引擎从5000万原始视频中策划了1亿以上的训练片段,以及一个五阶段训练流程,包括流匹配预训练、个性微调、两阶段蒸馏(>10倍加速)和RLHF对齐,部署在数千个GPU上。Avatar V生成无限时长的1080p视频,在我们的跨场景基准测试中实现了最先进的身份保持、唇形同步和生成质量,在自动指标和人工评估中持续优于包括Seedance 2.0、Kling O3 Pro、Veo 3.1和OmniHuman 1.5在内的领先系统。

英文摘要

Generating avatar videos that are not merely visually similar to a target individual but behaviorally recognizable, faithfully reproducing their talking rhythm, gestural tendencies, and expression dynamics, remains an open challenge. Existing methods predominantly condition on single static images, which provide insufficient identity information and cannot capture dynamic motion traits, while standard pixel-level objectives underserve the perceptually critical facial regions that determine avatar fidelity. We present Avatar V, a production-scale framework that addresses these limitations through video-reference-conditioned identity modeling. Rather than compressing identity into fixed-size embeddings, the model conditions directly on the full token sequence of a reference video, learning to reproduce both static identity attributes (facial geometry, skin texture) and dynamic behavioral patterns (talking rhythm, micro-expressions) through attention over the reference context. We introduce Sparse Reference Attention, an asymmetric mechanism achieving linear-complexity conditioning on arbitrarily long references; a motion representation stream enabling closed-loop talking style transfer; and an identity-aware super-resolution refiner inheriting the full reference conditioning. These are supported by a data engine curating 100M+ training clips from 50M raw videos, and a five-stage training pipeline with flow matching pre-training, personality fine-tuning, two-phase distillation (>10x acceleration), and RLHF alignment, deployed across thousands of GPUs. Avatar V generates 1080p videos of unlimited duration, achieving state-of-the-art identity preservation, lip synchronization, and generation quality on our cross-scene benchmark, consistently outperforming leading systems including Seedance 2.0, Kling O3 Pro, Veo 3.1, and OmniHuman 1.5 in both automated metrics and human evaluation.

2606.13898 2026-06-15 cs.CV cs.AI 新提交

HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

HiLo-Token: 输入自适应的高低频令牌压缩用于高效图像编辑

Haoran You, Yotam Nitzan, Lingzhi Zhang, Yifan Gong, Mang-Tik Chiu, Connelly Barnes, Yan Kang, Yuqian Zhou, Eli Shechtman, Sohrab Amirghodsi

发表机构 * Adobe ART AI Lab(Adobe ART AI实验室) Adobe Research(Adobe研究院)

AI总结 针对扩散变换器(DiT)在图像编辑中延迟高的问题,提出输入自适应的令牌压缩框架HiLo-Token,根据空间频率分配令牌预算,在保持生成质量的同时实现高达3.13倍加速。

Comments 14 pages, 10 figures, Patent filled

详情
AI中文摘要

创意图像编辑工具,如Photoshop的移除或生成填充按钮,是日常客户使用的核心,并占Photoshop和Lightroom流量的主要部分。然而,当前的生成式AI模型面临显著的延迟挑战,当从基于卷积的U-Net过渡到扩散变换器(DiT)时,这一问题变得更加突出。在我们对数百个代表性图像编辑样本(涵盖广泛的掩码比例)的评估中,即使将DiT模块从50个时间步蒸馏到8个时间步,它单独就占总模型延迟的平均73%。为了应对这一挑战,我们提出了$\textbf{HiLo-Token}$,一个输入自适应的令牌压缩框架,该框架将更多令牌预算分配给高频、丰富上下文的区域,同时将更少令牌分配给低频区域。具体来说,对于用户掩码指定的编辑区域,我们保留膨胀掩码内的所有令牌,以保持强局部性和上下文相关性。在编辑区域之外,我们引入了一种简单而有效的基于空间频率的高频令牌选择策略,以捕获重要的局部细节,同时使用来自16倍下采样图像的令牌来表示低频分量,并保留模糊但全局的结构。在生产级评估数据上的大量实验验证了所提方法的有效性,在A100-80GB上,对于小、中、大掩码比例类别(平均比例分别为6.38%、15.92%和35.36%),图像编辑任务分别实现了3.13倍、2.59倍和1.67倍的DiT加速,且生成质量无任何退化。

英文摘要

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose $\textbf{HiLo-Token}$, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.

2606.13964 2026-06-15 cs.CV 新提交

CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis

CaricHarmony:身份保持的漫画合成的对比扩散路径

Dongyu Wang, Dar-Yen Chen, Yi-Zhe Song

发表机构 * SketchX, CVSSP, University of Surrey(萨里大学CVSSP实验室SketchX组)

AI总结 提出CaricHarmony,一种无需训练的方法,通过并行无污染扩散路径解决身份与形状条件信号污染问题,实现平衡的漫画合成,在保持身份一致性的同时达到最优形状保真度。

详情
AI中文摘要

基于草图的漫画合成存在一个根本性失败模式:当身份和形状条件在扩散模型中结合时,它们会产生破坏性干扰,导致不可避免地向平淡肖像或无法识别的扭曲崩溃。我们将根本原因确定为\emph{条件信号污染}——去噪轨迹中竞争的概率分布使得平衡生成变得不可能。我们提出了CaricHarmony,这是第一种通过并行无污染扩散路径明确解决这种污染的无训练方法。在推理过程中,我们维护三条路径:$\mathcal{P}^{\mathrm{i}}$(纯身份)、$\mathcal{P}^{\mathrm{s}}$(纯形状)和$\mathcal{P}^{\mathrm{i+s}}$(和谐输出)。作用于交叉注意力特征的新型能量函数提供梯度引导,将$\mathcal{P}^{\mathrm{i+s}}$导向最优平衡:$\mathcal{E}_{\mathrm{shape}}$通过布局和语义对齐确保草图保真度,而$\mathcal{E}_{\mathrm{id}}$采用对极端扭曲鲁棒的令牌级对应匹配。与需要每身份70秒微调的DemoCaricature或受限于贝塞尔曲线的CaricatureBooth不同,CaricHarmony接受任何草图格式并在16秒内生成。实验展示了最先进的性能:在可比较的身份一致性分数下,形状CLIP分数为0.8615(对比0.8450),总体用户偏好分数为7.81(对比6.06)。我们的方法从根本上将身份-形状冲突重新概念化为扩散模型的条件信号污染,从而在保持识别的同时实现前所未有的创造性控制。

英文摘要

Sketch-based caricature synthesis suffers from a fundamental failure mode: when identity and shape conditions are combined in diffusion models, they create destructive interference that causes inevitable collapse toward either bland portraits or unrecognizable distortions. We identify the root cause as \emph{condition signal contamination} -- competing probability distributions in the denoising trajectory that make balanced generation impossible. We present CaricHarmony, the first training-free method that explicitly resolves this contamination through parallel uncontaminated diffusion paths. During inference, we maintain three paths: $\mathcal{P}^{\mathrm{i}}$ (pure identity), $\mathcal{P}^{\mathrm{s}}$ (pure shape), and $\mathcal{P}^{\mathrm{i+s}}$ (harmonized output). Novel energy functions operating on cross-attention features provide gradient guidance that steers $\mathcal{P}^{\mathrm{i+s}}$ toward optimal balance: $\mathcal{E}_{\mathrm{shape}}$ ensures sketch fidelity through layout and semantic alignment, while $\mathcal{E}_{\mathrm{id}}$ employs token-level correspondence matching robust to extreme distortions. Unlike DemoCaricature requiring 70 seconds per-identity fine-tuning or CaricatureBooth constrained to Bezier curves, CaricHarmony accepts any sketch format and generates in under 16 seconds. Experiments demonstrate state-of-the-art performance: 0.8615 shape CLIP score (vs. 0.8450) under comparable identity consistency score, with 7.81 overall user preference score (vs. 6.06). Our method fundamentally reconceptualizes the ID-shape conflict as conditioning signal contamination for diffusion models, enabling unprecedented creative control while preserving recognition.

2606.13971 2026-06-15 cs.CV 新提交

Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation

Prompt2Effect: 通过LoRA生成实现免训练图像到视频模型特化

Xiaomeng Yang, Yanyu Li, Gordon Guocheng Qian, Ivan Skorokhodov, Viacheslav Ivanov, Avalon Vinella, Xuan Zhang, Yanzhi Wang, Sergey Tulyakov, Anil Kag

发表机构 * Northeastern University(东北大学) Snap Inc.(Snap公司)

AI总结 提出Prompt2Effect,一种权重驱动超网络,通过单次前向传播直接合成效果特定的LoRA权重,无需训练,在保持视频质量的同时将计算成本从56 GPU小时降至3.3秒。

详情
AI中文摘要

将图像到视频(I2V)扩散模型个性化以具有特定视觉效果的需求日益增长,用于高端视频生成。当前实践需要为每个效果训练单独的LoRA模块,这带来了大量的数据整理和迭代优化成本,阻碍了交互式控制。我们提出Prompt2Effect,一种权重驱动的超网络,通过单次前向传播直接合成效果特定的LoRA权重,从而分摊每个效果的训练成本。与先前仅从语义回归适配器权重的超网络不同,Prompt2Effect显式地以冻结的基础模型权重为条件,将权重预测建立在每层的结构几何上。此外,我们不是预测原始LoRA矩阵,而是引入一种SVD规范化的参数化方法,解决了分解歧义并稳定了大规模权重合成。这些设计原则共同实现了高维I2V扩散模型的准确且可扩展的LoRA预测。大量实验表明,与传统的LoRA微调相比,Prompt2Effect实现了相当或更优的视频质量和效果对齐,同时将计算成本从56 GPU训练小时降至3.3秒的超网络推理。当用作后续微调的初始化时,我们预测的权重进一步提高了最终性能,并将优化速度提升了约10倍。

英文摘要

Personalizing Image-to-Video (I2V) diffusion models with specific visual effects is increasingly demanded for high-end video generation. Current practice requires training a separate Low-Rank Adaptation (LoRA) module for each effect, incurring substantial data curation and iterative optimization costs that hinder interactive control. We present Prompt2Effect, a weight-driven hypernetwork that amortizes per-effect training by directly synthesizing effect-specific LoRA weights in a single forward pass. Unlike prior hypernetworks that regress adapter weights purely from semantics, Prompt2Effect is explicitly conditioned on the frozen base model weights, grounding weight prediction in the structural geometry of each layer. Furthermore, instead of predicting raw LoRA matrices, we introduce an SVD-canonicalized parameterization that resolves factorization ambiguity and stabilizes large-scale weight synthesis. Together, these design principles enable accurate and scalable LoRA prediction for high-dimensional I2V diffusion models. Extensive experiments demonstrate that Prompt2Effect achieves on-par or superior video quality and effect alignment compared to conventional LoRA fine-tuning, while reducing the computational cost from 56 GPU training hours to 3.3 seconds of hypernetwork inference. When used as initialization for subsequent fine-tuning, our predicted weights further improve final performance and accelerate optimization by approximately 10x.

2606.14035 2026-06-15 cs.CV 新提交

Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention

面向360度室内全景编辑的基于重聚焦交叉注意力的免调优扩散模型

Dinh-Khoi Vo, Nhut-Thanh Le-Hinh, Viet-Tham Huynh, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * arXiv

AI总结 提出FocusDiff框架,通过重聚焦交叉注意力实现免调优的精确区域编辑,并扩展到360度室内全景编辑,在局部编辑基准LIMB上优于现有零样本方法。

Comments ICCCI 2026. Project page: https://vdkhoi20.github.io/FocusDiff

详情
AI中文摘要

零样本文本引导扩散显著推进了图像编辑,但其实际可用性仍受三个持续挑战的制约:需要精细提示工程的提示脆弱性、无意影响非目标区域的溢出编辑、以及由于训练数据中有限细粒度监督导致的小或杂乱对象上的失败。我们提出FocusDiff(目标感知重聚焦用于免调优扩散编辑),一个基于重聚焦交叉注意力的免调优框架,用于精确且区域特定的图像操作。给定通过自动分割或手动选择获得的目标区域,FocusDiff对非编辑区域应用选择性模糊,以引导注意力朝向掩码区域,同时准确地将对象的身份、结构和外观传递到编辑输出。集成的上下文保留模块进一步确保背景保真度和全局一致性,使得从简单文本提示在一次传递中实现精确编辑成为可能。我们还将FocusDiff扩展到360度室内全景编辑,并在虚拟现实环境中展示其有效性。在我们包含30个多对象图像和100个标注示例(包括具有挑战性的小对象案例)的局部编辑基准LIMB上的广泛实验表明,FocusDiff在文本-图像对齐和背景保留方面优于现有零样本编辑器,实现了卓越的精度、逼真度和可用性。项目页面见此https URL。

英文摘要

Zero-shot text-guided diffusion has significantly advanced image editing; however, its practical usability remains constrained by three persistent challenges: prompt brittleness that requires meticulous prompt engineering, spillover edits that unintentionally affect non-target regions, and failures on small or cluttered objects caused by limited fine-grained supervision in training data. We propose FocusDiff (Target-Aware Refocusing for Tuning-Free Diffusion Editing), a tuning-free framework for precise and region-specific image manipulation based on refocusing cross-attention. Given a target region obtained through automated segmentation or manual selection, FocusDiff applies selective blurring to non-edit areas to guide attention toward the masked region while accurately transferring the object's identity, structure, and appearance to the edited output. Integrated context-preserving modules further ensure background fidelity and global coherence, enabling accurate edits from simple text prompts in a single pass. We also extend FocusDiff to 360-degree indoor panorama editing and demonstrate its effectiveness within virtual reality environments. Extensive experiments on our localized editing benchmark LIMB, comprising 30 multi-object images and 100 annotated examples including challenging small-object cases, show that FocusDiff outperforms existing zero-shot editors in text-image alignment and background preservation, achieving superior precision, photorealism, and usability. The project page is available at https://vdkhoi20.github.io/FocusDiff.

2606.14042 2026-06-15 cs.CV 新提交

Rethinking One-Step Image Editing through ChordEdit: Reproduction, Simplification, and New Insights

通过ChordEdit重新思考一步图像编辑:复现、简化与新见解

Minghan Li, Jeremy Moebel, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab(哈佛人工智能与机器人实验室)

AI总结 本文通过复现、消融和简化ChordEdit,揭示其机制:和弦窗口作为时间步偏移,和弦传输执行低频语义编辑,近端对齐补充高频细节,从而将编辑分解为粗低频传输和细高频对齐两个阶段,为自适应编辑提供新路径。

Comments 9 pages

详情
AI中文摘要

一步图像编辑对于使文本引导的编辑快速、实用且易于部署至关重要,但其底层机制尚未完全理解。我们通过复现、消融和简化重新审视ChordEdit。我们的分析表明:a) 和弦窗口δ主要起到从t到t-δ的有效时间步偏移作用;b) 和弦传输作用于高噪声图像,主要执行低频语义编辑;c) 近端对齐作用于低噪声图像,并通过添加高频目标细节进行补充。基于此观点,ChordEdit自然地将编辑分解为粗低频传输阶段和细高频对齐阶段。这些发现为基于提示的动态时间步选择以实现自适应图像编辑提供了路径。所有代码和结果可在\href{this https URL}{链接}中找到。

英文摘要

One-step image editing is important for making text-guided editing fast, practical, and easy to deploy, but its underlying mechanism is still not fully understood. We revisit ChordEdit through reproduction, ablation, and simplification. Our analysis shows that a) the chord window $δ$ largely acts as an effective timestep shift from $t$ to $t - δ$; b) chord transport acts on high-noise images and mainly performs low-frequency semantic editing; and c) proximal alignment acts on low-noise images and complements it by adding high-frequency target details. In this view, ChordEdit naturally decomposes editing into a coarse low-frequency transport stage and a fine high-frequency alignment stage. These findings suggest a path toward prompt-conditioned dynamic timestep selection for adaptive image editing. All code and results can be found at \href{https://github.com/Harvard-AI-and-Robotics-Lab/ChordEdit-Reproduction}{link}.

2606.14125 2026-06-15 cs.CV cs.AI 新提交

Conditioning Matters: Stabilizing Inversion and Attention in Diffusion Image Editing

条件至关重要:稳定扩散图像编辑中的反演与注意力

Zheyuan Zhan, Hongchen Li, Can Wang, Yinfei Ma, Mingzhen Huang, Ruoshi Bai, Jiawei Chen, Siwei Lyu, Defang Chen

发表机构 * State Key Laboratory of Blockchain and Data Security, Zhejiang University(浙江大学区块链与数据安全全国重点实验室) HangZhou High-Tech Zong (Binjiang) Institute of Blockchain and Data Security(杭州高新技术产业开发区(滨江)区块链与数据安全研究院) College of Computer Science, Zhejiang University(浙江大学计算机科学与技术学院) University at Buffalo, State University of New York(纽约州立大学布法罗分校)

AI总结 本文提出SimEdit框架,通过优化文本条件精度和令牌级跨分支注意力控制,提升扩散模型反演稳定性和编辑保真度,在PIE-Bench上显著优于先前方法。

Comments Accepted to ECML PKDD 2026 Research Track

详情
AI中文摘要

基于反演的图像编辑提供了灵活且无需训练的控制,但仍面临反演精度以及编辑保真度与背景保留之间的权衡问题。尽管最近的方法改进了反演公式或注意力交互,但文本条件在塑造扩散动态和编辑行为中的作用仍未得到充分探索。我们从经验和理论上证明,文本条件的精度通过调节扩散速度场的几何形状来影响反演稳定性,同时也会影响编辑过程中跨分支注意力的一致性。这些效应直接影响背景保留和语义保真度。基于这一分析,我们提出了SimEdit,一个条件感知框架,包含两个互补组件:(a) 条件细化,构建具有改进语义精度和结构对齐的条件信号,以促进稳定反演和一致的注意力操作;(b) 令牌级跨分支注意力控制,将编辑相关和结构保留组件分离,并在注意力操作期间对其进行非对称调节。在PIE-Bench上的大量实验表明,SimEdit在反演重建质量和编辑性能上均持续优于先前的注意力操作方法。我们的代码可在以下网址获取:https://this URL。

英文摘要

Inversion-based image editing offers flexible and training-free control but still struggles with inversion accuracy and the trade-off between editing fidelity and background preservation. While recent methods improve inversion formulations or attention interactions, the role of textual conditioning in shaping diffusion dynamics and editing behavior remains underexplored. We show both empirically and theoretically that the precision of textual conditioning influences inversion stability by modulating the geometry of the diffusion velocity field, while also affecting the consistency of cross-branch attention during editing. These effects directly impact background preservation and semantic fidelity. Building on this analysis, we propose SimEdit, a conditioning-aware framework with two complementary components: (a) conditioning refinement, which constructs conditioning signals with improved semantic precision and structural alignment to facilitate stable inversion and consistent attention manipulation, and (b) token-wise cross-branch attention control, which separates edit-relevant and structure-preserving components and modulates them asymmetrically during attention manipulation. Extensive experiments on PIE-Bench demonstrate that SimEdit consistently improves both inversion reconstruction quality and editing performance over previous attention-manipulation approaches. Our code is available at https://github.com/zju-pi/SimEdit.

2606.14162 2026-06-15 cs.CV 新提交

VideoWeave: Unlocking Geometric Consistency in Video Generation via Joint Geometry-Video Modeling

VideoWeave: 通过联合几何-视频建模解锁视频生成中的几何一致性

Xunzhi Xiang, Zixuan Duan, Yabo Chen, Zhengxuan Wei, Guiyu Zhang, Zixiao Gu, Zhe Gao, Haibin Huang, Chi Zhang, Qi Fan, Xuelong Li

发表机构 * Nanjing University(南京大学) Institute of Artificial Intelligence, China Telecom (TeleAI)(中国电信人工智能研究院(TeleAI)) Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出VideoWeave,一种在潜在空间后训练框架,利用隐式几何模型特征约束生成分布,缓解几何重建误差,提升视频几何一致性。

详情
AI中文摘要

大规模视频扩散模型通常无法随时间保持3D结构,导致几何漂移和视角变化下的不合理运动。现有方法通常通过使用显式几何重建(如深度图、点云或重建的3D结构)来定义条件、监督或奖励信号,从而强制几何一致性,这使得生成器对上游几何管道的误差敏感。我们提出VideoWeave,一种潜在空间后训练框架,利用隐式几何模型特征约束生成分布,提供更灵活、非刚性的引导形式,减轻几何模型重建误差的影响。具体来说,VideoWeave将这些特征适配为几何潜在变量,并在共享去噪空间中与视频潜在变量联合建模,使得几何在训练过程中塑造生成分布。为支持这一过程,我们构建了GeoVid-80K,一个包含8万视频的配对外观和几何表示数据集。在文本到视频和图像到视频生成上的实验表明,VideoWeave在保持强视觉质量的同时改善了几何连贯性。VideoWeave项目页面见此https URL。

英文摘要

Large-scale video diffusion models often fail to preserve 3D structure over time, causing geometric drift and implausible motion under viewpoint changes. Existing methods usually enforce geometric consistency by using explicit geometry reconstructions, such as depth maps, point clouds, or reconstructed 3D structures, to define conditions, supervision, or reward signals, making the generator sensitive to errors from upstream geometry pipelines. We propose VideoWeave, a latent-space post-training framework that uses implicit geometry-model features to constrain the generative distribution, providing a more flexible and non-rigid form of guidance that mitigates the impact of reconstruction errors from geometry models. Specifically, VideoWeave adapts these features into geometry latents and jointly models them with video latents in a shared denoising space, allowing geometry to shape the generative distribution during training. To support this process, we build GeoVid-80K, an 80K-video dataset with paired appearance and geometry representations. Experiments on text-to-video and image-to-video generation show that VideoWeave improves geometric coherence while preserving strong visual quality. VideoWeave project page at https://videoweave.github.io/

2606.14297 2026-06-15 cs.CV cs.AI 新提交

Pix2Pix-Hybrid: Structure-Guided Conditional Synthesis of Hajj Crowd Images with Multi-Channel Conditioning and Weak Attribute Supervision

Pix2Pix-Hybrid: 结构引导的多通道条件与弱属性监督的朝觐人群图像条件合成

Amirah F. Alshammari, Bander A. Alzahrani, Nahed A. Alowidi

发表机构 * King Abdulaziz University(阿卜杜勒阿齐兹国王大学) Jouf University(焦夫大学)

AI总结 提出Pix2Pix-Hybrid条件GAN,通过多通道结构线索和上下文属性条件合成朝觐人群图像,用于数据增强,在减少人工标注的同时提升合成质量,并验证了合成数据对人群计数模型的改进效果。

详情
AI中文摘要

开发准确的朝觐场景人群计数模型仍然具有挑战性,因为领域特定的标注图像稀缺,且大型集会期间的数据收集引发隐私问题。为解决这些限制,本文提出Pix2Pix-Hybrid (P2P-H),一种用于结构引导的朝觐人群图像合成和数据增强的混合条件GAN。P2P-H基于Pix2Pix,采用U-Net生成器,以八个输入通道为条件,这些通道联合编码结构线索(边缘和灰度)和上下文属性(人群密度和一天中的时间)。为了捕捉密集场景中的详细纹理,该框架集成了两个在不同分辨率下运行的多尺度PatchGAN判别器。训练过程结合了对抗、感知和特征匹配目标,并采用自适应数据增强和稳定化策略。该模型在从60个公开视频源收集的993个真实朝觐帧上训练,条件属性自动推导以减少人工标注工作量。利用该框架,我们构建了CrowdH,一个包含10,000张高分辨率朝觐人群图像的合成数据集。实验结果表明,与Pix2Pix和StyleGAN2-ADA基线相比,P2P-H提高了结构保持的条件合成质量,并显示出对其他人群数据集的良好迁移性。为了评估下游实用性,我们进一步构建了CrowdH-Mix-469,一个包含384张真实朝觐图像和85张精选合成图像的标注混合真实-合成数据集,并在仅真实和真实加合成训练下评估了五个计数模型。精选的合成数据在所有五个模型上均降低了MAE,其中CSRNet的提升最为显著。

英文摘要

Developing accurate crowd-counting models for Hajj pilgrimage scenes remains challenging because domain-specific annotated images are scarce and data collection during large gatherings raises privacy concerns. To address these limitations, this paper proposes Pix2Pix-Hybrid (P2P-H), a hybrid conditional GAN for structure-guided Hajj crowd-image synthesis and data augmentation. P2P-H builds on Pix2Pix and employs a U-Net generator conditioned on eight input channels that jointly encode structural cues (edges and grayscale) and contextual attributes (crowd density and time of day). To capture detailed textures in dense scenes, the framework integrates two multi-scale PatchGAN discriminators operating at different resolutions. The training procedure combines adversarial, perceptual, and feature-matching objectives with adaptive data augmentation and stabilization strategies. The model was trained on 993 real Hajj frames collected from 60 publicly available video sources, with conditioning attributes derived automatically to reduce manual labeling effort. Using this framework, we constructed CrowdH, a synthetic dataset of 10,000 high-resolution Hajj crowd images. Experimental results show that P2P-H improves structure-preserving conditional synthesis quality compared with Pix2Pix and StyleGAN2-ADA baselines and shows favorable transfer to other crowd datasets. To assess downstream utility, we further constructed CrowdH-Mix-469, an annotated mixed real-synthetic dataset comprising 384 real Hajj images and 85 selected synthetic images,and evaluated five crowd-counting models under real-only and real-plus-synthetic training. The selected synthetic data reduced MAE across all five models, with the strongest gain observed for CSRNet.

2606.14317 2026-06-15 cs.CV 新提交

CausalMotion: Structured Physical Reasoning as Keyframe and Trajectory Guidance for Training-Free Video Generation

CausalMotion: 基于关键帧与轨迹引导的结构化物理推理实现免训练视频生成

Sihan Zhuang, Xinyuan Chen, Tianfan Xue, Yaohui Wang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) ShanghaiTech University(上海科技大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出CausalMotion框架,利用视觉语言模型将文本提示分解为因果一致的关键帧和物体运动轨迹,作为软约束引导预训练视频扩散模型,无需额外训练即可提升物理合理性和时间连贯性。

Comments Project Page: https://zhuangsh0713.github.io/CausalMotion/

详情
AI中文摘要

近年来基于扩散的视频生成在视觉质量和短期时间连贯性上取得了显著进展。然而,现有方法仍难以生成具有物理一致性和因果合理动态的视频,尤其是在涉及长程交互的场景中。这一局限性源于视频扩散模型主要隐式学习物理一致性,而视觉语言模型可以直接建模物理规律。基于这一思想,本文提出\textbf{CausalMotion},一种免训练框架,通过结构化中间表示将显式物理推理注入视频生成。我们的关键思想是将推理与生成解耦,利用视觉语言模型将文本提示分解为一系列因果一致的关键帧和以物体为中心的运动轨迹。然后将这些表示对齐并整合为软约束,在推理过程中引导预训练的视频扩散模型。这种设计无需额外训练或监督即可显式建模物体动态和因果转换。大量实验表明,我们的方法在动态密集场景中持续提升物理合理性和时间连贯性,同时保持高感知视频质量。

英文摘要

Recent advances in diffusion-based video generation have significantly improved visual quality and short-term temporal coherence. However, existing methods still struggle to produce videos with physically consistent and causally plausible dynamics, especially in scenarios involving long-horizon interactions. This limitation arises from the fact that video diffusion models primarily learn physical consistency implicitly, while vision-language models can directly model physical laws. Based on this idea, in this work, we propose \textbf{CausalMotion}, a training-free framework that injects explicit physical reasoning into video generation through structured intermediate representations. Our key idea is to decouple reasoning from generation by leveraging a vision-language model to decompose a text prompt into a sequence of causally consistent keyframes and object-centric motion trajectories. These representations are then aligned and integrated as soft constraints to guide a pretrained video diffusion model during inference. This design enables explicit modeling of object dynamics and causal transitions without requiring additional training or supervision. Extensive experiments show that our method consistently improves physical plausibility and temporal coherence, particularly in dynamics-intensive scenarios, while maintaining high perceptual video quality.

2606.14667 2026-06-15 cs.CV 新提交

Memento: Reconstruct to Remember for Consistent Long Video Generation

Memento: 通过重建来记忆以实现一致的长视频生成

Xuan Wei, Longbin Ji, Guan Wang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Qingqi Hong

发表机构 * Xiamen University(厦门大学) ERNIE Team, Baidu Inc.(百度公司ERNIE团队)

AI总结 提出Memento框架,通过主体重建引导和双查询记忆机制,解决长视频生成中主体一致性丢失问题,实现跨镜头连贯生成。

Comments Project page: https://ernie-research.github.io/Memento/

详情
AI中文摘要

长视频生成需要重复出现的主题在各种镜头、视角、运动和场景转换中保持一致。现有的时间分解方法通过逐镜头生成视频来提高可扩展性。然而,它们主要专注于优化合理的下一镜头延续,而没有验证历史记忆是否保留了身份关键的主体证据。因此,随着生成的进行,重复出现的主题可能会被稀释、覆盖或遗忘。在本文中,我们提出了Memento,一个主体重建引导的框架,将主体保留视为一个显式的身份基础问题,其前提是:一个忠实保存主体的记忆库应该能够仅从记忆中重建该主体。具体来说,Memento联合训练自回归的下一镜头生成和基于记忆的主体重建,利用历史记忆和全局故事描述恢复目标外观。为了从短程线索中分离出长程主体证据,Memento引入了一种双查询记忆机制,其中一个查询检索与身份相关的记忆,另一个选择短上下文关键帧以实现连贯的延续。此外,一个主体感知的电影数据管道通过一致、无代词的主体描述提供精确的重建监督。实验表明,Memento在长期主体一致性、跨镜头连贯性和视觉质量方面达到了最先进的性能。

英文摘要

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.

2606.14700 2026-06-15 cs.CV 新提交

RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space

RepFusion:利用多模态先验在表示空间中进行去噪

Xichen Pan, Aashu Singh, Satya Narayan Shukla, Xiangjun Fan, Shlok Kumar Mishra, Saining Xie

发表机构 * Meta AI New York University(纽约大学)

AI总结 提出RepFusion方法,利用多模态大语言模型作为噪声表示编码器,为扩散变压器提供条件信号,在相似推理预算下优于新初始化解码器基线。

Comments Project Page: https://xichenpan.com/repfusion

详情
AI中文摘要

大型语言模型(LLMs)广泛用于文本到图像(T2I)系统,但它们通常仅限于文本编码,而去噪由新训练的生成骨干网络处理。表示自编码器(RAEs)的出现将生成目标转向语义结构化的视觉表示,创建了一个与预训练LLM先验更兼容的潜在空间。受多模态LLM(MLLMs)的启发,其中MLP投影仪足以将干净的视觉表示与预训练LLM对齐,我们将MLLM本身重新用作噪声表示编码器,将此机制从干净输入扩展到噪声输入。我们提出了RepFusion,它使用生成的MLLM输出作为扩散变压器的条件信号。在相似推理预算下的受控比较中,RepFusion优于将可比容量分配给新初始化解码器的基线。这些结果表明,MLLMs为去噪视觉表示提供了强大的先验,并且通过以演化的噪声表示为条件,测试时的计算可以有效地用于现代T2I系统中重复的MLLM条件化。

英文摘要

Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.

2606.14049 2026-06-15 cs.SD cs.CV 交叉投稿

FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision

FoleyGenEx: 统一视频到音频生成,具备多模态控制、时间对齐与语义精度

Shiyao Wang, Xijuan Zeng, Hui Wang, Shiwan Zhao, Feng Deng, Chen Zhang, Yong Qin

发表机构 * Academy for Advanced Interdisciplinary Studies, Nankai University(南开大学前沿交叉学科研究院) Kling Team, Kuaishou Technology(快手科技Kling团队)

AI总结 提出FoleyGenEx统一框架,通过条件注入、多模态动态掩码和副词数据增强,实现视频到音频生成中多模态控制、帧级时间对齐与细粒度语义的同步合成。

Comments Accepted by INTERSPEECH 2026

详情
Journal ref
INTERSPEECH 2026
AI中文摘要

我们提出FoleyGenEx,一个统一的视频到音频(VTA)框架,集成了多模态控制、帧级时间对齐和细粒度语义,能够为多种任务生成同步且多功能的音频合成。现有的VTA方法要么具有多模态控制但时间对齐较弱,要么对齐能力强但缺乏参考音频条件和语义精度。FoleyGenEx通过三项核心创新填补了这一空白:用于音频控制VTA和Foley扩展的条件注入机制、保持训练同步的多模态动态掩码策略,以及利用信号处理和大语言模型增强文本监督的副词数据增强算法,提供细微语义。在AudioCaps、VGGSound和Greatest Hits上的实验表明,与现有方法相比,它具有竞争力的可控VTA性能。演示样本见此https URL。

英文摘要

We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control but weak temporal alignment or strong alignment but lack reference audio conditioning and semantic precision. FoleyGenEx fills this gap via three core innovations: a conditional injection mechanism for audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving training synchronization, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enhance textual supervision with nuanced semantics. Experiments on AudioCaps, VGGSound, and Greatest Hits demonstrate its competitive controllable VTA performance against existing methods. Demo samples are available at https://foleygenex.github.io/FoleyGenEx.

2512.04981 2026-06-15 cs.CV cs.LG 版本更新

Aligned but Stereotypical? How System Prompts Shape Demographic Bias in LLM-Based Text-to-Image Models

对齐但刻板?系统提示如何塑造基于LLM的文本到图像模型中的人口统计偏见

NaHyeon Park, Na Min An, Kunhee Kim, Soyeon Yoon, Jiahao Huo, Hyunjung Shim

发表机构 * KAIST(韩国科学技术院) HKUST (GZ)(香港科技大学(广州))

AI总结 研究LLM增强的文本到图像系统在提示扩展中引入隐性人口统计偏见的问题,提出无训练的去偏框架FairPro,通过自适应生成公平性指令减少人口统计差异。

Comments Project page: https://fairpro-t2i.github.io

详情
AI中文摘要

文本到图像(T2I)系统越来越依赖基于大语言模型(LLM)的文本条件来解释和扩展用户提示。虽然这提高了提示理解和文本-图像对齐,但我们发现,即使未指定人口统计属性,它也可能引入隐性的人口统计假设。为了系统地研究这种行为在不同提示模糊性和复杂性水平下的表现,我们构建了一个涵盖多种提示设置的综合基准。对八个最新T2I模型的评估表明,基于LLM的系统始终比非LLM基线表现出更强的人口统计偏差。我们进一步分析了系统提示,这是基于LLM的T2I系统特有的组件,用于指导提示解释和扩展。我们的分析表明,这些指令强烈影响文本嵌入,进而导致有偏的图像生成。受这些发现启发,我们提出了FairPro,一个无训练的去偏框架,它在保持用户意图的同时自适应地生成公平性感知指令。实验表明,FairPro在保持提示忠实度的同时显著减少了人口统计差异。

英文摘要

Text-to-image (T2I) systems increasingly rely on Large Language Model (LLM)-based text conditioning to interpret and expand user prompts. While this improves prompt understanding and text-image alignment, we find that it can also introduce implicit demographic assumptions, even when demographic attributes are unspecified. To systematically investigate this behavior across varying levels of prompt ambiguity and complexity, we construct a comprehensive benchmark covering diverse prompt settings. Evaluations on eight recent T2I models show that LLM-based systems consistently exhibit stronger demographic skew than non-LLM-based baselines. We further analyze system prompts, a component unique to LLM-based T2I systems that guides prompt interpretation and expansion. Our analyses show that these instructions strongly influence text embeddings, which subsequently leads to biased image generations. Motivated by these findings, we propose FairPro, a training-free debiasing framework that adaptively generates fairness-aware instructions while preserving user intent. Experiments demonstrate that FairPro substantially reduces demographic disparities while maintaining prompt fidelity.

2601.19115 2026-06-15 cs.CV 版本更新

FBSDiff++: Improved Frequency Band Substitution of Diffusion Features for Efficient and Highly Controllable Text-Driven Image-to-Image Translation

FBSDiff++: 改进的扩散特征频带替换用于高效且高度可控的文本驱动图像到图像翻译

Xiang Gao, Yunpeng Jia

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出FBSDiff++框架,通过动态频带替换扩散特征,实现无需训练的文本驱动图像到图像翻译,支持外观、布局和轮廓引导,并大幅提升推理速度(8.9倍),支持任意分辨率输入和局部编辑。

详情
AI中文摘要

随着大规模文本到图像(T2I)扩散模型在开放域图像创建方面取得显著进展,越来越多的关注集中在将其自然扩展到文本驱动图像到图像(I2I)翻译领域,其中源图像除了文本提示提供的文本引导外,还作为生成图像的视觉引导。我们提出了FBSDiff,一种新颖的框架,从全新的频域角度将现成的T2I扩散模型适配到I2I范式。通过扩散特征的动态频带替换,FBSDiff以即插即用的方式(无需模型训练、微调或在线优化)实现了多样且高度可控的文本驱动I2I,通过分别替换潜在扩散特征的低频带、中频带和高频带,实现外观引导、布局引导和轮廓引导的I2I翻译。此外,FBSDiff通过简单调整替换频带的带宽,灵活地实现了对I2I相关强度的连续控制。为了进一步提升图像翻译的效率、灵活性和功能性,我们提出了FBSDiff++,它在FBSDiff的基础上主要在三个方面进行了改进:(1)通过改进的模型架构大幅加速推理速度(推理速度提升8.9倍);(2)改进频带替换模块,允许输入任意分辨率和宽高比的源图像;(3)扩展模型功能,仅通过对核心方法进行细微调整即可实现局部图像操作和特定风格内容创建。大量的定性和定量实验验证了FBSDiff++在I2I翻译的视觉质量、效率、多样性和可控性方面相对于相关先进方法的优越性。

英文摘要

With large-scale text-to-image (T2I) diffusion models achieving significant advancements in open-domain image creation, increasing attention has been focused on their natural extension to the realm of text-driven image-to-image (I2I) translation, where a source image acts as visual guidance to the generated image in addition to the textual guidance provided by the text prompt. We propose FBSDiff, a novel framework adapting off-the-shelf T2I diffusion model into the I2I paradigm from a fresh frequency-domain perspective. Through dynamic frequency band substitution of diffusion features, FBSDiff realizes versatile and highly controllable text-driven I2I in a plug-and-play manner (without need for model training, fine-tuning, or online optimization), allowing appearance-guided, layout-guided, and contour-guided I2I translation by progressively substituting low-frequency band, mid-frequency band, and high-frequency band of latent diffusion features, respectively. In addition, FBSDiff flexibly enables continuous control over I2I correlation intensity simply by tuning the bandwidth of the substituted frequency band. To further promote image translation efficiency, flexibility, and functionality, we propose FBSDiff++ which improves upon FBSDiff mainly in three aspects: (1) accelerate inference speed by a large margin (8.9$\times$ speedup in inference) with refined model architecture; (2) improve the Frequency Band Substitution module to allow for input source images of arbitrary resolution and aspect ratio; (3) extend model functionality to enable localized image manipulation and style-specific content creation with only subtle adjustments to the core method. Extensive qualitative and quantitative experiments verify superiority of FBSDiff++ in I2I translation visual quality, efficiency, versatility, and controllability compared to related advanced approaches.

2602.01801 2026-06-15 cs.CV cs.AI 版本更新

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

快速自回归视频扩散与世界模型:基于时间缓存压缩与稀疏注意力

Dvir Samuel, Issar Tzachor, Matan Levy, Michael Green, Gal Chechik, Rami Ben-Ari

发表机构 * Hebrew University of Jerusalem(特拉维夫大学) Google Research(谷歌研究)

AI总结 提出FAST-AR框架,通过TempCache压缩KV缓存、AnnCA加速交叉注意力、AnnSA稀疏化自注意力,实现自回归视频扩散模型5-10倍加速,同时保持视觉质量并稳定GPU内存使用。

Comments Accepted to ICML 2026. Project Page: https://dvirsamuel.github.io/fast-auto-regressive-video/

详情
AI中文摘要

自回归视频扩散模型支持流式生成,为长序列合成、视频世界模型和交互式神经游戏引擎打开了大门。然而,其核心注意力层在推理时成为主要瓶颈:随着生成过程推进,KV缓存增长,导致延迟增加和GPU内存飙升,进而限制可用的时间上下文并损害长程一致性。在本工作中,我们研究了自回归视频扩散中的冗余性,并识别出三个持续存在的来源:跨帧的近似重复缓存键、缓慢演化的(主要是语义的)查询/键使得许多注意力计算冗余,以及长提示上的交叉注意力中每帧只有少量标记相关。基于这些观察,我们提出了一个统一的、无需训练的注意力框架(FAST-AR),用于快速自回归扩散,包含三个组件:TempCache通过时间对应压缩KV缓存以限制缓存增长;AnnCA通过使用快速近似最近邻(ANN)匹配选择帧相关的提示标记来加速交叉注意力;AnnSA通过将每个查询限制为语义匹配的键(也使用轻量级ANN)来稀疏化自注意力。这些模块共同减少了注意力、计算和内存,并且与现有的自回归扩散骨干网络和世界模型兼容。实验表明,在保持几乎相同的视觉质量的同时,实现了高达5-10倍的端到端加速,并且关键的是,在长序列生成中维持稳定的吞吐量和几乎恒定的峰值GPU内存使用,而先前的方法会逐渐变慢并遭受内存使用增加的问题。

英文摘要

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework (FAST-AR) for FAST-AutoRegressive diffusion, consisting of three components: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5 - x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

2606.13679 2026-06-15 cs.CV 版本更新

InterleaveThinker: Reinforcing Agentic Interleaved Generation

InterleaveThinker: 强化智能体交错生成

Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo, Ray Zhang, Hongsheng Li

发表机构 * CUHK MMLab(香港中文大学多媒体实验室) Meituan(美团) CUHK IMIXR(香港中文大学IMIXR实验室)

AI总结 提出首个多智能体管线InterleaveThinker,通过规划器和评论家智能体使现有图像生成器具备交错生成能力,并利用GRPO强化单步指令修正,显著提升生成性能。

Comments Project Page: https://zhengdian1.github.io/InterleaveThinker-proj/ Code: https://github.com/zhengdian1/InterleaveThinker

详情
AI中文摘要

最近的图像生成器在单图像生成和编辑中展示了令人印象深刻的逼真度和指令遵循能力。然而,受限于其架构,它们无法实现交错生成(文本-图像序列),这在视觉叙事、指导和具身操作中具有关键应用。即使是最近的开源统一多模态模型(UMMs)在这方面也表现出有限的性能。在本文中,我们介绍了InterleaveThinker,这是第一个旨在赋予任何现有图像生成器交错生成能力的多智能体管线。具体来说,我们使用规划器智能体来组织图像-文本输入序列,指示图像生成器在每个步骤所需的执行。随后,我们引入评论家智能体来评估生成器的输出,识别偏离计划指令的样本,并优化指令以进行重新生成。为了实现这一管线,我们构建了Interleave-Planner-SFT-80k和Interleave-Critic-SFT-112k以进行格式冷启动。然后,我们开发了Interleave-Critic-RL-13k,使用GRPO在生成轨迹内强化逐步指令修正能力。由于单个交错生成轨迹可能涉及超过25次生成器调用,优化整个轨迹在计算上不可行。因此,我们提出了准确率奖励和逐步奖励,使得单步强化学习能够有效引导整个生成轨迹。结果表明,InterleaveThinker在各种图像生成器上提升了性能。在交错生成基准上,它实现了与Nano Banana和GPT-5相当的性能。令人惊讶的是,它还在基于推理的基准上显著增强了基础模型;例如,在4步FLUX.2-klein上,我们在WISE和RISE上观察到了显著的增益。

英文摘要

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

7. 3D视觉、点云与空间智能 9 篇

2606.14168 2026-06-15 cs.CV 新提交

MUSE: Agentic 3D Scene Authoring via Memory-Grounded Incremental Requirement Satisfaction

MUSE: 基于记忆的增量需求满足的智能体3D场景创作

Ruijie Xu, Xinnan Zhu, Jiayu Ying, Daoguo Dong, Yuzhou Ji, Xin Tan

发表机构 * East China Normal University(华东师范大学) Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出MUSE多智能体框架,通过增量需求满足实现可控3D场景构建与编辑,在AuthorBench上显著提升目标达成率和场景保持率。

详情
AI中文摘要

文本驱动的3D场景生成是数字内容创作、具身AI仿真和交互设计中的一项有前景的技术,然而实际工作流程通常需要在保留非目标内容的同时对现有场景进行细化、扩展或修正。现有方法可以生成逼真且结构合理的场景,但它们通常缺乏具有需求级状态跟踪的可编辑性,因此部分级故障常常导致全场景重新生成或人工干预。为应对这一挑战,我们将可控3D场景创作形式化为增量需求满足,统一了构建和编辑。在本文中,我们提出了MUSE,一个基于记忆的多智能体框架,其中架构师将指令编译为结构化需求,雕刻师执行局部场景操作,检查员验证每一步并更新工作记忆、场景记忆和技能记忆。为了评估需求级可控性和保留感知编辑,我们引入了AuthorBench,提供145个受限构建案例和1584个保留感知编辑池,并配有外部结构化检查。在全构建案例上,MUSE将全目标成功率从37.9提升至80.7,表面约束满足率从35.0提升至92.6,优于最强基线。在分层240案例编辑测试集上,MUSE实现了49.6的全目标成功率、99.9的保留率和仅0.6的非预期更改率。除了自动指标外,对比较局部编辑基线的人工评估支持与用户意图更强的对齐,下游导航代理测试表明空间稳定性更强。结合验证我们记忆设计的消融实验,这些结果确立了MUSE作为可控3D场景创作的有效框架。

英文摘要

Text-driven 3D scene generation is a promising technique for digital content creation, embodied AI simulation, and interactive design, yet practical workflows often require refining, extending, or correcting existing scenes while preserving non-target content. Existing methods can produce realistic and structurally plausible scenes, but they generally lack editability with requirement-level state tracking, so part-level failures often lead to full-scene regeneration or manual intervention. To tackle this challenge, we formulate controllable 3D scene authoring as incremental requirement satisfaction, unifying construction and editing. In this paper, we present MUSE, a memory-grounded multi-agent framework in which an Architect compiles instructions into structured requirements, a Sculptor executes local scene operations, and an Inspector verifies each step while updating Working, Scene, and Skill Memory. To evaluate requirement-level controllability and preservation-aware editing, we introduce AuthorBench, offering 145 constrained construction cases and a 1,584-case preservation-aware editing pool paired with external structured checks. On full construction cases, MUSE improves All-Goal success from 37.9 to 80.7 and surface-constraint fulfillment from 35.0 to 92.6 over the strongest baseline. On a stratified 240-case editing test split, MUSE achieves 49.6 All-Goal success, 99.9 preservation rate, and only 0.6 unintended change rate. Beyond automated metrics, human evaluations on compared local-editing baselines support stronger alignment with user intent, and downstream navigation-proxy tests indicate stronger spatial stability. Combined with ablations validating our memory designs, these results establish MUSE as an effective framework for controllable 3D scene authoring.

2606.14292 2026-06-15 cs.CV 新提交

A Robust Point Cloud Analysis Framework Inspired By Primary Visual Cortex

一种受初级视觉皮层启发的鲁棒点云分析框架

Jisheng Dang, Dengyue Pan, Delin Deng, Yifan Zhang, Bimei Wang, Hong Peng, Bin Hu, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) School of Medical Technology, Beijing Institute of Technology(北京理工大学医学技术学院) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 受初级视觉皮层启发,提出DC-CCNN++框架,通过仿生神经网络替代MLP,结合NRMR模块和CPVT策略,在点云分类和分割中提升鲁棒性,性能媲美SOTA。

Comments 12 pages, 2 figures, 7 tables

详情
AI中文摘要

尽管点云分析取得了显著进展,但降低能耗和提高鲁棒性仍然研究不足,这主要归因于卷积神经网络(CNN)的固有局限性。为解决这一问题,我们从初级视觉皮层中汲取灵感,提出了一种树突连接连续耦合神经网络(DC-CCNN),这是一种用于点云分析的新型类脑神经网络(BINN)架构。通过结合离散和连续编码,我们的设计用更高效、更鲁棒的BINN取代了传统的多层感知机(MLP)。在此框架基础上,我们进一步提出了扩展模型DC-CCNN++,以提高在复杂损坏条件下的鲁棒性。具体来说,我们引入了一个神经启发的鲁棒调制与读出模块(NRMR),通过全局上下文增益调制和双码证据整合来增强特征稳定性和决策鲁棒性。我们还设计了一种皮层启发的渐进变异性训练(CPVT)策略,该策略在训练过程中逐步将模型暴露于结构化的环境变异性,同时保持稳定的干净样本锚点。实验结果表明,DC-CCNN++在点云分析上提升了类脑网络的性能,同时保持了与最先进方法相当的性能。与原始DC-CCNN相比,它在分类和部分分割上均取得了更强的结果,并且对稀疏性、遮挡、高斯噪声、椒盐噪声和空间变换表现出增强的鲁棒性。凭借其高效性、鲁棒性和生物学基础的设计,DC-CCNN++为点云分析提供了一种有前景的传统深度学习方法替代方案。代码可在该https URL获取。

英文摘要

Despite significant advancements in point cloud analysis, reducing energy consumption and improving robustness remain understudied, largely due to the inherent limitations of Convolutional Neural Networks (CNNs). To address this issue, we draw inspiration from the primary visual cortex and propose a Dendritic-Connected Continuous-Coupled Neural Network (DC-CCNN), a novel Brain-Inspired Neural Network (BINN) architecture for point cloud analysis. By combining discrete and continuous encoding, our design replaces traditional Multilayer Perceptrons (MLPs) with more efficient and robust BINNs. Building upon this framework, we further propose an extended model, DC-CCNN++, to improve robustness under complex corruption conditions. Specifically, we introduce a Neuro-Inspired Robust Modulation-and-Readout Module (NRMR) to enhance feature stability and decision robustness through global-context gain modulation and dual-code evidence integration. We also design a Cortically Inspired Progressive Variability Training (CPVT) strategy, which progressively exposes the model to structured environmental variability while preserving stable clean-sample anchors during training. Experimental results show that DC-CCNN++ improves the performance of brain-inspired networks on point cloud analysis while maintaining performance comparable to state-of-the-art methods. Compared with the original DC-CCNN, it achieves stronger results on both classification and part segmentation, and exhibits enhanced robustness against sparsity, occlusion, Gaussian noise, salt-and-pepper noise, and spatial transformations. With its efficiency, robustness, and biologically grounded design, DC-CCNN++ provides a promising alternative to traditional deep learning methods for point cloud analysis. Code is available at https://anonymous.4open.science/r/DC-CCNNpp-44E3.

2606.14355 2026-06-15 cs.CV eess.SP 新提交

Point Cloud Upsampling through Patch-based Frequency Superposition

基于补丁频率叠加的点云上采样

Marina Ritthaler, Azhar Hussian, Vasileios Belagiannis, André Kaup

发表机构 * Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡大学)

AI总结 提出一种基于补丁频率叠加的优化方法PUtPFS,通过选择点子集并叠加空间频率估计表面,在稀疏区域放置新点实现均匀上采样,无需训练数据,在点对面距离上超越现有方法。

详情
Journal ref
European Conference on Signal Processing (EUSIPCO) 2026
AI中文摘要

近年来,神经网络已成为大多数点云上采样方法中的主导模型。尽管这些方法取得了良好的效果,但它们存在一些缺点,例如缺乏可解释性和数据依赖性。此外,它们必须在与测试数据相似的数据集上进行训练才能表现良好。为了避免这些缺点,我们提出了基于补丁频率叠加的点云上采样(PUtPFS),这是一种基于优化的方法,通过选择点子集并通过叠加空间频率来估计该子集的表面。然后,在该表面上放置新点。通过连续选择点云中最稀疏区域中的点,可以实现均匀上采样。使用这种方法,我们在通常考虑的点对面距离上超越了当前最佳的上采样结果。此外,我们在基于优化的方法中实现了最佳的Chamfer距离和Hausdorff距离。作为额外优势,我们的方法不需要任何训练数据,并且具有数学可解释性。

英文摘要

In recent years, neural networks have become the dominant models in most point cloud upsampling methods. Although these approaches are achieving good results, they do have drawbacks, such as a lack of interpretability and data dependency. Moreover, they have to be trained on a dataset that is similar to the test data in order to perform well. To avoid these disadvantages, we propose Point Cloud Upsampling through Patch-based Frequency Superposition (PUtPFS), an optimization-based approach that selects subsets of points and estimates the surface of this set through superpositioning spatial frequencies. Then, new points are placed on this surface. By successively selecting points in the least dense regions of the point cloud, a uniform upsampling can be reached. With this method, we surpass the current best upsampling results in the commonly considered point-to-surface distance. Furthermore, we achieve the best Chamfer and Hausdorff distance among the optimization-based approaches. As an additional advantage, our method does not need any training data and is mathematically interpretable.

2606.14389 2026-06-15 cs.CV 新提交

MooMIns -- Monocular 3D Reconstruction and Object Pose Estimation from Multiple Instances

MooMIns -- 多实例单目3D重建与物体姿态估计

Robert Langendörfer, Markus Hillemann, Markus Ulrich

发表机构 * Institute of Photogrammetry and Remote Sensing, Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院摄影测量与遥感研究所)

AI总结 提出MooMIns方法,利用单张图像中同一物体的多个实例提供隐式多视图几何,通过反向高斯泼溅实现3D重建和6D姿态估计,避免深度先验导致的幻觉。

详情
AI中文摘要

从单张单目图像同时进行3D重建和6D物体姿态估计是一个固有的病态问题。然而,在工业场景中,物体的多个实例通常随机堆叠在料箱中,隐式地在单张图像内提供了同一物体的多个视角。我们证明,这种隐式多视图几何可以被利用来同时重建物体的3D形状并估计每个可见物体实例的6D姿态。我们提出了MooMIns,一种基于高斯泼溅的新方法,它反转了原始高斯泼溅公式:不是从多个相机渲染单个场景,而是从单个相机渲染多个物体实例。我们的方法使用SAM3实例分割掩码和改进的运动恢复结构(SfM)流水线进行初始化。与学习的单目深度估计不同,我们基于图像证据进行真正的几何重建,避免了训练数据先验导致的幻觉。我们在合成和真实料箱抓取场景中评估MooMIns,并展示了对于未见过的物体的准确重建以及单个实例的可靠姿态估计。

英文摘要

Simultaneous 3D reconstruction and 6D object pose estimation from a single monocular image is an inherently ill-posed problem. In industrial settings, however, multiple instances of an object are often randomly arranged in bins, implicitly providing several views of the same object within a single image. We show that this implicit multi-view geometry can be exploited to simultaneously reconstruct the object in 3D and estimate the 6D pose of each visible object instance. We present MooMIns, a new Gaussian-splatting-based approach that inverts the original Gaussian splatting formulation: instead of rendering a single scene from multiple cameras, we render multiple object instances from a single camera. Our method is initialized with SAM3 instance segmentation masks and a modified Structure from Motion (SfM) pipeline. In contrast to learned monocular depth estimation, we perform true geometry-based reconstruction from image evidence, avoiding hallucinations caused by training data priors. We evaluate MooMIns on synthetic and real bin-picking scenarios, and demonstrate accurate reconstruction of previously unseen objects as well as reliable pose estimation of individual instance

2606.14699 2026-06-15 cs.CV cs.GR cs.RO 新提交

Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

Instruct-Particulate: 基于运动学控制的可扩展前馈式3D物体关节化

Ruining Li, Yuxin Yao, Matt Zhou, Chuanxia Zheng, Christian Rupprecht, Joan Lasenby, Shangzhe Wu, Andrea Vedaldi

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学) Nanyang Technological University(南洋理工大学)

AI总结 提出Instruct-Particulate模型,通过运动学规范(部件描述、连接性、关节类型等)指导3D网格的关节分割和运动参数预测,利用异构数据集(15万+物体)训练,实现跨类别和AI生成网格的泛化。

Comments Project page: https://instruct-particulate.github.io/

详情
AI中文摘要

重建关节式3D物体对于动画、游戏和机器人模拟至关重要。最近的神经网络可以估计3D物体的关节结构,但其泛化能力仍然受到该任务标注数据稀缺的限制。为了解决这一差距,我们引入了Instruct-Particulate,一个模型,它接受一个3D网格以及一个目标运动学规范,包括部件描述、连接性、关节类型和可选的点提示,并预测相应的运动学部件分割和关节运动参数。运动学规范消除了任务的歧义,并允许模型针对不同粒度的标注,从而使得使用更丰富的异构训练数据成为可能。在测试时,运动学规范可以从大规模视觉-语言模型中自动获得,因此该模型可以应用于任何输入网格。为了大规模训练我们的模型,我们构建了一个包含超过15万个关节式3D物体的异构数据集,通过使用视觉-语言模型对部分其他3D模型(整体或已分解为部件)进行运动学标注,扩展了现有的公开数据集。实验表明,我们的模型在跨类别和AI生成网格上泛化更好,通过图像到3D模型实现了从真实世界图像重建关节式资产。

英文摘要

Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.

2503.19947 2026-06-15 cs.CV cs.AI 版本更新

Vanishing Depth: Training Generalized Depth Adapters with Sinusoidal Depth Preprocessing for Pretrained RGB Encoders

消失深度:基于正弦深度预处理的预训练RGB编码器通用深度适配器训练

Paul Koch, Jörg Krüger

发表机构 * Fraunhofer IPK(弗劳恩霍夫研究所) TU-Berlin(技术大学柏林)

AI总结 提出自监督训练方法,为预训练RGB编码器添加深度适配器,结合正弦深度编码实现通用鲁棒的深度特征提取,在分割、姿态估计和深度补全等下游任务中提升基线性能,SUN-RGBD分割达56.05 mIoU。

Comments Accepted to IntelliSys 2026

详情
AI中文摘要

通用度量深度理解对于精确的视觉引导机器人技术至关重要,而当前最先进的视觉编码器不支持这一点。为解决此问题,我们提出一种自监督训练方法,为预训练RGB编码器扩展一个深度适配器,将度量深度纳入并对齐到组合潜在空间中,同时不干扰预训练的RGB特征提取。结合我们的正弦深度编码,深度适配器实现了通用且鲁棒的深度密度和分布不变特征提取。我们的深度适配器在分割、姿态估计和深度补全等一系列相关RGBD下游任务中,提升了一组通用RGB基线的性能,而无需微调。最重要的是,我们在SUN-RGBD分割中达到了56.05 mIoU,同时在实验中优于最先进的深度感知和多模态编码器。当没有深度信息时,可以使用空地图激活深度适配器,利用单像素深度线索或单目深度估计,将深度感知特征提取纳入后续下游任务。

英文摘要

Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose a self-supervised training approach that extends pretrained RGB encoders with a depth adapter to incorporate and align metric depth into a combined latent space without interfering with the pretrained RGB feature extraction. In combination with our sinusoidal depth encoding, the depth adapter enables generalized and robust depth density and distribution invariant feature extraction. Our depth adapters improve a wide set of generalized RGB baselines across a spectrum of relevant RGBD downstream tasks in segmentation, pose estimation, and depth completion -- without the necessity of finetuning. Most importantly, we achieve 56.05 mIoU in the SUN-RGBD segmentation, while outperforming SOTA depth-aware and multi-modal encoders in our experiments. When no depth is present, one can activate our depth adapter with an empty map, use single pixel depth clues, or monocular depth estimation to include the depth aware feature extraction into subsequent downstream tasks.

2603.04976 2026-06-15 cs.CV cs.AI 版本更新

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

3D-RFT:基于视频的3D场景理解的强化微调

Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia, Siyuan Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出3D-RFT框架,将可验证奖励的强化学习(RLVR)扩展到视频3D感知与推理,通过直接优化评估指标(如3D IoU和F1分数)提升性能,4B模型超越8B模型。

Comments Accepted at ICML 2026. Project page: https://3d-rft.github.io/

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)推理能力的变革性范式,但其在3D场景理解中的潜力尚未充分挖掘。现有方法主要依赖监督微调(SFT),其中token级交叉熵损失作为优化的间接代理,导致训练目标与任务性能之间的错位。为弥合这一差距,我们提出了基于视频的3D场景理解的强化微调(3D-RFT),这是首个将RLVR扩展到视频3D感知与推理的框架。3D-RFT通过直接优化模型以匹配评估指标来转变范式。3D-RFT首先通过SFT激活3D感知的多模态大语言模型(MLLMs),然后使用组相对策略优化(GRPO)结合严格可验证的奖励函数进行强化微调。我们根据3D IoU和F1-Score等指标设计任务特定的奖励函数,以提供更有效的信号来指导模型训练。大量实验表明,3D-RFT-4B在各种基于视频的3D场景理解任务上达到了最先进的性能。值得注意的是,3D-RFT-4B在3D视频检测、3D视觉定位和空间推理基准上显著优于更大的模型(例如VG LLM-8B)。我们进一步揭示了3D-RFT的良好特性,如鲁棒有效性,以及对训练策略和数据影响的宝贵见解。我们希望3D-RFT能够作为未来3D场景理解发展的稳健且有前景的范式。

英文摘要

Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.

2605.21472 2026-06-15 cs.CV 版本更新

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

Stream3D: 基于证据记忆的序列多视角3D生成

Kaichen Zhou, Zeyang Bai, Xinhai Chang, Mengyu Wang, Paul Liang, Fangneng Zhan

发表机构 * World Mind Lab, HKUST(世界心智实验室,香港科技大学) Media Lab and EECS, MIT(媒体实验室和电子工程与计算机科学系,麻省理工学院) Kempner Institute, Harvard University(凯普纳研究所,哈佛大学)

AI总结 提出Stream3D,一种无需训练的流式机制,通过维护紧凑的证据记忆缓存关键历史帧,将冻结的视角条件3D生成器转换为流式生成器,解决单目视频流中3D生成的时间不一致问题。

Comments Multi-view 3D Generation, Streaming 3D Generation

详情
AI中文摘要

视角条件3D生成器(如SAM 3D、TRELLIS和Hunyuan3D)能够从单视角生成高质量物体重建,但真实世界的视觉观测通常以长单目流的形式出现。将这些生成器独立应用于每个流式帧会导致生成结果严重的时间不一致。为解决此问题,我们提出Stream3D,这是第一种无需训练的流式机制,通过恒定跨块记忆将冻结的视角条件3D生成器转换为流式生成器。Stream3D通过维护一个紧凑的证据记忆来实现这一点,该记忆基于提出的证据评分机制选择性缓存最具信息量的历史帧。随着流式处理进行,记忆动态更新以保留固定数量的信息帧,防止内存占用随序列长度线性增长。这还防止了长序列上的性能退化,并保持底层生成器完全不变,无需重新训练、架构修改或辅助损失。在真实和合成流式基准上的评估表明,Stream3D在光度指标和几何指标上均优于潜在传输基线,包括KV缓存重用和基于流的特征编辑。更多详情请见:https://stream-3d.github.io/stream3d.github.io/。

英文摘要

View-conditioned 3D generators such as SAM 3D, TRELLIS, and Hunyuan3D produce high-quality object reconstructions from a single view, but real-world visual observation often arrives as long monocular streams. Naively applying these generators to each streaming frame independently leads to severe temporal inconsistency in the generated results. To address this problem, we propose Stream3D, the first training-free streaming mechanism that turns a frozen view-conditioned 3D generator into a streaming generator with constant cross-chunk memory. Stream3D achieves this by maintaining a compact evidential memory, which selectively caches the most informative historical frames based on a proposed evidence score mechanism. As the stream progresses, the memory dynamically updates to retain a fixed number of informative frames, preventing the memory footprint from growing linearly with sequence length. This also prevents degradation over long sequences and keeps the underlying generator completely unchanged without retraining, architectural modifications, or auxiliary losses. Evaluated on both realistic and synthetic streaming benchmarks, Stream3D outperforms latent-transport baselines, including KV-cache reuse and flow-based feature editing, across both photometric and geometric metrics. More details can be found at: https://stream-3d.github.io/stream3d.github.io/.

2606.05102 2026-06-15 cs.CV 版本更新

ZipSplat: Fewer Gaussians, Better Splats

ZipSplat: 更少的高斯,更好的泼溅

Alexander Veicht, Sunghwan Hong, Dániel Baráth, Marc Pollefeys

发表机构 * ETH Zürich(苏黎世联邦理工学院) Microsoft(微软)

AI总结 提出 ZipSplat,一种基于令牌的前馈模型,通过聚类压缩视觉令牌并解码为高斯组,在无需重训练的情况下实现质量-效率权衡,以约6倍更少的高斯数在DL3DV和RealEstate10K上达到新最优。

详情
AI中文摘要

前馈式3D高斯泼溅方法能够在单次前向传递中从有姿态或无姿态图像重建场景,但当前方法为每个输入像素预测一个高斯,将表示预算与相机分辨率而非场景复杂度绑定。因此,一面平坦的墙壁和一块纹理丰富的物体会产生同样多的高斯,尽管几何需求截然不同。我们提出ZipSplat,一种基于令牌的前馈模型,将高斯放置与像素网格解耦。多视图骨干网络提取密集的视觉令牌,k-means聚类将其压缩为紧凑的场景令牌集。交叉注意力和自注意力精炼这些令牌,轻量级MLP将每个令牌解码为一组具有无约束3D位置的高斯。由于聚类在推理时应用,单个训练模型无需重训练即可覆盖质量-效率曲线。ZipSplat无需真实姿态或内参,但在DL3DV和RealEstate10K上以比像素对齐方法少约6倍的高斯数达到新最优,分别超过最佳无姿态基线2.1dB和1.2dB PSNR。它进一步零样本泛化到Mip-NeRF360和ScanNet++,优于所有可比基线。我们的项目页面位于https://veichta.com/zipsplat。

英文摘要

Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forward pass, yet current approaches predict one Gaussian per input pixel, tying the representation budget to camera resolution rather than scene complexity. A flat wall and a richly textured object thus produce equally many Gaussians despite very different geometric needs. We propose ZipSplat, a token-based feed-forward model that decouples Gaussian placement from the pixel grid. A multi-view backbone extracts dense visual tokens, and k-means clustering compresses them into a compact set of scene tokens. Cross- and self-attention refine these tokens, and a lightweight MLP decodes each into a group of Gaussians with unconstrained 3D positions. Because clustering is applied at inference, a single trained model spans the quality-efficiency curve without retraining. ZipSplat operates without ground-truth poses or intrinsics, yet sets a new state of the art on DL3DV and RealEstate10K with ${\sim}6{\times}$ fewer Gaussians than pixel-aligned methods, surpassing the best pose-free baseline by 2.1dB and 1.2dB PSNR, respectively. It further generalizes zero-shot to Mip-NeRF360 and ScanNet++, outperforming all comparable baselines. Our project page is at https://veichta.com/zipsplat.

8. 医学影像与生物视觉 11 篇

2606.14072 2026-06-15 cs.CV cs.CL 新提交

Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

扩散细化分割与视觉-语言解释用于儿童脑肿瘤MRI

Wentao Ke, Jianche Liu

发表机构 * Department of Mechanical Engineering, Stanford University(斯坦福大学机械工程系) School of Medicine, Stanford University(斯坦福大学医学院)

AI总结 提出两阶段框架,先用Swin-UNETR粗分割,再用条件扩散模型细化边界,最后结合多模态语言模型生成结构化报告,提升儿童脑肿瘤分割精度和可解释性。

详情
AI中文摘要

由于标注数据有限、成像表型异质性、肿瘤边界弥散以及肿瘤子区域类别不平衡,准确的儿童脑肿瘤分割仍然具有挑战性。在此,我们提出一个两阶段深度学习框架,用于改进多模态儿童脑MRI分割和临床解释。首先,我们在BraTS-PEDs MRI扫描上评估3D Res U-Net和Swin-UNETR基线模型,使用四种配准模态预测肿瘤核心、全肿瘤和增强肿瘤区域。其次,我们引入基于扩散的细化模型,以粗Swin-UNETR预测为条件,包括3D DDPM细化器和MedSegDiff。条件化显著提高了扩散稳定性和性能,特别是对于增强肿瘤边界分割。条件化MedSegDiff实现了最强的边界一致性,HD95最低。最后,预测的肿瘤体积和代表性分割叠加图与多模态语言模型集成,生成结构化的放射学风格报告。综合来看,我们的结果表明,从粗到细的扩散分割可以改善儿童肿瘤边界描绘,并支持端到端可解释的AI辅助神经肿瘤学工作流程。

英文摘要

Accurate pediatric brain tumor segmentation remains challenging due to limited annotated data, heterogeneous imaging phenotypes, diffuse tumor boundaries, and class imbalance across tumor subregions. Here, we present a two-stage deep learning framework for improving multi-modal pediatric brain MRI segmentation and clinical interpretation. First, we evaluate 3D Res U-Net and Swin-UNETR baselines on BraTS-PEDs MRI scans, using four co-registered modalities to predict tumor core, whole tumor, and enhancing tumor regions. Second, we introduce diffusion-based refinement models conditioned on coarse Swin-UNETR predictions, including a 3D DDPM refiner and MedSegDiff. Conditioning substantially improves diffusion stability and performance, particularly for enhancing tumor boundary segmentation. Conditioned MedSegDiff achieves the strongest boundary agreement with the lowest HD95. Finally, predicted tumor volumes and representative segmentation overlays are integrated with a multimodal language model to generate structured radiology-style reports. Together, our results suggest that coarse-to-refined diffusion segmentation can improve pediatric tumor boundary delineation and support end-to-end interpretable AI-assisted neuro-oncology workflows.

2606.14194 2026-06-15 cs.CV cs.LG 新提交

Hybrid Classical-Quantum (HCQ) Alzheimer's Classification via Supervised $β$-VAE and Quantum Kernels

混合经典-量子(HCQ)阿尔茨海默病分类:基于监督β-VAE与量子核

Tia Tiwari, Vamshi Krishna Kancharla, Neelam Sinha

发表机构 * Centre for Brain Research, Indian Institute of Science(印度科学研究所脑研究中心) Vision and AI Lab (VAL), Indian Institute of Science(印度科学研究所视觉与人工智能实验室)

AI总结 提出两阶段混合经典-量子流水线,通过监督3D β-VAE压缩MRI为64维潜码,经PLS选择6个成分编码为6量子比特态,利用量子核SVM实现AD分类,在ADNI-1上达72.1%准确率与0.799 AUC。

详情
AI中文摘要

本文提出了一种两阶段混合经典-量子(HCQ)流水线,用于从3D T1加权结构MRI体素中进行二元阿尔茨海默病(AD)分类,其中经典和量子组件设计为互补而非独立运行。一个监督的3D β-变分自编码器(VAE)在体素级重建、KL散度和焦点分类损失下进行端到端训练,将每个3D MRI体积(从152 x 184 x 152重采样为96 x 96 x 96)压缩为64维潜码。偏最小二乘(PLS)回归选择潜码中最佳区分阿尔茨海默病(AD)与认知正常(CN)受试者的六个分量,并将其重新缩放为旋转角度,通过ZZ量子特征映射编码到六量子比特寄存器上,得到相应的量子态。预计算核支持向量机(SVM)的输入是一个N x N Gram矩阵(N = 308),通过计算每对量子态之间的重叠得到。本工作的新颖之处在于量子核直接作用于由监督自编码器端到端学习的疾病感知特征,而非预提取的输入。在308名ADNI-1受试者(包括137名AD和171名CN)上,基线模型达到67.2%的准确率和0.759的AUC,而稳定性增强变体达到72.1%的准确率和0.799的AUC,且交叉验证方差减半。3D Grad-CAM进一步帮助验证了模型对与阿尔茨海默病相关脑区的关注。HCQ流水线可作为跨生物医学成像领域的诊断分类通用框架,这些领域对经典方法存在类似挑战。

英文摘要

This paper presents a two-stage Hybrid Classical-Quantum (HCQ) pipeline for binary Alzheimer's disease (AD) classification from 3D T1-weighted structural MRI volumes, where the classical and quantum components are designed to complement each other rather than operate independently. A supervised 3D $β$-variational autoencoder (VAE) is trained end-to-end under voxel-wise reconstruction, KL-divergence, and focal classification losses that compress each 3D MRI volume (resized from 152 x 184 x 152 to 96 x 96 x 96) into a 64-dimensional latent code. Partial Least Squares (PLS) regression selects the six components in the latent code that best separate Alzheimer's Disease (AD) from cognitively normal (CN) subjects and rescales them into rotation angles, which are encoded onto a six-qubit register using the ZZ quantum feature map to give us the respective quantum states. The input to a precomputed-kernel Support Vector Machine (SVM) is an N x N Gram matrix (N = 308), created by calculating the overlap between every pair of quantum states. The novelty of this work lies in the fact that the quantum kernel operates directly on disease-aware features that are learned end-to-end by a supervised autoencoder, rather than on pre-extracted inputs. On 308 ADNI-1 subjects, consisting of 137 AD and 171 CN subjects, the baseline achieved 67.2% accuracy and 0.759 AUC, while the stability-enhanced variant reached 72.1% accuracy and 0.799 AUC with cross-fold variance halved. 3D Grad-CAM further helped validate our model's focus on brain regions linked to Alzheimer's. The HCQ pipeline could serve as a general-purpose framework for diagnostic classification across biomedical imaging domains that present similar challenges for classical approaches.

2606.14251 2026-06-15 cs.CV 新提交

HiST: A Hierarchical Sparse Transformer for Cross-Modal Spatial Transcriptomics Modeling

HiST:用于跨模态空间转录组建模的分层稀疏Transformer

Weiyi Wu, Xinwen Xu, Xingjian Diao, Siting Li, Zhi Wei, Alma Andersson, Jiang Gui

发表机构 * New Jersey Institute of Technology(新泽西理工学院) Stevens Institute of Technology(史蒂文斯理工学院) Karolinska Institutet(卡罗林斯卡学院) Dartmouth College(达特茅斯学院)

AI总结 提出HiST,一种分层稀疏Transformer,通过稀疏窗口注意力和分辨率变换算子实现高效的多尺度空间转录组推断,显著降低计算开销并提升预测性能。

详情
Journal ref
ICML 2026
AI中文摘要

空间转录组学(ST)将基因表达与组织形态联系起来,但成本高且通量低,因此需要从常规组织学推断表达的替代方法。全切片H&E到ST推断将千兆像素图像与稀疏、不规则位置上的基因测量配对,使得多尺度建模在不产生密集网格开销或二次令牌混合的情况下具有挑战性。我们提出HiST,一种分层稀疏Transformer,将测量位置视为晶格索引的稀疏场,并直接在活跃组织足迹上构建二元编码器-解码器。HiST结合了稀疏窗口注意力用于局部几何对应,以及分辨率变换算子用于快速多尺度上下文整合。对于固定窗口大小,主要运行时间和内存随观测位置数量而非密集切片面积扩展。为缓解切片特定的采集变异,HiST通过一个瓶颈全局条件路径添加了\emph{切片校准令牌},该令牌总结切片级上下文并调节局部表示。在涵盖不同组织和采集源的多器官基准测试中,HiST在降低运行时间和峰值内存的同时,提升了相对于近期基线的预测性能。

英文摘要

Spatial transcriptomics (ST) links gene expression with tissue morphology but remains expensive and low-throughput, motivating surrogates that infer expression from routine histology. Whole-slide H&E-to-ST inference pairs a gigapixel image with gene measurements at a sparse, irregular set of locations, making multiscale modeling challenging without incurring dense-grid overhead or quadratic token mixing. We propose HiST, a hierarchical sparse transformer that treats measured locations as a lattice-indexed sparse field and builds a dyadic encoder--decoder directly on the active tissue footprint. HiST combines sparse window attention for local geometric correspondence with resolution-changing operators for rapid multiscale context integration. For a fixed window size, the dominant runtime and memory scale with the number of observed locations rather than the dense slide area. To mitigate slide-specific acquisition variation, HiST adds a bottlenecked global conditioning pathway via a \emph{slide calibration token} that summarizes slide-level context and conditions local representations. On a multi-organ benchmark spanning diverse tissues and acquisition sources, HiST improves predictive performance over recent baselines while reducing runtime and peak memory.

2606.14534 2026-06-15 cs.CV 新提交

A Lightweight Fiducial-Based Pipeline for 3D Hyperspectral Mapping of ex-vivo Lumpectomy Specimens

一种轻量级基于基准的离体肿块切除标本三维高光谱映射流水线

Anna Bicchi, Alberto Rota, Leonardo Passoni, Nicola Ancellotti, Andrea Peroni, Lorenzo Vinco, Dario Polli, Elena De Momi

发表机构 * Politecnico di Milano(米兰理工大学) Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano(米兰理工大学电子、信息与生物工程系) Department of Physics, Politecnico di Milano(米兰理工大学物理系)

AI总结 提出一种全自动、免标定流水线,利用RGB图像和单次HSI采集生成离体肿块切除标本的三维高光谱点云,通过ArUco标记实现亚毫米级配准,支持术中切缘评估。

详情
AI中文摘要

高光谱成像(HSI)是保乳手术(BCS)中用于术中评估切缘的一种有前景的模态,但其临床转化需要将固有的二维光谱信息与切除组织的三维形状对齐,以便精确定位可疑区域进行靶向随访。我们提出了一种全自动、免标定的流水线,该流水线从一组消费级相机RGB图像和单次自上而下的HSI采集生成离体肿块切除标本的三维高光谱点云。三维几何结构通过深度学习运动恢复结构(Structure-from-Motion)骨干网络重建,并通过自定义光束法平差(bundle adjustment)在度量参考框架中稳定,该平差对放置在标本周围的四个ArUco标记的角点强制执行一致性。然后,HSI立方体在不恢复HSI相机位姿的情况下配准到重建结果:两种模态中可见的标记定义了16个角点对应关系,驱动平面单应性(planar homography),并通过在正交渲染的深度图上查找恢复三维坐标。在两个离体肿块切除标本上评估,该流水线实现了中位三维配准误差低于1毫米,二维重投影误差低于0.02毫米,在加速硬件上每个标本的总处理时间低于4分钟。这些结果支持将HSI引导的空间定位集成到保乳手术的术中切缘评估工作流程中的可行性。

英文摘要

Hyperspectral Imaging (HSI) is a promising modality for intraoperative assessment of resection margins in Breast-Conserving Surgery (BCS), but its clinical translation requires aligning the inherently 2D spectral information onto the 3D shape of the excised tissue so that suspicious regions can be precisely localized for targeted follow-up. We present a fully automated, calibration-free pipeline that produces a 3D hyperspectral point cloud of an ex-vivo lumpectomy specimen from a set of consumer-camera RGB images and a single top-down HSI acquisition. The 3D geometry is reconstructed with a deep-learning Structure-from-Motion backbone, stabilized in a metric reference frame by a custom bundle adjustment that enforces consistency on the corners of four ArUco markers placed around the specimen. The HSI cube is then registered to the reconstruction without recovering the HSI camera pose: the markers, visible in both modalities, define 16 corner correspondences that drive a planar homography, and 3D coordinates are recovered by lookup on an orthographically rendered depth map. Evaluated on two ex-vivo lumpectomy specimens, the pipeline achieves a median 3D registration error below 1~mm and a 2D reprojection error below 0.02 mm, with a total per-specimen processing time under 4 minutes on accelerated hardware. These results support the feasibility of integrating HSI-guided spatial localization into intraoperative margin assessment workflows for breast-conserving surgery.

2606.14568 2026-06-15 eess.IV cs.CV 交叉投稿

Trimodal Glioma Representation Alignment via Volumetric Contrastive Learning

三模态胶质瘤表示对齐通过体积对比学习

Denise Marini, Eleonora Grassucci, Danilo Comminiello

发表机构 * arXiv

AI总结 提出GLORIA框架,通过Gramian对比损失对齐组织病理、基因表达和MRI三模态特征,用于胶质瘤分级和生存预测,在132例患者数据上优于双模态基线。

详情
AI中文摘要

胶质瘤分级和生存预测需要整合在不同空间和生物学尺度上收集的异质性信息。组织病理学描述组织形态,mRNA表达捕捉分子活动,磁共振成像提供肿瘤范围和放射学异质性的非侵入性视图。现有的胶质瘤预后模型通常只结合其中两个来源,而其对齐目标大多保持成对。本文介绍了GLORIA,一种用于胶质瘤组学-放射学-组织病理学对齐的新型三模态框架。GLORIA通过模态特定编码器处理全切片图像区域、基因表达谱和3D MRI体积,将它们投影到共享潜在空间,并使用Gramian对比损失对齐,该损失测量三个模态嵌入张成的体积。对齐的表示通过跨模态门控模块融合,并联合优化用于三级胶质瘤分级和总生存期预测。我们在匹配的TCGA-GBM/LGG和BraTS21队列上评估GLORIA,该队列包含132名具有所有三种模态的患者。在共享的三模态测试集上,GLORIA在所有考虑的指标上均优于双模态WSI-mRNA基线。

英文摘要

Glioma grading and survival prediction require the integration of heterogeneous information collected at different spatial and biological scales. Histopathology describes tissue morphology, mRNA expression captures molecular activity, and magnetic resonance imaging provides a non-invasive view of tumor extent and radiological heterogeneity. Existing glioma prognosis models often combine only two of these sources, while their alignment objectives remain mostly pairwise. This paper introduces GLORIA, a novel trimodal framework for GLioma Omics - Radiology - hIstopathology Alignment. GLORIA processes whole-slide image regions, gene-expression profiles, and 3D MRI volumes through modality-specific encoders, projects them into a shared latent space, and aligns them with a Gramian contrastive loss that measures the volume spanned by the three modality embeddings. The aligned representations are fused through a cross-modal gating module and optimized jointly for three-class glioma grading and overall survival prediction. We evaluate GLORIA on a matched TCGA-GBM/LGG and BraTS21 cohort, comprising 132 patients with all three modalities. On the shared trimodal test set, GLORIA improves over the bimodal WSI-mRNA baseline in all the metrics considered.

2511.12193 2026-06-15 cs.CV 版本更新

MMRINet: Efficient Mamba-Based Segmentation with Dual-Path Refinement for Low-Resource MRI Analysis

MMRINet: 基于Mamba的高效双路径细化分割网络用于低资源MRI分析

Abdelrahman Elsayed, Ahmed Jaheen, Mohammad Yaqub

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) New York University(纽约大学)

AI总结 提出MMRINet,一种基于Mamba的轻量级分割网络,通过双路径特征细化和渐进特征聚合,在低资源MRI分析中以2.5M参数实现高效分割,在SSA数据集上优于UNETR等基线。

Comments Accepted at The Medical Image Understanding and Analysis Conference (MIUA 2026)

详情
AI中文摘要

在多参数MRI中自动分割脑肿瘤在资源受限的临床环境中仍然是一个关键但未得到充分解决的挑战,因为需要高端GPU的深度3D网络不可行。这在撒哈拉以南非洲(SSA)尤为突出,低场扫描仪、异质患者群体和严重的数据稀缺加剧了应用标准深度学习管道的难度。我们提出了MMRINet,一种专为这些约束条件设计的轻量级分割架构。其核心是用线性复杂度的Mamba状态空间模型替代二次复杂度的自注意力,从而在不增加基于Transformer方法的计算开销的情况下实现高效的长程体积上下文建模。我们结合了两个轻量级细化组件:双路径特征细化(DPFR),提取互补的细节和上下文表示以改善有限数据下的特征多样性;以及渐进特征聚合(PFA),分层融合多尺度解码器输出以获得更清晰的分割边界。在包含来自尼日利亚临床站点的3D MRI扫描的BraTS-Lighthouse SSA 2025挑战数据集上评估,MMRINet仅用约2.5M参数就实现了平均Dice分数0.752和平均HD95 12.23 mm,优于所有评估的基线,包括UNETR、Swin-UNETR、SegMamba和SegResNet3D。这些结果表明,可以在大幅减少计算的情况下实现强大的验证集分割性能,为低资源临床环境中AI辅助神经肿瘤学提供了实用的一步。我们的GitHub仓库可在此访问:BioMedIA-MBZUAI/MMRINet。

英文摘要

Automated brain tumor segmentation in multi-parametric MRI remains a critical yet underserved challenge in resource-constrained clinical settings, where deep 3D networks requiring high-end GPUs are not viable. This is particularly acute across sub-Saharan Africa (SSA), where low-field scanners, heterogeneous patient demographics, and severe data scarcity compound the difficulty of applying standard deep learning pipelines. We present MMRINet, a lightweight segmentation architecture purpose-built for these constraints. At its core, MMRINet replaces quadratic-complexity self-attention with linear-complexity Mamba state-space models, enabling efficient long-range volumetric context modeling without the computational overhead of Transformer-based approaches. We combine two lightweight refinement components:Dual-Path Feature Refinement (DPFR), which extracts complementary detail and contextual representations to improve feature diversity under limited data, and Progressive Feature Aggregation (PFA), which hierarchically fuses multi-scale decoder outputs for sharper segmentation boundaries. Evaluated on the BraTS-Lighthouse SSA 2025 challenge dataset, comprising 3D MRI scans from Nigerian clinical sites, MMRINet achieves an average Dice score of 0.752 and an average HD95 of 12.23 mm with only ~2.5M parameters, outperforming all evaluated baselines, including UNETR, Swin-UNETR, SegMamba, and SegResNet3D. These results indicate that strong validation-set segmentation performance can be achieved with substantially reduced computation, offering a practical step toward AI-assisted neuro-oncology in low-resource clinical environments. Our GitHub repository can be accessed here: BioMedIA-MBZUAI/MMRINet.

2511.14897 2026-06-15 cs.CV cs.LG 版本更新

HULFSynth : An INR based Super-Resolution and Ultra Low-Field MRI Synthesis via Contrast factor estimation

HULFSynth: 基于隐式神经表示的超分辨率和超低场MRI合成,通过对比因子估计

Pranav Indrakanti, Luca Trautmann, Ivor Simpson

发表机构 * LILI Lab, University of Sussex, Brighton, UK(利利实验室,苏塞克斯大学,布里斯托尔,英国)

AI总结 提出无监督单图像双向MRI合成器,基于物理模型估计组织类型信噪比实现高低场转换,并利用隐式神经表示网络实现超分辨率,在合成和真实数据上验证了对比度提升。

Comments Medical Image Understanding and Analysis, MIUA 2026

详情
AI中文摘要

我们提出了一种无监督的单图像双向磁共振图像(MRI)合成器,它可以从高场(HF)幅度图像合成类似超低场(ULF)的图像,反之亦然。与现有的MRI合成模型不同,我们的方法受驱动HF和ULF MRI之间对比度变化的物理原理启发。我们的前向模型通过基于目标对比度值估计组织类型信噪比(SNR)值来模拟HF到ULF的变换。对于超分辨率任务,我们使用隐式神经表示(INR)网络,通过同时预测组织类型分割和图像强度来合成HF图像,而无需观察到的HF数据。所提出的方法使用从标准3T T1加权图像生成的合成ULF样数据进行定性评估,并使用配对的3T-64mT T1加权图像进行验证实验。在合成ULF样图像中,白质-灰质对比度提高了52%,在64mT图像中提高了37%。敏感性实验证明了我们的前向模型对目标对比度、噪声和初始种子的变化的鲁棒性。

英文摘要

We present an unsupervised single image bidirectional Magnetic Resonance Image (MRI) synthesizer that synthesizes an Ultra-Low Field (ULF) like image from a High-Field (HF) magnitude image and vice-versa. Unlike existing MRI synthesis models, our approach is inspired by the physics that drives contrast changes between HF and ULF MRIs. Our forward model simulates a HF to ULF transformation by estimating the tissue-type Signal-to-Noise ratio (SNR) values based on target contrast values. For the Super-Resolution task, we used an Implicit Neural Representation (INR) network to synthesize HF image by simultaneously predicting tissue-type segmentations and image intensity without observed HF data. The proposed method is evaluated using synthetic ULF-like data from generated from standard 3T T$_1$-weighted images for qualitative assessments and paired 3T-64mT T$_1$-weighted images for validation experiments. WM-GM contrast improved by 52% in synthetic ULF-like images and 37% in 64mT images. Sensitivity experiments demonstrated the robustness of our forward model to variations in target contrast, noise and initial seeding.

2512.03883 2026-06-15 cs.CV 版本更新

Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy

双交叉注意力孪生Transformer用于观察等待内镜中直肠肿瘤再生长评估

Jorge Tapias Gomez, Despoina Kanata, Aneesh Rangnekar, Christina Lee, Julio Garcia-Aguilar, Joshua Jesse Smith, Harini Veeraraghavan

发表机构 * Total Neoadjuvant Treatment (TNT)(总新辅助治疗) Watch-and-Wait (WW) surveillance(观察等待(WW)监视) Swin Transformer(Swin变压器) Dual Cross-Attention (SSDCA)(双交叉注意力(SSDCA))

AI总结 提出双交叉注意力孪生Swin Transformer(SSDCA),结合再分期和随访内镜图像区分临床完全缓解与局部再生长,在62名患者上实现81.76%平衡准确率。

Comments Accepted to ISBI 2026 conference (6 pages, 5 figures, 1 table)

详情
AI中文摘要

越来越多的证据支持对接受全新辅助治疗后再分期显示临床完全缓解(cCR)的直肠癌患者进行观察等待(WW)监测。然而,在WW期间,从随访内镜图像中早期准确检测局部再生长(LR)对于管理护理和预防远处转移至关重要。因此,我们开发了一种具有双交叉注意力的孪生Swin Transformer(SSDCA),用于结合再分期和随访的纵向内镜图像,区分cCR和LR。SSDCA利用预训练的Swin Transformer提取领域无关特征,增强对成像变化的鲁棒性。实现双交叉注意力以强调来自配对扫描的特征,无需任何空间对齐即可预测响应。SSDCA以及基于Swin的基线模型使用135名患者的图像对进行训练,并在62名患者的保留图像对上进行评估。SSDCA产生了最佳平衡准确率(81.76% ± 0.04)、灵敏度(90.07% ± 0.08)和特异性(72.86% ± 0.05)。鲁棒性分析显示,无论是否存在血液、粪便、毛细血管扩张和图像质量差等伪影,性能均稳定。提取特征的UMAP聚类显示,SSDCA具有最大的簇间分离(1.45 ± 0.18)和最小的簇内分散(1.07 ± 0.19),证实了判别性表示学习。代码和权重可在以下网址获取:this https URL

英文摘要

Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin Transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the paired scans without requiring any spatial alignment to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76% $\pm$ 0.04), sensitivity (90.07% $\pm$ 0.08), and specificity (72.86% $\pm$ 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 $\pm$ 0.18) and minimal intra-cluster dispersion (1.07 $\pm$ 0.19) with SSDCA, confirming discriminative representation learning. Code and weights available at: https://github.com/Jotanator/SSDCA

2512.10966 2026-06-15 cs.LG cs.AI cs.CV eess.IV 版本更新

Interpretable Alzheimer's Diagnosis via Multimodal Fusion of Regional Brain Experts

可解释的阿尔茨海默病诊断:基于区域脑专家的多模态融合

Farica Zhuang, Shu Yang, Dinara Aliyeva, Zixuan Wen, Duy Duong-Tran, Christos Davatzikos, Tianlong Chen, Song Wang, Li Shen

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出MREF-AD多模态区域专家融合模型,采用混合专家框架将各模态脑区域视为独立专家,通过门控网络学习个性化融合权重,实现可解释的AD诊断。

Comments Published at IEEE ICHI 2026

详情
AI中文摘要

准确早期诊断阿尔茨海默病(AD)对有效干预至关重要,需要整合多模态神经影像数据的互补信息。然而,传统融合方法通常依赖特征的简单拼接,无法自适应平衡淀粉样蛋白PET和MRI等生物标志物在不同脑区的贡献。本文提出MREF-AD,一种用于AD诊断的多模态区域专家融合模型。它是一个混合专家(MoE)框架,将每个模态内的介观脑区域建模为独立专家,并采用门控网络学习个体特定的融合权重。利用阿尔茨海默病神经影像学倡议(ADNI)的表格神经影像和人口统计学信息,MREF-AD在强经典和深度学习基线上取得了有竞争力的性能,同时提供了可解释的、模态和区域层面的洞察,揭示了结构和分子影像如何共同促进AD诊断。源代码见:此 https URL。

英文摘要

Accurate and early diagnosis of Alzheimer's disease (AD) is critical for effective intervention and requires integrating complementary information from multimodal neuroimaging data. However, conventional fusion approaches often rely on simple concatenation of features, which cannot adaptively balance the contributions of biomarkers such as amyloid PET and MRI across brain regions. In this work, we propose MREF-AD, a Multimodal Regional Expert Fusion model for AD diagnosis. It is a Mixture-of-Experts (MoE) framework that models mesoscopic brain regions within each modality as independent experts and employs a gating network to learn subject-specific fusion weights. Utilizing tabular neuroimaging and demographic information from the Alzheimer's Disease Neuroimaging Initiative (ADNI), MREF-AD achieves competitive performance over strong classic and deep baselines while providing interpretable, modality- and region-level insight into how structural and molecular imaging jointly contribute to AD diagnosis. The source code is available at https://github.com/PennShenLab/mref-ad.

2512.15947 2026-06-15 eess.IV cs.CV 版本更新

MCR-VQGAN: A Scalable and Cost-Effective Tau PET Synthesis Approach for Alzheimer's Disease Imaging

MCR-VQGAN:一种用于阿尔茨海默病成像的可扩展且经济高效的Tau PET合成方法

Jin Young Kim, Jeremy Hudson, Jeongchul Kim, Qing Lyu, Christopher T. Whitlow

发表机构 * Department of Biomedical Engineering, Wake Forest University School of Medicine(生物医学工程系,威克森林大学医学院) Department of Radiology, Wake Forest School of Medicine(放射学系,威克森林医学院) Department of Radiology and Biomedical Imaging, Yale School of Medicine(放射学与生物医学成像系,耶鲁医学院)

AI总结 提出MCR-VQGAN模型,通过多尺度卷积、ResNet块和CBAM模块改进VQGAN,从T1加权MRI合成高保真tau PET图像,在ADNI数据集上优于cGAN等方法,且合成图像保留诊断特征,区域SUVR等效分析显示与真实图像高度一致。

Comments Accepted for publication in IEEE Access. 14 pages, 5 figures, 8 tables

详情
AI中文摘要

Tau正电子发射断层扫描(PET)是阿尔茨海默病(AD)的关键诊断方式,但其广泛临床采用受到辐射暴露、可用性有限、高临床工作量和巨大财务成本的阻碍。为解决这些限制,我们提出了多尺度CBAM残差向量量化生成对抗网络(MCR-VQGAN),从结构T1加权MRI合成高保真tau PET图像。MCR-VQGAN通过三项增强改进了标准VQGAN架构:多尺度卷积、ResNet块和卷积块注意力模块(CBAM),这些共同改善了对局部和全局特征的捕获。使用来自ADNI数据库的222对T1加权MRI和tau PET扫描,我们训练并比较了MCR-VQGAN与cGAN、WGAN-GP、CycleGAN和基线VQGAN。MCR-VQGAN在所有指标上均实现了优越的图像合成性能(MSE = 0.0056 +/- 0.0061,PSNR = 30.65 +/- 4.47 dB,SSIM = 0.9263 +/- 0.0469)。在真实tau PET上训练的基于CNN的AD分类器在真实(63.64%)和合成(65.91%)图像上达到了相当的准确率,表明诊断相关特征得以保留。跨Braak定义ROI的区域SUVR等效分析进一步表明真实与合成tau PET之间高度一致(Pearson r = 0.78-0.88;ICC = 0.71-0.84),其中Braak V/VI区域一致性最强(ICC = 0.838)。这些结果共同表明,MCR-VQGAN为传统tau PET成像提供了一种有前景且可扩展的替代方案,可能改善AD研究和临床工作流程中tau生物标志物的可及性。

英文摘要

Tau positron emission tomography (PET) is a critical diagnostic modality for Alzheimer's disease (AD), but its widespread clinical adoption is hindered by radiation exposure, limited availability, high clinical workload, and substantial financial costs. To address these limitations, we propose the Multi-scale CBAM Residual Vector Quantized Generative Adversarial Network (MCR-VQGAN) to synthesize high-fidelity tau PET images from structural T1-weighted MRI. MCR-VQGAN advances the standard VQGAN architecture through three enhancements: multi-scale convolutions, ResNet blocks, and Convolutional Block Attention Modules (CBAM), which collectively improve the capture of local and global features. Using 222 paired T1-weighted MRI and tau PET scans from the ADNI database, we trained and compared MCR-VQGAN against cGAN, WGAN-GP, CycleGAN, and baseline VQGAN. MCR-VQGAN achieved superior image synthesis performance across all metrics (MSE = 0.0056 +/- 0.0061, PSNR = 30.65 +/- 4.47 dB, SSIM = 0.9263 +/- 0.0469). A CNN-based AD classifier trained on real tau PET achieved comparable accuracy on real (63.64%) and synthetic (65.91%) images, indicating that diagnostically relevant features are preserved. Regional SUVR-equivalent analysis across Braak-defined ROIs further indicated strong agreement between real and synthetic tau PET (Pearson r = 0.78-0.88; ICC = 0.71-0.84), with the strongest agreement in Braak V/VI (ICC = 0.838). Together, these results suggest that MCR-VQGAN offers a promising and scalable surrogate for conventional tau PET imaging, potentially improving the accessibility of tau biomarkers for AD research and clinical workflows.

2605.24609 2026-06-15 physics.med-ph cs.AI cs.CV 版本更新

Catching magnetic resonance imaging outliers in artificial intelligence-supported radiotherapy workflows: unsupervised detection and localization of image anomalies using deep learning

捕捉MRI异常:使用深度学习无监督检测和定位MRI伪影及临床异常

Mustafa Kadhim, Viktor Rogowski, Emilia Persson, Camila Gonzalez, André Haraldsson, Sofie Ceberg, Mikael Nilsson, Malin Kügele, Sven Bäck, Christian Jamtheim Gustafsson

发表机构 * Physics and Imaging in Radiation Oncology (phiRO)(物理与放射治疗成像(phiRO))

AI总结 提出一种两阶段无监督异常检测框架,通过离散令牌压缩和令牌惊奇度评分,在盆腔和脑部MRI上实现高精度异常检测与定位,支持放疗工作流自动化质量控制。

Comments This paper has been submitted to Physics and Imaging in Radiation Oncology (phiRO)

详情
AI中文摘要

人工智能越来越多地集成到放射治疗工作流程中,然而此类流程仍然容易受到分布外图像数据的影响,这些数据可能在临床任务中引入意外行为。基于深度学习的盆腔磁共振成像(MRI)异常检测在很大程度上仍未探索,对其全自动化可行性的透明评估有限。我们开发并评估了一个完全自动化的、无监督的盆腔和脑部MRI异常检测框架。一个两阶段框架在来自公共数据集的参考图像上训练:盆腔MRI使用LUND-PROBE,脑部MRI使用IXI、fastMRI和fastMRI+。在第一阶段,MRI切片被压缩成离散令牌;在第二阶段,对正常令牌的分布进行建模。通过结合感知图像差异和基于负对数似然的令牌惊奇度评分来估计异常证据。在具有合成全局异常和真实临床异常的盆腔MRI上,以及具有临床注释的fastMRI+异常的脑部MRI上,评估了自动检测。评估了敏感性、特异性、受试者工作特征曲线下面积(AUC)以及在保留的正常病例中的假阳性行为。该框架在隐藏评估队列中实现了稳健的检测,盆腔和脑部MRI的AUC分别为0.97(95% CI, 0.95-0.98)和0.81(95% CI, 0.74-0.87)。热图分析显示检测到的异常与真实位置之间具有很强的空间一致性,支持定位准确性和可解释性。这些结果支持无监督异常检测作为放射治疗工作流程中自动化MRI质量控制层的潜力,并透明地可视化可能危及下游基于AI任务的图像区域。

英文摘要

Artificial intelligence is increasingly integrated into radiotherapy workflows, yet such pipelines remain vulnerable to out-of-distribution image data that may introduce unexpected behavior in clinical tasks. Deep learning-based anomaly detection for pelvic magnetic resonance imaging (MRI) remains largely unexplored, and transparent evaluation of its feasibility for full automation is limited. We developed and evaluated a fully automated, unsupervised anomaly-detection framework for pelvic and brain MRI. A two-stage framework was trained on reference images from public datasets: LUND-PROBE for pelvic MRI, and IXI, fastMRI, and fastMRI+ for brain MRI. In the first stage, MRI slices were compressed into discrete tokens; in the second, the distribution of normal tokens was modeled. Anomaly evidence was estimated by combining perceptual image differences with token-surprisal scores based on negative log-likelihood. Automated detection was evaluated on pelvic MRI with synthetic global and real clinical anomalies, and on brain MRI with clinically annotated fastMRI+ abnormalities. Sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and false-positive behavior in held-out normal cases were assessed. The framework achieved robust detection across hidden evaluation cohorts, with AUCs of 0.97 (95% CI, 0.95-0.98) and 0.81 (95% CI, 0.74-0.87) for pelvic and brain MRI, respectively. Heatmap analysis showed strong spatial agreement between detected anomalies and ground-truth locations, supporting localization accuracy and interpretability. These results support the potential of unsupervised anomaly detection as an automated MRI quality-control layer for radiotherapy workflows, with transparent visualization of image regions likely to compromise downstream AI-based tasks.

9. 文档图像、OCR与图表理解 1 篇

2605.21182 2026-06-15 cs.CL cs.AI cs.CV 版本更新

Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding

Manga109-v2026: 重新审视Manga109标注以适应现代漫画理解

Jeonghun Baek, Atsuyuki Miyai, Shota Onohara, Hikaru Ikuta, Kiyoharu Aizawa

发表机构 * University of Tokyo(东京大学)

AI总结 本文重新审视Manga109的对话文本标注,识别出五类标注问题,包括转录错误、缺失文本区域、对话与拟声词重叠以及未分割的对话气泡,并通过结合OCR基于的问题检测和人工修订构建Manga109-v2026,修订了约29,000个对话标注,使Manga109更好地适应现代OCR和多模态漫画理解系统,同时保留漫画特有的表达结构。

Comments Accepted to the Culture x AI Workshop at ICML 2026. Project page: https://manga109.github.io/manga109-project-website/en/

详情
AI中文摘要

漫画是一种具有文化特色的多模态媒介,是日本流行文化中最具影响力的形态之一。随着AI系统越来越多地针对漫画理解、OCR和翻译进行研究,Manga109已成为漫画相关AI研究的基础数据集。然而,当前的Manga109数据集包含转录错误和粗略的标注,这与现代OCR和多模态漫画理解任务不匹配。在本工作中,我们重新审视Manga109的对话文本标注,识别出五类标注问题,包括转录错误、缺失文本区域、对话与拟声词重叠以及未分割的对话气泡。为了解决这些问题,我们结合基于OCR的问题检测和人工修订,构建了Manga109-v2026,修订了大约29,000个对话标注。我们的修订使Manga109更好地适应现代OCR和多模态漫画理解系统,同时保留了漫画特有的表达结构。

英文摘要

Manga is a culturally distinctive multimodal medium and one of the most influential forms of Japanese popular culture. As AI systems increasingly target manga understanding, OCR, and translation, Manga109 has become a foundational dataset for manga-related AI research. However, the current Manga109 dataset contains inaccurate transcriptions and coarse annotations, which do not align well with modern OCR and multimodal manga understanding tasks. In this work, we revisit the dialogue text annotations of Manga109 and identify five categories of annotation issues, including inaccurate transcriptions, missing text regions, overlapping dialogue and onomatopoeia, and under-segmented speech balloons. To address these issues, we combine OCR-based issue detection and manual revision to construct Manga109-v2026, revising approximately 29,000 dialogue annotations. Our revisions better align Manga109 with modern OCR and multimodal manga understanding systems while preserving expressive structures characteristic of manga.

10. 低层视觉、计算成像与图像增强 8 篇

2606.14071 2026-06-15 cs.CV 新提交

ShearFuse-UNet: Hadamard, DCT, and Shearlet Transform Fusion for Next-Day Wildfire Spread Prediction

ShearFuse-UNet: Hadamard、DCT和Shearlet变换融合用于次日野火蔓延预测

Ene Meco, Yingyi Luo, Emadeldeen Hamdan, Adam Watts, Ahmet Enis Cetin

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) US Forest Service, Pacific Wildland Fire Science Laboratory(美国林务局太平洋野火科学实验室)

AI总结 提出ShearFuse-UNet,一种轻量级深度学习模型,通过融合WHT、DCT和Shearlet变换分支,在U-Net编码器中实现多模态卫星数据的次日野火蔓延预测,以267k参数达到0.596 F1分数,优于ResNet18 U-Net。

详情
AI中文摘要

我们提出了ShearFuse-UNet,一种轻量级且计算高效的深度学习模型,用于从多模态卫星数据预测次日野火蔓延。该模型在U-Net骨干网络的每个编码器块内集成了三个互补的变换域分支:二维快速沃尔什-阿达玛变换(WHT)分支、二维离散余弦变换(DCT)分支和锥自适应数字Shearlet残差分支。WHT和DCT分支通过可学习的频谱缩放和固定的软阈值建立正交潜在空间,而Shearlet分支提供各向异性的多方向特征分解,显式编码火线特有的细长边缘结构。一个学习的SpectralFusion门自适应地组合WHT和DCT响应,并将Shearlet重构作为残差添加。这种三分支设计与Transformer自注意力有松散的结构类比:WHT和DCT分支提供自适应融合的互补频谱表示,而Shearlet分支通过残差路径贡献方向内容。与自注意力不同,所提出的设计依赖于固定的数学变换而非学习的投影算子,减少了参数数量和计算成本。在WildfireSpreadTS数据集上评估,ShearFuse-UNet仅用267k参数就达到了0.596的F1分数,优于基于ResNet18的U-Net(14M参数,F1=0.589),展示了非常有利的精度-效率权衡。在Google Next-Day Wildfire Spread数据集上的结果进一步验证了这些发现。

英文摘要

We propose ShearFuse-UNet, a lightweight and computationally efficient deep learning model for next-day wildfire spread prediction from multi-modal satellite data. The model integrates three complementary transform-domain branches inside each encoder block of a U-Net backbone: a 2D Fast Walsh-Hadamard Transform (WHT) branch, a 2D Discrete Cosine Transform (DCT) branch, and a cone-adapted digital Shearlet residual branch. The WHT and DCT branches establish orthogonal latent spaces with learnable spectral scaling and fixed soft-thresholding, while the Shearlet branch provides anisotropic, multi-directional feature decomposition that explicitly encodes the elongated edge structures characteristic of fire fronts. A learned SpectralFusion gate adaptively combines the WHT and DCT responses, and the Shearlet reconstruction is added as a residual. This three-branch design bears a loose structural analogy to transformer self-attention: the WHT and DCT branches provide complementary spectral representations that are adaptively fused, while the Shearlet branch contributes directional content through a residual pathway. Unlike self-attention, the proposed design relies on fixed mathematical transforms rather than learned projection operators, reducing parameter count and computational cost. Evaluated on the WildfireSpreadTS dataset, ShearFuse-UNet achieves an F1 score of 0.596 with only 267k parameters, outperforming a ResNet18-based U-Net (14M parameters, F1 = 0.589) and demonstrating a highly favorable accuracy-efficiency trade-off. Results on the Google Next-Day Wildfire Spread dataset further validate these findings across a different benchmark.

2606.14619 2026-06-15 cs.CV 新提交

StereoGeo: an end-to-end stereo camera calibration method

StereoGeo:一种端到端立体相机标定方法

Imane Meddour, Andréa Macario Barros, Cédric Gouy-Pailler

发表机构 * IMB - Institut Mines-Télécom, Paris, France(IMB - 巴黎理工大学)

AI总结 提出端到端网络StereoGeo,通过集成深度特征提取与可微优化器,同时估计双目相机的焦距、重力方向和相对外参,在真实基准上实现竞争性的内参标定和准确的外参估计。

Comments 5 pages, 1 figure, accepted at the 34th European Signal Processing Conference (EUSIPCO 2026)

详情
AI中文摘要

在这项工作中,我们提出了StereoGeo,一种基于端到端网络的立体相机标定方法。我们的方法估计左右相机的焦距和重力方向,以及它们之间的相对外参变换。现有方法通常依赖于结构化环境中的标定图案,或者仅处理单个相机配置,局限于内参或外参估计,并依赖于多视图设置。StereoGeo扩展了GeoCalib算法,将深度神经网络特征提取与可微优化器相结合。在真实世界基准上的大量实验表明,StereoGeo在内参标定方面取得了具有竞争力的性能,并提供了准确的立体外参估计,优于现有仅限于单目设置的方法。本工作中使用的数据集部分公开于该https URL。

英文摘要

In this work, we propose StereoGeo, an end-to-end network-based approach for stereo camera calibration. Our method estimates the focal lengths and gravity directions of the left and right cameras, as well as the relative extrinsic transformation relating them. Existing methods often rely on calibration patterns in structured environments or address only a single camera configuration, being limited to either intrinsic or extrinsic estimation, and depending on a multi-view setups. StereoGeo extends the GeoCalib algorithm, integrating deep neural network feature extraction with a differentiable optimizer. Extensive experiments on real-world benchmarks demonstrate that StereoGeo achieves competitive performance for intrinsic calibration and provides accurate stereo extrinsic estimation, outperforming existing methods that are limited to monocular settings. The dataset used in this work is partially publicly available at https://github.com/meddourimane/StereoGeo-dataset.

2606.14638 2026-06-15 cs.CV astro-ph.EP 新提交

Improving Lunar Topography with Deep Learning Schrödinger Bridges

利用深度学习薛定谔桥改进月球地形

Matthew Repasky, Erwan Mazarico, Michael K. Barker, Stefano Bertone, Terence J. Sabaka, Yao Xie

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(佐治亚理工学院H. Milton Stewart工业与系统工程系) NASA Goddard Space Flight Center(美国国家航空航天局戈达德太空飞行中心) Center for Research and Exploration in Space Science and Technology (CRESST II), University of Maryland, College Park(马里兰大学帕克分校空间科学与技术研究与探索中心(CRESST II)) National Institute for Astrophysics (INAF), Astrophysical Observatory of Turin(意大利国家天体物理研究所(INAF)都灵天体物理天文台)

AI总结 提出基于扩散薛定谔桥的生成模型,结合光学影像约束,实现月球地形超分辨率重建,并提供像素级不确定性估计。

详情
Journal ref
The Planetary Science Journal 7.6 (2026): 139
AI中文摘要

提高行星地形模型的分辨率可以更好地理解表面过程和地貌;然而,现有的解析超分辨率方法成本高昂且难以大规模应用。生成模型提供了学习数据中复杂关系的工具,并且由于硬件加速器和并行化,可以大规模应用。我们提出了一种基于扩散的薛定谔桥(SB)生成建模方法,用于月球地形超分辨率,连接低分辨率地形分布与高分辨率地形分布,并结合物理约束的光学影像。我们的方法受到现有形状重建方法的启发,这些方法通过使用目标分辨率的光学图像来改进先验的低分辨率地形。我们在一个新颖的渲染月球地形数据集上训练SB,模拟来自月球勘测轨道器窄角相机的光学影像。结果是一种灵活的地形超分辨率方法,可以在重建中提供像素级的不确定性。

英文摘要

Increasing the resolution of planetary topography models can enable a better understanding of surface processes and geomorphology; however, existing analytical super-resolution methods are expensive and difficult to apply at large scales. Generative models provide the tools to learn complex relationships within data and can be applied at scale due to hardware accelerators and parallelization. We present a diffusion-based Schrödinger Bridge (SB) generative modeling approach for lunar topography super-resolution, connecting the distribution of low-resolution topography to that of high-resolution topography, incorporating physically-constraining optical imagery. Our approach is inspired by existing Shape-from-Shading methods, which improve a priori low-resolution topography by using optical images at the target resolution. We train SBs on a novel dataset of rendered lunar topography, emulating optical imagery from the Lunar Reconnaissance Orbiter Narrow Angle Camera. The result is a flexible approach for topography super-resolution which can provide pixel-level uncertainties in the reconstruction.

2606.13957 2026-06-15 eess.IV cs.CV cs.MM 交叉投稿

High-Fidelity Video Compression based on Invertible Neural Transform and Implicit Conditioning

基于可逆神经变换和隐式条件的高保真视频压缩

Siyue Teng, Ho Man Kwan, Yuxuan Jiang, Fan Zhang, David Bull

发表机构 * Visual Information Lab, University of Bristol, UK(布里斯托大学视觉信息实验室)

AI总结 提出InnVC,一种基于可逆神经网络和隐式条件场的视频编解码器,通过保持可逆主变换路径并解耦相关内容和细节,在高质量区域实现显著性能提升。

详情
AI中文摘要

基于学习的视频压缩最近在率失真性能上已与传统视频编解码器相媲美。然而,大多数现有方法依赖于不可逆的分析-合成变换,重建质量受到量化和变换近似误差的双重影响。在高质量点,量化误差较小,变换引起的失真占主导地位,这一限制尤为突出。为此,我们提出InnVC,一种基于可逆神经网络的视频编解码器,用于宽范围和高保真压缩。核心思想是在量化前保留可逆的主变换路径,同时通过紧凑的隐式条件场注入内容自适应上下文。这将强相关的视频内容与难以建模的细节解耦,使不同组件专门负责互补的重建任务,从而实现更高效的压缩。为进一步提高可压缩性,我们引入了一种调度掩码策略,逐步将信息内容集中到更少的潜在通道中,以实现更有效的熵编码。在UVG和MCL-JCV基准上的实验表明,InnVC在广泛的质量范围内实现了强大的压缩性能,在高质量区域尤为有效,相对于x265在UVG上PSNR的BD率降低21.66%,MS-SSIM降低46.06%。据我们所知,InnVC是首个在单一架构尺度内覆盖从低比特率到高保真操作点的神经视频编解码器,PSNR跨度超过20 dB。

英文摘要

Learning-based video compression has recently achieved competitive rate-distortion performance compared to conventional video codecs. However, most existing methods rely on non-invertible analysis-synthesis transforms, with reconstruction quality subject to both quantization and transform approximation errors. This limitation becomes particularly restrictive at higher quality points, where quantization errors are small and transform-induced distortion dominates. To address this, we propose InnVC, an Invertible neural network based Video Codec for wide-range and high-fidelity compression. The core idea is to preserve an invertible main transform path prior to quantization, while injecting content-adaptive context through a compact implicit conditioning field. This decouples strongly correlated video content from harder-to-model fine details, allowing different components to specialize in complementary reconstruction tasks for more efficient compression. To further improve compressibility, we introduce a scheduled masking strategy that progressively concentrates informative content into fewer latent channels for more effective entropy coding. Experiments on the UVG and MCL-JCV benchmarks show that InnVC achieves strong compression performance over a broad quality range, being particularly effective in the high-quality regime, yielding BD-rate reductions of 21.66% in PSNR and 46.06% in MS-SSIM relative to x265 on UVG. To the best of our knowledge, InnVC is the first neural video codec covers operating poins from low bitrate to high fidelity within a single architecture scale, spanning more than 20 dB in PSNR.

2606.14248 2026-06-15 eess.IV cs.CV 交叉投稿

Spectrum Aware Illumination Estimation Using Multispectral Image

利用多光谱图像的光谱感知光照估计

Hyejin Oh, Woo-Shik Kim, Sangyoon Lee, YungKyung Park, Je-Won Kang

发表机构 * Department of Electronic and Electrical Engineering, Ewha W. University(成均馆大学电子与电气工程系) Telechips Samsung Advanced Institute of Technology(三星先进技术研究所) Department of Design, Ewha W. University(成均馆大学设计系)

AI总结 提出一种结合光谱注意力机制和光照先验的深度学习框架,通过时空光谱特征提取块和跨传感器域变换,实现高精度光照谱估计,并在真实多光谱数据集上验证了优越性。

Comments Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). DOI: 10.1109/TCSVT.2026.3701975

详情
AI中文摘要

多光谱成像通过捕获更多光谱波段扩展了传统的RGB成像,从而改进了光照谱估计。然而,现有方法往往未能充分利用光谱信息,导致在不同光照条件和不同传感器域下性能欠佳。因此,我们提出了一种具有时空光谱特征提取块的深度学习框架,该框架结合了光谱注意力机制以增强光谱相关性并保留与光照相关的空间特征。通过引入光照先验,我们的方法优先考虑在多光谱图像中提供更有意义信息的特定通道。我们还提出了跨不同多光谱传感器空间的光谱域变换。结果表明,在高维传感器空间中学习到的光照谱可以有效地变换到各种低维相机传感器空间,而无需任何额外训练。为了便于评估,我们引入了一个真实世界的多光谱数据集,其中包含在不同光照条件下捕获的高维真实光照谱。通过大量实验,我们证明了我们的方法相比现有模型实现了更高的准确性,从而为现实世界的光照谱估计提供了实用解决方案。代码和数据集可在以下网址获取:此 https URL。

英文摘要

Multispectral (MS) imaging extends beyond conventional RGB imaging by capturing more spectral bands, thereby improving illuminant spectrum estimation (ISE). However, existing methods often fail to fully exploit spectral information, resulting in suboptimal performance under diverse lighting conditions and across different sensor domains. Hence, we propose a deep learning framework with a spatio-spectral feature extraction block, which incorporates spectral attention mechanisms to enhance spectral correlation and preserve illuminant-relevant spatial features. Through the inclusion of an illuminant prior (IP), our approach prioritizes specific channels that provide more meaningful information in an MS image. We also propose a spectral-domain transform across different MS sensor spaces. The results demonstrate that illuminant spectra learned in high-dimensional sensor spaces can be effectively transformed to various lower-dimensional camera sensor spaces without any additional training. To facilitate evaluation, we introduce a real-world MS dataset containing high-dimensional ground-truth illumination spectra captured under diverse lighting conditions. Through extensive experiments, we demonstrate that our method achieves superior accuracy compared to existing models, thus providing a practical solution for real-world ISE. The code and dataset are available at https://github.com/hyejin5/Spectrum-Aware-Illumination-Estimation-Using-Multispectral-Image.

2601.21179 2026-06-15 cs.CV 版本更新

Enhancing Underwater Light Field Images via Global Geometry-aware Diffusion Process

通过全局几何感知扩散过程增强水下光场图像

Yuji Lin, Qian Zhao, Zongsheng Yue, Junhui Hou, Deyu Meng

发表机构 * School of Mathematics and Statistics, Xi’an Jiaotong University(西安交通大学数学与统计学学院) School of Mathematics and Statistics and the Ministry of Education Key Lab of Intelligent Networks and Network Security, Xi’an Jiaotong University(西安交通大学数学与统计学学院和教育部智能网络与网络安全重点实验室) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Macao Institute of Systems Engineering, Macau University of Science and Technology(澳门系统工程研究院,澳门科学大学)

AI总结 提出基于扩散的GeoDiff-LF框架,利用空间-角度结构增强水下4D光场成像,通过改进U-Net、几何引导损失和优化采样策略,有效缓解颜色失真,在视觉保真度和定量性能上超越现有方法。

Comments 14 pages, 9 figures

详情
AI中文摘要

本文研究了通过4D光场(LF)成像获取高质量水下图像的挑战性问题。为此,我们提出了GeoDiff-LF,一种基于SD-Turbo的新型扩散框架,通过利用其空间-角度结构来增强水下4D LF成像。GeoDiff-LF包含三个关键改进:(1)改进的U-Net架构,带有卷积和注意力适配器以建模几何线索;(2)使用张量分解和渐进加权的几何引导损失函数以正则化全局结构;(3)优化的采样策略与噪声预测以提高效率。通过整合扩散先验和LF几何,GeoDiff-LF有效缓解了水下场景中的颜色失真。大量实验表明,我们的框架在视觉保真度和定量性能上均优于现有方法,推动了水下成像增强的最新进展。代码将在https://this URL公开。

英文摘要

This work studies the challenging problem of acquiring high-quality underwater images via 4-D light field (LF) imaging. To this end, we propose GeoDiff-LF, a novel diffusion-based framework built upon SD-Turbo to enhance underwater 4-D LF imaging by leveraging its spatial-angular structure. GeoDiff-LF consists of three key adaptations: (1) a modified U-Net architecture with convolutional and attention adapters to model geometric cues, (2) a geometry-guided loss function using tensor decomposition and progressive weighting to regularize global structure, and (3) an optimized sampling strategy with noise prediction to improve efficiency. By integrating diffusion priors and LF geometry, GeoDiff-LF effectively mitigates color distortion in underwater scenes. Extensive experiments demonstrate that our framework outperforms existing methods across both visual fidelity and quantitative performance, advancing the state-of-the-art in enhancing underwater imaging. The code will be publicly available at https://github.com/linlos1234/GeoDiff-LF.

2604.26740 2026-06-15 cs.CV cs.GR 版本更新

Rendering-Aware Sparse Sampling for BRDF Acquisition

面向BRDF采集的渲染感知稀疏采样

W. Cao, D. Jönsson, Z. Huang, J. Unger

发表机构 * Media and Information Technology, Department of Science and Technology, Linköping University(_linköping大学科学与技术学院媒体与信息科技系)

AI总结 提出一种渲染感知的稀疏采样方法,通过可微渲染器优化采样方向,以最少BRDF测量实现高质量材质外观重建。

详情
AI中文摘要

精确的BRDF采集对于真实感渲染至关重要,但密集的测角光度计测量既缓慢又昂贵。我们研究如何选择一小部分BRDF测量,这些测量在学习的BRDF先验下对重建材质外观最具信息量。现有的稀疏采集方法通常优化所有材质的BRDF空间重建样本,而自适应测量的感知重要性最终取决于其对每个渲染外观的影响。因此,我们将稀疏自适应采集表述为一个渲染感知的优化问题。我们的方法结合了用于稀疏坐标-值观测的集合编码器、基于预训练超网络/PCA的BRDF重建器以及可微渲染器。在采样器训练期间,重建器保持固定,来自渲染图像损失的梯度优化测量位置。这将采集设计与先验拟合分离,并鼓励采样器选择在学习材质分布下信息量大的方向。为了使比较受控,我们在匹配的样本数量、训练/测试分割、渲染场景、对象掩码、图像映射和指标下评估均匀基线、元学习方法、HyperBRDF方法和我们学习的采样器。我们的核心主张是:当最终渲染外观是目标时,渲染感知采样改进了极其稀疏的BRDF采集。BRDF空间和组合损失仅作为消融实验报告,同时包括联合优化和仅图像潜在拟合以处理未见过的材质。

英文摘要

Accurate BRDF acquisition is essential for realistic rendering, but dense gonioreflectometer measurements are slow and expensive. We study how to select a small set of BRDF measurements that is most informative for reconstructing material appearance under a learned BRDF prior. Existing sparse-acquisition methods often optimize samples for BRDF-space reconstruction for all materials, while the perceptual importance of a adaptive measurement ultimately depends on its effect on each rendered appearance. We therefore formulate sparse adaptive acquisition as a rendering-aware optimization problem. Our method combines a set encoder for sparse coordinate--value observations, a pretrained hypernetwork-based/PCA-based BRDF reconstructor, and a differentiable renderer. During sampler training, the reconstructor remains fixed, and gradients from a rendered-image loss optimize the measurement locations. This separates acquisition design from prior fitting and encourages the sampler to choose directions that are informative under the learned material distribution. To make the comparison controlled, we evaluate the uniform baseline, meta-learning method, HyperBRDF method, and our learned sampler under matched sample numbers, train/test split, rendering scene, object mask, image mapping, and metrics. Our central claim: rendering-aware sampling improves extremely sparse BRDF acquisition when final rendered appearance is the target. BRDF-space and combined losses are reported only as ablations, together with joint refinement and image-only latent fitting for unseen materials.

2605.28477 2026-06-15 cs.CV 版本更新

SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation

SA4Depth: 自监督单目深度估计中一致的姿态-深度尺度对齐

Changxuan Li, Nadine Berner, Nassir Navab, Federico Tombari, Stefano Gasperini

发表机构 * Technical University of Munich(慕尼黑技术大学) BMW Group(宝马集团) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML)) Google(谷歌) VisualAIs Labs GmbH(VisualAIs实验室 GmbH)

AI总结 提出SA4Depth方法,通过可微的视觉特征重投影和姿态细化,对齐自监督深度估计中深度网络和姿态网络估计的场景尺度,提升深度预测精度且不增加推理时间。

Comments Accepted by IEEE RA-L 2026

详情
AI中文摘要

从单目序列进行自监督深度估计依赖于深度网络和姿态网络的联合学习。尽管已有大量研究致力于改进深度网络,但对姿态的努力仍然有限。在此背景下,即使深度估计达到尺度级别,我们强调了姿态网络和深度网络估计的场景尺度之间对齐的重要性。然后,我们引入了SA4Depth,一种改善这种对齐并提升深度预测的方法,同时保持推理时间不变。我们提出的方法利用训练期间估计的深度,跨连续帧重投影可学习的视觉特征,并通过减少特征对齐残差来细化姿态估计。通过我们的方法,由独立的深度网络和姿态网络估计的场景尺度得以对齐,并且不同序列之间的预测尺度一致性得到改善。我们的可微细化无缝集成到现有的自监督流程中,并显著改善了它们的深度估计。我们在KITTI、Cityscapes和NYUv2上进行了广泛的室外和室内实验,证明了这一点。此外,KITTI Odometry上的结果证实了我们姿态细化的有效性。我们的代码可在https://github.com/Runningchauncey/SA4Depth获取。

英文摘要

Self-supervised depth estimation from monocular sequences relies on the joint learning of a depth and a pose network. Despite abundant research done to improve the depth network, efforts on the pose remain limited. In this context, even when depth is estimated up to scale, we highlight the importance of the alignment between the scene scales estimated by the pose and depth nets. Then, we introduce SA4Depth, an approach to improve this alignment and boost the depth predictions while keeping the inference time unchanged. Our proposed method uses the depth estimated during training to reproject learnable visual features across consecutive frames and refine the pose estimates by reducing feature alignment residuals. With our method, the estimated scene scales by the separate depth and pose networks are aligned, and the prediction scale consistency is improved across different sequences. Our differentiable refinement integrates seamlessly into existing self-supervised pipelines and substantially improves their depth estimates. We demonstrate this with extensive experiments both outdoors and indoors on KITTI, Cityscapes, and NYUv2. Additionally, results on KITTI Odometry confirm the effectiveness of our pose refinement. Our code is available at https://github.com/Runningchauncey/SA4Depth .

11. 鲁棒性、安全、隐私与可信视觉 8 篇

2606.13870 2026-06-15 cs.CV cs.AI cs.LG 新提交

Mirage Probes: How Vision Models Fake Visual Understanding

幻象探针:视觉模型如何伪造视觉理解

Daniel Ben-Levi, Judah Goldfeder, Weiliang Zhao, Raz Lapid, Amit LeVi, Allen G. Roush, Ravid Shwartz-Ziv, Hod Lipson

发表机构 * Columbia University(哥伦比亚大学) Intuit Technion(以色列理工学院) Thoughtworks New York University(纽约大学)

AI总结 提出幻象探针框架,通过对比探针揭示视觉语言模型在无图像时也能回答问题的两种幻象行为:文本偏见和虚假图像,并证明后者需要表征级干预。

详情
AI中文摘要

视觉语言模型(VLM)即使在没有提供图像的情况下,也能自信且通常正确地回答基于图像的问题。这种幻象行为会虚增基准分数,而不反映视觉基础。先前的工作将其视为单一故障模式。我们认为这是两种。使用幻象探针(Mirage Probes),一种对比探针框架,将释义的问题变体与同一图像上的匹配幻象和非幻象标签配对,我们展示了在两个开源VLM中,幻象行为可以从残差流、MLP、后注意力和注意力头位置的内部激活中线性解码。我们证明朴素贝叶斯文本基线无法恢复此信号,排除了表面词汇混淆。跨基准可分离性模式,连同一种新颖的先验利用指数(PHI),衡量模型仅从文本中回答的程度,揭示了两种不同的机制:文本偏见,其中模型从语言先验中回答而不涉及视觉表征;以及虚假图像,其中模型在潜在空间中构建虚假视觉内容并像有基础一样回答。这种区别有直接的缓解后果:文本分布清理可以解决第一种机制,但无法触及第二种,因为虚假图像幻象存在于模型的视觉表征中而非文本中。忠实的视觉基础将需要在表征层面进行干预。

英文摘要

Vision-language models (VLMs) can answer image-based questions confidently, and often correctly, even when no image is provided. This mirage behavior inflates benchmark scores without reflecting visual grounding. Prior work treats this as a single failure mode. We argue it is two. Using Mirage Probes, a contrastive probing framework that pairs paraphrased question variants with matched mirage and non-mirage labels on the same image, we show that mirage behavior is linearly decodable from internal activations across residual stream, MLP, post-attention, and attention-head sites in two open-source VLMs. We demonstrate that a Naive Bayes text baseline cannot recover this signal, ruling out surface lexical confounds. Cross-benchmark separability patterns, together with a novel Prior Harnessing Index (PHI) measuring how much a model can answer from text alone, expose two distinct regimes: textual biases, where the model answers from language priors without engaging visual representations, and spurious images, where it constructs false visual content in latent space and answers as if grounded. The distinction has direct mitigation consequences: text-distribution cleaning can address the first regime but cannot reach the second, since spurious-image mirages live in the model's visual representations rather than its text. Faithful visual grounding will require interventions at the representational level.

2606.14230 2026-06-15 cs.CV cs.CL 新提交

A Multi-Domain Feature Fusion Framework for Generalizable Deepfake Detection Across Different Generators

面向不同生成器的可泛化深度伪造检测的多域特征融合框架

Amna Amjid, Sana Qadir, Mehwish Fatima, Raja Khurram Shahzad

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, Pakistan(巴基斯坦国立科技大学电气工程与计算机科学学院) Department of Communication, Quality Management and Information Systems, Mid Sweden University, Ostersund Campus, Sweden(瑞典中部大学通信、质量管理和信息系统系,厄斯特松德校区) Department of Computer Science, Electrical and Space Engineering, Lulea University of Technology, Luleå, Sweden(瑞典吕勒奥理工大学计算机科学、电气与空间工程系)

AI总结 提出SGFF-Net,融合空间、梯度和DWT频率表示,在双残差学习架构中实现跨生成器和跨范式的深度伪造检测,准确率提升至79.80%。

详情
AI中文摘要

深度伪造是人工生成的图像、音频或视频,威胁隐私、安全和信息完整性。检测此类内容对于打击虚假信息至关重要,因为最新模型能生成高度逼真的内容。虽然基于空间或频率的方法在基于生成对抗网络(GANs)的深度伪造上取得了良好的检测率,但它们往往难以处理最近的扩散模型生成的图像。特别是,现有方法很少利用互补的多域表示或系统地评估跨生成器的鲁棒性。为了解决这些挑战,我们提出了一种多域深度伪造检测框架SGFF-Net(空间-梯度-频率融合网络),该网络在双残差学习架构中集成了空间、梯度和基于DWT(离散小波变换)的频率表示。实验结果表明,SGFF-Net在数据集内评估中达到了98.95%的准确率,并在跨模型(70.46%)和跨范式(69.94%)设置中提升了性能。结合多源训练和数据增强进一步增强了鲁棒性,在跨模型评估中准确率从70.46%提升到79.80%,在跨范式评估中从69%提升到78%,在真实世界数据上从61.50%提升到75.80%。与单域检测器不同,SGFF-Net在空间、梯度和小波频率域中学习互补的取证线索,从而在跨生成器和跨范式评估中具有更强的鲁棒性。结果进一步表明,将多域表示与数据多样性和增强相结合,显著提高了泛化能力,为开发更可靠的深度伪造检测系统提供了实用见解。

英文摘要

Deepfakes are artificially generated images, audio, or videos that threaten privacy, security, and information integrity. Detecting such content is crucial for countering disinformation, as the latest models generate highly realistic content. While spatial- or frequency-based approaches achieve good detection rates on Generative Adversarial Networks (GANs)-based generated deepfakes, they often struggle with recent diffusion model-generated images. In particular, existing approaches rarely exploit complementary multi-domain representations or systematically evaluate cross-generator robustness. To address these challenges, we propose a multi-domain deepfake detection framework called SGFF-Net (Spatial-Gradient-Frequency Fusion Network) that integrates spatial, gradient, and DWT (Discrete Wavelet Transform)-based frequency representations within a dual residual learning architecture. Experimental results show that the SGFF-Net achieves 98.95\% accuracy in intra-dataset evaluation and improves performance in both cross-model (70.46\%) and cross-paradigm (69.94\%) settings. Incorporating multi-source training and data augmentation further enhances robustness, increasing accuracy from 70.46\% to 79.80\% in cross-model evaluation, from 69\% to 78\% in cross-paradigm evaluation, and from 61.50\% to 75.80\% on real-world data. Unlike single-domain detectors, the SGFF-Net learns complementary forensic cues across spatial, gradient, and wavelet-frequency domains, resulting in greater robustness under cross-generator and cross-paradigm evaluation. The results further show that combining multi-domain representations with data diversity and augmentation substantially improves generalization, providing practical insights for developing more reliable deepfake detection systems.

2606.14351 2026-06-15 cs.CV 新提交

ForceForget: Reinforcement Concept Removal for Enhancing Safety in Text-to-Image Models

ForceForget: 通过强化概念移除增强文本到图像模型的安全性

Dong Han, Yong Li

发表机构 * Dong Han(董汉) Yong Li(李勇)

AI总结 针对文本到图像模型生成不安全内容的问题,提出基于强化学习优化概念擦除奖励的方法,通过安全适配器调节文本嵌入,在消除不安全内容的同时保持模型对安全语义的生成能力。

Comments Accepted to ICML 2026

详情
AI中文摘要

随着生成式AI的发展,文本到图像(T2I)模型能够生成各种内容。然而,T2I模型仍可能生成不安全内容。为缓解此问题,研究者提出了多种概念擦除方法。但现有方法倾向于过度擦除不安全概念,并抑制有害提示中的良性概念,从而对模型效用产生负面影响。本文中,我们专注于在消除不安全内容的同时,通过强化学习优化概念擦除奖励(CER)来保持模型对安全语义解释的能力。为避免过度内容擦除,我们引入安全适配器(Safe Adapter)来投影部分文本嵌入,以在交叉注意力层中实现高效的概念调节。在不同数据集上进行的大量实验表明,与现有最先进(SOTA)的概念擦除方法相比,所提方法在减轻不安全内容生成的同时,能保持良性图像的高保真度。在鲁棒性方面,我们的方法在对抗红队工具时优于其他方法。此外,我们展示了所提方法在新兴的图像到图像(I2I)场景中比其他方法更有效。最后,我们将方法扩展到擦除一般概念,如艺术风格和物体。免责声明:本文包含可能对某些读者造成冒犯的性露骨内容讨论。本工作中使用的所有图像均为合成图像或来自公共数据集。

英文摘要

With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, T2I models still can generate unsafe contents. To alleviate this issue, various concept erasing methods are proposed. However, existing methods tend to excessively erase unsafe concepts and suppress benign concepts contained in harmful prompts, which can negatively affect model utility. In this paper, we focus on eliminating unsafe content while maintaining model capability in safe semantic meaning interpretation by optimizing the concept erasing reward (CER) with reinforcement learning. To avoid overly content erasure, we introduce the Safe Adapter to project partial text embedding for efficient concept regulation in cross-attention layers. Extensive experiments conducted on different datasets demonstrate the effectiveness of the proposed method in alleviating unsafe content generation while preserving the high fidelity of benign images compared with existing state-of-the-art (SOTA) concept erasing methods. In terms of robustness, our method outperforms counterparts against red-teaming tools. Moreover, we showcase the proposed approach is more effective in emerging image-to-image (I2I) scenarios compared with others. Lastly, we extend our method to erase general concepts, such as artistic styles and objects. Disclaimer: This paper includes discussions of sexually explicit content that may be offensive to certain readers. All images used in this work are synthesized or from public datasets.

2606.14504 2026-06-15 cs.CV 新提交

Scratched Lenses, Shifted Depth: Passive Camera-Side Optical Attacks

划痕镜头,深度偏移:被动式相机侧光学攻击

Qinlin He, Zeming Zhuang, Yongji Wu, Lan Zhang, Xiaoyong, Yuan

AI总结 提出一种被动式镜头划痕攻击SLASH,通过光学伪影扭曲深度线索,在单目深度估计和3D目标检测中实现高达32%的相对深度偏移。

详情
AI中文摘要

视觉系统上的物理对抗攻击通常通过场景操纵进行研究,例如对抗性补丁或投影,其中攻击者控制相机观察的内容。使用贴纸或辅助光学的相机侧攻击也已被探索,但它们将攻击视为来自设计模式的图像空间扰动。这忽略了物理缺陷如何与场景相关的光照和光学相互作用。我们识别出一种威胁:被动的镜头侧损伤,它持久存在但具有触发条件,产生在特定视觉条件下偏置几何推理的光学伪影。我们通过划痕诱导的镜头对抗性条纹劫持(SLASH)实例化这种威胁,这是一种由相机镜头或保护罩上的小划痕引起的物理世界攻击。划痕与明亮光源和镜面反射相互作用,产生扭曲深度线索的结构化条纹伪影。由于扰动在光路中固定但由场景触发,它既持久又具有选择性。我们在光学空间中制定攻击,将划痕模式建模为触发条件光学通道,并优化一个固定配置以适应不同的观看条件。我们在数字和真实世界环境中评估SLASH对单目深度估计和单目3D目标检测的效果。在固定划痕约束下,单目深度估计的方向性深度偏移达到高达32%的相对误差,对单目3D目标检测具有一致的影响。物理实验证实了向真实相机记录的迁移,诱导的深度偏移高于模型的自然预测基线。这些发现揭示了一个攻击面,其中看似无害的硬件缺陷充当潜在的、场景触发的对抗机制,挑战了关于物理鲁棒性的假设,并激励了安全视觉系统的防御措施。

英文摘要

Physical adversarial attacks on vision systems are typically studied through scene manipulation, such as adversarial patches or projections, where the adversary controls what the camera observes. Camera-side attacks using stickers or auxiliary optics have also been explored, but they treat attacks as image-space perturbations from designed patterns. This misses how physical imperfections interact with scene-dependent lighting and optics. We identify a threat: passive lens-side damage that is persistent yet trigger-conditioned, producing optical artifacts that bias geometric inference under particular visual conditions. We instantiate this threat through Scratch-induced Lens Adversarial Streak Hijacking SLASH, a physical-world attack caused by small scratches on a camera lens or protective cover. Scratches interact with bright light sources and specular reflections to create structured streak artifacts that distort depth cues. Since the perturbation is fixed in the optical path but triggered by the scene, it is both persistent and selective. We formulate the attack in optical space, model the scratch pattern as a trigger-conditioned optical channel, and optimize one fixed configuration across diverse viewing conditions. We evaluate SLASH on monocular depth estimation and monocular 3D object detection in digital and real-world settings. Under the fixed-scratch constraint, directional depth shifts reach up to 32% relative error for monocular depth estimation, with consistent effects on monocular 3D object detection. Physical experiments confirm transfer to real camera recordings, inducing depth shifts above the model's natural prediction baseline. These findings reveal an attack surface where benign-looking hardware imperfections act as latent, scene-triggered adversarial mechanisms, challenging assumptions about physical robustness and motivating defenses for secure vision systems.

2606.14658 2026-06-15 cs.CV cs.AI 新提交

Giving AI a Headache: Acoustic Adversarial Attacks to Computer Vision Applications

给AI带来头痛:针对计算机视觉应用的声学对抗攻击

Nicole Villavicencio-Garduño, Maksim Ekin Eren, Milo Prisbrey, Ben Migliori, Michael Teti

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究利用低频声波(<20 kHz)引起相机物理振动,导致AI视觉模型(如YOLO11)误分类、漏检或产生幻觉,并分析了影响攻击效果的因素。

Comments 9 pages, 7 figures, SPIE Defense + Security

详情
Journal ref
Proc. SPIE 14046, Assurance and Security for AI-enabled Systems 2026, 1404609 (10 Jun 2026)
AI中文摘要

人工智能(AI)越来越多地被用于自动化各种现实世界的计算机视觉(CV)应用,如自动驾驶车辆控制、面部识别和安全摄像头。最近的研究表明,声学振动可以引起相机真实的物理运动,干扰其内部稳定机制。由于这种运动超出了稳定系统设计处理的条件,系统会在帧中引入伪影,导致基于AI的CV模型误分类、错过目标或产生幻觉对象。先前的工作使用超声波频率(>20 kHz)进行短距离攻击,由于高频的衰减,这些攻击仅限于短距离。在这项工作中,我们研究了使用可听范围内较低频率(<20 kHz)的声学攻击,并进一步扩展了我们的分析,包括各种图像和物体特征如何受到攻击的影响。具体来说,我们进行了物理实验,通过用各种频率共振商用相机,证明了我们的攻击对现成目标检测模型(YOLO11)的可行性。基于我们的结果,我们提供了关于使AI CV系统更容易受到这些攻击的几个因素的见解,这可能有助于未来缓解策略的开发。

英文摘要

Artificial Intelligence (AI) is increasingly used to automate a variety of real-world computer vision (CV) applications, such as autonomous vehicle control, facial recognition, and security cameras. Recent research has shown that acoustic vibration can induce real physical motion in cameras, interfering with their internal stabilization mechanisms. Because the motion falls outside the conditions the stabilization system was designed to handle, the system introduces artifacts into the frame, causing AI-based CV models to misclassify, miss targets, or hallucinate objects. Previous work used ultrasonic frequencies (>20 kHz) to perform short-range attacks, which limits them to short distances due to the attenuation exhibited by high frequencies. In this work, we investigate acoustic attacks using lower frequencies in the audible range (<20 kHz), and we further expand our analysis to include how various image and object features are affected by the attacks. Specifically, we performed physical experiments to demonstrate the viability of our attacks on an off-the-shelf object detection model (YOLO11) by resonating a commercially available camera with various frequencies. Based on our results, we provide insights into several factors that make an AI CV system more vulnerable to these attacks, which could help inform the development of future mitigation strategies.

2406.09250 2026-06-15 cs.CV cs.AI cs.LG 版本更新

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

MirrorCheck: 视觉-语言模型的高效对抗防御

Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Takáč, Pascal Fua, Ivan Laptev, Karthik Nandakumar

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(莫扎伊德大学人工智能大学) NVIDIA École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) Michigan State University(密歇根州立大学)

AI总结 提出MirrorCheck框架,利用文本到图像模型和随机化策略检测并防御针对视觉-语言模型的自适应对抗攻击。

详情
AI中文摘要

视觉-语言模型(VLM)越来越容易受到复杂的对抗性攻击,包括专门设计用于绕过现有防御的自适应策略。为了解决这一漏洞,我们提出了MirrorCheck,一个鲁棒且与模型无关的检测框架,在单模态和多模态设置中均能有效运行。MirrorCheck利用文本到图像(T2I)模型从目标模型生成的标题中重建视觉内容,并通过比较原始图像和合成图像之间的特征空间嵌入来评估语义一致性。为了增强对自适应攻击的鲁棒性,MirrorCheck引入了一种随机防御策略,从多样化的模型库中随机选择T2I生成器和图像编码器。此外,我们采用了一种新颖的一次性(OTU)扰动,应用于所选编码器嵌入,并通过缩放因子调节,这降低了自适应攻击的有效性。跨多种威胁场景的大量实验表明,MirrorCheck始终优于基线方法,即使在强自适应对抗条件下也能保持其实用性。

英文摘要

Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings. MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.

2604.00887 2026-06-15 cs.CV cs.CR 版本更新

Towards Physically Realizable Adversarial Attenuation Patch against SAR Object Detection

面向SAR目标检测的物理可实现对抗衰减补丁

Yiming Zhang, Weibo Qin, Feng Wang

发表机构 * Key Laboratory for Information Science of Electromagnetic Waves (MoE) School of Information Science and Technology, Fudan University(电磁波信息科学重点实验室(MoE)复旦大学信息科学与技术学院)

AI总结 提出对抗衰减补丁(AAP)方法,通过能量约束优化和衰减部署框架平衡攻击有效性与隐蔽性,并基于信号级电子干扰机制实现物理可行性。

Comments 5 pages, 4 figures. Source code is available at https://github.com/boremycin/SAAP. Accepted and published in IEEE CAIT 2026. DOI: 10.1109/CAIT70489.2026.11553874

详情
Journal ref
Proc. 2026 China Aerospace Information Technology Conference (CAIT), Tongxiang, China, May 2026
AI中文摘要

深度神经网络在SAR目标检测任务中表现出色,但仍易受对抗攻击影响。现有的SAR特定攻击方法能有效欺骗检测器,但往往引入明显扰动,且主要局限于数字域,忽略了攻击SAR系统的物理实现约束。本文提出一种新颖的对抗衰减补丁(AAP)方法,采用能量约束优化策略结合基于衰减的部署框架,在攻击有效性和隐蔽性之间实现无缝平衡。更重要的是,AAP通过对齐信号级电子干扰机制,展现出强大的物理实现潜力。实验结果表明,AAP在保持高隐蔽性的同时有效降低检测性能,并在不同模型间表现出良好的可迁移性。本研究为SAR目标检测系统的对抗攻击提供了物理基础视角,并促进了更隐蔽且实际可部署的攻击策略设计。源代码已在此https URL公开。

英文摘要

Deep neural networks have demonstrated excellent performance in SAR target detection tasks but remain susceptible to adversarial attacks. Existing SAR-specific attack methods can effectively deceive detectors; however, they often introduce noticeable perturbations and are largely confined to digital domain, neglecting physical implementation constrains for attacking SAR systems. In this paper, a novel Adversarial Attenuation Patch (AAP) method is proposed that employs energy-constrained optimization strategy coupled with an attenuation-based deployment framework to achieve a seamless balance between attack effectiveness and stealthiness. More importantly, AAP exhibits strong potential for physical realization by aligning with signal-level electronic jamming mechanisms. Experimental results show that AAP effectively degrades detection performance while preserving high imperceptibility, and shows favorable transferability across different models. This study provides a physical grounded perspective for adversarial attacks on SAR target detection systems and facilitates the design of more covert and practically deployable attack strategies. The source code is made available at https://github.com/boremycin/SAAP.

2605.26702 2026-06-15 cs.CV cs.AI cs.CR cs.LG 版本更新

Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling

通过三阶SO(3)表示耦合的旋转不变球面水印

Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Antonios Argyriou, Wu Liu, Weiping Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对全景图像在任意3D旋转下水印鲁棒性不足的问题,提出利用三阶SO(3)表示耦合构造旋转不变的球面双谱,将水印嵌入高阶球谐系数并从不变标量中提取,实现理论保证的旋转不变性和高视觉保真度。

Comments ICML 2026

详情
AI中文摘要

全景图像的可靠水印面临任意3D旋转的根本挑战。由于全景图定义在球面上,它们在$SO(3)$作用下自然变换,使得传统的平面表示和基于增强的鲁棒策略变得不充分且缺乏理论保证。为了解决这个问题,我们将全景图表示为球面信号,并利用$SO(3)$表示理论推导出可证明的旋转不变描述符。虽然球谐系数在旋转下等变变换,但自然的旋转不变构造通常限于零阶统计量,这消除了方向信息并严重限制了嵌入容量。在这项工作中,我们通过张量积耦合高阶$SO(3)$不可约表示并投影到平凡表示,引入了一种有原则的三阶不变构造。这产生了球面不变双谱,它在保持严格旋转不变性的同时保留了相位信息。利用这一特性,我们将水印嵌入到高阶球谐系数中,并从不变双谱标量中恢复它们,从而在任意3D旋转下实现可靠的提取。我们提供了其$SO(3)$不变性的理论证明,并通过实验证明其对连续旋转具有近乎完美的鲁棒性,同时保持高视觉保真度。

英文摘要

Reliable watermarking of panoramic imagery is fundamentally challenged by arbitrary 3D rotations. As panoramas are defined on the sphere, they naturally transform under the action of $SO(3)$, rendering conventional planar representations and augmentation-based robustness strategies inadequate and devoid of theoretical guarantees. To address this, we formulate panoramas as spherical signals and leverage $SO(3)$ representation theory to derive provably rotation-invariant descriptors. While spherical harmonic coefficients transform equivariantly under rotations, the natural invariant constructions are typically limited to zeroth-order statistics which eliminate directional information and severely constrain embedding capacity. In this work, we introduce a principled third-order invariant construction by coupling higher-order $SO(3)$ irreducible representations via tensor products and projecting onto the trivial representation. This yields a spherical invariant bispectrum that preserves phase information while remaining strictly rotation-invariant. Leveraging this property, we embed watermarks into higher-order spherical harmonic coefficients and recover them from invariant bispectral scalars, enabling reliable extraction under arbitrary 3D rotations. We provide a theoretical proof of $SO(3)$ invariance for it and demonstrate experimentally its near-perfect robustness to continuous rotations while maintaining high visual fidelity.

12. 数据集、基准、评测与训练方法 18 篇

2606.13896 2026-06-15 cs.CV cs.AI 新提交

How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?

自监督遥感视觉模型如何迁移到下游任务?

Julia Romero, Qin Lv, Morteza Karimzadeh

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 研究六种代表性自监督地理空间基础模型(GeoFMs)在下游任务中的迁移表现,发现模型排名随任务和适应设置变化,中间层特征比最终层更相关,且解码器设计等适应设置影响与模型选择相当。

详情
AI中文摘要

自监督地理空间基础模型(GeoFMs)从遥感数据中学习可迁移表示,但其下游行为难以表征。我们研究了涵盖联合嵌入、重建和多模态预训练家族的六种代表性GeoFMs,并在不同标签可用性和下游流水线下评估了分类、回归和分割基准的迁移性能。我们发现模型排名随任务和适应设置而变化。逐层探针显示,在大多数情况下,与任务相关的信息在中间Transformer块中比在最终层嵌入中更容易获取,并且GeoFMs表现出不同的深度分布特征。在PASTIS和Sen1Floods11上的分割案例研究中,解码器设计和微调等下游适应设置可能与GeoFM的选择同样重要,且标准密集预测头可能与GeoFM在深度上组织信息的方式不一致。最后,案例研究中的CKA分析表明,微调不会均匀地重写GeoFMs的深度,最强的变化集中在ViT块中MLP的第一个线性层。这些结果有助于解释为什么GeoFM排名在不同基准之间发生变化,并激励更具表示意识的评估和适应策略。

英文摘要

Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines. We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles. In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks. These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.

2606.13910 2026-06-15 cs.CV 新提交

PMOF: A Dataset and Benchmark for Passenger Monitoring Using Overhead Fisheye Cameras

PMOF:使用顶置鱼眼摄像头的乘客监控数据集与基准

Stella Katharina Wermuth, Qazi Arbab Ahmed, Klaus Neumann, Thorsten Jungeblut

发表机构 * Bielefeld University(比勒费尔德大学) Fraunhofer IOSB-INA(弗劳恩霍夫IOSB-INA研究所)

AI总结 提出首个在运动车辆内采集的顶视鱼眼图像数据集PMOF,含19k+标注帧,支持检测、跟踪和行为识别,通过跨域微调实现高精度,缩小静态与动态环境域差距。

Comments 6 pages, 7 figures. Accepted to the 22nd IEEE International Conference on Advanced Visual and Signal-Based Systems (AVSS 2026)

详情
AI中文摘要

自主无工作人员公共交通需要可靠的车上乘客监控。然而,运动车辆内的感知面临空间受限、光照变化、运动引起的背景变化、遮挡和视角有限等挑战。为缓解这些空间约束,天花板安装的鱼眼摄像头从单一视角提供全场景覆盖。但现有公开顶置鱼眼数据集在静态环境中记录,未捕捉车辆运动引入的域偏移。为填补这一空白,我们提出PMOF(使用顶置鱼眼摄像头的乘客监控),这是首个在运动车辆内采集的顶视鱼眼图像公开数据集,包含超过19k张手动标注帧。PMOF提供旋转边界框、跟踪标识和行为标签,支持目标检测、跟踪和行为识别。我们使用YOLO26m-obb模型在多种数据集配置下对PMOF进行基准测试,这些配置将PMOF与现有顶置鱼眼数据集结合。结合自定义旋转感知增强的跨域微调在PMOF上达到94.8% AP50,在来自不同领域的未见顶置鱼眼数据集上达到96.5% AP50。我们的结果突出了静态与动态环境之间的域差距,并表明引入PMOF可提高检测性能,并将泛化能力从乘客监控推广到更广泛的基于鱼眼的人员检测任务。数据集和代码见该https URL。

英文摘要

Autonomous staff-free public transport requires reliable in-vehicle passenger monitoring. However, perception inside moving vehicles is challenged by confined spaces, variable illumination, motion-induced background variation, occlusion, and limited viewpoints. To mitigate these spatial constraints, ceiling-mounted fisheye cameras provide full-scene coverage from a single viewpoint. Yet existing public overhead fisheye datasets are recorded in static environments and do not capture the domain shift introduced by vehicle motion. To fill this gap, we introduce PMOF, Passenger Monitoring using Overhead Fisheye cameras, the first public dataset of top-view fisheye imagery captured inside a moving vehicle, comprising over 19k manually annotated frames. PMOF provides rotated bounding boxes, tracking identifiers, and action labels, supporting object detection, tracking, and action recognition. We benchmark PMOF using YOLO26m-obb models fine-tuned under multiple dataset configurations that combine PMOF with existing overhead fisheye datasets. Cross-domain fine-tuning with custom rotation-aware augmentation achieves 94.8% AP50 on PMOF and 96.5% AP50 on an unseen overhead fisheye dataset from a different domain. Our results highlight the domain gap between static and moving environments and show that incorporating PMOF improves detection performance and advances generalization beyond passenger monitoring to broader fisheye-based person detection tasks. The dataset and code are available at https://swermuth.github.io/pmof/.

2606.13911 2026-06-15 cs.CV 新提交

Overhead Wildlife Locator (OWL): Benchmarking Weakly Supervised Learning for Aerial Wildlife Surveys

Overhead Wildlife Locator (OWL): 用于航空野生动物调查的弱监督学习基准测试

Isai Daniel Chacón, Zhongqi Miao, Bruno Demuro, Caleb Robinson, Rahul Dodhia, Lasha Otarashvili, Jason Holmberg, Kirk Larsen, Howard Frederick, Nathan J. Pamperin, Pablo Arbeláez, Juan M. Lavista Ferres

发表机构 * Microsoft AI for Good Lab(微软人工智能公益实验室) Center for Research and Formation in Artificial Intelligence (Cinfonia), Universidad de los Andes(安第斯大学人工智能研究与培训中心) Conservation X Labs(保护X实验室) Kirk Larsen Consulting(Kirk Larsen咨询公司) Tanzania Wildlife Research Institute(坦桑尼亚野生动物研究所) Alaska Department of Fish and Game(阿拉斯加渔猎局)

AI总结 提出弱监督密度估计框架OWL,含三种变体,在五个公开航空数据集上超越现有方法,并在阿拉斯加驯鹿普查中验证了操作可行性。

Comments 16 pages, 4 figures, 3 tables

详情
AI中文摘要

自动化航空野生动物调查越来越依赖深度学习,但标准目标检测器需要边界框标注,据报道其标注速度比点级标签慢七倍,成本高三倍。为解决这一瓶颈,我们引入了Overhead Wildlife Locator (OWL),一个弱监督密度估计框架,包含三种变体:OWL-C,一种用于高通量筛选的全卷积模型;OWL-T,一种用于异质、杂乱场景的Swin增强混合模型;以及OWL-D,基于冻结的DINOv3 ViT-H+/16编码器和DPT风格融合解码器构建。我们在五个公开航空数据集上对所有三种变体与POLO、YOLOv11n和YOLOv11l进行了基准测试,数据集范围从稀疏的固定翼稀树草原调查到密集的无人机围场图像,并在其原始Delplanque分割上与已发表的HerdNet基线进行了比较。OWL-D在Delplanque上创下了新的最先进水平(0.934 AP对比HerdNet的0.840),并在五个数据集中有四个取得了最高AP。性能具有场景依赖性:在极端密度的SheepCounter无人机数据集上,混合模型OWL-T领先(0.978 AP),卷积变体达到最低计数误差,而基于基础模型的OWL-D性能下降,表明哪种变体适合哪种调查类型。我们进一步在阿拉斯加鱼类与野生动物局2022年中央北极驯鹿普查中验证了操作准备情况:在跨驯鹿群和跨时间迁移下,基于2017年Porcupine驯鹿群分割微调的OWL-C在保留的补丁测试集上达到了F1=0.965,在发布的测试补丁上聚合符号计数误差为+3.1%。我们在https://this https URL上发布了OWL代码、模型权重以及带注释的Porcupine驯鹿群2017年(PCH)和中央北极驯鹿群2022年(CAH)补丁,这是首个用于大规模驯鹿航空调查的开放补丁级数据集。

英文摘要

Automated aerial wildlife surveys increasingly rely on deep learning, yet standard object detectors require bounding-box annotations, reported to be up to seven times slower and three times more expensive to produce than point-level labels. To address this bottleneck, we introduce the Overhead Wildlife Locator (OWL), a weakly supervised density-estimation framework with three variants: OWL-C, a fully convolutional model for high-throughput screening; OWL-T, a Swin-augmented hybrid for heterogeneous, cluttered scenes; and OWL-D, built on a frozen DINOv3 ViT-H+/16 encoder with a DPT-style fusion decoder. We benchmark all three against POLO, YOLOv11n, and YOLOv11l across five public aerial datasets, from sparse fixed-wing savanna surveys to dense UAV paddock imagery, and against the published HerdNet baseline on its native Delplanque split. OWL-D sets a new state of the art on Delplanque (0.934 AP vs. HerdNet's 0.840) and records the highest AP on four of the five datasets. Performance is regime-dependent: on the extreme-density SheepCounter UAV dataset the hybrid OWL-T leads (0.978 AP) and the convolutional variants attain the lowest counting error, whereas the foundation-based OWL-D degrades, indicating which variant suits which survey type. We further validate operational readiness on the Alaska Department of Fish and Game's 2022 Central Arctic Caribou census: under cross-herd and cross-temporal transfer, OWL-C fine-tuned on the 2017 Porcupine Caribou Herd split attains F1 = 0.965 on a held-out patch test set, with a signed count error of +3.1% aggregated across the released test patches. We release the OWL code, model weights, and the annotated Porcupine Caribou Herd 2017 (PCH) and Central Arctic Herd 2022 (CAH) patches, the first open patch-level datasets for large-scale caribou aerial surveys, at https://github.com/microsoft/MegaDetector-Overhead.

2606.14025 2026-06-15 cs.CV 新提交

GarmentSketch: Large-scale Sketch-to-Fashion Benchmark

GarmentSketch:大规模草图到时尚基准

Duong-Duy-Khang Bui, Minh-Tan Pham, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * Kangbdd.github.io

AI总结 为解决时尚草图到图像合成缺乏大规模配对数据的问题,构建了包含26249对草图-文本描述的GarmentSketch数据集,并基于多模态大模型与人工精炼生成描述,评估了现有生成模型的性能。

Comments ICCCI 2026. Project page: https://khangbdd.github.io/garmentsketch

详情
AI中文摘要

时尚草图是设计工作流程的基石,允许在物理原型制作之前快速可视化创意概念。然而,基于草图的时尚图像合成进展因缺乏大规模、高质量配对资源而受阻。为弥补这一差距,我们提出了GarmentSketch,一个新颖的数据集,包含21个服装类别的26,249张时尚草图,每张草图都配有详细的文本描述。描述是通过一个多阶段流水线生成的,该流水线集成了多个多模态大语言模型(MLLM)与人在回路中的精炼,确保了语义准确性和描述丰富性。我们在最先进的生成模型上对GarmentSketch进行了基准测试,为草图引导的文本到图像生成提供了基线性能。我们的实验揭示了现有方法的潜力和当前局限性。通过提供全面且注释丰富的资源,GarmentSketch为推进草图理解、细粒度时尚图像生成以及设计中的创意人机协作奠定了基础。该数据集将在以下网址提供:this https URL。

英文摘要

Fashion sketching is a cornerstone of design workflows, allowing rapid visualization of creative concepts prior to physical prototyping. Yet, progress in sketch-based fashion image synthesis has been hindered by the absence of large-scale, high-quality paired resources. To bridge this gap, we present GarmentSketch, a novel dataset comprising 26,249 fashion sketches across 21 garment categories, each paired with detailed textual descriptions. Captions were produced through a multi-stage pipeline that integrates multiple multimodal large language models (MLLMs) with human-in-the-loop refinement, ensuring both semantic accuracy and descriptive richness. We benchmark GarmentSketch on state-of-the-art generative models, providing baseline performance for sketch-guided text-to-image generation. Our experiments reveal both the promise and the current limitations of existing methods. By offering a comprehensive and richly annotated resource, GarmentSketch establishes a foundation for advancing sketch understanding, fine-grained fashion image generation, and creative human-AI collaboration in design. The dataset will be available at: https://khangbdd.github.io/garmentsketch.

2606.14153 2026-06-15 cs.CV cs.RO 新提交

Encoder Winners Do Not Reliably Transfer Across VLA Backbone Scale: A Frozen-Backbone Grafting Diagnostic

编码器胜者无法可靠跨VLA骨干网络规模迁移:一种冻结骨干嫁接诊断方法

Qingping Zeng, Fei She

发表机构 * Tsinghua University(清华大学)

AI总结 提出冻结骨干嫁接诊断方法,发现小规模VLA上最优的视觉编码器在大规模骨干上并非最优,编码器选择依赖于骨干网络规模。

Comments 23 pages, 5 figures, 8 tables

详情
AI中文摘要

视觉-语言-动作(VLA)策略通常从其上游VLM发布中继承视觉编码器,但目前尚不清楚在小规模VLA上验证的编码器选择是否能迁移到更大的骨干网络上。我们引入了一种冻结骨干嫁接诊断方法:将已发布VLA的视觉塔替换为候选编码器,采用固定协议(自适应平均池化、LayerNorm和单个可训练的线性投影器),同时冻结语言模型和动作专家。在四个编码器、两个LIBERO套件、两个骨干网络(SmolVLA-450M和$\pi_{0.5}$-3.3B)以及每个单元两到三个随机种子(共40次主要嫁接运行,加上原生、LoRA、池化以及零/打乱图像对照,全部通过离线动作MSE评分)的条件下,小骨干网络的胜者无法可靠地选出大骨干网络的顶级编码器:SigLIP在SmolVLA上两个套件中均表现最佳,而在$\pi_{0.5}$上,DINOv2-small在空间套件中领先,物体套件则是对种子敏感的接近平局带;四个骨干-套件比较中的三个(以及12个种子级单元中的11个)支持依赖于骨干网络的排名。嫁接包装本身并非中性,在两个骨干网络上符号相反(在SmolVLA原生视觉塔上MSE增加45-56%,在$\pi_{0.5}$上降低50-52%),因此所有结论都依赖于固定的嫁接协议。我们将冻结嫁接定位为一种廉价的靶向骨干诊断方法,在承诺大规模使用编码器之前运行,而非闭环部署声明。

英文摘要

Vision-language-action (VLA) policies typically inherit their vision encoder from upstream VLM releases, but it is unclear whether an encoder choice validated on a small VLA transfers to a larger backbone. We introduce a frozen-backbone grafting diagnostic: the vision tower of a released VLA is replaced by a candidate encoder under a fixed protocol (adaptive average pooling, LayerNorm, and a single trainable linear projector), with the language model and action expert frozen. Across four encoders, two LIBERO suites, two backbones (SmolVLA-450M and $π_{0.5}$-3.3B), and two-to-three seeds per cell (40 main grafting runs plus native, LoRA, pooling, and zero-/shuffled-image controls, all scored by offline action MSE), the small-backbone winner does not reliably select the large-backbone top tier: SigLIP is best on SmolVLA across both suites, while on $π_{0.5}$ DINOv2-small leads the spatial suite and the object suite is a seed-sensitive near-tie band; three of the four backbone-suite comparisons (and 11 of 12 seed-level cells) support backbone-dependent rankings. The grafting wrapper is itself non-neutral with opposite sign across backbones (+45-56% MSE on the SmolVLA native tower, -50-52% on $π_{0.5}$), so all conclusions are conditional on the fixed grafting protocol. We position frozen grafting as a cheap target-backbone diagnostic to run before committing to an encoder at scale, not as a closed-loop deployment claim.

2606.14277 2026-06-15 cs.CV 新提交

One Layer's Trash is Another Layer's Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs

一层的垃圾是另一层的宝藏:LVLMs中自适应逐层视觉标记选择

Yongru Chen, Kai Zhang, Zeliang Zong, Yuchen Lu, Wenming Tan, Ye Ren, Jilin Hu

发表机构 * Hikvision Research Institute(海康威视研究院) Peking University(北京大学) East China Normal University(华东师范大学)

AI总结 提出自适应逐层视觉标记选择(ALVTS),通过轻量级选择器为不同层路由重要标记,实现高效压缩,在89%压缩率下保留96.7%精度。

Comments Accepted by CVPR 2026 (highlight)

详情
AI中文摘要

大型视觉语言模型(LVLMs)在多样化的多模态任务中取得了显著成功,但其实际部署仍受限于长视觉标记带来的计算负担。虽然视觉标记剪枝已成为一种有前景的解决方案,但现有方法存在一个根本性局限:一旦标记在特定层被剪枝,它们将无法被所有后续层访问,导致过早的信息丢失,从而损害模型性能。通过实证研究,我们观察到不同层表现出不同的视觉区域关注点,表明各层存在不同的最优标记子集。受此启发,我们提出自适应逐层视觉标记选择(ALVTS),这是一种突破传统静态标记剪枝范式的新框架。ALVTS包含一个轻量级标记选择器,用于识别重要标记并将其路由到后续处理,同时允许较不重要的标记跳过该层,从而最小化计算冗余。这两类标记在输入后续层之前无缝重新整合,促进整个模型的自适应压缩。基于我们提出的重要性一致性约束低秩近似,所提出的标记选择模块紧密模拟了完整注意力机制,有效捕捉其关键模式,而无需模型重新训练。在LLaVA-1.5、LLaVA-NeXT和Qwen2.5-VL上的大量实验验证了我们方法的有效性。在89%的标记压缩率下,ALVTS保留了原始模型96.7%的准确率,实现了LVLM推理中优越的效率-准确率权衡。

英文摘要

Large Vision-Language Models (LVLMs) have achieved remarkable success across diverse multimodal tasks, yet their practical deployment remains constrained by the computational burden arising from lengthy visual tokens. While visual token pruning has emerged as a promising solution, existing methods suffer from a fundamental limitation: once tokens are pruned at a specific layer, they become inaccessible to all subsequent layers, leading to premature information loss that can compromise model performance. Through empirical studies, we observe that different layers exhibit distinct visual region focus, indicating a varying optimal token subset across layers. Motivated by this insight, we propose Adaptive Layer-wise Visual Token Selection (ALVTS), a novel framework that breaks away from the conventional static token pruning paradigm. ALVTS incorporates a lightweight token selector to identify and route important tokens for further processing, while allowing less important tokens to skip the layer, thus minimizing computational redundancy. These two streams of tokens are seamlessly reintegrated before being fed into subsequent layers, facilitating adaptive compression across the entire model. Grounded in our importance consistency constrained low-rank approximation, the proposed token selection module closely emulates the full attention mechanism, effectively capturing its essential patterns without requiring model retraining. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL validate the effectiveness of our method. With an 89% token compression ratio, ALVTS retains 96.7% of the original model's accuracy, achieving a superior efficiency-accuracy trade-off for LVLM inference.

2606.14299 2026-06-15 cs.CV cs.LG 新提交

What Drives Test-Time Adaptation for CLIP? A Controlled Empirical Study from an Update Perspective

什么驱动了CLIP的测试时适应?从更新视角进行的受控实证研究

Jiazhen Huang, Xiao Chen, Zhiming Liu, Yaru Sun, Jingyan Jiang, Zhi Wang

发表机构 * Tsinghua University(清华大学) Shenzhen Technology University(深圳技术大学)

AI总结 本文通过受控实证研究,从更新视角分析了CLIP测试时适应方法的驱动因素,揭示了适应增益主要来自测试时证据和可靠代理,而非繁重优化,并指出无单一范式普遍最优。

详情
AI中文摘要

视觉语言模型(如CLIP)已成为开放词汇识别的标准骨干,但其零样本预测在部署时仍易受分布偏移影响。测试时适应(TTA)最近被扩展到CLIP作为轻量级解决方案,导致TTA4CLIP方法迅速增长。然而,该领域的实证进展在很大程度上超过了我们对真正驱动适应因素、其增益来源以及哪些偏移下保持可靠的理解。本文从追求最先进准确率中退一步,对TTA4CLIP进行了系统性的受控研究。我们首先根据测试时更新的内容,将现有方法组织为三个统一范式。然后,我们引入TTABC,一个开源的CLIP TTA基准,它标准化了评估协议并集成了20多种代表性方法。我们的受控实证分析集中在三个关键领域。首先,我们确定了基于参数方法的驱动因素,揭示适应增益主要由测试时证据和可靠代理驱动,而非繁重优化。其次,我们探索了超越繁重参数调整的证据利用,表明通过跨样本或当前样本证据以及轻量级原型更新可以实现竞争性和高效的性能。最后,我们证明TTA没有银弹:没有单一的适应范式普遍最优,首选范式取决于偏移的性质。我们希望我们的基准和研究能提供对当前TTA4CLIP格局的更清晰理解,并为进一步研究奠定基础。

英文摘要

Vision-Language Models (VLMs) such as CLIP have become a standard backbone for open-vocabulary recognition, yet their zero-shot predictions remain vulnerable to distribution shifts encountered at deployment. Test-Time Adaptation (TTA) has recently been extended to CLIP as a lightweight solution, leading to a rapidly growing body of TTA4CLIP methods. However, empirical progress in this area has largely outpaced our understanding of what truly drives adaptation, where their gains originate, and under which shifts they remain reliable. In this paper, we take a step back from the pursuit of state-of-the-art accuracy and conduct a systematic controlled study of TTA4CLIP. We first organize existing methods into three unified paradigms according to what is updated at test time. We then introduce TTABC, an open-source TTA Benchmark for CLIP, which standardizes evaluation protocols and integrates more than 20 representative methods. Our controlled empirical analysis focuses on three key areas. First, we determine the driving factors in parameter-based methods, revealing that adaptation gains are primarily driven by test-time evidence and reliable proxies rather than heavy optimization. Second, we explore evidence utilization beyond heavy parameter tuning, showing that competitive and efficient performance can be achieved through cross- or current-sample evidence and lightweight prototype updates. Finally, we demonstrate that there is no silver bullet for TTA: no single adaptation paradigm is universally optimal, and the preferred paradigm depends on the nature of shift. We hope our benchmark and study provide a clearer understanding of the current TTA4CLIP landscape and establish a foundation for further research.

2606.14562 2026-06-15 cs.CV cs.LG 新提交

NEST3D: A High-Resolution Multimodal Dataset of Sociable Weaver Tree Nests

NEST3D:织布鸟树巢的高分辨率多模态数据集

Constanza A. Molina Catricheo, Simon Boeder, Ting-Jia Guo, Giacomo May, Clément Berthelot, Devis Tuia, Friedrich Fedor Reinhard, Fabio Remondino, Benjamin Risse

发表机构 * Institute for Geoinformatics (ifgi), University of Münster(明斯特大学地理信息学研究所) École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院) Max Planck Institute of Animal Behavior(马克斯·普朗克动物行为研究所) University of Konstanz(康斯坦茨大学) Kuzikus Research Station(库兹库斯研究站) Fondazione Bruno Kessler (FBK)(布鲁诺·凯斯勒基金会)

AI总结 针对织布鸟巢缺乏精细3D结构数据的问题,提出包含104棵巢树、1.4TB多模态无人机数据集,并基准测试语义分割方法,PT-v3达86.35% mIoU。

Comments 14 pages, 4 figures. Dataset available at https://huggingface.co/NEST3D

详情
AI中文摘要

织布鸟巢作为复杂的生态结构,提供体温调节微栖息地并维持多种物种;然而,先前研究使用的数据集缺乏精细的3D结构细节。由于巢穴的不规则几何形状以及与复杂宿主植被的整合,生成可用且准确的3D织布鸟巢数据具有挑战性。我们通过一个开放获取的1.4TB多模态无人机数据集(包含104棵巢树,共27,945张RGB图像、111,780张多光谱图像、约7.81亿个3D点以及专家标注的语义分割标签)弥合了这一差距。我们使用KPConv、RandLA-Net和Point Transformer V3对语义分割进行基准测试,其中PT-v3在测试集上达到了86.35%的mIoU。虽然结果展示了基于Transformer和逐点方法的强大性能,但也凸显了架构相关的挑战,特别是对于基于卷积的方法(如KPConv)。通过独特地结合光谱、空间和结构信息,所提出的数据集推动了3D重建、分割和分类算法的发展,实现了从巢穴体积估计到物种保护等生态应用,并作为一个要求严格的基准,揭示了在极端类别不平衡下与架构相关的性能差异。

英文摘要

Sociable weaver nests function as complex ecological structures offering thermoregulatory microhabitats and sustaining diverse species; however, datasets used in prior studies lack fine-grained 3D structural detail. Producing usable and accurate 3D weaver nest data is challenging due to their irregular geometry and integration with complex host vegetation. We bridge this gap with an open-access, 1.4 TB multimodal drone dataset of 104 nest-bearing trees, comprising 27,945 RGB images, 111,780 multispectral images, approximately 781 million 3D points, and expert-annotated semantic segmentation labels. We benchmark semantic segmentation using KPConv, RandLA-Net, and Point Transformer V3, with PT-v3 achieving an mIoU of 86.35% on the test set. While the results demonstrate strong performance for transformer-based and point-wise methods, they also highlight architecture-dependent challenges, particularly for convolution-based approaches such as KPConv. By uniquely combining spectral, spatial, and structural information, the presented dataset advances 3D reconstruction, segmentation, and classification algorithms, enabling ecological applications from nest volume estimation to species conservation, and serves as a demanding benchmark that exposes architecture-dependent performance under extreme class imbalance.

2606.14578 2026-06-15 cs.CV 新提交

A Qualitative Review of GenAI-Based Methods for Data Generation and Augmentation in Industrial Computer Vision Applications

基于GenAI的工业计算机视觉应用中数据生成与增强方法的定性综述

Paul Koch, Paul Hofmann, Ferdinand Waßelewsky, Adem Karakurt, Andre Sérs, Jörg Krüger

发表机构 * Fraunhofer IPK(弗劳恩霍夫研究所) Hamburg University of Applied Sciences (HAW Hamburg)(汉堡应用技术大学) Technical University Berlin (Tu-Berlin)(柏林技术大学)

AI总结 本文综述了基于GenAI的数据生成与增强方法,旨在解决工业计算机视觉应用中数据获取的“先有鸡还是先有蛋”困境,并评估其在分类用例中的适应性。

Comments Accepted to Computing Conference 2026

详情
AI中文摘要

AI驱动的计算机视觉应用需要强大的数据库来确保可预测的行为和性能。这种可预测的行为对于工业应用获得用户信任尤为重要。然而,在工业应用中,这样的数据库并不容易获得,其获取也并非易事。主动学习方法可以在项目部署中逐步增加数据,从而迭代地扩充数据库,进而提高应用的可预测性。不幸的是,我们观察到这往往会导致用户对应用失去信任,而一旦失去信任就很难恢复。这就导致了“先有鸡还是先有蛋”的困境,即数据库和应用都无法得到发展。在这项工作中,我们回顾了最先进的方法和途径,以进一步推动初始主动数据扩充阶段后的数据库建设。这里,我们重点关注基于GenAI的数据生成和增强方法的最新进展,并评估它们在工业计算机视觉分类用例中的适应性。尽管我们观察到自动数据扩充的潜力,但我们也看到在源(训练环境)和目标(工业用例)之间存在领域不匹配——涉及自然语言和对象特征中定义的上下文。

英文摘要

AI-driven computer vision applications require a profound database to ensure predictable behaviors and performance. Such predictable behaviors are especially important for industrial applications in gaining trust from users. However, such a database is not readily available in industrial applications, and its acquisition is not trivial either. Active learning methods can be applied to ramp up data within a project deployment to iteratively increase the database, and thus the application predictability. Unfortunately, we observe that this often leads to a loss of user trust in the application, which is difficult to regain once lost. This leads to a "chicken-and-egg" dilemma in which neither the database nor the application is developed. In this work, we review state-of-the-art methods and approaches to further boost the database the initial active data ramp-up phase. Here, we focus on recent advancements in GenAI-based data generation and augmentation methods and review their adaptability on an industrial computer vision classification use case. Although we observe a potential for automatic data ramp-up, we also see a domain miss match in between the source (training environment) and target (industrial use-case) - regarding context defined in natural language and object characteristics.

2606.14586 2026-06-15 cs.CV 新提交

S$^2$COPE: Self-Supervised Concept Discovery via Preference Learning

S$^2$COPE: 通过偏好学习进行自监督概念发现

Shilong Xiang, Zirui Zhang, Chengzhi Mao

发表机构 * Rutgers University(罗格斯大学)

AI总结 提出S$^2$COPE框架,利用视觉大语言模型在自监督偏好优化循环中自主发现结构化概念,无需任何标签,在多个领域提升下游分类准确率。

详情
AI中文摘要

当前的表示学习范式存在一个根本性的折衷:自监督方法可扩展到大规模数据集但产生不透明的特征,而可解释模型则因需要密集的人工标注而受限。我们提出了通过偏好学习进行自监督概念发现(S$^2$COPE),这是一个无需标签的框架,解决了这一困境。S$^2$COPE不将视觉大语言模型(VLLMs)视为静态特征提取器,而是将其作为自监督偏好优化循环中的主动参与者。通过直接从原始图像中自主假设、验证和强化候选视觉属性,我们的框架无需任何标签即可发现新颖的结构化概念。在自然、医学和物理领域的大量实验表明,S$^2$COPE成功提取了标准VLLMs通常无法生成的领域特定概念。通过将概念发现直接摊销到VLLM骨干网络中(通过我们的自监督偏好目标,而非依赖静态生成和分离过滤),我们在未见数据上的下游top-1分类准确率实现了高达24个百分点的绝对提升。我们的工作表明,可解释性可以通过模型与偶然视觉结构的自主交互而出现,无需任何人类监督。

英文摘要

Current representation learning paradigms force a fundamental compromise: self-supervised methods scale to massive datasets but yield opaque features, whereas interpretable models remain bottlenecked by the need for dense human annotation. We introduce Self-Supervised Concept discOvery via Preference lEarning (\model), a label-free framework that resolves this dilemma. Instead of treating Vision-Large-Language Models (VLLMs) as static feature extractors, \model leverages them as active participants in a self-supervised preference optimization loop. By autonomously hypothesizing, validating, and reinforcing candidate visual attributes directly from raw imagery, our framework discovers novel, structured concepts without a single label. Extensive experiments across natural, medical, and physics domains demonstrate that \model successfully extracts domain-specific concepts where standard VLLMs often fail to generate. By amortizing concept discovery directly into the VLLM backbone through our self-supervised preference objective -- rather than relying on static generation and disjoint filtering -- we achieve up to a 24-point absolute improvement in downstream top-1 classification accuracy on unseen data. Our work suggest that interpretability can emerge through a model's autonomous interaction with incidental visual structures, without any human supervision.

2606.14657 2026-06-15 cs.CV 新提交

HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities

HPSv3++:跨扩散模型能力全谱系扩展奖励模型

Yijun Liu, Jie Huang, Zeyue Xue, Yuming Li, Ruizhe He, Haoran Li, Shijia Ge, Siming Fu

发表机构 * Tsinghua University(清华大学) JD Explore Academy(京东探索研究院) Peking University(北京大学) Zhejiang University(浙江大学)

AI总结 提出HPSv3++奖励模型框架,通过双维度偏好数据集HPDv3++和两阶段训练(正交梯度投影+无监督引导),提升对各类T2I模型及RL迭代的偏好预测能力,在多个基准上达到最优。

详情
AI中文摘要

奖励模型引导文本到图像(T2I)系统输出符合人类偏好的结果。然而,典型的奖励模型(如HPSv3)是在早期T2I模型的预标注数据上训练的,没有考虑因模型能力演进和强化学习(RL)迭代而产生的质量判别偏移,限制了其更广泛的适用性。在这项工作中,我们提出了HPSv3++,一个奖励模型框架,将HPSv3模型提升到适应不同T2I模型能力及其RL迭代变化的全能力-迭代谱系。具体来说,我们首先引入了HPDv3++,一个212K双维度偏好数据集,使用近期高能力(Qwen-Image)模型并辅以人工监督,对文本保真度和美学质量进行标注。然后我们提出了一个两阶段训练框架。第一阶段采用数据感知的正交梯度投影,从HPDv3++中融入多样化的美学感知,同时保留HPSv3中原始有效的人类偏好知识。第二阶段进一步利用来自不同能力水平和RL迭代的T2I模型的无标注数据,并引入一个联合能力-迭代条件的信号给奖励模型,以及一个标准差驱动的无监督引导机制,从而在能力-迭代谱系上强化奖励模型。HPSv3++实现了最先进的偏好预测,在HPDv3上比HPSv3高出9.8%,在GenAI-Bench上高出5.5%,同时在我们提出的HPDv3++上达到79.1%/88.1%。当用于T2I RL训练时,它持续提升了多种T2I模型的GenEval分数,展示了其广泛的能力。代码可在该网址获取。

英文摘要

Reward models guide text-to-image (T2I) systems toward outputs aligned with human preferences. However, typical reward models such as HPSv3 are trained on pre-annotated data from earlier T2I models, without accounting for quality discriminative shifts arising from evolving model capabilities and reinforcement learning (RL) iterations, limiting their broader applicability. In this work, we propose HPSv3++, a reward model framework that elevates the HPSv3 model for varying T2I model capabilities and their RL iteration changes across the full capability-iteration spectrum. Specifically, we first introduce HPDv3++, a 212K dual-dimension preference dataset annotated for text fidelity and aesthetic quality using a recent high-capability (Qwen-Image) model with human supervision. We then propose a two-stage training framework. Stage 1 employs data-aware orthogonal gradient projection to incorporate diverse aesthetic perception from HPDv3++ while preserving the original effective human preference knowledge in HPSv3. Stage 2 further leverages unlabeled data from T2I models spanning different capability levels and RL iterations, and introduces a joint capability-iterations conditioned signal for the reward model together with a standard deviation-driven unsupervised guidance mechanism, strengthening reward model across the capability-iteration spectrum. HPSv3++ achieves state-of-the-art preference prediction, outperforming HPSv3 9.8% on HPDv3, 5.5% on GenAI-Bench, while achieving 79.1%/88.1% on our proposed HPDv3++. When used for T2I RL training, it consistently improves GenEval scores across diverse T2I models, demonstrating its wide-range capabilities. The code is available at https://github.com/PlantPotatoOnMoon/HPSv3-PlusPlus.

2606.14684 2026-06-15 cs.CV cs.LG 新提交

HumP-KD: A Hybrid Uncertainty-Aware Multi-Stage Progressive Knowledge Distillation Framework for Efficient Fire Classification

HumP-KD: 一种混合不确定性感知的多阶段渐进式知识蒸馏框架用于高效火灾分类

Mohammed Arif Mainuddin, Najifa Tabassum, Omar Ibne Shahid, Riasat Khan

AI总结 提出HumP-KD框架,通过层次化渐进式知识蒸馏和多阶段蒸馏,将两个冻结的异构Transformer教师(Swin-Tiny和ViT-Base)及其集成知识蒸馏到轻量级MobileViT-S学生模型中,在火灾分类任务上显著提升性能,同时保持低参数量和实时推理速度。

详情
AI中文摘要

实时火灾分类系统需要模型同时具备准确性、计算效率以及可在资源受限硬件上部署的能力。本文提出\textbf{HumP-KD},一种混合不确定性感知的多阶段渐进式知识蒸馏框架,用于高效火灾分类。使用了两个数据集:FlameVision(8600张图像)和Dataset-II(31309张图像)。在标准预处理、在线增强、高斯噪声和运动模糊鲁棒性条件下,应用了多种CNN和Transformer基线模型。所提出的HumP-KD模型通过三个紧密集成的组件,将两个冻结的异构Transformer教师(Swin-Tiny和ViT-Base)及其Meta-MLP集成的知识蒸馏到轻量级MobileViT-S学生中。层次化渐进式知识蒸馏采用层次化特征构建器,生成融合的空间注意力掩码,以选择性地引导蒸馏到判别性区域。多阶段知识蒸馏在训练过程中逐步激活三个蒸馏阶段。在Dataset-II上,HumP-KD在10次独立试验中平均F1分数达到$0.9876 \pm 0.0063$,显著优于未使用蒸馏训练的MobileViT-S基线($0.9537 \pm 0.0351$),独立t检验($p = 0.0195$)和Wilcoxon符号秩检验($W = 1$,$p = 0.0039$)均证实了统计显著性。所提出的方法还展示了跨数据集的强泛化能力和在退化视觉条件下的鲁棒性。学生模型仅保留4.94M参数和19.01Mb模型大小,相比Swin-Tiny参数减少$5.7\times$,相比ViT-Base减少$17.5\times$,同时达到37.72 CPU FPS,适合实时部署。

英文摘要

Real-time fire classification systems require models that are simultaneously accurate, computationally efficient, and deployable on resource-constrained hardware. This work proposes \textbf{HumP-KD}, a Hybrid Uncertainty-aware Multi-stage Progressive Knowledge Distillation framework for efficient fire classification. Two datasets, FlameVision and Dataset-II, containing 8,600 and 31,309 images, are used. Various CNN and transformer baselines are applied under standard preprocessing, online augmentation, Gaussian noise and motion blur robustness conditions. The proposed HumP-KD model distills knowledge from two frozen heterogeneous transformer teachers, Swin-Tiny and ViT-Base, along with their Meta-MLP ensemble, into a lightweight MobileViT-S student via three tightly integrated components. Hierarchical Progressive Knowledge Distillation employs a Hierarchical Feature Builder. It generates a fused spatial attention mask to guide distillation toward discriminative regions selectively. Multi-Stage Knowledge Distillation progressively activates three distillation stages across training. On Dataset-II, HumP-KD achieves a mean F1 score of $0.9876 \pm 0.0063$ across 10 independent trials, significantly outperforming the MobileViT-S baseline trained without distillation ($0.9537 \pm 0.0351$), with statistical significance confirmed by both independent t-test ($p = 0.0195$) and Wilcoxon signed-rank test ($W = 1$, $p = 0.0039$). The proposed method also demonstrates strong generalization across datasets and robustness under degraded visual conditions. The student model retains only 4.94M parameters and 19.01Mb model size, representing a $5.7\times$ parameter reduction over Swin-Tiny and a $17.5\times$ reduction over ViT-Base, while achieving 37.72 CPU FPS, making it suitable for real-time deployment.

2606.14697 2026-06-15 cs.CV cs.AI cs.CL 新提交

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

ClinHallu: 用于诊断医学多模态大语言模型推理中阶段式幻觉的基准

Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, Weihua Chen, Fan Wang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) DAMO Academy, Alibaba Group(阿里巴巴达摩院) Hupan Lab(湖畔实验室) Zhejiang University(浙江大学)

AI总结 提出ClinHallu基准,包含7031个实例,每个实例带有结构化推理轨迹(视觉识别、知识回忆、推理整合),通过阶段替换干预和轨迹监督微调,实现细粒度幻觉诊断与缓解。

Comments Code and datasets: https://github.com/alibaba-damo-academy/ClinHallu

详情
AI中文摘要

构建可信的医学多模态大语言模型(MLLM)对于可靠的临床决策支持至关重要。现有的医学幻觉基准主要关注数据收集,但往往忽略了推理过程中幻觉的起源。我们发现幻觉来源因样本而异:错误可能源于视觉误识别、不正确的医学知识回忆或有缺陷的推理整合。为了实现源级别的幻觉诊断,我们引入了ClinHallu,一个用于医学MLLM推理中阶段式幻觉诊断的基准。ClinHallu包含7031个经过验证的实例,每个实例都附有分解为视觉识别、知识回忆和推理整合的结构化推理轨迹。我们还使用阶段替换干预来测量纠正特定阶段如何影响最终答案。除了评估,我们表明轨迹监督微调减少了阶段式幻觉。ClinHallu为诊断和缓解医学MLLM中的推理失败提供了一个细粒度的幻觉测试平台。该基准可从此https URL公开获取。

英文摘要

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.

2606.13894 2026-06-15 cs.LG cs.AI cs.CL cs.CV 交叉投稿

Gefen: Optimized Stochastic Optimizer

Gefen: 优化随机优化器

Nadav Benedek, Tomer Koren, Ohad Fried

发表机构 * Reichman University(赖希曼大学) Tel Aviv University(特拉维夫大学) Google Research(谷歌研究院)

AI总结 提出Gefen优化器,通过共享二阶矩估计和量化一阶矩,将AdamW内存占用减少约8倍,同时保持相同性能,支持更大批量和吞吐量。

详情
AI中文摘要

AdamW是现代深度学习的默认优化器,但其一阶和二阶矩状态会额外占用约两倍参数大小的训练内存。我们提出Gefen,一种内存高效的优化器,它自动在参数块之间共享二阶矩估计,并使用学习到的码本量化一阶矩,从而将AdamW的内存占用减少约8倍,同时保持相同性能,相当于每十亿参数减少6.5 GiB。该方法受理论结果启发,该结果表明大的混合Hessian项将平方梯度的比率约束为接近1,表明Hessian对齐的参数是共享二阶矩统计量的自然候选。由于大规模计算Hessian不切实际,Gefen从初始平方梯度推断块结构,除了AdamW默认超参数外,不需要任何架构特定的元数据或超参数。Gefen学习基于精确直方图的动态规划量化码本,并重用相同的块进行一阶矩缩放。在多种实验中,Gefen在比较的类似AdamW的方法中实现了最低的峰值优化器内存,同时保持AdamW级别的性能。在FSDP和DDP训练中,减少的内存占用支持更大的微批次,并显著提高相对于AdamW的吞吐量,提供了一种实用的即插即用替代方案,具有更低的内存使用,可以增加吞吐量并支持训练更大的模型或使用更大的批量大小。我们提供了完整的Python实现,包括融合CUDA内核,网址为https://this https URL。

英文摘要

AdamW is a default optimizer for modern deep learning, but its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, thereby reducing AdamW's memory footprint by ~8x while maintaining the same performance, corresponding to a reduction of 6.5 GiB per billion parameters. The method is motivated by a theoretical result showing that large mixed Hessian entries constrain the ratio of squared gradients toward one, suggesting that Hessian-aligned parameters are natural candidates for sharing second-moment statistics. Since computing Hessians is impractical at scale, Gefen infers block structure from the initial squared gradients, requiring no architecture-specific metadata or hyperparameters beyond AdamW defaults. Gefen learns an exact histogram-based dynamic-programming quantization codebook and reuses the same blocks for first-moment scaling. Across diverse experiments, Gefen achieves the lowest peak optimizer memory among the compared AdamW-like methods while maintaining AdamW-level performance. In FSDP and DDP training, the reduced memory footprint enables larger microbatches and improves throughput significantly over AdamW, providing a practical drop-in replacement with lower memory usage that can increase throughput and enable training larger models or using larger batch sizes. We provide the complete Python implementation, including fused CUDA kernels at https://github.com/ndvbd/Gefen

2508.18693 2026-06-15 cs.CV 版本更新

Feature-Space Planes Searcher: A Universal Domain Adaptation Framework for Interpretability and Computational Efficiency

特征空间平面搜索器:一种通用领域自适应框架,兼顾可解释性与计算效率

Zhitong Cheng, Yiran Jiang, Yulong Ge, Yufeng Li, Zhongheng Qin, Rongzhi Lin, Jianwei Ma

发表机构 * School of Mathematics and Institute for Artificial Intelligence, Harbin Institute of Technology, China(数学学院和人工智能研究院,哈尔滨工业大学,中国) School of Earth and Space Sciences, Institute for Artificial Intelligence, Peking University, China(地球和空间科学学院,人工智能研究院,北京大学,中国)

AI总结 提出特征空间平面搜索器(FPS),通过冻结特征编码器并利用特征空间几何模式优化决策边界,实现高效、可解释的领域自适应,在多个基准上达到竞争性能。

详情
AI中文摘要

领域偏移,即从标记源域到未标记目标域时模型性能下降,是部署深度学习系统的一个持续挑战。当前的无监督领域自适应(UDA)方法主要依赖于微调特征提取器,这种方法存在效率低、可解释性差以及对现代架构扩展性不足的问题。我们的分析表明,在大规模数据上预训练的模型在其特征空间中表现出域不变的几何模式,以类内聚类和类间分离为特征,从而保留了可迁移的判别结构。这些发现表明,领域偏移主要表现为边界不对齐而非特征退化。与微调整个预训练模型(这有引入不可预测特征失真的风险)不同,我们提出特征空间平面搜索器(FPS):一种新颖的领域自适应框架,通过利用这些几何模式优化决策边界,同时保持特征编码器冻结。这种简化的方法能够对自适应进行解释性分析,同时通过离线特征提取大幅降低内存和计算成本,允许在单个计算周期内进行全数据集优化。在公共基准上的评估表明,FPS达到了与最先进方法竞争或更优的性能。FPS能够高效地扩展到多模态大模型,并在包括蛋白质结构预测、遥感分类和地震检测在内的多个领域展现出通用性。我们预计FPS将为迁移学习,特别是领域自适应任务,提供一种简单、有效且可推广的范式。

英文摘要

Domain shift, characterized by degraded model performance during transition from labeled source domains to unlabeled target domains, poses a persistent challenge for deploying deep learning systems. Current unsupervised domain adaptation (UDA) methods predominantly rely on fine-tuning feature extractors - an approach limited by inefficiency, reduced interpretability, and poor scalability to modern architectures. Our analysis reveals that models pretrained on large-scale data exhibit domain-invariant geometric patterns in their feature space, characterized by intra-class clustering and inter-class separation, thereby preserving transferable discriminative structures. These findings indicate that domain shifts primarily manifest as boundary misalignment rather than feature degradation. Unlike fine-tuning entire pre-trained models - which risks introducing unpredictable feature distortions - we propose the Feature-space Planes Searcher (FPS): a novel domain adaptation framework that optimizes decision boundaries by leveraging these geometric patterns while keeping the feature encoder frozen. This streamlined approach enables interpretative analysis of adaptation while substantially reducing memory and computational costs through offline feature extraction, permitting full-dataset optimization in a single computation cycle. Evaluations on public benchmarks demonstrate that FPS achieves competitive or superior performance to state-of-the-art methods. FPS scales efficiently with multimodal large models and shows versatility across diverse domains including protein structure prediction, remote sensing classification, and earthquake detection. We anticipate FPS will provide a simple, effective, and generalizable paradigm for transfer learning, particularly in domain adaptation tasks. .

2512.00336 2026-06-15 cs.CV 版本更新

MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection

MVAD:多模态AI生成视频-音频检测基准数据集

Mengxue Hu, Yunfeng Diao, Changtao Miao, Tairui Ge, Taize Ge, Zhiqing Guo, Jianshu Li, Zhe Li, Zhongjie Ba, Joey Tianyi Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有数据集缺乏多模态真实性的问题,提出MVAD数据集,包含三种伪造模式、高质量样本及多样化的视觉风格和内容类别,用于检测AI生成的视频-音频内容。

Comments 10 pages,2 figures

详情
AI中文摘要

多模态AI生成视频-音频内容的快速发展引发了对信息安全和内容真实性的重大担忧。现有的合成视频数据集主要关注视觉模态,而少数包含音频的数据集也大多局限于面部深度伪造——这一局限性未能解决通用多模态AI生成内容日益扩展的领域,并严重阻碍了可信检测系统的发展。为弥补这一关键差距,我们引入了多模态视频-音频数据集(MVAD),这是第一个专门设计用于检测AI生成多模态视频-音频内容的综合数据集。我们的数据集具有三个关键特征:(1)真正的多模态性,样本根据三种真实的视频-音频伪造模式生成;(2)通过多种最先进的生成模型实现的高感知质量;(3)涵盖现实和动漫视觉风格、四种内容类别(人类、动物、物体和场景)以及四种视频-音频多模态数据类型的全面多样性。我们的数据集将在以下网址提供:此 https URL。

英文摘要

The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes--a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.

2512.05025 2026-06-15 cs.CV 版本更新

RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

RAMEN: 面向地球观测的分辨率可调多模态编码器

Nicolas Houdré, Diego Marcos, Hugo Riffaud de Turckheim, Dino Ienco, Laurent Wendling, Camille Kurtz, Sylvain Lobry

发表机构 * Institut National des Sciences de l'Univers (INSU), France(法国国家科学研究院) CNRS, France(法国国家科学研究中心)

AI总结 提出RAMEN,一种传感器无关的分辨率可调多模态编码器,通过将分辨率作为可控参数,在统一潜空间中实现多模态地球观测数据的连贯分析,并在PANGAEA基准上优于现有模型。

详情
Journal ref
CVPR 2026
AI中文摘要

地球观测(EO)数据涵盖广泛的空间、光谱和时间分辨率,从高分辨率光学图像到低分辨率多光谱产品或雷达时间序列。虽然最近的基座模型改进了多模态集成以学习有意义的表示,但它们通常期望固定的输入分辨率或基于传感器特定的编码器,限制了跨异构EO模态的泛化。为克服这些限制,我们引入了RAMEN,一种分辨率可调的多模态编码器,以完全传感器无关的方式学习跨EO数据的共享视觉表示。RAMEN将模态、空间和时间分辨率视为关键输入数据特征,从而在统一潜空间内实现跨模态的连贯分析。其主要方法贡献是将空间分辨率定义为可控输出参数,使用户在推理时能够直接控制所需的细节水平,并允许在空间精度和计算成本之间进行显式权衡。我们训练了一个统一的Transformer编码器,用于重构来自不同来源的掩蔽多模态EO数据,确保跨传感器和分辨率的泛化。预训练后,RAMEN有效地迁移到已知和未见过的传感器配置,并在社区标准的PANGAEA基准上优于更大的最先进模型,该基准包含多种多传感器和多分辨率下游任务。我们的代码和预训练模型可在以下网址获取:https://this URL。

英文摘要

Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.

2602.00593 2026-06-15 cs.CV cs.LG 版本更新

Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

Pix2Fact: 当视觉不够时——基于网络验证的细粒度VQA基准测试

Yifan Jiang, Cong Zhang, Bofei Zhang, Qiaofeng Zheng, Yifan Yang, Bingzhang Wang, Yew-Soon Ong

发表机构 * GADE Union (Global AI Data Experts Union)(GADE联盟(全球人工智能数据专家联盟)) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) New York University(纽约大学) Cambridge University(剑桥大学) The University of Hong Kong(香港大学)

AI总结 本文提出Pix2Fact基准测试,通过高分辨率真实场景中的网络验证,评估细粒度视觉问答中的专家级视觉感知和知识搜索能力,发现现有模型在复杂任务中存在显著不足。

详情
AI中文摘要

尽管在通用任务上取得了进展,视觉-语言模型(VLMs)仍然在需要精细视觉定位和外部知识的挑战中面临困难,而现有基准测试未能综合评估这些能力。为填补这一空白,我们引入Pix2Fact,一个视觉问答基准测试,旨在评估专家级视觉感知和知识搜索能力。Pix2Fact包含1000张高分辨率(4K+)图像,覆盖八个场景。其问题和答案由来自全球顶尖大学的博士持有标注者精心设计。每个问题都需要详细的视觉定位和外部知识的整合。评估十种最先进的VLMs,包括专有模型如Gemini-3.1-Pro和GPT-5.4,发现Pix2Fact对模型提出了严峻挑战:最先进的模型(Gemini-3.1-Pro)在有视觉地面真实和搜索工具的情况下仅达到51.7%的平均准确率。我们的分析将低准确率归因于三个因素:即使有视觉地面真实,频繁的视觉定位错误,浅层搜索利用,以及VLM无法检索长尾、无结构的局部信息。这种显著的差距暴露了当前模型在帮助人类处理需要超负荷视觉理解的现实场景中的局限性。我们相信Pix2Fact将作为推动下一代语言-视觉代理的关键基准测试,这些代理能够无缝整合细粒度感知与稳健的知识搜索。

英文摘要

Despite progress on general tasks, vision-language models (VLMs) still struggle with challenges that demand both fine-grained visual grounding and external knowledge, a synergy overlooked by existing benchmarks that evaluate these abilities in isolation. To fill this void, we introduce Pix2Fact, a visual question-answering benchmark designed to assess expert-level visual perception and knowledge search. Pix2Fact comprises 1,000 high-resolution (4K+) images spanning eight scenarios. Its questions and answers are meticulously crafted by PhD-holding annotators from top global universities across diverse disciplines. Each question requires detailed visual grounding and the integration of external knowledge. Evaluating ten state-of-the-art VLMs, including proprietary models such as Gemini-3.1-Pro and GPT-5.4, we find that Pix2Fact poses a formidable challenge: the most advanced model (Gemini-3.1-Pro) achieves only 51.7% average accuracy, even with access to visual ground truth and search tools. Our analysis attributes this low accuracy to three factors, frequent visual grounding errors even with visual ground truth, shallow search harnessing, and VLM's inability to retrieve long-tail, unstructured local information. This striking gap exposes the limitations of current models in assisting humans with real-world scenarios that demand overwhelming visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the next generation of language-vision agents that seamlessly integrate fine-grained perception with robust knowledge search.

13. 其他/综合视觉 12 篇

2606.13736 2026-06-15 cs.CV 新提交

Connections Between Pairs of Filters Improve the Accuracy of Convolutional Neural Networks

滤波器对之间的连接提高卷积神经网络的准确性

Kathleen Anderson, Philipp Grüning, Erhardt Barth

发表机构 * GitHub

AI总结 本文提出在卷积神经网络中引入可学习的滤波器对连接函数,替代传统点式非线性激活,通过在不同层自适应调整连接方式提升网络性能。

Comments IJCNN 2023

详情
AI中文摘要

尽管研究人员不断为CNN寻找新的改进网络结构,但大多数新发明的架构仍然依赖于堆叠卷积块并用点式激活函数分隔的传统模式。然而,纯粹基于点式非线性的网络存在缺陷。一种替代方案是在网络的两个滤波器之间引入成对连接。典型的连接函数使用乘法或最小值操作来实现逻辑AND连接。在本文中,我们进一步证明CNN可以从更通用的连接中受益,这些连接包含可学习的参数。通过这样的参数,网络能够在不同的网络层实现不同的连接,并更好地使连接函数适应手头的任务。

英文摘要

While researchers continue to find new and improved network structures for CNNs, most of the newly invented architectures still rely on the traditional pattern of stacking convolutional blocks and separating them with pointwise activation functions. However, there are drawbacks to a network purely building on pointwise nonlinearities. One alternative is to introduce a pairwise connection between two filters of a network. Typical connection functions use multiplications or the minimum operation to realize logical AND connections. In this paper, we go one step further by demonstrating that CNNs can benefit from more general connections, which include parameters that are learned. With such parameters, the network is able to implement different connections in different network layers and better adapt the connection function to the task at hand.

2606.13839 2026-06-15 cs.CV cs.AI eess.IV 新提交

Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography

解释RhythmFormer:远程光电容积描记术周期性稀疏注意力的系统XAI分析

Louis Chen, Torbjörn E. M. Nordling

发表机构 * Department of Mechanical Engineering, National Cheng Kung University(国立成功大学机械工程学系)

AI总结 针对rPPG Transformer可解释性缺乏定量评估的问题,提出四种归因方法并引入皮肤覆盖率和忠诚度系数,量化稀疏注意力中的多跳泄漏效应,Beyond Intuition方法在UBFC-rPPG上取得最优性能。

Comments 26 pages, 8 figures

详情
AI中文摘要

远程光电容积描记术(rPPG)Transformer在基准测试中实现了低心率误差,但其决策仍然不透明——随着rPPG向临床心率估计发展,这一问题日益受到关注。现有的rPPG XAI主要依赖定性热图检查,缺乏定量忠诚度指标或基于生理学的验证,在视觉合理性和可审计证据之间存在差距。我们解决了这一差距。首先,我们将四种归因方法(原始注意力、rollout、flow、Beyond Intuition)适配到RhythmFormer的双层路由注意力(带有top-$k$选择)上。其次,我们引入了一个皮肤覆盖度指标,量化归因质量落在皮肤区域的比例。第三,我们将SaCo忠诚度系数从其原始分类设置适配到rPPG回归,通过使用原始和扰动预测rPPG波形之间的MAE作为扰动影响。应用这些工具,我们量化了稀疏top-$k$路由下的多跳泄漏效应:注意力rollout和flow几乎完全恢复了各个精炼注意力层明确设置为零的连接。Beyond Intuition通过其值投影加权rollout和梯度支持掩码缓解了这一问题,在UBFC-rPPG上获得了评估方法中最高的中位精炼皮肤覆盖度(0.83对比vanilla rollout的0.57)和忠诚度(F=0.92)。需要在不同数据集和模型变体上进行验证。对低SaCo异常值的案例研究进一步表明,一旦替换了伪影区域,所有四种方法都一致恢复,表明在这个示例案例中,归因家族之间的SaCo行为一致。总之,这些指标将rPPG XAI推向关于空间对齐和扰动忠诚度的可审计数值证据,即可信的rPPG XAI。

英文摘要

Remote photoplethysmography (rPPG) transformers achieve low heart-rate error on benchmarks, yet their decisions remain opaque--a growing concern as rPPG moves toward clinical heart rate estimation. Existing rPPG XAI is dominated by qualitative heatmap inspection without quantitative faithfulness metrics or physiology-grounded validation, leaving a gap between visual plausibility and auditable evidence. We address this gap. First, we adapt four attribution methods (raw attention, rollout, flow, Beyond Intuition) to RhythmFormer's bi-level routing attention with top-$k$ selection. Second, we introduce a skin coverage metric quantifying how much attribution mass falls on skin regions. Third, we adapt the SaCo faithfulness coefficient from its original classification setting to rPPG regression by using the MAE between original and perturbed predicted rPPG waveforms as the perturbation impact. Applying these tools, we quantify a multi-hop leakage effect under sparse top-$k$ routing: attention rollout and flow almost completely restores the connections that individual refined-attention layers explicitly set to zero. Beyond Intuition mitigates this via its value-projection-weighted rollout and gradient-supported mask, attaining the highest median refined skin coverage ($0.83$ vs. $0.57$ for vanilla rollout) and faithfulness ($F=0.92$) among the evaluated methods on UBFC-rPPG. Validation across diverse datasets and model variants is needed. A case study on a low-SaCo outlier further shows all four methods recovering consistently once an artefactual region is replaced, suggesting consistent SaCo behavior across attribution families in this illustrative case. Together, these metrics move XAI for rPPG toward auditable numerical evidence about spatial alignment and perturbation faithfulness, i.e. trustworthy rPPG XAI.

2606.14006 2026-06-15 cs.CV cs.ET 新提交

HARBOR: Heading Analysis and Reconstruction from Behavioral Observation and Radar

HARBOR:基于行为观测与雷达的航向分析与重建

Joao P. A. Dantas, Paulo F. Silva Filho, Jelton A. Cunha, Gabriel Dietzsch

发表机构 * Institute for Advanced Studies (IEAv)(高级研究所(IEAv))

AI总结 提出HARBOR管道,仅用单张SAR图像在无辅助数据时预测船只运动,通过骨架几何和局部强度估计航向,离线校准AIS参数生成概率热图。

详情
AI中文摘要

海上态势感知通常依赖自动识别系统(AIS)传输来跟踪船只运动。然而,在作战或冲突场景中,由于信号丢失、故意关闭或有意欺骗,这些数据可能不可用。在此条件下,合成孔径雷达(SAR)图像成为广域海上监测的关键传感替代方案,尽管仅提供静态场景快照。本文介绍HARBOR(基于行为观测与雷达的航向分析与重建),一个完整的管道,用于将单张SAR图像转换为预测运动信息,而无需在推理时使用任何辅助数据源。该方法首先进行SAR图像预处理以增强和分割船只候选区域,然后通过骨架几何和局部强度模式进行自动检测、基于尺寸的分类和航向估计。AIS数据仅在离线校准阶段用于推导依赖船只类型的运动参数,随后应用于生成候选未来船只位置的概率热图。使用真实COSMO-SkyMed SAR图像进行的案例研究展示了该管道在巴西南部海上场景中的应用,显示了其在数据拒绝环境中提取运动趋势并生成船只位置概率投影的能力。

英文摘要

Maritime situational awareness often relies on Automatic Identification System (AIS) transmissions to track vessel movements. However, in operational or conflict scenarios, these data may be unavailable due to signal loss, deliberate deactivation, or intentional spoofing. In such conditions, synthetic aperture radar (SAR) imagery becomes a critical sensing alternative for wide-area maritime monitoring, despite providing only static scene snapshots. This work introduces HARBOR (Heading Analysis and Reconstruction from Behavioral Observation and Radar), a complete pipeline for transforming a single SAR image into predictive motion information without requiring any auxiliary data source at inference time. The method begins with SAR image preprocessing to enhance and segment vessel candidates, followed by automatic detection, size-based classification, and heading estimation using skeleton geometry and local intensity patterns. AIS data are used exclusively during an offline calibration phase to derive vessel-type-dependent motion parameters, which are then applied to generate probabilistic heatmaps of candidate future vessel positions. A case study using real COSMO-SkyMed SAR imagery demonstrates the pipeline on a maritime scene in southern Brazil, showing its ability to extract motion tendencies and generate probabilistic projections of vessel positions in data-denied environments.

2606.14024 2026-06-15 cs.CV 新提交

ViT-Up: Faithful Feature Upsampling for Vision Transformers

ViT-Up:面向视觉Transformer的忠实特征上采样

Krispin Wandel, Jingchuan Wang, Hesheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出ViT-Up,一种隐式特征上采样框架,通过从中间ViT隐藏状态构建逐层查询,在任意连续坐标预测特征,避免图像引导带来的特征泄露和模糊,在密集预测和语义对应任务上超越现有方法。

Comments Code is available at: https://github.com/krispinwandel/vit-up

详情
AI中文摘要

视觉Transformer(ViT)已成为视觉表示学习的主导架构,提供异常强大且广泛可重用的骨干特征。然而,由于全局自注意力的二次复杂度,ViT通常在小块令牌网格上运行,这给语义分割和深度估计等密集预测任务带来了持续瓶颈。这推动了任务无关特征上采样器的发展。尽管最近的最先进方法能产生视觉锐利的密集表示,但它们依赖浅层图像编码器进行引导上采样,可能引入特征泄露、碎片化和模糊。我们提出ViT-Up,一种隐式特征上采样框架,用从中间ViT隐藏状态构建的逐层查询替代外部图像引导。这使得在任意连续图像坐标上预测特征成为可能,同时保持与骨干特征空间的对齐。实验表明,ViT-Up在密集预测和语义对应任务上持续优于最先进的图像引导上采样器。在DINOv3-S+上,ViT-Up在Cityscapes上相比先前方法提升高达+2.07 mIoU,在SPair-71k上提升+4.17 PCK@0.10。使用更大的DINOv3-B骨干时,这些增益增加到+3.36 mIoU和+8.09 PCK@0.10,表明ViT-Up随骨干容量增加而扩展良好。

英文摘要

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.

2606.14556 2026-06-15 cs.CV 新提交

Visual Quality Score Assessment of Large White Goods in Remanufacture with Multi-View Deformable-DETR

基于多视角可变形DETR的再制造大型白色家电视觉质量评分评估

Paul Koch, Vivek Chavan

发表机构 * Fraunhofer-Institut für Produktionsanlagen und Konstruktionstechnik (IPK)(弗劳恩霍夫生产设备和结构技术研究所)

AI总结 针对再制造中大型白色家电视觉质量评估依赖人工且难以处理小缺陷的问题,提出基于多视角可变形DETR的自动评分框架,通过自监督预训练和微调减少标注需求,实现精确评估与可解释性。

Comments Accepted to GCSM 2026

详情
AI中文摘要

再制造大型白色家电对于循环经济至关重要,但视觉质量评估仍然是培训和定价的手动瓶颈。传统的检测方法需要大量标注,并且难以处理高分辨率多视角数据中的小缺陷。我们提出了一个基于可变形DETR的多视角框架,用于自动质量评分,该框架跨冗余视图聚合信息以提取细粒度特征。为了在有限标签下增强鲁棒性,我们采用自监督预训练,随后在专家标注的分数上进行监督微调。此外,在冻结特征图上进行线性投影,以识别感兴趣区域来解释模型决策。在工业多视角数据集上评估,我们的方法提供了精确的质量评估,同时减少了对人工标注和每个部件定制的依赖,为再制造生产线实现了可扩展且透明的检测。

英文摘要

Remanufacturing large white goods is essential for a circular economy, yet visual quality assessment remains a manual bottleneck for training and pricing. Conventional detection methods require extensive annotation and struggle with small defects in high-resolution multi-view data. We present a multi-view framework based on Deformable-DETR for automated quality scoring that aggregates information across redundant views to extract fine-grained features. To enhance robustness with limited labels, we employ self-supervised pretraining followed by supervised fine-tuning on expert-annotated scores. Additionally, a linear projection over frozen feature maps identifies regions of interest to explain model decisions. Evaluated on an industrial multi-view dataset, our approach delivers precise quality assessments while reducing reliance on manual annotation and per-part customization, enabling scalable and transparent inspection for remanufacturing lines.

2502.00869 2026-06-15 cs.CV 版本更新

A Unified Theory of Sinusoidal Activation Families for Implicit Neural Representations

隐式神经表示的正弦激活函数族统一理论

Alireza Morsali, MohammadJavad Vaez, Mohammadhossein Soltani, Amirhossein Kazerouni, Babak Taati, Morteza Mohammad-Noori

发表机构 * McGill University(麦吉尔大学) University of Melbourne(墨尔本大学) ARC Centre of Excellence for the Mathematical Analysis of Cellular Systems (MACSYS)(细胞系统数学分析卓越中心(MACSYS)) University of Toronto(多伦多大学) Vector Institute(向量研究所) University Health Network(大学健康网络) University of Tehran(塔里斯坦大学)

AI总结 提出STAF框架,通过可学习振幅、频率和相位的傅里叶式激活函数,理论分析其表达能力、NTK谱变化及初始化,实验证明在图像、音频、形状、逆问题和NeRF任务中优于或匹敌现有方法。

Comments Published in TMLR

详情
AI中文摘要

隐式神经表示(INR)使用紧凑神经网络建模连续信号,已成为视觉、图形和信号处理中的标准工具。一个核心挑战是在没有繁重的手工编码或脆弱的训练启发式方法的情况下准确捕捉精细细节。在文献中,周期激活函数已成为一种引人注目的补救措施:从使用固定全局频率的单一正弦波的SIREN,到采用多个正弦波并在某些情况下使用可训练频率和相位的更近期架构。我们研究了这一正弦激活函数族,并为INR中的可训练正弦激活函数开发了一个有原则的理论和实践框架。具体来说,我们通过正弦可训练激活函数(STAF)实例化该框架,这是一种傅里叶式激活函数,其振幅、频率和相位都是可学习的。我们的分析(i)建立了一个Kronecker等价构造,将可训练正弦激活函数与标准正弦网络联系起来,并量化了表达能力增长;(ii)刻画了在可训练正弦参数化下神经正切核(NTK)谱的变化;(iii)提供了一种初始化方法,无需渐近中心极限定理(CLT)论证即可产生标准正态后激活。在实验上,对于图像、音频、形状、逆问题(超分辨率、去噪)和NeRF,STAF在评估的INR任务中,在PSNR/SSIM等面向失真的重建指标上具有竞争力且通常更强,并且在逐层共享下具有有利的参数效率。虽然周期激活函数可以缓解谱偏差的实际表现,但我们的结果表明它们并未消除谱偏差;相反,可训练正弦函数可以改善评估设置中观察到的容量-优化权衡。

英文摘要

Implicit Neural Representations (INRs) model continuous signals with compact neural networks and have become a standard tool in vision, graphics, and signal processing. A central challenge is accurately capturing fine detail without heavy hand-crafted encodings or brittle training heuristics. Across the literature, periodic activations have emerged as a compelling remedy: from SIREN, which uses a single sinusoid with a fixed global frequency, to more recent architectures employing multiple sinusoids and, in some cases, trainable frequencies and phases. We study this family of sinusoidal activations and develop a principled theoretical and practical framework for trainable sinusoidal activations in INRs. Concretely, we instantiate this framework with Sinusoidal Trainable Activation Functions (STAF), a Fourier-like activation whose amplitudes, frequencies, and phases are learned. Our analysis (i) establishes a Kronecker-equivalence construction that expresses trainable sinusoidal activations with standard sine networks and quantifies expressive growth, (ii) characterizes how the Neural Tangent Kernel (NTK) spectrum changes under trainable sinusoidal parameterization, and (iii) provides an initialization that yields standard normal post-activations without asymptotic central limit theorem (CLT) arguments. Empirically, on images, audio, shapes, inverse problems (super-resolution, denoising) and NeRF, STAF is competitive and often stronger on distortion-oriented reconstruction metrics such as PSNR/SSIM across the evaluated INR tasks, with favorable parameter efficiency under layer-wise sharing. While periodic activations can alleviate practical manifestations of spectral bias, our results indicate they do not eliminate it; instead, trainable sinusoids can improve the observed capacity-optimization trade-off in the evaluated settings.

2512.14366 2026-06-15 cs.CV 版本更新

Optimizing Rank for High-Fidelity Implicit Neural Representations

优化秩以实现高保真隐式神经表示

Julian McGinnis, Florian A. Hölzl, Suprosanna Shit, Florentin Bieder, Paul Friedrich, Mark Mühlau, Bjoern Menze, Daniel Rueckert, Benedikt Wiestler

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文通过训练中稳定秩的调控,证明简单MLP也能实现高保真隐式神经表示,使用Muon优化器可显著提升性能,在多种任务上PSNR提升高达9 dB。

详情
AI中文摘要

基于普通多层感知器(MLP)的隐式神经表示(INR)被广泛认为无法表示高频内容。这促使研究转向架构干预,如坐标嵌入或专用激活函数,以表示高频信号。在本文中,我们挑战了普通MLP的低频偏差是学习高频内容的内在架构限制这一观点,而是认为这是训练过程中稳定秩退化的症状。我们通过实验证明,在训练过程中调控网络的秩可以显著提高学习信号的保真度,甚至使简单的MLP架构也具有表现力。大量实验表明,使用像Muon这样具有高秩、近正交更新的优化器,能够持续增强INR架构,甚至超越简单的ReLU MLP。这些显著改进适用于多种领域,包括自然图像、医学图像和新视角合成,在相同架构下PSNR提升高达9 dB。代码可在(https://rank-inrs.github.io)获取。

英文摘要

Implicit Neural Representations (INRs) based on vanilla Multi-Layer Perceptrons (MLPs) are widely believed to be incapable of representing high-frequency content. This has directed research efforts towards architectural interventions, such as coordinate embeddings or specialized activation functions, to represent high-frequency signals. In this paper, we challenge the notion that the low-frequency bias of vanilla MLPs is an intrinsic, architectural limitation to learn high-frequency content, but instead a symptom of stable rank degradation during training. We empirically demonstrate that regulating the network's rank during training substantially improves the fidelity of the learned signal, rendering even simple MLP architectures expressive. Extensive experiments show that using optimizers like Muon, with high-rank, near-orthogonal updates, consistently enhances INR architectures even beyond simple ReLU MLPs. These substantial improvements hold across a diverse range of domains, including natural and medical images and novel view synthesis, with up to +9 dB PSNR over the same architecture. Code is available at (https://rank-inrs.github.io).

2604.14193 2026-06-15 cs.CV eess.IV q-bio.NC 版本更新

QualiaNet: An Experience-Before-Inference Network

QualiaNet:一种先验体验的推理网络

Paul Linton

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出QualiaNet,模拟人类立体视觉的两阶段架构:先通过视差图模拟体验,再用CNN从视差梯度估计距离,验证了从视差梯度恢复距离的可行性。

详情
Journal ref
Extended abstract presented at the 9th Conference on Cognitive Computational Neuroscience, New York, NY, USA, 2026
AI中文摘要

人类3D视觉涉及两个不同阶段:体验模块,其中相对于注视点提取立体深度;推理模块,其中解释这种体验以估计3D场景属性。矛盾的是,尽管立体视觉不提供绝对距离信息,但它仍然影响我们对距离的推断。我们提出推理模块利用自然场景统计:近景产生鲜明的视差梯度,而远景相对平坦。QualiaNet在计算上实现了这种两阶段架构:模拟人类立体体验的视差图被传递给训练用于估计距离的CNN。该网络可以仅从视差梯度恢复距离,验证了这种方法。

英文摘要

Human 3D vision involves two distinct stages: an Experience Module, where stereo depth is extracted relative to fixation, and an Inference Module, where this experience is interpreted to estimate 3D scene properties. Paradoxically, although stereo vision does not provide us with absolute distance information, it nonetheless affects our inferences about distance. We propose the Inference Module exploits a natural scene statistic: near scenes produce vivid disparity gradients, while far scenes appear comparatively flat. QualiaNet implements this two-stage architecture computationally: disparity maps simulating human stereo experience are passed to a CNN trained to estimate distance. The network can recover distance from disparity gradients alone, validating this approach.

2601.18707 2026-06-15 cs.LG cs.AI cs.CV cs.NE 版本更新

SMART: Scalable Mesh-free Aerodynamic Simulations from Raw Geometries using a Transformer-based Surrogate Model

SMART: 基于Transformer代理模型的原始几何形状可扩展无网格气动模拟

Jan Hagnberger, Mathias Niepert

发表机构 * Jan Hagnberger Mathias Niepert

AI总结 提出SMART,一种无需模拟网格、仅使用几何点云预测任意查询位置物理量的神经代理模型,通过交叉层交互联合更新几何特征和物理场,性能媲美甚至超越依赖网格的方法。

Comments Accepted for publication at the 43rd International Conference on Machine Learning (ICML) 2026, Seoul, South Korea

详情
AI中文摘要

基于机器学习的代理模型已成为复杂几何体(如车身)物理模拟中数值求解器的高效替代方案。许多现有模型将模拟网格作为额外输入,从而减少预测误差。然而,为新几何体生成模拟网格计算成本高昂。相比之下,不依赖模拟网格的无网格方法通常误差更高。基于这些考虑,我们引入了SMART,一种神经代理模型,它仅使用几何体的点云表示,无需访问模拟网格,即可预测任意查询位置的物理量。几何体和模拟参数被编码到一个共享的潜在空间中,该空间捕捉物理场的结构和参数特征。然后,一个物理解码器关注编码器的中间潜在表示,将空间查询映射到物理量。通过这种跨层交互,模型联合更新潜在几何特征和演变的物理场。大量实验表明,SMART与依赖模拟网格作为输入的现有方法相比具有竞争力,并且通常表现更优,展示了其在工业级模拟中的能力。

英文摘要

Machine learning-based surrogate models have emerged as more efficient alternatives to numerical solvers for physical simulations over complex geometries, such as car bodies. Many existing models incorporate the simulation mesh as an additional input, thereby reducing prediction errors. However, generating a simulation mesh for new geometries is computationally costly. In contrast, mesh-free methods, which do not rely on the simulation mesh, typically incur higher errors. Motivated by these considerations, we introduce SMART, a neural surrogate model that predicts physical quantities at arbitrary query locations using only a point-cloud representation of the geometry, without requiring access to the simulation mesh. The geometry and simulation parameters are encoded into a shared latent space that captures both structural and parametric characteristics of the physical field. A physics decoder then attends to the encoder's intermediate latent representations to map spatial queries to physical quantities. Through this cross-layer interaction, the model jointly updates latent geometric features and the evolving physical field. Extensive experiments show that SMART is competitive with and often outperforms existing methods that rely on the simulation mesh as input, demonstrating its capabilities for industry-level simulations.

2603.12400 2026-06-15 math.CO cs.CV 版本更新

Generation of Maximal Snake Polyominoes Using a Deep Neural Network

使用深度神经网络生成最大蛇形多联骨牌

Benjamin Gauthier, Alain Goupil, Fadel Toure

发表机构 * Université du Québec à Trois-Rivières(魁北克大学三河分校)

AI总结 提出结构化像素空间扩散模型,从数据驱动学习生成最大蛇形多联骨牌,无需显式编码约束,能泛化到更大矩形并接近当前计算极限。

Comments In Proceedings GASCom 2026, arXiv:2606.09910

详情
Journal ref
EPTCS 445, 2026, pp. 104-113
AI中文摘要

最大蛇形多联骨牌在大矩形中难以数值研究,因为计算它们需要对特定矩形大小的所有蛇形进行完全枚举,这相当于暴力算法。这阻碍了对更大矩形中最大蛇形的研究。此外,大多数可枚举的蛇形位于小矩形中,掩盖了大尺度模式。在本文中,我们研究了深度神经网络在基于数据驱动的训练中生成最大蛇形多联骨牌的贡献,其中最大性和邻接约束不是显式编码的,而是学习的。为此,我们实验了一种去噪扩散模型,我们称之为结构化像素空间扩散(SPS Diffusion)。我们发现SPS Diffusion从小矩形泛化到大矩形,生成有效的蛇形直至28x28方格,并在接近当前计算极限的方格上产生最大蛇形候选。然而,该模型容易出错,例如分支、循环或多个蛇形组件。总体而言,扩散模型是有前景的,表明深度神经网络可以理解复杂的组合对象,这对其研究是有用的。

英文摘要

Maximal snake polyominoes are difficult to study numerically in large rectangles, as computing them requires the complete enumeration of all snakes for a specific rectangle size, which corresponds to a brute force algorithm. This hinders the study of maximal snakes in larger rectangles. Moreover, most enumerable snakes lie in small rectangles, obscuring large-scale patterns. In this paper, we investigate the contribution of a deep neural network to the generation of maximal snake polyominoes from a data-driven training, where the maximality and adjacency constraints are not encoded explicitly, but learned. To this extent, we experiment with a denoising diffusion model, which we referred as Structured Pixel Space Diffusion (SPS Diffusion). We find that SPS Diffusion generalizes from small rectangles to larger ones, generating valid snakes up to 28x28 squares and producing maximal snake candidates on squares close to the current computational limit. The model is, however, prone to errors such as branching, cycles, or multiple snake components. Overall, the diffusion model is promising and suggests that complex combinatorial objects can be understood by deep neural networks, which is useful in their investigation.

2210.00379 2026-06-15 cs.CV 版本更新

NeRF: Neural Radiance Field in 3D Vision: A Comprehensive Review (Updated Post-Gaussian Splatting)

NeRF: 3D视觉中的神经辐射场:全面综述(更新后Gaussian Splatting发布后)

Kyle Gao, Yina Gao, Hongjie He, Dening Lu, Linlin Xu, Jonathan Li

发表机构 * Faculty of Engineering, University of Toronto(多伦多大学工程学院) Department of Geomatics Engineering, University of Calgary(卡尔加里大学测绘工程系)

AI总结 本文综述了NeRF在过去五年的研究,涵盖了在Gaussian Splatting出现前和后的发展,总结了NeRF在新颖视角合成和3D隐式和混合表示神经场学习中的应用和贡献。

Comments Updated Post-Gaussian Splatting

详情
AI中文摘要

2020年3月,神经辐射场(NeRF)革新了计算机视觉,使隐式、基于神经网络的场景表示和新颖视角合成成为可能。NeRF模型在机器人、城市测绘、自动驾驶导航、虚拟现实/增强现实等领域找到了广泛的应用。2023年8月,Gaussian Splatting作为一种直接竞争对手被提出,获得了巨大的势头,并在新颖视角合成领域超越了基于NeRF的研究,成为主导框架。本文综述了过去五年(2020-2025)的NeRF相关论文。这些论文包括Gaussian Splatting出现前的时期,当时NeRF在新颖视角合成和3D隐式和混合表示神经场学习中占据主导地位。我们还包含Gaussian Splatting出现后的作品,其中NeRF和隐式/混合神经场找到了更多小众应用。我们的综述分为Gaussian Splatting出现前的架构和应用分类,以及NeRF、神经场和隐式/混合神经表示方法的活跃研究领域分类。我们介绍了NeRF的理论及其通过可微体积渲染进行训练的介绍。我们还提供了经典NeRF、隐式和混合神经表示以及神经场模型的性能和速度的基准比较,并概述了关键数据集。

英文摘要

In March 2020, Neural Radiance Field (NeRF) revolutionized Computer Vision, allowing for implicit, neural network-based scene representation and novel view synthesis. NeRF models have found diverse applications in robotics, urban mapping, autonomous navigation, virtual reality/augmented reality, and more. In August 2023, Gaussian Splatting, a direct competitor to the NeRF-based framework, was proposed, gaining tremendous momentum and overtaking NeRF-based research in terms of interest as the dominant framework for novel view synthesis. We present a comprehensive survey of NeRF papers from the past five years (2020-2025). These include papers from the pre-Gaussian Splatting era, where NeRF dominated the field for novel view synthesis and 3D implicit and hybrid representation neural field learning. We also include works from the post-Gaussian Splatting era where NeRF and implicit/hybrid neural fields found more niche applications. Our survey is organized into architecture and application-based taxonomies in the pre-Gaussian Splatting era, as well as a categorization of active research areas for NeRF, neural field, and implicit/hybrid neural representation methods. We provide an introduction to the theory of NeRF and its training via differentiable volume rendering. We also present a benchmark comparison of the performance and speed of classical NeRF, implicit and hybrid neural representation, and neural field models, and an overview of key datasets.

2508.18967 2026-06-15 cs.RO cs.CV 版本更新

Enhanced UAV Path Planning Using the Tangent Intersection Guidance (TIG) Algorithm

利用切线交点引导算法(TIG)增强的无人机路径规划

Hichem Cheriet, Khellat Kihel Badra, Chouraqui Samira

AI总结 本文提出TIG算法,通过椭圆切线交点方法生成可行路径,结合启发式规则和二次贝塞尔曲线平滑技术,在静态和动态环境中实现高效安全的无人机路径规划。

Comments Accepted for publication in JAMRIS Journal

详情
Journal ref
Journal of Automation, Mobile Robotics and Intelligent Systems, 20(2), 30-52 (2026)
AI中文摘要

高效的无人机导航对于各种应用至关重要,包括战斗支援、包裹递送和搜索救援。本文介绍了切线交点引导(TIG)算法,一种用于静态和动态环境中的无人机路径规划的先进方法。该算法使用椭圆切线交点方法生成可行路径。它为每个威胁生成两条子路径,根据启发式规则选择最佳路线,并迭代优化路径,直到达到目标。考虑到无人机的运动学和动力学约束,采用基于二次贝塞尔曲线的改进平滑技术生成平滑且高效的路径。实验结果表明,TIG算法在静态环境中能够在0.01秒内生成最短路径,比A*、PRM、RRT*、切线图和静态APPATT算法具有更少的转向角度。此外,在完全未知和部分已知环境中,TIG展示了高效的实时路径规划能力,用于避障,优于APF和动态APPATT算法。

英文摘要

Efficient and safe navigation of Unmanned Aerial Vehicles (UAVs) is critical for various applications, including combat support, package delivery and Search and Rescue Operations. This paper introduces the Tangent Intersection Guidance (TIG) algorithm, an advanced approach for UAV path planning in both static and dynamic environments. The algorithm uses the elliptic tangent intersection method to generate feasible paths. It generates two sub-paths for each threat, selects the optimal route based on a heuristic rule, and iteratively refines the path until the target is reached. Considering the UAV kinematic and dynamic constraints, a modified smoothing technique based on quadratic Bézier curves is adopted to generate a smooth and efficient route. Experimental results show that the TIG algorithm can generate the shortest path in less time, starting from 0.01 seconds, with fewer turning angles compared to A*, PRM, RRT*, Tangent Graph, and Static APPATT algorithms in static environments. Furthermore, in completely unknown and partially known environments, TIG demonstrates efficient real-time path planning capabilities for collision avoidance, outperforming APF and Dynamic APPATT algorithms.