arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.13929 2026-06-15 cs.CV cs.LG 新提交

Self-Evolving Visual Questioner

自演化视觉提问器

Yijun Liang, Hengguang Zhou, Ming Li, Lichen Li, Cho-Jui Hsieh, Tianyi Zhou

发表机构 * University of Maryland, College Park（马里兰大学帕克分校）； University of California, Los Angeles（加州大学洛杉矶分校）； Peking University（北京大学）； Arena ； MBZUAI（穆罕默德·本·扎耶德人工智能大学）

AI总结提出自演化框架，让视觉语言模型作为提问器和过滤器，无需外部监督即可生成更难、更信息丰富、更视觉中心的问题，并保持探索多样性以避免训练崩溃，显著提升自主提问质量和难度边界。

Comments 21 pages, including references and appendix. Project Page is available at https://joliang17.github.io/SelfEvolvingVQG/

详情

AI中文摘要

视觉语言模型（VLM）通常被训练为被动的回答者，而它们主动提出多样化、非平凡、视觉中心且基于问题的问题的能力仍未被充分探索。现有的视觉提问器的性能受到高质量训练数据的可用性或整理成本的瓶颈限制。我们证明，VLM可以在没有任何外部监督的情况下作为视觉提问器持续自我改进。我们提出一个自演化框架，该框架使用VLM本身作为提议者和过滤器，以产生更难、更信息丰富、更视觉中心的问题，同时保持其探索多样性以避免训练崩溃。这些问题随后用于以提问者和回答者模式训练VLM。为了评估提问器，我们引入了一个代理协议，从感知、推理和多样性维度评估问题。在各种骨干VLM上的实验表明，我们的方法显著提高了自主问题生成的质量，并大幅扩展了难度边界。在相同预算下，我们的自监督比在静态源数据上训练更有效。此外，自演化提问器仍然是一个有竞争力甚至更好的回答者。

英文摘要

Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.

URL PDF HTML ☆

赞 1 踩 0

2606.14703 2026-06-15 cs.CV cs.CL cs.LG 新提交

Gaze Heads: How VLMs Look at What They Describe

注视头：视觉语言模型如何观察它们所描述的内容

Rohit Gandikota, David Bau

发表机构 * Northeastern University（东北大学）

AI总结发现视觉语言模型的语言骨干中存在一组“注视头”，其注意力跟踪当前描述的图像区域，通过干预这些头可精确控制模型描述内容，准确率达83.1%。

详情

AI中文摘要

视觉语言模型在内部如何解决描述图像的任务远非显而易见。我们发现模型为此发展出一种特定机制：其语言模型骨干中的一小部分注意力头（我们称之为注视头），其注意力跟踪模型当前正在描述的图像区域。我们通过简单的相关性得分从几次前向传播中发现了它们，使用连环漫画作为受控测试平台，其中叙事顺序在空间上展开。这些注视头不仅跟踪正在描述的图像标记：将它们的注意力重定向到所选区域会强制视觉语言模型描述该区域。对前100个注视头（少于所有头的9%）进行单次注意力掩码干预，以83.1%的准确率将模型的答案引导到任何选定的漫画面板，而对随机头进行相同干预则无法重定向答案，并且对所有头进行干预会破坏生成。相同的杠杆还扩展到连续控制：在生成过程中切换注视目标会使模型在几个标记内结束当前面板描述并转向新面板。在漫画之外，相同的干预将答案重定向到自然COCO图像中的选定区域。该机制进一步在2B到32B参数的模型大小以及其他视觉语言模型架构中重复出现，尽管一些冻结编码器系列没有显示可比较的头集。更广泛地说，这表明通过机制分析识别的目标编辑可以作为实用的推理时杠杆来引导多模态模型行为，而无需任何重新训练。我们的代码、交互式演示和数据集可在以下网址获取：此 https URL

英文摘要

How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, using comic strips as a controlled testbed where narrative order is laid out spatially. These gaze heads do not just track the image tokens being described: redirecting their attention to a chosen region forces the VLM to describe that region instead. A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model's answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation. The same lever also extends to continuous control: switching the gaze target mid-generation makes the model wrap up its current panel description and move to the new one within a few tokens. Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images. The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set. More broadly, this shows that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal model behavior, without any retraining. Our code, interactive demo, and datasets are available at https://gaze.baulab.info/

URL PDF HTML ☆

赞 1 踩 1

2606.13707 2026-06-15 cs.AI cs.CL cs.CV 交叉投稿

Orchestra-o1: Omnimodal Agent Orchestration

Orchestra-o1: 全模态智能体编排

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng

发表机构 * CUHK（香港中文大学）； LIGHTSPEED ； PKU（北京大学）； THU（清华大学）； Tongji University（同济大学）

AI总结提出Orchestra-o1全模态智能体编排框架，通过统一编排机制实现模态感知任务分解、在线子智能体专业化和并行子任务执行，在OmniGAIA基准上准确率超第二名10.3%，并引入DA-GRPO强化学习方法训练Orchestra-o1-8B达到开源全模态智能体最优性能。

详情

AI中文摘要

近期智能体集群的成功将基于大语言模型（LLM）的智能体从单智能体工作流范式转向多智能体系统，凸显了智能体编排在任务分解与协作中的重要性。然而，现有编排框架局限于狭窄的模态集合，难以泛化到异构模态共存并交互的更复杂场景。这种局限性在全模态场景中尤为突出，此类任务需要对文本、图像、音频和视频等多样化输入进行统一理解与协调。在本工作中，我们提出Orchestra-o1，一种全模态智能体编排框架，旨在支持跨多种模态的高效智能体协作。Orchestra-o1引入统一编排机制，实现模态感知任务分解、在线子智能体专业化和并行子任务执行。这种可扩展设计使智能体系统能够有效处理涉及异构信息源的复杂现实任务，在OmniGAIA基准上超越第二名方法10.3%的准确率。此外，我们提出决策对齐群体相对策略优化（DA-GRPO），一种高效的智能体强化学习方法，用于训练Orchestra-o1-8B，该方法在所有现有开源全模态智能体中取得了最先进性能。

英文摘要

The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.

URL PDF HTML ☆

赞 1 踩 0

2606.14106 2026-06-15 cs.MA cs.CV 交叉投稿

Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents

朴素视觉记忆不足：GUI代理的失败模式研究

Seoyoung Choi, Minseok Ko, Hyunseok Lee, Kunwoong Kim, Woomin Song, Chanseok Jeon, Jinwoo Shin

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出动作锚定视觉记忆（AGMem），通过存储与成功动作相关的局部GUI区域图像而非全屏截图，减少GUI代理中的动作级错误，在OSWorld上将任务成功率提升33.3%。

Comments 9 pages, 5 figures, ICML 2026 WORKSHOP

详情

AI中文摘要

图形用户界面（GUI）代理越来越多地被用于自动化跨应用程序、网站和操作系统的复杂计算机任务。为了提高其可靠性，最近的工作引入了经验记忆，代理检索先前的轨迹以指导相似状态下的决策。更近期的方法进一步将这一思想扩展到视觉记忆，通过存储和检索过去交互中的截图，为代理提供比纯文本记忆更丰富的上下文信息。然而，视觉记忆在GUI代理中的效果仍未被充分理解：不清楚视觉记忆缓解了哪些失败，或加剧了哪些失败。为了系统分析视觉记忆的效果，我们引入了一个包含四种GUI代理失败（即认知失败、视觉状态误解、隐藏操作盲点和接地错误）的分类法，这些失败对应于感知-推理-动作流水线的不同阶段。我们发现，前置全图像记忆对失败分布产生了分歧性影响：它减少了状态级失败，但加剧了动作级失败，并增加了隐藏操作盲点和接地错误。受此发现启发，我们提出了动作锚定视觉记忆（AGMem），一种用于GUI代理的动作锚定记忆框架。AGMem的核心思想是存储捕捉与成功动作或恢复密切相关的局部GUI区域的图像裁剪，而不是存储全屏截图。在OSWorld上的实验表明，AGMem比全图像记忆将任务成功率提高了33.3%。这些结果表明，AGMem是GUI代理中视觉记忆的一种有效表示。

英文摘要

Graphical User Interface (GUI) agents are increasingly used to automate complex computer tasks across applications, websites, and operating systems. To improve their reliability, recent work has introduced experiential memory, where agents retrieve prior trajectories to guide decision-making in similar states. More recent approaches further extend this idea to visual memory by storing and retrieving screenshots from past interactions, providing agents with richer contextual information than text-only memories. However, the effect of visual memory in GUI agents remains insufficiently understood: it is unclear which failures visual memory mitigates, or which failures it exacerbates. To systematically analyze the effect of visual memory, we introduce a taxonomy of four GUI agent failures (i.e., cognitive failure, visual state misunderstanding, hidden operation blindness, and grounding error) that map to distinct stages of the perception-reasoning-action pipeline. We find that prepending full-image memory has a divergent effect on the failure distribution: it reduces state-level failures but worsens action-level ones, and increases hidden operation blindness and grounding error. Motivated by this finding, we propose Action-Grounded Visual Memory (AGMem), an action-grounded memory framework for GUI agents. The core idea of AGMem is to store image crops that capture the local GUI region closely related to a successful action or a recovery, rather than storing full screenshots. Experiments on OSWorld show that AGMem improves task success rates by 33.3 % over full-image memory. These results demonstrate that AGMem is an effective representation for visual memory in GUI agents.

URL PDF HTML ☆

赞 0 踩 0

2606.14172 2026-06-15 cs.LG cs.CV 交叉投稿

Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

上下文感知的模态-拓扑协同对齐用于多模态属性图

Sirui Zhang, Xu Wang, Zhengyu Wu, Xunkai Li, Hongchao Qin

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出CoMAG框架，通过任务自适应可靠上下文学习和模态保持的跳令牌对齐，统一处理图任务和模态任务，在保持稀疏边线性复杂度的同时提升结构预测、跨模态匹配和图条件生成性能。

详情

AI中文摘要

多模态属性图（MAGs）通过将图拓扑与文本、图像等异质属性耦合来建模真实世界实体。它们支持需要结构和类别判别表示以进行图中心任务，以及需要细粒度跨模态对应以进行模态中心任务。然而，现有的MAG方法通常依赖固定的图上下文或统一融合的表示，导致任务无关的传播和过度压缩的融合，阻碍了多样化的任务需求和模态特定证据的保留。为了解决这个问题，我们提出了CoMAG，一个统一的MAG骨干网络，学习任务自适应的可靠上下文并在其中进行模态保持的对齐。CoMAG首先通过从多模态语义一致性估计边可靠性、用语义邻居补充原始拓扑以及通过任务感知门选择上下文组件来进行可靠上下文学习。然后，它通过维护模态特定的多跳轨迹、跨模态匹配模态-跳令牌以及解耦共享和私有表示来进行模态保持的跳令牌对齐。因此，CoMAG在一次前向传播中产生图和模态表示，同时保留模态特定的线索。我们进一步分析了稳定传播、缓解过度平滑和控制模态崩溃。在九个OpenMAG数据集上的实验将CoMAG与仅特征、仅图、多模态和统一的MAG基线在图级预测、模态匹配和图条件生成方面进行了比较。结果表明，CoMAG达到了最佳报告性能，证明任务自适应的可靠上下文和模态保持的对齐改善了结构预测、跨模态匹配和图条件生成，同时保持了稀疏边线性复杂度。

英文摘要

Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative representations, and modality-centric tasks requiring fine-grained cross-modal correspondence. However, existing MAG methods often rely on fixed graph contexts or uniformly fused representations, causing task-agnostic propagation and over-compressed fusion that hinder diverse task requirements and modality-specific evidence preservation. To address this, we propose CoMAG, a unified MAG backbone that learns task-adaptive reliable contexts and modality-preserving alignment within them. CoMAG first conducts Reliable Context Learning by estimating edge reliability from multimodal semantic consistency, complementing raw topology with semantic neighbors, and selecting context components through a task-aware gate. It then performs Modality-preserving Hop-token Alignment by maintaining modality-specific multi-hop trajectories, matching modality-hop tokens across modalities, and decoupling shared and private representations. Thus, CoMAG produces graph and modality representations from one forward pass while retaining modality-specific cues. We further analyze stable propagation, over-smoothing mitigation, and modality-collapse control. Experiments on nine OpenMAG datasets compare CoMAG with feature-only, graph-only, multimodal, and unified MAG baselines across graph-level prediction, modality matching, and graph-conditioned generation. Results show that CoMAG achieves the best reported performance, demonstrating that task-adaptive reliable contexts and modality-preserving alignment improve structural prediction, cross-modal matching, and graph-conditioned generation while retaining sparse edge-linear complexity.

URL PDF HTML ☆

赞 0 踩 0

2508.03736 2026-06-15 cs.CV cs.AI 版本更新

Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities

通过视觉Transformer融合泛在射频数据与空间图像以增强智慧城市地图构建

Rafayel Mkrtchyan, Armen Manukyan, Hrant Khachatrian, Theofanis P. Raptis

发表机构 * Yerevan State University（亚美尼亚国立大学）； Consiglio Nazionale delle Ricerche（意大利国家研究委员会）

AI总结提出基于DINOv2的深度学习框架，融合开源地图与射频数据，利用视觉Transformer联合处理多模态信息，在合成与真实数据集上实现65.3%和64.9%的宏观IoU，显著优于单一数据源方法。

Comments Work supported by funding under the bilateral agreement between CNR (Italy) and HESC MESCS RA (Armenia) as part of the DeepRF project for the 2025-2026 biennium, and by the HESC MESCS RA grant No. 22rl-052 (DISTAL)

详情

DOI: 10.1016/j.pmcj.2026.102261
Journal ref: Pervasive and Mobile Computing, Article 102261, 2026

AI中文摘要

本文提出一种基于深度学习的方法，集成DINOv2架构，通过结合来自开源平台的（可能错误的）地图与从多个无线用户设备和基站收集的泛在射频（RF）数据，改进建筑地图构建。与先前方法不同，我们的方法利用基于视觉Transformer的架构，在统一框架内联合处理RF和地图模态，有效捕捉空间依赖性和结构先验，以提高地图构建精度。为评估目的，我们使用华为联合制作的合成数据集。为应对真实世界数据不完善的挑战，我们向其RF数据引入受控噪声以模拟真实条件。此外，我们开发并训练了一个仅利用聚合路径损耗信息来解决地图构建问题的模型。我们根据三个性能指标衡量结果：Jaccard指数（交并比，IoU）、Hausdorff距离和Chamfer距离。我们的设计实现了65.3%的宏观IoU，显著超过（i）错误地图基线（40.1%）、（ii）文献中仅使用RF的方法（37.3%）以及（iii）我们设计的非AI融合基线（42.2%）。对比评估突显了仅依赖RF数据或空间数据的局限性，以及AI在融合数据以提升智慧城市地图构建精度方面的有效性。我们还在奥斯陆地区的真实世界数据上进一步验证了我们的方法，通过真实部署环境补充了合成评估，其中我们的最佳融合模型达到了64.9%的宏观IoU。我们还概述了一种通过使用重叠窗口对区域进行分块来在更大区域上部署模型的策略。

英文摘要

In this paper, we present a deep learning-based approach that integrates the DINOv2 architecture to improve building mapping by combining (possibly erroneous) maps from open-source platforms with pervasive radio frequency (RF) data collected from multiple wireless user equipments and base stations. Unlike prior methods, our approach leverages a vision transformer-based architecture to jointly process both RF and map modalities within a unified framework, effectively capturing spatial dependencies and structural priors for enhanced mapping accuracy. For the evaluation purposes, we employ a synthetic dataset co-produced by Huawei. To address the challenges associated with real-world data imperfections, we introduce controlled noise to its RF data so as to simulate real-world conditions. Additionally, we develop and train a model that leverages only aggregated path loss information to tackle the mapping problem. We measure the results according to three performance metrics: the Jaccard index (intersection over union, IoU), the Hausdorff distance, and the Chamfer distance. Our design achieves a macro IoU of 65.3%, significantly surpassing (i) the erroneous maps baseline, which yields 40.1%, (ii) an RF-only method from the literature, which yields 37.3%, and (iii) a non-AI fusion baseline that we designed which yields 42.2%. The comparative evaluation highlights the limitations of relying solely on RF data or on spatial data, as well as the effectiveness that AI can have on fusing data towards enhancing smart city mapping accuracy. We further validate our method on real-world data from the Oslo region, complementing the synthetic evaluation with a real deployment setting, where our best fusion model reaches 64.9% macro IoU. We additionally outline a strategy for deploying the model over larger areas by tiling the region with overlapping windows.

URL PDF HTML ☆

赞 0 踩 0

2511.05017 2026-06-15 cs.CV cs.CL 版本更新

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

通过细化文本嵌入缓解大型视觉语言模型中的幻觉

Aakriti Agrawal, Gouthaman KV, Rohith Aralikatti, Gauri Jagatap, Jiaxin Yuan, Sarvesh Baskar, Vijay Kamarshi, Andrea Fanelli, Furong Huang

发表机构 * University of Maryland（马里兰大学）； Dolby Laboratories（杜比实验室）； Capital One

AI总结针对大型视觉语言模型因过度依赖文本先验而忽视视觉线索导致的幻觉问题，提出一种简单有效的视觉特征融入方法，通过学习视觉信息化的文本嵌入来平衡注意力分布，显著降低幻觉并提升多模态推理能力。

Comments Accepted at The 64th Annual Meeting of the Association for Computational Linguistics

详情

AI中文摘要

大型视觉语言模型（LVLMs）中的幻觉仍然是一个持续的挑战，通常源于多模态推理过程中视觉信息整合不足。一个关键原因是模型过度依赖文本先验而未能充分利用视觉线索，导致输出语言流畅但视觉上不准确。例如，给定一张空厨房台面的图像，LVLM可能会根据语言关联而非视觉证据幻觉出“一碗水果”或“一杯咖啡”。大多数LVLM通过将视觉特征附加到预训练LLM的输入流中，并在大规模视觉语言数据集上训练来整合视觉特征。我们的系统分析表明，由于LLM对语言主导表示的固有偏见，这种策略往往导致对文本信息的过度依赖。这种不平衡使注意力偏向文本而非视觉内容，削弱了模型将输出基于视觉输入的能力。为了解决这个问题，我们提出了一种简单而有效的视觉特征融入方法，鼓励模型学习与基础LLM不同的视觉信息化的文本嵌入，并促进更平衡的注意力分布。在多个幻觉基准上的实验结果表明，我们的方法显著减少了幻觉，并促进了更平衡的多模态推理。值得注意的是，我们的方法取得了显著提升，包括在MMVP-MLLM上+9.33%，在POPE-AOKVQA上+2.99%，在Merlin上高达+3.4%，以及在HallusionBench的硬数据分割上+3%。

英文摘要

Hallucinations in Large Vision-Language Models (LVLMs) remain a persistent challenge, often stemming from inadequate integration of visual information during multimodal reasoning. A key cause is the model's over-reliance on textual priors and underutilization of visual cues, leading to outputs that are linguistically fluent but visually inaccurate. For example, given an image of an empty kitchen countertop, an LVLM might hallucinate a "bowl of fruit" or "cup of coffee", relying on language associations rather than visual evidence. Most LVLMs incorporate visual features by appending them to the input stream of a pre-trained LLM and training on large-scale vision-language datasets. Our systematic analysis reveals that this strategy often leads to over-dependence on textual information due to the inherent bias of LLMs towards language-dominant representations. This imbalance skews attention towards the text over visual content, weakening the model's ability to ground outputs in visual inputs. To address this, we propose a simple yet effective visual feature incorporation method that encourages the model to learn visually-informed textual embeddings distinct from those of the base LLM and promotes a more balanced attention distribution. Experimental results across multiple hallucination benchmarks demonstrate that our method significantly reduces hallucinations and fosters more balanced multimodal reasoning. Notably, our approach achieves substantial gains, including +9.33% on MMVP-MLLM, +2.99% on POPE-AOKVQA, up to +3.4% on Merlin, and +3% on the hard-data split of HallusionBench.

URL PDF HTML ☆

赞 0 踩 0

2603.05230 2026-06-15 cs.CV cs.RO 版本更新

Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems

数字孪生驱动的自动化分拣系统中的纺织品分类与异物识别

Serkan Ergun, Tobias Mitterer, Hubert Zangl

发表机构 * Institute of Smart Systems Technologies（智能系统技术研究所）； University of Klagenfurt（克雷格弗特大学）； AAU SAL USE Laboratory（AAU SAL USE实验室）； Silicon Austria Labs（硅 Austria 实验室）

AI总结提出一种数字孪生驱动的机器人分拣系统，结合抓取预测、多模态感知和视觉语言模型，实现纺织品分类与异物检测，Qwen模型准确率达87.9%。

Comments 10 pages,single column, 5 figures, preprint for Photomet Edumet 2026 (Klagenfurt, Austria)

详情

AI中文摘要

对可持续纺织品回收日益增长的需求要求强大的自动化解决方案，能够处理可变形服装并在杂乱环境中检测异物。本文提出了一种数字孪生驱动的机器人分拣系统，集成了抓取预测、多模态感知和语义推理，用于现实世界中的纺织品分类。一个配备RGBD传感、电容式触觉反馈和碰撞感知运动规划的双臂机器人单元，自主地将服装从未分类的篮子中分离，将其转移到检查区域，并使用最先进的视觉语言模型（VLM）进行分类。我们在一个包含223个检查场景的数据集上对来自五个模型家族的九个VLM进行了基准测试，这些场景包括衬衫、袜子、裤子、内衣、异物（包括上述类别之外的服装）和空场景。评估评估了每类准确率、幻觉行为以及在实际硬件约束下的计算性能。结果表明，Qwen模型家族实现了最高的总体准确率（高达87.9%），具有强大的异物检测性能，而较轻的模型如Gemma3为边缘部署提供了有竞争力的速度-准确率权衡。数字孪生结合MoveIt实现了碰撞感知路径规划，并将分割后的检查服装3D点云集成到虚拟环境中，以提高操作可靠性。所提出的系统证明了将语义VLM推理与常规抓取检测和数字孪生技术相结合，在现实工业环境中实现可扩展的自主纺织品分拣的可行性。

英文摘要

The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.

URL PDF HTML ☆

赞 0 踩 0

2605.31604 2026-06-15 cs.CV 版本更新

Representation Forcing for Bottleneck-Free Unified Multimodal Models

表示强制：无瓶颈统一多模态模型

Yuqing Wang, Zhijie Lin, Ceyuan Yang, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Zihan Ding, Fuyun Wang, Shuai Wang, Youliang Zhang, Haoqi Fan, Xihui Liu

发表机构 * University of Hong Kong（香港大学）； ByteDance Seed（字节跳动种子）； The Chinese University of Hong Kong（香港中文大学）； Nanjing University（南京大学）； Tsinghua University（清华大学）

AI总结提出表示强制（RF）技术，通过让解码器自回归预测视觉表示作为中间令牌，再在相同骨干网络中引导像素扩散，从而消除统一多模态模型对预训练VAE的依赖，实现无瓶颈的端到端模型。

Comments Project page: https://yuqingwang1029.github.io/RepresentationForcing

详情

AI中文摘要

统一多模态模型（UMMs）旨在单个模型中处理感知和生成。然而，现有的UMMs仍然依赖一个冻结的、单独预训练的VAE进行图像生成，造成了结构瓶颈。简单地移除它会导致质量差距，因为模型必须从原始像素中同时学习高级结构和低级细节。在本文中，我们提出了表示强制（RF），一种通过使表示预测成为模型原生能力来缩小这一差距的技术。具体来说，RF强制解码器在像素之前自回归地预测视觉表示作为中间令牌；这些令牌随后保留在上下文中，在相同骨干网络内引导像素扩散。通过将表示从感知输出转变为生成目标，RF消除了任何外部生成潜在空间的需求。我们发现RF对理解和生成都有益。在图像生成上，我们的像素空间模型与RF匹配了基于VAE的最先进统一模型。在图像理解上，像素空间RF通常优于其基于VAE的变体。这些结果共同为迈向端到端、无瓶颈的UMMs提供了有效的一步。

英文摘要

Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels. In this paper, we propose Representation Forcing (RF), a technique that closes this gap by making representation prediction a native capability of the model. Concretely, RF forces the decoder to autoregressively predict visual representations as intermediate tokens before pixels; these tokens then stay in context to guide pixel diffusion within the same backbone. By turning representations from perception outputs into generation targets, RF eliminates the need for any external generative latent space. We find that RF benefits both understanding and generation. On image generation, our pixel-space model with RF matches state-of-the-art VAE-based unified models. On image understanding, pixel-space RF generally outperforms its VAE-based variant. Together, these results offer an effective step toward end-to-end, bottleneck-free UMMs.

URL PDF HTML ☆

赞 0 踩 0

2504.20734 2026-06-15 cs.CL cs.AI cs.CV cs.IR cs.LG 版本更新

UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

UniversalRAG: 在多样模态和粒度的语料库上实现检索增强生成

Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出UniversalRAG，一种能够处理多种模态和粒度的检索增强生成框架，通过动态路由机制和多粒度组织，提升跨模态知识检索的有效性，实验表明其在多个模态基准上的优越性。

Comments ACL 2026. Project page : https://universalrag.github.io

详情

AI中文摘要

检索增强生成（RAG）通过将外部相关知识与查询绑定，显著提升了事实准确性。然而，现有方法多局限于文本语料，尽管最近有尝试扩展到图像、视频等模态，但通常仅针对单一模态语料。相比之下，现实中的查询所需知识类型多样，单一知识源无法满足。为此，我们引入UniversalRAG，一种any-to-any RAG框架，旨在从异构源中检索和整合多样模态和粒度的知识。具体而言，受强制所有模态进入单一聚合语料的统一表示空间导致模态间隙的观察启发，我们提出模态感知路由，动态识别最合适的模态特定语料并执行针对性检索，并通过理论分析证明其有效性。此外，除模态外，我们对每个模态组织为多个粒度层级，实现针对查询复杂性和范围的精细检索。我们验证UniversalRAG在10个多种模态基准上的性能，显示其优于各种模态特定和统一基线。

英文摘要

Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, an any-to-any RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single aggregated corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose modality-aware routing, which dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it, and further justify its effectiveness with a theoretical analysis. Moreover, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 10 benchmarks of multiple modalities, showing its superiority over various modality-specific and unified baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.14010 2026-06-15 cs.CV cs.LG cs.RO 新提交

RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation

RT-VLA：通过知识蒸馏实现实时视觉-语言-动作模型

Xiangyu Huang, Zhenlin Hua, Han Zhou, Shounak Sural, Ragunathan Rajkumar

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出RT-VLA，通过多级监督蒸馏将SimLingo模型的能力压缩至轻量学生模型，在保持竞争性能的同时将推理时间降低44.8倍（纯视觉模式）和7.9倍（视觉+语言模式），实现实时可解释的VLA自动驾驶。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过联合建模视觉感知、语言推理、可解释性和动作预测，在端到端自动驾驶中展现出强大潜力。然而，其庞大的视觉-语言骨干网络和推理模块引入了显著的推理延迟，从而阻碍了它们在道路网络严苛现实中的部署。我们提出RT-VLA，一种轻量级、蒸馏的VLA模型，通过多级监督蒸馏将最先进的SimLingo模型的驾驶和推理能力迁移到紧凑的学生模型中。RT-VLA保留了基于语言的推理，并通过离线语言分析安全关键驾驶时刻来支持事后解释，而不增加实时控制的延迟。与SimLingo教师模型相比，RT-VLA在保持竞争性的闭环驾驶和语言推理性能的同时，在纯视觉模式下将推理时间减少了44.8倍，在视觉+语言模式下减少了7.9倍。这些结果表明，监督蒸馏是构建实时、可解释的VLA风格自动驾驶模型的实用方法。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for end-to-end autonomous driving by jointly modeling visual perception, language reasoning, explainability and action prediction. However, their large vision-language backbones and reasoning modules introduce substantial inference latency and thereby prevent their deployment in the unforgiving reality of the road networks. We propose RT-VLA, a lightweight, distilled VLA model that transfers the driving and reasoning capabilities of the state-of-the-art SimLingo model into a compact student through multi-level supervised distillation. RT-VLA preserves language-based reasoning and supports post-hoc explanation through offline language analysis of safety-critical driving moments without adding latency to real-time control. Compared to the SimLingo teacher, RT-VLA maintains competitive closed-loop driving and language reasoning performance while reducing inference time by 44.8X in vision-only mode and 7.9X in vision+language mode. These results suggest that supervised distillation is a practical approach for building real-time, explainable VLA-style autonomous driving models.

URL PDF HTML ☆

赞 0 踩 0

2606.14048 2026-06-15 cs.CV cs.RO 新提交

WAM4D: Fast 4D World Action Model via Spatial Register Tokens

WAM4D：通过空间注册令牌实现快速4D世界动作模型

Ying Li, Xiaobao Wei, Jiajun Cao, Hao Wang, Xiaowei Chi, Chengyu Bai, Qianpu Sun, Jiajun Li, Xiaojie Zhang, Jian Tang, Sirui Han, Shanghang Zhang

发表机构 * Peking University（北京大学）； The Hong Kong University of Science and Technology（香港科技大学）； Beijing Innovation Center of Humanoid Robotics（北京人形机器人创新中心）

AI总结提出WAM4D，利用轻量级空间注册令牌将预训练几何先验迁移至因果视频-动作变换器，实现高效4D世界动作建模，在RoboTwin 2.0和真实操作任务中提升空间一致性并保持快速推理。

Comments 15 pages, 7figures, 9tables

详情

AI中文摘要

世界动作模型（WAMs）最近在联合建模未来观测和可执行机器人动作方面显示出前景。然而，大多数现有的WAMs仍在2D视频或潜在空间中运行，其中视觉上合理的展开缺乏精确操作所需的3D空间约束和遮挡接触几何。虽然几何基础模型为从视觉观测恢复密集3D结构和运动提供了强大的先验，但迫使WAMs预测密集4D表示会引入昂贵的几何解码并减慢因果动作生成。为了解决这一权衡，我们提出了WAM4D，一种快速的4D世界动作模型，它使用轻量级空间注册令牌作为训练时的未来深度读出，将预训练的几何先验迁移到因果视频-动作变换器中，然后移除注册分支以实现轻量级动作推理。为了防止非因果捷径，我们进一步为混合变换器（MoT）WAM骨干设计了因果混合注意力，定义了视频、动作和几何令牌之间的模态特定可见性。在RoboTwin 2.0和具有挑战性的真实世界操作任务上的全面实验表明，WAM4D提高了空间一致性，并在保持高效推理的同时实现了具有竞争力的动作预测。

英文摘要

World action models (WAMs) have recently shown promise in jointly modeling future observations and executable robot actions. However, most existing WAMs still operate in 2D video or latent spaces, where visually plausible rollouts miss the 3D spatial constraints and occluded contact geometry required for precise manipulation. While geometric foundation models offer strong priors for recovering dense 3D structure and motion from visual observations, forcing WAMs to predict the dense 4D representation introduces costly geometric decoding and slows down causal action generation. To address the trade-off, we present WAM4D, a fast 4D world action model that uses lightweight spatial register tokens as training-time future-depth readouts to transfer pretrained geometric priors into a causal video-action transformer, then removes the register branch for lightweight action inference. To prevent non-causal shortcuts, we further design causal mixture attention for the Mixture-of-Transformers (MoT) WAM backbone, defining modality-specific visibility among video, action, and geometry tokens. Comprehensive experiments on RoboTwin 2.0 and challenging real-world manipulation tasks show that WAM4D improves spatial consistency and achieves competitive action prediction while maintaining efficient inference.

URL PDF HTML ☆

赞 0 踩 0

2606.13700 2026-06-15 eess.SP cs.CV 交叉投稿

C-MambaPose: A Physics-Informed Complex Mamba Framework for Cross-Environment WiFi Human Pose Estimation

C-MambaPose：一种物理信息驱动的复杂Mamba框架用于跨环境WiFi人体姿态估计

Phuc Nguyen H

发表机构 * VinUniversity（文大学）

AI总结提出C-MambaPose，一种结合物理信息的复值Mamba-GraFormer混合框架，通过相位保留表示和动态选择性感受野的时空复杂Mamba编码器，实现跨环境WiFi三维人体姿态估计，在MM-Fi数据集上以3.78M参数达到SOTA。

详情

AI中文摘要

利用无线WiFi信号进行人体姿态估计（HPE）因其无设备、保护隐私、抗遮挡和弱光等优点而成为一项有前景的技术。然而，现有方法往往忽略WiFi信号的物理复相位信息，并且由于严重的域偏移而无法在多样环境中泛化。在本文中，我们提出C-MambaPose，一种物理信息驱动的复值Mamba-GraFormer混合框架，用于鲁棒的跨环境WiFi三维人体姿态估计。我们的框架首先净化原始WiFi信道状态信息（CSI）相位误差，并构建保持相位的复值表示。然后，我们采用具有动态选择性感受野的时空复杂Mamba编码器来捕获细粒度的相位动态。一个交叉注意力联合查询映射器将非结构化序列标记映射到人体关节，这些关节由图卷积网络（GCN）解码以预测解剖学一致的3D坐标。在MM-Fi数据集上的广泛评估表明，C-MambaPose在所有设置下均达到与最先进基线竞争或更优的性能，特别是在具有挑战性的跨环境分割上设立了新的最先进水平，仅需3.78M参数——相比GraphPose-Fi减少83.1%，相比MetaFi++减少85.7%，同时保持与DT-Pose相当的大小（仅小18%），但无需任何预训练即可实现显著优越的性能。我们的代码在此https URL公开。

英文摘要

Human pose estimation (HPE) utilizing wireless WiFi signals has emerged as a promising technology owing to its device-free nature, privacy preservation, and robustness against occlusion and poor lighting. However, existing methods often overlook the physical complex phase information of WiFi signals and fail to generalize across diverse environments due to severe domain shifts. In this paper, we present C-MambaPose, a physics-informed complex-valued Mamba-GraFormer hybrid framework for robust cross-environment WiFi-based 3D HPE. Our framework first sanitizes raw WiFi Channel State Information (CSI) phase errors and constructs a phase-preserving complex-valued representation. We then employ a Spatiotemporal Complex Mamba encoder with a dynamic selective receptive field to capture fine-grained phase dynamics. A cross-attention joint-query mapper maps the unstructured sequence tokens to human joints, which are decoded by a Graph Convolutional Network (GCN) to predict anatomically coherent 3D coordinates. Extensive evaluations on the MM-Fi dataset show that C-MambaPose achieves competitive or superior performance to state-of-the-art baselines across all settings, setting a new state-of-the-art specifically on the challenging cross-environment split, requiring only 3.78 M parameters-an 83.1\% reduction compared to GraphPose-Fi~\cite{chen2026graph} and an 85.7\% reduction compared to MetaFi++~\cite{zhou2023metafi++}, while maintaining a comparable size to DT-Pose~\cite{chen2025towards} (which is only 18\% smaller) but achieving significantly superior performance without requiring any pretraining. Our code is publicly available at https://github.com/phucngvinuni/cmampose.git.

URL PDF HTML ☆

赞 0 踩 0

2606.13840 2026-06-15 cs.RO cs.CV 交叉投稿

Multi-Agent Embodied Autonomous Driving: From V2X Information Exchange to Shared World Models

多智能体具身自动驾驶：从V2X信息交换到共享世界模型

Senkang Hu, Zhengru Fang, Yihang Tao, Zihan Fang, Sam Tak Wu Kwong, Yuguang Fang

发表机构 * Lingnan University, Hong Kong（岭南大学（香港））

AI总结本文综述了从单车智能向多智能体具身系统转变的自动驾驶技术，通过共享世界模型实现感知共享、意图推断和协同规划，并指出了在仿真评估、实时安全保证等方面的研究空白。

详情

AI中文摘要

自动驾驶正从孤立的车辆智能转向多智能体具身系统，这些系统共享感知、推断意图并在不确定性下协调行动。本综述通过共享世界模型（SWMs）的视角审视这一转变：SWMs是跨车辆、基础设施和其他交通参与者维护的预测性跨智能体表征。我们回顾了超过380篇文献，涵盖车联万物（V2X）通信、协同感知、智能体间认知、协同规划、端到端协同驾驶以及用于闭环验证的仿真和数据引擎。核心问题是交换的观测如何成为对齐的状态、意图感知的交互和协调的下游行动。在所调查的文献中，评估仍然集中在仿真、精心设计的基准测试和离线协议上。基于基础模型的协调也缺乏在开放交通中经过验证的实时安全保证。这些空白为多智能体具身自动驾驶（MAEAD）提出了关键研究重点：可验证的共享状态维护、鲁棒的意图和计划对齐，以及在通信、延迟和部署约束下的安全协调行动。

英文摘要

Autonomous driving is shifting from isolated vehicle intelligence toward multi-agent embodied systems that share perception, infer intent, and coordinate action under uncertainty. This survey examines this transition through the lens of Shared World Models (SWMs): predictive cross-agent representations maintained across vehicles, infrastructure, and other traffic participants. We review more than 380 publications spanning vehicle-to-everything (V2X) communication, collaborative perception, inter-agent cognition, cooperative planning, end-to-end cooperative driving, and simulation and data engines for closed-loop validation. The organizing question is how exchanged observations become aligned state, intent-aware interaction, and coordinated downstream action. Across the surveyed literature, evaluation remains concentrated in simulation, curated benchmarks, and offline protocols. Foundation-model-based coordination also lacks verified real-time safety guarantees in open traffic. These gaps motivate key research priorities for multi-agent embodied autonomous driving (MAEAD): verifiable shared-state maintenance, robust intent and plan alignment, and safe coordinated action under communication, latency, and deployment constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.13886 2026-06-15 cs.RO cs.CV cs.LG 交叉投稿

PhysVLA: Towards Physically-Grounded VLA for Embodied Robotic Manipulation

PhysVLA：面向物理基础的VLA用于具身机器人操作

Namai Chandra, Shriram Damodaran, Lin Wang

发表机构 * IIT Madras（印度理工学院马德拉斯分校）； Nanyang Technological University（南洋理工大学）

AI总结提出PhysVLA，一种即插即用的推理时框架，通过相位有限状态机和选择性欧拉-拉格朗日门，在不重新训练的情况下为任何冻结的VLA骨干注入物理约束，提升成功率、稳定性和轨迹效率。

Comments 9 pages, 5 figures, supplementary material included

详情

AI中文摘要

EquiDexFlow: 基于接触的SE(3)-等变灵巧抓取生成流

Clinton Enwerem, John S. Baras, Calin Belta

发表机构 * Institute for Systems Research, University of Maryland, College Park（马里兰大学帕克分校系统研究所）

AI总结提出EquiDexFlow，一种SE(3)-等变流匹配模型，联合预测腕部姿态、关节角度、指尖接触、表面法线和接触力，通过将接触投影到物体表面并将力约束在库仑摩擦锥内，确保物理稳定抓取，在16自由度Allegro手上实现零摩擦违规和最佳综合分数。

Comments 22 pages, 11 figures, 11 tables. Project page with videos, code, and checkpoints: https://equidexflow.github.io

详情

AI中文摘要

大多数学习型灵巧抓取生成器将接触力降级为下游验证步骤，因此运动学上可行的姿态仍可能违反稳定物理抓取的条件。我们通过EquiDexFlow解决这一问题，这是一种SE(3)-等变流匹配模型，从物体点云联合预测腕部姿态、关节角度、指尖接触、表面法线和接触力。我们的架构通过构造将接触投影到物体表面并将力约束在库仑摩擦锥内，因此无需损失惩罚即可满足放置和摩擦合规性。我们证明了端到端SE(3)等变性，并在200次旋转上经验验证，腕部残差低于$0.04^\circ$且关节偏差严格为零。该模型在81个物体的8,100个力闭合抓取上训练，适用于16自由度Allegro手，在所有消融变体中实现了零摩擦违规、最佳综合分数和最低扳手残差。我们通过每指逆运动学将解码的指尖接触重新定位到16自由度LEAP手，我们的硬件可行优化将每个关节至少置于其执行器包络的5%以内，同时保持扳手平衡。在物理机器人上，重新定位的EquiDexFlow解码抓取在所有六个测试物体上完成了开环拾取和保持试验，每个非对称物体在标准姿态和$120^\circ$共旋转下均成功。视频、代码和检查点可在https://this URL获取。

英文摘要

Most learned dexterous grasp generators relegate contact forces to a downstream verification step, so a kinematically-plausible pose can still violate the conditions for a stable physical grasp. We address this with EquiDexFlow, an SE(3)-equivariant flow-matching model that jointly predicts wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from an object point cloud. Our architecture projects contacts onto the object surface and forces into the Coulomb friction cone by construction, so placement and friction compliance hold without loss penalties. We prove end-to-end SE(3) equivariance and verify it empirically over 200 rotations, with wrist residuals below $0.04^\circ$ and exactly zero joint deviation. Trained on 8,100 force-closure grasps across 81 objects for the 16-DoF Allegro Hand, our model achieves zero friction violations, the best composite score, and the lowest wrench residual among all ablation variants. We retarget decoded fingertip contacts to a 16-DoF LEAP Hand via per-finger inverse kinematics, and our hardware-feasible refinement places every joint at least 5% inside its actuator envelope while preserving wrench balance. On the physical robot, retargeted EquiDexFlow-decoded grasps complete open-loop pick-and-hold trials on all six test objects, with every asymmetric object succeeding at both the canonical pose and a $120^\circ$ co-rotation. Videos, code, and checkpoints are available at https://equidexflow.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.12910 2026-06-15 cs.RO cs.AI cs.CV cs.SY eess.SY 版本更新

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

边界框作为目标：通过神经符号规划实现语言条件抓取

Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出GRASP框架，利用预训练VLM将自然语言查询转化为神经符号目标状态，通过边界框检测实现零样本桌面操作，无需任务特定训练。

Comments Project website: https://allisonandreyev.github.io/grasp.github.io/

详情

AI中文摘要

为了将机器人有效集成到家庭或工业环境中，机器必须实时适应自然语言提示。尽管视觉-语言模型（VLM）已在机器人任务与运动规划（TAMP）中实现零样本泛化，但当前最先进的方法通常计算量“沉重”或需要在数千个演示上进行大量训练。我们提出GRASP（基础推理与符号规划）框架，作为向开放词汇桌面操作迈进的一步。我们的方法利用预训练VLM将自然语言查询转化为神经符号目标状态，通过边界框检测管道在物理世界中接地。与依赖固定颜色列表或硬编码坐标的方法不同，GRASP使机器人能够解释诸如“顶层架子”之类的抽象空间概念，并在无需额外微调的情况下执行任务。我们在三个难度级别的90次真实机器人试验中实现了73.3%的总体成功率，无需任务特定训练。

英文摘要

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

URL PDF HTML ☆

赞 0 踩 0

2606.14555 2026-06-15 cs.CV cs.AI 新提交

关系检索：利用已知-新颖相互作用进行通用类别发现

Yulin Xu, Chunqi Guo, Yuanzhen Shuai, Jianyuan Ni

发表机构 * University of California, Irvine（加州大学尔湾分校）； Sichuan Agricultural University（四川农业大学）； University College London（伦敦大学学院）； Juniata College（朱尼ata学院）

AI总结本文通过关系检索视角解决通用类别发现问题，提出关系模式一致性方法，通过双向知识转移增强已知类别和新类别发现，实验表明在通用和细粒度基准上均取得最佳性能。

Comments Accepted by ICMR 2026 (Oral)

详情

DOI: 10.1145/3805622.3810732

AI中文摘要

在本研究中，我们通过关系检索视角解决通用类别发现（GCD）问题，通过双向知识转移显式连接标记和未标记数据。尽管现有方法将这些来源分开处理，错过了有价值的作用机会，我们提出关系模式一致性（RPC），使两者相互增强。RPC使用一对一分类器进行软ID/OOD分解，然后引入两种机制：（i）为已知类别保留，我们转移语义行为对齐；（ii）为类别发现，我们利用样本来自同一类别与已知类别原型保持不变的关系的洞察，将不可靠的伪标签转化为明确的关系模式匹配。这种双向设计使标记数据指导未标记学习，同时通过它们的集体关系签名发现新类别。广泛的实验表明，RPC在通用和细粒度基准上均取得最佳性能。

英文摘要

In this study, we tackle Generalized Category Discovery (GCD) via a Relational Retrieval perspective, explicitly coupling labeled and unlabeled data through bidirectional knowledge transfer. While existing methods treat these sources separately, missing valuable interaction opportunities, we propose Relational Pattern Consistency (RPC) that enables mutual enhancement. RPC employs One-vs-All classifiers for soft ID/OOD decomposition, then introduces two mechanisms: (i) for known-class preservation, we transfer semantic behavioral alignment; (ii) for category discovery, we leverage the insight that samples from the same category maintain invariant relationships with known-class prototypes, transforming unreliable pseudo-labeling into well-defined relational pattern matching. This bidirectional design allows labeled data to guide unlabeled learning while discovering novel categories through their collective relational signatures. Extensive experiments demonstrate RPC achieves state-of-the-art performance on both generic and fine-grained benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.13723 2026-06-15 cs.CV cs.AI 新提交

Morphology-Aware Sample Assignment: Overcoming IoU Insensitivity for Surface Defect Detection

形态感知样本分配：克服IoU不敏感性用于表面缺陷检测

Pengfei Liu, Yuhan Guo

发表机构 * School of Management, Harbin Institute of Technology（管理学院，哈尔滨工业大学）； College of Computing and Data Science, Nanyang Technological University（计算与数据科学学院，南洋理工大学）

AI总结针对IoU在缺陷检测中不敏感的问题，提出基于面积、形状和长宽比的形态相似性度量来优化正样本分配，理论分析表明该方法能重塑匹配函数响应分布，在NEUDET和GC10-DET数据集上基于YOLOv9框架取得一致性能提升，且零额外推理开销。

详情

AI中文摘要

交并比（IoU）作为评估候选框与真实标注空间对齐的关键指标，直接决定了正样本集的质量和视觉检测模型的训练效果。通过理论建模和分析，我们揭示了IoU响应曲线上的一个非敏感区域，在该区域内，尽管样本的几何重叠程度不同，但IoU得分几乎相同。为克服这一局限，我们引入一组形态相似性度量，涵盖面积、形状和长宽比，以优化正样本分配过程，从而确保更具区分性和可靠性的匹配。通过基于均值的多维相似性聚合，推导出一个补充匹配分数，补偿IoU在表示结构对应性方面的固有缺陷。理论上，融入形态相似性重塑了匹配函数的响应分布，产生有效的方向梯度和多边形等响应轮廓，将高响应区域紧密限制在每个真实实例周围，显著提高了正样本选择的精度。基于YOLOv9框架的实验在NEUDET和GC10-DET数据集上均取得一致性能提升。值得注意的是，所提方法完全即插即用，且零额外推理开销，从而确保了工业视觉检测的部署效率。

英文摘要

Intersection-over-Union (IoU), as a pivotal metric for evaluating the spatial alignment between candidate proposals and ground-truth annotations, directly determines the quality of positive sample sets and the training efficacy of visual detection models. Through theoretical modeling and analysis, we uncover a non-sensitive region on the IoU response curve, within which samples yield nearly identical IoU scores despite distinct geometric overlaps. To overcome this limitation, we introduce a set of morphological similarity metrics covering area, shape, and aspect ratio, to refine the positive sample assignment process, thereby ensuring more discriminative and reliable matching. A supplementary matching score is derived via mean-based aggregation of these multidimensional similarities, compensating for the intrinsic limitation of IoU in representing structural correspondence. Theoretically, incorporating morphological similarity reshapes the response distribution of the matching function, yielding both effective directional gradients and polygon-like iso-response contours, which tightly confine high-response regions around each ground-truth instance and substantially enhance the precision of positive sample selection. Experiments based on the YOLOv9 framework demonstrate consistent performance gains on both NEUDET and GC10- DET datasets. Notably, the proposed approach is fully plug-and-play and incurs zero additional inference overhead, thereby ensuring deployment efficiency for industrial visual inspection.

URL PDF HTML ☆

赞 0 踩 0

2606.14005 2026-06-15 cs.CV 新提交

Context-Guided Semantic Alignment for Feature Fusion Networks

上下文引导的特征融合网络语义对齐

Hyungseop Lee, Jiho Lee, Woochul Kang

发表机构 * Department of Embedded Systems Engineering, Incheon National University（仁川国立大学嵌入式系统工程系）

AI总结提出轻量级语义对齐模块FINE，通过跨层级注意力机制利用高层上下文指导低层特征融合，并引入对齐感知令牌采样降低计算复杂度，提升目标检测精度。

Comments 26 pages, 12 figures, 8 tables

详情

AI中文摘要

特征融合网络是现代目标检测器的基础组件，通过聚合多尺度特征来检测不同大小的物体。然而，直接融合来自不同金字塔层次的特征往往因其异构表示而导致语义不一致。本文提出特征交互网络（FINE），一种轻量级语义对齐模块，在融合前通过跨层级注意力机制利用高层上下文指导来细化低层特征。为弥合结构差距并确保计算效率，我们引入对齐感知令牌采样，对齐跨尺度的对应空间区域，将注意力复杂度降低一个数量级。生成的注意力权重产生一个空间-通道调制图，通过残差逐元素调制进行上采样并应用于低层特征。该机制确保网络选择性地增强语义相关像素，同时保留密集预测任务所需的亚像素定位精度。FINE普遍适用于各种检测器，并在不牺牲效率的情况下持续提升检测精度。

英文摘要

Feature fusion networks are fundamental components in modern object detectors, aggregating multi-scale features to detect objects of varying sizes. However, directly fusing features from different pyramid levels often introduces semantic inconsistency due to their heterogeneous representations. In this paper, we propose Feature Interaction NEtwork (FINE), a lightweight semantic alignment module that refines low-level features via high-level contextual guidance using cross-level attention prior to fusion. To bridge the structural gap and ensure computational efficiency, we introduce an Alignment-Aware Token Sampling that aligns corresponding spatial regions across scales, reducing the attention complexity by an order of magnitude. The resulting attention weights generate a spatial-channel modulation map that is upsampled and applied to the low-level features via residual element-wise modulation. This mechanism ensures that the network selectively enhances semantically relevant pixels while preserving the sub-pixel localization accuracy necessary for dense prediction tasks. FINE is generally applicable to various detectors and consistently improves detection accuracy without compromising efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.14129 2026-06-15 cs.CV 新提交

BoRAD: Bootstrap your Own Representations for Multi-class Anomaly Detection

BoRAD: 自举表示实现多类异常检测

Duy Hoang Khuong, Tri Nguyen Minh, Ngu Huynh Cong Viet

发表机构 * Department of Artificial Intelligence, FPT University（FPT大学人工智能系）； Department of IT, FPT University（FPT大学信息技术系）； Department of Computing Fundamental, FPT University（FPT大学计算基础系）

AI总结提出BoRAD框架，通过原型正则化解决多类异常检测中重建模型的捷径和误重建问题，无需标签即可实现单模型多类检测。

详情

AI中文摘要

基于重建的异常检测在工业检测中具有吸引力，但将其从类别特定训练扩展到一劳永逸的设置具有挑战性。单个模型必须重建多样的正常外观，同时不复制异常细节，这暴露了两个耦合的失败模式：相同捷径，即异常通过重建路径；以及误重建，即正常类别相互混淆。我们提出\textbf{BoRAD}，一个无标签训练框架，将其视为表示容量分配问题。BoRAD使用共享的可学习原型库施加两个互补正则化器：空间原型对齐约束局部原型内变异以抑制异常复制，而原型相对全局对齐保留原型间结构并提高对异常角度偏差的敏感性。原型库和预测头仅在训练期间使用；推理保持标准的师生特征差异过程，无需类别标签、负样本对、内存检索或原型查找。BoRAD实现了具有竞争力的一劳永逸异常检测性能，包括MVTec AD上86.2% mAD、VisA上80.7% mAD和Real-IAD上73.1% mAD。诊断分析进一步显示异常泄漏减少、正常类别可分性提高以及异常-正常分数分离更强。

英文摘要

Reconstruction-based anomaly detection is attractive for industrial inspection, but scaling it from category-specific training to a one-for-all setting is challenging. A single model must reconstruct diverse normal appearances without copying abnormal details, which exposes two coupled failure modes: identical shortcut, where anomalies pass through the reconstruction path, and mis-reconstruction, where normal categories are confused with one another. We propose \textbf{BoRAD}, a label-free training framework that treats this as a representation-capacity allocation problem. BoRAD uses a shared learnable prototype bank to impose two complementary regularizers: spatial prototype alignment contracts local within-prototype variation to suppress anomaly copying, while prototype-relative global alignment preserves between-prototype structure and improves sensitivity to abnormal angular deviations. The prototype bank and prediction heads are used only during training; inference remains a standard teacher-student feature discrepancy pass, with no class labels, negative pairs, memory retrieval, or prototype lookup. BoRAD achieves competitive one-for-all anomaly detection performance, including 86.2\% mAD on MVTec AD, 80.7\% mAD on VisA and 73.1\% mAD on Real-IAD. Diagnostic analyses further show reduced anomaly leakage, improved normal-category separability, and stronger anomaly-normal score separation.

URL PDF HTML ☆

赞 0 踩 0

2606.14307 2026-06-15 cs.CV 新提交

Pano3D: Unified 3D Reconstruction and Panoptic Segmentation

Pano3D：统一的3D重建与全景分割

Victor Barberteguy, Ahmet Iscen, Mathilde Caron, Alireza Fathi, Gül Varol, Cordelia Schmid

发表机构 * Google DeepMind（谷歌深度思维）； LIGM, École des Ponts, IP Paris, Univ Gustave Eiffel, CNRS（LIGM, 巴黎高科桥梁学院, 巴黎理工学院, 古斯塔夫·埃菲尔大学, 法国国家科学研究中心）

AI总结提出统一框架，在3D前馈重建网络中集成全景分割，通过联合训练几何与语义损失实现互惠提升，在多个数据集上达到最优性能。

Comments Project page: https://victorbbt.github.io/Pano3D/

详情

AI中文摘要

最近，3D前馈重建神经网络的进展在无需任何相机参数的图像密集重建中取得了显著成功。然而，为这些模型配备鲁棒的语义理解仍然是一个开放问题。本文提出了一种在统一框架中执行3D重建和3D全景分割的方法。我们基于现有的3D重建模型，并为其增加了一个基于集合的掩码解码器。该方法通过几何损失和语义损失进行联合训练，实验表明两者相互促进。更准确地说，特征从几何信息初始化，然后微调以同时捕获几何和语义。我们通过将框架成功应用于在线和全对全注意力重建骨干网络，证明了方法的通用性。我们的方法在ScanNet、ScanNet200和ScanNet++数据集上的3D全景分割中达到了最先进的性能。消融研究表明，这种统一模型的联合训练使3D前馈重建神经网络具备了全景分割能力，并带来了互惠的改进。

英文摘要

Recent advances in 3D feedforward reconstruction neural networks have achieved remarkable success in dense reconstruction from images without any camera parameters. Yet, equipping these models with robust semantic understanding remains an open problem. Here we introduce an approach that performs 3D reconstruction and 3D panoptic segmentation in a unified framework. We build on existing 3D reconstruction models and augment them with a set-based mask decoder. The approach is jointly trained with a geometric and semantic loss, which are shown to be mutually beneficial. More precisely, the features are initialized from the geometric information and then finetuned to capture jointly geometry and semantics. We demonstrate the generality of our approach by successfully applying our framework both to online and all-to-all attention reconstruction backbones. Our method achieves state-of-the-art performance in 3D panoptic segmentation across ScanNet, ScanNet200, and ScanNet++ datasets. Ablation studies show that such joint training of a unified model equips 3D feedforward reconstruction neural networks with panoptic segmentation and yields mutually beneficial improvements.

URL PDF HTML ☆

赞 1 踩 0

2606.14475 2026-06-15 cs.CV 新提交

Value-order Decomposition for Generalist Anomaly Detection

值序分解用于通用异常检测

Miaoyun Zhao, Jing Chen, Miaoni Zhao, Qiang Zhang

发表机构 * Dalian University of Technology（大连理工大学）； Xi’an Chang’an Vanke City Primary School（西安长安万科城小学）； Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education（社会计算与认知智能教育部重点实验室（大连理工大学））

AI总结提出值序分解（VOD）方法，通过解耦和抑制类别、缺陷类型和域特定信息，实现跨域异常检测的强泛化。

详情

AI中文摘要

工业异常检测受限于数据量少，使得跨域泛化尤其具有挑战性。通用异常检测（GAD）旨在在源域上训练一个统一模型，能够有效检测未见目标域中的异常。在初始语义特征空间中，异常与物体类别或缺陷类型之间的强纠缠阻碍了跨域的有效泛化。最近的工作通过将特征投影到残差空间来解决这个问题；然而，这些方法主要增加了正常特征的跨域重叠，而异常特征仍然与物体类别、缺陷类型和数据域相关，导致对齐和泛化效果差。为了解决这一限制，我们提出了值序分解（VOD），一种简单而有效的技术，它弥合了物体类别、缺陷类型（包括真实和合成缺陷）和数据域之间的\textbf{三种泛化差距}。VOD解耦并抑制了物体类别、缺陷类型和域特定信息，促进了正常和异常样本内部的对齐，同时保持了它们的可分离性，从而实现了跨三个差距的鲁棒泛化。利用同一物体内真实和合成缺陷之间的强对齐，我们仅使用正常和合成异常参考进行异常检测，并有效泛化到未见过的真实缺陷类型。在多样化的工业和医学基准上的实验表明，我们的方法使用简单的剪切粘贴异常模拟策略，实现了跨三个差距的强泛化。

英文摘要

Industrial anomaly detection suffers from limited data, making cross-domain generalization particularly challenging. Generalist Anomaly Detection (GAD) aims to train a unified model on a source domain that can effectively detect anomalies in unseen target domains. In the initial semantic feature space, strong entanglement between anomalies and object categories or defect types hinders effective generalization across domains. Recent works address this issue by projecting features into a residual space; however, such methods primarily increase cross-domain overlap for normal features, while anomalous features remain specific to object categories, defect types and data domains, leading to poor alignment and generalization. To address this limitation, we propose Value-order Decomposition (VOD), a simple yet effective technique that bridges \textbf{three types of generalization gaps} across object categories, defect types (including real and synthetic defects), and data domains. VOD disentangles and suppresses object-category-, defect-type-, and domain-specific information, promoting alignment within normal and abnormal samples while preserving their separability, thereby enabling robust generalization across the three gaps. Leveraging the strong alignment between real and synthetic defects within the same object, we perform anomaly detection using only normal and synthetic-abnormal reference, and effectively generalize to unseen real defect types. Experiments on diverse industrial and medical benchmarks demonstrate that our method, using a simple cut-and-paste anomaly simulation strategy, achieves strong generalization across the three gaps.

URL PDF HTML ☆

赞 0 踩 0

2606.14701 2026-06-15 cs.CV 新提交

RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

RATS！补丁通过寄存器对话：寄存器注意力Transformer中的涌现部件

Timing Yang, Predrag Neskovic, Jansen Seheult, Wenchao Han, Anand Bhattad, Alan Yuille, Feng Wang

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Office of Naval Research, Arlington, VA（海军研究办公室，阿灵顿，弗吉尼亚州）； Department of Laboratory Medicine and Pathology, Mayo Clinic, MN, USA（梅奥诊所检验医学与病理学系，明尼苏达州，美国）

AI总结提出RATS模型，通过将分类令牌分解为可学习的寄存器令牌，在L→N→N→L瓶颈中路由补丁信息，无需辅助损失或部件标注，每个寄存器自发专化为类似物体部件的原语义区域，在五个分割基准上平均mIoU提升12。

详情

AI中文摘要

当人类看到一只鸟时，他们识别出的远不止是“鸟”——他们看到头部、翅膀和爪子，这是一个可重复使用部件的结构化组合，这些部件可以在他们见过的每一只鸟中被识别出来。我们询问一个自监督视觉模型能否自行发现相同的组合结构。为此，我们提出了RATS（寄存器注意力Transformer），它将分类令牌分解为N个可学习的寄存器令牌，通过三步压缩-通信-广播注意力机制，在L→N→N→L瓶颈中路由补丁信息。这N个寄存器被分配到H个注意力头上，因此分配给不同头的寄存器之间不相互作用。在没有辅助损失或部件标注的情况下，每个寄存器自发地专化为一个原语义区域，其涌现结构类似于物体部件。RATS在五个分割基准上平均超过所有基线+12 mIoU，在ADE20K（+1.11 mIoU）和COCO（+0.2 AP^m）上持续提升。其寄存器字典进一步展示了跨相关类别的部件级一致性和语义接近性。我们的结果表明，RATS可能为结构化和可解释的视觉表示学习提供有用的架构先验。

英文摘要

When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on its own. To this end, we propose RATS (Register Attention Transformers), which decomposes the classification token into N learnable register tokens that route patch information through an L->N->N->L bottleneck via a three-step compress-communicate-broadcast attention. The N registers are partitioned across the H attention heads, so that registers assigned to different heads do not interact with each other. Without auxiliary losses or part annotations, each register spontaneously specializes into a proto-semantic region whose emerging structure resembles object parts. RATS surpasses all baselines by +12 mIoU on average across five segmentation benchmarks, with consistent gains on ADE20K (+1.11 mIoU) and COCO (+0.2 AP^m). Its register dictionary further exhibits part-level consistency and semantic proximity across related categories. Our results suggest that RATS may provide a useful architectural prior for structured and interpretable visual representation learning.

URL PDF HTML ☆

赞 0 踩 0

2605.25651 2026-06-15 cs.CV 版本更新

Hierarchical Consistency Learning for Test-time Adaptation in Camouflage Perception

用于伪装感知测试时适应的层次一致性学习

Mingfeng Zha, Tianyu Li, Guoqing Wang, Yunqiang Pei, Chaofan Qiao, Jiening Zhang, Yang Yang, Heng Tao Shen

发表机构 * Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China（未来媒体中心和电子科技大学计算机科学与工程学院）； School of Computer Science and Technology, Tongji University（计算机科学与技术学院，同济大学）； Peng Cheng Laboratory（鹏城实验室）

AI总结提出层次一致性学习（HCL）框架，通过测试时适应动态调整表示，结合层次表示重构、任务亲和引导和原型一致性校准，解决伪装目标检测中的域刚性和注释依赖问题。

Comments Accepted by IEEE TIP

详情

AI中文摘要

伪装目标检测（COD）旨在通过物理属性定位与背景感知差异最小的目标。现有方法受限于静态的“训练-冻结”范式，存在域刚性和注释依赖，限制了其对场景变化和未见伪装模式的适应性。为克服这些问题，我们提出层次一致性学习（HCL）框架，该框架集成了测试时适应以实现动态表示重校准。具体而言，我们设计了层次表示重构（HRR），通过协同空间重构与双流频域分解来缓解特征纠缠，增强对表观均匀化的鲁棒性。像素和频谱推理提供了结构和上下文先验。我们进一步引入任务亲和引导（TAG），通过通道级亲和力在分支间传播知识，对齐局部判别线索并缓解语义漂移。为确保语义不变性，我们制定了原型一致性校准（PCC），将区域特征聚合为紧凑原型并建立原型-特征相似度。这施加了隐式和层次化的约束，弥合了任务和表示之间的差距。在四个伪装和四个水下目标基准上，在三种退化设置下的广泛实验表明，我们的方法始终优于最先进的方法，突显了其在分布偏移下的鲁棒性和泛化能力。

英文摘要

Camouflaged object detection (COD) aims to localize targets that exhibit minimal perceptual differences from backgrounds through physical attributes. Existing methods, constrained by the static train-then-freeze paradigm, suffer from domain rigidity and annotation dependency, limiting their adaptability to scene variations and unseen camouflage patterns. To overcome these, we propose the hierarchical consistency learning (HCL) framework, which integrates test-time adaptation for dynamic representation recalibration. Specifically, we design the hierarchical representation reconstruction (HRR) to alleviate feature entanglement by synergizing spatial reconstruction with dual-stream frequency-domain decomposition, enhancing robustness against appearance homogenization. The pixel and spectrum inference provide structural and contextual priors. We further introduce task affinity guidance (TAG) to propagate knowledge across branches via channel-wise affinity, aligning local discriminative cues and mitigating semantic drift. To ensure semantic invariance, we formulate the prototype consistency calibration (PCC), which aggregates region features into compact prototypes and establishes prototype-feature similarity. This imposes implicit and hierarchical constraints that bridge task and representation gaps. Extensive experiments across four camouflaged and four underwater object benchmarks, under three degradation settings, demonstrate that our method consistently outperforms state-of-the-art approaches, highlighting its robustness and generalization under distribution shifts.

URL PDF HTML ☆

赞 0 踩 0

2606.13714 2026-06-15 cs.CV 新提交

TSA: Temporal Slot Activation for Persistent Object-Centric Video Representation

TSA: 时间槽激活用于持久目标中心视频表示

Duc Nguyen, Sieu Tran, Hao Vo, Khoa Vo, Duy Minh Ho Nguyen, Nghi D. Q. Bui, Anh Nguyen, Long Mai, Ngan Le

发表机构 * University of Arkansas, USA（阿肯色大学）； Max Planck Research School for Intelligent Systems（马克斯·普朗克智能系统研究所）； Google Research, Google（谷歌研究院）； University of Liverpool, UK（利物浦大学）； Adobe Research（Adobe研究院）

AI总结提出时间槽激活（TSA）机制，通过学习每槽每帧激活分数实现持久槽的生命周期建模，解决无条件传播导致的状态漂移和重建干扰问题，在多个基准上提升目标分解和时间身份保持。

详情

AI中文摘要

无监督视频目标中心学习旨在将动态场景分解为时间上持久的实体表示。现有的循环视频槽注意力方法在帧间传播一组固定的槽，但通常假设无条件槽传播：每个槽在每一帧都被更新和解码，无论其对应目标是否可见。我们表明，这种设计违反了持久槽的基本生命周期要求：当目标缺失或完全遮挡时，其槽应保留先前状态，并避免解释无关的可见内容。相反，无条件传播导致两种失败路径：更新引起的状态漂移（当前帧证据覆盖缺失目标的表示）和解码器引起的重建干扰（非活跃槽通过解码器注意力保持与重建的耦合）。我们提出时间槽激活（TSA），一种无需可见性监督即可学习每槽每帧激活分数 $\alpha_{k,t} \in (0, 1)$ 的机制。TSA 使用该激活作为共享潜在控制变量进行槽生命周期建模。当槽不活跃时，TSA 通过激活门控更新将其状态锚定到前一槽，并通过在 softmax 归一化前对注意力 logits 施加激活依赖的加性偏置来抑制其解码器参与。这共同减少了状态漂移和重建驱动的干扰。为了在部分遮挡和逐渐重现下改进决策，TSA 进一步将激活预测条件于时间上下文编码器生成的每槽时间记忆。我们在 MOVi-C/E、YT-VIS 和 OVIS 基准上使用标准指标和基于跟踪的指标（FG-ARI、mBO、IDF1、HOTA）评估 TSA。TSA 持续改进了目标分解和时间身份保持，在长且严重遮挡的视频上取得了大幅提升。

英文摘要

Unsupervised video object-centric learning aims to decompose dynamic scenes into temporally persistent entity representations. Existing recurrent video slot-attention methods propagate a fixed set of slots across frames, but typically assume unconditional slot propagation: every slot is updated and decoded at every frame, regardless of whether its corresponding object is visible. We show that this design violates a basic lifecycle requirement for persistent slots: when an object is absent or fully occluded, its slot should preserve its previous state and avoid explaining unrelated visible content. Instead, unconditional propagation creates two failure pathways: update-induced state drift, where current-frame evidence overwrites the absent object's representation, and decoder-induced reconstruction interference, where the inactive slot remains coupled to reconstruction through decoder attention. We propose Temporal Slot Activation (TSA), a mechanism that learns a per-slot, per-frame activation score $α_{k,t} \in (0, 1)$ without visibility supervision. TSA uses this activation as a shared latent control variable for slot lifecycle modeling. When a slot is inactive, TSA anchors its state to the previous slot via activation-gated updating and suppresses its decoder participation through an activation-dependent additive bias on attention logits before softmax normalization. This jointly reduces state drift and reconstruction-driven interference. To improve decisions under partial occlusion and gradual reappearance, TSA further conditions activation prediction on a per-slot temporal memory produced by a Temporal Context Encoder. We evaluate TSA on MOVi-C/E, YT-VIS, and OVIS benchmarks using both standard and tracking-based metrics (FG-ARI, mBO, IDF1, HOTA). TSA consistently improves object decomposition and temporal identity preservation, with large gains on long, heavily occluded videos.

URL PDF HTML ☆

赞 0 踩 0

2606.13861 2026-06-15 cs.CV 新提交

Temporal Backtracking Search for Test-time Generative Video Reasoning

时间回溯搜索：测试时生成式视频推理

Sejoon Jun, Zheng Ding, Huangyuan Su, Weirui Ye, Yilun Du

发表机构 * Northeastern University（东北大学）； Independent Researcher（独立研究者）； Harvard & Kempner（哈佛大学与肯普纳研究所）； MIT（麻省理工学院）

AI总结提出时间回溯搜索（TBS），通过将搜索空间转移到时间轴，在扩散过程中定位失败点并回溯重启，显著提升视频模型在测试时的推理能力。

详情

AI中文摘要

虽然测试时扩展已彻底改变了大型语言模型的推理能力，但生成式视频推理仍受限于单次生成范式。我们证明，在去噪步骤上进行搜索无法挽救逻辑有缺陷的生成结果，因为空间轨迹在扩散过程的早期就已确定。根级最佳N（BoN）采样同样低效：推理错误在时间轴上早期聚集，而重新采样盲目丢弃已验证的先前进展。为了解锁视频模型的有效测试时扩展，我们引入了时间回溯搜索（TBS），它将搜索空间转移到时间轴。TBS通过三个核心机制将视频生成转化为迭代的生成-验证-重启循环：（1）可变K条件化，从任意干净的起始前缀恢复生成；（2）时间过程验证，定位失败并提取有效的重启锚点；（3）基于前缀的搜索，将计算重新分配给扩展正确轨迹，而不是根重采样。在算法、导航和机器人领域，TBS帕累托优于同等预算的BoN。在严格的分布外设置中，单次生成崩溃（BoN为0.7%），TBS达到22.7%，每个解决的片段都来自重启的分支。最终，TBS揭示了视频模型的局部推理能力远超单次生成所显示的水平，提供了一个可扩展的测试时框架来释放这种能力。

英文摘要

While test-time scaling has revolutionized reasoning in large language models, generative video reasoning remains bottlenecked by a single-shot paradigm. We demonstrate that searching over denoising steps cannot rescue logically flawed rollouts because spatial trajectories commit early in the diffusion process. Root-level Best-of-N (BoN) sampling is similarly inefficient: reasoning errors cluster early in the temporal axis, and resampling blindly discards verified upstream progress. To unlock effective test-time scaling for video models, we introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis. TBS transforms video generation into an iterative generate-verify-restart loop via three core mechanisms: (1) variable-K conditioning to resume generation from arbitrary clean prefixes; (2) temporal process verification to localize failures and extract valid restart anchors; and (3) prefix-based search to reallocate compute toward extending correct trajectories rather than root resampling. Across algorithmic, navigation, and robotics domains, TBS Pareto-dominates matched-budget BoN. In a strict out-of-distribution setting where one-shot generation collapses (0.7% for BoN), TBS achieves 22.7%, with every solved episode stemming from a restarted branch. Ultimately, TBS reveals that the local reasoning competence of video models far exceeds what single-shot rollouts indicate, providing a scalable test-time framework to unlock it.

URL PDF HTML ☆

赞 0 踩 0

2606.14094 2026-06-15 cs.CV cs.AI 新提交

FEMOT: Multi-Object Tracking using Frame and Event Cameras

FEMOT: 使用帧和事件摄像机的多目标跟踪

Shiao Wang, Xiao Wang, Chao Wang, Yitao Li, Menghao Liu, Bo Jiang, Yaowei Wang, Yonghong Tian, Jin Tang

发表机构 * School of Computer Science and Technology, Anhui University（安徽大学计算机科学与技术学院）； Peng Cheng Laboratory（鹏城实验室）； National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University（北京大学计算机学院多媒体信息处理全国重点实验室）； School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University（北京大学深圳研究生院电子与计算机工程学院）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））

AI总结提出FEMOT大规模RGB-事件多目标跟踪数据集和FEMOTR多模态跟踪框架，通过频域融合解耦特征，有效利用互补信息实现鲁棒跟踪。

详情

AI中文摘要

传统的RGB摄像机因其捕获丰富外观和语义信息的能力而被广泛用于多目标跟踪。然而，在复杂的现实挑战下，如运动模糊、低照度和过度曝光，其性能通常会下降。受生物启发的事件摄像机提供高时间分辨率和高动态范围，在极端场景下提供互补线索。尽管如此，由于缺乏大规模且标注良好的数据集，RGB-事件多目标跟踪仍未被充分探索。为解决这一问题，我们提出了FEMOT，一个大规模RGB-事件多目标跟踪数据集，涵盖多样化的现实场景和14个具有挑战性的属性。凭借RGB和事件数据以及高质量标注，FEMOT为系统评估RGB-事件多目标跟踪方法提供了可靠平台。基于FEMOT，我们重新训练并评估了超过十个强跟踪器，从而为未来研究建立了全面的基准。此外，我们提出了FEMOTR，一种多模态跟踪框架，该框架解耦RGB和事件特征并在频域中融合它们，从而有效利用其互补特性实现鲁棒的目标定位和身份关联。在FEMOT和DSEC-MOT数据集上的大量实验证明了所提方法的有效性。源代码和基准数据集已在此https URL上发布。

英文摘要

Conventional RGB cameras have been widely used in multi-object tracking due to their ability to capture rich appearance and semantic information. However, their performance is often degraded under complex real-world challenges, such as motion blur, low illumination, and overexposure. Bio-inspired event cameras offer high temporal resolution and high dynamic range, providing complementary cues under extreme scenarios. Nevertheless, RGB-event multi-object tracking remains underexplored due to the lack of large-scale and well-annotated datasets. To address this issue, we propose FEMOT, a large-scale RGB-event multi-object tracking dataset that covers diverse real-world scenarios and 14 challenging attributes. With both RGB and event data as well as high-quality annotations, FEMOT provides a reliable platform for systematically evaluating RGB-event multi-object tracking methods. Based on FEMOT, we retrain and evaluate over ten strong trackers, thereby establishing a comprehensive benchmark for future research. Furthermore, we propose FEMOTR, a multimodal tracking framework that decouples RGB and event features and fuses them in the frequency domain, thereby effectively exploiting their complementary characteristics for robust object localization and identity association. Extensive experiments on FEMOT and DSEC-MOT datasets demonstrate the effectiveness of the proposed method. The source code and benchmark dataset have been released on https://github.com/Event-AHU/FEMOT.

URL PDF HTML ☆

赞 0 踩 0

2606.14380 2026-06-15 cs.CV 新提交

FLaRA: Predicting Future Latent Representations for Accident Anticipation

FLaRA: 预测未来潜在表示用于事故预警

Lorenzo Caselli, Tomaso Trinci, Tommaso Bianconcini, Simone Magistri, Leonardo Taccari, Francesco Sambo, Andrew D. Bagdanov

发表机构 * Department of Information Engineering, University of Florence（佛罗伦萨大学信息工程系）； Verizon Connect

AI总结提出FLaRA架构，通过预测未来潜在表示实现事故预警，在Nexar等数据集上达到最优性能。

Comments Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

详情

AI中文摘要

从行车记录仪视频中预测交通事故是智能交通系统中的一个关键挑战。现有方法通常直接将视觉上下文映射到碰撞概率，而没有显式建模驾驶场景的未来演化。在本文中，我们提出了FLaRA（预测未来潜在表示用于事故预警），一种新颖的预测架构，通过预测未来潜在表示来转变这一范式。基于视频联合嵌入预测架构（V-JEPA2），我们的模型将预测器网络条件于观察到的上下文帧，以预测场景即将到来的潜在特征。然后，分类器对这些预测的未来表示进行操作，而不仅仅是过去的观察。为了确保这些预测基于现实的未来动态，我们引入了一个联合训练目标，同时优化辅助的特征级重建损失和交叉熵分类损失。在Nexar数据集上的广泛评估，以及在DAD、DADA-2000和DoTA基准上的跨域验证，表明我们的方法在保持现实早期预警能力的同时实现了最先进的性能。

英文摘要

Anticipating traffic accidents from dashcam videos is a critical challenge in intelligent transportation systems. Existing methods typically map visual context directly to a collision probability without explicitly modeling the future evolution of the driving scene. In this paper we propose FLaRA (Predicting Future Latent Representations for Accident Anticipation), a novel predictive architecture that shifts this paradigm by forecasting future latent representations for accident anticipation. Building upon the Video Joint-Embedding Predictive Architecture (V-JEPA2), our model conditions a predictor network on observed context frames to predict the forthcoming latent features of the scene. A classifier then operates on these predicted future representations rather than only on past observations. To ensure these forecasts remain grounded in realistic future dynamics, we introduce a joint training objective that simultaneously optimizes an auxiliary feature-level reconstruction loss and a cross-entropy classification loss. Extensive evaluations on the Nexar dataset, alongside cross-domain validations on the DAD, DADA-2000, and DoTA benchmarks, demonstrate that our approach achieves state-of-the-art performance while maintaining realistic early warning capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.14631 2026-06-15 cs.CV 新提交

SED:Lightweight Saliency prediction for Event-based data via Distillation

SED: 基于蒸馏的轻量级事件数据显著性预测

Romaric Mazna, Jean Martinet, Michele Magno

发表机构 * i3S/CNRS, Université Côte d’Azur（法国蔚蓝海岸大学i3S/CNRS实验室）； ETH Zürich（苏黎世联邦理工学院）

AI总结提出轻量级网络SED，通过知识蒸馏和深度时空块（DSTconv）实现事件数据显著性预测，模型大小减少562倍，参数减少554倍，性能匹配或超越教师模型。

详情

AI中文摘要

基于事件的显著性预测最近受到关注，因为将事件相机与显著性估计结合可以作为上游阶段，自然提高边缘端下游事件感知的效率。然而，当前的方法要么是神经形态的，在基于事件的显著性基准上表现不佳，要么由于依赖Transformer或3D卷积而对资源受限的边缘应用来说过于沉重。受高效卷积模块的启发，SED旨在利用事件数据中的时间信息，我们提出了一种轻量级网络，通过知识蒸馏训练，构建于深度时空块（DSTconv）之上——这是3D深度可分离卷积的分解。相对于其教师模型，我们的模型将模型大小从180 MB减少到0.32 MB（562倍），参数数量从45M减少到81k（554倍），同时在N-DHF1K和N-UCF Sports数据集上匹配或超越其性能。此外，它在训练分布之外具有很强的泛化能力，从合成事件数据迁移到真实事件数据，而从头训练的模型则失败。

英文摘要

Event-based saliency prediction has gained attention recently, as combining event cameras with saliency estimation can act as an upstream stage that naturally improves the efficiency of downstream eventbased perception at the edge. However, current approaches are either neuromorphic, underperforming on event-based saliency benchmarks, or too heavy for resource-constrained edge applications due to their reliance on transformers or 3D convolutions. Drawing inspiration from efficient convolutional modules, SED and aiming to exploit the temporal information in event data, we propose a lightweight network, trained through knowledge distillation, built on a Depthwise Spatio-Temporal Block (DSTconv) -- a factorization of the 3D depthwise separable convolution. Relative to its teacher, our model reduces the model size from 180 MB to 0.32 MB (562x) and the parameter count from 45M to 81k (554x), while matching or outperforming it on the N-DHF1K and N-UCF Sports datasets. Moreover, it generalizes strongly beyond its training distribution, transferring from synthetic to real event data where a model trained from scratch fails.

URL PDF HTML ☆

赞 0 踩 0

2604.15173 2026-06-15 cs.CV 版本更新

CaricHarmony：身份保持的漫画合成的对比扩散路径

Dongyu Wang, Dar-Yen Chen, Yi-Zhe Song

发表机构 * SketchX, CVSSP, University of Surrey（萨里大学CVSSP实验室SketchX组）

AI总结提出CaricHarmony，一种无需训练的方法，通过并行无污染扩散路径解决身份与形状条件信号污染问题，实现平衡的漫画合成，在保持身份一致性的同时达到最优形状保真度。

详情

AI中文摘要

基于草图的漫画合成存在一个根本性失败模式：当身份和形状条件在扩散模型中结合时，它们会产生破坏性干扰，导致不可避免地向平淡肖像或无法识别的扭曲崩溃。我们将根本原因确定为\emph{条件信号污染}——去噪轨迹中竞争的概率分布使得平衡生成变得不可能。我们提出了CaricHarmony，这是第一种通过并行无污染扩散路径明确解决这种污染的无训练方法。在推理过程中，我们维护三条路径：$\mathcal{P}^{\mathrm{i}}$（纯身份）、$\mathcal{P}^{\mathrm{s}}$（纯形状）和$\mathcal{P}^{\mathrm{i+s}}$（和谐输出）。作用于交叉注意力特征的新型能量函数提供梯度引导，将$\mathcal{P}^{\mathrm{i+s}}$导向最优平衡：$\mathcal{E}_{\mathrm{shape}}$通过布局和语义对齐确保草图保真度，而$\mathcal{E}_{\mathrm{id}}$采用对极端扭曲鲁棒的令牌级对应匹配。与需要每身份70秒微调的DemoCaricature或受限于贝塞尔曲线的CaricatureBooth不同，CaricHarmony接受任何草图格式并在16秒内生成。实验展示了最先进的性能：在可比较的身份一致性分数下，形状CLIP分数为0.8615（对比0.8450），总体用户偏好分数为7.81（对比6.06）。我们的方法从根本上将身份-形状冲突重新概念化为扩散模型的条件信号污染，从而在保持识别的同时实现前所未有的创造性控制。

英文摘要

Sketch-based caricature synthesis suffers from a fundamental failure mode: when identity and shape conditions are combined in diffusion models, they create destructive interference that causes inevitable collapse toward either bland portraits or unrecognizable distortions. We identify the root cause as \emph{condition signal contamination} -- competing probability distributions in the denoising trajectory that make balanced generation impossible. We present CaricHarmony, the first training-free method that explicitly resolves this contamination through parallel uncontaminated diffusion paths. During inference, we maintain three paths: $\mathcal{P}^{\mathrm{i}}$ (pure identity), $\mathcal{P}^{\mathrm{s}}$ (pure shape), and $\mathcal{P}^{\mathrm{i+s}}$ (harmonized output). Novel energy functions operating on cross-attention features provide gradient guidance that steers $\mathcal{P}^{\mathrm{i+s}}$ toward optimal balance: $\mathcal{E}_{\mathrm{shape}}$ ensures sketch fidelity through layout and semantic alignment, while $\mathcal{E}_{\mathrm{id}}$ employs token-level correspondence matching robust to extreme distortions. Unlike DemoCaricature requiring 70 seconds per-identity fine-tuning or CaricatureBooth constrained to Bezier curves, CaricHarmony accepts any sketch format and generates in under 16 seconds. Experiments demonstrate state-of-the-art performance: 0.8615 shape CLIP score (vs. 0.8450) under comparable identity consistency score, with 7.81 overall user preference score (vs. 6.06). Our method fundamentally reconceptualizes the ID-shape conflict as conditioning signal contamination for diffusion models, enabling unprecedented creative control while preserving recognition.

URL PDF HTML ☆

赞 0 踩 0

2606.13971 2026-06-15 cs.CV 新提交

Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation

Prompt2Effect: 通过LoRA生成实现免训练图像到视频模型特化

Xiaomeng Yang, Yanyu Li, Gordon Guocheng Qian, Ivan Skorokhodov, Viacheslav Ivanov, Avalon Vinella, Xuan Zhang, Yanzhi Wang, Sergey Tulyakov, Anil Kag

发表机构 * Northeastern University（东北大学）； Snap Inc.（Snap公司）

AI总结提出Prompt2Effect，一种权重驱动超网络，通过单次前向传播直接合成效果特定的LoRA权重，无需训练，在保持视频质量的同时将计算成本从56 GPU小时降至3.3秒。

详情

AI中文摘要

将图像到视频（I2V）扩散模型个性化以具有特定视觉效果的需求日益增长，用于高端视频生成。当前实践需要为每个效果训练单独的LoRA模块，这带来了大量的数据整理和迭代优化成本，阻碍了交互式控制。我们提出Prompt2Effect，一种权重驱动的超网络，通过单次前向传播直接合成效果特定的LoRA权重，从而分摊每个效果的训练成本。与先前仅从语义回归适配器权重的超网络不同，Prompt2Effect显式地以冻结的基础模型权重为条件，将权重预测建立在每层的结构几何上。此外，我们不是预测原始LoRA矩阵，而是引入一种SVD规范化的参数化方法，解决了分解歧义并稳定了大规模权重合成。这些设计原则共同实现了高维I2V扩散模型的准确且可扩展的LoRA预测。大量实验表明，与传统的LoRA微调相比，Prompt2Effect实现了相当或更优的视频质量和效果对齐，同时将计算成本从56 GPU训练小时降至3.3秒的超网络推理。当用作后续微调的初始化时，我们预测的权重进一步提高了最终性能，并将优化速度提升了约10倍。

英文摘要

Personalizing Image-to-Video (I2V) diffusion models with specific visual effects is increasingly demanded for high-end video generation. Current practice requires training a separate Low-Rank Adaptation (LoRA) module for each effect, incurring substantial data curation and iterative optimization costs that hinder interactive control. We present Prompt2Effect, a weight-driven hypernetwork that amortizes per-effect training by directly synthesizing effect-specific LoRA weights in a single forward pass. Unlike prior hypernetworks that regress adapter weights purely from semantics, Prompt2Effect is explicitly conditioned on the frozen base model weights, grounding weight prediction in the structural geometry of each layer. Furthermore, instead of predicting raw LoRA matrices, we introduce an SVD-canonicalized parameterization that resolves factorization ambiguity and stabilizes large-scale weight synthesis. Together, these design principles enable accurate and scalable LoRA prediction for high-dimensional I2V diffusion models. Extensive experiments demonstrate that Prompt2Effect achieves on-par or superior video quality and effect alignment compared to conventional LoRA fine-tuning, while reducing the computational cost from 56 GPU training hours to 3.3 seconds of hypernetwork inference. When used as initialization for subsequent fine-tuning, our predicted weights further improve final performance and accelerate optimization by approximately 10x.

URL PDF HTML ☆

赞 0 踩 0

2606.14035 2026-06-15 cs.CV 新提交

Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention

面向360度室内全景编辑的基于重聚焦交叉注意力的免调优扩散模型

Dinh-Khoi Vo, Nhut-Thanh Le-Hinh, Viet-Tham Huynh, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * arXiv

AI总结提出FocusDiff框架，通过重聚焦交叉注意力实现免调优的精确区域编辑，并扩展到360度室内全景编辑，在局部编辑基准LIMB上优于现有零样本方法。

Comments ICCCI 2026. Project page: https://vdkhoi20.github.io/FocusDiff

详情

AI中文摘要

零样本文本引导扩散显著推进了图像编辑，但其实际可用性仍受三个持续挑战的制约：需要精细提示工程的提示脆弱性、无意影响非目标区域的溢出编辑、以及由于训练数据中有限细粒度监督导致的小或杂乱对象上的失败。我们提出FocusDiff（目标感知重聚焦用于免调优扩散编辑），一个基于重聚焦交叉注意力的免调优框架，用于精确且区域特定的图像操作。给定通过自动分割或手动选择获得的目标区域，FocusDiff对非编辑区域应用选择性模糊，以引导注意力朝向掩码区域，同时准确地将对象的身份、结构和外观传递到编辑输出。集成的上下文保留模块进一步确保背景保真度和全局一致性，使得从简单文本提示在一次传递中实现精确编辑成为可能。我们还将FocusDiff扩展到360度室内全景编辑，并在虚拟现实环境中展示其有效性。在我们包含30个多对象图像和100个标注示例（包括具有挑战性的小对象案例）的局部编辑基准LIMB上的广泛实验表明，FocusDiff在文本-图像对齐和背景保留方面优于现有零样本编辑器，实现了卓越的精度、逼真度和可用性。项目页面见此https URL。

英文摘要

Zero-shot text-guided diffusion has significantly advanced image editing; however, its practical usability remains constrained by three persistent challenges: prompt brittleness that requires meticulous prompt engineering, spillover edits that unintentionally affect non-target regions, and failures on small or cluttered objects caused by limited fine-grained supervision in training data. We propose FocusDiff (Target-Aware Refocusing for Tuning-Free Diffusion Editing), a tuning-free framework for precise and region-specific image manipulation based on refocusing cross-attention. Given a target region obtained through automated segmentation or manual selection, FocusDiff applies selective blurring to non-edit areas to guide attention toward the masked region while accurately transferring the object's identity, structure, and appearance to the edited output. Integrated context-preserving modules further ensure background fidelity and global coherence, enabling accurate edits from simple text prompts in a single pass. We also extend FocusDiff to 360-degree indoor panorama editing and demonstrate its effectiveness within virtual reality environments. Extensive experiments on our localized editing benchmark LIMB, comprising 30 multi-object images and 100 annotated examples including challenging small-object cases, show that FocusDiff outperforms existing zero-shot editors in text-image alignment and background preservation, achieving superior precision, photorealism, and usability. The project page is available at https://vdkhoi20.github.io/FocusDiff.

URL PDF HTML ☆

赞 0 踩 0

2606.14042 2026-06-15 cs.CV 新提交

Rethinking One-Step Image Editing through ChordEdit: Reproduction, Simplification, and New Insights

通过ChordEdit重新思考一步图像编辑：复现、简化与新见解

Minghan Li, Jeremy Moebel, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab（哈佛人工智能与机器人实验室）

AI总结本文通过复现、消融和简化ChordEdit，揭示其机制：和弦窗口作为时间步偏移，和弦传输执行低频语义编辑，近端对齐补充高频细节，从而将编辑分解为粗低频传输和细高频对齐两个阶段，为自适应编辑提供新路径。

Comments 9 pages

2606.14125 2026-06-15 cs.CV cs.AI 新提交

Conditioning Matters: Stabilizing Inversion and Attention in Diffusion Image Editing

条件至关重要：稳定扩散图像编辑中的反演与注意力

Zheyuan Zhan, Hongchen Li, Can Wang, Yinfei Ma, Mingzhen Huang, Ruoshi Bai, Jiawei Chen, Siwei Lyu, Defang Chen

发表机构 * State Key Laboratory of Blockchain and Data Security, Zhejiang University（浙江大学区块链与数据安全全国重点实验室）； HangZhou High-Tech Zong (Binjiang) Institute of Blockchain and Data Security（杭州高新技术产业开发区（滨江）区块链与数据安全研究院）； College of Computer Science, Zhejiang University（浙江大学计算机科学与技术学院）； University at Buffalo, State University of New York（纽约州立大学布法罗分校）

AI总结本文提出SimEdit框架，通过优化文本条件精度和令牌级跨分支注意力控制，提升扩散模型反演稳定性和编辑保真度，在PIE-Bench上显著优于先前方法。

Comments Accepted to ECML PKDD 2026 Research Track

详情

AI中文摘要

基于反演的图像编辑提供了灵活且无需训练的控制，但仍面临反演精度以及编辑保真度与背景保留之间的权衡问题。尽管最近的方法改进了反演公式或注意力交互，但文本条件在塑造扩散动态和编辑行为中的作用仍未得到充分探索。我们从经验和理论上证明，文本条件的精度通过调节扩散速度场的几何形状来影响反演稳定性，同时也会影响编辑过程中跨分支注意力的一致性。这些效应直接影响背景保留和语义保真度。基于这一分析，我们提出了SimEdit，一个条件感知框架，包含两个互补组件：(a) 条件细化，构建具有改进语义精度和结构对齐的条件信号，以促进稳定反演和一致的注意力操作；(b) 令牌级跨分支注意力控制，将编辑相关和结构保留组件分离，并在注意力操作期间对其进行非对称调节。在PIE-Bench上的大量实验表明，SimEdit在反演重建质量和编辑性能上均持续优于先前的注意力操作方法。我们的代码可在以下网址获取：https://this URL。

英文摘要

Inversion-based image editing offers flexible and training-free control but still struggles with inversion accuracy and the trade-off between editing fidelity and background preservation. While recent methods improve inversion formulations or attention interactions, the role of textual conditioning in shaping diffusion dynamics and editing behavior remains underexplored. We show both empirically and theoretically that the precision of textual conditioning influences inversion stability by modulating the geometry of the diffusion velocity field, while also affecting the consistency of cross-branch attention during editing. These effects directly impact background preservation and semantic fidelity. Building on this analysis, we propose SimEdit, a conditioning-aware framework with two complementary components: (a) conditioning refinement, which constructs conditioning signals with improved semantic precision and structural alignment to facilitate stable inversion and consistent attention manipulation, and (b) token-wise cross-branch attention control, which separates edit-relevant and structure-preserving components and modulates them asymmetrically during attention manipulation. Extensive experiments on PIE-Bench demonstrate that SimEdit consistently improves both inversion reconstruction quality and editing performance over previous attention-manipulation approaches. Our code is available at https://github.com/zju-pi/SimEdit.

URL PDF HTML ☆

赞 0 踩 0

2606.14162 2026-06-15 cs.CV 新提交

VideoWeave: Unlocking Geometric Consistency in Video Generation via Joint Geometry-Video Modeling

VideoWeave: 通过联合几何-视频建模解锁视频生成中的几何一致性

Xunzhi Xiang, Zixuan Duan, Yabo Chen, Zhengxuan Wei, Guiyu Zhang, Zixiao Gu, Zhe Gao, Haibin Huang, Chi Zhang, Qi Fan, Xuelong Li

发表机构 * Nanjing University（南京大学）； Institute of Artificial Intelligence, China Telecom (TeleAI)（中国电信人工智能研究院（TeleAI））； Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出VideoWeave，一种在潜在空间后训练框架，利用隐式几何模型特征约束生成分布，缓解几何重建误差，提升视频几何一致性。

详情

AI中文摘要

大规模视频扩散模型通常无法随时间保持3D结构，导致几何漂移和视角变化下的不合理运动。现有方法通常通过使用显式几何重建（如深度图、点云或重建的3D结构）来定义条件、监督或奖励信号，从而强制几何一致性，这使得生成器对上游几何管道的误差敏感。我们提出VideoWeave，一种潜在空间后训练框架，利用隐式几何模型特征约束生成分布，提供更灵活、非刚性的引导形式，减轻几何模型重建误差的影响。具体来说，VideoWeave将这些特征适配为几何潜在变量，并在共享去噪空间中与视频潜在变量联合建模，使得几何在训练过程中塑造生成分布。为支持这一过程，我们构建了GeoVid-80K，一个包含8万视频的配对外观和几何表示数据集。在文本到视频和图像到视频生成上的实验表明，VideoWeave在保持强视觉质量的同时改善了几何连贯性。VideoWeave项目页面见此https URL。

英文摘要

Large-scale video diffusion models often fail to preserve 3D structure over time, causing geometric drift and implausible motion under viewpoint changes. Existing methods usually enforce geometric consistency by using explicit geometry reconstructions, such as depth maps, point clouds, or reconstructed 3D structures, to define conditions, supervision, or reward signals, making the generator sensitive to errors from upstream geometry pipelines. We propose VideoWeave, a latent-space post-training framework that uses implicit geometry-model features to constrain the generative distribution, providing a more flexible and non-rigid form of guidance that mitigates the impact of reconstruction errors from geometry models. Specifically, VideoWeave adapts these features into geometry latents and jointly models them with video latents in a shared denoising space, allowing geometry to shape the generative distribution during training. To support this process, we build GeoVid-80K, an 80K-video dataset with paired appearance and geometry representations. Experiments on text-to-video and image-to-video generation show that VideoWeave improves geometric coherence while preserving strong visual quality. VideoWeave project page at https://videoweave.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.14297 2026-06-15 cs.CV cs.AI 新提交

FoleyGenEx: 统一视频到音频生成，具备多模态控制、时间对齐与语义精度

Shiyao Wang, Xijuan Zeng, Hui Wang, Shiwan Zhao, Feng Deng, Chen Zhang, Yong Qin

发表机构 * Academy for Advanced Interdisciplinary Studies, Nankai University（南开大学前沿交叉学科研究院）； Kling Team, Kuaishou Technology（快手科技Kling团队）

AI总结提出FoleyGenEx统一框架，通过条件注入、多模态动态掩码和副词数据增强，实现视频到音频生成中多模态控制、帧级时间对齐与细粒度语义的同步合成。

Comments Accepted by INTERSPEECH 2026

详情

Journal ref: INTERSPEECH 2026

AI中文摘要

我们提出FoleyGenEx，一个统一的视频到音频（VTA）框架，集成了多模态控制、帧级时间对齐和细粒度语义，能够为多种任务生成同步且多功能的音频合成。现有的VTA方法要么具有多模态控制但时间对齐较弱，要么对齐能力强但缺乏参考音频条件和语义精度。FoleyGenEx通过三项核心创新填补了这一空白：用于音频控制VTA和Foley扩展的条件注入机制、保持训练同步的多模态动态掩码策略，以及利用信号处理和大语言模型增强文本监督的副词数据增强算法，提供细微语义。在AudioCaps、VGGSound和Greatest Hits上的实验表明，与现有方法相比，它具有竞争力的可控VTA性能。演示样本见此https URL。

英文摘要

We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control but weak temporal alignment or strong alignment but lack reference audio conditioning and semantic precision. FoleyGenEx fills this gap via three core innovations: a conditional injection mechanism for audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving training synchronization, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enhance textual supervision with nuanced semantics. Experiments on AudioCaps, VGGSound, and Greatest Hits demonstrate its competitive controllable VTA performance against existing methods. Demo samples are available at https://foleygenex.github.io/FoleyGenEx.

URL PDF HTML ☆

赞 0 踩 0

2512.04981 2026-06-15 cs.CV cs.LG 版本更新

Aligned but Stereotypical? How System Prompts Shape Demographic Bias in LLM-Based Text-to-Image Models

对齐但刻板？系统提示如何塑造基于LLM的文本到图像模型中的人口统计偏见

NaHyeon Park, Na Min An, Kunhee Kim, Soyeon Yoon, Jiahao Huo, Hyunjung Shim

发表机构 * KAIST（韩国科学技术院）； HKUST (GZ)（香港科技大学（广州））

AI总结研究LLM增强的文本到图像系统在提示扩展中引入隐性人口统计偏见的问题，提出无训练的去偏框架FairPro，通过自适应生成公平性指令减少人口统计差异。

Comments Project page: https://fairpro-t2i.github.io

详情

AI中文摘要

文本到图像（T2I）系统越来越依赖基于大语言模型（LLM）的文本条件来解释和扩展用户提示。虽然这提高了提示理解和文本-图像对齐，但我们发现，即使未指定人口统计属性，它也可能引入隐性的人口统计假设。为了系统地研究这种行为在不同提示模糊性和复杂性水平下的表现，我们构建了一个涵盖多种提示设置的综合基准。对八个最新T2I模型的评估表明，基于LLM的系统始终比非LLM基线表现出更强的人口统计偏差。我们进一步分析了系统提示，这是基于LLM的T2I系统特有的组件，用于指导提示解释和扩展。我们的分析表明，这些指令强烈影响文本嵌入，进而导致有偏的图像生成。受这些发现启发，我们提出了FairPro，一个无训练的去偏框架，它在保持用户意图的同时自适应地生成公平性感知指令。实验表明，FairPro在保持提示忠实度的同时显著减少了人口统计差异。

英文摘要

Text-to-image (T2I) systems increasingly rely on Large Language Model (LLM)-based text conditioning to interpret and expand user prompts. While this improves prompt understanding and text-image alignment, we find that it can also introduce implicit demographic assumptions, even when demographic attributes are unspecified. To systematically investigate this behavior across varying levels of prompt ambiguity and complexity, we construct a comprehensive benchmark covering diverse prompt settings. Evaluations on eight recent T2I models show that LLM-based systems consistently exhibit stronger demographic skew than non-LLM-based baselines. We further analyze system prompts, a component unique to LLM-based T2I systems that guides prompt interpretation and expansion. Our analyses show that these instructions strongly influence text embeddings, which subsequently leads to biased image generations. Motivated by these findings, we propose FairPro, a training-free debiasing framework that adaptively generates fairness-aware instructions while preserving user intent. Experiments demonstrate that FairPro substantially reduces demographic disparities while maintaining prompt fidelity.

URL PDF HTML ☆

赞 0 踩 0

2601.19115 2026-06-15 cs.CV 版本更新

MUSE: 基于记忆的增量需求满足的智能体3D场景创作

Ruijie Xu, Xinnan Zhu, Jiayu Ying, Daoguo Dong, Yuzhou Ji, Xin Tan

发表机构 * East China Normal University（华东师范大学）； Fudan University（复旦大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出MUSE多智能体框架，通过增量需求满足实现可控3D场景构建与编辑，在AuthorBench上显著提升目标达成率和场景保持率。

详情

AI中文摘要

文本驱动的3D场景生成是数字内容创作、具身AI仿真和交互设计中的一项有前景的技术，然而实际工作流程通常需要在保留非目标内容的同时对现有场景进行细化、扩展或修正。现有方法可以生成逼真且结构合理的场景，但它们通常缺乏具有需求级状态跟踪的可编辑性，因此部分级故障常常导致全场景重新生成或人工干预。为应对这一挑战，我们将可控3D场景创作形式化为增量需求满足，统一了构建和编辑。在本文中，我们提出了MUSE，一个基于记忆的多智能体框架，其中架构师将指令编译为结构化需求，雕刻师执行局部场景操作，检查员验证每一步并更新工作记忆、场景记忆和技能记忆。为了评估需求级可控性和保留感知编辑，我们引入了AuthorBench，提供145个受限构建案例和1584个保留感知编辑池，并配有外部结构化检查。在全构建案例上，MUSE将全目标成功率从37.9提升至80.7，表面约束满足率从35.0提升至92.6，优于最强基线。在分层240案例编辑测试集上，MUSE实现了49.6的全目标成功率、99.9的保留率和仅0.6的非预期更改率。除了自动指标外，对比较局部编辑基线的人工评估支持与用户意图更强的对齐，下游导航代理测试表明空间稳定性更强。结合验证我们记忆设计的消融实验，这些结果确立了MUSE作为可控3D场景创作的有效框架。

英文摘要

Text-driven 3D scene generation is a promising technique for digital content creation, embodied AI simulation, and interactive design, yet practical workflows often require refining, extending, or correcting existing scenes while preserving non-target content. Existing methods can produce realistic and structurally plausible scenes, but they generally lack editability with requirement-level state tracking, so part-level failures often lead to full-scene regeneration or manual intervention. To tackle this challenge, we formulate controllable 3D scene authoring as incremental requirement satisfaction, unifying construction and editing. In this paper, we present MUSE, a memory-grounded multi-agent framework in which an Architect compiles instructions into structured requirements, a Sculptor executes local scene operations, and an Inspector verifies each step while updating Working, Scene, and Skill Memory. To evaluate requirement-level controllability and preservation-aware editing, we introduce AuthorBench, offering 145 constrained construction cases and a 1,584-case preservation-aware editing pool paired with external structured checks. On full construction cases, MUSE improves All-Goal success from 37.9 to 80.7 and surface-constraint fulfillment from 35.0 to 92.6 over the strongest baseline. On a stratified 240-case editing test split, MUSE achieves 49.6 All-Goal success, 99.9 preservation rate, and only 0.6 unintended change rate. Beyond automated metrics, human evaluations on compared local-editing baselines support stronger alignment with user intent, and downstream navigation-proxy tests indicate stronger spatial stability. Combined with ablations validating our memory designs, these results establish MUSE as an effective framework for controllable 3D scene authoring.

URL PDF HTML ☆

赞 0 踩 0

2606.14292 2026-06-15 cs.CV 新提交

A Robust Point Cloud Analysis Framework Inspired By Primary Visual Cortex

一种受初级视觉皮层启发的鲁棒点云分析框架

Jisheng Dang, Dengyue Pan, Delin Deng, Yifan Zhang, Bimei Wang, Hong Peng, Bin Hu, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）； School of Medical Technology, Beijing Institute of Technology（北京理工大学医学技术学院）； Cloud and AI BU, Huawei（华为云与AI业务部）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）

AI总结受初级视觉皮层启发，提出DC-CCNN++框架，通过仿生神经网络替代MLP，结合NRMR模块和CPVT策略，在点云分类和分割中提升鲁棒性，性能媲美SOTA。

Comments 12 pages, 2 figures, 7 tables

详情

AI中文摘要

尽管点云分析取得了显著进展，但降低能耗和提高鲁棒性仍然研究不足，这主要归因于卷积神经网络（CNN）的固有局限性。为解决这一问题，我们从初级视觉皮层中汲取灵感，提出了一种树突连接连续耦合神经网络（DC-CCNN），这是一种用于点云分析的新型类脑神经网络（BINN）架构。通过结合离散和连续编码，我们的设计用更高效、更鲁棒的BINN取代了传统的多层感知机（MLP）。在此框架基础上，我们进一步提出了扩展模型DC-CCNN++，以提高在复杂损坏条件下的鲁棒性。具体来说，我们引入了一个神经启发的鲁棒调制与读出模块（NRMR），通过全局上下文增益调制和双码证据整合来增强特征稳定性和决策鲁棒性。我们还设计了一种皮层启发的渐进变异性训练（CPVT）策略，该策略在训练过程中逐步将模型暴露于结构化的环境变异性，同时保持稳定的干净样本锚点。实验结果表明，DC-CCNN++在点云分析上提升了类脑网络的性能，同时保持了与最先进方法相当的性能。与原始DC-CCNN相比，它在分类和部分分割上均取得了更强的结果，并且对稀疏性、遮挡、高斯噪声、椒盐噪声和空间变换表现出增强的鲁棒性。凭借其高效性、鲁棒性和生物学基础的设计，DC-CCNN++为点云分析提供了一种有前景的传统深度学习方法替代方案。代码可在该https URL获取。

英文摘要

Despite significant advancements in point cloud analysis, reducing energy consumption and improving robustness remain understudied, largely due to the inherent limitations of Convolutional Neural Networks (CNNs). To address this issue, we draw inspiration from the primary visual cortex and propose a Dendritic-Connected Continuous-Coupled Neural Network (DC-CCNN), a novel Brain-Inspired Neural Network (BINN) architecture for point cloud analysis. By combining discrete and continuous encoding, our design replaces traditional Multilayer Perceptrons (MLPs) with more efficient and robust BINNs. Building upon this framework, we further propose an extended model, DC-CCNN++, to improve robustness under complex corruption conditions. Specifically, we introduce a Neuro-Inspired Robust Modulation-and-Readout Module (NRMR) to enhance feature stability and decision robustness through global-context gain modulation and dual-code evidence integration. We also design a Cortically Inspired Progressive Variability Training (CPVT) strategy, which progressively exposes the model to structured environmental variability while preserving stable clean-sample anchors during training. Experimental results show that DC-CCNN++ improves the performance of brain-inspired networks on point cloud analysis while maintaining performance comparable to state-of-the-art methods. Compared with the original DC-CCNN, it achieves stronger results on both classification and part segmentation, and exhibits enhanced robustness against sparsity, occlusion, Gaussian noise, salt-and-pepper noise, and spatial transformations. With its efficiency, robustness, and biologically grounded design, DC-CCNN++ provides a promising alternative to traditional deep learning methods for point cloud analysis. Code is available at https://anonymous.4open.science/r/DC-CCNNpp-44E3.

URL PDF HTML ☆

赞 0 踩 0

2606.14355 2026-06-15 cs.CV eess.SP 新提交

Point Cloud Upsampling through Patch-based Frequency Superposition

基于补丁频率叠加的点云上采样

Marina Ritthaler, Azhar Hussian, Vasileios Belagiannis, André Kaup

发表机构 * Friedrich-Alexander-Universität Erlangen-Nürnberg（埃尔朗根-纽伦堡大学）

AI总结提出一种基于补丁频率叠加的优化方法PUtPFS，通过选择点子集并叠加空间频率估计表面，在稀疏区域放置新点实现均匀上采样，无需训练数据，在点对面距离上超越现有方法。

详情

Journal ref: European Conference on Signal Processing (EUSIPCO) 2026

AI中文摘要

近年来，神经网络已成为大多数点云上采样方法中的主导模型。尽管这些方法取得了良好的效果，但它们存在一些缺点，例如缺乏可解释性和数据依赖性。此外，它们必须在与测试数据相似的数据集上进行训练才能表现良好。为了避免这些缺点，我们提出了基于补丁频率叠加的点云上采样（PUtPFS），这是一种基于优化的方法，通过选择点子集并通过叠加空间频率来估计该子集的表面。然后，在该表面上放置新点。通过连续选择点云中最稀疏区域中的点，可以实现均匀上采样。使用这种方法，我们在通常考虑的点对面距离上超越了当前最佳的上采样结果。此外，我们在基于优化的方法中实现了最佳的Chamfer距离和Hausdorff距离。作为额外优势，我们的方法不需要任何训练数据，并且具有数学可解释性。

英文摘要

In recent years, neural networks have become the dominant models in most point cloud upsampling methods. Although these approaches are achieving good results, they do have drawbacks, such as a lack of interpretability and data dependency. Moreover, they have to be trained on a dataset that is similar to the test data in order to perform well. To avoid these disadvantages, we propose Point Cloud Upsampling through Patch-based Frequency Superposition (PUtPFS), an optimization-based approach that selects subsets of points and estimates the surface of this set through superpositioning spatial frequencies. Then, new points are placed on this surface. By successively selecting points in the least dense regions of the point cloud, a uniform upsampling can be reached. With this method, we surpass the current best upsampling results in the commonly considered point-to-surface distance. Furthermore, we achieve the best Chamfer and Hausdorff distance among the optimization-based approaches. As an additional advantage, our method does not need any training data and is mathematically interpretable.

URL PDF HTML ☆

赞 0 踩 0

2606.14389 2026-06-15 cs.CV 新提交

ZipSplat: 更少的高斯，更好的泼溅

Alexander Veicht, Sunghwan Hong, Dániel Baráth, Marc Pollefeys

发表机构 * ETH Zürich（苏黎世联邦理工学院）； Microsoft（微软）

AI总结提出 ZipSplat，一种基于令牌的前馈模型，通过聚类压缩视觉令牌并解码为高斯组，在无需重训练的情况下实现质量-效率权衡，以约6倍更少的高斯数在DL3DV和RealEstate10K上达到新最优。

详情

AI中文摘要

前馈式3D高斯泼溅方法能够在单次前向传递中从有姿态或无姿态图像重建场景，但当前方法为每个输入像素预测一个高斯，将表示预算与相机分辨率而非场景复杂度绑定。因此，一面平坦的墙壁和一块纹理丰富的物体会产生同样多的高斯，尽管几何需求截然不同。我们提出ZipSplat，一种基于令牌的前馈模型，将高斯放置与像素网格解耦。多视图骨干网络提取密集的视觉令牌，k-means聚类将其压缩为紧凑的场景令牌集。交叉注意力和自注意力精炼这些令牌，轻量级MLP将每个令牌解码为一组具有无约束3D位置的高斯。由于聚类在推理时应用，单个训练模型无需重训练即可覆盖质量-效率曲线。ZipSplat无需真实姿态或内参，但在DL3DV和RealEstate10K上以比像素对齐方法少约6倍的高斯数达到新最优，分别超过最佳无姿态基线2.1dB和1.2dB PSNR。它进一步零样本泛化到Mip-NeRF360和ScanNet++，优于所有可比基线。我们的项目页面位于https://veichta.com/zipsplat。

英文摘要

Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forward pass, yet current approaches predict one Gaussian per input pixel, tying the representation budget to camera resolution rather than scene complexity. A flat wall and a richly textured object thus produce equally many Gaussians despite very different geometric needs. We propose ZipSplat, a token-based feed-forward model that decouples Gaussian placement from the pixel grid. A multi-view backbone extracts dense visual tokens, and k-means clustering compresses them into a compact set of scene tokens. Cross- and self-attention refine these tokens, and a lightweight MLP decodes each into a group of Gaussians with unconstrained 3D positions. Because clustering is applied at inference, a single trained model spans the quality-efficiency curve without retraining. ZipSplat operates without ground-truth poses or intrinsics, yet sets a new state of the art on DL3DV and RealEstate10K with ${\sim}6{\times}$ fewer Gaussians than pixel-aligned methods, surpassing the best pose-free baseline by 2.1dB and 1.2dB PSNR, respectively. It further generalizes zero-shot to Mip-NeRF360 and ScanNet++, outperforming all comparable baselines. Our project page is at https://veichta.com/zipsplat.

URL PDF HTML ☆

赞 0 踩 0

2606.14072 2026-06-15 cs.CV cs.CL 新提交

Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

扩散细化分割与视觉-语言解释用于儿童脑肿瘤MRI

Wentao Ke, Jianche Liu

发表机构 * Department of Mechanical Engineering, Stanford University（斯坦福大学机械工程系）； School of Medicine, Stanford University（斯坦福大学医学院）

AI总结提出两阶段框架，先用Swin-UNETR粗分割，再用条件扩散模型细化边界，最后结合多模态语言模型生成结构化报告，提升儿童脑肿瘤分割精度和可解释性。

详情

AI中文摘要

由于标注数据有限、成像表型异质性、肿瘤边界弥散以及肿瘤子区域类别不平衡，准确的儿童脑肿瘤分割仍然具有挑战性。在此，我们提出一个两阶段深度学习框架，用于改进多模态儿童脑MRI分割和临床解释。首先，我们在BraTS-PEDs MRI扫描上评估3D Res U-Net和Swin-UNETR基线模型，使用四种配准模态预测肿瘤核心、全肿瘤和增强肿瘤区域。其次，我们引入基于扩散的细化模型，以粗Swin-UNETR预测为条件，包括3D DDPM细化器和MedSegDiff。条件化显著提高了扩散稳定性和性能，特别是对于增强肿瘤边界分割。条件化MedSegDiff实现了最强的边界一致性，HD95最低。最后，预测的肿瘤体积和代表性分割叠加图与多模态语言模型集成，生成结构化的放射学风格报告。综合来看，我们的结果表明，从粗到细的扩散分割可以改善儿童肿瘤边界描绘，并支持端到端可解释的AI辅助神经肿瘤学工作流程。

英文摘要

Accurate pediatric brain tumor segmentation remains challenging due to limited annotated data, heterogeneous imaging phenotypes, diffuse tumor boundaries, and class imbalance across tumor subregions. Here, we present a two-stage deep learning framework for improving multi-modal pediatric brain MRI segmentation and clinical interpretation. First, we evaluate 3D Res U-Net and Swin-UNETR baselines on BraTS-PEDs MRI scans, using four co-registered modalities to predict tumor core, whole tumor, and enhancing tumor regions. Second, we introduce diffusion-based refinement models conditioned on coarse Swin-UNETR predictions, including a 3D DDPM refiner and MedSegDiff. Conditioning substantially improves diffusion stability and performance, particularly for enhancing tumor boundary segmentation. Conditioned MedSegDiff achieves the strongest boundary agreement with the lowest HD95. Finally, predicted tumor volumes and representative segmentation overlays are integrated with a multimodal language model to generate structured radiology-style reports. Together, our results suggest that coarse-to-refined diffusion segmentation can improve pediatric tumor boundary delineation and support end-to-end interpretable AI-assisted neuro-oncology workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.14194 2026-06-15 cs.CV cs.LG 新提交

Hybrid Classical-Quantum (HCQ) Alzheimer's Classification via Supervised $β$-VAE and Quantum Kernels

混合经典-量子（HCQ）阿尔茨海默病分类：基于监督β-VAE与量子核

Tia Tiwari, Vamshi Krishna Kancharla, Neelam Sinha

发表机构 * Centre for Brain Research, Indian Institute of Science（印度科学研究所脑研究中心）； Vision and AI Lab (VAL), Indian Institute of Science（印度科学研究所视觉与人工智能实验室）

AI总结提出两阶段混合经典-量子流水线，通过监督3D β-VAE压缩MRI为64维潜码，经PLS选择6个成分编码为6量子比特态，利用量子核SVM实现AD分类，在ADNI-1上达72.1%准确率与0.799 AUC。

详情

AI中文摘要

本文提出了一种两阶段混合经典-量子（HCQ）流水线，用于从3D T1加权结构MRI体素中进行二元阿尔茨海默病（AD）分类，其中经典和量子组件设计为互补而非独立运行。一个监督的3D β-变分自编码器（VAE）在体素级重建、KL散度和焦点分类损失下进行端到端训练，将每个3D MRI体积（从152 x 184 x 152重采样为96 x 96 x 96）压缩为64维潜码。偏最小二乘（PLS）回归选择潜码中最佳区分阿尔茨海默病（AD）与认知正常（CN）受试者的六个分量，并将其重新缩放为旋转角度，通过ZZ量子特征映射编码到六量子比特寄存器上，得到相应的量子态。预计算核支持向量机（SVM）的输入是一个N x N Gram矩阵（N = 308），通过计算每对量子态之间的重叠得到。本工作的新颖之处在于量子核直接作用于由监督自编码器端到端学习的疾病感知特征，而非预提取的输入。在308名ADNI-1受试者（包括137名AD和171名CN）上，基线模型达到67.2%的准确率和0.759的AUC，而稳定性增强变体达到72.1%的准确率和0.799的AUC，且交叉验证方差减半。3D Grad-CAM进一步帮助验证了模型对与阿尔茨海默病相关脑区的关注。HCQ流水线可作为跨生物医学成像领域的诊断分类通用框架，这些领域对经典方法存在类似挑战。

英文摘要

This paper presents a two-stage Hybrid Classical-Quantum (HCQ) pipeline for binary Alzheimer's disease (AD) classification from 3D T1-weighted structural MRI volumes, where the classical and quantum components are designed to complement each other rather than operate independently. A supervised 3D $β$-variational autoencoder (VAE) is trained end-to-end under voxel-wise reconstruction, KL-divergence, and focal classification losses that compress each 3D MRI volume (resized from 152 x 184 x 152 to 96 x 96 x 96) into a 64-dimensional latent code. Partial Least Squares (PLS) regression selects the six components in the latent code that best separate Alzheimer's Disease (AD) from cognitively normal (CN) subjects and rescales them into rotation angles, which are encoded onto a six-qubit register using the ZZ quantum feature map to give us the respective quantum states. The input to a precomputed-kernel Support Vector Machine (SVM) is an N x N Gram matrix (N = 308), created by calculating the overlap between every pair of quantum states. The novelty of this work lies in the fact that the quantum kernel operates directly on disease-aware features that are learned end-to-end by a supervised autoencoder, rather than on pre-extracted inputs. On 308 ADNI-1 subjects, consisting of 137 AD and 171 CN subjects, the baseline achieved 67.2% accuracy and 0.759 AUC, while the stability-enhanced variant reached 72.1% accuracy and 0.799 AUC with cross-fold variance halved. 3D Grad-CAM further helped validate our model's focus on brain regions linked to Alzheimer's. The HCQ pipeline could serve as a general-purpose framework for diagnostic classification across biomedical imaging domains that present similar challenges for classical approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.14251 2026-06-15 cs.CV 新提交

HiST: A Hierarchical Sparse Transformer for Cross-Modal Spatial Transcriptomics Modeling

HiST：用于跨模态空间转录组建模的分层稀疏Transformer

Weiyi Wu, Xinwen Xu, Xingjian Diao, Siting Li, Zhi Wei, Alma Andersson, Jiang Gui

发表机构 * New Jersey Institute of Technology（新泽西理工学院）； Stevens Institute of Technology（史蒂文斯理工学院）； Karolinska Institutet（卡罗林斯卡学院）； Dartmouth College（达特茅斯学院）

AI总结提出HiST，一种分层稀疏Transformer，通过稀疏窗口注意力和分辨率变换算子实现高效的多尺度空间转录组推断，显著降低计算开销并提升预测性能。

详情

Journal ref: ICML 2026

AI中文摘要

空间转录组学（ST）将基因表达与组织形态联系起来，但成本高且通量低，因此需要从常规组织学推断表达的替代方法。全切片H&E到ST推断将千兆像素图像与稀疏、不规则位置上的基因测量配对，使得多尺度建模在不产生密集网格开销或二次令牌混合的情况下具有挑战性。我们提出HiST，一种分层稀疏Transformer，将测量位置视为晶格索引的稀疏场，并直接在活跃组织足迹上构建二元编码器-解码器。HiST结合了稀疏窗口注意力用于局部几何对应，以及分辨率变换算子用于快速多尺度上下文整合。对于固定窗口大小，主要运行时间和内存随观测位置数量而非密集切片面积扩展。为缓解切片特定的采集变异，HiST通过一个瓶颈全局条件路径添加了\emph{切片校准令牌}，该令牌总结切片级上下文并调节局部表示。在涵盖不同组织和采集源的多器官基准测试中，HiST在降低运行时间和峰值内存的同时，提升了相对于近期基线的预测性能。

英文摘要

Spatial transcriptomics (ST) links gene expression with tissue morphology but remains expensive and low-throughput, motivating surrogates that infer expression from routine histology. Whole-slide H&E-to-ST inference pairs a gigapixel image with gene measurements at a sparse, irregular set of locations, making multiscale modeling challenging without incurring dense-grid overhead or quadratic token mixing. We propose HiST, a hierarchical sparse transformer that treats measured locations as a lattice-indexed sparse field and builds a dyadic encoder--decoder directly on the active tissue footprint. HiST combines sparse window attention for local geometric correspondence with resolution-changing operators for rapid multiscale context integration. For a fixed window size, the dominant runtime and memory scale with the number of observed locations rather than the dense slide area. To mitigate slide-specific acquisition variation, HiST adds a bottlenecked global conditioning pathway via a \emph{slide calibration token} that summarizes slide-level context and conditions local representations. On a multi-organ benchmark spanning diverse tissues and acquisition sources, HiST improves predictive performance over recent baselines while reducing runtime and peak memory.

URL PDF HTML ☆

赞 0 踩 0

2606.14534 2026-06-15 cs.CV 新提交

A Lightweight Fiducial-Based Pipeline for 3D Hyperspectral Mapping of ex-vivo Lumpectomy Specimens

一种轻量级基于基准的离体肿块切除标本三维高光谱映射流水线

Anna Bicchi, Alberto Rota, Leonardo Passoni, Nicola Ancellotti, Andrea Peroni, Lorenzo Vinco, Dario Polli, Elena De Momi

发表机构 * Politecnico di Milano（米兰理工大学）； Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano（米兰理工大学电子、信息与生物工程系）； Department of Physics, Politecnico di Milano（米兰理工大学物理系）

AI总结提出一种全自动、免标定流水线，利用RGB图像和单次HSI采集生成离体肿块切除标本的三维高光谱点云，通过ArUco标记实现亚毫米级配准，支持术中切缘评估。

详情

AI中文摘要

高光谱成像（HSI）是保乳手术（BCS）中用于术中评估切缘的一种有前景的模态，但其临床转化需要将固有的二维光谱信息与切除组织的三维形状对齐，以便精确定位可疑区域进行靶向随访。我们提出了一种全自动、免标定的流水线，该流水线从一组消费级相机RGB图像和单次自上而下的HSI采集生成离体肿块切除标本的三维高光谱点云。三维几何结构通过深度学习运动恢复结构（Structure-from-Motion）骨干网络重建，并通过自定义光束法平差（bundle adjustment）在度量参考框架中稳定，该平差对放置在标本周围的四个ArUco标记的角点强制执行一致性。然后，HSI立方体在不恢复HSI相机位姿的情况下配准到重建结果：两种模态中可见的标记定义了16个角点对应关系，驱动平面单应性（planar homography），并通过在正交渲染的深度图上查找恢复三维坐标。在两个离体肿块切除标本上评估，该流水线实现了中位三维配准误差低于1毫米，二维重投影误差低于0.02毫米，在加速硬件上每个标本的总处理时间低于4分钟。这些结果支持将HSI引导的空间定位集成到保乳手术的术中切缘评估工作流程中的可行性。

英文摘要

Hyperspectral Imaging (HSI) is a promising modality for intraoperative assessment of resection margins in Breast-Conserving Surgery (BCS), but its clinical translation requires aligning the inherently 2D spectral information onto the 3D shape of the excised tissue so that suspicious regions can be precisely localized for targeted follow-up. We present a fully automated, calibration-free pipeline that produces a 3D hyperspectral point cloud of an ex-vivo lumpectomy specimen from a set of consumer-camera RGB images and a single top-down HSI acquisition. The 3D geometry is reconstructed with a deep-learning Structure-from-Motion backbone, stabilized in a metric reference frame by a custom bundle adjustment that enforces consistency on the corners of four ArUco markers placed around the specimen. The HSI cube is then registered to the reconstruction without recovering the HSI camera pose: the markers, visible in both modalities, define 16 corner correspondences that drive a planar homography, and 3D coordinates are recovered by lookup on an orthographically rendered depth map. Evaluated on two ex-vivo lumpectomy specimens, the pipeline achieves a median 3D registration error below 1~mm and a 2D reprojection error below 0.02 mm, with a total per-specimen processing time under 4 minutes on accelerated hardware. These results support the feasibility of integrating HSI-guided spatial localization into intraoperative margin assessment workflows for breast-conserving surgery.

URL PDF HTML ☆

赞 0 踩 0

2606.14568 2026-06-15 eess.IV cs.CV 交叉投稿

ShearFuse-UNet: Hadamard、DCT和Shearlet变换融合用于次日野火蔓延预测

Ene Meco, Yingyi Luo, Emadeldeen Hamdan, Adam Watts, Ahmet Enis Cetin

发表机构 * University of Illinois Chicago（伊利诺伊大学芝加哥分校）； US Forest Service, Pacific Wildland Fire Science Laboratory（美国林务局太平洋野火科学实验室）

AI总结提出ShearFuse-UNet，一种轻量级深度学习模型，通过融合WHT、DCT和Shearlet变换分支，在U-Net编码器中实现多模态卫星数据的次日野火蔓延预测，以267k参数达到0.596 F1分数，优于ResNet18 U-Net。

详情

AI中文摘要

我们提出了ShearFuse-UNet，一种轻量级且计算高效的深度学习模型，用于从多模态卫星数据预测次日野火蔓延。该模型在U-Net骨干网络的每个编码器块内集成了三个互补的变换域分支：二维快速沃尔什-阿达玛变换（WHT）分支、二维离散余弦变换（DCT）分支和锥自适应数字Shearlet残差分支。WHT和DCT分支通过可学习的频谱缩放和固定的软阈值建立正交潜在空间，而Shearlet分支提供各向异性的多方向特征分解，显式编码火线特有的细长边缘结构。一个学习的SpectralFusion门自适应地组合WHT和DCT响应，并将Shearlet重构作为残差添加。这种三分支设计与Transformer自注意力有松散的结构类比：WHT和DCT分支提供自适应融合的互补频谱表示，而Shearlet分支通过残差路径贡献方向内容。与自注意力不同，所提出的设计依赖于固定的数学变换而非学习的投影算子，减少了参数数量和计算成本。在WildfireSpreadTS数据集上评估，ShearFuse-UNet仅用267k参数就达到了0.596的F1分数，优于基于ResNet18的U-Net（14M参数，F1=0.589），展示了非常有利的精度-效率权衡。在Google Next-Day Wildfire Spread数据集上的结果进一步验证了这些发现。

英文摘要

We propose ShearFuse-UNet, a lightweight and computationally efficient deep learning model for next-day wildfire spread prediction from multi-modal satellite data. The model integrates three complementary transform-domain branches inside each encoder block of a U-Net backbone: a 2D Fast Walsh-Hadamard Transform (WHT) branch, a 2D Discrete Cosine Transform (DCT) branch, and a cone-adapted digital Shearlet residual branch. The WHT and DCT branches establish orthogonal latent spaces with learnable spectral scaling and fixed soft-thresholding, while the Shearlet branch provides anisotropic, multi-directional feature decomposition that explicitly encodes the elongated edge structures characteristic of fire fronts. A learned SpectralFusion gate adaptively combines the WHT and DCT responses, and the Shearlet reconstruction is added as a residual. This three-branch design bears a loose structural analogy to transformer self-attention: the WHT and DCT branches provide complementary spectral representations that are adaptively fused, while the Shearlet branch contributes directional content through a residual pathway. Unlike self-attention, the proposed design relies on fixed mathematical transforms rather than learned projection operators, reducing parameter count and computational cost. Evaluated on the WildfireSpreadTS dataset, ShearFuse-UNet achieves an F1 score of 0.596 with only 267k parameters, outperforming a ResNet18-based U-Net (14M parameters, F1 = 0.589) and demonstrating a highly favorable accuracy-efficiency trade-off. Results on the Google Next-Day Wildfire Spread dataset further validate these findings across a different benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.14619 2026-06-15 cs.CV 新提交

StereoGeo: an end-to-end stereo camera calibration method

StereoGeo：一种端到端立体相机标定方法

Imane Meddour, Andréa Macario Barros, Cédric Gouy-Pailler

发表机构 * IMB - Institut Mines-Télécom, Paris, France（IMB - 巴黎理工大学）

AI总结提出端到端网络StereoGeo，通过集成深度特征提取与可微优化器，同时估计双目相机的焦距、重力方向和相对外参，在真实基准上实现竞争性的内参标定和准确的外参估计。

Comments 5 pages, 1 figure, accepted at the 34th European Signal Processing Conference (EUSIPCO 2026)

2606.14638 2026-06-15 cs.CV astro-ph.EP 新提交

Improving Lunar Topography with Deep Learning Schrödinger Bridges

利用深度学习薛定谔桥改进月球地形

Matthew Repasky, Erwan Mazarico, Michael K. Barker, Stefano Bertone, Terence J. Sabaka, Yao Xie

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology（佐治亚理工学院H. Milton Stewart工业与系统工程系）； NASA Goddard Space Flight Center（美国国家航空航天局戈达德太空飞行中心）； Center for Research and Exploration in Space Science and Technology (CRESST II), University of Maryland, College Park（马里兰大学帕克分校空间科学与技术研究与探索中心（CRESST II））； National Institute for Astrophysics (INAF), Astrophysical Observatory of Turin（意大利国家天体物理研究所（INAF）都灵天体物理天文台）

AI总结提出基于扩散薛定谔桥的生成模型，结合光学影像约束，实现月球地形超分辨率重建，并提供像素级不确定性估计。

详情

DOI: 10.3847/PSJ/ae6244
Journal ref: The Planetary Science Journal 7.6 (2026): 139

AI中文摘要

提高行星地形模型的分辨率可以更好地理解表面过程和地貌；然而，现有的解析超分辨率方法成本高昂且难以大规模应用。生成模型提供了学习数据中复杂关系的工具，并且由于硬件加速器和并行化，可以大规模应用。我们提出了一种基于扩散的薛定谔桥（SB）生成建模方法，用于月球地形超分辨率，连接低分辨率地形分布与高分辨率地形分布，并结合物理约束的光学影像。我们的方法受到现有形状重建方法的启发，这些方法通过使用目标分辨率的光学图像来改进先验的低分辨率地形。我们在一个新颖的渲染月球地形数据集上训练SB，模拟来自月球勘测轨道器窄角相机的光学影像。结果是一种灵活的地形超分辨率方法，可以在重建中提供像素级的不确定性。

英文摘要

Increasing the resolution of planetary topography models can enable a better understanding of surface processes and geomorphology; however, existing analytical super-resolution methods are expensive and difficult to apply at large scales. Generative models provide the tools to learn complex relationships within data and can be applied at scale due to hardware accelerators and parallelization. We present a diffusion-based Schrödinger Bridge (SB) generative modeling approach for lunar topography super-resolution, connecting the distribution of low-resolution topography to that of high-resolution topography, incorporating physically-constraining optical imagery. Our approach is inspired by existing Shape-from-Shading methods, which improve a priori low-resolution topography by using optical images at the target resolution. We train SBs on a novel dataset of rendered lunar topography, emulating optical imagery from the Lunar Reconnaissance Orbiter Narrow Angle Camera. The result is a flexible approach for topography super-resolution which can provide pixel-level uncertainties in the reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.13957 2026-06-15 eess.IV cs.CV cs.MM 交叉投稿

High-Fidelity Video Compression based on Invertible Neural Transform and Implicit Conditioning

基于可逆神经变换和隐式条件的高保真视频压缩

Siyue Teng, Ho Man Kwan, Yuxuan Jiang, Fan Zhang, David Bull

发表机构 * Visual Information Lab, University of Bristol, UK（布里斯托大学视觉信息实验室）

AI总结提出InnVC，一种基于可逆神经网络和隐式条件场的视频编解码器，通过保持可逆主变换路径并解耦相关内容和细节，在高质量区域实现显著性能提升。

详情

AI中文摘要

基于学习的视频压缩最近在率失真性能上已与传统视频编解码器相媲美。然而，大多数现有方法依赖于不可逆的分析-合成变换，重建质量受到量化和变换近似误差的双重影响。在高质量点，量化误差较小，变换引起的失真占主导地位，这一限制尤为突出。为此，我们提出InnVC，一种基于可逆神经网络的视频编解码器，用于宽范围和高保真压缩。核心思想是在量化前保留可逆的主变换路径，同时通过紧凑的隐式条件场注入内容自适应上下文。这将强相关的视频内容与难以建模的细节解耦，使不同组件专门负责互补的重建任务，从而实现更高效的压缩。为进一步提高可压缩性，我们引入了一种调度掩码策略，逐步将信息内容集中到更少的潜在通道中，以实现更有效的熵编码。在UVG和MCL-JCV基准上的实验表明，InnVC在广泛的质量范围内实现了强大的压缩性能，在高质量区域尤为有效，相对于x265在UVG上PSNR的BD率降低21.66%，MS-SSIM降低46.06%。据我们所知，InnVC是首个在单一架构尺度内覆盖从低比特率到高保真操作点的神经视频编解码器，PSNR跨度超过20 dB。

英文摘要

Learning-based video compression has recently achieved competitive rate-distortion performance compared to conventional video codecs. However, most existing methods rely on non-invertible analysis-synthesis transforms, with reconstruction quality subject to both quantization and transform approximation errors. This limitation becomes particularly restrictive at higher quality points, where quantization errors are small and transform-induced distortion dominates. To address this, we propose InnVC, an Invertible neural network based Video Codec for wide-range and high-fidelity compression. The core idea is to preserve an invertible main transform path prior to quantization, while injecting content-adaptive context through a compact implicit conditioning field. This decouples strongly correlated video content from harder-to-model fine details, allowing different components to specialize in complementary reconstruction tasks for more efficient compression. To further improve compressibility, we introduce a scheduled masking strategy that progressively concentrates informative content into fewer latent channels for more effective entropy coding. Experiments on the UVG and MCL-JCV benchmarks show that InnVC achieves strong compression performance over a broad quality range, being particularly effective in the high-quality regime, yielding BD-rate reductions of 21.66% in PSNR and 46.06% in MS-SSIM relative to x265 on UVG. To the best of our knowledge, InnVC is the first neural video codec covers operating poins from low bitrate to high fidelity within a single architecture scale, spanning more than 20 dB in PSNR.

URL PDF HTML ☆

赞 0 踩 0

2606.14248 2026-06-15 eess.IV cs.CV 交叉投稿

Spectrum Aware Illumination Estimation Using Multispectral Image

利用多光谱图像的光谱感知光照估计

Hyejin Oh, Woo-Shik Kim, Sangyoon Lee, YungKyung Park, Je-Won Kang

发表机构 * Department of Electronic and Electrical Engineering, Ewha W. University（成均馆大学电子与电气工程系）； Telechips ； Samsung Advanced Institute of Technology（三星先进技术研究所）； Department of Design, Ewha W. University（成均馆大学设计系）

AI总结提出一种结合光谱注意力机制和光照先验的深度学习框架，通过时空光谱特征提取块和跨传感器域变换，实现高精度光照谱估计，并在真实多光谱数据集上验证了优越性。

Comments Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). DOI: 10.1109/TCSVT.2026.3701975

详情

DOI: 10.1109/TCSVT.2026.3701975

AI中文摘要

多光谱成像通过捕获更多光谱波段扩展了传统的RGB成像，从而改进了光照谱估计。然而，现有方法往往未能充分利用光谱信息，导致在不同光照条件和不同传感器域下性能欠佳。因此，我们提出了一种具有时空光谱特征提取块的深度学习框架，该框架结合了光谱注意力机制以增强光谱相关性并保留与光照相关的空间特征。通过引入光照先验，我们的方法优先考虑在多光谱图像中提供更有意义信息的特定通道。我们还提出了跨不同多光谱传感器空间的光谱域变换。结果表明，在高维传感器空间中学习到的光照谱可以有效地变换到各种低维相机传感器空间，而无需任何额外训练。为了便于评估，我们引入了一个真实世界的多光谱数据集，其中包含在不同光照条件下捕获的高维真实光照谱。通过大量实验，我们证明了我们的方法相比现有模型实现了更高的准确性，从而为现实世界的光照谱估计提供了实用解决方案。代码和数据集可在以下网址获取：此 https URL。

英文摘要

Multispectral (MS) imaging extends beyond conventional RGB imaging by capturing more spectral bands, thereby improving illuminant spectrum estimation (ISE). However, existing methods often fail to fully exploit spectral information, resulting in suboptimal performance under diverse lighting conditions and across different sensor domains. Hence, we propose a deep learning framework with a spatio-spectral feature extraction block, which incorporates spectral attention mechanisms to enhance spectral correlation and preserve illuminant-relevant spatial features. Through the inclusion of an illuminant prior (IP), our approach prioritizes specific channels that provide more meaningful information in an MS image. We also propose a spectral-domain transform across different MS sensor spaces. The results demonstrate that illuminant spectra learned in high-dimensional sensor spaces can be effectively transformed to various lower-dimensional camera sensor spaces without any additional training. To facilitate evaluation, we introduce a real-world MS dataset containing high-dimensional ground-truth illumination spectra captured under diverse lighting conditions. Through extensive experiments, we demonstrate that our method achieves superior accuracy compared to existing models, thus providing a practical solution for real-world ISE. The code and dataset are available at https://github.com/hyejin5/Spectrum-Aware-Illumination-Estimation-Using-Multispectral-Image.

URL PDF HTML ☆

赞 0 踩 0

2601.21179 2026-06-15 cs.CV 版本更新

Enhancing Underwater Light Field Images via Global Geometry-aware Diffusion Process

通过全局几何感知扩散过程增强水下光场图像

Yuji Lin, Qian Zhao, Zongsheng Yue, Junhui Hou, Deyu Meng

发表机构 * School of Mathematics and Statistics, Xi’an Jiaotong University（西安交通大学数学与统计学学院）； School of Mathematics and Statistics and the Ministry of Education Key Lab of Intelligent Networks and Network Security, Xi’an Jiaotong University（西安交通大学数学与统计学学院和教育部智能网络与网络安全重点实验室）； Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）； Macao Institute of Systems Engineering, Macau University of Science and Technology（澳门系统工程研究院，澳门科学大学）

AI总结提出基于扩散的GeoDiff-LF框架，利用空间-角度结构增强水下4D光场成像，通过改进U-Net、几何引导损失和优化采样策略，有效缓解颜色失真，在视觉保真度和定量性能上超越现有方法。

Comments 14 pages, 9 figures

详情

AI中文摘要

本文研究了通过4D光场（LF）成像获取高质量水下图像的挑战性问题。为此，我们提出了GeoDiff-LF，一种基于SD-Turbo的新型扩散框架，通过利用其空间-角度结构来增强水下4D LF成像。GeoDiff-LF包含三个关键改进：（1）改进的U-Net架构，带有卷积和注意力适配器以建模几何线索；（2）使用张量分解和渐进加权的几何引导损失函数以正则化全局结构；（3）优化的采样策略与噪声预测以提高效率。通过整合扩散先验和LF几何，GeoDiff-LF有效缓解了水下场景中的颜色失真。大量实验表明，我们的框架在视觉保真度和定量性能上均优于现有方法，推动了水下成像增强的最新进展。代码将在https://this URL公开。

英文摘要

This work studies the challenging problem of acquiring high-quality underwater images via 4-D light field (LF) imaging. To this end, we propose GeoDiff-LF, a novel diffusion-based framework built upon SD-Turbo to enhance underwater 4-D LF imaging by leveraging its spatial-angular structure. GeoDiff-LF consists of three key adaptations: (1) a modified U-Net architecture with convolutional and attention adapters to model geometric cues, (2) a geometry-guided loss function using tensor decomposition and progressive weighting to regularize global structure, and (3) an optimized sampling strategy with noise prediction to improve efficiency. By integrating diffusion priors and LF geometry, GeoDiff-LF effectively mitigates color distortion in underwater scenes. Extensive experiments demonstrate that our framework outperforms existing methods across both visual fidelity and quantitative performance, advancing the state-of-the-art in enhancing underwater imaging. The code will be publicly available at https://github.com/linlos1234/GeoDiff-LF.

URL PDF HTML ☆

赞 0 踩 0

2604.26740 2026-06-15 cs.CV cs.GR 版本更新

Rendering-Aware Sparse Sampling for BRDF Acquisition

面向BRDF采集的渲染感知稀疏采样

W. Cao, D. Jönsson, Z. Huang, J. Unger

发表机构 * Media and Information Technology, Department of Science and Technology, Linköping University（_linköping大学科学与技术学院媒体与信息科技系）

AI总结提出一种渲染感知的稀疏采样方法，通过可微渲染器优化采样方向，以最少BRDF测量实现高质量材质外观重建。

详情

AI中文摘要

精确的BRDF采集对于真实感渲染至关重要，但密集的测角光度计测量既缓慢又昂贵。我们研究如何选择一小部分BRDF测量，这些测量在学习的BRDF先验下对重建材质外观最具信息量。现有的稀疏采集方法通常优化所有材质的BRDF空间重建样本，而自适应测量的感知重要性最终取决于其对每个渲染外观的影响。因此，我们将稀疏自适应采集表述为一个渲染感知的优化问题。我们的方法结合了用于稀疏坐标-值观测的集合编码器、基于预训练超网络/PCA的BRDF重建器以及可微渲染器。在采样器训练期间，重建器保持固定，来自渲染图像损失的梯度优化测量位置。这将采集设计与先验拟合分离，并鼓励采样器选择在学习材质分布下信息量大的方向。为了使比较受控，我们在匹配的样本数量、训练/测试分割、渲染场景、对象掩码、图像映射和指标下评估均匀基线、元学习方法、HyperBRDF方法和我们学习的采样器。我们的核心主张是：当最终渲染外观是目标时，渲染感知采样改进了极其稀疏的BRDF采集。BRDF空间和组合损失仅作为消融实验报告，同时包括联合优化和仅图像潜在拟合以处理未见过的材质。

英文摘要

Accurate BRDF acquisition is essential for realistic rendering, but dense gonioreflectometer measurements are slow and expensive. We study how to select a small set of BRDF measurements that is most informative for reconstructing material appearance under a learned BRDF prior. Existing sparse-acquisition methods often optimize samples for BRDF-space reconstruction for all materials, while the perceptual importance of a adaptive measurement ultimately depends on its effect on each rendered appearance. We therefore formulate sparse adaptive acquisition as a rendering-aware optimization problem. Our method combines a set encoder for sparse coordinate--value observations, a pretrained hypernetwork-based/PCA-based BRDF reconstructor, and a differentiable renderer. During sampler training, the reconstructor remains fixed, and gradients from a rendered-image loss optimize the measurement locations. This separates acquisition design from prior fitting and encourages the sampler to choose directions that are informative under the learned material distribution. To make the comparison controlled, we evaluate the uniform baseline, meta-learning method, HyperBRDF method, and our learned sampler under matched sample numbers, train/test split, rendering scene, object mask, image mapping, and metrics. Our central claim: rendering-aware sampling improves extremely sparse BRDF acquisition when final rendered appearance is the target. BRDF-space and combined losses are reported only as ablations, together with joint refinement and image-only latent fitting for unseen materials.

URL PDF HTML ☆

赞 0 踩 0

2605.28477 2026-06-15 cs.CV 版本更新

SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation

SA4Depth: 自监督单目深度估计中一致的姿态-深度尺度对齐

Changxuan Li, Nadine Berner, Nassir Navab, Federico Tombari, Stefano Gasperini

发表机构 * Technical University of Munich（慕尼黑技术大学）； BMW Group（宝马集团）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心（MCML））； Google（谷歌）； VisualAIs Labs GmbH（VisualAIs实验室 GmbH）

AI总结提出SA4Depth方法，通过可微的视觉特征重投影和姿态细化，对齐自监督深度估计中深度网络和姿态网络估计的场景尺度，提升深度预测精度且不增加推理时间。

Comments Accepted by IEEE RA-L 2026

详情

AI中文摘要

从单目序列进行自监督深度估计依赖于深度网络和姿态网络的联合学习。尽管已有大量研究致力于改进深度网络，但对姿态的努力仍然有限。在此背景下，即使深度估计达到尺度级别，我们强调了姿态网络和深度网络估计的场景尺度之间对齐的重要性。然后，我们引入了SA4Depth，一种改善这种对齐并提升深度预测的方法，同时保持推理时间不变。我们提出的方法利用训练期间估计的深度，跨连续帧重投影可学习的视觉特征，并通过减少特征对齐残差来细化姿态估计。通过我们的方法，由独立的深度网络和姿态网络估计的场景尺度得以对齐，并且不同序列之间的预测尺度一致性得到改善。我们的可微细化无缝集成到现有的自监督流程中，并显著改善了它们的深度估计。我们在KITTI、Cityscapes和NYUv2上进行了广泛的室外和室内实验，证明了这一点。此外，KITTI Odometry上的结果证实了我们姿态细化的有效性。我们的代码可在https://github.com/Runningchauncey/SA4Depth获取。

英文摘要

Self-supervised depth estimation from monocular sequences relies on the joint learning of a depth and a pose network. Despite abundant research done to improve the depth network, efforts on the pose remain limited. In this context, even when depth is estimated up to scale, we highlight the importance of the alignment between the scene scales estimated by the pose and depth nets. Then, we introduce SA4Depth, an approach to improve this alignment and boost the depth predictions while keeping the inference time unchanged. Our proposed method uses the depth estimated during training to reproject learnable visual features across consecutive frames and refine the pose estimates by reducing feature alignment residuals. With our method, the estimated scene scales by the separate depth and pose networks are aligned, and the prediction scale consistency is improved across different sequences. Our differentiable refinement integrates seamlessly into existing self-supervised pipelines and substantially improves their depth estimates. We demonstrate this with extensive experiments both outdoors and indoors on KITTI, Cityscapes, and NYUv2. Additionally, results on KITTI Odometry confirm the effectiveness of our pose refinement. Our code is available at https://github.com/Runningchauncey/SA4Depth .

URL PDF HTML ☆

赞 0 踩 0

2606.13870 2026-06-15 cs.CV cs.AI cs.LG 新提交

Mirage Probes: How Vision Models Fake Visual Understanding

幻象探针：视觉模型如何伪造视觉理解

Daniel Ben-Levi, Judah Goldfeder, Weiliang Zhao, Raz Lapid, Amit LeVi, Allen G. Roush, Ravid Shwartz-Ziv, Hod Lipson

发表机构 * Columbia University（哥伦比亚大学）； Intuit ； Technion（以色列理工学院）； Thoughtworks ； New York University（纽约大学）

AI总结提出幻象探针框架，通过对比探针揭示视觉语言模型在无图像时也能回答问题的两种幻象行为：文本偏见和虚假图像，并证明后者需要表征级干预。

详情

AI中文摘要

视觉语言模型（VLM）即使在没有提供图像的情况下，也能自信且通常正确地回答基于图像的问题。这种幻象行为会虚增基准分数，而不反映视觉基础。先前的工作将其视为单一故障模式。我们认为这是两种。使用幻象探针（Mirage Probes），一种对比探针框架，将释义的问题变体与同一图像上的匹配幻象和非幻象标签配对，我们展示了在两个开源VLM中，幻象行为可以从残差流、MLP、后注意力和注意力头位置的内部激活中线性解码。我们证明朴素贝叶斯文本基线无法恢复此信号，排除了表面词汇混淆。跨基准可分离性模式，连同一种新颖的先验利用指数（PHI），衡量模型仅从文本中回答的程度，揭示了两种不同的机制：文本偏见，其中模型从语言先验中回答而不涉及视觉表征；以及虚假图像，其中模型在潜在空间中构建虚假视觉内容并像有基础一样回答。这种区别有直接的缓解后果：文本分布清理可以解决第一种机制，但无法触及第二种，因为虚假图像幻象存在于模型的视觉表征中而非文本中。忠实的视觉基础将需要在表征层面进行干预。

英文摘要

Vision-language models (VLMs) can answer image-based questions confidently, and often correctly, even when no image is provided. This mirage behavior inflates benchmark scores without reflecting visual grounding. Prior work treats this as a single failure mode. We argue it is two. Using Mirage Probes, a contrastive probing framework that pairs paraphrased question variants with matched mirage and non-mirage labels on the same image, we show that mirage behavior is linearly decodable from internal activations across residual stream, MLP, post-attention, and attention-head sites in two open-source VLMs. We demonstrate that a Naive Bayes text baseline cannot recover this signal, ruling out surface lexical confounds. Cross-benchmark separability patterns, together with a novel Prior Harnessing Index (PHI) measuring how much a model can answer from text alone, expose two distinct regimes: textual biases, where the model answers from language priors without engaging visual representations, and spurious images, where it constructs false visual content in latent space and answers as if grounded. The distinction has direct mitigation consequences: text-distribution cleaning can address the first regime but cannot reach the second, since spurious-image mirages live in the model's visual representations rather than its text. Faithful visual grounding will require interventions at the representational level.

URL PDF HTML ☆

赞 0 踩 0

2606.14230 2026-06-15 cs.CV cs.CL 新提交

A Multi-Domain Feature Fusion Framework for Generalizable Deepfake Detection Across Different Generators

面向不同生成器的可泛化深度伪造检测的多域特征融合框架

Amna Amjid, Sana Qadir, Mehwish Fatima, Raja Khurram Shahzad

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, Pakistan（巴基斯坦国立科技大学电气工程与计算机科学学院）； Department of Communication, Quality Management and Information Systems, Mid Sweden University, Ostersund Campus, Sweden（瑞典中部大学通信、质量管理和信息系统系，厄斯特松德校区）； Department of Computer Science, Electrical and Space Engineering, Lulea University of Technology, Luleå, Sweden（瑞典吕勒奥理工大学计算机科学、电气与空间工程系）

AI总结提出SGFF-Net，融合空间、梯度和DWT频率表示，在双残差学习架构中实现跨生成器和跨范式的深度伪造检测，准确率提升至79.80%。

详情

AI中文摘要

深度伪造是人工生成的图像、音频或视频，威胁隐私、安全和信息完整性。检测此类内容对于打击虚假信息至关重要，因为最新模型能生成高度逼真的内容。虽然基于空间或频率的方法在基于生成对抗网络（GANs）的深度伪造上取得了良好的检测率，但它们往往难以处理最近的扩散模型生成的图像。特别是，现有方法很少利用互补的多域表示或系统地评估跨生成器的鲁棒性。为了解决这些挑战，我们提出了一种多域深度伪造检测框架SGFF-Net（空间-梯度-频率融合网络），该网络在双残差学习架构中集成了空间、梯度和基于DWT（离散小波变换）的频率表示。实验结果表明，SGFF-Net在数据集内评估中达到了98.95%的准确率，并在跨模型（70.46%）和跨范式（69.94%）设置中提升了性能。结合多源训练和数据增强进一步增强了鲁棒性，在跨模型评估中准确率从70.46%提升到79.80%，在跨范式评估中从69%提升到78%，在真实世界数据上从61.50%提升到75.80%。与单域检测器不同，SGFF-Net在空间、梯度和小波频率域中学习互补的取证线索，从而在跨生成器和跨范式评估中具有更强的鲁棒性。结果进一步表明，将多域表示与数据多样性和增强相结合，显著提高了泛化能力，为开发更可靠的深度伪造检测系统提供了实用见解。

英文摘要

Deepfakes are artificially generated images, audio, or videos that threaten privacy, security, and information integrity. Detecting such content is crucial for countering disinformation, as the latest models generate highly realistic content. While spatial- or frequency-based approaches achieve good detection rates on Generative Adversarial Networks (GANs)-based generated deepfakes, they often struggle with recent diffusion model-generated images. In particular, existing approaches rarely exploit complementary multi-domain representations or systematically evaluate cross-generator robustness. To address these challenges, we propose a multi-domain deepfake detection framework called SGFF-Net (Spatial-Gradient-Frequency Fusion Network) that integrates spatial, gradient, and DWT (Discrete Wavelet Transform)-based frequency representations within a dual residual learning architecture. Experimental results show that the SGFF-Net achieves 98.95\% accuracy in intra-dataset evaluation and improves performance in both cross-model (70.46\%) and cross-paradigm (69.94\%) settings. Incorporating multi-source training and data augmentation further enhances robustness, increasing accuracy from 70.46\% to 79.80\% in cross-model evaluation, from 69\% to 78\% in cross-paradigm evaluation, and from 61.50\% to 75.80\% on real-world data. Unlike single-domain detectors, the SGFF-Net learns complementary forensic cues across spatial, gradient, and wavelet-frequency domains, resulting in greater robustness under cross-generator and cross-paradigm evaluation. The results further show that combining multi-domain representations with data diversity and augmentation substantially improves generalization, providing practical insights for developing more reliable deepfake detection systems.

URL PDF HTML ☆

赞 0 踩 0

2606.14351 2026-06-15 cs.CV 新提交

ForceForget: Reinforcement Concept Removal for Enhancing Safety in Text-to-Image Models

ForceForget: 通过强化概念移除增强文本到图像模型的安全性

Dong Han, Yong Li

发表机构 * Dong Han（董汉）； Yong Li（李勇）

AI总结针对文本到图像模型生成不安全内容的问题，提出基于强化学习优化概念擦除奖励的方法，通过安全适配器调节文本嵌入，在消除不安全内容的同时保持模型对安全语义的生成能力。

Comments Accepted to ICML 2026

详情

AI中文摘要

随着生成式AI的发展，文本到图像（T2I）模型能够生成各种内容。然而，T2I模型仍可能生成不安全内容。为缓解此问题，研究者提出了多种概念擦除方法。但现有方法倾向于过度擦除不安全概念，并抑制有害提示中的良性概念，从而对模型效用产生负面影响。本文中，我们专注于在消除不安全内容的同时，通过强化学习优化概念擦除奖励（CER）来保持模型对安全语义解释的能力。为避免过度内容擦除，我们引入安全适配器（Safe Adapter）来投影部分文本嵌入，以在交叉注意力层中实现高效的概念调节。在不同数据集上进行的大量实验表明，与现有最先进（SOTA）的概念擦除方法相比，所提方法在减轻不安全内容生成的同时，能保持良性图像的高保真度。在鲁棒性方面，我们的方法在对抗红队工具时优于其他方法。此外，我们展示了所提方法在新兴的图像到图像（I2I）场景中比其他方法更有效。最后，我们将方法扩展到擦除一般概念，如艺术风格和物体。免责声明：本文包含可能对某些读者造成冒犯的性露骨内容讨论。本工作中使用的所有图像均为合成图像或来自公共数据集。

英文摘要

With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, T2I models still can generate unsafe contents. To alleviate this issue, various concept erasing methods are proposed. However, existing methods tend to excessively erase unsafe concepts and suppress benign concepts contained in harmful prompts, which can negatively affect model utility. In this paper, we focus on eliminating unsafe content while maintaining model capability in safe semantic meaning interpretation by optimizing the concept erasing reward (CER) with reinforcement learning. To avoid overly content erasure, we introduce the Safe Adapter to project partial text embedding for efficient concept regulation in cross-attention layers. Extensive experiments conducted on different datasets demonstrate the effectiveness of the proposed method in alleviating unsafe content generation while preserving the high fidelity of benign images compared with existing state-of-the-art (SOTA) concept erasing methods. In terms of robustness, our method outperforms counterparts against red-teaming tools. Moreover, we showcase the proposed approach is more effective in emerging image-to-image (I2I) scenarios compared with others. Lastly, we extend our method to erase general concepts, such as artistic styles and objects. Disclaimer: This paper includes discussions of sexually explicit content that may be offensive to certain readers. All images used in this work are synthesized or from public datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.14504 2026-06-15 cs.CV 新提交

Scratched Lenses, Shifted Depth: Passive Camera-Side Optical Attacks

划痕镜头，深度偏移：被动式相机侧光学攻击

Qinlin He, Zeming Zhuang, Yongji Wu, Lan Zhang, Xiaoyong, Yuan

AI总结提出一种被动式镜头划痕攻击SLASH，通过光学伪影扭曲深度线索，在单目深度估计和3D目标检测中实现高达32%的相对深度偏移。

详情

AI中文摘要

视觉系统上的物理对抗攻击通常通过场景操纵进行研究，例如对抗性补丁或投影，其中攻击者控制相机观察的内容。使用贴纸或辅助光学的相机侧攻击也已被探索，但它们将攻击视为来自设计模式的图像空间扰动。这忽略了物理缺陷如何与场景相关的光照和光学相互作用。我们识别出一种威胁：被动的镜头侧损伤，它持久存在但具有触发条件，产生在特定视觉条件下偏置几何推理的光学伪影。我们通过划痕诱导的镜头对抗性条纹劫持（SLASH）实例化这种威胁，这是一种由相机镜头或保护罩上的小划痕引起的物理世界攻击。划痕与明亮光源和镜面反射相互作用，产生扭曲深度线索的结构化条纹伪影。由于扰动在光路中固定但由场景触发，它既持久又具有选择性。我们在光学空间中制定攻击，将划痕模式建模为触发条件光学通道，并优化一个固定配置以适应不同的观看条件。我们在数字和真实世界环境中评估SLASH对单目深度估计和单目3D目标检测的效果。在固定划痕约束下，单目深度估计的方向性深度偏移达到高达32%的相对误差，对单目3D目标检测具有一致的影响。物理实验证实了向真实相机记录的迁移，诱导的深度偏移高于模型的自然预测基线。这些发现揭示了一个攻击面，其中看似无害的硬件缺陷充当潜在的、场景触发的对抗机制，挑战了关于物理鲁棒性的假设，并激励了安全视觉系统的防御措施。

英文摘要

Physical adversarial attacks on vision systems are typically studied through scene manipulation, such as adversarial patches or projections, where the adversary controls what the camera observes. Camera-side attacks using stickers or auxiliary optics have also been explored, but they treat attacks as image-space perturbations from designed patterns. This misses how physical imperfections interact with scene-dependent lighting and optics. We identify a threat: passive lens-side damage that is persistent yet trigger-conditioned, producing optical artifacts that bias geometric inference under particular visual conditions. We instantiate this threat through Scratch-induced Lens Adversarial Streak Hijacking SLASH, a physical-world attack caused by small scratches on a camera lens or protective cover. Scratches interact with bright light sources and specular reflections to create structured streak artifacts that distort depth cues. Since the perturbation is fixed in the optical path but triggered by the scene, it is both persistent and selective. We formulate the attack in optical space, model the scratch pattern as a trigger-conditioned optical channel, and optimize one fixed configuration across diverse viewing conditions. We evaluate SLASH on monocular depth estimation and monocular 3D object detection in digital and real-world settings. Under the fixed-scratch constraint, directional depth shifts reach up to 32% relative error for monocular depth estimation, with consistent effects on monocular 3D object detection. Physical experiments confirm transfer to real camera recordings, inducing depth shifts above the model's natural prediction baseline. These findings reveal an attack surface where benign-looking hardware imperfections act as latent, scene-triggered adversarial mechanisms, challenging assumptions about physical robustness and motivating defenses for secure vision systems.

URL PDF HTML ☆

赞 0 踩 0

2606.14658 2026-06-15 cs.CV cs.AI 新提交

Giving AI a Headache: Acoustic Adversarial Attacks to Computer Vision Applications

给AI带来头痛：针对计算机视觉应用的声学对抗攻击

Nicole Villavicencio-Garduño, Maksim Ekin Eren, Milo Prisbrey, Ben Migliori, Michael Teti

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结研究利用低频声波（<20 kHz）引起相机物理振动，导致AI视觉模型（如YOLO11）误分类、漏检或产生幻觉，并分析了影响攻击效果的因素。

Comments 9 pages, 7 figures, SPIE Defense + Security

详情

DOI: 10.1117/12.3093699
Journal ref: Proc. SPIE 14046, Assurance and Security for AI-enabled Systems 2026, 1404609 (10 Jun 2026)

AI中文摘要

人工智能（AI）越来越多地被用于自动化各种现实世界的计算机视觉（CV）应用，如自动驾驶车辆控制、面部识别和安全摄像头。最近的研究表明，声学振动可以引起相机真实的物理运动，干扰其内部稳定机制。由于这种运动超出了稳定系统设计处理的条件，系统会在帧中引入伪影，导致基于AI的CV模型误分类、错过目标或产生幻觉对象。先前的工作使用超声波频率（>20 kHz）进行短距离攻击，由于高频的衰减，这些攻击仅限于短距离。在这项工作中，我们研究了使用可听范围内较低频率（<20 kHz）的声学攻击，并进一步扩展了我们的分析，包括各种图像和物体特征如何受到攻击的影响。具体来说，我们进行了物理实验，通过用各种频率共振商用相机，证明了我们的攻击对现成目标检测模型（YOLO11）的可行性。基于我们的结果，我们提供了关于使AI CV系统更容易受到这些攻击的几个因素的见解，这可能有助于未来缓解策略的开发。

英文摘要

Artificial Intelligence (AI) is increasingly used to automate a variety of real-world computer vision (CV) applications, such as autonomous vehicle control, facial recognition, and security cameras. Recent research has shown that acoustic vibration can induce real physical motion in cameras, interfering with their internal stabilization mechanisms. Because the motion falls outside the conditions the stabilization system was designed to handle, the system introduces artifacts into the frame, causing AI-based CV models to misclassify, miss targets, or hallucinate objects. Previous work used ultrasonic frequencies (>20 kHz) to perform short-range attacks, which limits them to short distances due to the attenuation exhibited by high frequencies. In this work, we investigate acoustic attacks using lower frequencies in the audible range (<20 kHz), and we further expand our analysis to include how various image and object features are affected by the attacks. Specifically, we performed physical experiments to demonstrate the viability of our attacks on an off-the-shelf object detection model (YOLO11) by resonating a commercially available camera with various frequencies. Based on our results, we provide insights into several factors that make an AI CV system more vulnerable to these attacks, which could help inform the development of future mitigation strategies.

URL PDF HTML ☆

赞 0 踩 0

2406.09250 2026-06-15 cs.CV cs.AI cs.LG 版本更新

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

MirrorCheck: 视觉-语言模型的高效对抗防御

Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Takáč, Pascal Fua, Ivan Laptev, Karthik Nandakumar

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence（莫扎伊德大学人工智能大学）； NVIDIA ； École Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）； Michigan State University（密歇根州立大学）

AI总结提出MirrorCheck框架，利用文本到图像模型和随机化策略检测并防御针对视觉-语言模型的自适应对抗攻击。

详情

AI中文摘要

视觉-语言模型（VLM）越来越容易受到复杂的对抗性攻击，包括专门设计用于绕过现有防御的自适应策略。为了解决这一漏洞，我们提出了MirrorCheck，一个鲁棒且与模型无关的检测框架，在单模态和多模态设置中均能有效运行。MirrorCheck利用文本到图像（T2I）模型从目标模型生成的标题中重建视觉内容，并通过比较原始图像和合成图像之间的特征空间嵌入来评估语义一致性。为了增强对自适应攻击的鲁棒性，MirrorCheck引入了一种随机防御策略，从多样化的模型库中随机选择T2I生成器和图像编码器。此外，我们采用了一种新颖的一次性（OTU）扰动，应用于所选编码器嵌入，并通过缩放因子调节，这降低了自适应攻击的有效性。跨多种威胁场景的大量实验表明，MirrorCheck始终优于基线方法，即使在强自适应对抗条件下也能保持其实用性。

英文摘要

Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings. MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.

URL PDF HTML ☆

赞 0 踩 0

2604.00887 2026-06-15 cs.CV cs.CR 版本更新

Towards Physically Realizable Adversarial Attenuation Patch against SAR Object Detection

面向SAR目标检测的物理可实现对抗衰减补丁

Yiming Zhang, Weibo Qin, Feng Wang

发表机构 * Key Laboratory for Information Science of Electromagnetic Waves (MoE) School of Information Science and Technology, Fudan University（电磁波信息科学重点实验室（MoE）复旦大学信息科学与技术学院）

AI总结提出对抗衰减补丁（AAP）方法，通过能量约束优化和衰减部署框架平衡攻击有效性与隐蔽性，并基于信号级电子干扰机制实现物理可行性。

Comments 5 pages, 4 figures. Source code is available at https://github.com/boremycin/SAAP. Accepted and published in IEEE CAIT 2026. DOI: 10.1109/CAIT70489.2026.11553874

详情

DOI: 10.1109/CAIT70489.2026.11553874
Journal ref: Proc. 2026 China Aerospace Information Technology Conference (CAIT), Tongxiang, China, May 2026

AI中文摘要

深度神经网络在SAR目标检测任务中表现出色，但仍易受对抗攻击影响。现有的SAR特定攻击方法能有效欺骗检测器，但往往引入明显扰动，且主要局限于数字域，忽略了攻击SAR系统的物理实现约束。本文提出一种新颖的对抗衰减补丁（AAP）方法，采用能量约束优化策略结合基于衰减的部署框架，在攻击有效性和隐蔽性之间实现无缝平衡。更重要的是，AAP通过对齐信号级电子干扰机制，展现出强大的物理实现潜力。实验结果表明，AAP在保持高隐蔽性的同时有效降低检测性能，并在不同模型间表现出良好的可迁移性。本研究为SAR目标检测系统的对抗攻击提供了物理基础视角，并促进了更隐蔽且实际可部署的攻击策略设计。源代码已在此https URL公开。

英文摘要

Deep neural networks have demonstrated excellent performance in SAR target detection tasks but remain susceptible to adversarial attacks. Existing SAR-specific attack methods can effectively deceive detectors; however, they often introduce noticeable perturbations and are largely confined to digital domain, neglecting physical implementation constrains for attacking SAR systems. In this paper, a novel Adversarial Attenuation Patch (AAP) method is proposed that employs energy-constrained optimization strategy coupled with an attenuation-based deployment framework to achieve a seamless balance between attack effectiveness and stealthiness. More importantly, AAP exhibits strong potential for physical realization by aligning with signal-level electronic jamming mechanisms. Experimental results show that AAP effectively degrades detection performance while preserving high imperceptibility, and shows favorable transferability across different models. This study provides a physical grounded perspective for adversarial attacks on SAR target detection systems and facilitates the design of more covert and practically deployable attack strategies. The source code is made available at https://github.com/boremycin/SAAP.

URL PDF HTML ☆

赞 0 踩 0

2605.26702 2026-06-15 cs.CV cs.AI cs.CR cs.LG 版本更新

Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling

通过三阶SO(3)表示耦合的旋转不变球面水印

Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Antonios Argyriou, Wu Liu, Weiping Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对全景图像在任意3D旋转下水印鲁棒性不足的问题，提出利用三阶SO(3)表示耦合构造旋转不变的球面双谱，将水印嵌入高阶球谐系数并从不变标量中提取，实现理论保证的旋转不变性和高视觉保真度。

Comments ICML 2026

详情

AI中文摘要

全景图像的可靠水印面临任意3D旋转的根本挑战。由于全景图定义在球面上，它们在$SO(3)$作用下自然变换，使得传统的平面表示和基于增强的鲁棒策略变得不充分且缺乏理论保证。为了解决这个问题，我们将全景图表示为球面信号，并利用$SO(3)$表示理论推导出可证明的旋转不变描述符。虽然球谐系数在旋转下等变变换，但自然的旋转不变构造通常限于零阶统计量，这消除了方向信息并严重限制了嵌入容量。在这项工作中，我们通过张量积耦合高阶$SO(3)$不可约表示并投影到平凡表示，引入了一种有原则的三阶不变构造。这产生了球面不变双谱，它在保持严格旋转不变性的同时保留了相位信息。利用这一特性，我们将水印嵌入到高阶球谐系数中，并从不变双谱标量中恢复它们，从而在任意3D旋转下实现可靠的提取。我们提供了其$SO(3)$不变性的理论证明，并通过实验证明其对连续旋转具有近乎完美的鲁棒性，同时保持高视觉保真度。

英文摘要

Reliable watermarking of panoramic imagery is fundamentally challenged by arbitrary 3D rotations. As panoramas are defined on the sphere, they naturally transform under the action of $SO(3)$, rendering conventional planar representations and augmentation-based robustness strategies inadequate and devoid of theoretical guarantees. To address this, we formulate panoramas as spherical signals and leverage $SO(3)$ representation theory to derive provably rotation-invariant descriptors. While spherical harmonic coefficients transform equivariantly under rotations, the natural invariant constructions are typically limited to zeroth-order statistics which eliminate directional information and severely constrain embedding capacity. In this work, we introduce a principled third-order invariant construction by coupling higher-order $SO(3)$ irreducible representations via tensor products and projecting onto the trivial representation. This yields a spherical invariant bispectrum that preserves phase information while remaining strictly rotation-invariant. Leveraging this property, we embed watermarks into higher-order spherical harmonic coefficients and recover them from invariant bispectral scalars, enabling reliable extraction under arbitrary 3D rotations. We provide a theoretical proof of $SO(3)$ invariance for it and demonstrate experimentally its near-perfect robustness to continuous rotations while maintaining high visual fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.13896 2026-06-15 cs.CV cs.AI 新提交

什么驱动了CLIP的测试时适应？从更新视角进行的受控实证研究

Jiazhen Huang, Xiao Chen, Zhiming Liu, Yaru Sun, Jingyan Jiang, Zhi Wang

发表机构 * Tsinghua University（清华大学）； Shenzhen Technology University（深圳技术大学）

AI总结本文通过受控实证研究，从更新视角分析了CLIP测试时适应方法的驱动因素，揭示了适应增益主要来自测试时证据和可靠代理，而非繁重优化，并指出无单一范式普遍最优。

详情

AI中文摘要

视觉语言模型（如CLIP）已成为开放词汇识别的标准骨干，但其零样本预测在部署时仍易受分布偏移影响。测试时适应（TTA）最近被扩展到CLIP作为轻量级解决方案，导致TTA4CLIP方法迅速增长。然而，该领域的实证进展在很大程度上超过了我们对真正驱动适应因素、其增益来源以及哪些偏移下保持可靠的理解。本文从追求最先进准确率中退一步，对TTA4CLIP进行了系统性的受控研究。我们首先根据测试时更新的内容，将现有方法组织为三个统一范式。然后，我们引入TTABC，一个开源的CLIP TTA基准，它标准化了评估协议并集成了20多种代表性方法。我们的受控实证分析集中在三个关键领域。首先，我们确定了基于参数方法的驱动因素，揭示适应增益主要由测试时证据和可靠代理驱动，而非繁重优化。其次，我们探索了超越繁重参数调整的证据利用，表明通过跨样本或当前样本证据以及轻量级原型更新可以实现竞争性和高效的性能。最后，我们证明TTA没有银弹：没有单一的适应范式普遍最优，首选范式取决于偏移的性质。我们希望我们的基准和研究能提供对当前TTA4CLIP格局的更清晰理解，并为进一步研究奠定基础。

英文摘要

Vision-Language Models (VLMs) such as CLIP have become a standard backbone for open-vocabulary recognition, yet their zero-shot predictions remain vulnerable to distribution shifts encountered at deployment. Test-Time Adaptation (TTA) has recently been extended to CLIP as a lightweight solution, leading to a rapidly growing body of TTA4CLIP methods. However, empirical progress in this area has largely outpaced our understanding of what truly drives adaptation, where their gains originate, and under which shifts they remain reliable. In this paper, we take a step back from the pursuit of state-of-the-art accuracy and conduct a systematic controlled study of TTA4CLIP. We first organize existing methods into three unified paradigms according to what is updated at test time. We then introduce TTABC, an open-source TTA Benchmark for CLIP, which standardizes evaluation protocols and integrates more than 20 representative methods. Our controlled empirical analysis focuses on three key areas. First, we determine the driving factors in parameter-based methods, revealing that adaptation gains are primarily driven by test-time evidence and reliable proxies rather than heavy optimization. Second, we explore evidence utilization beyond heavy parameter tuning, showing that competitive and efficient performance can be achieved through cross- or current-sample evidence and lightweight prototype updates. Finally, we demonstrate that there is no silver bullet for TTA: no single adaptation paradigm is universally optimal, and the preferred paradigm depends on the nature of shift. We hope our benchmark and study provide a clearer understanding of the current TTA4CLIP landscape and establish a foundation for further research.

URL PDF HTML ☆

赞 0 踩 0

2606.14562 2026-06-15 cs.CV cs.LG 新提交

NEST3D: A High-Resolution Multimodal Dataset of Sociable Weaver Tree Nests

NEST3D：织布鸟树巢的高分辨率多模态数据集

Constanza A. Molina Catricheo, Simon Boeder, Ting-Jia Guo, Giacomo May, Clément Berthelot, Devis Tuia, Friedrich Fedor Reinhard, Fabio Remondino, Benjamin Risse

发表机构 * Institute for Geoinformatics (ifgi), University of Münster（明斯特大学地理信息学研究所）； École Polytechnique Fédérale de Lausanne (EPFL)（洛桑联邦理工学院）； Max Planck Institute of Animal Behavior（马克斯·普朗克动物行为研究所）； University of Konstanz（康斯坦茨大学）； Kuzikus Research Station（库兹库斯研究站）； Fondazione Bruno Kessler (FBK)（布鲁诺·凯斯勒基金会）

AI总结针对织布鸟巢缺乏精细3D结构数据的问题，提出包含104棵巢树、1.4TB多模态无人机数据集，并基准测试语义分割方法，PT-v3达86.35% mIoU。

Comments 14 pages, 4 figures. Dataset available at https://huggingface.co/NEST3D

详情

AI中文摘要

织布鸟巢作为复杂的生态结构，提供体温调节微栖息地并维持多种物种；然而，先前研究使用的数据集缺乏精细的3D结构细节。由于巢穴的不规则几何形状以及与复杂宿主植被的整合，生成可用且准确的3D织布鸟巢数据具有挑战性。我们通过一个开放获取的1.4TB多模态无人机数据集（包含104棵巢树，共27,945张RGB图像、111,780张多光谱图像、约7.81亿个3D点以及专家标注的语义分割标签）弥合了这一差距。我们使用KPConv、RandLA-Net和Point Transformer V3对语义分割进行基准测试，其中PT-v3在测试集上达到了86.35%的mIoU。虽然结果展示了基于Transformer和逐点方法的强大性能，但也凸显了架构相关的挑战，特别是对于基于卷积的方法（如KPConv）。通过独特地结合光谱、空间和结构信息，所提出的数据集推动了3D重建、分割和分类算法的发展，实现了从巢穴体积估计到物种保护等生态应用，并作为一个要求严格的基准，揭示了在极端类别不平衡下与架构相关的性能差异。

英文摘要

Sociable weaver nests function as complex ecological structures offering thermoregulatory microhabitats and sustaining diverse species; however, datasets used in prior studies lack fine-grained 3D structural detail. Producing usable and accurate 3D weaver nest data is challenging due to their irregular geometry and integration with complex host vegetation. We bridge this gap with an open-access, 1.4 TB multimodal drone dataset of 104 nest-bearing trees, comprising 27,945 RGB images, 111,780 multispectral images, approximately 781 million 3D points, and expert-annotated semantic segmentation labels. We benchmark semantic segmentation using KPConv, RandLA-Net, and Point Transformer V3, with PT-v3 achieving an mIoU of 86.35% on the test set. While the results demonstrate strong performance for transformer-based and point-wise methods, they also highlight architecture-dependent challenges, particularly for convolution-based approaches such as KPConv. By uniquely combining spectral, spatial, and structural information, the presented dataset advances 3D reconstruction, segmentation, and classification algorithms, enabling ecological applications from nest volume estimation to species conservation, and serves as a demanding benchmark that exposes architecture-dependent performance under extreme class imbalance.

URL PDF HTML ☆

赞 0 踩 0

2606.14578 2026-06-15 cs.CV 新提交

A Qualitative Review of GenAI-Based Methods for Data Generation and Augmentation in Industrial Computer Vision Applications

基于GenAI的工业计算机视觉应用中数据生成与增强方法的定性综述

Paul Koch, Paul Hofmann, Ferdinand Waßelewsky, Adem Karakurt, Andre Sérs, Jörg Krüger

发表机构 * Fraunhofer IPK（弗劳恩霍夫研究所）； Hamburg University of Applied Sciences (HAW Hamburg)（汉堡应用技术大学）； Technical University Berlin (Tu-Berlin)（柏林技术大学）

AI总结本文综述了基于GenAI的数据生成与增强方法，旨在解决工业计算机视觉应用中数据获取的“先有鸡还是先有蛋”困境，并评估其在分类用例中的适应性。

Comments Accepted to Computing Conference 2026

详情

AI中文摘要

AI驱动的计算机视觉应用需要强大的数据库来确保可预测的行为和性能。这种可预测的行为对于工业应用获得用户信任尤为重要。然而，在工业应用中，这样的数据库并不容易获得，其获取也并非易事。主动学习方法可以在项目部署中逐步增加数据，从而迭代地扩充数据库，进而提高应用的可预测性。不幸的是，我们观察到这往往会导致用户对应用失去信任，而一旦失去信任就很难恢复。这就导致了“先有鸡还是先有蛋”的困境，即数据库和应用都无法得到发展。在这项工作中，我们回顾了最先进的方法和途径，以进一步推动初始主动数据扩充阶段后的数据库建设。这里，我们重点关注基于GenAI的数据生成和增强方法的最新进展，并评估它们在工业计算机视觉分类用例中的适应性。尽管我们观察到自动数据扩充的潜力，但我们也看到在源（训练环境）和目标（工业用例）之间存在领域不匹配——涉及自然语言和对象特征中定义的上下文。

英文摘要

AI-driven computer vision applications require a profound database to ensure predictable behaviors and performance. Such predictable behaviors are especially important for industrial applications in gaining trust from users. However, such a database is not readily available in industrial applications, and its acquisition is not trivial either. Active learning methods can be applied to ramp up data within a project deployment to iteratively increase the database, and thus the application predictability. Unfortunately, we observe that this often leads to a loss of user trust in the application, which is difficult to regain once lost. This leads to a "chicken-and-egg" dilemma in which neither the database nor the application is developed. In this work, we review state-of-the-art methods and approaches to further boost the database the initial active data ramp-up phase. Here, we focus on recent advancements in GenAI-based data generation and augmentation methods and review their adaptability on an industrial computer vision classification use case. Although we observe a potential for automatic data ramp-up, we also see a domain miss match in between the source (training environment) and target (industrial use-case) - regarding context defined in natural language and object characteristics.

URL PDF HTML ☆

赞 0 踩 0

2606.14586 2026-06-15 cs.CV 新提交

S$^2$COPE: Self-Supervised Concept Discovery via Preference Learning

S$^2$COPE: 通过偏好学习进行自监督概念发现

Shilong Xiang, Zirui Zhang, Chengzhi Mao

发表机构 * Rutgers University（罗格斯大学）

AI总结提出S$^2$COPE框架，利用视觉大语言模型在自监督偏好优化循环中自主发现结构化概念，无需任何标签，在多个领域提升下游分类准确率。

详情

AI中文摘要

当前的表示学习范式存在一个根本性的折衷：自监督方法可扩展到大规模数据集但产生不透明的特征，而可解释模型则因需要密集的人工标注而受限。我们提出了通过偏好学习进行自监督概念发现（S$^2$COPE），这是一个无需标签的框架，解决了这一困境。S$^2$COPE不将视觉大语言模型（VLLMs）视为静态特征提取器，而是将其作为自监督偏好优化循环中的主动参与者。通过直接从原始图像中自主假设、验证和强化候选视觉属性，我们的框架无需任何标签即可发现新颖的结构化概念。在自然、医学和物理领域的大量实验表明，S$^2$COPE成功提取了标准VLLMs通常无法生成的领域特定概念。通过将概念发现直接摊销到VLLM骨干网络中（通过我们的自监督偏好目标，而非依赖静态生成和分离过滤），我们在未见数据上的下游top-1分类准确率实现了高达24个百分点的绝对提升。我们的工作表明，可解释性可以通过模型与偶然视觉结构的自主交互而出现，无需任何人类监督。

英文摘要

Current representation learning paradigms force a fundamental compromise: self-supervised methods scale to massive datasets but yield opaque features, whereas interpretable models remain bottlenecked by the need for dense human annotation. We introduce Self-Supervised Concept discOvery via Preference lEarning (\model), a label-free framework that resolves this dilemma. Instead of treating Vision-Large-Language Models (VLLMs) as static feature extractors, \model leverages them as active participants in a self-supervised preference optimization loop. By autonomously hypothesizing, validating, and reinforcing candidate visual attributes directly from raw imagery, our framework discovers novel, structured concepts without a single label. Extensive experiments across natural, medical, and physics domains demonstrate that \model successfully extracts domain-specific concepts where standard VLLMs often fail to generate. By amortizing concept discovery directly into the VLLM backbone through our self-supervised preference objective -- rather than relying on static generation and disjoint filtering -- we achieve up to a 24-point absolute improvement in downstream top-1 classification accuracy on unseen data. Our work suggest that interpretability can emerge through a model's autonomous interaction with incidental visual structures, without any human supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.14657 2026-06-15 cs.CV 新提交

HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities

HPSv3++：跨扩散模型能力全谱系扩展奖励模型

Yijun Liu, Jie Huang, Zeyue Xue, Yuming Li, Ruizhe He, Haoran Li, Shijia Ge, Siming Fu

发表机构 * Tsinghua University（清华大学）； JD Explore Academy（京东探索研究院）； Peking University（北京大学）； Zhejiang University（浙江大学）

AI总结提出HPSv3++奖励模型框架，通过双维度偏好数据集HPDv3++和两阶段训练（正交梯度投影+无监督引导），提升对各类T2I模型及RL迭代的偏好预测能力，在多个基准上达到最优。

详情

AI中文摘要

奖励模型引导文本到图像（T2I）系统输出符合人类偏好的结果。然而，典型的奖励模型（如HPSv3）是在早期T2I模型的预标注数据上训练的，没有考虑因模型能力演进和强化学习（RL）迭代而产生的质量判别偏移，限制了其更广泛的适用性。在这项工作中，我们提出了HPSv3++，一个奖励模型框架，将HPSv3模型提升到适应不同T2I模型能力及其RL迭代变化的全能力-迭代谱系。具体来说，我们首先引入了HPDv3++，一个212K双维度偏好数据集，使用近期高能力（Qwen-Image）模型并辅以人工监督，对文本保真度和美学质量进行标注。然后我们提出了一个两阶段训练框架。第一阶段采用数据感知的正交梯度投影，从HPDv3++中融入多样化的美学感知，同时保留HPSv3中原始有效的人类偏好知识。第二阶段进一步利用来自不同能力水平和RL迭代的T2I模型的无标注数据，并引入一个联合能力-迭代条件的信号给奖励模型，以及一个标准差驱动的无监督引导机制，从而在能力-迭代谱系上强化奖励模型。HPSv3++实现了最先进的偏好预测，在HPDv3上比HPSv3高出9.8%，在GenAI-Bench上高出5.5%，同时在我们提出的HPDv3++上达到79.1%/88.1%。当用于T2I RL训练时，它持续提升了多种T2I模型的GenEval分数，展示了其广泛的能力。代码可在该网址获取。

英文摘要

Reward models guide text-to-image (T2I) systems toward outputs aligned with human preferences. However, typical reward models such as HPSv3 are trained on pre-annotated data from earlier T2I models, without accounting for quality discriminative shifts arising from evolving model capabilities and reinforcement learning (RL) iterations, limiting their broader applicability. In this work, we propose HPSv3++, a reward model framework that elevates the HPSv3 model for varying T2I model capabilities and their RL iteration changes across the full capability-iteration spectrum. Specifically, we first introduce HPDv3++, a 212K dual-dimension preference dataset annotated for text fidelity and aesthetic quality using a recent high-capability (Qwen-Image) model with human supervision. We then propose a two-stage training framework. Stage 1 employs data-aware orthogonal gradient projection to incorporate diverse aesthetic perception from HPDv3++ while preserving the original effective human preference knowledge in HPSv3. Stage 2 further leverages unlabeled data from T2I models spanning different capability levels and RL iterations, and introduces a joint capability-iterations conditioned signal for the reward model together with a standard deviation-driven unsupervised guidance mechanism, strengthening reward model across the capability-iteration spectrum. HPSv3++ achieves state-of-the-art preference prediction, outperforming HPSv3 9.8% on HPDv3, 5.5% on GenAI-Bench, while achieving 79.1%/88.1% on our proposed HPDv3++. When used for T2I RL training, it consistently improves GenEval scores across diverse T2I models, demonstrating its wide-range capabilities. The code is available at https://github.com/PlantPotatoOnMoon/HPSv3-PlusPlus.

URL PDF HTML ☆

赞 0 踩 0

2606.14684 2026-06-15 cs.CV cs.LG 新提交

HumP-KD: A Hybrid Uncertainty-Aware Multi-Stage Progressive Knowledge Distillation Framework for Efficient Fire Classification

HumP-KD: 一种混合不确定性感知的多阶段渐进式知识蒸馏框架用于高效火灾分类

Mohammed Arif Mainuddin, Najifa Tabassum, Omar Ibne Shahid, Riasat Khan

AI总结提出HumP-KD框架，通过层次化渐进式知识蒸馏和多阶段蒸馏，将两个冻结的异构Transformer教师（Swin-Tiny和ViT-Base）及其集成知识蒸馏到轻量级MobileViT-S学生模型中，在火灾分类任务上显著提升性能，同时保持低参数量和实时推理速度。

详情

AI中文摘要

实时火灾分类系统需要模型同时具备准确性、计算效率以及可在资源受限硬件上部署的能力。本文提出\textbf{HumP-KD}，一种混合不确定性感知的多阶段渐进式知识蒸馏框架，用于高效火灾分类。使用了两个数据集：FlameVision（8600张图像）和Dataset-II（31309张图像）。在标准预处理、在线增强、高斯噪声和运动模糊鲁棒性条件下，应用了多种CNN和Transformer基线模型。所提出的HumP-KD模型通过三个紧密集成的组件，将两个冻结的异构Transformer教师（Swin-Tiny和ViT-Base）及其Meta-MLP集成的知识蒸馏到轻量级MobileViT-S学生中。层次化渐进式知识蒸馏采用层次化特征构建器，生成融合的空间注意力掩码，以选择性地引导蒸馏到判别性区域。多阶段知识蒸馏在训练过程中逐步激活三个蒸馏阶段。在Dataset-II上，HumP-KD在10次独立试验中平均F1分数达到$0.9876 \pm 0.0063$，显著优于未使用蒸馏训练的MobileViT-S基线（$0.9537 \pm 0.0351$），独立t检验（$p = 0.0195$）和Wilcoxon符号秩检验（$W = 1$，$p = 0.0039$）均证实了统计显著性。所提出的方法还展示了跨数据集的强泛化能力和在退化视觉条件下的鲁棒性。学生模型仅保留4.94M参数和19.01Mb模型大小，相比Swin-Tiny参数减少$5.7\times$，相比ViT-Base减少$17.5\times$，同时达到37.72 CPU FPS，适合实时部署。

英文摘要

Real-time fire classification systems require models that are simultaneously accurate, computationally efficient, and deployable on resource-constrained hardware. This work proposes \textbf{HumP-KD}, a Hybrid Uncertainty-aware Multi-stage Progressive Knowledge Distillation framework for efficient fire classification. Two datasets, FlameVision and Dataset-II, containing 8,600 and 31,309 images, are used. Various CNN and transformer baselines are applied under standard preprocessing, online augmentation, Gaussian noise and motion blur robustness conditions. The proposed HumP-KD model distills knowledge from two frozen heterogeneous transformer teachers, Swin-Tiny and ViT-Base, along with their Meta-MLP ensemble, into a lightweight MobileViT-S student via three tightly integrated components. Hierarchical Progressive Knowledge Distillation employs a Hierarchical Feature Builder. It generates a fused spatial attention mask to guide distillation toward discriminative regions selectively. Multi-Stage Knowledge Distillation progressively activates three distillation stages across training. On Dataset-II, HumP-KD achieves a mean F1 score of $0.9876 \pm 0.0063$ across 10 independent trials, significantly outperforming the MobileViT-S baseline trained without distillation ($0.9537 \pm 0.0351$), with statistical significance confirmed by both independent t-test ($p = 0.0195$) and Wilcoxon signed-rank test ($W = 1$, $p = 0.0039$). The proposed method also demonstrates strong generalization across datasets and robustness under degraded visual conditions. The student model retains only 4.94M parameters and 19.01Mb model size, representing a $5.7\times$ parameter reduction over Swin-Tiny and a $17.5\times$ reduction over ViT-Base, while achieving 37.72 CPU FPS, making it suitable for real-time deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.14697 2026-06-15 cs.CV cs.AI cs.CL 新提交

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

ClinHallu: 用于诊断医学多模态大语言模型推理中阶段式幻觉的基准

Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, Weihua Chen, Fan Wang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； DAMO Academy, Alibaba Group（阿里巴巴达摩院）； Hupan Lab（湖畔实验室）； Zhejiang University（浙江大学）

AI总结提出ClinHallu基准，包含7031个实例，每个实例带有结构化推理轨迹（视觉识别、知识回忆、推理整合），通过阶段替换干预和轨迹监督微调，实现细粒度幻觉诊断与缓解。

Comments Code and datasets: https://github.com/alibaba-damo-academy/ClinHallu

详情

AI中文摘要

构建可信的医学多模态大语言模型（MLLM）对于可靠的临床决策支持至关重要。现有的医学幻觉基准主要关注数据收集，但往往忽略了推理过程中幻觉的起源。我们发现幻觉来源因样本而异：错误可能源于视觉误识别、不正确的医学知识回忆或有缺陷的推理整合。为了实现源级别的幻觉诊断，我们引入了ClinHallu，一个用于医学MLLM推理中阶段式幻觉诊断的基准。ClinHallu包含7031个经过验证的实例，每个实例都附有分解为视觉识别、知识回忆和推理整合的结构化推理轨迹。我们还使用阶段替换干预来测量纠正特定阶段如何影响最终答案。除了评估，我们表明轨迹监督微调减少了阶段式幻觉。ClinHallu为诊断和缓解医学MLLM中的推理失败提供了一个细粒度的幻觉测试平台。该基准可从此https URL公开获取。

英文摘要

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.

URL PDF HTML ☆

赞 0 踩 0

2606.13894 2026-06-15 cs.LG cs.AI cs.CL cs.CV 交叉投稿

Gefen: Optimized Stochastic Optimizer

Gefen: 优化随机优化器

Nadav Benedek, Tomer Koren, Ohad Fried

发表机构 * Reichman University（赖希曼大学）； Tel Aviv University（特拉维夫大学）； Google Research（谷歌研究院）

AI总结提出Gefen优化器，通过共享二阶矩估计和量化一阶矩，将AdamW内存占用减少约8倍，同时保持相同性能，支持更大批量和吞吐量。

详情

AI中文摘要

AdamW是现代深度学习的默认优化器，但其一阶和二阶矩状态会额外占用约两倍参数大小的训练内存。我们提出Gefen，一种内存高效的优化器，它自动在参数块之间共享二阶矩估计，并使用学习到的码本量化一阶矩，从而将AdamW的内存占用减少约8倍，同时保持相同性能，相当于每十亿参数减少6.5 GiB。该方法受理论结果启发，该结果表明大的混合Hessian项将平方梯度的比率约束为接近1，表明Hessian对齐的参数是共享二阶矩统计量的自然候选。由于大规模计算Hessian不切实际，Gefen从初始平方梯度推断块结构，除了AdamW默认超参数外，不需要任何架构特定的元数据或超参数。Gefen学习基于精确直方图的动态规划量化码本，并重用相同的块进行一阶矩缩放。在多种实验中，Gefen在比较的类似AdamW的方法中实现了最低的峰值优化器内存，同时保持AdamW级别的性能。在FSDP和DDP训练中，减少的内存占用支持更大的微批次，并显著提高相对于AdamW的吞吐量，提供了一种实用的即插即用替代方案，具有更低的内存使用，可以增加吞吐量并支持训练更大的模型或使用更大的批量大小。我们提供了完整的Python实现，包括融合CUDA内核，网址为https://this https URL。

英文摘要

AdamW is a default optimizer for modern deep learning, but its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, thereby reducing AdamW's memory footprint by ~8x while maintaining the same performance, corresponding to a reduction of 6.5 GiB per billion parameters. The method is motivated by a theoretical result showing that large mixed Hessian entries constrain the ratio of squared gradients toward one, suggesting that Hessian-aligned parameters are natural candidates for sharing second-moment statistics. Since computing Hessians is impractical at scale, Gefen infers block structure from the initial squared gradients, requiring no architecture-specific metadata or hyperparameters beyond AdamW defaults. Gefen learns an exact histogram-based dynamic-programming quantization codebook and reuses the same blocks for first-moment scaling. Across diverse experiments, Gefen achieves the lowest peak optimizer memory among the compared AdamW-like methods while maintaining AdamW-level performance. In FSDP and DDP training, the reduced memory footprint enables larger microbatches and improves throughput significantly over AdamW, providing a practical drop-in replacement with lower memory usage that can increase throughput and enable training larger models or using larger batch sizes. We provide the complete Python implementation, including fused CUDA kernels at https://github.com/ndvbd/Gefen

URL PDF HTML ☆

赞 0 踩 0

2508.18693 2026-06-15 cs.CV 版本更新

Feature-Space Planes Searcher: A Universal Domain Adaptation Framework for Interpretability and Computational Efficiency

特征空间平面搜索器：一种通用领域自适应框架，兼顾可解释性与计算效率

Zhitong Cheng, Yiran Jiang, Yulong Ge, Yufeng Li, Zhongheng Qin, Rongzhi Lin, Jianwei Ma

发表机构 * School of Mathematics and Institute for Artificial Intelligence, Harbin Institute of Technology, China（数学学院和人工智能研究院，哈尔滨工业大学，中国）； School of Earth and Space Sciences, Institute for Artificial Intelligence, Peking University, China（地球和空间科学学院，人工智能研究院，北京大学，中国）

AI总结提出特征空间平面搜索器（FPS），通过冻结特征编码器并利用特征空间几何模式优化决策边界，实现高效、可解释的领域自适应，在多个基准上达到竞争性能。

详情

AI中文摘要

领域偏移，即从标记源域到未标记目标域时模型性能下降，是部署深度学习系统的一个持续挑战。当前的无监督领域自适应（UDA）方法主要依赖于微调特征提取器，这种方法存在效率低、可解释性差以及对现代架构扩展性不足的问题。我们的分析表明，在大规模数据上预训练的模型在其特征空间中表现出域不变的几何模式，以类内聚类和类间分离为特征，从而保留了可迁移的判别结构。这些发现表明，领域偏移主要表现为边界不对齐而非特征退化。与微调整个预训练模型（这有引入不可预测特征失真的风险）不同，我们提出特征空间平面搜索器（FPS）：一种新颖的领域自适应框架，通过利用这些几何模式优化决策边界，同时保持特征编码器冻结。这种简化的方法能够对自适应进行解释性分析，同时通过离线特征提取大幅降低内存和计算成本，允许在单个计算周期内进行全数据集优化。在公共基准上的评估表明，FPS达到了与最先进方法竞争或更优的性能。FPS能够高效地扩展到多模态大模型，并在包括蛋白质结构预测、遥感分类和地震检测在内的多个领域展现出通用性。我们预计FPS将为迁移学习，特别是领域自适应任务，提供一种简单、有效且可推广的范式。

英文摘要

Domain shift, characterized by degraded model performance during transition from labeled source domains to unlabeled target domains, poses a persistent challenge for deploying deep learning systems. Current unsupervised domain adaptation (UDA) methods predominantly rely on fine-tuning feature extractors - an approach limited by inefficiency, reduced interpretability, and poor scalability to modern architectures. Our analysis reveals that models pretrained on large-scale data exhibit domain-invariant geometric patterns in their feature space, characterized by intra-class clustering and inter-class separation, thereby preserving transferable discriminative structures. These findings indicate that domain shifts primarily manifest as boundary misalignment rather than feature degradation. Unlike fine-tuning entire pre-trained models - which risks introducing unpredictable feature distortions - we propose the Feature-space Planes Searcher (FPS): a novel domain adaptation framework that optimizes decision boundaries by leveraging these geometric patterns while keeping the feature encoder frozen. This streamlined approach enables interpretative analysis of adaptation while substantially reducing memory and computational costs through offline feature extraction, permitting full-dataset optimization in a single computation cycle. Evaluations on public benchmarks demonstrate that FPS achieves competitive or superior performance to state-of-the-art methods. FPS scales efficiently with multimodal large models and shows versatility across diverse domains including protein structure prediction, remote sensing classification, and earthquake detection. We anticipate FPS will provide a simple, effective, and generalizable paradigm for transfer learning, particularly in domain adaptation tasks. .

URL PDF HTML ☆

赞 0 踩 0

2512.00336 2026-06-15 cs.CV 版本更新

MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection

MVAD：多模态AI生成视频-音频检测基准数据集

Mengxue Hu, Yunfeng Diao, Changtao Miao, Tairui Ge, Taize Ge, Zhiqing Guo, Jianshu Li, Zhe Li, Zhongjie Ba, Joey Tianyi Zhou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对现有数据集缺乏多模态真实性的问题，提出MVAD数据集，包含三种伪造模式、高质量样本及多样化的视觉风格和内容类别，用于检测AI生成的视频-音频内容。

Comments 10 pages,2 figures

详情

AI中文摘要

多模态AI生成视频-音频内容的快速发展引发了对信息安全和内容真实性的重大担忧。现有的合成视频数据集主要关注视觉模态，而少数包含音频的数据集也大多局限于面部深度伪造——这一局限性未能解决通用多模态AI生成内容日益扩展的领域，并严重阻碍了可信检测系统的发展。为弥补这一关键差距，我们引入了多模态视频-音频数据集（MVAD），这是第一个专门设计用于检测AI生成多模态视频-音频内容的综合数据集。我们的数据集具有三个关键特征：（1）真正的多模态性，样本根据三种真实的视频-音频伪造模式生成；（2）通过多种最先进的生成模型实现的高感知质量；（3）涵盖现实和动漫视觉风格、四种内容类别（人类、动物、物体和场景）以及四种视频-音频多模态数据类型的全面多样性。我们的数据集将在以下网址提供：此 https URL。

英文摘要

The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes--a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.

URL PDF HTML ☆

赞 0 踩 0

2512.05025 2026-06-15 cs.CV 版本更新

RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

RAMEN: 面向地球观测的分辨率可调多模态编码器

Nicolas Houdré, Diego Marcos, Hugo Riffaud de Turckheim, Dino Ienco, Laurent Wendling, Camille Kurtz, Sylvain Lobry

发表机构 * Institut National des Sciences de l'Univers (INSU), France（法国国家科学研究院）； CNRS, France（法国国家科学研究中心）

AI总结提出RAMEN，一种传感器无关的分辨率可调多模态编码器，通过将分辨率作为可控参数，在统一潜空间中实现多模态地球观测数据的连贯分析，并在PANGAEA基准上优于现有模型。

详情

Journal ref: CVPR 2026

AI中文摘要

地球观测（EO）数据涵盖广泛的空间、光谱和时间分辨率，从高分辨率光学图像到低分辨率多光谱产品或雷达时间序列。虽然最近的基座模型改进了多模态集成以学习有意义的表示，但它们通常期望固定的输入分辨率或基于传感器特定的编码器，限制了跨异构EO模态的泛化。为克服这些限制，我们引入了RAMEN，一种分辨率可调的多模态编码器，以完全传感器无关的方式学习跨EO数据的共享视觉表示。RAMEN将模态、空间和时间分辨率视为关键输入数据特征，从而在统一潜空间内实现跨模态的连贯分析。其主要方法贡献是将空间分辨率定义为可控输出参数，使用户在推理时能够直接控制所需的细节水平，并允许在空间精度和计算成本之间进行显式权衡。我们训练了一个统一的Transformer编码器，用于重构来自不同来源的掩蔽多模态EO数据，确保跨传感器和分辨率的泛化。预训练后，RAMEN有效地迁移到已知和未见过的传感器配置，并在社区标准的PANGAEA基准上优于更大的最先进模型，该基准包含多种多传感器和多分辨率下游任务。我们的代码和预训练模型可在以下网址获取：https://this URL。

英文摘要

Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.

URL PDF HTML ☆

赞 0 踩 0

2602.00593 2026-06-15 cs.CV cs.LG 版本更新

Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

Pix2Fact: 当视觉不够时——基于网络验证的细粒度VQA基准测试

Yifan Jiang, Cong Zhang, Bofei Zhang, Qiaofeng Zheng, Yifan Yang, Bingzhang Wang, Yew-Soon Ong

发表机构 * GADE Union (Global AI Data Experts Union)（GADE联盟（全球人工智能数据专家联盟））； Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； New York University（纽约大学）； Cambridge University（剑桥大学）； The University of Hong Kong（香港大学）

AI总结本文提出Pix2Fact基准测试，通过高分辨率真实场景中的网络验证，评估细粒度视觉问答中的专家级视觉感知和知识搜索能力，发现现有模型在复杂任务中存在显著不足。

详情

AI中文摘要

尽管在通用任务上取得了进展，视觉-语言模型（VLMs）仍然在需要精细视觉定位和外部知识的挑战中面临困难，而现有基准测试未能综合评估这些能力。为填补这一空白，我们引入Pix2Fact，一个视觉问答基准测试，旨在评估专家级视觉感知和知识搜索能力。Pix2Fact包含1000张高分辨率（4K+）图像，覆盖八个场景。其问题和答案由来自全球顶尖大学的博士持有标注者精心设计。每个问题都需要详细的视觉定位和外部知识的整合。评估十种最先进的VLMs，包括专有模型如Gemini-3.1-Pro和GPT-5.4，发现Pix2Fact对模型提出了严峻挑战：最先进的模型（Gemini-3.1-Pro）在有视觉地面真实和搜索工具的情况下仅达到51.7%的平均准确率。我们的分析将低准确率归因于三个因素：即使有视觉地面真实，频繁的视觉定位错误，浅层搜索利用，以及VLM无法检索长尾、无结构的局部信息。这种显著的差距暴露了当前模型在帮助人类处理需要超负荷视觉理解的现实场景中的局限性。我们相信Pix2Fact将作为推动下一代语言-视觉代理的关键基准测试，这些代理能够无缝整合细粒度感知与稳健的知识搜索。

英文摘要

Despite progress on general tasks, vision-language models (VLMs) still struggle with challenges that demand both fine-grained visual grounding and external knowledge, a synergy overlooked by existing benchmarks that evaluate these abilities in isolation. To fill this void, we introduce Pix2Fact, a visual question-answering benchmark designed to assess expert-level visual perception and knowledge search. Pix2Fact comprises 1,000 high-resolution (4K+) images spanning eight scenarios. Its questions and answers are meticulously crafted by PhD-holding annotators from top global universities across diverse disciplines. Each question requires detailed visual grounding and the integration of external knowledge. Evaluating ten state-of-the-art VLMs, including proprietary models such as Gemini-3.1-Pro and GPT-5.4, we find that Pix2Fact poses a formidable challenge: the most advanced model (Gemini-3.1-Pro) achieves only 51.7% average accuracy, even with access to visual ground truth and search tools. Our analysis attributes this low accuracy to three factors, frequent visual grounding errors even with visual ground truth, shallow search harnessing, and VLM's inability to retrieve long-tail, unstructured local information. This striking gap exposes the limitations of current models in assisting humans with real-world scenarios that demand overwhelming visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the next generation of language-vision agents that seamlessly integrate fine-grained perception with robust knowledge search.

URL PDF HTML ☆

赞 0 踩 0

2606.13736 2026-06-15 cs.CV 新提交

Connections Between Pairs of Filters Improve the Accuracy of Convolutional Neural Networks

滤波器对之间的连接提高卷积神经网络的准确性

Kathleen Anderson, Philipp Grüning, Erhardt Barth

发表机构 * GitHub

AI总结本文提出在卷积神经网络中引入可学习的滤波器对连接函数，替代传统点式非线性激活，通过在不同层自适应调整连接方式提升网络性能。

Comments IJCNN 2023

2606.13839 2026-06-15 cs.CV cs.AI eess.IV 新提交

Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography

解释RhythmFormer：远程光电容积描记术周期性稀疏注意力的系统XAI分析

Louis Chen, Torbjörn E. M. Nordling

发表机构 * Department of Mechanical Engineering, National Cheng Kung University（国立成功大学机械工程学系）

AI总结针对rPPG Transformer可解释性缺乏定量评估的问题，提出四种归因方法并引入皮肤覆盖率和忠诚度系数，量化稀疏注意力中的多跳泄漏效应，Beyond Intuition方法在UBFC-rPPG上取得最优性能。

Comments 26 pages, 8 figures

详情

AI中文摘要

远程光电容积描记术（rPPG）Transformer在基准测试中实现了低心率误差，但其决策仍然不透明——随着rPPG向临床心率估计发展，这一问题日益受到关注。现有的rPPG XAI主要依赖定性热图检查，缺乏定量忠诚度指标或基于生理学的验证，在视觉合理性和可审计证据之间存在差距。我们解决了这一差距。首先，我们将四种归因方法（原始注意力、rollout、flow、Beyond Intuition）适配到RhythmFormer的双层路由注意力（带有top-$k$选择）上。其次，我们引入了一个皮肤覆盖度指标，量化归因质量落在皮肤区域的比例。第三，我们将SaCo忠诚度系数从其原始分类设置适配到rPPG回归，通过使用原始和扰动预测rPPG波形之间的MAE作为扰动影响。应用这些工具，我们量化了稀疏top-$k$路由下的多跳泄漏效应：注意力rollout和flow几乎完全恢复了各个精炼注意力层明确设置为零的连接。Beyond Intuition通过其值投影加权rollout和梯度支持掩码缓解了这一问题，在UBFC-rPPG上获得了评估方法中最高的中位精炼皮肤覆盖度（0.83对比vanilla rollout的0.57）和忠诚度（F=0.92）。需要在不同数据集和模型变体上进行验证。对低SaCo异常值的案例研究进一步表明，一旦替换了伪影区域，所有四种方法都一致恢复，表明在这个示例案例中，归因家族之间的SaCo行为一致。总之，这些指标将rPPG XAI推向关于空间对齐和扰动忠诚度的可审计数值证据，即可信的rPPG XAI。

英文摘要

Remote photoplethysmography (rPPG) transformers achieve low heart-rate error on benchmarks, yet their decisions remain opaque--a growing concern as rPPG moves toward clinical heart rate estimation. Existing rPPG XAI is dominated by qualitative heatmap inspection without quantitative faithfulness metrics or physiology-grounded validation, leaving a gap between visual plausibility and auditable evidence. We address this gap. First, we adapt four attribution methods (raw attention, rollout, flow, Beyond Intuition) to RhythmFormer's bi-level routing attention with top-$k$ selection. Second, we introduce a skin coverage metric quantifying how much attribution mass falls on skin regions. Third, we adapt the SaCo faithfulness coefficient from its original classification setting to rPPG regression by using the MAE between original and perturbed predicted rPPG waveforms as the perturbation impact. Applying these tools, we quantify a multi-hop leakage effect under sparse top-$k$ routing: attention rollout and flow almost completely restores the connections that individual refined-attention layers explicitly set to zero. Beyond Intuition mitigates this via its value-projection-weighted rollout and gradient-supported mask, attaining the highest median refined skin coverage ($0.83$ vs. $0.57$ for vanilla rollout) and faithfulness ($F=0.92$) among the evaluated methods on UBFC-rPPG. Validation across diverse datasets and model variants is needed. A case study on a low-SaCo outlier further shows all four methods recovering consistently once an artefactual region is replaced, suggesting consistent SaCo behavior across attribution families in this illustrative case. Together, these metrics move XAI for rPPG toward auditable numerical evidence about spatial alignment and perturbation faithfulness, i.e. trustworthy rPPG XAI.

URL PDF HTML ☆

赞 0 踩 0

2606.14006 2026-06-15 cs.CV cs.ET 新提交

优化秩以实现高保真隐式神经表示

Julian McGinnis, Florian A. Hölzl, Suprosanna Shit, Florentin Bieder, Paul Friedrich, Mark Mühlau, Bjoern Menze, Daniel Rueckert, Benedikt Wiestler

发表机构 * University of Cambridge（剑桥大学）

AI总结本文通过训练中稳定秩的调控，证明简单MLP也能实现高保真隐式神经表示，使用Muon优化器可显著提升性能，在多种任务上PSNR提升高达9 dB。

详情

AI中文摘要

基于普通多层感知器（MLP）的隐式神经表示（INR）被广泛认为无法表示高频内容。这促使研究转向架构干预，如坐标嵌入或专用激活函数，以表示高频信号。在本文中，我们挑战了普通MLP的低频偏差是学习高频内容的内在架构限制这一观点，而是认为这是训练过程中稳定秩退化的症状。我们通过实验证明，在训练过程中调控网络的秩可以显著提高学习信号的保真度，甚至使简单的MLP架构也具有表现力。大量实验表明，使用像Muon这样具有高秩、近正交更新的优化器，能够持续增强INR架构，甚至超越简单的ReLU MLP。这些显著改进适用于多种领域，包括自然图像、医学图像和新视角合成，在相同架构下PSNR提升高达9 dB。代码可在(https://rank-inrs.github.io)获取。

英文摘要

Implicit Neural Representations (INRs) based on vanilla Multi-Layer Perceptrons (MLPs) are widely believed to be incapable of representing high-frequency content. This has directed research efforts towards architectural interventions, such as coordinate embeddings or specialized activation functions, to represent high-frequency signals. In this paper, we challenge the notion that the low-frequency bias of vanilla MLPs is an intrinsic, architectural limitation to learn high-frequency content, but instead a symptom of stable rank degradation during training. We empirically demonstrate that regulating the network's rank during training substantially improves the fidelity of the learned signal, rendering even simple MLP architectures expressive. Extensive experiments show that using optimizers like Muon, with high-rank, near-orthogonal updates, consistently enhances INR architectures even beyond simple ReLU MLPs. These substantial improvements hold across a diverse range of domains, including natural and medical images and novel view synthesis, with up to +9 dB PSNR over the same architecture. Code is available at (https://rank-inrs.github.io).

URL PDF HTML ☆

赞 0 踩 0

2604.14193 2026-06-15 cs.CV eess.IV q-bio.NC 版本更新

QualiaNet: An Experience-Before-Inference Network

QualiaNet：一种先验体验的推理网络

Paul Linton

发表机构 * Columbia University（哥伦比亚大学）

AI总结提出QualiaNet，模拟人类立体视觉的两阶段架构：先通过视差图模拟体验，再用CNN从视差梯度估计距离，验证了从视差梯度恢复距离的可行性。

2601.18707 2026-06-15 cs.LG cs.AI cs.CV cs.NE 版本更新

SMART: Scalable Mesh-free Aerodynamic Simulations from Raw Geometries using a Transformer-based Surrogate Model

SMART: 基于Transformer代理模型的原始几何形状可扩展无网格气动模拟

Jan Hagnberger, Mathias Niepert

发表机构 * Jan Hagnberger ； Mathias Niepert

AI总结提出SMART，一种无需模拟网格、仅使用几何点云预测任意查询位置物理量的神经代理模型，通过交叉层交互联合更新几何特征和物理场，性能媲美甚至超越依赖网格的方法。

Comments Accepted for publication at the 43rd International Conference on Machine Learning (ICML) 2026, Seoul, South Korea

详情

AI中文摘要

基于机器学习的代理模型已成为复杂几何体（如车身）物理模拟中数值求解器的高效替代方案。许多现有模型将模拟网格作为额外输入，从而减少预测误差。然而，为新几何体生成模拟网格计算成本高昂。相比之下，不依赖模拟网格的无网格方法通常误差更高。基于这些考虑，我们引入了SMART，一种神经代理模型，它仅使用几何体的点云表示，无需访问模拟网格，即可预测任意查询位置的物理量。几何体和模拟参数被编码到一个共享的潜在空间中，该空间捕捉物理场的结构和参数特征。然后，一个物理解码器关注编码器的中间潜在表示，将空间查询映射到物理量。通过这种跨层交互，模型联合更新潜在几何特征和演变的物理场。大量实验表明，SMART与依赖模拟网格作为输入的现有方法相比具有竞争力，并且通常表现更优，展示了其在工业级模拟中的能力。

英文摘要

Machine learning-based surrogate models have emerged as more efficient alternatives to numerical solvers for physical simulations over complex geometries, such as car bodies. Many existing models incorporate the simulation mesh as an additional input, thereby reducing prediction errors. However, generating a simulation mesh for new geometries is computationally costly. In contrast, mesh-free methods, which do not rely on the simulation mesh, typically incur higher errors. Motivated by these considerations, we introduce SMART, a neural surrogate model that predicts physical quantities at arbitrary query locations using only a point-cloud representation of the geometry, without requiring access to the simulation mesh. The geometry and simulation parameters are encoded into a shared latent space that captures both structural and parametric characteristics of the physical field. A physics decoder then attends to the encoder's intermediate latent representations to map spatial queries to physical quantities. Through this cross-layer interaction, the model jointly updates latent geometric features and the evolving physical field. Extensive experiments show that SMART is competitive with and often outperforms existing methods that rely on the simulation mesh as input, demonstrating its capabilities for industry-level simulations.

URL PDF HTML ☆

赞 0 踩 0

2603.12400 2026-06-15 math.CO cs.CV 版本更新

Generation of Maximal Snake Polyominoes Using a Deep Neural Network

使用深度神经网络生成最大蛇形多联骨牌

Benjamin Gauthier, Alain Goupil, Fadel Toure

发表机构 * Université du Québec à Trois-Rivières（魁北克大学三河分校）

AI总结提出结构化像素空间扩散模型，从数据驱动学习生成最大蛇形多联骨牌，无需显式编码约束，能泛化到更大矩形并接近当前计算极限。

Comments In Proceedings GASCom 2026, arXiv:2606.09910

详情

DOI: 10.4204/EPTCS.445.13
Journal ref: EPTCS 445, 2026, pp. 104-113

AI中文摘要

最大蛇形多联骨牌在大矩形中难以数值研究，因为计算它们需要对特定矩形大小的所有蛇形进行完全枚举，这相当于暴力算法。这阻碍了对更大矩形中最大蛇形的研究。此外，大多数可枚举的蛇形位于小矩形中，掩盖了大尺度模式。在本文中，我们研究了深度神经网络在基于数据驱动的训练中生成最大蛇形多联骨牌的贡献，其中最大性和邻接约束不是显式编码的，而是学习的。为此，我们实验了一种去噪扩散模型，我们称之为结构化像素空间扩散（SPS Diffusion）。我们发现SPS Diffusion从小矩形泛化到大矩形，生成有效的蛇形直至28x28方格，并在接近当前计算极限的方格上产生最大蛇形候选。然而，该模型容易出错，例如分支、循环或多个蛇形组件。总体而言，扩散模型是有前景的，表明深度神经网络可以理解复杂的组合对象，这对其研究是有用的。

英文摘要

Maximal snake polyominoes are difficult to study numerically in large rectangles, as computing them requires the complete enumeration of all snakes for a specific rectangle size, which corresponds to a brute force algorithm. This hinders the study of maximal snakes in larger rectangles. Moreover, most enumerable snakes lie in small rectangles, obscuring large-scale patterns. In this paper, we investigate the contribution of a deep neural network to the generation of maximal snake polyominoes from a data-driven training, where the maximality and adjacency constraints are not encoded explicitly, but learned. To this extent, we experiment with a denoising diffusion model, which we referred as Structured Pixel Space Diffusion (SPS Diffusion). We find that SPS Diffusion generalizes from small rectangles to larger ones, generating valid snakes up to 28x28 squares and producing maximal snake candidates on squares close to the current computational limit. The model is, however, prone to errors such as branching, cycles, or multiple snake components. Overall, the diffusion model is promising and suggests that complex combinatorial objects can be understood by deep neural networks, which is useful in their investigation.

URL PDF HTML ☆

赞 0 踩 0

2210.00379 2026-06-15 cs.CV 版本更新

NeRF: Neural Radiance Field in 3D Vision: A Comprehensive Review (Updated Post-Gaussian Splatting)

NeRF: 3D视觉中的神经辐射场：全面综述（更新后Gaussian Splatting发布后）

Kyle Gao, Yina Gao, Hongjie He, Dening Lu, Linlin Xu, Jonathan Li

发表机构 * Faculty of Engineering, University of Toronto（多伦多大学工程学院）； Department of Geomatics Engineering, University of Calgary（卡尔加里大学测绘工程系）

AI总结本文综述了NeRF在过去五年的研究，涵盖了在Gaussian Splatting出现前和后的发展，总结了NeRF在新颖视角合成和3D隐式和混合表示神经场学习中的应用和贡献。

Comments Updated Post-Gaussian Splatting

详情

DOI: 10.26599/CVM.2026.9450535

AI中文摘要

2020年3月，神经辐射场（NeRF）革新了计算机视觉，使隐式、基于神经网络的场景表示和新颖视角合成成为可能。NeRF模型在机器人、城市测绘、自动驾驶导航、虚拟现实/增强现实等领域找到了广泛的应用。2023年8月，Gaussian Splatting作为一种直接竞争对手被提出，获得了巨大的势头，并在新颖视角合成领域超越了基于NeRF的研究，成为主导框架。本文综述了过去五年（2020-2025）的NeRF相关论文。这些论文包括Gaussian Splatting出现前的时期，当时NeRF在新颖视角合成和3D隐式和混合表示神经场学习中占据主导地位。我们还包含Gaussian Splatting出现后的作品，其中NeRF和隐式/混合神经场找到了更多小众应用。我们的综述分为Gaussian Splatting出现前的架构和应用分类，以及NeRF、神经场和隐式/混合神经表示方法的活跃研究领域分类。我们介绍了NeRF的理论及其通过可微体积渲染进行训练的介绍。我们还提供了经典NeRF、隐式和混合神经表示以及神经场模型的性能和速度的基准比较，并概述了关键数据集。

英文摘要

In March 2020, Neural Radiance Field (NeRF) revolutionized Computer Vision, allowing for implicit, neural network-based scene representation and novel view synthesis. NeRF models have found diverse applications in robotics, urban mapping, autonomous navigation, virtual reality/augmented reality, and more. In August 2023, Gaussian Splatting, a direct competitor to the NeRF-based framework, was proposed, gaining tremendous momentum and overtaking NeRF-based research in terms of interest as the dominant framework for novel view synthesis. We present a comprehensive survey of NeRF papers from the past five years (2020-2025). These include papers from the pre-Gaussian Splatting era, where NeRF dominated the field for novel view synthesis and 3D implicit and hybrid representation neural field learning. We also include works from the post-Gaussian Splatting era where NeRF and implicit/hybrid neural fields found more niche applications. Our survey is organized into architecture and application-based taxonomies in the pre-Gaussian Splatting era, as well as a categorization of active research areas for NeRF, neural field, and implicit/hybrid neural representation methods. We provide an introduction to the theory of NeRF and its training via differentiable volume rendering. We also present a benchmark comparison of the performance and speed of classical NeRF, implicit and hybrid neural representation, and neural field models, and an overview of key datasets.

URL PDF HTML ☆

赞 0 踩 0

2508.18967 2026-06-15 cs.RO cs.CV 版本更新

Enhanced UAV Path Planning Using the Tangent Intersection Guidance (TIG) Algorithm

利用切线交点引导算法（TIG）增强的无人机路径规划

Hichem Cheriet, Khellat Kihel Badra, Chouraqui Samira

AI总结本文提出TIG算法，通过椭圆切线交点方法生成可行路径，结合启发式规则和二次贝塞尔曲线平滑技术，在静态和动态环境中实现高效安全的无人机路径规划。

Comments Accepted for publication in JAMRIS Journal

详情

DOI: 10.14313/jamris-2026-018
Journal ref: Journal of Automation, Mobile Robotics and Intelligent Systems, 20(2), 30-52 (2026)

AI中文摘要

高效的无人机导航对于各种应用至关重要，包括战斗支援、包裹递送和搜索救援。本文介绍了切线交点引导（TIG）算法，一种用于静态和动态环境中的无人机路径规划的先进方法。该算法使用椭圆切线交点方法生成可行路径。它为每个威胁生成两条子路径，根据启发式规则选择最佳路线，并迭代优化路径，直到达到目标。考虑到无人机的运动学和动力学约束，采用基于二次贝塞尔曲线的改进平滑技术生成平滑且高效的路径。实验结果表明，TIG算法在静态环境中能够在0.01秒内生成最短路径，比A*、PRM、RRT*、切线图和静态APPATT算法具有更少的转向角度。此外，在完全未知和部分已知环境中，TIG展示了高效的实时路径规划能力，用于避障，优于APF和动态APPATT算法。

英文摘要

Efficient and safe navigation of Unmanned Aerial Vehicles (UAVs) is critical for various applications, including combat support, package delivery and Search and Rescue Operations. This paper introduces the Tangent Intersection Guidance (TIG) algorithm, an advanced approach for UAV path planning in both static and dynamic environments. The algorithm uses the elliptic tangent intersection method to generate feasible paths. It generates two sub-paths for each threat, selects the optimal route based on a heuristic rule, and iteratively refines the path until the target is reached. Considering the UAV kinematic and dynamic constraints, a modified smoothing technique based on quadratic Bézier curves is adopted to generate a smooth and efficient route. Experimental results show that the TIG algorithm can generate the shortest path in less time, starting from 0.01 seconds, with fewer turning angles compared to A*, PRM, RRT*, Tangent Graph, and Static APPATT algorithms in static environments. Furthermore, in completely unknown and partially known environments, TIG demonstrates efficient real-time path planning capabilities for collision avoidance, outperforming APF and Dynamic APPATT algorithms.

URL PDF HTML ☆

赞 0 踩 0

1. 多模态与视觉语言模型 10 篇

Self-Evolving Visual Questioner

Gaze Heads: How VLMs Look at What They Describe

Orchestra-o1: Omnimodal Agent Orchestration

Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents

Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems

Representation Forcing for Bottleneck-Free Unified Multimodal Models

UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

2. 具身智能、机器人与自动驾驶 11 篇

RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation

WAM4D: Fast 4D World Action Model via Spatial Register Tokens

C-MambaPose: A Physics-Informed Complex Mamba Framework for Cross-Environment WiFi Human Pose Estimation

Multi-Agent Embodied Autonomous Driving: From V2X Information Exchange to Shared World Models

PhysVLA: Towards Physically-Grounded VLA for Embodied Robotic Manipulation

Efficient Online 3D Multi-Camera Multi-Object Tracking and Pose Estimation

LiAuto-GeoX: Efficient Grounded Driving Transformer

ADAPT: An Autonomous Forklift for Construction Site Operation

Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

3. 图像识别、检索与分类 5 篇

Rethinking Global Average Pooling: Your Classifier Is Secretly a Multi-Instance Learner

CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification

SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization

SAFformer:Improving Spiking Transformer via Active Predictive Filtering

Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery

4. 目标检测、分割与定位 7 篇

Morphology-Aware Sample Assignment: Overcoming IoU Insensitivity for Surface Defect Detection

Context-Guided Semantic Alignment for Feature Fusion Networks

BoRAD: Bootstrap your Own Representations for Multi-class Anomaly Detection

Pano3D: Unified 3D Reconstruction and Panoptic Segmentation

Value-order Decomposition for Generalist Anomaly Detection

RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

Hierarchical Consistency Learning for Test-time Adaptation in Camouflage Perception

5. 视频理解与时序视觉 6 篇

TSA: Temporal Slot Activation for Persistent Object-Centric Video Representation

Temporal Backtracking Search for Test-time Generative Video Reasoning

FEMOT: Multi-Object Tracking using Frame and Event Cameras

FLaRA: Predicting Future Latent Representations for Accident Anticipation

SED:Lightweight Saliency prediction for Event-based data via Distillation

Boundary-Centric Clip-Budgeted Active Learning for Temporal Action Segmentation

6. 生成式视觉与世界模型 18 篇

Compressing Image Style Training into a Single Model Forward

Avatar V: Scaling Video-Reference Avatar Video Generation

HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis

Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation

Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention

Rethinking One-Step Image Editing through ChordEdit: Reproduction, Simplification, and New Insights

Conditioning Matters: Stabilizing Inversion and Attention in Diffusion Image Editing

VideoWeave: Unlocking Geometric Consistency in Video Generation via Joint Geometry-Video Modeling

Pix2Pix-Hybrid: Structure-Guided Conditional Synthesis of Hajj Crowd Images with Multi-Channel Conditioning and Weak Attribute Supervision

CausalMotion: Structured Physical Reasoning as Keyframe and Trajectory Guidance for Training-Free Video Generation

Memento: Reconstruct to Remember for Consistent Long Video Generation

RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space

FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision

Aligned but Stereotypical? How System Prompts Shape Demographic Bias in LLM-Based Text-to-Image Models

FBSDiff++: Improved Frequency Band Substitution of Diffusion Features for Efficient and Highly Controllable Text-Driven Image-to-Image Translation

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

InterleaveThinker: Reinforcing Agentic Interleaved Generation

7. 3D视觉、点云与空间智能 9 篇

MUSE: Agentic 3D Scene Authoring via Memory-Grounded Incremental Requirement Satisfaction

A Robust Point Cloud Analysis Framework Inspired By Primary Visual Cortex

Point Cloud Upsampling through Patch-based Frequency Superposition

MooMIns -- Monocular 3D Reconstruction and Object Pose Estimation from Multiple Instances

Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

Vanishing Depth: Training Generalized Depth Adapters with Sinusoidal Depth Preprocessing for Pretrained RGB Encoders

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

ZipSplat: Fewer Gaussians, Better Splats

8. 医学影像与生物视觉 11 篇

Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

Hybrid Classical-Quantum (HCQ) Alzheimer's Classification via Supervised $β$-VAE and Quantum Kernels

HiST: A Hierarchical Sparse Transformer for Cross-Modal Spatial Transcriptomics Modeling

A Lightweight Fiducial-Based Pipeline for 3D Hyperspectral Mapping of ex-vivo Lumpectomy Specimens

Trimodal Glioma Representation Alignment via Volumetric Contrastive Learning

MMRINet: Efficient Mamba-Based Segmentation with Dual-Path Refinement for Low-Resource MRI Analysis