arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11385 2026-06-11 cs.CV 新提交

DeceptionX: Explainable Deception Detection with Multimodal Large Language Models

DeceptionX: 基于多模态大语言模型的可解释欺骗检测

Jiayu Zhang, Shuo Ye, Jiajian Huang, Yawen Cui, Taorui Wang, Wei Xia, Zeheng Wang, Haowen Tang, Hui Ma, Zitong Yu

发表机构 * Great Bay University（大湾区大学）； Hong Kong Polytechnic University（香港理工大学）

AI总结提出DeceptionX框架，将欺骗检测从黑箱分类转变为可解释的观察-思考-总结推理过程，通过构建DeceptChain数据集和三阶段训练管道，在标准基准上超越现有方法，同时提供专家级可解释推理路径。

详情

AI中文摘要

欺骗检测是情感计算和行为分析中一项关键且极具挑战性的任务。现有的深度学习方法通常将此任务视为简单的分类问题；然而，这种黑箱方法缺乏可解释性，无法捕捉人类专家在识别谎言时使用的复杂逻辑推理过程。尽管多模态大语言模型（MLLM）已展现出潜力，但有效应用它们需要在低层视听线索与高层逻辑推理之间建立桥梁。在本文中，我们提出DeceptionX，一种新颖的MLLM框架，将欺骗检测的范式从黑箱分类转变为可解释的观察-思考-总结推理过程。为解决高质量推理数据稀缺的问题，我们首先构建了DeceptChain，这是一个通过人机循环过程开发的高质量数据集。该数据集将细粒度的视觉和听觉证据（如微表情和声音颤抖）综合为结构化的思维链推理数据。此外，我们提出了一个三阶段训练管道和一种针对DeceptionX的差异感知冗余消除（DARE）策略，以进一步增强模型的泛化能力。大量实验表明，DeceptionX不仅在标准真实世界基准上优于现有的MLLM基线和最先进方法，而且提供了透明的、专家级的推理路径，弥合了多模态欺骗检测中准确性与可解释性之间的关键差距。

英文摘要

Deception detection is a critical and highly challenging task within affective computing and behavioral analysis. Existing deep learning methods typically treat this task as a straightforward classification problem; however, this black-box approach lacks interpretability and fails to capture the complex logical deduction processes utilized by human experts when identifying lies. While Multimodal Large Language Models (MLLMs) have shown potential, applying them effectively requires a bridge between low-level audiovisual cues and high-level logical reasoning. In this paper, we propose DeceptionX, a novel MLLM framework that shifts the paradigm of deception detection from black-box classification to an interpretable Observe-Think-Summarize reasoning process. To address the scarcity of high-quality reasoning data, we first constructed DeceptChain, a high-quality dataset developed through a human-in-the-loop process. This dataset synthesizes fine-grained visual and auditory evidence (such as micro-expressions and vocal tremors) into structured chain-of-thought reasoning data. Furthermore, we propose a three-stage training pipeline and a Discrepancy-Aware Redundancy Elimination~(DARE) strategy for DeceptionX to further enhance the model's generalization capabilities. Extensive experiments demonstrate that DeceptionX not only outperforms existing MLLM baselines and state-of-the-art methods on standard real-world benchmarks but also provides transparent, expert-level reasoning paths, bridging the critical gap between accuracy and interpretability in multimodal deception detection.

URL PDF HTML ☆

赞 0 踩 0

2606.11576 2026-06-11 cs.CV cs.AI 新提交

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

AVIS: 视觉语言模型的自适应测试时缩放

Ahmadreza Jeddi, Minh Ngoc Le, Amirhossein Kazerouni, Hakki Can Karaimer, Hue Nguyen, Iqbal Mohomed, Michael Brudno, Alex Levinshtein, Konstantinos G. Derpanis, Babak Taati, Radek Grzeszczuk

发表机构 * AI Center-Toronto, Samsung Electronics（三星电子多伦多AI中心）； University of Toronto（多伦多大学）； Vector Institute（向量研究所）； York University（约克大学）

AI总结提出AVIS，通过轻量策略联合优化视觉上下文缩放和推理缩放，利用无训练的关键多样性剪枝和自适应自一致性，在多种基准上提升精度-计算权衡。

详情

Comments: Project page: this https URL

AI中文摘要

现代视觉语言模型（VLM）受益于思维链提示和测试时缩放，但这些增益通常因大视觉上下文和长解码链而带来高昂推理成本。我们将此成本通过两个耦合的轴来审视：视觉上下文缩放（VCS），控制传递给语言模型的视觉证据量；以及视觉推理缩放（VRS），控制推理时推理搜索的执行量。现有方法通常一次优化一个轴，而跨这些轴的联合计算分配尚未充分探索。我们引入自适应视觉推理缩放（AVIS），一种轻量策略，根据每个查询自适应调整VCS和VRS。AVIS通过关键多样性视觉（KDV）剪枝实现VCS，这是一种无训练的$O(N)$基于关键字的规则，用于在预填充前移除冗余视觉令牌；并通过自适应自一致性实现VRS，使用学习的难度预测器选择推理滚动的数量。AVIS易于部署，兼容共享预填充推理，其中所有滚动重用单个预填充过程和KV缓存。在多样化的图像和视频推理基准上，AVIS相对于仅VCS和仅VRS的基线改善了精度-计算权衡，并且在RL后训练的VLM上仍然有效，同时保持低计算和低延迟。

英文摘要

Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibitive inference cost due to large visual contexts and long decoding chains. We view this cost through two coupled axes: Visual Context Scaling (VCS), which controls how much visual evidence is passed to the language model, and Visual Reasoning Scaling (VRS), which controls how much inference-time reasoning search is performed. Existing methods typically optimize one axis at a time, leaving the joint allocation of compute across these axes underexplored. We introduce Adaptive Visual Inference Scaling (AVIS), a lightweight policy that adapts both VCS and VRS per query. AVIS realizes VCS through Key Diversity Visual (KDV) pruning, a training-free $O(N)$ key-based rule for removing redundant visual tokens before prefilling, and realizes VRS through adaptive self-consistency, using a learned difficulty predictor to select the number of reasoning rollouts. AVIS is deployment-friendly and compatible with shared-prefill inference, where all rollouts reuse a single prefilling pass and KV cache. Across diverse image and video reasoning benchmarks, AVIS improves the accuracy--compute trade-off relative to VCS-only and VRS-only baselines, and remains effective on top of RL post-trained VLMs while keeping compute and latency low.

URL PDF HTML ☆

赞 0 踩 0

2606.11602 2026-06-11 cs.CV 新提交

On Aligning Hierarchical Standardized Embedding for Audio-visual Generalized Zero-shot Learning

面向音视频广义零样本学习的层次化标准化嵌入对齐

Zihan Zhang, Jie Hong, Siyuan Fan, Yanghao Zhou, Pengfei Fang

发表机构 * Southeast University（东南大学）； The University of Hong Kong（香港大学）； Beijing Institute of Technology（北京理工大学）； Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education（新一代人工智能技术及其跨学科应用重点实验室（东南大学），教育部）； School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）

AI总结提出AHSE方法，通过Z-score标准化和层次化对齐策略（语义、类别、批次三级）解决音视频与文本模态间的分布与结构差异，在三个基准数据集上取得竞争性能。

详情

AI中文摘要

音视频广义零样本学习（AV-GZSL）是一项具有挑战性的任务，旨在通过整合音频和视觉模态的数据来分类已见和未见对象或场景。近期研究主要集中于融合或对齐音频和视觉特征以生成更具信息量的音视频嵌入。此外，大多数现有方法对齐音视频与文本特征仅依赖于优化目标。然而，这些方法忽视了音视频与文本模态之间固有的分布和结构差异。为解决这一局限性，我们提出一种名为层次化标准化嵌入对齐（AHSE）的方法，该方法能够在共享嵌入空间内实现标准化音视频与文本嵌入的层次化对齐。具体而言，我们首先对融合后的音视频和文本嵌入应用Z-score标准化以减少分布不匹配。然后，我们引入一种层次化对齐策略，在语义、类别和批次三个层面最小化差异，从而构建一个更鲁棒且结构良好的嵌入空间。该策略不仅保留了语义和类间关系，还保持了每个批次内的空间一致性。在三个基准数据集：VGGSound-GZSL、UCF-GZSL和ActivityNet-GZSL上的大量实验表明，AHSE在零样本学习中取得了竞争性能。

英文摘要

Audio-visual Generalized Zero-shot Learning (AV-GZSL) is a challenging task that aims to classify both seen and unseen objects or scenes by integrating data from audio and visual modalities. Recent studies primarily focus on fusing or aligning audio and visual features to generate more informative audio-visual embeddings. Also, aligning the audio-visual and textual features of most existing methods relies solely on the optimization objectives. However, those methods neglect the inherent distributional and structural differences between audio-visual and textual modalities. To address this limitation, we propose a method termed Aligning Hierarchical Standardized Embedding (AHSE), which enables hierarchical alignment of standardized audio-visual and textual embeddings within a shared embedding space. Specifically, we first apply Z-score standardization to the fused audio-visual and textual embeddings to reduce distributional mismatches. We then introduce a hierarchical alignment strategy that minimizes discrepancies at the semantic, class, and batch levels, thereby constructing a more robust and well-structured embedding space. This strategy not only preserves semantic and inter-class relationships but also maintains spatial consistency within each batch. Extensive experiments on three benchmark datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL, demonstrate that AHSE achieves competitive performance in zero-shot learning.

URL PDF HTML ☆

赞 0 踩 0

2606.11683 2026-06-11 cs.CV cs.AI 新提交

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

推理，再推理：跨视角重访提升空间推理

Chaofan Ma, Zhenjie Mao, Yuhuan Yang, Fanqin Zeng, Yue Shi, Yingjie Zhou, Xiaofeng Cao, Jiangchao Yao

AI总结提出ReRe框架，通过生成互补新视角视频让MLLM先推理再验证，无需训练即可显著提升空间推理性能。

详情

Comments: ICML 2026

AI中文摘要

从自我中心视频进行空间推理本质上是具有挑战性的，因为可观察的证据受到相机轨迹的限制。现有方法依赖单轮推理，迫使模型通过语义先验而非可验证证据来解决几何歧义。我们认为空间推理应该是可重访的：在有限证据下形成的结论在获得互补视角时应保持开放以进行修正。基于这一见解，我们提出“推理，再推理”（ReRe），一种无需训练、推理时框架，包含两个阶段：在推理阶段，MLLM从原始视频形成空间假设；在再推理阶段，它通过观察合成的新视角视频来验证或修正假设。为了实现有效的跨视角重访，我们设计了一个几何到视频的流水线，从预测的3D几何中渲染出策略性互补的新视角。这些视角具有升高的、倾斜的视角，覆盖整个场景，同时保持MLLM的原生视频接口，无需架构修改。在VSI-Bench和STI-Bench上的广泛评估表明，ReRe显著提升开源MLLM，使其与专有最先进性能相媲美。项目页面：此https URL

英文摘要

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.11719 2026-06-11 cs.CV cs.AI 新提交

Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning

Ouroboros-Spatial：闭环数据-模型循环的空间推理

Enhan Zhao, Wei Wu, Yuanrui Zhang, Xueliang Zhao, Di He

发表机构 * Peking University（北京大学）； Ant International（蚂蚁国际）； The University of Hong Kong（香港大学）

AI总结提出Ouroboros-Spatial自演化框架，通过提议器与求解器闭环交互，动态生成与模型能力匹配的训练样本，在六个空间推理基准上以十分之一数据量显著提升Qwen3-VL性能。

详情

AI中文摘要

空间推理仍然是多模态大语言模型（MLLM）的一个持续挑战。现有方法主要依赖大规模、静态整理的数据集，其中所有训练样本被统一对待，而不考虑模型不断演变的能力。这种静态范式本质上是数据低效的：训练能力通常浪费在模型当前阶段过于简单或过于困难的样本上。为解决这一局限，我们提出Ouroboros-Spatial，一个自演进的训练框架，其中模型扮演提议器和求解器的双重角色。在每次迭代中，冻结的提议器从3D场景元数据和原始视频帧生成空间问答对，以及用于推导可靠真实值的可执行代码。然后，可学习的求解器在接受的样本上进行微调，其每个样本的预测置信度作为难度信号。该信号在下一迭代中反馈给提议器，引导其生成与求解器当前能力更匹配的问题。通过这种闭环设计，训练分布与模型能力共同演化，减少冗余的简单示例，同时过滤掉具有有限学习价值的模糊或无信息样本。在六个空间推理基准上，Ouroboros-Spatial显著提升了Qwen3-VL-4B和Qwen3-VL-8B的性能，同时使用的训练样本数量比近期大规模整理数据集少一个数量级。在VSI-Bench上，它对4B和8B模型分别取得了9.9和6.8个百分点的绝对提升，使两者均优于一系列强大的开源和专有基线模型。

英文摘要

Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardless of the model's evolving capabilities. This static paradigm is inherently data-inefficient: training capacity is often spent on samples that are either trivial or overly difficult for the model at its current stage. To address this limitation, we propose Ouroboros-Spatial, a self-evolving training framework in which the model plays dual roles as a proposer and a solver. In each iteration, a frozen proposer generates spatial question-answer (QA) pairs from 3D scene metadata and raw video frames, together with executable code for deriving reliable ground truth. A learnable solver is then fine-tuned on the accepted samples, and its per-sample prediction confidence is used as a difficulty signal. This signal is fed back to the proposer in the next iteration, guiding it to generate questions better matched to the solver's current capabilities. Through this closed-loop design, the training distribution co-evolves with model ability, reducing redundant trivial examples while filtering out ambiguous or uninformative samples with limited learning value. Across six spatial reasoning benchmarks, Ouroboros-Spatial substantially improves Qwen3-VL-4B and Qwen3-VL-8B while using an order of magnitude fewer training examples than recent large-scale curated datasets. On VSI-Bench, it yields absolute gains of 9.9 and 6.8 points for the 4B and 8B models, respectively, enabling both to outperform a wide range of strong open-source and proprietary baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.11745 2026-06-11 cs.CV cs.AI 新提交

From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

从提示到标记：将因果监督内化到视觉-语言模型中进行多图像因果推理

Haoping Yu, Yuanxi Li, Jing Ma

AI总结提出BridgeVLM，通过从多图像输入诱导因果图并转换为因果标记，注入LLM解码器进行因果消息传递，显著提升多图像因果推理性能。

详情

AI中文摘要

视觉因果推理对于理解和干预物理世界至关重要，需要从视觉输入中识别因果变量并推理干预效果。尽管最近取得了进展，大型视觉-语言模型（VLM）在此类任务上仍然脆弱，尤其是对于多图像输入上的干预和反事实查询。大多数现有探索通过文本提示注入因果知识，使因果机制外在于模型执行，限制了推理过程中的可靠控制。为了解决这个问题，我们提出了BridgeVLM，它通过从多图像输入中诱导因果图并将其转换为结构化的因果标记，由注入到LLM解码器中的RAMP层执行因果消息传递，从而内化视觉因果推理。我们进一步引入了一个统一的训练接口M3S，用于不同粒度（局部/全局级别）的细粒度因果监督。BridgeVLM在CausalVLBench的干预任务上达到了54.4%的准确率（而提示级监督为33.2%），在Causal3D上将结果从43.6%提升到49.0%，并在CausalVLBench上显著改善了因果结构学习（$F_1$：33.4% → 75.1%）。

英文摘要

Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large vision--language models (VLMs) remain brittle at such tasks, especially for interventional and counterfactual queries over multi-image inputs. Most existing explorations inject causal knowledge via textual prompts, leaving causal mechanisms external to model execution and limiting reliable control during inference. To address this problem, we propose BridgeVLM, which internalizes visual causal reasoning by inducing a causal graph from multi-image inputs and converting it into structured Causal Tokens executed by RAMP layers injected into the LLM decoder for causal message passing. We further introduce a unified training interface M3S for fine-grained causal supervision from different granularities (local/global level). BridgeVLM achieves 54.4% accuracy on intervention tasks on CausalVLBench (vs. 33.2% with prompt-level supervision), improves results on Causal3D from 43.6% to 49.0%, and substantially improves causal structure learning on CausalVLBench ($F_1$: 33.4% $\rightarrow$ 75.1%).

URL PDF HTML ☆

赞 0 踩 0

2606.11792 2026-06-11 cs.CV cs.AI cs.CL 新提交

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

MultiToP：学习修补视觉令牌以减轻视频大型多模态模型中的幻觉

Yuansheng Gao, Wenbin Xing, Jiahao Yuan, Kaiwen Zhou, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University（浙江大学）； Sun Yat-sen University（中山大学）； East China Normal University（华东师范大学）

AI总结提出MultiToP框架，通过轻量级视觉令牌修补器动态替换不可靠视觉令牌，结合信息引导排名校准和稀疏正则化，在不修改原模型情况下减少视频多模态模型幻觉，显著提升F1分数和问答准确率。

详情

Comments: Preprint

AI中文摘要

视频大型多模态模型在视频理解方面取得了显著进展，但仍容易产生幻觉，即生成的响应未能忠实于输入视频。在本文中，我们提出MultiToP，一种多模态上下文感知的视觉令牌修补框架，通过在语言生成之前优化不可靠的视觉令牌来减轻幻觉。MultiToP引入了一个轻量级的视觉令牌修补器，用于预测令牌级替换分布，并选择性地用动态全局修补令牌替换不可靠的视觉令牌。为了有效训练修补器，我们进一步提出了信息引导的排名校准，利用从主干网络派生的答案条件帧级信息线索来指导令牌替换。结合真实答案监督和稀疏正则化，MultiToP实现了局部视觉证据优化，而无需修改原始模型。大量实验表明，MultiToP在Vript-HAL上有效减少了幻觉，且推理开销可忽略不计，将Qwen3-VL-4B-Instruct的F1分数相比原始模型提高了50.60%。同时，MultiToP保持了通用的视频理解能力，在ActivityNet-QA上为Video-LLaVA-7B带来了18.58%的相对准确率提升。

英文摘要

Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

URL PDF HTML ☆

赞 0 踩 0

2606.11853 2026-06-11 cs.CV cs.AI 新提交

Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning

任务感知结构化记忆用于动态多模态上下文学习

Zhirui Chen, Ziwei Chen, Ling Shao

AI总结提出TASM框架，通过任务向量引导压缩、语义感知令牌合并和层次化记忆结构，解决多模态大语言模型上下文学习中记忆压缩导致的语义破坏和静态问题。

详情

Comments: Accepted to ICML 2026

AI中文摘要

多模态大语言模型（MLLMs）依赖上下文学习（ICL）进行快速任务适应，但其可扩展性受到有限上下文窗口和长多模态序列中键值（KV）缓存成本增长的严重限制。现有的记忆压缩方法通常依赖于刚性令牌移除或样本相关的重要性估计，这引入了偏差，破坏了语义结构（特别是视觉表示），并产生无法适应新查询的静态记忆。我们提出了TASM（任务感知结构化记忆），一个无需训练的框架，通过任务感知、结构保持和动态可访问的记忆构建来解决这些限制。TASM采用任务向量引导压缩，用捕获演示间共享相关性的任务级方向替代样本特定信号。为了保持底层流形，它通过二分图匹配应用语义感知令牌合并，在不进行破坏性修剪的情况下聚合令牌。最后，TASM将记忆结构化为一个层次结构，包括紧凑的核心记忆和潜在库，促进查询自适应的动态检索。评估证实，TASM在重度压缩下保持高性能，有效平衡了效率与适应性。

英文摘要

Multi-modal large language models (MLLMs) depend on in-context learning (ICL) for rapid task adaptation, but their scalability is severely limited by finite context windows and the growing cost of key-value (KV) caches in long multi-modal sequences. Existing memory compression approaches typically rely on rigid token removal or sample-dependent importance estimation, which introduces bias, disrupts semantic structure, particularly for visual representations, and yields static memories that cannot adapt to new queries. We introduce TASM (Task-Aware Structured Memory), a training-free framework that addresses these limitations through task-aware, structure-preserving, and dynamically accessible memory construction. TASM employs task-vector guided compression to replace sample-specific signals with a task-level direction that captures shared relevance across demonstrations. To preserve the underlying manifold, it applies semantics-aware token merging via bipartite graph matching, aggregating tokens without destructive pruning. Finally, TASM structures memory into a hierarchy comprising a compact Core Memory and a Latent Bank, facilitating query-adaptive dynamic retrieval. Evaluations confirm TASM maintains high performance under heavy compression, effectively balancing efficiency with adaptability.

URL PDF HTML ☆

赞 0 踩 0

2606.12069 2026-06-11 cs.CV 新提交

Tac-DINO: Learning Vision-Tactile Features with Patch Alignment

Tac-DINO：基于补丁对齐的视觉-触觉特征学习

Hong Li, Yankang Dong, Yue Xu, Yihan Tang, Mingzhu Li, Jiamin Qiu, Qihang Yao, Xing Zhu, Yujun Shen, Nan Xue, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Ant Group（蚂蚁集团）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出Tac-DINO方法，通过构建大规模触觉数据集和视觉-触觉全息匹配基准，利用补丁对齐学习局部到全局的视觉-触觉表征，性能优于无对齐方法。

2606.12106 2026-06-11 cs.CV cs.AI 新提交

MSUE: Multi-Modal Soccer Understanding Expert

MSUE：多模态足球理解专家

Litao Li, Yibo Yu, Yufeng Hu, Zhuo Yang, Jiali Wen, Yixin Chen, Yixi Zhou

发表机构 * South China University of Technology（华南理工大学）； Johns Hopkins University（约翰霍普金斯大学）； Peking University（北京大学）； University of Electronic Science and Technology of China（电子科技大学）

AI总结提出MSUE多专家问答架构，结合VLM数据合成管道与LLM动态调度文本、图像、视频专家，在SoccerNet VQA挑战中达到0.95准确率，获第三名。

2606.12195 2026-06-11 cs.CV 新提交

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

InternVideo3: 用多模态上下文推理代理化基础模型

Ziang Yan, Sheng Xia, Jiashuo Yu, Yue Wu, Tianxiang Jiang, Songze Li, Kanghui Tian, Yicheng Xu, Yinan He, Kai Chen, Limin Wang, Yu Qiao, Yi Wang

发表机构 * Shanghai Innovation Institute（上海创新研究院）； Shanghai AI Laboratory（上海人工智能实验室）； Nanjing University（南京大学）

AI总结提出InternVideo3框架，通过多模态上下文推理（MCR）和高效KV缓存压缩方法M^2LA，增强长视频理解与迭代交互能力，在多个基准上取得强性能。

详情

AI中文摘要

近期基础模型的进展已转向涉及多步推理和工具使用的代理行为。然而，开源工作主要聚焦于文本主导的场景，使得长时域多模态任务探索不足。这一差距在需要持续时间理解和迭代交互的视频任务中尤为明显。我们提出InternVideo3，一个通过多模态上下文推理（MCR）增强这些能力的框架。MCR将理解视为一个闭环过程，作用于包含观察、指令、推理、工具动作和记忆的共享演化上下文。这将长视频理解框架化为证据积累与验证。为确保效率，我们引入多模态多头潜在注意力（M^2LA），一种保留令牌的重参数化方法，压缩KV缓存状态同时保留完整令牌流。我们的分阶段训练包括持续预训练、短到长监督微调、基于规则的强化学习以及在线策略蒸馏。实验表明，InternVideo3在Video-MME、MLVU和EgoSchema等基准上取得了强性能。我们进一步将该模型实例化为带有检索工具的视频代理，展示了稳健的基于证据的行为。我们的结果表明，高效的上下文处理和闭环推理对于将开放多模态模型适应于长时域视觉接地代理至关重要。

英文摘要

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.

URL PDF HTML ☆

赞 0 踩 0

2606.12412 2026-06-11 cs.CV cs.AI 新提交

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

重新路由，而非移除：面向视觉语言模型的可恢复视觉令牌路由

Cheng-Yu Yang, Shao-Yuan Lo, Yu-Lun Liu

发表机构 * National Yang Ming Chiao Tung University（国立阳明交通大学）； National Taiwan University（国立台湾大学）

AI总结针对视觉语言模型中视觉令牌重要性随解码器深度变化的问题，提出无需训练的可恢复路由方法Reroute，将不可逆移除改为可恢复路由，在激进令牌缩减下提升定位能力并保持通用VQA性能。

详情

Comments: Code: this https URL

AI中文摘要

视觉语言模型（VLM）将图像投影为数百到数千个视觉令牌，使得解码器推理在注意力计算和KV缓存内存方面代价高昂。现有的视觉令牌缩减方法大多遵循排序-移除范式：它们对视觉令牌进行评分，保留一个紧凑的子集，并永久丢弃其余部分。我们表明这种不可逆操作是脆弱的，因为视觉令牌的重要性随解码器深度变化；在某一阶段排名低的令牌可能在后续层中变得相关，尤其是对于需要定位的查询。我们提出Reroute，一种无需训练的插件，用可恢复路由替代移除。在每个路由阶段，选中的视觉令牌通过解码器块，而延迟的令牌绕过该阶段并在下一个路由决策时重新进入候选池。Reroute重用现有的注意力分数排序规则和阶段级调度，保留了它所增强的剪枝方法的理论TFLOPs和KV缓存预算类别。在LLaVA-1.5和Qwen骨干网络上的FastV、PDrop和Nüwa变体中，Reroute在激进令牌缩减下改善了定位性能，同时保持通用VQA性能。这些结果表明，VLM令牌缩减不应仅被视为不可逆剪枝，也应被视为可恢复路由。代码可在此处获取：this https URL

英文摘要

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: this https URL

URL PDF HTML ☆

赞 1 踩 1

2606.11614 2026-06-11 cs.LG cs.AI cs.CV 交叉投稿

Information-Theoretic Decomposition for Multimodal Interaction Learning

多模态交互学习的信息论分解

Zequn Yang, Yake Wei, Haotian Ni, Zhihao Xu, Di Hu

AI总结提出基于信息论的多模态交互分解方法DMIL，通过变分分解架构和微调策略学习样本特定的冗余、独特和协同交互，提升多模态学习性能。

详情

Comments: Accepted to CVPR 2026

AI中文摘要

多模态学习依赖于捕获跨模态的冗余、独特和协同信息，这些信息共同构成多模态交互。一个关键但尚未充分探索的挑战是，这些隐式交互在不同样本间动态变化。在这项工作中，我们首次进行了系统的信息论分析，强调了学习这些动态的、样本特定的交互对于有效多模态学习的重要性。我们的分析进一步揭示了传统范式在学习这些不同交互类型方面的缺陷：模态集成方法难以捕获协同，而联合学习范式往往未能充分利用冗余信息。这突显了对一种能够基于每个样本自适应地从不同交互类型中学习的方法的需求。为此，我们提出了基于分解的多模态交互学习（DMIL），一种显式建模并学习样本特定交互的新范式。首先，我们设计了一个变分分解架构来分离组成交互组件。其次，我们采用了一种新的学习策略，在微调过程中利用这些显式交互组件来实现全面的交互学习。跨不同任务和架构的大量实验表明，DMIL通过适应整体的样本特定交互，始终实现了优越的性能。我们的框架灵活且广泛适用，建立了一个以交互为中心的多模态学习范式。代码可在以下网址获取：此 https URL。

英文摘要

Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning. The code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2509.11548 2026-06-11 cs.CV 版本更新

How Auxiliary Reasoning Unleashes GUI Grounding in VLMs

辅助推理如何释放VLM中的GUI定位能力

Weiming Li, Yan Shao, Jing Yang, Yujing Lu, Ling Zhong, Yuhan Wang, Min Yu, Tongxiao Ruan, Manni Duan

AI总结针对VLM在GUI定位任务中隐式空间理解强但显式坐标输出弱的问题，提出三种零样本辅助推理方法（如标记网格），通过输入图像添加空间线索，显著提升定位性能，在多个基准上达到接近最优微调方法的效果。

详情

AI中文摘要

图形用户界面（GUI）定位是构建GUI代理的基础任务。然而，通用视觉语言模型（VLM）由于缺乏特定优化，在此任务上表现不佳。本文识别出一个关键差距：尽管VLM表现出显著的潜在定位能力（如通过Pointing Game衡量的性能所示），但在输出显式坐标时表现不佳。为了解决这一差异并绕过当前微调方法的高数据和高标注成本，我们提出了三种零样本辅助推理方法。通过提供显式空间线索（如轴、网格和标记交点）作为输入图像的一部分，这些方法使VLM能够更好地表达其隐式空间理解能力。我们在四个GUI定位基准上评估了这些方法，涉及七个开源和专有VLM。实验结果表明，辅助推理带来了显著的性能提升。Mark-Grid Scaffold将Gemini-3.1-Pro在ScreenSpot-v2上的直接推理准确率从11.72%提升至95.20%，在ScreenSpot上达到最先进性能，并在ScreenSpot-v2和UI-I2E-Bench上接近最强的微调方法。我们的代码可在该https URL获取。

英文摘要

Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to better articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. Experimental results show substantial gains from auxiliary reasoning. Mark-Grid Scaffold boosts Gemini-3.1-Pro from 11.72\% under direct inference to 95.20\% on ScreenSpot-v2, achieves state-of-the-art performance on ScreenSpot, and approaches the strongest fine-tuned methods on ScreenSpot-v2 and UI-I2E-Bench. Our code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2511.08195 2026-06-11 cs.CV 版本更新

UI2Code^N: UI-to-Code Generation as Interactive Visual Optimization

UI2Code^N: 将UI到代码生成视为交互式视觉优化

Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiale Cheng, Xiaotao Gu, Jie Tang

AI总结提出将UI截图转代码任务重构为交互式视觉优化问题，采用基于偏好的强化学习方法RVPO优化视觉排名，在UI起草、润色和编辑任务上达到SOTA。

详情

Comments: 27 pages

AI中文摘要

UI到代码旨在将UI截图转换为可执行的前端代码。尽管视觉语言模型（VLM）取得了进展，但大多数现有方法将UI到代码视为单次生成，这与现实世界中本质上是迭代和反馈驱动的UI开发不匹配。我们将UI到代码重新表述为一个交互式视觉优化问题，其中代码生成嵌入在执行、视觉检查和由渲染视觉反馈驱动的迭代细化的闭环过程中。为了解决视觉目标的不可微性和绝对视觉评估器的噪声，我们提出了相对视觉策略优化（RVPO），这是一种基于偏好的强化学习方法，在执行反馈下优化渲染候选之间的相对视觉排名。我们将这一范式实例化为UI2Code^N，这是一个开源的9B模型，通过持续预训练、监督微调和强化学习进行训练。实验表明，在UI起草、UI润色和UI编辑基准测试中，即使超越更大的模型，也达到了最先进的性能，并且通过迭代视觉优化性能持续提升。我们的代码和模型可在该https URL获取。

英文摘要

UI-to-code aims to translate UI screenshots into executable front-end code. Despite progress with vision-language models (VLMs), most existing methods formulate UI-to-code as a single-pass generation, which mismatches real-world UI development that is inherently iterative and feedback-driven. We reformulate UI-to-code as an interactive visual optimization problem, where code generation is embedded in a closed-loop process of execution, visual inspection, and iterative refinement driven by rendered visual feedback. To address the non-differentiability of visual objectives and the noise of absolute visual evaluators, we propose Relative Visual Policy Optimization (RVPO), a preference-based reinforcement learning method that optimizes relative visual rankings among rendered candidates under execution feedback. We instantiate this paradigm in UI2Code^N, an open-source 9B model trained via continual pre-training, supervised fine-tuning, and reinforcement learning. Experiments demonstrate state-of-the-art performance on UI drafting, UI polishing, and UI editing benchmarks, even outperforming larger models, with performance consistently improving through iterative visual optimization. Our code and models are available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2511.16672 2026-06-11 cs.CV 版本更新

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

EvoLMM：具有连续奖励的自进化大型多模态模型

Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan

AI总结提出EvoLMM框架，通过单个骨干模型实例化提议者和求解者两个协作智能体，利用连续自奖励过程无监督地提升LMM推理能力，在ChartQA等基准上取得约3%的提升。

详情

Comments: 9 pages, 6 figures

AI中文摘要

近年来，大型多模态模型（LMMs）的进展实现了令人印象深刻的推理和感知能力，但大多数现有训练流程仍依赖于人工策划的数据或外部验证的奖励模型，限制了其自主性和可扩展性。在这项工作中，我们致力于以纯无监督方式（无需任何标注数据或奖励蒸馏）提升LMM的推理能力。为此，我们提出了一个名为EvoLMM的自进化框架，该框架从单个骨干模型实例化两个协作智能体：提议者（Proposer），生成多样化的、基于图像的问题；以及求解者（Solver），通过内部一致性解决这些问题，学习过程通过连续的自奖励机制进行。这种动态反馈促进了信息性查询的生成和结构化推理的改进，而无需依赖真实标签或人工判断。当使用流行的Qwen2.5-VL作为基础模型时，我们的EvoLMM在多模态数学推理基准（包括ChartQA、MathVista和MathVision）上取得了约3%的持续提升，仅使用原始训练图像。我们希望这种简单而有效的方法能成为一个坚实的基线，促进未来在完全无监督方式下自我改进LMM的研究。我们的代码和模型可在该https URL获取。

英文摘要

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2602.08735 2026-06-11 cs.CV 版本更新

From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

从对应到动作：多模态大语言模型中类人多图像空间推理

Masanari Oi, Koki Maeda, Ryuto Koike, Daisuke Oba, Nakamasa Inoue, Naoaki Okazaki

AI总结提出HATCH框架，通过补丁级空间对齐和动作-答案推理两个目标，提升多模态大模型在多图像空间推理中的性能，在三个基准上超越同规模基线。

详情

Comments: ICML 2026

AI中文摘要

尽管多模态大语言模型（MLLMs）在单图像空间推理方面取得了实质性进展，但多图像空间推理（需要整合来自多个视角的信息）仍然具有挑战性。认知研究表明，人类通过两种机制处理此类任务：跨视图对应（识别不同视图中对应于相同物理位置的区域）和逐步视角变换（顺序组合相对视角变化）。然而，现有研究仅部分且通常隐式地整合这些机制，没有对两者进行显式监督。我们提出了用于跨视图对应和视角变化的类人感知训练（HATCH），这是一个具有两个互补目标的训练框架：（1）补丁级空间对齐，鼓励补丁表示在空间对应区域跨视图对齐；（2）动作-答案推理，要求模型在预测最终答案之前生成显式的视角转换动作。在三个基准上的实验表明，HATCH以明显优势持续优于同规模基线，并与更大的模型相比取得了有竞争力的结果，同时保持了单图像推理能力。

英文摘要

While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.

URL PDF HTML ☆

赞 0 踩 0

2603.00461 2026-06-11 cs.CV 版本更新

ReMoT: Reinforcement Learning with Motion Contrast Triplets

ReMoT: 基于运动对比三元组的强化学习

Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong

AI总结提出ReMoT统一训练范式，通过规则生成大规模运动对比数据集和组相对策略优化，解决VLM时空一致性问题，在时空推理任务上提升25.1%。

详情

Comments: CVPR 2026 Highlight

AI中文摘要

我们提出ReMoT，一种统一的训练范式，系统地解决VLM在时空一致性方面的基本缺陷——这是导航、机器人和自动驾驶中的关键失败点。ReMoT整合了两个核心组件：(1) 一个基于规则的自动框架，生成ReMoT-16K，这是一个大规模（16.5K三元组）的运动对比数据集，源自视频元注释，超越了昂贵的手动或基于模型的生成。(2) 组相对策略优化，我们经验验证了它在学习这种对比推理时产生最优性能和数据效率，远远超过标准的监督微调。我们还构建了第一个细粒度运动对比三元组基准，用于衡量VLM对细微运动属性（例如相反方向）的辨别能力。由此产生的模型在我们的新基准和多个标准VLM基准上实现了最先进的性能，最终在时空推理任务上实现了惊人的25.1%的性能飞跃。

英文摘要

We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.20795 2026-06-11 cs.CV 版本更新

What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing

什么语义能经受住连接器的考验？视频编辑中VLM到DiT对齐的诊断

Hangyu Lin, Chao Wen, Chengming Xu, Jianxiong Gao, Jiangning Zhang, Xiaobin Hu, Yanwei Fu

AI总结本研究探讨了视频生成模型中VLM与DiT对齐过程中的语义瓶颈问题，通过提出TRACE-Edit数据集和诊断协议，发现连接器模块会导致细粒度结构语义的严重退化，挑战了原有假设。

详情

AI中文摘要

基于流匹配的视频生成模型日益依赖前置的视觉-语言模型（VLMs）来处理复杂的、基于指令的视频编辑任务。该范式下普遍的假设是连接模块能够无缝地将VLM的丰富多模态推理与DiT的原始文本嵌入空间对齐。然而，我们假设这种对齐实际上是一个严重的语义瓶颈，会退化细粒度的结构变量。验证这一假设具有挑战性，因为端到端的评估将对齐失败与生成错误混为一谈，而自然数据集缺乏解耦的标注。为了严格研究这一问题，我们提出了一种基于视频组成的受控数据处理流程，生成TRACE-Edit数据集，该数据集专注于基于关系的编辑。利用此数据集，我们提出了一种全面的诊断协议，分析现有视频编辑模型中元查询和连接器两个重要设计。对四个代表性模型案例的系统评估表明，在对齐过程中细粒度结构语义会受到严重退化。我们的发现推翻了无损语义传输的假设，将VLM到DiT的对齐识别为一个主要瓶颈，并为未来的多模态对齐架构提供了新的诊断基础。

英文摘要

Flow matching based video generative models have been increasingly relying on prepended Vision-Language Models (VLMs) to handle complex, instruction-based video editing. The prevailing assumption underlying this paradigm is that a connector module can seamlessly align the VLM's rich multi-modal reasoning with the original text embedding space of DiTs. However, we hypothesize that this alignment acts as a severe semantic bottleneck, degrading fine-grained structural variables. Verifying this is challenging, as end-to-end evaluations conflate alignment failures with generation errors, and natural datasets lack disentangled annotations. To rigorously investigate this, we propose a controlled data processing pipeline based on video composition that results in TRACE-Edit, a diagnostic dataset focusing on relation-based editing. Leveraging this dataset, we propose a comprehensive diagnostic protocol to analyze two important designs of meta-query and connector in the existing video editing models. Systematic evaluation of four representative model cases reveals that fine-grained structural semantics can be severely degraded during alignment. Our findings overturn the assumption of lossless semantic transfer, identifying the VLM-to-DiT alignment as a major bottleneck and providing a new diagnostic foundation for future multi-modal alignment architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.29588 2026-06-11 cs.CV cs.AI q-bio.NC 版本更新

Brain-IT-VQA: From Brain Signals to Answers

Brain-IT-VQA: 从脑信号到答案

Roman Beliy, Matias Cosarinsky, Oliver Heinimann, Navve Wasserman, Michal Irani

AI总结提出 Brain-IT-VQA 框架，基于 fMRI 脑信号解码语言令牌并结合语言模型进行视觉问答，在 NSD-VQA 新基准上显著优于先前方法，并用于分析脑区对视觉信息的贡献。

详情

AI中文摘要

从观看图像时记录的 fMRI 信号解码视觉内容，特别是回答关于所看图像的问题，是一个长期挑战。尽管近年来在基于 fMRI 的视觉问答（VQA）方面取得了显著进展，但性能仍然有限。此外，尽管最近的模型能够做出越来越准确的预测，但它们很少被用作理解大脑中视觉表征结构的工具。我们提出了 Brain-IT-VQA，一个基于 fMRI 的视觉问答框架。基于脑交互变换器（Brain-IT），我们的方法从脑活动中解码语言令牌，并将其与语言模型集成以回答视觉问题。我们的模型显著优于先前的基于 fMRI 的标题生成和 VQA 方法。我们进一步引入了 NSD-VQA，一个新的基于 fMRI 的视觉问答数据集和基准。与现有的图像-fMRI VQA 数据集通常每张图像只提供少数宽泛且弱控制的问题不同，NSD-VQA 在 20 个受控问题类别中平均每张图像提供 20 个问答对，这些类别解耦了多个层次的视觉理解。这使得在有限的 fMRI 测试数据下能够进行更可靠和可解释的评估。Brain-IT-VQA 和 NSD-VQA 共同提供了一个强大的预测框架和研究脑表征的工具。利用这个基准，我们量化了哪些形式的视觉和语义信息可以从对自然图像的 fMRI 响应中可靠解码。我们进一步分析了不同脑区在不同问题类型上的贡献。

英文摘要

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.

URL PDF HTML ☆

赞 0 踩 0

2606.04351 2026-06-11 cs.CV cs.CL 版本更新

Frames2LoRA: Parametric Video Internalization for Vision-Language Models

Video2LoRA: 视觉-语言模型的参数化视频内化

Manan Suri, Sarvesh Baskar, Dinesh Manocha

AI总结提出Video2LoRA方法，通过感知器超网络从视频编码中直接生成LoRA适配器，实现零视觉令牌的视频查询，在保持性能的同时大幅降低计算成本。

详情

Comments: this https URL

AI中文摘要

在视觉-语言模型中处理视频成本高昂：每帧占用数百个令牌，推理成本随每帧和每次重复查询而增加。我们引入Video2LoRA，一种参数化视频内化方法。感知器超网络逐层读取冻结VLM编码视频时产生的中间表示，并在单次前向传播中生成低秩适配（LoRA）适配器。与需要迭代梯度更新的标准LoRA微调不同，Video2LoRA直接从视频预测这些权重。在SmolVLM2 500M和2.2B上针对视频摘要和描述进行训练后，Video2LoRA使得相同的冻结VLM能够仅通过适配器回答查询，在查询时上下文中零视觉令牌。Video2LoRA在两种模型规模的所有五个描述基准测试中，以及在八个视频问答基准测试-规模配对中的七个上，统计上非劣效且等同于直接视频上下文推理。尽管仅在12帧384px上训练，它在高达1024帧和1024px时仍保持稳定，而直接视频上下文推理通常会退化。在此扫描中，它将回答时的视觉令牌负载减少高达1500倍，查询TTFT减少6-80倍，同时保持视频忠实输出。我们还发现，为非重叠视频段独立生成的适配器可以在秩空间中组合，这为分块长视频内化提供了一条路径。

英文摘要

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Frames2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Frames2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Frames2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Frames2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.

URL PDF HTML ☆

赞 0 踩 0

2606.10401 2026-06-11 cs.CV 版本更新

CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

CoCoSI: 面向空间智能的协作认知地图构建

Yiming Zhang, Ruoxuan Cao, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Cornell University（康奈尔大学）

AI总结提出一种即插即用的多智能体框架，通过协作构建结构化认知地图作为空间记忆，无需修改架构或额外训练即可增强预训练多模态大模型的空间理解能力。

详情

AI中文摘要

空间智能是多模态大语言模型（MLLMs）的一个关键前沿，使其能够从视觉体验中推理物理世界。受人类空间认知启发，最近的方法从多帧视觉输入构建基于网格的认知地图，以随时间维持连贯的空间表示。然而，有限的上下文长度仍然挑战空间理解，而现有方法如长上下文建模和外部记忆通常需要架构更改、记忆模块或微调，限制了其对现成预训练MLLMs的适用性。这促使我们提出一种轻量级、模型无关的方法，以在原生上下文窗口之外保留空间信息。为此，我们提出一个即插即用的多智能体框架，协作构建认知地图作为结构化空间记忆，无需架构修改或额外训练即可增强任意预训练MLLMs的空间理解。我们的框架具有局部-全局智能体协调、原子提交的认知地图构建以及跨智能体验证的特点。大量实验表明，我们的方法在空间理解任务上取得了优越性能，同时完全无需训练。代码将发布。

英文摘要

Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.

URL PDF HTML ☆

赞 0 踩 0

2606.11221 2026-06-11 cs.CV 新提交

LAST: Bridging Vision-Language and Action Manifolds via Gromov-Wasserstein Alignment

LAST: 通过Gromov-Wasserstein对齐连接视觉-语言与动作流形

Huaihai Lyu, Chaofan Chen, Yuheng Ji, Xiansheng Chen, Pengwei Wang, Shanghang Zhang, Changsheng Xu

AI总结提出LAST方法，通过李代数线性化和局部度量离散化，对齐视觉-语言语义几何与动作流形，解决异构空间不兼容问题，提升VLA模型收敛性和泛化性。

详情

AI中文摘要

我们从Gromov-Wasserstein视角研究视觉-语言-动作（VLA）学习，目标是使动作表征的关系几何与VL嵌入的语义几何兼容。然而，由于领域间的数学异质性，这种对齐并非易事：视觉-语言的语义空间在拓扑上是线性和各向同性的，而机器人动作的物理流形是非欧几里得和各向异性的。它们不兼容的度量结构使得直接回归不适定。为了解决这种不兼容性，我们引入了LAST（李代数动作空间分词器），它通过两阶段变换重建动作空间以建立与VL模态的局部度量兼容性：（1）全局拓扑线性化：通过李代数映射线性化动作流形，将轨迹转换为固定长度、物理可加的表示。（2）局部度量离散化：将表示分层离散化为模式和白化残差，生成近似各向同性的局部图表，这些图表在统计上与语义度量对齐。通过在全局和局部层面解决结构不匹配问题，LAST使VLA模型具有更优的收敛性和泛化性。

英文摘要

We take a Gromov-Wasserstein perspective on Vision-Language-Action (VLA) learning, where the goal is to make the relational geometry of action representations compatible with the semantic geometry of VL embeddings. However, this alignment is non-trivial due to the mathematical heterogeneity between the domains: the semantic space of vision-language is topologically linear and isotropic, whereas the physical manifold of robotic action is non-Euclidean and anisotropic. Their disjoint metric structures render direct regression ill-posed. To resolve this incompatibility, we introduce LAST (Lie-algebraic Action Space Tokenizer), which reconstructs the action space to establish local metric compatibility with the VL modality via a two-stage transformation: (1) Global Topological Linearization: linearizing the action manifold via Lie-algebraic mapping, converting trajectories into a fixed-length, physically additive representation. (2) Local Metric Discretization: hierarchically discretizing the representation into schemas and whitened residuals, yielding approximately isotropic local charts that are statistically aligned with the semantic metric. By resolving the structural mismatch at both global and local levels, LAST enables VLA models with superior convergence and generalizability.

URL PDF HTML ☆

赞 0 踩 0

2606.11507 2026-06-11 cs.CV 新提交

SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining

SceneMiner: 保持身份的多任务微调用于统一BEV场景挖掘

Abdalmalek Aburaddaha, Venkatraman Narayanan, Keval Thaker, Samir A. Rawashdeh

发表机构 * University of Michigan-Dearborn（密歇根大学迪尔伯恩分校）

AI总结提出SceneMiner，一种统一的仅相机鸟瞰图管道，通过冻结视觉语言骨干网络在单次前向传播中发出互补的挖掘信号，并发现跨任务干扰问题，通过零初始化新子模块和冻结共享流参数的身份保持多任务微调解决。

详情

AI中文摘要

从驾驶日志中挖掘困难、安全关键的场景受到缺乏难度标签的瓶颈，且没有单一的代理（碰撞风险、轨迹歧义或语义稀有性）足以单独找到这些场景。我们提出SceneMiner，一种统一的、仅相机的鸟瞰图管道，从冻结的视觉语言骨干网络在单次前向传播中发出互补的挖掘信号，无需激光雷达或雷达：用于文本提示场景搜索的检索嵌入、多标签场景标签分布以及连续的基于物理的风险评分（运动预测是副产品，而非贡献）。构建这样的多头模型暴露了我们的核心发现，即我们称之为跨任务干扰的失败模式：添加或升级一个头会改变共享激活流并降低权重冻结的兄弟头，因此仅冻结参数是不够的。我们的贡献，即保持身份的多任务微调，通过零初始化每个新子模块并冻结每个馈入共享流的参数来消除这种干扰。挖掘头因此保持比特一致，同时仅训练约102k参数。标签头通过将每个场景池化为32个视觉令牌，在20个场景标签上达到mAP 0.4614（micro-F1 0.5557），嵌入头支持文本提示检索，经定性验证。代码可在以下网址获取：this https URL

英文摘要

Mining hard, safety-critical scenes from driving logs is bottlenecked by the absence of difficulty labels, and no single proxy, collision risk, trajectory ambiguity, or semantic rarity suffices to find such scenes on its own. We present SceneMiner, a unified, camera-only bird's-eye-view pipeline that emits complementary mining signals from a frozen vision-language backbone in a single forward pass, with no LiDAR or radar: a retrieval embedding for text-prompted scenario search, a multi-label scene-tag distribution, and a continuous physics-based risk score (a motion forecast is a byproduct, not a contribution). Building such a multi-head model exposes our central finding, a failure mode we term cross-task interference: adding or upgrading one head shifts a shared activation stream and degrades weight-frozen sibling heads, so freezing parameters alone is insufficient. Our contribution, identity-preserving multi-task fine-tuning, removes this interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream. The mining heads are thereby preserved bit-identically while training only ~102k parameters. The tagging head reaches mAP 0.4614 (micro-F1 0.5557) on 20 scene tags by pooling each scene into 32 visual tokens, and the embedding head supports text-prompted retrieval, validated qualitatively. Code is available at: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.11573 2026-06-11 cs.CV 新提交

Understanding Cross-Sensor Feature Variations for Generalizable 3D Perception

理解跨传感器特征变化以实现可泛化的3D感知

Xin Qiu, Wenjie Liu, Fuyuan Ai, YuChen Tan, Zhiwei Xu, Chunyi Song

发表机构 * Zhejiang University（浙江大学）

AI总结针对雷达-相机BEV感知跨数据集性能下降问题，提出频域场景变化建模框架，通过合成多样源域视图并正则化融合表示，提升3D检测器鲁棒性，无需目标域样本。

详情

AI中文摘要

雷达-相机BEV感知在跨数据集评估时常常性能下降，因为驾驶场景、传感器配置和环境条件的变化会改变输入观测和内部融合表示。本文从源域变化建模的角度研究这一问题，旨在提高基于BEV的3D检测器的鲁棒性，而无需依赖目标域样本。我们引入一个框架，在频域中表征视觉场景变化，并利用这些变化合成多样的源域视图。通过比较生成的融合BEV表示，该框架进一步捕捉图像级变化如何影响多模态BEV特征。然后利用这些变化模式对检测器进行正则化，鼓励学习到的融合空间在潜在场景变化下保持稳定。所提出的方法仅在训练期间应用，推理流程保持不变。在View-of-Delft和TJ4DRadSet之间的跨数据集雷达-相机3D检测实验表明，该方法在多个BEV融合骨干网络上均有一致的改进，并且当少量目标域数据可用时，增益仍然有效。

英文摘要

Radar-camera BEV perception often suffers from degraded performance when evaluated across datasets, as changes in driving scenes, sensor configurations, and environmental conditions can alter both the input observations and the internal fused representations. This work studies this issue from the perspective of source-domain variation modeling, aiming to improve the robustness of BEV-based 3D detectors without relying on target-domain samples. We introduce a framework that characterizes visual scene variations in the frequency domain and uses them to synthesize diverse source-domain views. By comparing the resulting fused BEV representations, the framework further captures how image-level variations influence multi-modal BEV features. These variation patterns are then used to regularize the detector, encouraging the learned fusion space to remain stable under latent scene changes. The proposed method is applied only during training and leaves the inference pipeline unchanged. Experiments on cross-dataset radar-camera 3D detection between View-of-Delft and TJ4DRadSet demonstrate consistent improvements over multiple BEV fusion backbones, and the gains remain effective when a small amount of target-domain data is available.

URL PDF HTML ☆

赞 0 踩 0

2606.11687 2026-06-11 cs.CV cs.LG cs.RO 新提交

DroneShield-AI: A Multi-Modal Sensor Fusion Framework for Real-Time Autonomous Drone Threat Detection, Behavioral Intent Classification, and Swarm Intelligence in Contested Airspace

DroneShield-AI：一种用于受争议空域中实时自主无人机威胁检测、行为意图分类和群体智能的多模态传感器融合框架

Marius Bayizere

AI总结提出DroneShield-AI框架，集成RF信号分类、声学检测、YOLOv8视觉检测等六层处理，通过行为意图分类引擎（BICE）实现六类威胁分类并提前30秒预警，以及图神经网络群体智能模块（GNN-SIM）分析多无人机编队，在低成本硬件上达到96.1%检测精度和142ms延迟。

详情

Comments: 23 pages, 6 figures, 11 tables. Code available at this https URL

AI中文摘要

无人机（UAV）威胁已成为21世纪定义性的安全挑战。本文提出DroneShield-AI，一个统一的开放框架，集成了六个处理层：RF信号分类、声学电机特征检测、基于YOLOv8的视觉检测、证据加权传感器融合、行为意图分类引擎（BICE）和图神经网络群体智能模块（GNN-SIM）。BICE首次引入了针对无人机飞行模式的系统性六类威胁分类法，能够提前30秒发出预测性操作员警报。GNN-SIM是首个用于对抗性多无人机编队分析的开放框架，采用图注意力网络。在三个公开的真实世界数据集上评估，融合流水线在约500-780美元总系统成本的商用CPU级硬件上实现了96.1%的检测准确率、3.2%的误报率、AUC-ROC：0.981以及142ms的端到端延迟。所有代码、模型权重和仿真数据集在提交时公开发布。

英文摘要

Unmanned Aerial Vehicle (UAV) threats have emerged as a defining security challenge of the 21st century. This paper presents DroneShield-AI, a unified open framework integrating six processing layers: RF signal classification, acoustic motor-signature detection, YOLOv8-based visual detection, evidence-weighted sensor fusion, a Behavioral Intent Classification Engine (BICE), and a Graph Neural Network Swarm Intelligence Module (GNN-SIM). BICE introduces the first systematic six-class threat taxonomy for drone flight patterns, enabling predictive operator alerts with a 30-second advance-warning horizon. GNN-SIM is the first open framework for adversarial multi-drone formation analysis using Graph Attention Networks. Evaluated on three publicly available real-world datasets, the fused pipeline achieves 96.1% detection accuracy, 3.2% false alarm rate, AUC-ROC: 0.981, and 142ms end-to-end latency on commodity CPU-class hardware at approximately $500-$780 USD total system cost. All code, model weights, and simulation datasets are publicly released at submission.

URL PDF HTML ☆

赞 0 踩 0

2606.11989 2026-06-11 cs.CV 新提交

From Nominal Intensity to Equivalent Rainfall: A Path-Based Credibility Evaluation Framework for Simulated Rainfall in Autonomous-Driving Perception Tests

从名义强度到等效降雨：自动驾驶感知测试中模拟降雨的基于路径的可信度评估框架

Tian Xia, Xin Zhao, Shaolingfeng Ye, Junyi Chen

发表机构 * College of Automotive and Energy Engineering, Tongji University（同济大学汽车与能源工程学院）； Tsinghua University（清华大学）

AI总结提出基于路径的可信度评估方法，通过路径等效降雨强度、不确定性带和雨滴分布真实度评分，结合激光雷达点云计数和平均反射率进行感知一致性校正，实现模拟降雨与真实降雨的对齐及测试结果映射。

详情

Comments: 17 pages, preprint

AI中文摘要

可信的模拟降雨条件对于识别自动驾驶感知系统边界和支持面向SOTIF的风险评估至关重要。然而，封闭场地测试通常仅用名义降雨强度或单点测量来描述，这使得模拟降雨场难以与真实降雨对齐，并将测试结果映射到真实场景。本文提出了一种基于路径的自动驾驶感知测试中模拟降雨的可信度评估方法。以真实降雨的雨滴尺寸和速度联合分布为参考，每条候选路径由路径等效降雨强度、不确定性带和路径平均雨滴分布真实度（RRD）评分表示。进一步利用激光雷达目标点云计数和平均反射率进行感知一致性校正，量化每条模拟降雨路径对真实降雨感知效果的代理能力。实验使用了约10,000个真实降雨雨滴谱样本、728个RainSense感知样本以及2.4 m x 7.2 m模拟降雨区域内的45个空间采样点。结果表明，在相同名义条件下空间非均匀性仍然存在，证实了基于路径评估的必要性。该方法识别出路径IV和路径VI为优选候选路径，结果分别为11.54 +/- 0.31 mm/h、RRD = 0.43和8.28 +/- 0.34 mm/h、RRD = 0.46。这些路径在降雨强度稳定性、雨滴谱真实性和感知一致性方面表现出更均衡的性能。所提方法支持降雨条件下自动驾驶感知测试的路径选择、条件描述和可信解释。

英文摘要

Credible simulated-rainfall conditions are essential for identifying perception-system boundaries and supporting SOTIF-oriented risk assessment in automated driving. However, closed-field tests are often described only by nominal rainfall intensity or single-point measurements, making it difficult to align simulated rain fields with real rainfall and map test results to real-world scenarios. This paper proposes a path-based credibility evaluation method for simulated rainfall in autonomous-driving perception tests. Using the drop size and velocity joint distribution of real rainfall as the reference, each candidate path is represented by path-equivalent rainfall intensity, an uncertainty band, and a path-averaged Realism of Raindrop Distribution (RRD) score. Lidar target point-cloud count and mean reflectivity are further used for perception-consistency correction, quantifying the proxy capability of each simulated-rainfall path for real-rainfall perception effects. Experiments are conducted using about 10,000 real-rainfall raindrop-spectrum samples, 728 RainSense perception samples, and 45 spatial sampling points in a 2.4 m x 7.2 m simulated-rainfall area. Results show that spatial non-uniformity remains under the same nominal condition, confirming the need for path-based evaluation. The method identifies Path IV and Path VI as preferable candidates, with results of 11.54 +/- 0.31 mm/h, RRD = 0.43, and 8.28 +/- 0.34 mm/h, RRD = 0.46, respectively. These paths show more balanced performance in rainfall-intensity stability, raindrop-spectrum realism, and perception consistency. The proposed method supports path selection, condition description, and credible interpretation of autonomous-driving perception tests under rainfall.

URL PDF HTML ☆

赞 0 踩 0

2606.12217 2026-06-11 cs.CV cs.AI cs.RO 新提交

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

使远见可操作：在世界动作模型中重新利用表示对齐

Lu Qiu, Yizhuo Li, Yi Chen, Yuying Ge, Yixiao Ge, Xihui Liu

发表机构 * The University of Hong Kong（香港大学）； XPENG Robotics（小鹏机器人）

AI总结针对世界动作模型中视觉预测与动作提取不匹配的问题，提出AGRA方法，通过对齐视频扩散特征与语义表示，提升动作解码器对任务相关区域的关注，从而改善操作任务的性能与泛化能力。

详情

AI中文摘要

世界动作模型（WAM）通过使用视频生成模型在生成控制动作之前建模未来场景演变，为机器人操作提供了一条有前景的途径。然而，我们的实证观察揭示了一个现象：生成合理的视觉未来并不总能保证提取出准确的动作。为了诊断这一失败，我们进行了动作头注意力分析和因果干预。我们发现动作解码器未能聚焦于任务相关的交互区域，并且对任务无关区域的扰动保持敏感。这揭示了一种表示不匹配：为视觉重建优化的隐藏状态并未以适用于低级动作控制的形式组织。在本文中，我们提出了AGRA，一种动作接地表示对齐目标，通过将中间视频扩散特征与来自基础视觉编码器的空间连贯语义表示对齐，来正则化世界-动作接口。我们在真实世界的操作任务上评估了AGRA。实验表明，AGRA使世界模型表示更加动作接地：通过将动作解码器聚焦于正确的交互区域，它提高了物体定位精度和功能理解，并使策略对任务无关区域的扰动更加鲁棒。因此，AGRA在分布内性能和分布外泛化方面均持续优于基线世界动作模型。

英文摘要

World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.

URL PDF HTML ☆

赞 0 踩 0

2606.12396 2026-06-11 cs.CV cs.RO 新提交

VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

VLGA：用于自动驾驶的视觉-语言-几何-动作模型

Jin Yao, Dhruva Dixith Kurra, Tom Lampo, Zezhou Cheng, Danhua Guo, Burhan Yaman

发表机构 * Uber AV Labs（Uber自动驾驶实验室）； University of Virginia（弗吉尼亚大学）

AI总结提出VLGA模型，通过引入几何作为第四模态，利用逐像素点图回归损失监督，实现密集3D世界重建，在nuScenes和Bench2Drive上达到SOTA。

详情

Comments: Project page: this https URL

AI中文摘要

视觉-语言-动作（VLA）模型能够描述场景并用语言进行推理，但仍难以将其动作锚定在周围的密集3D世界中。现有方法要么从冻结的3D基础模型中注入特征，而没有确保策略使用这些特征的目标，要么通过稀疏的框和地图损失来约束几何，这些损失不提供密集的空间信号。我们引入了VLGA，这是第一个被监督以重建其驾驶通过的密集3D世界的视觉-语言-动作模型。VLGA通过一个专门的专家模块，由针对LiDAR的逐像素点图回归损失监督，将几何作为第四模态与视觉、语言和动作一起引入。在具有挑战性的nuScenes和Bench2Drive数据集上分别进行开环和闭环评估的大量实验表明，VLGA优于对应的VLA方法。特别是在开环nuScenes上，VLGA在没有自车状态的情况下，在VLA方法中取得了新的最先进结果，具有最低的L2误差（平均0.50米）和3秒碰撞率（0.18%）。在闭环Bench2Drive上，VLGA取得了79.08的最先进驾驶得分，比最强的先前VLA高出0.71，同时具有相当的效率和舒适性。

英文摘要

Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50\,m average) and 3-second collision rate (0.18\%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.

URL PDF HTML ☆

赞 0 踩 0

2606.12105 2026-06-11 cs.RO cs.CV cs.LG 交叉投稿

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

DAM-VLA: 解耦异步多模态视觉语言动作模型

Pankhuri Vanjani, Zhuoyue Li, Jakub Suliga, Moritz Reuss, Gianluca Geraci, Xinkai Jiang, Rudolf Lioutikov

发表机构 * Intuitive Robots Lab, Karlsruhe Institute of Technology (KIT)（直觉机器人实验室，卡尔斯鲁厄理工学院）； NVIDIA（英伟达）； Robotics Institute of Germany（德国机器人研究所）

AI总结针对VLA模型同步时钟与物理交互中不同模态频率不匹配的问题，提出DAM-VLA，通过解耦各模态时间处理、维护传感器速率更新的潜在缓冲区，并利用门控交叉注意力整合高频模态，在7个真实操作任务中平均成功率提升至95.2%。

详情

Comments: 17 pages, 8 figures

AI中文摘要

视觉-语言-动作（VLA）模型继承了视觉-语言预训练中的共享同步时钟，以单一速率处理每个输入。这与物理交互不一致，在物理交互中，高频模态以数百赫兹变化，视觉演化较慢，而语言在整个回合中保持不变。同步VLA会过采样慢速模态，欠采样快速模态，并将动作生成限制在最低有效频率。我们假设解耦每个模态的时间处理，让每个模态以其自身传感器速率更新和保留信息，可以产生更强的表示和更鲁棒的控制。我们提出DAM-VLA，它维护每个模态的潜在缓冲区，以传感器速率刷新并由动作头连续读取，通过门控交叉注意力整合新的高频模态，同时保持预训练主干不变。在七个接触丰富的真实世界操作任务中，DAM-VLA将最强同步基线的平均成功率提高了一倍以上（95.2% vs. 40.95%），同时维持平滑、反应式的100 Hz控制。项目网站：\href{ this https URL }{ this http URL }

英文摘要

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{ this https URL }{ this http URL }

URL PDF HTML ☆

赞 0 踩 0

2606.12142 2026-06-11 cs.RO cs.CV 交叉投稿

AerialClaw: An Open-Source Framework for LLM-Driven Autonomous Aerial Agents

AerialClaw：一个用于LLM驱动的自主空中智能体的开源框架

Ke Li, Jianfei Yang, Luyao Zhang, Guo Yu, Chengwei Yan, Yuan Ding, Di Wang, Nan Luo, Gang Liu, Xiao Gao, Quan Wang

发表机构 * Xidian University（西安电子科技大学）； Xi'an University of Architecture and Technology（西安建筑科技大学）

AI总结提出AerialClaw开源框架，采用模块化脑-技能-运行时架构，使基于LLM的智能体能够理解自然语言任务、调用空中技能、闭环决策，提升无人机系统的灵活性、可复现性和可扩展性。

详情

AI中文摘要

无人机（UAV）越来越多地用于检查、搜索救援、环境监测和应急响应。然而，大多数无人机应用仍然依赖于预定义的命令序列或特定任务的管道，开发者手动连接感知、规划、飞行控制、仿真、日志记录和安全模块。这限制了自主空中系统的灵活性、可复现性和可扩展性。本文提出了AerialClaw，一个开源软件框架，使无人机能够作为决策型空中智能体运行，而不仅仅是遵循命令的平台。给定自然语言任务，AerialClaw允许基于LLM的智能体理解任务、维护上下文、调用可执行的空中技能、观察感知和运行时反馈，并在闭环中迭代更新其决策。该框架采用模块化的脑-技能-运行时架构，结合了用于原子无人机操作的硬技能、基于Markdown的可重用任务策略软技能、文档驱动的智能体状态和能力边界、记忆驱动的反思、面向安全的运行时验证以及平台无关的执行适配器。AerialClaw支持轻量级模拟执行、PX4 SITL与Gazebo以及基于AirSim的仿真，同时提供Web控制台、可插拔模型后端、示例任务、仿真资产和分阶段部署脚本。通过结合标准化的空中技能、文档驱动的智能体状态、记忆和闭环LLM决策，AerialClaw提供了一个可复现且可扩展的开源框架，用于构建能够解释任务、做出决策、执行技能并根据反馈调整行为的无人机系统。

英文摘要

Unmanned aerial vehicles (UAVs) are increasingly used in inspection, search and rescue, environmental monitoring, and emergency response. However, most UAV applications still rely on pre-defined command sequences or task-specific pipelines, where developers manually connect perception, planning, flight control, simulation, logging, and safety modules. This limits the flexibility, reproducibility, and extensibility of autonomous aerial systems. This paper presents AerialClaw, an open-source software framework that enables UAVs to operate as decision-making aerial agents rather than merely command-following platforms. Given a natural-language mission, AerialClaw allows an LLM-based agent to understand the task, maintain context, invoke executable aerial skills, observe perception and runtime feedback, and iteratively update its decisions in a closed loop. The framework adopts a modular brain-skill-runtime architecture, combining hard skills for atomic UAV operations, Markdown-based soft skills for reusable task strategies, document-driven agent state and capability boundaries, memory-driven reflection, safety-oriented runtime validation, and platform-agnostic execution adapters. AerialClaw supports lightweight mock execution, PX4 SITL with Gazebo, and AirSim-based simulation, together with a web console, pluggable model backends, example missions, simulation assets, and staged deployment scripts. By combining standardized aerial skills, document-driven agent state, memory, and closed-loop LLM decision-making, AerialClaw provides a reproducible and extensible open-source framework for building UAV systems that can interpret missions, make decisions, execute skills, and adapt their behavior from feedback.

URL PDF HTML ☆

赞 0 踩 0

2606.12236 2026-06-11 cs.RO cs.CV 交叉投稿

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

DrivingAgent: 自动驾驶系统的设计与调度智能体

Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王选计算机技术研究所）； University of California, Merced（加州大学默塞德分校）

AI总结提出DrivingAgent框架，通过自动化模块开发（设计阶段）和强化学习训练的轻量级LLM实时调度（调度阶段），解决自动驾驶系统集成新模型和满足实时约束的挑战，在nuScenes和Bench2Drive上取得更优速度-精度权衡。

详情

AI中文摘要

许多自动驾驶系统越来越多地整合基础模型以提高泛化能力并处理长尾场景。然而，这一趋势带来了两个关键挑战：（i）设计和集成新模型的手动且劳动密集型过程，以及（ii）缺乏智能、动态的调度机制以满足严格的实时约束。虽然基于大语言模型（LLM）的智能体为自动化提供了有前景的途径，但现有框架并不适合自动驾驶。具体来说，它们未能区分系统设计和实时调度的根本不同需求，将模块视为不透明的黑盒，并且并非为持续运行而设计。为了解决这些局限性，我们提出了DrivingAgent，这是一个针对自动驾驶系统设计和调度双重挑战的新型智能体框架。在设计阶段，DrivingAgent通过解释系统架构、生成代码以及通过超网络训练验证模块来自动化模块开发。在调度阶段，它采用一个通过强化学习训练的轻量级LLM来实时动态编排系统模块，并由一个集成长期存储与带时间戳短期上下文的结构化记忆支持。实验结果表明，DrivingAgent在nuScenes和Bench2Drive基准测试上实现了更优的速度-精度权衡。

英文摘要

Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.12374 2026-06-11 cs.RO cs.CV 交叉投稿

Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration

语义感知的潜水员活动识别框架用于有效的水下多人类-机器人协作

Sadman Sakib Enan, Junaed Sattar

发表机构 * University of Minnesota（明尼苏达大学）

AI总结提出DAR-Net框架，结合Transformer时间推理与像素级场景监督，通过多损失训练对齐全局活动识别与局部人机交互语义，解决低可见度水下环境中的潜水员活动识别问题，并发布首个水下潜水员活动数据集UDA。

详情

AI中文摘要

有效的人机多体协作对于在具有挑战性和高风险的水下环境中扩展人类主导的操作至关重要。为了使自主水下航行器（AUV）成为真正的队友，它们必须能够理解周围环境并识别潜水员的活动，以提供帮助并确保安全。为此，我们引入了DAR-Net，一种新颖的基于Transformer的框架，用于分析复杂的水下场景并对潜水员活动进行分类。我们的贡献在于一种语义引导的学习公式，它将基于Transformer的时间推理与像素级场景监督相结合。这种多损失训练策略明确地将全局活动识别与局部人机交互语义对齐，这在低可见度水下条件下尤为关键。为了解决该领域数据稀缺的重大挑战，我们首次提出了水下潜水员活动（UDA）数据集，这是一个基础资源，包含超过2600张带有像素级掩码的注释图像。通过在受控环境中进行严格的实验评估，我们证明DAR-Net在识别六种不同潜水员活动方面达到了有希望的准确性，优于现有最先进的模型。虽然该数据集提供了关键的基线，但我们的工作作为开创性的一步，为未来研究奠定了基础，并促进了更智能、协作的水下机器人系统的发展。

英文摘要

Effective multi-human-robot collaboration is essential for expanding human-led operations in the challenging and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver's activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local human-robot interaction semantics, which is particularly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2,600 annotated images with pixel-level masks. Through rigorous experimental evaluations in a controlled environment, we demonstrate that DAR-Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.

URL PDF HTML ☆

赞 0 踩 0

2606.12402 2026-06-11 cs.RO cs.AI cs.CV 交叉投稿

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

DIRECT: 在具身规划器中何时何地分配测试时计算？

Jadelynn Dao, Milan Ganai, Yasmina Abukhadra, Ajay Sridhar, Mozhgan Nasr Azadani, Katie Luo, Clark Barrett, Jiajun Wu, Chelsea Finn, Marco Pavone

发表机构 * Stanford University（斯坦福大学）； University of Waterloo（滑铁卢大学）； NVIDIA（英伟达）

AI总结提出DIRECT路由框架，根据多模态场景上下文按提示分配计算资源，优化成功-成本帕累托前沿，实验表明不同缩放轴带来不同能力增益，在物理机器人上以更低延迟匹配或超越更强模型。

详情

AI中文摘要

视觉语言模型（VLM）越来越多地被部署为具身智能体的高层规划器，一种新兴策略是扩展测试时计算以提高能力。然而，我们观察到这样做会增加延迟、令牌使用和FLOPs，同时在下游任务中产生不均匀且往往递减的收益，限制了具身智能体的部署范围。我们认为，选择何时何地花费测试时计算是将前沿性能带入现实世界的关键。我们引入了DIRECT，一个路由框架，利用多模态场景上下文按提示分配计算资源，在固定模型选择上改进了成功-成本帕累托前沿。在三种主要的缩放轴（即思维链深度、模型大小和记忆历史）上，我们在VLABench和RoboMME上的实验表明，测试时计算并非均匀的杠杆：不同的轴产生性质不同的能力增益。我们在DROID设置中的物理Franka机械臂上验证了这些见解，涵盖了零样本操作和长程链式任务，我们的路由器以高达65%的平均延迟降低匹配或超过了更强模型的成功率。最终，我们的结果表明，天真地扩展测试时计算是浪费的，而DIRECT能够以极低的成本在机器人系统中提供前沿级别的具身规划。项目页面可在此http URL找到。

英文摘要

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at this http URL.

URL PDF HTML ☆

赞 0 踩 0

2604.24119 2026-06-11 cs.CV 版本更新

TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations

TopoHR: 面向驾驶场景中循环拓扑推理的层次化中心线表示与点到实例关系

Yifeng Bai, Zhirong Chen, Bo Song, Erkang Cheng, Haibin Ling

AI总结提出TopoHR框架，通过层次化中心线表示和统一架构中的点到实例与实例到实例关系，实现中心线检测与拓扑推理的循环交互，在OpenLane-V2上取得显著性能提升。

详情

Comments: Accepted at CVPR 2026 (camera ready version)

AI中文摘要

拓扑推理对于自动驾驶至关重要。当前方法主要关注用于中心线检测的实例级学习，随后是依赖于简化MLP层的拓扑推理顺序模块。此外，它们常常忽略拓扑推理中\textit{点到实例}（P2I）关系的重要性。为了解决这些局限性，我们提出了TopoHR（拓扑层次化表示），一种新颖的端到端框架，建立了中心线检测与拓扑推理之间的循环交互，使它们能够相互迭代增强。具体来说，我们引入了一种层次化中心线表示，包括点查询、实例查询和语义表示。这些多级特征在层次化中心线解码器中无缝集成和融合。此外，我们设计了一个层次化拓扑推理模块，在统一架构中捕获细粒度的P2I关系和全局的实例到实例（I2I）连接。通过这些新颖的组件，TopoHR确保了准确且鲁棒的拓扑推理。在OpenLane-V2基准上，TopoHR刷新了最先进性能，取得了显著改进。值得注意的是，与先前最佳结果相比，TopoHR在$\text{subset_A}$上实现了+3.8的$\mathrm{DET}_{\text{l}}$、+5.4的$\mathrm{TOP}_{\text{ll}}$，在$\text{subset_B}$上实现了+11.0的$\mathrm{DET}_{\text{l}}$、+7.9的$\mathrm{TOP}_{\text{ll}}$，验证了所提出组件的有效性。代码将在https://this URL公开分享。

英文摘要

Topology reasoning is crucial for autonomous driving. Current methods primarily focus on instance-level learning for centerline detection, followed by a sequential module for topology reasoning that relies on simplified MLP layers. Moreover, they often neglect the importance of \textit{point-to-instance} (P2I) relationships in topology reasoning. To address these limitations, we present TopoHR (Topological Hierarchical Representation), a novel end-to-end framework that establishes cyclic interaction between centerline detection and topology reasoning, allowing them to iteratively enhance each other. Specifically, we introduce a hierarchical centerline representation including point queries, instance queries, and semantic representations. These multi-level features are seamlessly integrated and fused within a hierarchical centerline decoder. Furthermore, we design a hierarchical topology reasoning module that captures both fine-grained P2I relationships and global instance-to-instance (I2I) connections within a unified architecture. With these novel components, TopoHR ensures accurate and robust topology reasoning. On the OpenLane-V2 benchmark, TopoHR refreshes state-of-the-art performance with significant improvements. Notably, compared with previous best results, TopoHR achieves +3.8 in $\mathrm{DET}_{\text{l}}$, +5.4 in $\mathrm{TOP}_{\text{ll}}$ on $\text{subset_A}$ and +11.0 in $\mathrm{DET}_{\text{l}}$, +7.9 in $\mathrm{TOP}_{\text{ll}}$ on $\text{subset_B}$, validating the effectiveness of the proposed components. The code will be shared publicly at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.08744 2026-06-11 cs.CV 版本更新

MB-Loc: Multi-planar Bird's-eye-view Localization in outdoor LiDAR scenes

MB-Loc：室外LiDAR场景中的多平面鸟瞰图定位

Ayaan Choudhury, Preet Savalia, Anirudh Pydah, Avinash Sharma

发表机构 * Indian Institute of Technology Jodhpur（印度理工学院焦特布尔分校）

AI总结提出MB-Loc框架，通过将LiDAR扫描投影为2.5D多平面鸟瞰图表示，结合KL正则化隐瓶颈和3D空间增强，实现轻量级、视角鲁棒的场景坐标回归定位，在NCLT数据集上达到实时推理并超越现有方法。

详情

AI中文摘要

全局LiDAR定位是自主导航系统的基本任务。最近的方法通过预测密集的3D世界坐标进行场景坐标回归（SCR），相比绝对位姿回归（APR）方法实现了更高的精度。然而，SCR方法引入了两个主要瓶颈：处理原始3D几何结构导致的严重计算低效，以及在不同传感器视角下性能显著下降。为了解决这些限制，我们提出了MB-Loc，一个轻量级且视角鲁棒的SCR框架。我们不依赖沉重的3D卷积，而是将输入的LiDAR扫描投影为2.5D多平面鸟瞰图（BEV）表示。通过沿Z轴切片点云并将有符号深度映射到离散的2D平面，MB-Loc保留了关键的3D几何结构，同时利用了标准2D CNN的计算可处理性。为了处理室外LiDAR固有的稀疏性，我们引入了一个KL正则化的隐瓶颈，该瓶颈在不注入随机噪声的情况下显式建模空间不确定性。最后，为了确保旋转鲁棒性，我们在平面投影之前应用3D空间增强，迫使网络隐式学习视角不变的特征。我们在公开的NCLT数据集上进行了大量实验，证明了我们提出的方法优于当前最先进的方法。以实时推理速度运行，MB-Loc在计算效率上显著优于传统的3D-SCR架构。

英文摘要

Global LiDAR localization is a fundamental task for autonomous navigation systems. Recent methods perform Scene Coordinate Regression (SCR) and achieve superior accuracy over Absolute Pose Regression (APR) solutions by predicting dense 3D world coordinates. However, SCR approaches introduce two major bottlenecks: severe computational inefficiency from processing raw 3D geometries and significant performance degradation under varying sensor viewpoints. To address these limitations, we present MB-Loc, a lightweight and viewpoint-robust SCR framework. Instead of relying on heavy 3D convolutions, we project the input LiDAR scan into a 2.5D Multi-planar Bird's-Eye View (BEV) representation. By slicing the point-cloud along the Z-axis and mapping signed depths into discrete 2D planes, MB-Loc retains essential 3D geometric structures while exploiting the computational tractability of standard 2D CNNs. To handle the inherent sparsity of outdoor LiDAR, we introduce a KL-regularized latent bottleneck that explicitly models spatial uncertainty without injecting stochastic noise. Finally, to ensure rotation robustness, we apply 3D spatial augmentations prior to planar projection, forcing the network to implicitly learn viewpoint-invariant features. We perform extensive experiments on the publicly available NCLT dataset and demonstrate that our proposed method outperforms the current state-of-the-art. Operating at real-time inference speeds, MB-Loc significantly outperforms traditional 3D-SCR architectures in computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2511.13207 2026-06-11 cs.RO cs.CV 版本更新

PIGEON: VLM-Driven Object Navigation via Points of Interest Selection

PIGEON: 通过兴趣点选择的VLM驱动物体导航

Cheng Peng, Zhenzhe Zhang, Xiaobao Wei, Yanhao Zhang, Heng Wang, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Shanghang Zhang, Jing Liu

AI总结提出PIGEON框架，将物体导航建模为基于原始观测的稀疏决策问题，通过兴趣点（PoI）作为视觉决策单元，结合VLM选择关键点，实现零样本SOTA性能并迁移至主动具身问答。

详情

AI中文摘要

在未见过的室内环境中进行物体导航要求智能体在部分可观测条件下执行语义搜索。视觉-语言模型（VLM）为此任务提供了强大的语义-空间先验，但如何将其与机器人导航接口仍然具有挑战性：密集的VLM推理成本高昂，而将环境抽象为符号记忆通常将高层推理与支持它的原始视觉证据分离。我们提出PIGEON（基于兴趣点引导的物体导航探索），一种VLM驱动的框架，将物体导航建模为基于原始观测的稀疏决策问题。PIGEON引入兴趣点（PoI）作为稀疏视觉决策单元，将几何可执行的路点与原始自我中心观测耦合。PIGEON不是将VLM用作密集控制器或限制其进行前沿排序，而是使VLM能够选择任务关键的PoI，包括探索前沿、疑似目标物体、可穿越楼梯和楼层级摘要，而低级规划器在它们之间执行连续运动。这种PoI接口进一步使高层导航决策可验证，使我们能够开发一个RLVR流水线，无需手动思维链注释即可改进局部VLM。在Habitat ObjectNav基准上的大量实验表明，PIGEON实现了零样本最先进性能，与基础模型能力一致扩展，并且仅通过提示修改即可迁移到主动具身问答。在物理机器人上的实际部署进一步证明了其鲁棒性和效率。

英文摘要

Object navigation in unseen indoor environments requires agents to perform semantic search under partial observability. Vision-language models (VLMs) provide strong semantic-spatial priors for this task, but how to interface them with robot navigation remains challenging: dense VLM inference is expensive, while abstracting environments into symbolic memories often separates high-level reasoning from the raw visual evidence that supports it. We propose we propose PIGEON (Point of Interest Guided Exploration for Object Navigation), a VLM-driven framework that formulates object navigation as raw-observation-grounded sparse decision problem. PIGEON introduces Points of Interest (PoIs) as sparse visual decision units that couple geometrically executable waypoints with raw egocentric observations. Rather than using VLMs as dense controllers or restricting them to frontier ranking, PIGEON enables VLMs to select among task-critical PoIs, including exploration frontiers, suspected target objects, traversable stairs, and floor-level summaries, while low-level planners execute continuous motion between them. This PoI interface further makes high-level navigation decisions verifiable, allowing us to develop an RLVR pipeline that improves local VLMs without manual Chain-of-Thought annotations. Extensive experiments on Habitat ObjectNav benchmarks show that PIGEON achieves state-of-the-art zero-shot performance, scales consistently with foundation model capacity, and transfers to Active Embodied Question Answering with only prompt modifications. Real-world deployments on physical robots further demonstrate its robustness and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.06904 2026-06-11 cs.RO cs.CV 版本更新

ActionMap: Robot Policy Learning via Voxel Action Heatmap

ActionMap: 基于体素动作热图的机器人策略学习

Pei Yang, Hai Ci, Yanzhe Chen, Qi Lv, Han Cai, Mike Zheng Shou

发表机构 * National University of Singapore ； NVIDIA

AI总结提出ActionMap，一种将动作空间建模为体素热图的动作解码器，替代现有VLA模型中的单点预测器，在LIBERO仿真和真实Franka操作中提升性能和数据效率。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在骨干网络、训练方法和数据规模方面快速发展，但将骨干网络隐藏状态转换为连续控制信号的动作解码器几乎没有变化，在大多数现有VLA中仍然是单点预测器。无论是通过自回归词元箱、L1回归还是流匹配去噪实现，所得解码器都将动作空间视为无结构的，在训练期间未利用相邻动作的几何邻近性。为了改进这一点，我们引入了ActionMap，一种体素热图动作头，可以插入现有VLA中替换其原生动作解码器。对于每个新动作，该头预测动作空间上的体素热图，其中每个体素直接存储对应动作的概率。在LIBERO仿真和真实Franka操作中，我们的热图头在匹配训练步数下超越了两种架构不同的骨干网络（例如，在LIBERO四套件平均上比OpenVLA-OFT的L1回归头高出8.2%），在两种骨干网络上以相当或更快的速度收敛，并且在低训练数据下保持显著更高的数据效率。跨骨干网络的一致性表明，动作表示是VLA性能的一个真正杠杆，与进一步的骨干网络或方法缩放不同。项目页面：此 https URL。

英文摘要

Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone's hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT's L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11626 2026-06-11 cs.CV 新提交

Adapting Vision-Language Models from Iconic to Inclusive for Multi-Label Recognition Without Labels

将视觉-语言模型从标志性适应到包容性：用于无标签的多标签识别

Cheng Chen, Jingyu Zhou, Yifan Zhao, Jia Li

发表机构 * State Key Laboratory of Virtual Reality Technology and Systems, SCSE & QRI, Beihang University（虚拟现实技术与系统国家重点实验室，北京航空航天大学计算机学院与青岛研究院）

AI总结提出无监督框架，通过“切割”和“缝合”两阶段适应VLMs，实现无标签的多标签图像识别，在四个数据集上超越现有无监督方法。

详情

AI中文摘要

理解多标签图像仍然是计算机视觉中的一项挑战性任务。随着视觉-语言多模态学习的快速发展，视觉-语言模型（VLM）能够在没有标注数据的情况下实现零样本识别。然而，由于其内在设计，这些模型通常优先考虑最标志性的物体，而忽略其他上下文正例。这种内在偏差与多标签学习的性质相冲突，从而限制了它们的适用性。在这项工作中，我们提出了一个无监督框架，将VLM从标志性识别适应到包容性理解，实现无标签的多标签图像识别。我们的方法包括两个关键阶段：“切割”和“缝合”：在切割阶段，我们提出了多采样响应估计器，以防止模型仅关注单个物体。在第二个缝合阶段，引入了多目标混合适应，以调整标签使其更符合多标签分布，同时仅在一个epoch内保留原始模型的内在特性。大量实验表明，我们的框架在四个公共数据集上显著优于现有的无监督方法，甚至超过了几种有代表性的弱监督基线。这些结果证明了将预训练VLM适应于更全面的视觉理解而无需人工标注的潜力。我们的代码在此https URL公开。

英文摘要

Understanding multi-label images remains a challenging task in computer vision. With the rapid progress of vision-language multimodal learning, vision-language models (VLMs) enable zero-shot recognition without labeled data. However, due to their intrinsic design, these models often prioritize the most iconic object and omit other contextual positives. This intrinsic bias conflicts with the nature of multi-label learning, thereby limiting their applicability. In this work, we propose an unsupervised framework that adapts VLMs from iconic recognition toward inclusive understanding, enabling label-free multi-label image recognition. Our approach consists of two key stages, ``cutting'' and ``sewing'': In the cutting stage, we present the multi-sampling response estimator to prevent the model from concentrating only on one single object. In the second sewing stage, the multi-object blend adaptation is introduced to adjust the labels to better conform to the multi-label distribution while preserving the intrinsic characteristics of the original model within only one epoch. Extensive experiments show that our framework significantly outperforms existing unsupervised approaches on four public datasets, even surpassing several representative weakly supervised baselines. These results demonstrate the potential of adapting pre-trained VLMs for more comprehensive visual understanding without manual annotations. Our code is publicly available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11661 2026-06-11 cs.CV cs.LG 新提交

Learning Instance-Adaptive Low-Rank Orthogonal Subspaces for Clothes-Changing Person Re-Identification

学习实例自适应低秩正交子空间用于换衣行人重识别

Dong-Woo Kim, Tae-Kyun Kim

AI总结提出Ortho-ReID方法，通过从VLM文本描述中显式建模低秩服装子空间，并利用几何约束提取服装不变特征，在多个基准数据集上取得最优性能。

详情

Comments: Accepted to the ICML 2026 Workshop on CoLoRAI

AI中文摘要

换衣行人重识别（CC-ReID）旨在识别尽管因服装变化导致外观剧烈变化的个体。现有方法依赖对抗学习来解耦服装特征，我们提出Ortho-ReID，该方法从VLM文本描述中显式建模低秩服装子空间，并通过直接几何约束提取服装不变表示。一个关键组件是基于Transformer的基生成器（Basis Maker），它通过与图像块的交叉注意力，将共享的低维服装先验细化为实例自适应低秩子空间，从而在变化的可见性条件下也能实现鲁棒的服装特征提取。该实例自适应子空间通过与服装文本嵌入对齐进行监督，而身份特征则通过可学习的投影头提取，并在几何上约束与其严格正交。大量实验表明，在PRCC（top-1提升5.9%）、Celeb-reID-light（提升3.5%）和LaST（提升5.3%）上达到了最先进性能，在LTCC上也取得了有竞争力的结果。

英文摘要

Clothes-changing person re-identification (CC-ReID) aims to recognize individuals despite drastic appearance changes caused by clothing variation. While existing methods rely on adversarial learning to disentangle clothing features, we propose Ortho-ReID, which explicitly models a low-rank clothing subspace from VLM text descriptions and extracts clothing-invariant representations via direct geometric constraints. A critical component is our transformer-based Basis Maker, which refines a shared, low-dimensional clothing prior into an instance-adaptive low-rank subspace through cross-attention with image patches, enabling robust clothing feature extraction even under varying visibility conditions. This instance-adaptive subspace is supervised via alignment with clothing text embeddings, while identity features are extracted via a learnable projection head and geometrically constrained to be strictly orthogonal to it. Extensive experiments demonstrate state-of-the-art performance on PRCC (+5.9% top-1), Celeb-reID-light (+3.5%), and LaST (+5.3%), with competitive results on LTCC.

URL PDF HTML ☆

赞 0 踩 0

2606.11689 2026-06-11 cs.CV 新提交

RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval

RankVR: 低秩结构感知与价值重新校准用于鲁棒组合图像检索

Jiale Huang, Zixu Li, Zhiheng Fu, Zhiwei Chen, Qinlei Huang, Yupeng Hu

发表机构 * Shandong University（山东大学）

AI总结针对组合图像检索中噪声三元组对应问题，提出RankVR框架，通过全局结构一致性感知模块利用相关矩阵有效秩解耦干净样本，并设计自适应语义价值校准模块动态量化三元组价值，在FashionIQ和CIRR数据集上显著优于现有方法。

详情

Comments: Accepted by ICMR 2026

AI中文摘要

组合图像检索（CIR）构成了一种关键范式，要求模型对参考图像和修改文本进行联合推理。然而，大规模数据集中普遍存在的噪声三元组对应（NTC）严重限制了模型性能。现有的去噪方法要么针对二元不匹配，要么依赖基于标量的逐点估计，忽略了样本群体中丰富的全局结构相关性和训练过程中的动态价值变化，从而产生次优结果。本文识别了两个关键未解决的挑战：语义相关性的全局结构不一致性和难样本判别不确定性。为了解决这些问题，我们提出了RankVR，一个通过全局结构一致性和动态价值感知构建鲁棒CIR模型的框架。具体来说，我们引入了全局结构一致性感知（GSCP）模块，该模块利用相关矩阵的有效秩将干净样本从结构噪声中解耦。通过测量秩差异，GSCP识别出破坏宏观语义对称性的样本。此外，我们开发了自适应语义价值校准（ASVC）模块，以区分高价值的难干净样本。通过整合训练潜力和可靠性，它动态量化每个三元组的语义价值，确保有效利用难样本，同时抑制以逻辑冲突为特征的噪声。在FashionIQ和CIRR基准数据集上的大量实验表明，RankVR显著优于现有最先进方法，验证了其在噪声环境中的卓越鲁棒性。

英文摘要

Composed Image Retrieval (CIR) constitutes a pivotal paradigm requiring models to perform joint reasoning on reference images and modification texts. However, the prevalence of Noisy Triplet Correspondence (NTC) in large-scale datasets severely constrains model performance. Existing denoising methods either target binary mismatches or rely on scalar-based point-wise estimation, neglecting rich global structural correlations among sample populations and dynamic value variations during training, thereby yielding suboptimal results. This paper identifies two critical unresolved challenges: Global Structural Inconsistency of Semantic Correlations and Hard Sample Discrimination Uncertainty. To address these, we propose RankVR, a framework designed to construct a robust CIR model via global structure consistency and dynamic value perception. Specifically, we introduce the Global Structure Consistency Perception (GSCP) module, which utilizes the Effective Rank of the Correlation Matrix to decouple clean samples from structural noise. By measuring rank difference, GSCP identifies samples disrupting macroscopic semantic symmetry. Furthermore, we develop the Adaptive Semantic Value Calibration (ASVC) module to distinguish high-value hard clean samples. By integrating training potential and reliability, it dynamically quantifies the semantic value of each triplet, ensuring effective utilization of hard samples while suppressing noise characterized by logical conflicts. Extensive experiments on the FashionIQ and CIRR benchmark datasets demonstrate that RankVR significantly outperforms existing state-of-the-art methods, validating its superior robustness in noisy environments.

URL PDF HTML ☆

赞 0 踩 0

2606.11884 2026-06-11 cs.CV cs.CR 新提交

Image Quality Assessment of Identity Cards Using Measures from Open Face Image Quality

使用开放人脸图像质量度量对身份证进行图像质量评估

Gregor Grote, Juan E. Tapia, Christian Rathgeb

发表机构 * da/sec - Biometrics and Internet Security Research Group, Hochschule Darmstadt（达姆施塔特应用科学大学生物识别与互联网安全研究组）

AI总结本文通过将OFIQ标准中的捕获相关质量度量应用于身份证图像，提出一种预处理流程，并分析这些度量与三种呈现攻击检测算法性能的相关性，表明基于某些OFIQ度量的质量评估可显著提升PAD性能。

2606.11966 2026-06-11 cs.CV 新提交

Feature extraction for plant growth estimation

用于植物生长估计的特征提取

Simbarashe Aldrin Ngorima, Albert Helberg, Marelie H. Davel

发表机构 * Faculty of Engineering, North-West University（西北大学工程学院）； Centre for Artificial Intelligence Research（人工智能研究中心）； National Institute for Theoretical and Computational Sciences（国家理论与计算科学研究所）

AI总结针对精准农业中实时估计植物生长阶段的需求，提出两种特征提取方法（Gabor滤波器与形态学操作、预训练CNN与迁移学习），在公开数据集上测试，CNN方法在速度和精度上均优于手工特征，最佳系统（VGG-19特征+RBF SVM）达到98.4%准确率，每图处理0.08秒。

详情

Comments: 13 pages

AI中文摘要

精准农业需要实时估计植物生长阶段。当植物生长阶段已知时，可以减少栽培中资源（如养分和水）的浪费，因为只需供应所需的资源。然而，不同生长阶段的植物具有相似的形态特征，这可能使自主生长阶段估计变得困难。本文提出了两种用于生长阶段估计的特征提取方法：一种使用Gabor滤波器组和形态学操作，另一种使用预训练卷积神经网络（CNN）和迁移学习。我们在公开的植物生长阶段数据集（“bccr-segset”）上测试了这些方法，该数据集包含两种在室内条件下生长和捕获的物种：油菜和小萝卜。使用支持向量机和提升树作为分类器，比较了两种提出的特征提取方法。我们发现两种方法都适用于实时应用，并且CNN特征在速度和准确性方面均优于手工特征。最佳系统（VGG-19特征，使用径向基函数支持向量机分类）对两个物种均获得了98.4%的准确率，处理一张图像仅需0.08秒。

英文摘要

Precision agriculture requires the estimation of plant growth stages in real-time. When the plant growth stage is known, the wastage of resources in cultivation, such as nutrients and water, is reduced as only the required resources need to be supplied. Plants at different growth stages, however, have similar morphological features, which can make autonomous growth stage estimation difficult. This paper presents two feature extraction methods for growth stage estimation: one that uses a bank of Gabor filters and morphological operations, and the other that uses pre-trained convolutional neural networks (CNNs) and transfer learning. We test these methods on a publicly available plant growth stage dataset (``bccr-segset``) for two species, canola and radish, grown and captured under indoor conditions. The two proposed feature extraction methods are compared, using support vector machines and boosted trees as classifiers. We find that both methods are suitable for real-time applications, and that CNN features outperform the hand-crafted features, both with regard to speed and accuracy. The best system (VGG-19 features, classified with a radial basis function support vector machine) obtained an accuracy of 98.4% for both species, processing an image in 0.08 seconds.

URL PDF HTML ☆

赞 0 踩 0

2606.12023 2026-06-11 cs.CV 新提交

ViT-FREE: Efficient Face Recognition via Early Exiting and Synthetic Adaptation

ViT-FREE：通过早期退出和合成自适应实现高效人脸识别

Tahar Chettaoui, Guray Ozgur, Eduarda Caldeira, Naser Damer, Fadi Boutros

发表机构 * Fraunhofer Institute for Computer Graphics Research IGD, Germany（德国弗劳恩霍夫计算机图形学研究所IGD）； Department of Computer Science, TU Darmstadt, Germany（德国达姆施塔特工业大学计算机科学系）

AI总结提出ViT-FREE框架，利用预训练ViT的早期退出策略，在不修改或重新训练骨干模型的情况下，从中间层进行人脸验证，实现高效推理；进一步提出ViT-FREE_FT轻量级微调策略，仅用合成数据适配投影层，提升浅层退出性能。

详情

Comments: Accepted at the 20th IEEE International Conference on Automatic Face and Gesture Recognition (2026)

AI中文摘要

视觉Transformer（ViT）在计算机视觉中获得了显著关注，并显示出在人脸识别（FR）方面的强大潜力。然而，其高计算成本使得在资源受限设备上部署具有挑战性，这促使需要平衡效率和准确性的方法。在这项工作中，我们研究了预训练ViT中的早期退出作为一种简单且无需训练的高效FR推理策略。利用Transformer编码器块之间统一的特征维度，我们引入了ViT-FREE，一个多退出框架，可以直接从中间表示进行人脸验证，而无需修改或重新训练骨干模型，从而降低推理成本。实验表明，补丁嵌入和注意力图在深度上逐渐演化，相邻ViT块之间具有高度相似性，并且与最终表示的对齐程度逐渐增加。这表明特征逐步细化和注意力收敛，表明中间层已经提供了适合早期退出的稳定且具有判别性的表示。通过在多个FR基准上的广泛实验，我们系统地分析了不同退出深度的准确性-效率权衡。结果表明，较晚的退出实现了非常有利的平衡，在第10层退出在IJB-C等基准上实现了高达20%的加速，同时验证性能仅下降1.5。此外，我们提出了ViT-FREE_FT，一种轻量级的退出特定微调策略，仅使用小型合成数据集适配投影层，同时保持Transformer骨干冻结。这种方法提高了浅层退出的性能，同时保留了效率优势，并且对较深退出几乎没有影响。

英文摘要

Vision Transformers (ViTs) have gained significant attention in computer vision and shown strong potential for face recognition (FR). However, their high computational cost makes deployment on resource-constrained devices challenging, motivating the need for methods that balance efficiency and accuracy. In this work, we investigate early exiting in pretrained ViTs as a simple yet effective training-free strategy for efficient FR inference. Leveraging the uniform feature dimensionality across transformer encoder blocks, we introduce ViT-FREE, a multi-exit framework that enables face verification directly from intermediate representations without modifying or retraining the backbone model, and thus, reducing inference cost. Empirically, we show that patch embeddings and attention maps evolve progressively across depth, exhibiting high similarity between consecutive ViT blocks and increasing alignment with the final representation. This indicates gradual feature refinement and attention convergence, suggesting that intermediate layers already provide stable and discriminative representations suitable for early exiting. Through extensive experiments on multiple FR benchmarks, we systematically analyze the accuracy-efficiency trade-off across exit depths. Our results demonstrate that later exits achieve a highly favorable balance, with exiting at layer 10 yielding up to a 20% speedup while incurring only a 1.5 drop in verification performance on benchmarks such as IJB-C. Also, we propose ViT-FREE_FT, a lightweight exit-specific fine-tuning strategy that adapts only the projection layers using a small synthetic dataset while keeping the transformer backbone frozen. This approach improves the performance of shallow exits while preserving the efficiency benefits and leaving deeper exits largely unaffected.

URL PDF HTML ☆

赞 0 踩 0

2606.12036 2026-06-11 cs.CV 新提交

Vision Transformers for Face Recognition Need More Registers

人脸识别的视觉Transformer需要更多寄存器

Tahar Chettaoui, Guray Ozgur, Eduarda Caldeira, Naser Damer, Fadi Boutros

发表机构 * Fraunhofer Institute for Computer Graphics Research IGD（弗劳恩霍夫计算机图形研究所）； Department of Computer Science, TU Darmstadt（达姆施塔特工业大学计算机科学系）

AI总结针对ViT在人脸识别中注意力图存在伪影的问题，引入寄存器令牌以增强可解释性，ViT-8R模型在IJB-B和IJB-C上达到最优性能。

详情

Comments: Accepted at the 20th IEEE International Conference on Automatic Face and Gesture Recognition (2026)

AI中文摘要

近期，用于人脸识别（FR）的视觉Transformer（ViT）的进展已超越了标准的CLS令牌范式。在该范式中，一个特殊的分类令牌（CLS）被前置到补丁嵌入中，并用作输入的下游任务表示。另一种方法，即拼接补丁嵌入（CPE），则通过将所有补丁令牌拼接成一个单一向量来利用它们，然后将其投影为紧凑的人脸表示。与基于CLS的方法相比，CPE已被证明能提高识别性能，但我们对注意力图的定性分析显示存在限制其可解释性的伪影。为解决此问题，我们引入了寄存器令牌，这些可学习令牌被拼接到初始补丁嵌入中，并通过ViT编码器块联合处理。与基线ViT相比，该机制已被证明能产生更结构化和可解释的注意力图。我们通过实验证明，这些伪影在各种ViT骨干网络（包括小型和大型模型）中一致出现，而引入寄存器令牌能有效缓解它们。添加四个或八个寄存器显著增强了可解释性，其中八个寄存器提供了最高的验证准确率和最平滑的注意力结构。我们最终的模型ViT-8R，对应一个基于CPE的ViT-B架构并增加了八个寄存器令牌，在大规模IJB-B和IJB-C基准测试中，在基于ViT的FR模型中达到了最先进的性能。此外，与基线模型相比，ViT-8R产生了明显更清晰的注意力图，这为模型的注意力行为提供了更深入的见解（此 https URL ）。

英文摘要

Recent advances in Vision Transformers (ViTs) for face recognition (FR) have moved beyond the standard CLS-token paradigm. In this paradigm, a special classification token (CLS) is prepended to the patch embeddings and used as a representation of the input for downstream tasks. An alternative approach, Concatenated Patch Embeddings (CPE), instead leverages all patch tokens by concatenating them into a single vector, which is then projected into a compact face representation. CPE has been shown to improve recognition performance in comparison to CLS-based ones, but our qualitative analysis of attention maps showed the presence of artifacts that limit their interpretability. To address this issue, we incorporate register tokens, learnable tokens concatenated to the initial patch embeddings, and processed jointly through the ViT encoder blocks. This mechanism has been shown to produce more structured and interpretable attention maps compared to baseline ViT. We empirically demonstrate that these artifacts consistently appear across various ViT backbones, including small and large models, and that introducing register tokens effectively mitigates them. Adding four or eight registers significantly enhances interpretability, with eight registers providing the highest verification accuracies and smoothest attention structures. Our resulting model, ViT-8R, corresponds to a CPE-based ViT-B architecture augmented with eight register tokens achieves state-of-the-art performance among ViT-based FR models on large-scale IJB-B and IJB-C benchmarks. Also, ViT-8R produces substantially clearer attention maps compared with the baseline model, which offer deeper insight into the model's attention behavior ( this https URL )

URL PDF HTML ☆

赞 0 踩 0

2606.12051 2026-06-11 cs.CV 新提交

MFEN:Multi-Frequency Expert Network for Visible-Infrared Person Re-ID

MFEN：用于可见光-红外行人重识别的多频专家网络

Xulin Li, Yan Lu, Bin Liu, Qinhong Yang, Qi Chu, Tao Gong, Nenghai Yu

发表机构 * University of Science and Technology of China（中国科学技术大学）； Anhui Province Key Laboratory of Digital Security（安徽省数字安全重点实验室）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出多频专家网络（MFEN），通过多频调制和混合专家设计自适应组合不同频带，结合随机频率增强和频率辅助优化，解决可见光-红外图像模态差异问题。

详情

Comments: CVPR Highlight

AI中文摘要

可见光-红外行人重识别（VI-ReID）由于可见光和红外图像之间的巨大模态差异而具有挑战性。我们认为这种差异主要与不同的光照条件有关，包括光波长和光源类型的差异。最近，基于频率的VI-ReID方法取得了显著成功，因为频率信息可以更好地提取与身份相关的轮廓和细节，同时排除无关的光照和颜色。然而，现有方法要么不区分不同频带，要么只关注一个频带，这在多样化的光照条件下是不够的。为了进行全面的频域学习，我们提出了多频专家网络（MFEN），通过混合专家设计实现多频调制并自适应组合不同频带。我们进一步引入随机频率增强（RFA）和频率辅助优化（FAO）来更好地训练MFEN。这三个模块互补，共同捕获关键的频域细节以实现鲁棒的表示学习。在三个VI-ReID数据集上的大量实验证明了我们方法的有效性。

英文摘要

Visible-infrared person re-identification (VI-ReID) is challenging due to the large modality discrepancy between visible and infrared images. We contend that this discrepancy is largely related to differing lighting conditions, including differences in light wavelength and light source type. Recently, frequency-based VI-ReID approaches have achieved notable success because frequency information can better extract identity-relevant contours and details while excluding irrelevant lighting and color. However, existing methods either do not distinguish different frequency bands or focus on only one band, which is insufficient under diverse lighting conditions. To perform comprehensive frequency domain learning, we propose a Multi-Frequency Expert Network (MFEN) that enables multi-frequency modulation and adaptively combines different bands through a mixture-of-experts design. We further introduce Random Frequency Augmentation (RFA) and Frequency Auxiliary Optimization (FAO) to better train MFEN. The three modules are complementary and jointly capture critical frequency-domain details for robust representation learning. Extensive experiments on three VI-ReID datasets demonstrate the effectiveness of our approach.

URL PDF HTML ☆

赞 0 踩 0

2606.12074 2026-06-11 cs.CV cs.AI eess.IV 新提交

Non-frontal face recognition using GANs and memristor-based classifiers

基于GAN和忆阻器分类器的非正面人脸识别

Semih Vazgecen, Cristian Sestito, Spyros Stathopoulos, Themis Prodromakis

发表机构 * Centre for Electronics Frontiers, Institute for Integrated Micro and Nano Systems, School of Engineering, The University of Edinburgh（爱丁堡大学工程学院集成微纳系统研究所电子前沿中心）

AI总结提出将轻量级GAN正面化与忆阻器神经形态识别结合，解决非正面人脸识别，在数据集上达96%准确率。

详情

Comments: 12 pages, 4 figures, 1 Supplementary (22 pages, 16 figures, 6 tables, 4 supplementary notes)

AI中文摘要

人脸识别系统通过深度学习技术取得了显著进展，在复杂场景中实现了高性能和鲁棒性。然而，这些方法带来了巨大的计算开销，限制了它们在资源受限平台（如无人机）上的原位适用性，而这些平台需要应对非正面人脸图像等挑战。基于忆阻器的神经形态系统已成为边缘AI应用的一种引人注目的方法，它将生物启发式处理与高效可扩展的计算相结合。在这项工作中，我们提出了一种人脸识别框架，通过集成基于轻量级生成对抗网络（GAN）的正面化处理和基于忆阻器的神经形态识别，来解决非正面姿态变化问题。在两个数据集上的实验结果表明，将对抗学习与忆阻技术相结合的有效性，实现了高达96%的识别准确率。所提出的方法缓解了传统AI的计算瓶颈，并为动态真实环境中的人脸识别提供了一种可扩展、高效的解决方案。

英文摘要

Face recognition systems have advanced significantly through deep learning techniques, delivering high performance and robustness in complex scenarios. However, these approaches incur substantial computational overhead, limiting their in situ applicability in resource-constrained platforms such as drones, where they can address challenges including non-frontal facial imagery. Memristor-based neuromorphic systems have emerged as a compelling approach for edge AI applications, combining biologically inspired processing with efficient and scalable computation. In this work, we propose a facial recognition framework that addresses non-frontal pose variations by integrating lightweight generative adversarial network (GAN)-based pose frontalisation with memristor-based neuromorphic recognition. The experimental results on two datasets demonstrate the effectiveness of combining adversarial learning with memristive technology, achieving up to 96% identification accuracy. The proposed approach alleviates the computational bottlenecks of conventional AI and offers a scalable, efficient solution for face recognition in dynamic real-world environments.

URL PDF HTML ☆

赞 0 踩 0

2606.12258 2026-06-11 cs.CV 新提交

Bridging Day and Night: Unsupervised Cross-Domain Re-Identification with Synergistic Prompt and Prototype Learning

连接昼夜：基于协同提示与原型学习的无监督跨域重识别

Jiyang Xu, Rui Liu, Hang Dai

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）

AI总结提出无监督昼夜重识别框架，结合提示学习和原型表示学习，通过两阶段训练实现无标注跨域身份关联，性能媲美全监督方法。

详情

AI中文摘要

跨域昼夜重识别（ReID）面临昼夜场景间显著视觉外观差异的根本挑战。现有的全监督方法严重依赖劳动密集型标注，成本高昂且跨域泛化能力有限。本文研究无监督昼夜重识别，提出一种新颖框架，协同结合提示学习和基于原型的表示学习，无需人工标注即可关联跨域身份。我们的方法采用渐进式两阶段训练策略。第一阶段，利用视觉语言模型以无标注方式生成实例特定的文本提示。我们采用实例级对齐机制，将视觉特征和文本提示嵌入统一语义空间，通过实例感知的动态偏差适应将未标注的昼夜图像与可学习提示对齐。第二阶段，构建域特定原型记忆库，并引入两个互补模块：i) 域内身份关联模块，增强每个域内的特征判别性；ii) 跨域原型匹配模块，可靠识别正负原型对，从而建立昼夜间的鲁棒身份对应关系。在公开基准上的大量实验验证了方法的有效性。在无监督设置下，我们的框架取得了与最先进全监督方法相当的Rank-1准确率。

英文摘要

Cross-domain day-night re-identification (ReID) is fundamentally challenged by the substantial visual appearance discrepancies between daytime and nighttime scenes. Existing fully supervised methods rely heavily on labor-intensive annotations, which are costly and exhibit limited generalization across domains. In this work, we investigate unsupervised day-night ReID and propose a novel framework that synergistically combines prompt learning and prototype-based representation learning to associate identities across domains without requiring manual labels. Our approach follows a progressive two-stage training strategy. In the first stage, we exploit the vision-language model to generate instance-specific textual prompts in an annotation-free manner. We employ an instance-level alignment mechanism to embed visual features and textual prompts into a unified semantic space, aligning unlabeled day/night images with learnable prompts via instance-aware dynamic-bias adaptation. In the second stage, we construct domain-specific prototype memory banks and introduce two complementary modules: i) an intra-domain identity association module to enhance feature discriminability within each domain, and ii) a cross-domain prototype matching module to reliably identify positive and negative prototype pairs, thereby establishing robust identity correspondences across day and night. Extensive experiments on public benchmarks validate the effectiveness of our method. Under the unsupervised setting, our framework attains Rank-1 accuracy comparable to state-of-the-art fully supervised methods.

URL PDF HTML ☆

赞 0 踩 0

2606.12294 2026-06-11 cs.CV eess.IV 新提交

Bridging the Modality Gap in Forensic Image Retrieval

弥合法医图像检索中的模态差距

Ricardo González-Gazapo, Annette Morales-González, Yoanna Martínez-Díaz, Heydi Méndez-Vázquez, Milton García-Borroto

发表机构 * Advanced Technologies Application Center (CENATAV)（先进技术应用中心（CENATAV））； Centro de Sistemas Complejos, Facultad de Física, Universidad de La Habana（哈瓦那大学物理学院复杂系统中心）

AI总结提出统一检索框架，利用多模态大语言模型生成文本描述并结合视觉与文本特征融合，提升纹身、人脸素描等法医任务的检索精度与鲁棒性。

详情

Comments: 23 pages, 5 figures, paper submitted to Elsevier journal

AI中文摘要

自动图像检索在现代法医分析中扮演着越来越关键的角色，支持依赖于视觉证据高效比较的调查工作流程。虽然先前的工作主要集中在开发和优化多模态检索系统，但很少关注评估这些技术在多样化真实场景中的法医适用性。在本研究中，我们提出了一个统一的检索框架，适用于四个关键的法医任务：（1）给定纹身查询图像的纹身图像检索；（2）由人类专家文本描述引导的纹身检索，模拟目击者口头描述纹身的常见情况；（3）从手绘草图中检索纹身；（4）从法医面部素描中检索人脸。我们的系统利用多模态大语言模型（MLLM）自动为所有查询和图库图像生成结构化文本描述，然后使用句子变换器嵌入进行基于文本的比较。我们使用仅视觉嵌入、仅文本嵌入以及一种多模态融合策略来评估检索性能，该策略结合了来自与每个任务相关的最先进视觉特征提取器的文本和图像相似性分数。模态融合一致地提高了检索精度和鲁棒性，特别是在视觉信息有限或嘈杂的场景中（例如，素描、部分纹身或零碎的目击者陈述）。这项工作突显了统一多模态检索流程的法医价值，并展示了现代MLLM如何能够操作化传统上依赖人工专家分析的具有挑战性的法医任务。我们的结果将多模态检索定位为支持涉及纹身、面部合成和目击者描述的调查工作流程的有前途工具。

英文摘要

Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.

URL PDF HTML ☆

赞 0 踩 0

2305.06145 2026-06-11 cs.CV 版本更新

Causal Clothes-Invariant Feature Learning for Cloth-Changing Person Re-ID

因果衣物不变特征学习用于换衣行人重识别

Xulin Li, Yan Lu, Bin Liu, Jiaze Li, Yating Liu, Qi Chu, Mang Ye, Wanli Ouyang, Nenghai Yu

AI总结针对换衣行人重识别中衣物变化导致特征失效的问题，提出因果衣物不变学习（CCIL），通过因果干预阻断衣物捷径，实现衣物不变特征学习，在PRCC和DeepChange数据集上分别达到66.4%和59.2%的Rank-1准确率。

详情

AI中文摘要

在换衣行人重识别（CCReID）中，学习衣物不变特征至关重要，这些特征能提供对衣物变化保持鲁棒的判别性ID特征。然而，当前存在的虚假相关性限制了现有ReID方法有效提取这些衣物不变特征。这种虚假相关性源于衣物归属：衣物很少在不同身份间共享，因此模型倾向于记忆衣物线索进行身份识别，这种策略对未见过的衣物泛化能力差。本文提出因果衣物不变学习（CCIL），将CC-ReID从似然学习P(Y|X)显式转换为因果干预学习P(Y|do(X))以阻断衣物捷径。CCIL通过三个模块实现这种干预：混淆字典、干预模块和解耦正则化。基于因果关系的建模使整个模型自然具有衣物不变性，有效防止特征学习中捕获虚假相关性。大量实验验证了CCIL的有效性。在PRCC和DeepChange数据集上，CCIL分别达到66.4%和59.2%的Rank-1准确率，比现有最优方法分别高出1.4和4.1个百分点。

英文摘要

In cloth-changing person re-identification (CCReID), it is critical to learn clothes-invariant feature, which can provide discriminative ID features that remain robust against clothing changes. However, a spurious correlation currently limits existing ReID methods from effectively extracting these clothing-invariant features. This spurious correlation arises from clothing ownership: clothing is rarely shared across different identities, so models tend to memorize clothing cues for identity recognition, and this strategy generalizes poorly to unseen clothing. In this paper, we propose Causal Clothes-Invariant Learning (CCIL), which explicitly shifts CC-ReID from likelihood learning P (Y|X) to causal intervention learning P (Y|do(X)) to block the clothing shortcut. CCIL realizes this intervention through three modules: a Confounder Dictionary, an Intervention Module, and Disentangle Regularization. The causality-based modeling makes the entire model naturally clothes-invariant, effectively preventing the capture of spurious correlations in feature learning. Extensive experiments validate the effectiveness of CCIL. On PRCC and DeepChange datasets, CCIL achieves Rank-1 accuracies of 66.4% and 59.2%, outperforming state-of-the-art methods by 1.4 and 4.1 percentage points, respectively.

URL PDF HTML ☆

赞 0 踩 0

2412.09023 2026-06-11 cs.CV 版本更新

STEAM: Squeeze and Transform Enhanced Attention Module

STEAM: 挤压与变换增强注意力模块

Rishabh Sabharwal, Ram Samarth B B, Parikshit Singh Rathore, Punit Rathore

AI总结提出一种基于图多头变换器的常参数注意力模块STEAM，同时建模通道和空间注意力，在几乎不增加计算量（GFLOPs）的情况下提升CNN性能。

详情

AI中文摘要

早期工作引入的通道和空间注意力机制增强了深度卷积神经网络（CNN）的表示能力，但往往导致参数和计算成本的增加。虽然近期方法专注于通道注意力的高效特征上下文建模，我们的目标是以最少的参数和减少的计算量全面建模通道和空间注意力。利用图中关系建模的原理，我们引入了一个常参数模块STEAM：挤压与变换增强注意力模块，该模块整合了通道和空间注意力以增强CNN的表示能力。据我们所知，我们是第一个提出基于图的方法来同时建模通道和空间注意力，利用多头图变换器的概念。此外，我们引入了输出引导池化（OGP），它高效捕获空间上下文以进一步增强空间注意力。我们在标准基准数据集上广泛评估了STEAM在大规模图像分类、目标检测和实例分割上的性能。STEAM在标准ResNet-50模型上实现了2%的准确率提升，而GFLOPs仅略有增加。此外，STEAM在准确率上优于领先模块ECA和GCT，同时实现了GFLOPs的三倍减少。

英文摘要

Channel and spatial attention mechanisms introduced in earlier work enhance the representational capabilities of deep convolutional neural networks (CNNs) but often increase parameter and computational costs. While recent approaches focus solely on efficient feature context modeling for channel attention, we aim to model both channel and spatial attention comprehensively with minimal parameters and reduced computation. Leveraging the principles of relational modeling in graphs, we introduce a constant-parameter module, \textit{STEAM: Squeeze and Transform Enhanced Attention Module}, which integrates channel and spatial attention to enhance the representation power of CNNs. To our knowledge, we are the first to propose a graph-based approach for modeling both channel and spatial attention, utilizing concepts from multi-head graph transformers. Additionally, we introduce \textit{Output Guided Pooling} (OGP), which efficiently captures spatial context to further enhance spatial attention. We extensively evaluate STEAM for large-scale image classification, object detection and instance segmentation on standard benchmark datasets. STEAM achieves a $2\%$ increase in accuracy over the standard ResNet-50 model with only a meager increase in GFLOPs. Furthermore, STEAM outperforms the leading modules, ECA and GCT, in terms of accuracy while achieving a threefold reduction in GFLOPs. The code will be made available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2509.14860 2026-06-11 cs.CV cs.AI cs.CL cs.MA 版本更新

MARIC: Multi-Agent Reasoning for Image Classification

MARIC：用于图像分类的多智能体推理

Wonduk Seo, Minhyeong Yu, Hyunjin An, Seunghyun Lee

AI总结提出多智能体框架MARIC，通过分解图像分类为协作推理过程，利用大纲智能体、方面智能体和推理智能体进行多视角分析与综合，在四个基准数据集上显著优于基线方法。

详情

Comments: 11 pages, preprint

AI中文摘要

图像分类传统上依赖于参数密集型模型训练，需要大规模标注数据集和大量微调才能达到有竞争力的性能。虽然最近的视觉语言模型（VLM）缓解了其中一些限制，但它们仍然受限于对单次表示的依赖，往往无法捕捉视觉内容的互补方面。在本文中，我们介绍了基于多智能体的图像分类推理（MARIC），这是一个多智能体框架，将图像分类重新表述为协作推理过程。MARIC首先利用大纲智能体分析图像的全局主题并生成有针对性的提示。基于这些提示，三个方面智能体沿着不同的视觉维度提取细粒度描述。最后，推理智能体通过集成反思步骤综合这些互补输出，产生用于分类的统一表示。通过明确地将任务分解为多个视角并鼓励反思性综合，MARIC减轻了参数繁重训练和单一VLM推理的缺点。在4个不同的图像分类基准数据集上的实验表明，MARIC显著优于基线，突出了多智能体视觉推理在鲁棒且可解释的图像分类中的有效性。

英文摘要

Image classification has traditionally relied on parameter-intensive model training, requiring large-scale annotated datasets and extensive fine tuning to achieve competitive performance. While recent vision language models (VLMs) alleviate some of these constraints, they remain limited by their reliance on single pass representations, often failing to capture complementary aspects of visual content. In this paper, we introduce Multi Agent based Reasoning for Image Classification (MARIC), a multi agent framework that reformulates image classification as a collaborative reasoning process. MARIC first utilizes an Outliner Agent to analyze the global theme of the image and generate targeted prompts. Based on these prompts, three Aspect Agents extract fine grained descriptions along distinct visual dimensions. Finally, a Reasoning Agent synthesizes these complementary outputs through integrated reflection step, producing a unified representation for classification. By explicitly decomposing the task into multiple perspectives and encouraging reflective synthesis, MARIC mitigates the shortcomings of both parameter-heavy training and monolithic VLM reasoning. Experiments on 4 diverse image classification benchmark datasets demonstrate that MARIC significantly outperforms baselines, highlighting the effectiveness of multi-agent visual reasoning for robust and interpretable image classification.

URL PDF HTML ☆

赞 0 踩 0

2606.11231 2026-06-11 cs.CV 新提交

CFCamo: A Counterfactual Detect-or-Abstain Framework for Camouflaged Object Detection

CFCamo：一种用于伪装目标检测的反事实检测或放弃框架

Suhang Li, Osamu Yoshie, Yuya Ieiri

发表机构 * Graduate School of Information, Production and Systems, Waseda University（早稻田大学信息生产系统研究生院）

AI总结提出CFCamo框架，通过反事实配对训练和策略优化，使COD模型在检测到目标时输出结果，在无目标时放弃检测，解决了正样本训练导致的过度检测偏差。

详情

Comments: 10 pages, 7 figures, 5 tables. Code and data: this https URL

AI中文摘要

视觉语言强化学习最近在伪装目标检测（COD）中展现出强大的目标存在定位能力。然而，定位只是决策的一方面：当智能体面对没有伪装目标的普通图像时，它是否仍会声称存在伪装目标？标准的COD训练和评估数据仅包含正样本，因此在此设置下优化的智能体会产生过度检测偏差，这是一种任务特定的物体幻觉形式，标准COD评估无法衡量。为了量化这种目标缺失行为，我们构建了反事实COD（CF-COD），一个配对基准，从每个留出的COD评估图像中移除伪装目标，同时保留合理的背景。CF-COD评估模型是否在原始图像上检测到目标，并在目标缺失的反事实图像上放弃检测，通过配对准确率（PA）总结。我们进一步引入了CFCamo，一个用于COD的配对反事实框架，支持放弃检测。在训练中，CFCamo使用反事实序列策略优化（CSPO）优化Qwen3-VL-4B-Instruct智能体，该策略采样配对的原始-反事实轨迹，并使用反事实配对奖励（CPR）将原始图像检测与反事实放弃耦合。在CAMO-test上，CFCamo相比先前基于RL的COD基线将S_alpha提高了3.7个百分点；在CF-COD上，它达到了80.0-90.8%的PA。消融实验表明，移除反事实耦合后，尽管目标存在COD得分很高，PA降至1.4-5.2%，这表明仅凭目标存在评估无法表征检测或放弃行为。总体而言，这些结果表明CFCamo通过将目标存在检测与目标缺失放弃耦合，而不仅仅是加强目标存在定位，改进了COD智能体。代码和数据可在https://this URL获取。

英文摘要

Vision-language reinforcement learning has recently shown strong target-present localization for camouflaged object detection (COD). Yet localization is only one side of the decision: when the agent faces an ordinary image with no camouflaged target, will it still claim that a camouflaged object exists? Standard COD training and evaluation data are positive-only, so agents optimized under this setting can acquire an over-detect bias, a task-specific form of object hallucination that standard COD evaluation leaves unmeasured. To quantify this target-absent behavior, we construct Counterfactual COD (CF-COD), a paired benchmark that removes the camouflaged target from each held-out COD evaluation image while preserving a plausible background. CF-COD evaluates whether a model detects the target on the original image and abstains on the target-absent counterfactual, summarized by Pair Accuracy (PA). We further introduce CFCamo, a paired counterfactual framework for COD with abstention. For training, CFCamo optimizes a Qwen3-VL-4B-Instruct agent with Counterfactual Sequence Policy Optimization (CSPO), which samples paired original-counterfactual rollouts and uses a Counterfactual Paired Reward (CPR) to couple original-image detection with counterfactual abstention. On CAMO-test, CFCamo improves S_alpha by +3.7 pp over the prior RL-based COD baseline; across CF-COD, it reaches 80.0-90.8% PA. Ablations show that removing counterfactual coupling reduces PA to 1.4-5.2% despite strong target-present COD scores, showing that target-present evaluation alone does not characterize detect-or-abstain behavior. Overall, these results indicate that CFCamo improves COD agents by coupling target-present detection with target-absent abstention, rather than merely strengthening target-present localization. Code and data are available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11285 2026-06-11 cs.CV 新提交

EventRadar: Long-Range Visual UAV Discovery through Spatiotemporal Event Sensing

EventRadar：通过时空事件感知实现远程视觉无人机发现

Zhiting Zhou, Xingchen Liu, Xinglin Yu, Jiashen Chen, Haoyang Wang, Jingao Xu, Yunhao Liu, Xinlei Chen

AI总结针对远程小目标无人机检测难题，提出EventRadar方法，利用事件相机捕捉螺旋桨引起的时域周期性，结合场景锚定几何证据（SAGE）和梳状引导谐波组学习迭代收缩阈值算法（CHG），在700-1500米距离上实现高精度检测。

详情

AI中文摘要

机场、公共场所及其他敏感区域周围的未经授权无人机活动使得受保护空域监测日益重要。一个实用的感知系统必须搜索广阔的角度区域，发现小型远程目标，并在限制周界被突破前返回方位支持和无人机特定证据。现有的无人机检测路径通常依赖空间组织的证据，如身体范围、轮廓或轨迹连续性。然而，在远距离上，随着目标足迹减弱和图像平面支撑缩小，这些线索变得难以保持和验证。EventRadar遵循一种互补线索：螺旋桨引起的时域周期性，最近的事件相机感知研究表明，在目标外观变弱后，这种周期性可以揭示无人机特有的运动。我们将这一线索扩展到千米级主动感知，使用事件相机原型。场景锚定几何证据（SAGE）将扫描事件与IMU姿态融合，维护一个方位索引的场景记忆，将瞬态候选支撑与持久背景杂波分离。然后，梳状引导谐波组学习迭代收缩阈值算法（CHG）将每个候选视为一个弱的高速率定时信号，并以固定计算量恢复相位不敏感的谐波证据。与相关事件相机基线在700-1500米无人机事件记录上的比较，EventRadar实现了0.990 mAP$_{.3}$和0.949 F1$_{.3}$，将FN$_{.3}$降低到0.009，并在原型分析中展示了实时可行性。

英文摘要

Unauthorized unmanned aerial vehicle (UAV) activity around airports, public venues, and other sensitive sites has made protected-airspace monitoring increasingly important. A practical sensing system must search a wide angular region, find small long-range targets, and return both bearing support and UAV-specific evidence before a restricted perimeter is breached. Existing UAV detection paths often rely on spatially organized evidence, such as body extent, silhouette, or track continuity. At long range, however, these cues become difficult to preserve and verify as the target footprint weakens and its image-plane support shrinks. EventRadar follows a complementary cue: propeller-induced temporal periodicity, which recent event-camera sensing studies have shown can reveal UAV-specific motion after appearance becomes weak. We extend this cue to kilometer-scale active sensing with an event-camera prototype. Scene-Anchored Geometry Evidence (SAGE) fuses scanning events with IMU pose to maintain a bearing-indexed scene memory, separating transient candidate support from persistent background clutter. Comb-guided Harmonic-Group Learned Iterative Shrinkage and Thresholding Algorithm (CHG) then treats each candidate as a weak high-rate timing signal and recovers phase-insensitive harmonic evidence with fixed compute. Compared with related event-camera baselines on 700-1500 m UAV event recordings, EventRadar achieves 0.990 mAP$_{.3}$ and 0.949 F1$_{.3}$, reduces FN$_{.3}$ to 0.009, and shows real-time feasibility in prototype profiling.

URL PDF HTML ☆

赞 0 踩 0

2606.11546 2026-06-11 cs.CV 新提交

VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

VL-DINO: 利用CLIP视觉-语言知识进行开放词汇目标检测

Hao Zhang, Qinran Lin, Linqi Song, Yong Li

发表机构 * Chongqing University（重庆大学）； City University of Hong Kong（香港城市大学）

AI总结提出VL-DINO，通过QPSC模块构建高质量正样本增强视觉-语言对齐，VSE模块蒸馏CLIP视觉知识，ORSA模块对齐区域特征与文本嵌入，在LVIS零样本检测上达到36.3/38.1 AP。

详情

AI中文摘要

像CLIP这样的视觉-语言模型可以为开放词汇目标检测提供丰富的语义先验。然而，将文本和视觉知识联合集成到检测架构中仍然具有挑战性。在本文中，我们提出了VL-DINO，一种通过更有效地利用CLIP的视觉-语言知识来增强DINO的开放词汇检测器。具体来说，首先开发了一个查询引导的正样本构建（QPSC）模块，以构建额外的高质量正样本，使原始DINO框架能够更好地适应跨异构数据源的混合训练，同时提供更多的视觉-语言对齐信号，从而在训练过程中融入更丰富的文本知识。然后引入了一个视觉语义编码器（VSE）模块，将CLIP视觉知识蒸馏到骨干网络提取的特征中，生成用于后续编码器精炼的融合特征。基于融合特征，一个目标-区域语义对齐（ORSA）模块提取以目标为中心的区域特征，并将其与相应的文本嵌入对齐，进一步融入文本线索。在零样本设置下，VL-DINO-T和VL-DINO-L在LVIS基准上分别达到了36.3和38.1 AP，持续优于先前的高级方法。大量实验证明了所提出设计的有效性和竞争性能。

英文摘要

Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP's vision-language knowledge. Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training. A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36.3 and 38.1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.

URL PDF HTML ☆

赞 0 踩 0

2606.11572 2026-06-11 cs.CV 新提交

FreqKD: Frequency-Decoupled Cross-Modal Knowledge Distillation for Infrared Object Detection

FreqKD: 面向红外目标检测的频率解耦跨模态知识蒸馏

Keval Thaker, Venkatraman Narayanan, Abdalmalek Aburaddaha, Samir A. Rawashdeh

发表机构 * University of Michigan-Dearborn（密歇根大学迪尔伯恩分校）

AI总结针对RGB与红外图像模态差异，提出频率解耦蒸馏框架FreqKD，对低频和高频成分分别施加严格MSE和松弛log-MSE损失，在KAIST数据集上提升DINOv2基线2.4 mAP50。

详情

AI中文摘要

通过知识蒸馏从大规模RGB基础模型迁移学习到红外图像，由于图像形成物理的根本差异仍然具有挑战性。我们研究了RGB-IR模态间隙的频谱结构，观察到特征差异在空间频率上并不均匀：低频分量（形状、布局）比高频分量（纹理、精细边缘）表现出更大的跨模态对齐，后者反映了模态特定特征。基于这一分析，我们提出了FreqKD，一种频率解耦蒸馏框架，对每个频带应用适应其跨模态一致性的非对称监督。该方法对低频带采用严格的均方误差（MSE）以保留共享的结构信息，对高频带采用松弛的log-MSE损失（权重为0.1）以提供边缘指导同时容忍纹理差异。对500个配对样本的频谱差异分析表明，在所有分析的Transformer层中，高频差异平均超过低频差异2.4倍。在KAIST多光谱行人检测上，FreqKD达到64.1 mAP50，比DINOv2基线提高2.4点。学到的表示可跨数据集（FLIR ADAS，+2.1 mAP50）、任务（MFNet分割，+1.85平均交并比）和架构（ResNet-50，+1.0 mAP50）迁移。代码见：this https URL

英文摘要

Transfer learning from large-scale RGB foundation models to infrared (IR) imagery through knowledge distillation (KD) remains challenging due to fundamental differences in image formation physics. We investigate the spectral structure of the RGB--IR modality gap and observe that feature divergence is not uniform across spatial frequencies: low-frequency components (shape, layout) show greater cross-modal alignment than high-frequency components (texture, fine edges), which reflect modality-specific characteristics. Based on this analysis, we propose FreqKD, a frequency-decoupled distillation framework that applies asymmetric supervision adapted to each band's cross-modal consistency. The method employs strict mean squared error (MSE) on the low-frequency band to preserve shared structural information and a relaxed log-MSE loss (weighted at 0.1) on the high-frequency band to provide edge guidance while tolerating texture differences. Spectral divergence analysis on 500 paired samples shows that high-frequency divergence exceeds low-frequency divergence by a factor of 2.4x on average across all analysed transformer layers. On KAIST multispectral pedestrian detection, FreqKD achieves 64.1 mAP50, improving 2.4 points over the DINOv2 baseline. The learned representation transfers across datasets (FLIR ADAS, +2.1 mAP50), tasks (MFNet segmentation, +1.85 mean intersection-over-union), and architectures (ResNet-50, +1.0 mAP50). Code is available at: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.11779 2026-06-11 cs.CV 新提交

Battery detection of XRay images using transfer learning

基于迁移学习的X射线图像电池检测

Nermeen Abou Baker, David Rohrschneider, Uwe Handmann

发表机构 * Ruhr West University of Applied Sciences（鲁尔西应用科学大学）

AI总结本研究利用迁移学习，基于YOLOv5m模型检测X射线图像中的电池，并分类三种锂离子电池，检测精度达94%，推理时间22毫秒。

2606.11837 2026-06-11 cs.CV cs.AI 新提交

LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

LASA：一种用于开放词汇场景草图语义分割的弱监督方法

Liwen Yi, Xianlin Zhang, Yue Zhang, Yue Ming, Xueming Li

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出LASA方法，通过跨层聚合Vision Transformer注意力图，在弱监督下实现开放词汇场景草图的语义分割，显著提升分割精度和空间一致性。

详情

AI中文摘要

开放词汇场景草图语义分割旨在基于推理时指定的灵活类别词汇，为稀疏线条图分配密集语义标签，而无需在训练期间依赖像素级标注。与自然图像不同，草图缺乏纹理和颜色线索，使得语义理解严重依赖于笔画布局和空间配置，这一挑战导致单层视觉-语言特征本质上不稳定。我们的关键观察是，来自不同Vision Transformer层的注意力图编码了互补的空间线索：浅层捕获全局结构布局，而深层聚焦于局部笔画交叉和物体部件。这表明跨层聚合比任何单独一层提供了更稳健的结构先验。利用这一洞察，我们提出了一种结构感知框架，基于\textbf{逐层累积结构注意力}（\textbf{LASA}），该框架聚合多层注意力以在弱监督下指导层次化语义对齐，并在推理期间细化预测。在FS-COCO、SFSD和FrISS上的实验表明，与先前的弱监督基线相比，LASA将mIoU分别提高了+3.43、+8.01和+15.74，在分割精度和空间一致性上均表现出一致的提升。我们的源代码将公开提供。

英文摘要

Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon \textbf{L}ayer-wise \textbf{A}ccumulated \textbf{S}tructural \textbf{A}ttention (\textbf{LASA}), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by $+3.43$, $+8.01$, and $+15.74$ over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.12066 2026-06-11 cs.CV 新提交

Performance Analysis of YOLOv11 and YOLOv8 for Mixed Traffic Object Detection under Adverse Weather Conditions in Developing Countries

YOLOv11与YOLOv8在发展中国家恶劣天气下混合交通目标检测的性能分析

Quoc Thuan Nguyen, Ha Anh Vu, Ngo Dang Thanh Ngan, Minh Phuc Hoang Ngoc

AI总结针对发展中国家恶劣天气下的混合交通场景，评估YOLOv11n与YOLOv8n在融合数据集上的性能，YOLOv11n在精度提升3.2%的同时计算量减少22%，实现精度与效率的优化平衡。

详情

AI中文摘要

在现代车辆系统中，恶劣条件下的鲁棒性能已成为自动驾驶的关键问题。我们的研究对YOLO系列最新版本YOLOv11 Nano架构进行了全面评估，以广泛采用的YOLOv8 Nano为基线，在融合了印度驾驶数据集（IDD）[1]和伯克利深度驾驶数据集（BDD100K）[2]的自定义数据集上进行基准测试。我们分析了在涉及密集混合交通、雨天和低光照条件的高熵场景中检测精度、推理速度和计算效率之间的权衡。具体而言，YOLOv11n实现了46.6%的平均精度（mAP@50），精度比基线提高了3.2%，有效减少了杂乱场景中的误报。此外，该模型表现出更高的能效，FLOPs减少22%（6.3G vs. 8.1G），同时在Tesla T4 GPU上保持70.9 FPS的实时推理速度，为安全关键的边缘部署提供了最优权衡。

英文摘要

In modern vehicular systems, robust performance under harsh conditions has become a critical problem of autonomous driving. Our study delivers a comprehensive evaluation of the newest iteration of the YOLO series, which is YOLOv11 Nano architecture benchmarked against the widely adopted YOLOv8 Nano as a baseline on a custom fused dataset that combines the Indian Driving Dataset (IDD) [1] and Berkeley Deep Drive Dataset (BDD100K) [2]. We have analyzed the trade-offs among detection accuracy, inference speed, and computational efficiency in high-entropy scenarios involving dense mixed traffic, rain, and low-light conditions. Specifically, YOLOv11n achieves a mean Average Precision (mAP@50) of 46.6%, with a notable 3.2% improvement in Precision over the baseline, effectively reducing false positives in cluttered scenes. Furthermore, the proposed model exhibits enhanced energy efficiency, requiring 22% fewer FLOPs (6.3G vs. 8.1G) while maintaining real-time inference speed of 70.9 FPS on a Tesla T4 GPU, offering an optimal trade-off for safety-critical edge deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.12371 2026-06-11 cs.CV 新提交

A Turbo-Inference Strategy for Object Detection and Instance Segmentation

一种用于目标检测和实例分割的涡轮推理策略

Zhen Zhao, Gang Zhang, Xiaolin Hu, Liang Tang

发表机构 * School of Technology, Beijing Forestry University（北京林业大学工学院）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Beijing National Research Center for Information Science and Technology, Tsinghua University（清华大学北京信息科学与技术国家研究中心）； Chinese Institute for Brain Research (CIBR)（北京脑科学与类脑研究中心）

AI总结提出一种涡轮推理策略，通过迭代利用检测与分割的互补信息，设计涡轮检测头和涡轮分割头形成闭环，无需重新训练即可提升两者精度。

详情

Comments: Preprint version of an article published in Computer Vision and Image Understanding

AI中文摘要

目标检测和实例分割任务密切相关。现有的自上而下实例分割方法通常遵循先检测后分割的范式，即先使用初始检测器识别并用边界框定位对象，然后在每个边界框内分割实例掩码。在这种方法中，检测精度直接影响后续分割性能。然而，以往的研究很少探讨实例分割任务对目标检测的影响。本文提出一种用于自上而下方法的涡轮推理策略，该策略迭代利用检测和分割任务之间的互补信息。具体来说，我们设计了两个模块：涡轮检测头和涡轮分割头，它们促进任务之间的通信。这两个模块形成一个闭环，交织检测和分割结果，而无需重新训练模型。在COCO、iFLYTEK和Cityscapes数据集上的综合实验表明，我们的方法在计算成本增加的情况下，显著提高了检测和分割精度。所提出的方法代表了预测精度和推理速度之间的权衡。代码可在以下网址获取：https://this URL。

英文摘要

Object detection and instance segmentation tasks are closely related. Existing top-down instance segmentation methods usually follow a detect-then-segment paradigm, where an initial detector is used to recognize and localize objects with bounding boxes, followed by the segmentation of an instance mask within each bounding box. In such methods, the detection accuracy directly influences the subsequent segmentation performance. However, previous research has seldom explored the impact of the instance segmentation task on object detection. In this paper, we present a turbo-inference strategy for the top-down methods that leverages the complementary information between detection and segmentation tasks iteratively. Specifically we design two modules: turbo-detection head and turbo-segmentation head, which facilitate communication between the tasks. The two modules form a closed loop that interlaces the detection and segmentation results without retraining the model. Comprehensive experiments on the COCO, iFLYTEK, and Cityscapes datasets demonstrate that our method substantially enhances both detection and segmentation accuracies with a certain increase in computational cost. The proposed method represents a tradeoff between prediction accuracy and inference speed. Codes are available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2504.12556 2026-06-11 cs.CV 版本更新

Contour Field based Elliptical Shape Prior for the Segment Anything Model

基于轮廓场的椭圆形状先验用于Segment Anything模型

Xinyu Zhao, Faqiang Wang, Li Cui, Yuping Duan, Jun Liu

AI总结针对SAM难以高效生成椭圆形状分割结果的问题，提出一种参数化椭圆轮廓场约束方法，通过变分法和对偶算法将椭圆先验与图像特征融合，提升特定任务分割精度。

详情

AI中文摘要

椭圆形状先验信息在提高医学和自然图像特定任务的分割精度方面起着至关重要的作用。现有的基于深度学习的分割方法，包括Segment Anything模型（SAM），通常难以高效地生成具有椭圆形状的分割结果。本文提出了一种新方法，利用变分法将椭圆形状先验集成到基于深度学习的SAM图像分割技术中。该方法建立了一个参数化的椭圆轮廓场，约束分割结果与预定义的椭圆轮廓对齐。利用对偶算法，该模型将图像特征与椭圆先验和空间正则化先验无缝集成，从而大大提高了分割精度。通过将SAM分解为四个数学子问题，我们集成变分椭圆先验设计了一种新的SAM网络结构，确保SAM的分割输出由椭圆区域组成。在特定图像数据集上的实验结果表明，该方法优于原始SAM。

英文摘要

The elliptical shape prior information plays a vital role in improving the accuracy of image segmentation for specific tasks in medical and natural images. Existing deep learning-based segmentation methods, including the Segment Anything Model (SAM), often struggle to produce segmentation results with elliptical shapes efficiently. This paper proposes a new approach to integrate the prior of elliptical shapes into the deep learning-based SAM image segmentation techniques using variational methods. The proposed method establishes a parameterized elliptical contour field, which constrains the segmentation results to align with predefined elliptical contours. Utilizing the dual algorithm, the model seamlessly integrates image features with elliptical priors and spatial regularization priors, thereby greatly enhancing segmentation accuracy. By decomposing SAM into four mathematical sub-problems, we integrate the variational ellipse prior to design a new SAM network structure, ensuring that the segmentation output of SAM consists of elliptical regions. Experimental results on some specific image datasets demonstrate an improvement over the original SAM.

URL PDF HTML ☆

赞 0 踩 0

2508.09459 2026-06-11 cs.CV cs.AI 版本更新

RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

RelayFormer: 一种用于可扩展图像和视频篡改定位的统一局部-全局注意力框架

Wen Huang, Jiarui Yang, Tao Dai, Jiawei Li, Shaoxiong Zhan, Bin Wang, Shu-Tao Xia

AI总结提出RelayFormer统一框架，通过全局局部中继（GLR）令牌和中继注意力机制，适应不同分辨率并统一处理图像与视频，在篡改定位任务中实现高效且性能优越。

详情

AI中文摘要

视觉篡改定位（VML）旨在识别图像和视频中被篡改的区域，随着高级编辑工具的兴起，这一任务变得日益具有挑战性。现有方法面临两个核心问题。首先是分辨率多样性。调整大小或填充可能会扭曲微妙的取证线索，并引入不必要的计算成本。其次是将图像的空间模型扩展到视频的时空输入的困难，这通常导致为两种数据类型维护单独的架构。为了解决这些挑战，我们提出了RelayFormer，一个统一框架，能够适应不同分辨率并自然处理静态和时态视觉数据。RelayFormer将输入划分为固定大小的子图像，并引入全局局部中继（GLR）令牌，通过基于中继的注意力机制传播结构化上下文。这种设计使得全局线索（如语义或时间一致性）的高效交换成为可能，同时保留细粒度的篡改伪影。与依赖统一调整大小或稀疏注意力的先前方法不同，RelayFormer以最小的开销扩展到可变分辨率和视频序列。跨多个基准的实验表明，其具有优越的性能和强大的效率，结合了无需插值或过多填充的分辨率适应性、图像和视频的统一处理，以及准确性和计算成本之间的有利平衡。代码可在\href{this https URL}{this https URL}获取。

英文摘要

Visual manipulation localization (VML) aims to identify tampered regions in images and videos, a task that has become increasingly challenging with the rise of advanced editing tools. Existing methods face two central issues. The first is resolution diversity. Resizing or padding can distort subtle forensic cues and introduce unnecessary computational cost. The second is the difficulty of extending spatial models for images to spatio-temporal inputs in videos, which often results in maintaining separate architectures for the two data types. To address these challenges, we propose RelayFormer, a unified framework that adapts to varying resolutions and naturally handles both static and temporal visual data. RelayFormer partitions inputs into fixed-size sub-images and introduces Global Local Relay (GLR) tokens that propagate structured context through a relay-based attention mechanism. This design enables efficient exchange of global cues, such as semantic or temporal consistency, while preserving fine-grained manipulation artifacts. Unlike prior approaches that depend on uniform resizing or sparse attention, RelayFormer scales to variable resolutions and video sequences with minimal overhead. Experiments across diverse benchmarks demonstrate superior performance and strong efficiency, combining resolution adaptivity without interpolation or excessive padding, unified processing for images and videos, and a favorable balance between accuracy and computational cost. Code is available at~\href{ this https URL }{ this https URL }.

URL PDF HTML ☆

赞 0 踩 0

2512.16415 2026-06-11 cs.CV 版本更新

CountZES: Counting via Zero-Shot Exemplar Selection

CountZES: 通过零样本示例选择进行计数

Muhammad Ibraheem Siddiqui, Muhammad Haris Khan

AI总结针对零样本计数中示例质量差导致计数不准的问题，提出CountZES方法，通过检测锚定、密度引导和特征共识三阶段协同选择多样化示例，提升计数准确性。

详情

AI中文摘要

在零样本（ZS）设置下，复杂场景中的目标计数尤其具有挑战性，其中仅使用类别名称对未见类别的实例进行计数。现有的ZS计数方法通常依赖现成的开放词汇检测器（OVD）从文本推断示例，但在密集场景中，这些方法会受到语义噪声、外观变异和多实例提议的影响。或者，采用随机图像块采样，但无法准确描绘目标实例。由于计数对示例质量敏感，此类选择策略通常产生代表性差的示例，导致计数估计不准确。为解决这些问题，我们提出CountZES，一种通过零样本示例选择进行目标计数的纯推理方法。CountZES通过三个协同阶段发现多样化的示例：检测锚定示例（DAE）、密度引导示例（DGE）和特征共识示例（FCE）。DAE细化OVD检测以分离出精确的单实例示例。DGE引入密度驱动的自监督范式，识别统计一致且语义紧凑的示例，而FCE通过特征空间聚类增强视觉一致性。这些阶段共同产生互补的示例集，平衡了文本基础、计数一致性和特征代表性。在多个数据集上的实验表明，CountZES在零样本计数方法中表现出优越性能，同时有效跨领域泛化。

英文摘要

Object counting in complex scenes is particularly challenging in the zero-shot (ZS) setting, where instances of unseen categories are counted using only a class name. Existing ZS counting methods that infer exemplars from text often rely on off-the-shelf open-vocabulary detectors (OVDs), which in dense scenes suffer from semantic noise, appearance variability, and multi-instance proposals. Alternatively, random image-patch sampling is employed, which fails to accurately delineate object instances. Since counting is sensitive to exemplar quality, such selection strategies often yield poorly representative exemplars, leading to inaccurate count estimation. To address these issues, we propose CountZES, an inference-only approach for object counting via ZS exemplar selection. CountZES discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines OVD detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across domains.

URL PDF HTML ☆

赞 0 踩 0

2604.13326 2026-06-11 cs.CV 版本更新

Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift

正确区域，错误标签：相关性偏移下分割中的语义标签翻转

Akshit Achara, Yovin Yahathugoda, Nick Byrne, Michela Antonelli, Esther Puyol Anton, Alexander Hammers, Andrew P. King

AI总结研究语义分割中因非因果特征与标签的虚假相关性导致的标签翻转问题，提出翻转诊断指标和基于熵的无标签翻转风险评分。

详情

Comments: Author name correction in this version

AI中文摘要

机器学习模型的鲁棒性可能因输入数据中非因果特征与目标标签之间的虚假相关性而受损。测试此类相关性的常见方法是在标签与某些非因果线索强烈关联的数据上训练模型，然后在关联不再成立的示例上进行评估。这一思想在分类任务中已得到充分验证，但对于语义分割，具体的失败模式尚不明确。我们表明，模型可能实现合理的重叠，但分配了错误的语义标签，将一个合理的前景类交换为另一个，即使对象边界大致正确。我们聚焦于这种语义标签翻转行为，并通过一个简单的诊断指标（Flip）进行量化，该指标统计真实前景像素被分配错误前景身份但仍被预测为前景的频率。在训练过程中类别与场景相关的设置下，增加相关性会持续扩大常见与罕见测试条件之间的差距，并增加反事实组内这些对象内部的标签交换。总体而言，我们的结果通过将前景错误分解为正确像素、翻转身份像素和遗漏至背景像素，激励在分布偏移下超越重叠来评估分割鲁棒性。我们还提出了一种基于熵、无需真实标签的“翻转风险”评分，该评分从前景身份不确定性计算得出，并表明它可以在推理时标记易翻转的案例。代码可在此 https URL 获取。

英文摘要

The robustness of machine learning models can be compromised by spurious correlations between non-causal features in the input data and target labels. A common way to test for such correlations is to train on data where the label is strongly tied to some non-causal cue, then evaluate on examples where that tie no longer holds. This idea is well established for classification tasks, but for semantic segmentation the specific failure modes are not well understood. We show that a model may achieve reasonable overlap while assigning the wrong semantic label, swapping one plausible foreground class for another, even when object boundaries are largely correct. We focus on this semantic label-flip behaviour and quantify it with a simple diagnostic (Flip) that counts how often ground truth foreground pixels are assigned the wrong foreground identity while remaining predicted as foreground. In a setting where category and scene are correlated during training, increasing the correlation consistently widens the gap between common and rare test conditions and increases these within-object label swaps on counterfactual groups. Overall, our results motivate assessing segmentation robustness under distribution shift beyond overlap by decomposing foreground errors into correct pixels, flipped-identity pixels, and missed-to-background pixels. We also propose an entropy-based, ground truth label-free `flip-risk' score, which is computed from foreground identity uncertainty, and show that it can flag flip-prone cases at inference time. Code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2605.13674 2026-06-11 cs.CV cs.AI 版本更新

Weakly Supervised Segmentation as Semantic-Based Regularization

弱监督分割作为语义基于的正则化

Stefano Colamonaco, Andrei-Bogdan Florea, Jaron Maene

AI总结本文提出通过神经符号方法整合模糊逻辑与深度分割模型，利用弱标注和领域先验知识提升伪标签质量，从而实现优于密集监督基线的分割精度。

详情

AI中文摘要

弱监督语义分割（WSSS）通过部分或粗略标注（如边界框、涂鸦或图像标签）训练密集像素级分割模型。尽管近期工作利用基础模型如Segment Anything Model（SAM）生成伪标签，但这些方法通常依赖启发式提示选择，难以整合先验知识或异质标签。本文通过神经符号视角：将可微模糊逻辑与深度分割模型结合。弱标注和领域特定先验被统一为连续逻辑约束，以微调SAM在弱监督下。优化后的基础模型随后生成改进的伪标签，从中训练一个无提示的第二阶段分割模型。在Pascal VOC 2012和REFUGE2视盘/杯分割数据集上的实验表明，逻辑引导的微调产生了更高质量的伪标签，导致分割精度超越密集监督基线。

英文摘要

Weakly supervised semantic segmentation (WSSS) trains dense pixel-level segmentation models from partial or coarse annotations such as bounding boxes, scribbles, or image-level tags. While recent work leverages foundation models such as the Segment Anything Model (SAM) to generate pseudo-labels, these approaches typically depend on heuristic prompt choices and offer limited ways to incorporate prior knowledge or heterogeneous labels. We address this gap by taking a neurosymbolic perspective: integrating differentiable fuzzy logic with deep segmentation models. Weak annotations and domain-specific priors are unified as continuous logical constraints that fine-tune SAM under weak supervision. The refined foundation model then produces improved pseudo-labels, from which we train a second-stage prompt-free segmentation model. Experiments on Pascal VOC 2012 and the REFUGE2 optic disc/cup segmentation dataset show that our logic-guided fine-tuning yields higher-quality pseudo-labels, leading to state-of-the-art segmentation accuracy that often exceeds densely supervised baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.20436 2026-06-11 cs.CV 版本更新

Lighting-aware Unified Model for Instance Segmentation

考虑光照的实例分割统一模型

Qisai Liu, Alloy Das, Zhanhong Jiang, Joshua R. Waite, Aditya Balu, Adarsh Krishnamurthy, Soumik Sarkar

AI总结本文提出了一种考虑光照的实例分割统一模型，通过开发Lighting Convolutional-Attention模块，在不微调重型主干网络的情况下提升分割鲁棒性，实验结果表明该方法能有效解决光照变化带来的领域差距问题。

详情

AI中文摘要

像Segment Anything Model（SAM）这样的基础模型展示了令人印象深刻的零样本泛化能力，但在多样化的现实世界光照下经常退化，特别是在实例分割中。在本工作中，我们通过开发Lighting Convolutional-Attention（\lca{}），一种适配模块，来解决这一限制。\lca{}采用双分支架构处理RGB特征和对比图，使模型对结构性变化敏感而非光照伪影。我们通过成对训练策略优化\lca{}，引入一个针对损失项，明确惩罚干净图像与其对应光照变体之间的差异。为了评估和支持这一架构，我们跨多个现有基准进行了全面的经验研究，并提出了一个专门设计的Unity基合成数据集，以准确复制复杂的现实世界光照条件。广泛的实验结果表明，我们的方法成功地弥合了领域差距，实现了优越的光照鲁棒分割。

英文摘要

Foundation models like the Segment Anything Model (SAM) demonstrate impressive zero-shot generalization but frequently degrade under diverse real-world illumination, particularly for instance segmentation. In this work, we address this limitation by developing \textit{Lighting Convolutional-Attention (\lca{})}, an adapter module that enhances segmentation robustness without fine-tuning the heavy backbone. \lca{} employs a dual-branch architecture to process RGB features alongside contrast maps, enabling physically motivated sensitivity to structural changes rather than illumination artifacts. We optimize \lca{} through a pairwise training strategy, introducing a targeted loss term that explicitly penalizes discrepancies between clean images and their corresponding illumination variants. To evaluate and support this architecture, we conduct a comprehensive empirical study across multiple existing benchmarks and present a novel Unity-based synthetic dataset specifically designed to accurately replicate complex real-world lighting conditions. Extensive experimental results demonstrate that our approach successfully bridges the domain gap, delivering superior lighting-robust segmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.10775 2026-06-11 cs.CV 版本更新

Spatially Selective Self-Training for Unsupervised Building Change Detection

空间选择性自训练用于无监督建筑变化检测

Wafaa I. M. Hussin, Zhi Lu, Anas M. I. Mohammed, Xiang Zhou, Ratiba A. H. Abubaker, Zhenming Peng

发表机构 * School of Information and Communication Engineering, University of Electronic Science and Technology of China（电子科技大学信息与通信工程学院）； Chengdu Yaguang Electronic Co., Ltd.（成都亚光电子股份有限公司）； Laboratory of Intelligent Collaborative Computing, University of Electronic Science and Technology of China（电子科技大学智能协同计算实验室）； School of Civil Engineering, University of Khartoum（喀土穆大学土木工程学院）； National Energy Research Center, Ministry of Higher Education and Scientific Research（高等教育部和科学研究部国家能源研究中心）

AI总结提出SST-CD框架，利用空间选择性自训练和局部一致性准则，从无标签双时相遥感图像中学习建筑变化检测器，在三个数据集上超越现有无监督方法。

详情

Comments: Under Review

AI中文摘要

无监督建筑变化检测旨在从未标记的双时相遥感图像中学习建筑变化掩膜。现有的无标签方法通常遵循差异到掩膜范式，直接使用时相差异、冻结的基础模型响应、基于提示的输出或后处理结果作为最终变化图。尽管这些策略提供了无标注线索，但它们并未学习任务特定的建筑变化检测器，并且仍然容易受到通用时相差异与建筑定义的结构变化之间的差距的影响。在实践中，这种差异通常是嘈杂且与任务无关的，因为外观变化、配准误差和非建筑修改可能产生强烈但误导性的响应。为了解决这个问题，我们提出了SST-CD，一种空间选择性自训练框架，将完全无标签的建筑变化检测重新表述为在嘈杂伪监督下的端到端检测器学习。SST-CD使用时相差异作为候选伪标签，并仅在空间可靠像素上训练检测器，其可靠性通过局部一致性准则估计，该准则从监督中过滤不一致区域。为了进一步稳定嘈杂的自训练，一个轻量级特征适配器重新校准双时相特征，而基于原型的解码器产生紧凑的变化和无变化表示。在LEVIR-CD、WHU-CD和DSIFN-CD上的实验表明，SST-CD分别达到了83.08%、91.69%和86.60%的F1分数，优于现有的无监督和无标签基线。代码将公开提供。

英文摘要

Unsupervised building change detection aims to learn building-change masks from unlabeled bi-temporal remote sensing images. Existing label-free methods often follow a discrepancy-to-mask paradigm, directly using temporal differences, frozen foundation-model responses, prompt-based outputs, or post-processing results as final change maps. Although these strategies provide annotation-free cues, they do not learn a task-specific building-change detector and remain vulnerable to the gap between generic temporal discrepancies and building-defined structural changes. In practice, such discrepancies are often noisy and task-irrelevant, as appearance shifts, registration errors, and non-building modifications can produce strong but misleading responses. To address this problem, we propose SST-CD, a spatially selective self-training framework that reformulates fully label-free building change detection as end-to-end detector learning under noisy pseudo supervision. SST-CD uses temporal discrepancies as candidate pseudo labels and trains the detector only on spatially reliable pixels, whose reliability is estimated by a local consistency criterion that filters inconsistent regions from supervision. To further stabilize noisy self-training, a lightweight feature adapter recalibrates bi-temporal features, while a prototype-based decoder produces compact change and no-change representations. Experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show that SST-CD achieves F1 scores of 83.08%, 91.69%, and 86.60%, respectively, outperforming existing unsupervised and label-free baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.11450 2026-06-11 cs.CV 新提交

Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition

探索自适应掩码重建用于自监督基于骨架的动作识别

Shengkai Sun, Zhiyong Cheng, Zefan Zhang, Jianfeng Dong, Zhihui Li, Meng Wang

发表机构 * Hefei University of Technology（合肥工业大学）； Jilin University（吉林大学）； Zhejiang Gongshang University（浙江工商大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出自适应掩码重建（AMR）框架，通过解耦编码器-解码器并引入自适应引导模块，加速预训练并提升下游动作识别精度，在多个数据集上超越现有方法。

详情

Comments: Accepted by CVPR2026. The code is available at this https URL

AI中文摘要

最近，掩码骨架重建模型已成为强大的动作表示学习器，推动了自监督基于骨架的动作识别的重大进展。然而，现有的最先进方法必须预测极其大量的时空块，显著延长了训练时间。此外，通过在重建过程中平等对待所有时空区域，这些模型被分散了注意力，无法学习动作语义背后的关键运动模式。为了解决这些挑战，我们提出了自适应掩码重建（AMR），一个更快更强的预训练框架。我们首先将解码器与编码器解耦，使得能够灵活预测更大的时空块，并大幅降低重建复杂度。鉴于更大的块包含更复杂的信息，这难以预测并因此降低性能，我们相应地引入了一个自适应引导模块。该模块识别高运动信息量的区域，引导模型关注每个块中最具判别力的部分，并减轻重建难度。在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD数据集上的实验表明，AMR不仅显著加速了预训练，还提高了下游识别精度，超越了当前最先进的方法。

英文摘要

Recently, masked skeleton reconstruction models have emerged as strong action representation learners, driving significant progress in self-supervised skeleton-based action recognition. However, existing state-of-the-art methods must predict an exceedingly large number of spatiotemporal patches, significantly prolonging training time. Besides, by treating all spatiotemporal regions equally during reconstruction, these models are distracted from learning the critical motion patterns that underlie action semantics. To address these challenges, we propose Adaptive Masked Reconstruction (AMR), a faster and stronger pre-training framework. We first decouple the decoder from the encoder, enabling flexible prediction of larger spatiotemporal patches and dramatically reducing reconstruction complexity. Given that larger patches contain more complex information, which is challenging to predict and consequently degrades performance, we accordingly introduce an adaptive guidance module. This module identifies regions of high motion informativeness, guiding the model to focus on the most discriminative parts of each patch and alleviating reconstruction difficulty. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that AMR not only accelerates pre-training substantially but also improves downstream recognition accuracy, surpassing current state-of-the-art approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.11645 2026-06-11 cs.CV 新提交

Motion Reinforces Appearance: RGB-Skeleton Gated Residual Fusion for Micro-Gesture Online Recognition

运动增强外观：用于微手势在线识别的RGB-骨架门控残差融合

Jialin Liu, Xinwen He, Pengyu Liu, Jiale Shi, Huaijuan Zang, Yanbin Hao

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, China（合肥工业大学计算机与信息工程学院）

AI总结提出DyFADet+双流RGB-骨架框架，通过门控残差模块自适应融合骨架运动与RGB特征，实现微手势在线识别，在SMG数据集上F1达40.88，排名第二。

详情

Comments: 13 pages, 2 figures

AI中文摘要

微手势分析因能从细微身体动作推断自发情绪而受到越来越多的关注。微手势在线识别，即在未修剪视频中定位和分类每个手势实例，是第四届EI-MiGA-IJCAI挑战赛的核心任务。与典型的时序动作检测相比，MGR强调动作的定位和分类，要求模型输出每个微手势的开始时间、结束时间和类别。此外，由于微手势高度自发，仅依赖单一模态难以捕捉完整准确的多模态线索。在这项工作中，我们提出DyFADet+，它将DyFADet扩展为双流RGB-骨架框架。在我们的模型中，两种模态都被投影到共享的多尺度时序嵌入中，并通过门控残差模块融合，该模块自适应地将骨架运动注入RGB表示，而不是使用简单的拼接。最后，这些融合特征由动态TAD头解码，用于在线分类和边界回归。在SMG数据集上，我们的方法取得了40.88的F1分数，在微手势在线识别赛道中排名第二。

英文摘要

Micro-gesture analysis attracts increasing attention for inferring spontaneous emotion from subtle body movements. Micro-gesture online recognition, which localizes and classifies each gesture instance in untrimmed videos, is a core task in the 4th EI-MiGA-IJCAI Challenge. Compared with typical temporal action detection, MGR emphasizes the localization and classification of actions, requiring the model to output the start time, end time, and category of each micro-gesture. Moreover, since micro-gestures are highly spontaneous, relying solely on a single modality makes it difficult to capture the complete and accurate multi-modal cues. In this work, we propose DyFADet+, which extends DyFADet into a dual-stream RGB-skeleton framework. In our model, both modalities are projected into shared multi-scale temporal embeddings and fused through a gated residual module, which adaptively injects skeleton motion into the RGB representation rather than using naive concatenation. Finally, these fused features are decoded by a Dynamic TAD head for online classification and boundary regression. On the SMG dataset, our method achieves an F1 score of 40.88, ranking 2nd in the Micro-gesture Online Recognition track.

URL PDF HTML ☆

赞 0 踩 0

2606.11913 2026-06-11 cs.CV 新提交

From Content to Knowledge: Lightning Fast Long-Video Understanding with Neural Knowledge Representations

从内容到知识：基于神经知识表示的闪电般快速长视频理解

Yuchen Guan, Xiao Li, Zongyu Guo, Xiaoyi Zhang, Xiulian Peng, Chun Yuan, Yan Lu

AI总结提出将长视频编码为神经知识表示（NKR），通过智能体知识蒸馏（AKD）自动合成描述和问答对，将视频知识嵌入VLM骨干网络的少量权重中，实现轻量级、可复用的视频理解，推理时无需重新加载视频，大幅降低延迟。

详情

AI中文摘要

我们提出了一种新的长视频理解范式，将长视频视为神经知识表示（NKR）。NKR既不将视频内容表示为标记流，也不表示为预组织的数据库，而是作为附加到VLM骨干网络的一小部分网络权重。通过一种新颖的智能体知识蒸馏（AKD）过程优化NKR权重，以封装视频的语义内容，其中智能体自动合成密集描述和问答对，将视频知识蒸馏到NKR中。虽然AKD作为一次性的全面编码阶段，但生成的NKR将视频转换为可移植、可重用的资产。在推理时，轻量级NKR被挂载到冻结的视觉语言模型（VLM）上，实现直接的、基于查询的理解，无需重新加载或重新编码原始视频。这种方法将视频长度与推理成本解耦，为多轮视频理解提供了高摊销效率。在LVBench基准上的实验表明，我们的方法在实现与最先进方法相当的性能的同时，将端到端延迟降低了两个数量级以上，为交互式长视频理解开辟了新的可能性。

英文摘要

We propose a new paradigm for long video understanding by treating a long video as a Neural Knowledge Representation (NKR). NKR represents video contents neither as a stream of tokens nor pre-organized databases, but as an individual small portion of network weights attached to the VLM backbone. The NKR weights are optimized to encapsulate the video's semantic content via a novel Agentic Knowledge Distillation (AKD) process, where an agent automatically synthesizes dense descriptions and question-answer pairs to distill the video's knowledge into the NKR. While AKD serves as a comprehensive, one-time encoding phase, the resulting NKR transforms the video into a portable, reusable asset. At inference, the lightweight NKR is mounted onto a frozen Vision-Language Model (VLM), enabling direct, query-based understanding without reloading or re-encoding the original video. This approach decouples video length from inference cost, offering high amortized efficiency for multi-turn video understanding. Experiments on the LVBench benchmark show our method achieves performance comparable to state-of-the-art approaches while reducing end-to-end latency by over two orders of magnitude, opening new possibilities for interactive long-video understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.12033 2026-06-11 cs.CV 新提交

SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection

SpikeTAD：用于端到端时序动作检测的脉冲神经网络

Min Yang, Mi Zhou, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University（南京大学计算机软件新技术国家重点实验室）

AI总结提出首个基于脉冲神经网络的端到端时序动作检测架构SpikeTAD，在保持极低功耗的同时，在THUMOS14和ActivityNet-1.3上分别达到67.2%和37.42%的平均mAP。

详情

Comments: Accepted by Pattern Recognition

AI中文摘要

视频理解是计算机视觉的关键部分，具有众多应用场景。随着移动设备的日益普及，越来越多的努力试图在其上部署视频理解模型。然而，现有的视频理解模型由于体积大且功耗高而难以部署。脉冲神经网络（SNNs）相比人工神经网络（ANNs）显示出生物合理性和低功耗优势，尤其是在被视为未来移动设备关键组件的神经形态芯片上。然而，过长的转换时间步长和严重的性能退化问题限制了它们的应用。为了解决上述问题，我们探索了SNNs在时序动作检测（TAD）上的应用，这是视频理解中的重要任务，并提出了首个基于SNN的端到端TAD架构，称为SpikeTAD。在保持极低功耗的同时，SpikeTAD在THUMOS14上实现了67.2%的平均mAP，在ActivityNet-1.3上实现了37.42%的平均mAP，证明了低功耗TAD模型的可行性。我们的代码可在以下网址获取：此 https URL。

英文摘要

Video understanding is a crucial part of computer vision, with numerous application scenarios. With the increasing popularity of mobile devices, an increasing number of efforts are trying to deploy video understanding models on them. However, existing video understanding models are difficult to deploy due to their large size and prohibitive power consumption. Spiking Neural Networks (SNNs) have shown bioplausibility and low power advantages over Artificial Neural Networks (ANNs), especially on neuromorphic chips which are regarded as essential components of future mobile devices. However, excessively long conversion time-steps and severe performance degradation problems limit their application. To solve the problems above, we explore the application of SNNs on temporal action detection (TAD), which is an important task in video understanding, and propose the first SNN-based end-to-end TAD architecture coined as SpikeTAD. While maintaining extremely low power consumption, SpikeTAD achieves an average mAP of 67.2% in THUMOS14 and 37.42% in ActivityNet-1.3, demonstrating the feasibility of a low-power TAD model. Our code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12047 2026-06-11 cs.CV cs.AI stat.ML 新提交

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

元数据感知的多提示推理用于零样本事故理解

Tarandeep Singh, Soumyanetra Pal, Soham Biswas, Nishanth Chandran

发表机构 * Netradyne

AI总结提出三阶段流水线，通过视觉-语言相似性、元数据驱动的多提示推理和开放词汇检测，实现零样本事故视频的时序定位、语义分类和空间定位，显著提升性能。

详情

Comments: Accepted at the AUTOPILOT Workshop, CVPR 2026 (non-archival). Workshop Paper ID 15

AI中文摘要

在本文中，我们通过识别冲击事件发生的时间、类型以及帧中的位置，使用自然语言解决监控视频中事故的零样本理解问题。我们提出一个三阶段流水线，将事故理解分解为何时、何物和何地。第一阶段利用视觉-语言相似性提取冲击周围的短时间窗口。第二阶段，我们执行元数据驱动的多提示推理，包含五个互补视角（基线、运动、几何、对比和决胜），并通过熵门控成对裁决器解决分歧。最后，我们基于预测的事故类型和场景布局查询开放词汇检测器以定位冲击，并使用分数加权质心聚合关键帧上的检测结果。我们的流水线在零样本ACCIDENT @ CVPR基准测试上，相对于帧中心基线，调和平均分数有显著提升。我们表明，将零样本视频理解分解为时序定位、语义分类和空间定位，比直接提示更能实现视觉-语言模型的可靠推理。

英文摘要

In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident understanding into when, what, and where. The first stage extracts a short temporal window around the impact using vision-language similarity. In the second stage, we perform metadata-driven multi-prompt reasoning with five complementary views (baseline, motion, geometry, contrast, and tiebreaker) and resolve disagreement via an entropy-gated pairwise adjudicator. Finally, we localize the impact of an open-vocabulary detector queried on the predicted accident type and scene layout, and aggregate detections across keyframes using a score-weighted centroid. Our pipeline achieves a substantial improvement in the harmonic-mean score over a centre-of-frame baseline on the zero-shot ACCIDENT @ CVPR benchmark. We show that decomposing zero-shot video understanding into temporal localization, semantic classification, and spatial grounding enable more reliable reasoning with vision-language models than direct prompting alone.

URL PDF HTML ☆

赞 0 踩 0

2606.12125 2026-06-11 cs.CV 新提交

Q-Fold: Query-Aware Focus-Context Spatio-Temporal Folding for Long Video Understanding

Q-Fold: 查询感知的焦点-上下文时空折叠用于长视频理解

Biao Tang, Xu Chen, Shuxiang Gou, Jingyi Yuan, Yuhan Zhang, Chenqiang Gao

发表机构 * Shenzhen Campus of Sun Yat-sen University（中山大学深圳校区）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China（电子科技大学深圳高等研究院）

AI总结提出Q-Fold，一种无需训练的长视频输入构建框架，通过查询引导将相关片段保留为高保真焦点帧，不相关片段折叠为上下文布局，在固定预算下提升多模态大模型的长视频理解性能。

详情

Comments: 10 pages, 5 figures, 8 tables. Code will be made publicly available

AI中文摘要

长视频理解对多模态大语言模型仍然具有挑战性，因为时间上延长的视频通常包含数千帧，因此穷举处理成本高昂。现有方法通常在有限的视觉预算下从长视频构建紧凑的视觉输入。然而，大多数方法仍然遵循以帧为中心的范式，并对保留的内容应用相似的表示，无论其重要性如何。这使得难以同时保留高保真视觉证据和广泛的时间覆盖。为了解决这个问题，我们提出了Q-Fold，一种无需训练的长视频理解输入构建框架。Q-Fold不将孤立帧作为基本建模单元，而是对连续的时间段进行操作，并在查询引导下构建异构的焦点-上下文表示。查询相关的片段被保留为高保真的焦点帧，而不太相关的片段被折叠成保持时间顺序的上下文布局。通过这种方式，Q-Fold保留了关键的视觉证据和广泛的时间覆盖，同时更好地保持了短片段内的局部时间连续性。在四个长视频基准测试和多个视频多模态大模型上的实验表明，Q-Fold在不增加输入预算的情况下持续提升性能。值得注意的是，它在一个超长视频基准测试上取得了高达9.1个百分点的提升。代码将公开提供。

英文摘要

Long-video understanding remains challenging for multimodal large language models, because temporally extended videos often contain thousands of frames and are therefore expensive to process exhaustively. Existing methods usually construct compact visual inputs from long videos under a limited visual budget. However, most of them still follow a frame-centric paradigm and apply similar representations to retained content regardless of its importance. This makes it difficult to preserve both high-fidelity visual evidence and broad temporal coverage. To address this issue, we propose Q-Fold, a training-free input construction framework for long-video understanding. Instead of treating isolated frames as the basic modeling unit, Q-Fold operates on contiguous temporal segments and constructs a heterogeneous Focus--Context representation under query guidance. Query-relevant segments are preserved as high-fidelity Focus Frames, while less relevant segments are folded into chronology-preserving contextual layouts. In this way, Q-Fold preserves critical visual evidence and broad temporal coverage, while better maintaining local temporal continuity within short segments. Experiments on four long-video benchmarks with multiple Video-MLLMs show that Q-Fold consistently improves performance without increasing the input budget. Notably, it achieves gains of up to 9.1 percentage points on an ultra-long video benchmark. Code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.12215 2026-06-11 cs.CV cs.IR cs.LG 新提交

MLT-Dedup: Efficient Large-Scale Online Video Deduplication via Multi-Level Representations and Spatial-Temporal Matching

MLT-Dedup：通过多级表示和时空匹配的高效大规模在线视频去重

David Yuchen Wang, Haoying Li, Hailun Xu, Wei Chee Yew, Zirui Zhu, Sanjay Saha, Hao Hei, Kanchan Sarkar, Kun Xu

发表机构 * TikTok Singapore（TikTok新加坡）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）； TikTok San Jose（TikTok圣何塞）

AI总结提出MLT-Dedup框架，采用多级视频编码器提取细粒度帧级和稀疏片段级嵌入，结合差分特征增强相似性模块进行时空匹配，在90%精度下降低在线重复率91%，索引容量提升5倍。

详情

Comments: Accepted by KDD-2026 ADS track

AI中文摘要

在线平台上用户生成视频内容的爆炸性增长伴随着大量近似重复视频的出现——这些视频相同或高度相似，但存在部分编辑差异。这些重复视频降低了用户体验，增加了存储和带宽成本，使得大规模视频去重成为一项关键任务。现有的视频去重框架在有限的索引预算下检索足够高质量候选视频方面面临根本性挑战，同时在效率和精度之间存在权衡。为了解决这些问题，我们提出了MLT-Dedup，一种基于多级表示和时空匹配的高效大规模在线视频去重框架。我们的方法采用多级视频编码器（ML-VE）提取细粒度的帧级嵌入和稀疏的片段级嵌入：稀疏嵌入支持高效的候选检索，而细粒度嵌入则用于精确的成对匹配。在匹配过程中，我们引入了DiF-SiM，一种差分特征增强相似性模块，能够定位重复的时间片段并提供可靠的相似性证据，以支持基于策略的去重决策。在真实大规模平台上的大量实验表明，MLT-Dedup在90%精度下将在线重复率降低了91%。此外，我们的稀疏检索设计使索引容量提升了5倍，从而在实际部署中实现了更广泛的候选覆盖。

英文摘要

The explosive growth of user-generated video content on online platforms is accompanied by the emergence of numerous near-duplicate videos--videos that are identical or highly similar but differ by partial edits. These duplicates degrade user experience and increase storage and bandwidth costs, making large-scale video deduplication a critical task. Existing video deduplication frameworks face a fundamental challenge in retrieving sufficient high-quality candidates under a limited index budget, as well as trade-offs between efficiency and precision. To address these issues, we propose MLT-Dedup, an efficient large-scale online video deduplication framework with Multi-Level representations and spatial-Temporal matching. Our approach employs a Multi-Level Video Encoder (ML-VE) to extract both fine-grained frame-level and sparse clip-level embeddings: sparse embeddings support efficient candidate retrieval, while fine-grained embeddings are loaded for precise pairwise matching. During matching, we introduce DiF-SiM, a Differential Feature-enhanced Similarity Module capable of locating duplicated temporal segments and providing reliable similarity evidence to support policy-driven deduplication decisions. Extensive experiments on a real-world large-scale platform demonstrate that MLT-Dedup reduces online repetition rates by 91% at 90% precision. Furthermore, our sparse retrieval design achieves a 5x increase in indexing capacity, enabling broader candidate coverage in real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.12300 2026-06-11 cs.CV cs.AI 新提交

Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

自然语言在小时级视频中的时间定位是一个搜索问题：基准与经验分解

Sukmin Seo, Geewook Kim

发表机构 * NAVER Cloud AI ； KAIST AI（韩国科学技术院人工智能系）

AI总结针对小时级视频的自然语言时间定位，提出搜索是主要瓶颈而非识别，发布首个开放小时级定位基准ExtremeWhenBench，并通过检索-定位混合方法显著提升性能。

详情

Comments: 10 pages, 6 figures, Code and benchmark: this https URL

AI中文摘要

时间定位——根据自然语言查询返回视频中的区间$[t_s, t_e]$——是长视频的语言接口，但此前仅在短视频上研究；小时级自然语言定位的动态仍未充分探索。我们认为，在小时级尺度上，限制因素是搜索而非识别：视频-LLM的瓶颈不在于定位附近的事件，而在于根据自然语言查询搜索长视频的相关区域。为验证这一点，我们发布了ExtremeWhenBench，首个开放的小时级定位基准（194个视频上的2273个查询，平均时长75.7分钟，最长9小时），具有开放式查询分布。所有开放视频-LLM均表现不佳，而帧级检索基线优于它们；失败分类将85%的失败归因于搜索；检索-定位混合方法比单一视频-LLM提升了6.7倍——类似于开放域QA中的检索-读取模式。

英文摘要

Temporal grounding--returning the interval $[t_s, t_e]$ for a natural-language query over a video--is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but--given a natural-language query--by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM--mirroring retrieve-then-read in open-domain QA.

URL PDF HTML ☆

赞 0 踩 0

2409.18478 2026-06-11 cs.CV 版本更新

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

Temporal2Seq: 一个面向时序视频理解任务的统一框架

Min Yang, Zichen Zhang, Qian Dang, Limin Wang

AI总结提出Temporal2Seq统一框架，将时序视频理解任务输出表示为离散token序列，通过单一架构训练通用模型，在TAD、TAS、GEBD三个任务上取得合理结果，并优于单任务训练。

详情

Comments: Accepted by CVIU

AI中文摘要

随着视频理解的发展，出现了大量用于片段级时序视频分析的任务，包括时序动作检测（TAD）、时序动作分割（TAS）和通用事件边界检测（GEBD）。尽管针对特定任务的视频理解模型在每个任务上都表现出色，但仍然缺乏一个能够同时处理多个任务的统一框架，这是下一代人工智能的一个有前景的方向。为此，在本文中，我们提出了一个单一的统一框架，称为Temporal2Seq，将这些时序视频理解任务的输出表述为离散token序列。通过这种统一的token表示，Temporal2Seq可以在单一架构内训练一个通用模型，用于不同的视频理解任务。在没有多任务学习（MTL）基准的情况下，我们通过借用TAD、TAS和GEBD任务的数据集，编制了一个全面的联合训练数据集。我们在三个任务的相应测试集上评估了我们的Temporal2Seq通用模型，结果表明Temporal2Seq能够在各种任务上产生合理的结果，并且在该框架上相比单任务训练具有优势。我们还研究了通用模型在不同任务的新数据集上的泛化性能，其表现优于特定模型。

英文摘要

With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks and achieve advantages compared with single-task training on this framework. We also investigate the generalization performance of our generalist model on new datasets from different tasks, which yields superior performance to the specific model.

URL PDF HTML ☆

赞 0 踩 0

2506.21855 2026-06-11 cs.CV 版本更新

Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation

Periodic-MAE：用于rPPG估计的周期性视频掩码自编码器

Jiho Choi, Sang Jun Lee

AI总结提出Periodic-MAE，一种自监督框架，通过周期性感知掩码和生理频带约束，从无标签面部视频学习可泛化的时空表示，提升远程光电容积描记法（rPPG）估计性能。

详情

AI中文摘要

在本文中，我们提出Periodic-MAE，一种自监督框架，用于从无标签面部视频中学习周期性生理信号的通用时空表示。该方法利用掩码自编码器（MAE），通过重建掩码视频令牌学习高维面部表示，而不依赖远程光电容积描记法（rPPG）特定监督。为了明确地将表示学习与rPPG特征对齐，我们引入了一种基于视频重采样的周期性感知帧掩码策略，使编码器能够学习捕获与脉搏信号估计相关的准周期性时间模式的表示。此外，生理频带约束被集成到MAE预训练框架中，利用脉搏信号在频域的稀疏性，引导学习到的表示朝向生理上有意义的模式。预训练后，学习到的表示被迁移到下游rPPG估计任务，其中编码器作为通用特征提取器，从面部视频中恢复脉搏相关信号。我们在四个基准数据集（包括PURE、UBFC-rPPG、MMPD和V4V）上进行了广泛实验。此外，我们在无约束光照条件和受试者运动下收集的真实世界rPPG数据集上评估了所提方法。实验结果表明，Periodic-MAE持续改善了rPPG估计性能，特别是在具有挑战性的跨数据集和真实世界评估场景中。我们的代码可在以下网址获取：此 https URL。

英文摘要

In this paper, we propose Periodic-MAE, a self-supervised framework for learning generalizable spatio-temporal representations of periodic physiological signals from unlabeled facial videos. The proposed method leverages a masked autoencoder (MAE), which learns high-dimensional facial representations by reconstructing masked video tokens without relying on remote photoplethysmography (rPPG) specific supervision. To explicitly align representation learning with the characteristics of rPPG, we introduce a periodicity-aware frame masking strategy based on video resampling, enabling the encoder to learn representations that capture quasi-periodic temporal patterns relevant to pulse signal estimation. In addition, physiological bandlimit constraints are integrated into the MAE pre-training framework, exploiting the sparsity of pulse signals in the frequency domain to guide the learned representations toward physiologically meaningful patterns. After pre-training, the learned representations are transferred to downstream rPPG estimation, where the encoder serves as a generic feature extractor for recovering pulse-related signals from facial videos. We conduct extensive experiments on four benchmark datasets, including PURE, UBFC-rPPG, MMPD, and V4V. Moreover, we evaluate the proposed approach on a real-world rPPG dataset collected under unconstrained lighting conditions and subject motion. Experimental results demonstrate that Periodic-MAE consistently improves rPPG estimation performance, particularly in challenging cross-dataset and real-world evaluation settings. Our code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2512.11393 2026-06-11 cs.CV 版本更新

The N-Body Problem: Parallel Execution from Single-Person Egocentric Video

N体问题：从单人物体中心视频进行并行执行

Zhifan Zhu, Yifei Huang, Yoichi Sato, Dima Damen

AI总结提出N体问题，从单人物体中心视频预测N人并行执行任务，通过结构化提示策略引导视觉语言模型推理3D环境、物体使用和时间依赖，在EPIC-Kitchens和HD-EPIC数据集上显著提升动作覆盖率并降低冲突。

详情

Comments: project webpage: this https URL

AI中文摘要

人类可以直观地并行化复杂活动，但模型能否通过观察一个人来预测这一点？给定一个物体中心视频，我们引入N体问题：预测N个人如何假设性地执行同一组任务。目标是最大化加速，但将视频片段天真地分配给个人往往违反现实世界约束，导致物理上不可能的场景，例如两个人使用同一物体或占据同一空间。为了量化这一点，我们形式化了N体问题，并提出了一套度量标准来评估性能（加速、任务覆盖）和可行性（空间碰撞、物体冲突和因果约束）。作为概念验证，我们引入了一种结构化提示策略，引导视觉语言模型（VLM）推理3D环境、物体使用和时间依赖，从而产生可行的并行执行。在来自EPIC-Kitchens和HD-EPIC的100个视频上，对于N=2，我们的结构化提示相比Gemini 2.5 Pro的基线提示，动作覆盖率提高了45%，同时碰撞率、物体冲突和因果冲突分别降低了51%、52%和55%。

英文摘要

Humans can intuitively parallelise complex activities, but can a model predict this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: predicting how N individuals, can hypothetically perform the same set of tasks. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To quantify this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). As a proof of concept, we introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies, producing a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, for $N = 2$, our structured prompt improves action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 51%, 52% and 55% respectively.

URL PDF HTML ☆

赞 0 踩 0

2603.20190 2026-06-11 cs.CV 版本更新

CoVR-R:Reason-Aware Composed Video Retrieval

CoVR-R: 推理感知的组合视频检索

Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala, Alaa Mostafa Lasheen, Rao Muhammad Anwer, Fahad Khan

AI总结提出一种零样本推理优先方法，利用大型多模态模型推断编辑的因果和时序后效，并构建CoVR-Reason基准评估，在隐式效应子集上显著优于强基线。

详情

Comments: 9 Pages, 3 Figures

AI中文摘要

组合视频检索（CoVR）旨在根据参考视频和文本修改找到目标视频。先前的工作假设修改文本完全指定了视觉变化，忽略了编辑产生的后效和隐含后果（例如，运动、状态转换、视角或持续时间线索）。我们认为成功的CoVR需要对这些后效进行推理。我们提出了一种推理优先的零样本方法，利用大型多模态模型（i）推断编辑所隐含的因果和时序后果，以及（ii）将得到的推理查询与候选视频对齐，无需任务特定的微调。为了评估CoVR中的推理能力，我们还提出了CoVR-Reason基准，该基准将每个（参考、编辑、目标）三元组与结构化的内部推理轨迹和具有挑战性的干扰项配对，这些干扰项需要预测后效而不是关键词匹配。实验表明，我们的零样本方法在召回率@K上优于强检索基线，并且在隐式效应子集上尤其出色。我们的自动和人工分析证实了检索结果中更高的步骤一致性和效果真实性。我们的发现表明，将推理纳入通用多模态模型可以通过明确考虑因果和时序后效来实现有效的CoVR。这减少了对任务特定监督的依赖，提高了对具有挑战性的隐式效应案例的泛化能力，并增强了检索结果的可解释性。这些结果指向了一个可扩展且原则性的可解释视频搜索框架。模型、代码和基准可在该网址获取。

英文摘要

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11289 2026-06-11 cs.CV 新提交

i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

i1: 一种简单且完全开放的强文本到图像模型配方

Boya Zeng, Tianze Luo, Shu Pu, Jucheng Shen, Taiming Lu, Gabriel Sarch, Zhuang Liu

发表机构 * Princeton University（普林斯顿大学）

AI总结本文通过300多次控制实验系统研究文本到图像扩散模型的设计选择，提出i1模型，仅用公开数据集训练3B参数模型，在五个基准上平均超越现有最佳完全开放模型29.5个百分点。

详情

Comments: Project page at this https URL

AI中文摘要

扩散模型持续推动文本到图像生成的进展。然而，将最近的进展归因于特定的建模和数据选择是困难的：最先进的开放权重模型提供的消融研究有限，并且不公开其训练数据和完整的训练细节。研究社区需要完全开放（权重、数据和代码）的模型作为进一步研究的基础；然而，现有的完全开放模型在性能上仍显著落后于领先模型。在本项目中，我们通过300多次控制实验（总计超过70万TPU v6e小时）系统研究了文本到图像扩散训练和推理中的建模与数据设计选择。我们的实验突出了几个经验发现（例如，等权重是混合策划数据集的强默认设置）和简单的设计决策（例如，更大的文本编码器适配器以最小的参数增加提升性能），用于训练强模型。在这些见解的指导下，我们训练了i1，一个仅使用公开可用数据集的3B参数文本到图像扩散模型。i1在五个代表性基准（GenEval、DPG、PRISM、CVTG-2K和LongText）上与领先模型竞争，并且平均超越现有最佳完全开放模型29.5个百分点。我们提供i1检查点、训练和推理代码以及数据处理流程。总之，我们的发现和i1配方为未来文本到图像扩散模型的开放研究建立了实践基础。我们的代码可从此https URL获取。

英文摘要

Diffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details. The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e.g., equal weighting is a strong default for mixing curated datasets) and simple design decisions (e.g., larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models. Our code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11670 2026-06-11 cs.CV cs.AI 新提交

ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation

ARGUS: 堆叠多视角身份马赛克注入用于主体保持的视频生成

Zijie Meng, Jiwen Liu, Yufei Liu, Chengzhuo Tong, Xiaoqiang Liu, Yuanxing Zhang, Yulong Xu, Pengfei Wan

发表机构 * Peking University（北京大学）； Kuaishou Technology（快手科技）； Xiamen University（厦门大学）

AI总结提出ARGUS框架，通过堆叠多视角身份马赛克注入（SMII）将身份表示为紧凑动态分布，结合MLLM身份导演、无交叉对反事实训练等模块，在主体保持视频生成中达到SOTA。

详情

Comments: 13 pages, 3 figures

AI中文摘要

仅靠正面人脸相似度无法解决主体保持的视频生成问题：生成的人物必须在运动、大视角变化、表情变化、遮挡、尺度变化以及文本、首帧和身份参考之间的冲突中保持可识别。我们认为核心瓶颈在于点参考范式，该范式将身份坍缩为与姿态、配饰、光照、背景和相机统计纠缠的单一静态观测。我们提出了Argus，一个基于Wan的框架，核心是堆叠多视角身份马赛克注入（SMII）。SMII将MLLM选择的图像/视频身份证据转换为3*3堆叠马赛克，使马赛克与当前扩散时间同步，并将其作为负时间只读内存注入Wan的原生令牌空间。这使身份从外部清洁适配器或单个参考图像转变为紧凑的动态分布。围绕SMII，MLLM身份导演选择信息丰富的身份时刻并解决条件冲突，而无交叉对反事实训练、时间身份退火和自适应自相似性指导在没有配对主体-视频监督的情况下提高了鲁棒性。我们进一步发布了HardID-Celeb，一个公众人物身份压力基准，并引入YawScore和OccScore来探测大偏航和首帧遮挡鲁棒性。Argus在OpenS2V-Eval Human-Domain上达到了SOTA结果，总分为64.38，FaceSim为71.86，NexusScore为51.62，NaturalScore为79.14。在HardID-Celeb上，Argus获得了76.80的FaceSim，并在YawScore和OccScore上分别比最强基线提高了12.60和15.10分，证明了动态身份记忆和大规模反事实自监督对于主体保持视频生成非常有效。

英文摘要

Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and conflicts among text, first-frame, and identity references. We argue that the central bottleneck is the point-reference paradigm, which collapses identity into a single static observation entangled with pose, accessories, lighting, background, and camera statistics. We introduce Argus, a Wan-based framework centered on Stacked Multi-View Identity Mosaic Injection (SMII). SMII converts MLLM-selected image/video identity evidence into a 3*3 stacked mosaic, synchronizes the mosaic with the current diffusion time, and injects it as negative-time read-only memory in Wan's native token space. This turns identity from an external clean adapter or a single reference image into a compact dynamic distribution. Around SMII, an MLLM Identity Director selects informative identity moments and resolves condition conflicts, while no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance improve robustness without paired subject-video supervision. We further release HardID-Celeb, a public-figure identity-stress benchmark, and introduce YawScore and OccScore to probe large-yaw and first-frame-occlusion robustness. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, reaching 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. On HardID-Celeb, Argus obtains 76.80 FaceSim and improves YawScore and OccScore by 12.60 and 15.10 points over the strongest baselines, demonstrating that dynamic identity memory and large-scale counterfactual self-supervision are highly effective for subject-preserving video generation.

URL PDF HTML ☆

赞 0 踩 0

2606.11751 2026-06-11 cs.CV cs.AI 新提交

AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory

AnchorEdit: 通过因果记忆在多轮图像编辑中保持时间一致性

Hang Xu, Xiaoxiao Ma, Guohui Zhang, Yu Hu, Siming Fu, Jie Huang, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）； JD Explore Academy（京东探索研究院）

AI总结提出首个自回归扩散框架AnchorEdit，通过因果记忆机制和自展开策略解决多轮编辑中的身份漂移和误差累积问题，在10轮以上交互中保持高保真度。

详情

Comments: Code: this https URL

AI中文摘要

多轮图像编辑对于迭代设计至关重要，但当前模型在连续步骤中常面临身份漂移和误差累积。现有研究利用视频先验保持一致性，但其依赖的双向注意力与交互式编辑的因果、顺序性质根本不符。本文提出AnchorEdit，首个专为高分辨率、长期多轮编辑设计的自回归（AR）扩散框架。AnchorEdit通过三阶段训练课程弥合视频先验与因果推理之间的差距：保持身份的单轮预训练、使用新颖的自展开策略进行因果AR强制微调以缓解暴露偏差，以及用于高效4步生成的一致性蒸馏。在推理过程中，我们引入记忆机制来锚定初始主体身份，并确保在扩展编辑轨迹上的稳定外推。为评估性能，我们提供了一个新的高分辨率多轮编辑基准，旨在压力测试长期稳定性。大量实验表明，AnchorEdit达到了最先进的结果，即使在10轮以上的交互中也能保持卓越的主体保真度和指令遵循能力。

英文摘要

Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.

URL PDF HTML ☆

赞 0 踩 0

2606.11838 2026-06-11 cs.CV 新提交

Plan-and-Verify Video Reward Reasoning with Spatio-Temporal Scene Graph Grounding

基于时空场景图基础的计划与验证视频奖励推理

Hyomin Kim, Junghye Kim, Joanie Hayoun Chung, Yoonjin Oh, Kyungjae Lee, Sungbin Lim, Sungwoong Kim

发表机构 * Korea University（高丽大学）

AI总结提出SG-PVR视频奖励模型，通过计划-验证推理和时空场景图，系统验证提示中的每个条件，实现细粒度语义对齐，提升文本到视频生成的组合对齐。

详情

AI中文摘要

文本到视频（T2V）生成的奖励模型指导后训练，但常在细粒度语义对齐上失败。我们将其归因于现有基于推理的奖励模型的两个结构弱点：它们没有系统地验证提示中描述的每个条件，并且支持每个判断的视觉证据在其自由形式推理中仍然是隐式的。我们提出SG-PVR，一种视频奖励模型，通过基于时空场景图的计划-验证推理来解决这些限制。验证计划将提示分解为原子声明，确保检查每个要求。时空场景图编码实体、属性和时间基础关系，从视频中提取并作为持久的结构化视觉参考贯穿推理过程。每个声明都针对视频和场景图进行验证，将判断锚定在明确的视觉证据上。SG-PVR在语义对齐（包括细粒度时间语义）上取得了强劲性能。作为测试时重排序器，它进一步增强了T2V生成中的组合对齐。

英文摘要

Reward models for text-to-video (T2V) generation guide post-training but often fail at fine-grained semantic alignment. We trace this to two structural weaknesses in existing reasoning-based reward models: they do not systematically verify every condition described in the prompt, and the visual evidence supporting each judgment remains implicit in their free-form reasoning. We propose SG-PVR, a video reward model that addresses these limitations through plan-and-verify reasoning grounded in spatio-temporal scene graphs. The verification plan decomposes the prompt into atomic claims, ensuring every requirement is checked. The spatio-temporal scene graph, encoding entities, attributes, and temporally-grounded relations, is extracted from the video and maintained as a persistent structured visual reference throughout reasoning. Each claim is verified against both the video and the scene graph, anchoring judgments in explicit visual evidence. SG-PVR achieves strong performance on semantic alignment, including fine-grained temporal semantics. As a test-time reranker, it further enhances compositional alignment in T2V generation.

URL PDF HTML ☆

赞 0 踩 0

2606.11969 2026-06-11 cs.CV 新提交

SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation

SpecLoR: 面向运动连贯文本到视频生成的频谱前瞻矫正

Xu Zhang, Yu Lu, Ruijie Quan, Zhaozheng Chen, Bohan Wang, Yi Yang

发表机构 * ReLER, College of Artificial Intelligence, Zhejiang University（浙江大学人工智能学院ReLER实验室）； Huawei Central Research Institute（华为中央研究院）

AI总结提出SpecLoR，一种即插即用的推理方法，通过前瞻预测和频域矫正减少文本到视频生成中的时空不一致性，在Wan2.2上显著提升运动连贯性且仅增加4次NFE。

详情

AI中文摘要

流匹配通过潜在ODE采样实现了鲁棒的文本到视频生成。然而，速度逼近和数值离散误差不可避免地累积，导致采样轨迹漂移。因此，生成的视频常常遭受严重的时空不一致性。尽管如此，直接矫正这些漂移的噪声潜在变量具有挑战性：(i) 时间步相关的噪声掩盖了可靠的结构线索；(ii) 空间干预可能破坏复杂的局部几何结构，同时带来高昂的计算成本。为了解决这个问题，我们提出了频谱前瞻矫正（SpecLoR），一种即插即用的推理方法，通过前瞻预测绕过噪声，并通过将矫正转移到频域来规避时空纠缠，在频域中自然视频的通用统计先验易于获取。首先，在早期采样阶段，SpecLoR前瞻估计干净潜在变量 $z_{t,0}$ 并计算其3D时空频谱。接着，SpecLoR矫正幅度谱以匹配先验，保持相位不变。最后，将矫正后的状态重新加噪以恢复ODE积分。在Wan2.2上的实验表明，SpecLoR在多个基准上显著减少了物理伪影并增强了运动连贯性，且计算开销极小（仅增加4次NFE）。

英文摘要

Flow Matching has enabled robust text-to-video generation via latent ODE sampling. However, velocity approximation and numerical discretization errors inevitably accumulate, causing sampling trajectories to drift. Consequently, generated videos often suffer from severe spatiotemporal inconsistencies. Nevertheless, directly correcting these drifted, noisy latents is challenging: (i) timestep-dependent noise obscures reliable structural cues; (ii) spatial interventions risk disrupting intricate local geometry while incurring heavy computational costs. To address this, we propose Spectral Lookahead Rectification (SpecLoR), a plug-and-play inference method that bypasses noise via lookahead prediction, and circumvents spatiotemporal entanglement by shifting corrections to the frequency domain, where universal statistical priors of natural videos are readily available. First, during early sampling stages, SpecLoR looks ahead to estimate the clean latent $z_{t,0}$ and computes its 3D spatiotemporal spectrum. Next, SpecLoR rectifies the amplitude spectrum to match the prior, leaving the phase intact. Finally, the corrected state is re-noised to resume ODE integration. Experiments on Wan2.2 demonstrate that SpecLoR significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks with minimal computational overhead (4 additional NFEs).

URL PDF HTML ☆

赞 0 踩 0

2606.12012 2026-06-11 cs.CV 新提交

FitVTON: Fit-aware Virtual Try-On via Body-Garment Size Control

FitVTON: 通过身体-服装尺寸控制实现合身感知的虚拟试穿

Yiqun Ning, Ao Shen, Chenhang He, Lei Zhang

发表机构 * Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算学系）； Nuvatech

AI总结针对现有虚拟试穿忽略物理合身性的问题，提出FitVTON模型，通过结构化文本提示编码服装-身体尺寸，并引入辅助头预测服装和暴露身体掩膜，结合纹理校正阶段，在真实数据集FittingEffect3K上验证了尺寸准确性和形状保持的优越性。

详情

AI中文摘要

尽管基于扩散的虚拟试穿已经实现了令人印象深刻的视觉真实性，但大多数方法将任务视为2D修复，优先考虑纹理保持而非物理合理性。因此，它们通常生成看似合理的图像，但未能反映不同体型下真实的服装合身性。我们提出了FitVTON，一种在野外不同身体上的合身感知虚拟试穿模型。FitVTON通过结构化文本提示编码服装-身体尺寸，并从参数化服装模型的模拟试穿三元组中学习。为了改善服装轮廓的合身效果，我们引入了两个辅助头来预测服装和暴露身体的掩膜。我们进一步引入了一个纹理校正阶段，以改善模拟数据的真实外观。为了评估合身保真度，我们策划了一个真实世界数据集FittingEffect3K，并结合了基于VLM的评分协议。主观和定量实验表明，FitVTON展示了真实的合身保真度，在尺寸准确性和形状保持方面显著优于最先进的方法，同时保持了有竞争力的图像质量。项目页面：此https URL。

英文摘要

While diffusion-based virtual try-on has achieved impressive visual realism, most methods treat the task as 2D inpainting, prioritizing texture preservation over physical plausibility. Consequently, they often produce plausible-looking images that fail to reflect authentic garment fit across diverse body shapes. We present FitVTON, a Fit-aware virtual try-on model on different bodies in the wild. FitVTON encodes garment-body size through structured text prompts, and learn from simulated try-on triplets from parameterized garment model. To improve the fitting effects over garment silhouettes, we introduce two auxiliary head to predict the masks for both the garment and the exposed body. We further introduce a texture rectification stage to improve realistic appearance from simulated data. To evaluate the fitting fidelity, we curate a real-world dataset, FittingEffect3K, combining VLM-based scoring protocol. Both subjective and quantitive experiments show that FitVTON demonstrate authentic fitting fidelity, with significant sizing accuracy and shape preservation over state-of-the-art methods while maintaining competitive image quality. Project Page: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12072 2026-06-11 cs.CV 新提交

World Model Self-Distillation: Training World Models to Solve General Tasks

世界模型自蒸馏：训练世界模型以解决通用任务

Sebastian Stapf, Pablo Acuaviva Huertos, Aram Davtyan, Paolo Favaro

发表机构 * Department of Computer Science（计算机科学系）

AI总结提出结合自蒸馏与强化学习的框架，从预训练视频生成器中提取任务解决能力，无需配对任务视频，在基准测试中超越原始模型。

详情

AI中文摘要

预训练视频生成器是有前景的视觉世界模型，展现出涌现的任务解决能力；然而，它们对详细文本描述的依赖限制了其在规划和决策中的直接使用。现有方法要么将这种推理外包给语言或视觉-语言模型，要么依赖带有配对任务执行视频的监督微调，后者收集成本高且难以扩展。我们提出一个可扩展的框架，通过结合自蒸馏与强化学习来激发此类模型的任务解决能力。给定一张无标注场景图像，视觉-语言模型生成候选任务和详细的逐步解决方案。该解决方案条件化一个预训练视频扩散模型（演示者）；我们将其行为蒸馏到一个仅以图像和简短任务提示为条件的执行者中。这将执行知识从字幕引导生成转移到指令条件任务解决，无需精心策划的任务视频监督。我们进一步通过来自VLM反馈的强化学习改进执行者，利用判断采样视频是否满足任务与生成解决方案之间的不对称性。在我们提出的WorldTasks-Benchmark和DreamGen机器人基准上的实验表明，在我们基于VLM的评估协议下，执行者超越了演示者，并具有竞争力地迁移到机器人任务。

英文摘要

Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.12153 2026-06-11 cs.CV cs.GR 新提交

TopoCap: Learning Topology-Agnostic Motion Priors for Monocular Video-to-Animation

TopoCap: 学习拓扑无关的运动先验用于单目视频到动画

Cheng-Feng Pu, Jia-Peng Zhang, Meng-Hao Guo, Yan-Pei Cao, Shi-Min Hu

发表机构 * Zhili College, Tsinghua University（清华大学致理书院）； BNRist, Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系，北京国家信息科学与技术研究中心）； VAST

AI总结提出TopoCap，首个统一框架，从单目视频提取运动并重定向到任意未见骨骼拓扑的角色，无需测试时优化，通过图CVAE学习通用运动流形和条件流匹配实现。

详情

AI中文摘要

生成式3D资产的爆炸式增长创造了巨大的动画需求，然而当前的动作捕捉方法仍然脆弱，局限于特定物种的模板（例如SMPL）或需要劳动密集型的手动绑定。我们引入了TopoCap，这是第一个统一的框架，能够从单目视频中提取运动并将其重定向到具有任意、未见过的骨骼拓扑的角色，即从双足到六足和无生命物体，无需测试时优化。我们的关键洞察是，虽然骨骼结构是组合且离散的，但运动背后的物理占据了一个连续的、低维的流形。我们通过一个两阶段生成流水线实现了这一洞察。首先，我们使用图CVAE学习一个通用运动流形，该流形将异构的运动链压缩成共享的、固定长度的潜在代码。通过明确地以目标骨架的结构嵌入为条件对解码器进行条件化，我们将运动动力学与骨骼拓扑解耦。其次，我们将视频到动画视为一个条件流匹配问题，从视觉特征预测这些拓扑无关的代码。为了学习这种广义先验，我们引入了Mobjaverse，这是一个从Objaverse-XL整理的大规模数据集。它包含超过5000个独特的骨骼拓扑和200万帧，其结构多样性比现有数据集高出两个数量级。大量实验表明，\MethodMotion在人类和四足基准测试中优于专业模型，同时实现了对长尾3D生物的零样本重定向。数据集在此https URL公开。

英文摘要

The explosion of generative 3D assets has created a massive demand for animation, yet current motion capture methods remain brittle, restricted to species-specific templates (e.g., SMPL) or requiring labor-intensive manual rigging. We introduce TopoCap, the first unified framework capable of extracting motion from monocular video and retargeting it onto characters with arbitrary, unseen skeletal topologies, i.e., from bipeds to hexapods and inanimate objects, without test-time optimization. Our key insight is that while skeletal structures are combinatorial and discrete, the underlying physics of motion occupy a continuous, low-dimensional manifold. We materialize this insight via a two-stage generative pipeline. First, we learn a Universal Motion Manifold using a Graph CVAE that compresses heterogeneous kinematic chains into a shared, fixed-length latent code. By explicitly conditioning the decoder on a structural embedding of the target rig, we disentangle motion dynamics from skeletal topology. Second, we treat video-to-animation as a conditional flow matching problem, predicting these topology-agnostic codes from visual features. To learn this generalized prior, we introduce Mobjaverse, a massive-scale dataset curated from Objaverse-XL. Comprising over 5,000 unique skeletal topologies and 2 million frames, it exceeds the structural diversity of existing datasets by two orders of magnitude. Extensive experiments demonstrate that \MethodMotion outperforms specialist models on human and quadruped benchmarks while enabling zero-shot retargeting for the long tail of 3D creatures. Dataset is publicly available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12213 2026-06-11 cs.CV 新提交

SHERPA: Seam-aware Harmonized ERP Adaptation for Open-Domain 360$^\circ$ Panorama Generation

SHERPA: 面向开放域360°全景生成的无缝感知协调ERP适配

Jungwoon Kang, Jaehun Kim, Yiwon Yu, Hyungyum Jang, Sanghoon Lee, Jongyoo Kim

发表机构 * Yonsei University（延世大学）

AI总结提出SHERPA框架，通过频率选择性圆形RoPE、圆形潜编码/解码、图像侧FFN适配器和双路径训练方案，实现从平面扩散模型到360°全景的轻量级适配，支持逼真和风格化全景生成。

详情

Comments: 29 pages, 23 figures, 5 tables. Preprint version

AI中文摘要

全景图像越来越多地用于世界生成、游戏和仿真中，用户不仅需要逼真的场景，还需要风格化和非逼真的环境。大规模文本到图像扩散和流模型为此目标提供了广泛的风格和语义先验，但平面图像训练使它们与等距柱状投影（ERP）表示的360°全景的环绕拓扑和极地区域不对齐。我们提出了SHERPA，一个轻量级适配框架，结合了频率选择性圆形RoPE、圆形潜编码/解码、图像侧FFN适配器和双路径训练方案。圆形RoPE仅将接缝敏感的高频水平RoPE带替换为整数周期谐波，同时保留预训练的低频频谱。配对全景路径监督几何，而未配对风格路径使用自监督偏航一致性进行无目标风格化提示。结果，SHERPA在逼真全景域和开放域风格化提示下生成360°全景。

英文摘要

Panoramic imagery is increasingly used in world-generation, games, and simulation, where users may need not only photorealistic scenes but also stylized and non-photorealistic environments. Large-scale text-to-image diffusion and flow models provide broad style and semantic priors for this goal, but planar image training misaligns them with the wrap-around topology and polar regions of $360^\circ$ panoramas represented in equirectangular projection (ERP). We present SHERPA, a lightweight adaptation framework that combines frequency-selective Circular RoPE, Circular Latent Encoding/Decoding, image-side FFN adapters, and a Dual-Path Training Scheme. Circular RoPE replaces only the seam-sensitive high-frequency horizontal RoPE band with integer-periodic harmonics while preserving the pretrained lower-frequency spectrum. The Paired Panorama Path supervises geometry, while the Unpaired Style Path uses self-supervised yaw consistency for target-free stylized prompts. As a result, SHERPA generates $360^\circ$ panoramas across both photorealistic panorama domains and open-domain stylized prompts.

URL PDF HTML ☆

赞 0 踩 0

2510.22335 2026-06-11 cs.CV cs.AI 版本更新

Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction

超越扩散：层级到层级自回归用于fMRI到图像重建

Xu Zhang, Ruijie Quan, Wenguan Wang, Yi Yang

AI总结提出MindHier框架，通过层级fMRI编码器、层级对齐和尺度感知粗到细引导策略，实现从粗到细的fMRI到图像重建，优于扩散方法。

详情

Comments: ICLR 2026

AI中文摘要

从fMRI信号重建视觉刺激是连接机器学习和神经科学的核心挑战。最近的扩散方法通常将fMRI活动映射到单个神经嵌入，并将其作为静态指导贯穿整个生成过程。然而，这种固定指导压缩了层级神经信息，并且与图像重建的阶段依赖性需求不一致。为此，我们提出MindHier，一种基于尺度自回归建模的从粗到细的fMRI到图像重建框架。MindHier引入三个组件：层级fMRI编码器提取多级神经嵌入，层级到层级对齐方案强制与CLIP特征的逐层对应，以及尺度感知的粗到细神经引导策略将这些嵌入注入到匹配尺度的自回归中。这些设计使MindHier成为扩散方法的一种高效且认知对齐的替代方案，通过实现层级重建过程，先合成全局语义再细化局部细节，类似于人类视觉感知。在NSD数据集上的大量实验表明，MindHier在语义保真度、推理速度（4.67倍）和结果确定性方面均优于基于扩散的基线方法。

英文摘要

Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single neural embedding, using it as static guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67$\times$ faster inference, and more deterministic results than the diffusion-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2512.14096 2026-06-11 cs.CV 版本更新

RSTR: Reducing SpatioTemporal Redundancy in Diffusion Transformers

RSTR: 减少扩散Transformer中的时空冗余

Ruitong Sun, Tianze Yang, Wei Niu, Jin Sun

AI总结提出RSTR框架，通过进化搜索和自适应秩分配联合减少扩散Transformer中的时空冗余，实现50%-70%计算节省并保持或提升生成质量。

详情

Comments: International Conference on Machine Learning (ICML)

AI中文摘要

扩散Transformer（DiTs）在图像生成中取得了显著成功，但其部署受到高计算成本的阻碍。我们识别出两种冗余来源。首先，时间冗余：无分类器引导（CFG）在每个时间步应用昂贵的双重前向传播，然而引导仅在特定步骤重要，且关键步骤的可变尺度可以补偿跳过其他步骤。其次，空间冗余：在可变引导下，不同Transformer块表现出异质性敏感性，但跨所有块的统一校准浪费计算且未能满足其不同需求。我们提出RSTR，这是首个联合减少扩散Transformer中时空冗余的框架。第一阶段通过进化搜索解决时间冗余，发现具有可变尺度的稀疏引导调度。第二阶段通过自适应秩分配解决空间冗余，根据敏感性将校准能力分配给Transformer区域。在DiT-XL/2、PixArt-$\alpha$、FLUX和最先进的Qwen-Image上的实验表明，在保持或提升质量的同时实现了50%-70%的计算节省。在DiT-XL/2上，RSTR实现了57%的节省和15%的FID改进；在Qwen-Image上，实现了3.43倍加速且质量保持不变。

英文摘要

Diffusion Transformers (DiTs) have achieved remarkable success in image generation, yet their deployment is hindered by high computational costs. We identify two sources of redundancy. First, temporal redundancy: Classifier-Free Guidance (CFG) applies costly dual forward passes at every timestep, yet guidance matters only at specific steps, and variable scales at critical steps can compensate for skipping others. Second, spatial redundancy: under variable guidance, different transformer blocks exhibit heterogeneous sensitivity, yet uniform calibration across all blocks wastes computation while failing to address their varying requirements. We present RSTR, the first framework to jointly reduce spatiotemporal redundancy in diffusion transformers. Stage-1 addresses temporal redundancy through evolutionary search, discovering sparse guidance schedules with variable scales. Stage-2 addresses spatial redundancy through adaptive rank allocation, assigning calibration capacities to transformer regions based on their sensitivity. Experiments on DiT-XL/2, PixArt-$\alpha$, FLUX, and state-of-the-art Qwen-Image demonstrate 50%-70% compute savings while maintaining or improving quality. On DiT-XL/2, RSTR achieves 57% savings with 15% FID improvement; on Qwen-Image, 3.43$\times$ speedup with preserved quality.

URL PDF HTML ☆

赞 0 踩 0

2605.02849 2026-06-11 cs.CV 版本更新

Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion

通过条件控制扩散实现超低比特率视频压缩的主动采样

Amirhosein Javadi, Shirin Saeedi Bidokhti, Tara Javidi

AI总结提出ActDiff-VC框架，利用条件扩散模型和主动采样策略（自适应关键帧选择与预算感知稀疏轨迹选择），在超低比特率下实现高感知质量视频压缩。

详情

Comments: 21 pages, 11 figures, 3 tables

AI中文摘要

扩散模型为超低比特率下的感知重建提供了强大的生成先验，但有效的视频压缩需要使用高度紧凑的条件信号来控制生成过程。在这项工作中，我们提出了ActDiff-VC，一种基于扩散的超低比特率视频压缩框架。我们的方法将视频划分为可变长度的片段，仅在需要时传输关键帧，并使用一组紧凑的跟踪点轨迹总结时间动态。基于这些稀疏信号，条件扩散解码器合成剩余帧，从而在严格的码率约束下实现感知上逼真的重建。为了支持这一设计，我们引入了两种机制：内容自适应关键帧选择和预算感知稀疏轨迹选择，它们共同为生成重建提供了紧凑而有效的条件。在UVG和MCL-JCV基准上的实验表明，在匹配NIQE时，ActDiff-VC实现了高达64.6%的码率降低，在可比码率下，KID改善高达64.6%，FID改善高达37.7%，并且在超低比特率下，相对于学习和基于扩散的基线，提供了有利的感知率失真权衡。

英文摘要

Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work, we present ActDiff-VC, a diffusion-based video compression framework for the ultra-low-bitrate regime. Our method partitions videos into variable-length segments, transmits keyframes only when needed, and summarizes temporal dynamics using a compact set of tracked point trajectories. Conditioned on these sparse signals, a conditional diffusion decoder synthesizes the remaining frames, enabling perceptually realistic reconstruction under severe rate constraints. To support this design, we introduce two mechanisms: content-adaptive keyframe selection and budget-aware sparse trajectory selection, which together enable compact yet effective conditioning for generative reconstruction. Experiments on the UVG and MCL-JCV benchmarks show that ActDiff-VC achieves up to 64.6\% bitrate reduction at matched NIQE, improves KID by up to 64.6\% and FID by up to 37.7\% at comparable bitrates against strong learned codecs, and delivers favorable perceptual rate--distortion trade-offs relative to learned and diffusion-based baselines in the ultra-low-bitrate regime.

URL PDF HTML ☆

赞 0 踩 0

2605.30437 2026-06-11 cs.CV 版本更新

Mitigating Content Shift and Hallucination in GenAI Image Editing via Structural Refinement

通过结构细化减轻生成式AI图像编辑中的内容偏移和幻觉

Luxi Zhao, Michael S. Brown

AI总结提出一种后处理框架，通过建立粗空间和光度对应关系并融合输入图像与GenAI增强图像，在保留感知增强的同时抑制幻觉内容，从而解决黑盒GenAI图像编辑中的结构保持问题。

详情

AI中文摘要

生成式AI（GenAI）图像编辑器（如Nano Banana）在修图任务中产生视觉上令人满意的结果，使非专家能够仅通过文本提示编辑图像。然而，这些模型的生成性质常常引入空间错位、纹理失真和内容幻觉，这些都对需要像素级保真度的下游工作流程有害。我们为黑盒GenAI图像修图确定了一个称为“结构保持GenAI融合”的问题设置：在保持对原始输入图像的结构忠实性的同时，保留GenAI输出的感知增强。为了解决这个问题，我们提出了一种后处理框架，该框架首先建立粗空间和光度对应关系，然后执行融合阶段，将期望的增强转移同时抑制幻觉内容，从而将输入图像与其GenAI增强版本融合。在此设置中缺乏直接先前工作的情况下，我们针对来自真实感风格迁移和图像融合的代表性方法评估我们的框架。我们的实验表明，我们的方法在保持像素级结构一致性和输入分辨率的同时，更好地保留了美学质量。

英文摘要

Generative AI (GenAI) image editors, such as Nano Banana, produce visually compelling results for retouching tasks, enabling non-experts to edit images through text prompts alone. However, the generative nature of these models often introduces spatial misalignment, texture distortion, and content hallucination, all of which are detrimental to downstream workflows that require pixel-level fidelity. We identify a problem setting we call "structure-preserving GenAI fusion" for black-box GenAI image retouching: retain the perceptual enhancements of a GenAI output while enforcing structural faithfulness to the original input image. To address this problem, we propose a post-processing framework that fuses an input image with its GenAI-enhanced counterpart by first establishing coarse spatial and photometric correspondences, then performing a fusion stage that transfers desired enhancements while suppressing hallucinated content. In the absence of direct prior work in this setting, we evaluate our framework against representative methods from photorealistic style transfer and image fusion. Our experiments demonstrate that our method better preserves aesthetic quality while maintaining pixel-level structural consistency and the input resolution.

URL PDF HTML ☆

赞 0 踩 0

2606.10135 2026-06-11 cs.CV cs.AI 版本更新

BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

BiWM：利用双向自回归推进开源交互式视频世界模型

Shaohao Rui, Xiaofeng Mao, Zhanyu Zhang, Peijia Lin, Yansong Zhu, Yibo Zhang, Haibin Wan, Weijie Ma

AI总结提出BiWM框架，通过双向自回归范式将预训练视频骨干转化为交互式世界模型，仅需两阶段训练（微调+分布匹配蒸馏），支持多尺度模型和长程生成，优于现有因果流水线。

详情

Comments: After the paper was posted, we discovered that several visualization results were produced using wrong configuration settings during runtime. This error affects the reliability of the presented visual comparisons. Additionally, further optimization of the design is needed. We therefore request to withdraw this version and will submit a corrected and improved version later

AI中文摘要

将双向视频扩散模型过渡到自回归范式提高了视频世界模型的交互性，但现有的因果流水线需要多个阶段（控制微调、自回归训练、因果初始化、少步蒸馏），并且由于误差累积，质量仍落后于双向模型。最近的世界模型如Yume-1.5和Matrix-Game-3.0采用双向自回归方法，通过自我纠正误差传播获得保真度和稳定的长程展开，但开源框架（如minWM）仅支持因果模型。我们提出BiWM，这是首个在双向自回归范式下用于交互式视频世界模型的全栈框架，联合优化生成质量和推理速度。从预训练视频骨干开始，BiWM通过微调注入相机控制，然后运行几步分布匹配蒸馏（DMD）阶段，将骨干转化为动作/相机可控的世界模型：仅需两个训练阶段（而非minWM的四个），在8xH200 GPU上几百步内收敛。单一方案覆盖Wan2.1-1.3B、Wan2.2-5B、HunyuanVideo-1.5-8B和LTX-2.3-22B，并支持现有双向模型的二次微调。BiWM实现了minWM失去可控性的真实相机控制，集成了可插拔历史压缩（FramePack风格和PackForcing风格）用于长程展开，并提供可选的NVFP4 4位训练/推理流水线。为对抗DMD的模式寻求退化，我们添加了GAN和覆盖前向KL目标，以保留场景动态。我们开源BiWM，用于资源受限的研究和高保真环境模拟。

英文摘要

Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.10804 2026-06-11 cs.CV 版本更新

SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

SCAIL-2：通过端到端上下文条件统一受控角色动画

Wenhao Yan, Fengjia Guo, Zhuoyi Yang, Jie Tang

发表机构 * Z.ai ； Tsinghua University（清华大学）

AI总结提出SCAIL-2框架，通过端到端上下文条件统一受控角色动画，绕过中间表示直接利用驱动视频，并合成MotionPair-60K数据集，采用上下文掩码和模式RoPE实现统一，结合Bias-Aware DPO减少误差，显著优于现有方法。

详情

AI中文摘要

受控角色动画需要将运动从驱动序列转移到参考角色。先前的工作严重依赖中间表示，包括用于表示运动的姿态骨架或用于表示环境的掩码背景，这不可避免地导致信息损失。为了解决这个问题，我们提出了SCAIL-2，一个绕过这些中间表示并实现\textbf{端到端}角色动画的框架。通过将驱动视频直接连接到序列，模型可以从输入视频中获得所有所需的视觉信息。为了解决缺乏端到端数据的问题，我们通过解耦条件统一角色动画的子任务，然后策划一个流程来合成MotionPair-60K，一个包含角色动画异构任务的端到端运动转移数据集。为了实现统一，我们利用上下文掩码条件和模式特定的RoPE作为文本指令和原始视觉信息之外的软引导。为了解决详细区域的合成差异，我们提出了Bias-Aware DPO来构建偏好项目以减轻误差。大量实验表明，我们的方法在各种角色动画任务中显著优于现有的最先进方法。合成数据的一个大子集以及模型权重将在我们的项目页面发布：this https URL。

英文摘要

Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, a framework that bypasses those intermediates and achieves \textbf{end-to-end} character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address the lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To achieve the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2603.12261 2026-06-11 cs.LG cs.AI cs.CV 版本更新

The Latent Color Subspace: Emergent Order in High-Dimensional Chaos

潜在颜色子空间：高维混沌中的涌现秩序

Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata

AI总结本文揭示了FLUX.1变分自编码器潜在空间中颜色表示的HSL结构，并提出一种无需训练的闭式潜在空间操作方法，实现对生成图像颜色的预测与显式控制。

2606.11314 2026-06-11 cs.CV cs.GR 新提交

TRON: Tracing Rays to Orchestrate a Neural Renderer for 3D Gaussian Reconstructions

TRON：追踪光线以编排用于3D高斯重建的神经渲染器

Or Perel, Hassan Abu Alhaija, Zian Wang, Jacob Munkberg, Matan Atzmon, Sanja Fidler, Masha Shugrina

发表机构 * NVIDIA（英伟达）； University of Toronto（多伦多大学）； Vector Institute（向量研究所）

AI总结提出TRON框架，结合3D高斯光线追踪与神经渲染，实现真实世界3D场景在新光照、动态物体运动、物体插入和材质编辑下的逼真可控渲染，通过内在分解先验和光线追踪辐射引导，弥合物理渲染与神经渲染的差距。

详情

Comments: Project page: this https URL

AI中文摘要

我们介绍了TRON，一种渲染框架，它将3D高斯光线追踪与神经渲染相结合，使得在新型光照、动态物体运动、物体插入和材质编辑下，对真实世界3D场景进行逼真且可控的渲染成为可能。先前仅依赖高斯表示的物理渲染（PBR）的方法，由于重建几何、材质估计和光传输估计的不完善，难以实现逼真的重光照。同时，神经渲染方法通常缺乏显式场景表示，限制了它们支持细粒度交互编辑的能力。TRON桥接了这两种范式。我们使用来自学习逆渲染模型的内在分解先验来正则化高斯场的材质属性，并重新利用光线追踪器提供辐射度量指导而非最终像素。通过将此输出视为结构化的3D支架，我们赋予轻量级神经渲染器能力，以弥合着色模型约束估计与逼真输出之间的领域差距。我们的关键见解是，显式3D知识与稳健材质先验的结合提供了速度和可控性，而神经渲染则实现了逼真图像的合成。为了支持真实世界场景，我们采用多阶段策略训练神经渲染器，包括大规模预训练和在从3D重建中构建的210万渲染合成及真实世界帧的新数据集上进行针对性微调。TRON在逼真度上优于基于高斯的重光照方法，在可编辑性和速度上优于先前的神经渲染器。据我们所知，TRON是首个能够在捕获的3D环境中实现实用交互式应用的方法，在动态几何、光照和材质条件下提供逼真的外观。

英文摘要

We introduce TRON, a rendering framework that combines 3D Gaussian ray tracing with neural rendering to enable realistic and controllable rendering of real-world 3D scenes under novel lighting, dynamic object motion, object insertion, and material editing. Prior approaches that rely solely on physically based rendering (PBR) of Gaussian representations struggle to achieve realistic relighting due to imperfections in reconstructed geometry, material estimates, and light transport estimation. At the same time, neural rendering methods often lack an explicit scene representation, limiting their ability to support interactive editing with fine-grained manipulation. TRON bridges these two paradigms. We use intrinsic decomposition priors from a learned inverse rendering model to regularize the material properties of a Gaussian field, and repurpose a ray tracer to provide radiometric guidance rather than final pixels. By treating this output as a structured 3D scaffold, we empower a lightweight neural renderer to bridge the domain gap between shading-model constrained estimates and photorealistic output. Our key insight is that the combination of explicit 3D knowledge with robust material priors provides speed and controllability, while neural rendering enables the synthesis of photorealistic images. To support real-world scenarios, we train our neural renderer with a multi-stage strategy consisting of large-scale pretraining and targeted fine-tuning on a newly constructed dataset of 2.1M rendered synthetic and real-world frames from 3D reconstructions. TRON outperforms Gaussian-based relighting methods in realism, and prior neural renderers in editability and speed. To the best of our knowledge, TRON is the first method to enable practical interactive applications in captured 3D environments, offering realistic appearance under dynamic geometric, lighting and material conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.11326 2026-06-11 cs.CV 新提交

DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax

DarkVGGT: 利用热几何在黑暗中透视，无需日光代价

Minseong Kweon, Wenyuan Zhao, Nuo Chen, Lulin Liu, Huiwen Han, Zihao Zhu, Srinivas Shakkottai, Chao Tian, Zhiwen Fan

发表机构 * University of Minnesota（明尼苏达大学）； Texas A&M University（德克萨斯农工大学）； Stanford University（斯坦福大学）

AI总结提出DarkVGGT，一种RGB-T前馈几何框架，通过物理感知热建模实现低光照场景下的鲁棒3D估计，引入热分解和几何共享路由模块，在退化RGB条件下保持精度。

详情

Comments: Project Page: this https URL

AI中文摘要

最近的前馈3D重建方法在从图像流高效端到端场景几何估计中展现出强大性能和灵活性。然而，它们对可见光外观的依赖使其在黑暗和低可见度环境中脆弱，此时RGB线索严重退化，几何证据变得模糊。为应对这一挑战，我们提出DarkVGGT，一种RGB-T前馈几何框架，使用物理感知热建模实现低光照场景下的鲁棒3D估计。DarkVGGT引入两个互补模块。首先，物理启发的热分解提取发射主导、几何一致的热线索，同时隔离可能引入几何模糊的稀疏反射残差。其次，几何共享热路由从热特定模式中分离模态不变的几何结构，选择性地将可靠性感知的结构引导注入RGB流。这些组件共同使得在退化RGB条件下实现准确的热信息几何估计，同时在光照良好环境中基本保持性能。在低可见度RGB-T基准上的实验表明，与现有前馈几何基线相比，在深度和相机姿态估计上均有一致改进。

英文摘要

Recent feed-forward 3D reconstruction methods have demonstrated strong performance and flexibility in efficient end-to-end scene geometry estimation from image streams. However, their reliance on visible-light appearance makes them vulnerable in dark and low-visibility environments, where RGB cues are severely degraded and geometric evidence becomes ambiguous. To address this challenge, we propose DarkVGGT, an RGB-T feed-forward geometry framework that uses physics-aware thermal modeling for robust 3D estimation in low-light scenes. DarkVGGT introduces two complementary modules. First, physics-inspired thermal factorization extracts emissive-dominant, geometry-consistent thermal cues while isolating sparse reflective residuals that may introduce geometric ambiguity. Second, geometry-shared thermal routing isolates modality-invariant geometric structures from thermal-specific patterns, selectively injecting reliability-aware structural guidance into the RGB stream. Together, these components enable accurate thermal-informed geometry estimation under degraded RGB conditions while largely preserving performance in well-lit environments. Experiments on low-visibility RGB-T benchmarks demonstrate consistent improvements in both depth and camera pose estimation over existing feed-forward geometry baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.11390 2026-06-11 cs.CV cs.DC cs.GR cs.LG 新提交

A Scalable PyTorch Abstraction for Multi-GPU Gaussian Splatting

一种可扩展的多GPU高斯泼溅PyTorch抽象

Matthew Cong, Francis Williams, Jonathan Swartz, Mark Harris, Sanja Fidler, Ken Museth

发表机构 * NVIDIA（英伟达）； University of Toronto（多伦多大学）； Vector Institute（向量研究所）

AI总结提出一种多GPU高斯泼溅方法，通过CUDA统一内存和NVLink在算子级别分布参数，实现大规模场景重建，支持超过10亿高斯泼溅。

详情

Comments: 14 pages, 6 tables, 2 figures, and 1 listing. Includes supplementary material

AI中文摘要

高斯泼溅方法在真实世界的神经重建中越来越受欢迎。然而，由于计算和内存限制，它们在规模和分辨率上常常受限。我们提出了一种多GPU高斯泼溅方法，将重建扩展到更高的分辨率和更大的场景，同时抽象掉了通常与模型分布相关的代码复杂性。为实现这一目标，我们提出一个PyTorch后端，通过CUDA统一内存和NVLink在GPU之间分布高斯参数和泼溅算子。由于分布发生在算子级别，模型代码不需要显式的跨设备通信。更广泛地说，该后端将多个GPU暴露为一个聚合的PyTorch设备，并支持其他PyTorch算子。我们展示了包含超过10亿个高斯泼溅的城市规模重建，具有街道级细节，数量是当前最先进方法的25倍以上。

英文摘要

Gaussian splatting methods have become increasingly popular for neural reconstruction of the real world. However, they are often limited in scale and resolution due to compute and memory constraints. We present a multi-GPU Gaussian splatting approach that scales reconstruction to higher resolutions and larger scenes while abstracting away the code complexity typically associated with distributing a model. To accomplish this, we propose a PyTorch backend that distributes the Gaussian parameters and splatting operators across GPUs via CUDA unified memory and NVLink. Because distribution occurs at the operator level, the model code requires no explicit cross-device communication. More broadly, the backend exposes multiple GPUs as an aggregate PyTorch device and supports other PyTorch operators. We demonstrate city-scale reconstructions with street-level detail consisting of over 1 billion Gaussian splats, more than 25 times as many as the current state of the art.

URL PDF HTML ☆

赞 0 踩 0

2606.11446 2026-06-11 cs.CV cs.GR 新提交

3D-CBM: A Framework for Concept-Based Interpretability in Generative 3D Modeling

3D-CBM：生成式3D建模中基于概念可解释性的框架

Ahmad Al-Kabbany

发表机构 * Yubree Labs ； Multimedia Interaction and Communication Lab, Arab Academy for Science and Technology（阿拉伯科学技术学院多媒体交互与通信实验室）

AI总结提出将概念瓶颈模型（CBM）融入3D生成架构，通过多层级可解释原语和功能属性映射，实现语义可操控的3D生成，实验验证了高概念预测精度和交互式纠错能力。

详情

AI中文摘要

本研究引入了一个将概念瓶颈模型（CBM）融入3D生成架构的框架，以解决深度几何学习中固有的“语义鸿沟”。随着深度模型成为3D内容创建的核心，可解释性从边缘特性转变为医疗和制造等安全关键领域中信任和问责的基本要求。CBM通过约束潜在表示与人类定义的概念对齐，提供了一种内在的可解释性解决方案，但其在非结构化3D数据上的应用仍 largely unexplored。我们设计、实现并验证了一个正式的3D-CBM架构，将原始几何输入（包括点云和网格）映射到可解释基元和功能属性的多层级分类中。该框架进一步确定了专门用于基于概念监督的战略性数据集，如PartNet和ShapeNet。来自3D部件操作概念验证实验的结果证明了该框架的有效性，实现了88.8%的概念预测准确率和0.0115的Chamfer距离。关键的是，该模型支持精确的测试时干预，允许交互式纠正结构错误。这项工作为语义可操控的3D生成奠定了基础，并邀请进一步探索协作式人在回路设计系统。

英文摘要

This research introduces a framework for incorporating Concept Bottleneck Models (CBMs) into 3D generative architectures to address the inherent 'semantic gap' in deep geometric learning. As deep models become central to 3D content creation, explainability shifts from a peripheral feature to a fundamental requirement for trust and accountability in safety-critical domains such as healthcare and manufacturing. CBMs provide an intrinsic interpretability solution by constraining latent representations to align with human-defined concepts, yet their application to unstructured 3D data remains largely unexplored. We design, implement, and validate a formal 3D-CBM architecture that maps raw geometric inputs, including point clouds and meshes, into a multi-tiered taxonomy of interpretable primitives and functional attributes. The framework further identifies strategic datasets, such as PartNet and ShapeNet, specialized for concept-based supervision. Experimental results from a 3D part-manipulation proof-of-concept experiment demonstrate the framework's efficacy, achieving a concept prediction accuracy of 88.8\% and a Chamfer Distance of 0.0115. Critically, the model enables precise test-time intervention, allowing for the interactive correction of structural errors. This work establishes a foundation for semantically-steerable 3D generation and invites further exploration into collaborative human-in-the-loop design systems.

URL PDF HTML ☆

赞 0 踩 0

2606.11466 2026-06-11 cs.CV 新提交

PT-WNO: Point Transformer with Wavelet Neural Operator for 3D Point Cloud Semantic Segmentation

PT-WNO: 结合小波神经算子的点Transformer用于3D点云语义分割

Nhut Le, Maryam Rahnemoonfar

发表机构 * Lehigh University（里海大学）

AI总结针对点云语义分割中全局上下文不足的问题，提出PT-WNO，通过在跳跃连接旁集成可学习的小波神经算子分支捕获多尺度全局频谱上下文，在四个基准上提升性能。

详情

AI中文摘要

点云语义分割需要同时捕捉细粒度局部几何和广阔全局场景结构的架构。基于Transformer的网络通过聚焦于详细的局部特征聚合表现出强大性能；然而，全局上下文主要通过编码器-解码器阶段之间的跳跃连接传递，我们认为这对于完整的场景理解是不够的。我们假设，用可学习的全局特征提取模块增强跳跃连接，使网络在深入局部细节之前获取场景级知识，从而产生更丰富且更具上下文基础的表示。为此，我们提出了点Transformer与小波神经算子（PT-WNO），它在点云Transformer骨干的跳跃连接旁集成了一个共享的小波神经算子（WNO）分支。在每个编码器-解码器过渡处，点特征被投影到密集的3D体素网格上，WNO通过可学习的小波分解和重建捕获多尺度全局频谱上下文。这些全局特征通过轻量级适配器融合回网络，补充而非替代现有的跳跃连接。在四个大规模3D点云基准上的实验证明了PT-WNO的有效性。在S3DIS（Area 5）上，PT-WNO达到71.59% mIoU，比Point Transformer v3（PTv3）基线高出+1.03个百分点。在DALES上达到81.05% mIoU（比基线高+1.47）。在ScanNet v2上，PT-WNO获得76.19% mIoU，与基线（76.36%）保持竞争力。

英文摘要

Point cloud semantic segmentation requires architectures that capture both fine-grained local geometry and broad global scene structure. Transformer-based networks have demonstrated strong performance by focusing on detailed local feature aggregation; however, global context is conveyed primarily through skip connections across encoder-decoder stages, which we argue is insufficient for full scene understanding. We hypothesize that augmenting skip connections with a learnable global feature extraction module allows the network to acquire scene-level knowledge before descending into local detail, leading to richer and more contextually grounded representations. To this end, we propose Point Transformer with Wavelet Neural Operato (PT-WNO), which integrates a shared Wavelet Neural Operator (WNO) branch alongside the skip connections of a point cloud transformer backbone. At each encoder-decoder transition, point features are projected onto a dense 3D volumetric grid where the WNO captures multi-scale global spectral context through learnable wavelet decomposition and reconstruction. These global features are fused back into the network via lightweight adapters, complementing rather than replacing the existing skip connections. Experiments on four large-scale 3D point cloud benchmarks demonstrate the effectiveness of PT-WNO. On S3DIS (Area 5), PT-WNO achieves 71.59% mIoU, outperforming the Point Transformer v3 (PTv3) baseline by +1.03 points. On DALES it achieves 81.05% mIoU (+1.47 over the baseline). On ScanNet~v2, PT-WNO obtains 76.19% mIoU, remaining competitive with the baseline (76.36%).

URL PDF HTML ☆

赞 0 踩 0

2606.11578 2026-06-11 cs.CV 新提交

Contactless 3D Human Body Measurement Using Depth Cameras for Smart Health Monitoring

基于深度相机的非接触式3D人体测量用于智能健康监测

Martha Asare, Xuan Wang, Juan Lopez Alvarenga, Lois Akosua Serwaa, Jinghao Yang

AI总结提出一种基于深度相机和3D点云的非接触式人体测量框架，通过空间滤波、地标选择及体素/网格分析实现身高、臂展、体积和表面积等关键指标的准确估计。

详情

Comments: 6 pages, 4 figures. Depth camera-based framework for contactless anthropometric measurement and geometric analysis using 3D point clouds

AI中文摘要

非接触式人体测量技术对于智能健康监测、数字健康应用和远程患者评估日益重要。传统的人体测量通常需要物理接触和训练有素的人员，这可能限制其在远程医疗环境中的可扩展性。在本研究中，我们介绍了一种基于深度相机的框架，利用3D点云数据估计人体测量值。使用Orbbec Astra 2深度相机捕获参与者的RGB图像、深度图和3D点云。利用基于Python的工具（包括Open3D、NumPy和OpenCV）处理捕获的点云，将人体从背景中分割出来。计算关键的人体测量值，如身高和臂展。通过3D点云上的空间滤波和地标选择组合获得测量值，然后利用相机内参将计算出的测量值投影到对应的RGB图像上。除了线性测量外，还使用基于体素的占用分析和基于网格的表面重建方法估计了近似身体体积和可见表面积。单次深度捕获的实验结果表明，无需物理接触即可从深度相机数据中获得准确的人体测量值和几何估计。本研究为未来将深度感知与智能健康监测和生成式AI模型相结合的实时系统奠定了基础，用于智能医疗应用。

英文摘要

Contactless body measurement technologies are becoming increasingly significant for smart health monitoring, digital health applications, and remote patient assessment. Traditional anthropometric measurements typically necessitate physical contact and trained personnel, which may constrain scalability in remote healthcare settings. In this study, we introduce a depth camera-based framework for estimating human body measurements utilizing 3D point cloud data. An Orbbec Astra 2 depth camera was employed to capture RGB images, depth maps, and 3D point clouds of participants. The captured point cloud was processed using Python-based tools, including Open3D, NumPy, and OpenCV, to segment the human body from the background. Key anthropometric measurements, such as height and arm span, were computed. The measurements were obtained through a combination of spatial filtering and landmark selection on the 3D point cloud, followed by the projection of the computed measurements onto the corresponding RGB image using camera intrinsic parameters. In addition to linear measurements, the approximate body volume and visible surface area were estimated using voxel-based occupancy analysis and mesh-based surface reconstruction methods. The experimental results from a single depth capture demonstrated that accurate body measurements and geometric estimates could be obtained from depth camera data without physical contact. This study provides a foundation for future real-time systems that integrate depth sensing with intelligent health monitoring and generative AI models for smart healthcare applications.

URL PDF HTML ☆

赞 0 踩 0

2606.11619 2026-06-11 cs.CV 新提交

Precision-Aware Illumination-Disentangled Vision Transformer for Spacecraft 6D Pose Estimation

精度感知光照解耦视觉Transformer用于航天器6D姿态估计

Zongwu Xie, Yifan Yang, Yonglong Zhang, Guanghu Xie, Yang Liu, Shuo Zhang

发表机构 * School of Mechatronics Engineering, Harbin Institute of Technology（哈尔滨工业大学机电工程学院）

AI总结提出PAID-ViT，通过光照解耦、可靠性感知令牌聚合和掩码监督，在光照变化和反射干扰下实现鲁棒的航天器6D姿态估计。

详情

Comments: 11 pages, 7 figures

AI中文摘要

视觉传感器为航天器近距离操作提供了轻量级解决方案，但在光照变化、镜面反射、阴影、弱纹理和背景干扰下，单目航天器6D姿态估计仍然困难。这些因素使局部视觉证据在空间上不可靠，并可能破坏姿态回归的稳定性。本文提出了一种精度感知光照解耦视觉Transformer（PAID-ViT），用于鲁棒的航天器姿态估计。该模型将姿态相关的结构令牌与光照敏感的外观令牌分离，在姿态聚合前估计补丁可靠性，并使用前景掩码监督以保留轮廓线索。一个无参数的几何恢复模块将归一化裁剪坐标、对数深度和连续6D旋转表示转换为相机坐标系下的旋转和平移。在SPEED+ V2（本研究使用的SPEED+验证/光箱/太阳灯评估配置）上的实验表明，PAID-ViT减少了平移误差，并在具有挑战性的太阳灯域中提高了鲁棒性，而消融研究支持了光照解耦、可靠性感知令牌聚合、掩码监督和训练侧正则化的互补作用。

英文摘要

Vision sensors provide a lightweight solution for spacecraft proximity operations, but monocular spacecraft 6D pose estimation remains difficult under illumination variation, specular reflection, shadowing, weak texture, and background interference. These factors make local visual evidence spatially unreliable and can destabilize pose regression. This article proposes a Precision-Aware Illumination-Disentangled Vision Transformer (PAID-ViT) for robust spacecraft pose this http URL proposed model separates pose-relevant structure tokens from illumination-sensitive appearance tokens, estimates patch reliability before pose aggregation, and uses foreground mask supervision to preserve silhouette cues. A parameter-free geometric recovery module converts normalized crop coordinates, log-depth, and a continuous 6D rotation representation into camera-frame rotation and translation. Experiments on SPEED+ V2, the SPEED+ validation/lightbox/sunlamp evaluation configuration used in this study, suggest that PAID-ViT reduces translation error and improves robustness in the challenging sunlamp domain, while ablation studies support the complementary roles of illumination disentanglement, reliability-aware token aggregation, mask supervision, and training-side regularization.

URL PDF HTML ☆

赞 0 踩 0

2606.11782 2026-06-11 cs.CV 新提交

Seeing What Matters: Perceptual Wrapper with Common Randomness for 3D Gaussian Splatting

看见重要之处：基于公共随机性的感知包装器用于3D高斯泼溅

He-Bi Yang, Jing-Zhong Chen, Yen-Kuan Ho, Sang NguyenQuang, Fan-Yi Hsu, Yun-Yu Lee, Jui-Chiu Chiang, Wen-Hsiao Peng

发表机构 * National Yang Ming Chiao Tung University（国立阳明交通大学）； National Chung Cheng University（国立中正大学）

AI总结针对3D高斯泼溅在内存受限和率失真优化管道中高频纹理合成困难的问题，提出一种2D感知包装器，利用伪随机高斯噪声和Wasserstein失真监督，以内容与视角相关的方式增强渲染输出，显著提升感知质量并压缩模型大小。

详情

Comments: 18 pages, 9 figures

AI中文摘要

虽然3D高斯泼溅（3DGS）实现了令人印象深刻的实时渲染，但它经常难以合成高频纹理，这一限制在内存受限和率失真优化（RDO）管道中尤为严重。为了解决这个问题，我们提出了一种通用的2D感知包装器，它以内容和视角相关的方式增强现有3DGS表示的渲染输出。我们的方法利用一个以伪随机高斯噪声为条件的轻量级合成网络来合成感知上合理的纹理。在Wasserstein失真的监督下，该网络学习匹配局部特征统计，而不是严格强制逐像素重建保真度，从而有效缓解标准框架中固有的模糊性。我们展示了我们的即插即用方法在普通、内存受限和RDO 3DGS方法中的广泛适用性。全面的主观和客观实验证实，我们的方法显著优于现有基线，在急剧减小文件或模型尺寸的同时，实现了卓越的感知质量。

英文摘要

While 3D Gaussian Splatting (3DGS) achieves impressive real-time rendering, it frequently struggles to synthesize high-frequency textures, a limitation heavily exacerbated in memory-constrained and rate-distortion-optimized (RDO) pipelines. To address this, we propose a versatile 2D perceptual wrapper that enhances the rendered outputs of existing 3DGS representations in a content- and view-dependent manner. Our method leverages a lightweight synthesis network conditioned on pseudo-random Gaussian noise to synthesize perceptually plausible textures. Supervised by Wasserstein Distortion, the network learns to match local feature statistics rather than strictly enforcing pixel-wise reconstruction fidelity, effectively mitigating the blurriness inherent in standard frameworks. We demonstrate the broad applicability of our plug-and-play approach across vanilla, memory-constrained, and RDO 3DGS methods. Comprehensive subjective and objective experiments confirm that our method significantly improves over existing baselines, yielding superior perceptual quality at sharply reduced file or model sizes.

URL PDF HTML ☆

赞 0 踩 0

2606.11805 2026-06-11 cs.CV cs.AI 新提交

TextHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization

TextHOI-3D: 基于离散多视图生成与联合网格优化的文本到三维手物交互

Zixiong Hao, Zhencun Jiang

发表机构 * Technical University of Munich（慕尼黑工业大学）； Tongji University（同济大学）； Shanghai Research Institute for Intelligent Autonomous Systems（上海自主智能无人系统科学中心）

AI总结提出TextHOI-3D框架，通过多视图离散表示连接文本生成与几何恢复，实现文本驱动的三维手物网格生成，显著降低物体倒角距离和穿透体积。

详情

Comments: 11 pages, 8 figures, 3 tables

AI中文摘要

文本条件的三维生成在图像和孤立物体方面进展迅速，但生成手物网格仍然具有挑战性：输出必须保持语言语义、跨视图一致性、物体几何、关节手部形状以及物理上合理的接触。我们提出TextHOI-3D，一个分阶段框架，使用生成的多视图观测作为文本条件视觉生成与几何感知手物恢复之间的显式接口。TextHOI-3D为固定相机的手物观测学习紧凑的VQ令牌空间，通过CLIP条件的视觉自回归模型从文本预测多视图视觉令牌，并通过先验初始化、多视图联合优化和抗穿透细化恢复统一的手物网格。该设计将语义生成与几何恢复分离，同时通过离散多视图表示保持两个阶段的连接。在HO3D衍生评估中，与单视图对应相比，多视图设置将物体倒角距离从17.26毫米降低到4.92毫米，穿透体积从5.3721立方厘米降低到0.2193立方厘米，同时改善了手部误差和表面F分数。这些结果支持多视图视觉令牌作为文本驱动三维手物网格创建的有效中间表示。

英文摘要

Text-conditioned 3D generation has progressed rapidly for images and isolated objects, but producing a hand-object mesh remains challenging: the output must preserve language semantics, cross-view consistency, object geometry, articulated hand shape, and physically plausible contact. We present TextHOI-3D, a staged framework that uses generated multi-view observations as an explicit interface between text-conditioned visual generation and geometry-aware hand-object recovery. TextHOI-3D learns a compact VQ token space for fixed-camera hand-object observations, predicts multi-view visual tokens from text with a CLIP-conditioned visual autoregressive model, and recovers a unified hand-object mesh through prior initialization, multi-view joint optimization, and anti-penetration refinement. The design separates semantic generation from geometric recovery while keeping both stages connected by a discrete multi-view representation. On HO3D-derived evaluations, the multi-view setting reduces object CD from 17.26 mm to 4.92 mm and penetration volume from 5.3721 cm^3 to 0.2193 cm^3 compared with a single-view counterpart, while improving hand errors and surface F-scores. These results support multi-view visual tokens as an effective intermediate representation for text-driven 3D hand-object mesh creation.

URL PDF HTML ☆

赞 0 踩 0

2606.11880 2026-06-11 cs.CV 新提交

SG2Loc: Sequential Visual Localization on 3D Scene Graphs

SG2Loc: 基于3D场景图的顺序视觉定位

Nicole Damblon, Olga Vysotska, Federico Tombari, Marc Pollefeys, Daniel Barath

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Google（谷歌）； TU Munich（慕尼黑工业大学）； Microsoft（微软）

AI总结提出一种轻量级顺序视觉定位方法，利用紧凑的3D场景图表示环境，通过粒子滤波和语义匹配实现高效定位，显著降低存储需求。

详情

Comments: The code will be available at this https URL

AI中文摘要

复杂室内环境中的视觉定位仍然是机器人和AR应用的关键挑战。顺序定位，即随时间细化位姿估计，对自主智能体至关重要。然而，传统方法通常需要存储大量图像数据库或点云，导致显著开销。本文提出一种新颖的轻量级顺序视觉定位方法，使用3D场景图。我们的方法用紧凑的场景图表示环境，其中节点表示对象（带有粗略网格），边编码空间关系。在定位阶段，对于每张图像，我们提取逐块语义特征，预测对象身份。定位在粒子滤波框架内进行。每个粒子代表一个相机位姿，将场景图中的粗略对象网格投影到图像中，根据可见性为块分配对象身份。输入图像中逐块特征与场景图对象特征的相似度决定粒子的权重。后续图像顺序融合，细化位姿估计。通过利用紧凑的场景图和高效的语义匹配，我们的方法在保持真实世界数据集性能的同时显著减少存储。代码将在该网址提供。

英文摘要

Visual localization in complex indoor environments remains a critical challenge for robotics and AR applications. Sequential localization, where pose estimates are refined over time, is important for autonomous agents. However, traditional methods often require storing extensive image databases or point clouds, leading to significant overhead. This paper introduces a novel, lightweight approach to sequential visual localization using 3D scene graphs. Our method represents the environment with a compact scene graph, where nodes represent objects (with coarse meshes) and edges encode spatial relationships. For each image in the localization phase, we extract per-patch semantic features, predicting object identities. Localization is performed within a particle filter framework. Each particle, representing a camera pose, projects the coarse object meshes from the scene graph into the image, assigning object identities to patches based on visibility. The similarity of the per-patch features, in the input image, and object features from the scene graph determines the weight of a particle. Subsequent images are incorporated sequentially, refining the pose estimate. By leveraging a compact scene graph and efficient semantic matching, our method significantly reduces storage while maintaining performance on real-world datasets. The code will be available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11894 2026-06-11 cs.CV 新提交

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Wild3R: 从无约束稀疏照片集合进行前馈式3D高斯泼溅

Yuto Furutani, Takashi Otonari, Kaede Shiohara, Toshihiko Yamasaki

发表机构 * The University of Tokyo（东京大学）

AI总结提出Wild3R，一种针对无约束稀疏照片集合的前馈式3D高斯泼溅方法，通过引入包含多样光照和瞬态物体的WildCity数据集，学习跨视角外观一致性并移除瞬态内容，性能优于现有前馈方法，与基于逐场景优化的方法相当。

详情

AI中文摘要

前馈式3D高斯泼溅（3DGS）消除了传统3DGS所需的耗时逐场景优化。然而，现有的前馈方法难以处理包含多样光照条件和瞬态物体的真实世界照片集合。在本文中，我们提出了Wild3R，一种针对无约束稀疏照片集合的前馈方法。主要瓶颈在于缺乏提供多视角、多种光照和瞬态变化的训练数据，而这些是学习鲁棒场景表示所必需的。为解决这一问题，我们引入了WildCity数据集，该数据集包含200个场景、170种光照条件和瞬态物体，总计337,500张图像。通过利用该数据集，我们的模型在参考视图条件下学习跨视角的外观一致性，同时移除瞬态内容。大量实验表明，我们的方法优于现有的前馈方法，并取得了与先前基于逐场景优化的方法相竞争的结果。

英文摘要

Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

URL PDF HTML ☆

赞 0 踩 0

2606.12099 2026-06-11 cs.CV 新提交

ISAP-3D: Identity-Slot Aligned Part-Aware 3D Generation

ISAP-3D: 身份槽对齐的部件感知3D生成

Junlin Hao, Haoshuai Fu, Xibin Song, Wei Li, Ruigang Yang, Xinggong Zhang, Jinchuan Zhang

发表机构 * Peking University（北京大学）； Tencent（腾讯）； Huawei（华为）； University of Science and Technology of China（中国科学技术大学）

AI总结针对部件感知3D生成中因身份-布局纠缠导致的结构歧义问题，提出身份槽对齐框架ISAP-3D，通过语义身份令牌锚定每个部件并进行一对一布局预测，实现稳定可控的部件级3D生成。

详情

AI中文摘要

部件感知3D生成旨在合成具有语义意义组件的结构化对象，但由于身份-布局纠缠，常常遭受结构歧义。现有方法要么隐式推断部件身份和空间布局，导致不稳定的部件分配（例如槽交换或部件合并），要么依赖在实践中难以获得的强布局条件。我们将这种歧义归因于身份槽置换自由度：没有显式的身份槽对齐，训练期间语义部件和生成槽之间的对应关系不可识别，允许多个槽分配适应相同的监督，导致不一致的分解。基于这一见解，我们认为稳定的部件感知生成需要身份对齐的一对一槽建模。因此，我们提出了一个身份槽对齐框架ISAP-3D，该框架用语义身份令牌锚定每个部件，执行身份条件的一对一布局预测，随后进行布局条件的几何合成。结构化的局部-全局条件在语义、空间和几何阶段保持身份对齐。我们还构建了一个具有统一语义协议的部件级数据集，以实现可学习且一致的身份槽对齐。大量实验表明，与最先进的部件感知生成基线相比，我们的方法在结构稳定性、可控性和鲁棒性方面有所改进。

英文摘要

Part-aware 3D generation aims to synthesize structured objects with semantically meaningful components, yet often suffers from structural ambiguity due to identity-layout entanglement. Existing methods either infer part identity and spatial layout implicitly, which can lead to unstable part allocation (e.g., slot swapping or part merging), or rely on strong layout conditions that are difficult to obtain in practice. We attribute this ambiguity to identity-slot permutation freedom: without explicit identity-slot alignment, the correspondence between semantic parts and generation slots is not identifiable during training, allowing multiple slot assignments to fit the same supervision and leading to inconsistent decomposition. Based on this insight, we argue that stable part-aware generation requires identity-aligned one-to-one slot modelling. We therefore propose an identity-slot aligned framework, ISAP-3D, which anchors each part with semantic identity tokens and performs identity-conditioned one-to-one layout prediction, followed by layout-conditioned geometry synthesis. Structured local-global conditioning maintains identity alignment across semantic, spatial, and geometric stages. We also construct a part-level dataset with a unified semantic protocol to enable learnable and consistent identity-slot alignment. Extensive experiments demonstrate improved structural stability, controllability, and robustness over state-of-the-art part-aware generation baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.12189 2026-06-11 cs.CV 新提交

DynaTok: Token-Based 4D Reconstruction from Partial Point Clouds

DynaTok: 基于Token的部分点云4D重建

Weirong Chen, Keisuke Tateno, Hidenobu Matsuki, Michael Niemeyer, Daniel Cremers, Federico Tombari

发表机构 * Technical University of Munich（慕尼黑工业大学）； Google（谷歌）； Imperial College London（伦敦帝国理工学院）； University of Bonn（波恩大学）

AI总结提出DynaTok框架，通过Transformer时空编码器和流匹配解码器，从部分点云序列中无对应地重建完整且时间一致的4D点云，无需图像。

详情

Comments: ICML 2026. Project page: this https URL

AI中文摘要

我们解决从部分点云序列的4D重建问题，其中深度传感器观测不完整、无序且缺乏显式时间对应。这种仅几何的设置由于缺失观测和模糊动态而具有挑战性。尽管最近的进展主要依赖于基于图像的方法，现有的基于点的方法通常关注单个物体、假设相对完整的输入或需要显式对应。为了解决这些限制，我们提出了DynaTok，一个基于点的框架，用于从部分点云序列中无对应地进行4D重建，无需图像。DynaTok将帧编码为紧凑的潜在token，通过基于Transformer的时空编码器随时间聚合不完整的观测，并通过统一模型中的残差token解耦几何和运动。然后，一个流匹配解码器以潜在token为条件，重建完整且时间一致的4D点云序列。在物体和场景级基准上的实验表明，从部分点云观测中重建质量和时间一致性得到了改善。项目页面：此https URL。

英文摘要

We address 4D reconstruction from partial point cloud sequences, where depth-sensor observations are incomplete, unordered, and lack explicit temporal correspondences. This geometry-only setting is challenging due to missing observations and ambiguous dynamics. While recent progress has largely relied on image-based methods, existing point-based approaches typically focus on single objects, assume relatively complete inputs, or require explicit correspondences. To address these limitations, we propose DynaTok, a point-based framework for correspondence-free 4D reconstruction from partial point cloud sequences without images. DynaTok encodes frames into compact latent tokens, aggregates incomplete observations over time with a Transformer-based spatiotemporal encoder, and decouples geometry and motion through residual tokens in a unified model. A flow-matching decoder then reconstructs complete, temporally consistent 4D point-cloud sequences conditioned on the latent tokens. Experiments on object- and scene-level benchmarks demonstrate improved reconstruction quality and temporal coherence from partial point cloud observations. Project page: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12368 2026-06-11 cs.CV 新提交

DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images

DepthMaster: 统一透视与全景图像的单目深度估计

Pengfei Wang, Shihao Wang, Liyi Chen, Zhiyuan Ma, Guowen Zhang, Lei Zhang

AI总结提出DepthMaster统一框架，通过将全景图分解为重叠透视块并引入对应一致性损失和虚拟投影相机几何先验，解决透视与全景深度估计的几何差异和数据稀缺问题，在13个数据集上实现零样本最优性能。

详情

AI中文摘要

虽然单目深度估计取得了显著进展，但对于窄视场（FoV）透视图像和$360^\circ$全景图像实现通用的度量深度估计仍然是一个未解决的挑战。现有方法通常针对特定相机类型设计，难以在多样化场景中生成准确的度量深度。这一限制源于两个关键挑战：透视相机与全景相机之间的固有几何差异，以及带有度量标注的全景训练数据的稀缺性。在这项工作中，我们引入了DepthMaster，一个统一的度量深度估计框架。我们不采用专门网络来学习球形畸变，而是通过将全景图像分解为重叠的透视块来重新表述问题。关键的是，与先前依赖临时架构修改来处理边界的基于投影的方法不同，我们引入了一种新颖的对应一致性损失（CCL），并注入虚拟投影相机作为几何先验，从而能够无缝拼接这些块，同时避免专用算子并保持主干与标准Transformer设计高度兼容。该策略通过将所有输入统一为规范透视表示来解决几何差异，并通过直接从大量透视数据集中解锁强大的度量先验来有效规避数据稀缺问题。在仅包含一个全景数据集的混合数据集上训练后，DepthMaster在13个多样化数据集上实现了最先进的零样本性能，不仅在透视和全景领域超越了通用方法，还领先于领先的专家模型。

英文摘要

While monocular depth estimation has achieved significant progress, achieving generalized metric depth estimation for both narrow field-of-view (FoV) perspectives and $360^\circ$ panoramas remains an unsolved challenge. Existing methods are often tailored to specific camera types and struggle to produce accurate metric depth that generalizes across diverse settings. This limitation stems from two key challenges: the inherent geometric discrepancy between perspective and panoramic cameras, and the scarcity of panoramic training data with metric annotations. In this work, we introduce DepthMaster, a unified metric depth estimation framework. Rather than employing specialized networks to learn spherical distortions, we reformulate the problem by decomposing panoramic images into overlapping perspective patches. Crucially, distinct from prior projection-based methods that rely on ad-hoc architectural modifications to handle boundaries, we introduce a novel Correspondence Consistency Loss (CCL) and inject virtual projection cameras as geometric priors, allowing us to seamlessly stitch the patches while avoiding specialized operators and keeping the backbone largely compatible with standard Transformer designs. This strategy also resolves the geometric differences by unifying all inputs into a canonical perspective representation, and effectively circumvents data scarcity by directly unlocking powerful metric priors from vast perspective datasets. Trained on a mixed dataset that contains only one panorama dataset, DepthMaster achieves state-of-the-art zero-shot performance on 13 diverse datasets, outperforming not only universal methods but also leading specialist models in both perspective and panoramic domains.

URL PDF HTML ☆

赞 0 踩 0

2606.11529 2026-06-11 cs.GR cs.CV cs.PF 交叉投稿

XPR: An Extensible Cross-Platform Point-Based Differentiable Renderer

XPR：一个可扩展的跨平台基于点的可微分渲染器

Steve Rhyner, Sankeerth Durvasula, Aleksandr Kovalev, Hansel Jia, Adrian Zhao, Mrutunjayya Mrutunjayya, Nilesh Ahuja, Selvakumar Panneer, Christina Giannoula, Nandita Vijaykumar

AI总结提出XPR框架，通过高级编程接口和模块化渲染管线，支持用少量代码实现3DGS等新方法，并利用XLA编译器跨平台运行。

详情

AI中文摘要

基于点的可微分渲染支撑着现代3D重建、新视角合成和基于学习的图形管线，但开发新的渲染方法通常需要大量的底层实现、硬件特定的内核以及手动编写的反向传播。这限制了快速原型设计、可重复性、探索和部署，尤其是在不同的硬件平台上。本文提出了XPR，一个可扩展的跨平台基于点的可微分渲染框架。XPR引入了一个高级编程接口，将方法特定的逻辑与共享的渲染管线分离，允许用户用几行代码实现新方法。其管线将渲染分解为模块化的、静态形状的并行操作，这些操作可以通过跨平台编译器降级到GPU、TPU、CPU和其他ML加速器。我们展示了3DGS、3DGUT和LinPrim的实现，仅需几百行Python代码，每个都可以通过XLA编译器编译到一系列硬件平台。这些结果表明，XPR为新兴的基于点的可微分渲染系统实现了快速实验和可移植执行。

英文摘要

Point-based differentiable rendering underpins modern 3D reconstruction, novel-view synthesis, and learning-based graphics pipelines, but developing new rendering methods often requires extensive low-level implementation, hardware-specific kernels, and manually written backward passes. This limits rapid prototyping, reproducibility, exploration, and deployment, especially across diverse hardware platforms. This paper presents XPR, an extensible cross-platform framework for point-based differentiable rendering. XPR introduces a high-level programming interface that separates method-specific logic from the shared rendering pipeline, allowing users to implement new methods in a few lines of code. Its pipeline decomposes rendering into modular, statically shaped parallel operations that can be lowered by a cross-platform compiler to GPUs, TPUs, CPUs, and other ML accelerators. We demonstrate implementations of 3DGS, 3DGUT, and LinPrim, with only a few 100s lines of Python code, each of which can be compiled to a range of hardware platforms with the XLA compiler. These results show that XPR enables fast experimentation and portable execution for emerging point-based differentiable rendering systems.

URL PDF HTML ☆

赞 0 踩 0

2601.03326 2026-06-11 cs.CV cs.LG 版本更新

Higher order PCA-like rotation-invariant features for detailed shape descriptors modulo rotation

高阶类PCA旋转不变特征用于模旋转的详细形状描述符

Jarek Duda

AI总结提出将PCA扩展到高阶张量（如三阶中心矩）或多项式乘高斯分布，以获取更精确的旋转不变形状描述符，并应用于分子形状描述、物体识别和形状相似性度量。

详情

Comments: 5 pages, 4 figures

AI中文摘要

PCA可用于旋转不变特征，通过协方差矩阵 $p_{ab}=E[(x_i-E[x_a])(x_b-E[x_b])]$ 用椭球近似形状，并利用其幂的迹等旋转不变量。然而，真实形状通常复杂得多，因此提出将其扩展到例如 $p_{abc}=E[(x_a-E[x_a])(x_b-E[x_b])(x_c-E[x_c])]$ 的三阶或更高阶张量以描述中心矩，或多项式乘高斯分布以得到任意高精度的可解码形状描述符及其类似的旋转不变量。其实际应用包括旋转不变特征以包含模旋转的形状，例如用于分子形状描述符，或用于2D图像/3D扫描中直至旋转的物体识别，可能也用于3D场景理解，或作为形状相似性度量，允许模旋转下物体的廉价比较，避免耗时的旋转优化。

英文摘要

PCA can be used for rotation invariant features, describing a shape with its $p_{ab}=E[(x_i-E[x_a])(x_b-E[x_b])]$ covariance matrix approximating shape by ellipsoid, allowing for rotation invariants like its traces of powers. However, real shapes are usually much more complicated, hence there is proposed its extension to e.g. $p_{abc}=E[(x_a-E[x_a])(x_b-E[x_b])(x_c-E[x_c])]$ order-3 or higher tensors describing central moments, or polynomial times Gaussian allowing decodable shape descriptors of arbitrarily high accuracy, and their analogous rotation invariants. Its practical applications could be rotation-invariant features to include shape modulo rotation e.g. for molecular shape descriptors, or for up to rotation object recognition in 2D images/3D scans maybe also for 3D scene understanding, or shape similarity metric allowing inexpensive comparison of objects modulo rotation avoiding costly optimization over rotations.

URL PDF HTML ☆

赞 0 踩 0

2606.11606 2026-06-11 cs.CV 新提交

Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation

冻结的基础模型嵌入在胸部X光检查中丢弃小病灶信号：对部署前评估的启示

Raajitha Muthyala, Zhenan Yin, Alekhya Jilla, Frank Li, Theo Dapamede, Bardia Khosravi, Mohammadreza Chavoshi, Judy Gichoya, Saptarshi Purkayastha

发表机构 * Department of Biomedical Engineering and Informatics, Indiana University（印第安纳大学生物医学工程与信息学系）； Department of Radiology and Imaging Sciences, Emory University（埃默里大学放射学与影像科学系）

AI总结本研究系统量化了五种冻结的视觉Transformer基础模型在胸部X光检查中保留或丢失小尺度、低对比度信号的情况，发现全局聚合步骤会无声地抑制小尺度信号，但可从补丁令牌中恢复。

详情

AI中文摘要

冻结的视觉Transformer（ViT）基础模型嵌入越来越多地用作下游胸部X光检查（CXR）流程的基础，然而在冻结的前向传播中，小尺度、低对比度信号在何处保留或丢失，尚未在架构、预训练领域和目标之间进行系统量化。我们探测了五种冻结的ViT（RAD-DINO、DINOv2-B/14、DINOv3 ViT-7B、BiomedCLIP、MedSigLIP）和一个冻结的DINO预训练ResNet-50架构对照，跨越三个大型CXR队列（NIH-CXR14、MIMIC-CXR、Emory-CXR；总池n=492,724）和ChestX-Det10（n=3,543；1,462个小病灶边界框，涵盖钙化、结节、肿块）。每个模型通过小尺度扰动面板和区域感知边界框分层探针对真实病灶进行评估，比较来自同一前向传播的三种池化模式：分类令牌（CLS）、补丁均值（所有最终层补丁令牌的平均值）和边界框限制的局部补丁。在扰动面板上，CLS嵌入处于随机水平（ROC曲线下面积[AUC] 0.500-0.524）；补丁均值在等模糊和网状细细胞上与CLS无区别，但在较大方向模糊足迹上随CLS上升，而全局决策任务的疾病AUC范围为0.642-0.913。局部补丁探针从同一前向传播中恢复AUC约1.0（每个模型平均改进+0.412至+0.488）；ResNet-50对照重现了随机水平。在ChestX-Det10上，图像级CLS分类显示类内小与大层间差距高达+0.243 AUC；同一前向传播上的边界框级局部补丁池化在每个（模型×类别）单元上恢复AUC >= 0.899。冻结的ViT嵌入在全局聚合步骤中无声地抑制小尺度信号；该信号可从补丁令牌中恢复，但需依赖于感兴趣区域。

英文摘要

Frozen vision-transformer (ViT) foundation-model embeddings increasingly serve as the substrate for downstream chest-radiography (CXR) pipelines, yet where small-scale, low-contrast signal is retained or lost in the frozen forward pass has not been systematically quantified across architectures, pretraining domains, and objectives. We probed five frozen ViTs (RAD-DINO, DINOv2-B/14, DINOv3 ViT-7B, BiomedCLIP, MedSigLIP) and a frozen DINO-pretrained ResNet-50 architectural control across three large CXR cohorts (NIH-CXR14, MIMIC-CXR, Emory-CXR; aggregate pool n=492,724) and ChestX-Det10 (n=3,543; 1,462 small-lesion bounding boxes across Calcification, Nodule, Mass). Each model was evaluated with a small-scale-perturbation panel and a region-aware bounding-box-stratified probe on real lesions, comparing three pooling modes from the same forward pass: classification token (CLS), patch-mean (mean over all final-layer patch tokens), and bounding-box-restricted patch-local. On the perturbation panel, CLS embeddings sat at the chance floor (area under the ROC curve [AUC] 0.500-0.524); patch-mean was indistinguishable from CLS on iso-blur and reticular-fine cells but rose with CLS on larger directional-blur footprints, while disease AUC on globally decided tasks ranged 0.642-0.913. Patch-local probes recovered AUC ~1.0 from the same forward pass (per-model mean improvement +0.412 to +0.488); the ResNet-50 control reproduced the chance floor. On ChestX-Det10, image-level CLS classification showed within-class small-versus-large stratum gaps up to +0.243 AUC; bounding-box-level patch-local pooling on the same forward pass recovered AUC >= 0.899 on every (model x class) cell. Frozen ViT embeddings silently suppress small-scale signal at the global-aggregation step; the signal is recoverable from patch tokens conditional on a region of interest.

URL PDF HTML ☆

赞 0 踩 0

2606.11740 2026-06-11 cs.CV cs.CL 新提交

UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA

UniReason-Med: 用于医学VQA中二维到三维迁移的共享基础推理接口

Mengzhuo Chen, Yan Shu, Chi Liu, Hongming Piao, Xidong Wang, Derek Li, Bryan Dai

发表机构 * IQuest Research

AI总结提出UniReason-Med框架，通过共享基础推理接口从2D医学图像向3D医学VQA迁移推理能力，结合监督微调和强化学习，显著提升3D推理性能。

详情

AI中文摘要

我们研究了当两种输入类型通过共同的推理接口对齐时，来自丰富2D医学图像的基础推理监督是否能够改善3D医学VQA。我们引入了UniReason-Med，一个单一检查点框架，在推理时处理2D图像或切片序列化的3D体积，通过共享框语法、区域标记注入和共同的基础推理策略生成交错文本推理和局部视觉证据。为了训练这个接口，我们构建了UniMed-CoT，一个包含220K指令微调数据集，具有交错的文本推理和基础视觉证据，包括170K 2D和50K 3D样本。通过监督微调后接结果级强化学习，UniReason-Med学会生成基础推理轨迹，而在强化学习期间无需基于IoU/Dice的定位奖励。数据混合和组件消融实验表明，联合2D+3D基础监督显著改善了仅3D训练的3D推理，而基础化和区域标记注入对2D和3D任务都有持续益处。这些结果表明，共享的基础推理接口可以将推理结构从2D图像迁移到切片序列化的体积医学理解。代码和数据公开在https://this URL。

英文摘要

We study whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both input types are aligned through a common reasoning interface. We introduce UniReason-Med, a single-checkpoint framework that processes either a 2D image or a slice-serialized 3D volume at inference time, generating interleaved textual reasoning and localized visual evidence through shared box syntax, region-token injection, and a common grounded reasoning policy. To train this interface, we construct UniMed-CoT, a 220K instruction-tuning dataset with interleaved textual reasoning and grounded visual evidence, including 170K 2D and 50K 3D samples. Through supervised fine-tuning followed by outcome-level reinforcement learning, UniReason-Med learns to generate grounded reasoning traces without IoU/Dice-based localization rewards during RL. Data-mixture and component ablations show that joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training, while grounding and region-token injection consistently benefit both 2D and 3D tasks. These results suggest that a shared grounded reasoning interface can transfer reasoning structure from 2D images to slice-serialized volumetric medical understanding. The code and data are publicly available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11846 2026-06-11 cs.CV 新提交

SheafStain: Sheaf-Theoretic Schrödinger Bridge for Spatially and Biologically Coherent Virtual Staining

SheafStain：用于空间和生物学一致虚拟染色的层论薛定谔桥

Hyeongyeol Lim, Hongjun Yoon, Eunjin Jang, Daeky Jeong, Won June Cho, Hwamin Lee

发表机构 * Department of Medical Informatics, College of Medicine, Korea University（高丽大学医学院医学信息学系）； DEEPNOID Inc.（DEEPNOID公司）

AI总结针对虚拟染色中补丁推理导致的空间不连续和上下文污染问题，提出SheafStain方法，将视觉基础模型特征重新解释为层状截面，结合薛定谔桥框架实现空间和生物学一致的虚拟染色，在HER2等指标上优于六种现有方法。

详情

Comments: 32 pages

AI中文摘要

当前的虚拟染色方法为癌症诊断和预后中的生物标志物量化提供了节省时间和成本的潜力。然而，对于千兆像素全切片图像（WSI）的补丁推理无法保持空间连续性，产生伪影，导致与真实图像出现灾难性不匹配。尽管病理视觉基础模型（VFM）提供了丰富的表示，但其自注意力机制导致不同的全局上下文为同一物理区域产生不一致的嵌入。我们将这种“上下文污染”形式化并验证为一个层论问题，其中这些嵌入形成一个违反粘合公理的预层。为了解决这个问题，我们提出了SheafStain，一种新方法，将VFM特征重新解释为层状截面，用于空间和生物学一致的虚拟染色。具体来说，SheafStain将类别和补丁令牌集成到薛定谔桥框架中作为层状截面。类别令牌锚定生物学一致性，而补丁令牌形成逐位置的空间图。在苏木精和伊红（H&E）与免疫组化（IHC）上共同预训练的主干网络产生非退化的跨染色茎，因此单个VFM特征空间同时监督输入条件和输出染色对齐。与先前在孤立$256 \ imes 256$补丁上评估并对$1024 \ imes 1024$真实图像进行随机裁剪或调整大小的工作不同，我们在$256 \ imes 256$上进行翻译，并在拼接后的$1024 \ imes 1024$输出上评估HER2、ER、PR和Ki-67。SheafStain在减轻补丁边界拼接伪影的同时，展示了优于六种先前方法的结果。代码即将发布。

英文摘要

Current virtual staining approaches offer the potential for time- and cost-efficient biomarker quantification in cancer diagnostics and prognostics. However, patch-wise inference for gigapixel whole slide images (WSIs) fails to maintain spatial continuity, yielding artifacts that cause catastrophic mismatches with ground-truth images. Although pathology Vision Foundation Models (VFMs) offer rich representations, their self-attention causes varying global contexts to produce inconsistent embeddings for the same physical region. We formalize and validate this ``context contamination'' as a sheaf-theoretic problem where these embeddings form a presheaf that violates the gluing axiom. To address this, we propose SheafStain, a new approach that reinterprets VFM features as sheaf-like sections for spatially and biologically coherent virtual staining. Specifically, SheafStain integrates class and patch tokens into a Schrödinger Bridge framework as sheaf-like sections. While the class token anchors biological consistency, patch tokens form a per-position spatial map. A backbone co-pretrained on Hematoxylin \& Eosin (H\&E) and Immunohistochemistry (IHC) yields non-degenerate cross-stain stalks, so a single VFM feature space supervises both input conditioning and output stain alignment. Departing from prior work that evaluates on isolated $256 \times 256$ patches and either random-crops or resizes the $1024 \times 1024$ ground truth, we translate at $256 \times 256$ and evaluate on the stitched $1024 \times 1024$ outputs across HER2, ER, PR, and Ki-67. SheafStain demonstrates promising results against six prior methods while mitigating patch-boundary stitching artifacts. Code will soon be released.

URL PDF HTML ☆

赞 0 踩 0

2606.12126 2026-06-11 cs.CV 新提交

AGE-MIL: Anchor-Guided Evidence Learning for Patient-Level Prediction

AGE-MIL: 锚点引导的证据学习用于患者级别预测

Jiawei Niu, Jian Chen, Di Zhang, Junbo Lu, Zhangcheng Liao, Xuhao Liu, Honglin Zhong, Mireia Crispin-Ortuzar, Chen Li, Zeyu Gao, Yi Cai

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University（西安交通大学计算机科学与技术学院）； Department of Oncology, University of Cambridge（剑桥大学肿瘤学系）； Xiangya School of Medicine, Central South University（中南大学湘雅医学院）

AI总结提出AGE-MIL框架，通过构建患者级锚点整合多张全切片图像证据，将风险建模为证据积累过程，实现弱监督下的稳定优化，在六个任务中优于八种现有方法。

详情

Comments: 11 pages, 2 figures, MICCAI early accepted

AI中文摘要

现有的计算病理学方法主要在全切片图像（WSI）级别的多实例学习（MIL）范式下运行，而患者级别的建模仍未得到充分探索。然而，在常规病理实践中，病理学家通过整合多个WSI的证据而非依赖任何单个切片来得出诊断和预后结论。当患者级别的监督直接施加于传统MIL框架时，这种差异造成了根本性的错位，常常导致优化不稳定和预测可靠性下降。为了解决这个问题，我们提出了锚点引导的证据MIL（AGE-MIL），一种用于患者级别预测的弱监督框架。AGE-MIL从切片表示中构建患者级别的锚点，以捕获全局病理上下文并指导诊断相关局部斑块的检索和整合，从而实现稳健的患者级别建模。患者级别的风险进一步被建模为证据积累过程，促进弱监督下的稳定优化。AGE-MIL在两个独立队列的六个临床相关患者级别预测任务上进行了评估。实验结果表明，所提出的框架始终优于八种最先进的MIL方法。代码可在以下网址获取：https://this https URL。

英文摘要

Existing computational pathology methods predominantly operate within whole-slide image (WSI)-level multiple instance learning (MIL) paradigms, while patient-level modeling remains underexplored. In routine pathological practice, however, pathologists derive diagnostic and prognostic conclusions by integrating evidence across multiple WSIs rather than relying on any single slide. This discrepancy creates a fundamental misalignment when patient-level supervision is directly imposed on conventional MIL frameworks, often leading to unstable optimization and degraded predictive reliability. To address this issue, we propose Anchor-Guided Evidence MIL (AGE-MIL), a weakly supervised framework for patient-level prediction. AGE-MIL constructs a patient-level anchor from slide representations to capture global pathological context and guide the retrieval and integration of diagnostically relevant local patches, enabling robust patient-level modeling. Patient-level risk is further modeled as an evidence accumulation process, promoting stable optimization under weak supervision. AGE-MIL is evaluated on six clinically relevant patient-level prediction tasks from two independent cohorts. Experimental results show that the proposed framework consistently outperforms eight state-of-the-art MIL methods. Code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12140 2026-06-11 cs.CV 新提交

Time-Conditioned and Multi-Time Survival Prediction from 2D PET/CT Projections in Lung Cancer

基于2D PET/CT投影的时间条件与多时间生存预测在肺癌中的应用

Ashish Chauhan, Sambit Tarai, Elin Lundström, Johan Öfverstedt, Håkan Ahlström, Joel Kullberg

发表机构 * Radiology, Department of Surgical Sciences, Uppsala University（乌普萨拉大学外科学系放射科）； National Academic Infrastructure for Supercomputing (NAISS), Linköping University（林雪平大学国家学术超级计算基础设施）； Antaros Medical ； SciLifeLab, Uppsala University（乌普萨拉大学SciLifeLab）

AI总结提出时间条件生存（ATCS）和多时间生存（MTS）两种方法，利用2D PET/CT投影预测非小细胞肺癌患者生存，ATCS在早期预测更优，MTS在晚期更优。

详情

Comments: Under review at MIUA 2026

AI中文摘要

从正电子发射断层扫描/计算机断层扫描（PET/CT）准确预测总生存期（OS）可以支持肿瘤学中的个性化治疗和随访策略。然而，时间建模对基于影像的生存预测的影响仍未得到充分探索。我们通过开发两种互补方法：注意力引导的时间条件生存（ATCS）和多时间生存（MTS），研究了不同时间公式如何影响生存预测。我们回顾性分析了848例非小细胞肺癌（NSCLC）患者的治疗前PET/CT图像，其中556例用于模型开发，292例用于保留测试。使用先前提出的时间条件生存（TCS）模型作为基线。模型通过5折交叉验证训练，并在测试集上使用时间依赖性曲线下面积（AUC）在0.5至5年之间每6个月间隔进行评估。ATCS和MTS均优于基线TCS模型，平均AUC分别为0.794和0.793，而基线为0.767。ATCS在早期时间点（0.5-3年）表现更好，而MTS在后期间隔（3.5-5年）表现更好。结合肿瘤特异性和组织特异性PET/CT特征比单独使用任一输入提高了性能。更精细的时间离散化改善了短期预测，而更粗的间隔提供了更稳定的长期估计。这些发现表明时间建模和输入设计影响基于PET/CT的生存预测。所提出的方法能够从治疗前影像进行时间特异性生存估计，并可能支持改进的风险分层和临床决策。

英文摘要

Accurate prediction of overall survival (OS) from positron emission tomography/computed tomography (PET/CT) can support personalized treatment and follow-up strategies in oncology. However, the impact of temporal modeling on imaging-based survival prediction remains insufficiently explored. We investigate how different temporal formulations influence survival prediction by developing two complementary approaches: Attention-guided Time-Conditioned Survival (ATCS) and Multi-Time Survival (MTS). We retrospectively analyzed pre-treatment PET/CT images from 848 patients with non-small cell lung cancer (NSCLC), including 556 for model development and 292 for held-out testing. A previously proposed Time-Conditioned Survival (TCS) model was used as a baseline. Models were trained using 5-fold cross-validation and evaluated on the test set using time-dependent area under the curve (AUC) at 6-month intervals from 0.5 to 5 years. Both ATCS and MTS outperformed the baseline TCS model, achieving mean AUCs of 0.794 and 0.793, respectively, compared to 0.767. ATCS performed better at earlier time points (0.5-3 years), whereas MTS performed better at later intervals (3.5-5 years). Combining tumor-specific and tissue-wise PET/CT features improved performance over either input alone. Finer temporal discretization improved short-term prediction, while coarser intervals provided more stable long-term estimates. These findings demonstrate that temporal modeling and input design influence PET/CT-based survival prediction. The proposed approaches enable time-specific survival estimation from pre-treatment imaging and may support improved risk stratification and clinical decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.12169 2026-06-11 cs.CV cs.AI cs.CL cs.LG 新提交

OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models

OpenMedReason: 医学视觉语言模型的科学推理监督

Negin Baghbanzadeh, Pritam Sarkar, Michael Colacci, Abeer Badawi, Adibvafa Fallahpour, Arash Afkanpour, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University（约克大学）； Vector Institute（向量研究所）； University of British Columbia（不列颠哥伦比亚大学）； University of Toronto（多伦多大学）； Unity Health Toronto / St. Michael’s Hospital（多伦多联合健康/圣迈克尔医院）； University Health Network（大学健康网络）； Arc Institute（弧研究所）； Queen's University（女王大学）

AI总结提出OpenMedReason，一个包含约45万图像-问题-答案实例的大规模开放医学推理语料库，其推理轨迹主要来自生物医学科学文章，并配套基准OpenMedReason-Bench进行细粒度评估，在监督微调和强化对齐中有效提升模型性能。

详情

Comments: 42 pages, 9 figures, 24 tables. Dataset and code: this https URL

AI中文摘要

高风险临床使用大型视觉语言模型（LVLMs）需要基于视觉证据和临床知识的推理，而不仅仅是正确的最终答案。我们引入了OpenMedReason，这是一个大规模、开放的多模态医学推理语料库，包含约45万图像-问题-答案实例，其推理轨迹主要来自策划的生物医学、人类撰写的科学文章。OpenMedReason提供了超越合成思维链的高保真监督，涵盖了多种医学领域视觉模态，如放射学扫描、显微图像、可见光照片、图表等。我们辅以OpenMedReason-Bench，这是一个留出基准，允许沿三个互补的能力轴（包括感知、医学知识和推理）对LVLMs进行细粒度评估，从而实现超越最终答案准确性的诊断性评估。OpenMedReason是一个丰富的训练资源，在监督微调（SFT）和基于强化的对齐中均显示出有效性。使用OpenMedReason进行训练，在VQA准确率上比基础模型平均提高20%，并且性能达到最强可比规模医学LVLMs的4.2%以内。细粒度性能分析证实，增益并非集中在单一轴上：OpenMedReason共同提升了感知、医学知识和推理，并且在86.1%的成对比较中，其推理轨迹优于基础模型。我们在以下网址发布代码和数据集：此 http URL。

英文摘要

High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at this http URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12286 2026-06-11 cs.CV 新提交

CellNet -- Localizing Cells using Sparse and Noisy Point Annotations

CellNet -- 利用稀疏和噪声点标注定位细胞

Benjamin Eckhardt, Dmytro Fishman, Stuart Fawke, Andrew Curtis, Bo Fussing, Constantin Pape

发表机构 * University of Göttingen（哥廷根大学）； Wellcome Sanger Institute（威康桑格研究所）； University of Tartu（塔尔图大学）

AI总结提出基于回归的深度学习算法CellNet，利用稀疏点标注在相位对比显微镜图像中检测和计数细胞，减少标注负担，在低数据场景下优于零样本方法。

详情

Comments: Conference poster at Biology at Scale: From Variants to Cellular Programs and Functions

AI中文摘要

计数活细胞是许多生物学研究工作流程中的重要步骤。我们在Wellcome Sanger研究所的合作者通过大规模饱和基因组编辑筛选研究人类重要基因，这需要反复多次计数细胞。基于计算机视觉的自动化对于高通量和资源效率至关重要。在这项工作中，我们开发了一种基于回归的深度学习计算机视觉算法，用于检测和计数相位对比显微镜图像中的细胞。为了减少标注工作量（这在实际中常成为瓶颈），我们专注于仅使用稀疏点标注来计数细胞，这种标注方式快速且易于获取。通过与最先进的零样本方法比较，我们表明基于回归的计数在低数据场景下是一种有前景的替代方案。通过开发自动计数显微镜图像中活细胞的方法，我们为人类基因组的重要研究做出了贡献。代码可在以下网址获取：https://this https URL。

英文摘要

Counting living cells is an important step in many biological research workflows. Our collaborators at the Wellcome Sanger Institute study vital genes in humans via large scale saturation genome editing screening, which requires repeatedly counting cells a great number of times. Computer Vision based automation is crucial for high throughput and resource efficiency. In this work, we develop a regression-based deep learning computer vision algorithm to detect and count cells in phase-contrast microscopy images. To reduce annotation effort, which in practice often becomes a bottleneck, we focus on counting cells only using sparse point annotations, which are fast and easy to acquire. By comparison to state-of-the-art 0-shot methods, we show that regression-based counting is a promising alternative in low data regimes. Through developing methods to automatically count living cells in microscopy images, we contribute to valuable research on the human genome. The code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12319 2026-06-11 cs.CV 新提交

Anatomically Conditioned Recurrent Refinement for Topology-Aware Circle of Willis Segmentation

解剖条件循环细化用于拓扑感知的Willis环分割

Juraj Perić, Marija Habijan, Dario Mužević, Irena Galić, Danilo Babin, Aleksandra Pižurica

发表机构 * Faculty of Electrical Engineering, Computer Science and Information Technology, Osijek, Croatia（奥西耶克大学电气工程、计算机科学与信息技术学院）； Clinical Medical Center Osijek, Osijek, Croatia（奥西耶克临床医学中心）； Ghent University, Dept. of Telecommunications and Information Processing, imec-TELIN-IPI, Ghent, Belgium（根特大学电信与信息处理系，imec-TELIN-IPI）； Ghent University, Dept. of Telecommunications and Information Processing, TELIN-GAIM, Ghent, Belgium（根特大学电信与信息处理系，TELIN-GAIM）

AI总结提出AC2RUNet，通过静态和动态双流架构结合课程学习，在TopCoW数据集上显著降低Hausdorff距离和Betti数误差，改善拓扑连通性。

详情

Comments: 9 pages, 4 figures, 1 table. Accepted at EUSIPCO 2026

AI中文摘要

由于复杂的拓扑结构和易碎细小的血管结构，从磁共振血管造影（MRA）中分割Willis环（CoW）具有挑战性。标准卷积神经网络（CNN）通常无法捕捉这些拓扑约束，导致“血管断裂”伪影。为了解决这个问题，我们提出了解剖条件循环细化U-Net（AC2RUNet）。我们的架构将分割解耦为两个流：提取不变解剖特征的静态流和随时间迭代细化拓扑错误的轻量级动态流。我们进一步引入了一种动态课程学习策略，从高召回率的几何监督过渡到拓扑感知约束。在TopCoW数据集上验证，AC2RUNet显著降低了Hausdorff距离（4.72 mm vs 9.17 mm）和Betti数误差（0.19 vs 0.40），在保持相当体积Dice的同时改善了nnU-Net基线的拓扑连通性。

英文摘要

Segmenting the Circle of Willis (CoW) from Magnetic Resonance Angiography (MRA) is challenging due to complex topology and thin vascular structures that are prone to fragmentation. Standard Convolutional Neural Networks (CNNs) often fail to capture these topological constraints, resulting in "broken vessel" artifacts. To address this, we propose the Anatomically Conditioned Recurrent Refinement U-Net (AC2RUNet). Our architecture decouples segmentation into two streams: a Static Stream that extracts invariant anatomical features and a lightweight Dynamic Stream that iteratively refines topological errors over time. We further introduce a dynamic curriculum learning strategy that transitions from high-recall geometric supervision to topology-aware constraints. Validated on the TopCoW dataset, AC2RUNet substantially reduces Hausdorff Distance (4.72 mm vs 9.17 mm) and Betti number errors (0.19 vs 0.40), improving topological connectivity over the nnU-Net baseline while maintaining comparable volumetric Dice.

URL PDF HTML ☆

赞 0 踩 0

2606.12346 2026-06-11 cs.CV cs.AI cs.LG 新提交

Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

Atlas H&E-TME：基于AI的可扩展组织分析，达到专家病理学家级别的准确性

Kai Standvoss, Miriam Hägele, Rosemarie Krupar, Julika Ribbat-Idel, Jennifer Altschüler, Gerrit Erdmann, Hans Pinckaers, Evelyn Ramberger, Madleen Drinkwitz, Ádám Nárai, Alexander Möllers, Katja Lingelbach, Sebastian Kons, Lukas Hönig, Recepcan Adigüzel, Joana Baião, Alberto Megina Gonzalo, Marius Teodorescu, Marie-Lisa Eich, Paolo Chetta, Shakil Merchant, Verena Aumiller, Simon Schallenberg, Andrew Norgan, Klaus-Robert Müller, Lukas Ruff, Maximilian Alber, Frederick Klauschen

发表机构 * Aignostics, Germany（Aignostics，德国）； Institute of Pathology, Charité – Universitätsmedizin Berlin, Germany（柏林夏里特医学院病理学研究所）； Berlin Institute of Health, Charité – Universitätsmedizin Berlin, Germany（柏林夏里特医学院柏林健康研究所）； Massachusetts General Hospital, Department of Pathology, Harvard Medical School, Boston, MA, US（哈佛医学院麻省总医院病理学系）； Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, US（梅奥诊所检验医学与病理学系）； Machine Learning Group, Technische Universität Berlin, Germany（柏林工业大学机器学习组）； BIFOLD – Berlin Institute for the Foundations of Learning and Data, Germany（柏林学习与数据基础研究所）； Department of Artificial Intelligence, Korea University, Republic of Korea（高丽大学人工智能系）； Max-Planck Institute for Informatics, Germany（马克斯·普朗克信息学研究所）； German Cancer Research Center (DKFZ) & German Cancer Consortium (DKTK), Berlin & Munich Partner Sites, Germany（德国癌症研究中心及德国癌症联盟柏林和慕尼黑合作站点）； Institute of Pathology, Ludwig-Maximilians-Universität München, Germany（慕尼黑大学病理学研究所）； Bavarian Cancer Research Center (BZKF), Germany（巴伐利亚癌症研究中心）

AI总结提出Atlas H&E-TME系统，利用病理基础模型预测组织质量、区域和细胞类型，通过IHC共识验证和20万+注释基准，在多种癌症中达到或超越病理学家水平。

详情

AI中文摘要

苏木精和伊红（H&E）染色是组织病理学的基石，然而对H&E全切片图像（WSI）进行可扩展的定量分析仍然是计算病理学中的核心挑战。我们提出了Atlas H&E-TME，这是一个基于Atlas病理基础模型家族的AI系统，可预测多种癌症类型的组织质量、组织区域和细胞类型标签，在细胞级分辨率下每张切片产生超过4,500个定量读数。验证此类系统的关键挑战在于克服H&E-only金标准固有的形态模糊性，以及依赖免疫组织化学（IHC）等模态的更可靠参考的可扩展性有限。我们通过一个双重验证框架解决了这一问题，该框架将生物学深度的基础与技术及形态学的广度相结合。在深度方面，我们提出了一种IHC引导的多病理学家共识协议，该协议显著提高了相较于传统H&E-only注释的评分者间一致性。这产生了一个分子学基础的参考，我们据此比较Atlas H&E-TME和仅使用H&E的病理学家。在广度方面，我们在超过20万个高置信度H&E-only病理学家注释上对Atlas H&E-TME进行了基准测试，这些注释涵盖1,500多个病例，跨越八种癌症类型及其最常见的转移部位，亚型覆盖每种癌症类型>90%的临床病例，来自25个以上来源和8种以上扫描仪型号。与IHC引导的共识相比，Atlas H&E-TME达到或超过了病理学家仅使用H&E的性能，并在这一广泛的形态学和技术范围内一致且稳健地泛化。通过这种方式，Atlas H&E-TME将H&E切片——病理学中最普遍的数据——转化为一个可扩展的、定量的肿瘤及其微环境窗口，为转化和临床研究中下一代基于组织的生物标志物奠定了基础。

英文摘要

Hematoxylin and eosin (H&E) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of H&E whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas H&E-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to H&E-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional H&E-only annotation. This yields a molecularly grounded reference against which we compare Atlas H&E-TME and pathologists working from H&E alone. For breadth, we benchmark Atlas H&E-TME on over 200,000 high-confidence H&E-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering >90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas H&E-TME matches or exceeds pathologist H&E-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas H&E-TME turns the H&E slide -- the most ubiquitous data in pathology -- into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.

URL PDF HTML ☆

赞 0 踩 0

2606.12407 2026-06-11 cs.CV 新提交

How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology

看似无关紧要的设计选择如何决定病理学中LLM的性能

Kian R. Weihrauch, Thomas A. Buckley, William Lotter, Arjun K. Manrai

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Harvard Medical School（哈佛医学院）； Dana-Farber Cancer Institute（丹娜-法伯癌症研究所）

AI总结通过系统因子分析发现，调整补丁大小、放大倍数等输入配置可使通用大语言模型在病理切片任务上性能大幅提升，缩小与专用模型的差距。

详情

AI中文摘要

通用大语言模型（LLM）在评估全切片图像（WSI）上的专用病理模型时，常被用作基线。由于WSI超出当代模型上下文限制，LLM基线通常使用独立处理的小尺寸高放大倍数补丁，通过多数投票进行，而缺乏对补丁大小、补丁数量和放大倍数等看似无关紧要的设计选择的系统评估。通用LLM一直表现不如专用系统，这强化了领域特定训练或架构适应对于涉及WSI的病理任务必要的观点。在这里，我们对四个输入设计因素：推理模式、补丁大小、放大倍数和补丁数量进行了系统因子分析。我们证明，先前研究通过选择非优化的输入配置夸大了专用模型与通用LLM之间的差距。在MultiPathQA基准上，切换到单一平衡配置（低放大倍数下的大补丁，联合处理）将GPT-5在癌症类型分类（TCGA）上从15.1%提升至39.5%，在器官分类（GTEx）上从38.1%提升至62.9%。每任务优化进一步带来增益，分别达到43.9%（TCGA）和71.6%（GTEx）。相同的配置推广到另外两个模型以及完全保留的CPTAC队列，在无需任何任务特定调整的情况下，将Gemini 3 Flash提升了23.4个百分点。

英文摘要

General-purpose large language models (LLMs) are routinely used as baselines when evaluating specialized pathology models on whole-slide images (WSIs). Because WSIs exceed contemporary model context limits, LLM baselines routinely use small, high-magnification patches processed independently via majority voting, without systematic evaluation of seemingly inconsequential design choices such as patch size, patch count, and magnification. Generalist LLMs have consistently underperformed specialized systems, reinforcing the perception that domain-specific training or architectural adaptation is necessary for pathology tasks involving WSIs. Here, we conduct a systematic factorial analysis of four input design factors: inference mode, patch size, magnification, and patch count. We demonstrate that prior studies have overstated the gap between specialized models and general-purpose LLMs by choosing non-optimized input configurations. On the MultiPathQA benchmark, switching to a single balanced configuration (large patches at lower magnification, processed jointly) raises GPT-5 from 15.1% to 39.5% on cancer-type classification (TCGA) and from 38.1% to 62.9% on organ classification (GTEx). Per-task optimization yields further gains up to 43.9% (TCGA) and 71.6% (GTEx). The same configuration generalizes to two other models and to a fully held-out CPTAC cohort, where it improves Gemini 3 Flash by 23.4 percentage points without any task-specific tuning.

URL PDF HTML ☆

赞 1 踩 0

2606.11287 2026-06-11 eess.IV cs.CV 交叉投稿

Intelligent Skin Cancer Detection Using a Multispectral Metasurface and a Hybrid

基于多光谱超表面和混合深度学习的智能皮肤癌检测

Afsane Saee Arezoomand

AI总结提出结合多光谱超表面成像与CNN-ViT混合深度学习架构，实现皮肤癌高精度检测，准确率达98%，灵敏度95%，特异性99%。

详情

Comments: 8 pages

AI中文摘要

皮肤癌是全球最常见的恶性肿瘤之一，早期检测对于提高患者生存率和降低治疗成本至关重要。传统的皮肤镜和视觉成像技术主要局限于可见光谱，通常无法捕捉与早期恶性肿瘤相关的细微光谱特征。本研究提出了一种创新框架，将多光谱超表面成像与基于卷积神经网络和视觉Transformer的混合深度学习架构相结合。设计的超表面能够非侵入性地获取对组织变化高度敏感的丰富光谱信息，而混合CNN-ViT模型同时提取局部和全局特征，以稳健地对皮肤病变进行分类。基于模拟的评估表明，所提方法实现了约98%的准确率、95%的灵敏度和99%的特异性，优于传统的基于RGB和单一架构的方法。使用注意力图进行的定性分析显示，模型关注临床相关的病变区域，提高了可解释性。总体而言，结果表明，将基于超表面的多光谱成像与混合深度学习相结合，可以引入新一代皮肤病学诊断工具，并为便携、快速且高精度的临床系统铺平道路。

英文摘要

Skin cancer is among the most prevalent malignancies worldwiAdbe satnradcitts early detection is essential for improving patient survival and reducing treatment costs Conventional dermoscopic and visual imaging techniques are primarily limited to the visible spectrum and often fail to capture subtle spectral signatures associated with early stage malignancies This study proposes an innovative framework that integrates a multispectral metasurface for imaging with a hybrid deep learning architecture based on Convolutional Neural Networks and Vision Transformers The designed metasurface enables noninvasive acquisition of rich spectral information highly sensitive to tissue alterations while the hybrid CNN ViT model simultaneously extracts local and global features to robustly classify skin lesions Simulation-based evaluations demonstrate that the proposed method achieves approximately 98 accuracy 95 percentages sensitivity and 99 perentage specificity surpassing conventional RGB-based and single-architecture approaches Qualitative analyses using attention maps reveal that the model focuses on clinically relevant lesion regions improving interpretability Overall the results indicate that combining metasurface based multispectral imaging with hybrid deep learning can introduce a new generation of diagnostic tools in dermatology and pave the way for portable fast and highly accurate clinical systems

URL PDF HTML ☆

赞 0 踩 0

2604.10242 2026-06-11 cs.CV 版本更新

MedVeriSeg: Teaching LISA-Like Medical Segmentation Models to Verify Query Validity Without Extra Training

MedVeriSeg: 教授LISA-like医学分割模型验证查询的有效性而无需额外训练

Qinyue Tong, Xiaozhen Wang, Ziqian Lu, Jun Liu, Yunlong Yu, Zheming Lu

AI总结本文提出MedVeriSeg，一种无需训练的查询验证框架，使LISA-like医学分割模型能够拒绝虚假分割查询。通过相似性响应质量评分模块和轻量级路由多代理验证模块，提升验证鲁棒性，并构建MedVeriSeg-Bench基准，有效减少幻觉分割。

详情

Comments: 13 pages, 9 figures

AI中文摘要

尽管文本提示基于医学图像分割取得进展，现有LISA-like MLLM方法通常生成掩码，无论查询中指定的目标是否存在，导致幻觉分割。本文提出MedVeriSeg，一种无需训练的查询验证框架，使LISA-like医学分割模型能够拒绝虚假分割查询。MedVeriSeg首先通过相似性响应质量评分模块量化[SEG]标记与图像特征之间的响应质量。为进一步提高鲁棒性，它采用轻量级路由多代理验证模块，将定量得分证据与定性代理证据融合，以全面验证查询的有效性。为支持系统评估，我们构建了MedVeriSeg-Bench，一个用于医学图像分割查询验证的基准。实验结果表明，MedVeriSeg有效识别虚假分割查询，减少幻觉分割，同时保持对有效查询的高接受率，从而在很大程度上保留LISA-like医学分割模型的分割实用性。

英文摘要

Despite recent progress in text-prompt-based medical image segmentation, existing LISA-like MLLM-based methods typically generate masks regardless of whether the target specified in the query is present, leading to hallucinated segmentation. In this work, we propose MedVeriSeg, a training-free query verification framework that enables LISA-like medical segmentation models to reject false segmentation queries. MedVeriSeg first quantifies the response quality between the [SEG] token and image features through a Similarity Response Quality Scoring Module. To further improve robustness, it employs a Lightweight Routed Multi-Agent Verification Module, which fuses quantitative score evidence with qualitative agent evidence to comprehensively verify the validity of the query. To support systematic evaluation, we construct MedVeriSeg-Bench, a benchmark designed for query verification in medical image segmentation. Experimental results demonstrate that MedVeriSeg effectively identifies false segmentation queries and reduces hallucinated segmentation, while maintaining a high acceptance rate for valid queries, thereby largely preserving the segmentation utility of LISA-like medical segmentation models.

URL PDF HTML ☆

赞 0 踩 0

2606.11107 2026-06-11 eess.IV cs.CV cs.LG 版本更新

Multimodal Brain Tumour Classification Using Feature Fusion

使用特征融合的多模态脑肿瘤分类

Wajih ul Islam, Muhammad Yaqoob, Javed Ali Khan, Volker Steuber

AI总结提出双分支多模态网络，融合MRI图像与91个放射组学特征，通过门控融合实现脑肿瘤分类，准确率达96.13%。

详情

AI中文摘要

临床医生通过综合患者症状、病史以及来自MRI和CT扫描等模态的定量成像数据，形成统一的临床判断来诊断脑肿瘤。然而，大多数深度学习模型仅依赖MRI/CT图像，未能复制临床医生的多模态推理。我们探索了一种双分支多模态网络，将原始MRI扫描与91个提取的放射组学特征（强度、纹理、形状和边界描述符）相结合，将脑肿瘤分类为胶质瘤、脑膜瘤、垂体瘤和无肿瘤。预训练的CNN骨干网络编码图像流，而专用的MLP编码放射组学特征流。通过拼接、门控或双向跨模态注意力策略融合两个流。在平衡的7200张图像数据集上的九次实验运行中，所有多模态配置均优于单模态基线，其中门控融合实现了最佳准确率96.13%。

英文摘要

Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history, and quantitative imaging data from modalities such as MRI and CT scans into a unified clinical judgement. However, most deep learning models rely on MRI/CT images alone, failing to replicate the clinicians multimodal reasoning. We explore a two-branch multimodal network combining raw MRI scans with 91 extracted radiomic features (intensity, texture, shape, and boundary descriptors) to classify brain tumors into glioma, meningioma, pituitary, and no-tumor. A pre-trained CNN backbone encodes the image stream, whereas a dedicated MLP encodes the radiomic stream. Both streams are fused via concatenation, gated, or bidirectional cross-modal attention strategies. Across nine experimental runs on a balanced 7,200 image dataset, all multimodal configurations outperform unimodal baselines with gated fusion achieving the best accuracy of 96.13%.

URL PDF HTML ☆

赞 0 踩 0

2606.11320 2026-06-11 cs.CV 新提交

Semantic Segmentation of Node and Edge Diagrams for Assistive Technology

面向辅助技术的节点和边图语义分割

Michael Cormier, Yichun Zhao, Laura Paul, Cameron Swift, Duc Tri Dang, Miguel Nacenta

发表机构 * Natural Sciences and Engineering Research Council of Canada（加拿大自然科学与工程研究理事会）

AI总结提出紧凑深度学习模型对节点-边图进行语义分割，在合成数据集上达到93%以上像素精度，以辅助非视觉访问。

2606.11477 2026-06-11 cs.CV cs.AI 新提交

Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

迈向全自动考试评分：基于基础模型的笔迹答案公平性识别

Hartwig Grabowski

发表机构 * Institute for Machine Learning and Analytics (IMLA), Offenburg University（奥芬堡大学机器学习和分析研究所（IMLA））

AI总结提出使用视觉-语言基础模型（VLM）识别手写答案，在61份考试（3141个答案位置）上达到98.4%准确率，并通过轻量提示将假阴性率降至0.58%，实现公平的全自动评分。

详情

Comments: 11 pages, 2 figures, 3 tables

AI中文摘要

手工批改手写试卷既耗时又容易出错，尤其是对于大规模班级，而全数字化考试往往迫使教学局限于封闭式问题格式。一个实用的折中方案是保留纸质、问题导向的任务，但将评估相关的答案以单个大写字母记录在机器可读的表格中。开放的问题是，这种读取能否足够准确，并且最重要的是，足够公平以实现无监督评分。早期的自动化方法仅达到约88%–91%的识别率——太低——并且在最关键的案例上失败：答案写在单元格外、被划掉或草书书写。我们展示了通用视觉-语言基础模型（VLM），它解释页面而非匹配像素模板，弥补了这一差距。在一个包含61份匿名考试（3141个答案位置）的基准测试中，最佳模型达到了98.4%的准确率，远高于之前的基线。关键的是，我们以公平性为中心进行评估：我们区分假阴性（正确答案被标记为错误，对学生不利）和假阳性，并且一个提供参考答案作为上下文的轻量提示将假阴性率降至0.58%。在示例性评分方案下，61份考试中只有3份会被评得更差，所有这些都通过学生自我审查步骤被发现。因此，大规模的全自动、公平性感知考试评分是合理的；我们发布匿名基准以支持可重复性。

英文摘要

Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read. The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%--91% recognition -- too low -- and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap. On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98.4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0.58%. Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2606.11710 2026-06-11 cs.CV 新提交

ERN-Net: Evolving Reason Node-Net for Document Binarization

ERN-Net: 用于文档二值化的演化推理节点网络

Hsin-Jui Pan, Sheng-Wei Chan, Jen-Shiung Chiang

AI总结提出ERN-Net，通过演化推理节点和多尺度推理增强退化敏感区域，结合ConvNeXt-Tiny骨干网络和DIBCO预训练，在低数据低内存下实现高效文档二值化。

2606.11977 2026-06-11 cs.CV 新提交

ParseFixer: An Agentic Framework for Document Parsing via Selective Multimodal Correction

ParseFixer: 一种通过选择性多模态校正的文档解析智能体框架

LeKai Yu, Hao Liu, Kun Wang, Zhiran Li, Ruping Cao, Fan Liu, Yupeng Hu

发表机构 * Shandong University（山东大学）； Southeast University（东南大学）

AI总结提出ParseFixer框架，结合全页骨干解析和智能体选择性校正，通过验证-回滚机制修复高价值解析错误，在DataMFM挑战赛文档解析任务中获得第三名。

详情

AI中文摘要

在本报告中，我们介绍了DataMFM挑战赛赛道1：文档解析的第三名解决方案。该赛道要求模型从文档页面图像中恢复结构化的Markdown文档，同时保留文本内容和文档结构。为了解决准确内容恢复和忠实结构重建的互补需求，我们提出了ParseFixer，一个用于骨干解析和选择性校正的智能体框架。ParseFixer包含两个关键模块：全页骨干解析（FBP）和智能体选择性校正（ASC）。FBP使用MinerU2.5 Pro生成稳定的初始Markdown输出，而ASC通过验证-回滚校正过程检测并修复高价值的解析失败。通过在开源骨干解析之后放置选择性多模态校正，ParseFixer在不重写可靠骨干预测的情况下，改善关键文档元素的恢复。在测试集上，我们的最终系统取得了61.78的总分，在赛道1中排名第三，证明了其在准确文档解析方面的有效性。我们的代码将发布在：this https URL。

英文摘要

In this report, we present our third-place solution for the DataMFM Challenge Track 1: Document Parsing. This track requires models to recover structured Markdown documents from document page images while preserving textual content and document structure. To address the complementary requirements of accurate content recovery and faithful structure reconstruction, we propose ParseFixer, an agentic framework for backbone parsing and selective correction. ParseFixer consists of two key modules: Full-Page Backbone Parsing (FBP) and Agentic Selective Correction (ASC). FBP produces stable initial Markdown outputs with MinerU2.5 Pro, while ASC detects high-value parsing failures and repairs them through a verify-and-rollback correction process. By placing selective multimodal correction after open-source backbone parsing, ParseFixer improves the recovery of key document elements without rewriting reliable backbone predictions. On the test set, our final system achieves an overall score of 61.78 and ranks third in Track 1, demonstrating its effectiveness for accurate document parsing. Our code will be released at: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11601 2026-06-11 cs.CV 新提交

Spatially Coupled Phase-to-Depth Calibration for Fringe Projection Profilometry

条纹投影轮廓术中的空间耦合相位-深度标定

Sehoon Tak, Jae-Sang Hyun

发表机构 * Department of Mechanical Engineering, Yonsei University（延世大学机械工程系）

AI总结提出一种空间耦合的相位-深度变换，通过全局相位标量与仿射空间项共享所有像素的映射，替代逐像素独立标定，提升空间一致性并减少表面伪影。

详情

AI中文摘要

在条纹投影轮廓术（FPP）中，深度通常通过在每个相机像素处独立拟合相位-深度关系来恢复。尽管这种逐像素标定实现了较高的局部精度，但相邻像素即使观测同一光滑表面，也可能获得显著不同的标定函数，导致空间不一致的几何结构和结构化表面伪影。我们提出一种空间耦合的相位-深度变换，其中所有像素共享一个单一的低维映射——全局相位标量与在未畸变参考相机网格上的仿射空间项相结合——而非独立的逐像素拟合，可选地通过一个有界、空间平滑的校正场进行增强。我们进一步引入一种原生网格配对方案，直接在参考相机网格上构建相位-深度标定对：当深度监督来自校正后的主动立体管线时，在立体3D空间中拟合平面，并沿原生射线采样回相机网格，因此相位图从未被校正。在具有高分辨率扫描仪真实数据集的牙齿目标上，所提出的模型达到了与主动立体参考相当的点到表面RMSE（约12微米聚合），同时在空间一致性上显著优于逐像素多项式和有理标定，并将运行时映射减少为每个像素的少量逐元素操作，参数存储可忽略不计。

英文摘要

In fringe projection profilometry (FPP), depth is commonly recovered by fitting a phase-to-depth relation independently at each camera pixel. Although such pixel-wise calibration achieves high local accuracy, neighboring pixels can acquire markedly different calibration functions even when they observe the same smooth surface, producing spatially inconsistent geometry and structured surface artifacts. We propose a spatially coupled phase-depth transformation in which all pixels share a single low-dimensional mapping-global phase scalars combined with affine spatial terms on the undistorted reference-camera grid-rather than independent per-pixel fits, optionally augmented by a bounded, spatially smooth correction field. We further introduce a native-grid pairing scheme that constructs phase-depth calibration pairs directly on the reference-camera grid: when depth supervision comes from a rectified active-stereo pipeline, planes are fitted in stereo 3D and sampled back onto the camera grid along native rays, so the phase maps are never rectified. On a dental target with high-resolution scanner ground truth, the proposed model attains point-to-surface RMSE comparable to an active-stereo reference (about 12{\mu}m aggregate) while substantially improving spatial coherence over pixel-wise polynomial and rational calibration, and reduces the runtime mapping to a few element-wise operations per pixel with negligible parameter storage.

URL PDF HTML ☆

赞 0 踩 0

2606.11841 2026-06-11 cs.CV 新提交

Scene-Adaptive Nonlinear Tone Curves for Pseudo Ground-Truth Generation in Low-Light 3D Gaussian Splatting

面向低光照3D高斯泼溅的场景自适应非线性色调曲线伪地面真值生成

Mingzhe Lyu, Jinqiang Cui, Hong Zhang

发表机构 * Southern University of Science and Technology（南方科技大学）； Pengcheng Laboratory（鹏城实验室）

AI总结针对低光照3D重建中伪地面真值生成问题，提出场景自适应非线性色调曲线框架，通过两种曲线（ASE和AP3）替代线性增益，在三个基准上PSNR提升最高4.34dB。

详情

AI中文摘要

低光照新视角合成具有挑战性，因为暗光多视图图像包含噪声、弱结构细节和压缩的动态范围。最近的3D高斯泼溅（3DGS）方法通过生成伪地面真值（pseudo-GT）图像作为监督目标来解决这些挑战，当没有配对正常光照参考时。现有的伪GT方法对所有像素应用均匀线性增益，这会裁剪亮区，同时暗区增强不足，限制了重建质量。我们观察到，在2D低光照增强中早已建立的非线性色调映射，尚未在3D重建的伪GT生成中得到探索。因此，我们提出了一种场景自适应非线性色调曲线框架，用非线性替代方案替换线性伪GT。该框架引入了基于百分位数的归一化以实现场景无关的曲线应用、场景自适应偏移用于自动黑电平调整，以及两条互补曲线：自适应SoftExp（ASE），一种有界指数曲线，和自适应Poly3（AP3），一种数据驱动的三次多项式。该模块仅改变伪GT计算，而保持3DGS骨干不变。在覆盖21个场景的三个基准上的实验表明，两条曲线均一致优于线性基线，在LOM上PSNR提升高达+4.34 dB，在RealX3D上提升+3.25 dB。尽管数学形式不同，两条曲线实现了相似的性能，表明改进是曲线无关的。代码见 https://this https URL。

英文摘要

Low-light novel view synthesis is challenging because dark multi-view images contain noise, weak structural detail, and compressed dynamic range. Recent 3D Gaussian Splatting (3DGS) methods address these challenges by generating pseudo ground-truth (pseudo-GT) images as supervision targets when paired normal-light references are unavailable. Existing pseudo-GT methods apply a uniform linear gain to all pixels, which clips bright regions while providing insufficient enhancement in dark regions, limiting reconstruction quality. We observe that nonlinear tone mappings, long established in 2D low-light enhancement, have not been explored for pseudo-GT generation in 3D reconstruction. Accordingly, we propose a scene-adaptive nonlinear tone-curve framework that replaces linear pseudo-GT with nonlinear alternatives. The framework introduces percentile-based normalisation for scene-agnostic curve application, a scene-adaptive offset for automatic black-level adjustment, and two complementary curves: Adaptive SoftExp (ASE), a bounded exponential curve, and Adaptive Poly3 (AP3), a data-driven cubic polynomial. The module changes only the pseudo-GT computation and leaves the 3DGS backbone unchanged. Experiments on three benchmarks covering 21 scenes show that both curves consistently outperform the linear baseline with PSNR improvements up to +4.34 dB on LOM and +3.25 dB on RealX3D. Both curves achieve similar performance despite their different mathematical forms, suggesting the improvement is curve-agnostic. Code is available at this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.12303 2026-06-11 cs.CV 新提交

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

从二维网格到一维标记：重塑多模态图像融合的共享表示

Yuchen Xian, Yunqiu Xu, Yang He, Yi Yang

AI总结提出基于冻结预训练图像标记器的紧凑一维标记接口，通过选择性标记编辑（STE）稀疏更新关键标记，在保持融合骨干网络不变的同时引导全局外观一致性，实现全局连贯与局部保真的最佳平衡。

详情

Comments: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

AI中文摘要

多模态图像融合旨在将来自不同模态的互补信息整合到融合图像中，该图像在保持全局一致外观的同时保留丰富的局部细节。现有方法在二维特征网格上构建共享表示，这些表示擅长建模局部结构，但对图像级全局外观因素的利用有限。为平衡这些目标，我们引入了一种基于冻结预训练图像标记器的紧凑一维标记接口，用于建模非局部外观/基因素。我们的设计不是将标记器用作重建骨干，而是将一维标记空间用作全局载体，同时保留用于局部结构恢复的二维空间路径。具体来说，我们引入了选择性标记编辑（STE），它稀疏地更新/替换一小部分关键标记，提供了一种轻量级机制来引导全局外观一致性，同时保持融合骨干网络不变并避免额外损失。在四个常用基准上的实验表明，我们的方法实现了最佳整体性能，在全局连贯性和局部保真度方面均具有一致的多指标改进。项目页面：此 https URL

英文摘要

Multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level global appearance factors. To balance these objectives, we introduce a compact 1D token interface based on a frozen pretrained image tokenizer for modeling non-local appearance/base factors. Rather than using the tokenizer as a reconstruction backbone, our design uses the 1D token space as a global carrier while retaining the 2D spatial pathway for local structure restoration. Specifically, we introduce Selective Token Editing (STE), which sparsely updates/replaces a small set of critical tokens, providing a lightweight mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding extra losses. Experiments on four commonly used benchmarks show that our method achieves the best overall performance, with consistent, multi-metric improvements in both global coherence and local fidelity. Project page: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.12378 2026-06-11 cs.CV cs.AI 新提交

Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

面向机器人生理感知的鲁棒光照相机心率估计

Zhi Wei Xu, Torbjörn E. M. Nordling

发表机构 * National Cheng Kung University（国立成功大学）

AI总结提出一种端到端时空Transformer框架，结合PRNet三维人脸对齐、光照增强、残差时序标准化和混合时频监督，在光照变化数据集上实现0.79 bpm心率MAE和0.982相关系数，相比PhysFormer降低93.6%误差。

详情

Comments: 8 pages, 4 figures

AI中文摘要

生理感知对于在日常生活环境中与人类交互的服务型、社交型和辅助型机器人至关重要。远程光电容积描记法（rPPG）能够从RGB相机中实现非接触式心率（HR）估计，使其成为机器人视觉系统的一种有前景的感知模态。然而，光照变化仍然是鲁棒部署的主要障碍。本文提出了一种端到端的时空Transformer框架，用于在具有不同光照条件的新数据集上进行远程心率估计。我们的估计器集成了基于PRNet的三维人脸对齐、片段级光照增强、残差时序标准化模块以及受控的混合时频监督。训练目标结合了Soft-Shifted Pearson波形损失和频谱Kullback-Leibler散度损失，其中调优权重（$\mathbf{\beta}$）控制频域心率指导的贡献。在覆盖三个光照级别的静态全混合协议上的实验表明，$\mathbf{\beta}=5$在测试的beta设置中提供了最强结果，实现了最佳运行心率平均绝对误差（MAE）为0.79 bpm，心率相关系数为0.982。与在我们的数据集上评估的PhysFormer基线相比，我们的估计器将心率MAE降低了93.6%，同时将心率相关系数从0.088提高到0.982，使其在光照变化时可用。

英文摘要

Physiological awareness is important for service, social, and assistive robots that interact with humans in everyday environments. Remote photoplethysmography (rPPG) enables non-contact heart-rate (HR) estimation from an RGB camera, making it a promising sensing modality for robot-mounted vision systems. However, illumination variation remains a major barrier to robust deployment. This paper presents an end-to-end spatial-temporal transformer framework for remote HR estimation on a new dataset with varied illumination. Our estimator integrates PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision. The training objective combines a Soft-Shifted Pearson waveform loss with a spectral Kullback-Leibler divergence loss, where a tuned weight ($\mathbf{\beta}$) controls the contribution of frequency-domain heart-rate guidance. Experiments on a static all-level mix protocol covering three illumination levels show that $\mathbf{\beta}=5$ provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982. Compared with the PhysFormer baseline evaluated on our dataset, our estimator reduces HR MAE by 93.6 %, while increasing HR correlation from 0.088 to 0.982, making it usable when illumination varies.

URL PDF HTML ☆

赞 0 踩 0

2606.09347 2026-06-11 cs.CV 版本更新

IB-HFN: Information Bottleneck-Driven SAR-Optical Fusion Network for High-Fidelity Cloud Removal

IB-HFN: 信息瓶颈驱动的SAR-光学融合网络用于高保真云去除

Haojun Guo, Fan Feng, Ziquan Wang, Yongsheng Zhang, Ying Yu

AI总结提出IB-HFN网络，通过双流骨干、空间信息瓶颈融合模块和联合优化策略，抑制SAR散斑噪声并保留光学细节，实现高保真云去除。

详情

AI中文摘要

合成孔径雷达（SAR）辅助的光学云去除旨在利用互补的SAR观测恢复光学遥感图像中被云遮挡的地表信息。现有的多模态融合方法通常依赖于直接的空间拼接和像素级监督，这会将SAR散斑噪声传播到光学重建中，并导致结果过度平滑。为了解决这些局限性，我们提出了一种信息瓶颈驱动的高保真网络（IB-HFN），用于SAR辅助的光学云去除。IB-HFN采用双流骨干网络，在深度语义融合前保留模态特定表示，从而减轻过早的跨模态污染。在融合阶段，我们引入了一个空间信息瓶颈融合模块，通过通道级变分信息瓶颈压缩SAR特征以抑制非结构化散斑噪声。同时，一个局部-全局门控机制预测晴空区域，并通过Dirac初始化的跳跃连接传递可靠的光学细节，将噪声抑制与纹理保留解耦。我们进一步开发了一种联合优化策略，将特征级瓶颈正则化与图像级约束（包括重建精度、结构一致性、光谱保真度和对比度锐度）相结合。动态权重调度平衡这些目标以稳定训练并减少雾状伪影。在SEN12MS-CR数据集上具有挑战性的时空分割下的实验表明，IB-HFN在结构保留和光谱保真度方面优于现有方法。

英文摘要

Synthetic aperture radar (SAR)-assisted optical cloud removal aims to recover surface information obscured by clouds in optical remote sensing images by exploiting complementary SAR observations. Existing multimodal fusion methods typically rely on direct spatial concatenation and pixel-wise supervision, which can propagate SAR speckle noise into optical reconstruction and lead to over-smoothed results. To address these limitations, we propose an Information Bottleneck-driven High-Fidelity Network (IB-HFN) for SAR-assisted optical cloud removal. IB-HFN employs a dual-stream backbone to preserve modality-specific representations before deep semantic fusion, thereby mitigating premature cross-modal contamination. At the fusion stage, we introduce a Spatial Information Bottleneck Fusion module that compresses SAR features through a channel-wise variational information bottleneck to suppress unstructured speckle noise. In parallel, a local-global gating mechanism predicts clear-sky regions and routes reliable optical details through a Dirac-initialized skip connection, decoupling noise suppression from texture preservation. We further develop a joint optimization strategy that integrates feature-level bottleneck regularization with image-level constraints on reconstruction accuracy, structural consistency, spectral fidelity, and contrastive sharpness. A dynamic weighting schedule balances these objectives to stabilize training and reduce hazy artifacts. Experiments on the SEN12MS-CR dataset under challenging spatio-temporal splits demonstrate that IB-HFN achieves superior structural preservation and spectral fidelity over existing methods.

URL PDF HTML ☆

赞 0 踩 0

2605.17557 2026-06-11 cs.GR cs.CV 版本更新

Real-Time Neural Hair Denoising

实时神经头发去噪

Chenghao Wu, Yuefan Shen, Tao Huang, Kai Yan, Zahra Montazeri, Kui Wu

AI总结本文提出了一种轻量级的实时方法，用于从严重欠采样的光栅化输入中重建基于丝状的头发G-Buffers。方法首先应用神经空间重建和时间累积来恢复头发覆盖，即像素内的分数头发可见性及切线向量，然后利用切线引导的重建步骤完成位置信息，随后用于基于物理的延迟头发着色。在多种发型和静态/动态场景下评估了该方法，其头发重建质量优于现有专门针对头发的去噪技术以及通用工业神经重建解决方案如DLSS和FSR。

2606.11505 2026-06-11 cs.CV cs.AI cs.CR 新提交

On the Study of Biometric Spoofing Detection using Deep Learning

基于深度学习的生物特征欺骗检测研究

Kumar Kartikey, Nikos Komninos

AI总结评估MobileNetV2、DenseNet-121、Inception-v3和STD模型在面部识别系统欺骗检测中的性能，MobileNetV2以92%准确率最优，适合实际应用。

详情

AI中文摘要

生物特征系统越来越多地部署在安全应用中；然而，它们仍然容易受到欺骗攻击，攻击者利用伪造的生物特征数据获取未经授权的访问。本研究评估了最先进的机器学习模型MobileNetV2、DenseNet-121、Inception-v3和欺骗痕迹解缠（STD）在面部识别系统中检测欺骗攻击的有效性。使用CelebA-Spoof数据集，研究通过准确率、精确率、召回率和F1分数等指标评估模型有效性。在MSU-MFSD数据集上进行跨数据集验证以评估泛化能力。结果表明MobileNetV2是最有效的模型，在平衡计算效率的同时达到92%的准确率，使其适用于实际应用。Inception-v3表现出中等鲁棒性，而DenseNet-121和STD在泛化方面存在困难。研究结果强调了在领域自适应和混合架构方面取得进展以增强生物特征安全系统的必要性。

英文摘要

Biometric systems are increasingly deployed in security applications; however, they remain vulnerable to spoofing attacks, in which attackers exploit counterfeit biometric data to gain unauthorized access. This research evaluates the effectiveness of state-of-the-art machine learning models, MobileNetV2, DenseNet-121, Inception-v3, and Spoof Trace Disentanglement (STD) in detecting spoofing attacks within facial recognition systems. Using the CelebA-Spoof dataset, the study evaluates model effectiveness using metrics such as accuracy, precision, recall, and F1 Score. Cross-dataset validation is carried out on the MSU-MFSD dataset to assess generalizability. The results show MobileNetV2 as the most efficient model, achieving 92% accuracy while balancing computational effectiveness, making it appropriate for real-life applications. Inception-v3 shows moderate robustness, while DenseNet-121 and STD struggle with generalization. The findings highlight the need for advances in domain adaptation and hybrid architectures to enhance biometric security systems.

URL PDF HTML ☆

赞 0 踩 0

2606.11615 2026-06-11 cs.CV cs.CR cs.LG 新提交

Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

Adv-TGD：面向人脸识别冒充攻击的对抗性文本引导扩散

Omid Ahmadieh, Nima Karimian

发表机构 * University of South Florida, Bellini College of Artificial Intelligence, Cybersecurity and Computing（南佛罗里达大学贝利尼人工智能、网络安全与计算学院）

AI总结提出Adv-TGD框架，利用Stable Diffusion和LoRA微调生成逼真对抗人脸，在保持视觉质量的同时实现高成功率身份冒充攻击，平均ASR达85.90%。

详情

AI中文摘要

人脸识别（FR）技术的广泛普及引发了严重的隐私担忧，因为面部数据可能在未经同意的情况下被利用。为了解决这一挑战，我们提出了Adv-TGD，一个生成式对抗攻击框架，能够合成逼真的人脸，冒充目标身份并欺骗人脸识别系统。基于Stable Diffusion，Adv-TGD对每个样本进行LoRA微调，以简洁的文本提示为条件，生成自然但具有对抗性操控的身份。与传统的身份攻击方法不同，我们的方法在单步去噪过程中为每个源-目标对优化轻量级交叉注意力适配器。潜在混合受到面部局部热图掩码的约束，以确保空间精确的身份操控，同时保留非敏感区域。我们引入了一个复合目标，结合了掩码epsilon-MSE重建、FR嵌入空间中的阈值化身份差异、方向特征对齐和源相似性抑制，以平衡对抗攻击和视觉真实性。可选地，LLaVA生成的属性提示增强了细粒度语义细节，而不会重新引入身份线索。在黑盒评估协议下，Adv-TGD在IR152、IRSE50、MobileFace和FaceNet上平均攻击成功率（ASR）达到85.90%，超过语义SOTA基线Adv-CPG +6.25个百分点、基于扩散的化妆方法DiffAIM +3个百分点以及基于噪声的P3-Mask +16个百分点。尽管攻击效果强劲，Adv-TGD仍保持了高视觉保真度（PSNR = 27.15 dB，SSIM = 0.981）。此外，我们通过成功将其扩展到野外数据集（LADN）、通用对象分类（ImageNet）和基于Transformer的扩散模型（FLUX.1），展示了我们框架的灵活性。

英文摘要

The widespread adoption of face recognition (FR) technologies raises serious privacy concerns, as facial data can be exploited without consent. To address this challenge, we propose Adv-TGD, a generative adversarial attack framework that synthesizes photorealistic faces capable of impersonating target identities and deceiving face recognition systems. Built upon Stable Diffusion, Adv-TGD performs per-sample LoRA fine-tuning conditioned on concise textual prompts to generate natural yet adversarially manipulated identities. Unlike conventional identity-attack approaches, our method optimizes lightweight cross-attention adapters for each source-target pair within a single-step denoising process. Latent blending is constrained by a face-local heatmap mask to ensure spatially precise identity manipulation while preserving non-sensitive regions. We introduce a composite objective that integrates masked epsilon-MSE reconstruction, thresholded identity divergence in FR embedding space, directional feature alignment, and source-similarity suppression to balance adversarial attack and visual realism. Optionally, LLaVA-generated attribute prompts enhance fine-grained semantic details without reintroducing identity cues. Under the black-box evaluation protocol, Adv-TGD attains an average attack success rate (ASR) of 85.90% across IR152, IRSE50, MobileFace, and FaceNet, surpassing the semantic SOTA baseline Adv-CPG by +6.25 points, diffusion-based makeup method DiffAIM by +3 points, and noise-based P3-Mask by +16 points. Despite its strong attack efficacy, Adv-TGD preserves high visual fidelity (PSNR = 27.15 dB, SSIM = 0.981). Furthermore, we demonstrate the flexibility of our framework by successfully extending it to in-the-wild datasets (LADN), general object classification (ImageNet), and transformer-based diffusion models (FLUX.1).

URL PDF HTML ☆

赞 0 踩 0

2606.11889 2026-06-11 cs.CV cs.AI cs.RO 新提交

Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection

面向自动驾驶危险检测的视觉-语言模型任务对齐稳定性分析

Everett Richards

AI总结研究视觉-语言模型在自动驾驶危险检测中，嵌入漂移与任务对齐危险分数变化的关系，发现不同腐败类型导致不同的失效模式，建议基准测试包含任务对齐稳定性指标。

详情

Comments: 8 pages (5 main body + 3 references / appendices). ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

AI中文摘要

视觉-语言模型（VLM）越来越多地用于自动驾驶中的场景理解，但鲁棒性分析通常仅依赖于任务无关的嵌入稳定性。我们研究腐败引起的嵌入漂移是否能预测基于CLIP图像-文本相似性的任务对齐危险分数的变化。通过在BDD100K道路场景上使用受控腐败，我们将嵌入漂移与边际漂移（定义为扰动下危险分数的变化）进行比较。这种关系高度依赖于腐败类型：某些家族表现出表示漂移与决策漂移之间的强耦合，而其他家族则在嵌入变化相对较小的情况下引发危险的决策不稳定性。此外，腐败家族在失效方向上有所不同：大多数通过假阴性抑制危险检测，而遮挡则触发假警报，这表明基准设计应考虑不对称的失效模式，而不仅仅是整体不稳定性率。这些结果表明，鲁棒性基准应包含任务对齐的稳定性指标，而不仅仅是嵌入级别的扰动统计。

英文摘要

Vision-language models (VLMs) are increasingly used for scene understanding in autonomous driving, but robustness analysis often relies on task-agnostic embedding stability alone. We study whether corruption-induced embedding drift predicts changes in a task-aligned hazard score derived from CLIP image-text similarities. Using controlled corruptions on BDD100K road scenes, we compare embedding drift against margin drift, defined as the change in hazard score under perturbation. The relationship is highly corruption-dependent: some families exhibit strong coupling between representation drift and decision drift, while others induce hazardous decision instability despite relatively modest embedding change. Furthermore, corruption families differ in failure direction: most suppress hazard detections via false negatives, while occlusion instead triggers false alarms, suggesting that benchmark design should account for asymmetric failure modes, not just overall instability rates. These results suggest that robustness benchmarks should include task-aligned stability measures in addition to embedding-level perturbation statistics.

URL PDF HTML ☆

赞 0 踩 0

2606.12263 2026-06-11 cs.CV 新提交

VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models

VOID: 击败潜在扩散模型中的未授权模仿

Chunlin Qiu, Ang Li, Tianxiao Huang, Ruilin Gan, Yunjie Ge, Shenyi Zhang, Huayi Duan, Lingchen Zhao, Chao Shen, Qian Wang

发表机构 * School of Cyber Science and Engineering, Wuhan University（武汉大学网络空间安全学院）； School of Computer Science, Wuhan University（武汉大学计算机学院）； Institute for Math&AI, Wuhan University（武汉大学数学与人工智能研究所）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； School of Cyber Science and Engineering, Xi’an Jiaotong University（西安交通大学网络空间安全学院）

AI总结针对潜在扩散模型被用于未授权模仿的问题，提出VOID防御框架，通过操纵模型内在随机性，放大潜在编码误差并抵消目标引导信号，实现语义破坏，阻止未授权模仿，同时将扰动限制在人眼不可感知区域。

详情

Comments: To appear in the 35th USENIX Security Symposium (USENIX Security 2026)

AI中文摘要

虽然潜在扩散模型（LDM）彻底改变了视觉合成，但它们越来越多地被用于对个人的未授权模仿。现有防御通过注入欺骗性扰动，将生成图像引导至无关目标。然而，这种方法基于一个无根据的假设：微小的扰动能在LDM的整个生成过程中保持其欺骗效果。实际上，模型固有的恢复机制会移除这些扰动，导致个体身份在生成的图像中重新出现。我们提出VOID，一种通过操纵LDM内在随机性克服这一难题的防御框架。VOID以两种新颖方式扰动扩散管道：1）放大潜在编码误差以破坏图像的语义结构，以及2）抵消目标引导信号以抑制模型的恢复能力。这导致语义破坏，阻止任何未授权模仿。值得注意的是，安全增益不以视觉效用为代价，因为VOID同时设法将扰动限制在受保护图像的人眼不可感知区域。我们在5个数据集上对10种模仿攻击的24种最先进防御进行了全面评估，证明了VOID前所未有的保护能力：它将平均Frechet Inception Distance（FID）从113提高到365，比迄今为止最强的防御提升了223%。

英文摘要

While Latent Diffusion Models (LDMs) have revolutionized visual synthesis, they are increasingly exploited for unauthorized mimicry of individuals. Existing defenses inject deceptive perturbations to steer the generated images toward irrelevant targets. However, this approach hinges on an ungrounded assumption: subtle perturbations can maintain their deceptive efficacy throughout an LDM's extensive generation process. In reality, the model's innate restoration mechanism will remove such perturbations and cause individual identities to re-emerge in the images generated. We propose VOID, a defense framework that overcomes this conundrum by manipulating an LDM's intrinsic stochasticity. VOID perturbs the diffusion pipeline in two novel ways: 1) amplifying the latent encoding errors to shatter an image's semantic structure, and 2) counteracting the target guidance signals to suppress the model's restoration capabilities. This results in a semantic corruption that thwarts any unauthorized mimicry. Notably, the security gain does not come at the price of visual utility, as VOID simultaneously manages to confine perturbations to human-imperceptible regions of protected images. Our comprehensive evaluation of 24 state-of-the-art defenses against 10 mimicry attacks on 5 datasets demonstrates VOID's unprecedented protection power: it increases the average Frechet Inception Distance (FID) from 113 to 365, a 223% improvement over the strongest defense to date.

URL PDF HTML ☆

赞 0 踩 0

2606.11200 2026-06-11 cs.CL cs.CV 交叉投稿

Detecting AI-Generated Content on Social Media with Multi-modal Language Models

使用多模态语言模型检测社交媒体上的AI生成内容

Chenyang Yang, Shen Yan, Yibo Yang, Litao Hu, Yuchen Liu, Yuan Zeng, Hanchao Yu, Yinan Zhu, Sumedha Singla, Brian Vanover, Huijun Qian, Zihao Wang, Fujun Liu, Aashu Singh, Jianyu Wang, Xuewen Zhang

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Meta

AI总结针对AI生成内容检测的泛化性差、单模态依赖和缺乏可解释性问题，提出基于多模态数据的紧凑视觉-语言模型，实现检测与解释，在公开基准和内部数据集上达到最优性能。

详情

AI中文摘要

生成式AI使得逼真的图像和视频得以创建，并越来越多地在社交媒体上传播，通常用于垃圾信息、错误信息、操纵和欺诈。现有的AI生成内容（AIGC）检测方法面临挑战，包括对新一代模型的泛化能力差、依赖单一模态以及缺乏可解释的解释。我们提出了一个流程，通过持续整理多样化的多模态社交媒体数据并训练一个紧凑的视觉-语言模型用于检测和解释，来缓解这些问题。我们的模型在公开基准上达到了最先进的检测性能，并在多个平台的内部社交媒体数据集上展示了强大的检测和解释能力。我们将模型部署在社交媒体平台上用于帖子推荐，并观察到对用户参与度的积极下游影响，表明在动态、真实的社交媒体环境中进行有效的AIGC检测是可行的。

英文摘要

Generative AI has enabled the creation of photorealistic images and videos that are increasingly disseminated on social media, often used for spam, misinformation, manipulation, and fraud. Existing AI-generated content (AIGC) detection methods face challenges including poor generalization to new generation models, reliance on single modalities, and lack of interpretable explanations. We present our pipeline that mitigates these issues by continuously curating diverse multi-modal social media data and training a compact vision-language model for detection and explanation. Our model achieves state-of-the-art detection performance on public benchmarks and demonstrates robust detection and explanation capabilities on internal social media datasets across multiple platforms. We deployed our model for post recommendation on social media platforms and observed positive downstream impacts on user engagement, demonstrating that it is feasible to perform effective AIGC detection in dynamic, real-world social media environments.

URL PDF HTML ☆

赞 0 踩 0

2506.03933 2026-06-11 cs.CV cs.AI 版本更新

Diffusion-based Cumulative Adversarial Purification for Vision Language Models

基于扩散的累积对抗净化方法用于视觉语言模型

Jia Fu, Yongtao Wu, Yihang Chen, Kunyu Peng, Xiao Zhang, Volkan Cevher, Sepideh Pashami, Anders Holst

AI总结提出DiffCAP，一种基于扩散的对抗净化策略，通过理论证明对抗效应随扩散单调衰减，并利用噪声注入与VLM嵌入相似度阈值自适应净化，显著提升防御效果并加速去噪。

详情

Comments: Accepted to Transactions on Machine Learning Research (TMLR 2026)

AI中文摘要

视觉语言模型（VLM）在多模态理解方面表现出卓越的能力，但它们对对抗扰动的敏感性对其在实际应用中的可靠性构成了重大威胁。尽管这些扰动通常对人类不可察觉，但它们可能极大地改变模型输出，导致错误的解释和决策。本文介绍了DiffCAP，一种新颖的基于扩散的净化策略，可以有效中和VLM中的对抗性破坏。我们在理论上建立了前向扩散过程中的可证明恢复区域，同时量化了相对于VLM的语义变化的收敛速度。这些发现表明，随着扩散的进行，对抗效应单调减弱。基于这一原理，DiffCAP利用噪声注入，以VLM嵌入的相似度阈值作为自适应标准，然后通过反向扩散恢复出干净且可靠的表示用于VLM推理。通过在三个任务场景中、不同攻击强度下、使用三个VLM在六个数据集上进行的大量实验，我们表明DiffCAP以显著优势优于现有的防御技术。值得注意的是，DiffCAP显著降低了超参数调优的复杂性和所需的扩散时间，从而加速了去噪过程。结合理论定理和实验支持，DiffCAP为在对抗环境中安全部署VLM提供了一种稳健且实用的解决方案。源代码可在以下网址获取：https://this URL。

英文摘要

Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to adversarial perturbations poses a significant threat to their reliability in real-world applications. Despite often being imperceptible to humans, these perturbations can drastically alter model outputs, leading to erroneous interpretations and decisions. This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs. We theoretically establish a provable recovery region in the forward diffusion process and meanwhile quantify the convergence rate of semantic variation with respect to VLMs. These findings manifest that adversarial effects monotonically fade as diffusion unfolds. Guided by this principle, DiffCAP leverages noise injection with a similarity threshold of VLM embeddings as an adaptive criterion, before reverse diffusion restores a clean and reliable representation for VLM inference. Through extensive experiments across six datasets with three VLMs under varying attack strengths in three task scenarios, we show that DiffCAP outperforms existing defense techniques by a substantial margin. Notably, DiffCAP significantly reduces both hyperparameter tuning complexity and the required diffusion time, thereby accelerating the denoising process. Equipped with theorems and empirical support, DiffCAP provides a robust and practical solution for securely deploying VLMs in adversarial environments. The source code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2510.08073 2026-06-11 cs.CV cs.LG 版本更新

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

物理驱动的时空建模用于AI生成视频检测

Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, Mingkui Tan

AI总结提出基于概率流守恒的物理驱动AI生成视频检测范式，通过归一化时空梯度（NSG）统计量捕捉物理异常，结合预训练扩散模型估计NSG，并利用最大均值差异（MMD）进行检测，在Recall和F1-Score上分别提升16.00%和10.75%。

详情

Comments: Accepted at NeurIPS 2025 spotlight

AI中文摘要

AI生成的视频已实现近乎完美的视觉真实感（如Sora），迫切需要可靠的检测机制。然而，检测此类视频在建模高维时空动态和识别违反物理规律的细微异常方面面临重大挑战。本文提出首个基于概率流守恒原理的物理驱动AI生成视频检测范式。具体而言，我们提出一种称为归一化时空梯度（NSG）的统计量，该统计量量化空间概率梯度与时间密度变化之比，明确捕捉与自然视频动态的偏差。利用预训练的扩散模型，我们通过空间梯度近似和运动感知时间建模开发了NSG估计器，无需复杂的运动分解，同时保持物理约束。在此基础上，我们提出基于NSG的视频检测方法（NSG-VD），该方法计算测试视频与真实视频NSG特征之间的最大均值差异（MMD）作为检测指标。最后，我们推导了真实视频与生成视频之间NSG特征距离的上界，证明由于分布偏移，生成视频表现出放大的差异。大量实验证实，NSG-VD在Recall和F1-Score上分别比最先进的基线方法高出16.00%和10.75%，验证了NSG-VD的优越性能。源代码可在该 https URL 获取。

英文摘要

AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose the first physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2604.06961 2026-06-11 cs.CV 版本更新

Auditing Demographic Bias in Facial Landmark Detection for Fair Human-Robot Interaction

审计人脸关键点检测中的群体偏见以实现公平的人机交互

Pablo Parte, Roberto Valle, José M. Buenaposada, Luis Baumela

AI总结本研究系统审计了人脸关键点检测中的年龄、性别和种族偏见，通过控制统计方法分离混杂视觉因素，发现头部姿态和分辨率等混杂因素影响更大，但年龄偏见显著存在。

详情

AI中文摘要

人机交互中的公平性关键取决于使机器人能够解释人类行为的感知模型的可靠性。虽然群体偏见已在高级人脸分析任务中得到广泛研究，但其在人脸关键点检测中的存在尚未被探索。在本文中，我们对该任务中的群体偏见进行了系统审计，分析了年龄、性别和种族偏见。为此，我们引入了一种受控统计方法，以从混杂视觉因素中分离出群体效应。我们的分析表明，视觉混杂因素，特别是头部姿态和人脸分辨率，大大超过了群体属性的影响。值得注意的是，在考虑这些混杂因素后，性别和种族之间的性能差异消失。然而，我们发现了统计上显著的年龄相关偏见，即老年人的定位误差更高。这表明公平性问题甚至可能出现在低级视觉组件中，并可能通过人机交互管道传播。我们认为，审计和纠正此类偏见是实现可信赖和公平的机器人感知系统的必要步骤。

英文摘要

Fairness in human-robot interaction critically depends on the reliability of the perceptual models that enable robots to interpret human behavior. While demographic biases have been widely studied in high-level facial analysis tasks, their presence in facial landmark detection remains unexplored. In this paper, we conduct a systematic audit of demographic bias in this task, analyzing the age, gender, and race biases. To this end, we introduce a controlled statistical methodology to disentangle demographic effects from confounding visual factors. Our analysis demonstrates that visual confounders, particularly head pose and face resolution, heavily outweigh the impact of demographic attributes. Notably, after accounting for these confounders, performance disparities across gender and race vanish. However, we identify a statistically significant age-related bias, with higher localization errors for older individuals. This shows that fairness issues can emerge even in low-level vision components and can propagate through the HRI pipeline. We argue that auditing and correcting such biases is a necessary step toward trustworthy and equitable robot perception systems.

URL PDF HTML ☆

赞 0 踩 0

2605.16651 2026-06-11 cs.CV cs.LG 版本更新

Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations

正确预测，误导性解释：关于视觉-语言模型解释的脆弱性

Narges Babadi, Hadis Karimipour

AI总结研究探讨了视觉-语言模型中解释热图在对抗条件下是否忠实反映推理过程，提出X-Shift攻击揭示解释与预测行为的脱节，验证了解释机制的脆弱性。

详情

Comments: Accepted at the ICML 2026 Workshop on Trustworthy AI for Good (AI4GOOD), Seoul, South Korea

AI中文摘要

解释机制被广泛用于增强视觉-语言模型（VLMs）的透明性和信任度，特别是在需要人类监督的决策场景中。然而，这些解释的鲁棒性仍不明确。本文研究了VLMs（特别是基于CLIP的模型）中的解释热图在对抗条件下是否忠实反映模型推理。我们发现，解释图谱可以系统性地被操控，同时保持模型的原始预测，揭示了预测行为与解释忠实性之间的脱节。为研究这种脆弱性，我们引入了X-Shift，一种新的灰盒攻击，通过扰动图像级视觉表示，将解释热图引导至语义无关区域，而不会改变预测输出。与传统对抗攻击旨在诱导误分类不同，X-Shift专门针对解释过程的完整性。该攻击不修改模型参数，并在多种CLIP架构和解释方法上通用。我们在ImageNet-1k、MS-COCO和Flickr30K上评估了所提出的方法，证明在不可察觉的扰动下，解释对齐性持续下降，而预测保持稳定。此外，标准以预测为导向的对抗攻击即使在更大的扰动预算下也无法复制相同的解释偏移行为。我们的发现突显了当前VLMs解释机制的根本局限性，并对它们在高影响应用中作为可靠信任指标的使用提出了担忧。

英文摘要

Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these explanations remains insufficiently understood. In this work, we investigate whether explanation heatmaps in VLMs, particularly CLIP-based models, faithfully reflect model reasoning under adversarial conditions. We show that explanation maps can be systematically manipulated while preserving the model's original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To study this vulnerability, we introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output. Unlike conventional adversarial attacks that aim to induce misclassification, X-Shift specifically targets the integrity of the explanation process itself. The attack operates without modifying model parameters and generalizes across multiple CLIP architectures and explanation methods. We evaluate the proposed approach on ImageNet-1k, MS-COCO, and Flickr30K, demonstrating consistent degradation in explanation alignment under imperceptible perturbations while maintaining prediction stability. Furthermore, standard prediction-oriented adversarial attacks fail to reproduce the same explanation-shifting behavior even under substantially larger perturbation budgets. Our findings highlight a fundamental limitation of current explanation mechanisms in VLMs and raise concerns about their use as reliable indicators of model trustworthiness in high-impact applications.

URL PDF HTML ☆

赞 0 踩 0

2605.31219 2026-06-11 cs.CV cs.CR cs.LG 版本更新

Latent Geometric Chords for Query-Efficient Decision-Based Adversarial Attacks

潜在几何和弦：面向查询高效决策型对抗攻击

Ei Hmue Khine, Yao Li, Jiebao Sun, Shengzhu Shi, Zhichang Guo, Boying Wu

AI总结提出潜在几何和弦（LGC）方法，通过曲率感知的几何搜索在压缩语义流形中导航决策边界，并引入残差对抗生成（RAG）机制以高视觉保真度实现查询高效的决策型黑盒对抗攻击。

详情

Comments: Added a conceptual diagram for the LGC architecture, 14 pages, 10 figures, 7 tables. Submitted to IEEE Transactions on Information Forensics and Security. The source code is available at this https URL

AI中文摘要

虽然基于决策的黑盒对抗攻击构成了严重的安全威胁，但当前方法存在根本性限制。像素级攻击经常引入不自然的高频视觉伪影，而潜在空间框架受限于低维流形的有限搜索空间和固有的重建缺陷。为解决这些限制，我们提出了潜在几何和弦（LGC）用于查询高效的决策型对抗攻击及其变体LGC-H。其核心是，LGC通过在压缩语义流形内执行曲率感知的几何搜索来导航决策边界。为保证高视觉保真度并规避维度瓶颈，我们引入了基于残差的对抗生成（RAG）机制。RAG将语义扰动隔离为几何和弦，并直接叠加到原始源图像上。RAG显著解决了基线重建缺陷，并有效将允许的搜索空间维度翻倍。实验结果表明，LGC实现了鲁棒的跨数据集迁移性，并显著优于最先进的基线方法。值得注意的是，我们的方法LGC在最小化扰动幅度的同时实现了最先进的视觉保真度——在5000次查询下结构相似性指数（SSIM）超过0.99，学习感知图像块相似度（LPIPS）低于0.01——并在严格的感知约束下保持高攻击成功率，成功攻破了经过对抗训练的鲁棒模型。源代码可在https://github.com/eihmuekhine/Latent-Geometric-Chords获取。

英文摘要

While decision-based black-box adversarial attacks present a severe security threat, current methodologies suffer from fundamental limitations. Pixel-wise attacks frequently introduce unnatural, high-frequency visual artifacts, while latent-space frameworks are confined by the limited search space of low-dimensional manifolds and inherent reconstruction flaws. To resolve these limitations, we propose Latent Geometric Chords (LGC) for Query-Efficient Decision-Based Adversarial Attacks alongside a variant, LGC-H. At its core, LGC navigates decision boundaries by executing a curvature-aware geometric search within a compressed semantic manifold. To guarantee high visual fidelity and circumvent dimensionality bottlenecks, we introduce a Residual-based Adversarial Generation (RAG) mechanism. RAG isolates semantic perturbations as geometric chords and superimposes them directly onto the original source image. RAG substantially resolves baseline reconstruction flaws and effectively doubles the permissible search space dimensions. Experimental results demonstrate that LGC achieves robust cross-dataset transferability and substantially outperforms state-of-the-art baselines. Notably, our method, LGC, minimizes perturbation magnitudes while achieving state-of-the-art visual fidelity--with a Structural Similarity Index Measure (SSIM) exceeding 0.99 and a Learned Perceptual Image Patch Similarity (LPIPS) below 0.01 at 5000 queries--and sustaining high attack success rates under stringent perceptual constraints, successfully compromising adversarially trained robust models. The source code is available at: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.10198 2026-06-11 cs.LG cs.AI cs.CV 版本更新

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

密度脊选择性预测：校准标签稀缺下的大语言模型与视觉语言模型幻觉检测

Nina I. Shamsi

AI总结针对校准标签稀缺时大语言模型和视觉语言模型的幻觉检测问题，提出基于核密度估计的密度脊方法，利用隐藏状态生成轨迹的六维运动特征图构建响应流形，通过到最近脊顶点的欧氏距离评分，在标签稀缺协议下AUROC提升5-20点。

详情

AI中文摘要

大语言模型和视觉语言模型中的幻觉检测日益被框架化为选择性预测，其中检测器分配置信度分数并在置信度低时弃权。无监督采样检测器（Semantic Entropy, EigenScore）避免标签但质量停滞，而有监督探针（SAPLMA）获得更强的分布内分数，但在校准标签稀缺时性能急剧下降。我们将大语言模型的响应流形恢复为基于隐藏状态生成轨迹的六维运动特征图的核密度估计的密度脊。测试生成通过其投影特征点到最近脊顶点的欧氏距离的负值进行评分，从而得到随机输出分布的低维几何骨架。我们在七个问答基准（HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA）上，使用九个文本和视觉大语言模型，在刻意标签稀缺协议（$n_{\ ext{cal}}{=}200$ 查询，$N{=}5$ 生成）下，与Semantic Entropy、SAR、EigenScore、SAPLMA和对数概率进行评估。我们的基于脊的分数在AUROC上以5-20个百分点的优势获胜，同时在校准标签稀缺下表现出温和的性能下降。

英文摘要

Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors (Semantic Entropy) avoid labels but plateau in quality, while supervised probes attain stronger in-distribution scores yet degrade sharply when calibration labels are scarce. We recover the response manifold of an LLM as the density ridge of a kernel density estimate built on a six-dimensional kinematic feature map of hidden state generation trajectories. A test generation is scored by the negated Euclidean distance from its projected feature point to the nearest ridge vertex, yielding a low-dimensional geometric skeleton of the stochastic output distribution. We evaluate against Semantic Entropy, topological methods, and log-probability on six QA benchmarks (HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA) using eight text and vision LLMs in a deliberately label-scarce protocol ($n_{\text{cal}}{=}200$ queries, $N{=}5$ generations). Our ridge-based score beats on AUROC with 5-20 points gain, while demonstrating tempered degradation under calibration-label scarcity.

URL PDF HTML ☆

赞 0 踩 0

2606.11233 2026-06-11 cs.CV 新提交

OSCS-SupCon: Orthogonal Sigmoid-based Common and Style Supervised Contrastive Learning for Robust Feature Disentanglement

OSCS-SupCon: 基于正交Sigmoid的通用与风格监督对比学习用于鲁棒特征解耦

Bin Wang, Fadi Dornaika

发表机构 * University of the Basque Country（巴斯克大学）； IKERBASQUE（伊克尔巴斯克）

AI总结针对监督对比学习中负样本稀释和特征空间纠缠问题，提出OSCS-SupCon框架，采用Sigmoid对比损失和正交约束，提升特征判别性和泛化能力。

详情

AI中文摘要

监督对比学习（SupCon）通过显式建模样本间的成对关系取得了强大性能。然而，现有基于SupCon的方法存在两个关键限制：标准InfoNCE损失导致的负样本稀释，以及缺乏分离类别相关（通用）和类别无关（风格）特征的显式约束引起的特征空间纠缠。这些限制降低了特征判别性和泛化能力。为解决这些问题，我们提出OSCS-SupCon（基于正交Sigmoid的通用与风格监督对比学习），一个结合Sigmoid成对对比目标与显式正交约束的统一框架。具体而言，我们引入一个具有两个可学习参数（温度和偏置）的Sigmoid对比损失，自适应地调整成对决策边界并缓解负样本稀释。此外，我们通过带ReLU非线性的线性投影强制通用和风格特征子空间之间的正交性，从而减少特征重叠并改善风格无关特征的解耦。在六个基准数据集上的大量实验表明，OSCS-SupCon在多种骨干架构上始终优于最先进的监督对比学习方法。特别是在使用ResNet-18骨干的细粒度CUB200-2011数据集上，所提方法相比CS-SupCon在分类准确率上提升了3.4%，突显了其鲁棒性和泛化能力。消融研究进一步证实了每个组件的有效性。

英文摘要

Supervised Contrastive Learning (SupCon) has achieved strong performance by explicitly modeling pairwise relationships among samples. However, existing SupCon-based methods suffer from two key limitations: negative-sample dilution induced by the standard InfoNCE loss, and feature-space entanglement caused by the lack of explicit constraints separating category-relevant (common) and category-irrelevant (style) features. These limitations reduce feature discriminability and generalization ability. To address these issues, we propose OSCS-SupCon (Orthogonal Sigmoid-based Common and Style Supervised Contrastive Learning), a unified framework that combines a sigmoid-based pairwise contrastive objective with explicit orthogonality constraints. Specifically, we introduce a sigmoid-based contrastive loss with two learnable parameters, temperature and bias, which adaptively modulate pairwise decision boundaries and alleviate negative-sample dilution. Furthermore, we enforce orthogonality between common and style feature subspaces via a linear projection with ReLU nonlinearity, thereby reducing feature overlap and improving disentanglement of style-irrelevant representations. Extensive experiments on six benchmark datasets demonstrate that OSCS-SupCon consistently outperforms state-of-the-art supervised contrastive learning methods across multiple backbone architectures. In particular, on the fine-grained CUB200-2011 dataset with a ResNet-18 backbone, the proposed method achieves a 3.4% improvement in classification accuracy over CS-SupCon, highlighting its robustness and generalization capability. Ablation studies further confirm the effectiveness of each component.

URL PDF HTML ☆

赞 0 踩 0

2606.11363 2026-06-11 cs.CV 新提交

NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

NSVQ: 通过稳定向量量化中的编码器漂移缓解码本崩溃

Hao Lu, Yongxin Guo, Onur Koyun, Zhengjie Zhu, Abbas Alili, Metin N. Gurcan

发表机构 * Wake Forest University School of Medicine（维克森林大学医学院）； Advocate Health（倡导健康）

AI总结提出NSVQ训练策略，通过非平稳嵌入损失、码本替换和分阶段编码器冻结，缓解大码本VQ中的码本崩溃，在ImageNet-1k上提升重建质量并保持100%码本利用率。

详情

AI中文摘要

向量量化是现代生成建模流程的核心，但大码本VQ模型常遭受码本崩溃。我们识别出编码器漂移是此失败的关键驱动因素：当编码器移动潜在分布时，稀疏更新的码向量可能滞后、失去分配并增加量化误差，通过直通估计器形成反馈循环。我们提出NSVQ，一种非平稳感知的VQ训练策略，结合密集非平稳嵌入损失、码本替换和分阶段编码器冻结。NSVQ首先在早期训练中帮助码本跟踪编码器漂移，然后冻结编码器以在固定潜在几何下巩固码本，最后重新引入对抗性细化。在ImageNet-1k上的实验表明，NSVQ在保持完全码本利用率的同时提高了重建质量。在ImageNet-1k 128×128分辨率下使用65,536个码本，与SimVQ相比，NSVQ将rFID从2.39降至2.10，同时两种方法均保持100%利用率。额外的潜在扩散实验表明，NSVQ还改善了下游ImageNet生成的FID。

英文摘要

Vector quantization is central to modern generative modeling pipelines, but large-codebook VQ models often suffer from codebook collapse. We identify encoder drift as a key driver of this failure: as the encoder moves the latent distribution, sparsely updated code vectors can lag behind, lose assignments, and increase quantization error, creating a feedback loop through the straight-through estimator. We propose NSVQ, a non-stationary-aware VQ training strategy that combines a dense non-stationary embedding loss, codebook replacement, and stage-wise encoder freezing. NSVQ first helps the codebook track encoder drift during early training, then freezes the encoder to consolidate the codebook under a fixed latent geometry, and finally reintroduces adversarial refinement. Experiments on ImageNet-1k show that NSVQ improves reconstruction quality while maintaining full codebook utilization. On ImageNet-1k at 128$\times$128 with 65,536 codes, NSVQ reduces rFID from 2.39 to 2.10 compared with SimVQ, while both methods maintain 100\% utilization. Additional latent diffusion experiments show that NSVQ also improves downstream ImageNet generation FID.

URL PDF HTML ☆

赞 0 踩 0

2606.11381 2026-06-11 cs.CV 新提交

From Simulation to Real-World: An In-Field 6D Pose Dataset and Baseline for Robotic Strawberry Harvesting

从仿真到现实：面向机器人草莓采摘的实地6D位姿数据集与基线

Woojung Son (1), Won Suk Lee (1), Zijing Huang (1), Daeun Choi (1), Catia Silva (2), Yu She (3), Yan Gu (4) ((1) Department of Agricultural and Biological Engineering, University of Florida, (2) Department of Electrical and Computer Engineering, University of Florida, (3) Edwardson School of Industrial Engineering, Purdue University, (4) School of Mechanical Engineering, Purdue University)

发表机构 * Department of Agricultural and Biological Engineering, University of Florida（佛罗里达大学农业与生物工程系）； Department of Electrical and Computer Engineering, University of Florida（佛罗里达大学电气与计算机工程系）； Edwardson School of Industrial Engineering, Purdue University（普渡大学爱德华森工业工程学院）； School of Mechanical Engineering, Purdue University（普渡大学机械工程学院）

AI总结针对机器人草莓采摘中6D位姿估计的仿真到现实差距问题，首次构建了实地草莓6D位姿真值数据集（12,040张图像），并基于NVIDIA Isaac Sim生成具有场景级真实感的合成数据集，通过基线实验量化了差距。

详情

Comments: 7 pages, 6 figures, 1 table

AI中文摘要

机器人草莓采摘需要精确的6D位姿估计；然而，在实际农业田间收集6D位姿真值本身具有挑战性。现有的6D位姿估计方法因此仅依赖缺乏场景级真实感的合成数据，其在真实农业田间条件下的性能尚未量化。在这项工作中，我们提出了据我们所知的第一个在实际农业田间收集的草莓6D位姿真值数据集（12,040张图像）。我们还引入了一个在NVIDIA Isaac Sim中渲染的合成数据集，具有场景级真实感和域随机化。尽管如此，我们的实验表明，显著的仿真到现实差距仍然存在，强调了可靠评估需要真实农业田间数据。我们进一步通过跨骨干编码器的基线6D位姿估计结果量化了仿真到现实差距，作为未来工作的参考。真实世界数据集将在接收后公开。

英文摘要

Robotic strawberry harvesting requires precise 6D pose estimation; however, collecting 6D pose ground truth in real agricultural fields is inherently challenging. Existing 6D pose estimation methods have therefore relied solely on synthetic data that lacks scene-level realism, leaving their performance under real agricultural field conditions unquantified. In this work, we present, to the best of our knowledge, the first real-world 6D pose ground truth dataset of strawberries collected in actual agricultural fields (12,040 images). We also introduce a synthetic dataset rendered in NVIDIA Isaac Sim, featuring scene-level realism and domain randomization. Nevertheless, our experiments reveal that a significant sim-to-real gap persists, underscoring the necessity of real agricultural field data for reliable evaluation. We further quantify the sim-to-real gap through baseline 6D pose estimation results across backbone encoders, serving as a reference for future work. The real-world dataset will be made available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.11563 2026-06-11 cs.CV cs.RO 新提交

Cross-Modal Benchmarking for Robotic Perception in Natural Environments

自然环境中机器人感知的跨模态基准测试

David Hall, Joshua Knights, Mark Cox, Peyman Moghadam

AI总结针对自然环境中机器人感知的挑战，提出WildCross跨模态基准，用于大规模自然场景下的地点识别和度量深度估计，并扩展了度量深度估计实验。

详情

Comments: Accepted to the IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026

AI中文摘要

自然环境对机器人感知系统提出了复杂挑战。当前模型，特别是视觉基础模型，主要在有结构的城市环境中训练，导致其在野外机器人任务的感知中存在弱点。我们利用最近发布的WildCross基准展示了当前模型的局限性，这是一个用于大规模自然环境中地点识别和度量深度估计的新型跨模态基准。WildCross包含超过476K个顺序RGB帧，带有半稠密深度和表面法线标注，每个帧都与准确的6DoF姿态和同步的稠密激光雷达子地图对齐。在这项工作中，我们提供了对最近WildCross基准结果的扩展分析，特别强调扩展的度量深度估计实验。本工作的代码仓库和数据集可在https://csiro-robotics.github.io/WildCross获取。

英文摘要

Natural environments present a complex challenge to robotics perception systems. Current models, particularly vision foundation models, are largely trained on structured, urban environments leading to weaknesses in their perception for field robotics tasks. We showcase the limitations of current models using our recently released WildCross benchmark, a new cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF pose and synchronized dense lidar submaps. In this work, we provide an expanded analysis of the benchmark results from the recent WildCross benchmark, with particular emphasis on expanded metric depth estimation experiments. Access to the code repository and dataset for this work can be found at https://csiro-robotics.github.io/WildCross.

URL PDF HTML ☆

赞 0 踩 0

2606.11568 2026-06-11 cs.CV 新提交

4DP-QA: Scalable QA for 4D Perception in Vision Language Models

4DP-QA：面向视觉语言模型中4D感知的可扩展问答

Seokju Cho, Abhishek Badki, Hang Su, Jindong Jiang, Ziyao Zeng, Seungryong Kim, Sifei Liu, Orazio Gallo

发表机构 * NVIDIA（英伟达）； Yale University（耶鲁大学）； KAIST AI（韩国科学技术院人工智能学院）

AI总结针对视觉语言模型难以理解动态场景的问题，提出一种关注运动场景理解的问答生成流水线，通过真运动追踪解耦物体与相机运动，生成大规模数据集4DP-QA和基准4DP-QA-Bench，训练现有模型在外部基准上取得性能提升。

详情

Comments: Project page: this https URL

AI中文摘要

尽管近期取得了进展，视觉语言模型（VLM）仍然难以理解世界的动态。我们注意到，对4D场景进行推理的能力本身具有挑战性，且因两个因素而进一步复杂化。首先，VLM通过其投影到2D图像上间接观察运动。其次，现有数据集未能解耦物体和相机运动。为应对这些挑战，我们提出一个关注运动相关场景理解的问答生成流水线。我们特别关注相机与运动之间的纠缠，通过以传统方式以及一种新颖的固定参考系（称为真运动追踪）进行追踪，从而提供对运动的直观描述。通过该流水线，我们生成了一个包含40万样本的大规模训练数据集4DP-QA（4D感知问答）和一个包含2200样本的基准数据集4DP-QA-Bench。在我们的数据集上训练现有模型在外部基准上取得了性能提升，验证了我们方法的有效性。

英文摘要

Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2.2K-sample benchmark, 4DP-QA-Bench. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.

URL PDF HTML ☆

赞 0 踩 0

2606.11682 2026-06-11 cs.CV cs.LG 新提交

Parameter-Efficient Adapter Tuning for Tabular-Image Multimodal Learning

面向表格-图像多模态学习的参数高效适配器微调

Jiaqi Luo

发表机构 * School of Mathematical Sciences, Soochow University（苏州大学数学科学学院）

AI总结提出TI-Adapter框架，通过冻结表格编码器并添加适配器，以及图像分支的嵌入层和瓶颈层适配器，实现高效多模态微调，在20个数据集上以更少参数达到或超越全微调性能。

详情

AI中文摘要

表格-图像多模态学习旨在通过联合使用结构化表格属性和视觉数据来提高预测建模能力。尽管预训练编码器提供了强大的模态特定表示，但全微调可能计算成本高昂，而保持编码器冻结可能限制任务特定适应。我们提出了表格-图像适配器（TI-Adapter），一种基于模态特定适配器的微调框架，用于高效的多模态适应。TI-Adapter冻结预训练的表格编码器，并在提取的表格嵌入后学习一个适配器，同时通过嵌入级和瓶颈级适配器来适应图像分支，而不是全微调。在20个表格-图像数据集上的实验表明，TI-Adapter在使用显著更少的可训练参数的情况下，达到了与全微调相当或更好的预测性能。消融研究进一步证明了适配器放置对于平衡性能和实际效率的重要性。

英文摘要

Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning can be computationally expensive, while keeping encoders frozen may limit task-specific adaptation. We propose the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter-based fine-tuning framework for efficient multimodal adaptation. TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters instead of full fine-tuning. Experiments on 20 tabular-image datasets show that TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters. Ablation studies further demonstrate the importance of adapter placement for balancing performance and practical efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.11702 2026-06-11 cs.CV cs.AI cs.CL 新提交

MedCTA: A Benchmark for Clinical Tool Agents

MedCTA: 临床工具智能体基准

Tajamul Ashraf, Hyewon Jeong, Fida Mohammad Thoker, Bernard Ghanem

发表机构 * King Abdullah University of Science and Technology (KAUST)（阿卜杜拉国王科技大学）； Massachusetts Institute of Technology (MIT)（麻省理工学院）

AI总结提出MedCTA基准，基于放射影像、病理切片和报告等真实临床多模态输入，评估医疗AI智能体在工具检索、证据获取和集成方面的规划与执行能力。

详情

Comments: Project Page: this https URL Code: this https URL Data: this https URL

AI中文摘要

为了做出临床合理的决策，医疗AI智能体需要超越简单的识别，具备工具检索、证据获取和集成能力。现有基准主要评估孤立的感知或单轮问答，因此对规划、工具调用和部署可靠性的失败可见性有限。我们提出了MedCTA，一个用于评估医疗工具智能体的基准，基于临床验证的、步骤隐含的任务，这些任务基于真实的多模态临床输入，包括放射影像、病理切片和报告。MedCTA包含107个真实临床任务，具有临床医生验证的、在5个部署工具上的可执行轨迹，并支持对工具选择、参数有效性、执行稳定性、轨迹保真度和结果质量的过程感知评估。我们对18个开源和闭源多模态模型进行了基准测试，发现即使是最先进的系统在多步骤临床工具使用中仍然脆弱：自主部署主要由协议失败、过早停止和错误工具调用主导，而黄金标准工具路由带来了巨大但仍不完整的改进。这些结果表明，强大的骨干感知能力并不能转化为临床环境中可靠的智能体行为。MedCTA为审计、诊断和推进可信赖的医疗AI智能体提供了一个严格的测试平台。数据集和评估套件可在该https URL获取。

英文摘要

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.11739 2026-06-11 cs.CV cs.AI 新提交

Multi-View In-Cabin Monitoring System for Public Transport Vehicles

公共交通车辆的多视角座舱内监控系统

Evgeny Gorelik, Kenny Dean Karrow, Fikret Sivrikaya, Sahin Albayrak, Christian Baumann

发表机构 * Technische Universität Berlin（柏林工业大学）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心）

AI总结提出一个多视角座舱内监控数据集，包含同步RGB-D图像和LiDAR数据，并提供3D人体姿态和边界框标注，支持多视角3D检测模型评估。

2606.11783 2026-06-11 cs.CV 新提交

A Comprehensive Ecosystem for Open-Domain Customized Video Generation

开放域定制视频生成的综合生态系统

Jingxu Zhang, Yuqian Hong, Daneul Kim, Kai Qiu, Qi Dai, Jianmin Bao, Yifan Yang, Xiaoyan Sun, Chong Luo

AI总结提出百万级数据集PexelsCustom-1M和参数高效框架CustoMDiT，仅用8%额外参数实现定制视频生成，并构建千类基准OpenCustom，开源整个生态系统。

详情

Comments: 5 pages, 3 figures, 4 tables. Accepted by ICASSP 2026

AI中文摘要

近期视频生成的进展展示了令人印象深刻的视觉合成能力。然而，开放域定制视频生成仍然受到缺乏大规模、带标注的数据集来捕捉多样化的身份特定属性的限制。为了解决这个问题，我们引入了PexelsCustom-1M，这是第一个公开可用的百万级身份保持视频生成数据集，包含跨越8000多个类别的一百万个精心策划的<身份，文本，视频>三元组。利用这一点，我们提出了CustoMDiT，一个参数高效的框架，将预训练的多模态扩散Transformer适配为定制视频生成器，仅增加8%的可学习参数。我们的方法超越了先前的最先进技术。然而，像DreamBooth这样的基准只覆盖了100个类别，对于现实应用来说是不够的。为了克服这一点，我们构建了OpenCustom，一个新的包含1000多个类别的基准，通过ImageNet和MS-COCO的跨数据集知识融合创建。大量实验证实了我们的数据集和模型的优势。我们将开源整个生态系统——包括数据集、流水线、基准和实现——以支持进一步的研究。

英文摘要

Recent progress in video generation has shown impressive visual synthesis capabilities. However, open-domain customized video generation remains limited by the lack of large-scale, annotated datasets capturing diverse identity-specific attributes. To address this, we introduce PexelsCustom-1M, the first publicly available million-scale dataset for identity-preserving video generation, containing one million curated <identity, text, video> triplets across 8,000+ categories. Leveraging this, we propose CustoMDiT, a parameter-efficient framework that adapts a pretrained multimodal Diffusion Transformer into a customized video generator with only 8% additional learnable parameters. Our method surpasses prior state-of-the-art. However, benchmarks such as DreamBooth cover only 100 classes, which is insufficient for real-world applications. To overcome this, we construct OpenCustom, a new benchmark with 1,000+ categories, created via cross-dataset knowledge fusion from ImageNet and MS-COCO. Extensive experiments confirm the advantages of both our dataset and model. We will open-source the entire ecosystem--including dataset, pipeline, benchmark, and implementations--to support further research.

URL PDF HTML ☆

赞 0 踩 0

2606.11925 2026-06-11 cs.CV cs.LG 新提交

Corpus Augmentation for Sign Language Translation via LLM-Guided Video Stitching

通过LLM引导的视频拼接进行手语翻译的语料增强

Zsolt Robotka, Ádám Rák, Jalal Al-Afandi, András Horváth, György Cserey

发表机构 * Peter Pazmany Catholic University, Faculty of Information Technology and Bionics（彼得·帕兹马尼天主教大学信息科技与仿生学院）； DeepSign Technologies Ltd.（DeepSign科技有限公司）

AI总结提出一种无需额外标注或生成模型的手语翻译语料增强方法，利用CTC强制对齐提取手语片段，通过LLM生成句子并拼接视频，在GFSLT-VLP基线上提升BLEU-4达2.92，并发现合成数据对视觉-语言预训练有害但可提升下游任务。

详情

AI中文摘要

手语翻译（SLT）将手语视频转换为口语文本，对于改善无障碍交流以及促进手语与非手语社区之间的沟通具有重要前景。虽然大规模弱对齐数据集实现了规模化预训练，且无词汇表方法减少了对专家标注的依赖，但用于微调的高质量平行手语视频-文本对仍然稀缺，限制了长尾词汇和未见结构的泛化。我们提出一种语料增强方法，无需额外人工标注、外部手语视频语料库或生成式视频模型，仅依赖现有的词汇表标注训练语料和用于句子生成的LLM：通过CTC强制对齐从训练视频中提取每个手语词汇的片段，由语料锚定的LLM生成新的词汇-句子对，通过随机句子采样和片段分配组装合成序列。得到的合成RGB视频-文本对在下游训练阶段与架构无关，可直接被基于RGB的SLT模型使用，或通过从视频提取输入的流水线转换为姿态或特征表示。Sincan等人在严格相同条件下重新评估了五种近期无词汇表方法；在GFSLT-VLP基线上验证的最大增益仅为0.98 BLEU-4。我们的增强方法在同一框架内应用，无需改变架构或训练协议，实现了+2.92 BLEU-4。我们进一步发现，合成数据虽然改善了视觉-语言预训练的目标，但对其有害；并且基于L2准则优化片段过渡以实现视觉平滑适得其反；我们提出，突兀的边界可能作为一种隐式正则化形式。代码可在https://this https URL获取。

英文摘要

Sign language translation (SLT) converts sign language video into spoken language text and holds significant promise for improving accessibility and enabling communication between signing and non-signing communities. While large weakly-aligned datasets have enabled pre-training at scale and gloss-free methods have reduced reliance on expert annotation, high-quality parallel sign video-text pairs for fine-tuning remain scarce, limiting generalisation on long-tail vocabulary and unseen constructions. We propose a corpus augmentation approach that requires no additional human annotation, external sign-language video corpora, or generative video models, relying only on the existing gloss-annotated training corpus and an LLM for sentence generation: per-gloss clips are extracted from training videos via CTC forced-alignment, novel gloss-sentence pairs are generated by a corpus-anchored LLM, and synthetic sequences are assembled through random sentence sampling and clip assignment. The resulting synthetic RGB video-text pairs are architecture-agnostic at the downstream training stage and can be consumed directly by RGB-based SLT models, or converted into pose or feature representations by pipelines that derive such inputs from video. Sincan et al. re-evaluated five recent gloss-free methods under strictly identical conditions; the largest verified gain over the GFSLT-VLP baseline was only 0.98 BLEU-4. Our augmentation, applied within the same framework, achieves +2.92 BLEU-4 without any change to architecture or training protocol. We further identify that synthetic data harms vision-language pretraining despite improving its objectives, and that optimising clip transitions for visual smoothness is counter-productive under L2-based criteria; we propose that abrupt boundaries may act as a form of implicit regularisation. Code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12171 2026-06-11 cs.CV cs.LG 新提交

Beyond Dark Knowledge: Mixup-Based Distillation for Reliable Predictions

超越暗知识：基于混合的蒸馏实现可靠预测

José Medina, Paul Honeine, Abdelaziz Bensrhair, Amnir Hadachi

发表机构 * ITS Lab, Institute of Computer Science, University of Tartu（塔尔图大学计算机科学学院ITS实验室）； LITIS, Université de Rouen（鲁昂大学LITIS实验室）； LITIS, INSA de Rouen（鲁昂国立应用科学学院LITIS实验室）

AI总结研究知识蒸馏与混合训练结合时教师-学生不匹配的影响，发现学生能独立获得线性结构并提升准确率与校准，提出混合蒸馏作为更丰富的知识传递通道。

详情

AI中文摘要

知识蒸馏（KD）和混合（mixup）已被证明能有效诱导类别边界的平滑性：KD捕捉概率分布中的固有类别关系，而混合通过输入的凸组合强制执行这些关系。然而，它们的相互作用仍未被充分理解，特别是当混合仅在学生训练期间应用时。在这种情况下，教师被查询来自其训练期间从未见过的邻域分布的输入，这是一种受控的不匹配，其对知识转移的影响尚未被表征。我们表明，这种不匹配导致教师的监督信号被分布混淆而非类间结构主导。尽管如此，学生并非仅仅模仿教师：它独立地在邻域区域获得更大的线性度，这是教师缺乏的结构特性，并超越了暗知识转移。与基线相比，带有混合的KD持续提高学生准确率，并将过度自信降低一个数量级，在CIFAR和ImageNet上使用不同容量的教师均如此。关键的是，校准独立于准确率转移从教师传播到学生，温度缩放控制着可测量的准确率-校准权衡，在邻域训练下这种权衡更加明显。这些结果将混合蒸馏重新定义为不是标准KD的退化版本，而是一个更丰富的传递通道，同时塑造判别性能、不确定性估计和表示几何。

英文摘要

Knowledge Distillation (KD) and mixup have proven effective at inducing smoothness in class boundaries; KD captures inherent class relationships in probability distributions, and mixup enforces them through convex combinations of inputs. Their interaction, however, remains poorly understood, particularly when mixup is applied only during student training. In this setting, the teacher is queried on inputs drawn from a vicinal distribution it never saw during training, a controlled mismatch whose effect on knowledge transfer has not been characterised. We show that this mismatch causes the teacher's supervisory signal to be dominated by distributional confusion rather than inter-class structure. Despite it, the student does not merely imitate the teacher: it independently acquires greater linearity in the vicinal region, a structural property that the teacher lacks, and goes beyond dark-knowledge transfer. KD with mixup consistently improves student accuracy and reduces overconfidence by an order of magnitude relative to the baseline, across CIFAR and ImageNet with varying-capacity teachers. Crucially, calibration propagates from teacher to student independently of accuracy transfer, and temperature scaling governs a measurable accuracy-calibration trade-off that becomes more pronounced under vicinal training. These results reframe mixup distillation not as a degraded version of standard KD, but as a richer transfer channel that simultaneously shapes discriminative performance, uncertainty estimation, and representational geometry.

URL PDF HTML ☆

赞 0 踩 0

2606.12226 2026-06-11 cs.CV eess.IV 新提交

An Electric Potential-Augmented Benchmark Dataset for Physics-Guided Image Reconstruction of Electrical Capacitance Tomography

一种电势增强的基准数据集，用于电容层析成像的物理引导图像重建

Xinqi Zhang, Qiming Ma, Lihui Peng

发表机构 * Department of Automation, Tsinghua University（清华大学自动化系）

AI总结针对电容层析成像（ECT）数据驱动方法忽略电势场的问题，提出一个包含电势图的基准数据集，通过COMSOL-MATLAB管道生成20,000个样本，并验证其提升建模精度和鲁棒性。

详情

AI中文摘要

虽然深度学习显著推进了电容层析成像（ECT）的图像重建，但大多数数据驱动方法直接映射电容和介电常数分布，将传感器视为黑箱。这忽略了电势场——控制非线性和病态“软场”效应的基本物理联系。为解决此问题，我们提出一个电势增强的ECT基准数据集，旨在将ECT背后的潜在物理显式集成到学习过程中。通过COMSOL-MATLAB管道为八电极传感器生成示例，数据集包含20,000个随机样本，涵盖四种典型流型。关键的是，除了传统的电容向量和以图像形式描绘的介电常数分布外，每个样本还保留了八个激励方向的全场电势图。除了数据发布，我们还提供了ECT正问题和逆问题的说明性评估协议。通过在分布内（IID）和分布外（OOD）场景下的全面测试，我们系统地展示了包含电势图如何增强建模精度和鲁棒性。从根本上说，潜在场信息的显式包含显著降低了将物理定律集成到ECT建模中的障碍，从而为未来ECT图像重建的物理引导机器学习建立了标准化基础。

英文摘要

While deep learning has significantly advanced image reconstruction of Electrical Capacitance Tomography (ECT), most data-driven methods map directly between capacitance and permittivity distribution, treating the sensor as a black box. This overlooks the electric potential field -- the fundamental physical link governing the nonlinear and ill-posed ``soft-field'' effect. To address this, we propose an electric potential-augmented ECT benchmark dataset designed to explicitly integrate latent physics behind ECT into the learning process. Generated via a COMSOL-MATLAB pipeline for an eight-electrode sensor as an example, the dataset comprises 20,000 randomized samples across four typical flow patterns. Crucially, alongside the conventional capacitance vectors and permittivity distributions depicted as images, each sample preserves eight excitation-wise full-field potential maps. Beyond data release, we provide illustrative evaluation protocols for both forward and inverse problems of ECT. Through comprehensive testing on both in-distribution (IID) and out-of-distribution (OOD) scenarios, we systematically demonstrate how the inclusion of electric potential maps enhances modeling accuracy and robustness. Fundamentally, the explicit inclusion of latent field information significantly lowers the barrier to integrating physical laws into ECT modeling, thereby establishing a standardized foundation for future physics-guided machine learning of ECT image reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.12278 2026-06-11 cs.CV cs.LG 新提交

Finding Sparse Subnetworks in One Training Cycle via Progressive Magnitude-Based Pruning

通过渐进式幅度剪枝在一个训练周期内找到稀疏子网络

Romana Qureshi, Hafida Benhidour, Said Kerrache, Nahlah Aljeraisy

发表机构 * King Abdullah University of Science and Technology（阿卜杜拉国王科技大学）； University of Jeddah（吉达大学）； King Fahd University of Petroleum and Minerals（法赫德国王石油矿产大学）； King Saud University（沙特国王大学）

AI总结提出渐进式幅度剪枝方法，在单训练周期内线性增加稀疏度，基于权重幅度更新掩码，在CIFAR-10和MNIST上优于LTH、SNIP和GraSP等基线。

详情

AI中文摘要

神经网络剪枝通过移除不太重要的参数来减小模型大小，同时旨在保持预测性能。尽管彩票假说（LTH）表明，当从合适的初始化训练时，稀疏子网络可以匹配密集网络，但其迭代剪枝过程需要多个完整的训练周期。本工作评估了渐进式幅度剪枝作为一种单周期替代方案。该方法在训练期间使用线性调度逐渐增加稀疏度，并基于活跃权重幅度更新剪枝掩码。我们在CIFAR-10和MNIST上，针对ResNet、VGG风格和LeNet架构进行了系统实验，将所提方法与代表性的迭代和基于初始化的剪枝基线（包括LTH、SNIP和GraSP）进行比较。在CIFAR-10上，该方法在ResNet-18上以72.9%稀疏度达到95.12%的准确率，而LTH报告为90.5%。在极端稀疏度下，它在VGG类架构上以97%稀疏度达到93.13%的准确率，而SNIP约为92.0%；在VGG-19上以97.97%稀疏度达到93.44%的准确率，而GraSP在98%稀疏度下为92.19%。在ResNet-18上的稀疏度-准确率分析进一步表明，在70-85%稀疏度范围内，准确率保持在密集基线的0.1个百分点以内。这些结果表明，在所评估的设置下，渐进式幅度剪枝为神经网络稀疏化提供了一种有效的单周期方法。

英文摘要

Neural network pruning reduces model size by removing less important parameters while aiming to preserve predictive performance. Although the Lottery Ticket Hypothesis (LTH) shows that sparse subnetworks can match dense networks when trained from suitable initializations, its iterative pruning procedure requires multiple complete training cycles. This work evaluates progressive magnitude-based pruning as a single-cycle alternative. The method gradually increases sparsity during training using a linear schedule and updates pruning masks based on active weight magnitudes. We conduct systematic experiments on CIFAR-10 and MNIST across ResNet, VGG-style, and LeNet architectures, comparing the proposed method with representative iterative and initialization-based pruning baselines, including LTH, SNIP, and GraSP. On CIFAR-10, the method achieves 95.12\% accuracy on ResNet-18 at 72.9\% sparsity, compared with 90.5\% reported for LTH. At extreme sparsity, it achieves 93.13\% accuracy on a VGG-like architecture at 97\% sparsity, compared with approximately 92.0\% for SNIP, and 93.44\% accuracy on VGG-19 at 97.97\% sparsity, compared with 92.19\% for GraSP at 98\% sparsity. A sparsity-accuracy analysis on ResNet-18 further shows that accuracy remains within 0.1 percentage points of the dense baseline across 70--85\% sparsity. These results indicate that progressive magnitude-based pruning provides an effective single-cycle approach for neural network sparsification under the evaluated settings.

URL PDF HTML ☆

赞 0 踩 0

2606.12295 2026-06-11 cs.CV cs.CL cs.IR 新提交

Findings of the MAGMaR 2026 Shared Task

MAGMaR 2026 共享任务结果

Alexander Martin, Dengjia Zhang, Joel Brogan, Francis Ferraro, Jeremy Gwinnup, Reno Kriz, Teng Long, Kenton Murray, Andrew Yates, Xiang Xiang

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； OpenAI ； University of Maryland, Baltimore County（马里兰大学巴尔的摩县分校）； Air Force Research Laboratory（空军研究实验室）； Human Language Technology Center of Excellence, Johns Hopkins University（约翰霍普金斯大学人类语言技术卓越中心）； University of Amsterdam（阿姆斯特丹大学）； Huazhong University of Science and Technology（华中科技大学）

AI总结本文介绍MAGMaR 2026共享任务的结果，包括视频检索和基于检索视频的生成任务，所有提交系统均超越去年基线。

2506.17137 2026-06-11 cs.CV 版本更新

Towards Conditional Feature Alignment for Cross-Domain Counting

面向跨域计数的条件特征对齐

Zhuonan Liang, Dongnan Liu, Jianan Fan, Yaxuan Song, Qiang Qu, Runnan Chen, Yu Yao, Peng Fu, Weidong Cai

AI总结提出条件特征对齐（CFA）框架，通过标签诱导的条件对齐而非全局域不变性，解决跨域计数中密度分布变化问题，在无监督域适应和源域泛化任务上取得显著性能提升。

详情

Comments: 12 pages, 6 figures, 4 tables

AI中文摘要

目标计数模型在跨域部署时性能往往会下降，因为密度组成在不同域之间变化，并且其本身与任务相关。标准的特征对齐方法倾向于通过鼓励全局域不变性来抑制这种变化，但当源域和目标域包含不同比例的背景、稀疏前景和密集前景时，这可能是有害的。我们提出条件特征对齐（CFA），一种跨域计数框架，它在标签诱导的条件下对齐表示，而不是在整个边缘特征分布上对齐。给定密度标注或伪密度预测，CFA构建前景/背景或密度级别的条件，并仅对齐属于匹配条件的特征。我们通过条件散度视角形式化这一思想，表明条件对齐消除了条件内的差异，同时保留了条件边缘的密度偏移。对于无监督域适应，CFA从标注中估计源域条件，从分离的伪密度图中估计目标域条件，然后执行条件级对抗对齐，并加入全图一致性正则化。对于源域泛化，我们通过MPCount实例化相同原则，在生成的源域视图之间强制执行条件级记忆一致性。在人群和细胞计数基准上的实验表明，在多种UDA和DG设置下，性能具有竞争力或得到提升。例如，在JHU-CROWD++ FH→SN上，CFA-DG将MAE/RMSE从MPCount的216.3/421.4降低到90.5/169.9，表明条件级对齐在大的天气和密度引起的偏移下特别有效。这些结果表明，条件级对齐是域自适应计数的一个有前景的设计原则。

英文摘要

Object counting models often degrade under cross-domain deployment because density composition varies across domains and is itself task-relevant. Standard feature alignment methods tend to suppress such variation by encouraging global domain invariance, which can be harmful when source and target domains contain different proportions of background, sparse foreground, and dense foreground. We propose Conditional Feature Alignment (CFA), a cross-domain counting framework that aligns representations within label-induced conditions rather than across full marginal feature distributions. Given density annotations or pseudo-density predictions, CFA constructs foreground/background or density-level conditions and aligns only features belonging to matching conditions. We formalise this idea through a conditional divergence perspective, showing that conditional alignment removes within-condition discrepancy while preserving condition-marginal density shift. For unsupervised domain adaptation, CFA estimates source conditions from annotations and target conditions from detached pseudo-density maps, then performs condition-wise adversarial alignment with full-image consistency regularisation. For source-domain generalisation, we instantiate the same principle with MPCount by enforcing condition-wise memory-consistency between generated source-domain views. Experiments on crowd and cell counting benchmarks show competitive or improved performance across diverse UDA and DG settings. For example, on JHU-CROWD++ FH$\rightarrow$SN, CFA-DG reduces MAE/RMSE from MPCount's 216.3/421.4 to 90.5/169.9, indicating that condition-wise alignment is especially effective under large weather- and density-induced shifts. These results suggest that condition-wise alignment is a promising design principle for domain-adaptive counting.

URL PDF HTML ☆

赞 0 踩 0

2510.06596 2026-06-11 cs.CV cs.AI cs.IT cs.LG 版本更新

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

SDQM：用于目标检测数据集评估的合成数据质量指标

Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin

AI总结提出SDQM指标，无需模型训练收敛即可评估合成数据质量，与YOLO11的mAP强相关，优于现有指标。

详情

Comments: Accepted and Published at SPIE: Journal of Electronic Imaging, Vol. 35, Issue 3

AI中文摘要

机器学习模型的性能在很大程度上依赖于训练数据。大规模、良好标注数据集的稀缺给构建鲁棒模型带来了重大挑战。为了解决这一问题，通过模拟和生成模型产生的合成数据已成为一种有前景的解决方案，它增强了数据集的多样性，并提高了模型的性能、可靠性和韧性。然而，评估这些生成数据的质量需要一个有效的指标。我们引入了合成数据集质量指标（SDQM），用于评估目标检测任务的数据质量，而无需模型训练收敛。该指标能够更高效地生成和选择合成数据集，解决了资源受限的目标检测任务中的一个关键挑战。在我们的实验中，SDQM与领先的目标检测模型YOLO11的平均精度均值（mAP）得分表现出强相关性，而先前的指标仅表现出中等或弱相关性。此外，它提供了改进数据集质量的可操作见解，最大限度地减少了昂贵的迭代训练需求。这一可扩展且高效的指标为评估合成数据设立了新标准。SDQM的代码可从此https URL获取。

英文摘要

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. We introduce the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean average precision (mAP) scores of YOLO11, a leading object detection model, whereas previous metrics only exhibited moderate or weak correlations. In addition, it provides actionable insights into improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at this https URL

URL PDF HTML ☆

赞 0 踩 0

2601.22725 2026-06-11 cs.CV cs.AI 版本更新

OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

OpenVTON-Bench：用于可控虚拟试穿评估的大规模高分辨率基准

Jin Li, Tao Chen, Kai Wen, Siqi Yin, Shuai Jiang, Weijie Wang, Jingwen Luo, Chenhui Wu

AI总结提出OpenVTON-Bench，包含约10万对高分辨率图像，通过DINOv3聚类和Gemini描述构建，并设计多模态评估协议，沿五个维度衡量试穿质量，与人类判断高度一致。

详情

Comments: Under review for the NeurIPS 2026 Datasets and Benchmarks Track

AI中文摘要

近期扩散模型的进展显著提升了虚拟试穿（VTON）系统的视觉保真度，但可靠的评估仍是一个持续的瓶颈。传统指标难以量化细粒度的纹理细节和语义一致性，而现有数据集在规模和多样性上无法满足商业标准。我们提出了OpenVTON-Bench，一个大规模基准，包含约10万对高分辨率图像（最高$1536 \ imes 1536$）。该数据集使用基于DINOv3的层次聚类进行语义平衡采样，并借助Gemini驱动的密集描述，确保在20个细粒度服装类别上均匀分布。为支持可靠评估，我们提出了一种多模态协议，沿五个可解释维度衡量VTON质量：背景一致性、身份保真度、纹理保真度、形状合理性和整体真实感。该协议将基于VLM的语义推理与基于SAM3分割和形态学腐蚀的新型多尺度表示度量相结合，能够分离边界对齐误差与内部纹理伪影。实验结果表明，该协议与人类判断高度一致（Kendall's $\ au$为0.833，而SSIM为0.611），为VTON评估建立了稳健的基准。

英文摘要

Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $\tau$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.08415 2026-06-11 cs.CV cs.AI 版本更新

CoVEBench: Can Video Editing Models Handle Complex Instructions?

CoVEBench: 视频编辑模型能处理复杂指令吗？

Jiangtao Wu, Jiaming Wang, Yiwen He, Yuanxing Zhang, Shihao Li, Dunyuan Liu, Xuedong Zhao, Jialu Chen, Zekun Moore Wang, Jiaheng Liu

发表机构 * Nanjing University（南京大学）； Kuaishou Technology（快手科技）

AI总结提出CoVEBench基准，包含416个源视频和626条多点编辑指令，通过MLLM评估指令遵循度和保真度，揭示当前模型在组合编辑中常遗漏编辑或破坏保留约束。

详情

Comments: 34 pages, 11 figures, 9 tables

AI中文摘要

虽然近期基于文本引导的视频编辑模型在基础任务（如风格迁移、物体插入）上表现出色，但现实用户请求具有高度组合性。单个提示通常要求多个耦合编辑，例如同时修改主体、动作和相机视角，同时严格保留无关的时空内容。现有基准受限于孤立编辑和粗粒度全局指标，无法诊断模型如何处理此类复杂工作流。为弥补这一空白，我们引入CoVEBench，一个组合视频编辑基准，包含416个精心策划的源视频、626条多点编辑指令和9,990个细粒度检查项。CoVEBench覆盖多样化的编辑维度，通过MLLM评判的指令遵循度和视频保真度，以及视频质量的自动指标来评估模型。大量实验表明，组合编辑仍然是一个深层次的挑战：当前模型在处理多个操作同时进行时，经常遗漏编辑、违反保留约束或引入伪影。CoVEBench为推进视频编辑向现实用户工作流发展提供了一个具有挑战性的诊断测试平台。

英文摘要

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.11152 2026-06-11 cs.CV 版本更新

P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning

P3D-Bench：用于参数化3D生成与结构推理的多模态大语言模型基准

Yikang Yang, Zhanpeng Hu, Youtian Lin, Mengqi Zhou, Jingxi Xu, Feihu Zhang, Jiaheng Liu, Yao Yao

发表机构 * Nanjing University（南京大学）； Envision

AI总结提出P3D-Bench基准，通过参数化3D程序评估多模态大语言模型在几何精度、语义对齐和装配一致性上的表现，涵盖文本到3D、图像到3D和装配3D三类任务。

详情

Comments: Project page: this https URL

AI中文摘要

多模态大语言模型能够编写代码生成复杂程序，并利用程序进行3D建模，这为基于其先验知识、世界知识和推理能力的3D生成开辟了新途径。然而，现有基准很少通过代码评估3D建模。这种建模不仅需要可运行代码：从文本或视觉规范出发，模型必须生成几何精确、语义对齐且装配一致的参数化3D程序。我们引入P3D-Bench，一个用于参数化3D生成的基准。与3D网格不同，参数化3D程序暴露了显式尺寸、构造操作和零件关系，揭示了模型是否恢复设计结构而不仅仅是外观。在统一协议下，P3D-Bench涵盖三个任务族（文本到3D、图像到3D和装配3D），并对每个输出进行可执行性、几何保真度、拓扑、文本约束、多视图语义对齐和零件级结构的评分。我们在400个文本案例、400个图像案例和203个带注释的装配体上评估了前沿多模态大语言模型和纯文本大语言模型，并以领域特定模型作为参考点。我们的广泛评估得出三个发现。首先，装配是最困难的设置，模型仍然无法将多个零件组合成连贯结构。其次，模型通常能恢复目标对象的整体形状和语义身份，但无法再现输入指定的精确参数化几何。第三，零件级建模在装配上仍然薄弱，模型既不能恢复每个零件的几何形状，也不能恢复正确的零件数量。这些结果使P3D-Bench成为评估参数化3D生成中精确参数化几何和零件级结构的基准。

英文摘要

Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchmarks rarely evaluate 3D modeling through code. Such modeling demands more than runnable code: from a text or visual specification, a model must generate a parametric 3D program that is geometrically precise, semantically aligned and assembly-consistent. We introduce P3D-Bench, a benchmark for parametric 3D generation. Unlike a 3D mesh, a parametric 3D program exposes explicit dimensions, construction operations and part relations, revealing whether a model recovers a design's structure, not just its appearance. Under a unified protocol, P3D-Bench covers three task families (Text-to-3D, Image-to-3D and Assembly-3D) and scores each output for executability, geometric fidelity, topology, text-grounded constraints, multiview semantic alignment and part-level structure. We evaluate frontier MLLMs and text-only LLMs on 400 text cases, 400 image cases and 203 annotated assemblies, with domain-specific models as reference points. Our extensive evaluation yields three findings. First, assemblies are the hardest setting, where models still fail to compose multiple parts into a coherent structure. Second, models can often recover the global shape and semantic identity of the target object, yet fail to reproduce the precise parametric geometry specified by the input. Third, part-level modeling remains weak on assemblies, where models recover neither the geometry of each part nor the right number of parts. These results position P3D-Bench as a benchmark for evaluating precise parametric geometry and part-level structure in parametric 3D generation.

URL PDF HTML ☆

赞 0 踩 0

2405.06995 2026-06-11 cs.SD cs.CV cs.MM eess.AS 版本更新

Benchmarking Cross-Domain Audio-Visual Deception Detection

跨域音视频欺骗检测基准测试

Xiaobao Guo, Zitong Yu, Nithish Muthuchamy Selvaraj, Bingquan Shen, Adams Wai-Kin Kong, Alex C. Kot

AI总结提出首个跨域音视频欺骗检测基准，评估不同场景下的泛化能力，并设计MM-IDGM算法和Attention-Mixer融合方法提升性能。

详情

Comments: 17 pages

AI中文摘要

自动欺骗检测对于帮助人类准确评估真实性和识别欺骗行为至关重要。传统的接触式技术，如测谎仪，依赖生理信号来确定个体陈述的真实性。然而，自动欺骗检测的最新进展表明，从音频和视频模态中提取的多模态特征在公开数据集上可能优于人类观察者。尽管有这些积极发现，现有音视频欺骗检测方法在不同场景下的泛化能力仍 largely unexplored。为弥补这一空白，我们提出了首个跨域音视频欺骗检测基准，使我们能够评估这些方法在现实场景中的泛化能力。我们使用了广泛采用的音频和视觉特征以及不同的架构进行基准测试，比较了单到单和多到单域泛化性能。为了进一步利用来自多个源域的数据进行训练的影响，我们研究了三种域采样策略，包括域同步、域交替和逐域采样，用于多到单域泛化评估。我们还提出了一种通过最大化模态编码器之间的梯度内积来增强泛化性能的算法，称为“MM-IDGM”。此外，我们提出了Attention-Mixer融合方法来提高性能，并相信这一新的跨域基准将促进未来音视频欺骗检测的研究。

英文摘要

Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.

URL PDF HTML ☆

赞 0 踩 0

2507.23534 2026-06-11 cs.LG cs.CV 版本更新

Continual Learning with Support Boundary Experience Blending

支持边界经验混合的持续学习

Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

AI总结提出经验混合框架，通过差分隐私启发的噪声生成支持边界数据，联合训练样本和边界数据以正则化决策边界，在多个数据集上提升持续学习准确率。

详情

AI中文摘要

持续学习旨在减轻模型在顺序任务训练时的灾难性遗忘。常见方法经验回放存储过去的样本，但仅稀疏地近似数据分布，导致决策边界脆弱且过于简化。我们通过引入支持边界数据来解决这一限制，该数据通过差分隐私启发的噪声注入潜在特征，生成边界邻近表示，隐式正则化决策边界。基于此，我们提出经验混合框架，通过双模型聚合策略联合训练样本和支持边界数据。经验混合有两个组成部分：(1) 潜在空间噪声注入以生成支持边界数据，(2) 联合利用样本和支持边界数据的端到端训练。与标准经验回放不同，支持边界数据丰富了决策边界附近的特征空间，从而实现更稳定和鲁棒的持续学习。在CIFAR-10、CIFAR-100、Tiny ImageNet和ImageNet1K上的大量实验分别展示了10%、6%、13%和2%的持续准确率提升。

英文摘要

Continual learning (CL) seeks to mitigate catastrophic forgetting when models are trained with sequential tasks. A common approach, experience replay (ER), stores past exemplars but only sparsely approximates the data distribution, yielding fragile and oversimplified decision boundaries. We address this limitation by introducing Support Boundary Data (SBD), generated via differential-privacy-inspired noise into latent features to create boundary-adjacent representations that implicitly regularize decision boundaries. Building on this idea, we propose Experience Blending (EB), a framework that jointly trains on exemplars and SBD through a dual-model aggregation strategy. EB has two components: (1) latent-space noise injection to generate support boundary data, and (2) end-to-end training that jointly leverages exemplars and SBD. Unlike standard experience replay, SBD enriches the feature space near decision boundaries, leading to more stable and robust continual learning. Extensive experiments on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet1K demonstrate consistent accuracy improvements of 10%, 6%, 13%, 2%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2512.11982 2026-06-11 astro-ph.IM cs.AI cs.CV cs.LG 版本更新

Semantic search for 100M+ galaxy images using AI-generated captions

基于AI生成描述的1亿+星系图像语义搜索

Nolan Koblischke, Liam Parker, Francois Lanusse, Jo Bovy, Irina Espejo, Shirley Ho

AI总结提出利用视觉语言模型生成星系图像描述，并对比对齐预训练天文学基础模型，构建可搜索嵌入，实现大规模星系图像的语义搜索，在稀有现象发现上取得最先进性能。

详情

Comments: ApJ, in press

AI中文摘要

通过缓慢的手动标注活动寻找科学上有趣的现象严重限制了我们对望远镜产生的数十亿星系图像的探索能力。在这项工作中，我们开发了一个流水线，从完全未标记的图像数据创建语义搜索引擎。我们的方法利用视觉语言模型（VLM）为星系图像生成描述，然后将预训练的天文学基础模型与这些嵌入的描述进行对比对齐，以产生大规模可搜索的嵌入。我们发现当前的VLM提供的描述信息足够丰富，可以训练一个语义搜索模型，该模型优于直接图像相似性搜索。我们的模型AION-Search在寻找稀有现象方面实现了最先进的零样本性能，尽管训练是在随机选择的图像上进行的，没有针对稀有情况进行刻意策划。此外，我们引入了一种基于VLM的重排序方法，该方法在top-100结果中对我们最具挑战性的目标的召回率几乎翻倍。首次，AION-Search实现了对超过1亿张星系图像的灵活语义搜索，使得从以前不可行的搜索中能够发现新现象，包括识别出36个新的河外恒星流候选体。更广泛地说，我们的工作提供了一种方法，使大型、未标记的科学图像档案变得可语义搜索，扩展了从地球观测到显微镜等领域的数据探索能力。代码、数据和应用程序可在以下网址公开获取：https://this https URL

英文摘要

Finding scientifically interesting phenomena through slow manual labeling campaigns severely limits our ability to explore the billions of galaxy images produced by telescopes. In this work, we develop a pipeline to create a semantic search engine from completely unlabeled image data. Our method leverages Vision-Language Models (VLMs) to generate descriptions for galaxy images, then contrastively aligns a pre-trained astronomy foundation model with these embedded descriptions to produce searchable embeddings at scale. We find that current VLMs provide descriptions that are sufficiently informative to train a semantic search model that outperforms direct image similarity search. Our model, AION-Search, achieves state-of-the-art zero-shot performance on finding rare phenomena despite training on randomly selected images with no deliberate curation for rare cases. Furthermore, we introduce a VLM-based re-ranking method that nearly doubles the recall for our most challenging targets in the top-100 results. For the first time, AION-Search enables flexible semantic search for over 100 million galaxy images, enabling discovery from previously infeasible searches, including the identification of 36 new extragalactic stellar stream candidates. More broadly, our work provides an approach for making large, unlabeled scientific image archives semantically searchable, expanding data exploration capabilities in fields from Earth observation to microscopy. The code, data, and app are publicly available at this https URL

URL PDF HTML ☆

赞 0 踩 0

2601.04203 2026-06-11 cs.CL cs.CV cs.LG cs.SE 版本更新

FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback

FronTalk: 以多模态反馈进行对话式代码生成的前端开发基准测试

Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, Yeming Wen

AI总结提出FronTalk基准，通过多轮对话和多模态反馈（文本与视觉指令）评估前端代码生成，发现模型存在遗忘和视觉反馈理解困难，提出AceCoder方法有效减少遗忘并提升性能。

详情

AI中文摘要

我们提出了FronTalk，一个前端代码生成基准，开创性地研究了一种独特的交互动态：具有多模态反馈的对话式代码生成。在前端开发中，草图、模型和带注释的截图等视觉工件对于传达设计意图至关重要，但它们在多轮代码生成中的作用仍未得到充分探索。为解决这一差距，我们聚焦于前端开发任务，整理了FronTalk，这是一个包含100个多轮对话的数据集，这些对话源自新闻、金融和艺术等不同领域的真实网站。每一轮都包含一个文本指令和一个等效的视觉指令，每个指令代表相同的用户意图。为全面评估模型性能，我们提出了一种新颖的基于智能体的评估框架，利用网络智能体模拟用户并探索网站，从而衡量功能正确性和用户体验。对20个模型的评估揭示了文献中系统性地未充分探索的两个关键挑战：（1）显著的遗忘问题，即模型覆盖先前实现的功能，导致任务失败；（2）解释视觉反馈的持续挑战，尤其是对于开源视觉语言模型（VLM）。我们提出了一个强大的基线来解决遗忘问题，即AceCoder，一种使用自主网络智能体批评每个过去指令实现的方法。这种方法将遗忘几乎减少到零，并将性能提升高达9.3%（从56.0%到65.3%）。总体而言，我们旨在为前端开发和多轮多模态代码生成的通用交互动态的未来研究提供坚实基础。代码和数据已在此https URL发布。

英文摘要

We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at this https URL

URL PDF HTML ☆

赞 0 踩 0

2602.02465 2026-06-11 cs.AI cs.CV cs.LG 版本更新

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

MentisOculi: 揭示心智图像推理的局限性

Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel

AI总结提出MentisOculi基准，通过多步推理问题测试前沿模型利用视觉表示辅助推理的能力，发现视觉策略普遍无法提升性能，且统一多模态模型存在生成错误累积和无法利用真实可视化的问题。

详情

Comments: 9 pages, 8 figures, Accepted at ICML 2026

AI中文摘要

前沿模型正从仅摄入视觉信息的多模态大语言模型（MLLMs）过渡到能够原生交错生成的统一多模态模型（UMMs）。这一转变激发了将中间可视化作为推理辅助的兴趣，类似于人类的心智图像。这一想法的核心是能够以目标导向的方式形成、维护和操作视觉表示。为了评估和探究这一能力，我们开发了MentisOculi，这是一个程序化的、分层的多步推理问题套件，适用于视觉解决方案，旨在挑战前沿模型。评估从潜在令牌到显式生成图像的视觉策略，我们发现它们通常无法提升性能。对UMMs的分析特别揭示了一个关键限制：虽然它们拥有解决任务的文本推理能力，并且有时能生成正确的视觉内容，但它们遭受复合生成错误，并且无法利用甚至真实的可视化。我们的发现表明，尽管视觉思维具有内在吸引力，但尚未有益于模型推理。MentisOculi为分析和弥合不同模型家族之间的这一差距建立了必要的基础。

英文摘要

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

URL PDF HTML ☆

赞 0 踩 0

2606.11269 2026-06-11 cs.CV cs.HC 新提交

Traits Run Deeper: Trait-Specific Asymmetric Fusion for Personality Assessment

特质更深：面向人格评估的特质特异性非对称融合

Jia Li, Qian Chen, Wei Wang, Xinyu Li, Zhenzhen Hu, Dongsheng Shao, Richang Hong, Meng Wang

发表机构 * Hefei University of Technology（合肥工业大学）； Intelligent Interconnected Systems Laboratory of Anhui Province（安徽省智能互联系统实验室）； Jianghuai Advanced Technology Center（江淮前沿技术中心）； Anhui Provincial Industry Innovation Center of Humanoid Robots（安徽省人形机器人产业创新中心）； Anhui Provincial Key Laboratory of Humanoid Robots（安徽省人形机器人重点实验室）

AI总结提出Traits Run Deeper框架，通过多模态基础表示、特质特异性非对称融合和分布校准回归模块，解决人格评估中模态偏好差异和标签偏差问题，在AVI Challenge 2026上MSE降低约25%。

详情

AI中文摘要

人格评估旨在从语言、声音和面部线索等动态行为中推断稳定的人格特质。由于不同的人格维度通过不同的行为视角展现，建模特质特异性证据具有挑战性。然而，现有大多数方法对所有维度采用统一的多模态融合策略，假设模态贡献相同。这忽略了特质特异性的模态偏好，并引入了跨模态干扰。为解决这一问题，我们提出了一种新颖的人格评估框架，称为Traits Run Deeper，由三个组件组成。具体而言，多模态基础表示（MFR）模块构建面向人格的多模态输入，并利用心理学启发的语义模板作为锚点，使基础模型能够捕获特质相关信息。基于MFR，特质特异性模态融合（TSMF）模块作为一种非对称融合机制，允许每个维度从模态特定建模到互补融合中，选择性地利用不同的模态路径。因此，TSMF捕获了异质的模态偏好，同时减少了跨模态污染。此外，分布校准人格回归（DCPR）模块通过目标分布校准减轻了标签不平衡和中心趋势偏差，提高了鲁棒性和稳定性。在AVI Challenge 2026验证集上的实验结果表明了所提出框架的有效性，与基线相比，均方误差（MSE）降低了约25%。在官方测试集上观察到一致的改进，我们的方法取得了最佳性能，并在人格评估赛道中排名第一。源代码将在此https URL提供。

英文摘要

Personality assessment aims to infer stable personality traits from dynamic behaviors across language, voice, and facial cues. Since different personality dimensions are revealed through distinct behavioral perspectives, modeling trait-specific evidence is challenging. However, most existing approaches adopt a uniform multimodal fusion strategy across all dimensions, assuming identical modality contributions. This overlooks trait-specific modality preferences and introduces cross-modal interference. To address this issue, we propose a novel personality assessment framework called Traits Run Deeper, which consists of three components. Specifically, the Multimodal Foundation Representation (MFR) module constructs personality-oriented multimodal inputs and leverages psychology-informed semantic templates as anchors, enabling foundation models to capture trait-relevant information. Building upon MFR, the Trait-Specific Modality Fusion (TSMF) module acts as an asymmetric fusion mechanism, allowing each dimension to selectively exploit different modality pathways from modality-specific modeling to complementary fusion. Thus, TSMF captures heterogeneous modality preferences while reducing cross-modal contamination. Furthermore, the Distribution-Calibrated Personality Regression (DCPR) module mitigates label imbalance and central tendency bias through target distribution calibration, improving robustness and stability. Experimental results on the AVI Challenge 2026 validation set demonstrate the effectiveness of the proposed framework, reducing mean squared error (MSE) by approximately 25% compared with the baseline. Consistent improvements are observed on the official test set, where our method achieves the best performance and ranks first in the Personality Assessment Track. The source code will be made available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12218 2026-06-11 cs.CV cs.AI 新提交

Adapting Prithvi-EO for Fallow Detection for Food-Water Nexus: ViT-Adapter Necks and Parameter-Efficient Backbone tuning of Geospatial Foundation Model

为食物-水关系调整Prithvi-EO用于休耕地检测：地理空间基础模型的ViT-Adapter颈部与参数高效骨干微调

Sk Muhammad Asif, Orhun Aydin

发表机构 * Earth, Atmospheric and Geospatial Science, Saint Louis University（圣路易斯大学地球、大气与地理空间科学系）

AI总结针对休耕地检测中多尺度特征需求与基础模型单尺度ViT骨干不匹配的问题，提出结合LoRA和混合PEFT的两种参数高效微调方案与三种颈部设计，其中Lite ViT-Adapter配合单阶段检测头在mAP@50上达到0.9479，优于无适配器方法25.70%。

详情

Comments: 10 pages, 6 figures. Preprint. Submitted to ACM SIGSPATIAL 2026

AI中文摘要

理解休耕地的空间分布对于优化食物-水关系至关重要，因为休耕在作物轮作和水资源保护中发挥着作用。休耕是美国农业部作物数据层中的一个低精度类别。地理空间基础模型Prithvi-EO在计算机视觉任务中展现出强大的迁移能力。然而，其视觉Transformer骨干在单一空间尺度上生成特征，不适合目标检测头所需的多尺度特征。现有方法通过缩放单步长令牌来合成多尺度金字塔，牺牲了空间异质性，而全骨干微调对于地理空间基础模型来说计算成本过高。我们评估了一个结合两种参数高效微调方案的休耕地检测流程：低秩适应和混合PEFT，以及三种颈部设计：伪多尺度、Lite ViT-Adapter和Full ViT-Adapter。我们最佳配置，即带有单阶段检测头的Lite ViT-Adapter，在Diou损失下实现了0.9479的mAP@50，表明中心感知定位对于不规则休耕地检测的有效性。在LoRA下，ViT-Adapter释放的单阶段检测比无适配器的基于锚点的方法提高了6.42%，而最佳配置比基线无适配器的基于锚点的方法提高了25.70%。这些结果表明，轻量级空间先验融合和选择性骨干解冻使Prithvi-EO能够更有效地捕捉局部休耕模式，优于依赖重塑单步长ViT令牌的方法。

英文摘要

Understanding spatial distribution of fallow land is important for optimizing the food-water (FW) nexus, given fallowing's role in crop rotation and water conservation. Fallow is a low accuracy class in USDA Cropland Data Layer (CDL). Geospatial foundation model (GFM), Prithvi-EO has shown strong transferability across computer vision tasks. However, its Vision Transformer (ViT) backbone produces features at a single spatial scale that are ill-suited for the multi-scale features required by object detection heads. Existing approaches synthesise multi-scale pyramids through scaling of single stride tokens, sacrificing spatial heterogeneity, and full backbone fine-tuning is computationally prohibitive for GFMs. We evaluate a fallow detection pipeline combining two parameter-efficient fine tuning (PEFT) schemes: Low-Rank Adaptation (LoRA) and a hybrid PEFT, with three neck designs: pseudo multi-scale, Lite ViT-Adapter, and Full ViT-Adapter. Our best configuration, Lite ViT-Adapter with a one-stage head, achieves a mAP@50 of 0.9479 with the Diou loss, suggesting the effectiveness of center-aware localization for irregular fallow field detection. ViT-Adapter free one-stage detection under LoRA improves the adapter-free anchor-based approach by 6.42%, and the best configuration improves baseline adapter-free anchor-based approach by 25.70%. These results demonstrate that lightweight spatial prior fusion and selective backbone unfreezing enable Prithvi-EO to capture local fallow patterns more effectively, outperforming approaches that rely on reshaped single-stride ViT tokens.

URL PDF HTML ☆

赞 0 踩 0

2606.12248 2026-06-11 cs.CV 新提交

Damage-TriageFormer: A Foundation-Model Framework for Typology-Based Building Damage Assessment from Mono-Temporal Imagery

Damage-TriageFormer：基于类型学的单时相影像建筑损伤评估的基础模型框架

Yiming Xiao, Yu-Hsuan Ho, Sanjay Thasma, Junwei Ma, Ali Mostafavi

发表机构 * Texas A&M University（德克萨斯A&M大学）； Resilitix Intelligence LLC ； Institute for a Disaster Resilient Texas（德克萨斯灾害韧性研究所）

AI总结提出Damage-TriageFormer，一种基于单张灾后影像的建筑损伤类型学评估模型，通过扩展DINOv3 ViT-L骨干网络和两阶段门控损伤头，在三个灾害数据集上实现了宏观F1约0.62，无需灾前影像即可支持应急响应。

详情

AI中文摘要

决策相关的建筑损伤评估对于灾后资源优先分配和恢复至关重要，但大多数自动化方法要么将损伤扁平化为单一严重程度等级（无损伤、轻微、严重、摧毁），要么需要成对的灾前和灾后影像，而这对于突发灾害通常不可用。本文提出了Damage-TriageFormer，一种基于单张灾后影像、足迹条件化的模型，它生成损伤类型学而非严重程度等级。我们的贡献包括：（1）DamageTriage-Bench，一个基于NOAA应急响应影像（涵盖2018年迈克尔飓风、2024年海伦飓风和2025年洛杉矶野火复合灾害）构建的新基准，包含五个类型学类别，区分屋顶损伤和结构损伤，并在每个类别内区分部分和全部范围；（2）Damage-TriageFormer，它扩展了DINOv3 ViT-L骨干网络，结合简单特征金字塔进行更高分辨率的实例池化、两阶段门控损伤头以及辅助严重程度回归目标。我们的模型在验证集上达到宏观F1为0.624，在保留的分层测试集上为0.619，在运营分类最需要的地方表现最强，无损伤建筑和完全结构倒塌的每类F1分别为0.91和0.84。尽管罕见的完全屋顶损伤类别由于样本有限和固有的模糊标签边界仍然困难，但我们的结果表明，单张灾后影像可以支持可操作的建筑损伤分类，无需灾前参考即可实现有针对性的应急响应和资源分配。

英文摘要

Decision-relevant building damage assessment is critical for prioritizing resources and recovery after a disaster, yet most automated methods either flatten damage into a single severity scale (no damage, minor, major, destroyed) or require paired pre- and post-event imagery that is often unavailable for emerging hazards. This paper presents Damage-TriageFormer, a single-image, post-event, footprint-conditioned model that produces a damage typology rather than a severity scale. We contribute: (1) DamageTriage-Bench, a new benchmark built from NOAA Emergency Response Imagery across Hurricane Michael (2018), Hurricane Helene (2024), and the 2025 Los Angeles wildfire complex, with five typology classes that distinguish roof damage from structural damage and, within each, partial from total extent; and (2) Damage-TriageFormer, which extends a DINOv3 ViT-L backbone with a Simple Feature Pyramid for higher-resolution instance pooling, a two-stage gated damage head, and an auxiliary severity-regression objective. Our model achieves macro F1 of 0.624 on validation and 0.619 on a held-out stratified test set, performing strongest where operational triage needs it most, with per-class F1 of 0.91 and 0.84 on undamaged buildings and total structural collapse, respectively. While the rare Total Roof Damage class remains difficult due to its limited examples and an inherently ambiguous label boundary, our results show that single-image post-event imagery can support actionable building damage typing, enabling targeted emergency response and resource allocation without a pre-event reference.

URL PDF HTML ☆

赞 0 踩 0

2606.12316 2026-06-11 cs.CV 新提交

Slots, Transitions, Loops: Learning Composable World Models for ARC

槽、转换、循环：学习可组合的ARC世界模型

Gege Gao, Bernhard Schölkopf, Andreas Geiger

发表机构 * University of Tübingen（图宾根大学）； ETH Zürich（苏黎世联邦理工学院）

AI总结提出Loop-OWM架构，通过颜色原型槽、演示条件任务摘要和循环转换模型，学习ARC任务中的视觉符号规则，在ARC-1和ARC-2上超越基线。

2606.12340 2026-06-11 cs.CV 新提交

Echoes of the Prior: A Computational Phenomenology of Forgetting

先验的回响：遗忘的计算现象学

Gege Gao, Bernhard Schölkopf, Andreas Geiger

发表机构 * Eberhard Karl University of Tübingen（蒂宾根大学）； Max Planck Institute for Intelligent Systems（马克斯·普朗克智能系统研究所）； Tübingen AI Center（蒂宾根人工智能中心）

AI总结通过在前馈3D重建模型中诱导突触衰减，可视化遗忘的主观现象学，将神经网络作为认知代理探索神经形态美学。

2606.11236 2026-06-11 cs.NE cs.CV cs.LG 交叉投稿

A2SG:Adaptive and Asymmetric Surrogate Gradients for Training Deep Spiking Neural Networks

A2SG：用于训练深度脉冲神经网络的适应性和非对称替代梯度

Yechan Kang, Yongjin Kweon, Mingyeong Seo, Sohee Park, Yeonguk Jeon, Jongkil Park, Hyun Jae Jang, Jaewook Kim, YeonJoo Jeong, Suyoun Lee, Seongsik Park

AI总结提出适应性和非对称替代梯度（A2SG）框架，通过自适应窗口调整梯度方向一致性、非对称梯度反映神经元动态，降低梯度变化并促进收敛到平坦最小值，在多种SNN模型和任务上提升精度与能效。

详情

Comments: Accepted at ICML 2026

AI中文摘要

由于替代梯度导致的尖锐损失景观和时间不一致性，训练深度脉冲神经网络（SNN）仍然具有挑战性。为了解决这些问题，我们提出了一个统一框架：适应性和非对称替代梯度A2SG。适应性梯度调整一个有效窗口以实现时空适应，减少空间梯度变化并保持梯度随时间的方向一致性。非对称梯度通过为具有更高膜电位的神经元分配更大的梯度来反映神经元动态，并且我们证明它们比对称替代梯度产生更低的方差。我们的分析进一步建立了局部梯度变化与损失景观曲率之间的直接联系，为A2SG如何促进收敛到更平坦的最小值并改善泛化提供了原理性解释。我们在多种模型上进行了广泛实验，包括基于CNN和基于Transformer的SNN，涉及各种任务，如使用静态和神经形态数据集的图像分类以及分割。结果表明，A2SG持续提高了准确性和能效，使其成为训练深度SNN的通用且可靠的解决方案。我们的代码可在以下网址获取：此 https URL。

英文摘要

Training deep spiking neural networks (SNNs) remains challenging due to sharp loss landscapes and temporal inconsistency caused by surrogate gradients. To address these challenges, we propose a unified framework: adaptive and asymmetric surrogate gradients A2SG. The adaptive gradients adjust an effective window for spatio-temporal adaptation, reducing spatial gradient variation and maintaining directional consistency of gradients over time. The asymmetric gradients reflect neuronal dynamics by assigning larger gradients to neurons with higher membrane potentials, and we prove that they yield lower variation than symmetric surrogates. Our analysis further establishes a direct connection between local gradient variation and the curvature of the loss landscape, providing a principled explanation for how A2SG promotes convergence to flatter minima and improves generalization. We conduct extensive experiments on diverse models, including CNN-based and Transformer-based SNNs, across various tasks such as image classification using both static and neuromorphic datasets, as well as segmentation. The results demonstrate that A2SG consistently improves accuracy and energy efficiency, establishing it as a general and reliable solution for training deep SNNs. Our code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.11930 2026-06-11 cs.HC cs.AI cs.CV 交叉投稿

Frozen Multimodal Embeddings for Personality and Cognitive Ability Assessment in Asynchronous Video Interviews

冻结多模态嵌入用于异步视频面试中的个性与认知能力评估

Kuo-En Hung, Hung-Yue Suen, Shih-Ching Yeh, Hsiang-Wen Wang

AI总结针对异步视频面试中标注数据有限的高维多模态学习问题，提出使用冻结多模态编码器（CLIP、Whisper、RoBERTa等）结合低容量下游模型，在个性预测任务上实现MSE降低19.1%，并发现认知能力预测中存在数据集捷径。

详情

Comments: 9 pages, 1 figure, 4 tables

AI中文摘要

从异步视频面试（AVI）中预测心理特质是一个具有挑战性的多模态学习问题，因为标注数据集有限，而每个回答包含高维的视觉、声学和语言信号。本文介绍了我们针对ACM多媒体AVI挑战2026的解决方案，该挑战评估两个任务：Track~1从与个性相关的面试回答中预测自我报告的HEXACO个性特质，Track~2从结构化AVI回答中对认知能力水平进行分类。我们将该问题视为小样本表示学习任务。我们不微调大型预训练模型，而是使用冻结的多模态编码器，包括用于视觉特征的CLIP、用于声学特征和转录的Whisper，以及用于文本表示的RoBERTa、E5和DeBERTaV3，随后使用低容量下游模型。对于Track~1，我们的特质特定回归和晚期融合系统实现了平均验证MSE为0.2696，优于官方基线0.3334。消融结果显示，从全局模型（0.3189）到逐特质建模（0.2871）再到逐特质晚期融合（0.2696）的三步改进，相对于官方基线MSE相对降低了19.1%。对于Track~2，一个紧凑的主题属性基线达到了0.5781的准确率，而我们的多模态集成达到了0.5313，两者均高于官方基线0.4062。我们将这一结果解释为验证分割中可能存在主题属性捷径的证据，而非从AVI内容中进行的稳健认知推理。总体而言，我们的发现表明，基于AVI的心理评估受益于特质特定的多模态建模，但认知能力预测需要仔细控制数据集捷径。

英文摘要

Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging multimodal learning problem because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1\% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

URL PDF HTML ☆

赞 0 踩 0

2502.14894 2026-06-11 cs.CV cs.AI cs.CY cs.LG 版本更新

FOCUS on Contamination: Hydrology-Informed Noise-Aware Learning for Geospatial PFAS Mapping

聚焦污染：基于水文信息与噪声感知的地理空间PFAS测绘学习

Jowaria Khan, Alexa Friedman, Sydney Evans, Rachel Klein, Runzi Wang, Katherine E. Manz, Kaley Beins, David Q. Andrews, Elizabeth Bondi-Kelly

AI总结提出FOCUS框架，结合稀疏PFAS观测与水文连通性等环境先验，通过噪声感知损失实现鲁棒训练，在PFAS污染测绘中优于传统方法。

详情

Comments: Best Paper Award at ICLR 2026 Machine Learning for Remote Sensing Workshop

AI中文摘要

全氟和多氟烷基物质（PFAS）是持久性环境污染物，对公共健康有显著影响，但由于现场采样的高成本和后勤挑战，大规模监测仍然严重受限。样本的缺乏导致难以用物理模型模拟其扩散，并且对PFAS在地表水中传输的科学理解有限。然而，描述土地覆盖、水文和工业活动的丰富地理空间和卫星衍生数据广泛可用。我们提出了FOCUS，一个用于PFAS污染测绘的地理空间深度学习框架，该框架将稀疏的PFAS观测与大规模环境背景（包括来自水文连通性、土地覆盖、污染源邻近性和采样距离的先验）相结合。这些先验被整合到一个原则性的、噪声感知的损失函数中，从而在稀疏标签下产生稳健的训练目标。通过广泛的消融实验、鲁棒性分析和实际验证，FOCUS始终优于包括稀疏分割、克里金法和污染物传输模拟在内的基线方法，同时在大区域上保持了空间一致性和可扩展性。我们的结果展示了AI如何通过提供筛查级风险图来支持环境科学，这些风险图可优先安排后续采样，并在缺乏完整物理模型的情况下帮助将潜在污染源与地表水污染模式联系起来。

英文摘要

Per- and polyfluoroalkyl substances (PFAS) are persistent environmental contaminants with significant public health impacts, yet large-scale monitoring remains severely limited due to the high cost and logistical challenges of field sampling. The lack of samples leads to difficulty simulating their spread with physical models and limited scientific understanding of PFAS transport in surface waters. Yet, rich geospatial and satellite-derived data describing land cover, hydrology, and industrial activity are widely available. We introduce FOCUS, a geospatial deep learning framework for PFAS contamination mapping that integrates sparse PFAS observations with large-scale environmental context, including priors derived from hydrological connectivity, land cover, source proximity, and sampling distance. These priors are integrated into a principled, noise-aware loss, yielding a robust training objective under sparse labels. Across extensive ablations, robustness analyses, and real-world validation, FOCUS consistently outperforms baselines including sparse segmentation, Kriging, and pollutant transport simulations, while preserving spatial coherence and scalability over large regions. Our results demonstrate how AI can support environmental science by providing screening-level risk maps that prioritize follow-up sampling and help connect potential sources to surface-water contamination patterns in the absence of complete physical models.

URL PDF HTML ☆

赞 0 踩 0

2602.03282 2026-06-11 cs.CV cs.AI 版本更新

Global Geometry Is Not Enough for Vision Representations

全局几何不足以用于视觉表示

Jiwan Chung, Seon Joo Kim

AI总结本文通过实验发现全局嵌入几何与组合绑定能力几乎无关，而输入-输出雅可比矩阵衡量的功能敏感性可靠地追踪该能力，并分析指出这是由于现有损失函数显式约束嵌入几何但未约束局部输入-输出映射所致。

详情

AI中文摘要

表示学习中的一个常见假设是，全局分布良好的嵌入支持鲁棒且可泛化的表示。这一关注点塑造了训练目标和评估协议，隐含地将全局几何视为表示能力的代理。虽然全局几何有效地编码了哪些元素存在，但它通常对元素如何组合不敏感。我们通过测试几何度量预测跨多种视觉编码器的组合绑定的能力来研究这一局限性。我们发现，基于标准几何的统计量与组合绑定几乎无相关性。相比之下，由输入-输出雅可比矩阵衡量的功能敏感性可靠地追踪这一能力。我们进一步提供了分析性解释，表明这种差异源于目标设计，因为现有损失显式约束嵌入几何，但未约束局部输入-输出映射。这些结果表明，全局嵌入几何仅捕捉了表示能力的部分视图，并将功能敏感性确立为建模复合结构的关键补充轴。

英文摘要

A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representations. This focus has shaped both training objectives and evaluation protocols, implicitly treating global geometry as a proxy for representational competence. While global geometry effectively encodes which elements are present, it is often insensitive to how they are composed. We investigate this limitation by testing the ability of geometric metrics to predict compositional binding across a diverse suite of vision encoders. We find that standard geometry-based statistics exhibit near-zero correlation with compositional binding. In contrast, functional sensitivity, as measured by the input--output Jacobian, reliably tracks this capability. We further provide an analytic account showing that this disparity arises from objective design, as existing losses explicitly constrain embedding geometry but leave the local input--output mapping unconstrained. These results suggest that global embedding geometry captures only a partial view of representational competence and establish functional sensitivity as a critical complementary axis for modeling composite structure.

URL PDF HTML ☆

赞 0 踩 0

2510.17816 2026-06-11 eess.SP cs.CV 版本更新

Cross-Domain Multi-Person Human Activity Recognition via Near-Field Wi-Fi Sensing

基于近场Wi-Fi感知的跨域多人人体活动识别

Xin Li, Jingzhi Hu, Yinghui He, Hongbo Wang, Jin Gan, Jun Luo

AI总结针对Wi-Fi多人活动识别中跨域适应难题，提出WiAnchor框架，通过预训练扩大类间特征间隔、微调阶段引入锚点匹配机制过滤个体干扰，实现缺失类别下的高效跨域识别，准确率超90%。

详情

AI中文摘要

基于Wi-Fi的人体活动识别（HAR）提供了极大的便利，并已成为一个蓬勃发展的研究领域，然而Wi-Fi固有的粗空间分辨率严重阻碍了其区分多个目标的能力。通过利用近场主导效应，为每个目标通过其个人Wi-Fi设备建立专用传感链路，为原生流量下的多人HAR提供了一种有前景的解决方案。然而，由于近场信号的目标特定特性和不规则模式，HAR神经网络模型需要微调（FT）以实现跨域适应，这在某些类别不可用时变得特别具有挑战性。在本文中，我们提出WiAnchor，一种新颖的训练框架，用于在活动类别不完整的情况下实现高效的跨域适应。该框架通过三个步骤处理嵌入不规则时间信息的Wi-Fi信号：在预训练期间，我们扩大类间特征间隔以增强活动的可分离性；在微调阶段，我们创新性地引入一种锚点匹配机制用于跨域适应，根据不完整的活动类别过滤目标特定干扰，而不是试图从中提取完整特征；最后，基于输入样本与锚点的特征级相似性进一步改进识别。我们构建了一个全面的数据集来彻底评估WiAnchor，在缺失活动类别的情况下实现了超过90%的跨域准确率。

英文摘要

Wi-Fi-based human activity recognition (HAR) provides substantial convenience and has emerged as a thriving research field, yet the coarse spatial resolution inherent to Wi-Fi significantly hinders its ability to distinguish multiple subjects. By exploiting the near-field domination effect, establishing a dedicated sensing link for each subject through their personal Wi-Fi device offers a promising solution for multi-person HAR under native traffic. However, due to the subject-specific characteristics and irregular patterns of near-field signals, HAR neural network models require fine-tuning (FT) for cross-domain adaptation, which becomes particularly challenging with certain categories unavailable. In this paper, we propose WiAnchor, a novel training framework for efficient cross-domain adaptation in the presence of incomplete activity categories. This framework processes Wi-Fi signals embedded with irregular time information in three steps: during pre-training, we enlarge inter-class feature margins to enhance the separability of activities; in the FT stage, we innovate an anchor matching mechanism for cross-domain adaptation, filtering subject-specific interference informed by incomplete activity categories, rather than attempting to extract complete features from them; finally, the recognition of input samples is further improved based on their feature-level similarity with anchors. We construct a comprehensive dataset to thoroughly evaluate WiAnchor, achieving over 90% cross-domain accuracy with absent activity categories.

URL PDF HTML ☆

赞 0 踩 0

2605.29292 2026-06-11 cs.CV

Turbulence-Robust Dynamic Object Segmentation with Multi-Signal Priors and SAM2 Refinement

基于多信号先验和SAM2优化的湍流鲁棒动态目标分割

Bolian Peng, Ying Tang, Xu Liu, Long Sun, Xiaoqiang Lu

AI总结提出一种无需训练的多信号分割流水线，结合RAFT运动估计、DINOv2语义先验、ViBe背景建模和SAM2掩码优化，解决大气湍流下的动态目标分割问题。

详情

Journal ref: Proceedings of the CVPR 2026 Workshops, UG2+ Challenge, 2026

AI中文摘要

本技术报告介绍了我们针对CVPR 2026 UG2+挑战赛第三赛道：湍流中动态目标分割（DOST）的解决方案。我们设计了一种无需训练的多信号分割流水线，结合了预训练的运动估计、自监督语义先验、背景异常建模、手动校准的提议融合以及基于SAM2的掩码优化。该方法使用RAFT获取密集运动响应，DINOv2获取语义目标先验，ViBe进行无需训练的背景建模，以及预训练的SAM2进行框提示掩码优化。我们的系统完全在推理模式下运行，而不是优化端到端的分割网络。这种设计适用于DOST场景，其中严重的大气湍流会产生伪运动、模糊和间歇性目标可见性，使得单一运动线索不可靠。最终提交的掩码由官方排行榜评估，报告了0.425041 mIoU和0.457206 mDice。由于没有进行特定任务的模型训练或微调，更强的学习时间关联、自适应提议选择或任务特定适应可能进一步改进系统。

英文摘要

This technical report presents our solution for the CVPR 2026 UG2+ Challenge Track 3: Dynamic Object Segmentation in Turbulence (DOST). We design a training-free multi-signal segmentation pipeline that combines pretrained motion estimation, self-supervised semantic priors, background anomaly modeling, manually calibrated proposal fusion, and SAM2-based mask refinement. The method uses RAFT for dense motion responses, DINOv2 for semantic objectness priors, ViBe for training-free background modeling, and pretrained SAM2 for box-prompt mask refinement. Instead of optimizing an end-to-end segmentation network, our system operates entirely in inference mode. This design is suitable for the DOST setting, where severe atmospheric turbulence produces pseudo-motion, blur, and intermittent target visibility, making a single motion cue unreliable. The final submitted masks are evaluated by the official leaderboard, which reports 0.425041 mIoU and 0.457206 mDice. Since no task-specific model training or fine-tuning is performed, stronger learned temporal association, adaptive proposal selection, or task-specific adaptation may further improve the system.

URL PDF HTML ☆

赞 0 踩 0

2605.17773 2026-06-11 cs.CV

PlantPose: Universal Plant Skeleton Estimation via Tree-constrained Graph Generation

PlantPose: 通过树约束图生成实现通用植物骨架估计

Xinpeng Liu, Hiroaki Santo, Yosuke Toda, Fumio Okura

AI总结本文提出PlantPose，一种通过树约束图生成实现通用植物骨架估计的方法，通过结合学习基于图生成和传统图算法，提高模型的泛化能力，并在多个领域实现了鲁棒且准确的植物骨架估计。

详情

DOI: 10.1007/s11263-026-02882-4
Comments: International Journal of Computer Vision, 2026

AI中文摘要

准确地从图像中估计植物骨架结构（例如分支结构）对于智能农业和植物科学至关重要。与人类骨骼固定拓扑结构不同，植物骨架估计面临独特的挑战，即从图像中估计任意树状图。为了解决这个问题，我们介绍了PlantPose，一种通过树约束图生成实现的通用植物骨架估计器。PlantPose结合了基于学习的图生成与传统图算法，在训练循环中强制执行树约束。为了提高模型的泛化能力，我们精心编排了一个包含真实世界和合成植物图像以及简化表示（例如草图和抽象画）的大型多样化数据集。该数据集使通用模型能够适应各种输入样式和植物图像类别，同时保持拓扑一致性。我们的方法在多个领域实现了鲁棒且准确的植物骨架估计，包括之前未见过的域外场景。进一步的分析突显了该方法在处理复杂、异质数据分布方面的优势和局限性。所有实现和数据集均在https://github.com/huntorochi/PlantPose/上提供。

英文摘要

Accurate estimation of plant skeletal structures (e.g., branching structures) from images is essential for smart agriculture and plant science. Unlike human skeletons with fixed topology, plant skeleton estimation presents a unique challenge, i.e., estimating arbitrary tree graphs from images. To address this problem, we introduce PlantPose, a universal plant skeleton estimator via tree-constrained graph generation. PlantPose combines learning-based graph generation with traditional graph algorithms to enforce tree constraints during the training loop. To enhance the model's generalization capability, we curate a large and diverse dataset comprising real-world and synthetic plant images, along with simplified representations (e.g., sketches and abstract drawings). This dataset enables the generalized model to adapt to diverse input styles and categories of plant images while preserving topological consistency. Our approach demonstrates robust and accurate plant skeleton estimation across multiple domains, including previously unseen out-of-domain scenarios. Further analyses highlight the method's strengths and limitations in handling complex, heterogeneous data distributions. All implementations and datasets are available at https://github.com/huntorochi/PlantPose/.

URL PDF HTML ☆

赞 0 踩 0

2604.01383 2026-06-11 cs.CV cs.AI

GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization

Syed Ahsan Masud Zaidi, Lior Shamir, William Hsu, Scott Dietrich, Talha Zaidi

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 10087-10095, June 2026
Comments: 9 pages, 5 figures, accepted to the CVPR 2026 Workshop on Computer Vision in Sports (CVSports) code: https://github.com/AhsanZaidi12/GRAZE

英文摘要

American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training-free pipeline for FPOC localization that requires no labeled tackle-contact examples. GRAZE uses Grounding DINO to discover candidate player-dummy interactions, refines them with motion-aware temporal reasoning, and uses SAM2 as an explicit pixel-level verifier of contact rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes the approach robust to cluttered scenes and unstable grounding near impact. On 738 tackle-practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within $\pm$ 10 frames on 77.5% of all clips and within $\pm$ 20 frames on 82.7% of all clips. These results show that frame-accurate contact onset localization in real-world practice footage is feasible without task-specific training.

URL PDF HTML ☆

赞 0 踩 0

2511.20216 2026-06-11 cs.AI cs.CE cs.CV cs.LG cs.RO

CostNav: A Navigation Benchmark for Real-World Economic-Cost Evaluation of Physical AI Agents

Haebin Seong, Sungmin Kim, Yongjun Cho, Myunchul Joe, Geunwoo Kim, Yubeen Park, Sunhoo Kim, Samwoo Seong, Yoonshik Kim, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Jinmyung Kwak, Sunghee Ahn, Jaemin Lee, Younggil Do, Seungyeop Yi, Woojin Cheong, Minhyeok Oh, Minchan Kim, Seongjae Kang, Youngjae Yu, Yunsung Lee

2512.20464 2026-06-11 physics.optics cs.CV cs.NE physics.app-ph

Snapshot 3D image projection using a diffractive decoder

Cagatay Isil, Alexander Chen, Yuhang Li, F. Onuralp Ardic, Shiqi Chen, Che-Yung Shen, Aydogan Ozcan

详情

DOI: 10.1038/s41377-026-02378-3
Journal ref: Light: Science & Applications (2026)
Comments: 22 Pages, 8 Figures

英文摘要

3D image display is essential for next-generation volumetric imaging; however, dense depth multiplexing for 3D image projection remains challenging because diffraction-induced cross-talk rapidly increases as the axial image planes get closer. Here, we introduce a 3D display system comprising a digital encoder and a diffractive optical decoder, which simultaneously projects different images onto multiple target axial planes with high axial resolution. By leveraging multi-layer diffractive wavefront decoding and deep learning-based end-to-end optimization, the system achieves high-fidelity depth-resolved 3D image projection in a snapshot, enabling axial plane separations on the order of a wavelength. The digital encoder leverages a Fourier encoder network to capture multi-scale spatial and frequency-domain features from input images, integrates axial position encoding, and generates a unified phase representation that simultaneously encodes all images to be axially projected in a single snapshot through a jointly-optimized diffractive decoder. We characterized the impact of diffractive decoder depth, output diffraction efficiency, spatial light modulator resolution, and axial encoding density, revealing trade-offs that govern axial separation and 3D image projection quality. We further demonstrated the capability to display volumetric images containing 28 axial slices, as well as the ability to dynamically reconfigure the axial locations of the image planes, performed on demand. Finally, we experimentally validated the presented approach, demonstrating close agreement between the measured results and the target images. These results establish the diffractive 3D display system as a compact and scalable framework for depth-resolved snapshot 3D image projection, with potential applications in holographic displays, AR/VR interfaces, and volumetric optical computing.

URL PDF HTML ☆

赞 0 踩 0

2512.20011 2026-06-11 cs.CV

PaveSync: A Unified and Comprehensive Dataset for Pavement Distress Analysis and Classification

Blessing Agyei Kyem, Joshua Kofi Asamoah, Anthony Dontoh, Andrews Danyo, Eugene Denteh, Armstrong Aboah

2411.10077 2026-06-11 cs.CV

Hierarchical Mutual Distillation for Multi-View Fusion: Learning from All Possible View Combinations

Jiwoong Yang, Haejun Chung, Ikbeom Jang

2412.12944 2026-06-11 math.OC cs.CV

Online optimisation for dynamic electrical impedance tomography

Neil Dizon, Jyrki Jauhiainen, Tuomo Valkonen

1. 多模态与视觉语言模型 22 篇

DeceptionX: Explainable Deception Detection with Multimodal Large Language Models

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

On Aligning Hierarchical Standardized Embedding for Audio-visual Generalized Zero-shot Learning

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning

From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning

Tac-DINO: Learning Vision-Tactile Features with Patch Alignment

MSUE: Multi-Modal Soccer Understanding Expert

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Information-Theoretic Decomposition for Multimodal Interaction Learning

How Auxiliary Reasoning Unleashes GUI Grounding in VLMs

UI2Code^N: UI-to-Code Generation as Interactive Visual Optimization

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

ReMoT: Reinforcement Learning with Motion Contrast Triplets

What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing

Brain-IT-VQA: From Brain Signals to Answers

Frames2LoRA: Parametric Video Internalization for Vision-Language Models

CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

2. 具身智能、机器人与自动驾驶 16 篇

LAST: Bridging Vision-Language and Action Manifolds via Gromov-Wasserstein Alignment

SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining

Understanding Cross-Sensor Feature Variations for Generalizable 3D Perception

DroneShield-AI: A Multi-Modal Sensor Fusion Framework for Real-Time Autonomous Drone Threat Detection, Behavioral Intent Classification, and Swarm Intelligence in Contested Airspace

From Nominal Intensity to Equivalent Rainfall: A Path-Based Credibility Evaluation Framework for Simulated Rainfall in Autonomous-Driving Perception Tests

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

AerialClaw: An Open-Source Framework for LLM-Driven Autonomous Aerial Agents

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations

MB-Loc: Multi-planar Bird's-eye-view Localization in outdoor LiDAR scenes

PIGEON: VLM-Driven Object Navigation via Points of Interest Selection

ActionMap: Robot Policy Learning via Voxel Action Heatmap

3. 图像识别、检索与分类 14 篇

Adapting Vision-Language Models from Iconic to Inclusive for Multi-Label Recognition Without Labels

Learning Instance-Adaptive Low-Rank Orthogonal Subspaces for Clothes-Changing Person Re-Identification

RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval

Image Quality Assessment of Identity Cards Using Measures from Open Face Image Quality

Feature extraction for plant growth estimation

ViT-FREE: Efficient Face Recognition via Early Exiting and Synthetic Adaptation

Vision Transformers for Face Recognition Need More Registers

MFEN:Multi-Frequency Expert Network for Visible-Infrared Person Re-ID

Non-frontal face recognition using GANs and memristor-based classifiers

Bridging Day and Night: Unsupervised Cross-Domain Re-Identification with Synergistic Prompt and Prototype Learning

Bridging the Modality Gap in Forensic Image Retrieval

Causal Clothes-Invariant Feature Learning for Cloth-Changing Person Re-ID

STEAM: Squeeze and Transform Enhanced Attention Module

MARIC: Multi-Agent Reasoning for Image Classification

4. 目标检测、分割与定位 15 篇

CFCamo: A Counterfactual Detect-or-Abstain Framework for Camouflaged Object Detection

EventRadar: Long-Range Visual UAV Discovery through Spatiotemporal Event Sensing

VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

FreqKD: Frequency-Decoupled Cross-Modal Knowledge Distillation for Infrared Object Detection

Battery detection of XRay images using transfer learning

LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

Performance Analysis of YOLOv11 and YOLOv8 for Mixed Traffic Object Detection under Adverse Weather Conditions in Developing Countries

A Turbo-Inference Strategy for Object Detection and Instance Segmentation

Contour Field based Elliptical Shape Prior for the Segment Anything Model

RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

CountZES: Counting via Zero-Shot Exemplar Selection

Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift

Weakly Supervised Segmentation as Semantic-Based Regularization

Lighting-aware Unified Model for Instance Segmentation

Spatially Selective Self-Training for Unsupervised Building Change Detection

5. 视频理解与时序视觉 12 篇

Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition

Motion Reinforces Appearance: RGB-Skeleton Gated Residual Fusion for Micro-Gesture Online Recognition

From Content to Knowledge: Lightning Fast Long-Video Understanding with Neural Knowledge Representations

SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

Q-Fold: Query-Aware Focus-Context Spatio-Temporal Folding for Long Video Understanding

MLT-Dedup: Efficient Large-Scale Online Video Deduplication via Multi-Level Representations and Spatial-Temporal Matching

Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition