arXivDaily arXiv每日学术速递 周一至周五更新
2605.18754 2026-05-19 cs.CV 版本更新

Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate

这些视角能是一个场景吗?在3D基础模型产生幻觉时评估多视角3D一致性

Soumava Paul, Prakhar Kaushik, Alan Yuille

发表机构 * CCVL, Johns Hopkins University(计算机视觉实验室,约翰霍普金斯大学)

AI总结 本文研究了在3D基础模型产生幻觉时多视角3D一致性的可靠性问题,提出了一种可控的鲁棒性基准和参数化家族,将神经度量分解为backbone、残差和聚合组件,并引入基于COLMAP的度量方法,以提高与人类判断的一致性。

Comments Project Page at https://mvp18.github.io/3d-consistency-metrics/

详情
AI中文摘要

多视角3D评估假设被评分的图像是对一个静态3D场景的观测。这一假设在NVS和稀疏视角重建中可能失效:输入或生成的输出可能包含伪影、异常帧、重复视角或噪声,但仍可能获得高3D一致性分数。现有基于参考的度量需要地面真实,而无需地面真实的度量如MEt3R依赖于学习的重建backbone,其失败模式尚不明确。我们通过比较神经重建先验与经典几何验证研究了这一可靠性问题。我们引入benchmark,一种用于多视角3D一致性的受控鲁棒性基准,以及一个参数化家族,将神经度量分解为backbone、残差和聚合组件。该家族恢复MEt3R并产生多达3倍更稳健的变体。我们的分析显示,VGGT、MASt3R、DUSt3R和Fast3R可以产生无关场景的密集几何和跨视角支持,重复图像和随机噪声。我们引入基于COLMAP的度量方法,利用匹配、注册、密集支持和重建失败作为失败感知的一致性信号。在真实的NVS输出和结构化的人类研究中,这些度量方法与人类判断的一致性比MEt3R高多达4倍。

英文摘要

Multiview 3D evaluation assumes that the images being scored are observations of one static 3D scene. This assumption can fail in NVS and sparse-view reconstruction: inputs or generated outputs may contain artifacts, outlier frames, repeated views, or noise, yet still receive high 3D consistency scores. Existing reference-based metrics require ground truth, while ground-truth-free metrics such as MEt3R depend on learned reconstruction backbones whose failure modes are poorly characterized. We study this reliability problem by comparing neural reconstruction priors with classical geometric verification. We introduce \benchmark, a controlled robustness benchmark for multiview 3D consistency, and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to $3\times$ more robust. Our analysis shows that VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry and cross-view support for unrelated scenes, repeated images, and random noise. We introduce COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals. On real NVS outputs and a structured human study, these metrics achieve up to $4\times$ higher correlation with human judgments than MEt3R.

2605.18749 2026-05-19 cs.SD cs.CV 版本更新

WavFlow: Audio Generation in Waveform Space

WavFlow:在波形空间中进行音频生成

Feiyan Zhou, Luyuan Wang, Shoufa Chen, Zhe Wang, Zhiheng Liu, Yuren Cong, Xiaohui Zhang, Fanny Yang, Belinda Zeng

发表机构 * Meta AI Northeastern University(东北大学)

AI总结 本文提出WavFlow框架,直接在原始波形空间生成高保真的音频,无需中间表示,通过波形分块和振幅提升实现稳定优化,通过自动化数据管道生成高质量视频-文本-音频三元组,实验结果显示在视频到音频和文本到音频基准测试中表现优异,证明了无需中间压缩即可实现高质量合成。

Comments Code: https://github.com/facebookresearch/WavFlow

详情
AI中文摘要

现代音频生成主要依赖于潜在空间压缩,引入了额外的复杂性和潜在的信息损失。在本工作中,我们挑战这一范式,提出WavFlow框架,该框架直接在原始波形空间中生成高保真的音频,而无需中间表示。为了克服建模高维和低能量信号的固有困难,我们将音频转换为2D token网格通过波形分块,并引入振幅提升以对齐信号尺度,通过直接x预测在流匹配中实现稳定优化。为了捕捉复杂的语义对齐和时间同步,我们利用自动化数据管道来收集500万高质量的视频-文本-音频三元组,使模型能够从头学习精细的声学模式。实验结果表明,WavFlow在视频到音频基准测试VGGSound(FD_PaSST:59.98,IS_PANNs:17.40,DeSync:0.44)和文本到音频基准测试AudioCaps(FD_PANNs:10.63,IS_PANNs:12.62)中表现竞争,与已有的基于潜在空间的方法相匹配或超过。我们的工作证明了中间压缩不是高质量合成的必要条件,为多模态音频生成提供了一个更简单且可扩展的替代方案。

英文摘要

Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.

2605.18748 2026-05-19 cs.CV 版本更新

Aurora: Unified Video Editing with a Tool-Using Agent

Aurora: 一种基于工具使用的统一视频编辑框架

Yongsheng Yu, Ziyun Zeng, Zhiyuan Xiao, Zhenghong Zhou, Hang Hua, Wei Xiong, Jiebo Luo

发表机构 * MIT-IBM(MIT-IBM研究院) NVIDIA(NVIDIA公司)

AI总结 本文提出Aurora框架,通过结合增强的视觉语言模型(VLM)代理和统一视频扩散变换器,解决视频编辑中的文本和视觉不充分问题,提升了视频编辑的灵活性和准确性。

Comments Code: https://github.com/yeates/Aurora

详情
AI中文摘要

近期的视频编辑模型趋于统一的条件设计:一个扩散变换器同时消耗文本、源视频和参考图像,并且一组权重可以用于替换、删除、风格迁移和参考驱动的插入。该设计具有灵活性,但假设用户已经提供了模型准备的文本、参考图像和局部编辑的空间定位,而实际需求往往省略这些。我们提出了Aurora,一种基于代理的视频编辑框架,将增强的视觉语言模型(VLM)代理与统一视频扩散变换器相结合。VLM代理将原始用户请求映射到与变换器条件通道对齐的结构化编辑计划,从而在生成前解决文本和视觉不充分问题。我们使用监督数据训练VLM代理进行完整的编辑计划和参考图像选择,同时结合偏好对进行鲁棒的工具使用和指令细化。我们引入AgentEdit-Bench来评估在文本和视觉不充分情况下的代理增强视频编辑。在AgentEdit-Bench和两个现有视频编辑基准测试中,实验表明Aurora优于仅基于指令的基线,并且VLM代理可以转移到兼容的冻结视频编辑模型中。项目页面:https://yeates.github.io/Aurora-Page

英文摘要

Recent video editing models have converged on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit. We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model (VLM) agent with a unified video diffusion transformer. The VLM agent maps a raw user request to a structured edit plan aligned with the transformer's conditioning channels, thereby resolving textual and visual underspecification before generation. We train the VLM agent with supervised data for complete edit planning and reference-image selection, together with preference pairs for robust tool use and instruction refinement. We introduce AgentEdit-Bench to evaluate agent-enhanced video editing under textual and visual underspecification. Experiments on AgentEdit-Bench and two existing video editing benchmarks show that Aurora improves over instruction-only baselines and that the VLM agent transfers to compatible frozen video editing models. Project page: https://yeates.github.io/Aurora-Page

2605.18735 2026-05-19 cs.CV cs.GR cs.LG 版本更新

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

PIXLRelight: 通过内在条件实现可控的图像重照明

Miguel Farinha, Ronald Clark

发表机构 * Department of Computer Science(计算机科学系) University of Oxford(牛津大学)

AI总结 PIXLRelight通过内在条件将物理基础渲染与学习图像合成相结合,实现对单图像重照明的可控性,其核心方法是利用真实照片或PBR渲染得到的内在条件进行训练和推理,从而在保证图像细节的同时实现高质量的重照明效果。

Comments Project page: https://mlfarinha.github.io/pixl-relight/. Under review

详情
AI中文摘要

我们提出了PIXLRelight,一种用于物理可控单图像重照明的前馈方法。现有方法要么提供有限的光照控制(例如通过文本或环境地图),要么在逆向和正向渲染链中累积误差,或者需要昂贵的每图像优化。我们的关键思想是通过共享的内在条件将物理基础渲染(PBR)与学习图像合成联系起来,该条件可以从真实照片或PBR渲染中获得。在训练时,成对的多光照照片被分解为反照率、漫反射阴影和非漫反射残差,这些条件用于模型训练。在推理时,相同的条件从粗略3D重建的输入下用户指定的PBR灯光路径追踪渲染中计算。基于变压器的神经渲染器然后将目标光照应用于源照片,通过每像素的仿射调制保留精细图像细节。PIXLRelight实现了任意PBR风格的光照控制,达到了最先进的重照明质量,并且每张图像的运行时间不到十分之一秒。代码和模型可在https://mlfarinha.github.io/pixl-relight/上获得。

英文摘要

We present PIXLRelight, a feed-forward approach for physically controllable single-image relighting. Existing methods either provide limited lighting control (e.g. through text or environment maps), accumulate errors when chaining inverse and forward rendering, or require costly per-image optimization. Our key idea is to bridge physically based rendering (PBR) and learned image synthesis through a shared intrinsic conditioning that can be obtained from either real photographs or PBR renders. At training time, paired multi-illumination photographs are decomposed into albedo, diffuse shading, and non-diffuse residuals, which condition the model. At inference time, the same conditioning is computed from a path-traced render of a coarse 3D reconstruction of the input under user-specified PBR lights. A transformer-based neural renderer then applies the target illumination to the source photograph, preserving fine image detail through a per-pixel affine modulation. PIXLRelight enables arbitrary PBR-style lighting control, achieves state-of-the-art relighting quality, and runs in under a tenth of a second per image. Code and models are available at https://mlfarinha.github.io/pixl-relight/.

2605.18734 2026-05-19 cs.CV 版本更新

EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos

EgoExoMem: 跨视角记忆推理 over 同步的自身视角和外部视角视频

Ruiping Liu, Junwei Zheng, Yufan Chen, Di Wen, Shaofang Quan, Chengzhi Wu, Jiaming Zhang, Kailun Yang, Kunyu Peng, Rainer Stiefelhagen

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) ETH Zurich(苏黎世联邦理工学院) University of Oxford(牛津大学) Hunan University(湖南大学)

AI总结 本文提出EgoExoMem,首个跨视角记忆推理基准,通过同步自身视角和外部视角视频进行跨视角记忆推理,利用E$^2$-Select方法实现高效的帧选择,实验表明自身和外部视角提供互补的记忆线索,但现有模型在基准测试中表现有限。

Comments The source code and dataset can be found at https://github.com/RuipingL/EgoExoMem

详情
AI中文摘要

自身视角记忆在具身智能中被广泛应用,但可能不足以进行全面的空间-时间推理。受人类从现场和观察者视角回忆的启发,我们引入EgoExoMem,首个跨视角记忆推理基准,包含2600个高质量MCQs,覆盖八个时间、空间和跨视角QA类型。为支持双视角检索,我们提出E$^2$-Select,一种无需训练的帧选择方法,结合基于相关性的预算分配与每视角k-DPP采样,以处理视角不对称性和跨视角时间一致性。实验表明,自身和外部视角提供互补的记忆线索,而现有MLLMs仍远未解决该基准:最佳模型仅达到55.3%。E$^2$-Select在帧选择和RAG基于的记忆基线中达到最先进的58.2%。进一步分析揭示了问题框架和答案定位之间的系统性视角偏好冲突,突显了跨视角记忆推理的新颖性和挑战性。

英文摘要

Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos. EgoExoMem contains $2.6K$ high-quality MCQs across eight temporal, spatial, and cross-view QA types. To support dual-view retrieval, we propose E$^2$-Select, a training-free frame selection method for synchronized ego-exo videos. It combines relevance-based budget allocation with per-view k-DPP sampling to handle view asymmetry and cross-view temporal consistency. Experiments show that ego and exo views provide complementary memory cues, while existing MLLMs remain far from solving the benchmark: the best model reaches only $55.3\%$. E$^2$-Select achieves state-of-the-art performance of $58.2\%$ over frame-selection and RAG-based memory baselines. Further analysis reveals systematic view-preference conflicts between question framing and answer grounding, underscoring the novelty and challenge of cross-view memory reasoning.

2605.18733 2026-05-19 cs.CV 版本更新

Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory

通过无训练的身份感知记忆推进叙事长视频生成

Jinzhuo Liu, Jiangning Zhang, Wencan Jiang, Yabiao Wang, Dingkang Liang, Zhucun Xue, Ran Yi, Yong Liu

发表机构 * Zhejiang University(浙江大学) Tencent Youtu Lab(腾讯云智实验室) Huazhong University of Science and Technology(华中科技大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种无训练的身份感知记忆框架IAMFlow,通过显式建模和跟踪持久实体身份,实现一致的生成,同时引入NarraStream-Bench基准测试,在叙事流视频生成中取得最佳性能。

Comments Project page: https://eddie0521.github.io/projects/iamflow/ Code: https://github.com/Eddie0521/IAMFlow

详情
AI中文摘要

自回归视频生成在视觉保真度和交互性方面有了显著提升,但仍然存在长期不一致性和记忆退化问题。现有解决方案要么使用预定义策略压缩历史帧,要么基于粗略隐式注意力信号检索关键帧,这两种方法都无法处理具有变化实体参考的演变提示,导致身份漂移、角色重复和属性丢失。为此,我们提出了IAMFlow,一种无训练的身份感知记忆框架,能够显式建模和跟踪持久实体身份,从而在提示转换过程中实现一致的生成。具体而言,一个大语言模型从每个提示中提取具有视觉属性的实体并分配唯一的全局ID用于身份感知记忆,而一个视觉语言模型异步验证和细化从渲染帧中提取的属性,从而在原位实现显式实体跟踪,而不是基于隐式相似性匹配。为了保持所提出框架的计算实用性,我们设计了一套系统推理加速管道,包括异步视觉验证、自适应提示转换和模型量化,从而实现了比现有基线更快的生成速度。此外,我们引入了NarraStream-Bench,一个用于叙事流视频生成的基准测试,其包含324个多提示脚本,跨越六个维度,并采用三维评估协议,整合了传统指标和多模态大语言模型评估。大量实验表明,尽管IAMFlow是无训练的,但其在NarraStream-Bench上取得了最佳整体性能,优于最强基线2.56分,同时在60秒多提示设置中比最高效的基线快1.39倍。

英文摘要

Autoregressive video generation has improved rapidly in visual fidelity and interactivity, but it still suffers from long-term inconsistency and memory degradation. Most existing solutions either compress historical frames using predefined strategies or retrieve keyframes based on coarse implicit attention signals, both of which fail to handle evolving prompts with shifting entity references, leading to identity drift, character duplication, and attribute loss. To address this, we propose IAMFlow, a training-free identity-aware memory framework that explicitly models and tracks persistent entity identities, enabling consistent generation across prompt transitions. Specifically, an LLM extracts entities with visual attributes from each prompt and assigns unique global IDs for identity-aware memory, while a VLM asynchronously verifies and refines attributes from rendered frames, enabling explicit entity tracking in place of implicit similarity-based matching. To keep the proposed framework computationally practical, we design a systematic inference acceleration pipeline, including asynchronous visual verification, adaptive prompt transition, and model quantization, which achieves faster generation than existing baselines. Furthermore, we introduce NarraStream-Bench, a benchmark for narrative streaming video generation that features 324 multi-prompt scripts spanning six dimensions and a three-dimensional evaluation protocol that integrates both traditional metrics and multimodal large language model-based assessments. Extensive experiments show that IAMFlow, despite being training-free, achieves the best overall performance on NarraStream-Bench, outperforming the strongest baseline by 2.56 points, while achieving a 1.39$\times$ speedup over the most efficient baseline in the 60-second multi-prompt setting.

2605.18729 2026-05-19 cs.RO cs.CV 版本更新

Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction

Robo-Cortex: 通过双粒认知记忆和自主知识诱导实现自我进化具身智能体

Nga Teng Chan, Yi Zhang, Yechi Liu, Renwen Cui, Fanhu Zeng, Zeyuan Ding, Xiancong Ren, Zhang Zhang, Qifeng Chen, Jian Liu, Yong Dai, Xiaozhu Ju

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) X-Humanoid Institute of Automation, CAS(中国科学院自动化研究所) Beijing University of Aeronautics and Astronautics(北京航空航天大学)

AI总结 本文提出Robo-Cortex框架,通过双粒认知记忆和自主知识诱导机制,使机器人能够自主诱导导航启发式方法并优化认知策略,从而在复杂环境中实现自主导航和探索。

详情
AI中文摘要

导航和与复杂环境交互的能力是真实世界具身智能体的核心,但导航未知环境仍然具有挑战性,因为“经验性失忆”导致现有基于轨迹的或反应性策略无法从过去交互中合成可推广的策略。我们提出了Robo-Cortex,一个自我进化的框架,使机器人能够通过持续的反思-适应循环自主诱导导航启发式方法并优化认知策略。通过将成功模式和失败陷阱抽象为自然语言启发式方法,Robo-Cortex实现了从被动执行到主动策略进化的转变。我们的核心创新是一个自主知识诱导(AKI)机制,将多模态轨迹转化为结构化的导航启发式库以实现知识泛化。该架构进一步集成了双粒认知记忆系统,包括用于实时局部进展分析的短时反思记忆(SRM)和将过去轨迹抽象为可重用指导和警示原则的长时原则记忆(LPM)。为确保稳健决策,我们引入了多模态的想象-然后验证循环,其中世界模型模拟潜在结果,基于视觉语言模型(VLM)的评估器验证行动计划。在IGNav、AR和AEQA上的广泛评估显示,Robo-Cortex在任务成功率和探索效率方面均优于强大的基线方法,其在最强前方法上的SPL提升高达+4.16%,在启发式转移至未知环境下的SPL提升高达+15.30%。初步的现实世界机器人实验进一步支持了Robo-Cortex在物理环境中的有效性。

英文摘要

The ability to navigate and interact with complex environments is central to real-world embodied agents, yet navigation in unseen environments remains challenging due to "experiential amnesia," where existing trajectory-driven or reactive policies fail to synthesize generalizable strategies from past interactions. We propose Robo-Cortex, a self-evolving framework that enables robots to autonomously induce navigation heuristics and refine cognitive strategies through a continuous reflection-adaptation loop. By abstracting success patterns and failure pitfalls into natural-language heuristics, Robo-Cortex enables a transition from passive execution to active strategy evolution. Our core innovation is an Autonomous Knowledge Induction (AKI) mechanism that distills multimodal trajectories into a structured Navigation Heuristic Library for knowledge generalization. The architecture further incorporates a Dual-Grain Cognitive Memory system, comprising a Short-term Reflective Memory (SRM) for real-time local progress analysis, and a Long-term Principle Memory (LPM) that abstracts past trajectories into reusable guiding and cautionary principles. To ensure robust decision-making, we introduce a multimodal Imagine-then-Verify loop, where a world model simulates potential outcomes and a VLM-based evaluator validates action plans. Extensive evaluations on IGNav, AR, and AEQA show that Robo-Cortex consistently outperforms strong baselines in both task success and exploration efficiency, with gains of up to +4.16% SPL over the strongest prior method and up to +15.30% SPL under heuristic transfer to unseen environments. Preliminary real-world robotic experiments further support the effectiveness of Robo-Cortex in physical settings.

2605.18719 2026-05-19 cs.CV 版本更新

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

SafeDiffusion-R1: 在线奖励引导用于安全扩散后训练

Komal Kumar, Ankan Deria, Abhishek Basu, Fahad Shamshad, Hisham Cholakkal, Karthik Nandakumar

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Michigan State University(密歇根州立大学)

AI总结 本文提出了一种在线强化学习框架,通过在负样本和正样本文本提示上进行后训练,利用组相对策略优化(GRPO)解决数据稀缺和模型退化问题,引入了引导奖励机制以提高扩散模型的安全性,实验表明其在减少不适当内容和提升生成质量方面表现优异。

Comments Page 28, Image 20, Table 6

详情
AI中文摘要

扩散模型已被广泛研究用于去除预训练过程中学习到的不安全内容。现有方法需要昂贵的监督数据,要么是不安全文本与安全图像的配对数据,要么是负/正图像对,使其难以扩展。此外,离线强化学习和监督微调方法生成离线合成数据会受到灾难性遗忘的影响,降低生成质量。我们提出了一种新的在线强化学习框架,通过在负样本和正样本文本提示上进行后训练,利用组相对策略优化(GRPO)解决数据稀缺和模型退化问题。为了消除对专门安全/不安全奖励模型的微调需求,我们引入了一种引导奖励机制,利用CLIP嵌入的一个固有特性:在嵌入空间中将文本表示引导向积极安全方向,远离消极方向。我们的在线策略方法使模型能够从多样化的提示中学习,包括显式不安全内容,而不会出现灾难性遗忘。大量实验表明,我们的方法将不适当内容减少到18.07%(与SD v1.4的48.9%相比),将色情检测减少到15(与基线646相比),同时在GenEval上将组合生成质量从42.08%提高到47.83%。值得注意的是,这些安全收益可以推广到七个危害类别中的跨领域不安全提示,实现了最先进的性能,而无需监督配对数据或奖励微调。Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.

英文摘要

Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a \textit{steering reward mechanism} that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08\% to 47.83\% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.

2605.18714 2026-05-19 cs.CV cs.AI 版本更新

Semantic Generative Tuning for Unified Multimodal Models

语义生成微调用于统一多模态模型

Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tencent ARCLab(腾讯ARCLab)

AI总结 本文提出语义生成微调(SGT)方法,通过将高阶语义任务作为生成代理,统一多模态模型的感知与生成能力,提升多模态理解和生成质量。

Comments 14 pages, 13 figures

详情
AI中文摘要

统一多模态模型(UMMs)致力于在单一架构中整合视觉理解和视觉生成。然而,现有训练范式分别通过稀疏文本信号优化理解,通过密集像素目标优化生成,导致表示空间不一致,隔离了视觉理解和生成,阻碍了它们的相互促进。本文首次系统地研究了生成式后训练,我们将层次化的视觉任务作为生成代理,以弥合UMMs中的隔离。通过实证研究发现,高阶语义任务,特别是图像分割,作为最优代理。不同于低阶任务,分割提供结构语义,显著增强视觉感知和生成布局的保真度。基于这些见解,我们引入语义生成微调(SGT),一种利用分割作为生成代理来对齐和协同多模态能力的新范式。机理分析进一步表明,SGT从根本上提高了特征线性可分离性,并优化了视觉-文本注意力分配模式。广泛的评估显示,SGT在主流基准上一致提升了多模态理解和生成保真度。我们的代码可在https://song2yu.github.io/SGT/上获得。

英文摘要

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.

2605.18700 2026-05-19 cs.CV 版本更新

A Large-Scale Study on the Accuracy vs Cost Trade-offs of Training and Evaluation Settings in Fine-Grained Image Recognition

细粒度图像识别中训练和评估设置的准确性与成本权衡的大规模研究

Edwin Arkel Rios, Augusto Christian Surya, Oswin Gosal, Fernando Mikael, Mary Madeline Nicole, Kisoon Jang, Bo-Cheng Lai, Min-Chun Hu

发表机构 * National Yang Ming Chiao Tung University, Taiwan National Tsing Hua University, Taiwan

AI总结 本文通过2000多项实验,探讨了不同训练和评估设置下,模型精度与成本之间的权衡,提出改进的Counterfactual Attention Learning方法,并提供高效的评估变体以降低推理成本。

Comments Accepted to The 13th Workshop on Fine-Grained Visual Categorization (FGVC13) @ CVPR 2026. Main: 6 pages, 4 figures

详情
AI中文摘要

先前关于细粒度图像识别(FGIR)的研究已确立了backbone选择的重要性,但忽视了不同训练和评估设置下的精度与成本权衡。在本工作中,我们进行了大规模研究,涵盖超过2000项实验,6种训练和评估设置,9种预训练backbone和17个数据集。初步观察数据增强在细粒度训练中的有效性促使我们扩展Counterfactual Attention Learning(CAL),一种基于数据感知裁剪和遮罩增强的状态-of-the-art方法,引入跨图像判别区域混合增强。我们还提出了一种高效的评估-only变体,在保持竞争力精度的同时,通过放弃通常由CAL和类似FGIR方法使用的判别作物的前向传递来降低推理成本。我们的结果表明,训练期间的数据感知增强仅能使模型在不使用作物的情况下达到卓越的精度,显著减少推理成本。为了支持未来研究,我们共享了代码和检查点:https://github.com/arkel23/FGIR-Backbones

英文摘要

Prior work on fine-grained image recognition (FGIR) has established the importance of the backbone selection, but has neglected the accuracy-vs-cost trade-offs under different training and evaluation settings. In this work we conduct a large-scale study with over 2000 experiments across 6 training and evaluation settings, 9 pretrained backbones, and 17 datasets. Preliminary observations on the effectiveness of data augmentation for fine-grained training motivate us to extend Counterfactual Attention Learning (CAL), a state-of-the-art method based on data-aware cropping and masking augmentations, with cross-image discriminative region mixing augmentation. We also propose an efficient evaluation-only variant that maintains competitive accuracy while reducing inference costs by forfeiting the forward pass on discriminative crops that is normally used by CAL and similar FGIR methods. Our results show that data-aware augmentations during training only can enable a model to achieve excellent accuracy even without crops, significantly reducing inference costs. To support future research we share our code and checkpoints at: \url{https://github.com/arkel23/FGIR-Backbones}

2605.18680 2026-05-19 cs.CV 版本更新

CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation

CMAG:基于概念的市场人像生成检索

Rajeev Goel, Jason Ding, Phani Harish Wajjala, Pavan Turaga, Tejaswi Gowda, Krishna C. Garikipati

发表机构 * Arizona State University(亚利桑那州立大学) Roblox Corporation(Roblox公司)

AI总结 本文提出CMAG框架,通过概念引导的检索和验证组合方法,解决市场人像生成中因文本模糊、元数据噪声和部件不一致导致的检索问题,提升生成人像的拓扑一致性和组合正确性。

Comments Accepted to CVPR 2026 Workshop (GRAIL-V)

详情
AI中文摘要

元宇宙平台依赖于由离散、按分类标签标记的3D资产(如上衣、下装、鞋子、配饰)组成的创作者驱动市场,其中人像需在严格分类和拓扑约束下组装。尽管用户日益期望自由形式的文本控制,但纯文本检索存在脆弱性:自然语言对平台分类体系而言是模糊的,元数据常噪声或非正式,且独立检索的部件可能在风格上不一致或几何上不兼容。我们提出CMAG,一种用于市场人像生成的概念引导检索和验证组合框架。给定提示,CMAG首先合成一个中间3D概念框架,通过提供全局空间和风格上下文来超越文本意图。同时,一个视图感知的部分发现模块通过提示分解和文本引导的分割提取局部视觉证据。一个基于提示的分类路由器强制分类覆盖并解决语义到分类的不匹配,之后一个混合分类检索器结合基于部件的融合和概念残差回退使用特征抑制。最后,一个代理视觉-语言模型在不同类别中过滤和重新排序候选者,并驱动一个迭代验证循环,以从目录资产中组装符合提示、拓扑一致的人像。我们在多样化的组合提示上评估了CMAG,并与强大的基线相比展示了改进的检索鲁棒性和组合正确性,突显了在提示模糊性下3D概念框架的重要性。

英文摘要

Metaverse platforms rely on creator-driven marketplaces where avatars are assembled from discrete, taxonomy-labeled 3D assets (e.g., tops, bottoms, shoes, accessories) under strict category and topology constraints. While users increasingly expect free-form text control, text-only retrieval is brittle: natural language is ambiguous with respect to platform taxonomies, metadata is often noisy or informal, and independently retrieved components can be stylistically inconsistent or geometrically incompatible. We propose \textbf{CMAG}, a concept-scaffolded retrieval and verified composition framework for marketplace avatar generation. Given a prompt, CMAG first synthesizes an intermediate 3D concept scaffold that disambiguates intent beyond text by providing global spatial and stylistic context. In parallel, a view-aware part discovery module extracts localized visual evidence via prompt decomposition and text-grounded segmentation. A prompt-conditioned taxonomy router enforces category coverage and resolves semantic-to-taxonomic mismatch, after which a hybrid category-wise retriever combines part-based fusion with a concept-residual fallback using feature suppression. Finally, an agentic vision--language model filters and re-ranks candidates across categories and drives an iterative verification loop to assemble prompt-faithful, topologically consistent avatars from catalog assets. We evaluate CMAG on diverse compositional prompts and demonstrate improved retrieval robustness and compositional correctness compared to strong baselines, highlighting the importance of 3D concept scaffolding under prompt ambiguity.

2605.18667 2026-05-19 cs.CV cs.LG 版本更新

Better Together: Evaluating the Complementarity of Earth Embedding Models

Better Together: 评估地球嵌入模型的互补性

Thijs L van der Plas, Jacob JW Bakermans, Vishal Nedungadi, Gabrielė Tijūnaitytė, Marc Rußwurm, Ioannis N Athanasiadis

发表机构 * Wageningen University(瓦赫宁根大学) University College London(伦敦大学学院) University of Bonn(波恩大学)

AI总结 本文研究了地球嵌入模型的互补性,提出通过融合嵌入来提升性能,并评估了四种模型在不同任务中的表现,发现互补性在任务和位置上都具有依赖性。

详情
AI中文摘要

地球嵌入模型将地球观测数据转换为与地球表面位置唯一关联的嵌入。这些模型通常单独评估,比较不同地球嵌入在下游任务中的性能。然而,空间对齐的嵌入可以自然融合,提供更丰富的每位置信息,而孤立评估无法捕捉到这一点。因此,我们提出通过互补性评估地球嵌入:融合嵌入相对于最佳单模型基线的性能提升。为此,我们引入了一个适用于任何嵌入和任务的嵌入互补性指数,并在六个下游任务中评估了四种地球嵌入模型(AlphaEarth、Tessera、GeoCLIP、SatCLIP),分别单独、成对和联合评估。融合嵌入在六个任务中的四个任务中优于最佳单模型,证实了单嵌入评估通常低估了地球嵌入的能力。互补性在任务和位置上都具有依赖性。进一步,对于一个土地覆盖回归任务,我们发现互补性部分由土地覆盖类别的空间尺度决定。互补性重新定义了地球嵌入:未来的最大收益可能不来自任何单一地球嵌入模型,而是来自更好的组合。

英文摘要

Earth embedding models transform Earth observation data into embeddings uniquely tied to locations on the Earth's surface. These models are typically evaluated in isolation, comparing the downstream task performance across different Earth embeddings. However, spatially aligned embeddings can naturally be fused, providing richer information per location, a capability that isolated evaluations fail to capture. We therefore propose assessing Earth embeddings by their complementarity: the performance gain of fused embeddings over the best single-model baseline. To operationalise this, we introduce an embedding complementarity index applicable to any embedding and task, and evaluate four Earth embedding models (AlphaEarth, Tessera, GeoCLIP, SatCLIP) in isolation, in all pairs, and jointly across six downstream tasks. Fused embeddings outperform the best single model in four out of six tasks, confirming that single-embedding evaluations often underestimate Earth embedding capabilities. Complementarity proves both task- and location-dependent. Further, for a land cover regression task, we find that complementarity is partially determined by the spatial scale of land cover classes. Complementarity reframes Earth embeddings: the greatest future gains may come not from any single Earth embedding model, but from combinations that are better together.

2605.18652 2026-05-19 cs.CV 版本更新

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

MementoGUI: 学习代理多模态记忆控制以实现长周期GUI代理

Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, Jiebo Luo

发表机构 * University of Rochester(罗切斯特大学) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出MementoGUI,一种学习代理多模态记忆控制框架,用于提升长周期GUI代理的任务状态维持能力,通过模块化记忆控制和可扩展的数据管道提高记忆检索和决策效率。

Comments Preprint, 15 pages, 4 figures, 5 tables

详情
AI中文摘要

近年来,GUI代理在视觉定位和动作预测方面取得了显著进展,但它们在需要跨多个界面转换维持任务状态的长周期任务中仍显得脆弱。现有代理通常依赖于原始历史回放或纯文本记忆,这要么使模型超载冗余截图,要么丢弃未来决策所需的局部视觉证据。为了解决这些限制,我们引入了MementoGUI,一种插件式代理记忆框架,为基于MLLM的GUI代理配备了MementoCore,一个用于在线记忆选择、压缩和检索的学习控制器。与将交互历史视为固定上下文不同,MementoGUI将长周期GUI控制视为一个在线记忆控制问题:工作记忆会选择性地保存与任务相关的界面事件,带有文本摘要和ROI级别的视觉证据,而情景记忆则通过学习的相关性选择检索可重用的过去轨迹。MementoCore将记忆控制模块化为专门的运算符,用于步骤处理、记忆压缩、情景写入和情景选择,使插件式记忆增强而无需微调GUI代理的主干。我们进一步开发了一条可扩展的数据整理管道,将计算机使用轨迹转换为记忆控制器训练数据,引入MementoGUI-Bench用于评估GUI代理的长周期决策能力,并设计基于MLLM的指标用于语义动作匹配、任务进度和记忆一致性。在GUI-Odyssey、MM-Mind2Web和MementoGUI-Bench上的实验表明,MementoGUI在无历史、历史回放和纯文本记忆基线之上一致提升了GUI代理的表现,较大的MementoCore主干进一步加强了记忆增强的GUI控制。

英文摘要

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce \textbf{MementoGUI}, a plug-in agentic memory framework that equips MLLM-based GUI agents with \textbf{MementoCore}, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce \textbf{MementoGUI-Bench} for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.

2605.18645 2026-05-19 cs.CV 版本更新

Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video

素性在首要位置:从单个随意视频中基于原始的机械对象理解

Arslan Artykov, Tom Ravaud, Nicolás Violante-Grezzi, Vincent Lepetit

发表机构 * LIGM, CNRS, Univ Gustave Eiffel, ENPC, Institut Polytechnique de Paris(LIGM研究所、国家科学研究中心、巴黎高等电力学院、巴黎理工学院)

AI总结 本文提出了一种不依赖类别信息的优化框架,将机械对象理解视为原始拟合问题,通过几何原始体避免不稳定点跟踪的弊端,并利用新的机制将原始体组织成受旋转和滑动关节约束的连贯部分,从而从单个随意拍摄的视频中恢复复杂的运动学。

详情
AI中文摘要

从单目视频中检索机械对象的3D运动学是计算机视觉中的基本挑战。现有方法依赖于复杂的视频设置或长期点跟踪、宽基线匹配等线索,但在严重遮挡、快速相机自运动或弱局部特征下经常表现脆弱。基于学习的方法在泛化到训练类别之外时也面临困难。我们提出了一种类别无关的优化框架,将机械对象理解视为原始体拟合问题。几何原始体作为代理表示,避免了不稳定点跟踪的陷阱;一种新的机制将它们组织成受旋转和滑动关节约束的连贯部分。我们的公式同时优化部分分割和关节参数,从单个随意拍摄的视频中恢复复杂的运动学。一种可见性意识的程序处理现实数据中固有的部分观察和遮挡。我们还提出了AiP-synth和AiP-real基准,具有显著的相机运动和严重的遮挡,并在现有方法上取得了更好的表现。项目页面:https://aartykov.github.io/Articulation-in-Prime/

英文摘要

Retrieving the 3D kinematics of articulated objects from monocular video is a fundamental challenge in computer vision. Existing methods rely on complex video setups or cues such as long-term point tracking or wide-baseline matching, but are frequently brittle under severe occlusions, rapid camera ego-motion, or weak local features. Learning-based methods, meanwhile, struggle to generalize beyond their training categories. We propose a category-agnostic optimization framework that treats articulated object understanding as a primitive-fitting problem. Geometric primitives serve as a proxy representation that avoids the pitfalls of unstable point tracks; a novel mechanism organizes them into coherent parts constrained by revolute and prismatic joints. Our formulation jointly optimizes part segmentation and joint parameters, recovering complex kinematics from a single casually captured video. A visibility-aware procedure handles partial observations and occlusions inherent to real-world data. We also propose the AiP-synth and AiP-real benchmarks, featuring significant camera motion and heavy occlusions, and outperform existing methods. Project page: https://aartykov.github.io/Articulation-in-Prime/

2605.18641 2026-05-19 cs.CV 版本更新

Leveraging Latent Visual Reasoning in Silence

利用沉默中的潜在视觉推理

Dongyao Zhu, Zhen Wang, Xi Xiao, Han Jiang, Saeed Vahidian, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su, Raju Vatsavai, Jianyang Gu

发表机构 * North Carolina State University(北卡罗来纳州立大学) UC, San Diego(加州大学圣地亚哥分校) University of Alabama at Birmingham(阿拉巴马大学伯明翰分校) Johns Hopkins University(约翰霍普金斯大学) Duke University(杜克大学) Boston University(波士顿大学) The Ohio State University(俄亥俄州立大学)

AI总结 本文探讨了在推理过程中是否需要持续的潜在令牌,发现即使移除这些令牌或用随机噪声替代,性能影响较小,提出了一种基于注意力的奖励机制以促进潜在令牌与后续文本令牌的交互,从而提升视觉感知和视觉推理任务的性能。

详情
AI中文摘要

潜在视觉推理通过在文本生成前插入连续潜在令牌,更直接地参与多模态推理。然而,这些潜在令牌在推理中的必要性仍存疑。我们发现,在空间推理基准上,用随机噪声替代或完全移除潜在令牌对性能影响很小。强化学习进一步在训练后减少了潜在生成行为。这些观察引发了一个核心问题:潜在视觉推理是否仍然有意义?我们认为其价值应由潜在令牌如何引导学习来衡量,而非是否在推理时保留。我们的分析表明,潜在推理在不同问题类型中效果不均,但任务级路由应用潜在生成是脆弱的。受这些发现启发,我们提出了一种基于注意力的奖励,鼓励生成的潜在令牌在强化学习中与后续文本令牌交互。该奖励在潜在模式激活时促进潜在利用,同时保持使用纯文本推理的灵活性。实验表明,我们的方法在感知和视觉推理基准上提升了性能,即使在训练后潜在令牌很少生成。我们的结果表明,在推理时没有显式表达的情况下,潜在视觉推理可以塑造更好的视觉基础和更准确的文本推理。我们的代码和训练模型可在GitHub和Hugging Face上公开获取。

英文摘要

Latent visual reasoning involves visual evidence more directly in multimodal reasoning by inserting continuous latent tokens before textual generation. However, the necessity of these latent tokens at inference remains ambiguous. We show that replacing latent tokens with random noise or removing them completely causes little performance degradation across spatial reasoning benchmarks. Reinforcement learning further diminishes the latent generation behavior after post-training. These observations raise a central question: Is latent visual reasoning still meaningful? We argue that its value should be measured by how effectively latent tokens guide learning, rather than whether they persist as an inference-time format. Our analysis shows that latent reasoning is unevenly favorable across question types, yet hard task-level routing for applying latent generation is brittle. Motivated by these findings, we propose an attention-based reward that encourages generated latent tokens to interact with later text tokens during RL. This reward promotes latent utilization when the latent mode is activated while preserving the flexibility to use pure-text reasoning. Experiments show that our method improves performance across perception and visual reasoning benchmarks, even when latent tokens are rarely generated after post-training. Our results highlight that, without explicit expression at inference, latent visual reasoning can shape better visual grounding and more accurate textual reasoning in silence. Our code and trained models are publicly available at \href{https://github.com/ddydyd32/silent-lvr/tree/master}{GitHub} and \href{https://huggingface.co/collections/cornuHGF/silent-lvr}{Hugging Face}.

2605.18636 2026-05-19 cs.CV 版本更新

SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

SPIKE:一种适应性双控制器框架,用于成本效益高的长周期游戏智能体

Wencan Jiang, Jiangning Zhang, Jianbiao Mei, Jinzhuo Liu, Yu Yang, Xiaobin Hu, Zhucun Xue, Yong Liu, Dacheng Tao

发表机构 * Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出SPIKE框架,通过双控制器设计实现成本高效长周期游戏智能体,通过事件触发机制和分层记忆结构提升目标导向性和任务完成率,实验表明在StarDojo数据集上显著提升成功率并降低资源消耗。

Comments https://wencanjiang.github.io/projects/SPIKE/

详情
AI中文摘要

长周期多模态智能体在开放世界游戏中必须在紧绷的令牌和延迟预算下保持目标导向,通过许多低层交互。现有方法往往在昂贵的每步推理和易漂移、重复失败和恢复不佳的反应执行之间权衡。我们的核心思想是重用战略推理在局部稳定的段落中,并在事件边界重新调用。我们提出了SPIKE,一种适应性双控制器框架,用于成本高效的长周期游戏控制。其战略控制器执行低频全局规划、故障分析和恢复,而其反应控制器在严格的令牌预算下处理快速的本地执行。事件触发器监控视觉变化、任务进度、重复动作和失败信号,以决定何时控制应保持反应或升级到战略推理。分层记忆将短期经验重用在状态-动作记忆银行(SA-MB)中,与结构化证据在状态动作知识图(SA-KG)中分离,使每个控制器能够检索所需的上下文。这种设计在多个反应步骤中重用战略提案,支持计划过时时的本地覆盖,并保留昂贵的推理用于需要额外思考的时刻。在StarDojo的Lite-100分割上,SPIKE比最强的Lite-100基线提高了5.0个百分点(38.5%相对),比最强的预算基线提高了9.3点(75.6%相对)。它还减少了54.9%的令牌消耗和40.8%的延迟。消融实验表明,事件触发、反应覆盖和异构记忆各自对成功和恢复有贡献,支持选择性推理而非每一步推理。

英文摘要

Long-horizon multimodal agents in open-world games must stay goal-directed across many low-level interactions under tight token and latency budgets. Existing approaches often trade off costly per-step reasoning against reactive execution that can drift, repeat failures, and recover poorly. Our key idea is to reuse strategic reasoning across locally stable segments and reinvoke it at event boundaries. We present SPIKE, an adaptive dual controller framework for cost-efficient long-horizon game control. Its Strategic Controller performs low-frequency global planning, failure analysis, and recovery, while its Reactive Controller handles fast local execution under a strict token budget. An Event Trigger monitors visual change, task progress, repeated actions, and failure signals to decide when control should stay reactive or escalate to strategic reasoning. Hierarchical Memory separates short-term experience reuse in the State-Action Memory Bank (SA-MB) from structured evidence in the State Action Knowledge Graph (SA-KG), allowing each controller to retrieve the context it needs. This design reuses strategic proposals over multiple reactive steps, supports local override when plans become stale, and reserves expensive reasoning for moments where extra deliberation is useful. On the Lite-100 split of StarDojo, SPIKE improves Lite-100 success rate (SR) by 5.0 percentage points (38.5% relative) over the strongest Lite-100 baseline and Budgeted SR by 9.3 points (75.6% relative) over the strongest budgeted baseline. It also reduces token consumption by 54.9% and latency by 40.8%. Ablations show that event triggering, reactive override, and heterogeneous memory each contribute to success and recovery, supporting selective reasoning rather than reasoning at every step.

2605.18621 2026-05-19 cs.CV cs.AI 版本更新

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

CrossView Suite: 利用数据集、模型和基准 harnessing MLLMs 的跨视图空间智能

Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Siliang Tang, Jun Xiao, Yueting Zhuang

发表机构 * Zhejiang University(浙江大学)

AI总结 该研究提出CrossView Suite,通过开发CrossViewSet、CrossViewBench和CrossViewer三个组件,解决跨视图推理中的数据稀缺、评估不足和对齐机制缺失问题,提升多视图空间理解能力。

详情
AI中文摘要

空间智能要求多模态大语言模型(MLLMs)超越单一视图感知,对物体、可见性、几何和交互在多个视角下保持一致推理。然而,跨视图推理的进步受限于三个主要缺口:大规模高质量标注训练数据的稀缺性、缺乏系统性评估的基准以及缺乏显式对齐机制以建立物体层面的一致性。为了解决这些缺口,我们全面开发了CrossView Suite的三个协调组件:CrossViewSet、CrossViewBench和CrossViewer。首先,我们引入一个多代理数据引擎,精心编纂了一个大规模、高质量的跨视图指令数据集,称为CrossViewSet,涵盖17种细粒度任务类型,包含1.6M个样本。其次,我们精心创建了一个场景不重叠的CrossViewBench,以全面评估MLLM的跨视图空间理解能力,评估其在各种方面的表现。最后,我们提出了CrossViewer,一个渐进的三阶段框架,用于MLLMs的跨视图空间推理,遵循感知->对齐->推理的范式。我们的方法配备了一个自适应的空间区域标记器,以捕捉细粒度的物体表示,然后显式对齐多视图对象,并因此融合对齐的特征,以提升MLLMs的跨视图推理能力。广泛的实验和分析表明,大规模训练数据、系统性评估和显式的跨视图对齐都是推动MLLMs从单视角感知向现实世界空间智能发展的关键因素。项目页面可在https://github.com/Thinkirin/Crossview-Suite上找到。

英文摘要

Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously create a scene-disjoint CrossViewBench to comprehensively assess the cross-view spatial understanding capability of an MLLM, evaluating it across various aspects. Finally, we propose CrossViewer, a progressive three-stage framework for cross-view spatial reasoning in MLLMs, following a Perception -> Alignment -> Reasoning paradigm. Our method equips an adaptive spatial region tokenizer to capture fine-grained object representations, and then aligns the multi-view objects explicitly, and thus fuses aligned features for boosting the cross-view inference capacity for MLLMs. Extensive experiments and analyses show that large-scale training data, systematic evaluation, and explicit cross-view alignment are all critical for advancing MLLMs from single-view perception toward real-world spatial intelligence. The project page is available at https://github.com/Thinkirin/Crossview-Suite.

2605.18617 2026-05-19 cs.RO cs.AI cs.CV 版本更新

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

ManiSoft: 向视觉-语言操控的柔软连续机器人迈进

Ziyu Wei, Luting Wang, Chen Gao, Li Wen, Si Liu

发表机构 * Beihang University(北京航空航天大学) National University of Singapore(新加坡国立大学) Hangzhou Innovation Institute, Beihang University(北京航空航天大学杭州创新研究院)

AI总结 本文提出ManiSoft基准,用于研究柔软连续机器人的视觉-语言操控,通过定制模拟器结合真实柔软体动力学和丰富的接触交互,定义了四个任务以展示变形控制的不同方面,并通过自动化流程生成6300个多样场景和专家轨迹,评估了三种代表性策略模型的性能。

Comments Accepted in ICML 2026

详情
AI中文摘要

大多数现有的视觉-语言操控研究针对刚性机械臂,其固定形态限制了在杂乱或狭窄空间中的适应性。柔软机械臂由于其可变形性提供了一个有吸引力的替代方案,但面临不可靠的本体感觉和分布式的低层驱动挑战。为了研究这些挑战,我们介绍了ManiSoft,一个用于柔软机械臂的视觉-语言操控基准。ManiSoft特征一个定制的模拟器,通过弹性力约束将真实柔软体动力学与丰富的接触交互相结合。在此基础上,ManiSoft定义了四个任务,每个任务突出显示变形控制的不同方面,从基本末端执行器协调到障碍物回避。为了支持策略训练和评估,ManiSoft包括一个自动化流程,生成6,300个多样场景及其对应的专家轨迹。为了大规模生成高质量轨迹,我们首先使用高层规划器将每个任务分解为一系列路径点,然后使用低层强化学习策略生成扭矩命令以跟踪路径点。基准测试三种代表性策略模型显示在清洁场景中相对有希望的结果,但在随机化情况下性能显著下降。可视化分析表明,失败主要源于本体感觉状态的视觉估计不准确和变形性在适应性障碍回避中的利用有限。我们预计ManiSoft将作为有价值的测试平台,在视觉-语言操控的背景下弥合刚性和柔软机械臂之间的差距。代码和数据集已发布在https://buaa-colalab.github.io/ManiSoft。

英文摘要

Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft{} includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at https://buaa-colalab.github.io/ManiSoft.

2605.18610 2026-05-19 cs.CV cs.AI cs.LG 版本更新

CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic

CATA: 通过冲突厌恶任务算术实现持续机器去学习

Shen Lin, Junhao Dong, Rongjie Chen, Xiaoyu Zhang, Li Xu, Xiaofeng Chen

发表机构 * Fujian Normal University(福建师范大学) Nanyang Technological University(南洋理工大学) Xidian University(西安电子科技大学)

AI总结 本文首次研究了视觉语言模型的持续去学习问题,提出CATA方法,通过冲突厌恶任务算术有效解决去学习中的有效性、模型保真度和持续性挑战。

详情
AI中文摘要

视觉语言模型(VLMs)在对齐视觉和文本表示方面表现出色,能够支持多种多模态应用。然而,其大规模训练数据不可避免地引发了隐私、版权和不良内容的担忧,这使得机器去学习变得必要。尽管现有研究主要关注单次去学习,但实际VLM部署往往涉及随时间推移的连续删除请求,从而产生持续机器去学习。在本文中,我们首次研究了VLMs的持续去学习,并识别出该设置中的三个关键挑战:去除目标知识的有效性、保留模型效用的保真度以及在连续更新下防止知识重新出现的持续性。为了解决这些挑战,我们提出了CATA,一种冲突厌恶任务算术方法,将每个遗忘请求表示为一个去学习任务向量。通过维护历史任务向量并执行符号感知的冲突厌恶聚合,CATA抑制可能削弱先前遗忘效果的冲突更新组件。在单次和持续设置下的大量实验表明,CATA在遗忘有效性、模型保真度和遗忘持续性方面均优于基线方法。

英文摘要

Vision-language models (VLMs) have shown remarkable ability in aligning visual and textual representations, enabling a wide range of multimodal applications. However, their large-scale training data inevitably raises concerns about privacy, copyright, and undesirable content, creating a strong need for machine unlearning. While existing studies mainly focus on single-shot unlearning, practical VLM deployment often involves sequential removal requests over time, giving rise to continual machine unlearning. In this work, we make the first attempt to study continual unlearning for VLMs and identify three key challenges in this setting: effectiveness in removing target knowledge, fidelity in preserving retained model utility, and persistence in preventing knowledge re-emergence under sequential updates. To address these challenges, we propose CATA, a conflict-averse task arithmetic method that represents each forget request as an unlearning task vector. By maintaining historical task vectors and performing sign-aware conflict-averse aggregation, CATA suppresses conflicting update components that may weaken previous forgetting effects. Extensive experiments under both single-shot and continual settings show that CATA outperforms baselines in terms of forgetting effectiveness, model fidelity, and forgetting persistence.

2605.18608 2026-05-19 cs.CV 版本更新

Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging

跨越迁移:通过动态风格桥接实现向前促进的持续测试时间适应

Zhilin Zhu, Yabin Wang, Zhiheng Ma, Yaguang Song, Yaowei Wang, Xiaopeng Hong

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Pengcheng Laboratory(鹏城实验室) Shenzhen University of Advanced Technology(深圳先进技术大学) Guangdong Provincial Key Laboratory of Computility Microelectronics(广东省计算微电子重点实验室)

AI总结 本文提出了一种新的向前促进的持续测试时间适应方法,通过动态风格桥接机制,在部署前构建紧凑的知识库,并在测试时动态注入输入数据风格,以提供可靠的监督信号,从而在持续迁移中实现稳定的适应。

Comments Accepted by CVPR 2026

详情
AI中文摘要

持续测试时间适应(CTTA)旨在使感知系统能够处理部署后遇到的动态分布偏移。现有方法主要采用后向对齐范式,这种范式将输入数据与源域衍生的监督代理进行刚性对齐,因此在面对不可靠的监督和不断变化的分布偏移时表现不佳。为克服这些限制,我们引入了一种新的向前促进范式,通过一种称为动态风格桥接的方法。在部署前,我们构建了一个生成类示例的紧凑知识库。在测试时间,为了减轻固有的生成偏移并使这些代理适应输入数据,我们提出了一个多级桥接机制。该机制在输入、统计和表示层动态地将代理与输入数据风格注入,同时保留代理的原始语义。这些高保真的代理随后被用来提供可靠且按需的监督信号,从而在持续偏移下实现稳定的适应。在标准CTTA基准上的广泛实验表明,我们的方法在最近的最先进方法上实现了一致且显著的改进。代码可在https://github.com/z1358/DAS上获得。

英文摘要

Continual Test-Time Adaptation (CTTA) aims to empower perception systems to handle dynamic distribution shifts encountered after deployment. Existing methods predominantly follow a backward-alignment paradigm, which rigidly aligns incoming data with supervisory surrogates derived from the source domain. Consequently, they struggle with unreliable supervision and evolving distribution shifts. To overcome these limitations, we introduce a novel forward-facilitation paradigm through a method termed Dynamic Style Bridging. Prior to deployment, we construct a compact knowledge base of generated class exemplars. During test time, to mitigate inherent generative bias and adapt these proxies to incoming data, we propose a multi-level bridging mechanism. This mechanism dynamically injects the proxies with incoming data styles at the input, statistical, and representation levels, while preserving the original semantics of the proxies. These high-fidelity proxies are then used to provide reliable, on-demand supervisory signals, enabling stable adaptation under continual shifts. Extensive experiments across standard CTTA benchmarks demonstrate that our method achieves consistent and substantial improvements over recent state-of-the-art approaches. Code is available at \href{https://github.com/z1358/DAS}.

2605.18603 2026-05-19 cs.CV 版本更新

Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

Starve to Perceive: 通过受限视觉带宽驯服VLMs中的懒惰感知

Yuhuan Wu, Cong Wei, Fangzhen Lin, Wenhu Chen, Haozhe Wang

发表机构 * Hong Kong University of Science and Technology(香港科学与技术大学) Technology University of Waterloo(滑铁卢大学)

AI总结 本文提出了一种新的训练方法,通过限制视觉带宽迫使视觉语言模型主动感知,从而提升其在高分辨率视觉环境中的表现。

详情
AI中文摘要

视觉语言模型(VLMs)作为处于环境中的智能体,在高分辨率视觉环境中需要主动感知——即通过缩放、裁剪和平移等操作动态决定观察方向的能力。然而,当前的训练范式产生的是模仿这些操作表面形式而没有功能性依赖的模型,我们称之为懒惰感知。我们将其归因于一个根本的学习不对称性:当粗略的全局视图结合语言先验足以达到中等准确度时,模型没有动力学习更复杂的多步骤视觉搜索。如果模型可以不主动观察就成功,它将永远学不会主动观察。这促使我们提出Starve to Perceive,一种限制视觉带宽的训练范式——限制每个观察到紧贴令牌预算,使得单个视角不足以完成任务,使主动感知成为唯一可行的策略。尽管不需要辅助损失、奖励塑造或架构变化——作为标准后训练流程的最小、即插即用修改——在感知饥饿下训练的模型在多种基准上实现了显著的提升,平均相对改进达5%。

英文摘要

Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception -- the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models that mimic the surface form of such operations without functionally depending on their outputs, a phenomenon we term lazy perception. We trace this to a fundamental learning asymmetry: when coarse global views combined with language priors suffice for moderate accuracy, the model has no incentive to learn harder multi-step visual search. If a model can succeed without actively looking, it will never learn to look. This motivates Starve to Perceive, a training paradigm that constrains visual bandwidth -- restricting each observation to a tight token budget so that no single view suffices for task completion, making active perception the only viable strategy. Despite requiring no auxiliary losses, reward shaping, or architectural changes -- serving as a minimal, plug-in modification to standard post-training pipelines -- models trained under perceptual starvation achieve substantial gains of 5% average relative improvement across diverse benchmarks.

2605.18601 2026-05-19 cs.CV 版本更新

Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models

Incantation: 自然语言作为多实体视频世界模型的动作接口

Shangwen Zhu, Qianyu Peng, Zhao Pu, Zhilei Shu, Xiangrui Ke, Zhaohu Xing, Zizhao Tong, Zeqing Wang, Xinyu Cui, Huangji Wang, Jian Zhao, Yeying Jin, Fan Cheng, Ruili Feng

发表机构 * SJTU(上海交通大学) NVIDIA Research(英伟达研究) USTC(中国科学技术大学) UCAS(乌兹别克斯坦科学院) NUS(新加坡国立大学) UWaterloo(滑铁卢大学) HKUST(香港理工大学) HKU(香港大学) ZGCA(浙江大学)

AI总结 本研究提出了一种基于自然语言的动作接口,用于多实体视频世界模型,解决了传统接口在细粒度多实体控制和跨实体、跨世界泛化能力上的不足,通过引入自然语言条件化实现了更强大的表达能力。

详情
AI中文摘要

现代交互式视频世界模型已实现了令人印象深刻的视觉保真度,但缺乏细粒度的多实体控制和跨实体、跨世界的泛化能力。我们追溯这一差距到动作接口:标准控制协议(例如动画ID、设备输入、场景级标题)在设计时将动作语义绑定到特定实体或引擎。我们提出自然语言作为接口,以解锁任何先前接口都无法实现的表达能力,并展示了Incantation,第一个具有每潜在帧(0.25秒)自然语言条件化的交互式视频世界模型,支持同时多实体控制和概念级跨实体转移,超越任何固定的渲染管道。我们配对了一个预训练的双向视频主干与帧本地文本交叉注意力,并通过ODE初始化的Self-Forcing蒸馏与RoPE解耦的滑动KV缓存实现实时长时间跨度流媒体。我们在跨实体转移(89% vs. 43%)和out-of-vocabulary提示(90% vs. 0%)上超越了Action-Index基线,并且我们的两步学生在480p下以19.7 FPS稳定运行,FVD在2小时滚动中保持稳定。我们进一步将相同的架构和训练配方应用于《国王之剑》,仅更改每个实体的动作词汇槽。我们已发布Incantation数据集的预览子集,包含手动收集的《艾尔登法环》玩家-Boss战斗片段,带有结构化的动作导向元数据。更大规模的《艾尔登法环》和KOF数据将在完整项目中发布。

英文摘要

Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.

2605.18599 2026-05-19 cs.CV 版本更新

Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

通过语义-空间解耦解决前馈新视角合成变换器中的表示歧义

Yihang Wu, Yihang Sun, Shaofeng Zhang, Zuxuan Wu, Junchi Yan, Xiaosong Jia, Yu-gang Jiang

发表机构 * Institute of Trustworthy Embodied Artificial Intelligence (TEAI)(可信具身人工智能研究所) Shanghai Key Laboratory of Multimodal Embodied AI(上海多模态具身人工智能重点实验室) Sch. of Artificial Intelligence & Sch. of Computer Science, Shanghai Jiao Tong University(上海交通大学人工智能学院与计算机科学学院) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出通过语义-空间解耦解决前馈新视角合成变换器中的表示歧义问题,通过分离语义和空间令牌,保持两者的显式表示并利用共享注意力路由保持跨分支交互,同时引入可选分类监督和双向调制以提高交互效果,从而在解码器-only和编码器-解码器前馈NVS模型中实现一致的改进。

Comments 24 pages, 11 figures, 4 tables. Project page: https://hangzay.github.io/ssd_lvsm/

详情
AI中文摘要

基于变换器的模型已推动前馈新视角合成(NVS)的发展。当前架构如GS-LRM和LVSM将语义信息(例如RGB)和空间信息(例如Plücker射线)混合到共享特征空间中。由于Plücker射线自然携带格状空间结构,这些设计会使空间偏差干扰外观表示并降低渲染保真度。为此,我们提出将前馈NVS变换器的表示解耦为单独的语义和空间令牌。解耦设计在各自的分支中保持语义和空间信息的显式性,同时通过共享注意力路由保持跨分支交互。基于此设计,我们引入可选分类监督和双向调制:前者提供分支特定的训练信号,后者提高两个分支之间的交互。值得注意的是,基础解耦设计由于其架构设计几乎不增加推理延迟。所提出的设计实现了持续的改进,证明了其在解码器-only和编码器-解码器前馈NVS模型中的有效性。

英文摘要

Transformer-based models have advanced feedforward novel view synthesis (NVS). Current architectures such as GS-LRM and LVSM mix semantic information (e.g., RGB) and spatial information (e.g., Plücker rays) into a shared feature space. Since Plücker rays naturally carry lattice-like spatial structure, these designs can make the spatial bias interfere with appearance representation and degrade rendering fidelity. To this end, we propose to decouple the representation of feedforward NVS transformers into separate semantic and spatial tokens. The decoupled design keeps semantic and spatial information explicit in their branches while preserving cross-branch interaction through shared attention routing. Built on this design, we introduce optional categorized supervision and bidirectional modulation: the former provides branch-specific training signals, while the latter improves interaction between the two branches. Notably, the base decoupled design introduces virtually zero additional inference latency due to its architectural design. The proposed designs achieve consistent improvements, demonstrating effectiveness across decoder-only and encoder-decoder feedforward NVS models.

2605.18577 2026-05-19 cs.CV 版本更新

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

OmniPro: 一个全面的多模态流视频理解基准测试

Ruixiang Zhao, Jie Yang, Zijie Xin, Tianyi Wang, Fengyun Rao, Jing LYU, Xirong Li

发表机构 * Renmin University of China(中国人民大学) WeChat Vision, Tencent Inc.(腾讯微信视觉实验室)

AI总结 本文提出OmniPro基准测试,旨在评估多模态感知、主动响应和多样视频理解任务,通过2700个经人类验证的样本覆盖9个子任务和3个认知层次,揭示音频在流视频理解中的关键作用及模型在长时间任务中的鲁棒性问题。

Comments Project page: https://ruixiangzhao.github.io/OmniPro

详情
AI中文摘要

多模态流视频理解,即从连续音频视频流中自主决定何时说话和说什么,是多模态大语言模型的一种新兴能力。现有基准测试在三个方面存在不足:它们主要依赖视觉信号,采用轮询或固定时间戳协议而不是真正的主动评估,并且只涵盖有限的任务范围,从而无法可靠地评估和区分多模态流模型。我们提出了OmniPro,这是第一个联合评估多模态感知、主动响应和多样化视频理解任务的基准测试。它包含2700个经人类验证的样本,涵盖9个子任务和3个认知层次,覆盖6种基本视频理解能力。值得注意的是,84%的样本需要音频信号(语音或非语音),并且每个样本都带有模态隔离标签,以实现细粒度的多模态分析。我们进一步引入了一种双模式评估协议:探测模式通过在每个真实触发前后查询模型来评估内容理解,而在线模式通过要求模型在流输入中自主决定何时响应来评估全面的主动能力。评估11个代表性模型后发现三个关键发现:(1)音频提供了一致的收益,但不同模型的利用情况差异很大;(2)性能随时间显著下降,表明长期鲁棒性有限;(3)非语音音频感知仍然是最弱的维度。

英文摘要

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.

2605.18553 2026-05-19 cs.CV cs.AI 版本更新

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

StableHand: 世界空间双臂运动估计中的质量感知流匹配

Huajian Zeng, Chaohua Yao, Yuantai Zhang, Jiaqi Yang, Rolandos Alexandros Potamias, Xingxing Zuo

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·泽德人工智能大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Imperial College London(伦敦帝国理工学院)

AI总结 本文提出StableHand,一种质量感知的流匹配框架,用于从第一人称视频中恢复世界空间双臂的4D运动,通过分解手部姿态估计器提取的观测质量为四个通道,并利用学习的质量网络预测质量信号,以提高运动估计的鲁棒性。

Comments Project Page: https://huajian-zeng.github.io/projects/stablehand/

详情
AI中文摘要

从第一人称视频中恢复世界空间中两个交互手的4D运动是监督机器人策略学习的基本能力,其中手腕轨迹跟踪末端执行器,手指运动规格化抓取姿态。在此设置中存在两个主要挑战:由于头部运动,手经常长时间离开摄像机视野,且持续的手-物体相互作用导致一个或两个手的严重遮挡。现有方法统一地基于噪声手运动观测,而不考虑其每帧的可靠性,导致性能显著下降。我们的关键见解是,准确的世界空间手运动估计与每帧手部观测的质量紧密相关。为此,我们将从现成的手部姿态估计器中提取的手部运动观测的质量分解为四个通道:双臂的手腕全局平移和手指运动。我们提出StableHand,一种质量感知的流匹配框架,其条件于这些四个通道的质量信号,这些信号由学习的质量网络预测。我们通过每通道的前向调度、质量调整的速度目标、AdaLN调制的DiT去噪器以及质量感知ODE初始化,自然地将质量信号整合到流匹配过程中。这种统一的生成过程在保持高质量观测的同时,利用学习的双臂运动先验重构不可靠的观测。在HOT3D和ARCTIC两个具有长缺失手跨度和持续手-物体遮挡的第一人称基准上,实验表明,StableHand在所有报告的指标上均达到最先进的性能,与最强基线相比,将W-MPJPE减少20-25%,在严重遮挡的ARCTIC序列上最大收益最明显。

英文摘要

Recovering world space 4D motion of two interacting hands from egocentric video is a fundamental capability for supervising robot policy learning, where wrist trajectories track the end-effector and finger articulations specify the grasp pose. Two major challenges arise in this setting: hands frequently leave the camera view for extended periods due to head motion, and persistent hand-object interactions cause severe occlusions of one or both hands. Existing methods uniformly condition on noisy hand motion observations without accounting for their per-frame reliability, leading to substantial performance degradation. Our key insight is that accurate world space hand motion estimation is tightly coupled with the quality of per-frame hand observations. To this end, we decompose the quality of hand motion observations extracted from an off-the-shelf hand pose estimator into four channels: wrist global translation and finger articulations for both hands. We propose StableHand, a quality-aware flow-matching framework conditioned on these four-channel quality signals, which are predicted by a learned quality network. We naturally incorporate the quality signals into the flow-matching process through a per-channel forward schedule, a quality-adjusted velocity target, AdaLN modulation of the DiT denoiser, and a quality-aware ODE initialization. This unified generative process preserves high-quality observations while reconstructing unreliable ones using a learned bimanual motion prior. Experiments on HOT3D and ARCTIC, two egocentric benchmarks featuring long missing-hand spans and persistent hand-object occlusions, show that StableHand achieves state-of-the-art performance across all reported metrics, reducing W-MPJPE by 20-25% compared to the strongest baseline, with the largest gains on heavily occluded ARCTIC sequences.

2605.18541 2026-05-19 cs.CV 版本更新

LESSViT: Robust Hyperspectral Representation Learning under Spectral Configuration Shift

LESSViT: 在光谱配置偏移下鲁棒的高光谱表示学习

Haozhe Si, Yuxuan Wan, Yuqing Wang, Minh Do, Han Zhao

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) Siebel School of Computing and Data Science(计算与数据科学学院)

AI总结 本文提出LESSViT,一种灵活的跨光谱泛化架构,通过低秩高效空间-光谱ViT,解决不同传感器下的高光谱图像建模问题,提升鲁棒性和效率。

详情
AI中文摘要

对不同传感器的高光谱图像(HSI)建模面临波长覆盖、波段采样和通道维度的变化带来的基本挑战。因此,基于固定光谱配置训练的模型往往无法泛化到其他传感器。现有的Vision Transformer(ViT)方法要么依赖于隐式光谱建模和固定通道假设,要么采用显式的空间-光谱注意力机制,但计算成本过高,导致效率与表达能力之间存在根本性的权衡。在本文中,我们引入了低秩高效空间-光谱ViT(LESSViT),一种用于跨光谱泛化的灵活架构。LESSViT基于LESS注意力,一种结构化的低秩因子分解,通过可分离的空间和光谱组件建模联合空间-光谱交互,将全空间-光谱注意力的复杂度从O(N²C²)降低到O(rNC),其中N是空间标记的数量,C是光谱通道的数量,r是低秩近似等级。我们进一步结合通道无关的补丁嵌入和波长感知的位置编码,以支持灵活的光谱输入。为了实现高效且稳健的预训练,我们引入了高光谱掩码自编码器(HyperMAE),具有解耦的空间-光谱掩码和分层通道采样。我们在跨光谱泛化设置下评估LESSViT,该设置模拟了跨传感器变化。在SpectralEarth基准测试中,实验表明LESSViT在光谱偏移下提高了鲁棒性,同时在分布内保持竞争力,显式且高效的空间-光谱建模对于可扩展和可泛化的高光谱表示学习至关重要。

英文摘要

Modeling hyperspectral imagery (HSI) across different sensors presents a fundamental challenge due to variations in wavelength coverage, band sampling, and channel dimensionality. As a result, models trained under a fixed spectral configuration often fail to generalize to other sensors. Existing Vision Transformer (ViT) approaches either rely on implicit spectral modeling with fixed channel assumptions or adopt explicit spatial-spectral attention with prohibitive computational cost, leading to a fundamental trade-off between efficiency and expressiveness. In this work, we introduce Low-rank Efficient Spatial-Spectral ViT (LESSViT), a sensor-flexible architecture for cross-spectral generalization. LESSViT is built on LESS Attention, a structured low-rank factorization that models joint spatial-spectral interactions through separable spatial and spectral components, reducing the complexity of full spatial-spectral attention from $O(N^2 C^2)$ to $O(rNC)$, where $N$ is the number of spatial tokens, $C$ is the number of spectral channels, and $r$ is the rank of the low-rank approximation. We further incorporate channel-agnostic patch embedding and wavelength-aware positional encoding to support flexible spectral inputs. To enable efficient and robust pretraining, we introduce a hyperspectral masked autoencoder (HyperMAE) with decoupled spatial-spectral masking and hierarchical channel sampling. We evaluate LESSViT under a cross-spectral generalization setting that simulates cross-sensor variability. Experiments on the SpectralEarth benchmark demonstrate that LESSViT improves robustness under spectral shifts while remaining competitive in-distribution, and explicit and efficient spatial-spectral modeling is essential for scalable and generalizable hyperspectral representation learning.

2605.18522 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Beyond Morphology: Quantifying the Diagnostic Power of Color Features in Cancer Classification

超越形态学:量化颜色特征在癌症分类中的诊断能力

Farnaz Kheiri, Shahryar Rahnamayan, Masoud Makrehchi

发表机构 * Dept. of Electrical, Computer and Software Engineering(电气、计算机与软件工程系) Ontario Tech University(安大略技术大学) Dept. of Engineering(工程系) Brock University(布鲁克大学)

AI总结 本文研究了颜色特征在癌症分类中的诊断能力,通过排除形态学信息,评估了全局颜色特征的判别力,发现颜色特征在二分类任务中可达到高达89%的准确率,表明颜色分布包含非随机的诊断信号。

详情
AI中文摘要

在组织病理学中,人类专家主要依靠颜色增强对比度来解读组织形态,而机器视觉模型则将颜色视为原始统计信息。这一区别提出了一个根本性问题:像素强度本身,独立于结构和形态学线索,能支持多少癌症分类?为了解决这个问题,我们系统评估了全局颜色特征的独立判别力,同时刻意排除所有形态学信息。具体而言,我们提取了统计颜色矩,并对RGB和HSV颜色直方图进行离散化处理,然后在十个不同的实验设置中使用经典机器学习分类器评估其性能。我们的结果表明,在二元诊断任务(例如良性与恶性)中,仅颜色特征即可实现强劲的性能,分类准确率可达到89%。这种性能很可能归因于与恶性相关的全局色度变化。重要的是,这些简单的颜色基表示在很大程度上优于随机基线,表明原始颜色分布编码了非随机且具有诊断意义的信号用于癌症检测。因此,本研究表明,简单的、计算高效的色彩特征可以作为一种有效的预筛选工具。通过识别具有强色度指示恶性特征的样本,这些轻量模型可以作为第一道筛选系统,减少对复杂深度学习架构的计算负担。

英文摘要

In histopathology, human experts primarily rely on color as a means of enhancing contrast to interpret tissue morphology, whereas machine vision models process color as raw statistical information. This distinction raises a fundamental question: to what extent can pixel intensity alone, independent of structural and morphological cues, support cancer classification? To address this question, we systematically evaluated the standalone discriminative power of global color features while deliberately excluding all morphological information. Specifically, we extracted statistical color moments and discretized RGB and HSV color histograms, and assessed their performance across ten diverse experimental settings using classical machine learning classifiers. Our results demonstrate that color features alone can achieve strong performance in binary diagnostic tasks (e.g., benign versus malignant), with classification accuracies reaching up to 89%. This performance is likely attributable to global chromatic shifts associated with malignancy. Importantly, these simple color-based representations consistently outperformed random baselines by a substantial margin, indicating that raw color distributions encode a non-random and diagnostically relevant signal for cancer detection. Consequently, this study suggests that simple, computationally efficient color features can serve as an effective pre-screening tool. By identifying samples with strong chromatic indicators of malignancy, these lightweight models could function as a first-pass triage system, reducing the computational burden on complex deep learning architectures.

2605.18491 2026-05-19 cs.CV 版本更新

Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks

对SSL预训练在相同和不同模态分割任务中转移性的基准测试

Jue Jiang, Harini Veeraraghavan

发表机构 * Department of Medical Physics, Memorial Sloan Kettering Cancer Center(医学物理系,纪念斯隆凯特勒癌症中心)

AI总结 本文通过九种SSL方法在相同和不同模态的分割任务中进行基准测试,评估了预训练模型的迁移能力和效率,发现自蒸馏masked image transformer在分割精度、收敛速度和少量样本到大量样本的性能差距方面表现最佳。

Comments Paper submitted to Medical Physics for review

详情
AI中文摘要

方法:九种覆盖四种预训练任务家族的SSL方法使用相同的10,412个3D CT扫描(1.89~M个2D轴向切片)从头开始预训练,这些扫描涵盖不同的疾病部位。每个方法的预训练Swin Transformer编码器被整合到SwinUNETR风格的分割网络中(Swin编码器与3D CNN解码器和跳跃连接),并在九个公开的分割任务上进行微调,包括大腹腔器官、头颈结构和CT和MRI中的肿瘤。性能通过Dice相似系数(DSC)评估。微调收敛速度、跨模态(CT到MRI)的迁移性以及少量样本和大量样本微调之间的特征重用模式进一步通过中心化核对齐分析。结果:自蒸馏masked image transformer(SMIT),结合masked image modeling(MIM)和局部和全局自蒸馏,在九个任务中实现了最高的分割精度、最快的微调收敛速度和最小的少量样本到大量样本性能差距,表明最强的数据效率。SMIT还显示了在少量样本和大量样本微调之间最一致的特征重用模式。基于MIM的SimMIM和自蒸馏方法(DINO、iBOT)优于依赖图像级全局表示的对比学习和旋转预测。SSL方法之间的差异在少量样本设置中最大,随着标记微调数据集大小的增加而缩小,表明在有限标注预算下SSL预训练的选择最为关键。

英文摘要

Methods: Nine SSL methods spanning four pretext-task families were pretrained from scratch using the same 10{,}412 3D CT scans (1.89~M 2D axial slices) covering varied disease sites. The pretrained Swin Transformer encoder from each method was integrated into a SwinUNETR-style segmentation network (Swin encoder with a 3D CNN decoder and skip connections) and fine-tuned on nine public segmentation tasks of varying complexity, including large abdominal organs, head-and-neck structures, and tumors from CT and MRI. Performance was assessed using Dice similarity coefficient (DSC). Fine-tuning convergence speed, transferability across modalities (CT-to-MRI), and feature-reuse patterns between few- and many-shot fine tuning were further analyzed using centered kernel alignment. Results: Self-distilled masked image transformer (SMIT), which combines masked image modeling (MIM) with local and global self-distillation, achieved the highest overall segmentation accuracy across the nine tasks, the fastest fine-tuning convergence, and the smallest few-shot-to-many-shot performance gap, indicating the strongest data efficiency. SMIT also showed the most consistent feature-reuse patterns between few- and many-shot fine tuning. MIM-based SimMIM and self-distillation methods (DINO, iBOT) outperformed contrastive learning and rotation prediction, which rely on image-level global representations. Differences between SSL methods were largest in the few-shot setting and narrowed as the size of the labeled fine-tuning dataset increased, indicating that the choice of SSL pretraining matters most under limited annotation budgets.

2605.18467 2026-05-19 cs.CV 版本更新

InstructAV2AV: Instruction-Guided Audio-Video Joint Editing

InstructAV2AV:基于指令的音频视频联合编辑

Haojie Zheng, Yixin Yang, Siqi Yang, Shuchen Weng, Boxin Shi

发表机构 * Beijing Academy of Artificial Intelligence(北京人工智能研究院) Peking University(北京大学)

AI总结 本文提出InstructAV2AV,首个端到端的指令引导音频视频联合编辑框架,通过构建大规模音频视频编辑数据集InsAVE-80K和改进的生成模型,实现了更高质量的音频视频联合编辑。

详情
AI中文摘要

最近的扩散基方法在视频内容操控方面取得了显著进展。然而,它们通常忽视伴随的音频,导致音频与编辑结果脱节。在本文中,我们提出了InstructAV2AV,首个端到端的指令引导音频视频联合编辑框架。我们首先开发了一个可扩展的数据合成管道,并构建了InsAVE-80K,首个大规模音频视频编辑数据集,包含高质量的源到目标配对。借助这一数据基础,我们适配了一个音频视频生成骨干网络,以利用其强大的先验知识。我们将音频视频输入与噪声潜在代码结合,以锚定源上下文,提出源指令门控注意力以提高指令遵循和内容保持,并引入两阶段训练策略以有效转移这些预训练的先验知识。广泛的实验表明,InstructAV2AV在两个评估集上,跨11个指标覆盖三个方面,均优于现有最先进方法,凸显了其在可控内容创作中的潜力。项目页面:https://hjzheng.net/projects/InstructAV2AV/.

英文摘要

Recent diffusion-based methods have achieved impressive progress in video content manipulation. However, they typically ignore the accompanying audio, leaving the audio disjointed from the edited results. In this paper, we propose InstructAV2AV, the first end-to-end framework for instruction-guided audio-video joint editing. We first develop a scalable data synthesis pipeline and construct InsAVE-80K, the first large-scale audio-video editing dataset with high-quality source-to-target pairs. With this data foundation, we adapt an audio-video generation backbone to leverage its robust priors. We concatenate the audio-video input with noisy latent codes to anchor the source context, propose the source-instruction gated attention to improve instruction following and content preservation, and introduce a two-stage training strategy to effectively transfer these pre-trained priors. Extensive experiments demonstrate that InstructAV2AV outperforms state-of-the-art methods across 11 metrics spanning three aspects on two evaluation sets, highlighting its potential for controllable content creation. Project page: https://hjzheng.net/projects/InstructAV2AV/.

2605.18466 2026-05-19 cs.CV 版本更新

Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI

基于语音引导的多模态学习用于实时MRI中的声道分割

Daiqi Liu, Lukas Mulzer, Md Hasan, Nyvenn de Castro, Fangxu Xing, Xingjian Kang, Chengze Ye, Siyuan Mei, Yipeng Sun, Tomás Arias-Vergara, Jana Hutter, Jonghye Woo, Andreas Maier, Paula Andrea Pérez-Toro

发表机构 * Harvard Medical School / Massachusetts General Hospital(哈佛医学院/麻省总医院) Institute for Information Processing, Leibniz University Hannover(汉诺威莱布尼茨信息处理研究所) GITA Lab, Facultad de Ingeniería. Universidad de Antioquia UdeA(安提奥基亚大学工程学院GITA实验室)

AI总结 本文提出了一种三阶段框架,利用语音和语音学监督进行训练,仅需实时MRI图像进行推理,通过将语音学表示转换为空间边界框先验进行发音器官定位,通过双级跨模态对比预训练对视觉和音频编码器对齐,并通过跨注意力解码器融合学习的表示,有效将多模态知识转移到单模态推理管道中,实验表明该方法在75-Speaker~Annot-16和USC-TIMIT数据集上优于现有单模态和多模态方法。

Comments under review

详情
AI中文摘要

在实时MRI(rtMRI)中对发音器官进行分割是一个具有低对比度、快速运动和有限空间分辨率的动态图像分割难题。然而,尽管rtMRI采集可能提供同步的声学信号,现有方法却丢弃了这一信息,而能结合音频的少数多模态方法在音频不可用时无法部署。我们提出了一种三阶段框架,在训练过程中利用音频和语音学监督,而在推理时仅需rtMRI图像:语音学表示被转换为空间边界框先验以用于发音器官定位,视觉和音频编码器通过双级跨模态对比预训练对齐,学习的表示通过跨注意力解码器融合,有效将多模态知识转移到单模态推理管道中。在75-Speaker~Annot-16和USC-TIMIT数据集上的评估表明,我们的方法优于现有单模态和多模态方法,证明了多模态监督对精确且可临床部署的声道分割提供了可转移的益处。

英文摘要

Segmenting vocal tract articulators in real-time MRI (rtMRI) is a challenging dynamic image segmentation problem characterized by low contrast, rapid motion, and limited spatial resolution. However, while rtMRI acquisitions may provide synchronized acoustic signals, existing methods discard this information, and the few multimodal approaches that incorporate audio cannot be deployed when audio is unavailable. We propose a three-stage framework that leverages acoustic and phonological supervision during training while requiring only the rtMRI image at inference: phonological representations are converted into spatial bounding-box priors for articulator localization, visual and acoustic encoders are aligned via dual-level cross-modal contrastive pretraining, and the learned representations are fused through a cross-attention decoder, effectively transferring multimodal knowledge into a single-modality inference pipeline. Evaluated on 75-Speaker~Annot-16 and USC-TIMIT datasets, our method outperforms existing unimodal and multimodal methods, demonstrating that multimodal supervision provides transferable benefits for precise and clinically deployable vocal tract segmentation.

2605.18451 2026-05-19 cs.CV cs.GR 版本更新

Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

Code-as-Room: 通过代理代码合成从俯视图图像生成3D房间

Yixuan Yang, Zhen Luo, Wanshui Gan, Jinkun Hao, Junru Lu, Jinghao Yan, Zhaoyang Lyu, Xudong Xu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) Southern University of Science and Technology(南方科技大学) University of Warwick(沃里克大学)

AI总结 本文提出Code-as-Room框架,通过结构化执行 harness 生成3D房间,利用Blender代码表示房间,并引入专门的代码基3D房间合成基准进行评估。

详情
AI中文摘要

设计逼真且功能性的3D室内房间对于广泛的应用至关重要,包括室内设计、虚拟现实、游戏和具身AI。尽管最近基于大语言模型(MLLM)的方法在从文本描述或参考图像生成3D房间方面展现出巨大潜力,但基于文本的方法难以精确捕捉空间信息,而现有的图像条件化代理在从俯视图生成整体房间时面临不稳定性和无限循环的问题。为了解决这些限制,我们提出了Code-as-Room,一种基于MLLM的代理框架,配备了结构化执行harness,用Blender代码表示3D房间。给定一个俯视房间图像,该框架解析参考图像以提取场景元素及其空间关系,并在有原则的多阶段管道中合成用于几何、材料和照明的可执行Blender代码。在整个过程中维护一个跨阶段的记忆模块,以缓解现有基于代理框架固有的上下文遗忘问题。我们进一步引入了一个专门的代码基3D房间合成基准,涵盖各种评估协议。基于我们的基准,对现有基于代理的方法进行了全面比较,以验证我们提出的执行harness的有效性。

英文摘要

Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms with Blender codes. Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline. A cross-stage memory module is maintained throughout to mitigate context forgetting inherent to existing agent-based frameworks. We further introduce a dedicated benchmark for code-based 3D room synthesis, encompassing various evaluation protocols. Based on our benchmark, comprehensive comparisons against existing agent-based methods are conducted to validate the effectiveness of our proposed execution harness.

2605.18436 2026-05-19 cs.CV 版本更新

A Dataset for the Recognition of Historical and Handwritten Music Scores in Western Notation

一个用于识别历史和手写乐谱的数据库

Pau Torras, Jiří Mayer, Carles Badal, Martina Dvořáková, Markéta Herzanová Vlková, Gerard Asbert, Vojtěch Dvořák, Samuel Šomorjai, Jan Hajič, Alicia Fornés

发表机构 * Computer Vision Center(计算机视觉中心) Department of Computer Science(计算机科学系) Department of Arts and Musicology(艺术与音乐学系) Institute of Formal and Applied Linguistics(形式与应用语言学研究所) Moravian Library(摩拉维亚图书馆)

AI总结 本文提出了一个包含1309页历史乐谱的数据库,用于训练光学音乐识别系统,该数据库提供了音乐XML转录和符号注释,是目前最大的手写音乐数据集,适用于训练和评估端到端和基于目标检测的OMR系统。

Comments Under review at Scientific Data

详情
AI中文摘要

大量的音乐遗产已由记忆机构(图书馆、博物馆和档案馆)数字化。然而,尽管深度学习的进步,光学音乐识别(OMR)领域在使音乐可机读方面仍然面临困难,主要是因为缺乏可用于真实条件训练的数据库。MusiCorpus数据集旨在通过提供1,309页的历史乐谱(主要是手写乐谱)以及音乐XML转录和符号注释来解决这一问题。它是目前最大的手写音乐数据集,也是首个包含来自记忆机构的真实且具有代表性的音乐文档集合的数据集,适用于训练和评估端到端和基于目标检测的OMR系统,并比较其性能。

英文摘要

A large amount of musical heritage has been digitised by memory institutions: libraries, museums, and archives. Nevertheless, the field of Optical Music Recognition (OMR) has struggled with making this music machine-readable, despite advances in deep learning, mostly because no datasets for training systems in realistic conditions were available. The MusiCorpus dataset aims to remedy this situation by providing 1,309 pages of historical sheet music, primarily handwritten, with MusicXML transcriptions and symbol annotations. It is the largest dataset of handwritten music to date and the first dataset containing a realistic and representative sample of musical document collections from memory institutions, suitable for training and evaluating both end-to-end and object detection-based OMR systems and comparing their performance.

2605.18434 2026-05-19 cs.IR cs.CV 版本更新

TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval

TIGER-FG: 基于文本引导的隐式细粒度 grounding 用于电商检索

Xinyu Sun, Huangyu Dai, Lingtao Mao, Zexin Zheng, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou

发表机构 * Kuaishou Technology(快手科技)

AI总结 本文提出TIGER-FG框架,通过文本引导生成目标聚焦的物品表示,解决电商检索中模态和粒度不匹配的问题,并在两个基准测试中提升了检索精度。

详情
AI中文摘要

电商图像搜索通常以裁剪图像作为查询,而每个候选项由完整物品图像和结构化文本表示。这种图像到多模态检索的设置存在两个不对称性:模态不匹配——视觉查询必须匹配图像-文本项,以及粒度不匹配——裁剪查询必须与包含背景上下文和可能干扰项的完整图像进行比较。基于检测的流水线通过显式定位来处理粒度不匹配,但会带来额外成本和误差传播,而CLIP风格的编码器避免检测,但容易受到背景或无关项的影响。为了解决这些限制,我们提出了TIGER-FG,一种用于图像到多模态电商检索的文本引导隐式细粒度grounding框架。TIGER-FG利用物品文本作为语义引导,生成无目标检测的物品表示以进行检索。我们进一步引入了双蒸馏目标,以保持目标区域的空间一致性和查询-项相似性结构,从而产生更稳定和判别性的多模态表示。此外,我们构建了ECom-RF-IMMR,一个包含1000万对训练集和两个评估基准的现实基准套件,涵盖标准和杂乱的物品布局。TIGER-FG在两个评估基准上分别将Recall@1比最强基线提高了6.1和34.4个百分点,仅使用85.7M查询侧参数和256维嵌入。在公共电商基准上的结果进一步证明了其在嘈杂和一对一检索场景中的泛化能力。代码和数据将被发布。

英文摘要

E-commerce image search often takes a cropped image as the query, while each candidate is represented by full item images and structured text. This image-to-multimodal retrieval setting presents two asymmetries: a modality disparity -- a visual query must match image--text items, and a granularity disparity -- a cropped query must be compared with full images containing background context and possible distractors. Detection-based pipelines handle the granularity disparity through explicit localization but incur extra cost and error propagation, whereas CLIP-style encoders avoid detection, but are vulnerable to backgrounds or irrelevant items. To address these limitations, we propose TIGER-FG, a text-guided implicit fine-grained grounding framework for image-to-multimodal e-commerce retrieval. TIGER-FG uses item text as semantic guidance to produce target-focused item representations without object detection for retrieval. We further introduce dual distillation objectives that preserve target-region spatial consistency and query--item similarity structure, yielding more stable and discriminative multimodal representations. In addition, we construct ECom-RF-IMMR, a realistic benchmark suite with a 10M-pair training set and two evaluation benchmarks covering standard and cluttered item layouts. TIGER-FG improves Recall@1 over the strongest baseline by 6.1 and 34.4 percentage points on the two evaluation benchmarks, respectively, with only 85.7M query-side parameters and 256-dim embeddings. Results on public e-commerce benchmarks further demonstrate its generalization to noisy and one-to-many retrieval scenarios. Code and data will be released.

2605.18419 2026-05-19 cs.CV cs.AI 版本更新

Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology

面向几何的不确定性聚类用于病理学中鲁棒的视觉上下文学习

Franciskus Xaverius Erick, Johanna Paula Müller, Bernhard Kainz

发表机构 * FAU Erlangen-Nürnberg, Erlangen, DE(埃尔兰根-纽伦堡大学) Department of Computing, Imperial College London, London, UK(伦敦帝国理工学院计算机系)

AI总结 本文提出GAUC,一种无需训练的聚类选择方法,直接在预训练的多模态嵌入空间中操作,通过优化三个目标提升视觉上下文学习的鲁棒性、准确性和校准性。

详情
AI中文摘要

视觉-语言模型(VLMs)能够将视觉感知与开放性临床推理结合,使其在计算病理学中具有吸引力。然而,对稀缺的专家标注病理数据进行数十亿参数的微调是不可行的,而上下文学习(ICL)在没有参数更新的情况下将VLM条件于演示图像-文本对,但容易受到所选示例和查询措辞的影响,导致诊断不可靠。现有选择策略依赖于查询依赖的最近邻检索,忽略了全局数据结构,需要昂贵的参数更新,或忽视了VLMs的联合视觉-文本嵌入几何。我们提出GAUC,一种无需训练的聚类选择方法,直接在预训练的多模态嵌入空间中操作。GAUC联合优化三个目标:(1)最大均值差异项,强制聚类与完整数据集之间的分布一致性;(2)有效互信息差异正则化器,通过利用VLMs的联合视觉-文本对齐来限制在提示改写下的性能下降;(3)预测方差惩罚,抑制过于自信且不稳定的输出。在CRC-100K和MHIST多个开源VLM架构上,GAUC在准确率、校准性和提示鲁棒性上均优于最近的ICL选择方法和数据集蒸馏基线,且无需单次梯度更新。

英文摘要

Vision-language models (VLMs) can couple visual perception with open-ended clinical reasoning, making them attractive for computational histopathology. However, fine-tuning billions of parameters on scarce, expert-annotated pathology data is prohibitive, while in-context learning (ICL), which conditions the VLM on demonstrative image-text pairs without parameter updates, suffers from high sensitivity to which examples are selected and how the query is phrased, producing unreliable diagnostics. Existing selection strategies rely on query-dependent nearest-neighbour retrieval that ignores global data structure, require costly parameter updates, or disregard the joint vision-text embedding geometry of VLMs. We propose GAUC, a training-free coreset selection method operating directly in the pre-trained multimodal embedding space. GAUC jointly optimises three objectives: (1) a Maximum Mean Discrepancy term enforcing distributional fidelity between coreset and full dataset, (2) an Effective Mutual Information Difference regulariser bounding performance degradation under prompt paraphrases by exploiting the VLM's joint vision-text alignment, and (3) a predictive-variance penalty suppressing overconfident, unstable outputs. On CRC-100K and MHIST across multiple open-source VLM architectures, GAUC consistently improves accuracy, calibration, and prompt robustness over recent ICL selection methods and dataset-distillation baselines, all without a single gradient update.

2605.18408 2026-05-19 cs.CV 版本更新

Historical Knowledge Graphs for Global Maritime Estimated Time of Arrival

全球海运估计到达时间的历史知识图谱

Neofytos Dimitriou

发表机构 * Maritime Digitalization Centre(海运数字化中心) Cyprus Marine and Maritime Institute(塞浦路斯海洋与海运研究院)

AI总结 本文提出利用AIS数据构建全球海运历史知识图谱的方法,通过高斯混合模型预处理提取轨迹,利用速度分布构建图谱,实现高效的航行时间预测,为港口运营和减排提供支持。

详情
AI中文摘要

准确的船舶预计到达时间预报对港口运营和脱碳至关重要,但缺乏成本高昂的上下文数据,全球范围内的航行时间预测仍极具挑战性。本文提出一种方法,仅使用自动识别系统(AIS)数据构建历史海运知识图谱。首先,通过基于高斯混合模型的预处理流程从噪声AIS数据中提取分段轨迹。然后通过迭代处理轨迹,按船舶类型、航行时间和方向存储速度分布,生成包含5,433个geohash-3节点和12,334条边的全球图谱。该图谱可通过分层、优先级系统查询任意两个位置之间的航行时间预测,该系统利用历史统计数据并有原则的回退机制。在时间上保留的测试集上,中位RMSE为22.75分钟(分段级)和30.90分钟(轨迹级),其中69.1%的轨迹在实际到达时间的20%以内。在第二个外部测试集上,中位RMSE为27.36分钟(分段级)和37.46分钟(轨迹级),其中62.1%的轨迹在20%以内。这些结果证实了我们方法的潜力,能够实现全球航行时间预测,并为及时到达规划和减排提供坚实基础。

英文摘要

Accurate vessel estimated-time-of-arrival forecasts are critical for port operations and decarbonization, yet global-scale travel-time prediction remains difficult without costly contextual data. Herein, I present a methodology for constructing a historical maritime knowledge graph using only Automatic Identification System (AIS) data. First, segmented trajectories are extracted from noisy AIS data using a Gaussian-mixture-model-based preprocessing pipeline. The graph is then constructed by iteratively processing the trajectories and storing speed distributions stratified by vessel type, time of travel, and direction of travel; the resulting global graph comprises 5,433 geohash-3 nodes and 12,334 edges. The graph can be queried to retrieve travel-time predictions between any two location via a hierarchical, priority-based system that uses historical statistics with principled fallback. On a temporally held-out test set, median RMSE is 22.75 min (segment-level) and 30.90 min (trajectory-level), with 69.1% of trajectories within 20% of actual arrival time. On a second external test set, median RMSE is 27.36 min (segment-level) and 37.46 min (trajectory-level), with 62.1% of trajectories within 20%. These results corroborate the promise of our method, enabling global travel-time prediction and providing a strong foundation for just-in-time arrival planning and emissions reduction.

2605.18398 2026-05-19 cs.CG cs.CV 版本更新

Generalize cross-ratios in n-dimensional Plane-Based Geometric Algebra

在n维平面基础几何代数中推广交叉比

Enzo Harquin, Stephane Breuils, Pascal Monasse, Venceslas Biri, Vincent Nozick

AI总结 本文研究了n维平面基础几何代数中投影交叉比的完整理论,建立了各类几何对象的显式交叉比公式,并证明其恢复了相应的经典不变量,同时识别了标准的成对测量算子。

详情
AI中文摘要

我们发展了n维平面基础几何代数(PGA,R(n,0,1))中投影交叉比的完整理论,涵盖了所有等级的几何对象:有限点和理想点、超平面以及中间扁面。对于每种对象类型和配置,我们建立了显式的交叉比公式,证明其恢复了适当的经典不变量,并确定了标准的成对测量算子。系统性的对偶分析进一步揭示了所有八种配置在Hodge对偶下组织成四对对偶配置,并且所有测量算子根据几何配置而非对象等级,要么是交换子要么是交换子对偶。在每种情况下,公式恢复了适当的经典不变量:平行配置的有符号距离比和割线配置的正弦交叉比。这些结果确立了交叉比作为PGA中的无等级项目不变量,并为从给定不变量直接定义n维同调映射提供了构造性基础。

英文摘要

We develop a complete theory of projective cross-ratios in n-dimensional Plane-Based Geometric Algebra (PGA), R(n,0,1), covering geometric objects of every grade: finite and ideal points, hyperplanes, and intermediate flats. For each object type and configuration, we establish an explicit cross-ratio formula, prove that it recovers the appropriate classical invariant, and identify the canonical pairwise measurement operator. A systematic duality analysis further revealed that all eight configurations organize into four dual pairs under the Hodge dual, and that all measurement operators reduce to either the commutator or the commutator dual, depending solely on the geometric configuration rather than on object grade. In each case the formula recovers the appropriate classical invariant: signed distance ratios for parallel configurations and sine cross-ratios for secant ones. These results establish the cross-ratio as a grade-agnostic projective invariant within PGA, and provide a constructive foundation for defining n-dimensional homographies directly from prescribed invariants.

2605.18390 2026-05-19 cs.CV 版本更新

Vision Foundation Models as Generalist Tokenizers for Image Generation

视见过滤模型作为图像生成的通用标记器

Anlin Zheng, Qi Han, Xin Wen, Chuofan Ma, Lanxi Gong, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

发表机构 * University of Hong Kong(香港大学) StepFun

AI总结 本文提出了一种基于冻结视见过滤模型(VFM)的通用图像标记器VFMTok,通过区域自适应量化框架和语义重建目标,提升了图像生成的质量和效率,同时在离散和连续潜在空间中实现了高保真度的类别条件合成。

Comments 4 figures and 14 tables

详情
AI中文摘要

在本文中,我们探索了构建一个通用图像标记器的全新方向,该标记器直接建立在冻结的视见过滤模型(VFM)之上。为了构建此标记器,我们利用冻结的VFM作为编码器,并引入两个关键创新:(1)区域自适应量化框架,用于消除标准2D网格特征中的空间冗余;(2)语义重建目标,使解码输出与VFM的表示对齐,以保持语义保真度。基于这些设计,我们提出了VFMTok,一种能够无缝在离散和连续潜在空间中运行的通用视觉标记器。VFMTok在合成质量上取得了显著提升,同时大幅提高了标记效率。对于离散自回归(AR)生成,它通过3倍加速模型收敛,并在ImageNet条件合成上实现了最先进的gFID值1.36。同样,对于连续空间生成,将VFMTok与去噪模型结合,可获得极佳的gFID值1.25。此外,由于潜在空间本身捕捉了丰富的空间语义,VFMTok能够在两种生成范式中无需分类器自由指导(w/o CFG)下实现高保真度的类别条件合成,显著加快了推理速度。除了这些显著的实证结果外,我们还系统地研究了我们方法的底层机制。我们发现,在VFM预训练过程中使用的特定自监督学习目标决定了其作为标记器的有效性。具体来说,一个联合优化全局对比学习和潜在掩码图像建模的VFM提供了最佳的图像标记表示。这些见解为未来图像标记器的设计奠定了坚实的基础,并提供了有价值的指导。

英文摘要

In this work, we explore the largely unexplored direction of building a generalist image tokenizer directly on top of a frozen vision foundation model (VFM). To build this tokenizer, we utilize a frozen VFM as the encoder and introduce two key innovations: (1) a region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features, and (2) a semantic reconstruction objective that aligns the decoded outputs with the VFM's representations to preserve semantic fidelity. Grounded in these designs, we propose VFMTok, a generalist visual tokenizer capable of operating seamlessly in both discrete and continuous latent spaces. VFMTok achieves substantial improvements in synthesis quality while drastically enhancing token efficiency. For discrete autoregressive (AR) generation, it accelerates model convergence by \textbf{3 times} and achieves a state-of-the-art gFID of \textbf{1.36} on ImageNet class-conditional synthesis. Similarly, for continuous-space generation, integrating VFMTok with a denoising model yields an exceptional gFID of \textbf{1.25}. Furthermore, because the latent space inherently captures rich spatial semantics, VFMTok enables high-fidelity class-conditional synthesis without classifier-free guidance (\textbf{w/o CFG}) across both generative paradigms, significantly accelerating inference speed. Beyond these remarkable empirical results, we systematically investigate the underlying mechanisms of our approach. We discover that the specific self-supervised learning objectives utilized during VFM pre-training dictate its effectiveness as a tokenizer. Specifically, a VFM jointly optimized with global contrastive learning and latent masked image modeling provides the optimal representations for image tokenization. These insights establish a strong foundation and offer valuable guidance for the design of future image tokenizers.

2605.18365 2026-05-19 cs.CV 版本更新

GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

GeoFlow: 在视频生成中强制隐式几何一致性

Jan Ackermann, Shengqu Cai, Boyang Deng, Zhengfei Kuang, Songyou Peng, Gordon Wetzstein

发表机构 * Stanford University(斯坦福大学) Google DeepMind(谷歌深Mind)

AI总结 本文提出GeoFlow,一种通过强化学习微调来增强视频生成中几何一致性的方法,通过引入几何一致性奖励,有效减少时间上的几何伪影,同时保持感知质量。

Comments Project Page: https://geometryflow.github.io/

详情
AI中文摘要

生成几何上一致的视频仍然是一个开放性挑战:基于网络级数据训练的文本到视频扩散模型仅隐式处理几何,导致在相机运动下出现物体形变、纹理漂移和非刚性背景。现有解决方案要么作为副产品改进一致性,要么仅适用于静态场景或完全重新对齐模型的潜在空间。我们引入了一个几何一致性奖励,直接衡量生成视频中的运动是否与一致的场景兼容。我们的关键见解是,在物理一致的视频中,背景运动应能由刚性相机诱导的流解释,而独立移动的物体应沿运动轨迹保持外观身份。我们使用光流、深度-姿态预测和基于特征的对应关系来分离刚性和动态区域并评估它们各自的一致性。将此奖励与强化学习微调结合,将几何一致性从一种涌现属性转化为视频生成器的显式优化目标。该方法对模型不敏感,适用于包含相机和物体运动的多样化动态场景。实验显示,在强基线模型上显著减少了时间上的几何伪影,同时保持感知质量。代码和模型权重已发布。

英文摘要

Generating geometrically consistent videos remains an open challenge: text-to-video diffusion models trained on web-scale data treat geometry only implicitly, leading to object deformation, texture drift, and non-rigid backgrounds under camera motion. Existing solutions either improve consistency as a byproduct, apply only to static scenes or realign the latent space of the model completely. We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth--pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective for video generators. The approach is model agnostic and applies to diverse dynamic scenes containing both camera and object motion. Experiments show substantial reductions in temporal geometric artifacts over strong baselines while preserving perceptual quality. Code and model weights are published.

2605.18349 2026-05-19 cs.CV cs.AI 版本更新

Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport

通过参数自由注意力机制优化CSRNet以实现公共交通中的人群计数

Aida Rostamza, Enrico Del Re, Joshua Cherian Varughese, Cristina Olaverri-Monreal

发表机构 * Johannes Kepler University Linz(约翰· Kepler 大学林茨) Department Intelligent Transport Systems(智能交通系统部门)

AI总结 本文研究了参数自由注意力机制在密集场景中的人群计数和密度图估计中的有效性,提出了一种结合PFCA和SA的新型注意力机制PFCASA,并在ShanghaiTech数据集上验证了其在公共交通视频流中的性能。

详情
AI中文摘要

占用估计和人群计数是设计智能高效公共交通车辆的关键任务。鉴于公共交通载客量可能从稀疏到拥挤变化,传统的占用估计模型必须适应这一目的。注意力机制在增强深度神经网络在拥挤场景中的人群计数能力方面表现出显著优势,尤其是在存在遮挡、复杂背景和透视畸变的情况下。然而,传统方法通常作为卷积层中的参数化子网络实现,不可避免地增加了模型大小和计算成本,限制了在资源受限的边缘设备上的部署。本文研究了最先进的参数自由注意力机制在高度拥挤场景中的人群计数和密度图估计中的有效性。我们评估了通道级(PFCA)、空间级(SA)和三维级(SimAM)模块,并将其性能与参数化注意力模块进行比较,后者限制引入不超过1%的额外参数。此外,我们提出了一种新的注意力机制组合,结合PFCA和SA(PFCASA)以分析公共交通系统内的视频流。使用CSRNet作为骨干网络,在ShanghaiTech数据集上的实验表明,参数自由注意力机制在不引入额外模型参数的情况下实现了可比或更优的准确性。详细的性能分析进一步揭示,PFCASA在少于40人的场景中优于其他注意力模块,而PFCA在人群密度增加时表现出更大的有效性,凸显了其在智能公共交通模式中的应用潜力。

英文摘要

Occupancy estimation and crowd counting are critical tasks in designing smart and efficient public transport vehicles. Given that public transport loading can vary from sparse to crowded, classical models for occupancy estimation must be adapted to suit this purpose. Attention mechanisms have shown remarkable capability in enhancing the representational power of deep neural networks for crowd counting in congested scenes with occlusion, complex backgrounds, and perspective distortion. However, conventional approaches, often implemented as parameterized sub-networks within convolutional layers, inevitably increase model size and computational cost, limiting deployment on resource-constrained edge devices. This paper investigates the effectiveness of state-of-the-art parameter-free attention mechanisms for crowd counting and density map estimation in highly congested scenes. We evaluate channel-wise (PFCA), spatial-wise (SA), and 3-D (SimAM) modules and compare their performance with parameterized attention modules constrained to introduce no more than 1% additional parameters. Furthermore, we present a novel combination of attention mechanisms that combines the strengths of PFCA and SA (PFCASA) customized for analyzing video streams onboard public transport systems. Using CSRNet as the backbone, experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy without introducing additional model parameters. A detailed performance analysis further reveals that PFCASA outperforms other attention modules in scenes with fewer than 40 individuals, while PFCA shows greater effectiveness as crowd density increases, underscoring their potential applicability for integration into smart public transport modalities.

2605.18346 2026-05-19 cs.CV cs.AI 版本更新

Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

聚焦强制:面向内容的每帧KV选择用于高效的自回归视频扩散

Peiliang Cai, Evelyn Zhang, Jiacheng Liu, Hao Lin, Ruiqi Zhang, Weile Mo, Yue Ma, Shikang Zheng, Jiehang Huang, Dongrui Liu, Linfeng Zhang

发表机构 * SJTU(上海交通大学) SDU(山东大学) HUST(华中科技大学) UTokyo(东京大学) HKUST(香港科技大学) SCUT(上海大学) Shanghai AI Lab(上海人工智能实验室)

AI总结 本文提出了一种无需训练的KV选择方法,通过结合注意力分数和历史帧的多样性分数,保留最相关和有区别的历史帧,从而在不牺牲质量的情况下提高自回归视频扩散的效率。

详情
AI中文摘要

近期在自回归视频扩散领域的进展使得序列和流式视频生成成为可能。然而,长视界生成需要越来越大的KV缓存,这使得在不牺牲质量的情况下实现高效的压缩具有挑战性。现有方法大多基于注意力分数选择历史帧,但它们的上下文决策仍然粗略。当同一块中生成多个帧时,这些方法通常对整个块应用共享的历史选择,仅通过注意力对历史帧评分,并将头预算均匀或通过注意力模式启发式分配,而不是显式估计头重要性。我们发现同一生成块中的帧可能依赖于不同的历史帧,同一历史帧在与当前帧的相对时间距离变化时可能获得不同的注意力分数,且屏蔽不同头会引发不均等的生成退化。受这些发现的启发,我们提出了Focused Forcing,一种无需训练的KV选择方法,该方法在生成帧和头维度上聚焦缓存历史。对于每个生成帧,Focused Forcing通过结合注意力分数和历史帧的多样性分数保留最相关和有区别的历史帧,同时将较大的预算分配给估计重要性更高的头。在多个自回归生成范式中,Focused Forcing在不训练的情况下实现了高达1.48倍的端到端加速,同时提高了视觉质量和文本对齐。

英文摘要

Recent advances in autoregressive video diffusion have enabled sequential and streaming video generation. However, long-horizon generation requires increasingly large KV caches, making efficient compression without sacrificing quality challenging. Existing methods mostly select historical frames based on attention scores, but their context decisions remain coarse. When multiple frames are generated in the same chunk, these methods often apply a shared history selection to the whole chunk, score historical frames solely by attention, and assign head-wise budgets either uniformly or by attention-pattern heuristics rather than explicit head-importance estimation. We show that frames within the same generated chunk can depend on distinct historical frames, that the same historical frame can receive different attention scores as its relative temporal distance to the current frames changes, and that masking different heads induces unequal generation degradation. Motivated by these findings, we propose \textbf{Focused Forcing}, a training-free KV selection method that focuses cached history along both generated-frame and head dimensions. For each generated frame, Focused Forcing preserves the most relevant and distinctive historical frames by combining attention scores with diversity scores of historical frames, while assigning larger budgets to heads with higher estimated importance. Across multiple autoregressive generation paradigms, Focused Forcing achieves up to $\textbf{1.48}\times$ end-to-end acceleration without training, while \textbf{improving visual quality and text alignment}. \textit{Our code will be released on GitHub.}

2605.18334 2026-05-19 cs.CV cs.GR 版本更新

3D Skew Gaussian Splatting with Any Camera Trajectory Visualization Engine

具有任意相机轨迹可视化引擎的3D斜高斯散射

Beizhen Zhao, Yifan Zhou, Gaochao Song, Ziran Yin, Hao Wang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Zhejiang University(浙江大学) The University of Hong Kong(香港大学)

AI总结 本文提出3D斜高斯散射(3DSGS),通过引入斜高斯分布来提升3D高斯散射的结构保真度和紧凑性,以解决对称高斯分布在捕捉形状和颜色不连续性方面的不足,从而提高可视化效果和空间数据探索的准确性。

Comments 16 pages

详情
AI中文摘要

尽管3D高斯散射(3DGS)已经革新了实时逼真视角合成,但其对称高斯分布的根本依赖引入了视觉伪影,阻碍了准确的空间数据探索。具体而言,对称内核难以捕捉形状和颜色不连续性,导致模糊和基本元素冗余,这在视觉分析中会误导人类感知。为了解决这些可视化障碍,我们引入了3D斜高斯散射(3DSGS),一种新的框架,显著增强了显式场景表示的结构保真度和紧凑性。我们的关键见解在于将标准基本元素扩展为一般斜高斯对应物。这种通用基本元素继承了标准高斯的高效光栅化特性,同时获得了内在的非对称建模能力。我们将其与增强的不透明度表示相结合,以更好地处理复杂的透明度,同时结合一种深度感知的密集化策略,智能管理基本元素的分配。此外,为了使这些进步能够应用于实际的视觉分析,我们重新推导了CUDA光栅化管线,使其普遍支持对称和斜高斯,将其整合到一个解耦的自由相机交互可视化引擎中。广泛的实验表明,3DSGS在复杂细节区域实现了更优的渲染质量和结构紧凑性,同时保持了实时帧率,以支持流畅的交互探索。补充推导和视觉结果可在https://3d-skew-gs.github.io/上获得。

英文摘要

While 3D Gaussian Splatting (3DGS) has revolutionized real-time photorealistic view synthesis, its fundamental reliance on symmetric Gaussian distributions introduces visual artifacts that hinder accurate spatial data exploration. Specifically, symmetric kernels struggle to capture shape and color discontinuities , which cause blurriness and primitive redundancy that mislead human perception during visual analysis. To address these visualization barriers, we introduce 3D Skew Gaussian Splatting (3DSGS), a novel framework that significantly enhances the structural fidelity and compactness of explicit scene representations. Our key insight lies in extending the standard primitive to a general Skew Gaussian counterpart. This generalized primitive inherits the highly efficient rasterization properties of standard Gaussians while gaining intrinsic asymmetric modeling capabilities. We couple this with an enhanced opacity representation to better handle complex transparency, alongside a depth-aware densification strategy that intelligently manages primitive allocation. Furthermore, to make these advancements actionable for real-world visual analytics, we re-derive the CUDA rasterization pipeline to universally support both symmetric and skew Gaussians, integrating it into a decoupled, free-camera interactive visualization engine. Extensive experiments demonstrate that 3DSGS achieves superior rendering quality and structural compactness, particularly in regions with intricate details, while maintaining the real-time frame rates necessary for fluid interactive exploration. Supplementary derivations and visual results are available at \textbf{\textit{https://3d-skew-gs.github.io/}}.

2605.18328 2026-05-19 cs.CV 版本更新

CineMatte: Background Matting for Virtual Production and Beyond

CineMatte:虚拟制作及其他场景的背景分割

Yuanjian He, Chen Zhang, Fasheng Chen, Jiangbo Cao

发表机构 * Online Video Business Unit, Tencent PCG Shenzhen, China(腾讯PCG深圳在线视频事业部)

AI总结 本文提出CineMatte,一种用于虚拟制作及其他场景的鲁棒背景分割框架。该方法采用交叉注意力条件设计,通过共享权重的冻结DINOv3 Vision Transformer编码输入帧和捕获的背景,并利用交叉注意力模块预测前景,从而保留预训练语义并提高对背景位移的鲁棒性。此外,还引入了CineMatte-4K数据集,包含4K HDR图像视频,为虚拟制作分割提供了首个非合成的数据集。

详情
AI中文摘要

LED虚拟制作(VP)利用大LED体积实时渲染背景,使镜头内视觉效果成为可能,但使剪辑后更改变得费力。我们通过CineMatte,一种用于VP及其他场景的鲁棒背景分割框架来解决这一问题。CineMatte采用交叉注意力条件设计。不同于将背景与输入拼接,CineMatte采用一个冻结的DINOv3 Vision Transformer,具有共享权重,分别对输入帧和捕获的背景进行编码。交叉注意力模块比较两个流以预测前景,保留预训练语义并提高对背景位移的鲁棒性。先前基于ViT的分割模型使用并行卷积“细节分支”来恢复细节,这在实际样本中可能由于与主干的语义对齐问题导致边界伪影。我们改用预训练的图像引导特征上采样器,这在很大程度上缓解了该问题。我们还引入了CineMatte-4K,一个在专业LED VP舞台上拍摄的4K HDR图像视频数据集。据我们所知,图像子集是首个VP分割数据集,非合成,通过绿色屏幕插入获得;视频子集包含相机运动和跟踪轨迹,以便后续可以正确渲染任意背景。在CineMatte-4K和公共基准(VideoMatte240K,YouTubeMatte)上,CineMatte不仅在VP中表现出色,而且对真实世界 footage 也具有强大的泛化能力。

英文摘要

LED Virtual Production (VP) uses large LED volumes to render backgrounds in real time, enabling in-camera visual effects but making post-shot changes labor-intensive. We address this with CineMatte, a robust background matting framework for VP and beyond. CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional "detail branch" to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace it with a pretrained, image-guided feature upsampler, which largely mitigates the problem. We also introduce CineMatte-4K, a 4K HDR image-video dataset captured on a professional LED VP stage. To the best of our knowledge, the image subset is the first dataset for VP matting and is non-synthetic, obtained via green-screen insertion; the video subset includes camera motion with tracked trajectories so that arbitrary backgrounds can be rendered later with correct parallax. Across CineMatte-4K and public benchmarks (VideoMatte240K, YouTubeMatte), CineMatte not only excels in VP but also generalizes robustly to real-world footage.

2605.18303 2026-05-19 cs.LG cs.AI cs.CV cs.RO 版本更新

PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics

PH-Dreamer: 通过端口-哈密顿生成动力学构建一个物理驱动的世界模型

Xueyu Luan, Chenwei Shi

AI总结 本文提出了一种基于端口-哈密顿框架的物理驱动世界模型PH-Dreamer,通过三个协同机制改进了基于递归状态空间架构的世界模型,实现了更紧凑且物理结构化的表示,同时提高了内部模拟器的保真度,并减少了潜在相空间体积、能量消耗和平均加速度平方。

Comments 12 pages, 3 figures

详情
AI中文摘要

基于递归状态空间架构构建的世界模型能够实现高效的潜在想象,但仍然缺乏物理结构,导致动力学违反守恒和耗散原理。我们引入了一个统一的端口-哈密顿框架,通过三种协同机制来解决这一问题。首先,我们将隐含的物理先验嵌入到递归转换中,通过将投影的潜在演变建模为受流动和耗散控制的能量路由,使投影的PH相空间偏向于更紧凑且物理结构化的表示。其次,我们开发了一个具有运动学意识的能量世界模型,该模型从本体感觉观察估计哈密顿量和功率平衡,提供了一个明确的物理信号用于热力学推理。第三,利用这些能量梯度,我们建立了基于能量的Actor-Critic,利用拉格朗日乘数来正则化策略优化,使其朝着更低的能量和更平滑的控制方向发展。在视觉控制基准测试中,该范式不仅实现了更优的渐近回报,还通过在想象奖励和真实奖励之间建立更紧密且方差更低的对齐关系,提高了内部模拟器的保真度,同时将潜在相空间体积减少了4.18-8.41%,能量消耗降低了高达7.80%,平均加速度平方降低了高达9.38%。

英文摘要

World models built on recurrent state space architectures enable efficient latent imagination, yet remain physically unstructured, producing dynamics that violate conservation and dissipative principles. We introduce a unified Port-Hamiltonian framework that remedies this through three synergistic mechanisms. First, we embed implicit physical priors into recurrent transitions by modeling projected latent evolution as action controlled energy routing governed by flow and dissipation, biasing the projected PH phase space toward a more compact and physically structured representation. Second, we develop a kinematics aware energy world model that estimates the Hamiltonian and power balance from proprioceptive observations, providing an explicit physical signal for thermodynamic reasoning. Third, leveraging these energy gradients, we establish an energy guided Actor-Critic that uses Lagrangian multipliers to regularize policy optimization toward lower energy and smoother control. Across visual control benchmarks, this paradigm not only attains superior asymptotic returns but also elevates internal simulator fidelity by establishing a tighter, lower variance alignment between imagined and real rewards, all while reducing latent phase space volume by 4.18-8.41%, energy consumption by up to 7.80%, and mean squared jerk by up to 9.38%.

2605.18288 2026-05-19 cs.CV 版本更新

Collision-Resistant Single-Pass Method for Unsupervised Fine-Grained Image Hashing

抗碰撞的单次传递方法用于无监督细粒度图像哈希

Anh-Kiet Duong, Petra Gomez-Krämer, Jean-Michel Carozza

发表机构 * GitHub

AI总结 本文提出了一种抗碰撞的单次传递自监督语义哈希方法(CS3H),通过单次传递归一化汉明距离损失直接优化汉明空间相似性,生成良好的二进制表示,同时引入了对碰撞敏感的注意力模块以强调稀有且判别性的局部模式,从而减少哈希碰撞并提高细粒度判别能力。

Comments 17 pages, accepted to ICIP 2026

详情
AI中文摘要

无监督细粒度图像哈希旨在学习紧凑的二进制代码,以在高度相似实例之间保留细微的视觉差异,而无需人工标注。然而,大多数现有方法忽视了碰撞抵抗性,导致略微语义不同的样本具有相同的哈希代码。在本文中,我们提出了一种抗碰撞的单次传递自监督语义哈希(CS3H)框架,该框架通过单次传递归一化汉明距离损失直接优化汉明空间相似性,以生成良好的二进制表示。我们进一步引入了对碰撞敏感的注意力模块,以强调稀有且判别性的局部模式,从而减少哈希碰撞并提高细粒度判别能力。在多个基准测试中,实验表明CS3H在检索准确性上始终优于最先进的方法,同时在最小计算开销的情况下实现了优越的碰撞抵抗性。

英文摘要

Unsupervised fine-grained image hashing aims to learn compact binary codes that preserve subtle visual differences among highly similar instances without manual annotations. However, most existing methods neglect collision resistance, leading to identical hash codes for slightly semantically different samples. In this paper, we propose Collision-Resistant Single-Pass Self-Supervised Semantic Hashing (CS3H), a collision-resistant framework that directly optimizes Hamming-space similarity via a single-pass normalized Hamming distance loss to produce well-separated binary representations. We further introduce a collision-sensitive attention module to emphasize rare and discriminative local patterns, reducing hash collisions and improving fine-grained discrimination. Experiments on multiple benchmarks show that CS3H consistently outperforms state-of-the-art methods in retrieval accuracy while achieving superior collision resistance with minimal computational overhead.

2605.18287 2026-05-19 cs.CV cs.RO 版本更新

StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

StableVLA: 向无额外数据的鲁棒视觉-语言-动作模型迈进

Yiyang Fu, Chubin Zhang, Shukai Gong, Yufan Deng, Kaiwei Sun, Qiyang Min, Qibin Hou, Yansong Tang, Jianan Wang, Daquan Zhou

发表机构 * Peking University(北京大学) Tsinghua University(清华大学) Nanjing University(南京大学) Nankai University(南开大学)

AI总结 本文研究了在未见真实世界视觉扰动下视觉-语言-动作(VLA)模型的鲁棒性问题,提出了一种基于信息理论的轻量级适配模块IB-Adapter,有效提升模型性能,同时保持高效和效果。

Comments Accepted by ICML 2026. Code: https://github.com/DAGroup-PKU/HumanNet. Project website: https://dagroup-pku.github.io/StableVLA/

详情
AI中文摘要

在训练数据中无法涵盖所有可能的扰动,这引发了关于在遇到未见真实世界视觉扰动时,视觉-语言-动作(VLA)模型鲁棒性的问题。在本文中,我们基于最近最先进的VLA模型进行了系统研究,并揭示了当引入训练数据中没有的视觉扰动时,性能显著下降。为缓解这一问题,我们提出了一种基于信息理论的轻量级适配模块,称为信息瓶颈适配器(IB-Adapter),该模块能够选择性地从视觉输入中过滤潜在噪声。无需任何额外数据或增强策略,IB-Adapter在基线模型上平均提升了30%,同时添加少于10M参数,显示出显著的效率和效果。此外,即使使用14倍更小的主干(0.5B参数)且未在Open X-Embodiment数据集上预训练,我们的模型StableVLA也实现了与7B规模最先进的VLA相媲美的鲁棒性。在参数开销极小(<10M)的情况下,我们的方法在长周期任务上保持了准确性,并在合成和物理视觉扰动下超越了OpenPi。

英文摘要

It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.

2605.18263 2026-05-19 cs.CV 版本更新

RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting

RT-Splatting:基于高斯点散布的联合反射-透射建模

Ji Shi, Xianghua Ying, Bowei Xing, Ruohao Guo, Wenzhen Yue

发表机构 * State Key Laboratory of General Artificial Intelligence(国家一般人工智能重点实验室) School of Intelligence Science and Technology(智能科学与技术学院)

AI总结 本文提出RT-Splatting方法,通过将高斯点的几何占用与光学不透明度分离,实现对半透明表面复杂反射和清晰透射的联合建模,从而在实时渲染中获得高质量的反射和透射效果。

Comments CVPR 2026 Highlight, Project Page: https://sjj118.github.io/RT-Splatting/

详情
AI中文摘要

3D高斯点散布(3DGS)能够实现实时新型视角合成,具有高质量的视觉效果。然而,现有方法在处理半透明镜面表面时存在困难,这些表面同时表现出复杂的反射和清晰的透射,常常产生模糊的反射或过度遮挡的透射。为了解决这个问题,我们提出了RT-Splatting框架,该框架将每个高斯点的几何占用与其光学不透明度解耦。这种分解产生了一个统一的表面-体积场景表示,使用单组高斯基元。我们的混合渲染器将这种表示同时解释为表面以捕获高频反射,以及体积以保持清晰的透射。为了减轻联合优化反射和透射时的模糊性,我们引入了Specular-Aware Gradient Gating,该方法抑制了高镜面区域的误导梯度进入透射分支,从而有效减少 distracting floaters。在具有挑战性的半透明场景上的实验表明,RT-Splatting实现了最先进的性能,能够实时渲染高质量的反射和清晰的透射。此外,我们的分解自然地支持灵活的场景编辑。项目页面可在https://sjj118.github.io/RT-Splatting上找到。

英文摘要

3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual quality. However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we present RT-Splatting, a framework that disentangles each Gaussian's geometric occupancy from its optical opacity. This factorization yields a unified surface-volume scene representation with a single set of Gaussian primitives. Our hybrid renderer interprets this representation both as a surface to capture high-frequency reflections and as a volume to preserve clear transmission. To mitigate the ambiguity in jointly optimizing reflection and transmission, we introduce Specular-Aware Gradient Gating, which suppresses misleading gradients from highly specular regions into the transmission branch, effectively reducing distracting floaters. Experiments on challenging semi-transparent scenes show that RT-Splatting achieves state-of-the-art performance, delivering high-fidelity reflections and clear transmission with real-time rendering. Moreover, our factorization naturally enables flexible scene editing. The project page is available at https://sjj118.github.io/RT-Splatting.

2605.18257 2026-05-19 cs.CV cs.AI cs.CL 版本更新

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

CodeBind: 一种用于多模态对齐的解耦表示学习框架

Zeyu Chen, Jie Li, Kai Han

发表机构 * Visual AI Lab, The University of Hong Kong(视觉人工智能实验室,香港大学)

AI总结 CodeBind通过统一的组合代码本设计优化多模态表示空间,解决了传统方法在跨模态信息差异和数据稀缺导致的对齐空间不足问题,实现了多模态分类和检索任务中的最佳性能。

Comments ACL 2026 Findings; Project page: https://visual-ai.github.io/codebind

详情
AI中文摘要

多模态表示对齐对于大语言模型和机器人至关重要。传统方法常受到跨模态信息差异和数据稀缺的限制,导致对齐空间不优,忽略了模态特有的特征。我们提出了CodeBind,一种通过模态共享-特定代码本设计优化多模态表示空间的框架。通过逐步对齐目标和连接模态,CodeBind避免了需要完全配对数据的需要。不同于传统硬对齐,CodeBind将特征分解为共享组件以实现语义一致性,以及特定组件以捕捉模态特有的细节。这种设计利用了组合向量量化方案,其中共享代码本弥合模态差距,而模态特定代码本通过防止主导模态压制其他模态来缓解表示偏差。在九种模态(文本、图像、视频、音频、深度、热成像、触觉、3D点云、EEG)上验证,CodeBind在多模态分类和检索任务中实现了最先进的性能。

英文摘要

Multimodal representation alignment is pivotal for large language models and robotics. Traditional methods are often hindered by cross-modal information discrepancies and data scarcity, leading to suboptimal alignment spaces that overlook modality-unique features. We propose CodeBind, a framework that optimizes multimodal representation spaces through a modality-shared-specific codebook design. By incrementally aligning target and bridging modalities, CodeBind bypasses the need for fully paired data. Unlike traditional hard alignment, CodeBind decomposes features into shared components for semantic consistency and specific components for modality-unique details. This design utilizes a compositional vector quantization scheme, where a shared codebook bridges modality gaps and modality-specific codebooks mitigate representation bias by preventing dominant modalities from overshadowing others. Validated across nine modalities (text, image, video, audio, depth, thermal, tactile, 3D point cloud, EEG), CodeBind achieves state-of-the-art performance in multimodal classification and retrieval tasks.

2605.18252 2026-05-19 cs.CV 版本更新

GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance

GaussianZoom: 一种结合几何和语义引导的渐进式缩放生成3D高斯点云

Jiale Shi, Jiarui Hu, Zesong Yang, Kaixuan Luan, Hujun Bao, Zhaopeng Cui

发表机构 * State Key Lab of CAD & CG(计算机辅助设计与图形学国家重点实验室)

AI总结 本文提出GaussianZoom,一种结合几何一致场景建模和多尺度语义推理的生成式缩放3D重建系统,通过迭代渐进框架实现从低分辨率输入生成高保真的极端缩放渲染。

Comments 10 pages, 7 figures

详情
AI中文摘要

我们介绍了GaussianZoom,一种具有迭代渐进框架的生成式缩放3D重建系统,该框架结合几何一致的场景建模和多尺度语义推理,以实现从低分辨率输入生成高保真的极端缩放渲染。为此,我们开发了一种新的多视角一致超分辨率模块,结合基于深度的特征扭曲和VLM驱动的细节合成,确保准确的多视角对应关系,同时在观察分辨率之外丰富细粒度外观。为了支持大范围的缩放,我们进一步引入了一种可扩展的连续细节层次结构,该结构动态调节高斯可见性,以实现平滑且无混叠的跨尺度渲染。在Mip-NeRF360和Tanks\&Temples上的实验表明,GaussianZoom在极端缩放下实现了优越的感知质量、多视角一致性和鲁棒性,为生成式缩放3D场景重建建立了强有力的基准。

英文摘要

We introduce GaussianZoom, a generative zoom-in 3D reconstruction system with an iterative progressive framework that combines geometry-consistent scene modeling and multi-scale semantic reasoning to enable high-fidelity extreme zoom-in rendering from low-resolution inputs. To achieve this, we develop a novel multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis, ensuring accurate multi-view correspondence while enriching fine-scale appearance beyond the observed resolution. To support zooming across large magnification ranges, we further introduce a new expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility for smooth, alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks\&Temples demonstrate that GaussianZoom achieves superior perceptual quality, multi-view consistency, and robustness under extreme magnification, establishing a strong baseline for generative zoom-in 3D scene reconstruction.

2605.18238 2026-05-19 cs.CV 版本更新

Non-Colliding Biometric Identities for Digital Entities: Geometry, Capacity, and Million-Scale Virtual Identity Provisioning

非碰撞生物识别身份用于数字实体:几何、容量与百万级虚拟身份提供

Yuyang Ji, Yixuan Shen, Anil Jain, Xiaoming Liu, Feng Liu

发表机构 * Department of Computer Science, Drexel University(德雷塞尔大学计算机科学系) Department of Computer Science and Engineering, Michigan State University(密歇根州立大学计算机科学与工程系) Department of Computer Science, University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校计算机科学系)

AI总结 本研究提出Biometric Identity Provisioning(BIP)框架,解决在真实人类身份库中提供非碰撞虚拟身份的问题,通过几何方法在真实面部流形中分配未被占用的间隙,生成高保真面部图像,并展示1000万非碰撞虚拟身份嵌入。

Comments 25 pages, 11 figures

详情
AI中文摘要

数字实体如AI代理和人形机器人日益与真实人类共同操作,但其身份基础设施仍基于凭证而非生物识别身份。我们引入Biometric Identity Provisioning(BIP),一种新的问题和解决方案框架,旨在:给定一个真实人类身份的注册画廊,提供虚拟身份,这些身份与每个注册身份不碰撞,保持足够的类间分离性,并能作为高保真面部图像实现。关键的几何洞察是真实面部身份占据嵌入超球面的低维子空间,留下的残余子空间无法供虚拟身份使用。因此,虚拟身份必须在真实面部流形本身中分配未被占用的间隙。BIP因此是一个受限的填充问题:可用的间隙远超任何可预见的注册规模,并且即使后续注册了新的真实身份,已提供的身份仍保持不碰撞。基于此几何,我们的排斥式分配不受任何固定提供数量的限制;我们展示了针对360,000个真实身份画廊的1000万非碰撞虚拟身份嵌入。将这些嵌入转化为面部图像需要一个在真实面部图像训练分布外运行的生成器;我们引入GapGen,一种具有间隙意识的生成器,通过渐进扩展合成到非碰撞区域的课程进行训练,验证了100万张逼真虚拟面部图像。我们进一步构建了v-LFW,一个LFW面部数据集的虚拟对应物,包含虚拟面部验证、跨现实匹配、真实与虚拟检测以及统一识别和检测的协议。

英文摘要

Digital entities such as AI agents and humanoid robots increasingly operate alongside real humans, yet their identity infrastructure is based on credentials rather than embodied biometric identity. We introduce Biometric Identity Provisioning (BIP), a new problem and solution framework that addresses: given an enrollment gallery of real human identities, provision virtual identities that are non-colliding with every enrolled identity, maintain sufficient inter-class separability, and are realizable as high-fidelity face images. The key geometric insight is that real face identities occupy a low-dimensional subspace of the embedding hypersphere, leaving no residual subspace for virtual identities. Hence, virtual identities must instead be allocated as unclaimed gaps within the real face manifold itself. BIP is therefore a constrained packing problem: available gaps vastly exceed any foreseeable enrollment scale, and provisioned identities remain non-colliding even as new real identities are subsequently enrolled. Grounded in this geometry, our repulsion-based allocation is not bounded by any fixed provisioning count; we demonstrate 10M non-colliding virtual identity embeddings against a gallery of 360K real identities. Realizing these embeddings as face images requires a generator that operates outside the training distribution of real face images; we introduce GapGen, a gap-aware generator trained with a curriculum that progressively extends synthesis into non-colliding regions, validated at 1M photorealistic virtual face images. We further construct v-LFW, a virtual counterpart to LFW face dataset, with protocols for virtual face verification, cross-reality matching, real-vs-virtual detection, and unified recognition and detection.

2605.18233 2026-05-19 cs.CV 版本更新

Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

增强无列车无限帧生成以实现一致的长视频

X. Feng, J. Zhu, M. Wu, C. Chen, F. Mao, H. Guo, J. Wu, X. Chu, K. Huang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学人工智能学院,北京,中国) The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China(复杂系统认知与决策智能重点实验室,中国科学院自动化研究所,北京,中国) AMAP, Alibaba Group, Beijing, China(AMAP,阿里巴巴集团,北京,中国)

AI总结 本文提出MIGA方法,通过两阶段对齐机制和双一致性增强机制,解决训练与推理不匹配和长时一致性维持的问题,从而提升长视频生成效果。

Comments Accepted by ICML 2026~

详情
AI中文摘要

Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose extbf{MIGA}, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.

英文摘要

Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose \textbf{MIGA}, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.

2605.18221 2026-05-19 cs.SD cs.CL cs.CV cs.LG physics.med-ph 版本更新

SIREM: Speech-Informed MRI Reconstruction with Learned Sampling

SIREM: 语音引导的MRI重建与学习采样

Md Hasan, Nyvenn Castro, Daiqi Liu, Lukas Mulzer, Jana Hutter, Jonghye Woo, Moritz Zaiss, Andreas Maier, Paula A. Perez-Toro

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg(埃森哲-埃尔朗根-纽伦堡大学模式识别实验室) Institute of Radiology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根大学医院放射学研究所) Institut für Informationsverarbeitung, Leibniz Universität Hannover(汉诺威莱比锡大学信息处理研究所) Department of Radiology, Harvard Medical School and Massachusetts General Hospital(哈佛医学院放射科和麻省总医院)

AI总结 本文提出了一种语音引导的MRI重建框架SIREM,通过同步语音作为跨模态先验,利用语音与声音学之间的相关性预测图像内容,从而在更高的吞吐量下实现更合理的解剖结构重建。

详情
AI中文摘要

实时磁共振成像(rtMRI)在语音生产中的应用能够非侵入性地可视化动态声带运动,对语音科学和临床评估具有价值。然而,rtMRI本质上受到空间分辨率、时间分辨率和获取速度之间的权衡限制,常常导致k空间测量不足和重建质量下降。我们提出SIREM,一种利用同步语音作为跨模态先验的MRI重建框架。核心思想是语音期间的声带配置与产生的声音学相关,使图像部分内容可从音频预测。SIREM将每帧建模为音频驱动组件和MRI驱动组件的融合,通过空间加权图。音频分支从语音预测发音器相关结构,而MRI分支从测量的k空间数据重建互补内容。我们进一步引入了可学习的软加权轮廓,使螺旋臂的使用与语音引导融合的交互研究可微分。这产生了一个统一的多模态公式,结合了音频驱动预测、MRI重建和采样适应。我们在USC语音rtMRI基准上评估了SIREM,与标准基线(包括栅格、基于小波的压缩感知和总变分)进行比较。SIREM引入了一种语音引导的重建范式,在比迭代方法高得多的吞吐量下运行,同时保持解剖上合理的声带结构。这些结果为多模态语音引导的rtMRI重建建立了初步基准,并突显了同步语音作为快速重建辅助先验的潜力。源代码可在https://github.com/mdhasanai/SIREM获取。

英文摘要

Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at https://github.com/mdhasanai/SIREM

2605.18209 2026-05-19 cs.CV cs.AI 版本更新

SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning

SPATIOROUTE: 动态提示路由用于零样本空间推理

Pawat Chunhachatrachai, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

发表机构 * National Taiwan University(台湾国立大学) Delta Robotics Innovation Center(Delta机器人创新中心)

AI总结 本文提出SpatioRoute,一种动态提示生成方法,通过语义定制的提示模板路由问题,无需额外训练或3D传感器输入,在零样本设置下提升空间推理性能,同时发现Chain-of-Thought提示在空间视频理解中效果不佳。

Comments 10 pages, 2 figures, 2nd Workshop on 3D-LLM/VLA, CVPR 2026

详情
AI中文摘要

在眼动视频上的空间问题回答是一项具有挑战性的任务,需要视觉-语言模型(VLMs)对3D物体位置、场景可行性和方向关系进行推理,特别是在无任务特定微调的零样本设置中。我们引入SpatioRoute,一种动态提示生成方法,将每个输入问题路由到语义定制的提示模板,无需任何额外训练、微调或3D传感器输入。SpatioRoute在两个互补模式中运行:SpatioRoute-R,一种基于规则的路由器,将问题类型(如What、Is、How、Can、Which)确定性地映射到专门的提示模板;以及SpatioRoute-L,一种基于LLM的方法,仅从问题和情境上下文生成任务特定的提示,无需在路由时使用视频输入。我们评估了SpatioRoute在SQA3D基准测试上跨不同模型家族的VLMs。SpatioRoute在固定提示基线上实现了高达5%的总体准确率提升,建立了在不需3D点云输入的情况下零样本视频-only空间VQA的新状态。此外,我们发现Chain-of-Thought(CoT)提示,通过Think it Twice架构实现,在此设置中对Qwen系列模型性能有持续下降,证实了问题感知路由比统一推理指令在空间视频理解中更有效。

英文摘要

Spatial question answering over egocentric video is a challenging task that requires Vision-Language Models (VLMs) to reason about 3D object positions, scene affordances, and directional relationships, particularly in the zero-shot setting where no task-specific fine-tuning is available. We introduce SpatioRoute, a dynamic prompt generation approach that routes each incoming question to a semantically tailored prompt template -- without any additional training, fine-tuning, or 3D sensor input. SpatioRoute operates in two complementary modes: SpatioRoute-R, a rule-based router that deterministically maps question typologies (e.g., What, Is, How, Can, Which) to specialized prompt templates; and SpatioRoute-L, an LLM-driven approach that generates task-specific prompts from the question and situational context alone, with no video input at routing time. We evaluate SpatioRoute on the SQA3D benchmark across VLMs spanning model families. SpatioRoute achieves consistent overall accuracy gains up to 5% over fixed prompt baselines, establishing a new state-of-the-art for zero-shot video-only spatial VQA without requiring 3D point-cloud inputs. As an additional finding, we observe that Chain-of-Thought (CoT) prompting, implemented via the Think it Twice architecture, consistently degrades performance in this setting on Qwen series models, confirming that question-aware routing is more effective than uniform reasoning instructions for spatial video understanding.

2605.18197 2026-05-19 cs.RO cs.AI cs.CV 版本更新

RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots

仅RGB的主动3D场景图生成用于室内移动机器人

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

发表机构 * Mobile Robotics Group (MRG)(移动机器人小组) Visual and Multimodal Applied Learning Lab (VANDAL)(视觉与多模态应用学习实验室)

AI总结 本文提出了一种仅使用RGB输入的主动3D场景图生成方法,通过统一感知与规划的结构化表示,解决了传统方法对专用传感器的依赖问题,并在Replica数据集上验证了其有效性。

详情
AI中文摘要

当前3D场景图生成方法依赖于专用深度传感器,如LiDAR或RGB-D相机,限制了部署到专用机器人平台,并排除了仅使用RGB相机的场景,如固定外部基础设施。现有流程通常基于被动收集的观测轨迹,而不是基于部分构建的场景表示选择视角,因此无法有效利用图中编码的语义和空间信息。本文提出了一种完全视觉框架,用于从仅RGB输入中主动、逐步构建3D场景图,解决了这两个限制。所提出的方法围绕共享的结构化表示统一感知和规划,该表示捕捉了物体语义、3D几何、关系上下文以及多视角信息。由于该框架是硬件无关的,并且仅依赖RGB观测,因此可以将机载机器人相机和固定外部相机的输入整合到同一表示中。在Replica数据集上的实验表明,仅RGB的流程在F1分数上与使用真实深度的基线相当。在ReplicaCAD上的主动探索实验进一步表明,语义驱动的视角选择在相同探索预算下能够检测到比基于几何前沿的基线多超过两倍的物体。最后,外部相机设置表明,互补的RGB视角可以有效启动场景图并提高上下文理解,而无需额外的探索成本。

英文摘要

Current approaches to 3D scene graph generation rely on dedicated depth sensors, such as LiDAR or RGB-D cameras, for metric 3D reconstruction. This limits deployment to specialized robotic platforms and excludes settings where only RGB cameras are available, such as fixed external infrastructure. Existing pipelines also typically operate on passively collected observation trajectories, rather than selecting viewpoints based on the partially built scene representation, and therefore fail to effectively exploit the semantic and spatial information encoded within the graph during exploration. This paper presents a fully visual framework for the active, incremental construction of 3D scene graphs from RGB input only, addressing both limitations. The proposed approach unifies perception and planning around a shared structured representation that captures object semantics, 3D geometry, relational context, and information from multiple viewpoints. Because the framework is hardware-agnostic and relies only on RGB observations, it can incorporate inputs from both onboard robot cameras and fixed external cameras within the same representation. Experiments on the Replica dataset show that the RGB-only pipeline achieves F1-score parity with baselines using ground-truth depth. Active exploration experiments on ReplicaCAD further show that semantic-driven viewpoint selection detects more than twice as many objects as a geometric frontier-based baseline under the same exploration budget. Finally, the external-camera setting demonstrates that complementary RGB views can effectively bootstrap the scene graph and improve contextual understanding at no additional exploration cost.

2605.18194 2026-05-19 cs.AI cs.CV 版本更新

Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

超越笛卡尔错觉:在感知瓶颈下测试双阶段多模态理论 of Mind

Yajing Zhou, Xiangyu Kong

发表机构 * College of Computer Science, Beijing Information Science and Technology University(北京信息科技大学计算机学院)

AI总结 本文研究了多模态大语言模型在感知瓶颈下的双阶段空间推理能力,提出了一种基于锚点的具身体验空间分解链式推理方法,以解决空间对称性和视角模糊性问题,从而提升多模态理论 of Mind 的表现。

Comments 17 pages, 3 figures

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在一般推理方面表现出色,但其具身空间智能仍受“笛卡尔错觉”限制——依赖文本概率分布,缺乏基于3D拓扑的理解。这种局限性在多智能体环境中尤为明显,这些环境不仅需要场景感知,还需要第二阶的理论 of Mind(ToM)。具体而言,智能体A必须推断智能体B对环境的看法,这严格受B的物理朝向和感官限制影响。在本文中,我们通过一个新颖的音频-视觉任务来探测MLLMs的双阶段空间推理极限:要求A预测B对A相对位置的估计。为此,我们提出了一个认识感知瓶颈模块,摒弃了刚性的规则坐标转换。相反,我们引入了基于锚点的具身体验空间分解链式推理(CoT)。该方法引导MLLMs进行“几何到语义”的投影,迫使它首先建立B的局部坐标系统,然后根据A是否在B的视觉视锥内动态加权视觉和听觉模态。广泛的评估表明,尽管当前MLLMs在空间对称性和视外模糊性方面根本上存在困难(建立了一个严格的零样本基准线42%准确率),我们的感知受限推理链在纯自体心和 allocentric 基准上表现稳健。通过系统地评估这些感知瓶颈,我们的工作揭示了当前MLLM空间推理的极限,并为具身AI中的认识、模态感知推理建立了基础范式。

英文摘要

While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B's belief about the environment, governed strictly by Agent B's physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B's estimation of A's relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based coordinate transformations. Instead, we introduce an Anchor-Based Embodied Spatial Decomposition Chain-of-Thought (CoT). This guides the MLLM through a "geometric-to-semantic" projection, forcing it to first establish B's local coordinate system and then dynamically weight visual and auditory modalities based on whether A falls within B's visual frustum. Extensive evaluations reveal that while current MLLMs fundamentally struggle with spatial symmetry and out-of-view ambiguities (establishing a rigorous zero-shot baseline of 42% accuracy), our sensory-bounded reasoning chain robustly outperforms pure egocentric and allocentric baselines. By systematically benchmarking these perceptual bottlenecks, our work exposes the current limits of MLLM spatial reasoning and establishes a foundational paradigm for epistemic, modality-aware inference in Embodied AI.

2605.18193 2026-05-19 cs.CV cs.GR 版本更新

Best Segmentation Buddies for Image-Shape Correspondence

图像-形状对应关系的最佳分割伙伴

Itai Lang, Dongwei Lyu, Dale Decatur, Rana Hanocka

发表机构 * University of Chicago(芝加哥大学)

AI总结 本文研究了在真实图像和无纹理3D形状之间估计分割到分割对应关系的任务,通过将图像像素与3D形状顶点链接,利用深度视觉特征进行特征相似性计算,从而实现跨模态的对应关系发现。

Comments CVPR 2026. Project page: https://threedle.github.io/bsb/

详情
AI中文摘要

寻找对应关系是计算机视觉和图形学中的基本且广泛研究的问题。在本工作中,我们探讨了在真实图像和无纹理3D形状之间估计分割到分割对应关系的任务。该任务极具挑战性,因为存在显著的外观、几何和视角差异。我们的方法通过将图像分割中的像素链接到相应3D形状语义部分的顶点,跨越了跨模态的差距。为此,我们首先将深度视觉特征从2D视觉模型蒸馏到3D形状表面,使得能够计算图像像素与形状顶点之间的特征相似性。然后,我们识别出最佳分割伙伴,即那些最相似的图像像素位于图像分割区域内的顶点,从而可靠地发现语义对应形状部分的顶点。最后,我们利用从2D图像分割模型蒸馏的3D特征,直接在3D中对形状进行分割,从而引导对应关系的发现。我们展示了我们的方法在广泛图像-形状配对中的通用性和鲁棒性,展示了准确且具有语义意义的对应关系。我们的项目页面位于https://threedle.github.io/bsb/。

英文摘要

Finding correspondences is a fundamental and extensively researched problem in computer vision and graphics. In this work, we examine the underexplored task of estimating segmentation-to-segmentation correspondence between images in the wild and untextured 3D shapes. This task is highly challenging due to substantial differences in appearance, geometry, and viewpoint. Our approach bridges the cross-modality gap by linking pixels in the image segment to vertices in the corresponding semantic part of the 3D shape. To achieve this, we first distill deep visual features from a 2D vision model onto the 3D shape surface, allowing for the computation of feature similarity between image pixels and shape vertices. Then, we identify Best Segmentation Buddies, vertices whose most similar image pixel lies within the image segmentation region, enabling the reliable discovery of vertices in semantically corresponding shape parts. Finally, we leverage distilled 3D features from the 2D image segmentation model to segment the shape directly in 3D, bootstrapping the correspondence process. We demonstrate the generality and robustness of our approach across a wide range of image-shape pairs, showcasing accurate and semantically meaningful correspondences. Our project page is at https://threedle.github.io/bsb/.

2605.18192 2026-05-19 cs.CV 版本更新

View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification

视感知语义对齐用于空-地人员重识别

Quan Zhang, Zeqiang Cai, Peiming Zhao, Jingze Wu, Cailun Wu, Hongbo Chen, Jianhuang Lai

发表机构 * Sun Yat-sen University, China(中山大学,中国) Pazhou Lab (HuangPu), Guangdong, China(琶洲实验室(黄埔),广东,中国) Guangdong Province Key Laboratory of Information Security Technology, Guangzhou, China(广东省信息安全技术重点实验室,广州,中国) Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China(教育部机器智能与高级计算重点实验室,中国)

AI总结 本文提出ViSA框架,通过视感知语义对齐方法解决空-地人员重识别中的视角差异问题,通过专家驱动令牌生成模块和双分支局部融合模块实现跨视角语义一致性,实验表明在CARGO基准测试中mAP提升了10.06%。

Comments CVPR 2026 POSTER

详情
AI中文摘要

空-地人员重识别(AGPReID)由于无人机和固定相机之间视角差异显著而极具挑战性。现有方法通常采用视角不变的范式,通过对齐不同视角下的共享特征来实现鲁棒性。然而,视角不变性本质上强制了部分级对齐,忽略了视角特定的线索和判别性身份信息。为此,本文提出ViSA(视感知语义对齐)框架,该框架实现了跨视角语义一致性,包含专家驱动令牌生成模块(ETGM)和双分支局部融合模块(DLFM)。技术上,前者构建了一组视感知专家,生成适应性语义查询以感知视角特定的模式,后者利用图推理提取并对齐响应不同专家的局部区域。在三个AGPReID基准测试集(包括AG-ReID.v2、CARGO和LAGPeR)上的广泛实验表明,ViSA在挑战性的CARGO跨视角协议上实现了mAP提升10.06%的显著改进。代码可在https://github.com/Cat-Zero/ViSA获取。

英文摘要

Aerial-Ground Person Re-Identification (AGPReID) remains highly challenging due to drastic viewpoint variations between drones and fixed cameras. Existing methods typically follow a view-invariant paradigm, aligning shared features across views to achieve robustness. However, view-invariant inherently enforces part-level alignment, which ignores view-specific cues and discriminative identity information. To this end, this work proposes ViSA (View-aware Semantic Alignment), a view-aware framework that achieves cross-view semantic consistency containing an Expert-driven Token Generation Module (ETGM) and a Dual-branch Local Fusion Module (DLFM). Technically, the former constructs a set of view-aware experts to generate adaptive semantic queries that perceive viewpoint-specific patterns, while the latter leverages graph reasoning to extract and align local regions responsive to different experts. Extensive experiments on three AGPReID benchmarks including AG-ReID.v2, CARGO and LAGPeR demonstrate that ViSA consistently achieves superior performance, with a notable 10.06\% mAP improvement on the challenging CARGO cross-view protocol. The code is available at \href{https://github.com/Cat-Zero/ViSA}{https://github.com/Cat-Zero/ViSA}.

2605.18190 2026-05-19 cs.LG cs.CV 版本更新

Dual-Rate Diffusion: Accelerating diffusion models with an interleaved heavy-light network

双速率扩散:通过交错重-轻网络加速扩散模型

Grigory Bartosh, David Ruhe, Emiel Hoogeboom, Jonathan Heek, Thomas Mensink, Tim Salimans

发表机构 * Google DeepMind Amsterdam(谷歌深Mind阿姆斯特丹) Amsterdam University of Amsterdam(阿姆斯特丹大学)

AI总结 本文提出双速率扩散方法,通过交错执行高容量上下文编码器和轻量解噪模型,加速扩散模型推理,同时保持样本质量,在ImageNet基准上实现性能与计算成本的平衡。

详情
AI中文摘要

扩散模型在生成性能上达到最先进的水平,但在推理过程中由于重复评估重的神经网络而面临高昂的计算成本。在本文中,我们提出了双速率扩散,一种通过交错执行高容量的上下文编码器和轻量高效的去噪模型来加速采样的方法。上下文编码器被稀疏评估以提取高维特征,这些特征在每一步都被轻量去噪模型有效重用,以高效地细化样本。这种方法显著加速了推理过程,而不会牺牲样本质量。在ImageNet基准上,双速率扩散在性能上与标准基线相匹配,同时将计算成本降低了2-4倍。此外,我们证明了我们的方法与蒸馏技术,如动量匹配蒸馏,兼容,从而在少步生成中进一步提高效率。

英文摘要

Diffusion models achieve state-of-the-art generative performance but suffer from high computational costs during inference due to the repeated evaluation of a heavy neural network. In this work, we propose Dual-Rate Diffusion, a method to accelerate sampling by interleaving the execution of a heavy high-capacity context encoder and a light efficient denoising model. The context encoder is evaluated sparsely to extract high-dimensional features, which are effectively reused by the light denoising model at every step to refine the sample efficiently. This approach significantly accelerates inference without compromising sample quality. On ImageNet benchmarks, Dual-Rate Diffusion matches the performance of standard baselines while reducing computational cost by a factor of $2$-$4$. Furthermore, we demonstrate that our method is compatible with distillation techniques, such as Moment Matching Distillation, enabling further efficiency gains in few-step generation.

2605.18184 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation

固定外部摄像头作为主动3D场景图生成的共同先验地图

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

发表机构 * Mobile Robotics Group (MRG)(移动机器人组) Visual and Multimodal Applied Learning Lab (VANDAL)(视觉与多模态应用学习实验室)

AI总结 本文提出利用固定外部RGB摄像头作为共同先验地图,以实现主动、渐进式的3D场景图生成,通过融合机器人 onboard 摄像头和固定外部摄像头的数据,提高场景理解的效率和准确性。

详情
AI中文摘要

常用的先验信息,如BIM模型、平面图和遥感图像,可以为自主机器人系统提供有价值的几何和语义上下文。在本文中,我们将固定外部RGB摄像头的观测视为共同先验地图(CPMs):环境的广角视图,在任何机器人运动开始之前初始化一个语义和几何场景先验。我们提出一个仅使用RGB的框架,用于主动、渐进式的3D场景图(3DSG)生成,该框架在单一硬件无关的管道中无缝融合来自机器人 onboard 摄像头和固定外部摄像头的观测。通过仅依赖RGB观测并通过前馈3D重建模型进行处理,系统将所有摄像头——机器人 onboard 或外部——视为相同,无需硬件修改。基于图的主动语义探索框架然后直接利用部分场景图,引导机器人向高语义不确定性区域前进,逐步完成和细化先验。实验表明,使用单个外部摄像头初始化场景图可使初始物体召回率提高高达+79%,并且先验的更丰富上下文显著提高了后续主动探索的效率。

英文摘要

Commonly available prior information, such as BIM models, floor plans, and remote sensing images, can provide valuable geometric and semantic context for autonomous robotic systems. In this paper, we treat observations from fixed external RGB cameras as Common Prior Maps (CPMs): wide-field views of the environment that initialize a semantic and geometric scene prior before any robot motion begins. We present an RGB-only framework for active, incremental 3D scene graph (3DSG) generation that seamlessly fuses observations from both onboard robot cameras and fixed external cameras within a single hardware-agnostic pipeline. By relying solely on RGB observations processed by a feed-forward 3D reconstruction model, the system treats all cameras - onboard or external - identically, requiring no hardware modifications. A graph-based active semantic exploration framework then directly leverages the partial scene graph to guide the robot toward regions of high semantic uncertainty, progressively completing and refining the prior. Experiments demonstrate that bootstrapping the scene graph with even a single external camera increases initial object recall by up to +79%, and that the richer context of the prior significantly improves the efficiency of subsequent active exploration.

2605.18177 2026-05-19 cs.CV 版本更新

Token-Space Mask Prediction for Efficient Vision Transformer Segmentation

基于令牌空间的掩码预测用于高效的视觉变换器分割

Calvin Galagain, Martyna Poreba, François Goulette

发表机构 * Université Paris-Saclay, CEA List(巴黎-萨克雷大学,CEA列表) U2IS, ENSTA Paris, Institut Polytechnique de Paris(U2IS,巴黎ENSTA,巴黎理工学院)

AI总结 本文提出TokenMask,一种直接从查询令牌亲和力计算掩码logits并进行logit空间插值的方法,从而在保持准确性的同时减少计算和内存需求,提高分割效率。

Comments CVPR, EVW 2026

详情
AI中文摘要

基于查询的视觉变换器分割模型通常重建密集的空间特征图以预测掩码,继承了卷积架构的设计模式。我们证明这种显式的图像空间重建是不必要的。我们引入TokenMask,一种令牌空间掩码头,直接从查询令牌亲和力计算掩码logits,并在logit空间而非特征空间中进行插值。这种重新表述保留了原始的线性评分机制,同时简化了计算结构。在多样化的ViT后端、数据集和分割任务中,TokenMask通过减少计算和内存需求,同时保持竞争性的准确性,提高了在NVIDIA Jetson AGX Orin上的实际速度。总体而言,TokenMask为嵌入式视觉系统提供了一种更简单且更易于部署的设计。

英文摘要

Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We introduce TokenMask, a token-space mask head that computes mask logits directly from query-token affinities and performs interpolation in logit space rather than feature space. This reformulation preserves the original linear scoring mechanism while simplifying the computational structure. Across diverse ViT backbones, datasets and segmentation tasks, TokenMask consistently improves efficiency over prior approaches by reducing computational and memory requirements while maintaining competitive accuracy, leading to tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16 inference. Overall, TokenMask yields a simpler and more deployment-friendly design for embedded vision systems.

2605.18176 2026-05-19 cs.CV cs.AI 版本更新

MARS: Technical Report for the CASTLE Challenge at EgoVis 2026

MARS:EgoVis 2026 CASTLE挑战的技术报告

Haoyu Zhang, Qiaohui Chu, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Shandong Jianzhu University(山东建筑大学)

AI总结 本文提出MARS系统,用于EgoVis 2026的CASTLE挑战,通过多模态代理推理解决需要多源信息的复杂问题,核心方法是多模态证据选择,主要贡献是实现了在多源数据上的有效推理。

Comments The Runner-up Solution for CASTLE Challenge @ EgoVis 2026

详情
AI中文摘要

本报告介绍了MARS,即多模态代理推理与源选择系统,是参与EgoVis 2026 CASTLE挑战的系统。参赛者必须在CASTLE 2024数据集上回答185个封闭式问题。与以往单视频眼动基准不同,CASTLE要求对四天活动、15个同步视角、官方 transcripts 及多种辅助模态(包括个人照片、辅助视频、注视、热成像和心率测量)进行推理。MARS将任务视为多模态源的代理证据选择问题,而非纯粹文本流程。MARS首先遵循官方CASTLE目录组织,从视频和 transcripts 两个主要来源以及注视、心率、照片和热成像四个辅助来源构建证据记忆。长视频仅转换为caption和基于DeepSeek的摘要,因为CASTLE视频太长无法直接输入模型上下文;此步骤压缩时间证据,同时保留照片和其他辅助媒体作为源特定证据。在推理时,一个GPT-5.4决策代理反复选择是否继续推理、请求特定缺失模态、生成答案或回退到随机选项,当证据不足时。所得到的系统在最终CASTLE挑战排行榜上获得第二名。我们的代码可在https://github.com/Hyu-Zhang/MARS获取。

英文摘要

This report presents MARS, short for Multimodal Agentic Reasoning with Source selection, our system for the CASTLE Challenge at EgoVis 2026. Participants must answer 185 closed-form questions over the CASTLE 2024 dataset. In contrast to prior single-video egocentric benchmarks, CASTLE requires reasoning over four days of activity, 15 synchronized perspectives, official transcripts, and multiple auxiliary modalities, including personal photos, auxiliary videos, gaze, thermal imagery, and heartrate measurements. MARS therefore treats the task as an agentic evidence-selection problem over multimodal sources rather than a purely text-only pipeline. MARS first follows the official CASTLE directory organization to build evidence memories from two primary sources, videos and transcripts, and four auxiliary sources, gaze, heartrate, photos, and thermal imagery. Long videos are converted into captions and DeepSeek-based summaries only because CASTLE videos are too long to fit directly into the model context for every question; this step compresses temporal evidence while keeping photos and other auxiliary media available as source-specific evidence. At inference time, a GPT-5.4 decision agent repeatedly chooses whether to continue reasoning, request a specific missing modality, produce an answer, or fall back to a random option when the evidence remains insufficient. The resulting system achieved second place on the final CASTLE Challenge leaderboard. Our codes are available at https://github.com/Hyu-Zhang/MARS.

2605.18173 2026-05-19 cs.CV 版本更新

Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting

你需要文本校正吗?用于无校正场景文本识别的软注意力掩码嵌入

Antonio Colombo, Giovanni Bianchi

发表机构 * School of Information, Polytechnic University of Turin(理工学院信息学院)

AI总结 本文提出了一种新的软注意力掩码嵌入模块(SAME),通过Transformer编码器的全局感受野编码高级特征并计算软注意力权重,然后与预测的掩码进行分层嵌入,生成精细的文本边界感知掩码,从而有效抑制背景噪声。基于该模块,本文提出了一个鲁棒的端到端文本识别框架SAME-Net,无需字符级标注或辅助文本校正模块。

详情
AI中文摘要

端到端场景文本识别,即在一个框架内统一文本检测和识别,已因深度学习的进步而取得显著进展。然而,大多数现有方法仍然受到多尺度变化、任意文本形状和复杂背景干扰导致的不完整掩码提案的影响,从而降低识别准确性。在本文中,我们提出了一种新的软注意力掩码嵌入模块(SAME),该模块利用Transformer编码器的全局感受野来编码高级特征并计算软注意力权重,然后与预测的掩码进行分层嵌入,生成精细的文本边界感知掩码,从而有效抑制背景噪声。基于该模块,我们提出了SAME-Net,一个鲁棒的端到端文本识别框架,无需字符级标注或辅助文本校正模块。由于软注意力机制是完全可微分的,识别损失梯度可以反向传播通过SAME模块到检测分支,从而实现检测和识别目标的联合优化。在具有挑战性的基准测试中进行了广泛的实验,证明了我们方法的有效性:SAME-Net在任意形状的Total-Text数据集上实现了84.02%的端到端H-mean,比之前的最先进方法GLASS在全词典准确率上高出1.02%,且无需额外训练数据;在多方向ICDAR 2015数据集上获得了具有竞争力的83.4%强词典结果。

英文摘要

End-to-end scene text spotting, which unifies text detection and recognition within a single framework, has witnessed remarkable progress driven by deep learning advances. However, most existing approaches still suffer from incomplete mask proposals caused by multi-scale variation, arbitrary text shapes, and complex background interference, thereby degrading recognition accuracy. In this paper, we propose a novel Soft Attention Mask Embedding module (SAME) that leverages the global receptive field of Transformer encoders to encode high-level features and compute soft attention weights, which are then hierarchically embedded with predicted masks to generate refined text-boundary-aware masks that effectively suppress background noise. Building upon this module, we present SAME-Net, a robust end-to-end text spotting framework that requires neither character-level annotations nor auxiliary text rectification modules. Since the soft attention mechanism is fully differentiable, recognition loss gradients can be back-propagated through the SAME module to the detection branch, enabling joint optimization of detection and recognition objectives. Extensive experiments on challenging benchmarks demonstrate the effectiveness of our approach: SAME-Net achieves 84.02\% end-to-end H-mean on the arbitrarily-shaped Total-Text dataset, surpassing the previous state-of-the-art GLASS by 1.02\% in full-lexicon accuracy without additional training data, and obtains competitive 83.4\% strong-lexicon results on the multi-oriented ICDAR 2015 dataset.

2605.18162 2026-05-19 cs.CV cs.AI 版本更新

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

通过几何逻辑一致性实现视觉语言模型中的自演化空间推理

Junming Liu, Yuqi Li, Yifei Sun, Maonan Wang, Piotr Koniusz, Yirong Chen, Ding Wang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) The City University of New York(纽约城市大学) The Chinese University of Hong Kong(香港中文大学) Data61 CSIRO(Data61澳大利亚国家科学研究院) University of New South Wales(新南威尔士大学) Australian National University(澳大利亚国立大学)

AI总结 本文提出SAGE框架,通过几何和语言二元操作在视觉语言模型中实现自演化空间推理,提升模型在空间推理任务中的鲁棒性和泛化能力。

Comments 23 pages, 7 figures, 3 tables

详情
AI中文摘要

视觉语言模型(VLMs)在视觉和语言任务上取得了显著进展,但其空间推理能力仍然脆弱:能够正确回答原始输入的模型在面对具有可预测答案映射的配对变换时仍可能失败,揭示了实例级正确性与鲁棒空间推理之间的差距。为此,我们提出空间对齐通过几何演化(SAGE),一种自演化框架,通过几何和语言二元操作在VLMs中强制逻辑一致性。SAGE将二元一致性作为辅助奖励纳入GRPO训练,鼓励模型在原始和变换输入之间产生逻辑一致的答案。一个动态操作池持续探测不一致,促进具有挑战性的操作并淘汰已掌握的操作,使训练聚焦于最有信息量的信号。SAGE具有模型无关性,比先前的GRPO方法更数据高效,并可作为轻量级的后训练阶段应用于任何现有的VLM。在视频和空间推理基准上的实验表明,SAGE在强基线模型上表现一致提升,并增强了对未见数据的泛化能力。

英文摘要

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.

2605.18156 2026-05-19 cs.CV 版本更新

Semi-LAR: Semi-supervised Contrastive Learning with Linear Attention for Removal of Nighttime Flares

Semi-LAR: 基于线性注意力的半监督对比学习用于夜间光斑去除

Xiyu Zhu, Wei Wang, Kui Jiang, Zhengguo Li

发表机构 * School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China(武汉科技大学计算机科学与技术学院) Faculty of Computing, Harbin Institute of Technology, Harbin 150001, China(哈尔滨工业大学计算机学院) SRO department, Institute for Infocomm Research, A*STAR, Singapore(新加坡资讯通信研究院SRO部门)

AI总结 本文提出了一种半监督对比学习框架,通过联合处理伪标签可靠性与表征歧视性,有效缓解了夜间光斑去除中的误差累积问题,并通过实验验证了该框架的模型无关性和性能提升。

详情
AI中文摘要

由于光斑瑕疵的大空间范围及其与场景结构的纠缠,光斑去除具有挑战性,而现有方法严重依赖大规模配对数据。我们提出了一种半监督光斑去除框架,通过联合处理伪标签可靠性与表征歧视性,实现了从未标记图像中稳定学习。我们提出了一种自适应伪标签存储库,通过无参考质量评估、动量更新和无效标签过滤逐步细化伪监督,有效缓解了误差累积。此外,我们提出了一种光斑感知的对比损失,明确将受光斑污染的输入视为负样本,并进行基于块的对比学习,鼓励表征在区分光斑模式的同时保持与可靠伪目标的一致性。在多个光斑基准上的广泛实验表明,所提出的框架具有模型无关性,并且在性能和鲁棒性方面均表现出一致的提升。

英文摘要

Lens flare removal is challenging due to the large spatial extent of flare artifacts and their entanglement with scene structures, while existing methods heavily rely on large-scale paired data. We propose a semi-supervised flare removal framework that enables stable learning from unlabeled images by jointly addressing pseudo-label reliability and representation discrimination. We propose an adaptive pseudo-label repository that progressively refines pseudo supervision through no-reference quality assessment, momentum-based updates, and invalid label filtering, effectively mitigating error accumulation. Moreover, we propose a flare-aware contrastive loss that explicitly treats flare-contaminated inputs as negatives and performs patch-level contrastive learning, encouraging representations that are discriminative against flare patterns while remaining consistent with reliable pseudo targets. Extensive experiments on multiple flare benchmarks demonstrate that the proposed framework is model-agnostic and consistently improves performance and robustness.

2605.12451 2026-05-19 cs.CV 版本更新

FuTCR: Future-Targeted Contrast and Repulsion for Continual Panoptic Segmentation

FuTCR: 未来目标对比与排斥用于持续全景分割

Nicholas Ikechukwu, Keanu Nichols, Deepti Ghadiyaram, Bryan A. Plummer

发表机构 * Boston University(波士顿大学)

AI总结 本文提出FuTCR框架,通过重构表示来解决持续全景分割中区分新增背景类别的挑战,通过像素到区域对比和排斥机制提升新类别性能,实验显示在六个CPS设置和多种数据集规模下,FuTCR在相对新类别全景质量上比最先进的方法提升高达28%,同时保持或提升基础类性能。

Comments Revised author affiliation

详情
AI中文摘要

持续全景分割(CPS)需要能够快速适应新类别的方法。由于该密集预测任务的性质,训练图像可能包含标记和未标记的对象。由于对这些未标记对象一无所知,现有方法通常在训练时将任何未标记像素归为一个'背景'类别。实际上,在训练过程中,它们反复告诉模型所有不同的背景类别都是相同的(即使它们不是)。这使得学习区分随着新增的背景类别而变得具有挑战性,因为这些新类别可能需要使用模型之前被告诉不重要并被忽略的信息。因此,我们提出了一种面向未来的对比和排斥(FuTCR)框架,通过在新类别引入之前重构表示来解决这一限制。FuTCR首先通过将模型预测的掩码中像素始终被分类为背景但表现出非背景logits的区域进行分组,发现自信的未来样区域。接着,FuTCR通过像素到区域对比从这些未标记区域构建连贯的原型,同时同时排斥背景特征远离已知类原型,以显式保留代表空间给未来类别。在六个CPS设置和多种数据集规模的实验中,FuTCR在相对新类别全景质量上比最先进的方法提升高达28%,同时保持或提升基础类性能,提升幅度高达4%。

英文摘要

Continual Panoptic Segmentation (CPS) requires methods that can quickly adapt to new categories over time. The nature of this dense prediction task means that training images may contain a mix of labeled and unlabeled objects. As nothing is known about these unlabeled objects a priori, existing methods often simply group any unlabeled pixel into a single "background" class during training. In effect, during training, they repeatedly tell the model that all the different background categories are the same (even when they aren't). This makes learning to identify different background categories as they are added challenging since these new categories may require using information the model was previously told was unimportant and ignored. Thus, we propose a Future-Targeted Contrastive and Repulsive (FuTCR) framework that addresses this limitation by restructuring representations before new classes are introduced. FuTCR first discovers confident future-like regions by grouping model-predicted masks whose pixels are consistently classified as background but exhibit non-background logits. Next, FuTCR applies pixel-to-region contrast to build coherent prototypes from these unlabeled regions, while simultaneously repelling background features away from known-class prototypes to explicitly reserve representational space for future categories. Experiments across six CPS settings and a range of dataset sizes show FuTCR improves relative new-class panoptic quality over the state-of-the-art by up to 28%, while preserving or improving base-class performance with gains up to 4%.

2605.12413 2026-05-19 cs.CV 版本更新

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

超越定位:从全景图像中对MLLMs的视角条件空间推理进行综合诊断

Yuangong Chen, Wai Keung Wong, Jiaxing Li, Ioannis Patras, Xu Zheng

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Guangzhou University(广州大学) Queen Mary University of London(伦敦玛丽女王大学) HKUST(Guangzhou)(香港科技大学(广州))

AI总结 本文研究了多模态大语言模型(MLLMs)在视角变化下的空间推理能力,提出了PCSR-Bench基准测试,评估了14种代表性MLLMs在不同任务上的表现,揭示了空间推理能力的显著差距,并探讨了通过强化学习进行优化的可能性。

Comments 10pages, 4 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉感知方面表现出色,但在视角变化下的空间推理能力有限。本文将这一挑战定义为视角条件空间推理(PCSR),研究了在360度全景图像中,广阔的场景覆盖减少了部分观察的歧义,但并不消除视角依赖推理的必要性。为了评估这一能力,我们引入了PCSR-Bench,这是一个包含来自2600张全景图像、26种室内环境的84,373个问题-答案对的诊断基准。PCSR-Bench包含八个任务,涵盖基础感知(如物体计数、相对距离和相对方向)和高级PCSR,包括组合链、以自身为中心的旋转、视角重新锚定、自身扭曲和有限视角可见性。我们评估了14种代表性MLLMs,并观察到显著的感知-推理差距:在基础相对方向任务上准确率达到57.59%,但在以自身为中心的旋转任务上降至13.49%,在以自身为中心的扭曲任务上降至7.13%,在开放性组合推理任务上降至0.64%。为了探索这一差距的可塑性,我们对一个7B规模的模型进行了基于强化学习的诊断研究。奖励塑造将匹配的7B基线从31.10%提升到60.06%,表明PCSR是部分可塑性而非完全不可变。然而,这些收益是任务选择性的,对奖励设计敏感,包括权重分配和奖励制定,并在一定程度上依赖于评估协议。这些结果将PCSR定位为当前MLLMs的关键瓶颈,并突显了在有针对性优化下的有限但有意义的恢复空间。

英文摘要

Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.

2605.11881 2026-05-19 cs.CV 版本更新

Learning Subspace-Preserving Sparse Attention Graphs from Heterogeneous Multiview Data

从异构多视图数据学习子空间保持的稀疏注意力图

Jie Chen, Yuanbiao Gou, Chuanbin Liu, Zhu Wang, Xi Peng

发表机构 * College of Computer Science(计算机科学学院) School of Economics and Management(经济管理学院) China University of Petroleum (Beijing)(中国石油大学(北京)) School of Artificial Intelligence(人工智能学院)

AI总结 本文提出了一种稀疏注意力图学习方法SAGL,通过学习异构多视图数据的子空间保持稀疏注意力图,以实现跨异构视图的语义对齐,核心方法是引入双线性注意力因子分解和动态稀疏门控机制,结合α-entmax生成稀疏注意力图,并通过理论分析和实验验证其有效性。

Comments 18 pages

详情
AI中文摘要

从大规模未标记数据中通过各种预训练模型提取的高维特征被称为异构多视图数据。现有大多数无监督迁移学习方法在利用多视图互补信息时无法忠实恢复内在子空间结构。因此,构建保持这些底层子空间结构的稀疏相似度图以实现跨异构视图的语义对齐是一个基本挑战。本文提出了一种稀疏注意力图学习(SAGL)方法,从异构多视图数据中学习子空间保持的稀疏注意力图。具体而言,我们引入了一种双线性注意力因子分解方案,以捕捉高维特征之间的不对称相似性,从而突破传统表示学习技术中的对称瓶颈。随后,动态稀疏门控机制预测一个特征特定的压缩因子,以适应性地控制邻居的拓扑贡献。此外,我们采用α-entmax生成子空间保持的稀疏注意力图以生成单个视图的图。SAGL利用这些视图特定的图进行稀疏信息聚合,产生用于多视图学习任务的判别表示。此外,我们还提供了一种严谨的理论分析,将可微稀疏注意力与概率单纯形约束联系起来。在多个基准数据集上的广泛实验表明,SAGL在很大程度上优于最先进的无监督迁移学习方法。

英文摘要

The high-dimensional features extracted from large-scale unlabeled data via various pretrained models with diverse architectures are referred to as heterogeneous multiview data. Most existing unsupervised transfer learning methods fail to faithfully recover intrinsic subspace structures when exploiting complementary information across multiple views. Therefore, a fundamental challenge involves constructing sparse similarity graphs that preserve these underlying subspace structures for achieving semantic alignment across heterogeneous views. In this paper, we propose a sparse attention graph learning (SAGL) method that learns subspace-preserving sparse attention graphs from heterogeneous multiview data. Specifically, we introduce a bilinear attention factorization scheme to capture asymmetric similarities among the high-dimensional features, which breaks the symmetry bottleneck that is inherent in the traditional representation learning techniques. A dynamic sparsity gating mechanism then predicts a feature-specific compression factor for adaptively controlling the topological contributions of neighbors. Furthermore, we employ a structured sparse projection via $α$-entmax to generate subspace-preserving sparse attention graphs for individual views. SAGL leverages these view-specific graphs to conduct sparse information aggregation, yielding discriminative representations for multiview learning tasks. In addition, we provide a rigorous theoretical analysis that bridges differentiable sparse attention and probability simplex constraints. Extensive experiments conducted on multiple benchmark datasets demonstrate that SAGL consistently outperforms the state-of-the-art unsupervised transfer learning approaches.

2605.11710 2026-05-19 cs.LG cs.CV 版本更新

Unlocking Compositional Generalization in Continual Few-Shot Learning

解锁持续少样本学习中的组合泛化

Phu-Quy Nguyen-Lam, Phu-Hoa Pham, Dao Sy Duy Minh, Chi-Nguyen Tran, Huynh Trung Kiet, Long Tran-Thanh

发表机构 * Faculty of Information Technology, University of Science, Vietnam National University(信息科技学院,科学大学,越南国家大学) Department of Computer Science, University of Warwick(计算机科学系,沃里克大学)

AI总结 本文提出了一种新的持续少样本学习范式,通过严格解耦表示学习与组合推理,实现对新概念的高效泛化,并在多个基准测试中取得最佳性能。

Comments 10 pages

详情
AI中文摘要

基于对象的表示方法在少样本学习中具有关键属性:而不是将场景视为单一单元,模型可以将其分解为个体对象级别的部分,这些部分可以在不同概念之间进行匹配和比较。在实践中,这种潜力很少被实现。持续学习者要么将场景压缩成全局嵌入,要么通过部分级匹配目标进行训练,这使表示过于紧密地依赖于已见过的模式,从而无法泛化到真正的新概念。在本文中,我们识别出这种根本性的结构冲突,并开创了一种新的范式,严格解耦表示学习与组合推理。利用自监督视觉变换器(ViTs)固有的片段级语义几何,我们的框架采用双阶段策略。在训练期间,槽表示完全优化为整体类别身份,保留高度可泛化的对象级几何结构。在推理期间,保留的槽被动态组合以匹配新场景。我们证明了这种范式提供了双重结构优势:冻结的主干自然防止了表示漂移,而我们的轻量级、整体优化保持了特征对新概念转移的能力。广泛的实验验证了这种方法,在标准持续学习基准中实现了最佳的未见概念泛化和最小的遗忘。

英文摘要

Object-centric representations promise a key property for few-shot learning: Rather than treating a scene as a single unit, a model can decompose it into individual object-level parts that can be matched and compared across different concepts. In practice, this potential is rarely realized. Continual learners either collapse scenes into global embeddings, or train with part-level matching objectives that tie representations too closely to seen patterns, leaving them unable to generalize to truly novel concepts. In this paper, we identify this fundamental structural conflict and pioneer a new paradigm that strictly decouples representation learning from compositional inference. Leveraging the inherent patch-level semantic geometry of self-supervised Vision Transformers (ViTs), our framework employs a dual-phase strategy. During training, slot representations are optimized entirely toward holistic class identity, preserving highly generalizable, object-level geometries. At inference, preserved slots are dynamically composed to match novel scenes. We demonstrate that this paradigm offers dual structural benefits: The frozen backbone naturally prevents representation drift, while our lightweight, holistic optimization preserves the features' capacity for novel-concept transfer. Extensive experiments validate this approach, achieving state-of-the-art unseen-concept generalization and minimal forgetting across standard continual learning benchmarks.

2605.11654 2026-05-19 cs.CV cs.AI cs.RO 版本更新

Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery

通过基于原型的语义部分发现实现抗天气的跨视角地理定位

Chi-Nguyen Tran, Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Long Tran-Thanh

发表机构 * Faculty of Information Technology, University of Science, Vietnam National University(信息技术学院,科学大学,越南国家大学) Department of Computer Science, University of Warwick(计算机科学系,沃里克大学)

AI总结 本文提出SkyPart,一种轻量级可替换头,用于基于补丁的视觉变换器,通过在补丁网格上显式分组实现部分分组。SkyPart有四个理论基础的组件:(i)通过单次传递余弦分配学习可学习的原型以竞争补丁标记;(ii)在训练期间应用的海拔条件线性调制,使检索嵌入在推理时无海拔依赖;(iii)对活跃原型的图注意力读出;(iv)一种Kendall不确定性加权多目标损失,其平稳点是帕累托平稳点。在26.95M参数和22.14 GFLOPs下,SkyPart是表现最佳方法中最小的,并在SUES-200、University-1652和DenseUAV上设定了新的状态。其在十条件WeatherPrompt腐蚀基准下的优势优于最强基线。

Comments 37 pages, 7 figures, 6 tables

详情
AI中文摘要

跨视角地理定位(CVGL),即匹配一个倾斜无人机视角到地理参考的卫星瓷砖,已成为在GPS信号被干扰、欺骗或不可用时自主无人机导航的关键替代方案。尽管近年来取得了显著进展,但仍然存在三个限制:(1)全局描述符设计将补丁网格压缩成一个向量,而没有在视角间隙中分离布局和纹理;(2)与海拔相关的尺度变化保留在学习嵌入中,而不是被边缘化;(3)多目标训练依赖于手动调整的标量损失,这些损失在不兼容的梯度尺度上。我们提出SkyPart,一种轻量级可替换头,用于基于补丁的视觉变换器(ViTs),在补丁网格上实施显式部分分组。SkyPart有四个理论基础的组件:(i)通过单次传递余弦分配学习可学习的原型以竞争补丁标记;(ii)在训练期间应用的海拔条件线性调制,使检索嵌入在推理时无海拔依赖;(iii)对活跃原型的图注意力读出;(iv)一种Kendall不确定性加权多目标损失,其平稳点是帕累托平稳点。在26.95M参数和22.14 GFLOPs下,SkyPart是表现最佳方法中最小的,并在SUES-200、University-1652和DenseUAV上设定了新的状态。其在十条件WeatherPrompt腐蚀基准下的优势优于最强基线。

英文摘要

Cross-view geo-localization (CVGL), which matches an oblique drone view to a geo-referenced satellite tile, has emerged as a key alternative for autonomous drone navigation when GNSS signals are jammed, spoofed, or unavailable. Despite strong recent progress, three limitations persist: (1) global-descriptor designs compress the patch grid into a single vector without separating layout from texture across the view gap; (2) altitude-related scale variation is retained in the learned embedding rather than marginalized; and (3) multi-objective training relies on hand-tuned scalars over losses on incompatible gradient scales. We propose SkyPart, a lightweight swappable head for patch-based vision transformers (ViTs) that institutes explicit part grouping over the patch grid. SkyPart has four theory-grounded components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment; (ii) altitude-conditioned linear modulation applied only during training, making the retrieval embedding altitude-free at inference; (iii) a graph-attention readout over active prototypes; and (iv) a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs, SkyPart is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol. Its advantage over the strongest baseline widens under the ten-condition WeatherPrompt corruption benchmark.

2604.03212 2026-05-19 cs.CV 版本更新

ProtoFlow: Mitigating Forgetting in Class-Incremental Remote Sensing Segmentation via Low-Curvature Prototype Flow

ProtoFlow: 通过低曲率原型流缓解类别增量遥感分割中的遗忘

Jiekai Wu, Rong Fu, Chuangqi Li, Zijian Zhang, Guangxin Wu, Hao Zhang, Shiyin Lin, Jianyuan Ni, Yang Li, Dongxu Zhang, Amir H. Gandomi, Simon Fong, Pengbin Feng

发表机构 * Faculty of Health Data Science, Juntendo University(静冈大学健康数据科学学院) The Institute of Collaborative Innovation, University of Macau(澳门大学协同创新研究所) Department of Information and Computing Sciences, Faculty of Science, Utrecht University(乌得勒支大学科学学院信息与计算科学系) Department of Computer and Information Science, University of Pennsylvania(宾夕法尼亚大学计算机与信息科学系) School of Computer Science, University of Chinese Academy of Sciences(中国科学院大学计算机科学学院) Department of Computer & Information Science & Engineering, University of Florida(佛罗里达大学计算机与信息科学与工程系) Department of Computer Science, Juniata College(朱尼塔学院计算机科学系) National Engineering Research Center for Beijing Biochip Technology(北京生物芯片工程技术研究中心) CapitalBio Corporation(资本生物公司) Faculty of Engineering & Information Technology, University of Technology Sydney(悉尼科技大学工程与信息技术学院) University Research and Innovation Center (EKIK), Obuda University(布达佩斯大学研究与创新中心(EKIK)) Faculty of Science and Technology, University of Macau(澳门大学科学与技术学院) Department of Mathematics, University of Southern California(南加州大学数学系)

AI总结 本文提出ProtoFlow,一种时间感知的原型动态框架,通过将类别原型建模为轨迹并学习其演变,以缓解遥感分割中的遗忘问题,实验表明其在多个基准上取得了显著提升。

详情
AI中文摘要

遥感分割在实际部署中本质上是连续的:新的语义类别不断出现,且获取条件随季节、城市和传感器而变化。尽管取得了进展,许多增量方法仍将训练步骤视为孤立的更新,导致表示漂移和遗忘控制不足。我们提出了ProtoFlow,一种时间感知的原型动态框架,将类别原型建模为轨迹,并通过显式的时间向量场学习其演变。通过联合强制低曲率运动和类间分离,ProtoFlow在增量学习过程中稳定了原型几何。在标准的类别和领域增量遥感基准上的实验表明,ProtoFlow在mIoUall上比强大的基线模型提高了1.5-2.0个百分点,并减少了遗忘。这些结果表明,显式建模时间原型演变是一种实用且可解释的策略,用于鲁棒的连续遥感分割。开源代码:https://github.com/dudududke/protoflow.

英文摘要

Remote sensing segmentation in real deployment is inherently continual: new semantic categories emerge, and acquisition conditions shift across seasons, cities, and sensors. Despite recent progress, many incremental approaches still treat training steps as isolated updates, which leaves representation drift and forgetting insufficiently controlled. We present ProtoFlow, a time-aware prototype dynamics framework that models class prototypes as trajectories and learns their evolution with an explicit temporal vector field. By jointly enforcing low-curvature motion and inter-class separation, ProtoFlow stabilizes prototype geometry throughout incremental learning. Experiments on standard class- and domain-incremental remote sensing benchmarks show consistent gains over strong baselines, including up to 1.5-2.0 points improvement in mIoUall, together with reduced forgetting. These results suggest that explicitly modeling temporal prototype evolution is a practical and interpretable strategy for robust continual remote sensing segmentation. Open-source code:https://github.com/dudududke/protoflow.

2604.02060 2026-05-19 cs.CV cs.RO 版本更新

CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

CompassAD: 基于意图的多功能竞争物体3D affordance 地标

Jingliang Li, Jindou Jia, Tuo An, Chuhao Zhou, Xiangyu Chen, Shilin Shan, Boyu Ma, Bofan Lyu, Gen Li, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University, Singapore(MARS实验室,南洋理工大学,新加坡)

AI总结 该研究提出了一种新的3D affordance设定,即意图驱动的可混淆地标,旨在预测多物体点云中正确物体的每点affordance掩码,基于隐含的自然语言意图。通过构建CompassAD基准,该研究展示了在具有隐含意图的多物体组合中的先进结果,并在机器人机械臂上验证了其在真实世界抓取中的有效性。

详情
AI中文摘要

当被告知要“切蛋糕”时,机器人必须在附近的剪刀之上选择刀,尽管两个物体都提供相同的切割功能。在真实世界场景中,多个物体可能具有相同的affordance,但只有一个是给定任务上下文下的合适对象。我们称这种情况为混淆对。然而,现有的3D affordance方法大多回避了这一挑战,通过评估孤立的单个物体,通常伴有查询中提供的显式类别名称。我们正式提出了意图驱动的可混淆affordance地标,这是一种新的3D affordance设定,要求在多物体点云中预测正确物体的每点affordance掩码,基于隐含的自然语言意图。为了研究这个问题,我们构建了CompassAD,第一个专注于隐含意图的多物体组合基准。它包含30个混淆物体对,覆盖16种affordance类型,6,422个组合,以及88K+个查询-回答对。此外,我们提出了CompassNet,一个包含两个专门模块的框架,专为该任务定制。实例受限的交叉注入(ICI)在物体边界内约束语言-几何对齐,以防止跨物体语义泄漏。双级对比细化(BCR)在几何组和点级别上强制执行区分,使目标和可混淆表面之间的区别更加清晰。广泛的实验表明,在已见和未见查询上均取得了最先进的结果,并在机器人机械臂上的部署证实了其在真实世界抓取中的有效性。

英文摘要

When told to "cut the cake," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Intent-Driven Confusable Affordance Grounding, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusing multi-object compositions. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 compositions, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object compositions.

2604.00634 2026-05-19 cs.RO cs.CV 版本更新

LiPS: Lightweight Panoptic Segmentation for Resource-Constrained Robotics

LiPS: 为资源受限机器人设计的轻量级全景分割

Calvin Galagain, Martyna Poreba, François Goulette, Cyrill Stachniss

发表机构 * Université Paris-Saclay, CEA LIST(巴黎-萨克雷大学,CEA LIST) U2IS, ENSTA, Institut Polytechnique de Paris(U2IS、ENSTA、巴黎理工学院) University of Bonn, Center for Robotics(波恩大学,机器人中心)

AI总结 本文提出LiPS,一种轻量级全景分割方法,通过简化特征提取和融合路径,在保持查询基于解码的同时,显著降低计算需求,实现与更重模型相当的精度和更高的吞吐量。

Comments Accepted to IEEE International Conference on Image Processing (ICIP) 2026, Paper #2070

详情
AI中文摘要

全景分割是机器人感知的关键使能器,因为它将语义理解与对象级推理统一起来。然而,随着最新模型复杂性的增加,它们不再适合在资源受限的平台上部署,如移动机器人。我们提出了一种名为LiPS的新方法,通过轻量级设计保留查询基于解码,同时引入流线型的特征提取和融合路径,旨在在大幅降低计算需求的同时提供强大的全景分割性能。在标准基准上的评估表明,LiPS在精度上与更重的基线相当,同时提供高达4.5倍的吞吐量(每秒帧数),并需要几乎6.8倍更少的计算。这种效率使LiPS成为现代全景模型与现实世界机器人应用之间的重要桥梁。

英文摘要

Panoptic segmentation is a key enabler for robotic perception, as it unifies semantic understanding with object-level reasoning. However, the increasing complexity of state-of-the-art models makes them unsuitable for deployment on resource-constrained platforms such as mobile robots. We propose a novel approach called LiPS that addresses the challenge of efficient-to-compute panoptic segmentation with a lightweight design that retains query-based decoding while introducing a streamlined feature extraction and fusion pathway. It aims at providing a strong panoptic segmentation performance while substantially lowering the computational demands. Evaluations on standard benchmarks demonstrate that LiPS attains accuracy comparable to much heavier baselines, while providing up to 4.5 higher throughput, measured in frames per second, and requiring nearly 6.8 times fewer computations. This efficiency makes LiPS a highly relevant bridge between modern panoptic models and real-world robotic applications.

2603.23194 2026-05-19 cs.GR cs.CV cs.LG 版本更新

PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning

PhysSkin: 通过自监督神经皮肤化实现实时且可泛化的基于物理的动画

Yuanhang Lei, Tao Cheng, Xingxuan Li, Boming Zhao, Siyuan Huang, Ruizhen Hu, Peter Yichen Chen, Hujun Bao, Zhaopeng Cui

发表机构 * State Key Laboratory of CAD&CG(CAD与计算机图形学国家重点实验室) BIGAI Shenzhen University(深圳大学) University of British Columbia(不列颠哥伦比亚大学)

AI总结 本文提出PhysSkin框架,通过自监督学习策略实现对多样3D形状和离散化形式的实时基于物理的动画,其核心方法是神经皮肤化场自动编码器和物理感知的学习策略。

Comments Accepted by CVPR 2026 Highlight. Project Page: https://zju3dv.github.io/PhysSkin/

详情
AI中文摘要

实现能够在多样3D形状和离散化形式之间泛化的真实时间基于物理的动画仍然是一个基本挑战。我们引入PhysSkin,一个基于物理的框架,解决这一挑战。受线性混合皮肤化的启发,我们学习连续皮肤化场作为基函数,将运动子空间坐标提升到全空间变形,子空间由手柄变换定义。为了生成无网格、离散化无关且物理一致的皮肤化场,PhysSkin采用新的神经皮肤化场自动编码器,由基于Transformer的编码器和交叉注意力解码器组成。此外,我们还开发了一种新的物理感知自监督学习策略,结合实时皮肤化场归一化和冲突感知梯度校正,从而有效平衡能量最小化、空间平滑性和正交约束。PhysSkin在可泛化的神经皮肤化上表现出色,并实现了实时基于物理的动画。

英文摘要

Achieving real-time physics-based animation that generalizes across diverse 3D shapes and discretizations remains a fundamental challenge. We introduce PhysSkin, a physics-informed framework that addresses this challenge. In the spirit of Linear Blend Skinning, we learn continuous skinning fields as basis functions lifting motion subspace coordinates to full-space deformation, with subspace defined by handle transformations. To generate mesh-free, discretization-agnostic, and physically consistent skinning fields that generalize well across diverse 3D shapes, PhysSkin employs a new neural skinning fields autoencoder which consists of a transformer-based encoder and a cross-attention decoder. Furthermore, we also develop a novel physics-informed self-supervised learning strategy that incorporates on-the-fly skinning-field normalization and conflict-aware gradient correction, enabling effective balancing of energy minimization, spatial smoothness, and orthogonality constraints. PhysSkin shows outstanding performance on generalizable neural skinning and enables real-time physics-based animation.

2603.14936 2026-05-19 cs.CV 版本更新

Bridging the Intention-Expression Gap: Aligning Multi-Dimensional Preferences via Hierarchical Relevance Feedback in Text-to-Image Diffusion

弥合意图-表达鸿沟:通过层次相关反馈对齐多维偏好

Wenxi Wang, Hongbin Liu, Mingqian Li, Junyan Yuan, Junqi Zhang

发表机构 * Tongji University(同济大学)

AI总结 本文提出一种层次相关反馈驱动框架,通过在文本到图像扩散模型中对齐多维特征,解决用户意图与表达之间的鸿沟问题,提升模型对多维偏好的识别能力。

详情
AI中文摘要

用户往往具有明确的视觉意图,但难以用语言准确表达。这种意图-表达鸿沟使得在文本到图像扩散模型中对齐生成图像与潜在视觉偏好成为基本挑战。现有方法要么需要模型训练,牺牲灵活性,要么依赖文本反馈,加重认知负担。尽管最近的无训练方法使用基于点击的二元偏好反馈来减少用户努力,但它们迫使基础模型(FMs)在语义层面推断偏好。当面对多维偏好时,FMs会受到推断过载的影响,并且无法在冲突的用户信号下识别出确切的首选特征值。因此,一种灵活的多维特征对齐框架仍然缺失。为了解决这个问题,我们提出了一个层次相关反馈驱动(HRFD)框架。认识到多个特征难以同时收敛,HRFD将它们组织成三级层次,并适应相关反馈以强制粗到细的收敛,从而减少认知负担。为了绕过FM推断过载,HRFD将过程分解为独立的单特征偏好推断任务。此外,为了克服FM在识别首选值上的失败,HRFD采用统计推断来量化“喜欢”和“不喜欢”图像集之间特征分布差异,实现稳健且透明的偏好测量。关键的是,HRFD完全在外部文本空间中运行,严格无训练且模型无关。广泛的实验表明,HRFD能够有效捕捉用户的真正视觉意图,显著优于基线方法。

英文摘要

Users often possess a clear visual intent but struggle to articulate it precisely in language. This intention-expression gap makes aligning generated images with latent visual preferences a fundamental challenge in text-to-image diffusion models. Existing methods either require model training, sacrificing flexibility, or rely on textual feedback, imposing a heavy cognitive burden. Although recent training-free methods use click-based binary preference feedback to reduce user effort, they force Foundation Models (FMs) to infer preferences at the semantic level. When faced with multi-dimensional preferences, FMs suffer from inference overload and fail to identify exact preferred feature values under conflicting user signals. Consequently, a flexible framework for multi-dimensional feature alignment remains absent. To address this, we propose a Hierarchical Relevance Feedback-Driven (HRFD) framework. Recognizing that multiple features struggle to converge simultaneously, HRFD organizes them into a three-tier hierarchy and adapts relevance feedback to enforce coarse-to-fine convergence, minimizing cognitive load. To bypass FM inference overload, HRFD decouples the process into independent single-feature preference inference tasks. Furthermore, to overcome FMs' failure in identifying preferred values, HRFD employs statistical inference to quantify the distribution divergence of features between "liked" and "disliked" image sets, achieving robust and transparent preference measurement. Crucially, HRFD operates entirely within the external text space, remaining strictly training-free and model-agnostic. Extensive experiments demonstrate that HRFD effectively captures the user's true visual intent, significantly outperforming baseline approaches.

2603.13708 2026-05-19 cs.CV 版本更新

RSEdit: Text-Guided Image Editing for Remote Sensing

RSEdit:面向遥感的文本引导图像编辑

Chen Zhenyuan, Zhang Zechuan, Zhang Feng

发表机构 * School of Earth Sciences, Zhejiang University(浙江大学地球科学学院) Zhejiang Provincial Key Laboratory of Geographic Information Science(浙江省地理信息科学重点实验室) Key Laboratory of Spatio-temporal Information and Intelligent Services (LSIIS), Ministry of Natural Resources of the People’s Republic of China(中华人民共和国自然资源部时空信息与智能服务重点实验室) ReLER, CCAI, Zhejiang University(ReLER,中国人工智能学会,浙江大学)

AI总结 本文提出RSEdit,一种基于生成模型的遥感图像编辑方法,通过研究文本到图像模型的条件策略,实现了在保持地理空间结构的同时,生成指令忠实的图像编辑结果。

Comments accepted by IEEE GRSL

详情
AI中文摘要

在本文中,我们探索了利用生成模型在遥感领域进行文本引导的图像编辑。我们提出了RSEdit,一种从U-Net到DiT的各种配置模型的集合。具体来说,我们展示了首次全面研究如何通过文本到图像模型构建图像编辑模型的条件策略。我们的实验表明,RSEdit在保持地理空间结构的同时,实现了最佳的指令忠实编辑。我们发布了代码和检查点。

英文摘要

In this paper, we explore text-guided image editing in the remote sensing domain using generative modeling. We propose \rsedit, a collection of models from U-Net to DiT with various configurations. Specifically, we present the first comprehensive study of conditioning strategies for building image editing models from off-the-shelf text-to-image ones. Our experiments show that \rsedit achieves the best instruction-faithful edits while preserving geospatial structure. We release the code at \url{https://github.com/Bili-Sakura/RSEdit-Preview} and checkpoints at \url{https://huggingface.co/collections/BiliSakura/rsedit}.

2603.09668 2026-05-19 cs.CV 版本更新

DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics

DiffWind: 基于物理的可微风驱物体动力学建模

Yuanhang Lei, Boming Zhao, Zesong Yang, Xingxuan Li, Tao Cheng, Haocheng Peng, Ru Zhang, Yang Yang, Siyuan Huang, Yujun Shen, Ruizhen Hu, Hujun Bao, Zhaopeng Cui

发表机构 * State Key Laboratory of CAD & CG(CAD与计算机图形学国家重点实验室) State Key Laboratory of General Artificial Intelligence(通用人工智能国家重点实验室) Ant Group(蚂蚁集团) Shenzhen University(深圳大学)

AI总结 本文提出DiffWind,一种基于物理的可微框架,统一了风-物体相互作用建模、基于视频的重建和正向模拟。通过将风表示为基于网格的物理场,物体表示为从3D高斯点散布派生的粒子系统,并利用材料点方法(MPM)建模其相互作用,从而实现了对风驱物体动力学的重建。此外,本文还引入了WD-Objects数据集,通过大量实验证明了该方法在重建精度和模拟保真度方面显著优于现有动态场景建模方法。

Comments Accepted by ICLR 2026. Project page: https://zju3dv.github.io/DiffWind/

详情
AI中文摘要

从视频观测建模风驱物体动力学极具挑战性,因为风的不可见性和时空变异性以及物体的复杂变形。我们提出了DiffWind,一种基于物理的可微框架,统一了风-物体相互作用建模、基于视频的重建和正向模拟。具体来说,我们将风表示为基于网格的物理场,物体表示为从3D高斯点散布派生的粒子系统,其相互作用通过材料点方法(MPM)建模。为了恢复风驱物体动力学,我们引入了一个重建框架,通过可微渲染和模拟联合优化时空风力场和物体运动。为了确保物理有效性,我们将其纳入格子玻尔兹曼方法(LBM)作为物理约束,强制符合流体动力学定律。除了重建之外,我们的方法自然支持在新型风条件下进行正向模拟,并能够实现新的应用,如风引导重定向。我们进一步引入了WD-Objects,一个合成和现实世界风驱场景的数据集。大量实验表明,我们的方法在重建精度和模拟保真度方面显著优于现有动态场景建模方法,为基于视频的风-物体相互作用建模开辟了新的途径。

英文摘要

Modeling wind-driven object dynamics from video observations is highly challenging due to the invisibility and spatio-temporal variability of wind, as well as the complex deformations of objects. We present DiffWind, a physics-informed differentiable framework that unifies wind-object interaction modeling, video-based reconstruction, and forward simulation. Specifically, we represent wind as a grid-based physical field and objects as particle systems derived from 3D Gaussian Splatting, with their interaction modeled by the Material Point Method (MPM). To recover wind-driven object dynamics, we introduce a reconstruction framework that jointly optimizes the spatio-temporal wind force field and object motion through differentiable rendering and simulation. To ensure physical validity, we incorporate the Lattice Boltzmann Method (LBM) as a physics-informed constraint, enforcing compliance with fluid dynamics laws. Beyond reconstruction, our method naturally supports forward simulation under novel wind conditions and enables new applications such as wind retargeting. We further introduce WD-Objects, a dataset of synthetic and real-world wind-driven scenes. Extensive experiments demonstrate that our method significantly outperforms prior dynamic scene modeling approaches in both reconstruction accuracy and simulation fidelity, opening a new avenue for video-based wind-object interaction modeling.

2603.09405 2026-05-19 cs.CV 版本更新

YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search

YOLO-NAS-Bench: 一种具有自进化预测器的代理基准,用于YOLO架构搜索

Zhe Li, Xiaoyu Ding, Jiaxin Zheng, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学计算机科学技术研究院)

AI总结 本文提出YOLO-NAS-Bench,一种针对YOLO检测器的代理基准,通过自进化机制提升预测器的准确性,从而在YOLO架构搜索中实现高效评估。

Comments Accepted as Oral at CVPR 2026 Workshop on Neural Architecture Search (NAS)

详情
AI中文摘要

针对目标检测中的神经架构搜索(NAS)面临高评估成本的问题,本文提出YOLO-NAS-Bench,首个专门针对YOLO风格检测器的代理基准。YOLO-NAS-Bench定义了一个涵盖通道宽度、块深度和运算符类型的搜索空间,覆盖YOLOv8到YOLO12的核心模块。通过随机、分层和拉丁超立方策略采样1000种架构,在COCO-mini上训练并构建LightGBM代理预测器。为提高预测器在高性能领域的表现,提出自进化机制,通过预测器自身发现并评估有信息量的架构,使预测器的R2从0.770提升至0.815,稀疏Kendall Tau从0.694提升至0.752。使用最终预测器作为进化搜索的适应度函数,发现超越所有官方YOLOv8-YOLO12基线的架构,在COCO-mini上具有可比的延迟,验证了预测器对高性能检测架构的判别能力。代码可在https://github.com/VDIGPKU/YOLO-NAS-Bench获取。

英文摘要

Neural Architecture Search (NAS) for object detection is severely bottlenecked by high evaluation cost, as fully training each candidate YOLO architecture on COCO demands days of GPU time. Meanwhile, existing NAS benchmarks largely target image classification, leaving the detection community without a comparable benchmark for NAS evaluation. To address this gap, we introduce YOLO-NAS-Bench, the first surrogate benchmark tailored to YOLO-style detectors. YOLO-NAS-Bench defines a search space spanning channel width, block depth, and operator type across both backbone and neck, covering the core modules of YOLOv8 through YOLO12. We sample 1,000 architectures via random, stratified, and Latin Hypercube strategies, train them on COCO-mini, and build a LightGBM surrogate predictor. To sharpen the predictor in the high-performance regime most relevant to NAS, we propose a Self-Evolving Mechanism that progressively aligns the predictor's training distribution with the high-performance frontier, by using the predictor itself to discover and evaluate informative architectures in each iteration. This method grows the pool to 1,500 architectures and raises the ensemble predictor's R2 from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752, demonstrating strong predictive accuracy and ranking consistency. Using the final predictor as the fitness function for evolutionary search, we discover architectures that surpass all official YOLOv8-YOLO12 baselines at comparable latency on COCO-mini, confirming the predictor's discriminative power for top-performing detection architectures. The code is available at https://github.com/VDIGPKU/YOLO-NAS-Bench.

2603.04727 2026-05-19 cs.CV cs.AI 版本更新

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

多模态大语言模型是否准备好用于监控?对零样本异常检测在现实中的检验

Shanle Yao, Armin Danesh Pazho, Narges Rashvand, Hamed Tabkhi

发表机构 * Electrical and Computer Engineering Department(电气与计算机工程系)

AI总结 本文研究了多模态大语言模型在现实中的零样本异常检测性能,发现其存在保守偏差,通过特定指令可以提升F1分数,但召回率仍是关键瓶颈。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视频理解方面展示了出色的通用能力,但其在现实中的视频异常检测(VAD)可靠性仍待探索。与传统依赖重建或姿态线索的流程不同,MLLMs实现了将异常检测视为语言引导推理任务的范式转变。本文通过将VAD重新表述为二分类任务,在弱时间监督下系统评估了最先进的MLLMs在ShanghaiTech和CHAD基准上的性能。我们研究了提示特异性及时间窗口长度(1s-3s)对性能的影响,重点分析精度-召回率的权衡。研究发现,在零样本设置中存在显著的保守偏差;尽管模型表现出高置信度,但倾向于选择'正常'类,导致高精度但召回率崩溃,限制了实际应用。我们证明,针对类别的特定指令可显著改变这一决策边界,使ShanghaiTech的峰值F1分数从0.09提升至0.64,但召回率仍是关键瓶颈。这些结果突显了MLLMs在嘈杂环境中的显著性能差距,并为未来在召回导向提示和模型校准方面的研究提供了基础,这对需要复杂视频理解和推理的开放世界监控任务提出了要求。

英文摘要

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

2602.22941 2026-05-19 cs.CV 版本更新

Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings

基于平移和缩放视频记录的皮划艇冲刺团队船只速度和划桨率重建

Julian Ziegler, Daniel Matthes, Finn Gerdts, Patrick Frenzel, Torsten Warnke, Matthias Englert, Tina Koevari, Mirco Fuchs

发表机构 * Laboratory for Biosignal Processing, Leipzig University of Applied Sciences, Leipzig, Germany(生物信号处理实验室,莱比锡应用科学大学,莱比锡,德国) Research Group Canoeing, Institute for Applied Training Science (IAT), Leipzig, Germany(划船研究组,应用训练科学研究所(IAT),莱比锡,德国) German Canoe Federation, Duisburg, Germany(德国皮划艇联合会,杜伊斯堡,德国)

AI总结 本文提出了一种基于平移和缩放视频记录重建皮划艇冲刺团队船只速度和划桨率的方法,利用YOLOv8检测浮标和运动员,结合已知的浮标网格估计同源性,通过U-Net进行船体校准以估计船的位置,并利用光流实现鲁棒跟踪,最终提取划桨率信息,实验结果表明其速度和划桨率的MAPE分别达到0.011和0.009,具有高精度和自动化反馈。

详情
AI中文摘要

节奏策略,由速度和划桨率曲线定义,对于皮划艇冲刺的峰值表现至关重要。尽管GPS是分析的黄金标准,但其有限的可用性需要自动化视频分析方法。本文提出了一种扩展框架,用于从平移和缩放的视频记录中重建所有冲刺项目(K1-K4,C1-C2)和距离(200m-500m)的性能指标。我们的方法利用YOLOv8进行浮标和运动员检测,利用已知的浮标网格估计同源性。我们通过学习特定船体的运动员偏移量来一般化估计船的位置,利用U-Net进行船体校准。进一步,我们通过光流实现鲁棒的跟踪方案以适应多运动员船体类型。最后,我们介绍了从姿态估计或运动员边界框本身提取划桨率信息的方法。与精英比赛GPS数据的评估显示,速度的MAPE为0.011 [0.008 0.014](Spearman rho=0.974)和划桨率的MAPE为0.009 [0.006 0.013](Spearman rho=0.975)。这些方法为教练提供了高精度、自动化的反馈,且无需传感器,仅需极少的手动初始化工作。

英文摘要

Pacing strategies, defined by velocity and stroke rate profiles, are essential for peak performance in canoe sprint. While GPS is the gold standard for analysis, its limited availability necessitates automated video-based solutions. This paper presents an extended framework for reconstructing performance metrics from panned and zoomed video recordings across all sprint disciplines (K1-K4, C1-C2) and distances (200m-500m). Our method utilizes YOLOv8 for buoy and athlete detection, leveraging the known buoy grid to estimate homographies. We generalized the estimation of the boat position by means of learning a boat-specific athlete offset using a U-net based boat tip calibration. Further, we implement a robust tracking scheme using optical flow to adapt to multi-athlete boat types. Finally, we introduce methods to extract stroke rate information from either pose estimations or the athlete bounding boxes themselves. Evaluation against GPS data from elite competitions yields a velocity MAPE of 0.011 [0.008 0.014] (Spearman rho=0.974) and a stroke rate MAPE of 0.009 [0.006 0.013] (Spearman rho = 0.975). The methods provide coaches with highly accurate, automated feedback with minimal manual initialization work required, and without requiring sensors.

2602.21707 2026-05-19 eess.IV cs.CV cs.LG math.OC 版本更新

Learning spatially adaptive sparsity level maps for arbitrary convolutional dictionaries

学习任意卷积字典的时空自适应稀疏性水平图

Joshua Schulz, David Schote, Christoph Kolbitsch, Kostas Papafitsoros, Andreas Kofler

发表机构 * Physikalisch-Technische Bundesanstalt (PTB), Braunschweig and Berlin, Germany(物理技术联邦机构(PTB),柏林和不莱梅,德国) School of Mathematical Sciences, Queen Mary University of London, UK(伦敦女王学院数学科学学院,英国)

AI总结 本文提出了一种学习方法,通过改进的网络设计和专门的训练策略,扩展了基于神经网络推断的时空自适应稀疏性水平图的图像重建方法,实现了滤波器排列不变性,并在低场MRI中展示了使用不同字典的优势。

Comments accepted for publication at ICIP 2026; differs from previous versions after a bugfix in one of the used packages; corresponds to the final camera-ready version submitted to the conference

详情
AI中文摘要

最先进的学习重建方法通常依赖于黑盒模块,尽管性能强大,但对其可解释性和鲁棒性提出了质疑。本文基于最近提出的一种图像重建方法,通过将数据驱动的信息嵌入到基于模型的卷积字典正则化中,利用神经网络推断的时空自适应稀疏性水平图。通过改进的网络设计和专门的训练策略,我们扩展了该方法,以实现滤波器排列不变性以及在推理时更改卷积字典的可能性。我们将该方法应用于低场MRI,并与其他几种最近的深度学习方法进行了比较,包括体内数据,展示了使用不同字典的优势。我们进一步评估了该方法在测试体内和体外数据时的鲁棒性。当测试体外数据时,所提出的方法比其他学习方法受到的数据分布偏移影响更小,这归因于其基于模型的重建组件对训练数据的依赖性较低。

英文摘要

State-of-the-art learned reconstruction methods often rely on black-box modules that, despite their strong performance, raise questions about their interpretability and robustness. Here, we build on a recently proposed image reconstruction method, which is based on embedding data-driven information into a model-based convolutional dictionary regularization via neural network-inferred spatially adaptive sparsity level maps. By means of improved network design and dedicated training strategies, we extend the method to achieve filter-permutation invariance as well as the possibility to change the convolutional dictionary at inference time. We apply our method to low-field MRI and compare it to several other recent deep learning-based methods, also on in vivo data, where the benefit of using a different dictionary is demonstrated. We further assess the method's robustness when tested on in- and out-of-distribution data. When tested on the latter, the proposed method suffers less from the data distribution shift compared to the other learned methods, which we attribute to its reduced reliance on training data due to its underlying model-based reconstruction component.

2602.08206 2026-05-19 cs.CV 版本更新

Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation

基于地理推理的词汇无关遥感语义分割

Chufeng Zhou, Jian Wang, Xinyuan Liu, Xiaokang Zhang

发表机构 * School of Electronic Information, Wuhan University of Science and Technology(武汉科技大学电子信息学院) State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications(北京邮电大学网络与交换技术国家重点实验室) Oriental Space Port Research Institute(东方航天港研究院) School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院)

AI总结 本文提出了一种基于地理推理的词汇无关遥感语义分割框架GR-CoT,通过离线知识蒸馏流和在线实例推理流解决遥感开放词汇语义分割中的语义歧义问题,提升复杂场景下的分割性能和语义一致性。

Comments 5 pages, 3 figures

详情
AI中文摘要

开放词汇语义分割已成为遥感领域的重要研究方向,因为它能够实现超越预定义土地覆盖类别的识别。然而,现有方法主要依赖于被动的视觉-文本匹配,往往在地理复杂场景中面临语义歧义问题,尤其是当不同类别表现出相似的光谱或结构模式时。为了解决这个问题,我们提出了一个用于遥感开放词汇语义分割的地理推理链式思考(GR-CoT)框架。GR-CoT由一个离线知识蒸馏流和一个在线实例推理流组成。前者为歧分类构建类别解释标准,后者执行宏观场景锚定、视觉特征解耦和知识驱动的决策合成,以生成适应图像的词汇表供下游分割使用。在LoveDA和GID5基准测试中,实验表明所提出的框架提高了整体分割性能,并在复杂场景中产生了更具语义一致性的预测。

英文摘要

Open-vocabulary semantic segmentation has become an important direction in remote sensing, as it enables recognition beyond predefined land-cover categories. However, existing methods mainly depend on passive visual-text matching and often struggle with semantic ambiguity in geographically complex scenes, especially when different classes exhibit similar spectral or structural patterns. To address this issue, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework for remote sensing open-vocabulary semantic segmentation. GR-CoT consists of an offline knowledge distillation stream and an online instance reasoning stream. The former constructs category interpretation standards for confusing classes, while the latter performs macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis to generate an image-adaptive vocabulary for downstream segmentation. Experiments on the LoveDA and GID5 benchmarks indicate that the proposed framework improves overall segmentation performance and yields more semantically coherent predictions in complex scenes.

2601.14330 2026-05-19 cs.CV cs.LG 版本更新

LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models

LURE: 用于扩散模型多概念重新唤醒的潜在空间解阻

Mengyu Sun, Ziyuan Yang, Andrew Beng Jin Teoh, Junxu Liu, Haibo Hu, Yi Zhang

发表机构 * Sichuan University(四川大学) The Hong Kong Polytechnic University(香港理工大学) Nanyang Technological University(南洋理工大学) Yonsei University(延世大学)

AI总结 本文提出LURE方法,通过重建潜在空间和引导采样轨迹,实现多概念的高保真重新唤醒,解决了现有方法在多概念场景下的梯度冲突和特征纠缠问题。

详情
AI中文摘要

概念擦除旨在抑制扩散模型中的敏感内容,但最近的研究表明,被擦除的概念仍可能被重新唤醒,揭示了擦除方法的脆弱性。现有重新唤醒方法主要依赖于提示级优化来操控采样轨迹,忽略了其他生成因素,限制了对底层动态的全面理解。在本文中,我们将生成过程建模为一个隐式函数,以实现对多个因素的全面理论分析,包括文本条件、模型参数和潜在状态。我们理论证明,扰动每个因素可以重新唤醒被擦除的概念。基于这一见解,我们提出了一种新的概念重新唤醒方法:用于概念重新唤醒的潜在空间解阻(LURE),通过重建潜在空间并引导采样轨迹来重新唤醒被擦除的概念。具体而言,我们的语义重新绑定机制通过将去噪预测与目标分布对齐来重建潜在空间,以重新建立断裂的文本-视觉关联。然而,在多概念场景中,朴素的重建会导致梯度冲突和特征纠缠。为了解决这个问题,我们引入了梯度场正交化,强制特征正交以防止相互干扰。此外,我们的潜在语义识别引导采样(LSIS)通过后验密度验证确保重新唤醒过程的稳定性。广泛的实验表明,LURE能够在多种擦除任务和方法中同时实现多个被擦除概念的高保真重新唤醒。

英文摘要

Concept erasure aims to suppress sensitive content in diffusion models, but recent studies show that erased concepts can still be reawakened, revealing vulnerabilities in erasure methods. Existing reawakening methods mainly rely on prompt-level optimization to manipulate sampling trajectories, neglecting other generative factors, which limits a comprehensive understanding of the underlying dynamics. In this paper, we model the generation process as an implicit function to enable a comprehensive theoretical analysis of multiple factors, including text conditions, model parameters, and latent states. We theoretically show that perturbing each factor can reawaken erased concepts. Building on this insight, we propose a novel concept reawakening method: Latent space Unblocking for concept REawakening (LURE), which reawakens erased concepts by reconstructing the latent space and guiding the sampling trajectory. Specifically, our semantic re-binding mechanism reconstructs the latent space by aligning denoising predictions with target distributions to reestablish severed text-visual associations. However, in multi-concept scenarios, naive reconstruction can cause gradient conflicts and feature entanglement. To address this, we introduce Gradient Field Orthogonalization, which enforces feature orthogonality to prevent mutual interference. Additionally, our Latent Semantic Identification-Guided Sampling (LSIS) ensures stability of the reawakening process via posterior density verification. Extensive experiments demonstrate that LURE enables simultaneous, high-fidelity reawakening of multiple erased concepts across diverse erasure tasks and methods.

2601.13839 2026-05-19 cs.CV 版本更新

DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes

DisasterVQA: 一个用于灾难场景的视觉问答基准数据集

Aisha Al-Mohannadi, Ayisha Firoz, Yin Yang, Muhammad Imran, Ferda Ofli

发表机构 * Qatar Computing Research Institute(卡塔尔计算研究所) Hamad Bin Khalifa University(哈马德·本·卡伊夫大学) College of Science & Engineering(科学与工程学院) Qatar University(卡塔尔大学)

AI总结 本文提出DisasterVQA数据集,用于灾难场景中的感知与推理任务,通过1395张真实图像和4405对专家 curated 的问答对,评估了七种最先进的视觉-语言模型在灾难响应中的性能,发现模型在细粒度定量推理、物体计数和上下文敏感解释方面存在不足。

Comments Accepted at ICWSM 2026

详情
AI中文摘要

社交媒体图像在自然灾害和人为灾害中提供低延迟的情报信息源,能够实现快速损害评估和响应。尽管视觉问答(VQA)在通用领域表现出色,但其在灾难响应中所需的复杂和安全关键推理的适用性仍不明确。我们引入了DisasterVQA基准数据集,专门用于危机情境中的感知和推理。DisasterVQA包含1395张真实世界图像和4405对专家精心编写的问答对,涵盖洪水、野火和地震等多种事件。基于人道主义框架,包括FEMA ESF和OCHA MIRA,该数据集包含二元、多选和开放式问题,覆盖情境意识和操作决策任务。我们评估了七种最先进的视觉-语言模型,并发现性能在问题类型、灾难类别、地区和人道主义任务上存在差异。尽管模型在二元问题上实现高准确率,但在细粒度定量推理、物体计数和上下文敏感解释方面表现不佳,尤其是在代表性不足的灾难场景中。DisasterVQA提供了一个具有挑战性和实用性的基准,以指导开发更稳健和具有操作意义的视觉-语言模型用于灾害响应。该数据集可通过https://doi.org/10.5281/zenodo.18267769公开获取。

英文摘要

Social media imagery provides a low-latency source of situational information during natural and human-induced disasters, enabling rapid damage assessment and response. While Visual Question Answering (VQA) has shown strong performance in general-purpose domains, its suitability for the complex and safety-critical reasoning required in disaster response remains unclear. We introduce DisasterVQA, a benchmark dataset designed for perception and reasoning in crisis contexts. DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning diverse events such as floods, wildfires, and earthquakes. Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA, the dataset includes binary, multiple-choice, and open-ended questions covering situational awareness and operational decision-making tasks. We benchmark seven state-of-the-art vision-language models and find performance variability across question types, disaster categories, regions, and humanitarian tasks. Although models achieve high accuracy on binary questions, they struggle with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, particularly for underrepresented disaster scenarios. DisasterVQA provides a challenging and practical benchmark to guide the development of more robust and operationally meaningful vision-language models for disaster response. The dataset is publicly available at https://doi.org/10.5281/zenodo.18267769.

2512.18953 2026-05-19 cs.CV 版本更新

Symmetry Matters: Auditing and Symmetrizing 3D Generative Models

对称性至关重要:审计和对称化3D生成模型

Nicolas Caytuiro, Ivan Sipiran

发表机构 * University of Chile(智利大学)

AI总结 本文研究了无条件点云生成中对称性的保持问题,通过审计多个3D生成模型的对称性并计算基于Chamfer距离的归一化对称性分数,发现现有模型在对称性意识评估协议下存在持续的对称性差距。通过分析训练数据和引入对称性意识干预,作者提出了在半对象数据集上训练生成模型并在采样时进行反射重建的方法,从而提高几何一致性和视觉合理性。

Comments 12 pages, 8 figures, 4 tables

详情
AI中文摘要

对称性是许多物体类别中强有力的先验知识,但标准的3D生成模型基准很少报告这一先验是否被保留。我们研究了无条件点云生成中的对称性保持问题。我们首先通过几种3D生成模型审计生成形状的对称性,并基于Chamfer距离(CD)计算归一化对称性分数。我们表明,尽管当前3D生成模型在标准评估下取得竞争性结果,但当应用对称性意识评估协议时,它们显示出持续的对称性差距。为了测试这个差距是否仅仅继承自训练数据,我们评估了这些模型在由ShapeNet衍生的镜像物体数据集上的表现,并分析了训练过程中的对称性动态。通过机制可解释性技术,在采样和潜在空间层面进一步表明,反射对称性在学习的生成过程中并不可靠地编码。最后,为了解决这个差距,我们提出了一种数据导向的对称性意识干预:在半对象数据集上训练生成模型,并在采样时通过反射重建完整物体。在多个模型架构上,这种干预显著提高了几何一致性和视觉合理性,同时在标准度量下仍具竞争力。这些发现表明,需要伴随标准基准进行对称性意识评估,未来的3D生成模型应显式地将这一先验纳入训练或采样过程中。

英文摘要

Symmetry is a strong prior present in many object categories, yet standard benchmarks for 3D generative models rarely report whether this prior is preserved. We study symmetry preservation in unconditional point cloud generation. We first audit the symmetry of generated shapes by several 3D generative models and compute a normalized symmetry score based on the Chamfer Distance (CD). We show that although current 3D generative models achieve competitive results under standard evaluation, they reveal a persistent symmetry gap when a symmetry-aware evaluation protocol is applied. To test whether this gap is merely inherited from the training data, we evaluate these models over a mirrored-objects dataset derived from ShapeNet and analyze symmetry dynamics during training. Mechanistic interpretability techniques were employed at the sampling and latent levels to further show that reflection symmetry is not reliably encoded in the learned generative process. Finally, to address this gap, we propose a data-centric symmetry-aware intervention: training generative models on a half-objects dataset and reconstructing full objects by reflection during sampling. Across multiple backbones, this intervention substantially improves geometric consistency and visual plausibility while remaining competitive under standard metrics. These findings suggest that symmetry-aware evaluation is needed alongside standard benchmarks, and incoming 3D generative models should incorporate this prior explicitly, either during training or sampling.

2512.11446 2026-05-19 cs.CV 版本更新

YawDD+: Frame-level Annotations for Accurate Yawn Prediction

YawDD+: 用于准确打哈欠预测的帧级标注

Ahmed Mujtaba, Gleb Radchenko, Marc Masana, Radu Prodan

发表机构 * Embedded Systems Division, Silicon Austria Labs(Silicon Austria Labs嵌入式系统部门) Institute of Visual Computing, Graz University of Technology(格拉茨技术大学视觉计算研究所) Department of Computer Science, University of Innsbruck(因斯布鲁克大学计算机科学系)

AI总结 本文提出了一种半自动化标注流程,通过人工在循环验证来标注YawDD视频以获得更准确的帧级标注,从而在边缘设备上提升模型训练效果,实现更高效的疲劳驾驶检测。

Comments This paper is accepted in the 33rd IEEE International Conference on Image Processing (ICIP) 2026

详情
AI中文摘要

驾驶员疲劳仍然是道路事故的主要原因,导致24%的碰撞事故。尽管打哈欠是疲劳的早期行为指标,但现有方法面临挑战,因为视频标注数据集中存在系统性噪声,源于粗略的时间标注。训练稳健的机器学习(ML)模型需要丰富的监督标签,以帮助从训练数据中学习显著特征。此外,在边缘设备上高效训练和推断模型对于疲劳驾驶检测任务至关重要,以在不依赖云基础设施的情况下实现车辆上的准确实时决策。为了解决这个问题,我们开发了一种半自动标注流程,通过人工在循环验证来标注YawDD视频以获得更准确的帧级标注,从而在边缘平台如NVIDIA Jetson NANO上更准确地训练模型。在YawDD+上训练已建立的MNasNet分类器和YOLOv11检测器架构,比视频级监督提高了多达6%的帧准确率和5%的mAP,分别在Jetson NANO和AGX上实现了99.34%的分类准确率和95.69%的检测mAP。此外,MNasNet在AGX上仅用8.69分钟/epoch完成一个周期,同时提供高达115帧/秒(FPS)的推断时间,证明了增强的数据质量本身支持边缘设备上的驾驶员疲劳监测系统,而无需服务器端计算。YawDD+数据集和训练好的模型已在线上提供。

英文摘要

Driver fatigue remains a leading cause of road accidents, responsible for 24% of crashes. While yawning serves as an early behavioral indicator of fatigue, existing approaches face significant challenges due to the presence of systematic noise in video-annotated datasets arising from coarse temporal annotations. Training robust machine learning (ML) models requires rich supervisory labels that help learn salient features from the training data. Moreover, efficient on-device training and inference of models on edge devices is crucial in driver fatigue detection tasks to enable accurate real-time decisions on vehicles without reliance on cloud infrastructure. To address this issue, we develop a semi-automated labeling pipeline with human-in-the-loop verification to annotate YawDD videos to YawDD+ frame-level annotations, enabling more accurate model training on edge platforms such as NVIDIA Jetson NANO. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6% and mAP by 5% over video-level supervision, achieving 99.34% classification accuracy and 95.69% detection mAP on Jetson NANO and AGX. Moreover, MNasNet completed the epoch time in just 8.69 min/epoch while delivering up to 115 frames-per-second (FPS) inference time on AGX, confirming that enhanced data quality alone supports on-device driver fatigue monitoring systems without server-side computation. The YawDD+ dataset and trained models are available online.

2512.05136 2026-05-19 cs.CV cs.AI 版本更新

Fine-tuning an ECG Foundation Model to Predict Coronary CT Angiography Outcomes

微调一种心电图基础模型以预测冠状动脉CT血管造影结果

Yujie Xiao, Qinghao Zhao, Gongzheng Tang, Hao Zhang, Zhuoran Kan, Deyun Zhang, Jun Li, Guangkun Nie, Xiaocheng Fang, Haoyu Wang, Shun Huang, Tong Liu, Jian Liu, Kangyin Chen, Shenda Hong

发表机构 * Institute of Medical Technology, Peking University Health Science Center(北京大学人民医院医学技术研究所) National Institute of Health Data Science, Peking University(北京大学国家健康数据科学研究院) Department of Cardiology, Peking University People’s Hospital(北京大学人民医院心内科) Tianjin Key Laboratory of Ionic-Molecular Function of Cardiovascular Disease, Department of Cardiology, Tianjin Institute of Cardiology, The Second Hospital of Tianjin Medical University(天津医科大学心血管离子-分子功能重点实验室,天津心脏病学研究院,天津医科大学第二医院心内科) Heart Voice Medical Technology(心声医疗科技) School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院)

AI总结 本文研究了通过微调心电图基础模型来预测冠状动脉CT血管造影结果的研究问题,采用多中心研究方法,利用CTCA作为解剖参考标准,开发并验证了AI-ECG模型,以预测血管特异性冠状动脉狭窄,并展示了模型在内部和外部验证中的表现,以及其在临床中的应用价值。

详情
AI中文摘要

CAD仍然是全球公共卫生的主要负担,然而可扩展的筛查工具有限。尽管CTCA是首选的非侵入性诊断方法,但其使用受到资源需求和辐射暴露的限制。AI-ECG可能为CAD风险分层提供补充方法。在多中心研究中,我们开发并验证了使用CTCA作为解剖参考标准的AI-ECG模型,以预测血管特异性冠状动脉狭窄。在内部验证中,模型在各血管上的AUC值为0.683-0.744,并表现出一致的外部性能。在临床正常ECG中保持了鉴别能力,并在各亚组中保持了广泛稳定性。模型预测的概率随着CTCA定义的狭窄严重程度呈单调增加。模型概率通过预定义的灵敏度和特异性基于阈值转换为血管特异性低、中、高风险分层。校准分析显示预测风险与观察风险之间的一致性,而DCA表明与“全部治疗”和“不治疗”策略相比,具有净临床获益。将AI衍生的风险分层与指南基于的PTP类别相结合,提高了排除性能,减少了灰色区域比例,并与PTP单独使用相比实现了正NRI。在纵向随访队列中,Kaplan-Meier分析显示模型定义的风险组在主要不良心血管事件风险上存在明显分离。波形和归因分析进一步识别了与高风险预测相关的结构化ECG形态差异和具有生理意义的信号区域。这些发现支持AI-ECG作为补充CAD筛查、解剖风险估计和临床分层的可行工具,但需要进一步的前瞻性研究来确认其临床影响。

英文摘要

CAD remains a major global public health burden, yet scalable screening tools are limited. Although CCTA is a first-line non-invasive diagnostic modality, its use is constrained by resource requirements and radiation exposure. AI-ECG may offer a complementary approach for CAD risk stratification. In this multicenter study, we developed and validated an AI-ECG model using CCTA as the anatomical reference standard to predict vessel-specific coronary stenosis. In internal validation, the model achieved AUC values of 0.683-0.744 across vessels and showed consistent external performance. Discrimination was maintained in clinically normal ECGs and remained broadly stable across subgroups. Model-predicted probabilities increased monotonically with CCTA-defined stenosis severity. Model probabilities were converted into vessel-specific low-, intermediate-, and high-risk strata using predefined sensitivity- and specificity-based thresholds. Calibration analysis showed agreement between predicted and observed risk, while DCA indicated net clinical benefit over treat-all and treat-none strategies. Integrating AI-derived risk strata with guideline-based PTP categories improved rule-out performance, reduced the gray-zone proportion, and achieved positive NRI compared with PTP alone. In a longitudinal follow-up cohort, Kaplan-Meier analysis showed clear separation of major adverse cardiovascular event risk across model-defined risk groups. Waveform- and attribution-based analyses further identified structured ECG morphology differences and physiologically meaningful signal regions associated with high-risk predictions. These findings support AI-ECG as a feasible tool for complementary CAD screening, anatomical risk estimation, and clinical triage, while prospective studies are needed to confirm its clinical impact.

2512.01843 2026-05-19 cs.CV 版本更新

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

PhyDetEx: 检测和解释T2V模型的物理合理性

Zeqing Wang, Keze Wang, Lei Zhang

发表机构 * Zeqing Wang 1,3(王泽清 1,3) Keze Wang 1(王凯泽 1) Lei Zhang 2(张磊 2)

AI总结 本文提出PhyDetEx,通过构建PID数据集和轻量级微调方法,评估T2V模型在生成物理合理视频方面的性能,并发现尽管模型在视频生成上有所进步,但理解并遵循物理定律仍存在挑战。

Comments 23 pages, 10 figures

详情
AI中文摘要

受模型容量和训练规模增长的推动,文本到视频(T2V)生成模型在视频质量、长度和遵循指令的能力方面取得了显著进展。然而,这些模型是否能理解物理并生成物理上合理的视频仍是一个问题。尽管视觉语言模型(VLMs)已被广泛用于各种应用中的通用评估,但它们难以识别生成视频中的物理不可能内容。为研究此问题,我们构建了一个PID(物理不可信检测)数据集,包含500个手动标注的测试视频和2,588对训练视频,其中每个不可信视频都是通过仔细修改其对应真实视频的描述来生成的,以诱导T2V模型生成物理上不可信的内容。利用构建的数据集,我们提出了一种轻量级微调方法,使VLMs不仅能检测物理不可信事件,还能生成违反物理原理的文本解释。将微调后的VLM作为物理合理性检测器和解释器,即PhyDetEx,我们评估了一系列最先进的T2V模型,以评估它们对物理定律的遵守程度。我们的发现表明,尽管最近的T2V模型在生成物理合理内容方面取得了显著进展,但理解和遵守物理定律仍是一个具有挑战性的问题,特别是对于开源模型。我们的数据集、训练代码和检查点可在https://github.com/Zeqing-Wang/PhyDetEx获取。

英文摘要

Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.

2511.00392 2026-05-19 cs.RO cs.AI cs.CV 版本更新

SonarSweep: Fusing Sonar and Vision for Robust 3D Reconstruction via Plane Sweeping

SonarSweep: 通过平面扫描融合声纳与视觉以实现鲁棒的3D重建

Lingpeng Chen, Jiakun Tang, Apple Pui-Yi Chui, Ziyang Hong, Junfeng Wu

发表机构 * Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Chinese University of Hong Kong, Hong Kong(香港中文大学) Department of Automation, Harbin Institute of Technology(哈尔滨工业大学自动化系)

AI总结 本文提出SonarSweep,一种端到端的深度学习框架,通过将平面扫描算法应用于声纳与视觉数据的跨模态融合,克服了单一模态方法在 underwater 环境中3D重建的局限性,实现了更精确和稳定的深度图生成。

Comments 8 pages, 9 figures, conference

详情
AI中文摘要

在视觉退化的水下环境中实现准确的3D重建仍是一个严峻的挑战。单一模态方法不足:基于视觉的方法因可见性差和几何约束而失败,而声纳则因固有的高度歧义和低分辨率而受限。因此,先前的融合技术依赖于启发式方法和错误的几何假设,导致显著的伪影和无法建模复杂场景。在本文中,我们引入了SonarSweep,一种新颖的端到端深度学习框架,通过将原理性的平面扫描算法应用于声纳与视觉数据的跨模态融合,克服了这些限制。在高保真模拟和真实环境中的大量实验表明,SonarSweep能够一致地生成密集且准确的深度图,在挑战性条件下,特别是在高浊度情况下,显著优于最先进的方法。为了促进进一步研究,我们将公开我们的代码和一个新型的数据集,该数据集包含同步的立体相机和声纳数据,这是首次公开的此类数据集。

英文摘要

Accurate 3D reconstruction in visually-degraded underwater environments remains a formidable challenge. Single-modality approaches are insufficient: vision-based methods fail due to poor visibility and geometric constraints, while sonar is crippled by inherent elevation ambiguity and low resolution. Consequently, prior fusion technique relies on heuristics and flawed geometric assumptions, leading to significant artifacts and an inability to model complex scenes. In this paper, we introduce SonarSweep, a novel, end-to-end deep learning framework that overcomes these limitations by adapting the principled plane sweep algorithm for cross-modal fusion between sonar and visual data. Extensive experiments in both high-fidelity simulation and real-world environments demonstrate that SonarSweep consistently generates dense and accurate depth maps, significantly outperforming state-of-the-art methods across challenging conditions, particularly in high turbidity. To foster further research, we will publicly release our code and a novel dataset featuring synchronized stereo-camera and sonar data, the first of its kind.

2510.11391 2026-05-19 cs.CV cs.AI cs.CL 版本更新

DocReward: A Document Reward Model for Structuring and Stylizing

DocReward: 一种用于文档结构化和风格化的文档奖励模型

Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia, Tengchao Lv, Yupan Huang, Wenshan Wu, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Tao Ge, Xun Wang, Huitian Jiao, Sun Mao, FNU Kartik, Si-Qing Chen, Wai Lam, Furu Wei

发表机构 * CUHK(香港大学) UCAS(中国科学技术大学) XJTU(西安交通大学) UMich(密歇根大学) Microsoft(微软)

AI总结 本文提出DocReward,一种用于评估文档结构和风格的奖励模型,通过构建包含117,000对文档的DocPair数据集,采用Bradley-Terry损失训练,有效提升了文档生成的结构和风格专业性。

详情
AI中文摘要

近期的代理工作流程自动化了专业文档生成,但主要关注文本质量,忽视了结构和风格的专业性,这对于可读性同样至关重要。这一差距主要源于缺乏有效的奖励模型,无法引导代理生成结构和风格专业的文档。我们引入DocReward,一种评估文档结构和风格的文档奖励模型。为此,我们提出了一种文本质量无关的框架,确保评估不受内容质量的影响,并构建了包含117,000对文档的DocPair数据集,涵盖32个领域和267种类型。每对文档内容相同,但结构和风格专业性不同。DocReward使用Bradley-Terry损失进行训练。在人工标注的基准测试中,DocReward在相同设置下比GPT-5高出14.6个百分点。强化学习实验进一步表明,DocReward能有效引导代理生成具有更一致结构和风格专业性的文档,突显了其实际应用价值。

英文摘要

Recent agentic workflows automate professional document generation but focus narrowly on textual quality, overlooking structural and stylistic professionalism, which is equally critical for readability. This gap stems mainly from a lack of effective reward models capable of guiding agents toward producing documents with high structural and stylistic professionalism. We introduce DocReward, a document reward model that evaluates documents based on their structure and style. To achieve this, we propose a textual-quality-agnostic framework that ensures assessments are not confounded by content quality, and construct DocPair, a dataset of 117K paired documents covering 32 domains and 267 types. Each pair shares identical content but differs in structural and stylistic professionalism. DocReward is trained using the Bradley-Terry loss. On a manually annotated benchmark, DocReward outperforms GPT-5 by 14.6 percentage points in the same setting. Reinforcement learning experiments further show that DocReward effectively guides agents toward generating documents with consistently higher structural and stylistic professionalism, highlighting its practical utility.

2510.06809 2026-05-19 cs.CV 版本更新

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

VA-Adapter:将超声基础模型适应于超声心动图探头引导

Teng Wang, Haojun Jiang, Yuxuan Wang, Zhenguo Sun, Yujiao Deng, Shiji Song, Gao Huang

发表机构 * Department of Automation, BNRist, Tsinghua University, Beijing, China(自动化系、BNRist、清华大学、北京、中国) School of Computer Science and Technology, Xidian University(计算机科学与技术学院、西安电子科技大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院) Chinese PLA General Hospital(中国人民解放军总医院)

AI总结 本文提出VA-Adapter,通过将超声基础模型与理解个体三维结构的能力相结合,提高超声心动图探头引导的精度和效率,实验表明其在参数量较少的情况下表现优于现有模型。

Comments MICCAI2026 Early Accept Paper

详情
AI中文摘要

超声心动图是检测心脏疾病的关键工具,但其操作难度高导致专业人员短缺。探头引导系统通过辅助获取高质量图像,提供了降低操作门槛的有前景的解决方案。然而,由于显著的个体差异,稳健的探头引导仍具挑战性。这种差异表现为二维图像中低级特征的差异,这使得图像特征理解复杂化,以及个体三维结构的差异,这给精确导航带来挑战。为了解决这些挑战,我们首先提出利用超声基础模型从大量数据集中学习的稳健图像表示。然而,将这些模型应用于探头导航是困难的,因为它们缺乏对个体三维结构的理解。为此,我们精心设计了视觉-动作适配器(VA-Adapter)以在线注入理解个体三维结构的能力。具体来说,通过将VA-Adapter嵌入基础模型的图像编码器中,模型可以从历史视觉-动作序列中推断心脏解剖结构,模拟超声技师的认知过程。在包含超过131万样本的数据集上进行的广泛实验表明,VA-Adapter在参数量少约33倍的情况下优于现有探头引导模型。代码可在https://github.com/LeapLabTHU/VA-Adapter上获得。

英文摘要

Echocardiography is a critical tool for detecting heart diseases, yet its steep operational difficulty causes a shortage of skilled personnel. Probe guidance systems, which assist in acquiring high-quality images, offer a promising solution to lower this operational barrier. However, robust probe guidance remains challenging due to significant individual variability. This variability manifests as differences in low-level features within two-dimensional (2D) images, which complicates image feature understanding, and differences in individual three-dimensional (3D) structures, which poses challenges for precise navigation. To address these challenges, we first propose leveraging the robust image representations learned by ultrasound foundation models from vast datasets. Yet, applying these models to probe navigation is non-trivial due to their lack of understanding of individual 3D structures. To this end, we meticulously design a Vision-Action Adapter (VA-Adapter) to online inject the capability of understanding individual 3D structures. Specifically, by embedding the VA-Adapter into the foundation model's image encoder, the model can infer cardiac anatomy from historical vision-action sequences, mimicking the cognitive process of a sonographer. Extensive experiments on a dataset with over 1.31M samples demonstrate that the VA-Adapter outperforms strong probe guidance models while requiring approximately 33 times fewer trained parameters. Code is available at https://github.com/LeapLabTHU/VA-Adapter.

2510.04382 2026-05-19 eess.IV cs.CV cs.NA math.NA 版本更新

Adaptive double-phase Rudin--Osher--Fatemi denoising model

自适应双相Rudin-Osher-Fatemi去噪模型

Wojciech Górny, Michał Łasica, Alexandros Matsoukas

发表机构 * Faculty of Mathematics, Universität Wien(维也纳大学数学系) Faculty of Mathematics, Informatics and Mechanics, University of Warsaw(华沙大学数学、信息学与力学系) Institute of Mathematics of the Polish Academy of Sciences(波兰科学院数学研究所) Department of Mathematics, School of Applied Mathematical and Physical Sciences, National Technical University of Athens(雅典技术大学应用数学与物理科学学院数学系)

AI总结 本文提出了一种基于双相积分函数的自适应ROF去噪模型,旨在减少阶梯效应并保留图像边缘,通过在合成和自然图像上测试性能,展示了在SSIM、PSNR和LPIPS等相似性度量上优于传统模型的表现。

Comments 23 pages, 16 figures, supplementary material available at: https://github.com/wojciechgorny/double-phase-ROF-model/

详情
AI中文摘要

尽管自 seminal Rudin--Osher--Fatemi (ROF) 关于总变分 (TV) 去噪的论文发表超过30年后,该模型在科学应用如天文成像中仍然具有相关性。然而,它已知会产生诸如阶梯效应之类的问题。许多该模型的变体已被提出,旨在对抗这些问题。最近,在数学分析社区对双相问题的大量研究背景下,提出了一种双相类型积分函数,包含TV和一个加权二次增长项,作为图像恢复的正则化器。在此,我们提出了一种基于该正则化的ROF去噪模型的自适应变体。它旨在相对于经典ROF模型减少阶梯效应,同时以类似的方式保留图像的边缘。我们实现了该模型,并在不同噪声水平下的合成和自然图像上测试其性能。与具有类似可解释性的已建立模型相比,我们在SSIM、PSNR以及LPIPS等相似性度量上观察到改进或相似的表现,同时阶梯效应明显减少。

英文摘要

Even though more than 30 years have passed since the seminal Rudin--Osher--Fatemi (ROF) paper on total variation (TV) denoising, it remains relevant, in particular in scientific applications such as astronomical imaging. However, it is known to suffer from artifacts such as the staircasing effect. Many variants of the model have been proposed with the aim of countering this. Recently, against the backdrop of immense research output on double-phase problems in the mathematical analysis community, a double-phase type integral functional, comprising of TV and a weighted term of quadratic growth, was suggested as a regularizer for image restoration. Here, we propose an adaptive variant of the ROF denoising model based on that regularizer. It is designed to reduce staircasing with respect to the classical ROF model, while preserving the edges of the image in a similar fashion. We implement the model and test its performance on synthetic and natural images over a range of noise levels. Compared to {established} models {with similar interpretability to ROF}, we observe an improved or similar performance in terms of similarity metrics SSIM, PSNR, {and LPIPS}, while the staircasing effect is visibly reduced.

2509.22244 2026-05-19 cs.CV 版本更新

FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

FlashEdit: 解耦速度、结构和语义以实现精确图像编辑

Junyi Wu, Zhiteng Li, Haotong Qin, Yulun Zhang, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文提出FlashEdit,一种高效的局部图像编辑框架,通过解耦速度、结构和语义来实现精确编辑,实验表明其在保真度和效率之间取得了良好的平衡。

Comments Our code will be made publicly available at https://github.com/JunyiWuCode/FlashEdit

详情
AI中文摘要

基于文本的图像编辑使用扩散模型已取得了显著的高质量成果,但往往面临可接受的延迟问题。我们介绍了FlashEdit,一种针对标准反向编辑设置的实时局部图像编辑框架。其效率和精度源于三个关键创新:(1)一个循环一致的一步反向(COSI)管道,通过循环一致性鼓励流形对齐的一步反向;(2)一种背景屏蔽(BG-Shield)技术,通过结构自注意干预提高非编辑区域的保真度;(3)一种稀疏的空间交叉注意(SSCA)机制,通过抑制语义泄漏促进精确编辑。在PIE-Bench上的实验表明,FlashEdit在保真度和效率之间取得了良好的权衡,编辑可在0.2秒内完成,比基于DDIM的多步编辑快超过150倍。我们的代码将在https://github.com/JunyiWuCode/FlashEdit上公开发布。

英文摘要

Text-guided image editing with diffusion models has achieved remarkable quality but often suffers from prohibitive latency. We introduce \textbf{FlashEdit}, a real-time localized image editing framework for the standard inversion-based editing setting. Its efficiency and precision stem from three key innovations: (1) a \textbf{Cycle-Consistent One-Step Inversion (COSI)} pipeline that encourages manifold-aligned one-step inversion through cycle consistency; (2) a \textbf{Background Shield (BG-Shield)} technique that improves preservation of non-edited regions via structural self-attention intervention; and (3) a \textbf{Sparsified Spatial Cross-Attention (SSCA)} mechanism that promotes precise edits by suppressing semantic leakage. Experiments on PIE-Bench demonstrate a strong preservation-efficiency trade-off, with edits completed in under 0.2 seconds and an over 150$\times$ speedup over DDIM-based multi-step editing. Our code will be made publicly available at \url{https://github.com/JunyiWuCode/FlashEdit}.

2508.17431 2026-05-19 cs.CV cs.AI cs.LG 版本更新

FedKLPR: KL-Guided Pruning-Aware Federated Learning for Person Re-Identification

FedKLPR: 基于KL引导的剪枝感知联邦学习用于人重识别

Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien

发表机构 * Media IC and System Lab, the Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University(媒体IC与系统实验室,电子工程研究所及电气工程系,国立台湾大学)

AI总结 本文提出FedKLPR框架,通过KL散度引导训练、无结构剪枝和跨轮次恢复技术,解决联邦学习在人重识别中的统计异质性和通信开销问题,实验表明其在通信开销和准确性方面均优于现有方法。

Comments 10 pages, 3 figures, 5 tables, submitted to IEEE Transactions on Multimedia

详情
AI中文摘要

人重识别(re-ID)是智能监控和公共安全中的基本任务。联邦学习(FL)提供了一种隐私保护的协同模型训练范式,无需集中数据收集。然而,由于非独立同分布(non-IID)客户端数据导致的统计异质性和频繁传输大规模模型带来的通信开销,将FL应用于现实世界中的re-ID系统仍然具有挑战性。为了解决这些挑战,我们提出了FedKLPR,一种轻量且通信高效的联邦学习框架用于人重识别。FedKLPR包含三个关键组件。首先,KL散度引导训练,包括KL散度正则化损失(KLL)和KL散度聚合权重(KLAW),用于缓解统计异质性和在非IID设置下提高收敛稳定性。其次,引入无结构剪枝以减少通信开销,并提出剪枝率聚合权重(PRAW)以衡量剪枝后客户端参数的相对重要性。与KLAW结合,PRAW形成KL散度-剪枝权重聚合(KLPWA),使在异构数据分布下能够有效聚合剪枝后的本地模型。第三,跨轮次恢复(CRR)适应性地控制剪枝跨通信轮次以防止过度压缩并保持模型准确性。在八个基准数据集上的实验表明,FedKLPR在保持竞争性准确性的同时实现了显著的通信节省。与现有最先进方法相比,FedKLPR在ResNet-50上将通信成本减少了40%--42%,并实现了更优异的总体性能。

英文摘要

Person re-identification (re-ID) is a fundamental task in intelligent surveillance and public safety. Federated learning (FL) provides a privacy-preserving paradigm for collaborative model training without centralized data collection. However, deploying FL in real-world re-ID systems remains challenging due to statistical heterogeneity caused by non-IID client data and the substantial communication overhead incurred by frequent transmission of large-scale models. To address these challenges, we propose FedKLPR, a lightweight and communication-efficient federated learning framework for person re-ID. FedKLPR consists of three key components. First, KL-Divergence-Guided training, including the KL-Divergence Regularization Loss (KLL) and KL-Divergence-aggregation Weight (KLAW), is introduced to mitigate statistical heterogeneity and improve convergence stability under non-IID settings. Second, unstructured pruning is incorporated to reduce communication overhead, and the Pruning-ratio-aggregation Weight (PRAW) is proposed to measure the relative importance of client parameters after pruning. Together with KLAW, PRAW forms KL-Divergence-Prune Weighted Aggregation (KLPWA), enabling effective aggregation of pruned local models under heterogeneous data distributions. Third, Cross-Round Recovery (CRR) adaptively controls pruning across communication rounds to prevent excessive compression and preserve model accuracy. Experiments on eight benchmark datasets demonstrate that FedKLPR achieves substantial communication savings while maintaining competitive accuracy. Compared with state-of-the-art methods, FedKLPR reduces communication cost by 40\%--42\% on ResNet-50 while achieving better overall performance.

2508.16663 2026-05-19 cs.CV cs.AI cs.LG 版本更新

The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

The Loupe: 一种用于增强视觉变换器中判别特征的插件式注意力模块

Naren Sengodan

发表机构 * Jain University(贾因大学)

AI总结 本文提出The Loupe模块,通过在视觉变换器的中间特征阶段插入轻量级插件式空间门控模块,利用小CNN预测单通道空间掩码,并在端到端训练中使用交叉熵目标和l1稀疏项对特征激活进行加权,从而提升细粒度视觉分类性能。

详情
AI中文摘要

细粒度视觉分类(FGVC)要求模型关注于细微的、与任务相关的区域,而非广泛的物体上下文。我们提出了The Loupe,一种轻量级的插件式空间门控模块,用于层次化的视觉变换器。该模块在中间特征阶段插入,使用小CNN预测单通道空间掩码,并在端到端训练中使用交叉熵目标和l1稀疏项对特征激活进行加权。在CUB-200-2011数据集上,The Loupe将Swin-Base的准确率从88.36%提升至91.72%,将Swin-Tiny的准确率从85.14%提升至88.61%,且仅增加0.1%的参数。消融实验表明,改进依赖于插入点和稀疏正则化器,表明受控的空间门控比朴素的多尺度遮蔽在此设置下更有效。定性结果表明,学习到的掩码通常与判别鸟类部分对齐,尽管该模块不是部分级监督的替代品,在遮挡或细粒度内部分差异时可能会失效。

英文摘要

Fine-Grained Visual Classification (FGVC) requires models to focus on subtle, task-relevant regions rather than broad object context. We present The Loupe, a lightweight plug-and-play spatial gating module for hierarchical Vision Transformers. The module is inserted at an intermediate feature stage, predicts a single-channel spatial mask with a small CNN, and uses that mask to reweight feature activations during end-to-end training with a cross-entropy objective and an l1 sparsity term. On CUB-200-2011, The Loupe improves Swin-Base from 88.36% to 91.72% and Swin-Tiny from 85.14% to 88.61%, with under 0.1% additional parameters. Ablations show that the improvement depends on the insertion point and the sparsity regularizer, suggesting that controlled spatial gating is more effective than naive multi-scale masking in this setting. Qualitative results indicate that the learned masks often align with discriminative bird parts, although the module is not a substitute for part-level supervision and can fail under occlusion or fine-grained intra-part differences.

2508.10678 2026-05-19 cs.CV 版本更新

HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection

HyperTea: 一种基于超图的时序增强与对齐网络用于移动红外小目标检测

Zhaoyuan Qi, Weihua Gao, Wenlong Niu, Jie Tang, Yun Li, Xiaodong Peng

发表机构 * Key Laboratory of Electronics and Information Technology for Space Systems, National Space Science Center, Chinese Academy of Sciences(空间系统电子信息技术重点实验室,国家空间科学中心,中国科学院) School of Advanced Interdisciplinary Studies, University of Chinese Academy of Sciences(中国科学院大学交叉学科学院) School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院)

AI总结 本文提出HyperTea网络,通过整合全局和局部时序视角,有效建模特征的高阶时空相关性,首次将CNN、RNN和HGNN结合用于MIRSTD,显著提升检测性能。

Comments Accepted by Knowledge-Based Systems

详情
AI中文摘要

在实际应用场景中,由于目标的大小小、强度弱和复杂的运动模式,移动红外小目标检测(MIRSTD)仍然极具挑战性。现有方法通常仅建模特征节点之间的低阶相关性,并在单一时间尺度上进行特征提取和增强。尽管超图已被广泛用于高阶相关性学习,但其在MIRSTD中却受到有限关注。为了探索超图的潜力并增强多时间尺度特征表示,我们提出HyperTea,它整合了全局和局部时序视角,有效建模特征的高阶时空相关性。HyperTea由三个模块组成:全局时序增强模块(GTEM)通过语义聚合和传播实现全局时序上下文增强;局部时序增强模块(LTEM)设计用于捕捉相邻帧之间的局部运动模式,然后增强局部时序上下文;此外,我们进一步开发了一个时序对齐模块(TAM)以解决潜在的跨尺度特征错位问题。据我们所知,HyperTea是首次将卷积神经网络(CNNs)、循环神经网络(RNNs)和超图神经网络(HGNNs)结合用于MIRSTD的工作,显著提升了检测性能。在DAUB和IRDST上的实验表明其处于最先进的水平(SOTA)。我们的源代码可在https://github.com/Lurenjia-LRJ/HyperTea上获得。

英文摘要

In practical application scenarios, moving infrared small target detection (MIRSTD) remains highly challenging due to the target's small size, weak intensity, and complex motion pattern. Existing methods typically only model low-order correlations between feature nodes and perform feature extraction and enhancement within a single temporal scale. Although hypergraphs have been widely used for high-order correlation learning, they have received limited attention in MIRSTD. To explore the potential of hypergraphs and enhance multi-timescale feature representation, we propose HyperTea, which integrates global and local temporal perspectives to effectively model high-order spatiotemporal correlations of features. HyperTea consists of three modules: the global temporal enhancement module (GTEM) realizes global temporal context enhancement through semantic aggregation and propagation; the local temporal enhancement module (LTEM) is designed to capture local motion patterns between adjacent frames and then enhance local temporal context; additionally, we further develop a temporal alignment module (TAM) to address potential cross-scale feature misalignment. To our best knowledge, HyperTea is the first work to integrate convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hypergraph neural networks (HGNNs) for MIRSTD, significantly improving detection performance. Experiments on DAUB and IRDST demonstrate its state-of-the-art (SOTA) performance. Our source codes are available at https://github.com/Lurenjia-LRJ/HyperTea.

2508.07292 2026-05-19 cs.AI cs.CL cs.CV 版本更新

EndoCogniAgent: Closed-Loop Agentic Reasoning with Self-Consistency Validation for Endoscopic Diagnosis

EndoCogniAgent: 闭环代理推理与自我一致性验证用于内窥镜诊断

Yi Tang, Kai-Ni Wang, Yang Chen, Xiaopu He, Guangquan Zhou

发表机构 * School of Biological Science and Medical Engineering, Southeast University(东南大学生物科学与医学工程学院) Jiangsu Key Laboratory of Biomaterials and Devices, Southeast University(江苏省生物材料与器件重点实验室) State Key Laboratory of Digital Medical Engineering, Southeast University(国家数字医学工程重点实验室) The First Affiliated Hospital of Nanjing Medical University(南京医科大学第一附属医院) Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing(江苏省联合国际医学信息处理联合实验室) Laboratory of Image Science and Technology, the School of Computer Science and Engineering, Southeast University(图像科学与技术实验室,东南大学计算机科学与工程学院)

AI总结 该研究提出EndoCogniAgent框架,通过闭环代理推理和自我一致性验证提升内窥镜诊断的准确性与可靠性,其核心方法是将诊断过程建模为受控状态更新过程,并引入EndoAgentBench基准进行评估。

Comments 10 pages, 8 figures, 2 tables. Revised version with major updates on methodology and extended evaluation on EndoAgentBench. Code and data are available at https://github.com/Tyyds-ai/EndoCogniAgent

详情
AI中文摘要

内窥镜诊断是一个迭代过程,临床医生逐步获取、比较和验证局部视觉证据以得出结论。当前AI系统未能充分支持此过程,因为细粒度证据获取和多步推理仍弱相关,导致两种失败模式:幻觉证据和未纠正的误差累积,影响诊断可靠性。我们提出EndoCogniAgent,一种闭环代理框架,将内窥镜诊断建模为受控状态更新过程。在每次推理轮次中,中央计划器选择下一步证据获取动作,专用专家工具提取相应观察,自我一致性验证机制沿两个维度检查观察:知识一致性与输入图像以及时间一致性与先前验证的发现,然后更新诊断状态。验证的观察被纳入演进状态以指导后续计划,而缺乏充分支持的发现则保留并带有纠正反馈,引导计划器进行进一步验证。我们进一步引入EndoAgentBench,一个以工作流程为导向的基准,包含来自11个内窥镜数据集的6132个问题-答案对,旨在评估诊断代理在全面诊断链中的表现,从细粒度视觉感知到高水平诊断推理。实验显示,EndoCogniAgent在感知任务上达到85.23%的平均准确率,在推理任务上达到71.13%的临床接受率,消融分析确认自我一致性验证和事件状态维护对这些提升至关重要。

英文摘要

Endoscopic diagnosis is an iterative process in which clinicians progressively acquire, compare, and verify local visual evidence before reaching a conclusion. Current AI systems do not adequately support this process because fine-grained evidence acquisition and multi-step reasoning remain weakly coupled. This gives rise to two failure modes, hallucinated evidence and uncorrected error accumulation, that undermine diagnostic reliability. We propose EndoCogniAgent, a closed-loop agentic framework that formulates endoscopic diagnosis as a controlled state update process. At each reasoning round, a central planner selects the next evidence acquisition action, specialized expert tools extract the corresponding observation, and a self-consistency validation mechanism examines the observation along two dimensions, knowledge consistency against the input image and temporal consistency with prior validated findings, before updating the diagnostic state. Validated observations are admitted into the evolving state to condition subsequent planning, while insufficiently supported findings are retained with corrective feedback that redirects the planner toward additional verification. We further introduce EndoAgentBench, a workflow-oriented benchmark comprising 6,132 question-answer pairs from 11 endoscopic datasets, designed to evaluate diagnostic agents across a comprehensive diagnostic chain, from fine-grained visual perception to high-level diagnostic reasoning. Experiments show that EndoCogniAgent achieves 85.23\% average accuracy on perception tasks and 71.13\% clinical acceptance rate on reasoning tasks, with ablation analysis confirming that self-consistency validation and episodic state maintenance are individually critical to these gains.

2508.06038 2026-05-19 cs.CV cs.AI 版本更新

Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models

Fourier Compressor: 频域视觉令牌压缩用于视觉-语言模型

Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin

发表机构 * LUMIA Lab(LUMIA实验室) School of Artificial Intelligence(人工智能学院) Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Noah’s Ark Lab(诺亚实验室) Huawei Technologies Ltd.(华为技术有限公司) School of Computer Science(计算机科学学院)

AI总结 本文提出了一种基于频域的视觉令牌压缩策略,通过傅里叶变换减少计算开销并提升效率,同时保持语义准确性,实验表明其在图像和视频任务中均表现出色。

详情
AI中文摘要

视觉-语言模型(VLMs)由于高分辨率图像和视频输入引入的大量视觉令牌,导致计算开销和推理延迟显著增加。现有的无参数令牌压缩方法通常依赖于令牌选择或合并,但可能丢弃大量视觉信息或扭曲原始表示分布,导致在高压缩比下性能下降。为此,我们探索了一种更有效且高效的视觉令牌压缩策略,重点在频域方向。受图像压缩中频域变换(如JPEG)的成功启发,我们系统分析了视觉表示中的频域冗余,并揭示了不同频带中语义信息的非均匀分布。基于此,我们引入了傅里叶压缩器,一种有效、无参数且高度通用的模块,通过FFT(复杂度为O(n² log n))在频域内去除视觉表示的冗余。实现过程中无额外参数,计算开销极小且保持语义保真度。在图像基准测试中,我们的方法在保留超过96%原始准确率的同时,将推理FLOPs减少高达83.8%,生成速度提升31.2%。它在图像和视频理解任务中均表现出色,且在LLaVA和Qwen-VL架构中均能稳定泛化,证明其在高效VLMs中的实用价值。

英文摘要

Vision-Language Models (VLMs) incur substantial computational overhead and inference latency due to the large number of vision tokens introduced by high-resolution image and video inputs. Existing parameter-free token compression methods typically rely on token selection or merging, yet they risk discarding substantial visual information or distorting the original representation distribution, resulting in pronounced performance degradation at high compression ratios. In response, we aim to explore a more effective and efficient visual token compression strategy, with a promising direction in the frequency domain. Motivated by the success of frequency-domain transforms in image compression (e.g., JPEG), we systematically analyze the frequency redundancy in visual representations and uncover a non-uniform distribution of semantic information across frequency bands. Building upon this, we introduce Fourier Compressor, an effective, parameter-free, and highly generalizable module that removes redundancy from visual representations within the frequency domain. Implemented via FFT with $\mathcal{O}(n^2 \log n)$ complexity and no additional parameters, Fourier Compressor introduces negligible computational overhead while preserving semantic fidelity. Extensive experiments on image-based benchmarks demonstrate that our method achieves a favorable performance-efficiency trade-off, retaining over 96% of the original accuracy while reducing inference FLOPs by up to 83.8% and boosting generation speed by 31.2%. It consistently outperforms existing parameter-free methods and even surpasses some parameterized approaches. Importantly, Fourier Compressor generalizes consistently across both LLaVA and Qwen-VL architectures, and further extends to video understanding tasks, highlighting its practical applicability for efficient VLMs.

2507.22136 2026-05-19 cs.CV 版本更新

Color as the Impetus: Transforming Few-Shot Learner

颜色作为动力:转换少样本学习者

Chaofei Qi, Zhitai Liu, Jianbin Qiu

AI总结 本文提出了一种基于颜色感知机制的少样本学习框架,通过强调不同通道的颜色信息来提升特征提取和分类性能,同时引入知识蒸馏方法增强元学习能力。

Comments This work is currently being redone. It requires significant revisions and polishing. Additionally, the title will also be revised. Therefore, this version is no longer needed.

详情
AI中文摘要

人类具备天生的元学习能力,部分归因于其出色的色彩感知能力。在本文中,我们开创性地从模拟人类色彩感知机制的角度出发,提出了少样本学习的新视角。我们提出了ColorSense Learner,一种生物启发的元学习框架,利用跨通道特征提取和交互学习。通过在不同通道中战略强调不同的颜色信息,我们的方法有效过滤了无关特征,同时捕捉到判别性特征。颜色信息代表了最直观的视觉特征,但传统元学习方法大多忽略了这一方面,而专注于类别间的抽象特征区分。我们的框架通过协同的色彩通道交互弥合了这一差距,使能够更好地提取类内共同性并扩大类间差异。此外,我们引入了基于知识蒸馏的元蒸馏器ColorSense Distiller,该方法利用先验教师知识来增强学生网络的元学习能力。我们对十一个多少样本基准进行了全面的粗粒度/细粒度和跨域实验进行验证。大量实验表明,我们的方法具有极强的泛化能力、鲁棒性和可迁移性,并且能够轻松地从颜色感知的角度处理少样本分类。

英文摘要

Humans possess innate meta-learning capabilities, partly attributable to their exceptional color perception. In this paper, we pioneer an innovative viewpoint on few-shot learning by simulating human color perception mechanisms. We propose the ColorSense Learner, a bio-inspired meta-learning framework that capitalizes on inter-channel feature extraction and interactive learning. By strategically emphasizing distinct color information across different channels, our approach effectively filters irrelevant features while capturing discriminative characteristics. Color information represents the most intuitive visual feature, yet conventional meta-learning methods have predominantly neglected this aspect, focusing instead on abstract feature differentiation across categories. Our framework bridges the gap via synergistic color-channel interactions, enabling better intra-class commonality extraction and larger inter-class differences. Furthermore, we introduce a meta-distiller based on knowledge distillation, ColorSense Distiller, which incorporates prior teacher knowledge to augment the student network's meta-learning capacity. We've conducted comprehensive coarse/fine-grained and cross-domain experiments on eleven few-shot benchmarks for validation. Numerous experiments reveal that our methods have extremely strong generalization ability, robustness, and transferability, and effortless handle few-shot classification from the perspective of color perception.

2507.22057 2026-05-19 cs.CV 版本更新

MetaLab: Few-Shot Game Changer for Image Recognition

MetaLab: 图像识别中的少样本突破

Chaofei Qi, Zhitai Liu, Jianbin Qiu

AI总结 本文提出了一种高效的少样本图像识别方法MetaLab,通过CIELab引导的相干元学习框架,实现了高准确率、鲁棒性和有效泛化能力,接近人类识别水平。

Comments This work is currently being redone. It requires significant revisions and polishing. Additionally, the title will also be revised. Therefore, this version is no longer needed.

详情
AI中文摘要

困难的少样本图像识别具有显著的应用前景,但与传统大规模图像识别相比仍存在显著的技术差距。本文提出了一种高效的原生方法,称为CIELab引导的相干元学习(MetaLab)。结构上,我们的MetaLab由两个协作的神经网络组成:LabNet,能够对CIELab颜色空间进行域转换并提取丰富的分组特征,以及相干LabGNN,能够促进亮度图和颜色图之间的相互学习。为了充分验证,我们在四个粗粒度基准、四个细粒度基准和四个跨域少样本基准上进行了广泛的比较研究。具体而言,我们的方法在每个类别仅使用一个样本时能够实现高准确率、鲁棒性能和有效的泛化能力。总体而言,所有实验都表明,我们的MetaLab可以达到99%的准确率,接近人类识别水平,仅需少量的视觉偏差。

英文摘要

Difficult few-shot image recognition has significant application prospects, yet remaining the substantial technical gaps with the conventional large-scale image recognition. In this paper, we have proposed an efficient original method for few-shot image recognition, called CIELab-Guided Coherent Meta-Learning (MetaLab). Structurally, our MetaLab comprises two collaborative neural networks: LabNet, which can perform domain transformation for the CIELab color space and extract rich grouped features, and coherent LabGNN, which can facilitate mutual learning between lightness graph and color graph. For sufficient certification, we have implemented extensive comparative studies on four coarse-grained benchmarks, four fine-grained benchmarks, and four cross-domain few-shot benchmarks. Specifically, our method can achieve high accuracy, robust performance, and effective generalization capability with one-shot sample per class. Overall, all experiments have demonstrated that our MetaLab can approach 99\% $\uparrow\downarrow$ accuracy, reaching the human recognition ceiling with little visual deviation.

2507.22041 2026-05-19 cs.CV 版本更新

Shallow Deep Learning Can Still Excel in Fine-Grained Few-Shot Learning

浅层深度学习仍能在细粒度少样本学习中表现出色

Chaofei Qi, Chao Ye, Zhitai Liu, Weiyang Lin, Jianbin Qiu

AI总结 本文研究了浅层深度网络在细粒度少样本学习中的表现,提出了一种位置感知星座网络(LCN-4),通过改进的特征聚类模块有效减少损失,验证了浅层网络在该任务中的有效性。

Comments This work is currently being redone. It requires significant revisions and polishing. Additionally, the title will also be revised. Therefore, this version is no longer needed.

详情
AI中文摘要

深度学习已在广泛领域得到广泛应用,包括依赖深度骨干网络的细粒度少样本学习(FGFSL)。然而,较浅的深度骨干网络如ConvNet-4不常被选用,因为它们倾向于提取大量非抽象的视觉属性。本文重新评估了网络深度与完全编码少样本实例能力之间的关系,并探讨浅层深度架构是否能实现与主流深度骨干网络相当或更优的性能。受Vanilla ConvNet-4的启发,我们提出了一种位置感知星座网络(LCN-4),配备先进的位置感知特征聚类模块。该模块能够高效编码和整合空间特征融合、特征聚类和隐蔽特征位置,从而显著减少整体损失。具体而言,我们创新性地提出了一种通用网格位置编码补偿,有效解决特定普通卷积在特征提取过程中位置信息缺失的问题。此外,我们进一步提出了一种通用频域位置嵌入技术,以补偿聚类特征中的位置损失。我们在三个代表性的细粒度少样本基准上进行了验证。相关实验表明,LCN-4显著优于基于ConvNet-4的最新方法,并实现了与大多数ResNet12方法相当或更优的性能,证实了我们的猜想。

英文摘要

Deep learning has witnessed the extensive utilization across a wide spectrum of domains, including fine-grained few-shot learning (FGFSL) which heavily depends on deep backbones. Nonetheless, shallower deep backbones such as ConvNet-4, are not commonly preferred because they're prone to extract a larger quantity of non-abstract visual attributes. In this paper, we initially re-evaluate the relationship between network depth and the ability to fully encode few-shot instances, and delve into whether shallow deep architecture could effectuate comparable or superior performance to mainstream deep backbone. Fueled by the inspiration from vanilla ConvNet-4, we introduce a location-aware constellation network (LCN-4), equipped with a cutting-edge location-aware feature clustering module. This module can proficiently encoder and integrate spatial feature fusion, feature clustering, and recessive feature location, thereby significantly minimizing the overall loss. Specifically, we innovatively put forward a general grid position encoding compensation to effectively address the issue of positional information missing during the feature extraction process of specific ordinary convolutions. Additionally, we further propose a general frequency domain location embedding technique to offset for the location loss in clustering features. We have carried out validation procedures on three representative fine-grained few-shot benchmarks. Relevant experiments have established that LCN-4 notably outperforms the ConvNet-4 based State-of-the-Arts and achieves performance that is on par with or superior to most ResNet12-based methods, confirming the correctness of our conjecture.

2505.18991 2026-05-19 cs.CV 版本更新

Fast Kernel-Space Diffusion for Remote Sensing Pansharpening

快速核空间扩散用于遥感全色锐化

Hancong Jin, Zihan Cao, Liang-jian Deng, Jingjing Li

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文提出KSDiff框架,通过整合低秩核心张量生成器和统一因子生成器,利用结构感知的多头注意力机制生成增强全局上下文的卷积核,以提升全色锐化质量并加速推理,实验表明其在性能和效率上均优于现有方法。

Comments CVPR 2026 Findings

详情
AI中文摘要

全色锐化旨在将高分辨率全色(PAN)图像和低分辨率多光谱(LRMS)图像融合为一幅具有精细空间细节和丰富光谱信息的单一图像。尽管深度学习方法取得了进展,但现有方法往往无法捕捉遥感数据分布中固有的全局先验。基于扩散模型的方法因强大的分布映射能力而成为有前途的解决方案,但它们存在推理延迟大的问题。我们引入KSDiff,一种快速核空间扩散框架,通过整合低秩核心张量生成器和统一因子生成器,利用结构感知的多头注意力机制生成增强全局上下文的卷积核,以提升全色锐化质量并加速推理。我们进一步提出一种针对全色锐化的两阶段训练策略,便于集成到现有全色锐化架构中。实验表明,KSDiff在性能上优于最近的有前途的方法,并且在扩散基线全色锐化方法上实现了超过500倍的推理速度提升。消融研究、可视化和进一步评估证实了我们方法的有效性。代码将在可能接受时发布。

英文摘要

Pansharpening seeks to fuse high-resolution panchromatic (PAN) and low-resolution multispectral (LRMS) images into a single image with both fine spatial and rich spectral detail. Despite progress in deep learning-based approaches, existing methods often fail to capture global priors inherent in remote sensing data distributions. Diffusion-based models have recently emerged as promising solutions due to their powerful distribution mapping capabilities, however, they suffer from heavy inference latency. We introduce KSDiff, a fast kernel-space diffusion framework that generates convolutional kernels enriched with global context to enhance pansharpening quality and accelerate inference. Specifically, KSDiff constructs these kernels through the integration of a low-rank core tensor generator and a unified factor generator, orchestrated by a structure-aware multi-head attention mechanism. We further introduce a two-stage training strategy tailored for pansharpening, facilitating integration into existing pansharpening architectures. Experiments show that KSDiff achieves superior performance compared to recent promising methods, and with over $500 \times$ faster inference than diffusion-based pansharpening baselines. Ablation studies, visualizations and further evaluations substantiate the effectiveness of our approach. Code will be released upon possible acceptance.

2505.16278 2026-05-19 cs.CV cs.AI cs.RO 版本更新

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

DriveMoE:面向端到端自动驾驶的视觉-语言-动作混合专家模型

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan

发表机构 * Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学计算机科学学院与人工智能学院) Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究院) Shanghai Key Laboratory of Multimodal Embodied AI(上海多模态具身人工智能重点实验室) AnyScale AI Project(AnyScale AI项目)

AI总结 本文提出DriveMoE,一种基于混合专家架构的端到端自动驾驶框架,通过场景专用的视觉混合专家和技能专用的动作混合专家,实现了对复杂驾驶场景的有效处理,展示了在自动驾驶任务中结合视觉和动作混合专家的有效性。

Comments Accepted by CVPR 2026, Project Page: https://thinklab-sjtu.github.io/DriveMoE/

详情
AI中文摘要

端到端自动驾驶(E2E-AD)需要有效处理多视角传感器数据和稳健处理多样且复杂的驾驶场景,特别是罕见的激进转弯等场景。最近混合专家(MoE)架构在大语言模型(LLMs)中的成功表明,参数的专业化能够实现强大的可扩展性。在本工作中,我们提出了DriveMoE,一种新的基于MoE的E2E-AD框架,包含场景专用的视觉MoE和技能专用的动作MoE。DriveMoE基于我们$π_0$视觉-语言-动作(VLA)基线(最初来自具身AI领域),称为Drive-$π_0$。具体而言,我们通过训练一个路由器,根据驾驶上下文动态选择相关摄像头,将视觉MoE添加到Drive-$π_0$中。这种设计模仿了人类驾驶认知,即司机选择性地关注关键视觉线索,而不是穷尽处理所有视觉信息。此外,我们通过训练另一个路由器来激活针对不同驾驶行为的专用专家模块,通过显式的行为专业化,DriveMoE能够处理多样化的场景而不受现有模型中模式平均的困扰。在Bench2Drive闭环评估实验中,DriveMoE实现了最先进的性能,证明了在自动驾驶任务中结合视觉和动作MoE的有效性。我们将发布DriveMoE和Drive-$π_0$的代码和模型。

英文摘要

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $π_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$π_0$. Specifically, we add Vision MoE to Drive-$π_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$π_0$.

2503.14346 2026-05-19 cs.CV 版本更新

3D Densification for Multi-Map Monocular VSLAM in Endoscopy

3D致密化用于内窥镜多地图单目视觉SLAM

X. Anadón, Javier Rodríguez-Puigvert, J. M. M. Montiel

发表机构 * Universidad de Zaragoza(萨拉戈萨大学)

AI总结 本文提出了一种方法,通过去除异常值和增强地图密度,改进了内窥镜多地图单目视觉SLAM中的3D环境表示,实现了在临床应用中更精确的3D地图重建。

详情
AI中文摘要

多地图稀疏单目视觉同时定位与建图应用于单目内窥镜序列已被证明在内窥镜中频繁的损失(如运动模糊、时间遮挡、工具交互或水喷射)后能够稳健地恢复跟踪。稀疏多地图对于稳健的相机定位是足够的,但它们在环境表示方面非常差,它们是嘈杂的,有高比例的不准确重建的3D点,包括显著的异常值,更重要的是在临床应用中具有不可接受的低密度。我们提出了一种方法来去除异常值并增强状态-of-the-art稀疏内窥镜多地图CudaSIFT-SLAM的地图。通过使用鲁棒的LMedS将NN LightDepth用于到尺度的深度密集预测对齐稀疏CudaSIFT子地图。我们的系统缓解了单目深度估计中的固有尺度模糊问题,同时过滤异常值,导致可靠的致密3D地图。我们在C3VD幻影结肠数据集中提供了准确致密地图的实验证据,4.15毫米RMS精度在可接受的计算时间内。我们还报告了在Endomapper数据集上的真实结肠镜的定性结果。

英文摘要

Multi-map Sparse Monocular visual Simultaneous Localization and Mapping applied to monocular endoscopic sequences has proven efficient to robustly recover tracking after the frequent losses in endoscopy due to motion blur, temporal occlusion, tools interaction or water jets. The sparse multi-maps are adequate for robust camera localization, however they are very poor for environment representation, they are noisy, with a high percentage of inaccurately reconstructed 3D points, including significant outliers, and more importantly with an unacceptable low density for clinical applications. We propose a method to remove outliers and densify the maps of the state of the art for sparse endoscopy multi-map CudaSIFT-SLAM. The NN LightDepth for up-to-scale depth dense predictions are aligned with the sparse CudaSIFT submaps by means of the robust to spurious LMedS. Our system mitigates the inherent scale ambiguity in monocular depth estimation while filtering outliers, leading to reliable densified 3D maps. We provide experimental evidence of accurate densified maps 4.15 mm RMS accuracy at affordable computing time in the C3VD phantom colon dataset. We report qualitative results on the real colonoscopy from the Endomapper dataset.

2502.07360 2026-05-19 q-bio.QM cs.CV 版本更新

Supervised contrastive learning for cell stage classification of animal embryos

基于监督对比学习的动物胚胎细胞阶段分类

Yasmine Hachani, Patrick Bouthemy, Elisa Fromont, Sylvie Ruffini, Ludivine Laffont, Alline de Paula Reis

发表机构 * Inria center at Rennes University, France(法国里昂大学Inria研究中心) University of Rennes, IRISA, France(法国南特大学IRISA研究所) Paris-Saclay University, UVSQ, INRAE, BREED, France(法国巴黎-萨克雷大学、UVSQ、INRAE、BREED研究所) The National Veterinary School of Alfort (EnvA), France(法国阿尔福兽医学校(EnvA))

AI总结 本文提出了一种基于监督对比学习和焦点损失的深度学习方法,用于自动分类动物胚胎的细胞阶段,解决了低质量图像、类别模糊和数据分布不均等挑战,并在牛胚胎和小鼠胚胎数据集上实现了优于现有方法的性能。

Journal ref Scientific Reports, 2026

详情
AI中文摘要

视频显微镜结合机器学习为研究体外生成(IVP)胚胎的早期发育提供了有前景的方法。然而,手动标注发育事件,特别是细胞分裂,对于生物学家来说是耗时的,且无法扩展到实际应用。我们旨在利用深度学习方法自动分类来自2D时间延时显微镜视频的胚胎细胞阶段。我们专注于牛胚胎发育的分析,因为我们的主要应用是牛养殖,并创建了牛胚胎细胞阶段(ECS)数据集。挑战有三个:(1)低质量图像和牛暗细胞使细胞阶段识别困难,(2)发育阶段边界处的类别模糊,以及(3)数据分布不平衡。为了解决这些挑战,我们引入了CLEmbryo,一种结合监督对比学习和焦点损失的新型方法,并使用轻量级3D神经网络CSN-50作为编码器。我们还展示了我们的方法具有良好的泛化能力。CLEmbryo在我们的牛ECS数据集和公开可用的NYU小鼠胚胎数据集上均优于现有最先进的方法。

英文摘要

Videomicroscopy, when combined with machine learning, offers a promising approach for studying the early development of in vitro produced (IVP) embryos. However, manually annotating developmental events, and more specifically cell divisions, is time-consuming for a biologist and cannot scale up for practical applications. We aim to automatically classify the cell stages of embryos from 2D time-lapse microscopy videos with a deep learning approach. We focus on the analysis of bovine embryonic development using video microscopy, as we are primarily interested in the application of cattle breeding, and we have created a Bovine Embryos Cell Stages (ECS) dataset. The challenges are three-fold: (1) low-quality images and bovine dark cells that make the identification of cell stages difficult, (2) class ambiguity at the boundaries of developmental stages, and (3) imbalanced data distribution. To address these challenges, we introduce CLEmbryo, a novel method that leverages supervised contrastive learning combined with focal loss for training, and the lightweight 3D neural network CSN-50 as an encoder. We also show that our method generalizes well. CLEmbryo outperforms state-of-the-art methods on both our Bovine ECS dataset and the publicly available NYU Mouse Embryos dataset.

2605.18132 2026-05-19 cs.CV cs.AI 版本更新

Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models

谁生成了这个3D资产?学习生成3D模型的来源归属

Sihan Ma, Siyuan Liang, Dacheng Tao

发表机构 * College of Computing & Data Science, Nanyang Technological University, Singapore(南洋理工大学计算机与数据科学学院)

AI总结 该研究提出了一种方法,用于确定给定3D资产是由哪种生成模型创建的,通过构建首个被动来源归属基准,发现生成3D模型留下稳定的指纹特征,从而建立了可信的3D内容来源的新标准。

详情
AI中文摘要

生成3D模型被应用于游戏、机器人和沉浸式创作,因此来源归属至关重要:给定一个3D资产,我们能否确定并识别出是哪种生成模型创建的?该问题面临两个核心挑战:分散的归属信号,其中3D指纹分布在多视角、几何和频率域提示中;以及现实部署约束,其中稀少的标签、退化的提示和混合真实/合成资产会破坏归属的可靠性。为了系统研究该问题,我们构建了迄今为止首个被动来源归属基准,涵盖22种代表性的3D生成器,在标准、少样本和现实部署协议下。基于此基准,我们发现生成3D模型留下两种稳定的指纹:跨视角不一致性和体现在几何统计和频率域提示中的结构伪影。为了捕捉这些分散的信号,我们提出了一种层次多视角多模态Transformer,融合每个视角的外观、几何和频率域特征,并在跨视角建模全局关系。大量实验表明性能优异,在全监督下达到97.22%的准确率,在仅有1%训练数据时达到77.17%的准确率,对应每个生成器少于五个样本。这些结果表明现代3D生成器留下稳定且可归属的指纹,建立了可信3D内容来源的新基准和方法论基础。

英文摘要

Generative 3D models are deployed in gaming, robotics, and immersive creation, making source attribution critical: given a 3D asset, can we identify whether and which generative model created it? This problem faces two core challenges: dispersed attribution signals, where 3D fingerprints are distributed across multi-view, geometric, and frequency-domain cues; and realistic deployment constraints, where scarce labels, degraded prompts, and mixed real/synthetic assets undermine attribution reliability. To systematically study this problem, we construct, to the best of our knowledge, the first passive source attribution benchmark for modern generated assets, covering 22 representative 3D generators under standard, few-shot, and realistic deployment protocols. Based on this benchmark, we find that generative 3D models leave two types of stable fingerprints: cross-view inconsistency and structural artifacts reflected in geometric statistics and frequency-domain cues. To capture these dispersed signals, we propose a hierarchical multi-view multi-modal Transformer that fuses appearance, geometric, and frequency-domain features within each view and models global relationships across views. Extensive experiments demonstrate strong performance, achieving 97.22% accuracy under full supervision and 77.17% accuracy with only 1% training data, corresponding to fewer than five samples per generator. These results show that modern 3D generators leave stable and attributable fingerprints, establishing a new benchmark and methodological foundation for trustworthy 3D content provenance.

2605.18130 2026-05-19 cs.CV 版本更新

Rad-VLSM: A Cross-Modal Framework with Semantics-Assisted Prompting for Medical Segmentation and Diagnosis

Rad-VLSM:一种结合语义辅助提示的跨模态框架用于医学分割与诊断

Fengyi Zhang, Xujie Zeng, Mohan Liu, Zengyi Wang, Yalong Jiang

发表机构 * Student Member, IEEE(IEEE学生会员) Member, IEEE(IEEE会员)

AI总结 本文提出Rad-VLSM框架,通过语义引导的提示机制,提升医学图像分割与诊断的准确性,解决现有模型易受背景组织和无关视觉相关性干扰的问题。

详情
AI中文摘要

医学图像分割在支持诊断而非仅仅生成病变掩码时更具临床价值。然而,诊断相关的病变线索往往微妙且局部化,而现有模型可能受背景组织、声学伪影和无关视觉相关性干扰。为了解决这个问题,我们提出了Rad-VLSM,一种两阶段跨模态框架,用于语义辅助的病变聚焦、鲁棒分割和视觉基础诊断。第一阶段中,基于BLIP-2的视觉-语言对齐模块在语义引导下识别病变相关候选区域,并将其转换为框提示。第二阶段中,这些提示被输入基于SAM的多任务网络,其中多候选区域聚合策略提高提示稳定性并引导病变分割。预测的掩码随后用作诊断的空间先验,视觉-放射组学融合头将病变感知的视觉特征与选定的放射组学描述符整合。通过使用语义信息进行定位而非直接预测,Rad-VLSM减少了文本到诊断的依赖,并将诊断基于病变层面的证据。在私有临床乳腺超声数据集和公共基准测试中,Rad-VLSM在分割和诊断性能方面表现强劲,具有良好的泛化能力。

英文摘要

Medical image segmentation is more clinically valuable when it supports diagnosis rather than merely producing lesion masks. However, diagnostically relevant lesion cues are often subtle and localized, while existing models may be distracted by background tissues, acoustic artifacts, and irrelevant visual correlations. To address this problem, we propose Rad-VLSM, a two-stage cross-modal framework for semantics-assisted lesion focusing, robust segmentation, and visually grounded diagnosis. In the first stage, a BLIP-2-based vision-language alignment module identifies lesion-related candidate regions under semantic guidance and converts them into box prompts. In the second stage, these prompts are fed into a SAM-based multitask network, where a multi-candidate region aggregation strategy improves prompt stability and guides lesion segmentation. The predicted masks are then used as spatial priors for diagnosis, and a visual-radiomics fusion head integrates lesion-aware visual features with selected radiomics descriptors. By using semantic information for localization rather than direct prediction, Rad-VLSM reduces text-to-diagnosis dependence and grounds diagnosis in lesion-level evidence. Experiments on a private clinical breast ultrasound dataset and public benchmarks show that Rad-VLSM achieves strong segmentation and diagnostic performance with favorable generalization.

2605.18115 2026-05-19 cs.CV 版本更新

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

WinTok: 一种通过分解视觉理解和生成来实现双赢的混合分词器

Yiwei Guo, Shaobin Zhuang, Zhipeng Huang, Canmiao Fu, Chen Li, Jing Lyu, Yali Wang

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) WeChat Vision, Tencent Inc.(腾讯微信视觉部) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出WinTok,一种通过分解视觉理解和生成任务来实现双赢的混合分词器,通过引入可迁移的语义分词来减少跨任务干扰,从而在多个基准测试中提升了重建、理解和生成性能。

详情
AI中文摘要

构建统一的视觉分词器对于弥合视觉理解和生成之间的差距至关重要。然而,现有方法在处理这两个任务之间的固有冲突时存在困难,因为单一的分词空间被迫同时支持高层语义抽象和低层像素重建。我们提出了WinTok,一种简洁的混合分词器,通过显式解耦这两个目标实现了双赢性能。WinTok通过添加一组可学习的语义分词来补充像素分词,有效地减轻了跨任务干扰,而无需付出双分词器的计算开销。为进一步增强理解能力,我们引入了不对称的分词蒸馏机制:语义分词通过任何视觉基础模型预训练的语义嵌入进行引导,使它们能够继承强大的辨别能力,同时保持灵活性。在10个具有挑战性的基准测试中,WinTok在重建、理解和生成方面都实现了持续的改进。仅在5000万开源数据上训练,WinTok在分类准确率上超越了强大的基线UniTok 11.2%,尽管其使用的训练数据显著少于其他方法。代码已发布在https://github.com/markywg/WinTok。

英文摘要

Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining flexibility. Across 10 challenging benchmarks, WinTok delivers consistent improvements in reconstruction, understanding, and generation. Trained on only 50M open-source data, WinTok surpasses the strong baseline UniTok by 11.2% in classification accuracy and achieves a competitive reconstruction rFID of 0.41, despite using substantially less training data. Code is released at https://github.com/markywg/WinTok.

2605.18111 2026-05-19 cs.CL cs.CV 版本更新

How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking

LLMs在回答孟加拉语医学视觉问题方面的表现如何?数据集与基准测试

Rafid Ahmed, Intesar Tahmid, Mir Sazzat Hossain, Tasnimul Hossain Tomal, Md Fahim, Md Farhad Alam Bhuiyan

发表机构 * Penta Global Limited Center for Computational & Data Sciences, Independent University(独立大学计算与数据科学中心)

AI总结 本文提出BanglaMedVQA数据集,用于评估当前基础模型在孟加拉语医学视觉问答任务中的表现,发现其性能显著低于英语基准,揭示了低资源语言在医学推理中的挑战。

Comments 14 pages, 7 figures, 5 tables, Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare, PMLR 317:1-14, 2026

详情
AI中文摘要

近年来,大型语言模型(LLMs)和大型视觉语言模型(LVLMs)的进步使通用系统在复杂推理任务中展现出有希望的能力,包括医学领域。医学视觉问答(MedVQA)尤其受益于这些发展。然而,尽管孟加拉语是全球最广泛使用的语言之一,但尚不存在针对它的MedVQA基准。为解决这一缺口,我们引入了BanglaMedVQA数据集,包含经过临床验证的图像-问题-答案三元组,并对当前基础模型在该资源上的全面评估。与先前发现的当前模型在英语MedVQA基准上表现不佳一致,我们的分析显示孟加拉语性能显著更低,反映了低资源语言固有的挑战。即使表现最佳的模型如Gemini和GPT-4.1 mini也未能准确回答专门的诊断问题,表明在细粒度医学推理方面存在严重限制。虽然某些开源模型如Gemma-3偶尔在一般类别中优于这些模型,但它们在临床复杂问题上也表现不佳,凸显了对顶级评估方法的迫切需求。

英文摘要

Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.

2605.18109 2026-05-19 cs.AI cs.CV cs.RO 版本更新

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

TaskGround:全场景家庭推理的结构化可执行任务推断

ZhiYuan Feng, Yu Deng, Ruichuan An, Zhenhua Liu, Qixiu Li, Keming Wu, Zhiying Du, Weijie Wang, Haoxiao Wang, Shuang Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

发表机构 * Tsinghua University(清华大学) Microsoft Research Asia(微软亚洲研究院) Peking University(北京大学) Fudan University(复旦大学) Zhejiang University(浙江大学)

AI总结 本文提出TaskGround框架,通过结构化任务推断提升全场景家庭推理能力,其核心贡献是引入FullHome评估套件,验证了在家庭场景中执行任务结构推断的重要性,并展示了紧凑本地模型在实际家庭部署中的有效性。

Comments Project page: https://aaronfengzy.github.io/TaskGround/

详情
AI中文摘要

在真实家庭部署中,家庭代理通常必须从完整的家庭场景和处于特定情境的家庭请求出发,而不是从干净的任务规范出发。此类请求要求代理识别与任务相关的实体,恢复意图的任务条件,并从周围场景上下文中解决顺序约束。我们正式将这种能力定义为全场景家庭推理:给定一个完整的家庭场景和一个处于特定情境的家庭请求,代理必须在生成接地技能级动作序列之前推断出可执行的任务结构。这种设置具有挑战性,因为完整的家庭场景包含大量与任务无关的信息,使直接完整场景提示效率低下且容易出错。在实际部署中,这一挑战进一步被隐私和本地计算限制放大,这些限制更倾向于紧凑的开放权重模型,其具有有限的长上下文推理能力。我们提出TaskGround,一种无需训练且模型无关的Ground-Infer-Execute框架,该框架将完整的场景接地为紧凑的任务相关场景切片,推断出可执行的任务结构,并将其编译为接地的技能级动作序列。为了评估这一设置,我们引入了FullHome,一个经过人类验证的400个家庭任务评估套件,涵盖多样化的家庭规模环境以及目标导向和过程约束要求。在FullHome上,TaskGround在专有和开放权重模型上均大幅提升了任务成功率。值得注意的是,它使Qwen3.5-9B在直接完整场景提示下与GPT-5竞争,同时将总输入token成本减少了多达18倍。我们的结果识别了执行任务结构推断为全场景家庭推理中的关键瓶颈,并表明结构化接地可以显著提高紧凑本地模型在实际家庭部署中的有效性。

英文摘要

In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models with limited long-context reasoning ability. We propose TaskGround, a training-free and model-agnostic Ground-Infer-Execute framework that grounds complete scenes into compact task-relevant scene slices, infers executable task structure, and compiles it into grounded skill-level action sequences. To evaluate this setting, we introduce FullHome, a human-validated evaluation suite of 400 household tasks spanning diverse home-scale environments and both goal-oriented and process-constrained requirements. On FullHome, TaskGround improves task success rates by large margins across both proprietary and open-weight models. Notably, it makes Qwen3.5-9B competitive with GPT-5 under direct complete-scene prompting while reducing total input-token cost by up to 18x. Our results identify executable task-structure inference as a central bottleneck in full-scene household reasoning and show that structured grounding can make compact local models substantially more effective for practical household deployment.

2605.18101 2026-05-19 cs.CV cs.AI 版本更新

SENSE: Satellite-based ENergy Synthesis for Sustainable Environment

SENSE: 基于卫星的能源合成以实现可持续环境

Kailai Sun, Mingyi He, Heye Huang, Can Rong, Alok Prakash, Baoshen Guo, Shenhao Wang, Jinhua Zhao

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Florida(佛罗里达大学)

AI总结 本文提出SENSE,一种统一的生成性城市建筑能耗框架,通过结合生成扩散模型和大规模视觉模型知识,生成高分辨率的城市卫星图像和对齐的高质量建筑能耗和高度地图,以提高城市可持续发展预测性能。

Comments Accpted by KDD 2026 (Oral)

详情
AI中文摘要

城市建筑能耗建模在实现联合国可持续发展目标7和11中起着关键作用。尽管基于卫星图像和深度学习的研究已取得显著进展,但仍存在许多挑战:大多数现有研究本质上是预测性的,无法反映城市规划的生成性;虽然生成式AI和扩散模型在卫星图像中实现了指数级增长,但缺乏城市功能生成(例如能耗层);第三,高质量高分辨率建筑能耗数据与卫星图像的对齐数据有限且稀缺。本文提出SENSE(基于卫星的能源合成以实现可持续环境),一种统一的生成性城市建筑能耗(UBEM)框架,联合合成逼真的城市卫星图像和对齐的高质量建筑能耗和高度地图。通过在道路网络和城市密度指标上进行条件控制,SENSE基于可控扩散模型,利用大规模视觉模型学习到的知识,生成城市建筑能耗和高度信息(注释)在潜在空间中。在四个城市(纽约市、波士顿、里昂、釜山)上的实验表明,SENSE实现了高视觉保真度和强物理一致性,满足ASHRAE标准度量。实验表明,SENSE可以使用少于20%的标注能耗数据生成足够的注释合成数据,将下游预测性能提升10% IoU。与最先进的城市能耗预测方法相比,SENSE显著降低了预测误差(预测误差减少了3%-11% NMBE和1%-9% CVRMSE)。本研究为城市科学、能源科学和建筑科学提供了能耗效率的城市规划和物理生成解决方案。数据集和代码:https://huggingface.co/datasets/skl24/MUSE和https://github.com/kailaisun/GenAI4Urban-Energy/.

英文摘要

Urban Building Energy Modeling plays a critical role in achieving the United Nations' Sustainable Development Goals 7 and 11. Although existing studies based on satellite imagery and deep learning have achieved remarkable progress, many challenges exist: most existing studies are inherently predictive, failing to reflect the generative nature of urban planning; although generative AI and diffusion models have seen explosive growth in satellite imagery, they lack the urban functional generation (e.g., energy layer); third, aligned high-quality high-resolution building energy data with satellite imagery is limited and scarce. Here we propose SENSE (Satellite-based ENergy Synthesis for Sustainable Environment), a unified generative UBEM framework that jointly synthesizes realistic urban satellite imagery and aligned high-quality building energy consumption and height maps. By conditioning on road networks and urban density metrics, SENSE, based on a controllable diffusion model, leverages the knowledge learned by large vision models to generate urban building energy consumption and height information (annotations) in the latent space. Experiments across four cities (New York City, Boston, Lyon, Busan) demonstrate that SENSE achieves high visual fidelity and strong physical consistency, satisfying the ASHRAE standard metric. Experiments demonstrate that SENSE can generate enough annotated synthetic data using less than 20% labeled energy data, boosting downstream prediction performance by 10% IoU. Compared to SOTA urban energy prediction methods, SENSE significantly reduced prediction error (reduced 3%-11% NMBE and 1%-9% CVRMSE). This study offers an energy-efficiency urban planning and physical generation solution for urban science, energy science and building science. The dataset and code: https://huggingface.co/datasets/skl24/MUSE and https://github.com/kailaisun/GenAI4Urban-Energy/.

2605.18063 2026-05-19 cs.CV cs.LG 版本更新

The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting

MixCount数据集:弥合开放词汇物体计数的数据缺口

Corentin Dumery, Niki Amini-Naieni, Shervin Naini, Pascal Fua

发表机构 * EPFL(苏黎世联邦理工学院) University of Oxford(牛津大学) Northwestern University(西北大学)

AI总结 本文提出MixCount数据集,通过自动生成管道解决开放词汇物体计数中混合物体场景下的数据不足问题,展示了在真实世界基准上的显著提升。

Comments Co-first authors. Dataset and project page https://corentindumery.github.io/projects/mixcount.html

详情
AI中文摘要

物体计数是一个基础的视觉任务,已有超过十年的专门研究,但最先进的模型在混合物体设置中仍系统性地失败,这在工业检测和产品分拣等现实应用中占主导地位。我们证明,这一差距主要是由现有训练和评估数据的限制造成的:真实的计数数据集标注成本过高且存在标签噪声,而现有的合成替代方案缺乏多样性和现实感。我们通过MixCount数据集和基准来解决这一问题,该数据集旨在针对当前计数模型的失败模式。为了克服构建和标注此类数据的高成本,我们开发了一种自动生成管道,能够大规模合成图像、细粒度文本描述和像素级计数注释,消除了此前数据集中的标注模糊性。在MixCount上评估最先进的计数模型会暴露混合物体设置下的严重退化。更重要的是,将这些模型在我们的合成数据上训练,在真实世界基准上取得了显著提升,将FSC-147的MAE降低了20.14%,在PairTally上降低了18.3%。这些结果确立了MixCount作为细粒度计数的基准和训练数据集,并证明了我们的管道能够产生实际上无限的标注数据,从而解决了计数模型中长期存在的瓶颈问题。

英文摘要

Object counting is a foundational vision task with over a decade of dedicated research, yet state-of-the-art models still fail systematically in the mixed-object setting that dominates real-world applications such as industrial inspection and product sorting. We show that this gap is strongly driven by limitations in existing training and evaluation data: real counting datasets are prohibitively expensive to annotate and suffer from labeling noise, while existing synthetic alternatives lack diversity and realism. We address this with MixCount, a dataset and benchmark for mixed-object counting designed to target the failure modes of current counting models. To overcome the high cost of constructing and labeling such data, we develop an automatic generation pipeline that synthesizes images, fine-grained textual descriptions, and pixel-perfect counting annotations at scale, eliminating the labeling ambiguity that plagues prior datasets. Evaluating state-of-the-art counting models on MixCount exposes severe degradation in the mixed-object setting. More importantly, training these models on our synthesized data yields substantial gains on real-world benchmarks, reducing MAE by 20.14% on FSC-147 and by 18.3% on PairTally. These results establish MixCount as both a benchmark and a training dataset for fine-grained counting, and demonstrate that our pipeline, which produces effectively unlimited labeled data, helps address a long-standing bottleneck in counting models.

2605.18060 2026-05-19 cs.CV 版本更新

Embedded ConvNet Ensembles: A Lightweight Approach to Recognize Arabic Handwritten Characters

嵌入式卷积网络集合:一种轻量级的阿拉伯手写字符识别方法

Mohsine El Khayati, Rachid Elouahbi, Abdelillah Semma

发表机构 * Systems theory and informatics laboratory(系统理论与信息系统实验室) Moulay Ismail University of Meknes(穆拉伊姆·艾斯米尔大学梅克内斯分校) Laboratory of Computer Science and Applications(计算机科学与应用实验室) Computer Science Dept.(计算机科学系)

AI总结 本文提出了一种轻量级嵌入式卷积网络与集成学习相结合的方法,用于实现阿拉伯手写字符识别,通过实验验证了轻量模型在准确率上的优势以及集成学习对性能的提升。

Comments Accepted in the IEEE 15th Image, Video, and Multidimensional Signal Processing Workshop 2026

详情
AI中文摘要

阿拉伯手写字符识别(AHCR)近年来通过深度卷积神经网络(ConvNets)取得了显著进展。然而,文献中的许多模型深度且在参数和FLOPs上计算成本高,限制了其在资源受限设备上的部署,而这些设备日益普遍。本研究通过提出轻量级嵌入式ConvNet模型和集成学习技术来填补这一空白。进行了广泛的实验以确定AHCR的最佳实践,考虑了训练超参数、学习策略、模型选择和集成方法。结果表明,嵌入模型可以实现与或甚至超过更重架构的准确率。集成学习在只有适度计算开销的情况下进一步提升性能,特别是在具有挑战性的训练场景中。在集成策略中,软投票产生了最佳的整体结果。

英文摘要

Arabic Handwritten Character Recognition (AHCR) has recently advanced significantly with deep Convolutional Neural Networks (ConvNets). However, many models in the literature are deep and computationally expensive in terms of parameters and FLOPs, limiting their deployment on resource-constrained devices, which are increasingly common. This study addresses this gap by proposing a combination of lightweight embedded ConvNet models and ensemble learning techniques. Extensive experiments were conducted to identify best practices in AHCR, considering training hyperparameters, learning strategies, model choices, and ensemble methods. Results show that embedded models can achieve accuracy comparable to, or even surpassing, heavier architectures. Ensemble learning further enhances performance with only modest computational overhead, particularly under challenging training scenarios. Among the ensembling strategies, soft voting yielded the best overall results.

2605.18058 2026-05-19 cs.CV 版本更新

Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models

阿拉伯手写识别的威胁:调查嵌入式卷积网络模型上的黑盒对抗攻击

Mohsine EL Khayati, Abdelillah Semma, Abdelaziz Courr, Rachid Elouahbi

发表机构 * Systems theory and informatics laboratory(系统理论与信息学实验室) Moulay Ismail University of Meknes(穆莱·艾息姆大学) Department of Computer Science(计算机科学系) EST of Sidi Bennour(西迪·本努尔工程与技术学院) Chouaib Doukkali University(侯赛因·杜克利大学) Faculty of Education Sciences(教育科学学院) University Mohammed V(穆莱·维大学) Laboratory of Computer Science and Applications(计算机科学与应用实验室)

AI总结 本研究探讨了阿拉伯手写识别系统对黑盒对抗攻击的脆弱性,通过实验揭示了高精度模型在面对对抗攻击时的易受攻击性,强调了加强模型安全性和可靠性的必要性。

Comments Accepted in the IEEE 15th Image, Video, and Multidimensional Signal Processing Workshop 2026

详情
AI中文摘要

阿拉伯手写识别(AHR)通过深度学习模型取得了显著进展。AHR研究主要关注性能,而安全性却很少受到重视。本研究通过展示高性能模型对对抗黑盒攻击的易受攻击性,提供了一条新的研究方向。研究聚焦于黑盒攻击,反映了现实场景中攻击者对模型架构没有先验知识的情况。在两个包含阿拉伯手写字符的基准AHR数据集上进行了大量实验。结果表明攻击的有效性,其中Pixle攻击在大多数模型上达到了99-100%的攻击成功率。其他较为温和的攻击在大多数实验中达到了50-96%的成功率。尽管攻击成功率较高,但攻击保持了字符的结构完整性,使其在人眼几乎不可察觉。研究结果表明,所研究的模型对对抗操纵具有更高的易受性。这突显了加强这些模型安全性和可靠性以确保其在AHR实际应用中的必要性。

英文摘要

Arabic handwriting recognition (AHR) has made significant progress with deep learning models. AHR research has largely focused on performance, with security receiving little attention. This study provides what appears to be a new line of inquiry by demonstrating the vulnerability of high-performing models to adversarial black-box attacks. The focus on black-box attacks reflects real-world scenarios where the attacker has no prior knowledge of the model architecture. Extensive experiments were conducted on two benchmark AHR datasets containing Arabic handwritten Characters. Results demonstrated the effectiveness of the attacks, with the Pixle attack achieving an attack success rate of 99-100\% on most models. Other, less aggressive attacks achieved success rates of 50-96\% across most experiments. Despite the higher attack success rate, the attacks maintain the structural integrity of the characters, rendering them almost imperceptible to the human eye. The findings indicate the higher vulnerability of the studied models to adversarial manipulation. This underscores the need to strengthen efforts to secure these models and ensure their reliability in AHR real-world applications.

2605.18054 2026-05-19 eess.IV cs.CV cs.MM 版本更新

CATRF: Codec-Adaptive TriPlane Radiance Fields for Volumetric Content Delivery

CATRF:用于体积分发的编码自适应三平面辐射场

Tung-I Chen, Lingdong Wang, Subhransu Maji, Ramesh K. Sitaraman

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 本文提出CATRF,一种用于体积分发的编码自适应三平面辐射场方法,通过在训练过程中将二维特征平面量化和打包到编码器友好的画布中,并利用标准编码器进行回程处理,从而在训练循环中插入非可微的编码器管道,使辐射场特征能够直接适应客户端侧的编码器失真,而无需引入任何学习的编码器参数。

详情
AI中文摘要

体积分发承诺了下一代内容分发应用,但其带宽需求仍然是一个关键瓶颈。隐式和混合体积分表示减少了模型大小,但仍然需要精心编码才能达到2D视频般的比特率。我们提出了CATRF,一种标准编码器在循环中的压缩框架,用于平面分解的辐射场。在训练过程中,我们对二维特征平面进行量化和打包到编码器友好的画布中,运行标准编码器回程(JPEG/VP9/HEVC/AV1),然后解包并解量化解码后的特征,再进行体积渲染。我们使用直通估计器(STE)将非可微的标准编码器管道插入到训练循环中,使辐射场特征能够直接适应真实的客户端侧编码器失真,而无需引入任何学习的编码器参数。在静态和动态基准测试中,CATRF在编码器无关和学习编码器回路基线中始终实现了更好的率失真权衡,并且在压缩效率和解码速度上也优于最近的压缩3DGS方法。这些结果凸显了一条通往低比特率、抗压缩的体积分表示的实用路径,用于自由视角视频流媒体。

英文摘要

Volumetric media promises next-generation content delivery applications, but its bandwidth demand remains a key bottleneck. Implicit and hybrid volumetric representations reduce model sizes, yet still require careful coding to reach 2D video-like bitrates. We present CATRF, a standard-codec-in-the-loop compression framework for plane-factorized radiance fields. During training, we quantize and pack 2D feature planes into codec-friendly canvases, run a standard codec roundtrip (JPEG/VP9/HEVC/AV1), then unpack and dequantize the decoded features before volume rendering. We use a straight-through estimator (STE) to insert the non-differentiable, standard codec pipeline into the training loop, allowing radiance-field features to adapt directly to the real, client-side codec distortions without introducing any learned codec parameters. On both static and dynamic benchmarks, CATRF consistently achieves a better rate-distortion trade-off over codec-agnostic and learned-codec-in-the-loop baselines, and also outperforms recent compressed 3DGS methods in both compression efficiency and decoding speed. These results highlight a practical path toward low-bitrate, compression-resilient volumetric representations for free-viewpoint video streaming.

2605.18052 2026-05-19 cs.CV 版本更新

Efficient 3D Content Reconstruction and Generation

高效3D内容重建与生成

Jiahao Li

发表机构 * TOYOTA TECHNOLOGICAL INSTITUTE AT CHICAGO(丰田技术研究所芝加哥分校)

AI总结 本文提出了一种高效的3D内容生成和重建方法,通过结合多视图扩散和稀疏视图3D重建,实现了高质量的3D资产生成,并开发了FastMap算法以提高3D重建的速度和精度。

详情
AI中文摘要

自动3D内容创建旨在用能够从文本或图像直接合成或恢复3D资产的系统取代劳动密集型的建模和扫描流程。其应用范围涵盖视频游戏、虚拟现实、机器人技术和模拟,使资产原型设计、多样化的交互世界生成和高效的3D数据收集成为可能。当前解决方案主要遵循两种互补的范式:(i)文本或图像到3D生成,学习3D几何和外观的先验知识,以从自然语言或单视图图像创建新资产;(ii)3D重建,从RGB图像估计相机姿态和几何结构。本论文在两个方向上都取得了进展。在生成方面,我介绍了Instant3D,它结合了多视图扩散和前馈稀疏视图3D重建,可在5-20秒内生成高质量的资产。在重建方面,我开发了FastMap,一种结构从运动流水线,通过使用一阶优化与广泛融合的GPU内核,实现了比现有最先进方法快10倍的速度提升,同时保持了可比的姿态精度和下游新视图合成质量。

英文摘要

Automatic 3D content creation seeks to replace labor-intensive modeling and scanning pipelines with systems that can synthesize or recover 3D assets directly from text or images. Its applications span video games, virtual reality, robotics, and simulation, enabling rapid asset prototyping, diverse interactive world generation, and efficient 3D data collection for training foundation models. Contemporary solutions largely follow two complementary paradigms: (i) text- or image-to-3D generation, which learns priors over 3D geometry and appearance to create novel assets from natural language or a single view image; and (ii) 3D reconstruction, which estimates camera poses and geometry from RGB images. This thesis advances both directions. On the generation side, I introduce Instant3D, which combines multi-view diffusion with feed-forward sparse-view 3D reconstruction to produce high-quality assets in 5-20 seconds. On the reconstruction side, I develop FastMap, a structure-from-motion pipeline that achieves up to 10x speedup over prior state-of-the-art by using first-order optimization with fused GPU kernels extensively, while maintaining comparable pose accuracy and downstream novel view synthesis quality.

2605.18041 2026-05-19 cs.CV 版本更新

OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models

OmniSelect: 动态模态感知的令牌压缩用于高效多模态大语言模型

Morunliu Yang, Ruotao Xu, Le Li, Yue Wang, Jianxin Zhang, Juntao Li, Yihang Lou, Siwei Feng, Peifeng Li

发表机构 * Soochow University(苏州大学) Peking University(北京大学)

AI总结 本文提出OmniSelect,一种无需训练的模态自适应令牌剪枝框架,通过动态选择压缩策略来提高多模态大语言模型的效率,通过轻量级AudioCLIP模型估计跨模态相关性,并根据相关性得分在不同时间组中进行细粒度令牌剪枝,从而在不增加训练成本的情况下实现高效的多模态令牌压缩。

详情
AI中文摘要

Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $ extbf{OmniSelect}$, a 免训练, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.

英文摘要

Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $\textbf{OmniSelect}$, a training-free, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.

2605.18039 2026-05-19 cs.CV 版本更新

SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals

SGSoft: 通过模板引导的软信号学习融合语义-几何特征以实现3D形状对应

Soyeon Yoon, Chang Wook Seo, Hyunjung Shim

发表机构 * KAIST AI(韩国科学技术院人工智能研究所) Anigma Technologies(Anigma科技公司)

AI总结 本文提出SGSoft方法,通过模板引导的软信号学习融合语义-几何特征,实现3D形状对应,解决了结构变化、非等距变形和拓扑不一致的挑战,实现了最先进的跨类别泛化和最佳精度-效率权衡。

详情
AI中文摘要

学习变形3D形状之间的密集对应关系仍是一个长期挑战,由于结构变化、非等距变形和不一致拓扑。现有方法通常在通用性、几何保真度和效率之间进行权衡。我们通过提出SGSoft,一个统一的内在流程,解决这个问题:(i) 在标准模板上构建测地线对应场;(ii) 学习由预训练语义先验引导的多模态密集描述符,利用该测地线对应场监督;(iii) 通过描述符空间的最近邻搜索在单次前向传递中检索密集对应关系。这种公式在大姿态变化、结构差异和重新网格化下实现了稳定且拓扑不变的监督。SGSoft在跨类别泛化方面达到最先进的水平,同时在先前方法中提供了最佳的精度-效率权衡。它还实现了近实时推断,无需预对齐、成对优化或后处理。学习的描述符可以有效地转移到下游任务,如语义分割和变形转移,建立了一种可扩展且可部署的密集3D对应范式。

英文摘要

Learning dense correspondences across deformable 3D shapes remains a long-standing challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency. We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space. This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing. SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy-efficiency trade-off among prior methods. It also achieves near real-time inference without pre-alignment, pairwise optimization, or post-refinement. Learned descriptors can be transferred effectively to downstream tasks such as semantic segmentation and deformation transfer, establishing a scalable and deployment-ready paradigm for dense 3D correspondence.

2605.18038 2026-05-19 cs.CV 版本更新

Patch Ensembles for Robust Salmon Re-Identification with Weak Trajectory Labels

基于补丁的鲁棒性鲑鱼重识别方法:使用弱轨迹标签

Espen Uri Høgstedt, Christian Schellewald, Annette Stahl, Rudolf Mester

发表机构 * Department of Computer Science, Norwegian University of Science and Technology(挪威科技大学计算机科学系) SINTEF Ocean(SINTEF海洋研究中心) Department of Engineering Cybernetics, Norwegian University of Science and Technology(挪威科技大学工程 cybernetics 系)

AI总结 本文提出了一种基于补丁的重识别框架,通过融合补丁级预测来决定鲑鱼身份,利用侧线预测提取纹理锚定的补丁和补丁切片,通过多摄像头实验设置构建跨摄像头测试集,实验证明该方法在同轨迹验证和跨摄像头测试中均优于全图像基线,展示了更好的泛化能力和鲁棒性。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP)

详情
AI中文摘要

在商业网箱中,鲑鱼重识别具有挑战性,因为种群数量大,这要求严格准确性并使大规模标记数据获取不可行。轨迹ID可以作为代理标签,但会引入轨迹ID偏差。为了解决这些挑战,我们提出了一种基于补丁的重识别框架,将补丁级预测融合到鲑鱼身份决策中。一个关键组件是预测鲑鱼的侧线,从而提取纹理锚定的补丁和补丁切片。为了实现真实的评估,我们引入了一个实验设置,使用多个相距6米的摄像头,允许同一鱼在不同轨迹中被记录。这使得通过手动匹配确认构建跨摄像头测试集成为可能。我们的集成方法在同轨迹验证中(0.932到0.965 mAP)和跨摄像头测试中(0.609到0.860 mAP)均优于全图像基线。跨摄像头设置的显著改进证明了改进的通用性和鲁棒性。代码和数据:https://github.com/espenbh/salmon-reid-patch-ensemble。

英文摘要

Salmon re-identification in commercial net-pens is challenging due to large populations, which impose strict accuracy requirements and make large-scale labeled data acquisition infeasible. Trajectory IDs can be used as proxy labels, but this introduces trajectory-ID bias. To address these challenges, we propose a patch-based re-identification framework that fuses patch-level predictions into a salmon identity decision. A key component is the prediction of the salmon's lateral line, enabling extraction of texture-anchored patches and patch slices. To enable realistic evaluation, we introduce an experimental setup using multiple cameras placed 6 m apart, allowing the same fish to be recorded in different trajectories. This enables the construction of a cross-camera test set through manual match confirmation. Our ensemble approach outperforms the full-image baseline in same-trajectory validation (0.932 to 0.965 mAP) and cross-camera testing (0.609 to 0.860 mAP). The substantial improvements in the cross-camera setting demonstrate improved generalizability and robustness. Code and data: https://github.com/espenbh/salmon-reid-patch-ensemble.

2605.18029 2026-05-19 cs.CV 版本更新

What Matters for Grocery Product Retrieval with Open Source Vision Language Models

在开源视觉语言模型中,什么因素影响杂货产品检索

Emmanuel G. Maminta, Rowel O. Atienza

发表机构 * AI Graduate Program, University of the Philippines, Diliman, Quezon City(菲律宾大学达林学院人工智能研究生项目) EEEI, University of the Philippines, Diliman, Quezon City(菲律宾大学达林学院电子工程系)

AI总结 本文研究了开源视觉语言模型在杂货产品检索任务中的表现,发现数据质量比规模更重要,高效模型可以胜出,并且存在召回率差距的问题。

Comments Accepted in the 28th International Conference on Pattern Recognition (ICPR 2026)

详情
AI中文摘要

多模态产品检索(MPR)是无结账零售和自动化库存系统的基础,但需要细粒度SKU区分,而标准视觉语言基准无法捕捉这一点。我们首次系统地在GroceryVision挑战赛的MPR任务上评估了190个开源VLMs,隔离了预训练数据、架构和输入分辨率。我们的分析得出三个可操作的发现。(1)数据质量优于规模。从原始网络爬取切换到过滤数据集可获得高达16.6%的准确率提升,超过翻倍模型参数的收益。(2)高效模型可以获胜。MobileCLIP-B(150M参数)优于在噪声数据上训练的351M模型。我们引入了效率度量标准“语义功率密度”(ϕ),该指标惩罚低于阈值的准确性。(3)存在召回率差距。最先进模型在Recall@5上达到94.5%,但在Recall@1上下降17.5%,表明对比嵌入式在分类上有效,但无法对视觉相似的SKU进行排序。代码和评估脚本可在https://github.com/upeee/openmpr获取。

英文摘要

Multimodal product retrieval (MPR) underpins checkout-free retail and automated inventory systems, yet it demands fine-grained SKU discrimination that standard vision-language benchmarks fail to capture. We present the first systematic zero-shot evaluation of 190 open-source VLMs on the MPR task of the GroceryVision Challenge, isolating pre-training data, architecture, and input resolution. Our analysis yields three actionable findings. \textbf{(1) Data quality trumps scale.} Switching from raw web-scrapes to filtered datasets delivers up to 16.6\% accuracy gains, exceeding the benefit of doubling model parameters. \textbf{(2) Efficient models can win.} MobileCLIP-B (150M parameters) outperforms 351M counterparts trained on noisy data. We introduce \textit{semantic power density} ($ϕ$), an efficiency metric that penalizes sub-threshold accuracy. \textbf{(3) A precision gap persists.} State-of-the-art models achieve 94.5\% Recall@5 but suffer a 17.5\% drop at Recall@1, revealing that contrastive embeddings cluster categories effectively but fail to rank visually similar SKUs. Code and evaluation scripts are available at \url{https://github.com/upeee/openmpr}.

2605.18018 2026-05-19 cs.CV cs.AI cs.HC 版本更新

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

See What I Mean: 对齐视觉与语言表示以实现视频细粒度物体理解

Boyuan Sun, Bowen Yin, Yuanming Li, Xihan Wei, Qibin Hou

发表机构 * VCIP, CS, Nankai University(南开大学计算机科学与技术学院) Tongyi Lab, Alibaba Group(阿里云实验室) NKIARI, Shenzhen Futian(深圳福田国家信息研究所)

AI总结 本文提出SWIM方法,通过对齐视觉和语言表示,仅从文本提示中实现细粒度物体理解,解决了传统方法需要显式视觉提示的问题,通过构建NL-Refer数据集和多层交叉注意力图提升文本-视觉对齐性能。

Journal ref CVPR 2026

详情
AI中文摘要

我们提出了SWIM(See What I Mean),一种新颖的训练策略,通过对齐视觉和语言表示,仅从文本提示中实现细粒度物体理解。与需要显式视觉提示(如掩码或点)的传统方法不同,SWIM仅在训练期间利用掩码监督来指导跨模态注意力,使模型在推理时能够自动关注用户指定的物体。我们对预训练多模态大语言模型(MLLMs)的交叉注意力分析揭示了一种系统性差异:属性词在视觉模态中产生尖锐、局部化的激活,而物体名词由于语义参考偏差和分布式高层表示产生扩散和分散的模式。为了解决这种不对齐问题,我们构建了NL-Refer数据集,其中每个物体掩码都配以精确的自然语言指引用。SWIM从物体名词中提取多层交叉注意力图,并强制与真实掩码保持空间一致性。实验结果表明,SWIM显著提高了文本-视觉对齐性能,并在细粒度物体理解基准上优于基于视觉提示的方法。代码和数据可在https://github.com/HumanMLLM/SWIM获取。

英文摘要

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at \href{https://github.com/HumanMLLM/SWIM}{https://github.com/HumanMLLM/SWIM}.

2605.18013 2026-05-19 cs.CV cs.AI 版本更新

TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

TinySAM 2: 极端内存压缩用于高效的跟踪任何模型

Zhaoyuan Ding, Yijing Yang, Han Shu, Xinghao Chen

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出TinySAM 2,一种轻量级视频分割模型,通过引入内存质量管理机制和联合空间-时间令牌压缩,有效降低了内存存储和计算成本,实现了在DAVIS和SA-V等挑战性数据集上达到SAM 2.1 90%性能,仅使用7%内存令牌和3%训练数据。

Comments 12 pages, 6 figures

详情
AI中文摘要

Segment Anything Model 2 (SAM 2) 作为视频分割领域的核心基础模型,在半监督视频对象分割和跟踪任何任务中表现出色。然而,SAM 2的多阶段图像编码器和内存模块复杂的计算特性提高了模型在实际应用中的部署难度。为了解决这个问题,我们提出了TinySAM 2,一种在性能和效率之间取得平衡的轻量级视频分割模型。首先,引入了一个内存质量管理机制,用于选择并保留高信息量的历史帧作为内存。此外,提出了一种联合空间-时间令牌压缩方法,通过空间域上的平均池化压缩冗余令牌,在时间域上基于令牌级相似性测量选择信息令牌。此外,采用RepViT作为轻量级图像编码器,进一步减少模型参数。在DAVIS和SA-V等挑战性数据集上的大量实验表明,TinySAM 2在性能上达到了SAM 2.1的90%,仅使用7%的内存令牌和3%的训练数据。本研究有效缓解了SAM 2在参数数量、计算负载和部署成本方面的瓶颈,为视频分割模型在设备上的广泛应用提供了资源高效的解决方案。

英文摘要

Segment Anything Model 2 (SAM 2) serves as a core foundation model in the field of video segmentation. Building upon the original SAM model, it introduces a memory bank mechanism and demonstrates outstanding performance in tasks such as semi-supervised video object segmentation and tracking anything. However, the complex computational characteristics of SAM 2's multi-stage image encoder and memory module have raised the barrier to the model's deployment in practical applications. To address this issue, we propose TinySAM 2, a lightweight video segmentation model that balances performance and efficiency. First, a memory quality management mechanism is introduced to select and retain high-informative historical frames as the memory. In addition, a joint-spatial-temporal token compression is proposed that reduces the memory storage and computational cost. Specifically, average pooling is employed to first compress redundancy tokens in the spatial domain. In the temporal domain, informative tokens are selected across frames in the memory bank based on token-level similarity measurement. Besides, we take RepViT as the lightweight image encoder, which further reduces the model parameters. Extensive experiments on challenging datasets such as DAVIS and SA-V demonstrate that TinySAM 2 achieves 90% of the performance of SAM 2.1, with only 7% memory tokens and 3% training data. This study effectively alleviates the bottlenecks in parameter count, computational load, and deployment costs associated with SAM 2, providing a resource-efficient solution for the widespread application of video segmentation models on devices.

2605.18012 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SAS: Semantic-aware Sampling for Generative Dataset Distillation

SAS: 语义感知的生成数据集蒸馏

Mingzhuo Li, Guang Li, Linfeng Ye, Jiafeng Mao, Takahiro Ogawa, Konstantinos N. Plataniotis, Miki Haseyama

发表机构 * Hokkaido University(北海道大学) University of Toronto(多伦多大学) The University of Tokyo(东京大学)

AI总结 本文提出了一种语义感知的数据集蒸馏方法,通过利用CLIP作为语义先验,设计三个语义评分函数来量化类别相关性、类别间分离性和集合内多样性,从而生成紧凑且语义区分度高的数据集。

Comments Published as a journal paper in IEEE OJSP

详情
AI中文摘要

深度神经网络在广泛的任务中取得了显著的性能,但这种成功往往伴随着由于大规模训练数据带来的巨大计算和存储成本。数据集蒸馏通过构建紧凑且信息丰富的数据集,以实现高效的模型训练同时保持下游性能。然而,大多数现有方法主要强调匹配数据分布或下游训练统计,对蒸馏数据中高阶语义信息的保留有限。在本文中,我们引入了语义感知的视角进行数据集蒸馏,通过利用对比语言-图像预训练(CLIP)作为语义先验进行后采样。我们的目标是获得不仅紧凑而且语义上类别区分度高且多样化的蒸馏数据集。为此,我们设计了三个语义评分函数,以量化预训练语义空间中的类别相关性、类别间分离性和集合内多样性。基于现有蒸馏方法生成的图像池,我们进一步开发了一种两阶段策略进行有效的采样:第一阶段过滤语义区分度高的样本以形成可靠的候选集,第二阶段进行动态多样性感知选择以减少冗余并保持语义覆盖。在多个数据集、图像池和下游模型上的广泛实验显示了一致的性能提升,突显了在数据集蒸馏中整合语义信息的有效性。

英文摘要

Deep neural networks have achieved impressive performance across a wide range of tasks, but this success often comes with substantial computational and storage costs due to large-scale training data. Dataset distillation addresses this challenge by constructing compact yet informative datasets that enable efficient model training while maintaining downstream performance. However, most existing approaches primarily emphasize matching data distributions or downstream training statistics, with limited attention to preserving high-level semantic information in the distilled data. In this work, we introduce a semantic-aware perspective for dataset distillation by leveraging Contrastive Language-Image Pretraining (CLIP) as a semantic prior for post-sampling. Our goal is to obtain distilled datasets that are not only compact but also semantically class-discriminative and diverse. To this end, we design three semantic scoring functions that quantify class relevance, inter-class separability, and intra-set diversity in a pretrained semantic space. Based on image pools generated by existing distillation methods, we further develop a two-stage strategy for effective sampling: the first stage filters semantically discriminative samples to form a reliable candidate set, and the second stage performs a dynamic diversity-aware selection to reduce redundancy while preserving semantic coverage. Extensive experiments across multiple datasets, image pools, and downstream models demonstrate consistent performance gains, highlighting the effectiveness of incorporating semantic information into dataset distillation.

2605.18010 2026-05-19 cs.CV cs.GR 版本更新

Functionalization via Structure Completion and Motion Rectification

通过结构补全和运动校正实现功能化

Mingrui Zhao, Sai Raj Kishore Perla, Kai Wang, Sauradip Nag, Duc Anh Nguyen, Jiayi Peng, Ruiqi Wang, Angel X. Chang, Manolis Savva, Ali Mahdavi-Amiri, Hao Zhang

发表机构 * Simon Fraser University(西蒙弗雷泽大学) ShanghaiTech University(上海科技大学)

AI总结 本文提出了一种新的任务,即对象功能化,旨在将视觉上合理但不功能的3D模型转换为功能性和物理上可操作的模型。通过将功能化问题建模为新的功能图上的图补全问题,开发了神经图功能化器(GraFu)来补全不完整的图,从而生成3D几何结构,并校正错误的人工标注和预测运动。

详情
AI中文摘要

获取和创建3D资产长期以来主要基于视角或外观驱动。因此,现有的数字3D模型往往缺乏必要的结构组件,以实现其预期功能,例如关节、支撑结构、内部结构或交互元素。同时,即使人工标注的运动也经常存在误差,导致物理上不合理的行为。我们引入了对象功能化,这是一种新的任务,旨在将视觉上合理但不功能的3D模型转换为功能性和物理上可操作的模型。我们将功能化建模为一个新的功能图上的图补全问题,其中标记的节点代表对象部分,标记的边编码功能和接触关系,而可移动的节点携带运动属性,使得结构功能缺陷表现为缺失的节点或错误的边。我们开发了神经图功能化器(GraFu)来补全表示非功能3D对象的不完整图。补全后的图随后驱动一个几何实现阶段,将预测的连接器和结构元素实例化为3D,具有令人印象深刻的效果,即校正错误的人工标注和预测运动。为了支持训练和评估,专注于家具作为丰富且具有挑战性的目标类别,我们引入了FurFun-233,一个包含233对非功能化和功能化家具模型的数据集。在PartNet-Mobility(

英文摘要

Acquisition and creation of 3D assets have been largely view- or appearance-driven. As a result, existing digital 3D models often lack the requisite structural components to function as intended, such as joints, supports, interiors, or interaction elements. At the same time, even human-annotated motions are frequently error-prone, leading to physically implausible behavior. We introduce object functionalization, a novel task aimed at transforming visually plausible but non-functional 3D models into functional and physically operable ones. We formulate functionalization as a graph completion problem over a new functional graph representation, where labeled nodes represent object parts, labeled edges encode functional and contact relations, and movable nodes carry motion attributes, so that structural functional deficiencies manifest as missing nodes or incorrect edges. We develop a neural Graph Functionalizer (GraFu) to complete an incomplete graph representing a non-functional 3D object. The completed graph then drives a geometry realization stage that instantiates predicted connectors and structural elements in 3D, with the compelling side effect of rectifying erroneous human-annotated and predicted motions. To support training and evaluation, focusing on furniture as a rich and challenging target category, we introduce FurFun-233, a dataset of 233 paired non-functional and functionalized furniture models. On PartNet-Mobility ("zero-shot") and HSSD test sets, our method matches state-of-the-art methods in motion prediction accuracy while substantially improving functionality in terms of collision and connectivity.

2605.18006 2026-05-19 eess.IV cs.CV cs.MM 版本更新

Inter-LPCM: Learning-based Inter-Frame Predictive Coding for LiDAR Point Cloud Compression

Inter-LPCM: 基于学习的帧间预测编码用于激光雷达点云压缩

Chang Sun, Hui Yuan, Shiqi Jiang, Chongzhen Tian, Guanghui Zhang, Raouf Hamzaoui

发表机构 * School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学) Key Laboratory of Machine Intelligence and System Control, Ministry of Education(教育部机器智能与系统控制重点实验室) School of Computer Science and Technology, Shandong University(计算机科学与技术学院,山东大学) School of Engineering and Sustainable Development, De Montfort University(工程与可持续发展学院,德蒙福特大学)

AI总结 本文提出Inter-LPCM,一种基于学习的帧间预测编码方法,用于改进激光雷达点云压缩中的几何冗余去除,通过引入delta编码策略、帧间半径预测模型和轻量级注意力预测模型,结合RD优化的量化方法和针对每个球坐标分量的熵编码模型,提高压缩效率和质量。

Comments 14 pages, 12 figures

详情
AI中文摘要

由于激光雷达传感器以固定角分辨率获取点云,因此可以系统地参数化并高效压缩到球坐标系中。传统基于球坐标系的点云压缩方法在率失真(RD)性能方面表现出色,几何点云压缩(G-PCC)标准中的预测几何编码(PredGeom)是其中的典型例子。尽管PredGeom包含帧间预测模式,但其依赖于简单的线性模型,限制了其捕捉复杂运动模式和结构依赖的能力。同时,现有基于学习的球域压缩方法并未利用帧间相关性来减少几何冗余。为了解决这些限制,我们提出了一种基于学习的帧间预测编码方法,称为Inter-LPCM。对于方位预测,我们采用基于预定义角分辨率的delta编码策略。为了提高半径压缩,我们引入了帧间半径预测(Inter-RP)模型,该模型通过当前帧和已注册参考帧中的邻近点来估计当前点的半径。此外,我们设计了一个轻量级注意力预测(LAEP)模型,通过捕捉不同坐标间的长距离几何相关性来预测仰角。对于量化,我们提出了一种RD优化的方法来选择球坐标系中的量化步长。对于熵编码,我们为每个球坐标分量设计了不同的模型。这些模型适应于每个坐标的统计先验,从而实现更准确的概率估计。我们的源代码可在https://github.com/SDUChangSun/Inter-LPCM上公开获取。

英文摘要

Because LiDAR sensors acquire point clouds with a fixed angular resolution, the resulting data can be systematically parameterized and efficiently compressed in the spherical coordinate system. Traditional spherical coordinate-based point cloud compression methods have demonstrated strong rate-distortion (RD) performance, with the predictive geometry coding (PredGeom) method in the geometry-based point cloud compression (G-PCC) standard being a prominent example. Although PredGeom includes an inter-frame prediction mode, it relies on a simple linear model, which limits its ability to capture complex motion patterns and structural dependencies. Meanwhile, existing learning-based compression methods in the spherical domain do not exploit inter-frame correlations to reduce geometry redundancy. To address these limitations, we propose a learning-based inter-frame predictive coding method, termed Inter-LPCM. For azimuth prediction, we employ a delta coding strategy based on the predefined angular resolution. To improve radius compression, we introduce an inter-frame radius predictive (Inter-RP) model that estimates the current point's radius using neighboring points from both the current frame and the registered reference frame. In addition, we design a lightweight attention-based prediction (LAEP) model to predict elevation angles by capturing long-range geometric correlations across different coordinates. For quantization, we propose an RD-optimized method to select quantization steps in the spherical coordinate system. For entropy coding, we design distinct models for each spherical coordinate component. These models are adapted to the statistical priors of each coordinate, enabling more accurate probability estimation. Our source code is publicly available at https://github.com/SDUChangSun/Inter-LPCM

2605.17997 2026-05-19 cs.LG cs.AI cs.CV 版本更新

MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization

MARR: 模块自适应残差重建用于低比特后训练量化

Le Su, Xing Luo, Zhi Jin

发表机构 * Peng Cheng Laboratory(鹏城实验室)

AI总结 本文提出MARR,一种模块自适应残差重建方法,通过为每个模块分配特定的缩放系数,平衡残差相关的HA偏差和累积误差校正,从而在低比特量化中提升性能。

详情
AI中文摘要

近年来,基于残差重建的模型量化方法在低比特后训练量化(PTQ)中取得了有希望的性能,通过引入跨层残差来减少来自先前层的误差积累。然而,这些残差也可能引入额外的偏差,源于重建基于PTQ的Hessian近似(HA)假设,导致量化性能不理想。在本文中,我们分析发现,通过将残差项乘以一个缩放系数,可以提供一种直接的方法来缓解与残差强度相关的HA偏差,同时保持累积误差校正。更重要的是,我们观察到这种权衡是模块依赖性的,使单一全局残差强度不足以在不同模块之间平衡有效的校正和残差相关的偏差。基于这些观察,我们提出了模块自适应残差重建(MARR),为每个模块分配模块特定的缩放系数,以自适应地平衡累积误差校正和残差相关的HA偏差。为了避免昂贵的每模块系数搜索并获得稳定的系数估计,我们设计了一种基于比例-积分-微分(PID)的自适应更新策略,利用重建误差作为反馈,逐步细化此系数。在多个典型的大语言模型(LLMs)和视觉变换器(ViTs)上的实验表明,MARR在低比特量化(小于等于4位)中表现出色,实现了LLMs高达20.2%的性能提升,以及ViTs相对于残差重建最先进的方法高达4.6%的相对提升。代码将在接受后公开发布。

英文摘要

Recently, residual reconstruction-based model quantization methods have achieved promising performance in low-bit post-training quantization (PTQ) by introducing cross-layer residuals to reduce error accumulated from previous layers.However, these residuals may also introduce additional bias arising from the Hessian-approximation (HA) assumption underlying reconstruction-based PTQ, leading to suboptimal quantization performance.In this work, we analyze that multiplying the residual term by a scaling coefficient provides a direct way to mitigate the HA bias associated with residual strength, while preserving accumulated-error correction. More importantly, we observe that this trade-off is module-dependent, making a single global residual strength insufficient to balance effective correction and residual-related bias across modules.Based on these observations, we propose Module-Adaptive Residual Reconstruction (MARR), which assigns a module-specific scaling coefficient to adaptively balance accumulated-error correction and residual-related HA bias for each module.To avoid expensive per-module coefficient search and obtain a stable coefficient estimate, we design a Proportional-Integral-Derivative (PID)-based adaptive update strategy that uses reconstruction error as feedback to progressively refine this coefficient. Experiments on several typical large language models (LLMs) and vision transformers (ViTs) demonstrate the effectiveness of MARR under low-bit quantization (less than or equal to 4-bit), achieving up to 20.2% performance gains on LLMs and up to 4.6% relative gains on ViTs over the residual reconstruction state-of-the-art methods.Code will be made publicly available upon acceptance.

2605.17990 2026-05-19 cs.CV cs.HC 版本更新

Low Latency Gaze Tracking via Latent Optical Sensing

通过潜在光学感知实现低延迟的注视跟踪

Yidan Zheng, Matheus Souza, Kaizhang Kang, Qiang Fu, Hadi Amata, Wolfgang Heidrich

发表机构 * King Abdullah University of Science and Technology(卡布斯大学)

AI总结 本文提出了一种实时注视跟踪系统,通过全被动光学编码器直接获取任务相关的潜在特征,利用微透镜阵列和共设计的二进制铬掩膜进行空间复用光学编码,产生足够估计注视方向的紧凑测量集,从而减少计算开销并提高延迟性能。

详情
AI中文摘要

我们提出了一种实时注视跟踪系统,该系统通过全被动光学编码器直接获取任务相关的潜在特征。与处理全分辨率图像不同,我们的方法利用微透镜阵列和共设计的二进制铬掩膜进行空间复用光学编码,产生一组紧凑的测量,足以用于注视估计。通过在光学域内整合传感和特征提取,所提出的系统消除了对高带宽图像读取的需要,并显著减少了计算开销。编码的测量通过4x4光电晶体管阵列捕获,并通过轻量级神经网络映射到注视方向。我们的概念验证原型实现了端到端的感知到推理延迟为3.4 ms,优于已发表的研究系统。我们在模拟和真实世界数据上展示了本方法的有效性,实现了与传统基于摄像头的管道相比具有竞争力的注视估计精度,同时显著提高了延迟和能效。本文工作展示了任务驱动的光学感知在超低延迟、计算高效的人机交互系统中的潜力。

英文摘要

We present a real-time gaze tracking system that directly acquires task-relevant latent features using a fully passive optical encoder. Instead of forming and processing full-resolution images, our approach leverages a microlens array with a co-designed binary chromium mask to perform spatially multiplexed optical encoding, producing a compact set of measurements sufficient for gaze estimation. By integrating sensing and feature extraction in the optical domain, the proposed system eliminates the need for high-bandwidth image readout and substantially reduces computational overhead. The encoded measurements are captured by a 4 x 4 phototransistor array and mapped to gaze direction using a lightweight neural network. Our proof-of-concept prototype enables an end-to-end sensing-to-inference latency of 3.4 ms, outperforming published research systems. We demonstrate the effectiveness of our approach on both simulated and real-world data, achieving competitive gaze estimation accuracy while significantly improving latency and energy efficiency compared to conventional camera-based pipelines. This work highlights the potential of task-driven optical sensing for ultra-low-latency, computationally efficient human-computer interaction systems.

2605.17984 2026-05-19 eess.IV cs.CV cs.RO 版本更新

See Silhouettes in Motion with Neuromorphic Vision

用神经形态视觉感知运动中的轮廓

Pei Zhang, Shijie Lin, Zhou Ge, Jinpeng Chen, Wei Pu

发表机构 * School of Electrical Engineering, Guangxi University(广西大学电气工程学院) Department of Computer Science, The University of Hong Kong(香港大学计算机科学系) School of Mechatronic Engineering and Automation, Shanghai University(上海大学机电工程与自动化学院) SHU General Intelligent Robotics Research Institute(SHU通用智能机器人研究院) School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications(北京邮电大学计算机科学学院(国家试点软件学院)) School of Information and Communication Engineering, University of Electronic Science and Technology of China(电子科技大学信息与通信工程学院)

AI总结 本文提出了一种双模方法,利用帧和事件的协同作用,在仅CPU的设备上实现实时高帧率二值化,有效减少运动模糊并提升在恶劣光照下的性能,为资源受限边缘平台的轻量感知和交互铺平道路。

Comments 12 pages, 12 figures, and 3 tables. This work is under review. Project page: https://github.com/pz-even/event_binarization

详情
AI中文摘要

准双模物体,如文本、道路标志和条形码,在日常视觉交流中发挥基本而关键的作用。通过将其简化为清晰的轮廓,二值化使用最简语言传达必要的视觉线索,以实现最大下游效率。然而,基于帧的成像在移动平台如无人机、自动驾驶汽车和水下车辆上往往面临困难。在这些动态场景中,快速运动和恶劣光照会使成像失效,导致严重的运动模糊和关键细节的消失。为克服这些限制,神经形态视觉通过事件相机,具有微秒级时间分辨率和高动态范围,成为自然的解决方案。在此事件驱动的感知范式基础上,我们提出了一种简单而有效的双模方法,利用帧和事件之间的协同作用,在仅CPU的设备上实现实时、高帧率的二值化。广泛的评估表明,该方法在减少运动模糊方面与领先技术具有竞争力,并在挑战性光照条件下提供显著改进。此外,我们的异步工作流程绕过了事件稀缺问题,避免了传统时间分组重建的限制,即使在极高的千赫兹帧率下也能保持清晰的目标形状。其二值化结果进一步作为可靠的表示,促进了各种下游任务。本文为在资源受限边缘平台上的具身智能轻量感知和交互铺平了道路。

英文摘要

Quasi-bimodal objects, such as text, road signs, and barcodes, play a basic yet vital role in daily visual communication. By boiling these down to clear silhouettes, binarization uses a minimal language to convey essential vision cues for maximum downstream efficiency. The catch is that frame-based imaging often struggles on mobile platforms like drones, self-driving cars, and underwater vehicles. In these dynamic scenes, rapid motion and harsh lighting can make it blind, causing severe motion blur and erasing crucial details. To overcome the limits, neuromorphic vision via event cameras, featuring microsecond-level temporal resolution and high dynamic range, steps in as a natural solution. Building upon this event-driven sensing paradigm, we introduce a simple yet effective dual-modal approach that harnesses the synergy between frames and events to achieve real-time, high-frame-rate binarization on CPU-only devices. Extensive evaluations present that it earns competitive performance against leading techniques in reducing motion blur, while delivering impressive improvements under challenging illumination. Besides, our asynchronous workflow bypasses event scarcity that breaks traditional time-binning reconstruction, maintaining clear target shapes even at extreme kilohertz frame rates. Its binary results further serve as reliable representations that facilitate a range of downstream tasks. This work paves the way towards lightweight perception and interaction in embodied intelligence on resource-constrained edge platforms.

2605.17980 2026-05-19 cs.CV 版本更新

Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution

学习平衡:用于基于参考的遥感图像超分辨率的解耦孪生扩散变换器

Bin Luo, Runmin Dong, Zhaoyang Luo, Jinxiao Zhang, Jiyao Zhao, Fan Wei, Haohuan Fu

发表机构 * Tsinghua Shenzhen International Graduate School, Shenzhen, China(清华大学深圳国际研究生院) Sun Yat-sen University, Zhuhai, China(中山大学) National Supercomputing Center in Shenzhen, Shenzhen, China(深圳国家超算中心) Tsinghua University, Beijing, China(清华大学)

AI总结 本文提出DS-DiT解耦孪生扩散变换器,通过在注意力层面解耦低分辨率和参考信息交互,解决参考基于超分辨率中参考信息依赖过重和利用不足的问题,提升生成质量。

详情
AI中文摘要

基于扩散的方法在大尺度遥感图像超分辨率中展现出显著潜力,特别是在基于参考的超分辨率(RefSR)中,高分辨率参考图像提供关键的细粒度纹理先验。然而,现有方法往往在过度依赖参考信息导致纹理伪影和利用不足导致细节恢复不足之间存在权衡。为了解决这些问题,我们提出了DS-DiT,一种解耦孪生扩散变换器方法,该方法在注意力层面解耦低分辨率和参考信息交互。通过使低分辨率结构先验和参考纹理信息能够独立与噪声潜在空间交互,框架有效缓解了不同来源之间的竞争。此外,为了补偿全局注意力有限的局部建模能力,我们引入了Patch-Level Weights(PLW)模块,该模块可自适应地调节条件源的融合。此外,这种孪生架构在推理过程中促进了自引导策略,通过利用强参考和弱参考条件之间的预测差异来增强重建。这种方法在不额外训练的情况下提升了生成质量。在多个数据集和缩放因子上的实验结果表明,DS-DiT在定量指标和视觉保真度上均优于现有方法。

英文摘要

Diffusion-based methods demonstrate significant potential for remote sensing image super-resolution at large scaling factors, particularly in reference-based super-resolution (RefSR) where high-resolution reference images provide critical fine-grained texture priors. However, existing methods often suffer from a trade-off between over-reliance on reference information, which leads to texture artifacts, and underutilization, which results in insufficient detail recovery. To address these issues, we propose DS-DiT, a Decoupled Siamese Diffusion Transformer method that decouples low-resolution and reference interactions at the attention level. By enabling low-resolution structural priors and reference texture information to interact independently with the noisy latent, the framework effectively mitigates inter-source competition. Furthermore, to compensate for the limited local modeling ability of global attention, we introduce a Patch-Level Weights (PLW) module that adaptively modulates the fusion of conditional sources. In addition, this siamese architecture facilitates an autoguidance strategy during inference, which enhances reconstruction by exploiting the prediction discrepancy between strong and weak reference conditions. This approach boosts generation quality without additional training. Experimental results across multiple datasets and scaling factors demonstrate that DS-DiT outperforms existing methods in both quantitative metrics and visual fidelity.

2605.17969 2026-05-19 cs.CV 版本更新

Generation Navigator: A State-Aware Agentic Framework for Image Generation

生成导航器:一种基于状态的图像生成代理框架

Jinming Liu, Ruoyu Feng, Yuqi Wang, Wenjun Zeng, Xin Jin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology(东部技术研究所) Independent(独立)

AI总结 本文提出了一种基于状态的图像生成代理框架Generation Navigator,通过将图像生成问题重新表述为状态条件下的动作生成问题,解决了传统方法中在强化学习训练中因信用分配问题导致的不足,通过PRE-GRPO算法提升了生成质量与推理准确性。

详情
AI中文摘要

尽管文本到图像生成技术取得了快速进展,但忠实实现用户意图仍然具有挑战性,通常需要手动多轮尝试和错误。为了自动化此过程,现有系统依赖于简单的提示重写或由手工规则驱动的闭环代理,而不是学习适应不断变化的生成过程。在本文中,我们将图像生成重新表述为一个状态条件下的动作生成问题,并提出Generation Navigator,一个多轮T2I代理,能够学习动态引导生成轨迹并输出下一步动作。然而,通过强化学习训练此代理会引入关键的信用分配挑战:仅根据单一状态奖励轨迹会将所有动作视为同等信用,忽略了各轮次质量动态变化,并无法区分那些提升轨迹的动作与那些降质或浪费轮次而无进展的动作。我们通过PRE-GRPO(峰值保留-效率组相对策略优化)算法解决这一问题,这是一种轨迹级强化学习目标,明确奖励发现高质量图像(峰值)、避免后续轮次质量下降(保留)以及最小化不必要的轮次(效率)。实验表明,在多个基准测试中取得了显著提升,达到了0.90的WISE分数和79.06%的T2I-ReasonBench推理准确率。

英文摘要

Despite rapid advances in text-to-image generation, faithfully realizing user intent remains challenging, often requiring manual multi-turn trial and error. To automate this process, existing systems rely on either simple prompt rewriting or closed-loop agents driven by hand-crafted rules, rather than learning to adapt actions to the evolving generation process. In this paper, we reformulate image generation as a state-conditioned action-making problem and propose Generation Navigator, a multi-turn T2I agent that learns to dynamically steer the generation trajectory and output the next action. However, training this agent via reinforcement learning introduces a critical credit assignment challenge: naively rewarding a trajectory based solely on a single state assigns equal credit to all actions in the rollout, ignores the quality dynamics across turns, and fails to distinguish actions that improve the trajectory from those that degrade it or waste turns without progress. We resolve this with PRE-GRPO (Peak-Retention-Efficiency Group Relative Policy Optimization), a trajectory-level reinforcement learning objective that explicitly rewards discovering a high-quality image (Peak), avoiding subsequent quality degradation across turns (Retention), and minimizing unnecessary turns (Efficiency). Experiments show substantial improvements across benchmarks, reaching a WISE score of 0.90 and 79.06% reasoning accuracy on T2I-ReasonBench.

2605.17954 2026-05-19 cs.CV cs.AI cs.LG 版本更新

A More Word-like Image Tokenization for MLLMs

一种更像单词的图像标记化方法用于大规模语言模型

Hyun Lee, Hyemin Jeong, Yejin Kim, Hyungwook Choi, Hyunsoo Cho, Soo Kyung Kim, Joonseok Lee

发表机构 * Seoul National University(首尔国立大学) Ewha Womans University(成均馆大学)

AI总结 本文提出了一种解耦视觉标记化方法(DiVT),通过将图像块嵌入聚类为语义单元,使每个标记对应于独特的视觉概念,从而提升多模态模型的性能和效率。

Journal ref Proceedings of the IEEE/CVF International Conference on Pattern Recognition and Computer Vision (CVPR), 2026

详情
AI中文摘要

现代多模态大语言模型(MLLMs)通常保持语言模型不变,并训练一个视觉投影器,将像素映射到其嵌入空间中的标记序列,使图像能以与文本相同的形式呈现。然而,语言模型已优化以操作离散且具有语义意义的标记,而现有视觉投影器将图像转换为长流的连续且高度相关的嵌入。这导致视觉标记的行为不同于LLM最初训练以理解的单词状单元。我们提出了一种新的解耦视觉标记化(DiVT),将图像块嵌入聚类为连贯的语义单元,使得每个标记对应于一个独特的视觉概念,而不是一个刚性的网格单元。DiVT进一步根据图像复杂度调整其标记预算,提供显式的精度-计算权衡,既不修改视觉编码器也不修改语言模型。在多样化的多模态基准测试中,DiVT在显著较少的视觉标记下匹配或超越基线,展示了在有限标记预算下的鲁棒性,显著降低了内存成本和延迟,同时使视觉输入更兼容于LLM。我们的代码可在https://github.com/snuviplab/DiVT上获得。

英文摘要

Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.

2605.17952 2026-05-19 cs.CV 版本更新

Counting Machine Parts

机器零件计数

Benedict Florance Arockiaraj, Elizabeth Dinella, Ankit Billa, Ajay Anand

AI总结 本文研究了机器零件计数问题,提出了一种基于FamNet的改进方法,通过引入额外损失项进行训练,并在给定数据集上评估了传统图像处理流程、实例分割和密度图估计等基线方法的性能,最终实现了1.96的MAE指标。

详情
AI中文摘要

图像中物体计数任务在许多领域都有应用。例如,人群计数、库存计数和细胞计数已成为近期研究的焦点。估计物体数量的主要挑战包括重叠物体、物体尺度问题、遮挡和光照条件变化。在本报告中,我们探索了机器洗涤零件计数问题。我们的技术是FamNet的扩展,加入了额外的损失项,并在给定数据集上进行训练。我们通过计算真实物体数量与模型输出之间的均方误差(MAE)和均方根误差(RMSE)来评估这些算法的性能。我们的方法实现了1.96的MAE性能。

英文摘要

Counting objects in an image is a task applicable across many domains. For instance, crowd counting, inventory counting, and cell counting have been the focus of recent research. The major challenges in estimating the count of objects include overlapping objects, object scale issues, occlusions, and varying lighting conditions. In this report, we explore the problem of counting machine washer parts. Our technique is an extension of FamNet with an additional loss component, trained on the given dataset. We compare to three baseline methods: a traditional image processing pipeline, instance segmentation, and density map estimation. We evaluate the performance of these algorithms by computing the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) between the true object counts and the model outputs. Our approach achieves a performance of 1.96 MAE.

2605.17949 2026-05-19 cs.CV 版本更新

SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

SkyNative: 一种面向遥感视觉证据推理的原生多模态框架

Xiao Yang, Ronghao Fu, Zhiwen Lin, Zhuoran Duan, Jiashun Zhu, Jiasen Hu, Lang Sun, Weipeng Zhang, Jiaqi Liu, Xu Na, Haoran Liu, Weijie Zhang, Bo Yang

发表机构 * College of Computer Science and Technology, Jilin University, China(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education(教育部符号计算与知识工程重点实验室)

AI总结 本文提出SkyNative,一种原生多模态框架,通过去除预训练视觉骨干,直接在语言模型token空间中表示图像为原始patch tokens,以提升遥感图像的细粒度空间推理能力。

详情
AI中文摘要

遥感视觉-语言模型通常依赖预训练的视觉编码器将图像转换为语义特征后再进行语言模型推理。尽管在场景级理解上有效,这种流程可能过早压缩局部视觉证据,使细粒度空间推理容易受到语言先验的影响,尤其是在超高分辨率遥感图像中。我们提出了SkyNative,一种面向遥感的原生多模态框架,采用无编码器架构,去除预训练视觉骨干,直接在语言模型token空间中表示图像为原始patch tokens。为协调低级视觉patches与文本tokens,SkyNative引入了模态感知的解耦机制,该机制在统一的自回归骨干中使用模态特定的参数。我们进一步引入了一个视觉依赖基准,通过逐步视觉退化和误导性文本提示来诊断模型是否基于图像证据得出答案。在标准遥感理解任务和大格式空间推理评估中,SkyNative展示了更强的图像基础感知能力和改进的抗提示诱导语言先验能力。这些结果表明,原生patch级多模态建模是可靠遥感视觉-语言推理的有前景方向。

英文摘要

Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.

2605.17933 2026-05-19 cs.CV 版本更新

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

AtlasVA: 无教师视觉技能记忆用于无需教师的VLM代理

Pan Wang, Yihao Hu, Xiujin Liu, Jingchu Yang, Hang Wang, Zhihao Wen

发表机构 * Ant Group(蚂蚁集团) University of Science and Technology of China(中国科学技术大学) Westlake University(西湖大学) University of Michigan - Ann Arbor(密歇根大学-安娜堡分校) Sun Yat-sen University(中山大学)

AI总结 本研究提出AtlasVA,一种无需教师的视觉技能记忆框架,通过空间热图、视觉示例和符号文本技能三层结构,统一感知、记忆和优化,实现在无需外部LLM监督下的强化学习性能提升。

详情
AI中文摘要

视觉语言模型(VLM)代理越来越多地依赖记忆增强的强化学习来在长时间任务中重用经验,但大多数现有框架将记忆存储为文本并依赖专有教师模型来总结或细化。这种设计与空间决策不匹配:几何先验被压缩成有损语言,稀疏交互通常通过延迟文本反馈监督,而不是密集的视觉基础信号。我们主张VLM代理的可重用经验应保持视觉基础。基于这一见解,我们提出了AtlasVA,一种无需教师的视觉技能记忆框架,将记忆组织为三个互补的层次:空间热图、视觉示例和符号文本技能。AtlasVA进一步通过轨迹统计和轻量级网格启发式方法直接演化危险和亲和图谱,并将这些自演化图谱作为基于潜在函数的形状奖励用于强化学习。这种设计统一了感知、记忆和优化,无需外部LLM监督。在Sokoban、FrozenLake、3D沉浸导航和3D机器人操作基准测试中,实验表明AtlasVA在文本中心记忆基线和竞争VLM代理上表现一致优异,尤其在空间密集任务上收益显著。主页:https://wangpan-ustc.github.io/AtlasvaWeb

英文摘要

Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose \textbf{AtlasVA}, a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. AtlasVA further evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, and reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision. Experiments on \textsc{Sokoban}, \textsc{FrozenLake}, 3D embodied navigation, and 3D robotic manipulation benchmarks show that AtlasVA consistently outperforms text-centric memory baselines and competitive VLM agents, with especially strong gains on spatially intensive tasks. Homepage: https://wangpan-ustc.github.io/AtlasvaWeb

2605.17918 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Domain Transfer Becomes Identifiable via a Single Alignment

通过单个对齐使领域转移变得可识别

Sagar Shrestha, Subash Timilsina, Hoang-Son Nguyen, Xiao Fu

发表机构 * School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, USA(电气工程与计算机科学系,俄勒冈州立大学,科瓦利斯,俄勒冈,美国)

AI总结 本文提出了一种新的方法,通过结构稀疏性条件和单个配对锚样本实现领域转移的可识别性,减少了对监督信号的依赖,并提出了高效的雅可比稀疏性正则化器以支持高维学习。

详情
AI中文摘要

领域转移(DT)将源分布映射到目标分布,并支持无监督的图像到图像翻译、单细胞分析和跨平台医学影像任务。然而,DT本质上是不明确的:推动正向映射通常不可识别,因为保持测度的自同构(MPAs)在保持边缘分布的同时改变跨领域对应关系,导致内容不一致的翻译。最近的工作表明,通过联合转移多个对应的源/目标条件分布可以消除MPAs,但标记这些条件的监督信号在实践中并不总是可用。我们开发了一种替代的DT可识别性路线。在雅可比支持图案的结构稀疏性条件下,我们证明了分布匹配与单个配对锚样本足以识别真实转移——比先前方法需要的监督更少。为了支持实际的高维学习,我们进一步提出了一种基于随机掩码有限差分的高效雅可比稀疏性正则化器,得到一个可扩展的替代品,无需显式雅可比评估。在合成和现实任务上的实验证实了理论。

英文摘要

Domain transfer (DT) maps source to target distributions and supports tasks such as unsupervised image-to-image translation, single-cell analysis, and cross-platform medical imaging. However, DT is fundamentally ill-posed: push-forward mappings are generally non-identifiable, as measure-preserving automorphisms (MPAs) preserve marginals while altering cross-domain correspondences, leading to content-misaligned translation. Recent work shows that MPAs can be eliminated by jointly transferring multiple corresponding source/target conditional distributions, but supervision signals labeling such conditionals are not always available in practice. We develop an alternative route to DT identifiability. Under a structural sparsity condition on the Jacobian support pattern, we show that distribution matching together with a single paired anchor sample suffices to identify the ground-truth transfer -- requiring substantially less supervision than prior approaches. To enable practical high-dimensional learning, we further propose an efficient Jacobian sparsity regularizer based on randomized masked finite differences, yielding a scalable surrogate without explicit Jacobian evaluation. Empirical results on synthetic and real-world DT tasks validate the theory.

2605.17915 2026-05-19 cs.CV 版本更新

SurgLQA: Scalable Long-Horizon Surgical Video Question Answering

SurgLQA: 可扩展的长时程外科视频问答

Diandian Guo, Xikai Yang, Ruiyang Li, Jialun Pei, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出SurgLQA框架,通过融合时间一致性巩固和时间接地多策略扩展方法,解决长时程外科视频问答中的长程动态建模问题,提升手术流程中的推理能力。

Comments MICCAI 2026 Early Accept

详情
AI中文摘要

外科视频问答(VideoQA)提供了一个有前景的动态术中解释范式,能够为临床环境中的实时决策支持和上下文感知检索提供支持。然而,现有方法主要局限于图像或短片段,限制了其对长程手术流程中因果依赖关系的建模能力。为解决这一挑战,我们提出了SurgLQA,一个统一的长时程VideoQA框架,用于可扩展的外科推理。该框架集成了忠实时间一致性巩固(FTC),利用内在时间线索构建紧凑的长程表示,同时保持细粒度的时间保真度。进一步,我们开发了时间接地多策略扩展(TMS),一种适应性测试时间推理范式,能够在时间接地上下文中战略性地调整策略层面的推理能力。为了促进系统评估,我们重构了一个长时程结肠镜VideoQA基准,Colon-LQA,并在Colon-LQA和REAL-Colon-VQA上进行了广泛的实验。实验结果表明,我们的方法在长程推理中通过时间接地推理实现了持续的性能提升。代码链接:https://github.com/RascalGdd/SurgLQA。

英文摘要

Surgical Video Question Answering (VideoQA) provides a promising paradigm for dynamic intraoperative interpretation, enabling real-time decision support and context-aware retrieval in clinical environments. Nevertheless, existing approaches are predominantly restricted to images or short clips, limiting their ability to model long-range procedural dynamics and causal dependencies across extended surgical workflows. To address this challenge, we propose SurgLQA, a unified long-horizon VideoQA framework for scalable surgical reasoning. This framework incorporates Faithful Temporal Consolidation (FTC), which leverages intrinsic temporal cues to construct compact long-range representations while preserving fine-grained temporal fidelity. Further, we develop Temporally-Grounded Multi-Policy Scaling (TMS), an adaptive test-time inference paradigm that strategically adjusts policy-level reasoning capacity within temporally grounded contexts. To facilitate systematic evaluation, we restructured a long-duration colonoscopy VideoQA benchmark, Colon-LQA, and conducted extensive experiments on Colon-LQA and REAL-Colon-VQA. Experimental results demonstrate that our approach achieves consistent performance gains in long-range reasoning with temporally grounded inference. Code link: https://github.com/RascalGdd/SurgLQA.

2605.17912 2026-05-19 cs.RO cs.CV 版本更新

WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

WorldArena 2.0: 扩展模态、功能和平台的具身世界模型基准测试

Yu Shang, Yinzhou Tang, Yiding Ma, Zhuohang Li, Lei Jin, Weikang Su, Xin Jin, Zhaolu Wang, Ziyou Wang, Xin Zhang, Haisheng Su, Weizhen He, Wei Wu, Haoyi Duan, Gordon Wetzstein, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong Tian, Tat-Seng Chua, Wenwu Zhu, Chen Gao, Yong Li

发表机构 * Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) Stanford University(斯坦福大学) The University of Hong Kong(香港大学) Princeton University(普林斯顿大学) Chinese Academy of Sciences(中国科学院) University of Science and Technology of China(中国科学技术大学) Peking University(北京大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出WorldArena 2.0,扩展了具身世界模型的评估,涵盖模态、功能和平台三个维度,提供全面的测试平台以评估具身世界模型的进展。

详情
AI中文摘要

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

英文摘要

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

2605.17907 2026-05-19 cs.CV cs.AI 版本更新

One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception

一个模型翻译它们所有:面向异构协作感知的通用任意到任意翻译

Yang Li, Weize Li, Quan Yuan, Congzhang Shao, Guiyang Luo, Yunqi Ba, Xuanhan Zhu, Xinyuan Ding, Xiaoyuan Fu, Jinglin Li

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室)

AI总结 本文提出UniTrans,一种通用任意到任意特征模态翻译模型,通过预训练一组翻译专家参数并学习其组合系数来实现零样本翻译,从而在OPV2V-H和DAIR-V2X数据集上实现了优于现有方法的性能。

Comments 19 pages, accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

通过共享中间特征,协作感知扩展了每个代理的感知能力,但现实世界中的特征模态异质性仍然是有效融合的关键障碍。大多数现有方法,包括直接适应和协议基于的转换,通常依赖于为新出现的特征模态训练适配器,往往需要额外的重新训练或微调。这种重复训练成本高,并且由于模型和数据隐私限制,在跨制造商之间不可行,限制了现实世界的可扩展性。为了解决这个问题,我们提出了UniTrans,一种通用的任意到任意特征模态翻译模型,该模型可以即时实例化任意模态的翻译器。UniTrans预训练了一组翻译专家参数,并学习其组合系数作为源到目标模态映射的函数。映射是在模态内在的潜在空间中进行测量,其中内在编码器从单帧中间特征中提取模态特定但场景不变的代码,使UniTrans能够以零样本的方式实例化翻译器。在OPV2V-H和DAIR-V2X上的实验表明,UniTrans在模拟和现实世界中均优于现有方法,通过通用模型实现了高效的任意到任意翻译。代码可在https://github.com/CheeryLeeyy/UniTrans上获得。

英文摘要

By sharing intermediate features, collaborative perception extends each agent's sensing beyond standalone limits, but real-world feature modality heterogeneity remains a key barrier to effective fusion. Most existing methods, including direct adaption and protocol-based transformation, typically rely on training adapters for newly emerging feature modalities and often require additional retraining or fine-tuning. Such repeated training is costly and is often infeasible across manufacturers due to model and data privacy constraints, limiting real-world scalability. To address this issue, we propose UniTrans, a universal any-to-any feature modality translation model that instantiates translators on the fly for arbitrary modalities. UniTrans pretrains a bank of translator expert parameters and learns their combination coefficients as a function of source-to-target modality mapping. The mapping is measured in a modality-intrinsic latent space, where an intrinsic encoder extracts modality-specific yet scene-invariant codes from single-frame intermediate features, enabling UniTrans to instantiate translators in a zero-shot manner. Experiments on OPV2V-H and DAIR-V2X demonstrate that UniTrans consistently outperforms state-of-the-art methods in both simulated and real-world settings, enabling efficient any-to-any translation through a universal model. The code is available at https://github.com/CheeryLeeyy/UniTrans.

2605.17904 2026-05-19 cs.CV 版本更新

Beyond Euclidean Prototypes: Spectral Disentanglement and Geodesic Matching for Few-Shot Medical Image Segmentation

超越欧几里得原型:基于谱分解和测地匹配的少样本医学图像分割

Penghao Jia, Zhiyong Huang, Mingyang Hou, Zhi Yu, Shuai Miao, Jiahong Wang, Yan Yan

发表机构 * School of Microelectronics and Communication Engineering, Chongqing University(重庆大学微电子与通信工程学院) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)

AI总结 本文提出Spectral-Geodesic Prototype Network (SGP-Net),通过谱原型银行和测地匹配器解决少样本医学图像分割中的原型纠缠和拓扑盲匹配问题,实现对形状、纹理和边界线索的解耦编码。

详情
AI中文摘要

少样本医学图像分割(FSMIS)旨在从一个或几个标注的支持图像中勾勒出新的解剖目标,以应对医学影像中的标注稀缺问题。尽管近期取得了进展,但基于原型的方法仍然受到两个耦合限制的阻碍:1)线索纠缠,即单个空间域原型被迫同时总结器官轮廓、实质纹理和边界外观,因此任何支持-查询不匹配在其中一个线索上都会无差别地传播到其他线索;2)拓扑盲匹配,即余弦相似度在环境欧几里得空间中测量距离,而忽略了底层特征流形的连通性,导致低对比度器官内碎片化激活和泄漏到邻近组织。为此,我们提出了Spectral-Geodesic Prototype Network (SGP-Net),其围绕一个由两个耦合组件组成的Spectral-Geodesic Prototype Module构建。一个Spectral Prototype Bank (SPB)通过可学习的径向傅里叶滤波器将支持和查询特征分解为低、中、高频带,从而为每个类别生成三个解耦的原型,分别编码形状、纹理和边界线索。一个Geodesic Matcher (GM)则用可微的热扩散近似来替代余弦相似度,用特征亲和图传播匹配信号,使得在流形上的像素积累一致的响应,而流形外的相似者则被抑制。在三个公开的FSMIS基准测试中,实验表明SGP-Net在与最近的最先进方法相竞争的性能上取得了可比的结果。

英文摘要

Few-Shot Medical Image Segmentation (FSMIS) aims to delineate novel anatomical targets from one or a few annotated support images, addressing the annotation scarcity in medical imaging. Notwithstanding recent advancements, current prototype-based methods are bottlenecked by two coupled limitations: 1) cue entanglement, where a single spatial-domain prototype is forced to summarise organ silhouette, parenchymal texture and boundary appearance simultaneously, so any support-query mismatch on one cue propagates indiscriminately to the others; and 2) topology-blind matching, where cosine similarity measures distance in the ambient Euclidean space and ignores the connectivity of the underlying feature manifold, causing fragmented activations inside low-contrast organs and leakage into neighbouring tissues. To this end, we propose Spectral-Geodesic Prototype Network (SGP-Net), built around a Spectral-Geodesic Prototype Module with two coupled components. A Spectral Prototype Bank (SPB) decomposes support and query features into low-, mid- and high-frequency bands via learnable radial Fourier filters, yielding three disentangled prototypes per class that separately encode shape, texture and boundary cues. A Geodesic Matcher (GM) then replaces cosine similarity with a differentiable heat-diffusion approximation of geodesic distance, propagating matching signals along a feature affinity graph so that on-manifold pixels accumulate consistent responses while off-manifold look-alikes are suppressed. Experiments on three public FSMIS benchmarks demonstrate that SGP-Net achieves competitive performance against recent state-of-the-art methods.

2605.17875 2026-05-19 cs.CV 版本更新

HexagonalWarriorMamba: Superior Threshold-Dependent Multi-label Classification of 12-Lead ECG Cardiac Abnormalities

HexagonalWarriorMamba: 12导联ECG心脏异常的阈值依赖多标签分类的更优方法

Huawei Jiang, Husna Mutahira, Shibo Wei, Jiahang Li, Vladimir Shin, Juneho Yi, Dongryeol Ryu, Wonyoung Park, Mannan Saeed Muhammad

发表机构 * Sungkyunkwan University, Department of Computer Science and Engineering(顺天乡大学计算机科学与工程系) Sogang University, Department of Computer Science and Engineering(成均馆大学计算机科学与工程系) Gwangju Institute of Science and Technology, Department of Biomedical Science and Engineering(全州科学技术院生物医学科学与工程系) Tianjin Normal University, School of Artificial Intelligence(天津师范大学人工智能学院) Financial University under the Government of the Russian Federation, Department of Artificial Intelligence(俄罗斯联邦金融大学人工智能系) Sungkyunkwan University, Department of Electrical and Computer Engineering(顺天乡大学电气与计算机工程系) Queen Mary University of London, School of Electronic Engineering and Computer Science(伦敦皇后玛丽大学电子工程与计算机科学学院)

AI总结 本文提出HexagonalWarriorMamba框架,通过将12导联ECG视为单通道2D图像而非传统1D时间序列,改进了传统深度学习模型在处理ECG信号长程依赖关系方面的不足,实现了对心脏异常的更优多标签分类。

Comments Submitted to Scientific Reports

详情
AI中文摘要

从12导联心电图(ECG)中准确自动诊断心脏异常对于管理心血管疾病至关重要。然而,传统深度学习模型在检测并发状况方面仍面临挑战,因为它们通常难以建模ECG信号固有的长程依赖性。本文提出HexagonalWarriorMamba(HWMamba),一种基于Mamba架构的框架,将12导联ECG视为单通道2D图像而非传统1D时间序列。通过整合分层架构与2D选择性扫描机制,HWMamba被设计用于建模数据中的全局上下文和复杂空间关系。该模型在PhysioNet/Computing in Cardiology挑战2021数据集上进行评估,该数据集包含26个诊断标签,涵盖来自四个国家和三个大洲的七个机构的记录。结果表明,HWMamba在五个关键的阈值依赖指标上均优于当前最先进的方法,包括挑战分数和子集准确率。这些改进在保持宏AUROC接近SOTA性能的同时,提供了来自训练数据的有效阈值选择与强大的判别能力之间的平衡。这种Hexagonal Warrior表现,反映了在多个评估维度上的一致性能,使HWMamba成为多标签ECG分类的稳健且多功能的方法。

英文摘要

The accurate automated diagnosis of cardiac abnormalities from 12-lead electrocardiograms (ECGs) is critical for managing cardiovascular disease. However, detecting concurrent conditions remains a challenge for traditional deep learning models, which often have limited ability to model the long-range dependencies inherent in ECG signals. This manuscript proposes HexagonalWarriorMamba (HWMamba), a framework built on the Mamba architecture that processes 12-lead ECGs as single-channel 2D images rather than conventional 1D time series. By integrating a hierarchical architecture with a 2D Selective Scan mechanism, HWMamba is designed to model global context and complex spatial relationships within the data. The model is evaluated on the PhysioNet/Computing in Cardiology Challenge 2021 dataset, which includes 26 diagnostic labels and comprises recordings collected from seven institutions across four countries and three continents. Results demonstrate that HWMamba outperforms current state-of-the-art (SOTA) methods across five key threshold-dependent metrics, including Challenge Score and Subset Accuracy. These improvements provide a balance between strong discriminative capability and effective threshold selection derived from the training data, while maintaining near-SOTA performance in Macro AUROC. This Hexagonal Warrior performance, reflecting consistent performance across multiple evaluation dimensions, positions HWMamba as a robust and versatile approach for multi-label ECG classification.

2605.17869 2026-05-19 cs.CV 版本更新

PySIFT: GPU-Resident Deterministic SIFT for Deep Learning Vision Pipelines

PySIFT:用于深度学习视觉流水线的GPU驻留确定性SIFT

Sivakumar K. S., Mohammad Daniyalur Rahman, Gopi Raju Matta

发表机构 * Indian Institute of Technology Madras(印度理工学院马德拉斯分校)

AI总结 本文研究了经典SIFT在深度学习视觉流水线中的应用,展示了其在准确性和速度上的优势,并提出了PySIFT,一种完全在GPU上驻留的SIFT实现,能够提供确定性的输出和高效的性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

在局部特征研究中,一个普遍的假设是经典手工描述符是精度有限的 relics,最好被学习的替代品取代。我们证明这是错误的。通过覆盖四个基准(HPatches、ROxford5K、IMC Phototourism、MegaDepth)的8种配置消融研究,我们展示了经典SIFT结合DSP多尺度池化在所有准确性指标上均优于神经描述符和方向替代(HardNet、OriNet),同时运行速度比传统方法快2-18倍,并且学习的匹配器(LightGlue)补充而非取代经典特征。结论重新定义了一十年的工作:不是“取代SIFT”,而是“与SIFT组合”,经典提取与学习匹配仅在几何上下文需要时使用。这一发现之所以不可见,是因为没有先前的GPU SIFT能够保持整个流水线在VRAM中或提供模块化以进行受控的经典-学习消融。我们提出了PySIFT,第一个完全在GPU上驻留的SIFT,使用CuPy/Numba CUDA内核和DLPack零拷贝传递到下游DL框架——无论关键点数量如何,元数据交换均在毫秒级O(1)时间内完成。在一台NVIDIA RTX 3050(4 GB VRAM)笔记本电脑上,PySIFT实现了:(i)在HPatches上比OpenCV SIFT更高的平均匹配准确率(MMA);(ii)在高分辨率MegaDepth上每对快383毫秒;(iii)在跨数据集基准测试中更高的几何精度(在MegaDepth上+5.6 pp AUC@10°,在IMC Phototourism上更多内点);(iv)位确定性的输出——在不同运行中具有相同的关键点和描述符,即使在不同GPU架构上也能够重复检测。这一保证表明学习的提取器无法在不付出显著性能牺牲的情况下匹配,也无法在不同GPU架构上实现,因为cuDNN的架构依赖性算法选择。PySIFT是开源的,无需C++编译。

英文摘要

A widespread assumption in local feature research holds that classical handcrafted descriptors are accuracy-limited relics best replaced by learned alternatives. We show this is wrong. Through an 8-configuration ablation spanning four benchmarks (HPatches, ROxford5K, IMC Phototourism, MegaDepth), we demonstrate that classical SIFT with DSP multi-scale pooling outperforms neural descriptor and orientation replacements (HardNet, OriNet) on every accuracy metric--while running 2--18$\times$ faster--and that learned matchers (LightGlue) complement rather than supersede classical features. The conclusion reframes a decade of work: not "replace SIFT" but "compose with SIFT," classical extraction paired with learned matching only where geometric context demands it. This finding was invisible because no prior GPU SIFT kept the complete pipeline in VRAM or offered modularity for controlled classical-vs-learned ablations. We present PySIFT, the first fully GPU-resident SIFT, implemented in CuPy/Numba CUDA kernels with DLPack zero-copy handoff to downstream DL frameworks--submillisecond O(1) metadata swap regardless of keypoint count. On a laptop-grade NVIDIA RTX 3050 (4 GB VRAM), PySIFT achieves: (i) higher Mean Matching Accuracy (MMA) than OpenCV SIFT on HPatches, (ii) 383 ms faster per pair on high-resolution MegaDepth, (iii) higher geometric accuracy on cross-dataset benchmarks (+5.6 pp AUC@10${}^\circ$ on MegaDepth, more inliers on IMC Phototourism), and (iv) bitwise deterministic output--identical keypoints and descriptors across runs, with detection reproducing identically even across GPU architectures: a guarantee that learned extractors cannot match without significant performance sacrifice, and cannot achieve at all across GPU architectures due to cuDNN's architecture-dependent algorithm selection. PySIFT is open-source, requiring no C++ compilation.

2605.17865 2026-05-19 cs.CV 版本更新

Imaging Hidden Objects with Consumer LiDAR via Motion Induced Sampling

通过运动诱导采样用消费级LiDAR成像隐藏物体

Siddharth Somasundaram, Aaron Young, Akshat Dave, Adithya Pediredla, Ramesh Raskar

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Dartmouth College(达特茅斯学院)

AI总结 本文提出了一种多帧融合策略,利用运动诱导孔径采样模型,在消费级LiDAR上实现了非线视成像,实现了隐藏物体的3D重建、多物体跟踪和相机定位,并展示了消费级硬件无需额外设置即可实现非线视成像的潜力。

详情
AI中文摘要

LiDARs are being increasingly deployed for consumer imaging in handheld, wearable, and robotic applications. These sensors can capture the time-of-flight of light at picosecond resolution, which in principle, enables them to capture information about objects hidden from their field of view. While such non-line-of-sight (NLOS) imaging capabilities have been shown on research-grade LiDARs, they are challenging to achieve on consumer devices due to poor signal quality resulting from low laser power, low spatial resolution, and object and camera motion. Inspired by burst photography and synthetic aperture radar, we propose a multi-frame fusion strategy to overcome these challenges and demonstrate NLOS imaging on consumer LiDAR. We first introduce the motion-induced aperture sampling model to unify the effects of object shape, object motion, and camera motion under a single measurement model. Using this model, we demonstrate several NLOS capabilities on a smartphone-grade LiDAR: (1) 3D reconstruction, (2) single and multi-object tracking, and (3) camera localization using hidden objects. Previously, NLOS imaging capabilities were largely restricted to bulky and expensive research-grade hardware that requires extensive setup and calibration. Our results represent a shift towards plug-and-play NLOS imaging, where anyone can image hidden objects with off-the-shelf hardware ($<100) and no additional setup. We believe that democratization of such capabilities will advance consumer applications of NLOS imaging.

英文摘要

LiDARs are being increasingly deployed for consumer imaging in handheld, wearable, and robotic applications. These sensors can capture the time-of-flight of light at picosecond resolution, which in principle, enables them to capture information about objects hidden from their field of view. While such non-line-of-sight (NLOS) imaging capabilities have been shown on research-grade LiDARs, they are challenging to achieve on consumer devices due to poor signal quality resulting from low laser power, low spatial resolution, and object and camera motion. Inspired by burst photography and synthetic aperture radar, we propose a multi-frame fusion strategy to overcome these challenges and demonstrate NLOS imaging on consumer LiDAR. We first introduce the motion-induced aperture sampling model to unify the effects of object shape, object motion, and camera motion under a single measurement model. Using this model, we demonstrate several NLOS capabilities on a smartphone-grade LiDAR: (1) 3D reconstruction, (2) single and multi-object tracking, and (3) camera localization using hidden objects. Previously, NLOS imaging capabilities were largely restricted to bulky and expensive research-grade hardware that requires extensive setup and calibration. Our results represent a shift towards plug-and-play NLOS imaging, where anyone can image hidden objects with off-the-shelf hardware ($<100) and no additional setup. We believe that democratization of such capabilities will advance consumer applications of NLOS imaging.

2605.17850 2026-05-19 stat.ML cs.CV cs.LG cs.NA math.NA math.PR 版本更新

Simple Approximation and Derivative Free Inference-Time Scaling for Diffusion Models via Sequential Monte Carlo on Path Measures

通过路径测度的序列蒙特卡洛实现扩散模型的简单近似与无导数推理时间缩放

Chenyang Wang, Weizhong Wang, Yinuo Ren, Jose Blanchet, Yiping Lu

发表机构 * School of Mathematical Sciences, Peking University, Beijing, China School of Mathematical Sciences, Fudan University, Shanghai, China Department of Industrial Engineering \& Management Sciences, Northwestern University, Evanston, IL, United States Institute for Computational \& Mathematical Engineering, Stanford University, Stanford, CA, United States Management Science \& Engineering, Stanford University, Stanford, CA, United States

AI总结 本文提出URGE算法,一种无需梯度的推理时间缩放方法,通过路径重要性重加权提升扩散模型样本质量,同时在合成测试和扩散模型基准中表现出色,且实现简单且无梯度依赖。

Comments accepted by ICML 2026

详情
AI中文摘要

扩散生成模型越来越多地依赖于推理时间引导,通过添加漂移项或重新加权专家混合物来提高任务特定目标的样本质量。然而,大多数现有技术需要重复评估分数或梯度,引入偏差、高计算开销或两者兼有。我们引入URGE(Unbiased Resampling via Girsanov Estimation),一种无导数的推理时间缩放算法,通过Girsanov测度变换进行路径重要性重加权。与先前工作不同,URGE为每个模拟轨迹附加简单的乘法权重,并定期重新采样。无需计算基于梯度的粒子权重。我们建立了路径级和粒子级SMC之间的等价性:Girsanov路径权重允许一个向后条件期望,恢复先前的粒子级权重,保证两种方案产生相同的无偏终端分布。经验上,URGE在合成测试和扩散模型基准中优于现有推理时间引导基线,实现了更好的生成质量,同时显著更简单且完全无梯度依赖。

英文摘要

iffusion-based generative models increasingly rely on inference-time guidance, adding a drift term or reweighting mixture of experts, to improve sample quality on task-specific objectives. However, most existing techniques require repeated score or gradient evaluations, introducing bias, high computational overhead, or both. We introduce \texttt{URGE}, Unbiased Resampling via Girsanov Estimation, a derivative-free inference-time scaling algorithm that performs path-wise importance reweighting via a Girsanov change of measure. Instead of computing gradient-based particle weights in previous work, \texttt{URGE} attaches a simple multiplicative weight to each simulated trajectory and periodically resamples. No score, no Hessian, and no PDE evaluation is required. We establish an equivalence between path-wise and particle-wise SMC: the Girsanov path weight admits a backward conditional expectation that recovers the previous particle-level weights, guaranteeing that both schemes produce the same unbiased terminal law. Empirically, \texttt{URGE} outperforms existing inference-time guidance baselines on synthetic tests and diffusion-model benchmarks, achieving better generation quality, while being significantly simpler to implement and fully gradient-free.

2605.17834 2026-05-19 cs.CV 版本更新

Stabilizing, Scaling & Enhancing MeanFlow for Large-scale Diffusion Distillation

稳定、扩展与增强MeanFlow用于大规模扩散蒸馏

Xiao He, Yang Li, Peizhen Zhang, Songtao Liu, Zhao Zhong, Nannan Wang

发表机构 * State Key Laboratory of Integrated Services Networks(信息服务网络国家重点实验室) Xidian University(西安电子科技大学) Tencent Hunyuan(腾讯文英)

AI总结 本文提出了一种稳定MeanFlow的方法,通过引入暖启动技术并结合轨迹分布对齐,提高了大规模工业模型蒸馏的性能和泛化能力。

Comments 10 pages

详情
AI中文摘要

扩散模型表现出卓越的生成能力,但其高延迟限制了实际部署。许多研究尝试减少采样步骤以加速推理。其中,MeanFlow因其简洁的公式和显著的性能而受到关注。然而,其优化目标的不稳定性以及'均值偏置'限制了其在蒸馏大规模工业模型中的应用。为了稳定MeanFlow用于蒸馏大规模模型,我们首先引入了暖启动技术,其中MeanFlow的原始微分解法被替换为离散解。这种设计避免了由于MeanFlow目标包含来自未充分训练模型的stop-gradient项而导致的训练崩溃。一旦模型获得初步能力以拟合平均速度场,我们将其优化目标切换回微分解法,以实现进一步的细化。同时,为了缓解在极少数步推理中复杂目标分布下的'均值偏置',我们将其纳入轨迹分布对齐作为辅助目标,鼓励学生模型的轨迹分布更接近教师模型的轨迹分布。我们提出的蒸馏框架在应用于文本到图像(T2I)模型FLUX.1-dev(高达12B参数)时,相比现有蒸馏方法表现更优。此外,当扩展到80B参数的最新状态(SOTA)T2I模型HunyuanImage 3.0时,我们的方法继续表现出稳健的泛化能力和强性能。

英文摘要

Diffusion models exhibit remarkable generative capability, but their high latency limits practical deployment. Many studies have attempted to reduce sampling steps to accelerate inference. Among them, MeanFlow has attracted considerable attention due to its concise formulation and remarkable performance. Nevertheless, the instability of its optimization objective and the ''mean-seeking bias'' have limited its applicability to distill large-scale industrial models. To stabilize MeanFlow for distilling large-scale models, we first introduce a warm-up technique, in which the original differential solution of MeanFlow is replaced by a discrete solution. This design avoids training collapse caused by the MeanFlow target containing a stop-gradient term from an undertrained model. Once the model acquires a preliminary ability to fit the average velocity field, we switch the optimization objective back to the differential solution, enabling further refinement. Meanwhile, to alleviate the ''mean-seeking bias'' of MeanFlow under extremely few-step inference with complex target distributions, we incorporate trajectory distribution alignment as an auxiliary objective, encouraging the student model's trajectory distribution to align more closely with that of the teacher model. Our proposed distillation framework achieves superior performance compared to existing distillation approaches when applied to the text-to-image (T2I) model FLUX.1-dev (up to 12B parameters). Furthermore, when extended to the 80B-parameter state-of-the-art (SOTA) T2I model HunyuanImage 3.0, our method continues to demonstrate robust generalization and strong performance.

2605.17826 2026-05-19 cs.CV cs.AI 版本更新

CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models

CounterCount: 一种用于视觉语言模型计数偏差诊断的框架

Reem Alzahrani, Hassan Alshanqiti, Bushra Bin Hemid, Zaid Alyafeai, Abdelrahman Eldesokey, Bernard Ghanem

发表机构 * KAUST(卡尔斯鲁德大学) University of Edinburgh(爱丁堡大学) King Abdullah University of Science and Technology(国王阿卜杜勒-阿齐兹大学)

AI总结 本文提出CounterCount框架,通过对比事实性与反事实性图像来诊断视觉语言模型在计数任务中的偏差问题,揭示模型对物体级先验知识的依赖,并提出统一的注意力调节策略提升反事实计数准确性。

详情
AI中文摘要

视觉语言模型(VLMs)在多模态推理方面表现出色,但尚不清楚其答案是基于视觉证据还是由学习的语言和世界先验知识驱动。计数提供了一个精确的测试环境:当视觉证据与常识物体知识冲突时,模型必须依赖图像而非典型计数。我们引入CounterCount,一种用于VLMs的反事实计数诊断框架,包含配对的事实性和反事实性图像、编辑过的计数相关属性、验证答案和局部化证据注释。评估最近的VLMs,我们发现其在事实性图像上表现强劲,但在反事实属性变化下持续退化,表明即使存在矛盾的视觉证据,模型仍依赖物体级先验知识。利用局部化注释,我们发现这些失败不仅由于缺失或模糊的视觉证据,而是由于模型对计数相关视觉token的注意力权重不足。我们引入一种统一的推理时间注意力调节策略,重新加权所选的视觉token,使多个VLMs的反事实计数准确率提高高达8%。总体而言,CounterCount揭示了先验驱动的计数失败,并为设计未来的VLMs提供了诊断见解。

英文摘要

Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to missing or ambiguous visual evidence, but to models underweighting attention to count-relevant visual tokens. We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8% across multiple VLMs. Overall, CounterCount exposes prior-driven counting failures and provides diagnostic insights for designing future VLMs.

2605.17823 2026-05-19 cs.CV cs.AI 版本更新

Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

为什么我们看那里:一种最大化场景理解的视网膜视觉语言模型表现出的人类样注视模式

Shravan Murlidaran, Ziqi Wen, Sana Shehabi, Miguel P. Eckstein

发表机构 * Psychological & Brain Sciences, University of California, Santa Barbara(加州大学圣芭芭拉分校心理学与脑科学系) Electrical and Computer Engineering, University of California, Santa Barbara(加州大学圣芭芭拉分校电气与计算机工程系) Computer Science, University of California, Santa Barbara(加州大学圣芭芭拉分校计算机科学系)

AI总结 研究探讨了人类自由观看时注视模式的形成机制,发现最大化场景理解的视网膜视觉语言模型能够产生类似人类的注视模式,表明这种模式可能是优化场景理解的副产品。

详情
AI中文摘要

当人类在没有特定任务的情况下观察场景(自由观看)时,他们最初会将眼动定向到场景中心,然后注视人物、文本、被注视或抓取的物体以及具有语义意义的区域。这些标志性注视模式所反映的内容以及它们是否优化了底层感知任务仍不清楚。我们显示,一个具有模拟视网膜视觉的计算代理,经过训练以优化场景理解,会表现出人类样的注视模式。相比之下,经过训练以搜索或分类场景的代理版本,或配备比人类更好的或更差的周边视觉的版本,预测人类注视模式的准确性较低。因此,人类自由观看的注视模式可能是在生物视网膜视觉约束下优化场景理解的副产品。

英文摘要

When humans view scenes without a specific task (free-viewing), they initially direct their eye movements toward the scene center and then fixate on people, text, objects being gazed at or grasped, and semantically meaningful regions. What these signature fixation patterns reflect and whether they optimize an underlying perceptual task remain unknown. We show that a computational agent with simulated foveation, trained to optimize scene comprehension, exhibits emergent human fixation signature patterns. In contrast, versions of the agent trained to search or classify scenes, or equipped with peripheral vision that was better or worse than human vision, predicted human fixation patterns less accurately. Thus, human free-viewing fixation patterns may emerge as a functional byproduct of optimizing scene comprehension under the biological constraints of foveated vision.

2605.17822 2026-05-19 cs.CV 版本更新

Unleashing the Representational Power of Fourier Shapes for Attacking Infrared Object Detection

释放傅里叶形状的表示能力以攻击红外目标检测

Yixing Yong, Jian Wang, Ming Lei, Lijun He, Fan Li

发表机构 * School of Information and Communications Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China(信息与通信工程学院,电子与信息工程学院,西安交通大学,西安,中国) School of Physics, Xi'an Jiaotong University, Xi'an, China(物理学院,西安交通大学,西安,中国)

AI总结 本文提出了一种基于傅里叶形状的红外目标检测攻击方法,通过引入可学习的傅里叶形状,克服了传统形状方法在表示能力和优化能力之间的根本权衡问题,实现了高效的梯度优化生成具有欺骗性的形状,使人类目标逃避检测。

详情
AI中文摘要

红外目标检测在自动驾驶和监控中至关重要,但仍然容易受到物理对抗攻击的威胁。与RGB域不同,攻击必须操控热信号,使得热阻材料的几何形状成为主要的对抗信息载体。当前基于形状的方法在表示能力和优化能力之间存在根本性的权衡,限制了攻击效果。在本文中,我们通过将可学习的傅里叶形状引入红外域,克服了这一困境。我们利用端到端可微框架,将一组紧凑的傅里叶系数,定义形状边界,通过 winding number theorem 解析地映射到像素空间的掩码。这使得能够通过梯度优化高效生成具有欺骗性的形状,使人类目标逃避检测。广泛的数字和物理实验提供了全面的评估,并验证了我们的优越性能。我们得到的物理贴片实现了惊人的鲁棒性,成功逃避了不同距离、角度、姿态和个体的检测器,且在距离大于25米(置信度=0.5)时攻击成功率超过88%。代码可在 https://github.com/Yongyx99/Fourier-shape-attack 上获得。

英文摘要

Infrared object detection is crucial for perception in autonomous driving and surveillance but remains vulnerable to physical adversarial attacks. Unlike in the RGB domain, where attacks rely on color texture, infrared attacks must manipulate thermal signatures, making the geometry shape of heat-blocking materials the primary adversarial information carrier. Current shape-based methods suffer from a fundamental trade-off between representational capability and optimization power, limiting their attack effectiveness.In this work, we overcome this dilemma by introducing learnable Fourier shapes to the infrared domain. We utilize an end-to-end differentiable framework where a compact set of Fourier coefficients, defining the shape boundary, is analytically mapped to a pixel-space mask via the winding number theorem. This enables efficient gradient-based optimization to generate potent shapes that cause human targets to evade detection. Extensive digital and physical experiments provide a comprehensive evaluation and validate our superior performance. Our resulting physical patch achieves striking robustness, successfully evading detectors across diverse distances, angles, poses, and individuals, and achieves over 88% attack success rate at distances greater than 25m (conf.=0.5). Code is available at https://github.com/Yongyx99/Fourier-shape-attack.

2605.17818 2026-05-19 cs.CV 版本更新

Evidence-Guided Unknown Rejection for High-Confidence Near-Known Unknowns

基于证据的未知拒绝用于高置信度近似未知物

Xi Chen, Yingjun Xiao, Gang Fang

发表机构 * Xi Chen 1(陈曦 1) Yingjun Xiao 2(肖英俊 2) Gang Fang 3(方刚 3)

AI总结 本文提出EGUR-A方法,通过改变决策方式从判断样本得分是否足够高到判断预测已知类别是否有足够证据接受样本,从而减少高置信度的误判接受。

Comments 8 pages, 2 figures,8 tables

详情
AI中文摘要

开放集识别系统面临一个被忽视的失败模式:高置信度的近似未知物,这些样本位于已知标签集之外,但足够接近已知类别,使得闭合集分类器以高置信度接受它们。我们证明这种失败在标量阈值方法中普遍存在,包括最近的后处理检测器,并且更强的编码器可能放大而非消除风险。我们提出EGUR-A,将决策从『这个样本的得分是否足够高?』转变为『这个预测的已知类别是否有足够的证据来接受这个样本?』EGUR-A结合类别条件的局部接受证据与全局残差证据,并从已知样本统计中选择其相对权重,而无需未知验证数据。在CUB、FGVC-Aircraft和ImageNet-hard上,EGUR-A显著减少了在匹配已知拒绝操作点处的高置信度误判接受。结果不是更强的阈值,而是不同的问题:已知类别是否有权接受样本。

英文摘要

Open-set recognition systems face a neglected failure mode: high-confidence near-known unknowns, which lie outside the known label set but are close enough to known classes that a closed-set classifier accepts them with high confidence. We show that this failure is widespread across scalar-threshold methods, including recent post-hoc detectors, and that stronger encoders can amplify rather than remove the risk. We propose EGUR-A, which changes the decision from ``is this sample's score high enough?'' to ``does this predicted known class have sufficient evidence to accept this sample?'' EGUR-A combines class-conditional local acceptance evidence with global residual evidence, and selects their relative weight from known-sample statistics without unknown validation data. Across CUB, FGVC-Aircraft, and ImageNet-hard, EGUR-A substantially reduces high-confidence false known acceptance at matched known-rejection operating points. The result is not a stronger threshold; it is a different question: whether a known class is entitled to accept a sample.

2605.17807 2026-05-19 cs.CV cs.AI 版本更新

Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation

课程组策略优化:适应性采样以释放文本到图像生成的潜力

Baoteng Li, Xianghao Zang, Xinran Wang, Xiangyu Na, Zhixiang He, Hao Sun, Chi Zhang, Zhongjiang He, Tianwei Cao, Kongming Liang, Zhanyu Ma

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院) Beijing Key Laboratory of Multimodal Data Intelligent Perception and Governance(北京多模态数据智能感知与治理重点实验室)

AI总结 本文提出了一种适应性课程训练框架CGPO,通过动态调整采样策略来提高文本到图像生成的训练效率,同时解决多类别数据集中的数据不平衡问题。

详情
AI中文摘要

文本到图像(T2I)生成在近年来取得了显著进展。同时,基于组相对策略优化(GRPO)的强化学习方法引起了广泛关注,并已成功应用于T2I任务。然而,训练过程中常用的均匀采样策略往往忽略了样本难度与模型当前学习能力之间的匹配,导致训练效率低下。我们主张,提高训练效率需要持续优先选择与模型 evolving 能力匹配且仍能主动学习的提示。为此,我们提出了课程组策略优化(CGPO),一种适应性课程训练框架。在训练过程中,每个提示生成一组由奖励模型评分的图像。我们使用组奖励的方差作为在线代理来衡量提示的一致性。较高的方差表明模型部分捕捉了提示要求,但尚未达到稳定的掌握。此类提示更可能提供有用的训练信号,因此相应增加其采样概率。此外,为了解决多类别数据集中的数据不平衡问题,我们设计了一种基于比例公平优化的类别校准方法,以平衡各类别之间的训练难度。在GenEval、T2I-CompBench++和DPG Bench上的实验表明,我们的框架有效提高了生成性能。

英文摘要

Text-to-Image (T2I) generation has achieved remarkable progress in recent years. Meanwhile, reinforcement learning methods, particularly those based on Group Relative Policy Optimization (GRPO), have attracted widespread attention and been successfully applied to T2I tasks. However, the uniform sampling strategy commonly used during training often ignores the match between sample difficulty and the model's current learning capability, leading to low training efficiency. We argue that improving training efficiency requires continuously prioritizing prompts that match the model's evolving capability and remain actively learnable. To this end, we propose Curriculum Group Policy Optimization (CGPO), an adaptive curriculum training framework. During training, each prompt produces a group of images scored by a reward model. We use the variance of group rewards as an online proxy for prompt inconsistency. A higher variance suggests that the model has partially captured the prompt requirements but has not yet achieved stable mastery. Such prompts are more likely to provide useful learning signals, so we increase their sampling probabilities accordingly. Additionally, to address data imbalance in multi-category datasets, we design a category calibration method based on proportional fairness optimization, which balances training difficulty across categories. Experiments on GenEval, T2I-CompBench++, and DPG Bench demonstrate that our framework effectively improves generation performance.

2605.17799 2026-05-19 cs.CV cs.LG 版本更新

Is Complex Training Necessary for Long-Tailed OOD Detection? A Re-think from Feature Geometry

长尾分布外检测是否需要复杂的训练?从特征几何角度的重新思考

Ningkang Peng, Xuanming Chen, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学)

AI总结 本文重新审视长尾分布外检测问题,提出通过特征几何方法简化检测过程,改进Mahalanobis距离计算,提升检测性能。

详情
AI中文摘要

长尾分布外检测通常通过专门的训练方法解决,包括引入分布外数据、回避头、对比目标、能量损失或梯度冲突控制。我们表明这些训练机制可能掩盖了一个更简单的问题:冻结的长尾表示可能已经包含有用的分布外证据,但原始Mahalanobis距离受到频率耦合特征半径和不充分支持的尾部协方差的影响。我们提出了超球面池化Mahalanobis(HPM)方法,一种后处理检测器,将特征归一化到单位球面,并用池化、岭正则化的度量替换类特定协方差,同时保持类均值作为语义锚点。在CIFAR-LT实验和ImageNet-100-LT近分布外边界分析中,HPM提高了原始Mahalanobis评分;对于先验校准经验风险最小化(PC-ERM),在CIFAR-10-LT上将AUROC从46.49提升到85.67,在CIFAR-100-LT上从50.40提升到78.35。这个简单的PC-ERM+HPM流程在CIFAR-100-LT上实现了最佳对数效率分数(LES;3.08),在显著降低训练时间成本的情况下,保留了约95%的最佳CIFAR-100-LT AUROC观测值。这些结果表明,在长尾分布外检测中应分别评估表示质量、检测器几何和训练复杂性。

英文摘要

Long-tailed out-of-distribution (LT-OOD) detection is often addressed with specialized training, including auxiliary out-of-distribution (OOD) data, abstention heads, contrastive objectives, energy losses, or gradient-conflict control. We show that these training mechanisms can obscure a simpler issue: frozen long-tailed representations may already contain useful OOD evidence, but raw Mahalanobis distance is distorted by frequency-coupled feature radius and poorly supported tail covariance. We propose Hyperspherical Pooled Mahalanobis (HPM), a post-hoc detector that normalizes features onto the unit sphere and replaces class-specific covariance with a pooled, ridge-regularized metric while keeping class means as semantic anchors. In CIFAR-LT experiments and an ImageNet-100-LT near-OOD boundary analysis, HPM improves raw Mahalanobis scoring; for Prior-Calibrated ERM (PC-ERM), it raises AUROC from 46.49 to 85.67 on CIFAR-10-LT and from 50.40 to 78.35 on CIFAR-100-LT. This simple PC-ERM+HPM pipeline also achieves the best Log Efficiency Score (LES; 3.08) on CIFAR-100-LT, retaining roughly 95% of the best CIFAR-100-LT AUROC observed among the compared post-hoc scores at substantially lower training-time cost. These results argue for evaluating representation quality, detector geometry, and training complexity as separate factors in LT-OOD detection.

2605.17795 2026-05-19 cs.LG cs.CV 版本更新

When Accuracy Is Not Enough: Uncertainty Collapse between Noisy Label Learning and Out-of-Distribution Detection

当准确性不够时:噪声标签学习与分布外检测之间的不确定性崩溃

Ningkang Peng, Jingyang Mao, Runhan Zhou, Peirong Ma, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学)

AI总结 本文研究了噪声标签学习与分布外检测之间的不确定性崩溃问题,提出了一种通用的ACC-OOD基准,揭示了高准确率并不保证分布外可靠性,提出虚拟边距正则化方法来缓解这一问题。

详情
AI中文摘要

噪声标签学习(LNL)通常通过封闭集分类准确率进行评估,但部署时往往需要分类器能够拒绝分布外(OOD)输入。我们提出了一种学习者无关的ACC-OOD基准,冻结LNL检查点,并在合成和真实噪声标签上评估它们,使用标准化的近/远OOD路由和事后评分。该基准揭示了一种反复出现的失败模式:高封闭集准确率不保证OOD可靠性,因为低置信度、被错误分类的分布内样本可能在噪声训练下与OOD输入占据的得分和特征区域重叠。我们称之为这种病理现象不确定性崩溃。这种结构重叠可能导致高准确率的LNL方法在标准OOD评分下失去ID错误/OOD界面的分离性。作为干预措施,我们研究了虚拟边距正则化(VMR),一种轻量级的修复探针,主要通过PSSCL展示,通过在可信ID批次上合成边界虚拟异常值并扩大能量边距。VMR在不替换主机目标或牺牲封闭集准确率的情况下,部分减少了由崩溃引起的远OOD失败。这些结果支持LNL基准,同时报告封闭集泛化、开放世界可靠性以及结构重叠诊断。

英文摘要

Learning with noisy labels (LNL) is typically benchmarked by closed-set classification accuracy, yet deployment often requires classifiers to reject out-of-distribution (OOD) inputs. We present a learner-agnostic ACC-OOD benchmark that freezes LNL checkpoints and evaluates them with standardized near-/far-OOD routing and post-hoc scores across synthetic and real label noise. The benchmark reveals a recurring failure mode: high closed-set accuracy does not ensure OOD reliability, because low-confidence, misclassified in-distribution samples can overlap the score and feature regions occupied by OOD inputs under noisy training. We term this pathology uncertainty collapse. This structural overlap can make high-accuracy LNL methods lose separability at the ID-error/OOD interface under standard OOD scores. As an intervention, we study Virtual Margin Regularization (VMR), a lightweight repair probe demonstrated mainly with PSSCL that synthesizes boundary virtual outliers on trusted ID batches and widens the energy margin. VMR partially reduces the collapse-induced far-OOD failure without replacing the host objective or sacrificing closed-set accuracy in the tested settings. These results support LNL benchmarks that co-report closed-set generalization, open-world reliability, and structural overlap diagnostics.

2605.17780 2026-05-19 cs.CV 版本更新

Network Knowledge Prior Guided Learning for Data-Efficient Surface Defect Detection

基于网络知识先验的高效数据表面缺陷检测学习

Hang-Cheng Dong, Guodong Liu, Dong Ye, Bingguo Liu

发表机构 * School of Instrumentation Science and Engineering, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学仪器科学与工程学院) Harbin Institute of Technology Suzhou Research Institute, Suzhou, China(哈尔滨工业大学苏州研究院)

AI总结 本文提出了一种基于网络知识先验的知识引导损失函数,通过在训练过程中整合模型可解释性,提升数据高效表面缺陷检测的性能和可信赖度。

详情
AI中文摘要

基于深度学习的方法已成为工业缺陷检测的事实标准。然而,它们的数据渴求性和固有的

英文摘要

Deep learning-based methods have become the de facto standard for industrial defect detection. However, their data-hungry nature and inherent "black-box" characteristics often lead to performance bottlenecks and limited trustworthiness in real-world applications. To address these challenges, this paper proposes a novel knowledge-guided loss function that seamlessly integrates model interpretability into the training process without incurring any additional inference cost. Our method operates in two phases: first, a primary classification network is trained, and its explanations, in the form of saliency maps, are generated as prior knowledge. Second, a multi-task learning framework is established, where the main task performs classification, and an auxiliary task imposes consistency between the saliency maps of the final model and the primary model. This consistency is enforced by a dedicated knowledge-guided loss term, effectively acting as a powerful regularizer to steer the model towards robust feature representations. Extensive experiments on multiple public defect datasets demonstrate that our approach consistently enhances the performance of baseline models in terms of accuracy and AP. Moreover, visual analysis reveals that the proposed method yields more concentrated and human-intelligible saliency maps. This work presents a simple yet effective paradigm for bridging the gap between model performance and interpretability, paving the way for more reliable and high-performing vision systems in industrial quality inspection.

2605.17777 2026-05-19 cs.CV 版本更新

Efficient Sparse-to-Dense Visual Localization via Compact Gaussian Scene Representation and Accelerated Dense Pose Estimation

通过紧凑的高斯场景表示和加速的密集姿态估计实现高效的稀疏到密集视觉定位

Zizhuo Li, Songchu Deng, Linfeng Tang, Jiayi Ma

发表机构 * Electronic Information School, Wuhan University(武汉大学电子信息学院) School of Robotics, Wuhan University(武汉大学机器人学院) Electronic Information School and the School of Robotics, Wuhan University(武汉大学电子信息学院和机器人学院)

AI总结 本文提出了一种高效的视觉定位方法LiteLoc,通过去除冗余的色彩字段和优化密集姿态估计,显著提升了内存和计算效率,同时保持了定位性能。

Comments IEEE/CAA JAS 2026

详情
AI中文摘要

本文提出LiteLoc,一种基于3D高斯点云(3DGS)的新型高效局部化器。先前最先进的稀疏到密集局部化器STDLoc在定位能力上表现出色,但存在严重的存储冗余和计算延迟问题。通过重新审视其设计决策,我们推导出两个简单但高效的改进方法,使LiteLoc在内存和计算效率上大幅提升,同时更易于训练。关键发现是,继承自Feature 3DGS的色彩场对定位功能上是无用的,但其重建高频光度细节需要大量的高斯基元,导致紧密耦合的色彩-特征表示,产生显著的内存开销和次优的特征场优化。为此,我们提出了一种无色彩解耦的特征场,通过保留仅任务必要的特征属性,构建紧凑的高斯场景表示,从而消除约94%的冗余存储,而不会损失与定位相关的信息。我们进一步发现,主要的计算瓶颈在于密集的视角-n-点(PnP)求解器,其中大多数匹配贡献饱和的几何约束,精度提升有限。因此,我们提出了一种压缩策略,将密集匹配压缩到5%的代表性匹配子集,从而在鲁棒估计中实现了近19倍的速度提升,同时性能下降 negligible。大量实验表明,LiteLoc在多个场景中超越了STDLoc,具有显著的效率优势,为对延迟敏感的视觉定位打开了新的前景。

英文摘要

This letter presents LiteLoc, a novel and efficient localizer built on 3D Gaussian Splatting (3DGS). The previous state-of-the-art (SoTA) sparse-to-dense localizer, STDLoc, has shown remarkable localization capability but suffers from severe storage redundancy and computational latency. By revisiting its design decisions, we derive two simple yet highly effective improvements that cumulatively make LiteLoc much more efficient in both memory and computation, while also being easier to train. One key observation is that the color field, inherited directly from Feature 3DGS, is functionally useless for localization. Yet, its reconstruction of high-frequency photometric details necessitates excessive Gaussian primitives, resulting in a tightly coupled color-feature representation with significant memory overhead and sub-optimal feature field optimization. To resolve this, we propose a color-free decoupled feature field that constructs a compact Gaussian scene representation by retaining only task-essential feature attributes, thereby eliminating approximately 94% of redundant storage with no loss of localization-relevant information. We further find that the primary computational bottleneck lies in the dense Perspective-n-Point (PnP) solver, where most matches contribute saturated geometric constraints with diminishing accuracy gains. Accordingly, we propose a condensing strategy that distills dense matches into a subset of 5% representative matches, enabling a nearly 19-fold speedup in robust estimation with negligible performance drop. Extensive experiments show that LiteLoc surpasses STDLoc in multiple scenes with considerable efficiency benefits, opening up exciting prospects for latency-sensitive visual localization.

2605.17772 2026-05-19 cs.CV 版本更新

Towards Universal Physical Adversarial Attacks via a Joint Multi-Objective and Multi-Model Optimization Framework

通过联合多目标和多模型优化框架实现通用物理对抗攻击

Ziyang Liu, Hongyuan Wang, Zijian Wang, Yinxi Lu, Yunzhao Zang, Zhiqiang Yan, Qianhao Ning

发表机构 * Research Center for Space Optical Engineering, Harbin Institute of Technology(哈尔滨工业大学空间光学工程研究中心) Zhengzhou Research Institute, Harbin Institute of Technology(郑州研究院,哈尔滨工业大学)

AI总结 本文提出了一种联合多目标和多模型优化框架(JMOF),通过定量相似性分析选择最优的替代模型集合,以解决物理对抗攻击中单个替代模型过拟合和优化目标的问题,同时通过双层机制平衡攻击效率与深度泛化,并通过正交梯度对齐策略解决跨模型梯度冲突,从而提升攻击效果和跨任务泛化能力。

Comments Under review

详情
AI中文摘要

物理对抗攻击通常会过度拟合单一替代模型和优化目标。虽然集成攻击可以缓解这一问题,但现有方法在受限的物理纹理空间中面临严重的梯度冲突,显著降低了跨模型可转移性。为弥合这一差距,本文提出了一种联合多目标和多模型优化框架(JMOF),该框架利用定量相似性分析来选择最优的替代模型集合。在JMOF中,双层机制共同抑制预测输出并平化中间特征分布,平衡攻击效率与深度泛化。此外,正交梯度对齐(OGA)策略解决跨模型梯度冲突,将相互排斥的梯度转化为协同优化方向。广泛的模拟和现实世界实验表明,JMOF在对抗多种黑盒检测器方面优于最先进的基线方法。关键的是,JMOF表现出显著的跨视觉任务泛化能力,能够生成同时欺骗目标检测、语义分割或单目深度估计模型的攻击。这项研究推进了物理对抗攻击的泛化极限,为评估现实部署中视觉AI的脆弱性提供了稳健的框架。

英文摘要

Physical adversarial attacks often overfit single surrogate models and optimization objectives. While ensemble attacks can mitigate this, existing methods struggle with severe gradient conflicts within restricted physical texture spaces, significantly degrading cross-model transferability. To bridge this gap, this paper proposes a Joint Multi-Objective and Multi-Model Optimization Framework (JMOF) that leverages quantitative similarity analysis to select the optimal surrogate model ensemble. Within JMOF, a dual-level mechanism jointly suppresses prediction outputs and flattens intermediate feature distributions, balancing attack efficiency with deep generalization. Additionally, an Orthogonal Gradient Alignment (OGA) strategy resolves cross-model gradient conflicts, transforming mutually repulsive gradients into synergistic optimization directions. Extensive simulated and real-world experiments demonstrate that JMOF outperforms state-of-the-art baselines against diverse black-box detectors. Crucially, JMOF exhibits substantial cross-vision-task generalization, generating attacks capable of simultaneously deceiving object detection and semantic segmentation or monocular depth estimation models. This research advances the generalization limits of physical adversarial attacks, providing a robust framework for evaluating visual AI vulnerabilities in real-world deployments.

2605.17766 2026-05-19 cs.CV 版本更新

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

LatentUMM: 双重潜在对齐用于统一多模态模型

Yinyi Luo, Wenwen Wang, Hayes Bai, Marios Savvides, Jindong Wang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) William & Mary(威廉与玛丽学院)

AI总结 本文提出LatentUMM,通过构建增强的共享潜在空间,显式对齐映射到和从潜在空间的转换,提高跨模态一致性。实验表明,该方法在多种架构上一致提升了多模态一致性。

详情
AI中文摘要

统一多模态模型(UMMs)通过学习共享的潜在空间,在理解和生成方面取得优异表现,但往往在这些能力之间存在功能不一致。我们发现,这一问题并非源于共享表示的不足,而是源于映射到和从潜在空间的转换之间缺乏显式对齐。因此,生成和重新编码可能遵循不一致的轨迹,在模态转换时导致语义漂移。在本文中,我们提出了LatentUMM,一个构建增强共享潜在空间的框架,以显式对齐这些转换并提高跨模态一致性。LatentUMM包含两个阶段。第一阶段,双潜在对齐在模态和容量层面强制一致性:跨模态对齐使用更强的嵌入模型来施加结构化的跨模态语义,而双容量对齐在生成和重新编码下强制双向一致性。第二阶段,潜在动态稳定化通过随机潜在滚动和偏好优化提高鲁棒性,倾向于保留语义一致性的轨迹。实验表明,LatentUMM在多种架构上一致提高了多模态一致性。代码可在:https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM。

英文摘要

Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.

2605.17759 2026-05-19 cs.CV 版本更新

FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion

FrequencyBooster: 高保真像素扩散的全频建模

Lichen Ma, Zipeng Guo, Yu He, Xiaolong Fu, Luohang Liu, Jingling Fu, Junshi Huang, Yan Li

AI总结 本文提出FrequencyBooster,一种能够提升像素扩散模型全频建模能力的框架,通过高容量解码器提取高频细节和低频语义,从而在保持全局结构的同时实现更精确的像素生成。

详情
AI中文摘要

为克服基于VAE的潜在扩散模型固有的保真度瓶颈和优化偏差,像素空间扩散模型作为一种具有吸引力的端到端范式而出现。然而,现有的像素扩散模型往往难以在计算效率与高频率细节保留之间取得平衡。它们通常依赖于基于块的压缩或受限的局部解码,导致一种'频谱妥协',即高频和精细像素信息被抑制。为了解决这些挑战,我们提出了FrequencyBooster,一种新的框架,旨在为像素扩散模型赋予全频建模能力,而无需显著的开销。该方法的核心是一个高容量解码器,专门用于提取详尽的高频细节和低频语义,后者来源于Diffusion Transformer (DiT) 主干网络。与以往牺牲全局上下文以换取局部细化的工作不同,FrequencyBooster利用高维特征表示,在保持全局结构完整性的同时实现了更优的像素级精度。在ImageNet上的大量实验表明,我们的方法效果显著:在仅320个epoch内,我们的模型在256×256分辨率下达到最先进的FID为1.60。此外,在512×512分辨率下,FrequencyBooster达到FID为1.69,显著优于现有的像素空间和潜在空间生成模型。

英文摘要

To circumvent the inherent fidelity bottlenecks and optimization misalignment of VAE-based latent diffusion, pixel-space diffusion models have emerged as a compelling end-to-end paradigm. However, existing pixel diffusion models often struggle to balance computational efficiency with the preservation of high-frequency details. They frequently resort to patch-based compression or restricted local decoding, leading to a "spectral compromise" where high-frequency and fine-grained pixel information are suppressed. To address these challenges, we propose \textbf{FrequencyBooster}, a novel framework designed to empower pixel diffusion with full-frequency modeling capabilities without prohibitive overhead. The core of our method is a high-capacity decoder that specializes in extracting exhaustive high-frequency details and low-frequency semantics, the latter of which is derived from a Diffusion Transformer (DiT) backbone. Unlike prior works that sacrifice global context for local refinement, FrequencyBooster leverages high-dimensional feature representations to maintain global structural integrity while achieving superior pixel-level precision. Extensive experiments on ImageNet demonstrate the effectiveness of our approach: our model achieves a state-of-the-art FID of \textbf{1.60} at $256 \times 256$ resolution within only 320 epochs. Furthermore, at $512 \times 512$ resolution, FrequencyBooster attains an FID of \textbf{1.69}, significantly outperforming existing pixel-space and latent-space generative models.

2605.17748 2026-05-19 cs.CV 版本更新

Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction

通过全局-局部自适应交互释放视觉Transformer在图像质量评估中的潜力

Yu Li, Puchao Zhou, Yachun Mi, Yanfeng Wu, Xiaoming Wang, Shaohui Liu

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Meituan(美团)

AI总结 本文提出了一种全局-局部自适应交互框架,通过双流特征提取机制和交互式全局-局部融合,提升图像质量评估的预测精度和鲁棒性,同时减少可训练参数数量。

Journal ref Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. [10567]-[10571], 2026

详情
AI中文摘要

在盲图像质量评估(BIQA)领域,准确预测自然环境中真实失真图像的感知质量仍然极具挑战性,因为存在多样的复杂失真。尽管现有方法已取得显著准确性,但其可扩展性常受限于主观注释的高成本和可用数据集的有限规模。近年来,大规模预训练视觉模型的进步引入了强大的语义和表征能力,但其在IQA任务中的应用受到显著的计算需求和次优微调效率的阻碍。为克服这些限制,我们引入了全局-局部交互适配器(GLIA),一种新的框架,通过双流特征提取机制与交互式全局-局部融合有效利用预训练的视觉Transformer。通过同时保留全局语义信息和细粒度局部细节,我们的方法在显著减少可训练参数的同时,实现了优越的预测精度和鲁棒性。在多个基准上的广泛实验验证了我们方法的有效性和优越性。

英文摘要

In the field of Blind Image Quality Assessment (BIQA), accurately predicting the perceptual quality of authentically distorted images remains highly challenging due to the diverse and complex distortions present in natural environments. Although existing methods have achieved notable accuracy, their scalability is often constrained by the high cost of subjective annotation and the limited size of available datasets. Recent advances in large-scale pre-trained vision models have introduced powerful semantic and representational capabilities, yet their application to IQA tasks is hindered by substantial computational demands and suboptimal fine-tuning efficiency. To overcome these limitations, we introduce the Global-Local Interaction Adapter (GLIA), a novel framework that effectively harnesses pre-trained Vision Transformers through a dual-stream feature extraction mechanism coupled with interactive global-local fusion. By jointly retaining global semantic information and fine-grained local details, our approach delivers superior prediction accuracy and robustness while requiring significantly fewer trainable parameters. Extensive experiments on multiple benchmarks validate the effectiveness and superiority of our approach.

2605.17743 2026-05-19 cs.CV 版本更新

MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation

MoASE++: 基于领域自适应在线蒸馏的激活稀疏专家混合模型用于持续测试时间适应

Ronyu Zhang, Aosong Cheng, Gaole Dai, Yulin Luo, Jiaming Liu, Li Du, Huanrui Yang, Dan Wang, Leyuan Fang, Yuan Du, Shanghang Zhang

发表机构 * Nanjing University and The Hong Kong Polytechnic University(南京大学和香港理工大学) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学多媒体信息处理国家重点实验室,计算机科学学院) University of Arizona(亚利桑那大学) Hong Kong University of Science and Technology(香港科技大学) School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) Nanjing University(南京大学)

AI总结 本文提出MoASE++,通过结合领域自适应在线蒸馏的激活稀疏专家混合模型,解决持续测试时间适应中领域无关结构与领域特定纹理分离的问题,提升模型在动态视觉环境中的持续适应能力。

详情
AI中文摘要

持续测试时间适应旨在将源预训练模型适应非平稳、未标记的目标流,同时保持过去的能力,但纹理偏见的骨干网络可能导致误差累积和灾难性遗忘。受人类视觉系统分离形状和纹理过程的启发,我们引入MoASE,一种插件式混合专家模型,利用具有空间可微置零的激活稀疏专家,将领域无关的结构与领域特定的纹理分离,形成互补的高激活和低激活路径,同时高阶和低阶瓶颈多样化表示。激活稀疏门产生输入自适应的SDD阈值以精确选择令牌,领域感知路由器利用纹理敏感线索为每个样本分配专家权重。为遏制对未标记流的确认偏见并稳定监督,我们引入领域自适应在线蒸馏构成MoASE++,包括基于EMA锚定的在线反KL蒸馏和基于熵和置信度的增强策略,使同一视图的预测对齐并提高鲁棒性-可塑性平衡。在分类(CIFAR-10/100-C,ImageNet-C)和语义分割(Cityscapes->ACDC)上的广泛实验表明,MoASE++在动态视觉环境中持续适应方面表现出一致的最先进性能,提供了一种原理明确、可控的持续适应方法。

英文摘要

Continual test-time adaptation adapts a source-pretrained model to non-stationary, unlabeled target streams while retaining past competence, yet texture-biased backbones risk error accumulation and catastrophic forgetting. Drawing inspiration from the process of decoupling shape and texture in the human visual system, we introduce MoASE, a plug-in mixture-of-experts that disentangles domain-agnostic structure from domain-specific texture using Activation Sparsity Experts with Spatial Differentiable Dropout, forming complementary high- and low-activation pathways, while high- and low-rank bottlenecks diversify representations. The Activation Sparsity Gate produces input-adaptive SDD thresholds for precise token selection, and the Domain-Aware Router assigns per-sample expert weights using texture-sensitive cues. To curb confirmation bias on unlabeled streams and stabilize supervision, we then introduce Domain-Adaptive On-Policy Distillation to constitute MoASE++, with an EMA-anchored on-policy reverse KL distillation and an augmentation policy conditioned on entropy and confidence that aligns predictions across the same views and improves the robustness-plasticity balance. Extensive experiments on classification (CIFAR-10/100-C, ImageNet-C) and semantic segmentation (Cityscapes->ACDC) demonstrate consistent state-of-the-art performance, offering a principled, controllable approach to continual adaptation in dynamic visual environments.

2605.17742 2026-05-19 cs.CV cs.HC 版本更新

UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation

UST-Hand: 一种面向3D自监督手姿态估计的不确定性感知时空点云交互网络

Tianhao Han, Haoyang Zhang, Liang Xie, Haochen Chang, Kun Gao, Yuan Cheng, Pengfei Ren, Erwei Yin

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Beijing University of Posts and Telecommunications(北京邮电大学) Sun Yat-sen University(中山大学) Peking University(北京大学) Defense Innovation Institute, Academy of Military Sciences(国防科技创新院,军事科学学院) Tianjin Artificial Intelligence Innovation Center(天津人工智能创新中心)

AI总结 本文提出UST-Hand,一种通过估计手姿态不确定性分布并构建概率点云特征空间的自监督学习框架,以更稳定地建模复杂的时空关系,从而在三个具有挑战性的数据集上实现了最先进的性能,比现有自监督方法在均位点误差(MPVPE)上高出37.8%。

Comments Accepted by CVPR 2026

详情
AI中文摘要

手动标注准确的3D手姿态非常耗时且劳动密集。现有的自监督手姿态估计方法利用输入图像与渲染输出之间的差异或多视角一致性约束作为驱动因素来优化网络并逐步提高姿态精度。然而,这些方法对噪声伪标签高度敏感,并忽略了充分利用细粒度空间相关性的重要性,这削弱了模型训练的稳定性。为了解决这些问题,我们提出了UST-Hand,一种自监督学习框架,该框架估计手姿态的不确定性分布,并构建一个概率点云特征空间,从而能够建模复杂的时空关系。UST-Hand采用条件归一化流模型来捕捉手姿态分布,并采样多样假设,从而在噪声伪标签监督下实现稳健学习,具有增强的稳定性。这些多假设被映射到统一的概率3D点云空间中进行多视角和时间特征交互,全面探索手运动模式和细粒度空间相关性。在三个具有挑战性的数据集上的广泛实验表明,UST-Hand实现了最先进的性能,比现有自监督方法在均位点误差(MPVPE)上高出37.8%。

英文摘要

Manually annotating accurate 3D hand poses is extremely time-consuming and labor-intensive. Existing self-supervised hand pose estimation methods leverage the discrepancy between input images and rendered outputs, or multi-view consistency constraints, as the driving force to optimize networks and progressively refine pose accuracy. However, these methods are highly susceptible to noisy pseudo-labels and overlook the importance of fully exploiting fine-grained spatial correlations, which undermines the stability of model training. To address these issues, we propose UST-Hand, a self-supervised learning framework that estimates uncertainty distribution of hand pose and constructs a probabilistic point cloud feature space, which enables the complex spatiotemporal relationship modeling. UST-Hand employs a conditional normalizing flow model to capture hand pose distributions and samples diverse hypotheses, facilitating robust learning under noisy pseudo-labels supervision with enhanced stability. These multi-hypothesis are mapped to a unified probabilistic 3D point cloud space for multi-view and temporal feature interaction, comprehensively exploring hand motion patterns and fine-grained spatial correlations. Extensive experiments on three challenging datasets demonstrate that UST-Hand achieves state-of-the-art performance, outperforming existing self-supervised methods by up to 37.8% in Mean Per Vertex Position Error (MPVPE).

2605.17729 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis

领域增量学习用于疫情 resilient 胸部X光分析

Danu Kim

发表机构 * Danu Kim(丹努·金)

AI总结 本文提出了一种基于回放的领域增量持续学习方法,用于在跨领域变化中保持肺炎检测的鲁棒性和一致性,通过类感知平衡回放和类感知损失实现平衡的类表示和动态重加权,实验表明该方法在领域偏移的PneumoniaMNIST数据集上达到88.66%的平均准确率,优于经验回放、微调和联合训练基线。

Comments Published in Korea Software Congress (2025)

详情
AI中文摘要

深度学习模型在肺炎检测中实现了高准确性,但其在临床领域中的泛化能力受限于成像设备、获取协议和机构条件的差异。本研究引入了一种基于回放的领域增量持续学习方法,旨在使模型能够持续适应跨领域变化而不发生灾难性遗忘。所提出的方法结合了类感知平衡回放以在受限内存中保持平衡的类表示,以及类感知损失以在训练过程中动态重新加权类不平衡。在包含五个模拟领域的领域偏移PneumoniaMNIST数据集上进行的实验表明,所提出的方法实现了88.66%的平均准确率,优于经验回放、微调和联合训练基线。这些发现突显了所提出方法在跨临床环境变化中实现稳健和一致肺炎检测的有效性。

英文摘要

Deep learning models achieved high accuracy in pneumonia detection from chest X-rays. However, their generalization across clinical domains remains limited due to variations in imaging devices, acquisition protocols, and institutional conditions. This study introduces a replay-based domain-incremental continual learning designed to enable continual adaptation to cross-domain variations without catastrophic forgetting. The proposed method incorporates a class-aware balanced replay to maintain balanced class representation within a constrained memory and a class-aware loss to dynamically reweight class imbalance during training. Experiments conducted on a domain-shifted PneumoniaMNIST dataset consisting of five simulated domains demonstrate that the proposed method achieves an average accuracy of 88.66%, outperforming Experience Replay, Fine-Tuning, and Joint Training baselines. These findings highlight the efficacy of the proposed approach in achieving robust and consistent pneumonia detection across clinical environment variations.

2605.17727 2026-05-19 cs.CV 版本更新

GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations

GraSP-VL: 长度作为视觉-语言表示的语义粒度接口

Zesheng Li, Chengchang Pan, Honggang Qi

发表机构 * University of the Chinese Academy of Sciences(中国科学院大学)

AI总结 本文研究如何将嵌入长度转化为可控的语义访问接口,提出GraSP-VL方法,通过学习共享的近正交前缀变换,实现视觉-语言嵌入的语义层次递进接口,并在多个数据集上验证了其有效性。

Comments Preprint

详情
AI中文摘要

冻结的视觉-语言嵌入包含从物体身份到属性、关系和完整描述意义的多级语义信号,但这些信号通过固定长度的向量接口暴露。我们研究是否可以将嵌入长度转化为可控的语义访问接口。我们提出了GraSP-VL,它在冻结VLM嵌入上学习了一个共享的近正交前缀变换。GraSP-VL实现了语义马特罗什卡接口:短前缀被分配粗粒度的语义角色,而更长的前缀逐步暴露更细粒度的语言基础区分。由于变换在图像和文本嵌入之间共享,并且保持了全维度几何,前缀行为的变化不会改写原始VLM空间。在包含20,147个示例的COCO/Flickr30K注释池上,GraSP-VL达到了阶梯评分53.01和难负样本选择性89.76,同时保持全空间漂移低于10^-6。它还转移到SugarCrepe-clean数据集,达到86.03的对象准确率和11.96的平均外部涌现,并保持全维度零样本CIFAR-100准确率。这些结果表明,冻结的VLM嵌入可以重新组织为可截断的语义前缀接口,而不是仅仅压缩。

英文摘要

Frozen vision-language embeddings contain signals at multiple semantic resolutions, from object identity to attributes, relations, and full-caption meaning, but they expose these signals through a fixed-length vector interface. We study whether embedding length can be turned into a controllable semantic access interface. We propose \textbf{GraSP-VL}, which learns a shared near-orthogonal prefix transform over frozen VLM embeddings. GraSP-VL instantiates a \textbf{Semantic Matryoshka} interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions. Because the transform is shared across image and text embeddings and preserves full-dimensional geometry, prefix behavior changes without rewriting the original VLM space. On a 20,147-example COCO/Flickr30K annotation pool, GraSP-VL reaches a staircase score of 53.01 and hard-negative selectivity of 89.76, while keeping full-space drift below $10^{-6}$. It also transfers to SugarCrepe-clean with 86.03 object accuracy and 11.96 mean external emergence, and preserves full-dimensional zero-shot CIFAR-100 accuracy. These results show that frozen VLM embeddings can be reorganized into a truncatable semantic prefix interface rather than merely compressed.

2605.17719 2026-05-19 cs.CV 版本更新

Patch-MoE Mamba: A Patch-Ordered Mixture-of-Experts State Space Architecture for Medical Image Segmentation

Patch-MoE Mamba: 一种用于医学图像分割的基于补丁顺序的专家混合状态空间架构

Diego Adame, Fabian Vazquez, Jose A. Nunez, Huimin Li, Jinghao Yang, Erik Enriquez, DongChul Kim, Haoteng Tang, Bin Fu, Pengfei Gu

发表机构 * University of Texas Rio Grande Valley(德克萨斯理工大学里奥格兰德谷分校)

AI总结 本文提出了一种基于补丁顺序的专家混合状态空间架构Patch-MoE Mamba,以解决现有Mamba分割模型在像素级方向扫描破坏局部二维空间结构以及简单求和融合方向无法适应多样物体大小、形状和边界的问题。

详情
AI中文摘要

基于CNN和Transformer的架构在医学图像分割中已取得优异性能,但CNN在建模长距离依赖性方面存在限制,而Transformer则常面临二次计算和内存复杂度的问题。状态空间模型,尤其是基于Mamba的网络,提供了一种高效的替代方案,具有线性序列复杂度。然而,现有的Mamba分割模型仍面临两个限制:像素级方向扫描会破坏局部二维空间结构,而简单的求和融合方向无法适应多样化的物体大小、形状和边界。为了解决这些问题,我们提出了Patch-MoE Mamba,一种用于医学图像分割的基于补丁顺序的专家混合状态空间架构。它引入了一种分层的补丁顺序扫描机制,能够在保留局部空间邻域的同时捕捉多尺度上下文,并引入了基于MoE的方向融合模块,通过四个方向专家、一个可学习的连接专家和残差方向聚合,自适应地结合多个Mamba扫描器输出。在五个公开的息肉分割基准和ISIC 2017/2018皮肤病变分割数据集上的实验表明了Patch-MoE Mamba的有效性和通用性。

英文摘要

CNN- and Transformer-based architectures have achieved strong performance in medical image segmentation, but CNNs are limited in modeling long-range dependencies, while Transformers often suffer from quadratic computational and memory complexity. State space models, especially Mamba-based networks, offer an efficient alternative with linear sequence complexity. However, existing Mamba segmentation models still face two limitations: pixel-wise directional scanning can disrupt local 2D spatial structure, and simple summation-based fusion of scan directions cannot adapt well to diverse object sizes, shapes, and boundaries. To address these issues, we propose \textit{Patch-MoE Mamba}, a patch-ordered mixture-of-experts state space architecture for medical image segmentation. It introduces a hierarchical patch-ordered scanning mechanism that preserves local spatial neighborhoods while capturing multi-scale context, and an MoE-based directional fusion module that adaptively combines multiple Mamba scanner outputs using four directional experts, a learnable concatenation expert, and residual directional aggregation. Experiments on five public polyp segmentation benchmarks and the ISIC 2017/2018 skin lesion segmentation datasets demonstrate the effectiveness and generality of Patch-MoE Mamba.

2605.17686 2026-05-19 cs.CV 版本更新

Brain-inspired spike-timing plasticity for reliable label-efficient event-camera vision

脑启发式脉冲时间依赖性可塑性用于可靠的标签高效事件相机视觉

Mohamad Yazan Sadoun, Sarah Sharif, Yaser Mike Banad

发表机构 * School of Electrical and Computer Engineering, University of Oklahoma(俄克拉荷马大学电气与计算机工程学院)

AI总结 本文提出了一种基于脑启发式脉冲时间依赖性可塑性(STDP)的事件相机视觉方法,通过三个局部STDP模块实现无需GPU支持的单线程处理,提升了标签效率和检测性能。

详情
AI中文摘要

部署事件相机目标检测器受到每帧标注需求和GPU计算需求的限制。本文引入了三个局部脉冲时间依赖性可塑性(STDP)模块,包括序列、候选和管可靠性模块,这些模块在单个CPU线程上运行而无需GPU支持。在FRED无人机基准测试中,所提出的框架覆盖了三个标签高效监督层级。严格零标签检测器实现了53.8%的mAP@30,约26个训练衍生位实现76.9%的mAP@30,而STDP候选可靠性门实现了78.60±0.42%的mAP@30。在获取顺序漂移下,群体门在20次正例试验中优于流式k-means,而无漂移对照组则否定了其效果。STDP将单模型方差减少了6.6倍,一个训练好的门与44种子集合界线相当。门在Intel Lava上实现了89%的前两名一致性。在EVUAV基准测试中,管级STDP层将误报率从454降至331e-4(Pd≥88%)。密集梯度训练检测器无法提供这种梯度训练、密集矩阵乘法和无局部可塑性操作的组合。

英文摘要

Deploying event-camera object detectors is constrained by per-frame labeling requirements and GPU compute demands. This work introduces three local spike-timing-dependent plasticity (STDP) modules, including sequence, candidate, and tube-reliability modules, that operate on a single CPU thread without GPU support. On the FRED drone benchmark, the proposed framework spans three label-efficient supervision tiers. A strict zero-label detector achieves 53.8% mAP@30, approximately 26 train-derived bits achieve 76.9% mAP@30, and an STDP candidate-reliability gate achieves 78.60 +/- 0.42% mAP@30. Under acquisition-order drift, the cohort gate outperforms streaming k-means by 2.03 +/- 0.58 percentage points across 20 of 20 positive trials, while a no-drift control falsifies the effect. STDP reduces single-model variance by 6.6 times, and one trained gate matches a 44-seed ensemble bound. The gate transfers to Intel Lava with 89% top-2 agreement. On the EVUAV benchmark, a tube-level STDP layer reduces false alarms from 454 to 331e-4 at Pd >= 88%. Dense gradient-trained detectors cannot provide this combination of gradient training, dense matrix multiplication, and local plasticity-free operation by construction.

2605.17685 2026-05-19 cs.CV cs.AI cs.CR cs.SY eess.SP eess.SY 版本更新

Attention-Guided Fusion of 1D and 2D CNNs for Robust ECG-Based Biometric Recognition

基于注意力引导的1D和2D CNN融合用于鲁棒的基于ECG的生物识别

Arioua, Islameddine, Benzaoui, Amir, Zeroual, Abdelhafid, Houam, Lotfi

发表机构 * PIMIS Laboratory, Electronics and Telecommunications Department(PIMIS实验室,电子与电信系) Université du 8 Mai 1945(8月1945大学) Electrical Engineering Department, University of 20 August 1955(电子工程系,20 August 1955大学) Department of Electrical Engineering, Faculty of Science and Applied Sciences(电子工程系,科学与应用科学学院) Larbi Ben M'hidi University(拉比·本·迈迪大学) Department of Electronics and Communications, University of Larbi Tebessi(电子与通信系,拉比·塔贝西大学)

AI总结 本文提出了一种结合1D和2D CNN的混合框架,通过注意力引导融合机制提升ECG生物识别的鲁棒性和性能,实验表明该方法在多个数据集上均取得了较高的识别准确率。

Journal ref Digital Signal Processing 2026

详情
AI中文摘要

基于心电图(ECG)的生物识别已作为一种安全的身份验证和活体检测的有希望的解决方案。然而,大多数现有方法依赖于单模深度学习架构,单独处理一维(1D)时间信号或二维(2D)时频表示,限制了鲁棒性和泛化能力。为了解决这个问题,本文提出了一种将1D和2D卷积神经网络(CNNs)整合到统一端到端架构中的混合框架。1D分支从原始ECG信号中提取时序和形态学特征,而2D分支从时频表示中捕获判别性的频谱信息。注意力引导的融合机制根据输入特性动态加权两种模态,克服了传统静态融合策略的局限性。该框架在三个基准数据集(ECG-ID、MIT-BIH和PTB)上进行了评估,包括健康受试者和患有心脏病理学的患者,分别实现了99.56%、100.00%和99.89%的识别准确率。为了评估长期生物稳定性,还进行了多会话Heartprint数据集的实验,该数据集跨越十年。所提出的方法在相同会话中实现了98.54%(S1)、99.09%(S2)、94.93%(S3R)和96.08%(S3L)的准确率,跨会话评估达到了56.33%(S1-S2)和53.27%(S2-S3R),证明了其在时间上的稳定生物特征捕获能力。最优配置结合了InceptionTime用于1D处理,ResNet-34用于2D分析,以及基于注意力的融合。消融研究证实,所提出的注意力机制在传统融合方法中始终表现更优。总体而言,所提出的框架为ECG生物识别提供了一种稳健、可扩展且高性能的解决方案。

英文摘要

Electrocardiogram (ECG)-based biometric recognition has emerged as a promising solution for secure authentication and liveness detection. However, most existing methods rely on unimodal deep learning architectures that independently process either one-dimensional (1D) temporal signals or two-dimensional (2D) time-frequency representations, limiting robustness and generalization. To address this issue, this paper proposes a hybrid framework integrating 1D and 2D convolutional neural networks (CNNs) within a unified end-to-end architecture. The 1D branch extracts temporal and morphological features from raw ECG signals, while the 2D branch captures discriminative spectral information from time-frequency representations. An attention-guided fusion mechanism dynamically weights both modalities according to input characteristics, overcoming the limitations of conventional static fusion strategies. The framework was evaluated on three benchmark datasets (ECG-ID, MIT-BIH, and PTB), including healthy subjects and patients with cardiac pathologies, achieving identification accuracies of 99.56%, 100.00%, and 99.89%, respectively. To assess long-term biometric permanence, experiments were also conducted on the multi-session Heartprint dataset spanning ten years. The proposed approach achieved same-session accuracies of 98.54% (S1), 99.09% (S2), 94.93% (S3R), and 96.08% (S3L), while cross-session evaluations reached 56.33% (S1-S2) and 53.27% (S2-S3R), demonstrating the ability to capture stable biometric signatures over time. The optimal configuration combines InceptionTime for 1D processing, ResNet-34 for 2D analysis, and attention-based fusion. Ablation studies confirm that the proposed attention mechanism consistently outperforms conventional fusion approaches. Overall, the proposed framework provides a robust, scalable, and high-performance solution for ECG biometric recognition.

2605.17682 2026-05-19 cs.CV 版本更新

GEM: Gaussian Evolution Model for Occupancy Forecasting and Motion Planning

GEM:用于占用预测和运动规划的高斯演化模型

Cheng Chen, Hao Huang, Saurabh Bagchi

发表机构 * Purdue University(普渡大学) New York University Abu Dhabi(纽约大学阿布扎克分校)

AI总结 该研究提出GEM模型,通过高斯演化模型实现高效的占用预测和运动规划,解决了传统方法在时间灵活性、场景演化和连续时间动态匹配上的不足。

详情
AI中文摘要

未来3D语义占用预测和运动规划是自动驾驶的核心,需要模型能够推断周围场景的演变和车辆的行动。现有占用世界模型通常将场景离散化为潜在嵌入、体素特征或量化标记,并通过固定步长自回归生成预测未来状态。这限制了时间灵活性,掩盖了场景演变,长时间预测会积累误差,并且难以匹配真实驾驶场景的连续时间动态。我们提出了GEM,一种用于非自回归占用世界建模的高斯演化模型,其中驾驶场景被表示为学习的动态显式连续4D高斯原语。与逐步推演未来占用状态不同,GEM可以直接查询高斯世界表示中的任意时间戳,并将相应的条件3D高斯分布投射到语义占用体积中。这使得能够高效地进行全时间范围预测,同时保留紧凑且可解释的场景表示。通过解耦空间几何、时间支持和原语运动,GEM使预测的世界更容易检查,因为每个原语的演变可以连续随时间跟踪。相同表示也支持运动规划,通过从学习的高斯世界预测未来的车辆轨迹。大量实验表明,GEM在未来的语义占用预测和强大的运动规划性能方面均达到最先进的水平,同时提供灵活的时间查询。

英文摘要

Future 3D semantic occupancy forecasting and motion planning are central to autonomous driving, as they require models to reason about how surrounding scenes evolve and how the ego vehicle should act. Existing occupancy world models commonly discretize scenes into latent embeddings, volumetric features, or quantized tokens, and forecast future states through fixed-step autoregressive generation. This limits temporal flexibility, obscures scene evolution, accumulates errors over long horizons, and poorly matches the continuous-time dynamics of real driving scenes. We propose GEM, a Gaussian Evolution Model for non-autoregressive occupancy world modeling, where driving scenes are represented as explicit continuous 4D Gaussian primitives with learned dynamics. Instead of rolling out future occupancy states step by step, GEM directly queries the Gaussian world representation at arbitrary timestamps and splats the corresponding conditional 3D Gaussians into semantic occupancy volumes. This enables efficient forecasting over the full horizon while retaining a compact and interpretable scene representation. By decoupling spatial geometry, temporal support, and primitive motion, GEM makes the predicted world easier to inspect, as each primitive's evolution can be followed continuously over time. The same representation also supports motion planning by predicting future ego trajectories from the learned Gaussian world. Extensive experiments show that GEM achieves state-of-the-art future semantic occupancy forecasting and strong motion planning performance, while providing flexible temporal querying.

2605.17673 2026-05-19 cs.CV 版本更新

A simple approach for biometrics: Finger-knuckle prints recognition based on a Sobel filter and similarity measures

一种简单的生物识别方法:基于Sobel滤波器和相似性度量的指纹-指节印识别

E. O. Rodrigues, T. M. Porcino, Aura Conci, Aristofanes C. Silva

发表机构 * Department of Computer Science, Universidade Federal Fluminense(弗拉门蒂努斯联邦大学计算机科学系) Department of Electrical Engineering, Universidade Federal do Maranhão(马拉尼昂联邦大学电气工程系)

AI总结 本文提出了一种简单的指纹-指节印识别方法,利用Sobel滤波器和相似性度量进行边缘检测和噪声减少,实现了高效的二值图像处理和存储,实验表明在大规模数据集上达到了17.02%的正确识别率。

Journal ref 2016 International Conference on Systems, Signals and Image Processing (IWSSIP)

详情
AI中文摘要

本文的目标是提出一种新的指纹-指节印识别方法,该方法本质上是手指指节区域的数字照片。我们采用了非常简单的视觉计算概念,如基于Sobel算子的边缘检测滤波器和简单的噪声减少算法。这些操作非常快速,能够产生二值图像,这些图像在处理和存储上都非常高效。此外,除了预处理之外,还考虑并评估了某些相似性度量以用于该任务。在预处理输入手指后,将其与数据集中所有手指的图像一一进行比较。我们获得了在大规模数据集上高达17.02%的成功识别率(真阳性率).

英文摘要

The objective of this work is to propose a novel methodology for the finger knuckle print recognition, which is essentially a digital photo of the finger-knuckle region. We have employed very simple concepts of visual computing such as a filter based on the Sobel operator for finding edges and a simple noise reduction algorithm. These operations are exceptionally fast and produce binary images, which are very efficient to process and to store. Furthermore, alongside this preprocessing, some similarity measures were also regarded and evaluated for the task. After preprocessing an input finger it is compared to all the images of fingers in the dataset, one by one. We have obtained up to 17.02% of successful recognitions (true positive rate) with a large dataset.

2605.17668 2026-05-19 cs.CV 版本更新

Deep learning-based compression of giga-resolution whole slide images

基于深度学习的高分辨率全切片图像压缩

Maren Høibø, Etienne Gaucher, Ingerid Reinertsen, Marit Valla, Erik Smistad

发表机构 * Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology (NTNU)(临床与分子医学系,挪威科学技术大学(NTNU)) Clinic of Laboratory Medicine, St. Olavs hospital, Trondheim University Hospital(实验室医学诊所,圣奥拉夫医院,特罗姆瑟大学医院) Department of Health Research, SINTEF Digital(健康研究部,SINTEF数字) Department of Circulation and Medical Imaging, Norwegian University of Science and Technology (NTNU)(循环与医学成像系,挪威科学技术大学(NTNU))

AI总结 本文研究了基于深度学习的全切片图像压缩方法,通过比较深度学习与传统编码方式(JPEG、JPEG-2000、JPEG-XL)在去除玻璃和压缩效果上的差异,发现深度学习压缩在减少文件大小方面更有效,但解压时间更长。

详情
AI中文摘要

数字病理学的实施导致全切片图像(WSI)数量增加。WSI的大小对存储构成挑战。目前,WSI使用JPEG等编码器压缩,每个WSI占用数GB空间,存储玻璃导致大量空间浪费。本研究探讨并比较了基于深度学习的组织分割用于去除玻璃和深度学习压缩方法与JPEG、JPEG-2000和JPEG-XL。创建了包含完整玻璃、玻璃替换为单色像素以及玻璃替换为零字节瓷砖的图像金字塔(N=21),并使用JPEG、JPEG-XL和深度学习模型进行压缩。此外,几种压缩模型在组织切片数据集上进行了评估,并与JPEG、JPEG-2000和JPEG-XL进行了比较。去除玻璃显著减少了JPEG和JPEG-XL的文件大小。与JPEG压缩相比,基于深度学习的图像压缩将WSI大小减少了43-72%,而基于深度学习的玻璃去除将WSI大小减少了0.3-33%和6-62%(仅使用单色像素和去除所有玻璃瓷砖)。结合两者,总大小减少为44-80%,表明基于深度学习的图像压缩能高效压缩玻璃瓷砖,而JPEG则不能。在组织切片数据集上,最好的基于深度学习的压缩模型在每块切片上平均节省了约35-40%的存储空间,同时保持平均SSIM值高于0.95,而JPEG-XL和JPEG-2000分别节省了17%和14%,同时保持SSIM值为0.96。然而,深度学习模型的解压时间比JPEG和JPEG-XL更高。

英文摘要

Implementation of digital pathology leads to an increased number of whole slide images (WSIs). The large size of WSIs is challenging. Today, WSIs are compressed with codecs like JPEG resulting in several gigabytes per WSI, and large amounts of space are wasted storing glass. In this study, deep learning-based tissue segmentation for glass removal, and deep learning compression methods were explored and compared with JPEG, JPEG-2000 and JPEG-XL. Image pyramids (N=21) with intact glass, glass replaced by single-colored pixels, and glass replaced by zero-byte tiles were created and compressed with JPEG, JPEG-XL and a deep learning model. Additionally, several compression models were evaluated on a tissue patch dataset and compared with JPEG, JPEG-2000 and JPEG-XL. Removing glass reduced file sizes considerably for JPEG and JPEG-XL. Deep learning-based image compression reduced the WSI size by 43-72% compared to JPEG compression, whereas deep learning-based glass removal reduced the WSI size by 0.3-33%, and 6-62% using only single-colored pixels and removing all-glass tiles, respectively. Combining the two gave a small improvement to a 44-80% total size reduction which indicates that deep learning-based image compression is able to efficiently compress glass tiles, whereas JPEG is not. On the tissue patch dataset, the best deep learning-based compression models saved on average ~35-40% per patch compared to JPEG, while keeping an average SSIM above 0.95, whereas JPEG-XL and JPEG-2000 saved 17% and 14%, respectively while keeping an SSIM of 0.96. However, the deep learning models had higher decompression times than JPEG and JPEG-XL.

2605.17661 2026-05-19 cs.RO cs.CV 版本更新

Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping

Mono-Hydra++: 基于多任务学习的实时单目场景图构建用于3D室内映射

U. V. B. L. Udugama, George Vosselman, Francesco Nex

发表机构 * Department of Earth Observation Science, University of Twente(特文特大学地球观测科学系)

AI总结 本文提出Mono-Hydra++,一种基于多任务学习的实时单目RGB加IMU流水线,用于3D室内度量语义映射和分层3D场景图构建,通过结合M2H-MX多任务模型和深度特征视觉惯性里程计前端,实现了在资源受限的机器人平台上无需主动深度传感器的实时度量语义映射和场景图构建。

Comments Submitted to ISPRS Journal of Photogrammetry and Remote Sensing. 50 pages, figures and tables included. Code: https://github.com/BavanthaU/mono-hydra-pp.git

详情
AI中文摘要

自主敏捷机器人需要的不仅仅是度量几何:它们必须理解物体、房间、地点和空间关系,以进行搜索、检查、探索和人机交互。传统度量地图支持定位和避障,但不提供这种语义和关系结构。3D场景图通过将几何与物体级和房间级的理解连接起来,填补了这一空白。在敏捷平台上构建此类表示仍然困难,因为空中和轻量级机器人受到严格的载荷、电力和计算限制,使RGB-D相机和LiDAR传感器在许多机载设置中不切实际。我们提出了Mono-Hydra++,一种实时单目RGB加IMU流水线,用于室内度量语义映射和分层3D场景图构建。该系统结合了M2H-MX,一种基于DINOv3的多任务模型,用于深度和语义,以及深度特征视觉惯性里程计前端,稀疏预测深度约束在VIO推导的姿态图中,语义遮蔽用于动态区域,以及在Mono-Hydra后端体积融合前的姿态感知时间对齐。在Go-SLAM ScanNet评估子集中,Mono-Hydra++在仅使用单目RGB加IMU输入的情况下,其平均轨迹误差比我们比较中的最强RGB-D基线低1.6%,在校准的7-Scenes中,其平均ATE比最强的竞争校准基线提高了29.8%。我们进一步在真实ITC建筑部署中验证了Mono-Hydra++,使用RealSense RGB加IMU,并通过在Jetson Orin NX 16GB上部署ONNX/TensorRT FP16 M2H-MX-L感知模型,以25.53 FPS的速度证明了嵌入可行性。这些结果表明,Mono-Hydra++可以在不依赖主动深度传感器的情况下,为资源受限的机器人平台提供实时度量语义映射和场景图构建。

英文摘要

Autonomous agile robots need more than metric geometry: they must understand objects, rooms, places, and spatial relations for search, inspection, exploration, and human robot interaction. Conventional metric maps support localization and collision avoidance, but do not provide this semantic and relational structure. 3D scene graphs address this gap by connecting geometry with object level and room level understanding. Building such representations on agile platforms remains difficult because aerial and lightweight robots operate under strict payload, power, and compute limits, making RGB-D cameras and LiDAR sensors impractical for many onboard settings. We present Mono-Hydra++, a real time monocular RGB plus IMU pipeline for indoor metric semantic mapping and hierarchical 3D scene graph construction. The system combines M2H-MX, a DINOv3 based multi-task model for depth and semantics, with a deep feature visual inertial odometry front end, sparse predicted depth constraints in the VIO derived pose graph, semantic masking for dynamic regions, and pose aware temporal alignment before volumetric fusion in the Mono-Hydra backend. On the Go-SLAM ScanNet evaluation subset, Mono-Hydra++ achieves 1.6% lower average trajectory error than the strongest RGB-D baseline in our comparison, while using only monocular RGB plus IMU input. On calibrated 7-Scenes, it improves average ATE by 29.8% over the strongest competing calibrated baseline. We further validate Mono-Hydra++ in a real ITC building deployment using RealSense RGB plus IMU and demonstrate embedded feasibility by deploying the ONNX/TensorRT FP16 M2H-MX-L perception model at 25.53 FPS on a Jetson Orin NX 16GB. These results show that Mono-Hydra++ can provide real time metric semantic mapping and scene graph construction for resource constrained robotic platforms without relying on active depth sensors.

2605.17640 2026-05-19 cs.IR cs.CV 版本更新

MARQUIS: A Three-Stage Pipeline for Video Retrieval-Augmented Generation

MARQUIS:视频检索增强生成的三阶段流水线

Debashish Chakraborty, Dengjia Zhang, Jialiang Jin, Hanting Liu, Katherine Guerrerio, Hanxiang Qin, Tyler Skow, Alexander Martin, Reno Kriz, Benjamin Van Durme

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Human Language Technology Center of Excellence(人机语言技术卓越中心)

AI总结 本文提出MARQUIS三阶段流水线,通过查询扩展、融合和重排序、校准的结构化证据提取以及从提取证据生成文章,解决了视频检索增强生成中检索和生成的不足,提升了检索性能和文章生成质量。

Comments Accepted as an oral presentation at the ACL 2026 Workshop MAGMaR Systems. 27 pages, 4 figures. Code can be found here: https://github.com/debashishc/marquis

详情
AI中文摘要

视频检索增强生成需要系统从大规模语料中检索相关音频视觉证据,并将其合成为连贯且有属性的文本。当前方法在两个端点都存在困难:检索方法在复杂多维查询上表现不佳,这些查询无法用单一嵌入捕捉,而生成方法缺乏跨多个视频的高层推理能力,并且在长多视频上下文中面临内存限制。我们提出了MARQUIS:一个三阶段流水线,通过(1)查询扩展、融合和重排序,(2)校准的结构化证据提取,以及(3)从提取的证据生成文章,可选地由RLM控制。在MAGMaR2026共享任务中,我们提升了检索性能从0.195到0.759(nDCG@10)。对于文章生成,ITER-QA-BASE在CAG基线上将平均人类评分从3.09提高到3.83,而MARQUIS-RLM获得3.30的人类评分并在非QA系统中实现最强的引用召回率。

英文摘要

Retrieval-augmented generation from videos requires systems to retrieve relevant audiovisual evidence from large corpora and synthesize it into coherent, attributed text. Current approaches struggle at both ends: retrieval methods fail on complex, multi-faceted queries that cannot be captured by a single embedding, while generation methods lack the high-level reasoning needed to synthesize across multiple videos and face memory constraints over long, multi-video contexts. We present MARQUIS: a three-stage pipeline that addresses these limitations through (1) query expansion, fusion, and reranking, (2) calibrated structured evidence extraction, and (3) article generation from extracted evidence, optionally controlled by an RLM. On the MAGMaR2026 shared task, we improve retrieval performance from 0.195 to 0.759 (nDCG@10). For article generation, ITER-QA-BASE improves average human score from 3.09 to 3.83 over the CAG baseline, while MARQUIS-RLM achieves a human score of 3.30 and the strongest citation recall among non-QA systems.

2605.17638 2026-05-19 cs.CV 版本更新

TouchMap-OR: Multi-View 3D Mapping of Hand-Surface Contacts

TouchMap-OR: 医院内多视角手-表面接触的3D映射

Sophokles Ktistakis, Rui Wang, Bastian Grande, Hugo Sax

发表机构 * ETH Zurich(苏黎世联邦理工学院) Institute for Anesthesiology and Perioperative Medicine, University Hospital Zurich(苏黎世大学麻醉学与围术期医学研究所) Department of Public and Global Health, University of Zurich(苏黎世大学公共卫生与全球健康系)

AI总结 本文提出TouchMap-OR系统,通过多视角RGB-D视觉系统实现手术室中身份分辨的手-表面接触重建,利用临床环境的语义结构推断接触时间和位置,通过多视角手部重建与追踪医生获得一致的手部轨迹,并建立手术室的语义3D模型以将手部轨迹映射到特定表面。

详情
AI中文摘要

临床医生、患者和医疗设备之间的手-表面互动在医疗程序中起着核心作用,在病原体传播中起关键作用。然而,这些互动仍然大多未被观察到,因为目前的感染预防实践依赖于手动观察,无法重建详细的接触历史。在本工作中,我们提出了在手术室中身份分辨的手-表面互动重建问题,并引入了TouchMap-OR,一种多视角RGB-D视觉系统,该系统能够建模医生、可变形手部几何结构以及临床环境的语义结构,以推断接触发生的时间和位置。该系统在多摄像机之间重建全局一致的多个人3D骨骼轨迹,同时从RGB观测与深度数据对齐的数据中估计可变形MANO手部网格。多视角手部重建被融合并关联到追踪的医生,以获得一致的左右手轨迹。通过多视角分割和深度融合构建手术室的语义3D模型,使重建的手部轨迹能够映射到特定表面,包括医疗设备、可移动物体和患者身体部位。利用时间手-表面接近性推断接触事件,描述了哪位医生接触了哪个表面以及何时。我们在三个真实的麻醉诱导记录上评估了TouchMap-OR,手动标注了接触事件。TouchMap-OR在二元接触F1值上达到0.75,优于基于跟踪的基线方法,同时保持了可比的多个人跟踪精度,并实现了0.96的身份分配精度。

英文摘要

Hand-surface interactions between clinicians, patients, and medical equipment play a central role in pathogen transmission during medical procedures. However, these interactions remain largely unobserved, as current infection-prevention practices rely on manual observation and cannot reconstruct detailed contact histories. In this work we formulate the problem of identity-resolved hand-surface interaction reconstruction in operating rooms and introduce TouchMap-OR, a multi-view RGB-D vision system that models clinicians, articulated hand geometry, and the semantic structure of the clinical environment to infer when and where contacts occur. The system reconstructs globally consistent multi-person 3D skeleton tracks across cameras while estimating articulated MANO hand meshes from RGB observations aligned to depth data. Multi-view hand reconstructions are fused and associated with tracked clinicians to obtain consistent left and right hand trajectories. A semantic 3D model of the operating room is built from multi-view segmentation and depth fusion, enabling reconstructed hand trajectories to be mapped to specific surfaces, including medical equipment, movable objects, and patient body sites. Temporal hand-surface proximity is used to infer contact episodes describing which clinician touched which surface and when. We evaluate TouchMap-OR on recordings from three real anesthesia inductions with manually annotated contact events. TouchMap-OR achieves 0.75 binary contact F1, outperforming tracking-based baselines while maintaining comparable multi-person tracking accuracy and achieving 0.96 identity attribution accuracy.

2605.17633 2026-05-19 cs.CV cs.AI 版本更新

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

SparseSAM: Segment Anything模型中激活的结构稀疏化

Hoai-Chau Tran, Chi H. Nguyen, Duy M. H. Nguyen, Mathias Niepert, Fan Lai, Khoa D. Doan

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) College of Engineering & Computer Science, VinUniversity(Vin大学工程与计算机科学学院) VinUni-Illinois Smart Health Center, VinUniversity(Vin大学-伊利诺伊智能健康中心) DFKI Max Planck Research School for Intelligent Systems (IMPRS-IS)(马克斯·普朗克智能系统研究学校) University of Stuttgart(斯图加特大学)

AI总结 本文提出SparseSAM,一种无需训练的结构稀疏化框架,通过联合加速注意力和MLP层并保持token身份,从而在保持高质量的同时提高推理速度和减少内存使用。

详情
AI中文摘要

Segment Anything Model (SAM) 实现了强大的开放词汇分割,但其基于ViT的图像编码器在推理延迟和内存方面占主导地位。现有的激活压缩方法,如标记合并,通过减少标记长度来处理,但引入了非平凡的运行时开销,并在高压缩下导致灾难性质量下降。其他应用稀疏注意力的方法仅关注注意力本身,使MLP完全密集,并限制了可达到的速度提升。我们提出了SparseSAM,一种(i)无需训练的结构稀疏化框架,该框架在加速注意力和MLP层的同时保持token身份。SparseSAM引入了(ii)Stripe-Sort Attention,它使用确定性的Z序排列将密集注意力转换为静态的硬件友好的稀疏模式,消除了动态掩码的开销。SparseSAM进一步引入了(iii)残差一致性MLP,只将信息性token路由通过MLP,同时通过残差路径传播剩余token。在四个分割基准测试中,SparseSAM在0.4密度下仅损失0.004 mIoU,在0.3密度下损失0.021 mIoU,相较于标记合并方法的改进,准确率损失减少了2.10倍,同时实现了2倍更快的推理速度和2.8倍的内存减少。

英文摘要

The Segment Anything Model (SAM) achieves strong open-vocabulary segmentation, but its ViT-based image encoders dominate inference latency and memory. Existing activation compression methods, such as token merging, reduce the token length to process, yet introduce non-trivial runtime overhead and encounter catastrophic quality drop under high compression. Other methods applying Sparse Attention focus on attention alone, leaving the MLP fully dense and capping achievable speedup. We propose SparseSAM, a (i) training-free structured sparsification framework that jointly accelerates attention and MLP layers while preserving token identity. SparseSAM introduces (ii) Stripe-Sort Attention, which uses a deterministic Z-order permutation to transform dense attention into static hardware-friendly sparse patterns, eliminating dynamic masking overhead. SparseSAM further introduces a (iii) Residual-Consistency MLP that routes only informative tokens through the MLP while propagating remaining tokens through the residual pathway. Across four segmentation benchmarks, SparseSAM loses only 0.004 mIoU at a 0.4 density and 0.021 mIoU at 0.3, a 2.10x reduction in accuracy loss versus token merging advances, while achieving 2x faster inference and 2.8x memory reduction.

2605.17624 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Multi-task learning on partially labeled datasets via invariant/equivariant semi-supervised learning

通过不变/等变半监督学习进行部分标注数据集上的多任务学习

Miquel Martí i Rabadán, Alessandro Pieropan, Hossein Azizpour, Atsuto Maki

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Univrses AB

AI总结 本文研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力,通过FixMatch方法和其等变扩展Dense FixMatch进行评估,在城市景观和BDD100K数据集上针对常见的目标检测和语义分割任务进行测试,发现不变和等变半监督学习在大多数情况下优于监督基线,特别是在标注样本较少时效果更佳。

Comments https://github.com/miquelmarti/DenseFixMatch

详情
AI中文摘要

我们研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力。具体而言,我们使用流行的FixMatch方法进行不变半监督学习,并采用其等变扩展Dense FixMatch。我们在Cityscapes和BDD100K数据集上评估了它们在计算机视觉中普遍的目标检测和语义分割任务中的性能。我们考虑了每个任务标注子集的不同大小以及它们之间的不同重叠情况。我们的结果表明,对于不变和等变半监督学习,大多数情况下都优于监督基线,特别是在任务中可用标注样本较少时,改进最为显著,且后者方法通常表现更好。我们的研究表明,不变/等变学习是有限标注数据下多任务学习的一个有前途的方向。

英文摘要

We investigate the potential of invariant and equivariant semi-supervised learning for addressing the challenges of training multi-task models on partially labeled datasets with differently structured output tasks. Specifically, we use the popular FixMatch method for invariant semi-supervised learning and its equivariant extension Dense FixMatch. We evaluate their performance on the Cityscapes and BDD100K datasets in the context of the prevalent object detection and semantic segmentation tasks in computer vision. We consider varying sizes of the subsets annotated for each task and different overlaps among them. Our results for both invariant and equivariant semi-supervised learning outperform supervised baselines in most situations, with the most significant improvements observed when fewer labeled samples are available for a task and generally better results for the latter approach. Our study suggests that invariant/equivariant learning is a promising general direction for multi-task learning from limited labeled data.

2605.17620 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing

SynVA:一种用于血管生成和动脉瘤编辑的模块化工具包

Marten J. Finck, Niklas C. Koser, Sarker M. Mahfuz, Tameem Jahangir, Jon E. Wilhelm, Daniel Behme, Naomi Larsen, Wojtek Palubicki, Sylvia Saalfeld, Sören Pirk

发表机构 * Visual Computing and Artificial Intelligence, Kiel University, Germany(视觉计算与人工智能研究所,基尔大学,德国) Institute for Medical Informatics and Statistics, Kiel University, Germany(医学信息学与统计研究所,基尔大学,德国) Clinic for Neuroradiology, Medical Faculty, Magdeburg University, Germany(神经放射科,马格德堡大学医学学院,德国) Department of Radiology and Neuroradiology, University Hospital Schleswig-Holstein, Germany(放射学与神经放射学部门,石勒苏益格-荷尔斯泰因大学医院,德国) Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Poland(数学与计算机科学学院,亚当·密茨凯维奇大学,波兰)

AI总结 本文提出SynVA,一种模块化工具包,用于生成血管网格和在解剖学上一致的动脉瘤合成,通过结合新的流匹配方法和基于学习的方法,生成真实血管几何和解剖学合理的动脉瘤,同时提供大规模标注数据集以提升医疗影像分析能力。

详情
AI中文摘要

颅内动脉瘤(IAs)以不可预测的生长和破裂风险为特征,是导致中风的主要原因,可能引发致命性出血,具有高死亡率和长期残疾。随着人口老龄化,脑血管疾病的发病率和整体负担预计会增加,凸显了需要可扩展的方法来分析复杂的医疗数据并提高对这些疾病的群体层面理解的必要性。尽管数字孪生和深度学习为提高诊断、预后和治疗提供了有希望的途径,但其效果受到大规模高质量医疗数据和相应标签稀缺的限制。我们提出了SynVA,一种用于血管网格生成和解剖学一致动脉瘤合成的模块化工具包。SynVA结合了基于流匹配的新型方法生成健康血管网格与基于学习的方法生成解剖条件下的动脉瘤网格——动脉瘤是从已有的血管几何结构计算而来的,而不是孤立生成。此外,我们引入了基于生理学原理和统计先验的SynVA过程模型,用于血管和动脉瘤合成,从而能够生成大规模数据集(例如用于训练基于网格的生成模型)。为此,我们发布了包含50,000个完全标注网格样本的数据集,用于各种下游视觉任务,如语义分割。广泛的定量和定性评估证明了SynVA能够生成逼真的血管几何和解剖学合理的动脉瘤。具体而言,我们的实验表明,某些方法生成的动脉瘤形状更符合专家人类感知,而其他方法在定量相似性度量上与真实动脉瘤的重建表现更优。

英文摘要

Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.

2605.17610 2026-05-19 cs.CV cs.CL 版本更新

SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening

SafeLens: 一种高效且可靠的视频护栏系统,采用快速和缓慢筛查

Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra

发表机构 * University of South Florida(佛罗里达州立大学) University of California, Davis(加州大学戴维斯分校)

AI总结 本研究提出SafeLens视频护栏框架,通过快速和缓慢的推理架构实现高效的视频内容审核,同时构建高质量数据集并采用结构化Chain-of-Thought追踪来解决训练时间扩展的限制,从而在实际和AI生成视频基准测试中取得最佳性能,同时显著降低推理成本。

详情
AI中文摘要

在线视频平台和AI生成内容的快速增长使得可靠的视频护栏成为安全性和现实部署的关键挑战。尽管大多数视频可通过快速模式识别筛查,但一小部分需要对时间复杂的内容和细致的政策约束进行深入推理。现有方法通常依赖于在所有输入上统一应用大型视觉-语言模型,导致推理成本高且计算资源分配效率低。我们提出了SafeLens视频护栏框架,引入快速和缓慢的推理架构,以实现高效且准确的内容审核,根据输入的不同具有可变的计算成本。此外,我们通过应用影响引导过滤对SafeWatch数据集进行处理,仅保留原始数据的2.4%。为进一步解决训练时间扩展的限制,我们通过在过滤数据中添加结构化的Chain-of-Thought追踪来实现测试时间推理。在实际和AI生成视频基准测试中,SafeLens实现了最先进的性能,优于强大的开源视频护栏(如SafeWatch-8B、OmniGuard-7B)和闭源模型(如GPT-5.4、Gemini-3.1-pro),同时显著降低推理成本,证明了高效设计比仅扩大数据或模型大小更有效。

英文摘要

The rapid growth of online video platforms and AI-generated content has made reliable video guardrails a key challenge for safety and real-world deployment. While most videos can be screened through fast pattern recognition, a small subset requires deeper reasoning over temporally complex content and nuanced policy constraints. Existing approaches typically rely on large vision-language models applied uniformly across all inputs, resulting in high inference costs and inefficient allocation of computation. We propose SafeLens, a video guardrail framework that introduces a fast-and-slow inference architecture for efficient and accurate content moderation with variable computational cost across inputs. Additionally, we construct a high-quality dataset by applying influence-guided filtering to the SafeWatch Dataset, retaining only 2.4% of the original data. To further address limitations of training-time scaling, we enable test-time reasoning by augmenting the filtered data with structured Chain-of-Thought traces. Across real-world and AI-generated video benchmarks, SafeLens achieves state-of-the-art performance, outperforming strong open-source video guardrails (e.g., SafeWatch-8B, OmniGuard-7B) and closed-source models (e.g., GPT-5.4, Gemini-3.1-pro) while significantly reducing inference cost, demonstrating that efficient design serves to be more effective than scaling data or model size alone.

2605.17591 2026-05-19 cs.CV 版本更新

Error-Decomposed Class-Conditional Fusion for Statistically Guaranteed Hard-Category Robust Perception

误差分解类条件融合用于统计保证的硬类别鲁棒感知

Guowei Luo, Ziqi Shi, Zhao Xie

发表机构 * Hefei University of Technology, Hefei, China(合肥工业大学) Lishui University, Lishui, China(丽水大学)

AI总结 本文提出误差分解类条件融合(ED-CCF)方法,通过解决硬类别可靠性问题,提升关键类别性能的同时保持整体稳定性,实现统计保证的鲁棒感知。

Comments 14 pages, 8 figures. Preprint

详情
AI中文摘要

聚合目标检测指标本质上会掩盖在操作关键的长尾少数类别中的灾难性和可重复性故障。本文正式将这种普遍的脆弱性定义为硬类别可靠性问题(HCRP):严格纠正脆弱类别而不影响稳定类别性能边界的基本架构挑战。为系统性消除这一限制,我们提出了误差分解类条件融合(ED-CCF),一种优雅的决策层推断框架。不同于启发式全局后处理,ED-CCF将预测投射到复杂的四状态误差分类学中,在严格的经验验证下动态激活校准路径。在高度受限的600张图像验证基准上,隔离cz作为关键脆弱性(HCEC=0.86,BSR=0.14),我们的框架实现了突破性进展:在提升cz mAP50从0.089343到0.109353(巨大的+22.4%相对增长)的同时,完美保持了全局稳定性的帕累托最优性(将所有mAP50从0.581925提升到0.584864)。通过在50对子集试验中进行彻底验证,展示了压倒性的96%胜率和严格的布农尼校正威尔科xon显著性(p<0.05),这项工作从根本上重新定义了输出层融合作为安全关键视觉感知的可审计、统计保证范式。

英文摘要

Aggregate object detection metrics inherently mask catastrophic and repeatable failures in operationally critical, long-tail minority classes. This paper formally defines this pervasive vulnerability as the Hard-Category Reliability Problem (HCRP): the fundamental architectural challenge of strictly rectifying vulnerable categories without compromising the performance boundaries of stable classes under stringent protocols. To systematically dismantle this limitation, we propose Error-Decomposed Class-Conditional Fusion (ED-CCF), an elegant decision-layer inference framework. Diverging from heuristic global post-processing, ED-CCF projects predictions into a sophisticated quad-state error taxonomy, dynamically activating calibration pathways exclusively upon rigorous empirical justification. On a highly constrained 600-image validation benchmark, isolating cz as the critical vulnerability (HCEC=0.86, BSR=0.14), our framework achieves a targeted breakthrough: it elevates cz mAP50 from 0.089343 to 0.109353 (a massive +22.4% relative surge) while flawlessly preserving the Pareto optimality of global stability (raising all mAP50 from 0.581925 to 0.584864). Backed by exhaustive validation across 50 paired subset trials demonstrating an overwhelming 96% win rate and strict Bonferroni-corrected Wilcoxon significance (p<0.05), this work fundamentally redefines output-level fusion as an auditable, statistically guaranteed paradigm for safety-critical visual perception.

2605.17588 2026-05-19 cs.CV 版本更新

MSIQ: Moment-based Scale-Invariant Quality Measure for Single Image Super-Resolution

MSIQ: 基于矩的尺度不变质量度量用于单图像超分辨率

Leonid Bedratyuk

发表机构 * Khmelnytsky National University, Faculty of Information Technology, Ukraine(赫梅利尼茨基国立大学信息科技学院,乌克兰)

AI总结 本文提出了一种基于矩的尺度不变质量度量(MSIQ),用于评估单图像超分辨率(SISR)结果的质量,该方法通过比较两幅图像的归一化中心几何矩,能够直接比较不同空间分辨率的图像,且具有数学确定性和解析形式,解决了传统方法在几何结构保持和强制缩放带来的误差问题。

Comments 23 pages

详情
AI中文摘要

评估单图像超分辨率(SISR)结果的质量仍然是一个开放的方法学问题。常见的全参考度量(PSNR, SSIM, LPIPS)没有明确评估图像几何结构的保持,这对于基于尺度的重建正确性至关重要。此外,它们需要将图像强制对齐到相同大小(强制缩放),这在评估过程中引入了外部插值误差。本文提出了一种诊断性的尺度不变质量度量MSIQ(基于矩的尺度不变质量度量),基于两幅图像的归一化中心几何矩的比较。MSIQ能够在不缩放的情况下直接比较不同空间分辨率的图像,具有数学确定性(模型无关)和解析形式。为了为该方法提供理论基础,我们引入了一个概念区分,即度量在跟踪退化方面的能力(跟踪能力)与它们的几何选择性(几何特异性)之间的区别。实验验证确认了MSIQ在均匀缩放下的稳定性,并同时揭示了传统度量对插值方法选择的高敏感性。结果表明,MSIQ具有显著的几何选择性:所提出的方法有效地区分了几何变形与非几何伪影,特别是JPEG压缩,不同于基于像素和感知的度量。还显示,MSIQ对结构扰动的响应在不同SR算法类别中保持稳定,包括具有不同架构的DNN模型。所提出的方法是一种补充的诊断工具,适用于几何保真度优先的领域,特别是医学成像和遥感。

英文摘要

Assessing the quality of single image super-resolution (SISR) results remains an open methodological problem. Common full-reference metrics (PSNR, SSIM, LPIPS) do not explicitly evaluate the preservation of the geometric structure of images, which is critical for the correctness of scale-based reconstruction. In addition, they require the forced alignment of images to the same size (\textit{forced resizing}), which introduces an external interpolation error into the evaluation process. This paper proposes a diagnostic scale-invariant quality measure, MSIQ (\textit{Moment-based Scale-Invariant Quality}), based on the comparison of normalized central geometric moments of two images. MSIQ enables direct comparison of images with different spatial resolutions without resizing, is mathematically deterministic (\textit{model-free}), and has an analytical form. To provide a theoretical basis for the approach, we introduce a conceptual distinction between the ability of metrics to monotonically track degradation (\textit{tracking ability}) and their geometric selectivity (\textit{geometric specificity}). The experimental validation confirmed the stability of MSIQ under uniform scaling and, at the same time, revealed the high sensitivity of traditional metrics to the choice of interpolation method. The results show that MSIQ has pronounced geometric selectivity: the proposed measure effectively separates geometric deformations from non-geometric artifacts, in particular JPEG compression, unlike pixel-based and perceptual metrics. It is also shown that the response of MSIQ to structural perturbations remains stable across different classes of SR algorithms, including DNN models with different architectures. The proposed measure is a complementary diagnostic tool for domains where geometric fidelity has priority, in particular medical imaging and remote sensing.

2605.17584 2026-05-19 cs.CV 版本更新

VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos

VVitCutLER: 向视频中无监督的目标检测和分割迈进

Zhijing Lu, Khurram Azeem Hashmi, Didier Stricker, Muhammad Zeshan Afzal

发表机构 * RPTU University of Kaiserslautern-Landau(莱茵-瓦尔德大学凯撒斯劳滕-兰道分校) German Research Center for Artificial Intelligence(德国人工智能研究中心)

AI总结 该研究提出VVitCutLER框架,通过引入时间一致性提升视频中伪标签的质量,从而改进目标检测和实例分割的性能,减少时间不稳定性。

Comments 11 figures, cvpr workshop

详情
AI中文摘要

无监督像素级视频理解在现实场景中仍具有挑战性,因为运动模糊、遮挡和快速物体动态常导致时间漂移和闪烁的伪标签。我们提出VVitCutLER,一个用于视频目标检测和实例分割的无监督框架,通过时间一致性提高伪标签的质量。我们的核心贡献是VitCut,一个时间稳定的伪标签生成器,通过跨帧区域一致性减少在场退化期间的误差积累。同时,VitCut使用蒸馏解码器实现有效的实例掩码预测。然后,基于VitCut,VVitCutLER进一步整合跨帧特征聚合以增强视频级的鲁棒性。在标准视频基准上的广泛实验表明,VVitCutLER显著提高了检测和分割性能,同时减少了时间不稳定性。这些结果突显了时间一致性监督对鲁棒像素级视频理解的重要性。

英文摘要

Unsupervised pixel-level video understanding remains challenging in real-world scenarios, where motion blur, occlusion, and fast object dynamics often cause temporal drift and flickering pseudo-labels.We propose VVitCutLER, an unsupervised framework for video object detection and instance segmentation, which improves the quality of pseudo-labels through temporal consistency. Our core contribution is VitCut, a temporarily stable pseudo-label generator that reduces error accumulation during field degradation through cross-frame region consistency. Meanwhile, VitCut uses a distillation decoder to achieve effective instance mask prediction. Then, based on VitCut, VVitCutLER further integrates cross-frame feature aggregation to enhance video-level robustness. Extensive experiments on standard video benchmarks demonstrate that VVitCutLER significantly improves detection and segmentation performance while reducing temporal instability. These results highlight the importance of temporally consistent supervision for robust pixel-level video understanding.

2605.17583 2026-05-19 cs.CV 版本更新

AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

AgentSteerTTS: 一个用于复合指令文本到语音的多智能体闭环框架

Bin Kang, Shaoguo Wen, Yang Fan, Shunlong Wu, Junjie Wang, Yulin Li, Junzhi Zhao, Junle Wang, Zhuotao Tian

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Shenzhen Loop Area Institute(深圳环湖研究院) Tencent Turinglab(腾讯人工智能实验室) Tsinghua University(清华大学) Southwest Jiaotong University(西南交通大学)

AI总结 本文提出AgentSteerTTS,一个多智能体闭环框架,通过引入对抗解耦代理、双流锚定控制器和快速-慢反馈代理,实现了对复合指令的意图忠实表达控制,实验表明其在复合指令基准和公开测试集上显著提升了性能。

Comments Project page: https://kane2kang.github.io/AgentSteerTTS/

详情
AI中文摘要

尽管现有的文本到语音(TTS)模型表现出高度的表达性,但对复合指令的细粒度控制仍然具有挑战性,因为离散的文本意图与连续的语音实现之间存在结构不匹配。受人类认知解耦的启发,我们引入了AgentSteerTTS,一个用于意图忠实表达控制的多智能体闭环框架。首先,在我们的框架中,对抗解耦代理通过学习分离的身份和情感-语调子空间,并利用泄漏抑制正则化来减轻说话者-情感泄漏。接下来,双流锚定控制器利用大规模的语音原型库来使抽象意图具体化:检索代理选择表达锚点,而合成代理通过门控注意力融合它们为连续控制向量。最后,快速-慢反馈代理通过潜在梯度校正来细化输出强度,并利用高层感知批评来解决语义-语音不匹配。在复合指令基准和公开测试集上的实验表明,AgentSteerTTS在基线模型上产生了一致且显著的改进,证明了所提出方法的有效性。

英文摘要

While existing text-to-speech (TTS) models exhibit high expressiveness, fine-grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi-agent closed-loop framework designed for intent-faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Next, a Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast-Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic-acoustic mismatches using high-level perceptual critique. Experiments on a composite-instruction benchmark and public test sets show that AgentSteerTTS yields consistent and significant improvements to the baselines, demonstrating the effectiveness of the proposed method.

2605.17577 2026-05-19 cs.CV 版本更新

TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models

TAME: 通过混合专家架构实现视觉语言模型的测试时对抗提示调优

Xin Wang, Yixu Wang, Jiaming Zhang, Ruofan Wang, Jiaqi Yu, Kai Chen, Jingjing Chen, Xingjun Ma, Yu-Gang Jiang

发表机构 * Fellow, IEEE(IEEE会士)

AI总结 本文提出TAME,一种基于混合专家架构的测试时防御方法,旨在提升视觉语言模型在对抗扰动下的鲁棒性,同时保持对清洁样本的泛化能力。

详情
AI中文摘要

大规模预训练的视觉语言模型(VLMs),如CLIP,在零样本泛化方面表现强大,但对不可察觉的对抗扰动高度敏感,这在开放世界部署中引发了严重安全问题。为了在不需下游任务特定重新训练的情况下增强鲁棒性,我们提出了TAME,一种新颖的测试时防御方法。基于我们之前的测试时对抗提示调优(TAPT),TAME通过将TAPT的单一自适应提示替换为输入条件化的混合专家(MoE)框架进行架构重构,从而实现更表达力和适应性的防御。具体而言,TAME维护一个可学习的专家提示库,并利用输入依赖的路由机制,在推理时为每个未标记的测试样本聚合定制化的提示混合。这种测试时防御机制由三个无监督目标驱动:(1)多视图预测熵最小化,(2)逐层对齐视觉标记统计到预计算的干净和对抗参考分布,以及(3)MoE正则化以实现平衡的专家利用和提示多样性。我们在11个基准数据集上评估了TAME,包括ImageNet和10个额外的零样本数据集。结果表明,TAME在AutoAttack下将原始CLIP的零样本对抗鲁棒性提高了至少49.1%,同时在清洁样本上保持了良好的泛化能力。TAME还普遍优于现有对抗提示调优方法,平均鲁棒性提升至少30.2%。

英文摘要

Large-scale pre-trained Vision-Language models (VLMs), such as CLIP, exhibit strong zero-shot generalization, yet remain highly vulnerable to imperceptible adversarial perturbations, raising serious safety concerns for open-world deployment. To enhance robustness without requiring downstream task-specific retraining, we propose TAME, a novel test-time defense. Building upon our prior Test-Time Adversarial Prompt Tuning (TAPT), TAME introduces an architectural reformulation by replacing TAPT's single adaptive prompt with an input-conditioned Mixture-of-Experts (MoE) framework, enabling more expressive and adaptive defense. Specifically, TAME maintains a bank of learnable expert prompts and employs an input-dependent routing mechanism to aggregate a customized prompt mixture for each unlabeled test sample at inference time. This test-time defense mechanism is driven by three unsupervised objectives: (1) multi-view prediction entropy minimization, (2) layer-wise alignment of visual token statistics to precomputed clean and adversarial reference distributions, and (3) MoE regularization for balanced expert utilization and prompt diversity. We evaluated TAME on 11 benchmark datasets, including ImageNet and 10 additional zero-shot datasets. The results show that TAME improves the zero-shot adversarial robustness of the original CLIP by at least 49.1% under AutoAttack while largely preserving generalization on clean samples. TAME also consistently outperforms existing adversarial prompt tuning methods across multiple prompt designs, yielding an average robustness gain of at least 30.2%.

2605.17573 2026-05-19 cs.CV cs.CR 版本更新

Deepfake Detection in Social Media: A Temporal Artifact Analysis Using 3D Convolutional Neural Networks

社交媒体中的深度伪造检测:利用3D卷积神经网络进行时序特征分析

Mohammadreza Rashidi, Raja Hashim Ali, Sami Ur Rahman

发表机构 * Department of Computer Science AI(计算机科学系人工智能部门) Media Analysis Lab Berlin, Germany(媒体分析实验室柏林德国)

AI总结 本文提出了一种基于R3D-18的3D卷积神经网络检测器,通过结合二元交叉熵损失与时间一致性正则化损失,提升深度伪造检测在高分辨率和跨数据集场景下的准确性,证明了时间特征比空间特征在社交媒体重编码中更具鲁棒性。

Comments 13 pages, 6 figures

详情
AI中文摘要

合成面部视频在社交媒体上传播的速度比平台审核速度更快,导致虚假信息和身份攻击的成本上升。帧级深度伪造检测器在生成器质量增加时性能急剧下降;高质量的128x128 GAN输出在空间仅准确性上减少五个百分点,而时间不一致性的特征基本保持不变。我们通过基于R3D-18的3D卷积神经网络检测器解决这一差距,该检测器使用复合损失函数,结合二元交叉熵与时间一致性正则化。模型处理来自DeepfakeTIMIT数据集的16帧片段,并初始化自Kinetics-400动作识别权重。我们在128x128分辨率的内数据集评估中报告了92.8%的准确率;在不微调的情况下跨数据集转移到FaceForensics++达到76.4%,微调后有所提升。消融研究显示,迁移学习贡献了7.2个百分点,面部跟踪增加了3.5个百分点,而时间一致性正则化在高质量伪造中提供了额外的增益。结果表明,时间特征比空间特征在社交媒体重编码中更具泛化能力,提供了一个能够存活的检测信号。

英文摘要

Synthetic facial videos have proliferated across social media faster than platform moderation can respond, raising the cost of disinformation and identity-based attacks. Frame-level deepfake detectors degrade sharply as generator quality increases; high-quality 128x128 GAN output cuts spatial-only accuracy by five percentage points while leaving temporal inconsistencies largely intact. We address this gap with a 3D Convolutional Neural Network detector based on R3D-18, trained with a composite loss that combines binary cross-entropy with a temporal-consistency regularizer. The model processes 16-frame clips from the DeepfakeTIMIT dataset and is initialized from Kinetics-400 action-recognition weights. We report 92.8% accuracy on intra-dataset evaluation at 128x128 resolution; cross-dataset transfer to FaceForensics++ without fine-tuning reaches 76.4%, rising after minimal fine-tuning. Ablation studies show that transfer learning contributes 7.2 percentage points and face tracking adds 3.5 points, while temporal consistency regularization provides additional gains on high-quality fakes. The results establish that temporal artifacts generalize more broadly than spatial ones, providing a detection signal that survives social-media re-encoding.

2605.17571 2026-05-19 cs.CV cs.LG 版本更新

Stable Routing for Mixture-of-Experts in Class-Incremental Learning

混合专家在类增量学习中的稳定路由

Zirui Guo, Quan Cheng, Da-Wei Zhou, Lijun Zhang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院)

AI总结 本文研究了在类增量学习中混合专家模型的稳定路由问题,提出了一种稳定路由框架StaR-MoE,通过敏感性感知路由对齐和不对称容量正则化,提高了模型对新类别的适应能力和旧类别的知识保留能力。

详情
AI中文摘要

类增量学习(CIL)要求模型在学习新类别时保持先前知识。最近,结合预训练模型与混合专家(MoE)的方法在CIL中受到越来越多关注:它们通常在学习过程中扩展专家,并使用路由器分配权重。然而,现有MoE方法往往忽视了专家扩展引起的路由漂移。一旦引入新的专家,路由器可能会将样本从早期类别重新分配给新加入的专家,从而扰动已建立的专家组合,即使旧专家保持冻结。我们主张,可扩展的MoE在CIL中需要两个互补的性质:稳定的旧类路由用于知识保留和足够的容量利用用于新类适应。为此,我们提出了Stable Routing for MoE(StaR-MoE),一种用于可扩展MoE的路由级别框架。通过结合敏感性感知的路由对齐,StaR-MoE通过敏感性引导的约束将当前旧类路由行为与历史路由分布对齐。同时,StaR-MoE引入了不对称容量正则化,以鼓励有效利用扩展的专家池,而不影响类特定的路由专业化。在四个标准CIL基准上的广泛实验表明,StaR-MoE在平均准确率和最后准确率上均优于现有最先进方法,突显了稳定路由的重要性。

英文摘要

Class-incremental learning (CIL) requires models to learn new classes sequentially while preserving prior knowledge. Recently, approaches that combine pre-trained models with mixture-of-experts (MoE) have received increasing attention in CIL: they typically expand experts during learning and employ a router to assign weights across experts. However, existing MoE methods often overlook routing drift induced by expert expansion. Once new experts are introduced, the router may reassign samples from earlier classes to newly added experts, thereby perturbing previously established expert compositions and causing interference even when old experts remain frozen. We argue that expandable MoE in CIL requires two complementary properties: stable old-class routing for knowledge preservation and sufficient capacity utilization for new-class adaptation. To this end, we propose Stable Routing for MoE (StaR-MoE), a routing-level framework for expandable MoE in CIL. By incorporating sensitivity-aware routing alignment, StaR-MoE aligns current old-class routing behavior with historical routing distributions through sensitivity-guided constraints. Complementarily, StaR-MoE introduces asymmetric capacity regularization to encourage effective utilization of the expanded expert pool without compromising class-specific routing specialization. Extensive experiments across four standard CIL benchmarks demonstrate that StaR-MoE consistently improves both average and last accuracy over state-of-the-art methods, highlighting the importance of stable routing.

2605.17566 2026-05-19 cs.CV 版本更新

Rethinking Point Clouds as Sequences: A Causal Next-Token Predictive Learning Framework

重新思考点云作为序列:一种因果性下一标记预测学习框架

Yumeng Yao, Jingzhi Dong, Haowen Gu, Tao Chen, Zonghan Wu, Xiaoshui Huang, Yazhou Yao

发表机构 * Nanjing University of Science and Technology(南京理工大学) Shanghai Jiao Tong University(上海交通大学) Hangzhou City University(杭州城市学院) East China Normal University(华东师范大学)

AI总结 本文提出PointNTP,将点云预训练重新定义为全因果、无解码器的潜在下一标记预测问题,通过局部补丁分割和结构化3D标记序列生成,实现对点云结构依赖的直接建模,无需重建解码器或显式几何恢复,实验表明其在多个下游任务中表现优异。

Comments 10 pages, 2 figures. Code will be released upon acceptance

详情
AI中文摘要

随着多模态基础模型和预测预训练的快速发展,一个重要的开放问题是如何为3D点云配备一种更符合下一标记和下一嵌入学习的预训练范式。现有点云自监督方法大多基于掩码重建或显式几何生成,因此仍局限于输入恢复而非预测依赖建模。本文引入PointNTP,将点云预训练重新定义为全因果、无解码器的潜在下一标记预测问题。具体而言,每个点云首先被分割成局部补丁,并根据补丁中心几何结构化为3D标记序列。生成的序列随后通过因果Transformer进行建模,采用仅前缀条件的训练方式,并通过停止梯度目标稳定移位预测目标。该设计使模型能够在潜在空间中直接学习结构依赖,而无需重建解码器或显式几何恢复。大量实验表明,所提出的PointNTP在多个下游任务中表现优异:在ScanObjectNN的OBJ_BG、OBJ_ONLY和PB_T50_RS上分别达到93.8%(+0.5%)、92.6%(+0.3%)和89.3%(+1.1%);在ShapeNetPart上获得85.0%(+0.1%)的Cls.mIoU;在S3DIS Area 5上达到71.1%的mAcc。总体而言,无解码器的因果潜在预测提供了一种简单、可扩展且可能模态无关的点云自监督学习范式,为3D数据的基础式预测学习提供了新的视角。

英文摘要

With the rapid progress of multimodal foundation models and predictive pre-training, an important open question is how to equip 3D point clouds with a pre-training paradigm that is better aligned with next-token and next-embedding learning. Existing point-cloud self-supervised methods are largely built on masked reconstruction or explicit geometric generation, and thus remain tied to input recovery rather than predictive dependency modeling. In this paper, we introduce PointNTP, which reformulates point cloud pre-training as a fully causal, decoder-free latent Next-Token Prediction problem. Specifically, each point cloud is first partitioned into local patches and serialized into a structured 3D token sequence according to patch-center geometry. The resulting sequence is then modeled by a causal Transformer under prefix-only conditioning, and trained with a shift-based prediction objective stabilized by stop-gradient targets. This design enables the model to learn structural dependencies directly in latent space, without reconstruction decoders or explicit geometric recovery. Extensive experiments demonstrate that the proposed PointNTP is highly competitive across multiple downstream tasks: it achieves 93.8%(+0.5%), 92.6%(+0.3%), and 89.3%(+1.1%) on OBJ_BG, OBJ_ONLY, and PB_T50_RS of ScanObjectNN, respectively; obtains 85.0%(+0.1%) in Cls.mIoU on ShapeNetPart; and reaches 71.1% mAcc on S3DIS Area 5. Overall, decoder-free causal latent prediction provides a simple, scalable, and potentially modality-agnostic paradigm for point-cloud self-supervised learning, offering a new 3D perspective on foundation-style predictive learning for 3D data.

2605.17564 2026-05-19 cs.CV 版本更新

A Conditional U-Net Pipeline with Pre- and Post-Processing for Aerial RGB-to-Thermal Image Translation

具有预处理和后处理的条件U-Net管道用于航空RGB到热图像转换

Tseten Sherpa, Sikandar Ali, Shubham Parab, Haoyun Feng, Matthew Dennis, Keenan Gibbons, Verrah Otiende, Geoffrey H. Siwo

发表机构 * Department of Data Science, University of Michigan, Ann Arbor, MI, USA(数据科学系,密歇根大学,安阿伯,MI,美国) Department of Information Science, University of Michigan, Ann Arbor, MI, USA(信息科学系,密歇根大学,安阿伯,MI,美国) Department of Computer Science, University of Michigan, Ann Arbor, MI, USA(计算机科学系,密歇根大学,安阿伯,MI,美国) Arcknow, New York, USA(Arcknow,纽约,美国) School of Environmental Sustainability, University of Michigan, Ann Arbor, MI, USA(可持续环境学院,密歇根大学,安阿伯,MI,美国) SmithGroup, Ann Arbor, MI, USA(SmithGroup,安阿伯,MI,美国) Michigan Institute for Data and AI in Society (MIDAS), University of Michigan, Ann Arbor, MI, USA(密歇根数据与人工智能社会研究院(MIDAS),密歇根大学,安阿伯,MI,美国) United States International University (USIU), Nairobi, Kenya(美国国际大学(USIU),内罗毕,肯尼亚) Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI, USA(学习健康科学系,密歇根大学医学院,安阿伯,MI,美国) Department of Pharmacology, University of Michigan Medical School, Ann Arbor, MI, USA(药理学系,密歇根大学医学院,安阿伯,MI,美国) Center for Global Health Equity, University of Michigan, Ann Arbor, MI, USA(全球健康公平中心,密歇根大学,安阿伯,MI,美国)

AI总结 本文提出了一种基于条件U-Net的简单架构,结合天气数据和针对性预处理与后处理技术,以提高航空RGB到热图像转换的性能,实验结果显示其在PSNR、SSIM和LPIPS指标上优于现有方法。

Comments 8 pages, 7 figures, NeurIPS 2026

详情
AI中文摘要

配对的RGB-热图像数据在图像融合、目标跟踪和异常检测等应用中显示出显著的实用性;然而,其广泛应用受到对齐的RGB-热图像对有限的限制。RGB到热图像(及反之)转换已成为解决这一挑战的实用解决方案。先前的方法包括条件生成对抗网络(cGANs)如ThermalGAN和基于可扩展插值转换器(SiT)的架构如ThermalGen,已显示出在航空到热图像转换中的强大潜力。在本工作中,我们探索了替代架构,这些架构在保持性能的同时优先考虑简洁性。具体而言,我们提出了一种在瓶颈层中结合天气数据的条件U-Net,辅以在Pix2Pix GAN架构中应用的针对性预处理和后处理技术。我们利用612对RGB和热图像的训练集,并在五折交叉验证后,最终在保留的测试集上进行评估。我们的条件U-Net模型表现最佳,峰值信噪比(PSNR)为14.5485,结构相似性指数测量(SSIM)为0.8095,学习感知图像块相似性(LPIPS)为0.1666。这些结果优于基础ThermalGen模型,后者分别达到了PSNR、SSIM和LPIPS分数为7.56、0.2444和0.6317。我们发现,虽然饱和度增强和对比度增强的预处理以及高斯模糊的后处理提供了可观察的改进,但结合条件数据的效果最为显著。我们的发现巩固了将辅助元数据整合到热图像生成中的潜力,表明此类信息可以作为准确热重建至关重要的环境条件的代理。

英文摘要

Paired RGB-thermal data has shown significant utility across a range of applications, including image fusion, object tracking, and anomaly detection; however, its broader adoption is constrained by the limited availability of aligned RGB-thermal image pairs. RGB-to-thermal (and vice versa) image translation has emerged as a practical solution to this challenge. Prior approaches including conditional generative adversarial networks (cGANs) such as ThermalGAN and Scalable Interpolant Transformer (SiT)-based architectures such as ThermalGen have demonstrated strong potential for aerial-to-thermal image translation. In this work, we explore alternative architectures that prioritize simplicity while maintaining performance. Specifically, we propose a conditional U-Net that incorporates weather data at the bottleneck layer, complemented by targeted preprocessing and post-processing techniques applied within the Pix2Pix GAN architecture. We utilize a training set of 612 paired RGB and thermal images, and evaluate over 5-fold cross-validation, ultimately testing on a held-out test set. Our conditional U-Net model performed best, with a peak signal-to-noise ratio (PSNR) of 14.5485, structural similarity index measure (SSIM) of 0.8095, and learned perceptual image patch similarity (LPIPS) of 0.1666. These results outperformed the base ThermalGen model, which attained PSNR, SSIM, and LPIPS scores of 7.56, 0.2444, and 0.6317 respectively. We find that while saturation boost and contrast enhancement for preprocessing and Gaussian blur for post-processing provide observable improvements, the incorporation of conditioning data was most effective. Our findings cement the potential of integrating auxiliary metadata into thermal image generation, suggesting that such information can serve as a proxy for environmental conditions critical to accurate thermal reconstruction.

2605.17555 2026-05-19 cs.LG cs.CV 版本更新

PFlow-T: A Persistence-Driven Forward Process for Topology-Controlled Generation

PFlow-T:基于持续性的拓扑控制生成过程

Snigdha Chandan Khilar

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出PFlow-T,一种基于持续性的前向过程生成模型,通过持续同调来控制拓扑结构,实现了对Betti数的生成和处理非分布任务的改进。

详情
AI中文摘要

当前拓扑感知的扩散模型由于使用高斯噪声进行破坏而存在架构不匹配的问题,通过条件侧通道恢复结构特征。为解决此问题,我们引入PFlow-T,一种生成模型,其前向过程完全基于持续同调。在PFlow-T中,时间度量的是H1拓扑特征如孔的破坏,而非高斯噪声注入。此前向过程根据特征的持续性来消除特征。反向网络则直接反转这种结构破坏以在一步内预测干净状态。在MNIST数字零、一和八上的测试显示,PFlow-T在生成请求的Betti数和处理非分布任务方面显著优于基线模型。PFlow-T是首个使用持续同调作为前向过程的生成架构,尽管我们注意到它目前仅限于低分辨率像素空间代理。

英文摘要

Current topology aware diffusion models face an architectural mismatch by using Gaussian noise for corruption while recovering structural features through conditional side channels To fix this we introduce PFlow T a generative model that bases its forward process entirely on persistent homology In PFlow T time measures the destruction of H1 topological features like holes rather than Gaussian noise injection This forward process eliminates features based on their persistence The reverse network then directly inverts this structured corruption to predict the clean state in one step Tests on MNIST digits zero one and eight show PFlow T significantly outperforms a baseline model in generating requested Betti numbers and handling out of distribution tasks PFlow T is the first generative architecture using persistent homology for the forward process although we note it is currently limited to low resolution pixel space proxies

2605.17527 2026-05-19 cs.CV 版本更新

Designing streetscapes from street-view imagery using diffusion models

利用扩散模型从街景图像中设计街道景观

Yuzhou Chen, Yuebing Liang, Lingqian Hu, Kailai Sun, Qingqi Song, Chang Zhao, Shenhao Wang

发表机构 * Department of Urban and Regional Planning, University of Florida(城市与区域规划系,佛罗里达大学) Singapore-MIT Alliance for Research and Technology Centre (SMART)(新加坡-麻省理工联合研究中心(SMART)) Department of Landscape Architecture and Urban Planning, Texas A&M University(景观建筑与城市规划系,德克萨斯大学安德森分校) Department of Agronomy, University of Florida(农业系,佛罗里达大学)

AI总结 本文提出了一种生成多模态AI框架,通过目标视觉指标生成替代的街道景观,提升了城市规划和设计中的视觉探索能力。

详情
AI中文摘要

街景图像(SVI)被广泛用于量化城市环境的关键指标,如绿化率、天空和道路视图指数。然而,现有研究大多集中在测量当前的街道景观,很少支持生成替代或不存在的城市场景,这是地理学学科如城市规划和设计中的核心任务。为解决这一差距,我们提出了一种生成多模态AI框架,该框架能够根据目标视觉指标合成替代的街道景观,从而直接探索城市场景。我们首先构建了一个多模态数据集,将SVI与文本描述、分割图、道路掩码以及芝加哥和奥兰多的视觉元素定量指标对齐。使用这个数据集,我们证明扩散模型能够生成逼真且语义一致的街道景观图像,同时响应文本和图像控制。我们的定量评估显示,结合视觉控制可以提高语义一致性,使LPIPS指数降低约6%,同时保持整体视觉真实性。此外,整体语义一致性在奥兰多提高了23.7%,在芝加哥提高了46.4%,通过mIoU指数测量,类别层面的提升甚至超过了100%的改进,特别是在建筑视图指数方面。通过视觉和文本提示,可以精细控制街道景观的生成,当文本和视觉控制冲突时,图像控制始终占主导地位,表明了清晰的控制层次以及进一步发展视觉控制对于城市场景生成的重要性。总体而言,本文为使用SVI和扩散模型进行街道景观生成建立了重要的基准,并展示了生成式AI如何成为一种实用、可扩展且可控的城市场景探索方法。

英文摘要

Street-view imagery (SVI) is widely used to quantify key indicators of urban environment, such as green- ery, sky, or road view indices. However, existing studies largely focus on measuring current streetscapes and rarely support the generation of alternative and non-existing urban scenarios, which is a core task in geospatial disciplines such as urban planning and design. To address this gap, we propose a gener- ative multimodal AI framework that synthesizes alternative streetscapes conditioned on targeted visual metrics, enabling direct visual exploration of urban scenarios. We first construct a multimodal dataset that aligns SVIs with textual descriptions, segmentation maps, road masks, and quantitative metrics of visual elements in Chicago and Orlando. Using this dataset, we demonstrate that diffusion models can produce realistic and semantically consistent streetscape imagery while responding to both textual and imagery controls. Our quantitative evaluations show that incorporating visual controls can improve semantic consistency, reducing the LPIPS index by approximately 6% while maintaining global visual realism. In addition, overall semantic consistency increases by 23.7% in Orlando and 46.4% in Chicago, as measured by the mIoU index, with class-wise gains exceeding even 100% improvement for building view indices. Streetscape generation can be controlled in a fine-grained manner by both visual and textual prompts, and when textual and visual controls conflict, imagery controls consistently dominate, indicating a clear control hierarchy and the importance of further developing visual controls for urban scene generation. Overall, this work establishes an important benchmark for streetscape generation us- ing SVIs and diffusion models, and illustrates how generative AI can serve as a practical, scalable, and controllable approach for urban scenario exploration.

2605.17506 2026-05-19 cs.CV 版本更新

Degradation Frequency Curve: An Explicit Frequency-Quantified Representation for All-in-One Image Restoration

退化频率曲线:一种用于全能图像恢复的显式频率量化表示

Xinghua Huang, Zhixiong Yang, Chen Wu, Shengxi Li, Shuaifeng Zhi, Yue Zhang, Qibin Hou, Xin Deng, Jingyuan Xia

发表机构 * College of Electronic Science, National University of Defense Technology(国防科技大学电子科学学院) College of Electronic Engineering, Beihang University(北航电子工程学院) VCIP, School of Computer Science, Nankai University(南开大学计算机科学学院)

AI总结 本文提出退化频率曲线(DFC),一种显式量化退化影响的频率域表示方法,通过测量频带内的残差到退化能量比来量化退化响应,从而为全能图像恢复提供有效的表示基础,提升了在复杂退化条件下的性能和泛化能力。

详情
AI中文摘要

所有-in-one盲图像恢复中的基本困难在于退化通常被视为隐含在退化到清洁映射中的隐式因素,而不是可以测量和操作的显式对象。这种限制在混合、复合或未见的退化条件下更加明显,其中退化效应难以分配到预定义标签或任务特定参数。我们提出退化频率曲线(DFC),一种结构化的频谱表示,通过测量频域内带状的残差到退化能量比来量化退化响应。DFC将视觉纠缠且难以描述的退化效应转换为可测量的退化坐标空间。此外,DFC可以自适应地分解为带状频谱标记,允许局部退化响应被表示为可重用的恢复先验。基于这种表示,我们开发了DFC引导图像恢复器(DFC-IR),一种基于标记的多尺度框架,逐步从中间恢复中估计DFC,并利用所得频谱标记以粗到细的方式指导退化感知恢复。在标准、复合、未见和现实世界退化基准上的广泛实验表明,DFC为所有-in-one恢复提供了有效的表示基础,导致在复杂退化配置下达到最先进的性能和改进的泛化能力。

英文摘要

A fundamental difficulty in all-in-one blind image restoration is that degradation is usually treated as an implicit factor hidden in degraded-to-clean mapping, rather than as an explicit object that can be measured and manipulated. This limitation becomes more pronounced under mixed, compound, or unseen degradation conditions, where degradation effects are hard to assign to predefined labels or task-specific parameters. We propose the Degradation Frequency Curve (DFC), a structured spectral representation that quantifies degradation responses by measuring band-wise residual-to-degraded energy ratios in the frequency domain. DFC converts visually entangled and hard-to-describe degradation effects into a measurable degradation coordinate space. Moreover, DFC can be adaptively decomposed into band-wise spectral tokens, allowing local degradation responses to be represented as reusable restoration priors. Based on this representation, we develop the DFC-guided Image Restorer (DFC-IR), a token-conditioned multi-scale framework that progressively estimates DFCs from intermediate restorations and uses the resulting spectral tokens to guide degradation-aware restoration in a coarse-to-fine manner. Extensive experiments on standard, composite, unseen, and real-world degradation benchmarks show that DFC provides an effective representation basis for all-in-one restoration, leading to state-of-the-art performance and improved generalization under complex degradation profiles.

2605.17504 2026-05-19 cs.CV cs.AI 版本更新

A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle

从分布视角看视觉机制可解释性:KL最小软约束原理

Guancheng Zhou, Yisi Luo, Zhengfu He, Zhenyu Jin, Xuyang Ge, Wentao Shu, Deyu Meng, Xipeng Qiu

发表机构 * School of Mathematics and Statistics(数学与统计学学院) Ministry of Education Key Lab of Intelligent Networks and Network Security(教育部智能网络与网络安全重点实验室) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学)

AI总结 本文提出了一种基于分布的视觉机制可解释性方法,通过KL最小化优化问题来平衡可解释性和模型忠实性,利用能量引导的扩散后验采样实现,并在DINOv3模型上验证了其有效性。

详情
AI中文摘要

当前视觉机制可解释性(MI)的主要范式仍局限于通过启发式方法(如Top-K激活检索或正则化优化)解释视觉模型的内部单元。在本文中,我们建立了视觉MI的理论分布视角,该视角模型了特征激活对自然图像分布的影响,从而构建了一个KL最小化优化问题来建模MI任务。在此框架下,识别了先前MI范式中的统计偏差,揭示这些范式可能在人类感知上不可解释(即偏离自然图像分布)或在机械上不忠实于视觉模型(即无法激活模型特征)。为了解决这些偏差,我们提出了一种基于KL最小化软约束原理的视觉MI模型,该模型在理论上平衡了可解释性和忠实性。我们通过能量引导的扩散后验采样实现了这一原理。广泛的实验验证了所提出分布视角的理论正确性,并展示了我们的范式在DINOv3视觉模型上的实际有效性。

英文摘要

Most current paradigms in visual mechanistic interpretability (MI) remain confined to interpreting internal units of the vision model via heuristic methods (e.g., top-$K$ activation retrieval or optimization with regularization). In this work, we establish a theoretical distributional view for visual MI, which models the influence of a feature activation on the natural image distribution, thereby formulating a Kullback-Leibler (KL)-minimal optimization problem to model the MI task. Under this framework, statistical biases are identified within previous MI paradigms, which reveal that they may either be perceptually uninterpretable to humans (i.e., deviate from the natural image distribution), or mechanistically unfaithful to the vision models (i.e., unable to activate model features). To resolve the biases under the distributional view, we propose a model with a KL-minimal soft-constraint principle for visual MI that theoretically balances interpretability and faithfulness. We realize this principle via energy-guided diffusion posterior sampling. Extensive experiments validate the theoretical soundness of the proposed distributional view and demonstrate the practical effectiveness of our paradigm on the DINOv3 vision model.

2605.17500 2026-05-19 cs.LG cs.CV 版本更新

The Silent Brush: Evaluating Artistic Style Leakage in AI Art Generation

沉默的画笔:评估AI艺术生成中的艺术风格泄露

Ninad Joshi, Ashutosh Ranjan, Vivek Srivastava, Shirish Karande

发表机构 * TCS Research(TCS研究)

AI总结 本文研究了AI艺术生成中由于模型学习并复现艺术风格而产生的无意风格复现问题,提出了一种评估方法Art Arena,用于衡量艺术作品的编码强度、交互情况以及在无明确提示的情况下风格特征的重现频率。

详情
AI中文摘要

生成式文本到图像模型通常是在大规模网络爬取数据集上训练的,这些数据集包含多样化的视觉内容,如受版权保护和风格独特的艺术品,引发了关于所有权、归属和受保护视觉表达的无意重用的担忧。一个关键问题是,模型可以从这些数据中学习风格模式,并在生成输出中复现这些模式,而无需在提示中显式引用。我们称这种现象为The Silent Brush,即使在未被请求的情况下,所学的风格也会再次出现。现有的评估方法主要集中在近似重复检索或成员推断,而没有考虑到这种跨提示的无意风格复现形式。为了解决这些差距,我们首先制定了评估The Silent Brush的指导原则。然后引入Art Arena评估协议,用于衡量艺术作品的编码强度、交互情况以及在无明确提示的情况下其风格特征在生成输出中重现的频率。我们对广泛使用的文本到图像扩散模型,包括Stable Diffusion v1.5、Stable Diffusion XL (SDXL)和SANA-1.5进行了评估,并设计使其能够跨文本到图像生成系统通用。我们的结果表明,The Silent Brush源于艺术作品之间表示强度和交互动态的差异,导致模型生成中的不对称混合。代码和评估资源可在:https://anonymous.4open.science/r/ArtArena-EBE4获取。

英文摘要

Generative text-to-image models are typically trained on large-scale web-scraped datasets that include diverse visual content such as copyrighted and stylistically distinctive artworks, raising concerns about ownership, attribution, and the unintended reuse of protected visual expressions. A key issue is that models can learn stylistic patterns from this data and reproduce them in generated outputs without any explicit reference in the prompt. We refer to this phenomenon as The Silent Brush, where such learned styles reappear even when they are not requested. Existing evaluation methods mainly focus on near-duplicate retrieval or membership inference and do not account for this form of unintended stylistic resurfacing across prompts. To address these gaps, we first formulate guiding principles for evaluation of The Silent Brush. We then introduce Art Arena, an evaluation protocol that measures how strongly artworks are encoded, how they interact, and how frequently their stylistic traits reappear in generated outputs without explicit mention in prompts. We evaluate Art Arena on widely used text-to-image diffusion models, including Stable Diffusion v1.5, Stable Diffusion XL (SDXL), and SANA-1.5, and design it to generalize across text-to-image generative systems. Our results show that The Silent Brush arises from differences in representational strength and interaction dynamics between artworks, leading to asymmetric blending in model generations. Code and evaluation resources are available at: https://anonymous.4open.science/r/ArtArena-EBE4.

2605.17493 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph 版本更新

Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE

超越线性叠加:利用KAN-SAE在AI天气模型中发现气候特征

Minjong Cheon

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 本文提出KAN-SAE,一种基于Kolmogorov-Arnold网络的稀疏自编码器,通过非线性激活函数揭示天气预测模型中的气候特征,相比线性基线提升了72%的活跃特征数量和降低了20%的特征冗余。

详情
AI中文摘要

深度学习天气预测模型在预测能力上表现出色,但其内部如何表示物理气候现象仍不明确。通过稀疏自编码器(SAEs)实现的机理可解释性提供了一种分解这些表示的有原则方法,但现有SAEs假设严格线性特征叠加,这与现代变压器中编码的高度非线性大气动力学不匹配。我们引入KAN-SAE,一种稀疏自编码器,其编码器将标准ReLU替换为可学习的每特征B-样条激活函数,这些激活函数来自Kolmogorov-Arnold网络(KANs),使每个潜在维度能够发展出自己的非线性门控配置。应用于Sonny时,KAN-SAE发现了975个活跃特征(相比线性基线的566个,提升了72%),并具有20%更低的特征冗余和可比的重建保真度。在无任何气候监督的情况下,KAN-SAE识别出一个在西欧空间集中的可解释热浪特征,并通过因果操控实验验证了西太平洋台风追踪器。我们的结果表明,非线性激活对于深度学习天气预测模型的机理可解释性至关重要,恢复了对线性基线不可见的气候特征。

英文摘要

Deep learning weather prediction models achieve remarkable predictive skill yet remain largely opaque: we know little about how they represent physical climate phenomena internally. Mechanistic interpretability through Sparse Autoencoders (SAEs) offers a principled route to decomposing these representations, but existing SAEs assume strictly linear feature superposition - a constraint ill-suited for the highly nonlinear atmospheric dynamics encoded in modern transformers. We introduce KAN-SAE, a sparse autoencoder whose encoder replaces the standard ReLU with learnable per-feature B-spline activations drawn from Kolmogorov-Arnold Networks (KANs), allowing each latent dimension to develop its own nonlinear gating profile. Applied to Sonny, KAN-SAE discovers 975 alive features (vs. 566 for a linear baseline, a 72% improvement) with 20% lower inter-feature redundancy and comparable reconstruction fidelity. Without any climate supervision, KAN-SAE identifies an interpretable European heatwave feature spatially concentrated over western Europe, and a western Pacific typhoon tracker confirmed by causal steering experiments. Our results demonstrate that nonlinear activations are essential for mechanistic interpretability of deep learning weather prediction models, recovering climate features that remain invisible to linear baselines.

2605.17489 2026-05-19 cs.CV 版本更新

Employing Vision-Language Models for Face Image Quality Assessment

利用视觉-语言模型进行人脸图像质量评估

Erdi Sarıtaş, Eren Onaran, Vitomir Štruc, Hazım Kemal Ekenel

发表机构 * Department of Computer Engineering, Istanbul Technical University(伊斯坦布尔技术大学计算机工程系) Faculty of Electrical Engineering, University of Ljubljana(卢布尔雅那大学电气工程系) Division of Engineering, NYU Abu Dhabi(纽约大学阿布扎克分校工程系)

AI总结 本文研究了利用现成的视觉-语言模型(VLMs)在零样本设置下进行人脸图像质量评估(FIQA)的潜力,通过综合评估框架评估VLM性能,并发现模型架构对生物识别效用性能有显著影响,VLMs的输出与传统方法一致,但生成的分数可能因提示而异。

详情
AI中文摘要

人脸图像质量评估(FIQA)是生物识别流水线中的关键控制步骤。它确保只有可靠的样本被处理,以保持系统精度。最先进的FIQA方法具有高实用性,但通常以“黑箱”方式操作。它们产生标量分数,但没有可解释的人类反馈。这种缺乏透明性限制了它们在人类在回路场景中的有效性,例如自动边境控制,其中需要可操作的反馈。在本文中,我们研究了现成的视觉-语言模型(VLMs)在零样本设置下进行FIQA的潜力,以弥合这一差距。我们提出了一个全面的评估框架来评估VLM性能。这包括通过误差-拒绝曲线基准传统FIQA方法。此外,使用多样化的数据集,从监控导向到合成生成,我们分析了它们的可解释性、一致性和对提示变化的鲁棒性。我们的结果表明,生物识别效用性能在很大程度上取决于架构,而不是仅仅参数数量。大多数VLMs的输出与传统方法一致。我们还发现,VLMs的排名性能和生成的分数可能因提示而异。我们的合成消融研究显示,尽管增加参数数量可以提高内部一致性,但比较小模型的退化检测性能更差。这些发现表明,使用VLMs进行零样本FIQA分数估计是很有前景的,可以作为传统FIQA流水线的可解释性模块进行补充。代码可在https://github.com/ThEnded32/VLM4FIQA.git获得。

英文摘要

Face Image Quality Assessment (FIQA) is a crucial control step in biometric pipelines. It ensures only reliable samples are processed to maintain system accuracy. State-of-the-art FIQA methods achieve high utility but typically operate as "black boxes." They produce scalar scores without human-interpretable justifications. This lack of transparency limits their effectiveness in human-in-the-loop scenarios, such as automated border control, where actionable feedback is essential. In this paper, we investigate the potential of off-the-shelf Vision-Language Models (VLMs) to bridge this gap by performing FIQA in a zero-shot setting. We present a comprehensive evaluation framework for assessing VLM performance. This involves benchmarking traditional FIQA methods through error-versus-reject curves. Additionally, using a diverse set of datasets, ranging from surveillance-oriented to synthetically generated, we analyzed their interpretability, consistency, and robustness to prompt changes. Our results show biometric utility performance depends significantly on architecture, not merely on parameter count. Most VLMs' outputs align with those of traditional methods. We also find that VLM ranking performance and the generated scores may vary across prompts. Our synthetic ablation study shows that while increasing the parameter count can improve internal consistency, it yields worse degradation-detection performance than smaller models. These findings suggest that zero-shot FIQA score estimation using VLMs is promising and could effectively complement conventional FIQA pipelines as an interpretability module. The codes are available at https://github.com/ThEnded32/VLM4FIQA.git.

2605.17488 2026-05-19 cs.CV cs.MM cs.SD 版本更新

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

Omni-Customizer: 用于联合音频-视频生成的端到端多模态定制

Yuheng Chen, Qingdong He, Teng Hu, Yuji Wang, Yabiao Wang, Lizhuang Ma, Jiangning Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文提出Omni-Customizer,一种端到端多模态定制框架,旨在实现精确的多模态身份信息绑定和无缝融合,通过引入Omni-Context Fusion模块和Masked TTS Cross-Attention机制,提升多模态定制生成的性能。

详情
AI中文摘要

联合音频和视频生成的领域已因强大基础模型的出现而发生根本性变革。尽管取得了进展,但实现多模态定制,以在多个相互作用的主体中同时保持视觉身份和语音音色的一致性,仍然鲜有研究。为弥合这一差距,我们提出了Omni-Customizer,一种端到端框架,专门针对多模态身份信息的精确绑定和无缝融合。具体而言,我们引入了Omni-Context Fusion(OCF)模块,该模块能有效丰富基础文本提示,加入密集的多模态身份提示,同时引入Masked TTS Cross-Attention(MTP-CA)机制,专门设计以防止严重的"语音泄漏"问题。在该架构中,我们提出了语义锚定的多模态RoPE(SA-MRoPE),用于将视觉和音频参考标记以及TTS嵌入锚定到其对应的语义描述,从而实现结构化的多模态融合和稳健的身份绑定。此外,我们设计了一种全面的训练策略,结合交错音频-视频调度以快速适应多语言场景而不影响基础先验,以及渐进式内对到跨对课程学习以促进高阶和稳健的身份特征学习。大量实验表明,Omni-Customizer在双模态定制生成中实现了最先进的性能,其在视觉身份相似性、音色一致性、精确音频-视频同步以及整体视频-音频保真度方面均表现出色。

英文摘要

The landscape of joint audio and video generation has been fundamentally transformed by the advent of powerful foundation models. Despite these strides, achieving cohesive multimodal customization for the simultaneous preservation of visual identities and vocal timbres across multiple interacting subjects remains largely underexplored. To bridge this gap, we present Omni-Customizer, an end-to-end framework targeted at the precise binding and seamless fusion of multimodal identity information. Specifically, we introduce an Omni-Context Fusion (OCF) module that effectively enriches the base textual prompt with dense, multimodal identity cues, along with a Masked TTS Cross-Attention (MTP-CA) mechanism explicitly designed to prevent the severe "speech leakage" problem. Within this architecture, we propose Semantic-Anchored Multimodal RoPE (SA-MRoPE) to anchor visual and audio reference tokens, along with TTS embeddings, to their corresponding semantic descriptions, enabling structured multimodal fusion and robust identity binding. Furthermore, we devise a comprehensive training strategy that incorporates interleaved audio-video scheduling to rapidly adapt the audio branch to multilingual scenarios without degrading foundational priors, and a progressive in-pair to cross-pair curriculum to facilitate the learning of high-level and robust identity features. Extensive experiments demonstrate that Omni-Customizer achieves state-of-the-art performance in dual-modal customized generation, excelling across visual identity similarity, timbre consistency, precise audio-video synchronization, and overall video-audio fidelity.

2605.17483 2026-05-19 cs.CV 版本更新

On Applicability of Synthetic Datasets for Facial Expression Recognition

关于合成数据集在面部表情识别中的适用性

Ali Azmoudeh, Erdi Sarıtaş, Ömer Yıldırım, Hazım Kemal Ekenel

发表机构 * Department of Computer Engineering, Istanbul Technical University(伊斯坦布尔技术大学计算机工程系) Department of Informatics, University of Zurich(苏黎世大学信息学系) Division of Engineering, NYU Abu Dhabi(纽约大学阿布扎克分校工程系)

AI总结 本文研究了合成数据集在面部表情识别中的应用,提出三种隐私保护策略来构建平衡的数据集,并通过实验验证了合成数据在缓解类别不平衡和隐私限制方面的有效性。

详情
AI中文摘要

面部表情识别面临两个核心挑战。第一个是公共数据集中类别不平衡的问题,这会扭曲学习过程并削弱泛化能力。第二个是隐私和数据收集限制的问题,这限制了面部图像的共享并阻碍了大而平衡数据集的创建。为了解决这些问题,我们考察了三种互补的策略,用于在标准七种离散面部表情类别设置下构建隐私保护的面部表情识别(FER)数据集。我们的策略是:(i)在置信度阈值方案下使用教师模型对大规模未标记面部集合进行伪标签;(ii)使用扩散模型进行提示驱动的合成,条件于人口统计学属性;(iii)任务感知的基于GAN的表情编辑,该方法在保持身份和真实感的同时修改面部表情。在训练和评估中,我们采用了广泛使用的数据集,包括AffectNet、RAF-DB和FER2013。我们利用合成数据集DigiFace、DCFace和EmoNet-Face BIG作为伪标签的未标记源。此外,我们利用FFHQ数据集作为生成合成的来源。主要实验使用经典CNN主干网络IR50进行,我们还探索了更复杂的架构POSTERv1,以评估其可行性和鲁棒性。通过跨数据集评估,我们分析了每种策略在整理数据集中的权衡。研究结果展示了合成数据如何有效替代或与真实数据集结合,以缓解不平衡和隐私限制的问题。代码和生成数据集:https://www.github.com/AliAZ98/SyntFER

英文摘要

Facial Expression Recognition faces two core challenges. The first is class imbalance in public datasets, which skews the learning process and weakens generalization. The second is related to privacy and data collection constraints, which limit the sharing of facial images and restrict the creation of large, balanced datasets. To address these issues, we examine three complementary strategies for constructing privacy-preserving FER datasets in the standard seven discrete facial expression classes setting. Our strategies are: (i) pseudo-labeling large unlabeled face collections with a teacher model under a confidence-thresholding scheme, (ii) prompt-driven synthesis using diffusion models conditioned on demographic attributes, and (iii) task-aware GAN-based expression editing that modifies facial expression while preserving identity and realism. For training and evaluation, we employed widely adopted datasets, including AffectNet, RAF-DB, and FER2013. We utilized the synthetic datasets DigiFace, DCFace, and EmoNet-Face BIG as unlabeled sources for pseudo-labeling. Additionally, we utilized the FFHQ dataset as the source for generative synthesis. The main experiments are conducted using a classic CNN backbone, IR50, and we also explore a more complex architecture, POSTERv1, to assess its feasibility and robustness. Using cross-dataset evaluations, we analyze the trade-offs each strategy presents in curated datasets. The findings demonstrate how synthetic data can effectively substitute or be combined with real datasets to mitigate imbalance and privacy limitations. Code and generated datasets:https://www.github.com/AliAZ98/SyntFER

2605.17478 2026-05-19 cs.CV 版本更新

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

Mamba-VGGT: 通过外部滑动窗口Mamba内存实现持久长序列视频几何 grounded 变换器

Tianchen Deng, Zhenxiang Xiong, Nailin Wang, Fangjinhua Wang, Jiuming Liu, Jianfei Yang, Hesheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) ETH Zurich(苏黎世联邦理工学院) Cambridge University(剑桥大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出Mamba-VGGT框架,通过引入滑动窗口Mamba内存模块,解决传统VGGT在长序列视频中几何遗忘和累积漂移问题,提升3D场景重建的精度与稳定性。

详情
AI中文摘要

视觉几何 grounded 变换器(VGGT)在高保真3D场景重建中设立了新基准。然而,随着序列长度增加,这些模型因全局注意力的二次复杂度而出现灾难性几何遗忘和累积漂移,主要由于需要截断的时间窗口。为克服由此产生的几何漂移,我们提出了Mamba-VGGT,一种增强的VGGT框架,能够实现持久的长距离推理。我们的关键贡献是滑动窗口Mamba(SWM)内存模块,该模块在时间窗口间维护显式的外部记忆标记。该模块利用选择性状态空间建模来提炼和传播全局几何先验,有效绕过了传统变换器的记忆限制。为了在不破坏预训练VGGT高度优化的空间特征的情况下整合这些长期时间线索,我们提出了一种零初始化空间内存注入器。利用零卷积层,该注入器适应性地将持久记忆融合到patch token流中,确保结构稳定性和无缝特征对齐。广泛实验表明,我们的方法在维持空间一致性和减少轨迹累积误差方面显著优于现有VGGT方法。我们的工作为大规模3D环境中基于几何的世界建模提供了可扩展、线性复杂度的解决方案。

英文摘要

Visual Geometry Grounded Transformers (VGGT) have set new benchmarks in high-fidelity 3D scene reconstruction. However, as the sequence length increases, these models suffer from catastrophic geometric forgetting and accumulation drift, primarily due to the quadratic complexity of global attention which necessitates truncated temporal windows. To overcome the resulting geometric drift, we present Mamba-VGGT, an enhanced VGGT framework capable of persistent long-range reasoning. Our key contribution is a Sliding Window Mamba (SWM) memory module that maintains an explicit external memory token across temporal windows. This module leverages selective state-space modeling to distill and propagate global geometric priors, effectively bypassing the memory constraints of traditional transformers. To integrate these long-term temporal cues without disrupting the highly optimized spatial features of the pre-trained VGGT, we propose a Zero-Init Spatial Memory Injector. Utilizing zero-convolutional layers, this injector adaptively fuses persistent memory into the patch token stream, ensuring structural stability and seamless feature alignment. Extensive experiments demonstrate that our approach significantly outperforms existing VGGT-based methods in maintaining spatial consistency and reducing trajectory accumulation errors. Our work provides a scalable, linear-complexity solution for geometry-grounded world modeling in extensive 3D environments.

2605.17456 2026-05-19 cs.CV cs.AI 版本更新

GCE-MIL: Faithful and Recoverable Evidence for Multiple Instance Learning in Whole-Slide Imaging

GCE-MIL: 多实例学习中全滑片成像的可信且可恢复的证据

Xiangyu Li, Ran Su

发表机构 * College of Intelligence and Computing(智能与计算学院)

AI总结 该研究提出GCE-MIL方法,通过优化S/N/R标准直接提升多实例学习中全滑片成像的预测性能和证据质量,改进了宏F1分数和C-index,并减少了连续-离散差距。

Comments 10 pages, 17 figures, 24 table

详情
AI中文摘要

多实例学习(MIL)是全滑片图像(WSI)分类和生存预测的标准方法,其中基于注意力的模型将图像块特征聚合为滑片级预测。这些模型将注意力权重视为预测的证据,但注意力被优化用于分类,而非识别支持诊断的实际图像块。这种混淆导致三个失败:选择的图像块不足(单独保留它们会降低宏F1分数0.078)、多余(移除它们几乎不影响预测)以及不可恢复(连续的注意力分数与推理中使用的离散图像块子集不一致)。核心前提是证据质量应通过显式标准直接优化——充分性、必要性和可恢复性(S/N/R)——而不是作为分类的副产品继承。GCE-MIL是一种背骨无关的封装器,通过三种注入模式和三种证据组件实现:一个将选择与领域特定概念对齐的 grounding 机制,一个作为可微分代理的 noisy-OR 覆盖,以及一个通过边缘引导修复将连续选择器转换为离散子集的阈值加修复恢复。在9个背骨和9个数据集(81种配置)上,GCE-MIL将平均宏F1分数提高了0.024,C-index提高了0.014,减少了连续-离散差距4-7,增加了补集退化2-4。通过在离散恢复后可选的图像块预过滤,推理速度可提高高达5倍,同时保留0.989的完整袋效用。

英文摘要

Multiple instance learning (MIL) is the standard approach for whole-slide image (WSI) classification and survival prediction, where attention-based models ag gregate patch features into slide-level predictions. These models treat attention weights as evidence for their predictions, but attention is optimized for classi fication, not for identifying which patches actually support the diagnosis. This conflation leads to three failures: selected patches are insufficient (keeping them alone drops Macro-F1 by 0.078), unnecessary (removing them barely changes the prediction), and unrecoverable (continuous attention scores disagree with discrete patch subsets used at inference). The central premise is that evidence quality should be optimized directly through explicit criteria- Sufficiency, Necessity, and Recov erability (S/N/R)- rather than inherited as a byproduct of classification. GCE-MIL is a backbone-agnostic wrapper implemented through three injection modes and three evidence components: a grounding mechanism that aligns selection with domain-specific concepts, noisy-OR coverage that acts as a differentiable proxy for interventional evidence search, and threshold-plus-repair recovery that converts continuous selectors into discrete subsets through marginal-guided repair. Across 9 backbones and 9 datasets (81 configurations), GCE-MIL improves average Macro-F1 by 0.024 and C-index by 0.014, reduces the continuous-discrete gap by 4-7, and increases complement degradation by 2-4. With optional tile prefiltering after discrete recovery, inference runs up to 5 faster while retaining 0.989 full-bag utility.

2605.17451 2026-05-19 cs.CV 版本更新

DeTrack: A Benchmark and Altitude-Aware Dual World Model for Drone-embodied Tracking

DeTrack:一种无人机具身跟踪的基准及海拔感知双世界模型

Guyue Hu, Haoming Liu, Siyuan Song, Chenglong Li, Feng Chen, Jin Tang

发表机构 * Hefei Si Valley Technology Development Co., Ltd(合肥蜀山科技发展有限公司) Institute of Embodied Intelligence, Anhui University(embodied intelligence研究院,安徽大学)

AI总结 本文提出DeTrack任务,要求无人机在交互式3D环境中利用在线自体观察和主动飞行控制进行目标跟踪,并提出AaDWorlds框架以解决海拔相关的可见性与飞行安全矛盾。

详情
AI中文摘要

空中目标跟踪在公共安全、应急救援、野生动物监测等领域有广泛应用。然而,现有空中跟踪基准主要基于固定摄像头位置或预设飞行路径的被动2D视频序列,其中无人机被视为被动相机而非具身代理,无法主动感知、交互和控制其在动态3D场景中的运动。本文定义了新的无人机具身跟踪任务DeTrack,要求无人机利用在线自体观察和主动飞行控制在闭环中跟踪目标。我们构建了一个包含11,368条目标轨迹的大型基准,涵盖多样化的场景、渲染条件、语义区域和移动干扰物,并提供了针对目标可见性、跟踪准确性和轨迹成功的评估指标。我们进一步提出了AaDWorlds,一种用于无人机具身跟踪的海拔感知双世界模型框架。AaDWorlds包含一个海拔感知感知模块和双世界模型,分别在高海拔和低海拔环境下预测未来状态。通过结合伪海拔感知观察和预测的未来状态,AaDWorlds缓解了目标可见性与飞行安全之间的固有矛盾。在DeTrack基准上的实验表明,AaDWorlds在所有评估指标上均提升了闭环跟踪性能。

英文摘要

Aerial object tracking has broad applications in public safety, emergency rescue, wildlife monitoring, and related fields. However, existing aerial tracking benchmarks are mainly based on passive 2D video sequences captured from fixed camera locations or predefined flight paths, where drones are treated as passive cameras rather than embodied agents that actively perceive, interact, and control their motion in dynamic 3D scenes. In this paper, we define a new drone-embodied tracking task, termed DeTrack, which requires a drone to track a target in interactive 3D environments using online egocentric observations and active flight control in a closed loop. We build a large-scale benchmark containing 11,368 target trajectories across diverse scenes, rendering conditions, semantic regions, and moving distractors, together with evaluation metrics for target visibility, tracking accuracy, and trajectory success. We further propose AaDWorlds, an altitude-aware dual world model framework for drone-embodied tracking. AaDWorlds consists of an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes. By combining pseudo altitude-aware observations and imagined future states, AaDWorlds alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety. Experiments on the DeTrack benchmark demonstrate that AaDWorlds improves closed-loop tracking performance across all evaluation metrics.

2605.17449 2026-05-19 cs.CV cs.AI 版本更新

Spatial Blindness in Whole-Slide Multiple Instance Learning

全切片多实例学习中的空间盲区

Xiangyu Li, Ran Su

发表机构 * College of Intelligence and Computing(智能与计算学院)

AI总结 本文研究了全切片多实例学习中由于空间信息处理不足导致的分类误差问题,提出ResTopoMIL模型通过引入不变原型直方图和坐标洗牌约束来提升模型对空间关系的敏感性,从而在多个公开数据集上提升了分类和生存预测性能。

Comments 28 pages, 8 figures, 16 tables

详情
AI中文摘要

全切片MIL模型通常被称为上下文感知模型,当将图网络、Transformer或状态空间模块置于补丁嵌入之上时。我们证明这种标签可能具有误导性。在病理任务中,组织结构是诊断信号的一部分,几个强大的MIL基线在补丁坐标随机排列后,滑片级别AUC几乎未变。它们的预测准确,但大多具有组合性。我们将其失败模式称为空间盲区。我们的解释是基于优化的:在滑片级监督下,密集的外观统计信息被早期学习,留下弱梯度用于稀疏的空间关系。ResTopoMIL通过首先拟合一个排列不变的原型直方图,然后冻结它,同时一个轻量级图分支在坐标洗牌约束下学习残差来解决这个问题。该架构设计简单;干预在于如何训练空间分支。在9个公开WSI基准上,ResTopoMIL在1.15M参数下提升了分类和生存预测性能,恢复了对坐标扰动的敏感性,并在CAMELLYON-16上提供了更强的局部化证据。

英文摘要

Whole-slide MIL models are often called context-aware once graphs, Transform ers, or state-space modules are placed above patch embeddings. We show that this label can be deceptive. On pathology tasks where tissue architecture is part of the diagnostic signal, several strong MIL baselines retain nearly unchanged slide level AUC after patch coordinates are permuted. Their predictions are accurate, but largely compositional. We refer to this failure mode as spatial blindness. Our explanation is optimization-based: dense appearance statistics are learned early under slide-level supervision, leaving weak gradients for sparse spatial relations. ResTopoMIL addresses the issue by first fitting a permutation-invariant prototype histogram and then freezing it while a lightweight graph branch learns the residual under a coordinate-shuffling constraint. The architecture is simple by design; the intervention is in how the spatial branch is trained. Across 9 public WSI bench marks, ResTopoMIL improves classification and survival prediction with 1.15M parameters, restores sensitivity to coordinate perturbation, and gives stronger lo calization evidence on CAMELYON-16.

2605.17447 2026-05-19 cs.CV cs.CL 版本更新

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

FastOCR: 通过KV缓存剪枝实现高效的动态视觉聚焦文档解析

Zihan Tang, Leqi Shen, Hui Chen, Ao Wang, Ben Wan, Yan Feng, Ke Zhang, Sicheng Zhao, Tongxuan Liu, Guiguang Ding

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出FastOCR,一种无需训练的框架,通过动态视觉聚焦技术解决文档解析中的高效KV缓存剪枝问题,显著提升处理速度和准确性。

详情
AI中文摘要

视觉-语言模型(VLMs)在光学字符识别(OCR)中展现出强大潜力,但编码密集文档所需的大量视觉令牌导致推理成本过高。现有剪枝方法依赖物理驱逐,例如在prefill阶段永久丢弃视觉令牌。尽管在自然图像上有效,但此策略在OCR中失效,因为几乎每个视觉令牌可能对应一个字符或结构元素,任何不可逆的损失都会导致准确性急剧下降。我们观察到,尽管文档图像看似密集且难以剪枝,模型对它们的注意力实际上在时间上是稀疏的:在每个解码步骤中,它集中在一小块区域,随着步骤逐渐移动,就像人类读者依次聚焦于词语而不是一次性感知整页内容一样。受此动态视觉聚焦现象的启发,我们将不可行的全局剪枝问题转化为可处理的局部动态问题,并提出FastOCR,一种无需训练的框架,包含两个互补模块。具体而言,Focal-Guided Pruning识别少量焦点层,并在每一步从中选择最相关的视觉令牌;Cross-Step Fixation Reuse利用固定点的逐渐移动,从上一步温暖启动。通过动态调整哪些令牌被关注而不是驱逐任何缓存中的令牌,FastOCR避免了永久信息丢失。广泛实验表明,FastOCR作为一种即插即用的加速模块,在五个不同大小和架构的VLMs上表现出一致的泛化能力。在Qwen2.5-VL上,FastOCR在每个解码步骤只关注5%的视觉令牌,保留了未剪枝模型98%的准确性,同时将注意力延迟减少了3.0倍。

英文摘要

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.

2605.17436 2026-05-19 cs.CV cs.CL 版本更新

Medical Context Distorts Decisions in Clinical Vision Language Models

医学语境扭曲了临床视觉语言模型的决策

David Restrepo, Ira Ktena, Maria Vakalopoulou, Stergios Christodoulidis, Enzo Ferrante

发表机构 * MICS(医学信息学中心) CentraleSupélec - Université Paris-Saclay(中央超导学院 - 巴黎萨克雷大学) Cancer Data Science Unit(癌症数据科学单元) IHU PRISM National Institute in Precision Oncology(精准肿瘤学国家研究所) University Paris-Saclay(巴黎萨克雷大学) CentraleSupelec(中央超导学院) Gustave Roussy(儒勒-维维安-圣拉扎尔医院) INSERM(国家医学研究院) CONICET(阿根廷国家科研与技术创新委员会) Universidad de Buenos Aires(布宜诺斯艾利斯大学)

AI总结 本文研究了医学语境对临床视觉语言模型决策的影响,发现模型在整合医学记录的视觉和文本信息时存在模态依赖、无关历史依赖和提示敏感性等问题,强调了在临床应用前需要建立明确的保障措施。

详情
AI中文摘要

视觉-语言模型(VLMs)越来越多地被提出用于临床决策支持,但其在需要整合医学记录中视觉和文本信息的现实场景中的可靠性仍缺乏充分了解。本文识别了三种失败模式:(1)对文本的过度依赖而非图像,(2)对无关临床历史的虚假依赖,以及(3)在语义等价输入上的提示敏感性。我们评估了多种通用领域和医学调优的开源和闭源VLMs,在胸片任务中使用MIMIC-CXR进行测试。通过系统地操纵图像-文本对齐、临床历史和提示公式,我们发现VLM的决策受到文本模态主导,即使有视觉证据可用。此外,我们发现VLMs受到无关报告的强烈影响,而微小的提示变化可以逆转正确的图像基预测。我们的发现强调了在考虑将这些模型用于临床实践之前,需要建立明确的保障措施和压力测试。

英文摘要

Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct image-based predictions. Our findings underscore the need for explicit safeguards and stress-testing before considering the use of these models in clinical practice.

2605.17433 2026-05-19 cs.CV 版本更新

VISTA: Variance-Gated Inter-Sequence Test-Time Adaptation for Multi-Sequence MRI Segmentation

VISTA: 用于多序列MRI分割的方差门控跨序列测试时间适应

Zhipeng Deng, Jiale Zhou, Wenhan Jiang, Haolin Wang, Xun Lin, Yafei Ou, Yefeng Zheng

发表机构 * Westlake University(西湖大学) Hokkaido University(北海道大学) The Chinese University of Hong Kong(香港中文大学) RIKEN(理化学研究所)

AI总结 本文提出VISTA框架,解决多序列MRI分割中模态交互偏移问题,通过设计跨序列干预生成器和跨视图分歧感知伪标签方法,提升模型在临床环境下的适应能力,实验表明在不同群体上性能优于现有方法。

Comments MICCAI2026 early accept

详情
AI中文摘要

在新的临床环境中部署多序列磁共振成像(MRI)分割模型具有挑战性,因为存在扫描仪和采集协议的差异。尽管现有的TTA方法能够处理基本的单模态偏移,但它们在根本性的双偏移问题下常常失效,因为其适应信号无法捕捉模态交互偏移,这会破坏跨序列一致性。为了解决这个问题,我们提出了方差门控跨序列测试时间适应(VISTA),一种无源框架,用于解决模态交互偏移问题。首先,我们设计了一个跨序列干预生成器(ISIG),通过交换低频谱和熵局部化的补丁跨序列生成一组一致性探针,保持解剖语义的同时挑战跨序列依赖性。其次,我们引入了跨视图分歧感知伪标签(CDPL),通过跨视图分歧方差建立体素级可靠性度量,动态门控自我训练并强制干预一致性,促使网络依赖于稳健的解剖语义。大量实验将模型从标准成人MRI(BraTS-GLI-Pre)适应到非洲低场(BraTS-SSA)和儿童(BraTS-PED)群体,在临床偏移下优于竞争方法,实现了绝对Dice改进+1.89%(SSA)和+2.82%(PED)超过源模型。代码可在https://github.com/dzp2095/VISTA获取。

英文摘要

Deploying multi-sequence magnetic resonance imaging (MRI) segmentation models to new clinical environments is challenging due to variations in scanners and acquisition protocols. Although existing TTA methods handle basic per-modality shifts, they often fail under a fundamental dual-shift problem, as their adaptation signals fail to capture modality-interaction shifts that disrupt inter-sequence consistency. To address this, we propose Variance-gated Inter-Sequence Test-time Adaptation (VISTA), a source-free framework that tackles modality-interaction shifts. First, we design an Inter-Sequence Intervention Generator (ISIG) that generates a set of consistency probes by swapping low-frequency spectra and entropy-localized patches across sequences, preserving anatomical semantics while challenging inter-sequence dependencies. Second, we introduce Cross-View Disagreement-Aware Pseudo Labeling (CDPL), which establishes a voxel-wise reliability metric using cross-view disagreement variance to dynamically gate self-training and enforce interventional consistency, encouraging the network to rely on robust anatomical semantics. Extensive experiments adapting from standard adult MRI (BraTS-GLI-Pre) to African low-field (BraTS-SSA) and pediatric (BraTS-PED) cohorts show improved performance over competing methods under clinical shifts, achieving absolute Dice improvements of +1.89% (SSA) and +2.82% (PED) over the source model. The code is available at https://github.com/dzp2095/VISTA.

2605.17429 2026-05-19 cs.LG cs.CV 版本更新

Radial-Angular Geometry for Reliable Update Diagnosis in Noisy-Label Learning

径向-角向几何用于噪声标签学习中的可靠更新诊断

Ningkang Peng, Jingyang Mao, Xiaoqian Peng, Weiguang Qu, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学) Nanjing University of Chinese Medicine(南京中医药大学)

AI总结 本文提出了一种基于径向-角向几何的方法,用于在噪声标签学习中可靠地诊断更新,通过比较观测标签梯度与EMA教师诱导的参考梯度,区分对齐的困难清洁更新与由损坏标签引起的冲突更新。

详情
AI中文摘要

噪声标签方法通常从正向空间信号如损失、置信度或熵来估计样本可靠性。这些信号表明样本是否难以预测,但它们不直接测试其观察到的标签是否导致可靠的参数更新。这个差距很重要,因为困难的干净样本和错误标记的样本可能具有相似的损失,但会诱导不同的更新。我们重新诠释可靠性估计为观测标签更新的诊断。样本级经验Fisher迹提供了一个反向空间的更新能量度量:对于分类器层,它分解为一个预测残差项和一个特征敏感性项,因此捕获了超越标量损失的信息。然而,迹仍是一个径向幅度信号,无法决定大更新是否有益或有害。因此,我们提出了相对几何冲突(RGC),它将观测标签梯度与由EMA教师诱导的参考梯度进行比较。冲突项有助于区分大但对齐的困难清洁更新与由损坏标签引起的冲突更新。在合成和现实世界的噪声标签基准上,RGC在我们的评估协议下提高了困难清洁样本的保留和准确性。

英文摘要

Noisy-label methods often estimate sample reliability from forward-space signals such as loss, confidence, or entropy. These signals indicate whether a sample is difficult to predict, but they do not directly test whether its observed label induces a reliable parameter update. This gap matters because hard clean samples and mislabeled samples can have similar loss while inducing different updates. We recast reliability estimation as diagnosis of the observed-label update. The sample-wise empirical Fisher trace gives a backward-space measure of update energy: for the classifier layer, it factorizes into a prediction-residual term and a feature-sensitivity term, so it captures information beyond scalar loss. Trace, however, is still a radial magnitude signal and cannot decide whether a large update is useful or harmful. We therefore propose Relative Geometric Conflict (RGC), which compares the observed-label gradient with a reference gradient induced by an EMA teacher. The conflict term helps distinguish large but aligned hard-clean updates from large conflicting updates caused by corrupted labels. Across synthetic and real-world noisy-label benchmarks, RGC improves hard-clean preservation and accuracy under our evaluation protocol.

2605.17423 2026-05-19 cs.CV 版本更新

Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration

Soap2Soap:通过多智能体协作实现长 cinematic 视频重制

Yiren Song, Huilin Zhong, Kevin Qinghong Lin, Haofan Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室) University of Oxford(牛津大学) Lovart AI(Lovart人工智能)

AI总结 本研究提出 Soap2Soap 框架,通过多智能体协作实现长 cinematic 视频重制,解决视频到视频生成中长期一致性与叙事保真度的问题。

详情
AI中文摘要

我们研究系列级 cinematic 重制,这是一个长视界视频到视频生成问题,通过风格化或演员替换局部化完整 episodes 或 films,同时严格保持叙事结构、动作编排和角色身份在数百个镜头中。现有视频生成和编辑管道在此领域常常失效,因为大相机运动和视角变化下会出现身份漂移、背景突变和语义侵蚀的叠加问题。我们提出 Soap2Soap,一个通过双桥一致性机制强制长期语言-视觉一致性的多智能体框架:一个场景感知的 JSON 剧本作为持久的语义骨架,以及在场景和镜头级别动态分配的视觉参考锚点。为在视频合成前抑制漂移,我们引入批次关键帧一致性,通过基于网格的公式共同生成多个关键帧在共享的潜在上下文中。一个闭环验证智能体进一步审计身份、稳定性和对齐度以触发选择性再生。在 SoapBench 上的实验显示,与商业视频生成 API 相比,在长期一致性和叙事保真度方面有显著提升。

英文摘要

We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.

2605.15735 2026-05-19 cs.CV cs.AI 版本更新

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

UAM:VL A训练中遗忘的双流视角

Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, Jianyu Chen

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出UAM模型,通过双流架构解决VL A训练中因单一编码器导致的多模态能力下降问题,展示了通过架构分离而非冻结权重或辅助数据可实现语义保留,并在多种任务中取得高成功率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通常通过在动作数据上微调预训练的视觉-语言模型(VLM)来构建。然而,我们证明这种标准方法系统性地削弱了VLM的多模态能力,这种副作用我们称之为‘具身税’。但VL A是否必须遗忘?受生物视觉双流组织的启发,我们将这种退化归因于结构性瓶颈:当前VL A要求单一编码器同时支持语言基础语义和控制相关的视觉特征,而生物视觉将识别与视觉运动控制分为不同的路径。基于此观点,我们提出了统一动作模型(UAM),添加了一个平行的背侧专家,作为大脑背侧通路的类比。为了使背侧专家成为有效的第二路径并减少对VLM的控制学习负担,我们从预训练的生成模型中初始化它,并用中层推理目标进行训练,该目标预测视觉动态。这种设计使我们能够仅用动作数据端到端地训练整个VLA:无需参数冻结、无需梯度停止、无需辅助VL共训练,UAM保留了超过95%的底层VLM的多模态能力,同时在多种任务中取得了最高平均成功率,包括未见物体、新物体-目标组合和指令变化等探测分布外泛化的任务。这些结果表明,VL A中的语义保留可以从架构分离本身产生,而非通过冻结权重或辅助数据重放,并且这种保留的语义能力可以自然地从VLM转移到动作中的语义泛化。

英文摘要

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

2605.15586 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Embracing Biased Transition Matrices for Complementary-Label Learning with Many Classes

拥抱偏置转移矩阵以实现多类互补标签学习

Tan-Ha Mai, Chao-Kai Chiang, Han-Hwa Shih, Gang Niu, Masashi Sugiyama, Hsuan-Tien Lin

发表机构 * National Taiwan University(国立台湾大学) The University of Tokyo(东京大学) RIKEN Center for Advanced Intelligence Project(日本理化学研究院先进智能项目中心)

AI总结 本文提出了一种新的框架BICL,通过设计偏置的标签生成过程来克服传统互补标签学习在多类设置中的限制,从而在CIFAR-100和TinyImageNet-200上实现了传统方法的七倍以上准确率提升。

Comments 33 pages, 16 figures, 18 tables

详情
AI中文摘要

互补标签学习(CLL)是一种弱监督范式,其中实例被标记为不属于其类别的标签。尽管已有十年的研究,CLL方法主要在10类分类任务中具有竞争力,而扩展到大规模标签空间仍然是一个持久的瓶颈。这种限制源于传统方法对均匀标签生成的假设,这在多类设置中严重稀释了学习信号。在本文中,我们证明通过故意设计偏置(非均匀)的生成过程,将互补标签限制在类别的子集,可以克服这一长期存在的障碍。这一发现促使我们提出Bias-Induced Constrained Labeling(BICL),一个涵盖数据收集到训练的原理性框架,利用这种偏置。BICL在CIFAR-100和TinyImageNet-200上实现了有效学习,比传统方法的准确率提高了超过七倍。我们的发现为在现实应用中使CLL适用于多类问题开辟了新的道路。

英文摘要

Complementary-label learning (CLL) is a weakly supervised paradigm where instances are labeled with classes they do not belong to. Despite a decade of research, CLL methods remain competitive mainly on 10-class classification, with scaling to large label spaces continuing to be an enduring bottleneck. This limitation stems from the common assumption of uniform label generation in traditional methods, which fatally dilutes the learning signal in many-class settings. In this paper, we demonstrate that this long-standing barrier can be overcome by deliberately designing a biased (non-uniform) generation process that restricts complementary labels to a subset of classes. This finding motivates us to propose Bias-Induced Constrained Labeling (BICL), a principled framework spanning data collection to training that leverages this bias. BICL enables effective learning on CIFAR-100 and TinyImageNet-200, achieving more than sevenfold accuracy improvements over traditional methods. Our findings establish a new trajectory for making CLL feasible for many classes in real-world applications.

2605.15487 2026-05-19 cs.LG cs.CV eess.IV 版本更新

Learning Normalized Energy Models for Linear Inverse Problems

学习归一化能量模型以解决线性逆问题

Nicolas Zilberstein, Santiago Segarra, Eero Simoncelli, Florentin Guth

发表机构 * Rice University(里士满大学) Flatiron Institute(Flatiron研究所) New York University(纽约大学)

AI总结 本文提出了一种新的能量模型,用于解决线性逆问题,通过引入基于协方差的正则化项来提高不同测量条件下的一致性,从而计算出归一化的后验密度,无需额外训练或微调,同时实现了能量引导的自适应采样、无偏的Metropolis-Hastings修正步骤以及通过贝叶斯规则估计退化算子。

Comments ICML 2026

Journal ref Int'l Conf Machine Learning (ICML), Jul 2026. https://openreview.net/forum?id=PlFJwgaaDK

详情
AI中文摘要

生成扩散模型可以为成像中的逆问题提供强大的先验概率模型,但现有实现存在两个关键限制:(i) 先验密度以隐式方式表示,(ii) 它们依赖于似然近似,这会引入采样偏见。我们通过引入一种新的能量模型来解决这些挑战,该模型针对去噪进行了训练,并引入了基于协方差的正则化项,以确保在不同测量条件下的一致性。训练后的模型能够为各种线性逆问题计算归一化的后验密度,而无需额外的重新训练或微调。除了保留扩散模型的采样能力外,这还使以前不可用的能力得以实现:能量引导的自适应采样,可以实时调整采样计划,无偏的Metropolis-Hastings修正步骤,以及通过贝叶斯规则估计退化算子。我们验证了该方法在多个数据集(ImageNet、CelebA、AFHQ)和任务(修复、去模糊)上的性能,证明了其与现有基线相比具有竞争力或更优的表现。

英文摘要

Generative diffusion models can provide powerful prior probability models for inverse problems in imaging, but existing implementations suffer from two key limitations: $(i)$ the prior density is represented implicitly, and $(ii)$ they rely on likelihood approximations that introduce sampling biases. We address these challenges by introducing a new energy-based model trained for denoising with a covariance-based regularization term that enforces consistency across different measurement conditions. The trained model can compute normalized posterior densities for diverse linear inverse problems, without additional retraining or fine tuning. In addition to preserving the sampling capabilities of diffusion models, this enables previously unavailable capabilities: energy-guided adaptive sampling that adjusts schedules on-the-fly, unbiased Metropolis-Hastings correction steps, and blind estimation of the degradation operator via Bayes rule. We validate the method on multiple datasets (ImageNet, CelebA, AFHQ) and tasks (inpainting, deblurring), demonstrating competitive or superior performance to established baselines.

2605.11817 2026-05-19 cs.RO cs.CV 版本更新

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

Yixu Feng, Zinan Zhao, Yanxiang Ma, Chenghao Xia, Chengbin Du, Yunke Wang, Chang Xu

发表机构 * The University of Sydney(悉尼大学) City University of Hong Kong(香港城市大学)

AI总结 本文提出了一种基于可微网格采样的视觉-语言-动作模型压缩方法,通过连续的token重采样保留关键空间信息,实现高达90%的计算量减少而不影响性能。

Journal ref Proceedings of the Forty-third International Conference on Machine Learning, 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中表现出色,但其高计算成本限制了实时部署。现有token剪枝方法面临根本性的权衡:使用剪枝进行剧烈压缩会不可避免地丢弃关键几何细节,如接触点,导致性能严重下降。我们主张通过重新思考压缩作为几何感知的连续token重采样来打破这种权衡。为此,我们提出了可微网格采样器(GridS),一个即插即用的模块,用于在VLA中进行任务感知的连续重采样。通过自适应预测最小的显著坐标集并利用可微插值提取特征,GridS在保留关键空间信息的同时实现了大幅压缩(少于10%的原始视觉token)。在LIBERO基准和真实机器人平台上的实验表明,GridS实现了76%的FLOPs减少,而无需降级成功率。代码可在https://github.com/Fediory/Grid-Sampler上获得。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.

2605.11567 2026-05-19 cs.CV 版本更新

Dynamic Execution Commitment of Vision-Language-Action Models

视觉-语言-动作模型的动态执行承诺

Feng Chen, Xianghui Wang, Yuxuan Chen, Boying Li, Yefei He, Zeyu Zhang, Yicheng Wu

发表机构 * University of Adelaide(阿德莱德大学) Sichuan University(四川大学) Shanghai Jiao Tong University(上海交通大学) Monash University(墨尔本大学) Zhejiang University(浙江大学) Imperial College London(伦敦帝国理工学院)

AI总结 本文提出A3机制,通过将动态执行承诺重新定义为自推测前缀验证问题,解决了视觉-语言-动作模型在动态或分布外情况下执行鲁棒性和推理吞吐量之间的平衡问题。

Comments code is available at https://inceptionwang.github.io/A3/

详情
AI中文摘要

视觉-语言-动作(VLA)模型主要采用动作分块方法,即在单次前向传递中预测并承诺一系列连续的低层动作,以摊销大规模主干网络的推理成本并减少每步延迟。然而,将这些多步骤预测提交到现实世界执行需要在成功率和推理效率之间进行平衡,这一决策通常由针对特定任务调整的固定执行时间范围控制。此类启发式方法忽略了预测可靠性与状态依赖性的关系,导致在动态或分布外情况下表现脆弱。在本文中,我们引入了A3,一种自适应动作接受机制,将动态执行承诺重新定义为自推测前缀验证问题。A3首先通过群体采样计算轨迹级的动作共识分数,然后选择一个代表性的草稿并优先验证下游部分。具体而言,它强制执行:(1)共识有序的条件不变性,通过判断在高共识动作条件下重新解码后低共识动作是否保持一致来验证低共识动作;以及(2)前缀封闭的序列一致性,通过只接受从开始处最长连续验证动作序列来保证物理运行完整性。因此,执行时间范围自然成为满足内部模型逻辑和序列执行约束的最长可验证前缀。在多种VLA模型和基准测试中,实验表明A3消除了手动调整时间范围的需要,同时在执行鲁棒性和推理吞吐量之间实现了更优的平衡。

英文摘要

Vision-Language-Action (VLA) models predominantly adopt action chunking, i.e., predicting and committing to a short horizon of consecutive low-level actions in a single forward pass, to amortize the inference cost of large-scale backbones and reduce per-step latency. However, committing these multi-step predictions to real-world execution requires balancing success rate against inference efficiency, a decision typically governed by fixed execution horizons tuned per task. Such heuristics ignore the state-dependent nature of predictive reliability, leading to brittle performance in dynamic or out-of-distribution settings. In this paper, we introduce A3, an Adaptive Action Acceptance mechanism that reframes dynamic execution commitment as a self-speculative prefix verification problem. A3 first computes a trajectory-wise consensus score of actions via group sampling, then selects a representative draft and prioritizes downstream verification. Specifically, it enforces: (1) consensus-ordered conditional invariance, which validates low-consensus actions by judging whether they remain consistent when re-decoded conditioned on high-consensus actions; and (2) prefix-closed sequential consistency, which guarantees physical rollout integrity by accepting only the longest continuous sequence of verified actions starting from the beginning. Consequently, the execution horizon emerges as the longest verifiable prefix satisfying both internal model logic and sequential execution constraints. Experiments across diverse VLA models and benchmarks demonstrate that A3 eliminates the need for manual horizon tuning while achieving a superior trade-off between execution robustness and inference throughput.

2605.10239 2026-05-19 cs.CV 版本更新

AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting

AdaptSplat: 为前馈3D高斯点划法适应视觉基础模型

Mingwei Xing, Xinliang Wang, Yifeng Shi

发表机构 * Ke Holdings Inc.(凯控股公司)

AI总结 本文提出AdaptSplat,通过在通用架构中引入一个仅含1.5M参数的轻量级适配器,有效提升了前馈3D高斯点划法在跨领域泛化和高频几何保真度方面的性能。

详情
AI中文摘要

本文探讨了一种简单而强大的轻量级适配器设计,用于前馈3D高斯点划法(3DGS)。现有方法通常在图像特征提取→多视角交互→特征解码的通用流程上应用复杂的、架构特定的设计。然而,受限于3D训练数据的规模瓶颈和深度网络的低通滤波效应,这些方法在跨领域泛化和高频几何保真度方面仍显不足。为了解决这些问题,我们提出了AdaptSplat,证明在不使用复杂组件工程的情况下,仅在通用架构中引入一个仅含1.5M参数的适配器就足以实现优越的性能。具体而言,我们设计了一个轻量级的频率保持适配器(FPA),从强大视觉基础模型主干的浅层特征中提取方向感知的高频结构先验,并通过高频位置编码和自适应残差调制无缝地将其整合到通用流程中。这有效补偿了深度特征中过度平滑导致的高频衰减,提高了高斯原语在复杂表面和尖锐边界上的拟合精度。大量实验表明,AdaptSplat在多个标准基准上实现了最先进的前馈重建性能,并在跨领域泛化方面表现出稳定性。代码可在:https://github.com/xmw666/AdaptSplat 获取。

英文摘要

This work explores a simple yet powerful lightweight adapter design for feed-forward 3D Gaussian Splatting (3DGS). Existing methods typically apply complex, architecture-specific designs on top of the generic pipeline of image feature extraction $\rightarrow$ multi-view interaction $\rightarrow$ feature decoding. However, constrained by the scale bottleneck of 3D training data and the low-pass filtering effect of deep networks, these methods still fall short in cross-domain generalization and high-frequency geometric fidelity. To address these problems, we propose AdaptSplat, which demonstrates that without complex component engineering, introducing a single adapter of only 1.5M parameters into the generic architecture is sufficient to achieve superior performance. Specifically, we design a lightweight Frequency-Preserving Adapter (FPA) that extracts direction-aware high-frequency structural priors from the shallow features of a powerful vision foundation model backbone, and seamlessly integrates them into the generic pipeline via high-frequency positional encodings and adaptive residual modulation. This effectively compensates for the high-frequency attenuation caused by over-smoothing in deep features, improving the fitting accuracy of Gaussian primitives on complex surfaces and sharp boundaries. Extensive experiments demonstrate that AdaptSplat achieves state-of-the-art feed-forward reconstruction performance on multiple standard benchmarks, with stable generalization across domains. Code available at: https://github.com/xmw666/AdaptSplat.

2605.10185 2026-05-19 cs.CV cs.AI 版本更新

DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors

DynGhost: 用于量子探测器动态鬼成像的时序建模Transformer

Vittorio Palladino, Ahmet Enis Cetin

发表机构 * Politecnico di Milano(米兰理工学院) University of Illinois at Chicago(伊利诺伊大学香槟分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出DynGhost,一种基于Transformer的动态鬼成像方法,通过交替的空间和时间注意力模块解决传统方法在动态场景和低光条件下的局限性,利用量子感知训练框架提升真实硬件下的性能。

Comments 6 pages, 8 figures

详情
AI中文摘要

鬼成像通过将结构化照明图案与标量强度测量相关联,从单像素桶探测器重建空间信息。尽管深度学习方法在静态场景中取得了显著成果,但存在两个关键局限:现有架构未能利用帧间的时间相干性,导致动态鬼成像问题未得到解决,且假设加性高斯噪声模型,而实际单光子硬件遵循泊松统计。我们提出了DynGhost(动态鬼成像Transformer),通过交替的空间和时间注意力块解决这两个限制。基于物理准确的探测器模拟(SNSPDs、SPADs、SiPMs)和Anscombe方差稳定化归一化,我们的量子感知训练框架解决了导致经典模型在真实硬件约束下失效的分布偏移。在多个基准测试中,DynGhost在动态和光子匮乏设置中优于传统重建方法和现有深度学习架构。

英文摘要

Ghost imaging reconstructs spatial information from a single-pixel bucket detector by correlating structured illumination patterns with scalar intensity measurements. While deep learning approaches have achieved promising results on static scenes, two critical limitations remain unaddressed: existing architectures fail to exploit temporal coherence across frames, leaving dynamic ghost imaging largely unsolved, and they assume additive Gaussian noise models that do not reflect the true Poissonian statistics of real single-photon hardware. We present DynGhost (Dynamic Ghost Imaging Transformer), a transformer architecture that addresses both limitations through alternating spatial and temporal attention blocks. Our quantum-aware training framework, based on physically accurate detector simulations (SNSPDs, SPADs, SiPMs) and Anscombe variance-stabilizing normalization, resolves the distribution shift that causes classical models to fail under realistic hardware constraints. Experiments across multiple benchmarks demonstrate that DynGhost outperforms both traditional reconstruction methods and existing deep learning architectures, with particular gains in dynamic and photon-starved settings.

2605.08163 2026-05-19 cs.CV cs.AI cs.CL 版本更新

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

MULTITEXTEDIT:跨语言文本-图像编辑中退化程度的基准测试

Liwei Cheng, Shibo Feng, Lunjie Zhou, Yixuan Guan, Dayan Guan

发表机构 * Harbin Institute of Technology(哈尔滨理工大学)

AI总结 本文提出MULTITEXTEDIT基准测试,通过12种语言、5种视觉领域和7种编辑操作的3600个实例,评估跨语言文本-图像编辑中退化问题,引入语言保真度指标并发现模型在文本准确性和脚本保真度上的显著退化。

Comments 11 pages, 5 figures

详情
AI中文摘要

文本-图像编辑已成为视觉内容创作的关键能力,但现有基准测试大多以英语为中心且常将视觉合理性与语义正确性混为一谈。我们引入MULTITEXTEDIT,一个包含3,600个实例的受控基准测试,涵盖12种语言类型、5种视觉领域和7种编辑操作。每个实例的语言变体共享相同的视觉基础,并配有人工编辑的参考文本和区域掩码,从而隔离语言变量以进行跨语言比较。为捕捉粗粒度文本匹配度指标所遗漏的脚本级错误,如缺失变音符号、RTL顺序颠倒和混合脚本渲染,我们引入了一个由两阶段LVM协议评分的语言保真度(LSF)度量,其与母语者标注员的二次加权κ值达到0.76。评估12个开源和专有系统时,发现所有模型在跨语言退化方面表现显著,最大退化出现在希伯来语和阿拉伯语上,最小退化出现在荷兰语和西班牙语上,且集中在文本准确性和脚本保真度而非粗粒度结构维度上。我们还发现普遍存在的语义和像素不匹配,其中输出保持全局布局和背景保真度,但扭曲了脚本特定的形态。

英文摘要

Text-in-image editing has become a key capability for visual content creation, yet existing benchmarks remain overwhelmingly English-centric and often conflate visual plausibility with semantic correctness. We introduce MULTITEXTEDIT, a controlled benchmark of 3,600 instances spanning 12 typologically diverse languages, 5 visual domains, and 7 editing operations. Language variants of each instance share a common visual base and are paired with a human-edited reference and region masks, isolating the language variable for cross-lingual comparison. To capture script-level errors that coarse text-matching metrics miss, such as missing diacritics, reversed RTL order, and mixed-script renderings, we introduce a language fidelity (LSF) metric scored by a two-stage LVM protocol that first traces the edited target text and then judges it in isolation, reaching a quadratic-weighted \k{appa} of 0.76 against native-speaker annotators. Evaluating 12 open-source and proprietary systems with LSF alongside standard semantic and mask-aware pixel metrics, we find pronounced cross-lingual degradation for every model, largest on Hebrew and Arabic and smallest on Dutch and Spanish, and concentrated in text accuracy and script fidelity rather than in coarse structural dimensions. We also uncover a pervasive semantic and pixel mismatch, where outputs preserve global layout and background fidelity yet distort script-specific forms.

2605.07790 2026-05-19 cs.LG cs.CV 版本更新

Hessian Surgery: Class-Targeted Post-Hoc Rebalancing via Hessian Spike Perturbation

Hessian Surgery: 通过Hessian尖峰扰动实现类目标后处理重平衡

Hugo Vigna, Samuel Bontemps

发表机构 * CentraleSupélec – Université Paris-Saclay(中央理工巴黎高等学院 – 巴黎萨克莱大学) ESILV – Léonard de Vinci(ESILV – 莱昂纳德·德·文奇)

AI总结 本文提出Hessian Surgery方法,通过扰动模型权重沿尖峰特征向量来重平衡各类准确率,无需重新训练,提升了CIFAR-10和ISIC-2019数据集的平衡准确率和标准差。

Comments The code is available here: https://github.com/hugovigna/hessian-surgery.git

详情
AI中文摘要

训练好的深度网络的Hessian谱表现出一种特征结构:连续的近零特征值和少量的大异常特征值(尖峰),证实了随机矩阵理论在深度学习中的相关性。尖峰数量与类别数减一相匹配。尽管先前工作描述了这种结构,但没有方法将其操作化以提高分类性能。我们提出Hessian Surgery,一种后处理优化方法,直接扰动模型权重沿尖峰特征向量以重平衡各类准确率而无需重新训练。我们引入(i)一个尖峰类敏感度矩阵,量化每个类准确率沿每个尖峰特征向量的方向导数,(ii)一个约束优化扰动系数,针对弱类同时保持强类,以及(iii)自适应幅度控制,根据迭代级改进信号调整扰动预算。我们在CIFAR-10和ISIC-2019上获得了令人鼓舞的结果,同时在平衡准确率和标准差方面都取得了显著提升。

英文摘要

The Hessian spectrum of trained deep networks exhibits a characteristic structure: a continuous bulk of near-zero eigenvalues and a small number of large outlier eigenvalues (spikes), confirming the relevance of Random Matrix Theory in deep learning. The spike count matches the number of classes minus one. While prior work has described this structure, no method has exploited it operationally to improve classification performance. We propose Hessian Surgery, a post-hoc optimization method that directly perturbs model weights along spike eigenvectors to rebalance per-class accuracy without retraining. We introduce (i) a spike-class sensitivity matrix that quantifies the directional derivative of each class's accuracy along each spike eigenvector, (ii) a constrained optimization of perturbation coefficients that targets weak classes while preserving strong ones, and (iii) an adaptive amplitude control that raises or lowers the perturbation budget based on iteration-level improvement signals. We obtain encouraging results on CIFAR-10 and ISIC-2019 on both balanced accuracy and standard deviation.

2605.02198 2026-05-19 cs.CV 版本更新

SlimDiffSR: Toward Lightweight and Efficient Remote Sensing Image Super-Resolution via Diffusion Model Distillation

SlimDiffSR: 向轻量高效遥感图像超分辨率迈进:通过扩散模型蒸馏

Ce Wang, Zhenyu Hu, Wanjie Sun

发表机构 * School of Remote Sensing and Information Engineering, Wuhan University(武汉大学遥感与信息工程学院)

AI总结 本文提出SlimDiffSR,一种轻量高效的基于扩散模型的遥感图像超分辨率框架,通过引入不确定性引导的时间步分配策略和结构化剪枝策略,提升模型效率和重建质量。

详情
AI中文摘要

扩散模型最近在图像超分辨率(SR)中取得了显著性能,但其高计算成本限制了在遥感应用中的实际部署。为了解决这个问题,我们提出了SlimDiffSR,一种轻量高效的基于扩散模型的框架,用于实际的遥感图像超分辨率。与现有单步扩散方法不同,我们首先引入了不确定性引导的时间步分配策略,以构建一个更强的单步教师模型,其中重建难度与扩散时间步长显式相关,从而实现自适应生成强度。在此基础上,我们进一步提出了一种针对遥感图像的结构化剪枝策略,系统地移除冗余的语义模块,并用轻量级设计替换标准操作,包括频域分离卷积、方向分离卷积以及查询驱动的全局聚合模块。这些组件显式利用了遥感数据的独特特性,如稀疏的高频细节、强方向模式和长距离空间依赖性。为了增强知识转移,我们将在蒸馏过程中引入最大均值差异(MMD),以对齐教师和学生模型之间的特征分布。在多个遥感基准上的广泛实验表明,SlimDiffSR在效率和重建质量之间实现了良好的平衡。特别是,它在多步扩散模型相比下实现了高达200倍的推理加速和20倍的模型参数减少,同时在感知质量方面具有竞争力,并在效率上明显优于现有的轻量级扩散基线。代码可在:https://github.com/wwangcece/SlimDiffSR获取。

英文摘要

Diffusion models have recently achieved remarkable performance in image super-resolution (SR), but their high computational cost limits practical deployment in remote sensing applications. To address this issue, we propose SlimDiffSR, a lightweight and efficient diffusion-based framework for real-world remote sensing image super-resolution. Unlike existing single-step diffusion methods that rely on fixed timesteps, we first introduce an uncertainty-guided timestep assignment strategy to construct a stronger single-step teacher model, where reconstruction difficulty is explicitly linked to diffusion timesteps, enabling adaptive generative strength. Building upon this teacher, we further present a structured pruning strategy tailored to remote sensing imagery, which systematically removes redundant semantic modules and replaces standard operations with lightweight designs, including frequency-separable convolution, direction-separable convolution, and a query-driven global aggregation module. These components explicitly exploit the unique characteristics of remote sensing data, such as sparse high-frequency details, strong directional patterns, and long-range spatial dependencies. To enhance knowledge transfer, we incorporate Maximum Mean Discrepancy (MMD) into the distillation process to align feature distributions between the teacher and student models. Extensive experiments on multiple remote sensing benchmarks demonstrate that SlimDiffSR achieves a favorable balance between efficiency and reconstruction quality. In particular, it attains up to $200\times$ inference acceleration and a $20\times$ reduction in model parameters compared with multi-step diffusion models, while achieving competitive perceptual quality and clearly outperforming existing lightweight diffusion baselines in efficiency. The code is available at: https://github.com/wwangcece/SlimDiffSR.

2604.24763 2026-05-19 cs.CV 版本更新

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Tuna-2:像素嵌入在多模态理解和生成中优于视觉编码器

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, Yuren Cong

发表机构 * Meta AI The University of Hong Kong(香港大学) University of Waterloo(滑铁卢大学)

AI总结 本文提出Tuna-2,一种基于像素嵌入的统一多模态模型,通过直接使用像素嵌入进行多模态理解和生成,展示了统一像素空间建模在高质量图像生成中可以与潜在空间方法竞争,并证明了预训练视觉编码器在多模态建模中并非必要。

Comments Project page: https://tuna-ai.org/tuna-2

详情
AI中文摘要

统一多模态模型通常依赖于预训练的视觉编码器,并使用独立的视觉表示进行理解和生成,导致两种任务之间存在不一致,阻碍了从原始像素进行端到端优化。我们引入Tuna-2,一种原生统一多模态模型,直接基于像素嵌入进行视觉理解和生成。Tuna-2通过使用简单的补丁嵌入层来编码视觉输入,大幅简化了模型架构,完全摒弃了诸如VAE或表示编码器等模块化视觉编码器设计。实验表明,Tuna-2在多模态基准测试中实现了最先进的性能,证明了统一像素空间建模能够与潜在空间方法在高质量图像生成中竞争。此外,虽然基于编码器的变体在早期预训练中收敛更快,但Tuna-2的无编码器设计在大规模情况下实现了更强的多模态理解,特别是在需要细粒度视觉感知的任务中。这些结果表明,预训练视觉编码器在多模态建模中并非必要,端到端的像素空间学习为生成和感知的更强视觉表示提供了一条可扩展的路径。

英文摘要

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.

2604.20155 2026-05-19 cs.CV 版本更新

GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

GSCompleter: 一种无需蒸馏的插件,用于在几秒钟内进行基于度量的3D高斯溅射完成

Ao Gao, Jingyu Gong, Xin Tan, Zhizhong Zhang, Lizhuang Ma, Yuan Xie

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Shanghai Innovation Institute(上海创新研究院) Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University(重庆精密光学重点实验室,东华大学重庆研究院) Shanghai Key Laboratory of Computer Software Evaluating and Testing(上海计算机软件评测测试重点实验室) Department of Computer Science and Engineering, Shanghai Jiao Tong University(上海交通大学计算机科学与工程学院)

AI总结 本文提出了一种无需蒸馏的GSCompleter插件,通过稳定的'生成-注册'流程实现基于度量的3D高斯溅射完成,提高了完成质量和效率,并在三个基准上取得了新的最先进的结果。

详情
AI中文摘要

3D高斯溅射(3DGS)凭借其显式表示和效率,已彻底改变了高质量神经渲染。然而,从稀疏视角重建场景会因覆盖范围有限而遭受严重的几何空洞和漂浮物。当前的场景完成方法通常依赖于迭代的'修复-蒸馏'范式,这计算成本高,容易出现不稳定优化,并且容易过拟合。为了解决这些限制,我们提出了GSCompleter,一种无需蒸馏的插件,将场景完成转移到稳定的'生成-注册'流程。具体而言,GSCompleter合成出视觉上合理的2D参考图像,并通过稳健的立体锚点视角选择机制将其显式提升为具有一致度量尺度的3D高斯原语。这些新生成的原语随后通过新颖的射线约束注册策略无缝集成到全局场景中。通过用稳定的几何注册替代不稳定蒸馏,GSCompleter在三个基准上表现出优越的3DGS完成性能,比各种基线在质量和效率上都得到了提升,并取得了新的最先进的(SOTA)结果。

英文摘要

3D Gaussian Splatting (3DGS) has revolutionized high-fidelity neural rendering with its explicit representation and efficiency. However, reconstructing scenes from sparse viewpoints suffers from severe geometric voids and floaters due to limited coverage. Current scene completion methods typically rely on an iterative "Repair-then-Distill" paradigm, which is computationally intensive, prone to unstable optimization, and susceptible to overfitting. To address these limitations, we propose GSCompleter, a distillation-free plugin that shifts scene completion to a stable "Generate-then-Register" workflow. Specifically, GSCompleter synthesizes visually plausible 2D reference images and explicitly lifts them into 3D Gaussian primitives with a consistent metric scale via a robust Stereo-Anchor View Selection mechanism. These newly generated primitives are then seamlessly integrated into the global scene using a novel Ray-Constrained Registration strategy. By replacing unstable distillation with rapid geometric registration, GSCompleter exhibits superior 3DGS completion performance across three benchmarks, enhancing both quality and efficiency over various baselines and achieving new state-of-the-art (SOTA) results.

2604.16429 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph 版本更新

(Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models

(稀疏) 注意细节:在基于机器学习的天气预测模型中保持频谱保真度

Maksim Zhdanov, Ana Lucic, Max Welling, Jan-Willem van de Meent

发表机构 * AMLab(AM实验室) University of Amsterdam(阿姆斯特丹大学)

AI总结 本文提出Mosaic模型,通过学习功能扰动生成集合成员,并利用网格对齐的块稀疏注意力机制,在原分辨率网格上操作,以线性成本捕捉长距离依赖关系,从而在1.5°分辨率下达到或超越更精细分辨率模型的性能,实现了状态-of-the-art结果。

Comments Accepted to ICML 2026

详情
AI中文摘要

我们介绍Mosaic,一种概率天气预测模型,旨在解决基于机器学习的天气预测中频谱退化问题的三种失败模式:频谱阻尼(统计学)、高频混叠(架构学)和残余高频泄漏(参数学)。Mosaic通过学习的功能扰动生成集合成员,并通过网格对齐的块稀疏注意力机制在原分辨率网格上操作,该机制是一种硬件对齐的机制,通过在空间相邻查询之间共享键和值,以线性成本捕捉长距离依赖关系。在1.5°分辨率和214M参数下,Mosaic在关键变量上达到或超越了在6倍更精细分辨率上训练的模型的性能,并在1.5°模型中实现了最先进的结果,生成了经过良好校准的集合,其个体成员在所有解析频率上表现出近乎完美的频谱对齐。一个24成员、10天的预测在单个H100 GPU上不到12秒。代码可在https://github.com/maxxxzdn/mosaic上获得。

英文摘要

We introduce Mosaic, a probabilistic weather forecasting model that addresses three failure modes of spectral degradation in ML-based weather prediction: spectral damping (statistical), high-frequency aliasing (architectural), and residual high-frequency leakage (parametric). Mosaic generates ensemble members through learned functional perturbations and operates on native-resolution grids via mesh-aligned block-sparse attention, a hardware-aligned mechanism that captures long-range dependencies at linear cost by sharing keys and values across spatially adjacent queries. At 1.5° resolution with 214M parameters, Mosaic matches or outperforms models trained on 6$\times$ finer resolution on key variables and achieves state-of-the-art results among 1.5° models, producing well-calibrated ensembles whose individual members exhibit near-perfect spectral alignment across all resolved frequencies. A 24-member, 10-day forecast takes under 12s on a single H100~GPU. Code is available at https://github.com/maxxxzdn/mosaic.

2603.27341 2026-05-19 cs.AI cs.CV cs.LG 版本更新

A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling

外科AI的比较研究:数据、计算和扩展的潜力与局限

Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara, Shauna Otto, Zhuang-Fang Yi, John Zhu, Daniel A. Donoho, X. Y. Han, Neeraj Mainkar, Margaux Masson-Forsythe

发表机构 * Center for Applied AI, Chicago Booth(应用人工智能中心,芝加哥商学院) Surgical Data Science Collective(外科数据科学集体) Children’s National Hospital(儿童医学中心) Operations Management & Tolan Center for Healthcare, Chicago Booth(运营管理与托兰医疗中心,芝加哥商学院)

AI总结 本文通过2026年最先进的AI方法,研究了外科手术工具检测中的性能和限制,发现即使使用多十亿参数模型和大量训练数据,当前的视觉语言模型在神经外科手术工具检测任务中仍表现不足,且模型规模和训练时间的增加对性能提升效果有限,表明当前AI在手术应用中仍面临显著挑战。

详情
AI中文摘要

最近的人工智能(AI)模型在多个生物医学任务基准上已匹配或超越了人类专家,但特别是在外科手术基准方面,这些基准往往缺失于主要的医学基准套件中。由于手术需要整合多种任务,一般能力的AI模型可能成为协作工具,如果性能可以得到提升。一方面,通过扩展架构大小和训练数据的常规方法具有吸引力,尤其是由于每年有数百万小时的手术视频数据生成。另一方面,为AI训练准备手术数据需要显著更高的专业水平,并且在该数据上训练需要昂贵的计算资源。这些权衡描绘了现代AI是否以及在多大程度上能够帮助外科实践的不确定图景。在本文中,我们通过使用2026年最先进的AI方法进行外科手术工具检测的案例研究来探讨这个问题。我们证明,即使使用多十亿参数模型和大量训练,当前的视觉语言模型在看似简单的神经外科手术工具检测任务中仍表现不足。此外,我们展示了扩展实验,表明增加模型规模和训练时间仅导致相关性能指标的边际改善。因此,我们的实验表明,当前模型在手术使用案例中仍可能面临重大障碍。此外,一些障碍无法通过额外的计算能力简单地“解决”并持续存在于不同的模型架构中,提出了数据和标签可用性是否是唯一限制因素的问题。我们讨论了这些约束的主要贡献者,并提出了潜在的解决方案。

英文摘要

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but surgical benchmarks in particular are often missing from prominent medical benchmark suites. Since surgery requires integrating disparate tasks, generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

2603.23672 2026-05-19 cs.RO cs.CV 版本更新

Bio-Inspired Event-Based Visual Servoing for Ground Robots

生物启发的基于事件的视觉伺服控制用于地面机器人

Maral Mordad, Kian Behzad, Debojyoti Biswas, Noah J. Cowan, Milad Siami

发表机构 * Department of Electrical & Computer Engineering, Northeastern University(东北大学电气与计算机工程系) Laboratory for Computational Sensing and Robotics, Johns Hopkins University(约翰霍普金斯大学计算感知与机器人实验室) Department of Mechanical Engineering, Johns Hopkins University(约翰霍普金斯大学机械工程系)

AI总结 本文提出了一种基于生物启发的1D事件视觉伺服框架,用于在结构化环境中运行的地面机器人,通过动态视觉传感器和多模式刺激直接合成非线性状态反馈项,实现了高效低延迟的控制。

详情
AI中文摘要

生物感觉系统本质上是自适应的,能够过滤掉恒定刺激并优先处理相对变化,可能提高计算和代谢效率。受广泛动物主动感知行为的启发,本文介绍了一种原理性的1D基于事件的视觉伺服框架,用于在结构化环境中运行的地面机器人。利用动态视觉传感器(DVS),我们证明通过将固定的空间核应用于由结构化对数强度变化模式生成的异步事件流,所得到的网络事件流能够分析性地隔离特定的运动状态组合。我们建立了该事件率估计器的一般理论界,并证明线性和二次空间剖面分别隔离了机器人的速度和位置-速度乘积。利用这些特性,我们采用多模式刺激直接合成非线性状态反馈项,而无需传统状态估计。为克服事件感知中在平衡点固有的线性可观测性损失,我们提出了一种生物启发的主动感知极限环控制器。在1/10比例自主地面车辆上的实验验证证实了所提出直接感知方法的有效性、极低延迟和计算效率。

英文摘要

Biological sensory systems are inherently adaptive, filtering out constant stimuli and prioritizing relative changes, likely enhancing computational and metabolic efficiency. Inspired by active sensing behaviors across a wide range of animals, this paper introduces a principled 1D event-based visual servoing framework for ground robots operating in structured environments. Utilizing a Dynamic Vision Sensor (DVS), we demonstrate that by applying a fixed spatial kernel to the asynchronous event stream generated from structured logarithmic intensity-change patterns, the resulting net event flux analytically isolates specific combinations of kinematic states. We establish a generalized theoretical bound for this event rate estimator and show that linear and quadratic spatial profiles isolate the robot's velocity and position-velocity product, respectively. Leveraging these properties, we employ a multi-pattern stimulus to directly synthesize a nonlinear state feedback term entirely without traditional state estimation. To overcome the inescapable loss of linear observability at equilibrium inherent in event sensing, we propose a bio-inspired active sensing limit-cycle controller. Experimental validation on a 1/10-scale autonomous ground vehicle confirms the efficacy, extreme low-latency, and computational efficiency of the proposed direct-sensing approach.

2603.21787 2026-05-19 cs.CV 版本更新

Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTevent

在MTevent上评估用于工业多类识别的循环事件基目标检测基准

Lokeshwaran Manohar, Moritz Roidl

发表机构 * Chair of Material Handling and Warehousing, TU Dortmund University, Dortmund, Germany(物料搬运与仓储学系,杜伊斯堡-艾森大学,多特蒙德,德国)

AI总结 本文研究了在MTevent数据集上使用循环ReYOLOv8s进行工业多类识别的性能,并通过非循环YOLOv8s作为基线分析时间记忆的影响,发现事件域预训练对性能提升更有效。

Comments Accepted at the Neuromorphic Field Robotics and Automation Workshop, ICRA 2026

详情
AI中文摘要

事件相机因提供高时间分辨率、高动态范围和减少运动模糊而在工业机器人中具有吸引力。然而,大多数基于事件的目标检测研究集中在户外驾驶场景或有限类别设置上。在本工作中,我们在MTevent上评估了循环ReYOLOv8s用于工业多类识别,并使用非循环YOLOv8s变体作为基线来分析时间记忆的影响。在MTevent验证分割上,最佳的从头开始的循环模型(C21)达到了0.285 mAP50,比非循环YOLOv8s基线(0.260)提高了9.6%。事件域预训练效果更显著:GEN1初始化的微调在剪辑长度21时达到了最佳整体结果0.329 mAP50,并且与从头开始训练不同,GEN1预训练模型在剪辑长度上持续改进。PEDRo初始化下降到0.251,表明源域预训练不匹配可能不如从头开始训练有效。持续失败模式主要由类别不平衡和人-物体交互主导。总体而言,我们将这项工作定位为对工业环境中循环事件基检测的聚焦基准测试和分析研究。

英文摘要

Event cameras are attractive for industrial robotics because they provide high temporal resolution, high dynamic range, and reduced motion blur. However, most event-based object detection studies focus on outdoor driving scenarios or limited class settings. In this work, we benchmark recurrent ReYOLOv8s on MTevent for industrial multi-class recognition and use a non-recurrent YOLOv8s variant as a baseline to analyze the effect of temporal memory. On the MTevent validation split, the best scratch recurrent model (C21) reaches 0.285 mAP50, corresponding to a 9.6\% relative improvement over the non-recurrent YOLOv8s baseline (0.260). Event-domain pretraining has a stronger effect: GEN1-initialized fine-tuning yields the best overall result of 0.329 mAP50 at clip length 21, and unlike scratch training, GEN1-pretrained models improve consistently with clip length. PEDRo initialization drops to 0.251, indicating that mismatched source-domain pretraining can be less effective than training from scratch. Persistent failure modes are dominated by class imbalance and human-object interaction. Overall, we position this work as a focused benchmarking and analysis study of recurrent event-based detection in industrial environments.

2603.13652 2026-05-19 cs.CV 版本更新

Causal Attribution via Activation Patching

通过激活修补进行因果归因

Amirmohammad Izadi, Mohammadali Banayeeanzade, Alireza Mirrokni, Hosein Hasani, Mobin Bagherian, Faridoun Mehri, Mahdieh Soleymani Baghshah

发表机构 * Sharif University of Technology(谢尔万大学)

AI总结 本文提出了一种新的因果归因方法CAAP,通过直接干预内部激活来估计图像补丁对Vision Transformer预测的贡献,从而产生更准确和局部化的归因结果。

详情
AI中文摘要

针对Vision Transformers(ViTs)的归因方法旨在识别影响模型预测的图像区域,但产生忠实且良好的局部化归因仍具有挑战性。现有归因方法面临多个限制,基于梯度、相关性传播和注意力的方法依赖于局部近似,而扰动或优化方法则干预输入、令牌或替代物,而非内部补丁表示。关键挑战在于类别相关证据是通过跨层的补丁令牌相互作用形成的;仅操作输入变化、注意力权重或反向相关性信号的方法可能只能提供补丁重要性的间接代理,而非直接测试上下文化补丁表示的预测效果。我们提出通过激活修补进行因果归因(CAAP),通过直接干预内部激活来估计单个图像补丁对ViT预测的贡献,而非使用学习的掩码或合成扰动模式。对于每个补丁,CAAP将对应的源图像激活插入中性目标上下文中的中间层范围,并使用由此产生的目标类别分数作为归因信号。所得到的归因图反映了补丁相关内部表示对模型预测的因果贡献。因果干预作为一种原则性的测量方法,通过在初始表示形成后捕捉语义证据,同时避免晚期层的全局混合,这可能减少空间特异性。在多个ViT骨干网络和标准度量指标上,CAAP在各种设置中均优于现有方法,并产生更忠实且局部化的归因结果。

英文摘要

Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing attribution methods face several limitations, with gradient-based, relevance-propagation, and attention-based methods relying on local approximations, while perturbation or optimization-based methods intervene on inputs, tokens, or surrogates rather than internal patch representations. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers; methods that operate only on input changes, attention weights, or backward relevance signals may therefore provide indirect proxies for patch importance rather than directly testing the predictive effect of contextualized patch representations. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal contribution of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing semantic evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP consistently outperforms existing methods in various settings and produces more faithful and localized attributions.

2603.10935 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Spherical VAE with Cluster-Aware Feasible Regions: Guaranteed Prevention of Posterior Collapse

具有聚类感知可行区域的球形VAE:保证防止后验崩溃

Zegu Zhang, Jian Zhang

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种理论保证非崩溃解的新型框架,通过利用球壳几何和聚类感知约束,防止VAE中的后验崩溃问题,并在合成和现实数据集上实现了100%的崩溃预防。

Comments 8 pages, 6 figures

详情
AI中文摘要

变分自编码器(VAEs)经常受到后验崩溃的影响,其中潜在变量在近似后验退化为先验时变得无信息。尽管最近的研究将崩溃描述为由数据协方差属性决定的相变,但现有方法主要旨在避免而非消除崩溃。我们引入了一种新的框架,通过利用球壳几何和聚类感知约束,从理论上保证非崩溃解。我们的方法将数据转换为球壳,通过K-means计算最优聚类分配,并定义一个在聚类内方差W和崩溃损失δ-collapse之间的可行区域。我们证明当重构损失被限制在这个区域内时,崩溃解在数学上被排除在可行参数空间之外。关键的是,我们引入了规范约束机制,确保解码器输出保持与球壳几何兼容,而不限制表示能力。与以往方法不同,我们的方法提供了严格的理论保证,计算开销小,且不施加对解码器输出的限制。在合成和现实数据集上的实验表明,在传统VAE完全失败的条件下,实现了100%的崩溃预防,重构质量匹配或超过最先进的方法。我们的方法不需要显式的稳定性条件(例如σ² < λ_max),并且适用于任意神经网络架构。代码可在https://github.com/tsegoochang/spherical-vae-with-Cluster获取。

英文摘要

Variational autoencoders (VAEs) frequently suffer from posterior collapse, where the latent variables become uninformative as the approximate posterior degenerates to the prior. While recent work has characterized collapse as a phase transition determined by data covariance properties, existing approaches primarily aim to avoid rather than eliminate collapse. We introduce a novel framework that theoretically guarantees non-collapsed solutions by leveraging spherical shell geometry and cluster-aware constraints. Our method transforms data to a spherical shell, computes optimal cluster assignments via K-means, and defines a feasible region between the within-cluster variance $W$ and collapse loss $δ_{\text{collapse}}$. We prove that when the reconstruction loss is constrained to this region, the collapsed solution is mathematically excluded from the feasible parameter space. \textbf{Critically, we introduce norm constraint mechanisms that ensure decoder outputs remain compatible with the spherical shell geometry without restricting representational capacity.} Unlike prior approaches, our method provides a strict theoretical guarantee with minimal computational overhead without imposing constraints on decoder outputs. Experiments on synthetic and real-world datasets demonstrate 100\% collapse prevention under conditions where conventional VAEs completely fail, with reconstruction quality matching or exceeding state-of-the-art methods. Our approach requires no explicit stability conditions (e.g., $σ^2 < λ_{\max}$) and works with arbitrary neural architectures. The code is available at https://github.com/tsegoochang/spherical-vae-with-Cluster.

2603.00607 2026-05-19 cs.CV cs.AI 版本更新

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

IdGlow: 多主体生成中的动态身份调节

Honghao Cai, Xiangyuan Wang, Jing Li, Yunhao Bai, Tianze Zhou, Haohua Chen, Chao Hui, Changhao Qiao, Runqi Wang, Sijie Xu, Yuyang Hao, Zezhou Cui, Yuyuan Yang, Wei Zhu, Yibo Chen, Xu Tang, Yao Hu, Zhen Li

发表机构 * Xiaohongshu Inc.(小红书公司) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 本文提出IdGlow框架,通过任务自适应的时间步调度和视觉语言模型解决多主体生成中的稳定性与可塑性矛盾,提升面部真实感与商业级美学质量。

详情
AI中文摘要

多主体图像生成需要在一致的场景中无缝协调多个参考身份。然而,现有方法依赖刚性空间掩码或局部注意力,往往在需要复杂结构变形的任务中(如保持身份的年龄变换)面临'稳定性-可塑性困境'。为此,我们提出IdGlow,一种基于流匹配扩散模型的无掩码、分阶段框架。在监督微调(SFT)阶段,我们引入任务自适应的时间步调度,与扩散生成动力学对齐:一种线性衰减调度,逐步放松约束以生成自然群体组成,以及一个时间门控机制,将身份注入集中于关键语义窗口,成功保留成人面部语义而不覆盖儿童样结构。为解决属性泄漏和语义模糊问题而无需显式布局输入,我们进一步整合了基于badcase驱动的视觉语言模型(VLM)进行精确的上下文感知提示合成。在第二阶段,我们设计了细粒度群体级直接偏好优化(DPO)方法,采用加权边距公式,同时消除多主体伪影、提升纹理和谐度,并重新校准身份保真度以适应现实分布。在两个具有挑战性的基准测试——直接多人物融合和年龄变换群体生成——上的大量实验表明,IdGlow从根本上缓解了稳定性-可塑性冲突,实现了在最先进的面部保真度和商业级美学质量之间的优越帕累托平衡。

英文摘要

Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.

2602.22667 2026-05-19 cs.CV 版本更新

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

单目开放词汇占用预测用于室内场景

Changqing Zhou, Yueru Luo, Han Zhang, Zeyu Jiang, Changhao Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 该研究提出了一种基于几何的监督方法,用于单目开放词汇室内场景的占用预测,通过引入一种基于Poisson的透明度感知方法和逐步温度衰减调度,提高了几何和语义对齐的稳定性与精度,实验结果显示在Occ-ScanNet数据集上取得了较高的IoU和mIoU指标。

Comments Accepted at CVPR2026 Oral

详情
AI中文摘要

开放词汇3D占用对于具有体素的智能体至关重要,这些智能体需要理解具有丰富语义类别的复杂室内环境,并超越固定分类体系。尽管最近的研究在户外驾驶场景中探索了开放词汇占用,但这些方法在室内场景中表现不佳,因为几何更密集,布局更复杂,语义更细粒度。为了解决这些挑战,我们采用仅使用二元占用标签(占用vs自由)的几何-only监督范式。我们的框架基于3D语言嵌入高斯,这些高斯作为统一的中间表示,将细粒度3D几何与语言对齐的语义嵌入耦合在一起。在几何方面,我们发现现有高斯到占用运算符在如此弱的监督下无法收敛,我们引入了一种基于Poisson的透明度感知方法,稳定了体积分组。在语义方面,直接对渲染特征和开放词汇分割特征之间的对齐导致特征混合;因此,我们提出了一个逐步温度衰减调度,逐步在溅射过程中锐化透明度,加强高斯-语言对齐。在Occ-ScanNet上,我们的框架在开放词汇设置中实现了59.50 IoU和21.05 mIoU,超过了所有现有的占用方法在IoU,并在mIoU上大幅优于先前的开放词汇方法。代码将在https://github.com/JuIvyy/LegoOcc上发布。

英文摘要

Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.

2602.20200 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

全局先验与局部一致性:双内存增强的视觉-语言-动作模型用于高效机器人操作

Zaijing Li, Bing Hu, Rui Shao, Gongwei Chen, Dongmei Jiang, Pengwei Xie, Jianye Hao, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) PengCheng Laboratory(鹏城实验室) Shenzhen Loop Area Institute(深圳洛神研究院) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出OptimusVLA模型,通过引入全局先验内存和局部一致性内存,解决机器人操作中动作生成效率低和鲁棒性差的问题,从而在多个基准测试中实现了更高的成功率和更快的推理速度。

Comments Accepted by CVPR 2026

详情
AI中文摘要

分层视觉-语言-动作(VLA)模型已成为机器人操作中的主导范式。它通常包括一个视觉-语言骨干网络用于感知和理解,以及一个生成性策略用于动作生成。然而,其性能越来越受到动作生成过程的限制。(i) 低推理效率。各向同性噪声先验与目标动作分布之间存在显著的分布差距,这会增加去噪步骤和不可行样本的发生率。(ii) 脆弱性差。现有策略仅基于当前观察,忽视了历史序列的约束,因此缺乏对任务进展和时间一致性意识。为了解决这些问题,我们引入OptimusVLA,一种具有全局先验内存(GPM)和局部一致性内存(LCM)的双内存VLA框架。GPM用从语义相似轨迹中检索到的任务级先验替代高斯噪声,从而缩短生成路径并减少函数评估次数(NFE)。LCM动态建模执行的动作序列以推断任务进展,并注入一个学习的一致性约束,强制轨迹的时间一致性和平滑性。在三个模拟基准测试中,OptimusVLA始终优于强大的基线:它在LIBERO上实现了98.6%的平均成功率,在CALVIN上比pi_0提高了13.5%,在RoboTwin 2.0 Hard上达到了38%的平均成功率。在现实世界评估中,OptimusVLA在泛化和长周期套件中排名第一,比pi_0分别高出42.9%和52.4%,同时实现了2.9倍的推理加速。

英文摘要

Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.

2602.12755 2026-05-19 cs.CV 版本更新

Towards reconstructing experimental sparse-view X-ray CT data with diffusion models

向稀疏视角X射线CT数据重建迈进:基于扩散模型

Nelas J. Thomsen, Xinyuan Wang, Felix Lucka, Ezgi Demircan-Tureyen

发表机构 * 1 Martin-Luther-University Halle-Wittenberg, Institute of Physics, Halle, Germany 2 Centrum Wiskunde \& Informatica, Computational Imaging Group, Amsterdam, The Netherlands

AI总结 本文研究了如何利用扩散模型重建稀疏视角X射线CT数据,探讨了训练数据不匹配(域偏移)和正向模型不匹配对实验数据应用的影响,发现域偏移在不同程度上影响模型性能,而正向模型不匹配可通过退火似然权重调度缓解。

Comments 5 pages + references, 4 figures, 2 tables, conference paper

详情
AI中文摘要

基于扩散的图像生成器在解决不明确的逆问题,如稀疏视角X射线计算机断层扫描(CT)方面具有前景。大多数研究考虑合成数据,不清楚训练数据不匹配(“域偏移”)或正向模型不匹配是否复杂其成功应用于实验数据。我们测量了与合成Shepp-Logan幻影相似的物理幻影的CT数据,并在具有不同域偏移程度的合成图像数据集上训练扩散先验。然后,我们采用分解扩散采样方案,在难度逐渐增加的稀疏视角CT数据集上应用这些先验。我们的结果表明,域偏移的作用是微妙的:虽然严重的不匹配导致模型崩溃和幻觉,但多样化的先验匹配或超过匹配良好的但狭窄的先验。正向模型不匹配会将图像样本推离先验流形,导致伪影,但可以通过退火似然权重调度缓解,这也可以提高计算效率。总体而言,我们证明了性能增益并不立即从合成数据转移到实验数据,未来的发展必须通过现实世界基准来验证。

英文摘要

Diffusion-based image generators are promising priors for ill-posed inverse problems like sparse-view X-ray Computed Tomography (CT). As most studies consider synthetic data, it is not clear whether training data mismatch (``domain shift'') or forward model mismatch complicate their successful application to experimental data. We measured CT data from a physical phantom resembling the synthetic Shepp-Logan phantom and trained diffusion priors on synthetic image data sets with different degrees of domain shift towards it. Then, we employed the priors in a Decomposed Diffusion Sampling scheme on sparse-view CT data sets with increasing difficulty leading to the experimental data. Our results reveal that domain shift plays a nuanced role: while severe mismatch causes model collapse and hallucinations, diverse priors match or exceed well-matched but narrow priors. Forward model mismatch pulls the image samples away from the prior manifold, which causes artifacts but can be mitigated with annealed likelihood weight schedules that also increase computational efficiency. Overall, we demonstrate that performance gains do not immediately translate from synthetic to experimental data, and future development must validate against real-world benchmarks.

2602.11130 2026-05-19 cs.LG cs.CV 版本更新

Meltdown: Circuits and Bifurcations in Point-Cloud-Conditioned 3D Diffusion Transformers

Meltdown: 点云条件化3D扩散变换器中的电路与分叉

Maximilian Plattner, Fabian Paischer, Johannes Brandstetter, Arturs Berzins

发表机构 * Institute for Machine Learning, JKU Linz(机器学习研究所,林茨大学)

AI总结 该研究探讨了点云条件化3D扩散变换器在输入变化下的失败模式,揭示了Meltdown现象,通过机制性案例研究展示了其成因,并提出了PowerRemap方法以抑制该现象。

详情
AI中文摘要

稀疏点云是3D表面重建中常见的输入模式,包括在安全关键领域如手术导航和自动驾驶感知中。最近的点云条件化3D扩散变换器在这一领域通过利用学习先验知识实现了最先进的结果。我们展示了这些模型在现实输入变化下可能灾难性地失败,并展示了其原因。我们识别出一种称为Meltdown的失败模式:对稀疏输入点云的微小表面扰动可以将重建输出分解成数百个不连通的部分。对抗搜索在两个开放权重的最先进架构(WaLa、Make-a-Shape)上恢复Meltdown,在真实世界数据集(GSO、SimJEB)和DDPM和DDIM采样下恢复率在89.9-100%。我们追踪Meltdown在正向传递中:它由点在表面上分布的均匀性决定,通过点云编码器忠实传递,并由扩散骨干中的单个早期去噪交叉注意力写入步骤所提交。扩散轨迹集合在接近此提交步骤时表现出对称性破裂,与反向过程的分叉一致。通过一系列匹配幅度的控制,我们证明模型提交的变量是方向性的,集中在写入扰动漂移的低维子空间中。受此发现启发,我们引入PowerRemap,一种测试时间控制,通过重塑局部写入的奇异谱来抑制此漂移,在WaLa上恢复率为98.3%,在Make-a-Shape上为84.6%。这些结果将电路级交叉注意力机制与轨迹级失败解释联系起来,展示了机理分析如何解释和指导条件扩散变换器的行为。

英文摘要

Sparse point clouds are a common input modality for 3D surface reconstruction, including in safety-critical settings such as surgical navigation and autonomous perception. Recent point-cloud-conditioned 3D diffusion transformers achieve state-of-the-art results in this regime by leveraging learned priors. We show that these models can fail catastrophically under realistic input variation, and present a mechanistic case study of why. We identify a failure mode we call Meltdown: tiny on-surface perturbations to a sparse input point cloud can fracture the reconstructed output into hundreds of disconnected pieces. Adversarial search recovers Meltdown in 89.9-100% of shapes across the two open-weight state-of-the-art architectures we study (WaLa, Make-a-Shape) on real-world datasets (GSO, SimJEB) and under both DDPM and DDIM sampling. We trace Meltdown along the forward pass: it is governed by how uniformly the points are distributed on the surface, faithfully transduced through the point-cloud encoder, and committed by a single early-denoising cross-attention write in the diffusion backbone. Diffusion-trajectory ensembles exhibit symmetry-breaking near this commit step, consistent with a bifurcation of the reverse process. Through a suite of matched-magnitude controls, we show that the variable on which the model commits is directional, concentrated in a low-rank subspace of the write's perturbation drift. Motivated by this finding, we introduce PowerRemap, a test-time control that reshapes the singular spectrum of the localized write to suppress this drift, with rescue rates of 98.3% on WaLa and 84.6% on Make-a-Shape. Together, these results link a circuit-level cross-attention mechanism to a trajectory-level account of the failure, demonstrating how mechanistic analysis can explain and guide behavior in conditional diffusion transformers.

2601.14568 2026-05-19 cs.CV cs.AI 版本更新

Breaking the accuracy-resource dilemma: a lightweight adaptive video inference enhancement

打破精度-资源困境:一种轻量级自适应视频推理增强

Wei Ma, Shaowu Chen, Junjie Ye, Peichang Zhang, Lei Huang

发表机构 * State Key Laboratory of Radio Frequency Heterogeneous Integration (Shenzhen University)(无线电频率异构集成国家重点实验室(深圳大学)) Institute of Applied Artificial Intelligence of the Guangdong–HongKong–Macao Greater Bay(粤港澳大湾区应用人工智能研究院) Henan Academy of Science Applied Physics Institute Co.,Ltd.(河南省应用物理科学研究院有限公司)

AI总结 本文提出了一种轻量级自适应视频推理增强框架,通过动态切换不同规模的模型来平衡资源利用与推理性能。

Comments 5 pages, 5 figures

详情
AI中文摘要

现有的视频推理(VI)增强方法通常通过扩大模型规模和采用复杂的网络架构来提高性能。尽管这些方法展示了最先进的性能,但往往忽视了资源效率和推理有效性之间的权衡,导致资源利用效率低下和次优的推理性能。为了解决这个问题,本文开发了一种基于关键系统参数和推理相关指标的模糊控制器(FC-r)。在FC-r的指导下,提出了一种VI增强框架,利用相邻视频帧中目标的时空相关性。根据目标设备的实时资源条件,该框架可以在VI过程中动态切换不同规模的模型。实验结果表明,所提出的方法有效实现了资源利用和推理性能之间的平衡。

英文摘要

Existing video inference (VI) enhancement methods typically aim to improve performance by scaling up model sizes and employing sophisticated network architectures. While these approaches demonstrated state-of-the-art performance, they often overlooked the trade-off of resource efficiency and inference effectiveness, leading to inefficient resource utilization and suboptimal inference performance. To address this problem, a fuzzy controller (FC-r) is developed based on key system parameters and inference-related metrics. Guided by the FC-r, a VI enhancement framework is proposed, where the spatiotemporal correlation of targets across adjacent video frames is leveraged. Given the real-time resource conditions of the target device, the framework can dynamically switch between models of varying scales during VI. Experimental results demonstrate that the proposed method effectively achieves a balance between resource utilization and inference performance.

2601.06943 2026-05-19 cs.CV cs.AI 版本更新

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

观看、推理与搜索:一个面向开放网络的视频深度研究基准,用于代理视频推理

Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Jisheng Dang, Rui Xu, Sen Hu, Jianheng Hou, Chengwei Qin, Xiaobin Hu, Kunyi Wang, Zhi Yang, Hao Peng, Hong Peng, Ronghao Chen, Huacan Wang

发表机构 * LZU(兰州大学) HKUST(GZ)(香港科技大学(广州)) UBC(不列颠哥伦比亚大学) FDU(福建大学) PKU(北京大学) USC(美国南加州大学) NUS(新加坡国立大学) UCAS(中国科学院大学) HKUST(香港科技大学) QuantaAlpha(量子Alpha)

AI总结 本文提出VideoDR基准,用于研究开放网络环境下视频代理推理,通过跨帧视觉锚点提取、交互式网络检索和多跳推理验证,揭示了长检索链中维持初始视频锚点、目标漂移和长时程一致性等关键挑战。

详情
AI中文摘要

在现实世界视频问答场景中,视频往往只提供局部视觉线索,而可验证答案分布在开放网络中;模型因此需要联合执行跨帧线索提取、迭代检索和基于多跳推理的验证。为弥合这一差距,我们构建了首个视频深度研究基准VideoDR。VideoDR专注于视频条件的开放领域视频问答,要求进行跨帧视觉锚点提取、交互式网络检索和基于联合视频-网络证据的多跳推理;通过严格的真人标注和质量控制,我们获得了涵盖六个语义领域的高质量视频深度研究样本。我们评估了多种闭源和开源多模态大语言模型在Workflow和Agentic范式下的表现,结果表明Agentic并不始终优于Workflow:其收益取决于模型在长检索链中维持初始视频锚点的能力。进一步分析表明,目标漂移和长时程一致性是核心瓶颈。总之,VideoDR为研究开放网络环境下视频代理提供了系统性的基准,并揭示了下一代视频深度研究代理的关键挑战。

英文摘要

In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.

2601.06163 2026-05-19 cs.CV cs.LG 版本更新

Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking

Forget-It-All: 通过概念感知神经元掩码实现多概念机器去学习

Kaiyuan Deng, Bo Hui, Gen Li, Jie Ji, Minghai Qin, Geng Yuan, Xiaolong Ma

发表机构 * The University of Arizona(亚利桑那大学) The University of Tulsa(塔尔萨大学) Clemson University(克莱姆森大学) Western Digital Corporation(西部数据公司) University of Georgia(佐治亚大学)

AI总结 该研究提出Forget-It-All框架,通过利用模型稀疏性,解决多概念去学习问题,有效提升去学习效果并保持生成质量。

Comments Accepted to ICML 2026

Journal ref Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

文本到图像(T2I)扩散模型的广泛应用引发了对其可能生成版权、不当或敏感图像的担忧。作为实际解决方案,机器去学习旨在在不重新训练的情况下删除不需要的概念。尽管现有方法在单概念去学习中有效,但去除多个概念时往往面临显著挑战,包括去学习效果、生成质量和对超参数和数据集的敏感性。我们通过利用模型稀疏性,从独特角度看待多概念去学习,并提出Forget It All(FIA)框架。FIA首先引入对比概念显著性以量化每个权重连接对目标概念的贡献。然后通过结合时间信息和空间信息,识别出概念敏感神经元,确保只选择那些一致响应目标概念的神经元。最后,FIA从识别的神经元中构建掩码,并将其融合成统一的多概念掩码,其中对一般内容生成有广泛支持的无概念神经元被保留,而概念特定神经元被修剪以去除目标。FIA是无训练的,需要最少超参数调整即可用于新任务,实现即插即用。在三个不同的去学习任务上进行了广泛的实验,证明FIA在多概念去学习中实现了更可靠的性能,提高了遗忘效果同时保持生成的保真度和质量。代码可在https://github.com/kaiyuan02415/Forget-It-All获取。

英文摘要

The widespread adoption of text-to-image (T2I) diffusion models has raised concerns about their potential to generate copyrighted, inappropriate, or sensitive imagery. As a practical solution, machine unlearning aims to erase unwanted concepts without retraining from scratch. While most existing methods are effective for single-concept unlearning, they often struggle when removing multiple concepts, causing significant challenges in unlearning effectiveness, generation quality, and sensitivity to hyperparameters and datasets. We take a unique perspective on multi-concept unlearning by leveraging model sparsity and propose the Forget It All (FIA) framework. FIA first introduces Contrastive Concept Saliency to quantify each weight connection's contribution to a target concept. It then identifies Concept Sensitive Neurons by combining temporal and spatial information, ensuring that only neurons consistently responsive to the target concept are selected. Finally, FIA constructs masks from the identified neurons and fuses them into a unified multi-concept mask, where Concept Agnostic Neurons that broadly support general content generation are preserved while concept-specific neurons are pruned to remove the targets. FIA is training-free and requires minimal hyperparameter tuning for new tasks, enabling plug-and-play use. Extensive experiments across three distinct unlearning tasks demonstrate that FIA achieves more reliable multi-concept unlearning, improving forgetting effectiveness while maintaining generation fidelity and quality. Code is available at https://github.com/kaiyuan02415/Forget-It-All

2601.06162 2026-05-19 cs.LG cs.CV 版本更新

Forget Many, Forget Right: Scalable and Precise Concept Unlearning in Diffusion Models

忘却众多,忘却正确:扩散模型中可扩展且精确的概念反学习

Kaiyuan Deng, Gen Li, Yang Xiao, Bo Hui, Xiaolong Ma

发表机构 * The University of Arizona(亚利桑那大学) Clemson University(克莱姆森大学) The University of Tulsa(塔尔萨大学)

AI总结 本文提出了一种名为ScaPre的统一框架,用于在大规模扩散模型中实现精确的概念反学习,通过解决冲突更新、不精确机制和依赖额外数据的问题,提高了反学习的效率和精度。

Comments Accepted at ICLR 2026

Journal ref International Conference on Learning Representations (ICLR) 2026

详情
AI中文摘要

文本到图像的扩散模型已取得显著进展,但其使用引发了版权和滥用问题,促使研究机器反学习。然而,将多概念反学习扩展到大规模场景仍然困难,因为存在三个挑战:(i)冲突的权重更新会阻碍反学习或降低生成质量;(ii)不精确的机制会导致对相似内容的损害;(iii)依赖额外数据或模块,造成可扩展性瓶颈。为了解决这些问题,我们提出了可扩展-精确概念反学习(ScaPre),一种专门针对大规模反学习的统一框架。ScaPre引入了冲突感知的稳定设计,整合了谱迹正则化和几何对齐,以稳定优化、抑制冲突并保持全局结构。此外,Informax解耦器识别与概念相关的参数并自适应地重新加权更新,严格将反学习限制在目标子空间内。ScaPre产生了一个高效的闭式解,无需额外数据或子模型。在对象、风格和显性内容上的全面实验表明,ScaPre能够有效移除目标概念并保持生成质量。它比最佳基线在可接受的质量限制内能忘却多达$ imes \mathbf{5}$更多的概念,实现了大规模反学习的最先进精度和效率。代码可在https://github.com/kaiyuan02415/scapre获取。

英文摘要

Text-to-image diffusion models have achieved remarkable progress, yet their use raises copyright and misuse concerns, prompting research into machine unlearning. However, extending multi-concept unlearning to large-scale scenarios remains difficult due to three challenges: (i) conflicting weight updates that hinder unlearning or degrade generation; (ii) imprecise mechanisms that cause collateral damage to similar content; and (iii) reliance on additional data or modules, creating scalability bottlenecks. To address these, we propose Scalable-Precise Concept Unlearning (ScaPre), a unified framework tailored for large-scale unlearning. ScaPre introduces a conflict-aware stable design, integrating spectral trace regularization and geometry alignment to stabilize optimization, suppress conflicts, and preserve global structure. Furthermore, an Informax Decoupler identifies concept-relevant parameters and adaptively reweights updates, strictly confining unlearning to the target subspace. ScaPre yields an efficient closed-form solution without requiring auxiliary data or sub-models. Comprehensive experiments on objects, styles, and explicit content demonstrate that ScaPre effectively removes target concepts while maintaining generation quality. It forgets up to $\times \mathbf{5}$ more concepts than the best baseline within acceptable quality limits, achieving state-of-the-art precision and efficiency for large-scale unlearning. Code is available at https://github.com/kaiyuan02415/scapre

2601.01593 2026-05-19 cs.CV cs.MM 版本更新

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

超越补丁:面向多模态少样本字体生成的全局感知自回归模型

Haonan Cai, Yuxuan Luo, Zhouhui Lian

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王学轩计算机技术研究所) School of Electronics Engineering and Computer Science, Peking University(北京大学电子工程与计算机科学学院)

AI总结 本文提出GAR-Font,一种多模态少样本字体生成的自回归框架,通过全局感知分词器、多模态风格编码器和后处理流程,提升了字体生成的全局风格一致性和质量。

Comments 28 pages, Accepted as CVPR 2026 Conference Paper

详情
AI中文摘要

手动字体设计是一个将风格化视觉概念转化为一致的字形集的复杂过程。在自动少样本字体生成(FFG)中,模型常常难以在有限参考下保持结构完整性和风格忠实性。尽管自回归(AR)模型展示了出色的生成能力,但其在FFG中的应用受限于传统的补丁级标记化,这忽略了对字体合成至关重要的全局依赖关系。此外,现有FFG方法仍局限于图像到图像的范式,仅依赖视觉参考,忽略了语言在传达字体设计风格意图中的作用。为了解决这些限制,我们提出了GAR-Font,一种新的AR框架用于多模态少样本字体生成。GAR-Font引入了一个全局感知的分词器,能够有效捕捉局部结构和全局风格模式,一个多模态风格编码器通过轻量级的语言-风格适配器提供灵活的风格控制,无需进行高强度的多模态预训练,并且一个后处理流程进一步增强了结构完整性和风格一致性。大量实验表明,GAR-Font在现有FFG方法上表现更优,尤其在保持全局风格忠实性和在文本风格指导下获得更高质量的结果方面表现出色。

英文摘要

Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.

2512.12598 2026-05-19 cs.CV 版本更新

Setting the Stage: Text-Driven Scene-Consistent Image Generation

设定舞台:基于文本的场景一致图像生成

Cong Xie, Che Wang, Yan Zhang, Ruiqi Yu, Han Zou, Zheng Pan, Zhenpeng Zhan

发表机构 * Global Business Unit, Baidu Inc.(百度公司全球业务部)

AI总结 本文研究了场景构建任务,通过结合参考场景图像和文本条件生成符合空间关系的演员图像,提出了一种新的数据构建管道和对应关系引导的注意力损失,以提高场景一致性和文本-图像一致性。

详情
AI中文摘要

我们专注于场景构建的基础任务:给定一个参考场景图像和一个文本条件,该条件指定了要在场景中生成的演员类别及其与场景的空间关系,目标是生成一个输出图像,该图像保持与参考图像相同的场景身份,同时根据文本中描述的空间关系正确生成演员。现有方法在这一任务上面临困难,主要是由于高质量配对数据稀缺和生成目标不明确。为克服数据瓶颈,我们提出了一种新的数据构建管道,结合现实照片、实体移除和图像到视频扩散模型,生成具有多样化场景、视角和正确实体-场景关系的训练对。此外,我们引入了一种新的对应关系引导的注意力损失,利用跨视角线索来强制与参考场景的空间对齐。在我们构建的场景一致基准上进行的实验表明,我们的方法在自动指标和人类偏好研究中均优于最先进的基线。我们的方法能够生成具有多样化视角和构图的图像,同时忠实遵循文本指令并保持参考场景身份。

英文摘要

We focus on the foundational task of Scene Staging: given a reference scene image and a text condition specifying an actor category to be generated in the scene and its spatial relation to the scene, the goal is to synthesize an output image that preserves the same scene identity as the reference image while correctly generating the actor according to the spatial relation described in the text. Existing methods struggle with this task, largely due to the scarcity of high-quality paired data and unconstrained generation objectives. To overcome the data bottleneck, we propose a novel data construction pipeline that combines real-world photographs, entity removal, and image-to-video diffusion models to generate training pairs with diverse scenes, viewpoints and correct entity-scene relationships. Furthermore, we introduce a novel correspondence-guided attention loss that leverages cross-view cues to enforce spatial alignment with the reference scene. Experiments on our scene-consistent benchmark show that our approach achieves better scene alignment and text-image alignment than state-of-the-art baselines, according to both automatic metrics and human preference studies. Our method generates images with diverse viewpoints and compositions while faithfully following the textual instructions and preserving the reference scene identity.

2512.01030 2026-05-19 cs.CV 版本更新

Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

Lotus-2: 通过强大的图像生成模型推进几何密集预测

Jing He, Haodong Li, Mingzhi Sheng, Ying-Cong Chen

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文提出Lotus-2,一种两阶段确定性框架,用于稳定、准确和精细的几何密集预测,通过充分利用预训练生成先验,实现了单目深度估计和表面法线预测的新状态-of-the-art结果。

Comments v3: Fixed some typos. Project page: https://lotus-2.github.io/

详情
AI中文摘要

从单张图像中恢复像素级几何属性本质上是病态的,由于外观模糊性和2D观测与3D结构之间的非单射映射。尽管判别回归模型通过大规模监督实现强大性能,但其成功受限于可用数据的规模、质量和多样性,以及有限的物理推理能力。最近的扩散模型表现出强大的世界先验,能够编码从大规模图像-文本数据中学习到的几何和语义信息,但直接重用其随机生成公式对于确定性几何推断是次优的:前者是为多样且高保真的图像生成优化的,而后者需要稳定且准确的预测。在本文中,我们提出Lotus-2,一种两阶段确定性框架,旨在为稳定、准确和精细的几何密集预测提供最优适应协议,以充分利用预训练生成先验。具体而言,在第一阶段,核心预测器采用单步确定性公式和清洁数据目标以及轻量级局部连续性模块(LCM)来生成无网格伪影的全局一致结构。在第二阶段,细节增强器在由核心预测器定义的流形内执行受限的多步校正流细化,通过无噪声的确定性流匹配来增强精细几何。仅使用59K训练样本,即现有大规模数据集的不到1%,Lotus-2在单目深度估计和高度竞争的表面法线预测中建立了新的状态-of-the-art结果。这些结果表明,扩散模型可以作为确定性世界先验,使在传统判别和生成范式之外实现高质量的几何推理成为可能。

英文摘要

Recovering pixel-wise geometric properties from a single image is fundamentally ill-posed due to appearance ambiguity and non-injective mappings between 2D observations and 3D structures. While discriminative regression models achieve strong performance through large-scale supervision, their success is bounded by the scale, quality, and diversity of available data, as well as by limited physical reasoning. Recent diffusion models exhibit powerful world priors that encode geometry and semantics learned from massive image-text data, yet directly reusing their stochastic generative formulation is suboptimal for deterministic geometric inference: the former is optimized for diverse and high-fidelity image generation, whereas the latter requires stable and accurate predictions. In this work, we propose Lotus-2, a two-stage deterministic framework for stable, accurate and fine-grained geometric dense prediction, aiming to provide an optimal adaptation protocol to fully exploit the pre-trained generative priors. Specifically, in the first stage, the core predictor employs a single-step deterministic formulation with a clean-data objective and a lightweight local continuity module (LCM) to generate globally coherent structures without grid artifacts. In the second stage, the detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the core predictor, enhancing fine-grained geometry through noise-free deterministic flow matching. Using only 59K training samples, less than 1% of existing large-scale datasets, Lotus-2 establishes new state-of-the-art results in monocular depth estimation and highly competitive surface normal prediction. These results demonstrate that diffusion models can serve as deterministic world priors, enabling high-quality geometric reasoning beyond traditional discriminative and generative paradigms.

2511.19320 2026-05-19 cs.CV 版本更新

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

SteadyDancer: 一种具有第一帧保留的和谐且一致的人像图像动画方法

Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gangshan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma

发表机构 * State Key Laboratory for Novel Software Technology(新型软件技术国家重点实验室) Platform and Content Group (PCG), Tencent(腾讯平台与内容部) Shanghai AI Lab(上海人工智能实验室)

AI总结 本文提出SteadyDancer,一种基于图像到视频(I2V)范式的框架,通过条件协调机制、协同姿态调节模块和分阶段解耦目标训练管道,实现了和谐一致的人像动画,并首次确保第一帧的鲁棒保留。

Comments 10 pages, with supp

详情
AI中文摘要

在人类图像动画中,保留第一帧身份的同时确保精确运动控制是一个根本性挑战。主导的参考到视频(R2V)范式的图像到运动绑定过程忽略了现实应用中常见的时空对齐问题,导致身份漂移和视觉伪影等失败。我们引入SteadyDancer,一种基于图像到视频(I2V)范式的框架,实现了和谐且一致的动画,并首次确保第一帧保留的鲁棒性。首先,我们提出条件协调机制以协调两个冲突条件,从而在不牺牲保真度的情况下实现精确控制。其次,我们设计协同姿态调节模块以生成适应性且一致的姿态表示,该表示高度兼容参考图像。最后,我们采用分阶段解耦目标训练管道,分层优化模型以实现运动保真度、视觉质量和时间一致性。实验表明,SteadyDancer在外观保真度和运动控制方面均达到最先进的性能,同时比可比方法需要显著更少的训练资源。该模型已公开发布在https://mcg-nju.github.io/steadydancer-web。

英文摘要

Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods. The model has been publicly released at \url{https://mcg-nju.github.io/steadydancer-web}.

2511.17392 2026-05-19 cs.CV 版本更新

MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration

MorphSeek: 用于可变形图像配准的细粒度潜在表示级策略优化

Runxun Zhang, Yizhou Liu, Li Dongrui, Bo XU, Jingwei Wei

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Sun Yat-sen University(中山大学) Fudan University(复旦大学) Hebei Medical University(河北医科大学)

AI总结 本文提出MorphSeek,一种在潜在特征空间中进行细粒度策略优化的方法,用于解决可变形图像配准中的高维变形空间和体素级监督稀缺问题,通过引入随机高斯策略头和组相对策略优化,实现了高效探索和粗到细的优化,提升了配准的Dice系数和标签效率。

Comments 20 pages

详情
AI中文摘要

可变形图像配准(DIR)仍然是医学图像分析中的基本但具有挑战性的问题,主要由于密集位移场的高维变形空间和体素级监督的稀缺性。现有的强化学习框架通常将此空间投影到低维表示,限制了其捕捉空间变异形变的能力。我们提出MorphSeek,一种细粒度的表示级策略优化范式,将DIR重新表述为潜在特征空间中的连续优化过程。MorphSeek在编码器上引入随机高斯策略头,以建模潜在特征的分布,从而实现高效的探索和粗到细的优化。该框架通过组相对策略优化整合了无监督预热和弱监督微调,其中多轨迹采样稳定了训练并提高了标签效率。在三个3D配准基准(OASIS脑MRI、LiTS肝脏CT和腹部MR-CT)上,MorphSeek在保持高标签效率的同时,通过最小的参数成本和低步骤级延迟开销,实现了优于竞争基线的Dice改进。除了优化器细节外,MorphSeek推进了一种表示级策略学习范式,实现了空间一致且数据高效的形变优化,提供了一种原理上可行、不依赖于特定主干网络和优化器的可扩展视觉对齐解决方案。

英文摘要

Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR-CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.

2511.16361 2026-05-19 cs.CV 版本更新

Multi-Order Matching Network for Alignment-Free Depth Super-Resolution

多阶匹配网络用于无对齐深度超分辨率

Zhengxue Wang, Zhiqiang Yan, Yuan Wu, Guangwei Gao, Xiang Li, Jian Yang

发表机构 * PCA Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology(计算机科学与工程学院PCA实验室,南京理工大学) School of Computing, National University of Singapore(新加坡国立大学计算机学院) School of Computer Science, Nankai University(南开大学计算机学院) PCA Lab, School of Intelligence Science and Technology, Nanjing University(智能科学与技术学院PCA实验室,南京大学)

AI总结 本文提出了一种无对齐框架Multi-Order Matching Network (MOMNet),通过多阶匹配机制和多阶聚合策略,从不对齐的RGB数据中提取并选择最相关的信息,以提高深度超分辨率的性能和泛化能力。

详情
AI中文摘要

最近的引导深度超分辨率方法基于深度和RGB之间严格空间对齐的假设,实现高质量的深度重建。然而,在现实场景中,严格对齐的RGB-D数据受到固有硬件限制(例如物理分离的RGB-D传感器)和不可避免的校准漂移(由机械振动或温度变化引起)的阻碍。因此,现有方法在应用于不对齐的现实场景时往往会出现不可避免的性能下降。在本文中,我们提出了Multi-Order Matching Network (MOMNet),一种新颖的无对齐框架,能够自适应地从不对齐的RGB中检索并选择最相关的信息。具体而言,我们的方法首先采用多阶匹配机制,联合执行零阶、一阶和二阶匹配,以在多阶特征空间中全面识别与深度一致的RGB信息。为了有效整合检索到的RGB和深度信息,我们进一步引入了由多个结构检测器组成的多阶聚合模块。该策略利用多阶先验作为提示,促进从RGB到深度的特征选择性转移。广泛的实验表明,MOMNet在未对齐和对齐的数据集上均实现了优越的性能和泛化能力。

英文摘要

Recent guided depth super-resolution methods are premised on the assumption of strict spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves superior performance and generalization across both unaligned and aligned datasets.

2511.16309 2026-05-19 cs.CV cs.LG 版本更新

Sparse Autoencoders are Topic Models

稀疏自编码器是主题模型

Leander Girrbach, Zeynep Akata

发表机构 * Technical University of Munich (TUM), Munich Center for Machine Learning (MCML), Helmholtz Munich(慕尼黑技术大学(TUM)、慕尼黑机器学习中心(MCML)、海德堡-慕尼黑研究所)

AI总结 本文提出将稀疏自编码器(SAEs)视为主题模型的新视角,通过构建连续主题模型(CTM)来解释嵌入空间,并推导出SAE的目标作为最大后验估计器,从而揭示SAE特征是主题性组件而非可调节方向。

Comments ICML 2026

详情
AI中文摘要

稀疏自编码器(SAEs)被用于分析嵌入,但其作用和实用价值存在争议。我们提出了一种新的视角,通过展示它们可以自然地被理解为主题模型。我们受到潜在狄利克雷分配(LDA)的启发,提出了一种连续主题模型(CTM)用于嵌入空间,并在此模型下推导出SAE目标作为最大后验估计器。这种观点表明SAE特征是主题性组件而非可调节方向。为了验证我们的理论发现,我们引入了SAE-TM主题建模框架,该框架:(1)训练SAE以学习可重用的主题原子;(2)将它们解释为下游数据中的词分布;(3)将它们合并到任意数量的主题中而无需重新训练。SAE-TM在文本和图像数据集上比强大的基线产生更连贯的主题,同时保持多样性。最后,我们分析了图像数据集中的主题结构,并追踪了日本木版画中主题随时间的变化。我们的工作将SAEs定位为跨模态大规模主题分析的有效工具。代码可在https://github.com/ExplainableML/SAE-TM获取。

英文摘要

Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We propose a continuous topic model (CTM) inspired by Latent Dirichlet Allocation (LDA) for embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. To confirm our theoretical findings, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code is available at https://github.com/ExplainableML/SAE-TM .

2511.11934 2026-05-19 cs.LG cs.CV 版本更新

A Systematic Analysis of Out-of-Distribution Detection Under Representation and Training Paradigm Shifts

基于表示和训练范式转变的分布外检测系统分析

Claudio César Claros Olivares, Austin J. Brockmeier

发表机构 * Department of Electrical & Computer Engineering(电气与计算机工程系) University of Delaware(德雷塞尔大学)

AI总结 本文通过表示中心的视角系统评估了分布外检测的CSFs,分析了不同架构、训练范式和数据集的影响,并提出基于PCA的投影过滤方法和基于神经坍塌的预测方法来提升检测性能。

详情
AI中文摘要

我们通过表示中心的视角系统评估了分布外检测(OOD)的CSFs。我们的研究涵盖了CNN和ViT架构、多种训练范式、四个图像分类源数据集(CIFAR-10、CIFAR-100、SuperCIFAR-100和TinyImageNet),以及通过CLIP衍生的语义距离将OOD数据集分为近、中、远三个区域。为了比较这些设置下的CSFs,我们采用了一种多重比较受控的排名流程,该流程在无阈值排名指标(AURC和AUGRC)下识别出统计上不可区分的顶级聚类。主要经验发现是,竞争性检测器家族更依赖于学习的表示而不是单纯的分数设计。对于CNN和ViT,简单的概率分数在误分类检测中占主导地位。在CNN中,基于边界的分数在近OOD区域最强,而几何感知分数如NNGuide、fDBD和CTM在移位严重性增加时变得更具竞争力。在微调的ViT中,顶级聚类主要由重建和残差分数主导。为了解释这些排名变化,我们使用神经坍塌(NC)指标分析最后一层表示。得到的图景在不同架构中是一致的:原型和边界感知分数在表示更坍塌且与分类器权重更好对齐时更强,而弱坍塌区域则更青睐梯度和流形基于的分数。基于这些见解,我们提出两个贡献:一种基于PCA的投影过滤过程,可以提高检测器性能,以及一种利用训练分类器计算的NC测量来预测其竞争性的分布外检测器短名单的方法,而无需任何额外的分布外数据。

英文摘要

We present a systematic benchmark of out-of-distribution (OOD) detection CSFs through a representation-centric lens. Our study spans CNN and ViT backbones, multiple training paradigms, four image-classification source datasets (CIFAR-10, CIFAR-100, SuperCIFAR-100, and TinyImageNet), and OOD datasets grouped into near, mid, and far regimes using CLIP-derived semantic distances. To compare CSFs across these settings, we employ a multiple-comparison-controlled rank pipeline that identifies top cliques of statistically indistinguishable winners under threshold-free ranking metrics (AURC and AUGRC). The main empirical finding is that the competitive detector family depends more on the learned representation than on score design alone. For both CNNs and ViTs, simple probabilistic scores dominate misclassification detection. On CNNs, margin-based scores are strongest in near-OOD regimes, while geometry-aware scores such as NNGuide, fDBD, and CTM become more competitive as shift severity increases. On fine-tuned ViTs, the top cliques are led mainly by reconstruction- and residual-based scores. To interpret these ranking shifts, we analyze the last-layer representation using Neural Collapse (NC) metrics. The resulting picture is consistent across architectures: prototype- and boundary-aware scores become stronger when the representation is more collapsed and better aligned with classifier weights, whereas weaker-collapse regimes favor gradient- and manifold-based scores. Building on these insights, we propose two contributions: a simple PCA-based projection-filtering procedure that improves detector performance, and an approach that uses NC measurements computed from a trained classifier to predict its competitive out-of-distribution detector shortlist, without requiring any additional OOD data.

2511.08704 2026-05-19 cs.CV cs.LG 版本更新

Rethinking Generative Image Pretraining: How Far Are We From Scaling Up Next-Pixel Prediction?

重新思考生成图像预训练:我们离扩大下一步像素预测还有多远?

Xinchen Yan, Chen Liang, Lijun Yu, Adams Wei Yu, Yifeng Lu, Quoc V. Le

发表机构 * Google Deepmind(谷歌深Mind)

AI总结 本文研究了自回归下一步像素预测的扩展特性,探讨了统一视觉模型中简单且端到端但尚未充分探索的框架。通过在32x32分辨率的图像上训练Transformer模型,评估了三个目标指标:下一步像素预测目标、ImageNet分类准确率和基于生成的完成度(通过Fr'echet距离测量)。研究发现,最优扩展策略高度依赖任务,且随着图像分辨率的增加,模型大小必须比数据量增长得更快。通过预测发现,计算能力是主要瓶颈,而非训练数据量。随着计算能力每年增长四到五倍,预计在五年内可实现像素级图像建模。

Comments Accepted by ICML2026

详情
AI中文摘要

本文研究了自回归下一步像素预测的扩展特性,一种简单、端到端但尚未充分探索的统一视觉模型框架。从32x32分辨率的图像开始,我们训练了一系列Transformer模型,使用IsoFlops配置在计算预算高达7e19 FLOPs的情况下进行训练,并评估了三个不同的目标指标:下一步像素预测目标、ImageNet分类准确率和基于生成的完成度(通过Fr'echet距离测量)。首先,最优扩展策略高度依赖于任务。在固定的32x32分辨率下,图像分类和图像生成的最优扩展特性不同,其中生成最优设置要求数据量增长是分类最优设置的三到五倍。其次,随着图像分辨率的增加,最优扩展策略表明模型大小必须比数据量增长得更快。令人惊讶的是,通过投影我们的发现,我们发现主要瓶颈是计算能力,而不是训练数据量。随着计算能力每年增长四到五倍,我们预测在五年内可以实现像素级图像建模。

英文摘要

This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation-based completion measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed resolution of 32x32 alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.

2510.26635 2026-05-19 eess.IV cs.CV 版本更新

SAMRI: Segment Any MRI

SAMRI:分割任何MRI

Zhao Wang, Wei Dai, Thuy Thanh Dao, Steffen Bollmann, Hongfu Sun, Craig Engstrom, Shekhar S. Chandra

发表机构 * School of Electrical Engineering and Computer Science, The University of Queensland, Australia(昆士兰大学电气工程与计算机科学学院,澳大利亚)

AI总结 SAMRI是针对MRI优化的Segment Anything Model,通过框和点提示实现更高效的全身体部MRI分割,特别是在小而临床重要的结构上。

详情
AI中文摘要

摘要:SAMRI是针对MRI优化的Segment Anything Model,实现了优越的全身体部MRI分割,特别是在小而临床重要的结构上,通过框和点提示实现快速标注。目的:现有SAM的适应版本将MRI视为通用模态,忽略了变量组织对比、强度不均匀和临床重要的小结构。我们提出了一种MRI专用的基础模型,具有强大的全身体部分割和零样本泛化能力,可直接用于任何MRI标注任务。方法:SAMRI仅微调SAM的掩码解码器(ViT-B/16),保持编码器冻结以保留预训练表示并消除冗余传递,从而减少训练时间94%,可训练参数96%,FLOPs约99%。训练使用了来自30个数据集的110万张2D切片-掩码对,涵盖47个目标、T1/T2/FLAIR/DWI对比度和全身体部解剖结构,使用焦点Dice损失和边界框(可选点)提示。按掩码面积分层(小:<0.5%;中:0.5-3.5%;大:>3.5%),并通过Wilcoxon符号秩检验评估显著性。结果:SAMRI在框+点提示下在47个目标上实现了平均DSC 0.87±0.11,优于MedSAM(0.74±0.24)17.6%(p < 0.05),对小结构(+42.4%)和中等结构(+26.9%)的提升最大。在六个零样本数据集上,SAMRI实现了平均DSC 0.85,优于基线。推理仅需约4.5 GB VRAM通过标准硬件上的交互界面。结论:在大规模MRI特定语料库上微调解码器,实现了优越的全身体部分割,具有强大的零样本泛化能力,特别是在小而临床重要的结构上。公开代码、预训练模型和交互界面使SAMRI可用于MRI分割研究和临床工作流程。

英文摘要

Summary: SAMRI is an MRI-specialized adaptation of the Segment Anything Model achieving superior whole-body MRI segmentation, particularly for small and clinically critical structures, through box and point prompts for rapid annotation. Purpose: Existing SAM adaptations treat MRI as a generic modality, overlooking variable tissue contrast, intensity inhomogeneity, and clinically important small structures. We propose an MRI-specialized foundation model with strong whole-body segmentation and zero-shot generalization for direct use on any MRI annotation task. Methods: SAMRI fine-tunes only the mask decoder of SAM (ViT-B/16), keeping encoders frozen to preserve pretrained representations and eliminate redundant passes-reducing training time by 94%, trainable parameters by 96%, and FLOPs by ~99% versus full-model retraining. Training used 1.1 million 2D slice-mask pairs from 30 datasets spanning 47 targets, T1/T2/FLAIR/DWI contrasts, and whole-body anatomy, with focal-Dice loss and bounding-box (with optional point) prompts. Sizes were stratified by mask area (small: <0.5%; medium: 0.5-3.5%; large: >3.5%), and significance assessed by the Wilcoxon signed-rank test. Results: SAMRI with box+point prompts achieved mean DSC 0.87 +/- 0.11 across 47 targets, outperforming MedSAM (0.74 +/- 0.24) by 17.6% (p < 0.05), with largest gains for small (+42.4%) and medium (+26.9%) structures. On six zero-shot datasets, SAMRI achieved mean DSC 0.85, outperforming baselines. Inference requires only ~4.5 GB VRAM through an interactive interface on standard hardware. Conclusion: Decoder-only fine-tuning on a large, MRI-specific corpus delivers superior whole-body segmentation with strong zero-shot generalization, particularly for small and clinically salient structures. Public code, pretrained models, and an interactive interface make SAMRI deployable for MRI segmentation research and clinical workflows.

2510.18822 2026-05-19 cs.CV 版本更新

SAM 2++: Tracking Anything at Any Granularity

SAM 2++: 任意粒度下的任意目标跟踪

Jiaming Zhang, Cheng Liang, Yichun Yang, Chenkai Zeng, Yutao Cui, Xinwen Zhang, Xin Zhou, Kai Ma, Gangshan Wu, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Tencent(腾讯) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出SAM 2++框架,通过统一的提示编码、输出解码和记忆表示设计,实现了对不同粒度的目标状态(如掩码、框和点)的统一跟踪,同时引入Tracking-Any-Granularity数据集以提升统一跟踪模型的训练和评估效果。

Comments 14 pages

详情
AI中文摘要

由于不同任务中目标状态的粒度差异,现有跟踪器多针对单一任务进行设计,这种特异性限制了其泛化能力,无法有效利用多任务训练数据,导致模型设计和参数冗余。尽管最近的统一视觉模型在不同任务间共享部分架构,但通常保留任务特定的接口,并忽视不同粒度背后共同的跟踪原理,留下真正统一视频跟踪的空白。为统一视频跟踪任务,我们提出了SAM 2++,一个能够处理不同粒度目标状态的统一框架,包括掩码、框和点,通过集成设计的提示编码、输出解码和记忆表示。首先,为处理不同目标粒度,我们设计了任务特定的提示,将多样化的任务输入映射到通用的提示嵌入,同时引入统一解码器,以共同的输出形式生成任务结果,而无需重新设计整体流程。其次,为满足记忆匹配,跟踪的核心操作,我们引入了任务自适应的记忆机制,统一不同粒度的记忆同时保持其不同的状态语义,防止全参数共享导致粒度间的干扰。最后,我们引入Tracking-Any-Granularity,第一个大规模且多样化的视频跟踪数据集,具有丰富的三粒度注释。它通过定制的数据引擎,结合分阶段的手动标注和模型辅助完成,提供全面的资源用于训练、基准测试和分析统一跟踪模型。全面的实验表明,SAM 2++在不同粒度的多样化跟踪任务中设定了新的状态-of-the-art,建立了统一且稳健的跟踪框架。

英文摘要

Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task, which specificity limits their generalization, preventing them from effectively utilizing multi-task training data and leading to redundancy in both model design and parameters. Although recent unified vision models share partial architectures across tasks, they usually retain task-specific interfaces and overlook the common tracking principle behind different granularities, leaving a gap for truly unified video tracking. To unify video tracking tasks, we present SAM 2++, a unified framework that can handle target states at different granularities, including masks, boxes, and points, through an integrated design of prompt encoding, output decoding, and memory representation. First, to handle different target granularities, we design task-specific prompts that map diverse task inputs into general prompt embeddings, together with a Unified Decoder that produces task results in a common output form without redesigning the overall pipeline. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities while preserving their distinct state semantics, preventing full parameter sharing from causing interference across granularities. Finally, we introduce Tracking-Any-Granularity, the first large and diverse video tracking dataset with rich annotations at three granularities. It is constructed through a customized data engine with phased manual annotation and model-assisted completion, providing a comprehensive resource for training, benchmarking, and analyzing unified tracking models. Comprehensive experiments confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.

2510.17363 2026-05-19 cs.CV cs.LG cs.RO 版本更新

M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

M2H:基于高效窗口交叉任务注意力的多任务学习用于单目空间感知

U. V. B. L Udugama, George Vosselman, Francesco Nex

发表机构 * Department of Earth Observation Science(地球观测科学系)

AI总结 本文提出M2H框架,通过高效的窗口交叉任务注意力模块,实现单目图像上的语义分割、深度估计、边缘检测和表面法线估计,同时在计算效率上优于现有方法。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures

详情
AI中文摘要

在边缘设备上部署实时空间感知需要高效的多任务模型,这些模型能够在利用互补任务信息的同时最小化计算开销。本文介绍了Multi-Mono-Hydra(M2H),一种新的多任务学习框架,用于从单张单目图像中进行语义分割、深度、边缘和表面法线估计。与传统方法依赖独立单任务模型或共享编码器-解码器架构不同,M2H引入了基于窗口的跨任务注意力模块,实现了结构化的特征交换同时保留任务特定的细节,提高了任务间预测的一致性。M2H基于轻量级的ViT-based DINOv2主干网络,优化了实时部署,并作为支持动态环境中3D场景图构建的单目空间感知系统的基础。全面评估显示,M2H在NYUDv2上优于最先进的多任务模型,在Hypersim上超越了单任务深度和语义基线,在Cityscapes数据集上实现了更优的性能,同时在笔记本硬件上保持计算效率。除了基准测试外,M2H还在真实世界数据上得到了验证,证明了其在空间感知任务中的实用性。

英文摘要

Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

2509.25969 2026-05-19 cs.CV 版本更新

A Multi-purpose Tracking Framework for Salmon Welfare Monitoring in Challenging Environments

一种用于挑战性环境中鲑鱼福利监测的多用途跟踪框架

Espen Uri Høgstedt, Christian Schellewald, Annette Stahl, Rudolf Mester

发表机构 * Norwegian University of Science and Technology(挪威科学技术大学) SINTEF Ocean(SINTEF海洋)

AI总结 本文提出了一种多用途跟踪框架,用于在具有挑战性的环境中实现鲑鱼福利的自动化监测,通过使用姿态估计网络提取鲑鱼的边界框及其对应的身体部位信息,以解决水下鲑鱼场景中的特定挑战,并构建了两个新的数据集来评估鲑鱼跟踪的挑战。

Comments Accepted to the Joint Workshop on Marine Vision 2025 (CVAUI & AAMVEM), held in conjunction with ICCV 2025

详情
AI中文摘要

基于计算机视觉(CV)的连续、自动化和精确的鲑鱼福利监测是减少工业网箱养鱼中鲑鱼死亡率和改善鲑鱼福利的关键步骤。现有的CV方法用于确定福利指标主要集中在单一指标上,并依赖于其他应用领域的对象检测器和跟踪器来帮助其福利指标计算算法。这在实际应用中带来了高资源需求,因为每个指标必须单独计算。此外,这些方法在水下鲑鱼场景中容易受到物体遮挡、相似物体外观和相似物体运动等困难的影响。为了解决这些挑战,我们提出了一种灵活的跟踪框架,该框架使用姿态估计网络提取鲑鱼及其对应身体部位的边界框,并利用身体部位的信息,通过专门的模块,来解决水下鲑鱼场景中的特定挑战。随后,高细节的身体部位跟踪被用于计算福利指标。我们构建了两个新的数据集,评估两个鲑鱼跟踪挑战:拥挤场景中的鲑鱼ID转移和转弯期间的鲑鱼ID切换。我们的方法在两个鲑鱼跟踪挑战中均优于当前最先进的行人跟踪器BoostTrack。此外,我们创建了一个用于计算鲑鱼尾鳍拍打波长的数据集,证明了我们的身体部位跟踪方法适合基于尾鳍分析的自动化福利监测。数据集和代码可在https://github.com/espenbh/BoostCompTrack上获得。

英文摘要

Computer Vision (CV)-based continuous, automated and precise salmon welfare monitoring is a key step toward reduced salmon mortality and improved salmon welfare in industrial aquaculture net pens. Available CV methods for determining welfare indicators focus on single indicators and rely on object detectors and trackers from other application areas to aid their welfare indicator calculation algorithm. This comes with a high resource demand for real-world applications, since each indicator must be calculated separately. In addition, the methods are vulnerable to difficulties in underwater salmon scenes, such as object occlusion, similar object appearance, and similar object motion. To address these challenges, we propose a flexible tracking framework that uses a pose estimation network to extract bounding boxes around salmon and their corresponding body parts, and exploits information about the body parts, through specialized modules, to tackle challenges specific to underwater salmon scenes. Subsequently, the high-detail body part tracks are employed to calculate welfare indicators. We construct two novel datasets assessing two salmon tracking challenges: salmon ID transfers in crowded scenes and salmon ID switches during turning. Our method outperforms the current state-of-the-art pedestrian tracker, BoostTrack, for both salmon tracking challenges. Additionally, we create a dataset for calculating salmon tail beat wavelength, demonstrating that our body part tracking method is well-suited for automated welfare monitoring based on tail beat analysis. Datasets and code are available at https://github.com/espenbh/BoostCompTrack.

2509.19102 2026-05-19 cs.RO cs.AI cs.CV 版本更新

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

FUNCanon: 通过功能对象规范化学习姿态感知的动作原语以实现通用的机器人操作

Hongli Xu, Lei Zhang, Xiaoyue Hu, Boyang Zhong, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

发表机构 * TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg(汉堡大学信息学院TAMS(多模态系统技术)) Technical University of Munich(慕尼黑技术大学) Agile Robots SE(敏捷机器人有限公司)

AI总结 本文提出FUNCanon框架,通过功能对象规范化学习姿态感知的动作原语,以实现通用的机器人操作,该方法将长周期操作任务分解为由主体、动词和对象定义的动作片段,从而提升策略的可组合性和可重用性。

Comments project website: https://sites.google.com/view/funcanon, 11 pages

详情
AI中文摘要

通用机器人技能从端到端演示中通常会导致任务特定的策略,这些策略难以超越训练分布进行泛化。因此,我们引入FUNCanon框架,将长周期操作任务转换为一系列动作片段,每个片段由主体、动词和对象定义。这些片段将策略学习聚焦于动作本身,而不是孤立的任务,从而实现组合性和重用性。为了使策略具有姿态感知和类别通用性,我们对功能对象进行规范化,通过功能对齐和自动操作轨迹转移,利用大型视觉语言模型的 affordance 信息将对象映射到共享的功能框架中。一个以对象为中心和动作为中心的扩散策略FuncDiffuser在对齐的数据上进行训练,自然尊重对象的 affordances 和姿态,简化了学习并提高了泛化能力。在模拟和现实基准上的实验表明,该方法在类别层面实现了泛化,跨任务行为重用和鲁棒的sim2real部署,显示功能规范化为复杂操作领域可扩展模仿学习提供了强大的归纳偏置。演示细节和补充材料可在我们的项目网站上获得:https://sites.google.com/view/funcanon。

英文摘要

General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

2509.16391 2026-05-19 cs.LG cs.AI cs.CV 版本更新

CoUn: Empowering Machine Unlearning via Contrastive Learning

CoUn: 通过对比学习赋能机器无学习

Yasser H. Khalil, Mehdi Setayesh, Hongliang Li

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出CoUn框架,通过对比学习和监督学习调整保留数据的表示,以提高机器无学习的有效性,实验表明其在多个数据集和模型架构上均优于现有方法。

详情
AI中文摘要

机器无学习(MU)旨在从已训练模型中移除特定'遗忘'数据的影响,同时保持对剩余'保留'数据的知识。现有的基于标签操纵或模型权重扰动的MU方法往往效果有限。为此,我们引入了CoUn,一种受观察启发的新MU框架:当模型仅使用保留数据重新训练时,它会根据保留数据的语义相似性对遗忘数据进行分类。CoUn通过对比学习(CL)和监督学习调整学习的数据表示,仅应用于保留数据。具体而言,CoUn(1)利用数据样本之间的语义相似性,通过CL间接调整遗忘表示,(2)通过监督学习保持保留表示在其各自聚类内。在各种数据集和模型架构上的广泛实验表明,CoUn在无学习有效性上 consistently 超过最先进的MU基线。此外,将我们的CL模块集成到现有基线中可以增强其无学习有效性。

英文摘要

Machine unlearning (MU) aims to remove the influence of specific "forget" data from a trained model while preserving its knowledge of the remaining "retain" data. Existing MU methods based on label manipulation or model weight perturbations often achieve limited unlearning effectiveness. To address this, we introduce CoUn, a novel MU framework inspired by the observation that a model retrained from scratch using only retain data classifies forget data based on their semantic similarity to the retain data. CoUn emulates this behavior by adjusting learned data representations through contrastive learning (CL) and supervised learning, applied exclusively to retain data. Specifically, CoUn (1) leverages semantic similarity between data samples to indirectly adjust forget representations using CL, and (2) maintains retain representations within their respective clusters through supervised learning. Extensive experiments across various datasets and model architectures show that CoUn consistently outperforms state-of-the-art MU baselines in unlearning effectiveness. Additionally, integrating our CL module into existing baselines empowers their unlearning effectiveness.

2509.02351 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels

序数自适应校正:一种数据导向的带有噪声标签的序数图像分类方法

Alireza Sedighi Moghaddam, Mohammad Reza Mohammadi

发表机构 * School of Computer Engineering, Iran University of Science and Technology(伊朗科学技术大学计算机工程学院)

AI总结 本文提出了一种数据导向的序数图像分类方法ORDAC,通过利用标签分布学习来建模序数标签的内在模糊性和不确定性,动态调整每个样本的标签分布均值和标准差,从而有效校正噪声标签并提高模型性能。

Comments 10 pages, 5 figures, 5 tables

详情
AI中文摘要

标记数据是训练计算机视觉任务中监督深度学习模型的基本组成部分。然而,尤其是在序数图像分类中,类边界往往具有模糊性,因此标注过程容易产生错误和噪声。此类标签噪声会显著降低机器学习模型的性能和可靠性。本文针对序数图像分类任务中检测和校正标签噪声的问题,提出了一种新的数据导向方法,称为ORDinal Adaptive Correction(ORDAC)。该方法利用标签分布学习(LDL)的能力来建模序数标签的内在模糊性和不确定性。在训练过程中,ORDAC动态调整每个样本的标签分布的均值和标准差。与其丢弃可能含有噪声的样本不同,该方法旨在校正这些样本并充分利用整个训练数据集。所提出方法在年龄估计(Adience)和疾病严重程度检测(糖尿病视网膜病变)基准数据集上,针对各种不对称高斯噪声场景进行了评估。结果表明,ORDAC及其扩展版本(ORDAC_C和ORDAC_R)在模型性能上取得了显著提升。例如,在Adience数据集上40%的噪声情况下,ORDAC_R将均方误差从0.86降低到0.62,并将召回指标从0.37提高到0.49。该方法还展示了其在原始数据集中固有噪声的校正效果。这项研究表明,使用标签分布进行自适应标签校正是增强在存在噪声数据时序数分类模型鲁棒性和准确性的一种有效策略。

英文摘要

Labeled data is a fundamental component in training supervised deep learning models for computer vision tasks. However, the labeling process, especially for ordinal image classification where class boundaries are often ambiguous, is prone to error and noise. Such label noise can significantly degrade the performance and reliability of machine learning models. This paper addresses the problem of detecting and correcting label noise in ordinal image classification tasks. To this end, a novel data-centric method called ORDinal Adaptive Correction (ORDAC) is proposed for adaptive correction of noisy labels. The proposed approach leverages the capabilities of Label Distribution Learning (LDL) to model the inherent ambiguity and uncertainty present in ordinal labels. During training, ORDAC dynamically adjusts the mean and standard deviation of the label distribution for each sample. Rather than discarding potentially noisy samples, this approach aims to correct them and make optimal use of the entire training dataset. The effectiveness of the proposed method is evaluated on benchmark datasets for age estimation (Adience) and disease severity detection (Diabetic Retinopathy) under various asymmetric Gaussian noise scenarios. Results show that ORDAC and its extended versions (ORDAC_C and ORDAC_R) lead to significant improvements in model performance. For instance, on the Adience dataset with 40% noise, ORDAC_R reduced the mean absolute error from 0.86 to 0.62 and increased the recall metric from 0.37 to 0.49. The method also demonstrated its effectiveness in correcting intrinsic noise present in the original datasets. This research indicates that adaptive label correction using label distributions is an effective strategy to enhance the robustness and accuracy of ordinal classification models in the presence of noisy data.

2508.13977 2026-05-19 cs.CV 版本更新

ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving

ROVR-Open-Dataset: 一个大规模深度数据集用于自动驾驶

Xianda Guo, Ruijun Zhang, Yiqun Duan, Ruilin Wang, Matteo Poggi, Keyuan Zhou, Wenzhao Zheng, Wenke Huang, Gangwei Xu, Yanlun Peng, Yuan Si, Qin Zou

发表机构 * Wuhan University(武汉大学) Institute of Automation, Chinese Academy of Sciences (CASIA)(中国科学院自动化研究所) University of Technology Sydney(悉尼大学) University of Bologna(博洛尼亚大学) Zhejiang University(浙江大学) University of California, Berkeley(加州大学伯克利分校) Huazhong University of Science and Technology(华中科技大学) Great Wall Motor(长城汽车) ROVR Labs, Inc.(ROVR实验室)

AI总结 本文提出ROVR-Open-Dataset,一个大规模、多样化且成本效益高的深度数据集,用于提升自动驾驶中空间感知的能力,通过提供丰富的场景、光照和天气条件数据,以及经过验证的地面真实数据,支持鲁棒的模型训练,并识别出当前架构共享的三种失败模式。

详情
AI中文摘要

深度估计是自动驾驶和其他在开放城市环境中运行的无人驾驶系统空间感知的基本组成部分。现有的深度数据集如KITTI、nuScenes和DDAD虽然推动了该领域的发展,但在多样性和可扩展性方面存在局限,且在这些数据集上的基准性能已接近饱和。一个较少讨论的约束是传感器经济性:这些数据集背后的定制多激光雷达装置成本高、耗电且难以在大规模车队中复制,这限制了任何单一基准所能覆盖的地理和时间多样性。我们提出了ROVR,一个大规模、多样化且成本效益高的深度数据集,旨在捕捉现实驾驶的复杂性。ROVR包含200,000个高分辨率帧,涵盖高速公路、乡村和城市场景,覆盖昼夜周期和恶劣天气条件,收集于北美洲、欧洲和亚洲。我们还发布了校准、同步、预处理和隐私管道,使该平台能够被第三方复现。轻量级的采集管道支持可扩展的收集,而稀疏但统计上充分的地面真实数据——通过密度消融验证——支持稳健的模型训练。广泛的消融研究进一步表征了不同场景类型、光照、天气条件和地面真实稀疏程度下的性能,并识别出三种定性不同的失败模式——光度崩溃、几何混淆和范围饱和——这些当前架构共享。该数据集、数据加载器、校准和隐私管道以及评估代码已在https://xiandaguo.net/ROVR-Open-Dataset上公开发布。

英文摘要

Depth estimation is a fundamental component of spatial perception for autonomous driving and other unmanned systems operating in open urban environments. Existing depth datasets such as KITTI, nuScenes, and DDAD have advanced the field but are limited in diversity and scalability, and benchmark performance on them is approaching saturation. A less discussed constraint is \emph{sensor economics}: the bespoke multi-LiDAR rigs behind these datasets are expensive, power-hungry, and difficult to replicate at fleet scale, which caps the geographic and temporal diversity that any single benchmark can cover. We present ROVR, a large-scale, diverse, and cost-efficient depth dataset designed to capture the complexity of real-world driving. ROVR comprises 200K high-resolution frames across highway, rural, and urban scenarios, spanning day/night cycles and adverse weather conditions, collected across North America, Europe, and Asia. We additionally release the calibration, synchronization, preprocessing, and privacy pipeline so that the platform can be reproduced by third parties. The lightweight acquisition pipeline enables scalable collection, while sparse but statistically sufficient ground truth -- validated by a density ablation -- supports robust model training. Extensive ablation studies further characterize performance across scene types, illumination, weather conditions, and ground-truth sparsity levels, and identify three qualitatively distinct failure modes -- photometric collapse, geometric confusion, and range saturation -- that current architectures share. The dataset, data loaders, calibration and privacy pipelines, and evaluation code are publicly available at \url{https://xiandaguo.net/ROVR-Open-Dataset}.

2507.06384 2026-05-19 eess.IV cs.CV 版本更新

Mitigating 3D Prostate Biparametric MRI Data Scarcity through Domain Adaptation using Locally-Trained Latent Diffusion Models for Prostate Cancer Detection

通过使用本地训练的潜在扩散模型进行领域适应以缓解3D前列腺双参数MRI数据稀缺问题

Emerson P. Grabke, Babak Taati, Masoom A. Haider

发表机构 * Institute of Biomedical Engineering, University of Toronto(多伦多大学生物医学工程研究所) Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital(圣心医院卢内尔-塔内本研究所) KITE Research Institute, Toronto Rehabilitation Institute, University Health Network(多伦多康复研究所、KITE研究所在大学健康网络) Joint Department of Medical Imaging, University of Toronto, Princess Margaret Hospital, and Sinai Health systems(多伦多大学联合医学影像部门、玛格丽特医院及辛纳医疗系统) Department of Computer Science, University of Toronto(多伦多大学计算机科学系) Faculty Affiliate of the Vector Institute, Toronto(向量研究所教职员工)

AI总结 本文提出CCELLA++,一种新的潜在扩散模型流程,用于同时生成3D双参数前列腺MRI(bpMRI),包括轴向T2加权(AxT2)、高b值扩散系列(HighB)和表观扩散系数图(ADC),以克服数据稀缺问题。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

目的:潜在扩散模型(LDMs)可以缓解影响医疗图像解释机器学习开发的数据稀缺挑战。最近的CCELLA LDM通过合成MRI用于分类器训练提高了前列腺癌检测性能,但仅限于轴向T2加权(AxT2)序列,未研究跨机构领域偏移,并优先考虑PI-RADS而非组织学结果。方法:我们提出CCELLA++,一种新的LDM流程,用于同时生成3D双参数前列腺MRI(bpMRI),包括AxT2、高b值扩散系列(HighB)和表观扩散系数图(ADC),以克服这些限制。我们研究了源无关领域适应,使用在单个机构真实或LDM生成的合成数据上预训练的分类器,在微调前进行外部分布数据集的分数。结果:CCELLA++在AxT2核启动距离上与CCELLA相当(0.0128,0.0131分别)。CCELLA++合成bpMRI预训练在AP和AUC上优于真实bpMRI,达到12.5%(n≤166)外部数据集体积(p<0.01所有),无预训练在AUC上达到25%外部体积(n=332,p<0.05所有),并且CCELLA AxT2-only预训练在数据稀缺(n=83,p<0.001 AP和AUC)和完整数据(n=1329,p<0.05 AP和AUC)场景中均优于。结论:CCELLA++合成bpMRI可以提高下游分类器的泛化能力和性能,优于真实bpMRI或CCELLA生成的AxT2-only图像。未来的工作应量化医疗图像质量,平衡bpMRI LDM训练,并将LDM与额外信息相结合。意义:CCELLA++可以生成优于真实数据的合成bpMRI,用于数据稀缺外部机构的领域适应,推动医疗影像机器学习的发展。我们的代码可在https://github.com/grabkeem/CCELLA-plus-plus获取。

英文摘要

Objective: Latent diffusion models (LDMs) could mitigate data scarcity challenges affecting machine learning development for medical image interpretation. The recent CCELLA LDM improved prostate cancer detection performance using synthetic MRI for classifier training but was limited to the axial T2-weighted (AxT2) sequence, did not investigate inter-institutional domain shift, and prioritized PI-RADS over histopathology outcomes. Methods: We propose CCELLA++, a novel LDM pipeline for simultaneous 3D biparametric prostate MRI (bpMRI) generation, including the AxT2, high b-value diffusion series (HighB) and apparent diffusion coefficient map (ADC), to overcome these limitations. We investigated source-free domain adaptation with classifiers pretrained on single institution real or LDM-generated synthetic data prior to fine-tuning on fractions of an out-of-distribution, external dataset. Results: CCELLA++ achieved comparable AxT2 Kernel Inception Distance to CCELLA (0.0128, 0.0131 respectively). CCELLA++ synthetic bpMRI pretraining outperformed real bpMRI in AP and AUC up to 12.5% (n<=166) external dataset volume (p<0.01 all), no pretraining in AUC up to 25% external volume (n=332, p<0.05 all), and CCELLA AxT2-only pretraining in both data-scarce (n=83, p<0.001 AP and AUC) and full data (n=1329, p<0.05 AP and AUC) scenarios. Conclusion: CCELLA++ synthetic bpMRI can improve downstream classifier generalization and performance beyond real bpMRI or CCELLA-generated AxT2-only images. Future work should quantify medical image quality, balance bpMRI LDM training, and condition the LDM with additional information. Significance: CCELLA++ can generate synthetic bpMRI that outperforms real data for domain adaptation with data-scarce external institutions, advancing machine learning development for medical imaging. Our code is available at https://github.com/grabkeem/CCELLA-plus-plus

2507.01099 2026-05-19 cs.CV cs.AI cs.LG cs.RO 版本更新

Geometry-aware 4D Video Generation for Robot Manipulation

面向机器人操作的几何感知4D视频生成

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, Shuran Song

发表机构 * Stanford University(斯坦福大学) Toyota Research Institute(丰田研究院)

AI总结 本文提出了一种几何感知的4D视频生成模型,通过跨视角点图对齐进行训练,以确保生成视频在多视角下的3D一致性,从而在单个RGB-D图像输入下生成时空一致的未来视频序列,并在不依赖相机姿态的情况下实现稳定的视觉和空间对齐预测。

Comments ICLR 2026; Project website: https://robot4dgen.github.io

详情
AI中文摘要

理解并预测物理世界的动态可以增强机器人在复杂环境中的规划和交互能力。尽管最近的视频生成模型在建模动态场景方面显示出强大的潜力,但生成在不同摄像机视角下既时间一致又几何一致的视频仍然是一项重大挑战。为此,我们提出了一种4D视频生成模型,通过在训练过程中使用跨视角点图对齐来监督模型,以确保生成视频的多视角3D一致性。通过这种几何监督,模型学习了一个共享的3D场景表示,使其能够从单个RGB-D图像输入中,根据新的视角生成时空一致的未来视频序列,而无需依赖相机姿态作为输入。与现有基线方法相比,我们的方法在多个模拟和现实世界机器人数据集上产生了更稳定和空间对齐的预测。我们进一步表明,预测的4D视频可用于使用现成的6自由度姿态跟踪器恢复机器人末端执行器轨迹,从而生成在新相机视角下具有良好泛化能力的机器人操作策略。

英文摘要

Understanding and predicting dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of generated videos by supervising the model with cross-view pointmap alignment during training. Through this geometric supervision, the model learns a shared 3D scene representation, enabling it to generate spatio-temporally aligned future video sequences from novel viewpoints given a single RGB-D image per view, and without relying on camera poses as input. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, yielding robot manipulation policies that generalize well to novel camera viewpoints.

2506.21499 2026-05-19 eess.IV cs.CV 版本更新

Lightweight Physics-Aware Zero-Shot Ultrasound Plane-Wave Denoising

轻量级物理感知零样本超声平面波去噪

Hojat Asgariandehkordi, Mostafa Sharifzadeh, Morteza Rezanejad, Hassan Rivaz

发表机构 * Department of Electrical and Computer Engineering, Concordia University(电气与计算机工程系,康科迪亚大学)

AI总结 本文提出了一种轻量级物理感知零样本去噪框架,用于低角度CPWC超声成像,无需外部训练数据集或干净参考图像,通过将可用的成射角分为两个不相交子集,分别重建具有不同角度依赖性伪影和噪声特征的复合图像,利用自监督残差学习框架训练轻量级卷积神经网络,从而在不需领域特定微调或配对数据集的情况下,适应不同解剖区域和采集设置。

详情
AI中文摘要

超声相干平面波成像(CPWC)通过结合多个定向传输的回声来增强图像对比度。尽管增加定向角度的数量通常能提高图像质量,但会显著降低帧率,并可能在快速移动目标中引入模糊伪影。此外,复合图像仍易受噪声影响,尤其是在使用有限数量传输获取时。在本工作中,我们提出了一种轻量级物理感知零样本去噪框架,用于低角度CPWC超声成像,以在不需外部训练数据集或干净参考图像的情况下提高图像质量。所提出的方法将可用的定向角度分为两个不相交子集,每个子集用于重建具有不同角度依赖性伪影和噪声特征的复合图像。这些重建的图像随后作为伪对,在自监督残差学习框架中用于训练一个轻量级卷积神经网络,直接在测试样本上进行训练。由于底层组织结构在子集之间保持一致,而非相干伪影随定向角度选择变化,所提出的物理感知配对策略使网络能够区分解剖信息与不一致的噪声和伪影。与监督方法不同,所提出的方法不需要领域特定的微调或配对数据集,使其能够适应不同的解剖区域和采集设置。此外,所提出的框架采用仅包含两个卷积层的高效架构,使训练快速且计算成本低廉。

英文摘要

Ultrasound Coherent Plane-Wave Compounding (CPWC) enhances image contrast by combining echoes from multiple steered transmissions. While increasing the number of steering angles generally improves image quality, it significantly reduces frame rate and may introduce blurring artifacts in fast-moving targets. In addition, compounded images remain susceptible to noise, particularly when acquired using a limited number of transmissions. In this work, we propose a lightweight physics-aware zero-shot denoising framework for low-angle CPWC ultrasound imaging that improves image quality without requiring external training datasets or clean reference images. The proposed approach partitions the available steering angles into two disjoint subsets, each used to reconstruct compounded images with different angle-dependent artifacts and noise characteristics. These reconstructed images are then used as pseudo-pairs within a self-supervised residual learning framework to train a lightweight convolutional neural network directly on the test sample. Because the underlying tissue structures remain consistent across the subsets while the incoherent artifacts vary with steering angle selection, the proposed physics-aware pairing strategy enables the network to distinguish anatomical information from inconsistent noise and artifacts. Unlike supervised approaches, the proposed method does not require domain-specific fine-tuning or paired datasets, making it adaptable across different anatomical regions and acquisition settings. Furthermore, the proposed framework employs an efficient architecture composed of only two convolutional layers, enabling fast and computationally inexpensive training.

2506.20522 2026-05-19 cs.CV 版本更新

AI-assisted radiographic analysis in detecting alveolar bone-loss severity and patterns

辅助人工智能的放射学分析用于检测牙槽骨丧失的严重程度和模式

Chathura Wimalasiri, Piumal Rathnayake, Shamod Wijerathne, Sumudu Rasnayaka, Dhanushka Leuke Bandara, Roshan Ragel, Vajira Thambawita, Isuru Nawinne

发表机构 * Faculty of Engineering, University of Peradeniya(工程学院,珀德尼亚大学) Faculty of Dental Sciences, University of Peradeniya(牙科学院,珀德尼亚大学) Simula Metropolitan Center for Digital Engineering(模拟 Metropolitan 数字工程中心)

AI总结 本研究提出了一种新型的基于人工智能的深度学习框架,利用牙内窥镜根尖放射图像自动检测和量化牙槽骨丧失及其模式,通过结合YOLOv8进行牙齿检测和Keypoint R-CNN模型识别解剖标志物,实现了对牙槽骨丧失严重程度的精确计算,并通过几何分析确定水平与角状骨丧失模式,实验结果在1000张专家标注的放射图像上达到了高准确率。

Comments This manuscript is 17 pages with 5 tables and 12 figures. The manuscript is under review at Nature Scientific Reports

详情
AI中文摘要

牙周炎是一种慢性炎症性疾病,导致牙槽骨丧失,显著影响口腔健康和生活质量。准确评估骨丧失的严重程度和模式对于诊断和治疗计划至关重要。在本研究中,我们提出了一种新型的基于人工智能的深度学习框架,利用牙内窥镜根尖放射图像自动检测和量化牙槽骨丧失及其模式。我们的方法结合YOLOv8进行牙齿检测,与Keypoint R-CNN模型识别解剖标志物,从而实现对骨丧失严重程度的精确计算。此外,YOLOv8x-seg模型用于分割骨水平和牙齿掩码,通过几何分析确定骨丧失模式(水平 vs. 角状)。在1000张大规模、专家标注的放射图像上进行评估,我们的方法在检测骨丧失严重程度(类内相关系数高达0.80)和骨丧失模式分类(准确率87%)方面取得了高准确率。这种自动化系统提供了一种快速、客观且可重复的牙周评估工具,减少了对主观手动评估的依赖。通过将人工智能整合到牙科放射学分析中,我们的框架有潜力提高牙周炎的早期诊断和个性化治疗计划,最终改善患者护理和临床结果。

英文摘要

Periodontitis, a chronic inflammatory disease causing alveolar bone loss, significantly affects oral health and quality of life. Accurate assessment of bone loss severity and pattern is critical for diagnosis and treatment planning. In this study, we propose a novel AI-based deep learning framework to automatically detect and quantify alveolar bone loss and its patterns using intraoral periapical (IOPA) radiographs. Our method combines YOLOv8 for tooth detection with Keypoint R-CNN models to identify anatomical landmarks, enabling precise calculation of bone loss severity. Additionally, YOLOv8x-seg models segment bone levels and tooth masks to determine bone loss patterns (horizontal vs. angular) via geometric analysis. Evaluated on a large, expertly annotated dataset of 1000 radiographs, our approach achieved high accuracy in detecting bone loss severity (intra-class correlation coefficient up to 0.80) and bone loss pattern classification (accuracy 87%). This automated system offers a rapid, objective, and reproducible tool for periodontal assessment, reducing reliance on subjective manual evaluation. By integrating AI into dental radiographic analysis, our framework has the potential to improve early diagnosis and personalized treatment planning for periodontitis, ultimately enhancing patient care and clinical outcomes.

2505.19155 2026-05-19 cs.CV cs.CL 版本更新

Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

稀疏到密集:一种无损加速视频理解的LLM免费午餐

Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang, Wei Gao, Qian Liu

发表机构 * Singapore Management University(新加坡国立管理学院) Sea AI Lab(Sea AI实验室) National University of Singapore(国立新加坡大学)

AI总结 本文提出了一种名为Sparse-to-Dense(StD)的解码策略,通过结合稀疏top-K注意力和密集全注意力模块,实现视频大语言模型(Video-LLMs)的无损加速,从而在处理长视频序列时显著提高处理速度。

Comments Accepted by ACL 2025

详情
AI中文摘要

由于当前视频大语言模型(Video-LLMs)的自回归性质,输入序列长度的增长会导致推理延迟增加,这给处理通常非常长的视频序列带来了挑战。我们发现,在解码过程中,Video-LLMs中大多数标记的注意力分数趋于稀疏和集中,只有某些标记需要全面的全注意力。基于这一见解,我们引入了Sparse-to-Dense(StD),一种新颖的解码策略,集成了两个不同的模块:一个利用稀疏top-K注意力,另一个采用密集全注意力。这些模块协同工作,以在不损失的情况下加速Video-LLMs。快速(稀疏)模型推测解码多个标记,而缓慢(密集)模型并行验证它们。StD是一种无调优、即插即用的解决方案,可在视频处理中实现高达1.94倍的壁时加速。它在保持模型性能的同时,使从标准Video-LLM无缝过渡到稀疏Video-LLM变得可能,只需最小的代码修改。

英文摘要

Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

2505.07813 2026-05-19 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

DexWild:面向真实场景的机器人策略的灵巧交互

Tony Tao, Mohan Kumar Srirama, Jason Jingzhou Liu, Kenneth Shaw, Deepak Pathak

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出DexWild框架,通过结合人类和机器人示范数据,提升机器人在多样化环境中的泛化能力,实验表明其在未见环境中的成功率显著高于传统方法。

Comments In RSS 2025. Website at https://dexwild.github.io

详情
AI中文摘要

大规模、多样化的机器人数据集已成为使灵巧操作策略泛化到新环境的有希望途径,但获取此类数据集存在诸多挑战。虽然远程操作能提供高保真的数据集,但其高成本限制了可扩展性。相反,如果人们可以像在日常生活中一样使用自己的手来收集数据呢?在DexWild中,一个多样化的数据收集团队使用他们的手在多种环境和物体上收集数小时的交互数据。为了记录这些数据,我们创建了DexWild-System,一种低成本、移动且易于使用的设备。DexWild学习框架在人类和机器人示范数据上共同训练,相较于单独训练每个数据集,其性能得到提升。这种组合产生了能够泛化到新环境、任务和形态的稳健机器人策略,只需少量额外的机器人特定数据。实验结果表明,DexWild显著提高了性能,在未见环境中实现了68.5%的成功率,几乎是仅使用机器人数据训练的策略的四倍,并提供了5.8倍更好的跨形态泛化能力。视频结果、代码库和说明可在https://dexwild.github.io上找到。

英文摘要

Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io

2505.06907 2026-05-19 cs.AI cs.CV cs.NE 版本更新

A Survey on Foundation Models for Personalized Federated Intelligence

面向个性化联邦智能的基础模型综述

Yu Qiao, Huy Q. Le, Avi Deb Raha, Phuong-Nam Tran, Apurba Adhikary, Mengchun Zhang, Loc X. Nguyen, Eui-Nam Huh, Dusit Niyato, Choong Seon Hong

发表机构 * School of Computing, Kyung Hee University(韩国庆熙大学计算机学院) Noakhali Science and Technology University(诺阿克利科学与技术大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 本文综述了基础模型在个性化联邦智能中的应用,探讨了联邦学习与基础模型的结合,提出了一种新的个性化联邦智能范式,旨在为实现人工智能个性化提供基础支持。

Comments Accepted ACM Computing Survey

详情
AI中文摘要

大语言模型(如ChatGPT、Gemini和Grok)的兴起重塑了人工智能领域。作为基础模型(FMs)的典型实例,它们在生成类人内容方面表现出色,推动人工智能向通用人工智能(AGI)迈进。然而,它们的规模庞大、隐私敏感和计算需求高,给个性化定制带来了挑战。为此,我们提出了人工智能个性化(API)的愿景,专注于将FMs适应到个体用户,同时确保隐私。作为API的核心赋能者,我们提出个性化联邦智能(PFI),这是一种新的范式,不仅整合了联邦学习(FL)的隐私优势和FMs的泛化能力,还将个性化置于核心。为此,我们首先回顾了最近的FL和FMs进展,为PFI奠定基础。然后,我们探讨了PFI流水线的核心阶段:边缘的高效个性化、可信的适应和通过检索增强生成的自适应细化。最后,我们强调了实现PFI的未来方向。总体而言,本文的综述旨在为API的发展奠定基础,作为AGI的补充方向,PFI是关键的赋能范式。

英文摘要

The rise of large language models (LLMs), such as ChatGPT, Gemini, and Grok, has reshaped the AI landscape. As prominent instances of foundational models (FMs), they exhibit remarkable capabilities in generating human-like content, pushing the boundaries towards artificial general intelligence (AGI). However, their large-scale nature, privacy sensitivity, and substantial computational demands pose significant challenges for personalized customization for end users. To bridge this gap, we present the vision of artificial personalized intelligence (API), which focuses on adapting FMs to individual users while ensuring privacy. As a central enabler of API, we propose personalized federated intelligence (PFI), a new paradigm that not only integrates the privacy benefits of federated learning (FL) with the generalization capabilities of FMs but also places personalization at its core. To this end, we first survey recent advances in FL and FMs that lay the foundation for PFI. We then explore core stages of the PFI pipeline: efficient personalization at the edge, trustworthy adaptation, and adaptive refinement via retrieval-augmented generation. Finally, we highlight future directions for enabling PFI. Overall, this survey aims to lay a foundation for the development of API as a complementary direction to AGI, with PFI as a key enabling paradigm.

2501.13795 2026-05-19 cs.CV 版本更新

Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models

无需训练的零样本时序动作检测与视觉-语言模型

Chaolei Han, Hongsong Wang, Jidong Kuang, Lei Zhang, Jie Gui

发表机构 * Southeast University School of Cyber Science and Engineering(东南大学网络安全科学与工程学院) Southeast University School of Computer Science and Engineering(东南大学计算机科学与工程学院) Nanjing Normal University School of Electrical Engineering and Automation(南京师范大学电气工程与自动化学院)

AI总结 本文提出一种无需训练的零样本时序动作检测方法FreeZAD,利用现有的视觉-语言模型直接对未标记视频中的未知活动进行分类和定位,无需额外微调或适应,并通过LogOIC和频率基于的动作校准以及测试时适应策略提升性能。

Journal ref IEEE Transactions on Multimedia, 2026

详情
AI中文摘要

现有的零样本时序动作检测(ZSTAD)方法主要采用全监督或无监督策略来识别未见活动。然而,这些基于训练的方法容易出现领域偏移且计算成本高,阻碍了其在现实场景中的应用。在本文中,不同于以往的工作,我们提出了一种无需训练的零样本时序动作检测(FreeZAD)方法,利用现有的视觉-语言(ViL)模型,直接对未修剪视频中的未知活动进行分类和定位,而无需任何额外的微调或适应。我们通过设计Logarithmic decay weighted Outer-Inner-Contrastive Score(LogOIC)和基于频率的动作校准,消除了显式时间建模和伪标签质量的依赖。此外,我们引入了使用原型中心采样(PCS)的测试时适应(TTA)策略来扩展FreeZAD,使ViL模型能够更有效地适应ZSTAD。在THUMOS14和ActivityNet-1.3数据集上的大量实验表明,我们的无需训练的方法在性能上优于最先进的无监督方法,且仅需1/13的运行时间。当配备TTA时,增强的方法进一步缩小了与全监督方法之间的差距。

英文摘要

Existing zero-shot temporal action detection (ZSTAD) methods predominantly use fully supervised or unsupervised strategies to recognize unseen activities. However, these training-based methods are prone to domain shifts and require high computational costs, which hinder their practical applicability in real-world scenarios. In this paper, unlike previous works, we propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method, leveraging existing vision-language (ViL) models to directly classify and localize unseen activities within untrimmed videos without any additional fine-tuning or adaptation. We mitigate the need for explicit temporal modeling and reliance on pseudo-label quality by designing the LOGarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration. Furthermore, we introduce a test-time adaptation (TTA) strategy using Prototype-Centric Sampling (PCS) to expand FreeZAD, enabling ViL models to adapt more effectively for ZSTAD. Extensive experiments on the THUMOS14 and ActivityNet-1.3 datasets demonstrate that our training-free method outperforms state-of-the-art unsupervised methods while requiring only 1/13 of the runtime. When equipped with TTA, the enhanced method further narrows the gap with fully supervised methods.

2412.18158 2026-05-19 cs.CV eess.IV 版本更新

Semantics Disentanglement and Composition for Universal Image Coding with Efficiently LLM Reasoning and Generative Diffusion

语义解耦与组合用于具有高效LLM推理和生成扩散的通用图像编码

Jinming Liu, Yuntao Wei, Junyan Lin, Shengyang Zhao, Heming Sun, Zhibo Chen, Wenjun Zeng, Xin Jin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative(宁波空间智能与数字衍生关键实验室) Ningbo Institute of Digital Twin(宁波数字孪生研究院) Eastern Institute of Technology(东部技术研究院) University of Science and Technology of China(中国科学技术大学) Yokohama National University(Yokohama国立大学)

AI总结 本文提出UniCodec,一种基于语义解耦和组合生成的通用图像编码框架,通过高效LLM推理和生成扩散模型实现人类和机器需求的统一压缩,无需重新训练。

详情
AI中文摘要

已学习的图像压缩方法在性能上表现出色,但通常高度专门化于人类感知或特定机器视觉任务。这种专门化限制了其通用性和重新训练成本。为此,我们引入UniCodec,一种基于编码器的语义解耦和解码器的组合生成的通用编码器。该框架旨在同时满足人类和机器需求,消除任务特定重新训练的需要。在编码器中,UniCodec利用由大型语言模型(LLM)预先生成的任务特定标签代码本。对于任何给定任务,接地模型使用相应的代码本进行任务感知的解耦,压缩最相关的图像区域。这种机制不仅节省了大量位数,而且是系统快速零重新训练适应的关键:切换到新任务只需选择新代码本。解码器则进行组合生成:它将紧凑的解耦组件与生成扩散模型的强大先验结合,从而重建高质量、完整的图像,优化以满足人类感知的丰富细节和机器视觉任务的精确特征。广泛的实验表明,UniCodec在性能上始终优于现有方法,有效弥合了以人类为中心和以机器为中心压缩之间的差距。

英文摘要

Learned image compression methods have shown impressive performance but are often highly specialized for either human perception or specific machine vision tasks. This specialization limits their versatility and requires costly retraining for new applications. To address this, we introduce UniCodec, a universal codec built on a novel paradigm of semantic disentanglement at the encoder and compositional generation at the decoder. This framework is designed to simultaneously serve both human and machine needs, eliminating the need for task-specific retraining. At the encoder, UniCodec leverages pre-generated, task-specific label codebooks created by a Large Language Model (LLM). For any given task, a grounding model uses the corresponding codebook to perform task-aware disentanglement, compressing only the most relevant image regions. This mechanism not only saves significant bits but is also the key to our system's rapid, zero-retraining adaptation: switching to a new task is as simple as selecting a new codebook. The decoder then performs compositional generation: it combines the compact, disentangled components with powerful priors from a generative diffusion model. This process reconstructs a high-quality, complete image optimized with rich detail for human perception and precise features for machine vision tasks. Extensive experiments demonstrate that UniCodec consistently outperforms existing methods, effectively bridging the gap between human-centric and machine-centric compression.

2409.15980 2026-05-19 cs.CV cs.AI 版本更新

Leveraging Unsupervised Learning for Cost-Effective Visual Anomaly Detection

利用无监督学习实现高效视觉异常检测

Yunbo Long, Zhengyang Ling, Sam Brook, Duncan McFarlane, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 本研究提出一种低成本视觉异常检测系统,通过预训练模型和低成本硬件,利用少量数据实现高准确率的异常检测,适用于中小型企业。

详情
AI中文摘要

传统的基于机器学习的视觉检测系统需要大量数据收集和重复模型训练来提高准确性。这些系统通常需要昂贵的相机、计算设备和显著的机器学习专业知识,这对中小型企业构成重大负担。本研究探索利用预训练模型和低成本硬件的无监督学习方法,开发一种高效的视觉异常检测系统。该系统利用Anomalib的无监督学习模型,并通过openVINO部署在经济型Raspberry Pi硬件上。结果表明,该系统仅用10张正常产品图像即可在Raspberry Pi上完成异常检测的训练和推理,耗时仅90秒,达到F1宏评分超过0.95的性能。尽管系统对环境变化如光照、产品摆放或背景略有敏感,但其仍为中小型企业提供了一种快速且经济的工厂自动化检测方法。代码可在https://github.com/Yunbo-max/Cost-Effective-Visual-Anomaly-Detection-using-Unsupervised-Learning获取。

英文摘要

Traditional machine learning-based visual inspection systems require extensive data collection and repetitive model training to improve accuracy. These systems typically require expensive camera, computing equipment and significant machine learning expertise, which can substantially burden small and medium-sized enterprises. This study explores leveraging unsupervised learning methods with pre-trained models and low-cost hardware to create a cost-effective visual anomaly detection system. The research aims to develop a low-cost visual anomaly detection solution that uses minimal data for model training while maintaining generalizability and scalability. The system utilises unsupervised learning models from Anomalib and is deployed on affordable Raspberry Pi hardware through openVINO. The results show that this cost-effective system can complete anomaly defection training and inference on a Raspberry Pi in just 90 seconds using only 10 normal product images, achieving an F1 macro score exceeding 0.95. While the system is slightly sensitive to environmental changes like lighting, product positioning, or background, it remains a swift and economical method for factory automation inspection for small and medium-sized manufacturers. The code is available at https://github.com/Yunbo-max/Cost-Effective-Visual-Anomaly-Detection-using-Unsupervised-Learning.

2409.12190 2026-05-19 cs.RO cs.CV 版本更新

Bundle Adjustment in the Eager Mode

急切模式下的捆绑调整

Zitong Zhan, Huan Xu, Zihang Fang, Xinpeng Wei, Yaoyu Hu, Chen Wang

发表机构 * Spatial AI & Robotics (SAIR) Lab, University at Buffalo(空间人工智能与机器人实验室,布法罗大学) Georgia Institute of Technology(佐治亚理工学院) Purdue University(普渡大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种与PyTorch无缝集成的高效急切模式捆绑调整库,通过稀疏感知的自动微分设计和GPU加速的稀疏运算,提升了在机器人应用中捆绑调整的运行效率和性能。

详情
AI中文摘要

捆绑调整(BA)是各种机器人应用中的关键技术,例如同步定位与建图(SLAM)、增强现实(AR)和摄影测量学。BA通过优化诸如相机姿态和3D地标等参数,使它们与观测结果对齐。随着深度学习在感知系统中的重要性日益增加,将BA与深度学习框架整合已成为提高可靠性和性能的迫切需求。然而,广泛使用的基于C++的BA库,如GTSAM、g²o和Ceres Solver,缺乏与现代深度学习库如PyTorch的原生整合。这种限制影响了它们的灵活性、调试简便性和整体实现效率。为了解决这一差距,我们引入了一种与PyTorch无缝集成的高效急切模式BA库。我们的方法包括稀疏感知的自动微分设计和针对二次优化设计的GPU加速稀疏运算。我们的GPU急切模式BA在所有基准测试中均实现了显著的运行时间效率,与GTSAM、g²o和Ceres相比,平均加速分别为18.5×、22×和23×。

英文摘要

Bundle adjustment (BA) is a critical technique in various robotic applications such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C++-based BA libraries, such as GTSAM, g$^2$o, and Ceres Solver, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA library seamlessly integrated with PyTorch with high efficiency. Our approach includes a sparsity-aware auto-differentiation design and GPU-accelerated sparse operations designed for 2nd-order optimization. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5$\times$, 22$\times$, and 23$\times$ across all benchmarks compared to GTSAM, g$^2$o, and Ceres, respectively.

2308.06197 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features

利用基本特征的深度知识蒸馏进行复杂面部表情识别

Angus Maiden, Bahareh Nakisa

发表机构 * School of Information Technology, Deakin University(德克萨斯大学信息学院)

AI总结 本文提出了一种基于持续学习的方法,通过知识蒸馏和新颖的预测排序记忆重放,实现了复杂面部表情识别的最新状态,能够在少量样本下准确识别新复合表情类别。

Comments 13 pages, 9 figures, 6 tables, 3 algorithms. Code available at https://github.com/AngusMaiden/complex-FER

详情
AI中文摘要

复杂情绪识别是一种认知任务,迄今为止尚未达到与其他处于或高于人类认知水平的任务相同的优秀性能。通过面部表情识别情绪尤其困难,因为人类面部表达的情绪复杂性。为了使机器在复杂面部表情识别方面达到人类的水平,可能需要实时综合知识和理解新概念,就像人类所做的那样。人类能够仅通过少量示例学习新概念,通过从记忆中蒸馏重要信息。受人类认知和学习的启发,我们提出了一种新的持续学习方法,用于复杂面部表情识别,通过在基本表情类别上构建和保留知识,能够使用少量训练样本准确识别新的复合表情类别。在本工作中,我们还使用GradCAM可视化来展示基本和复合面部表情之间的关系。我们的方法通过知识蒸馏和一种新颖的预测排序记忆重放来利用这种关系,实现了复杂面部表情识别持续学习的最新状态,新类别的总体准确率为74.28%。我们还证明了使用持续学习进行复杂面部表情识别的性能远优于非持续学习方法,比最先进的非持续学习方法提高了13.95%。我们的工作也是首次将少样本学习应用于复杂面部表情识别,仅使用每个类别一个训练样本,就实现了100%的准确率,达到了最先进的水平。

英文摘要

Complex emotion recognition is a cognitive task that has so far eluded the same excellent performance of other tasks that are at or above the level of human cognition. Emotion recognition through facial expressions is particularly difficult due to the complexity of emotions expressed by the human face. For a machine to approach the same level of performance in complex facial expression recognition as a human, it may need to synthesise knowledge and understand new concepts in real-time, as humans do. Humans are able to learn new concepts using only few examples by distilling important information from memories. Inspired by human cognition and learning, we propose a novel continual learning method for complex facial expression recognition that can accurately recognise new compound expression classes using few training samples, by building on and retaining its knowledge of basic expression classes. In this work, we also use GradCAM visualisations to demonstrate the relationship between basic and compound facial expressions. Our method leverages this relationship through knowledge distillation and a novel Predictive Sorting Memory Replay, to achieve the current state-of-the-art in continual learning for complex facial expression recognition, with 74.28% Overall Accuracy on new classes. We also demonstrate that using continual learning for complex facial expression recognition achieves far better performance than non-continual learning methods, improving on state-of-the-art non-continual learning methods by 13.95%. Our work is also the first to apply few-shot learning to complex facial expression recognition, achieving the state-of-the-art with 100% accuracy using only a single training sample per class.

2303.11675 2026-05-19 cs.CV 版本更新

ReBaR: Reference-Based Reasoning for Robust Pose Estimation from Monocular Images

ReBaR:基于参考的鲁棒单目图像姿态估计

Yongkang Cheng, Mingjiang Liang, Jifeng Ning, Gaoge Han, Wei Liu, Shaoli Huang

发表机构 * College of Information Engineering Northwest A\&F University(西北农林科技大学信息工程学院) University of Technology Sydney(悉尼技术大学) Tencent AI Lab(腾讯人工智能实验室) City University of Hong Kong(香港城市大学)

AI总结 本文提出ReBaR方法,通过学习参考特征来解决遮挡和深度模糊问题,实现从单目图像中鲁棒的人体姿态和形状估计。

Comments Accepted by Pattern Recognition

详情
AI中文摘要

ReBaR(Reference-Based Reasoning for Robust Human Pose and Shape Estimation),旨在从单视图像中估计人体形状和姿态。ReBaR通过学习部分回归推理的参考特征,有效解决了遮挡和深度模糊的挑战。我们的方法首先通过注意力引导机制提取身体和部分区域的特征。随后,这些特征用于编码额外的部分-身体依赖关系,以实现个体部分的回归,其中部分特征作为查询,身体特征作为参考。这种基于参考的推理使网络能够利用可见部分和身体参考信息推断被遮挡部分与身体的空间关系。ReBaR在三个基准数据集上优于现有方法,并在最近的新方法中仍保持竞争力。结果表明在处理深度模糊和遮挡方面有显著改进。这些结果强烈支持了我们基于参考的框架在从单目图像中估计人体形状和姿态的有效性。

英文摘要

R}easoning for Robust Human Pose and Shape Estimation), designed to estimate human body shape and pose from single-view images. ReBaR effectively addresses the challenges of occlusions and depth ambiguity by learning reference features for part regression reasoning. Our approach starts by extracting features from both body and part regions using an attention-guided mechanism. Subsequently, these features are used to encode additional part-body dependencies for individual part regression, with part features serving as queries and the body feature as a reference. This reference-based reasoning allows our network to infer the spatial relationships of occluded parts with the body, utilizing visible parts and body reference information. ReBaR outperforms contemporary methods on three benchmark datasets and still maintains competitive advantages among recent new approaches. Demonstrating significant improvement in handling depth ambiguity and occlusion. These results strongly support the effectiveness of our reference-based framework for estimating human body shape and pose from single-view images.

2605.17368 2026-05-19 cs.CV 版本更新

RadGenome-Anatomy: A Large-Scale Anatomy-Labeled Chest Radiograph Dataset via Physically Grounded Volumetric Projection

RadGenome-Anatomy: 通过物理基础的体积分量生成大规模解剖标注胸部X光图像数据集

Shuchang Ye, Mingyuan Meng, Hao Wang, Usman Naseem, Jinman Kim

发表机构 * The University of Sydney(悉尼大学) Zhongguancun Academy(中关村学院) Shanghai Jiao Tong University(上海交通大学) Macquarie University(麦考瑞大学)

AI总结 本文提出RadGenome-Anatomy数据集,通过物理基础的体积分量生成技术,生成包含超过1000万段分割掩码的大型解剖标注胸部X光图像数据集,用于改进医学图像分割和诊断任务。

详情
AI中文摘要

胸部X光图像的解剖结构标注对于医学图像分割和广泛的下游诊断任务至关重要。然而,直接在2D胸部X光图像上标注解剖结构是劳动密集型且本质上模糊的,因为3D解剖结构被投影到一个单一的2D平面上,其中边界可能会重叠、被遮挡或只部分可见。因此,现有的解剖标注胸部X光图像数据集在规模、解剖覆盖和标签可靠性方面仍然有限。为了解决这些限制,我们引入了RadGenome-Anatomy,这是最大的解剖标注胸部X光图像数据集,包含超过1000万段分割掩码,涵盖210种解剖结构,共计25,692例研究。它通过将大规模3D解剖掩码从CT体积投影到2D放射学空间中,通过标准放射学几何构造而成。这将标注从直接追踪不确定的2D边界转移到定义体积空间中的解剖结构,其中在X光中重叠或部分不可见的结构仍能保持空间分离。因此,每个2D掩码代表了在体积空间中定义的结构的物理基础投影足迹。RadGenome-Anatomy的规模和广泛的解剖覆盖,包括重叠、部分可见或难以直接勾勒的结构,使研究几何测量作为胸部X光解释的明确证据成为可能。我们通过训练XAnatomy来预测结构特定的掩码并推导临床相关测量,实现了对心脏扩大、脊柱侧弯和脊柱后凸的诊断准确率分别为96.4%、95.6%和89.2%。

英文摘要

Anatomical structure labels for chest radiographs are essential for medical image segmentation and a broad range of downstream diagnostic tasks. However, annotating anatomy directly on 2D chest radiographs is labor-intensive and intrinsically ambiguous, as 3D anatomical structures are projected onto a single 2D plane where boundaries may overlap, be occluded, or appear only partially visible. Consequently, existing anatomy-labeled chest radiograph datasets remain limited in scale, anatomy coverage, and label reliability. To address these limitations, we introduce RadGenome-Anatomy, the largest anatomy-labeled chest radiograph dataset, containing over 10 million segmentation masks across 210 anatomical structures in 25,692 studies. It is constructed by projecting large-scale 3D anatomical masks from CT volumes into 2D radiographic space through canonical radiographic geometry. This shifts annotation from directly tracing uncertain 2D boundaries to defining anatomy in volumetric space, where structures that overlap or become partially invisible in radiographs remain spatially separable. As a result, each 2D mask represents the physically grounded projected footprint of a volumetrically defined structure. The scale and broad anatomical coverage of RadGenome-Anatomy, including structures that are overlapping, partially visible, or difficult to delineate directly, enable research on geometric measurements as explicit evidence for chest radiograph interpretation. We demonstrate this by training XAnatomy to predict structure-specific masks and derive clinically relevant measurements, achieving diagnostic accuracies of 96.4%, 95.6%, and 89.2% for cardiomegaly, kyphosis, and scoliosis, respectively.

2605.17367 2026-05-19 cs.CV 版本更新

Bridging Data Trials and Task Barriers: A Unified Framework for Sketch Biometric Identification

弥合数据试验与任务障碍:面向草图生物识别的统一框架

Decheng Liu, Bin Hu, Xinbo Gao, Dawei Zhou, Chunlei Peng, Nannan Wang, Ruimin Hu

发表机构 * IEEE

AI总结 本文提出了一种统一框架,用于解决草图生物识别中的跨模态和跨任务挑战,通过高效的合成草图生成和任务序列持续学习,提升模型的鲁棒性和泛化能力。

Comments The source code and models are publicly available at https://github.com/sHanbIgsUn/UFSB

详情
AI中文摘要

与现有的跨模态识别任务(例如异构人脸识别、草图重识别等)不同,我们引入了一种新的且实用的设置,称为草图生物识别,旨在在不同数据领域间持续训练一个统一的模型,即使涉及多样化的识别任务。草图生物识别面临挑战,包括真实的草图数据稀缺、高标注成本、隐私风险以及跨任务模型的泛化能力不足。现有方法通常依赖于有限的真实数据或单任务优化,难以有效解决跨模态和跨任务的联合挑战。本文提出了一种统一框架,整合了高效的合成草图生成和任务序列持续学习。首先,我们设计了一个高效的流程来生成大规模的高质量合成人物和人脸草图数据,这显著降低了成本并避免了隐私风险。同时,我们通过融合真实数据增强了模型的鲁棒性。其次,我们构建了一个通用的统一框架用于草图生物识别,该框架采用任务序列训练策略:模型首先在人物数据集上完成草图人物重识别学习;随后,通过可信样本重放技术保持获得的人物识别能力,并无缝地在人脸数据集上进行增量训练。这使一个模型能够同时处理多个草图生物识别任务的跨任务能力。为了支持上述草图生物识别的研究,我们构建了一个新的大规模基准,SketchUnified-BioID,并配备了几种实用的评估协议。

英文摘要

Different from existing cross-modality identification tasks (e.g., heterogeneous face recognition, sketch re-identification, etc.), we introduce a novel yet practical setting for these related identification tasks, named \textbf{sketch biometric identification}, which aims to continually train a unified model across different data domains, even diverse identification tasks. Sketch biometric identification faces challenges, including scarce real sketch data, high annotation costs, privacy risks, and insufficient generalization ability of cross-task models. Existing methods usually rely on limited real data or single-task optimization, making it difficult to effectively address the joint challenges of cross-modality and cross-task. This paper proposes a unified framework that integrates efficient synthetic sketch generation and task-sequential continual learning. First, we design an efficient pipeline to generate a large-scale and high-quality synthetic person and face sketch data, which significantly reduces costs and avoids privacy risks. Meanwhile, we enhance the model's robustness by fusing real data. Second, we construct a universal unified framework for sketch biometric identification, which adopts a task-sequential training strategy: the model first completes sketch person re-identification learning on the person dataset; subsequently, it maintains the acquired person recognition capability through a trusted sample replay technique and seamlessly performs incremental training on the face dataset. This enables a single model to simultaneously handle the cross-task capabilities of multiple sketch biometric identification tasks. To support the study of the mentioned sketch biometric identification, we built a new large-scale benchmark, SketchUnified-BioID, with several practical evaluation protocols.

2605.17365 2026-05-19 cs.CV 版本更新

Memory-Augmented Query Intent Understanding for Efficient Chat-based Image Retrieval

基于记忆的查询意图理解用于高效的基于聊天的图像检索

Xianke Chen, Daizong Liu, Yushuo Lou, Xin Tan, Xun Yang, Shuhui Wang, Xun Wang, Jianfeng Dong

发表机构 * School of Computer Science and Technology, and the School of Statistics and Mathematics, Zhejiang Gongshang University(计算机科学与技术学院,和统计与数学学院,浙江工商大学) School of Computer Science and Technology, Zhejiang Gongshang University, and the Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology(计算机科学与技术学院,浙江工商大学,和大数据与未来电子商务技术浙江省重点实验室) Wangxuan Institute of Computer Technology, Peking University(王璇计算机技术研究所,北京大学) School of Information and Electronic Engineering, Zhejiang Gongshang University(信息与电子工程学院,浙江工商大学) School of Computer Science and Technology, East China Normal University(计算机科学与技术学院,华东师范大学) Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences , Institute of Computing Technology, CAS(中国科学院智能信息处理重点实验室,计算技术研究所,中国科学院) School of Information Science and Technology, University of Science and Technology of China(信息科学与技术学院,中国科学技术大学)

AI总结 本文提出了一种高效的基于聊天的图像检索任务中的记忆增强查询意图理解框架MAQIU,通过动态聚合和演化查询意图的语义表示,防止意图遗忘并增强长期语义完整性,从而在保持高计算效率的同时实现显著的性能提升。

详情
AI中文摘要

与传统的文本到图像检索任务不同,基于聊天的图像检索允许人机交互系统通过多轮对话逐步澄清和细化用户意图,从而实现更精细的检索结果。该任务的关键挑战在于在对话轮次中动态理解和更新用户的查询意图。尽管现有工作在这一新任务上取得了显著性能,但它们要么通过直接拼接所有先前查询到一个长文本序列,要么依赖大语言模型来从历史中重建当前查询,这些策略计算冗余且容易导致意图表示不一致。为了解决这些问题,本文提出了一种新的、高效的基于记忆的用户意图更新框架,称为记忆增强查询意图理解(MAQIU)。它引入了一个轻量级的记忆模块,动态聚合和演化查询意图的语义表示,同时进一步采用记忆回查机制以防止意图遗忘并增强长期语义完整性。此外,MAQIU还整合了历史图像检索结果作为视觉指导,使模型能够加强跨轮次的相关性并细化当前视觉理解。广泛的实验表明,MAQIU在保持高计算效率的同时实现了显著的性能提升,与先前基线ChatIR相比,将对话编码FLOPs减少了86.4%。源代码可在https://github.com/HuiGuanLab/MAQIU上获得。

英文摘要

Different from traditional text-to-image retrieval tasks, chat-based image retrieval allows the human-interactive system to iteratively clarify and refine user intent through multi-round dialogue, thereby achieving more fine-grained retrieval results. The key challenge in this task lies in dynamically understanding and updating the user's query intent across dialogue rounds. Although existing works have achieved great performance on this new task, they simply handle history query information either by directly concatenating all previous queries into a long textual sequence or by relying on large language models to reconstruct the current query from history. Such strategies are computationally redundant and easily lead to inconsistent intent representations as the dialogue progresses. To alleviate these issues, this paper proposes a novel and efficient memory-based user intent updating framework for the chat-based image retrieval task, called Memory-Augmented Query Intent Understanding (MAQIU). It introduces a lightweight memorization module that dynamically aggregates and evolves the semantic representation of query intent across dialogues, while a memory recall mechanism is further employed to prevent intent forgetting and enhance long-term semantic integrity. In addition, MAQIU also integrates historical image retrieval results as visual guidance, allowing the model to strengthen cross-round correlations and refine current visual understanding. Extensive experiments demonstrate that MAQIU achieves substantial performance gains while maintaining high computational efficiency, reducing dialogue encoding FLOPs by 86.4\% compared with the prior baseline ChatIR. Source code is available at https://github.com/HuiGuanLab/MAQIU.

2605.17360 2026-05-19 cs.CV 版本更新

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Omni-DuplexEval: 评估实时双工全模交互

Chaoqun He, Mingyang Xiang, Yingjing Xu, Bokai Xu, Junbo Cui, Jie Zhou, Yuan Yao, Lijie Wen

发表机构 * Tsinghua University(清华大学) Tongji University(同济大学) ModelBest Inc.(ModelBest公司)

AI总结 本文提出Omni-DuplexEval基准,用于系统评估实时双工交互能力,通过两个互补场景评估模型生成连续响应和主动提醒的能力,并揭示现有模型在平衡响应及时性和内容连贯性方面的局限性。

Comments 22 pages, 6 figures

详情
AI中文摘要

实时双工交互对于在真实世界场景中运行的多模态AI系统至关重要,其中模型必须持续处理流式输入并适时响应。然而,大多数现有的多模态大语言模型(MLLMs)是在离线设置中评估的,其中整个视频输入在生成任何响应之前都被处理。尽管最近的工作开始探索实时双工MLLMs,但仍然没有全面的基准或自动评估方法用于这种设置。为了解决这一差距,我们提出了Omni-DuplexEval,一个用于系统评估实时双工交互的基准。该基准包含两个互补场景:(1)实时描述,评估生成连续、时间对齐的响应以跟踪演化的多模态输入的能力,以及(2)主动提醒,评估识别显著事件并适时响应的能力。Omni-DuplexEval包含660个视频,具有细粒度、人工标注的标签和精确的时间元数据,涵盖9个基于真实世界场景的任务,其中所有问题均以开放性查询形式提出。我们进一步引入了一个基于LLM-as-a-Judge的自动评估框架,通过时间戳感知和顺序推理联合评估响应内容对齐和响应时间,实现了与人类判断的高度一致。在最先进的双工MLLMs上的实验揭示了显著的局限性。表现最好的模型整体得分仅为39.6%,在主动提醒任务上仅得20.0%。我们的分析识别出两个关键挑战:模型在平衡及时响应与连贯、整体内容生成方面存在困难,且它们往往无法确定何时响应和生成什么内容。我们希望我们的工作能促进MLLMs的进一步发展。

英文摘要

Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.

2605.17356 2026-05-19 cs.CV 版本更新

UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings

UniPPTBench: 一种统一的演示生成基准,适用于多样化的输入设置

Bo Zhao, Maosheng Pang, Chen Zhang, Huan Yang, Yixin Cao, Wei Ji

AI总结 本文提出UniPPTBench,一个统一的演示生成基准,针对四种代表性的输入设置:模糊提示、长文档、多模态文档和多源生成,同时引入UniPPTEval评估协议,结合跨设置比较的共享指标和针对每个设置核心需求的定制指标,以提供更准确的评估框架。

详情
AI中文摘要

现有工作通常专注于孤立的输入设置下的演示生成,而现实中的使用案例涵盖了多样化的场景,包括模糊的用户提示、长文档、多模态材料和多个异质来源。此外,当前的评估往往不够场景特定。它们主要依赖于通用的演示质量标准,如视觉吸引力、布局质量以及整体连贯性,但未能评估不同输入设置所需的核心能力,包括基于事实的压缩、视觉-文本对齐以及跨来源合成。因此,该领域缺乏一个统一的基准和一个场景感知的评估框架,以准确诊断不同现实场景下的演示生成系统。我们提出了UniPPTBench,一个适用于四种代表性输入设置的统一基准:模糊提示、长文档、多模态文档和多源生成。我们进一步引入UniPPTEval,一种场景感知的评估协议,结合用于跨设置比较的共享指标和针对每个设置核心需求定制的场景特定指标。我们还提供了透明的参考基线以支持可重复的比较。在UniPPTBench上的实验揭示了不同设置之间的显著性能差异以及内容基础、多模态整合和跨来源合成中的反复失败模式。特别是,通用演示质量指标上的强大表现并不一定意味着在基于事实的场景中任务执行的强表现。共同,UniPPTBench和UniPPTEval为评估不同现实场景下的演示生成提供了忠实且诊断性的基础。代码和数据将公开可用。

英文摘要

Existing works typically focus on presentation generation under isolated input settings, whereas real-world use cases span diverse scenarios, including vague user prompts, long documents, multimodal materials, and multiple heterogeneous sources. Moreover, current evaluations are often insufficiently scenario-specific. They mainly rely on generic presentation-quality criteria, such as visual appeal, layout quality, and overall coherence, but fail to assess the core capabilities required by different input settings, including grounded compression, visual-text alignment, and cross-source synthesis. Consequently, the field lacks a unified benchmark and a scenario-aware evaluation framework for faithfully diagnosing presentation-generation systems across diverse real-world settings. We present UniPPTBench, a unified benchmark for presentation generation across four representative input settings: vague-prompt, long-document, multimodal-document, and multi-source generation. We further introduce UniPPTEval, a scenario-aware evaluation protocol that combines shared metrics for cross-setting comparison with scenario-specific metrics tailored to the core requirements of each setting. We also provide transparent reference baselines to support reproducible comparison. Experiments on UniPPTBench reveal substantial performance variation across settings and recurring failure modes in content grounding, multimodal integration, and cross-source synthesis. In particular, strong performance on generic presentation-quality metrics does not necessarily imply strong task fulfillment in grounded scenarios. Together, UniPPTBench and UniPPTEval provide a faithful and diagnostic foundation for evaluating presentation generation across diverse real-world scenarios. Code and data will be publicly available.

2605.17354 2026-05-19 cs.CV 版本更新

GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction

GeoHand: 解锁先验几何知识以实现单目3D手形 reconstruction

Weiquan Lin, Yaoqing Hu, Liangchen Dai, Xu Tang, Xingyu Chen

发表机构 * School of Artificial Intelligence, Xidian University(西安电子科技大学人工智能学院) Zhongguancun Academy(中关村学院) School of Automation, Beijing Institute of Technology(北京理工大学自动化学院)

AI总结 本文提出GeoHand框架,通过解锁冻结的基础单目几何估计器MoGe2中的高质量几何先验,结合地图级GeoAdapter和门控跨模态token融合策略,实现高精度手形重建,尤其在严重遮挡和手-物体交互场景中表现优异。

详情
AI中文摘要

单目3D手形重建本质上是一个几何问题,然而仅依靠RGB外观特征往往难以解决由自遮挡和手-物体相互作用引起的严重歧义。虽然引入深度可以显式提供空间线索,但原始传感器捕获的深度图存在大量噪声和不完整性,限制了其在细粒度手形重建中的应用。为弥合这一差距,我们提出GeoHand,一种新颖的框架,能够从冻结的基础单目几何估计器MoGe2中解锁高质量几何先验。认识到这些先验偏向于通用场景,我们引入地图级GeoAdapter来重新校准空间特征,特别是适应于细节丰富的手形重建。此外,为了系统地整合这些适应后的先验而不过度干扰固有的RGB外观线索,我们采用门控跨模态token融合策略。最后,为了确保精确的局部运动,我们设计了关键点查询迭代细化器(KQIR),利用投影的关节位置查询几何感知的图像特征以进行空间修正。通过在统一管道中结合全局几何消歧和局部细化,GeoHand在FreiHAND、DexYCB和HO3Dv3上实现了最先进的性能,特别是在严重遮挡和手-物体交互场景中。

英文摘要

Monocular 3D hand reconstruction is intrinsically a geometric problem, yet RGB appearance features alone often struggle to resolve severe ambiguities caused by self-occlusions and hand-object interactions. While introducing depth can explicitly provide spatial cues, raw sensor-captured depth maps are extensively noisy and incomplete, limiting their usefulness for fine-grained hand reconstruction. To bridge this gap, we propose GeoHand, a novel framework that unlocks high-quality geometric priors from a frozen foundational monocular geometry estimator (MoGe2). Recognizing that these priors are oriented toward general scenes, we introduce a map-level GeoAdapter to recalibrate the spatial features, specifically adapting them for detailed hand reconstruction. Furthermore, to systematically integrate these adapted priors without overwhelming intrinsic RGB appearance cues, we employ a gated cross-modal token fusion strategy. Finally, to secure precise local articulation, we design a Keypoint-Queried Iterative Refiner (KQIR) that uses projected joint locations to query geometry-aware image features for spatial correction. By combining global geometric disambiguation with local refinement in a unified pipeline, GeoHand achieves state-of-the-art performance on FreiHAND, DexYCB, and HO3Dv3, especially under severe occlusions and hand-object interactions.

2605.17347 2026-05-19 cs.CY cs.CV cs.LG 版本更新

Position: Age Estimation Models Do Not Process Biometric Data

位置:年龄估计模型不处理生物特征数据

Nikita Marshalkin

发表机构 * Sumsub GmbH, Berlin, Germany(Sumsub公司,柏林,德国)

AI总结 本文研究了年龄估计模型是否处理生物特征数据,通过实验表明这些模型无法达到身份识别阈值,因此不涉及身份识别,呼吁研究者和监管机构提高透明度。

Comments 11 pages, 3 figures, 3 tables. Accepted as a position paper at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

当神经网络通过照片估计某人的年龄时,它是否处理生物特征数据?答案取决于网络在推断过程中是否生成身份区分的表示,这个问题对机器学习研究人员来说可能显得 trivial,但在 GDPR 下可能需要同意,BIPA 下可能面临法定损害,或在欧盟 AI 法下被归类为高风险 AI。然而,目前没有监管指导。本文提供了实证证据:在三个面部验证基准测试中评估的14个模型显示,年龄估计器在数量级上远低于识别阈值。年龄估计模型无法识别个体。我们呼吁研究者提供有关系统存储和能做什么的透明度,并呼吁监管机构区分短暂处理与模板存储。

英文摘要

When a neural network estimates someone's age from a photograph, does it process biometric data? The answer depends on whether identity-discriminative representations arise within the network during inference, a question that may seem trivial to ML researchers but triggers consent requirements under GDPR, statutory damages under BIPA, or high-risk AI classification under the EU AI Act. Yet no regulatory guidance addresses it. This position paper provides empirical evidence: 14 models evaluated across 3 face verification benchmarks show age estimators fall orders of magnitude short of identification thresholds. Age estimation models cannot identify individuals. We call on researchers to provide transparency about what systems store and can do, and on regulators to distinguish transient processing from template storage.

2605.17345 2026-05-19 cs.CV 版本更新

VoxShield: Protecting 3D Medical Datasets from Unauthorized Training via Frequency-Aware Inter-Slice Disruption

VoxShield: 通过频率感知的跨切片扰动保护3D医学数据集免受未经授权的训练

Xinyao Liu, Zhipeng Deng, Wenhan Jiang, Haolin Wang, Xun Lin, Yafei Ou, Yefeng Zheng

发表机构 * Westlake University(西拉雅大学) Dalian University of Technology(大连理工大学) Hokkaido University(北海道大学) The Chinese University of Hong Kong(香港中文大学) RIKEN(理化学研究所)

AI总结 本文提出VoxShield,一种通过频率感知的跨切片扰动机制,针对3D医学图像分割数据集中的体积诱导偏差,有效降低3D分割网络性能,同时保持视觉质量。

Comments Submitted version to MICCAI 2026 (Provisional Accept)

详情
AI中文摘要

公开3D医学图像分割(MIS)数据集的发布加速了临床研究,但同时也提高了未经授权的AI模型训练的风险。尽管不可学习的例子(UE)通过注入不可察觉的扰动来防止有效模型学习,但现有方法主要针对2D场景。它们忽略了3D医学体积中固有的体积空间相关性和跨切片解剖一致性,这些是3D分割网络的关键学习先验。为弥合这一差距,我们提出了VoxShield,一种UE框架,专门针对3D网络的体积归纳偏差。我们的核心见解是通过系统性地破坏3D架构依赖的跨切片连续性,可以根本破坏其空间聚合过程。具体来说,我们引入了一种跨切片频率一致性扰动机制,最大化相邻切片之间的频谱差异,沿z轴注入结构不一致性。此外,还加入了语义预测扰动模块。通过最大化干净和扰动logits之间的ℓ₁差异,它迫使注入的噪声穿透整个网络并破坏最终的语义映射。在BraTS19和FLARE21上的实验表明,VoxShield成功降低了3D分割性能,将DSC从80.0%降至接近0.0%,从88.6%降至6.8%。所有保护都通过最小扰动(ε=4/255)实现,以保持高质量的视觉保真度。代码可在https://github.com/KK266299/VoxShield上获得。

英文摘要

The release of public 3D medical image segmentation (MIS) datasets accelerates clinical research but simultaneously heightens risks of unauthorized AI model training. While Unlearnable Examples (UE) offer protection by injecting imperceptible perturbations to prevent effective model learning, existing methods primarily target 2D scenarios. They neglect the volumetric spatial correlations and inter-slice anatomical consistency inherent in 3D medical volumes, which serve as critical learning priors for 3D segmentation networks. To bridge this gap, we propose VoxShield, a UE framework that explicitly targets the volumetric inductive biases of 3D networks. Our core insight is that by systematically dismantling the cross-slice continuity that 3D architectures rely on, we can fundamentally impair their spatial aggregation process. Specifically, we introduce an Inter-Slice Frequency Consistency Disruption mechanism that maximizes the spectral divergence between adjacent slices, injecting structural incoherence along the $z$-axis. Complementing this structural attack, a Semantic Prediction Disruption module is incorporated. By maximizing the $\ell_1$ divergence between clean and perturbed logits, it forces the injected noise to penetrate the entire network and corrupt the final semantic mapping. Experiments on BraTS19 and FLARE21 demonstrate that VoxShield successfully degrades 3D segmentation performance, reducing the DSC from 80.0% to near 0.0% and from 88.6% to 6.8%, respectively. All protections are achieved with minimal perturbation ($ε=4/255$) to preserve high visual fidelity. The code is available at https://github.com/KK266299/VoxShield.

2605.17343 2026-05-19 cs.CV 版本更新

GraphMAR: Geometry-Aware Graph Learning Framework for Spatially Adaptive CT Metal Artifact Reduction

GraphMAR: 一种基于几何的图学习框架用于空间自适应的CT金属伪影减少

Zilong Li, Chenglong Ma, Yiming Lei, Yuanlin Li, Jing Han, Jiannan Liu, Huidong Xie, Junping Zhang, Yi Zhang, Hongming Shan

发表机构 * Shanghai Key Lab of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University(上海智能信息处理关键实验室,计算机科学与人工智能学院,复旦大学) Institute of Science and Technology for Brain-inspired Intelligence, Fudan University(脑启发式智能科学技术研究院,复旦大学) College of Computer Science and Technology, Qingdao University(计算机科学与技术学院,青岛大学) Department of Oral Maxillofacial Head and Neck Oncology, Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine(口腔颌面头颈肿瘤科,上海第九人民医院,上海交通大学医学院) School of Cyber Science and Engineering, Sichuan University(网络科学与工程学院,四川大学)

AI总结 本文提出GraphMAR,一种基于几何的图学习框架,用于在图像域中实现空间自适应的CT金属伪影减少,通过引入图基的几何建模来显式识别伪影并提高恢复质量和可解释性。

详情
AI中文摘要

计算断层扫描(CT)金属伪影减少(MAR)旨在减少由金属植入物和其他高密度物体引起的严重条纹伪影。有效的MAR通常需要准确的伪影定位和去除。sinogram域方法可以利用显式的几何线索,如金属痕迹,来识别金属损坏的测量,但需要原始投影数据,这在临床和实际场景中往往不可用。图像域方法更加灵活且广泛适用,但通常缺乏可比的几何指导,限制了它们定位伪影的能力,导致结果不理想。为了解决这一限制,我们提出了GraphMAR,一种用于显式伪影识别和图像域中空间自适应MAR的几何意识学习框架。关键思想是引入基于图的几何建模作为sinogram金属痕迹的图像域类比。具体来说,我们首先从金属掩模中构建几何图,并推导出一个几何密度图,根据植入物之间的几何关系粗略定位伪影易发区域。然后我们设计了GraphMoE,一个基于图的混合专家模块,该模块在特征空间中构建极坐标伪影图,并适应性地将不同专家路由到不同的空间区域进行MAR。通过将学习到的路由图与几何密度图对齐,GraphMAR在提供显式和可解释的伪影定位的同时,实现了区域自适应的伪影减少。在模拟和真实世界数据集上的实验表明,GraphMAR在现有方法上实现了更优的MAR性能。据我们所知,这是首次引入基于图的建模用于CT MAR,并在图像域中实现显式的伪影识别,提高了恢复质量和可解释性。

英文摘要

Computed tomography (CT) metal artifact reduction (MAR) aims to reduce the severe streaking artifacts induced by metallic implants and other high-density objects. Effective MAR generally requires both accurate artifact localization and artifact removal. Sinogram-domain methods can exploit explicit geometric cues, such as metal traces, to identify metal-corrupted measurements, while requiring raw projection data, which is often unavailable in clinical and practical scenarios. Image-domain methods are more flexible and widely applicable, yet they usually lack comparable geometric guidance, limiting their ability to localize artifacts and leading to suboptimal results. To address this limitation, we propose GraphMAR, a geometry-aware learning framework for explicit artifact identification and spatially adaptive MAR in the image domain. The key idea is to introduce graph-based geometric modeling as an image-domain analogue of sinogram metal traces. Specifically, we first construct a geometric graph from the metal mask and derive a geometric density graph that coarsely localizes artifact-prone regions according to inter-implant geometry. We then design GraphMoE, a graph-routed mixture-of-experts module that builds a polar-coordinate artifact graph in feature space and adaptively routes different experts to different spatial regions for MAR. By aligning the learned routing maps with the geometric density graph, GraphMAR provides explicit and interpretable artifact localization while enabling region-adaptive artifact reduction. Experiments on both simulated and real-world datasets demonstrate that GraphMAR achieves superior MAR performance compared with existing methods. To the best of our knowledge, this is the first work to introduce graph-based modeling for CT MAR and to enable explicit artifact identification in the image domain, improving both restoration quality and interpretability.

2605.17341 2026-05-19 cs.CV cs.AI 版本更新

Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

通过跨模态语义对齐实现面向视觉-语言模型的单样本黑盒成员推断攻击

Jiaqing Li, Yajuan Lu, Xiaochuan Shi, Gang Wu, ZhongYuan Wang, Chao Liang

发表机构 * Wuhan University(武汉大学) Tarim University(塔里木大学)

AI总结 本文提出了一种基于跨模态语义对齐的新型成员推断攻击框架,针对视觉-语言模型在单样本和黑盒场景下的数据安全风险进行评估,通过量化联合嵌入空间中的对齐程度,显著提升了攻击性能。

详情
AI中文摘要

视觉-语言模型(VLMs)虽取得了显著成功,但其依赖大规模数据集和意外记忆训练数据,带来了重大数据安全风险。成员推断攻击(MIAs)旨在通过确定数据样本是否包含在模型训练集中来评估这些风险。然而,现有针对VLMs的MIAs方法面临关键瓶颈:灰盒方法依赖于内部logits,通常在实际应用程序接口(APIs)中受限,而黑盒方法依赖于大规模统计分布,在单样本场景中表现不佳。为此,我们从跨模态语义对齐的角度研究MIAs,并观察到成员图像由于训练记忆表现出显著更强的图像-描述对齐,而生成的非成员描述可能偏离原始视觉内容。基于这一洞察,我们提出了一种针对严格黑盒和单样本场景的新MIAs框架,该框架在联合嵌入空间中量化此类对齐,从而绕过这些不现实的假设。我们在三个开源和两个闭源VLMs上进行了广泛实验。在VL-MIA/Flicker数据集上,我们的方法在LLaVA-1.5上实现了0.821的AUC,显著优于现有基线。此外,它在各种图像扰动下仍保持稳健,突显了其实用性。

英文摘要

Vision-Language Models (VLMs) have achieved remarkable success, yet their reliance on massive datasets and unintended memorization of training data raise significant data security risk. Membership Inference Attacks (MIAs) aim to assess these risks by determining whether a data sample was included in a model's training set. However, existing MIA methods against VLMs face critical bottlenecks: gray-box method relies on internal logits that are typically restricted in real-world Application Programming Interfaces (APIs), while black-box method depends on large-scale statistical distributions, which struggle in single-sample scenarios. To this end, we investigate MIAs from the perspective of cross-modal semantic alignment, and observe that member images exhibit significantly stronger image-caption alignment due to training memorization, whereas generated captions for non-members may deviate from the original visual content. Leveraging this insight, we propose a novel MIA framework designed for strict black-box and single-sample setting that quantifies such alignment within a joint embedding space, thereby bypassing these unrealistic assumptions. We conducted extensive experiments on three open-source and two closed-source VLMs. On the VL-MIA/Flicker dataset, our method achieves an AUC of 0.821 against LLaVA-1.5, significantly outperforming existing baselines. Furthermore, it remains robust under diverse image perturbations, highlighting its practicality.

2605.17336 2026-05-19 cs.RO cs.CV eess.SP 版本更新

Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

基于触觉的多模态融合在具身智能中的应用:视觉、语言和接触驱动范式的综述

Zhixiang Cao, Di Tian, Runwei Guan, Yanzhou Mu, Xiaolou Sun, Shaofeng Liang, Daizong Liu, Tao Huang, Yutao Yue, Henghui Ding, Bin Fang, Alex Zhou, Qing-Long Han, Hui Xiong

发表机构 * School of Electronic Science and Engineering, Xi’an Jiaotong University, China(西安交通大学电子科学与技术学院) Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), China(香港科技大学(广州)人工智能研究所) State Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) Purple Mountain Laboratory, China(紫金山实验室) Institute for Math & AI, Wuhan University, China(武汉大学数学与人工智能学院) Centre for AI and Data Science Innovation and the School of Science and Engineering, James Cook University, Australia(詹姆斯库克大学人工智能与数据科学创新中心及科学与工程学院) School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China(北京邮电大学人工智能学院) Institute of Big Data, Fudan University, China(复旦大学大数据研究院) Linkerbot (Beijing) Technology Co., Ltd, China(北京链动科技有限公司) School of Engineering, Swinburne University of Technology, Melbourne(斯威本技术大学工程学院)

AI总结 本文综述了多模态触觉融合在具身智能中的研究,探讨了如何通过整合视觉、语言和触觉信息来提升物理交互与语义推理的结合,提出了一种分层的分类体系,并总结了当前的研究挑战和未来方向。

Comments 20 pages, 8 figures

详情
AI中文摘要

触觉感知是具身智能中的基本模态,能够提供关于接触几何、材料属性和交互动态的独特且直接反馈,这无法被远程传感器所替代。然而,单一的触觉感知在空间覆盖稀疏和缺乏全局语义上下文方面存在固有局限。随着深度学习和大语言模型的迅速发展,将触觉与视觉和语言相结合已成为连接物理交互与语义推理的关键。尽管进展迅速,现有研究仍分散在不同的数据集、传感模态和任务中,缺乏统一的理论框架。为解决这一差距,本文提供了截至2026年第一季度的多模态触觉融合研究的全面综述。我们提出了一种分层的分类体系,将该领域分为两个主要维度:多模态数据集和多模态方法。在数据方面,我们对从触觉-视觉数据集、触觉-语言数据集、触觉-视觉-语言数据集以及触觉-视觉-其他数据集等资源进行了分类。在方法方面,我们把先前的工作分为三个核心支柱:(1)多模态感知与识别,专注于物体理解和抓取预测;(2)跨模态生成,专注于触觉、视觉和文本之间的双向翻译;(3)多模态交互,强调反馈控制和语言引导的操作。此外,我们总结了代表性的触觉传感硬件,回顾了常用的评估指标和基准设置,并讨论了当前的挑战和有前途的未来方向。

英文摘要

Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.

2605.17327 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

为单目视觉-惯性系统使用前馈3D模型实现高效的特征-free初始化

Yuantai Zhang, Jiaqi Yang, Huajian Zeng, Changhao Chen, Haoang Li, Liang Li, Dezhen Song, Xingxing Zuo

发表机构 * MBZUAI(马克斯·普朗克人工智能研究所) HKUST (GZ)(香港科技大学(广州)) Zhejiang University(浙江大学)

AI总结 本文提出了一种无需视觉特征跟踪的初始化框架,利用前馈3D模型预测的点云,从而提高了单目视觉-惯性导航系统的初始化可靠性与效率,实验表明其初始化成功率超过90%且数据需求显著减少。

详情
AI中文摘要

快速且可靠的初始化对于单目视觉-惯性导航系统(VINS)至关重要,因为它为后续的状态估计建立了初始条件。尽管已有显著进展,但大多数现有方法仍依赖于视觉特征对应关系,并需要3-4秒的传感器数据才能成功初始化,这限制了它们的应用性和效率。随着前馈3D模型的出现,这些模型可以直接从图像预测点云,我们重新从简洁的角度审视视觉-惯性初始化问题。在本文中,我们提出了一种特征-free初始化框架,利用前馈3D模型预测的点云,从而避免了视觉特征跟踪和估计的需要。这种设计显著降低了系统复杂性并提高了初始化的可靠性。在公开数据集上的实验表明,所提出的特征-free初始化方法实现了最高成功率,超过90%,并且显著减少了成功初始化所需的数据持续时间,通常降至1.2秒以下。我们进一步在自采集的数据集上验证了我们的方法,覆盖了各种室内和室外场景,展示了鲁棒性能,特别是在现有方法常失败的视觉退化环境中。代码和数据集可在https://github.com/Yuantai-Z/FF-VIO-Init获取。

英文摘要

Fast and reliable initialization is critical for monocular visual-inertial navigation systems (VINS), as it establishes the starting conditions for subsequent state estimation. Despite steady progress, most existing methods heavily rely on visual feature correspondences and require 3-4 seconds of sensory data for successful initialization, which limits their applicability and efficiency. With the advent of feed-forward 3D models that can directly predict point clouds from images, we revisit the visual-inertial initialization problem from a concise perspective. In this work, we propose a feature-free initialization framework that leverages up-to-scale point clouds predicted by a feed-forward 3D model, thereby obviating the need for visual feature tracking and estimation. This design substantially reduces system complexity and improves the reliability of initialization. Experiments on public datasets demonstrate that the proposed feature-free initialization method achieves the highest success rate, exceeding 90%, and significantly reduces the data duration required for successful initialization, typically to under 1.2 s. We further validate our method on a self-collected dataset covering various indoor and outdoor scenarios, demonstrating robust performance, particularly in visually degraded environments where existing methods often fail. The code and dataset are available at https://github.com/Yuantai-Z/FF-VIO-Init.

2605.17312 2026-05-19 cs.CV 版本更新

VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers

VISTA: 基于扩散变换器的三元组监督视频风格迁移

Yiren Song, Wangzi Yao, Haofan Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Lovart AI(Lovart人工智能)

AI总结 本文提出VISTA方法,通过引入大规模三元组数据和基于扩散变换器的框架,解决视频风格迁移中风格、内容和运动的联合建模与解耦问题,实现了高质量的风格迁移效果。

详情
AI中文摘要

视频风格迁移旨在在保持内容、结构和运动的同时将视频渲染成目标艺术风格。尽管图像风格化技术已迅速发展,但视频风格化仍具有挑战性,因为存在时间不一致的问题。现有的大多数方法对帧或关键帧进行风格化,并通过启发式的时序传播来强制一致性,这在遮挡、遮挡解除和长期运动下容易产生漂移和闪烁伪影。我们提出VISTA-1000,一个包含1000种风格和运动对齐三元组的数据集(风格参考、干净视频、风格化视频),并提出一种基于扩散变换器的上下文视频风格迁移框架,具有轻量级的风格适配器以实现稳健的风格提取。大量实验表明,该方法在风格保真度、时间一致性和内容保持方面均达到最佳性能。

英文摘要

Video style transfer aims to render videos in a target artistic style while preserving content, structure, and motion. While image stylization has advanced rapidly, video stylization remains challenging due to temporal inconsistency. Most existing methods stylize frames or keyframes and enforce consistency via heuristic temporal propagation, which is brittle under occlusions, disocclusions, and long-term motion, leading to drift and flickering artifacts. We argue that a fundamental bottleneck lies in the lack of large-scale triplet data and a principled training paradigm that jointly models and disentangles style, content, and motion.To address this, we introduce VISTA-1000, a synthetic dataset with 1,000 styles and motion-aligned triplets of style reference, clean video, and stylized video, and propose a diffusion-transformer-based in-context video style transfer framework with a lightweight style adapter for robust style extraction. Extensive experiments demonstrate SOTA performance in style fidelity, temporal consistency, and content preservation.

2605.17311 2026-05-19 cs.CV 版本更新

SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection

SpecSem-Net: 通过融合频谱和语义特征实现鲁棒的AI生成视频检测

Zixi Wei, Huixuaun Zhang, Xiaojun Wan

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学计算机技术研究院)

AI总结 本文提出SpecSem-Net框架,通过引入语义引导的频谱去噪机制,有效检测高保真AI生成视频,实验表明其在基准和公开数据集上达到87.25%和95.59%的准确率。

详情
AI中文摘要

近期商业视频生成模型如Sora和Veo的显著视觉保真度,使得鲁棒的AI生成视频检测变得至关重要,以防止合成内容与真实视频难以区分并被用于虚假信息。然而,现有检测器往往因过度依赖日益逼真的语义特征而失败,忽视了细微的频谱伪影。本文提出SpecSem-Net,这是首个专门针对高保真AI生成视频检测引入语义引导频谱去噪机制的框架。具体而言,我们设计了一个频谱模块,通过基于傅里叶变换的过滤提取高频特征。此外,为减少频谱噪声引起的误判,我们采用门控融合机制,自适应融合语义上下文,有效缓解频谱噪声。此外,为了评估检测器在最新顶级生成模型上的性能,我们构建了一个包含5个顶级商业生成器的综合基准。广泛实验表明,SpecSem-Net在基准和公开数据集上均优于现有方法,分别达到87.25%和95.59%的准确率。

英文摘要

The remarkable visual fidelity of recent commercial video generative models, such as Sora and Veo, renders robust AI-generated video detection increasingly essential to prevent synthetic content from being indistinguishable from real videos and exploited for disinformation. However, existing detectors often fail due to an over-reliance on increasingly realistic semantic features, neglecting subtle spectral artifacts. In this paper, we propose SpecSem-Net, the first framework to introduce a semantic-guided spectral denoising mechanism specifically for high-fidelity AI-generated video detection. Specifically, we design a spectral module to extract high-frequency features via Fourier-Transform based filtering. Furthermore, to reduce misjudgments arising from spectral noise, we employ a Gated Merging Mechanism to adaptively fuse semantic context, effectively mitigating spectral noise. Additionally, to evaluate detector performance on the latest top-tier generative models, we construct a comprehensive benchmark comprising 5 SOTA commercial generators. Extensive experiments demonstrate that SpecSem-Net outperforms existing methods, achieving accuracies of 87.25% and 95.59% on our benchmark and public datasets, respectively.

2605.17310 2026-05-19 cs.CV cs.AI 版本更新

Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models

注意力劫持:跨查询的视觉-语言模型响应操控

Zhiqiang Wang, Dongrui Liu, Yan Li, Zonghao Ying, Wei Xue, Wenhan Luo, Yike Guo

发表机构 * Hong Kong University of Science and Technology(香港理工大学) Shanghai Jiao Tong University(上海交通大学) Beihang University(北京航空航天大学)

AI总结 本文研究了视觉-语言模型中跨查询响应操控问题,提出了一种新的对抗攻击方法Attention Hijacking,通过引导内部注意力分布保持图像主导模式,提高攻击在不同查询下的有效性。

详情
AI中文摘要

现有针对视觉-语言模型(VLMs)的对抗攻击可以将模型输出导向攻击者指定的目标响应,但当相同扰动输入与不同文本查询配对时,其效果往往会下降。本文研究了跨查询响应操控,即期望一个对抗示例在多样化的用户查询中保持有效。我们首先分析了现有攻击的局限性,发现成功转移与在响应生成过程中保持图像主导的注意力模式密切相关。受此观察启发,我们提出了Attention Hijacking,一种新的对抗攻击方法,该方法明确引导内部注意力分布向持久的图像主导模式倾斜。通过放大视觉标记对目标响应标记的影响,同时抑制文本标记的竞争影响,我们的方法减少了 manipulated 输出对特定查询用语的依赖。在广泛使用的VLMs上的大量实验表明,Attention Hijacking显著提高了跨查询转移性,适用于多样化的目标响应和未见查询。该方法也有效扩展到多种攻击场景,为VLMs中注意力稳定性在可转移响应操控中的作用提供了新的见解。

英文摘要

Existing adversarial attacks on vision-language models (VLMs) can steer model outputs toward attacker-specified target responses, but their effectiveness often degrades when the same perturbed input is paired with different textual queries. This paper studies cross-query response manipulation, where a single adversarial example is expected to remain effective across diverse user queries. We first analyze the limitations of existing attacks and find that successful transfer is closely associated with preserving an image-dominant attention pattern during response generation. Motivated by the observation, we propose \textbf{Attention Hijacking}, a novel adversarial attack that explicitly steers internal attention distributions toward a persistent image-dominant pattern. By amplifying the influence of visual tokens on target response tokens while suppressing the competing influence of textual tokens, our method reduces the dependence of the manipulated output on the specific wording of the query. Extensive experiments on widely used VLMs show that Attention Hijacking substantially improves cross-query transferability across diverse target responses and unseen queries. The method also extends effectively to multiple attack scenarios, offering new insights into the role of attention stability in transferable response manipulation for VLMs.

2605.17309 2026-05-19 cs.CV cs.AI 版本更新

StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting

StyleText: 一个大规模数据集和基准,用于具有风格保留的场景文本修复

Aleksandr Simonyan, Nipun Jindal

发表机构 * Adobe Inc.(Adobe公司)

AI总结 本文提出StyleText,一个用于具有风格保留的场景文本修复的大规模数据集和基准,通过控制评估文本可读性和视觉一致性,利用共享场景上下文。

Comments Accepted at the SynData4CV Workshop, CVPR 2026. 8 pages + 1 page of references, 5 figures, 4 tables

详情
AI中文摘要

我们提出了StyleText,一个用于局部场景文本修复的大型数据集和基准,具有风格保留。StyleText包含28,518个图像-掩码-提示三元组,分为9,932个场景家族,使能够受控评估文本可读性和视觉一致性。我们通过自动化流程构建数据集,该流程结合LLM提示模板、基于Flux的源生成与键值(KV)缓存注入、基于OCR的语义过滤、多边形掩码提取以及掩码条件的FluxFill增强。我们定义了一个可重复的评估协议,使用归一化的OCR度量(词准确率和字符错误率)和CLIP图像-图像相似性,结合显式预处理。在StyleText上训练的FluxFill+LoRA基线在初始化基础上显著提高了OCR准确性,同时保持场景风格一致性,为未来的比较建立了有力的参考点。

英文摘要

We present StyleText, a large-scale dataset and benchmark for localized scene-text inpainting with style preservation. StyleText contains 28,518 image-mask-prompt triplets grouped into 9,932 scene families, enabling controlled evaluation of text legibility and visual consistency under shared scene context. We construct the dataset with an automated pipeline that combines LLM prompt templating, Flux-based source generation with key-value (KV) cache injection, OCR-based semantic filtering, polygon mask extraction, and mask-conditioned FluxFill augmentation. We define a reproducible evaluation protocol using normalized OCR metrics (word accuracy and character error rate) and CLIP image-image similarity with explicit preprocessing. A FluxFill+LoRA baseline trained on StyleText improves OCR accuracy substantially over initialization while maintaining scene style consistency, establishing a strong reference point for future comparisons.

2605.17303 2026-05-19 cs.CV 版本更新

LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos

LongDPM: 长视频中基于重叠意识的4D重建

Chenyi Xu, Yihao Wu, Liqi Yan, Chao Yang, Jianhui Zhang, Fangli Guan, Pan Li

发表机构 * Hangzhou Dianzi University(杭州电子科技大学)

AI总结 本文提出LongDPM,一种基于重叠意识的长视频单目动态重建框架,通过分块处理、重登记和动态身份关联,实现长距离的3D重建和跟踪,提升了PointOdyssey、Kubric-F和Kubric-G等数据集上的密集跟踪精度和相机姿态估计性能。

详情
AI中文摘要

从长单目视频中恢复动态3D场景对于保持共享坐标系中密集几何、相机运动和时间对应的一致性至关重要。现有方法面临两个关键挑战:(1)前馈重建模型提供准确的局部预测,但仅限于短片段;(2)长距离跟踪器保持对应关系但不产生密集序列级重建。本文提出LongDPM,一种新的重叠意识框架,用于可扩展的长距离单目动态重建。首先,LongDPM通过重叠分块处理长视频,使推理内存受限于分块长度。其次,它通过带有静态意识的重叠抽象进行置信度加权注册,连接分块局部坐标系统。第三,它在分块边界处关联动态身份,并融合匹配轨迹以恢复连贯的长距离3D运动。实验结果表明,LongDPM在长距离重建和跟踪性能上优于现有方法,在PointOdyssey、Kubric-F和Kubric-G数据集上减少了密集跟踪EPE,同时在相机姿态估计方面获得了最佳TUM-dynamics ATE。

英文摘要

Recovering a dynamic 3D scene from a long monocular video is crucial for dense geometry, camera motion, and temporal correspondence to remain consistent in a shared coordinate system. Existing methods face two key challenges: (1) feed-forward reconstruction models provide accurate local predictions but are limited to short clips, and (2) long-range trackers preserve correspondences without producing dense sequence-level reconstruction. This paper presents LongDPM, a novel overlap-aware framework for scalable long-range monocular dynamic reconstruction. First, LongDPM processes long videos in overlapping chunks, keeping inference memory bounded by the chunk length. Second, it connects chunk-local coordinate systems through confidence-weighted registration with static-aware overlap abstraction. Third, it associates dynamic identities across chunk boundaries and fuses matched trajectories to recover coherent long-range 3D motion. Experimental results demonstrate that LongDPM achieves superior long-range reconstruction and tracking performance, reducing dense tracking EPE over V-DPM on PointOdyssey, Kubric-F, and Kubric-G, while obtaining the best TUM-dynamics ATE for camera pose estimation.

2605.17294 2026-05-19 cs.CV 版本更新

HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing

HierEdit: 基于区域的分层扩散用于高效的高分辨率编辑

Yuyao Zhang, Alexander Huang-Menders, Yu-Wing Tai

发表机构 * Dartmouth College(达特茅斯学院)

AI总结 本文提出HierEdit,一种区域感知的分层扩散框架,用于高效可扩展的高分辨率图像编辑。通过低分辨率代理生成参考并定位修改区域,结合分层局部窗口扩散模型和推理加速机制,实现无需高分辨率训练数据的快速高保真编辑。

详情
AI中文摘要

高分辨率图像编辑对于专业和创意应用至关重要,但现有的多模态扩散编辑器在计算效率上仍然不足,并且受限于相对较低的分辨率。当前方法要么冗余处理整个图像画布,要么依赖大规模高分辨率数据集,导致显著的训练和推理成本。我们引入HierEdit,一种区域感知的分层扩散框架,专门用于高效且可扩展的高分辨率图像编辑。我们的方法首先使用现成的编辑模型在低分辨率代理上进行编辑,生成参考并定位修改区域。一个分层局部窗口扩散模型(Local-Window MMDiT)仅在原始高分辨率图像中细化编辑区域,同时重用未修改的区域作为条件输入。低分辨率代理进一步提供结构指导和中间去噪监督(Inference Acceleration),确保一致的全局语义和稳定的生成,而无需完整的高分辨率注意力计算。这种针对性和分层的设计使图像编辑能够快速、高质量地达到4K分辨率,而无需任何专门的高分辨率训练数据。广泛的实验表明,HierEdit在商用分辨率数据集上实现了竞争性的视觉质量,同时显著加速了推理过程,并无缝扩展到超高清的4K编辑。请查看我们的项目页面:https://peteryyzhang.github.io/HierEdit-page/

英文摘要

High-resolution image editing is essential for professional and creative applications, yet existing multimodal diffusion-based editors remain computationally inefficient and constrained to relatively low resolutions. Current approaches redundantly process the entire image canvas or rely on large-scale high-resolution datasets, resulting in substantial training and inference costs. We introduce HierEdit, a region-aware hierarchical diffusion framework designed for efficient and scalable high-resolution image editing. Our method first performs edits on a low-resolution proxy using an off-the-shelf editing model to generate a reference and to localize the modified regions. A hierarchical local-window diffusion model (\textbf{Local-Window MMDiT}) that refines only edited regions within the original high-res image, while reusing the unaltered regions as conditioning inputs. The low-resolution proxy further provides structural guidance and intermediate denoising supervision (\textbf{Inference Acceleration}) , ensuring consistent global semantics and stable generation without the need for full-resolution attention computation. This targeted and hierarchical design enables fast, high-fidelity editing of images up to 4K resolution without any specialized high-resolution training data. Extensive experiments demonstrate that HierEdit achieves competitive visual quality on commodity-resolution datasets while significantly accelerating inference and extending seamlessly to ultra-high-resolution 4K editing. Please check our {\href{https://peteryyzhang.github.io/HierEdit-page/}{\textbf{Project Page}}}.

2605.17284 2026-05-19 cs.CV cs.AI cs.LG cs.RO 版本更新

CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving

CLAP:用于端到端自动驾驶的对比潜在空间提示优化

Ruiyang Zhu, Yuehan He, Boyuan Zheng, Zesen Zhao, Ahmad Chalhoub, Qingzhao Zhang, Z. Morley Mao

发表机构 * University of Michigan(密歇根大学) University of Arizona(亚利桑那大学)

AI总结 本文提出CLAP方法,通过对比潜在空间提示优化解决自动驾驶中罕见但安全关键的长尾场景问题,利用V2X通信获取数据并优化提示,从而提升规划性能。

Comments 9 pages + appendix

详情
AI中文摘要

端到端自动驾驶系统通过视觉-语言-动作(VLA)模型在常见驾驶场景中表现出色,但在罕见但安全关键的长尾场景如活跃施工区和复杂让行几何中表现脆弱。本文提出了一种方法,超越数据扩展和模型训练,解决长尾挑战场景。我们引入CLAP(对比潜在空间提示优化),一种位置感知的适应框架,通过车辆到一切(V2X)通信按需检索,将冻结的VLA驾驶模型与每条道路块的软提示相结合。我们的方法基于VLA潜在空间的两个观察:(i)在VLA的隐藏状态层,来自相同道路块的场景紧密聚集并占据潜在空间的紧凑区域;(ii)在单个道路块内,长尾和正常帧在潜在表示中高度混合,难以改进其中一个而不影响另一个。CLAP通过两阶段流程解决此问题:监督对比学习发现道路块特定的困难场景方向,随后方向性正则化提示优化选择性改进挑战帧同时保持正常帧性能。在NAVSIM基准上,使用各种最先进的VLA后端,CLAP将挑战场景规划错误减少了24%,在不回归正常帧的情况下显著提高了规划性能。

英文摘要

End-to-end autonomous driving systems powered by Vision-Language-Action (VLA) models achieve strong performance on common driving scenarios, yet remain brittle in rare but safety-critical long-tail situations such as active construction zones and complex yielding geometries. In this paper, we present a method that addresses the long-tail challenging scenes beyond data scaling and model training. We introduce CLAP (Contrastive Latent-space Prompt optimization), a location-aware adaptation framework that augments a frozen VLA driving model with per-roadblock soft prompts, optimized from crowdsourced data and retrieved on demand via Vehicle-to-Everything (V2X) communication. Our approach rests on two observations from VLAs' latent space: (i) at the VLA's hidden-state layer, scenarios from the same roadblock cluster tightly and occupy compact regions of the latent space; and (ii) within a single roadblock, long-tail and normal frames are heavily intermixed in the latent representation, making it difficult to improve one without disturbing the other. CLAP addresses this via a two-stage pipeline: supervised contrastive learning to discover a roadblock-specific hard-scene direction, followed by directionally regularized prompt optimization that selectively improves challenging frames while preserving normal frame performance. On the NAVSIM benchmark with various state-of-the-art VLA backbones, CLAP reduces challenging scenario planning error by 24% with no regression on normal frames, significantly improving planning performance.

2605.17270 2026-05-19 cs.CV 版本更新

Beyond Detection: A Structure-Aware Framework for Scene Text Tracking

超越检测:一种结构感知的场景文本跟踪框架

Chenmin Yu, Liu Yu, Daiqing Wu, Gengluo Li, Zeyu Chen, Yu Zhou

发表机构 * VCIP \& TMCC \& DISSec, College of Computer Science \& College of Cryptology Cyber Science, Nankai University Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

AI总结 本文提出了一种结构感知的场景文本跟踪框架SymTrack,针对场景文本跟踪中的几何失真、视觉模糊和结构细节敏感性等挑战,通过双分支设计和自适应推理引擎实现了高效的文本跟踪,提升了视频文本处理的能力。

Comments Accepted at ICML 2026. Code is available at: [https://github.com/EdisonYCM/SymTrack]

详情
AI中文摘要

现代视觉目标跟踪器在一般目标上表现优异,但在处理场景文本时性能显著下降。尽管目前研究较少,但视频中文本跟踪对于动态文本操作如分割、移除和编辑至关重要。为填补这一空白,本文将此特定任务正式定义为场景文本跟踪,并提出了首个系统性的工作。我们识别了该任务的三个主要挑战:1) 严重的几何失真来自透视变化,2) 不同实例之间的高视觉模糊性,3) 对细粒度结构细节的高敏感性。为解决这些问题,我们提出了SymTrack,一种无检测的统一框架,具有协同的双分支设计。它集成了Cross-Expert Calibration机制以减少语义偏差,以及Predictive Token Rectification机制以纠正结构不平衡,并辅以Adaptive Inference Engine以在运动约束下稳定预测。考虑到该任务缺乏专用基准,我们利用三个视频文本定位数据集构建了一个具有高质量注释的基准。大量实验表明,SymTrack在所有三个基准上均达到了新的状态-of-the-art,比先前最佳跟踪器在BOVText_SOT上提高了高达11.97%的AUC。总体而言,我们的工作促进了高效的文本跟踪,为更通用的视频文本处理铺平了道路。

英文摘要

Modern visual object trackers show impressive results on general targets, yet their performance drops substantially when dealing with scene text. Although currently underexplored, tracking text in videos is essential for dynamic text manipulations such as segmentation, removal, and editing. To fill this gap, this paper formalizes this specific task as Scene Text Tracking and presents the first systematic work for it. We identify three primary challenges in this task: 1) severe geometric distortions from perspective shifts, 2) high visual ambiguity across different instances, and 3) high sensitivity to fine-grained structural details. To address these issues, we propose SymTrack, a unified detection-free framework with synergistic dual-branch design. It integrates a Cross-Expert Calibration mechanism to reduce semantic bias, along with a Predictive Token Rectification mechanism to correct structural imbalances, complemented by an Adaptive Inference Engine that stabilizes predictions under motion constraints. Considering the lack of dedicated benchmarks for this task, we utilize three datasets from video text spotting to construct a benchmark with high-quality annotations. Extensive experiments demonstrate that SymTrack sets the new state-of-the-art on all three benchmarks, outperforming previous best trackers by up to 11.97\% AUC on $ \text{BOVText}_{\text{SOT}} $. Overall, our work promotes efficient and thorough text tracking, paving the way toward more generalized video text manipulation.

2605.17262 2026-05-19 cs.CV 版本更新

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

EgoIntrospect: 一个用于用户中心内部状态推理的注视数据集和基准

Zeyu Wang, Chang Liu, Eduardus Tjitrahardja, Yuntao Wang, Borislav Pavlov, Fangfei Gou, Jose Manuel Davila, Dai Shi, Ran Xu, Yue Pan, Jiayi Tan, Shuting Chang, Qi Wang, Jinzhao Li, Jiacheng Hua, Yifei Huang, Jingwei Sun, Yu Zhang, Liuxin Zhang, Guocai Yao, Jia Jia, Yin Li, Qianying Wang, Yuanchun Shi, Miao Liu

发表机构 * Tsinghua University(清华大学) Tongji University(同济大学) Renmin University of China(中国人民大学) The University of Tokyo(东京大学) Lenovo Group(联想集团) Peking University(北京大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Shanghai Qi Zhi Institute(上海启智研究院)

AI总结 本文提出EgoIntrospect数据集,用于研究用户中心内部状态推理,通过自注释揭示用户与AI助手的交互意图,评估多模态大语言模型在从注视观察中推理用户内部状态的能力。

详情
AI中文摘要

尽管在注视视频数据集和基准方面已有大量努力,但理解用户内部状态,这对于实现无缝的AI助手体验至关重要,仍被忽视。在本文中,我们介绍了EgoIntrospect,这是第一个在用户驱动场景中捕捉的注视数据集,具有自我注释,明确揭示用户与AI助手的交互意图。EgoIntrospect使用跨设备设置收集,提供了同步的视频、音频、注视、运动和生理信号。它包含60名受试者180小时的记录,平均每人记录3小时。利用EgoIntrospect,我们正式化了一套围绕用户内部状态的任务,包括情感体验、交互意图和认知记忆。我们进一步处理注释以构建评估现代多模态大语言模型从注视观察中推理用户内部状态能力的基准。在我们基准上的实验表明,现有的多模态大语言模型难以有效利用多模态信号来推断用户的主观内部状态。该数据集和注释将向公众开放,以促进注视视觉和可穿戴AI助手的研究。项目页面:https://ego-introspect.github.io/

英文摘要

Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/

2605.17252 2026-05-19 cs.CV cs.GR 版本更新

Monocular Depth Perception Enhancement Based on Joint Shading/Contrast Model and Motion Parallax (JSM)

基于联合阴影/对比模型和运动视差(JSM)的单目深度感知增强

Seungchul Ryu, Hyunjin Yoo, Tara Akhavan

发表机构 * Faurecia Irystec Inc.(Faurecia Irystec公司)

AI总结 本文提出了一种新的JSM框架,用于增强单目深度感知,显著提高了深度体积感知和深度范围感知,该框架不仅适用于传统2D显示设备,也可用于3D显示设备,因为它能补充双目深度线索。

详情
AI中文摘要

立体3D显示器采用双目深度线索来提供深度感知。然而,用户需要配备昂贵的专用设备才能欣赏基于双目深度线索的深度感知。此外,立体显示器引起的视觉疲劳仍然是一个具有挑战性的问题。为了克服这一限制,本文提出了一种新的框架JSM,用于增强单目深度感知,显著提高了深度体积感知和深度范围感知。所提出的框架不仅可以在任何传统2D显示设备上提供增强的深度感知,还可以应用于3D显示设备,因为它能够补充双目深度线索。定性评估、消融研究和主观用户评估证明了所提出框架的优势和实用性。

英文摘要

Stereoscopic 3D displays adopt a binocular depth cue to provide depth perception. However, users should be equipped with expensive special devices to appreciate depth perception based on the binocular depth cues. Also, visual fatigue induced by the stereoscopic display is still a challenging open problem. In order to overcome this limitation, this paper proposes a novel framework, JSM, to enhance monocular depth perception, significantly improving both depth volume perception and depth range perception. The proposed framework can not only provide an enhanced depth perception on any conventional 2D display devices, but also it can be applicable to the 3D display devices since it is complementary to binocular depth cues. The qualitative evaluation, ablation study, and subjective user evaluation proved the advantages and practicability of the proposed framework.

2605.17248 2026-05-19 cs.CV 版本更新

Image-to-Video Diffusion: From Foundations to Open Frontiers

图像到视频扩散:从基础到开放前沿

Xianlong Wang, Wenbo Pan, Shijia Zhou, Ke Li, Yuqi Wang, Zeyu Ye, Hangtao Zhang, Leo Yu Zhang, Xiaohua Jia

发表机构 * IEEE Publication Technology Group(IEEE出版技术组)

AI总结 本文研究了图像到视频扩散生成的核心问题,通过梳理任务定义、模型架构、数据集和评估指标,提出了一种基于架构和训练范式的分类方法,并总结了四个核心设计:条件编码、时间建模、噪声先验设计和空间时间上采样,探讨了代表性应用场景和主要开放挑战。

详情
AI中文摘要

基于扩散的图像到视频(I2V)生成已成为生成模型中的核心方向,通过将参考图像(可选条件)转换为时间一致的视频。与更广泛的视频生成设置相比,该任务对内容一致性、身份保持和运动一致性提出了更严格的要求。尽管文献增长迅速,现有工作大多在更广泛的主题中讨论I2V生成,仍缺乏专门的分类法和以该领域为中心的系统分析。本文通过将扩散I2V生成视为独立主题来填补这一空白。首先回顾了任务定义、模型架构、数据集和评估指标,然后通过基于架构和训练范式的分类法组织现有方法。进一步总结了四个核心设计,即条件编码、时间建模、噪声先验设计和空间时间上采样,并讨论了代表性应用场景和主要开放挑战。

英文摘要

Diffusion-based \textit{image-to-video} (I2V) generation has become a central direction in generative models by turning a reference image, with optional conditions, into a temporally coherent video. Compared with broader video generation settings, this task places stricter demands on content consistency, identity preservation, and motion coherence. Although the literature grows rapidly, existing works mostly discuss I2V generation within broader topics and still lack a dedicated taxonomy together with a systematic analysis centered on this field. This work addresses that gap by treating diffusion I2V generation as a standalone subject. It first reviews the task formulation, model architectures, datasets, and evaluation metrics, and then organizes existing methods through a taxonomy based on architecture and training paradigm. It further distills four core designs, namely condition encoding, temporal modeling, noise prior design, and spatial-temporal upsampling, and discusses representative application scenarios together with major open challenges.

2605.17236 2026-05-19 cs.CV cs.AI 版本更新

Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability

对视觉变换器在自动化宫颈癌分类中的系统评估:优化、统计验证与临床可解释性

Nisreen Albzour, Sarah S. Lam

发表机构 * School of Systems Science and Industrial Engineering, Binghamton University(宾夕法尼亚大学系统科学与工业工程学院)

AI总结 本文研究了视觉变换器在自动化宫颈癌分类中的应用,通过优化和统计验证,展示了其在临床可解释性方面的优势。

详情
AI中文摘要

手动宫颈癌筛查的巴氏涂片分析受到观察者间差异、时间限制和专家资源有限的限制。尽管卷积神经网络(CNNs)已自动化了宫颈细胞分类,但它们在建模长距离空间依赖性和缺乏临床可解释性方面仍有局限。在本研究中,视觉变换器(ViT)架构被系统优化以提高自动化宫颈癌筛查的性能,从而提高了可解释性。通过赫尔勒夫数据集(917张图像:242张正常,675张异常)对ViT-Tiny进行优化,这是一种轻量级视觉变换器架构,旨在减少计算复杂性。通过全面评估增强策略、类别加权和超参数,最佳配置实现了94.9%-95.2%的交叉验证准确率,其中随机水平翻转和类别加权(0.7 x 1.3)被确定为最有效的因素。梯度加权类激活映射(Grad-CAM)分析证实,模型注意力对应于临床相关形态学特征,包括核区域、细胞边界和染色质纹理,这与细胞病理学标准一致。这些发现表明,视觉变换器可以提供准确且可解释的决策支持,以用于宫颈癌筛查,这满足了医疗AI部署所需的临床性能和透明性要求。

英文摘要

Manual Pap smear analysis for cervical cancer screening is limited by inter-observer variability, time constraints, and restricted expert availability. Although convolutional neural networks (CNNs) have automated cervical cell classification, they remain limited in modeling long-range spatial dependencies and often lack clinical interpretability. In this study, Vision Transformer (ViT) architectures were systematically optimized to enhance automated cervical cancer screening, which resulted in improved interpretability. The Herlev dataset (917 images: 242 normal, 675 abnormal) was utilized to optimize ViT-Tiny, a lightweight Vision Transformer architecture designed for reduced computational complexity, through a comprehensive evaluation of augmentation strategies, class weighting, and hyperparameters. The optimal configuration achieved 94.9%-95.2% cross-validation accuracy, in which random horizontal flipping and class weighting (0.7 x 1.3) were identified as most effective. Gradient-weighted Class Activation Mapping (Grad-CAM) analysis confirmed that model attention corresponded to clinically relevant morphological features, which include nuclear regions, cell boundaries, and chromatin texture, which align with cytopathological criteria. These findings indicate that Vision Transformers can deliver accurate and interpretable decision support for cervical cancer screening, which fulfills both clinical performance and transparency requirements essential for medical AI deployment.

2605.17214 2026-05-19 cs.AI cs.CL cs.CV 版本更新

ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

ChemVA:推动大型语言模型在化学反应图示理解上的进步

Mingyang Rao, Kehua Feng, Zhihui Zhu, Jiangzhen Fu, Hao Yu, Keyan Ding, Huajun Chen

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University(浙江大学杭州全球科学与技术创新中心) Department of Chemistry, Fudan University(复旦大学化学系)

AI总结 本文针对现有系统在理解化学反应图示时存在的视觉缺陷和语义断开问题,提出ChemVA框架,通过视觉锚机制和语义对齐方法提升大型语言模型在化学推理中的性能。

详情
AI中文摘要

尽管大型语言模型(LLMs)已革新了科学文本处理,但在解释化学反应图示方面存在显著的能力差距。我们识别出两个限制当前系统的根本瓶颈:视觉缺陷,即通用视觉编码器难以解析密集分子图的严格拓扑连接性;以及语义断开,即标准线性字符串,如SMILES,无法有效激活模型的潜在化学推理能力。为弥合这些差距,我们提出了化学视觉激活(ChemVA)框架,该框架采用视觉锚机制通过混合粒度检测来定位功能团,随后采用语义对齐方法将视觉特征转换为实体名称,以最大限度地激活LLMs中的知识。我们在OCRD-Bench数据集上评估了我们的方法,该数据集包含密集的视觉-语义上下文和全面的反应覆盖,以评估从识别到推理的整个谱系。在OCRD-Bench上的大量实验表明,ChemVA实现了92.0%的结构识别准确率。通过弥合视觉和语义瓶颈,我们的框架在9种不同的LLMs上实现了约20个百分点的性能提升,使开放式权重模型能够与专有SOTA系统在复杂的化学推理任务中竞争。

英文摘要

While Large Language Models (LLMs) have revolutionized scientific text processing, they exhibit a significant capability gap when interpreting chemical reaction diagrams. We identify two fundamental bottlenecks restricting current systems: a Visual Deficit, where generic vision encoders struggle to resolve the strict topological connectivity of dense molecular graphs, and a Semantic Disconnect, where standard linear strings, such as SMILES, fail to effectively activate the model's latent chemical reasoning. To bridge these gaps, we propose the Chemical Visual Activation (ChemVA) framework, which employs a Visual Anchor mechanism to ground functional groups via hybrid-granularity detection, followed by a semantic alignment approach that translates visual features into entity names to maximize knowledge activation in LLMs. We evaluate our approach on OCRD-Bench, a newly constructed dataset featuring dense visual-semantic contexts and comprehensive reaction coverage to evaluate the full spectrum from recognition to reasoning. Extensive experiments on OCRD-Bench demonstrate that ChemVA achieves 92.0% structural recognition accuracy. By bridging visual and semantic bottlenecks, our framework delivers a consistent performance gain of approximately 20 percentage points across 9 diverse LLMs, enabling open-weight models to rival proprietary SOTA systems in complex chemical reasoning tasks.

2605.17198 2026-05-19 q-bio.NC cs.CV 版本更新

MIRAGE: Robust multi-modal architectures translate fMRI-to-image models from vision to mental imagery

MIRAGE:鲁棒的多模态架构将fMRI到图像模型从视觉扩展到心理意象

Reese Kneeland, Cesar Kadir Torrico Villanueva, Jordyn Ojeda, Shuhb Khanna, Jonathan Xu, Paul S. Scotti, Thomas Naselaris

发表机构 * University of Minnesota(明尼苏达大学) Medical AI Research Center (MedARC)(医学人工智能研究中心 (MedARC)) University of Sydney(悉尼大学) Stanford University(斯坦福大学) Alljoined Sophont Princeton Neuroscience Institute(普林斯顿神经科学研究所)

AI总结 本文提出MIRAGE方法,通过多模态文本和图像特征训练,实现从脑活动解码心理意象,展示了在NSD-Imagery基准上SOTA的性能。

详情
AI中文摘要

为了在下游应用中有效,训练以从人类脑活动重建已见图像为目标的视觉解码模型必须能够泛化到内部生成的视觉表示,即心理意象。在分析最近发布的NSD-Imagery数据集时,我们发现虽然一些现代视觉解码器在心理意象重建中表现良好,但有些却失败了,且在已见图像重建中的最先进性能并不能保证在心理意象重建中的最先进性能。受这些发现的启发,我们开发了MIRAGE,一种专门设计用于训练视觉数据集并从脑活动交叉解码心理意象的方法。MIRAGE采用线性主干和多模态文本和图像特征作为输入到扩散模型。特征指标和人类评估者证明MIRAGE在NSD-Imagery基准上是心理意象重建的SOTA。通过消融分析,我们发现心理意象重建在解码器使用相对较少维度的图像特征并包含基于文本和高低层次图像特征的指导时表现最佳。我们的工作表明,给定正确的架构,现有大规模数据集使用外部刺激作为训练数据,可以用于解码心理意象,值得对心理意象重建的未来成功和实用性抱有乐观态度。

英文摘要

To be useful for downstream applications, vision decoding models that are trained to reconstruct seen images from human brain activity must be able to generalize to internally generated visual representations, i.e., mental images. In an analysis of the recently released NSD-Imagery dataset, we demonstrated that while some modern vision decoders can perform quite well on mental image reconstruction, some fail, and that state-of-the-art (SOTA) performance on seen image reconstruction is no guarantee of SOTA performance on mental image reconstruction. Motivated by these findings, we developed MIRAGE, a method explicitly designed to train on vision datasets and cross-decode mental images from brain activity. MIRAGE employs a linear backbone and multi-modal text and image features as input to a diffusion model. Feature metrics and human raters establish MIRAGE as SOTA for mental image reconstruction on the NSD-Imagery benchmark. With ablation analysis we show that mental image reconstruction works best when decoders use image features with relatively few dimensions and include guidance from text-based and both high- and low-level image-based features. Our work indicates that--given the right architecture--existing large-scale datasets using external stimuli are viable training data for decoding mental images, and warrant optimism about the future success and utility of mental image reconstruction.

2605.17197 2026-05-19 cs.LG cs.CV 版本更新

OPTNet: Ordering Point Transformer Network for Post-disaster 3D Semantic Segmentation

OPTNet:用于灾后3D语义分割的点变换网络

Nhut Le, Ehsan Karimi, Maryam Rahnemoonfar

发表机构 * Computer Science and Engineering, Lehigh University, Bethlehem PA 18015, US(计算机科学与工程系,莱维大学,贝特莱姆 PA 18015,美国) Civil and Environmental Engineering, Lehigh University, Bethlehem PA 18015, US(土木与环境工程系,莱维大学,贝特莱姆 PA 18015,美国)

AI总结 本文提出OPTNet,一种通过可学习的点排序模块动态预测最优排列以提高注意力机制局部性的网络,用于灾后3D点云语义分割。

Comments Accepted for International Conference on Pattern Recognition (ICPR) 2026

详情
AI中文摘要

灾后损害评估需要快速且准确地对3D点云进行语义分割,以识别受损的基础设施,如损坏的建筑和道路。早期的点变换(如PTv1、PTv2)依赖于计算成本高的邻居搜索(k-NN)和最远点采样(FPS)。为了提高效率,最近的架构如Point Transformer V3(PTv3)采用了静态序列化方法,如Hilbert曲线或Z-order,来组织无序点以进行基于窗口的注意力。然而,这些固定顺序并不利于捕捉灾难场景的复杂几何结构。在本文中,我们提出了OPTNet(Ordering Point Transformer Network),它引入了一个可学习的点排序模块。OPTNet利用自监督的排序损失动态预测最优排列,以最大化注意力机制的局部性。我们在3DAeroRelief数据集上评估了我们的方法,显著优于最先进的基线。

英文摘要

Post-disaster damage assessment requires rapid and accurate semantic segmentation of 3D point clouds to identify critical infrastructure such as damaged buildings and roads. Early Point Transformers (e.g., PTv1, PTv2) relied on computationally expensive neighbor searching (k-NN) and Farthest Point Sampling (FPS). To improve efficiency, recent architectures like Point Transformer V3 (PTv3) adopted static serialization methods, such as Hilbert curves or Z-order, to organize unstructured points for window-based attention. However, these fixed orderings are not optimal for capturing the complex geometry of disaster scenes. In this paper, we propose OPTNet (Ordering Point Transformer Network), which introduces a learnable Point Sorter module. OPTNet utilizes a self-supervised ordering loss to dynamically predict an optimal permutation that maximizes the locality of the attention mechanism. We evaluate our method on the 3DAeroRelief dataset, significantly outperforming state-of-the-art baselines.

2605.17179 2026-05-19 cs.CV 版本更新

iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning

iMiGUE-3K:一种基于自监督学习的微手势分析大规模基准

Chengyan Wang, Haoyu Chen, Hui Wei, Yueyi Yang, Yunquan Chen, Guoying Zhao

发表机构 * CMVS, University of Oulu(奥卢大学CMVS实验室) KTH Royal Institute of Technology(皇家理工学院) ELLIS Institute Finland(芬兰ELLIS研究所)

AI总结 本文提出iMiGUE-3K大规模数据集和MG-FMs基础模型,用于微手势情感理解,通过自监督学习提升情绪识别性能。

详情
AI中文摘要

情感理解是情感计算和人工智能中的基本挑战。尽管现有方法主要关注面部表情和语音,但往往忽视了通过身体语言传达的丰富情绪线索。最近,微手势(MGs)作为一种替代线索受到越来越多关注,但目前缺乏支持MG基础模型预训练的大规模数据集。为了推动MG研究,我们提出一个新的微手势情感理解基准,包含关键贡献:新的数据集(iMiGUE-3K)和一系列针对不同任务的基础模型。通过基于模型的众包数据收集策略,我们构建了iMiGUE-3K,这是迄今为止最大的MG数据集。该数据集包含332名专业网球运动员过去七年的公开采访视频,总时长超过3.4K小时视频片段和3700万帧。数据集包含32种微手势类别,具有丰富的描述性标注,是首个大规模、真实场景的视频数据集,用于细粒度手势基情感分析。基于iMiGUE-3K,我们提出MG-FMs,一种用于可迁移手势呈现学习的判别基础模型。基于该基础模型,我们建立了五个全面的评估任务:微手势识别(无监督、半监督、监督)、微手势检索和微手势情感识别。我们对代表性方法的系统评估表明,基于微手势的分析显著提升了情感理解。我们希望这项工作能为微手势分析提供全面工具,并为未来心理诊断、情感计算和高级人机交互研究奠定坚实基础。

英文摘要

Emotion understanding is a fundamental challenge in affective computing and artificial intelligence. While existing approaches predominantly focus on facial expressions and speech, they often overlook the rich emotional cues conveyed through body language. Recently, micro-gestures (MGs), unintentional, subconscious movements driven by inner feelings, have attracted increasing attention as an alternative to other cues. However, there are no existing large-scale datasets supporting the pre-training of the MG foundation model. To advance MG research, we present a new benchmark for micro-gesture-based emotion understanding, featuring key contributions with a novel dataset (iMiGUE-3K) and a series of foundation models for different tasks. Using a model-based crowd-sourcing data collection strategy, we construct iMiGUE-3K, the largest MG dataset to date. It comprises video recordings from 332 distinct professional tennis players' public press interviews over the past seven years, totaling more than 3.4K long video clips and 37 million frames. The dataset includes 32 micro-gesture classes with rich descriptive annotations, making it the first large-scale, in-the-wild, video dataset for fine-grained gesture-based emotion analysis. Built on iMiGUE-3K, we propose MG-FMs, a discriminative foundation model for transferable gesture presentation learning. Based on the foundation model, we establish five comprehensive evaluation tasks: MG recognition (unsupervised, semi-supervised, supervised), MG retrieval, and MG emotion recognition. Our systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding. We hope this work can provide comprehensive tools for MG analysis and set a solid foundation for future research in psychological diagnostics, affective computing, and advanced human-computer interaction.

2605.17165 2026-05-19 cs.CV cs.LG 版本更新

Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives

视频JEPA中的因子化潜在动态:辅助目标的实证研究

Santosh Premi

发表机构 * Adhikari(阿迪卡里)

AI总结 本研究探讨了视频JEPA中辅助目标的实证效果,通过对比不同辅助目标变体,发现潜在表示的因子化方法在提升某些能力的同时可能降低其他能力,FWM-HW-LD在混合数据集下提升了ImageNet-100和SSv2的性能。

详情
AI中文摘要

联合嵌入预测架构(JEPA)是自监督视频表示学习的一个有前景的框架,但小型规模的视频JEPA训练中辅助目标的行为尚未得到充分表征。我们报告了在两个预训练阶段(单一数据集(UCF-101)和混合数据集(UCF-101 + Something-Something V2 + ImageNet-100))下对18种辅助目标变体进行的小规模实证研究。我们评估了冻结表示在三个互补基准上的表现:Diving-48(细粒度运动)、SomethingSomething V2(时间推理)和ImageNet-100(外观)。我们的实验表明,许多辅助目标表现出能力取舍:在一种下游能力上的收益往往伴随着另一种能力的退化。我们随后研究了FWM-HW-LD(带有硬区域加权的因子化世界模型与潜在动态),这是一种训练时的目标,将潜在表示分为外观和动态子空间,并对JEPA预测误差和潜在动态误差应用硬区域加权。在我们的混合数据集设置中,FWM-HW-LD相比参考基线在ImageNet-100上提高了+5.92个百分点,在SSv2上提高了+3.21个百分点,同时在Diving-48上保持在0.30个百分点以内。这些结果表明,潜在因子化是研究视频JEPA中辅助目标取舍的有效方向。

英文摘要

Joint-Embedding Predictive Architectures (JEPA) are a promising framework for self-supervised video representation learning, yet the behavior of auxiliary objectives in small-scale Video-JEPA training is not well characterized. We report a small-scale empirical study of 18 auxiliary objective variants for Video-JEPA across two pretraining regimes: single-dataset (UCF-101) and mixed-dataset (UCF-101 + Something-Something V2 + ImageNet-100). We evaluate frozen representations on three complementary benchmarks: Diving-48 (fine-grained motion), SomethingSomething V2 (temporal reasoning), and ImageNet-100 (appearance). Our experiments suggest that many auxiliary objectives exhibit capacity trade-offs: gains on one downstream capability often coincide with degradation on another. We then study FWM-HW-LD (Factorized World-Model with Hard-Region-Weighted Latent Dynamics), a training-time objective that separates the latent representation into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction errors and latent dynamics errors. In our mixed-dataset setting, FWM-HW-LD improves ImageNet-100 by +5.92 and SSv2 by +3.21 percentage points relative to the reference baseline, while remaining within 0.30 percentage points on Diving-48. These results indicate that latent factorization is a useful direction for studying auxiliary-objective trade-offs in Video-JEPA.

2605.17160 2026-05-19 cs.LG cs.AI cs.CV 版本更新

When Bits Break Recourse: Counterfactual-Faithful Quantization

当比特失效时的反事实:反事实忠实量化

Chaymae Yahyati, Ismail Lamaakal, Khalid El Makkaoui, Ibrahim Ouahbi

发表机构 * Mohammed First University(穆罕默德第一大学)

AI总结 本文研究了量化过程中反事实可解释性的问题,提出反事实忠实量化方法,通过定义有效性下降和反事实可逆差距两个指标来评估量化对反事实可解释性的影响,并在多个数据集上验证了该方法在保持准确性的同时提升了反事实稳定性。

Comments 57 pages, 32 tables, 26 figures

详情
AI中文摘要

量化可以在低比特部署下保持预测准确性,但会无声地破坏算法可逆性:一个在量化前可以执行的操作在量化后可能失效,或变得显著更昂贵。我们通过有效性、成本和方向稳定性来形式化量化下的反事实敏感性,并引入两个指标:有效性下降(VD)和反事实可逆差距(CRG),以揭示准确性无法检测到的可逆失败。我们提出反事实忠实量化(CFQ),通过训练量化参数和混合精度位分配,在全局位预算下强制在教师可逆点上保持目标结果,以保留反事实行为。基于边界的分析给出了在受限制的量化扰动下可逆转移的充分条件。在Adult、德国信贷和COMPAS数据集上的实验表明,与准确性匹配的基线相比,CFQ在保持准确性的同时显著提高了VD和CRG。

英文摘要

Quantization can preserve predictive accuracy under low-bit deployment while silently breaking algorithmic recourse: an actionable change that flips a decision before quantization may fail after quantization, or become substantially more costly. We formalize counterfactual sensitivity under quantization through validity, cost, and direction stability, and introduce two metrics: Validity Drop (VD) and Counterfactual Recourse Gap (CRG) that reveal recourse failures invisible to accuracy. We propose Counterfactual-Faithful Quantization (CFQ), which trains quantizer parameters and mixed-precision bit allocation to preserve counterfactual behavior by enforcing the target outcome at teacher recourse points under a global bit budget. A margin-based analysis gives a sufficient condition for recourse transfer under bounded quantization perturbations. Experiments on Adult, German Credit, and COMPAS show that accuracy-matched baselines can significantly degrade recourse stability, while CFQ maintains accuracy and substantially improves VD and CRG across bit budgets.

2605.17135 2026-05-19 cs.CV 版本更新

Collaborative Learning for Semi-Supervised LiDAR Semantic Segmentation

协同学习用于半监督激光雷达语义分割

Bin Yang, Alexandru Paul Condurache

发表机构 * Bosch Research, Robert Bosch GmbH, Stuttgart, Germany(博世研究, 博世有限公司, 斯图加特, 德国) Institute for Neuro- and Bioinformatics, University of Lübeck, Lübeck, Germany(神经与生物信息学研究所, 奥尔德马克大学, 奥尔德马克, 德国)

AI总结 本文提出CoLLiS框架,通过协同学习解决半监督学习中伪标签单一导致的偏差问题,实验表明其在低标注情况下表现优异。

Comments The paper was accepted by ICML2026

详情
AI中文摘要

对大规模激光雷达点云进行注释用于3D语义分割成本高且耗时,这促使了半监督学习(SemiSL)的应用。标准的激光雷达SemiSL方法通常采用两步训练范式,其中伪标签从单一蒸馏源单独生成,无论是相同的还是另一个激光雷达表示。这种监督依赖于唯一的伪标签源,可能加剧确认偏见并在训练过程中传播错误,最终限制性能。为了解决这一挑战,我们引入了CoLLiS,一种新颖的框架,利用协同学习进行激光雷达半监督分割。与之前解耦伪标签生成和训练阶段的范式不同,CoLLiS通过将它们视为同等学生,在单步中协同训练多个表示。每个学生从多个表示中自适应地蒸馏,同时在线监控学生间的差异以解决矛盾的监督并有效缓解确认偏见。在三个数据集上的大量实验表明,CoLLiS在低标注情况下显著优于最先进的激光雷达SemiSL方法。

英文摘要

Annotating large-scale LiDAR point clouds for 3D semantic segmentation is costly and time-consuming, which motivates the use of semi-supervised learning (SemiSL). Standard LiDAR SemiSL methods typically adopt a two-step training paradigm, where pseudo-labels are separately generated from a single distillation source, either from the same or another LiDAR representation. Such supervision relies on a unique source of pseudo-labels, which can reinforce confirmation bias and propagate errors during training, ultimately limiting performance. To address this challenge, we introduce CoLLiS, a novel framework that leverages Collaborative Learning for LiDAR Semi-supervised segmentation. Unlike prior paradigms with decoupled pseudo-labeling and training phases, CoLLiS trains multiple representations collaboratively in a single step by treating them as coequal students. Each student is adaptively distilled from multiple representations, while inter-student disparities are monitored online to resolve contradictory supervision and effectively mitigate confirmation bias. Extensive experiments on three datasets demonstrate that CoLLiS consistently outperforms state-of-the-art LiDAR SemiSL methods, with particularly strong gains in low-label regimes.

2605.17133 2026-05-19 cs.CV cs.AI 版本更新

CAM-VFD: Cross-Attention Multimodal Video Forgery Detection

CAM-VFD: 跨注意力多模态视频伪造检测

Hoda Osama Elkhodary, Sherin Mostafa Youssef, Marwa Elshenawy, Dalia Sobhy

发表机构 * Computer Engineering Department, College of Engineering and Technology, Arab Academy for Science, Technology and Maritime Transport(计算机工程系,工程与技术学院,阿拉伯科学、技术与海运交通学院)

AI总结 针对深度伪造技术和视频编辑工具快速发展带来的挑战,本文提出CAM-VFD框架,通过跨模态矛盾建模实现多模态视频伪造检测,实验表明其在两个生成视频基准测试中表现出色,具有良好的鲁棒性。

详情
AI中文摘要

深度伪造技术和视频编辑工具的快速发展对多媒体取证、司法证据完整性以及信息真实性构成了重大挑战。当前的检测器依赖单一模态信号,将外观、几何和运动独立处理。然而,先进的生成器在保持单模态一致性的同时会产生跨模态矛盾,这些矛盾在取证上具有鉴别性但无法被单一模态检测器发现。本文提出CAM-VFD,即跨注意力多模态视频伪造检测框架,将跨模态矛盾建模为方向性取证信号。该框架采用跨注意力融合机制,其中基于CLIP的外观表示作为查询,与VideoMAE运动特征和MiDaS深度特征进行对比,从而识别视觉、时间及几何证据之间的矛盾。通过跨模态注意力差异分析验证了该设计,观察到真实与伪造分布在统计上可分离(p<0.001,Cohen's d=0.68)。在两个生成视频基准测试中的实验结果表明,CAM-VFD在GenVidBench上达到95.31%的Top-1准确率,在GenVideo上达到93.43%的准确率、90.63%的F1分数和96.56%的AUROC。此外,CAM-VFD在压缩、噪声、模糊和对抗扰动下表现出稳定的性能,表明跨模态推理可能在媒体取证中提高鲁棒性。代码已公开在https://github.com/Hoda-Osama/CAM-VFD/tree/main。

英文摘要

The rapid advancement of Deepfake technologies and video manipulation tools poses a critical challenge to multimedia forensics, judicial evidence integrity, and information authenticity. Current detectors rely on single-modality signals, treating appearance, geometry, and motion independently. However, advanced generators maintain within-modality consistency while producing cross-modal contradictions, which are forensically discriminative but invisible to any single-modal detector. We propose CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradiction as a directional forensic signal. The framework uses a cross-attention fusion mechanism in which CLIP-based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and geometric evidence. We examine this design through cross-modal attention discrepancy analysis, observing statistically separable real and fake distributions ($p<0.001$, Cohen's $d=0.68$). Experimental results on two generative video benchmarks indicate consistent performance, with 95.31\% Top-1 accuracy on GenVidBench and 93.43\% accuracy, 90.63\% F1-score, and 96.56\% AUROC on GenVideo. Moreover, CAM-VFD demonstrates stable performance under compression, noise, blur, and adversarial perturbations, suggesting that cross-modal reasoning may improve robustness in media forensics. The code is publicly available at \url{https://github.com/Hoda-Osama/CAM-VFD/tree/main}.

2605.17125 2026-05-19 cs.CV cs.LG 版本更新

Principal Component Analysis for Lunar Crater Detection

基于主成分分析的月球陨石坑检测

Travis Driver, John A. Christian

发表机构 * School of Aerospace Engineering, Georgia Institute of Technology(航空航天工程学院,佐治亚理工学院)

AI总结 本文提出了一种基于主成分分析的自动陨石坑模板生成方法,用于改进基于图像的陨石坑识别技术,通过在模拟月球图像上展示优于手工挑选模板的检测和定位性能。

详情
AI中文摘要

光学导航是月球轨道器和着陆器任务中的关键组成部分。基于图像的陨石坑识别由于月球表面陨石坑丰富以及现有大量陨石坑目录的可用性,已成为光学导航的有前景技术。此外,由于月球陨石坑在形态上相对同质,模板匹配已被确定为识别的有前景方法。在本文中,我们提出EigenCrater,一种基于陨石坑数字高程图(DEM)的主成分分析的自动陨石坑模板生成方法。我们证明了在模拟月球图像上,该方法在检测和位置估计性能方面优于手工挑选的模板。

英文摘要

Optical navigation is a critical component for lunar orbiter and lander missions. Image-based crater identification has emerged as a promising technology for optical navigation due to the abundance of craters on the lunar surface and the availability of extensive crater catalogs. Moreover, due to the relative morphological homogeneity among lunar craters, template matching has been identified as a promising approach for identification. In this paper, we propose EigenCrater, an automated crater template generation method based on principal component analysis of crater digital elevation maps (DEMs). We demonstrate superior detection and position estimation performance relative to hand-picked templates on simulated lunar imagery.

2605.17120 2026-05-19 cs.CV 版本更新

Markerless Motion Capture for Biomechanical Whole-Body Kinematic Estimation in Infants

无标记运动捕捉用于婴儿生物力学全身运动学估计

Divya Joshi, J. D. Peiffer, Colleen Peyton, R. James Cotton

发表机构 * Center for Bionic Medicine, Shirley Ryan AbilityLab(生物医学中心,Shirley Ryan AbilityLab) Dept. of Physical Therapy and Human Movement Science, Northwestern University(物理治疗与人类运动科学系,西北大学) Dept. of Biomedical Engineering, Northwestern University(生物医学工程系,西北大学) Dept. of Pediatrics, Northwestern University(儿科系,西北大学)

AI总结 本研究评估了三种先进的姿态估计框架在婴儿运动学重建中的性能,展示了无标记运动捕捉在婴儿生物力学分析中的潜力和局限性。

Comments Accepted to EMBC 2026

详情
AI中文摘要

早期识别婴儿运动障碍依赖于专家对自发运动的视觉评估,这推动了自动化、客观方法的发展。本文系统评估了三种最先进的姿态估计框架(MeTRAbs-ACAE、SAM 3D Body和Sapiens)在8名婴儿13次录制的100个视频上的性能。通过重投影误差、几何一致性以及Procrustes对齐的3D位置误差量化关键点检测精度,并展示了将逆向运动学框架拟合到婴儿数据的可行性证明。虽然Sapiens在重投影误差和几何一致性方面表现最佳(分别为22.8像素和0.82),但SAM 3D Body提供了最全面的3D信息用于运动学重建,其Procrustes对齐的位置误差为19至28毫米。通过案例比较示例,证明了基于SAM 3D Body估计的生物力学模型能够区分与运动发育相关的婴儿典型运动模式,如临床专家所识别的。这些发现突显了3D姿态估计在婴儿生物力学中的潜力和当前限制,并为可扩展的视频基早期运动发育评估奠定了初步基础。

英文摘要

arly identification of motor impairment in infancy relies on expert visual assessment of spontaneous movement, motivating the development of automated, objective alternatives. One promising approach is using computer vision, which benefits from high quality pose estimation from video. In this study, we systematically evaluated three state-of-the-art pose estimation frameworks (MeTRAbs-ACAE, SAM 3D Body, and Sapiens) on 100 videos over 13 sessions of 8 infants recorded with a multi-view markerless motion capture system. We quantified keypoint detection accuracy using reprojection error, geometric consistency, and Procrustes-aligned 3D position error, and demonstrated proof-of-concept for fitting an inverse kinematic framework to infant data. While Sapiens achieved the lowest reprojection error and highest geometric consistency of the methods evaluated (22.8 pixels and 0.82, respectively), SAM 3D Body provided the most comprehensive 3D information for kinematic reconstruction with Procrustes-aligned position errors of 19 to 28 mm. We demonstrate in a case comparison example that biomechanical models fit to SAM 3D estimates distinguish representative movement patterns in infants related to motor development, as identified by a clinical expert. Together, these findings highlight both the promise and current limitations of 3D pose estimation for infant biomechanics and establish preliminary groundwork for scalable, video-based assessment of early motor development.

2605.17102 2026-05-19 cs.GR cs.CV 版本更新

VoxScene: Anchor-Conditioned Voxel Diffusion for Indoor Scene Arrangement

VoxScene: 基于锚点的体素扩散用于室内场景布置

Haotian Mao, Yuhan Huang, Jiatao Lin, Yang Zhao, Hui Wang, Yiheng Zhang, Yuwang Wang, Chenliang Zhou, Yan Zhang, Fangcheng Zhong, Xubo Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学) University of Cambridge(剑桥大学) Peking University(北京大学)

AI总结 本文提出VoxScene,一种基于锚点的体素扩散框架,用于3D场景合成。通过引入显式的、以物体为中心的体素表示,解决了现有方法在处理密集环境时的物理碰撞和结构纠缠问题,实现了高保真体素网格的生成,提升了场景布置的物理合理性和形状多样性。

详情
AI中文摘要

我们提出了VoxScene,一种新颖的基于锚点的体素扩散框架,专门用于3D场景合成。当前数据驱动的布局生成技术通常依赖于边界代理或隐式表示,这忽略了体素结构。这种几何盲性不可避免地导致严重的物理碰撞和结构纠缠,特别是在密集环境中。为克服这些限制,我们转向显式的、以物体为中心的体素表示。我们的流程依次合成离散的体素占用,条件于先前的锚点和局部上下文。通过利用离散体素的互斥性质,我们的方法消除了空间歧义,即使在高度复杂的环境中也能保证无碰撞的布置。此外,生成的高保真体素网格作为判别性的几何查询,用于后续资产检索。广泛的实验表明,我们的方法具有普遍性,实现了最先进的物理合理性,并在形状多样性方面超越了现有布局规划器。

英文摘要

We present VoxScene, a novel anchor-conditioned voxel diffusion framework tailored for 3D scene synthesis. Current data-driven layout generation techniques typically rely on bounding proxies or implicit representations, which overlook volumetric structures. This geometric blindness inevitably leads to severe physical collisions and structural entanglement, particularly in densely populated environments. To overcome these limitations, we shift the paradigm to an explicit, object-centric voxel representation. Our pipeline sequentially synthesizes discrete volumetric occupancies conditioned on prior anchors and local context. By exploiting the mutually exclusive nature of discrete voxels, our approach eliminates spatial ambiguities and guarantees collision-free arrangements, even in highly complex environments. Furthermore, the synthesized high-fidelity voxel grids serve as discriminative geometric queries for downstream asset retrieval. Extensive experiments demonstrate the universality of our method, achieving state-of-the-art physical plausibility and unlocking shape diversity compared to existing layout planners.

2605.17095 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC

警察执法视频中的视觉时间线:用于训练和分析的开放BWC操作上下文和活动编目

Angela Srbinovska, Christopher Homan, Adrian Martin, Ernest Fokoué

发表机构 * Rochester Institute of Technology(罗切斯特理工大学) Rochester Police Department(罗切斯特警察局) Office of Business Intelligence(业务智能办公室) School of Mathematics and Statistics(数学与统计学学院)

AI总结 本文提出了一种处理体感摄像头视频的方法,生成时间对齐的固定长度10秒窗口序列,用于训练和分析,通过隐私保护协议进行处理和标记,以提高事件审查和培训流程的效率。

Comments 13 pages, 10 figures, 9 tables

详情
AI中文摘要

执法机构正在积累大量体感摄像头(BWC)视频。然而,这些视频仍然在操作上是模糊的。也就是说,分析人员和培训人员仍然需要花费大量时间观看完整视频以确定关键事件的开始点,并识别活动转向更剧烈的物理活动的点。我们提出了一种方法,将BWC视频处理为时间对齐的固定长度10秒窗口序列,通过隐私保护协议进行处理和标记。每个窗口被标记为两个维度的信息:(i)窗口的操作上下文和(ii)窗口内的运动强度水平,对于因黑暗、模糊或遮挡导致证据不足的窗口,使用低证据标签。我们训练模型根据这两个轴分类窗口,使用从每个窗口中采样的帧,通过CLIP模型编码并汇总成窗口级别的表示。我们提取每个窗口的密集光流统计信息以捕捉运动强度。在测试窗口中,最佳上下文模型达到78.75%的准确率,最佳准确率活动模型达到88.33%。我们还包含了完整性审计,以展示结果以及视觉时间线表示如何支持更快的事件审查,并使警官培训流程更加实用。

英文摘要

Law enforcement agencies are accumulating vast amounts of body-worn camera (BWC) footage. However, this remains operationally opaque. That is, analysts and trainers still have to invest considerable time watching full-length videos to pinpoint the start of key encounters and identify the points where activity shifts to something more physically intense. We present an approach to process BWC video into a time-aligned sequence of fixed-length 10-second windows, processed and labeled using a privacy-conscious protocol. Each window is labeled with two dimensions of information: (i) the operational context of the window and (ii) the level of motion intensity within the window, with low-evidence labels for windows for which insufficient evidence exists due to darkness, blur or occlusion. We train models to classify windows based on these two axes using frames sampled from each window encoded using CLIP model and aggregated into a window-level representation. We extract dense optical flow statistics for each window to capture motion intensity. On test windows the best context model achieves 78.75% accuracy, and the best-accuracy activity model achieves 88.33%. We also included integrity audits to show the results and how the visual timeline representations support faster incident review and make the officer training workflow more practical.

2605.17093 2026-05-19 cs.CV cs.CL 版本更新

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

HEED:基于密度加权残差对齐的混合视觉-语言模型蒸馏

Yihao Liang, Niraj K. Jha

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出HEED方法,通过密度加权残差对齐改进混合视觉-语言模型蒸馏,提升在OCR和文档任务中的性能,同时在不同教师模型和混合架构上实现高效推理。

详情
AI中文摘要

将视觉-语言模型蒸馏为更高效的混合架构,如3:1 Mamba-2/注意力混合,已成为提高推理效率的标准做法。聚合基准表明这可行,但隐藏了选择性失败。当将Qwen3-VL-8B-Instruct蒸馏为3:1 Mamba-2/注意力混合时,在视觉推理基准如MMStar、MMBench和MMMU-Pro上,学生模型在教师模型附近保持2分差距,但在光学字符识别和文档任务上下降13分。学生模型仍能理解场景,但失去回答所需的细粒度文本。我们发现大部分失败归因于特定位置。在高分辨率图像中,大多数拼图是天空、墙壁或平滑纹理,而一小部分携带文本、边缘、物体边界或其他局部细节。在令牌级诊断中,前10%最高密度拼图的残差漂移比后10%最低密度拼图大3.6倍,且教师遮蔽答案贡献大3.5倍。均匀加权将许多损失项分配给低信息量的背景拼图,而稀疏答案承载拼图未得到特殊保护。所需干预极小:我们用拼图自不相似性作为无监督代理来替代均匀残差对齐,以确定位置重要性。我们称之为HEED。与常规端到端蒸馏相比,HEED在OCRBench v2上提升8.7分,在10个基准平均上提升5.13分。增益在不同教师模型和混合架构上实现。在标准后训练后,学生在10个基准平均上达到教师级性能,具有4.12倍的吞吐量和128k上下文时68%的内存节省,无需额外参数和推理时间成本。

英文摘要

Distilling vision-language models into faster hybrid architectures, such as 3:1 Mamba-2/attention mixes, is now standard practice for making inference efficient. Aggregate benchmarks suggest that this works but they hide selective failures. When we distill Qwen3-VL-8B-Instruct into a 3:1 Mamba-2/attention hybrid, student model stays within 2 points of the teacher across visual reasoning benchmarks like MMStar, MMBench, and MMMU-Pro, while dropping 13 points on optical-character-recognition and document tasks. The student can still understand the scene but loses the fine-grained text needed to answer. We localize much of the failure to a specific kind of position. In a high-resolution image, most patches are sky, wall, or smooth texture, while a small fraction carries text, edges, object boundaries, or other local details. In a token-level diagnostic, the top 10% highest-density patches have 3.6$\times$ larger residual drift than the bottom 10% lowest-density patches and 3.5$\times$ larger teacher-masking answer contribution. Uniform weighting devotes many loss terms to low-information background patches, whereas sparse answer-bearing patches receive no special protection. The required intervention is minimal: we replace uniform residual alignment with density-weighted residual alignment, using patch self-dissimilarity as a training-free proxy for position importance. We call this HEED. Compared with normal end-to-end distillation, HEED increases performance by 8.7 points on OCRBench v2 and 5.13 points on a 10-benchmark average. The gain is realized on different teacher models and hybrid architectures. After standard post-training, the student reaches teacher-level performance on the 10-benchmark average with a 4.12$\times$ throughput and a 68% memory saving at 128k context, with no additional parameters and no inference-time cost.

2605.17087 2026-05-19 cs.CV 版本更新

The Learnability Gap in Medical Latent Diffusion

医学潜在扩散中的可学习差距

Mischa Dombrowski, Felix Nützel, Bernhard Kainz

发表机构 * Department of Computing, Imperial College London(伦敦帝国理工学院计算机系)

AI总结 本文研究了医学图像中潜在扩散模型在处理类别不平衡问题时的可学习差距,指出尽管预训练的自动编码器能有效编码判别特征,但其潜在表示的结构性使分类器难以学习,通过开发噪声条件潜在分类器和图像空间蒸馏技术,提高了效率并改善了潜在空间质量。

详情
AI中文摘要

生成数据增强使用潜在扩散模型是解决医学影像类别不平衡问题的有前景策略,但当前方法侧重于感知保真度和领域特定自动编码器微调,而忽视了更根本的瓶颈。我们识别并正式化了可学习差距:大规模预训练自动编码器能够忠实编码医学分类的判别特征,如重建空间中的近无损性能所示,但其潜在表示以难以被分类器学习的方式结构化。在五个自动编码器家族和四个覆盖胸片、皮肤镜、计算机断层扫描和超声的医学基准上,我们证明这种差距无论架构、初始化策略或超参数调整如何,都持续存在,且医学领域微调无法关闭它。为了探测并部分缩小这一差距,我们开发了具有FiLM层的噪声条件潜在分类器和图像空间蒸馏,这些方法在效率和内存方面比图像空间模型分别提高了64倍和120倍,同时作为潜在空间质量的诊断工具。我们的分析提供了一个新的框架来评估自动编码器的潜在空间,并识别其结构而非保真度或领域特定性是关闭真实和合成医学训练数据性能差距的主要障碍。

英文摘要

Generative data augmentation with latent diffusion models is a promising strategy for addressing class imbalance in medical imaging, yet current approaches focus on perceptual fidelity and domain-specific autoencoder fine-tuning while neglecting a more fundamental bottleneck. We identify and formalize the learnability gap: large-scale pretrained autoencoders faithfully encode discriminative features for medical classification, as evidenced by near-lossless performance in reconstruction space, yet their latent representations are structured in ways that are difficult for classifiers to learn from. Across five autoencoder families and four medical benchmarks spanning chest radiography, dermatoscopy, computed tomography, and echocardiography, we show that this gap persists regardless of architecture, initialization strategy, or hyperparameter tuning, and that medical-domain fine-tuning of the autoencoder does not close it. To probe and partially narrow the gap, we develop noise-conditioned latent classifiers with FiLM layers and image-space distillation that offer 64x throughput and 120x memory gains over image-space models while serving as diagnostic tools for latent space quality. Our analysis provides a new framework for evaluating autoencoder latent spaces and identifies their structure, rather than their fidelity or domain specificity, as the primary obstacle to closing the performance gap between real and synthetic medical training data.

2605.17070 2026-05-19 cs.CV 版本更新

EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models

EPIC-Bench: 一种以感知为中心的细粒度具身视觉 grounding 的基准

Haozhe Shan, Xiancong Ren, Han Dong, Haoyuan Shi, Yingji Zhang, Jiayu Hu, Yi Zhang, Yong Dai, Bin Shen, Lizhen Qu, Zenglin Xu, Xiaozhu Ju

发表机构 * X-Humanoid Fudan University(复旦大学) University of Science and Technology of China(中国科学技术大学) University of Manchester(曼彻斯特大学) Monash University(墨尔本大学) Celonis AI University of New South Wales(新南威尔士大学)

AI总结 本文提出 EPIC-Bench,一种以感知为中心的细粒度具身视觉 grounding 基准,旨在系统评估 VLMs 在现实世界具身环境中的视觉感知能力。该基准包含 6.6k 个精心标注的元组(图像,文本,掩码),涵盖 23 个细粒度任务,涉及具身交互管道的三个核心阶段:目标定位、导航和操作。评估结果显示,尽管先进推理模型表现出潜力,但当前 VLMs 在复杂视觉-文本对齐方面普遍存在困难,特别是在多目标计数、部分-整体关系理解和 affordance 区域检测方面存在瓶颈。

详情
AI中文摘要

尽管大型视觉-语言模型(VLMs)越来越多地被用作具身代理的感知骨干,但现有基准往往依赖于问答或多选格式。这些协议允许模型利用语言先验,而不是展示真正的视觉 grounding。为了解决这个问题,我们提出了 EPIC-Bench,即具身感知基准,这是一种细粒度 grounding 基准,旨在系统地评估 VLMs 在现实世界具身环境中的视觉感知能力。EPIC-Bench 包含 6.6k 个精心标注的元组(图像,文本,掩码),涵盖 23 个细粒度任务,横跨具身交互管道的三个核心阶段:目标定位、导航和操作。对超过 89 个领先 VLMs 的广泛评估显示,尽管先进推理模型显示出潜力,但当前 VLMs 在复杂视觉-文本对齐方面普遍存在困难。具体而言,模型在多目标计数、部分-整体关系理解以及 affordance 区域检测方面存在关键瓶颈。EPIC-Bench 为推进下一代视觉驱动的具身模型提供了稳健的基础和可操作的见解。

英文摘要

While large vision-language models (VLMs) are increasingly adopted as the perceptual backbone for embodied agents, existing benchmarks often rely on question-answering or multiple-choice formats. These protocols allow models to exploit linguistic priors rather than demonstrating genuine visual grounding. To address this, we present EPIC-Bench, Embodied PerceptIon BenChmark, a fine-grained grounding benchmark designed to systematically evaluate the visual perceptual capabilities of VLMs in real-world embodied environments. Comprising 6.6k meticulously annotated tuples (Image, Text, Mask), EPIC-Bench spans 23 fine-grained tasks across three core stages of the embodied interaction pipeline: Target Localization, Navigation, and Manipulation. Extensive evaluations of over 89 leading VLMs reveal that while advanced reasoning models show promise, current VLMs universally struggle with complex visual-text alignment for physical interactions. Specifically, models exhibit critical bottlenecks in multi-target counting, part-whole relationship understanding, and affordance region detection. EPIC-Bench provides a robust foundation and actionable insights for advancing the next generation of vision-driven embodied models.

2605.17042 2026-05-19 cs.CV 版本更新

Thermal-Only Crowd Counting with Deployment-Time Privacy Protection

仅热成像的人群计数与部署时隐私保护

Yifei Qian, Zhongliang Guo, Chun Tong Lei, Bowen Deng, Chun Pong Lau, Xiaopeng Hong, Michael P. Pound

发表机构 * School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院) School of Computer Science, University of St Andrews(斯特灵大学计算机科学学院) Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本文提出了一种仅使用热成像数据的人群计数框架,通过消除RGB数据依赖,减少公共监控中隐私暴露风险,并利用深度到RGB扩散模型来缓解热成像的模糊性,提升计数准确性。

详情
AI中文摘要

尽管RGB-热人群计数已显示出潜力,但该范式面临关键限制:RGB数据在公共监控中引发隐私问题,而多模态对齐问题会降低融合性能。我们提出首个专门设计用于隐私意识人群计数的热成像-only框架,在推理时消除RGB依赖,并显著减少公共监控部署中连续RGB捕获带来的隐私暴露。为缓解热成像模糊性,我们利用深度到RGB扩散模型作为跨模态桥梁,提取具有辨别力的特征以增强热表示。关键地,我们证明单步LCM去噪产生最忠实于深度条件信号结构内容的特征,而多步方法则逐步将特征与条件输入解耦并累积误差,从而降低计数准确性。在RGBT-CC和DroneRGBT数据集上的实验表明,我们的方法在性能上与最先进的RGB-热融合方法具有竞争力,且仅需在推理时使用热输入,消除了连续RGB捕获的需求,这在现实世界监控部署中是主要的隐私问题。代码将公开提供。

英文摘要

While RGB-Thermal crowd counting has shown promise, the paradigm faces critical limitations: RGB data raises privacy concerns in public surveillance, and multi-modal misalignment degrades fusion performance. We propose the first thermal-only framework specifically designed for privacy-conscious crowd counting, eliminating RGB dependency at inference time and substantially reducing the privacy exposure associated with continuous RGB capture in public surveillance deployments. To mitigate thermal ambiguity, we leverage depth-to-RGB diffusion models as a cross-modal bridge, extracting discriminative features that enhance thermal representations. Critically, we demonstrate that single-step LCM denoising yields features most faithful to the structural content of the depth conditioning signal, while multi-step approaches progressively decouple features from the conditioning input and accumulate errors that degrade counting accuracy. Experiments on RGBT-CC and DroneRGBT datasets show our method achieves competitive performance against state-of-the-art RGB-T fusion methods, while requiring only thermal input during inference, eliminating the need for continuous RGB capture that constitutes the primary privacy concern in real-world surveillance deployment. The code will be made publicly available.

2605.17019 2026-05-19 cs.CV 版本更新

StreamingEffect: Real-Time Human-Centric Video Effect Generation

StreamingEffect: 实时以人为中心的视频效果生成

Yiren Song, Cheng Liu, Yuxin Jiang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室)

AI总结 本文提出StreamingEffect框架,通过实时视频到视频编辑技术,在保持人类身份、背景内容和时间一致性的同时添加表现性效果,并构建了最大的以人为中心的视频效果数据集VideoEffect-130K,实现了单块H200 GPU上的实时高质量720p视频编辑。

详情
AI中文摘要

实时以人为中心的视频效果生成对于直播人为主的应用如电子商务直播、娱乐和vlogging具有高度需求,但仍然困难,由于缺乏合适的数据和可部署的编辑模型。与通用视频生成不同,此任务需要实时视频到视频编辑,添加表现性效果的同时保持人类身份、背景内容和时间一致性。现有加速努力主要集中在文本到视频生成,而高效的视频编辑蒸馏仍 largely underexplored。在本文中,我们提出StreamingEffect,一个实时以人为中心的流视频效果框架。我们采用上下文视频编辑架构并训练高质量的双向教师,然后将其蒸馏为因果自回归学生,并进一步将采样步骤从50步减少到4步。我们还引入关键帧控制,允许参考效果帧在线注入并通过流进行传播以实现交互式编辑。为了解决数据瓶颈问题,我们构建了VideoEffect-130K,据我们所知,这是最大的以人为中心的视频效果数据集,包含70000个效果视频和60000个编辑视频,涵盖600个效果类别,这些类别是从短视频和编辑平台中挑选的。实验表明,我们的方法能够在单块H200 GPU上实现实时、高质量的720p视频编辑。

英文摘要

Streaming video effect generation is highly desirable for live human-centric applications such as e-commerce streaming, entertainment, and vlogging, yet remains difficult due to the lack of suitable data and deployable editing models. Unlike generic video generation, this task requires real-time video-to-video editing that adds expressive effects while preserving human identity, background content, and temporal consistency. Existing acceleration efforts mainly focus on text-to-video generation, while efficient distillation for video editing remains largely underexplored. In this paper, we present \textbf{StreamingEffect}, a real-time human-centric streaming video effect framework. We adopt an in-context video editing architecture and train a high-quality bidirectional teacher, then distill it into a causal autoregressive student and further reduce sampling from 50 steps to 4 steps. We also introduce keyframe control, allowing reference effect frames to be injected online and propagated through the stream for interactive editing. To address the data bottleneck, we construct \textbf{VideoEffect-130K}, to our knowledge the largest human-centric video effect dataset, containing 70K effect videos and 60K editing videos across 600 effect categories curated from short-video and editing platforms. Experiments show that our method enables real-time, high-quality 720p video editing on a single H200 GPU.

2605.17014 2026-05-19 cs.CV 版本更新

RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos

RHINO:从单目视频中重建人类与新物体的交互

Lixin Xue, Chengwei Zheng, Georgios Paschalidis, Chen Guo, Manuel Kaufmann, Juan Zarate, Dimitrios Tzionas

发表机构 * ETH Zürich(苏黎世联邦理工学院) The University of Tokyo(东京大学) University of Amsterdam(阿姆斯特丹大学) Aristotle University of Thessaloniki(希腊塞萨洛尼基阿里斯托芬大学)

AI总结 本文提出RHINO框架,通过三步方法从单目视频中重建3D人类、新物体和静态场景,解决了视频中物体和人类交互的重建问题,提升了4D重建和新视角合成的性能。

Comments CVPR 2026. Project page: https://lxxue.github.io/RHINO

详情
AI中文摘要

从智能系统中重建人类、物体及其交互的3D结构是一个长期目标。通常输入是移动相机的RGB视频,使任务变得不明确;深度具有歧义,人类和物体相互遮挡,相机和物体运动交织,产生似动现象。大多数先前工作只针对人类或物体单独处理,忽略它们的相互作用,或假设已知的3D形状或相机,这在实际应用中不切实际。我们开发了RHINO(Reconstructing Human Interactions with Novel Objects),一个三步框架,从单目RGB视频中恢复3D的人、新(未见过的)操控物体和静态场景,以共同的世界框架。首先,我们利用3D感知的基础模型,获取稳定结构从运动(SfM)的提示,即使在低纹理区域也能稳定;这将导致从前景像素获得操控物体的粗略形状和似动现象,以及从背景像素获得粗略场景形状和相机运动。第二,我们通过现成的方法估计摄像机框架中的人类,并从似动现象中减去相机运动以提取物体运动;这将人类、物体和粗略场景形状注册到共同的世界框架中。第三,我们使用具有组件的神经场和每个组件的有符号距离场来细化形状。后者进一步使不同可微的接触先验,吸引表面同时惩罚相互穿透,提高最终重建的物理合理性。对于评估,我们捕捉了一个新的手持单目视频数据集,与体积4D捕捉阶段同步,提供地面真实形状和相机运动。RHINO在新视角合成和4D重建上优于最先进的基线。消融实验表明,每个阶段都做出了显著贡献。代码和数据可在https://lxxue.github.io/RHINO上获得。

英文摘要

Reconstructing people, objects, and their interactions in 3D is a long-standing goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a manipulated object from foreground pixels, and a coarse scene shape and camera motion from background pixels. Second, we estimate a human in the camera frame via an off-the-shelf method, and subtract the camera motion from apparent motion to extract the object motion; this registers the human, object, and coarse scene shapes into a common world frame. Third, we refine shapes using a compositional neural field with per-component signed-distance fields. The latter further enables differentiable contact priors that attract surfaces while penalizing interpenetration, improving the physical plausibility of the final reconstruction. For evaluation, we capture a new dataset of handheld monocular videos synchronized with a volumetric 4D capture stage, providing ground-truth shape and camera motion. RHINO outperforms state-of-the-art baselines on novel-view synthesis and 4D reconstruction. Ablations show that each stage contributes substantially. Code and data are available at https://lxxue.github.io/RHINO.

2605.17011 2026-05-19 cs.GR cs.CV cs.LG 版本更新

Topo-GS: Continuous Volumetric Embedding of High-Dimensional Data via Topological Gaussian Splatting

Topo-GS: 通过拓扑高斯散射实现高维数据的连续体积分嵌入

João Paulo Gois, Luis Gustavo Nonato

发表机构 * Universidade Federal do ABC (UFABC)(巴西圣安德烈大学)

AI总结 本文提出Topo-GS方法,利用拓扑感知策略将高维数据转换为连续体积分表示,通过局部几何约束优化,保持局部拓扑保真度,同时显式表现投影扭曲。

Comments 7 pages, 2 figures

详情
AI中文摘要

降维算法将高维数据映射到可可视化的2D或3D空间,但传统上依赖于离散点云范式。这种离散抽象容易受到视觉遮挡和人工不连续性的影响,往往无法表示底层流形的连续密度。为了解决这些限制,我们引入Topo-GS,一个框架,重新利用3D高斯散射(3DGS)将多维投影作为无网格体积分重建过程。与标准光度损失不同,Topo-GS由局部几何约束驱动。通过解决正交Procrustes目标,优化强制了As-Rigid-As-Possible先验,同时显式对齐每个高斯的空间协方差到局部切空间。认识到解卷不同内在维数的数据需要不同的空间处理,我们利用拓扑感知策略,将损失公式定制以保持连续1D轨迹或连贯2D表面。定量和视觉评估表明,Topo-GS成功地将离散散点图转换为连续体积分表示,其中固有的投影扭曲显式表现为可观察的几何变化,同时保持与离散基线相当的局部拓扑保真度。

英文摘要

Dimensionality reduction algorithms map high-dimensional data into visualizable 2D or 3D spaces, but traditionally rely on a discrete point-cloud paradigm. This discrete abstraction is susceptible to visual occlusion and artificial discontinuities, often failing to represent the continuous density of the underlying manifold. To address these limitations, we introduce Topo-GS, a framework that repurposes 3D Gaussian Splatting (3DGS) to cast multidimensional projection as a meshless volumetric reconstruction process. Instead of standard photometric losses, Topo-GS is driven by local geometric constraints. By solving orthogonal Procrustes targets, the optimization enforces an As-Rigid-As-Possible prior while explicitly aligning the spatial covariance of each Gaussian to the local tangent space. Recognizing that unrolling data of varying intrinsic dimensionalities requires distinct spatial treatments, we utilize a topology-aware strategy that tailors the loss formulation to preserve either continuous 1D trajectories or cohesive 2D surfaces. Quantitative and visual evaluations demonstrate that Topo-GS successfully transforms discrete scatter plots into continuous volumetric representations, where inherent projection distortions explicitly manifest as observable geometric variations, while preserving local topological fidelity comparable to discrete baselines.

2605.16990 2026-05-19 cs.CV 版本更新

DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing

DreamEdit3D: 多视角扩散模型的3D编辑个性化

Jinxin Ai, Matthias Nießner, Ziya Erkoç

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出DreamEdit3D,通过自然语言实现多视角扩散模型的3D编辑个性化,通过提取语义组件并学习不同组件的token嵌入,实现多视角一致的高质量3D编辑。

Comments 24 pages, 5 figures

详情
AI中文摘要

尽管2D扩散模型在保持身份的个性化方面取得了显著成功,但将其能力扩展到3D资产仍是一个重大挑战,因为多视角一致性和空间控制的复杂性。受这些2D进展的启发,我们提出了一种新的个性化方法,用于文本引导的3D编辑,通过自然语言实现组合性和对象级控制。给定一个3D输入,我们渲染正交视图并提取对象级分割掩码以隔离语义组件。然后通过定制的两阶段优化策略学习每个组件的distinct token embeddings:多视角文本倒置与注意力对齐,随后对多视角扩散模型进行完整微调。在推理过程中,这些解耦的tokens与编辑提示无缝组合,生成多视角一致的图像,随后提升为高保真纹理3D网格。在多样化的编辑场景中的广泛评估表明,我们的方法成功地将2D个性化的优势转移到3D中,相比现有基线,在编辑忠实度和身份保持方面取得了最先进的成果。

英文摘要

While 2D diffusion models have achieved remarkable success in identity-preserving personalization, extending this capability to 3D assets remains a significant challenge due to the complexities of multi-view consistency and spatial control. Inspired by these 2D advancements, we present a novel personalization method for text-guided 3D editing that enables compositional, object-level control through natural language. Given a 3D input, we render orthogonal views and extract object-level segmentation masks to isolate semantic components. We then learn distinct token embeddings for each component through a tailored two-phase optimization strategy: multi-view textual inversion with attention alignment, followed by full fine-tuning of multi-view diffusion model. During inference, these disentangled tokens seamlessly compose with editing prompts to generate multi-view consistent images, which are subsequently lifted into high-fidelity textured 3D meshes. Extensive evaluations across diverse editing scenarios demonstrate that our method successfully transfers the flexibility of 2D personalization to 3D, achieving state-of-the-art edit faithfulness and identity preservation compared to existing baselines.

2605.16981 2026-05-19 cs.CV 版本更新

Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction

重新思考长序列递归3D重建中的状态更新门

Kejun Ren, Lei Jin, Tianxin Huang, Lianming Xu, Li Wang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) School of Computing and Data Science, The University of Hong Kong(香港大学计算机与数据科学学院)

AI总结 本文针对长序列递归3D重建中状态更新门的结构瓶颈问题,提出一个基于帧级的门控机制,通过闭式解推导出无需参数、训练和额外前向传递的标量帧级门,从而在多个基准测试中显著提升了精度并保持了常量内存使用。

Comments 17 pages, 7 figures

详情
AI中文摘要

在严格常量内存预算下进行流式3D重建的关键在于递归状态如何随着流的演进进行更新。我们对TTT3R风格的每token门在五个基准测试中进行了分析,发现了一个结构性瓶颈:门本质上在幅度上受到限制(中位数为0.31;从未超过0.6),并且几乎帧不变,导致每个状态token的有效内存范围仅为约3帧,这成为长序列漂移的结构性根源。我们追溯到一个缺失的轴:现有推理时间方法仅在每token、帧内级别上调节更新,而正交的帧级别问题——'每个帧应如何强地贡献于状态'——被视为内容无关。我们通过闭式解从帧间内部特征的变化推导出一个标量帧级门α_t ∈ (0, 1],这是一种连续放松的经典同时定位与建图(SLAM)关键帧选择方法,无需参数、训练或额外前向传递。在六个涵盖相机姿态、视频深度和3D重建的基准测试中,我们的门在长TUM-RGBD姿态序列中将ATE减少51%,在Bonn视频深度中将AbsRel减少12.8%,在KITTI长序列姿态估计中超越了LongStream和Keyframe-VO,同时保持严格常量内存使用,且无训练成本。

英文摘要

Streaming 3D reconstruction under a strict constant-memory budget hinges on how the recurrent state is updated as the stream evolves. We profile TTT3R-style per-token gates across five benchmarks and discover a structural bottleneck: the gate is intrinsically bounded in magnitude (median $0.31$; never exceeding $0.6$) and nearly frame-invariant, yielding an effective memory horizon of only $\sim$3 frames per state token, which serves as the structural origin of long-sequence drift. We trace this to a missing axis: existing inference-time methods modulate updates only at the per-token, intra-frame level, while the orthogonal frame-level question of \emph{how strongly each frame should contribute to the state} has been treated as content-independent. We close this gap with a scalar frame-level gate $α_t \in (0, 1]$ derived in closed form from frame-to-frame changes of internal features -- a continuous relaxation of classical Simultaneous Localization and Mapping (SLAM) keyframe selection that requires no parameters, no training, and no extra forward pass. Across six benchmarks spanning camera pose, video depth, and 3D reconstruction at sequence lengths up to $4,541$ frames, our gate cuts ATE by $51\%$ on long TUM-RGBD pose sequences, reduces AbsRel by $12.8\%$ on Bonn video depth, and on KITTI long-sequence pose estimation surpasses both LongStream and Keyframe-VO, while retaining strictly constant memory at zero training cost.

2605.16980 2026-05-19 cs.CV 版本更新

Statistical Hand Shape Modeling from Clinical CT Scans Using Deep Learning and Implicit Skinning

基于深度学习和隐式皮肤的临床CT扫描中手部形状统计建模

Gokce Guven, Hasan Fehmi Ates, Deniz Karasahin, Kaan Erdogan

发表机构 * Dept. of Computer Science and Engineering(计算机科学与工程系) Özyeğin University(奥兹耶尼大学) Dept. of Artificial Intelligence and Data Engineering(人工智能与数据工程系) Osteoid Inc.(Osteoid公司)

AI总结 本文提出了一种AI辅助的重建流程,利用深度学习和隐式皮肤技术对临床CT扫描中的手部解剖结构进行分割和分析,通过统计形状建模提高生物力学、人机工程学和医疗诊断的应用价值。

详情
AI中文摘要

准确的分割和手部解剖的统计形状建模对医学诊断、人机工程学和生物力学有重要影响。本研究提出了一种AI辅助的重建流程,用于从1,271例肘至手(e2h-CT)计算机断层扫描中分割和分析手部解剖结构。首先使用基于Pix2Pix的条件生成对抗网络去除CT体积中的石膏和背景伪影。清洁后的扫描随后在3D Slicer中处理,提取皮肤和骨掩膜,并将其转换为封闭曲面网格模型。分割的骨网格用于构建骨骼表示,使隐式皮肤能够将所有手模型对齐到标准化的解剖配置。随后,使用Geodesic Based Coherent Point Drift++(GBCPD++)算法对手部皮肤表面进行非刚性配准,以在不同受试者之间建立点对应关系。然后对配准后的模型应用主成分分析(PCA)以量化解剖形状的变异性。Pix2Pix预处理阶段在保留测试集上实现了Dice系数为0.9856和IoU为0.9720。统计建模在90例扫描中进行,其中手指完全可见且解剖上分离。所得的统计形状分布与美国陆军人体测量调查(ANSUR II)有很强的一致性,支持重建模型的解剖有效性。所提出的方法在生物力学建模、人机工程学优化、假肢设计和精准医疗诊断方面具有显著潜力。

英文摘要

Accurate segmentation and statistical shape modeling of hand anatomy have significant implications for medical diagnostics, ergonomics, and biomechanics. This study proposes an AI-assisted reconstruction pipeline for segmenting and analyzing hand anatomy from 1,271 elbow-to-hand (e2h-CT) computed tomography scans. A Pix2Pix-based conditional generative adversarial network is first employed to remove plaster cast and background artifacts from CT volumes. The cleaned scans are then processed in 3D Slicer to extract skin and bone masks, which are converted into closed-surface mesh models. Segmented bone meshes are used to construct skeletal representations, enabling implicit skinning to align all hand models into a standardized anatomical configuration. Subsequently, non-rigid registration is performed on the hand skin surfaces using the Geodesic Based Coherent Point Drift++ (GBCPD++) algorithm to establish point-wise correspondence across subjects. Principal Component Analysis (PCA) is then applied to the registered models to quantify anatomical shape variability. The Pix2Pix preprocessing stage achieved a Dice coefficient of 0.9856 and an IoU of 0.9720 on the held-out test set. Statistical modeling was performed on a subset of 90 scans in which the fingers were fully visible and anatomically separated. The resulting statistical shape distributions demonstrate strong agreement with the U.S. Army Anthropometric Survey (ANSUR II), supporting the anatomical validity of the reconstructed models. The proposed methodology demonstrates significant potential for advancing biomechanical modeling, ergonomic optimization, prosthetic design, and precision medical diagnostics.

2605.16973 2026-05-19 cs.CV cs.LG 版本更新

SHED: Style-Homogenized Embedding Alignment for Domain Generalization

SHED: 风格均质化嵌入对齐用于领域泛化

Kai Gan, Tong Wei

发表机构 * School of Computer Science and Engineering, Southeast University, Nanjing 210096, China(1 东南大学计算机科学与工程学院,南京 210096,中国) Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China(2 教育部计算机网络与信息集成重点实验室(东南大学),中国)

AI总结 本文提出SHED方法,通过均质化嵌入对齐来解决领域泛化中的信息不对称问题,实验表明其在多个基准测试中取得了最先进的性能。

详情
AI中文摘要

领域泛化旨在通过嵌入分布偏移增强模型对未见领域的鲁棒性。尽管像CLIP这样的大规模视觉-语言模型表现出色,但其直接的图像-文本嵌入对齐却受到固有信息不对称的限制:图像编码了类别语义和领域特定的风格,而文本提示主要传达基本的类别线索。这种不对称性阻碍了在现实场景中对新领域的泛化。为此,我们提出了SHED,一种基于CLIP的新方法,通过对齐风格均质化的嵌入而不是CLIP编码器的原始表示。在训练过程中,SHED从图像嵌入(按源领域计算)和文本嵌入(在多样化的提示模板下平均并去除全局质心)中移除领域特定的风格质心。在推理过程中,考虑到目标领域信息的缺乏,SHED将多样化的文本领域质心投影到视觉空间,并通过成员加权聚合预测。在五个基准测试上的广泛实验表明,SHED在多个基准测试中取得了最先进的性能,显著优于先前方法(例如,在DomainNet上比标准微调高出+4.0%)

英文摘要

Domain generalization aims to enhance model robustness against unseen domains with embedding distribution shifts. While large-scale vision-language models like CLIP exhibit strong generalization, their direct image-text embedding alignment suffers from inherent information asymmetry: images encode both class semantics and domain-specific styles, whereas text prompts primarily convey basic class cues. This asymmetry hinders generalization to novel domains in realistic scenarios. To address this, we propose Style-Homogenized Embedding alignment for Domain-generalization (SHED), a novel CLIP-based method that aligns style-homogenized embeddings instead of raw representations from encoders in CLIP. During training, SHED removes domain-specific style centroids from both image embeddings computed per source domains and text embeddings which are averaged across diverse prompt templates and stripped of a global centroid. For inference, considering the lack of target domain information, SHED projects diverse textual domain centroids into the visual space and aggregates predictions via membership weighting. Extensive experiments on five benchmarks show SHED achieves state-of-the-art performance, outperforming prior methods significantly (e.g., +4.0\% on DomainNet vs. standard fine-tuning).

2605.16967 2026-05-19 cs.CV 版本更新

Expandable, Compressible, Mineable: Open-World Thermal Image Restoration

可扩展、可压缩、可挖掘:面向开放世界热成像修复的ECMRNet

Pu Li, Huafeng Li, Yafei Zhang, Wen Wang, Neng Dong, Jie Wen

发表机构 * Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Yunnan, China(昆明理工大学信息工程与自动化学院,云南,中国) School of Mathematics and Statistics, Yunnan University, Yunnan, China(云南大学数学与统计学院,云南,中国) Shenzhen Key Laboratory of Visual Object Detection and Recognition, Harbin Institute of Technology, Shenzhen, China(深圳视觉对象检测与识别重点实验室,哈尔滨工业大学,深圳,中国)

AI总结 本文提出ECMRNet,从持续学习视角解决开放世界热成像修复问题,通过可扩展、可压缩、可挖掘的闭环过程实现持续适应新型退化,同时通过结构熵剪枝和子退化知识挖掘模块提升修复性能。

Comments Accepted by ICML2026

详情
AI中文摘要

在开放世界场景中,热红外(TIR)图像退化持续出现并演变,而现有大多数单一切换修复方法基于封闭集假设,难以持续适应新退化。为此,我们提出ECMRNet,即面向开放世界热成像修复的可扩展、可压缩、可挖掘修复网络。从概念上,ECMRNet将持续退化学习统一为一个“扩展-压缩-挖掘”闭环过程,通过可控进化实现对新退化的持续适应。从结构上,ECMRNet将中间表示分解为组隔离的子空间,并通过冻结历史组和等形扩展新组,实现严格参数隔离和快速适应新退化。为抑制任务积累后的模型增长,我们提出结构熵剪枝,通过二维结构熵最小化识别并移除冗余通道组,实现信息贡献驱动的自适应压缩。此外,我们设计了子退化知识挖掘模块,动态检索并重新组合历史表示中的可转移组件,以提高复合退化下的修复性能。实验结果表明,ECMRNet在多种单退化和复合退化场景中均实现了优越的整体性能,同时使用更少的参数和更低的计算成本。源代码可在https://github.com/Kust-lp/ECMRNet获取。

英文摘要

In open-world settings, thermal infrared (TIR) image degradations continuously emerge and evolve, while most existing all-in-one restoration methods are built on a closed-set assumption and struggle to continually adapt to novel degradations. To address this, we propose ECMRNet, an Expandable, Compressible, and Mineable Restoration Network for open-world TIR restoration from a continual learning perspective. Conceptually, ECMRNet unifies continual degradation learning as an "expand-compress-mine" closed-loop process, enabling sustained adaptation to new degradations with controllable evolution. Structurally, ECMRNet decomposes intermediate representations into group-isolated subspaces, and achieves strict parameter isolation and fast adaptation to new degradations by freezing historical groups and isomorphically expanding new ones. To curb model growth as tasks accumulate, we present Structural Entropy Pruning, which identifies and removes redundant channel groups via two-dimensional structural entropy minimization, achieving information contribution-driven adaptive compression. Moreover, we design a Sub-degradation Knowledge Mining Module that dynamically retrieves and recombines transferable components from historical representations to improve restoration under compound degradations. Experimental results demonstrate that ECMRNet achieves superior overall performance across diverse single and compound degradations while using fewer parameters and lower computational cost. The source code is available at https://github.com/Kust-lp/ECMRNet.

2605.16961 2026-05-19 cs.CV cs.AI 版本更新

Latent Action Control for Reasoning-Guided Unified Image Generation

潜在动作控制用于推理引导的统一图像生成

Fuxiang Zhai, Sixiang Chen, Yingjin Li, Shuaibo Li, Jianyu Lai, Tengjun Huang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州))

AI总结 本文提出Latent Action Control (LAC),通过将推理表示为隐藏的连续动作,使推理过程可操作,从而在统一生成器中实现推理引导的图像生成。LAC通过角色结构化的潜在轨迹进行规划、内部视觉草图、诊断和细化,并将这些动作注入到条件流生成的隐藏流中,从而提升生成质量。

详情
AI中文摘要

统一的多模态模型可以在共享的骨干网络中编码视觉理解和图像生成,但理解并不自动转化为控制:模型可能推断出对象、关系或知识提示,但无法在生成的图像中实例化。我们提出潜在动作控制(LAC),通过将推理表示为隐藏的连续动作,使推理过程可操作。给定提示,LAC会规划角色结构化的潜在轨迹,进行内部视觉草图、诊断和细化,并将这些动作注入到条件流生成的隐藏流中,而无需生成推理标记或中间图像。由于这些动作轨迹是未观察到的,LAC通过先验引导的变分潜在动作对齐从仅训练的语义先验、草图图像特征和监督停止信号中学习这些动作,随后通过Latent-Flow GRPO对齐潜在到图像的生成轨迹与终端视觉反馈。这为从推断的关系、绑定和知识提示到生成过程的控制路径提供了支持。在BAGEL-7B-MoT上实现后,LAC在GenEval、WISE和T2I-CompBench中一致提升了组合性和知识引导的生成,尤其是在空间关系、属性绑定和世界知识敏感提示上表现最佳。消融实验和潜在干预显示,学习的动作轨迹被生成器消耗,表明统一生成在理解不仅被编码,而是在生成过程中被操作时受益。

英文摘要

Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation, without producing reasoning tokens or intermediate images. Since such action trajectories are unobserved, LAC learns them through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This provides a control path from inferred relations, bindings, and knowledge cues to the generation process. Instantiated on BAGEL-7B-MoT, LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts. Ablations and latent interventions show that the learned action trajectory is consumed by the generator, suggesting that unified generation benefits when understanding is not only encoded, but made actionable during generation.

2605.16951 2026-05-19 cs.CV 版本更新

Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing

Edit-GRPO: 一种用于图像编辑的保持局部性的策略优化框架

Shaodong Xu, Zexian Li, Zhendong Wang, Litong Gong, Tiezheng Ge, Wengang Zhou, Bo Zheng, Houqiang Li

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 本文提出Edit-GRPO框架,通过分离编辑与保留目标,解决图像编辑中保持局部性与全局一致性的问题,提升编辑效果并减少上下文扭曲等常见伪影。

详情
AI中文摘要

图像编辑中的一个根本性挑战在于保持空间局部性:编辑应改进目标内容而不应无意地改变周围区域。然而,大多数基于优化的编辑方法将图像视为整体实体,导致全局策略更新,从而破坏局部性并引入不期望的上下文变化。我们观察到,这一问题源于局部编辑意图与全局应用的优化信号之间的不匹配。受此启发,我们提出Edit-GRPO,一种在优化图像编辑时保持局部性的策略优化框架,该框架明确地将编辑和保留目标分离。通过为编辑和非编辑区域分配区域特定的优化信号,Edit-GRPO使策略更新与编辑任务的空间结构对齐,从而实现局部改进同时保持全局视觉一致性。这种设计有效抑制了诸如上下文扭曲和边界不一致等常见伪影。在各种图像编辑场景中的广泛实验表明,与现有基于优化的方法相比,Edit-GRPO在显著提高局部性保持的同时,保持了强大的编辑性能,验证了所提框架的通用性和有效性。

英文摘要

A fundamental challenge in image editing lies in preserving spatial locality: edits should improve targeted content without inadvertently altering surrounding regions. However, most optimization-based editing approaches treat images as holistic entities, causing global policy updates that undermine locality and introduce undesired context changes. We observe that this issue stems from a mismatch between localized editing intent and globally applied optimization signals. Motivated by this insight, we propose Edit-GRPO, preserving Locality while optimizing image editing, a locality-preserving policy optimization framework that explicitly decouples editing and preservation objectives. By assigning region-specific optimization signals to edit and non-edit areas, Edit-GRPO aligns policy updates with the spatial structure of editing tasks, enabling localized improvements while maintaining global visual coherence. This design effectively suppresses common artifacts such as context distortion and boundary inconsistency. Extensive experiments across diverse image editing scenarios demonstrate that Edit-GRPO significantly improves locality preservation while maintaining strong editing performance compared to existing optimization-based methods, validating the generality and effectiveness of the proposed framework.

2605.16949 2026-05-19 cs.CV 版本更新

Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

超越点对点匹配:为加速扩散变换器的结构表示对齐

Shaodong Xu, Zhendong Wang, Litong Gong, Zexian Li, Wengang Zhou, Tiezheng Ge, Houqiang Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出sREPA框架,通过显式结构约束来对齐特征图的相对几何关系,以提高生成质量并加速收敛。

详情
AI中文摘要

最近的扩散变换器(DiTs)进展表明,将噪声潜在状态与经过训练的语义特征对齐(如代表性对齐(REPA)所开创的)可以显著加速训练并提高生成保真度。随后的分析(例如iREPA)表明,这些收益主要来自于转移预训练视觉表示中包含的空间结构。然而,大多数现有对齐方法使用点对点匹配目标或依赖于隐式架构调整,这些方法未能显式建模视觉基础模型中固有的空间关系几何。我们主张,元素级监督不足以捕捉视觉表示中的丰富空间拓扑,有效的对齐应以显式结构约束的形式进行。为此,我们提出了sREPA,一种结构代表性对齐框架,以强制特征图的相对几何一致性,而不是仅仅匹配单个特征点。通过鼓励模型内部化预训练特征中的整体空间布局和结构相关性,sREPA在比最先进的对齐策略更快、更稳定的收敛以及改进的样本质量方面取得了成果。我们的代码和模型将被发布。

英文摘要

Recent advances in Diffusion Transformers (DiTs) demonstrate that aligning noisy latent states with well-trained semantic features-as pioneered by Representation Alignment (REPA)-can substantially accelerate training and improve generation fidelity. Subsequent analysis(e.g., iREPA) suggests that these gains arise primarily from transferring spatial structure contained in pre-trained vision representations. However, mostly existing alignment methods employ point-wise matching objectives or rely on implicit architectural tweaks, which fail to explicitly model the spatial relational geometry inherent in vision foundation models. We argue that such element-wise supervision is insufficient to capture the rich spatial topology of visual representations, and that effective alignment for generation should instead be formulated as an explicit structural constraint. To this end, we propose sREPA, a structural REPresentation Alignment framework to enforce consistency in the relational geometry of feature maps, rather than merely matching individual feature points. By encouraging the model to internalize holistic spatial layouts and structural correlations from pre-trained features, sREPA achieves faster and more stable convergence, along with improved sample quality, compared to state-of-the-art alignment strategies. Our code and models will be released.

2605.16937 2026-05-19 cs.CV 版本更新

DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis

DEVIS-GRPO:释放GRPO用于动态极视合成

Yi Zuo, Huimin Wu, Lingling Li, Fang Liu, Licheng Jiao, Qing Li

发表机构 * Xidian University(西安电子科技大学) State Key Laboratory of General Artificial Intelligence, BIGAI(人工智能国家重点实验室,BIGAI)

AI总结 本文提出DEVIS-GRPO,一种基于GRPO的框架,用于轨迹控制的视频生成,是首个在线策略梯度方法用于极视视频生成。核心方法是新颖的采样策略ADEVIS,通过逐步积累小视增量实现大视运动,提高了训练效率和采样多样性。

详情
AI中文摘要

轨迹控制的视频生成已成为可控视频生成的关键。尽管当前方法在小视相机运动下表现良好,但在大视运动下显著退化。现有的极视合成解决方案通常需要专门的视频对,需要大量标注工作。为了解决这些限制,我们提出了动态极视合成-GRPO(DEVIS-GRPO),一种基于GRPO的框架,用于轨迹控制的视频生成,是首个在线策略梯度方法用于极视视频生成。我们的方法的核心是一种新颖的采样策略:累积动态极视合成(ADEVIS),通过逐步积累小视增量实现大视运动。该方法带来了两个关键优势:1)增强的训练效率,因为它消除了需要预热策略模型的需要,通过收集昂贵的配对大视视频;2)增加的采样多样性,通过灵活变化轨迹配置实现。最后,我们设计了多级一致性-质量奖励函数来选择高质量的样本用于模型优化。在Kubric-4D、iPhone和DL3DV数据集上的实验表明了我们的方法的优越性。在Kubric-4D上,我们在非遮挡区域相比第二好的方法在PSNR上提高了21.57%,在SSIM上提高了7.31%。在iPhone上,LPIPS减少了18.56%。

英文摘要

Trajectory-controlled video generation has become essential for controllable video generation. While current methods perform well under small-view camera motions, they degrade significantly with large-view motions. Existing solutions for extreme-view synthesis typically require dedicated video pairs, demanding substantial annotation effort. To address these limitations, we propose Dynamic Extreme VIew Synthesis-GRPO (DEVIS-GRPO), a GRPO-based framework for trajectory-controlled video generation, the first online policy gradient method for extreme view video generation. Central to our approach is a novel sampling strategy: Accumulative Dynamic Extreme VIew Synthesis (ADEVIS), which achieves large-view camera motions by progressively accumulating small-view increments. This method delivers two key advantages: 1) enhanced training efficiency, as it eliminates the need to warm-start the policy model by collecting expensive paired large-view videos, and 2) increased sampling diversity, achieved by flexibly varying trajectory configurations. Finally, we designed a multi-level consistency-quality reward function to select high-quality samples for model optimization. Experiments on the Kubric-4D, iPhone, and DL3DV datasets demonstrate our method's superiority. On Kubric-4D, we achieve relative improvements of 21.57% in PSNR and 7.31% in SSIM over the second-best method in non-occlusion areas. On iPhone, LPIPS is reduced by 18.56%.

2605.16925 2026-05-19 cs.CV 版本更新

P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction

P2GS: 基于物理先验的高斯点云法用于光度一致的城市重建

Kota Shimomura, Hidehisa Arai, Tsubasa Takahashi, Takayoshi Yamashita, Hironobu Fujiyoshi

发表机构 * Chubu University(名古屋大学) Turing Inc.(图灵公司)

AI总结 本文提出P2GS,一种基于物理先验的高斯点云法,用于解决自动驾驶中由于异质相机管道和动态户外照明导致的光度不一致问题,通过联合分解视图不变的线性HDR光场、每视图曝光尺度和色调映射函数,提升光度一致性与光照一致性。

Comments Accepted CVPR2026 main

详情
AI中文摘要

3D高斯点云法(3DGS)最近作为一种强大的显式表示方法出现,使其能够实现快速、高保真的渲染,成为自动驾驶闭环模拟器和感知模型的有前途的基础。然而,传统3DGS隐式假设不同视图之间具有一致的曝光和色调映射。真实驾驶数据由于异质相机管道和动态户外照明而违反这一假设,将曝光差异和传感器噪声烘焙到光场中,导致在静态背景中产生伪影和不一致的照明,这对现实模拟至关重要。这些问题是自动驾驶中尤为突出的,因为稀疏的视点、变化的曝光和户外照明相互作用,而以往的工作主要针对动态物体重建,忽略了跨视图的光度一致性。为了解决这一限制,我们引入了P2GS,一种物理一致的高斯点云框架,仅从LDR图像中联合分解视图不变的线性HDR光场、每视图曝光尺度和色调映射函数。P2GS采用基于物理图像形成过程的统一优化策略,强制相对曝光一致性和HDR域光流正则化。这产生了一个对跨相机照明差异具有鲁棒性的光场,同时保持标准3DGS的实时效率。在真实和模拟驾驶环境中进行的实验表明,P2GS在LDR重建中匹配或超越了先前的方法,同时在多样化的场景中提供了显著改进的光度一致性、可靠的曝光归一化和物理一致的照明。

英文摘要

3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, high-fidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency. To address this limitation, we introduce P2GS, a physically consistent Gaussian Splatting framework that jointly decomposes a view-invariant linear HDR radiance field, per-view exposure scales, and tone-mapping functions from only LDR images without HDR supervision. P2GS employs a unified optimization strategy grounded in the physical image-formation process, enforcing relative-exposure consistency and HDR-domain radiance regularization. This yields a radiance field robust to inter-camera illumination differences while preserving the real-time efficiency of standard 3DGS. Experiments across real and simulated driving environments show that P2GS matches or surpasses prior methods in LDR reconstruction while providing substantially improved photometric consistency, reliable exposure normalization, and physically coherent illumination across diverse scenes.

2605.16922 2026-05-19 cs.CV 版本更新

Motion Cues from Image-based Point Tracking for LiDAR Scene Flow Estimation

基于图像点跟踪的运动线索用于LiDAR场景流估计

Youngdong Jang, Gyeongrok Oh, Jong Wook Kim, Hyunju Ryu, Hyung-gun Chi, SeungHyeon Kim, Seungryong Kim, Jonghyun Choi, Sangpil Kim

发表机构 * Korea University(韩国大学) Purdue University(普渡大学) KAIST(韩国科学技术院) Hyundai Motor Company(现代汽车公司)

AI总结 本文提出TrackCue框架,通过图像点跟踪获取密集轨迹以改进LiDAR场景流估计中的动态物体表示,通过视觉一致的运动补偿策略和视觉运动线索提升来实现更准确的静态-动态分类和更可靠的场景流学习。

详情
AI中文摘要

LiDAR场景流估计对于自动驾驶至关重要,因为它为每个点提供3D运动。自监督方法利用静态-动态分类来缓解静态和动态点之间的不平衡,从而获得针对性的监督。然而,现有方法依赖于稀疏几何观测进行此分类,使其容易受到数据稀疏性和遮挡的影响。由此产生的噪声标签会提供错误的运动指导并降低场景流学习的效果。为了解决这个问题,我们引入了TrackCue,一种基于跟踪的框架,用于改进LiDAR场景流估计中的动态物体表示。具体而言,TrackCue重新利用点跟踪来获取锚定在LiDAR点上的密集图像空间轨迹,提供超越稀疏几何观测的运动线索。此外,我们提出了一种视觉一致的运动补偿策略,该策略在图像平面中将跟踪轨迹与自我诱导的刚性轨迹进行比较,有效地将真正的物体运动与自我诱导的表观运动分离。为了将这些分离的运动线索转移到LiDAR领域,我们执行了视觉运动线索提升,将自我补偿的图像轨迹与LiDAR点相关联以进行静态-动态标签细化。结果,TrackCue产生更准确的静态-动态分类,并为场景流学习提供更可靠的监督。实验结果表明,TrackCue显著提高了动态标签的精度和F1分数,从而在自监督场景流估计中带来了性能提升。

英文摘要

LiDAR scene flow estimation is essential for autonomous driving, as it provides 3D motion for each point. Self-supervised approaches use static-dynamic classification to mitigate the imbalance between static and dynamic points, deriving targeted supervision. However, existing methods rely on sparse geometric observations for this classification, making them vulnerable to data sparsity and occlusions. The resulting noisy labels provide incorrect motion guidance and degrade scene flow learning. To address this, we introduce TrackCue, a tracking-guided framework for improving dynamic object representation in LiDAR scene flow estimation. In particular, TrackCue repurposes point tracking to obtain dense image-space trajectories anchored to LiDAR points, providing motion cues beyond sparse geometric observations. Furthermore, we present a visually consistent motion compensation strategy that compares the tracked trajectories with ego-induced rigid trajectories in the image plane, effectively isolating true object motion from ego-induced apparent motion. To transfer these isolated motion cues back to the LiDAR domain, we perform visual motion cue lifting, which associates ego-compensated image trajectories with LiDAR points for static-dynamic label refinement. As a result, TrackCue produces more accurate static-dynamic classification and provides more reliable supervision for scene flow learning. Experimental results show that TrackCue significantly improves the precision and F1 score of dynamic labels, leading to performance gains in self-supervised scene flow estimation.

2605.16918 2026-05-19 cs.CV 版本更新

HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

HighSync: 通过潜在扩散模型实现高质量唇部同步

Saeed Firouzi Daghigh, Majid Iranpour Mobarekeh, Mostafa Alavi, Mehdi Bagheri

发表机构 * Department of Computer Engineering and Information Technology, Payam Noor University(Payam Noor大学计算机工程与信息科技系)

AI总结 本文提出HighSync,一种端到端的扩散框架,用于生成与任意输入音频对齐的逼真说话人脸视频。该方法同时解决了图像质量和同步准确性之间的矛盾,是首个原生在512*512分辨率上运行的唇部同步模型,适用于电影和广播行业等专业生产环境。

Comments 12 pages, 7 figures, 5 tables

详情
AI中文摘要

我们提出了HighSync,一种端到端的扩散基框架,用于生成与任意输入音频对齐的逼真说话人脸视频。现有方法在图像质量和同步准确性之间难以取得平衡,产生视觉降质或时间不一致的唇部运动。HighSync同时解决这两个挑战,并且据我们所知,是首个在512*512分辨率上原生运行的唇部同步模型,使其成为电影和广播行业等专业生产环境中的可行解决方案。我们方法的核心是识别并系统消除一种数据泄漏现象,这种现象在先前工作中无声地破坏了时间建模,阻碍模型发展对音频信号的真实依赖。在感知质量和同步准确性指标上的全面评估证实,HighSync在两者上均实现了最先进的性能。源代码、预训练模型和补充视频结果可在https://github.com/saeed5959/high_sync上公开获取。

英文摘要

We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip movements. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512*512 resolution, positioning it as a viable solution for professional production environments such as the film and broadcast industries. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal. Comprehensive evaluations across both perceptual quality and synchronization accuracy metrics confirm that HighSync achieves state-of-the-art performance on both fronts. Source code, pre-trained models, and supplementary video results are publicly available at: https://github.com/saeed5959/high_sync

2605.16911 2026-05-19 cs.CV 版本更新

VGGT-Occ: Geometry-Grounded and Density-Aware Gated Fusion for 3D Occupancy Prediction

VGGT-Occ:基于几何和密度的门控融合用于3D占用预测

Xun Chen, Tianchen Deng, Rui Wang, Fangjinhua Wang, Junyi Ma, Hongming Shen, Hesheng Wang, Danwei Wang

发表机构 * Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出VGGT-Occ,通过在整个管道中嵌入几何标记,引入投影感知可变形注意力(PA-DA)以注入几何信息,结合视图质量语义门控实现跨视图一致性,采用顺序粗到细解码器与门控融合优化效率和性能,实验证明其在3D语义占用预测中的有效性。

详情
AI中文摘要

3D语义占用预测需要准确的2D到3D特征提升,但当前方法限制相机几何到初始投影。后续操作如偏移学习、注意力加权和跨相机聚合仍缺乏几何感知,忽略了关键的物理约束。我们提出了VGGT-Occ,一个在完整管道中嵌入几何标记的框架。我们引入了投影感知可变形注意力(PA-DA)以在所有注意力阶段注入几何信息。PA-DA将3D偏移投影回图像平面,并利用投影雅可比作为加性偏置以抑制不可靠的观测。特征随后通过视图质量语义门控进行跨视图一致性整合。为了优化效率和性能,我们采用顺序粗到细解码器与门控融合,其中低分辨率特征被细化为更高分辨率,通过信息密度分配计算,同时显著减少解码器成本。广泛的评估证明了我们方法的有效性和准确性。在SurroundOcc-nuScenes上,VGGT-Occ在T=1时达到33.00%的IoU和21.08%的mIoU,在T=2推理时达到33.64%的IoU和21.43%的mIoU,优于现有方法,仅使用约4100万可训练参数。代码将公开发布。

英文摘要

3D semantic occupancy prediction requires accurate 2D-to-3D feature lifting, yet current methods restrict camera geometry to initial projections. Subsequent operations like offset learning, attention weighting, and cross-camera aggregation remain geometry-agnostic, ignoring essential physical constraints. We propose VGGT-Occ, a framework that embeds geometric tokens throughout the entire pipeline. We introduce Projection-Aware Deformable Attention (PA-DA) to inject geometry into all attention stages. PA-DA projects 3D offsets back to image planes and leverages the projection Jacobian as an additive bias to suppress unreliable observations. Features are then integrated through a view-quality semantic gate for cross-view consistency. To optimize both efficiency and performance, we employ a sequential coarse-to-fine decoder with gated fusion, where low-resolution features are refined into higher resolutions, allocating computation by information density while substantially reducing decoder cost. Extensive evaluations demonstrate the effectiveness and accuracy of our approach. On SurroundOcc-nuScenes, VGGT-Occ achieves 33.00\% IoU and 21.08\% mIoU ($T{=}1$), and 33.64\% IoU and 21.43\% mIoU with $T{=}2$ inference, outperforming existing methods, with only ${\sim}41$M trainable parameters in the occupancy head. Code will be released publicly.

2605.16908 2026-05-19 cs.ET cs.CR cs.CV 版本更新

BIDO: A Biometric Identity Online Authentication Framework

BIDO:一种生物特征身份在线认证框架

Aditya Mithra, Sibi Chakkaravarthy S, Srinivas Kankanala

发表机构 * CyberMACS Kadir Has University(凯德里哈大学CyberMACS) DigitalFortress Private Limited & Indominus Labs Private Limited(DigitalFortress私人有限公司及Indominus Labs私人有限公司) Centre of Excellence, Artificial Intelligence & Robotics (AIR), School of Computer Science and Engineering(卓越中心,人工智能与机器人(AIR),计算机科学与工程学院) VIT-AP University, India(印度VIT-AP大学) Centre of Excellence, Artificial Intelligence & Robotics (AIR), School of Electroncs and Communication Engineering(卓越中心,人工智能与机器人(AIR),电子与通信工程学院)

AI总结 本文提出BIDO框架,通过动态生成非居民Web认证凭证,实现无需存储长期生物特征模板的设备无关身份验证,达到NIST SP 800-63B中的AAL2标准,同时在多个面部基准测试中表现出高准确率和低误报率。

详情
AI中文摘要

安全系统需要在不需用户携带物理令牌、智能卡或专用硬件验证器的情况下实现持续且密码学上稳健的身份验证。本文提出了BIDO(Biometric Identity Online),一种设备无关的认证标准,能够在不存储长期生物特征模板、面部图像或其他形式的个人可识别信息(PII)的情况下,达到NIST SP 800-63B中的认证保证级别2(AAL2)。BIDO通过在每次认证事件中从活体生物特征测量中确定性地推导出椭圆曲线数字签名算法(ECDSA)的密钥材料,该过程使用用户定义的记忆化秘密进行盐化,从而消除了持久私钥存储的需求,同时允许从任何商用传感器终端进行验证。生成的凭证是非发现(非居民)Web认证(WebAuthn)凭证,完全兼容所有FIDO2启用的网站和服务,无需服务器端修改。一个多阶段流程,包括捕获200个有效生物特征样本、使用Dlib 68点面部地标预测器进行特征提取、仿射面部对齐、正面性门控、从双眼中点计算欧几里得距离、地板除法量化(除数q=8)、跨会话漂移稳定化以及多数投票SHA-256哈希绑定,产生验证种子(Vseed),从中临时生成WebAuthn凭证并在签名后立即零化。在三个主要面部基准测试(VGGFace2、LFW和MegaFace)上进行评估,达到99.51%的验证准确率(LFW)和92.14%的MegaFace挑战1在10^6干扰项中的排名1识别准确率,同时具有0.03%的密码学误接受率(FAR)和0.90%的误拒率(FRR)。

英文摘要

Security systems demand continuous, cryptograph- ically robust identity verification without requiring subjects to carry physical tokens, smart cards, or dedicated hardware authenticators. This paper presents BIDO (Biometric Identity Online), a device-free authentication standard that achieves Au- thenticator Assurance Level 2 (AAL2) per NIST SP 800-63B with- out storing long-lived biometric templates, facial images, or any other form of Personally Identifiable Information (PII). BIDO derives Elliptic Curve Digital Signature Algorithm (ECDSA) key material deterministically from a live biometric measurement salted with a user-defined memorized secret at every authen- tication event, eliminating persistent private-key storage while enabling verification from any commodity sensor terminal. The generated credentials are non-discoverable (non-resident) Web Authentication (WebAuthn) credentials, fully compatible with all FIDO2-enabled websites and services without modification on the server side. A multi-stage pipeline, comprising capture of 200 valid biometric samples, feature extraction using the Dlib 68- point facial landmark predictor, affine face alignment, frontality gating, Euclidean distance computation from the inter-eye mid- point, floor-division quantization with divisor q = 8, inter-session drift stabilization, and majority-voting SHA-256 hash binding, produces a Verification Seed (Vseed) from which the WebAuthn credential is transiently derived and immediately zeroized after signing. Evaluated against three prominent face benchmarks (VGGFace2, LFW, and MegaFace), achieving 99.51% verification accuracy on LFW and 92.14% Rank-1 identification accuracy on MegaFace Challenge 1 at 10^6 distractors, with a cryptographic False Accept Rate (FAR) of 0.03%, a False Reject Rate (FRR) of 0.90%.

2605.16905 2026-05-19 cs.LG cs.CV 版本更新

AIM: Adversarial Information Masking for Faithfulness Evaluation of Saliency Maps

AIM:对抗性信息遮蔽用于显著图忠实性评估

Chia-Ying Hsieh, Hsin-Yuan Fang, Chun-Shu Wei

发表机构 * National Yang Ming Chiao Tung University(阳明交通大学)

AI总结 本文提出AIM方法,通过对抗性信息遮蔽框架评估显著图的忠实性及遮蔽操作的可靠性,通过对比不同遮蔽方式下的退化效果,减少遮蔽诱导的偏差,并揭示不同模态下符号和非符号归因之间的差异。

详情
AI中文摘要

后验显著性方法广泛用于解释深度神经网络,但其忠实性难以可靠评估。现有评估方法根据显著性诱导的特征排序进行特征遮蔽并测量性能退化,但这种退化可能受遮蔽操作干扰:零遮蔽可能产生分布外伪影,而基于插值的遮蔽可能保留残余预测信息。我们提出对抗性信息遮蔽(AIM),一种基于显著性的对抗性特征替换框架,用于评估显著图的忠实性和遮蔽操作的可靠性。AIM将选定特征替换为输入的对抗性对应值,并在互补的遮蔽顺序下比较退化效果。我们通过随机归因偏差和解释方法忠实性排名的稳定性来评估可靠性。在图像、音频和EEG任务中的实验表明,AIM相比零和插值遮蔽减少了遮蔽诱导的偏差,同时揭示了符号和非符号归因之间的模态依赖性差异。

英文摘要

Post-hoc saliency methods are widely used to interpret deep neural networks, but their faithfulness is difficult to evaluate reliably. Existing evaluations mask features according to saliency-induced feature ordering and measure performance degradation, but this degradation can be confounded by the masking operator: zero masking may create out-of-distribution artifacts, while interpolation-based masking may preserve residual predictive information. We propose Adversarial Information Masking (AIM), a saliency-guided adversarial feature replacement framework for evaluating both saliency-map faithfulness and masking-operator reliability. AIM replaces selected features with values from an adversarial counterpart of the input and compares degradation under complementary masking orders. We assess reliability using random-attribution bias and stability of explanation-method faithfulness rankings. Experiments on image, audio, and EEG tasks suggest that AIM reduces masking-induced bias compared with zero and interpolation-based masking, while revealing modality-dependent differences between signed and unsigned attributions.

2605.16903 2026-05-19 cs.CV 版本更新

WOW-Seg: A Word-free Open World Segmentation Model

WOW-Seg: 无词开放世界分割模型

Danyang Li, Tianhao Wu, Bin Li, Zhenyuan Chen, Yang Zhang, Yuxuan Li, Ming-Ming Cheng, Xiang Li

发表机构 * NKIARI, Shenzhen Futian(深圳福田NKIARI) VCIP, CS, Nankai University(南开大学VCIP实验室) AAIS, Nankai University(南开大学AAIS实验室) Sichuan Agricultural University(四川农业大学) Peking University Shenzhen Graduate School(北京大学深圳研究生院)

AI总结 本文提出WOW-Seg模型,旨在解决开放世界图像分割中的目标精确分割与语义理解问题,通过引入Mask2Token模块和Cascade Attention Mask,提升模型性能,并构建了Region Recognition Dataset (RR-7K)数据集,在LVIS数据集上取得优异成果。

Comments Accepted by ICLR 2026. Code and benchmark dataset are available at https://github.com/AAwCAA/WOW-Seg-Meta

详情
AI中文摘要

开放世界图像分割旨在通过解决现实世界中无限开放的对象类别集,实现图像中目标的精确分割和语义理解。然而,传统封闭集分割方法难以适应复杂的开放世界场景,而基础分割模型如SAM在分割能力与语义理解之间存在明显差距。为弥合这一差距,我们提出了WOW-Seg,一种无词开放世界分割模型,用于对开放集类别中的对象进行分割和识别。具体而言,WOW-Seg引入了新颖的视觉提示模块Mask2Token,将图像掩码转换为视觉令牌并确保其与VLLM特征空间对齐。此外,我们引入了Cascade Attention Mask以解耦不同实例之间的信息。此方法减少了实例间干扰,显著提升了模型性能。我们进一步构建了一个开放世界区域识别测试基准:Region Recognition Dataset (RR-7K)。该数据集包含7,662个类别,代表目前最丰富的区域识别数据集。WOW-Seg在LVIS数据集上取得强劲成果,达到语义相似度89.7和语义IoU 82.4。这一表现超越了先前的SOTA,同时仅使用八分之一的参数量。这些结果凸显了WOW-Seg强大的开放世界泛化能力。代码及相关资源可在https://github.com/AAwcAA/WOW-Seg-Meta获取。

英文摘要

Open world image segmentation aims to achieve precise segmentation and semantic understanding of targets within images by addressing the infinitely open set of object categories encountered in the real world. However, traditional closed-set segmentation approaches struggle to adapt to complex open world scenarios, while foundation segmentation models such as SAM exhibit notable discrepancies between their strong segmentation capabilities and relatively weaker semantic understanding. To bridge these discrepancies, we propose WOW-Seg, a Word-free Open World Segmentation model for segmenting and recognizing objects from open-set categories. Specifically, WOW-Seg introduces a novel visual prompt module, Mask2Token, which transforms image masks into visual tokens and ensures their alignment with the VLLM feature space. Moreover, we introduce the Cascade Attention Mask to decouple information across different instances. This approach mitigates inter-instance interference, leading to a significant improvement in model performance. We further construct an open world region recognition test benchmark: the Region Recognition Dataset (RR-7K). With 7,662 classes, it represents the most extensive category-rich region recognition dataset to date. WOW-Seg attains strong results on the LVIS dataset, achieving a semantic similarity of 89.7 and a semantic IoU of 82.4. This performance surpasses the previous SOTA while using only one-eighth the parameter count. These results underscore the strong open world generalization capabilities of WOW-Seg. The code and related resources are available at https://github.com/AAwcAA/WOW-Seg-Meta.

2605.16901 2026-05-19 cs.CV 版本更新

CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model

CAR-SAM:跨注意力重建用于Segment Anything模型的后训练量化

Houji Wen, Jiangyong Yu, Jun Li, Dawei Yang

发表机构 * Nanjing University of Science and Technology(南京理工大学) Houmo AI

AI总结 本文提出CAR-SAM,一种针对Segment Anything模型的统一量化框架,通过引入MatMul-Aware Compensation机制和Joint Cross-Attention Reconstruction策略,解决后训练量化中注意力耗散和重建振荡问题,实现4位精度下的高效量化。

详情
AI中文摘要

Segment Anything Models (SAMs) 被广泛应用于计算机视觉中的通用图像分割,但在资源受限设备上部署具有挑战性,因为它们具有高计算和内存需求。后训练量化(PTQ)是一种广泛使用的模型压缩和加速技术。然而,现有的PTQ方法未能考虑SAM解码器中的跨注意力架构。这种退化主要源于SAMs特有的挑战:(1)注意力耗散,其中解码器中的注意力信息,对于表示分割掩码至关重要,在低比特量化下会坍缩成扩散且非语义的形式;(2)重建振荡,其中双向耦合的两个变压器引入了跨分支误差干扰并破坏了收敛。为了解决这些问题,我们提出了CAR-SAM,一种专门针对SAMs的统一量化框架。首先,为了缓解注意力耗散,我们引入了MatMul-Aware Compensation(MAC)机制,将激活引起的量化误差从MatMul转移到前导线性权重。其次,为了缓解解码器优化中的振荡,我们开发了一种联合跨注意力重建(JCAR)策略,联合重建耦合的注意力分支,抑制振荡行为并促进稳定收敛。广泛的实验表明,CAR-SAM能够稳健地将SAM模型量化到4位精度,在SAM-B和SAM-L上分别比现有方法在mAP上提高了14.6%和6.6%。

英文摘要

Segment Anything Models (SAMs) are extensively used in computer vision for universal image segmentation, but deploying them on resource-constrained devices is challenging due to their high computational and memory demands. Post-Training Quantization (PTQ) is a widely used technique for model compression and acceleration. However, existing PTQ methods fail to consider the cross-attention architecture in the SAM decoder. This degradation primarily stems from the unique challenges posed by SAMs: (1) Attention dissipation, where the attention information in the decoder, which is crucial for representing segmentation masks, collapses into a diffuse and non-semantic form under low-bit quantization; and (2) Reconstruction oscillation, where bidirectional coupling within the two-way transformer introduces cross-branch error interference and destabilizes convergence. To tackle these issues, we propose CAR-SAM, a unified quantization framework tailored for SAMs. Firstly, to mitigate attention dissipation, we introduce MatMul-Aware Compensation (MAC) mechanism that transfers activation-induced quantization errors from MatMul to preceding linear weights. Secondly, to mitigate oscillation in decoder optimization, we develop a Joint Cross-Attention Reconstruction (JCAR) strategy that jointly reconstructs coupled attention branches, suppressing oscillatory behavior and promoting stable convergence. Extensive experiments show that CAR-SAM robustly quantizes SAM models down to 4-bit precision, surpassing existing methods by 14.6% and 6.6% mAP on SAM-B and SAM-L respectively.

2605.16899 2026-05-19 cs.CV 版本更新

LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map

LASAR:迈向基于潜在认知图的时空推理

Jinzhou Tang, Sidi Liu, Waikit Xiu, Weixing Chen, Keze Wang

发表机构 * Sun Yat-sen University(中山大学)

AI总结 本文提出LASAR架构,通过双记忆系统维护事件经历和语义认知图,并引入ST-CRL对比目标训练该架构,以提升长距离碎片化经验中对细粒度空间关系的编码能力,在标准VLN-CE和VSI-Bench基准上实现了零样本泛化能力的2%-3.5%提升。

详情
AI中文摘要

具身AI中的一个根本挑战是验证智能体是否构建了空间结构的内部模型,或者仅仅是学习模仿任务特定的专家轨迹。这至关重要,因为基于动作中心任务(如VLN)和推理中心任务(如EQA)的基础方法往往共享一个共同的局限性:缺乏迫使它们编码细粒度空间关系(如拓扑或距离)的训练信号。为了解决这一问题,我们首先提出LASAR,一种具有双记忆系统的架构,旨在维护事件经历和语义认知图。然后引入了时空上下文表示学习(ST-CRL),一种对比目标,用于训练该架构。ST-CRL利用从模拟中生成的注释时空上下文中的时空线索来构建样本对,从而从智能体的经验中形成内部认知图。实验表明,我们的方法在标准VLN-CE和VSI-Bench基准上的零样本泛化能力提升了2%-3.5%。我们还证明了所提出认知图具有高度的自一致性。

英文摘要

A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent's experiences. Experiments demonstrate that our method achieves 2\%-3.5\% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.

2605.16892 2026-05-19 cs.CV cs.AI cs.CL 版本更新

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

DriveSafe: 一种用于驾驶场景中风险检测与安全建议的框架

Sainithin Artham, Shankar Gangisetty, Avijit Dasgupta, C. V. Jawahar

发表机构 * IIIT-Hyderabad(IIIT-海得拉巴)

AI总结 本文提出DriveSafe框架,通过结构化自然语言描述实现风险感知场景理解,结合多模态上下文生成空间 grounded 的描述,用于下游风险评估和安全建议,实验表明其在DRAMA基准上达到最先进的性能。

Comments 8 pages

详情
AI中文摘要

全面的情景意识对于在安全关键环境中运行的自动驾驶车辆至关重要,因为它能够识别并缓解潜在风险。尽管最近的多模态大语言模型(MLLMs)在通用视觉-语言任务上表现出色,但我们的研究发现,零样本MLLMs在细粒度、空间接地的风险评估中仍不如领域特定的方法。为了解决这一差距,我们提出了DriveSafe,一种用于风险感知场景理解的框架,利用结构化自然语言描述。具体而言,我们的方法首先生成包含运动、空间和深度线索的多模态上下文的时空接地描述。这些描述随后用于下游的风险评估,明确识别危险物体、其位置以及它们所暗示的不安全行为,随后提供可操作的安全建议。为了进一步提高性能,我们采用描述-风险配对来微调一个轻量级的适配器模块,高效地将领域特定的知识注入基础LLM中。通过将风险评估条件化为显式的语言基础场景表示,DriveSafe在零样本MLLMs和先前的领域特定基线之上取得了显著的提升。在DRAMA基准上的全面实验表明了最先进的性能,而消融研究验证了我们关键设计选择的有效性。项目页面:https://cvit.iiit.ac.in/research/projects/cvit-projects/drivesafe

英文摘要

Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision-language tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal context, including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption-risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/drivesafe

2605.16889 2026-05-19 cs.CV 版本更新

Controlling Decision Drift in Multimodal Sentiment Analysis with Missing Modalities

通过缺失模态控制决策漂移的多模态情感分析

Chenglizhao Chen, Yuchen Cao, Xinyu Liu, Mengke Song, Guisheng Zhang, Xiaomin Yu

发表机构 * Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China)(青岛软件研究所,计算机科学与技术学院,中国石油大学(华东)) Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software(山东省智能油气工业软件重点实验室) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文提出了一种两级参考对齐框架,旨在解决多模态情感分析中因缺失模态和质量不平衡导致的决策漂移问题,通过稳定参考提升鲁棒性,实验表明在不同缺失模态设置下方法有效,且在全模态输入下达到最先进的性能。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

多模态情感分析依赖于文本、声学和视觉信号,但现实数据常面临模态缺失和质量不平衡的问题。现有方法通过可用模态生成缺失模态的特征,但不同模态的表达机制和情感动态差异可能导致生成特征偏离真实分布并误导预测。此外,不可靠的模态可能主导融合,导致不同模态组合间的表示漂移和情感表示不稳定。为了解决这些挑战,我们提出了一种两级参考对齐框架。该框架在特征表示和情感决策层面引入稳定参考,以提高模态缺失下的鲁棒性。首先级参考对齐利用完整模态样本来约束表示,并将不同模态组合对齐到共享的情感空间。第二级参考对齐通过原型检索和投票抑制不可靠模态,以决策层面实现跨模态一致性。结果表明,该框架在各种缺失模态模式下保持稳定可靠的情感预测。在CMU-MOSI和CMU-MOSEI数据集上的实验显示,方法在不同缺失模态设置下表现一致。在全模态输入下,所提方法达到最先进的性能,准确率(ACC)为86.28%和85.88%,F1值为86.24%和85.86%。

英文摘要

Multimodal sentiment analysis relies on textual, acoustic, and visual signals, yet real-world data often suffer from modality missing and quality imbalance. Existing methods generate features for modality missing from available ones, but differences in expression mechanisms and sentiment dynamics across modalities may cause the generated features to deviate from true distributions and mislead prediction. In addition, unreliable modalities may dominate fusion, resulting in representation shift across modality combinations and unstable sentiment representations. To address these challenges, we propose a two-level reference alignment framework. The framework introduces stable references at the feature representation and sentiment decision levels to improve robustness under modality missing. First-level reference alignment leverages complete-modality samples to constrain representations and align different modality combinations into a shared sentiment space. Second-level reference alignment enforces cross-modal consistency at the decision level by suppressing unreliable modalities through prototype retrieval and voting. As a result, the framework maintains stable and reliable sentiment predictions under diverse missing-modality patterns. Experiments on CMU-MOSI and CMU-MOSEI show consistent improvements across various missing-modality settings. Under full-modality input, the proposed method achieves state-of-the-art performance, with ACC of 86.28% and 85.88%, and F1 of 86.24% and 85.86%.

2605.16887 2026-05-19 cs.CV cs.LG 版本更新

Mind the Gap: Learning Modality-Agnostic Representations with a Cross-Modality UNet

Mind the Gap: Learning Modality-Agnostic Representations with a Cross-Modality UNet

Xin Niu, Enyi Li, Jinchao Liu, Yan Wang, Margarita Osadchy, Yongchun Fang

发表机构 * Tianjin Key Laboratory of Intelligent Robotics, College of Artificial Intelligence, Nankai University, China(天津智能机器人重点实验室,人工智能学院,南开大学,中国) Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, Nankai University, China(可信行为智能工程研究中心,教育部,南开大学,中国) Department of Computer Science, Haifa University, Israel(计算机科学系,海法大学,以色列) VisionMetric Ltd, Canterbury, Kent, UK(VisionMetric Ltd,坎特伯雷,肯特,英国)

AI总结 本文提出了一种紧凑的编码器-解码器神经模块(cmUNet),通过跨模态转换和模态内重建,学习模态无关的表示,同时保留身份相关的信息。此外,作者提出了MarrNet,通过将cmUNet连接到标准特征提取网络,实现跨模态匹配,并在多个挑战性任务上验证了其优越性能。

Comments Published in IEEE Transactions on Image Processing. See full abstract in the PDF file

Journal ref n IEEE Transactions on Image Processing, vol. 33, pp. 655-670, 2024

详情
AI中文摘要

Cross-modality recognition has many important applications in science, law enforcement and entertainment. Popular methods to bridge the modality gap include reducing the distributional differences of representations of different modalities, learning indistinguishable representations or explicit modality transfer. The first two approaches suffer from the loss of discriminant information while removing the modality-specific variations. The third one heavily relies on the successful modality transfer, could face catastrophic performance drop when explicit modality transfers are not possible or difficult. To tackle this problem, we proposed a compact encoder-decoder neural module (cmUNet) to learn modality-agnostic representations while retaining identity-related information. This is achieved through cross-modality transformation and in-modality reconstruction, enhanced by an adversarial/perceptual loss which encourages indistinguishability of representations in the original sample space. For cross-modality matching, we propose MarrNet where cmUNet is connected to a standard feature extraction network which takes as inputs the modality-agnostic representations and outputs similarity scores for matching. We validated our method on five challenging tasks, namely Raman-infrared spectrum matching, cross-modality person re-identification and heterogeneous (photo-sketch, visible-near infrared and visible-thermal) face recognition, where MarrNet showed superior performance compared to state-of-the-art methods. Furthermore, it is observed that a cross-modality matching method could be biased to extract discriminant information from partial or even wrong regions, due to incompetence of dealing with modality gaps, which subsequently leads to poor generalization. We show that robustness to occlusions can be an indicator of whether a method can well bridge the modality gap.

英文摘要

Cross-modality recognition has many important applications in science, law enforcement and entertainment. Popular methods to bridge the modality gap include reducing the distributional differences of representations of different modalities, learning indistinguishable representations or explicit modality transfer. The first two approaches suffer from the loss of discriminant information while removing the modality-specific variations. The third one heavily relies on the successful modality transfer, could face catastrophic performance drop when explicit modality transfers are not possible or difficult. To tackle this problem, we proposed a compact encoder-decoder neural module (cmUNet) to learn modality-agnostic representations while retaining identity-related information. This is achieved through cross-modality transformation and in-modality reconstruction, enhanced by an adversarial/perceptual loss which encourages indistinguishability of representations in the original sample space. For cross-modality matching, we propose MarrNet where cmUNet is connected to a standard feature extraction network which takes as inputs the modality-agnostic representations and outputs similarity scores for matching. We validated our method on five challenging tasks, namely Raman-infrared spectrum matching, cross-modality person re-identification and heterogeneous (photo-sketch, visible-near infrared and visible-thermal) face recognition, where MarrNet showed superior performance compared to state-of-the-art methods. Furthermore, it is observed that a cross-modality matching method could be biased to extract discriminant information from partial or even wrong regions, due to incompetence of dealing with modality gaps, which subsequently leads to poor generalization. We show that robustness to occlusions can be an indicator of whether a method can well bridge the modality gap.

2605.16879 2026-05-19 cs.CV 版本更新

Towards Generalized Image Manipulation Localization via Score-based Model

通过基于分数的模型实现通用图像操纵定位

Yunfei Wang, Bo Du, Zhe Yang, Xin Liu, Zhiyu Lin, Tianxin Xu, Ji-Zhe Zhou

发表机构 * Sichuan University(四川大学)

AI总结 本文提出DiffIML框架,通过引入基于分数的生成模型来解决图像操纵定位中的泛化问题,利用结构先验迭代恢复相干掩码,提升模型鲁棒性,并在多个基准测试中证明其优越的泛化能力。

Comments Accepted to ICMR 2026. 9 pages, 4 figures

详情
AI中文摘要

随着合成媒体的快速发展,图像操纵定位(IML)已成为多媒体取证中的关键组成部分,用于确保数字内容的完整性。然而,泛化仍然是核心挑战,因为现有的判别方法通常学习固定的决策边界,容易过拟合特定训练伪影,且无法适应未见过的操纵类型。为了解决这一问题,我们提出了DiffIML,一种新颖的框架,引入基于分数的生成模型到IML中。不同于直接估计硬边界,DiffIML近似分数函数,即对数似然的梯度,以捕捉掩码分布的内在几何拓扑。这一范式利用结构先验迭代地从噪声中恢复连贯的掩码,从而避免判别模型的脆弱性。在此框架下,扩散模型成为学习分数函数的有效数值求解器。为确保实用性,我们分别解决了标准扩散模型的效率和稳定性瓶颈:(1)利用轻量级的特定掩码VAE实现快速的潜在空间处理,并采用解耦架构和轻量级去噪UNet;(2)边缘监督和误差先验以减轻采样过程中的误差累积。在两个不同的协议上对八个非生成式和三个生成式基准进行的广泛实验表明,DiffIML在多个基准测试中均优于最先进的方法,实现了在多样化未见过的数据集上的显著泛化改进。代码将公开提供。

英文摘要

With the rapid evolution of synthetic media, Image Manipulation Localization (IML) has emerged as a critical component in multimedia forensics for ensuring the integrity of digital content. However, generalization remains a core challenge, as existing discriminative methods typically learn a fixed decision boundary that tends to overfit to specific training artifacts and fails to adapt to unseen manipulation types. To address this, we propose DiffIML, a novel framework that introduces score-based generative modeling to IML. Diverging from the direct estimation of hard boundaries, DiffIML approximates the score function, the gradient of the log-likelihood, to capture the intrinsic geometric topology of mask distributions. This paradigm leverages structural priors to iteratively recover coherent masks from noise, thereby circumventing the brittleness associated with discriminative models. Under this formulation, diffusion models serve as an effective numerical solver for the learned score function.To ensure practicality, we respectively resolve the efficiency and stability bottlenecks of standard diffusion by: (1) utilizing a Lightweight Mask-Specific VAE for fast latent-space process and a decoupled architecture with a lightweight denoising UNet, (2) edge supervision and error prior to mitigate error accumulation during sampling. Extensive experiments of two distinct protocols on eight non-generative and three generative benchmarks demonstrate that DiffIML consistently outperforms state-of-the-art methods, yielding remarkable generalization improvements on diverse unseen datasets. The code will be publicly available.

2605.16877 2026-05-19 cs.CV 版本更新

Zero-Shot Faithful Textual Explanations via Directional-Derivative Influence on Predictions

通过预测影响的定向导数生成零样本文本解释

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

发表机构 * Chiba University(千叶大学) National Institute of Informatics(国家信息研究所)

AI总结 本文提出FaithTrace方法,通过测量文本解释对分类器特征空间中类logit的定向导数,提升图像分类器的透明度和解释的忠实性。

Comments 11+8 pages, 8 figures, 6 tables

详情
AI中文摘要

零样本文本解释旨在通过探测内部表示使图像分类器更透明,而无需依赖任务特定监督或LVLMs。然而,现有方法常遗漏真正驱动预测的特征,导致解释对模型决策证据的忠实性有限。为此,我们提出FaithTrace。受忠实解释应描述强影响预测的概念的启发,FaithTrace直接测量解释诱导的表示如何改变类logit。我们引入影响评分,计算为分类器特征空间中文本诱导方向上类logit的定向导数,并用其作为忠实性的代理。此外,我们将此影响评分扩展为定量评估指标,帮助填补文本解释忠实性评估的空白。实验表明,FaithTrace产生的解释比基线更忠实,有助于更准确地理解模型。代码将公开发布。

英文摘要

Zero-shot textual explanations aim to make image classifiers more transparent by probing their internal representations, without relying on task-specific supervision or LVLMs. However, existing methods often miss the features that truly drive the prediction, resulting in limited \textit{faithfulness} to the evidence underlying the model's decision. To address this, we propose FaithTrace. Motivated by the idea that faithful explanations should describe concepts that strongly influence the prediction, FaithTrace directly measures how much the representation induced by the explanation changes the class logit. We introduce an influence score, computed as the directional derivative of the class logit along the text-induced direction in the classifier's feature space, and use it as a proxy for faithfulness. Moreover, we extend this influence score into quantitative evaluation metrics, helping fill the gap in faithfulness evaluation for textual explanations. Experiments show that FaithTrace yields more faithful explanations than baselines, facilitating a more accurate understanding of the model. The code will be publicly released.

2605.16873 2026-05-19 cs.CV 版本更新

HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction

HAD:面向3D重建的幻觉感知扩散先验

Xi Liu, Weiwei Sun, Zhou Ren, Chris Broaddus, Siyu Huang, Laurent Guigues

发表机构 * Amazon AWS(亚马逊AWS) Clemson University(克莱姆森大学)

AI总结 本文提出HAD,一种面向3D重建的幻觉感知扩散先验,通过利用预训练在大规模3D数据上的馈送式新视角合成(NVS)网络的多视角推理能力,估计增强图像的像素级幻觉分数图,从而在逐步3D重建过程中选择性地屏蔽不可靠像素,减少幻觉伪影,提升3D重建质量。

Comments Accepted by CVPR 2026

详情
AI中文摘要

扩散先验最近在通过在新视角上增强训练视角来提高稀疏视角3D重建质量方面表现出强大的能力,但不可避免地会引入幻觉内容——与输入视角不一致的伪影——进入最终的3D模型。为了解决这一挑战,我们提出了Hallucination-Aware Diffusion prior(HAD),它通过利用预训练在大规模3D数据上的馈送式新视角合成(NVS)网络的多视角推理能力,估计增强图像的像素级幻觉分数图。这些幻觉分数使在逐步3D重建过程中能够选择性地屏蔽不可靠像素,防止将不存在的伪影引入3D模型。为了进一步提高性能,我们在每个新视角上创建多个增强图像版本,通过将扩散先验条件化于不同的输入视角,然后将这些图像融合成最终图像,该图像利用了所有输入视角的更广泛上下文。我们证明了我们的方法在扩散辅助的3D重建中显著减少了幻觉伪影,从而在多个新视角合成基准上实现了最先进的性能。我们的项目在https://xiliu8006.github.io/HAD-Project-website/上公开可用。

英文摘要

Diffusion priors have recently demonstrated strong capability in enhancing the quality of sparse-view 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content -- artifacts inconsistent with the input views -- into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning the diffusion prior on different input views, which are then fused into a final image that leverages the broader context across all input views. We show that our method substantially reduces hallucination artifacts in diffusion-assisted 3D reconstruction, thereby achieving state-of-the-art performance across multiple benchmarks on novel view synthesis. Our project are publicly available at \href{https://xiliu8006.github.io/HAD-Project-website/}{project website}.

2605.16864 2026-05-19 cs.CV cs.AI 版本更新

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

基于度量的视觉基础模型特征融合用于分割任务

Yachan Guo, JoseLuis Gomez Zurita, Danna Xue, Yi Xiao, AntonioManuel Lopez Pena

发表机构 * Universitat Autònoma de Barcelona(巴塞罗那自治大学) Computer Vision Center(计算机视觉中心) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳研究院)

AI总结 本文提出了一种基于度量的特征融合方法,通过评估不同视觉基础模型的特征空间,选择并聚合互补特征以提升密集预测任务的性能。

Comments Accepted to the CVPR 2026 Findings Track

详情
AI中文摘要

尽管大规模视觉基础模型(VFMs)在语义理解方面表现优异,但在实例感知的密集预测任务中仍显不足。它们在表示上存在不同的偏倚:例如,可提示的分割模型(如SAM2)专注于细粒度区域边界,而自监督模型(如DINOv3)强调物体层面的结构。这一观察表明,结合不同VFMs的互补特征可以增强下游密集预测任务。然而,简单的多VFMs融合 seldom 导致可靠的增益,且如何利用其互补特征的可解释原则仍待探索。在本文中,我们提出了一种基于度量的方法,通过显式的评估分数选择并聚合不同VFMs的互补特征。具体而言,我们设计了一套无标签的度量标准,在特征空间的两个方面,结构一致性与边缘保真度,来评估VFM编码器的特征。在这些分数的指导下,我们识别出互补性强的边缘强和结构强的编码器对,并通过主辅融合方案进行整合。这种特征融合不需要复杂的架构更改,并且仅在单个阶段进行训练。我们的模型在多个密集预测任务中相比基线模型表现出一致的性能提升,具有更好的物体层面语义和更准确的边界定位。代码可在{https://github.com/gyc-code/metric-guided-fusion}获取。

英文摘要

Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at {https://github.com/gyc-code/metric-guided-fusion}.

2605.16861 2026-05-19 cs.CV cs.AI 版本更新

Prefix-Adaptive Block Diffusion for Efficient Document Recognition

前缀自适应块扩散用于高效的文档识别

Mingxu Chai, Ziyu Shen, Chenyu Liu, Kaidi Zhang, Jiazheng Zhang, Dingwei Zhu, Zhiheng Xi, Ruoyu Chen, Jun Long, Jihua Kang, Tao Gui, Qi Zhang

发表机构 * Computation and Artificial Intelligence Innovative College, Fudan University, Shanghai, China(复旦大学计算与人工智能创新学院,上海,中国) Shanghai Innovation Institute, Shanghai, China(上海创新研究院,上海,中国) ByteDance, Shanghai, China(字节跳动,上海,中国)

AI总结 本文提出前缀自适应块扩散模型(PA-BDM),通过改进块内去噪和缓存机制,提升文档识别的效率和准确性。

Comments 17pages,6 figures

详情
AI中文摘要

块扩散模型(BDMs)支持并行生成、灵活长度输出和KV缓存,使其在高效文档解析中具有潜力。然而,现有BDMs将去噪和缓存承诺绑定到固定的块边界:块内去噪时并行性缩小,而生成的token无法缓存直到整个块完成。此外,块内双向去噪与块间自回归冲突,导致信息流不一致,可能挑战结构敏感的识别。我们提出前缀自适应块扩散模型(PA-BDM),用从前缀到后缀的因果去噪替代块内双向去噪,并将块大小视为最大候选范围而非固定承诺单位。PA-BDM使用置信度门控结构损失(CSL)在扩展训练到更长延续之前构建低熵前缀。在推理过程中,逐步前缀承诺(PPC)则动态地将最长可靠的前缀投入KV缓存,并从更新的前缀重置下一个候选范围,每一步都恢复大的并行解码空间。实验表明,3B PA-BDM在多个基准上实现了更高的识别得分,并在2.5B MinerU-Diffusion上将推理吞吐量提高了71.6%。

英文摘要

Block Diffusion Models (BDMs) support parallel generation, flexible-length output, and KV caching, making them promising for efficient document parsing. However, existing BDMs bind denoising and cache commitment to fixed block boundaries: parallelism shrinks during intra-block denoising, while generated tokens cannot be cached until the whole block is completed. Moreover, intra-block bidirectional denoising conflicts with inter-block autoregression, creating inconsistent information flow that can challenge structure-sensitive recognition. We propose the Prefix-Adaptive Block Diffusion Model (PA-BDM), which replaces intra-block bidirectional denoising with causal denoising from prefix to suffix and treats the block size as a maximum candidate range rather than a fixed commitment unit. PA-BDM uses Confidence-gated Structural Loss (CSL) to build low-entropy prefixes before extending training to longer continuations. During inference, Progressive Prefix Commitment (PPC) then dynamically commits the longest reliable prefix into the KV cache and resets the next candidate range from the updated prefix, restoring a large parallel decoding space at each step. Experiments show that the 3B PA-BDM achieves higher recognition scores on several benchmarks and improves inference throughput by 71.6\% over the 2.5B MinerU-Diffusion.

2605.16859 2026-05-19 cs.CV cs.AI 版本更新

VGGT-CD: Training-Free Robust Registration for 3D Change Detection

VGGT-CD:无训练的鲁棒三维变化检测注册

Wei Zhang, Songhua Li, Yihang Wu, Qiang Li, Qi Wang

发表机构 * Northwestern Polytechnical University(西北工业大学)

AI总结 本文提出VGGT-CD方法,通过解耦跨时间注册与动态变化干扰,实现无训练的鲁棒三维变化检测注册,有效减少轨迹误差并提升注册速度。

Comments 13 pages, 5 figures. Code is available at: https://github.com/WZ-CS/VGGT-CD

详情
AI中文摘要

从多视角图像进行三维变化检测对于城市监控、灾难评估和自动驾驶至关重要。然而,现有方法大多在2D领域操作,其中视角变化被误认为物理变化且深度不可用。虽然视觉几何基础模型如VGGT能够快速从未摆正的图像生成密集点云,但独立每轮重建面临根本性障碍:不可预测的跨轮标度模糊、注册-变化悖论以及普遍存在的边缘飞行噪声。为了解决这些挑战,我们提出了VGGT-CD,一种无训练的流水线,将跨时间注册与动态变化干扰解耦。在粗阶段,稀疏关键帧联合推断建立统一的度量空间并产生初始Sim(3)先验。在细阶段,密集重建通过隔离静态背景对应关系进行净化。闭合形式的质心对齐优化平移同时锁定标度和旋转,使用残差自检数学保证非退化。在World Across Time数据集的11场景基准上评估,VGGT-CD在户外将绝对轨迹误差减少了44%,在室内减少了59%。它以6倍于传统方法的速度完成注册,生成高纯度的3D变化地图,无需任务特定训练。

英文摘要

3D change detection from multi-view images is essential for urban monitoring, disaster assessment, and autonomous driving. However, existing methods predominantly operate in the 2D domain, where viewpoint variations are mistaken for physical changes and depth is unavailable. While visual geometry foundation models like VGGT rapidly produce dense point clouds from unposed images, independent per-epoch reconstruction encounters fundamental obstacles: unpredictable inter-epoch scale ambiguity, registration-change paradox where scene changes corrupt alignment, and pervasive edge-flying noise. To address these challenges, we present VGGT-CD, a training-free pipeline decoupling cross-temporal registration from dynamic-change interference. In the Coarse Stage, sparse keyframe joint inference establishes a unified metric space and yields an initial Sim(3) prior. In the Fine Stage, dense reconstructions are purified by isolating static-background correspondences. A closed-form centroid alignment refines the translation while locking scale and rotation, using a residual self-check to mathematically guarantee non-degradation. Evaluated on an 11-scene benchmark from the World Across Time dataset, VGGT-CD reduces Absolute Trajectory Error by 44% outdoors and 59% indoors. It completes registration over 6 times faster, producing high-purity 3D change maps without task-specific training.

2605.16848 2026-05-19 cs.CV cs.AI cs.CL cs.LG 版本更新

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

基于模式的思考:通过模式诱导突破视觉规划中的感知瓶颈

Yichang Jian, Boyuan Xiao, Zhenyuan Huang, Yifei Peng, Yao-Xiang Ding

发表机构 * State Key Lab of CAD& CG(CAD与CG国家重点实验室)

AI总结 本文提出通过模式诱导的方法,利用模式推理和模式诱导策略,使视觉语言模型在视觉规划任务中实现更高效和准确的感知与推理,解决传统模型在复杂输入下的感知瓶颈问题。

详情
AI中文摘要

从原始视觉输入进行规划仍然对当前的视觉-语言模型(VLMs)构成重大挑战,当输入复杂度超出其一步感知能力时。受最近在图像思考(TWI)中的进展启发,一种合理的解决方案是通过迭代获取和整合局部视觉证据,将感知过程分解为更简单的步骤。然而,尽管当前VLMs在一般TWI能力上训练良好,但其在规划领域中的感知瓶颈仍然存在。为解决这一挑战,我们将TWI视为一种工具,逐步构建并反映一个准确的内部世界模型。我们发现,由此产生的无训练规划策略使VLMs能够解决远超其初始能力的任务,但代价是过多的TWI操作会显著增加计算开销。为进一步提高效率,我们提出模式推理,一种新的TWI策略,使VLMs能够主动识别新任务中的已知视觉模式并直接推断局部世界模型结构。为了获得这些模式,我们提出模式诱导,一种在线归纳学习策略,将视觉模式视为复合且可重用的专家,这些专家是自主从经验中发现和优化的。在FrozenLake、Crafter和CubeBench领域中的实验评估表明,我们的方法在准确性和效率之间实现了良好的平衡。

英文摘要

Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.

2605.16834 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

基于有限数据的细粒度多模态对齐的相对表示学习

Shiwon Kim, Yu Rang Park

发表机构 * Yonsei University(延世大学)

AI总结 本文提出了一种基于相对表示的学习方法,用于在有限数据条件下实现细粒度多模态对齐,通过学习token级别的跨模态结构来提升零样本分类、跨模态检索和零样本分割任务的性能。

详情
AI中文摘要

多模态预训练展示了强大的泛化性能,但在缺乏配对数据的领域中,这种范式往往难以实施。一种有前景的替代方法是事后多模态对齐,它通过有限数量的配对示例分别对预训练的单模态编码器进行对齐。然而,现有方法主要关注全局表示的对齐,忽略了片段-token关系。这可能阻碍了需要细粒度跨模态匹配的任务的迁移,超越粗粒度样本层面的语义。为了解决这个问题,我们提出了一种事后对齐方法,通过相对表示学习token级别的跨模态结构。具体来说,我们通过图像和文本与每种模态空间中一组可学习锚点的token级相似性来表示它们,这些锚点被训练以诱导一致的跨模态相似性模式,以匹配对。尽管仅学习锚点而没有重大的投影层,我们的方法在零样本分类、跨模态检索和零样本分割任务中均显著优于现有方法。这突显了在有限配对数据下,建模细粒度跨模态结构对于有效事后多模态对齐的重要性。

英文摘要

Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately pre-trained unimodal encoders using a limited number of paired examples. However, existing methods focus primarily on aligning global representations, missing patch-token relations. This may hinder transfer to tasks that require fine-grained cross-modal matching beyond coarse sample-level semantics. To address this issue, we propose a post-hoc alignment method that learns token-level cross-modal structure using relative representations. Specifically, we represent images and texts through their token-level similarities to a set of learnable anchors in each modality space, which are trained to induce consistent cross-modal similarity patterns for matched pairs. Despite learning only the anchors without heavy projection layers, our approach consistently outperforms existing methods in zero-shot classification, cross-modal retrieval, and zero-shot segmentation by a substantial margin. This highlights the importance of modeling fine-grained cross-modal structure for effective post-hoc multimodal alignment with limited paired data.

2605.16832 2026-05-19 cs.CV 版本更新

Coarse Semantic Injection for LLM-Conditioned Structured Indoor Prediction

粗粒度语义注入用于LLM条件的结构室内预测

Shuliang Zhu, Tomiwa Adey, Jinjia Zhou

AI总结 本文提出了一种接口保持的语义增强方法,用于LLM条件的结构解码,通过将语义证据与点云表示关联,将其编码为RGBB点接口,以提升结构室内预测的精度,特别是在复杂场景中的门框定位和家具检测。

详情
AI中文摘要

大型语言模型(LLMs)最近被用作结构解码器,用于从3D点云输入中进行室内理解。然而,点云编码器在体素化和稀疏池化后,往往低估了如门和窗等细长结构元素,并可能在拥挤场景中遗漏单个家具实例。我们提出了一种接口保持的语义增强方法,用于LLM条件的结构解码。关键思想是将语义证据与点云表示关联,将其缩减为粗粒度四组代码(家具、墙壁、开口和其他),并将其编码为RGBB点接口:红色表示家具,绿色表示墙壁,蓝色表示开口,黑色表示其他,其中RGBB表示在三个RGB通道中用三种颜色表示四种语义状态,而不是额外的第四通道。该语义颜色代码在原始原始点属性后附加,因此几何和语义共享相同的稀疏标记化路径,同时下游语言模型解码器和输出序列化保持不变。我们进一步引入了一个轻量级的路由语义位移模块,其辅助头仅用于训练时的比率/预算正则化和分析,以在稀疏池化后加强语义线索。整体流程可以使用RGB衍生的语义证据。在这些受控的语义源设置下,报告的指标在Structured3D、SpatialLM数据集和ARKitScenes上均有所提升,尤其是在拥挤场景中的开口定位和单个家具检测。消融实验澄清了语义源、颜色编码、标记融合和位移注入的作用,同时显示颜色/熵效应仍然非平凡。

英文摘要

Large language models (LLMs) have recently been used as structured decoders for indoor understanding from 3D point-token inputs. However, point cloud encoders often under-represent thin structural elements such as doors and windows after voxelization and sparse pooling, and may miss individual furniture instances in cluttered scenes. We propose an interface-preserving semantic augmentation for LLM-conditioned structured decoding. The key idea is to associate semantic evidence with the point-cloud representation, reduce it to a coarse four-group code (furniture, walls, openings, and others), and encode it as an RGBB point interface: red for furniture, green for walls, blue for openings, and black for others, where RGBB denotes four semantic color states represented in three RGB channels rather than an additional fourth channel. This semantic color code is appended to the original raw point attributes before tokenization, so geometry and semantics share the same sparse tokenization path while the downstream language model decoder and output serialization remain unchanged. We further introduce a lightweight routed semantic shift module, with an auxiliary head used only for training-time ratio/budget regularization and analysis, to strengthen semantic cues after sparse pooling. The overall pipeline can use RGB-derived semantic evidence. Under these controlled semantic-source settings, the reported metrics improve across Structured3D, the SpatialLM dataset, and ARKitScenes, especially for opening localization and per-instance furniture detection in cluttered scenes. Ablations clarify the roles of semantic source, color coding, token fusion, and shift injection, while also showing that color/entropy effects remain nontrivial.

2605.16818 2026-05-19 cs.CV cs.AI 版本更新

Observation-Aligned Mask Priors for Learning Physical Dynamics from Authentic Occlusions

基于观测对齐的遮罩先验学习物理动态的遮罩方法

Chiyuan Ma, Zihan Zhou, Tianshu Yu

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen(数据科学学院,香港大学(深圳))

AI总结 本文提出了一种基于观测对齐的遮罩先验方法,通过学习真实的遮罩分布来构建上下文-查询分区,从而在不完整数据上训练物理动态学习。该方法利用贝叶斯流网络预训练二进制遮罩,结合全局归一化交叉熵目标生成与稀疏观测对齐的样本特定遮罩,从而避免零查询死区和局部生成崩溃。

详情
AI中文摘要

直接从不完整观测中学习物理动态具有挑战性,因为真实的遮罩是结构化的、样本依赖的,并且常常不是随机缺失的,而现有方法通常依赖启发式遮罩规则或预定义的遮罩分布。我们提出Observation-Aligned Mask Priors框架,该框架学习真实的观测遮罩分布,并利用其构建上下文-查询分区以从不完整数据中训练。具体来说,我们先在二进制观测遮罩上预训练一个贝叶斯流网络(BFN)以捕捉真实的遮罩拓扑结构,然后通过全局归一化交叉熵目标引导BFN采样,生成与每个稀疏观测对齐的样本特定遮罩。遮罩与观测遮罩的交集定义为上下文,剩余的观测条目成为扩散模型的查询目标。我们证明,这种基于交集的分区使每个有效的观测维度都有严格正的概率被查询,防止零查询死区和局部生成崩溃。在三个具有真实卫星遮罩的现实世界海洋学数据集上,跨分辨率至256×256的实验显示,在MSE和PSNR上优于强扩散基线的一致改进。这些结果表明,从真实遮罩中学习遮罩先验是学习不完整物理观测的有效替代方法,无需访问完全观测的场数据。

英文摘要

Learning physical dynamics directly from incomplete observations is challenging because authentic occlusions are structured, sample-dependent, and often missing not at random, whereas existing methods typically rely on heuristic masking rules or predefined mask distributions. We propose Observation-Aligned Mask Priors, a framework that learns the distribution of authentic observation masks and uses it to construct context-query partitions for training from incomplete data. Specifically, we pretrain a Bayesian Flow Network (BFN) on binary observation masks to capture real occlusion topologies, then guide BFN sampling with a globally normalized cross-entropy objective to generate sample-specific masks aligned with each sparse observation. The intersection between the guided mask and the observed mask defines the context, and the remaining observed entries become query targets for a diffusion-based reconstruction model. We show that this intersection-based partitioning gives every valid observed dimension a strictly positive probability of being queried, preventing zero-query dead zones and local generative collapse. Experiments on three real-world oceanographic datasets with authentic satellite occlusions, across resolutions up to 256$\times$256, show consistent improvements over strong diffusion baselines in MSE and PSNR. These results demonstrate that learning mask priors from authentic occlusions is an effective alternative to heuristic masking for learning from incomplete physical observations without access to fully observed fields.

2605.16817 2026-05-19 eess.IV cs.CV 版本更新

Adaptive Fused Prior Transfer for Controllable Generative Image Compression

自适应融合先验传输用于可控生成图像压缩

Yifei Pei, Ying Liu, Nam Ling

发表机构 * Department of Computer Science and Engineering, Santa Clara University(计算机科学与工程系,圣克拉拉大学)

AI总结 本文提出了一种可控生成图像压缩方法AFP-GIC,通过自适应融合先验传输实现解码端的高质量重建,减少了延迟并降低了参数数量,同时在低比特率下提升了视觉质量。

Comments 19 pages, 10 figures. This work has been submitted to IEEE Access for possible publication. Code is available at https://github.com/yifeipet/AFP_GIC

详情
AI中文摘要

已学习的图像压缩在速率-失真性能上取得了竞争性成果,但极低比特率的重建仍然困难,因为传输的表示通常无法保持精细纹理和局部结构。感知和生成编码器通过使用学习的重建先验来解决这个问题,可控编码器允许一个模型覆盖不同的比特率和重建偏好。然而,可控性本身并不能解决解码端的重建先验问题:在严苛的比特约束下,解码器必须从有限的传输信息中推断缺失的细节,而现有的基于代码书的可控设计通常依赖于单一代码书的token-based先验。本文提出了自适应融合先验传输用于可控生成图像压缩(AFP-GIC),一种可控编码器,将自适应融合的先验从冻结的预训练AdaCode模型中传输。编码端的融合先验特征指导潜在空间的形成,而解码器从压缩的表示和选定的控制变量中预测兼容的融合先验,从而在不传输融合先验本身的情况下实现先验引导的重建。一个激励性分析将解码端的融合先验对齐与重建误差上界相关联,并表明融合先验族包含单个代码书选择作为特殊情况。在统一的基准测试中,AFP-GIC将解码器延迟减少了18.1%,整体参数数量减少了3110万(20.5%)相对于DC-VIC。在Kodak、CLIC2020和DIV2K上的实验显示了具有竞争力的PSNR,最清晰的感知增益出现在NIQE分数和极低比特率的视觉比较中。

英文摘要

Learned image compression has achieved competitive rate-distortion performance, but very-low-bitrate reconstruction remains difficult because the transmitted representation often cannot preserve fine textures and local structures. Perceptual and generative codecs address this problem by using learned reconstruction priors, and controllable codecs allow one model to cover different bitrate and reconstruction preferences. However, controllability alone does not resolve the decoder-side reconstruction-prior problem: under severe bit constraints, the decoder must infer missing details from limited transmitted information, while existing codebook-based controllable designs generally rely on single-codebook token-based priors. This paper proposes Adaptive Fused Prior Transfer for Controllable Generative Image Compression (AFP-GIC), a controllable codec that transfers an adaptive fused prior from a frozen pretrained AdaCode model. Encoder-side fused-prior features guide latent formation, while the decoder predicts a compatible fused prior from the compressed representation and selected control variables, enabling prior-guided reconstruction without transmitting the fused prior itself. A motivating analysis relates decoder-side fused-prior alignment to a reconstruction-error upper bound and shows that the fused-prior family contains single-codebook choices as special cases. Under the unified benchmark, AFP-GIC reduces decoder latency by 18.1% and the overall parameter count by 31.10 million (20.5%) relative to DC-VIC. Experiments on Kodak, CLIC2020, and DIV2K show competitive PSNR, with the clearest perceptual gains in NIQE scores and very-low-bitrate visual comparisons.

2605.16810 2026-05-19 cs.CV 版本更新

Training-Free Occluded Text Rendering via Glyph Priors and Attention-Guided Semantic Blending

无需训练的遮挡文本渲染:通过字形先验和注意力引导的语义融合

Jingqi Hou, Hongtian Wang

发表机构 * College of Computer Science, Beijing University of Technology, Beijing, China(北京理工大学计算机学院)

AI总结 本文提出一种无需训练的遮挡文本渲染框架,通过预训练的FLUX.1-dev模型,解决文本生成中遮挡物位置和文本结构稳定性问题,采用双流推理和字形先验稳定文本结构,提升文本可读性和遮挡对齐效果。

Comments 9 pages, 3 figures, 3 tables

详情
AI中文摘要

我们提出一种无需训练的遮挡文本渲染框架,使用预训练的FLUX.1-dev主干网络。该任务要求模型生成可识别的字体并放置遮挡物在预期文本区域。现有文本到图像生成器在这一设置中仍然具有挑战性:遮挡物往往远离文本,而文本可能被扭曲或漂浮在遮挡物之上。为了解决这个问题,我们提出了一个重启双流推理框架,将文本布局保持与遮挡物插入解耦。基流提供干净的字形参考和相同步骤的键/值(K/V)特征,而编辑流则基于遮挡提示进行条件化。我们进一步采用来自FreeText的光谱字形先验思想,并将其适应于早期到中期去噪过程中稳定目标文本结构。在推理过程中,我们的方法局部化目标文本,从令牌条件化的注意力和字形支持中估计文本带区域,并推导出一个锚点感知的硬融合掩码用于遮挡物。在最终的编辑过程中,生成从相同的初始噪声开始,并在选定的注意力站点应用硬掩码引导的图像-令牌K/V替换,保持基流布局在掩码外,同时在掩码内注入来自编辑流的遮挡物外观。在代表性遮挡文本场景的实验中,显著提高了文本可读性,并在遮挡对齐方面具有竞争力,从而在不进行模型微调的情况下实现了更稳定的物体-文本组合。

英文摘要

We present a training-free framework for occluded text rendering with a pretrained FLUX.1-dev backbone. The task requires a model to render recognizable typography and place an occluding object over the intended text region. This setting remains difficult for existing text-to-image generators: the occluder often drifts away from the text, while the text may be distorted or appear to float on top of the occluding object. To address this problem, we propose a restarted dual-stream inference framework that decouples text-layout preservation from occluder insertion. A Base Stream provides a clean typographic reference and same-step key/value (K/V) features, while the Edit Stream is conditioned on the occlusion prompt. We further adopt the spectral glyph-prior idea from FreeText and adapt it to stabilize the target text structure during early-to-mid denoising. In the reasoning pass, our method localizes the target text, estimates a text-band region from token-conditioned attention and glyph support, and derives an anchor-aware hard fusion mask for the occluder. In the final edit pass, generation restarts from the same initial noise and applies hard mask-guided image-token K/V replacement at selected attention sites, preserving the Base layout outside the mask while injecting the occluder appearance from the Edit Stream inside the mask. Experiments on representative occluded text scenarios demonstrate substantially improved text readability and competitive occlusion alignment, yielding more stable object-on-text compositions without any model fine-tuning.

2605.16807 2026-05-19 cs.CV 版本更新

DecoRec: Decomposed 3D Scene Reconstruction from Single-View Images via Object-Level Diffusion

DecoRec: 通过物体级扩散进行单视图图像的分解3D场景重建

Yuhan Ping, Yuan Liu, Xiaoxiao Long, Peng Wang, Junhui Hou, Jianyi Zheng, Jia Pan, Xin Li, Cheng Lin

发表机构 * Department of Computer Science, the University of Hong Kong(香港大学计算机科学系) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Faculty of Humanities and Arts, Macau University of Science and Technology(澳门科学理工学院人文艺术学院) Department of Computer Science and Engineering, Texas A&M University(德克萨斯A&M大学计算机科学与工程系) Department of Computer Science and Engineering, Macau University of Science and Technology(澳门科学理工学院计算机科学与工程系)

AI总结 本文提出DecoRec系统,通过物体级扩散方法实现单视图图像的分解3D场景重建,解决了现有方法在场景重建中出现的精度问题,并通过可微渲染和扩散引导细化技术提升重建效果。

详情
AI中文摘要

在本文中,我们介绍了DecoRec,一种新的系统,旨在将单视图2D图像提升为分解的3D场景网格。当前单视图场景重建方法通常依赖于物体检索或粗粒度3D体素或表面的回归,导致无法准确捕捉输入图像的外观和几何结构。缺乏高质量的大规模场景级数据集进一步加剧了从单视图图像直接生成3D场景的难度。为了实现高质量的3D场景生成,DecoRec利用最近的基于扩散的单视图物体重建方法,分别重建单个物体。随后提出一个细化流程,通过可微渲染技术和扩散引导细化技术有效地将这些重建的物体合并,提升外观和几何结构。我们的结果表明,DecoRec在几何和新合成方面实现了高质量的单视图场景重建,为下游应用如房间内部设计提供了显著的便利。

英文摘要

In this paper, we introduce \textit{DecoRec}, a novel system designed to elevate single-view 2D images to a decomposed 3D scene mesh. Current methods for single-view scene reconstruction typically rely on object retrieval or the regression of coarse 3D voxels or surfaces, leading to inaccuracies in capturing the appearance and geometry of the input image. The lack of high-quality large-scale scene-level datasets further complicates direct 3D scene generation from single-view images. To achieve high-quality 3D scene generation from a single-view image, DecoRec takes advantage of recent diffusion-based single-view object reconstruction methods to reconstruct individual objects separately. Subsequently, a refinement pipeline is proposed to effectively merge these reconstructed objects, enhancing appearance and geometry through a differentiable rendering technique and diffusion-guided refinement. Our results demonstrate that DecoRec facilitates high-quality single-view scene reconstruction in both geometry and novel synthesis, offering significant benefits for downstream applications like room interior design.

2605.16806 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Cross-modal Affinity-aligned Multimodal Learning Analytics for Predicting Student Collaboration Satisfaction in Game-Based Learning

跨模态亲和对齐的多模态学习分析用于预测基于游戏的学习中学生协作满意度

Wen-Hsin Tsai, Chia-Ming Lee, Yuk-Ying Tung

发表机构 * Institute of Education, National Cheng Kung University(国立成功大学教育研究所) Institute of Intelligent System, National Yang Ming Chiao Tung University(阳明交通大学智能系统研究所) Department of Computer Science, University at Albany, State University of New York(纽约州立大学水牛城分校计算机科学系)

AI总结 本文提出了一种跨模态亲和对齐的多模态学习分析框架,通过建模模态间关系和对比学习来增强学生协作满意度预测的鲁棒性和可解释性。

Comments Accetped by CVPR 2026 CVxEdu Workshop

详情
AI中文摘要

协作式基于游戏的学习环境为小组知识构建提供了丰富的机遇,但自动预测学生协作满意度仍具挑战性。关键障碍是模态退化:在教育部署中,个体模态如眼动在学生群体间表现出不一致的信息量,导致基于隐式注意力的融合产生脆弱的多模态表示。我们提出了亲和对齐多模态学习分析(AAMLA)框架,其核心贡献是跨模态亲和引导的模态对齐(CAMA)模块,该模块通过亲和矩阵显式建模模态间关系,并通过对比学习强制跨模态一致性,从而实现对无信息模态的自适应抑制而不丢弃它们。AAMLA进一步应用模态特定的投影层,将异构特征,包括面部动作单元、头部姿态、眼动和交互痕迹日志,映射到统一的语义空间,然后再进行对齐。在EcoJourneys协作学习环境中的50名中学生实验表明,在标准和模态退化条件下,AAMLA在单模态基线和先前跨注意力方法上均表现出一致的改进,SHAP和t-SNE分析证实CAMA能够产生稳健且可解释的跨模态表示,用于学生协作建模。

英文摘要

Collaborative game-based learning environments offer rich opportunities for small-group knowledge construction, yet automatically predicting student collaboration satisfaction remains challenging. A critical barrier is modality degradation: in educational deployments, individual modalities such as eye gaze exhibit inconsistent informativeness across student cohorts, causing implicit attention-based fusion to produce brittle multimodal representations. We propose the Affinity-Aligned Multimodal Learning Analytics (AAMLA) framework, whose core contribution is the Cross-modal Affinity-guided Modality Alignment (CAMA) module, which explicitly models inter-modal relationships via affinity matrices and enforces cross-modal consistency through contrastive learning, enabling adaptive suppression of uninformative modalities without discarding them. AAMLA further applies modality-specific projection layers to map heterogeneous features, including facial action units, head pose, eye gaze, and interaction trace logs, into a unified semantic space prior to alignment. Experiments on 50 middle school students in the EcoJourneys collaborative learning environment demonstrate consistent improvements over unimodal baselines and prior cross-attention approaches under standard and modality degradation conditions, with SHAP and t-SNE analyses confirming that CAMA produces robust, interpretable cross-modal representations for student collaboration modeling.

2605.16805 2026-05-19 cs.CV 版本更新

NeuroLiDAR: Adaptive Frame Rate Depth Sensing via Neuromorphic Event-LiDAR Fusion

NeuroLiDAR: 通过神经形态事件-LiDAR融合实现自适应帧率深度感知

Darshana Rathnayake, Dulanga Weerakoon, Meera Radhakrishnan, Archan Misra

发表机构 * Singapore Management University(新加坡国立管理学院) Singapore-MIT Alliance for Research and Technology Centre(新加坡-麻省理工联合研究和技术中心) University of Technology Sydney(悉尼技术大学)

AI总结 本文提出NeuroLiDAR,通过融合稀疏LiDAR数据和密集的神经形态事件相机数据,实现了高达约66Hz的自适应帧率深度感知,减少了29%的深度重建误差。

Comments ICRA2026 accepted

详情
AI中文摘要

LiDARs被广泛用于3D深度重建,但其性能常受到固有硬件限制的制约,这些限制在范围、空间分辨率和帧率之间产生权衡。许多LiDAR系统通常以低帧率(例如5-10Hz)运行,优先考虑远距离传感而不是对快速场景变化的响应。我们提出了NeuroLiDAR,一种能够实现高达约66Hz有效帧率的自适应深度感知框架,通过融合时间稀疏的LiDAR数据与时间密集的神经形态事件相机数据。NeuroLiDAR集成了两个组件:基于事件的关键帧检测和基于事件的深度外推,以动态调整感知速率以响应场景动态。为了评估我们的方法,我们引入了ELiDAR数据集,涵盖了户外和室内场景,并展示了NeuroLiDAR在RMSE中将深度重建误差减少了约29%,同时实现了27.8-47.3Hz的自适应帧率。我们的代码和数据集可在https://github.com/darshanakgr/neurolidar上获得。

英文摘要

LiDARs are widely used for 3D depth reconstruction, but their performance is often limited by inherent hardware constraints that impose trade-offs between range, spatial resolution, and frame rate. Many LiDAR systems typically operate at low frame rates (e.g., 5-10 Hz), prioritizing long-range sensing over responsiveness to rapid scene changes. We present NeuroLiDAR, an adaptive depth sensing framework that achieves effective frame rates of up to $\approx$66 Hz by fusing temporally sparse LiDAR data with temporally dense inputs from neuromorphic event cameras. NeuroLiDAR integrates two components: event-based keyframe detection and event-guided depth extrapolation, to dynamically adjust the sensing rate in response to scene dynamics. To evaluate our approach, we introduce ELiDAR, a dataset spanning outdoor and indoor scenarios, and show that NeuroLiDAR reduces depth reconstruction error by $\approx$29\% in RMSE while achieving adaptive frame rates between 27.8-47.3 Hz. Our code and dataset are available at https://github.com/darshanakgr/neurolidar.

2605.16797 2026-05-19 cs.CV cs.RO 版本更新

EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices

EgoKit: 向统一低成本第一人称视角数据采集迈进:异构设备

Liuchuan Yu, Erdem Murat, Beichen Wang, Yan Zeng, Tingting Luo, Huizhen Zhou, Shanghao Li, Huining Feng, Zhigen Zhao, Ning Yang, Ke Jing, Yunhao Liu, Ruoya Sheng

发表机构 * George Mason University(乔治·马歇尔大学) ByteDance(字节跳动)

AI总结 本文提出EgoKit,一种统一六种异构设备的第一人称视角数据采集工具包,解决了不同设备间SDK差异和数据采集不一致的问题,同时提供统一的日志格式和手部追踪数据。

详情
AI中文摘要

第一人称视角视频越来越多地被用作机器人学习、活动理解及具身AI研究的数据源,但大规模采集仍然碎片化:每个候选主机设备,如Android手机、iPhone、iPad、智能眼镜或扩展现实(XR)头戴设备,都暴露了不同的SDK,对原始摄像机访问有不同的政策,以及对外部USB摄像机和设备内跟踪有不同的限制。因此,同步第一人称视角和腕部视角的采集通常通过要么承诺单一专有平台或构建一次性装置来实现,这些装置无法跨设备转移。为了解决这一差距,我们提出了EgoKit,一种工具包,它在六个异构主机设备上暴露相同的第一人称视角录制流程。在所有支持的设备上,EgoKit提供相同的录制交互,并产生本地存储的视频,具有统一的日志格式;在XR头戴设备上,它还记录头部姿态和符合OpenXR标准的26关节手部追踪,与视频流对齐。配套的配件,包括两个带有支架的腕部摄像机、一个头带和一个USB-C集线器,使任何支持的主机都能添加腕部视角捕获,而无需定制硬件制造。EgoKit可在\url{https://egokit.chuange.org/}上获得。

英文摘要

Egocentric video is increasingly used as a data source for robot learning, activity understanding, and embodied AI research, but collecting it at scale remains fragmented in practice: each candidate host device, such as an Android phone, iPhone, iPad, smart glasses, or extended reality (XR) headset, exposes a different SDK, a different policy on raw camera access, and different limitations on external USB cameras and on-device tracking. Synchronized ego-view and wrist-view capture is therefore typically obtained by either committing to a single proprietary platform or building one-off rigs that do not transfer across devices. To address this gap, we present EgoKit, a toolkit that exposes the same egocentric recording workflow across six heterogeneous host devices. Across all supported devices, EgoKit presents the same recording interaction and produces locally stored video with a uniform log format; on XR headsets, it additionally logs head pose and OpenXR-standard 26-joint hand tracking aligned to the video streams. The companion accessories, including two wrist cameras with mounts, a head strap, and a USB-C hub, add wrist-view capture to any supported host without custom hardware fabrication. EgoKit is available at \url{https://egokit.chuange.org/}.

2605.16796 2026-05-19 cs.CR cs.CV 版本更新

Watermarks Attack Watermarks: Re-Watermarking as a Generic Removal Strategy

水印攻击水印:重水印作为通用移除策略

Maria Bulychev, Neil G. Marchant, Benjamin I. P. Rubinstein

发表机构 * University of Melbourne(墨尔本大学)

AI总结 本文提出了一种新的攻击方法,通过重水印来移除现有水印,展示了其在水印检测和移除中的有效性,为水印安全性和抗攻击性提供了新的视角。

Comments 9 pages, 6 figures

详情
AI中文摘要

水印技术通过在输入图像中引入不可察觉的变化来触发检测器,以证明来源并保护知识产权。文献显示,对水印方案的攻击引起了极大的关注:攻击者显然有动机窃取受版权保护的内容或绕过立法规定的深度伪造保护。在本工作中,我们提出一个简单而强大的观察:对水印-like水印的攻击本身寻求对输入图像的不可察觉变化(现在已经水印化)以触发检测器。这种将水印攻击与水印本身进行类比的思路非常有启发性:水印可以用来攻击水印。我们的第一项贡献验证了这一假设。在涵盖96种数据集、受害者和攻击水印组合的严格实验中,我们展示了仅重水印已水印的图像可以可靠地抑制原始信号,而无需梯度、代理模型或检测密钥。我们的第二项贡献是一个简单的分类器,用于检测给定图像中现有水印的存在和身份。令人惊讶的是,实验结果表明总体准确率高达0.878-0.953。这一结果作为独立的安全漏洞具有重要意义:研究表明,方法特定的攻击在移除方面比黑盒攻击更强大。综合来看,水印识别结合重水印可以将位准确率降低至少25%至48%。我们的工作构成了一种廉价、通用且高度有效的攻击流程,质疑了当前水印方案对如此简单的攻击的可靠性,以及现有复杂攻击的价值。

英文摘要

Watermarking combines an imperceptible change to an input image that will trigger a detector, to assert provenance and protect intellectual property. The literature has shown great interest in attacks on watermarking schemes: attackers are clearly motivated to steal copyrighted material or circumvent legislated deepfake protections. In this work, we make a simple-yet-powerful observation: that such attacks on watermarking-like watermarks themselves-seek an imperceptible change to an input image (now already watermarked) that will trigger a detector. This analogy comparing watermark attacks to watermarking itself is highly suggestive: that watermarks could be used to attack watermarks. Our first contribution validates this hypothesis. In rigorous experiments spanning 96 combinations of dataset, victim, and attack watermarks, we show that simply re-watermarking an already watermarked image reliably suppresses the original signal, without requiring gradients, surrogate models, or detection keys. Our second contribution is a simple classifier for detecting the presence and identity of an existing watermark in a given image. Surprisingly, experimental findings demonstrate outstanding overall accuracies 0.878-0.953. This result is of independent interest as a security vulnerability: research shows that method-specific attacks achieve substantially stronger removal than black-box attacks. Taken together, watermark identification combined with re-watermarking successfully reduces bit accuracy by at least 25% and up to 48%. Our work constitutes a cheap, generic, and highly effective attack pipeline, calling into question the reliability of current watermarking schemes to such a simple attack, as well as the value of existing sophisticated attacks.

2605.16795 2026-05-19 cs.CV cs.AI cs.GR 版本更新

3DPhysVideo: Consistency-Guided Flow SDE for Video Generation via 3D Scene Reconstruction and Physical Simulation

3DPhysVideo: 通过3D场景重建和物理模拟的一致性引导流SDE用于视频生成

Hwidong Kim, Yunho Kim, Tae-Kyun Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出了一种无需训练的管道,通过3D场景重建和物理模拟生成逼真视频,利用一致性引导流SDE分解预测速度以确保条件输入的一致性,从而在多物体和流体交互场景中实现从单张图像到物理合理视频的过渡。

Comments Project page: https://hwidong-kim.github.io/projects/3DPhysVideo

详情
AI中文摘要

视频生成模型取得了显著进展,但它们常常产生违反物理动态基础的视觉伪影。最近的工作如PhysGen3D通过网格重建和基于物理的渲染处理单张图像到3D物理,但在建模流体动力学、多物体交互和照片级真实感方面仍存在挑战。本文介绍了3DPhysVideo,一种新颖的无训练管道,能够从单张图像生成物理真实的视频。我们重新利用现成的视频模型进行两个阶段。首先,我们将其用作新的视图合成器,通过引导图像到视频(I2V)流模型使用渲染点云来重建完整的360度3D场景几何。其次,在应用物理求解器到此几何后,物理模拟的点云用于引导相同的I2V流模型以合成最终的高质量视频。一致性引导流SDE将I2V流模型预测的速度分解为去噪和一致性偏差,强制条件输入的一致性,使我们能够有效地重新利用模型进行3D重建和模拟引导的视频生成。在包括多物体和流体交互场景在内的多样化实验中,我们的方法成功地从单张图像过渡到物理合理的视频,同时在单个消费级GPU上运行高效。它在GPT基线得分、VideoPhy基准和人类评估中优于最先进的基线。

英文摘要

Video generative models have made remarkable progress, yet they often yield visual artifacts that violate grounding in physical dynamics. Recent works such as PhysGen3D tackle single image-to-3D physics through mesh reconstruction and Physically-Based Rendering, but challenges remain in modeling fluid dynamics, multi-object interactions and photorealism. This work introduces 3DPhysVideo, a novel training-free pipeline that generates physically realistic videos from a single image. We repurpose an off-the-shelf video model for two stages. First, we use it as a novel view synthesizer to reconstruct complete 360-degree 3D scene geometry by guiding the image-to-video (I2V) flow model with rendered point clouds. Second, after applying physics solvers to this geometry, the physically simulated point cloud is used to guide the same I2V flow model to synthesize final, high-quality videos. Consistency-Guided Flow SDE, which decomposes the predicted velocity of the I2V flow model into denoising and consistency bias, enforces consistency to the conditional inputs, allowing us to effectively repurpose the model for both 3D reconstruction and simulation-guided video generation. In the diverse experiments including multi-objects, and fluid interaction scenes, our method successfully bridges the gap from single-images to physically plausible videos, while remaining efficient to run on a single consumer GPU. It outperforms state-of-the-art baselines on GPT-based scores, VideoPhy benchmark and human evaluation.

2605.16789 2026-05-19 cs.CV 版本更新

Accelerating Rectified Flow Models via Trajectory-Aware Caching

通过轨迹感知缓存加速修正流模型

Xiao Liu, Kai Liu, Naiyang Guan, Hongliang Lu, Zhixin Wang, Zhikai Chen, Renjing Pei, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Academy of Military Science(军事科学院) Huawei Technologies Ltd(华为技术有限公司)

AI总结 本文提出TACache框架,通过轨迹感知缓存技术,在无需训练的情况下加速生成高保真图像和视频,通过分解速度场并补偿误差,实现更高的采样速度。

Comments 22 pages,14 figures

详情
AI中文摘要

扩散和修正流(RF)模型能够生成高保真的图像和视频,但其迭代的速度场评估计算成本高。现有的缓存方法通过跳过时间步来加速采样,但其粗略的近似会导致在长间隔跳过时积累误差并降低质量。我们提出TACache(轨迹感知缓存),一种训练自由的加速框架,遵循跳过然后补偿的范式。TACache将离散的速度加速度沿RF轨迹进行正交分解,将其分为平行分量和正交残差,隔离每一步近似误差的幅度和方向源。该框架分为两个阶段:离线阶段,累积变化阈值在幅度和方向指标上产生跳过计划并限制每个跳过间隔可延伸的距离;在线阶段,每个跳过的步骤中,离线统计数据与样本的历史正交方向结合,重建跳过的速度而无需额外的模型评估。在BAGEL、FLUX.1-dev和Wan2.1-1.3B上的实验表明,TACache在文本到图像生成中实现了高达4.14倍的速度提升,在文本到视频生成中实现了2.11倍的速度提升,并在所有参考保真度度量上优于先前的基于缓存的方法。代码将很快发布。

英文摘要

Diffusion and rectified flow (RF) models generate high-fidelity images and videos, but their iterative velocity-field evaluations are computationally expensive. Existing caching methods accelerate sampling by skipping timesteps, yet their coarse approximations introduce accumulated errors over long skip intervals and degrade quality under aggressive acceleration. We propose TACache (Trajectory-Aware Cache), a training-free acceleration framework following a skip-then-compensate paradigm. TACache performs an orthogonal decomposition of discrete velocity acceleration along the RF trajectory into a parallel component and an orthogonal residual, isolating the magnitude and directional sources of per-step approximation error. The framework operates in two stages: offline, cumulative variation thresholds on the magnitude and direction indicators yield the skip schedule and bound how far each skip interval may extend; online, at each skipped step the offline statistics are combined with the sample's historical orthogonal direction to reconstruct the skipped velocity without additional model evaluations. Experiments on BAGEL, FLUX.1-dev, and Wan2.1-1.3B show that TACache achieves up to 4.14 speedup on text-to-image generation and 2.11 speedup on text-to-video generation, with consistent improvements over prior cache-based methods on all reference-based fidelity metrics. Code will be released soon.

2605.16785 2026-05-19 cs.CV cs.AI 版本更新

Encoding Robust Topological Signatures for Hyperdimensional Computing

为超维计算编码鲁棒的拓扑特征

Arpan Kusari

发表机构 * University of Michigan Transportation Research Institute(密歇根大学交通研究院) University of Michigan(密歇根大学)

AI总结 本文提出了一种基于拓扑特征的超维计算方法,通过提取离散拓扑原始特征并结合RTS不变的形状签名,提高了超维计算在旋转、噪声和遮挡等扰动下的鲁棒性,实验表明其在多个数据集上优于传统方法。

详情
AI中文摘要

超维(HD)计算由于其简单性、快速的原型基推断和与在线更新的兼容性,为边缘学习提供了一个有吸引力的替代方案。然而,标准的基于像素的HD编码器容易受到分布偏移的影响,如旋转、噪声或遮挡,会显著降低准确性。我们从二值化形状中提取离散拓扑原始特征——尤其是孔洞,并将它们与旋转/平移/缩放(RTS)不变的形状签名配对。我们的方法为(i)外轮廓使用空间金字塔变体的Zernike矩构建RTS稳定的描述符,(ii)每个孔洞使用其径向签名的内在傅里叶描述符以及RTS-标准相对几何。每个原始特征通过随机投影和角色绑定映射到双极超向量,并通过排列不变的捆绑聚合变量卡数的孔洞集以形成单个图像超向量。为了避免过度加权任何线索,我们通过在验证集上融合余弦相似度学习Zernike和孔洞通道的非负可靠性权重。在MNIST和EMNIST数据集上进行的实验表明,拓扑引导的HD计算相比传统HD基线显著提高了鲁棒性,保持了多个扰动家族的高精度,并受益于轻量级在线训练。与在干净数据上训练的紧凑CNN相比,我们的方法在清洁精度上具有竞争力,同时对几种像素级扰动具有明显更强的鲁棒性,证明了显式拓扑结构是实现鲁棒HD表示的可行途径。代码在https://github.com/arpan-kusari/Topological-HDC提供。

英文摘要

Hyperdimensional (HD) computing offers an attractive alternative to deep networks for edge learning due to its simplicity, fast prototype-based inference, and compatibility with online updates. However, standard pixel-based HD encoders are brittle: small distribution shifts such as rotation, noise, or occlusion can drastically reduce accuracy. We extract discrete topological primitives-most notably holes-from binarized shapes and pair them with rotation/translation/scale (RTS)-invariant shape signatures. Our method constructs RTS-stable descriptors for (i) the outer shape using a spatial-pyramid variant of Zernike moments and (ii) each hole using an intrinsic Fourier descriptor of its radial signature together with RTS-canonical relative geometry. Each primitive is mapped to a bipolar hypervector via randomized projection and role binding, and variable-cardinality hole sets are aggregated by permutation-invariant bundling to form a single image hypervector. To avoid over-weighting any cue, we learn nonnegative reliability weights for the Zernike and hole channels on a validation set via late fusion of cosine similarities. Experiments on MNIST and EMNIST under controlled corruptions (rotation, Gaussian noise, salt-and-pepper, cutout, zoom) show that Topology-guided HD computing substantially improves robustness compared with a naive HD baseline, maintaining high accuracy across multiple corruption families and benefiting from lightweight online training. Compared with a compact CNN trained on clean data, our method achieves competitive clean accuracy while offering markedly stronger robustness to several pixel-level corruptions, demonstrating that explicit topological structure is a practical route to robust HD representations. The code is provided at https://github.com/arpan-kusari/Topological-HDC.

2605.16779 2026-05-19 cs.CV cs.AI 版本更新

A Holistic Method for Superquadric Fitting Using Unsupervised Clustering Analysis

一种基于无监督聚类分析的超二次曲面拟合整体方法

Mingyang Zhao, Sipu Ruan, Xiaohong Jia

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(数学科学国家重点实验室,数学与系统科学学院,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) Robotics Institute, School of Mechanical Engineering and Automation, Beihang University(北京航空航天大学机械工程与自动化学院机器人研究所)

AI总结 本文提出了一种新的方法,用于在存在噪声和异常值的情况下对点云进行超二次曲面拟合,通过无监督聚类分析重新定义问题,实现了刚性和变形超二次曲面的一体化拟合,同时提供了闭式解析解和收敛性证明。

Comments 20 pages, Code: https://github.com/zikai1/SuperquadricFitting

Journal ref IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2026

详情
AI中文摘要

本文提出了一种新的方法,用于在存在噪声和异常值的情况下对点云进行超二次曲面拟合,该方法在多个领域具有广泛的应用。与以往仅专注于拟合刚性或变形超二次曲面或存在鲁棒性和数值稳定性问题的方法不同,我们的方法从无监督聚类的新视角重新定义问题,使刚性和变形超二次曲面的拟合能够在统一的框架中完成。我们的方法核心是一种受无监督聚类分析启发的稳定优化函数,其中我们将点云数据和潜在参数曲面的样本分别作为聚类成员和质心。然后,具有动态更新质心位置的聚类过程成为优化超二次曲面参数的直接代理,建立了几何拟合与聚类动态之间的原则性联系。我们进一步推导了聚类质心与聚类成员之间的成对计算与正交距离之间的关系,从而有效消除了耗时的曲面采样过程。此外,我们的公式为模糊成员度向量和协方差矩阵提供了闭式解析解,确保了高效迭代优化,并能够更有效地处理几何变形。此外,我们还提供了收敛性分析的理论证明,并证明了聚类启发的拟合方法通过内在增加目标函数的凸性来逃避局部极小值。实现已公开在https://github.com/zikai1/SuperquadricFitting。

英文摘要

This work presents a novel method for fitting superquadrics to point clouds under the contamination of noise and outliers, which has many applications for shape modeling across diverse fields. Unlike prior approaches that either exclusively focus on fitting rigid or deformable superquadrics, or suffer from robustness and numerical instability issues, our method redefines the problem from a new unsupervised clustering perspective, enabling the holistic fitting of both rigid and deformable superquadrics within a unified framework. Central to our approach is a stable optimization function inspired by unsupervised clustering analysis, where we formulate the point cloud data and samples from the potential parametric surface as clustering members and centroids, respectively. Then, the clustering process with dynamic updates to centroid locations serves as a direct proxy for optimizing superquadric parameters, establishing a principled link between geometric fitting and clustering dynamics. We further derive the relationship between pairwise computations of clustering centroids and clustering members to orthogonal distances, effectively eliminating the need for the time-consuming surface sampling process. Moreover, our formulation provides closed-form analytical solutions for both the fuzzy membership degree vector and the covariance matrix, ensuring efficient iteration optimization and enabling more effective handling of geometric deformations. In addition, we provide a theoretical certificate of convergence analysis and demonstrate that the clustering-inspired fitting method can escape local minima by inherently increasing the convexity of the objective function. The implementation is publicly available at https://github.com/zikai1/SuperquadricFitting.

2605.16775 2026-05-19 cs.CV cs.AI cs.LG 版本更新

VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment

VolTA-3D: 基于3D体积分块对齐的脑MRI自监督学习

Amy Makawana, Abhijeet Parida, Marius George Linguraru, Julia Ive, Syed Muhammad Anwar

发表机构 * Institute of Health Informatics(健康信息学研究所) Sheikh Zayed Institute for Pediatric Surgical Innovation(谢赫扎耶德儿童外科创新研究所) School of Medicine and Health Sciences(医学与健康科学学院)

AI总结 本文提出VolTA-3D,一种用于脑MRI自监督学习的3D视觉Transformer框架,通过联合对齐全局类风格标记和局部块标记,增强体积分块表示的可迁移性,从而在多个下游任务中表现出更好的泛化能力和鲁棒性。

Comments Accepted at EMBC 2026

详情
AI中文摘要

自监督学习(SSL)通过利用大规模未标记数据推动了医学图像分析的发展。然而,在脑磁共振成像(MRI)中,大多数3D模型仍局限于分割或分类任务,限制了其在不同数据集、成像协议和下游任务中的泛化能力。这种缺乏可迁移性限制了3D MRI模型的临床应用,尽管存在大量未标记的体数据。我们提出了Volta-3D,一种自监督的3D视觉Transformer框架,旨在学习可迁移的体表示。Volta-3D在学生-教师范式中联合对齐全局类风格标记和局部块标记,并强制细粒度结构重建。这种联合全局-局部对齐解决了脑MRI中有限的语义多样性和细微解剖特征,这对现有SSL方法构成了挑战。我们在多个分布外下游任务上评估了Volta-3D,包括海马体分割和性别及阿尔茨海默病与健康对照的分类。在所有任务中,Volta-3D学习的表示均优于随机初始化的基线,证明了其在域偏移下的改进可迁移性和鲁棒性。因此,在预训练过程中联合强制全局语义一致性和局部结构学习,使模型能够从未标记的脑MRI数据中学习更广泛的概念。总体而言,VolTA-3D支持有效的多任务下游性能,具有任务特定的适应性,是迈向通用化和临床可行的3D模型的一步。

英文摘要

Self-supervised learning (SSL) has advanced medical image analysis be enabling learning form large unlabelled data. However, in brain magnetic resonance imaging (MRI), most 3D models remain specialized for either segmentation of classification, limiting their ability to generalize across datasets, imaging protocols,, and downstream tasks. This lack of transferability constrains the clinical utility of 3D MRI models, despite the availability of unlabeled volumetric data. We present Volta-3D, a self-supervised 3D Vision Transformer framework designed to learn transferable volumetric representations. Volta-3D jointly aligns global class-style tokens and local patch tokens within a student-teacher paradigm and enforces fine-grained structural reconstruction. This combined global-local alignment addresses the limited semantic diversity and subtle anatomical characteristics of brain MRI, which challenges existing SSL approaches. We evaluate Volta-3D on multiple out-of-distribution downstream tasks, including hippocampal segmentation and classification of sex and Alzheimer's disease versus healthy controls. Across all tasks, representations learned by Volta-3D outperform randomly initialized baselines, demonstrating improved transferability and robustness under domain shift. Hence jointly enforcing global semantic consistency and local structural learning during pretraining enables broader concept learning from unlabeled brain MRI data. Overall VolTA-3D supports effective multi-task downstream performance with task-specific pertaining, a step towards generalizable and clinically viable 3D models.

2605.16774 2026-05-19 cs.CV cs.AI 版本更新

CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris

CANSURF:一种ASV视角的可回收物数据集和基准,用于表面级垃圾的检测与跟踪

Zaid Aljundi, Zahra F. Rahmatullah, Mostafa Elemam, Abdullah Moosa

发表机构 * School of Mathematical and Computer Sciences(数学与计算机科学学院) Heriot-Watt University Dubai(惠顿大学迪拜分校) School of Engineering and Physical Sciences(工程与物理科学学院)

AI总结 本文提出了一种新的ASV视觉系统和表面可回收物数据集,用于在水面条件下检测和跟踪小型反射性垃圾,如铝罐。数据集包含约7.3k张原始图像,经过十种增强方法扩展至约57k张训练/验证图像,涵盖了多样的光照和水状态。通过基准测试,训练YOLOv11在CANSURF数据集上提升了12倍的性能,展示了数据集的价值。实验表明,YOLOv11+ByteTrack在稳定跟踪和多目标准确性方面表现最佳,而YOLOv11+SAHI在远距离罐子的召回率上有所提升,但精度有所下降。考虑到任务需求,YOLOv11 + SAHI在检测最大数量的罐子方面表现更好。

Comments Published in the 2025 8th International Conference on Signal Processing and Information Security (ICSPIS). Published and available to view on IEEE Xplore

Journal ref Proc. 2025 8th Int. Conf. Signal Processing and Information Security (ICSPIS), 2025, pp. 1-6

详情
AI中文摘要

表面级海洋垃圾仍然是自主清洁任务中的实际瓶颈,其中小型、反射性的目标(如铝罐)必须在强光、波浪和部分淹没条件下从远处检测。本文提出了一种ASV视觉系统和一个新的表面可回收物数据集。该数据集包含约7.3k张从视频中提取的原始图像,并通过十种增强类型扩展至约57k张训练/验证图像,涵盖了多样化的光照和水状态。一组针对表面操作定制的检测器和检测-跟踪管道进行了基准测试。在CANSURF上训练YOLOv11的性能比通用数据集提高了12倍,突显了数据集的价值。实验表明,YOLOv11+ByteTrack在稳定跟踪(较少的身份切换)和多目标准确性方面表现最佳,而YOLOv11+SAHI在远距离罐子的召回率上有所提升,但精度在全上下文输入中有所下降。鉴于任务配置,单罐拾取与接近和抓取,YOLOv11 + SAHI在检测最大数量的罐子方面表现更好。没有先前的公开数据集针对从水面视角在水面上检测铝罐;此数据集填补了这一空白,并支持可重复的评估。

英文摘要

Surface-level marine debris remains a practical bottleneck for autonomous clean-up, where small, reflective targets (e.g., aluminum cans) must be detected at distance under glare, ripples, and partial submersion. This paper presents, an ASV vision system and a new surface-can dataset. The dataset comprises ~7.3k raw images extracted from videos and annotated with bounding boxes, expanded via ten augmentation types to ~57k training/validation images spanning diverse lighting and water states. A family of detector and detector-tracker pipelines tailored to surface operations were benchmarked. Training YOLOv11 on CANSURF boosts performance 12x over generic datasets, highlighting the dataset's value. Experiments show that YOLOv11+ByteTrack yields the most stable tracks (fewer identity switches) and stronger multi-object accuracy under, while YOLOv11+SAHI increases recall on far-field cans at the cost of lower precision in full-context inputs. Given the mission profile, single-can pickup with approach and grab, YOLOv11 + SAHI proves better for detecting the maximum number of cans. No prior open dataset targets aluminum cans on water from a surface-level viewpoint; this dataset fills this gap and supports reproducible evaluation.

2605.16769 2026-05-19 cs.CV 版本更新

GLT-PEFT: Gated Lie-Tucker Parameter-Efficient Fine-Tuning for Alzheimer's Disease Diagnosis with Hippocampal Segmentation Pretraining

GLT-PEFT: 基于门控李-塔克参数高效微调的阿尔茨海默病诊断方法(结合海马体分割预训练)

Guanghua He, Hancan Zhu, Gaohang Yu, An Zhang

发表机构 * Department of Mathematics, Hangzhou Dianzi University(杭州电子科技大学数学系) School of Mathematics, Physics and Information, Shaoxing University(绍兴大学数学与物理学院) Department of Mathematics, Zhejiang University of Science and Technology(浙江科技学院数学系)

AI总结 本文提出GLT-PEFT方法,通过门控李-塔克分解实现高效参数微调,用于阿尔茨海默病诊断,结合海马体分割预训练,提升医学影像模型的适应性与鲁棒性。

详情
AI中文摘要

参数高效微调(PEFT)已成为在数据有限条件下适应预训练模型的有前景范式。然而,现有大多数PEFT方法针对矩阵结构参数设计,不适用于医学影像模型中的高维卷积核。此外,它们通常依赖加法更新,缺乏保持预训练参数几何结构的机制,而乘法(几何感知)更新难以在统一框架中整合。为了解决这一问题,本文提出GLT-PEFT,一种用于阿尔茨海默病(AD)诊断的门控李-塔克参数高效微调框架。所提出的方法将预训练的海马体分割模型转移到下游分类任务。塔克分解使3D卷积核实现张量感知的低秩适应,而基于李群的变换提供结构保持的乘法更新。门控机制进一步协调加法和乘法更新形式,实现统一且更稳定的微调策略。大量实验表明,GLT-PEFT在跨任务转移中实现有效效果,同时显著减少可训练参数,突显其在医学影像模型中的高效和鲁棒适应性。

英文摘要

Parameter-efficient fine-tuning (PEFT) has emerged as a promising paradigm for adapting pretrained models under limited data conditions. However, most existing PEFT methods are designed for matrix-structured parameters and are not well suited for high-dimensional convolutional kernels in medical imaging models. Moreover, they typically rely on additive updates and lack mechanisms to preserve the geometric structure of pretrained parameters, while multiplicative (geometry-aware) updates are difficult to integrate within a unified framework. To address this issue, this paper proposes GLT-PEFT, a gated Lie-Tucker parameter-efficient fine-tuning framework for Alzheimer's disease (AD) diagnosis. The proposed approach transfers a hippocampal segmentation pretrained model to a downstream classification task. Tucker decomposition enables tensor-aware low-rank adaptation of 3D convolutional kernels, while Lie group-based transformations provide structure-preserving multiplicative updates. A gating mechanism further reconciles additive and multiplicative update forms, resulting in a unified and more stable fine-tuning strategy. Extensive experiments demonstrate that GLT-PEFT achieves effective cross-task transfer while significantly reducing trainable parameters, highlighting its effectiveness for efficient and robust adaptation in medical imaging models.

2605.16768 2026-05-19 cs.CV eess.IV 版本更新

Axial-Relation Guided Fusion State Space Model for Optical-Elevation Sensing Image Segmentation

基于轴向关系引导的融合状态空间模型用于光学-海拔感测图像分割

Feng Gao, Zhilin Jin, Yanhai Gan, Junyu Dong, Qian Du

发表机构 * State Key Laboratory of Physical Oceanography, Ocean University of China(中国海洋大学物理海洋学国家重点实验室) Department of Electrical and Computer Engineering, Mississippi State University(密苏里州立大学电气与计算机工程系)

AI总结 本文提出了一种基于状态空间模型的框架,用于光学-海拔遥感图像分割,通过引入多尺度状态空间模块和轴向关系引导融合模块,有效提升了多源遥感图像语义分割的性能和计算效率。

Comments Accepted by IEEE GRSL 2026

详情
AI中文摘要

多源遥感图像的语义分割是地球观测应用中的基本任务。现有方法在多尺度上下文建模不足和跨模态特征融合不优方面存在困难,限制了其在复杂高分辨率场景中的性能。为此,我们提出轴向关系引导融合Mamba(ARG-Mamba),一种基于状态空间模型的框架,用于光学-海拔遥感图像分割。具体而言,我们引入了多尺度状态空间模块,以线性计算复杂度捕获细粒度局部细节和全局上下文依赖。此外,设计了轴向关系引导融合模块,以显式建模水平和垂直轴上的全局跨模态相关性,从而在光学和海拔模态之间实现高效的特征融合。在ISPRS Vaihingen和Potsdam数据集上进行的广泛实验表明,ARG-Mamba在保持有利的计算效率的同时,始终优于最先进的方法。代码将在https://github.com/oucailab/ARG-Mamba上公开发布。

英文摘要

Semantic segmentation of multi-source remote sensing images is a fundamental task for Earth observation applications. Existing methods often struggle with insufficient multi-scale context modeling and suboptimal cross-modal feature fusion, limiting their performance in complex high-resolution scenes. To this end, we propose Axial-Relation Guided Fusion Mamba (ARG-Mamba), a state space model-based framework for optical-elevation remote sensing image segmentation. Specifically, we introduce a Multi-Scale State Space Module to capture both fine-grained local details and global contextual dependencies with linear computational complexity. Moreover, an Axial-Relation Guided Fusion Module is designed to explicitly model global cross-modal correlations along horizontal and vertical axes, enabling efficient feature fusion between optical and elevation modalities. Extensive experiments conducted on the ISPRS Vaihingen and Potsdam datasets demonstrate that our ARG-Mamba consistently outperforms state-of-the-art methods while maintaining favorable computational efficiency. The code will be made publicly available at \url{https://github.com/oucailab/ARG-Mamba}.

2605.16764 2026-05-19 cs.CV eess.IV 版本更新

Synthetic Aperture Radar Image Change Detection Based on Global Dynamic Context-Aware Network

基于全局动态上下文感知网络的合成孔径雷达图像变化检测

Baogui Huan, Chuanzheng Gong, Dezhong Chen, Feng Gao, Junyu Dong, Qian Du

发表机构 * State Key Laboratory of Physical Oceanography, Ocean University of China(中国海洋大学物理海洋国家重点实验室) Department of Electrical and Computer Engineering, Mississippi State University(密苏里州立大学电气与计算机工程系)

AI总结 本文提出了一种专门用于合成孔径雷达图像变化检测的全局动态上下文感知网络GDNet,通过引入全局动态卷积模块和两阶段Mixup策略,有效整合局部细节与全局上下文信息,提升对不同变化模式的检测能力。

Comments Accepted by IEEE JSTARS 2026

详情
AI中文摘要

卷积神经网络(CNNs)已广泛且成功地应用于合成孔径雷达(SAR)图像变化检测任务。然而,传统卷积层固有地受到局部感受野的限制,主要捕捉空间局部模式,而忽视了对SAR图像中细微或大规模变化至关重要的全局上下文。为了解决这些限制,我们提出了一种专门针对SAR图像变化检测的全局动态上下文感知网络(GDNet)。我们的方法核心是一种新的全局动态卷积模块,该模块根据从输入特征中提取的全局语义信息,自适应地调节卷积核权重。通过动态整合长距离依赖关系,这种机制使网络能够整合局部细节和全局上下文,从而提高其检测不同变化模式的能力。此外,我们引入了精心设计的两阶段Mixup策略用于模型训练。与传统单阶段Mixup不同,我们的两阶段设计生成了更多样化和信息丰富的训练样本,有效正则化模型,即使在数据有限的情况下也能获得更稳定和可靠的分类结果。在三个SAR数据集上的广泛实验展示了所提GDNet相较于其他最先进方法的优越性。这些发现突显了全局动态建模和高级数据增强策略在推进SAR图像解释方面的潜力。源代码可在\url{https://github.com/oucailab/GDNet}获得。

英文摘要

Convolutional neural networks (CNNs) have been extensively and successfully applied to the task of synthetic aperture radar (SAR) image change detection. However, conventional convolutional layers are inherently limited by their local receptive fields, which mainly capture spatially localized patterns while neglecting the global context that is often crucial for accurately distinguishing subtle or large-scale changes in SAR imagery. To address these limitations, we propose a novel Global Dynamic Context-Aware Network (GDNet) specifically tailored for SAR image change detection. At the core of our approach lies a novel global dynamic convolution module, which adaptively modulates convolution kernel weights according to the global semantic information extracted from the input features. By dynamically incorporating long-range dependencies, this mechanism enables the network to integrate both local detail and global context, thus improving its ability to detect diverse change patterns. In addition, we introduce a carefully designed two-stage Mixup strategy for model training. Unlike conventional single-stage Mixup, our two-stage design generates more diverse and informative training samples, effectively regularizing the model and yielding more stable and reliable classification results even under limited data scenarios. Extensive experiments on three SAR datasets demonstrate the superiority of the proposed GDNet compared to other state-of-the-art methods. These findings highlight the potential of global dynamic modeling and advanced data augmentation strategies for advancing SAR image interpretation. Source codes are available at \url{https://github.com/oucailab/GDNet}.

2605.16748 2026-05-19 cs.GR cs.AI cs.CV cs.LG cs.MA cs.MM 版本更新

Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned, Self-Correcting Video Generation

Genflow Ad Studio:一种用于品牌一致、自我纠正视频生成的复合AI架构

Debanshu Das, Lavi Nigam, Sunil Kumar Jang Bahadur, Gopala Dhar

发表机构 * Google(谷歌)

AI总结 本文提出Genflow Ad Studio,一种复合AI架构,通过品牌DNA提取模块和对抗性多代理质量控制循环,提高了品牌一致的视频生成效率,将合规率从42%提升到89%。

Comments 6 pages, 2 figures, 2 tables. Accepted to the ACM Conference on AI and Agentic Systems (CAIS '26). Includes demo video and code repository links

Journal ref ACM Conference on AI and Agentic Systems (CAIS '26), May 26-29, 2026, San Jose, CA, USA

详情
AI中文摘要

近期生成视频模型的进步展示了高水平的视觉保真度,但其在企业环境中的整合受到时间不一致性和严重的品牌不一致性的限制。当前的单体架构难以强制执行严格的品牌约束,经常产生未经批准的视觉资产。我们介绍了Genflow,一种复合AI系统,旨在生成媒体生产中强制执行品牌一致性。我们的架构集成了基于检索的'品牌DNA'提取模块,以参数化生成方式根据已确立的企业身份指南进行生成。此外,我们实现了对抗性多代理质量控制(QC)循环。与单次生成流程不同,此流程采用评估代理,反复批评生成的帧,与提取的参数进行比较,促使生成模型细化输出,直到达成确定性的一致性。通过转向多阶段、自我纠正的流程,Genflow将品牌合规视频生成的产量从42%提高到89%,建立了稳健的框架,用于可扩展的、企业级的生成系统。

英文摘要

Recent advancements in generative video models demonstrate high visual fidelity, yet their integration into enterprise environments is restricted by temporal inconsistencies and severe brand misalignment. Current monolithic architectures struggle to enforce rigid brand constraints, frequently hallucinating unapproved visual assets. We introduce Genflow, a Compound AI System designed to enforce brand consistency in generative media production. Our architecture integrates a retrieval-based 'Brand DNA' extraction module to parameterize generation according to established corporate identity guidelines. Furthermore, we implement an Adversarial Multi-Agent Quality Control (QC) loop. Instead of a single-pass generation, this pipeline employs evaluator agents to iteratively critique generated frames against the extracted parameters, prompting generator models to refine outputs until a deterministic consensus is reached. By transitioning to a multi-stage, self-correcting pipeline, Genflow improved the yield of brand-compliant video generations from 42% to 89%, establishing a robust framework for scalable, enterprise-grade generative systems.

2605.16745 2026-05-19 cs.CV 版本更新

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

EVA01: 通过混合变换器实现统一的原生3D理解和生成

Zongyuan Yang, Mingjing Yi, Wanli Ma, Chenzhuo Fan, Bocheng Li, Baolin Liu, Yuke Lou, Yingde Song, Yongping Xiong, Zhengdong Guo, Shimu Wang

发表机构 * SeeleAI Team(SeeleAI团队)

AI总结 本文提出EVA01框架,通过混合变换器架构扩展多模态大语言模型的模态边界,实现原生的3D网格理解和生成以及上下文感知编辑,提升文本到3D生成的保真度和多轮几何编辑能力。

Comments 28 pages, 10 figures, 6 tables. Technical report

详情
AI中文摘要

本文解决了将3D网格作为多模态大语言模型(MLLM)的原生模态整合的挑战。基于扩散的大型重建模型将语义理解与几何推理解耦,作为无状态重建器,条件于密集的2D像素先验。最近的MLLM基于方法将3D模态视为外部输出而非多模态序列的原生组件,使渐进式适应而没有系统分析几何流形如何与MLLM特征空间对齐。我们引入EVA01,一个统一的框架,扩展MLLM的模态边界,原生纳入3D网格理解和生成以及上下文感知编辑。基于混合变换器(MoT)架构,EVA01将模型分为预训练的Understanding Expert(E_und)和结构上镜像的Generation Expert(E_gen),通过共享的全局自注意力和硬模态路由耦合。该设计使MLLM主干的语义潜在空间与几何流形对齐,从而在不使用中间2D表示的情况下直接转移多模态先验。结果表明,EVA01在文本到3D生成保真度方面达到最先进的水平,并解锁了具有身份保持的稳健长上下文多轮几何编辑能力,这一能力对无状态重建流程来说是根本无法实现的。我们的发现进一步为将2D基础模型与3D任务整合提供了架构洞察,指导3D原生多模态系统的设计。项目页面:https://www.seeles.ai/research/pages/EVA01

英文摘要

This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert ($E_{\mathrm{und}}$) and a structurally mirrored Generation Expert ($E_{\mathrm{gen}}$), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01

2605.16742 2026-05-19 cs.CV stat.ME 版本更新

Diffeomorphic Cortical Alignment via Direct Warping of Streamline Endpoints

通过直接变形纤维束端点实现的皮层对齐

Yang Xiang, Martin Cole, Zhengwu Zhang

发表机构 * Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill(统计与运筹学系,北卡罗来纳大学教堂山分校) Department of Psychiatry, University of Rochester(精神病学系,罗切斯特大学)

AI总结 本文提出了一种基于连接性的皮层对齐方法,通过直接操作白质纤维束端点来对齐皮层表面,以提高纤维束层面的对应性,并在主要纤维束上实现更高的连接性重叠系数和更强的鲁棒性。

详情
AI中文摘要

皮层表面注册通常由局部几何描述符(例如沟回深度和曲率)驱动。尽管这种方法实现了几何对应,但忽略了白质解剖结构所施加的远距离连接约束。扩散磁共振成像束追踪提供了这些关键约束;然而,先前的连接性指导流程通常对预计算的连接性矩阵进行对齐,使优化高度敏感于连接性估计及其分辨率。在本文中,我们提出了一种新的基于连接性的皮层对齐方法,通过直接在白质纤维束端点上操作来对齐皮层表面。我们将束端点建模为产品流形Ω×Ω上的点云,其中Ω代表膨胀的皮层半球的球形域。我们的对齐方法通过迭代(i)通过最小化连接性不匹配计算Ω的小变形扭曲,并(ii)根据此扭曲更新端点。该方法依赖于一个几何框架,确保输出扭曲是微分同胚,并具有最终目标,即优化已知纤维束的匹配。在人类连接组计划(HCP)数据上的实验表明,该方法在纤维束层面实现了改进的对应性,实现了主要纤维束上的更高连接性重叠系数,并在Ω的网格分辨率下比最先进的方法如ENCORE和MSMAll表现出更强的鲁棒性。

英文摘要

Cortical surface registration is often driven by local geometric descriptors (e.g., sulcal depth and curvature). While this approach achieves geometric correspondence, it neglects the long-range wiring constraints imposed by white-matter anatomy. Diffusion MRI tractography offers these crucial constraints; however, prior connectivity-informed pipelines typically align precomputed connectivity matrices, making the optimization highly sensitive to connectivity estimation and its resolution. In this paper, we introduce a novel connectivity-based surface registration method that aligns cortical surfaces by operating directly on white-matter fiber-tract endpoints. We model tract endpoints as a point cloud on the product manifold $Ω\times Ω$, where $Ω$ represents the spherical domain of the inflated cortical hemispheres. Our alignment method iteratively (i) computes a small diffeomorphic warp for $Ω$ by minimizing connectivity mismatch, and (ii) updates the endpoints based on this warp. The method relies on a geometric framework that ensures output warps are diffeomorphisms and has a final goal that optimizes the matching of well-known fiber bundles. Experiments on Human Connectome Project (HCP) data demonstrate improved tract-level correspondence, achieving higher connectivity-level overlap coefficients on major fiber bundles and stronger robustness across grid resolutions for $Ω$ compared to state-of-the-art methods such as ENCORE and MSMAll.

2605.16737 2026-05-19 cs.RO cs.CV 版本更新

DriveSafer: End-to-End Autonomous Driving with Safety Guidance

DriveSafer: 结合安全指导的端到端自动驾驶

Shounak Sural, Raj Rajkumar

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出DriveSafer框架,通过减少致命性规划失败来提高端到端自动驾驶的安全性,而非单纯提升平均规划质量。

详情
AI中文摘要

端到端(E2E)自动驾驶模型近年来在性能上有了显著提升,尤其是在越来越具有挑战性的基准测试中。然而,现代生成式E2E规划器仍然在安全关键场景中存在大量致命性故障。我们发现许多此类故障源于物理约束和安全要求的违反,导致不安全行为。受此发现启发,本文专注于改进生成式端到端驾驶中的安全结果,通过有针对性地减少致命性规划失败,而不是提升平均规划质量。为此,我们提出了DriveSafer,一种面向失败的的安全框架,用于端到端规划器。DriveSafer通过利用训练时的安全约束和推理时的安全指导,明确引导生成式规划器朝向安全行为。与最先进的DiffusionDrive模型相比,在NAVSIM基准测试中,DriveSafer将致命性故障数量(PDMS=0)减少了48%,在可行驶区域合规性故障上减少了超过65%。

英文摘要

End-to-End (E2E) autonomous driving models have shown growing capability in recent years, with performance improving on increasingly challenging benchmarks. However, modern generative E2E planners still suffer from a substantial number of catastrophic failures in safety-critical scenarios. We find that many such failures arise from violations of physical constraints and safety requirements, leading to unsafe behavior. Motivated by this finding, in this paper, we focus on improving safety outcomes in generative end-to-end driving with a targeted reduction of catastrophic planning failures, instead of enhancing average planning quality. Towards this end, we propose DriveSafer, a failure-aware safety framework for end-to-end planners. DriveSafer explicitly steers generative planners towards safe behaviors leveraging both training-time safety constraints and inference-time safety guidance. Compared to the state-of-the-art DiffusionDrive model, on the NAVSIM benchmark, DriveSafer reduces the number of catastrophic failures (PDMS=0) by 48%, with over 65% reduction in drivable-area compliance failures.

2605.16732 2026-05-19 cs.CV cs.LG 版本更新

DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

DiRotQ:面向4位扩散变换器的旋转感知量化

Sayeh Sharify, Mahsa Salmani, Hesham Mostafa

发表机构 * d-Matrix

AI总结 本文提出DiRotQ,一种W4A4量化框架,通过旋转感知激活量化缓解扩散变换器在4位精度下的性能下降问题,同时引入VLM-as-a-Judge评估协议和Triton定制内核提升压缩下的效率与质量。

详情
AI中文摘要

扩散变换器(DiTs)在图像生成质量上达到最先进的水平,但在推理过程中带来显著的内存和计算成本。尽管激进的后训练量化(PTQ)到4位精度能带来显著的效率提升,但通常会导致严重的质量下降。现有方法,包括基于平滑的方法、混合精度方案、旋转技术以及低秩残差方法,部分缓解了这一问题,但仍与FP16/BF16性能存在明显差距。在本工作中,我们引入DiRotQ,一种W4A4 PTQ框架,通过旋转感知的激活量化来缓解这种降级。DiRotQ通过主成分分析(PCA)识别出捕捉主导激活方差的低秩子空间,在该子空间中保留系数以较高精度,同时将剩余组件量化为4位。在推理时,通过校准得出的正交变换将激活旋转到PCA基底中,而逆旋转被融合到层权重中,离线。结合基于GPTQ的权重量化,DiRotQ在PixArt-Σ数据集上实现了FID(更低越好)为15.9和PSNR(越高越好)为19.1 dB,优于先前最先进的SVDQuant(FID 18.9,PSNR 17.6)在同一INT W4A4设置下的表现。除了标准指标外,我们引入了VLM-as-a-Judge评估协议,这是该设置下的首次此类评估,提供了更全面的感知质量和提示对齐评估。在系统层面,我们实现了基于Triton的定制内核,以实现高效的端到端推理,将12B FLUX.1-dev模型的内存使用减少了2.1倍,并在24 GB RTX 4090 GPU上实现了2.3倍的加速。

英文摘要

Diffusion Transformers (DiTs) achieve state-of-the-art image generation quality but incur substantial memory and computational costs at inference. While aggressive Post-Training Quantization (PTQ) to 4-bit precision offers significant efficiency gains, it typically results in severe quality degradation. Existing approaches, including smoothing-based methods, mixed-precision schemes, rotation techniques, and low-rank residual methods, partially mitigate this issue but still leave a noticeable gap to FP16/BF16 performance. In this work, we introduce DiRotQ, a W4A4 PTQ framework that mitigates this degradation through rotation-aware activation quantization. DiRotQ identifies a low-rank subspace capturing dominant activation variance via Principal Component Analysis (PCA), preserving coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit. Activations are rotated into the PCA basis at inference time using calibration-derived orthogonal transformations, while the inverse rotation is fused into the layer weights offline. Combined with GPTQ-based weight quantization, DiRotQ achieves an FID (lower is better) of 15.9 and PSNR (higher is better) of 19.1 dB on PixArt-Σ over the MJHQ-30K dataset, outperforming the prior state-of-the-art SVDQuant (FID 18.9, PSNR 17.6) under the same INT W4A4 setting. Beyond standard metrics, we introduce a VLM-as-a-Judge evaluation protocol for diffusion model quantization, the first such evaluation in this setting, providing a more holistic assessment of perceptual quality and prompt alignment under aggressive compression. On the systems side, we implement a Triton-based custom kernel to enable efficient end-to-end inference, reducing memory usage of the 12B FLUX.1-dev model by 2.1x and delivering 2.3x speedup over the BF16 baseline, on a 24 GB RTX 4090 GPU.

2605.16720 2026-05-19 cs.CV cs.LG 版本更新

Compositional Adversarial Training for Robust Visual Watermarking

组合对抗训练用于鲁棒的视觉水印

Anirudh Satheesh, Michael-Andrei Panaitescu-Liess, Andrew Xu, Georgios Milis, Heng Huang, Zikui Cai, Furong Huang

发表机构 * University of Maryland(马里兰大学)

AI总结 本文提出了一种组合对抗训练(CAT)框架,通过在结构化空间中构建组合转换的min-max问题,提升视觉水印的鲁棒性,实验表明其在多种攻击设置下优于随机增强基线。

详情
AI中文摘要

鲁棒水印通常使用随机后处理增强进行训练,但随机采样无法覆盖真实攻击管道的组合空间,难以遇到真正破坏检测的稀有组合。这导致训练不稳定且样本效率低。我们将其水印鲁棒性建模为结构化组合转换空间上的min-max问题。我们提出组合对抗训练(CAT),一种插件框架,学习一个顺序可微的对抗者,观察当前水印图像并在每一步选择攻击家族以最大程度干扰信息恢复。CAT结合了直通Gumbel-Softmax攻击选择与熵正则化,使反向传播可端到端微分并聚合攻击家族的梯度信息,从而实现更快、更平滑的收敛,而不陷入单一攻击模式。我们评估CAT在生成后水印VideoSeal 0.0、VideoSeal 1.0和PixelSeal以及在生成WMAR下的单步和双步攻击套件,以及在分布内和多分布图像和视频基准测试中。CAT在单步攻击设置中将水印容量提高最高63.5%,在组合设置中提高13.0%;在自回归设置中,CAT在困难几何变换上将TPR@FPR=1%平均提高12%。这些结果表明,鲁棒视觉水印受益于对抗适应组合对抗者而非独立随机破坏。

英文摘要

Robust watermarking is typically trained with random post-processing augmentation, but random sampling under-covers the combinatorial space of realistic attack pipelines and rarely encounters the rare compositions that actually break detection. This leads to unstable training and poor sample efficiency. We instead formulate watermark robustness as a min-max problem over a structured space of compositional transformations. We propose Compositional Adversarial Training (CAT), a plug-in framework that learns a sequential differentiable adversary that observes the current watermarked image and selects an attack family at each step to maximally disrupt message recovery. CAT combines a straight-through Gumbel-Softmax attack selection with entropy regularization, allowing the backward pass to be end-to-end differentiable and aggregate gradient information across attack families, yielding faster, smoother convergence without collapsing to a single attack mode. We evaluate CAT on post-generation watermarks VideoSeal 0.0, VideoSeal 1.0, and PixelSeal and in-generation WMAR under both single-step and two-step attack suites, on in-distribution and multiple out-of-distribution image and video benchmarks. CAT consistently outperforms random-augmentation baselines trained with the same augmentation budget, with the largest gains on hard composed attacks and OOD evaluations; improving overall watermark capacity by up to $63.5\%$ in the single-step attack setting and $13.0\%$ in the compositional setting. In the autoregressive setting, CAT improves the TPR@FPR$=1\%$ by $12\%$ on average on difficult geometric transformations. These results show that robust visual watermarking benefits from training against adaptive compositional adversaries rather than independent random corruptions.

2605.16696 2026-05-19 cs.CV 版本更新

Face inpainting with Identity Preserving Latent Diffusion Models

基于身份保持的潜在扩散模型的面部修复

João Santos, Carlos Santiago, Manuel Marques

发表机构 * Institute for Systems and Robotics(系统与机器人研究所)

AI总结 本文提出ID-ControlNet,利用潜在扩散模型实现面部修复,通过身份嵌入保持身份一致性,实验表明其在CelebA-HQ等数据集上优于传统方法,接近最先进的身份感知方法。

详情
AI中文摘要

面部修复技术能够以视觉逼真方式恢复缺失或遮挡的面部区域,但保持最终输出的身份仍是一个基本挑战。身份一致性对于下游应用如人脸识别、数字取证和人机交互至关重要,其中细微的身份扭曲可能显著降低性能或信任度。尽管扩散基生成模型在图像修复中取得了显著进展,但它们通常难以忠实保留个体特定的面部特征。另一方面,现有身份感知方法通常依赖于昂贵的微调、辅助监督或对多样遮挡、姿态和面部变化的鲁棒性有限。为了解决这些限制,我们提出ID-ControlNet,一种基于潜在扩散模型的身份保持面部修复框架。基于ControlNet架构,我们的方法将扩散过程条件化为从预训练的人脸识别网络中提取的面部身份嵌入。这种设计使能够重建遮挡的面部区域,同时保持全局面部一致性和身份保真度。此外,我们引入了身份一致性和三元组损失训练策略,以显式地强制生成的面部与目标身份表示之间的对齐。在CelebA-HQ、FFHQ和新的E-Mask数据集上的大量实验表明,ID-ControlNet在身份保持方面显著优于标准扩散基修复方法,实现了与最先进身份感知方法相当的性能。

英文摘要

Face inpainting techniques recover missing or occluded facial regions in a visually realistic manner, but preserving the identity in the final output remains a fundamental challenge. Identity consistency is crucial for downstream applications such as face recognition, digital forensics, and human-computer interaction, where even subtle identity distortions can significantly degrade performance or trust. Although diffusion-based generative models have recently achieved remarkable progress in image inpainting, they often struggle to faithfully retain individual-specific facial characteristics. On the other hand, existing identity-aware methods typically rely on costly fine-tuning, auxiliary supervision, or exhibit limited robustness to diverse occlusions, poses, and facial variations. To address these limitations, we propose ID-ControlNet, an identity-preserving face inpainting framework built upon latent diffusion models. Based on ControlNet architecture, our approach conditions the diffusion process on facial identity embeddings extracted from a pretrained face recognition network. This design enables reconstruction of occluded facial regions while maintaining global facial coherence and identity fidelity. Furthermore, we introduce an identity consistency and triplet loss training strategy that explicitly enforces alignment between the generated face and the target identity representation. Extensive experiments on CelebA-HQ, FFHQ, and on a new E-Mask dataset demonstrate that ID-ControlNet significantly improves identity preservation over standard diffusion-based inpainting methods, achieving performance comparable to SOTA identity-aware approaches.

2605.16672 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Multi-Object Tracking Consistently Improves Wildlife Inference

多目标跟踪一致地提升野生动物推断

Mufhumudzi Muthivhi, Jiahao Huo, Fredrik Gustafsson, Terence L. van Zyl

发表机构 * World Wide Fund (WWF)(世界自然基金会) Centre for Artificial Intelligence Research (CAIR)(人工智能研究中心)

AI总结 本文利用多目标跟踪技术提升野生动物分类模型的鲁棒性,通过融合轨迹信息改进分类结果,实验表明在三个数据集上均提升了性能。

Comments Accepted for publication in IEEE 2026 29th International Conference on Information Fusion

详情
AI中文摘要

相机陷阱已成为生态研究和生物多样性保护中常用的野生动物监测工具。野生动物分类模型受益于野生动物视觉数据的增加,这些模型在经过整理的高质量数据集上能达到高水平的准确性。然而,其性能仍然易受现实环境约束的影响。在进行时间连续序列的推断时,它们常常产生不一致的预测。单个个体在帧之间的预测标签会迅速变化。本研究利用相机陷阱数据的时间特性来增强野生动物分类模型的推断预测。具体来说,我们采用几种标准的多目标跟踪(MOT)模型,将连续帧中的检测结果进行关联。经过整理的轨迹用于融合softmax类概率。融合的概率评分产生一个单一的共识类标签估计,以覆盖噪声引起的误分类。实验结果分析表明,我们的策略在所有数据集和每个指标上均优于独立分类器。具体而言,表现最好的MOT模型在三个MOT数据集上分别比分类器提高了5.1%、3.1%和2.0%的加权F1分数。

英文摘要

Camera traps have become a common tool for wildlife monitoring efforts in ecological research and biodiversity conservation. Wildlife classification models have benefited from the increase in wildlife visual data. These models reach high levels of accuracy on curated, high-quality datasets. However, their performance remains sensitive to real-world environmental constraints. They often produce inconsistent predictions when performing inference on temporally coherent sequences. The predicted label for a single individual shifts rapidly between frames. This study exploits the temporal nature of camera-trap data to augment inferred predictions from a wildlife classification model. Specifically, we adopt several standard Multi-Object Tracking (MOT) models to link detections across consecutive frames. The curated trajectories are used to fuse the softmax class probabilities. The fused probability score produces a single consensus class label estimate that overrides misclassifications caused by noise. The analysis of the experimental results shows that our proposed strategy improves over a standalone classifier over all datasets and for each metric. Specifically, the best-performing MOT models gain a weighted F1-Score of 5.1%, 3.1% and 2.0% over the classifier across three MOT datasets.

2605.16671 2026-05-19 cs.AI cs.CV cs.CY cs.LG 版本更新

Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents

野生环境中的可持续智能:通过知识自适应边缘专家代理实现生态监测民主化

Jiaxing Li, Hao Fang, Chi Xu, Miao Zhang, Jiangchuan Liu, William I. Atlas, Katrina M. Connors, Mark A. Spoljaric

发表机构 * Simon Fraser University(西蒙 Fraser大学) Wild Salmon Center(野生鲑鱼中心) Pacific Salmon Foundation(太平洋鲑鱼基金会) Haida Fisheries Program(海达渔业计划)

AI总结 本文提出一种知识自适应边缘代理架构,通过分离视觉感知与推理,结合视觉编码器和动态知识库,实现生态监测的可持续发展,促进伦理AI协同开发。

Comments 10 pages

详情
AI中文摘要

快速的生物多样性丧失凸显了有效监测的紧迫性,但手动调查仍消耗资源。尽管设备上的AI提供了一种可扩展的替代方案,但野外环境中经常受到环境变化的挑战。当前方法依赖云资源,需要持续上传现场数据以重新训练模型。这种方法不适合远程部署,因为它消耗有限的电力和网络连接。为了解决这些限制,本研究提出从模型适应转向知识适应。我们介绍了一种架构,将视觉感知与推理分离,结合视觉编码器和动态知识库。我们使用显式知识库取代隐式编码专家知识到模型参数。这种方法还通过结构化形式保存专家见解来支持知识可持续性。通过跨学科合作与生物学家和原住民社区,这项工作推进了伦理AI的协同开发,促进负责任和文化知情的生态系统管理。

英文摘要

Rapid biodiversity loss underscore the urgency of effective monitoring, yet manual surveys remain resource-intensive. While on-device AI offers a scalable alternative, its performance in the wild is often challenged by environmental variability. Current methods rely heavily on cloud resource, which requires continuous uploading of field data for model retraining. This approach is unsuitable for remote deployments because it consumes limited power and network connectivity. To address these constraints, this research proposes a shift from model adaptation to knowledge adaptation. We introduce an architecture that separates visual perception from reasoning, combining a visual encoder with a dynamic knowledge base. We uses an explicit knowledge base to replace implicitly encoding expert knowledge into model parameters. This method also supports knowledge sustainability by preserving expert insights in a structured form. Through cross-disciplinary collaboration with biologists and Indigenous communities, this work advances ethical AI co-development, fostering responsible and culturally informed ecosystem management.

2605.16649 2026-05-19 cs.CV 版本更新

AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling

AtlasVid: 通过解耦的全局-局部建模实现高效超高清长视频生成

Ziyang Mai, Yuyao Zhang, Yu-Wing Tai

发表机构 * Dartmouth College(达特茅斯学院)

AI总结 本文提出AtlasVid框架,通过解耦建模提升超高清长视频生成效率,实现60.9倍加速和更低训练成本,优于原生4K生成器。

详情
AI中文摘要

近期基于扩散的视频生成器在视觉保真度和提示可控性方面取得了显著进展,但将其扩展到超高清(UHR)长视频仍极具挑战性。难点尤其体现在长单次生成中,需保持连续场景的全局时间一致性,同时不依赖剪辑过渡或自回归镜头拼接的精细空间细节。本文从解耦建模角度重新审视这一挑战。我们主张现有视频扩散模型已编码了强局部视觉先验,而主要瓶颈在于如何高效扩展全局时空建模以适应更高的分辨率和持续时间。基于此见解,我们提出AtlasVid,一种解耦的全局-局部框架,用于高效UHR长视频生成。AtlasVid首先通过时间缩放RoPE生成低分辨率和低FPS的全局语义代理,从而扩展时间范围而不增加训练token数量。在该代理的引导下,高分辨率细节分支进行联合去噪,采用分层局部性保持注意力。重新排列的时空窗口保持几何局部性,不对称的全局-局部注意力注入对齐的语义指导并保留模型的预训练能力。此设计使模型具备分辨率无关的训练能力:模型仅在720P上训练,使用轻量LoRA适配,即可直接泛化到4K及更长(>10秒)的视频生成。实验表明,AtlasVid显著提升了超高清长视频生成的效率,实现了高质量UHR长视频生成,速度提升60.9倍,训练成本显著降低,甚至优于原生4K视频生成器。

英文摘要

Recent diffusion-based video generators have achieved remarkable visual fidelity and prompt controllability, yet scaling them to ultra-high-resolution (UHR) long videos remains prohibitively expensive. The difficulty is especially pronounced for long single-shot generation where a continuous scene must preserve global temporal coherence, and fine-grained spatial details without relying on clip transitions or autoregressive shot stitching. In this work, we revisit this challenge from the perspective of decoupled modeling. We argue that existing video diffusion models already encode strong local visual priors, while the main bottleneck lies in efficiently extending global spatiotemporal modeling as resolution and duration increase. Based on this insight, we propose AtlaVid, a decoupled global-local framework for efficient UHR long video generation. AtlaVid first generates a low-resolution and low-FPS global semantic proxy via temporally scaled RoPE, thereby extending the temporal horizon without increasing the training token count. Guided by this proxy, a high-resolution detail branch performs joint denoising with hierarchical locality-preserving attention. Reordered spatiotemporal windows preserve geometric locality and asymmetric global-local attention injects aligned semantic guidance and preserves the model's pretrained ability. This design enables resolution-agnostic training: the model is trained only at 720P with lightweight LoRA adaptation, yet generalizes directly to 4K and beyond for longer (>10s) video synthesis. Experiments show that AtlaVid substantially improves the efficiency of ultra-high-resolution long video generation, achieving high-quality UHR long video generation with 60.9x speed up and significantly less training cost and even better performance than native 4K video generators.

2605.14963 2026-05-19 cs.CV 版本更新

H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors

H-OmniStereo:基于方向对齐法线先验的零样本全方位立体匹配

Chenxing Jiang, Zhe Tong, Pusen Gao, Peize Liu, Yang Xu, Chuan Fang, Ping Tan, Shaojie Shen

发表机构 * Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology(电子与计算机工程系,香港科学与技术大学)

AI总结 本文提出H-OmniStereo框架,通过构建高质量合成数据集和引入方向对齐法线估计器,解决全方位立体匹配中数据稀缺和视角先验退化问题,实现更高精度和跨视角一致性。

Comments 8 pages, 9 figures

详情
AI中文摘要

在顶底等距矩形图像上的立体匹配为全方位感知提供了有效框架,因为垂直对齐的视差线能够利用大量数据集和单目先验驱动的先进透视立体架构。然而,此类适应的性能严重受限于全方位立体数据集的稀缺性和球面畸变下单目先验的退化。为解决这些挑战,我们提出H-OmniStereo,零样本全方位立体匹配框架。首先,我们构建包含280万对顶底等距矩形立体对的高质量合成数据集以扩大训练规模。其次,我们引入等距矩形单目法线估计器,专门在方向对齐坐标系中运行。除了提供抗畸变和跨视角一致的几何先验以建立可靠的立体匹配对应关系外,该设计还提升了训练效率并适应了训练测试视角范围不匹配。大量实验表明,我们的方法在域外数据集上比现有方法更准确,并成功泛化到实际消费者相机设置中使用单个模型。模型和数据集将在https://github.com/JIANG-CX/H-OmniStereo发布。

英文摘要

Stereo matching on top-bottom equirectangular images provides an effective framework for full-surround perception, as vertically aligned epipolar lines enable the use of advanced perspective stereo architectures that are largely driven by large-scale datasets and monocular priors. However, the performance of such adaptations is severely limited by the scarcity of omnidirectional stereo datasets and the degradation of perspective monocular priors under spherical distortions. To address these challenges, we propose H-OmniStereo, a zero-shot omnidirectional stereo matching framework. First, we construct high-quality synthetic dataset comprising over 2.8 million top-bottom equirectangular stereo pairs to scale up training. Second, we introduce an equirectangular monocular normal estimator, specifically operating in a heading-aligned coordinate system. Beyond providing distortion-robust and cross-view-consistent geometric priors for establishing reliable correspondences in stereo matching, this design boosts training efficiency and accommodates train-test FoV mismatches. Extensive experiments show that our approach achieves higher accuracy than existing methods on out-of-domain datasets and successfully generalizes to real-world consumer camera setups using a single model. The model and dataset will be released at https://github.com/JIANG-CX/H-OmniStereo.

2605.14854 2026-05-19 cs.CV cs.AI 版本更新

FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery

因子化HMR:视频人体网格恢复的混合框架

Patrick Kwon, Chen Chen

发表机构 * Institute of Artificial Intelligence(人工智能研究所) University of Central Florida(佛罗里达中央大学)

AI总结 本文提出FactorizedHMR框架,通过确定性回归模块和概率流匹配模块分别处理人体不同部位的恢复问题,结合复合目标表示和几何感知监督提升模糊部位的恢复效果,实现在遮挡和漂移敏感度指标上的优势。

详情
AI中文摘要

人体网格恢复(HMR)本质上具有歧义性:在遮挡或弱深度线索下,同一图像证据可能由多个3D身体解释。这种歧义性并非均匀分布于全身,躯干姿态和根结构通常相对受约束,而远端关节如手臂和腿部则更不确定。基于此观察,我们提出FactorizedHMR,一种两阶段框架,分别处理这两种情形。一个确定性回归模块首先恢复稳定的躯干-根锚点,一个概率流匹配模块则完成剩余的非躯干关节。为使完成可靠,我们结合复合目标表示与几何感知监督和特征感知分类器自由引导,保留躯干-根锚点的同时提升易产生歧义的关节的单参考恢复。我们还引入了一个合成数据管道,提供在多种视角下的配对图像-相机-运动监督。在相机空间和世界空间基准测试中,FactorizedHMR与强基线竞争,尤其在遮挡密集恢复和漂移敏感世界空间指标上表现最突出。

英文摘要

Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.

2605.13322 2026-05-19 cs.CV cs.LG 版本更新

KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

KamonBench:一种基于语法规则的数据集,用于评估视觉-语言模型中的组合因子恢复

Richard Sproat, Stefano Peluchetti

AI总结 KamonBench通过20000个合成复合徽章及辅助组件示例,提供评估视觉-语言模型中稀疏组合识别和因子恢复的可控测试环境,支持程序代码因子度量和可控因子对重组。

Comments Preprint

详情
AI中文摘要

KamonBench通过20000个合成复合徽章及辅助组件示例,提供评估视觉-语言模型中稀疏组合识别和因子恢复的可控测试环境,支持程序代码因子度量和可控因子对重组。

英文摘要

Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-to-structure benchmark with 20,000 synthetic composite crests and auxiliary component examples. Each composite crest is paired with a formal kamon description language - "kamon yōgo" - description, a segmented Japanese analysis, an English translation, and a non-linguistic program code. Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and linear probes of factor accessibility. We include baseline results for a ViT encoder/Transformer decoder and two VGG n-gram decoders, with and without learned positional masks. KamonBench therefore provides a controlled testbed for sparse compositional visual recognition and factor recovery in vision-language models.

2605.11871 2026-05-19 cs.CV 版本更新

$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

$h$-control: 无需训练的相机控制 via 块条件吉布斯细化

Yuzhu Wang, Xi Ye, Duo Su, Yangyang Xu, Jun Zhu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) South China University of Technology(华南理工大学)

AI总结 本文提出$h$-control,通过改进采样器结构,解决免训练视频生成中相机控制的逆向问题,提升轨迹一致性与视觉质量的平衡,实现在多个数据集上的最佳表现。

详情
AI中文摘要

无需训练的相机控制对于预训练的流匹配视频生成器是一个部分观察逆向问题:深度扭曲的引导视频为潜变量子集提供噪声证据,采样器必须与预训练先验相协调。现有方法难以平衡轨迹一致性和视觉质量,且启发式引导强度调整缺乏鲁棒性。我们提出$h$-control,通过在采样器中引入结构变化:每个外层硬替换引导步骤均增强内循环块条件伪吉布斯细化,对同一噪声水平下的未观测补集进行处理,保证收敛到部分观察条件数据定律。为加速高维视频潜变量的收敛,我们利用其条件局部性,将未观测补集划分为3D块,每个块由自定义混合指示器跟踪,能自适应冻结收敛块。在RealEstate10K和DAVIS数据集上,$h$-control在所有七种免训练和训练-based竞争者中取得最佳FVD,优于所有免训练基线。

英文摘要

Training-free camera control for pretrained flow-matching video generators is a partial-observation inverse problem: a depth-warped guidance video supplies noisy evidence on a subset of latent sites, which the sampler must reconcile with the pretrained prior. Existing methods struggle to balance the trade-off between trajectory adherence and visual quality and the heuristic guidance-strength tuning lacks robustness. We propose \textbf{$h$-control}, which resolves this dilemma through a structural change to the sampler: each outer hard-replacement guidance step is augmented with an inner-loop \emph{block-conditional pseudo-Gibbs refinement} on the unobserved complement at the same noise level, with provable convergence to the partial-observation conditional data law. To accelerate convergence on high-dimensional video latents, we exploit their conditional locality, partitioning the unobserved complement into 3D patches, each tracked by a custom mixing indicator that adaptively freezes converged patches. On RealEstate10K and DAVIS, \textbf{$h$-control} attains the best FVD against all seven training-free and training-based competitors, outperforming every training-free baseline on every reported metric.

2605.11208 2026-05-19 cs.CV 版本更新

Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

Hi-GaTA:用于外科视频报告生成的分层门控时间聚合适配器

Kedi Sun, Chaohui Dang, Yue Feng, James Glasbey, Theodoros N. Arvanitis, Le Zhang

发表机构 * School of Engineering, College of Engineering and Physical Sciences, University of Birmingham, Birmingham, UK(英国伯明翰大学工程学院) School of Computer Science, University of Birmingham, Birmingham, UK(英国伯明翰大学计算机科学学院) Department of Applied Health Sciences, University of Birmingham, Birmingham, UK(英国伯明翰大学应用健康科学系)

AI总结 本文提出Hi-GaTA框架,通过时间聚合压缩长视频序列生成LLM兼容的视觉前缀令牌,结合预训练的外科专用视频编码器和LoRA微调,实现高质量外科报告生成。

Comments 11 pages, 2 figures

详情
AI中文摘要

自动化、临床级的外科手术评估报告可减少文档负担并提供客观反馈,但面临视频时空表示与语言推理对齐困难及高质量隐私数据稀缺的挑战。为此,我们建立包含214个高质量模拟外科视频及外科医生撰写的评估报告的基准。基于此资源,我们提出包含Hi-GaTA的感知-对齐-推理框架,其中Hi-GaTA是一种新型轻量级时间适配器,通过短到长范围时间聚合高效压缩长视频序列为紧凑的LLM兼容视觉前缀令牌。为实现稳健的视觉感知,我们预训练了Sur40k,一种针对外科专用的ViViT风格视频编码器,在40,000分钟的公开外科视频上进行预训练以捕捉细粒度的时空手术先验。Hi-GaTA采用带有文本条件双交叉注意力的时间金字塔,并通过跨层门控融合和递增深度策略提高多尺度一致性。最后,我们使用LoRA微调LLM主干以在有限监督下实现连贯且风格一致的外科报告生成。实验表明,我们的方法在整体性能上最佳,且在强大的多模态大语言模型(MLLM)基线中表现出一致的优势。消融研究进一步验证了每个提出组件的有效性。

英文摘要

Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.

2605.10759 2026-05-19 cs.LG cs.CV 版本更新

Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models

强化共轭匹配:扩散和流匹配模型的后训练强化学习扩展

Andreas Bergmeister, Stefanie Jegelka, Nikolas Nüsken, Carles Domingo-Enrich, Jakiw Pidstrigach

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) King's College London(伦敦国王学院) Microsoft Research New England(微软研究院新英格兰分部) University of Oxford(牛津大学)

AI总结 本文提出Reinforce Adjoint Matching方法,通过强化学习后训练优化扩散和流匹配模型,无需SDE回滚或梯度,提升生成质量与人类偏好匹配度。

详情
AI中文摘要

扩散和流匹配模型的扩展性源于预训练的监督回归:干净样本通过分析噪声,模型回归闭式目标。强化学习后训练将模型对齐于奖励。在图像生成中,这使样本正确组成物体、清晰渲染文本并匹配人类偏好。现有方法依赖于成本高的SDE回滚、奖励梯度或替代损失,牺牲了预训练的回归结构。我们证明结构可扩展至强化学习后训练。在KL正则化的奖励最大化下,最优生成过程使干净端点分布向奖励更高的样本倾斜,而噪声法则不变。结合此与共轭匹配最优条件和REINFORCE恒等式,我们推导出Reinforce Adjoint Matching(RAM):一种一致性损失,修正预训练目标与奖励。每一步,从当前模型抽样干净端点,评估其奖励,按预训练方式噪声化,并回归。无需SDE回滚、反向共轭扫描或奖励梯度。如同预训练目标,RAM简单且可扩展。在Stable Diffusion 3.5M上,RAM在可组合性、文本渲染和人类偏好方面达到最高奖励,达到Flow-GRPO的峰值奖励,训练步骤减少达50倍。

英文摘要

Diffusion and flow-matching models scale because pretraining is supervised regression: a clean sample is noised analytically, and a model regresses against a closed-form target. RL post-training aligns the model with a reward. In image generation, this makes samples compose objects correctly, render text legibly, and match human preferences. Existing methods rely on costly SDE rollouts, reward gradients, or surrogate losses, sacrificing pretraining's regression structure. We show that the structure extends to RL post-training. Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution towards samples with higher reward and leaves the noising law unchanged. Combining this with the adjoint-matching optimality condition and a REINFORCE identity, we derive Reinforce Adjoint Matching (RAM): a consistency loss that corrects the pretraining target with the reward. At each step, we draw a clean endpoint from the current model, evaluate its reward, noise it as in pretraining, and regress. No SDE rollouts, backward adjoint sweeps, or reward gradients are required. Like the pretraining objective, RAM is simple and scales. On Stable Diffusion 3.5M, RAM achieves the highest reward on composability, text rendering, and human preference, reaching Flow-GRPO's peak reward in up to $50\times$ fewer training steps.

2605.09619 2026-05-19 cs.CV 版本更新

GSMap: 2D Gaussians for Online HD Mapping

GSMap:用于在线高精度地图的2D高斯

Zhenxuan Zeng, Lingxuan Wang, Sheng Yang, Yanan He, Mingxia Chen, Wei Suo, Peng Wang

发表机构 * School of Computer Science, Northwestern Polytechnical University, China(西北工业大学计算机学院) Unmanned Vehicle Dept, Cainiao Inc., Alibaba Group, China(菜鸟网络无人车部门)

AI总结 GSMap通过可学习的2D高斯表示统一了向量化和光栅化方法,提升高精度地图的几何与拓扑学习性能,实验表明其在nuScenes和Argoverse2上表现优异。

Comments Preprint

详情
AI中文摘要

准确的高精度(HD)地图构建对自动驾驶至关重要,但现有方法面临根本性权衡:基于向量化的方案保持拓扑但难以保证几何精度,而基于光栅化的方案能提供精确的几何监督但输出不结构化。为此,我们提出GSMap,一种新的框架,通过可学习的2D高斯表示统一两种范式。每个地图元素被建模为一个有序的2D高斯序列,其中心对应于向量化的折线/多边形的顶点。这种形式使同时优化成为可能:(1)可微光栅化强制像素级几何约束,(2)拓扑感知的向量化保持结构规律性。在nuScenes和Argoverse2上的实验表明,基于高斯的表示有效统一了几何和拓扑学习,实现了显著的性能提升,并展示了与现有HD地图架构的良好兼容性。代码将在https://github.com/peakpang/GSMap上提供。

英文摘要

Accurate High-Definition (HD) map construction is critical for autonomous driving, yet existing methods face a fundamental trade-off: vectorization-based approaches preserve topology but struggle with geometric fidelity, while rasterization-based approaches enable precise geometric supervision but produce unstructured outputs. To bridge this gap, we propose GSMap, a novel framework that unifies both paradigms via a learnable 2D Gaussian representation. Each map element is modeled as an ordered sequence of 2D Gaussians, whose centers correspond to the vertices of the vectorized polyline/polygon. This formulation enables simultaneous optimization through: (1) Differentiable rasterization that enforces pixel-level geometric constraints, and (2) Topology-aware vectorization that maintains structural regularity. Experiments on both nuScenes and Argoverse2 demonstrate that our Gaussian-based representation effectively unifies geometric and topological learning, achieving significant performance improvements and demonstrating strong compatibility with existing HD mapping architectures. Code will be available at https://github.com/peakpang/GSMap

2605.08925 2026-05-19 cs.CV 版本更新

ClickSeg3D: Few-Click Interactive Segmentation via Semantic Embeddings

ClickSeg3D: 通过语义嵌入实现少点击交互分割

Xueyang Kang, Zijian Yu, Kourosh Khoshelham, Liangliang Nan

发表机构 * School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) University of Science and Technology of China(中国科学技术大学) Faculty of Architecture and the Built Environment, Delft University of Technology(代尔夫特理工大学建筑与环境学院)

AI总结 本文提出ClickSeg3D,通过语义嵌入实现高效3D实例分割,采用点集直接处理和多对象点击联合推理,提升分割精度与效率,适用于实时应用。

Comments 15 pages, 9 figures, 6 tables

详情
AI中文摘要

交互分割通过利用用户提供的点击逐步细化预测,关键在于在完全监督标签成本高或需要泛化到未见类别时有效生成标签。现有3D交互方法存在局限:大多数按顺序预测,每次迭代仅预测一个对象并使用二进制掩码,而一些近期方法依赖2D基础模型和相机对齐来弥合2D-3D差距。为解决这些限制,我们提出了一种新的交互分割框架,直接在稀疏、随机下采样的3D点上操作,并在单次前向传递中处理多个对象点击。该框架由基于点Transformer的编码器和层次化掩码解码器组成,整合多级裁剪和合并操作,条件于可学习的语义嵌入。与以往需要在每次手动修正点击后重复模型更新的交互方法不同,我们的方法联合推理所有点击查询,建模实例间的关系,并通过空间和语义嵌入细化空间掩码和语义预测。广泛实验表明,与强基线相比,我们的模型将mIoU指标提高超过20个百分点,并在跨数据集评估中实现8-10个百分点的提升,通常只需每个实例一个点击。我们的方法为交互3D实例分割提供了通用且高效的解决方案,特别适用于实时应用,如机器人操作、导航和快速3D语义标注。

英文摘要

Interactive segmentation allows efficient label generation by leveraging user-provided clicks to progressively refine predictions, which is critical when fully supervised labels are costly or generalization to unseen classes is needed. Existing 3D interactive methods are limited: most operate sequentially, predicting only one object per iteration with binary masks, while several recent approaches depend on 2D foundation models and camera alignment to bridge the 2D-3D gap. To address these limitations, we propose a novel interactive segmentation framework that operates directly on sparse, randomly downsampled 3D points and processes multiple object clicks in a single forward pass. Our framework consists of a point Transformer-based encoder and a hierarchical mask decoder, which integrates multi-level crop-and-merge operations conditioned on learnable semantic embeddings. Unlike prior interactive approaches that require repeated model updates after each manually corrective click, our method jointly reasons over all click queries, modeling inter-instance relationships and refining both spatial masks and semantic predictions through spatial and semantic embeddings. Extensive experiments demonstrate that our model improves the mIoU metric by over 20 percent compared to strong baselines and achieves 8-10 percent gains under cross-dataset evaluation for a one-click per instance setting, often requiring only a single click per object. Our approach provides a generalizable and efficient solution for interactive 3D instance segmentation, particularly suitable for real-time applications such as robotic manipulation, navigation, and rapid 3D semantic annotation.

2605.08567 2026-05-19 cs.CV 版本更新

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

ACWM-Phys:探究动作条件化视频世界模型中的广义物理交互

Haotian Xue, Yipu Chen, Liqian Ma, Zelin Zhao, Lama Moukheiber, Yuchen Zhu, Yongxin Chen

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出ACWM-Phys基准,用于评估动作条件化预测在多样物理动态下的性能,通过系统实验发现模型在物理规则和任务复杂度上存在泛化差异,指导了物理基础世界模型的设计。

详情
AI中文摘要

动作条件化世界模型(ACWMs)在视频预测和决策中展现出强大潜力。然而,现有基准主要局限于自聚焦导航或狭窄的特定机器人数据集,仅能有限覆盖通用世界理解所需的丰富物理交互。我们引入ACWM-Phys,一个新基准,用于在清洁可控的模拟环境中评估动作条件化预测在多样化物理动态下的性能,包含涵盖刚体动力学、运动学、可变形物体交互和粒子动力学的训练和评估数据。为评估插值和泛化能力,我们设计了分布内和分布外协议,通过受控的交互模式或场景配置变化。通过在完全可控的模拟器中构建基准,ACWM-Phys实现了精确的数据收集、可重复的评估和系统分析模型能力。通过系统实验发现,分布外泛化不仅取决于物理领域,还取决于有效任务复杂性:模型在视觉简单、低维交互上泛化良好,但在可变形接触、高维控制和复杂关节运动上表现下降。这表明模型仍严重依赖于视觉外观模式而非完全学习底层物理。消融实验显示,交叉注意力提高了高维动作条件化,因果VAE优于帧级编码器,更大的动作空间更难建模但能通过提供更丰富的控制信号提高泛化能力。这些发现指导了物理基础世界模型的设计。

英文摘要

Action-conditioned world models (ACWMs) have shown strong promise for video prediction and decision-making. However, existing benchmarks are largely restricted to egocentric navigation or narrow, task-specific robotics datasets, offering only limited coverage of the rich physical interactions required for generalized world understanding. We introduce ACWM-Phys, a new benchmark for evaluating action-conditioned prediction under diverse physical dynamics in a clean, controllable simulation environment with a carefully designed action space. ACWM-Phys contains training and evaluation data spanning rigid-body dynamics, kinematics, deformable-object interactions, and particle dynamics. To evaluate both interpolation and generalization, we design in-distribution and out-of-distribution protocols with controlled shifts in interaction patterns or scene configurations. By building the benchmark in a fully controllable simulator, ACWM-Phys enables precise data collection, reproducible evaluation, and systematic analysis of model capabilities for physically grounded world modeling. Through systematic experiments on ACWM-DiT, we find that OoD generalization depends not only on the physical regime but also on effective task complexity: models generalize well on visually simple, low-dimensional interactions with clear geometric structure, but suffer larger drops on deformable contacts, high-dimensional control, and complex articulated motion. This suggests that the model still relies heavily on visual appearance patterns instead of fully learning the underlying physics. Ablations show that cross-attention improves high-dimensional action conditioning, causal VAEs outperform frame-wise encoders, and larger action spaces are harder to model but can improve generalization by providing richer control signals. These findings guide the design of physically grounded world models.

2605.02759 2026-05-19 cs.RO cs.CV 版本更新

DynoSLAM: Dynamic SLAM with Generative Graph Neural Networks for Real-World Social Navigation

DynoSLAM:基于生成图神经网络的动态SLAM用于现实世界的社交导航

Danil Tokhchukov, Veronika Morozova, Gonzalo Ferrer

发表机构 * Applied AI Institute(应用人工智能研究所)

AI总结 本文提出DynoSLAM,通过整合社交感知图神经网络,解决动态环境中SLAM的不确定性问题,提升机器人在拥挤环境中的导航能力。

Comments Code & Project page at https://github.com/makriot/dynoslam

详情
AI中文摘要

传统同时定位与建图(SLAM)算法依赖于静态环境假设,限制了其在现实世界中的应用,尤其是存在移动实体(如行人)的场景。本文提出DynoSLAM,一种紧密耦合的动态图SLAM架构,将社交感知的图神经网络(GNN)直接整合到因子图优化中。与传统方法使用刚性常速启发式或确定性单体神经先验不同,我们的框架将行人运动预测建模为随机世界模型。通过利用训练好的GNN的蒙特卡洛回放,我们捕捉人类互动的多模态认知不确定性,并通过动态马氏距离因子将其嵌入SLAM图中。通过广泛的模拟实验,我们证明这种随机建模不仅保持了高度准确的回顾跟踪,还能防止由确定性

英文摘要

Traditional Simultaneous Localization and Mapping (SLAM) algorithms rely heavily on the static environment assumption, which severely limits their applicability in real-world spaces populated by moving entities, such as pedestrians. In this work, we propose DynoSLAM, a tightly-coupled Dynamic GraphSLAM architecture that integrates socially-aware Graph Neural Networks (GNNs) directly into the factor graph optimization. Unlike conventional approaches that use rigid constant-velocity heuristics or deterministic single-agent neural priors, our framework formulates pedestrian motion forecasting as a stochastic World Model. By utilizing Monte Carlo rollouts from a trained GNN, we capture the multimodal epistemic uncertainty of human interactions and embed it into the SLAM graph via a dynamic Mahalanobis distance factor. We demonstrate through extensive simulated experiments that this stochastic formulation not only maintains highly accurate retrospective tracking but also prevents the optimization failures caused by the deterministic "argmax problem". Ultimately, extracting the empirical mean and covariance matrices of future pedestrian states provides a mathematically rigorous, probabilistic safety envelope for downstream local planners, enabling anticipatory and collision-free robot navigation in densely crowded environments.

2605.02167 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution

面向流形的引导集成梯度用于可靠特征归因

Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi

发表机构 * Kim Jaechul Graduate School of AI, Korea Advanced Institute of Science and Technology (KAIST)(金 Jaechul人工智能研究生院,韩国科学技术院(KAIST))

AI总结 本文提出面向流形的引导集成梯度(MA-GIG),通过在预训练变分自编码器的潜在空间中构建归因路径,减少非流形区域噪声,提升特征归因的可靠性。

Comments 32 pages, 13 figures, 12 tables. Accepted to ICML 2026; includes appendix

详情
AI中文摘要

特征归因是诊断和信任深度神经网络的核心,集成梯度(IG)因其公理性质而被广泛使用。然而,当基线与输入之间的积分路径经过具有噪声梯度的区域时,IG可能产生不可靠的解释。虽然引导集成梯度通过自适应更新低梯度幅度特征来减少这种敏感性,但输入空间的引导仍会产生偏离数据流形的中间输入。为了解决这一限制,我们提出了面向流形的引导集成梯度(MA-GIG),通过在预训练变分自编码器的潜在空间中构建归因路径。通过解码中间潜在状态,MA-GIG将路径偏向于学习的生成流形,减少对不合理的输入空间区域的暴露。通过定性与定量评估,我们证明MA-GIG通过在接近输入的路径特征上聚合梯度,产生忠实的解释。因此,我们的方法减少了非流形噪声,并在多个数据集和分类器上优于先前的路径归因方法。我们的代码可在https://github.com/leekwoon/ma-gig/上获得。

英文摘要

Feature attribution is central to diagnosing and trusting deep neural networks, and Integrated Gradients (IG) is widely used due to its axiomatic properties. However, IG can yield unreliable explanations when the integration path between a baseline and the input passes through regions with noisy gradients. While Guided Integrated Gradients reduces this sensitivity by adaptively updating low-gradient-magnitude features, input-space guidance still produces intermediate inputs that deviate from the data manifold. To address this limitation, we propose \emph{Manifold-Aligned Guided Integrated Gradients} (MA-GIG), which constructs attribution paths in the latent space of a pre-trained variational autoencoder. By decoding intermediate latent states, MA-GIG biases the path toward the learned generative manifold and reduces exposure to implausible input-space regions. Through qualitative and quantitative evaluations, we demonstrate that MA-GIG produces faithful explanations by aggregating gradients on path features proximal to the input. Consequently, our method reduces off-manifold noise and outperforms prior path-based attribution methods across multiple datasets and classifiers. Our code is available at https://github.com/leekwoon/ma-gig/.

2605.01815 2026-05-19 cs.CV 版本更新

Cross-Domain Adversarial Augmentation: Stabilizing GANs for Medical and Handwriting Data Scarcity

跨领域对抗增强:稳定GANs以应对医学和手写数据稀缺

Md. Sohanuzzaman Soad, Mahady Al Hady, S M Rafiuddin Rifat, Sudip Ghose

发表机构 * University of Asia Pacific(亚太平洋大学)

AI总结 本文探讨了在低资源领域中使用生成数据增强技术,通过DCGAN模型生成合成样本,并评估其在医学影像和手写识别中的有效性,提升了有限数据下的分类性能。

Comments 11 Pages, 7 figures, 2 tables

详情
AI中文摘要

生成对抗网络(GANs)可通过生成额外训练样本来缓解计算机视觉任务中的数据稀缺问题。本文在两个低资源领域中探索生成数据增强:孟加拉语手写字符识别和胸部X光图像分析。我们使用基于DCGAN的模型训练64x64图像以生成合成样本,并通过Inception Score(IS)、Fréchet Inception Distance(FID)和t-SNE、UMAP等可视化方法评估其质量。为了衡量实用性,我们使用真实数据和真实与合成数据的组合训练图像分类器。实验结果表明,合成增强提高了数据多样性,并在有限数据设置中一致提升了分类性能。我们还研究了训练稳定性技术,包括梯度惩罚和谱归一化,并进行了合成到真实数据比例和样本过滤策略的消融研究。此外,我们讨论了医学图像评估、数据集许可和合成数据隐私问题的相关挑战。我们的方法简单、可复现,并为资源受限成像应用中的生成增强提供了强大的基线。

英文摘要

Generative Adversarial Networks (GANs) can help overcome data scarcity in computer vision tasks by generating additional training samples. In this work, we explore generative data augmentation in two low-resource domains: Bangla handwritten character recognition and chest X-ray image analysis. We use DCGAN-based models trained on 64x64 images to generate synthetic samples and evaluate their quality using Inception Score (IS), Fréchet Inception Distance (FID), and visualization methods such as t-SNE and UMAP. To measure practical usefulness, we train image classifiers using real data and a combination of real and synthetic data. Experimental results show that synthetic augmentation improves data diversity and consistently increases classification performance in limited-data settings. We also investigate training stability techniques, including gradient penalty and spectral normalization, and perform ablation studies on synthetic-to-real data ratios and sample filtering strategies. In addition, we discuss challenges related to medical image evaluation, dataset licensing, and privacy concerns of synthetic data. Our approach is simple, reproducible, and provides a strong baseline for generative augmentation in resource-constrained imaging applications.

2605.00793 2026-05-19 eess.IV cs.AI cs.CV 版本更新

Unsupervised Denoising of Real Clinical Low Dose Liver CT with Perceptual Attention Networks

无监督的实时临床低剂量肝CT去噪与感知注意力网络

Zhilin Guan, Wei Zhang

发表机构 * Department of Computing(计算系) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本文提出基于感知注意力网络的无监督低剂量肝CT去噪框架,结合U-Net、注意力机制和残差网络,通过感知损失提升医疗图像特征提取,利用真实临床数据集和医学评价标准验证方法有效性,满足临床需求。

Comments 8 pages, 10 figures, 5 tables

详情
AI中文摘要

随着深度学习的发展,医学图像处理已广泛用于辅助临床研究。本文聚焦于利用深度学习进行低剂量计算机断层扫描(CT)的去噪问题。尽管低剂量CT减少了患者辐射暴露,但也引入了更多噪声,可能干扰医生的视觉解读并影响诊断结果。为了解决这个问题,受Cycle-GAN启发,本文提出了一种端到端的无监督低剂量CT去噪框架。该框架结合了U-Net结构进行多尺度特征提取、注意力机制进行特征融合、残差网络进行特征转换,并引入感知损失以提升网络对医疗图像特征的适应性。此外,我们构建了真实低剂量CT数据集,并设计了大量对比实验,通过图像基评估指标和医学评价标准验证所提方法。与经典方法相比,本文的主要优势在于解决了真实临床数据不能直接用于监督学习的限制,同时仍实现了优异的性能。实验结果也由影像医师专业评估,满足临床需求。

英文摘要

With the development of deep learning, medical image processing has been widely used to assist clinical research. This paper focuses on the denoising problem of low-dose computed tomography using deep learning. Although low-dose computed tomography reduces radiation exposure to patients, it also introduces more noise, which may interfere with visual interpretation by physicians and affect diagnostic results. To address this problem, inspired by Cycle-GAN for unsupervised learning, this paper proposes an end-to-end unsupervised low-dose computed tomography denoising framework. The proposed framework combines a U-Net structure for multi-scale feature extraction, an attention mechanism for feature fusion, and a residual network for feature transformation. It also introduces perceptual loss to improve the network for the characteristics of medical images. In addition, we construct a real low-dose computed tomography dataset and design a large number of comparative experiments to validate the proposed method, using both image-based evaluation metrics and medical evaluation criteria. Compared with classical methods, the main advantage of this paper is that it addresses the limitation that real clinical data cannot be directly used for supervised learning, while still achieving excellent performance. The experimental results are also professionally evaluated by imaging physicians and meet clinical needs.

2604.28134 2026-05-19 cs.CV 版本更新

MeshReGen: A Unified 3D Geometry Regeneration Framework

MeshReGen: 一种统一的3D几何再生框架

Geon Yeong Park, Roman Shapovalov, Rakesh Ranjan, Jong Chul Ye, Andrea Vedaldi, Thu Nguyen-Phuoc

发表机构 * KAIST Meta Reality Labs(韩国科学技术院元宇宙实验室)

AI总结 MeshReGen通过基于VecSet的条件机制,实现从2D图像和初始3D形状再生3D对象,支持增强、重建和编辑等任务,无需额外标注即可在多个任务中实现可控的3D生成。

Comments Project page: https://geonyeong-park.github.io/meshregen/ 32 pages, 18 figures, 6 tables. Includes Appendix

详情
AI中文摘要

我们考虑从2D图像和初始3D形状再生3D对象的问题。大多数3D生成器以单次操作方式工作,将文本或图像转换为3D对象,可控性有限。我们引入MeshReGen,一种基于初始3D形状的3D再生器。这种概念简单的公式允许我们支持多种有用的任务,包括3D增强、重建和编辑。MeshReGen使用基于VecSet的新条件机制,使再生器能够通过一致的细粒度细节更新或改进输入几何。MeshReGen通过自监督预训练任务和增强从现成的3D数据集中学习广泛适用的再生先验,无需额外标注。我们评估了MeshReGen的几何一致性和细粒度质量,在多个任务中实现了可控3D生成的最先进性能。

英文摘要

We consider the problem of regenerating 3D objects from 2D images and initial 3D shapes. Most 3D generators operate in a one-shot fashion, converting text or images to a 3D object with limited controllability. We introduce instead MeshReGen, a 3D regenerator that is conditioned on an initial 3D shape. This conceptually simple formulation allows us to support numerous useful tasks, including 3D enhancement, reconstruction, and editing. MeshReGen uses a new conditioning mechanism based on VecSet, which allows the regenerator to update or improve the input geometry with consistent fine-grained details. MeshReGen learns a widely applicable regeneration prior from off-the-shelf 3D datasets via self-supervised pretext tasks and augmentations, without additional annotations. We evaluate both the geometric consistency and fine-grained quality of MeshReGen, achieving state-of-the-art performance in controllable 3D generation across several tasks.

2604.10027 2026-05-19 cs.CV 版本更新

SinkTrack: Attention Sink based Context Anchoring for Large Language Models

SinkTrack: 基于注意力sink的上下文锚定用于大语言模型

Xu Liu, Guikun Chen, Wenguan Wang

发表机构 * The State Key Lab of Brain-Machine Intelligence(脑机智能国家重点实验室)

AI总结 SinkTrack通过将<BOS>作为信息锚点,注入关键上下文特征,缓解大语言模型的幻觉和上下文遗忘问题,实验显示在文本和多模态任务中均取得显著提升。

Comments ICLR 2026. Code: https://github.com/67L1/SinkTrack

详情
AI中文摘要

大语言模型(LLMs)面临幻觉和上下文遗忘问题,先前研究认为注意力漂移是主要原因,即LLMs的注意力转向新生成的token而远离初始输入上下文。为应对此问题,我们利用LLMs的一个相关内在特性:注意力sink——倾向于持续将高注意力分配给序列的第一个token(即<BOS>)。具体而言,我们提出了一种先进的上下文锚定方法SinkTrack,将<BOS>作为信息锚点,并将其表示中注入关键上下文特征(如来自输入图像或指令的特征)。因此,LLM在整个生成过程中始终保持对初始输入上下文的锚定。SinkTrack是无需训练的即插即用方法,且引入了极小的推理开销。实验表明,SinkTrack在文本(例如在SQuAD2.0上使用Llama3.1-8B-Instruct模型时提升21.6%)和多模态(例如在M3CoT上使用Qwen2.5-VL-7B-Instruct模型时提升22.8%)任务中均有效缓解了幻觉和上下文遗忘问题。其在不同架构和规模上的稳定提升凸显了其鲁棒性和泛化能力。我们还从信息传递的角度分析了其底层工作原理。源代码可在https://github.com/67L1/SinkTrack获取。

英文摘要

Large language models (LLMs) suffer from hallucination and context forgetting. Prior studies suggest that attention drift is a primary cause of these problems, where LLMs' focus shifts towards newly generated tokens and away from the initial input context. To counteract this, we make use of a related, intrinsic characteristic of LLMs: attention sink -- the tendency to consistently allocate high attention to the very first token (i.e., <BOS>) of a sequence. Concretely, we propose an advanced context anchoring method, SinkTrack, which treats <BOS> as an information anchor and injects key contextual features (such as those derived from the input image or instruction) into its representation. As such, LLM remains anchored to the initial input context throughout the entire generation process. SinkTrack is training-free, plug-and-play, and introduces negligible inference overhead. Experiments demonstrate that SinkTrack mitigates hallucination and context forgetting across both textual (e.g., +21.6% on SQuAD2.0 with Llama3.1-8B-Instruct) and multi-modal (e.g., +22.8% on M3CoT with Qwen2.5-VL-7B-Instruct) tasks. Its consistent gains across different architectures and scales underscore the robustness and generalizability. We also analyze its underlying working mechanism from the perspective of information delivery. Our source code is available at https://github.com/67L1/SinkTrack.

2604.08936 2026-05-19 cs.CV 版本更新

M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model

M-IDoL:面向医学基础模型的模态特定与多样化表示学习的信息分解

Yihang Liu, Longzhen Yang, Jiaxiong Yang, Ying Wen, Lianghua He, Heng Tao Shen

发表机构 * School of Computer Science and Technique(计算机科学与技术学院) Tongji University(同济大学) School of Communications and Electronic Engineering(通讯与电子工程学院) East China Normal University(华东师范大学)

AI总结 本文提出M-IDoL,通过信息分解提升医学基础模型的模态特异性和多样性,通过最大化跨模态熵和最小化内模态不确定性,在21个下游任务中实现优于现有模型的泛化能力。

详情
AI中文摘要

医学基础模型(MFMs)旨在从多模态医学图像中学习通用表示,以有效泛化到多样化的临床任务。然而,现有大多数MFMs面临信息模糊问题,将多模态表示融合到单一嵌入空间中,导致模态特异性和多样性下降。本文提出M-IDoL,一种自监督的MFMs,通过两个目标引入信息分解:i)通过将多模态表示分散到可分离的专家混合(MoE)子空间中,最大化跨模态熵以实现跨模态的表示特异性;ii)通过在每个MoE子空间内进行细粒度语义辨别,最小化内模态不确定性以丰富每个模态的表示多样性。通过在115万张医学图像上预训练,M-IDoL在21个下游临床任务中实现了优于20个基础模型的泛化能力,并学习了模态特定和多样化的表示,展示了跨模态特征簇的更清晰分离和每个模态内更细粒度的特征辨别。

英文摘要

Medical foundation models (MFMs) aim to learn universal representations from multimodal medical images that can generalize effectively to diverse downstream clinical tasks. However, most existing MFMs suffer from information ambiguity that blends multimodal representations in a single embedding space, leading to the degradation of modality specificity and diversity. In this paper, we propose M-IDoL, a self-supervised MFM that introduces Information Decomposition for multimodal representation Learning via two objectives: i) maximizing inter-modality entropy by dispersing multimodal representations into separable Mixture-of-Experts (MoE) subspaces to achieve representation specificity across modalities; and ii) minimizing intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality. By pre-training on 1.15 million medical images, M-IDoL i) delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (e.g., X-ray, fundus, OCT, dermoscopy and pathology), and ii) learns modality-specific and diverse representations, showing clearer separation of feature clusters across modalities and finer-grained feature discrimination within each modality.

2603.29167 2026-05-19 cs.CV 版本更新

JDCNet: Confidence-Gated Privileged-Modality Distillation for Cost-Preserving X-ray Inference

JDCNet:基于置信度门控的特权模态蒸馏用于成本保持的X射线推断

Bo Ma, Jinsong Wu, Weiqi Yan, Hongjiang Wei, Kun Liu

发表机构 * Auckland University of Technology(奥克兰技术大学) Guilin University of Electronic Technology(桂林电子科技大学) Hikvision Technology Co., Ltd(海康威视技术有限公司) Hebei University of Technology(河北工业大学)

AI总结 JDCNet通过置信度门控的CT到X射线蒸馏框架,在保持固定成本的单模态部署路径下,利用特权模态进行训练。在510名患者配对的BIMCV队列中,两种配置在固定转移门下优于ResNet-18基线,证明置信度门控辅助目标比均匀软化的CT日志更可转移。

详情
AI中文摘要

我们研究了一个系统层面的视觉推断问题:在训练时使用昂贵的特权模态,同时保持固定成本的单模态部署路径。我们提出了JDCNet,一种置信度门控的CT到X射线蒸馏框架,在训练时,CT教师仅在教师置信度超过阈值的训练样本上提供辅助的硬目标或温度缩放目标;在部署时,学生仅使用X射线输入,并匹配监督X射线基线的参数、MAC和延迟配置。在510名患者配对的BIMCV队列中,经过患者层面的5折交叉验证,两种JDCNet配置在固定转移门下优于监督的ResNet-18基线:3切片软KL监督产生ΔBA=+0.035(95% CI [+0.011, +0.057]),中切片硬监督产生+0.033([+0.007, +0.058])。在相同的划分和门下,logit蒸馏、门控logit蒸馏、对比对齐、注意力转移、特征提示、BiomedCLIP微调以及模块增强变体均未通过。置信度门控的辅助目标因此比均匀软化的CT日志更可转移;证据仅限于一个配对队列,因此在任何部署之前都需要外部配对队列的重复验证。

英文摘要

We study a systems-level visual inference problem: using an expensive privileged modality during training while preserving a fixed-cost, single-modality deployment path. We present JDCNet, a confidence-gated CT-to-X-ray distillation framework in which the CT teacher supplies an auxiliary hard or temperature-scaled target only on training samples whose teacher confidence exceeds a threshold; at deployment the student takes X-ray input alone and matches the parameter, MAC, and latency profile of the supervised X-ray baseline. On a 510-patient same-patient paired BIMCV cohort with patient-level 5-fold cross-validation, two JDCNet configurations clear a fixed transfer gate against the supervised ResNet-18 baseline: 3-slice soft-KL supervision yields $Δ\mathrm{BA}{=}{+}0.035$ ($95\%$ CI $[{+}0.011,{+}0.057]$) and mid-slice hard supervision yields $+0.033$ ($[{+}0.007,{+}0.058]$). Under the same splits and gate, logit distillation, gated logit distillation, contrastive alignment, attention transfer, feature hints, BiomedCLIP fine-tuning, and a module-augmented variant do not pass. Confidence-gated auxiliary targets are therefore a more transferable channel than uniformly softened CT logits; the evidence is bounded to one paired cohort, so external paired-cohort replication is required before any deployment claim.

2603.22570 2026-05-19 cs.CV 版本更新

CanViT: Toward Active-Vision Foundation Models

CanViT:迈向主动视觉基础模型

Yohaï-Eliel Berreby, Sabrina Du, Audrey Durand, B. Suresh Krishna

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克人工智能研究院) Université Laval(拉瓦尔大学)

AI总结 CanViT是首个任务和策略无关的主动视觉基础模型,通过场景相对RoPE结合视图Transformer和场景-wide潜在工作空间,提出Canvas Attention机制,实现高效推理和高分辨率扩展,并在ADE20K和ImageNet-1k上取得新突破。

Comments v2: additional results: 84.5% IN1k accuracy after fine-tuning and effect of canvas resolution. Code and weights: https://github.com/m2b3/CanViT-PyTorch

详情
AI中文摘要

主动视觉旨在通过序列化、局部化的窥视实现高效且符合生物认知的感知,但缺乏可扩展的通用架构和预训练流水线,导致主动视觉基础模型(AVFM)研究不足。我们引入CanViT,首个任务和策略无关的AVFM。CanViT利用场景相对RoPE将视网膜拓扑视觉Transformer主干与场景级潜在工作空间(canvas)结合,通过Canvas Attention机制实现高效交互。我们解耦思考(主干级)与记忆(canvas级),消除canvas侧的自注意力和全连接层,实现快速序列推理和高分辨率扩展。我们提出无标签的主动视觉预训练方案,策略无关的密集潜在蒸馏:从低分辨率窥视序列中重建场景级DINOv3嵌入,随机化位置、缩放级别和长度。我们从1320万个ImageNet-21k场景(比先前主动模型多一个数量级)和10亿个随机窥视预训练CanViT-B,耗时166小时在单块H100上。在ADE20K分割中,冻结的CanViT-B在单个低分辨率窥视下达到38.5%的mIoU,优于最佳主动模型的27.6%,且使用20倍更少的推理FLOPs以及与DINOv3教师模型同等FLOP或输入量的匹配。在额外窥视下,CanViT-B达到45.9%的ADE20K mIoU。在ImageNet-1k分类中,CanViT-B也设定了主动视觉的新状态,微调后达到84.5%的top-1准确率。CanViT可泛化至更长的rollout、更大的场景和新策略。我们的工作缩小了被动与主动视觉之间的巨大差距,展示了任务和策略无关AVFM预训练的潜力。

英文摘要

Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines, leaving Active-Vision Foundation Models (AVFMs) underexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve fast sequential inference and scalability to high output resolutions. We propose a label-free active vision pretraining scheme, policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths. We pretrain CanViT-B from a random initialization on 13.2 million ImageNet-21k scenes--an order of magnitude more than previous active models--and 1 billion random glimpses, in 166 hours on a single H100. On ADE20K segmentation, a frozen CanViT-B achieves 38.5% mIoU in a single low-resolution glimpse, outperforming the best active model's 27.6% with 20x fewer inference FLOPs as well as its FLOP- or input-matched DINOv3 teacher. Given additional glimpses, CanViT-B reaches 45.9% ADE20K mIoU. On ImageNet-1k classification, CanViT-B also sets a new active-vision state of the art, with 84.5% top-1 accuracy after fine-tuning. CanViT generalizes to longer rollouts, larger scenes, and new policies. Our work narrows the wide gap between passive and active computer vision, demonstrating the potential of task- and policy-agnostic AVFM pretraining.

2603.21071 2026-05-19 cs.CV cs.AI 版本更新

CTFS : Collaborative Teacher Framework for Forward-Looking Sonar Image Semantic Segmentation with Extremely Limited Labels

CTFS:用于极有限标注数据的前瞻性声呐图像语义分割协作教师框架

Ping Guo, Chengzhou Li, Guanchen Meng, Qi Jia, Jinyuan Liu, Zhu Liu, Yu Liu, Zhongxuan Luo, Xin Fan

发表机构 * School of Software Technology, Dalian University of Technology(大连理工大学软件学院)

AI总结 本文提出CTFS框架,通过多教师协作机制提升声呐图像在极有限标注下的分割性能,通过跨教师可靠性评估机制减少噪声伪标签影响,实验显示在FLSMD数据集上2%标注时mIoU提升5.08%。

Comments Accepted to CVPR 2026 Finding. Code: https://github.com/pingggg516/CTFS

详情
AI中文摘要

作为最重要的水下传感技术之一,前瞻性声呐具有独特的成像特性。声呐图像常受严重斑点噪声、低纹理对比度、声影和几何失真影响,使传统教师-学生框架在极有限标注条件下难以获得满意性能。为解决此问题,我们提出一种用于前瞻性声呐图像的协作教师语义分割框架。该框架引入由一个通用教师和多个声呐专用教师组成的多教师协作机制。通过采用多教师交替指导策略,学生模型可学习通用语义表示的同时捕捉声呐图像的特殊特性,从而实现更全面和稳健的特征建模。考虑到声呐图像的挑战,可能导致教师生成大量噪声伪标签,我们进一步设计了跨教师可靠性评估机制。该机制通过评估多视角和多教师间的预测一致性与稳定性动态量化伪标签的可靠性,从而减轻噪声伪标签的负面影响。值得注意的是,在FLSMD数据集上,当仅标注2%的数据时,我们的方法在mIoU上比其他最先进的方法提高了5.08%。

英文摘要

As one of the most important underwater sensing technologies, forward-looking sonar exhibits unique imaging characteristics. Sonar images are often affected by severe speckle noise, low texture contrast, acoustic shadows, and geometric distortions. These factors make it difficult for traditional teacher-student frameworks to achieve satisfactory performance in sonar semantic segmentation tasks under extremely limited labeled data conditions. To address this issue, we propose a Collaborative Teacher Semantic Segmentation Framework for forward-looking sonar images. This framework introduces a multi-teacher collaborative mechanism composed of one general teacher and multiple sonar-specific teachers. By adopting a multi-teacher alternating guidance strategy, the student model can learn general semantic representations while simultaneously capturing the unique characteristics of sonar images, thereby achieving more comprehensive and robust feature modeling. Considering the challenges of sonar images, which can lead teachers to generate a large number of noisy pseudo-labels, we further design a cross-teacher reliability assessment mechanism. This mechanism dynamically quantifies the reliability of pseudo-labels by evaluating the consistency and stability of predictions across multiple views and multiple teachers, thereby mitigating the negative impact caused by noisy pseudo-labels. Notably, on the FLSMD dataset, when only 2% of the data is labeled, our method achieves a 5.08% improvement in mIoU compared to other state-of-the-art approaches.

2603.19538 2026-05-19 cs.CV 版本更新

MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane

MoCA3D:单目3D边界框在图像平面中的预测

Changwoo Jeon, Rishi Upadhyay, Achuta Kadambi

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Yonsei University(延世大学)

AI总结 MoCA3D通过密集预测实现单目3D边界框预测,无需相机内参即可获得图像平面几何信息,实验表明其在图像平面几何一致性上优于现有方法。

Comments 27 pages, 9 figures, including supplementary material

详情
AI中文摘要

单目3D物体理解通常被视为从2D区域到3D框的提升问题。然而,新兴下游应用需要图像平面几何(如投影3D框角点),这在缺乏已知内参时难以获得。我们引入MoCA3D,一种单目、类别无关的3D模型,能够在推理时无需相机内参,预测投影3D边界框角点及每个角点的深度。MoCA3D将像素空间定位和深度分配作为密集预测,通过角点热图和深度图实现。为评估图像平面几何保真度,我们提出像素对齐几何(PAG),直接测量图像平面角点和深度一致性。大量实验表明,MoCA3D在图像平面角点PAG上提升了22.8%,在3D IoU上保持竞争力,使用比现有方法少高达57倍的可训练参数。最后,我们将MoCA3D应用于之前在未知内参下难以实现的下游任务,展示了其在标准基线模型之外的实用性。

英文摘要

Monocular 3D object understanding has largely been cast as a 2D RoI-to-3D box lifting problem. However, emerging downstream applications require image-plane geometry (e.g., projected 3D box corners) which cannot be easily obtained without known intrinsics, a problem for object detection in the wild. We introduce MoCA3D, a Monocular, Class-Agnostic 3D model that predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time. MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps. To evaluate image-plane geometric fidelity, we propose Pixel-Aligned Geometry (PAG), which directly measures image-plane corner and depth consistency. Extensive experiments demonstrate that MoCA3D achieves state-of-the-art performance, improving image-plane corner PAG by 22.8% while remaining comparable on 3D IoU, using up to 57 times fewer trainable parameters. Finally, we apply MoCA3D to downstream tasks which were previously impractical under unknown intrinsics, highlighting its utility beyond standard baseline models.

2603.19199 2026-05-19 cs.RO cs.CV 版本更新

FASTER: Rethinking Real-Time Flow VLAs

FASTER:重新思考实时流视频语言动作

Yuxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Yang, Jinghua Hou, Junyi Li, Kaixin Ding, Hengshuang Zhao

发表机构 * The University of Hong Kong(香港大学) ACE Robotics(ACE机器人)

AI总结 本文提出FASTER方法,通过引入时间感知调度策略,显著降低实时流视频语言动作系统的反应延迟,提升动态任务中的轨迹生成效率与质量。

Comments Project page: https://innovator-zero.github.io/FASTER

详情
AI中文摘要

实时执行对于在物理世界中部署视觉-语言-动作(VLA)模型至关重要。现有异步推理方法主要优化轨迹平滑度,但忽视了对环境变化的反应延迟。通过重新思考动作分块策略中的反应概念,本文系统分析了决定反应时间的因素。我们证明反应时间遵循由时间到第一个动作(TTFA)和执行时间跨度共同决定的均匀分布。此外,我们揭示了在流式VLA中应用恒定调度的标准做法可能效率低下,并迫使系统在任何移动开始前完成所有采样步骤,从而形成反应延迟的瓶颈。为了解决这一问题,我们提出快速动作采样以实现即时反应(FASTER)。通过引入时间感知调度,FASTER在流式采样过程中自适应优先处理近期动作,将去噪的即时反应压缩至十倍(例如在π_{0.5}和X-VLA中)为一步,同时保持长时间跨度轨迹的质量。结合流式客户端-服务器管道,FASTER显著降低了在真实机器人上的有效反应延迟,尤其是在部署在消费级GPU上时。实际实验,包括一个高度动态的乒乓球任务,证明FASTER显著提升了通用策略的实时响应能力,实现了快速生成准确且平滑的轨迹。

英文摘要

Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $π_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks substantially improved real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.

2603.18178 2026-05-19 cs.CV cs.AI 版本更新

VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

VLM-AutoDrive: 事后训练视觉-语言模型用于安全关键的自动驾驶事件

Mohammad Qazim Bhat, Yufan Huang, Niket Agarwal, Hao Wang, Michael Woods, John Kenyon, Tsung-Yi Lin, Xiaodong Yang, Ming-Yu Liu, Kevin Xie

发表机构 * NVIDIA

AI总结 本文提出VLM-AutoDrive框架,通过整合元数据生成的描述、LLM生成的描述、视觉问答对和推理监督,提升预训练视觉语言模型在安全关键自动驾驶事件中的检测性能。

Comments 16 pages, 9 figures, submitted to arXiv

详情
AI中文摘要

随着第一人称视角 dashcam 视频的快速增长,检测安全关键事件如碰撞和近碰撞成为重大挑战,这些场景短暂、罕见且难以被通用视觉模型捕捉。尽管多模态大语言模型(MLLMs)展现出强大的推理能力,但其在驾驶场景中因领域和时间对齐问题而表现不佳。我们引入VLM-AutoDrive,一种模块化的事后训练框架,用于将预训练的视觉-语言模型(VLMs)适应到高保真异常检测。该框架整合了元数据衍生的标题、LLM生成的描述、视觉问答对以及推理链(CoT)监督,以实现领域对齐和可解释的学习。现成的VLMs如NVIDIA的Cosmos-Reason1 7B(CR1)在零样本设置中碰撞召回率接近零;通过VLM-AutoDrive微调,碰撞F1值从0.00提升到0.69,整体准确率从35.35%提升到77.27%。VLM-AutoDrive提供了一种可扩展的配方,用于将通用VLMs适应到安全关键、时间局部化的感知任务。在真实世界Nexar dashcam视频上评估,它在碰撞和近碰撞检测方面实现了显著提升,同时生成可解释的推理轨迹,弥合了感知、因果性和决策推理之间的差距。

英文摘要

The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.

2603.16947 2026-05-19 cs.CV cs.AI 版本更新

LightZeroNav: Zero-Shot Vision Language Navigation in Continuous Environments Based on Lightweight VLMs

LightZeroNav: 基于轻量级VLMs的连续环境中零样本视觉语言导航

Kun Luo, Xiangyu Dong, Xiaoguang Ma, Haoran Zhao, Yaoming Zhou

发表机构 * Foshan Graduate School of Innovation, Northeastern University(创新研究生院,东北大学) Faculty of Robot Science and Engineering, Northeastern University(机器人科学与工程学院,东北大学) School of Aeronautic Science and Engineering, Beihang University(航空科学与工程学院,北航) QingniaoAI, China(清北AI,中国)

AI总结 本文提出LightZeroNav,通过轻量级VLMs解决连续环境中零样本视觉语言导航的三大瓶颈,无需特定训练或图搜索,在RGB观测和轻量级Qwen3-VL-8B模型下实现与GPT-4o相当的性能。

详情
AI中文摘要

尽管视觉语言导航(VLN)发展迅速,但在连续环境中使用轻量级视觉语言模型(VLMs)进行零样本VLN(VLN-CE)仍极具挑战性,因为这些模型的有限推理能力使长周期导航不可靠。本文提出LightZeroNav,以解决在使用轻量级VLMs进行零样本VLN-CE时的三大主要瓶颈,即多源输入的信息冗余、由噪声文本记忆引起的进度估计不准确以及动作执行与阶段转换之间的任务纠缠。仅使用RGB观测和轻量级开源Qwen3-VL-8B主干,LightZeroNav在无需特定训练、图搜索或路径预测器的情况下实现了与GPT-4o(~200B)相当的性能,证明了其在零样本VLN-CE中的有效性。

英文摘要

Although vision-language navigation (VLN) has progressed rapidly, zero-shot VLN in continuous environments (VLN-CE) remains highly challenging when using lightweight vision-language models (VLMs), whose limited reasoning capacity makes long-horizon navigation unreliable. In this paper, we propose LightZeroNav to tackle the three major bottlenecks when using lightweight VLMs in zero-shot VLN-CE,i.e.,information redundancy from multi-source inputs, inaccurate progress estimation caused by noisy textual memory, and task entanglement between action execution and stage transition. Using only RGB observations and a lightweight open-source Qwen3-VL-8B backbone, LightZeroNav achieves competitive performance with GPT-4o (~200B) without task-specific training, graph search, or waypoint predictors, demonstrating its effectiveness in zero-shot VLN-CE.

2603.09286 2026-05-19 cs.CV 版本更新

CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation

CogBlender:迈向文本到图像生成中的连续认知干预

Shengqi Dang, Yi He, Jiaying Lei, Ziqing Qian, Nan Cao

发表机构 * Tongji University(同济大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 CogBlender通过两阶段方法实现对图像生成中认知属性的连续多维干预,有效控制如情感、记忆性等心理属性。

详情
AI中文摘要

除了传达语义信息,图像还具有引发特定心理反应的认知属性,如记忆编码或情感反应。尽管现代文本到图像(T2I)模型能生成语义连贯的内容,但难以控制认知属性(如情感、记忆性)并匹配用户心理意图。为此,我们引入CogBlender算法,通过新颖的两阶段方法实现对认知属性的连续多维干预。首先,构建离散的认知感知重写提示变体,代表不同的极端认知状态。其次,通过在流匹配模型的流场域内插值得到连续控制信号。通过动态混合这些提示预测的流场以实现目标认知评分,CogBlender能够平滑地引导生成轨迹,实现最终图像中期望的认知属性。在四个认知属性(即情感、唤醒度、支配性和记忆性)上的广泛实验表明,CogBlender实现了有效的认知干预。

英文摘要

Beyond conveying semantic information, images also possess cognitive properties that elicit specific psychological responses from viewers, such as memory encoding or emotional reactions. Although modern text-to-image (T2I) models generate semantically coherent content effectively, they struggle to control cognitive properties (e.g., valence, memorability) and often fail to align with the user's psychological intent. To bridge the gap, we introduce CogBlender, an algorithm that enables continuous and multi-dimensional intervention on cognitive properties through a novel two-stage approach. First, we construct discrete cognition-aware rewritten prompts-variants of the input prompt that represent distinct extreme cognitive states. Second, we translate these discrete prompts into continuous control signals by interpolating within the velocity-field domain of flow-matching models. By dynamically blending the velocity fields predicted from these prompts according to the target cognitive scores, CogBlender smoothly steers the generative trajectory to realize the desired cognitive properties in the final image. Extensive experiments across four cognitive properties (i.e., valence, arousal, dominance, and memorability) demonstrate that CogBlender achieves effective cognitive intervention.

2603.04870 2026-05-19 cs.CV 版本更新

Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning

基于扩散的sRGB真实噪声生成:通过提示驱动的噪声表示学习

Jaekyun Ko, Dongjin Kim, Soomin Lee, Guanghui Wang, Tae Hyun Kim

发表机构 * Department of Computer Science, Hanyang University(翰阳大学计算机科学系) Mobile Experience (MX) Division, Samsung Electronics(三星电子移动体验部门) Department of Computer Science, Toronto Metropolitan University(多伦多 Metropolitan 大学计算机科学系)

AI总结 本文提出Prompt-Driven Noise Generation框架,通过学习提示特征生成真实噪声图像,无需依赖相机元数据,提升噪声合成的通用性和应用性。

Comments CVPR 2026

详情
AI中文摘要

在sRGB图像空间中去噪具有挑战性,由于噪声变化大。尽管端到端方法表现良好,但实际场景中其效果受限于真实噪声-清洁图像对的稀缺性,这些对昂贵且难以收集。为解决这一限制,已开发出几种生成方法,从有限数据中合成逼真的噪声图像。这些方法通常依赖相机元数据进行训练和测试以合成现实噪声。然而,缺乏元数据或设备间不一致限制了其实用性。因此,我们提出了一种新的框架,称为提示驱动噪声生成(PNG)。该模型能够获取高维提示特征,捕捉现实输入噪声的特征,并创建与输入噪声分布一致的多种逼真噪声图像。通过消除对显式相机元数据的依赖,我们的方法显著提高了噪声合成的通用性和应用性。全面的实验表明,我们的模型能够有效生成逼真的噪声图像,并在各种基准数据集上成功应用于去除现实噪声。

英文摘要

Denoising in the sRGB image space is challenging due to large noise variability. Although end-to-end methods perform well, their effectiveness in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited data. These approaches often rely on camera metadata during both training and testing to synthesize real-world noise. However, the lack of metadata or inconsistencies between devices restricts their usability. Therefore, we propose a novel framework called Prompt-Driven Noise Generation (PNG). This model is capable of acquiring high-dimensional prompt features that capture the characteristics of real-world input noise and creating a variety of realistic noisy images consistent with the distribution of the input noise. By eliminating the dependency on explicit camera metadata, our approach significantly enhances the generalizability and applicability of noise synthesis. Comprehensive experiments reveal that our model effectively produces realistic noisy images and show the successful application of these generated images in removing real-world noise across various benchmark datasets.

2603.02667 2026-05-19 cs.CV cs.LG 版本更新

Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

统一对比学习与生成目标以实现视觉理解和文本到图像生成

Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati, Hong-You Chen, Satya Narayan Shukla, Jianpeng Cheng, Yonghuan Yang, Jun Xiao, Xiangjun Fan, Aashu Singh, Dina Katabi, Shlok Kumar Mishra

发表机构 * MIT Computer Science \& Artificial Intelligence Laboratory Meta AI

AI总结 本文提出DREAM框架,通过Masking Warmup解决对比学习与文本到图像生成的矛盾,提升模型在多个任务上的性能。

详情
AI中文摘要

将文本-图像对比学习与文本到图像生成统一到一个端到端模型具有挑战性,因为两者需要不同的掩码策略:对比学习需要近完全可见的token,而掩码生成模型需要大量干扰。我们引入DREAM框架,通过Masking Warmup调度,在训练过程中逐步调整掩码分布的中心,使低和高掩码比率同时存在。这种共暴露使一个联合训练的编码器能够服务于两种目标。所得到的稳定优化解锁了语义对齐解码:在推理阶段,经过所有掩码比率训练的文本编码器可以评估部分生成的图像并选择最佳轨迹,仅需解码图像的12.5%,从而提高FID和吞吐量。DREAM在ImageNet线性探测(+1.1%)、5次转移(+4.1%)、ADE20K分割(+1.9%)和NYU深度估计(+6.25%)上优于CLIP,在CC12M FID上优于FLUID(+6.2%)的同时保持CLIP Score。这些收益表明,当正确统一文本-图像对比和生成目标时,它们是协同作用而非竞争。

英文摘要

Unifying text-image contrastive learning and text-to-image (T2I) generation in a single end-to-end model is challenging because the two objectives demand opposing masking regimes: contrastive alignment needs near-complete visible tokens, while masked generative modeling needs heavy corruption. We introduce DREAM, a unified framework that resolves this conflict through Masking Warmup, a schedule that shifts the center of the masking distribution over training, so low and high masking ratios coexist at every step. This co-exposure lets a single jointly-trained encoder serve both objectives. The resulting stable optimization unlocks Semantically Aligned Decoding at inference: the text encoder, trained against visual embeddings at all masking ratios, can score partially generated images and select the best trajectory with as little as 12.5% of the image decoded, improving both FID and throughput. DREAM outperforms its single-objective baselines, CLIP and FLUID: on ImageNet linear-probing (+1.1%), 5-shot transfer (+4.1%), ADE20K segmentation (+1.9%), and NYU depth estimation (+6.25%) over CLIP, and on CC12M FID (+6.2%) over FLUID while maintaining CLIP Score. Together, these gains show that text-image contrastive and generative objectives, when properly unified, are synergistic rather than competing.

2603.01993 2026-05-19 cs.CV 版本更新

Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

培养可推广的多模态操纵检测的推理能力

Yuchen Zhang, Yaxiong Wang, Kecheng Han, Yujiao Wu, Lianwei Wu, Li Zhu, Zhedong Zheng

发表机构 * School of Software Engineering, Xi’an Jiaotong University(西安交通大学软件工程学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院) CSIRO(澳大利亚联邦科学与工业研究组织) Northwestern Polytechnical University(西北工业大学) University of Macau(澳门大学)

AI总结 本文提出REFORM框架,通过推理驱动的方法改进多模态操纵检测,提升泛化能力,在多个数据集上取得新高准确率。

Comments Accepted to ACL 2026

详情
AI中文摘要

近期生成式AI的发展显著提升了多模态媒体操纵的逼真度,给操纵检测带来了重大挑战。现有操纵检测和定位方法主要集中在结果导向的操纵类型分类,这不仅缺乏可解释性,还容易过拟合表面特征。本文认为,可推广的检测需要纳入显式的推理过程,而非仅分类有限的操纵类型。为此,我们提出REFORM,一个推理驱动的框架,将学习从结果拟合转向过程建模。REFORM采用三阶段课程,首先诱导推理依据,然后对齐推理与最终判断,最后通过强化学习优化逻辑一致性。为支持这一范式,我们引入ROM,一个具有丰富推理标注的大规模数据集。大量实验表明,REFORM在多个数据集上均取得新高准确率,包括在ROM上81.52%的准确率,在DGM4上76.65%的准确率,在MMFakeBench上74.9的F1分数。

英文摘要

Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.

2603.00952 2026-05-19 cs.CV 版本更新

Decoupling Motion and Geometry in 4D Gaussian Splatting

分离运动与几何的4D高斯点散射

Yi Zhang, Yulei Kang, Jiangxin Sun, Beihao Xia, Jisheng Dang, Jian-Fang Hu

发表机构 * Sun Yat-sen University(中山大学) University of Trento(特伦特大学) Huazhong University of Science and Technology(华中科技大学) Lanzhou University(兰州大学)

AI总结 本文提出VeGaS框架,通过引入伽利略剪切矩阵和几何变形网络,分离高斯运动与几何属性,提升复杂非线性运动建模能力,实验表明其在公开数据集上达到最先进的性能。

详情
AI中文摘要

动态场景的高保真重建是一个重要但具有挑战性的问题。尽管最近的4D高斯点散射(4DGS)展示了建模时间动态的能力,但其将高斯运动和几何属性耦合在单一协方差公式中,限制了对复杂运动的表达能力,常导致视觉伪影。为此,我们提出VeGaS,一种基于速度的新型4D高斯点散射框架,通过引入伽利略剪切矩阵,显式纳入时间变化的速度,灵活建模复杂非线性运动,同时严格隔离高斯运动对几何相关条件高斯协方差的影响。此外,引入几何变形网络,利用时空上下文和速度线索细化高斯形状和方向,增强时间几何建模。在公开数据集上的大量实验表明,VeGaS实现了最先进的性能。

英文摘要

High-fidelity reconstruction of dynamic scenes is an important yet challenging problem. While recent 4D Gaussian Splatting (4DGS) has demonstrated the ability to model temporal dynamics, it couples Gaussian motion and geometric attributes within a single covariance formulation, which limits its expressiveness for complex motions and often leads to visual artifacts. To address this, we propose VeGaS, a novel velocity-based 4D Gaussian Splatting framework that decouples Gaussian motion and geometry. Specifically, we introduce a Galilean shearing matrix that explicitly incorporates time-varying velocity to flexibly model complex non-linear motions, while strictly isolating the effects of Gaussian motion from the geometry-related conditional Gaussian covariance. Furthermore, a Geometric Deformation Network is introduced to refine Gaussian shapes and orientations using spatio-temporal context and velocity cues, enhancing temporal geometric modeling. Extensive experiments on public datasets demonstrate that VeGaS achieves state-of-the-art performance.

2602.23058 2026-05-19 cs.CV cs.RO 版本更新

GeoWorld: Geometric World Models

GeoWorld:几何世界模型

Zeyu Zhang, Danning Li, Ian Reid, Richard Hartley

发表机构 * ANU(澳大利亚国立大学) MBZUAI(穆斯林人工智能研究所)

AI总结 GeoWorld通过超几何JEPA和几何强化学习解决传统能量预测模型在几何结构和长周期预测中的不足,实验显示在3-4步规划中性能提升3%-2%。

Comments Accepted to CVPR 2026

详情
AI中文摘要

基于能量的预测世界模型通过推理潜在能量景观进行多步视觉规划,但现有方法面临两个挑战:(i)其潜在表示通常在欧几里得空间中学习,忽略了状态间的几何和层次结构;(ii)难以进行长周期预测,导致扩展 rollout 中快速退化。为了解决这些挑战,我们引入GeoWorld,通过超几何JEPA将潜在表示从欧几里得空间映射到双曲流形,以保留几何结构和层次关系。我们进一步引入几何强化学习进行能量优化,实现双曲潜在空间中的稳定多步规划。在CrossTask和COIN上的广泛实验显示,与最先进的V-JEPA 2相比,在3步规划中性能提升约3%,在4步规划中提升约2%。项目网站:https://steve-zeyu-zhang.github.io/GeoWorld。

英文摘要

Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.

2602.19710 2026-05-19 cs.CV cs.LG cs.RO 版本更新

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

面向通用视觉-语言-动作策略的通用姿态预训练

Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, Yanwei Fu

发表机构 * Tencent Robotics X(腾讯机器人X) Futian Laboratory(福田实验室) The Hong Kong University of Science and Technology(香港科学与技术大学) Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出Pose-VLA,通过分离预训练和后训练阶段,解决视觉-语言-动作模型中的特征坍塌和训练效率问题,实现通用3D空间先验提取与机器人特定动作空间的高效对齐。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. Project website: https://hetolin.github.io/PoseVLA

Journal ref Robotics: Science and Systems, 2026

详情
AI中文摘要

现有视觉-语言-动作(VLA)模型常因将高层感知与稀疏的、特定身体动作监督结合而出现特征坍塌和低训练效率。由于这些模型通常依赖优化用于视觉问答(VQA)的VLM主干,它们擅长语义识别但常忽视细微的3D状态变化,这些变化决定了不同的动作模式。为解决这些不一致,我们提出了Pose-VLA,一种解耦范式,将VLA训练分为预训练阶段以提取统一摄像机空间中的通用3D空间先验,以及后训练阶段以在机器人特定的动作空间中高效对齐。通过引入离散姿态标记作为通用表示,Pose-VLA无缝整合了来自不同3D数据集的空间接地与机器人演示中的几何级轨迹。我们的框架遵循一个两阶段预训练流程,通过姿态建立基本空间接地,然后通过轨迹监督实现运动对齐。广泛的评估显示,Pose-VLA在RoboTwin 2.0上实现了79.5%的平均成功率,并在LIBERO上表现出竞争力。现实世界实验进一步展示了在使用仅100个演示每任务的情况下,对多样化物体的鲁棒泛化能力,验证了我们预训练范式的效率。

英文摘要

Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.

2602.18584 2026-05-19 cs.LG cs.AI cs.CV 版本更新

GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

GIST: 通过耦合优化几何进行指令微调的目标数据选择

Guanghui Min, Tianhao Huang, Ke Wan, Chen Chen

发表机构 * Department of Computer Science, University of Virginia, Charlottesville, USA(弗吉尼亚大学计算机科学系)

AI总结 本文提出GIST方法,通过子空间对齐替代轴对齐缩放,解决参数高效微调中参数耦合问题,实现更高效的目标数据选择。

Comments ICML 2026; 27 pages, 8 figures, 11 tables

详情
AI中文摘要

目标数据选择已成为高效指令微调中的关键范式,旨在为特定任务识别一小部分有影响力的训练示例。在实践中,影响力通常通过示例对参数更新的影响来衡量。为了使选择可扩展,许多方法利用优化器统计(如Adam状态)作为轴对齐的替代品,隐式地将参数视为坐标独立。我们证明在参数高效微调(PEFT)方法如LoRA中,这一假设在破裂。在这种情况下,诱导的优化几何表现出强跨参数耦合和非平凡的非对角交互,而任务相关的更新方向被限制在低维子空间中。受此不匹配的启发,我们提出GIST(梯度等距子空间转换),一种简单但原则性的替代方法,用稳健的子空间对齐替代轴对齐缩放。GIST通过奇异值分解(SVD)从验证梯度中恢复任务特定的子空间,将训练梯度投影到该耦合子空间,并通过与目标方向的对齐程度评分示例。大量实验表明,在相同的选择预算下,GIST仅使用0.29%的存储和25%的计算时间,与当前最先进的基线匹配或优于。

英文摘要

Targeted data selection has emerged as a crucial paradigm for efficient instruction tuning, aiming to identify a small yet influential subset of training examples for a specific target task. In practice, influence is often measured through the effect of an example on parameter updates. To make selection scalable, many approaches leverage optimizer statistics (e.g., Adam states) as an axis-aligned surrogate for update geometry (i.e., diagonal precondition), implicitly treating parameters as coordinate-wise independent. We show that this assumption breaks down in parameter-efficient fine-tuning (PEFT) methods such as LoRA. In this setting, the induced optimization geometry exhibits strong cross-parameter coupling with non-trivial off-diagonal interactions, while the task-relevant update directions are confined to a low-dimensional subspace. Motivated by this mismatch, we propose GIST (Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment. GIST recovers a task-specific subspace from validation gradients via singular value decomposition (SVD), projects training gradients into this coupled subspace, and scores examples by their alignment with target directions. Extensive experiments have demonstrated that GIST matches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget.

2602.12280 2026-05-19 cs.CV 版本更新

Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

惊喜之笔:向量素描中的渐进语义错觉

Huai-Hsun Cheng, Siang-Ling Zhang, Yu-Lun Liu

发表机构 * National Yang Ming Chiao Tung University

AI总结 本文提出渐进语义错觉任务,通过逐步添加笔触实现单幅素描的语义转变,引入双分支Score Distillation Sampling机制解决双重约束问题,提升识别性和错觉强度。

Comments SIGGRAPH 2026. Project page: https://stroke-of-surprise.github.io/

详情
AI中文摘要

传统视觉错觉依赖于空间操纵,如多视角一致性。本文引入渐进语义错觉,一种新的向量素描任务,单幅素描通过逐步添加笔触经历剧烈语义变化。我们提出Stroke of Surprise生成框架,优化向量笔触以满足不同绘制阶段的语义解释。核心挑战在于双重约束:初始前缀笔触必须形成连贯对象(如鸭子),同时作为添加delta笔触后第二概念(如羊)的结构基础。为此,我们提出一种序列感知的联合优化框架,由双分支Score Distillation Sampling机制驱动。不同于冻结初始状态的顺序方法,我们的方法动态调整前缀笔触,发现适用于两个目标的

英文摘要

Visual illusions traditionally rely on spatial manipulations such as multi-view consistency. In this work, we introduce Progressive Semantic Illusions, a novel vector sketching task where a single sketch undergoes a dramatic semantic transformation through the sequential addition of strokes. We present Stroke of Surprise, a generative framework that optimizes vector strokes to satisfy distinct semantic interpretations at different drawing stages. The core challenge lies in the "dual-constraint": initial prefix strokes must form a coherent object (e.g., a duck) while simultaneously serving as the structural foundation for a second concept (e.g., a sheep) upon adding delta strokes. To address this, we propose a sequence-aware joint optimization framework driven by a dual-branch Score Distillation Sampling (SDS) mechanism. Unlike sequential approaches that freeze the initial state, our method dynamically adjusts prefix strokes to discover a "common structural subspace" valid for both targets. Furthermore, we introduce a novel Overlay Loss that enforces spatial complementarity, ensuring structural integration rather than occlusion. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines in recognizability and illusion strength, successfully expanding visual anagrams from the spatial to the temporal dimension. Project page: https://stroke-of-surprise.github.io/

2602.11553 2026-05-19 cs.CV cs.AI 版本更新

Perception-based Image Denoising via Generative Compression

基于生成压缩的图像去噪

Nam Nguyen, Thinh Nguyen, Bella Bose

发表机构 * School of Electrical and Computer Engineering, Oregon State University, Corvallis, OR 97331, USA(电气与计算机工程学院,俄勒冈州立大学,科瓦利斯,OR 97331,USA)

AI总结 本文提出基于生成压缩的去噪框架,通过熵编码潜在表示和感知度量提升去噪效果,实验显示在保持 distortion 性能的同时实现感知改进。

详情
AI中文摘要

图像去噪旨在在去除噪声的同时保持结构细节和感知现实,但受扰动驱动的方法常产生过度平滑的重建,特别是在强噪声和分布偏移下。本文提出一种基于生成压缩的去噪框架,通过从熵编码的潜在表示中重建,强制低复杂度结构,同时通过感知度量如学习感知图像块相似性(LPIPS)损失和Wasserstein距离的生成解码器恢复真实纹理。介绍了两种互补的实例:(i) 基于条件Wasserstein GAN(WGAN)的压缩去噪器,明确控制速率-失真-感知(RDP)权衡;(ii) 基于条件扩散的重建策略,通过压缩潜在进行迭代去噪。进一步建立了在加性高斯噪声下的压缩最大似然去噪器的非渐近保证,包括重建误差和解码误差概率的界限。在合成和真实噪声基准上的实验显示了一致的感知改进,同时保持竞争性的失真性能。

英文摘要

Image denoising aims to remove noise while preserving structural details and perceptual realism, yet distortion-driven methods often produce over-smoothed reconstructions, especially under strong noise and distribution shift. This paper proposes a generative compression framework for perception-based denoising, where restoration is achieved by reconstructing from entropy-coded latent representations that enforce low-complexity structure, while generative decoders recover realistic textures via perceptual measures such as learned perceptual image patch similarity (LPIPS) loss and Wasserstein distance. Two complementary instantiations are introduced: (i) a conditional Wasserstein GAN (WGAN)-based compression denoiser that explicitly controls the rate-distortion-perception (RDP) trade-off, and (ii) a conditional diffusion-based reconstruction strategy that performs iterative denoising guided by compressed latents. We further establish non-asymptotic guarantees for the compression-based maximum-likelihood denoiser under additive Gaussian noise, including bounds on reconstruction error and decoding error probability. Experiments on synthetic and real-noise benchmarks demonstrate consistent perceptual improvements while maintaining competitive distortion performance.

2602.08167 2026-05-19 cs.RO cs.AI cs.CV cs.LG 版本更新

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

基于互联网规模知识的自监督行动预测具身推理

Milan Ganai, Katie Luo, Jonas Frey, Clark Barrett, Marco Pavone

发表机构 * Stanford(斯坦福大学) UC Berkeley(加州大学伯克利分校) NVIDIA(英伟达)

AI总结 本文提出R&B-EnCoRe方法,通过自监督细化使模型从互联网知识中自推导具身推理策略,提升动作执行和导航性能,减少碰撞率。

Comments Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

具身链式思维(CoT)推理显著提升了视觉-语言-动作(VLA)模型,但当前方法依赖刚性模板指定推理原语(如场景中的物体、高层计划、结构 affordances)。这些模板可能迫使策略处理无关信息,干扰关键动作预测信号。我们引入R&B-EnCoRe,使模型通过自监督细化从互联网规模知识中自推导具身推理。通过将推理视为重要加权变分推断中的潜在变量,模型可生成并提炼无外部奖励、验证者或人工标注的具身特定策略训练数据集。我们在各种VLA架构中验证R&B-EnCoRe,应用于 manipulation(Franka Panda在仿真中,WidowX在硬件中)、legged导航(双足、轮式、自行车、四足)和自动驾驶具身,参数规模为1B、4B、7B和30B。我们的方法在 manipulation 成功率提升28%,导航评分提高101%,碰撞率减少21%。R&B-EnCoRe使模型提炼出预测成功控制的推理,避免手动标注工程,同时将互联网规模知识接地于物理执行。

英文摘要

Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.

2602.06523 2026-05-19 cs.CV cs.HC 版本更新

MicroBi-ConvLSTM: An Ultra-Lightweight Efficient Model for Human Activity Recognition on Resource Constrained Devices

MicroBi-ConvLSTM:一种用于资源受限设备上人类活动识别的超轻量高效模型

Mridankan Mandal

发表机构 * Department of Information Technology(信息技术系) Indian Institute of Information Technology, Allahabad Prayagraj, India(印度阿利哈巴德普雷亚格贾信息科技学院,印度)

AI总结 本文提出MicroBi-ConvLSTM模型,通过双阶段卷积特征提取和单层双向LSTM实现超轻量级架构,在保持线性复杂度的同时,参数减少2.9倍于TinierHAR和11.9倍于DeepConvLSTM,并在多个基准测试中表现出竞争力。

详情
AI中文摘要

在资源受限的可穿戴设备上进行人类活动识别(HAR)需要在准确性和严格的内存和计算预算之间取得平衡。现有的轻量级架构如TinierHAR(34K参数)和TinyHAR(55K参数)虽然在准确率上表现优异,但考虑到操作系统开销后,超出了微控制器有限SRAM的内存预算。本文提出MicroBi-ConvLSTM,一种超轻量级卷积递归架构,通过双阶段卷积特征提取和4倍时间池化,以及单层双向LSTM,平均达到11.4K参数。这比TinierHAR减少了2.9倍的参数,比DeepConvLSTM减少了11.9倍,同时保持线性O(N)复杂度。在八个多样化的HAR基准测试中,MicroBi-ConvLSTM在超轻量级范围内保持了竞争力:在UCI-HAR上达到93.41%的宏F1,在SKODA装配手势上达到94.46%,在Daphnet步态冻结检测上达到88.98%。系统性消融揭示了任务依赖的组件贡献,其中双向性对事件检测有益,但在周期性运动中提供边际增益。在Raspberry Pi Pico 2和ESP32上的设备部署验证了硬件可行性,无论是INT8量化还是FP32全精度路径。在INT8量化下,MicroBi-ConvLSTM是唯一在两个平台上实现全部8/8数据集覆盖的架构,Pico 2的平均延迟为72.8毫秒,ESP32上的PyTorch一致性为97.9%。在FP32部署下,它在所有成功的配置中实现了100.0%的一致性(8/8 Pico 2,7/8 ESP32),证实了所有INT8保真度下降都是量化伪影,而不是架构限制。

英文摘要

Human Activity Recognition (HAR) on resource constrained wearables requires models that balance accuracy against strict memory and computational budgets. State of the art lightweight architectures such as TinierHAR (34K parameters) and TinyHAR (55K parameters) achieve strong accuracy, but exceed memory budgets of microcontrollers with limited SRAM once operating system overhead is considered. We present MicroBi-ConvLSTM, an ultra-lightweight convolutional recurrent architecture achieving 11.4K parameters on average through two stage convolutional feature extraction with 4x temporal pooling, and a single bidirectional LSTM layer. This represents 2.9x parameter reduction versus TinierHAR and 11.9x versus DeepConvLSTM while preserving linear O(N) complexity. Evaluation across eight diverse HAR benchmarks shows that MicroBi-ConvLSTM maintains competitive performance within the ultra-lightweight regime: 93.41% macro F1 on UCI-HAR, 94.46% on SKODA assembly gestures, and 88.98% on Daphnet gait freeze detection. Systematic ablation reveals task dependent component contributions where bidirectionality benefits episodic event detection, but provides marginal gains on periodic locomotion. On-device deployment on the Raspberry Pi Pico 2 and ESP32 validates hardware viability under both INT8 quantized and FP32 full-precision paths. Under INT8 quantization, MicroBi-ConvLSTM is the only architecture achieving full 8/8 dataset coverage on both platforms, with 72.8 ms average latency on Pico 2 and 97.9% PyTorch parity on ESP32. Under FP32 deployment, it achieves 100.0% parity on all successful configurations (8/8 Pico 2, 7/8 ESP32), confirming that all INT8 fidelity degradation is a quantization artifact rather than an architectural limitation.

2602.06037 2026-05-19 cs.CV 版本更新

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

基于几何的思考:用于空间推理的主动几何整合

Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, JiaWang Bian, Hang Xu, Xiaodan Liang

发表机构 * Shenzhen campus of Sun Yet-sen University(中山大学深圳校区) Yinwang Intelligent Technology Co. Ltd.(云网智能科技有限公司) Shanghai Jiao Tong University(上海交通大学) Shanghai innovation institute(上海创新研究院) Nanyang Technological University(南洋理工大学)

AI总结 本文提出GeoThinker框架,通过主动感知机制改进空间推理,通过空间接地融合和重要性门控实现几何与语义的精准整合,提升空间智能性能。

详情
AI中文摘要

近期多模态大语言模型在空间推理中的进展越来越多地利用3D编码器中的几何先验。然而,现有整合策略大多被动:几何作为全局流暴露并以无差别方式融合,常导致语义-几何不一致和冗余信号。我们提出GeoThinker,框架将范式从被动融合转向主动感知。不同于特征混合,GeoThinker使模型能够根据内部推理需求选择性检索几何证据。GeoThinker通过在精心选择的VLM层应用空间接地融合实现此目标,其中语义视觉先验通过帧严格交叉注意力查询并整合任务相关的几何信息,进一步通过重要性门控校准,使每帧注意力偏向任务相关的结构。全面评估结果表明,GeoThinker在空间智能上达到新的状态-of-the-art,达到VSI-Bench峰值72.6分。此外,GeoThinker在复杂下游场景中表现出鲁棒的泛化能力和显著提升的空间感知,包括具身指称和自动驾驶。我们的结果表明,主动整合空间结构的能力对于下一代空间智能至关重要。代码可在https://github.com/Li-Hao-yuan/GeoThinker找到。

英文摘要

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.

2602.04802 2026-05-19 cs.CV 版本更新

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

VISTA-Bench: 视觉-语言模型是否真的能像纯文本一样理解可视化文本?

Qing'an Liu, Juntong Feng, Yuhao Wang, Xinzhe Han, Yujie Cheng, Yue Zhu, Haiwen Diao, Yunzhi Zhuge, Huchuan Lu

发表机构 * Dalian University of Technology(大连理工大学) Nanyang Technological University(南洋理工大学)

AI总结 VISTA-Bench通过对比纯文本和可视化文本问题,揭示了视觉-语言模型在处理可视化文本时的模态差距,发现模型在语义相同的情况下表现显著下降。

Comments 32 pages, 16 figures

详情
AI中文摘要

视觉-语言模型(VLMs)在跨模态理解方面取得了显著进展,但现有基准主要关注纯文本查询。本文引入VISTA-Bench,一个涵盖多模态感知、推理和单模态理解的系统性基准,通过受控渲染条件对比纯文本和可视化文本问题,评估模型对可视化文本的理解能力。对超过30个代表性VLMs的评估揭示了显著的模态差距:在纯文本表现优异的模型在同等语义内容以可视化文本呈现时显著退化。这一差距随着感知难度的增加而加剧,凸显了模型对渲染变化的敏感性。VISTA-Bench提供了一个原则性的评估框架,用于诊断这一限制并指导向更统一的语言表征(tokenized文本和像素)的发展。

英文摘要

Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 30 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels.

2602.00470 2026-05-19 cs.CV 版本更新

FG-TreeSeg: Flow-Guided Tree Crown Segmentation without Instance Annotations

FG-TreeSeg:基于流引导的树冠分割无需实例标注

Pengyu Chen, Fangzheng Lyu, Sicheng Wang, Cuizhen Wang

发表机构 * Department of Geography, University of South Carolina(南卡罗来纳大学地理系) Department of Geography, Virginia Polytechnic Institute and State University(弗吉尼亚理工大学地理系)

AI总结 本文提出FG-TreeSeg,通过将树冠建模为拓扑流场中的星形凸对象,利用Cellpose-SAM实现无需标注的树冠实例分割,实验表明其在不同传感器和冠层密度下均具有良好的泛化能力。

Comments 5 pages, 8 figures

Journal ref IEEE Geoscience and Remote Sensing Letters, 2026

详情
AI中文摘要

个体树冠分割是遥感中用于森林生物量估算和生态监测的重要任务。然而,在密集重叠冠层中准确界定仍是一个瓶颈。尽管监督深度学习方法面临高标注成本和泛化能力有限的问题,新兴的基础模型(如Segment Anything Model)往往缺乏领域知识,导致在密集簇中出现欠分割。为弥合这一差距,我们提出了FG-TreeSeg,一种无需训练的树冠实例分割框架,将基于生物医学成像的流引导界定方法转移到遥感领域。通过将树冠建模为拓扑流场中的星形凸对象,利用Cellpose-SAM,FG-TreeSeg框架通过向量收敛迫使接触的树冠实例分离。在NEON和BAMFOREST数据集上的实验以及视觉检查表明,我们的框架在不同传感器类型和冠层密度下均具有良好的泛化能力,可为树冠实例分割和标签生成提供无需训练的解决方案。

英文摘要

Individual tree crown segmentation is an important task in remote sensing for forest biomass estimation and ecological monitoring. However, accurate delineation in dense, overlapping canopies remains a bottleneck. While supervised deep learning methods suffer from high annotation costs and limited generalization, emerging foundation models (e.g., Segment Anything Model) often lack domain knowledge, leading to under-segmentation in dense clusters. To bridge this gap, we propose FG-TreeSeg, a training-free framework for tree crown instance segmentation that transfers flow-based delineation from biomedical imaging to remote sensing. By modeling tree crowns as star-convex objects within a topological flow field using Cellpose-SAM, the FG-TreeSeg framework forces the separation of touching tree crown instances based on vector convergence. Experiments on the NEON and BAMFOREST datasets and visual inspection demonstrate that our framework generalizes robustly across diverse sensor types and canopy densities, which can offer a training-free solution for tree crown instance segmentation and labels generation.

2601.21531 2026-05-19 cs.CR cs.AI cs.CV 版本更新

On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression

大型视觉-语言模型在视觉标记压缩下的对抗鲁棒性研究

Xinwei Zhang, Hangcheng Liu, Li Bai, Hao Wang, Qingqing Ye, Tianwei Zhang, Haibo Hu

发表机构 * The Hong Kong Polytechnic University, Hong Kong(香港理工大学) Nanyang Technological University, Singapore(南洋理工大学) Chongqing University, Chongqing, China(重庆大学) Research Centre for Privacy and Security Technologies in Future Smart Systems, PolyU(未来智能系统中的隐私与安全技术研究中心)

AI总结 本文研究了视觉标记压缩对大型视觉-语言模型对抗鲁棒性的影响,提出CAGE攻击方法,通过优化与压缩推理对齐,揭示压缩机制下的鲁棒性漏洞。

Comments Accepted by ICML 2026

详情
AI中文摘要

视觉标记压缩广泛用于加速大型视觉-语言模型(LVLMs),通过剪枝或合并视觉标记来提升效率,但其对抗鲁棒性仍未经探索。我们发现现有基于编码器的攻击无法充分揭示压缩LVLMs的鲁棒性漏洞,原因在于优化与推理之间的不匹配:扰动在完整标记表示上优化,而推理则通过标记压缩瓶颈进行。为解决这一差距,我们提出了压缩对齐攻击(CAGE),无需假设访问部署压缩机制或其标记预算,通过预期特征破坏和排名扭曲对齐,集中扰动在可能预算下存活的标记上,并主动对齐标记扭曲与排名分数以促进高扭曲证据的保留。在多样化的代表性插件式压缩机制和数据集上,结果表明CAGE在鲁棒性上始终优于基线。本文强调忽视压缩的鲁棒性评估可能过于乐观,呼吁对高效LVLMs进行压缩感知的安全评估和防御。

英文摘要

Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks cannot fully disclose the robustness vulnerabilities of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.

2601.21458 2026-05-19 cs.CV 版本更新

Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization

从重建误差中挖掘伪造痕迹:一种用于多模态深度伪造时间定位的弱监督框架

Midou Guo, Qilin Yin, Wei Lu, Rui Yang

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) Sun Yat-sen University(中山大学) Alibaba Group(阿里巴巴集团)

AI总结 本文提出RT-DeepLoc框架,通过重建误差识别深度伪造,利用MAE学习真实数据的时空模式,结合不对称视频对比损失提升定位精度,实验表明在大规模数据集上达到弱监督时间伪造定位的最新水平。

详情
AI中文摘要

现代深度伪造已发展为局部和间歇性篡改,需要精细的时间定位以缓解严重的数字安全风险。帧级标注成本过高,使得弱监督方法成为必要,仅依赖视频级标签。为此,我们提出基于重建的时空深度伪造定位(RT-DeepLoc)框架,通过重建误差识别伪造。该框架使用仅在真实数据上训练的掩码自动编码器(MAE)学习其内在时空模式,使模型能产生显著的重建差异,从而在不需密集人工标注的情况下提供准确定位所需的细粒度线索。为稳健利用这些指标,我们引入了新的不对称视频对比损失(AICL)。通过聚焦于由这些重建线索引导的真实特征紧凑性,AICL建立了一个稳定的决策边界,增强局部辨别力,同时通过先进生成模型保持对未见伪造的泛化能力。在大规模数据集(包括LAV-DF)上的广泛实验表明,RT-DeepLoc在弱监督时间伪造定位任务上实现了最先进的性能。

英文摘要

Modern deepfakes have evolved into localized and intermittent manipulations that require fine-grained temporal localization to mitigate severe digital security risks. The prohibitive cost of frame-level annotation makes weakly supervised methods a practical necessity, which rely only on video-level labels. To this end, we propose Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc), a weakly supervised temporal forgery localization framework that identifies forgeries via reconstruction errors. Our framework uses a Masked Autoencoder (MAE) trained exclusively on authentic data to learn its intrinsic spatiotemporal patterns; this allows the model to produce significant reconstruction discrepancies for forged segments, effectively providing the missing fine-grained cues for accurate localization without demanding dense human annotations. To robustly leverage these indicators, we introduce a novel Asymmetric Intra-video Contrastive Loss (AICL). By focusing on the compactness of authentic features guided by these reconstruction cues, AICL establishes a stable decision boundary that enhances local discrimination while preserving generalization to unseen forgeries by advanced generative models. Extensive experiments on large-scale datasets, including LAV-DF, demonstrate that RT-DeepLoc achieves state-of-the-art performance in weakly-supervised temporal forgery localization.

2601.20306 2026-05-19 cs.CV 版本更新

TPGDiff: Hierarchical Triple-Prior Guided Diffusion for Image Restoration

TPGDiff: 基于三级先验引导的图像修复扩散网络

Yanjie Tu, Qingsen Yan, Axi Niu, Jiacong Tang

发表机构 * School of Computer Science, Northwestern Polytechnical University, Xi'an, China(西北工业大学计算机学院,西安,中国) Shenzhen Research Institute of Northwestern Polytechnical University, Shenzhen, China(西北工业大学深圳研究院,深圳,中国)

AI总结 TPGDiff通过整合降质先验、结构先验和语义先验,实现图像修复的分层引导,提升严重降质区域的重建能力。

详情
AI中文摘要

所有-in-one图像修复旨在通过单一统一模型解决多种退化类型。现有方法通常依赖退化先验指导修复,但难以重建严重退化区域的内容。尽管近期工作利用语义信息促进内容生成,但将其整合到扩散模型浅层往往破坏空间结构(例如,模糊伪影)。为此,我们提出了一种三先验引导扩散(TPGDiff)网络用于统一图像修复。TPGDiff在整个扩散轨迹中整合退化先验,同时在浅层引入结构先验,在深层引入语义先验,实现图像重建的分层互补先验引导。具体而言,我们利用多源结构线索作为结构先验,捕捉细粒度细节并指导浅层表示。为了补充此设计,我们进一步开发了蒸馏驱动的语义提取器,以生成稳健的语义先验,确保在严重退化情况下深层可靠高阶指导。此外,采用退化提取器学习退化感知先验,使扩散过程在所有时间步长上实现阶段自适应控制。在单退化和多退化基准上的广泛实验表明,TPGDiff在多样化的修复场景中实现了优越的性能和泛化能力。我们的项目页面是:https://leoyjtu.github.io/tpgdiff-project.

英文摘要

All-in-one image restoration aims to address diverse degradation types using a single unified model. Existing methods typically rely on degradation priors to guide restoration, yet often struggle to reconstruct content in severely degraded regions. Although recent works leverage semantic information to facilitate content generation, integrating it into the shallow layers of diffusion models often disrupts spatial structures (\emph{e.g.}, blurring artifacts). To address this issue, we propose a Triple-Prior Guided Diffusion (TPGDiff) network for unified image restoration. TPGDiff incorporates degradation priors throughout the diffusion trajectory, while introducing structural priors into shallow layers and semantic priors into deep layers, enabling hierarchical and complementary prior guidance for image reconstruction. Specifically, we leverage multi-source structural cues as structural priors to capture fine-grained details and guide shallow layers representations. To complement this design, we further develop a distillation-driven semantic extractor that yields robust semantic priors, ensuring reliable high-level guidance at deep layers even under severe degradations. Furthermore, a degradation extractor is employed to learn degradation-aware priors, enabling stage-adaptive control of the diffusion process across all timesteps. Extensive experiments on both single- and multi-degradation benchmarks demonstrate that TPGDiff achieves superior performance and generalization across diverse restoration scenarios. Our project page is: https://leoyjtu.github.io/tpgdiff-project.

2601.16527 2026-05-19 cs.LG cs.AI cs.CL cs.CV 版本更新

Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs

超越表面遗忘:多模态大语言模型中Hallucinations的锐度感知鲁棒擦除

Xianya Fang, Feiyang Ren, Xiang Chen, Yu Tian, Zhen Bi, Haiyang Yu, Sheng-Jun Huang

发表机构 * College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics(南京航空航天大学计算机科学与技术学院) Institute for AI, Tsinghua University(清华大学人工智能研究院) Huzhou University(湖州大学) Institute of Dataspace, Hefei Comprehensive National Science Center(合肥综合性国家科学中心数据空间研究院) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出SARE方法,通过目标导向的min-max优化和Targeted-SAM机制,解决多模态大语言模型中 hallucinations 的鲁棒擦除问题,提升模型稳定性与擦除效果。

详情
AI中文摘要

多模态大语言模型虽然强大,但容易产生hallucinations,即不存在的实体,影响可靠性。尽管最近的遗忘方法试图缓解这一问题,我们发现了一个关键缺陷:结构脆弱性。我们实证显示,标准擦除仅能表面抑制,使模型陷入尖锐极小值,轻度重新学习后hallucinations会灾难性复苏。为确保几何稳定性,我们提出SARE,将遗忘视为目标min-max优化问题,并使用Targeted-SAM机制显式平坦hallucinated概念周围的损失景观。通过在模拟最坏情况参数扰动下抑制hallucinations,我们的框架确保了鲁棒去除的稳定性。大量实验表明,SARE在擦除效果上显著优于基线,同时保持一般生成质量。关键的是,它在重新学习和参数更新中维持持久的hallucination抑制,验证了几何稳定性的有效性。

英文摘要

Multimodal LLMs are powerful but prone to object hallucinations, which describe non-existent entities and harm reliability. While recent unlearning methods attempt to mitigate this, we identify a critical flaw: structural fragility. We empirically demonstrate that standard erasure achieves only superficial suppression, trapping the model in sharp minima where hallucinations catastrophically resurge after lightweight relearning. To ensure geometric stability, we propose SARE, which casts unlearning as a targeted min-max optimization problem and uses a Targeted-SAM mechanism to explicitly flatten the loss landscape around hallucinated concepts. By suppressing hallucinations under simulated worst-case parameter perturbations, our framework ensures robust removal stable against weight shifts. Extensive experiments demonstrate that SARE significantly outperforms baselines in erasure efficacy while preserving general generation quality. Crucially, it maintains persistent hallucination suppression against relearning and parameter updates, validating the effectiveness of geometric stabilization.

2601.02353 2026-05-19 cs.CV cs.LG 版本更新

Meta-Learning Guided Pruning for Few-Shot Plant Pathology on Edge Devices

元学习引导的剪枝用于边缘设备上的少样本植物病理学

Mohammed Mudassir Uddin, Shahnawaz Alam, Mohammed Kaif Pasha, Dr Tasneem Bano Rehman, Dr Fahmina Taranum, Afroze Begum

发表机构 * Department of CSE, Muffakham Jah College of Engineering and Technology (MJCET)(计算机科学与工程系,穆法卡姆·贾赫工程与技术学院(MJCET))

AI总结 本文提出DACIS方法,结合神经网络剪枝与少样本学习,实现边缘设备上高效植物疾病识别,实验表明模型大小减小78%且保持92.3%的精度。

详情
AI中文摘要

远程地区农民需要快速可靠的植物疾病识别方法,但通常缺乏实验室或高性能计算资源。深度学习模型可通过叶片图像检测疾病,但模型通常过大且计算成本高,难以在低成本边缘设备如Raspberry Pi上运行。此外,收集数千张标记的疾病图像进行训练既昂贵又耗时。本文通过结合神经网络剪枝和少样本学习解决这两个挑战。本文提出Disease-Aware Channel Importance Scoring (DACIS),一种识别神经网络中区分不同植物疾病关键部分的方法,集成到三阶段Prune-then-Meta-Learn-then-Prune (PMP)流程中。在PlantVillage和PlantDoc数据集上的实验表明,所提出的方法将模型大小减少78%,同时保持92.3%的原始精度,压缩后的模型在Raspberry Pi 4上以每秒7帧的速度运行,使小农户农民的实时田间诊断成为可能。

英文摘要

Farmers in remote areas need quick and reliable methods for identifying plant diseases, yet they often lack access to laboratories or high-performance computing resources. Deep learning models can detect diseases from leaf images with high accuracy, but these models are typically too large and computationally expensive to run on low-cost edge devices such as Raspberry Pi. Furthermore, collecting thousands of labeled disease images for training is both expensive and time-consuming. This paper addresses both challenges by combining neural network pruning, removing unnecessary parts of the model, with few-shot learning, which enables the model to learn from limited examples. This paper proposes Disease-Aware Channel Importance Scoring (DACIS), a method that identifies which parts of the neural network are most important for distinguishing between different plant diseases, integrated into a three-stage Prune-then-Meta-Learn-then-Prune (PMP) pipeline. Experiments on PlantVillage and PlantDoc datasets demonstrate that the proposed approach reduces model size by 78% while maintaining 92.3% of the original accuracy, with the compressed model running at 7 frames per second on a Raspberry Pi 4, making real-time field diagnosis practical for smallholder farmers.

2512.23180 2026-05-19 cs.CV 版本更新

GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation

GaussianDWM: 基于3D高斯场景表示的统一场景理解和多模态生成驱动世界模型

Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, Hesheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) MEGVII Technology(商汤科技) Mach Drive

AI总结 本文提出基于3D高斯表示的统一驱动世界模型框架,实现3D场景理解和多模态生成,并通过语言引导采样策略和双条件生成模型提升生成效果,实验验证其在nuScenes和NuInteract数据集上的优越性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

本文提出了一种基于3D高斯表示的统一驱动世界模型框架,旨在解决现有驱动世界模型(DWMs)在3D场景理解和多模态生成方面的不足。现有DWMs只能根据输入数据生成内容,无法解释或推理驾驶环境。此外,当前方法使用点云或BEV特征表示3D空间信息,无法准确对齐文本信息与底层3D场景。为了解决这些限制,我们提出了一种新的统一DWM框架,基于3D高斯场景表示,实现了3D场景理解和多模态生成,并能够为理解和生成任务提供上下文丰富性。我们的方法通过将丰富的语言特征嵌入到每个高斯体素中,直接对齐文本信息与3D场景,从而实现早期模态对齐。此外,我们设计了一种新颖的任务感知语言引导采样策略,以去除冗余的3D高斯体素,并将准确且紧凑的3D标记注入LLM中。此外,我们设计了一种双条件多模态生成模型,其中通过我们的视觉-语言模型捕获的信息作为高阶语言条件,与低阶图像条件结合,共同引导多模态生成过程。我们在nuScenes和NuInteract数据集上进行了全面研究,验证了我们框架的有效性。我们的方法实现了最先进的性能。我们将公开代码在GitHub上:https://github.com/dtc111111/GaussianDWM。

英文摘要

Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub https://github.com/dtc111111/GaussianDWM.

2512.16085 2026-05-19 cond-mat.mtrl-sci cs.CV 版本更新

Machine Learning Enabled Graph Analysis of Particulate Composites: Application to Solid-state Battery Cathodes

机器学习赋能的颗粒复合材料分析:应用于固态电池正极

Zebin Li, Shimao Deng, Yijin Liu, Jia-Mian Hu

发表机构 * Department of Materials Science and Engineering, University of Wisconsin-Madison(威斯康星大学麦迪逊分校材料科学与工程系) Walker Department of Mechanical Engineering, University of Texas at Austin(德克萨斯大学奥斯汀分校沃克机械工程系)

AI总结 本文提出一种基于机器学习的框架,将多模态X射线图像转化为拓扑感知图,用于分析颗粒复合材料的微观结构与性能关系,以固态电池正极为例验证了三相交点和离子/电子传导通道的重要性。

详情
AI中文摘要

颗粒复合材料支撑了许多固态化学和电化学系统,其中多相边界和颗粒间连接等微观特征强烈影响系统性能。X射线显微镜技术的进步使得能够捕捉这些复杂微观结构的大规模、多模态图像,但如何利用这些数据发现新的物理见解并指导微观结构优化仍是一个重大挑战。本文开发了一种机器学习(ML)赋能的框架,能够自动将实验多模态X射线图像转换为可扩展、拓扑感知的图结构,用于提取物理见解并建立局部微观结构-性能关系,既在颗粒层面又在网络层面。以固态锂离子电池的多相颗粒正极为例,我们的ML赋能图分析证实了三相交点和同时离子/电子传导通道在实现理想局部电化学活性中的关键作用。本文的工作确立了基于图的微观结构表示作为连接多模态实验成像与功能理解的强大范式,并促进了在广泛颗粒复合材料中进行微观结构感知的数据驱动材料设计。

英文摘要

Particulate composites underpin many solid-state chemical and electrochemical systems, where microstructural features such as multiphase boundaries and inter-particle connections strongly influence system performance. Advances in X-ray microscopy enable capturing large-scale, multimodal images of these complex microstructures with an unprecedentedly high throughput. However, harnessing these datasets to discover new physical insights and guide microstructure optimization remains a major challenge. Here, we develop a machine learning (ML) enabled framework that enables automated transformation of experimental multimodal X-ray images of multiphase particulate composites into scalable, topology-aware graphs for extracting physical insights and establishing local microstructure-property relationships at both the particle and network level. Using the multiphase particulate cathode of solid-state lithium batteries as an example, our ML-enabled graph analysis corroborates the critical role of triple phase junctions and concurrent ion/electron conduction channels in realizing desirable local electrochemical activity. Our work establishes graph-based microstructure representation as a powerful paradigm for bridging multimodal experimental imaging and functional understanding, and facilitating microstructure-aware data-driven materials design in a broad range of particulate composites.

2512.07245 2026-05-19 cs.CV 版本更新

Zero-Shot Textual Explanations via Translating Decision-Critical Features

通过翻译决策关键特征实现零样本文本解释

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

发表机构 * Chiba University(千叶大学) National Institute of Informatics(国家信息研究所)

AI总结 本文提出TEXTER方法,通过隔离决策关键特征生成更准确的文本解释,提升模型可解释性。

Comments Accepted to CVPR 2026 Findings

详情
AI中文摘要

文本解释通过自然语言描述图像分类器的预测理由,使决策过程透明。大型视觉-语言模型虽能生成描述,但并非为特定分类器推理设计。现有零样本解释方法将全局图像特征与语言对齐,生成描述可见内容而非驱动预测的因素。本文提出TEXTER,通过隔离决策关键特征前进行对齐,识别预测相关的神经元并强调其中编码的特征,将其映射到CLIP特征空间以检索反映模型推理的文本解释。稀疏自编码器进一步提升可解释性,尤其对Transformer架构有效。大量实验表明,TEXTER比现有方法提供更忠实和可解释的解释。代码可在https://github.com/tttt-0814/TEXTER获取。

英文摘要

Textual explanations make image classifier decisions transparent by describing the prediction rationale in natural language. Large vision-language models can generate captions but are designed for general visual understanding, not classifier-specific reasoning. Existing zero-shot explanation methods align global image features with language, producing descriptions of what is visible rather than what drives the prediction. We propose TEXTER, which overcomes this limitation by isolating decision-critical features before alignment. TEXTER identifies the neurons contributing to the prediction and emphasizes the features encoded in those neurons -- i.e., the decision-critical features. It then maps these emphasized features into the CLIP feature space to retrieve textual explanations that reflect the model's reasoning. A sparse autoencoder further improves interpretability, particularly for Transformer architectures. Extensive experiments show that TEXTER provides more faithful and interpretable explanations than existing methods. The code is available at \url{https://github.com/tttt-0814/TEXTER}.

2512.04331 2026-05-19 cs.CV 版本更新

Open Set Face Forgery Detection via Dual-Level Evidence Collection

开放集人脸伪造检测 via 双层证据收集

Zhongyi Cai, Bryce Gernon, Wentao Bao, Yifan Li, Matthew Wright, Yu Kong

发表机构 * Michigan State University(密歇根州立大学) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 本文提出双层证据检测方法,用于识别新型伪造类别,通过不确定性估计提升实际应用,实验显示在识别新伪造类别时性能优于现有方法。

Comments Accepted at IEEE FG 2026

详情
AI中文摘要

人脸伪造的增加已严重削弱在线内容的真实性。随着生成算法的快速发展,新的伪造类别不断出现,严重挑战现有检测方法。尽管检测技术有所提高,但现有方法仍局限于二元真实vs伪造分类或已知伪造类别的识别。此外,它们无法识别完全新的伪造方法。本文研究了开放集人脸伪造检测(OSFFD)问题,要求检测模型识别新伪造类别。为了增强其实际应用,我们重新表述了OSFFD问题并通过不确定性估计解决。具体而言,我们提出了双层证据人脸伪造检测(DLED)方法,通过提取和整合空间和频率层面的类别特定证据来估计预测不确定性。在多样化的设置中进行的全面实验表明,所提出的DLED方法实现了最先进的性能。值得注意的是,它在识别新伪造类别时平均比现有基线模型高出20%。同时,DLED方法在标准二元真实vs伪造人脸伪造检测任务中也表现出竞争力。

英文摘要

The surge in face forgeries has increasingly undermined confidence in the authenticity of online content. As generation algorithms rapidly evolve, new fake categories will constantly emerge, severely challenging existing face forgery detection methods. Although face forgery detection has recently improved, current techniques remain largely confined to binary Real-vs-Fake classification or the recognition of known fake categories. Moreover, they fail to identify the emergence of entirely new forgery methods. In this work, we study the Open Set Face Forgery Detection (OSFFD) problem, which requires the detection model to identify novel fake categories. To enhance its real-world applicability, we reformulate the OSFFD problem and address it through uncertainty estimation. Specifically, we propose the Dual-Level Evidential face forgery Detection (DLED) approach, which estimates prediction uncertainty by extracting and integrating category-specific evidence on the spatial and frequency levels. Comprehensive experiments across diverse settings demonstrate that our proposed DLED approach achieves state-of-the-art performance. Notably, it surpasses various existing baseline models by a $20\%$ margin on average when identifying forgeries from novel fake categories. Concurrently, our DLED method yields competitive performance on the standard binary Real-versus-Fake face forgery detection task.

2512.04329 2026-05-19 cs.CV cs.SE 版本更新

A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

一种基于检索增强生成的方法用于从神经网络中提取算法逻辑

Waleed Khalid, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg, Germany(计算机视觉实验室,CAIDAS,乌尔姆大学,德国)

AI总结 本文提出NN-RAG方法,通过检索增强生成技术从神经网络代码库中提取并验证模块,实现了跨仓库的架构迁移与重复检测,提升了神经网络架构的可复现性和多样性。

详情
AI中文摘要

重用现有神经网络组件对研究效率至关重要,但发现、提取和验证这些模块仍面临困难。我们引入NN-RAG,一种检索增强生成系统,将大规模异构PyTorch代码库转换为可搜索和执行的验证神经模块库。与传统代码搜索或克隆检测工具不同,NN-RAG执行范围感知的依赖解析、保留导入的重建以及验证门控提升,确保每个检索块都是范围封闭、可编译和可运行的。应用于19个主要仓库,流程提取了1,289个候选块,验证了941个(73.0%),并证明超过80%的块在结构上是唯一的。通过多层次去重(精确、词汇、结构),我们发现NN-RAG为LEMUR数据集贡献了绝大多数独特的架构,提供了约72%的所有新网络结构。除了数量,NN-RAG独特地使跨仓库的架构迁移成为可能,自动在一个项目中识别可重用的模块并在另一个上下文中重新生成,依赖完整。据我们所知,没有其他开源系统能以这种规模提供这种能力。框架的中性规范进一步允许可选地与语言模型集成,用于合成或数据集注册,而无需重新分发第三方代码。总体而言,NN-RAG将碎片化的视觉代码转化为可复现、可追溯的子基质,为算法发现提供了一个首个开源解决方案,既量化又扩展了跨仓库的可执行神经架构的多样性。

英文摘要

Reusing existing neural-network components is central to research efficiency, yet discovering, extracting, and validating such modules across thousands of open-source repositories remains difficult. We introduce NN-RAG, a retrieval-augmented generation system that converts large, heterogeneous PyTorch codebases into a searchable and executable library of validated neural modules. Unlike conventional code search or clone-detection tools, NN-RAG performs scope-aware dependency resolution, import-preserving reconstruction, and validator-gated promotion -- ensuring that every retrieved block is scope-closed, compilable, and runnable. Applied to 19 major repositories, the pipeline extracted 1,289 candidate blocks, validated 941 (73.0%), and demonstrated that over 80% are structurally unique. Through multi-level de-duplication (exact, lexical, structural), we find that NN-RAG contributes the overwhelming majority of unique architectures to the LEMUR dataset, supplying approximately 72% of all novel network structures. Beyond quantity, NN-RAG uniquely enables cross-repository migration of architectural patterns, automatically identifying reusable modules in one project and regenerating them, dependency-complete, in another context. To our knowledge, no other open-source system provides this capability at scale. The framework's neutral specifications further allow optional integration with language models for synthesis or dataset registration without redistributing third-party code. Overall, NN-RAG transforms fragmented vision code into a reproducible, provenance-tracked substrate for algorithmic discovery, offering a first open-source solution that both quantifies and expands the diversity of executable neural architectures across repositories.

2511.19953 2026-05-19 cs.CV 版本更新

Supervise Less, See More: Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting

少监督,多观察:基于原型引导的提示方法实现无训练核实例分割

Wen Zhang, Qin Ren, Wenjing Liu, Haibin Ling, Chenyu You

发表机构 * Stony Brook University(石溪大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出SPROUT框架,通过组织学先验知识构建滑片特定参考原型,利用部分最优传输方案指导特征对齐,使SAM模型无需训练即可实现精准核分割。

Comments ICML 2026; 44 pages, 25 figures, 26 tables; Code at https://github.com/Y-Research-SBU/SPROUT

详情
AI中文摘要

准确的核实例分割是计算病理学中的关键任务,支持数据驱动的临床洞察并促进下游转化应用。尽管大型视觉基础模型在零样本生物医学分割中显示出潜力,但大多数现有方法仍依赖密集监督和计算昂贵的微调。因此,无训练方法成为有吸引力的研究方向,但尚未被广泛探索。本文介绍SPROUT,一种完全无训练和注释的提示框架,用于核实例分割。SPROUT利用组织学指导的先验知识构建滑片特定的参考原型,以缓解领域差距。这些原型通过部分最优传输方案逐步引导特征对齐。所得前景和背景特征被转换为正负点提示,使Segment Anything Model (SAM)能够在不进行任何参数更新的情况下生成精确的核分割。在多个病理学基准上的广泛实验表明,SPROUT在无监督或再训练的情况下实现了竞争性性能,建立了可扩展的无训练核实例分割的新范式。

英文摘要

Accurate nuclear instance segmentation is a pivotal task in computational pathology, supporting data-driven clinical insights and facilitating downstream translational applications. While large vision foundation models have shown promise for zero-shot biomedical segmentation, most existing approaches still depend on dense supervision and computationally expensive fine-tuning. Consequently, training-free methods present a compelling research direction, yet remain largely unexplored. In this work, we introduce SPROUT, a fully training- and annotation-free prompting framework for nuclear instance segmentation. SPROUT leverages histology-informed priors to construct slide-specific reference prototypes that mitigate domain gaps. These prototypes progressively guide feature alignment through a partial optimal transport scheme. The resulting foreground and background features are transformed into positive and negative point prompts, enabling the Segment Anything Model (SAM) to produce precise nuclear delineations without any parameter updates. Extensive experiments across multiple histopathology benchmarks demonstrate that SPROUT achieves competitive performance without supervision or retraining, establishing a novel paradigm for scalable, training-free nuclear instance segmentation in pathology.

2511.18801 2026-05-19 cs.CV 版本更新

PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion

PartDiffuser:通过离散扩散实现部件级3D网格生成

Yichen Yang, Hong Li, Haodong Zhu, Linin Yang, Guojun Lei, Sheng Xu, Baochang Zhang

发表机构 * Beihang University(北航) Communication University of China(中国通信大学) Zhejiang University(浙江大学)

AI总结 PartDiffuser提出一种半自回归扩散框架,通过部件级方法生成高保真3D网格,有效平衡全局结构与局部细节。

详情
AI中文摘要

现有自回归方法在生成艺术家设计的网格时难以平衡全局结构一致性与高保真局部细节,易受误差累积影响。为解决此问题,我们提出PartDiffuser,一种新的半自回归扩散框架用于点云到网格生成。该方法首先对网格进行语义分割,然后以部件级方式进行操作:通过部件间的自回归确保全局拓扑,同时在每个语义部件内使用并行离散扩散过程精确重建高频几何特征。PartDiffuser基于DiT架构,引入了部件感知的交叉注意力机制,利用点云作为层次化的几何条件,动态控制生成过程,从而有效解耦全局和局部生成任务。实验表明,该方法在生成具有丰富细节的3D网格方面显著优于现有最先进模型,展现出适合实际应用的卓越细节表现。

英文摘要

Existing autoregressive (AR) methods for generating artist-designed meshes struggle to balance global structural consistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs semantic segmentation on the mesh and then operates in a "part-wise" manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high-frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local generation tasks. Experiments demonstrate that this method significantly outperforms state-of-the-art (SOTA) models in generating 3D meshes with rich detail, exhibiting exceptional detail representation suitable for real-world applications.

2511.14223 2026-05-19 cs.CV 版本更新

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

StreamingTalker: 基于音频的3D面部动画与自回归扩散模型

Yifan Yang, Zhi Cen, Sida Peng, Xiangwei Chen, Yifu Deng, Xinyu Zhu, Fan Jia, Xiaowei Zhou, Hujun Bao

发表机构 * State Key Laboratory of CAD&CG, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室) Ant Group(蚂蚁集团) College of Computer Science, Zhejiang University(浙江大学计算机学院)

AI总结 本文提出基于自回归扩散模型的StreamingTalker,解决音频驱动3D面部动画中长音频处理延迟和超训练范围的问题,通过动态条件生成高质量实时面部动作。

详情
AI中文摘要

本文聚焦于语音驱动的3D面部动画任务,旨在生成逼真且同步的面部动作,通过音频输入驱动。近期方法采用音频条件扩散模型生成表达自然的动画,但处理整个音频序列存在两个主要挑战:处理超出训练范围的音频序列效果差,且长音频输入会产生显著延迟。为解决这些问题,我们提出一种新的自回归扩散模型,以流式方式处理输入音频。该设计确保了不同音频长度的灵活性,并实现了低延迟,与音频持续时间无关。具体而言,我们选择少量过去帧作为历史动作上下文,并将其与音频输入结合,生成动态条件。该条件引导扩散过程迭代生成面部动作帧,实现高质量的实时合成。此外,我们实现了实时交互演示,突显了该方法的有效性和效率。代码将在https://zju3dv.github.io/StreamingTalker/上发布。

英文摘要

This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs. Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations. However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.

2511.07329 2026-05-19 cs.LG cs.CV 版本更新

Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis

基于分形的计算架构制备用于高级大语言模型分析

Yash Mittal, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg(计算机视觉实验室,CAIDAS,乌尔姆大学)

AI总结 本文提出FractalNet框架,通过递归模板模式自动生成并评估卷积神经网络架构,实现高效稳定的网络结构探索,实验显示分形架构在五轮训练后达到80.18%的准确率。

详情
AI中文摘要

本文提出FractalNet,一种基于分形设计原理的框架,通过递归模板模式自动生成并评估卷积神经网络(CNN)架构。该框架通过递归分形模板系统地变化关键参数如分形深度、列宽和层配置,而非依赖计算成本高的神经架构搜索(NAS)方法。框架包含生成器、分形模板模块和运行器模块,生成1200多个CNN架构在CIFAR-10数据集上进行测试。使用PyTorch进行训练,采用随机梯度下降和自动混合精度及梯度检查点技术降低计算开销。实验结果显示分形架构具有稳定的训练动态和竞争性性能,五轮训练后验证准确率为60-70%,峰值准确率为80.18%。这些发现表明递归分形结构在平衡网络深度和宽度方面有效,并支持大规模自动化架构探索。

英文摘要

This paper proposes FractalNet, a framework based on fractal design principles that automatically generates and evaluates convolutional neural network (CNN) architectures using recursive template patterns. Rather than relying on computationally expensive Neural Architecture Search (NAS) methods, the framework explores a structured architecture space defined by recursive fractal templates that systematically vary key parameters such as fractal depth, column width, and layer configurations. The framework consists of three core components: a generator that produces candidate architectures via controlled permutations of convolutional, normalization, activation, and dropout layers; a fractal template module that enforces recursive multi-path structural patterns; and a runner module that manages model training, evaluation, and logging. Using this system, over 1,200 distinct CNN architectures were automatically generated and evaluated on the CIFAR-10 image classification benchmark. Training was performed in PyTorch using stochastic gradient descent with Automatic Mixed Precision (AMP) and gradient checkpointing to reduce computational overhead. Experimental results demonstrate that fractal-based architectures exhibit stable training dynamics and achieve competitive performance, with an average validation accuracy of 60-70% and a peak accuracy of 80.18% after only five training epochs. These findings suggest that recursive fractal structures provide an effective means of balancing network depth and width while supporting large-scale automated architecture exploration. The proposed framework offers a resource-efficient and interpretable approach to systematic neural architecture experimentation.

2510.16416 2026-05-19 cs.CV cs.AI 版本更新

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

SSL4RL:重新审视自监督学习作为视觉语言推理的内在奖励

Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chenheng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang

发表机构 * Peking University(北京大学) MIT(麻省理工学院) TUM(技术大学(TUM)) Meituan(美团) Peking University School of CIT, MCML, MDSIEECS and CSAIL(北京大学计算机学院、MCML、MDSIEECS以及CSAIL)

AI总结 本文提出SSL4RL框架,利用自监督学习任务作为强化学习的可验证奖励,提升视觉语言推理性能,并通过实验验证其在多模态模型对齐中的有效性。

详情
AI中文摘要

视觉语言模型(VLMs)通过整合大语言模型与视觉输入展现出显著能力,但往往无法充分利用视觉证据,要么依赖视觉任务中的语言先验,要么在推理中依赖文本捷径。尽管强化学习(RL)可以对齐模型与期望行为,但其在VLMs中的应用受限于缺乏可扩展且可靠的奖励机制。为克服这一挑战,我们提出了SSL4RL,一种新的框架,利用自监督学习(SSL)任务作为RL微调的可验证奖励源。我们的方法将SSL目标,如预测图像旋转或重建遮蔽块,转化为密集的自动奖励信号,消除了对人类偏好数据或不可靠的AI评估者的需求。实验表明,SSL4RL在视觉中心和视觉语言推理基准上显著提高了性能。此外,通过系统消融分析,我们识别出影响SSL4RL任务有效性的关键因素,如任务难度、模型规模和与目标领域语义的一致性,为未来工作提供了新的设计原则。我们还通过将其应用于图学习,展示了框架的通用性,其中取得了显著收益。SSL4RL建立了一种利用可验证的自监督目标对齐多模态模型的灵活且有效的方法论。

英文摘要

Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

2509.26037 2026-05-19 cs.AI cs.CV cs.LG 版本更新

CoLLM-NAS: Collaborative Large Language Models for Efficient Knowledge-Guided Neural Architecture Search

CoLLM-NAS:协作大型语言模型用于高效知识引导的神经架构搜索

Zhe Li, Zhiwei Lin, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学计算机科学技术研究院)

AI总结 本文提出CoLLM-NAS,一种基于协作大型语言模型的两阶段神经架构搜索框架,通过导航和生成两个LLM及协调模块,有效指导搜索过程,提升效率并取得新状态最优结果。

Comments Accepted as Oral at CVPR 2026 Workshop on Neural Architecture Search (NAS)

详情
AI中文摘要

将大型语言模型(LLMs)与神经架构搜索(NAS)结合,为自动设计神经架构提供了新可能。然而,现有方法面临架构无效、计算低效和性能劣于传统NAS的限制。本文提出协作LLM-based NAS(CoLLM-NAS),一种两阶段NAS框架,通过两个互补的LLM驱动知识引导搜索。具体而言,提出具有状态的导航LLM指导搜索方向,无状态的生成LLM合成高质量候选,以及协调模块协调LLM间通信并管理评估过程。CoLLM-NAS通过结合LLM对结构神经架构的内在知识与迭代反馈和历史轨迹的逐步知识,高效指导搜索过程。在ImageNet和NAS-Bench-201上的实验结果表明,CoLLM-NAS超越现有NAS方法和传统搜索算法,取得新状态最优结果,同时显著降低搜索成本4-10倍。此外,CoLLM-NAS在多种搜索空间(如MobileNet、ShuffleNet和AutoFormer)中一致提升各种两阶段NAS方法(如OFA、SPOS和AutoFormer)的性能和效率,展示其优秀的泛化能力。

英文摘要

The integration of Large Language Models (LLMs) with Neural Architecture Search (NAS) has introduced new possibilities for automating the design of neural architectures. However, most existing methods face critical limitations, including architectural invalidity, computational inefficiency, and inferior performance compared to traditional NAS. In this work, we present Collaborative LLM-based NAS (CoLLM-NAS), a two-stage NAS framework with knowledge-guided search driven by two complementary LLMs. Specifically, we propose a stateful Navigator LLM to guide search direction, a stateless Generator LLM to synthesize high-quality candidates, and a Coordinator module to orchestrate inter-LLM communication and manage evaluation processes. CoLLM-NAS efficiently guides the search process by combining LLMs' inherent knowledge of structured neural architectures with progressive knowledge from iterative feedback and historical trajectory. Experimental results on ImageNet and NAS-Bench-201 show that CoLLM-NAS surpasses existing NAS methods and conventional search algorithms, achieving new state-of-the-art results while significantly reducing search costs by 4--10. Furthermore, CoLLM-NAS consistently enhances the performance and efficiency of various two-stage NAS methods (e.g., OFA, SPOS, and AutoFormer) across diverse search spaces (e.g., MobileNet, ShuffleNet, and AutoFormer), demonstrating its excellent generalization.

2509.13270 2026-05-19 cs.CV cs.AI 版本更新

RadGame: An AI-Powered Platform for Radiology Education

RadGame:一种基于人工智能的放射学教育平台

Mohammed Baharoon, Siavash Raissi, John S. Jun, Thibault Heintz, Mahmoud Alabbad, Ali Alburkani, Sung Eun Kim, Kent Kleinschmidt, Abdulrahman O. Alhumaydhi, Mohannad Mohammed G. Alghamdi, Jeremy Francis Palacio, Mohammed Bukhaytan, Noah Michael Prudlo, Rithvik Akula, Brady Chrisler, Benjamin Galligos, Mohammed O. Almutairi, Mazeen Mohammed Alanazi, Nasser M. Alrashdi, Joel Jihwan Hwang, Sri Sai Dinesh Jaliparthi, Luke David Nelson, Nathaniel Nguyen, Sathvik Suryadevara, Steven Kim, Mohammed F. Mohammed, Yevgeniy R. Semenov, Kun-Hsing Yu, Abdulrhman Aljouie, Hassan AlOmaish, Adam Rodman, Pranav Rajpurkar

发表机构 * Harvard Medical School(哈佛医学院) Mass General Brigham(麻省总医院) Maastricht University(马斯特里赫特大学) Department of Medical Imaging, King Abdulaziz Medical City, Ministry of National Guard, Riyadh, Saudi Arabia(国王阿卜杜勒-阿齐兹医疗城医学影像科,沙特阿拉伯) National Strategic Technology Research Institute, Seoul National University Hospital(全国战略技术研究所,首尔国立大学医院) Saint Louis University School of Medicine(圣路易斯大学医学院) College of Medicine, King Saud bin Abdulaziz University for Health Sciences(国王萨勒曼·本·阿卜杜勒阿齐兹大学医学院) Tufts University School of Medicine(塔夫茨大学医学院) Department of Biomedical Informatics, Harvard Medical School(哈佛医学院生物医学信息学系)

AI总结 RadGame通过结合游戏化与大规模公开数据集,提供AI驱动的反馈,提升放射学教育中的定位和报告撰写能力,显著提高学习效果。

Comments ML4H Version

详情
AI中文摘要

我们介绍了RadGame,一种基于人工智能的游戏化平台,用于放射学教育,旨在提升局部定位和报告生成两项核心技能。传统放射学培训基于被动接触病例或实时指导,限制了即时和可扩展的反馈机会。RadGame通过结合游戏化、大规模公开数据集和自动化AI反馈,为人类学习者提供清晰的结构化指导。在RadGame Localize中,玩家绘制边界框以定位异常,自动与放射科医生绘制的标注比较,并通过视觉语言模型生成用户遗漏的解释。在RadGame Report中,玩家根据胸片、年龄和指征撰写发现,接收基于放射学报告生成指标的结构化AI反馈,突出与放射科医生书面真实报告的错误和遗漏,最终生成性能和风格评分。在前瞻性评估中,使用RadGame的参与者在定位准确性上比传统被动方法提高了68%,在报告撰写准确性上比传统方法提高了31%。RadGame展示了AI驱动游戏化在提供可扩展、反馈丰富的放射学培训中的潜力,并重新定义了医疗AI资源在教育中的应用。

英文摘要

We introduce RadGame, an AI-powered gamified platform for radiology education that targets two core skills: localizing findings and generating reports. Traditional radiology training is based on passive exposure to cases or active practice with real-time input from supervising radiologists, limiting opportunities for immediate and scalable feedback. RadGame addresses this gap by combining gamification with large-scale public datasets and automated, AI-driven feedback that provides clear, structured guidance to human learners. In RadGame Localize, players draw bounding boxes around abnormalities, which are automatically compared to radiologist-drawn annotations from public datasets, and visual explanations are generated by vision-language models for user missed findings. In RadGame Report, players compose findings given a chest X-ray, patient age and indication, and receive structured AI feedback based on radiology report generation metrics, highlighting errors and omissions compared to a radiologist's written ground truth report from public datasets, producing a final performance and style score. In a prospective evaluation, participants using RadGame achieved a 68% improvement in localization accuracy compared to 17% with traditional passive methods and a 31% improvement in report-writing accuracy compared to 4% with traditional methods after seeing the same cases. RadGame highlights the potential of AI-driven gamification to deliver scalable, feedback-rich radiology training and reimagines the application of medical AI resources in education.

2508.04227 2026-05-19 cs.CV cs.LG 版本更新

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

视觉语言模型的持续学习:超越遗忘的综述与分类

Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami, Xialei Liu, Joost van de Weijer, Yonghong Tian

AI总结 本文综述了视觉语言模型的持续学习挑战,提出四种核心范式以解决跨模态特征漂移和灾难性遗忘问题,强调零样本学习和智能体生态系统的发展。

详情
AI中文摘要

视觉语言模型(VLMs)和近期多模态大语言模型(MLLMs)通过前所未有的跨模态对齐和零样本泛化革新了人工智能。然而,使它们能够从非平稳数据中持续学习仍是一个重大挑战,因为它们的跨模态对齐和泛化能力特别容易受到灾难性遗忘的影响。不同于传统单模态持续学习(CL),VLMs面临独特的挑战,如跨模态特征漂移、由于共享架构导致的参数干扰以及零样本能力侵蚀。此外,生成式MLLMs表现出一种独特的“对齐税”,其中灾难性遗忘不仅表现为事实性遗忘,还表现为深度链式思维(CoT)推理的系统性崩溃。本文首次全面、诊断性地回顾了预测VLMs和生成式MLLMs的持续学习。我们系统地分解了上述失败模式,并提出了一个以挑战为导向的分类,包括四个核心范式:(1)多模态重播策略解决显式和隐式记忆漂移;(2)跨模态正则化强制拓扑和几何对齐;(3)参数高效适应利用动态路由和子空间投影;以及新兴的(4)模型融合与解耦范式。我们批判性地分析了评估协议的演变,强调了向双轨基准(领域 vs. 能力 CL)和微诊断 CoT 评估的转变。最后,我们绘制了未来研究的路线图,强调组合式零样本学习、具身AI与传感器融合以及自主智能体生态系统。所有资源均可在:https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models 上找到。

英文摘要

Vision-language models (VLMs) and the recent surge of Multimodal Large Language Models (MLLMs) have revolutionized artificial intelligence with unprecedented cross-modal alignment and zero-shot generalization. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. Furthermore, generative MLLMs exhibit a unique ``alignment tax,'' where catastrophic forgetting manifests not merely as factual amnesia, but as a systemic collapse of deep Chain-of-Thought (CoT) reasoning. This survey presents the first comprehensive, diagnostic review bridging continual learning for both predictive VLMs and generative MLLMs. We systematically deconstruct the aforementioned failure modes and propose a challenge-driven taxonomy comprising four core paradigms: (1) Multi-Modal Replay Strategies addressing explicit and implicit memory drift; (2) Cross-Modal Regularization enforcing topological and geometric alignment; (3) Parameter-Efficient Adaptation} utilizing dynamic routing and subspace projections; and the emerging (4) Model Fusion and Decoupling paradigms. We critically analyze the evolution of evaluation protocols, highlighting the essential shift toward dual-track benchmarks (Domain vs. Ability CL) and micro-diagnostic CoT evaluations. Finally, we chart a roadmap for future research, emphasizing compositional zero-shot learning, embodied AI with sensor fusion, and autonomous agentic ecosystems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.

2507.12969 2026-05-19 cs.LG cs.CV 版本更新

WaveletInception Networks for on-board Vibration-Based Infrastructure Health Monitoring

小波 inception 网络用于车载振动基基础设施健康监测

Reza Riahi Samani, Alfredo Nunez, Bart De Schutter

发表机构 * Delft Center for Systems and Control (DCSC), Delft University of Technology(代尔夫特理工大学系统与控制中心) Section of Railway Engineering, Department of Engineering Structures, Delft University of Technology(工程结构系铁路工程部)

AI总结 本文提出WaveletInception-BiGRU网络,通过可学习小波包变换提取频谱特征,结合Inception-残差网络进行多尺度特征学习,并利用BiGRU模块整合时间依赖性,实现无需预处理的振动信号分析,提升车载基础设施健康监测的准确性和自动化水平。

Comments Under reviewer for the Journal of Engineering Application of Artificial Intelligence

详情
AI中文摘要

本文提出了一种深度学习框架,用于分析车载振动响应信号以进行基础设施健康监测。所提出的WaveletInception-BiGRU网络采用可学习的小波包变换(LWPT)进行早期频谱特征提取,随后通过一维Inception-残差网络(1D Inception-ResNet)模块进行多尺度、高级特征学习。双向门控循环单元(BiGRU)模块则整合时间依赖性,并纳入操作条件,如测量速度。该方法使能够有效分析在不同速度下记录的振动信号,无需显式信号预处理。序列估计头进一步利用双向时间信息,产生准确的基础设施健康局部评估。最终,该框架生成高分辨率的空间映射健康配置文件。针对轨道刚度回归和过渡区分类的案例研究显示,所提出的框架显著优于现有方法,证明了其在准确、局部化和自动化车载基础设施健康监测中的潜力。

英文摘要

This paper presents a deep learning framework for analyzing on board vibration response signals in infrastructure health monitoring. The proposed WaveletInception-BiGRU network uses a Learnable Wavelet Packet Transform (LWPT) for early spectral feature extraction, followed by one-dimensional Inception-Residual Network (1D Inception-ResNet) modules for multi-scale, high-level feature learning. Bidirectional Gated Recurrent Unit (BiGRU) modules then integrate temporal dependencies and incorporate operational conditions, such as the measurement speed. This approach enables effective analysis of vibration signals recorded at varying speeds, eliminating the need for explicit signal preprocessing. The sequential estimation head further leverages bidirectional temporal information to produce an accurate, localized assessment of infrastructure health. Ultimately, the framework generates high-resolution health profiles spatially mapped to the physical layout of the infrastructure. Case studies involving track stiffness regression and transition zone classification using real-world measurements demonstrate that the proposed framework significantly outperforms state-of-the-art methods, underscoring its potential for accurate, localized, and automated on-board infrastructure health monitoring.

2506.11925 2026-05-19 cs.AR cs.AI cs.CV cs.LG 版本更新

Real-World Deployment of a Lane Change Prediction Architecture Based on Knowledge Graph Embeddings and Bayesian Inference

基于知识图谱嵌入和贝叶斯推断的车道变换预测架构的现实世界部署

M. Manzour, Catherine M. Elias, Omar M. Shehata, R. Izquierdo, M. A. Sotelo

发表机构 * Department of Computer Engineering, University of Alcalá(阿尔卡拉大学计算机工程系) Department of Computer Science, German University in Cairo(开罗德国大学计算机科学系) Department of Mechatronics, German University in Cairo(开罗德国大学机电系)

AI总结 本文提出基于知识图谱嵌入和贝叶斯推断的车道变换预测系统,通过现实硬件验证,实现了算法与道路部署的结合,提前3-4秒预测目标车辆车道变换,确保安全。

Journal ref 2025 IEEE International Conference on Vehicular Electronics and Safety (ICVES)

详情
AI中文摘要

近年来,车道变换预测研究取得显著进展,但大多数研究局限于仿真或数据集结果,未能实现算法与道路部署的结合。本文通过现实硬件展示了基于知识图谱嵌入(KGEs)和贝叶斯推断的车道变换预测系统。该系统包含感知模块和预测模块:感知模块感知环境,提取数值特征并转换为语言类别,与预测模块通信;预测模块执行KGE和贝叶斯推断模型,预测目标车辆的行驶动作并转换为纵向制动动作。现实硬件实验验证表明,该预测系统能提前3-4秒预测目标车辆的车道变换,为自动驾驶车辆提供充足反应时间,确保车道变换安全。

英文摘要

Research on lane change prediction has gained a lot of momentum in the last couple of years. However, most research is confined to simulation or results obtained from datasets, leaving a gap between algorithmic advances and on-road deployment. This work closes that gap by demonstrating, on real hardware, a lane-change prediction system based on Knowledge Graph Embeddings (KGEs) and Bayesian inference. Moreover, the ego-vehicle employs a longitudinal braking action to ensure the safety of both itself and the surrounding vehicles. Our architecture consists of two modules: (i) a perception module that senses the environment, derives input numerical features, and converts them into linguistic categories; and communicates them to the prediction module; (ii) a pretrained prediction module that executes a KGE and Bayesian inference model to anticipate the target vehicle's maneuver and transforms the prediction into longitudinal braking action. Real-world hardware experimental validation demonstrates that our prediction system anticipates the target vehicle's lane change three to four seconds in advance, providing the ego vehicle sufficient time to react and allowing the target vehicle to make the lane change safely.

2506.05442 2026-05-19 cs.CV cs.AI 版本更新

Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving

结构化标注加速面向端到端自动驾驶的视觉-语言模型

Hao Jiang, Chuan Hu, Yukang Shi, Yuan He, Ke Wang, Xi Zhang, Zhipeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) KargoBot

AI总结 本文提出结构化标注的NuScenes-S数据集和紧凑型FastDrive模型,提升自动驾驶中决策任务的效率与准确性,实验显示在结构化数据集上性能优异,推理速度提升超10倍。

详情
AI中文摘要

视觉-语言模型(VLMs)因其类人推理能力成为端到端自动驾驶的有前景方法。然而,现有VLMs与现实应用之间仍存在显著差距。主要限制是现有松散格式的语言描述数据集不适用于机器,可能引入冗余。此外,VLMs的高计算成本和大规模阻碍了推理速度和现实部署。为弥合这一差距,本文引入了结构化且简洁的基准数据集NuScenes-S,该数据集源自NuScenes数据集并包含适用于机器的结构化表示。此外,我们提出了FastDrive,一个参数仅为0.9B的紧凑型VLM基线。与现有参数超过7B且未结构化的VLMs(如LLaVA-1.5)相比,FastDrive能够理解和生成结构化且简洁的描述,以高效率生成机器友好的驾驶决策。大量实验表明,FastDrive在结构化数据集上实现了竞争性的性能,决策任务的精度提高了约20%,同时在推理速度上超越大规模参数基线超过10倍。此外,消融研究进一步聚焦于场景注释(如天气、时间)对决策任务的影响,证明了其在自动驾驶决策任务中的重要性。

英文摘要

Vision-Language Models (VLMs) offer a promising approach to end-to-end autonomous driving due to their human-like reasoning capabilities. However, troublesome gaps remains between current VLMs and real-world autonomous driving applications. One major limitation is that existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy. Additionally, high computational cost and massive scale of VLMs hinder the inference speed and real-world deployment. To bridge the gap, this paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations. Moreover, we present FastDrive, a compact VLM baseline with 0.9B parameters. In contrast to existing VLMs with over 7B parameters and unstructured language processing(e.g., LLaVA-1.5), FastDrive understands structured and concise descriptions and generates machine-friendly driving decisions with high efficiency. Extensive experiments show that FastDrive achieves competitive performance on structured dataset, with approximately 20% accuracy improvement on decision-making tasks, while surpassing massive parameter baseline in inference speed with over 10x speedup. Additionally, ablation studies further focus on the impact of scene annotations (e.g., weather, time of day) on decision-making tasks, demonstrating their importance on decision-making tasks in autonomous driving.

2505.20914 2026-05-19 cs.CV 版本更新

Geometry-Editable and Appearance-Preserving Object Compositon

可编辑且保留外观的对象合成

Jianman Lin, Haojie Li, Chunmei Qing, Zhijing Yang, Liang Lin, Tianshui Chen

发表机构 * South China University of Technology(华南理工大学) Guangdong University of Technology(广东工业大学) Sun Yat-sen University(中山大学)

AI总结 本文提出DGAD模型,通过语义嵌入和交叉注意力机制实现几何可编辑与外观保留,提升对象合成的精度与真实性。

详情
AI中文摘要

通用对象合成(GOC)旨在无缝整合目标对象到背景场景中,同时保持其细粒度外观细节。最近的方法通过语义嵌入和扩散模型实现几何可编辑生成,但高紧凑嵌入仅编码高层语义线索,不可避免地丢弃细粒度外观细节。我们引入一种解耦的几何可编辑且外观保留扩散(DGAD)模型,首先利用语义嵌入隐式捕捉所需几何变换,然后通过交叉注意力检索机制对齐细粒度外观特征与几何编辑表示,从而在对象合成中实现精确几何编辑和忠实外观保留。具体而言,DGAD基于CLIP/DINO衍生和参考网络提取语义嵌入和外观保留表示,然后在解耦方式下无缝整合到编码和解码管道中。首先将语义嵌入整合到具有强大空间推理能力的预训练扩散模型中,隐式捕捉对象几何,从而实现灵活的对象操作和确保有效的可编辑性。然后设计一种密集交叉注意力机制,利用隐式学习的对象几何检索并空间对齐外观特征,确保忠实的外观一致性。在公共基准上的广泛实验验证了所提DGAD框架的有效性。

英文摘要

General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties, while simultaneously preserving its fine-grained appearance details. Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation. However, these highly compact embeddings encode only high-level semantic cues and inevitably discard fine-grained appearance details. We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion (DGAD) model that first leverages semantic embeddings to implicitly capture the desired geometric transformations and then employs a cross-attention retrieval mechanism to align fine-grained appearance features with the geometry-edited representation, facilitating both precise geometry editing and faithful appearance preservation in object composition. Specifically, DGAD builds on CLIP/DINO-derived and reference networks to extract semantic embeddings and appearance-preserving representations, which are then seamlessly integrated into the encoding and decoding pipelines in a disentangled manner. We first integrate the semantic embeddings into pre-trained diffusion models that exhibit strong spatial reasoning capabilities to implicitly capture object geometry, thereby facilitating flexible object manipulation and ensuring effective editability. Then, we design a dense cross-attention mechanism that leverages the implicitly learned object geometry to retrieve and spatially align appearance features with their corresponding regions, ensuring faithful appearance consistency. Extensive experiments on public benchmarks demonstrate the effectiveness of the proposed DGAD framework.

2505.17674 2026-05-19 cs.CV 版本更新

SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding

SVL:基于脉冲的视觉-语言预训练用于高效的3D开放世界理解

Xuerui Qiu, Peixi Wu, Yaozhi Wen, Shaowei Gu, Yuqi Pan, Xinhao Luo, Bo XU, Guoqi Li

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Future Technology, University of Chinese Academy of Sciences(中国科学院大学未来技术学院) Zhongguancun Academy(中关村学院) University of Science and Technology of China(中国科学技术大学) Peking University(北京大学)

AI总结 SVL提出基于脉冲的视觉-语言预训练框架,通过多尺度三元组对齐和可重参数化的视觉-语言整合,提升SNN在3D开放世界理解中的性能,实现高效零样本3D分类和多模态问答。

Comments ICML 2026 Spotlight

详情
AI中文摘要

脉冲神经网络(SNNs)提供了一种高效的提取3D时空特征的方法。然而,现有的SNNs在性能上仍显著落后于人工神经网络(ANNs),主要是由于预训练策略不足。这些限制表现为泛化能力有限、任务特定性和缺乏多模态理解,特别是在多模态问答和零样本3D分类等挑战性任务中。为克服这些挑战,我们提出了基于脉冲的视觉-语言(SVL)预训练框架,使SNNs能够实现开放世界3D理解,同时保持脉冲驱动的效率。SVL引入了两个关键组件:(i)多尺度三元组对齐(MTA)用于在3D、图像和文本模态之间进行无标签三元组对比学习;(ii)可重参数化的视觉-语言整合(Rep-VLI)以实现轻量级推理,而无需依赖大型文本编码器。广泛的实验表明,SVL在零样本3D分类中实现了85.4%的top-1准确率,超过了先进的ANN模型,并在下游任务中持续优于先前的SNNs,包括3D分类(+6.1%)、DVS动作识别(+2.1%)、3D检测(+1.1%)和3D分割(+2.1%),具有显著的效率。此外,SVL使SNNs能够执行开放世界3D问答任务,有时甚至优于ANNs。据我们所知,SVL代表了首个可扩展、通用和硬件友好的3D开放世界理解范式,有效弥合了SNNs和ANNs在复杂开放世界理解任务中的差距。代码可用https://github.com/bollossom/SVL。

英文摘要

Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing SNNs still exhibit a significant performance gap compared to Artificial Neural Networks (ANNs) due to inadequate pre-training strategies. These limitations manifest as restricted generalization ability, task specificity, and a lack of multimodal understanding, particularly in challenging tasks such as multimodal question answering and zero-shot 3D classification. To overcome these challenges, we propose a Spike-based Vision-Language (SVL) pretraining framework that empowers SNNs with open-world 3D understanding while maintaining spike-driven efficiency. SVL introduces two key components: (i) Multi-scale Triple Alignment (MTA) for label-free triplet-based contrastive learning across 3D, image, and text modalities, and (ii) Re-parameterizable Vision-Language Integration (Rep-VLI) to enable lightweight inference without relying on large text encoders. Extensive experiments show that SVL achieves a top-1 accuracy of 85.4% in zero-shot 3D classification, surpassing advanced ANN models, and consistently outperforms prior SNNs on downstream tasks, including 3D classification (+6.1%), DVS action recognition (+2.1%), 3D detection (+1.1%), and 3D segmentation (+2.1%) with remarkable efficiency. Moreover, SVL enables SNNs to perform open-world 3D question answering, sometimes outperforming ANNs. To the best of our knowledge, SVL represents the first scalable, generalizable, and hardware-friendly paradigm for 3D open-world understanding, effectively bridging the gap between SNNs and ANNs in complex open-world understanding tasks. Code is available https://github.com/bollossom/SVL.

2412.11149 2026-05-19 cs.CV 版本更新

A Comprehensive Survey of Action Quality Assessment: Method and Benchmark

动作质量评估的全面综述:方法与基准

Kanglei Zhou, Ruizhi Cai, Liyuan Wang, Hubert P. H. Shum, Xiaohui Liang

AI总结 本文综述了动作质量评估的最新进展,提出模态驱动的分层分类体系,建立统一基准,分析方法演变和研究趋势,探讨当前挑战与未来方向。

Comments Published in Pattern Recognition. Project page and benchmark resources are available online

Journal ref Pattern Recognition, 2026, Article 113933

详情
AI中文摘要

动作质量评估(AQA)旨在自动评估人类动作的执行质量,并已广泛应用于体育分析、技能评估和医疗领域。然而,AQA研究往往是在异质数据集和评估设置下开发的,使得方法间的系统比较困难。为了解决这些挑战,我们呈现了AQA近期进展的全面综述。特别地,我们提出了一种模态驱动的分层分类体系,将现有方法分为基于视频、基于骨架和多模态方法,并分析代表性模型的方法学演变。我们进一步通过整合多样化数据集和标准化评估协议,建立了代表性视频基AQA方法的统一基准,使在准确性和计算效率方面实现一致的比较。最后,我们分析了新兴研究趋势,识别了当前AQA研究中的关键挑战,并概述了从短期的方法学进步到长期由新兴AI范式带来的机遇的未来方向。该项目的网页可在https://ZhouKanglei.github.io/AQA-Survey上找到。

英文摘要

Action Quality Assessment (AQA) aims to automatically evaluate how well human actions are performed and has been widely applied in sports analysis, skill assessment, and healthcare. However, AQA studies are often developed under heterogeneous datasets and evaluation settings, making systematic comparison across methods difficult. To address these challenges, we present a comprehensive survey of recent advances in AQA. In particular, we propose a modality-driven hierarchical taxonomy that organizes existing methods into video-based, skeleton-based, and multi-modal approaches, and analyze the methodological evolution of representative models. We further establish a unified benchmark for representative video-based AQA methods by integrating diverse datasets and standardized evaluation protocols, enabling consistent comparison in terms of both accuracy and computational efficiency. Finally, we analyze emerging research trends, identify key challenges in current AQA research, and outline future directions ranging from near-term methodological advances to longer-term opportunities enabled by emerging AI paradigms. The project web page can be found at https://ZhouKanglei.github.io/AQA-Survey.

2412.00666 2026-05-19 cs.CV 版本更新

Explaining Object Detectors via Collective Contribution of Pixels

通过像素的集体贡献解释目标检测器

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

发表机构 * Chiba University(千叶大学) National Institute of Informatics(信息处理研究所)

AI总结 本文提出基于Shapley值和交互的游戏理论方法,以捕捉像素的个体和集体贡献,提升目标检测器的解释性与准确性。

Comments Accepted to CVPR 2026 (Highlight); code is available at: https://github.com/tttt-0814/VX-CODE

详情
AI中文摘要

视觉解释对于增强目标检测器的可靠性至关重要。目标检测器通过评估多个视觉特征的集体信息来识别和定位实例。在生成解释时,忽视这些集体影响可能导致遗漏组成线索或捕捉虚假相关性。然而,现有方法通常仅关注单个像素贡献,忽视了多个像素的集体贡献。为了解决这一限制,我们提出了一种基于Shapley值和交互的游戏理论方法,以显式捕捉个体和集体像素贡献。我们的方法为边界框定位和类别确定提供解释,突出对检测至关重要的区域。广泛实验表明,所提出的方法在识别重要区域的准确性上优于最先进的方法。代码可在https://github.com/tttt-0814/VX-CODE获取。

英文摘要

Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code is available at https://github.com/tttt-0814/VX-CODE

2411.17917 2026-05-19 cs.CV cs.RO 版本更新

DECODE: Domain-aware Continual Domain Expansion for Motion Prediction

DECODE:面向领域的持续领域扩展用于运动预测

Boqi Li, Haojie Zhu, Henry X. Liu

发表机构 * Department of Civil and Environmental Engineering, University of Michigan(密歇根大学土木与环境工程系)

AI总结 DECODE提出一种持续学习框架,通过预训练模型逐步扩展领域专用模型,结合超网络和流机制实现高效模型选择与不确定性估计,有效降低遗忘率并提升预测精度。

Comments This work has been published in IEEE TPAMI Early Access

详情
AI中文摘要

运动预测对于自动驾驶车辆在复杂环境中有效导航和准确预测其他交通参与者行为至关重要。随着自动驾驶不断发展,整合新多样驾驶场景的需求促使频繁重新训练模型。为此,我们引入DECODE,一种新的持续学习框架,从预训练的通用模型开始,逐步发展专用领域模型。不同于现有持续学习方法试图开发一个能跨多样场景泛化的统一模型,DECODE独特地平衡了专用性与泛化性,动态调整以满足实时需求。所提框架利用超网络生成模型参数,显著降低存储需求,并结合归一化流机制基于似然估计进行实时模型选择。此外,DECODE利用深度贝叶斯不确定性估计技术合并最相关专用和通用模型的输出。这种整合确保在熟悉条件下最优性能,同时在不熟悉场景中保持鲁棒性。广泛评估证实了框架的有效性,实现显著低的遗忘率0.044和平均minADE 0.584米,显著超越传统学习策略,并在广泛驾驶条件下表现出适应性。

英文摘要

Motion prediction is critical for autonomous vehicles to effectively navigate complex environments and accurately anticipate the behaviors of other traffic participants. As autonomous driving continues to evolve, the need to assimilate new and varied driving scenarios necessitates frequent model updates through retraining. To address these demands, we introduce DECODE, a novel continual learning framework that begins with a pre-trained generalized model and incrementally develops specialized models for distinct domains. Unlike existing continual learning approaches that attempt to develop a unified model capable of generalizing across diverse scenarios, DECODE uniquely balances specialization with generalization, dynamically adjusting to real-time demands. The proposed framework leverages a hypernetwork to generate model parameters, significantly reducing storage requirements, and incorporates a normalizing flow mechanism for real-time model selection based on likelihood estimation. Furthermore, DECODE merges outputs from the most relevant specialized and generalized models using deep Bayesian uncertainty estimation techniques. This integration ensures optimal performance in familiar conditions while maintaining robustness in unfamiliar scenarios. Extensive evaluations confirm the effectiveness of the framework, achieving a notably low forgetting rate of 0.044 and an average minADE of 0.584 m, significantly surpassing traditional learning strategies and demonstrating adaptability across a wide range of driving conditions.

2407.15199 2026-05-19 cs.CV cs.CY 版本更新

Multiple Object Detection and Tracking in Panoramic Videos for Cycling Safety Analysis

全景视频中多目标检测与跟踪用于骑行安全分析

Jingwei Guo, Yitai Cheng, Meihui Wang, Ilya Ilyankou, Natchapon Jongwiriyanurak, Xiaowei Gao, Nicola Christie, James Haworth

发表机构 * Department of Civil, Environmental, and Geomatic Engineering(土木、环境与测绘工程系) University College London(伦敦大学学院) SpaceTimeLab(时空实验室) Department of Earth Science and Engineering(地球科学与工程系) Imperial College London(帝国理工学院) Centre for Transport Studies(交通研究中心)

AI总结 本文提出三步框架提升全景视频中多目标检测与跟踪性能,通过子图像分割投影提升检测精度,改进跟踪模型以处理边界连续性和类别信息,并在实际应用中验证车辆超车检测的有效性。

Journal ref IET Intelligent Transport Systems, Volume 20, no. 1 (2026): e70228

详情
AI中文摘要

骑行者面临不成比例的受伤风险,但传统碰撞记录过于稀疏,无法在细粒度空间和时间尺度上识别风险因素。最近,自然主义研究利用视频数据捕捉复杂的行为和基础设施风险因素。全景视频是一种有前景的格式,可以记录骑行者周围的360度视图。然而,其使用受到失真、大量小对象和边界连续性的限制,现有计算机视觉模型无法处理。本研究提出一个新颖的三步框架:(1)通过分割和投影原始360度图像为子图像来提升全景影像中的目标检测精度;(2)修改多目标跟踪模型以整合边界连续性和目标类别信息;(3)通过实际应用中的车辆超车检测任务进行验证。该方法使用由骑行者记录的伦敦道路全景视频进行评估。实验结果表明,方法在不同图像分辨率下均优于基线,实现了更高的平均精度。此外,改进的跟踪方法在识别切换次数上减少了10.0%,识别精度提高了2.7%。超车检测任务的F分数达到0.82,展示了所提方法在实际骑行安全场景中的实用性。

英文摘要

Cyclists face a disproportionate risk of injury, yet conventional crash records are too sparse to identify risk factors at fine spatial and temporal scales. Recently, naturalistic studies have used video data to capture the complex behavioural and infrastructural risk factors. A promising format is panoramic video, which can record 360$^\circ$ views around a rider. However, its use is limited by distortions, large numbers of small objects, and boundary continuity, which cannot be handled using existing computer vision models. This research proposes a novel three-step framework: (1) enhancing object detection accuracy on panoramic imagery by segmenting and projecting the original 360$^\circ$ images into sub-images; (2) modifying multi-object tracking models to incorporate boundary continuity and object category information; and (3) validating through a real-world application of vehicle overtaking detection. The methodology is evaluated using panoramic videos recorded by cyclists on London's roadways under diverse conditions. Experimental results demonstrate improvements over baselines, achieving higher average precision across varying image resolutions. Moreover, the enhanced tracking approach yields a 10.0% decrease in identification switches and a 2.7% improvement in identification precision. The overtaking detection task achieves a high F-score of 0.82, illustrating the practical effectiveness of the proposed method in real-world cycling safety scenarios.

2406.09333 2026-05-19 cs.CV 版本更新

Learning Spatial-Preserving Hierarchical Representations for Digital Pathology

学习空间保持的层次表示用于数字病理学

Weiyi Wu, Xingjian Diao, Chunhui Zhang, Chongyang Gao, Xinwen Xu, Siting Li, Jiang Gui

发表机构 * Dartmouth College(达特茅斯学院) Massachusetts General Hospital(麻省总医院) Northwestern University(西北大学)

AI总结 本文提出SPAN框架,通过保留空间关系和计算分配,提升数字病理学图像的层次表示能力,通过两种变体在多个数据集上验证了其有效性。

Journal ref CVPR 2026 (Findings Track)

详情
AI中文摘要

全滑片图像(WSI)由于其十亿像素分辨率和信息区域稀疏分布,带来了根本性的计算挑战。现有方法常独立处理图像块或以扭曲空间上下文的方式重塑它们,从而掩盖了WSI固有的层次金字塔表示。我们引入稀疏金字塔注意力网络(SPAN),一种层次框架,能够在保留空间关系的同时将计算分配给信息区域。SPAN直接从单尺度输入构建多尺度表示,使WSI数据的精确层次建模成为可能。我们通过两种变体:SPAN-MIL用于滑片分类,SPAN-UNet用于分割,展示了SPAN的通用性。在多个公开数据集上的全面评估表明,SPAN有效捕捉了层次结构和上下文关系。我们的结果提供了明确证据,表明架构归纳偏置和层次表示增强了滑片级和块级性能。通过解决WSI分析中的关键计算挑战,SPAN为计算病理学提供了有效的框架,并展示了大规模医学图像分析的重要设计原则。

英文摘要

Whole slide images (WSIs) pose fundamental computational challenges due to their gigapixel resolution and the sparse distribution of informative regions. Existing approaches often treat image patches independently or reshape them in ways that distort spatial context, thereby obscuring the hierarchical pyramid representations intrinsic to WSIs. We introduce Sparse Pyramid Attention Networks (SPAN), a hierarchical framework that preserves spatial relationships while allocating computation to informative regions. SPAN constructs multi-scale representations directly from single-scale inputs, enabling precise hierarchical modeling of WSI data. We demonstrate SPAN's versatility through two variants: SPAN-MIL for slide classification and SPAN-UNet for segmentation. Comprehensive evaluations across multiple public datasets show that SPAN effectively captures hierarchical structure and contextual relationships. Our results provide clear evidence that architectural inductive biases and hierarchical representations enhance both slide-level and patch-level performance. By addressing key computational challenges in WSI analysis, SPAN provides an effective framework for computational pathology and demonstrates important design principles for large-scale medical image analysis.

2312.03798 2026-05-19 cs.CV 版本更新

Single Image Reflection Removal with Patch Reflectance Prior

单图像反射去除与补丁反射率先验

Dongshen Han, Heechan Yoon, Hyukmin Kwon, Hyun-Cheol Kim, Hyon-Gon Choo, Seungkyu Lee, Chaoning Zhang

发表机构 * Kyunghee University(庆尚大学) Electronics and Telecommunications Research Institute(电子电信研究院)

AI总结 本文提出基于补丁反射先验的单图像反射去除方法,通过反射先验提取网络学习非均匀反射先验,并利用变压器U-Net架构实现高效反射去除,实验证明在真实世界基准上达到SIRR领域最先进的准确率。

详情
AI中文摘要

单图像反射去除(SIRR)在真实世界图像中是一个具有挑战性的任务,由于在玻璃表面传输和反射过程中发生多样的图像退化。许多现有方法依赖特定的先验假设来解决这个问题。在本文中,我们提出了一种通用的反射强度先验,捕捉反射现象的强度,并证明其有效性。为了学习反射强度先验,我们引入了反射先验提取网络(RPEN)。通过将图像分割成区域补丁,RPEN在图像中学习非均匀的反射先验。我们提出了一种基于先验的反射去除网络(PRRN),使用简单的变压器U-Net架构,该架构适应从RPEN馈送的反射先验。在真实世界基准上的实验结果证明了我们方法的有效性,实现了SIRR领域的最佳准确率。

英文摘要

Single Image Reflection Removal (SIRR) in real-world images is a challenging task due to diverse image degradations occurring on the glass surface during light transmission and reflection. Many existing methods rely on specific prior assumptions to resolve the problem. In this paper, we propose a general reflection intensity prior that captures the intensity of the reflection phenomenon and demonstrate its effectiveness. To learn the reflection intensity prior, we introduce the Reflection Prior Extraction Network (RPEN). By segmenting images into regional patches, RPEN learns non-uniform reflection prior in an image. We propose Prior-based Reflection Removal Network (PRRN) using a simple transformer U-Net architecture that adapts reflection prior fed from RPEN. Experimental results on real-world benchmarks demonstrate the effectiveness of our approach achieving state-of-the-art accuracy in SIRR.

2310.20389 2026-05-19 eess.IV cs.CV 版本更新

High-Resolution Reference Image Assisted Volumetric Super-Resolution of Cardiac Diffusion Weighted Imaging

高分辨率参考图像辅助的心脏扩散加权成像体积分辨率提升

Yinzhe Wu, Jiahao Huang, Fanwen Wang, Pedro Ferreira, Andrew Scott, Sonia Nielles-Vallespin, Guang Yang

发表机构 * Department of Bioengineering, Imperial College London(帝国理工学院生物工程系) Cardiovascular Magnetic Resonance Unit, Royal Brompton Hospital(皇家布里托尼医院心血管磁共振单位) National Heart and Lung Institute, Imperial College London(帝国理工学院国家心脏和肺研究所)

AI总结 本文提出一种基于深度学习的体积分辨率提升框架,利用高分辨率b0 DWI作为输入,提升心脏扩散加权成像的图像质量,并证明了该框架在未见b值下的泛化能力。

Comments Accepted by SPIE Medical Imaging 2024

详情
AI中文摘要

扩散张量心脏磁共振(DT-CMR)是唯一用于非侵入性检查人体心脏微结构的活体方法。当前DT-CMR研究旨在提高对心脏微结构与健康心脏宏观功能关系以及微结构功能障碍与疾病关系的理解。为了获得最终DT-CMR指标,需要获取至少6个方向的扩散加权成像(DWI)。然而,由于DWI信噪比较低,标准体素尺寸在微结构尺度上相当大。在本研究中,我们探索了基于深度学习的方法在提高图像质量方面的潜力(在所有维度上提升4倍)。本研究提出了一种新的框架,通过将高分辨率b0 DWI作为额外模型输入,实现体积分辨率提升。我们证明了额外输入能够提供更高的超分辨图像质量。此外,模型还能超分辨未见过的b值的DWI,证明了该框架在心脏DWI超分辨率中的泛化能力。最后,我们建议在训练和推理中将高分辨率参考图像作为低分辨率图像的额外输入,以指导所有参数成像中的超分辨框架,尤其是在可用参考图像的情况下。

英文摘要

Diffusion Tensor Cardiac Magnetic Resonance (DT-CMR) is the only in vivo method to non-invasively examine the microstructure of the human heart. Current research in DT-CMR aims to improve the understanding of how the cardiac microstructure relates to the macroscopic function of the healthy heart as well as how microstructural dysfunction contributes to disease. To get the final DT-CMR metrics, we need to acquire diffusion weighted images of at least 6 directions. However, due to DWI's low signal-to-noise ratio, the standard voxel size is quite big on the scale for microstructures. In this study, we explored the potential of deep-learning-based methods in improving the image quality volumetrically (x4 in all dimensions). This study proposed a novel framework to enable volumetric super-resolution, with an additional model input of high-resolution b0 DWI. We demonstrated that the additional input could offer higher super-resolved image quality. Going beyond, the model is also able to super-resolve DWIs of unseen b-values, proving the model framework's generalizability for cardiac DWI superresolution. In conclusion, we would then recommend giving the model a high-resolution reference image as an additional input to the low-resolution image for training and inference to guide all super-resolution frameworks for parametric imaging where a reference image is available.

2305.07152 2026-05-19 cs.CV 版本更新

Intuitive Surgical SurgToolLoc and SurgVU Challenges Results: 2022-2025

直观外科SurgToolLoc和SurgVU挑战结果:2022-2025

Aneeq Zia, Max Berniker, Rogerio Garcia Nespolo, Xiaorui Zhang, Conor Perreault, Kiran Bhattacharyya, Xi Liu, Ziheng Wang, Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa, Bo Liu, David Austin, Yiheng Wang, Michal Futrega, Jean-Francois Puget, Zhenqiang Li, Yoichi Sato, Ryo Fujii, Ryo Hachiuma, Mana Masuda, Hideo Saito, An Wang, Mengya Xu, Mobarakol Islam, Long Bai, Winnie Pang, Hongliang Ren, Chinedu Nwoye, Luca Sestini, Nicolas Padoy, Maximilian Nielsen, Samuel Schüttler, Thilo Sentker, Hümeyra Husseini, Ivo Baltruschat, Rüdiger Schmitz, René Werner, Aleksandr Matsun, Mugariya Farooq, Numan Saaed, Jose Renato Restom Viera, Mohammad Yaqub, Neil Getty, Fangfang Xia, Zixuan Zhao, Xiaotian Duan, Xing Yao, Ange Lou, Hao Yang, Jintong Han, Jack Noble, Jie Ying Wu, Tamer Abdulbaki Alshirbaji, Nour Aldeen Jalal, Herag Arabian, Ning Ding, Knut Moeller, Weiliang Chen, Quan He, Muhammad Bilal, Taofeek Akinosho, Adnan Qayyum, Massimo Caputo, Hunaid Vohra, Michael Loizou, Anuoluwapo Ajayi, Ilhem Berrou, Faatihah Niyi-Odumosu, Charlie Budd, Oluwatosin Alabi, Tom Vercauteren, Ruoxi Zhao, Ayberk Acar, John Han, Jumanh Atoum, Yinhong Qin, Surong Hua, Lu Ping, Wenming Wu, Rongfeng Wei, Jinlin Wu, You Pang, Zhen Chen, Tim Jaspers, Amine Yamlahi, Piotr Kalinowski, Dominik Michael, Tim Rädsch, Marco Hübner, Danail Stoyanov, Stefanie Speidel, Lena Maier-Hein, Jie Tian, Ruxin Zhang, Khang Hoang Nguyen, Anh Quoc Nguyen, Tam Minh Nguyen, Khoi Dinh Tran, Minh Nguyen Dang Nhat, Trinh Thi Doan Pham, Linh Van Nguyen, Chunyang Jiang, Dewei Yang, Haitao Li, Yannick Prudent, Thibaut Boissin, Mahmood Alam, Shazad Ashraf, Andrew D. Beggs, Lukman Akanbi, Manuel D. Delgado, Narain Gupta, Amir M. Hajiyavand, Iqbal Qasim, Hafiz A. Alaka, Junaid Qadir, Shu Yang, Yihui Wang, Hao Chen, Shin Paul, Yosuke Yamagishi, Zhang Dong, Hongyun Li, Hongyu Gu, Xiaoliu Ding, Xiaoyao Liu, Xingyu Zhao, Mariana Ribeiro, Tiago Jesus, André Ferreira, Guilherme Barbosa, João Carvalho, Leonardo Barroso, Nuno Gomes, Rafael Peixoto, Rodrigo Ralha, Victor Alves, Stephanie, Nattapat Ittikosil, Achita Chitrapan, Quan Huu Cap, Jiayuan Huang, Shreyas C Dhake, Sergi Kavtaradze, Mobarak I Hoque, Ka Young Kim, Su Yong Yun, Young Tae Kim, Hyeon Bae Kim, Seong Tae Kim, Zuxing Deng, Ling Li, Jieyu Zheng, Xiaojian Li, Anthony Jarc

发表机构 * Intuitive Surgical, Inc.(Intuitive Surgical公司) Muroran Institute of Technology(Muroran理工学院) Niigata University of Health and Welfare Fujita Health University(Niigata大学健康与福利大学 Fujita健康大学) NVIDIA, Inc.(NVIDIA公司) University of Tokyo(东京大学) Keio University(Keio大学) Shun Hing Institute of Advanced Engineering(Shun Hing先进工程研究所) NUS NUSRI SZ(新加坡大学 NUSRI SZ) University of Strasbourg IHU Strasbourg(斯特拉斯堡大学 IHU斯特拉斯堡) University Medical Center Hambrug-Eppendorf(汉堡-埃彭多夫大学医学中心)

AI总结 本文总结了2022-2025年间在机器人辅助手术中解决手术工具定位和手术视觉理解的挑战成果,探讨了相关机器学习问题的解决方法与贡献。

详情
AI中文摘要

机器人辅助(RA)手术有潜力改变外科干预。Intuitive Surgical致力于推动这些变革,并促进将使它们成为可能的机器学习模型和算法。为此,我们邀请外科数据科学社区参与通过医学影像计算和计算机辅助介入(MICCAI)会议举办的年度竞赛。每年的变化使社区面临解决复杂机器学习问题的挑战,特别是在高级RA应用中。本文记录了这些挑战的结果,重点是手术工具定位(SurgToolLoc)和手术视觉理解(SurgVU)。随这些挑战发布的公开数据集在另一篇arXiv:2501.09209 [1]论文中详细说明。

英文摘要

Robotic assisted (RA) surgery promises to transform surgical intervention. Intuitive Surgical is committed to fostering these changes and the machine learning models and algorithms that will enable them. With these goals in mind we have invited the surgical data science community to participate in a yearly competition hosted through the Medical Imaging Computing and Computer Assisted Interventions (MICCAI) conference. With varying changes from year to year, we have challenged the community to solve difficult machine learning problems in the context of advanced RA applications. Here we document the results of these challenges, focusing on surgical tool localization (SurgToolLoc) and surgical visual understanding (SurgVU). The publicly released dataset that accompanies these challenges is detailed in a separate paper arXiv:2501.09209 [1].

2605.16628 2026-05-19 cs.CV 版本更新

SCARED-C: Corrected Camera Poses for Endoscopic Depth Estimation

SCARED-C:用于内窥镜深度估计的修正相机姿态

John J. Han, Adam Schmidt, Max Allan, Jie Ying Wu, Omid Mohareri

发表机构 * Vanderbilt University(范德比大学) Intuitive Surgical, Inc.(Intuitive Surgical公司)

AI总结 SCARED-C通过修正相机姿态,将可靠RGB-D配对数从35增加到17135,采用COLMAP和尺度恢复步骤提升内窥镜深度估计的精度与可靠性。

详情
AI中文摘要

SCARED数据集是用于内窥镜深度估计的广泛使用的基准,提供通过结构光传感器捕获的真实3D重建。然而,非关键帧图像的深度图依赖于机器人运动学,引入了显著的姿态误差,限制了数据集可靠标注部分仅为35个关键帧。我们提出了SCARED-C,即SCARED数据集的修正版本,将可靠RGB-D配对数从35扩展到17,135。我们的流程应用COLMAP,一种结构从运动系统,对所有帧重新估计相机姿态,随后进行尺度恢复步骤,利用真实关键帧深度图将结果重建对齐到度量空间。我们通过(1)立体视差评估和(2)单目深度估计实验验证了修正后的姿态。修正后的数据集和代码已向社区公开。

英文摘要

The SCARED dataset is a widely used benchmark for endoscopic depth estimation, offering ground-truth 3D reconstructions captured with a structured light sensor. However, the depth maps for non-keyframe images rely on robot kinematics that introduce substantial pose errors, limiting the reliably labeled portion of the dataset to 35 keyframes. We present SCARED-C, a corrected version of the SCARED dataset that expands the number of reliable RGB-D pairs from 35 to 17,135. Our pipeline applies COLMAP, a Structure-from-Motion system, to re-estimate camera poses for all frames, followed by a scale recovery step that aligns the resulting reconstructions to metric space using the ground-truth keyframe depth maps. We validate the corrected poses through (1) stereo disparity evaluation and (2) monocular depth estimation experiments. The corrected dataset and code are publicly released to the community.

2605.16603 2026-05-19 cs.CV 版本更新

Controlla: Learning Controllability via Graph-Constrained Latent Geometry

Controlla:通过图约束潜在几何学习可控性

Jamuna S. Murthy, Amin Karimi Monsefi, Rajiv Ramnath

发表机构 * Ramaiah Institute of Technology(拉马亚院技术学院) The Ohio State University(俄亥俄州立大学)

AI总结 Controlla通过图约束潜在几何学习可控性,结合图约束最优传输对多模态输入的语义属性和身份因子进行对齐,提升可控性、身份保持和跨模态对齐。

详情
AI中文摘要

可控多模态生成通常被表述为推理时的条件化问题,使用提示、引导或辅助模块。尽管有效,这些方法并未显式结构化语义属性的演变,可能导致身份漂移和跨模态不一致。我们提出Controlla,一种模块化因子化控制框架,将可控性视为结构化潜在几何的属性。Controlla从多模态输入中学习身份和属性因子,并利用图约束最优传输将它们对齐到图先验,鼓励属性遵循图一致轨迹的同时保持参考身份。为了评估这一设置,我们构建了AffectHuman-43K,一个考虑泄漏的多模态基准,用于参考导向的情感控制,并引入了对轨迹一致性和潜在解耦的几何感知度量。实验显示在可控性、身份保持和跨模态对齐方面有持续改进,此外还进行了图敏感性、可扩展性和鲁棒性的分析。

英文摘要

Controllable multimodal generation is commonly formulated as an inference-time conditioning problem using prompts, guidance, or auxiliary modules. While effective, such approaches do not explicitly structure how semantic attributes evolve, which can lead to identity drift and inconsistent cross-modal behavior. We propose Controlla, a modular factorized-control framework that treats controllability as a property of structured latent geometry. Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport, encouraging attributes to follow graph-consistent trajectories while preserving reference identity. To evaluate this setting, we construct AffectHuman-43K, a leakage-aware multimodal benchmark for reference-grounded affective control, and introduce geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity, extensibility, and robustness.

2605.16582 2026-05-19 cs.CV 版本更新

ArtMesh: Part-Aware Articulated Mesh Fields with Motion-Consistent Dynamics

ArtMesh:带有运动一致动态的部件感知可变形网格场

Sylvia Yuan, Dan Wang, Ravi Ramamoorthi, Xinrui Cui

发表机构 * University of California San Diego(加州大学圣地亚哥分校) University of North Texas(北卡罗来纳州立大学)

AI总结 ArtMesh通过构建基于网格的可微渲染基础,实现了从多视角图像中重建可变形物体的连接三角网格,并在100个新基准数据集上超越了现有3DGS方法。

详情
AI中文摘要

我们提出了ArtMesh,一种基于网格的方法,用于从起始和结束状态的多视角图像中显式重建可变形物体,作为具有每部分刚性运动的连接三角网格。现有基于3D高斯点散射的可变形重建管道继承了其点散射基础的无结构点几何,无法提供表面拓扑来推断部件边界或沿物体连接性的运动一致性。ArtMesh相反,建立在基于网格的可微渲染基础之上,使部件感知动态能够直接作用于结构拓扑。为了使拓扑与可变形兼容,我们引入了部件感知受限德劳内重新三角化,产生连接的子网格,其三角形不跨越语义部件边界。动态网格场然后通过双向顶点运动一致性优化可变形性,通过传输网格顶点和像素级运动一致性优化渲染的RGB-D观察。我们引入了Articulate-100,一个包含100个可变形物体的16个PartNet-Mobility类别的新基准。在该基准上,ArtMesh在关节参数估计和部件级几何重建上优于现有3DGS方法,其在具有许多可动部件的物体上收益最大。

英文摘要

We present ArtMesh, a mesh-native method for reconstructing articulated objects explicitly as connected triangle meshes with per-part rigid motion from multi-view images in start and end states. Existing 3D Gaussian Splatting pipelines for articulated reconstruction inherit the unstructured point-based geometry of their splatting base, which provides no surface topology for reasoning about part boundaries or enforcing motion consistency along the object's connectivity. ArtMesh instead builds on a mesh-based differentiable rendering backbone, enabling part-aware dynamics to act directly on the structured topology. To make the topology compatible with articulation, we introduce part-aware restricted Delaunay remeshing, producing connected submeshes whose triangles do not cross semantic part boundaries. The dynamic mesh field then optimizes articulation using bidirectional Vertex-wise Motion Consistency on transported mesh vertices and Pixel-wise Motion Consistency on rendered RGB-D observations. We introduce Articulate-100, a new benchmark of 100 articulated objects spanning 16 PartNet-Mobility categories. On this benchmark, ArtMesh outperforms prior 3DGS-based pipelines in joint parameter estimation and part-level geometric reconstruction, with the largest gains on objects with many movable parts.

2605.16572 2026-05-19 cs.CV 版本更新

TriALS: Triphasic-Aided Liver Lesion Segmentation Benchmark in Non-Contrast CT

TriALS: 三相辅助非增强CT肝脏病变分割基准

Marawan Elbatel, Mohamed Ghonim, Jiaji Mao, Zhuosheng Lin, Katharina Eckstein, Andrés Martínez Mora, Jonathan Deissler, Maximilian Rokuss, Constantin Ulrich, Zdravko Marinov, Wenhui Deng, Baoxun Li, Huijun Hu, Jun Shen, Mohanad Ghonim, Khadiga Omar Nassar, Mariam Elbakry, Menna Dyab, Amr Muhammad Abdo Salem, Nouran Elghitany, Noha Elghitany, Yi Qin, Xuanqi Huang, Haonan Wang, Shao-Woo Yen, Ahmed Elghamry Saba, Salma Ahmad, Xinyan Fang, Jiahao Zhang, Xiaodi Wang, Xinghua Ma, Gongning Luo, Jessica C. Delmoral, João Manuel R. S. Tavares, Ankan Deria, Adinath Dukre, Yutong Xie, Imran Razzak, Dongwook Kim, Matthew Choi, Hanxiao Zhang, Minghui Zhang, Xin You, Abdul Qayyum, Steven A. Niederer, Moona Mazher, Rachika E. Hamadache, Ricardo Montoya-del-Angel, Robert Martí, Xavier Lladó, Toufiq Musah, Livingstone Eli Ayivor, Enrique Almar-Munoz, Agnes Mayr, Kaouther Mouheb, Esther E. Bron, Stefan Klein, Ahmed Abouelhoda, Amira Adel, Susan Adil Ali, Rainer Stiefelhagen, Klaus H. Maier-Hein, Fabian Isensee, Aya Yassin, Xiaomeng Li

发表机构 * Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology(香港理工大学电子与计算机工程系) AI Center of Excellence, Ain Shams University(爱思明大学人工智能中心) Department of Radiology, Ain Shams University(爱思明大学放射科) Department of Radiology, Guangdong Provincial Key Laboratory of Malignant Tumor Epigenetics and Gene Regulation, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University(广东省恶性肿瘤表观遗传与基因调控重点实验室,中山大学孙逸仙纪念医院放射科) Nanfang Hospital, Southern Medical University(南方医科大学南华医院) Division of Medical Image Computing, German Cancer Research Center (DKFZ), Heidelberg, Germany(德国癌症研究中心(DKFZ)医学影像计算部,海德堡,德国) Medical Faculty Heidelberg, Heidelberg University(海德堡大学医学院) Faculty of Mathematics and Computer Science, Heidelberg University(海德堡大学数学与计算机科学学院) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文提出TriALS挑战,通过多中心150例数据评估自动肝脏病变分割算法,在非增强CT条件下取得人类水平性能,但表现受训练数据规模和预训练策略影响显著。

Comments TriALS challenge paper across MICCAI 2024 and 2025; data and code at https://github.com/xmed-lab/TriALS

详情
AI中文摘要

非增强CT(NCCT)上肝脏病变自动分割在临床中很重要但极具挑战性,特别是在缺乏对比剂的非洲和亚洲低资源地区。进展受限于缺乏标注的NCCT基准。本文描述了TriALS挑战,通过埃及和中国机构的150例多中心数据集(4相CT采集600体积)评估算法性能。在70例数据上评估,最佳方法在门静脉期Dice系数达0.754,但非增强CT下降至0.57。外部验证显示,领先方法在非增强CT上比现成模型提升最高28%。算法性能主要由训练数据规模和预训练策略决定。跨年比较揭示了非增强CT的持续感知障碍,仅扩大预训练无法克服。数据、标注和代码可在https://github.com/xmed-lab/TriALS获取。

英文摘要

Automated segmentation of liver lesions on non-contrast computed tomography (NCCT) is clinically important but fundamentally challenging, particularly in low-resource settings across Africa and Asia where contrast agents are frequently unavailable. Progress has been limited by the absence of annotated NCCT benchmarks. Here we describe the TriALS challenge for automated liver lesion segmentation under contrast-limited conditions, supported by a multi-centre dataset of 150 cases with four-phase CT acquisitions (600 volumes) from Egyptian and Chinese institutions. Algorithms were evaluated on 70 cases from three institutions, including an independent external cohort. The top-performing method achieved a mean venous-phase Dice of 0.754, consistent with human-level performance, yet dropped to 0.57 on NCCT. On external validation, the leading method outperformed off-the-shelf models by up to 28% in Dice on NCCT. Algorithm performance was most strongly predicted by training data scale and pre-training strategy. A cross-year comparison exposed a persistent perceptual barrier on NCCT that scaling pre-training alone cannot overcome. Data, annotations, and code are available at https://github.com/xmed-lab/TriALS.

2605.16550 2026-05-19 cs.CV cs.LG 版本更新

Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition

基于注意力的变换器聚合网络用于视频眼周识别

Luiz G F Carreira, Breno A Mariano, Victor H C de Melo, David Menotti, William Robson Schwartz

发表机构 * Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil(巴西米纳斯吉拉斯联邦大学计算机科学系) Department of Informatics, Federal University of Paraná, Curitiba, Brazil(巴西巴西南部联邦大学信息技术系)

AI总结 本文提出一种基于变换器的聚合网络,用于视频眼周识别,通过特征嵌入和聚合模块提升识别鲁棒性,在COX Face数据集上优于传统方法,达到99.8%的TPR@1e-1和96.6%的Rank-5。

Comments Accepted at ICIP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. DOI to be added upon publication

详情
AI中文摘要

视频眼周识别是基于个体眼睛周围区域识别身份的任务。眼周区域是人脸最具有区分性的区域之一,使其适合识别任务。其作为生物特征模态的应用在监控环境中逐渐兴起,尤其是在传统生物特征如面部或虹膜识别因非受限采集条件而不可行时。本文提出了一种针对监控环境的视频眼周识别的注意力感知方法。该框架包含两个主要模块:特征嵌入和聚合。特征嵌入模块是一个深度卷积神经网络,将眼周数据映射到特征向量。聚合模块是一个仅含编码器的变换器,能够自适应地将帧级特征聚合为单一视频表示和静态参考图像的特征向量。在公开可用的COX Face数据集上的实验表明,所提方法的鲁棒性,一致优于传统聚合方案。在最佳情况下,该方法实现了99.8%的TPR@1e-1和96.6%的Rank-5。

英文摘要

Video periocular recognition is the task of recognizing an individual's identity based on the region around an individual's eyes. The periocular area is one of the most discriminative regions of the human face, making it suitable for recognition tasks. Its use as a biometric modality has emerged as an alternative, especially in surveillance scenarios where conventional biometric traits such as face or iris recognition become unfeasible due to unconstrained acquisition conditions. This paper proposes an attention-aware approach for video-based periocular recognition in surveillance environments. The framework consists of two main modules: feature embedding and aggregation. The feature embedding module is a deep convolutional neural network that maps periocular data to feature vectors. The aggregation module is an encoder-only transformer that adaptively learns to aggregate frame-level features into a single video representation and a feature vector for the still reference image. Experiments on the publicly available COX Face dataset show the robustness of the proposed method, consistently outperforming naive aggregation schemes. In the best scenario, the approach achieves $99.8\%$ of TPR@$1e^{-1}$ and $96.6\%$ of Rank-5.

2605.16519 2026-05-19 cs.CV eess.SP 版本更新

DepthPolyp: Pseudo-Depth Guided Lightweight Segmentation for Real-Time Colonoscopy

DepthPolyp:基于伪深度引导的轻量级分割用于实时结肠镜检查

Zhuoyu Wu, Wenhui Ou, Lexi Zhang, Pei-Sze Tan, Dongjun Wu, Junhe Zhao, Wenqi Fang, Raphaël C. -W. Phan

发表机构 * CyPhi AI Lab, Monash University, Malaysia Campus, Malaysia Department of Electronic \& Computer Engineering, Hong Kong University of Science \& Technology, Hong Kong, P.R. China Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, P.R. China Harbin Institute of Technology, Harbin, P.R. China

AI总结 本文提出DepthPolyp,一种基于伪深度引导的多任务学习轻量级分割框架,通过高效特征调制实现强跨数据集泛化能力,在实时结肠镜检查中表现出色。

Comments This paper has been accepted to the International Conference on Pattern Recognition (ICPR 2026)

详情
AI中文摘要

准确的结肠镜检查息肉分割对于早期结直肠癌检测至关重要,但现实临床环境中的运动模糊、镜面反射和照明不稳定性等挑战使现有方法在实际手术场景中性能显著下降。本文提出DepthPolyp,一种基于伪深度引导的多任务学习和高效特征调制的轻量级分割框架。该架构结合了层次化Ghost因子化进行紧凑特征生成,交错洗牌融合实现低成本跨尺度交互,以及动态组门控实现自适应组内特征加权。大量实验表明,DepthPolyp在训练于退化数据并在清洁和噪声目标领域评估时,表现出强跨数据集泛化能力,优于轻量级基线并能与大幅更大的模型竞争。在PolypGen真实手术视频评估中,DepthPolyp在参数量仅为3.57M、GMACs为0.86的情况下,能够在移动设备上以超过180 FPS的速度运行,使其在资源受限的临床环境中具有良好的实时部署能力。代码和预训练权重可在https://github.com/ReaganWu/DepthPolyp/获取。

英文摘要

Accurate polyp segmentation in colonoscopy is essential for early colorectal cancer detection, yet real-world clinical environments pose persistent challenges such as motion blur, specular reflections, and illumination instability. Most existing methods are optimized on clean benchmark images and suffer noticeable performance degradation when deployed in authentic surgical scenarios. We propose DepthPolyp, a lightweight and robust segmentation framework based on pseudo-depth-guided multi-task learning and efficient feature modulation. The architecture combines hierarchical Ghost factorization for compact feature generation, Interleaved Shuffle Fusion for low-cost cross-scale interaction, and Dynamic Group Gating for adaptive group-wise feature weighting. Extensive experiments demonstrate that DepthPolyp achieves strong cross-dataset generalization when trained on degraded data and evaluated on both clean and noisy target domains, consistently outperforming lightweight baselines and remaining competitive with substantially larger models. In real surgical video evaluation on PolypGen, DepthPolyp achieves better segmentation performance than models up to $20\times$ larger while preserving real-time inference speed. With only 3.57M parameters and 0.86 GMACs, the proposed method runs at over 180 FPS on mobile devices, making it well suited for real-time deployment in resource-constrained clinical environments. Code and pretrained weights are available at: https://github.com/ReaganWu/DepthPolyp/

2605.16515 2026-05-19 cs.CV cs.LG 版本更新

SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability

SeamCam:通过多线索视觉可探测性量化无缝伪装

Amin Karimi Monsefi, Abolfazl Meyarian, Mridul Khurana, Shuheng Wang, Pouyan Navard, Cheng Zhang, Anuj Karpatne, Wei-Lun Chao, Rajiv Ramnath

发表机构 * The Ohio State University(俄亥俄州立大学) Path Robotics, USA(Path Robotics公司) Virginia Tech(弗吉尼亚理工大学) Boston University(波士顿大学)

AI总结 SeamCam通过将伪装评估转化为视觉定位问题,提出了一种量化动物伪装效果的指标,通过人类实验验证其有效性,并展示了其在扩散模型训练中的应用。

详情
AI中文摘要

动物被描述为有效伪装时,能够无缝融入周围环境,但目前缺乏标准化的量化措施。本文通过将伪装评估转化为视觉定位问题:伪装良好的动物在已知类别时仍难以检测。引入SeamCam指标,量化动物的可探测性。给定图像和目标物种,SeamCam生成类别条件的检测提案,提取分割掩码,并识别其子集,其联合覆盖最大IoU与真实掩码。SeamCam分数是最大可恢复定位信号的补数,分数越高伪装越强(即可探测性越低)。在94名参与者和2390次比较的人类二择一强制选择研究中,SeamCam与人类伪装难度判断达成78.82%的一致性,优于现有最先进方法约25%。随后展示了SeamCam作为直接偏好优化(DPO)的偏好信号,用于微调基于扩散的修复模型以生成伪装。这提供了一种经济的训练方法,其目标专门适用于伪装生成,不同于典型的扩散模型。为支持严格基准测试,进一步引入CamFG-1.5k数据集,包含1521张高分辨率图像,在伪装生成前动物完全可见,使评估更公平,通过控制现有数据集中存在的遮挡伪影。

英文摘要

Animals are described as effectively camouflaged when they blend seamlessly with their surrounding, yet no standardized quantitative measure of this seamlessness exists. We address this gap by framing camouflage evaluation as a visual localization problem: a well-camouflaged animal is one that remains difficult to detect even when its category is known. We introduce SeamCam (Seamless Camouflage), a metric that quantifies how detectable an animal is from the available visual evidence. Given an image and a target species, SeamCam generates category-conditioned detection proposals, extracts segmentation masks, and identifies the subset whose collective union yields the highest IoU with the ground-truth mask. The SeamCam score is one minus this maximum recoverable localization signal, where a higher score indicates stronger camouflage (i.e., lower detectability). In a human two-alternative forced-choice study with 94 participants and 2,390 comparisons, SeamCam achieves 78.82% agreement with human camouflage difficulty judgments, outperforming state-of-the-art by about 25%. We then demonstrate SeamCam's utility as a preference signal for Direct Preference Optimization (DPO) to fine-tune a diffusion-based inpainting model for camouflage generation. This offers an affordable training approach with an objective explicitly suited for camouflage generation, unlike typical diffusion models. To support rigorous benchmarking, we further introduce CamFG-1.5k, a curated dataset of 1,521 high-resolution images in which animals are fully visible prior to camouflage generation, enabling unbiased evaluation by controlling for occlusion artifacts present in existing datasets. https://7amin.github.io/SeamCam/

2605.16481 2026-05-19 cs.CV cs.AI 版本更新

Visual Agentic Memory: Enabling Online Long Video Understanding via Online Indexing, Hierarchical Memory, and Agentic Retrieval

视觉代理记忆:通过在线索引、分层记忆和代理检索实现在线长视频理解

Aiden Yiliu Li, Nels Numan, Anthony Steed

发表机构 * University College London(伦敦大学学院)

AI总结 本文提出视觉代理记忆框架,通过在线索引、分层记忆和代理检索实现长视频理解,实验显示其在OVO-Bench和MM-Lifelong数据集上均取得优异成绩。

详情
AI中文摘要

长视频理解需要比大上下文窗口更多的内容,还需要一种记忆机制,决定保留哪些视觉证据,保持其在长时间范围内可搜索,并使后续推理基于可恢复的观察而非压缩的潜在状态。我们提出了视觉代理记忆(VAM),一种无需训练的框架,包含三个组件。在线索引支持在流式约束下选择性证据保留。分层记忆将保留的证据组织成并行表示,使时间上下文与空间观察对齐。代理检索在生成基于证据的答案前搜索、检查和验证候选证据。在OVO-Bench上,VAM在所有报告的基线中取得了最高的RT+BT平均值(68.41),优于使用相同基础MLLM(Gemini 3 Flash,67.46)的端到端方法。在MM-Lifelong train@month的月度分割(105.6小时覆盖51天)上,VAM达到17.11%,仅次于使用GPT-5的ReMA(17.62%)。这些结果表明,长时间视频理解受益于将视觉记忆视为显式、可检查和可查询的基质。代码可在https://github.com/yiliu-li/Visual-Agentic-Memory获取。

英文摘要

Long video understanding requires more than large context windows. It also needs a memory mechanism that decides what visual evidence to retain, keeps it searchable over long horizons, and grounds later reasoning in recoverable observations rather than compressed latent state alone. We propose Visual Agentic Memory (VAM), a training-free framework with three components. Online Indexing supports selective evidence retention under streaming constraints. Hierarchical Memory organises retained evidence in a Parallel Representation that aligns temporal context with spatial observations. Agentic Retrieval searches, inspects, and verifies candidate evidence before producing a grounded answer. On OVO-Bench, VAM achieves the highest RT+BT average (68.41) across all reported baselines, improving over end-to-end use of the same underlying MLLM (Gemini 3 Flash, 67.46). On the month-scale split of MM-Lifelong train@month (105.6 hours over 51 days), VAM reaches 17.11%, second only to ReMA with GPT-5 (17.62%). These results suggest that long-horizon video understanding benefits from treating visual memory as an explicit, inspectable, and queryable substrate. Code is available at https://github.com/yiliu-li/Visual-Agentic-Memory.

2605.16477 2026-05-19 cs.LG cs.CV 版本更新

Seeking the Unfamiliar but Memorable: Conceptual Creativity as Meta-Learning

寻求不熟悉但难忘的概念:作为元学习的概念创造力

Mengye Ren

发表机构 * Agentic Learning AI Lab(代理学习人工智能实验室)

AI总结 本文提出概念创造力作为元学习,通过创作者生成候选刺激和评估者适应学习,产生可学习的创新内容。

Comments 25 pages

详情
AI中文摘要

什么是创造新概念而不是检索熟悉概念的意义?重复采样生成模型在同一提示下会产生风格相似但内容典型的变化。我们提出创造力是产生对适应观察者最初不熟悉的刺激,但通过少量暴露即可学习。我们将此形式化为创作者-评估者对:创作者生成候选,评估者适应几轮学习步骤,评估者的改进成为创作者优化的奖励。我们用扩散模型作为创作者,MNIST上的自动编码器作为评估者,以及带有低秩适配器的CLIP作为自然图像的评估者。扩散模型保持冻结,无额外语言条件;元学习梯度足以产生基础模型无法单独生成的风格变化和概念组合。

英文摘要

What does it mean to create a new concept, rather than retrieve a familiar one? Repeatedly sampling a generative model at the same prompt produces variations with similar styles and typical content. We propose that creativity is the production of stimuli that are unfamiliar to an adaptive observer at first sight, but quickly learnable from a few exposures. We formalize this as a Creator-Appraiser pair: a Creator generates a candidate, an Appraiser adapts to it for a few inner-loop learning steps, and the Appraiser's improvement becomes the reward the Creator optimizes through. We instantiate the framework with diffusion as the Creator, an autoencoder Appraiser on MNIST, and a CLIP Appraiser with a low-rank adapter for natural images. The diffusion model remains frozen with no additional language conditioning; the meta-learning gradient is enough to produce both stylistic variations and concept compositions that the base model does not generate on its own.

2605.16476 2026-05-19 eess.IV cs.CV cs.LG 版本更新

Deep Learning for MRI Slice Interpolation: The Critical Role of Problem Formulation

深度学习在MRI切片插值中的应用:问题建模的关键作用

Shamit Savant

发表机构 * University of Florida, Gainesville, FL 32611, USA(佛罗里达大学)

AI总结 本文探讨了深度学习在前列腺成像中插值中间MRI切片的方法,发现问题建模对性能的影响远大于架构复杂度,通过改进插值方式提升了SSIM性能。

Comments 10 pages main text, 21 pages total with supplementary, 8 figures, supplementary material included

详情
AI中文摘要

在临床MRI中,通过平面分辨率通常比平面内分辨率更粗糙,限制了诊断价值。本文研究了深度学习方法用于插值中间MRI切片,有效将通过平面分辨率翻倍。评估了五种架构(CNN、U-Net、两种GAN变体和DDPM),发现问题建模对性能的影响远大于架构复杂度。通过将插值任务改用相邻切片(i-1,i+1)而非远距离切片(i-2,i+2),在所有确定性架构上实现了58%的SSIM提升。U-Net模型在PSNR为30.08 dB和SSIM为0.898,比线性插值基线提升了10.1%。DDPM也进行了评估,但因随机生成与确定性重建需求不匹配而表现不佳。这些发现表明,在医学影像任务中,问题建模的影响是架构复杂度的290倍。

英文摘要

Through-plane resolution in clinical MRI is typically much coarser than in-plane resolution, limiting diagnostic utility. This work investigates deep learning approaches to interpolate intermediate MRI slices in prostate imaging, effectively doubling through-plane resolution. I evaluated five architectures (CNN, U-Net, two GAN variants, and DDPM) and discovered that problem formulation has dramatically more impact than architectural complexity. By reformulating the interpolation task to use adjacent slices (i-1, i+1) rather than distant slices (i-2, i+2), I achieved a 58% improvement in SSIM performance across all deterministic architectures. The U-Net model achieved the best results with PSNR of 30.08 dB and SSIM of 0.898, representing a 10.1% improvement over linear interpolation baseline. A DDPM was also evaluated but showed poor reconstruction quality due to fundamental mismatch between stochastic generation and deterministic reconstruction requirements. These findings demonstrate that problem formulation can have 290x more impact than architectural sophistication in medical imaging tasks.

2605.16469 2026-05-19 eess.IV cs.CV 版本更新

Flow Matching with Optimized Subclass Priors for Medical Image Augmentation

利用优化子类先验的流匹配用于医学图像增强

Felix Nützel, Mischa Dombrowski, Bernhard Kainz

发表机构 * Department of Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91052 Erlangen, Germany(生物医学工程人工智能系,弗里德里希-亚历山大-厄林根-纽伦堡大学) Department of Computing, Imperial College London, London SW7 2AZ, UK(计算系,帝国理工学院伦敦分校)

AI总结 本文提出通过优化子类先验来提升医学图像增强中罕见疾病的生成质量,通过生成模型的潜在空间进行子类模式划分,并学习子类条件源分布以提高生成效果和多样性。

Comments 11 pages, 3 figures, 7 tables

详情
AI中文摘要

罕见疾病在医学影像诊断中占据主导地位,但临床数据集严重缺乏这些病例,导致分类器在最需要可靠检测的条件下失效。生成增强可以补充缺失的尾类覆盖,但粗略疾病标签将多样化的子类型和采集设置聚合为多模态条件,导致生成器偏向主导子模式。我们提出一种离线策略,在两个层面引入信息性先验:首先,通过生成模型的潜在空间中的高斯混合建模将每个粗略标签划分为连贯的子模式;其次,学习子类条件源分布,重新定位和缩放每个子模式的起始分布,缩短轨迹并减少子类内的分散。为了防止退化解,我们施加显式的几何控制,适度集中归一化位移方向围绕可学习的原型,同时限制路径长度异常值。在长尾胸部X光(MIMIC-LT,NIH-LT)和CT切片(CT-RATE)基准测试中,所提出的方法一致提高了尾类生成保真度和多样性(FID,IRS),并是一种有前景的增强策略,能够可靠地提高跨模态的下游平衡准确率和宏F1值。

英文摘要

Rare diseases dominate the diagnostic challenge in medical imaging yet are severely underrepresented in clinical datasets, causing classifiers to fail on exactly the conditions where reliable detection matters most. Generative augmentation can supply the missing tail-class coverage, but coarse disease labels aggregate diverse subtypes and acquisition settings into multi-modal conditionals that bias generators toward dominant submodes, while a shared Gaussian source forces rare subpopulations through disproportionately long transport paths. We propose an offline strategy that introduces informative priors at two levels: first, we partition each coarse label into coherent submodes via Gaussian mixture modeling in the generative model's latent space; second, we learn subclass-conditioned source distributions that re-center and re-scale the starting distribution per submode, shortening trajectories and reducing within-subclass dispersion. To prevent degenerate solutions we impose explicit geometric control, moderately concentrating normalized displacement directions around learnable prototypes while capping path-length outliers. On long-tailed chest X-ray (MIMIC-LT, NIH-LT) and CT slice (CT-RATE) benchmarks the proposed method consistently improves tail-class generation fidelity and diversity (FID, IRS) and is a promising augmentation strategy that reliably improves downstream balanced accuracy and macro-F1 over a non-augmented baseline across modalities.

2605.16468 2026-05-19 cs.CV cs.AI cs.CL cs.LG q-bio.NC 版本更新

Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex

可解释的神经编码机制揭示人类视觉皮层的精细功能选择性

Idan Daniel Grosbard, Mor Geva, Galit Yovel

发表机构 * Sagol School of Neuroscience(萨戈尔神经科学学院) Blavatnik School of Computer Science and AI(布拉瓦提克计算机科学与人工智能学院) School of Psychological Sciences(心理学科学学院)

AI总结 本文提出MINE框架,通过机制可解释工具揭示自然图像中驱动皮层 voxel 活动的特征,验证了特征对 voxel 响应的因果影响,并揭示了视觉皮层中精细的功能选择性。

Comments 40 pages, 28 figures

详情
AI中文摘要

理解人类视觉的核心目标是揭示驱动神经活动的视觉特征。已有研究利用人工神经网络作为编码模型预测皮层对自然图像的响应,揭示了激活类别选择区域的视觉内容。然而,现有方法多为相关性分析,将编码器视为黑箱,无法确定哪些图像特征驱动每个 voxel 的响应。本文提出机制可解释神经编码(MINE)框架,通过机制可解释工具定位自然图像中驱动毫米级(voxel 级)活动的特征。MINE利用语言对齐的图像表示预测每个 voxel 的响应,并生成语义可解释的特征描述,用于 voxel 的激活。进一步将这些 per-image 特征泛化为 per-voxel 功能轮廓。为验证 per-image 描述,我们显示它们足以生成激发 voxel 响应与原始图像响应匹配的图像,其准确性优于随机或低贡献控制生成的图像。此外,通过反事实插入或移除预测特征,可使激活在预期方向变化,提供因果证据。由 voxel 激活轮廓指导的反事实编辑产生更强的激活变化,表明轮廓忠实捕捉每个 voxel 的选择性。最后,将 MINE 应用于研究充分的类别选择脑区,显示其恢复了已知的类别偏好,同时揭示了每个区域内的精细 voxel 结构。总体而言,我们的结果确立了机制可解释性作为发现和验证神经功能精细假设的路径。

英文摘要

A central goal in understanding human vision is to uncover the visual features that drive neuronal activity. A growing body of work has used artificial neural networks as encoding models to predict cortical responses to natural images, revealing the visual content that activates category-selective regions. However, existing approaches are largely correlational and treat the encoder as a black box, leaving open which image features drive each voxel's response. We introduce Mechanistically Interpretable Neural Encoding (MINE), a framework that opens this black box by applying mechanistic-interpretability tools to localize the features within natural images that drive millimeter-scale (voxel-level) activity. MINE predicts each voxel's response using language-aligned image representations, and produces semantically interpretable descriptions of the features critical for the voxel's activation. We further generalize these per-image features into per-voxel functional profiles. To validate the per-image descriptions, we show they are sufficient to generate images that elicit voxel responses matching the responses to the original images, more accurately than images generated from random or low-attribution controls. Moreover, counterfactually inserting or removing the predicted features from images shifts activation in the expected direction, providing causal evidence. Counterfactual editing guided by the per-voxel activation profiles produces even stronger activation shifts, indicating that the profiles faithfully capture each voxel's selectivity. Finally, we apply MINE to well-studied category-selective brain regions, showing it recovers their known categorical preferences while revealing fine-grained unique voxel structure within each region. Overall, our results establish mechanistic interpretability as a path to discover and causally validate fine-grained hypotheses about neural function.

2605.16464 2026-05-19 cs.CV cs.AI 版本更新

MHMamba: Multi-Head Mamba for 3D Brain Tumor Segmentation

MHMamba:多头Mamba用于3D脑肿瘤分割

Hanjun Tao, Hua Wang, Fan Zhang

发表机构 * Shandong Technology and Business University(山东科技与商务大学) Ludong University(鲁东大学)

AI总结 本文提出MHMamba,结合U型结构与多头状态空间模型,提升3D脑肿瘤分割的长程表示与多模态训练稳定性,实验显示在BraTS数据集上整体准确率和边界平滑度显著提升。

Comments 10 pages, 3 figures, 4 tables

详情
AI中文摘要

脑肿瘤在形态和多模态对比方面具有高度异质性,手动逐层标注耗时且依赖经验,因此需要高效稳定的自动化分割方法。为解决CNN建模长程依赖的局限性和Transformer在3DMRI中的计算和内存开销问题,本文提出多头Mamba(MHMamba)。该方法结合U型架构与多头状态空间模型(Mamba),将通道维度拆分为并行SSM头并进行残差聚合,增强长程表示并提升多模态训练的稳定性,同时保持线性复杂度。为进一步对齐统计信息并增强病变响应,设计了多头输出的通道空间校准模块,并引入适应性融合机制在跳跃连接中动态连接全局语义与局部细节,从而提升边界一致性及小体积病变的检测能力。在BraTS2021和BraTS2023上进行了实验和消融测试,结果显示MHMamba在整体准确率、边界平滑度及肿瘤核心和小体积增强区域的敏感度上实现了稳定显著的提升,同时保持了基于Mamba建模的线性复杂度优势,验证了方法的有效性和通用性。

英文摘要

Brain tumors exhibit high heterogeneity in morphology and multimodal contrast, making manual slice-by-slice de lineation time-consuming and experience-dependent, thus necessitating efficient and stable automated segmentation methods. To address the limitations of CNNs in modeling long-range dependencies, and the heavy computational and memory overhead and inter-block contextual in coherence of Transformers in 3D MRI, this paper proposes Multi-Head Mamba (MHMamba). This method combines a U-shaped architecture with a multi-head state-space model (Mamba), splitting the channel dimension into parallel SSM heads and aggregating them with residuals. This enhances long-range representation and improves the stability of multimodal training while maintaining linear complexity. To further align statistics and enhance lesion response, we designed a channel-space calibration module for multi-head outputs and introduced an adaptive fusion mechanism at skip connections to dynamically connect global semantics with local details, thereby improving boundary consistency and the detection of small-volume lesions. We conducted experiments and ablations on BraTS2021 and BraTS2023. The results showed that MHMamba achieved stable and significant improvements in overall accuracy, boundary smoothness, and sensitivity to tumor core and small-volume enhancement areas, while preserving the linear-complexity advantage of Mamba-based modeling, thus verifying the effectiveness and versatility of the method.

2605.16460 2026-05-19 cs.CV 版本更新

REC-RL: Referring expression counting via Gaussian and range-based reward optimization

基于高斯和范围的奖励优化的指称表达计数(REC-RL)

Hui Liu, Yunlai Teng, Kunlong Bai, Pengfei Qi, Haotian Yan, Liang Li, Junlan Feng

发表机构 * JIUTIAN Research, Beijing, China(京天研究,北京,中国)

AI总结 本文提出REC-RL框架,通过引入思考-范围-回答范式优化视觉推理过程,结合精度指导和结构化输出奖励,提升指称表达计数任务的性能和泛化能力。

Comments 5 pages

详情
AI中文摘要

指称表达计数(REC)是一种意图驱动的任务,需要上下文感知的视觉推理。尽管最近的视觉-语言模型将语言融入视觉理解,但大多数现有REC方法依赖基于规则的强化学习,奖励主要关注最终准确性,忽略了中间推理的质量。我们提出REC-RL,一种强化学习框架,引入思考-范围-回答范式,显式优化视觉推理过程。REC-RL采用群体相对策略优化,并采用两种轻量级奖励:一种结合基于范围的区间监督与基于高斯的精度指导的准确性奖励,另一种强制结构化输出的格式奖励。通过将中间焦点预测建模为内部决策过程,REC-RL避免了额外注释,并更好地与人类感知对齐。广泛的实验表明,REC-RL在强基线模型上表现一致提升,并在多个基准上表现出稳健的泛化能力。

英文摘要

Referring expression counting (REC) is an intention-driven task that requires context-aware visual reasoning. While recent vision-language models incorporate language for visual understanding, most existing REC methods rely on rulebased reinforcement learning with rewards focused primarily on final accuracy, overlooking the quality of intermediate reasoning. We propose REC-RL, a reinforcement learning framework that introduces a think-range-answer paradigm to explicitly optimize the visual reasoning process. RECRL employs Group Relative Policy Optimization and two lightweight rewards: an accuracy reward that combines range-based interval supervision with Gaussian-based precision guidance, and a format reward that enforces structured outputs. By modeling intermediate focus prediction as internal decision-making, REC-RL avoids additional annotations and better aligns with human perception. Extensive experiments demonstrate consistent improvements over strong baselines and robust generalization across benchmarks.

2605.16458 2026-05-19 cs.CV cs.AI 版本更新

Conservative AI for Safety-Sensitive Medical Image Restoration: Residual-Bounded CT-CTA Enhancement for Intracranial Aneurysm-Relevant Signal Recovery

安全敏感医学图像修复中的保守AI:用于颅内动脉瘤相关信号恢复的残差受限CT-CTA增强

Weijun Ma

发表机构 * Independent Researcher(独立研究者) King George, Vancouver School Board(金乔治,温哥华学区) Vancouver, BC, Canada(温哥华,BC省,加拿大)

AI总结 本文提出了一种残差受限的2.5D修复框架,用于安全敏感的医学图像修复,通过编辑控制图限制修改范围,提升CT和CTA图像质量,减少误诊风险。

Comments Preprint manuscript, 16 pages, 4 figures, 3 tables. This manuscript presents a residual-bounded 2.5D CT/CTA restoration framework for conservative medical image enhancement and evaluates it using image-recovery, baseline comparison, Monte Carlo stability, anatomical localization, and external low-dose CT testing

详情
AI中文摘要

图像修复模型越来越多地应用于退化的医学扫描,但在安全敏感的环境中,必须在不不受控制地修改临床重要区域的情况下提高图像质量。这在颅内CT和CT血管造影(CTA)中尤为重要,因为小血管和动脉瘤相关线索靠近高对比度的解剖边界。我们将医学图像修复视为保守AI问题,并提出了一种基于合成退化CT/CTA输入训练的残差受限2.5D修复框架。模型通过编辑控制图将学习到的残差添加到原始中心切片中,限制修改的幅度和空间范围。我们使用动脉瘤相关图像恢复矩阵、与高斯基线的配对比较、蒙特卡洛稳定性测试、有意义编辑的解剖定位以及低剂量CT的外部评估来评估该框架。在50个分布外的CT-CTA案例中,受限模型实现了平均目标增益0.0635,平均PSNR 37.51 dB,以及iatrogenic编辑率4.0%。在1000次蒙特卡洛运行中,模型在85.4%的运行中保持净正收益,没有稳定负收益。在外部低剂量CT中,模型在方向上有益,并且产生的修改足迹比基线小得多。有意义的编辑集中在大脑和颅骨区域,而无关解剖结构几乎没有变化。这些发现提供了初步的计算证据,表明在敏感血管成像中残差受限的修复是可行的,但它们不证明临床诊断性能,需要专家审查和前瞻性验证后才能用于临床应用。

英文摘要

Image restoration models are increasingly applied to degraded medical scans, but in safety-sensitive settings they must improve image quality without uncontrolled modification of clinically important regions. This is especially relevant for intracranial CT and CT angiography (CTA), where small vessels and aneurysm-relevant cues lie near high-contrast anatomical boundaries. We frame medical image restoration as a conservative AI problem and present a residual-bounded 2.5D restoration framework trained on synthetically degraded CT/CTA inputs. The model adds a learned residual to the original center slice through an edit-control map that limits the magnitude and spatial extent of modification. We evaluate the framework using an aneurysm-relevant image-recovery matrix, paired comparison against a Gaussian baseline, Monte Carlo stability testing, anatomical localization of meaningful edits, and external evaluation on low-dose CT. On 50 out-of-distribution CT-CTA cases, the bounded model achieved a mean target gain of 0.0635, a mean PSNR of 37.51 dB, and an iatrogenic-edit rate of 4.0%. Across 1,000 Monte Carlo runs, it remained net positive in 85.4% of runs with no stably negative cases. On external low-dose CT, the model was directionally beneficial and produced a substantially smaller modification footprint than the baseline. Meaningful edits concentrated in brain and skull regions while unrelated anatomy showed negligible change. These findings provide preliminary computational evidence that residual-bounded restoration is feasible in boundary-sensitive vascular imaging, but they do not establish clinical diagnostic performance and require expert review and prospective validation before clinical use.

2605.16456 2026-05-19 cs.CV 版本更新

Multi-hop Relational Contrastive Learning: Extending Spatial Contrastive Pre-training Beyond Pairwise Relations

多跳关系对比学习:超越成对关系的空间对比预训练

Sheikh Tanvir Ahmed, Md. Tanvir Raihan

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) United International University(国际联合大学)

AI总结 本文提出多跳关系对比学习框架,通过捕捉场景图中k跳路径的隐含空间依赖,提升空间感知能力,在GQA子集上实现了更优的检索和下游任务表现。

详情
AI中文摘要

理解物体间空间关系对场景理解至关重要,但大多数对比预训练方法仅建模成对关系,忽略了更丰富的组合和多跳交互。本文提出多跳关系对比学习(MRCL)框架,扩展空间对比学习到图结构的场景表示。通过追踪场景图中k跳路径,MRCL捕捉隐含的空间依赖,定义多级对比目标,鼓励嵌入在保持语义稳定性的同时响应空间布局。在GQA子集上,MRCL生成空间感知表示,提升内容基于图检索(NDCG@5=0.748)并持续改善下游任务,包括空间关系识别和图基问题回答。这些结果表明,多跳关系监督比仅成对方法提供更丰富的结构指导,导致更鲁棒、组合和几何感知的视觉表示。

英文摘要

Understanding how objects relate to each other in space is fundamental to scene understanding, yet most contrastive pre-training approaches only model pairwise relationships, leaving richer compositional and multi-hop interactions largely unexplored. We introduce Multi-Hop Relational Contrastive Learning (MRCL), a framework that extends spatial contrastive learning to graph-structured scene representations. By tracing k-hop paths through scene graphs built from detected objects, MRCL captures implicit spatial dependencies that go well beyond what direct object pairs can express. We define a multi-level contrastive objective spanning nodes, edges, and multi-hop paths, encouraging embeddings that remain stable across object semantics while staying responsive to spatial layout. On a GQA subset, MRCL produces spatially-aware representations that improve content-based graph retrieval (NDCG@5 = 0.748) and consistently benefit downstream tasks, including spatial relationship recognition and graph-based question answering. Together, these results suggest that multi-hop relational supervision offers substantially richer structural guidance than pairwise-only methods, leading to visual representations that are more robust, compositional, and geometry-aware.

2605.16444 2026-05-19 cs.CV cs.AI 版本更新

Diffusion Attention Expert Model for Predicting and Semi-automatic Localizing STAS in Lung Cancer Histopathological Images

扩散注意力专家模型用于预测和半自动定位肺癌组织病理图像中的STAS

Liangrui Pan, Jiadi Luo, Yuxuan Xiao, Chenchen Nie, Xiaoshuai Wu, Songqing Fan, Ling Chu, Manqiu Li, Rongfang He, Zhenyu Zhao, Ruixing Wang, Shulin Liu, Yiyi Liang, Xiang Wang, Qingchun Liang, Shaoliang Peng

发表机构 * College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机科学与电子工程学院) Department of Pathology, The Second Xiangya Hospital, Central South University(中南大学湘雅医院病理科) Hunan Clinical Medical Research Center for Cancer Pathogenic Genes Testing and Diagnosis(湖南临床医学肿瘤基因检测与诊断研究中心) Department of Thoracic Surgery, The Second Xiangya Hospital, Central South University(中南大学湘雅医院胸外科) Department of pathology, Hunan Cancer Hospital, The Affiliated Cancer Hospital of Xiangya School of Medicine, Central South University(湖南肿瘤医院病理科) Department of Pathology, The Third Xiangya Hospital, Central South University(中南大学湘雅第三医院病理科) Department of Pathology, First People's Hospital of Pingjiang County(平江县第一人民医院病理科) Department of Pathology, the First Affiliated Hospital, Hengyang Medical School, University of South China(南华大学衡阳医学院第一附属医院病理科) Department of Radiology, The Second Xiangya Hospital of Central South University(中南大学湘雅医院放射科) Department of Radiology, Xiangya Hospital, Central South University(中南大学湘雅医院放射科) Oncology Department and State Key Laboratory of Systems Medicine for Cancer of Shanghai Cancer Institute, Renji Hospital, School of Medicine, Shanghai Jiaotong University(上海癌症研究院肿瘤科及上海交通大学医学院系统医学重点实验室)

AI总结 本文提出DAEM模型,通过多尺度特征学习和双分支架构提升STAS检测精度,实现对冷冻切片和石蜡切片的高AUC值检测,并利用肿瘤微环境特征实现STAS半自动定位。

Comments Accepted by Nature Communications

详情
AI中文摘要

准确的术中和术后STAS诊断对指导肺癌手术决策和术后管理至关重要。然而,组织病理学评估耗费人力且易出现漏诊或误诊。我们提出扩散注意力专家模型(DAEM)用于检测冷冻切片(FSs)和石蜡切片(PSs)中的STAS。其扩散注意力专家模块利用全注意力聚合学习多尺度特征,而双分支架构强化多尺度特征表示。在内部数据集中,DAEM在FSs和PSs上分别达到0.8946和0.9112的AUC值。在八个机构的外部多中心数据集上验证显示,模型具有强泛化性和可解释性。利用PSs中的肿瘤微环境(TME)特征,进一步实现了STAS位置及其与原发肿瘤距离的半自动测量。多个定量TME指标被识别为STAS的潜在生物标志物,包括微泡型STAS。总体而言,DAEM通过在FSs和PSs上实现准确且可解释的检测,为STAS评估提供临床可操作的框架,通过基于定量TME的分析支持术后风险分层。

英文摘要

Accurate intraoperative and postoperative diagnosis of spread through air spaces (STAS) is essential for guiding surgical decisions and postoperative management in lung cancer. However, histopathological assessment is labor-intensive and is prone to missed or incorrect diagnoses. We propose a Diffusion Attention Expert Model (DAEM) to detect STAS in frozen sections (FSs) and paraffin sections (PSs). Its diffusion attention expert module leverages full attention aggregation to learn multi-scale features from histopathological images, while a dual-branch architecture strengthens multi-scale feature representation. On an internal dataset, DAEM achieves AUCs of 0.8946 for FSs and 0.9112 for PSs. Validation on external multi-center datasets from eight institutions demonstrates strong generalizability and interpretability. Using tumor microenvironment (TME) features in PSs, we further enable semi-automatic measurement of STAS location and its distance from the primary tumor. Several quantitative TME metrics are identified as potential biomarkers for STAS, including micropapillary-type STAS. Overall, DAEM offers a clinically actionable framework for STAS assessment by enabling accurate and interpretable detection on FSs and PSs, supporting postoperative risk stratification through quantitative TME-based analysis.

2605.16440 2026-05-19 cs.CV cs.AI 版本更新

Semantic Smoothing via Novel View Synthesis for Robust SAR Image Classification

通过新颖视角合成实现语义平滑以实现稳健的SAR图像分类

Daniel Brignac, Fengwei Tian, Banafsheh Latibari, Abhijit Mahalanobis, Ravi Tandon

发表机构 * The University of Arizona(亚利桑那大学)

AI总结 本文提出语义平滑方法,通过新颖视角合成模型生成结构化随机变换,提升SAR图像分类在对抗攻击下的鲁棒性,并提高干净分类准确率。

详情
AI中文摘要

深度神经网络对对抗扰动敏感,限制了其在安全关键应用中的部署,如合成孔径雷达(SAR)自动目标识别(ATR)。随机化平滑通过在噪声输入上平均预测来提高鲁棒性,但各向同性噪声常无法保持SAR图像的语义结构。我们提出语义平滑,一种防御方法,用由新颖视角合成模型生成的结构化随机变换取代基于噪声的扰动。对于SAR,我们根据获取几何学合成多个可能的雷达视角。在生成的随机视角上进行预测并聚合,以形成鲁棒分类器。实验表明,语义平滑在标准攻击(如FGSM和PGD)以及SAR特定攻击(如OTSA和SMGAA)中提高了鲁棒性,同时提高了干净分类准确率。这些结果表明,通过保留语义的几何变换进行随机化平滑,是结构感知领域对抗防御的一种有前景的替代方案。

英文摘要

Deep neural networks are vulnerable to adversarial perturbations, limiting deployment in safety-critical applications such as synthetic aperture radar (SAR) automatic target recognition (ATR). Randomized smoothing improves robustness by averaging predictions over noisy inputs, but isotropic noise often fails to preserve the semantic structure of SAR imagery. We propose semantic smoothing, a defense that replaces noised-based perturbations with structured randomized transformations generated by a novel view synthesis model. For SAR, we condition on acquisition geometry to synthesize multiple plausible radar views. Predictions across generated randomized views are aggregated to form a robust classifier. Experiments show that semantic smoothing improves robustness against standard attacks, such as FGSM and PGD, and SAR-specific attacks, such as OTSA and SMGAA, while also increasing clean classification accuracy. These results demonstrate that randomized smoothing via semantically preserving geometric transformations is a promising alternative to isotropic noise for adversarial defense in structured sensing domains.

2605.16439 2026-05-19 cs.CV cs.AI 版本更新

KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

KVCapsule: 用于视觉-语言模型的高效序列KV缓存压缩方法:不对称冗余

Yingbing Huang, Tharun Adithya Srikrishnan, Steven K. Reinhardt, Deming Chen

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) AMD

AI总结 本文提出KVCapsule,一种针对视觉语言模型的KV缓存压缩框架,通过轻量压缩和重建组件实现内存节省,提升吞吐量并减少内存占用,同时保持精度。

详情
AI中文摘要

视觉-语言模型(VLMs)作为大型语言模型(LLMs)的重要扩展,通过文本和图像输入实现多模态推理。尽管VLMs增强了语言模型的能力,但它们也继承并放大了关键计算瓶颈:自回归解码过程中大规模键值(KV)缓存带来的内存开销。这一挑战在VLMs中尤为严重,因为图像生成更长的token序列和更密集的特征表示,相比文本。此外,视觉token的空间和信息丰富性引入了结构化的注意力模式,使得许多针对LLM的KV缓存压缩技术在直接应用于VLMs时效果不佳。在本文中,我们对视觉token的行为进行了详细的实证分析,突显其与纯文本模型的关键差异。基于这些见解,我们提出KVCapsule,一种新的视觉token的KV缓存压缩框架。KVCapsule保持预训练VLM骨干网络冻结,不需要修改注意力计算模块,并且可以通过轻量级压缩和重建组件集成到现有VLMs中。我们评估了KVCapsule在多个VLMs和基准任务上的性能,证明在60%的压缩率下,TPS提升达2倍,KV缓存内存减少达2.4倍,同时精度或响应质量几乎没有下降。我们的发现为在受限内存预算下扩展VLM推理提供了实用路径,并启发进一步研究结构感知的缓存压缩方法以多模态模型。

英文摘要

Vision-Language Models (VLMs) have emerged as a critical and fast-growing extension of Large Language Models (LLMs) that enable multimodal reasoning through both text and image inputs. Although VLMs enrich the capabilities of language models, they also inherit and amplify key computational bottlenecks: the memory overhead caused by the large key-value (KV) cache during autoregressive decoding. This challenge is particularly severe in VLMs, where images produce longer token sequences and denser feature representations compared to text. Moreover, the spatial and information-rich nature of vision tokens introduces structured attention patterns that make many LLM-oriented KV cache compression techniques ineffective when applied directly to VLMs. In this work, we conduct a detailed empirical analysis of the behavior of vision tokens, highlighting the critical differences from purely text-based models. Based on these insights, we propose KVCapsule, a novel KV cache compression framework for vision tokens. KVCapsule keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and can be integrated into existing VLMs through lightweight compression and reconstruction components. We evaluate KVCapsule on multiple VLMs and benchmark tasks, demonstrating up to 2x improvement in TPS and 2.4x reduction in KV cache memory at a 60% compression ratio, with negligible degradation in accuracy or response quality. Our findings offer practical pathways to scale VLM inference under constrained memory budgets and inspire further research into structure-aware cache compression for multimodal models.

2605.16431 2026-05-19 cs.CV 版本更新

CT-DegradBench: A Physics-Informed Benchmark for CT Degradation Detection and Severity Estimation

CT-DegradBench:一种用于CT退化检测和严重程度估计的物理引导基准

Yousra Nabila Taifour, Marouane Tliba, Zuheng Ming, Marie Luong, Nour Aburaed, Aladine Chetouani, Gorkem Durak, Alessandro Bruno, Faouzi Alaya Cheikh, Habib Zaidi, Ulas Bagci, Azeddine Beghdadi

发表机构 * Université Sorbonne Paris Nord(索邦巴黎北大学) University of Dubai(迪拜大学) Northwestern University(西北大学) IULM University(IULM大学) Norwegian University of Science and Technology(挪威科学与技术大学) University of Geneva(日内瓦大学)

AI总结 本文提出CT-DegradBench,一个用于评估CT退化检测和严重程度估计的基准,结合语义先验和频域线索,提出SeSpeCT框架,在多模态嵌入空间中构建免训练的语义质量轴,实现退化类型和严重程度的联合预测。

Comments Accepted in CVPR 2026 VISION Workshop (DEXTER track)

详情
AI中文摘要

Computed tomography (CT) images are frequently degraded by acquisition artifacts, including noise, blur, streaking, aliasing, and metal artifacts. Yet CT enhancement is still largely evaluated using image quality metrics with limited perceptual and clinical validity, while existing datasets remain focused on isolated restoration tasks, hindering unified benchmarking across diverse degradation types. We present CT-DegradBench, a dataset and benchmark for CT degradation detection and severity estimation under controlled single- and mixed-artifact settings. CT-DegradBench enables systematic evaluation across multiple degradation families and severity levels within a common experimental framework. We further propose SeSpeCT (Semantic-Spectral CT degradation estimation), a framework that combines semantic priors from medical vision-language models with complementary frequency-domain cues for artifact analysis. SeSpeCT constructs a 免训练 semantic quality axis in the multimodal embedding space using radiology-informed text prompts, without task-specific fine-tuning, and combines it with spectral features that capture degradation-specific frequency patterns. The resulting representation enables joint prediction of artifact type and severity. Experimental results show that SeSpeCT consistently outperforms the evaluated baselines under both single- and mixed-degradation settings. The framework is available at https://github.com/yousranb/CT-DEGRADBENCH.

英文摘要

Computed tomography (CT) images are frequently degraded by acquisition artifacts, including noise, blur, streaking, aliasing, and metal artifacts. Yet CT enhancement is still largely evaluated using image quality metrics with limited perceptual and clinical validity, while existing datasets remain focused on isolated restoration tasks, hindering unified benchmarking across diverse degradation types. We present CT-DegradBench, a dataset and benchmark for CT degradation detection and severity estimation under controlled single- and mixed-artifact settings. CT-DegradBench enables systematic evaluation across multiple degradation families and severity levels within a common experimental framework. We further propose SeSpeCT (Semantic-Spectral CT degradation estimation), a framework that combines semantic priors from medical vision-language models with complementary frequency-domain cues for artifact analysis. SeSpeCT constructs a training-free semantic quality axis in the multimodal embedding space using radiology-informed text prompts, without task-specific fine-tuning, and combines it with spectral features that capture degradation-specific frequency patterns. The resulting representation enables joint prediction of artifact type and severity. Experimental results show that SeSpeCT consistently outperforms the evaluated baselines under both single- and mixed-degradation settings. The framework is available at https://github.com/yousranb/CT-DEGRADBENCH.

2605.16427 2026-05-19 cs.CV cs.AI 版本更新

EAGT: Echocardiography Augmentation for Generalisability and Transferability

超声波增强:通用性和可迁移性

Soroush Elyasi, Sara Adibzadeh, Nasim Dadashi Serej, Julie Wall, Massoud Zolgharni

发表机构 * THRIVE Centre, University of West London(西伦敦大学THRIVE中心) University of West London(西伦敦大学) School of Computing and Engineering, University of West London(西伦敦大学计算机与工程学院)

AI总结 本文研究了29种数据增强技术及其组合对左心室分割的通用性和可迁移性影响,发现几何变换优于强度增强,且最佳组合提升模型鲁棒性。

详情
AI中文摘要

深度学习模型在超声分割中常难以跨机构、设备和患者群体泛化,因收集大量一致标注数据不现实。数据增强广泛用于提升模型鲁棒性,但其在超声中的跨数据集泛化作用尚不明确。本文评估了29种数据增强技术及其配对组合,使用U-Net在Unity、CAMUS和EchoNet Dynamic数据集上进行2D左心室分割。每种增强方法在不同超参数设置下,通过Dice和IoU在域内和跨域场景下重复运行评估,统计显著性通过独立t检验量化。结果表明,解剖合理几何变换,特别是仿射、位移-缩放-旋转、透视和随机水平翻转,显著提升跨数据集性能,而激进的强度或伪影增强常降低泛化能力。配对增强组合优于单个增强,尤其以随机水平翻转与仿射组合在大多数迁移场景中表现一致。这些发现为设计增强策略提供了实证指导,以增强超声分割模型的鲁棒性和可迁移性。

英文摘要

Deep learning models for echocardiography segmentation often struggle to generalise across institutions, scanners, and patient populations, where collecting large, consistently annotated datasets is infeasible. Data augmentation is widely used to improve the robustness of deep learning models; however, its role in enhancing cross-dataset generalisability in echocardiography remains insufficiently understood. This study presents a large-scale multi-dataset evaluation of 29 data augmentation techniques and their pairwise combinations for 2D left ventricular segmentation using a U-Net trained on Unity, CAMUS, and EchoNet Dynamic datasets. Each augmentation was explored under several hyperparameter settings and assessed through repeated runs using Dice and IoU in both in-domain and cross-dataset scenarios, with statistical significance quantified via independent t-tests. Results show that anatomically plausible geometric transformations, particularly affine, shift-scale-rotate, perspective, and random horizontal flip, substantially improve cross-dataset performance, whereas aggressive intensity- or artefact-based augmentations often degrade generalisability. Pairwise augmentation combinations outperform individual augmentations and show that moderate flip-centric combinations, especially random horizontal flip with affine, yield consistent gains across most transfer scenarios. These findings provide empirically grounded guidance for designing augmentation policies that enhance the robustness and transferability of echocardiography segmentation models.

2605.16423 2026-05-19 cs.CV 版本更新

Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization

非线性双极补偿:处理训练后量化中的异常值

Peilin Sun, Jianxin Wu

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院)

AI总结 本文提出非线性双极补偿方法,通过引入非线性补偿减少异常值影响,设计双极对数变换压缩异常值,提升训练后量化的效率与鲁棒性。

详情
AI中文摘要

网络量化已成为最实用的模型压缩技术,通过将浮点数映射到低比特表示显著减少模型的内存和计算消耗。然而,现有量化方法通常面临速度-精度权衡和泛化能力有限的问题。为解决这些问题,最近的补偿方法通过引入额外的轻量线性层提供高效且通用的解决方案。然而,这些方法的准确性受到其补偿能力有限和对异常值敏感的限制。在本文中,我们提出非线性双极补偿(NBC),一种训练后量化方法,通过引入非线性补偿减少异常值的影响。我们进一步设计双极对数变换(BLT),通过将量化输入和量化误差映射到变换空间来压缩异常值。然后在变换空间中应用简单的线性层进行补偿,保持方法的效率。在各种任务、模型和量化方法上的广泛实验证实了我们NBC方法的有效性、效率、鲁棒性和通用性。

英文摘要

Network quantization has emerged as one of the most practical model compression techniques, which significantly reduces a model's memory and compute consumption by mapping floating-point numbers to low-bit representations. However, existing quantization methods typically suffer from the speed-accuracy tradeoff and limited generalization. To address these issues, recent compensation-based methods offer an efficient yet general solution by introducing additional lightweight linear layers into the quantized network. However, the accuracy of these methods suffers from their limited compensation capability and high sensitivity to outliers. In this paper, we propose Nonlinear Bipolar Compensation (NBC), a post-training quantization approach that introduces nonlinear compensation to reduce the effect of outliers. We further design Bipolar Logarithmic Transformation (BLT), which compresses outliers by mapping both the quantized input and the quantization error into a transformed space. A simple linear layer is then applied for compensation in the transformed space, preserving the efficiency of our method. Extensive experiments across various tasks, models, and quantization methods confirm the effectiveness, efficiency, robustness, and generality of our NBC approach.

2605.16420 2026-05-19 cs.CV cs.LG 版本更新

Video Reconstruction using Diffusion-based Image-to-Video Generation with Trajectory Guidance

基于扩散模型的图像到视频生成与轨迹引导的视频重建

Stelio Bompai, Ioannis Kontopoulos, Giannis Spiliopoulos, Dimitris Zissis, Konstantinos Tserpes

发表机构 * Department of Electrical and Computer Engineering, National Technical University of Athens, Greece(电气与计算机工程系,希腊国家技术大学雅典分校) Department of Product and Systems Design Engineering, University of the Aegean, Syros, Greece(产品与系统设计工程系,爱琴海大学锡罗斯分校)

AI总结 本文提出利用预训练的图像到视频扩散模型,通过GPS轨迹引导生成无人机视频的缺失或丢失帧,无需领域特定微调,展示了在低纹理和小目标条件下视频重建的有效性。

Comments Accepted at the 1st Workshop on Multi-Sensor Trajectory Knowledge Discovery and Extraction (MuseKDE 2026), co-located with the 27th IEEE International Conference on Mobile Data Management (IEEE MDM 2026)

详情
AI中文摘要

本文解决了自主水面车辆进行结构化海上 maneuver 时顶视无人机视频中缺失或丢失帧的重建问题。我们提出了一种将原始GPS telemetry 和单个参考帧转换为轨迹引导视频序列的流程,使用预训练的图像到视频扩散模型,无需领域特定微调。通过将GPS坐标投影到图像空间,产生每艘船的运动提示,以条件化SG-I2V扩散模型。生成的帧通过感知、时间和轨迹度量与真实视频进行评估,并与光流外推和RIFE插值基线进行基准测试。SG-I2V在所有方法中产生了最自然的帧(BRISQUE 25.52,接近真实值23.64),最真实的运动幅度(时间平滑度1.14 vs. 真实值1.42),以及最强的GPS轨迹一致性(9.31px vs. 真实值28.70px,后者反映的是视频和GPS日志之间的大致时间对齐,而非生成误差),证明了轨迹引导的扩散合成在挑战性低纹理、小目标条件下是可行的海上视频重建方法。

英文摘要

This paper addresses the problem of reconstructing missing or dropped frames in top-down drone video of autonomous surface vehicles performing structured maritime manoeuvres. We propose a pipeline that converts raw GPS telemetry and a single reference frame into a trajectory-guided video sequence using a pre-trained image-to-video diffusion model, requiring no domain-specific fine-tuning. GPS coordinates from onboard telemetry logs are projected into image space via an equirectangular mapping, producing per-vessel motion cues that condition the SG-I2V diffusion model. The generated frames are evaluated against ground-truth video using perceptual, temporal and trajectory-based metrics, and benchmarked against optical flow extrapolation and RIFE interpolation baselines. SG-I2V produces the most naturally appearing frames among all methods (BRISQUE 25.52, closest to ground-truth 23.64), the most realistic motion magnitude (temporal smoothness 1.14 vs. ground truth 1.42), and the strongest GPS trajectory adherence (9.31px vs. 28.70px for ground-truth, the latter reflecting approximate temporal alignment between footage and GPS logs rather than generation error), demonstrating that trajectory-guided diffusion synthesis is a viable approach to maritime video reconstruction under challenging low-texture, small-object conditions.

2605.16419 2026-05-19 cs.CV cs.AI cs.RO 版本更新

Agentic Pipeline for Self-Synchronized Multiview Joint Angle Monitoring in Uncalibrated Environments

基于代理的自同步多视角关节角度监控管道:在无标定环境中

Juncheng Yu, Lusi A, Haoxuan Xie, Weiming Wang

发表机构 * National Engineering Research Center of Neuromodulation, School of Aerospace Engineering, Tsinghua University(神经调制国家工程研究中心,航空航天工程学院,清华大学)

AI总结 本文提出了一种基于代理的自同步多视角关节角度监控方法,利用两台摄像头在无标定环境下实现自动视频同步和自验证,通过多模态大语言模型和先进单目2D姿态估计模型提取候选姿态,并通过代理选择机制自动识别和跟踪目标个体,以在多人和遮挡情况下产生一致的2D姿态,从而估计关节角度。

Comments Accepted by EMBC 2026. 7 pages, 3 figures

详情
AI中文摘要

运动监控在长期康复中对脊髓损伤患者至关重要,其中多视角无标记运动捕捉方法已显示出显著潜力。然而,由于依赖校准和多视角同步的困难,其在患者自行部署环境中部署仍然具有挑战性。在本工作中,我们提出了一种基于代理的自同步多视角关节角度监控管道,利用两台摄像头在无标定环境中实现自动视频同步和代理驱动的自验证。最先进的单目2D姿态估计模型用于提取候选姿态,其中应用了基于代理的选择机制,以自动识别和跟踪目标个体,从而在多人和遮挡情况下产生一致的2D姿态。此类2D姿态被优化以从无标定的多视角姿态序列中估计关节角度,通过显式的几何建模确保可解释性。与Vicon系统的验证显示了该方法的强性能,达到MAE为5.97°±2.36°和Pearson相关系数为0.962±0.014。所提出的方法预计能提供一个实用的、患者可自行部署的系统,以在无标定的家庭环境中进行日常运动监控。

英文摘要

Kinematic monitoring plays a critical role in long-term rehabilitation for patients with spinal cord injury (SCI), where multi-view markerless motion capture methods have shown significant potential. However, owing to the reliance on calibration and the difficulty of achieving multi-view synchronization, their deployment in patient self-deployed environments remains challenging. In this work, we propose an agentic pipeline for self-synchronized multi-view joint angle monitoring in uncalibrated environments using two cameras without hardware triggers. The Multimodal large language models enable automatic video synchronization and agent-driven self-verification. State-of-the-art monocular 2D pose estimation models are employed to extract candidate poses, where an agent-based selection mechanism is then applied to automatically identify and track the target subject, thereby producing consistent 2D poses in the presence of multiple individuals and occlusions. Such 2D poses are optimized to estimate joint angles from uncalibrated multi-view pose sequences, ensuring interpretability through explicit geometric modeling. Validation against Vicon system demonstrated the strong performance, achieving an MAE of $5.97^\circ \pm 2.36^\circ$ and a Pearson correlation coefficient of $0.962 \pm 0.014$. The proposed method is expected to provide a practical, patient self-deployable system to perform daily kinematic monitoring in uncalibrated home environments.

2605.16418 2026-05-19 cs.CV cs.AI 版本更新

Neural Visual Decoding via Cognitive guided Adaptive Blurring and Information Constrained Alignment

通过认知引导的自适应模糊和信息受限对齐实现神经视觉解码

Fan Yin, Chuhang Zheng, Peiliang Gong, Donghai Guan, Qi Zhu

发表机构 * Department of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics(南京航空航天大学人工智能学院) Department of Electrical and Information Engineering, Tianjin University(天津大学电气与信息工程学院)

AI总结 本文提出CAIA框架,通过认知引导的自适应模糊和信息受限对齐,提升神经信号与视觉语义的映射精度,改进零样本脑-图像检索的Top-1和Top-5准确率。

详情
AI中文摘要

基于EEG的视觉解码旨在建立神经信号与视觉语义之间的映射。然而,它受到严重的信息粒度不匹配和EEG信号信噪比低的双重挑战。现有方法通常处理静态视觉特征,忽略了人类视觉的动态选择性和神经振荡的频率特异性。为此,我们提出了CAIA框架,通过认知引导的自适应模糊和信息受限对齐来弥合这一差距。在视觉侧,它模拟选择性注意以自适应地减少冗余。同时,在EEG侧,它利用神经振荡先验和信息瓶颈机制来增强信噪比。具体而言,我们设计了一种基于认知动态的自适应模糊机制,通过跨模态注意动态整合中心偏向和显著性引导的视觉线索。此外,我们引入了分布感知的边界校准损失,以稳健地纠正由异常样本引起的对齐偏差。此外,提出了一种认知引导的信息筛选方法,以选择任务相关的EEG振荡。大量实验表明,CAIA在零样本脑-图像检索中提高了受试者依赖和受试者无关的平均Top-1和Top-5准确率,显著优于现有方法。我们的工作验证了优化视觉信息密度以匹配神经粒度能提供更可解释和稳健的神经解码路径。

英文摘要

EEG-based visual decoding aims to establish a mapping between neural signals and visual semantics. However, it remains constrained by the dual challenges of severe information granularity mismatch and the low signal-to-noise ratio (SNR) of EEG signals. Existing approaches typically treat static visual features, ignoring the dynamic selectivity of human vision and the frequency specificity of neural oscillations. To bridge this gap, we propose CAIA, a Cognitive-guided Adaptive blurring with Information-Constrained Alignment framework for Neural-Visual decoding. On the visual side, it simulates selective attention to adaptively reduce redundancy. Meanwhile, on the EEG side, it leverages neural oscillation priors and the information bottleneck mechanism to enhance SNR. Specifically, we devise a cognitive-dynamics-based adaptive blurring mechanism that dynamically integrates center-biased and saliency-guided visual cues via cross-modal attention. Furthermore, we introduce a distribution-aware boundary calibration loss to robustly rectify alignment bias caused by outlier samples. Moreover, a cognitively-guided information-screening method is proposed to select task-relevant EEG oscillations. Extensive experiments demonstrate that CAIA improves both subject-dependent and subject-independent average Top-1 and Top-5 accuracy in zero-shot brain-to-image retrieval, significantly outperforming prior methods. Our work validates that optimizing visual information density to match neural granularity offers a more interpretable and robust pathway for neural decoding.

2605.16416 2026-05-19 cs.CV cs.AI 版本更新

CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning

CAVE:一种用于碎片化视觉证据推理的结构化信用分配方法

Tengda Guo, Jie Leng, Hanlei Li, Yaoyuan Liang, Qingyue Zhang, Dian Yang, Mingyu Zhang, Yuhua Fu, Shao-Lun Huang

发表机构 * Tsinghua University(清华大学) Peking University(北京大学) Zhejiang University of Technology(浙江工业大学)

AI总结 CAVE通过结构化过程-奖励机制提升碎片化视觉推理能力,引入三个互补信号优化推理步骤,提升模型可靠性与鲁棒性。

Comments 24 pages, 6 figures. Preprint

详情
AI中文摘要

视觉-语言模型(VLMs)在通用多模态推理中表现优异,但在整合非局部视觉信息支持语义不明确的视觉推理方面面临挑战。本文提出CAVE,一种基于GRPO的结构化过程-奖励方法,通过信念更新、证据获取和自适应聚焦控制三个信号评估中间步骤贡献,引导模型优化推理动作并学习更可靠的视觉推理策略。同时构建TRACER-Bench,涵盖四个非局部且语义易混淆的推理维度,提供关键中间证据监督推理路径。实验表明,CAVE在需要整合碎片化视觉证据的任务中显著提升性能,涵盖公开基准和新引入的TRACER-Bench,同时在通用多模态评估中保持竞争力。进一步分析显示,CAVE有效提升视觉推理能力,在长距离和深层跨区域依赖下表现更稳健。

英文摘要

Vision-Language Models (VLMs) have achieved strong performance on general multimodal reasoning, yet remain challenged in integrating nonlocal visual information to support semantically underdetermined visual reasoning. We describe this challenge as Fragmented Visual Reasoning. To this end, we propose Credit Assignment for Visual Evidence (CAVE), a structured process-reward method based on GRPO for interleaved visual reasoning. Specifically, CAVE evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control, thereby guiding the model to optimize each reasoning action and learn more reliable visual reasoning strategies. Meanwhile, we construct TRACER-Bench, which covers four nonlocal and semantically confusable reasoning dimensions and provides key intermediate evidence to supervise reasoning paths. Experiments demonstrate that CAVE substantially improves performance on tasks requiring fragmented visual evidence integration, covering both public benchmarks and our newly introduced TRACER-Bench, while retaining competitive performance on general multimodal evaluations. Further analyses reveal that CAVE effectively improves the visual reasoning capacity and exhibits stronger robustness under longer-range and deeper cross-region dependencies.

2605.16414 2026-05-19 cs.CV 版本更新

NERVE: A Neuromorphic Vision and Radar Ensemble for Multi-Sensor Fusion Research

NERVE:一种用于多传感器融合研究的神经形态视觉与雷达集成系统

Omar Mansour, Pietro Martinello, Ethan Milon, YingFu Xu, Manolis Sifalakis, Guangzhi Tang, Amirreza Yousefzadeh

发表机构 * 1University of Twente, Netherlands 2University of Modena 3University of Strasbourg, France 4IMEC the Netherlands, Netherlands 5Innatera, Netherlands 6Maastricht University, Netherlands

AI总结 NERVE包含257分钟同步记录的多传感器数据,用于评估多模态融合,通过DVS与雷达结合提升人体检测和距离估计性能。

Comments To be published in ICJNN 2026 Maastricht

详情
AI中文摘要

我们介绍了NERVE(神经形态视觉与雷达集成系统),一个包含来自五个传感器的257分钟同步记录的多传感器数据集:两个动态视觉传感器(DVS)、一个RGB-D相机和两个雷达单元(24GHz和77GHz)。在办公室环境中拍摄了12天,包含约600GB的未压缩时间对齐数据,约914,000帧和约960万张RGB COCO格式注释,涵盖16个相关物体类别。为了评估多模态融合,我们构建了DVS+雷达子集用于人体检测和距离估计。使用前馈和递归检测器的基线实验表明,结合DVS与77GHz雷达能一致提高检测性能,递归模型在mAP上达到47.5%,平均绝对雷达距离误差低于1.8米,优于激光雷达地面真实值。

英文摘要

We present NERVE (Neuromorphic Vision and Radar Ensemble), a multi-sensor dataset comprising 257 minutes of synchronized recordings from five sensors: two Dynamic Vision Sensors (DVS), an RGB-D camera, and two Radar units (24GHz and 77GHz). Captured across 12 measurement days in office environments, NERVE contains around 600GB of uncompressed temporally aligned data with around 914,000 frames and around 9.6 million RGB COCO-formatted annotations covering 16 relevant object categories. To evaluate multi-modal fusion, we construct a DVS+Radar subset for human detection and distance estimation. Baseline experiments using feed-forward and recurrent detectors show that combining DVS with 77GHz Radar consistently improves detection, with recurrent models achieving up to 47.5% mAP and mean absolute Radar distance errors below 1.8m against LiDAR ground truth.

2605.16412 2026-05-19 cs.RO cs.CV 版本更新

SCAR: Self-Supervised Continuous Action Representation Learning

SCAR:自监督连续动作表示学习

Hongjia Liu, Fan Feng, Minghao Fu, Xinyue Wang, Haofei Lu, Biwei Huang

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) KTH Royal Institute of Technology(皇家理工学院)

AI总结 本文提出SCAR框架,通过自监督学习统一动作表示,提升跨体素和任务的泛化能力。

详情
AI中文摘要

尽管动作在具身智能中起核心作用,但从视觉转换中学习可迁移的动作表示仍是一个基本挑战,特别是在数据有限的情况下,世界模型需要在不同体素间泛化。我们提出SCAR,一个联合逆向-前向动力学框架,用于从视觉转换中学习跨体素的统一动作表示。基于预训练生成主干,SCAR使用逆向动力学模型(IDM)从潜在观察对中推断潜在动作,并使用前向动力学模型(FDM)根据这些动作预测未来动态。为了使潜在空间可迁移而非通用视觉瓶颈,我们正则化潜在动作后验向标准高斯先验,限制任意视觉编码,并引入对抗不变性以抑制体素和环境特定的噪声因素。在Procgen和Robotwin数据集上的实验表明,学习的统一潜在动作表示比体素特定的原始动作更强大,作为世界建模的条件接口,提高了跨体素低数据适应和跨任务迁移性能。

英文摘要

Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment-specific actuation. In this work, we propose SCAR, a joint inverse-forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment-specific raw actions, yielding improved cross-embodiment low-data adaptation and cross-task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.

2605.16411 2026-05-19 cs.CV cs.AI cs.CL cs.DB cs.LG 版本更新

Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift

通过分布偏移下的分阶段偏好优化减少视觉-语言模型中的幻觉

Qinwu Xu

发表机构 * Meta AI

AI总结 本文提出分阶段偏好优化框架,通过构建针对幻觉问题的数据集,提升视觉-语言模型的 grounded reasoning,减少幻觉并提高响应信息量。

详情
AI中文摘要

幻觉仍然是视觉-语言模型(VLMs)中的基本挑战,其中自回归生成可能因联合概率建模下的最大似然估计而产生语言上合理但物理上不一致或视觉上不 grounded 的响应。我们提出了一种分阶段偏好优化框架,通过有针对性的多模态数据构建来减少幻觉。该框架强调模糊的空间方向、物体关系、OCR不确定性以及对抗性假前提训练。幻觉负样本通过最小扰动但视觉不一致的替代品生成,使直接偏好优化(DPO)能够更好地区分 grounded 推理与 plausible 幻觉。在开源基准和现实多模态评估场景中的实验表明,改进了 grounded 一致性,减少了幻觉,并产生了更具信息量的 grounded 响应。跨模型定性评估进一步显示,所提出的多模态 LLM DPO 框架在模糊空间推理和对抗性假前提设置中比几个前沿专有 VLMs 产生更视觉 grounded 的响应。结果表明,幻觉可能不仅源于模型容量的限制,还源于自回归概率生成在弱视觉 grounding 下倾向于选择语言上合理但视觉上不一致的延续。未来工作可能探索物理一致性建模、不确定性感知的多模态推理以及超越标准自回归解码的架构替代方案。

英文摘要

Hallucination remains a fundamental challenge in vision-language models (VLMs), where autoregressive generation may produce linguistically plausible yet physically inconsistent or visually ungrounded responses due to likelihood maximization under joint probabilistic modeling. We propose a stage-wise preference optimization framework for hallucination reduction through targeted multimodal data construction. Rather than directly optimizing on generic instruction-following data, our approach progressively constructs hallucination-focused preference pairs near known failure boundaries. The framework emphasizes ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. Hallucinated negatives are generated through minimally perturbed yet visually inconsistent alternatives, enabling Direct Preference Optimization (DPO) to better separate grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses. Cross-model qualitative evaluation further shows that the proposed multimodal LLM DPO framework produces more visually grounded responses than several frontier proprietary VLMs, such as in ambiguous spatial reasoning and adversarial false-premise settings. The results suggest that hallucination may arise not only from limited model capacity, but also from inherent tendencies of autoregressive probabilistic generation to favor linguistically plausible continuations under weak visual grounding. Future work may explore physical consistency modeling, uncertainty-aware multimodal reasoning, and architectural alternatives beyond standard autoregressive decoding.

2605.16410 2026-05-19 cs.CV 版本更新

Test-Time Hinting for Black-Box Vision-Language Models

测试时提示法用于黑盒视觉-语言模型

Kaihua Hou, Abhijith Varma Mudunuri, Jiaxing Qiu, Roxana Daneshjou, Thomas Hartvigsen, Ahmed Alaa

发表机构 * AlaaLab(Alaa实验室)

AI总结 本文提出测试时提示法,通过单次VLM调用提升性能,无需开放权重模型访问,适用于前沿闭源模型。方法通过轻量提示生成器预测提示,引导VLM避开常见错误模式,提升自然图像VQA基准的准确率。

详情
AI中文摘要

测试时缩放(TTS)方法在LLM中表现出色,但在视觉-语言模型(VLMs)中的应用仍较有限。现有VLM TTS方法大多需要开放权重模型访问或昂贵的重复采样,并主要在多模态数学和科学推理基准上评估,而非通用视觉理解任务。本文提出测试时提示法,通过单次VLM调用和仅需黑盒API访问即可提升VLM性能,广泛适用于前沿闭源模型。我们的方法受观察启发,即VLM错误倾向于聚集在重复失败模式周围。因此,我们训练了一个轻量级提示生成器模型,预测给定测试输入应添加哪种

英文摘要

Test-time scaling (TTS) methods have proven highly effective for LLMs, yet their application to vision-language models (VLMs) remains relatively underexplored. Existing VLM TTS methods largely require open-weight model access or expensive repeated sampling, and are evaluated primarily on multimodal mathematical and scientific reasoning benchmarks rather than general visual understanding tasks. In this paper, we propose Test-Time Hinting, a method that improves VLM performance via a single VLM call and requiring only black-box API access, which makes it broadly applicable to frontier closed-weight models. Our method is motivated by the observation that VLM errors tend to cluster around recurring failure patterns. We therefore train a lightweight hint generator model to predict, for a given test input, which "hint" should be prepended to the prompt, providing targeted contextual or procedural guidance that steers the VLM away from its characteristic failure modes. We show that Test-Time Hinting improves the accuracy of multiple closed-weight VLMs on natural-image VQA benchmarks and that these gains generalize to unseen benchmarks and VLMs without retraining the hint generator.

2605.16408 2026-05-19 cs.CV 版本更新

Visual Search Patterns in 3D Pancreatic Imaging: An Eye Tracking Study

三维胰腺成像中的视觉搜索模式:一项眼动研究

Anna Anikina, Leila Khaertdinova, Trine Balschmidt, Michael B Andersen, Christoph F Müller, Erik GS Brandt, Henrik S Thomsen, Claudia Mello-Thoms, Bulat Ibragimov

发表机构 * Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系) Department of Radiology, Herlev Hospital(赫尔勒夫医院放射科) Department of Radiology, University of Iowa(爱荷华大学放射科)

AI总结 本研究通过眼动追踪分析三维胰腺CT影像中放射科医生的视觉搜索行为,揭示其在空间和时间上的注视模式,为理解诊断策略提供新的视角。

Comments Accepted at SPIE - Medical Imaging Conference 2026

Journal ref Proc. SPIE 13928, Medical Imaging 2026: Image Perception, Observer Performance, and Technology Assessment, 1392814 (2026)

详情
AI中文摘要

眼动追踪已成为研究视觉感知和搜索策略的强大工具,尤其在医学领域。尽管在2D环境中应用较为简便,但在3D医学影像中的应用仍面临挑战。此研究聚焦于放射学领域,其中体积成像如CT扫描被医生常规解读。放射科医生通常通过数百张2D切片进行解读,通常以轴向投影查看。对导航通过CT体积期间的眼动数据进行分类有助于理解放射科医生如何应对诊断任务。作为分类方法的一个示例,我们让两名放射科医生搜索胰腺腹部CT图像,并收集眼动数据,将眼动轨迹与切片导航对齐,以可视化胰腺通过体积的表示,并分析临床医生在空间和时间上的注视行为。

英文摘要

Eye tracking has emerged as a powerful tool for examining visual perception and search strategies in various domains, including medicine. While it is relatively straightforward to apply in 2D settings, its use in 3D medical imaging remains challenging and not yet well explored. This gap is particularly relevant for radiology, where volumetric images such as computed tomography (CT) scans are routinely read by medical experts. Radiologists typically interpret these images by navigating through hundreds of 2D slices, most often viewed in the axial projection. A taxonomy of eye movement data during navigation through a CT volume could be valuable to understand how radiologists approach diagnostic tasks. As an example of the derived taxonomy, we asked two radiologists to search abdominal CTs of the pancreas. We collect eye tracking data and align eye gaze movements with slice navigation to visualize the representation of the pancreas through volume and analyze clinicians' gaze behavior in both space and time.

2605.16406 2026-05-19 cs.CV 版本更新

Contrastive-SDXL: Annotation-Preserving Night-Time Augmentation for Pedestrian Detection

对比-SDXL:保留标注的夜间增强用于行人检测

Franky George, Muhammad Khalid, Adil Khan

AI总结 本文提出Contrastive-SDXL框架,利用SDXL-Turbo和LoRA微调,通过语义对比损失和对象一致性损失生成逼真夜间图像,提升行人检测性能。

详情
AI中文摘要

夜间行人检测因标注数据有限和光照差异大而具有挑战性。潜在扩散模型(LDMs)提供强大的图像到图像翻译和跨域增强基础,但其在安全关键感知中的有效性依赖于翻译过程中是否保留检测相关对象和局部语义结构。本文提出Contrastive-SDXL,基于SDXL-Turbo和LoRA微调的夜间行人检测增强框架。为保持白天输入与翻译夜间图像之间的语义对应关系,引入基于预训练DINOv2编码器的补丁级语义对比损失,而非生成器编码器特征。多级DINOv2自注意力图强制局部和全局语义一致性,而对象一致性损失显式鼓励行人保留。Contrastive-SDXL生成逼真的夜间图像,实现Fréchet Inception距离(FID)为22.5。使用合成图像训练的检测器在误检率上比仅使用白天数据的基线减少6-7%,接近使用真实夜间数据训练的性能。这些结果表明,以一致性驱动的扩散增强可以有效支持安全关键的夜间行人检测。

英文摘要

Night-time pedestrian detection remains challenging because labelled night-time data are limited and large illumination differences make daytime-only trained detectors unreliable. Latent diffusion models (LDMs) provide a powerful basis for image-to-image translation and cross-domain augmentation, but their effectiveness in safety-critical perception depends on whether detector-relevant objects and local semantic structure are preserved when translating between source and target domains. In this work, we present Contrastive-SDXL, a day-to-night augmentation framework for night-time pedestrian detection built on SDXL-Turbo and fine-tuned using Low-Rank Adaptation (LoRA). To preserve semantic correspondence between daytime inputs and translated night-time images, we introduce a patch-wise semantic contrastive loss guided by a pretrained DINOv2 encoder rather than generator encoder features. Multi-level DINOv2 self-attention maps enforce both local and global semantic consistency, while an object consistency loss explicitly encourages pedestrian preservation. Contrastive-SDXL produces realistic night-time images, achieving a Frechet Inception Distance (FID) of 22.5. Detectors trained with our synthetic images obtain a 6-7% reduction in miss rate compared with a daytime-only baseline, approaching the performance of detectors trained on real night-time data. These results demonstrate that consistency-driven diffusion augmentation can effectively support safety-critical night-time pedestrian detection.Specific

2605.16405 2026-05-19 cs.CV 版本更新

Concepts Worth Having: Refining VLM-Guided Concept Bottleneck Models with Minimal Annotations

值得拥有的概念:通过最小标注精炼VLM引导的概念瓶颈模型

Nicola Debole, Andrea Passerini, Stefano Teso, Andrea Pugnana, Emanuele Marconato

发表机构 * DISI, University of Trento, Italy(特伦托大学DISI研究所,意大利)

AI总结 本文提出VH-CBM模型,结合VLM和少量标注提升概念预测准确性与可解释性,验证其在概念校准和主动学习中的优势。

详情
AI中文摘要

概念瓶颈模型(CBMs)是通过从输入中提取的高层概念进行预测的神经分类器。CBMs通过从概念级标注中学习来确保利益相关者能够理解这些概念及其预测,但这些标注通常很少。最近的CBM架构通过从视觉-语言模型(VLMs)中获取标注来解决这个问题。尽管大大扩展了适用性,但这样做可能导致较低质量的概念和因此较不可解释的模型。我们采取中间路线,引入了视觉加人类引导的CBM(VH-CBM),一种结合VLM和少量密集标注的混合方法。VH-CBM在VLM的嵌入空间中使用高斯过程,捕捉目标领域有用的全局信息,将专家的监督传播到任何目标数据点。我们的实证评估显示,即使标注数据量仅为1%,VH-CBM也能比VLM引导的CBMs预测更准确的概念,同时具有更好的概念校准并支持主动学习。

英文摘要

Concept-bottleneck models (CBMs) are neural classifiers that compute predictions from high-level concepts extracted from the input. CBMs ensure stakeholders can understand the concepts -- and the predictions they entail -- by learning these from concept-level annotations, which are however seldom available. Recent CBM architectures work around this issue by obtaining annotations from Vision-Language Models (VLMs). While greatly broadening applicability, doing so can yield lower quality concepts and therefore less interpretable models. We strike for a middle ground by introducing Vision-plus-Human-guided CBM (VH-CBM), a hybrid approach that exploits both VLMs and a small amount of dense annotations. VH-CBM employs a Gaussian Process in the VLM's embedding space, which captures useful global information about the target domain, to propagate the expert's supervision to any target data point. Our empirical evaluation shows how VH-CBM predicts more accurate concepts than VLM-guided CBMs even when annotating as little as 1% of the data, while sporting better concept calibration and supporting active learning.

2605.16404 2026-05-19 cs.CV 版本更新

Hybrid Quantum-MambaVision: A Quantum-enhanced State Space Model for Calibrated Mixed-type Wafer Defect Detection

混合量子-马amba视觉:一种增强状态空间模型用于校准混合型晶圆缺陷检测

Satwik Sai Prakash Sahoo, Jyoti Prakash Sahoo, Ting Wang, Subrota Kumar Mondal

发表机构 * Odisha University of Technology and Research(奥里萨理工大学) National Taiwan University of Science and Technology(台湾科技大学) East China Normal University(东华大学) Macau University of Science and Technology(澳门科技大学)

AI总结 本文提出Hybrid Quantum-MambaVision,通过结合状态空间模型与量子适配器,解决工业视觉数据中极端类别不平衡和计算复杂性问题,实现高效多标签分类。

详情
AI中文摘要

从工业视觉数据中提取可操作知识受到极端类别不平衡和现代基础模型计算复杂性的限制。在半导体制造中,识别多标签晶圆缺陷是一个复杂的空间数据挖掘任务,其中重叠模式掩盖了关键的根本原因信号。尽管Vision Transformers (ViTs)在全局依赖提取方面表现出色,但其二次扩展性使其在高吞吐量、实时异常检测中效率低下。为克服这些计算障碍,本文引入了Hybrid Quantum-MambaVision,一种针对空间知识发现高度高效的架构。我们整合了一个线性复杂度的状态空间模型(SSM)主干与参数化量子上下文适配器(QCA)和低秩适应(LoRA)。马amba主干高效捕捉长程空间依赖性,而量子适配器将压缩的潜在特征映射到高维希尔伯特空间以解构复杂的重叠签名。在高度不平衡的MixedWM38数据集上,Hybrid Quantum-MambaVision实现了卓越的多标签分类性能,显著降低了复杂多缺陷拓扑的错误率,相比经典基线显著减少最大校准误差(MCE)和最小化预期假阳性成本。本文建立了一种可扩展的量子-经典混合范式,用于工业数据挖掘中的高效表示学习。

英文摘要

Extracting actionable knowledge from industrial visual data is fundamentally bottlenecked by extreme class imbalance and the prohibitive computational complexity of modern foundation models. In semi-conductor manufacturing, identifying multi-label wafer defects is a complex spatial data mining task where overlapping patterns obscure critical root-cause signals. While Vision Transformers (ViTs) excel at global dependency extraction, their quadratic scaling renders them inefficient for high-throughput, real-time anomaly detection. To overcome these computational barriers, this paper introduces Hybrid Quantum-MambaVision, a highly efficient architecture tailored for spatial knowledge discovery. We integrate a linear-complexity State-Space Model (SSM) backbone with a Parameterized Quantum Context Adapter (QCA) and Low-Rank Adaptation (LoRA). The Mamba backbone efficiently captures long-range spatial dependencies, while the quantum adapter maps compressed latent features into a high-dimensional Hilbert space to disentangle complex, overlapping signatures. On the highly imbalanced MixedWM38 dataset, Hybrid Quantum-MambaVision achieves exceptional multi-label classification performance, significantly reducing the error rate on complex multi-defect topologies compared to classical baselines. The quantum regularizer acts as a profound uncertainty calibrator, substantially reducing Maximum Calibration Error (MCE) and minimizing expected false-positive costs. This work establishes a scalable Quantum-Classical hybrid paradigm for efficient representation learning in industrial data mining.

2605.16403 2026-05-19 cs.CV cs.SD 版本更新

When Vision Speaks for Sound

当视觉为声音说话

Xiaofei Wen, Wenjie Jacky Mo, Xingyu Fu, Rui Cai, Tinghui Zhu, Wendi Li, Yanan Xie, Muhao Chen, Peng Qi

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 本文发现视频中MLLMs的音频理解依赖视觉线索而非实际音频流,提出Thud框架通过三种音频编辑干预研究此问题,并提出两阶段对齐方法提升模型性能。

Comments 24 pages, 10 figures

详情
AI中文摘要

尽管视频能力的MLLMs取得显著进展,我们发现其视频中的音频理解往往由视觉驱动:模型依赖视觉线索推断或虚构音频信息,而非验证音频流。此问题出现在最先进的开源全能模型和领先的闭源模型中。我们将此失败模式称为音频-视觉的Clever Hans效应,即模型看似(错误地)音频相关,但实际利用视觉-音频相关性而不验证音频和视觉流是否真正对齐。为系统研究此行为,我们引入Thud,一个基于三种反事实音频编辑的干预驱动探测框架:Shift测试时间同步,Mute测试声音存在,Swap测试音频-视觉一致性。除诊断外,我们进一步研究两阶段对齐方法:干预衍生的偏好对教授音频验证,而事件级通用视频偏好规范模型防止过度专业化。我们的最佳10000样本方法在三个干预维度的平均性能提高28个百分点,同时略微提升通用视频和音频-视觉问答基准性能。

英文摘要

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.

2605.16402 2026-05-19 cs.CV 版本更新

WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments

WinDeskGround:复杂多窗口桌面环境中的鲁棒GUI定位基准

Haoren Zhao, Tianyi Chen, Zhen Wang

发表机构 * School of Cyberspace, Hangzhou Dianzi University, Hangzhou, China(杭州电子科技大学信息学院) Microsoft(微软公司)

AI总结 本文提出WinDeskGround基准,通过参数生成复杂桌面场景,评估GUI定位鲁棒性。实验显示顶级模型在简单环境表现佳,但部分遮挡下准确率下降。

详情
AI中文摘要

多模态大语言模型(MLLMs)已革新GUI自动化,但其效果主要建立在理想化单层界面之上。本文指出,现有先进代理在真实桌面环境中面临多窗口堆叠、遮挡和视觉杂乱等鲁棒性挑战。为此,我们引入WinDeskGround,一种新型基准和合成框架,通过控制窗口遮挡、布局密度和语义相似性参数生成复杂桌面场景。我们构建了包含1356对高保真指令-目标对的多样化元数据集,并对五种领先MLLMs进行了全面评估。结果表明,顶级代理在简化设置中表现优异,但在部分遮挡下准确性下降。WinDeskGround为评估和提升现实环境中GUI代理的鲁棒性提供了有价值的基准。代码可在https://github.com/ZZZhr-1/WinDeskGround获取。

英文摘要

Multimodal Large Language Models (MLLMs) have revolutionized GUI automation, yet their efficacy is largely established on idealized, single-layer interfaces. This paper identifies a critical reliability gap: state-of-the-art agents face distinct robustness challenges in real-world desktop environments characterized by multi-window stacking, occlusion, and visual clutter. To address this, we introduce WinDeskGround, a novel benchmark and synthesis framework tailored for evaluating GUI grounding robustness. Unlike static datasets, our framework parametrically generates complex desktop scenarios by controlling window occlusion, layout density, and semantic similarity, thereby simulating the distribution shifts of authentic workflows. We construct a diverse meta-dataset of 1,356 high-fidelity instruction-target pairs and conduct comprehensive evaluations of five leading MLLMs. Our results demonstrate that while top-tier agents excel in simplified settings, their accuracy declines under partial occlusion. WinDeskGround provides a valuable benchmark to facilitate the assessment and advancement of GUI agent robustness in realistic environments. The code is available at https://github.com/ZZZhr-1/WinDeskGround.

2605.16401 2026-05-19 cs.CV cs.LG 版本更新

CADS: Conformal Adaptive Decision System for Cost-Efficient Image Classification

CADS:用于成本高效图像分类的符合适应决策系统

Turkoglu Mikael, Bary Tim, Thielens Vincent, Dausort Manon, Macq Benoît

发表机构 * ICTEAM, UCLouvain, Belgium(ICTEAM,比利时鲁汶大学) SAFiR Lab, Univ. of Sherbrooke, Canada(SAFiR实验室,加拿大Sherbrooke大学) Univ. of Mons, Belgium(蒙斯大学,比利时)

AI总结 CADS通过动态路由样本优化资源分配,提升图像分类的效率与准确性,降低计算成本达12倍。

Comments 6 pages, 2 figures, 1 table, Accepted at ICIP 2026

详情
AI中文摘要

尽管高容量AI模型在性能上取得突破,但其部署常受限于高推理成本、环境影响及

英文摘要

While high-capacity AI models have advanced state-of-the-art performance, their practical deployment is often hindered by high inference costs, environmental impact, and a "one-size-fits-all" approach that ignores varying sample complexity. In clinical settings for instance, the waste of computational resources on routine cases is a significant barrier to sustainable AI. In this paper, we introduce the Conformal Adaptive Decision System (CADS), a sequential multi-model algorithm designed to optimize resource allocation by efficiently sampling models based on the estimated data complexity. CADS leverages conformal prediction to quantify image uncertainty at runtime. CADS provides a mathematically grounded framework for balancing the cost-accuracy dilemma that dynamically routes samples through a model cascade, ranging from lightweight "Scout" models to high-capacity "Oracle" architectures. Validated on two datasets, CADS demonstrated superior efficiency and accuracy at a computational cost that can be up to 12 times lower than heavy-model inference. By accurately routing samples based on real-time complexity, CADS ensures high diagnostic reliability while drastically reducing the economic and environmental footprint of AI.

2605.16399 2026-05-19 cs.CV cs.LG 版本更新

Stable and Near-Reversible Diffusion ODE Solvers for Image Editing

稳定且近可逆的图像编辑扩散ODE求解器

Barbora Barancikova, Daniil Shmelev, Cristopher Salvi

发表机构 * Department of Computing, Imperial College London, London, United Kingdom(帝国理工学院伦敦分校计算机系,伦敦,英国) Department of Mathematics, Imperial College London, London, United Kingdom(帝国理工学院伦敦分校数学系,伦敦,英国)

AI总结 本文提出近可逆Runge-Kutta方法以提升图像编辑的稳定性与精度,平衡可逆性与数值稳定性,保留背景保真优势。

详情
AI中文摘要

扩散模型的反向在图像编辑中起核心作用。代数可逆的ODE求解器为文本引导的图像编辑提供了有吸引力的方法,通过消除DDIM基编辑流程中的反向误差。然而,实证结果表明仅可逆性不足。由于编辑需要更大的语义或视觉变化,可逆扩散求解器常表现出不稳定性,并导致输出质量急剧下降。本文显示,精确可逆性与数值稳定性之间的权衡在图像编辑中表现为背景保真与提示对齐之间的权衡。随后研究了近可逆Runge-Kutta方法作为更稳定的替代方案。当与向量场平滑策略结合时,所得方法提高了编辑保真度,在大范围编辑下仍保持稳定,并在很大程度上保留了可逆求解器的背景保真优势。

英文摘要

The inversion of diffusion models plays a central role in image editing. Algebraically reversible ODE solvers provide an appealing approach to diffusion inversion for text-guided image editing, by eliminating the inversion error inherent in DDIM-based editing pipelines. However, empirical results indicate that reversibility alone is insufficient. As edits require larger semantic or visual changes, reversible diffusion solvers often exhibit instabilities and suffer sharp drops in output quality. In this paper, we show that the trade-off between exact reversibility and numerical stability manifests empirically as a trade-off between background preservation and prompt alignment in image editing. We then investigate the use of near-reversible Runge-Kutta methods as a more stable alternative to exactly reversible diffusion schemes. When combined with a vector-field smoothing strategy, the resulting approach improves edit fidelity, remains stable under large edits, and largely retains the background-preservation benefits of reversible solvers.

2605.16397 2026-05-19 cs.CV cs.AI 版本更新

Trajectory-Aware Adaptive Inference in Object Detection Models

轨迹感知的自适应推理在目标检测模型中

Grigorios Papanikolaou, Ioannis Kontopoulos, Giannis Spiliopoulos, Dimitris Zissis, Konstantinos Tserpes

发表机构 * Department of Electrical and Computer Engineering, National Technical University of Athens, Greece(电子与计算机工程系,国家技术大学亚历山大学院,希腊) Department of Product and Systems Design Engineering, University of the Aegean, Syros, Greece(产品与系统设计工程系,爱琴海大学,西罗斯,希腊)

AI总结 本文提出利用GPS轨迹数据优化目标检测模型的推理过程,通过引入早退机制减少计算成本,提升实时感知效率。

Comments Accepted to the MuseKDE workshop of the IEEE MDM 2026 conference

详情
AI中文摘要

随着自主水下导航中传感器的集成,大规模多模态数据集的出现对高效实时感知提出了挑战。在这样的系统中,目标检测和附近船只轨迹感知紧密耦合,尤其是在动态环境中。然而,目标检测模型在推理过程中的效率常被忽视。为此,我们基于现有目标检测框架,将GPS轨迹数据纳入推理过程,实现输入自适应计算。具体来说,在基于YOLOv8的检测器中引入早退机制,结合运动线索(如船舶间距离)。分离距离短且高速接近的船舶帧使用完整模型处理,而其他帧仅激活网络的一部分架构。通过利用物体间距离和距离减少速率评估帧或帧集的难度(或场景复杂度)。实验结果表明,该策略在保持满意检测性能的同时,显著减少了推理时间和计算成本,从而在准确性和效率之间实现了灵活的权衡,相比完整模型推理。

英文摘要

The increasing integration of sensors in autonomous maritime navigation has led to large-scale multimodal datasets, raising challenges in achieving efficient real-time perception. In such systems, object detection and trajectory perception of nearby vessels are tightly coupled, particularly in dynamic environments such as maritime navigation. However, the efficiency of object detection models during inference remains an often-overlooked aspect. To this end, we build upon an existing object detection framework by incorporating GPS trajectory data into the inference process to enable input-adaptive computation. Specifically, we introduce an early-exit mechanism in a YOLOv8-based detector that incorporates motion cues - such as inter-vessel distances. Frames of vessels that are separated by short distances, converging with high speed, are processed using the full model, while only a subset of the network's architecture is activated otherwise. The difficulty degree (or scene complexity) of a frame or set of frames per second is evaluated by leveraging inter-object distance and the rate at which the distance between them decreases. Experimental results demonstrate that this strategy maintains satisfactory detection performance while significantly reducing inference time and computational cost, thus enabling a flexible trade-off between accuracy and efficiency compared to full-model inference.

2605.16396 2026-05-19 cs.CV cs.LG 版本更新

Beyond MMSE: Enhancing PnP Restoration with ProxiMAP

超越MMSE:通过ProxiMAP增强PnP修复

Kenta Vert, Giacomo Meanti, Scott Pesme, Michael Arbel, Julien Mairal

发表机构 * Univ. Grenoble Alpes(格勒诺布尔阿尔卑斯大学) Inria(法国国家科学研究中心) CNRS(法国国家科学研究中心) Grenoble INP(格勒诺布尔理工大学) LJK(实验室) MaLGa Centre(MaLGa中心) DIBRIS(DIBRIS研究所) Università di Genova(热那亚大学) MMS(MMS机构) Italian Institute of Technology(意大利理工学院)

AI总结 本文提出ProxiMAP,通过调整噪声调度使去噪器保持分布内,实现更稳定的图像重建,适用于去模糊、补全、超分辨率和相位恢复等任务。

详情
AI中文摘要

Plug-and-Play (PnP)方法通过将不可行的最大后验(MAP)去噪器替换为MMSE去噪器成为解决成像逆问题的标准工具。尽管这种不匹配常被视为不可避免,近期研究试图通过针对扩散模型分数来缩小这一差距。本文指出在实践中,学习到的分数与真实分数不匹配,导致MAP目标迭代收敛到卡通化图像而非真实图像,而提前停止迭代能获得更好结果。本文将这一观察转化为设计原则,引入ProxiMAP,一种迭代的MAP近似方法,其噪声调度保持迭代残差噪声与去噪器训练噪声匹配。这使去噪器保持分布内,其分数可靠,并产生隐式提前停止,避免上述失败模式。ProxiMAP是标准PnP算法中MMSE去噪器的模块化替换,能一致提升重建质量。基于相同原理,本文提出一种混合变体,仅在PnP晚期迭代中应用ProxiMAP,其中去噪器最可靠,匹配或超过全替换变体,且成本仅为分数之一。

英文摘要

Plug-and-Play (PnP) methods have become standard tools for solving imaging inverse problems by replacing the intractable maximum a posteriori (MAP) denoiser with the MMSE one. While this mismatch has been widely treated as unavoidable, recent works have sought to close this gap by targeting the MAP with diffusion-model scores. We show this is problematic in practice: learned scores do not match the true ones, so MAP-targeting iterations converge to cartoon-like images rather than realistic ones, and better results are obtained by stopping short of convergence. We turn this observation into a design principle and introduce ProxiMAP, an iterative MAP approximation whose noise schedule keeps the iterate's residual noise matched to the denoiser's training noise. This keeps the denoiser in-distribution where its score is reliable, and yields implicit early stopping that avoids the failure mode above. ProxiMAP is a modular drop-in replacement for MMSE denoisers in standard PnP algorithms and consistently sharpens reconstructions across deblurring, inpainting, super-resolution, and phase retrieval. Building on the same principle, we propose a hybrid variant that applies ProxiMAP only in the late iterations of PnP, where the denoiser is most reliable -- matching or exceeding the full-replacement variant at a fraction of the cost.

2605.16393 2026-05-19 cs.CV cs.AI 版本更新

Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation

基于 Vision Transformer 的 UNet 用于领域自适应语义分割

Joel Valdivia Ortega, Tingying Peng, Marion Jasnin

发表机构 * Helmholtz Pioneer Campus, Helmholtz Munich, Neuherberg, Germany(海德堡先锋校园,海德堡穆恩奇,纽赫尔伯格,德国) School of Computation, Information and Technology, TUM, Garching, Germany(计算、信息与技术学院,技术大学慕尼黑,冈辛,德国) Department of Chemistry, TUM, Garching, Germany(化学系,技术大学慕尼黑,冈辛,德国)

AI总结 本文提出 ViTC-UNet,通过可学习令牌和双向注意力解码器将预训练 ViT 表示条件化于 UNet,以提升生物医学语义分割的精度与适应性。

详情
AI中文摘要

语义分割在生物医学研究中至关重要,但 Vision Transformers(ViTs)在该领域仍存在性能差距,尤其在稀疏、精细结构和低信噪比目标上。我们部分归因于可提示 ViT 模型中常用的轻量级像素解码器,可能缺乏高精度生物医学掩码所需的局部归纳偏置。我们通过引入 ViTC-UNet,通过可学习令牌和双向注意力解码器将预训练 ViT 表示条件化于 UNet,结合 ViT 的全局视觉先验与 UNet 的局部归纳偏置和高分辨率解码能力,同时避免端到端 ViT 微调,即使在跨领域设置中。ViTC-UNet 在 MRI 和 CT 模态的语义分割任务中均优于基线结果,证明了结构条件化的 UNet 解码可有效适应大规模视觉先验到高复杂度的生物医学分割。

英文摘要

Semantic segmentation is essential for analysing anatomical features in biomedical research, yet a performance gap remains for Vision Transformers (ViTs) in the field, particularly for sparse, fine-structured, and low signal-to-noise targets. We attribute this challenge in part to the lightweight pixel decoders commonly used in promptable ViT models, who may lack the local inductive bias needed for high-precision biomedical masks. We bridge this gap by introducing ViTC-UNet, which conditions a UNet on frozen pre-trained ViT representations through learnable tokens and a two-way attention decoder. This combines ViT global visual priors with the local inductive bias and high-resolution decoding capacity of UNets, while avoiding end-to-end ViT fine-tuning even in cross-domain settings. ViTC-UNet outperforms baseline results in semantic segmentation tasks across MRI and CT modalities, demonstrating that structure-conditioned UNet decoding can efficiently adapt large-scale visual priors to high-complexity biomedical segmentation.

2605.16392 2026-05-19 q-bio.QM cs.CV cs.LG 版本更新

Bridging the Modality Bottleneck in Pathology MIL through Virtual Molecular Staining

弥合病理MIL中的模态瓶颈:通过虚拟分子染色

Yucheng Xing, Pei Liu, Jingying Ma, Ruping Hong, Jiangdong Qiu, Tianyu Liu, Kai He, Ling Huang, Mengling Feng

发表机构 * National University of Singapore(新加坡国立大学) Hunan University(湖南大学) Peking Union Medical College Hospital (PUMCH)(北京协和医学院附属阜外医院) Imperial College London(伦敦帝国理工学院)

AI总结 本文提出MIST方法,通过虚拟分子染色提升病理MIL中投影层性能,改进240/256配置,平均提升3.5%,在生存预测、组织分型和生物标志物预测中分别提升5.2%、3.3%和2.6%。

详情
AI中文摘要

多重实例学习(MIL)是计算病理学中全切片图像分析的主流框架,通常结合冻结的补丁编码器、投影层和滑片级聚合器。尽管编码器和聚合器已广泛研究,投影层仍是一个主要的形态学瓶颈。这限制了诸如生物标志物状态和生存等终点,这些终点由未被H&E形态完全捕捉的分子状态决定。我们引入了分子指导的染色转换(MIST),一种可替换MIL投影层的插件,仅在训练期间使用配对的空间转录组学数据来构建虚拟分子染色。MIST将基因表达谱聚类为跨模态原型,将其锚定在冻结的基础模型特征空间中,并利用它们沿分子指导的轴重新组织H&E补丁特征。它不需要转录组学在推理阶段,并且可以在标准MIL聚合器之前插入。我们评估了MIST在23个下游任务和8个MIL聚合器上的表现。MIST在256种配置中改进了240种,平均提升3.5%,在各种终点类型中观察到一致的提升:生存预测提升5.2%,组织分型提升3.3%,生物标志物预测提升2.6%。消融实验确认基因衍生的原型是提升的主要来源,而空间、生物和病理分析显示跨模态原型亲和力能够从H&E中捕捉到空间上一致的分子程序。

英文摘要

Multiple instance learning (MIL) is the dominant framework for whole-slide image analysis in computational pathology, typically combining a frozen patch encoder, a projection layer, and a slide-level aggregator. While encoders and aggregators have been extensively studied, the projection layer remains a largely morphology-only bottleneck. This limits endpoints such as biomarker status and survival, which are governed by a molecular state that is not fully captured by H&E morphology. We introduce Molecularly Informed Staining Transform (MIST), a plug-in replacement for the MIL projection layer that uses paired spatial transcriptomics only during training to construct virtual molecular stains. MIST clusters gene expression profiles into cross-modal prototypes, anchors them in the frozen foundation model feature space, and uses them to reorganize H&E patch features along molecularly guided axes. It requires no transcriptomics at inference and can be inserted before standard MIL aggregators. We evaluate MIST across 23 downstream tasks and 8 MIL aggregators. MIST improves 240 of 256 configurations over the standard projection layer, with an average gain of +3.5%, observed consistently across endpoint types: +5.2% on survival prediction, +3.3% on tissue subtyping, and +2.6% on biomarker prediction. Ablations confirm that gene-derived prototypes are the primary source of the gains, while spatial, biological, and pathological analyses show that cross-modal prototype affinities capture spatially coherent molecular programs from H&E alone.

2605.16390 2026-05-19 cs.CV cs.LG stat.ML 版本更新

Inducing Spatial Locality in Vision Transformers through the Training Protocol

通过训练协议在视觉变换器中诱导空间局部性

Eduardo Santiago Toledo, Asael Fabian Martínez

发表机构 * Universidad Autónoma Metropolitana(亚利桑特自治大学)

AI总结 研究通过对比不同训练协议,发现CutMix能提升视觉变换器早期层的注意力局部性,降低MAD值,表明CutMix促进局部注意力的产生。

详情
AI中文摘要

我们研究了是否可以通过训练协议在从头训练的视觉变换器(ViT)的早期层中诱导空间局部性,而无需大规模预训练。在CIFAR-10、CIFAR-100和Tiny-ImageNet上,我们比较了基线协议与现代协议(AutoAugment/ColorJitter、CutMix和Label Smoothing),通过均值注意力距离(MAD)和归一化熵来表征每个注意力头。在所有三个数据集中,现代协议在早期层产生更局部和更集中的注意力;在CIFAR-100上,最小MAD从0.316(基线)降至0.008(现代)。为了确定这种效果的来源,我们在CIFAR-100上进行了消融研究,分别添加或移除每个组件。结果表明CutMix是实验中的决定性组件:所有包含CutMix的条件均显示MAD为0.024,而所有不包含CutMix的条件仍保持在MAD 0.210。AutoAugment和Label Smoothing对局部性无独立影响。总体而言,这些发现表明,由CutMix诱导的从部分图像区域进行分类的压力,可以促进视觉变换器中局部注意力的出现。

英文摘要

We investigate whether the training protocol can induce spatial locality in the early layers of a Vision Transformer (ViT) trained from scratch, without large-scale pretraining. Keeping the architecture and optimization procedure fixed, we compare a Baseline protocol with a Modern protocol (AutoAugment/ColorJitter, CutMix, and Label Smoothing) on CIFAR-10, CIFAR-100, and Tiny-ImageNet, characterizing each attention head via Mean Attention Distance (MAD) and normalized entropy. Across all three datasets, the Modern protocol produces more local and more concentrated attention in early layers; on CIFAR-100, the minimum MAD drops from 0.316 (Baseline) to 0.008 (Modern). To identify the source of this effect, we conduct an ablation study on CIFAR-100 by adding or removing each component individually. The results identify CutMix as the determining component within our experiments: all conditions with CutMix exhibit MAD 0.024, while all conditions without CutMix remain at MAD 0.210. AutoAugment and Label Smoothing show no independent effect on locality. Taken together, these findings suggest that the pressure to classify from partial image regions, induced by CutMix, can promote the emergence of local attention in Vision Transformers.

2605.16388 2026-05-19 cs.CV 版本更新

ChronoSC: Task-Oriented Semantic Communication via Temporal-to-Color Encoding

ChronoSC:通过时间到颜色编码实现面向任务的语义通信

Phuc H. Nguyen, Trung T. Nguyen, Quy N. Duong, Van-Dinh Nguyen

发表机构 * Smart Green Transformation Center (GREEN-X), VinUniversity, Vietnam(智能绿色转型中心(GREEN-X),Vin大学,越南) School of Computer Science(计算机科学学院)

AI总结 ChronoSC通过时间到颜色编码实现视频问答任务的高效语义通信,采用轻量级无损投影方案,实现极端时间压缩,提升带宽效率并保持高准确率。

Comments 6 pages, IEEE ICCE 2026

详情
AI中文摘要

语义通信(SC)旨在通过传输任务相关信息而非原始数据来减少传输开销。然而,现有视频SC方法大多专注于像素级重建或依赖复杂的时空管道,导致带宽使用过度和延迟高,不适用于低资源部署。本文提出ChronoSC,一种面向视频问答(VideoQA)的语义通信框架。ChronoSC引入Chrono-Color Stacking,一种轻量级无损投影方案,将视频的时间动态编码为单张静态图像,在传输前实现极端时间压缩。此紧凑的语义表示通过轻量级Deep Joint Source-Channel Coding(DeepJSCC)收发器传输,并在接收端显式重建。与潜在空间方法不同,显式视觉重建使预训练的视觉-语言模型得以直接复用;具体而言,预训练的BLIP模型用于从噪声重建的chrono图像中推断答案。在CLEVRER数据集上的实验表明,ChronoSC相比原始视频传输实现了高达192倍的带宽减少,同时保持高VideoQA准确率。

英文摘要

Semantic communication (SC) aims to reduce transmission overhead by conveying task-relevant information rather than raw data. However, existing SC approaches for video largely focus on pixel-level reconstruction or rely on complex spatiotemporal pipelines, leading to excessive bandwidth usage and latency that are unsuitable for low-resource deployments. In this paper, we propose ChronoSC, a task-oriented semantic communication framework for Video Question Answering (VideoQA). ChronoSC introduces Chrono-Color Stacking, a lightweight and lossless projection scheme that encodes temporal video dynamics into a single static image, enabling extreme temporal compression before transmission. This compact semantic representation is transmitted using a lightweight Deep Joint Source-Channel Coding (DeepJSCC) transceiver and explicitly reconstructed at the receiver. Unlike latent-space methods, explicit visual reconstruction enables the direct reuse of pre-trained vision-language models; specifically, a pre-trained BLIP model is employed to infer answers from noisy, reconstructed chrono-images. Experiments on the CLEVRER dataset show that ChronoSC achieves up to 192 times bandwidth reduction compared to raw video transmission while maintaining high VideoQA accuracy.

2605.16387 2026-05-19 cs.CV cs.AI 版本更新

Stabilizing Temporal Inference Dynamics for Online Surgical Phase Recognition

稳定在线手术阶段识别的时序推断动态

Yang Liu, Ning Zhu, Jingjing Peng, Xiwu Chen, Alejandro Granados, Guotai Wang, Sebastien Ourselin

发表机构 * King's College London, London, UK(伦敦国王学院) University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文提出统一框架稳定时序推断动态,通过TEC损失抑制误差传播,EGTP强制证据驱动状态转移,TFI衡量时间碎片化,提升稳定性并减少预测碎片化。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

在线手术阶段识别(SPR)模型可达到高帧级准确性,但其预测往往缺乏时序稳定性,导致工作流理解碎片化并降低下游辅助的可靠性。本文表明这种不稳定性并非随机噪声,而是源于两个机制:早期误分类会破坏时序特征状态并传播形成误差级联,而阶段转换遵循证据积累动态,但大多数在线SPR系统依赖无记忆的帧级决策,使其对短暂置信度波动敏感。我们提出一个统一的训练-推断-评估框架,通过模型无关、即插即用的组件显式稳定时序推断动态。在训练中,时序误差级联(TEC)损失通过稳定时序特征演变抑制误差起始并缓解误差传播。在推断中,证据门控转换预测器(EGTP)强制证据驱动的状态转移,仅在积累证据超过置信度边界时允许阶段变化。在评估中,我们引入时间碎片化指数(TFI),一个可靠性感知的度量标准,量化由不稳定性引起的时序分歧,超越传统帧级和基于标记的度量。在Cholec80和AutoLaparo上跨三个代表性backbone的实验表明,所提框架显著提高了时序稳定性并减少了预测碎片化,同时保持或略微提高了帧级性能。

英文摘要

Online Surgical Phase Recognition (SPR) models can reach high frame-wise accuracy, yet their predictions often lack temporal stability, fragmenting workflow understanding and reducing the reliability of downstream assistance. We show that this instability is not random noise but arises from two mechanisms: early misclassifications corrupt temporal feature states and propagate forward to form error cascades, and phase transitions follow evidence-accumulation dynamics whereas most online SPR systems rely on memoryless frame-wise decisions, making them sensitive to transient confidence fluctuations. We propose a unified Train-Inference-Evaluation framework that explicitly stabilizes temporal inference dynamics using model-agnostic, plug-and-play components. For training, the Temporal Error-Cascade (TEC) loss suppresses error onset and mitigates forward error propagation by stabilizing temporal feature evolution. For inference, the Evidence-Gated Transition Predictor (EGTP) enforces evidence-driven state transitions, allowing phase changes only when accumulated evidence exceeds a confidence boundary. For evaluation, we introduce the Temporal Fragmentation Index (TFI), a reliability-aware metric that quantifies instability-induced temporal disagreement beyond conventional frame-wise and token-based measures. Experiments on Cholec80 and AutoLaparo across three representative backbones show that the proposed framework substantially improves temporal stability and reduces prediction fragmentation, while maintaining or modestly improving frame-wise performance.

2605.16386 2026-05-19 cs.CV 版本更新

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

对多模态大语言模型评分者的审计:临床顺序评分中的中间倾向偏差

Jiaqing Zhang, Sandeep Elluri, Bhanu Cherukuvada, Yonah Joffe, Jessica Sena, Miguel Contreras, Scott Siegel, Subhash Nerella, Catherine Price, Parisa Rashidi

发表机构 * Department of Electrical & Computer Engineering(电气与计算机工程系) Department of Computer and Information Science and Engineering(计算机与信息科学与工程系) Department of Clinical and Health Psychology(临床与健康心理学系) Department of Biomedical Engineering(生物医学工程系)

AI总结 本文研究多模态大语言模型在临床顺序评分中的中间倾向偏差,通过基准测试发现三种前沿LLM在Clock Drawing Test评分中存在系统性压缩倾向,影响临床决策。

详情
AI中文摘要

多模态大语言模型(LLM)正被探索用于临床自动评估,但其在顺序临床量表上的评分行为尚不明确。我们通过Shulman评分标准,用三个前沿LLM家族与监督深度学习模型在两个公开数据集上进行基准测试。尽管完全微调的视觉转换器在校准(MAE 0.52,内1准确性91%)方面表现最佳,零样本LLM在基于容忍度的一致性(GPT-5 MAE 0.67,内1准确性92%)上仍具竞争力,尽管绝对误差更高。然而,每项评分分析显示,三种LLM家族均表现出显著的中间倾向效应(系统性端点压缩):预测系统性压缩向量标刻度中间,低端(分数0到1)过预测,高端(分数5到4)下预测。这种效应不成比例影响临床关键极端,准确评分最影响认知障碍筛查决策。定向删除显示,既无少量示例覆盖完整评分范围,也无从提示中删除临床术语能消除该效应。我们的发现将LLM作为判断者的偏见文献从NLP评估扩展到临床评估,并强调在高风险筛查流程中部署LLM评分者前需要校准意识的评估和事后校准。

英文摘要

Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.

2605.16384 2026-05-19 cs.CV cs.AI 版本更新

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

全局标记与补丁标记之间的相互增强:从理论到实践

Xiusheng Huang, Xin Jiang, Jun Zhao, Kang Liu, Yequan Wang

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China(认知与决策智能复杂系统重点实验室,自动化研究所,中国科学院,北京,中国) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 本文提出TaTok框架,通过引入全局标记和动态令牌过滤算法,解决现有方法中信息不足和冗余问题,提升图像令牌化效果和推理速度。

Comments 21 pages, 8 figures

详情
AI中文摘要

准确有效的离散图像令牌化对长图像序列处理至关重要。然而,当前方法以固定比率压缩所有内容,忽视了图像中信息密度的变化,导致冗余或信息丢失。受信息熵启发,我们提出TaTok,一种理论指导的自适应图像令牌化框架。我们严格识别现有方法的两个关键问题:仅使用补丁令牌重建图像时的信息不足,以及补丁令牌之间的信息冗余。为此,我们引入全局令牌来建模补丁令牌之间的互信息,并基于累积条件熵的动态令牌过滤(DTF)算法来消除冗余。实验证实TaTok的最先进性能,实现了1.3倍gFID提升和8.7倍推理加速。通过根据信息丰富度分配令牌,TaTok实现了更压缩但更准确的图像令牌化,为未来研究提供了有价值的见解。

英文摘要

Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.

2605.16383 2026-05-19 cs.CV cs.AI stat.ML 版本更新

A neurosymbolic Approach with Epistemic Deep Learning for Hierarchical Image Classification

一种结合知识符号学习与认知深度学习的分层图像分类方法

Ezel Kilicdere, Shireen Kudukkil Manchingal, Fabio Cuzzolin

发表机构 * Institute for AI, Data Analysis and Systems (AIDAS) School of Engineering, Computing and Mathematics, Oxford Brookes University, UK(人工智能、数据分析和系统研究所(AIDAS)工程、计算与数学学院,英国奥克斯福德布鲁克斯大学)

AI总结 本文提出一种统一的神经符号和认知建模框架,通过融合Swin Transformer、焦点集推理和可微模糊逻辑,提升分层图像分类的准确性和逻辑一致性。

Comments 36 pages

详情
AI中文摘要

深度神经网络在图像分类任务中实现高精度,但往往产生过于自信的预测,无法表达认知不确定性,并违反数据中存在的逻辑或结构约束。这些局限性在分层分类中尤为明显,因为细粒度和粗粒度的预测必须保持一致。本文首次提出一种统一的神经符号和认知建模框架,通过融合Swin Transformer、焦点集推理和可微模糊逻辑,将标签视为孤立类别,而是在学习的嵌入空间中诱导数据驱动的焦点集,帮助捕捉多个可能细粒度类别的认知不确定性。这些焦点集构成了一个基于信念理论的层,利用模糊隶属函数和t-范数合取来鼓励细粒度和粗粒度预测之间的一致性。可学习的损失进一步平衡校准、质量正则化和逻辑一致性,使模型能够自适应地权衡符号结构与数据驱动的证据。在分层图像分类实验中,本文框架在与Transformer基线相当的准确性的同时,提供更校准和可解释的预测,减少过度自信并强制在分层输出中保持高逻辑一致性。实验结果表明,结合焦点集推理与模糊逻辑为深度学习模型提供了实际步骤,使其既准确又具有认知意识。

英文摘要

Deep neural networks achieve high accuracy on image classification tasks. Yet, they often produce overconfident predictions as which fail to express epistemic uncertainty, and frequently violate logical or structural constraints present in the data. These limitations are particularly pronounced in hierarchical classification, where predictions across fine and coarse levels must remain coherent. We propose, for the first time, a unified neurosymbolic and epistemic modelling framework that augments Swin Transformers with focal set reasoning and differentiable fuzzy logic. Rather than treating labels as isolated categories, our method induces data-driven focal sets within the learnt embedding space, which helps capture epistemic uncertainty over multiple plausible fine-grained classes. These focal sets form the basis of a belief-theoretic layer that uses fuzzy membership functions and t-norm conjunctions to encourage consistency between fine- and coarse-grained predictions. A learnable loss further balances calibration, mass regularisation, and logical consistency, allowing the model to adaptively trade off symbolic structure with data-driven evidence. In experiments on hierarchical image classification, our framework maintains accuracy on par with transformer baselines while providing more calibrated and interpretable predictions, reducing overconfidence and enforcing high logical consistency across hierarchical outputs. Our experimental results show that combining focal set reasoning with fuzzy logic provides a practical step toward deep learning models that are both accurate and epistemically aware.

2605.16381 2026-05-19 cs.CV cs.AI 版本更新

StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video

StreamPro: 从反应式感知到主动决策的流视频处理

Ao Li, Zihan Xiao, Zihao Yue, Boshen Xu, Linli Yao, Jiaze Li, Pei Fu, Jianzhong Ju, Jian Luan, Qin Jin

发表机构 * AIM3 Lab, Renmin University of China(中国人民大学AIM3实验室) MiLM Plus, Xiaomi Inc.(小米公司MiLM Plus)

AI总结 StreamPro通过引入CB-Stream损失和GRPO算法,提升流视频处理的主动决策能力,在StreamPro-Bench上取得显著成效,性能优于先前最佳。

详情
AI中文摘要

主动流视频理解需要模型持续处理视频流并决定何时响应,而非仅仅确定响应内容。这自然引入了部分观察下的决策问题,模型需在早期预测与充分证据之间平衡。然而,现有基准大多遵循“看见再回答”范式,响应仅在明确证据出现后触发,将主动推理缩减为延迟感知。因此,它们无法评估模型在不完整观察下的及时性和可靠性决策能力。此外,训练主动模型本身具有挑战性,因为流轨迹中沉默与响应信号之间存在极端不平衡,且需要联合优化响应准确性和时机。为解决这些问题,我们引入StreamPro-Bench,从感知理解、时间推理和主动代理三个互补视角评估流模型。其中,主动代理衡量模型在部分观察下的早期但可靠决策能力。我们进一步提出StreamPro,一种两阶段训练框架用于主动学习。首先,我们引入CB-Stream损失以缓解监督不平衡问题。然后,我们应用基于多粒度奖励设计的分组相对策略优化(GRPO)。实验表明,StreamPro显著提升了主动性能。在StreamPro-Bench上,其达到41.5,远超先前最佳(10.4),同时在实时流基准测试中也表现优异,达到78.9分。

英文摘要

Proactive streaming video understanding requires models to continuously process video streams and decide when to respond, rather than merely what to respond. This naturally introduces a decision-making problem under partial observations, where models must balance early prediction against sufficient evidence. However, existing benchmarks largely follow a "see-then-answer" paradigm, where responses are triggered only after explicit evidence appears, effectively reducing proactive reasoning to delayed perception. As a result, they fail to evaluate a model's ability to make timely and reliable decisions under incomplete observations. Moreover, training proactive models is inherently challenging due to the extreme imbalance between silence and response signals in streaming trajectories, as well as the need to jointly optimize response correctness and timing. To address these challenges, we introduce StreamPro-Bench, a new benchmark that evaluates streaming models from three complementary perspectives: Perception Understanding, Temporal Reasoning, and Proactive Agency, where the last measures a model's ability to make early yet reliable decisions under partial observations. We further propose StreamPro, a two-stage training framework for proactive learning. First, we introduce CB-Stream Loss to mitigate the severe supervision imbalance during supervised fine-tuning (SFT). Then, we apply Group Relative Policy Optimization (GRPO) with a multi-grained reward design that involves both turn-level and trajectory-level rewards. Experiments show that StreamPro significantly improves proactive performance. On StreamPro-Bench, it achieves 41.5, substantially outperforming the previous best (10.4), while also maintaining strong performance on real-time streaming benchmarks, achieving 78.9 on StreamingBench-RTVU.

2605.16376 2026-05-19 eess.IV cs.CV cs.DC cs.LG cs.MM 版本更新

Kelvin v1.0: A Neural Pre-Encoder for H.264: A standards-compliant learned preprocessor with -27.62% BD-VMAF on UVG

Kelvin v1.0:一种用于H.264的神经预编码器:一种符合标准的学得预处理程序,在UVG上实现-27.62%的BD-VMAF

Marco Graziano

发表机构 * Graziano Labs Corp.(格raziano实验室公司)

AI总结 Kelvin v1.0通过内容自适应像素调整优化H.264编码,实现比基准libx264更高的BD-VMAF,其在UVG和MCL-JCV数据集上均表现优异,同时解决了H.264非可微的工程挑战。

详情
AI中文摘要

Kelvin是一种轻量级学得预编码器,位于未修改的libx264编码器之前。它应用内容自适应的像素调整,每个通道限制在±1/255以内,使编码器将比特分配到最需要感知的区域,同时输出兼容所有现有解码器、播放器和CDN的标准H.264位流。在七序列1080p UVG基准上,Kelvin v1.0实现平均BD-VMAF为-27.62%(7/7胜),BD-VMAF-NEG为-5.18%(6/7胜)。在30序列MCL-JCV公开数据集上,相同检查点在28/30片段上胜过基准libx264,去除两个可诊断失败后,平均BD-VMAF为-27.70%,与UVG一致。核心工程挑战是H.264的非可微性:我们描述了一种混合编码器代理,结合校准的可微率估计器(与真实libx264的每像素比特数斯皮尔曼_rho=0.986)和在真实编码器输出上训练的U-Net失真代理。我们发布完整的每序列率失真数据,MCL-JCV上的命名失败模式分类(率下限违规、分布偏移、指标饱和),以及五个基准的合理性面板(hqdn3d、unsharp、-tune psnr、-tune ssim、x265 medium),并诚实定位:x265 medium在相同数据集上每项指标均胜过Kelvin。因此,Kelvin是为在H.264上保持是约束而非选择的工作负载设计的。

英文摘要

Kelvin is a lightweight learned pre-encoder that sits in front of an unmodified libx264 encoder. It applies content-adaptive pixel adjustments, bounded at +/-1/255 per channel, so that the encoder allocates bits where they matter most perceptually, while emitting a standard H.264 bitstream compatible with every existing decoder, player, and CDN. On the seven-sequence 1080p UVG benchmark, Kelvin v1.0 achieves a mean BD-VMAF of -27.62% (7 of 7 wins) and BD-VMAF-NEG of -5.18% (6 of 7 wins) relative to baseline libx264 at preset medium. On the 30-sequence MCL-JCV public set (28 unseen by training), the same checkpoint wins on 28 of 30 clips by BD-VMAF; with the two diagnosable failures removed the mean is -27.70% BD-VMAF and -5.37% BD-VMAF-NEG, consistent with UVG to within one percentage point. A central engineering challenge is the non-differentiability of H.264: we describe a hybrid codec proxy that combines a calibrated differentiable rate estimator (Spearman rho = 0.986 vs. real libx264 bits-per-pixel) with a U-Net distortion proxy trained on real encoder outputs. We publish full per-sequence rate-distortion data, a named failure-mode taxonomy on MCL-JCV (rate-floor violation, distribution shift, metric saturation), a five-baseline sanity panel (hqdn3d, unsharp, -tune psnr, -tune ssim, x265 medium), and honest positioning: x265 medium beats Kelvin on every metric on the same corpus. Kelvin is therefore designed for workloads where remaining on H.264 is a constraint rather than a choice.

2605.16373 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Cross-Source Supervision for Bone Infection Segmentation in Dual-Modality PET-CT

跨源监督在双模态PET-CT骨感染分割中的应用

Zonglin Yang, Xiaolei Diao, Jishizhan Chen, Xiaozhuang Man, Wei Kong, Gen Wen, Pengfei Cheng, Daqian Shi

发表机构 * Shanghai Maritime University(上海海洋大学) University College London(伦敦大学学院) Shanghai Sixth People’s Hospital(上海第六人民医院) Shanghai Sixth People’s Hospital Affiliated to SJTU School of Medicine(上海第六人民医院附属复旦大学医学院) Queen Mary University of London(伦敦女王玛丽大学)

AI总结 本文提出一种双模态端到端分割框架,通过早融合多模态表示整合PET代谢信号和CT骨窗解剖信息,解决标注不一致下的骨感染分割问题,采用患者级3D体积评估和交叉验证提高性能。

详情
AI中文摘要

早期和准确诊断骨感染及病变定位对临床治疗至关重要。PET-CT结合了CT的解剖信息和PET的代谢信息,是诊断骨感染的重要成像模态。然而,由于病变边界不清晰和不同专家或自动化系统生成的标注不一致,准确的病变分割仍具挑战性。本文研究了在标注不一致下的多模态分割。我们开发了一个双模态端到端分割框架,通过早融合多模态表示整合PET代谢信号和CT骨窗解剖信息。为了缓解小数据集中小切片相关性导致的性能膨胀,本研究弃用传统二维评估方法,采用严格的患者级3D体积评估和交叉验证。此外,我们提出了一种解耦的双源学习框架,其中并行模型在由高灵敏度和高特异性临床意图驱动的独立专家标注上进行训练。实验结果客观报告了患者级性能变化(均值±标准差和均值-标准差),证明了多模态PET-CT融合的有效性。交叉评估矩阵定量揭示了模型如何成功内化不同的专家诊断哲学,提供了一种稳健且保持多样性的临床AI部署范式,用于骨感染分割。

英文摘要

Early and accurate diagnosis and lesion localization of bone infections are crucial for clinical treatment. PET-CT integrates anatomical information from CT with metabolic information from PET, making it an important imaging modality for diagnosing bone infections. However, accurate lesion segmentation remains challenging due to indistinct lesion boundaries and inconsistencies in annotations generated by different experts or automated systems. In this work, we investigate multimodal segmentation of bone infections under annotation discrepancy. We develop a bimodal end-to-end segmentation framework that integrates PET metabolic signals and CT bone-window anatomy through an early-fusion multimodal representation.To mitigate performance inflation caused by inter-slice correlation in small datasets, this study discards traditional two-dimensional evaluation methods and implements a rigorous patient-level 3D volumetric evaluation and cross-validation. Furthermore, instead of forcing a singular consensus, we propose a decoupled dual-source learning framework where parallel models are trained on independent expert annotations driven by high-sensitivity and high-specificity clinical intents. Experimental results objectively report performance variations at the patient level (Mean + SD and Mean - SD), demonstrating the effectiveness of multimodal PET-CT fusion. The cross-evaluation matrix quantitatively reveals how models successfully internalize distinct expert diagnostic philosophies, providing a robust, diversity-preserving paradigm for clinical AI deployment in bone infection segmentation.

2605.16372 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SwordBench: Evaluating Orthogonality of Steering Image Representations

SwordBench:评估转向图像表示的正交性

Vladimir Zaigrajew, Dawid Pludowski, Hubert Baniecki, Przemyslaw Biecek

发表机构 * Centre for Credible AI(可信人工智能中心) Warsaw University of Technology(华沙技术大学) University of Warsaw(华沙大学)

AI总结 本文提出SwordBench,用于评估视觉模型在多个backbone和概念移除任务中转向表示的正交性,引入了交叉概念鲁棒性和 collateral damage 等新评估指标,发现线性SVM在分离性和正交性上优于稀疏自编码器,但无法实现零 collateral damage。

详情
AI中文摘要

在推理时间对模型表示进行干预以校正预测对于AI可解释性和安全性至关重要,但现有评估协议局限于模糊的语言建模任务。为填补这一空白,我们引入SwordBench,一个用于评估视觉模型在多个backbone和概念移除任务中转向表示的基准。除了统一的基准测试套件外,我们还提出了新的评估概念,揭示了概念激活向量正交性对实用转向的二次影响。具体而言,交叉概念鲁棒性衡量在针对替代概念正交化输入上概念检测性能的稳定性,而collateral damage量化在缺乏偏见的输入上转向是否意外影响下游任务的模型性能。我们发现尽管线性支持向量机在分离性和正交性上表现优异,但无法实现零collateral damage,通常落后于稀疏自编码器。在更简单的环境中,标准基线和优化方法均无法实现完美的转向。源代码将很快在GitHub上发布。

英文摘要

Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.

2605.16371 2026-05-19 cs.CV cs.AI 版本更新

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

GeoSym127K:可扩展的符号验证合成用于多模态几何推理

Jinhao Jing, Zheng Ma, Jinwei Liang, Qiannian Zhao, Shawn Chen, Jing Yang, Por Lip Yee, Prayag Tiwari, Jingjing Bai, Benyou Wang, Lewei Lu, Zhan Su

发表机构 * School of Information Technology, Halmstad University(哈姆斯塔德大学信息科技学院)

AI总结 本文提出GeoSym引擎,通过类型条件语法和分析SymGT求解器生成精确符号地面真实值,构建了包含51K高清图像、127K问题和55K答案验证CoT QA对的GeoSym127K数据集,并展示了其在几何推理任务中的性能提升。

详情
AI中文摘要

大型多模态模型(LMMs)在几何推理中常因视觉幻觉和缺乏数学精确的Chain-of-Thought(CoT)数据而遇到困难。为此,我们提出了GeoSym引擎,一种自动且可扩展的神经符号框架。通过利用类型条件语法和分析SymGT求解器,它能够推导出精确的符号地面真实值,并无缝整合到稳健的渲染管线中,生成高精度的几何图示。使用该引擎,我们构建了GeoSym127K,一个难度分层的数据集,包含51K高清图像、127K带有符号地面真实值的问题和55K答案验证的CoT QA对。我们还引入了GeoSym-Bench,一个由专家整理的511个复杂样本集,用于严格评估。通过广泛的监督微调(SFT),我们证明GeoSym在依赖图示和多步骤几何任务上实现了集中改进。我们的Qwen3-VL-8B模型在MathVerse Vision-Only子集上实现了绝对+22.21%的提升,并在WeMath上达到61.52%(+6.19%的改进),缓解了长距离逻辑碎片化问题,并优于先进的闭源模型如Doubao-1.8。进一步地,通过Reinforcement Learning with Verifiable Rewards(RLVR) via GRPO发现,从结构SFT检查点初始化显著提升了零样本RL的性能上限。由确定性精确匹配信号驱动,这展示了我们可验证推理合成的稳健扩展潜力。数据集和代码可在https://huggingface.co/datasets/Tomie0506/GeoSym127K和https://github.com/Tomie56/GeoSym127K获得。

英文摘要

Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at https://huggingface.co/datasets/Tomie0506/GeoSym127K and https://github.com/Tomie56/GeoSym127K.

2605.16366 2026-05-19 cs.CV cs.AI 版本更新

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

Fre-Res: 频率-残差视频令牌压缩用于高效的视频多模态大语言模型

Yigui Feng, Qinglin Wang, Yang Liu, Jie Liu

发表机构 * College of Computer Science, National University of Defense Technology(计算机科学学院,国防科技大学) Shien-Ming Wu School of Intelligent Engineering, South China University of Technology(智能工程谢民明伍学院,华南理工大学)

AI总结 Fre-Res通过分离空间和时间信息,实现视频令牌压缩,在保持细节精度的同时提升效率,适用于短时事件和长视频推理。

Comments 24 pages, 5 figures

详情
AI中文摘要

视频多模态大语言模型面临空间保真度与时间覆盖度之间的矛盾:保留细粒度视觉细节需要大量空间令牌,而捕捉短暂事件需要密集的时间采样。我们提出Fre-Res,一种预算自适应的双轨视频令牌压缩框架,分别处理这两种证据形式。Fre-Res保留稀疏的高保真空间锚点,并通过紧凑的残差频域令牌表示密集的时间演变。具体而言,它对视觉潜在空间中的帧间残差轨迹应用时间1D-DCT,在其中观察到强低频集中。为对齐频域动态与原生视觉嵌入,Fre-Res引入了空间引导吸收器,将时间残差信息注入与空间锚点对应的令牌中。在细粒度短视频和长视频推理基准上,Fre-Res实现了有利的准确率-效率权衡,匹配或接近全令牌性能,同时显著减少视觉令牌长度。广泛消融实验进一步表明,时间频域残差保留因果转换线索,而空间锚点对细粒度物体和布局推理至关重要。

英文摘要

Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-grained visual details requires many spatial tokens, while capturing short-lived events requires dense temporal sampling. We propose \textbf{Fre-Res}, a budget-adaptive dual-track video-token compression framework that separates these two forms of evidence. Fre-Res preserves sparse high-fidelity spatial anchors and represents dense temporal evolution through compact residual-frequency tokens. Specifically, it applies temporal 1D-DCT to inter-frame residual trajectories in vision-latent space, where we observe strong low-frequency concentration. To align frequency-domain dynamics with native visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information into spatially corresponding anchor tokens. Across fine-grained short-video and long-video reasoning benchmarks, Fre-Res achieves a favorable accuracy--efficiency trade-off, matching or approaching full-token performance while substantially reducing visual-token length. Extensive ablations further show that temporal-frequency residuals preserve causal transition cues, while spatial anchors remain essential for fine-grained object and layout reasoning.

2605.16359 2026-05-19 cs.CV cs.AI 版本更新

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

多模态语言模型需要多少视觉标记?通过F^3A进行视觉标记剪枝的扩展

YiJie Huang, Yiqun Zhang, Zhuoyue Jia, Xiaocui Yang, Junzhao Huang, Zihan Wang, Shi Feng, Daling Wang, Yifei Zhang, Yongkang Liu

发表机构 * School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China(东北大学计算机科学与工程学院,沈阳 110819,中国) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) School of Computer and Communication Engineering, Northeastern University, Qinhuangdao 066004, China(东北大学计算机与通信工程学院,秦皇岛 066004,中国)

AI总结 本文提出F^3A方法,通过任务条件证据搜索优化视觉标记分配,在不训练模型的情况下实现高效的视觉标记剪枝,保留原始多模态提示和解码流程。

详情
AI中文摘要

视觉语言模型通过将越来越长的视觉标记序列输入语言骨干网络来提升感知能力,但由此产生的推理成本提出了一个基本的扩展问题:随着多模态模型的增长,实际上需要多少视觉标记,以及在固定视觉标记预算下如何分配?现有训练免费剪枝方法通常通过一shot代理如解码器注意力、视觉相似性或条件多样性来回答这个问题。我们主张将视觉标记剪枝视为任务条件证据搜索,特别是在极端压缩和跨模型规模的情况下。我们提出F^3A,一种训练免费的视觉标记剪枝路由器,在语言模型消耗图像标记之前运行。F^3A构建轻量级的问题条件线索,通过冻结的稀疏感知头将它们与视觉网格标记匹配,并通过粗略证据定位、局部细化、覆盖保持竞争和恢复未覆盖区域来分配固定视觉标记预算。它不需要模型训练,不需要额外的LLM前向传递,并保留原始多模态提示和解码流程。

英文摘要

Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.

2605.16357 2026-05-19 eess.SP cs.AI cs.CV 版本更新

Learning Displacement-Aware WiFi Representations for Weakly Supervised Relative Localization

学习位移感知的Wi-Fi表示以实现弱监督的相对定位

Tzu-Ti Wei, Po-Cheng Chen, Yu-Chee Tseng, Jen-Jee Chen

发表机构 * College of AI, National Yang Ming Chiao Tung University(人工智能学院,National Yang Ming Chiao Tung大学)

AI总结 本文提出IP框架,通过交叉模态学习对齐指纹轨迹与位移轨迹,学习位移感知的Wi-Fi表示,实现准确的相对定位,并扩展至少样本绝对定位。

详情
AI中文摘要

基于Wi-Fi指纹的室内定位已广泛研究,但现有方法多关注绝对定位并依赖密集坐标标注,获取成本高。本文研究相对定位问题,目标是直接估计两个Wi-Fi指纹轨迹间的位移,不预测绝对位置。为减少标注开销,采用惯性传感获取的步进运动向量作为弱监督。提出Intersection Pathway (IP)框架,通过共享潜在空间对齐指纹轨迹与位移轨迹。关键思想是使潜在空间具有加法结构,使潜在空间的加减对应物理运动组合,实现直接的相对位移推断。实验表明,所提方法在合成数据集上学习位移感知的Wi-Fi表示,实现不同位移范围的准确相对定位。此外,所学模型可扩展至少样本绝对定位。

英文摘要

WiFi fingerprint-based indoor localization has been widely studied, but most existing approaches focus on absolute positioning and rely on dense coordinate annotations, which are costly to obtain at scale. In this paper, we study a fundamentally different problem: relative localization, where the goal is to directly estimate the displacement between two WiFi fingerprint traces without predicting their absolute positions. To reduce annotation overhead, we adopt weak supervision in the form of stepwise motion vectors obtained from inertial sensing. We propose Intersection Pathway (IP), a cross-modal learning framework that aligns fingerprint traces (f-traces) and displacement traces (d-traces) in a shared latent space. The key idea is to enforce an additive structure in the latent space, such that latent addition and subtraction correspond to physical motion composition, enabling direct relative-displacement inference. Experiments on a synthesized dataset derived from real measurements demonstrate that the proposed method learns displacement-aware WiFi representations and achieves accurate relative localization across varying displacement ranges. Furthermore, the learned model can be extended to few-shot absolute localization with sparse anchors.

2605.16355 2026-05-19 cs.GR cs.CV 版本更新

Generative 3D Gaussians with Learned Density Control

具有学习密度控制的生成3D高斯

Runjie Yan, Yan-Pei Cao, Peng Wang, Ding Liang, Yuan-Chen Guo

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院)

AI总结 本文提出Density-Sampled Gaussians (DeG),通过学习概率密度函数实现3D表示的自适应密度控制,结合渲染监督优化空间密度和高斯属性,提升单图到3D生成的质量。

Comments 19 pages, 16 figures, SIGGRAPH Conference Papers '26

详情
AI中文摘要

我们提出Density-Sampled Gaussians (DeG),一种新的3D表示方法,旨在弥合自适应渲染原语与可扩展生成建模之间的差距。不同于现有方法将3D高斯限制在固定体素网格或数组中,DeG将高斯中心建模为从定义在八叉树上的可学习概率密度函数中采样的值。这种形式提供了一个严谨的数学框架用于自适应密度控制:通过在渲染监督下联合优化空间密度和高斯属性,我们的模型自然地将原语集中在几何复杂度高的区域。我们通过新的渲染损失贡献梯度实现这一点,该梯度是标准高斯点绘中离散致密化和修剪启发法的全微分替代品。所得到的表示高度灵活,可通过简单调整采样预算从单一潜在代码解码出可变分辨率。为实现生成合成,我们在DeG上训练了一个潜在扩散模型。我们识别出在应用扩散模型到无序集合结构潜在变量时存在关键挑战,这会显著减慢收敛速度,并提出VecSeq,一种标准重新索引机制,将潜在token锚定到确定性的3D Sobol序列。这将模糊的集合生成问题转化为稳健的序列建模任务。大量实验表明,我们的流程在单图到3D生成中实现了最先进的质量,结合了无序原语的结构自适应性和基于网格的方法的训练稳定性。

英文摘要

We present Density-Sampled Gaussians (DeG), a novel 3D representation designed to bridge the gap between adaptive rendering primitives and scalable generative modeling. Unlike existing approaches that constrain 3D Gaussians to fixed voxel grids or arrays, DeG models Gaussian centers as samples from a learnable probability density function defined over an octree. This formulation provides a rigorous mathematical framework for adaptive density control: by jointly optimizing the spatial density and Gaussian attributes under rendering supervision, our model naturally concentrates primitives in regions of high geometric complexity. We achieve this via a new render loss contribution gradient that serves as a fully differentiable analogue to the discrete densification and pruning heuristics used in standard Gaussian Splatting. The resulting representation is highly flexible, supporting variable-resolution decoding from a single latent code by simply adjusting the sampling budget. To enable generative synthesis, we train a latent diffusion model on DeG. We identify a critical challenge in applying diffusion to unordered set-structured latents, which can significantly slow convergence, and propose VecSeq, a canonical re-indexing mechanism that anchors latent tokens to a deterministic 3D Sobol sequence. This transforms the ambiguous set-generation problem into a robust sequence modeling task. Extensive experiments demonstrate that our pipeline achieves state-of-the-art quality in single-image-to-3D generation, combining the structural adaptivity of unstructured primitives with the training stability of grid-based methods.

2605.16317 2026-05-19 cs.CV physics.optics 版本更新

Noise2Params: Unification and Parameter Determination from Noise via a Probabilistic Event Camera Model

Noise2Params: 事件相机中噪声与参数统一的建模与确定

Owen Root, Julinda Mujo, Min Xu

发表机构 * Department of Physics and Astronomy, CUNY Hunter College(物理与天文学系,CUNY 哈佛学院)

AI总结 本文提出Noise2Params方法,通过概率模型统一事件相机噪声描述,确定相机参数,无需动态光源。

Comments Main paper: 29 pages, 18 figures; Supplemental Information: 22 pages, 15 figures

详情
AI中文摘要

准确统一的事件相机(ECs)模型仍难以获得,阻碍了校准和算法设计。本文开发了一个基于光子统计的概率模型,统一描述静态场景噪声事件和步响应曲线(S-curves)的描述。推导了三种概率分布形式,涵盖所有强度范围:精确泊松分布、鞍点分布和高斯分布。该模型揭示了这些看似不同的EC行为之间的内在联系,并澄清了S-curves的解释,表明其比选择固定概率阈值更复杂。基于此模型,我们提出Noise2Params方法,通过最小化观测噪声事件分布的误差来确定相机特定的log-contrast阈值B、lux到光子转换因子α以及泄漏项θ(发现与强度相关)。Noise2Params仅需静态均匀场景的记录,提供了一种实验可及的替代方法,而非需要特殊动态光源的方法。我们进一步通过训练卷积神经网络(CNNs)在合成噪声图像上进行训练,并评估其从实验数据中重建静态场景的能力,进一步证明了模型的有效性。我们还通过展示CNNs结合合成数据优于仅训练于实验数据的CNNs,证明了模型的实用性。我们的框架为EC校准、噪声感知算法设计以及光子受限领域应用提供了定量基础。

英文摘要

Accurate, unified models for event cameras (ECs) remain elusive, hampering calibration and algorithm design. We develop a foundational probabilistic model for EC event detection, grounded in photon statistics, that unifies the description of static scene noise events and step response curves (S-curves) within a single analytical framework. Three formulations of the probability distributions are derived, spanning all intensity regimes: exact Poisson, saddle-point, and Gaussian. The model reveals the underlying connection between these otherwise disparate EC behaviors and clarifies the interpretation of S-curves, which we show is more nuanced than selecting a fixed probability threshold. Based on this model, we propose Noise2Params, a method for determining camera-specific values of the log-contrast threshold $B$, the lux-to-photon conversion factor $α$, and the leakage term $θ$ (found to be intensity dependent), via error minimization against observed noise-event distributions. Noise2Params requires only recordings of static, uniform scenes, offering an experimentally accessible alternative to approaches that demand specialized dynamic light sources. We further support the validity the model by training convolutional neural networks (CNNs) on synthetic noise images generated from our distributions and evaluating their ability to reconstruct static scenes from experimental data. We further demonstrate the utility of our model by showing that CNNs incorporating synthetic data outperform those trained solely on experimental data. Our framework provides a quantitative foundation for EC calibration, noise-aware algorithm design, and applications in photon-limited regimes.

2605.16266 2026-05-19 cs.GR cs.CV cs.LG 版本更新

Patchwork: A compact representation for 3D polygonal shapes

Patchwork: 3D多边形形状的紧凑表示

Ruichen Zheng, Biao Zhang, Michael Birsak, Mikhail Skopenkov, Peter Wonka

发表机构 * King Abdullah University of Science and Technology(国王阿卜杜勒·阿齐兹大学)

AI总结 Patchwork提出了一种新的通用形状表示方法,通过少量参数建模2D和3D几何,提供可证明的复杂度界和任意精度近似能力,结合高效梯度优化和新型正则化损失,实现高紧凑性,适用于几何学习与重建任务。

详情
AI中文摘要

我们介绍了Patchwork,一种新的通用形状表示方法,能够用少量参数建模2D和3D几何。Patchwork建立在严谨的数学框架上,提供可证明的复杂度界,并能在任意维度中以任意精度近似任意形状。我们提出了一种高效的基于梯度的优化方案,用于拟合2D和3D数据,同时提出一种新的正则化损失,逐步剔除冗余元素,收敛后获得高紧凑性。我们的方法提供了快速拟合性能,参数数量比现有方法少很多,并原生支持内外分类,使其成为几何学习和重建任务的通用且紧凑的表示方法,未来可能用于3D生成。我们的实现可在:https://github.com/Ankbzpx/patchwork-experiment 获取。

英文摘要

We introduce Patchwork, a new general-purpose shape representation capable of modeling 2D and 3D geometry with a small number of parameters. Patchwork is grounded in a rigorous mathematical framework, providing provable complexity bounds and the ability to approximate arbitrary shapes with arbitrary precision in any dimension. We propose an efficient gradient-based optimization scheme to fit Patchwork representations to 2D and 3D data, along with a novel regularization loss that progressively prunes redundant elements, yielding high compactness after convergence. Our approach offers fast fitting performance, a fraction of the required parameters compared to existing alternatives, and native support for inside-outside classification, making it a versatile and compact representation for geometric learning and reconstruction tasks, with future potential for 3D generation. Our implementation is available at: https://github.com/Ankbzpx/patchwork-experiment.

2605.14787 2026-05-19 cs.CV cs.CL 版本更新

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

组合图像检索基准是否需要多模态组合?

Matteo Attimonelli, Alessandro De Bellis, Aryo Pradipta Gema, Rohit Saxena, Monica Sekoyan, Wai-Chung Kwan, Claudio Pomo, Alessandro Suglia, Dietmar Jannach, Tommaso Di Noia, Pasquale Minervini

发表机构 * Politecnico di Bari(巴里理工大学) Sapienza University of Rome(罗马萨皮恩扎大学) University of Edinburgh(爱丁堡大学) University of Klagenfurt(克雷格弗特大学)

AI总结 研究发现组合图像检索任务中,许多查询可通过单一模态解决,而非真正的多模态组合,揭示了多模态组合的假设不成立。

详情
AI中文摘要

组合图像检索(CIR)是一种多模态检索任务,其中查询由参考图像和文本修改组成,目标是检索满足两者条件的目标图像。本文表明,这一假设并不总成立。在四个广泛使用的CIR基准和十一种通用多模态嵌入模型中,大量查询可通过单一模态解决(32.2%至83.6%),揭示了普遍存在的单模态捷径。为此,我们进行了两阶段审核:首先通过跨模型分析识别捷径可解查询;其次在4741个捷径不可解查询上进行人工验证,发现仅1689个查询结构合理,常见问题包括模糊编辑和目标不匹配。重新评估模型在验证子集上的表现显示,查询无法再仅通过单一模态解决,成功检索需结合两种输入。虽然准确率下降,但对多模态信息的依赖增加。整体而言,当前CIR基准将捷径可解、噪声和真正组合性查询混为一谈,导致对模型多模态能力的高估。

英文摘要

Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.

2605.14068 2026-05-19 cs.CV cs.LG 版本更新

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

CurveBench:一种用于嵌套乔丹曲线精确拓扑推理的基准

Amirreza Mohseni, Mona Mohammadi, Morteza Saghafian, Naser Talebizadeh Sardari

发表机构 * Maastricht University(马斯特里赫特大学) Cornell University(康奈尔大学) TU Wien(维也纳技术大学) Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 CurveBench是一个用于从视觉输入中进行层次拓扑推理的基准,包含756张非相交的乔丹曲线图像,通过结构预测任务恢复由曲线诱导的根树结构。

详情
AI中文摘要

我们介绍了CurveBench,一种用于从视觉输入中进行层次拓扑推理的基准。CurveBench包含756张图像,这些图像中的乔丹曲线在易、多边形、地形启发、迷宫状和密集计数配置下成对不相交。每张图像都标注了一个根树,编码平面区域之间的包含关系。我们将任务定义为结构预测:给定一张图像,模型必须恢复由曲线诱导的完整根树。尽管该任务在视觉上简单,但最强的评估模型Gemini 3.1 Pro在CurveBench-Easy上仅达到71.1%的树生成准确率,在CurveBench-Hard上仅为19.1%。我们进一步通过RLVR风格的微调展示了基准的实用性。我们的训练Qwen3-VL-8B模型在CurveBench-Easy上将Qwen-3-VL-8B-Thinking的树生成准确率从2.8%提升到33.3%,超过GPT-5.4和Claude Opus 4.5。剩余的差距,尤其是在CurveBench-Hard上,表明精确的拓扑感知视觉推理仍远未解决。

英文摘要

We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of \textbf{756 images} of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given an image, a model must recover the full rooted containment tree induced by the curves. Despite the visual simplicity of the task, the strongest evaluated model, Gemini 3.1 Pro, achieves only \textbf{71.1\%} tree-generation accuracy on CurveBench-Easy and \textbf{19.1\%} on CurveBench-Hard. We further demonstrate benchmark utility through RLVR-style fine-tuning of open-weight vision-language models. Our trained Qwen3-VL-8B model improves over \texttt{Qwen-3-VL-8B-Thinking} from \textbf{2.8\%} to \textbf{33.3\%} tree-generation accuracy on CurveBench-Easy, exceeding GPT-5.4 and Claude Opus 4.5 under our evaluation protocol. The remaining gap, especially on CurveBench-Hard, shows that exact topology-aware visual reasoning remains far from solved.

2605.13161 2026-05-19 cs.CV cs.LG 版本更新

A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning

A₃B₂:一种自适应非对称适配器,用于缓解视觉-语言图像分类中的分支偏差

Yiyun Zhou, Zhonghua Jiang, Wenkang Han, Kunxi Li, Mingjing Xu, Chang Yao, Jingyuan Chen

发表机构 * Zhejiang University(浙江大学) Swansea University(斯旺西大学)

AI总结 本文提出A₃B₂适配器,通过引入不确定性感知适配器阻尼机制,缓解少样本学习中的分支偏差问题,实验表明其在多个数据集上优于现有基线方法。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

高效的迁移学习方法为大规模视觉-语言模型(例如CLIP)提供了强大的少样本迁移能力,但现有适配方法遵循固定微调范式,隐含假设图像和文本分支的重要性是均匀的,这一假设在图像分类中未被系统研究。通过深入分析,我们揭示了视觉-语言图像分类中的分支偏差问题:在分布外设置下,适配图像编码器并不总能提高性能。受此启发,我们提出了A₃B₂,一种自适应非对称适配器,用于缓解少样本学习中的分支偏差。A₃B₂引入了不确定性感知适配器阻尼(UAAD),在预测不确定性较高时自动抑制图像分支适配,实现软且数据驱动的控制,无需手动干预。在架构上,A₃B₂采用了一种轻量级非对称设计,受混合专家启发,结合负载平衡正则化。在三个少样本图像分类任务上,对11个数据集的广泛实验表明,A₃B₂在多个数据集上一致优于11个竞争的提示和适配基线方法。

英文摘要

Efficient transfer learning methods for large-scale vision-language models ($e.g.$, CLIP) enable strong few-shot transfer, yet existing adaptation methods follow a fixed fine-tuning paradigm that implicitly assumes a uniform importance of the image and text branches, which has not been systematically studied in image classification. Through extensive analysis, we reveal a Branch Bias issue in vision-language image classification: adapting the image encoder does not always improve performance under out-of-distribution settings. Motivated by this observation, we propose A$_3$B$_2$, an Adaptive Asymmetric Adapter that alleviates Branch Bias in few-shot learning. A$_3$B$_2$ introduces Uncertainty-Aware Adapter Dampening (UAAD), which automatically suppresses image-branch adaptation when prediction uncertainty is high, enabling soft and data-driven control without manual intervention. Architecturally, A$_3$B$_2$ adopts a lightweight asymmetric design inspired by mixture-of-experts with Load Balancing Regularization. Extensive experiments on three few-shot image classification tasks across 11 datasets demonstrate that A$_3$B$_2$ consistently outperforms 11 competitive prompt- and adapter-based baselines.

2605.11555 2026-05-19 cs.CV 版本更新

ScribbleDose: Scribble-Guided Dose Prediction in Radiotherapy

ScribbleDose:放射治疗中的涂鸦引导剂量预测

Zhenxi Zhang, Yitao Zhuang, Yao Pu, Peixin Yu, Zirong Li, Yan Xia, Hui Li, Bin Li, Fuchen Zheng, Ge Ren

发表机构 * Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong SAR(香港理工大学健康科技与信息技术系) The Hong Kong Polytechnic University Shenzhen Research Institute, The Hong Kong Polytechnic University, China(香港理工大学深圳研究院) Department of Orthodontics and Orofacial Orthopedics, Friedrich-Alexander-University Erlangen-Nuremberg, Germany(弗赖堡-埃尔兰根-纽伦堡大学口腔医学与面部骨科系) Institute of Scientific Instrumentation, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China(深圳先进技术研究所科学仪器研究所) Department of Computer and Information Science, University of Macau, Macau SAR(澳门大学计算机与信息科学系)

AI总结 本文提出一种基于稀疏涂鸦的剂量预测框架,通过生成密集解剖学掩码和结构引导剂量生成模块,提高剂量预测精度并降低标注成本。

Comments Preprint of the submitted version before peer review. The final Version of Record will be available in the MICCAI 2026 proceedings published by Springer

详情
AI中文摘要

在放射治疗剂量预测中,解剖结构掩膜被广泛应用,因其能提供明确的几何约束以促进结构-剂量耦合。然而,传统手动标注这些掩膜需要精确标注与放射治疗相关的结构边界,这耗时且劳动密集。为解决这些限制,我们提出了一种基于涂鸦引导的剂量预测框架,仅依赖于用稀疏涂鸦标注的解剖结构。具体来说,我们设计了一个涂鸦完成模块(SCM),通过将稀疏涂鸦标签传播到语义相似的体素中,生成密集的解剖学掩膜。在传播过程中,引入基于监督体素的正则化以保持几何边界一致性,以确保解剖学的合理性。此外,我们提出了一种结构引导剂量生成模块(SGDGM),以加强稀疏结构提示与剂量分布之间的对应关系。在此基础上,从涂鸦中获得的密集掩膜用作结构引导,以条件剂量预测网络。这种涂鸦-掩膜-剂量一致性鼓励高剂量集中在目标体积内,同时有效保护周围危险器官。在公开的GDP-HMM数据集上的广泛实验表明,所提方法在保持优越的剂量预测性能的同时,显著降低了标注成本,为稀疏结构标注下的剂量预测提供了实用的范式。代码和重新标注的涂鸦在https://github.com/iCherishxixixi/ScribbleDose上公开。

英文摘要

Anatomical structure masks are widely adopted in radiotherapy dose prediction, as they provide explicit geometric constraints that facilitate structure-dose coupling. However, conventional manual delineation of these masks requires precise annotation of structure boundaries relevant to radiotherapy, which is time-consuming and labor-intensive. To address these limitations, we propose a scribble-guided dose prediction framework that relies solely on anatomical structures annotated with sparse scribbles. Specifically, we design a Scribble Completion Module (SCM) to generate dense anatomical masks by propagating sparse scribble labels to semantically similar voxels. During the propagation process, a supervoxel-based regularization is introduced to preserve geometric boundary consistency to ensure anatomical plausibility. Furthermore, we propose a Structure-Guided Dose Generation Module (SGDGM) to strengthen the correspondence between sparse structural cues and dose distribution. Herein, the completed dense masks derived from scribbles serve as structural guidance to condition the dose prediction network. This scribble-mask-dose consistency encourages high-dose concentration within target volumes while effectively sparing surrounding organs-at-risk. Extensive experiments on the open-source GDP-HMM dataset demonstrate that the proposed method maintains superior dose prediction performance while substantially reducing annotation cost, providing a practical paradigm for dose prediction under sparse structural annotation. The code and reannotated scribbles are made publicly available at https://github.com/iCherishxixixi/ScribbleDose.

2602.07272 2026-05-19 cs.CV cs.GR 版本更新

VideoNeuMat: Neural Material Extraction from Generative Video Models

VideoNeuMat: 从生成视频模型中提取神经材料

Bowen Xue, Saeed Hadadan, Zheng Zeng, Fabrice Rousselle, Zahra Montazeri, Milos Hasan

发表机构 * University of Manchester(曼彻斯特大学) NVIDIA(NVIDIA公司) University of California Santa Barbara(加州大学圣巴bara分校)

AI总结 VideoNeuMat通过两阶段流程从视频扩散模型中提取可重用的神经材料,利用可控相机和光照轨迹生成材料样本视频,并通过大型重建模型重建紧凑的神经材料,实现超越合成训练数据的现实感和多样性。

详情
AI中文摘要

VideoNeuMat通过两阶段流程从视频扩散模型中提取可重用的神经材料,利用可控相机和光照轨迹生成材料样本视频,并通过大型重建模型重建紧凑的神经材料,实现超越合成训练数据的现实感和多样性。

英文摘要

Creating photorealistic materials for 3D rendering requires exceptional artistic skill. Generative models for materials could help, but are currently limited by the lack of high-quality training data. While recent video generative models effortlessly produce realistic material appearances, this knowledge remains entangled with geometry and lighting. We present VideoNeuMat, a two-stage pipeline that extracts reusable neural material assets from video diffusion models. First, we finetune a large video model (Wan 2.1 14B) to generate material sample videos under controlled camera and lighting trajectories, effectively creating a "virtual gonioreflectometer" that preserves the model's material realism while learning a structured measurement pattern. Second, we reconstruct compact neural materials from these videos through a Large Reconstruction Model (LRM) finetuned from a smaller Wan 1.3B video backbone. From 17 generated video frames, our LRM performs single-pass inference to predict neural material parameters that generalize to novel viewing and lighting conditions. The resulting materials exhibit realism and diversity far exceeding the limited synthetic training data, demonstrating that material knowledge can be successfully transferred from internet-scale video models into standalone, reusable neural 3D assets.

2601.22537 2026-05-19 eess.IV cs.CV 版本更新

EndoCaver: Handling Fog, Blur and Glare in Endoscopic Images via Joint Deblurring-Segmentation

EndoCaver:通过联合去模糊与分割处理内窥镜图像中的雾、模糊和眩光

Zhuoyu Wu, Wenhui Ou, Pei-Sze Tan, Jiayan Yang, Wenqi Fang, Zheng Wang, Raphaël C. -W. Phan

发表机构 * 1 CyPhi ( ) AI Research Lab, School of IT, Monash University, Malaysia Campus 2 Department of Electronic \& Computer Engineering, The Hong Kong University of Science 3 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences

AI总结 本文提出EndoCaver,一种轻量级变压器,通过联合去模糊与分割任务,有效处理内窥镜图像中的雾、模糊和眩光问题,提升结肠癌筛查的自动化检测性能。

Comments Accepted for publication at IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026

详情
AI中文摘要

内窥镜图像分析对结肠癌筛查至关重要,但现实条件常受到镜头雾化、运动模糊和镜面高光的严重影响,严重妨碍自动化息肉检测。我们提出EndoCaver,一种轻量级变压器,具有单向引导双解码器架构,能够实现图像去模糊和分割的联合多任务能力,同时显著减少计算复杂性和模型参数。具体而言,它集成了全局注意力模块(GAM)用于跨尺度聚合,去模糊-分割对齐器(DSA)用于转移恢复线索,以及基于余弦的调度器(LoCoS)用于稳定的多任务优化。在Kvasir-SEG数据集上的实验表明,EndoCaver在干净数据上达到0.922 Dice分数,在严重图像退化下达到0.889,优于现有最先进方法,同时将模型参数减少90%。这些结果证明了其效率和鲁棒性,使其适合在设备上进行临床部署。代码可在https://github.com/ReaganWu/EndoCaver获取。

英文摘要

Endoscopic image analysis is vital for colorectal cancer screening, yet real-world conditions often suffer from lens fogging, motion blur, and specular highlights, which severely compromise automated polyp detection. We propose EndoCaver, a lightweight transformer with a unidirectional-guided dual-decoder architecture, enabling joint multi-task capability for image deblurring and segmentation while significantly reducing computational complexity and model parameters. Specifically, it integrates a Global Attention Module (GAM) for cross-scale aggregation, a Deblurring-Segmentation Aligner (DSA) to transfer restoration cues, and a cosine-based scheduler (LoCoS) for stable multi-task optimisation. Experiments on the Kvasir-SEG dataset show that EndoCaver achieves 0.922 Dice on clean data and 0.889 under severe image degradation, surpassing state-of-the-art methods while reducing model parameters by 90%. These results demonstrate its efficiency and robustness, making it well-suited for on-device clinical deployment. Code is available at https://github.com/ReaganWu/EndoCaver.

2509.15357 2026-05-19 cs.CV cs.LG 版本更新

MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

MaskAttn-SDXL:可控区域级文本到图像生成

Yu Chang, Jiahao Chen, Anzhe Cheng, Paul Bogdan

发表机构 * University of Southern California(南加州大学) University of Toronto(多伦多大学)

AI总结 本文提出MaskAttn-SDXL模块,通过在softmax前注入token条件空间门控,解决扩散模型在多物体生成中的全局协调和可靠性问题,无需改变SDXL流程。

Comments Accepted to the 2026 International Joint Conference on Neural Networks (IJCNN 2026)

详情
AI中文摘要

扩散模型在文本到图像生成中取得了强劲成果,但随着提示变得更为结构化和多对象,仍存在重要限制。在架构方面,U-Net主干高效稳定,但其局部性使全局协调更困难,而基于Transformer的扩散模型虽改善了全局交互,但计算和内存成本显著增加。同时,组合可靠性仍较弱:模型常混合对象属性、违反空间关系或遗漏请求实体,且这些错误无法可靠地通过FID或基于CLIP的指标反映。为解决这些问题而无需更改SDXL流程,我们提出MaskAttn-SDXL,一种插件模块,通过在softmax前注入token条件空间门控到交叉注意力logits中。该门控将token到位置的交互稀疏化,以抑制无关绑定,同时保留预训练主干和标准采样过程,无需外部监督或推理时编辑。

英文摘要

Diffusion models have achieved strong results in text-to-image generation, but important limitations remain as prompts become more structured and multi-object. On the architecture side, U-Net backbones are efficient and stable, yet their locality makes global coordination harder, while Transformer-based diffusion models improve global interactions but at substantially higher compute and memory cost. In parallel, compositional reliability remains weak: models often mix attributes across objects, violate spatial relations, or omit requested entities, and these errors are not reliably reflected by global metrics such as FID or CLIP-based scores. To address these issues without changing the SDXL pipeline, we propose MaskAttn-SDXL, a plug-in module that injects token-conditioned spatial gating into cross-attention logits before softmax. The gating sparsifies token-to-location interactions to suppress irrelevant bindings while preserving the pretrained backbone and standard sampling process, requiring no external supervision or inference-time editing.

2508.01608 2026-05-19 cs.CV 版本更新

From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models

从像素到地点:一个系统性基准,用于评估大语言模型中的图像地理定位能力

Lingyao Li, Runlong Yu, Qikai Hu, Bowei Li, Min Deng, Yang Zhou, Xiaowei Jia

发表机构 * University of South Florida Tampa USA University of Alabama Tuscaloosa USA University of Michigan Ann Arbor USA Texas Tech University Lubbock USA Texas A \& M University College Station USA University of Pittsburgh Pittsburgh USA University of South Florida University of Alabama University of Michigan Texas Tech University Texas A \& M University University of Pittsburgh

AI总结 本文提出IMAGEO-Bench基准,系统评估大语言模型在图像地理定位中的准确性、距离误差、地理偏见和推理过程,揭示闭源模型在高资源区域表现优于欠代表区域。

详情
AI中文摘要

图像地理定位,即识别图像中描绘的地理位置,对危机响应、数字取证和基于位置的智能应用至关重要。尽管近期大语言模型(LLMs)的进步为视觉推理提供了新机会,但其在图像地理定位能力方面仍缺乏系统评估。本文介绍IMAGEO-Bench基准,系统评估准确性、距离误差、地理偏见和推理过程。该基准包含三个多样化的数据集,涵盖全球街道场景、美国兴趣点(POIs)和一个未见图像的私有集合。通过在10种最先进的LLMs(包括开源和闭源模型)上的实验,我们揭示了明显的性能差异,闭源模型通常表现更优。重要的是,我们发现地理偏见,因为LLMs在高资源区域(如北美、西欧和加州)表现更好,而在欠代表区域表现下降。回归诊断显示,成功的地理定位主要依赖于识别城市环境、户外环境、街道级影像和可识别地标。总体而言,IMAGEO-Bench为LLMs的空间推理能力提供了严格视角,并为构建地理定位感知的AI系统提供了启示。

英文摘要

Image geolocalization, the task of identifying the geographic location depicted in an image, is important for applications in crisis response, digital forensics, and location-based intelligence. While recent advances in large language models (LLMs) offer new opportunities for visual reasoning, their ability to perform image geolocalization remains underexplored. In this study, we introduce a benchmark called IMAGEO-Bench that systematically evaluates accuracy, distance error, geospatial bias, and reasoning process. Our benchmark includes three diverse datasets covering global street scenes, points of interest (POIs) in the United States, and a private collection of unseen images. Through experiments on 10 state-of-the-art LLMs, including both open- and closed-source models, we reveal clear performance disparities, with closed-source models generally showing stronger reasoning. Importantly, we uncover geospatial biases as LLMs tend to perform better in high-resource regions (e.g., North America, Western Europe, and California) while exhibiting degraded performance in underrepresented areas. Regression diagnostics demonstrate that successful geolocalization is primarily dependent on recognizing urban settings, outdoor environments, street-level imagery, and identifiable landmarks. Overall, IMAGEO-Bench provides a rigorous lens into the spatial reasoning capabilities of LLMs and offers implications for building geolocation-aware AI systems.

2505.17352 2026-05-19 cs.CV 版本更新

Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey

通过强化学习和奖励建模对扩散模型的对齐与安全性:综述

Preeti Lamba, Kiran Ravish, Ankita Kushwaha, Pawan Kumar

发表机构 * International Institute of Information Technology(国际信息科技学院)

AI总结 本文综述了通过强化学习、奖励建模等方法对文本到图像扩散模型进行对齐和安全性的最新进展,探讨了反馈来源、奖励信号形式、优化机制等五个维度,并提出了多目标对齐、反馈高效偏好学习等开放性挑战。

Comments 3 figures, 1 table

详情
AI中文摘要

扩散模型已成为图像和多模态生成的核心范式,但其部署引发了关于对齐、安全性、偏好满足和滥用鲁棒性的持续疑问。本文综述了通过强化学习、奖励建模、偏好优化和安全特定微调对齐文本到图像扩散模型的最新进展。我们沿五个轴组织文献:反馈来源、奖励或偏好信号形式、优化机制、分布偏移和奖励过度优化的处理、以及安全作为显式约束而非一般偏好的程度。本文涵盖了人类反馈强化学习、KL正则化策略优化、直接偏好优化、二元效用优化、可微奖励微调、替代奖励学习、区域感知微调以及安全导向的DPO变体。为使综述易于理解,我们包含了扩散采样、奖励建模和偏好优化的教程解释,并简要连接了图像扩散对齐与新兴文本和掩码语言扩散模型。我们还比较了代表性方法在反馈要求、计算成本、可扩展性、对奖励黑客的易感性和安全性关键部署的适用性。最后,我们将文献综合为一组开放性挑战:多目标对齐、反馈高效偏好学习、对抗鲁棒安全对齐、在变化规范下的持续对齐和可解释奖励建模。本文的目标是为新兴的扩散模型对齐领域提供一个连贯的技术图谱,并识别在对齐生成模型可靠部署前必须解决的方法学缺口。

英文摘要

Diffusion models have become a central paradigm for image and multimodal generation, yet their deployment raises persistent questions about alignment, safety, preference satisfaction, and robustness to misuse. This survey reviews recent progress on aligning text-to-image diffusion models through reinforcement learning, reward modeling, preference optimization, and safety-specific fine-tuning. We organize the literature along five axes: the source of feedback, the form of the reward or preference signal, the optimization mechanism, the treatment of distribution shift and reward overoptimization, and the extent to which safety is addressed as an explicit constraint rather than a generic preference. The review covers reinforcement learning from human feedback, KL-regularized policy optimization, direct preference optimization, binary utility optimization, differentiable reward fine-tuning, surrogate reward learning, region-aware fine-tuning, and safety-oriented DPO variants. To make the survey accessible, we include tutorial explanations of diffusion sampling, reward modeling, and preference optimization, and briefly connect image diffusion alignment to emerging text and masked language diffusion models. We also compare representative methods in terms of feedback requirements, computational cost, scalability, susceptibility to reward hacking, and suitability for safety-critical deployment. Finally, we synthesize the literature into a set of open challenges: multi-objective alignment, feedback-efficient preference learning, adversarially robust safety alignment, continual alignment under changing norms, and interpretable reward modeling. The goal of this survey is to provide a coherent technical map of the emerging area of diffusion model alignment and to identify the methodological gaps that must be addressed before aligned generative models can be reliably deployed.