2606.18992 2026-06-18 cs.CV 新提交

Show, Don't Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverage

展示，而非询问：基于轮次有效覆盖的生成式视觉消歧用于组合图像检索

Amsisan Tran, Baogh Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Amsisan Tran ； Baogh Le ； Tuan Kiet Pham ； Sui Yang Guang

AI总结提出CLARA框架，通过展示视觉备选面板让用户选择，结合似然比重校准实现多轮覆盖保证，在组合图像检索中有效消歧，优于文本提问基线。

详情

AI中文摘要

组合图像检索（CIR）使用参考图像和文本修改来搜索目标图像。然而，此类查询通常描述多个可能的图像而非一个确切目标，使得用户意图模糊。近期方法通过使用共形预测估计模糊性并向用户提问澄清文本来解决此问题。但这些方法有两个局限：其覆盖保证仅在第一轮交互中成立，且文本问题通常不足以解决细粒度视觉差异，如外观、属性或视角。我们提出CLARA，一种通过向用户展示小型视觉备选面板来消歧的澄清框架。用户无需回答文本问题，只需选择最接近预期目标的原型图像。这提供了直接的视觉信号，并避免依赖模型预测用户答案。为在多轮交互中维持有效的共形保证，CLARA使用用户选择引起的似然比对校准进行重加权。显示的原型也被约束为代表当前候选集，并映射到真实语料库图像，确保生成的图像不能人为提高覆盖。在开放域和时尚基准上的实验表明，CLARA匹配单轮最先进的检索性能，在多轮交互中维持名义覆盖，并在比强文本问题基线更少的轮次中找到预期目标。其优势在模糊性涉及视角或细粒度属性时尤为明显，此时视觉消歧比文本提问更有效。

英文摘要

Composed image retrieval (CIR) uses a reference image and a text modification to search for a target image. However, such queries often describe several possible images rather than one exact target, making the user's intent ambiguous. Recent methods address this by using conformal prediction to estimate ambiguity and by asking users clarifying text questions. However, these methods have two limitations: their coverage guarantee only holds at the first interaction, and text questions are often insufficient for resolving fine-grained visual differences such as appearance, attributes, or viewpoint. We propose CLARA, a clarification framework that resolves ambiguity by showing users a small panel of visual alternatives. Instead of answering text questions, the user simply selects the prototype image closest to the intended target. This provides a direct visual signal and avoids relying on a model to predict the user's answer. To maintain valid conformal guarantees across multiple interaction rounds, CLARA reweights calibration using the likelihood ratio induced by the user's selection. The displayed prototypes are also constrained to represent the current candidate set and are snapped to real corpus images, ensuring that generated images cannot artificially improve coverage. Experiments on open-domain and fashion benchmarks show that CLARA matches single-turn state-of-the-art retrieval performance, maintains nominal coverage across interaction rounds, and finds the intended target in fewer rounds than strong text-question baselines. Its advantage is especially clear when ambiguity involves viewpoint or fine-grained attributes, where visual clarification is more effective than textual questioning.

URL PDF HTML ☆

赞 0 踩 0

2606.19100 2026-06-18 cs.CV 新提交

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

AMALIA-VL: 一个原生欧洲葡萄牙语开源视觉与语言模型

Diogo Glória-Silva, João Cardeira, Manuel Letras da Luz, Afonso Simplício, Gonçalo Vinagre, Diogo Tavares, Rafael Ferreira, Inês Calvo, Inês Vieira, David Semedo, João Magalhães

发表机构 * NOVA School of Science and Technology（NOVA科学与技术学校）； NOVA LINCS

AI总结针对欧洲葡萄牙语缺乏开源多模态模型的问题，提出AMALIA-VL，通过三阶段训练和葡萄牙语中心数据混合，建立强基线并开源所有资源。

详情

AI中文摘要

大型视觉与语言模型（LVLMs）发展迅速，但欧洲葡萄牙语（pt-PT）在现有的开源多模态模型中仍系统性地未被充分服务，这些模型要么将其与巴西葡萄牙语混为一谈，要么在其训练数据混合中严重缺乏代表性。我们推出了AMALIA-VL，这是第一个原生为pt-PT构建的开源指令微调LVLM，通过可学习的连接器将高分辨率视觉编码器与动态图像平铺以及完全开放的pt-PT优化语言模型配对。我们贡献了一个精心设计的三阶段训练过程——视觉-语言对齐、通用视觉指令微调和偏好优化——以及一个以pt-PT为中心的多模态数据混合，该混合结合了策划和翻译的公共数据集与新颖的数据集，以解决欧洲葡萄牙语多模态资源几乎完全缺失的问题。我们的评估表明，AMALIA-VL为开源pt-PT LVLM建立了强基线。我们将发布模型权重、训练数据和构建流程，以及机器翻译的pt-PT评估基准，以帮助民主化pt-PT LVLM的开发。

英文摘要

Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.We will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

URL PDF HTML ☆

赞 0 踩 0

2606.19277 2026-06-18 cs.CV 新提交

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

高效遥感视觉问答的统一框架：适配双编码器、混合架构和编码器-解码器架构

Timothy Agboada, Shikha Chandel, Yadav Raj Ghimire, Leila Hashemi-Beni

发表机构 * Engineering North Carolina A\&T State University Greensboro - NC, USA ； College of Science ； Technology North Carolina A\&T State University Greensboro - NC, USA

AI总结提出RS Adapter参数高效微调策略，在三种视觉语言模型架构上注入轻量瓶颈适配器，仅用不到5%可训练参数实现遥感VQA，混合架构FLAVA在多模态推理与检索间取得最佳平衡。

Comments 4 pages, 2 figures, accepted and to be presented at 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026), scheduled for 9 to 14 August 2026 in Washington D.C

详情

AI中文摘要

遥感领域的视觉问答因航空影像的高分辨率、多尺度目标分布和语义复杂性而面临独特挑战。尽管通用领域的基础模型取得了显著成功，但直接应用于RSVQA受到巨大领域偏移和全微调计算成本高昂的阻碍。本研究对RS Adapter（一种参数高效微调策略）在三种不同的视觉语言模型架构上进行了比较分析：双编码器CLIP、编码器-解码器BLIP和混合FLAVA。我们引入了一个统一的架构手术流水线，将轻量瓶颈适配器注入冻结骨干网络的注意力和MLP层，从而以少于5%的可训练参数实现快速适应。在高分辨率RSVQA x数据集上的实验结果表明，虽然所有适配模型均实现收敛，但混合FLAVA架构相比单模态对应模型提供了更优越的多模态推理与检索能力平衡。我们的发现为灾害评估和城市监测中的资源高效VQA建立了新的基准。

英文摘要

Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.19338 2026-06-18 cs.CV 新提交

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

超越当前观测：评估多模态大语言模型在可控非马尔可夫博弈中的表现

Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Zhejiang University（浙江大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出RNG-Bench基准套件，通过配对记忆和3D迷宫两个博弈，评估多模态大模型在非马尔可夫环境中重建历史观测并据此行动的能力，发现主要错误源于遗忘而非决策，微调可提升性能。

详情

AI中文摘要

将多模态基础模型部署为闭环策略时，越来越需要基于不再可见的观测来调节动作。然而，现有基准要么暴露完整状态，将隐藏状态重建与其他智能体技能混为一谈，要么仅在回合结束后测试记忆。我们引入了RNG-Bench（重建性非马尔可夫博弈），这是一个基准套件，旨在隔离基础模型在多步交互中重建过去观测并据此行动的能力。RNG-Bench包含两个互补的博弈：配对记忆，其中卡片身份在特定位置短暂显示后需被回忆；以及3D迷宫，其中自我中心视图需整合为空间地图。两个博弈都在统一的测试框架下评估，具有三个可控难度轴：网格大小、视觉模式和观测模态。该基准进一步引入了头对头对决协议以控制实例级方差，以及记忆差距指标，将遗忘与不良动作选择区分开来。最难的配置需要大约128K个token和每回合350个图像输入，前沿MLLMs远未饱和。记忆差距分析表明，大多数残余错误源于遗忘较早的观测，而非次优决策。最后，在最优策略轨迹和过滤后的模型演示上微调Qwen3.5-9B，提高了RNG-Bench的性能，并迁移到现有基准，而不降低通用多模态能力。

英文摘要

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

URL PDF HTML ☆

赞 0 踩 0

2606.19341 2026-06-18 cs.CV cs.CL cs.SD 新提交

Native Active Perception as Reasoning for Omni-Modal Understanding

原生主动感知作为全模态理解的推理

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； Qwen Team, Alibaba Group（阿里巴巴集团Qwen团队）

AI总结提出OmniAgent，一种基于POMDP迭代观察-思考-行动循环的原生全模态智能体，通过主动感知将推理复杂度与视频时长解耦，在多个基准上达到开源模型最优性能。

Comments Accepted at ICML 2026. Code and models: https://github.com/harryhsing/omniagent

详情

AI中文摘要

用于长视频理解的被动模型通常依赖于“全看一遍”范式，无论查询难度如何都统一处理帧，导致计算成本随视频时长增长。尽管出现了交互式框架，但它们通常依赖于全局预扫描，其上下文成本仍随视频长度扩展。我们提出OmniAgent，第一个原生全模态智能体，将视频理解建模为基于POMDP的迭代观察-思考-行动循环。OmniAgent执行按需动作，选择性地将视听线索提炼到持久文本记忆中，有效将推理复杂度与原始视频时长解耦。为实现这一点，我们引入了(1)智能体监督微调，通过最佳N轨迹合成和双阶段质量控制在启动原生主动感知；(2)带TAURA（轮次感知自适应不确定性重缩放优势）的智能体强化学习，利用轮次级熵将信用分配引导至关键发现轮次。关键的是，OmniAgent表现出正向测试时缩放，性能随推理轮次增加而提升，验证了主动感知的有效性。在十个基准（如VideoMME、LVBench）上的实验结果表明，OmniAgent在开源模型中达到了最先进性能。值得注意的是，在LVBench上，我们的7B智能体优于10倍大的Qwen2.5-VL-72B（50.5% vs. 47.3%）。

英文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

URL PDF HTML ☆

赞 0 踩 0

2606.18583 2026-06-18 cs.CV cs.RO 新提交

Mem-World：用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology（大连理工大学）； Samsung R&D Institute China-Beijing (SRCB)（三星中国北京研究院）

AI总结提出Mem-World，通过4D腕部视角曲面元索引内存W-VMem，解决操作中因遮挡和运动导致的场景遗忘问题，实现持久世界建模，提升策略评估与改进效果。

详情

AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式，通过生成动作一致的视频推演，为昂贵的真实世界实验提供了可扩展的替代方案。然而，在操作中持久世界建模仍然具有挑战性：频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图，导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制，我们提出了Mem-World，一种内存增强的多视图动作条件世界模型。其核心是W-VMem，一种4D腕部视图为中心的曲面元索引内存，将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置，W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中，通过基于曲面元的渲染和评分选择相关历史帧，为预测提供信息丰富且非冗余的上下文。大量实验表明，Mem-World在复杂操作场景中生成持久推演，比Ctrl-World实现更可靠的策略评估，将皮尔逊相关系数提高14.5%，并通过合成数据生成支持有效的策略改进，在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.19258 2026-06-18 cs.CV cs.RO 新提交

多模态超图融合用于低光照人群计数

Hao-Yuan Ma, Li Zhang, Yushi Qiu, Jie Gao, Yan Zhang, Bangjun Wang

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结针对低光照环境下人群计数难题，构建三个新基准数据集，提出多模态超图融合模块和可变形矩形稀疏注意力模块，形成低光照计数网络LCNet，在三个基准上取得最优性能。

详情

AI中文摘要

人群计数是计算机视觉中的一项基本任务。然而，低光照环境下的人群计数在实际世界中具有重要实用价值，却仍未得到充分探索。现有方法主要关注良好光照场景或依赖单模态红绿蓝（RGB）表示，这在极端黑暗和复杂非均匀光照下往往变得不可靠。为解决此问题，我们构建了三个新的低光照人群计数基准，包括两个合成数据集SHA_Dark和SHB_Dark，以及一个真实世界基准LC-Crowd（低光照人群数据集）。受Retinex物理建模启发，我们引入深度和Canny边缘线索作为互补的几何和结构先验，以增强低光照条件下的内在反射率表示。我们提出多模态超图融合模块，将RGB外观、深度几何和边缘结构线索统一表示为超图中的节点，并通过动态超边构建和消息传递显式捕获它们的高阶互补关系。此外，为在密集预测中自适应分配计算，我们提出可变形矩形稀疏注意力（DRSA）模块，通过锚点感知估计和自适应矩形窗口建模将计算集中在信息丰富区域。基于这些设计，我们开发了统一的低光照计数网络（LCNet）用于鲁棒的低光照人群计数。在三个基准上的大量实验表明，所提方法在整体性能上优于现有最先进（SOTA）方法。代码见补充材料。数据集将在接收后公开。

英文摘要

Crowd counting is a fundamental task in computer vision. However, crowd counting in low-light environments remains largely underexplored, despite its practical importance in the real world. Existing methods mainly focus on well-lit scenes or rely on single-modality Red-Green-Blue (RGB) representations, which often become unreliable under extreme darkness and complex non-uniform illumination. To handle this problem, we construct three new low-light crowd counting benchmarks, which consist of two synthetic datasets, SHA\_Dark and SHB\_Dark, and a real-world benchmark LC-Crowd (Low-light Crowd Dataset). Inspired by Retinex-based physical modeling, we introduce depth and Canny edge cues as complementary geometric and structural priors to enhance the intrinsic reflectance representation under low-light conditions. We propose a Multi-Modal Hyper-Graph Fusion module, which formulates RGB appearance, depth geometry, and edge structure cues as nodes in a unified hyper-graph and explicitly captures their high-order complementary relationships via dynamic hyperedge construction and message passing. Furthermore, to adaptively allocate computation in dense prediction, we propose a Deformable Rectangular Sparse Attention (DRSA) module, which concentrates computation on informative regions through anchor-aware estimation and adaptive rectangular window modeling. Based on these designs, we develop a unified Low-Light Counting Network (LCNet) for robust low-light crowd counting. Extensive experiments on three benchmarks demonstrate that the proposed method achieves the best overall performance against existing state-of-the-art (SOTA) methods. The code is in the supplementary material. The datasets will be made public upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.18582 2026-06-18 cs.CV cs.RO eess.IV 新提交

Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics

ICRA 2026 GOOSE 2D细粒度语义分割挑战赛技术报告：利用DINOv3实现野外机器人中的鲁棒户外场景理解

Jaeil Park, Hyobin Choi, Sangjin Lee, Hyungtae Lim, Sung-Hoon Yoon

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)（大邱庆北科学技术院）； Massachusetts Institute of Technology (MIT)（麻省理工学院）

AI总结提出一种结合DINOv3自监督骨干、ViT-Adapter和Mask2Former解码器的网络设计，以及多尺度测试增强和模型集成的推理策略，在64类细粒度越野语义分割挑战中取得第一名，复合得分76.57%。

Comments 5 pages, 4 figures

详情

AI中文摘要

ICRA 2026野外机器人研讨会举办的GOOSE 2D细粒度语义分割挑战赛评估了越野图像在64个细粒度类别和11个评估的非空洞粗类别上的密集语义分割。我们提出了该挑战的第一名解决方案。我们的解决方案包含两个互补的改进：(a) 网络级设计，结合了自监督DINOv3 ViT-L/16骨干、ViT-Adapter和Mask2Former掩码分类解码器，以及基于全局[CLS]令牌的粗类别辅助损失；(b) 推理时聚合策略，基于多尺度和水平翻转测试时增强，以及使用Codabench分数选择的前三个检查点的集成。我们的方法达到了官方复合得分76.57%，包括69.32%的细类mIoU和83.81%的类别级mIoU，并在最终阶段排行榜上排名第一：http://this url。

英文摘要

The GOOSE 2D Fine-Grained Semantic Segmentation Challenge at the ICRA 2026 Workshop on Field Robotics evaluates dense semantic segmentation of off-road imagery over a fine-grained taxonomy of 64 classes and 11 evaluated non-void coarse categories. We present the first-place solution to this challenge. Our solution comprises two complementary improvements: (a) a network-level design that combines a self-supervised DINOv3 ViT-L/16 backbone, a ViT-Adapter, and a Mask2Former mask-classification decoder, together with a coarse-category auxiliary loss on the global [CLS] token; and (b) an inference-time aggregation strategy based on multi-scale and horizontal-flip test-time augmentation and an ensemble of the top three checkpoints selected using Codabench scores. Our method achieves an official composite score of 76.57%, consisting of 69.32% fine-class mIoU and 83.81% category-level mIoU, and ranks first on the final phase leaderboard: www.codabench.org/competitions/14257/#/results-tab.

URL PDF HTML ☆

赞 0 踩 0

2606.18783 2026-06-18 cs.CV 新提交

SCR-Guided Difficulty-Aware Optimization for Infrared Small Target Detection

SCR引导的困难感知优化用于红外小目标检测

Yunus Sevim, Behçet Uğur Töreyin

发表机构 * Aselsan（阿塞尔桑公司）； Istanbul Technical University（伊斯坦布尔理工大学）

AI总结提出REEM框架，利用信杂比作为可见性先验，通过可微调制软IoU损失，提升低可见性目标检测性能，无需额外参数或推理开销。

Comments Accepted at CVPR 2026 Workshops (PBVS). Published version: https://openaccess.thecvf.com/content/CVPR2026W/PBVS/html/Sevim_SCR-Guided_Difficulty-Aware_Optimization_for_Infrared_Small_Target_Detection_CVPRW_2026_paper.html

详情

AI中文摘要

红外小目标检测由于严重的背景杂波、低对比度和弱空间响应仍然具有挑战性，其中几何重叠单独不足以表征检测质量。在这项工作中，我们提出了REEM（重加权显式可见性增强调制），一种轻量级的SCR引导的困难感知优化框架，在训练期间将信杂比（SCR）作为物理上有意义的可见性先验。REEM不修改网络架构或直接优化SCR，而是从输入图像计算真实局部SCR，并对软IoU学习信号应用可微调制，强调低可见性目标，同时保持稳定优化和相同的推理行为。REEM集成到基于U-Net的MSHNet中，无需引入额外参数、架构修改或推理时开销。大量实验表明，与基线相比，REEM实现了持续改进，获得了更高的IoU和检测概率（Pd），同时大幅减少了虚警（FA），特别是在具有挑战性的低可见性条件下。这些结果表明，SCR引导的困难感知优化为红外小目标检测提供了有效且物理基础的补充，超越了传统的基于重叠的目标函数。代码可在https://github.com/yall-in-one/Reemm获取。

英文摘要

Infrared small target detection remains challenging due to severe background clutter, low contrast, and weak spatial responses where geometric overlap alone is insufficient to characterize detection quality. In this work, we propose REEM (Reweighted Explicit-visibility Enhanced Modulation), a lightweight SCR-guided difficulty-aware optimization framework that incorporates Signal-to-Clutter Ratio (SCR) as a physically meaningful visibility prior during training. Instead of modifying the network architecture or directly optimizing SCR, REEM computes a ground-truth local SCR from the input image and applies a differentiable modulation to the soft-IoU learning signal, emphasizing low-visibility targets while preserving stable optimization and identical inference behavior. REEM is integrated into a U-Net-based MSHNet without introducing additional parameters, architectural modifications, or inference-time overhead. Extensive experiments demonstrate consistent improvements over the baseline, achieving higher IoU and detection probability (Pd) together with substantially reduced false alarms (FA), particularly under challenging low-visibility conditions. These results suggest that SCR-guided difficulty-aware optimization provides an effective and physically grounded complement to conventional overlap-based objectives for infrared small target detection. The code is available at https://github. com/yall-in-one/Reemm.

URL PDF HTML ☆

赞 0 踩 0

2606.18441 2026-06-18 cs.CV 新提交

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集：视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）； Beijing University of Posts and Telecommunications（北京邮电大学）； Cloud and AI BU, Huawei（华为云与AI业务部）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）

AI总结提出无时间标注的过程级奖励框架CF-GRPO，通过视频内在线索构建一致性帧先验，并利用一致性帧奖励优化模型帧使用与先验的对齐，提升视频推理性能。

详情

AI中文摘要

强化学习提升了大型语言模型的推理能力，但将仅结果奖励应用于视频多模态大语言模型（Video-MLLMs）时，对哪些视觉证据应支持答案提供的指导有限。受多感官整合启发（其中一致的线索可以增强感知估计的显著性和可靠性），我们引入了一致性帧GRPO（CF-GRPO），一种无需时间标注的过程级奖励框架，用于证据感知的视频推理。CF-GRPO从内在视频线索中构建一致性帧先验，包括时间覆盖、场景转换线索和查询条件化的视觉相关性。然后，它从视觉和响应表示中计算模型侧的帧使用分数，并通过一致性帧奖励（CFR）优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化，CFR提供了高对比度的奖励信号，无需人工时间标注。实验表明，VideoCFR在复杂视频推理基准上取得了有竞争力的性能，并在多个指标上优于代表性的Video-MLLM和RL基线，同时一致性先验提供了训练中强调的证据帧的可解释视图。实现代码见：https://this https URL。

英文摘要

Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.

URL PDF HTML ☆

赞 0 踩 0

2606.18558 2026-06-18 cs.CV 新提交

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI（艾伦人工智能研究所）； University of Washington（华盛顿大学）； UNC-Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结提出一种基于语言指令的3D点运动预测方法，通过构建大规模数据集和基准，实现类无关、视角稳定的运动轨迹预测，并在机器人操作和视频生成中验证其有效性。

详情

AI中文摘要

运动预测是视觉智能的核心：智能体必须预测物体如何运动，以规划行动、推理物理交互并合成逼真的未来场景。我们认为，世界坐标系中的3D点提供了一种通用表示，具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务：给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述，模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务：(1) MolmoMotion-1M是一个大型语料库，包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹；(2) PointMotionBench是一个人工验证的基准，涵盖111个物体类别和61种运动类型；(3) MolmoMotion是一个通用运动预测模型，支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式，并在PointMotionBench上显著优于现有运动预测基线。最后，我们展示了学习到的3D运动先验能很好地迁移到下游应用：它提高了机器人操作的训练效率和泛化能力，其预测轨迹为生成模型提供了有效的运动指导，以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

URL PDF HTML ☆

赞 0 踩 0

2606.18586 2026-06-18 cs.CV cs.AI 新提交

APT: Atomic Physical Transitions for Causal Video-Language Understanding

APT: 用于因果视频语言理解的原子物理转变

Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

发表机构 * Northwestern University（西北大学）； Dolby Laboratories（杜比实验室）

AI总结提出原子物理转变（APT）作为视频中因果状态变化的显式表示，并构建混合来源数据集，通过APT-Tune微调方法使VLM学习物理转变而不遗忘事件级知识。

详情

AI中文摘要

物理事件不仅通过其名称来理解，还通过组成它们的因果状态变化来理解。诸如“弹跳”之类的片段级标签可能是正确的，但同时隐藏了使事件在物理上有效的过程，从支撑丧失和接触开始到反弹和稳定。为了使这一隐藏过程显式化，我们引入了原子物理转变（APT）：最小的、时间局部化的状态变化，将可见线索与活跃的物理机制以及前后动力学状态联系起来。APT链将视频表示为有序的因果转变序列，而不是单个聚合事件标签：事件标签说明发生了什么；APT链解释为什么会发生。为了使VLM能够学习APT，我们从人工标注和模拟器真实数据构建了混合来源的APT数据，涵盖接触、重力、摩擦和旋转/稳定性中的14种转变类型，包含1,246个试验中的27,303个计时实例。利用这些数据，我们发现当前的VLM在转变级物理理解上存在不足，零样本召回率最多为14%，错误主要由遗漏的转变主导。直接在APT链上进行微调可以改善转变检测，但会导致事件级遗忘，表明模型学习的是专门的答案格式，而不是可复用的物理表示。因此，我们提出了APT-Tune，一种参数高效的方案，教会VLM使用因果转变而不遗忘如何回答视频问题。它结合了图像填充感知监督、格式条件协同训练和机制条件域到类型解码，使APT学习具有格式鲁棒性和物理基础。在Qwen3-VL-2B上仅使用11M LoRA参数，APT-Tune显著提高了APT召回率，同时改善了事件级视频迁移。这些结果表明，APT不是一种新的答案格式，而是一种用于物理视频理解的人类对齐的因果监督信号。

英文摘要

Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.19062 2026-06-18 cs.CV 新提交

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

DREAM: 通过双目标编码扩展视觉-语言模型用于跨模态检索

Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik

发表机构 * Sejong University（世宗大学）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； Ulsan National Institute of Science and Technology（乌山国立科学研究院）

AI总结提出DREAM模型，通过双路径表示增强与对齐，结合层级视觉编码器和混合语言建模，在视频检索任务中实现新SOTA。

详情

AI中文摘要

在当今媒体驱动的世界中，视频内容在监控、教育和娱乐等领域的指数级增长使得通过自然语言查询检索语义相关视频变得日益关键。早期的视频检索系统依赖于手工特征或浅层跨模态映射，限制了其捕捉复杂语义和时间动态的能力。虽然大规模视觉-语言模型改进了跨模态对齐，但在建模细粒度时间依赖和微妙语言结构方面仍存在挑战。本文介绍DREAM：双路径表示增强与对齐模型，一种通过增强视觉和文本编码来解决这些局限性的新型多模态框架。DREAM采用混合语言建模策略，结合掩码和排列语言建模目标，以捕捉局部和全局语言语义。在视觉方面，我们设计了一个具有级联组注意力的层级视觉编码器，通过多阶段令牌交互和从粗到细的注意力细化来整合空间和时间信息。我们通过在广泛使用的MSRVTT、MSVD和LSMDC基准数据集上进行全面评估来验证DREAM，分别取得了49.4%、49.7%和27.3%的新SOTA R1分数。定性分析进一步展示了模型在帧间保持连贯注意力以及将复杂查询与动态视频内容对齐的能力。这些发现强调了层级注意力和双目标文本建模在实现鲁棒、上下文感知视频检索中的有效性，并为推进跨模态表示学习的未来研究铺平了道路。

英文摘要

In today's media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model's ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.18478 2026-06-18 cs.CV 新提交

Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

数据强制蒸馏：恢复少步视频生成中的多样性和保真度

Siyi Chen, Shaowei Liu, Yixuan Jia, Zian Wang, Huan Ling, Qing Qu, Jun Gao

发表机构 * University of Michigan（密歇根大学）； NVIDIA（英伟达）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结针对分布匹配蒸馏（DMD）在少步视频生成中出现的模式坍塌和过饱和问题，提出数据强制蒸馏（DFD）框架，通过教师评分差异引导学生接近真实数据分布，仅需一行代码修改即可恢复多样性和保真度。

详情

AI中文摘要

最近的进展表明，将多步视频扩散模型蒸馏为高效的少步学生模型具有前景。其中，分布匹配蒸馏（DMD）及其后继DMD2实现了强大的生成质量和快速收敛。然而，由于反向KL目标的性质，这些方法表现出两个持续的失败模式：样本多样性大幅下降，以及明显过饱和的输出偏离真实视频外观。在这项工作中，我们提出了数据强制蒸馏（DFD），一个简单的训练后框架，通过仅一行代码更改即可恢复DMD中的多样性和保真度。其核心是教师评分差异，用于引导学生朝向真实数据分布，将其拉向缺失的模式（缓解模式坍塌）并远离真实数据中不存在的问题模式（避免过饱和）。我们提供了框架的深入理论分析，并在文本到视频、图像到视频和自回归视频生成上验证了我们的方法。仅需100-300步微调，DFD就能有效恢复Wan2.1-1.3B和Cosmos-Predict2.5-2B模型上的多样性和保真度，解决过饱和伪影，显著改善视频动态和外观，甚至优于教师模型。

英文摘要

Recent progress has shown promise in distilling multi-step video diffusion models into efficient few-step students. Among them, Distribution Matching Distillation (DMD) and its successor DMD2 achieved strong generation quality and fast convergence. However, due to the nature of the reverse Kullback--Leibler (KL) objective, these methods exhibit two persistent failure modes: a substantial drop in sample diversity, and visibly over-saturated outputs that deviate from real-video appearance. In this work, we propose Data-Forcing Distillation (DFD), a simple post-training framework that restores diversity and fidelity in DMD with only a single-line of code change. At its core is the teacher score discrepancy to guide the student toward the real-data distribution, pulling it to missing modes (mitigating mode collapse) and away from problematic modes absent in real data (avoiding over-saturation). We provide an in-depth theoretical analysis of our framework and validate our approach on text-to-video, image-to-video, and autoregressive video generation. With only 100--300 steps of finetuning, DFD effectively restores diversity and fidelity on both Wan2.1-1.3B and Cosmos-Predict2.5-2B model, resolving the over-saturation artifacts with significantly better video dynamics and appearance, and even outperforms the teacher model.

URL PDF HTML ☆

赞 0 踩 0

2606.18591 2026-06-18 cs.CV 新提交

Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

桥接创意意图与视觉质量：基于创作者驱动的循环视频生成与代理反馈循环

Denis Savytski, Aiden Lei, Heding Liu, Warren Yang, Sihan Liang, Alexander Liu, Zhe Zhao

发表机构 * University of California, Davis（加州大学戴维斯分校）； The Harker School（哈克学校）； Basis Independent Silicon Valley（硅谷贝斯独立学校）； Saratoga High（萨拉托加高中）

AI总结提出CHIEF框架，通过人类-AI协作的迭代视频精炼，结合创作者驱动和代理主观反馈，提升长视频的叙事连贯性与创意方向。

Comments Accepted to the Workshop on Human-AI Co-Creativity at ICML 2026

详情

AI中文摘要

生成式AI使内容创作日益普及，但许多AI生成的视频缺乏叙事连贯性和创意方向，尤其在较长时长时问题更为突出。与编码不同，AI生成受益于可靠的反馈和循环自我改进等技术，而视频生成需要关于情节、场景和叙事的主观反馈，这自然激发了融入人类创意方向的方法。我们提出了CHIEF，一个人类-AI协同创作视频生成框架，将创作者置于人机循环迭代视频精炼的中心，并通过提供自动主观反馈来支持他们。创作者通过驱动每次迭代来融入其创意方向，而他们的修订则由专门的精炼代理整合。反馈循环由基于角色条件的多模态LLM生成，这些LLM观看生成的视频并从观众角度产生主观批评，提供自我评估无法捕捉的反馈。为测试我们提出框架的有效性，我们与没有电影制作经验的高中生和大学生合作，创作从1分钟短视频到具有复杂情节的完整10分钟短片的视频。

英文摘要

Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.

URL PDF HTML ☆

赞 0 踩 0

2606.18702 2026-06-18 cs.CV 新提交

UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

UniTemp: 通过双向蒸馏实现任意时间顺序的视频生成

Lin Zhang, Sicheng Mo, Zefan Cai, Jinhong Lin, Zihao Lin, Jiuxiang Gu, Krishna Kumar Singh, Yuheng Li, Yin Li

发表机构 * University of Wisconsin Madison（威斯康星大学麦迪逊分校）； Adobe Research（Adobe 研究院）； University of California Los Angeles（加利福尼亚大学洛杉矶分校）； University of California Davis（加利福尼亚大学戴维斯分校）

AI总结提出UniTemp框架，通过双向蒸馏训练单个自回归模型，支持任意时间方向（前向、后向、中间插值）的视频生成，解决因果3D VAE在后向生成中的不连续性，提升可控性。

详情

AI中文摘要

自回归视频扩散模型已成为长视频生成的一种有前景的方法，在流式设置中表现出色。然而，现有方法仅限于前向时间生成，而实际视频创作通常需要灵活的生成顺序，例如，基于未来上下文进行后向扩展，或基于过去和未来上下文进行中间插值生成。我们通过训练一个支持任意时间方向生成的自回归模型来弥合这一差距。一个关键的技术挑战来自视频扩散模型中广泛使用的因果3D VAE，它编码的潜变量严格依赖于过去上下文。虽然这种因果结构适合前向生成，但在后向生成时会导致块间不连续性。为了解决这个问题，我们引入了块级锚点潜变量，这是一组辅助潜变量，用于在后向生成过程中恢复块边界处缺失的过去上下文。基于这一设计，我们提出了UniTemp，一个双向蒸馏框架，训练单个自回归学生模型用于任意方向的视频生成。在推理时，UniTemp可以基于任意过去和/或未来帧进行条件生成，提高了双向和中间插值生成的可控性。实验表明，与仅前向方法相比，UniTemp在短和长视频生成上保持了竞争性能，同时支持多种工作流程，如双向视频扩展、中间插值生成、循环视频生成、场景转换和视觉故事生成。项目网站：此 https URL

英文摘要

Autoregressive video diffusion models have emerged as a promising approach for long video generation, achieving strong performance in streaming settings. However, existing methods are restricted to forward temporal generation, whereas practical video creation often requires flexible generation order, e.g., conditioning on future context to extend backward, or on both past and future context for inbetween generation. We bridge this gap by training an autoregressive model that supports generation in arbitrary temporal directions. A key technical challenge arises from the Causal 3D VAE widely used in video diffusion models, which encodes latents strictly conditioned on past context. While suited for forward generation, this causal structure causes inter-block discontinuities when generation proceeds backward. To address this, we introduce blockwise anchor latents, a set of auxiliary latents that restore the missing past context at block boundaries during backward generation. Built on this design, we propose UniTemp, a bidirectional distillation framework that trains a single autoregressive student model for any-direction video generation. At inference time, UniTemp conditions on arbitrary past and/or future frames, improving controllability for both bidirectional and inbetween generation. Experiments show that UniTemp maintains competitive performance on short and long video generation compared to forward-only methods, while enabling diverse workflows such as bidirectional video extension, inbetween generation, looping video generation, scene transition, and visual story generation. Project website: https://lzhangbj.github.io/projects/unitemp/

URL PDF HTML ☆

赞 0 踩 0

2606.18765 2026-06-18 cs.CV 新提交

SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

SpectralDiT：流匹配DiT的时间步条件谱残差校正

Jiayu Tian

发表机构 * Peking University（北京大学）

AI总结提出SpectralDiT，通过时间步条件谱残差校正模块，在CIFAR-10和ImageNet-100上以极少额外计算和参数提升流匹配DiT的生成质量，FID分别降低5.1%和8.7%。

详情

AI中文摘要

我们提出SpectralDiT，一种对流匹配扩散变换器（Diffusion Transformers）的轻量级修改，它在MLP残差分支中添加了时间步条件谱校正。该模块将每个残差更新分解为补丁-令牌网格上的低频和高频分量，然后学习一个零初始化的加法门，使得模型最初与基线DiT匹配。在CIFAR-10像素空间生成中，SpectralDiT在补丁大小为1时将FID从20.78提升至19.71，并缩小了径向傅里叶谱差距。此外，我们将方法扩展到ImageNet-100上的潜在扩散。在额外理论FLOPs增加0.6%和参数增加1.36%的情况下，SpectralDiT改进了潜在流匹配，在无分类器引导（CFG 2.0）下实现了8.7%的相对FID降低。所有报告结果均为五个种子的平均值。在CIFAR-10上的消融实验和门控可视化揭示了稳定的块特定谱校正模式。

英文摘要

We propose SpectralDiT, a lightweight modification to flow-matching Diffusion Transformers that adds timestep-conditioned spectral correction to the MLP residual branch. The module decomposes each residual update into low- and high-frequency components on the patch-token grid, then learns a zero-initialized additive gate so the model initially matches the baseline DiT. On CIFAR-10 pixel-space generation, SpectralDiT improves FID from 20.78 to 19.71 at patch size 1 and reduces the radial Fourier spectrum gap. Furthermore, we scale our method to latent diffusion on ImageNet-100. With 0.6% additional theoretical FLOPs and 1.36% additional parameters, SpectralDiT improves latent flow-matching, achieving an 8.7% relative FID reduction under classifier-free guidance (CFG 2.0). All reported results are averaged over five seeds. Ablations and gate visualizations on CIFAR-10 reveal stable block-specific spectral correction patterns.

URL PDF HTML ☆

赞 0 踩 0

2606.18788 2026-06-18 cs.CV cs.CL 新提交

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

HandwritingAgent: 语言驱动的可缩放矢量空间手写合成

Jaward Sesay, Yue Yu, Börje F. Karlsson

发表机构 * Beijing Institute of Technology（北京理工大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）

AI总结提出HandwritingAgent，利用大推理模型在SVG格式中自动回归生成手写笔画序列，无需风格特定训练，通过自然语言和参考图像控制风格，在模仿、识别、多语言及复杂数学表达式合成等任务上达到或超越现有最优方法。

详情

AI中文摘要

教会机器模仿自然手写风格仍然是一个开放挑战，因为它需要合成在形状、纹理、压力和字体上动态变化的笔画序列——不仅在不同个体之间，而且在同一个人的手写中也是如此。针对这一挑战的尝试主要探索了在线和离线环境下的深度学习方法。然而，这些方法通常受到风格特定架构选择、对大型数据集的严重依赖、高计算成本以及缺乏通过自然语言灵活控制书写风格的限制。为此，我们引入了HandwritingAgent，一个语言驱动的智能体，它可以直接在可缩放矢量图形（SVG）格式中合成自然手写序列，无需风格特定训练。该智能体利用大型推理模型在离散网格画布环境中对目标手写字形进行几何分析并自回归生成笔画序列。生成过程以对话或非对话模式提供的文本以及参考手写风格图像为条件。在涵盖模仿、识别、多语言手写合成以及复杂手写数学和科学表达式生成等多样化手写任务上的实验表明，性能有显著提升，HandwritingAgent匹配或超越了最先进的生成式手写模型，同时提供了一种更高效、可控且泛化能力更强的合成方法。

英文摘要

Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

URL PDF HTML ☆

赞 0 踩 0

2606.18906 2026-06-18 cs.CV 新提交

BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

BindEdit: 驯服注意力泄漏以实现精确的多目标图像编辑

Chaewon Park, Soyoon Lee, Naeun Lee, Minjung Shin, Seogkyu Jeon, Kibeom Hong

发表机构 * Sookmyung Women’s University（成均女性大学）； Yonsei University（延世大学）； Samsung Research（三星研究院）

AI总结针对多目标图像编辑中的语义混合和对象重复问题，提出BindEdit方法，通过联合正则化交叉注意力和自注意力、交叉注意力重平衡机制及区域保真项，在单次扩散轨迹内抑制注意力泄漏，实现精确编辑。

Comments Preprint

详情

AI中文摘要

真实图像编辑能够精确操作视觉内容，但现有方法在复杂的多目标场景中常常失败，导致语义混合、对象重复或编辑不完整。我们将这些失败归因于注意力泄漏，即在去噪过程中，跨空间区域和文本标记的信号变得纠缠。具体来说，我们识别出两种不同形式的泄漏：编辑-标记泄漏，其中模糊的标记-区域对齐导致对象混合；以及源主导泄漏，其中未改变的源对象的标记压倒了目标实体应有的注意力。为了解决这些泄漏，我们提出了\textbf{BindEdit}，它在单次扩散轨迹内强制执行注意力级别的约束。为了抑制编辑-标记泄漏，BindEdit联合正则化交叉注意力和自注意力，使得每个目标标记组绑定到其对应的空间区域，同时保持实例级别的分离。为了抑制源主导泄漏，一种交叉注意力重平衡机制放大目标标记的影响，并减弱可编辑区域内残留的源语义。此外，区域保真项确保每个目标概念在整个编辑掩码中连贯表达。另外，我们提出了一个全面的多目标基准，涵盖不同的对象数量和类别。大量实验表明，BindEdit在单次扩散轨迹内始终优于现有方法，在单目标和多目标编辑场景中均保持稳健性能。

英文摘要

Real image editing enables precise manipulation of visual content, yet existing methods often fail in complex multi-object scenarios, causing semantic blending, object duplication, or incomplete edits. We attribute these failures to attention leakage, where signals across spatial regions and text tokens become entangled during the denoising process. Specifically, we identify two distinct forms of leakage: Edit-Token Leakage, where ambiguous token-region alignment leads to object blending, and Source Dominance Leakage, where tokens of unchanged source objects overwhelm the attention intended for target entities. To resolve these leakages, we propose \textbf{BindEdit}, which enforces attention-level constraints within a single diffusion trajectory. To suppress Edit-Token Leakage, BindEdit jointly regularizes cross- and self-attention so that each target token group is bound to its corresponding spatial region while maintaining instance-level separation. To suppress Source Dominance Leakage, a cross-attention re-balancing mechanism amplifies target token influence and attenuates residual source semantics within editable regions. Moreover, a region fidelity term ensures that each target concept is expressed coherently across the entire editing mask. Additionally, we propose a comprehensive multi-object benchmark encompassing diverse object counts and categories. Extensive experiments demonstrate that BindEdit consistently outperforms existing methods within a single diffusion trajectory, maintaining robust performance across both single- and multi-object editing scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.19073 2026-06-18 cs.CV 新提交

Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

驯服I2V模型用于图像HOI编辑：认知基准与智能体自校正框架

Jiayi Gao, Qingchao Chen, Yuxin Peng, Yang Liu

发表机构 * Wangxuan Institute of Computer Technology, Peking University, Beijing, China（王轩计算机技术研究所，北京大学，北京，中国）； National Institute of Health Data Science, Peking University, Beijing, China（国家健康数据科学研究院，北京大学，北京，中国）

AI总结提出HOI-Edit基准和SCPE框架，利用I2V模型的时间生成能力进行动态人-物交互编辑，通过自校正提示迭代优化，实现与SOTA竞争的性能。

详情

AI中文摘要

当前的图像编辑方法在静态属性上表现出色，但在复杂的人-物交互（HOI）上失败，这是一个关键挑战，现有基准将HOI与静态属性混淆，依赖无法同时评估动态交互有效性和纠缠的人-物对保留的全局指标。因此，我们首先引入HOI-Edit，一个包含三个渐进认知层次的综合基准，其特点是自动化指标HOI-Eval，通过让VLM在思考后对包含基础人-物对的图像进行问答，可靠地评估实例级交互。考虑到任务本质是重塑动态关系，我们对图像到视频（I2V）模型进行基准测试，发现它们由于其时间生成能力而天生适合动态编辑。关键的是，除了优越的性能，这种能力提供了“失败过程的重放”，为错误原因提供了独特的可诊断性。因此，我们提出SCPE（自校正过程编辑），一种新颖的智能体自校正框架，通过迭代优化的提示约束I2V模型的生成，使生成的视频更准确地呈现目标HOI。从这些视频中提取的帧是最终的编辑结果。在HOI-Edit上，SCPE在交互上达到了与最先进（SOTA）编辑模型（如Nano Banana）竞争的性能。代码可在该https URL获取。

英文摘要

Current image editing methods excel at static attributes but fail at complex Human-Object Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that reliably evaluates instance-level interaction by letting VLM Q&A after thinking with images containing grounded Human-Object pairs. Considering the task's essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a "replay of the failure process," offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction. Code is available at https://github.com/oceanflowlab/HOI-Edit.

URL PDF HTML ☆

赞 0 踩 0

2606.19103 2026-06-18 cs.CV cs.AI 新提交

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

ProductConsistency：通过SFT和RL改进基于指令的图像编辑中的产品身份保持

Mukund Khanna, Raj Singh Yadav, Kunal Singh

发表机构 * Fractal Analytics

AI总结针对基于指令的图像编辑中产品特征保持不足的问题，提出ProductConsistency数据集和循环一致性奖励，结合监督微调与强化学习，显著提升产品一致性、文本渲染和视觉质量。

Comments CVPR HiGen 2026

详情

AI中文摘要

近期基于指令的图像编辑的进展使模型能够根据自然语言指令执行复杂的视觉编辑。然而，在以产品为中心的场景中，保留产品特征、品牌和文本元素至关重要，当前的开源和闭源模型往往难以维持这种细粒度的对象身份。这一问题因缺乏具有文本保真度约束的基于指令的产品图像编辑数据集而进一步加剧，导致该能力在很大程度上被视为基于指令的图像编辑模型的隐式能力。在这项工作中，我们引入了ProductConsistency数据集，旨在改进以产品为中心的图像编辑。我们的方法包括一个用于产品编辑的包含87k样本的监督微调（SFT）数据集、一个包含869张独特产品图像的强化学习（RL）数据集，以及一个新的基准数据集ProductConsistency Benchmark，以允许对编辑模型进行严格和标准化的评估。为了指导RL训练，我们提出了一种循环一致性奖励，通过使用原始产品描述与从编辑图像生成的描述之间的字幕相似性来强制保持产品身份的语义。我们使用我们的数据集对Qwen-Image-Edit-2511和Flux.1-Kontext-dev进行了微调，并在OCR和感知指标以及基于MLLM的评估中展示了相对于基线模型的一致改进，表明更强的产品一致性、文本渲染和整体视觉质量；其中Qwen-Image-Edit-2511模型实现了字符错误率降低5倍。代码和流程可在此https URL获取。

英文摘要

Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md

URL PDF HTML ☆

赞 0 踩 0

2606.19195 2026-06-18 cs.CV 新提交

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Moebius: 0.2B轻量级图像修复框架，性能达10B级别

Kangsheng Duan, Ziyang Xu, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang

发表机构 * Huazhong University of Science and Technology（华中科技大学）； VIVO AI Lab（维沃人工智能实验室）

AI总结提出Moebius轻量级图像修复框架，通过局部-λ混合交互模块和自适应多粒度蒸馏策略，以0.22B参数实现与10B级模型FLUX.1-Fill-Dev相当甚至更优的生成质量，推理速度提升15倍以上。

详情

AI中文摘要

尽管10B级别的工业基础模型推动了图像修复的边界，但其高昂的计算成本严重阻碍了实际部署。构建高度优化的任务特定专家模型是一个有前景的解决方案，然而极端的结构压缩不可避免地引发了严重的表示瓶颈。为解决这一问题，我们提出了Moebius，一个高效的轻量级修复框架。我们通过引入局部-λ混合交互（$L\lambda MI$）模块系统地重构了扩散主干。该模块由局部-λ和交互-λ子模块组成，巧妙地将空间上下文和全局语义先验总结为固定大小的线性矩阵，在保留复杂潜在交互的同时大幅减少参数。此外，为了释放这种高度紧凑架构的全部表示能力，我们将其与自适应多粒度蒸馏策略协同配对。该策略严格在潜在空间内操作以避免昂贵的像素空间解码，动态平衡多个基于梯度的损失以实现高保真对齐。在自然和肖像基准上的大量实验表明，这种最优协同使Moebius能够媲美甚至超越10B级工业通用模型FLUX.1-Fill-Dev的生成质量。值得注意的是，Moebius仅使用不到2%的参数（0.22B vs. 11.9B）就实现了这一点，同时总推理时间加速超过15倍，为高保真修复设立了新的效率标准。项目页面见此https URL。

英文摘要

While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-$λ$ Mix Interaction ($LλMI$) block. Comprising Local-$λ$ and Interactive-$λ$ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a $>15\times$ acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at https://hustvl.github.io/Moebius.

URL PDF HTML ☆

赞 0 踩 0

2606.18429 2026-06-18 cs.CV cs.AI cs.LG 新提交

CAOA -- Completion-Assisted Object-CAD Alignment

CAOA -- 补全辅助的物体-CAD对齐

Hiranya Garbha Kumar, Minhas Kamal, Balakrishnan Prabhakaran

发表机构 * University at Albany（奥尔巴尼大学）

AI总结提出CAOA方法，结合语义感知点云补全和对称感知相对位姿估计，在Scan2CAD上实现17%精度提升，并发布S2C-Completion数据集。

Comments GitHub: https://github.com/MinhasKamal/CAOA

详情

DOI: 10.1109/3DV69130.2026.00047
Journal ref: Thirteenth International Conference on 3D Vision (3DV), 2026

AI中文摘要

准确地将CAD模型与室内RGB-D扫描中的对应物体对齐是3D语义重建的核心挑战。该任务需要估计9自由度（DoF）位姿——位置、旋转和三轴尺度——但受到噪声和不完整扫描以及导致几何畸变的分割误差的阻碍。我们提出补全辅助的物体-CAD对齐（CAOA），该方法将语义和上下文感知的点云补全模块与对称感知的相对位姿估计算法相结合，实现CAD模型与扫描物体的精确对齐。现有的补全方法通常在合成数据集上训练和评估，往往难以泛化到真实扫描。为弥合这一差距，我们引入了一种针对室内场景的合成数据生成策略，通过与广泛使用的补全数据集进行定量比较，验证了其显著减小合成到真实领域差距的效果。此外，我们发布了S2C-Completion，一个来自Scan2CAD的超过8500个物体-CAD对的专家标注数据集，用于真实室内单物体补全，并作为该任务的新基准。对于物体-CAD对齐，我们通过对称感知损失融入对称信息，提高了对对称模糊的鲁棒性。在Scan2CAD基准上，CAOA相比最先进方法实现了17%的精度提升。

英文摘要

Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along three axes-but is hindered by noisy and incomplete scans, as well as segmentation errors that cause geometric distortions. We present Completion-Assisted Object-CAD Alignment (CAOA), a method that integrates a semantically and contextually aware point cloud completion module with a symmetry-aware relative pose estimation algorithm, enabling precise alignment of CAD models to scanned objects. Existing completion methods are typically trained and evaluated on synthetic datasets, which often fail to generalize to real-world scans. To bridge this gap, we introduce a synthetic data generation strategy tailored to indoor scenes, significantly reducing the synthetic-to-real domain gap-validated through quantitative comparisons with widely used completion datasets. In addition, we release S2C-Completion, an expert-annotated dataset of over 8,500 object-CAD pairs from Scan2CAD, created for real-world indoor single-object completion and intended as a new benchmark for this task. For object-CAD alignment, we incorporate symmetry information via a symmetry-aware loss, improving robustness to symmetric ambiguities. On the Scan2CAD benchmark, CAOA achieves a 17% accuracy improvement over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.18439 2026-06-18 cs.CV cs.RO 新提交

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

RegimeVGGT：面向视觉几何基础Transformer的逐层空间保持冗余去除

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； University of California, Irvine（加利福尼亚大学尔湾分校）； Nanyang Technological University（南洋理工大学）

AI总结提出RegimeVGGT，通过逐层U形压缩（显著性引导带状合并与选择性保护K/V下采样）去除冗余，在保持重建质量的同时实现6.7倍加速。

Comments 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

详情

AI中文摘要

视觉几何基础Transformer（VGGT）通过一次前向传播从多视图图像恢复密集3D场景结构，但二次交叉帧注意力限制了其可扩展性。现有的免训练加速器沿单一轴均匀减少计算，忽略了层间异质性。我们的频谱、探测和因果分析揭示了三个区域：浅层缺乏跨视图结构，中层驱动跨视图对齐，深层对密集几何是冗余的，但其跨帧注意力对姿态仍然至关重要。RegimeVGGT沿两个轴应用逐层U形压缩：显著性引导带状合并保护几何和边缘显著性令牌，而选择性保护K/V下采样通过相移空间网格、参考帧锚点以及未压缩的相机/注册令牌来保持跨帧空间覆盖和姿态关键路径。免训练，RegimeVGGT在匹配重建质量下相比VGGT*实现了6.7倍加速。

英文摘要

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

URL PDF HTML ☆

赞 0 踩 0

2606.18623 2026-06-18 cs.CV eess.IV 新提交

Intrinsic 4D Gaussian Segmentation from Scene Cues

内在4D高斯分割：基于场景线索

Hasan Yazar, Mohamed Rayan Barhdadi, Erchin Serpedin, Mehmet Tuncel, Hasan Kurban

发表机构 * Istanbul Technical University（伊斯坦布尔理工大学）； Texas A&M University（德克萨斯农工大学）； Hamad Bin Khalifa University（哈马德·本·哈利法大学）

AI总结提出Intrinsic-GS方法，无需训练和掩码，通过构建高斯原语的亲和图并利用社区检测实现4D场景分割，在Neu3D和HyperNeRF上达到与掩码监督方法相当的精度，且速度提升12.5倍。

Comments 15 pages, 4 figures, 7 tables. Includes supplementary material. Preprint

详情

AI中文摘要

动态4D高斯泼溅以高保真度重建变形场景，并越来越多地被用作动态3D场景的表示。要利用此类场景进行编辑、操作或运动分析，首先需要对其进行分割：将高斯原语分组为连贯的对象。当前流程通过从基础模型（如SAM）导入2D掩码，并将其提升或蒸馏到高斯表示中来获得这种分组。在动态场景中，这些掩码必须在多个帧和视角中生成，成本高昂，并且所得分割可能强烈依赖于这些外部掩码的质量和一致性。我们探究能否从高斯本身恢复更多的对象级结构，并提出Intrinsic-GS，一种无需训练、无需掩码的方法，该方法根据外观、方向、尺度、变形轨迹和非学习渲染边界线索，在高斯原语上构建稀疏亲和图。该图通过Leiden社区检测进行划分，无需基础模型，也无需学习特征场。在标准的4D高斯分割基准Neu3D和HyperNeRF上，Intrinsic-GS在没有掩码监督的情况下恢复了大量的对象结构，在Neu3D上达到0.746 mIoU，在HyperNeRF上达到0.575；在Neu3D上，仅几何变体达到0.902 mIoU，与SAM监督的TRASE相当。在HyperNeRF上，Intrinsic-GS的运行速度比掩码监督流程中使用的掩码生成和特征渲染阶段快12.5倍。这些结果表明，大部分分割信号已经编码在高斯本身中，为3D和4D高斯分割提供了一种快速、无需掩码的方向，也可能指向在外部掩码不可靠或昂贵的情况下更可泛化、更鲁棒的分割。

英文摘要

Dynamic 4D Gaussian Splatting reconstructs deforming scenes with high fidelity and is increasingly adopted as a representation for dynamic 3D scenes. Putting such a scene to use, for editing, manipulation or motion analysis, first requires segmenting it: grouping the Gaussian primitives into coherent objects. Current pipelines obtain this grouping by importing 2D masks from foundation models such as SAM and lifting or distilling them into the Gaussian representation. In dynamic scenes these masks must be generated across many frames and views, which is costly, and the resulting segmentation can depend strongly on the quality and consistency of those external masks. We ask how much object-level structure can instead be recovered from the Gaussians themselves, and propose Intrinsic-GS, a training-free, mask-free method that builds a sparse affinity graph over Gaussian primitives from appearance, orientation, scale, deformation-trajectory and non-learned rendered-boundary cues. The graph is partitioned with Leiden community detection, requiring no foundation model and no learned feature field. On the standard 4D Gaussian segmentation benchmarks, Neu3D and HyperNeRF, Intrinsic-GS recovers substantial object structure without mask supervision, reaching 0.746 mIoU on Neu3D and 0.575 on HyperNeRF; on Neu3D, a geometry-only variant reaches 0.902 mIoU, matching SAM-supervised TRASE. On HyperNeRF, Intrinsic-GS runs 12.5x faster than the mask-generation and feature-rendering stages used by mask-supervised pipelines. These results suggest that much of the segmentation signal is already encoded in the Gaussians themselves, offering a fast, mask-free direction for 3D and 4D Gaussian segmentation that may also point toward more generalizable, robust segmentation in settings where external masks are unreliable or expensive.

URL PDF HTML ☆

赞 0 踩 0

2606.18787 2026-06-18 cs.CV 新提交

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

基于UDF的点云重建中的学习半径估计

Eito Ogawa, Hiroshi Watanabe

发表机构 * Graduate School of FSE Waseda University Tokyo, Japan（Waseda大学研究生院FSE学院东京日本）

AI总结提出一种学习型逐查询半径选择器，预测连续支撑半径并插入冻结的LoSF-UDF骨干网络，通过抛物线插值获取离网目标半径进行训练，提高点云表面重建的细粒度精度。

2606.18861 2026-06-18 cs.CV cs.AI 新提交

基于热核先验的流形变分学习

Jiarui Xing, Tal Zeevi, Nian Wu, Jian Wang

发表机构 * Yale School of Medicine（耶鲁大学医学院）； University of Virginia（弗吉尼亚大学）； Harvard Medical School（哈佛医学院）

AI总结提出一种流形锚定变分框架，利用几何感知EM算法选择热核加权潜图上的图中心点作为原型，确保原型在流形上，并通过Dirichlet能量正则化保持潜空间几何平滑，在心脏瘢痕和脑MRI基准上取得最高精度和清晰原型。

详情

AI中文摘要

学习医学影像队列的无监督表示可以揭示临床上有意义的原型，而无需专家标签，这些标签通常带有噪声且无法捕捉真实的病理异质性。然而，现有的深度潜变量模型通过欧几里得平均估计高斯混合先验，产生的原型会偏离弯曲的数据流形，并随着子种群数量的增加而退化。我们提出了一种流形锚定变分框架，基于几何感知的期望最大化（EM）算法，其M步骤选择每个子种群原型作为热核加权潜图上具有最高扩散中心性的图中心点，确保每个原型保持在流形上。Dirichlet能量正则化强制潜空间的几何平滑性，每个子种群的不确定性分数实现了无标签的质量评估。流形锚定EM是一种通用几何工具，扩展了标准EM，并易于应用于其他潜变量模型。在心脏瘢痕和脑MRI基准上，我们的框架在所有比较方法中取得了最高精度，产生了迄今为止最清晰的原型，并且在所有基线退化的较大子种群数量下保持稳定。

英文摘要

Learning unsupervised representations of medical imaging cohorts can reveal clinically meaningful prototypes without expert labels, which are often noisy and fail to capture true pathological heterogeneity. However, existing deep latent-variable models estimate Gaussian mixture priors via Euclidean averaging, producing prototypes that drift off the curved data manifold and degenerate as the number of sub-populations grows. We propose a manifold-anchored variational framework built on a geometry-aware Expectation-Maximization (EM) algorithm, whose M-step selects each sub-population prototype as the graph medoid with the highest diffusion centrality on a heat-kernel-weighted latent graph, ensuring that every prototype remains on-manifold. A Dirichlet energy regularizer enforces geometric smoothness of the latent space, and a per-sub-population uncertainty score enables label-free quality assessment. \rev{The manifold-anchored EM is a general-purpose geometric tool that extends standard EM and applies readily to other latent-variable models beyond this setting.} On cardiac scar and brain MRI benchmarks, our framework attains the highest accuracy among all compared methods, produces the sharpest prototypes reported to date, and remains stable at large sub-population counts where all baselines degenerate.

URL PDF HTML ☆

赞 0 踩 0

2606.18675 2026-06-18 cs.CV 新提交

BrainFusionNet: a deep learning and XAI model to understand local, global, and sequential features of MRI images for improved brain tumour detection

BrainFusionNet：一种用于理解MRI图像局部、全局和序列特征以改进脑肿瘤检测的深度学习与XAI模型

Md Taimur Ahad, Bo Song, Yan Li

发表机构 * School of Mathematics, Physics and Computing, University of Southern Queensland（南方昆士兰大学数学、物理与计算学院）； School of Engineering, University of Southern Queensland（南方昆士兰大学工程学院）

AI总结提出BrainFusionNet混合模型，结合CNN、ViT和GRU提取MRI空间、上下文和序列特征，并集成SHAP、LIME和GradCAM进行可解释性分析，在公开数据集上达到98%准确率，优于SOTA CNN。

详情

Journal ref: Brain Inf. 13, 21 (2026)

AI中文摘要

磁共振成像（MRI）的噪声给深度学习（DL）带来挑战，当肿瘤边界模糊、肿瘤位置和外观复杂时尤其如此。因此，我们开发了BrainFusionNet，它结合卷积神经网络（CNN）、视觉变换器（ViT）和门控循环单元（GRU），从MRI图像中提取空间、上下文和序列特征，以改进脑肿瘤分类。此外，集成了可解释AI（如SHAP、LIME和GradCAM），以可视化和突出显示有助于BrainFusionNet决策过程的图像区域。所提出的BrainFusionNet模型在两个公开MRI数据集上进行了评估，K折验证表明在两个数据集上准确率均达到98%。该模型与六种最先进的（SOTA）CNN和迁移学习进行了比较。在SOTA CNN中，DenseNet121和VGG16达到了96%的最高准确率。BrainFusionNet的新颖之处在于，该混合模型能够有效提取MRI图像的局部和全局特征，即使在小尺度肿瘤区域和肿瘤尺寸较小的情况下也是如此。该模型具有平衡的序列CNN架构，以捕获低层和深层特征；以及定制的ViT，可捕获局部特征、稳定梯度流并降低MRI图像训练期间梯度消失的风险。CNN和ViT的输出被馈送到GRU以进行最终分类。此外，我们分析像素强度以确定MRI图像质量是否影响图像分类。我们的发现在图像解释方面非常新颖，因为我们发现MRI图像中像素强度的分布会影响DL性能。

英文摘要

The noise of Magnetic Resonance Imaging MRI poses challenges for Deep Learning DL when tumor boundaries are obscured tumor location and appearance are complex Therefore we develop BrainFusionNet that combines Convolutional Neural Networks CNNs Vision Transformers ViT and Gated Recurrent Units GRUs to extract spatial contextual and sequential features from MRI images for improved brain tumor classification Furthermore explainable AI such as SHAP LIME and GradCAM are integrated to visualise and highlight image regions that contribute to BrainFusionNets decisionmaking process The proposed BrainFusionNet model is evaluated on two publicly available MRI datasets Kfold validation suggests 98 accuracy on both datasets The model was compared with the six stateoftheart SOTA CNNs and transfer learning Among the SOTA CNNs DenseNet121 and VGG16 achieved the highest accuracy of 96 The novelty of BrainFusionNet is that the hybrid model effectively extracts local and global features from MRI images even in smallscale tumor regions and small tumor sizes The model has a balanced sequential CNN architecture to capture lowlevel and deeperlayer features a customized ViT that captures local features stabilizes gradient flow and reduces the risk of vanishing gradients during MRI image training The CNN and ViT outputs are fed into a GRU for final classification Furthermore we analyze pixel intensities to determine whether MRI image quality affects image classification Our findings are very novel in image interpretation as we found that the distribution of pixel intensities in MRI images affects DL performance

URL PDF HTML ☆

赞 0 踩 0

2606.18682 2026-06-18 cs.CV 新提交

Multi-Class Brain Tumor Classification Using Advanced Deep Learning Models: A Comparative Study

使用先进深度学习模型的多类脑肿瘤分类：一项比较研究

Asad Channa, Asghar Ali Chandio, Akhtar Hussain Jalbani, Mehwish Leghari, Shahzad Memon

发表机构 * Department of Computer Science, Quaid-e-Awam University of Engineering, Sciences & Technology（夸迪-艾瓦姆工程、科学与技术大学计算机科学系）； Department of Artificial Intelligence, Quaid-e-Awam University of Engineering, Sciences & Technology（夸迪-艾瓦姆工程、科学与技术大学人工智能系）； The Faculty of Artificial Intelligence and Cyber Security, Universiti Teknikal Malaysia Melaka（马来西亚梅拉卡技术大学人工智能与网络安全学院）； Department of Data Science, Quaid-e-Awam University of Engineering, Sciences & Technology（夸迪-艾瓦姆工程、科学与技术大学数据科学系）； Department of Computer Science and Digital Technologies, School of Architecture, Computing and Engineering, University of East London（东伦敦大学建筑、计算与工程学院计算机科学与数字技术系）

AI总结本研究比较五种CNN架构（包括定制模型和四种预训练模型）在约10,000张MRI图像上的多类脑肿瘤分类性能，发现EfficientNetB0以95%准确率最优，尤其显著提高了脑膜瘤的召回率（89%）。

详情

AI中文摘要

尽管深度学习最近取得了进展，但从MRI图像中准确分类脑肿瘤仍然面临挑战。在本研究中，我们对五种不同的卷积神经网络（CNN）架构进行了全面评估，包括一个定制的基线模型和四个预训练模型，用于使用临床来源的约10,000张MRI图像数据集对多类脑肿瘤进行分类。我们使用了五种不同的架构：VGG16、VGG19、DenseNet121和EfficientNetB0，它们都在相同的实验框架内进行了测试和训练。性能通过总体准确率和肿瘤召回率来衡量，以评估每种架构的临床相关性能。我们发现，与其他测试的架构相比，EfficientNetB0具有最佳的整体分类准确率95%；具体来说，VGG16（94.37%）、VGG19（92.29%）、DenseNet121（90.91%）和定制CNN（78.00%）。我们研究的一个特别重要的发现是，在检测脑膜瘤方面有显著改进；具体而言，简单的CNN可以以约20%的召回率检测脑膜瘤，而EfficientNetB0能够以89%的召回率检测脑膜瘤。脑膜瘤通常难以检测，因为它们在MRI图像上可能表现得非常微妙。此外，一个有趣的发现是，更深的VGG19性能不如较浅的VGG16。这表明，在处理医学图像时，CNN模型的架构效率可能比其深度更重要。总体而言，EfficientNetB0似乎在分类准确率、模型参数数量和临床有意义性能之间提供了最佳权衡。

英文摘要

Despite recent advancements in deep learning, accurately classifying brain tumors from MRI images continues to pose challenges. In this research, we present a comprehensive evaluation of five different convolutional neural networks (CNN) architectures, including a customized baseline model and four pre-trained models - for use in classifying multi-class brain tumors using a clinically-sourced dataset of approximately 10,000 MRI images. We have utilized five different architectures; VGG16, VGG19, DenseNet121, and EfficientNetB0, which were all tested and trained within an identical experimental framework. Performance was measured by both overall accuracy and tumor-wise recall as a means to measure the clinically-relevant performance of each architecture. We found that EfficientNetB0 had the best overall classification accuracy at 95%, when compared to the other architectures tested; specifically VGG16 (94.37%), VGG19 (92.29%), DenseNet121 (90.91%) and the customized CNN (78.00%). An especially important finding of our research was the considerable improvement in detecting meningiomas; specifically, while simple CNNs could detect meningiomas with a recall rate of approximately 20%, EfficientNetB0 was able to detect meningiomas with a recall rate of 89%. Meningiomas are often difficult to detect because they can appear very subtly on MRI images. Additionally, an interesting finding was that the deeper VGG19 performed worse than the shallower VGG16. This indicates that in many cases the architectural efficiency of a CNN model may be more important than its depth when working with medical images. Overall, EfficientNetB0 appears to provide the optimal trade-off between classification accuracy, number of parameters used in the model and clinically meaningful performance.

URL PDF HTML ☆

赞 0 踩 0

2606.18707 2026-06-18 cs.CV 新提交

PEFT-MedSAM: Efficient Fine-Tuning of Medical Foundation Models for Explainable Skin Lesion Segmentation

PEFT-MedSAM：面向可解释皮肤病变分割的医学基础模型高效微调

Asad Channa, Abdullah Khan, Asghar Ali Chandio, Aamir Akbar, Shahzad Memon, Aqib Hussain, Ameer Hamza

发表机构 * Department of Computer Science, Quaid-e-Awam University of Engineering, Sciences & Technology（计算机科学系，卡迪尔-阿瓦姆工程、科学与技术大学）； Department of Artificial Intelligence, Quaid-e-Awam University of Engineering, Sciences & Technology（人工智能系，卡迪尔-阿瓦姆工程、科学与技术大学）； Department of Computer Science, Sindh Madressatul Islam University, City Campus, Karachi（计算机科学系， Sind 阿里斯坦伊斯兰大学，卡拉奇城校区）； Department of Computer Science and Digital Technologies, School of Architecture, Computing and Engineering, University of East London（计算机科学与数字技术系，建筑、计算与工程学院，东伦敦大学）

AI总结提出参数高效微调方法PEFT-MedSAM，冻结预训练编码器仅训练轻量解码器，在ISIC 2018上达到0.9411 Dice系数，并通过Grad-CAM可解释性增强临床可信度。

详情

AI中文摘要

使用深度学习模型对皮肤镜图像进行皮肤病变自动分割，有助于比常规检测更早发现黑色素瘤。然而，大多数现有的深度学习方法性能不佳。本文旨在提出一种名为PEFT-MedSAM的参数高效微调方法，用于适配医学分割一切模型（MedSAM）以自动分割皮肤镜皮肤病变。PEFT-MedSAM方法仅使用轻量级掩码解码器训练模型，同时保持预训练图像编码器和提示编码器冻结。在ISIC 2018基准数据集上的实验表明，与完全训练的U-Net基线（0.8715 Dice系数）和零样本MedSAM推理（0.8997 Dice系数）相比，PEFT-MedSAM获得了0.9411的Dice系数和0.8918的交并比。使用PH2数据集进行的外部验证显示Dice系数为0.9467，标准差为±0.0310。这些主张的支持证据包括比较两个数据集的Wilcoxon符号秩检验p值小于0.0001，以及bootstrap估计的95%置信区间[0.9364, 0.9447]，该区间表示重复测试获得的平均Dice系数的估计范围。为了增加临床可信度，我们使用Grad-CAM可解释性以及基于指向游戏的评估方法，在验证集上评估CNN基线模型。结果表明，在包含519张图像的验证集上，准确率达到98.27%，并确认模型正确分类了包含皮肤病变的区域。

英文摘要

Automated segmentation of skin lesions using deep learning models for dermoscopic images can be very helpful in finding melanomas earlier than they would normally be detected. However, most deep learning methods available do not perform well. The aim of this paper is to present a parameter-efficient fine-tuning method called PEFT-MedSAM for adapting the Medical Segment Anything Model (MedSAM) to automatically segment dermoscopic skin lesions. The PEFT-MedSAM method uses only the lightweight mask decoder for training the model while keeping the pre-trained image encoder and prompt encoder frozen. The experiments performed on the ISIC 2018 benchmark dataset shows that PEFT-MedSAM obtains a dice coefficient of .9411 and an intersection over union value of .8918 when compared to both a fully trained U-Net baseline (.8715 dice coefficient) and zero-shot MedSAM inference (.8997 dice coefficient). The external validation of the model using PH2 dataset shows .9467 dice coefficient with +/- .0310 standard deviation. Supportive evidence for these claims include a p-value less than .0001 for Wilcoxon signed rank tests comparing the two datasets and bootstrap-estimated 95% confidence intervals of [.9364,.9447] that represent the estimated range of possible values for the average dice coefficient obtained by repeating the test. To increase clinical trustworthiness, we used Grad-CAM explainability along with a pointing game based evaluation methodology to evaluate the CNN baseline model on the validation set. The results showed that we had an accuracy rate of 98.27% on the validation set of 519 images and confirmed that the model classified regions containing skin lesions.

URL PDF HTML ☆

赞 0 踩 0

2606.18723 2026-06-18 cs.CV cs.LG 新提交

Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation

临床对齐的几何约束用于鲁棒的IVUS血管边界分割

Yunshu Chen, Litao Yang, Giuseppe Di Giovanni, Jordan Tan, Deval Mehta, Andrew Lin, Derek Chew, Masasi Fujino, Julie Butters, Stephen Nicholls, Zongyuan Ge, Kyung Hoon Cho

发表机构 * AIM For Health Lab, Monash University（莫纳什大学AIM健康实验室）； Department of Data Science and Artificial Intelligence, Faculty of IT, Monash University（莫纳什大学信息技术学院数据科学与人工智能系）； Monash University Victorian Heart Institute（莫纳什大学维多利亚心脏研究所）； School of Computing Technologies, RMIT University（皇家墨尔本理工大学计算技术学院）； National Cerebral and Cardiovascular Center（国立循环器病研究中心）； Department of Cardiology, Chonnam National University Hospital and Medical School（全南大学医院和医学院心脏病学系）

AI总结提出GeoCat网络，通过双编码器与可微几何一致性损失，在IVUS分割中降低边界漂移和拓扑错误，提升临床几何测量精度。

Comments MICCAI2026 Accepted

详情

AI中文摘要

血管内超声（IVUS）管腔和外弹性膜（EEM）分割对于定量评估冠状动脉斑块负荷至关重要。管腔或EEM勾画的误差会直接传播到斑块面积、斑块负荷和几何测量中。然而，优先考虑重叠分数的标准方法常常遭受边界漂移和拓扑错误，导致临床测量不准确。我们提出GeoCat，一个几何一致性网络，使用双笛卡尔-极坐标编码器，结合跨域注意力和时间融合，处理5帧IVUS片段。可微的几何一致性损失直接监督临床相关描述符，包括直径、方向和横截面积。该模型在来自146名患者的12,242张标注帧上训练，这些帧使用两种商用IVUS系统采集。我们使用分割准确性和斑块相关临床指标评估性能，包括Dice/IoU、边界测量（95HD（mm）、ASSD）、拓扑违规率和临床几何误差（dmax/dmin、角度和面积）。在我们的数据集上，GeoCat实现了0.93的Dice，将95HD降低到0.14 mm，并将拓扑违规率降低到1.0%。重要的是，它显著提高了几何保真度，产生0.13-0.16 mm的直径误差和约8度的角度误差，支持可靠的斑块负荷量化。

英文摘要

Intravascular ultrasound (IVUS) lumen and external elastic membrane (EEM) segmentation is important for quantitative coronary plaque burden assessment. Errors in lumen or EEM delineation directly propagate to plaque area, plaque burden and geometric measurements. However, standard methods prioritising overlap scores often suffer from boundary drift and topology errors, leading to inaccurate clinical measurements. We present GeoCat, a geometry-consistent network that processes 5-frame IVUS clips using dual Cartesian-polar encoders with cross-domain attention and temporal fusion. A differentiable geometry consistency loss directly supervises clinically relevant descriptors including diameters, orientations, and cross-sectional areas. The model is trained on 12,242 annotated frames from 146 patients acquired with two commercial IVUS systems. We evaluate performance using both segmentation accuracy and plaque-relevant clinical metrics, including Dice/IoU, boundary measures(95HD (mm), ASSD), topology violation rate, and clinical geometry errors (dmax/dmin, angles, and areas). On our dataset, GeoCat achieves a Dice of 0.93, reduces 95HD to 0.14 mm, and lowers topology violations to 1.0%. Importantly, it significantly improves geometric fidelity, yielding diameter errors of 0.13-0.16 mm and angular errors of ~8 degrees, supporting reliable plaque burden quantification.

URL PDF HTML ☆

赞 0 踩 0

2606.18749 2026-06-18 cs.CV 新提交

Toward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models

迈向3D医学图像的无训练零样本异常检测：基于批次的方法使用2D基础模型

Tai Le-Gia

发表机构 * Chungnam National University（忠南大学）

AI总结提出CS3F框架，利用2D基础模型对3D医学图像进行零样本异常检测，通过沿多轴分解、切片编码和跨主体相似性计算异常分数，并引入粗到细的分词策略减少信号衰减。

详情

AI中文摘要

零样本异常检测（ZSAD）在医学成像中具有吸引力，因为临床系统必须处理异构采集协议、变化的患者群体以及可能缺乏标注训练数据的病理。大多数现有的零样本异常检测方法是为2D图像设计的，它们直接扩展到3D医学体积受到大规模体积基础模型稀缺或利用体积上下文困难的限制。我们提出CS3F，一个无训练的基于批次的框架，用于3D医学图像中的ZSAD，使用2D基础模型。每个体积沿多个解剖轴分解，并由2D视觉变换器逐切片编码。然后通过池化相邻切片特征将其转换为局部体积令牌。异常分数通过跨主体互相似性获得：在其他主体中缺乏相似令牌的令牌被赋予更高的异常分数。为了减少深度池化引起的病灶信号衰减，我们引入了一种粗到细的分词策略，无需穷举匹配即可实现细分辨率体积评分。CS3F在脑部MRI上针对转移瘤、胶质瘤和中风进行评估，并在肺部CT上验证其泛化能力，超越标准图谱对齐的脑部MRI。结果表明，冻结的2D基础模型可以支持3D医学图像中的异常定位，且细分词化的益处很大程度上取决于病灶对比度和成像模态。

英文摘要

Zero-shot anomaly detection (ZSAD) is attractive for medical imaging because clinical systems must handle heterogeneous acquisition protocols, changing patient populations, and pathologies for which annotated training data may be unavailable. Most existing zero-shot anomaly detection methods are designed for 2D images, and their direct extension to 3D medical volumes is limited by the scarcity of large-scale volumetric foundation models or by the difficulty of utilizing volumetric context. We propose CS3F, a training-free batch-based framework for ZSAD in 3D medical images using 2D foundation models. Each volume is decomposed along multiple anatomical axes and encoded slice-wise by a 2D vision transformer. These are then converted into localized volumetric tokens by pooling neighboring slice features. Anomaly scores are obtained from cross-subject mutual similarity: tokens that lack close analogues in other subjects are assigned higher anomaly scores. To reduce the attenuation of focal lesion signals caused by depth pooling, we introduce a coarse-to-fine tokenization strategy that enables fine-resolution volumetric scoring without exhaustive matching. CS3F is evaluated on brain MRI across metastases, glioma, and stroke, as well as validated on lung CT to test generalizability beyond atlas-aligned brain MRI. The results show that frozen 2D foundation models can support anomaly localization in 3D medical images, and that the benefit of fine tokenization depends strongly on lesion contrast and imaging modality.

URL PDF HTML ☆

赞 0 踩 0

2606.18753 2026-06-18 cs.CV 新提交

SMART: A Flexible, Interpretable, and Scalable Spatio-temporal Brain Atlas from High-Resolution Imaging Data

SMART：一种灵活、可解释且可扩展的高分辨率成像数据时空脑图谱

John Kalkhof, Boris Gutman, Emile d'Angremont, Daniel C. Alexander, Marco Lorenzi

发表机构 * Illinois Institute of Technology（伊利诺伊理工学院）； Amsterdam University Medical Center（阿姆斯特丹大学医学中心）； University College London（伦敦大学学院）

AI总结提出SMART框架，通过解耦全局疾病动态与患者特定解剖表现，学习连续疾病时间图谱，实现高分辨率3D医学图像中时空变化的灵活、可解释和可扩展建模。

详情

AI中文摘要

我们介绍了SMART，一个从纵向高分辨率3D医学图像中学习灵活、可解释且可扩展的时空脑图谱的框架。现有的时空图谱构建方法依赖于黑盒生成模型，缺乏灵活性、限制可解释性，并且难以扩展到高维数据。SMART通过学习一个连续的疾病时间图谱来解决这些挑战，该图谱将全局群体级疾病动态与患者特定的解剖表现解耦。在解剖学启发先验的指导下，SMART通过区域特异性微分方程，沿着共享的疾病时间线建模可解释的全局区域进展轨迹。全局轨迹进一步通过由灵活且可扩展的多尺度神经细胞自动机参数化的密集微分同胚位移，个性化到个体解剖结构。在阿尔茨海默病的五个纵向MRI数据集（ADNI-1/GO/2、OASIS-3、AIBL；>1300名受试者）上评估，SMART产生了解剖学上有意义的疾病进展预测，并实现了最先进的预测准确性和比对抗性和扩散基线更好的时间一致性。我们的方法为高维医学图像时间序列中时空变化的灵活、可解释和可扩展建模建立了一个新范式。

英文摘要

We introduce SMART, a framework for learning a flexible, interpretable, and scalable spatio-temporal brain atlas from longitudinal high-resolution 3D medical images. Existing approaches to spatio-temporal atlas construction rely on black-box generative models that lack flexibility, limit interpretability, and struggle to scale to high-dimensional data. SMART addresses these challenges by learning a continuous disease-time atlas that decouples global group-wise disease dynamics from their patient-specific anatomical manifestation. Guided by anatomically inspired priors, SMART models interpretable global trajectories of regional progression along a shared disease timeline through region-specific differential equations. Global trajectories are further personalized to individual anatomies via dense diffeomorphic displacements parameterized by a flexible and scalable multi-scale Neural Cellular Automata. Evaluated on five longitudinal MRI datasets in Alzheimer's disease (ADNI-1/GO/2, OASIS-3, AIBL; > 1,300 subjects), SMART produces anatomically meaningful predictions of disease progression and achieves state-of-the-art forecasting accuracy and improved temporal consistency over adversarial and diffusion baselines. Our approach establishes a new paradigm for flexible, interpretable, and scalable modeling of spatio-temporal change in high-dimensional medical image time-series.

URL PDF HTML ☆

赞 0 踩 0

2606.18825 2026-06-18 cs.CV 新提交

DreamReg: Belief-Driven World Model for 2D-3D Ultrasound Registration

DreamReg：基于信念驱动的世界模型用于2D-3D超声配准

Luoyao Kang, Yuelin Zhang, Jiwei Shan, Haifan Gong, Qingpeng Ding, Shing Shin Cheng

发表机构 * T Stone Robotics Institute, The Chinese University of Hong Kong（香港中文大学T Stone机器人研究所）； Multi-scale Medical Robotics Center（多尺度医疗机器人中心）； Perelman School of Medicine, University of Pennsylvania（宾夕法尼亚大学佩雷尔曼医学院）

AI总结提出DreamReg框架，将2D-3D超声配准建模为信念更新，通过世界模型模拟探头运动并整合想象结果，在CAMUS和u-RegPro数据集上实现鲁棒且准确的实时配准。

详情

AI中文摘要

超声（US）广泛应用于手术导航，但由于部分可观测性、散斑噪声以及依赖于动作的US采集，术中2D切片与术前3D体积之间的实时配准仍然具有挑战性。现有方法是一次性的或短视的，难以随时间收集证据或捕捉外科医生如何根据屏幕反馈调整探头运动。我们提出DreamReg，一个基于信念驱动的世界模型框架，将2D-3D配准形式化为对刚性变换的信念更新。DreamReg维护一个潜在信念状态，总结过去的观测和位姿信息，并在新切片到达时通过学习到的动态不断细化变换。在训练期间，DreamReg暴露于模拟临床扫描行为的探头运动轨迹，并通过将位姿细化条件于当前US观测来学习更新其信念。在推理期间，DreamReg通过内部想象来细化配准：它展开学习到的世界模型以模拟候选探头运动及其预测的观测，并整合这些想象的结果以收敛到准确的刚性变换。在CAMUS和u-RegPro数据集上的实验表明，与最先进方法相比，DreamReg在实时引导中具有改进的鲁棒性和有竞争力的配准精度。

英文摘要

Ultrasound (US) is widely used for surgical navigation, yet real-time registration between intraoperative 2D slices and preoperative 3D volumes remains challenging due to partial observability, speckle noise, and the action-dependent US acquisition. Existing methods are one-shot or short-horizon, making it hard for them to gather evidence over time or capture how surgeons adjust probe motion based on on-screen feedback. We propose DreamReg, a belief-driven world-model framework that formulates 2D-3D registration as belief updating over rigid transformations. DreamReg maintains a latent belief state that summarizes past observations and poses information, and continuously refines the transformation through learned dynamics as new slices arrive. During training, DreamReg is exposed to probe-motion trajectories that mimic clinical scanning behavior and learns to update its belief by conditioning pose refinement on the current US observation. During inference, DreamReg refines registration via internal imagination: it rolls out the learned world model to simulate candidate probe motions and their predicted observations, and integrates these imagined outcomes to converge to an accurate rigid transformation. Experiments on CAMUS and u-RegPro datasets demonstrate improved robustness and competitive registration accuracy for real-time guidance compared with state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.18860 2026-06-18 cs.CV cs.LG 新提交

Quantification of Uncertainty with Adversarial Models in Medical Image Segmentation

医学图像分割中对抗模型的不确定性量化

Hana Jebril, Thomas Pinetz, Günter Klambauer, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria（人工智能研究所、医学数据科学中心、维也纳医学大学，奥地利）； Comprehensive Center for AI in Medicine, Medical University of Vienna, Austria（医学人工智能综合中心、维也纳医学大学，奥地利）； ELLIS Unit Linz, LIT AI Lab and Institute for Machine Learning, Johannes Kepler University Linz, Austria（林茨ELLIS单位、LIT人工智能实验室和机器学习研究所、林茨约瑟夫·冯·克拉夫特大学，奥地利）； Institute for Machine Learning, Johannes Kepler University Linz, Austria（机器学习研究所、林茨约瑟夫·冯·克拉夫特大学，奥地利）； Clinical Research Center for Medical AI, Johannes Kepler University Linz, Austria（医学人工智能临床研究中心、林茨约瑟夫·冯·克拉夫特大学，奥地利）

AI总结提出QUAM-SM后处理框架，通过针对性对抗搜索识别脆弱像素，量化不确定性并分离认知与偶然不确定性，在公开数据集上优于现有方法。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

可靠的像素级不确定性量化具有通过实现高保真纵向监测和区分真实病理变化与伪影来改变临床工作流程的潜力。理想情况下，这些模型提供关键治疗计划和手术干预所需的稳定性。然而，标准深度学习模型常常遭受校准不良，产生过度自信的预测，掩盖了微妙病理边界处的潜在脆弱性。为了解决这个问题，我们提出了QUAM-SM，一种使用针对性对抗搜索来识别“对抗脆弱”像素的后处理框架。通过主动寻找暴露预测不稳定性的扰动，我们的方法突出了决策最容易被翻转的区域。重要的是，该框架将认知不确定性与偶然不确定性分离。在两个具有多个专家标注的公开数据集上的实验表明，QUAM-SM在可靠性和边界敏感性方面优于标准和最新的不确定性估计方法。代码可在以下网址获取：https://this https URL

英文摘要

Reliable pixel-level uncertainty quantification holds the potential to transform clinical workflows by enabling high-fidelity longitudinal monitoring and distinguishing true pathological changes from artifacts. Ideally, these models provide the stability required for critical treatment planning and surgical intervention. However, standard deep learning models often suffer from miscalibration, yielding overconfident predictions that mask underlying vulnerabilities at subtle pathological boundaries. To address this, we propose QUAM-SM, a post-hoc framework using targeted adversarial search to identify "adversarially fragile" pixels. By actively seeking perturbations that expose predictive instability, our method highlights regions where decisions are most vulnerable to being flipped. Importantly, the framework disentangles epistemic uncertainty from aleatoric uncertainty. Experiments on two public datasets with multiple expert annotations demonstrate that QUAM-SM outperforms both standard and recent uncertainty estimation approaches in terms of reliability and boundary sensitivity. Code is available at https://github.com/HanaJebril/quam_sm

URL PDF HTML ☆

赞 0 踩 0

2606.18869 2026-06-18 cs.CV 新提交

Learning to Distort: Weakly-Supervised Image Quality Transfer for Prostate DWI Correction

学习扭曲：用于前列腺DWI校正的弱监督图像质量迁移

YuCheng Tang, Wen Yan, Alexander Ng, Natasha Thorley, Pawel Rajwa, Yipei Wang, Aqua Asif, Clare Allen, Louise Dickinson, Francesco Giganti, David Atkinson, Shonit Punwani, Daniel Alexander, Shaheer Ullah Saeed, Veeru Kasivisvanathan, Yipeng Hu

发表机构 * UCL Hawkes Institute（UCL哈维斯研究所）； Department of Medical Physics and Biomedical Engineering（医学物理与生物医学工程系）； University College London（伦敦大学学院）； Division of Surgery and Interventional Science（外科与介入科学分会）； Centre for Medical Imaging（医学成像中心）； British Urology Researchers in Surgical Training (BURST)（英国泌尿外科手术培训研究人员（BURST））； Department of Radiology（放射科）； University College London Hospitals NHS Foundation Trust（伦敦大学学院医院国家健康服务信托基金）； Centre for Medical Image Computing（医学图像计算中心）； Department of Computer Science（计算机科学系）； Department of Urology（泌尿科）

AI总结提出弱监督图像质量迁移框架，利用图像质量评估信号从无失真图像学习生成真实失真，并训练校正模型，在PI-RADS和Gleason评分分类任务中优于现有无配对方法。

详情

AI中文摘要

单次激发平面回波前列腺弥散加权成像（DWI）常因几何失真而复杂化，影响从这些图像中获得可靠诊断的能力。开发自动化校正方法面临缺乏配对的失真和未失真临床扫描的挑战。本文首先提出一种新颖的弱监督图像质量迁移（IQT）框架，从无失真图像到失真图像，利用图像质量评估（IQA）信号监督迁移过程。与传统方法需要昂贵的体素级配对数据或采用无配对算法不同，我们的方法利用图像级质量标签（此处为失真与无失真）在预训练特征空间中建立潜在质量原型。认识到模拟真实失真比直接无配对校正更可靠，我们描述了一种弱监督原型流匹配算法，显式正则化生成轨迹朝向失真原型，产生模拟临床退化的真实磁敏感伪影。通过合成这些真实配对，我们能够训练第二个IQT模型进行正向失真校正。实验结果表明，我们生成的图像成功模拟了真实伪影的诊断干扰，从而产生更强大的失真校正IQT模型。除定性比较外，我们还通过评估临床下游任务性能（PI-RADS和Gleason评分分类），使用分布内和外部数据集，将我们的方法与现有无配对方法（如CycleGAN、UNIT-DDPM和OT-FM）作为正向或反向替代方案进行详尽的定量评估。

英文摘要

Single-shot echo-planar prostate diffusion-weighted imaging (DWI) is frequently complicated by geometric distortions, which impact the ability to derive reliable diagnoses from such images. Developing automated correction methods is challenged by the absence of paired distorted and undistorted clinical scans. In this paper, we first propose a novel weakly-supervised image quality transfer (IQT) framework from undistorted to distorted images that utilizes image quality assessment (IQA) signals to supervise the transfer process. Unlike traditional methods that require expensive, voxel-wise paired data or resort to developing unpaired algorithms, our approach utilizes image-level quality labels (here, distorted vs. undistorted) to establish latent quality prototypes within a pre-trained feature space. Recognizing that simulating realistic distortions is more reliable than direct unpaired correction, we describe a weakly-supervised prototype flow matching algorithm to explicitly regularize generative trajectories towards distorted prototypes, producing realistic susceptibility artifacts that mimic clinical degradations. By synthesizing these realistic pairs, we enable a second IQT model to be trained in the forward direction for distortion correction. Experimental results demonstrate that our generated images successfully mimic the diagnostic interference of real-world artifacts, which leads to more capable distortion correction IQT models. In addition to qualitative comparisons, we also conduct exhaustive quantitative evaluations that compare our approach with existing unpaired approaches (e.g., CycleGAN, UNIT-DDPM, and OT-FM) - as either forward or reverse alternatives - by assessing clinical downstream task performance in PI-RADS and Gleason score classification, using both in-distribution and external data sets.

URL PDF HTML ☆

赞 0 踩 0

2606.18872 2026-06-18 cs.CV 新提交

Bridging Single Distortion Artifacts and Mmultifactorial Clinical Quality: Few-shot Biparametric MRI Quality Assessment via Distortion-trained Prototypical Networks

桥接单一失真伪影与多因素临床质量：基于失真训练的原型网络的少样本双参数MRI质量评估

Yuheng Tang, Alexander Ng, Wen Yan, Natasha Thorley, Pawel Rajwa, Yipei Wang, Aqua Asif, Clare Allen, Louise Dickinson, Francesco Giganti, Shonit Punwani, Daniel Alexander, Veeru Kasivisvanathan, Yipeng Hu

发表机构 * UCL Hawkes Institute（UCL Hawkes研究所）； Department of Medical Physics and Biomedical Engineering（医学物理与生物医学工程系）； University College London（伦敦大学学院）； Division of Surgery and Interventional Science（外科与介入科学分会）； Centre for Medical Imaging（医学成像中心）； British Urology Researchers in Surgical Training (BURST)（英国泌尿外科手术培训研究人员（BURST））； Department of Radiology（放射科）； University College London Hospitals NHS Foundation Trust（伦敦大学学院医院国家健康服务信托基金）； Centre of Medical Imaging, Division of Medicine（医学成像中心，医学分会）； Centre for Medical Image Computing（医学图像计算中心）； Department of Computer Science（计算机科学系）； Department of Urology（泌尿科）

AI总结提出一种少样本双参数原型网络，利用失真标签元训练，通过特征融合和域对齐，仅用5个样本即可预测PI-QUAL临床质量评分，解决临床数据稀缺问题。

详情

AI中文摘要

临床前列腺多参数MRI高度依赖高质量扩散加权成像（DWI），但DWI读图常因几何失真（通常由直肠气体引起）而受损。通过PI-QUAL评分系统评估质量是新兴的临床标准，但该方法主观、耗时，且存在类别不平衡问题，其中低质量病例多样且相对稀少。以PRIME临床试验为例，6%的图像PI-QUAL评分低于4，87%的DWI问题源于失真，许多其他临床质量问题代表性不足。为解决这种标注临床数据的双重稀缺性，我们提出了一种用于自动图像质量评估（IQA）的少样本双参数原型网络。我们的框架利用双分支3D ResNet融合T2加权和DWI特征，提供解剖背景以区分真实形态与失真。为处理现实异质性，我们引入特征级线性调制（FiLM）和梯度反转层（GRL），以对齐基于不同b值的特征分布，同时抑制采集相关偏差。我们证明，仅基于相对客观、易于获取的失真标签进行元训练的模型，能够仅使用五个代表性样本有效适应预测复杂的多因素临床质量评分（如PI-QUAL）。在两个数据集上的实验结果表明，我们的方法在此具有挑战性的IQA任务中显著优于少样本学习基线，为临床工作流程中标准化前列腺MRI质量控制提供了实际可行且数据高效的解决方案。

英文摘要

Clinical prostate multi-parametric MRI relies heavily on high-quality diffusion-weighted imaging (DWI), yet reading DWI is frequently compromised by geometric distortion, often caused by rectal air. Assessing quality via the PI-QUAL scoring system is an emerging clinical standard, but it is subjective, time-consuming and suffers from a class imbalance where low-quality cases are diverse and relatively scarce. Using the PRIME clinical trial as an example, there are $6\%$ images with PI-QUAL scores lower than 4, $87\%$ of DWI issues are due to distortion. Many of the other clinical quality issues are under-represented. To address this common dual-scarcity of annotated clinical data, we propose a few-shot biparametric prototypical network for automated image quality assessment (IQA). Our framework utilizes a dual-branch 3D ResNet to fuse T2-weighted and DWI features, providing anatomical context to distinguish true morphology from distortion. To handle real-world heterogeneity, we introduce feature-wise linear modulation (FiLM) and a gradient reversal layer (GRL) to align feature distributions conditioned on varying b-values while suppressing acquisition-related biases. We demonstrate that a model meta-trained solely on comparatively objective, readily obtainable distortion labels can effectively adapt to predicting complex, multi-factorial clinical quality scores such as PI-QUAL using only five representative samples. Experimental results on two datasets show that our method significantly outperforms few-shot learning baselines for this challenging IQA task, offering a practically feasible and data-efficient solution for standardizing prostate MRI quality control in clinical workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.18876 2026-06-18 cs.CV cs.LG 新提交

Test-Time Adaptation in Optical Coherence Tomography Using Trajectory-Aligned Time-Independent Flow

光学相干断层扫描中基于轨迹对齐的时间无关流的测试时自适应

Veit Hucke, Thomas Pinetz, Gregor Reiter, Ursula Schmidt-Erfurth, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria（人工智能研究所、医学数据科学中心、维也纳医学大学，奥地利）； Comprehensive Center for Artificial Intelligence in Medicine, Medical University of Vienna, Austria（医学人工智能综合中心、维也纳医学大学，奥地利）； Department of Ophthalmology and Optometry, Medical University of Vienna, Austria（眼科与视光学部、维也纳医学大学，奥地利）； Laboratory for Ophthalmic Image Analysis, Medical University of Vienna, Austria（眼科图像分析实验室、维也纳医学大学，奥地利）

AI总结提出一种基于流匹配的测试时自适应方法，通过直方图匹配和去除时间条件，生成高质量替代图像，在AMD分割中达到最优性能。

Comments Accepted in MICCAI

2606.18886 2026-06-18 cs.CV 新提交

DINO-Med3D: Bridging Dimension and Domain Gaps in Volumetric Segmentation via Progressive Adaptation

DINO-Med3D：通过渐进式适应弥合体分割中的维度与领域差距

Haoyu Hu, Xiyao Ma, Shiqi Liu, Linsen Zhang, Xiaoliang Xie, Xiaohu Zhou, Zeng-Guang Hou

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结提出两阶段渐进框架DINO-Med3D，通过多切片嵌入模块、3D适配器和并行细节恢复流，将DINOv3适配到3D医学分割，在五个数据集上超越现有方法。

Comments Accepted at MICCAI 2026. The camera-ready version and link will be made publicly available upon publication

详情

AI中文摘要

尽管DINOv3在自然图像中展现了显著的语义判别能力，但其直接应用于体医学分割受到固有的维度和领域差异的阻碍。为解决这些问题，我们提出DINO-Med3D，一个两阶段渐进框架，将预训练的DINOv3编码器重新用于3D医学任务。在第一阶段，我们通过引入融合伪3D上下文的多切片嵌入模块来弥合维度差距，同时采用分割代理任务将从自然场景学到的表示适应到医学领域。随后，我们通过在冻结的主干中添加轻量级3D适配器来增强体理解，以强制执行全局切片间连续性。最后，为补偿嵌入过程中固有的空间信息损失，我们设计了一个并行细节恢复流，以显式保留高频边界线索。在五个公共数据集上的大量实验表明，我们的方法成功地将DINOv3适应到医学领域，并显著优于最先进的基线方法。

英文摘要

Although DINOv3 has demonstrated remarkable semantic discrimination in natural imagery, its direct application to volumetric medical segmentation is hindered by inherent dimension and domain disparities. To resolve these issues, we propose DINO-Med3D, a two-stage progressive framework that repurpose the pre-trained DINOv3 encoder for 3D medical tasks. In the first stage, we mitigate the dimension gap by introducing a multi-slice embedding module that incorporates pseudo-3D context, while simultaneously employing a segmentation proxy task to adapt representations learned from natural scenes to the medical domain. Subsequently, we further enhance volumetric understanding by adding lightweight 3D adapters into the frozen backbone to enforce global inter-slice continuity. Finally, to compensate for the spatial information loss inherent in the embedding process, we design a parallel detail recovery stream to explicitly preserve high-frequency boundary cues. Extensive experiments on five public datasets demonstrate that our approach successfully adapts DINOv3 to the medical domain and significantly outperforms state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18894 2026-06-18 cs.CV 新提交

重新思考表格结构识别中的指针损失：面向空间局部性的几何感知指针损失

Hong-Jun Choi, Jongho Lee, Jaeyoung Kim

发表机构 * Teamreboott Inc.（Teamreboott公司）

AI总结针对指针网络在表格结构识别中相邻单元格错误占79.6%的问题，提出几何感知指针损失，通过反距离加权重写交叉熵目标，聚焦邻近单元格梯度，在不增加推理成本下提升性能。

详情

AI中文摘要

使用指针网络的表格结构识别（TSR）通过预测HTML序列同时将标签与检测到的文本（或单元格）区域对齐，取得了令人印象深刻的结果。然而，我们的分析揭示，当指针网络失败时，79.6%的错误发生在空间相邻的单元格之间（曼哈顿距离<=2）。尽管如此，标准交叉熵损失对所有负候选样本赋予相同权重。在这项工作中，我们提出了几何感知指针（GAP）损失，它根据与真实值的空间邻近性重新加权交叉熵目标。通过应用反距离加权，GAP将梯度流集中在模型最困难的区域：相邻单元格比远处单元格获得更强的梯度。我们的方法仅需对损失计算进行简单修改，保持相同的模型架构且零额外推理成本。在PubTabNet和SynthTabNet上的大量实验表明，GAP持续减少相邻单元格错误，达到了新的最先进性能。我们的发现表明，在损失层面融入几何归纳偏置为鲁棒TSR提供了一种简单而有效的方法。我们的代码可在以下网址获取：this https URL

英文摘要

Table Structure Recognition (TSR) using a pointer network achieves impressive results by predicting HTML sequences while aligning tags to detected text (or cell) regions. However, our analysis reveals that when pointer networks fail, 79.6% of errors occur between spatially adjacent cells (Manhattan distance <= 2). Despite this, standard cross-entropy loss weights all negative candidates equally. In this work, we propose Geometry-Aware Pointer (GAP) Loss, which reweights the cross-entropy objective based on spatial proximity to ground truth. By applying inverse distance weighting, GAP focuses gradient flow where the model struggles most: immediate neighbors receive stronger gradients than distant cells. Our approach requires only a straightforward modification to the loss computation, maintaining the same model architecture with zero additional inference cost. Extensive experiments on PubTabNet and SynthTabNet demonstrate that GAP consistently reduces adjacent-cell errors, achieving new state-of-the-art performance. Our findings suggest that incorporating geometric inductive biases at the loss level provides a simple yet effective approach to robust TSR. Our code is available at https://github.com/teamreboott/GAP

URL PDF HTML ☆

赞 0 踩 0

2606.18793 2026-06-18 cs.CV 新提交

Fuzzy-Geometric Branch-Point Modeling for Structure-Aware Augmentation of Handwritten Chinese Characters

模糊几何分支点建模用于结构感知的手写汉字增强

Dongbin Jiao, Yibo Lyu, Qiulu Wei, Fuxiang Lu, Shengcai Liu, Shi Yan

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）； Guangdong Provincial Key Laboratory of Brain-inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology（南方科技大学计算机科学与工程系广东省类脑智能计算重点实验室）

AI总结针对手写汉字增强中数据稀缺和结构失真问题，提出基于模糊几何的结构感知增强框架，通过模糊集建模分支点并优化，结合贝塞尔重建与多策略扰动生成样本，显著降低字错误率。

详情

AI中文摘要

数据稀缺和结构失真严重限制了高安全性认证中的手写识别。现有的增强方法常导致拓扑和形态损伤，尤其在处理复杂汉字时，笔画交叉、连笔和急转弯使传统分支点检测不可靠。为此，本文提出一种模糊几何驱动的结构感知（FGSA）增强框架。我们将分支点建模为骨架空间中的模糊集，通过整合拓扑邻域证据和方向场散度，构建连续的分支点隶属度场。该隶属度场通过无监督代理目标自适应优化，实现无需人工标注的鲁棒笔画解耦。最后，通过参数化三次贝塞尔重建和多策略扰动合成运动学对齐样本，确保结构保真度与样本多样性之间的平衡。此外，我们建立了LZUSig，一个专门针对中文手写签名细粒度结构退化的大规模高挑战性数据集。在CASIA-HWDB1.1、ChiSig和LZUSig上的大量实验表明，FGSA显著降低了字错误率（ΔWER），在对比基线中取得了最优识别增益。更重要的是，它在任务增益、结构保真度和判别特征保留之间实现了稳健的权衡，为手写增强提供了一种高度可控的解决方案。

英文摘要

Data scarcity and structural distortion significantly limit handwriting recognition in high-security authentication. Existing augmentation methods often cause topological and morphological damage, particularly when processing complex Chinese characters where stroke intersections, ligatures, and sharp turns render traditional branch-point detection unreliable. To address this, this paper proposes a fuzzy geometry-driven structure-aware (FGSA) augmentation framework. We model branch points as fuzzy sets within the skeleton space, constructing a continuous branch-point membership field by integrating topological neighborhood evidence with direction field divergence. This membership field is adaptively optimized via an unsupervised surrogate objective, enabling robust stroke decoupling without manual annotation. Finally, kinematically-aligned samples are synthesized through parameterized cubic Bézier reconstruction and multi-strategy perturbations, ensuring a balance between structural fidelity and sample diversity. Moreover, we establish LZUSig, a large-scale, highly challenging dataset specifically dedicated to fine-grained structural degradation in Chinese handwritten signatures. Extensive experiments on CASIA-HWDB1.1, ChiSig, and LZUSig demonstrate that FGSA significantly reduces the word-level error rate ($Δ$WER), achieving optimal recognition gains over the compared baselines. More importantly, it strikes a robust trade-off among task gain, structural fidelity, and discriminative feature preservation, offering a highly controllable solution for handwriting augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.18884 2026-06-18 cs.CV 新提交

Performance Gap Analysis between Latin and Arabic Scripts HTR

拉丁文与阿拉伯文手写文本识别之间的性能差距分析

Sana Al-azzawi, Elisa Barney, Marcus Liwicki

发表机构 * Luleå University of Technology Department of Computer Science, Electrical

AI总结本研究使用统一CRNN模型在多个数据集上比较阿拉伯文和拉丁文手写文本识别性能，发现性能差距在低资源场景下显著，随数据增加而缩小但持续存在，并分析了标注质量、视觉变异性和字符分布等因素。

Comments this paper accepted at TIPS workshop ICPR 2026

详情

AI中文摘要

尖峰金字塔小波变换用于高效低能耗图像恢复

Chen Zhao, Xiantao Hu, Song Wu, Qian Wang, Chen Wu, Rui Xie, Jian Yang, Ying Tai

发表机构 * Nanjing University（南京大学）； Nanjing University of Science and Technology（南京理工大学）； University of Science and Technology of China（中国科学技术大学）； China Mobile Institute（中国移动研究院）

AI总结提出基于尖峰神经网络和金字塔小波变换的SPWM模型，通过SDPW块建模长程依赖并利用小波域退化特性，在保持图像质量的同时显著降低计算和能耗。

Comments Accepted by Pattern Recognition

详情

AI中文摘要

尖峰神经网络（SNNs）因其高效性和生物启发的潜力在计算机视觉领域引起了广泛兴趣。虽然基于尖峰CNN的方法在图像恢复（IR）任务中显示出前景，但其性能受到CNN操作固有感受野限制的约束。在本文中，我们探索了离散小波变换的优势，并提出了一种基于尖峰金字塔小波模型（SPWM）以实现高效低能耗目标。具体来说，我们开发了一个尖峰双金字塔小波（SDPW）块来建模长程依赖并利用小波域中的退化特性。在多个基准上的实验结果表明，SPWM在保持图像质量的同时显著降低了计算成本和能耗。我们的方法展示了SNNs在IR领域的潜力，为资源受限设备的未来应用提供了新的见解。

英文摘要

Spiking neural networks (SNNs) have garnered significant interest in computer vision due to their potential for efficiency and biological inspiration. While spiking CNN-based methods have shown promise for image restoration (IR) tasks, their performance is constrained by the inherent receptive field limitations of CNN operations. In the paper, we explore the benefits of discrete wavelet transformation and propose a spiking pyramid wavelet-based model (SPWM) for high-efficient and low-energy target. Specifically, we develop a spiking dual pyramid wavelet (SDPW) block to model long-range dependency and exploit the properties of the degradation in the wavelet domain. Experimental results on several benchmarks demonstrate that SPWM significantly lowers computational costs and energy consumption while maintaining image quality. Our method showcases the potential of SNNs in the field of IR, offering new insights for future applications of resource-limited devices.

URL PDF HTML ☆

赞 0 踩 0

2606.19046 2026-06-18 cs.CV 新提交

Low-Rank Tensor Completion Based on Fractional Regularization with Ky Fan p-k Norm

基于Ky Fan p-k范数分数阶正则化的低秩张量补全

Shan Fan, Feng Zhang, Jianjun Wang, Xi-Le Zhao, Tingwen Huang

发表机构 * School of Mathematics and Statistics, Southwest University（西南大学数学与统计学学院）； School of Mathematical Sciences/Research Center for Image and Vision Computing, University of Electronic Science and Technology of China（电子科技大学数学科学学院/图像与视觉计算研究中心）； Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology（深圳先进技术大学计算机科学与控制工程学院）

AI总结提出张量核范数与Ky Fan p-k范数之比（TNPK）作为非凸替代，逼近张量管秩，并构建低秩张量补全模型，证明低秩张量是局部极小点，设计ADMM算法，实验验证优于现有方法。

详情

AI中文摘要

本文通过提出一种新颖的非凸替代，即张量核范数与张量Ky Fan p-k范数（TNPK）之比，来精确逼近张量管秩，从而解决低秩张量补全（LRTC）问题。TNPK具有吸引人的性质，包括尺度不变性、参数灵活性以及在特定p和k选择下存在闭式解。在特定的p和k参数设置下，它退化为张量核范数与张量Ky Fan k范数（TNK）之比或张量核范数与张量Frobenius范数（TNF）之比。我们构建了一个LRTC模型，并在张量零空间性质（NSP）下，证明了低秩张量是所提模型的局部极小点。此外，我们推导了Ky Fan p-k逆范数的近端算子，并进一步开发了一种高效的交替方向乘子法（ADMM）算法，在温和条件下保证子序列收敛。在合成和真实世界数据集上的大量实验验证了我们的方法相对于最先进竞争者的优越性能。

英文摘要

This paper addresses low-rank tensor completion (LRTC) by proposing a novel nonconvex surrogate, namely the ratio of the tensor nuclear norm to the tensor Ky Fan p-k norm (TNPK), to accurately approximate the tensor tubal rank. The TNPK possesses appealing properties, including scale invariance, parameter flexibility, and the existence of closed-form solutions under specific choices of p and k. With specific parameter settings of p and k, it reduces to the ratio of the tensor nuclear norm to the tensor Ky Fan k norm (TNK) or the ratio of the tensor nuclear norm to the tensor Frobenius norm (TNF). We construct a LRTC model and, under the tensor null space property (NSP), prove that low-rank tensors are local minimizers of the proposed model. Moreover, we derive the proximal operator of the Ky Fan p-k inverse-norm and further develop an efficient alternating direction method of multipliers (ADMM) algorithm with guaranteed subsequential convergence under mild conditions. Extensive experiments on synthetic and real-world datasets validate the superior performance of our method against state-of-the-art competitors.

URL PDF HTML ☆

赞 0 踩 0

2606.19097 2026-06-18 cs.CV 新提交

DVANet: Degradation-aware Visual-prior Alignment Network for Image Restoration

DVANet: 面向图像复原的退化感知视觉先验对齐网络

Yanjie Tu, Qingsen Yan, Axi Niu, Tao Hu, Haokui Zhang, Jiantao Zhou

发表机构 * School of Computer Science, Northwestern Polytechnical University（西北工业大学计算机学院）； Shenzhen Research Institute of Northwestern Polytechnical University（西北工业大学深圳研究院）； State Key Laboratory of Internet of Things for Smart City, University of Macau（澳门大学智慧城市物联网国家重点实验室）

AI总结提出DVANet，一种基于半二次分裂优化的深度展开网络，通过退化感知观测一致性与视觉先验引导重建的协同展开，实现复杂退化下的统一图像复原，在多种退化场景和跨域任务中表现优越。

Comments All-in-One Image Restoration; Deep Unfolding; Degradation Representation; Visual Prior

详情

AI中文摘要

全能图像复原旨在开发一个统一的复原框架来处理多种退化类型。现有的端到端方法通常将复原过程视为黑盒映射，缺乏明确的优化解释。尽管深度展开为图像复原提供了可解释的迭代建模范式，但现有方法大多依赖于固定的退化假设或预定义的退化信息，难以适应复杂退化和局部内容受损下的统一复原需求。这一限制制约了它们在退化抑制和结构细节恢复方面的性能。为解决这些问题，本文提出DVANet，一种受半二次分裂优化算法启发的深度展开网络，将复杂退化下的统一图像复原公式化为退化感知观测一致性与视觉先验引导重建之间的协同展开过程。具体而言，在退化感知观测一致性分支中，采用退化表示模块提取全局退化属性和局部退化线索，并利用退化条件映射增强模型对不同退化类型的适应性。在视觉先验引导重建分支中，引入DINOv3提供结构和语义信息作为层次化视觉先验，从而补充受损区域缺失的结构信息并改善细节恢复。大量实验表明，DVANet在多场景退化和跨域图像复原任务上取得了优越或具有竞争力的性能，展现出良好的退化适应性和泛化能力。

英文摘要

All-in-One image restoration aims to develop a unified restoration framework for handling diverse degradation types. Existing end-to-end methods usually regard the restoration process as a black-box mapping, lacking an explicit optimization interpretation. Although deep unfolding provides an interpretable iterative modeling paradigm for image restoration, existing methods mostly rely on fixed degradation assumptions or predefined degradation information, making them difficult to adapt to unified restoration requirements under complex degradations and locally damaged content. This limitation restricts their performance in degradation suppression and structural detail recovery. To address these issues, this paper proposes DVANet, a deep unfolding network inspired by the half-quadratic splitting optimization algorithm, which formulates unified image restoration under complex degradations as a collaborative unfolding process between degradation-aware observation consistency and visual-prior-guided reconstruction. Specifically, in the degradation-aware observation consistency branch, a degradation representation module is employed to extract global degradation attributes and local degradation cues, and degradation-conditioned mapping is used to enhance the model's adaptability to different degradation types. In the visual-prior-guided reconstruction branch, DINOv3 is introduced to provide structural and semantic information as hierarchical visual priors, thereby complementing the missing structural information in damaged regions and improving detail recovery. Extensive experiments demonstrate that DVANet achieves superior or competitive performance on multi-scenario degradation and cross-domain image restoration tasks, showing favorable degradation adaptability and generalization ability.

URL PDF HTML ☆

赞 0 踩 0

2606.18318 2026-06-18 cs.CV cs.CR 新提交

Budget-Aware Adaptive Adversarial Patches for Black-Box Object Detection

预算感知的自适应对抗补丁用于黑盒目标检测

Pedram MohajerAnsari, Amir Salarpour, David Fernandez, Mert D. Pesé

AI总结提出一种查询高效、预算自适应的黑盒攻击方法，结合上下文汤普森采样放置和NES像素更新，在严格纯图像抑制测试下，对CNN和Transformer检测器实现强抑制，并揭示查询-视觉足迹权衡。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026)

详情

AI中文摘要

对抗补丁对现代目标检测器构成实际威胁。先前工作揭示了脆弱性，但三个差距限制了可操作的见解：(i) 很少有基于分数的黑盒攻击在严格查询预算下联合优化补丁的位置、纹理和大小；(ii) 成功很少与补丁的视觉足迹相关联；(iii) 评估常常混淆EOT鲁棒性与纯视图抑制。我们提出\method{}，一种查询高效、预算自适应的黑盒攻击，它结合了轻量级的上下文汤普森采样放置器与NES风格的像素更新，仅在进展停滞时增大补丁。报告基于严格的纯图像抑制测试；EOT被审计但从不作为成功的替代，可选的外观/可打印性权重揭示了强度-可见性权衡。在YOLOv5、Faster R-CNN和YOLOS上，\method{}在基于CNN的检测器上实现了强抑制，在基于Transformer的检测器上实现了显著抑制，使用紧凑的补丁，并相对于固定大小和启发式基线暴露了清晰的查询-足迹权衡。打印-捕获实验进一步展示了跨未见物理对象和视角的迁移。

英文摘要

Adversarial patches pose a practical threat to modern object detectors. Prior work shows vulnerability, but three gaps limit actionable insight: (i) few \emph{score-based black-box} attacks \emph{jointly} optimize patch \emph{location, texture, and size} under tight query budgets; (ii) success is rarely tied to the patch's \emph{visual footprint}; and (iii) evaluations often conflate EOT robustness with plain-view suppression. We present \method{}, a query-efficient, budget-adaptive black-box attack that couples a lightweight \emph{Contextual Thompson-Sampling} placer with NES-style pixel updates, growing the patch only when progress stalls. Reporting is anchored by a \emph{strict plain-image} suppression test; EOT is audited but never used as a substitute for success, and optional appearance/printability weights expose strength--visibility trade-offs. Across YOLOv5, Faster R-CNN, and YOLOS, \method{} achieves strong suppression on CNN-based detectors and substantial suppression on the transformer-based detector, using compact patches and exposing clear query--footprint trade-offs relative to fixed-size and heuristic baselines. A print--capture pilot further shows transfer across unseen physical objects and viewpoints.

URL PDF HTML ☆

赞 0 踩 0

2606.18510 2026-06-18 cs.CV cs.CR 新提交

Architectural Bias in Face Presentation Attack Detection: A Comparative Study of Vision Transformers and Convolutional Neural Networks

人脸呈现攻击检测中的架构偏差：视觉Transformer与卷积神经网络的比较研究

Ngela Landon Ntung, Floride Tuyisenge, Jema David Ndibwile

发表机构 * College of Engineering, Carnegie Mellon University（卡内基梅隆大学工程学院）

AI总结通过比较ViT和CNN在人脸呈现攻击检测中的表现，发现预训练ViT（DeiT-S）在准确率、公平性和跨种族泛化上优于CNN，将种族间ACER差距降低83%。

Comments 8 Pages, 4 Figures, 5 Tables

详情

AI中文摘要

人脸呈现攻击检测（PAD）系统构成生物特征认证中的关键安全层；然而，现有方法在不同人口群体间表现出系统性性能差异，对深肤色个体影响尤为严重。本文通过实证比较研究，探究视觉Transformer架构相对于卷积基线是否能够减少人脸PAD系统中的人口统计偏差。实验在CASIA-SURF跨种族人脸反欺骗（CeFA）数据集上进行。评估了三种架构：从头训练的多模态ViT-Tiny、ResNet18 CNN基线，以及在CeFA上微调的预训练DeiT-S，覆盖非洲、东亚和零样本中亚人口群体。DeiT-S实现了最高总体准确率97.27%和最低等错误率0.86%，优于准确率90.15%的ResNet18。在公平性方面，DeiT-S将非洲与东亚受试者之间的种族间ACER差距降至0.13%，而基于LBP的工作[6]报告为0.75%，降低了83%。最值得注意的是，ResNet18在零样本中亚受试者上的BPCER为10.44%，而DeiT-S在相同未见群体上保持2.89%，展现出3.6倍的泛化优势。这些结果表明，预训练视觉Transformer在PAD中实现了更高的准确率，产生了更小的人口统计性能差距，并在未见人口群体上更公平地泛化，表明PAD中的跨人口公平性可能部分受架构设计影响。

英文摘要

Face Presentation Attack Detection (PAD) systems constitute a critical security layer in biometric authentication; however, existing approaches exhibit systematic performance disparities across demographic groups, disproportionately affecting individuals with darker skin tones. This paper presents a comparative empirical investigation of whether Vision Transformer architectures reduce demographic bias in face PAD systems relative to convolutional baselines. Experiments are conducted on the CASIA-SURF Cross-Ethnicity Face Anti-Spoofing (CeFA) dataset. Three architectures are evaluated: a Multimodal ViT-Tiny trained from scratch, a ResNet18 CNN baseline, and a pretrained DeiT-S fine-tuned on CeFA across African, East Asian, and zero-shot Central Asian demographic groups. DeiT-S achieves the highest overall accuracy of 97.27% and the lowest EER of 0.86%, outperforming ResNet18 at 90.15% accuracy. In terms of fairness, DeiT-S reduces the inter-ethnic ACER gap between African and East Asian subjects to 0.13%, compared to 0.75% reported in an LBP-based work [6], representing an 83% reduction. Most notably, while ResNet18 records a BPCER of 10.44% on zero-shot Central Asian subjects, DeiT-S maintains 2.89% on the same unseen group, demonstrating a 3.6x generalization advantage. These results suggest that pretrained Vision Transformers achieve superior PAD accuracy, produce smaller demographic performance gaps, and generalize more equitably across unseen demographic groups, indicating that cross-demographic fairness in PAD may partly be influenced by architectural design.

URL PDF HTML ☆

赞 0 踩 0

2606.19184 2026-06-18 cs.CV cs.LG 新提交

When AUC Misleads: Polarization-Aware Evaluation of Deepfake Detectors under Domain Shift

当AUC误导：域偏移下深度伪造检测器的极化感知评估

Dat Nguyen, Cosmin Radoi, Romain Hermary, Marcella Astrid, Nesryne Mejri, Enjie Ghorbel, Djamila Aouada

发表机构 * Cristal Laboratory, National School of Computer Sciences, University of Manouba（马努巴大学国家计算机科学学院Cristal实验室）

AI总结针对现有AUC评估无法反映真实场景中混合数据源和不同伪影类型的问题，提出Cross-dataset AUC（Cross-AUC）指标，通过平均每域AUC并引入预测极化度量（Wasserstein距离）来评估域偏移鲁棒性，实验证明其有效性。

详情

AI中文摘要

生成式AI的最新进展，如扩散模型和换脸工具，使得创建高度逼真的深度伪造成为可能，导致了包括金融欺诈和非自愿色情内容在内的现实危害。为此，深度伪造检测成为一个活跃的研究领域，近期方法越来越关注提高对未见操作的泛化能力。这通常通过跨多个数据集分别测量的ROC曲线下面积（AUC）来评估。然而，这种评估未能反映检测器面对混合数据源和不同伪影类型的真实场景。为解决这一局限，我们引入一种新指标——跨数据集AUC（Cross-AUC），该指标平均每域AUC并加入预测极化度量，以考虑对域偏移的鲁棒性。极化程度通过类别分数分布之间的Wasserstein距离量化。Cross-AUC不仅更真实地评估深度伪造检测器在域偏移下的泛化能力，而且具有可解释性，因为它能更好地解释性能下降的原因。在七个基准数据集上的实验证明了其实用性。

英文摘要

Recent advances in generative AI, such as diffusion models and face-swapping tools, have enabled the creation of highly realistic deepfakes, leading to real-world harms including financial fraud and non-consensual explicit content. In response, deepfake detection has become an active research area, with recent methods increasingly focusing on improving generalization to unseen manipulations. This is typically evaluated using the Area Under the ROC Curve (AUC) measured separately across multiple datasets. However, such an evaluation fails to reflect real-world scenarios where detectors face a mixture of data sources and varying artifact types. To address this limitation, we introduce a novel metric, Cross-dataset AUC (Cross-AUC) that averages per-domain AUCs with a measure of prediction polarization for taking into account the robustness to domain shift. The polarization extent is quantified by the Wasserstein Distance between class score distributions. Cross-AUC not only assesses the generalization capabilities of deepfake detectors under domain shifts more realistically, but it is also interpretable as it better explains the reason behind a drop in performance. Experiments performed on seven benchmark datasets demonstrate its practical relevance.

URL PDF HTML ☆

赞 0 踩 0

重新思考空地协作：渐进式跨任务基准与社会化学习框架

Zhoupeng Guo, Yunqi Zhu, Zhihe Fan, Xinjie Yao, Ruipu Zhao, Boan Tao, Yiming Sun, Zhen Wang, Pengfei Zhu

发表机构 * School of Automation, Southeast University（东南大学自动化学院）； School of Computer Science and Engineering, University of New South Wales（新南威尔士大学计算机科学与工程学院）； School of Sports Training, Tianjin University of Sport（天津体育学院运动训练学院）； Faculty of Information Engineering and Automation, Kunming University of Science and Technology（昆明理工大学信息工程与自动化学院）； School of Artificial Intelligence, Tianjin University（天津大学人工智能学院）； School of Artificial Intelligence, Hebei University of Technology（河北工业大学人工智能学院）

AI总结提出空地渐进协作基准AGPC和社会化协同感知框架SCP，通过双层级路由器实现跨视角跨任务选择性交互，在异构空地感知中提升下游性能7.86%。

详情

AI中文摘要

空地协同感知对于真实世界动态环境中的鲁棒视觉理解至关重要。然而，现有研究通常将协作建模为单任务跨视角融合，忽视了定位、目标关联和细粒度解析之间的功能依赖关系。此外，空中和地面视角的异构性引入了显著的几何、尺度和遮挡差异，使得统一特征共享容易受到负迁移的影响。为解决这些问题，我们将空地感知建模为渐进式跨任务协作任务，并构建了空地渐进协作（AGPC）基准，这是一个包含超过745K原始视频帧的时空对齐基准。基于该基准，我们提出了社会化协同感知（SCP），一个从空中全局定位到地面目标关联和身份感知解析的渐进式协作框架。其核心模块——双层级路由器（DLR），将输入侧的多尺度专家选择与输出侧的任务条件调制解耦，实现了选择性的跨视角和跨任务交互，同时抑制有害干扰。大量实验证明了SCP的有效性。它实现了3.73%的协同进化增益和7.86%的平均下游性能提升。这些结果表明，对于异构空地感知，任务条件协作比统一融合更有效。代码可在该网址获取。

英文摘要

Air-ground collaborative perception is crucial for robust visual understanding in real-world dynamic environments. However, existing studies typically formulate collaboration as single-task cross-view fusion, overlooking the functional dependencies among localization, target association, and fine-grained parsing. In addition, the heterogeneous nature of aerial and ground views introduces substantial geometric, scale, and occlusion discrepancies, making uniform feature sharing vulnerable to negative transfer. To tackle these issues, we model air-ground perception as a progressive cross-task collaboration task and construct the Air-Ground Progressive Collaboration (AGPC) benchmark, a spatio-temporally aligned benchmark comprising more than 745K raw video frames. Built upon this benchmark, we propose Socialized Co-Perception (SCP), a coarse-to-fine framework that organizes collaboration progressively from aerial global localization to ground target association and identity-aware parsing. Its core module, the Dual-Layer Router (DLR), decouples input-side multi-scale expert selection from output-side task-conditioned modulation, enabling selective cross-view and cross-task interaction while suppressing harmful interference. Extensive experiments demonstrate the effectiveness of SCP. It achieves a 3.73\% coevolutionary gain and a 7.86\% improvement in average downstream performance. These results show that task-conditioned collaboration is more effective than uniform fusion for heterogeneous air-ground perception. The code is available at https://github.com/g1136639260-spec/AGSCP.

URL PDF HTML ☆

赞 0 踩 0

2606.18943 2026-06-18 cs.CV 新提交

Physics-IQ Verified

物理智力验证

Tim Rädsch, Yuki M Asano, Hilde Kuehne, Stefan Bauer, Priyank Jaini, Robert Geirhos, Carsten T. Lüth

发表机构 * Anates Labs（Anates实验室）； Technical University of Munich（慕尼黑技术大学）； University of Technology Nuremberg（纽伦堡技术大学）； Tuebingen AI Center, University of Tuebingen（图宾根大学人工智能中心）； Helmholtz AI, Munich（慕尼黑海德堡人工智能研究所）； Google DeepMind research（谷歌DeepMind研究）

AI总结本文提出Physics-IQ Verified基准，通过改进提示和地面真实质量及引入样本级评分系统，提升视频生成模型对物理现实的理解评估，验证结果表明基准提升了57.6%的样本和34.8%的提示。

详情

AI中文摘要

视频生成模型（VGMs）已成为新的前沿，不仅用于视频生成，还用于多种下游任务，包括世界建模。为推进这些任务，一个良好的视频模型必须理解世界的物理现实。评估这种理解成为新兴领域，催生了Physics-IQ基准，通过将模型生成的视频与真实物理实验视频进行比较来量化。本文系统审计了Physics-IQ基准，揭示不足并提出三种解决方案，改进如何衡量VGMs的物理理解。具体而言，我们提高了提示和地面真实质量以减少混淆因素影响，并进一步引入样本级评分系统，使每个样本和指标权重相等。我们的基准Physics-IQ Verified优化了57.6%的所有样本并改进了超过34.8%的提示。在使用六个图像到视频生成模型的比较研究中，我们观察到中等但有意义的排名变化（Kendall's τ=0.46）。我们希望Physics-IQ Verified通过提供更可靠的信号推动社区发展，向物理准确的VGMs迈进。该基准的代码可通过此https URL访问。

英文摘要

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $τ= 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark

URL PDF HTML ☆

赞 0 踩 0

2606.18952 2026-06-18 cs.CV 新提交

SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

SP-TransientBench: 一个真实捕获的单光子感知基准

Hongzhou Dong, Zili Zhang, Ziting Wen, Yiheng Qiang, Runrong Deng, Wenle Dong, Ziwen Jiang, Xinyang Li, Rui Lu, Shuoyao Sun, Wenyu Wang, Ziyi Xia, Haitao Zheng, Guodong Shi, Xiaoqiang Ren

发表机构 * Shanghai University（上海大学）； Southern University of Science and Technology（南方科技大学）； The University of Sydney（悉尼大学）

AI总结针对单光子LiDAR在真实场景中因噪声和多回波瞬态现象导致的感知挑战，提出包含10个场景、10297个视角的真实捕获多任务基准STB，支持深度估计、多视图重建和3D语义理解评估。

详情

AI中文摘要

基于单光子雪崩二极管（SPAD）传感的单光子LiDAR（SPL）能够以极高灵敏度进行时间分辨光子测量，为光子匮乏环境下的主动3D感知提供了独特潜力。然而，由于独特的测量噪声和复杂的多回波瞬态现象，真实世界的单光子感知仍然面临根本性挑战，这些因素共同使几何重建和语义场景理解变得复杂。尽管对基于SPAD的传感兴趣日益增长，现有研究大多局限于模拟数据或小规模受控捕获。因此，在深度估计、多视图重建和3D语义理解方面，对真实世界单光子感知的系统评估仍未得到充分探索。为弥补这一空白，我们引入了SP-TransientBench（STB），一个真实捕获的多任务单光子感知基准。STB包含10个多样化场景和10297个视图，使用固态单光子LiDAR以256×192分辨率捕获。每个视图提供具有多回波行为的完整飞行时间直方图、标准化元数据和用于多视图评估的校准相机位姿。我们还为选定场景提供了13类3D语义标注。通过为每个任务提供专用数据划分和评估协议，STB能够在多个3D视觉问题上实现真实世界单光子感知的一致且可重复的基准测试。数据集和代码将在接收后发布。

英文摘要

Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved scenarios.However, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at $256\times192$ resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.19053 2026-06-18 cs.CV 新提交

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

大规模视觉-语言模型在细粒度图像任务上的基准测试：从评估到诊断

Hong-Tao Yu, Chen-Wei Xie, Yuxin Peng, Serge Belongie, Xiu-Shen Wei

发表机构 * School of Computer Science and Engineering, Southeast University, China（东南大学计算机科学与工程学院，中国）； Alibaba Group（阿里巴巴集团）； School of Computer Science and Engineering, School of Intelligence Science and Engineering, and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, China（东南大学计算机科学与工程学院、智能科学与工程学院以及新一代人工智能技术及其交叉应用关键实验室，中国）； Wangxuan Institute of Computer Technology, National Key Laboratory for Multimedia Information Processing, Peking University, China（北京大学王轩计算机技术研究所、多媒体信息处理国家重点实验室，中国）； University of Copenhagen, Denmark（丹麦哥本哈根大学）

AI总结提出FG-BMK基准，含101万问题和28万图像，通过人机双范式评估LVLM的细粒度语义识别与视觉判别能力，诊断失败原因，发现视觉表示、语义对齐等瓶颈。

详情

AI中文摘要

近期大规模视觉-语言模型（LVLMs）展示了显著的多模态感知和推理能力。尽管众多基准从整体或任务特定角度评估了LVLMs，但它们在细粒度图像任务（计算机视觉的基础）上的能力仍未得到充分理解。为填补这一空白，我们引入FG-BMK，一个全面的细粒度评估基准，包含101万问题和28万图像，覆盖从常见物体中心领域到专业领域的多样化场景。FG-BMK通过面向人类和面向机器的范式，联合评估对话级细粒度语义识别和特征级视觉判别能力，从而诊断分析LVLM的失败是否源于视觉表示不足、视觉-语义对齐薄弱或细粒度知识有限。通过对一系列代表性LVLM/VLM的大量实验，我们发现当前LVLMs仍是不充分的细粒度识别器，失败源于视觉表示、语义对齐、模态对齐和类别级知识中相互交织的瓶颈。我们进一步分析了提升细粒度能力的训练设计因素，并考察了视觉和语言扰动如何影响LVLM预测。这些发现为当前LVLMs的局限性提供了诊断性见解，并为未来数据构建和模型设计提供了指导，以开发更可靠的细粒度视觉任务LVLMs。我们的代码已开源，可从此https URL获取。

英文摘要

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception and reasoning capabilities. While numerous benchmarks have evaluated LVLMs from holistic or task-specific perspectives, their capabilities on fine-grained image tasks-fundamental to computer vision-remain insufficiently understood. To address this gap, we introduce FG-BMK, a comprehensive fine-grained evaluation benchmark containing 1.01 million questions and 0.28 million images, covering diverse scenarios from common object-centric domains to specialized domains. FG-BMK jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms, enabling diagnostic analysis of whether LVLM failures arise from insufficient visual representations, weak visual-to-semantic grounding, or limited fine-grained knowledge. Through extensive experiments on a diverse set of representative LVLMs/VLMs, we find that current LVLMs remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge. We further analyze training design factors for improving fine-grained capabilities and examine how visual and linguistic perturbations affect LVLM predictions. These findings provide diagnostic insights into the limitations of current LVLMs and offer guidance for future data construction and model design in developing more reliable LVLMs for fine-grained visual tasks. Our code is open-source and available at https://fg-bmk.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.18661 2026-06-18 cs.CV cs.AI 新提交

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

LandslideAgent与多模态LandslideBench：一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

发表机构 * Central South University（中南大学）

AI总结提出指令驱动智能体框架，包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent，实现自主滑坡识别与分析。

详情

AI中文摘要

智能滑坡灾害解译对于防灾减灾至关重要，然而当前范式难以同时提取视觉特征和高层次地球科学语义，而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉。为解决这些挑战，我们提出一个指令驱动的智能体框架，包含三个组成部分。首先，通过多VLM交叉验证和交互式标注构建LandslideBench，这是一个多模态细粒度数据集，包含七个子类型标签、高分辨率图像、像素级掩膜和高质量文本描述。然后，通过LoRA在LandslideBench上微调面向滑坡的VLM——LandslideVLM，以增强地质语义理解。最后，以LandslideVLM为认知核心的领域规则增强智能体LandslideAgent，采用双规则控制器，结合结构化报告元数据约束和交叉验证识别约束，来调控自动化工具调用。实验表明，LandslideBench为五种主流模型在细粒度分类和语义分割上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上分别提升了10.96%、32.87%和15.91%。LandslideAgent进一步实现了自主多源空间数据推理，实现了滑坡识别与分析的全流程智能化。

英文摘要

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.19249 2026-06-18 cs.CV cs.LG 新提交

Transformer Geometry Observatory TGO-I: Spectral Geometry Observatory

Transformer几何观测站TGO-I：谱几何观测站

Kaustubh Kapil, Kishor P. Upla

发表机构 * Sardar Vallabhai National Institute of Technology (SVNIT), Surat, India（印度苏拉特萨达尔·瓦拉巴伊国家理工学院（SVNIT））

AI总结提出TGO框架，通过分析ViT表示的谱几何（有效秩、稳定秩、参与比、谱熵、谱平坦度、谱各向异性等），发现训练过程中维度利用增加、各向异性降低、谱熵和参与比上升，最终CLS标记表示具有最高有效维度和最低各向异性。

详情

AI中文摘要

尽管Vision Transformers（ViTs）被广泛采用并在众多计算机视觉应用中取得成功，对其维度和表示几何的基本理解仍然相对未被充分探索。为了弥补这一差距，我们引入了Transformer几何观测站（TGO），这是一个系统的实验和分析流程框架，旨在研究Vision Transformers的表示几何和动态。TGO-I是该框架的第一部分，专注于ViT表示的谱几何。使用在ImageNet-100上训练的ViT-Small/16模型，我们分析了训练过程中的有效秩、稳定秩、参与比、谱熵、谱平坦度、谱各向异性、协方差结构、特征谱和奇异值谱。我们的结果揭示了维度利用的一致增加，伴随着各向异性降低、谱熵增加、参与比增加以及逐渐平坦的特征谱。与常见的直觉（即训练应将信息集中到少数主导方向）相反，我们观察到方差在表示维度上的逐渐重新分布。这一现象在最终的CLS标记表示中尤为明显，该表示在网络中表现出最高的有效维度和最低的各向异性。

英文摘要

Despite the widespread adoption of Vision Transformers (ViTs) and their success across numerous computer vision applications, the fundamental understanding of their dimensional and representational geometry remains relatively underexplored. To address this gap, we introduce Transformer Geometry Observatory (TGO), a systematic framework of experiments and analysis pipelines designed to investigate the representational geometry and dynamics of Vision Transformers. TGO-I, the first installment of the framework, focuses on the spectral geometry of ViT representations. Using a ViT-Small/16 model trained on ImageNet-100, we analyze Effective Rank, Stable Rank, Participation Ratio, Spectral Entropy, Spectral Flatness, Spectral Anisotropy, covariance structure, eigenspectra, and singular value spectra throughout training. Our results reveal a consistent increase in dimensional utilization, accompanied by decreasing anisotropy, increasing spectral entropy, increasing participation ratio, and progressively flatter eigenspectra. Contrary to the common intuition that training should concentrate information into a small number of dominant directions, we observe a progressive redistribution of variance across representational dimensions. This phenomenon is particularly pronounced in the final CLS token representation, which exhibits the highest effective dimensionality and lowest anisotropy within the network.

URL PDF HTML ☆

赞 0 踩 0

1. 多模态与视觉语言模型 11 篇

Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Show, Don't Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverage

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Native Active Perception as Reasoning for Omni-Modal Understanding

2. 具身智能、机器人与自动驾驶 6 篇

Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

Spatially Stratified Distillation for Heterogeneous Radar Place Recognition

Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

3. 图像识别、检索与分类 3 篇

A Prototypical Signature Approach for Writer-Independent Offline Signature Verification

LARE: Low-Attention Region Encoding for Text-Image Retrieval

ROSA-TFormer: A Radar-Optical Sensor-Aware Temporal Transformer for Pinus sylvestris Plantation Classification in Northern Shaanxi Using GEE-Derived Sentinel-1/2 Time Series

4. 目标检测、分割与定位 3 篇

Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics

SCR-Guided Difficulty-Aware Optimization for Infrared Small Target Detection

5. 视频理解与时序视觉 4 篇

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

APT: Atomic Physical Transitions for Causal Video-Language Understanding

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

6. 生成式视觉与世界模型 9 篇

Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

7. 3D视觉、点云与空间智能 9 篇

CAOA -- Completion-Assisted Object-CAD Alignment

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

Intrinsic 4D Gaussian Segmentation from Scene Cues

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

FlowObject: Flow Steering for Bridging Generative Priors and Reconstruction Fidelity

Hand-4DGS: Feed-Forward 3D Gaussian Splatting for 4D Hand Reconstruction from Egocentric Videos

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

NeuMesh++: Towards Versatile and Efficient Volumetric Editing with Disentangled Neural Mesh-based Implicit Field

8. 医学影像与生物视觉 18 篇

Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification

On-Manifold Variational Learning with Heat-Kernel Priors

BrainFusionNet: a deep learning and XAI model to understand local, global, and sequential features of MRI images for improved brain tumour detection

Multi-Class Brain Tumor Classification Using Advanced Deep Learning Models: A Comparative Study

PEFT-MedSAM: Efficient Fine-Tuning of Medical Foundation Models for Explainable Skin Lesion Segmentation

Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation

Toward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models

SMART: A Flexible, Interpretable, and Scalable Spatio-temporal Brain Atlas from High-Resolution Imaging Data

DreamReg: Belief-Driven World Model for 2D-3D Ultrasound Registration

Quantification of Uncertainty with Adversarial Models in Medical Image Segmentation

Learning to Distort: Weakly-Supervised Image Quality Transfer for Prostate DWI Correction

Bridging Single Distortion Artifacts and Mmultifactorial Clinical Quality: Few-shot Biparametric MRI Quality Assessment via Distortion-trained Prototypical Networks

Test-Time Adaptation in Optical Coherence Tomography Using Trajectory-Aligned Time-Independent Flow

DINO-Med3D: Bridging Dimension and Domain Gaps in Volumetric Segmentation via Progressive Adaptation

Automatic ply-specific analyses of CFRP micrographs using shortest-path-based ply distinction

GUMP-Net: An interpretable model-data-driven intelligent algorithm for multi-class pelvic segmentation

Confidence is Not Reliability: Rethinking MC Dropout in Brain Tumour Segmentation

Parameter-Efficient Adaptation of SAM 3 for Automated ITV Generation from 4DCT Images

9. 文档图像、OCR与图表理解 5 篇

Rethinking the Pointer Loss in Table Structure Recognition: Geometry-Aware Pointer Loss for Spatial Locality

Fuzzy-Geometric Branch-Point Modeling for Structure-Aware Augmentation of Handwritten Chinese Characters

Performance Gap Analysis between Latin and Arabic Scripts HTR

PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

10. 低层视觉、计算成像与图像增强 4 篇

Neural Phase Correlation

Spiking Pyramid Wavelet Transformation for High-efficient and Low-energy Image Restoration