arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.26115 2026-05-26 cs.CV 版本更新

TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

TriSplat: 面向仿真的前馈式3D场景重建

Weijie Wang, Zimu Li, Jinchuan Shi, Zeyu Zhang, Botao Ye, Marc Pollefeys, Donny Y. Chen, Bohan Zhuang

发表机构 * Zhejiang University(浙江大学) ETH Zurich(苏黎世联邦理工学院) ETH AI Center(ETH人工智能中心) Microsoft(微软) Monash University(莫纳什大学)

AI总结 提出TriSplat,一种前馈式重建网络,使用有向三角形图元表示场景,直接从稀疏视图图像预测并导出可直接用于仿真的网格场景。

Comments Project Page: https://lhmd.top/trisplat, Code: https://github.com/ziplab/TriSplat

详情
AI中文摘要

稀疏视图3D重建越来越多地通过前馈式splatting网络来解决,这些网络直接从图像预测显式图元。然而,现有方法大多仍以高斯图元为中心,且仅间接暴露表面:提取可用于下游仿真、物理推理或具身交互的网格仍需昂贵的后处理步骤,这违背了前馈式的承诺。这一限制在无姿态设置中尤为突出,因为场景结构和相机参数必须从稀疏观测中联合估计。我们提出TriSplat,一种前馈式重建网络,使用有向三角形图元表示场景,并直接从单次前向传播中导出可用于仿真的网格场景。给定输入图像,网络预测局部3D点图、三角形属性、相机姿态和可选内参。我们的方法不是将三角形方向回归为无约束的潜变量,而是从预测的点图构建几何法线,通过图像条件法线头进行细化,并将其转换为稳定的局部框架用于三角形参数化。单目法线引导调度进一步稳定早期训练,而透明度和模糊调度逐步锐化学习到的表面表示以直接提取网格。在RealEstate10K和DL3DV上的实验表明,与高斯前馈基线相比,该表示方法能产生更几何保真的重建,同时保持有竞争力的新视角渲染质量。由于渲染图元本身就是表面三角形,输出可直接被物理引擎、碰撞检测器和标准渲染管线使用而无需任何转换,使其成为面向仿真的前馈式3D场景重建的实用解决方案。

英文摘要

Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet most existing methods remain centered on Gaussian primitives and expose surfaces only indirectly: extracting a usable mesh for downstream simulation, physics reasoning, or embodied interaction still requires expensive post-hoc steps that break the feed-forward promise. This limitation is especially pronounced in pose-free settings, where scene structure and camera parameters must be estimated jointly from sparse observations. We present TriSplat, a feed-forward reconstruction network that represents scenes with oriented triangle primitives and directly exports simulation-ready mesh scenes from a single forward pass. Given input images, the network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics. Rather than regressing triangle orientation as an unconstrained latent variable, our approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization. A mono-normal bootstrap schedule further stabilizes early training, while opacity and blur scheduling progressively sharpens the learned surface representation for direct mesh extraction. Experiments on RealEstate10K and DL3DV show that this representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. Because the rendering primitives are themselves surface triangles, the output can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction.

2605.26113 2026-05-26 cs.RO cs.CV 版本更新

AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond

AnyScene: 迈向高度可控的任意位置驾驶场景生成及超越

Haiming Zhang, Junfei Zhou, Feng Jiang, Jingzhong Li, Zhenglong Guo, Penglin Dai, Jifeng Dai, Yan Xie, Benjin Zhu

发表机构 * Li Auto(利汽车) Southwest Jiaotong University(西南交通大学) Tsinghua University(清华大学)

AI总结 提出AnyScene框架,通过时空占用扩散Transformer和几何引导视图扩展模块,实现从BEV布局生成语义占用序列和参考无关的多视角驾驶视频,支持精确可控和长时生成。

Comments Work in progress. Project page: https://mind-omni.github.io/

详情
AI中文摘要

生成高保真且可控的合成数据对于推进端到端自动驾驶至关重要,特别是解决罕见安全关键场景的长尾问题。现有的占用引导方法通常依赖于浅层条件机制和参考帧相关的视频合成,这限制了从任意BEV布局进行细粒度可控性,并限制了其在可扩展模拟中的适用性。在本文中,我们提出了AnyScene,一个统一的以占用为中心的驾驶场景生成框架。AnyScene通过时空占用扩散Transformer从BEV布局生成语义占用序列,该Transformer以自回归方式联合标记BEV和占用特征。这种设计使得从跨数据集和用户定义的BEV输入实现精确可控性,同时自然支持长时生成。基于生成的占用,几何引导视图扩展模块将占用视为规范空间表示,并以无参考和自回归方式合成时间一致的多视角驾驶视频,支持推理时的灵活相机配置。大量实验表明,AnyScene在占用和视频生成方面均达到最先进性能。它展现出对未见和定制布局的强大泛化能力,并为下游任务(如稀疏视图3D重建)提供可衡量的益处。

英文摘要

Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely on shallow conditioning mechanisms and reference-frame-dependent video synthesis, which limits fine-grained controllability from arbitrary BEV layouts and restricts their applicability for scalable simulation. In this paper, we propose AnyScene, a unified occupancy-centric framework for driving scene generation. AnyScene generates semantic occupancy sequences from BEV layouts through a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features in an autoregressive manner. This design enables precise controllability from cross-dataset and user-defined BEV inputs while naturally supporting long-horizon generation. Building upon the generated occupancy, a Geometry-Grounded View Expansion module treats occupancy as the canonical spatial representation and synthesizes temporally consistent multi-view driving videos in a reference-free and autoregressive fashion, supporting flexible camera configurations at inference time. Extensive experiments demonstrate that AnyScene achieves state-of-the-art performance in both occupancy and video generation. It exhibits strong generalization to unseen and customized layouts, and provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction.

2605.26111 2026-05-26 cs.CV cs.AI cs.GR cs.LG cs.MM 版本更新

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

从多模态大语言模型中榨取能力用于主题驱动生成

Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li, Igor Gilitschenski

发表机构 * University of Toronto & Vector Institute(多伦多大学及向量研究所) Adobe(Adobe公司) Google(谷歌公司)

AI总结 提出一种结合多模态大语言模型和VAE身份条件的方法,通过双层级聚合模块和多阶段去噪策略,在主题驱动图像生成中实现多模态理解与身份保持的平衡,优于现有方法。

Comments 33 pages, 18 figures, Project Page: https://zsh2000.github.io/squeeze-mllm-subject-gen/

详情
AI中文摘要

主题驱动图像生成旨在合成新图像,在遵循文本指令的同时保持给定主题的身份。现有方法通常分别编码文本和参考图像,这限制了跨模态推理能力并导致复制粘贴伪影。最近连接多模态模型和扩散模型的框架改进了指令遵循,但很大程度上忽略了身份保持。为了解决这些限制,我们将扩散模型条件设置为联合编码文本和参考图像的多模态大语言模型(MLLM),并用基于VAE的身份条件进行增强。设计了一种新颖的双层级聚合(DLA)模块来聚合多级MLLM特征以实现最优条件,并应用多阶段去噪策略在推理过程中逐步平衡来自MLLM的语义信息和来自VAE的精细细节身份。大量实验表明,我们的方法协调了多模态理解与身份保持,缓解了复制粘贴问题,并在主题驱动图像生成中实现了优于人类偏好的性能。我们的项目网站位于https://zsh2000.github.io/squeeze-mllm-subject-gen/。

英文摘要

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference. Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation. Our project website is available at https://zsh2000.github.io/squeeze-mllm-subject-gen/.

2605.26110 2026-05-26 cs.LG cs.CL cs.CV 版本更新

Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning

Prism:面向可扩展多模态持续指令微调的插件式可复现基础设施

Jun-Tao Tang, Yu-Cheng Shi, Zhen-Hao Xie, Da-Wei Zhou

发表机构 * School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院) National Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室)

AI总结 针对多模态持续指令微调中工程瓶颈问题,提出Prism插件式代码库,通过轻量级插件注册机制分离算法开发与骨干实现,支持大规模训练流水线,实现可复现、可扩展的实验。

Comments Code is available at https://github.com/LAMDA-CL/Prism

详情
AI中文摘要

多模态大语言模型(MLLMs)通过指令微调将多样任务重构为统一的指令遵循框架,从而实现多功能性。然而,实际部署需要持续适应新兴任务,这推动了多模态持续指令微调(MCIT)的发展。尽管其重要性日益增长,当前的MCIT研究受到严重的工程瓶颈阻碍。现有方法通常通过直接修改基础MLLM代码库来实现,这带来了大量的实现开销,并产生了方法特定的架构,严重限制了代码复用和公平比较。为了解决这一问题,我们引入了Prism,一个专门为可扩展MCIT研究设计的插件式可复现代码库。它通过轻量级插件注册机制将算法开发与骨干实现分离,使得新策略可以作为独立插件集成,而无需修改底层MLLM代码库,从而消除结构碎片化并加速方法开发。Prism原生支持广泛使用的大规模训练流水线,从而实现可复现和可扩展的MCIT实验。代码可在https://github.com/LAMDA-CL/Prism获取。

英文摘要

Multimodal Large Language Models (MLLMs) achieve versatility by reformulating diverse tasks into a unified instruction-following framework via instruction tuning. However, real-world deployment requires continuous adaptation to emerging tasks, motivating Multimodal Continual Instruction Tuning (MCIT). Despite its growing importance, current MCIT research is hindered by severe engineering bottlenecks. Existing methods are typically implemented by directly modifying the base MLLM codebase, which imposes substantial implementation overhead and yields method-specific architectures that severely limit code reuse and fair comparison. To address this, we introduce Prism, a plug-in reproducible codebase specifically designed for scalable MCIT research. It separates algorithmic development from the backbone implementation via a lightweight plugin registration mechanism, enabling new strategies to be integrated as independent plugins without modifying the underlying MLLM codebase, thereby eliminating structural fragmentation and accelerating method development. Prism natively supports widely used large-scale training pipeline, thereby enabling reproducible and scalable MCIT experimentation. Code is available at https://github.com/LAMDA-CL/Prism.

2605.26109 2026-05-26 cs.CV 版本更新

Helix4D: Complex 4D Mesh Generation

Helix4D: 复杂4D网格生成

Jiraphon Yenphraphai, Jianqi Chen, Jian Wang, Gordon Qian, Sergey Tulyakov, Rameen Abdal, Raymond A. Yeh, Peter Wonka, Chaoyang Wang

发表机构 * Snap(Snap公司) Purdue University(普渡大学) KAUST(科威特大学)

AI总结 提出Helix4D框架,通过滑动窗口跨帧注意力和4D时间编码,将Trellis2从图像到3D扩展为视频条件4D动态网格生成,解决复杂拓扑变化、透明材料、薄结构和内表面等难题。

Comments Project page: https://snap-research.github.io/helix4d/

详情
AI中文摘要

当前的视频到4D方法在处理复杂拓扑变化、透明材料、薄结构和内表面时存在困难。我们提出了Helix4D,一个动态网格生成框架,它继承了Trellis2的表达能力,并将其从图像到3D适应为视频条件4D生成。我们的设计源于两个关键问题:(a) 如何使Trellis2的帧局部注意力在帧间共享信息,同时保持其在罕见情况(如透明物体和内表面)上的预训练质量,以及(b) 如何在不破坏预训练能力的情况下将时间信息注入纯3D位置编码。我们通过滑动窗口跨帧注意力并锚定第一帧来解决(a)。第一帧由基础Trellis2模型生成并注入到我们的模型中,使其通过跨帧注意力继承Trellis2在罕见情况下的质量。我们通过一种4D时间编码来解决(b),该编码将冗余的低频空间RoPE频带重新用于时间,从而将编码从3D扩展到4D,且无需额外参数。大量实验表明,Helix4D在ActionBench和我们自己具有挑战性的复杂动态集上能有效生成高质量动态网格。

英文摘要

Current video-to-4D methods struggle with complex topology changes, transparent materials, thin structures, and inner surfaces. We present Helix4D, a dynamic mesh generation framework by inheriting the expressive representation of Trellis2, adapting it from image-to-3D to video-conditioned 4D generation. Our design arises from two key questions: (a) how to enable Trellis2's frame-local attention to share information across frames while preserving its pretrained quality on rare cases such as transparent objects and inner surfaces, and (b) how to inject temporal information into a purely 3D positional encoding without breaking pretrained capabilities. We address (a) with a sliding-window cross-frame attention and anchor on the first frame. The first frame is generated by the base Trellis2 model and injected into our model, letting it inherit Trellis2's quality in rare cases through cross-frame attention. We address (b) with a 4D temporal encoding that repurposes redundant low-frequency spatial RoPE bands for time, extending the encoding from 3D with no additional parameters. Extensive experiments show the effectiveness of Helix4D for high-quality dynamic mesh generation on ActionBench and our own challenging complex dynamics set.

2605.26105 2026-05-26 cs.CV 版本更新

On-Policy Adversarial Flow Distillation for Autoregressive Video Generation

自回归视频生成中的策略对抗流蒸馏

Yang Luo, Shengju Qian, Xiaohang Tang, Zirui Zhu, Yong Liu, Xin Wang, Yang You

发表机构 * LIGHTSPEED University College London(伦敦大学学院)

AI总结 提出策略对抗流蒸馏(AFD)框架,通过策略内对抗性反馈和正向过程流匹配,实现从异构黑盒教师模型向自回归学生模型的高效蒸馏。

详情
AI中文摘要

自回归视频生成器在流式、长时和交互式应用中具有吸引力,但将强大的黑盒教师模型蒸馏到因果学生模型中仍然困难。学生模型必须在其自身的 rollout 分布下学习,而实际教师模型可能只暴露提示条件化的完整视频,并且在架构、容量、时间设计和采样调度上可能不同。这种接口使得监督微调离策略、基于分数的蒸馏不适用,并且直接的对抗性模仿对于去噪时间信用分配过于稀疏。我们提出对抗流蒸馏(AFD),一种用于异构黑盒视频蒸馏的策略内框架。AFD 查询教师模型并在相同提示上 rollout 当前学生模型,训练一个提示配对的 Bradley-Terry 判别器来估计干净样本的教师-学生差异,并将得到的策略内优势转换为学生自身噪声状态上的正向过程流匹配更新。因此,AFD 提供了密集的速度场监督,同时不需要教师分数、潜在变量、去噪轨迹、步骤对齐或反向链强化学习。在两个因果 AR 学生家族上的实验表明,AFD 在保持一般视频质量的同时,持续改善了运动和物理敏感生成,消融实验验证了自适应策略内反馈和正向过程信用分配的重要性。该方法仅需要干净的教师视频和学生 rollout,为将专有或异构视频生成器蒸馏为高效自回归学生模型提供了一条实用途径。

英文摘要

Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into causal students remains difficult. The student must learn under its own rollout distribution, whereas practical teachers may expose only prompt-conditioned completed videos and may differ in architecture, capacity, temporal design, and sampling schedule. This interface makes supervised fine-tuning off-policy, score-based distillation inapplicable, and direct adversarial imitation too sparse for denoising-time credit assignment. We propose Adversarial Flow Distillation (AFD), an on-policy framework for heterogeneous black-box video distillation. AFD queries the teacher and rolls out the current student on the same prompts, trains a prompt-paired Bradley-Terry discriminator to estimate clean-sample teacher-student discrepancy, and converts the resulting on-policy advantage into forward-process flow-matching updates on the student's own noised states. Thus, AFD provides dense velocity-field supervision while requiring no teacher scores, latents, denoising trajectories, step alignment, or reverse-chain reinforcement learning. Experiments across two causal AR student families show that AFD consistently improves motion- and physics-sensitive generation while preserving general video quality, and ablations validate the importance of adaptive on-policy feedback and forward-process credit assignment. The method requires only clean teacher videos and student rollouts, providing a practical route for distilling proprietary or heterogeneous video generators into efficient autoregressive students.

2605.26104 2026-05-26 cs.CV 版本更新

EVIDENT: Routing MLLM Adaptation through Entity-Grounded Visual Evidence for Cross-Domain Video Temporal Grounding

EVIDENT: 通过实体锚定的视觉证据路由MLLM适配用于跨域视频时间定位

Geo Ahn, Jiwook Han, Youngrae Kim, Joonseok Lee, Jinwoo Choi

发表机构 * Kyung Hee University(庆尚大学) University of Southern California(南加州大学) Seoul National University(首尔国立大学)

AI总结 针对视频时间定位中域迁移导致性能下降的问题,提出EVIDENT框架,通过实体瓶颈适配器、实体绑定蒸馏损失和实体到证据门控机制,利用预训练MLLM的实体注意力实现参数高效的跨域鲁棒时间定位。

详情
AI中文摘要

微调MLLM用于视频时间定位(VTG)通常能提升域内性能,但在域迁移下性能急剧下降。本工作中,我们发现这种失败主要不仅由未见查询概念驱动,更由视觉域迁移导致,这阻止了模型将其学习的时间定位知识与固有的实体注意力能力耦合。为解决此问题,我们引入EVIDENT,一个参数高效的适配框架,通过将VTG适配路由通过显式的视觉实体证据,将时间定位锚定在预训练MLLM固有的实体注意力上。EVIDENT包含三个组件:(i) 实体瓶颈适配器,将密集的视觉令牌转换为紧凑的实体级槽;(ii) 实体绑定蒸馏损失,将对象性先验注入语义非结构化的MLLM视觉空间,引导每个槽绑定到一致的实体;(iii) 实体到证据门控机制,利用捕获的实体作为证据,引导模型定位包含查询相关实体的时刻。这些组件共同使VTG微调依赖于实体锚定的证据,而非脆弱的数据集捷径。在跨域VTG基准上的实验表明,EVIDENT在保持竞争性域内性能的同时,以适度的参数开销持续提升域外鲁棒性。这些结果表明,实体级锚定是通用时间定位的有效归纳偏置。

英文摘要

Fine-tuning MLLMs for Video Temporal Grounding (VTG) often improves in-domain performance but degrades sharply under domain shift. In this work, we find that this failure is primarily driven not just by unseen query concepts, but by visual domain shift, which prevents the model from coupling its learned temporal localization knowledge with its inherent entity-attention capability. To address this, we introduce EVIDENT, a parameter-efficient adaptation framework that anchors temporal grounding in the inherent entity-attention of pre-trained MLLMs by routing VTG adaptation through explicit visual entity evidence. EVIDENT consists of three components: (i) an Entity Bottleneck Adapter that transforms dense visual tokens into compact entity-level slots, (ii) an Entity-Binding Distillation loss that instills objectness priors into the semantically unstructured MLLM visual space, guiding each slot to bind to a coherent entity, and (iii) an Entity-to-eVidence gating mechanism that leverages the captured entities as evidence, steering the model to localize moments containing query-relevant entities. Together, these components enable VTG fine-tuning to rely on entity-grounded evidence rather than brittle dataset shortcuts. Experiments on cross-domain VTG benchmarks show that EVIDENT consistently improves out-of-domain robustness while preserving competitive in-domain performance with modest parameter overhead. These results suggest that entity-level grounding is an effective inductive bias for generalizable temporal localization.

2605.26095 2026-05-26 cs.CV 版本更新

Pixel-Level Pavement Distress Assessment Using Instance Segmentation

基于实例分割的像素级路面病害评估

Logan Dewick, Bibesh Pyakurel, Kong Pheng Yang, Nazim Choudhury, M. G. Sarwar Murshed

发表机构 * Computer Science Department University of Wisconsin - Green Bay, Green Bay, WI, USA(威斯康星大学绿湾分校计算机科学系)

AI总结 提出基于Mask R-CNN实例分割的路面病害分析系统,在自定义数据集上实现精确的裂缝和坑洞分割,并验证了其在实际路面图像中的有效性。

Comments 7 pages, 6 figures

详情
AI中文摘要

自动路面病害评估不仅需要图像级分类或粗略的边界框检测,还需要对细长、分支和不规则裂缝进行精确定位,以达到维护相关量化所需的几何精度。本文提出了一种基于Mask R-CNN实例分割的视觉路面病害分析系统,并在UWGB-StreetCrack(一个自定义的现场采集道路图像数据集,使用车载智能手机获取,并手动标注了纵向裂缝、横向裂缝、鳄鱼裂缝和坑洞的多边形标签)上进行了评估。在一致的微调协议下,考虑了五种基于Detectron2的Mask R-CNN骨干网络变体。性能最佳的模型——使用ResNet-101 FPN骨干网络的Mask R-CNN,在项目特定的边界框匹配协议下实现了84.23%的精确率、90.04%的召回率和87.04%的F1分数。同一模型产生的聚合预测裂缝面积分数为2.164%,与真实裂缝面积分数2.170%非常接近。为了将分割系统与面向检测器的替代方案进行对比,还基于CSPDarknet53的YOLO检测器进行了适配和重新训练,在验证协议上达到了27.5%的精确率和20.7%的召回率。结果表明,实例分割是现场路面图像和聚合裂缝面积估计的一个实用方向,同时也暴露了注释一致性、类别不平衡、混淆因素抑制和掩码级基准测试方面的开放挑战。

英文摘要

Automated pavement distress assessment requires more than image-level classification or coarse bounding box detection, demanding precise localization of thin, branching, and irregular cracks to achieve the geometric precision necessary for maintenance-relevant quantification. This paper presents a vision-based pavement distress analysis system based on Mask R-CNN instance segmentation and evaluates it on UWGB-StreetCrack, a custom field-collected roadway image dataset acquired with a vehicle-mounted smartphone and manually annotated with polygon labels for longitudinal cracks, transverse cracks, alligator cracks, and potholes. Five Detectron2-based Mask R-CNN backbone variants were considered under a consistent fine-tuning protocol. The best-performing model, Mask R-CNN with a ResNet-101 FPN backbone, achieved 84.23% precision, 90.04% recall, and an F1 score of 87.04% under the project-specific bounding-box matching protocol. The same model produced an aggregate predicted crack-area fraction of 2.164%, closely matching the 2.170% ground-truth crack-area fraction. To contextualize the segmentation system against a detector-oriented alternative, a CSPDarknet53-based YOLO detector was also adapted and retrained on the dataset, reaching 27.5% precision and 20.7% recall on the validation protocol. The results show that instance segmentation is a practical direction for field pavement imagery and aggregate crack-area estimation, while also exposing open challenges in annotation consistency, class imbalance, confounder rejection, and mask-level benchmarking.

2605.26062 2026-05-26 cs.GR cs.CV 版本更新

Look Both Ways Before You Cross: Lifting Cross Fields From 2D Visual Priors

过马路前左右看:从2D视觉先验中提取交叉场

Dale Decatur, Jacob Serfaty, Oded Stein, Amir Vaxman, Rana Hanocka

发表机构 * University of Chicago(芝加哥大学) University of Southern California(南加州大学) University of Edinburgh(爱丁堡大学)

AI总结 提出CrossLift方法,利用文本到图像先验从2D图像中提取方向信号,通过两次平滑插值将其反投影到网格表面,生成语义对齐的交叉场和四边形网格。

Comments Project page at: https://crosslift.github.io/

详情
AI中文摘要

我们提出了CrossLift,一种由图像中的视觉特征引导的网格交叉场计算技术。我们利用强大的文本到图像先验,这些先验能够合成特征对齐的二维四边形网格图像。我们将此信号提取为2D图像中明确的逐像素方向,然后将其反投影到网格表面。我们通过在网格表面上执行两次平滑插值(首先在每个视图内,然后在多个视图之间)来聚合这些候选表面方向。我们在每次插值中为候选方向提出基于置信度的自定义权重,这使我们能够解决同一面上的候选方向之间的冲突,并将我们的场平滑插值到被遮挡的面。我们的方法是模块化的,可以与许多不同的2D视觉先验一起使用。我们展示了在纹理对齐四边形网格以及使用粗略的用户绘制线条作为信号的交互式交叉场设计中的额外应用。我们在多种有机和机械形状上展示了CrossLift的有效性,并生成了与现有方法相比具有优越语义对齐的四边形网格。项目页面:https://crosslift.github.io/

英文摘要

We present CrossLift, a technique for computing cross fields on meshes guided by visual features in images. We leverage powerful text-to-image priors that are capable of synthesizing images of feature-aligned quad meshes in 2D. We extract this signal as explicit per-pixel directions in the 2D images, which we then back-project to the mesh surface. We aggregate these candidate surface directions by performing two smooth interpolations on the mesh surface (first within each view and second across multiple views). We propose custom confidence-based weights for the candidate directions in each interpolation that allow us to resolve conflicts between candidates on the same face and smoothly interpolate our field to occluded faces. Our method is modular and can be used with many different 2D visual priors. We show additional applications to texture-aligned quad meshing as well as interactive cross-field design using coarse, user-drawn lines as signal. We demonstrate the effectiveness of CrossLift on a diverse set of both organic and mechanical shapes and produce quad meshes that exhibit superior semantic alignment as compared to existing methods. Project page at: https://crosslift.github.io/

2605.26038 2026-05-26 cs.CV cs.AI 版本更新

DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

DRScaffold:提升轻量级视觉语言模型在密集场景推理中的能力

Xinrui Shi, Kai Liu, Ziqing Zhang, Jianze Li, Anqi Li, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对轻量级视觉语言模型在密集场景推理中缺乏显式视觉锚定导致推理链不可靠的问题,提出DRScaffold监督微调框架,通过将监督目标分解为四个因果有序阶段,在不修改架构的情况下强制进行有根据的推理,显著提升密集场景推理性能。

详情
AI中文摘要

轻量级视觉语言模型在标准基准测试中表现有竞争力,但在密集场景推理中系统性失败,其中多个物体、属性和关系必须通过多步推理共同定位和解决。这种能力对于模型必须可靠解释杂乱环境的现实应用至关重要。然而,现有的训练信号在推理步骤与底层视觉实体和关系之间没有提供显式锚定,使得轻量级模型可以自由生成流畅但视觉上无根据的推理链。为解决这一差距,我们首先引入DRBench,一个包含2943张图像中14573个问题的基准,分为五个任务类别,跨越三个渐进推理层。基于DRBench,我们提出DRScaffold,一个监督微调框架,将监督目标分解为四个因果有序阶段,在不修改架构的情况下强制进行有根据的推理。在三个轻量级VLM上的实验表明,在DRBench上取得了显著提升,同时保持或改善了一般基准的性能。值得注意的是,使用DRScaffold训练的Qwen2.5-VL-3B在DRBench上超越了冻结的Qwen2.5-VL-32B,表明结构化监督可以替代密集场景推理中相当一部分模型规模。我们的代码和模型可在https://github.com/irene-shi/DRScaffold获取。

英文摘要

Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for real-world applications where models must reliably interpret cluttered environments. Yet existing training signals provide no explicit grounding between reasoning steps and the underlying visual entities and relations, leaving lightweight models free to generate fluent but visually unanchored reasoning chains. To address this gap, we first introduce DRBench, a benchmark of 14,573 questions across 2,943 images, organized into five task categories spanning three progressive reasoning layers. Building on DRBench, we propose DRScaffold, a supervised fine-tuning framework that decomposes the supervision target into four causally ordered stages, enforcing grounded reasoning without architectural modification. Experiments on three lightweight VLMs demonstrate substantial gains on DRBench while preserving or improving performance on general-purpose benchmarks. Notably, Qwen2.5-VL-3B trained with DRScaffold surpasses the frozen Qwen2.5-VL-32B on DRBench, demonstrating that structured supervision can substitute for a significant portion of model scale in dense-scene reasoning. Our code and models are available at https://github.com/irene-shi/DRScaffold .

2605.26032 2026-05-26 cs.CV cond-mat.stat-mech cs.AI cs.LG 版本更新

Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution

一切尺度:具有连续超分辨率的尺度不变扩散

Zixin Jessie Chen, Zhuo Chen, Archer Wang, Jeff Gore, William T. Freeman, Congyue Deng, Marin Soljačić

发表机构 * Department of Physics, Massachusetts Institute of Technology(麻省理工学院物理系) Department of EECS, Massachusetts Institute of Technology(麻省理工学院电子工程与计算机科学系) NSF AI Institute for Artificial Intelligence and Fundamental Interactions(国家科学基金会人工智能与基础相互作用研究所) Institute for Data, Systems and Society, Massachusetts Institute of Technology(麻省理工学院数据、系统与社会研究所)

AI总结 提出SKILD模型,通过尺度不变扩散统一图像生成与连续超分辨率,仅改变起始时间步即可实现不同任务。

Comments 29 pages, 17 figures

详情
AI中文摘要

从噪声创建图像是图像生成;从粗糙输入重建精细细节是超分辨率。尽管它们在实际应用中有差异,但都可以理解为逆转跨尺度的信息损失。我们引入了$ extbf{SKILD}$,一个$ extbf{S}$cale-invariant $ extbf{K}$-Space $ extbf{I}$mage $ extbf{L}$earning $ extbf{D}$iffusion模型,它在单个无条件框架内统一了生成和连续超分辨率。自然图像和临界物理系统都表现出尺度不变性,我们利用这一点设计了一个前向过程,该过程从精细尺度到粗糙尺度衰减图像内容,同时注入频谱匹配的高斯噪声,使尺度成为扩散动力学的显式坐标。相同训练的反向过程通过仅改变起始时间步来执行生成和连续超分辨率:$ extit{没有特定任务的架构,没有条件分支,没有无分类器指导,没有按尺度因子重新训练}$。实验上,SKILD在无条件CIFAR-10上达到FID 2.65和Inception Score 9.63,从单个无条件检查点在ImageNet上执行$2 imes$--$8 imes$超分辨率,同时在感知指标上优于条件模型,并重建了临界伊辛模型,其连接的四点相关函数紧密跟踪真实情况。

英文摘要

Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical differences, both can be understood as reversing information loss across scales. We introduce $\textbf{SKILD}$, a $\textbf{S}$cale-invariant $\textbf{K}$-Space $\textbf{I}$mage $\textbf{L}$earning $\textbf{D}$iffusion model that unifies generation and continuous super-resolution within a single unconditional framework. Both natural images and critical physical systems exhibit scale invariance, and we leverage it to design a forward process that attenuates image content from fine to coarse scales while injecting spectrum-matched Gaussian noise, making scale an explicit coordinate of the diffusion dynamics. The same trained reverse process performs generation and continuous super-resolution by varying only the starting timestep: $\textit{no task-specific architecture, no conditioning branch, no classifier-free guidance, no retraining per scale factor}$. Empirically, SKILD reaches FID $2.65$ and Inception Score $9.63$ on unconditional CIFAR-10, performs $2\times$--$8\times$ super-resolution on ImageNet from a single unconditional checkpoint while outperforming conditional models across perceptual metrics, and reconstructs critical Ising models whose connected four-point correlations closely track the ground truth.

2605.26026 2026-05-26 cs.CV cs.AI cs.LG 版本更新

A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring

一种用于光片荧光显微镜的多模态3D基础模型实现少样本分割、分类和去模糊

Adina Scheinfeld, Haotan Zhang, Shang Mu, Rudolf L. M. van Herten, Lucas Stoffl, Ali Erturk, Zhuhao Wu, Johannes C. Paetzold

发表机构 * Tri-Institutional Program in Computational Biology \& Medicine, Weill Cornell Medicine, New York, NY, USA Department of Radiology, Weill Cornell Medicine, New York, NY, USA Helen Robert Appel Alzheimers Disease Research Institute, Feil Family Brain Mind Research Institute, Weill Cornell Medicine, New York, NY, USA Graduate Program in Physiology, Biophysics Systems Biology, Weill Cornell Medicine, New York, NY, USA Cornell Tech, New York, NY, USA Institute for Intelligent Biotechnologies (iBIO), Helmholtz Center Munich, Neuherberg, Germany Institute for Stroke Dementia Research, Klinikum der Universität München, Ludwig-Maximilians University Munich, Munich, Germany

AI总结 提出一种基于掩码重建与图像-文本对齐联合优化的3D基础模型,在光片荧光显微镜数据上预训练,通过少样本适应显著降低标注成本并提升分割、分类和去模糊性能。

Comments 11 pages, 3 figures

详情
AI中文摘要

光片荧光显微镜(LSM)能够对生物样本进行高分辨率三维(3D)成像,提供丰富的体积数据用于研究细胞组织、病理学和血管网络。然而,LSM数据的大小、维度和标注负担使得监督深度学习方法成本高昂且难以扩展。此外,尽管存在大量未标注的LSM体积数据,但由于计算挑战和体积表示学习的复杂性,针对该模态的基础模型仍未得到充分探索。在这项工作中,我们引入了一个用于LSM数据的3D基础模型,该模型在涵盖多种生物体、染色和成像协议的大型精选3D图像集合上进行了预训练。通过联合优化掩码重建和图像-文本对齐,我们学习了可迁移的体积表示。预训练骨干网络大幅降低了标注负担,实现了针对多种下游任务的高效少样本适应。我们在下游分割、分类和去模糊任务上评估了该方法。结果表明,我们的方法在(1)使用标准评估指标衡量时以及(2)经过领域专家严格评估时,均持续优于基线。这凸显了基础模型预训练在减少标注需求的同时提升多样化LSM分析任务性能的潜力。预训练模型权重以及预训练和微调的代码已公开:https://github.com/AdinaScheinfeld/lsm_fm_public_repo.git。

英文摘要

Light sheet fluorescence microscopy (LSM) enables high-resolution, three-dimensional (3D) imaging of biological specimens, providing rich volumetric data for studying cellular organization, pathology, and vascular networks. However, the size, dimensionality, and annotation burden of LSM data make supervised deep learning approaches costly and difficult to scale. Additionally, despite the abundance of unannotated LSM volumes, foundation models for this modality remain underexplored due to computational challenges and the complexity of volumetric representation learning. In this work, we introduce a 3D foundation model for LSM data, pretrained on a large curated collection of 3D images spanning multiple organisms, stains, and imaging protocols. We learn transferable volumetric representations by jointly optimizing for masked reconstruction and image-text alignment. The pretrained backbone drastically reduces the annotation burden, enabling efficient, few-shot adaptation for varied downstream tasks. We evaluate this approach on downstream segmentation, classification, and deblurring. Our results demonstrate consistent improvements over baselines, (1) when measured using standard evaluation metrics and (2) when rigorously assessed by domain experts. This highlights the potential of foundation model pretraining to reduce annotation requirements while improving performance across diverse LSM analysis tasks. Pretrained model weights and code for pretraining and finetuning are publicly available: https://github.com/AdinaScheinfeld/lsm_fm_public_repo.git.

2605.26014 2026-05-26 cs.CV cs.CL 版本更新

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

STORM: 视频语言模型中时空推理的内化建模

Yiming Liang, Yixiao Chen, Yiyang Zhou, Yixuan Wang, Shoubin Yu, Andong Deng, Fuxiao Liu, Qin Zhang, Chen Chen, Mohit Bansal, Huaxiu Yao

发表机构 * Purdue(普渡大学) Harvard(哈佛大学) UNC(北卡罗来纳大学教堂山分校) UCF(佛罗里达大学) NVIDIA(英伟达) Physion Labs(Physion 实验室)

AI总结 提出STORM框架,通过有界连续潜在轨迹内化推理过程,无需显式文本思维链或外部工具,提升视频推理准确性并降低推理开销。

详情
AI中文摘要

许多视频推理任务需要跨帧跟踪运动、时间顺序和演化的视觉状态。基于大型视觉语言模型(LVLMs)的现有方法通常通过文本思维链(CoT)、关键帧选择、重复帧插入或外部工具使用来外化推理。虽然有效,但此类流水线增加了推理延迟和工程复杂性,并迫使时间-视觉证据被序列化为文本或从帧中重复重新编码。受视觉推理可以在语言化之前隐式发生的直觉启发,我们提出STORM(通过内化建模的时空推理),一个两阶段框架,教导LVLMs通过有界连续潜在轨迹进行推理,而不是显式文本CoT。在第一阶段,STORM将潜在令牌与从生成视频中衍生的思想-视频表示对齐,将潜在状态基于动态视觉证据。在第二阶段,模型进一步通过仅答案监督训练,鼓励推理过程内化而无需逐步注释。生成的思想视频仅在训练期间使用;在推理时,STORM执行有界潜在展开,无需重新生成视频、重新插入帧或调用外部视觉工具。在VideoMME、MVBench、TempCompass和MMVU上的实验表明,与基于工具或视频生成的推理流水线相比,STORM提高了视频推理准确性,同时显著降低了推理开销。

英文摘要

Many video reasoning tasks require tracking motion, temporal order, and evolving visual states across frames. Existing methods built on large vision-language models (LVLMs) often address this challenge by externalizing reasoning through textual chain-of-thought (CoT), keyframe selection, repeated frame reinsertion, or external tool use. While effective, such pipelines increase inference-time latency and engineering complexity, and they force temporal-visual evidence to be serialized into text or repeatedly re-encoded from frames. Inspired by the intuition that visual reasoning can occur implicitly before verbalization, we propose STORMS (Spatial-Temporal reasOning via inteRnalized Modeling), a two-stage framework that teaches LVLMs to reason through bounded continuous latent trajectories instead of explicit textual CoT. In Stage I, STORMS aligns latent tokens with thought-video representations derived from generated videos, grounding the latent states in dynamic visual evidence. In Stage II, the model is further trained with answer-only supervision, encouraging the reasoning process to be internalized without step-by-step annotations. Generated thought videos are used only during training; at inference, STORMS performs a bounded latent rollout without regenerating videos, reinserting frames, or invoking external visual tools. Experiments on VideoMME, MVBench, TempCompass, and MMVU show that STORMS improves video reasoning accuracy while substantially reducing inference overhead compared with tool or video-generation-based reasoning pipelines.

2605.26013 2026-05-26 cs.LG cs.AI cs.CV 版本更新

AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models

AdvantageFlow: 流模型中基于优势加权的强化学习最小二乘法

Branislav Kveton, Anup Rao, Subhojyoti Mukherjee, Krishna Kumar Singh, Viet Dac Lai

发表机构 * Adobe Research(Adobe研究)

AI总结 提出AdvantageFlow算法,通过优势加权前向过程预测损失和 rollout 策略正则化,在图像生成任务中优于Flow-GRPO和负感知微调基线。

详情
AI中文摘要

我们引入了AdvantageFlow,一种用于修正流模型的前向过程强化学习算法。与优化反向过程的Flow-GRPO不同,我们优化了一个优势加权的前向过程预测损失。当优势为负且损失变为非凸时,该优化问题不稳定。我们通过rollout策略正则化来稳定它,这降低了方差,并源于拟合局部奖励改进的目标分布。我们在Stable Diffusion 3.5 Medium上评估了AdvantageFlow在图像生成任务中的表现。它优于Flow-GRPO和基于负感知微调的最先进前向过程强化学习基线。

英文摘要

We introduce AdvantageFlow, a forward-process reinforcement learning algorithm for rectified flow models. Unlike Flow-GRPO, which optimizes the reverse process, we optimize an advantage-weighted forward-process prediction loss. This optimization problem is unstable when advantages are negative and the loss becomes non-convex. We stabilize it by rollout policy regularization, which reduces variance and arises from fitting a local reward-improving target distribution. We evaluate AdvantageFlow on image generation tasks with Stable Diffusion 3.5 Medium. It outperforms both Flow-GRPO and a state-of-the-art forward-process RL baseline based on negative-aware fine-tuning.

2605.26004 2026-05-26 cs.CV cs.CL 版本更新

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

MAGIC: 面向视觉语言模型的多模态对齐与接地感知指令核心集

Shristi Das Biswas, Kaushik Roy

发表机构 * Purdue University(普渡大学)

AI总结 提出MAGIC方法,利用预训练VLM中的多模态增益、桥接相关性和技能神经元签名三种内在信号,通过无训练、前向传播的核心集选择,构建紧凑且行为保真的子集用于多模态指令微调,在20%预算下达到甚至超越全微调性能。

详情
AI中文摘要

大型视觉语言模型(LVLMs)的指令微调越来越依赖于大规模多模态语料库,然而这些数据集包含大量冗余、低视觉依赖性以及多模态推理行为覆盖极不平衡的样本。因此,均匀子采样或基于分数的朴素选择往往产生次优的训练子集。我们提出MAGIC,一种无需训练、仅前向传播的核心集选择方法,旨在为多模态指令微调构建紧凑且行为保真的子集。MAGIC基于从预训练VLM中提取的三个内在信号:多模态增益,衡量从视觉输入获得的似然改进;桥接相关性,捕捉答案令牌在视觉令牌上的接地锐度;以及技能神经元签名,通过顶部激活的前馈神经元表征每个样本引发的功能计算。MAGIC通过三阶段流程组合这些信号:过滤低增益样本,通过归一化质量目标对候选样本排序,并在离散神经元签名上执行桶式预算分配以保留潜在的多模态技能覆盖。该公式避免了反向传播、辅助选择器训练以及连续激活空间中的昂贵聚类,同时保持高效且易于部署在现有VLM中。在LLaVA-665K和Vision-Flan数据集上,以及向大型目标模型LLaVA-1.5-7B和-13B的迁移设置中,MAGIC在匹配的20%预算下持续优于强基线:在LLaVA-665K上达到全微调相对性能的100.3%,在Vision-Flan-186K上达到101.6%,同时减少了73.7%的挂钟运行时间。

英文摘要

Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.

2605.26003 2026-05-26 cs.CV 版本更新

Towards 3D heart mesh generation using contactless radar imaging and physics-informed neural network

基于非接触式雷达成像和物理信息神经网络的3D心脏网格生成

Jinye Li, Chenxi Fu, Minghang Zheng, Yang Liu, Xiahai Zhuang, Qingchao Chen

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Fudan University(复旦大学) Peking University(北京大学)

AI总结 提出SAR2Mesh框架,通过粗到细的网格变形过程,结合几何感知特征投影和物理信息雷达损失,从合成孔径雷达图像重建高保真3D心脏几何结构。

详情
AI中文摘要

心脏功能评估需要连续、无创的监测,而MRI在这方面的能力有限。毫米波雷达及其合成孔径雷达模式提供了一种保护隐私且便携的即时临床检测应用。然而,从SAR重建高保真3D心脏几何结构仍然是一个开放挑战。传统雷达方法生成稀疏点云,缺乏连续表面拓扑。同时,由于SAR图像中严重的散斑噪声和模糊边界,直接应用光学重建网络效果不佳。为了弥合这一差距,我们提出了SAR2Mesh,一种将任务重新表述为粗到细网格变形过程的新框架。通过用拓扑模板初始化,我们的方法通过渐进网格变形明确保留解剖连通性。我们引入了几何感知特征投影模块,通过3D到2D采样提取多视图特征,以及物理信息雷达损失,以强制预测几何与原始雷达回波之间的一致性。此外,我们提出了Cardiac Mesh-SAR,第一个大规模配对SAR-网格数据集。大量实验表明,SAR2Mesh显著优于现有的基于图像的基线,实现了准确且物理一致的心脏重建。

英文摘要

Cardiac function evaluation necessitates continuous, non-invasive monitoring, a capability limited in MRI. Millimeter-wave (mmWave) radar and its Synthetic Aperture Radar (SAR) mode offer a privacy-preserving and portable point-of-care clinical applications. However, reconstructing high-fidelity 3D cardiac geometry from SAR remains an open challenge. Traditional radar methods generate sparse point clouds that lack continuous surface topology. Meanwhile, direct application of optical reconstruction networks performs poorly due to the severe speckle noise and ambiguous boundaries inherent in SAR images. To bridge this gap, we propose SAR2Mesh, a novel framework that reformulates the task as a coarse-to-fine mesh deformation process. By initializing with a topological template, our approach explicitly preserves anatomical connectivity through progressive mesh deformation.We introduce a geometry-aware feature projection module to extract multi-view features via 3D-to-2D sampling, and a physics-informed radar loss to enforce consistency between the predicted geometry and raw radar echoes. Furthermore, we present Cardiac Mesh-SAR, the first large-scale paired SAR-mesh dataset. Extensive experiments demonstrate that SAR2Mesh significantly outperforms existing image-based baselines, achieving accurate and physically consistent cardiac reconstructions.

2605.25168 2026-05-26 eess.IV cs.AI cs.CV 版本更新

Methodology for Creating a Clinically Verified Dermoscopic Image Dataset

创建临床验证的皮肤镜图像数据集的方法论

Kozachok Elena Sergeevna

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences(伊万诺夫系统编程研究所,俄罗斯科学院)

AI总结 提出一种结合移动皮肤镜图像采集标准操作程序、结构化元数据信息模型和多阶段专家验证的方法,构建临床验证的皮肤镜图像数据集,用于医学信息学研究。

Comments 22 pages, 5 figures, 5 tables

详情
AI中文摘要

本研究提出了一种构建临床验证的皮肤镜图像数据集的方法,用于医学信息学研究。该工作的相关性在于,自动化诊断支持系统的性能不仅取决于图像数量,还取决于图像采集过程的可重复性、结构化元数据的完整性以及诊断标签的可靠性。国际数据集主要是在与俄罗斯常规门诊实践和移动皮肤镜显著不同的条件下创建的。所提出的方法整合了三个相互关联的组成部分:(1)通过移动皮肤镜采集图像的标准操作程序(SOP),(2)一个信息模型,包含16个结构化元数据字段,组织成六个临床导向的块,采用ISIC兼容的符号表示,以及(3)多阶段专家验证诊断标签(初始临床注释、三位专家的共识审查以及所有恶性肿瘤的组织学确认)。使用该方法,在2025年6月至2026年5月期间,收集了来自443名患者的1026张独特的皮肤镜图像数据集。从1044条初始记录中排除了18个重复项。该数据集包括九个疾病类别;所有39个恶性病变(18个黑色素瘤、15个基底细胞癌和6个鳞状细胞癌)均经过组织学验证。患者年龄范围为2至90岁(中位年龄38岁),其中女性279人(63%),男性164人(37%)。每张图像都附有专家注释的皮肤镜结构和明确的verification_stage字段,指示诊断确认的水平。所得数据集作为临床验证的试点资源,适用于独立模型评估、域偏移分析、可解释性研究和进一步扩展。

英文摘要

This study presents a methodology for constructing a clinically verified dataset of dermatoscopic images for medical informatics research. The relevance of the work is driven by the fact that the performance of automated diagnostic support systems depends not only on the volume of images, but also on the reproducibility of the image acquisition procedure, the completeness of structured metadata, and the reliability of diagnostic labels. International collections were primarily created under conditions that differ substantially from routine Russian outpatient practice and mobile dermatoscopy. The proposed methodology integrates three interconnected components: (1) a standard operating procedure (SOP) for acquiring images via mobile dermatoscopy, (2) an information model comprising 16 structured metadata fields organized into six clinically oriented blocks in ISIC-compatible notation, and (3) a multi-stage expert verification of diagnostic labels (initial clinical annotation, consensus review by three specialists, and histological confirmation of all malignant neoplasms). Using this methodology, a dataset of 1,026 unique dermatoscopic images from 443 patients was collected between June 2025 and May 2026. From 1,044 initial records, 18 duplicates were excluded. The dataset includes nine nosological categories; all 39 malignant lesions (18 melanomas, 15 basal cell carcinomas, and 6 squamous cell carcinomas) were histologically verified. Patient age ranged from 2 to 90 years (median 38), with 279 females (63%) and 164 males (37%). Each image is accompanied by expert-annotated dermatoscopic structures and an explicit verification_stage field indicating the level of diagnostic confirmation. The resulting dataset serves as a pilot clinically verified resource suitable for independent model evaluation, domain shift analysis, interpretability studies, and further expansion.

2605.25979 2026-05-26 cs.CV 版本更新

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

LLaVA-OneVision-2:迈向下一代感知智能

Xiang An, Yin Xie, Feilong Tang, Yunyao Yan, Huajie Tan, Didi Zhu, Changrui Chen, Xiuwei Zhao, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Kaichen Zhang, Wenkang Zhang, Zheng Cheng, Nansen Zhang, Chunsheng Wu, Chunjiang Ge, Zimin Ran, Dehua Song, Chunyuan Li, Shikun Feng, Ming Hu, Zhangquan Chen, Junbo Niu, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng

发表机构 * Glint Lab(Glint实验室) AIM for Health Lab(健康AI实验室) MVP Lab(MVP实验室)

AI总结 提出LLaVA-OV-2模型,通过编解码流令牌化、窗口注意力和3D RoPE实现统一视频理解与时空定位,在多项基准上超越Qwen3-VL-8B。

详情
AI中文摘要

我们介绍LLaVA-OneVision-2(LLaVA-OV-2),这是LLaVA-OneVision系列中迄今为止能力最强的视觉语言模型,在广泛的多模态基准测试中均取得了卓越性能。该模型基于原生OneVision编码器,并引入窗口注意力机制以实现高效的局部计算,同时保持原生分辨率。其关键进展是编解码流令牌化:它将压缩视频视为连续的比特成本流,其中比特成本动态决定自适应时间分组,运动残差线索选择显著空间证据到紧凑的视觉画布中。这种分配将有限的令牌预算集中在包含事件的内容上,相比固定图片组,实现了更稳定的长视频令牌压缩。共享的3D RoPE进一步将编解码画布、采样帧和图像置于统一的时空坐标系中。此外,我们围绕大规模开放监督构建了LLaVA-OV-2数据和训练栈:约800万重新标注的视频样本用于预训练,400万样本的空间语料库用于微调。我们还引入了JumpScore,这是一个针对高频、密集重复运动中的细粒度定位的时空定位基准,填补了现有视频评估的空白。LLaVA-OV-2的一项突出能力是其在视频理解、时空定位、空间定位和操作轨迹推理上的统一感知。在JumpScore上,LLaVA-OneVision-2-8B达到74.9 JumpScore mAP,比Qwen3-VL-8B(30.1)高出44.8分;在同一基准的匹配视觉令牌预算下,编解码流输入相比帧采样在时空定位上提升9.7分。在标准基准上,LLaVA-OneVision-2-8B在视频任务上平均比Qwen3-VL-8B高出4.3分,在空间任务上高出5.3分,在跟踪任务上平均J&F高出15.6分。

英文摘要

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.

2605.25968 2026-05-26 cs.CV 版本更新

Context-driven Missing-Modality Learning for Robust Medical Diagnosis with Image-Tabular Data

基于上下文驱动的缺失模态学习用于图像-表格数据的鲁棒医学诊断

Tianling Liu, Lequan Yu, Tong Han, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University, Tianjin 300350, China.(智能与计算学院,天津大学,天津300350,中国) Department of Statistics and Actuarial Science, School of Computing and Data Science, The University of Hong Kong, Hong Kong.(统计与精算系,计算与数据科学学院,香港大学,香港) Department of Radiology, Tianjin Huanhu Hospital, Tianjin 300350, China.(放射科,天津华和医院,天津300350,中国) Tianjin Key Laboratory of Cerebral Vascular and Neurodegenerative Diseases, Tianjin 300350, China.(天津脑血管与神经退行性疾病重点实验室,天津300350,中国) Medical School of Tianjin University, Tianjin 300072, China.(天津大学医学院,天津300072,中国)

AI总结 提出CMML框架,通过级联残差变换器自编码器合成缺失模态并利用上下文令牌进行语义对齐,在三种医学数据集上超越现有方法。

Comments 12 pages, 8 figures

详情
AI中文摘要

虽然整合多种成像和临床表格记录的多模态数据对于准确医学诊断至关重要,但临床实践中特定模态的任意缺失普遍存在,严重降低了多模态模型的性能。现有方法要么丢弃缺失模态导致信息丢失,要么在未捕获复杂模态间依赖关系的情况下难以合成它们。为解决这些限制,我们提出了一种新颖的上下文驱动缺失模态学习(CMML)框架,该框架顺序执行模态合成和语义对齐,以在任意缺失条件下实现鲁棒诊断。具体来说,我们设计了一个基于级联残差变换器的自编码器(CRTA),利用可学习的上下文令牌作为数据集级语义先验来捕获模态间依赖关系并合成关键的缺失表示。这些表示进一步通过模态特定的记忆库得到丰富。为解决原始可用表示与合成表示之间的差异,我们通过注入来自CRTA输出的多模态表示,将学习到的上下文令牌转化为实例自适应的语义参考。该参考引导异构模态表示对齐到统一空间,最后应用类别感知对比细化来探索判别性诊断线索。在皮肤病变(Derm7pt)、眼病(ODIR)和脑膜瘤(MEN)数据集上的广泛评估表明,CMML显著优于最先进(SOTA)方法,平均AUC分别提升1.26%、0.97%和1.32%。

英文摘要

While multimodal data integrating diverse imaging and clinical tabular records is crucial for accurate medical diagnosis, the arbitrary absence of specific modalities is prevalent in clinical practice, severely degrading the performance of multimodal models. Existing methods either discard missing modalities, leading to information loss, or struggle to synthesize them without capturing complex inter-modal dependencies. To address these limitations, we propose a novel Context-driven Missing-Modality Learning (CMML) framework, which sequentially performs modality synthesis and semantic alignment to achieve robust diagnosis under arbitrary missing conditions. Specifically, we design a Cascade Residual Transformer-based Autoencoder (CRTA) that leverages learnable context tokens acting as dataset-level semantic prior to capture inter-modal dependencies and synthesize key missing representations. These representations are further enriched by modality-specific memory banks. To resolve the discrepancy between original available and synthesized representations, we transform the learned context tokens into instance-adaptive semantic references by infusing multimodal representations from the CRTA's outputs. This reference guides the alignment of heterogeneous modality representations into a unified space, where class-aware contrastive refinement is finally applied to explore discriminative diagnostic cues. Extensive evaluations on skin lesion (Derm7pt), ocular disease (ODIR), and meningioma (MEN) datasets demonstrate that CMML significantly outperforms state-of-the-art (SOTA) methods, yielding AVG AUC improvements of 1.26%, 0.97%, and 1.32%, respectively.

2605.25952 2026-05-26 cs.CV cs.AI 版本更新

VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

VEN-VL: 一种用于高效多模态理解的视觉集成MoE框架

Yinghao Wu, Zhuoyan Luo, Yiyao Yu, Zhaojian Yu, Yujiu Yang, Xiao-Ping Zhang

发表机构 * Tsinghua University(清华大学)

AI总结 提出VEN-VL框架,通过先丰富后压缩的策略,利用视觉集成MoE和自适应路由增强视觉令牌的信息容量与密度,在少量压缩令牌下实现复杂视觉任务的性能与效率平衡。

详情
AI中文摘要

尽管近期高效方法在加速多模态理解方面取得了显著进展,但它们仍然存在明显的性能下降。这些方法强调单一视觉线索的高压缩比,并依赖基于启发式剪枝策略的粗略注意力对齐,导致视觉令牌的信息容量和密度出现瓶颈。针对这一局限,我们提出了VEN-VL,一种遵循“先丰富后压缩”原则的视觉集成MoE框架,用于高效感知。具体来说,我们首先通过统一不同视角的视觉表示来丰富信息容量,然后通过专门视觉专家中的自适应路由器逐步压缩信息以增强信息密度。此外,我们通过显式视觉监督融入原始结构的重建能力,促进关键信息的保留。实验结果表明,我们在使用少量信息压缩令牌的复杂视觉任务中具有优越性,有效弥合了性能与效率之间的差距。

英文摘要

Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.

2605.25947 2026-05-26 cs.CV 版本更新

A Pedestrian-Vehicle Interaction Benchmark and Annotation Framework for Unstructured Scenes via Uncalibrated Cameras

非标定相机下的非结构化场景行人-车辆交互基准与标注框架

Haoyang Peng, Qian Hu, Songan Zhang, Ming Yang

发表机构 * School of Automation and Intelligent Sensing(自动化与智能感知学院) Global Institute of Future Technology(未来技术全球研究院)

AI总结 针对非结构化场景中行人-车辆交互数据稀缺的问题,提出基于非标定监控视频的标注框架PINNS数据集,包含多国多场景的密集交互轨迹与场景信息,以促进复杂混合交通中的轨迹预测研究。

Comments 10 pages, 8 figures; project page available at https://github.com/Songan-Lab

详情
AI中文摘要

预测行人与车辆之间的交互对于非结构化和半结构化场景中的自动驾驶安全至关重要;然而,由于缺乏具有密集行人-车辆交互的公共数据集,这一任务受到严重阻碍。当前大多数研究依赖于结构化道路数据,导致非结构化环境中复杂的异质交互未能得到充分表示和研究。本文提出一种基于非标定监控摄像头视频数据的数据集标注框架,并推出PINNS(非结构化场景中非标定摄像头的行人-车辆交互数据集)。该数据集涵盖多个国家和地区,包含多样化的典型交通场景,并考虑了季节、光照条件和天气的变化。它聚焦于具有密集行人-车辆交互的复杂场景,并设计为易于扩展。数据集根据中国自动化学会发布的标准进行构建和标注,提供轨迹数据和相应的场景级信息。此外,本文分析了异质智能体轨迹预测的当前挑战和研究方向,展示了所提出数据集的必要性和实用性。我们希望我们的框架和数据集能够促进复杂混合交通场景中轨迹预测和自动驾驶的研究。PINNS数据集公开于https://github.com/Songan-Lab。

英文摘要

Predicting the interaction between pedestrian and vehicle is essential for autonomous driving safety in unstructured and semi-structured scenarios; however, this task is severely hindered by the scarcity of public datasets that feature dense pedestrian-vehicle interactions. Most current studies rely on structured road data, leaving the complex, heterogeneous interactions found in unstructured environments insufficiently represented and researched. In this paper, we propose a dataset annotation framework based on video data from uncalibrated surveillance cameras and present PINNS (Pedestrian-vehicle Interaction dataset from uNcalibrated cameras in uNstructured Scenes). The dataset covers multiple countries and regions, includes diverse typical traffic scenarios, and considers variations in seasons, lighting conditions, and weather. It focuses on complex scenes with dense pedestrian-vehicle interactions and is designed to be easily extensible. The dataset is constructed and annotated according to the standard issued by the Chinese Association of Automation, providing both trajectory data and corresponding scene-level information. Furthermore, this paper analyzes current challenges and research directions in heterogeneous agent trajectory prediction, shows the necessity and usefulness of the proposed dataset. We hope our framework and dataset will facilitate research on trajectory prediction and autonomous driving in complex mixed traffic scenarios. PINNS is publicly available at https://github.com/Songan-Lab.

2605.25944 2026-05-26 cs.CV cs.AI 版本更新

EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory

EchoPilot: 通过尺度空间语义提示和可靠性门控记忆实现无训练超声视频分割

Ruiqiang Xiao, Zhaohu Xing, Yijun Yang, Zhenyan Han, Weiming Wang, Kaishun Wu, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Third Affiliated Hospital of Sun Yat-Sen University(中山大学第三附属医院) Hong Kong Metropolitan University(香港 Metropolitan 大学)

AI总结 提出EchoPilot,一种无需训练、仅需单点点击和类别名称的超声视频分割框架,通过尺度空间语义提示解决初始化歧义,并引入可靠性门控记忆减少传播漂移,在多个数据集上达到最优性能。

Comments Early accepted to MICCAI 2026. Project page: https://keeplearning-again.github.io/EchoPilot/

详情
AI中文摘要

超声视频分割在临床上具有重要价值,但由于散斑噪声、弱边界和快速解剖变形而困难。最近的可提示基础模型实现了点引导分割,但它们在超声中的直接部署仍然不可靠:单个点提供的空间上下文不足以解决尺度模糊性,贪婪的记忆更新会将早期错误放大为严重的时间漂移。我们提出了EchoPilot,一个在稀疏第一帧交互下进行超声视频分割的无训练框架,仅需单点点击和解剖类别名称。EchoPilot协调一个冻结的医学视觉语言模型(VLM)进行语义定位,一个视觉基础模型(VFM)进行密集几何特征提取,以及一个可提示视频分割器进行掩码预测和传播。为了解决初始化歧义,我们提出了尺度空间语义提示,首先通过无参数的S.E.E.D.(语义能量-熵密度)准则选择最佳上下文视图,然后从密集基础特征中合成几何精确的辅助点提示,无需额外用户交互。为了减少传播漂移,进一步引入了可靠性门控记忆更新,在不确定预测下选择性冻结分割器的记忆库,防止错误累积。我们还贡献了第一个动态胎儿胎盘超声视频分割数据集,包含671个标注帧。在三个超声视频数据集上,EchoPilot在稀疏交互设置下实现了最先进的性能,持续优于无训练基线和微调专家。

英文摘要

Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but their direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift. We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot orchestrates a frozen medical vision-language model (VLM) for semantic localization, a vision foundation model (VFM) for dense geometric feature extraction, and a promptable video segmentor for mask prediction and propagation. To resolve initialization ambiguity, we propose Scale-Space Semantic Prompting, which first selects an optimal contextual view via a parameter-free S.E.E.D. (Semantic Energy-Entropy Density) criterion, and then synthesizes geometrically precise auxiliary point prompts from dense foundation features without additional user interaction. To reduce propagation drift, a Reliability-Gated Memory update is further introduced to selectively freeze the segmentor's memory bank under uncertain predictions, preventing error accumulation. We also contribute the first dynamic fetal placenta ultrasound video segmentation dataset with 671 annotated frames. Across three ultrasound video datasets, EchoPilot achieves state-of-the-art performance under the sparse-interactive setting, consistently outperforming training-free baselines and finetuned specialists.

2605.25942 2026-05-26 cs.CV cs.RO 版本更新

LRDDv3: High-Resolution Long-Range Drone Detection Dataset with Range Information and Thermal Data

LRDDv3:具有距离信息和热数据的高分辨率远程无人机检测数据集

Knut Peterson, Zaid Mayers, Azmain Yousuf, Priontu Chowdhury, Asher Zaczepinski, Solmaz Arezoomandan, Reihaneh Maarefdoust, David Han

发表机构 * iMaPLe Research Lab, Drexel University(Drexel大学iMaPLe研究实验室) University of Maine(缅因大学)

AI总结 提出LRDDv3数据集,包含102,532张高分辨率远程RGB图像和29,630张配对IR图像,支持远程无人机检测,提供距离信息。

Comments 8 pages, 5 figures. Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA)

详情
AI中文摘要

无人机已迅速成为各种空域中的常见设备,涵盖从娱乐飞行到商业摄影和包裹递送等多种应用。随着无人机日益普及,有人和无人飞行器能够远程检测无人机及其他飞行物体以有效跟踪运动并确保共享空域安全运行变得至关重要。尽管已有多个用于无人机检测的数据集,但对高质量数据的需求仍然存在,特别是在高分辨率远程无人机数据领域。为解决这一问题,我们引入了一个高分辨率数据集,包含102,532张远程无人机RGB图像,这些图像从128个不同的视频片段中以5 FPS采样,这些片段在17个不同的数据采集日(跨越8个月)的飞行中拍摄,以确保光照场景、飞行位置和背景元素的多样性。该数据集拥有全面的无人机距离信息,以及29,630张IR图像,所有这些图像都与基础数据集中的RGB图像配对。作为首批利用4K图像分辨率和配对640x512 IR图像的无人机检测数据集之一,我们的工作代表了在远程检测无人机方面的重要进展。如需获取完整数据集,请访问https://research.coe.drexel.edu/ece/imaple/lrddv3/

英文摘要

Unmanned Aerial Vehicles (UAVs) have quickly become common in various airspaces, representing a wide range of applications from recreation flying to commercial photography and package delivery. With the increasing prevalence of UAVs, it becomes critical that both manned and unmanned aircraft can detect UAVs and other flying objects from long range to effectively track movement and ensure safe operation in shared spaces. While several datasets have been introduced for drone detection, the need for expanded high-quality data persists, especially in the area of high-resolution long-range drone data. To address this, we introduce a high-resolution dataset of 102,532 long-range RGB images of drones, sampled at 5 FPS from 128 distinct video clips taken mid flight during 17 different data collection days spread over 8 months to ensure a wide variety of lighting scenarios, flight locations, and background elements. The dataset boasts comprehensive drone range information across the dataset, as well as 29,630 IR images, all paired with RGB counterparts from the base dataset. As one of the first drone detection datasets to leverage 4K image resolution and paired 640x512 IR images, our work represents a significant advancement to enable the detection of drones at long range. For access to the complete dataset, please visit https://research.coe.drexel.edu/ece/imaple/lrddv3/

2605.25941 2026-05-26 cs.CV 版本更新

Where Concept Erasure Should Occur: Concept-Layer Alignment in Text-to-Video Diffusion Models

概念擦除应发生在何处:文本到视频扩散模型中的概念-层对齐

Yiwei Xie, Ping Liu, Zheng Zhang

发表机构 * The School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China(人工智能与自动化学院,华中科技大学,武汉430074,中国) Department of Computer Science and Engineering, University of Nevada, Reno, NV, USA(计算机科学与工程系,内华达大学里诺分校,内华达州,美国)

AI总结 本文通过识别概念-层拓扑对齐瓶颈,提出基于可分离性优化的CLEAR框架,在文本到视频扩散模型中实现精确的概念擦除并保持生成质量。

Comments Accepted by ICML 2026

详情
AI中文摘要

文本到视频扩散变换器在模型深度上不均匀地编码语义信息,这限制了有效概念擦除。我们识别出一个表示瓶颈,称为概念-层拓扑对齐,在该对齐下目标概念在特定表示深度表现出更高的可分离性。在这些深度之外,概念和非目标信号仍然强烈纠缠,限制了深度特定擦除的有效性。这一观察将概念擦除重新定义为识别概念-非目标分离自然出现的表示深度的问题。受此结构约束的启发,我们引入了CLEAR,一个用于概念擦除的可分离性驱动优化框架,明确强制概念-层对齐。CLEAR通过将层选择公式化为概念-非目标可分离性的优化问题(而非依赖层无关或启发式选择)来实现这一原则。为此,我们引入了一个可分离性感知目标,偏好表现出更强概念-非目标分离的层。在大规模文本到视频扩散模型上的实验表明,强制概念-层对齐导致更精确的概念抑制,同时保持整体生成质量。

英文摘要

Text-to-video diffusion transformers encode semantic information unevenly across model depth, which constrains effective concept erasure. We identify a representational bottleneck, termed concept-layer topological alignment, under which target concepts exhibit higher separability at certain representational depths. Outside these depths, concept and non-target signals remain strongly entangled, limiting the effectiveness of depth-specific erasure. This observation reframes concept erasure as the problem of identifying representational depths where concept-non-target separation naturally emerges. Motivated by this structural constraint, we introduce CLEAR, a separability-driven optimization framework for concept erasure that explicitly enforces concept-layer alignment. CLEAR operationalizes this principle by formulating layer selection as an optimization problem over concept-non-target separability, rather than relying on layer-agnostic or heuristic choices. To enable this, we introduce a separability-aware objective that favors layers exhibiting stronger concept-non-target separation. Experiments on large-scale text-to-video diffusion models demonstrate that enforcing concept--layer alignment leads to more precise concept suppression while preserving overall generative quality.

2605.25940 2026-05-26 eess.IV cs.CV 版本更新

How Accurate are Video Quality Models for Diffusion-Based Video Super-Resolution?

扩散模型视频超分辨率中的视频质量模型有多准确?

Benjamin Herb, Steve Göring, Alexander Raake, Rakesh Rao Ramachandra Rao

发表机构 * Institute for Communications Engineering, RWTH Aachen University, Germany(通讯工程研究所,亚琛工业大学,德国)

AI总结 本研究通过主观测试比较了六种扩散模型视频超分辨率方法,评估现有视频质量模型(尤其是全参考和无参考模型)在扩散VSR上的准确性,发现基于CNN的全参考模型相关性较高但均不足以替代主观测试。

Comments Accepted for the 18th International Conference on Quality of Multimedia Experience (QoMEX 2026)

详情
AI中文摘要

最近的视频超分辨率(VSR)方法使用深度神经网络来增强低质量输入视频并恢复视觉细节,其中基于扩散的方法尤其显示出有希望的结果。在本文中,我们通过将模型预测与主观测试结果进行比较,研究现有视频质量模型是否可用于评估这些基于扩散的VSR方法的性能。该研究比较了六种上采样方法(Lanczos、Rhea、SCST、DOVE、SeedVR2、Starlight Mini),应用于压缩(AV1和DCVC-RT)和未压缩的低分辨率视频,考虑在UHD-1/4K屏幕上播放。使用一系列全参考和无参考质量模型来评估它们对这种新型质量退化的适用性,重点关注序列内性能。结果强调,基于CNN的全参考模型,如LPIPS、DISTS和CVQA-FR,显示出比传统全参考模型以及测试的无参考模型显著更高的相关系数。大多数模型高估了SCST过度锐利的结果,VMAF主要由于Starlight Mini引入的空间不一致而失败。测试的视频质量模型均未达到足够的准确性以替代互补的主观测试。参考、降质和上采样的视频,以及用户评分和模型分数,随论文在https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-1-VSR作为开放数据提供。

英文摘要

Recent video super-resolution (VSR) approaches use deep neural networks to enhance low-quality input videos and recover visual detail, with diffusion-based methods in particular showing promising results. In this paper, we investigate whether existing video quality models can be used to assess the performance of these diffusion-based VSR methods, by comparing model predictions with results from a subjective test. The study compares six upscaling methods (Lanczos, Rhea, SCST, DOVE, SeedVR2, Starlight Mini) applied to both compressed (AV1 and DCVC-RT) and uncompressed low-resolution videos considering the play-out on a UHD-1/4K screen. A range of full- and no-reference quality models are used to assess their applicability to this new type of quality degradation, focusing on within-sequence performance. The results highlight that CNN-based full-reference models, such as LPIPS, DISTS, and CVQA-FR show significantly higher correlation coefficients than both conventional full- as well as the tested no-reference models. Most overestimate the overly sharp results of SCST, with VMAF mainly failing due to spatial inconsistencies introduced by Starlight Mini. None of the tested video quality models reach sufficient accuracy so as to replace complementary subjective testing. The reference, degraded and upscaled videos, as well as the user ratings and model scores are made available with the paper at https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-1-VSR as open data.

2605.25922 2026-05-26 cs.CV 版本更新

Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models

闭环双向提示用于视觉语言模型的对抗鲁棒性

Xiao Liu, Jiaxiang Liu, Boci Peng, Boren Hu, Yusong Wang, Xiwen Chen, Prayag Tiwari, Liming Zhang, Mingkun Xu

发表机构 * University of Macau(澳门大学) Guangdong Institute of Intelligence Science and Technology(广东智能科学与技术研究院) Peking University(北京大学) Independent Researcher(独立研究员) Institute of Science Tokyo(东京科学研究院) Morgan Stanley(摩根大通) Halmstad University(哈马碧大学)

AI总结 针对视觉语言模型在对抗扰动下跨模态语义对齐脆弱的问题,提出闭环双向提示方法,通过动态反馈循环恢复跨模态一致性,并引入语义锚点约束循环更新,实现实例自适应保护,在11个数据集上达到最先进的鲁棒性和泛化性能。

Comments 24 pages, 8 figures

详情
AI中文摘要

视觉语言模型能很好地适应下游任务,但对破坏跨模态语义对齐的对抗扰动高度脆弱。现有的防御方法大多是单向或结构性的,未能利用双向跨模态互补性和实例自适应的保护。为了克服对抗设置中单向和静态防御的局限性,我们提出了闭环双向提示,通过冻结编码器上的动态反馈循环将鲁棒适应视为跨模态一致性恢复。引入语义锚点作为稳定先验以约束循环更新并减轻扰动引起的特征损坏。通过基于锚点的自举,文本语义去噪视觉表示,而精炼的视觉使实例自适应提示更新成为可能,从而产生修正且鲁棒的共识。在11个数据集上的广泛评估验证了最先进的鲁棒性和强的基础到新类泛化能力,同时在计算成本和准确性之间保持了良好的平衡。

英文摘要

Vision Language Models adapt well to downstream tasks but are highly vulnerable to adversarial perturbations that disrupt cross-modal semantic alignment. Existing defenses are largely unidirectional or structural, failing to exploit bidirectional cross-modal complementarity and instance-wise adaptive protection. To overcome the limitations of unidirectional and static defenses in adversarial settings, we propose Closed-Loop Bidirectional Prompting, casting robust adaptation as cross-modal agreement recovery via a dynamic feedback loop on frozen encoders. A Semantic Anchor is introduced as a stable prior to constrain cyclic updates and mitigate perturbation-induced feature corruption. Through anchor-based bootstrapping, textual semantics denoise visual representations, while the refined visuals enable instance-adaptive prompt updating, yielding a rectified and robust consensus. Extensive evaluations across 11 datasets validate state-of-the-art robustness and strong base-to-new generalization, while maintaining a favorable trade-off between computational cost and accuracy.

2605.25921 2026-05-26 cs.GR cs.CV 版本更新

Curve Skeletonization in Continuous domain for Meshes and Point Clouds

网格与点云的连续域曲线骨架化

Jai Bardhan, Ramya Hebbalaguppe, Aravind Udupa

发表机构 * TCS Research(TCS研究) IIT Delhi(德里理工学院)

AI总结 提出CSCD框架,将基于局部分隔符的骨架化方法推广到连续域,通过CSCD-M(网格)和CSCD-PC(点云)两种实现,提升了骨架提取的鲁棒性和拓扑保持能力。

Comments 31 pages, 26 figures, 7 tables, 4 algorithms. Published at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026

详情
AI中文摘要

3D曲线骨架化的进展正在加速广泛的应用。然而,开发能够捕捉复杂物体细节的鲁棒骨架化算法仍然具有挑战性。基于局部分隔符(LS)的骨架化提供了一种高效的基于图的方法,但由于其离散性质,存在表示不准确的问题。为了解决这个问题,我们引入了CSCD,一个新颖的连续域曲线骨架化框架,将LS推广到流形上。具体来说,我们提出了两种实现:用于网格的CSCD-M和用于点云的CSCD-PC。CSCD-M利用网格的内在三角剖分来抵抗噪声并改善拓扑保持,而CSCD-PC采用簇状拉普拉斯算子以增强鲁棒性。据我们所知,CSCD-M是第一个用于曲线骨架化的内在方法。我们的结果表明,CSCD-M在各种网格上匹配LS的性能,并在Thingi10k数据集等基准测试上优于LS(TOG'21)。CSCD-PC在质量上优于CoverageAxis++(Eurographics'24)和EPCS(CAG'23)。最后,我们展示了CSCD在几个下游任务中的有效性:物体分类、形状分割、识别物体中的手柄、隧道和收缩。项目网站:https://cscd-skel.pages.dev

英文摘要

Advancements in 3D curve skeletonization are accelerating progress across a wide range of applications. However, developing robust skeletonization algorithms that capture intricate object details remains challenging. Skeletonization via Local Separators (LS) offers an efficient graph-based approach but suffers from representation inaccuracies due to its discrete nature. To address this, we introduce CSCD, a novel framework for Curve Skeletonization in the Continuous Domain, generalizing LS to manifolds. Specifically, we present two realizations: CSCD-M for meshes and CSCD-PC for point clouds. CSCD-M leverages the intrinsic triangulation of a mesh for resilience to noise and improved topological preservation, while CSCD-PC employs tufted Laplacians for enhanced robustness. To our knowledge, CSCD-M is the first intrinsic method for curve skeletonization. Our results show CSCD-M matches LS performance across diverse meshes and outperforms LS (TOG'21) on benchmarks like Thingi10k dataset. CSCD-PC qualitatively outperforms CoverageAxis++ (Eurographics'24) and EPCS (CAG'23). Finally, we demonstrate the efficacy of CSCD in a few downstream tasks: object classification, shape segmentation, identifying handles, tunnels, and constrictions in objects. Project Website: https://cscd-skel.pages.dev

2605.25909 2026-05-26 cs.CV 版本更新

R5DGS: Semantic-Aware 4D Gaussian Splatting with Rigid Body Constraints for Efficient Dynamic Scene Reconstruction

R5DGS:基于刚体约束的语义感知4D高斯泼溅用于高效动态场景重建

Denis Gridusov, Maxim Popov, Sergey Kolyubin

发表机构 * Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University(生物机电学与节能机器人实验室,ITMO大学)

AI总结 提出R5DGS框架,通过紧凑身份编码和CLIP对象查找表实现语义感知的4D高斯表示,并利用刚体推理约束仅预测对象质心动力学,从而在保持轨迹合理性的同时实现11 FPS的加速。

Comments Code: https://github.com/be2rlab/r5dgs

详情
AI中文摘要

从多视角视频中重建和预测动态3D场景是机器人、AR/VR和数字孪生的基础任务。最近基于物理信息的高斯泼溅方法在未来的帧外推上取得了令人印象深刻的结果,但缺乏语义感知且计算开销大。我们引入了$ extbf{R5DGS}$,一个通过紧凑的身份编码向量增强物理驱动的4D高斯表示的框架,实现了精确的高斯到对象关联。通过构建离线的基于CLIP的对象查找表,我们支持开放词汇的文本提示,以检索和渲染任意时间戳和视角下的特定对象高斯。此外,我们提出了一个刚体推理约束,仅对对象质心预测和集成物理动力学,通过相对变换将运动传播到关联的高斯。这一优化在外推过程中实现了11 FPS的加速,而不损害轨迹的合理性。

英文摘要

Reconstructing and predicting dynamic 3D scenes from multi-view videos is a foundational task for robotics, AR/VR, and digital twins. Recent physics-informed Gaussian Splatting methods achieve impressive future frame extrapolation but lack semantic awareness and suffer from large computational overhead. We introduce $\textbf{R5DGS}$, a framework that augments a physics-driven 4D Gaussian representation with compact Identity Encoding vectors, enabling precise Gaussian-to-object association. By constructing an offline CLIP-based object lookup table, we support open-vocabulary text prompting to retrieve and render object-specific Gaussians across arbitrary timestamps and viewpoints. Furthermore, we propose a rigid-body inference constraint that predicts and integrates physical dynamics exclusively for object centroids, propagating motion to associated Gaussians via relative transformations. This optimization yields a 11 FPS speedup during extrapolation without compromising trajectories plausibility.

2605.25901 2026-05-26 cs.CV cs.RO 版本更新

AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

AgentGrounder:使用多模态语言模型的零样本3D视觉点云定位

Cuong Huynh, Maxim Popov, Denis Gridusov, Sergey Kolyubin

发表机构 * Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University(生物机械与高效能机器人实验室,ITMO大学)

AI总结 提出AgentGrounder框架,通过两阶段设计(离线构建对象查找表和在线工具驱动代理)实现零样本3D视觉定位,在ScanRefer和Nr3D上分别提升2.5%和6.3%的准确率。

Comments Code: https://github.com/be2rlab/AgentGrounder

详情
AI中文摘要

3D视觉定位(3DVG)是具身AI的基本能力,要求智能体根据自然语言描述在3D场景中定位物体。最近的零样本方法利用2D视觉语言模型(LVLMs),但它们通常依赖于现有的多视图图像集,并且难以处理标准3D分割工具提供的有限语义和空间细节。我们提出了$ extbf{AgentGrounder}$,一个零样本3D视觉定位框架,直接对彩色点云进行操作,无需特定任务的3D训练。我们的方法采用两阶段设计:(1)离线阶段,应用3D模型构建对象查找表(OLT),包含实例ID、语义标签、3D边界框;(2)在线工具驱动代理,分解每个查询,仅从OLT中检索相关候选对象,进行几何评分,并在需要额外视觉证据(如颜色、材质或视角敏感线索)时按需触发图像渲染。与固定的锚点-目标匹配流水线相比,这种设计减少了级联匹配错误,并通过避免提示过载无关对象来提高上下文窗口效率。我们在零样本设置下对ScanRefer和Nr3D进行了评估,观察到在我们的设置中比SeeGround有持续改进,包括ScanRefer上+2.5%的Acc@0.5和Nr3D上+6.3%,在Nr3D视图无关查询上显著提升+6.3%。这些结果表明,结合选择性检索、几何推理和自适应视觉检查为开放词汇3D定位提供了实用且稳健的基础。我们的代码可在https://github.com/be2rlab/AgentGrounder获取。

英文摘要

3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding. Our code is available at https://github.com/be2rlab/AgentGrounder.

2605.25892 2026-05-26 cs.CV 版本更新

SP-MoMamba: Superpixel-driven Mixture of State Space Experts for Efficient Image Super-Resolution

SP-MoMamba:基于超像素驱动的状态空间专家混合模型用于高效图像超分辨率

Wenbin Zou, Yawen Cui, Yi Wang, Lap-Pui Chau, Liang Chen, Jinshan Pan, Huiping Zhuang, Guanbin Li

发表机构 * Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou, China(华南理工大学谢民武智能工程学院) Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR(香港理工大学电子与电气工程系) College of Photonic and Electronic Engineering, Fujian Normal University, Fuzhou, China(福建师范大学光电工程学院) School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China(南京理工大学计算机科学与工程学院) School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China(中山大学计算机科学与工程学院)

AI总结 提出SP-MoMamba,通过超像素驱动将刚性扫描转化为语义级交互,结合多尺度超像素状态空间专家混合与局部空间调制专家,实现高效且保真的图像超分辨率。

Comments 16 pages, 15 figures

详情
AI中文摘要

状态空间模型(SSM)因其线性复杂度和长程建模能力,已成为高效单图像超分辨率(SR)的强大范式。然而,现有的基于Mamba的方法通常依赖于与数据无关的刚性扫描,将2D图像重塑为固定网格上的1D序列,这不可避免地破坏了空间语义拓扑并引入伪影。受格式塔知觉分组理论的启发,我们提出了SP-MoMamba,一种用于内容感知SR的超像素驱动状态空间专家混合模型。我们的核心思想是通过将超像素视为基本单元,将传统的刚性扫描转化为语义级交互。具体来说,我们引入了超像素驱动状态空间模型(SP-SSM),它将语义同质区域压缩为高阶令牌,以保持全局拓扑一致性。为了解决固定扫描尺度与多样语义粒度之间的冲突,我们开发了多尺度超像素状态空间专家混合(MSS-MoE)。该模块利用动态路由机制自适应地分配尺度特定专家,有效捕捉多尺度纹理,同时减少计算冗余。此外,为了防止全局抽象过程中高频细节的丢失,我们引入了局部空间调制专家(LSME)来补充全局建模,确保锐利边缘和精细结构的精确重建。在标准基准上的大量实验表明,与最先进的高效SR方法相比,SP-MoMamba实现了更优的重建保真度和更有利的效率-性能权衡。

英文摘要

State space models (SSMs) have emerged as a powerful paradigm for efficient single-image super-resolution (SR) due to their linear complexity and long-range modeling capabilities. However, existing Mamba-based methods typically rely on data-agnostic rigid scanning, which reshapes 2D images into 1D sequences over a fixed grid, inevitably disrupting spatial-semantic topology and introducing artifacts. Inspired by the \textbf{Gestalt perceptual grouping theory}, we propose \textbf{SP-MoMamba}, a superpixel-driven mixture of state space experts designed for content-aware SR. Our core idea is to transform the traditional rigid scanning into a \textbf{semantic-level interaction} by treating superpixels as fundamental units. Specifically, we introduce the \textbf{Superpixel-driven State Space Model (SP-SSM)}, which compresses semantically homogeneous regions into high-order tokens to preserve global topological consistency. To address the conflict between fixed scanning scales and diverse semantic granularities, we develop the \textbf{Multi-Scale Superpixel Mixture of State Space Experts (MSS-MoE)}. This module utilizes a dynamic routing mechanism to adaptively assign scale-specific experts, effectively capturing multi-scale textures while reducing computational redundancy. Furthermore, to prevent the loss of high-frequency details during global abstraction, we introduce a \textbf{Local Spatial Modulation Expert (LSME)} to complement the global modeling, ensuring a precise reconstruction of sharp edges and fine structures. Extensive experiments on standard benchmarks demonstrate that SP-MoMamba achieves superior reconstruction fidelity and a more favorable efficiency-performance trade-off compared to state-of-the-art efficient SR methods.

2605.25878 2026-05-26 eess.IV cs.CV 版本更新

A Clinically Validated Foundation Model for Comprehensive Lung Pathology Interpretation

临床验证的基础模型用于全面肺部病理解读

Zhengrui Guo, Zhengyu Zhang, Jiabo Ma, Yihui Wang, Fengtao Zhou, Yingxue Xu, Ling Liang, Chenglong Zhao, Qi Xie, Jinbang Li, Shujing Guo, Fangyi Han, Zhijian Cen, Ziyi Liu, Cheng Jin, Junlin Hou, Zhixuan Chen, Yu Cai, Lijuan Qu, Shifu Chen, Yueping Liu, Zhe Wang, Xiuming Zhang, Muyan Cai, Li Liang, Hao Chen

发表机构 * Department of Pathology, Nanfang Hospital, Southern Medical University, Guangzhou, China(南方医科大学南芳医院病理科,广州,中国) Department of Pathology, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China(南方医科大学基础医学学院病理科,广州,中国) Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China(香港科技大学计算机科学与工程系,香港,中国) Guangdong Provincial Key Laboratory of Molecular Tumor Pathology, Guangzhou, China(广东省分子肿瘤病理重点实验室,广州,中国) Department of Pathology, Shandong Provincial Qianfoshan Hospital, Jinan, Shandong, China(山东省青岛坊山医院病理科,济南,山东,中国)

AI总结 提出PulmoFoundation,一种基于Virchow2和约4万张H&E染色全切片图像进行亚专科预训练的肺部病理基础模型,通过32项临床任务和前瞻性随机对照试验验证,在诊断准确性、效率和一致性上显著提升。

详情
AI中文摘要

病理评估指导肺癌诊断、治疗选择和预后评估,但当前的CPath方法依赖于针对孤立目标的任务特定模型。尽管泛癌基础模型提供了多功能性,但它们缺乏亚专科深度,且未在临床工作流程中评估或在真实世界环境中进行前瞻性验证。我们介绍了PulmoFoundation,这是一个多中心、前瞻性验证、随机对照试验(RCT)评估的基础模型,用于术前、术中和术后护理的全面肺部病理评估。PulmoFoundation基于Virchow2,通过使用约40,000张诊断性H&E染色全切片图像(WSI)进行亚专科特定预训练构建,并在约26,000张WSI上系统评估了32项临床相关任务。除了准确预测分子标记和患者生存率外,我们的模型在活检、冰冻切片和手术切除切片的核芯诊断任务中达到了临床级性能。在一项针对1,357名患者、涵盖11项诊断任务的注册前瞻性研究中,我们的模型实现了平均AUC 92.3%。使用预设的分诊阈值,PulmoFoundation可以减少68.8%的活检和83.0%的冰冻切片的额外二次复核负担,并推迟44.5%的IHC染色订单,阳性预测值分别为1.0、0.991和0.966。除了前瞻性验证,我们还进行了一项交叉RCT,涉及八名病理学家,AI辅助在4,928个病例-阅片者对中提高了诊断准确性(有AI为91.7%,无AI为83.8%)。AI辅助还使中位诊断时间减少了19.6%,诊断信心提高了8.7%,并将阅片者间一致性从中等(kappa=0.56)提高到显著(kappa=0.76)。这些评估共同支持PulmoFoundation作为临床验证的肺部病理决策支持系统。

英文摘要

Pathological assessment guides lung cancer diagnosis, treatment selection, and prognostic evaluation, yet current CPath approaches rely on task-specific models for isolated objectives. Although pan-cancer foundation models offer versatility, they lack subspecialty-level depth and have not been evaluated across clinical workflows or prospectively validated in real-world settings. We introduce PulmoFoundation, a multi-center, prospectively validated, randomized controlled trial (RCT)-evaluated foundation model for comprehensive lung pathology assessment across pre-operative, intra-operative, and post-operative care. Built upon Virchow2 via subspecialty-specific pretraining using ~40,000 diagnostic H&E-stained whole-slide images (WSIs), PulmoFoundation was systematically evaluated on ~26,000 WSIs across 32 clinically relevant tasks. In addition to accurately predicting molecular markers and patient survival, our model achieves clinical-grade performance in core diagnostic tasks across biopsy, frozen section, and surgical resection slides. In a registered prospective study of 1,357 patients across 11 diagnostic tasks, our model achieved an average AUC of 92.3%. Using pre-specified triage thresholds, PulmoFoundation could reduce additional second-review burden for 68.8% of biopsies and 83.0% of frozen sections, and defer 44.5% of IHC stain orders, with PPVs of 1.0, 0.991, and 0.966. Beyond prospective validation, we conducted a crossover RCT with eight pathologists, in which AI assistance improved diagnostic accuracy across 4,928 case-reader pairs (91.7% w/ AI vs. 83.8% w/o AI). AI assistance also reduced median diagnostic time by 19.6%, increased diagnostic confidence by 8.7%, and improved inter-rater agreement from moderate (kappa = 0.56) to substantial (kappa = 0.76). Together, these evaluations support PulmoFoundation as a clinically validated decision-support system for lung pathology.

2605.25876 2026-05-26 cs.CV 版本更新

DyCoRM: Dynamic Criterion-Aware Reward Modeling for Text-to-Image Generation

DyCoRM: 面向文本到图像生成的动态准则感知奖励建模

Jiaying Qian, Ziheng Jia, Qian Zhang, Zicheng Zhang, Jiayi Guo, Junqi Zhang, Guangtao Zhai, Xiongkuo Min

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对用户对文本到图像生成中动态、细粒度评价准则的需求,提出DyCoRM动态准则感知奖励模型,并构建数据集DyCoDataset-20K和基准DyCoBench-1K,通过准则感知偏好比较和DyCoPick选择方法,实现首个动态细粒度奖励建模框架。

详情
AI中文摘要

随着文本到图像(T2I)生成的持续进步,生成高质量图像变得越来越容易;因此,用户需求转向更符合其特定要求的图像。由于奖励模型在评估生成图像是否符合用户偏好方面扮演着越来越重要的角色,这一趋势为奖励建模带来了一个重要挑战:奖励模型不应仅依赖静态和通用的评价维度,而应考虑用户评估生成图像是否满足其特定要求时所用的任务相关且细粒度的准则。为应对这一挑战,我们提出了DyCoRM,一种动态的、准则感知的奖励模型,它能够基于任务相关准则并进行准则感知的偏好比较。为支持这一设定,我们构建了DyCoDataset-20K,提供动态准则及准则级标注,并进一步推导出DyCoBench-1K,一个用于在动态准则下系统评估奖励模型的基准。我们还引入了DyCoPick,它将准则感知奖励建模应用于T2I图像选择。我们的贡献建立了首个用于动态和细粒度评估以及在T2I生成中实际应用的奖励建模框架。

英文摘要

With the continued advancement of text-to-image (T2I) generation, producing high-quality images is becoming increasingly attainable; consequently, user demands are shifting toward images that better satisfy their specific requirements. As reward models play an increasingly important role in assessing whether generated images align with user preference, this trend introduces an important challenge for reward modeling: rather than relying solely on static and general evaluation dimensions, reward models should account for the task-relevant and fine-grained criteria through which users assess whether generated images meet their specific requirements. To address this challenge, we propose DyCoRM, a dynamic, criterion-aware reward model that grounds task-relevant criteria and performs criterion-aware preference comparison. To support this setting, we construct DyCoDataset-20K, which provides dynamic criteria together with criterion-level annotations, and further derive DyCoBench-1K, a benchmark for systematically evaluating reward models under dynamic criteria. We further introduce DyCoPick, which applies criterion-aware reward modeling to selecting T2I images. Our contributions establish the first reward modeling framework for dynamic and fine-grained evaluation and practical application in T2I generation.

2605.25874 2026-05-26 cs.CV 版本更新

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

WBench:面向交互式视频世界模型评估的综合多轮基准

Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, Henghui Ding

发表机构 * Fudan University(复旦大学) Meituan Longcat Team(美团Longcat团队)

AI总结 提出WBench,一个包含五个维度、289个测试用例和1058轮交互的综合多轮基准,用于系统评估交互式世界模型,并发现现有模型在不同维度上表现不一。

Comments Technical report of WBench. Homepage: https://meituan-longcat.github.io/WBench/

详情
AI中文摘要

交互式世界模型正在快速发展,但现有基准仅覆盖部分所需能力,缺乏统一标准进行系统评估。为填补这一空白,我们引入了WBench,一个全面的多轮基准,用于沿五个维度(视频质量、设置遵循、交互遵循、一致性和物理合规性)评估交互式世界模型。WBench包含289个测试用例和1,058个交互轮次,每个用例指定一个世界设置和多轮交互序列,涵盖多样场景、风格、主体以及第一人称和第三人称视角,同时包括四种交互类型:导航、主体动作、事件编辑和视角切换。对于导航,WBench统一了文本、6自由度姿态和离散动作控制,使得具有不同原生输入接口的模型都能被评估。评估使用22个自动子指标,结合了专业视觉模型和大规模多模态模型,所有指标均通过人类判断进行验证。在20个最先进的模型上,我们发现没有单个模型在所有维度上都表现良好。我们提供了关于每个模型特征性优势、劣势和开放挑战的详细诊断见解。代码和数据可在 https://github.com/meituan-longcat/WBench 获取。

英文摘要

Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.

2605.25860 2026-05-26 cs.CV 版本更新

SAM3-Assisted Training of Lightweight YOLO Models for Precision Pig Farming

SAM3辅助训练的轻量级YOLO模型用于精准养猪

Marcos Vinicius Mendes Faria, Thiago Borges Pereira, Isabella C. F. S. Condotta, Thiago Meireles Paixão, Francisco de Assis Boldt

发表机构 * Department of Animal Sciences, University of Illinois at Urbana-Champaign, USA(动物科学系,伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出利用SAM 3自动生成伪标签训练YOLOv8检测器,无需人工标注,在PigLife数据集上达到79.4% mAP,推理速度比教师模型快约200倍。

Comments Accepted for publication at the IEEE Sensors Applications Symposium (SAS 2026)

详情
AI中文摘要

基于深度学习的物体检测彻底改变了精准畜牧业(PLF),但仍存在一个关键障碍:高性能基础模型(如SAM 3)计算量过大,无法在边缘部署,而轻量级模型(如YOLO)需要大量人工标注。本文提出了一种全自动知识蒸馏流程,利用Segment Anything Model 3(SAM 3)生成零样本伪标签,用于训练高效的YOLOv8检测器。通过将SAM 3视为离线自动标注器,消除了手动标注瓶颈,生成的模型能够在资源受限的硬件上实现实时推理。我们在PigLife数据集上系统评估了该方法,将SAM 3监督模型与人工标注基线进行了比较。结果表明,无需人工干预,SAM 3训练的YOLOv8m平均精度(mAP)达到79.4%,同时推理延迟比教师模型降低约200倍。此外,分层分析显示,在低遮挡场景下,自动流程的检测率与人工基准相当(AP50 > 99%)。这些发现表明,基础模型可以作为有效的零标注成本监督器,为智慧农业提供可扩展的边缘计算解决方案。

英文摘要

Deep learning-based object detection has revolutionized Precision Livestock Farming (PLF), yet a critical barrier remains: high-performance Foundation Models (such as SAM 3) are too computationally intensive for edge deployment, while lightweight models (like YOLO) require prohibitive manual annotation efforts. This work proposes a fully automated knowledge distillation pipeline that leverages the Segment Anything Model 3 (SAM 3) to generate zero-shot pseudo-labels for training efficient YOLOv8 detectors. By treating SAM 3 as an offline auto-annotator, we eliminate the manual labeling bottleneck, producing models capable of real-time inference on resource-constrained hardware. We systematically evaluate this approach on the PigLife dataset, comparing SAM 3-supervised models against human-annotated baselines. Results demonstrate that a SAM 3-trained YOLOv8m achieves a mean Average Precision (mAP) of 79.4% without human intervention, while reducing inference latency by approximately 200$\times$ compared to the teacher model. Furthermore, stratified analysis reveals that in low-occlusion scenarios, the automated pipeline achieves detection rates comparable to human benchmarks ($AP_{50} > 99\%$). These findings indicate that foundation models can serve as effective, zero-annotation-cost supervisors, enabling scalable edge computing solutions for smart agriculture.

2605.25832 2026-05-26 cs.RO cs.AI cs.CL cs.CV 版本更新

When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills

当搜索成为记忆:将机器人设计试验转化为可迁移技能

Yunfei Wang, Xiaohao Xu, Yang Li, Xiaonan Huang

发表机构 * University of Michigan(密歇根大学)

AI总结 提出Auto-Robotist,一种自进化LLM代理,通过将形态搜索轨迹提炼为自然语言技能库,实现可迁移的机器人设计知识,在EvoGym任务中提升冷启动搜索并跨设计空间迁移技能。

Comments 20 pages, 8 figures

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作进化机器人设计的提案生成器,但大多数循环仍然是无记忆的:模拟结果塑造下一代种群,但并未作为可复用的设计知识保留。我们提出Auto-Robotist,一种自进化的LLM代理,它将形态搜索轨迹提炼为显式的自然语言技能库。每个技能存储结构原型、基于证据的正负规则以及支持它们的评估设计,使设计记忆可检查而非隐含在种群中。在搜索过程中,代理检索技能以调节LLM对精英主体的编辑,同时保留遗传算法(GA)突变路径以进行探索;评估后,通过添加、诊断和合并更新库。在涵盖运动、穿越和物体交互的七个EvoGym任务中,Auto-Robotist改善了冷启动5x5搜索,并将学到的技能迁移到10x10设计空间,其中参考条件迁移在每个任务上都优于GA。这些结果表明,LLM代理可以将昂贵的物理评估转化为可复用、可审计的设计原则。我们的代码将在接收后发布。

英文摘要

Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless: simulator results shape the next population but are not preserved as reusable design knowledge. We present Auto-Robotist, a self-evolving LLM agent that distills morphology-search traces into an explicit natural-language skill library. Each skill stores a structural archetype, evidence-grounded positive and negative rules, and the evaluated designs that support them, making design memory inspectable rather than implicit in a population. During search, the agent retrieves skills to condition LLM edits of elite bodies while retaining a Genetic Algorithm (GA) mutation path for exploration; after evaluation, it updates the library through Add, Diagnose, and Merge. Across seven EvoGym tasks spanning locomotion, traversal, and object interaction, Auto-Robotist improves cold-start 5x5 search and transfers learned skills to 10x10 design spaces, where reference-conditioned transfer outperforms GA on every task. These results suggest that LLM agents can convert expensive physical evaluations into reusable, auditable design principles. Our code will be released upon acceptance.

2605.25821 2026-05-26 cs.CV 版本更新

[CLS] is Not Enough: Multi-Label Recognition via Patch-Level Inference and Adaptive Aggregation

[CLS] 还不够:基于补丁级推理与自适应聚合的多标签识别

Akang Wang, Xili Deng, Zhanxuan Hu, Yi Zhao, Yonghang Tai, Huafeng Li

发表机构 * Yunnan Normal University, Kunming, China(云南师范大学,昆明,中国) Kunming University of Science and Technology, Kunming, China(昆明理工大学,昆明,中国)

AI总结 针对CLIP等视觉语言模型在多标签识别中因[CLS]全局表征不足的问题,提出PIAA框架,通过补丁级推理和自适应聚合实现无训练的多标签识别,在NUS-WIDE上mAP提升超6%。

详情
AI中文摘要

视觉语言模型(如CLIP)通过将图像与文本概念对齐展现出强大的零样本识别能力,但在多标签识别(多个目标共存)中表现不佳。一个关键瓶颈是[CLS]标记作为单一的全局视觉表征,不足以忠实编码具有不同尺度、上下文和共现模式的多样目标。为解决这一局限,我们提出一个新的多标签图像识别框架PIAA,将预测公式化为补丁级推理后接自适应聚合。具体来说,我们首先从两个互补角度增强补丁级预测:(i)缓解视觉编码器中的语义纠缠以获得更具判别性的补丁表征,(ii)学习无监督视觉分类器以缩小视觉-语言模态差距。然后我们引入一个自适应聚合模块,将补丁级分数整合为最终的多标签预测。值得注意的是,整个流程完全无需训练,不需要梯度更新或参数微调。实验表明,我们的方法以最小的额外计算实现了显著改进,在具有挑战性的NUS-WIDE基准上相比代表性基线mAP提升超过6%。代码可在https://github.com/akang-wang/PIAA获取。

英文摘要

Vision-Language Models such as CLIP exhibit strong zero-shot recognition capability by aligning images with textual concepts, yet they often underperform on multi-label recognition where multiple objects co-exist. A key bottleneck is that the [CLS] token, as a single global visual representation, is insufficient to faithfully encode diverse targets with varying scales, contexts, and co-occurrence patterns. To address this limitation, we present a new multi-label image recognition framework, termed PIAA, which formulates prediction as Patch-level Inference followed by Adaptive Aggregation. Specifically, we first enhance patch-wise predictions from two complementary perspectives: (i) mitigating semantic entanglement in the visual encoder to obtain more discriminative patch representations, and (ii) learning an unsupervised visual classifier to narrow the vision-language modality gap. We then introduce an adaptive aggregation module that consolidates patch-level scores into the final multi-label prediction. Notably, the entire pipeline is fully training-free, requiring no gradient updates or parameter fine-tuning. Experiments show that our method achieves strong improvements with minimal extra computation, exceeding a 6% mAP gain on the challenging NUS-WIDE benchmark over representative baselines. Code is available at https://github.com/akang-wang/PIAA.

2605.25810 2026-05-26 cs.CV 版本更新

Data-driven Head Motion Generation through Natural Gaze-Head Coordination

数据驱动的自然注视-头部协调头部运动生成

Xiaohan Liu, Yilin Wen, Yusuke Sugano

发表机构 * Institute of Industrial Science, The University of Tokyo(东京大学工业科学研究所)

AI总结 提出首个数据驱动方法,通过自动提取自然注视和头部运动,利用条件变分自编码器生成与注视相关的头部运动,并应用于注视控制的视频生成。

详情
AI中文摘要

我们提出了首个数据驱动的方法,从大规模野外面部视频中建模时间上的注视-头部协调。为了获得可泛化学习的训练数据,我们提出了一种自动流水线,利用现成的基于外观的注视估计器提取自然且多样化的注视和头部运动。为了捕捉注视-头部协调的概率相关性和时间动态,我们将模型建立在生成性条件变分自编码器上,以生成合理且多样化的注视条件头部运动。我们进一步将框架应用于注视控制的面部视频生成,其中我们实现了与输入注视相关的自然逼真头部运动的视频生成——这一方面此前未被强调。人类评估和定量比较证明了我们方法的有效性并验证了我们的设计选择,评估者对我们的方法表现出统计学上显著的偏好,优于基线方法。

英文摘要

We present the first data-driven approach to model temporal gaze-head coordination from large-scale in-the-wild facial videos. To obtain training data for generalizable learning, we propose an automatic pipeline that extracts natural yet diverse gaze and head motions with off-the-shelf appearance-based gaze estimators. To capture the probabilistic correlation and temporal dynamics of gaze-head coordination, we build our model on a generative conditional Variational Autoencoder for plausible yet diverse gaze-conditioned head motion generations. We further apply our framework to gaze-controlled facial video generation, where we enable video generation with natural and realistic head motion correlated to the input gaze - an aspect that has not been emphasized before. Human evaluation and quantitative comparisons demonstrate our method's effectiveness and validate our design choices, with evaluators showing statistically significant preference for our approach over baseline methods.

2605.25804 2026-05-26 cs.CV 版本更新

Event-to-Video Reconstruction using Spatio-Temporal and Frequency-Enhanced Deep Neural Networks

基于时空与频率增强深度神经网络的事件到视频重建

Ramna Maqsood, Paulo Nunes, Luís Ducla Soares, Caroline Conti

发表机构 * Instituto de Telecomunicações, Instituto Universitário de Lisboa (ISCTE-IUL)(电信研究所,里斯本大学学院(ISCTE-IUL))

AI总结 提出MSFET-E2V模型,通过跨域注意力模块融合时空特征与离散小波变换的频率表示,并设计轻量级小波增强跳跃块,实现高质量事件到视频重建,在多个数据集上超越现有方法。

详情
AI中文摘要

事件相机相比传统基于帧的相机具有显著优势,包括高时间分辨率、低延迟和能量效率。这些特性使其适用于高速和高动态范围场景采集;然而,缺乏密集强度帧限制了传统计算机视觉方法在场景理解中的直接应用。事件到视频(E2V)重建旨在通过将异步事件流转换为同步视频帧序列来弥合这一差距。现有的基于卷积神经网络和Transformer的E2V重建方法主要在空间域操作,往往难以恢复精细结构细节并抑制严重重建伪影。为解决这些问题,我们提出MSFET-E2V,一种新颖的多尺度频率增强Transformer模型。其核心是跨域注意力模块,该模块将时空特征与来自离散小波变换的频率感知表示相融合。与仅依赖空间注意力的先前方法不同,我们的方法通过考虑低频和高频分量有效捕捉局部和全局结构,增强细节保留和跨各种运动场景的鲁棒性。此外,我们提出一个轻量级小波增强跳跃块作为跳跃连接,通过联合空间-频率域处理促进伪影抑制和结构细节细化。大量实验表明,MSFET-E2V在多个真实世界事件数据集上取得了优于最先进方法的性能,在重建质量上提供了显著提升。此外,与现有基于Transformer的方法相比,我们提出的模型显著减少了参数数量、GPU内存使用和推理时间。

英文摘要

Event cameras offer significant advantages over conventional frame-based counterparts, including high temporal resolution, low latency, and energy efficiency. These characteristics make them suitable for high-speed and high-dynamic range scene acquisition scenarios; however, the lack of dense intensity frames limits the direct applicability of conventional computer vision methods for scene understanding. Event-to-video (E2V) reconstruction seeks to bridge this gap by converting asynchronous event streams into a sequence of synchronous video frames. Existing E2V reconstruction methods based on convolutional neural networks and transformers operate primarily in the spatial domain and often struggle to recover fine structural details while suppressing severe reconstruction artifacts. To address these issues, we propose MSFET-E2V, a novel multiscale frequency-enhanced transformer model. At its core lies a cross-domain attention module, which fuses spatio-temporal features with frequency-aware representations derived from the discrete wavelet transform. Unlike prior methods relying solely on spatial attention, our approach effectively captures both local and global structures by taking into account low- and high-frequency components, enhancing detail preservation and robustness across various motion scenarios. Furthermore, we propose a lightweight wavelet-enhanced skip block that serves as a skip connection, facilitating artifact suppression and structural detail refinement through joint spatial-frequency domain processing. Extensive experiments demonstrate that MSFET-E2V achieves superior performance over state-of-the-art methods on multiple real-world event datasets, offering significant gains in reconstruction quality. Moreover, compared to the existing transformer-based method, our proposed model significantly reduces the number of parameters, the GPU memory usage, and inference time.

2605.25802 2026-05-26 cs.CV 版本更新

Rethinking VLM Representation for VLA Initialization

重新思考用于VLA初始化的VLM表示

Weifeng Lin, Siyuan Huang, Hao Li, Tingwei Chen, Ruichuan An, Xinyu Wei, Jianbo Liu, Hongsheng Li

发表机构 * CUHK(香港中文大学) PolyU Peking University(北京大学) ACE Robotics(ACE机器人)

AI总结 本文通过控制表示设计问题,沿能力级具身VQA监督、参数更新策略和机器人数据预训练三个轴,研究VLA初始化,发现保留预训练VLM表示对动作性能至关重要,而LoRA比全微调提供更可靠的初始化,分阶段基于LoRA的训练获得最强变体。

Comments 9 main-text pages, 5 appendix pages, 4 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型广泛采用预训练的视觉-语言模型(VLM)作为策略骨干,但目前尚不清楚何种预训练VLM表示对VLA初始化有用。在本文中,我们将VLA初始化作为一个受控的表示设计问题,沿三个轴进行研究:能力级具身VQA监督、参数更新策略和机器人数据预训练。我们的实验表明,原始预训练VLM表示是动作性能的关键来源。然而,具身VQA适应并不产生一致的收益:其收益取决于下游瓶颈,且来自不同能力域的收益并非简单相加。对于更新策略,LoRA提供了比全微调更可靠的初始化,表明过度重塑预训练表示会削弱VLA初始化。机器人数据预训练进一步改善了VLA初始化,通过分阶段基于LoRA的训练获得了最强变体。这些发现共同表明,有效的VLM到VLA适应应在保留对动作学习有用的预训练VLM表示的同时,注入与动作相关的具身和机器人轨迹信号。

英文摘要

Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretrained VLM representation is useful as a VLA initialization. In this paper, we study VLA initialization as a controlled representation-design problem along three axes: capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining. Our experiments show that the original pretrained VLM representation is a key source of action performance. However, embodied VQA adaptation does not yield uniform gains: its benefit depends on downstream bottlenecks, and gains from different capability domains are not simply additive. For update strategy, LoRA provides a more reliable initialization than Full Finetune, indicating that overly reshaping the pretrained representation can weaken VLA initialization. Robot-data pretraining further improves VLA initialization, with the strongest variant obtained by staged LoRA-based training. Together, these findings suggest that effective VLM-to-VLA adaptation should inject action-relevant embodied and robot-trajectory signals while preserving the pretrained VLM representation that remains useful for action learning.

2605.25801 2026-05-26 cs.CV 版本更新

PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolution

PixelWizard: 迈向高效高保真超大规模空间分辨率视频生成

Wenxue Li, Jingjing Ren, Peng Zhang, Tian Ye, Daiguo Zhou, Jian Luan, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) MiLM Plus, Xiaomi Inc(小米公司MiLM Plus部门) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出PixelWizard框架,通过分层解耦全局结构建模与细节合成,并引入噪声跨度对齐捷径训练,实现超大规模分辨率视频的高效高保真生成,加速超过10倍。

详情
AI中文摘要

高分辨率视频生成面临优化不稳定和计算成本高昂的双重瓶颈。令牌序列的大规模扩展不仅使优化偏向局部纹理而牺牲全局一致性,导致结构崩溃,还带来了高昂的训练成本和严重的推理延迟。为了解决这个问题,我们提出了PixelWizard,一个将全局结构建模与细粒度细节合成分层解耦的框架。PixelWizard首先建立一个紧凑的时空锚点以集中密集的结构先验,然后指导高分辨率下的细粒度生成。这减轻了局部优化偏差,确保结构稳定性而不损害高频细节。利用这种结构稳定性,我们引入了噪声跨度对齐捷径训练来打破推理瓶颈。通过显式建模步长,该机制允许模型以大步长遍历生成轨迹。关键的是,我们结合了指数索引偏置采样和自适应噪声跨度校准,以对齐优化与高分辨率网格的偏移噪声调度,确保鲁棒的少步推理而不产生蒸馏的沉重开销。大量实验表明,PixelWizard在实现卓越视觉质量的同时,将原生2K/4K视频的生成采样加速超过10倍。

英文摘要

High-resolution video generation faces a coupled bottleneck of optimization instability and prohibitive computational costs. The massive expansion of the token sequence not only biases optimization toward local textures at the expense of global coherence, leading to structural collapse, but also imposes prohibitive training costs and severe inference latency. To address this, we propose PixelWizard, a framework that hierarchically decouples global structure modeling from fine-grained detail synthesis. PixelWizard first establishes a compact spatiotemporal anchor to concentrate dense structural priors, which then guides fine-grained generation at high resolution. This mitigates the local optimization bias to ensure structural stability without compromising high-frequency details. Leveraging this structural stability, we introduce Noise-Span Aligned Shortcut Training to break the inference bottleneck. By explicitly modeling the step size, this mechanism allows the model to traverse the generation trajectory with large steps. Crucially, we incorporate Exponential Index-Biased Sampling and Adaptive Noise-Span Calibration to align optimization with the shifted noise schedules of high-resolution grids, ensuring robust few-step inference without incurring the heavy overhead of distillation. Extensive experiments demonstrate that PixelWizard achieves superior visual quality while accelerating the generative sampling of native 2K/4K videos by over 10x.

2605.25799 2026-05-26 cs.CV 版本更新

Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning

应对源自由跨域小样本学习中加剧的注意力汇聚问题

Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 针对跨域小样本学习中标准微调加剧注意力汇聚导致判别性下降的问题,提出基于令牌动态重加权的方法抑制简单令牌依赖并增强困难令牌学习,实现新最优性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

视觉语言模型(如CLIP)展现了令人印象深刻的泛化能力,但其在跨域小样本学习(CDFSL)中的潜力尚未充分探索,该任务需要模型将源域信息迁移到训练数据稀缺的目标域。尽管注意力汇聚现象已在某些任务的视觉语言模型中被观察到,但其在CDFSL场景中的作用尚未被研究。本文揭示了先前工作忽视的一个关键问题:CDFSL中标准的目标域小样本微调显著加剧了注意力汇聚问题,导致类别间判别性差。为理解这一现象,通过大量实验,我们将其解释为模型对域适应的捷径学习:为克服源域与目标域之间的巨大域差距,模型倾向于将初始更接近目标域类别的令牌(即简单令牌)推得更近,从而加剧注意力汇聚,浪费了学习其他有判别性但初始较远的令牌(即困难令牌)的能力。为解决此问题,我们提出一种新方法,在目标域微调期间根据令牌与目标域类别的相关性动态重加权令牌,明确抑制模型对简单令牌的依赖并增强困难令牌的学习,减少汇聚令牌并提升判别性。在四个基准数据集上的大量实验验证了我们方法的合理性,展现了新的最优性能。我们的代码可在 https://github.com/shuaiyi308/TIR 获取。

英文摘要

Vision-language models (VLMs) like CLIP have shown impressive generalization capabilities, yet their potential for Cross-Domain Few-Shot Learning (CDFSL) remains underexplored, where the model needs to transfer source-domain information to target domains with scarce training data. While the attention sink phenomenon has been observed in VLMs for certain tasks, its role in CDFSL scenarios has not been studied. In this paper, we uncover a critical issue overlooked by prior works: standard target-domain few-shot fine-tuning in CDFSL significantly exacerbates the attention sink problem, leading to poor discriminability across classes. To understand this phenomenon, through extensive experiments, we interpret it as the model's shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model's reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. Extensive experiments on four benchmark datasets validate the rationale of our method, demonstrating new state-of-the-art performance. Our codes are available at https://github.com/shuaiyi308/TIR.

2605.25784 2026-05-26 cs.CV cs.MM 版本更新

VertiCue-Bench: Diagnosing Whether MLLMs Use Height Cues to Resolve 2D Ambiguity in Remote Sensing Natural Scenes

VertiCue-Bench: 诊断多模态大语言模型是否利用高度线索解决遥感自然场景中的二维歧义

Jing Huang, Duanchu Wang, Junjie Yang, Zihang Cheng, Cheng Li, Lin Cui, Zhouyi Wu, Di Wang

发表机构 * Xi’an Jiaotong University(西安交通大学) Xidian University(西安电子科技大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出VertiCue-Bench基准,通过17个任务1534个实例诊断MLLMs是否真正利用冠层高度模型(CHM)的垂直线索解决遥感自然场景中的语义歧义,发现模型在感知高度线索与语义推理之间存在显著脱节。

详情
AI中文摘要

多模态大语言模型(MLLMs)最近在地理空间推理方面显示出有希望的进展。然而,现有的遥感基准仍然主要围绕二维中心,主要基于光学外观评估模型。在自然环境中,由于严重的光谱混淆,这种范式失效,其中生态上不同的区域共享相似的纹理但在垂直结构上根本不同。在这种情况下,明确的3D结构数据,如冠层高度模型(CHMs),成为语义消歧的基本几何证据。然而,目前尚不清楚当前的MLLMs是否能够真正利用垂直线索来解决外观级别的歧义。为了填补这一空白,我们引入了VertiCue-Bench,这是第一个基于CHM的地理空间推理诊断基准。VertiCue-Bench包含1534个精心策划的实例,涵盖17个任务,明确将低级高度感知与歧义感知的语义推理分离。对14个最先进的通用和遥感专用MLLMs的评估,结合反事实模态测试,揭示了惊人的感知-推理分离。虽然模型在读取原始CHM高度线索方面表现出新兴能力,但它们大多未能将几何感知转化为可靠的语义推理,在需要联合约束时通常表现不如仅使用RGB的基线。总体而言,VertiCue-Bench揭示了自然场景理解中关键的几何到语义的差距,为推进地理空间MLLMs提供了可行的见解。

英文摘要

Multimodal Large Language Models (MLLMs) have recently shown promising progress in geospatial reasoning. However, existing remote sensing benchmarks remain largely 2D-centric, evaluating models primarily on optical appearance. In natural environments, this paradigm breaks down due to severe spectral confusion, where ecologically distinct regions share similar textures but differ fundamentally in vertical structure. In such cases, explicit 3D structural data, such as Canopy Height Models (CHMs), become essential geometric evidence for semantic disambiguation. Yet, it remains unclear whether current MLLMs can genuinely leverage vertical cues to resolve appearance-level ambiguity. To address this gap, we introduce VertiCue-Bench, the first diagnostic benchmark for CHM-grounded geospatial reasoning. VertiCue-Bench comprises 1,534 carefully curated instances across 17 tasks, explicitly disentangling low-level height perception from ambiguity-aware semantic reasoning. Evaluations on 14 state-of-the-art general and remote-sensing-specialized MLLMs, combined with counterfactual modality testing, reveal a striking perception-reasoning dissociation. While models exhibit emerging competence in reading raw CHM height cues, they largely fail to translate geometric perception into reliable semantic reasoning, often underperforming RGB-only baselines when joint constraints are required. Overall, VertiCue-Bench exposes a critical geometry-to-semantics gap in natural scene understanding, offering actionable insights for advancing geospatial MLLMs.

2605.25778 2026-05-26 cs.CV 版本更新

OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance

OMGTex: 无需几何引导的一阶段多风格面部纹理重建

Zitong Xiao, Yuda Qiu, Zisheng Ye, Xiaoguang Han

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)科学与工程学院) Guangdong Provincial Key Laboratory of Future Networks of Intelligence(广东省未来网络智能化重点实验室) FNii-Shenzhen(FNii-深圳)

AI总结 提出OMGTex,一种端到端的扩散框架,无需3D几何先验,直接从多风格面部图像重建高质量、可编辑的UV纹理,通过梯度引导推理和语义感知训练实现鲁棒重建与编辑。

Comments CVPR 2026 (Poster)

详情
AI中文摘要

我们提出OMGTex,一种端到端的基于扩散的框架,用于从多风格面部图像重建高质量且可编辑的面部UV纹理。现有的纹理重建方法面临两个主要限制:(1) 依赖于难以准确估计的3D几何先验,尤其是在面部遮挡或风格化域中,导致脆弱性;(2) 缺乏语义解耦,阻碍了区域特定的纹理编辑和风格迁移。我们的工作同时解决了这两个挑战。 我们的核心创新是一个无几何的流水线,直接将2D面部图像映射到其对应的可编辑UV纹理。我们引入了两种关键技术:首先,为了解决扩散生成中常见的UV错位问题,我们引入了一种推理时的梯度引导细化策略,显式校正结构一致性。其次,我们利用扩散模型固有的语义分布能力,设计了一种新颖的训练范式来增强这种倾向,从而实现面部纹理的语义感知编辑。此外,为了解决多风格纹理重建中的数据稀缺问题,我们构建了CANVAS,这是第一个涵盖真实和多样化风格化领域的全面配对纹理重建数据集。 据我们所知,OMGTex是第一个无几何推理框架,能够在不同领域实现鲁棒、风格一致且可编辑的面部纹理重建。我们的方法在多个面部纹理基准上达到了最先进的性能。

英文摘要

We propose OMGTex, an end-to-end diffusion-based framework for reconstructing high-quality and editable facial UV textures from multi-style facial images. Existing texture reconstruction methods face two major limitations: (1) Fragility due to reliance on 3D geometry priors, which are difficult to estimate accurately, especially under facial occlusions or in stylized domains; and (2) A lack of semantic disentanglement, inhibiting region-specific texture editing and style transfer. Our work addresses both challenges simultaneously. Our core innovation is a geometry-free pipeline that directly maps a 2D face image to its corresponding editable UV texture. We introduce two key techniques: First, to address the challenge of UV misalignment common in diffusion generation, we introduce a gradient-guided refinement strategy at inference time, which explicitly corrects structural consistency. Second, we leverage the inherent semantic distribution capability of diffusion models and design a novel training paradigm to enhance this tendency, enabling semantic-aware editing of facial texture. Furthermore, to address the data scarcity in multi-style texture reconstruction, we construct CANVAS, the first comprehensive paired texture reconstruction dataset covering realistic and diverse stylized domains. To the best of our knowledge, OMGTex is the first geometry-free inference framework that achieves robust, style-consistent, and editable facial texture reconstruction across diverse domains. Our method achieves state-of-the-art performance on multiple facial texture benchmarks.

2605.25775 2026-05-26 cs.CV 版本更新

DRFusion: Drift-Resilient Temporally Consistent Infrared-Visible Video Fusion

DRFusion: 抗漂移的时间一致红外-可见光视频融合

Xingyuan Li, Haoyuan Xu, Shulin Li, Xiang Chen, Zhiying Jiang, Jinyuan Liu

发表机构 * College of Computer Science Technology, Zhejiang University, Hangzhou, China School of Software Technology \& DUT-RU International School of ISE, Dalian University of Technology, Dalian, China College of Information Science Technology, Dalian Maritime University, Dalian, China

AI总结 提出一种抗漂移的视频融合方法,将任务重构为历史条件运动生成,通过稳定历史引导和软时间锚定实现时间一致性,并采用解耦结构-运动适应策略,在融合质量和时间稳定性上达到最优。

Comments 11 pages, 7 figures, 4 tables

详情
AI中文摘要

红外和可见光视频融合对于在动态场景中实现全面感知至关重要。然而,保持时间一致性仍然是一个艰巨的挑战。依赖光流的传统方法通常存在几何刚性和重影伪影。此外,标准的基于扩散的融合模型通常以逐帧方式运行;当扩展到自回归设置时,它们缺乏内在的时间约束,并且容易出现严重的误差累积和漂移,其中微小的伪影随时间放大。为了解决这些限制,我们提出了一种抗漂移的视频融合方法,将任务重构为历史条件运动生成。我们引入了稳定历史引导和软时间锚定,将时间一致性重新定义为频谱滤波,无需刚性对齐即可隐式聚合运动动态。此外,我们的解耦结构-运动适应策略通过两阶段训练和潜在细化桥接了预训练先验和结构约束。大量实验表明,我们的方法在融合质量和时间稳定性方面均达到了最先进的性能。

英文摘要

Infrared and visible video fusion is essential for achieving comprehensive perception in dynamic scenes. However, maintaining temporal consistency remains a formidable challenge. Conventional methods relying on optical flow often suffer from geometric rigidity and ghosting artifacts. Moreover, standard diffusion-based fusion models typically operate in a frame-by-frame manner; when extended to autoregressive settings, they lack intrinsic temporal constraints and are prone to severe error accumulation and drifting, where minor artifacts amplify over time. To address these limitations, we propose a drift-resilient video fusion method that reformulates the task as history-conditioned motion generation. We introduce Stabilized History Guidance and Soft Temporal Anchoring to reframe temporal consistency as spectral filtering, implicitly aggregating motion dynamics without rigid alignment. Furthermore, our Decoupled Structure-Motion Adaptation strategy bridges pre-trained priors and structural constraints via two-stage training and latent refinement. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both fusion quality and temporal stability.

2605.25765 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models

通过交叉注意力激活投影实现扩散模型的概念遗忘

Saemi Moon, Suhyeon Jun, Seoyeon Lee, Dongwoo Kim

发表机构 * CSE, POSTECH(POSTECH计算机科学系) GSAI, POSTECH(POSTECH通用人工智能实验室)

AI总结 提出PURE方法,利用交叉注意力激活空间构建遗忘和保留基,通过线性投影编辑权重,在保持保留概念的同时有效消除目标概念。

详情
AI中文摘要

概念遗忘旨在从预训练的文本到图像扩散模型中擦除目标概念,而无需重新训练。闭式方法在此设置中具有吸引力,因为它们对交叉注意力权重应用单一确定性编辑,并且不增加推理时间成本。然而,现有的闭式方法通过文本编码器对少数命名目标概念的简短锚定提示的响应来表示目标概念,而唤起该概念但不一致命名的释义提示可以绕过编辑。我们认为,目标应该改为在交叉注意力激活空间中表示。文本嵌入描述用户的提示,而交叉注意力激活描述模型即将渲染的内容,后者泛化到锚定模板未覆盖的释义。基于这一观察,我们提出了PURE(U-Net渲染中的投影用于擦除),这是一种闭式方法,从沿短去噪轨迹捕获的逐层交叉注意力激活构建遗忘和保留基,并将单个线性投影器应用于交叉注意力键和值权重。在最近涵盖艺术风格、知识产权、名人和NSFW类别中十个概念的整体概念遗忘基准上,PURE显著减少了在释义和对抗性提示下的目标泄露,同时将保留概念保持接近未编辑模型,在评估方法中实现了最佳的总体遗忘-保留权衡。

英文摘要

Concept unlearning aims to erase a target concept from a pretrained text-to-image diffusion model without retraining. Closed-form methods are attractive in this setting because they apply a single deterministic edit to the cross-attention weights and add no inference-time cost. Existing closed-form methods, however, represent the target concept through the text encoder's response to a few short anchor prompts that name it, and paraphrased prompts that evoke the concept without naming it consistently bypass the edit. We argue that the target should instead be represented in the cross-attention activation space. Text embeddings describe the user's prompt, while cross-attention activations describe what the model is about to render, and the latter generalize to paraphrase the anchor templates do not cover. Building on this observation, we propose PURE (Projection in U-Net Rendering for Erasure), a closed-form method that builds the forget and retain bases from per-layer cross-attention activations captured along a short denoising trajectory and applies a single linear projector to the cross-attention key and value weights. On a recent holistic concept-unlearning benchmark covering ten concepts across artistic style, intellectual property, celebrity, and NSFW categories, PURE significantly reduces target leakage under paraphrased and adversarial prompts while preserving retain concepts close to the unedited model, yielding the best overall forget-retain trade-off among evaluated methods.

2605.25764 2026-05-26 cs.CV cs.AI 版本更新

Benchmarking Pathology Foundation Models for Spatial Domain Understanding

病理基础模型在空间域理解中的基准测试

Bokai Zhao, Yiyang Zhang, Yuanchi Zhu, Hanqing Chao, Long Bai, Tai Ma, Minfeng Xu, Ming Song, Tianzi Jiang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Brainnetome Center, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所脑网膜工程中心) Beijing Key Laboratory of Brainnetome and Brain-Computer Interface, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所北京脑网膜与脑机接口重点实验室) DAMO Academy, Alibaba Group(阿里云达摩院) ShanghaiTech University(上海科技大学)

AI总结 提出SpaPath-Bench基准,通过空间域识别任务评估病理基础模型在区分组织区域和捕获空间关系方面的表示能力。

Comments MICCAI2026

详情
AI中文摘要

病理基础模型(PFMs)已成为从全切片图像(WSIs)中学习可迁移表示的核心方法,通常通过下游临床终点进行基准测试。虽然这种任务级评估不可或缺,但它们对表示本身编码了什么提供了有限的见解,特别是PFM嵌入是否能够区分有意义的组织区域并捕获其空间关系。我们提出了SpaPath-Bench,一个表示级基准,旨在诊断PFMs中的空间表示能力。SpaPath-Bench将配对全切片图像和空间转录组学(ST)数据上的空间域识别(SDI)制定为诊断任务。它整理了42个公开的配对WSI和ST切片,支持跨19个编码器和7种SDI方法的大规模评估,并使用三个互补标准衡量分区质量:无监督空间一致性、转录组学参考一致性和专家参考一致性。在83K次运行中,SpaPath-Bench揭示了不同的预训练范式捕获了组织空间架构的不同方面,并为构建下一代空间感知计算病理模型提供了实用指导。代码和数据管道公开于https://bokai-zhao.github.io/SpaPath-benchboard/。

英文摘要

Pathology foundation models (PFMs) have emerged as a core approach for learning transferable representations from whole slide images (WSIs), and they are typically benchmarked through downstream clinical endpoints. While such task level evaluations are indispensable, they offer limited insight into what the representations themselves encode, particularly whether PFM embeddings can distinguish meaningful tissue regions and capture their spatial relationships. We present SpaPath-Bench, a representation level benchmark designed to diagnose spatial representation capability in PFMs. SpaPath-Bench formulates spatial domain identification (SDI) on paired whole slide image and spatial transcriptomics (ST) data as a diagnostic task. It curates 42 public paired WSI and ST slides, enables large scale evaluation across 19 encoders and seven SDI methods, and measures partition quality using three complementary criteria: unsupervised spatial coherence, transcriptomics referenced agreement, and expert referenced agreement. Across 83K runs, SpaPath-Bench reveals that different pretraining paradigms capture distinct aspects of tissue spatial architecture, and it provides practical guidance for building the next generation of spatially aware computational pathology models. Code and data pipelines are publicly available at https://bokai-zhao.github.io/SpaPath-benchboard/.

2605.25759 2026-05-26 cs.CV 版本更新

Towards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences

通过合成局部偏好实现解剖学合理的人体图像生成

Bao Li, Yuliang Xiu, Zhen Liu

发表机构 * Westlake University(西湖大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出 ASAP 框架,利用局部退化机制构建受控偏好对,并结合局部有界 DPO 变体,在保持整体图像质量的同时减少解剖学错误。

详情
AI中文摘要

大规模文本到图像基础模型已实现显著的视觉真实感,但生成具有正确解剖结构的人体图像仍然具有挑战性。现有方法通过在高品质人体照片上进行监督微调时使用部位特定模块或局部损失加权来强制解剖约束,但此类数据集有限,且由于光照、姿态和背景等混杂因素,通常提供模糊的优化信号。基于偏好的对齐提供了一种替代方案,但标准的直接偏好优化(DPO)平等对待所有像素,因此未能利用解剖伪影的局部性。为了解决这个问题,我们提出了通过合成解剖偏好进行对齐(ASAP)的框架,该框架通过对高保真人体图像应用局部退化机制来构建受控偏好对。该机制通过对图像进行受控实验,在目标区域引入明确的解剖错误,同时保留其余内容。利用这一机制,我们创建了人类解剖偏好(HAP)数据集,包含超过10K个精心挑选的对,用于有效对齐文本到图像人体图像生成模型的解剖结构。为了更好地利用这些受控偏好对的局部性,我们引入了DPO的局部有界变体,该变体优先优化目标解剖区域,同时强制有限偏好间隔以防止过度优化并保持全局语义。我们进一步引入了HAF-Bench,一个用于系统评估解剖保真度的基准。大量实验表明,ASAP在多个基础模型上持续减少解剖错误,同时保持整体图像质量。

英文摘要

Large-scale text-to-image foundation models have achieved remarkable visual realism, yet generating human images with correct anatomical structures remains challenging. Existing approaches enforce anatomical constraints through part-specific modules or localized loss weighting during supervised fine-tuning on high-quality human photos, but such datasets are limited and often provide ambiguous optimization signals due to confounding factors such as lighting, pose, and background. Preference-based alignment offers an alternative, but standard Direct Preference Optimization (DPO) treats all pixels equally and therefore fails to exploit the localized nature of anatomical artifacts. To address this, we propose the framework of Alignment via Synthetic Anatomical Preference (ASAP), which constructs controlled preference pairs through a localized degradation mechanism applied to high-fidelity human images. This mechanism performs a controlled experiment on images by introducing explicit anatomical errors in targeted regions while preserving the remaining content. With this mechanism, we create the Human Anatomical Preference (HAP) dataset with over 10K curated pairs for effective anatomical alignment of text-to-image human image generative models. To better leverage the locality of these controlled preference pairs, we introduce a localized and margin-bounded variant of DPO that prioritizes optimization in targeted anatomical regions while enforcing a finite preference margin to prevent over-optimization and preserve global semantics. We further introduce HAF-Bench, a benchmark for systematic evaluation of anatomical fidelity. Extensive experiments demonstrate that ASAP consistently reduces anatomical errors across multiple foundation models while maintaining overall image quality.

2605.25751 2026-05-26 cs.CV 版本更新

SplitAvatar: One-shot Head Avatar with Autoregressive Gaussian Splitting

SplitAvatar: 基于自回归高斯分裂的单次头部化身

Hongzhe Liao, Chuhua Xian, Hongmin Cai, Haiyang Liu, Fa-Ting Hong

发表机构 * South China University of Technology(华南理工大学) University of Tokyo(东京大学) The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出一种基于自回归高斯分裂的单图像可动画头部化身重建方法,通过图分裂网络渐进生成高斯体,解决高斯数量不匹配和细粒度细节缺失问题。

详情
AI中文摘要

3D高斯泼溅(3DGS)利用各向异性高斯体为高质量场景重建提供了高效方法。最近,基于3DGS的方法显著提升了人类化身的渲染质量,同时实现了实时性能。然而,现有方法存在基于图像和基于3DMM的方法生成的高斯体数量不匹配的问题。这种差异导致重建的表情缺乏细粒度细节。本文提出了一种从单张图像重建可动画头部化身的新方法。我们提出了一种图分裂网络,利用自回归架构从粗到细渐进生成高斯体。为了解决分裂高斯体引起的图不一致性,我们采用网格拓扑扩展方法,使GNN的连通性与增加的高斯数量对齐。此外,我们引入了一种新颖的密度控制方法,包括一个门控机制,为高斯体生成软掩码,防止分裂操作后的过度密集化。这允许对不同面部区域的高斯密度进行动态控制。为了实现平滑快速的训练,我们采用延迟过滤策略,避免在训练期间重新计算图拓扑。实验结果表明,我们的自回归结构通过渐进分裂高斯体有效提升了表情表示能力。这一过程通过GNN引导的分裂实现,合成更精确的面部细节,并达到更高的重建质量。

英文摘要

3D Gaussian Splatting (3DGS) provides an efficient method for high-quality scene reconstruction using anisotropic Gaussians. Recently, 3DGS-based methods have significantly improved the rendering quality of human avatars while enabling real-time performance. However, existing methods suffer from a magnitude mismatch in the number of Gaussians generated by image-based and 3DMM-based approaches. This discrepancy results in reconstructed expressions that lack fine-grained detail. In this paper, we introduce a novel method for reconstructing an animatable head avatar from a single image. We propose a Graph splitting network to progressively generate Gaussians from coarse to fine using an autoregressive architecture. To address the graph inconsistency caused by split Gaussians, we employ a mesh topology extension method to align the GNN's connectivity with the increased Gaussian count. Furthermore, we introduce a novel density control method that includes a gating mechanism that generates soft masks for Gaussians, preventing over-densification after the splitting operation. This allows for dynamic control over Gaussian density across different facial regions. For smooth and rapid training, we employ a delayed filtering strategy to avoid re-computing the graph topology during training. Experimental results demonstrate that our autoregressive structure effectively improves expression representation ability by progressively splitting Gaussians. This process, enabled by the GNN-guided splitting, synthesizes more precise facial details and achieves higher reconstruction quality.

2605.25737 2026-05-26 cs.CV 版本更新

SFR-Net: Learning Scale-Frustum Representations for Ultra-Wide Area Remote Sensing Image Segmentation

SFR-Net: 学习尺度截锥体表示用于超广域遥感图像分割

Chuyu Zhong, Keyan Chen, Qinzhe Yang, Bowen Chen, Zhengxia Zou, Zhenwei Shi

发表机构 * Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University(航天智能科学与技术学院,北京航空航天大学) Key Laboratory of Spacecraft Design Optimization and Dynamic Simulation Technologies, Ministry of Education, Beihang University(航天器设计优化与动态仿真技术重点实验室,北京航空航天大学) Shen Yuan Honors College, Beihang University(神元荣誉学院,北京航空航天大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,新加坡南洋理工大学)

AI总结 针对超广域遥感图像中地物尺度差异大和长距离上下文语义连续性问题,提出尺度截锥体表示网络(SFR-Net),通过构建尺度截锥体表示和级联跨尺度融合机制,在GID和FBPS数据集上分别提升mIoU 1.72%和4.29%。

详情
AI中文摘要

像素数量和地理覆盖范围是遥感图像的两个关键特征。现有的遥感图像分割方法通常专注于像素数量小或像素数量大但地理覆盖范围有限的图像。本文介绍了一种针对超广域(UWA)遥感图像的新分割任务,其特点是像素数量大且地理覆盖范围极广。UWA分割的核心挑战在于同时处理尺度变化显著的地物以及保持长距离上下文语义连续性。为了解决这些挑战,我们提出了尺度截锥体表示网络(SFR-Net)。受不同高度拍摄的遥感图像视锥体的启发,我们构建了尺度截锥体表示,实现了不同尺度下地物和上下文特征的统一建模。此外,我们设计了一种级联跨尺度融合机制,以有效整合这些表示,增强局部语义理解,同时确保长距离上下文连续性。在GID和FBPS上的实验结果表明,SFR-Net达到了最先进的性能,相比最强的竞争方法,mIoU分别提高了1.72%和4.29%。此外,所提出的尺度截锥体表示可以集成到通用分割网络中,以提高分割精度和收敛速度。实现代码将在https://github.com/ChuyuZhong/SFR-Net公开。

英文摘要

Pixel count and geographical coverage are two key characteristics of remote sensing images. Existing remote sensing image segmentation methods typically focus on images with either a small pixel count or a large pixel count but limited geographical coverage. In this paper, we introduce a novel segmentation task targeting ultra-wide area (UWA) remote sensing images, characterized by both a large pixel count and extremely wide geographical coverage. The core challenges of UWA segmentation lie in simultaneously handling ground objects with significantly varying scales and maintaining long-range contextual semantic continuity. To address these challenges, we propose the Scale-Frustum Representation Network (SFR-Net). Inspired by the viewing frustums of remote sensing images captured from different altitudes, we construct scale-frustum representations, enabling unified modeling of ground objects and contextual features at different scales. Furthermore, we design a cascaded cross-scale fusion mechanism to effectively integrate these representations, enhancing local semantic understanding while ensuring long-range contextual continuity. Experimental results on GID and FBPS demonstrate that SFR-Net achieves state-of-the-art performance, improving mIoU by 1.72% and 4.29%, respectively, over the strongest competing methods. In addition, the proposed scale-frustum representations can be integrated into generic segmentation networks to improve both segmentation accuracy and convergence speed. The implementation code will be publicly available at https://github.com/ChuyuZhong/SFR-Net.

2605.25730 2026-05-26 cs.CV 版本更新

DeCoDrift: Stabilizing Decoder Coupling in Closed-Loop Foundation Segmentation

DeCoDrift:闭环基础分割中的解码器耦合稳定化

H. M. Shadman Tabib, Md. Shamsuzzoha Bayzid, M Sohel Rahman

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 针对闭环迭代分割中解码器耦合漂移导致误差累积的问题,提出无需训练或真值监督的推理时稳定化框架DeCoDrift,通过约束提示更新和保持解码器耦合来提升注意力稳定性、时间一致性和分割质量。

Comments 18 Pages, 5 Figures

详情
AI中文摘要

基础分割模型(如Segment Anything Model, SAM)现在常被用于迭代流水线中,其中每个预测掩码被反馈作为下一个提示。这种做法将分割转变为闭环动态过程,但这些系统的解码器级行为在很大程度上仍未得到研究。我们表明,这种反馈循环可能引发一种先前被忽视的故障模式——解码器耦合漂移,其中掩码解码器的交叉注意力逐渐失去与目标对象的对齐,导致误差在迭代中累积。我们通过检测SAM的掩码解码器并推导出无真值的提示-图像耦合、注意力稳定性和时间一致性度量来研究这一现象。在体积电子显微镜数据上,这些解码器内部信号显示,与基于真值锚定的反馈相比,标准迭代提示系统性地降低了注意力对齐和时间一致性。然后,我们将迭代提示形式化为一个离散时间动态系统,并展示近端锚定如何减少反馈循环中的误差放大。基于这一分析,我们引入了DeCoDrift,一个无需训练、推理时稳定的框架,它约束提示更新并在迭代中保持解码器耦合。在大量实验中,DeCoDrift在注意力稳定性、时间一致性和分割质量上持续优于标准迭代提示,无需重新训练或真值监督。更广泛地说,我们的结果表明,解码器内部动态不仅仅是诊断性的:它们为在闭环使用中稳定基础分割模型提供了可操作的信号。

英文摘要

Foundation segmentation models such as Segment Anything Model (SAM) are now routinely used in iterative pipelines, where each predicted mask is fed back as the next prompt. This practice turns segmentation into a closed-loop dynamical process, yet the decoder-level behavior of these systems remains largely unexamined. We show that this feedback loop can induce a previously overlooked failure mode, decoder coupling drift, in which the mask decoder's cross-attention progressively loses alignment with the target object, causing errors to accumulate across iterations. We study this phenomenon by instrumenting SAM's mask decoder and deriving ground-truth-free measures of prompt-image coupling, attention stability, and temporal consistency. On volumetric electron microscopy data, these decoder-internal signals reveal that standard iterative prompting systematically degrades attention alignment and temporal coherence relative to oracle-anchored feedback. We then formalize iterative prompting as a discrete-time dynamical system and show how proximal anchoring reduces error amplification in the feedback loop. Building on this analysis, we introduce DeCoDrift, a training-free inference-time stabilization framework that constrains prompt updates and preserves decoder coupling across iterations. Across extensive experiments, DeCoDrift consistently improves attention stability, temporal coherence, and segmentation quality over standard iterative prompting, without retraining or ground-truth supervision. More broadly, our results show that decoder-internal dynamics are not merely diagnostic: they provide actionable signals for stabilizing foundation segmentation models in closed-loop use.

2605.25725 2026-05-26 cs.CV 版本更新

TriDP-PTM: a three-stage distortion-perception tradeoff guides the pre-training model for radar cardiac sensing

TriDP-PTM:三阶段失真-感知权衡引导的预训练模型用于雷达心脏感知

Jinye Li, Aidong Men, Yang Liu, Qingchao Chen

发表机构 * National Institute of Health Data Science, Peking University(北京大学国家健康数据科学研究院) Institute of Medical Technology, Peking University(北京大学医学技术研究院) Beijing University of Posts and Telecommunications(北京邮电大学) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究所) State Key Laboratory of General Artificial Intelligence, Peking University(北京大学通用人工智能国家重点实验室)

AI总结 提出三阶段失真-感知预训练模型(TriDP-PTM),通过雷达-心电图-任务间接路径和复合损失函数,在合作竞争阶段实现最佳下游临床精度。

详情
AI中文摘要

心血管疾病(CVDs)仍然是全球主要的死亡原因,需要连续、准确的非侵入性心脏监测。虽然非接触式雷达方法显示出巨大潜力,但它们通常采用单一的“失真驱动”或“感知驱动”范式,经常面临“低失真但弱语义信息”与“高感知保真度但差可解释性”之间的权衡。为了解决这个问题,我们提出了一种三阶段失真-感知预训练模型(TriDP-PTM),这是一个基于雷达的多尺度融合双路径框架,系统比较了“直接雷达到任务”路径与“间接雷达到心电图到任务”路径。通过将心电图生成器与特征判别器集成以形成复合损失函数,我们的方法有效地将医学先验知识(如心电图形态和节律)纳入下游任务。通过实证分析,我们揭示了这种权衡表现为三个不同阶段(正和、合作竞争和负和),表明最佳的下游临床准确性通常出现在合作竞争阶段。在涉及30名受试者、5种生理状态的数据集上进行的大量实验表明,间接路径在各种任务中始终优于直接路径,在波形分割中实现了0.80的平均IoU,在四个任务中实现了98.3%的平均分类准确率,并且与最强基线相比,血压回归的MAE降低了56%。这些发现验证了我们的框架,并表明在间接雷达到心电图路径中,适当权衡失真和感知损失以在合作竞争机制中运行,对于在非接触式心脏监测中实现临床可解释的心电图形态和强大的下游准确性至关重要。

英文摘要

Cardiovascular diseases (CVDs) remain a leading cause of death globally, necessitating continuous, accurate non-invasive cardiac monitoring. While non-contact radar-based approaches show great promise, they often employ a single "distortion-driven" or "perception-driven" paradigm, frequently facing a trade-off between "low distortion but weak semantic information" and "high perceptual fidelity but poor interpretability." To address this, we propose a Three-stage Distortion-Perception Pre-Training Model (TriDP-PTM), a radar-based multi-scale fusion dual-path framework that systematically compares the "direct radar-to-task" path against an "indirect radar-to-ECG-to-task" path. By integrating an ECG generator with a feature discriminator to form a composite loss function, our approach effectively incorporates medical priors - such as ECG morphology and rhythm - into downstream tasks. Through empirical analysis, we reveal that this trade-off manifests in three distinct phases (Positive-Sum, Coopetitive, and Negative-Sum), showing optimal downstream clinical accuracy typically emerges in the coopetitive stage. Extensive experiments on a dataset involving 30 subjects across 5 physiological states reveal that the indirect path consistently outperforms the direct path in diverse tasks, achieving 0.80 mean IoU in waveform segmentation, 98.3% average classification accuracy across four tasks, and a 56% MAE reduction in blood pressure regression compared to the strongest baselines. These findings validate our framework and indicate that, within the indirect radar-to-ECG pathway, appropriately weighting distortion and perception losses to operate in the coopetitive regime is critical for achieving both clinically interpretable ECG morphology and strong downstream accuracy in non-contact cardiac monitoring.

2605.25708 2026-05-26 cs.CV cs.CL cs.ET 版本更新

CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning

CMAP: 面向多域任务增量学习的跨模态自适应提示

Sriram Mandalika

发表机构 * Hasso Plattner Institute(霍普斯·普拉特纳研究所)

AI总结 针对多域任务增量学习,提出跨模态自适应提示方法,利用CLIP文本嵌入空间进行任务路由、置信度估计和编码器适应,在MTIL基准上超越现有技术。

详情
AI中文摘要

多域任务增量学习要求模型在视觉多样的域中顺序获取知识,同时不遗忘先前任务,且在推理时无法访问任务身份。基于冻结视觉-语言模型的参数高效方法已取得显著进展,但现有方法完全依赖视觉特征进行任务路由、置信度估计和编码器适应,未利用CLIP的跨模态文本嵌入空间。我们通过三个贡献填补这一空白。文本空间任务路由将视觉高斯匹配替换为与冻结CLIP文本原型的余弦相似度,实现与顺序无关的路由,在零参数成本下对数据稀缺具有鲁棒性。多原型视觉-文本置信度将单高斯类建模替换为K均值视觉原型和任务校准阈值下的跨模态对齐分数。对称跨模态门控将每层Gumbel门扩展到文本编码器,以批量图像特征为条件,在分布外输入上保持跨模态对齐。在涵盖11个数据集和1201个类的MTIL基准上,我们的方法在Order-I下达到74.2%的迁移率、80.5%的平均准确率和88.7%的最终准确率,仅用2.5M可训练参数且无外部数据,分别超越先前最优方法5.0、3.7和3.0个百分点。

英文摘要

Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, yet all existing approaches rely exclusively on visual features for task routing, confidence estimation, and encoder adaptation, leaving CLIP's cross-modal text embedding space entirely unexploited. We address this gap through three contributions. Text-space task routing replaces visual Gaussian matching with cosine similarity to frozen CLIP text prototypes, giving order-independent routing robust to data scarcity at zero parameter cost. Multi-prototype visual-textual confidence replaces single-Gaussian class modeling with K-means visual prototypes and cross-modal alignment scores under task-calibrated thresholds. Symmetric cross-modal gating extends per-layer Gumbel gates to the text encoder conditioned on batch image features, preserving cross-modal alignment on out-of-distribution inputs. On the MTIL benchmark spanning 11 datasets and 1201 classes, our method achieves 74.2% Transfer, 80.5% Average, and 88.7% Last under Order-I, surpassing the prior state of the art by 5.0, 3.7, and 3.0 percentage points with only 2.5M trainable parameters and no external data.

2605.25706 2026-05-26 cs.CV 版本更新

Towards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker

迈向开放世界的指代表达理解:一种无需训练的多任务一致性检查器基准

Zongjian Wu, Lei Zhang

发表机构 * School of Microelectronics and Communication Engineering, Chongqing University, Chongqing, China(微电子与通信工程学院,重庆大学,重庆,中国)

AI总结 针对现有指代表达理解(REC)基准局限于简单场景和单目标假设的问题,提出OpenRef基准,涵盖多样视觉场景、可变目标数量和丰富词汇类型,并引入无需训练的多任务一致性检查器(MCC)以提升模型在开放世界中的性能。

Comments 17 pages, 7 figures. Project Page: https://zongjianwu.github.io/openref

详情
AI中文摘要

指代表达理解(REC)旨在根据给定表达在图像中定位目标对象。尽管视觉语言模型的最新进展已使REC任务取得显著改进,但当前的REC基准通常局限于简单场景,并假设每个表达映射到唯一对象。这些限制阻碍了REC模型在开放世界环境中的部署。为填补这一空白,我们引入了OpenRef,一个针对复杂视觉和语言场景的新REC基准。OpenRef具有三个关键进展:1)多样化的视觉场景:涵盖多种视觉领域,包括地面视角、无人机视角、黑暗场景和恶劣天气条件;2)可变目标数量:通过多目标和零目标样本打破单目标限制;3)丰富的词汇类型:包含专有名词、多义词和序数词,以适应更广泛的表达需求。此外,由于传统指标不足以应对开放世界设置,我们利用F1衡量定位准确性,并提出N3R(负相对拒绝可靠性)来评估对否定表达的相对拒绝可靠性。最后,我们引入了多任务一致性检查器(MCC),这是一种无需训练但即插即用的策略,通过强制执行一致性自我验证,一键提升模型性能。大量实验表明,本工作显著提升了现有REC模型在复杂场景中的性能,为开放世界REC铺平了道路。项目页面:https://zongjianwu.github.io/openref

英文摘要

Referring expression comprehension (REC) aims to localize a target object within an image based on a given expression. Although recent advances in vision-language models have led to substantial improvements in REC tasks, current REC benchmarks often hold simple scenarios and the assumption that each expression maps to a unique object. These limitations hinder the deployment of REC models in open-world environments. To fill this gap, we introduce OpenRef, a new benchmark for REC in complex visual and linguistic scenarios. OpenRef features three key advancements: 1) Diverse visual scenarios: spanning diverse visual domains, including ground views, drone views, dark scenes and adverse weather conditions; 2) Variable target counts: breaking the single-target limitation with multi-target and none-target samples; 3) Rich vocabulary types: incorporating proper nouns, polysemous words and ordinal terms to fit a wider range of expression needs. Furthermore, as traditional metrics are insufficient for open-world setting, we leverage F1 to measure grounding accuracy and propose N3R (Negative Relative Rejection Reliability) to assess relative rejection reliability against negative expressions. Finally, we introduce Multi-task Consistency Checker (MCC), a training-free but plug-and-play strategy that enhances model performance with one click by enforcing consistency self-verification. Extensive experiments demonstrate that this work significantly advances the performance of existing REC models in complex scenarios, paving the way for open-world REC. Project page: https://zongjianwu.github.io/openref

2605.25663 2026-05-26 cs.LG cs.CV 版本更新

Opportunistic Target Selection: Early Directional Commitment for Query-Efficient Black-Box Adversarial Attacks

机会目标选择:面向查询高效黑盒对抗攻击的早期定向承诺

Florent Tariolle, Florian Yger

发表机构 * INSA Rouen Normandy(里昂-诺曼底理工学院) LITIS

AI总结 提出一种轻量级方法OTS,通过早期将无目标攻击切换为有目标攻击,锁定当前领先的非真实类,从而减少查询次数并提高成功率。

Comments 13 pages, 10 figures, 3 tables; code available at https://github.com/Tariolle/opportunistic-target-selection

详情
AI中文摘要

仅最小化真实置信度的黑盒对抗攻击存在类别漂移问题:扰动在特征空间中游荡而不承诺特定对抗类别,浪费查询在分散、无方向的进展上。我们引入机会目标选择(OTS),一种轻量级包装器,在攻击轨迹早期将无目标攻击切换为有目标目标,锁定当前领先的非真实类别。OTS不需要对底层攻击进行架构修改,不需要梯度访问,也不需要先验的目标类别知识。我们在五个标准ImageNet分类器(4500次运行)上对三种基于分数的攻击(SimBA、使用交叉熵损失的Square Attack和Bandits)验证了OTS。在随机搜索攻击上,OTS紧密跟踪oracle性能,在ResNet-50上成功率提升高达27个百分点,审查均值迭代次数相对减少43%。在梯度估计攻击(Bandits)和边际损失攻击上,OTS是冗余的,这一负面结果强化了我们将OTS解释为边际损失替代的观点。在对抗训练模型上,双峰难度分布消除了目标帮助的机制。

英文摘要

Black-box adversarial attacks that minimize only the ground-truth confidence suffer from class drift: perturbations wander through the feature space without committing to a specific adversarial class, wasting queries on diffuse, undirected progress. We introduce Opportunistic Target Selection (OTS), a lightweight wrapper that switches an untargeted attack to a targeted objective early in its trajectory, locking onto whichever non-true class currently leads. OTS requires no architectural modification to the underlying attack, no gradient access, and no a priori target-class knowledge. We validate OTS on three score-based attacks (SimBA, Square Attack with cross-entropy loss, and Bandits) across five standard ImageNet classifiers (4,500 runs). On random-search attacks, OTS closely tracks oracle performance, with gains up to +27 pp in success rate and 43% relative reduction in censored-mean iterations on ResNet-50. On gradient-estimation attacks (Bandits) and attacks with margin loss, OTS is redundant, a negative result that reinforces our interpretation of OTS as a margin-loss surrogate. On adversarially-trained models, a bimodal difficulty distribution eliminates the regime where targeting helps.

2605.25661 2026-05-26 cs.CV 版本更新

DRM: Diffusion-based Reward Model With Step-wise Guidance

DRM: 基于扩散的奖励模型与逐步引导

Jaxon Zhang, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu

发表机构 * Peking University(北京大学) WeChat Vision, Tencent Inc.(腾讯微信视觉实验室)

AI总结 提出基于扩散的奖励模型(DRM),利用预训练扩散模型作为评估骨干,通过逐步评估能力改进强化学习对齐和推理采样,提升图像生成质量。

详情
AI中文摘要

当前主流将扩散模型与人类偏好对齐的方法通常采用基于VLM的奖励模型。然而,这些为语义对齐预训练的奖励模型难以捕捉关键的感知质量,如美学、构图和视觉和谐。在这项工作中,我们认为一个能够高保真生成的模型必须对这些视觉属性有深刻理解。基于这一见解,我们引入了基于扩散的奖励模型(DRM),这是一种新颖的范式,使用预训练的扩散模型作为强大的评估骨干。DRM的一个关键优势是其独特的能力,不仅可以评估最终图像,还可以评估生成过程中任何阶段的噪声中间潜变量。我们以两种方式利用这种逐步评估能力。首先,我们提出了逐步GRPO,一种强化学习算法,提供密集的每步奖励,以解决GRPO算法中不精确的信用分配问题,从而实现更稳定和有效的对齐。其次,我们引入了逐步采样,一种新颖的推理策略,使用DRM作为动态引导,在每一步评估多个生成路径,引导过程朝向更高质量的结果。大量实验证实,我们的方法显著提升了生成图像的最终质量。代码:https://github.com/jjaxonx/DRM。

英文摘要

Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities-such as aesthetics, composition, and visual harmony. In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes. Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone. A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways. First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment. Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes. Extensive experiments confirm that our approach significantly enhances the final quality of generated images. Code: https://github.com/jjaxonx/DRM.

2605.25659 2026-05-26 cs.CV 版本更新

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

StreamChar: 基于解耦编排的长时程流式角色音频-视频生成

Linrui Tian, Qi Wang, Bang Zhang

发表机构 * Tongyi Lab, Alibaba Group(通义实验室,阿里巴巴集团)

AI总结 提出StreamChar流式框架,通过LLM编排器与联合音频-视频DiT解耦长时程编排与短窗去噪,实现实时、稳定、高质量的角色动画生成。

详情
AI中文摘要

实时流式联合音频-视频生成用于角色动画需要生成器说出请求的文本、跨块保持视觉身份并在严格的播放预算内运行。这些要求难以同时满足:逐块自回归生成会累积文本-音频错位和视觉漂移,而低延迟所需的少步蒸馏通常会降低空间多样性和时间质量。我们提出StreamChar,一种将长时程编排与短窗音频-视频去噪分离的流式框架。基于LLM的编排器使用文本和历史上下文生成帧对齐的音频条件,联合音频-视频DiT在参考和运动帧条件下执行局部双向去噪。为高效部署,我们使用两阶段蒸馏流程,首先压缩采样器,然后在在线块展开下微调学生模型。进度感知指针在展开训练期间将部分文本与生成的音频对齐,而汇块记忆提供持久视觉锚点以减少长时程漂移。在短片段和长时程协议上的实验表明,StreamChar在单个H100 GPU上实时运行,与最近的联合和音频驱动基线相比,在文本保真度、音视频同步、视觉质量和流式稳定性方面提供了有利的系统级权衡。

英文摘要

Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.

2605.25657 2026-05-26 cs.CV 版本更新

ARMA-C3: A Contrastive ARMA Convolutional Framework for Unsupervised and Semi-supervised Classification

ARMA-C3: 一种用于无监督和半监督分类的对比ARMA卷积框架

VSS Tejaswi Abburi, Saurabh J. Shigwan, Nitin Kumar

发表机构 * VSS Tejaswi Abburi Saurabh J. Shigwan Nitin Kumar

AI总结 提出ARMA-C3框架,利用对比学习和图割正则化在无监督和半监督场景下学习图节点的判别性表示,在多个医学影像数据集上表现优异。

详情
AI中文摘要

在生物医学和神经退行性疾病中,由于标记数据的稀缺和成像模式的复杂性,准确和早期疾病识别仍然具有挑战性。为了解决这些问题,我们引入了ARMA-C3,一个统一的无监督和半监督图学习框架,用于基于对比学习和图割正则化的节点分类,以学习结构上有意义且具有判别性的表示。通过将样本或图像建模为图节点并利用样本间关系,所提出的框架捕获了传统机器学习方法通常忽略的受试者级别依赖关系。我们在五个临床相关数据集上进行了广泛的二分类实验:阿尔茨海默病神经影像学倡议(ADNI)、额颞叶痴呆神经影像学(NIFD)数据集以及三个医学影像基准(BreastMNIST、PneumoniaMNIST和一个肝脏超声数据集)。实验结果表明,ARMA-C3在多个评估设置中,特别是在有限监督和严重类别不平衡下,与经典聚类技术、最先进的机器学习模型以及现有的基于图的深度学习方法相比,取得了具有竞争力且通常更优越的性能。所提出的框架进一步展示了在多样化生物医学成像模态中的鲁棒表示学习和强跨模态泛化能力。

英文摘要

In biomedical and neurodegenerative disorders, accurate and early disease identification remains challenging due to the scarcity of labeled data and the complexity of imaging patterns. To address these challenges, we introduce ARMA-C3, a unified unsupervised and semi-supervised graph learning framework for node classification based on contrastive learning and graph-cut regularization to learn structurally meaningful and discriminative representations. By modeling samples or images as graph nodes and exploiting inter-sample relationships, the proposed framework captures subject-level dependencies that conventional machine learning methods typically overlook. We conduct extensive binary classification experiments across five clinically relevant datasets: the Alzheimer's Disease Neuroimaging Initiative (ADNI), the Neuroimaging in Frontotemporal Dementia (NIFD) dataset, and three medical imaging benchmarks (BreastMNIST, PneumoniaMNIST, and a liver ultrasound dataset). Experimental results demonstrate that ARMA-C3 achieves competitive and frequently superior performance compared to classical clustering techniques, state-of-the-art machine learning models, and existing graph-based deep learning approaches across multiple evaluation settings, particularly under limited supervision and severe class imbalance. The proposed framework further demonstrates robust representation learning and strong cross-modal generalization across diverse biomedical imaging modalities.

2605.25656 2026-05-26 cs.CV 版本更新

Event-based Batting Impact Estimation

基于事件的击球冲击估计

Ryotaro Ishida, Wataru Ikeda, Ryosei Hara, Akemi Kobayashi, Toshitaka Kimura, Mariko Isogawa

发表机构 * Keio University(庆应大学) NTT Communication Science Laboratories(NTT通信科学实验室)

AI总结 提出利用事件相机的高时间分辨率和高动态范围,通过检测球与球棒的加权质心距离来估计击球冲击时刻,并引入掩膜细化网络解决事件帧与RGB图像之间的域差异,在低光和严重遮挡条件下将平均绝对误差降低约63%。

Comments Accepted to IEEE International Conference on Image Processing (ICIP) 2026. (c) 2026 IEEE. Personal use of this material is permitted

详情
AI中文摘要

精确估计击球冲击时刻对于理解快速感觉运动控制至关重要。然而,由于时间分辨率不足和运动模糊,RGB相机难以完成此任务。同样,惯性测量单元(IMU)由于传感器侵入性和有限的时间精度,在实际比赛中不实用。为克服这些限制,我们提出了一种新颖框架,利用事件相机(具有微秒级分辨率和高动态范围)基于检测到的球与球棒之间的加权质心距离来估计冲击时刻。为解决事件帧与RGB图像之间的域差异(这会降低分割精度),我们生成高密度事件帧。然后,我们引入一个掩膜细化网络,利用这些帧和双向掩膜信息,并通过一种新颖的损失函数进行优化。在真实数据集上的实验表明,我们的方法在具有挑战性的条件下(包括低光环境和严重遮挡)实现了卓越的准确性,将平均绝对误差降低了约63%,优于基线方法。

英文摘要

Estimating the precise timing of batting impact is crucial for understanding the rapid sensorimotor control. However, this task is challenging for RGB cameras due to insufficient temporal resolution and motion blur. Similarly, Inertial Measurement Units (IMUs) are impractical for actual matches due to sensor intrusiveness and their limited temporal precision. To overcome these limitations, we propose a novel framework leveraging event-based cameras, which offer microsecond resolution and high dynamic range, to estimate impact timing based on the weighted centroid distance between the detected ball and bat. To address the domain gap between event frames and RGB images that degrades segmentation accuracy, we generate high-density event frames. We then introduce a mask refinement network that leverages these frames and bidirectional mask information, optimized using a novel loss function. Experiments on real-world datasets demonstrate that our method achieves superior accuracy under challenging conditions, including low-light environments and severe occlusions, outperforming baselines by reducing the Mean Absolute Error by approximately 63%.

2605.25621 2026-05-26 cs.CV 版本更新

StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

StreamOV: 通过证据引导记忆与响应触发的流式全视频理解

Ming Xie, Zizheng Huang, Xudong Tan, Chao Wang, Xiangyu Zeng, Wenxiao Wu, Tao Chen, Limin Wang, Yanwei Fu

发表机构 * Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学) Nanjing University(南京大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出StreamOV框架,利用多模态证据引导的长短期记忆和隐状态驱动的触发机制,实现流式全视频理解中的在线推理与主动响应,并在新基准SOVBench上取得最优性能。

详情
AI中文摘要

虽然流式全视频理解需要持续感知和主动的实时交互,但这一关键领域仍未被充分探索。当前的全模态方法本质上是为离线场景设计的,由于两个根本缺陷限制了其在流式场景中的适用性。首先,它们缺乏稳健的机制来管理长时间跨度下持续增长的音视频上下文,并且无法在适当时机自主发起响应。其次,现有基准主要局限于离线、单轮问答,无法捕捉连续的多轮流式交互。为弥补这些差距,我们提出了StreamOV,一种新颖的流式全视频理解框架,用于具有有限记忆和主动响应触发的高效在线音视频推理。具体来说,StreamOV引入了多模态证据引导的长短期记忆,在固定预算下将历史音视频上下文压缩为紧凑的信息性证据。它还采用隐状态驱动的触发器来决定何时响应,避免了显式的静音令牌生成和外部路由器。我们还整理了SOVBench,这是首个用于在线、多轮全模态评估的综合基准。大量实验表明,StreamOV在各种流式和全视频基准上取得了最先进的性能,证明了其在在线和离线视频理解中的有效性。

英文摘要

While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenarios due to two fundamental flaws. First, they lack robust mechanisms to manage continuously growing audio-visual context over long horizons and cannot autonomously initiate responses at opportune moments. Second, existing benchmarks are predominantly confined to offline, single-turn question answering, failing to capture continuous, multi-turn streaming interactions. To bridge these gaps, we propose StreamOV, a novel Streaming Omni-Video understanding framework for efficient online audio-visual reasoning with bounded memory and proactive response triggering. Specifically, StreamOV introduces a multimodal evidence-guided long-short term memory that condenses historical audio-visual context into compact informative evidence under a fixed budget. It further employs a hidden-state-driven trigger to decide when to respond, avoiding explicit silence-token generation and external routers. We also curate SOVBench, the first comprehensive benchmark for online, multi-turn omni-modal evaluation. Extensive experiments show that StreamOV achieves state-of-the-art performance across diverse streaming and omni-video benchmarks, demonstrating its effectiveness for both online and offline video understanding.

2605.25615 2026-05-26 cs.CV 版本更新

UAV-OVO: Out-of-Viewpoint Generalization in UAV Action Recognition

UAV-OVO:无人机动作识别中的视点外泛化

Yu Xia, Zhengbo Zhang, Shuaihu Zhang, Zhigang Tu

发表机构 * Wuhan University(武汉大学) Singapore University of Technology and Design(新加坡科技与设计大学)

AI总结 针对无人机动作识别中训练与测试视点不一致导致的性能下降问题,提出UAV-OVO基准和LATER方法,通过视点隔离和LoRA锚定特征重中心化实现视点鲁棒泛化。

详情
AI中文摘要

无人机动作识别面临标准基准测试常掩盖的部署偏移:从低俯视角拍摄的无人机视频训练的模型可能需要识别来自高俯视角的相同动作类别。虽然动作标签保持不变,但这种偏移改变了身体可见性、运动投影和场景上下文,促使模型依赖视点特定的捷径。我们引入UAV-OVO,一个用于无人机动作识别的视点外泛化基准。UAV-OVO从未校准视频中导出视点分数,使用视点隔离带将低俯视角视频分配给训练和分布内测试集,同时保留高俯视角视频用于分布外测试,并构建按类别分布匹配的ID/OOD测试集,使得性能差异反映视点偏移而非标签不平衡。在代表性视频识别器上,UAV-OVO揭示了显著的ID/OOD差距:拟合低俯视角训练分布良好的模型往往无法迁移到保留的高俯视角,暴露了被整体准确性隐藏的视点捷径。我们进一步提出LATER,即LoRA锚定的测试时重中心化,首先通过低秩适配(LoRA)适配识别器,然后利用学习到的LoRA子空间作为在线特征重中心化的语义锚点。具体来说,LATER在重中心化特征之前将目标域位移投影到LoRA子空间的正交补上,减少视点引起的漂移同时保留任务相关语义。UAV-OVO和LATER共同为视点鲁棒的无人机视频理解提供了一个受控测试床和一种实用的适配方法。

英文摘要

UAV action recognition faces a deployment shift that standard benchmarks often obscure: a model trained on UAV footage captured from low-depression viewpoints may be required to recognize the same action classes from high-depression viewpoints. While the action labels remain unchanged, this shift alters body visibility, motion projection, and scene context, encouraging models to rely on viewpoint-specific shortcuts. We introduce UAV-OVO, an Out-of-Viewpoint generalization benchmark for UAV action recognition. UAV-OVO derives view scores from uncalibrated videos, uses a view-isolation band to assign low-depression videos to the training and in-distribution test splits while reserving high-depression videos for out-of-distribution testing, and constructs ID/OOD test sets matched by class distribution so that performance differences reflect viewpoint shift rather than label imbalance. Across representative video recognizers, UAV-OVO reveals a substantial ID/OOD gap: models that fit the low-depression training distribution well often fail to transfer to held-out high-depression views, exposing viewpoint shortcuts hidden by aggregate accuracy. We further propose LATER, LoRA-Anchored Test-time Re-centering, which first adapts the recognizer with Low-Rank Adaptation (LoRA) and then uses the learned LoRA subspace as a semantic anchor for online feature re-centering. Specifically, LATER projects target-domain displacement onto the orthogonal complement of the LoRA subspace before re-centering features, reducing viewpoint-induced drift while preserving task-relevant semantics. Together, UAV-OVO and LATER provide a controlled testbed and a practical adaptation method for viewpoint-robust UAV video understanding.

2605.25599 2026-05-26 cs.LG cs.CV 版本更新

Generalized Evidential Deep Learning: From a Bayesian Perspective

广义证据深度学习:从贝叶斯视角

Yuanye Liu, Yibo Gao, Yuanyang Chen, Xiahai Zhuang

发表机构 * School of Data Science, Fudan University, Shanghai, China(复旦大学数据科学学院,上海,中国)

AI总结 本文从广义贝叶斯框架出发,为证据深度学习建立理论基础,并提出统一可扩展的广义证据深度学习框架,在分类、不确定性估计和OOD检测上取得可比结果。

Comments Submitted to ICML2026

详情
AI中文摘要

证据深度学习(EDL)已成为一种高效、无需采样的不确定性估计策略。一系列EDL变体被提出以解决原始框架的特定局限性,并取得了显著成功。然而,EDL的基本理论结构以及这些变体之间的关系尚未得到系统研究。在这项工作中,我们通过在广义贝叶斯框架内解释EDL,包括先验规范、后验更新和训练目标,为其建立了原则性的理论基础。我们进一步从贝叶斯分布不确定性角度刻画了证据不确定性,并通过渐近分析建立。基于这一视角,我们进一步提出了广义证据深度学习(GEDL),这是一个统一且可扩展的框架,明确解耦了各个组件的作用,并将GEDL与现有变体系统地联系起来。大量实验表明,GEDL在分类、不确定性估计和OOD检测上取得了可比的结果,并具有理论依据。

英文摘要

Evidential Deep Learning (EDL) has emerged as an efficient, sampling-free strategy for uncertainty estimation. A series of EDL variants have been proposed to address specific limitations of the original framework, achieving notable success. However, the underlying theoretical structure of EDL and the relationships among these variants have received limited systematic investigation. In this work, we establish a principled theoretical foundation for EDL by interpreting it within a generalized Bayesian framework that includes prior specification, posterior update, and training objective. We further characterize evidential uncertainty from a Bayesian distributional uncertainty viewpoint, established via asymptotic analysis. Building on this perspective, we further propose Generalized Evidential Deep Learning (GEDL), a unified and extensible framework that explicitly disentangles the roles of individual components and systematically relates GEDL to existing variants. Extensive experiments demonstrate that GEDL yields comparable results on classification, uncertainty estimation and OOD detections, with theoretical grounding.

2605.25598 2026-05-26 cs.CV 版本更新

SurfSurg6D: Geometry Consistent Dense Correspondence for Textureless Surgical Instrument Pose Estimation

SurfSurg6D:面向无纹理手术器械位姿估计的几何一致密集对应

Daiyun Shen, Shuojue Yang, Chang Han Low, Qian Li, Mengya Xu, Qi Dou, Yueming Jin

发表机构 * National University of Singapore(国立新加坡大学) Chinese University of Hong Kong(香港中文大学)

AI总结 针对无纹理手术器械位姿估计中的数据稀缺和几何一致性挑战,本文构建了SynSurg6D数据集并提出SurfSurg6D密集对应框架,在多个数据集上实现了优于现有方法的RGB-only位姿估计。

详情
AI中文摘要

手术器械位姿估计为自主机器人手术、技能评估和手术工作流程标准化等有前景的应用提供了关键信息。然而,由于高精度要求、频繁遮挡、无纹理器械、深度信息稀缺以及标注数据非常有限,该任务仍然极具挑战性。这些限制导致在将通用物体位姿估计方法应用于手术场景时性能往往不理想。为解决这些问题,我们首先构建了一个新数据集SynSurg6D,以缓解该任务中的数据短缺问题。我们进一步提出了SurfSurg6D,一个专为手术器械位姿估计设计的密集对应框架。在SurgRIPE、EndoVis2018和SurgPose数据集上的实验结果表明,我们生成的SynSurg6D数据集能够多样化位姿分布,从而提升现有方法的性能。此外,SurfSurg6D优于现有方法,为精确高效的RGB-only位姿估计提供了鲁棒解决方案。

英文摘要

Surgical instrument pose estimation provides crucial information for promising applications, including autonomous robotic surgery, skill assessment, and standardization of surgical workflow. However, this task remains highly challenging due to high precision requirements, frequent occlusions, textureless instruments, scarcity of depth information and very limited annotated data. These constraints often lead to unsatisfactory performance when employing general object pose estimation approaches to surgical scenarios. To address these issues, we first construct a new dataset SynSurg6D, to alleviate the data shortage in this task. We further propose SurfSurg6D, a dense-correspondence framework tailored for surgical instrument pose estimation. Experimental results on the SurgRIPE, EndoVis2018 and SurgPose datasets demonstrate that the introduction of our generated dataset SynSurg6D is able to diversify the pose distributions, thus enhancing the performance of existing approaches. Furthermore, SurfSurg6D outperforms existing methods, providing a robust solution for precise and efficient RGB-only pose estimation.

2605.25595 2026-05-26 cs.CV 版本更新

How Far Has AI Come in Liver Fibrosis Staging? A Large-Scale Real-World Dataset and Benchmark

AI在肝纤维化分期中取得了多大进展?大规模真实世界数据集与基准

Yuanye Liu, Nannan Shi, Zhejia Zhang, Hanxiao Zhang, Boya Wang, Derong Yu, Nao Wang, Yuxin Jin, Yang Zhou, Kunhao Yuan, Siqi Wang, Lida Yang, Xu Qiao, Wentao Liu, Xuelei He, Xin Hong, Guoyan Zheng, Xin Chen, Guang-Zhong Yang, Le Zhang, Lei Li, Yuxin Shi, Xiahai Zhuang

发表机构 * School of Data Science, Fudan University, Shanghai, China(复旦大学数据科学学院) Department of Radiology, Shanghai Public Health Clinical Center, Fudan University, Shanghai, China(复旦大学上海公共卫生临床中心放射科) Department of Electrical and Computer Engineering, Northwestern University, Evanston, USA(西北大学电气与计算机工程系) Shanghai Key Laboratory of Flexible Medical Robotics, Tongren Hospital, Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China(上海柔性医疗机器人重点实验室) School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China(上海交通大学生物医学工程学院) School of Computer Science, University of Nottingham, Nottingham, UK(诺丁汉大学计算机科学学院) Institute of Medical Robotics, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China(上海交通大学生物医学工程学院医疗机器人研究所) College of Computer Science and Technology, Huaqiao University, Xiamen, China(华侨大学计算机科学与技术学院) School of Electronic Information (School of Artificial Intelligence), Northwest University, Xi'an, China(西北大学电子信息学院(人工智能学院)) Department of Mechanical Engineering, University College London, London, UK(伦敦大学学院机械工程系) Institute of Neuroscience and Cardiovascular Research, University of Edinburgh, Edinburgh, UK(爱丁堡大学神经科学与心血管研究学院) CAS Center for Excellence in Nanoscience, National Center for Nanoscience and Technology, Beijing, China(中国科学院纳米科学卓越中心) School of Control Science and Engineering, Shandong University, Jinan, China(山东大学控制科学与工程学院) School of Engineering, College of Engineering and Physical Sciences, University of Birmingham, Birmingham, UK(伯明翰大学工程学院)

AI总结 基于多中心、多序列MRI的大规模真实世界数据集LiFS,系统评估了9种AI方法在肝纤维化分期中的表现,发现最佳AI与资深放射科医生相当,但跨中心异质性和标签不平衡仍是主要挑战。

Comments Submitted to Medical Image Analysis

详情
AI中文摘要

尽管方法学上取得了多年进展,但AI在肝纤维化分期中的进展从未在定义临床实践的异质性、多中心条件下进行系统评估。为填补这一空白,我们引入了LiFS,这是一个来自MICCAI 2025 CARE-Liver挑战的大规模数据集和基准,包含来自多个中心和扫描仪的610名患者的多序列MRI。据我们所知,LiFS是第一个提供完整钆塞酸增强序列并具有来自不同真实世界扫描仪的病理学确认注释的基准。通过对从96个注册团队中选出的9种独立开发方法进行系统评估,并与队列内放射科医生参考结果进行比较,我们的发现从三个互补角度回答了当前AI在临床级肝纤维化分期方面的进展。首先,与放射科医生相比,最佳AI方法总体上与资深放射科医生相当,并在特定设置下显著超过初级放射科医生,而中位AI性能通常接近初级放射科医生水平。其次,从数据角度来看,跨中心异质性、标签不平衡和对比增强序列变异性成为AI方法的主要挑战。第三,从技术角度来看,方法设计选择,包括空间配准、输入维度、多模态融合策略和骨干架构,似乎调节了跨中心鲁棒性,尽管没有单一选择能完全缩小差距。总体而言,LiFS为定位AI在肝纤维化分期中的当前状态以及促进对限制临床可靠部署的关键挑战的未来研究提供了严格的真实世界基准。

英文摘要

Despite years of methodological progress, how far AI has come in liver fibrosis staging has never been systematically evaluated under the heterogeneous, multi-center conditions that define clinical practice. To address this gap, we introduce LiFS, a large-scale dataset and benchmark derived from the MICCAI 2025 CARE-Liver challenge, comprising 610 patients across multiple centers and scanners with multi-sequence MRI. To the best of our knowledge, LiFS is the first benchmark providing complete gadoxetic acid-enhanced sequences with histopathology-confirmed annotations from diverse real-world scanners. Through systematic evaluation of 9 independently developed methods selected from 96 registered teams against in-cohort radiologist reference results, our findings address how far current AI has progressed toward clinical-level liver fibrosis staging from three complementary perspectives. First, against radiologists, the best AI methods were broadly comparable to the senior radiologist and significantly exceeded the junior radiologist in selected settings, while median AI performance generally approached junior-radiologist levels. Second, from a data perspective, cross-center heterogeneity, label imbalance, and contrast-enhanced sequence variability emerge as the dominant challenges for AI methods. Third, from a technical perspective, methodological design choices, including spatial registration, input dimensionality, multi-modal fusion strategy, and backbone architecture, appear to modulate cross-center robustness, although no single choice alone closes the gap. Overall, LiFS provides a rigorous real-world benchmark for positioning the current state of AI in liver fibrosis staging and for enabling future research on the key challenges that limit clinically reliable deployment.

2605.25589 2026-05-26 cs.CV 版本更新

Artifact Correction for Echo-Planar Imaging at Low-Field and Ultra-Low-Field MRI

低场和超低场MRI中回波平面成像的伪影校正

Sisi Qiao, Yilin Yu, Tiecheng Lin, Yuhao Liu, Jiajia Sun, Xiaoling Li

发表机构 * School of Mechanical Engineering, Xi'an Jiaotong University(西安交通大学机械工程学院)

AI总结 针对低场和超低场MRI中回波平面成像的奈奎斯特鬼影问题,提出一种无需参考扫描的校正流程,结合峰值对齐与插值重采样方法,有效抑制鬼影并提升图像质量。

Comments 19 pages, 10 figures, 2 tables

详情
AI中文摘要

目的:低场和超低场MRI中的回波平面成像因奇偶k空间错位而遭受严重的奈奎斯特鬼影伪影。本研究开发了一种无参考扫描的伪影校正流程,减少对传统参考扫描的依赖,同时实现更好的鬼影抑制。方法:从传统的基于参考扫描的鬼影校正方法出发,我们首先引入一种基于峰值对齐的鬼影校正方法,无需参考数据即可校正奇偶行位移。为进一步减少残余伪影,采用了插值与重采样策略。该组合方法在低场和超低场下的EPI和扩散加权EPI数据上进行了评估。结果:所提出的流程有效减轻了奈奎斯特鬼影,改善了结构连续性,并增强了信号均匀性。仅基于峰值对齐的鬼影校正方法提供了与基于参考扫描的鬼影校正方法相当的伪影抑制效果,而插值与重采样进一步抑制了残余伪影,使得在超低场条件下能够可靠地可视化脑结构。结论:为低场和超低场EPI提出了一种实用的无参考校正流程,结合了基于峰值对齐的鬼影校正方法和插值重采样,实现了高效的鬼影抑制,扩展了低场MRI系统的临床适用性,为基于超低场EPI的DWI成像提供了理论指导和实践经验。

英文摘要

Purpose: Echo-planar imaging (EPI) in low-field (LF) and ultra-low-field MRI (ULF) suffers from severe Nyquist ghost artifacts due to odd-even k-space misalignment. This study develops a reference-free artifact correction pipeline that reduces reliance on conventional reference scans while achieving improved ghost suppression. Methods: Starting from the traditional reference-scan-based ghost artifact correction method, we first introduce a peak-alignment-based ghost artifact correction method to correct odd-even line displacement without reference data. To further reduce residual artifacts, an interpolation-and-resampling strategy is applied. The combined method was evaluated using EPI and diffusion-weighted EPI data in LF and ULF. Results: The proposed pipeline effectively mitigated Nyquist ghosts, improved structural continuity, and enhanced signal uniformity. Peak-alignment-based ghost artifact correction method alone provided comparable artifact suppression to reference-scan-based ghost artifact correction method, while interpolation and resampling further suppressed residual artifacts, enabling reliable visualization of brain structures under ULF conditions. Conclusion: A practical, reference-free correction pipeline is presented for LF and ULF EPI, combining peak-alignment-based ghost artifact correction method and interpolation-resampling to achieve efficient ghost suppression and expand the clinical applicability of low-field MRI systems, providing both theoretical guidance and practical experience for ULF EPI-based DWI imaging.

2605.25574 2026-05-26 cs.CV cs.AI 版本更新

Mosaic: Compositional Multi-Concept Erasure via Vector Field Blending

Mosaic: 通过向量场混合的组合式多概念擦除

Junseok Ko, Jungwoo Kim, Jong-Seok Lee

发表机构 * Department of Artificial Intelligence, Yonsei University(延世大学人工智能系) School of Integrated Technology, Yonsei University(延世大学整合技术学院)

AI总结 针对流式文本到图像模型中同时擦除多个目标概念的任务,提出Mosaic框架,通过动态构建概念特定掩码并选择性混合向量场,无需额外优化即可有效移除复杂场景中的多概念。

详情
AI中文摘要

概念擦除已成为确保文本到图像(T2I)模型安全与伦理图像合成的关键研究方向。现有研究虽探索了多概念擦除,但通常假设每张图像仅有一个目标概念,这一限制被现代基于流的T2I模型日益暴露,此类模型可同时生成包含多个概念的复杂场景。为弥补这一空白,我们引入组合式多概念擦除这一新任务,旨在同时移除单个场景中的多个目标概念。我们提出CoME-Bench,一个用于评估组合式多概念擦除的基准,涵盖类别内和跨类别场景。我们进一步提出Mosaic,一个用于基于流的T2I模型中多概念擦除的新框架,该框架通过动态构建概念特定掩码并选择性混合它们,利用向量场中目标概念的空间局部性,无需额外优化。大量实验表明,Mosaic能有效移除复杂组合场景中的多个目标概念,同时保留非目标上下文。

英文摘要

Concept erasure has emerged as a key research direction for ensuring safe and ethical image synthesis in Text-to-Image (T2I) models. While existing studies have explored concept erasure across multiple concepts, they typically assume only a single target concept per image, a limitation increasingly exposed by modern flow-based T2I models, which can generate complex scenes with multiple concepts simultaneously. To address this gap, we introduce compositional multi-concept erasure, a new task that aims to simultaneously remove multiple target concepts within a single scene. We propose CoME-Bench, a benchmark for evaluating compositional multi-concept erasure, which covers both intra- and cross-category scenarios. We further propose Mosaic, a novel framework for multi-concept erasure in flow-based T2I models, which exploits the spatial locality of target concepts in the vector field by dynamically constructing concept-specific masks and selectively blending them without additional optimization. Extensive experiments demonstrate that Mosaic effectively removes multiple target concepts in complex compositional scenes while preserving non-target contexts.

2605.25571 2026-05-26 cs.CV 版本更新

AnE: Pushing the Reasoning Frontier of Multimodal LLMs via Anchor Evolution

AnE: 通过锚点进化推动多模态大语言模型的推理前沿

Zehao Wang, Yihan Zeng, Zidong Gong, Yuanfan Guo, Feng Zhu, Hongzhi Zhang, Wei Zhang, Wangmeng Zuo

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Huawei Noah's Ark Lab(华为诺亚实验室) Independent Researcher(独立研究员)

AI总结 提出锚点进化(AnE)范式,通过真值锚点数据策展和脚手架剥离机制,解决多模态大模型推理中的认知漂移和幻觉路径问题,显著提升推理性能。

Comments 34 pages,10 figures

详情
AI中文摘要

通过监督微调(SFT)和强化学习(RL)进行的后训练对于增强多模态大语言模型(MLLMs)的推理能力至关重要,然而现有范式由于静态数据的限制常常达到性能瓶颈。虽然当前方法利用自我反思或自我进化来突破这些界限,但它们仍然受到低质量合成数据导致的认知漂移和幻觉推理路径的影响。为了解决这些挑战,我们提出了锚点进化(AnE),一种整合了真值锚点数据策展和模型进化的新范式,在推理前沿实现了忠实且稳定的性能提升。具体来说,我们提出了真值锚点扩展,通过轨迹展开定位模型失败前沿,并利用真实数据库检索高保真锚点以进行忠实的数据策展。随后,我们引入了脚手架剥离机制来内化推理能力。该机制首先通过脚手架增强监督来锚定推理路径,以减轻直接在原始数据上进行SFT的学习复杂性和分布漂移,然后利用强化学习剥离脚手架模板,从而有效地将推理路径转化为内在模型能力。在多模态推理基准上的实验结果表明,我们的方法显著推进了模型性能前沿,在八个多模态基准上将基础模型提升了10.3%,并达到了最先进的结果。代码将公开提供。

英文摘要

Post-training via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is crucial for enhancing reasoning in Multimodal Large Language Models (MLLMs), yet existing paradigms often reach a performance bottleneck due to the limitations of static data. While current methods leverage self-reflection or self-evolution to push these boundaries, they still suffer from cognitive drift and hallucinated reasoning paths caused by low-quality synthetic data. To address these challenges, we propose Anchor Evolution (AnE), a new paradigm that integrates truth-anchored data curation and model evolution, achieving faithful and steady performance gains at the reasoning frontier. Specifically, we propose Truth Anchor Expansion, which pinpoints the model failing frontier via trajectory rollouts and leverages ground-truth databases to retrieve high-fidelity anchors for faithful data curation. Subsequently, we introduce the Scaffold-Stripping Mechanism to internalize reasoning capabilities. This mechanism first anchors reasoning paths via scaffold-augmented supervision to mitigate the learning complexity and distribution drift of direct SFT on raw data, then leverages RL to strip the scaffold template, thereby effectively transitioning the reasoning paths into intrinsic model capabilities. Experimental results on multimodal reasoning benchmarks show that our method substantially advances the model performance frontier, improving the base model by 10.3\% across eight multimodal benchmarks and achieving state-of-the-art results. The code will be made publicly available.

2605.25568 2026-05-26 cs.CV 版本更新

Rethinking Scribble-Guided Image Editing: Generalization, Instruction Adherence, and Multi-Tasking

重新思考涂鸦引导的图像编辑:泛化、指令遵循与多任务

Mingyi Xu, Jinpeng Lin, Min Zhou, Tiezheng Ge, Ming Zeng

发表机构 * Xiamen University(厦门大学) Taobao & Tmall Group of Alibaba(阿里巴巴淘宝与天猫集团)

AI总结 针对涂鸦引导图像编辑在多任务场景下性能不稳定的问题,通过实证研究揭示指令级泛化瓶颈,提出覆盖-真实课程、多任务拼接和编辑聚焦损失三种策略,在VIBE基准上实现单任务和多任务的最优结果。

详情
AI中文摘要

涂鸦引导的图像编辑允许用户将简单的涂鸦注释与文本提示相结合,以指定图像编辑的位置和方式,从而实现灵活交互和精确的空间控制。然而,现有模型在这种范式下仍表现出不稳定的性能,尤其是在多任务场景中。为了提升性能,我们使用开源编辑模型进行实证研究,并揭示了泛化中的不对称性:指令级泛化(包括跨编辑任务以及从单任务到多任务设置)比图像域泛化(例如从合成图像到真实图像,或从马赛克图像到常规图像)更具挑战性。这表明主要瓶颈在于对多样化编辑指令的学习不足,而非图像域差异。受此启发,我们提出了三种策略:(a) 覆盖-真实课程,一个两阶段流程,首先构建大规模合成、指令丰富的数据以提供广泛的任务监督,然后精选少量真实数据以细化生成的真实性;(b) 多任务拼接,通过几乎零成本地拼接单任务样本来构建多任务训练样本,同时使学习到的能力泛化到非马赛克图像;(c) 编辑聚焦损失,利用合成数据中输入和输出图像之间的变化区域,将训练聚焦于编辑区域,提高学习效率和编辑准确性。通过这些策略,我们在VIBE基准上显著提升了单任务和多任务涂鸦引导编辑的性能,取得了最先进的结果。我们将公开发布我们的数据集和模型。

英文摘要

Scribble-guided image editing allows users to combine simple scribble annotations with text prompts to specify both where and how an image should be edited, enabling flexible interaction with precise spatial control. However, existing models still exhibit unstable performance under this paradigm, especially in multi-task scenarios. To improve performance, we conduct empirical studies using an open-source editing model and reveal an asymmetry in generalization: instruction-level generalization, including across editing tasks and from single-task to multi-task settings, is more challenging than image-domain generalization, such as from synthetic to real-world images or from mosaicked to regular images. This suggests that the primary bottleneck lies in insufficient learning for diverse editing instructions rather than in the image domain gap. Motivated by this insight, we propose three strategies: (a) a Coverage-then-Realism Curriculum, a two-stage pipeline that first builds large-scale synthetic, instruction-rich data for broad task supervision, then curates a small set of real-world data to refine generation realism; (b) Multi-Task Mosaicking, which constructs multi-task training samples by concatenating single-task examples at nearly zero cost while enabling the learned capability to generalize to non-mosaicked images; and (c) an Edit-Focused Loss, which leverages the changed regions between input and output images in synthetic data to focus training on edited regions, improving both learning efficiency and editing accuracy. With these strategies, we substantially improve both single-task and multi-task scribble-guided editing on the VIBE benchmark, achieving state-of-the-art results. We will publicly release our dataset and model.

2605.25563 2026-05-26 cs.CV 版本更新

CodecSplat: Ultra-Compact Latent Coding for Feed-Forward 3D Gaussian Splatting

CodecSplat: 用于前馈式3D高斯泼溅的超紧凑潜在编码

Pengpeng Yu, Runqing Jiang, Qi Zhang, Dingquan Li, Jing Wang, Yulan Guo

发表机构 * Sun Yat-sen University(中山大学) Peking University(北京大学) Pengcheng Laboratory(鹏城实验室)

AI总结 提出CodecSplat框架,通过将压缩集成到前馈式高斯生成流水线中,利用结构化中间特征表示实现超紧凑场景编码,显著降低存储和传输开销。

详情
AI中文摘要

尽管前馈式3D高斯泼溅无需逐场景优化即可从稀疏上下文视图重建可渲染的高斯基元,但现有流水线并未提供紧凑的场景表示用于存储或传输。一种自然的解决方案是将现有的3DGS压缩方法应用于生成的高斯基元。然而,这种方法作用于最终的不规则3D表示,且与内部特征到高斯的生成过程解耦,限制了压缩效率。为解决此问题,我们引入了CodecSplat,一种用于前馈式3D高斯泼溅的超紧凑潜在编码框架。CodecSplat首先将中间2D高斯生成特征编码为熵编码的场景比特流。在解码器端,潜在特征被重建并用于预测深度和高斯参数,然后映射到3D高斯基元。注意,通过将压缩集成到前馈式高斯生成流水线中,CodecSplat避免了对不规则3D高斯基元的低效压缩,并允许编解码器利用结构化的中间特征表示。我们在前馈式高斯泼溅骨干网络上实例化了CodecSplat,该网络具有深度引导的多视图特征细化和分层学习特征编解码器。在DL3DV和RealEstate10K数据集上,CodecSplat分别实现了23.56-26.36 dB和24.76-27.05 dB的PSNR,每场景仅需20.00-107.77 KiB和3.37-12.51 KiB。这比压缩前馈式生成的高斯基元大约小一个数量级,同时保持了可控的率失真行为。

英文摘要

While feed-forward 3D Gaussian splatting reconstructs renderable Gaussian primitives from sparse context views without per-scene optimization, existing pipelines do not provide a compact scene representation for storage or transmission. A natural solution is to apply existing 3DGS compression methods to the generated Gaussian primitives. However, this approach operates on the final irregular 3D representation and is decoupled from the internal feature-to-Gaussian generation process, which limits compression efficiency. To address this, we introduce CodecSplat, an ultra-compact latent coding framework for feed-forward 3D Gaussian splatting. CodecSplat first encodes an intermediate 2D Gaussian-generation feature into an entropy-coded scene bitstream. At the decoder, the latent feature is reconstructed and used to predict depth and Gaussian parameters, which are then mapped to 3D Gaussian primitives. Note that, by integrating compression into the feed-forward Gaussian generation pipeline, CodecSplat avoids inefficient compression over irregular 3D Gaussian primitives and allows the codec to exploit the structured intermediate feature representation. We instantiate CodecSplat on a feed-forward Gaussian splatting backbone with depth-guided multi-view feature refinement and a hierarchical learned feature codec. On DL3DV and RealEstate10K datasets, CodecSplat achieves 23.56-26.36 dB and 24.76-27.05 dB PSNR with only 20.00-107.77 KiB and 3.37-12.51 KiB per scene, respectively. This is roughly one order of magnitude smaller than compressing feed-forward generated Gaussian primitives, while preserving controllable rate-distortion behavior.

2605.25561 2026-05-26 cs.CV 版本更新

Are We Overconfident in Models and Results for Semi-Supervised 3D Medical Image Segmentation?

我们在半监督3D医学图像分割的模型和结果上是否过于自信?

Jun Li, Ziwei Qin

发表机构 * Institute of Systems Science and Technology, School of Electrical Engineering, Southwest Jiaotong University, China(系统科学与技术研究院,电气工程学院,西南交通大学,中国)

AI总结 针对半监督医学图像分割中伪标签框架的确认偏差和基准测试集使用不当导致的性能高估问题,提出一种基于双轴可靠性评估的三空间校准分割框架(TCSeg),以解耦置信度与不确定性并协同校正偏差。

Comments Accepted by ICML 2026

详情
AI中文摘要

半监督学习已成为减少标注成本的主流范式。然而,我们认为当前的进展被双重过度自信问题所掩盖。在算法层面,主流的伪标签框架常常将预测置信度与不确定性混为一谈,导致严重的确认偏差。在策略层面,由于多个基准数据集缺乏专用的验证集,一些研究也使用测试集进行验证,导致性能估计膨胀。后续方法为了超越已报告的最先进水平而被迫采用相同策略,引发了过拟合的军备竞赛。这引发了担忧,即社区中令人印象深刻的数值提升可能反映的是过拟合而非真正的进步。因此,我们提出了一种基于原则性双轴可靠性评估引擎的三空间校准分割框架。它明确地将置信度与不确定性解耦,并利用这一信号在特征空间、概率空间和图像空间中以协作方式检测和纠正确认偏差。在三个基准数据集上,TCSeg在现有评估协议下始终提供强大的性能。更重要的是,我们主张社区在多次运行协议下报告最终检查点结果,从而以更现实的视角建立更严格的基准。代码将公开:github.com/DirkLiii/TCSeg。

英文摘要

Semi-supervised learning has become a dominant paradigm for reducing annotation costs. However, we argue that the current progress is clouded by a twofold overconfidence problem. Algorithmically, mainstream pseudo-labeling frameworks often conflate prediction confidence with uncertainty, leading to severe confirmation bias. Strategically, since multiple benchmark datasets lack dedicated validation sets, some studies use the test set for validation as well, leading to inflated performance estimates. Subsequent methods, compelled to employ the same strategy to surpass reported SOTA, trigger an arms race of overfitting. This raises concerns that the impressive numerical gains in the community may reflect overfitting rather than genuine progress. Thus, we propose a tri-space calibrated segmentation framework founded on a principled dual-axis reliability assessment engine. It explicitly decouples confidence from uncertainty and uses this signal to detect and correct confirmation bias across feature, probability, and image spaces in a collaborative manner. Across three benchmark datasets, TCSeg consistently delivers strong performance under existing evaluation protocols. More importantly, we advocate that the community report final-checkpoint results under multiple-run protocols, thereby establishing more rigorous benchmarks with a more realistic perspective. Code will be available: github.com/DirkLiii/TCSeg.

2605.25553 2026-05-26 cs.CV cs.RO 版本更新

ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation

ComPose:用于鲁棒类别级物体姿态估计的统一补全-姿态框架

Huan Ren, Yihan Chen, Chuxin Wang, Nailong Liu, Wenfei Yang, Tianzhu Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory(国家空间科学探测重点实验室,深空探测实验室) Beijing Institute of Control Engineering(北京控制工程研究所)

AI总结 提出ComPose框架,通过关键点渐进补全模块和几何关系一致性损失,将形状补全与姿态估计紧密集成,在不依赖类别级形状先验的情况下提升点云不完整场景下的姿态估计精度和效率。

Comments Accepted by CVPR 2026 (Oral, Best Paper Award Candidate). Project page is available at renhuan1999.github.io/ComPose

详情
AI中文摘要

类别级物体姿态估计旨在预测特定类别中任意物体的姿态和尺寸。现有方法难以处理观测点云固有的不完整性,这限制了它们捕捉完整物体形状以实现鲁棒姿态推理的能力。虽然点云补全提供了一种有前景的解决方案,但将其作为部分观测的独立预处理步骤会引入复合误差和额外计算开销,最终阻碍准确性和效率。为解决这些挑战,我们提出了ComPose,一种新颖的统一框架,紧密集成形状补全以提供完整的几何线索,从而增强姿态估计。ComPose的核心是一个基于关键点的渐进补全模块,通过逐步预测稀疏关键点及其周围的密集点集来恢复完整形状表示,使关键点能够捕捉整体物体几何结构。几何关系编码模块进一步用局部和全局几何上下文丰富关键点特征。此外,我们引入了一种新颖的几何关系一致性损失,以强制观测关键点与其预测的NOCS坐标之间的结构对齐,确保全局一致的坐标变换。在标准基准上的大量实验表明,我们的方法在不依赖类别级形状先验的情况下优于现有最先进方法。

英文摘要

Category-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency. To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and their surrounding dense point sets, empowering the keypoints to capture holistic object geometries. A geometric relation encoding module further enriches keypoint features with both local and global geometric context. In addition, we introduce a novel geometric relation consistency loss to enforce structural alignment between observed keypoints and their predicted NOCS coordinates, ensuring globally coherent coordinate transformations. Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art approaches without relying on category-level shape priors.

2605.25547 2026-05-26 cs.RO cs.CV 版本更新

TapSampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation

TapSampling:基于任务进度理解验证器的推理时采样方法用于机器人操作

Sizhe Zhao, Shengping Zhang, Shuo Yang, Weiyu Zhao, Shuigen Wang, Xiangyang Ji

发表机构 * Harbin Institute of Technology, China(哈尔滨工业大学,中国) Harbin Institute of Technology (Weihai) Qingdao Research Institute, China(哈尔滨工业大学(威海)青岛研究院,中国) Iray Technology co., Ltd., Shandong, China(Iray科技有限公司,山东,中国) Tsinghua University, Beijing, China(清华大学,北京,中国)

AI总结 提出TapSampling框架,通过Action-VAE在低维潜空间采样候选动作,并利用任务进度预测验证器选择最优动作,无需微调即可提升多种通用策略的性能。

Comments ICML 2026. Project Page: https://aipixel.github.io/TapSampling/

详情
AI中文摘要

现有的具身控制研究通过扩展训练数据和模型规模展现了显著的性能提升。我们则探索推理时策略作为另一个维度。非确定性生成模型,如扩散模型和自回归模型,已被广泛应用于具身控制领域。然而,单次推理范式限制了它们的性能。在本文中,我们提出 extbf{TapSampling},一个即插即用的推理时采样框架。首先,我们引入一个Action-VAE,通过将策略生成的初始动作映射到压缩的后验分布中,在低维潜空间中表示动作,从中可以抽取任意数量的潜样本并解码为候选动作,这些动作近似于真实动作分布。其次,我们将动作验证表述为任务进度结果预测,利用机器人数据集固有的序列结构训练一个语义基础验证器,用于可解释的动作选择。此外,TapSampling是一个策略无关的框架。在模拟和真实环境中的大量实验表明,我们的方法无需进一步微调策略即可显著提升多种通用策略的性能。代码和模型可在项目页面获取。

英文摘要

Existing embodied control research demonstrates remarkable performance improvements by scaling training data and model size. We instead explore inference-time strategy as an alternative axis. Non-deterministic generative models, such as diffusion and autoregressive models, have been widely adopted in the field of embodied control. However, the single-shot inference paradigm limits their performance. In this paper, we propose \textbf{TapSampling}, a plug-and-play framework for inference-time sampling. First, we introduce an Action-VAE that represents actions in a low-dimensional latent space by mapping policy-generated initial actions into a compressed posterior distribution, from which any number of latent samples can be drawn and decoded into candidate actions that approximate the true action distribution. Second, we formulate action verification as task-progress outcome prediction, using the intrinsic sequential structure of robotic datasets to train a semantically grounded verifier for interpretable action selection. Furthermore, TapSampling is a policy-agnostic framework. Extensive experiments in both simulated and real-world environments demonstrate that our method substantially improves multiple generalist policies without further policy finetuning. Code and models are available at the project page.

2605.25530 2026-05-26 cs.CV 版本更新

Location Prior Generation via Multi-Source Urban Data Fusion for Low-Altitude Air Mobility

基于多源城市数据融合的低空空中交通位置先验生成

Xiang Xie, Xiaonan Liu

发表机构 * Politecnico di Milano(米兰理工大学) School of Natural and Computing Science, University of Aberdeen(阿伯丁大学自然科学与计算科学学院)

AI总结 提出LPGF框架,融合多源数据(哨兵2号影像、无人机遥测、车辆GPS轨迹、OSM足迹)生成结构化城市位置先验,通过三级优先级分配建筑高度,并引入质量门控的阴影估计模块,在米兰数据集上验证了约5.5米的最坏误差。

Comments 11 pages, 7 figures, submitted to IEEE Journal of Internet of Things

详情
AI中文摘要

建筑高度作为城市空间数据的第三维度,在全球地理空间数据库中超过95%的结构中缺失。对于新兴的低空经济而言,这一数据缺口迫使每个空中平台依赖实时机载感知而非预计算的3D场景几何。我们提出了位置先验生成框架(LPGF),这是一个多源数据融合管道,将哨兵2号影像、无人机遥测、车辆GPS轨迹和OpenStreetMap足迹整合为结构化、可重用的城市位置先验。LPGF通过三级优先级层次分配建筑高度:(1)可用的显式OSM高度标签,(2)楼层数乘以每层3.2米(若记录),以及(3)否则使用建筑类型默认高度,产生约5.5米的最坏情况误差。一个可选的基于阴影的高度估计模块(SHEM)仅在满足四项质量标准时才被激活;当任何标准失败时,管道转向结构化后备方案。在MiTra A50米兰数据集上,质量门正确识别了两种成像故障模式:10米GSD下的亚像素阴影和0.93米GSD下的地面阴影合并,在两种情况下均产生一致的27栋建筑先验。第三级类型默认高度与手动楼层计数(n=15)进行验证,在5.0米不确定性范围内达到MAE=3.07米。该框架表明,结构化、质量门控的通用数据流融合可以为低空城市运营启动3D场景覆盖。

英文摘要

Building height, the third dimension (3D) of urban spatial data, is absent in over 95% of structures in global geospatial databases. For the emerging low-altitude economy, this data gap forces each aerial platform to rely on real-time onboard sensing rather than pre-computed 3D scene geometry. We present the Location Prior Generation Framework (LPGF), a multi-source data fusion pipeline that integrates Sentinel-2 imagery, UAV telemetry, vehicle GPS trajectories, and OpenStreetMap footprints into structured, reusable urban location priors. LPGF assigns building heights through a three-tier priority hierarchy: (1) explicit OSM height tags where available, (2) floor count multiplied by 3.2 m per story where recorded, and (3) building-type default heights otherwise, yielding a worst-case error of approximately 5.5 m. An optional shadow-based height estimation module (SHEM) is activated only when a four-criterion quality gate is satisfied; when any criterion fails, the pipeline routes to structured fallback. On the MiTra A50 Milan dataset, the quality gate correctly identified two imaging failure modes: sub-pixel shadows at 10 m GSD and ground shadow merging at 0.93 m GSD, producing a consistent 27-building prior in both cases. Tier 3 type-default heights were validated against manual floor counts (n=15), achieving MAE=3.07 m within the 5.0 m uncertainty bound. The framework demonstrates that structured, quality-gated fusion of universally available data streams can bootstrap 3D scene coverage for low-altitude urban operations.

2605.25524 2026-05-26 cs.CV 版本更新

ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs

ProSR: 面向可靠思维链的过程塑造空间推理方法

Jiangyang Li, Cong Wan, Changjie Wu, Songlin Dong, Lingjun Zhang, Linzhe Shi, Xu Wang, Zhiheng Ma, Hang Zhang, Mu Xu, Yihong Gong

发表机构 * Xi’an Jiaotong University(西安交通大学) Amap, Alibaba Group(阿里巴巴集团Amap) Tsinghua University(清华大学) Shenzhen University of Advanced Technology(深圳大学先进技术学院)

AI总结 针对视觉语言模型在空间推理中存在的虚假基础与尾部不稳定性问题,提出ProSR框架,通过反事实不变性惩罚和尾部漂移惩罚优化推理过程,提升答案准确率及轨迹稳定性与视觉依赖性。

Comments 19 pages, 6 figures

详情
AI中文摘要

可靠的空间推理仍然是视觉语言模型(VLM)的核心瓶颈。现有的空间推理主流训练范式主要依赖于结果对齐或过程模仿,缺乏对推理过程的显式约束,因此难以确保真正的视觉依赖和稳定的推理轨迹。在本文中,我们构建了一个覆盖多种空间现象的高质量思维链数据集,并诊断了模型的推理过程,揭示了强化学习优化过程中两种典型的过程退化类型:虚假基础(绕过视觉证据)和尾部不稳定性(推理后期不确定性异常上升)。为了解决这些问题,我们提出了ProSR,一种用于空间推理的过程塑造优化框架。通过反事实不变性惩罚和尾部漂移惩罚,ProSR将优化目标从单一的答案正确性扩展到两个过程级维度:视觉依赖性和轨迹稳定性。在多个复杂和分布外的空间推理基准上的实验表明,ProSR在提高答案准确率的同时,生成的推理轨迹更加稳定且更依赖于视觉证据。

英文摘要

Reliable spatial reasoning remains a core bottleneck for vision-language models (VLMs). Existing mainstream training paradigms for spatial reasoning largely rely on outcome alignment or process imitation, lacking explicit constraints on the reasoning process, and therefore struggle to ensure genuine visual dependence and stable reasoning trajectories. In this paper, we construct a high-quality CoT dataset covering diverse spatial phenomena and diagnose the model's reasoning process, revealing two typical types of process degradation during reinforcement learning optimization: Spurious Grounding, which bypasses visual evidence, and Tail Instability, where uncertainty abnormally rises in the later stage of reasoning. To address these issues, we propose ProSR, a process-shaping optimization framework for spatial reasoning. Through a Counterfactual Invariance Penalty and a Tail Drift Penalty, ProSR extends the optimization objective from single answer correctness to two process-level dimensions: visual dependence and trajectory stability. Experiments on multiple complex and out-of-distribution spatial reasoning benchmarks show that ProSR improves answer accuracy while generating reasoning trajectories that are more stable and more dependent on visual evidence.

2605.25518 2026-05-26 cs.CV cs.AI 版本更新

Cross-Stage Attention Multi-Expert Network for Radiologist-Inspired Breast Ultrasound Diagnosis

受放射科医生启发的乳腺超声诊断的跨阶段注意力多专家网络

Xinyang Zhai, Chong Yang, Ruizhi Zhang

发表机构 * International Agency for Research on Cancer (IARC)(国际癌症研究机构) World Health Organization(世界卫生组织)

AI总结 提出跨阶段注意力混合专家网络(CSA-MoE-Net),通过跨阶段注意力模块增强多级特征、三分支MoE块从全肿瘤图像、肿瘤核心和边界学习互补特征,并在平衡数据集上实现96.33%准确率,显著优于基线ResNet-18。

详情
AI中文摘要

乳腺超声成像是一种重要的早期乳腺癌诊断无创方法,但由于肿瘤异质性、边界模糊和数据不平衡,自动良恶性分类仍具挑战。为了提高特征表示和分类准确性,本文提出了跨阶段注意力混合专家网络(CSA-MoE-Net)。它采用跨阶段注意力增强的ResNet-18作为骨干网络,其中跨阶段注意力模块自适应地重新校准多级特征,从而增强关键肿瘤特征并抑制冗余。一个三分支混合专家(MoE)块从全肿瘤图像、肿瘤核心和边界学习互补特征,自适应门控网络融合这些特征以捕获形态、纹理和上下文信息。融合后的特征在架构中称为融合专家特征(FEF)。在包含2,129张乳腺超声图像的平衡数据集上的实验表明,在20次独立运行的平均值下,该模型实现了96.33%的准确率、94.09%的精确率、98.53%的召回率、96.25%的F1分数和99.50%的AUC。与基线ResNet-18相比,这些指标分别提高了3.01、0.70、5.37、2.98和5.42个百分点。所提出的机制无需侵入性修改,可无缝嵌入VGG-16、DenseNet-121等网络,带来稳定的性能提升,从而为计算机辅助诊断提供可靠支持。

英文摘要

Breast ultrasound imaging is an important noninvasive method for early breast cancer diagnosis, but automatic benign/malignant classification remains challenging due to tumor heterogeneity, blurred boundaries, and data imbalance. To improve feature representation and classification accuracy, this paper proposes the Cross-Stage Attention Mixture-of-Experts Network (CSA-MoE-Net). It adopts a Cross-Stage Attention-enhanced ResNet-18 as the backbone, in which the Cross-Stage Attention module adaptively recalibrates multi-level features, thereby enhancing key tumor features and suppressing redundancy. A three-branch Mixture of Experts (MoE) Block learns complementary features from the Whole Tumor Image, Tumor Core, and Boundary, and an Adaptive Gating Network fuses them to capture morphological, textural, and contextual information. The fused features are denoted as Fused Expert Feature (FEF) in the architecture. Experiments on a balanced dataset of 2,129 breast ultrasound images show that, averaged over 20 independent runs, the model achieves an accuracy of 96.33\%, precision of 94.09\%, recall of 98.53\%, F1-score of 96.25\%, and AUC of 99.50\%. Compared to the baseline ResNet-18, these metrics improve by 3.01, 0.70, 5.37, 2.98, and 5.42 percentage points, respectively. The proposed mechanism requires no invasive modification and can be seamlessly embedded into VGG-16, DenseNet-121, etc., yielding stable performance gains, thus providing reliable support for computer-aided diagnosis.

2605.25503 2026-05-26 cs.CV 版本更新

Metric--Phase Fields: Decoupling Distance and Sign for Thin-Structure Reconstruction from Unoriented Point Clouds

度量-相位场:从无定向点云中解耦距离和符号以重建薄结构

Jiayi Kong, Xuhui Chen, Chen Zong, Fei Hou, Junhui Hou, Wenping Wang, Ying He

发表机构 * S-Lab, Nanyang Technological University, Singapore Key Laboratory of System Software (CAS), Institute of Software, Chinese Academy of Sciences, China University of Chinese Academy of Sciences, China School of Mathematics, Nanjing University of Aeronautics Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China Department of Computer Science Engineering, Texas A\&M University, USA

AI总结 提出度量-相位场(MPF),通过解耦度量距离和拓扑相位,结合门控度量公式和残差相位注入,实现从无定向点云中稳定重建薄结构和开放边界。

详情
AI中文摘要

神经有符号距离函数(SDF)在重建水密流形方面表现出色,但由于严格的内外约束,在薄结构和开放边界上失败。相反,无符号距离场(UDF)适应一般几何形状,但在零水平集处存在梯度奇异性,阻碍优化和提取。我们引入度量-相位场(MPF),一种解耦的隐式表示,将度量邻近性与拓扑相位分离。给定无定向点云,MPF学习(i)无符号度量场$r$和(ii)平滑相位场$θ$,我们推导出一个有界相位指示器$P=\tanh(βθ)$,在有意义的地方提供软内外线索。我们通过门控度量公式和残差相位注入耦合这两个场,以获得具有稳定近表面梯度的有符号隐函数。相位系数$β$是可学习的,允许MPF自适应控制相变锐度和软符号指示器的饱和程度。在合成和扫描的薄壳及薄板形状上的实验表明,MPF比最近的基于SDF的方法更忠实地保留薄层结构,同时比基于UDF的方法实现更稳健的训练和更可靠的表面提取。源代码和测试模型见\href{https://github.com/JIAYI-Scarlett/ICML2026-MPF}{MPFs-GitHub}。

英文摘要

Neural Signed Distance Functions (SDFs) excel at reconstructing watertight manifolds but fail on thin structures and open boundaries due to strict inside--outside constraints. Conversely, Unsigned Distance Fields (UDFs) accommodate general geometries but suffer from gradient singularities at the zero-level set, hindering optimization and extraction. We introduce Metric--Phase Fields (MPFs), a decoupled implicit representation that separates metric proximity from topological phase. Given an unoriented point cloud, MPFs learn (i) an unsigned metric field $r$ and (ii) a smooth phase field $θ$, for which we derive a bounded phase indicator $P=\tanh(βθ)$ that provides soft inside--outside cues where they are meaningful. We couple the two fields via a gated-metric formulation with a residual phase injection to obtain a signed implicit function with stable near-surface gradients. The phase coefficient $β$ is learnable, allowing MPFs to adaptively control the sharpness of the phase transition and the degree of saturation of the soft sign indicator. Experiments on both synthetic and scanned thin-shell and thin-plate shapes demonstrate that MPFs preserve thin and layered structures more faithfully than recent SDF-based methods, while also enabling more robust training and more reliable surface extraction than UDF-based approaches. Check out \href{https://github.com/JIAYI-Scarlett/ICML2026-MPF}{MPFs-GitHub} for source code and test models.

2605.25500 2026-05-26 cs.CV 版本更新

Full-4D: Generating Full-Scope 4D Scenes from a Single-View Video

Full-4D:从单视角视频生成全范围4D场景

Tingxi Chen, Ke Hao, Yabo Chen, Zhengxue Cheng, Rong Xie, Li Song, Haibin Huang, Chi Zhang, Xuelong Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Institute of Artificial Intelligence, China Telecom (TeleAI)(中国电信人工智能研究院)

AI总结 提出一种将单视角视频转换为全范围4D场景的框架,通过多视角视频合成和基于优化的4D重建,引入大规模数据集Real-MV-4D、融合时间-视图注意力的扩散模型和流匹配蒸馏损失,实现高保真度和几何一致性。

详情
AI中文摘要

从单视角视频生成4D场景本质上是不适定的:单一视角缺乏恢复完整、动态场景所需的信息。现有方法通常局限于单目视频、简单的3D效果或仅在原始视角附近的小视角扰动,未能实现真正的4D生成。同时,缺乏捕捉全范围4D场景的大规模同步多视角视频数据集进一步阻碍了这一方向的发展。我们提出了一种新颖的单视角视频到4D框架,将全范围4D生成视为多视角视频合成,然后从生成的视角进行基于优化的4D重建。为了端到端地实例化这一公式,我们做出了三个关键贡献。首先,我们引入了Real-MV-4D,一个大规模数据集,包含在多样化真实环境中捕获的同步多视角视频,以提供4D监督。其次,我们训练了一个多视角视频扩散模型,该模型由一种新颖的融合时间(T)-视图(V)注意力机制驱动,直接将几何重投影先验和显式相机条件嵌入到其视图-时间交互中。与基本的特征融合不同,这种直接绑定严格地将生成过程与物理3D先验对齐,以生成密集、同步的T×V视频网格。第三,我们不依赖非交互且不一致的2D视频插值,而是将合成的多视角视频提升为显式4D表示(即4DGS),并通过流匹配蒸馏损失进行正则化,利用多视角先验改进新视角渲染。大量实验表明,我们的方法在视觉保真度和几何一致性方面均优于现有方法,实现了从单视角视频生成全范围4D场景。

英文摘要

Generating 4D scenes from a single-view video is inherently ill-posed: a single viewpoint lacks the information needed to recover a complete, dynamic scene with full coverage. Existing methods are typically limited to monocular videos, simple 3D effects, or only small viewpoint perturbations around the original viewpoint, falling short of true 4D generation. Meanwhile, the lack of large-scale datasets capturing full-scope 4D scenes with synchronized multi-view videos further hinders progress in this direction. We propose a novel single-view video-to-4D framework that casts full-scope 4D generation as a multi-view video synthesis followed by optimization-based 4D reconstruction from the generated views. To instantiate this formulation end-to-end, we make three key contributions. First, we introduce Real-MV-4D, a large-scale dataset of synchronized multi-view videos captured in diverse real-world environments to provide the 4D supervision. Second, we train a multi-view video diffusion model driven by a novel fused time(T)-view(V) attention mechanism that directly embeds geometric reprojection priors and explicit camera conditioning into its view-time interactions. Unlike basic feature fusion, this direct binding strictly aligns the generation process with physical 3D priors to produce a dense, synchronized T$\times $V video grid. Third, rather than relying on non-interactive and inconsistent 2D video interpolations, we lift the synthesized multi-view videos into an explicit 4D representation (i.e. 4DGS), regularized by a Flow Matching Distillation loss that exploits the multi-view prior to improve novel-view rendering. Extensive experiments demonstrate that our method outperforms existing approaches in both visual fidelity and geometric consistency, enabling full-scope 4D scene generation from single-view videos.

2605.25495 2026-05-26 cs.RO cs.CV 版本更新

RepSAM: Bridging Foundation Models to Robotic Vision via Representation-Guided Adaptation

RepSAM: 通过表示引导的适应连接基础模型与机器人视觉

Wenhui Chu

发表机构 * Department of Computer Science and Engineering, Texas A&M University(计算机科学与工程系,德克萨斯大学阿马尔科分校)

AI总结 针对基础模型在非结构化机器人视觉场景中性能下降的问题,提出RepSAM框架,通过CKA引导的秩分配策略和多模态融合模块实现参数高效微调,在减少158倍可训练参数的同时达到全微调97.9%的性能。

Comments Accepted to IJCAI-ECAI 2026 (Special Track on AI and Robotics). 8 pages, 4 figures, 12 tables

详情
AI中文摘要

尽管SAM等基础模型具有零样本能力,但在非结构化环境中的机器人感知仍然具有挑战性。本文将性能下降归因于Transformer层间非均匀的表示偏移:浅层表现出显著的领域差距(CKA < 0.5),而深层则有效迁移(CKA > 0.7)。基于这一观察,我们提出RepSAM,一种表示引导的参数高效微调(PEFT)框架,用于将基础模型适应到机器人视觉。RepSAM采用理论基础的CKA引导秩分配策略,结合多模态融合模块,以稳健处理具有挑战性的机器人场景,包括透明物体和杂乱场景。在六个基准和机器人操作任务上的实验评估表明,RepSAM达到了全微调性能的97.9%(89.0% vs. 90.9% mIoU),同时将可训练参数减少了158倍(从632M降至4.0M)。RepSAM在单个A100 GPU上仅需4小时训练(比全微调减少96倍,全微调需要384 GPU小时),即可比DoRA提高7.9%的mIoU。这些改进具有统计显著性(p < 0.01),并在机器人操作成功率上比LoRA(RGB)基线绝对提高了12.0%。

英文摘要

Robotic perception in unstructured environments remains challenging despite the zero-shot capabilities of foundation models such as SAM. This work attributes performance degradation to non-uniform representation shifts across transformer layers: shallow layers exhibit substantial domain gaps (CKA < 0.5), whereas deep layers transfer effectively (CKA > 0.7). Based on this observation, we propose RepSAM, a representation-guided parameter-efficient fine-tuning (PEFT) framework for adapting foundation models to robotic vision. RepSAM employs a theoretically grounded CKA-guided rank allocation strategy combined with a multi-modal fusion module for robust handling of challenging robotic scenarios, including transparent objects and cluttered scenes. Experimental evaluation across six benchmarks and robotic manipulation tasks demonstrates that RepSAM achieves 97.9% of full fine-tuning performance (89.0% vs. 90.9% mIoU) while reducing trainable parameters by 158x (from 632M to 4.0M). RepSAM outperforms DoRA by 7.9% mIoU with just 4 hours of training on a single A100 GPU (a 96x reduction from full fine-tuning, which takes 384 GPU-hours). These improvements are statistically significant (p < 0.01) and translate to a 12.0% absolute improvement in robotic manipulation success rates over the LoRA (RGB) baseline.

2605.25488 2026-05-26 cs.CV cs.AI cs.MM 版本更新

Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation

测试时自适应条件用于稳定音频驱动说话头生成

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

发表机构 * School of Business, University of New South Wales (UNSW)(新南威尔士大学商学院) School of Engineering and Built Environment, Griffith University(格里菲斯大学工程与环境学院)

AI总结 提出一种无需参数训练的测试时自适应条件框架(TT-SAC),通过反馈循环调整条件表示,提升预训练说话头生成器的身份保持、时间一致性和感知质量。

Comments Research report

详情
AI中文摘要

音频驱动的说话头生成在AniTalker、FLOAT和Sonic等最新模型中取得了显著进展。尽管取得了成功,大多数现有方法在推理阶段依赖单一静态参考图像来调节整个视频生成过程。这种静态条件范式通常导致固定身份特征与动态面部运动之间的不匹配,从而引起身份漂移、时间不一致性和感知质量下降。我们引入了测试时自适应条件(TT-SAC),这是一个无需参数的推理框架,使预训练的说话头生成器能够在推理过程中调整其条件表示,而无需重新训练、梯度更新或额外监督。TT-SAC不是将参考肖像视为不可变的,而是将生成器与其编码器组合成一个反馈循环:生成器自身的输出被重新编码,以构建一个更符合合成序列时间动态的精细条件表示。单次自适应步骤近似于生成过程的自洽平衡,稳定了跨时间的身份和运动。我们进一步提供了理论分析,表明在温和的Lipschitz假设下,测试时条件自适应减少了特征方差并提高了生成稳定性,同时表现出原则性的偏差-方差权衡,该权衡决定了自适应最优强度。在最新说话头生成器和基准数据集上的大量实验表明,在唇形同步准确性、时间一致性、身份保持和感知保真度方面均有持续改进。TT-SAC提供了一种模型无关且无需训练的策略来增强生成视频模型,将测试时条件自适应确立为稳定音频驱动肖像动画的有效机制。

英文摘要

Audio-driven talking-head generation has achieved remarkable progress with recent models such as AniTalker, FLOAT, and Sonic. Despite their success, most existing approaches rely on a single static reference image to condition the entire video generation process at inference stage. This static conditioning paradigm often creates a mismatch between fixed identity features and dynamically evolving facial motion, leading to identity drift, temporal inconsistency, and degraded perceptual quality. We introduce Test-Time Self-Adaptive Conditioning (TT-SAC), a parameter-free inference framework that enables pretrained talking-head generators to adapt their conditioning representations during inference without retraining, gradient updates, or additional supervision. Instead of treating the reference portrait as immutable, TT-SAC composes the generator with its encoder in a feedback loop: the generator's own outputs are re-encoded to construct a refined conditioning representation that better aligns with the temporal dynamics of the synthesized sequence. A single adaptation step approximates a self-consistent equilibrium of the generative process, stabilizing identity and motion across time. We further provide theoretical analysis showing that test-time conditioning adaptation reduces feature variance and improves generative stability under mild Lipschitz assumptions, while exhibiting a principled bias-variance tradeoff that governs the optimal strength of adaptation. Extensive experiments on state-of-the-art talking-head generators and benchmark datasets demonstrate consistent improvements in lip-sync accuracy, temporal coherence, identity preservation, and perceptual fidelity. TT-SAC offers a model-agnostic and training-free strategy for enhancing generative video models, establishing test-time conditioning adaptation as an effective mechanism for stabilizing audio-driven portrait animation.

2605.25479 2026-05-26 cs.CV 版本更新

MAIL++: Multi-Modal Bi-directional Agent Layer for Vision-Language Models

MAIL++: 视觉语言模型的多模态双向智能体层

Kaixiang Chen, Pengfei Fang, Hui Xue

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其交叉应用国家重点实验室(东南大学),中华人民共和国教育部,中国)

AI总结 提出MAIL/MAIL++方法,通过将跨模态耦合嵌入VLM内在计算模块并引入双向桥接,实现参数高效微调,在少样本分类和跨域检索中超越现有方法。

详情
AI中文摘要

将大型视觉语言模型(如CLIP)适应下游任务仍然具有挑战性,因为全微调计算成本高且在小数据场景下容易过拟合。参数高效微调(PEFT)通过轻量级提示或适配器模块缓解了这些问题,而跨模态耦合通过增强视觉和语言之间的交互被证明特别有效。然而,现有的耦合机制主要依赖外部辅助模块,导致间接、粗粒度的交互,这些交互在结构上与原始VLM解耦,从而限制了表示的表达能力。在本文中,我们提出了多模态交互智能体层(MAIL),这是一种PEFT范式,将跨模态耦合直接嵌入VLM的内在计算模块中。MAIL冻结主干网络,并在核心模块(如LayerNorm)之后插入轻量级智能体层,以近似全微调引起的参数更新。为了在这一层面耦合视觉和文本流,我们引入了一个基于瓶颈的文本到图像桥,该桥联合优化跨模态的成对智能体层,协调相应计算模块的适应。我们进一步提出了MAIL++,它通过元智能体层、元文本桥和元图像桥实现了双向跨模态交换。在推理时,所有智能体层被重参数化到冻结的主干网络中,保持原始计算效率。在少样本图像分类和少样本通用跨域检索上的大量实验表明,MAIL和MAIL++始终优于最先进的PEFT方法。

英文摘要

Adapting large vision-language models (VLMs) such as CLIP to downstream tasks remains challenging, as full fine-tuning is computationally prohibitive and prone to overfitting in low-data regimes. Parameter-efficient fine-tuning (PEFT) alleviates these issues with lightweight prompt- or adapter-based modules, and cross-modal coupling has proven especially effective by strengthening interactions between vision and language. However, existing coupling mechanisms predominantly rely on external auxiliary modules, leading to indirect, coarse-grained interactions that are structurally decoupled from the original VLM and thus limit representational expressiveness. In this paper, we propose Multi-Modal Interactive Agent Layer (MAIL), a PEFT paradigm that embeds cross-modal coupling directly into the intrinsic computation modules of VLMs. MAIL freezes the backbone and inserts lightweight agent layers after core modules, such as LayerNorm, to approximate the parameter updates induced by full fine-tuning. To couple visual and textual streams at this level, we introduce a bottleneck-based text-to-image bridge that jointly optimizes paired agent layers across modalities, coordinating the adaptation of corresponding computation modules. We further present MAIL++, which enables bidirectional cross-modal exchange through a meta agent layer, a meta-text bridge, and a meta-image bridge. At inference time, all agent layers are re-parameterized into the frozen backbone, preserving the original computational efficiency. Extensive experiments on few-shot image classification and few-shot universal cross-domain retrieval demonstrate that MAIL and MAIL++ consistently outperform state-of-the-art PEFT methods.

2605.25461 2026-05-26 cs.CV 版本更新

MetaphorVU: Towards Metaphorical Video Understanding

MetaphorVU:迈向隐喻视频理解

Zhuoqun Li, Boxi Cao, Guiping Jiang, Fangrui Lv, Ruotong Pan, Jianan Wang, Xiangyu Wu, Hongyu Lin, Yaojie Lu, Yong Du, Ruyin Jia, Liyan, Tingting Gao, Han Li, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对当前多模态大语言模型在隐喻视频理解上的不足,提出首个系统性基准MetaphorVU-Bench,并设计基于隐喻知识图谱的推理增强框架MetaphorBoost,显著提升模型性能。

Comments ICML 2026 spotlight

详情
AI中文摘要

隐喻视频在各种现实场景中广泛存在,用于传达复杂思想,理解它们通常需要高阶认知能力。对隐喻视频理解缺乏系统性研究不仅限制了多模态大语言模型(MLLMs)的现实应用,也阻碍了对其高阶认知能力的全面评估。为填补这一空白,我们提出了MetaphorVU-Bench,这是首个专门用于隐喻视频理解的系统性和综合性基准。通过实验,我们发现当前的MLLMs在准确的隐喻视频理解上存在困难,远落后于人类水平,主要原因是跨域映射存在缺陷。受此发现启发,我们构建了一个隐喻知识图谱作为映射增强,并提出了MetaphorBoost,一个推理时增强框架,实现了持续的性能提升。我们的基准、分析和方法为未来推进MLLMs的研究提供了有用的见解和基础。

英文摘要

Metaphorical videos are prevalent across various real-world scenarios to convey complex ideas, and understanding them typically requires high-order cognitive capabilities. The lack of systematic studies on metaphorical video understanding not only constrains the real-world applicability of MLLMs but also impedes the thorough assessment of their high-order cognitive capabilities. To bridge this gap, we propose MetaphorVU-Bench, the first systematic and comprehensive benchmark dedicated to metaphorical video understanding. Through experiments, we find current MLLMs struggle with accurate metaphorical video understanding, lagging far behind human level, primarily due to defective cross-domain mapping. Motivated by this finding, we construct a metaphor knowledge graph as mapping augmentation and propose MetaphorBoost, an inference-time enhancement framework achieving consistent performance improvement. Our benchmark, analysis, and method provide useful insights and a foundation for future research on advancing MLLMs.

2605.25442 2026-05-26 cs.CV 版本更新

Enhancing Single-Image Facial Demorphing using Multimodal Large Language Models

利用多模态大语言模型增强单图像面部去变形

Nitish Shukla, Arun Ross

发表机构 * IEEE

AI总结 提出一种基于多模态大语言模型引导的耦合扩散重建框架,通过提取中间层语义嵌入作为条件,实现无参考的面部去变形,恢复构成图像并保持身份一致性。

详情
AI中文摘要

人脸识别系统越来越容易受到变形攻击,其中合成图像被制作成匹配多个身份,从而实现未经授权的访问和身份欺诈。现有的检测方法可以识别变形图像,但无法恢复构成图像或身份,限制了其取证实用性。本文提出了一种新颖的无参考面部去变形框架,利用多模态大语言模型(MLLMs)引导耦合的扩散重建过程。我们的关键创新在于从MLLM中间层提取语义嵌入以调节去变形过程,提供关于面部属性和身份线索的高级推理,补充低级像素信息。我们将去变形表述为一个耦合的条件生成问题,其中两个构成人脸通过直接在RGB域中操作的去噪扩散模型联合合成,确保身份间一致性,同时保留细粒度的感知细节。与依赖于压缩潜在表示或假设训练集和测试集之间身份重叠的先前方法不同,我们的方法通过直接利用MLLM隐藏状态作为条件信号,绕过了有损的文本生成-重新编码循环,使去噪网络能够关注细微的视觉线索,如头发、背景和面部纹理。消融研究进一步揭示,MLLM中间层编码了更具身份判别性的表示,RGB域去变形在严格操作点上的性能优于潜在空间方法30-40%,并且完整的MLLM嵌入通过多模态预训练的增强语义结构,比原始ViT特征提供了显著优势。

英文摘要

Face recognition systems are increasingly vulnerable to morphing attacks, where a composite image is crafted to match multiple identities, enabling unauthorized access and identity fraud. Existing detection methods identify morphed images but cannot recover constituent images or identities, limiting their forensic utility. This paper presents a novel reference-free facial demorphing framework that leverages Multimodal Large Language Models (MLLMs) to guide a coupled diffusion-based reconstruction process. Our key innovation lies in extracting semantic embeddings from intermediate MLLM layers to condition the demorphing, providing high-level reasoning about facial attributes and identity cues that complement low-level pixel information. We formulate demorphing as a coupled conditional generation problem, where both constituent faces are synthesized jointly through a denoising diffusion model operating directly in the RGB domain, ensuring inter-identity consistency while preserving fine-grained perceptual details. Unlike prior approaches that rely on compressed latent representations or assume identity overlap between training and testing sets, our method bypasses lossy text generation-reencoding cycles by directly utilizing MLLM hidden states as conditioning signals, enabling the denoising network to attend to subtle visual cues such as hair, background, and facial textures. Ablation studies further reveal that middle MLLM layers encode more identity-discriminative representations, RGB-domain demorphing outperforms latent-space approaches by 30--40\% at strict operating points, and full MLLM embeddings provide substantial advantages over raw ViT features through enhanced semantic structuring from multimodal pretraining.

2605.25437 2026-05-26 cs.CV 版本更新

Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning

看见更多意味着知道更多吗?基于单锚优势归一化的多源视觉推理

Fanhu Zeng, Zhicong Luo, Zefan Wang, You Li, Chi Chen, Maosong Sun

发表机构 * Tsinghua University(清华大学) Northwest Polytechnical University(西北工业大学) Beijing Jiaotong University(北京交通大学)

AI总结 针对多源视觉推理中现有方法无法区分信息增益与干扰的问题,提出MARS框架,通过单源奖励作为动态锚点,将多源融合的信息增益显式纳入优势归一化,在强化学习中自适应增强源间互促并抑制噪声,在GRPO和DAPO上分别提升3.2%和4.9%。

Comments preprint

详情
AI中文摘要

通过可验证奖励的强化学习(RLVR)进行视觉推理已取得显著进展。然而,在处理多源输入时,现有方法倾向于将其视为信息的简单累积,缺乏明确机制来区分整合额外源是否带来信息增益或引入干扰。因此,它们在整合多个源时难以有效建模动态交互,特别是当这些源在物理属性和语义上差异显著时(例如红外和深度),导致当某个源包含主导信号时,性能甚至低于单源推理。为解决此问题,我们提出MARS,一种新颖的基于单锚的多源推理框架,将每个视觉模态建模为独立信息源。具体而言,通过将单源奖励视为动态锚点,我们的方法将多源融合引入的信息增益显式纳入优势归一化,并在RLVR中自适应地强调源间的相互促进,同时抑制潜在噪声或冲突。从理论分析来看,我们的方法有效量化了梯度估计中多源整合引入的信息增益,实现了模态的一致调节。实验结果也表明,在GRPO和DAPO上,跨不同数据集分别取得了3.2%和4.9%的性能提升,证实了方法的有效性。

英文摘要

Visual reasoning through reinforcement learning with verifiable rewards (RLVR) has achieved remarkable progress. However, when dealing with multi-source inputs, existing approaches tend to treat them as a mere accumulation of information, lacking explicit mechanisms to distinguish whether integrating additional sources yields information gain or introduces interference. Therefore, they struggle to effectively model dynamic interaction when integrating multiple sources, particularly when they differ significantly in physical properties and semantics, e.g., infrared and depth, leading to inferior performance to mono-source reasoning when a certain source holds the dominant signal. To address this issue, we propose MARS, a novel mono-anchored multi-source reasoning framework that models each visual modality as an independent information source. Specifically, by treating mono-source rewards as dynamic anchors, our method explicitly incorporates the information gain introduced by multi-source fusion into advantage normalization and adaptively emphasizes mutual promotion between sources while suppressing potential noise or conflicts during RLVR. From theoretical analysis, our method effectively quantifies information gain introduced by multi-source integration in gradient estimation, enabling consistent modality regulation. Empirical results also show impressive 3.2% and 4.9% performance gains on GRPO and DAPO across diverse datasets, confirming effectiveness of our method.

2605.25427 2026-05-26 cs.CV cs.AI 版本更新

Binding Visual Features Point by Point

逐点绑定视觉特征

Udith Haputhanthri, Declan Campbell, Rim Assouel, Jonathan D. Cohen, Taylor W. Webb

发表机构 * Princeton University(普林斯顿大学) Mila – Quebec AI Institute(魁北克AI研究所) Université de Montréal(蒙特利尔大学)

AI总结 研究通过文本引导的“指向”机制解决视觉语言模型在多目标场景中的绑定问题,发现该机制诱导内部视觉搜索程序,消除绑定错误并实现组合泛化。

详情
AI中文摘要

尽管在标准基准测试中取得了成功,但视觉语言模型在处理涉及多目标场景的任务时仍表现出持续的失败,包括许多对人类来说相对容易的任务。最近的研究发现,这些失败可能源于在上下文中准确绑定对象特征的基本能力缺失,这在认知科学和神经科学中被称为“绑定问题”。人类视觉系统被认为通过串行处理来解决这一绑定问题,即一次只关注一个对象,以避免来自其他对象的干扰。最近的研究提出了“指向”——使用显式空间坐标来指代对象——作为视觉语言模型的类似解决方案,并发现它提高了具有挑战性的多目标任务的性能。然而,目前尚不清楚这种方法为何(即在机制或表征层面)能提高性能,以及这与人类视觉中的串行处理有何直接关系。本文研究了这一问题。我们发现,通过文本学习指向会诱导内部视觉搜索程序,并描述了支持这一过程的机制。我们还发现,指向行为可以通过微调泛化到新任务,并且这样做可以消除绑定错误并实现组合泛化。这些结果提供了一个原理证明,即串行处理可以像解决生物视觉中的绑定问题一样,解决视觉语言模型中的绑定问题。

英文摘要

Despite success on standard benchmarks, vision language models display persistent failures on tasks involving processing of multi-object scenes, including many tasks that are relatively easy for humans. Recent work has found that these failures may stem from a basic inability to accurately bind object features in-context, a challenge that is referred to as the "binding problem" in cognitive science and neuroscience. The human visual system is thought to solve this binding problem via serial processing, attending to individual objects one at a time so as to avoid interference from other objects. Recent work has proposed "pointing" -- the use of explicit spatial coordinates to refer to objects -- as an analogous solution for vision language models, and found that it improves performance on challenging multi-object tasks. However, it is unclear $\textit{why}$ (i.e., on a mechanistic or representational level) this approach improves performance, and how directly this relates to serial processing in human vision. Here, we investigate this question. We find that learning to point-via-text induces an internal visual search routine, and we characterize the mechanisms that support this procedure. We also find that pointing behavior can be generalized to new tasks via fine-tuning, and that doing so eliminates binding errors and enables compositional generalization. These results provide a proof-of-principle that serial processing can solve the binding problem for vision language models just as it does for biological vision.

2605.25426 2026-05-26 cs.GR cs.CV 版本更新

Learning View-Dependent Splatting Kernels

学习视图相关的溅射核

Huakeng Ding, Zhanpeng Liu, Fan Pei, Kun Zhou, Hongzhi Wu

发表机构 * State Key Lab of CAD and CG, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室) Hangzhou Research Institute of Holographic and AI Technology(杭州全息与人工智能技术研究院)

AI总结 提出一种可微框架,通过自动学习视图相关的2D核,在基于溅射的管线中提升新视角合成质量与表示效率。

Comments Accepted to SIGGRAPH 2026. 10 pages, 8 figures

详情
AI中文摘要

我们提出一种可微框架,在基于溅射的管线中自动学习视图相关的2D核,以提升新3D视角合成的重建质量和表示效率。我们的体积基元定义为边界椭球体和3D核潜向量。首先学习一个投影网络,以椭球体属性和3D核潜向量为输入,输出2D核潜向量。接着,结果送入解码器,生成关于马氏距离的径向对称2D核,受投影椭球体约束。神经网络与每个基元的属性联合优化。在标准基准上展示了我们方法的有效性,与最先进的分析和学习的核技术相比具有优势。最后,我们将该思想扩展到学习用于2D溅射以及图像表示的通用2D核。

英文摘要

We present a differentiable framework to automatically learn view-dependent 2D kernels in a splatting-based pipeline to improve reconstruction quality and representation efficiency for novel 3D view synthesis. Our volumetric primitive is defined as a bounding ellipsoid and a 3D-kernel latent vector. We first learn a projection network to output a 2D-kernel latent, taking the attributes of the ellipsoid and the 3D-kernel latent as input. Next, the result is sent to a decoder to produce a radially symmetric 2D kernel in terms of Mahalanobis distance, bounded by the projected ellipsoid. The neural networks along with per-primitive attributes are jointly optimized. The effectiveness of our approach is demonstrated on standard benchmarks, comparing favorably against state-of-the-art techniques on both analytical and learned kernels. Finally, we extend the idea to learn general 2D kernels for 2D splatting as well as image representation.

2605.25418 2026-05-26 cs.CV cs.GR cs.LG 版本更新

Generating 3D models from sketches of human faces using a combined approach of Convolutional Neural Networks, Procedural Modeling, and Contour Mapping

利用卷积神经网络、程序化建模和轮廓映射的联合方法从人脸素描生成3D模型

Nancy Iskander

发表机构 * Behaviour Digital

AI总结 提出一种结合卷积神经网络、参数化3D人脸模型和主动蛇形轮廓的新方法,首次通过训练CNN检测素描中的表情并生成对应3D模型。

Comments A thesis submitted in conformity with the requirements for the degree of Master of Science in Computer Science Graduate Department of Computer Science University of Toronto

详情
AI中文摘要

从人脸素描生成3D模型是计算机图形学中的一个活跃研究课题,因为它有潜力极大地促进专业3D艺术家和新手的建模工作。受面部表情显著改变和塑造面部轮廓这一观察的启发,我们的方法结合了表情检测和3D模型生成。结果是一种从素描生成3D模型的新方法,它依赖于三个组成部分:卷积神经网络、参数化3D人脸模型(Valley Girl)和主动蛇形轮廓。在文献中首次,CNN(使用我们自己生成的数据集)被训练通过检测活跃的FACS动作单元来识别给定素描中的表情。然后,该表情被复制到Valley Girl上以获得具有相似表情的3D模型。接着,使用主动蛇形轮廓来找到所需的变换,以缩小该模型与给定素描之间的差距。

英文摘要

Generating 3D models from face sketches is an active topic of research in Computer Graphics due to its potential to tremendously facilitate the modeling of faces for both professional 3D arists and novices. Motivated by the observation that facial expressions are responsible for significantly altering and shaping the contours in our faces, we combine both expression detection and 3D model generation in our approach. The result is a novel approach to generating 3D models from sketches which relies on three components: Convolutional Neural Networks, a parametric 3D face model (Valley Girl), and Active Snake Contours. For the first time in the literature, CNNs are trained (using our own generated dataset) to detect the expression in the given sketch through detecting the active FACS Action Units. The expression is then duplicated on Valley Girl to obtain a 3D model with a similar expression. Active Snake Contours are then used to find the transforms needed to close the gaps between that model and the given sketch.

2605.25409 2026-05-26 cs.CV 版本更新

MTLLFM: Multimodal-Temporal Laughter Localization: UR-FUNNY-Temporal and SMILE-Temporal Benchmarks with an Adaptive Multimodal Fusion Model

MTLLFM: 多模态时间笑声定位——UR-FUNNY-Temporal和SMILE-Temporal基准数据集与自适应多模态融合模型

Eyal Hanania, Nadav Kirsch, Daniel Arkushin, Jonathan Benvenisti, Amos Bercovich, Elie Zemmour, Sahar Froim

发表机构 * WSC-Sports(WSC-体育)

AI总结 针对现有方法无法精确捕捉短暂笑声事件时间边界的问题,本文提出两个完全标注的时间笑声数据集(UR-FUNNY-Temporal和SMILE-Temporal)和一个轻量级弱监督框架,通过固定HuBERT和MAE编码器结合时间softmax池化与自适应模态门控,实现从片段级标签到帧级时间定位,在体育广播数据上达到99% F1和68.1%定位精度,并将下游笑声推理CIDEr提升227%。

Comments Accepted to the Workshop on Affective & Behavior Analysis in-the-wild, CVPR 2026

详情
AI中文摘要

在视频中检测笑声对于情感计算和叙事理解至关重要,但现有方法将其视为粗粒度的片段级分类,无法捕捉短暂、瞬态笑声事件的精确时间边界。我们通过两个互补的贡献填补了这一空白。首先,我们引入了UR-FUNNY-Temporal和SMILE-Temporal,这是两个完全标注的时间笑声数据集,扩展了广泛使用的幽默基准。我们的标注覆盖超过11,053个视频(78.8小时),并为每个笑声事件提供精确的起始/结束边界,以及区分说话者与观众笑声、模态主导性(声学、视觉或两者)和强度级别的丰富元数据。其次,我们提出了一个轻量级弱监督框架用于时间笑声定位。我们的架构将固定的HuBERT和MAE编码器与时间softmax池化和自适应模态门控相结合,从片段级标签学习细粒度的时间定位,而无需在训练期间使用帧级标注。在三个数据集上的实验表明,我们的方法显著优于包括Gemini 3 Flash在内的多模态基础模型,在体育广播数据上达到99%的F1和68.1%的定位精度。消融实验验证了每个架构组件。此外,我们的精确时间标签将下游笑声推理的CIDEr提升了227%,使GPT-3.5能够超越GPT-4o。代码、UR-FUNNY-Temporal和SMILE-Temporal数据集已在https://github.com/WSCSports/MTLLFM-temporal-laughter-localization公开。

英文摘要

Detecting laughter in video is essential for affective computing and narrative understanding, yet existing approaches treat it as coarse clip-level classification, failing to capture precise temporal boundaries of brief, transient laughter events. We address this gap with two complementary contributions. First, we introduce UR-FUNNY-Temporal and SMILE-Temporal, fully annotated temporal laughter datasets extending two widely-used humor benchmarks. Our annotations cover over 11,053 videos (78.8 hours) and provide precise onset/offset boundaries for each laughter event, along with rich metadata distinguishing speaker vs. audience laughter, modality dominance (acoustic, visual, or both), and intensity levels. Second, we propose a lightweight weakly-supervised framework for temporal laughter localization. Our architecture combines fixed HuBERT and MAE encoders with temporal softmax pooling and adaptive modality gating, learning fine-grained temporal grounding from clip-level labels without requiring frame-level annotations during training. Experiments across three datasets demonstrate that our approach substantially outperforms multimodal foundation models including Gemini 3 Flash, achieving 99% F1 and 68.1% localization precision on sports broadcast data. Ablations validate each architectural component. Furthermore, our precise temporal tags improve downstream laughter reasoning by 227% on CIDEr, enabling GPT-3.5 to outperform GPT-4o. The code, UR-FUNNY-Temporal and SMILE-Temporal datasets are publicly available at https://github.com/WSCSports/MTLLFM-temporal-laughter-localization.

2605.25407 2026-05-26 cs.CV 版本更新

Towards Active Real-to-Twin Inspection: A New Paradigm for Zero-Shot Anomaly Detection

迈向主动实景到数字孪生检测:零样本异常检测的新范式

Jiaxuan Liu, Yunkang Cao, Yufeng Chen, Chunyang Li, Yuhuan Du, Hui Zhang

发表机构 * National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University(机器人视觉感知与控制技术国家工程研究中心,湖南大学)

AI总结 提出Real-to-Twin异常检测任务,通过AVATAR框架学习实景与CAD数字孪生之间的语义对齐,实现零样本异常定位。

Comments 6 pages, 4 figures, accepted to IEEE-CYBER 2026, Florence, Italy

详情
AI中文摘要

零样本异常检测(AD)在具身工业检测中的部署受到其依赖被动、固定视角2D图像的严重制约。这种固有形式无法适应真实环境中所需的主动、动态观测。为突破这一限制,我们引入了实景到数字孪生异常检测(Real-to-Twin Anomaly Detection),这是一项新颖的任务,直接针对几何匹配的CAD数字孪生评估物理观测。为应对这一新任务,我们提出了AVATAR框架,旨在学习实景与数字孪生之间的鲁棒语义对齐。通过仅使用无缺陷对来弥合良性的Sim2Real领域差距,AVATAR有效地将CAD先验转化为动态、无异常的参考。这种优雅的公式使模型能够以零样本方式将各种异常定位为不可对齐的偏差,消除了对缺陷标注的需求。大量实验表明,AVATAR显著优于改编的最先进基线,对严重的视角变化表现出卓越的鲁棒性。代码和数据集将公开提供。

英文摘要

The deployment of zero-shot anomaly detection (AD) in embodied industrial inspection is severely bottlenecked by its reliance on passive, fixed-viewpoint 2D imagery. Such formulations inherently fail to accommodate the active, dynamic observations required in real-world environments. To break this limitation, we introduce Real-to-Twin Anomaly Detection, a novel task that evaluates physical observations directly against geometrically matched CAD Digital Twins. To tackle this new task, we propose AVATAR, a framework designed to learn robust semantic alignment between Real and Digital Twins. By bridging benign Sim2Real domain gaps using only defect-free pairs, AVATAR effectively transforms CAD priors into dynamic, anomaly-free references. This elegant formulation enables the model to localize diverse anomalies in a zero-shot manner as unalignable deviations, eliminating the need for defect annotations. Extensive experiments demonstrate that AVATAR substantially outperforms adapted state-of-the-art baselines, exhibiting exceptional robustness to severe viewpoint variations. The code and dataset will be made publicly available.

2605.25396 2026-05-26 cs.CV cs.AI 版本更新

Subspace-Guided Semantic and Topological Invariant Registration for Annotation-Free Ultrasound Plane Quality Control

子空间引导的语义与拓扑不变配准用于无标注超声平面质量控制

Chunzheng Zhu, Jianxin Lin, Feng Wang, Cheng Jiang, Guanghua Tan, Zhenyu Zhou, Shengli Li, Kenli Li

发表机构 * Hunan University(湖南大学) Shenzhen Maternity and Child Healthcare Hospital(深圳妇幼保健医院)

AI总结 提出STRIQ框架,通过子空间引导的配准一致性度量,实现无标注超声平面质量控制,达到与临床质量评分的最优相关性。

Comments MICCAI 2026 Accepted Paper; Subspace-Guided Registration for Ultrasound Quality Control

详情
AI中文摘要

超声图像的可靠质量控制对于实时采集指导和回顾性临床审计至关重要,然而现有方法严重依赖逐平面标注,或采用在临床采集固有空间变形下易产生系统性偏差的伪标签。我们提出STRIQ,一种基于配准的框架,将无标注超声平面质量控制重新定义为子空间引导的一致性度量问题。具体而言,STRIQ引入潜在配准对齐器(LRA)以建立查询图像与方差驱动锚点之间的层次特征空间对应,这些锚点通过方差谱准则从无标签数据中自主提炼,作为结构稳定的原型。为进一步区分解剖平面并减轻负知识迁移,我们提出正交知识子空间(OKS)模块。OKS将平面特定表示分解为相互正交的子空间,实现细粒度专家协作同时防止平面间干扰,确保质量度量基于原则性的子空间邻近性。在内部US4QA和公开CAMUS数据集上的大量实验表明,STRIQ实现了与临床质量评分的最优相关性,为无标注、实时可靠的超声质量控制建立了新范式。我们的代码可在https://github.com/zhcz328/STRIQ获取。

英文摘要

Reliable quality control (QC) of ultrasound images is essential for both real-time acquisition guidance and retrospective clinical audit, yet existing approaches rely heavily on per-plane annotations, or employ pseudo-labeling prone to systematic bias under spatial deformations inherent in clinical acquisition. We present STRIQ, a registration-driven framework that recasts annotation-free US plane quality control as a subspace-guided consistency measurement problem. Specifically, STRIQ introduces a Latent Registration Aligner (LRA) to establish hierarchical feature space correspondences between query images and variance-driven anchors, which are autonomously distilled from unlabeled data via a variance spectrum criterion to serve as structurally stable prototypes. To further disambiguate anatomical planes and mitigate negative knowledge transfer, we propose an Orthogonal Knowledge Subspace (OKS) module. The OKS decomposes plane-specific representations into mutually orthogonal subspaces, enabling fine-grained expert collaboration while preventing inter-plane interference, ensuring that the quality metric is grounded in principled subspace proximity. Extensive experiments on the in-house US4QA and public CAMUS datasets demonstrate that STRIQ achieves state-of-the-art correlation with clinical quality scores, establishing a new paradigm for annotation-free, real-time reliable ultrasound quality control. Our code is available at https://github.com/zhcz328/STRIQ.

2605.25385 2026-05-26 cs.CV cs.AI 版本更新

Weakly Supervised Camouflaged Object Detection Based on the SAM Model and Mask Guidance

基于SAM模型和掩码引导的弱监督伪装目标检测

Xia Li, Xinran Liu, Lin Qi, Junyu Dong

发表机构 * School of Computer Science(计算机科学学院) Technology, Ocean University of China, Qingdao 266100, China(技术,中国海洋大学,青岛266100,中国)

AI总结 提出MGNet网络,利用SAM模型生成伪标签,通过级联掩码解码器、上下文增强模块和掩码引导特征聚合模块,实现弱监督伪装目标检测,性能与全监督方法相当。

Comments 18 pages

详情
AI中文摘要

伪装目标检测(COD)由于目标与背景高度相似,是一项具有挑战性的任务。现有的全监督方法需要耗费大量人力进行像素级标注,因此弱监督方法成为平衡精度与标注效率的可行折中方案。然而,由于使用粗标注,弱监督方法常出现性能下降。本文提出一种新的弱监督伪装目标检测方法以克服这些限制。具体地,我们设计了一个新颖的网络MGNet,通过利用自定义级联掩码解码器(CMD)生成的初始掩码来引导分割过程并增强边缘预测,从而解决边缘模糊和漏检问题。我们引入上下文增强模块(CEM)以减少漏检,以及掩码引导特征聚合模块(MFAM)进行有效的特征聚合。针对弱监督挑战,我们提出BoxSAM,利用带有边界框提示的Segment Anything Model(SAM)生成伪标签。通过采用冗余处理策略,为训练MGNet提供高质量的像素级伪标签。大量实验表明,我们的方法在性能上与当前最先进方法具有竞争力。

英文摘要

Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundings. Existing fully supervised methods require labor-intensive pixel-level annotations, making weakly supervised methods a viable compromise that balances accuracy and annotation efficiency. However, weakly supervised methods often experience performance degradation due to the use of coarse annotations. In this paper, we introduce a new weakly supervised approach for camouflaged object detection to overcome these limitations. Specifically, we propose a novel network, MGNet, which tackles edge ambiguity and missed detections by utilizing initial masks generated by our custom-designed Cascaded Mask Decoder (CMD) to guide the segmentation process and enhance edge predictions. We introduce a Context Enhancement Module(CEM) to reduce the missing detection, and a Mask-guided Feature Aggregation Module (MFAM) for effective feature aggregation. For the weak supervision challenge, we propose BoxSAM, which leverages the Segment Anything Model (SAM) with bounding-box prompts to generate pseudo-labels. By employing a redundant processing strategy, high quality pixel-level pseudo-labels are provided for training MGNet. Extensive experiments demonstrate that our method delivers competitive performance against current state-of-the-art methods.

2605.25377 2026-05-26 cs.CV cs.AI 版本更新

Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation

对抗正交解缠用于LVLM幻觉缓解

Ruoxi Cheng, Haoxuan Ma, Zhengfei Hai, Yiyan Huang, Ranjie Duan, Tianle Zhang, Xu Yang, Ziyi Ye, Xingjun Ma

发表机构 * Fudan University(复旦大学) Tencent(腾讯) Nanjing University(南京大学) Southeast University(东南大学) Great Bay University(大坝大学) TeleAI, China Telecom(TeleAI,中国电信)

AI总结 提出对抗正交解缠(AOD)框架,通过最小最大目标学习幻觉相关方向,并利用双前向对比解码策略,在不需额外训练的情况下缓解大型视觉语言模型(LVLM)的幻觉问题。

详情
AI中文摘要

大型视觉语言模型(LVLM)推进了多模态理解,但其可靠性受到幻觉的限制,即生成内容与视觉事实冲突。现有缓解方法要么依赖昂贵的外部干预(如指令调优和检索),要么使用受限于有缺陷的注意力权重和纠缠的隐藏表示的内部机制。我们提出对抗正交解缠(AOD),一种用于缓解LVLM幻觉的潜在几何框架。AOD通过最小最大目标学习幻觉相关方向:分类器将幻觉信号集中到投影分量中,而对抗器通过梯度反转层将其从正交残差空间中移除。学习到的方向使得一种无需训练的双前向对比解码策略能够抑制幻觉同时保持通用能力。在三个LVLM上进行的四个幻觉和四个效用基准实验表明,AOD一致优于强基线。它在POPE上平均提高超过6%的准确率,将AMBER提升6%,并在MMMU等效用任务上保持强劲性能。进一步分析显示跨数据集的鲁棒迁移,表明AOD捕获了通用的幻觉相关偏差而非数据集特定伪影。我们的源代码和数据集可在https://github.com/Hunter-Wrynn/AOD获取。

英文摘要

Large Vision-Language Models (LVLMs) have advanced multimodal understanding, yet their reliability is limited by hallucination, where generated content conflicts with visual facts. Existing mitigation methods either rely on costly external interventions, such as instruction tuning and retrieval, or use internal mechanisms that remain limited by flawed attention weights and entangled hidden representations. We propose Adversarial Orthogonal Disentanglement (AOD), a latent geometric framework for mitigating LVLM hallucinations. AOD learns a hallucination-related direction through a minimax objective: a classifier concentrates hallucination signals into the projected component, while an adversary removes them from the orthogonal residual space via a Gradient Reversal Layer. The learned direction enables a training-free dual-forward-pass contrastive decoding strategy that suppresses hallucinations while preserving general capabilities. Experiments on three LVLMs across four hallucination and four utility benchmarks show that AOD consistently outperforms strong baselines. It improves POPE accuracy by over 6\% on average, boosts AMBER by 6\%, and maintains strong performance on utility tasks such as MMMU. Further analysis shows robust transfer across datasets, suggesting that AOD captures general hallucination-related biases rather than dataset-specific artifacts. Our source code and datasets are available at https://github.com/Hunter-Wrynn/AOD.

2605.25373 2026-05-26 cs.CV 版本更新

Physics-Aware 3D Gaussian Editing for Driving Scene Generation

物理感知的三维高斯编辑用于驾驶场景生成

Feng Zhou, Jian Zhang, Yuhang Sun, He Wang, Qiong Wen, Debao Kong, Tieru Wu, Rui Ma

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University(吉林大学汽车底盘集成与生物力学国家重点实验室) China FAW Group Co., Ltd.(中国第一汽车集团有限公司)

AI总结 提出RoVES系统,通过单图像驱动的道路几何插入和4-DOF半车动力学模型,实现物理感知的驾驶场景编辑与车辆姿态校正。

详情
AI中文摘要

三维高斯泼溅(3DGS)在自动驾驶仿真和数据生成中展现出巨大潜力,能够实现逼真的重建和灵活的场景操作。然而,现有的3DGS场景编辑方法对道路几何编辑(例如插入减速带或凹陷路面)支持有限,并且通常不将此类编辑与合理的车辆-道路交互动力学耦合。这种编辑对于在极端驾驶场景下生成训练数据或评估系统在这些道路不规则情况下的可靠性至关重要。此外,许多基于优化的方法需要每次编辑进行数分钟的细化,而现有的高效替代方案主要关注外观级别或对象级别的操作,而非物理感知的道路不规则编辑。为了解决这些限制,我们提出了RoVES,一个用于驾驶场景中物理感知三维高斯编辑的道路和车辆编辑系统。RoVES实现了单图像驱动的道路几何插入,并将编辑后的道路轮廓与4-DOF半车动力学模型耦合,以实现垂直位移和俯仰方向上的物理感知车辆姿态校正。RoVES以一次性、无优化的流水线(1.84秒)插入道路元素,完整流水线(包括颜色转移和基于车辆动力学的姿态校正)在6.24秒内完成;它通过姿态编辑编辑动态车辆,并逐帧校正姿态以近似动力学一致的垂直位移和俯仰响应。在Waymo数据集上的实验表明,RoVES为物理感知的驾驶场景生成提供了实用的效率和具有竞争力的视觉一致性。

英文摘要

3D Gaussian Splatting (3DGS) has shown great potential in autonomous driving simulation and data generation, enabling photorealistic reconstruction and flexible scene manipulation. However, existing 3DGS scene editing methods have limited support for road geometry editing (e.g., inserting speed humps or sunken roads), and generally do not couple such edits with plausible vehicle-road interaction dynamics. Such editing is essential for generating training data under extreme driving scenarios or evaluating system reliability under these road irregularities. Moreover, many optimization-based methods require minutes of per-edit refinement, while existing efficient alternatives mainly focus on appearance-level or object-level manipulation rather than physics-aware road irregularity editing. To address these limitations, we propose RoVES, a Road-and-Vehicle Editing System for physics-aware 3D Gaussian editing in driving scenes. RoVES enables single-image-driven road geometry insertion and couples the edited road profile with a 4-DOF half-car vehicle dynamics model to achieve physics-aware vehicle pose correction in vertical displacement and pitch. RoVES inserts road elements in a one-shot, optimization-free pipeline (1.84s), and the full pipeline (including color transfer and vehicle-dynamics-based pose correction) completes in 6.24s; it edits dynamic vehicles via pose editing and corrects poses frame-by-frame to approximate dynamics-consistent vertical displacement and pitch responses. Experiments on the Waymo dataset show that RoVES provides practical efficiency and competitive visual consistency for physics-aware driving scene generation.

2605.25364 2026-05-26 cs.CV 版本更新

Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning

MLLMs 能否超越语言进行推理?VisReason:一个面向视觉中心推理的综合基准

Longteng Guo, Yifan Wang, Pengkang Huo, Tailai Chen, Yuze Wu, Jing Liu, Xinxin Zhu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出 VisReason 基准,包含 1505 个日常场景问题,评估多模态大模型在视觉中心推理上的表现,揭示人类与模型间的显著差距。

Comments Accepted by ACL 2026 Findings, resources released at https://github.com/CASIA-IVA-Lab/VisReason

详情
AI中文摘要

近期多模态大语言模型(MLLMs)在视觉推理基准上取得了强劲性能,但尚不清楚这种性能在多大程度上反映了直接基于视觉证据的推理。我们引入了 VisReason,一个面向日常场景中视觉中心推理的基准,其中感知与推理紧密耦合。VisReason 包含 1505 个问题,涵盖感知、结构和概念推理等 10 个类别。我们的评估表明,VisReason 对现有基准提出了性质不同的挑战,暴露了人类与当前 MLLMs 之间的巨大差距,并揭示了测试时推理策略带来的有限收益。VisReason 为评估超越语言的视觉中心推理提供了一个聚焦的诊断工具。

英文摘要

Recent multimodal large language models (MLLMs) achieve strong performance on visual reasoning benchmarks, yet it remains unclear to what extent such performance reflects reasoning directly grounded in visual evidence. We introduce VisReason, a benchmark for vision-centric reasoning in everyday scenarios where perception and inference are tightly coupled. VisReason contains 1,505 questions across 10 categories spanning perceptual, structural, and conceptual reasoning. Our evaluation shows that VisReason poses a qualitatively different challenge from existing benchmarks, exposing substantial gaps between humans and current MLLMs and revealing limited benefits from test-time reasoning strategies. VisReason offers a focused diagnostic for evaluating vision-centric reasoning beyond language.

2605.25363 2026-05-26 cs.CV 版本更新

MARVEL: Universal Murray's Law-informed Vessel Tree Segmentation and Topology Estimation

MARVEL:基于Murray定律的通用血管树分割与拓扑估计

Yi Zhou, Thiara Sana Ahmed, Jacqueline Chua, Meng Wang, Qinrong Zhang, Alejandro F. Frangi, Huazhu Fu, Jun Cheng, Leopold Schmetterer, Bingyao Tan

发表机构 * Singapore Eye Research Institute(新加坡眼科学研究院) Singapore National Eye Centre(新加坡国家眼科中心) Ophthalmology & Visual Sciences Academic(眼科与视觉科学学术)

AI总结 提出一种与骨干网络无关的框架MARVEL,通过可微分的Murray定律约束正则化训练,提升血管分割的生理合理性、拓扑一致性,并在高血压分类任务中显著优于基线模型。

Comments 10 pages, 18 figures

详情
AI中文摘要

血管循环遵循优化质量传输和代谢能量消耗的基本生物物理原理,这些原理可以通过Murray定律有效建模。然而,当代深度学习方法用于血管分割时往往忽略这些生物物理约束,导致生理上不合理的分支和血管树误分类,使得这些自动分割结果对于下游临床任务(如血流模拟或疾病量化)不可靠。在本文中,我们引入MARVEL(基于Murray定律的通用血管分割与拓扑估计),一个与骨干网络无关的框架,将生物物理先验整合到血管树提取中。MARVEL结合逐像素监督与显式半径预测,以强制执行从经验宽度-指数映射导出的局部分叉约束。我们在训练期间将这些约束实现为可微正则化器,以引导模型朝向生理一致的重建。我们在八个公开数据集上评估MARVEL,涵盖多种血管模态和分割骨干网络。结果表明MARVEL在分割准确性、拓扑一致性和生理合理性方面具有优越性能。通过将分割掩膜转换为基于图的血流动力学模拟,我们证明MARVEL保留了区分高血压眼和正常眼所需的细微病理狭窄和拓扑连接。结果显示,MARVEL通过眼内动静脉压力差显著改善了高血压的分类(p < 0.001),在拓扑一致性和临床预测价值方面均优于基线模型。

英文摘要

Vascular circulation follows fundamental biophysical principles that optimize mass transport and metabolic energy expenditure, which can be effectively modeled by Murray's law. However, contemporary deep learning methods for vascular segmentation often neglect these biophysical constraints. This leads to physiologically implausible branching and misclassification vascular trees, rendering. These automated segmentation results are unreliable unreliable for downstream clinical tasks such as blood flow simulation or disease quantification. In this paper, we introduce MARVEL (Universal MurrAy's law-infoRmed Vessel sEgmentation and topoLogy estimation), a backbone-agnostic framework that integrates biophysical priors into vascular tree extraction. MARVEL combines per-pixel supervision with explicit radius predictions to enforce local bifurcation constraints derived from an empirical width-exponent mapping. We implement these constraints as differentiable regularizers during training to guide models toward physiologically consistent reconstructions. We evaluate MARVEL on eight public datasets across multiple vascular modalities and segmentation backbones. Results demonstrate MARVEL's superior performance in segmentation accuracy, topological consistency, and physiological plausibility. By converting segmented masks into graph-based hemodynamic simulations, we demonstrate that MARVEL preserves the subtle pathological narrowing and topological connectivity required to distinguish hypertensive from normotensive eyes. Results show that MARVEL significantly improves the classification of hypertension via arteriovenous pressure differences in the eye (p < 0.001), outperforming baseline models in both topological consistency and clinical predictive value.

2605.25357 2026-05-26 cs.CV cs.MA 版本更新

Towards Reliable Fetal Ultrasound Interpretation with Multi-Agent Collaboration

面向可靠胎儿超声解读的多智能体协作

Xiaotian Hu, Mingxuan Liu, Junwei Huang, Kasidit Anmahapong, Yifei Chen, Yiming Huang, Xuguang Bai, Zihan Li, Hongjia Yang, Yingqi Hao, Hong Xu, Yu Jiang, Tian Tian, Yi Liao, Haibo Qu, Qiyuan Tian

发表机构 * Tsinghua University(清华大学) University of California San Diego(加州大学圣地亚哥分校) West China Second University Hospital, Sichuan University(四川大学西昌医学院)

AI总结 提出FetUSAgents多智能体系统,通过协作LLM代理和双路径证据仲裁(DPEA)整合视觉工具与临床推理,在胎儿超声VQA、报告生成等任务上超越最强基线25%以上。

详情
AI中文摘要

自动化胎儿超声解读需要从视觉感知(包括平面识别和解剖分割)到临床理解(包括生物测量和诊断报告)的工作流程。然而,当前“一任务一模型”的范式限制了跨多步骤过程的系统性证据整合。尽管多模态大语言模型(MLLM)展现出有前景的视觉理解能力,但其有限的领域特定基础和幻觉风险限制了在胎儿超声分析中的可靠性。为解决这些限制,我们提出了FetUSAgents,一个工具增强的多智能体系统,用于全面的胎儿超声解读,支持视觉问答(VQA)、报告生成、图像描述和视频总结。FetUSAgents通过协作的LLM代理协调任务特定的视觉工具,并将临床查询分解为从解剖识别到定量测量的子任务。我们进一步引入了双路径证据仲裁(DPEA),它将基于LLM的审慎推理与来自专业视觉工具的结构化计算证据相结合。一个检索增强的证据库整合中间发现,以支持可追溯且临床可靠的结论。此外,我们构建了FetUS-VQA,一个专门用于胎儿超声的VQA基准,包含1,892张图像和3,205个问答对,涵盖10个临床任务。广泛的分布外实验表明,FetUSAgents优于通用和医学MLLM,在VQA准确率上超过最强基线25%以上。这些结果表明了一条通往产前成像的基于证据的临床助手的可扩展路径。代码已公开。

英文摘要

Automated fetal ultrasound interpretation requires a workflow from visual perception, including plane recognition and anatomical segmentation, to clinical understanding, including biometric measurement and diagnostic reporting. However, the prevailing "one-task, one-model" paradigm limits systematic integration of evidence across this multi-step process. Although multimodal large language models (MLLMs) show promising visual understanding, their limited domain-specific grounding and hallucination risks restrict reliability in fetal ultrasound analysis. To address these limitations, we propose FetUSAgents, a tool-augmented multi-agent system for comprehensive fetal ultrasound interpretation, supporting visual question answering (VQA), report generation, image captioning, and video summarization. FetUSAgents coordinates task-specific visual tools through collaborative LLM agents and decomposes clinical queries into subtasks that progress from anatomical recognition to quantitative measurement. We further introduce Dual-Path Evidence Arbitration (DPEA), which integrates LLM-based deliberative reasoning with structured computational evidence from specialized visual tools. A retrieval-enhanced evidence bank consolidates intermediate findings to support traceable and clinically grounded conclusions. In addition, we construct FetUS-VQA, a dedicated VQA benchmark for fetal ultrasound, comprising 1,892 images and 3,205 question-answer pairs across 10 clinical tasks. Extensive out-of-distribution experiments show that FetUSAgents outperforms general and medical MLLMs, exceeding the strongest baseline by more than 25 percent in VQA accuracy. These results suggest a scalable route toward evidence-driven clinical assistants for prenatal imaging. Code is available.

2605.25348 2026-05-26 eess.IV cs.AI cs.CV cs.LG cs.SC 版本更新

Parameter-Efficient CT Reconstruction via Deep Graph Laplacian Regularization

基于深度图拉普拉斯正则化的参数高效CT重建

Veera Varuni Radhakrishnan, Chinthaka Dinesh, Qurat-ul-Ain Azim

发表机构 * Mechanical and Industrial Engineering Department(机械与工业工程系)

AI总结 提出深度图拉普拉斯正则化(Deep GLR)方法,通过将二次图正则化集成到近端前向-后向分裂优化框架中,仅用少量参数和数据即可实现低剂量CT重建的噪声抑制,在参数效率和数据效率上显著优于现有方法。

Comments 7 pages, 3 figures, conference

详情
AI中文摘要

低剂量计算机断层扫描(LDCT)重建面临重建质量与资源需求之间的关键权衡。虽然最近的深度学习方法达到了最先进的性能,但它们通常依赖超过50万个参数,并在超过35,000次扫描的大规模数据集上训练。本文研究在严格资源约束下,基于图的正则化是否能提供有意义的噪声抑制。我们提出了深度图拉普拉斯正则化(Deep GLR),将二次图正则化集成到近端前向-后向分裂优化框架中,并包含三个轻量级CNN模块。在LoDoPaB-CT基准上评估,Deep GLR达到了30.70 dB的PSNR,相比滤波反投影提高了6.33 dB,同时仅使用了91,848个参数,在1000个样本上训练(标准训练集的2.8%)。与基准方法相比,这代表了每dB改进5.8倍的参数效率和30倍的数据效率。学习到的图带宽参数(ε=1.25)收敛到可解释的值,表明该方法捕捉了有意义的图像先验而非过拟合。尽管与最先进方法相比仍有13 dB的差距,但结果表明基于图的正则化为资源受限的医学成像场景提供了有利的效率-质量权衡。

英文摘要

Low-dose computed tomography (LDCT) reconstruction faces a critical tradeoff between reconstruction quality and resource requirements. While recent deep learning methods achieve state-of-the-art performance, they typically rely on over 500,000 parameters trained on large-scale datasets exceeding 35,000 scans. This work investigates whether graph-based regularization can provide meaningful noise reduction under strict resource constraints. We propose Deep Graph Laplacian Regularization (Deep GLR), integrating quadratic graph regularization into a Proximal Forward-Backward Splitting optimization framework with three lightweight CNN modules. Evaluated on the LoDoPaB-CT benchmark, Deep GLR achieves 30.70 dB PSNR, representing a 6.33 dB improvement over filtered backprojection, while using only 91,848 parameters trained on 1000 samples (2.8\% of standard training set). Compared to benchmark methods, this represents 5.8 times better parameter efficiency and 30 times better data efficiency per dB improvement. The learned graph bandwidth parameter ($ε$=1.25) converges to interpretable values, suggesting the method captures meaningful image priors rather than overfitting. While a 13 dB gap remains versus state-of-the-art methods, results demonstrate that graph-based regularization provides a favorable efficiency-quality tradeoff for resource-constrained medical imaging scenarios.

2605.25347 2026-05-26 cs.CV cs.LG 版本更新

ERNIE-Image Technical Report

ERNIE-Image 技术报告

Jiaxiang Liu, Zhida Feng, Pengyu Zou, Zhenyu Qian, Tianrui Zhu, Jun Xia, Yuehu Dong, Yanzheng Lin, Honglin Xiong, Anqi Chen, Yunpeng Ding, Jinghui Duan, Lin Gao, Chao Han, Tiechao He, Jiakang Hu, Ranjun Hua, Xueming Jiang, Qingli Kong, Yuting Lei, Tianyu Li, Yunlin Liu, Changling Liu, Yaxin Liu, Yi Liu, Xuguang Liu, Xiaolong Ma, Yan Pan, Yiran Ren, Nan Sheng, Yu Sun, Siyang Sun, Yixiang Tu, Yang Wan, Huanai Wang, Siqi Wang, Yang Wu, Youzhi Yang, Xiaowen Yang, Jianwen Yang, Yehua Yang, Quanwen Zhang, Xinmin Zhang, Haoxin Zhang, Xiang Zhang, Jun Zhang, Qian Zhang, Qiao Zhao, Qi Zhou

发表机构 * ERNIE Team, Baidu(百度ERNIE团队)

AI总结 提出基于8B单流DiT架构的开源文本到图像生成模型ERNIE-Image,通过自底向上的预训练数据构建和自顶向下的后训练数据构建,结合稳定DPO策略和MT-DMD蒸馏方法,在指令遵循、文本渲染和美学质量上接近顶级商业模型。

详情
AI中文摘要

我们介绍了ERNIE-Image,一个基于8B单流DiT架构构建的开源文本到图像生成模型。ERNIE-Image旨在通过更有效地挖掘大规模预训练数据并在整个训练过程中提高监督质量,来弥合当前开源模型与领先闭源系统之间的差距。在预训练阶段,我们采用自底向上的数据构建流程,结合细粒度图像分类、丰富的标题注释、美学评估和分层采样。该策略在保留长尾概念和详细真实世界知识的同时减少数据噪声,为复杂生成任务提供了更坚实的基础。在后训练阶段,我们针对高需求场景使用自顶向下的数据构建流程,多样化提示注释以更好地匹配真实用户输入,并应用稳定的DPO策略使模型与人类美学偏好对齐。我们进一步训练ERNIE-Image-Turbo以实现高效的8-NFE生成,并提出MT-DMD以减轻蒸馏过程中的能力漂移。为了使模型在实际场景中更易于使用,我们为其配备了一个轻量级的提示增强器,将简洁的用户意图扩展为结构化的视觉描述。此外,我们开发了工业级美学模型ERNIE-Image-Aes,以及用于真实美学评估的人工标注基准ERNIE-Image-Aes-1K。大量的定性和定量实验表明,ERNIE-Image在开源模型中实现了领先性能,并在指令遵循、文本渲染和美学质量方面接近顶级商业模型。我们发布训练好的模型和美学资源,以促进AIGC社区的进一步学术研究和技术进步。

英文摘要

We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.

2605.25345 2026-05-26 cs.GR cs.CV 版本更新

Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering

用于高保真高斯增强面元渲染的深度剥离

Keyang Ye, Hongzhi Wu, Kun Zhou

发表机构 * State Key Lab of CAD&CG, Zhejiang University(计算机辅助设计与图形学国家重点实验室,浙江大学) Hangzhou Research Institute of Holographic and AI Technology(杭州全息与人工智能技术研究院)

AI总结 提出DP-GES,通过半透明边界增强不透明面元并利用深度剥离建立逐像素排序,实现无排序高斯溅射和正确透射率调制,消除锯齿和弹出伪影,提升重建质量。

详情
AI中文摘要

新视角合成已被NeRF和3D高斯溅射(3DGS)显著推进,这些方法需要对体积样本或基元进行排序以实现正确的颜色混合。虽然最近的高斯增强面元(GES)实现了高性能、无排序渲染,但它们存在锯齿伪影和次优重建的问题。为解决这些限制,我们提出DP-GES,一种新颖的表示方法,通过半透明边界增强不透明面元,并利用深度剥离建立准确的逐像素排序。该设计实现了具有正确透射率调制的无排序高斯溅射,有效消除了锯齿和弹出伪影,同时促进了完全可微的联合优化。大量实验表明,我们的方法在广泛场景中实现了优越的重建质量,并与最先进技术相比具有竞争力。

英文摘要

Novel view synthesis has been significantly advanced by NeRFs and 3D Gaussian Splatting (3DGS), which require ordering volumetric samples or primitives for correct color blending. While the recent Gaussian-Enhanced Surfels (GES) enable high-performance, sort-free rendering, they suffer from aliasing artifacts and suboptimal reconstruction. To address these limitations, we propose DP-GES, a novel representation that augments opaque surfels with semi-transparent boundaries and leverages Depth Peeling to establish accurate per-pixel ordering. This design enables sort-free Gaussian splatting with correct transmittance modulation, effectively eliminating aliasing and popping artifacts while facilitating a fully differentiable joint optimization. Extensive experiments demonstrate that our method achieves superior reconstruction quality and compares favorably against state-of-the-art techniques across a wide range of scenes.

2605.25343 2026-05-26 cs.CV 版本更新

Toward Native Multimodal Modeling: A Roadmap

迈向原生多模态建模:路线图

Siyu An, Junru Lu, Junnan Dong, Qiufeng Wang, Yinghui Li, Weizhi Fei, Zichao Yu, Zheng Yuan, Biao Liu, Haopeng Wang, Renzhao Liang, Yixuan Yang, Yunhang Shen, Bo Ke, Keyu Chen, Linhao Luo, Difan Zou, Xiao Huang, Di Yin, Ruizhi Qiao, Xing Sun

发表机构 * Tencent Youtu Lab(腾讯优图实验室) Tsinghua University(清华大学) The University of Hong Kong(香港大学) University of Warwick(沃林汉大学) Monash University(墨尔本大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出从非原生多模态范式向原生多模态建模(NMM)过渡的正式路线图,通过输入-输出二元性分类现有模型,并系统探讨架构协调、数据整理、训练推理及评估的全栈工业级方案。

Comments 52 pages, 5 figures, 3 tables, ~300 references

详情
AI中文摘要

多模态建模是从模态无关推理迈向世界建模的关键一步。早期方法主要依赖后期融合,即组装编码器、冻结语言骨干网络和输出头;而近期研究已将范式转向原生多模态建模(NMM),通过模态的内在集成实现卓越的多模态性能。尽管潜力巨大,原生架构的设计空间仍缺乏明确定义。本文向社区呈现了这一过渡的正式路线图。具体而言,我们正式定义了架构原生性,将中期融合和早期融合与非原生范式区分开来。我们进一步通过输入-输出二元性的视角将现有原生模型组织为三类:(i) 多到文本,用于仅输出文本的跨模态理解;(ii) 多到目标,用于面向场景的生成,例如图像、音频和视频生成;(iii) 多到多,用于对称输入-输出的统一建模。我们对迈向最终NMM框架的过渡进行了全面且工业级的调查,在该框架中,理解和生成在统一的Transformer范式中无缝共存。我们从工业视角系统地拆解了端到端流水线,包括架构协调、大规模数据整理、全栈训练配方、推理与部署,以及真正原生建模的综合评估。

英文摘要

Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fusion that assembles encoders and frozen language backbones with output heads, recent efforts have shifted the paradigm toward native multimodal modeling (NMM) with the intrinsic integration of modalities for superior multimodal performance. Despite its potential, the design space of native architectures remains insufficiently defined. In this paper, we present the community with a formalized roadmap for this transition. Specifically, we formally define the architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. We further organize the existing native models through the lens of input-output duality into three categories: (i) Multi-to-Text for cross-modal comprehension with text-only output; (ii) Multi-to-Target for scenario-oriented generation, e.g., image, audio and video generation, and (iii) Multi-to-Multi for unified modeling with symmetric input-output. We deliver a comprehensive and industrial-grade investigation into the transition toward the definitive NMM framework, where understanding and generation seamlessly coexist within a unified transformer paradigm. We systematically unpack the end-to-end pipeline from industrial perspectives from architectural coordination, massive data curation, to full-stack training recipes, inference & deployment, and the comprehensive evaluation for truly native modeling.

2605.25334 2026-05-26 cs.CV 版本更新

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

双路径几何感知多模态大语言模型用于空间智能

Yufei Zheng, Xuhan Zhu, Zide Liu, Chunpeng Zhou, Chenfeng Wang, Yongchao Xu, Yunnan Wang, Jiawei Liu, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha

发表机构 * University of Science and Technology of China(中国科学技术大学) Li Auto Inc.(利汽车公司) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出GAMSI,一种仅以RGB图像为输入、通过双路径查询和专家引导视觉对齐实现3D结构与度量尺度联合感知的多模态大语言模型,在七个空间智能基准上达到最优性能。

详情
AI中文摘要

从2D视觉输入理解物理世界的空间能力依赖于两种互补的几何知识:整体3D结构感知和细粒度度量尺度估计。现有的多模态大语言模型通常只处理其中一个方面,将深度图或点云作为额外模型输入,这带来了大量计算开销并继承了上游预测模型的泛化局限性。我们提出GAMSI,一种双路径几何感知多模态大语言模型用于空间智能,仅以RGB图像为输入,同时在统一的自回归骨干网络内内化两种几何先验。具体地,我们引入度量-结构解耦查询,使用两组可学习查询分别从共享视觉上下文中提取密集度量信号和稀疏结构线索,并通过任务解耦注意力掩码防止两条路径相互污染。在此基础上,专家引导视觉定位模块将聚合的线索投影回帧级视觉特征,并与视觉基础模型对齐,这些模型仅作为训练时的监督,而非模型输入。我们进一步构建了一个多任务空间指令微调数据集,包含152,776个样本,涵盖13种任务类型和三种视觉模态,整合自六个公共数据集。通过两阶段课程训练,GAMSI在七个空间智能基准上达到了最先进的性能。

英文摘要

Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large language models (MLLMs) typically address only one facet, ingesting either depth maps or point clouds as additional model inputs, which incurs substantial computational overhead and inherits the generalization limitations of upstream prediction models. We propose GAMSI, a dual-pathway Geometry-Aware MLLM for Spatial Intelligence that takes only RGB images as input while internalizing both forms of geometric prior within a unified autoregressive backbone. Specifically, we introduce Metric-Structure Decoupled Queries (MSDQ) which employ two groups of learnable queries to respectively extract dense metric signals and sparse structural cues from the shared visual context, with a task-decoupled attention mask further preventing the two pathways from contaminating each other. Building on this, an Expert-Guided Visual Grounding (EVG) module projects the aggregated cues back to frame-level visual features and aligns them with vision foundation models, which serve purely as training-time supervision, rather than as model inputs. We further build a multi-task spatial instruction-tuning dataset (MTS) comprising 152{,}776 samples spanning 13 task types and three visual modalities, consolidated from six public datasets. Trained with a two-stage curriculum, GAMSI achieves state-of-the-art performance on seven spatial intelligence benchmarks.

2605.25333 2026-05-26 cs.CV 版本更新

Teaching Video Generators to Remember: Eliciting Dynamic Memory for Out-of-Sight State Evolution

教会视频生成器记忆:为不可见状态演化引出动态记忆

Tianshuo Xu, Yichen Xie, Depu Meng, Chensheng Peng, Quentin Herau, Bo Jiang, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition(应用直觉) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对视频生成模型在观测中断时状态冻结的问题,提出ReMind框架,通过面向记忆的数据构建、事件感知训练和缓存适配,利用KV缓存机制实现动态记忆,在STEVO-Bench和恢复任务上取得最佳成绩。

详情
AI中文摘要

视频世界模型应在证据未被观测时维持演化状态,但当前生成器在中断时往往冻结隐藏状态。这不仅仅是容量问题:预训练的视频扩散Transformer已经具备能够进行非局部检索的KV缓存机制,但很少被训练用作动态记忆。我们引入ReMind,一个通过面向记忆的数据、事件感知训练和缓存适配来引出动态记忆行为的框架。围绕100多种动态事件的分类,我们构建了一个带相机标注的训练混合集,结合了VLM过滤的真实视频、生成的硬动态、合成相机循环和记忆中断增强。每个片段被转换为带有保护锚点、退化区间和显式时间间隙的帧图。节点结构化的课程,包括节点丢弃、噪声记忆、前沿延续和参考缓存训练,迫使模型在中断时检索相关的过去状态,而不是仅依赖局部连续性。PM-RoPE,一种优雅的相机相位RoPE扩展,以单注意力成本解锁了时空检索,同时保留了预训练路径。ReMind在STEVO-Bench和恢复任务上取得了最佳总体分数。此外,通用图像到视频评估证实该课程避免了灾难性遗忘。我们将开源代码、数据和模型。

英文摘要

Video world models should maintain evolving states when evidence is unobserved, yet current generators often freeze hidden states upon interruption. This is not simply a capacity problem: pretrained video diffusion transformers already possess KV-cache mechanisms capable of non-local retrieval, but they are rarely trained to use them as dynamic memory. We introduce ReMind, a framework eliciting dynamic memory behavior via memory-oriented data, event-aware training, and cache adaptation. Organized around a taxonomy of 100+ dynamic events, we build a camera-annotated training mixture combining VLM-filtered real videos, generated hard dynamics, synthetic camera loops, and memory-interruption augmentations. Each clip is converted into a frame graph with protected anchors, degraded intervals, and explicit temporal gaps. A node-structured curriculum, including node-drop, noisy memory, frontier continuation, and reference-cache training, forces the model to retrieve relevant past states across interruptions rather than relying solely on local continuity. PM-RoPE, an elegant camera-phase RoPE extension, unlocks spatiotemporal retrieval at a single-attention cost while preserving pretrained pathways. ReMind achieves the best overall scores on STEVO-Bench and recovery tasks. Furthermore, general image-to-video evaluations confirm this curriculum avoids catastrophic forgetting. We will open-source our code, data, and models.

2605.25328 2026-05-26 cs.CV cs.MM 版本更新

DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement

DIVA: 利用统一多模态模型中的表示差异实现相互增强

Renjie Lu, Xulong Zhang, Xiaoyang Qu, Shangfei Wang, Jianzong Wang

发表机构 * Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China(平安科技(深圳)有限公司,深圳,中国) University of Science and Technology of China(中国科学技术大学)

AI总结 针对统一多模态模型中理解与生成任务因监督信号差异导致相互干扰的问题,提出DIVA框架,通过分解视觉表示为共享和独有成分并利用互信息估计实现内部协同,在理解与生成任务上分别提升7.82%和8.46%。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基于单一架构构建的统一多模态模型(UMMs)在理解和生成任务中均展现出令人印象深刻的表现。我们识别出一个基本挑战,即由不同监督信号引起的归纳偏差:生成分支偏好能够重建的高保真、细粒度表示,而理解分支则偏好对任务无关因素保持不变的语义判别性嵌入。因此,在单一骨干网络中优化这些互补但不等价的目标会导致相互损害而非增强。在本文中,我们首先分析了统一骨干网络中这种干扰的根本原因,并揭示了其内部表示中的互补结构。受此观察启发,我们提出了DIVA,一个自我改进的训练后框架,将表示差异转化为内部协同。通过基于两条互补信息流将视觉表示显式分解为共享和独有成分,DIVA使得理解和生成分支都能实现有益的迁移,同时通过互信息估计保护独有信息免受跨流干扰的完整性。尽管具有通用性,我们的方法在视觉理解(+7.82%)和生成(+8.46%)任务上均取得了一致的改进。官方代码见:https://github.com/Jayyy-H/DIVA。

英文摘要

Unified Multimodal models (UMMs) built on a single architecture have shown impressive performance in both understanding and generation. We identify a fundamental challenge that lies in inductive biases induced by distinct supervision signals: generation branch prefers high-fidelity, fine-grained representations capable of reconstruction, while the understanding favours semantically discriminative embeddings that remain invariant to task-irrelevant factors. Consequently, optimizing these complementary but non-equivalent objectives within a monolithic backbone leads to mutual impairment instead of enhancement. In this paper, we first analyze the root cause of this interference in unified backbones and reveal a complementary structure in their internal representations. Motivated by the observation, we propose DIVA, a self-improved post-training framework that transforms the representation divergence into interior synergy. By explicitly factorizing the visual representation into shared and unique components based on two complementary information flow, DIVA enables both the understanding and generation branches to achieve beneficial transferring while preserving the integrity of unique information from cross-flow interference via mutual information estimation. Despite its generality, our method consistently achieves improvements across visual understanding (+7.82%) and generation (+8.46%). The official code is available at: https://github.com/Jayyy-H/DIVA.

2605.25326 2026-05-26 cs.CV 版本更新

Perceive-then-Plan: Layout-as-Policy for Monocular 3D Scene Layout Estimation

感知-然后-规划:以布局为策略的单目3D场景布局估计

Junwei Zhou, Yu-Wing Tai

发表机构 * Department of Computer Science(计算机科学系) Dartmouth College(达特茅斯学院)

AI总结 提出Perceive-then-Plan框架,通过视觉语言模型将单目3D布局估计转化为感知与迭代规划问题,以布局为策略(LaP)学习动作序列逐步优化场景假设,生成更物理一致且与观测对齐的3D布局。

Comments 21 pages

详情
AI中文摘要

从单张图像构建结构化的3D场景布局需要协调视觉观察与物理和空间约束,这一挑战难以仅通过直接预测来解决。在这项工作中,我们将单目3D布局估计形式化为一个带有视觉语言模型的感知-然后-规划问题,其中感知器首先定位3D对象,然后规划器通过动作迭代优化场景假设,这些动作在保持与输入图像一致性的同时提高物理合理性。我们提出布局为策略(LaP),将规划阶段视为策略学习问题:3D布局表示为结构化状态,并通过离散动作(如平移、旋转和缩放)进行优化。从几何增强感知器的观测对齐初始化开始,LaP规划器被训练生成逐步解决几何不一致性并强制实现现实空间关系的动作序列。为了实现有效学习,我们将监督轨迹初始化与基于偏好的优化相结合,使模型能够在无需显式奖励工程的情况下学习纠正行为。这种公式将布局估计从一次性预测任务转变为迭代优化过程,从而更好地处理全局约束和复杂的对象交互。实验表明,我们的方法生成的布局在物理上更连贯,与视觉观察更一致,同时自然支持场景编辑和操作等下游任务。

英文摘要

Building structured 3D scene layouts from a single image requires reconciling visual observations with physical and spatial constraints, a challenge that is difficult to address with direct prediction alone. In this work, we formulate monocular 3D layout estimation as a perceive-then-plan problem with vision-language models, where a Perceiver first grounds the 3D objects and then a Planner iteratively refines the scene hypothesis through actions that improve physical plausibility while preserving consistency with the input image. We propose Layout-as-Policy (LaP), which casts the planning stage as a policy learning problem: 3D layouts are represented as structured states, and refined via discrete actions such as translation, rotation, and rescaling. Starting from an observation-aligned initialization with the geometry-enhanced Perceiver, the LaP Planner is trained to produce action sequences that progressively resolve geometric inconsistencies and enforce realistic spatial relations. To enable effective learning, we combine supervised trajectory initialization with preference-based optimization, allowing the model to learn corrective behaviors without requiring explicit reward engineering. This formulation transforms layout estimation from a one-shot prediction task into an iterative refinement process, enabling better handling of global constraints and complex object interactions. Experiments demonstrate that our approach produces layouts that are more physically coherent and better aligned with visual observations, while naturally supporting downstream tasks such as scene editing and manipulation.

2605.25308 2026-05-26 cs.CV 版本更新

Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

通过动态特征归一化稳定流视频几何

Xiaoyang Lyu, Muxin Liu, Xiaoshan Wu, Ruicheng Wang, Yi-Hua Huang, Yang-Tian Sun, Shaoshuai Shi, Xiaojuan Qi

发表机构 * The University of Hong Kong(香港大学) USTC(中国科学技术大学) Voyager Research, Didi Chuxing(滴滴出行 Voyager 研究)

AI总结 针对流式RGB输入中单目几何模型的时间不一致问题(主要表现为尺度-偏移漂移),提出轻量级因果循环模块DyFN,通过动态调制特征统计量实现稳定几何估计,仅微调2%参数即可达到SOTA时间稳定性。

Comments 16 pages, 9 Figures, page: https://shawlyu.github.io/DyFN

详情
AI中文摘要

从流式RGB输入中一致地估计3D几何对于自动驾驶、具身AI和大规模重建等实际应用至关重要。虽然现代单目几何基础模型在单张图像上取得了很高的精度,但在连续输入上表现出严重的时间不一致性,主要表现为尺度-偏移漂移。通过有针对性的实证分析,我们将这种不稳定性追溯到其根本原因:潜在特征统计量的波动,其均值和方差直接决定了预测深度的尺度和偏移。基于这一洞察,我们引入了动态特征归一化(DyFN),这是一种轻量级的因果循环模块,能够动态且鲁棒地调制特征统计量,以随时间保持稳定的几何。我们通过仅微调DyFN(仅占2%的额外参数)来适配强大的预训练单目几何模型用于流式处理,同时保持骨干网络冻结,从而在保持单张图像精度的同时实现时间一致性。在四个基准上的大量实验表明,DyFN有效消除了时间伪影,如不连续的分层和位置抖动,并实现了最先进的时间稳定性,相比先前的流式方法提升了高达14%,甚至优于更重的非因果视频基线。项目页面:https://shawlyu.github.io/DyFN

英文摘要

Consistent 3D geometry estimation from streaming RGB input is crucial for real-world applications such as autonomous driving, embodied AI, and large-scale reconstruction. While modern monocular geometry foundation models achieve strong single-image accuracy, they exhibit severe temporal inconsistency on continuous input, notably dominated by scale--shift drifting. Through targeted empirical analysis, we trace this instability to its root cause: fluctuations in latent feature statistics, whose mean and variance directly determine the predicted depth's scale and shift. Building on this insight, we introduce Dynamic Feature Normalization (DyFN), a lightweight, causal recurrent module that dynamically and robustly modulates feature statistics to maintain stable geometry over time. We adapt powerful pretrained monocular geometry models for streaming by finetuning only DyFN, a mere 2\% additional parameters, while keeping the backbone frozen, thereby achieving temporal consistency without compromising single-image accuracy. Extensive experiments across four benchmarks show that DyFN effectively eliminates temporal artifacts such as disjointed layering and positional jitter, and achieves state-of-the-art temporal stability, improving over prior streaming methods by up to 14\% and even outperforming heavier non-causal video baselines. Project Page: https://shawlyu.github.io/DyFN

2605.25307 2026-05-26 cs.CV 版本更新

Recursive Class Connectivity Classification (R3C) Applied to Binary Image Segmentation for Improved Infant Fingerprint Enhancement

递归类连接分类(R3C)应用于二值图像分割以改进婴儿指纹增强

Joao Leonardo Harres Dall Agnol, Luiz Fernando Puttow Southier, Jefferson Tales 0liva, Marcelo Teixeira, Rodrigo Mineto, Marcelo Filipa, Dalcimar Casanova, Erick Oliveira Rodrigues

发表机构 * Infant.ID Ltda(Infant.ID公司) Graduate Program in Production and Systems Engineering (PPGEPS), Federal University of Technology-Paran (UTFPR)(生产与系统工程硕士项目,联邦技术大学-巴拉那(UTFPR))

AI总结 提出递归类连接分类(R3C)框架,通过迭代扩展脊线结构改进现有增强方法的二值分割输出,无需训练数据即可提升婴儿指纹识别率。

详情
Journal ref
IEEE Access 2025
AI中文摘要

图像增强在婴儿指纹匹配中至关重要,因为儿童特有的特征(如较小的手指尺寸和较薄的脊线结构)通常会在采集过程中降低图像质量。为解决这些限制,注册通常依赖于专门的高分辨率扫描仪,而大多数现有增强方法并非为此设计。因此,儿童的识别率仍显著低于成人指纹。本研究引入递归类连接分类(R3C),一种通过扩展脊线结构迭代细化现有增强方法二值分割输出的新颖框架。R3C不需要修改底层分类器,且无需训练数据(目前婴儿指纹尚无此类数据)。相反,该方法通过将分类后的图像反复反馈到分类过程中,同时将每个中间分割与原始输入图像结合,从而改进分割。在三个指纹数据集上使用四种不同增强分类器进行的实验表明,与单独使用增强方法相比,R3C可将儿童的真接受率(TAR)提高最多4%,新生儿提高超过40%。定性分析进一步表明,R3C重新连接了断裂的脊线模式,改善了分割的视觉质量。由于独立于所使用的增强方法,R3C为改进二值分割提供了灵活且广泛适用的解决方案。

英文摘要

Image enhancement plays a crucial role in infant fingerprint matching, as child-specific characteristics such as smaller finger dimensions and thinner ridge structures often degrade image quality during acquisition. To address these limitations, enrollment typically depends on specialized highresolution scanners, which most existing enhancement methods are not designed to support. Consequently, identification rates for children remain significantly lower than those achieved with adult fingerprints. This study introduces Recursive Class Connectivity Classification (R3C), a novel framework that iteratively refines binary segmentation outputs from existing enhancement methods by extending ridge structures. R3C does not require modifications to the underlying classifier and operates without training data, which is not currently available for infant fingerprints. Instead, the method improves segmentation by repeatedly feeding the classified image back into the classification process, while combining each intermediate segmentation with the original input image. Experiments conducted on three fingerprint datasets using four different enhancement classifiers show that R3C can increase the True Acceptance Rate (TAR) by up to 4% for children and over 40% for newborns, compared to using the enhancement methods alone. A qualitative analysis further demonstrates that R3C reconnects fragmented ridge patterns, improving the visual quality of segmentation. Because it functions independently of the enhancement method used, R3C provides a flexible and broadly applicable solution for improving binary segmentation.

2605.25304 2026-05-26 cs.LG cs.CR cs.CV 版本更新

When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers

当可解释性成为负担:针对CBM概念层的对抗攻击

Aditya Sridhar

发表机构 * Independent Researcher(独立研究者)

AI总结 本文系统研究了概念瓶颈模型(CBM)中概念层的对抗性脆弱性,提出了一种基于语义扰动的稳定性正则化防御方法SPECTRA,显著提高了攻击所需的最小扰动范数,同时保持了分类精度。

Comments Accepted to CVPR 2026 (Findings). 9 pages, 6 figures

详情
AI中文摘要

概念瓶颈模型(CBM)已成为可解释机器学习的基础方法,通过显式的概念激活提供人类可理解的中间表示。然而,这种可解释性从根本上引入了一个关键且先前未被探索的攻击面:概念瓶颈层本身。我们提出了对CBM中概念级对抗性脆弱性的全面、系统性研究,揭示了对输入像素进行有针对性的最小扰动可以通过操纵语义表示导致灾难性的错误分类。我们开发了一个严格的理论框架来量化概念空间的鲁棒性,建立了揭示这些架构脆弱性景观的新指标。我们在CUB-200-2011数据集上的广泛分析表明,标准CBM对概念级操纵表现出严重的敏感性。为了解决这一关键弱点,我们引入了SPECTRA(基于语义扰动的概念训练以增强对抗鲁棒性),一种原则性的稳定性正则化防御。SPECTRA有效地强化了语义表示空间,将成功攻击所需的最小扰动范数从0.46提高到超过4,200,使得有针对性的概念操纵在计算上变得不可行。此外,SPECTRA将基线分类精度保持在2.2%以内。通过将概念级攻击确立为一种根本不同的威胁模型,这项工作在可解释机器学习与对抗鲁棒性的交叉领域开辟了一个新的研究前沿。

英文摘要

Concept Bottleneck Models (CBMs) have emerged as a cornerstone approach for interpretable machine learning, providing human-understandable intermediate representations through explicit concept activations. However, this interpretability fundamentally introduces a critical, previously unexplored attack surface: the concept bottleneck layer itself. We present a comprehensive, systematic study of concept-level adversarial vulnerabilities in CBMs, revealing that targeted, minimal perturbations operating on input pixels can induce catastrophic misclassification by manipulating semantic representations. We develop a rigorous theoretical framework to quantify concept-space robustness, establishing novel metrics that expose the vulnerability landscape of these architectures. Our extensive analysis on the CUB-200-2011 dataset demonstrates that standard CBMs exhibit severe susceptibility to concept-level manipulation. To address this critical weakness, we introduce SPECTRA (Semantic Perturbation-based Concept Training for Robustness against Attacks), a principled stability regularization defense. SPECTRA effectively hardens the semantic representation space, increasing the minimal perturbation norm required for a successful attack from 0.46 to over 4,200, rendering targeted concept manipulation computationally prohibitive. Furthermore, SPECTRA preserves baseline classification accuracy to within 2.2%. By establishing concept-level attacks as a fundamentally distinct threat model, this work opens a new research frontier at the intersection of interpretable machine learning and adversarial robustness.

2605.25294 2026-05-26 cs.CV 版本更新

Geometry-Aware Image Flow Matching

几何感知图像流匹配

Junho Lee, Kwanseok Kim, Joonseok Lee

发表机构 * Seoul National University, Seoul, Korea(首尔国立大学)

AI总结 本文通过发现自然图像语义信息主要编码在方向分量上,提出球面最优传输流匹配(SOT-CFM)和球面流匹配(SFM)两种几何感知方法,在超球面上建模图像,相比欧几里得基线取得更优性能。

详情
AI中文摘要

生成模型的最新进展突显了几何感知建模在流形约束环境中的强大能力。然而,对于自然图像,该领域仍局限于欧几里得假设,未能利用数据内在的几何结构。在本文中,我们研究了自然图像的几何结构,观察到语义信息主要编码在方向分量中,而范数分量可以通过全局平均值近似。这一性质在RGB空间和潜在空间中都成立,表明自然图像可以在超球面上有效建模。基于这一发现,我们引入了球面最优传输流匹配(SOT-CFM),它利用角距离,以及球面流匹配(SFM),它直接在流形上约束动力学。我们的实验表明,这些几何感知方法相比欧几里得基线取得了更优的性能。最终,这项工作提供了一种新颖的视角,弥合了基于黎曼流形的建模与自然图像生成之间的差距。

英文摘要

Recent advances in generative models highlight the power of geometry-aware modeling in manifold-constrained settings. Yet, for natural images, the field remains confined to Euclidean assumptions, failing to exploit the potential of intrinsic geometric structures within the data. In this work, we investigate the geometry of natural images and observe that semantic information is predominantly encoded in directional components, while norm components can be approximated by the global average. This property holds across both RGB and latent spaces, suggesting that natural images can be effectively modeled on a hypersphere. Building on this finding, we introduce Spherical Optimal Transport Flow Matching (SOT-CFM), which utilizes angular distance, and Spherical Flow Matching (SFM), which constrains dynamics directly on the manifold. Our experiments demonstrate that these geometry-aware methods achieve superior performance against Euclidean baselines. Ultimately, this work provides a novel perspective that bridges the gap between Riemannian manifold-based modeling and natural image generation.

2605.25293 2026-05-26 cs.CV cs.AI cs.RO 版本更新

Neuromorphic LiDAR-based Bird's Eye View Object Detection using Energy-efficient Spiking Neural Networks

基于神经形态激光雷达的鸟瞰图目标检测:使用节能脉冲神经网络

Sambit Mohapatra, Senthil Yogamani, Heinrich Gotzig, Patrick Mader

发表机构 * Valeo, Germany(德国瓦莱欧公司) Valeo, Ireland(爱尔兰瓦莱欧公司) TU Ilmenau, Germany(德国伊门豪大学)

AI总结 提出一种端到端脉冲编码器-解码器网络,用于激光雷达点云鸟瞰图表示中的目标检测,通过代理梯度反向传播训练,在KITTI基准上达到高精度,并实现3.33倍突触操作能耗降低。

详情
AI中文摘要

自动驾驶感知需要在严格的功耗约束下对三维传感器数据进行准确高效的处理。传统卷积神经网络实现了强大的检测精度,但计算密集,限制了其在资源受限的神经形态平台上的部署。脉冲神经网络通过事件驱动的稀疏计算提供了一种引人注目的替代方案,但其在复杂真实世界感知任务(如三维目标检测)中的应用仍然有限。在这项工作中,我们提出了一种端到端脉冲编码器-解码器网络,用于激光雷达点云鸟瞰图表示中的目标检测,并使用代理梯度反向传播进行训练。我们训练了两个变体:一个膜电位变体,在输出阶段读取连续神经元状态以获得最大精度,在$\mathrm{IoU}\!=\!0.5$(简单/中等/困难)下达到$92.05$/$87.04$/$86.51$ AP;以及一个全二进制脉冲变体,每一层仅操作脉冲序列,用于直接神经形态部署。我们评估了四种输入脉冲编码策略,并证明允许网络直接从数据学习脉冲表示优于手工制作的泊松、延迟和z轴编码方案,在KITTI基准上,当顺序帧不可用且BEV输入跨时间步重复呈现作为时间流代理时。分块能量分析表明,在保守的基于循环的操作下,与等效CNN相比,突触操作能量降低了$3.33 imes$。这些结果共同证明了脉冲神经网络在自动驾驶中实现准确且节能的神经形态感知的可行性。

英文摘要

Autonomous driving perception demands accurate and efficient processing of three-dimensional sensor data under strict power constraints. Traditional convolutional neural networks achieve strong detection accuracy but are computationally intensive, limiting their suitability for deployment on resource-constrained neuromorphic platforms. Spiking neural networks offer a compelling alternative through event-driven sparse computation, yet their application to complex real-world perception tasks such as three-dimensional object detection remains limited. In this work, we propose an end-to-end spiking encoder-decoder network for object detection in bird's eye view representations of LiDAR point clouds, trained using surrogate gradient backpropagation. We train two variants: a membrane potential variant that reads continuous neuron state at the output stage for maximum accuracy, achieving $92.05$/$87.04$/$86.51$ AP at $\mathrm{IoU}\!=\!0.5$ (Easy/Moderate/Hard), and, a fully binary spiking variant that operates exclusively on spike trains at every layer for direct neuromorphic deployment. We evaluate four input spike encoding strategies and demonstrate that allowing the network to learn spike representations directly from data outperforms hand-crafted Poisson, latency, and z-axis encoding schemes on the KITTI benchmark, where sequential frames are unavailable and the BEV input is presented repeatedly across timesteps as a proxy for temporal streaming. A block-wise energy analysis demonstrates a $3.33\times$ reduction in synaptic operation energy over an equivalent CNN under conservative loop-based operation. Together, these results demonstrate the viability of spiking neural networks for accurate and energy-efficient neuromorphic perception in autonomous driving.

2605.25266 2026-05-26 cs.CV 版本更新

DeltaCam: Differential Intrinsic Camera Modeling for Video Generation

DeltaCam: 用于视频生成的差分内参相机建模

Debabrata Mandal, Zhihan Peng, Yujie Wang, Praneeth Chakravarthula

发表机构 * UNC, Chapel Hill USA(北卡罗来纳大学教堂山分校)

AI总结 提出DeltaCam视频扩散框架,通过差分参数化神经相机适配器学习相对变化,实现焦距、光圈、ISO等内参的平滑可控视频生成,并扩展到真实场景。

详情
AI中文摘要

将相机内参纳入视频生成模型为控制场景动态和影响视觉外观的成像过程提供了原则性方法。先前工作主要关注外参控制(如相机姿态和运动),而将内参视为隐式或固定。关键瓶颈在于缺乏具有准确且多样化的时变相机元数据的大规模视频数据集,这使得学习绝对相机参数化变得困难。因此,当前模型难以以可控且时间一致的方式融入摄影相机行为,包括景深转换、曝光变化、镜头畸变和色彩处理。我们引入DeltaCam,一种视频扩散框架,通过Δ参数化的神经相机适配器对相机行为进行建模,该适配器基于相机运动和内参的相对变化而非绝对状态进行操作。通过从合成视频数据中学习这种差分公式,我们减轻了对精确真实世界相机标签的依赖,并实现了对焦距、光圈、ISO、色温和镜头畸变成像因子的平滑一致控制。我们将此框架扩展到真实世界视频,通过两种机制:在真实图像-元数据对上微调控制以实现精确镜头匹配,以及提取解耦嵌入用于隐式视频到视频风格迁移,无需显式相机参数。通过有效分离场景内容与内生成像行为,DeltaCam实现了现有模型难以实现的相机一致视频生成和编辑操作。最终,我们的结果为连接合成控制与真实世界摄影仿真建立了一种实用且可扩展的方法。

英文摘要

Incorporating camera intrinsics into video generation models offers a principled way to control not only scene dynamics but also the imaging process that governs visual appearance. Prior work has primarily focused on extrinsic control, such as camera pose and motion, while treating intrinsic camera parameters as implicit or fixed. A key bottleneck is the lack of large-scale video datasets with accurate and diverse temporally varying camera metadata, which makes learning absolute camera parameterizations difficult. As a result, current models struggle to incorporate photographic camera behavior, including depth-of-field transitions, exposure variations, lens distortions, and color processing, in a controllable and temporally consistent manner. We introduce DeltaCam, a video diffusion framework that models camera behavior through $Δ$-parameterized neural camera adaptors, operating on relative changes in camera motion and intrinsics instead of absolute states. By learning this differential formulation from synthetic video data, we mitigate reliance on precise real-world camera labels and enable smooth, consistent control over imaging factors such as focal length, aperture, ISO, color temperature, and lens distortion. We extend this framework to real-world footage through two mechanisms: finetuning the controls on real image-metadata pairs for precise shot matching, and extracting disentangled embeddings for implicit video-to-video style transfer without requiring explicit camera parameters. By effectively separating scene content from intrinsic imaging behavior, DeltaCam enables camera-consistent video generation and editing operations that are difficult to achieve with existing models. Ultimately, our results establish a practical and scalable approach for bridging synthetic control and real-world photographic emulation.

2605.25262 2026-05-26 cs.CV 版本更新

Semantics-Guided Multimodal Masked Autoencoder Pretraining for 3D BEV Object Detection

语义引导的多模态掩码自编码器预训练用于3D BEV目标检测

Prabuddhi Wariyapperuma, Rajitha de Silva, Marc Hanheide, Thomas Bohné, Leonardo Guevara

发表机构 * University of Lincoln, Lincoln Centre for Autonomous Systems(林肯大学,林肯自主系统中心) University of Cambridge, Institute for Manufacturing, Department of Engineering(剑桥大学,制造研究所,工程系)

AI总结 提出语义引导的多模态掩码自编码器框架,通过语义引导的LiDAR体素掩码和辅助点语义解码分支,在预训练中注入语义信息,提升3D BEV目标检测性能。

Comments Accepted at the ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy (SRRA) as a lightning talk and poster

详情
AI中文摘要

准确的3D鸟瞰图(BEV)目标检测对于自动驾驶至关重要,并且强烈依赖于来自互补传感器(如摄像头和LiDAR)的有效多模态表示。多模态掩码自编码器已显示出学习此类表示以用于下游3D BEV目标检测的强大潜力。然而,现有方法通常对摄像头和LiDAR输入应用均匀随机掩码,平等对待所有区域,并且仅通过掩码重建学习表示。我们提出了一种语义引导的多模态掩码自编码器框架,该框架在预训练期间通过两个独立组件引入语义信息:(i)语义引导的LiDAR体素掩码,它更强烈地保留语义重要的LiDAR区域,以及(ii)一个辅助的点级LiDAR语义解码分支,在重建之外注入语义引导。在BEVFusion 3D目标检测上,与标准UniM2AE基线相比,我们的语义引导预训练策略在nuScenes mini验证集上提升了性能:语义引导的LiDAR体素掩码在基线上实现了+1.49%的平均精度(mAP)和+1.66%的nuScenes检测分数(NDS),而解码器侧的点语义监督实现了+1.39%的mAP和+3.22%的NDS。

英文摘要

Accurate 3D bird's-eye view (BEV) object detection is essential for autonomous driving, and depends strongly on effective multimodal representations from complementary sensors such as cameras and LiDAR. Multimodal masked autoencoders have shown strong potential for learning such representations for downstream 3D BEV object detection. However, existing methods typically apply uniform random masking to camera and LiDAR inputs, treating all regions equally, and learn representations only through masked reconstruction. We propose a semantics-guided multimodal masked autoencoder framework that introduces semantic information during pretraining through two separate components: (i) semantics-guided LiDAR voxel masking, which preserves semantically important LiDAR regions more strongly, and (ii) an auxiliary point-wise LiDAR semantic decoder branch that injects semantic guidance in addition to reconstruction. On BEVFusion 3D object detection, our semantics-guided pretraining strategy improves performance on the nuScenes mini validation set compared to the standard UniM2AE baseline: semantics-guided LiDAR voxel masking yields +1.49% mean Average Precision (mAP) and +1.66% nuScenes Detection Score (NDS), while decoder-side point semantic supervision yields +1.39% mAP and +3.22% NDS over the baseline.

2605.25254 2026-05-26 cs.CV cs.AI 版本更新

Guess the Unified Model: How Much Can We Recover from Generated Images?

猜猜统一模型:从生成的图像中我们能恢复多少?

Jasin Cekinmez, Ryo Mitsuhashi, Addison J. Wu, Yida Yin

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文研究统一模型生成图像的可分离性,通过七个模型的大量图像实验,发现模型归因高度可行,且语义内容对可分离性有贡献但非主导信号。

详情
AI中文摘要

随着统一模型生成的图像现在在线广泛传播,追溯其来源模型为透明度和深入理解单个模型的特征行为提供了一条途径。先前的工作已经探索了LLM生成文本、扩散模型图像和数据集的来源,但统一模型生成图像的可分离性仍然是一个未充分探索的领域。我们通过使用七个统一模型生成的图像,检查在损坏、领域和提示语言上的可分离性来填补这一空白。我们表明模型归因高度可行,因为我们的模型在每个模型约20K图像的情况下达到了近乎完美的准确率。损坏和结构扰动对归因性能的影响较小,跨领域泛化表明语义内容对可分离性有贡献,但并非主导信号。最后,我们观察到对于大多数模型,提示语言归因接近随机水平,表明语言特定的视觉特征极少。这些发现突显了统一模型输出中一致的模型特定视觉特征,并为追踪和审计生成图像流水线开辟了新方向。

英文摘要

With unified model-generated images now widespread online, attributing their model of origin offers a path toward transparency and deeper insight into the characteristic behaviors of individual models. Prior work has explored provenance in LLM-generated text, diffusion model images, and datasets, but the separability of unified model-generated images remains an underexplored area. We address this gap by examining separability across corruption, domains, and prompt languages using images generated by seven unified models. We show that model attribution is highly feasible as our model achieves near-perfect accuracy with around 20K images per model. Corruptions and structural perturbations have only a modest effect on attribution performance, and cross-domain generalization reveals that semantic content contributes to separability but is not the dominant signal. Finally, we observe that for most models, prompt language attribution is around chance levels, suggesting minimal language-specific visual signatures. These findings highlight consistent model-specific visual characteristics in unified models outputs and open new directions for tracing and auditing generative image pipelines.

2605.20787 2026-05-26 cs.CV 版本更新

Findings of the Counter Turing Test: AI-Generated Image Detection

反图灵测试结果:AI生成图像检测

Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

发表机构 * Kalyani Government Engineering College(卡利尼政府工程学院) University of South Carolina(南卡罗来纳大学) IIIT Delhi(德里IIIT) BITS Pilani Hyderabad Campus(比斯潘尼 Hyderabad 分校) IIIT Guwahati(果阿瓦提IIIT) NIT Silchar(西里char 工科院) San José State University(桑乔斯州立大学) UCLA(加州大学洛杉矶分校) Washington State University(华盛顿州立大学) Vishwakarma Institute of Information Technology(维斯瓦克arma 信息科技学院) Meta AI Amazon AI(亚马逊AI) BITS Pilani Goa(比斯潘尼 Goa 分校)

AI总结 本文通过Defactify 4.0工作坊的反图灵测试竞赛,评估了多种检测方法在区分AI生成图像与真实图像及识别具体生成模型上的性能,发现检测准确率较高但模型识别仍具挑战。

Comments Defactify4 @AAAI 2025

详情
AI中文摘要

生成式AI技术(如Stable Diffusion、DALL-E和Midjourney)的快速发展显著改变了合成视觉内容的创建方式。虽然这些模型推动了各行各业的创新,但也带来了严重挑战,包括错误信息、虚假信息和有偏内容生成。AI生成图像日益逼真,使其检测成为研究人员、政策制定者和行业利益相关者关注的紧迫问题。 在本文中,我们介绍了Defactify 4.0工作坊的成果,该工作坊推出了用于AI生成图像检测的反图灵测试(CT2)。竞赛包含两个关键任务:(1)将图像二分类为AI生成或真实;(2)识别生成AI图像的具体生成模型。为支持这两个任务,我们采用了MS COCOAI数据集,该基准包含由五个最先进模型生成的96000张真实和合成图像,以及来自MS COCO的真实图像。 参与者采用了多种检测策略,包括卷积神经网络(CNN)、视觉Transformer(ViT)、基于频率的分析、对比学习和多模态技术。结果表明,虽然AI生成图像可以被高精度检测(F1分数>0.83),但准确识别具体模型仍然更具挑战性(最高F1分数:0.4986)。这些发现凸显了改进模型指纹识别、对抗鲁棒性和实时检测机制的必要性。

英文摘要

The rapid advancements in generative AI technologies, such as Stable Diffusion, DALL-E, and Midjourney, have significantly transformed the creation of synthetic visual content. While these models enable innovation across industries, they also pose serious challenges, including misinformation, disinformation, and biased content generation. The increasing realism of AI-generated images makes their detection a pressing concern for researchers, policymakers, and industry stakeholders. In this paper, we present the findings of the Defactify 4.0 workshop, which introduced the Counter Turing Test (CT2) for AI-Generated Image Detection. The competition consisted of two key tasks: (1) binary classification of images as either AI-generated or real and (2) identification of the specific generative model responsible for an AI-generated image. To support both tasks, we employed the MS COCOAI dataset, a benchmark of 96000 real and synthetic images generated by five state-of-the-art models alongside real images from MS COCO. Participants employed diverse detection strategies, including convolutional neural networks (CNNs), Vision Transformers (ViTs), frequency-based analysis, contrastive learning, and multimodal techniques. The results demonstrated that while AI-generated images can be detected with high accuracy (F1-score > 0.83), identifying the exact model used remains significantly more challenging (highest F1-score: 0.4986). These findings highlight the need for improved model fingerprinting, adversarial robustness, and real-time detection mechanisms.

2605.20772 2026-05-26 cs.CV 版本更新

VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering

VIHD: 基于视觉干预的医学视觉问答幻觉检测

Jiayi Chen, Benteng Ma, Zehui Liao, Winston Chong, Yasmeen George, Jianfei Cai

发表机构 * Department of Data Science \& AI, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia Alfred Health Radiology, Alfred Health, Melbourne, VIC 3004, Australia School of Translational Medicine, Faculty of Medicine, Nursing Health Sciences, Monash University, Melbourne, VIC 3800, Australia Hong Kong Polytechnic University, Hong Kong SAR, China

AI总结 提出VIHD方法,通过视觉依赖探测和视觉干预解码校准语义熵,有效检测医学多模态大语言模型中的幻觉响应。

Comments Early accepted by MICCAI 2026. This version of the contribution has been accepted for publication, after peer review (when applicable) but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections

详情
AI中文摘要

尽管医学多模态大语言模型(MLLMs)在辅助诊断方面展现出潜力,但它们仍然频繁生成在语言上看似合理但缺乏视觉证据的幻觉响应。这种幻觉对临床决策构成风险,因此需要有效的检测方法。现有的内省检测方法主要通过分析模型在原始或扰动输入条件下的响应来进行不确定性估计或逻辑验证。然而,这种外部扰动通常是启发式的且与上下文无关,忽略了解码过程中生成令牌与相关视觉令牌之间的内部跨模态依赖。为解决这一问题,我们提出了VIHD,一种基于视觉干预的幻觉检测方法,通过针对性的视觉令牌掩码校准语义熵,以实现更有效的幻觉检测。VIHD通过视觉依赖探测(VDP)定位视觉主导的解码器层,通过令牌掩码执行视觉干预解码(VID)以校准语义分布,并将得到的校准语义熵(CSE)量化为可靠的幻觉信号。在三个医学VQA基准测试和两个医学MLLM上的大量实验表明,VIHD始终优于最先进的方法,强调了细粒度视觉依赖对于幻觉检测的重要性。代码将发布在https://github.com/Jiayi-Chen-AU/VIHD。

英文摘要

While medical Multimodal Large Language Models (MLLMs) have shown promise in assisting diagnosis, they still frequently generate hallucinated responses that appear linguistically plausible but lack visual evidence. Such hallucinations pose risks to clinical decision-making and necessitate effective detection. Existing introspective detection methods primarily perform uncertainty estimation or logical verification by analyzing model responses conditioned on original or perturbed inputs. However, such external perturbations are often heuristic and context-agnostic, which overlooks the internal cross-modal dependency between generated tokens and related visual tokens during decoding. To address this issue, we propose VIHD, a Visual Intervention-based Hallucination Detection method that leverages targeted visual token masking to calibrate semantic entropy for more effective hallucination detection. VIHD locates visually dominant decoder layers via Visual Dependency Probing (VDP), executes Visual Intervention Decoding (VID) via token masking to calibrate the semantic distribution, and quantifies the resulting Calibrated Semantic Entropy (CSE) as a reliable hallucination signal. Extensive experiments on three medical VQA benchmarks with two medical MLLMs demonstrate that VIHD consistently outperforms state-of-the-art methods, underscoring the importance of fine-grained visual dependency for hallucination detection. The code will be available at https://github.com/Jiayi-Chen-AU/VIHD

2605.19739 2026-05-26 cs.CV 版本更新

FlowErase-RL: Rethinking Concept Erasure as Reward Optimization in Flow Matching Models

FlowErase-RL:将概念擦除重新思考为流匹配模型中的奖励优化

Yi Sun, Zhiqi Zhang, Xinhao Zhong, Yimin Zhou, Shuoyang Sun, Bin Chen, Shu-Tao Xia, Ke Xu

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Jilin University(吉林大学) Peng Cheng Laboratory(鹏城实验室) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出FlowErase-RL,首个基于GRPO的框架,通过动态双路径奖励机制将概念擦除转化为奖励优化问题,在抑制目标概念的同时保持生成保真度,实现最先进的擦除性能与鲁棒性。

详情
AI中文摘要

近期流匹配模型的进展显著提升了文本到图像生成的质量,但也因生成有害或不良内容而引入了日益增长的安全风险。现有的概念擦除方法要么是推理时干预,效果有限;要么依赖监督微调(SFT),后者需要精确对齐的数据,且在可扩展性和多概念场景中面临挑战。本文提出\emph{FlowErase-RL},首个基于GRPO的流匹配模型概念擦除框架。我们将概念擦除重新表述为奖励优化问题,并引入 extbf{动态双路径奖励机制},联合优化(i)概念擦除(CE)奖励以抑制目标概念,以及(ii)非目标空间(NS)奖励以保持生成保真度。通过性能驱动的切换策略,在训练过程中自适应平衡两条奖励路径,无需显式监督即可实现稳定优化。在裸体、物体和艺术风格擦除上的大量实验表明,我们的方法在保持强大图像质量和语义对齐的同时,实现了最先进的擦除性能。此外,它对对抗攻击表现出鲁棒抵抗性,并能有效扩展到多概念场景。我们的结果为流匹配模型中的安全可控生成建立了新范式。

英文摘要

Recent advances in flow matching models have significantly improved text-to-image generation quality, but also introduce growing safety risks due to the generation of harmful or undesirable content. Existing concept erasure methods are either inference-time interventions with limited effectiveness or rely on supervised fine-tuning (SFT), which requires precisely aligned data and struggles with scalability and multi-concept settings. In this paper, we propose \emph{FlowErase-RL}, the first GRPO-based framework for concept erasure in flow matching models. We reformulate concept erasure as a reward optimization problem and introduce a \textbf{dynamic dual-path reward mechanism} that jointly optimizes (i) a Concept Erasure (CE) reward to suppress target concepts and (ii) a Non-target Space (NS) reward to preserve generative fidelity. The two reward paths are adaptively balanced during training via a performance-driven switching strategy, enabling stable optimization without explicit supervision. Extensive experiments on nudity, object, and artistic style erasure demonstrate that our method achieves state-of-the-art erasure performance while maintaining strong image quality and semantic alignment. Moreover, it exhibits robust resistance to adversarial attacks and scales effectively to multi-concept scenarios. Our results establish a new paradigm for safe and controllable generation in flow matching models.

2605.18916 2026-05-26 cs.MM cs.AI cs.CV cs.SD eess.AS 版本更新

CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

CounterFlow: 一种用于反事实视频拟音生成的两阶段推理时采样方法

Gyubin Lee, Junwon Lee, Juhan Nam

发表机构 * Kim Jaechul Graduate School of AI, KAIST(金 Jaechul人工智能研究生院,韩国科学技术院)

AI总结 提出CounterFlow,一种两阶段推理时采样方案,用于预训练的流匹配VT2A模型,以生成与视觉证据矛盾但时间同步的反事实视频拟音,并通过新指标评估替换质量。

Comments accepted to CVPR 2026 Workshop on Sight and Sound

详情
AI中文摘要

我们研究反事实视频拟音生成,旨在采用与视觉证据矛盾的声源身份,同时保持与无声视频的时间同步。现有的视频与文本到音频(VT2A)模型难以处理此问题,当视频和文本内容不一致时,它们往往仍锚定于视觉隐含的声源。我们提出CounterFlow,一种用于预训练流匹配VT2A模型的推理时双阶段采样方案。第一阶段构建视频衍生的时间结构,同时抑制视觉隐含的声源;第二阶段放弃视频条件,完全专注于塑造朝向目标提示的音频音色。与朴素的负提示和最新基线相比,CounterFlow显著改进了反事实视频拟音生成。为了评估替换质量,我们提出一个利用文本-音频共嵌入空间的度量,同时衡量目标提示证据和残留的视觉隐含声源泄漏。视频演示和代码可在https://gyubin-lee.github.io/counterflow-demo/获取。

英文摘要

We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding space to measure both target-prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at https://gyubin-lee.github.io/counterflow-demo/

2605.18746 2026-05-26 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

ESI-Bench: 迈向闭环感知-动作的具身空间智能

Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas, Li Fei-Fei, Jiajun Wu, Yejin Choi

发表机构 * Stanford University(斯坦福大学) UCLA(加州大学洛杉矶分校) Northwestern University(西北大学)

AI总结 提出ESI-BENCH基准,通过主动探索(感知、移动、操作)在OmniGibson环境中评估具身空间智能,发现主动探索显著优于被动方法,失败主因是动作盲视而非感知弱,且模型存在元认知差距。

Comments https://esi-bench.github.io/

详情
AI中文摘要

空间智能通过感知-动作循环展开:智能体通过行动获取观察,并推理观察如何随动作变化。它们不是被动处理所见,而是主动揭示未见——遮挡结构、动态、包含关系和功能,这些无法仅通过被动感知解决。我们超越先前假设神谕观察的空间智能表述,将观察者重新定义为行动者。我们引入ESI-BENCH,一个基于OmniGibson、扎根于Spelke核心知识系统的全面具身空间智能基准,涵盖10个任务类别和29个子类别。智能体必须决定部署哪些能力——感知、移动和操作——以及如何排序以主动积累任务相关证据。我们对最先进的MLLM进行大量实验,发现主动探索显著优于被动对应物,智能体自发发现涌现的空间策略而无需明确指令,而随机多视角往往增加噪声而非信号,尽管消耗更多图像。大多数失败并非源于感知弱,而是动作盲视:糟糕的动作选择导致糟糕的观察,进而引发级联错误。虽然显式3D基础稳定了深度敏感任务的推理,但不完美的3D表示通过扭曲空间关系证明比2D基线更有害。人类研究进一步揭示,与寻求证伪视角并在矛盾下修正信念的人类不同,模型无论证据质量如何都过早且高置信度地承诺,暴露了一个既不能通过更好感知也不能通过更多具身互动单独闭合的元认知差距。

英文摘要

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.

2605.17287 2026-05-26 cs.CV 版本更新

LISA: Language-guided Interference-aware Spatial-Frequency Attention for Driver Gaze Estimation

LISA: 语言引导的干扰感知空间-频率注意力用于驾驶员视线估计

Jun Ma, Zhenye Yang, Ruichen Zhou, Pei Zhang, Huan Li, Jinpeng Chen

发表机构 * School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications(北京邮电大学计算机科学学院(国家级试点软件工程学院)) Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education(教育部可信分布式计算与服务重点实验室(BUPT)) School of Electrical Engineering, Guangxi University(广西大学电气工程学院) Zhejiang University(浙江大学)

AI总结 提出LISA框架,结合频域先验与视觉语言知识,通过双域融合机制和训练时解耦策略,实现鲁棒的驾驶员视线估计,在遮挡和光照变化下达到最优性能。

Comments 9 pages, 5 figures, 3 tables

详情
AI中文摘要

驾驶员视线估计是现代监控系统中评估驾驶员注意力的一项基本指标。除了易受突然光照变化和传感器噪声影响外,空间域模型难以将真实的视线线索与无关的视觉属性分离。在本文中,我们提出了LISA,一个语言引导的干扰感知空间-频率注意力框架,结合了频域先验与视觉语言知识。观察到即使在空间扰动下幅度谱仍保持相对稳定,我们设计了一种双域融合机制。它将稳定的低频语义集成到高频细节中,利用空间注意力精确定位眼部区域。为减少语义模糊性,我们还引入了一种训练时解耦策略。使用冻结的CLIP编码器和正交正则化,我们将视线特征与外观干扰明确分离。在两个基准上的实验表明,LISA达到了最先进的性能,在遮挡和光照变化下具有显著增强的鲁棒性。代码仓库可在 https://github.com/Mason-bupt/LISA 获取。

英文摘要

Driver gaze estimation serves as a fundamental metric for evaluating driver attentiveness in modern monitoring systems. Beyond being vulnerable to sudden lighting changes and sensor noise, spatial-domain models struggle to disentangle authentic gaze cues from irrelevant visual attributes. In this paper, we propose LISA, a \textbf{L}anguage-guided \textbf{I}nterference-aware \textbf{S}patial-Frequency \textbf{A}ttention framework that combines frequency-domain priors with vision-language knowledge. Observing that the amplitude spectrum remains relatively stable even under spatial perturbations, we design a dual-domain fusion mechanism. It integrates stable low-frequency semantics into high-frequency details, employing spatial attention to precisely target ocular regions. To reduce semantic ambiguity, we also introduce a training-time disentanglement strategy. Using a frozen CLIP encoder and orthogonal regularization, we explicitly separate gaze features from appearance interference. Experiments on two benchmarks show that LISA achieves state-of-the-art performance, with significantly improved robustness against occlusions and lighting variations. The code repository is available at https://github.com/Mason-bupt/LISA.

2605.12964 2026-05-26 cs.CV 版本更新

Asymmetric Flow Models

非对称流模型

Hansheng Chen, Jan Ackermann, Minseo Kim, Gordon Wetzstein, Leonidas Guibas

发表机构 * Stanford University(斯坦福大学)

AI总结 提出非对称流建模(AsymFlow),通过秩非对称速度参数化将噪声预测限制在低秩子空间,同时保持数据预测全维,从而在高维空间中实现高效的流生成,在ImageNet 256×256上取得领先的1.57 FID,并首次提供将预训练潜在流模型微调为像素空间模型的途径。

Comments Code: https://github.com/Lakonik/LakonLab Webpage: https://hanshengchen.com/asymflow

详情
AI中文摘要

高维空间中的基于流的生成是困难的,因为即使数据具有强低秩结构,速度预测也需要建模高维噪声。我们提出非对称流建模(AsymFlow),一种秩非对称速度参数化,将噪声预测限制在低秩子空间,同时保持数据预测全维。通过这种非对称预测,AsymFlow在不改变网络架构或训练/采样过程的情况下,解析地恢复全维速度。在ImageNet 256×256上,AsymFlow取得了领先的1.57 FID,大幅优于先前的DiT/JiT类像素扩散模型。AsymFlow还首次提供了将预训练潜在流模型微调为像素空间模型的途径:将低秩像素子空间与潜在空间对齐,得到无缝初始化,保留潜在模型的高级语义和结构,因此微调主要改善低级不匹配,而非重新学习像素生成。我们展示了从FLUX.2 klein 9B微调得到的像素AsymFlow模型在像素空间文本到图像生成中建立了新的最先进水平,在HPSv3、DPG-Bench和GenEval上击败了其潜在基础模型,并在定性上显示出显著改善的视觉真实感。

英文摘要

Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From this asymmetric prediction, AsymFlow analytically recovers the full-dimensional velocity without changing the network architecture or training/sampling procedures. On ImageNet 256$\times$256, AsymFlow achieves a leading 1.57 FID, outperforming prior DiT/JiT-like pixel diffusion models by a large margin. AsymFlow also provides the first-ever route for finetuning pretrained latent flow models into pixel-space models: aligning the low-rank pixel subspace to the latent space gives a seamless initialization that preserves the latent model's high-level semantics and structure, so finetuning mainly improves low-level mismatches rather than relearning pixel generation. We show that the pixel AsymFlow model finetuned from FLUX.2 klein 9B establishes a new state of the art for pixel-space text-to-image generation, beating its latent base on HPSv3, DPG-Bench, and GenEval while qualitatively showing substantially improved visual realism.

2605.12649 2026-05-26 cs.CV 版本更新

DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery

DIVER: 通过表达性语义恢复深入挖掘蒸馏数据

Qianxin Xia, Zhiyong Shu, Wenbo Jiang, Jiawei Du, Jielei Wang, Guoming Lu

发表机构 * University of Electronic Science and Technology of China, Chengdu, China(电子科技大学,成都,中国) Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore(高性能计算研究所(IHPC),科技研究局(A*STAR),新加坡) Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), Singapore(前沿人工智能研究中心(CFAR),科技研究局(A*STAR),新加坡)

AI总结 提出双阶段蒸馏框架DIVER,利用预训练扩散模型通过语义继承、引导和融合恢复蒸馏数据的表达性语义,提升跨架构泛化能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

数据集蒸馏旨在从原始数据集中合成一个紧凑的代理数据集,该数据集不可读或非原始,以保护隐私并实现高效学习。然而,先前的方法通常采用单阶段蒸馏范式,该范式会学习过度适应先验架构的特定模式,从而抑制语义表达并导致跨异构架构的性能下降。为了解决这个问题,我们提出了一种新颖的双阶段蒸馏框架,称为${ extbf{DIVER}}$,它利用预训练的扩散模型通过表达性语义恢复深入挖掘蒸馏数据,整个过程包括语义继承、引导和融合。语义继承将抽象蒸馏图像的高级语义蒸馏到潜在空间中,以过滤掉架构特定的“噪声”并保留内在语义。此外,语义引导通过指导反向过程来改善原始语义的保留。最后,语义融合被设计为仅在反向过程的具体阶段提供语义引导,防止语义模糊和伪影,同时保持引导信息。大量实验验证了DIVER在改进经典蒸馏技术和显著提升跨架构泛化方面的有效性和效率,在ImageNet(256×256)上仅需与原始DiT相当的处理时间,且仅使用4 GB GPU内存。

英文摘要

Dataset distillation aims to synthesize a compact proxy dataset that is unreadable or non-raw from the original dataset for privacy protection and highly efficient learning. However, previous approaches typically adopt a single-stage distillation paradigm, which suffers from learning specific patterns that overfit on a prior architecture, consequently suppressing the expression of semantics and leading to performance degradation across heterogeneous architectures. To address this issue, we propose a novel dual-stage distillation framework called ${\textbf{DIVER}}$, which leverages the pre-trained diffusion model to dive deeper into $\textbf{DI}$stilled data $\textbf{V}$ia $\textbf{E}$xpressive semantic $\textbf{R}$ecovery, an entire process of semantic inheritance, guidance, and fusion. Semantic inheritance distills high-level semantics of abstract distilled images into the latent space to filter out architecture-specific ``noise" and retain the intrinsic semantics. Furthermore, semantic guidance improves the preservation of the original semantics by directing the reverse procedure. Finally, semantic fusion is designed to provide semantic guidance only during the concrete phase of the reverse process, preventing semantic ambiguity and artifacts while maintaining the guidance information. Extensive experiments validate the effectiveness and efficiency of DIVER in improving classical distillation techniques and significantly improving cross-architecture generalization, requiring processing time comparable to raw DiT on ImageNet (256$\times$256) with only 4 GB of GPU memory usage.

2605.12374 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

填补GAP:多模态大语言模型中视觉推理的粒度对齐范式

Yanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou, Pascal Poupart, Lei Lv, Qi Zhao, Li Wang, Hao Li, Xiaoxi Jiang, Guanjun Jiang

发表机构 * Qwen Large Model Application Team, Alibaba(阿里云大模型应用团队) Alibaba University of Waterloo(阿里大学水力学院) Vector Institute(向量研究所) Zhejiang University(浙江大学)

AI总结 提出GAP(粒度对齐范式),通过特征级、上下文级和能力引导级对齐,解决多模态大语言模型中视觉潜在推理的特征空间不匹配问题,提升感知与推理性能。

详情
AI中文摘要

视觉潜在推理让多模态大语言模型(MLLM)以连续令牌形式创建中间视觉证据,避免外部工具或图像生成器。然而,现有方法通常遵循输出即输入的潜在范式,产生不稳定的收益。我们识别出特征空间不匹配是导致这种不稳定的证据:主流的视觉潜在模型建立在预归一化MLLM上,重用解码器隐藏状态作为预测的潜在输入,尽管这些状态与模型训练时消耗的输入嵌入处于截然不同的范数范围(Xie et al., 2025; Li et al., 2026; Team et al., 2026)。这种不匹配可能使直接潜在反馈不可靠。受此诊断启发,我们提出GAP,一种用于视觉潜在建模的粒度对齐范式。GAP在三个层面对齐视觉潜在推理:特征级对齐通过轻量级PCA对齐潜在头将解码器输出映射为输入兼容的视觉潜在;上下文级对齐通过可检查的辅助视觉监督锚定潜在目标;能力引导对齐选择性地将潜在监督分配给基础MLLM难以处理的示例。在Qwen2.5-VL 7B上,所得模型在我们监督变体中实现了最佳平均聚合感知和推理性能。推理时干预探测进一步表明,生成的潜在提供了任务相关的视觉信号,而不仅仅是增加令牌槽位。

英文摘要

Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume (Xie et al., 2025; Li et al., 2026; Team et al., 2026). This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose GAP, a Granular Alignment Paradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.

2605.09223 2026-05-26 cs.CV 版本更新

CREST: Curvature-Regulated Event-Centric Sampling for Efficient Long-Video Understanding

CREST: 曲率调节的事件中心采样用于高效长视频理解

Mehrajul Abadin Miraj, Abdul Mohaimen Al Radi, Shariful Islam Rayhan, Md. Tanvir Alam, Ismat Rahman, Yu Tian, Md Mosaddek Khan

发表机构 * Dept. of CSE, University of Dhaka(达卡大学计算机科学与工程系) Dept. of CSE, University of Central Florida(中央佛罗里达大学计算机科学与工程系)

AI总结 提出一种无训练帧选择方法CREST,利用查询-帧相关性的时间几何(局部曲率)来指导采样,在固定预算下实现高效长视频理解。

详情
AI中文摘要

从长视频中选择信息帧是一个组合问题,现有方法要么通过高效启发式方法处理,但未显式建模查询条件的时间结构,要么通过多阶段检索流水线处理,但预处理成本高。我们提出 extbf{CREST},一种基于查询-帧相关性的时间几何的无训练帧选择方法。CREST基于观察:相关性随时间表现出结构化的局部变化——显著事件周围曲率陡峭,冗余段区域平坦。通过使用局部曲率指导选择,CREST在短暂决定性事件和缓慢演变的证据之间更有效地分配固定帧预算。在固定主干网络和帧预算下,CREST在LongVideoBench和VideoMME上比轻量级相关性-覆盖基线AKS获得更高准确率,同时保留了更强多阶段检索流水线MIRA的93-95%准确率,而预处理成本仅为后者的3-4%。 ootnote{代码和实现细节包含在补充材料中,将在接收后公开发布。}在时间帧选择的诊断基准TempRel上,CREST比AKS相对提高6.88%。成对LLM-as-a-judge评估进一步表明,CREST选择的帧产生更连贯的帧条件描述,在两个基准上胜率分别为60.58%和54.50%。这些结果表明,局部时间几何为长视频帧选择提供了简单高效的基础。

英文摘要

Selecting informative frames from long videos is a combinatorial problem that existing methods address either through efficient heuristics without explicit modeling of query-conditioned temporal structure, or through multi stage retrieval pipelines with substantial preprocessing cost. We propose \textbf{CREST}, a training-free frame selection method grounded in the temporal geometry of query--frame relevance. CREST is based on the observation that relevance over time exhibits structured local variation: sharp curvature around salient events and flatter regions in redundant segments. By using local curvature to guide selection, CREST allocates a fixed frame budget more effectively across brief decisive events and slowly evolving evidence. Under a fixed backbone and frame budget, CREST achieves higher accuracy than AKS, a lightweight relevance--coverage baseline, on LongVideoBench and VideoMME, while retaining 93--95\% of the accuracy of MIRA, a stronger multi-stage retrieval pipeline, at only 3--4\% of its preprocessing cost.\footnote{Code and implementation details are included in the supplementary material and will be released publicly upon acceptance.} On TempRel, our diagnostic benchmark for temporal frame selection, CREST achieves a 6.88\% relative improvement over AKS. Pairwise LLM-as-a-judge evaluation further shows that CREST-selected frames yield more coherent frame-conditioned descriptions, with win rates of 60.58\% and 54.50\% on the two benchmarks. These results show that local temporal geometry provides a simple and efficient basis for long-video frame selection.

2605.07607 2026-05-26 cs.CV 版本更新

FS-I2P:A Hierarchical Focus-Sweep Registration Network with Dynamically Allocated Depth

FS-I2P:一种具有动态分配深度的分层聚焦扫描配准网络

Zhixin Cheng, Yujia Chen, Xujing Tao, Bohao Liao, Xiaotian Yin, Baoqun Yin, Tianzhu Zhang

发表机构 * School of Information Science and Technology, University of Science and Technology of China(信息科学与技术学院,中国科学技术大学) School of Computer Science and Information Engineering, Hefei University of Technology(计算机科学与信息工程学院,合肥工业大学) National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory(深空探测国家实验室,深空探测实验室) Institute of Advanced Technology, University of Science and Technology of China(先进技术研究院,中国科学技术大学)

AI总结 提出一种基于聚焦-扫描范式的分层交互模块和动态层分配策略,用于解决图像到点云配准中的尺度模糊和注意力漂移问题,在RGB-D Scenes V2和7-Scenes数据集上达到最优性能。

详情
AI中文摘要

图像到点云的配准常常受到视角变化、跨模态差异和重复纹理的挑战,这些因素会导致尺度模糊,进而产生错误的对应关系。最近的无检测方法通过利用多尺度特征和基于Transformer的交互来缓解这一问题。然而,它们仍然存在跨层的注意力漂移和层内不一致性,阻碍了精确配准。受人类行为启发,我们提出了一种“聚焦-扫描”范式,并在基于SSM的框架内开发了分层聚焦-扫描交互模块,以增强多层次跨模态特征关联。此外,我们引入了一种动态层分配策略,自适应地确定迭代深度,以更好地利用几何约束并提高匹配鲁棒性。在两个基准数据集RGB-D Scenes V2和7-Scenes上的大量实验和消融研究表明,我们的方法达到了最先进的性能。

英文摘要

Image-to-point cloud registration is often challenged by viewpoint changes, cross-modal discrepancies, and repetitive textures, which induce scale ambiguity and consequently lead to erroneous correspondences. Recent detection-free methods alleviate this issue by leveraging multi-scale features and transformer-based interactions. However, they still suffer from attention drift across layers and intra-scale inconsistencies, hindering precise registration. Inspired by human behavior, we propose a ``Focus--Sweep'' paradigm and develop a Hierarchical Focus--Sweep Interaction Module within an SSM-based framework to enhance multi-level cross-modal feature association. In addition, we introduce a Dynamic Layer Allocation Strategy that adaptively determines the iteration depth to better exploit geometric constraints and improve matching robustness. Extensive experiments and ablations on two benchmarks, RGB-D Scenes V2 and 7-Scenes, demonstrate that our approach achieves state-of-the-art performance.

2605.02764 2026-05-26 cs.CV 版本更新

FoR-Net: Learning to Focus on Hard Regions for Efficient Semantic Segmentation

FoR-Net:学习聚焦困难区域以实现高效语义分割

Sheng-Wei Chan, Hsin-Jui Pan, Chun-Po Shen, Yung-Che Wang, Meng-Qian Li, Chia-Min Lin, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University(东海大学电子与计算机工程系)

AI总结 提出FoR-Net框架,通过可学习的重要性图和Top-K激活机制聚焦困难区域(如细长结构和物体边界),在有限计算资源下实现高效语义分割。

Comments 9 pages, 2 figures, 2 tables. Efficient semantic segmentation under resource-constrained settings. Code will be released

详情
AI中文摘要

我们提出FoR-Net,一种高效的语义分割框架,专注于识别和增强困难区域。FoR-Net不依赖沉重的全局建模,而是采用一种高效策略,通过可学习的重要性图和Top-K激活机制选择性强调信息丰富的区域。具体来说,选择器模块预测区域重要性,使模型能够聚焦于挑战性区域,如细长结构和物体边界。使用不同感受野的卷积分支实现多尺度推理,允许多样化的空间上下文聚合。我们在有限计算资源下对Cityscapes基准评估FoR-Net。尽管其设计高效且训练配置标准,FoR-Net仍取得了有竞争力的性能,并表现出对困难区域的改进关注。这些结果表明,选择性区域聚焦推理可以作为语义分割的一种实用且高效的替代方案。本工作探索了资源受限环境下的区域聚焦推理,并为开发高效且区域感知的分割模型提供了见解。

英文摘要

We present FoR-Net, an efficient semantic segmentation framework that focuses on identifying and enhancing hard regions. Instead of relying on heavy global modeling, FoR-Net adopts an efficient strategy that selectively emphasizes informative regions through a learned importance map and a Top-K activation mechanism. Specifically, a selector module predicts region-wise importance, enabling the model to focus on challenging areas such as thin structures and object boundaries. Multi-scale reasoning is achieved using convolutional branches with different receptive fields, allowing diverse spatial context aggregation. We evaluate FoR-Net on the Cityscapes benchmark under limited computational resources. Despite its efficient design and standard training configuration, FoR-Net achieves competitive performance and exhibits improved attention to difficult regions. These results suggest that selective region-focused reasoning can serve as a practical and efficient alternative for semantic segmentation. This work explores region-focused reasoning under resource-constrained settings and provides insights for developing efficient and region-aware segmentation models.

2604.23728 2026-05-26 cs.CV cs.AI 版本更新

ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction

ESIA:基于能量的时空交互感知框架用于行人意图预测

Yanping Wu, Meiting Dang, Lin Wu, Edmond S. L. Ho, Zhenghua Chen, Chongfeng Wei

发表机构 * James Watt School of Engineering, University of Glasgow(格拉斯哥大学詹姆斯·瓦特工程学院)

AI总结 提出ESIA框架,利用条件随机场和能量函数建模时空交互,通过结构一致性约束和模拟退火算法实现行人意图预测,在标准基准上达到最先进性能并提升可解释性。

Comments 13 pages, 6 figures, 3 tables

详情
AI中文摘要

自动驾驶的最新进展推动了行人意图预测的研究,该研究旨在通过建模时间动态、社交互动和环境背景来推断未来的过街决策和行动。然而,现有研究仍受限于过度简化的多智能体交互模式、不透明的推理逻辑以及行为预测中缺乏全局一致性,这损害了鲁棒性和可解释性。在这项工作中,我们提出了ESIA(基于能量的时空交互感知框架),一种新颖的基于条件随机场(CRF)的范式。我们将意图预测任务视为一个基于统一图表示的结构化预测问题,将行人和环境视为时空节点。为了表征它们的不同角色,我们为节点分配一元势能以捕捉个体意图,为边分配成对势能以编码社交和环境交互。这些势能被整合到一个统一的全局能量函数中,以确保行为预测的场景级一致性。为了在没有真实标签监督的情况下进一步约束推理,我们引入了结构一致性项来惩罚逻辑矛盾。该优化通过一种新颖的一元种子模拟退火(U-SSA)算法高效求解,该算法利用高置信度的一元先验快速收敛到高质量解。在标准基准上的大量实验表明,ESIA在现有方法中实现了最先进的性能,并具有更好的可解释性。

英文摘要

Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decisions and actions by modeling temporal dynamics, social interactions, and environmental context. However, existing studies remain constrained by oversimplified multi-agent interaction patterns, opaque reasoning logic, and a lack of global consistency in behavioral predictions, which compromise both robustness and interpretability. In this work, we propose ESIA (Energy-based Spatiotemporal Interaction-Aware framework), a novel Conditional Random Field (CRF)-based paradigm. We cast the intention prediction task as a structured prediction problem over a unified graph-based representation, treating pedestrians and the environment as spatiotemporal nodes. To characterize their distinct roles, we assign unary potentials to nodes to capture individual intentions, and pairwise potentials to edges to encode social and environmental interactions. These potentials are integrated into a unified global energy function to ensure scene-level consistency across behavioral predictions. To further constrain inference without ground-truth supervision, we introduce structural consistency terms to penalize logical contradictions. This optimization is efficiently solved via a novel Unary-Seeded Simulated Annealing (U-SSA) algorithm, which leverages high-confidence unary priors to rapidly converge to a high-quality solution. Extensive experiments on standard benchmarks demonstrate that ESIA achieves state-of-the-art performance with improved interpretability over existing methods.

2604.08213 2026-05-26 cs.CV cs.AI 版本更新

EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis

EditCaption: 用于图像编辑指令合成的人工精炼SFT与HAE-DPO

Xiangyuan Wang, Honghao Cai, Yunhao Bai, Chao Hui, Tianze Zhou, Haohua Chen, Hao Shi, Yuling Wu, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu

发表机构 * Peking University(北京大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tsinghua University(清华大学) Beihang University(北京航空航天大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出EditCaption两阶段后训练流程,通过人工精炼SFT和基于难度自适应错误感知DPO(HAE-DPO)提升图像编辑指令合成质量,显著降低关键错误率并超越现有模型。

详情
AI中文摘要

高质量的源-目标图像对及精确的编辑指令对于指令引导的图像编辑至关重要,但大规模构建此类训练三元组成本高昂。最近的流程通常依赖视觉语言模型自动合成编辑指令,但我们发现强大的VLM仍难以描述图像对之间的视觉变换。具体而言,它们表现出三种反复出现的失败模式:方向不一致、视角模糊和缺少细粒度属性。在400个图像对的人工评估中,多个开源VLM基线产生超过47%的关键错误率,使得许多合成指令不适合下游训练。为解决此问题,我们提出EditCaption,一种用于图像编辑指令合成的两阶段后训练流程。首先,通过基于GLM的自动字幕生成、EditScore过滤和人工精炼构建100K监督微调数据集。其次,收集10K人工标注的偏好对,其中每个被拒绝的指令都标注了其主要错误类型和严重程度。基于此数据集,我们提出难度自适应错误感知DPO(HAE-DPO),一种任务适配的DPO目标,它引入了基于人工标注的严重程度、失败模式类型和参考模型难度的自适应边界。在三个基准上的实验表明,我们的235B模型经过SFT+HAE-DPO后在开源和闭源模型中达到最先进性能,在Eval-400、HQ-Edit和ByteMorph-Bench上分别获得4.720、4.672和4.651分——在所有三个基准上均超越Gemini-3-Pro。人工评估证实关键错误率从47.75%降至17.50%,正确率从41.75%提升至70.25%,超越Gemini-3-Pro(66.00%)。

英文摘要

High-quality source-target image pairs with precise editing instructions are essential for instruction-guided image editing, yet constructing such training triplets at scale remains costly. Recent pipelines often rely on vision-language models to synthesize editing instructions automatically, but we find that strong VLMs still struggle to describe visual transformations between image pairs. In particular, they exhibit three recurring failure modes: orientation inconsistency, viewpoint ambiguity, and missing fine-grained attributes. In a human evaluation on 400 image pairs, several open-source VLM baselines produce critical-error rates above 47\%, making many synthesized instructions unsuitable for downstream training. To address this, we propose EditCaption, a two-stage post-training pipeline for image editing instruction synthesis. First, we construct a 100K supervised fine-tuning dataset through GLM-based auto-captioning, EditScore filtering, and human refinement. Second, we collect 10K human-annotated preference pairs, where each rejected instruction is labeled with its primary error type and severity. Based on this dataset, we propose Hardness-Adaptive Error-Aware DPO (HAE-DPO), a task-adapted DPO objective that introduces an adaptive margin based on human-labeled severity, failure-mode type, and reference-model hardness. Experiments across three benchmarks demonstrate that our 235B model with SFT+HAE-DPO achieves state-of-the-art performance among open-source and closed models, scoring 4.720 on Eval-400, 4.672 on HQ-Edit, and 4.651 on ByteMorph-Bench -- surpassing Gemini-3-Pro on all three. Human evaluation confirms critical error rates drop from 47.75\% to 17.50\%, with correct rates improving from 41.75\% to 70.25\%, surpassing Gemini-3-Pro (66.00\%).

2604.04707 2026-05-26 cs.CV 版本更新

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

OpenWorldLib: 高级世界模型的统一代码库与定义

DataFlow Team, Bohan Zeng, Daili Hua, Kaixin Zhu, Yifan Dai, Bozhou Li, Yuran Wang, Chengzhuo Tong, Yifan Yang, Mingkun Chang, Jianbin Zhao, Zhou Liu, Hao Liang, Xiaochen Ma, Ruichuan An, Junbo Niu, Zimo Meng, Tianyi Bai, Meiyi Qiang, Huanyao Zhang, Zhiyou Xiao, Tianyu Guo, Qinhan Yu, Runhao Zhao, Zhengpin Li, Xinyi Huang, Yisheng Pan, Yiwen Tang, Juanxi Tian, Yang Shi, Yue Ding, Xinlong Chen, Hongcheng Gao, Minglei Shi, Jialong Wu, Zekun Wang, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Yiren Song, Mike Zheng Shou, Wentao Zhang

发表机构 * Peking University(北京大学) Zhongguancun Academy(中关村学院) Tsinghua University(清华大学) National University of Singapore(新加坡国立大学) Shanghai Jiao Tong University(上海交通大学) Sun Yat-sen University(中山大学) Beijing Key Laboratory of Data Intelligence and Security(北京数据智能与安全重点实验室) Nanyang Technological University(南洋理工大学)

AI总结 本文提出OpenWorldLib框架,基于对世界模型演化的分析给出清晰定义,并系统分类其核心能力,实现多任务模型的统一集成与高效推理。

Comments 28 pages, 6 figures

详情
AI中文摘要

世界模型作为人工智能中一个前景广阔的研究方向已引起广泛关注,但仍缺乏清晰统一的定义。本文中,我们介绍了OpenWorldLib,一个针对高级世界模型的全面且标准化的推理框架。借鉴世界模型的演化,我们提出一个明确的定义:世界模型是以感知为中心、具备交互和长期记忆能力、用于理解和预测复杂世界的模型或框架。我们进一步系统性地分类了世界模型的基本能力。基于这一定义,OpenWorldLib将不同任务的模型集成在统一框架内,实现高效复用和协同推理。最后,我们对世界模型研究的潜在未来方向提出了额外的思考和分析。代码链接:https://github.com/OpenDCAI/OpenWorldLib

英文摘要

World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib

2604.03318 2026-05-26 cs.CV 版本更新

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

EgoMind: 通过多模态大语言模型中的语言推理激活空间认知

Zhenghao Chen, Huiqun Wang, Di Huang

发表机构 * State Key Laboratory of Complex and Critical Software Environment(复杂与关键软件环境国家重点实验室) School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 提出EgoMind框架,通过角色扮演描述和渐进空间分析的无几何空间推理方法,仅用少量数据即可在多基准测试中提升MLLMs的空间推理能力。

Comments Accepted by CVPR 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地应用于空间认知任务,期望它们能够理解并与复杂环境交互。现有工作大多通过引入3D先验或几何监督来改进空间推理,这虽然提升了性能,但带来了大量的数据准备和对齐成本。相比之下,纯2D方法由于捕获跨帧空间关系的能力有限,往往在多帧空间推理中表现不佳。为了解决这些限制,我们提出了EgoMind,一个思维链框架,通过角色扮演描述(联合构建跨帧的一致语言场景图)和渐进空间分析(逐步推理任务特定问题)实现无几何空间推理。仅使用5K自动生成的SFT样本和20K RL样本,EgoMind在VSI-Bench、SPAR-Bench、SITE-Bench和SPBench上取得了有竞争力的结果,证明了其在增强MLLMs空间推理能力方面的有效性,并突出了语言推理在空间认知中的潜力。代码和数据已发布在https://github.com/Hyggge/EgoMind。

英文摘要

Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating its effectiveness in strengthening the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition. Code and data are released at https://github.com/Hyggge/EgoMind.

2603.06687 2026-05-26 cs.CV cs.CL cs.ET cs.MM cs.RO 版本更新

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

TimeSpot: 在真实世界场景中评估视觉语言模型的地理时间理解能力

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez

发表机构 * Computational Intelligence and Operations Laboratory (CIOL), Bangladesh(计算智能与运筹实验室(CIOL),孟加拉国) Shahjalal University of Science and Technology (SUST), Sylhet, Bangladesh(沙赫jalal科学与技术大学(SUST),沙赫里尔,孟加拉国) North South University (NSU), Dhaka, Bangladesh(北南大学(NSU),达卡,孟加拉国) Qatar Computing Research Institute (QCRI), Doha, Qatar(卡塔尔计算研究中心(QCRI),多哈,卡塔尔)

AI总结 提出TimeSpot基准,通过1,455张全球图像评估视觉语言模型在时间属性(季节、月份、时段、日光相位)和地理属性(大洲、国家、气候带、环境类型、经纬度)上的推理能力,发现现有模型性能低下,尤其时间推理不足。

Comments Accepted to ICML 2026

详情
AI中文摘要

地理时间理解,即仅从视觉输入推断位置、时间和上下文属性的能力,支撑着灾害管理、交通规划、具身导航、世界建模和地理教育等应用。尽管最近的视觉语言模型(VLM)利用地标和路标等线索在图像地理定位方面取得了进展,但它们推理时间信号和物理基础空间线索的能力仍然有限。为弥补这一差距,我们引入了TimeSpot,一个用于评估VLM在真实世界中进行地理时间推理的基准。TimeSpot包含来自80个国家的1,455张地面图像,要求直接从视觉证据中结构化预测时间属性(季节、月份、时段、日光相位)和地理属性(大洲、国家、气候带、环境类型、经纬度)。它还包括时空推理任务,测试在真实世界不确定性下的物理合理性。对最先进的开源和闭源VLM的评估显示性能低下,尤其是时间推理。虽然监督微调带来了改进,但结果仍不充分,凸显了需要新方法来实现稳健的、基于物理的地理时间理解。TimeSpot可在 https://TimeSpot-GT.github.io 获取。

英文摘要

Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding TimeSpot is available at: https://TimeSpot-GT.github.io.

2603.04114 2026-05-26 cs.CV 版本更新

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

Any2Any: 统一任意模态遥感翻译

Haoyang Chen, Jing Zhang, Hebaixu Wang, Shiqin Wang, Pohsun Huang, Jiayuan Li, Haonan Guo, Di Wang, Zheng Wang, Bo Du

发表机构 * National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, Wuhan University(多媒体软件国家工程研究中心、人工智能研究院、计算机科学学院、武汉大学) Hubei Key Laboratory of Multimedia(湖北省多媒体重点实验室) Zhongguancun Academy, Beijing, China. 100094(中关村学院,北京,中国。100094) School of Electronic Information, Wuhan University, Wuhan, China(电子信息学院,武汉大学,武汉,中国) School of Automation, Beijing Institute of Technology(自动化学院,北京理工大学) State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China(测绘、制图与遥感信息工程国家重点实验室,武汉大学,武汉,中国)

AI总结 提出统一潜扩散框架Any2Any,通过共享潜空间和轻量残差适配器实现任意模态间的高效翻译,并在新数据集RST-1M上验证了其优于成对方法且具备零样本泛化能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

多模态遥感图像提供同一地理场景的互补观测,但在实际中这些观测往往不完整。现有的跨模态翻译方法将每个模态对视为独立任务,导致二次复杂度且对未见模态组合的泛化能力有限。我们将任意到任意翻译建模为场景共享潜表示上的推理,其中不同模态对应同一底层语义的部分观测。基于此公式,我们提出Any2Any,一个统一的潜扩散框架,将异构输入投影到几何对齐的潜空间。该结构通过共享骨干网络执行锚定潜回归,解耦模态特定表示学习与语义映射。此外,使用轻量级目标特定残差适配器来纠正系统性潜失配,而不增加推理复杂度。为了支持稀疏但连接监督下的学习,我们引入了RST-1M,首个百万级遥感数据集,包含五种感知模态的配对观测,为任意到任意翻译提供监督锚点。在14个翻译任务上的实验表明,Any2Any始终优于成对翻译方法,并对未见模态对展现出强大的零样本泛化能力。代码和模型可在https://github.com/MiliLab/Any2Any获取。

英文摘要

Multi-modal remote sensing imagery provides complementary observations of the same geographic scene, yet such observations are frequently incomplete in practice. Existing cross-modal translation methods treat each modality pair as an independent task, resulting in quadratic complexity and limited generalization to unseen modality combinations. We formulate Any-to-Any translation as inference over a shared latent representation of the scene, where different modalities correspond to partial observations of the same underlying semantics. Based on this formulation, we propose Any2Any, a unified latent diffusion framework that projects heterogeneous inputs into a geometrically aligned latent space. Such structure performs anchored latent regression with a shared backbone, decoupling modality-specific representation learning from semantic mapping. Moreover, lightweight target-specific residual adapters are used to correct systematic latent mismatches without increasing inference complexity. To support learning under sparse but connected supervision, we introduce RST-1M, the first million-scale remote sensing dataset with paired observations across five sensing modalities, providing supervision anchors for any-to-any translation. Experiments across 14 translation tasks show that Any2Any consistently outperforms pairwise translation methods and exhibits strong zero-shot generalization to unseen modality pairs. Code and models are available at https://github.com/MiliLab/Any2Any.

2602.21198 2026-05-26 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

从试错中学习:具身大语言模型的反思式测试时规划

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Leonidas Guibas, Jiajun Wu, Yejin Choi

发表机构 * Stanford University(斯坦福大学) Northwestern University(西北大学)

AI总结 提出反思式测试时规划方法,通过行动中反思和行动后反思两种模式,结合回溯性反思,使具身智能体在测试时进行自我纠正和经验积累,显著提升长程任务性能。

详情
AI中文摘要

具身大语言模型赋予机器人高级任务推理能力,但它们无法反思错误原因,导致部署成为一系列独立尝试,错误重复而非积累经验。借鉴人类反思实践,我们引入反思式测试时规划,整合两种反思模式: extit{行动中反思},代理在行动前利用测试时扩展生成并评分多个候选行动,基于内部反思;以及 extit{行动后反思},利用测试时训练,根据执行后的外部反思更新内部反思模型和行动策略。我们还包含回溯性反思,允许代理重新评估早期决策,并利用后见之明进行模型更新,实现适当的长程信用分配。在我们新设计的Long-Horizon Household基准和MuJoCo Cupboard Fitting基准上的实验表明,与基线模型相比有显著提升,并能零样本泛化到逼真的HM3D环境以及在Franka Panda机械臂上的真实机器人实验。消融实验证实,行动中反思和行动后反思相互依赖,且回溯性反思在较低计算开销下比逐步外部反馈实现更好的信用分配。定性分析进一步突出了通过反思进行的行为纠正。

英文摘要

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with zero-shot generalization to photorealistic HM3D environments and real-robot experiments on a Franka Panda arm. Ablations confirm that reflection-in-action and reflection-on-action are mutually dependent, and that retrospective reflection achieves better credit assignment than step-wise external feedback at lower computational overhead. Qualitative analyses further highlight behavioral correction through reflection.

2602.15811 2026-05-26 cs.CV cs.AI 版本更新

CARL-CXR: Continual Adapter-Based Routing for Task-Unknown Chest Radiograph Classification

CARL-CXR:基于连续适配器路由的任务未知胸部X光片分类

Muthu Subash Kavitha, Anas Zafar, Amgad Muneer, Jia Wu

发表机构 * Department of Imaging Physics, The University of Texas MD Anderson Cancer Center(影像物理系,德克萨斯大学MD安德森癌症中心)

AI总结 提出CARL-CXR框架,通过固定高容量骨干网络、增量添加轻量级任务特定适配器和分类头,以及潜在任务选择器,解决任务未知推理下的胸部X光片增量分类问题,显著减少灾难性遗忘并提升路由准确性。

Comments 9 pages, 4 figures

详情
AI中文摘要

胸部X光片分类器的临床部署需要模型能够在新数据集可用时进行更新,而无需对先前观察到的数据进行重新训练或降低已验证的性能。我们研究了任务未知推理下的任务增量连续学习设置,其中异质的胸部X光数据集顺序到达,且在部署时任务身份不可用。我们提出了CARL-CXR,一个基于连续适配器的路由框架,该框架保持固定的高容量骨干网络,同时增量引入轻量级任务特定适配器和分类头。一个潜在任务选择器基于适配器条件特征进行操作,将每个输入动态路由到最相关的任务路径,利用紧凑的任务原型和特征级经验回放来在顺序更新中保留任务身份,而无需存储原始图像。在MIMIC-CXR和CheXpert两个具有不同患者群体、成像设备和注释流程的大规模数据集上的实验表明,CARL-CXR实现了最小的灾难性遗忘(AUROC下降0.012),比已建立的连续学习基线LwF和EWC分别减少了6倍和11倍,同时保持了具有竞争力的诊断性能(AUROC 0.74)。在任务未知部署下,CARL-CXR在路由准确性上比联合训练高出12.5个百分点(75.0% vs. 62.5%):与LwF和EWC不同,后者在推理时需要明确的任务标识符且不提供路由机制。

英文摘要

Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining on previously observed data or degrading validated performance. We study a task-incremental continual learning setting for chest radiograph classification under task-unknown inference, where heterogeneous chest X-ray datasets arrive sequentially and task identity is unavailable at deployment time. We propose CARL-CXR, a continual adapter-based routing framework that maintains a fixed high-capacity backbone while incrementally introducing lightweight task-specific adapters and classifier heads. A latent task selector operates on adapter-conditioned features to dynamically route each input to the most relevant task pathway, leveraging compact task prototypes and feature-level experience replay to preserve task identity across sequential updates without storing raw images. Experiments on MIMIC-CXR and CheXpert two large-scale datasets with distinct patient populations, imaging devices, and annotation pipelines demonstrate that CARL-CXR achieves minimal catastrophic forgetting (0.012 AUROC drop), representing a 6X and 11X reduction over established continual learning baselines LwF and EWC respectively, while maintaining competitive diagnostic performance (AUROC 0.74). Under task unknown deployment, CARL-CXR outperforms joint training by 12.5 points in routing accuracy (75.0% vs. 62.5%): unlike LwF and EWC, which require explicit task identifiers at inference and provide no routing mechanism.

2602.08426 2026-05-26 cs.CL cs.AI cs.CV 版本更新

Prism: Spectral-Aware Block-Sparse Attention

Prism: 频谱感知的块稀疏注意力

Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) ByteDance Inc.(字节跳动公司) OpenMOSS Team(OpenMOSS团队)

AI总结 针对长上下文LLM预填充中块稀疏注意力的块选择效率瓶颈,提出无训练频谱感知方法Prism,通过高低频分支分解和能量温度校准恢复位置信号,实现纯块级重要性估计,在保持精度同时实现高达5.1倍加速。

Comments ICML 2026

详情
AI中文摘要

块稀疏注意力有望加速长上下文LLM的预填充,但高效识别相关块仍是瓶颈。现有方法通常采用粗粒度注意力作为块重要性估计的代理,但往往诉诸昂贵的令牌级搜索或评分,导致显著的选择开销。在本工作中,我们将通过均值池化的标准粗粒度注意力的不准确性追溯到一个理论根源:均值池化与旋转位置嵌入(RoPE)之间的交互。我们证明均值池化充当低通滤波器,在高频维度上引起破坏性干扰,有效造成局部位置信息(如斜线模式)的“盲点”。为解决此问题,我们引入Prism,一种无训练的频谱感知方法,将块选择分解为高频和低频分支。通过应用基于能量的温度校准,Prism直接从池化表示中恢复衰减的位置信号,使得仅使用块级操作即可进行块重要性估计,从而提高效率。大量评估证实,Prism在保持与全注意力精度相当的同时,实现了高达$\mathbf{5.1 imes}$的加速。

英文摘要

Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.

2602.01183 2026-05-26 cs.CV cs.LG 版本更新

Refining Context-Entangled Content Segmentation via Curriculum Selection and Anti-Curriculum Promotion

通过课程选择与反课程促进优化上下文纠缠内容分割

Chunming He, Rihan Zhang, Fengyang Xiao, Dingming Zhang, Zhiwen Cao, Sina Farsiu

发表机构 * Duke University(杜克大学) Adobe(Adobe公司)

AI总结 提出CurriSeg双阶段学习框架,结合课程学习与反课程学习原理,通过动态数据选择与频谱盲性微调提升上下文纠缠内容分割的鲁棒性和泛化能力。

Comments ICML 2026, 8 figures, 11 tables

详情
AI中文摘要

生物学习从简单到困难的任务逐步进行,逐渐增强感知和鲁棒性。受此原理启发,我们解决上下文纠缠内容分割(CECS)这一具有挑战性的场景,其中对象与周围环境共享内在视觉模式,如伪装目标检测。传统分割网络主要依赖架构增强,但往往忽略了在纠缠数据分布下控制鲁棒性的学习动态。我们引入CurriSeg,一个双阶段学习框架,统一了课程和反课程原则以提高表示可靠性。在课程选择阶段,CurriSeg基于样本损失的时间统计动态选择训练数据,区分困难但有信息的样本与噪声或模糊样本,从而实现稳定的能力增强。在反课程促进阶段,我们设计了频谱盲性微调,抑制高频成分以强制依赖低频结构和上下文线索,从而增强泛化能力。大量实验表明,CurriSeg在多种CECS基准上取得了一致的改进,无需增加参数或增加总训练时间,为进展与挑战如何相互作用以促进鲁棒且上下文感知的分割提供了原则性视角。代码将发布。

英文摘要

Biological learning proceeds from easy to difficult tasks, gradually reinforcing perception and robustness. Inspired by this principle, we address Context-Entangled Content Segmentation (CECS), a challenging setting where objects share intrinsic visual patterns with their surroundings, as in camouflaged object detection. Conventional segmentation networks predominantly rely on architectural enhancements but often ignore the learning dynamics that govern robustness under entangled data distributions. We introduce CurriSeg, a dual-phase learning framework that unifies curriculum and anti-curriculum principles to improve representation reliability. In the Curriculum Selection phase, CurriSeg dynamically selects training data based on the temporal statistics of sample losses, distinguishing hard-but-informative samples from noisy or ambiguous ones, thus enabling stable capability enhancement. In the Anti-Curriculum Promotion phase, we design Spectral-Blindness Fine-Tuning, which suppresses high-frequency components to enforce dependence on low-frequency structural and contextual cues and thus strengthens generalization. Extensive experiments demonstrate that CurriSeg achieves consistent improvements across diverse CECS benchmarks without adding parameters or increasing total training time, offering a principled view of how progression and challenge interplay to foster robust and context-aware segmentation. Code will be released.

2601.21406 2026-05-26 cs.CV cs.LG 版本更新

Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

通过多表示生成增强统一多模态模型的理解能力

Zihan Su, Hongyang Wei, Kangrui Cen, Yong Wang, Guanhua Chen, Chun Yuan, Xiangxiang Chu

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) AMAP, Alibaba Group(阿里妈妈,阿里巴巴集团) Shanghai Jiao Tong University(上海交通大学) Southern University of Science and Technology(南方科技大学)

AI总结 提出UniMRG方法,通过辅助生成像素、深度和分割等多重表示,增强统一多模态模型的理解能力,减少幻觉并提升空间理解。

Comments Code: https://github.com/Sugewud/UniMRG

详情
AI中文摘要

统一多模态模型(UMMs)在单一框架内整合了视觉理解和生成。其最终目标是创建一个理解和生成相互促进的循环。虽然最近的后训练方法成功利用理解来增强生成,但利用生成来改善理解的逆向方向仍基本未被探索。在这项工作中,我们提出了UniMRG(统一多表示生成),一种简单而有效的架构无关的后训练方法。UniMRG通过引入辅助生成任务来增强UMMs的理解能力。具体来说,我们训练UMMs生成输入图像的多种内在表示,即像素(重建)、深度(几何)和分割(结构),同时进行标准的视觉理解目标。通过综合这些多样化的表示,UMMs捕获关于外观、空间关系和结构布局的互补信息。因此,UMMs对视觉输入形成了更深入和全面的理解。跨多种UMM架构的大量实验表明,我们的方法显著增强了细粒度感知,减少了幻觉,并改善了空间理解,同时提升了生成能力。

英文摘要

Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.

2601.19743 2026-05-26 eess.IV cs.CV cs.LG 版本更新

Interpretable and backpropagation-free Green Learning for efficient multi-task echocardiographic segmentation and classification

可解释且无需反向传播的绿色学习用于高效多任务超声心动图分割与分类

Jyun-Ping Kao, Jiaxin Yang, C. -C. Jay Kuo, Jonghye Woo

AI总结 提出一种无需反向传播的多任务绿色学习框架,通过无监督VoxelHop编码器与多级回归解码器及XG-Boost分类器,在EchoNet-Dynamic数据集上实现左心室分割与射血分数分类,以极低参数量达到高精度。

Comments Accepted for publication in APSIPA Transactions on Signal and Information Processing. Jyun-Ping Kao and Jiaxing Yang contributed equally to this work. C.-C. Jay Kuo and Jonghye Woo are the senior authors

详情
AI中文摘要

超声心动图是管理心力衰竭(HF)的基石,左心室射血分数(LVEF)是指导治疗的关键指标。然而,手动LVEF评估存在较高的观察者间变异性,而现有的深度学习(DL)模型通常是计算密集且数据饥饿的“黑箱”,阻碍了临床信任和采用。在此,我们提出了一种无需反向传播的多任务绿色学习(MTGL)框架,可同时进行左心室(LV)分割和LVEF分类。我们的框架将用于分层时空特征提取的无监督VoxelHop编码器与多级回归解码器和XG-Boost分类器相结合。在EchoNet-Dynamic数据集上,我们的MTGL模型实现了最先进的分类和分割性能,分类准确率达到94.3%,Dice相似系数(DSC)达到0.912,显著优于多个先进的3D DL模型。关键的是,我们的模型在参数数量少一个数量级的情况下实现了这一性能,展现了卓越的计算效率。这项工作表明,GL范式可以为复杂的医学图像分析提供高度准确、高效且可解释的解决方案,为临床实践中更可持续和可信的人工智能铺平道路。

英文摘要

Echocardiography is a cornerstone for managing heart failure (HF), with Left Ventricular Ejection Fraction (LVEF) being a critical metric for guiding therapy. However, manual LVEF assessment suffers from high inter-observer variability, while existing Deep Learning (DL) models are often computationally intensive and data-hungry "black boxes" that impede clinical trust and adoption. Here, we propose a backpropagation-free multi-task Green Learning (MTGL) framework that performs simultaneous Left Ventricle (LV) segmentation and LVEF classification. Our framework integrates an unsupervised VoxelHop encoder for hierarchical spatio-temporal feature extraction with a multi-level regression decoder and an XG-Boost classifier. On the EchoNet-Dynamic dataset, our MTGL model achieves state-of-the-art classification and segmentation performance, attaining a classification accuracy of 94.3% and a Dice Similarity Coefficient (DSC) of 0.912, significantly outperforming several advanced 3D DL models. Crucially, our model achieves this with over an order of magnitude fewer parameters, demonstrating exceptional computational efficiency. This work demonstrates that the GL paradigm can deliver highly accurate, efficient, and interpretable solutions for complex medical image analysis, paving the way for more sustainable and trustworthy artificial intelligence in clinical practice.

2601.18597 2026-05-26 cs.CV 版本更新

EFSI-DETR: Efficient Frequency-Semantic Integration for Real-Time Small Object Detection in UAV Imagery

EFSI-DETR:面向无人机图像实时小目标检测的高效频率-语义集成

Yu Xia, Chang Liu, Tianqi Xiang, Zhigang Tu

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing(信息工程测绘遥感国家重点实验室) Wuhan University(武汉大学) Wuhan University Shenzhen Research Institute(武汉大学深圳研究院) School of Computer Science(计算机学院) School of Automation Science and Engineering(自动化科学与工程学院) South China University of Technology(华南理工大学)

AI总结 提出EFSI-DETR框架,通过动态频率-空间统一协同网络和高效语义特征集中器,实现无人机图像中实时小目标检测的先进性能。

详情
AI中文摘要

由于有限的特征表示和无效的多尺度融合,无人机图像中的实时小目标检测仍然具有挑战性。现有方法未充分利用频率信息并依赖静态卷积操作,限制了获取丰富特征表示的能力,并阻碍了深层语义特征的有效利用。为解决这些问题,我们提出EFSI-DETR,一种新颖的检测框架,将高效语义特征增强与动态频率-空间引导相结合。EFSI-DETR包含两个主要组件:(1) 动态频率-空间统一协同网络(DyFusNet),联合利用频率和空间线索进行鲁棒的多尺度特征融合;(2) 高效语义特征集中器(ESFC),以最小计算成本实现深层语义提取。此外,采用细粒度特征保留(FFR)策略,在融合过程中纳入空间丰富的浅层特征,以保留对无人机图像中小目标检测至关重要的细粒度细节。在VisDrone和CODrone基准上的大量实验表明,我们的EFSI-DETR以实时效率实现了最先进的性能,在VisDrone上AP和AP_s分别提升了 extbf{1.6}\%和 extbf{5.8}\%,同时在单个RTX 4090 GPU上获得 extbf{188} FPS的推理速度。

英文摘要

Real-time small object detection in Unmanned Aerial Vehicle (UAV) imagery remains challenging due to limited feature representation and ineffective multi-scale fusion. Existing methods underutilize frequency information and rely on static convolutional operations, which constrain the capacity to obtain rich feature representations and hinder the effective exploitation of deep semantic features. To address these issues, we propose EFSI-DETR, a novel detection framework that integrates efficient semantic feature enhancement with dynamic frequency-spatial guidance. EFSI-DETR comprises two main components: (1) a Dynamic Frequency-Spatial Unified Synergy Network (DyFusNet) that jointly exploits frequency and spatial cues for robust multi-scale feature fusion, (2) an Efficient Semantic Feature Concentrator (ESFC) that enables deep semantic extraction with minimal computational cost. Furthermore, a Fine-grained Feature Retention (FFR) strategy is adopted to incorporate spatially rich shallow features during fusion to preserve fine-grained details, crucial for small object detection in UAV imagery. Extensive experiments on VisDrone and CODrone benchmarks demonstrate that our EFSI-DETR achieves the state-of-the-art performance with real-time efficiency, yielding improvement of \textbf{1.6}\% and \textbf{5.8}\% in AP and AP$_{s}$ on VisDrone, while obtaining \textbf{188} FPS inference speed on a single RTX 4090 GPU.

2601.18135 2026-05-26 cs.CV 版本更新

Forward Consistency Learning with Gated Context Aggregation for Video Anomaly Detection

基于门控上下文聚合的前向一致性学习用于视频异常检测

Jiahao Lyu, Minghua Zhao, Xuewen Huang, Yifei Chen, Shuangli Du, Jing Hu, Cheng Shi, Zhiyong Lv

发表机构 * Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi’an University of Technology(陕西网络计算与安全技术重点实验室,西安理工大学计算机科学与工程学院) School of Cyber Science and Engineering, Xi’an Jiaotong University(网络安全与工程学院,西安交通大学)

AI总结 提出轻量级FoGA模型,通过前向一致性学习和门控上下文聚合,在资源受限设备上实现高效视频异常检测,性能优于现有方法且速度达155 FPS。

Comments It has been submitted to the KBS journal

详情
Journal ref
Knowledge-Based Systems 2026
AI中文摘要

作为公共安全的关键要素,视频异常检测(VAD)旨在实时监控系统中衡量各种事件与正常模式的偏差。然而,现有大多数VAD方法依赖大规模模型追求极端精度,限制了其在资源受限边缘设备上的可行性。此外,主流基于预测的VAD仅利用单帧未来预测误差检测异常,忽略了更长时域前向信息的更丰富约束。本文提出FoGA,一种轻量级VAD模型,执行基于门控上下文聚合的前向一致性学习,包含约2M参数,专为潜在边缘设备设计。具体而言,我们提出一种基于Unet的方法,对连续帧进行特征提取以生成即时预测和前向预测。然后,我们在跳跃连接中引入门控上下文聚合模块,动态融合相同空间尺度下的编码器和解码器特征。最后,模型通过新颖的前向一致性损失联合优化,并采用混合异常测量策略整合即时帧和前向帧的误差以实现更准确检测。大量实验证明了所提方法的有效性,其显著优于最先进的竞争方法,运行速度高达155 FPS。因此,我们的FoGA在性能与效率指标之间实现了出色的权衡。

英文摘要

As a crucial element of public security, video anomaly detection (VAD) aims to measure deviations from normal patterns for various events in real-time surveillance systems. However, most existing VAD methods rely on large-scale models to pursue extreme accuracy, limiting their feasibility on resource-limited edge devices. Moreover, mainstream prediction-based VAD detects anomalies using only single-frame future prediction errors, overlooking the richer constraints from longer-term temporal forward information. In this paper, we introduce FoGA, a lightweight VAD model that performs Forward consistency learning with Gated context Aggregation, containing about 2M parameters and tailored for potential edge devices. Specifically, we propose a Unet-based method that performs feature extraction on consecutive frames to generate both immediate and forward predictions. Then, we introduce a gated context aggregation module into the skip connections to dynamically fuse encoder and decoder features at the same spatial scale. Finally, the model is jointly optimized with a novel forward consistency loss, and a hybrid anomaly measurement strategy is adopted to integrate errors from both immediate and forward frames for more accurate detection. Extensive experiments demonstrate the effectiveness of the proposed method, which substantially outperforms state-of-the-art competing methods, running up to 155 FPS. Hence, our FoGA achieves an excellent trade-off between performance and the efficiency metric.

2601.00553 2026-05-26 cs.CV cs.AI 版本更新

A Comprehensive Dataset for Human vs. AI Generated Image Detection

人类与AI生成图像检测的综合数据集

Rajarshi Roy, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Gaytri Jena, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

发表机构 * 1 Kalyani Government Engineering College, India. 2 IIIT Delhi, India. 3 BITS Pilani Hyderabad Campus, India. 4 University of South Carolina, USA. 5 IIIT Guwahati, India. 6 NIT Silchar, India. 7 San Jos\' e State University, USA. 8 UCLA, USA. 9 Washington State University, USA. 10 Vishwakarma Institute of Information Technology, India. 11 Gandhi Institute for Technological Advancement, India. 12 Meta AI, USA. 13 Amazon AI, USA. 14 BITS Pilani Goa, India.

AI总结 针对AI生成图像检测问题,构建了包含96000个真实与合成数据点的MS COCOAI数据集,并提出了图像真伪分类与生成模型识别两个任务。

详情
AI中文摘要

像Stable Diffusion、DALL-E和MidJourney这样的多模态生成式AI系统从根本上改变了合成图像的创建方式。这些工具推动了创新,但也促进了误导性内容、虚假信息和被操纵媒体的传播。随着生成的图像越来越难以与照片区分,检测它们已成为当务之急。为了应对这一挑战,我们发布了MS COCOAI,这是一个用于AI生成图像检测的新数据集,包含96000个真实和合成数据点,基于MS COCO数据集构建。为了生成合成图像,我们使用了五个生成器:Stable Diffusion 3、Stable Diffusion 2.1、SDXL、DALL-E 3和MidJourney v6。基于该数据集,我们提出了两个任务:(1)将图像分类为真实或生成;(2)识别哪个模型生成了给定的合成图像。该数据集可在https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset获取。

英文摘要

Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, we release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.

2512.24331 2026-05-26 cs.CV 版本更新

Spatial-aware Vision Language Model for Autonomous Driving

面向自动驾驶的空间感知视觉语言模型

Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong

发表机构 * Motional University of Amsterdam(阿姆斯特丹大学)

AI总结 提出LVLDrive框架,通过融合LiDAR点云与视觉语言模型,利用渐进融合Q-Former和空间感知问答数据集,解决3D度量空间推理瓶颈,提升自动驾驶场景理解与决策可靠性。

Comments Accepted to CVPR AutoPilot Workshop 2026

详情
AI中文摘要

尽管视觉语言模型(VLM)通过利用语言模型中的常识在端到端自动驾驶中展现出显著前景,但它们依赖2D图像线索进行复杂场景理解和决策,这成为安全性和可靠性的关键瓶颈。当前基于图像的方法难以进行精确的度量空间推理和几何推断,导致不可靠的驾驶策略。为弥补这一差距,我们提出LVLDrive(LiDAR-视觉-语言),一种新颖框架,通过引入LiDAR点云作为额外输入模态,专门设计用于增强现有VLM的鲁棒3D度量空间理解能力。一个关键挑战在于如何减轻不同3D数据对预训练VLM带来的灾难性干扰。为此,我们引入渐进融合Q-Former,逐步注入LiDAR特征,确保VLM现有知识库的稳定性和保留。此外,我们开发了空间感知问答(SA-QA)数据集,明确教导模型高级3D感知和推理能力。在驾驶基准上的大量实验表明,与仅视觉的对应模型相比,LVLDrive在场景理解、度量空间感知和可靠驾驶决策方面均实现了优越性能。我们的工作强调了显式3D度量数据对于构建可信赖的基于VLM的自主系统的重要性。

英文摘要

While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.

2512.18735 2026-05-26 cs.CV cs.AI 版本更新

$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

$M^3-Verse$: 大型多模态模型的“找不同”挑战

Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu, Wubing Xia, Weili Xu, Jiaao Wu, Junchen He, Mingyu Jia, Ciyun Zhao, Ye Sun, Yizhi Li, Zhonghan Zhao, Jian Zhang, Gaoang Wang

发表机构 * Zhejiang University, China(浙江大学) Shanghai AI Lab, China(上海人工智能实验室) Hangzhou Normal University, China(杭州师范大学)

AI总结 提出 $M^3-Verse$ 基准,通过多视角视频对评估 LMM 在一致空间中对物体动态变化的理解能力,并验证了现有模型的局限性。

详情
AI中文摘要

现代大型多模态模型(LMMs)在静态图像和单状态时空理解方面表现出非凡的能力。然而,它们在两个不同视频观测中理解共享空间上下文内物体动态变化的能力仍未被充分探索。这种在一致环境中推理变换的能力对于空间智能领域的进步尤为关键。在本文中,我们引入了 $M^3-Verse$,一个多模态、多状态、多维度的基准,以正式评估这一能力。它基于成对视频,这些视频提供了室内场景在状态变化前后的多视角观察。该基准包含总共 270 个场景和 2,932 个问题,分为 50 多个子任务,探究 4 种核心能力。我们评估了 16 个最先进的 LMMs,并观察到它们在跟踪状态转换方面的局限性。为了解决这些挑战,我们进一步提出了一个简单而有效的基线,在多状态感知中实现了显著的性能提升。因此,$M^3-Verse$ 提供了一个具有挑战性的新测试平台,以促进对动态视觉世界有更全面理解的下一代模型的发展。您可以从 https://github.com/Wal-K-aWay/M3-Verse_pipeline 获取构建流程,并从 https://www.modelscope.cn/datasets/WalKaWay/M3-Verse 获取完整的基准数据。

英文摘要

Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.

2512.13597 2026-05-26 cs.CV 版本更新

Lighting in Motion: Spatiotemporal HDR Lighting Estimation

运动中的光照:时空高动态范围光照估计

Christophe Bolduc, Julien Philip, Li Ma, Mingming He, Paul Debevec, Jean-François Lalonde

发表机构 * Université Laval(拉瓦尔大学) Eyeline Labs(Eyeline实验室)

AI总结 提出基于扩散的时空光照估计方法LiMo,通过生成不同曝光下的镜面与漫反射球体,结合深度与几何条件,实现高精度高频细节预测与照度估计。

详情
AI中文摘要

我们提出LiMo(运动中的光照),一种基于扩散的时空光照估计方法。LiMo旨在同时实现逼真的高频细节预测和准确的照度估计。为此,我们提出根据输入中3D位置生成一组不同曝光下的镜面与漫反射球体。利用扩散先验,我们在大规模定制的室内外场景数据集上微调强大的现有扩散模型,并配以时空光照探针。为了实现准确的空间条件,我们证明仅靠深度是不够的,并引入一种新的几何条件来提供场景相对于目标3D位置的相对位置。最后,我们利用可微渲染将不同曝光下的漫反射和镜面预测合并为单个HDRI图。我们彻底评估了我们的方法和设计选择,使LiMo在空间控制和预测精度方面均达到最先进水平。

英文摘要

We present Lighting in Motion (LiMo), a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.

2512.12425 2026-05-26 cs.CV 版本更新

Boosting Monocular Metric Depth Estimation via Bokeh Rendering

通过散景渲染提升单目度量深度估计

Hangwei Zhang, Armando Fortes, Tianyi Wei, Xingang Pan

发表机构 * S-Lab, Nanyang Technological University(南洋理工大学S实验室) Beihang University(北航大学)

AI总结 提出BokehDepth两阶段框架,利用物理生成模型产生校准散景堆栈作为无监督几何信号,通过散景感知聚合模块提升单目深度估计的度量精度。

Comments Project Page: https://fogradio.github.io/BokehDepth_Project/

详情
Journal ref
ICML 2026
AI中文摘要

散景渲染和深度估计共享基本的光学联系,但现有方法未能充分利用这种互惠性。传统的散景管线严重依赖有噪声的深度图,不可避免地引入视觉伪影。相反,现有的单目深度模型通常遵循两种有缺陷的范式。基于生成扩散的框架往往缺乏一致的度量尺度。同时,前馈度量深度模型在纹理缺失或远处区域经常失败,而散焦模糊可以提供几何信息。我们提出BokehDepth,一个两阶段框架,将合成散焦视为无监督的几何信号。在第一阶段,一个物理基础的生成模型从单个清晰输入产生校准的散景堆栈,无需先验深度输入。随后,一个轻量级的散景感知聚合模块将这些堆栈集成到深度估计框架的编码器中。这种机制允许模型从散焦维度提取一致的几何特征,同时保持解码器架构不变。实验表明,与依赖深度的渲染基线相比,BokehDepth实现了优越的视觉散景保真度,并持续提升了最先进单目深度模型的度量精度。

英文摘要

Bokeh rendering and depth estimation share a fundamental optical connection, yet existing methods fail to fully exploit this reciprocity. Conventional bokeh pipelines rely heavily on noisy depth maps that inevitably introduce visual artifacts. Conversely, existing monocular depth models typically follow two flawed paradigms. Generative diffusion-based frameworks often lack consistent metric scale. Meanwhile, feed-forward metric depth models frequently fail in textureless or distant regions where defocus blur can provide geometric information. We propose BokehDepth, a two-stage framework that treats synthetic defocus as a supervision-free geometric signal. In the first stage, a physically grounded generative model produces calibrated bokeh stacks from a single sharp input without requiring prior depth input. Subsequently, a lightweight defocus-aware aggregation module integrates these stacks into the encoder of a depth estimation framework. This mechanism allows the model to extract consistent geometric features from the defocus dimension while keeping the decoder architecture unchanged. Experiments demonstrate that BokehDepth achieves superior visual bokeh fidelity compared to depth-dependent rendering baselines and consistently enhances the metric accuracy of state-of-the-art monocular depth models.

2512.08125 2026-05-26 eess.IV cs.CV 版本更新

FlowSteer: Conditioning Flow Field for Consistent Image Restoration

FlowSteer: 条件化流场以实现一致图像恢复

Tharindu Wickremasinghe, Chenyang Qi, Harshana Weligampola, Zhengzhong Tu, Stanley H. Chan

发表机构 * Purdue University(普渡大学) HKUST(香港科技大学) Texas A&M University(德克萨斯农工大学)

AI总结 提出FlowSteer,一种算子感知的条件化方案,通过在采样路径中注入测量先验,将冻结流的隐式引导与显式测量约束耦合,在零样本设置下实现超分辨率、去模糊、去噪和着色等任务的一致图像恢复。

Comments Accepted by CVPRF 2026. Camera Ready version. Project page is \href{https://tharindu-nirmal.github.io/FlowSteer/}{in this link}

详情
AI中文摘要

基于流的文本到图像(T2I)模型在提示驱动图像生成方面表现出色,但在图像恢复(IR)中常常“偏离”对测量的忠实。先前的工作通过数据特定流或任务特定适配器来缓解这种漂移,但这些方法计算量大且不可跨任务扩展。这引出了一个问题:“难道我们不能高效地操纵流模型现有的生成能力吗?”为此,我们引入了FlowSteer(FS),一种算子感知的条件化方案,它在采样路径中注入测量先验,将冻结流的隐式引导与显式测量约束耦合。在超分辨率、去模糊、去噪和着色任务中,FS在严格的零样本设置下(无需重新训练模型,无需适配器)提高了测量一致性和身份保持。我们展示了流模型的性质及其对噪声的敏感性如何指导这种调度器的设计。FlowSteer虽然简单,但在利用流模型丰富的生成先验的同时,实现了更高保真度的重建图像。所有数据和代码将在\href{https://tharindu-nirmal.github.io/FlowSteer/}{此链接}公开。

英文摘要

Flow-based text-to-image (T2I) models excel at prompt-driven image generation, but falter on Image Restoration (IR), often "drifting away" from being faithful to the measurement. Prior work mitigate this drift with data-specific flows or task-specific adapters that are computationally heavy and not scalable across tasks. This raises the question "Can't we efficiently manipulate the existing generative capabilities of a flow model?" To this end, we introduce FlowSteer (FS), an operator-aware conditioning scheme that injects measurement priors along the sampling path,coupling a frozed flow's implicit guidance with explicit measurement constraints. Across super-resolution, deblurring, denoising, and colorization, FS improves measurement consistency and identity preservation in a strictly zero-shot setting-no retrained models, no adapters. We show how the nature of flow models and their sensitivities to noise inform the design of such a scheduler. FlowSteer, although simple, achieves a higher fidelity of reconstructed images, while leveraging the rich generative priors of flow models. All data and code will be publicly available \href{https://tharindu-nirmal.github.io/FlowSteer/}{in this link}.

2511.19065 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Understanding, Accelerating, and Improving MeanFlow Training

理解、加速和改进MeanFlow训练

Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, Dominik Narnhofer

发表机构 * Yonsei University(延世大学) ETH Zurich(苏黎世联邦理工学院) University of Zurich(苏黎世大学) Max Planck ETH CLS(马克斯·普朗克ETH CLS) Google(谷歌)

AI总结 通过分析瞬时速度与平均速度的相互作用,提出一种加速瞬时速度形成并逐步转移训练重点的有效训练方案,实现更快的收敛和更优的少步生成性能。

详情
AI中文摘要

MeanFlow通过联合学习瞬时速度场和平均速度场,有望在少步内实现高质量生成建模。然而,其底层训练动态仍不清楚。我们分析两种速度之间的相互作用,发现:(i) 建立良好的瞬时速度是学习平均速度的前提;(ii) 当时间间隔较小时,瞬时速度的学习受益于平均速度,但随着间隔增大而退化;(iii) 任务亲和性分析表明,对于一步生成至关重要的大间隔平均速度的平滑学习,依赖于先形成准确的瞬时速度和小间隔平均速度。在这些观察的指导下,我们设计了一种有效的训练方案,加速瞬时速度的形成,然后将重点从短间隔平均速度转移到长间隔平均速度。我们改进的MeanFlow训练实现了更快的收敛和显著更好的少步生成:使用相同的DiT-XL骨干网络,我们的方法在1-NFE ImageNet 256x256上达到了令人印象深刻的FID 2.87,而传统的MeanFlow基线为3.43。或者,我们的方法以2.5倍更短的训练时间或使用更小的DiT-L骨干网络,匹配MeanFlow基线的性能。

英文摘要

MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.

2511.12046 2026-05-26 cs.CR cs.AI cs.CV cs.LG 版本更新

BackWeak: Backdooring Knowledge Distillation Simply with Weak Triggers and Fine-tuning

BackWeak: 使用弱触发器和微调简单后门知识蒸馏

Shanmin Wang, Dongdong Zhao

发表机构 * School of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Wuhan University of Technology(武汉科技大学)

AI总结 提出BackWeak方法,通过微调教师模型嵌入弱触发器实现后门攻击,无需替代学生模型或模拟蒸馏,在标准蒸馏过程中可靠转移至不同学生架构。

详情
AI中文摘要

知识蒸馏对于压缩大型模型至关重要,但依赖从第三方仓库下载的预训练“教师”模型引入了严重的安全风险——最显著的是后门攻击。现有的知识蒸馏后门方法通常复杂且计算密集:它们使用替代学生模型和模拟蒸馏来保证可转移性,并构建类似于通用对抗扰动(UAP)的触发器,这些触发器在幅度上不隐蔽,本质上表现出强烈的对抗行为。本文质疑这种复杂性是否必要,并构建了隐蔽的“弱”触发器——具有可忽略对抗效应的不可察觉扰动。我们提出了BackWeak,一种简单、无替代的攻击范式。BackWeak表明,通过使用非常小的学习率对良性教师模型进行微调并嵌入弱触发器,即可植入强大的后门。我们证明,这种精细的微调足以嵌入后门,在受害者的标准蒸馏过程中可靠地转移到不同的学生架构,从而实现高攻击成功率。在多个数据集、模型架构和知识蒸馏方法上的广泛实证评估表明,BackWeak比以往复杂的方法更高效、更简单,且通常更隐蔽。本文呼吁研究知识蒸馏后门攻击的学者特别关注触发器的潜在对抗特性。

英文摘要

Knowledge Distillation (KD) is essential for compressing large models, yet relying on pre-trained "teacher" models downloaded from third-party repositories introduces serious security risks--most notably backdoor attacks. Existing KD backdoor methods are typically complex and computationally intensive: they employ surrogate student models and simulated distillation to guarantee transferability, and construct triggers similar to universal adversarial perturbations (UAPs), which being not stealthy in magnitude, inherently exhibit strong adversarial behavior. This work questions whether such complexity is necessary and constructs stealthy "weak" triggers--imperceptible perturbations that have negligible adversarial effect. We propose BackWeak, a simple, surrogate-free attack paradigm. BackWeak shows that a powerful backdoor can be implanted by simply fine-tuning a benign teacher with a weak trigger using a very small learning rate. We demonstrate that this delicate fine-tuning is sufficient to embed a backdoor that reliably transfers to diverse student architectures during a victim's standard distillation process, yielding high attack success rates. Extensive empirical evaluations on multiple datasets, model architectures, and KD methods show that BackWeak is efficient, simpler, and often more stealthy than previous elaborate approaches. This work calls on researchers studying KD backdoor attacks to pay particular attention to the trigger's potential adversarial characteristics.

2510.22827 2026-05-26 cs.CV cs.LG 版本更新

FairJudge: Abstention-Aware Multimodal Judges for Fairness and Alignment Evaluation in Text-to-Image Models

FairJudge: 文本到图像模型中公平性与对齐评估的弃权感知多模态裁判

Zahraa Al Sahili, Maimuna Nowaz, Maryam Fetanat, Ioannis Patras, Matthew Purver

发表机构 * Queen Mary University of London(伦敦玛丽女王大学) Institut Jožef Stefan(乔泽夫·斯蒂芬研究所) Imperial College London(伦敦帝国学院)

AI总结 提出FairJudge协议,利用多模态大语言模型作为结构化裁判,通过封闭标签、弃权机制和证据报告,在文本到图像模型中实现社会属性预测、职业定位和提示-图像对齐的公平性评估。

详情
AI中文摘要

评估文本到图像(T2I)系统不仅需要判断图像是否匹配提示,还需要判断社会显著属性是否被忠实表示且没有无根据的推断。现有的自动评估器通常依赖于以面部为中心的识别器或对比图像-文本相似度,这些方法提供的诊断反馈有限,并且通常在视觉证据模糊或缺失时强制进行预测。对于宗教和残疾等公平敏感属性,其中线索可能是上下文相关的、间接的或故意未指定的,这些评估器可能会遗漏细心的人类评审员会注意到的失败模式。我们引入了\textsc{FairJudge},一种弃权感知的评估协议,该协议使用遵循指令的多模态LLM作为社会属性预测、职业定位和提示-图像对齐的结构化裁判。该协议将输出限制为封闭标签集,要求可见证据的理由,在线索不足时支持明确的\textsc{unspecified}决策,并将基于量规的对齐判断映射到$[-1,1]$。这些约束将MLLM裁判从开放式评估转变为可解析、可审计的评估程序。在四个属性预测基准和三个职业/对齐基准上,\textsc{FairJudge}优于或补充了CLIP、DeepFace、VIEScore和VQAScore。消融实验表明,封闭标签、弃权和证据报告对可靠性至关重要。我们进一步引入了\textsc{DIVERSIFY}和\textsc{DIVERSIFY-Professions},这两个资源丰富的上下文数据集用于评估超越面部可见或图标线索的社会表示和职业定位。我们发布了代码、提示、数据集、解析器日志和每张图像的裁判输出,以支持可重复的审计。

英文摘要

Evaluating text-to-image (T2I) systems requires judging not only whether an image matches a prompt, but also whether socially salient attributes are represented faithfully and without unsupported inference. Existing automated evaluators typically rely on face-centric recognizers or contrastive image--text similarity, which provide limited diagnostic feedback and often force predictions even when visual evidence is ambiguous or absent. For fairness-sensitive attributes such as religion and disability, where cues may be contextual, indirect, or intentionally unspecified, these evaluators can therefore miss failure modes that careful human reviewers would notice. We introduce \textsc{FairJudge}, an abstention-aware evaluation protocol that uses instruction-following multimodal LLMs as structured judges for social-attribute prediction, profession grounding, and prompt--image alignment. The protocol constrains outputs to closed label sets, requires visible-evidence rationales, supports an explicit \textsc{unspecified} decision when cues are insufficient, and maps rubric-based alignment judgments to $[-1,1]$. These constraints turn MLLM judging from open-ended assessment into a parseable, auditable evaluation procedure. Across four attribute-prediction benchmarks and three profession/alignment benchmarks, \textsc{FairJudge} outperforms or complements CLIP, DeepFace, VIEScore, and VQAScore. Ablations show that closed labels, abstention, and evidence reporting are central to reliability. We further introduce \textsc{DIVERSIFY} and \textsc{DIVERSIFY-Professions}, two context-rich resources for evaluating social representation and profession grounding beyond face-visible or iconic cues. We release code, prompts, datasets, parser logs, and per-image judge outputs to support reproducible auditing.

2510.15264 2026-05-26 cs.CV 版本更新

DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

DriveGen3D: 通过高效视频扩散提升前馈驾驶场景生成

Weijie Wang, Jiagang Zhu, Zeyu Zhang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Haoxiao Wang, Guan Huang, Xinze Chen, Yukun Zhou, Wenkang Qin, Duochao Shi, Haoyun Li, Yicheng Xiao, Donny Y. Chen, Jiwen Lu

发表机构 * Zhejiang University(浙江大学) GigaAI Tsinghua University(清华大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Peking University(北京大学) Monash University(墨尔本大学)

AI总结 提出DriveGen3D框架,结合快速视频扩散Transformer(FastDrive-DiT)和前馈3D重建模块(FastRecon3D),实现高质量、可控的动态3D驾驶场景生成,在长视频和3D一致性上达到最优。

Comments ICME 2026 Oral, Project Page: https://lhmd.top/drivegen3d

详情
AI中文摘要

我们提出了DriveGen3D,一个用于生成高质量、高可控性动态3D驾驶场景的新框架,解决了现有方法的关键局限性。当前的驾驶场景合成方法要么因扩展时间生成而面临高昂的计算需求,要么专注于没有3D表示的长时间视频合成,或者局限于静态单场景重建。我们的工作通过多模态条件控制,将加速的长期视频生成与大规模动态场景重建相结合,弥合了这一方法论差距。DriveGen3D引入了一个由两个专门组件组成的统一流程:FastDrive-DiT,一个高效的视频扩散Transformer,用于在文本和鸟瞰图(BEV)布局引导下进行高分辨率、时间连贯的视频合成;以及FastRecon3D,一个前馈模块,可快速构建跨时间的3D高斯表示,确保时空一致性。DriveGen3D能够生成长达$800\times424$、12 FPS的驾驶视频及相应的3D场景,在保持效率的同时取得了最先进的结果。

英文摘要

We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. DriveGen3D enable the generation of long driving videos (up to $800\times424$ at $12$ FPS) and corresponding 3D scenes, achieving state-of-the-art results while maintaining efficiency.

2510.14862 2026-05-26 cs.CV cs.DC 版本更新

Multi-modal video data-pipelines for machine learning with minimal human supervision

最小人工监督的机器学习多模态视频数据管道

Mihai-Cristian Pîrvu, Marius Leordeanu

发表机构 * Institute of Mathematics of the Romanian Academy "Simion Stoilow"(罗马尼亚科学院数学研究所 "Simion Stoilow") Faculty of Automatic Control and Computer Science, National University of Science and Technology POLITEHNICA(自动控制与计算机科学系,波兰技术大学)

AI总结 提出一种全自动数据管道,利用预训练专家模型和程序化组合,在无需人工监督下融合多种视觉模态,并基于PHG-MAE模型实现高效蒸馏,以低参数(<1M)达到与300M参数模型竞争的性能,部署于实时语义分割和深度估计任务。

详情
AI中文摘要

现实世界本质上是多模态的。我们的工具以数字形式(如视频或声音)观察和拍摄其快照,但大部分信息丢失。同样,对于人类之间的动作和信息传递,语言被用作书面交流形式。传统上,机器学习模型是单模态的(例如,rgb -> 语义或文本 -> 情感分类)。最近的趋势走向双模态,其中图像和文本一起学习,然而,为了真正理解世界,我们需要整合所有这些独立的模态。在这项工作中,我们尝试使用很少或没有人工监督来结合尽可能多的视觉模态。为此,我们使用预训练专家模型和它们之间的程序化组合,在原始视频之上构建一个完全自主的数据管道,我们也将其开源。然后,我们利用PHG-MAE,一个专门设计用于利用多模态数据的模型。我们展示了这个模型被高效蒸馏成低参数(<1M)后,可以与约300M参数的模型竞争。我们将该模型部署在商品硬件上的手持设备或网络摄像头上,分析实时语义分割的用例。最后,我们使用相同的框架部署其他现成模型,如用于近实时深度估计的DPT。

英文摘要

The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb -> semantic or text -> sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (<1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.

2510.11296 2026-05-26 cs.CV cs.LG 版本更新

$Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization

$Δ\mathrm{Energy}$: 优化视觉-语言对齐过程中的能量变化提升OOD检测与OOD泛化

Lin Zhu, Yifeng Yang, Xinbing Wang, Qinying Gu, Nanyang Ye

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出ΔEnergy分数,通过重新对齐视觉-语言模态时的能量变化来同时提升分布外检测和分布外泛化性能,并基于此开发了统一微调框架EBM。

Comments Accepted by NeurIPS2025

详情
AI中文摘要

近期针对视觉-语言模型(VLM)的方法在下游任务快速适应中取得了显著成功。当应用于真实世界下游任务时,VLM不可避免地会遇到分布内(ID)数据和分布外(OOD)数据。OOD数据集通常包括协变量偏移(例如,已知类别但图像风格变化)和语义偏移(例如,测试时未见类别)。这凸显了提升VLM对协变量偏移OOD数据的泛化能力,同时有效检测开放集语义偏移OOD类别的重要性。本文受重新对齐视觉-语言模态时(具体通过将最大余弦相似度直接降低到低值)观察到的闭集数据中显著能量变化的启发,提出了一种新的OOD分数,命名为ΔEnergy。ΔEnergy显著优于基于能量的原始OOD分数,为OOD检测提供了更可靠的方法。此外,ΔEnergy还能同时提升协变量偏移下的OOD泛化,这是通过ΔEnergy的下界最大化(称为EBM)实现的。理论上证明EBM不仅能增强OOD检测,还能产生领域一致的Hessian矩阵,这作为OOD泛化的强指标。基于这一发现,我们开发了一个统一的微调框架,能够提升VLM在OOD泛化和OOD检测两方面的鲁棒性。在具有挑战性的OOD检测和泛化基准上的大量实验证明了我们方法的优越性,在AUROC上比近期方法提升了10%到25%。

英文摘要

Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named ΔEnergy. ΔEnergy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, ΔEnergy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for ΔEnergy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs' robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.

2510.10921 2026-05-26 cs.CV cs.AI cs.LG 版本更新

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

FG-CLIP 2: 一种双语细粒度视觉-语言对齐模型

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin

发表机构 * AI Research(360人工智能研究院)

AI总结 提出FG-CLIP 2双语视觉语言模型,通过区域-文本匹配、长描述建模和文本内模态对比损失等细粒度监督,在英中双语上实现细粒度对齐,在29个数据集上取得最优结果。

Comments Accepted in ICML2026

详情
AI中文摘要

细粒度视觉-语言理解需要视觉内容与语言描述之间的精确对齐,这一能力在当前模型中仍然有限,尤其是在非英语环境下。虽然CLIP等模型在全局对齐上表现良好,但它们往往难以捕捉对象属性、空间关系和语言表达中的细粒度细节,且对双语理解的支持有限。为应对这些挑战,我们提出了FG-CLIP 2,一个旨在推进英语和中文细粒度对齐的双语视觉语言模型。我们的方法利用了丰富的细粒度监督,包括区域-文本匹配和长描述建模,以及多个判别性目标。我们进一步引入了文本内模态对比损失,以更好地区分语义相似的描述。在精心策划的大规模英语和中文数据混合上训练,包括新发布的1200万中文区域-文本数据集,FG-CLIP 2实现了强大的双语性能。为进行严格评估,我们提出了一个新的中文多模态理解基准,包括长描述检索和边界框分类。在8个任务的29个数据集上的大量实验表明,FG-CLIP 2优于现有方法,在两种语言上均达到了最先进的结果。我们发布了模型、代码和基准,以促进双语细粒度视觉-语言对齐的未来研究。

英文摘要

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, including a newly released 12M Chinese region-text dataset, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained vision-language alignment.

2510.03827 2026-05-26 cs.CV cs.RO 版本更新

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

LIBERO-PRO:超越记忆的视觉-语言-动作模型鲁棒与公平评估

Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, Lichao Sun

发表机构 * Huazhong University of Science and Technology(华中科技大学) College of AI, Tsinghua University(清华大学人工智能学院) Wuhan University of Technology(武汉理工大学) Lehigh University(莱斯大学)

AI总结 针对LIBERO基准评估中的记忆偏差问题,提出LIBERO-PRO扩展基准,通过在操作对象、初始状态、任务指令和环境四个维度施加合理扰动,揭示现有VLA模型性能从90%以上骤降至0.0%的严重缺陷,并呼吁采用鲁棒评估方法。

Comments 10 pages,7 figures, 0 tables

详情
AI中文摘要

LIBERO已成为评估视觉-语言-动作(VLA)模型的广泛采用的基准;然而,其当前的训练和评估设置存在问题,常常导致性能估计膨胀,并阻碍公平的模型比较。为了解决这些问题,我们引入了LIBERO-PRO,一个扩展的LIBERO基准,系统性地评估模型在四个维度(操作对象、初始状态、任务指令和环境)的合理扰动下的性能。实验结果表明,尽管现有模型在标准LIBERO评估下达到90%以上的准确率,但在我们的泛化设置下,其性能骤降至0.0%。关键的是,这种差异暴露了模型依赖于对训练集中动作序列和环境布局的死记硬背,而非真正的任务理解或环境感知。例如,当目标对象被替换为无关物品时,模型仍持续执行抓取动作;即使给出被破坏的指令甚至混乱的令牌,其输出也保持不变。这些发现揭示了当前评估实践中的严重缺陷,我们呼吁社区放弃误导性方法,转而采用对模型泛化和理解能力的鲁棒评估。我们的代码可在 https://github.com/Zxy-MLlab/LIBERO-PRO 获取。

英文摘要

LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.

2509.24621 2026-05-26 cs.CV 版本更新

FreeRet: MLLMs as Training-Free Retrievers

FreeRet: 无需训练的多模态大语言模型检索器

Yuhan Zhu, Xiangyu Zeng, Chenting Wang, Xinhao Li, Chunxu Liu, Yicheng Xu, Ziang Yan, Yi Wang, Limin Wang

发表机构 * Nanjing University(南京大学) Shanghai AI Lab(上海人工智能实验室) Shanghai Jiaotong University(上海交通大学) Institute of Science Tokyo(东京科学研究院) Zhejiang University(浙江大学)

AI总结 提出FreeRet框架,将现成的多模态大语言模型转化为无需额外训练的两阶段检索器,通过语义嵌入和重排序提升检索性能。

Comments ICML 2026

详情
AI中文摘要

多模态大语言模型正成为混合模态检索的通用基础。然而,它们通常需要大量的后期训练才能转化为用于检索的对比编码器。本文提出:现成的多模态大语言模型能否在无需额外训练的情况下作为强大的检索器?我们提出了FreeRet,一个即插即用的框架,可将任何多模态大语言模型转化为两阶段检索器。FreeRet首先直接从模型中导出语义嵌入以进行快速候选搜索,然后利用其推理能力进行精确重排序。该框架贡献了三个进步:绕过词汇对齐层以获得语义保真的嵌入、通过显式先验条件化表示生成、以及通过中性选择框架减轻重排序中的框架效应。在涵盖46个数据集的MMEB和MMEB-V2基准测试中,FreeRet显著优于在数百万个对上训练的模型。除基准测试外,FreeRet与模型无关,可无缝扩展至不同多模态大语言模型系列和规模,保留其生成能力,支持任意模态组合,并将检索、重排序和生成统一到单个模型内的端到端RAG中。我们的发现表明,经过精心利用的预训练多模态大语言模型可以在无需训练的情况下作为强大的检索引擎,弥补了其作为通才角色的关键差距。

英文摘要

Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

2509.21592 2026-05-26 cs.CV cs.AI cs.LG 版本更新

What Happens Next? Anticipating Future Motion by Generating Point Trajectories

接下来会发生什么?通过生成点轨迹预测未来运动

Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi

发表机构 * Visual Geometry Group, University of Oxford(牛津大学视觉几何组)

AI总结 提出一种基于单张图像预测未来运动的方法,通过生成密集轨迹网格来捕捉场景动态和不确定性,相比现有方法更准确多样,并验证其在机器人等下游任务中的有效性。

详情
Journal ref
ICLR 2026
AI中文摘要

我们考虑从单张图像预测运动的问题,即预测世界中物体可能如何移动,而无法观察其他参数如物体速度或施加的力。我们将此任务表述为密集轨迹网格的条件生成,模型紧密遵循现代视频生成器的架构,但输出运动轨迹而非像素。这种方法捕捉了场景范围的动态和不确定性,比先前的回归器和生成器产生更准确和多样化的预测。我们在模拟数据上广泛评估了我们的方法,展示了其在机器人等下游应用中的有效性,并在真实世界的直觉物理数据集上显示出有希望的准确性。尽管最近最先进的视频生成器常被视为世界模型,但我们表明它们在从单张图像预测运动方面存在困难,即使在简单的物理场景如落块或机械物体交互中,尽管对这些数据进行了微调。我们表明这一局限性源于生成像素的开销,而非直接建模运动。

英文摘要

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

2509.09658 2026-05-26 cs.CV 版本更新

Measuring Epistemic Humility in Multimodal Large Language Models

测量多模态大语言模型中的认知谦逊

Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(Mohamed bin Zayed人工智能大学) Hong Kong Baptist University(香港 Baptist大学)

AI总结 提出HumbleBench基准,通过强制选择多项选择中引入“以上皆非”选项,评估多模态大语言模型拒绝错误选项的谦逊行为。

详情
AI中文摘要

多模态大语言模型(MLLMs)中的幻觉——即模型生成与输入图像不一致的内容——在现实应用中带来显著风险,从视觉问答中的错误信息到决策中的不安全错误。现有基准主要测试识别准确性,即评估模型能否在干扰项中选择正确答案。这忽略了可信AI的另一个重要能力:当没有提供的选项得到图像支持时,能够识别并避免做出错误选择,这是一种与谦逊相关的行为。我们提出了HumbleBench,这是一个新的幻觉基准,旨在评估MLLMs在强制选择多项选择设置中拒绝错误选项的能力,其中包含“以上皆非”选项。基于全景场景图数据集,我们利用对象和关系的细粒度场景图注释,使用候选属性线索,并提示GPT-4-Turbo生成多项选择问题,随后进行严格的人工筛选。每个问题都包含一个“以上皆非”选项,要求模型不仅识别正确的视觉信息,还要识别何时没有提供的答案有效。我们在HumbleBench上评估了各种最先进的MLLMs——包括通用型、专门推理型和专有模型——并为社区报告了实证结果。通过纳入明确的错误选项拒绝,HumbleBench填补了当前评估套件中的一个关键空白,评估了一种较窄但重要的、与可信多模态推理相关的弃权行为。我们的代码和数据集已公开发布,可在https://github.com/maifoundations/HumbleBench获取。

英文摘要

Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks another important capability for trustworthy AI: recognizing when none of the provided options is supported by the image and abstaining from committing to a false choice, a humility-related behavior. We present HumbleBench, a new hallucination benchmark designed to evaluate false-option rejection in MLLMs under a forced-choice multiple-choice setting with a ``None of the above'' option. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations for objects and relations, use candidate attribute cues, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a ``None of the above'' option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including general-purpose, specialized reasoning, and proprietary models -- on HumbleBench and report empirical findings for the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites by assessing a narrower but important abstention-oriented behavior that is relevant to trustworthy multimodal reasoning. Our code and dataset are released publicly and can be accessed at \href{https://github.com/maifoundations/HumbleBench}{https://github.com/maifoundations/HumbleBench}.

2509.00056 2026-05-26 cs.CV 版本更新

Apex-Centered Spatio-Temporal Rank Pooling and Gradient Attention for Micro-Expression Recognition

基于顶点的时空秩池化和梯度注意力用于微表情识别

Luu Tu Nguyen, Vu Tram Anh Khuong, Thanh Ha Le, Thi Duyen Ngo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology(信息技术学院,越南工程大学)

AI总结 提出微表情时空图像(MESTI)和微表情梯度注意力网络(MEGANet),通过改进输入模态和注意力机制提升微表情识别性能。

详情
AI中文摘要

微表情识别(MER)由于微表情的细微和短暂性是一项具有挑战性的任务。传统的输入模态,如顶点帧、光流和动态图像,往往无法充分捕捉这些短暂的面部运动,导致性能次优。在本研究中,我们引入了微表情时空图像(MESTI),这是一种针对微表情的动态秩池化的重新表述,将视频序列转换为单张图像,同时强调微表情的起始-顶点-结束时间模式。此外,我们提出了微表情梯度注意力网络(MEGANet),该网络包含一个提出的梯度注意力块,以增强从微表情中提取细粒度运动特征。通过结合MESTI和MEGANet,我们旨在建立一种更有效的MER方法。进行了大量实验以评估MESTI的有效性,将其与现有输入模态在常规架构上进行比较。此外,我们证明将先前发表的MER网络的输入替换为MESTI会导致一致的性能提升。还评估了MEGANet的性能,显示我们提出的网络在SMIC-HS、SAMM数据集上达到了最先进的结果,在CASMEII数据集上具有竞争力的性能,并且在报告的跨数据集评估设置中也取得了领先性能。MESTI和MEGANet的组合始终优于比较方法。这些发现强调了MESTI作为优越输入模态和MEGANet作为先进识别网络的潜力,旨在在各种应用中实现更有效的MER系统。

英文摘要

Micro-expression recognition (MER) is a challenging task due to the subtle and fleeting nature of micro-expressions. Traditional input modalities, such as Apex Frame, Optical Flow, and Dynamic Image, often fail to adequately capture these brief facial movements, resulting in suboptimal performance. In this study, we introduce the Micro-expression Spatio-Temporal Image (MESTI), a micro-expression-specific reformulation of dynamic rank pooling that transforms a video sequence into a single image while emphasizing the onset-apex-offset temporal pattern of micro-expressions. Additionally, we present the Micro-expression Gradient Attention Network (MEGANet), which incorporates a proposed Gradient Attention block to enhance the extraction of fine-grained motion features from micro-expressions. By combining MESTI and MEGANet, we aim to establish a more effective approach to MER. Extensive experiments were conducted to evaluate the effectiveness of MESTI, comparing it with existing input modalities across regular architectures. Moreover, we demonstrate that replacing the input of previously published MER networks with MESTI leads to consistent performance improvements. The performance of MEGANet is also evaluated, showing that our proposed network achieves state-of-the-art results on the SMIC-HS, SAMM and competitive performance on CASMEII datasets, it also achieves leading performance in the reported cross-dataset evaluation settings. The combination of MESTI and MEGANet consistently outperforms the compared methods. These findings underscore the potential of MESTI as a superior input modality and MEGANet as an advanced recognition network, aiming to more effective MER systems in a variety of applications.

2508.13309 2026-05-26 cs.CV cs.LG 版本更新

DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples

DASH:一种用于合成有效且隐蔽的对抗样本的元攻击框架

Abdullah Al Nomaan Nafi, Habibur Rahaman, Zafaryab Haider, Tanzim Mahfuz, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty

发表机构 * University of Maine(缅因大学) University of Florida(佛罗里达大学) University of Tennessee, Knoxville(田纳西大学,基洛纳)

AI总结 提出DASH元攻击框架,通过多阶段自适应组合Lp约束攻击方法,生成有效且感知对齐的对抗样本,在多个数据集上优于现有方法。

Comments Accepted to CVPR 2026

详情
AI中文摘要

在白盒设置下,已有大量技术被提出用于在严格的Lp范数约束下生成对抗样本。然而,这类范数受限的样本往往与人类感知不一致,只有少数方法专门探索感知对齐的对抗样本。此外,尚不清楚能否有效利用Lp约束攻击的见解来提升感知效能。本文介绍DASH,一个完全可微的元攻击框架,通过策略性地组合现有基于Lp的攻击方法,生成有效且感知对齐的对抗样本。DASH以多阶段方式运行:在每个阶段,它使用学习到的自适应权重聚合来自多个基础攻击的候选对抗样本,并将结果传播到下一阶段。一种新颖的元损失函数通过联合最小化误分类损失和感知失真来指导这一过程,使框架能够动态调整每个基础攻击在各阶段的贡献。我们在CIFAR-10、CIFAR-100和ImageNet上对对抗训练模型评估DASH。尽管仅依赖基于Lp约束的方法,DASH显著优于最先进的感知攻击如AdvAD,实现了更高的攻击成功率(例如提升20.63%)和更优的视觉质量(以SSIM、LPIPS和FID衡量,分别提升约11、0.015和5.7)。此外,DASH对未见过的防御具有良好的泛化能力,使其成为评估鲁棒性的实用且强大的基线,无需为每种新防御手工设计自适应攻击。

英文摘要

Numerous techniques have been proposed for generating adversarial examples in white-box settings under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only a few methods specifically explore perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DASH, a fully differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DASH on adversarially trained models across CIFAR-10, CIFAR-100, and ImageNet. Despite relying solely on Lp-constrained based methods, DASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD, achieving higher attack success rates (e.g., 20.63% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements $\approx$ of 11, 0.015, and 5.7, respectively). Furthermore, DASH generalizes well to unseen defenses, making it a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense.

2508.12628 2026-05-26 cs.CV 版本更新

Creative4U: MLLMs-based Advertising Creative Image Selector with Comparative Reasoning

Creative4U: 基于MLLMs的广告创意图像选择器与比较推理

Yukang Lin, Xiang Zhang, Shichang Jia, Bowen Wan, Chenghan Fu, Xudong Ren, Yueran Liu, Wanxian Guan, Pengji Wang, Jian Xu, Bo Zheng, Baolin Liu

发表机构 * Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团)

AI总结 提出基于多模态大语言模型的创意图像评估与选择范式,通过构建比较推理数据集CreativePair和强化学习方法Creative4U,实现可解释的创意选择。

详情
AI中文摘要

广告中的创意图像是电子商务平台的核心和灵魂。引人注目的创意图像可以提升用户的购物体验,增加广告主的收入以及平台的广告收入。随着AIGC技术的出现,广告主能够以极低的成本生产大量创意图像。然而,他们难以评估创意质量以进行选择。现有方法主要关注创意排序,无法满足可解释的创意选择需求。在这项工作中,我们提出了首个可解释的创意评估与选择范式。借助多模态大语言模型(MLLMs),我们的方法将创意图像的评估与选择整合到自然语言生成任务中。为了促进这项研究,我们构建了CreativePair,这是首个比较推理驱动的创意数据集,包含8k个带标注的图像对,每个样本包含一个标签,指示哪张图像更优。此外,我们引入了Creative4U(读作Creative for You),一种基于MLLMs的创意选择器,它考虑了用户的兴趣。通过Reason-to-Select RFT,其中包括基于思维链的监督微调(CoT-SFT)和基于组相对策略优化(GRPO)的强化学习,Creative4U能够准确评估和选择创意图像。离线和在线实验均证明了我们方法的有效性。我们的代码和数据集将公开,以推动研究和工业应用。

英文摘要

Creative image in advertising is the heart and soul of e-commerce platform. An eye-catching creative image can enhance the shopping experience for users, boosting income for advertisers and advertising revenue for platforms. With the advent of AIGC technology, advertisers can produce large quantities of creative images at minimal cost. However, they struggle to assess the creative quality to select. Existing methods primarily focus on creative ranking, which fails to address the need for explainable creative selection. In this work, we propose the first paradigm for explainable creative assessment and selection. Powered by multimodal large language models (MLLMs), our approach integrates the assessment and selection of creative images into a natural language generation task. To facilitate this research, we construct CreativePair, the first comparative reasoning-induced creative dataset featuring 8k annotated image pairs, with each sample including a label indicating which image is superior. Additionally, we introduce Creative4U (pronounced Creative for You), a MLLMs-based creative selector that takes into account users' interests. Through Reason-to-Select RFT, which includes supervised fine-tuning with Chain-of-Thought (CoT-SFT) and Group Relative Policy Optimization (GRPO) based reinforcement learning, Creative4U is able to evaluate and select creative images accurately. Both offline and online experiments demonstrate the effectiveness of our approach. Our code and dataset will be made public to advance research and industrial applications.

2506.23700 2026-05-26 eess.IV cs.CV 版本更新

MedSAM-CA: A CNN-Augmented ViT with Attention-Enhanced Multi-Scale Fusion for Medical Image Segmentation

MedSAM-CA:一种用于医学图像分割的CNN增强型ViT与注意力增强多尺度融合方法

Peiting Tian, Xi Chen, Haixia Bi, Fan Li

AI总结 提出MedSAM-CA,通过卷积注意力增强边界细化网络和注意力增强特征融合块,在低资源条件下微调预训练MedSAM模型,实现高精度医学图像分割。

Comments Withdrawn by the authors because the current version requires substantial revision in the description of the experimental settings and data preprocessing procedures. The manuscript should not be cited in its current form

详情
AI中文摘要

医学图像分割在临床诊断和治疗规划中起着关键作用,其中精确的边界勾画对于准确的病灶定位、器官识别和定量评估至关重要。近年来,基于深度学习的方法显著提高了分割精度。然而,仍存在两个主要挑战。首先,这些方法的性能严重依赖于大规模标注数据集,而在医学场景中,由于隐私问题和高昂的标注成本,这些数据集往往难以获得。其次,临床挑战性场景,例如某些成像模态的低对比度以及恶性肿瘤引起的模糊病灶边界,仍然对精确分割构成障碍。为了解决这些挑战,我们提出了MedSAM-CA,一种架构级别的微调方法,通过适应预训练的基础模型Medical Segment Anything (MedSAM)来减轻对大量手动标注的依赖。MedSAM-CA引入了两个关键组件:卷积注意力增强边界细化网络(CBR-Net)和注意力增强特征融合块(Atte-FFB)。CBR-Net与MedSAM编码器并行运行,利用分层卷积处理恢复长距离注意力机制可能忽略的边界信息。嵌入在MedSAM解码器中的Atte-FFB将来自CBR-Net跳跃连接的多级细粒度特征与解码器内上采样的全局表示融合,以增强边界勾画精度。在涵盖皮肤镜、CT和MRI成像模态的公开数据集上的实验验证了MedSAM-CA的有效性。在皮肤镜数据集上,MedSAM-CA仅使用完整训练数据的2%就达到了94.43%的Dice系数,达到了完整数据训练性能的97.25%,展示了在低资源临床场景中的强大有效性。

英文摘要

Medical image segmentation plays a crucial role in clinical diagnosis and treatment planning, where accurate boundary delineation is essential for precise lesion localization, organ identification, and quantitative assessment. In recent years, deep learning-based methods have significantly advanced segmentation accuracy. However, two major challenges remain. First, the performance of these methods heavily relies on large-scale annotated datasets, which are often difficult to obtain in medical scenarios due to privacy concerns and high annotation costs. Second, clinically challenging scenarios, such as low contrast in certain imaging modalities and blurry lesion boundaries caused by malignancy, still pose obstacles to precise segmentation. To address these challenges, we propose MedSAM-CA, an architecture-level fine-tuning approach that mitigates reliance on extensive manual annotations by adapting the pretrained foundation model, Medical Segment Anything (MedSAM). MedSAM-CA introduces two key components: the Convolutional Attention-Enhanced Boundary Refinement Network (CBR-Net) and the Attention-Enhanced Feature Fusion Block (Atte-FFB). CBR-Net operates in parallel with the MedSAM encoder to recover boundary information potentially overlooked by long-range attention mechanisms, leveraging hierarchical convolutional processing. Atte-FFB, embedded in the MedSAM decoder, fuses multi-level fine-grained features from skip connections in CBR-Net with global representations upsampled within the decoder to enhance boundary delineation accuracy. Experiments on publicly available datasets covering dermoscopy, CT, and MRI imaging modalities validate the effectiveness of MedSAM-CA. On dermoscopy dataset, MedSAM-CA achieves 94.43% Dice with only 2% of full training data, reaching 97.25% of full-data training performance, demonstrating strong effectiveness in low-resource clinical settings.

2506.17629 2026-05-26 cs.CV cs.AI cs.CL 版本更新

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

CLiViS: 通过语言-视觉协同释放认知地图用于具身视觉推理

Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) King Abdullah University of Science and Technology(科廷大学) Fudan University(复旦大学)

AI总结 提出CLiViS框架,通过LLM进行高层任务规划并协调VLM驱动的开放世界视觉感知,构建动态认知地图以迭代更新场景上下文,实现无需训练的具身视觉推理。

详情
AI中文摘要

具身视觉推理(EVR)旨在基于自我中心视频遵循复杂、自由形式的指令,从而在动态环境中实现语义理解和时空推理。尽管具有潜力,EVR面临复杂指令多样性和长期自我中心视频中复杂时空动态的挑战。现有解决方案要么在静态视频描述上使用大型语言模型(LLM),这通常会遗漏关键视觉细节,要么依赖端到端视觉语言模型(VLM),后者在逐步组合推理上存在困难。考虑到LLM在推理和VLM在感知方面的互补优势,我们提出了CLiViS。这是一个新颖的无训练框架,利用LLM进行高层任务规划,并协调VLM驱动的开放世界视觉感知,以迭代更新场景上下文。基于这种协同,CLiViS的核心是一个动态认知地图,它在推理过程中不断演化。该地图构建了具身场景的结构化表示,连接了低层感知和高层推理。跨多个基准的大量实验证明了CLiViS的有效性和通用性,特别是在处理长期视觉依赖方面。代码可在 https://github.com/Teacher-Tom/CLiViS 获取。

英文摘要

Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.

2506.10689 2026-05-26 cs.CV 版本更新

Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery

通过多任务和多年龄方法在无约束图像中筛查未成年人的未成年人检测

Christopher Gaul, Eduardo Fidalgo, Enrique Alegre, Rocío Alaiz Rodríguez, Eri Pérez Corral

发表机构 * Department of Electrical, Systems and Automation Engineering(电气、系统与自动化工程系)

AI总结 提出一种基于冻结FaRL视觉语言骨干和紧凑两层MLP的多任务架构,结合α重加权焦点损失和年龄平衡采样,在无约束图像中准确检测未成年人,并在新基准上显著提升性能。

详情
AI中文摘要

在无约束图像中准确自动筛查未成年人需要模型对分布偏移具有鲁棒性,并能应对公共数据集中儿童代表性不足的问题。为解决这些问题,我们提出了一种多任务架构,基于冻结的FaRL视觉语言骨干,结合一个紧凑的两层MLP,该MLP在一个年龄回归头和四个二元未成年人头(12、15、18和21岁)之间共享特征,并包含专门的超/低龄判别任务。该设计聚焦于法律关键年龄范围,同时保持骨干冻结。通过$α$重加权焦点损失和年龄平衡小批量采样缓解类别不平衡,同时通过年龄间隔移除阈值附近的模糊样本。评估在我们的新总体未成年人基准(303k清洗训练图像,110k测试图像)上进行,定义了“ASORES-39k”受限总体测试(去除噪声最大的域)和年龄估计野移测试“ASWIFT-20k”(20k图像,强调极端姿态(>45°)、表情和低图像质量以模拟现实世界偏移)。在清洗总体集上使用重采样和年龄间隔训练后,我们的多年龄模型“F”将ASORES-39k上的平均绝对误差从4.175岁(仅年龄基线)降至4.068岁,并在1%虚假成人率下将18岁以下检测的F2分数从0.801提升至0.857。在ASWIFT-20k上,相同配置几乎保持0.99的召回率,同时F2从0.742提升至0.833,展示了域偏移的鲁棒性。

英文摘要

Accurate automatic screening of minors in unconstrained images requires models robust to distribution shift and resilient to the under-representation of children in public datasets. To address these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary underage heads (12, 15, 18, and 21 years). This design focuses on the legally critical age range while keeping the backbone frozen. Class imbalance is mitigated through an $α$-reweighted focal loss and age-balanced mini-batch sampling, while an age gap removes ambiguous samples near thresholds. Evaluation is conducted on our new Overall Underage Benchmark (303k cleaned training images, 110k test images), defining both the "ASORES-39k" restricted overall test, which removes the noisiest domains, and the age estimation wild-shifts test "ASWIFT-20k" of 20k-images, stressing extreme poses ($>$45°), expressions, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model "F" reduces the mean absolute error on ASORES-39k from 4.175 y (age-only baseline) to 4.068 y and improves under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the ASWIFT-20k, the same configuration nearly sustains 0.99 recall while F2 rises from 0.742 to 0.833, demonstrating robustness to domain shift.

2505.11758 2026-05-26 cs.CV cs.AI cs.GR cs.RO 版本更新

Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning

具有预测性提示和负学习的可泛化视觉语言少样本适应

Sriram Mandalika

发表机构 * Hasso Plattner Institute, University of Potsdam(霍普夫纳研究所,波茨坦大学)

AI总结 提出SCAN框架,通过查询自适应负路由、LLM引导对比提示和自适应融合权重,解决视觉语言模型少样本适应中负类信号处理问题,在11个基准上平均提升4.61%。

详情
AI中文摘要

视觉语言模型的少样本适应在推理时如何处理负类信号方面仍然存在根本性限制。现有方法对所有查询应用统一的负抑制,忽略了最具破坏性的混淆是查询特定的,并且随支持集几何形状而变化。我们提出SCAN(选择性混淆感知负样本),一个通过三个针对性贡献解决这一问题的框架。在推理中,查询自适应负路由将抑制限制在每个查询最易混淆的前K个类别,无需额外参数。通用负文本模板被替换为LLM引导的对比提示,描述易混淆类别对之间的区分属性,在关键处锐化文本决策边界。基于支持集Fisher可判别性估计的无参数自适应融合权重消除了手动调整视觉语言权衡的需要。在11个标准基准上评估,SCAN在16-shot设置下平均优于先前的基于提示和基于适配器的方法4.61%,在类间混淆最严重的细粒度数据集上提升高达7.70%。SCAN在分布偏移下也表现出强泛化性,在四个ImageNet OOD变体上平均提升2.95%,并在显著标签噪声下保持稳健性能,在50%标签损坏下的准确率仍超过最强竞争方法的干净基线。

英文摘要

Few-shot adaptation of vision-language models remains fundamentally limited by how negative class signals are handled at inference. Existing methods apply uniform negative suppression across all queries, ignoring that the most damaging confusions are query-specific and shift with support-set geometry. We introduce SCAN (Selective Confusion-Aware Negatives), a framework that addresses this gap through three targeted contributions. In inference, query-adaptive negative routing restricts suppression to the top-K most confusable classes per query, requiring zero additional parameters. Generic negative text templates are replaced with LLM-bootstrapped contrastive prompts that describe discriminative attributes between confusable class pairs, sharpening the textual decision boundary where it matters most. A parameter-free adaptive fusion weight estimated from support-set Fisher discriminability removes the need for manual tuning of the vision-language trade-off. Evaluated across 11 standard benchmarks, SCAN consistently outperforms prior prompt-based and adapter-based methods by an average of 4.61% at 16-shot, with gains of up to 7.70% on fine-grained datasets where inter-class confusion is most severe. SCAN also generalizes strongly under distribution shift, improving by 2.95% on average across four ImageNet OOD variants, and maintains robust performance under significant label noise, with accuracy under 50% label corruption still exceeding the clean baseline of the strongest competing method.

2503.01122 2026-05-26 cs.CV 版本更新

ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization

ACCORD: 通过依赖正则化缓解文本到图像扩散个性化中的概念耦合

Shizhan Liu, Hao Zheng, Hang Yu, Jianguo Li

发表机构 * Ant Group(蚂蚁集团)

AI总结 提出两种即插即用损失函数(去噪解耦损失和先验解耦损失)直接最小化两种依赖差异,以缓解概念耦合问题,实现文本控制与个性化保真度的更好平衡。

详情
AI中文摘要

图像个性化因其能够仅使用少量参考图像定制文本到图像生成而受到关注。然而,图像个性化的一个关键挑战是概念耦合问题,即有限的参考图像导致模型在个性化目标与其他概念之间形成不希望的关联。当前方法试图间接解决这个问题,导致文本控制与个性化保真度之间的次优平衡。本文通过统计分析直接处理概念耦合问题,揭示其源于两种不同的依赖差异来源。因此,我们提出了两种互补的即插即用损失函数:去噪解耦损失和先验解耦损失,每种损失旨在最小化一种依赖差异。大量实验表明,我们的方法在文本控制与个性化保真度之间实现了更优的权衡。

英文摘要

Image personalization has garnered attention for its ability to customize Text-to-Image generation using only a few reference images. However, a key challenge in image personalization is the issue of conceptual coupling, where the limited number of reference images leads the model to form unwanted associations between the personalization target and other concepts. Current methods attempt to tackle this issue indirectly, leading to a suboptimal balance between text control and personalization fidelity. In this paper, we take a direct approach to the concept coupling problem through statistical analysis, revealing that it stems from two distinct sources of dependence discrepancies. We therefore propose two complementary plug-and-play loss functions: Denoising Decouple Loss and Prior Decouple loss, each designed to minimize one type of dependence discrepancy. Extensive experiments demonstrate that our approach achieves a superior trade-off between text control and personalization fidelity.

2412.15668 2026-05-26 cs.CV 版本更新

Adaptive Hierarchical Graph Cut for Multi-granularity Out-of-distribution Detection

自适应层次图割用于多粒度分布外检测

Xiang Fang, Arvind Easwaran, Blaise Genest, Ponnuthurai Nagaratnam Suganthan

发表机构 * Interdisciplinary Graduate Programme, Nanyang Technological University(新加坡国立大学跨学科研究生项目) College of Computing and Data Science, Nanyang Technological University(新加坡国立大学计算与数据科学学院) KINDI Computing Research Center, College of Engineering, Qatar University(卡塔尔大学工程学院KINDI计算研究中心)

AI总结 提出自适应层次图割网络(AHGC),通过构建层次KNN图并基于图连接和密度信息进行子图划分,以处理不同标签粒度下的分布外检测问题,在CIFAR-10和CIFAR-100上FPR95指标分别降低40.47%和81.24%。

Comments Published in IEEE Transactions on Artificial Intelligence

详情
AI中文摘要

本文聚焦于一项重要且具有挑战性的任务:分布外检测(OOD检测),旨在区分并拒绝具有语义偏移的测试样本,以防止在分布内(ID)数据上训练的模型产生不可靠的预测。尽管先前的工作已取得一定成功,但它们对于现实世界中具有挑战性的应用效果不佳,因为这些方法简单地将所有未标记数据视为OOD数据,忽略了不同数据集具有不同标签粒度的情况。例如,CIFAR-10中的“猫”和Tiny-ImageNet中的“虎斑猫”具有相同语义,但由于标签粒度不同而具有不同标签。为此,本文提出了一种新颖的自适应层次图割网络(AHGC),以深入探索不同图像之间的语义关系。具体地,我们构建一个层次KNN图,基于余弦相似度评估不同图像之间的相似性。基于图的连接和密度信息,我们将图切割成多个子图以整合这些语义相似的样本。如果子图中标记样本的百分比大于阈值,我们将百分比最高的标签分配给未标记图像。为进一步提高模型泛化能力,我们将每张图像增强为两个增强版本,并最大化这两个版本之间的相似性。最后,我们利用相似度分数进行OOD检测。在两个具有挑战性的基准(CIFAR-10和CIFAR-100)上进行的大量实验表明,在典型情况下,AHGC在“FPR95”指标上分别比最先进的OOD检测方法在CIFAR-100上降低81.24%,在CIFAR-10上降低40.47%,这显示了我们的AHGC的有效性。

英文摘要

This paper focuses on a significant yet challenging task: out-of-distribution detection (OOD detection), which aims to distinguish and reject test samples with semantic shifts, so as to prevent models trained on in-distribution (ID) data from producing unreliable predictions. Although previous works have made decent success, they are ineffective for real-world challenging applications since these methods simply regard all unlabeled data as OOD data and ignore the case that different datasets have different label granularity. For example, "cat" on CIFAR-10 and "tabby cat" on Tiny-ImageNet share the same semantics but have different labels due to various label granularity. To this end, in this paper, we propose a novel Adaptive Hierarchical Graph Cut network (AHGC) to deeply explore the semantic relationship between different images. Specifically, we construct a hierarchical KNN graph to evaluate the similarities between different images based on the cosine similarity. Based on the linkage and density information of the graph, we cut the graph into multiple subgraphs to integrate these semantics-similar samples. If the labeled percentage in a subgraph is larger than a threshold, we will assign the label with the highest percentage to unlabeled images. To further improve the model generalization, we augment each image into two augmentation versions, and maximize the similarity between the two versions. Finally, we leverage the similarity score for OOD detection. Extensive experiments on two challenging benchmarks (CIFAR- 10 and CIFAR-100) illustrate that in representative cases, AHGC outperforms state-of-the-art OOD detection methods by 81.24% on CIFAR-100 and by 40.47% on CIFAR-10 in terms of "FPR95", which shows the effectiveness of our AHGC.

2409.17608 2026-05-26 cs.CV 版本更新

Appearance Blur-driven AutoEncoder and Motion-guided Memory Module for Video Anomaly Detection

外观模糊驱动的自编码器和运动引导的记忆模块用于视频异常检测

Jiahao Lyu, Minghua Zhao, Jing Hu, Xuewen Huang, Shuangli Du, Cheng Shi, Zhiyong Lv

发表机构 * School of Computer Science and Engineering, Xi’an University of Technology(西安理工大学计算机科学与工程学院)

AI总结 提出一种基于外观模糊和运动引导记忆模块的零样本跨数据集视频异常检测方法,通过构建全局伪异常并利用运动记忆项扩大正常与异常运动差异。

Comments 13 pages, 11 figures

详情
Journal ref
Knowledge-Based Systems 2026
AI中文摘要

视频异常检测(VAD)通常学习正常样本的分布并通过测量显著偏差来检测异常,但不期望的泛化可能会重构一些异常从而抑制偏差。同时,大多数VAD无法应对新目标域的跨数据集验证,而少样本方法必须费力地依赖目标域的模型调优来完成域适应。为解决这些问题,我们提出一种新颖的VAD方法,带有运动引导记忆模块,实现零样本跨数据集验证。首先,我们对原始外观图像添加高斯模糊,从而构建全局伪异常,作为网络输入。然后,我们提出多尺度残差通道注意力来去模糊正常样本中的伪异常。接下来,通过记录训练阶段的运动特征获得记忆项,用于在测试阶段从原始信息中检索运动特征。最后,我们的方法可以通过注意力忽略模糊的真实异常,并依赖运动记忆项来增加正常与异常运动之间的正常性差距。在三个基准数据集上的大量实验证明了所提方法的有效性。与跨域方法相比,我们的方法在测试时无需适应即可实现有竞争力的性能。

英文摘要

Video anomaly detection (VAD) often learns the distribution of normal samples and detects the anomaly through measuring significant deviations, but the undesired generalization may reconstruct a few anomalies thus suppressing the deviations. Meanwhile, most VADs cannot cope with cross-dataset validation for new target domains, and few-shot methods must laboriously rely on model-tuning from the target domain to complete domain adaptation. To address these problems, we propose a novel VAD method with a motion-guided memory module to achieve cross-dataset validation with zero-shot. First, we add Gaussian blur to the raw appearance images, thereby constructing the global pseudo-anomaly, which serves as the input to the network. Then, we propose multi-scale residual channel attention to deblur the pseudo-anomaly in normal samples. Next, memory items are obtained by recording the motion features in the training phase, which are used to retrieve the motion features from the raw information in the testing phase. Lastly, our method can ignore the blurred real anomaly through attention and rely on motion memory items to increase the normality gap between normal and abnormal motion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of the proposed method. Compared with cross-domain methods, our method achieves competitive performance without adaptation during testing.

2409.09953 2026-05-26 cs.CV 版本更新

Uncertainty-Guided Appearance-Motion Association Network for Out-of-Distribution Action Detection

不确定性引导的外观-运动关联网络用于分布外动作检测

Xiang Fang, Arvind Easwaran, Blaise Genest

发表机构 * College of Computing and Data Science(计算与数据科学学院) Nanyang Technological University(南洋理工大学)

AI总结 针对分布外动作检测任务,提出不确定性引导的外观-运动关联网络(UAAN),通过融合外观与运动特征并推理时空物体交互,显著优于现有方法。

Comments Accepted by MIPR 2024

详情
AI中文摘要

分布外(OOD)检测旨在检测并拒绝具有语义偏移的测试样本,以防止在分布内(ID)数据集上训练的模型产生不可靠的预测。现有工作仅在图像数据集上提取外观特征,无法处理包含大量运动信息的动态多媒体场景。因此,我们针对一个更现实且更具挑战性的OOD检测任务:OOD动作检测(ODAD)。给定一个未裁剪的视频,ODAD首先对ID动作进行分类并识别OOD动作,然后定位ID和OOD动作。为此,本文提出了一种新颖的不确定性引导的外观-运动关联网络(UAAN),该网络同时探索外观特征和运动上下文,以推理用于ODAD的时空物体间交互。首先,我们设计独立的外观和运动分支,以提取相应的面向外观和面向运动的物体表示。在每个分支中,我们构建一个时空图来推理外观引导和运动驱动的物体间交互。然后,我们设计一个外观-运动注意力模块,融合外观和运动特征以进行最终的动作检测。在两个具有挑战性的数据集上的实验结果表明,UAAN显著优于最先进的方法,证明了其有效性。

英文摘要

Out-of-distribution (OOD) detection targets to detect and reject test samples with semantic shifts, to prevent models trained on in-distribution (ID) dataset from producing unreliable predictions. Existing works only extract the appearance features on image datasets, and cannot handle dynamic multimedia scenarios with much motion information. Therefore, we target a more realistic and challenging OOD detection task: OOD action detection (ODAD). Given an untrimmed video, ODAD first classifies the ID actions and recognizes the OOD actions, and then localizes ID and OOD actions. To this end, in this paper, we propose a novel Uncertainty-Guided Appearance-Motion Association Network (UAAN), which explores both appearance features and motion contexts to reason spatial-temporal inter-object interaction for ODAD.Firstly, we design separate appearance and motion branches to extract corresponding appearance-oriented and motion-aspect object representations. In each branch, we construct a spatial-temporal graph to reason appearance-guided and motion-driven inter-object interaction. Then, we design an appearance-motion attention module to fuse the appearance and motion features for final action detection. Experimental results on two challenging datasets show that UAAN beats state-of-the-art methods by a significant margin, illustrating its effectiveness.

2404.10947 2026-05-26 cs.CV 版本更新

Residual Connections Harm Generative Representation Learning

残差连接损害生成式表示学习

Xiao Zhang, Ruoxi Jiang, William Gao, Rebecca Willett, Michael Maire

发表机构 * University of Chicago(芝加哥大学) Fudan University(复旦大学) Tencent(腾讯) Shanghai Academy of AI for Science(上海人工智能科学研究院)

AI总结 通过减少残差网络中恒等捷径的权重,显著提升掩码自编码器和扩散模型等生成式表示学习框架中的语义特征学习质量。

Comments accepted to CVPR 2026

详情
AI中文摘要

我们表明,在残差网络中引入一个加权因子以减少恒等捷径的影响,可以显著增强生成式表示学习框架(如掩码自编码器(MAE)和扩散模型)中的语义特征学习。我们的修改显著提高了特征质量,对于使用ViT-B/16骨干网络的MAE,将ImageNet-1K K近邻准确率从27.4%提升至63.9%,线性探测准确率从67.8%提升至72.7%,同时增强了扩散模型的生成质量。这一显著差距表明,虽然残差连接结构在促进梯度传播方面起着重要作用,但它可能通过将浅层表示的“回声”注入深层,从而降低抽象学习能力,产生有害副作用。我们通过一个固定公式来改善这一缺点,该公式随着层深度增加而单调减少恒等连接的贡献。我们的设计促进了特征抽象的逐步发展,且不影响网络的可训练性。分析我们修改后的残差网络学到的表示,我们发现低有效特征秩与下游任务性能之间存在相关性。

英文摘要

We show that introducing a weighting factor to reduce the influence of identity shortcuts in residual networks significantly enhances semantic feature learning in generative representation learning frameworks, such as masked autoencoders (MAEs) and diffusion models. Our modification notably improves feature quality, raising ImageNet-1K K-Nearest Neighbor accuracy from 27.4% to 63.9% and linear probing accuracy from 67.8% to 72.7% for MAEs with a ViT-B/16 backbone, while also enhancing generation quality in diffusion models. This significant gap suggests that, while residual connection structure serves an essential role in facilitating gradient propagation, it may have a harmful side effect of reducing capacity for abstract learning by virtue of injecting an echo of shallower representations into deeper layers. We ameliorate this downside via a fixed formula for monotonically decreasing the contribution of identity connections as layer depth increases. Our design promotes the gradual development of feature abstractions, without impacting network trainability. Analyzing the representations learned by our modified residual networks, we find correlation between low effective feature rank and downstream task performance.

2309.07778 2026-05-26 eess.IV cs.CV cs.LG q-bio.TO 版本更新

Virchow: A Million-Slide Digital Pathology Foundation Model

Virchow:百万级数字病理学基础模型

Eugene Vorontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Siqi Liu, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, Philippe Mathieu, Alexander van Eck, Donghun Lee, Julian Viret, Eric Robert, Yi Kan Wang, Jeremy D. Kunz, Matthew C. H. Lee, Jan Bernhard, Ran A. Godrich, Gerard Oakley, Ewan Millar, Matthew Hanna, Juan Retamero, William A. Moye, Razik Yousfi, Christopher Kanan, David Klimstra, Brandon Rothrock, Thomas J. Fuchs

发表机构 * Paige Microsoft Research(微软研究院) NSW Health Pathology(新南威尔士州卫生病理学) St George Hospital(圣乔治医院) Memorial Sloan Kettering Cancer Center(纪念斯隆凯特琳癌症中心) University of Rochester(罗切斯特大学)

AI总结 提出Virchow,一个基于DINOv2自监督学习、在150万张H&E染色全切片图像上训练的6.32亿参数视觉Transformer模型,用于计算病理学,在泛癌检测和生物标志物预测任务上达到最先进性能。

详情
AI中文摘要

通过分析病理图像实现精准医疗和决策支持系统的人工智能应用,有潜力彻底改变癌症的诊断和治疗。这类应用将依赖于模型捕捉病理图像中观察到的多样化模式的能力。为应对这一挑战,我们提出了Virchow,一个用于计算病理学的基础模型。利用DINOv2算法支持的自监督学习,Virchow是一个拥有6.32亿参数的视觉Transformer模型,在来自不同组织和标本类型的150万张苏木精-伊红染色全切片图像上训练,数据量比以往工作高出数个数量级。Virchow模型使得开发一个泛癌检测系统成为可能,该系统在17种不同癌症类型上的整体标本级AUC达到0.949,同时在7种罕见癌症类型上达到0.937的AUC。Virchow模型在内部和外部图像块级基准测试以及切片级生物标志物预测任务上均达到了最先进水平。性能的提升凸显了在大型病理图像数据集上训练的重要性,表明扩展数据和网络架构可以提高许多高影响计算病理学应用的准确性,尤其是在训练数据有限的情况下。

英文摘要

The use of artificial intelligence to enable precision medicine and decision support systems through the analysis of pathology images has the potential to revolutionize the diagnosis and treatment of cancer. Such applications will depend on models' abilities to capture the diverse patterns observed in pathology images. To address this challenge, we present Virchow, a foundation model for computational pathology. Using self-supervised learning empowered by the DINOv2 algorithm, Virchow is a vision transformer model with 632 million parameters trained on 1.5 million hematoxylin and eosin stained whole slide images from diverse tissue and specimen types, which is orders of magnitude more data than previous works. The Virchow model enables the development of a pan-cancer detection system with 0.949 overall specimen-level AUC across 17 different cancer types, while also achieving 0.937 AUC on 7 rare cancer types. The Virchow model sets the state-of-the-art on the internal and external image tile level benchmarks and slide level biomarker prediction tasks. The gains in performance highlight the importance of training on massive pathology image datasets, suggesting scaling up the data and network architecture can improve the accuracy for many high-impact computational pathology applications where limited amounts of training data are available.

2303.07863 2026-05-26 cs.CV cs.AI cs.MM 版本更新

You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos

你可以比看见更早定位:一种用于压缩视频中时序句子定位的高效流程

Xiang Fang, Daizong Liu, Pan Zhou, Guoshun Nan

发表机构 * The Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology(大数据安全湖北工程研究中心,网络安全科学与工程学院,华中科技大学) Peking University(北京大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出一种三分支压缩域时空融合框架(TCSF),直接从压缩视频中提取I帧、运动向量和残差特征,实现高效准确的时序句子定位。

Comments Accepted by CVPR 2023

详情
AI中文摘要

给定一个未剪辑视频,时序句子定位(TSG)旨在根据句子查询语义上定位目标时刻。尽管先前的工作取得了不错的成功,但它们仅关注从连续解码帧中提取的高级视觉特征,未能处理压缩视频的查询建模,导致训练和测试期间表示能力不足且计算复杂度高。本文提出了一种新的设置——压缩域TSG,直接利用压缩视频而非完全解压的帧作为视觉输入。为了处理原始视频比特流输入,我们提出了一种新颖的三分支压缩域时空融合(TCSF)框架,该框架提取并聚合三种低级视觉特征(I帧、运动向量和残差特征)以实现高效准确的定位。特别地,不像先前工作那样编码整个解码帧,我们仅通过学习I帧特征来捕获外观表示,以减少延迟。此外,我们不仅通过学习运动向量特征来探索运动信息,还通过残差特征探索相邻帧的关系。通过这种方式,进一步设计了一个带有自适应运动-外观融合模块的三分支时空注意力层,以提取和聚合外观和运动信息用于最终定位。在三个具有挑战性的数据集上的实验表明,我们的TCSF以更低的复杂度实现了比现有最先进方法更好的性能。

英文摘要

Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target moment semantically according to a sentence query. Although previous respectable works have made decent success, they only focus on high-level visual features extracted from the consecutive decoded frames and fail to handle the compressed videos for query modelling, suffering from insufficient representation capability and significant computational complexity during training and testing. In this paper, we pose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input. To handle the raw video bit-stream input, we propose a novel Three-branch Compressed-domain Spatial-temporal Fusion (TCSF) framework, which extracts and aggregates three kinds of low-level visual features (I-frame, motion vector and residual features) for effective and efficient grounding. Particularly, instead of encoding the whole decoded frames like previous works, we capture the appearance representation by only learning the I-frame feature to reduce delay or latency. Besides, we explore the motion information not only by learning the motion vector feature, but also by exploring the relations of neighboring frames via the residual feature. In this way, a three-branch spatial-temporal attention layer with an adaptive motion-appearance fusion module is further designed to extract and aggregate both appearance and motion information for the final grounding. Experiments on three challenging datasets shows that our TCSF achieves better performance than other state-of-the-art methods with lower complexity.

2209.11572 2026-05-26 cs.CV cs.AI cs.IR cs.MM 版本更新

Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

多模态跨域对齐网络用于视频时刻检索

Xiang Fang, Daizong Liu, Pan Zhou, Yuchong Hu

发表机构 * Hubei Key Laboratory of Distributed System Security(湖北分布式系统安全重点实验室) Hubei Engineering Research Center on Big Data Security(湖北大数据安全工程研究中心) School of Cyber Science and Engineering(网络安全学院) Huazhong University of Science and Technology(华中科技大学) Wangxuan Institute of Computer Technology(王轩计算机技术研究所) Peking University(北京大学) School of Computer Science and Technology(计算机科学与技术学院) Key Laboratory of Information Storage System Ministry of Education of China(信息存储系统教育部重点实验室)

AI总结 提出多模态跨域对齐网络,通过域对齐、跨模态对齐和特定对齐三个模块,解决跨域视频时刻检索中域差异和语义鸿沟问题。

Comments Accepted by IEEE Transactions on Multimedia

详情
AI中文摘要

作为多媒体信息检索中日益流行的任务,视频时刻检索(VMR)旨在根据给定的语言查询从未修剪的视频中定位目标时刻。大多数先前的方法严重依赖于大量手动标注(即时刻边界),这在实践中获取成本极高。此外,由于不同数据集之间的域差异,直接将预训练模型应用于未见过的域会导致性能显著下降。本文聚焦于一项新任务:跨域VMR,其中在一个域(“源域”)中有完全标注的数据集,但目标域(“目标域”)仅包含未标注的数据集。据我们所知,我们提出了关于跨域VMR的首项研究。为了解决这一新任务,我们提出了一种新颖的多模态跨域对齐(MMCDA)网络,将标注知识从源域迁移到目标域。然而,由于源域和目标域之间的域差异以及视频和查询之间的语义鸿沟,直接将训练好的模型应用于目标域通常会导致性能下降。为解决此问题,我们开发了三个新颖的模块:(i)域对齐模块,用于对齐每个模态在不同域之间的特征分布;(ii)跨模态对齐模块,旨在将视频和查询特征映射到联合嵌入空间,并对齐目标域中不同模态之间的特征分布;(iii)特定对齐模块,试图获取特定帧与给定查询之间的细粒度相似性以实现最优定位。通过联合训练这三个模块,我们的MMCDA能够学习域不变且语义对齐的跨模态表示。

英文摘要

As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims to localize the target moment from an untrimmed video according to a given language query. Most previous methods depend heavily on numerous manual annotations (i.e., moment boundaries), which are extremely expensive to acquire in practice. In addition, due to the domain gap between different datasets, directly applying these pre-trained models to an unseen domain leads to a significant performance drop. In this paper, we focus on a novel task: cross-domain VMR, where fully-annotated datasets are available in one domain (``source domain''), but the domain of interest (``target domain'') only contains unannotated datasets. As far as we know, we present the first study on cross-domain VMR. To address this new task, we propose a novel Multi-Modal Cross-Domain Alignment (MMCDA) network to transfer the annotation knowledge from the source domain to the target domain. However, due to the domain discrepancy between the source and target domains and the semantic gap between videos and queries, directly applying trained models to the target domain generally leads to a performance drop. To solve this problem, we develop three novel modules: (i) a domain alignment module is designed to align the feature distributions between different domains of each modality; (ii) a cross-modal alignment module aims to map both video and query features into a joint embedding space and to align the feature distributions between different modalities in the target domain; (iii) a specific alignment module tries to obtain the fine-grained similarity between a specific frame and the given query for optimal localization. By jointly training these three modules, our MMCDA can learn domain-invariant and semantic-aligned cross-modal representations.

2208.14882 2026-05-26 cs.MM cs.CL cs.CV cs.IR 版本更新

Hierarchical Local-Global Transformer for Temporal Sentence Grounding

层次化局部-全局Transformer用于时间语句定位

Xiang Fang, Daizong Liu, Pan Zhou, Zichuan Xu, Ruixuan Li

发表机构 * Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology(大数据安全湖北工程研究中心,华中科技大学网络安全科学与工程学院) Wangxuan Institute of Computer fTechnology, Peking University(王宣计算机技术研究院,北京大学) School of software, Dalian University of Technology(软件学院,大连理工大学) School of Computer Science, and Technology, Huazhong University of Science, and Technology(计算机科学与技术学院,华中科技大学)

AI总结 提出层次化局部-全局Transformer(HLGT),通过建模视频和查询的不同粒度层次及跨模态交互,实现更细粒度的多模态表示,并在三个数据集上取得最先进性能。

Comments Publish in IEEE Transactions on Multimedia

详情
AI中文摘要

本文研究多媒体问题中的时间语句定位(TSG),旨在根据给定的句子查询准确确定未修剪视频中的特定视频片段。传统的TSG方法主要遵循自上而下或自下而上的框架,且不是端到端的,严重依赖耗时的后处理来优化定位结果。最近,一些基于Transformer的方法被提出,以高效有效地建模视频和查询之间的细粒度语义对齐。尽管这些方法在一定程度上取得了显著性能,但它们将视频帧和查询词等同视为Transformer输入进行关联,未能捕捉它们不同粒度的不同语义。为解决这一问题,本文提出了一种新颖的层次化局部-全局Transformer(HLGT),利用这种层次信息并建模不同粒度层次和不同模态之间的交互,以学习更细粒度的多模态表示。具体来说,我们首先将视频和查询分割成单独的片段和短语,通过时间Transformer学习它们的局部上下文(相邻依赖)和全局相关性(长距离依赖)。然后,引入全局-局部Transformer来学习局部级和全局级语义之间的交互,以实现更好的多模态推理。此外,我们开发了一种新的跨模态循环一致性损失,以强制两个模态之间的交互并鼓励它们之间的语义对齐。最后,我们设计了一种全新的跨模态并行Transformer解码器,用于整合编码的视觉和文本特征以进行最终定位。在三个具有挑战性的数据集上进行的大量实验表明,我们提出的HLGT实现了新的最先进性能。

英文摘要

This paper studies the multimedia problem of temporal sentence grounding (TSG), which aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query. Traditional TSG methods mainly follow the top-down or bottom-up framework and are not end-to-end. They severely rely on time-consuming post-processing to refine the grounding results. Recently, some transformer-based approaches are proposed to efficiently and effectively model the fine-grained semantic alignment between video and query. Although these methods achieve significant performance to some extent, they equally take frames of the video and words of the query as transformer input for correlating, failing to capture their different levels of granularity with distinct semantics. To address this issue, in this paper, we propose a novel Hierarchical Local-Global Transformer (HLGT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities for learning more fine-grained multi-modal representations. Specifically, we first split the video and query into individual clips and phrases to learn their local context (adjacent dependency) and global correlation (long-range dependency) via a temporal transformer. Then, a global-local transformer is introduced to learn the interactions between the local-level and global-level semantics for better multi-modal reasoning. Besides, we develop a new cross-modal cycle-consistency loss to enforce interaction between two modalities and encourage the semantic alignment between them. Finally, we design a brand-new cross-modal parallel transformer decoder to integrate the encoded visual and textual features for final grounding. Extensive experiments on three challenging datasets show that our proposed HLGT achieves a new state-of-the-art performance.

2011.11194 2026-05-26 cs.LG cs.CV cs.NE 版本更新

V3H: View Variation and View Heredity for Incomplete Multi-view Clustering

V3H: 面向不完整多视图聚类的视图变异与视图遗传

Xiang Fang, Yuchong Hu, Pan Zhou, Dapeng Oliver Wu

发表机构 * School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology(华中科技大学大数据安全工程研究中心) Department of Electrical and Computer Engineering, University of Florida(佛罗里达大学电子与计算机工程系)

AI总结 提出一种受遗传学启发的视图变异与视图遗传方法(V3H),通过分解子空间为变异矩阵和遗传矩阵分别学习各视图的独特信息和所有视图的一致信息,并利用可调低秩表示恢复底层数据结构,在不完整多视图聚类中同时捕获一致与独特信息,在15个基准数据集上超越现有方法。

Comments Publisheded in IEEE Transactions on Artificial Intelligence

详情
Journal ref
IEEE Transactions on Artificial Intelligence 2020
AI中文摘要

真实数据常以多个不完整视图的形式出现。不完整多视图聚类是集成这些不完整视图的有效方法。以往的方法仅学习不同视图之间的一致信息,而忽略了每个视图的独特信息,这限制了它们的聚类性能和泛化能力。为克服这一局限,我们提出了一种新颖的视图变异与视图遗传方法(V3H)。受遗传学中变异与遗传的启发,V3H首先将每个子空间分解为对应视图的变异矩阵和所有视图的遗传矩阵,分别表示独特信息和一致信息。然后,通过基于聚类指示矩阵对齐不同视图,V3H集成来自不同视图的独特信息以提高聚类性能。最后,借助基于遗传矩阵的可调低秩表示,V3H恢复潜在的真正数据结构以减少大不完整性的影响。更重要的是,V3H可能是首个将遗传学引入聚类算法以从不完整多视图数据中同时学习一致信息和独特信息的工作。在15个基准数据集上的大量实验结果验证了其相对于其他最先进方法的优越性。

英文摘要

Real data often appear in the form of multiple incomplete views. Incomplete multi-view clustering is an effective method to integrate these incomplete views. Previous methods only learn the consistent information between different views and ignore the unique information of each view, which limits their clustering performance and generalizations. To overcome this limitation, we propose a novel View Variation and View Heredity approach (V3H). Inspired by the variation and the heredity in genetics, V3H first decomposes each subspace into a variation matrix for the corresponding view and a heredity matrix for all the views to represent the unique information and the consistent information respectively. Then, by aligning different views based on their cluster indicator matrices, V3H integrates the unique information from different views to improve the clustering performance. Finally, with the help of the adjustable low-rank representation based on the heredity matrix, V3H recovers the underlying true data structure to reduce the influence of the large incompleteness. More importantly, V3H presents possibly the first work to introduce genetics to clustering algorithms for learning simultaneously the consistent information and the unique information from incomplete multi-view data. Extensive experimental results on fifteen benchmark datasets validate its superiority over other state-of-the-arts.

2011.10331 2026-05-26 cs.CV cs.LG 版本更新

ANIMC: A Soft Framework for Auto-weighted Noisy and Incomplete Multi-view Clustering

ANIMC: 一种自动加权噪声与不完整多视图聚类的软框架

Xiang Fang, Yuchong Hu, Pan Zhou, Dapeng Oliver Wu

发表机构 * Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology(大数据安全湖北工程研究中心,信息科学与工程学院,华中科技大学) School of Computer Science and Technology, Huazhong University of Science and Technology(计算机科学与技术学院,华中科技大学) Key Laboratory of Information Storage System Ministry of Education of China, Huazhong University of Science and Technology(信息存储系统教育部重点实验室,华中科技大学) Department of Electrical and Computer Engineering, University of Florida(电气与计算机工程系,佛罗里达大学)

AI总结 提出ANIMC框架,通过软自动加权策略和双软正则回归模型,处理多视图聚类中的缺失实例和噪声问题。

Comments Publisheded in IEEE Transactions on Artificial Intelligence

详情
Journal ref
IEEE Transactions on Artificial Intelligence 2021
AI中文摘要

多视图聚类在许多图像处理场景中有广泛应用。在这些场景中,原始图像数据通常包含缺失实例和噪声,而大多数多视图聚类方法忽略了这一点。然而,缺失实例可能使这些方法难以直接使用,噪声则会导致不可靠的聚类结果。本文通过软自动加权策略和双软正则回归模型,提出了一种新颖的自动加权噪声与不完整多视图聚类框架(ANIMC)。首先,通过设计自适应半正则化非负矩阵分解(adaptive semi-RNMF),软自动加权策略为每个视图分配适当的权重,并添加软边界以平衡噪声和不完整性的影响。其次,通过提出θ-范数,双软正则回归模型通过选择不同的θ来调整模型的稀疏性。与现有方法相比,ANIMC具有三个独特优势:1)它是一种软算法,可以在不同场景下调整我们的框架,从而提高其泛化能力;2)它自动学习每个视图的适当权重,从而减少噪声的影响;3)它执行双软正则回归,对齐不同视图中的相同实例,从而减少缺失实例的影响。大量实验结果表明,它优于其他最先进的方法。

英文摘要

Multi-view clustering has wide applications in many image processing scenarios. In these scenarios, original image data often contain missing instances and noises, which is ignored by most multi-view clustering methods. However, missing instances may make these methods difficult to use directly and noises will lead to unreliable clustering results. In this paper, we propose a novel Auto-weighted Noisy and Incomplete Multi-view Clustering framework (ANIMC) via a soft auto-weighted strategy and a doubly soft regular regression model. Firstly, by designing adaptive semi-regularized nonnegative matrix factorization (adaptive semi-RNMF), the soft auto-weighted strategy assigns a proper weight to each view and adds a soft boundary to balance the influence of noises and incompleteness. Secondly, by proposingθ-norm, the doubly soft regularized regression model adjusts the sparsity of our model by choosing differentθ. Compared with existing methods, ANIMC has three unique advantages: 1) it is a soft algorithm to adjust our framework in different scenarios, thereby improving its generalization ability; 2) it automatically learns a proper weight for each view, thereby reducing the influence of noises; 3) it performs doubly soft regularized regression that aligns the same instances in different views, thereby decreasing the impact of missing instances. Extensive experimental results demonstrate its superior advantages over other state-of-the-art methods.

2605.25220 2026-05-26 cs.CV cs.GR cs.RO 版本更新

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

无需多视图生成的多视图一致3D高斯头部头像

Aviral Chharia, Fernando De la Torre

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出MVCHead,一种直接从随机采样的2D图像学习3D高斯头部模型的方法,通过层次状态空间块和SE(3)多视图评判器实现多视图一致性,无需多视图数据或3D监督。

Comments CVPR 2026; Project Website: https://humansensinglab.github.io/MVCHead/

详情
Journal ref
CVPR, Denver, CO, USA, 2026, pp. 40163-40174
AI中文摘要

高保真3D高斯头部头像生成对于AR/VR、远程呈现和数字人类等应用至关重要。现有方法依赖于多视图数据集、3D捕获或中间2D视图合成。相比之下,我们仅从随机采样的2D图像中学习条件和非条件3D头部模型,而不使用多视图数据、3D监督或中间视图生成。我们引入MVCHead,一种单次状态空间模型,直接在3D表示中强制执行多视图一致性(MVC),同时在这些约束下回归3D高斯。其核心是,我们提出层次状态空间(HiSS)块,从粗到细逐步细化高斯,同时捕获长距离依赖。在每个HiSS块中,我们修改Mamba的标准单向扫描,提出层次双向状态扫描(HiBiSS),将递归与多视图不一致性最强的轴对齐。最后,我们设计了一个SE(3)多视图评判器,判断一组自渲染是否来自单个底层3D配置,奖励跨视图像素对齐而不观察真实的多视图对。MVCHead实现了最先进的感知质量,在纹理和几何一致性上超越了先前方法,并保持了可比的形状一致性。为了展示可扩展性,我们发布了FaceGS-10K,这是第一个用于训练和评估3D头部模型的大规模即用型3D高斯头部资产数据集。项目页面和代码:https://humansensinglab.github.io/MVCHead/

英文摘要

High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba's standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models. Project Page and code: https://humansensinglab.github.io/MVCHead/

2605.25191 2026-05-26 cs.CV 版本更新

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

在推理时将图像引导注入文本条件扩散模型

Agata Żywot, Iason Skylitsis, Thijmen Nijdam, Zoe Tzifa-Kratira, Derck Prinzhorn, Konrad Szewczyk, Aritra Bhowmik

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 提出视觉概念融合(VCF),一种无需重新训练即可在推理时同时以图像和文本为条件进行双重引导的方法,通过对齐CLIP图像特征与文本嵌入空间实现视觉概念注入。

详情
AI中文摘要

像Stable Diffusion这样的文本到图像扩散模型可以从文本生成高质量图像,但缺乏在推理时无需重新训练即可注入视觉引导(例如草图、风格)的方法。现有方法要么需要计算昂贵的微调,要么依赖于可能造成与文本提示语义不对齐的风格迁移技术。我们引入了视觉概念融合(VCF),这是第一种在推理时无需任何概念特定训练即可同时对图像和文本提示进行双重条件化的方法。VCF通过将CLIP图像特征与文本嵌入空间对齐,实现了将视觉概念注入Stable Diffusion。VCF由三个组件组成:(1)一个轻量级对齐器,使用InfoNCE和交叉注意力重建损失将图像标记映射到文本嵌入流形;(2)一种保留文本和视觉语义的融合策略;(3)一个可选的提示-噪声优化(PNO)模块,用于测试时细化。我们的实验表明,VCF成功地从参考图像中转移了包括风格、构图和调色板在内的视觉属性,同时保持了对提示的遵循。定量结果显示文本对齐(CLIP分数)和视觉对应(LPIPS)之间存在权衡,VCF在参考保真度方面优于基线。

英文摘要

Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.

2605.25175 2026-05-26 cs.CV 版本更新

Discrepancy Minimization Improves Cross-Hospital Robustness in Digital Pathology

差异最小化提升数字病理学中的跨医院鲁棒性

Ben Vardi, Dana Schonberger, Yuval Friedmann, Zohar Yakhini, Iris Barshack, Alexander Loebel, Ariel Shamir

发表机构 * Reichman University, Herzliya, Israel(以色列海法大学) Institute of Pathology, Sheba Medical Center, Ramat-Gan, Israel(以色列沙巴医疗中心病理研究所) Technion - Israel Institute of Technology, Haifa, Israel(以色列技术学院)

AI总结 通过局部最大均值差异(LMMD)微调病理基础模型,在域适应和域泛化设置下提升跨医院鲁棒性。

详情
AI中文摘要

病理基础模型(PFMs)近年来快速发展,支持为多种组织病理学任务训练分类器。然而,它们在医院间的鲁棒性仍然有限:当在一个医院的数据上训练分类器并在另一个目标医院评估时,性能通常会下降。我们通过使用局部最大均值差异(LMMD)目标微调PFMs来解决这一挑战,该目标适用于两种设置:域适应(有未标记的目标医院数据可用)和域泛化(目标医院数据完全不可用)。在补丁和切片级别的实验表明,在多个PFMs和任务上均有一致的改进。

英文摘要

Pathology foundation models (PFMs) have advanced rapidly in recent years and support training classifiers for a range of histopathology tasks. However, their robustness across hospitals remains limited: performance often degrades when training a classifier on data from one hospital and evaluating it on another target hospital. We address this challenge by fine-tuning PFMs with a local maximum mean discrepancy (LMMD) objective that applies to two settings: domain adaptation, where unlabeled target-hospital data is available, and domain generalization, where target-hospital data is unavailable at all. Experiments at both the patch- and slide-level show consistent improvements across multiple PFMs and tasks.

2605.25163 2026-05-26 cs.CV cs.AI 版本更新

K-U-KAN: Koopman-Enhanced U-KAN for 3D Dental Reconstruction from a Single Panoramic X-ray Radiograph

K-U-KAN: 基于Koopman增强的U-KAN用于单张全景X射线片的三维牙齿重建

Bikram Keshari Parida, Abhijit Sen, Wonsang You

发表机构 * Artificial Intelligence \& Image Processing Lab., Department of Information \& Communication Engineering, Sun Moon University, Asan-Si, South Korea Department of Physics Engineering Physics, Tulane University, New Orleans, LA, USA

AI总结 提出K-U-KAN三阶段流水线,结合Kolmogorov-Arnold网络、Koopman算子与U-KAN,从单张全景X射线高效重建三维牙齿结构,提升感知质量并缩短训练时间。

Comments 24 pages, 9 figures,

详情
AI中文摘要

全景X射线将三维颌骨压缩为二维条带;我们的目标是干净且快速地恢复缺失的深度。现有的隐式神经表示能渲染逼真的体积,但训练缓慢,对采样和位置编码敏感,且实际成本高。纯CNN基线效率高,但难以处理牙弓的长程几何,模糊了精细的釉质-牙本质边界,且可解释性差。我们提出K-U-KAN,一个三阶段流水线:(i) 使用Kolmogorov-Arnold网络将二维特征提升为深度感知的可观测变量,(ii) 通过Koopman令牌块以稳定的、相位感知的线性演化推进这些可观测变量,(iii) 将预测的深度区间放置在焦槽射线上,然后由轻量级3D注意力U-KAN细化体积。这种物理(Beer-Lambert图像形成)、几何(马蹄形焦槽)和学习线性动力学的结合,在批量大小为1的原生射线强度上产生了清晰的解剖结构、更少的伪影和鲁棒的行为。在保留数据上,K-U-KAN在信号和结构指标上与Transformer/隐式基线相当,显著提高了感知质量,并且训练时间大约减半——使单视图全景X射线到锥形束CT重建在临床流程中更加实用。

英文摘要

A panoramic X-ray compresses a 3D jaw into a 2D strip; we aim to recover the missing depth cleanly and fast. Existing implicit neural representations render realistic volumes but are slow to train, sensitive to sampling and positional encodings, and costly in practice. Pure CNN baselines are efficient yet struggle with the dental arch's long-range geometry, blur fine enamel-dentin boundaries, and offer little interpretability. We present K-U-KAN, a three-stage pipeline that (i) lifts 2D features into depth-aware observables with Kolmogorov-Arnold Networks, (ii) advances these observables by a stable, phase-aware linear evolution via a Koopman token block, and (iii) places the predicted depth bins onto focal-trough rays before a lightweight 3D attention U-KAN refines the volume. This marriage of physics (Beer-Lambert image formation), geometry (horseshoe focal trough), and learned linear dynamics yields sharp anatomy, fewer artifacts, and robust behavior on native radiographic intensities with batch size one. On held-out data, K-U-KAN matches transformer/implicit baselines on signal and structure metrics, clearly improves perceptual quality, and trains in roughly half the time-making single-view PX $\to$ CBCT reconstruction more practical for clinical pipelines.

2605.25127 2026-05-26 cs.CV cs.LG 版本更新

PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration

PQDT: 伪查询双Transformer用于鲁棒点云修复

Haoqing Wu, Alexa Nawotki, Jochen Garcke

发表机构 * Mercedes-Benz AG(梅赛德斯-奔驰集团) University of Bonn(波恩大学) Fraunhofer SCAI(弗劳恩霍夫SCAI研究所)

AI总结 提出一种基于伪查询模块和Transformer主干网络的统一3D修复网络,通过两阶段几何变换增强结构清晰度和局部细节,在多种退化场景下超越现有方法。

Comments To be published in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

详情
AI中文摘要

点云是计算机视觉中一种基本的3D表示,支持广泛的感知任务。然而,由于传感器限制或遮挡,真实世界的点云常常遭受不完整、噪声、离群点和密度不规则等退化。从这种退化数据中恢复干净且详细的形状对于下游应用至关重要。尽管现有的基于学习方法在完成或去噪等单个任务上取得了进展,但它们通常依赖于全局瓶颈特征,这会丢失细粒度几何信息,并且对变化的输入质量敏感。我们提出一个统一的3D修复网络,直接以点云作为输入,并在多种退化场景下自适应地重建高质量几何。我们方法的核心是一个伪查询模块,在Transformer主干网络中实现,它将几何变换重新表述为两个协作阶段,以增强结构清晰度、鲁棒性和局部细节保留。在精心设计的基准测试上的大量实验表明,我们的方法在通用3D修复中超越了最先进的性能。它有效处理了完成、变形和去噪退化的复杂组合。通过这项工作,我们提供了一个新颖的、统一的、仅基于点的主干网络,用于鲁棒的3D修复,从而实现更通用的3D感知。

英文摘要

Point clouds are a fundamental 3D representation in computer vision, enabling a wide range of perception tasks. However, real-world point clouds often suffer from degradations such as incompleteness, noise, outliers, and irregular density, caused by sensor limitations or occlusions. Recovering clean and detailed shapes from such degraded data is crucial for downstream applications. While existing learning-based methods achieve progress on individual tasks like completion or denoising, they typically rely on global bottleneck features, which lose fine-grained geometry and remain sensitive to varying input quality. We propose a unified 3D restoration network that directly takes point clouds as input and adaptively reconstructs high-quality geometry under diverse degradation scenarios. At the core of our approach is a Pseudo-Query module, implemented within a Transformer backbone, which reformulates geometric translation into two cooperative stages to enhance structural clarity, robustness, and local detail preservation. Extensive experiments on curated benchmarks demonstrate that our approach surpasses state-of-the-art performance in general 3D restoration. It effectively handles complex combinations of completion, deformation, and denoising degradations. With this work, we provide a novel unified, point-only backbone for robust 3D restoration, enabling more versatile 3D perception.

2605.25123 2026-05-26 cs.LG cs.AI cs.CL cs.CV stat.ML 版本更新

Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo

扩散模型的推理时对齐:基于信任区域迭代扭曲序贯蒙特卡洛方法

Weixin Wang, Yu Yang, Wei Deng, Pan Xu

发表机构 * Duke University(杜克大学) Morgan Stanley(摩根大通)

AI总结 提出信任区域迭代扭曲序贯蒙特卡洛(TRI-TSMC)框架,通过迭代学习扭曲函数来改进扩散模型推理时的对齐,在文本生成和文本到图像生成任务上优于现有方法。

Comments 34 pages, 6 figures, and 7 tables

详情
AI中文摘要

我们研究基于扩散的生成模型的推理时对齐,旨在引导基础模型产生高奖励输出而不更新其权重。最近的基于序贯蒙特卡洛(SMC)的引导方法以原则性的方式近似奖励倾斜的目标分布,但其提议仍主要依赖于基础采样器。由于奖励信息主要通过粒子重加权和重采样在传播后使用,这些方法可能需要大量粒子预算,并遭受权重退化和高方差估计的问题。降低方差和提高粒子效率的一种方法是迭代学习提供前瞻指导的扭曲函数,如扭曲SMC。然而,现有的可学习扭曲方法主要针对经典序贯推理开发,当应用于具有高维状态空间和终端、噪声或黑盒奖励的扩散对齐时可能不稳定。我们提出信任区域迭代扭曲序贯蒙特卡洛(TRI-TSMC),一种用于在基于SMC的推理时对齐中学习扭曲函数的信任区域框架。每次迭代在路径空间中计算精确的KL约束更新,通过温度重要性重加权得到闭式解,并通过加权最大似然将该目标投影回参数化扭曲族。理论上,我们形式化了最优扭曲函数的值函数解释,并表明它产生零方差采样器。我们证明信任区域更新沿着护航路径朝向目标分布,加权最大似然更新是前向KL投影,并且该路径降低了残差重要性权重方差。实验上,在匹配的推理时预算下,TRI-TSMC在离散扩散文本生成和文本到图像生成上改进了主要对齐目标。

英文摘要

We study inference-time alignment for diffusion-based generative models, aiming to steer a base model toward high-reward outputs without updating its weights. Recent Sequential Monte Carlo (SMC)-based steering methods approximate reward-tilted target distributions in a principled way, but their proposals remain largely tied to the base sampler. Since reward information is mainly used after propagation through particle reweighting and resampling, these methods can require large particle budgets and suffer from weight degeneracy and high-variance estimates. One way to reduce variance and improve particle efficiency is to iteratively learn twisting functions that provide look-ahead guidance, as in twisted SMC. However, existing learnable twisting methods are developed mainly for classical sequential inference and can be unstable when applied to diffusion-based alignment with high-dimensional state spaces and terminal, noisy, or black-box rewards. We propose Trust-Region Iterative Twisted Sequential Monte Carlo (TRI-TSMC), a trust-region framework for learning twisting functions in SMC-based inference-time alignment. Each iteration computes an exact KL-constrained update in path space, which admits a closed-form solution by tempered importance reweighting, and projects this target back to the parameterized twisted family by weighted maximum likelihood. Theoretically, we formalize the value-function interpretation of the optimal twisting function and show that it yields a zero-variance sampler. We prove that the trust-region update follows an escort path toward the target distribution, that the weighted maximum-likelihood update is a forward-KL projection, and that the path reduces residual importance-weight variance. Empirically, TRI-TSMC improves primary alignment objectives on discrete diffusion text generation and text-to-image generation under matched inference-time budgets.

2605.25119 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Trust-Aware Joint Feature-Prediction Discrepancy for Robust Domain Adaptation

信任感知的联合特征-预测差异用于鲁棒域适应

Xi Ding, Lei Wang, Syuan-Hao Li, Yongsheng Gao

发表机构 * School of Engineering and Built Environment, Griffith University, Australia(工程与环境学院,格里菲斯大学,澳大利亚)

AI总结 提出信任感知域适应框架,通过联合特征-预测差异(JFPD)结合不确定性信任和语义对齐信任,实现可靠性感知的域差异估计,提升域适应性能。

Comments Research report

详情
AI中文摘要

域适应旨在减轻标记源域与未标记或稀疏标记目标域之间分布偏移导致的性能下降。大多数现有方法在特征空间或预测空间中估计域差异。然而,这些单一视角策略忽略了域偏移下的一个关键问题:用于对齐的信号可靠性。实际上,学习到的表示和语义预测都可能变得不可靠,平等对待所有目标样本可能导致误导性对齐和次优迁移。我们引入了信任感知域适应,这是一个原则性框架,通过特征和预测信号的可靠性来建模域差异。我们方法的核心是联合特征-预测差异(JFPD),这是一个统一公式,联合捕捉表示散度和预测散度,并通过样本特定信任加权它们的贡献。信任通过两种互补机制量化:不确定性信任,从预测熵导出以抑制不可靠预测;语义对齐信任,从特征空间中的原型相似性计算以强调良好对齐的表示。通过优先考虑自信且语义一致的样本,同时降低噪声或模糊样本的权重,JFPD提供了域差异的可靠性感知估计。我们进一步将JFPD集成到训练目标中,引导适应朝向目标域的可靠区域。在标准基准上的实验表明,所提出的框架始终实现优越的适应性能,并产生与目标域误差相关的差异估计。这项工作首次解决了在域适应中建模特征与预测之间交互信任的重要性。

英文摘要

Domain adaptation aims to mitigate performance degradation caused by distribution shifts between a labeled source domain and an unlabeled or sparsely labeled target domain. Most existing approaches estimate domain discrepancy either in feature space or in prediction space. However, these single-perspective strategies overlook a critical problem under domain shift: the reliability of the signals used for alignment. In practice, both learned representations and semantic predictions may become unreliable, and treating all target samples equally can lead to misleading alignment and suboptimal transfer. We introduce trust-aware domain adaptation, a principled framework that models domain discrepancy through the reliability of feature and prediction signals. Central to our approach is the Joint Feature-Prediction Discrepancy (JFPD), a unified formulation that jointly captures representation divergence and prediction divergence while weighting their contributions by sample-specific trust. Trust is quantified via two complementary mechanisms: uncertainty-aware trust, derived from prediction entropy to suppress unreliable predictions, and semantic-alignment trust, computed from prototype similarity in feature space to emphasize well-aligned representations. By prioritizing confident and semantically consistent samples while down-weighting noisy or ambiguous ones, JFPD provides a reliability-aware estimate of domain discrepancy. We further integrate JFPD into a training objective that guides adaptation toward trustworthy regions of the target domain. Experiments on standard benchmarks demonstrate that the proposed framework consistently achieves superior adaptation performance and yields discrepancy estimates that correlate with target-domain error. This work addresses, for the first time, the importance of modeling trust in the interaction between features and predictions for domain adaptation.

2605.25110 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Uncertainty-DTW for Sequences and Visual Tokens

Uncertainty-DTW 用于序列和视觉标记

Lei Wang, Syuan-Hao Li, Yongsheng Gao, Piotr Koniusz

发表机构 * School of Engineering and Built Environment, Electrical and Electronic Engineering, Griffith University(工程与建筑环境学院,电气与电子工程学院,格里菲斯大学) School of Computer Science and Engineering, University of New South Wales(计算机科学与工程学院,新南威尔士大学)

AI总结 提出不确定性感知的动态时间规整(uDTW)框架,通过异方差不确定性建模和最大似然估计实现鲁棒对齐,并推广到视觉标记集,在多个领域取得优于现有方法的结果。

Comments Research report

详情
AI中文摘要

对齐结构化数据是计算机视觉和机器学习中的一个基本问题,支撑着时间序列分析、人类动作识别和视觉表示学习等任务。现有的对齐方法,包括动态时间规整(DTW)及其可微变体,依赖于确定性相似度度量,因此对异质和噪声特征敏感。在这项工作中,我们引入了不确定性感知对齐,这是一个概率框架,用异方差不确定性建模成对对应关系,并沿对齐路径执行结构化匹配。我们的公式,不确定性-DTW(uDTW),为每个对应分配一个正态分布,并通过最大似然估计目标参数化每条对齐路径,该目标包括(i)一个精度加权匹配项,抑制不可靠特征,以及(ii)一个对数方差正则化,防止退化解。这产生了一个概率对齐机制,对噪声具有鲁棒性且可解释,因为不确定性直接反映了匹配的可靠性。我们进一步将该框架从时间序列推广到标记化的视觉表示,从而能够对视觉标记集进行结构化匹配。学习到的不确定性可以解释为反向注意力:语义相关区域表现出低不确定性并主导对齐,而模糊/噪声区域具有高不确定性。这提供了对齐、注意力和不确定性建模之间的联系。我们在不同领域评估了所提出的框架。结果表明,与最先进的方法相比,该方法持续改进,并且学习到的不确定性与语义重要性相关。这些发现将不确定性感知对齐确立为一个通用、鲁棒且可解释的框架,用于从结构化数据中学习。

英文摘要

Aligning structured data is a fundamental problem in computer vision and machine learning, underlying tasks such as time series analysis, human action recognition, and visual representation learning. Existing alignment methods, including Dynamic Time Warping (DTW) and its differentiable variants, rely on deterministic similarity measures and are therefore sensitive to heterogeneous and noisy features. In this work, we introduce uncertainty-aware alignment, a probabilistic framework that models pairwise correspondences with heteroscedastic uncertainty and performs structured matching along alignment paths. Our formulation, uncertainty-DTW (uDTW), assigns each correspondence a Normal distribution and parametrizes each alignment path by a Maximum Likelihood Estimate objective consisting of (i) a precision-weighted matching term that suppresses unreliable features, and (ii) a log-variance regularization that prevents degenerate solutions. This yields a probabilistic alignment mechanism that is robust to noise and interpretable, as uncertainty directly reflects the reliability of matches. We further generalize this framework from temporal sequences to tokenized visual representations, enabling structured matching over sets of visual tokens. The learned uncertainty can be interpreted as a reverse-attention: semantically relevant regions exhibit low uncertainty and dominate the alignment, while ambiguous/noisy regions have high uncertainty. This provides a connection between alignment, attention, and uncertainty modeling. We evaluate the proposed framework across diverse domains. The results demonstrate consistent improvements over state-of-the-art methods and show that learned uncertainty correlates with semantic importance. These findings establish uncertainty-aware alignment as a general, robust, and interpretable framework for learning from structured data.

2605.25077 2026-05-26 cs.CV 版本更新

WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

WorldCraft: 从相机导航到交互式视频世界模型中的物体操控

Bohai Gu, Taiyi Wu, Yueyang Yuan, Jian Liu, Xiaocheng Lu, Dazhao Du, Jie Zhang, Jinxiang Lai, Shuai Yang, Xiaotong Zhao, Alan Zhao, Song Guo

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) AI Technology Center, Tencent Video, Tencent(腾讯视频AI技术中心,腾讯) Wuhan University(武汉大学) Peking University(北京大学)

AI总结 提出WorldCraft框架,通过轨迹控制管道(NWT、SP-LoRA、TASP)将交互式视频世界模型从相机导航扩展到物体级轨迹操控,实现用户指定路径下的物体运动与相机导航共存。

Comments Project page: https://nevsdev.github.io/WorldCraft/

详情
AI中文摘要

最近的基于视频的世界模型使像素空间环境在相机层面具有交互性:用户可以导航视角,同时模型生成连贯的视觉延续。然而,它们的动作空间仍然不完整:用户可以移动相机,但不能对单个物体进行操作。由于现实世界的交互本质上是物体中心的,这样的模型更接近被动的场景观察者,而非真正可操控的环境。我们提出WorldCraft,一个将交互式视频世界模型从相机导航扩展到物体级轨迹动作的框架。给定用户点击和手绘路径,WorldCraft生成未来帧,其中所选物体遵循指定轨迹运动,同时相机继续导航场景。WorldCraft通过一个轨迹中心控制管道实现这一点:首先,归一化世界轨迹(NWT)在相机不变的世界坐标系中表示用户绘制的运动,并在当前相机姿态下动态重投影,将物体运动与相机引起的屏幕空间位移分离;然后,空间路径LoRA(SP-LoRA)通过模型的空间控制路径注入这个世界空间信号,在保留预训练相机控制器的同时增加物体操控能力;最后,轨迹锚定状态持久化(TASP)将世界轨迹视为持久空间状态,并在轨迹条件生成后刷新自回归记忆,使移动物体在离开相机视野后能够在其更新位置重新出现。实验表明,WorldCraft实现了精确的物体控制,在仅相机评估下保持了基于视频的世界模型的相机保真度,并在包含离屏移动的长自回归展开中维持了物体状态。

英文摘要

Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: users can move the camera, but cannot act on individual objects. Since real-world interaction is inherently object-centric, such models remain closer to passive scene observers than truly manipulable environments. We present WorldCraft, a framework that expands interactive video world models from camera navigation to object-level trajectory actions. Given a user click and a sketched path, WorldCraft generates future frames in which the selected object follows the prescribed trajectory while the camera continues to navigate the scene. WorldCraft achieves this through a trajectory-centric control pipeline: First, Normalized World Trajectory (NWT) represents user-drawn motion in a camera-invariant world coordinate system and dynamically re-projects it under the current camera pose, separating object motion from camera-induced screen-space displacement; Spatial-Pathway LoRA (SP-LoRA) then injects this world-space signal through the model's spatial-control pathway, adding object manipulation capability while preserving the pretrained camera controller; finally, Trajectory-Anchored State Persistence (TASP) treats the world trajectory as a persistent spatial state and refreshes autoregressive memory after trajectory-conditioned generation, allowing moved objects to reappear at their updated positions after leaving the camera view. Experiments show that WorldCraft enables accurate object control, preserves the video-based world model's camera fidelity under camera-only evaluation, and maintains object state across long autoregressive rollouts with off-camera excursions.

2605.25042 2026-05-26 cs.CV 版本更新

Unbiased Diffusion Variational Inversion via Principled Posterior Matching

无偏扩散变分反演:基于原则性后验匹配

Weimin Bai, Yuxuan Gu, Yifei Wang, Weijian Luo, He Sun

发表机构 * Peking University(北京大学)

AI总结 提出原则性后验匹配(PPM)框架,通过精确优化KL散度(利用Fisher散度积分)解决逆问题中模式坍塌和不确定性量化不可靠的问题,统一变分推理和摊销推理,在图像修复、超分辨荧光显微和射电干涉成像中实现高保真重建和校准的不确定性估计。

详情
AI中文摘要

现有的基于分数的逆问题方法通常采用KL散度在反演分布与贝叶斯后验之间的近似最小化。这种近似导致严重的模式坍塌和不可靠的不确定性量化。在本文中,我们提出原则性后验匹配(PPM),一个回归变分推理基础而非使用技巧性近似的框架。我们不依赖启发式近似,而是通过整合Fisher散度严格公式化KL散度的精确优化。我们推导出该积分的可处理等价梯度形式,使得无需先前近似引入的偏差即可进行精确优化。我们的分析清楚地揭示了先前方法中的模式坍塌直接源于这种近似差距。在我们的理论解决方案支持下,PPM统一了两个互补范式:(1)在变分推理中,PPM采用覆盖质量的散度,显著提高了反演多样性和不确定性量化;(2)在摊销推理中,它使得能够训练高效的重建网络以进行快速的单步重建。此外,我们的公式通过推广Fisher散度的积分,自然地扩展到更广泛的散度度量族。我们在具有挑战性的计算成像任务中验证了PPM,包括图像修复、超分辨荧光显微镜和射电干涉黑洞成像。在所有实验中,PPM实现了卓越的重建保真度、忠实的多模态后验恢复以及良好校准的不确定性估计,为科学成像建立了一个稳健的框架。

英文摘要

Existing score-based methods for inverse problems often resort to approximate minimization of the KL divergence between the inversion distribution and the Bayesian posterior. Such an approximation leads to severe mode collapse and unreliable uncertainty quantification. In this paper, we propose Principled Posterior Matching (PPM), a framework that returns to the fundamentals of variational inference, rather than using tricky approximations. Instead of relying on heuristic approximations, we rigorously formulate the exact optimization of the KL divergence via the integration of Fisher divergence. We derive a tractable, equivalent gradient form of this integral, enabling precise optimization without the biases introduced by prior approximations. Our analysis clearly reveals that the mode collapse in previous methods stems directly from this approximation gap. Supported by our theoretical solution, PPM unifies two complementary paradigms: (1) In variational inference, PPM adopts mass-covering divergences that significantly improve the inversion diversity and uncertainty quantification; (2) In amortized inference, it enables the training of an efficient reconstruction network for rapid, single-step reconstruction. Furthermore, our formulation naturally extends to a broader family of divergence measures by generalizing the integral of the Fisher divergence. We validate PPM across challenging computational imaging tasks, including inpainting, super-resolution fluorescent microscopy, and radio interferometric black-hole imaging. In all experiments, PPM achieves superior reconstruction fidelity, faithful multimodal posterior recovery, and well-calibrated uncertainty estimates, establishing a robust framework for scientific imaging.

2605.25039 2026-05-26 cs.CV 版本更新

AstroRAG -- A Pagerank-Based Retrieval-Augmented Generation Pipeline for Question Answering in Astronomy

AstroRAG -- 一种基于PageRank的检索增强生成管道用于天文学问答

Zhifeng Wang, Jason Jingshi Li, Kaihao Zhang, Ramesh Sankaranarayana

发表机构 * Australian National University(澳大利亚国立大学) Learning Machines Pty Ltd

AI总结 提出AstroRAG,一种基于PageRank的检索增强生成管道,通过两阶段检索(MMR和PR重排序)在严格token预算下选择紧凑互支持的上下文,无需训练且保护隐私,在天文学QA基准上使Mistral-7B准确率和F1分数达到79.49%,性能近乎翻倍。

Comments Accepted to IEEE CAI 2026

详情
AI中文摘要

大型语言模型(LLMs)在自然语言处理中表现出强大的性能,但仅依赖参数化知识时常常产生事实性错误。检索增强生成(RAG)通过将响应基于外部证据来减轻这些错误,然而传统的检索-转储方法经常引入无关上下文,从而降低答案质量。在这项工作中,我们提出了AstroRAG——一种基于PageRank的检索增强生成(RAG)管道,适用于天文学中的问答。该系统在Elasticsearch中执行token感知的分块和每个实例的临时索引,然后执行两阶段检索:(i)最大边际相关性(MMR)以获得一个小的、多样化的候选集,以及(ii)在相似性图上进行读者驱动的PageRank(PR)重排序,以在严格的token预算下识别紧凑、互支持的上下文。我们的设计无需训练、保护隐私且可重复,因为每个实例通过临时索引处理以防止跨任务泄漏。我们在用于天文学QA的AstroQA基准上评估了该管道,并在所有难度级别上展示了有竞争力的性能。特别是,RAG增强的Mistral-7B实现了 extbf{79.49\%的准确率}和 extbf{79.49\%的F1分数},几乎是非RAG对应版本性能的两倍。这些结果突显了严格检索和精炼在提升领域特定推理方面的有效性,为将RAG扩展到其他科学领域奠定了坚实基础。

英文摘要

Large language models (LLMs) demonstrate strong performance in natural language processing but often generate factual errors when relying solely on parametric knowledge. Retrieval-Augmented Generation (RAG) mitigates these errors by grounding responses in external evidence, yet conventional retrieve-and-dump approaches frequently introduce irrelevant context that degrades answer quality. In this work, we present AstroRAG -- a PageRank-based retrieval-augmented generation (RAG) pipeline adapted for question answering in astronomy. The system performs token-aware chunking and per-instance, ephemeral indexing in Elasticsearch, then executes a two-stage retrieval: (i) Maximal Marginal Relevance (MMR) to obtain a small, diverse candidate set and (ii) a reader-driven PageRank (PR) re-ranking on a similarity graph to identify a compact, mutually supportive context under a strict token budget. Our design is training-free, privacy-preserving, and reproducible, as each instance is processed through transient indexing to prevent cross-task leakage. We evaluate the pipeline on the AstroQA benchmark for astronomy QA, and demonstrate competitive performance across all difficulty levels. In particular, the RAG-enhanced Mistral-7B achieves \textbf{79.49\% accuracy} and \textbf{79.49\% F1-score}, nearly doubling the performance of its non-RAG counterpart. These results highlight the effectiveness of disciplined retrieval and refinement in boosting domain-specific reasoning, establishing a robust foundation for extending RAG to other scientific fields.

2605.25024 2026-05-26 cs.CV 版本更新

DA-UCT: Self-Supervised Domain-Adaptive Ultrasound Computed Tomography for Rapid Musculoskeletal Sound Speed Reconstruction

DA-UCT:用于快速肌肉骨骼声速重建的自监督域自适应超声计算机断层扫描

Tianyu Liu, Heyu Ma, Aiduo Wang, Peiwen Li, Boyi Li, Ying Li, Dan Li, Chengcheng Liu, Dean Ta

发表机构 * College of Biomedical Engineering, Fudan University(复旦大学生物医学工程学院)

AI总结 提出SDA-UCT框架,通过自监督域自适应和注意力增强网络,实现快速高分辨率肌肉骨骼超声计算机断层扫描重建,显著提升速度并保持高质量。

详情
AI中文摘要

通过全波形反演的超声计算机断层扫描(UCT)能够实现高分辨率定量成像,用于组织表征和疾病诊断。然而,由于高度非线性的优化,UCT存在计算负担大和收敛问题严重等缺点。深度学习可以加速UCT重建,但监督训练需要大规模标记数据集,这在体内难以获得。为了解决这些限制,我们提出了SDA-UCT,一个两阶段自监督域自适应框架,用于快速准确的肌肉骨骼组织UCT成像。SDA-UCT采用在模拟数据集上预训练的注意力增强网络(AttUCT),并通过物理信息自监督学习迁移到体内数据,有效弥合了模拟到真实的域差距。集成了低秩自适应(LoRA)机制,以实现跨不同临床场景的高效自适应。结果表明,AttUCT在模拟人前臂上实现了高质量声速重建,PSNR为29.23 dB,SSIM为0.928,优于传统FWI和现有深度学习方法。在体内数据上验证,SDA-UCT成功重建了揭示人前臂复杂解剖结构(皮肤、脂肪、肌肉、肌腱、骨骼和骨髓)的声速图像,与MRI参考高度一致。仅调整3%参数的LoRA机制实现了与全微调相当的性能。快速重建(每帧5毫秒)实现了实时3D可视化,比传统FWI提高了五个数量级。这项工作代表了首个用于快速、高分辨率体内UCT成像的自监督域自适应深度学习,显示了在肌肉骨骼疾病诊断中的潜力。

英文摘要

Ultrasound computed tomography (UCT) via full waveform inversion (FWI) enables high-resolution quantitative imaging for tissue characterization and disease diagnosis. However, UCT suffers from large computational burden and severe convergence issues due to highly nonlinear optimization. Deep learning can accelerate UCT reconstruction, but supervised training requires large-scale labeled datasets difficult to obtain in vivo. To address these limitations, we propose SDA-UCT, a two-stage self-supervised domain-adaptive framework for rapid and accurate UCT imaging of musculoskeletal tissues. SDA-UCT employs an attention-enhanced network (AttUCT) pre-trained on simulation datasets and transfers to in-vivo data via physics-informed self-supervised learning, effectively bridging the simulation-to-real domain gap. A Low-Rank Adaptation (LoRA) mechanism is integrated to enable efficient adaptation across diverse clinical scenarios. Results showed that AttUCT achieved high-quality SOS reconstruction for simulated human forearm with a PSNR of 29.23 dB and SSIM of 0.928, outperforming conventional FWI and existing deep learning methods. Validated on in-vivo data, SDA-UCT successfully reconstructed SOS images revealing complex anatomical structures (skin, fat, muscle, tendon, bone and bone marrow) for human forearm, in high concordance with MRI references. The LoRA mechanism adjusting only 3% of parameters achieved comparable performance to full fine-tuning. The rapid reconstruction (5 ms per frame) enables real-time 3D visualization, achieving five-orders-of-magnitude improvement over traditional FWI. This work represents the first self-supervised domain-adaptive deep learning for rapid, high-resolution in-vivo UCT imaging, showing potential for musculoskeletal disease diagnosis.

2605.25022 2026-05-26 cs.CV cs.AI 版本更新

D3S2: Diffusion-Guided Dataset Distillation for Semantic Segmentation

D3S2: 扩散引导的语义分割数据集蒸馏

Wenjie Zheng, Haoji Hu, Jiali Lu, Xingze Zou, Jing Wang

发表机构 * Zhejiang University(浙江大学)

AI总结 针对语义分割数据集蒸馏中的长尾类别不平衡、像素级对齐和高计算成本问题,提出两阶段框架D3S2,通过类别平衡掩码选择和扩散引导图像合成生成紧凑训练集,在极低压缩率下显著提升分割性能。

详情
AI中文摘要

数据集蒸馏旨在将大规模数据集压缩为紧凑的合成集,同时保持训练效果。然而,现有研究主要关注图像分类,而语义分割等密集预测任务尚未充分探索。本文识别了分割数据集蒸馏的三个关键挑战:(i) 长尾类别不平衡,(ii) 图像与密集标签之间严格的像素级对齐需求,以及(iii) 使用复杂模型优化高分辨率数据的高计算成本。为应对这些挑战,我们提出D3S2,一种扩散引导的语义分割数据集蒸馏框架。我们的方法采用两阶段设计。在类别平衡掩码选择中,我们通过优先考虑低表示类别的贪婪策略构建代表性掩码集。在扩散引导图像合成中,我们使用预训练的布局到图像扩散模型生成以所选掩码为条件的图像,自然确保空间对齐。为进一步增强合成数据的训练效用,我们引入具有两个互补目标的引导扩散采样:用于像素级对齐的分割一致性损失,以及用于对齐跨层每类特征统计的类级特征匹配损失。大量实验证明了D3S2的优越性。值得注意的是,在1%的极低压缩率下,我们的方法在ADE20K和COCO-Stuff上使用Mask2Former (Swin-S)分别达到24.99%和35.49%的mIoU,比随机选择分别高出9.34%和5.70%。

英文摘要

Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic sets while preserving training efficacy. However, existing studies mainly focus on image classification, leaving dense prediction tasks such as semantic segmentation largely underexplored. In this work, we identify three key challenges for segmentation DD: (i) long-tailed class imbalance, (ii) the need for strict pixel-wise alignment between images and dense labels, and (iii) the high computational cost of optimizing high-resolution data with complex models. To address these challenges, we propose D3S2, a Diffusion-guided Dataset Distillation framework for Semantic Segmentation. Our method adopts a two-stage design. In Class-Balanced Mask Selection, we construct a representative mask set via a greedy strategy that prioritizes underrepresented classes. In Diffusion-Guided Image Synthesis, we employ a pretrained layout-to-image diffusion model to generate images conditioned on the selected masks, naturally ensuring spatial alignment. To further enhance the training utility of synthesized data, we introduce guided diffusion sampling with two complementary objectives: a segmentation-consistency loss for pixel-level alignment, and a class-wise feature matching loss for aligning per-class feature statistics across layers. Extensive experiments demonstrate the superiority of D3S2. Notably, at an extremely compression rate of 1%, our method achieves 24.99% and 35.49% mIoU on ADE20K and COCO-Stuff with Mask2Former (Swin-S), outperforming random selection by 9.34% and 5.70%, respectively.

2605.25014 2026-05-26 cs.CV 版本更新

Stop Denoising Your Blurs

停止去噪你的模糊

Sasidhar Parvathireddy, Vamsidhar Saraswathula, Rama Krishna Gorthi

发表机构 * Indian Institute of Technology Tirupati, India.(印度泰尔普蒂印度理工学院)

AI总结 提出ConvDiff框架,用卷积替代加性噪声构建模糊退化轨迹,实现基于扩散模型的图像去模糊,弥合模糊数学原理与扩散算法设计的差距。

Comments Accepted at IEEE International Conference on Image Processing (ICIP) 2026. 7 pages, 3 figures

详情
AI中文摘要

近年来,扩散模型在图像恢复任务中取得了显著性能。其核心机制依赖于在加性噪声操作之前对退化先验的受限假设。然而,模糊模型作为最广泛研究的退化形式之一,违反了这一假设,因为它本质上基于卷积而非加法。在本文中,我们引入了ConvDiff,一种新颖的基于扩散的框架,该框架用卷积替代加法操作,用于图像去模糊任务。在前向过程中,我们利用卷积的频域特性,从清晰图像到其模糊对应物构建有意义的轨迹,而不是用加性噪声逐步破坏图像。虽然当前工作针对高斯模糊实例化了该框架(其中频域分解产生闭式且物理有效的中间状态),但从模糊算子构建退化轨迹的基本原则自然扩展到其他模糊族。该公式弥合了模糊的数学原理与基于扩散的恢复算法的迭代设计之间的差距,从而实现了更物理基础且有效的图像恢复模型。

英文摘要

In recent times, diffusion models have achieved remarkable performance in image restoration tasks. Their core mechanism relies on the restricted presumption of degradation prior to the additive noise operation. However, the blur model, one of the most widely studied degradation formulations, violates this assumption, as it is inherently based on convolution rather than addition. In this paper, we introduce ConvDiff, a novel diffusion based framework that substitutes the additive operation with convolution for the task of image deblurring. In the forward process, we construct a meaningful trajectory from the clean image to its blurred counterpart by exploiting the frequency domain characteristics of convolution, rather than progressively corrupting the image with additive noise. While the current work instantiates this framework for Gaussian blur, where frequency-domain decomposition yields closed-form and physically valid intermediate states, the underlying principle of constructing degradation trajectories from the blur operator extends naturally to other blur families. This formulation bridges the gap between the mathematical principles of blurring and the iterative design of diffusion-based restoration algorithms, enabling more physically grounded and effective image restoration models.

2605.25012 2026-05-26 cs.CV 版本更新

Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation

从语义字典中学习:面向统一视觉表示与生成的判别式码本对比学习

Imanol G. Estepa, Jesús M Rodríguez-de-Vera, Bhalaji Nagarajan, Petia Radeva

发表机构 * Universitat de Barcelona(巴塞罗那大学) Barcelona Supercomputing Center (BSC)(巴塞罗那超级计算中心)

AI总结 提出LEASE框架,通过配对生成-判别码本设计,在离散标记空间中联合优化掩码重建损失和码本对比损失,实现统一视觉表示与生成,在ImageNet-1K上达到最先进性能。

Comments Accepted at CVPR'26

详情
AI中文摘要

判别式和生成式视觉模型在各自领域表现出色,但在语义上存在错位,阻碍了统一视觉学习的进展。我们提出LEASE(从语义字典中学习),一种自监督框架,通过配对生成-判别码本设计弥合这一差距。LEASE完全在通过一次性预计算步骤产生的离散标记空间中运行,无需数据增强、教师模型或在线分词器即可高效训练。LEASE整合了两个互补目标:捕获细粒度生成细节的掩码标记重建损失,以及通过自适应质心加权将编码器特征与判别语义对齐的码本对比损失。这种双重监督产生了一个统一潜在空间,同时支持高质量生成和强大的表示学习。在ImageNet-1K上,LEASE实现了最先进的统一性能,在线性探测(相比MAGE和Sorcen提升高达+1.7%)、无条件生成(相比MAGE FID降低1.26,IS提升10.19)、少样本学习(相比Sorcen平均提升+0.56%)、迁移学习(相比MAGE和Sorcen平均提升+0.75%)以及鲁棒性基准(相比MAGE和Sorcen平均提升+5.86%和+4.25%)上均优于先前的VQGAN方法如MAGE和Sorcen。它还能与领域专用的对比和生成模型竞争,同时超越先前的MIM方法。无监督的LEASE模型还可以通过在其学习表示基础上构建扩展到条件生成,与专用基线相比具有竞争力。总体而言,LEASE为联合理解和生成视觉内容的通用视觉模型提供了高效且有效的一步。

英文摘要

Discriminative and generative vision models excel in their respective domains but remain semantically misaligned, hindering progress toward unified visual learning. We introduce LEASE (LEArning from SEmantic Dictionaries), a self-supervised framework that bridges this gap using a paired generative-discriminative codebook design. LEASE operates entirely in a discrete token space produced through a one-time precomputation step, enabling efficient training without data augmentations, teacher models, or online tokenizers. LEASE integrates two complementary objectives: a masked token reconstruction loss that captures fine-grained generative detail, and a codebook contrast loss that aligns encoder features with discriminative semantics via adaptive centroid weighting. This dual supervision yields a unified latent space that supports both high-quality generation and strong representation learning. On ImageNet-1K, LEASE achieves state-of-the-art unified performance, outperforming prior VQGAN-based methods such as MAGE and Sorcen across linear probing (up to +1.7%), unconditional generation (-1.26 FID and +10.19 IS w.r.t MAGE), few-shot learning (+0.56% on average against Sorcen), transfer (+0.75% average improvement against MAGE and Sorcen), and robustness benchmarks (+5.86% and +4.25% average improvement against MAGE and Sorcen, respectively). It also competes favorably with domain-specialized contrastive and generative models while surpassing previous MIM methods. The unsupervised LEASE model can also be extended to conditional generation by building upon its learned representations, proving competitive with specialized baselines. Overall, LEASE provides an efficient and effective step toward general-purpose vision models that jointly understand and generate visual content.

2605.25009 2026-05-26 cs.CV 版本更新

ClueAegis: Heuristic-to-Reasoning Cognitive-skill Learning for Unified Evidence-based Synthetic Image Detection

ClueAegis:面向统一基于证据的合成图像检测的启发式到推理认知技能学习

Huangsen Cao, Hongkang Chu, Yuxi Li, Ying Zhang, Chen Li, Jing Lyu, Yongwei Wang, Yu Zhao, Fei Wu

发表机构 * Zhejiang University(浙江大学) WeChat Vision, Tencent Inc(腾讯微信视觉实验室) University of the Chinese Academy of Sciences(中国科学院大学)

AI总结 针对现有合成图像检测方法缺乏结构化取证推理的问题,提出一种启发式到推理的认知技能学习框架ClueAegis,通过两阶段智能体流程实现技能选择与证据引导推理,在跨域泛化和鲁棒性上达到最优性能。

详情
AI中文摘要

生成模型的快速发展使合成图像越来越逼真,挑战了可靠的检测。现有方法通常局限于端到端分类或单一推理,因此无法建模结构化的取证推理和异构视觉证据。我们从认知角度重新审视合成图像检测,提出了一种启发式到推理的认知技能学习框架,用于基于证据的取证分析。给定输入图像,我们的框架首先提取启发式感知线索,选择最优取证技能,然后执行技能条件推理以进行证据提取和决策。为支持这一范式,我们引入了ClueAegis-Bench,它将合成图像检测分解为显式标注的取证认知技能,以实现超越二分类的结构化评估。基于该基准,我们提出了ClueAegis(面向统一基于证据的合成图像检测的认知技能学习),一个两阶段智能体框架,执行启发式技能选择,然后通过技能条件工具链进行证据引导推理。该设计将合成图像检测重新表述为一个可配置的多技能推理过程,桥接了感知、技能选择和取证推理。大量实验表明,ClueAegis在提升跨域泛化和鲁棒性的同时实现了最先进的性能。它还提供了透明的推理轨迹和结构化的取证证据,为传统的端到端检测器提供了更可解释的替代方案。

英文摘要

The rapid advancement of generative models has made synthetic images increasingly realistic, challenging reliable detection. Existing methods are often limited to end-to-end classification or monolithic reasoning, and thus fail to model structured forensic reasoning and heterogeneous visual evidence. We revisit synthetic image detection from a cognitive perspective and propose a \textit{Heuristic-to-Reasoning} cognitive skill learning framework for evidence-based forensic analysis. Given an input image, our framework first extracts heuristic perceptual clues, selects the optimal forensic skill, and then performs skill-conditioned reasoning for evidence extraction and decision making. To support this paradigm, we introduce \textbf{ClueAegis-Bench}, which decomposes synthetic image detection into explicitly annotated forensic cognitive skills for structured evaluation beyond binary classification. Based on this benchmark, we propose \textbf{ClueAegis} (\underline{C}ognitive-skill \underline{L}earning for \underline{U}nified \underline{E}vidence-based Synthetic Image Detection), a two-stage agentic framework that conducts heuristic skill selection followed by evidence-guided reasoning through skill-conditioned toolchains. This design reformulates synthetic image detection as a configurable multi-skill reasoning process that bridges perception, skill selection, and forensic reasoning. Extensive experiments show that ClueAegis achieves state-of-the-art performance while improving cross-domain generalization and robustness. It also provides transparent reasoning trajectories and structured forensic evidence, offering a more explainable alternative to conventional end-to-end detectors.

2605.24993 2026-05-26 cs.AI cs.CV 版本更新

NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding

NeurIPS: 基于球面的脑解码的神经解剖学归纳先验

Sijin Yu, Zijiao Chen, Zhenyu Yang, Zihao Tan, Jiakun Xu, Zhongliang Liu, Shengxian Chen, Wenxuan Wu, Xiangmin Xu, Xin Zhang

发表机构 * South China University of Technology(南方科技大学) Stanford University(斯坦福大学) King's College London(伦敦国王学院) Foshan University(佛山大学) Pazhou Lab(琶洲实验室)

AI总结 提出NeurIPS框架,通过选择性ROI球形分词器和结构引导专家混合模型,将解剖变异转化为归纳先验,在自然场景数据集上实现表面解码器最先进性能,并显著提升训练效率。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

当前的fMRI解码器面临性能-保真度权衡,其中高效的ID编码器优于几何保真的表面模型。我们认为这部分是由于低效的表面分词化以及未能将解剖学用作预测信号。我们提出NeurIPS,一个通过将解剖变异从干扰因素重新定义为强大的归纳先验来改进表面解码的框架。NeurIPS结合了两项创新:用于高效几何编码的选择性ROI球形分词器(SRST),以及使用皮层特征显式建模个体解剖的结构引导专家混合模型(SG-MoE)。在自然场景数据集上,NeurIPS为表面解码器建立了新的最先进水平,并实现了与强1D基线相当的性能。这是以空前的效率实现的,因为模型收敛速度显著加快(10个epoch对比600个epoch)。这种效率使得仅使用20%的数据即可快速适应新受试者,并确保随着训练队列扩大而稳健扩展。消融实验提供了因果证据,表明这些收益源于模型使用皮层特征,而非记忆受试者ID。通过利用解剖先验,NeurIPS为稳健、可泛化的脑解码提供了一条有原则且可扩展的路径。

英文摘要

Current fMRI decoders face a performance-fidelity trade-off where efficient ID encoders outperform geometrically faithful surface-based models. We argue this is partly driven by inefficient surface tokenization and the failure to use anatomy as a predictive signal. We present NeurIPS, a framework that improves surface-based decoding by reframing anatomical variation from a nuisance to a powerful inductive prior. NeurIPS unites two innovations: a Selective ROI Spherical Tokenizer (SRST) for efficient geometric encoding, and a Structure-Guided Mixture of Experts (SG-MoE) that explicitly models individual anatomy using cortical features. On the Natural Scenes Dataset, NeurIPS establishes a new state-of-the-art for surface decoders and achieves performance comparable to strong 1D baselines. This is achieved with unprecedented efficiency, as the model converges dramatically faster (10 vs. 600 epochs). This efficiency enables rapid adaptation to new subjects using only 20% of data and ensures robust scalability as the training cohort is expanded. Ablations provide causal evidence that these gains are driven by the model's use of cortical features, not by memorizing subject IDs. By leveraging anatomical priors, NeurIPS provides a principled and scalable path toward robust, generalizable brain decoding.

2605.24977 2026-05-26 cs.CV cs.CL 版本更新

Universal Boosts, Specific Suppressors: Sparse Autoencoder Steering of Medical Vision-Language Models

通用增强,特定抑制:基于稀疏自编码器引导的医学视觉语言模型

Farhad Nooralahzadeh, Benjamin Gundersen, Nicolas Deperrois, Hidetoshi Matsuom, Mizuho Nishio, Thomas Frauenfelder, Ahmed Allam, Christian Blüthgen, Michael Moor, Michael Krauthammer

发表机构 * University of Zurich and University Hospital of Zurich(苏黎世大学及苏黎世大学医院) Kobe University(Kobe大学) ETH AI Center(苏黎世联邦理工学院人工智能中心) ETH Zurich(苏黎世联邦理工学院) Stanford University(斯坦福大学) Zurich University of Applied Sciences(苏黎世应用科学大学)

AI总结 本文提出一种无需权重更新的解码时残差引导方法,通过每token稀疏自编码器(SAE)对医学视觉语言模型进行干预,抑制幻觉并提升报告质量,在多个模型上取得显著改进。

详情
AI中文摘要

医学视觉语言模型(VLM)在生成胸部X光报告时经常出现幻觉:它们编造图像中不存在的发现,遗漏重要发现,或定位错误。我们通过解码时残差引导,基于每token稀疏自编码器(SAE)来缓解这一问题,无需权重更新:在后期层使用Top-$K$ SAE,针对临床错误进行因果引导,然后在推理时结合抑制/增强干预。在MIMIC-CXR测试集上,我们的纯推理方法提高了三个放射学VLM(RadVLM、LLaVA-Rad和CheXOne)生成报告的质量,临床复合指标的相对改进分别为+5.4%、+7.2%和+17.0%,并且所有骨干网络的GREEN得分均具有统计显著性。跨模型特征对齐表明,质量促进(增强)方向在不同架构间高度重叠,而与幻觉相关的(抑制)方向则是模型特定的。因此,可迁移的引导必须针对每个骨干网络进行抑制处理,而不是共享一个通用的抑制列表。相同的配方无需重新训练即可零样本迁移到IU-Xray(GREEN相对提升+7.7%),确认了所识别的特征是模型属性,而非训练语料库的属性。我们发布了因果特征集和一个交互式特征仪表板:https://cxr-sparse-feature-dashboard.netlify.app/。

英文摘要

Medical vision-language models (VLMs) often hallucinate findings when generating chest X-ray reports: they fabricate findings that are not present in the image, miss important ones, or locate them incorrectly. We mitigate this without weight updates by decoding-time residual steering on a per-token sparse autoencoder (SAE) basis: Top-$K$ SAEs on late layers, causal steering against clinical errors, then combined suppress/boost intervention at inference time. On the MIMIC-CXR test split, our inference-only method improves the quality of generated reports for three radiology VLMs (RadVLM, LLaVA-Rad, and CheXOne), with relative improvements of +5.4%, +7.2%, and +17.0% in the clinical composite metric, and statistically significant GREEN gains on all backbones. A cross-model feature alignment shows that the quality-promoting (boost) directions overlap strongly across architectures, whereas hallucination-linked (suppress) directions are model-specific. Therefore, transferable steering must treat suppression per-backbone, rather than sharing a universal suppress list. The same recipe transfers zero-shot to IU-Xray (Green $+7.7\%$ rel.) without retraining, confirming that the identified features are properties of the model, not of the training corpus. We release causal feature sets and an interactive feature dashboard: https://cxr-sparse-feature-dashboard.netlify.app/.

2605.24973 2026-05-26 cs.CV cs.AI cs.CL 版本更新

MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

MinerU-Popo:结构化文档解析的通用后处理模型

Bangrui Xu, Ziyang Miao, Xuanhe Zhou, Yiming Lin, Zirui Tang, Xiaomeng Zhao, Fan Wu, Cheng Tan, Fan Wu, Bin Wang, Conghui He

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory, OpenDataLab(上海人工智能实验室,OpenDataLab) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MinerU-Popo轻量级通用后处理框架,通过分解为文本/表格截断恢复、标题层级重建和图文关联四个子任务,并利用动态分块和重叠同步将OCR页面级结果重构为文档级逻辑结构,显著提升标题层级TEDS和RAG准确性。

Comments The code is available at https://github.com/opendatalab/MinerU-Popo

详情
AI中文摘要

基于VLM的OCR模型已成为文档解析的事实标准,因为它们可以准确提取页面级元素(例如单个页面内的段落)及其边界框和文本内容。然而,下游应用(如RAG)需要连贯的文档级信息,而这些模型常常破坏跨页连续性,并且无法恢复被页面边界截断的结构(如段落和表格)。这种关系不局限于单个页面;相反,它们需要对跨多个页面的标题、段落、表格和图像进行联合分析。因此,一个自然的解决方案是重用现有的OCR输出,并通过后处理重建文档级逻辑结构。为此,我们提出了MinerU-Popo,一个轻量级且通用的OCR输出后处理框架,它将来自不同解析器的页面级结果转换为连贯的文档级结构。MinerU-Popo将问题分解为四个聚焦的子任务:文本截断恢复、表格截断恢复、标题层级重建和图文关联。为了有效解决这些问题,我们构建了一个面向任务的数据引擎,具有任务特定的输入过滤,并使用生成的数据(30K)微调了一个轻量级后处理模型(Qwen3-VL-4B)。为了支持长文档,我们引入了基于重叠同步的动态分块,对齐微调模型的分块级输出并保持全局一致性。最后,我们将对齐后的输出组装成树状文档表示,并通过节点分块和摘要进一步丰富,以支持下游检索和分析。实验结果表明,MinerU-Popo在所有五个测试的OCR模型上,标题层级TEDS至少提高了20%,提高了RAG准确性并降低了每次查询的延迟。

英文摘要

VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document-level information, whereas these models often break cross-page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document-level logical structures through post-processing. To this end, we propose MinerU-Popo, a lightweight and universal framework for POst-Processing OCR outputs, which converts page-level results from diverse parsers into coherent document-level structures. MinerU-Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image-text association. To address these effectively, we build a task-oriented data engine with task-specific input filtering, and use the generated data (30K) to fine-tune a lightweight post-processing model (Qwen3-VL-4B). To support long documents, we introduce dynamic chunking with overlap-based synchronization, which aligns chunk-level outputs from the fine-tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree-structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU-Popo improves title-hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per-query latency.

2605.24965 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection

视觉基础模型在面部深度伪造检测中的跨域泛化极限

Ibrahim Delibasoglu

发表机构 * Department of Software Engineering, Faculty of Computer and Information Sciences(软件工程系,计算机与信息科学学院)

AI总结 本文通过系统评估三种视觉基础模型(RoPE-ViT、DINOv3、NVIDIA C-RADIOv4-H)在DF40基准上的线性探测性能,揭示了它们在面部深度伪造检测中的跨域泛化极限,发现基础模型对全脸合成保持高判别力,但对局部编辑技术存在根本性边界。

详情
AI中文摘要

生成模型的快速进化使得超逼真面部深度伪造的创建成为可能,暴露了现代数字取证中的一个关键弱点:检测器无法泛化到未见过的操作技术。传统网络遭受表示崩溃,过度拟合特定训练生成器的局部伪影指纹。本研究探讨了现代视觉基础模型是否可以作为可泛化的、开箱即用的特征提取器,能够在完全未见过的生成流形上追踪取证异常。我们进行了系统的跨域评估,比较了三种基础学习范式:全监督宏观语义特征(RoPE-ViT)、纯自监督几何特征(DINOv3)和多教师聚合表示(NVIDIA C-RADIOv4-H)。通过部署冻结的骨干网络并进行下游线性探测,我们映射了这些架构在具有挑战性的DF40基准上的性能极限。我们的实证结果揭示了预训练范式和参数规模之间的内在权衡,证明虽然基础模型对全脸合成保持高判别能力,但局部面部编辑技术在线性探测评估结构中暴露了基本边界。源代码和模型权重可在 http://github.com/mribrahim/deepfake 获取。

英文摘要

The rapid evolution of generative models has enabled the creation of hyper-realistic facial deepfakes, exposing a critical vulnerability in modern digital forensics: the inability of detectors to generalize to unseen manipulation techniques. Traditional networks suffer from representation collapse, overfitting to localized artifact fingerprints of specific training generators. This work investigates whether modern Vision Foundation Models can serve as generalizable, out-of-the-box feature extractors capable of tracking forensic anomalies across entirely unseen generative manifolds. We conduct a systematic cross-domain evaluation comparing three foundational learning paradigms: fully supervised macro-semantic features (RoPE-ViT), pure self-supervised geometric features (DINOv3), and multi-teacher agglomerative representations (NVIDIA C-RADIOv4-H). By deploying frozen backbones subjected to downstream linear probing, we map the performance limitations of these architectures on the challenging DF40 benchmark. Our empirical findings expose the intrinsic trade-offs between pre-training paradigms and parameter scale, proving that while foundation models retain high discriminative capabilities for entire face synthesis, localized face editing techniques expose fundamental boundaries in linear probe evaluation structures. Source code and model weights are available in http://github.com/mribrahim/deepfake

2605.24964 2026-05-26 cs.CV 版本更新

ConFi-GS Confidence-Guided High-Frequency Injection for 3D Gaussian Splatting Super-Resolution

ConFi-GS:置信度引导的高频注入用于3D高斯泼溅超分辨率

Jiaxiang Li, Zongtan Zhou, Zhen Tan, Yadong Liu, Dewen Hu

AI总结 提出一种可靠性感知的频率建模框架,通过几何引导的细节需求先验和频率感知的可靠性图,指导低分辨率3DGS重建中高频细节的注入,提升保真度和感知质量。

详情
AI中文摘要

从低分辨率多视图图像重建高质量3D场景对3D高斯泼溅(3DGS)仍具挑战,因为高频观测不足常导致纹理模糊、边界弱化和视图不一致细节。现有方法要么统一应用超分辨率引导,要么主要基于几何采样定位增强区域。然而,它们通常不区分两个根本不同的问题:哪里需要额外细节,以及相应的候选高频内容是否足够可靠以融入多视图一致的3D表示。本文提出一种用于低分辨率3DGS重建的可靠性感知频率建模框架。该框架首先估计几何引导的细节需求先验,以定位在低分辨率监督下可能细节不足的区域。然后计算频率感知的可靠性图,以确定候选高频细节是否结构上受支持、频谱上未解决且跨视图稳定。结合这些信号得到细节注入图,指导优化过程中超分辨率细节的引入位置。基于该图,我们设计了一个统一的优化方案,包括空间选择性监督、从粗到细的频率正则化和可靠性感知的高斯稠密化。该方案控制可靠细节的注入位置、高频监督的激活时机以及未解决但可靠的细节如何融入高斯表示。多个基准上的实验表明,在抑制不稳定或视图不一致细节的同时,保真度和感知质量得到提升。

英文摘要

Reconstructing high-quality 3D scenes from low-resolution multi-view images remains challenging for 3D Gaussian Splatting (3DGS), because insufficient high-frequency observations often lead to blurred textures, weak boundaries, and view-inconsistent details. Existing approaches either apply super-resolution guidance uniformly or localize enhancement regions based mainly on geometric sampling. However, they typically do not distinguish between two fundamentally different questions: where additional detail is needed, and whether the corresponding candidate high-frequency content is reliable enough to be internalized into a multi-view consistent 3D representation. In this paper, we propose a reliability-aware frequency modeling framework for low-resolution 3DGS reconstruction. The framework first estimates a geometry-guided detail-demand prior to locate regions that are likely under-detailed under low-resolution supervision. It then computes a frequency-aware reliability map to determine whether candidate high-frequency details are structurally supported, spectrally unresolved, and cross-view stable. Combining these signals yields a detail-injection map that guides where super-resolved details should be introduced during optimization. Based on this map, we design a unified optimization scheme comprising spatially selective supervision, coarse-to-fine frequency regularization, and reliability-aware Gaussian densification. This scheme controls where reliable details are injected, when high-frequency supervision is activated, and how unresolved yet reliable details are internalized into the Gaussian representation. Experiments on multiple benchmarks show improved fidelity and perceptual quality while suppressing unstable or view-inconsistent details.

2605.24962 2026-05-26 cs.CV 版本更新

Tempered Self-Similarity Alignment for Physically Plausible Video Generation

Tempered Self-Similarity Alignment for Physically Plausible Video Generation

Manjin Kim, Suha Kwak, Minsu Cho

发表机构 * Pohang University of Science and Technology (POSTECH)(浦项科学技术大学)

AI总结 提出Tempered Self-Similarity Alignment (TSA)损失函数,通过将视觉基础模型中的时空自相似性关系知识迁移到视频生成模型中,以改善视频的物理合理性。

Comments Accepted to the CVPR 2026 Workshop on Video Generative Models: Benchmarks and Evaluation (VGBE)

详情
AI中文摘要

尽管视频生成模型取得了显著进展,但它们仍然难以生成物理上逼真的视频,经常出现外观漂移、不合理的运动和时间不一致性。在这项工作中,我们通过将视觉基础模型中编码的时空自相似性(STSS)关系知识迁移到视频生成模型中来解决这一局限性。STSS表示特征在空间和时间上的成对相似性,揭示了视频中物体如何与其他实体相互作用的 relational structure,有效捕捉了真实世界的动态,包括物体运动和语义变换。为了迁移这种关系知识,我们提出了Tempered Self-similarity Alignment (TSA)损失,它将STSS转换为概率对应分布,并训练视频生成模型使其在动态变化区域上的对应分布与视觉基础模型的对应分布对齐。在VideoPhy和VideoPhy2基准测试上的评估表明,我们的方法在不同交互场景中显著提升了物理合理性,验证了迁移关系知识对于生成物理逼真视频的有效性。

英文摘要

Despite remarkable advances in video generative models, they still struggle to generate physically realistic videos, frequently exhibiting appearance drift, implausible motion, and temporal inconsistencies. In this work, we address this limitation by transferring relational knowledge encoded in spatio-temporal self-similarity (STSS) from visual foundation models into video generative models. STSS represents pairwise similarities among features across space and time, revealing the relational structure of how objects interact with other entities throughout a video, effectively capturing real-world dynamics, including object motion and semantic transformations. To transfer this relational knowledge, we propose Tempered Self-similarity Alignment (TSA) loss, which transforms STSS into probabilistic correspondence distributions and trains the video generative model to align its correspondence distributions with those of the visual foundation model on dynamically changing regions. Evaluated on VideoPhy and VideoPhy2 benchmarks, our method demonstrates substantial improvements in physical plausibility across diverse interaction scenarios, validating the effectiveness of transferring relational knowledge for physically realistic video generation.

2605.24959 2026-05-26 cs.CV 版本更新

Three-Step Conditional Diffusion 3D Reconstruction for Light-Field Microscopy

三步条件扩散光场显微三维重建

Qihong Zhao, Shaokang Yan, Zhimin Qiao, Jinjia Wang, Bo Xiong

发表机构 * Yanshan University(雁山大学) Peking University(北京大学)

AI总结 针对光场显微成像中传统算法分辨率低、伪影重、计算成本高,以及现有学习方法重建精度和泛化能力不足的问题,提出一种基于三步条件扩散的高保真三维重建方法,通过确定性三步采样和轻量条件U-Net实现快速准确重建,并引入类间检测模块增强稳定性。

Comments 10 pages, 6 figures. Accepted to CVPR 2026 Findings

详情
AI中文摘要

光场显微镜(LFM)能够单次捕获生物样本的多角度信息,支持实时体积成像。然而,传统的基于物理的算法通常受限于有限的空间分辨率、严重的伪影和高计算成本。现有的基于学习的方法提高了推理效率,但在重建精度和泛化能力方面仍面临限制。为了解决这些挑战,本文提出了一种用于LFM的高保真三步条件扩散(TCD)三维重建方法。尽管传统扩散模型在生成建模中取得了显著成功,但其缓慢的采样过程以及质量与效率之间的固有权衡阻碍了其在实时三维成像中的应用。我们通过确定性三步采样策略结合轻量条件U-Net重新设计了扩散过程,为快速准确的体积重建建立了新范式。此外,还引入了类间检测(ICD)模块,以在推理过程中识别分布外或异常输入,从而增强模型的稳定性和可靠性。大量实验和跨数据集评估表明,TCD在重建保真度和泛化能力方面均显著优于最先进的方法,为光场显微镜提供了一种高效实用的三维重建解决方案。

英文摘要

Light-field microscopy (LFM) enables single-shot capture of multi-angular information from biological samples, supporting real-time volumetric imaging. However, traditional physics-based algorithms often suffer from limited spatial resolution, severe artifacts, and high computational costs. Existing learning-based methods improve inference efficiency but still face limitations in reconstruction accuracy and generalization capability. To address these challenges, this paper proposes a high-fidelity Three-Step Conditional Diffusion (TCD) 3D reconstruction method for LFM. Although conventional diffusion models have achieved remarkable success in generative modeling, their slow sampling process and the inherent trade-off between quality and efficiency hinder their application in real-time 3D imaging. We redesign the diffusion process through a deterministic three-step sampling strategy coupled with a lightweight conditional U-Net, establishing a new paradigm for fast and accurate volumetric reconstruction. Furthermore, an Inter-Class Detection (ICD) module is incorporated to identify out-of-distribution or anomalous inputs during inference, thereby enhancing model stability and reliability. Extensive experiments and cross-dataset evaluations demonstrate that TCD significantly outperforms state-of-the-art methods in both reconstruction fidelity and generalization, providing an efficient and practical 3D reconstruction solution for light-field microscopy.

2605.24957 2026-05-26 cs.AI cs.CV cs.LG 版本更新

Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration

通过区域感知注意力重校准减轻视觉语言模型中的对象幻觉

Yuanzhi Xu, Qian Gao, Jun Fan, Guohui Ding, Zhenyu Yang, Sixue Lin, Yuteng Xiao

发表机构 * Qilu University of Technology (Shandong Academy of Sciences)(齐鲁工业大学(山东省科学院)) China Telecom Digital Intelligence Technology Co, Ltd(中国电信数字智能技术有限公司) Shenyang Aerospace University(沈阳航空航天大学) Qilu Institute of Technology(齐鲁理工学院)

AI总结 提出一种无需训练的区域感知自适应加权机制,通过计算注意力头的稳健统计中点并利用跨头分歧动态调整干预预算,以连续惩罚调制抑制幻觉路径,有效纠正视觉语义错位,同时保持生成流畅性。

详情
AI中文摘要

生成事实上不正确的对象(通常称为对象幻觉)仍然是大型视觉语言模型(LVLMs)中的一个持久挑战。当前解决该问题的方法——从昂贵的数据驱动微调和延迟较高的对比解码到刚性的注意力头截断——常常在计算效率或模型特征空间的连续性上做出妥协。为克服这些限制,我们引入了一种新颖的、无需训练的推理策略,该策略作为一种区域感知的自适应加权机制,动态纠正语义漂移,而不依赖于突然的启发式截断。通过计算各注意力头上的离群值稳健统计中点,我们为可靠的视觉表示建立了一个稳定锚点。然后,我们利用跨区域映射的跨头分歧来动态确定干预预算,通过连续惩罚调制温和地抑制引起幻觉的注意力路径。这种重校准过程有效纠正了视觉语义错位,同时完全保留了生成流畅性和语言先验。在包括CHAIR、POPE和MME在内的标准多模态基准上的全面评估表明,我们的策略显著减少了实例级和句子级幻觉。结果展示了与当代基线相比的最先进性能,证实了我们方法的效率和算法鲁棒性。我们的代码将公开。

英文摘要

The generation of factually incorrect objects, commonly known as object hallucination, remains a persistent challenge in Large Vision-Language Models (LVLMs). Current approaches to address this issue - ranging from expensive data-driven fine-tuning and high-latency contrastive decoding to rigid attention head truncation - frequently compromise either computational efficiency or the continuity of the model's feature space. To overcome these limitations, we introduce a novel, training-free inference strategy that operates as a region-aware adaptive weighting mechanism to dynamically correct semantic drift without relying on abrupt heuristic truncations. By computing an outlier-resistant statistical midpoint across various attention heads, we establish a stable anchor for reliable visual representations. We then utilize the inter-head disagreement mapped across regions to dynamically determine intervention budgets, gently suppressing hallucination-inducing attention paths through a continuous penalty modulation. This recalibration process effectively rectifies visual-semantic misalignments while fully preserving generative fluency and language priors. Comprehensive evaluations on standard multimodal benchmarks, including CHAIR, POPE, and MME, reveal that our strategy substantially curtails both instance- and sentence-level hallucinations. The results demonstrate state-of-the-art performance against contemporary baselines, confirming our method's efficiency and algorithmic robustness. Our code will be public.

2605.24946 2026-05-26 cs.CV 版本更新

Interpretability Transfer from Language to Vision via Sparse Autoencoders

通过稀疏自编码器实现从语言到视觉的可解释性迁移

Alexey Kravets, Da Li, Chuan Li, Da Chen, Vinay P. Namboodiri

发表机构 * University of Bath, UK(巴斯大学) Lambda, Inc.(Lambda公司) Samsung AI Centre Cambridge(三星AI研究中心)

AI总结 提出VISTA框架,通过约束视觉投影器将视觉token映射到LLM的文本SAE空间,实现无需专用视觉SAE的视觉可解释性,并在对象移除和替换任务上分别提升35%和47%。

详情
Journal ref
ICML 2026
AI中文摘要

最近使用稀疏自编码器(SAE)在语言模型可解释性方面取得的进展尚未有效迁移到视觉领域,主要原因是标记视觉概念的困难和模糊性。在本文中,我们引入了通过SAE迁移对齐的视觉可解释性(VISTA),这是一个在LLaVA风格的视觉-语言模型中通过约束视觉投影器将视觉token映射到LLM预先存在的、已标记的文本SAE空间,从而将可解释性从语言迁移到视觉的框架。该方法无需训练专用的视觉SAE即可实现视觉可解释性。通过使用LLM的SAE重建损失对投影器进行正则化,VISTA将匹配率(衡量SAE空间中激活最强的文本概念与图像中语义元素对应准确度的指标)提高了三倍。利用该框架,我们进一步分析了不同视觉编码器的空间定位特性,并表明DINOv2特征比其他编码器具有更强的定位能力。利用这种精确性,我们通过细粒度的局部概念干预验证了VISTA的跨模态对齐,其中特定对象在模型感知中被移除或替换,同时保留周围场景。与纯视觉基线相比,对象移除任务提升了35%,对象替换任务提升了47%,为视觉token存在于文本SAE流形中提供了因果证据。这些贡献在多种LLM架构上得到了验证。

英文摘要

Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we introduce Visual Interpretability via SAE Transfer Alignment (VISTA), a framework that transfers interpretability from language to vision in a LLaVA-style vision-language model by constraining a visual projector to map visual tokens into an LLM's pre-existing, labeled textual SAE space. This approach enables visual interpretability without training dedicated vision SAEs. By regularizing the projector using the LLM's SAE reconstruction loss, VISTA achieves a threefold increase in the matching rate, which measures how accurately the most activating textual concepts in the SAE space correspond to semantic elements in the image. Using this framework, we further analyze spatial localization properties of different vision encoders and show that DINOv2 features have stronger localization abilities than other encoders. Leveraging this precision, we validate VISTA's cross-modal alignment through fine-grained, localized concept interventions, where specific objects are removed or replaced in the model's perception while preserving the surrounding scene. This results in improvements of 35% in object removal and 47% in object replacement tasks over vision-only baselines, providing causal evidence that visual tokens inhabit the text SAE manifold. These contributions are validated across multiple LLM architectures.

2605.24938 2026-05-26 cs.IR cs.AI cs.CV 版本更新

Your Embedding Model is SMARTer Than You Think

你的嵌入模型比你想象的更聪明

Jianrui Zhang, Hyun Jung Lee, Sukanta Ganguly, Tae-Eui Kam, Donghyun Kim, Yong Jae Lee

发表机构 * UW-Madison(威斯康星大学麦迪逊分校) Korea University(韩国大学) NetApp, Inc.(NetApp公司)

AI总结 提出SMART框架,通过利用标准单向量模型的隐式多向量能力,在推理时应用后期交互,无需额外训练即可提升多模态检索性能。

详情
AI中文摘要

多模态检索严重依赖单向量检索器,它将丰富的顺序令牌序列压缩为单个全局表示。虽然高效,但它们丢弃了密集检索任务所需的关键细粒度局部证据。多向量方法作为解决方案被引入,但严格需要训练,且许多忽略了全局总结表示的必要性。为解决这一问题,我们引入SMART,一个释放标准单向量模型潜在多向量能力的框架。我们首先证明,在池化嵌入上的标准对比训练通过梯度流隐式塑造了前序隐藏状态的检索几何结构。通过在推理时对这些冻结的隐藏状态应用直接后期交互,SMART作为一种即插即用的升级,持续提升跨多种模态的性能,甚至在MMEB-V2上进一步改进了最先进的模型。我们还揭示了SMART的优越性能,简单的轻量级后训练不仅节省时间和计算,还在视觉文档检索上带来进一步改进,使单向量模型能够超越最先进的多向量对应模型。最终,SMART为多模态检索提供了高效的推理增强和强大的微调技术。我们在https://github.com/HanSolo9682/SMART开源了代码和权重。

英文摘要

Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.

2605.24932 2026-05-26 cs.CV 版本更新

X-Edit: Exact, Explicit, and Explainable Null-Space Editing for Medical Vision Transformers

X-Edit: 面向医学视觉Transformer的精确、显式且可解释的零空间编辑

Yuanye Liu, Siyuan Zhou, Ke Zhang, Lei Li, Wei Chen, Xiahai Zhuang

发表机构 * Fudan University(复旦大学) Johns Hopkins University(约翰霍普金斯大学) National University of Singapore(新加坡国立大学) University of Sydney(悉尼大学)

AI总结 提出X-Edit框架,通过因果定位和零空间投影实现医学图像分类中ViT模型的精确错误修正,避免灾难性遗忘。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

预训练的视觉Transformer(ViT)越来越多地用于医学图像分类。然而,在动态临床场景中纠正其不可避免的失败案例是一个关键挑战。传统的微调方法固有地遭受灾难性遗忘,严重降低先前获得的诊断能力。这种不稳定性从根本上危及临床安全。解决这一脆弱性需要一种主动、可控且可靠的干预机制,该机制既有理论依据又具有内在可解释性。为此,我们提出X-Edit(精确、显式且可解释的编辑),一种高效的零空间模型编辑框架。X-Edit将编辑过程从基于梯度的迭代优化转变为有理论依据的闭式解。具体来说,我们首先通过因果追踪显式定位导致错误预测的影响层。然后,从精心挑选的锚点集中构建正交零空间投影矩阵。通过将精确的参数更新几何约束在该零空间内,我们提供了数学保证,即干预能够纠正目标错误而不干扰已建立的诊断表示。在六个医学影像基准上的广泛评估表明,X-Edit全面抑制了灾难性遗忘,同时实现了卓越的编辑成功率。我们的代码可在https://github.com/HenryLau7/X-Edit获取。

英文摘要

Pre-trained Vision Transformers (ViTs) are increasingly deployed for medical image classification. However, correcting their inevitable failure cases in dynamic clinical scenarios poses a critical challenge. Conventional fine-tuning approaches inherently suffer from catastrophic forgetting, severely degrading previously acquired diagnostic capabilities. Such instability fundamentally compromises clinical safety. Addressing this vulnerability requires an active, controllable, and reliable intervention mechanism that is both theoretically grounded and inherently interpretable. To this end, we propose X-Edit (eXact, eXplicit, and eXplainable Editing), an efficient null-space model editing framework. X-Edit transitions the editing process from iterative gradient-based optimization to a theoretically grounded, closed-form solution. Specifically, we first explicitly localize the influential layers via causal tracing governing the erroneous prediction. Subsequently, we construct an orthogonal null-space projection matrix from a curated anchor set. By geometrically constraining the exact parameter update strictly within this null space, we provide mathematical guarantees that the intervention rectifies targeted errors without perturbing established diagnostic representations. Extensive evaluations on six medical imaging benchmarks demonstrate that X-Edit comprehensively suppresses catastrophic forgetting while achieving superior edit success rates. Our code is available at https://github.com/HenryLau7/X-Edit.

2605.24928 2026-05-26 cs.CV 版本更新

MambaDSF: Multi-Scale SSM with Dilated Feature Fusion for Sonar Small Target Detection

MambaDSF:基于膨胀特征融合的多尺度SSM用于声纳小目标检测

Hui Lin, Jiayi Li, Jing Wang, Shenghui Rong

发表机构 * School of Information Science and Engineering, Ocean University of China(中国海洋大学信息科学与工程学院) School of Information and Communication Technology, Griffith University(格里菲斯大学信息与通信技术学院)

AI总结 针对声纳小目标检测中像素覆盖不足、噪声干扰和尺度模糊问题,提出MambaDSF混合框架,通过Mamba增强特征金字塔、膨胀融合编码器和尺度自适应损失函数,在UATD数据集上达到91.5% mAP50,参数28.7M。

Comments 8 pages, 4 figures, under review at IEEE Geoscience and Remote Sensing Letters (GRSL)

详情
AI中文摘要

声纳成像是水下目标检测的主要方式,但由于像素覆盖不足、声学对比度低以及不同成像距离下的尺度模糊,小目标仍然难以检测。基于CNN的检测器能高效提取局部特征,但缺乏全局声学上下文,无法抑制噪声引起的虚警。基于Transformer的方法以二次计算代价捕捉长距离依赖。现有的基于Mamba的视觉模型提供高效的线性代价扫描,但缺乏跨金字塔层级的多尺度语义对齐、多感受野融合以及可靠声纳检测所需的小目标感知训练监督。本文提出Mamba膨胀尺度融合(MambaDSF),一个混合框架,通过三个贡献解决这些局限:Mamba增强特征金字塔(MambaEFP)骨干网络,以线性复杂度联合捕捉局部回波线索和全局声学上下文;膨胀融合Mamba(DFMamba)编码器,强制跨金字塔层级的多尺度特征对齐;以及尺度自适应加权IoU(SA-WIoU)和跨尺度一致性(CSC)损失,稳定小目标训练。MambaDSF在UATD前视声纳基准上达到91.5% mAP50,参数为2870万,超越所有对比检测器。在小目标子集上,增益达到+2.2个百分点,在FLS和MD-FLS上的跨域评估证实了所提出架构的泛化能力。代码公开于https://github.com/IDontKnowAAA/MambaDSF。

英文摘要

Sonar imaging is the primary modality for underwater target detection, yet small targets remain difficult to detect due to insufficient pixel coverage, low acoustic contrast, and scale ambiguity across imaging ranges. CNN-based detectors extract local features efficiently but cannot suppress noise-induced false alarms without global acoustic context. Transformer-based methods capture long-range dependencies at quadratic computational cost. Existing Mamba-based vision models offer efficient linear-cost scanning but lack multi-scale semantic alignment across pyramid levels, multi-receptive-field fusion, and small-target-aware training supervision needed for reliable sonar detection. This letter proposes Mamba Dilated-Scale Fusion (MambaDSF), a hybrid framework addressing these limitations through three contributions: a Mamba Enhanced Feature Pyramid (MambaEFP) backbone that jointly captures local echo cues and global acoustic context at linear complexity; a Dilate Fusion Mamba (DFMamba) encoder that enforces multi-scale feature alignment across pyramid levels; and Scale-Adaptive Weighted IoU (SA-WIoU) and Cross-Scale Coherence (CSC) losses that stabilize small-target training. MambaDSF achieves 91.5% mAP50 on the UATD forward-looking sonar benchmark with 28.7 million parameters, surpassing all compared detectors. On a small-target subset the gain reached +2.2 percentage points, and cross-domain evaluation on FLS and MD-FLS confirms the generalization of the proposed architecture. The codes are publicly available at https://github.com/IDontKnowAAA/MambaDSF.

2605.24915 2026-05-26 cs.GR cs.CV 版本更新

Snapshot Polarimetric Display Inverse Rendering

快照偏振显示逆渲染

Seokjun Choi, Yunseong Moon, Kaizhang Kang, Hoon-Gyu Chung, Jin-Nyeong Kim, Giljoo Nam, Seung-Hwan Baek

发表机构 * POSTECH

AI总结 本文提出一种快照偏振显示逆渲染方法,利用LCD投影线性偏振RGB图案和偏振相机获取光谱偏振测量,通过前馈Transformer预测每像素法线、反照率、粗糙度和金属度,在真实桌面场景中优于现有方法。

详情
AI中文摘要

逆渲染仍然是图形学和视觉领域的核心挑战,尤其是在轻量级桌面工作流程所需的快照配置中,每帧信息预算高度受限。以往的逆渲染工作探索了各种可用的维度来丰富每次拍摄的信息,包括时间调制、光谱编码和偏振。在这项工作中,我们引入了偏振显示逆渲染,使用LCD投影线性偏振RGB二值图案,并配备四分之一波片的RGB偏振相机在单次拍摄中获取光谱偏振测量。一个前馈Transformer将这些测量映射到每像素法线、反照率、粗糙度和金属度。为了克服训练数据稀缺,我们通过生成流形扩展了一组有限的实测偏振双向反射分布函数。在真实桌面设置上的评估表明,该方法在多种场景中实现了准确的逆渲染,优于现有方法。

英文摘要

Inverse rendering remains a core challenge in graphics and vision, especially in the snapshot configurations required for lightweight desktop workflows, where the per-frame information budget is highly constrained. Previous inverse rendering work explores various available dimensions for enriching the per-shot information, including temporal modulation, spectral encoding, and polarization. In this work, we introduce polarimetric display inverse rendering, using an LCD to project a linearly polarized RGB binary pattern and an RGB polarization camera augmented with a quarter-wave plate to acquire spectro-polarimetric measurements in a single shot. A feed-forward transformer maps these measurements to per-pixel normal, albedo, roughness, and metallicity. To overcome training data scarcity, we expand a limited set of measured polarimetric bidirectional reflectance distribution functions via a generative manifold. Evaluations on a real desktop setup demonstrate accurate inverse rendering across diverse scenes, outperforming existing approaches.

2605.24894 2026-05-26 cs.CV 版本更新

BFS: Back-to-Front Layered Image Synthesis via Knowledge Transfer

BFS: 通过知识转移的前后分层图像合成

Kyoungkook Kang, Gyujin Sim, Sunghyun Cho

发表机构 * SAMSUNG(三星) POSTECH

AI总结 提出BFS框架,利用双分支扩散模型和两阶段训练,通过从非分层图像合成中转移知识,实现高质量的前景层合成与背景和谐融合。

Comments SIGGRAPH 2026

详情
AI中文摘要

随着生成模型扩展了视觉内容创作的可能性,分层图像合成已成为可控和创意编辑的一个有前景的方向。然而,现有方法难以充分发挥这一潜力。基于分解的方法通常难以实现干净分离,而基于生成的方法则面临训练数据获取困难的问题,降低了质量和场景多样性。在本文中,我们提出了BFS,一种新颖的基于生成的分层图像合成框架。具体来说,给定背景图像和用户指导,BFS合成一个前景层,该层不仅包含前景对象,还包括其相关的视觉效果(如阴影和反射),同时与背景无缝协调以产生连贯的合成图像。为了实现多样且高质量的前景层合成,同时克服数据稀缺问题,我们利用相对易于学习的非分层图像合成知识来指导前景合成。为此,我们采用双分支扩散框架,其中两个相互连接的分支分别生成合成图像和前景层,实现双向知识转移。基于该框架,我们提出了一种两阶段训练方案,利用高质量的非分层合成图像数据集有效提升前景质量。大量实验(包括用户研究)表明,BFS生成了高质量的分层图像,始终优于先前方法。

英文摘要

As generative models expand the possibilities of visual content creation, layered image synthesis has emerged as a promising direction for controllable and creative editing. However, existing methods struggle to fully realize this potential. Decomposition-based methods often struggle with clean separation, while generation-based methods suffer from difficulty in training data acquisition, reducing quality and scene diversity. In this paper, we propose BFS, a novel generation-based framework for layered image synthesis. Specifically, given a background image and user guidance, BFS synthesizes a foreground layer that incorporates not only a foreground object but also its associated visual effects, such as shadows and reflections, while seamlessly harmonizing with the background to produce a coherent composite. To enable diverse and high-quality foreground layer synthesis while overcoming data scarcity, we leverage the comparatively easy-to-learn knowledge of unlayered image synthesis for the foreground synthesis. To this end, we adopt a dual-branch diffusion framework in which two interconnected branches generate a composite image and a foreground layer, respectively, enabling bidirectional knowledge transfer. Based on this framework, we propose a two-stage training scheme that utilizes a high-quality unlayered composite image dataset to effectively enhance foreground quality. Extensive experiments, including a user study, show that BFS produces high-quality layered images, consistently outperforming prior methods.

2605.24893 2026-05-26 cs.CV 版本更新

BED-SAM2: Boundary-Enhanced-Depth SAM2 via Monocular Geometric Priors

BED-SAM2: 通过单目几何先验增强边界的深度SAM2

Tyler Rust, Dara McNally, Kyle O'Donnell, Colin Kelly, Chandra Kambhamettu

发表机构 * University of Delaware(德克萨斯大学) University of South Florida(佛罗里达州立大学) DEVCOM Army Research Laboratory(国防部陆军研究实验室)

AI总结 本研究通过修改SAM2编码器以直接编码单目深度信息,提出BED-SAM2模型,在少量训练周期内实现显著和伪装物体检测的竞争性能。

Comments 9 pages, 5 figures, 5 tables. Presented as a poster at the CVPR 2026 Workshop on Computer Vision in the Wild (CVinW). Code available at https://github.com/TylerRust-1/BED-SAM2

详情
AI中文摘要

基于SAM2视觉基础模型进行下游分割,本研究引入了边界增强深度(BED)-SAM2。修改了SAM2 Hiera编码器架构,以直接从RGB图像编码单目深度信息,从而提供几何线索,增强物体边界描绘并促进伪装物体形状的提取。BED-SAM2在多个显著和伪装物体检测任务中,仅需五个训练周期即可展现出具有竞争力的最先进性能。

英文摘要

Building upon the SAM2 vision foundation model for downstream segmentation, this study introduces Boundary Enhanced Depth (BED)-SAM2. The SAM2 Hiera encoder architecture is modified to directly encode monocular depth information from RGB images, thereby providing geometric cues that enhance object boundary delineation and facilitate the extraction of camouflaged object shapes. BED-SAM2 demonstrates competitive state-of-the-art performance across multiple salient and camouflaged object detection tasks with as few as five training epochs.

2605.24870 2026-05-26 cs.CV 版本更新

Trajectory-Consistent Calibration for Cache-Accelerated Diffusion Models

轨迹一致校准用于缓存加速扩散模型

Mingyu Liang, Dingkun Xu, Jingwei Xu

发表机构 * Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术实验室)

AI总结 针对缓存加速扩散模型中表示偏差导致生成质量下降的问题,提出无训练的轨迹一致校准方法,通过离线迭代校准缓存表示,在PixArt-alpha和DiT-XL/2上持续改善FID。

Comments 23 pages, 8 figures, 8 tables. Code is available at https://github.com/NJUDeepEngine/TCC

详情
AI中文摘要

扩散Transformer在迭代采样过程中需要重复进行去噪器评估,导致推理计算成本高昂。基于缓存的加速方法通过跨去噪步骤重用中间表示来降低这一成本,但可能引入表示偏差并降低生成质量。本文分析了这些偏差,并表明有效的校准应考虑重用导致的直接不匹配以及先前校正引起的后续轨迹偏移。为解决这一挑战,我们提出了轨迹一致校准(TCC),一种无训练的方法,将缓存表示校准为其全计算对应物。具体而言,TCC并非从单个未校正的缓存轨迹中估计所有校准先验,而是使用离线迭代过程,使得每个先验都考虑先前校准引起的轨迹偏移。在PixArt-alpha和DiT-XL/2上的实验表明,TCC在保持底层重用策略的同时,持续改善了代表性缓存加速方法的FID。值得注意的是,在基于FORA的典型PixArt-alpha缓存加速设置中,TCC将FID从29.83降至27.35,略微超过了全计算基线。

英文摘要

Diffusion Transformers require repeated denoiser evaluations during iterative sampling, making inference computationally expensive. Cache-based acceleration reduces this cost by reusing intermediate representations across denoising steps, but can introduce representation deviations and degrade generation quality. In this paper, we analyze these deviations and show that effective calibration should consider both the direct mismatch caused by reuse and the subsequent trajectory shift induced by earlier corrections. To address this challenge, we propose Trajectory-Consistent Calibration (TCC), a training-free method that calibrates cached representations toward their full-computation counterparts. Specifically, rather than estimating all calibration priors from a single uncorrected cache trajectory, TCC uses an offline iterative procedure so that each prior accounts for the trajectory shift induced by preceding calibrations. Experiments on PixArt-alpha and DiT-XL/2 show that TCC consistently improves FID across representative cache-based acceleration methods while preserving their underlying reuse policies. Notably, in a representative PixArt-alpha cache-acceleration setting based on FORA, TCC reduces FID from 29.83 to 27.35, slightly surpassing the full-computation baseline.

2605.24843 2026-05-26 cs.CV cs.AI 版本更新

Adversarial Error Correction for Visual Autoregressive Generation

视觉自回归生成的对抗性纠错

Ligong Bi, Tao Huang, Jianyuan Guo, Chang Xu

发表机构 * Shanghai Jiao Tong University(上海交通大学) City University of Hong Kong(香港城市大学) The University of Sydney(悉尼大学)

AI总结 提出AID-VAR框架,通过对抗性注入诊断机制纠正视觉自回归模型中的级联误差,提升生成质量。

详情
AI中文摘要

视觉自回归(VAR)模型通过执行层次化的下一尺度预测,已成为图像合成的强大范式。然而,VAR模型天生容易产生级联误差传播,其中细微的粗尺度误预测会在层次结构中放大,最终扭曲最终合成。为了缓解这一问题,我们提出了AID-VAR,一个即插即用的框架,通过对抗性注入诊断增强预训练的VAR。与标准的被动生成不同,AID-VAR引入了一种主动纠错机制,灵感来自GAN中的对抗性反馈。我们部署了一个判别器来诊断每个尺度转换处的保真度差距,并配有一个轻量级的引导注入器。该模块作为一个非侵入式适配器,优化冻结的VAR骨干网络的特征流形,有效引导生成朝向真实图像的分布,同时不破坏预训练潜在空间的稳定性。此外,为了严格评估这种跨尺度进展,我们引入了跨尺度一致性得分(ISCS),这是一个新的度量标准,用于量化连续分辨率尺度之间的保真度和结构对齐。在各种骨干网络上的实验结果表明,AID-VAR以可忽略的开销提供了更清晰的纹理细节和更少的结构失真。例如,AID-VAR-d20在参数仅增加3%的情况下,FID提升了16%。这些结果确立了AID-VAR作为升级大规模VAR生成器的高效且可扩展的途径,在不改变训练数据、基础架构或采样调度的情况下,增强了全局连贯性和局部细节。代码可在https://github.com/bijiw515/AID-VAR获取。

英文摘要

Visual Autoregressive (VAR) models have emerged as a powerful paradigm for image synthesis by performing hierarchical next-scale prediction. However, VAR models are inherently prone to cascading error propagation, where subtle coarse-scale mispredictions are amplified across the hierarchy, ultimately distorting the final synthesis. To mitigate this, we propose AID-VAR, a plug-and-play framework that enhances pre-trained VARs through Adversarially Injected Diagnosis. Instead of a standard passive generation, AID-VAR introduces a proactive error-correction mechanism inspired by the adversarial feedback in GANs. We deploy a discriminator to diagnose fidelity gaps at each scale transition, coupled with a lightweight guidance injector. This module operates as a non-invasive adapter that refines the feature manifold of a frozen VAR backbone, effectively steering the generation toward the distribution of real images without destabilizing the pre-trained latent space. Furthermore, to rigorously evaluate this cross-scale progression, we introduce the Inter-Scale Consistency Score (ISCS), a novel metric that quantifies the fidelity and structural alignment between consecutive resolution scales. Experimental results across various backbones demonstrate that AID-VAR delivers sharper textural details and fewer structural distortions with negligible overhead. For instance, AID-VAR-d20 achieves a 16% improvement in FID with only a 3% increase in parameters. These results establish AID-VAR as a highly efficient and scalable pathway for upgrading large-scale VAR generators, enhancing global coherence and local detail without altering training data, base architectures, or sampling schedules. Code is available at https://github.com/bijiw515/AID-VAR.

2605.24831 2026-05-26 cs.CV cs.AI 版本更新

Multiscale Real-Time Object Detection in the NMS-Free Era: A Comparative Performance Evaluation of YOLOv8 and YOLO26

无NMS时代的实时多尺度目标检测:YOLOv8与YOLO26的对比性能评估

Chidera G. Oguine, Kanyifeechukwu J. Oguine, Obiozor M. Oguine, Ozioma C. Oguine

发表机构 * University of Abuja(阿布贾大学) Vanderbilt University(范德比大学) University of Notre Dame(圣约翰大学)

AI总结 本文在Pascal VOC和VisDrone数据集上,从准确率、定位、模型大小、计算量和延迟等维度,系统比较了基于NMS的YOLOv8与无NMS的YOLO26在多尺度下的性能,发现YOLO26在多数尺度上检测更强且模型复杂度更低,但在密集小目标场景下优势缩小,且YOLOv8在GPU延迟上仍有竞争力。

Comments 11 pages, 6 tables, 9 figures

详情
AI中文摘要

非极大值抑制(NMS)仍然是许多实时目标检测流程中的关键后处理步骤,但在资源受限的环境中可能引入延迟变化和部署复杂性。最近的无NMS设计(如YOLO26)旨在通过端到端检测减少这种依赖,然而与基于NMS的成熟模型(如YOLOv8)相比,其性能在标准基准之外尚未得到充分探索。本文在Pascal VOC和VisDrone上比较了YOLOv8和YOLO26,这两个数据集分别代表通用目标检测和密集空中小目标检测。两个模型家族在五个尺度上使用准确率、定位、模型大小、GFLOPs以及CPU/GPU延迟进行评估。结果表明,YOLO26在Pascal VOC上的大多数尺度上实现了更强的检测性能和更低的模型复杂度,而在VisDrone上性能差距缩小,两个模型在处理密集小目标时均表现困难。YOLOv8在GPU延迟上仍具有竞争力,表明无NMS设计并不能保证普遍的部署优势。总体而言,研究表明检测器的选择取决于数据集特征、目标尺度、模型容量和硬件约束。

英文摘要

Non-Maximum Suppression (NMS) remains a key post-processing step in many real-time object detection pipelines, but it can introduce latency variation and deployment complexity in resource-constrained settings. Recent NMS-free designs such as YOLO26 aim to reduce this dependence through end-to-end detection, yet their performance relative to established NMS-based models such as YOLOv8 remains underexplored beyond standard benchmarks. This paper compares YOLOv8 and YOLO26 on Pascal VOC and VisDrone, representing general object detection and dense aerial small-object detection, respectively. Both model families are evaluated across five scales using accuracy, localization, model size, GFLOPs, and CPU/GPU latency. Results show that YOLO26 achieves stronger detection performance and lower model complexity on Pascal VOC across most scales, while the performance gap narrows on VisDrone, where both models struggle with dense small targets. YOLOv8 remains competitive in GPU latency, showing that NMS-free design does not guarantee universal deployment superiority. Overall, the study shows that detector selection depends on dataset characteristics, object scale, model capacity, and hardware constraints.

2605.24816 2026-05-26 cs.CV 版本更新

AOEPT: Breaking the Implicit Modality-Reduction Bottleneck in Modality-Missing Prompt Tuning

AOEPT:打破模态缺失提示调优中的隐式模态缩减瓶颈

Jian Lang, Rongpei Hong, Ting Zhong, Fan Zhou

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Intelligent Digital Media Technology Key Laboratory of Sichuan Province(四川省智能数字媒体技术重点实验室)

AI总结 提出AOEPT方法,通过模态上下文提示(MCPs)蒸馏全局模态先验,为缺失模态提供潜在信息源,恢复多模态Transformer的推理范围,解决模态缺失场景下隐式模态缩减瓶颈问题。

Comments 20 pages, Accepted by ICML 2026, Code is available from https://github.com/Jian-Lang/AOEPT

详情
AI中文摘要

在现实环境中部署多模态系统通常需要处理模态缺失场景,即一个或多个模态不可用。虽然最近的研究通过提示调优解决了通用多模态Transformer(MT)架构的这一挑战,但我们发现了这些方法的一个基本限制:隐式模态缩减瓶颈。通过仅将提示条件限制在观察到的模态上,它们无意中将MT的推理范围限制在模态缩减子空间内,切断了缺失模态潜在信息源的访问。为克服这一限制,我们提出AOEPT,开创了一种新颖的模态上下文提示方式。具体来说,我们引入了轻量级的模态上下文提示(MCPs),从训练数据中蒸馏全局模态先验,作为缺失模态信息源的潜在存储库。基于剩余模态,这些MCPs被实例化为实例感知提示,为每个样本选择性地增强缺失模态信息,从而将MT的推理范围恢复到仅观察模态子空间之外。在各种多模态基准和骨干网络上的实验证实了AOEPT的强大性能,且计算开销极小。

英文摘要

Deploying multimodal systems in real-world environments often entails handling modality-missing scenarios, where one or more modalities are unavailable. While recent studies address this challenge for the general Multimodal Transformer (MT) architecture via prompt tuning, we identify a fundamental limitation in these methods: the Implicit Modality-Reduction bottleneck. By conditioning prompts solely on the observed modalities, they inadvertently restrict the reasoning scope of MTs to the modality-reduced subspace, cutting off access to the latent information sources of the missing modalities. To overcome this limitation, we propose AOEPT, which pioneers a novel modal-contextualized prompting fashion. Specifically, we introduce lightweight Modal-Contextualized Prompts (MCPs) that distill global modality-wise priors from training data, serving as latent repositories of the information sources for missing modalities. Conditioned on the remaining modalities, these MCPs are instantiated into instance-aware prompts that selectively augment missing-modality information for each sample, thereby restoring the reasoning scope of MTs beyond the observed-modality-only subspace. Experiments across various multimodal benchmarks and backbones confirm the strong performance of AOEPT, with minimal computational overhead.

2605.24807 2026-05-26 cs.CV 版本更新

CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation

CLIP引导的SAM:用于可提示分割的参数高效语义条件

Shayan Jalilian, Abdul Bais

发表机构 * University of Regina, Regina, SK, Canada(里贾纳大学)

AI总结 提出CLIP-Guided SAM框架,通过轻量级多模态语义适配器将CLIP特征注入SAM图像编码器,实现内部语义条件化,在低标注数据下提升分割性能并支持手动和半自动两种模式。

详情
AI中文摘要

可提示基础模型如分割一切模型(SAM)能生成高质量掩码,但语义上仍存在盲区,依赖外部提示来指定类别。现有的视觉-语言方法通过外部提示耦合来解决这一限制,即视觉-语言模型作为独立阶段为SAM生成空间提示。我们提出CLIP引导的SAM,一种基于内部语义条件的参数高效分割框架。我们不是仅使用语义信号来生成提示,而是通过轻量级多模态语义适配器将CLIP派生的文本、视觉和相似性特征直接注入SAM的图像编码器。这些适配器调节SAM的内部特征表示,使得语义信息能够影响掩码预测,同时保留SAM原有的可提示接口。我们的框架专为低标注数据场景设计,适用于通用领域基准和专门的下游任务。它支持两种操作模式:手动模式(用于同时使用文本和空间提示的交互式分割)和半自动纯文本模式(用于仅需文本输入的概念特定分割应用)。我们表明,鲁棒性取决于训练与推理时使用的提示类型是否一致,使得训练-测试提示一致性成为重要的设计原则。通过大量实验和消融研究,我们评估了我们的方法,与无语义条件的SAM+PEFT基线、视觉-语言+SAM流水线、SAM 3以及依赖大量无标注数据的强半监督分割方法进行比较。在这些设置中,CLIP引导的SAM在训练和部署中均保持参数高效的同时,始终取得优越或具有竞争力的性能。

英文摘要

Promptable foundation models such as the Segment Anything Model (SAM) produce high-quality masks but remain semantically blind, relying on external prompts to specify categories. Existing vision-language approaches address this limitation by using external prompt coupling, where a vision-language model generates spatial prompts for SAM as a separate stage. We propose CLIP-Guided SAM, a parameter-efficient segmentation framework built on internal semantic conditioning. Instead of using semantic signals only to generate prompts, we inject CLIP-derived text, vision, and similarity features directly into SAM's image encoder through lightweight multi-modal semantic adapters. These adapters condition SAM's internal feature representations, allowing semantic information to influence mask prediction while preserving SAM's original promptable interface. Our framework is designed for low labeled-data settings and applies to both general-domain benchmarks and specialized downstream tasks. It supports two operating modes: Manual mode, for interactive segmentation with both text and spatial prompts, and Semi-Automatic text-only mode, for applications that require concept-specific segmentation using only textual input. We show that robustness depends on aligning training with the type of prompts used at inference, making train-test prompt consistency an important design principle. Through extensive experiments and ablations, we evaluate our method against SAM+PEFT baselines without semantic conditioning, vision-language + SAM pipelines, SAM 3, and strong semi-supervised segmentation methods that rely on large amounts of unlabeled data. Across these settings, CLIP-Guided SAM consistently achieves superior or competitive performance while remaining parameter-efficient in both training and deployment.

2605.24805 2026-05-26 cs.CV 版本更新

Fishbone: From One 3D Asset to a Million Controllable Edits

Fishbone: 从一个3D资产到百万可控编辑

Yumeng He, Xiaoying Wang, Peihao Li, Yanjia Huang, Joe Masterjohn, Jiajun Wu, Leonidas Guibas, Yin Yang, Ying Jiang, Chenfanfu Jiang

发表机构 * UCLA(加州大学洛杉矶分校) USC(南加州大学) UC Berkeley(加州大学伯克利分校) TRI(技术研究院) Stanford(斯坦福大学) Utah(犹他大学)

AI总结 提出一种统一的脊-肋表示方法Fishbone,支持通用网格的可控参数化变形、降阶动力学和动画,并构建了Fishbone-136K数据集,应用于可控3D生成、机器人学习数据增强等任务。

Comments 20 pages, 19 figures

详情
AI中文摘要

大规模可控3D资产对于计算机图形学、具身AI、机器人和交互式内容创作至关重要,但由于手动建模和绑定的高成本,创建多样化的3D资产仍然具有挑战性。形状变形提供了一种从现有网格生成变体的自然方式,但现有的数据驱动方法通常依赖稀疏的用户输入,而参数化编辑框架需要手动设计的控制结构和特定类别的配置。受自然生物启发,其中中央脊柱控制全局形状,横截面肋骨控制局部变化,我们引入了Fishbone,一种统一的脊-肋表示,适用于通用形状,支持可控参数化网格变形、降阶动力学和动画。给定输入网格,Fishbone使用自适应热方法计算测地标量场,提取等值线作为横截面肋骨,通过肋骨中心构建光滑的几何感知脊柱,并使用高斯加权蒙皮将表面顶点与附近的肋骨和脊柱结构关联。由此产生的表示支持实时和可预测的变形:肋骨控制局部轮廓,如厚度、方向和横截面变化,而脊柱控制全局弯曲、扭转和拉伸。相同的结构还支持降阶模拟和关键帧动画。我们进一步通过用脊-肋结构增强Hunyuan3D构建了Fishbone-136K,并展示了在可控3D生成、基于变形的机器人学习数据增强、交互式网格编辑和智能体生成中的应用。实验证明了所提出框架的有效性、效率和通用性。

英文摘要

Large-scale controllable 3D assets are critical for computer graphics, embodied AI, robotics, and interactive content creation, yet creating diverse 3D assets remains challenging due to the high cost of manual modeling and rigging. Shape deformation offers a natural way to generate variations from existing meshes, but existing data-driven methods often rely on sparse user inputs, while parametric editing frameworks require manually designed control structures and category-specific configurations. Inspired by natural creatures, where a central spine governs global shape and cross-sectional ribs control local variation, we introduce Fishbone, a unified rib-spine representation for general shapes that supports controllable parametric mesh deformation, reduced-space dynamics, and animation. Given an input mesh, Fishbone computes a geodesic scalar field with an adaptive heat method, extracts iso-contours as cross-sectional ribs, constructs a smooth geometry-aware spine through rib centers, and associates surface vertices with nearby rib and spine structures using Gaussian-weighted skinning. The resulting representation enables real-time and predictable deformation: ribs control local profiles such as thickness, orientation, and cross-sectional variation, while the spine controls global bending, twisting, and stretching. The same structure also supports reduced-space simulation and keyframe animation. We further construct Fishbone-136K by augmenting Hunyuan3D with rib-spine structures, and demonstrate applications in controllable 3D generation, deformation-based data augmentation for robot learning, interactive mesh editing, and agentic generation. Experiments demonstrate the effectiveness, efficiency, and versatility of the proposed framework.

2605.24799 2026-05-26 cs.CV cs.AI 版本更新

Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models

面向大规模视觉识别的多模态大语言模型分治推理

Zhipeng Ye, Jiaqi Huang, Feng Jiang, Qiufeng Wang, Yikang Duan, Dawei Wang, Xihang Zhou, Qian Qiao

发表机构 * Taizhou Institute of Science and Technology, Nanjing University of Science and Technology(泰州科技学院、南京理工大学) Department of Intelligence Science, Xi’an Jiaotong-Liverpool University(智能科学系,西安交通大学利物浦大学) School of Computer Science and Technology, Soochow University(计算机科学与技术学院,苏州大学) Department of Statistical Sciences, University of Toronto(统计科学系,多伦多大学)

AI总结 针对多模态大语言模型在长序列识别中性能崩溃的问题,提出分治推理(DCI)策略,通过递归分解任务和动态剪枝提升信噪比与分类精度。

详情
AI中文摘要

多模态大语言模型(MLLMs)在广泛的视觉语言任务中展现了强大的能力。然而,当应用于大规模图像分类时,随着标签空间的扩大,其性能显著下降——我们将这一现象定义为长序列识别中的性能崩溃。通过信息论分析,我们揭示了这种崩溃源于不断增长的信息熵与注意力机制中显著的注意力稀释和衰减之间的根本冲突,这损害了模型在处理极长提示时维持足够信噪比的能力。为缓解这一问题,我们提出了分治推理(DCI),一种用于MLLMs视觉识别的新型测试时扩展策略。DCI递归地将复杂的全局分类任务分解为多个更简单的局部子问题,并采用动态剪枝机制压缩搜索空间。该方法通过缓解长序列推理中固有的权重稀释问题,有效提高了局部信噪比和模型精度。此外,传统自注意力具有难以承受的二次计算复杂度,而DCI在大规模分类场景中实现了更有利的扩展行为并显著加速推理。在ImageNet-1K和ImageNet-21K等基准上的大量实验表明,DCI持续提高了分类精度。这使得轻量级开源模型无需任何额外训练或微调即可与甚至超越前沿闭源巨头。作为一种模型无关、即插即用的范式,DCI为在大规模场景中扩展MLLMs的推理精度提供了一种高效方法。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as the label space expands a phenomenon we define as Performance Collapse in Long Sequence Recognition. Through an information theoretic analysis, we reveal that this collapse stems from a fundamental conflict between the escalating information entropy and the prominent attention dilution and decay within attention mechanisms, which impairs the model's ability to maintain a sufficient signal-to-noise ratio when processing extremely long prompts. To mitigate this, we propose Divide-and-Conquer Inference (DCI), a novel test-time scaling strategy for visual recognition with MLLMs. DCI recursively decomposes complex global classification tasks into multiple simpler, localized subproblems and employs a dynamic pruning mechanism to compress the search space. This method effectively improves the local signal to noise ratio and model accuracy by mitigating the inherent weight dilution issues in long-sequence inference. Moreover, while traditional self-attention incurs a prohibitive quadratic computational complexity, DCI achieves more favorable scaling behavior and substantially accelerates inference in large scale classification scenarios. Extensive experiments on benchmarks such as ImageNet-1K and ImageNet-21K demonstrate that DCI consistently improves classification accuracy. This enables lightweight open-source models to rival or even surpass frontier closed-source giants without any additional training or fine-tuning. As a model-agnostic, plug-and-play paradigm, DCI offers an efficient approach for scaling the inferential precision of MLLMs in large-scale scenarios.

2605.24797 2026-05-26 cs.CV 版本更新

HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm

HCL-FF:用于前向-前向算法的分层对比学习

Jie-En Yao, Hong-En Chen, C. -C. Jay Kuo

发表机构 * University of Southern California(南加州大学)

AI总结 针对前向-前向算法缺乏分层协调和特征语义模糊的问题,提出HCL-FF框架,通过粗到细的分层学习策略和监督对比学习目标,在CIFAR-10等数据集上取得FF方法最佳性能。

Comments Accepted by CVPR 2026. Code: https://github.com/JNNNNYao/HCL-FF

详情
AI中文摘要

使用反向传播训练的深度神经网络在视觉任务中取得了显著性能,但仍存在生物不可解释、计算要求高和难以解释的问题。前向-前向(FF)算法通过局部目标函数独立训练每一层,提供了一种有前景的替代方案。然而,其纯局部优化缺乏跨层的分层协调,且将 goodness 与特征解耦导致表示无约束且语义模糊。我们提出分层对比学习FF框架(HCL-FF)来解决这些限制。HCL-FF引入了(1)一种从粗到细的分层学习策略,引导表示从低级线索到高级语义,以及(2)一种监督对比目标,在 goodness 解耦后强制类别判别性对齐。在CIFAR-10、CIFAR-100和Tiny-ImageNet上的实验表明,HCL-FF在基于FF的方法中取得了新的最佳性能,准确率分别提升了+5.46%、+17.00%和+12.51%。

英文摘要

Deep neural networks trained with backpropagation have achieved outstanding performance in vision tasks but remain biologically implausible, computationally demanding, and difficult to interpret. The Forward-Forward (FF) algorithm offers a promising alternative by training each layer independently through local goodness objectives. However, its purely local optimization lacks hierarchical coordination across layers, and the decoupling of goodness from features leaves the representations unconstrained and semantically ambiguous. We propose a Hierarchical and Contrastive Learning FF framework (HCL-FF) to address these limitations. HCL-FF introduces (1) a coarse-to-fine hierarchical learning strategy that guides representations from low-level cues to high-level semantics, and (2) a supervised contrastive objective that enforces class-discriminative alignment after goodness decoupling. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that HCL-FF achieves new state-of-the-art performance among FF-based methods, with notable accuracy gains of +5.46%, +17.00%, and +12.51%, respectively.

2605.24794 2026-05-26 cs.CV cs.CL 版本更新

DUEL: Adversarial Self-Play for Multimodal Reasoning

DUEL: 用于多模态推理的对抗性自我对弈

Lin Qiu, Hanqing Zeng, Yao Liu, Bingjun Sun, Guangdeng Liao, Ji Liu

发表机构 * Meta AI

AI总结 提出DUEL框架,通过对抗性自我对弈从预训练VLM生成监督信号,结合长度归一化对数似然奖励,无需人工标注即可提升视觉推理与判别能力。

详情
AI中文摘要

强化学习已成为提升视觉语言模型推理能力的有效范式。然而,基于RL的优化通常依赖于昂贵且难以扩展的高质量标注。现有的无监督替代方案可能因弱视觉基础和缺乏可靠验证信号而偏向有偏解。我们提出一个自我进化的训练后框架DUEL,其中监督信号源于从同一预训练VLM初始化的两个策略之间的对抗性交互。挑战者生成一个基于图像的真实声明及其最小扰动的难负样本,而求解者验证两个声明与图像的一致性,从而在近邻语义下鼓励细粒度视觉判别。为了稳定优化,我们引入长度归一化的对数似然奖励,在二元结果监督之外保留信息性优化信号,并在稀疏反馈下提高学习稳定性。实验表明,DUEL在无需额外人工标注、外部奖励模型或图像编辑工具的情况下,持续提升视觉推理和鲁棒判别能力。

英文摘要

Reinforcement learning (RL) has emerged as an effective paradigm for improving the reasoning capability of vision-language models (VLMs). However, RL-based optimization typically depends on costly high-quality annotations that are difficult to scale. Existing unsupervised alternatives may drift toward biased solutions due to weak visual grounding and the lack of reliable verification signals. We propose a self-evolving post-training framework, DUEL, where supervision emerges from adversarial interactions between two policies initialized from the same pretrained VLM. A Challenger generates an image-grounded true claim together with a minimally perturbed hard-negative counterpart, while a Solver verifies both claims against the image, encouraging fine-grained visual discrimination under near-neighbor semantics. To stabilize optimization, we introduce a length-normalized log-likelihood reward that preserves informative optimization signals beyond binary outcome supervision and improves learning stability under sparse feedback. Experiments show that DUEL consistently improves visual reasoning and robust discrimination without additional human annotations, external reward models, or image editing tools.

2605.24792 2026-05-26 cs.CV cs.AI 版本更新

Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

用于胃肠内窥镜的参数高效视觉语言模型:医学图像生成与临床视觉问答

Ojonugwa Oluwafemi Ejiga Peter, Frederick Akor Ejiga, Fahmi Khalifa, Md Mahmudur Rahman

发表机构 * Computer Science Department, Morgan State University(莫尔甘州大学计算机科学系) International Organization for Migration (IOM)(国际移民组织) Electrical & Computer Engineering Department, Morgan State University(莫尔甘州大学电气与计算机工程系)

AI总结 提出双流水线参数高效微调模型,结合Florence-2和LoRA Stable Diffusion,分别解决临床视觉问答和隐私保护合成数据生成问题,在Kvasir-VQA数据集上取得高ROUGE和BLEU分数,并显著降低计算成本。

详情
AI中文摘要

胃肠内窥镜AI系统的主要局限性源于标注数据短缺、严格的隐私政策以及传统模型微调中的显著瓶颈。这些限制阻碍了复杂AI模型在临床实践中的成功应用,尤其影响了诊断的可靠性和可扩展性。在本文中,我们提出了一种双流水线PEFT模型,解决了两个基本问题:医学视觉问答(VQA)和隐私保护合成数据的生成。对于临床VQA,我们采用Florence-2视觉语言模型。利用PEFT增强了模型的可解释性,同时大幅降低了训练的计算成本。同时,我们使用低秩适应(LoRA)与Stable Diffusion 2.1生成高质量的胃肠图像,在不违反患者隐私的情况下增强训练数据库。本研究使用了Kvasir-VQA数据集。我们的Florence-2 VQA模型实现了ROUGE-1为0.92,ROUGE-L为0.91,BLEU分数从0.08提升到0.24。在私有数据集上的微调始终优于在公共数据集上的微调。秩为4的LoRA合成达到了最优性能,保真度得分为0.290,一致性得分为0.730,Frechet BiomedCLIP距离(FBD)为1450,计算成本降低了近90%。该框架提高了AI在胃肠内窥镜中的临床潜力。与FLUX、MSDM和Kandinsky 2.2相比,我们的模型表现出更优的FBD和强语义对齐。虽然其他模型在保真度或一致性上领先,但我们更低的FBD表明更好的图像-文本一致性。这些结果确立了我们的方法作为增强临床AI中VQA和合成数据生成的稳健解决方案。

英文摘要

The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the successful application of sophisticated AI models in clinical practice, particularly affecting the reliability and scalability of diagnosis. In this paper, we present a dual-pipeline PEFT model that addresses two fundamental problems: medical Visual Question Answering (VQA) and the generation of privacy-preserving synthetic data. For clinical VQA, we adopt the Florence-2 vision-language model. Leveraging PEFT enhances model interpretability while substantially reducing the computational cost of training. Simultaneously, we employ Low-Rank Adaptation (LoRA) with Stable Diffusion 2.1 to generate high-quality GI images that enhance training databases without violating patient privacy. This research utilized the Kvasir-VQA dataset. Our Florence-2 VQA model achieved ROUGE-1 of 0.92, ROUGE-L of 0.91, and BLEU score improvements from 0.08 to 0.24. Fine-tuning on private datasets consistently showed better results than fine-tuning on public datasets. The rank-4 LoRA synthesis achieved optimal performance with a fidelity score of 0.290, an agreement score of 0.730, and a Frechet BiomedCLIP Distance (FBD) of 1450, reducing computational costs by almost 90 percent. This framework improves the clinical potential of AI in GI endoscopy. Compared to FLUX, MSDM, and Kandinsky 2.2, our model demonstrates superior FBD and strong semantic alignment. While other models lead in Fidelity or Agreement, our lower FBD indicates better image-text coherence. These results establish our approach as a robust solution for enhancing VQA and synthetic data generation in clinical AI.

2605.24789 2026-05-26 cs.CV eess.IV 版本更新

Self-Supervised Contrastive Learning for Cardiac MR Sequence Classification

自监督对比学习用于心脏磁共振序列分类

Yuli Wang, Hyewon Jung, Dongshen Peng, Yuwei Dai, Jing Wu, Haoyue Guan, Yoko Kato, Zhicheng Jiao, Yu Sun, Ihab Kamel, Joao Lima, Cheng Ting Lin, Harrison Bai

发表机构 * Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine(放射科与放射科学系,约翰霍普金斯大学医学院) Department of Electrical and Computer Engineering, Johns Hopkins University(电气与计算机工程系,约翰霍普金斯大学) Department of Computer Science, University of North Carolina at Chapel Hill(计算机科学系,北卡罗来纳大学教堂山分校) Department of Radiology, University of Colorado Denver Anschutz Medical Campus(放射科,科罗拉多大学丹佛分校安舒茨医学中心) Department of Radiology, Second Xiangya Hospital, Central South University(放射科,中南大学湘雅医院) Department of Cardiology, Johns Hopkins University School of Medicine(心血管科,约翰霍普金斯大学医学院) Department of Diagnostic Imaging, Brown University Health(诊断影像科,布朗大学健康中心)

AI总结 针对预训练ViT在心脏MR领域迁移效果差的问题,提出基于图像的自监督对比学习适应策略,在内部数据集上优于监督训练,并泛化到外部MR数据集,四个常见序列分类AUC超过0.75。

详情
AI中文摘要

利用自注意力机制的视觉Transformer(ViT)模型在各种视觉任务(包括图像分类)中展现出强大的泛化能力。然而,这些通常在通用公共数据集上预训练的模型往往缺乏医学成像应用所需的专门领域知识。在本研究中,我们使用内部数据集调查了ViT模型对心脏磁共振(MR)图像的适应情况。我们发现预训练的ViT特征不能有效地迁移到心脏MR领域。为了克服这一限制,我们引入了一种利用基于图像的自监督对比学习的适应策略,与传统的监督训练方法相比,表现出优越的性能。此外,我们适应的ViT模型对外部MR数据集(如BraTS和ADNI)表现出强大的泛化能力。通过消融研究,我们进一步研究了批次大小和数据集规模对性能的影响。最终,我们的适应模型在四种最常见的心脏MR序列上实现了超过0.75的分类AUC。

英文摘要

Vision Transformer (ViT) models, utilizing self-attention mechanisms, have demonstrated robust generalization capabilities across various vision tasks, including image classification. However, these models, typically pretrained on general public datasets, often lack the specialized domain knowledge necessary for medical imaging applications. In this study, we investigate the adaptation of ViT models, specifically for cardiac magnetic resonance (MR) images, using an in-house dataset. We found that pretrained ViT features do not effectively transfer to the cardiac MR domain. To overcome this limitation, we introduce an adaptation strategy that utilizes image-based self-supervised contrastive learning, demonstrating superior performance compared to traditional supervised training approaches. Moreover, our adapted ViT model exhibits strong generalization to external MR datasets such as BraTS and ADNI. Through ablation studies, we further investigate the impact of batch size and dataset scale on performance. Ultimately, our adapted model achieves classification AUC exceeding 0.75 across the four most common cardiac MR sequences.

2605.24776 2026-05-26 cs.CV 版本更新

How Noisy Poses Break Inverse Dynamics: Analysis and Mitigation for Video-Based Joint Torque Estimation

噪声姿态如何破坏逆动力学:基于视频的关节力矩估计的分析与缓解

Donghyun Kim, Chanyoung Kim, Eunseo Jeong, Youngjoong Kwon, Seong Jae Hwang

发表机构 * Emory University(埃默里大学) Yonsei University(延世大学)

AI总结 本文系统分析了3D人体姿态估计噪声通过逆动力学放大关节力矩误差的问题,提出SMPL-Dynamics模块并通过可微姿态优化将力矩误差降低93%。

详情
AI中文摘要

单目3D人体姿态估计的最新进展使得从视频中实现精确的身体跟踪成为可能。然而,由于逆动力学中的噪声放大,将这些运动学估计转化为物理量(如关节力矩)仍然具有挑战性。在这项工作中,我们系统分析了姿态估计噪声如何通过逆动力学管道传播。我们提出了三个关键发现:(1)通过数值微分计算关节力矩时,姿态噪声被放大约1000倍;(2)近端关节(脊柱、髋部)对噪声的敏感度比远端关节(手腕、手)高10倍;(3)在微分之前进行低通滤波可显著减少这种放大。为了支持这一分析,我们开发了SMPL-Dynamics,这是一个用于SMPL人体模型的完全可微逆动力学模块,无需外部物理模拟器。我们的模块支持端到端梯度计算,并通过可微姿态优化证明了这一点,该优化将力矩误差降低了93%,而姿态变化可忽略不计。

英文摘要

Recent advances in monocular 3D human pose estimation enable accurate body tracking from video. However, translating these kinematic estimates into physical quantities, such as joint torques, remains challenging due to noise amplification through inverse dynamics. In this work, we provide a systematic analysis of how pose estimation noise propagates through the inverse dynamics pipeline. We present three key findings: (1) pose noise is amplified by approximately 1,000x when computing joint torques via numerical differentiation, (2) proximal joints (spine, hips) are up to 10x more sensitive to noise than distal joints (wrists, hands), and (3) low-pass filtering before differentiation substantially reduces this amplification. To enable this analysis, we develop SMPL-Dynamics, a fully differentiable inverse dynamics module for the SMPL body model that requires no external physics simulators. Our module supports end-to-end gradient computation, and we demonstrate this through differentiable pose refinement, which reduces torque error by 93% with negligible change in pose.

2605.24771 2026-05-26 cs.CV cs.AI cs.LG 版本更新

From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks

从理论到决策规则:校准视觉-语言模型弱监督的噪声标签交叉点——基于三个医学影像基准

Bruce Changlong Xu, Jose James, Alexander Ryu

发表机构 * Department of Computer Science, Stanford University(计算机科学系,斯坦福大学)

AI总结 通过三个医学影像基准校准理论预测的噪声标签交叉点,提出基于少量金标标签的决策规则。

Comments 5 pages, 2 figures, 4 tables

详情
AI中文摘要

经典的噪声标签理论预测,弱监督下的下游性能上限是标注者的准确率,这意味着一个尖锐的交叉点:一旦金标训练的分类器达到标注者的水平,弱标签就会从帮助变为伤害。该预测是理论性的;缺少的是将其转化为现代基础模型标注者的实例级陈述的基准校准。我们针对BiomedCLIP生成的弱标签,在三个医学影像基准(PCAM、ISIC、NIH-CXR)和六个跨越11倍参数范围的下游架构上提供了这样的校准。理论预测的交叉点出现在PCAM上约100个样本,ISIC上20-50个,NIH-CXR上250-500个;交叉点以上的弱标签使AUC降低高达-0.10。对于五个预训练架构中的四个,交叉点位置与架构无关,而一个家族内的DenseNet扫描(2.5倍参数,相同预训练)支持了标注者(而非学生)是主要约束的观点。该校准进而产生一个可在10-20个金标标签下操作的决策规则:比较仅金标AUC与用户金标集上的VLM准确率。NIH-CXR上的结构化与随机噪声符号翻转表明,该界限的仅速率形式是不完整的,并确定了一个具体的改进(标签空间投影),未来的基准可以设计来测试它。

英文摘要

Classical noisy-label theory predicts that downstream performance under weak supervision is bounded above by the labeler's accuracy, implying a sharp crossover: once a gold-trained classifier matches the labeler, weak labels stop helping and start hurting. The prediction is theoretical; what is missing is a benchmark calibration that turns it into an instance-level statement for modern foundation-model labelers. We provide such a calibration for BiomedCLIP-generated weak labels on three medical-imaging benchmarks (PCAM, ISIC, NIH-CXR) and six downstream architectures spanning an 11x parameter range. The crossover predicted by theory appears at ng~100 on PCAM, 20-50 on ISIC, and 250-500 on NIH-CXR; weak labels above the crossover degrade AUC by up to -0.10. The location is architecture-invariant for four of five pretrained architectures, and a within-family DenseNet sweep (2.5x parameters, identical pretraining) supports the view that the labeler, not the student, is the dominant constraint. The calibration in turn produces a decision rule operable from 10-20 gold labels: compare gold-only AUC to VLM accuracy on the user's gold set. A structured-vs-random noise sign flip on NIH-CXR shows that the rate-only formulation of the bound is incomplete and identifies a concrete refinement (label-space projection) that future benchmarks can be designed to test.

2605.24770 2026-05-26 cs.LG cs.CV 版本更新

Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra

Muon在视觉Transformer中的应用:优化器-数据增强交互与梯度谱

Ben S. Southworth, Shuai Jiang, Daniel McBride, Eric C. Cyr, Stephen Thomas

发表机构 * Los Alamos National Laboratories(洛斯阿拉莫斯国家实验室) Sandia National Laboratories(桑迪亚国家实验室) Lehigh University(莱斯大学)

AI总结 研究Muon优化器在视觉Transformer训练中的表现,发现其优于AdamW,且增益依赖于数据增强,通过梯度奇异值分析揭示Muon与AdamW在注意力投影和深层前馈块中的谱差异。

Comments 25 pages, 15 figures

详情
AI中文摘要

Muon是一种最近开发的矩阵感知优化器,在Transformer训练中表现出色,但其在视觉Transformer(ViT)中的行为尚不明确。我们研究Muon在ViT训练中的应用,主要在ImageNet-100和Pl@ntNet-300K上,与AdamW在涉及mixup、cutmix、平滑以及随机增强和擦除的标准视觉方案下进行比较。Muon始终优于AdamW,在长尾Pl@ntNet宏观top-1上尤其显著。这些增益也依赖于数据增强方案,Muon从高级和显著的数据增强技术中获益远大于AdamW。为了理解这种交互,我们分析了整个ViT中矩阵梯度的奇异值结构。在Muon训练中,去除重度数据增强会导致训练后期梯度矩阵的谱集中和模式坍塌,主要发生在深层MLP-down块中。在固定的“完整”增强方案下,Muon与AdamW最明显的对比出现在QKV梯度中,其中AdamW梯度能量集中在更窄的基上,而Muon将能量分散到更多的奇异模式上。因此,ViT中的Muon最好理解为一种优化器-数据增强交互。在固定方案下,Muon与AdamW最明显的区别在于注意力投影,其梯度由更宽的谱基组成。在Muon内部,完整的训练方案对于防止深层前馈块中的后期谱集中和模式坍塌很重要。我们进一步展示了在图像分割和掩码自编码器模型上训练ViT的效果,Muon在所有考虑的设置中均优于AdamW。

英文摘要

Muon is a recently developed matrix-aware optimizer that has shown strong results in transformer training, but its behavior in vision transformers (ViTs) is not yet well understood. We study Muon for ViT training, largely on ImageNet-100 and Pl@ntNet-300K, comparing against AdamW under standard vision recipes involving mixup, cutmix, smoothing, and random augmentation and erasing. Muon consistently outperforms AdamW, with especially large gains on long-tailed Pl@ntNet macro top-1. These gains are also recipe-dependent, where Muon benefits much more than AdamW from advanced and significant data augmentation techniques. To understand this interaction, we analyze the singular-value structure of matrix gradients throughout the ViT. Within Muon training runs, removing heavy data augmentation induces a late-training spectral concentration and mode collapse in gradient matrices, primarily in deep MLP-down blocks. Under a fixed "full" augmentation recipe, the clearest Muon-AdamW contrast appears instead in QKV gradients, where AdamW gradient energy remains concentrated in a much narrower basis while Muon spreads energy across substantially more singular modes. Muon in ViTs is therefore best understood as an optimizer-recipe interaction. Under a fixed recipe, Muon differs from AdamW most clearly in attention projections, where its gradients consist of a broader spectral basis. Within Muon, a full training recipe is important for preventing late spectral concentration and mode collapse in deep feedforward blocks. We further demonstrate efficacy in training ViTs on image segmentation and masked autoencoder models, where Muon outperforms AdamW in all settings considered.

2605.24769 2026-05-26 cs.CV cs.AI eess.IV 版本更新

Leveraging pretrained RGB denoisers for hyperspectral image restoration

利用预训练RGB去噪器进行高光谱图像恢复

Daniele Picone, Mohamad Jouni, Mauro Dalla-Mura

发表机构 * Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-Lab(格勒诺布尔阿尔卑斯大学、法国国家科学研究中心、格勒诺布尔INP、GIPSA实验室)

AI总结 提出一种轻量级适配器,通过投影映射重用冻结的预训练RGB去噪器,实现高光谱图像的去噪、去模糊和超分辨率恢复,实验表明RGB先验具有良好的迁移性。

详情
AI中文摘要

高光谱图像恢复面临若干挑战,包括训练数据有限、传感器特异性强以及光谱维度高。这些限制阻碍了鲁棒高光谱先验的学习,促使我们重用从大规模RGB数据中学到的先验。在这项工作中,我们提出了一种最小训练的轻量级适配器,通过投影映射将冻结的预训练RGB去噪器重新用于高光谱恢复。该方法对低维光谱投影进行去噪,并通过约束线性聚合重建高光谱立方体,同时保持即插即用的兼容性和底层RGB去噪器的稳定性。在多个数据集上的去噪、去模糊和超分辨率实验表明,该方法持续优于高光谱专用基线,显示了大规模RGB先验的强迁移性。

英文摘要

Hyperspectral image restoration faces several challenges, including limited training data, strong sensor specificity, and high spectral dimensionality. These limitations hinder the learning of robust hyperspectral priors, motivating the reuse of priors learned from large-scale RGB data. In this work, we propose a minimally trained, lightweight adapter that repurposes frozen pretrained RGB denoisers for hyperspectral restoration through a projection mapping. The method denoises low-dimensional spectral projections and reconstructs the hyperspectral cube through constrained linear aggregation, while preserving plug-and-play compatibility and the stability properties of the underlying RGB denoiser. Experiments on denoising, deblurring, and super-resolution across multiple datasets demonstrate consistent improvements over hyperspectral-specific baselines, showing the strong transferability of large-scale RGB priors.

2605.24762 2026-05-26 cs.CV 版本更新

4KLSDB: A Large-Scale Dataset for 4K Image Restoration and Generation

4KLSDB:用于4K图像恢复与生成的大规模数据集

Zihao Zhu, Kuan-Ru Huang, Zhaoming Xu, Renjie Li, Bo Wu, Ruizheng Bai, Mingyang Wu, Sayak Paul, Zhengzhong Tu

发表机构 * Texas A&M University(德克萨斯A&M大学) Hugging Face

AI总结 为解决现有数据集缺乏原生4K分辨率和规模的问题,提出包含129,484张4K图像的大规模数据集4KLSDB,并通过多阶段自动过滤和标注确保质量,实验证明其在超分辨率和扩散模型训练中能显著提升4K基准性能。

Comments Accepted to the DataCV Workshop at CVPR 2026; 10 pages, 4 figures, 7 tables; Our project page is available at: https://4klsdb.github.io/

详情
AI中文摘要

高分辨率数据集对于推进超分辨率(SR)和文本到图像(T2I)扩散研究至关重要。然而,当前公开可用的数据集既缺乏原生4K分辨率,也缺乏训练最先进模型所需的大规模。为解决这一差距,我们引入了一个4K大规模数据集与基准(4KLSDB),这是一个大规模、多样化的数据集,包含129,484张精心策划的4K分辨率图像,涵盖自然、城市景观、人物、食物、艺术品和CGI等多个类别,以及分别包含2,000和1,984张图像的独立验证集和测试集。图像来源于已建立的开放数据集,包括Photo Concept Bucket、Laion2B和PD12M。4KLSDB经历了严格的多阶段自动过滤和标注流程,涉及人工标注员和大规模多模态模型(LMMs),以确保高美学质量和数据集一致性。我们通过训练代表性的超分辨率和扩散模型来证明4KLSDB的有效性,观察到在原生4K基准上性能的显著提升。综合实验表明,在真实4K分辨率数据上训练与图像恢复任务中保真度的提高之间存在正相关,尤其是在4K分辨率下。我们通过提供4KLSDB,为研究社区提供宝贵资源,以推动真正高保真图像合成与恢复的进展。我们的项目页面位于:https://4klsdb.github.io/。

英文摘要

High-resolution datasets are essential for advancing super-resolution (SR) and text-to-image (T2I) diffusion research. However, current publicly available datasets lack both the native 4K resolution and the extensive scale necessary for training state-of-the-art models. To address this gap, we introduce a 4K Large Scale Dataset and Benchmark (4KLSDB), a large-scale, diverse dataset consisting of 129,484 carefully curated 4K resolution images spanning multiple categories such as nature, urban scenes, people, food, artwork, and CGI, alongside distinct validation and test sets containing 2,000 and 1,984 images respectively. Images were sourced from established open datasets including Photo Concept Bucket, Laion2B, and PD12M. 4KLSDB underwent rigorous multi-stage automated filtering and annotation pipelines involving both human annotators and Large Multimodal Models (LMMs) to ensure high aesthetic quality and dataset consistency. We demonstrate 4KLSDB's effectiveness by training representative super-resolution and diffusion models, observing significant improvements in performance on native 4K benchmarks. Comprehensive experiments illustrate a positive correlation between training on true 4K resolution data and improved fidelity in image restoration task, especially on 4K resolution. We provide the research community a valuable resource to drive progress toward genuinely high-fidelity image synthesis and restoration by providing 4KLSDB. Our project page is available at: https://4klsdb.github.io/.

2605.24761 2026-05-26 cs.CV cs.RO 版本更新

Drift-Resistant Navigation World Model with Anchored Epipolar Guidance

抗漂移导航世界模型与锚定对极引导

Po-Chien Luan, Zimin Xia, Wuyang Li, Yang Gao, Alexandre Alahi

发表机构 * EPFL(瑞士联邦理工学院)

AI总结 提出一种抗漂移导航世界模型,通过锚定引导滚动和双向对极几何约束,同时减轻感知漂移和几何漂移,提升长期视觉质量、几何一致性和多视图连贯性。

详情
AI中文摘要

我们提出抗漂移导航世界模型,这是一种生成模型,可减轻传统基于滚动的导航世界模型中的感知漂移和几何漂移。现有方法递归地将生成内容馈送到后续步骤,导致噪声累积和预测退化,即感知漂移。同时,它们的预测通常偏离智能体的运动,导致几何漂移。我们通过将世界模型预测重新设计为锚定引导滚动来解决这两种漂移。我们不顺序滚动每一帧,而是首先预测稀疏的未来锚点,作为稳定的长期目标,然后生成每个块内的中间帧,这些帧以过去上下文和未来锚点为条件。重要的是,这些稀疏锚点还提供几何约束,由双向对极几何支持,以定位中间帧中相应内容应出现的位置。在四个基准上的实验表明,在长期视觉质量、几何一致性和多视图连贯性方面,相对于强基线有一致的改进。这些提升进一步转化为相同规划器下下游规划性能的提高,突显了抗漂移、几何感知预测对于可靠导航世界模型的重要性。

英文摘要

We propose Drift-Resistant Navigation World Model, a generative model that mitigates both perceptual drift and geometric drift in conventional rollout-based navigation world models. Existing methods recursively feed generated content into subsequent steps, causing noise accumulation and degraded predictions, i.e., perceptual drift. Meanwhile, their predictions often deviate from the agent's motion, resulting in geometry drift. We address both types of drift by redesigning world-model prediction as an anchor-guided rollout. Instead of rolling out every frame sequentially, we first predict sparse future anchors that serve as stable long-range targets, and then generate intermediate frames within each chunk conditioned on both past context and future anchors. Importantly, these sparse anchors also provide geometric constraints, supported by bidirectional epipolar geometry, to localize where corresponding content should appear in the intermediate frames. Experiments on four benchmarks demonstrate consistent improvements over strong baselines in long-horizon visual quality, geometric consistency, and multi-view coherence. These gains further translate into improved downstream planning performance under the same planners, highlighting the importance of drift-resistant, geometry-aware prediction for reliable navigation world models.

2605.24754 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Motion-Compensated Weight Compression

运动补偿权重压缩

Ismail Lamaakal

发表机构 * Multidisciplinary Faculty of Nador Mohammed Premier University(纳多莫哈梅德 premier 大学多学科学院)

AI总结 提出运动补偿权重压缩(MCWC)方法,通过对齐置换对称块并利用层序预测和熵编码,有效压缩神经网络权重,在Transformer语言建模和视觉分类任务中提升率-精度帕累托前沿。

Comments 54 pages, 17 tables, 6 Figures

详情
AI中文摘要

神经网络权重日益成为部署的瓶颈,然而大多数压缩流水线独立处理各层,忽略了由函数保持对称性引起的跨层冗余。我们提出运动补偿权重压缩(MCWC),一种仅权重的编解码器,它对齐置换对称块(例如隐藏单元和注意力头)以最大化跨层对应,将深度转化为可预测序列。在对齐的坐标系中,MCWC使用带有周期性关键帧的轻量级层序预测器,并仅编码在率失真目标下训练的学习熵模型预测残差。一个简单的解码器通过熵解码、反量化、预测驱动重建和逆对齐来重建可部署的权重,从而实现快速权重物化以进行推理。在Transformer语言建模和视觉分类中,MCWC在强量化和学习权重编解码基线之上改善了率-精度帕累托前沿,同时保持有竞争力的解码时间。消融实验证实,对齐、预测、熵建模和关键帧调度对于获得全部增益都是必要的。我们的代码可通过 https://github.com/Ism-ail11/MCWC 获取。

英文摘要

Neural network weights are increasingly a bottleneck for deployment, yet most compression pipelines treat layers independently and overlook cross-layer redundancy induced by function-preserving symmetries. We propose Motion-Compensated Weight Compression (MCWC), a weight-only codec that aligns permutation-symmetric blocks (e.g., hidden units and attention heads) to maximize cross-layer correspondence, turning depth into a predictable sequence. In the aligned coordinate system, MCWC uses a lightweight layer-sequential predictor with periodic keyframes and encodes only quantized prediction residuals using a learned entropy model trained under a rate distortion objective. A simple decoder reconstructs deployable weights by entropy decoding, dequantization, predictor-driven reconstruction, and inverse alignment, enabling fast weight materialization for inference. Across Transformer language modeling and vision classification, MCWC improves the rate accuracy Pareto frontier over strong quantization and learned weight-codec baselines, while maintaining competitive decode time. Ablations confirm that alignment, prediction, entropy modeling, and keyframe scheduling are each necessary for the full gains. Our code is available via https://github.com/Ism-ail11/MCWC.

2605.24753 2026-05-26 cs.CV 版本更新

Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain

点云中的鬼影:瞬态域中的LiDAR去眩光

Avery Gump, Connor Henley, Sungjin Cheong, Akarsh Prabhakara, Mohit Gupta

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 针对固态LiDAR内部多径眩光导致的伪影问题,提出基于瞬态眩光扩散函数(TGSF)的物理模型和无训练算法,在点云形成前抑制眩光,保留真实场景结构。

Comments CVPR 2026

详情
AI中文摘要

现代LiDAR正迅速从笨重的机械扫描系统过渡到超紧凑、低成本、固态阵列。这种微型化在实现可扩展性、经济性和类似相机的数据结构的同时,引入了一种新的严重故障模式:内部多径眩光。当来自明亮或高反射表面的光在LiDAR内部反射和散射时,本应到达单个像素的光会扩散到像素阵列上。由此产生的伪影会创建幻影物体、遮挡真实物体,并产生安全关键的“点云中的鬼影”。本文介绍了一种基于物理的传感模型和算法技术来解决这一效应。我们表明,内部眩光可以表示为作用于瞬态测量的线性、场景无关算子——瞬态眩光扩散函数(TGSF)。基于此模型,我们开发了一种无训练方法,在点云形成之前对低级LiDAR检测(或回波)进行操作,利用眩光扩散函数的知识来推理每个检测来自眩光的可能性。该方法与现有LiDAR信号处理流水线兼容,可在未经修改的商业传感器上部署。通过使用真实单光子LiDAR硬件的实验,我们证明了在保留真实场景结构的同时,显著抑制了严重眩光伪影。

英文摘要

Modern LiDARs are rapidly transitioning from bulky, mechanically scanned systems to ultra-compact, low-cost, solid-state arrays. This miniaturization-while enabling scalability, affordability, and camera-like data structures-introduces a new and severe failure mode: internal-multipath glare. When light from a bright or retroreflective surface reflects and scatters within the LiDAR, light that should reach a single pixel spreads across the pixel array. The resulting artifacts create phantom objects, obscure real ones, and produce safety-critical "ghosts in the point clouds." This paper introduces a physically grounded sensing model and algorithmic techniques for addressing this effect. We show that internal glare can be represented as a linear, scene-independent operator-the Transient Glare Spread Function (TGSF)-acting on the transient measurements. Building on this model, we develop a training-free approach that operates on low-level LiDAR detections (or echoes) prior to point-cloud formation, leveraging knowledge of the glare spread function to reason about the likelihood of each detection arising from glare. The resulting approach is compatible with existing LiDAR signal-processing pipelines, and deployable on unmodified commercial sensors. Using experiments with real single-photon LiDAR hardware, we demonstrate substantial suppression of severe glare artifacts while preserving true scene structure.

2605.24726 2026-05-26 cs.CV 版本更新

From Full Boards to Tiny Defects: Scale-Aware Tile Inference with Topology-Aware Merging for High-Resolution PCB Defect Detection

从整板到微小缺陷:面向高分辨率PCB缺陷检测的尺度感知瓦片推理与拓扑感知合并

Mohammad Alijanpour Shalmani, Alale Rezvani Boroujeni, Ali Amini, Jiann Shiun Yuan

发表机构 * Dept. of Electrical and Computer Engineering(电气与计算机工程系) Dept. of Marketing(市场营销系) Centre of Real Time Computer Systems, Faculty of Informatics, Kaunas University of Technology(实时计算机系统中心,信息学院,凯纳斯技术大学)

AI总结 针对高分辨率PCB图像缩放导致微小缺陷丢失的问题,提出基于瓦片推理的尺度一致训练策略和拓扑感知合并方法,无需重新训练即可显著提升缺陷检测精度。

详情
AI中文摘要

高分辨率印刷电路板(PCB)检测在将整板图像缩放到标准检测器输入时存在分辨率崩溃问题:微尺度缺陷缩小到几个像素而被遗漏。基于瓦片的推理保留了局部细节,但在瓦片边缘引入边界伪影,导致分割检测和假阴性。我们提出了五种推理策略的系统比较,在两个高分辨率PCB缺陷数据集PCB-Defect(230张图像,1704个标注)和HRIPCB(693张图像,2953个标注)上评估,涵盖六类缺陷。我们表明训练-推理尺度一致性至关重要:在全图像上训练的检测器在瓦片推理下mAP@50崩溃至0.01,而同一架构在640×640瓦片裁剪上训练时在两个数据集上分别达到0.72和0.94。我们进一步利用拓扑感知瓦片合并(TA-TM),一种无需训练的后处理方法,构建瓦片邻接图,并在全局NMS之前使用邻瓦片一致性调整边界敏感检测分数。在两个数据集中,添加128像素瓦片重叠将边界区域召回率从约26-63%提升至约70-100%,TA-TM在两个基准上均达到最佳mAP@50,且瓦片推理恢复了全图像方法完全遗漏的46-100%的小缺陷。结果在不同数据集上一致,证实了所提出策略的泛化性。TA-TM无需重新训练且架构无关,可直接应用于现有PCB检测流水线。

英文摘要

High-resolution printed circuit board (PCB) inspection suffers from resolution collapse when full-board images are resized to standard detector inputs: micro-scale defects shrink to a few pixels and are missed. Tile-based inference preserves local detail but introduces boundary artefacts at tile edges, causing split detections and false negatives. We present a systematic comparison of five inference strategies evaluated on two high-resolution PCB defect datasets, PCB-Defect (230 images, 1704 annotations) and HRIPCB (693 images, 2 953 annotations), spanning six defect classes. We show that training-inference scale consistency is critical: a detector trained on full images collapses to mAP@50 = 0.01 under tile inference, while the same architecture trained on 640*640 tile crops achieves 0.72 and 0.94 on the two datasets respectively. We further exploited Topology-Aware Tile Merging (TA-TM), a training-free post-processing method that builds a tile-adjacency graph and adjusts boundary-sensitive detection scores using neighbour-tile agreement before global NMS. Across both datasets, adding 128 px tile overlap raises boundary-zone recall from ~26-63% to ~70-100%, TA-TM achieves the best mAP@50 on both benchmarks, and tile inference recovers 46-100% of small defects missed entirely by full-image methods. Results are consistent across datasets, confirming the generalizability of the proposed strategy. TA-TM requires no retraining and is architecture-agnostic, making it directly applicable to existing PCB inspection pipelines.

2605.24722 2026-05-26 cs.CV 版本更新

Calibrating Probabilistic Object Detectors with Annotator Disagreement

校准具有标注者分歧的概率目标检测器

Zhi Qin Tan, Owen Addison, Yunpeng Li

发表机构 * organization= Faculty of Dentistry, Oral \& Craniofacial Sciences, King's College London , city= London , country= United Kingdom

AI总结 针对目标检测中因物体模糊性导致标注者分歧的问题,提出一种无需真实标注即可校准概率目标检测器的方法,通过设计分类和定位校准误差指标及训练时/事后校准器,使模型预测不确定性匹配标注分布。

详情
AI中文摘要

对于模糊物体(例如医学图像),标注者之间可能存在高度分歧,这凸显了在目标检测任务中建立真实标注的挑战。尽管如此,所有现有的目标检测器都隐式地需要访问真实标注以进行训练或评估。我们针对的基本问题是:如何利用多个标注者的标注(但缺乏因物体模糊性导致的客观真实标注)来学习目标检测器,以及如何使学习到的检测器在检测模糊物体时表达有意义的模型预测不确定性?为了回答这些问题,我们提出了一种可解释的方法来校准概率目标检测器,其校准目标是将类别置信度和边界框方差估计与标注者的标注分布对齐。我们引入了一个高效且有效的框架来校准概率目标检测器,通过设计四个评估指标来衡量分类和定位的校准误差,并提出了一种训练时校准和后处理校准器,所有这些都无需访问任何真实标注。该框架可推广到许多现有的概率目标检测器,例如YOLO系列和两阶段检测器。在医学和自然图像的真实世界和合成数据集上的实验结果表明,所提出的框架与三种流行的目标检测器相结合具有优越的性能。

英文摘要

High degrees of disagreement among annotators can exist for ambiguous objects, e.g. in medical images, underscoring the challenges of establishing ground truth annotations in object detection tasks. Despite this, all existing object detectors implicitly require access to ground truth annotations for either training or evaluation. The fundamental questions we target are: How can we learn an object detector with multiple annotators' annotations but without objective ground truth annotations due to object ambiguity, and how can we enable the learned detector to express meaningful model predictive uncertainties in detecting ambiguous objects? To answer these questions, we present an interpretable approach to calibrate probabilistic object detectors, where the calibration goal is to align the class confidence and bounding box variance estimates to the annotators' annotation distribution. We introduce an efficient yet effective framework to calibrate probabilistic object detectors by designing four evaluation metrics to measure calibration errors regarding classification and localization, and proposing a train-time calibration and post-hoc calibrator, all without the need to access any ground truth. This framework is generalizable to many existing probabilistic object detectors, such as the YOLO families and two-stage detectors. Empirical results with real-world and synthetic datasets of medical and natural images demonstrate the superior performance of the proposed framework with three popular object detectors.

2605.24702 2026-05-26 cs.CV 版本更新

Do Image-Text Metrics Respect Semantic Invariances?

图像-文本度量是否尊重语义不变性?

Amit Agarwal, Hitesh Laxmichand Patel, Meizhu Liu, Jyotika Singh, Karan Dua, Hansa Meghwani, Matthew Rowe, Michael Avendi, Yassi Abbasi, Tao Sheng, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结 通过空间、物体和社会语言框架三个维度的语义保持扰动,系统评估了五种流行图像-文本评估器(CLIPScore、PAC-S、UMIC、FLEUR和确定性LLM评判)的语义不变性,发现它们对非语义变化敏感,并提出了不变性校准评分作为后处理调整方法。

详情
AI中文摘要

无参考图像到文本评估器现在已成为评分图像-标题对齐的标准工具,但尚不清楚它们是否尊重语义不变性。我们对五种流行评估器(CLIPScore、PAC-S、UMIC、FLEUR和确定性LLM评判)进行了不变性探测,在三个轴向上施加语义保持扰动:空间(翻转、上下文保持的重定位、轻微旋转)、物体(尺度、类别)和社会语言框架(带有中性及长度匹配对照的文化/经济形容词)。在三个检测数据集和三个标题评估套件的精心策划切片上,我们发现了一致的非语义敏感性,其中良性的空间编辑和简单的措辞变化平均使分数变化约6-9%,而对于仅相差0.7%的系统,这些变化可能导致高达约37%的情况下的排名翻转,尤其是在空间变化下。一项小型人类研究也支持这一发现,并确认标注者通常认为扰动对同样正确,因此这些变化反映了度量行为而非语义变化。我们进一步提出了不变性校准评分,这是一种后处理调整方法,大致将中位数绝对敏感性减半,同时保持与学习型标题评估器的相关性。

英文摘要

Reference-free image-to-text evaluators are now standard for scoring image-caption alignment, yet it is unclear whether they respect semantic invariances. We present an invariance probe on five popular evaluators (CLIPScore, PAC-S, UMIC, FLEUR, and a deterministic LLM judge) under semantics-preserving perturbations along three axes -- spatial (flips, context-preserving repositioning, light rotations), object (scale, category), and socio-linguistic framing (cultural/economic adjectives with neutral and length-matched controls). Across curated slices of three detection datasets and three caption evaluation suites, we find consistent non-semantic sensitivities, where benign spatial edits and simple phrasing changes shift scores by $\approx$6--9\% on average, and for systems separated by just 0.7\%, these shifts can cause ranking flips in up to $\sim$37\% of cases, particularly under spatial changes. A small human study also supports this finding and confirms that annotators generally judge perturbed pairs as equally correct, so these shifts reflect metric behavior rather than semantic change. We further propose invariance-calibrated scoring, a post-hoc adjustment that roughly halves median absolute sensitivity while retaining correlation with learned caption evaluators.

2605.24691 2026-05-26 cs.CV 版本更新

AdaFuse-Det: Adaptive Cross-Modal Fusion of Event Cameras for Robust Object Detection in Low-Light RGB Imagery

AdaFuse-Det: 自适应跨模态融合事件相机用于低光照RGB图像中的鲁棒目标检测

Raju Imandi, Chethana B, Bharatesh Chakravarthi, Yong-Guk Kim, Manipriya S, Pavan Kumar B N

发表机构 * SRM University AP, India(印度SRM大学AP分校) Aptiv, Bengaluru, India(印度Aptiv公司,班加罗尔) Arizona State University, USA(美国亚利桑那州立大学) Sejong University, South Korea(韩国世宗大学) Indian Institute of Information Technology Sri City, India(印度Sri City信息学院)

AI总结 提出AdaFuse-Det双流框架,通过基于最小方差线性估计的自适应跨模态融合模块融合CLAHE增强RGB与事件数据,在低光照下实现鲁棒目标检测,在LLE-VOS基准上召回率65.54%、精确率53.85%、F1分数59.12%。

详情
AI中文摘要

在极端低光照条件下可靠地检测目标是计算机视觉中的一个开放性问题,在从夜间监控到搜索救援机器人等应用中具有实际紧迫性。传统RGB相机在低光子通量下性能急剧下降,而事件相机以微秒分辨率和宽动态范围记录异步逐像素亮度变化,提供了很大程度上与光照无关的互补结构线索。我们提出AdaFuse-Det,一个双流框架,通过基于最小方差线性估计理论的自适应跨模态融合模块,将CLAHE增强的RGB帧与体素化事件张量融合。我们形式化地证明学习到的注意力图渐近地恢复了高斯-马尔可夫最优融合权重,并为体素化阶段建立了事件守恒和时间分辨率界限。在LLE-VOS基准上,AdaFuse-Det在严重光照退化下实现了召回率65.54%、精确率53.85%和F1分数59.12%,在召回率上优于单模态检测器,其差距反映了理论上预测的光照适应行为。

英文摘要

Detecting objects reliably under extreme low-light conditions is an open problem in computer vision, with practical urgency in applications ranging from nighttime surveillance to search-and-rescue robotics. Conventional RGB cameras degrade sharply at low photon flux, while event cameras which record asynchronous per-pixel brightness changes at microsecond resolution and high dynamic range provide complementary structural cues that are largely illumination-invariant. We present AdaFuse-Det, a dual-stream framework that fuses CLAHE-enhanced RGB frames with voxelized event tensors through an Adaptive Cross-Modal Fusion (ACMF) module grounded in minimum-variance linear estimation theory. We formally show that the learned attention map asymptotically recovers the Gauss-Markov optimal fusion weights, and establish event conservation and temporal resolution bounds for the voxelization stage. On the LLE-VOS benchmark, AdaFuse-Det achieves a Recall of $65.54\%$, Precision of $53.85\%$, and F1-Score of $59.12\%$ under severe illumination degradation, outperforming single-modality detectors in recall by a margin that reflects the theoretically predicted illumination-adaptation behavior.

2605.24687 2026-05-26 cs.CV cs.AI 版本更新

HoloFair: Unified T2I Fairness Evaluation and Fair-GRPO Debiasing

HoloFair: 统一的T2I公平性评估与Fair-GRPO去偏

Ruyi Chen, Lu Zhou, Xiaogang Xu, Chiyu Zhang, Jiafei Wu, Liming Fang

发表机构 * Nanjing University of Aeronautics and Astronautics(南京航空航天大学) School of Software Technology, Zhejiang University, Ningbo, China(浙江大学宁波校区软件学院) Ningbo Global Innovation Center, Zhejiang University, Ningbo, China(浙江大学宁波全球创新中心) Collaborative Innovation Center of Novel Software Technology and Industrialization(新型软件技术与产业化协同创新中心)

AI总结 提出HoloFair基准框架,通过多属性组间偏差指数(MGBI)评估文本到图像模型的公平性,并引入基于强化学习的Fair-GRPO方法进行去偏,在SD3.5-Medium模型上显著提升多维公平性且保持图像质量。

Comments Accepted to ICML 2026. Code and dataset are available at https://github.com/1059684669/HoloFair

详情
AI中文摘要

文本到图像(T2I)模型在视觉真实感和语义一致性方面取得了显著进展,但它们常常延续并放大社会偏见。现有的评估方法通常只处理单维偏见,缺乏从社会相关深层语义层面揭示模型偏见的视角。我们引入了HoloFair,一个用于多维人口统计偏见分析的综合基准框架。该框架基于我们大规模面向公平性的数据集和SpaFreq(空间-频率)属性分类器,提出了多属性组间偏差指数(MGBI)指标,旨在评估内在多样性和条件偏见。除评估外,我们还进一步引入了Fair-GRPO,一种基于强化学习的去偏方法,通过设计的多目标奖励函数改变生成模型的分布。例如,在SD3.5-Medium模型上的实验表明,Fair-GRPO在保持高图像质量的同时显著改善了多维公平性。我们还分析了潜在的奖励黑客现象,并提供了相应的缓解策略。代码和数据集可在https://github.com/1059684669/HoloFair获取。

英文摘要

Text-to-Image (T2I) models have made significant strides in visual realism and semantic consistency, yet they often perpetuate and amplify societal biases. Existing evaluation methods typically address only single-dimensional biases, lacking perspectives to uncover model biases at social-related deeper semantic levels. We introduce HoloFair, a comprehensive benchmark framework for multidimensional demographic bias analysis. Built upon our large-scale fairness-oriented dataset and the SpaFreq (Spatial-Frequency) attribute classifier, this framework proposes the Multi-attribute, Group-wise Bias Index (MGBI) metric, designed to assess both intrinsic diversity and conditional biases. Beyond evaluation, we further introduce Fair-GRPO, a reinforcement-learning-based debiasing method that alters the distribution of generative models through a designed multi-objective reward function. E.g., experiments on the SD3.5-Medium model demonstrate that Fair-GRPO significantly improves multidimensional fairness while maintaining high image quality. We also analyze potential reward hacking phenomena and provide corresponding mitigation strategies. Code and dataset are available at https://github.com/1059684669/HoloFair

2605.24679 2026-05-26 cs.CV 版本更新

MindAdapter: Few-Shot Parameter-Efficient Residual Calibration of Cross-Subject Brain-to-Visual Decoding Models

MindAdapter: 跨被试脑到视觉解码模型的少样本参数高效残差校准

Jiaxiang Liu, Jiawei Du, Xupeng Chen, Guoqi Li, Jiang Cai, Simon Fong, Mingkun Xu

发表机构 * Guangdong Institute of Intelligence Science and Technology(广东智能科学与技术研究院) Agency for Science, Technology and Research(科技研究局) New York University(纽约大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系)

AI总结 提出MindAdapter框架,通过解耦的线性-残差级联对齐和拓扑锚定双流流形约束,实现跨被试脑到视觉解码的少样本参数高效校准。

Comments Accepted to KDD 2026 (AI4Sciences Track). 15 pages, 7 figures

详情
AI中文摘要

跨被试脑到视觉解码由于严重的个体间变异性导致系统性的被试特异性功能错位,仍然是脑机接口的核心挑战。为了解决这个问题,我们提出了MindAdapter,一个针对预训练脑到视觉解码模型的参数高效少样本校准框架。MindAdapter采用解耦的线性-残差级联对齐范式,冻结预训练的显式脑功能对齐主干(粗粒度),并引入轻量级非线性残差适配器(细粒度),从而将全局跨被试对应关系与被试特异性残差校正分离,实现细粒度的空间和语义校准。为了进一步保持全局表征稳定性,我们设计了一个拓扑锚定的双流流形约束,其中一小部分共享刺激作为拓扑锚点,提供体素级配对监督,而语义流通过冻结的视觉-语言解码器在未配对的脑数据上强制执行一致性。总之,MindAdapter在保持预训练期间学习的全局表征几何结构的同时,高效地注入被试特异性校正。在自然场景数据集(NSD)上的实验表明,MindAdapter仅使用少量共享刺激就能显著提高跨被试视觉重建和检索精度,为个性化脑到视觉解码提供了一种实用且数据高效的解决方案。

英文摘要

Cross-subject brain-to-visual decoding remains a core challenge in brain-computer interfaces due to severe inter-individual variability that induces systematic subject-specific functional misalignment. To address this issue, we propose MindAdapter, a parameter-efficient few-shot calibration framework for pretrained brain-to-visual decoding models. MindAdapter adopts a decoupled linear-residual cascade alignment paradigm by freezing a pretrained explicit brain functional alignment backbone (coarse) and introducing a lightweight nonlinear residual adapter (fine), thereby disentangling global cross-subject correspondence from subject-specific residual corrections for fine-grained spatial and semantic calibration. To further preserve global representational stability, we design a topology-anchored dual-stream manifold constraint, where a small set of shared stimuli serves as topological pins with voxel-level paired supervision, while a semantic stream enforces consistency through a frozen vision-language decoder on unpaired brain data. Together, MindAdapter efficiently injects subject-specific corrections while maintaining the global representational geometry learned during pretraining. Experiments on the Natural Scenes Dataset (NSD) demonstrate that MindAdapter substantially improves cross-subject visual reconstruction and retrieval accuracy using only a few shared stimuli, offering a practical and data-efficient solution for personalized brain-to-visual decoding.

2605.24675 2026-05-26 cs.CV cs.AI 版本更新

VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

VaaWIT: 面向多语言网页图像翻译的大语言模型视觉感知适配

Bo Li, Ronghao Chen, Ningyuan Deng, Huacan Wang, Shaolin Zhu, Lijie Wen

发表机构 * The Hong Kong University of Science(香港科技大学) Tianjin University(天津大学) Tsinghua University(清华大学)

AI总结 针对网页图像翻译中视觉表示差距问题,提出VaaWIT框架,通过双流注意力模块和视觉感知适配器,实现大语言模型对细粒度视觉特征的动态融合,在多个基准上超越开源模型并接近闭源模型性能。

Comments Accepted by KDD 2026

详情
AI中文摘要

翻译网页图像中的文本对于改善内容可访问性和跨语言信息检索至关重要,尤其是在社交媒体和电子商务领域。尽管大型视觉语言模型(LVLMs)已经推进了多模态理解,但由于视觉表示差距,将它们应用于网页图像翻译仍然具有挑战性:标准编码器通常优先考虑高级语义,而忽略了识别多样字符形态所需的细粒度视觉细节。为了解决这一挑战,我们提出了VaaWIT,一个端到端框架,用于适配大语言模型进行多语言网页图像翻译。该框架引入了两项关键技术贡献:(1)双流注意力模块(DSAM),促进多语言语义特征与详细视觉表示之间的双向交互,从而合成对文本变化鲁棒的统一特征;(2)视觉感知适配器(VAA),一种参数高效的微调策略,将这些融合的视觉线索动态注入冻结的LLM主干。这种设计使模型能够有效地将视觉上下文与语言推理对齐,同时最小化计算成本。在三个公共基准上的八个任务上的大量实验表明,VaaWIT显著优于最先进(SOTA)的开源基线,并达到了与专有模型相竞争的性能。这些结果验证了将细粒度视觉感知集成到LLM中用于复杂网页内容分析的有效性。

英文摘要

Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.

2605.24674 2026-05-26 cs.CV 版本更新

Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing

推理对齐:扩散Transformer在视频编辑中的隐式推理

Yan Li, Lin Liu, Xiaopeng Zhang, Qi Tian

发表机构 * The Hongkong University of Science and Technology(香港科技大学) Huawei Inc.(华为公司)

AI总结 针对指令式视频编辑中条件信号未分化及交叉注意力监督不足的问题,提出RVEDiT框架,通过粒度路由令牌条件和参考锚定注意力对齐实现粗到细编辑与内部推理正则化。

详情
AI中文摘要

基于指令的视频编辑需要根据自然语言指令转换源视频,同时保留无关内容并保持时间连贯性。我们认为现有的扩散Transformer(DiT)编辑器由于两个结构原因难以完成此任务。首先,条件信号未分化地输入所有Transformer块,迫使单个令牌流同时编码全局编辑意图和细粒度视觉证据。其次,控制编辑的交叉注意力模式仅通过像素级重建间接监督,使得模型内部推理过程约束不足。为了解决这两个限制,我们提出了RVEDiT,一个隐式推理视频编辑DiT框架,围绕两个互补组件构建。第一个组件,粒度路由令牌条件,引入从多模态大语言模型蒸馏的可学习编辑令牌,并将其路由到浅层块,同时将原生视觉和文本令牌保留给深层块,从而在骨干网络内部诱导出从粗到细的编辑过程。第二个组件,参考锚定注意力对齐,在训练期间采用参数共享的参考分支,并最大化编辑分支和参考分支注意力特征之间的互信息,正则化模型的内部推理而不产生任何额外的推理成本。在标准基于指令的视频编辑基准上的实验表明,RVEDiT始终优于最先进的基线,特别是在局部和组合编辑方面取得了显著提升。

英文摘要

Instruction-based video editing requires transforming a source video according to a natural-language instruction while preserving irrelevant content and remaining temporally coherent. We argue that existing Diffusion Transformer (DiT) editors struggle with this task for two structural reasons. First, conditioning signals are fed undifferentiated into all transformer blocks, forcing a single token stream to encode both global editing intent and fine-grained visual evidence. Second, the cross-attention patterns that govern the edit are supervised only indirectly through pixel-level reconstruction, leaving the model's internal reasoning process under-constrained. To address both limitations, we propose RVEDiT, an implicit Reasoning Video Editing DiT framework built around two complementary components. The first, Granularity-Routed Token Conditioning, introduces learnable editing tokens distilled from a multimodal LLM and routes them to shallow blocks, while reserving native visual and textual tokens for deeper blocks, thereby inducing a coarse-to-fine editing process inside the backbone. The second, Reference-Anchored Attention Alignment, employs a parameter-sharing reference branch during training and maximizes the mutual information between the attention features of the editing and reference branches, regularizing the model's internal reasoning without incurring any additional inference cost. Experiments on standard instruction-based video editing benchmarks show that RVEDiT consistently outperforms state-of-the-art baselines, with particularly strong gains on localized and compositional edits.

2605.24652 2026-05-26 cs.AI cs.CV cs.MM cs.SD 版本更新

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

AVBench:面向音视频生成模型的人类对齐与自动化评估基准

Jialiang Yang, Bin Xia, Ruihang Chu, Dingdong Wang, Wanke Xia, Zhun Mou, Tianyang Zhong, Yiting Zhao, Wenming Yang

发表机构 * Tsinghua University(清华大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出AVBench,通过细粒度人类中心指标和偏好学习训练的专业评估器,实现音视频生成的自动化、准确评估。

详情
AI中文摘要

音视频(AV)生成的快速进步使得能够生成具有同步声音的高保真合成内容,特别是涉及语音和交互的人类相关场景。然而,AV生成的评估仍处于早期阶段,只有少数针对人类相关场景的粗粒度基准,并且依赖于有限的预设评估和通用多模态大语言模型,导致对模型能力的不准确评估。为了解决这些问题,我们引入了AVBench,一个专为人类中心AV生成设计的全自动化基准。AVBench基于两个关键设计以实现全面准确的评估:(i)人类中心和细粒度指标。AVBench整合了十个评估维度,专为以人为中心的现实场景设计,涵盖视觉质量、音频质量以及跨模态的多层次一致性。这些实用指标捕捉了现有基准经常忽略的人类相关细节。(ii)通过偏好学习训练的专业评估器。为了解决缺乏专门训练数据的问题,我们通过将真实视频转化为具有受控扰动的多样化训练对来构建大规模监督。在该高质量数据集上微调后,评估器学会可靠地检测细微的跨模态不一致性。关键的是,AVBench不输出离散的文本判断,而是从模型对二元决策的预测置信度中推导出连续评估分数。这种概率评分机制比传统的VQA风格评估更可靠,并且与人类判断高度一致。综合来看,AVBench为AV生成提供了自动化评估,展示了数据过滤的强大潜力,并可作为来自人类反馈的强化学习(RLHF)的可微分奖励信号。

英文摘要

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).

2605.24642 2026-05-26 cs.CV cs.RO 版本更新

Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

理解几何基础模型对视觉-语言-动作模型的影响

Yurou Yang, Muyuan Lin, Roberto Martin-Martin, Martin Labrie, Shreekant Gayaka, Cheng-Hao Kuo, Luca Carlone

发表机构 * Amazon Personal Robotics Group(亚马逊个人机器人小组) University of Texas at Austin(德克萨斯大学奥斯汀分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文通过线性探测分析量化了视觉-语言-动作模型(VLA)与几何基础模型(GFM)之间的“几何差距”,比较了三种注入几何信息的架构,并研究了非架构因素对几何VLA性能的影响。

详情
AI中文摘要

近期工作探索了视觉-语言-动作模型(VLA)与用于3D重建的几何基础模型(GFM)(如VGGT)交叉领域的新机遇。虽然由此产生的几何VLA通常表现出改进的性能,但仍不清楚:(i) 现代VLA是否已经具备足够的几何理解能力,(ii) 将几何理解注入VLA的最佳架构是什么,以及(iii) 其他影响几何VLA的设计选择的效果。在本文中,我们针对特定的VLA(GR00T-N1.5)和GFM(VGGT)进行了严格的实验分析,以阐明这些问题。我们的第一个贡献是通过基于线性探测的严格分析,形式化了先前工作中关于当前VLA缺乏几何理解的直觉。该分析首次量化了VLA与GFM之间的“几何差距”。我们的第二个贡献是识别并比较了将GFM与VLA桥接的不同策略。我们实现了三种不同的架构,它们在将几何信息注入VLA的方式上有所不同,同时尽可能保持低级实现细节相似,以确保公平比较。最后,我们分析了非架构选择(例如,训练数据、相机数量、重建质量)对几何VLA性能的影响。

英文摘要

Recent work explores new opportunities at the intersection of vision-language-action models (VLAs) and geometric foundation models (GFMs) for 3D reconstruction, such as VGGT. While the resulting geometric VLAs often show improved performance, it remains unclear (i) if modern VLAs already have sufficient geometric understanding to start with, (ii) what is the best architecture to inject geometric understanding into a VLA, and (iii) what is the effect of other design choices that affect geometric VLAs. In this paper we provide a rigorous experimental analysis to shed light on these questions, for a specific choice of VLA (GR00T-N1.5) and GFM (VGGT). Our first contribution is to formalize prior work's intuition that current VLAs lack geometric understanding, by providing a rigorous analysis based on linear probing. The analysis quantifies, for the first time, the "geometric gap" between VLAs and GFMs. Our second contribution is to identify and compare different strategies to bridge GFMs with VLAs. We implement three different architectures, which differ in the way they inject geometry in the VLA, while keeping low-level implementation details as similar as possible, to ensure a fair comparison. Finally, we analyze the impact of non-architectural choices (e.g., training data, number of cameras, reconstruction quality) on the performance of the geometric VLAs.

2605.24639 2026-05-26 cs.CV cs.AI 版本更新

DisDop: Distillation with Domain Priors for Open-Vocabulary Aerial Object Detection

DisDop: 基于领域先验蒸馏的开放词汇航空目标检测

Ruihao Xu, Yong Liu, Yansong Tang, Sule Bai, Xubing Ye, Bingyao Yu, Yutao Guo, Jiwen Lu, Jie Zhou

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Tsinghua University(清华大学)

AI总结 提出DisDop框架,通过从遥感基础模型(RemoteCLIP和DINOv3)中系统蒸馏多级领域先验知识到轻量级检测器,实现开放词汇航空目标检测的最新性能。

详情
AI中文摘要

近年来,随着无人机的广泛应用,航空图像的目标检测引起了越来越多的关注,尤其是不受预定义类别限制的开放词汇航空检测。由于无人机视角图像的稀缺性及其与自然图像的显著差异,直接应用为自然场景设计的普通开放词汇检测方法难以取得令人满意的结果。一些研究提出通过使用轻量级网络或生成伪标签来从预训练模型迁移知识,但它们往往依赖于在自然图像上训练的模型,忽略了专门为遥感和航空图像定制的基础模型的潜力。为了解决这一局限性,我们提出了DisDop,一个统一的框架,系统地将来自遥感基础模型(例如RemoteCLIP和DINOv3)的多级领域先验知识蒸馏到轻量级检测器中。具体来说,我们首先通过教师融合策略蒸馏视觉先验,该策略结合了RemoteCLIP的跨模态对齐能力和DINOv3的细粒度局部特征提取能力,将其互补优势迁移到检测器的骨干网络中。其次,我们通过显式建模类别间语义关系来蒸馏嵌入在RemoteCLIP文本编码器中的文本先验,同时结合全局上下文先验以增强小目标的局部特征表示。通过这种多级先验蒸馏框架,我们的DisDop在开放词汇航空检测基准上取得了新的最先进性能。大量的消融分析也证明了我们提出模块的合理性和有效性。

英文摘要

With the widespread application of drones in recent years, object detection of aerial images has attracted increasing attention, especially open-vocabulary aerial detection which is not restricted to predefined categories. Due to the scarcity of drone's viewpoint images and their significant differences from natural images, it is difficult to achieve satisfying results by directly applying vanilla open-vocabulary detection methods designed for natural scenarios. Some studies propose to transfer knowledge from pre-trained models by using lightweight networks or generating pseudo labels, but they tend to rely on models trained on natural images, neglecting the potential of foundation models specifically tailored for remote sensing and aerial imagery. To address this limitation, we propose DisDop, a unified framework that systematically distills multi-level domain priors from remote sensing foundation models (e.g., RemoteCLIP and DINOv3) into a lightweight detector. Specifically, we first distill visual priors through a teacher fusion strategy that combines RemoteCLIP's cross-modal alignment capability with DINOv3's fine-grained local feature extraction ability, transferring their complementary strengths to the detector's backbone. Second, we distill textual priors embedded in RemoteCLIP's text encoder by explicitly modeling inter-category semantic relationships, while incorporating global contextual priors to enhance local feature representation for small objects. Through this multi-level prior distillation framework, our DisDop achieves new state-of-the-art performance on open-vocabulary aerial detection benchmarks. Extensive ablation analysis also demonstrates the rationality and effectiveness of our proposed modules.

2605.24631 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion

超越生成先验:JEPA引导扩散的少数采样

Sol Park, Soobin Um

发表机构 * Department of Artificial Intelligence, Kookmin University, Seoul, South Korea(人工智能系,韩国全州大学,首尔)

AI总结 提出一种基于世界模型JEPA引导的扩散采样框架,通过近似策略实现高效计算,在无条件、类别条件和文本到图像生成中提升少数样本的保真度和语义有效性。

Comments ICML 2026, 21 pages, 9 figures

详情
AI中文摘要

少数采样旨在数据流形上生成低密度实例,在医学诊断、异常检测和创意AI等应用中具有核心重要性。然而,现有方法相对于从训练数据中学习的生成先验来定义少数样本,将稀有性限制在可能无法很好反映现实世界语义的模型特定概念中。在这项工作中,我们提出了一种以世界为中心的少数采样视角,该视角相对于现实世界先验而非生成器诱导的密度来定义稀有性。为此,我们引入了JEPA引导,一种由联合嵌入预测架构(JEPA)引导的扩散采样框架——JEPA是一类编码广泛、语义丰富表示的世界模型。JEPA引导将扩散轨迹导向JEPA隐含密度下的低密度区域,从而使生成的少数样本与现实世界的语义稀有性对齐。为了使JEPA引导在计算上实用,我们开发了带有理论误差界限的原则性近似策略,显著降低了引导计算的开销。在无条件、类别条件和文本到图像生成上的大量实验表明,JEPA引导持续提高了少数样本的保真度和语义有效性,在捕捉现实世界的稀有性概念方面优于以生成器为中心的基线。代码可在https://github.com/soobin-um/jepa-guidance获取。

英文摘要

Minority sampling aims to generate low-density instances on a data manifold and is of central importance in applications such as medical diagnosis, anomaly detection, and creative AI. Existing approaches, however, define minority samples relative to generative priors learned from training data, confining rarity to model-specific notions that may poorly reflect real-world semantics. In this work, we propose a world-centric perspective on minority sampling, which defines rarity with respect to real-world priors rather than generator-induced densities. To this end, we introduce JEPA guidance, a diffusion sampling framework guided by a Joint-Embedding Predictive Architecture (JEPA) -- a class of world models that encode broad, semantically rich representations. JEPA guidance steers diffusion trajectories toward low-density regions under the implicit density induced by the JEPA, thereby aligning generated minorities with real-world semantic rarity. To make JEPA guidance computationally practical, we develop principled approximation strategies accompanied by theoretical error bounds, significantly reducing the overhead of guidance computation. Extensive experiments across unconditional, class-conditional, and text-to-image generation demonstrate that JEPA guidance consistently improves the fidelity and semantic validity of minority samples, outperforming generator-centric baselines in capturing real-world notions of rarity. Code is available at https://github.com/soobin-um/jepa-guidance.

2605.24630 2026-05-26 cs.CV 版本更新

DexSIM: Real-time Dexterous Simulation with Unified Causal Video Diffusion

DexSIM: 具有统一因果视频扩散的实时灵巧仿真

Adam Lee

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 提出DexSIM框架,通过两阶段训练(双向视频扩散和自回归滚动训练)实现实时、长时一致的灵巧操作仿真,在像素相似度、运动保真度和手部投影精度上超越基线。

Comments World Model @ ICLR 2026

详情
AI中文摘要

视频扩散模型的最新进展已实现对物理世界的大规模仿真,但手部物体交互的仿真研究较少。我们提出DexSIM,一个用于实时灵巧操作仿真的灵巧仿真框架。以往利用视频扩散和3D重建的工作侧重于导航,而灵巧操作虽在创建交互式仿真体验和为机器人生成合成数据方面有广泛应用,但进展有限。现有方法缺乏实时交互性、长期空间一致性和记忆。我们为DexSIM提出两阶段训练框架。首先,通过在手部动作轨迹和视频的统一特征空间中进行联合嵌入,训练一个双向视频扩散模型。我们利用高斯热图手部编码实现更准确的手部表示。然后,我们进行基于滚动的自回归训练,将更新的空间缓存作为注意力汇点用于空间记忆,从而提高了长期一致性和3D感知灵巧操作仿真。DexSIM在像素和语义相似度、运动保真度和手部投影精度上优于基线。它还支持手部运动迁移等新应用,并以15.24 FPS的帧率实现实时交互。

英文摘要

Recent progress of video diffusion models have enabled extensive simulation of the physical world. While simulation with hand object interaction has been less explored. We propose DexSIM, a dexterous simulation framework for simulating dexterous manipulation in real-time. While previous works utilizing video diffusion and 3D reconstruction focus on navigation, dexterous manipulation has been limited while it has extensive applications for creating interactive experiences with the simulated world and for generating synthetic data for robotics. Existing methods lack real-time interactivity and long-term spatial consistency and memory. We propose a 2-stage training framework for DexSIM. First we train a bi-directional video diffusion model by jointly embedding the hand action trajectory and video in a unified feature space. We utilize gaussian heatmap hand encoding for more accurate hand representation. Then we conduct a roll-out based autoregressive training with updated spatial cache as attention sink for spatial memory, which improves long-term consistency and 3D aware dexterous manipulation simulation. DexSIM outperforms the baseline on pixel and semantic similarity, motion fidelity, and hand projection accuracy. It also allows new applications such as hand motion transfer and runs at 15.24 FPS real-time interactivity.

2605.24625 2026-05-26 cs.CV 版本更新

ULF-Synth: Physics-Guided Ultra-Low-Field MRI Enhancement for Pediatric Neuroimaging

ULF-Synth:用于儿科神经影像的物理引导超低场MRI增强

Toufiq Musah, Salvatore Calcagno, Federica Proietto Salanitri, Xiaomeng Li, Maruf Adewole, Marawan Elbatel

发表机构 * Kwame Nkrumah University of Science and Technology(科拉努姆大学科学与技术学院) University of Catania(卡塔尼亚大学) The Hong Kong University of Science and Technology(香港科学与技术大学) Medical Artificial Intelligence Lab(医学人工智能实验室)

AI总结 提出ULF-Synth框架,通过从高场MRI合成逼真的超低场图像并采用空间-频率域目标,实现无需真实配对数据的超低场MRI增强,提升结构相似性和诊断可接受性。

Comments 10 pages, 2 figures, 3 tables

详情
AI中文摘要

超低场(ULF)MRI提供了便携且可及的神经影像,但与高场(HF)系统相比,存在信噪比降低和空间分辨率有限的问题。获取配对的ULF-HF数据进行监督增强通常很困难,尤其是在资源有限的环境中。我们提出了ULF-Synth框架,它结合了:(i)基于采集的从HF体积合成逼真ULF图像的方法,以创建大规模配对训练数据;(ii)优先恢复高频解剖细节的空间-频率域目标。该公式与架构无关,在编码器-解码器、对抗性和基于扩散的翻译模型中一致地提高了结构相似性和感知保真度。当仅使用合成数据训练时,所得模型有效泛化到真实的64mT ULF采集,改善了下游多类脑分割,并在盲法读者研究中获得了更高的放射科医生偏好和诊断可接受性。这些发现表明,合成配对监督提供了一种实用且可扩展的途径来增强ULF MRI,而无需真实的配对采集。代码、模型和数据集:https://github.com/toufiqmusah/ULF-Synth

英文摘要

Ultra-low-field (ULF) MRI offers portable and accessible neuroimaging but suffers from reduced signal-to-noise ratio and limited spatial resolution compared to high-field (HF) systems. Acquiring paired ULF-HF data for supervised enhancement is often difficult, particularly in resource-limited settings. We introduce ULF-Synth, a framework that combines: (i) acquisition-based synthesis of realistic ULF images from HF volumes to create large-scale paired training data, (ii) a spatial-frequency domain objective that prioritizes recovery of high-frequency anatomical detail. This formulation is architecture-agnostic, consistently improving structural similarity and perceptual fidelity across encoder-decoder, adversarial, and diffusion-based translation models. When trained exclusively on synthetic data, the resulting models generalize effectively to real 64mT ULF acquisitions, improving downstream multiclass brain segmentation and achieving higher radiologist preference and diagnostic acceptability in a blinded reader study. These findings demonstrate that synthetic paired supervision provides a practical and scalable pathway for enhancing ULF MRI without requiring real paired acquisitions. Code, Models and Dataset: https://github.com/toufiqmusah/ULF-Synth

2605.24624 2026-05-26 cs.CV 版本更新

Vision-Language Binding in In-Context Image Generation

上下文图像生成中的视觉-语言绑定

Chris Ge, Rohit Gandikota, Antonio Torralba, Tamar Rott Shaham

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Northeastern University(东北大学)

AI总结 本文通过因果干预方法揭示FLUX.2模型中文本令牌与参考图像之间的隐式跨模态绑定机制,并定位绑定发生在文本序列的填充令牌上。

Comments 35 pages, 19 figures

详情
AI中文摘要

上下文图像生成模型(如FLUX.2)接收文本提示和可选的参考图像作为输出的视觉条件。在内部,所有三个输入——文本、参考图像和噪声令牌——被连接并通过单个注意力流处理,其中所有令牌可以相互关注。这留下了参考信息如何通过模型流动以产生输出图像的问题。我们展示了文本令牌与参考图像之间出现隐式跨模态绑定:在前向传播过程中,文本令牌吸收视觉参考内容,并且这些吸收的内容因果地影响生成的输出。我们通过三种因果干预方法揭示了FLUX.2中的这种绑定:T2I Lens,通过文本到图像路径解码中间文本令牌激活;Attention Knockout,切断特定的注意力边;以及I2I-to-I2I Patching,在编辑运行之间复制文本令牌激活。在包括SUN397和DreamBench++数据集以及在线收集的图像在内的2875个编辑任务中,我们观察到一致的分工:参考图像的属性(如颜色、风格和场景设置)首先被写入文本令牌,然后由文本令牌携带到生成的图像中;像素精确的属性(如特定面孔或实例身份)绕过文本令牌,通过图像到图像注意力直接从参考图像流向生成的图像。我们进一步将参考-文本绑定定位到文本序列的填充令牌。这些结果表明,多模态DiT中的文本令牌不仅仅是提示持有者,而是参考图像内容的结构化通道。更广泛地说,它们表明即使在统一注意力的多模态生成模型中,令牌模态也决定了条件信息如何在网络中表示和路由。

英文摘要

In-context image generation models such as FLUX.2 take a text prompt and an optional reference image as visual conditioning for the output. Internally, all three inputs -- text, reference image, and the noise tokens -- are concatenated and processed through a single attention stream, where all tokens can attend to one another. This leaves open how reference information flows through the model to produce the output image. We show that an implicit cross-modal binding emerges between the text tokens and the reference image: the text tokens absorb visual reference content during the forward pass, and that absorbed content causally influences the generated output. We surface this binding with three causal interventions on FLUX.2: T2I Lens, which decodes intermediate text-token activations through a text-to-image path; Attention Knockout, which severs specific attention edges; and I2I-to-I2I Patching, which copies text token activations between editing runs. Across 2,875 editing tasks on various images, including SUN397 and DreamBench++ datasets and images collected online, we observe a consistent division of labor: properties of the reference image, like color, style, and scene setting, are first written into the text tokens, which carry them to the generated image; pixel-exact properties like a specific face or instance identity bypass the text tokens and flow directly from reference to image through image-to-image attention. We further localize the reference-text binding to the padding tokens of the text sequence. These results show that text tokens in a multimodal DiT are not just prompt holders, but a structured channel for reference image content. More broadly, they suggest that even in unified-attention multimodal generative models, token modality structures how conditioning information is represented and routed across the network.

2605.24622 2026-05-26 cs.RO cs.CV 版本更新

PoseRefer: Pathway-Local Parameters for Semantically Grounded Reference Resolution

PoseRefer: 用于语义基础指代消解的通路-局部参数

Anna Deichler

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 提出PoseRefer架构,通过解耦姿态和文本通路并冻结MiniLM类别嵌入,在MM-Conv数据集上实现31.9%的top-1准确率,并揭示融合准确性可能受类别表示伪影影响。

Comments ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction

详情
AI中文摘要

一个机器人解析“把杯子放在那个上面”必须融合手势、语言和场景几何,然而3D基础基准测试仅部分捕获了这一情况:描述是事后编写的,手势是模板化的,或者指向是为相机摆拍的。MM-Conv从二元VR交互中捕获自然的伴随语音手势,同时包含全身动作捕捉和3D场景图。我们使用它来评估姿态-语言融合,采用解耦的后期融合架构,其中姿态和文本通路不共享任何学习参数。这两个选择共同使得通过受控消融更容易隔离类别、姿态和文本的贡献。使用冻结的MiniLM类别嵌入的融合在每种指代类型上都超过了仅姿态和最佳文本通路,达到31.9%的top-1。学习到的标量门根据文本通路是否有类别访问权限而在相反策略之间切换。这是一个可靠性诊断:除非通路在架构上解耦,否则语义基础系统的融合准确性声明与类别表示伪影无法区分。

英文摘要

A robot resolving ``put the cup on that one'' must fuse gesture, language, and scene geometry, yet 3D grounding benchmarks only partially capture this regime: descriptions are written post-hoc, gestures are templated, or pointing is staged for the camera. MM-Conv captures natural co-speech gesture from dyadic VR interaction alongside full-body motion capture and 3D scene graphs. We use it to evaluate pose-language fusion with a decoupled late-fusion architecture in which pose and text pathways share no learned parameters. The two choices together make category, pose, and text contributions easier to isolate through controlled ablations. Fusion with frozen MiniLM category embeddings exceeds pose alone and the best text-only pathway on every reference type, reaching 31.9% top-1. The learned scalar gate flips between opposing policies depending on whether the text pathway has category access. This is a reliability diagnostic: fusion-accuracy claims for semantic grounding systems are indistinguishable from category-representation artifacts unless pathways are architecturally decoupled.

2605.24621 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Phase-Aware Wavelet-Based-Scattering Encoder-Decoder for Dense Predictions

相位感知的基于小波散射的编解码器用于密集预测

Ghassen Marrakchi, Basarab Matei

发表机构 * Northern Paris Computer Science Lab, Sorbonne Paris Nord University, Villetaneuse, France(北巴黎计算机科学实验室,巴黎-索邦大学,法国维莱特内斯)

AI总结 提出一种相位感知散射编解码器,通过在跳跃连接中显式保留相位信息来恢复空间结构,在图像去噪和皮肤病变分割任务中验证了相位对密集预测的有效性。

Comments 21 pages, 16 figures, 10 tables

详情
AI中文摘要

散射变换实现了Lipschitz稳定性和平移不变性,但密集预测任务需要保留在全局平均中丢失的空间结构。我们提出了相位感知散射编解码器,通过在跳跃连接中显式保留相位来恢复这些信息。在图像去噪(BSD68)上,打破平移不变性使PSNR提高了+2.17 dB;相位保留额外增加了+1.03 dB。一种新颖的空间洗牌消融实验(惩罚-1.26 dB)表明相位编码了位置依赖的结构。我们在第二个密集预测任务(ISIC皮肤病变分割)上进行了初步的可扩展性研究,完整的交叉验证正在进行中。这项工作推进了原则性的小波-深度学习集成,展示了相位信息如何在像素级预测中补充散射的稳定性-表达性权衡。

英文摘要

Scattering transforms achieve Lipschitz stability and translation invariance, but dense prediction tasks require preserving spatial structure lost in global averaging. We propose Phase-Aware Scattering Encoder-Decoder, which restores this information by explicitly preserving phase in skip connections. On image denoising (BSD68), breaking translation invariance improves PSNR by $+2.17$~dB; phase preservation adds $+1.03$~dB. A novel spatial shuffling ablation ($-1.26$~dB penalty) demonstrates phase encodes location-dependent structure. We conduct a preliminary extensibility study on a second dense prediction task (ISIC skin lesion segmentation), with full cross-validation as ongoing work. This work advances principled wavelet-deep learning integration, showing how phase information complements scattering's stability-expressiveness trade-off in pixel-level prediction.

2605.24608 2026-05-26 cs.AI cs.CV cs.LG 版本更新

Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology

基于数学形态学的深度卷积学习的格论与代数模型

Gustavo, Angulo

发表机构 * Mines Paris, PSL University, CMA-Center for Applied Mathematics, Sophia-Antipolis, France(巴黎 Mines 学院,PSL 大学,应用数学中心,法国索菲亚-安蒂波利斯)

AI总结 本文基于格论和数学形态学,为深度卷积架构(CNN、ResNet、UNet)建立了严格的代数框架,揭示了标准CNN流水线是交叉格算子,并识别出三种真正的幂等开运算层设计。

详情
AI中文摘要

我们为深度卷积架构(包括CNN、ResNet和如UNet的编码器-解码器网络)建立了一个严格的代数框架,该框架基于格论和数学形态学。核心工具是Matheron-Maragos-Banon-Barrera (MMBB) 平移不变算子通用表示理论,我们将其系统地应用于标准深度网络的每一层。主要发现是:标准CNN流水线(线性卷积 + ReLU + 平坦最大池化)是一个交叉格算子:卷积是傅里叶下半格中的腐蚀,ReLU是格并闭包,最大池化是逐点最大加格中的膨胀,它们的组合既不是形态学开运算也不是闭运算。第二个发现是:ReLU在逐点格中的上伴随是一个全局(非局部)算子,在全局非负函数上为恒等映射,否则为负无穷,因此没有局部形态学腐蚀能与ReLU构成伴随对。这两个结果共同提供了深度在标准CNN中引入真正表示能力的精确代数原因:组合层不是幂等的。我们识别并完全刻画了三种真正的幂等开运算层设计:纯最大加形态学层(逐点格)、谱维纳层(傅里叶格)和自对偶形态学层。我们建立了完整的不动点和收敛理论。该框架还将最大池化、步长卷积和拉普拉斯金字塔统一在Goutsias-Heijmans伴随金字塔理论下,并给出了激活-池化膨胀(APD)分解及其正确的伴随算子。

英文摘要

We develop a rigorous algebraic framework for deep convolutional architectures, CNNs, ResNets, and encoder--decoder networks such as UNet, grounded in lattice theory and mathematical morphology. The central tool is the Matheron--Maragos--Banon--Barrera (MMBB) universal representation theory for translation-invariant operators, which we apply systematically to every layer of a standard deep network. The principal finding is that the standard CNN pipeline (linear convolution~$+$ ReLU~$+$ flat max-pooling) is a cross-lattice operator: the convolution is an erosion in the Fourier inf-semilattice while ReLU is a lattice-join closing and max-pooling is a dilation in the pointwise max-plus lattice, and their composition is a morphological opening in neither. A second finding is that the upper adjoint of ReLU in the pointwise lattice is a global (non-local) operator, the identity on globally non-negative functions and $-\infty$ otherwise, so no local morphological erosion can form an adjunction pair with ReLU. These two results together provide the precise algebraic reason why depth in standard CNNs introduces genuine representational power: the composed layer is not idempotent. Three layer designs that are genuine idempotent openings are identified and fully characterised: the pure max-plus morphological layer (pointwise lattice), the spectral Wiener layer (Fourier lattice), and the self-dual morphological layer. We establish a complete fixed-point and convergence theory. The framework also unifies max-pooling, strided convolution, and the Laplacian pyramid under the Goutsias--Heijmans adjoint pyramid theory, and gives the Activation--Pooling Dilation (APD) factorisation with its correct adjoint.

2605.24604 2026-05-26 cs.CV 版本更新

LC-Flow: Learning Local Continuous Optical Flow and Confidence from events

LC-Flow: 从事件中学习局部连续光流与置信度

Gunwoo Jeon, Chaesong Park, Jongwoo Lim

发表机构 * IPAI, Seoul National University(IPAI,首尔国立大学)

AI总结 提出LC-Flow,首个基于学习的、从局部事件中估计时间连续光流的方法,通过连续局部循环网络和联合学习的置信度,解决事件稀疏性和孔径问题,在MVSEC和DSEC上达到局部方法最优,且置信度引导的聚合在MVSEC上超越基于帧的方法。

详情
AI中文摘要

事件相机以微秒分辨率异步捕捉亮度变化,但现有光流方法未能充分利用这种时间连续性。基于帧的方法引入人工累积延迟并遭受领域过拟合,而基于模型的局部方法无状态运行,丢弃预测间的时间历史,产生不准确的光流。 我们提出 extbf{LC-Flow},首个时间连续的、基于学习的光流估计器,完全从局部事件操作。其核心是一个连续局部循环网络,为每个空间网格维护持久隐藏状态,随着事件到达逐步累积时间上下文。与受限于固定累积窗口的基于帧的方法不同,也与每一步从头重新计算运动的无状态基于模型的方法不同,LC-Flow在任意时间戳上生成具有完整运动历史的稀疏局部光流估计。 为了解决局部观测固有的歧义性,我们联合学习一个置信度分数,量化每个预测的可靠性,明确处理事件稀疏性和孔径问题。该置信度具有双重作用:为下游任务(如视觉里程计)过滤不可靠估计,并为多尺度置信度引导的聚合提供有原则的权重,从稀疏局部输出重建全局一致的光流。LC-Flow在MVSEC和DSEC上均达到局部方法的最优性能,而置信度引导的聚合在MVSEC基准上建立了新的总体最优,超越了依赖全局空间先验的重型基于帧的网络。

英文摘要

Event cameras capture brightness changes asynchronously with microsecond resolution, yet existing optical flow methods fail to fully exploit this temporal continuity. Frame-based approaches impose artificial accumulation latency and suffer from domain overfitting, while model-based local methods operate statelessly, discarding temporal history between predictions and yielding inaccurate flows. We propose \textbf{LC-Flow}, the first temporally continuous, learning-based optical flow estimator that operates purely from local events. At its core, a Continuous Local Recurrent Network maintains persistent hidden states per spatial grid, incrementally accumulating temporal context as events arrive. Unlike frame-based methods constrained to fixed accumulation windows, and unlike stateless model-based methods that recompute motion from scratch at each step, LC-Flow produces sparse local flow estimates at arbitrary timestamps with full motion history. To address the inherent ambiguity of local observations, we jointly learn a confidence score that quantifies the reliability of each prediction, explicitly handling event sparsity and the aperture problem. This confidence serves a dual role: filtering unreliable estimates for downstream tasks such as visual odometry, and providing principled weights for a multi-scale confidence-guided aggregation that reconstructs globally consistent flow from the sparse local outputs. LC-Flow achieves state-of-the-art performance among local methods on both MVSEC and DSEC, while the confidence-guided aggregation establishes a new overall state-of-the-art on the MVSEC benchmark, surpassing heavy frame-based networks that rely on global spatial priors.

2605.24593 2026-05-26 cs.CV 版本更新

Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration

自监督动态异质退化建模用于统一零样本图像恢复

XiaoWan Hu, Jing Yang, HeNan Liu, HuaQiu Li, Mai Xu

发表机构 * Beihang University(北航) Tsinghua University(清华大学)

AI总结 提出统一物理零样本图像恢复框架,通过将异质退化重参数化为同质分布并引入动态质量细化策略,实现单/混合退化下的最优性能。

详情
AI中文摘要

零样本图像恢复提供了一种灵活的方式来处理各种退化,无需特定任务的训练。然而,现有方法通常依赖堆叠层或预训练特征来增强退化表达,同时忽略了物理一致的先验。不充分的退化提示在零样本扩散过程中带来了沉重的训练负担和高采样成本。此外,固定的推理轨迹在复杂损坏下往往收敛到次优解。我们观察到异质退化可以重参数化为一个最小物理一致参数集以实现紧凑表示。基于这一见解,我们首先提出一个统一的物理零样本图像恢复(UP-ZeroIR)框架,该框架将异质退化显式建模为同质全分布。该分布可以在潜在空间中直接优化,从而实现原则性的解探索和有效的提示适应。此外,我们引入了一种动态质量细化策略,自适应调整扩散轨迹以实现鲁棒的全局最优收敛。大量实验表明,我们的方法在单一和混合退化下均达到了最先进的性能。我们的代码可在 https://github.com/yangjinglyy/UP-ZeroIR 获取。

英文摘要

Zero-shot image restoration provides a flexible way to handle diverse degradations without task-specific training. However, existing methods typically rely on stacked layers or pre-trained features to enhance degradation expression, while overlooking physically consistent priors. The insufficient degradation prompts impose the heavy training burden and high sampling costs during zero-shot diffusion. Moreover, the fixed inference trajectory often collapses to suboptimal solutions under complex corruptions. We observe that heterogeneous degradations can be reparameterized into a minimal set of physically coherent parameters for compact representation. Based on this insight, we first propose a unified physical zero-shot image restoration (UP-ZeroIR) framework that explicitly models heterogeneous degradations into a homogeneous all-in-one distribution. The distribution can be optimized directly in the latent space, enabling principled solution exploration and effective prompt adaptation. Besides, we introduce a dynamic quality-refinement strategy that adaptively adjusts the diffusion trajectory for robust globally optimal convergence. Extensive experiments demonstrate that our method achieves state-of-the-art performance across both single and mixed degradations. Our code is available at https://github.com/yangjinglyy/UP-ZeroIR

2605.24590 2026-05-26 cs.CV cs.LG stat.ML 版本更新

Physen-Noise2Noise: Physics-Guided Self-Supervised Defocus Deblurring with Bias Correction under Low-Light Conditions

Physen-Noise2Noise: 低光条件下带偏差校正的物理引导自监督散焦去模糊

Ziyan Huang, Lang Wu, Hongji Wang, Yifei Liu, Dongliang Tang, Hongqiao Wang

发表机构 * School of Mathematics and Statistics, Central South University(数学与统计学学院,中南大学) Key Laboratory for Micro/Nano Optoelectronic Devices of Ministry of Education, Hunan Provincial Key Laboratory of Low-Dimensional Structural Physics and Devices, School of Physics and Electronics, Hunan University(教育部微/纳米光电器件重点实验室,湖南省低维结构物理与器件重点实验室,物理与电子学院,湖南大学)

AI总结 提出一种基于物理模型的自监督散焦去模糊框架Physen-Noise2Noise,通过可学习噪声偏差参数和频域约束,在无干净参考图像的情况下联合校正偏差噪声并恢复高频细节。

Comments 14 pages

详情
AI中文摘要

低光、长曝光散焦去模糊由于同时存在严重模糊和复杂有偏噪声,仍然是一个具有挑战性的问题。现有方法通常依赖于简化的噪声假设,这限制了它们在真实成像条件下的有效性。在这项工作中,我们提出了Physen-Noise2Noise,一种由散焦成像物理模型引导的自监督去模糊框架,它利用有噪声的多帧观测,无需干净参考图像。与传统的基于Noise2Noise的方法假设零均值噪声不同,我们推导了散焦成像过程固有的频域约束,并通过可学习的噪声偏差参数将其纳入学习框架。此外,引入了一种多帧有噪初始化策略,在去模糊之前抑制复杂有偏噪声,为重建提供更稳定的起点。该公式显式建模有偏噪声,并在训练过程中实现联合偏差校正和高频细节恢复。此外,我们开发了一种预训练-微调变体,以增强在挑战性噪声条件下的鲁棒性和泛化能力。在模拟和真实数据集上的大量实验表明,所提出的方法在存在复杂有偏噪声的情况下,始终优于最先进的自监督散焦去模糊方法。

英文摘要

Low-light, long-exposure defocus deblurring remains a challenging problem due to the simultaneous presence of severe blur and complex biased noise. Existing methods typically rely on simplified noise assumptions, which limits their effectiveness under realistic imaging conditions. In this work, we propose Physen-Noise2Noise, a self-supervised deblurring framework guided by the physical model of defocus imaging, which leverages noisy multi-frame observations without requiring clean reference images. Unlike conventional Noise2Noise-based approaches that assume zero-mean noise, we derive a frequency-domain constraint inherent to the defocus imaging process and incorporate it into the learning framework via a learnable noise bias parameter. In addition, a multi-frame noisy initialization strategy is introduced to suppress complex biased noise prior to deblurring, providing a more stable starting point for reconstruction. This formulation explicitly models biased noise and enables joint bias correction and high-frequency detail recovery during training. Furthermore, we develop a pretrain-finetune variant to enhance robustness and generalization under challenging noise conditions. Extensive experiments on both simulation and real-world datasets demonstrate that the proposed method consistently outperforms state-of-the-art self-supervised approaches for defocus deblurring in the presence of complex biased noise.

2605.24578 2026-05-26 cs.CV 版本更新

World Models as Group Actions

世界模型作为群作用

Zijie Wang, Wei Zhang, Weiming Zhang, Fanqi Zhang, Xiao Tan, Yipeng Qin, Guanbin Li

发表机构 * Sun Yat-sen University(中山大学) Shenzhen Loop Area Institute(深圳环城院) Baidu Inc.(百度公司) Cardiff University(卡迪夫大学) Guangdong Key Laboratory of Big Data Analysis and Processing(广东大数据分析与处理重点实验室)

AI总结 本文提出将动作条件世界建模形式化为状态空间上的群作用,通过潜在空间正则化强制执行恒等、逆和组合一致性,并引入群作用一致性(GAC)和群作用鲁棒性(GAR)指标来评估结构正确性和展开稳定性。

Comments Under review

详情
AI中文摘要

视频世界模型已实现强大的视觉真实性,但这并不确保其动态真正由动作控制。本文认为,动作忠实性应通过动作的组合结构来理解,在许多具身设置中,这种结构遵循群结构(例如,导航中的SE(2))。基于这一见解,我们将动作条件世界建模形式化为在状态空间上实现群作用,为评估超越视觉质量的动态提供了原则性标准。为了实施这一框架,我们提出了一种统一方法,通过合成监督的潜在空间正则化强制执行恒等、逆和组合一致性,避免额外数据收集。我们进一步引入了两个指标:群作用一致性(GAC)和群作用鲁棒性(GAR),以评估结构正确性和展开稳定性。大量实验结果表明,我们的方法在不降低感知质量的情况下,一致地改进了最先进视频世界模型中的GAC和GAR。

英文摘要

Video world models have achieved strong visual realism, but this does not ensure that their dynamics are truly governed by actions. In this work, we argue that action faithfulness should be understood through the compositional structure of actions, which in many embodied settings follows a group structure (e.g., SE(2) for navigation). Based on this insight, we formalize action-conditioned world modeling as realizing a group action on the state space, providing a principled criterion for evaluating dynamics beyond visual quality. To operationalize this framework, we propose a unified approach that enforces identity, inverse, and composition consistency via latent-space regularization with synthesized supervision, avoiding additional data collection. We further introduce two metrics: Group-Action Consistency (GAC) and Group-Action Robustness (GAR), to evaluate structural correctness and rollout stability. Extensive experimental results show that our method consistently improves both GAC and GAR in state-of-the-art video world models without degrading perceptual quality.

2605.24570 2026-05-26 cs.LG cs.AI cs.CV 版本更新

PILOT: Policy-Informed Learned Optimization for Adaptive Deep Network Training

PILOT: 策略引导的学习优化器用于自适应深度网络训练

Sattam Altuuaim, Lama Ayash, Muhammad Mubashar, Naeemullah Khan

发表机构 * King Abdullah University of Science and Technology(卡布斯大学) University of Strathclyde(斯特拉思克莱德大学)

AI总结 提出PILOT在线优化器,通过梯度方向一致性信号动态调整动量、归一化和符号更新的组合,在FashionMNIST和CIFAR-10上实现更高准确率。

Comments 16 pages, 5 figures

详情
AI中文摘要

尽管优化在深度学习中扮演核心角色,但大多数优化器依赖于训练开始前固定函数形式的更新结构。这种静态设计限制了它们响应损失景观中变化梯度行为的能力,其中训练可能在稳定、噪声和不一致状态之间切换。本研究提出PILOT(策略引导的学习优化器),一种在线优化器,在训练过程中自适应其更新行为。PILOT不使用动量、归一化和符号更新之间的固定平衡,而是将梯度方向一致性作为局部训练稳定性的信号。基于该一致性信号调整更新规则,使优化器能够在梯度变得稳定、噪声或不一致时调整其行为。在FashionMNIST和CIFAR-10上的实验表明,PILOT在卷积设置中始终达到评估优化器中的最高准确率。在CNN架构上,PILOT在FashionMNIST上达到94.13%,在CIFAR-10上达到81.94%。在ResNet-18上,它进一步提升了性能,在FashionMNIST上达到95.71%,在CIFAR-10上达到93.42%。这些结果表明,在训练过程中学习如何调整更新结构可以在保持简单一阶优化框架的同时,提高紧凑和更深卷积模型的性能。PILOT的实现公开于https://github.com/SattamAltwaim/PILOT.git。

英文摘要

Despite the central role of optimization in deep learning, most optimizers rely on update structures whose functional form is fixed before training begins. This static design can limit their ability to respond to changing gradient behavior across the loss landscape, where training may shift between stable, noisy, and inconsistent regimes. This study proposes PILOT (Policy-Informed Learned OpTimizer), an online optimizer that adapts its update behavior during training. Rather than using a fixed balance between momentum, normalization, and sign-based updates, PILOT uses gradient-direction agreement as a signal of local training stability. Conditioning the update rule on this agreement signal allows the optimizer to adjust its behavior when gradients become stable, noisy, or inconsistent. Experiments on FashionMNIST and CIFAR-10 show that PILOT consistently achieves the highest accuracy among the evaluated optimizers across convolutional settings. On the CNN architecture, PILOT reaches 94.13% on FashionMNIST and 81.94% on CIFAR-10. On ResNet-18, it further improves performance, reaching 95.71% on FashionMNIST and 93.42% on CIFAR-10. These results suggest that learning how to adapt the update structure during training can improve performance across both compact and deeper convolutional models while preserving a simple first-order optimization framework. The implementation of PILOT is publicly available at https://github.com/SattamAltwaim/PILOT.git

2605.24566 2026-05-26 cs.CV cs.GR cs.LG 版本更新

EMA: Effort Metric Attention for Anatomical Effort-Guided Human Motion Diffusion

EMA: 面向解剖学努力引导的人体运动扩散的努力度量注意力

Joshua Siy, Huakun Liu, Yutaro Hirao, Monica Perusquia-Hernandez, Hideaki Uchiyama, Kiyoshi Kiyokawa

发表机构 * Nara Institute of Science and Technology(奈良科学技术大学)

AI总结 提出基于努力度量注意力(EMA)的强度控制框架,通过数值努力信号调节运动扩散模型,实现细粒度、区域化的运动强度控制,并验证了与LMA描述符的单调对齐。

Comments Accepted at IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)

详情
AI中文摘要

人体运动扩散模型可以从文本合成动作序列,但控制运动强度仍然具有挑战性。现有方法依赖于与努力相关的副词,这些副词模糊不清,无法捕捉诸如节奏等定量方面,通常导致动态平坦且单调。我们提出了一种基于努力度量注意力(EMA)的强度控制框架,这是一个交叉注意力模块,将扩散条件建立在数值努力信号上。受拉班动作分析(LMA)启发,该框架关注时间和重量努力因素。我们使用两个运动学指标来近似这些因素:用于节奏的峰值关节位置变化和用于运动量的集体关节位置变化。EMA实现了细粒度、区域化的控制,无需昂贵的后验优化。我们引入了两个评估任务,度量到运动的一致性和身体部位级别的努力调制,以评估数值保真度和局部控制。实验和用户研究表明,指定的努力水平、生成的运动动态和已建立的LMA描述符之间具有近乎单调的对齐。这些结果表明在实践中对努力动态进行了有效且可解释的控制。

英文摘要

Human motion diffusion models can synthesize action sequences from text, but controlling motion intensity remains challenging. Existing approaches rely on effort-related adverbs, which are ambiguous and fail to capture quantitative aspects such as pacing, often resulting in flat and monotonous dynamics. We propose an intensity-control framework based on Effort Metric Attention (EMA), a cross-attention module that conditions diffusion on numerical effort signals. Inspired by Laban Movement Analysis (LMA), the framework focuses on the Time and Weight effort factors. We approximate these factors using two kinematic metrics: peak joint positional change for pacing and collective joint positional change for motion amount. EMA enables fine-grained, region-wise control without costly post-hoc optimization. We introduce two evaluation tasks, metric-to-motion consistency and body-part-level effort modulation, to assess numerical fidelity and localized control. Experiments and a user study show near-monotonic alignment between specified effort levels, generated motion dynamics, and established LMA descriptors. These results indicate effective and interpretable control of effort dynamics in practice.

2605.24562 2026-05-26 cs.CV cs.AI 版本更新

PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction

PEDESTRIANQA: 面向行人意图与轨迹预测的视觉-语言模型基准

Naman Mishra, Shankar Gangisetty, C. V. Jawahar

发表机构 * CVIT, IIIT-Hyderabad, India(IIIT-海得拉巴计算机视觉与智能技术研究所,印度)

AI总结 提出大规模视频数据集PedestrianQA,将行人意图和轨迹预测转化为带结构化理由的问答任务,通过微调视觉-语言模型显著提升预测准确性与可解释性。

详情
AI中文摘要

行人意图和轨迹预测对于自动驾驶系统的安全部署至关重要,直接影响复杂交通环境中的导航决策。近期大型视觉-语言模型的进展通过结合高容量视觉理解与灵活的自然语言推理,为这些任务提供了强大的新范式。本文中,我们引入PedestrianQA,这是一个大规模视频数据集,将行人意图和轨迹预测公式化为带有结构化理由的问答任务。PedestrianQA以自然语言表达丰富标注的行人序列,使视觉-语言模型能够从视觉动态、上下文线索和交通智能体间的交互中学习,同时生成其预测的简洁解释,无需为每个任务定制专门的架构。在PIE、JAAD、TITAN和IDD-PeD上的实证评估表明,在PedestrianQA上微调最先进的视觉-语言模型显著提高了意图分类、轨迹预测准确性以及解释性理由的质量,展示了视觉-语言模型作为安全关键行人行为建模的统一且可解释框架的强大潜力。

英文摘要

Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision-language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural language reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as question-answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized architectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling.

2605.24553 2026-05-26 cs.CV 版本更新

IQA-Spider: Unifying Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring

IQA-Spider:统一多粒度图像质量评估与推理、定位和指代

Xinge Peng, Yiting Lu, Xin Li, Zhibo Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出IQA-Spider框架,通过统一推理、定位和指代任务,实现多粒度图像质量评估,并采用两阶段设计解决现有方法仅支持部分感知维度的问题。

Comments Accepted by ICML 2026

详情
AI中文摘要

我们提出IQA-Spider,这是第一个将推理、定位和指代统一到单个基于LMM的框架中的图像质量评估(IQA)框架,用于多粒度质量理解。现有的基于LMM的IQA方法通常仅支持部分感知维度,例如质量描述和问答(即推理)或像素级定位。这一局限性主要源于缺乏(i)统一的任务和数据形式化,以及(ii)有效的多粒度学习优化范式。为解决这些局限性,我们形式化了一个严格的任务四元组,涵盖全局和局部质量描述、像素级定位以及区域级指代。基于这一形式化,我们通过可扩展的自动标注流水线构建了相应的IQA数据集,从而为统一的多粒度学习提供了坚实基础。为进一步实现统一感知,我们采用无冲突的两阶段设计,逐步将文本级多粒度理解扩展到像素级定位:(i)第一阶段使模型具备跨多个IQA任务的细粒度文本级推理能力;(ii)第二阶段引入无需训练的文本到点定位范式,通过将token logits映射到空间坐标来桥接文本语义和像素级感知。基于这些努力,我们实现了具有统一多粒度可解释图像质量评估的IQA-Spider。在多个基准上的大量实验展示了强大的性能,验证了所提出形式化和框架的有效性与通用性。

英文摘要

We present IQA-Spider, the first image quality assessment (IQA) framework that unifies reasoning, grounding, and referring into a single LMM-based framework for multi-granularity quality understanding. Existing LMM-based IQA methods typically support only partial perception dimensions, such as quality description and question answering~(\textit{i.e.}, reasoning) or pixel-level grounding. This limitation largely stems from the absence of (i) a unified task and data formulation and (ii) effective optimization paradigms for multi-granularity learning. To address these limitations, we formulate a rigorous four-task paradigm covering global and local quality description, pixel-level grounding, and region-level referring. Based on this formulation, we construct a corresponding IQA dataset with a scalable and automatic annotation pipeline, thereby providing a solid foundation for unified multi-granularity learning. To further enable unified perception, we adopt a conflict-free two-stage design that progressively extends text-level multi-granularity understanding to pixel-level grounding: (i) the first stage equips the model with fine-grained text-level reasoning across multiple IQA tasks, and (ii) the second stage introduces a training-free text-to-point grounding paradigm, which bridges textual semantics and pixel-level perception by mapping token logits to spatial coordinates. Based on these efforts, we achieve IQA-Spider with unified multi-granularity explainable image quality assessment. Extensive experiments across multiple benchmarks demonstrate strong performance, validating the effectiveness and versatility of the proposed formulation and framework.

2605.24533 2026-05-26 cs.CV 版本更新

Learnable Shape Prototypes with Occlusion-Geometry-Guided Injection for Amodal Instance Segmentation

可学习形状原型与遮挡几何引导注入的模态实例分割

Fufan Zhang, Jingxiang Wang, Xiangjie Ye

发表机构 * School of Mechanical Engineering and Automation, Northeastern University(机械工程与自动化学院,东北大学) School of Information Science and Engineering, Northeastern University(信息科学与工程学院,东北大学)

AI总结 提出一种门控可靠性自适应形状先验框架,通过可学习原型和交叉注意力生成实例自适应形状先验,并利用可见掩码的符号距离场调节注入强度,在多个评估设置下超越现有方法。

Comments 13 pages, 7 figures, 5 tables. Submitted to IEEE Transactions on Circuits and Systems for Video Technology

详情
AI中文摘要

模态实例分割旨在预测完整的物体掩码,包括被遮挡区域,这些区域缺乏像素级观测,必须借助形状先验进行推断。现有方法通过固定容量编码空间或昂贵的生成模型获取形状先验,并在所有空间位置均匀注入,而不适应可见区域和遮挡区域之间不同的先验需求。本文提出一种门控可靠性自适应形状先验框架,该框架引入一个形状先验记忆模块,通过交叉注意力组合可学习原型,通过加权原型组合(而非生成)产生实例自适应形状先验。然后,一个空间自适应可靠性门利用可见掩码的符号距离场,根据每个位置的遮挡深度调节注入强度,在可见区域保留可靠特征,同时将形状补偿引导至遮挡区域。在两个主流模态实例分割基准上的实验表明,所提方法在多个评估设置下优于现有方法,在标准设置下,其中一个基准上的遮挡区域平均交并比提高了超过11个百分点,同时总参数量约为三分之一。线性探针分析进一步揭示,可见掩码交叉注意力模块隐式地将遮挡几何编码到视觉标记表示中,解释了所提模块分解的有效性。

英文摘要

Amodal instance segmentation aims to predict the complete object mask including occluded regions that lack pixel-level observations and must be inferred with the aid of shape priors. Existing methods acquire shape priors through fixed-capacity encoding spaces or expensive generative models, and inject them uniformly across all spatial positions without adapting to the varying prior demand between visible and occluded regions. In this paper, we propose a gated reliability-adaptive shape prior framework, which introduces a shape prior memory module that combines learnable prototypes via cross-attention to produce instance-adaptive shape priors through weighted prototype combination rather than generation. A spatial adaptive reliability gate then employs the signed distance field of the visible mask to modulate injection intensity at each position according to its occlusion depth, preserving reliable features in visible regions while directing shape compensation toward occluded areas. Experiments on two mainstream amodal instance segmentation benchmarks demonstrate that the proposed method outperforms existing approaches under multiple evaluation settings, improving the mean intersection-over-union over occluded regions by over 11 percentage points on one of the two benchmarks under the standard setting, while using approximately one-third of the total parameters. Linear probing analysis further reveals that the visible-mask cross-attention module implicitly encodes occlusion geometry into visual token representations, explaining the effectiveness of the proposed module decomposition.

2605.24532 2026-05-26 cs.CV 版本更新

Image-Conditioned Instance Prompt Network for Referring Remote Sensing Image Segmentation

图像条件实例提示网络用于遥感图像指代分割

Biaoyu Ren, Qingsheng Wang, Cun Xu, Dingkang Yang, Wenxuan Wang

发表机构 * School of Computer Science, Northwestern Polytechnical University, Xi'an, China(西北工业大学计算机科学学院,西安,中国) College of Intelligent Robotics and Advanced Manufacturing, Fudan University, Shanghai, China(复旦大学智能机器人与先进制造学院,上海,中国) Shenzhen Research Institute of Northwestern Polytechnical University, Shenzhen, China(西北工业大学深圳研究院,深圳,中国)

AI总结 提出图像条件实例提示网络(ICIPNet),通过自适应视觉语义表示和双边信息融合模块,缓解跨模态特征融合瓶颈,提升遥感图像指代分割性能。

Comments 6 pages, 3 figures. Equal contribution: Biaoyu Ren and Qingsheng Wang. Corresponding authors: Dingkang Yang and Wenxuan Wang

详情
AI中文摘要

遥感图像指代分割(RRSIS)是一项与具身感知范式相关的情境化、任务驱动的跨模态任务,要求模型将视觉空间特征与语言意图对齐以实现精确的目标感知。近期研究聚焦于细化文本特征的粒度并优化图像-文本特征融合,以更好地引导目标特征表示。然而,描述粒度不足和对语义偏移的敏感性可能导致跨模态特征融合的瓶颈。为解决这些问题,我们提出带有双边信息融合的图像条件实例提示网络(ICIPNet),旨在缓解跨模态特征融合的瓶颈。ICIPNet引入图像条件实例提示(ICIP)模块,无需外部知识即可生成自适应的视觉和语义表示。双边信息融合(BIF)模块沿token和通道维度增强特征融合。实验表明,所提出的ICIPNet优于现有RRSIS模型。

英文摘要

Referring Remote Sensing Image Segmentation (RRSIS) is a situated, task-driven cross-modal task related to the embodied perception paradigm, requiring models to align visual-spatial features with linguistic intentions for precise target perception. Recent research has focused on refining the granularity of textual features and optimizing image-text feature fusion to better guide target feature representations. However, insufficient descriptive granularity and sensitivity to semantic shifts can cause bottlenecks in cross-modal feature fusion. To address these issues, we propose the Image-Conditioned Instance Prompt Network (ICIPNet) with Bilateral Information Fusion, which is designed to alleviate bottlenecks in cross-modal feature fusion. ICIPNet introduces an Image-Conditioned Instance Prompt (ICIP) module to generate self-adaptive visual and semantic representations without external knowledge. The Bilateral Information Fusion (BIF) module enhances feature fusion along the token and channel dimensions. Experiments demonstrate that the proposed ICIPNet outperforms existing RRSIS models.

2605.24531 2026-05-26 cs.CV 版本更新

NudgeVAD: Language-Nudged End-to-End Driving via FiLM Residuals

NudgeVAD: 通过FiLM残差的语言引导端到端驾驶

Chieh-Chi Yang, Yu-Hsiang Chen, Yi-Ting Chen

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学)

AI总结 提出NudgeVAD框架,利用语言作为校准的微调信号,通过恒等初始化的FiLM和零初始化残差头,在命令不可靠时显著提升驾驶轨迹预测性能。

Comments Technical report for the doScenes Instructed Driving Challenge, CVPR 2026 DriveX Workshop. 1st place in the Ablation track

详情
AI中文摘要

自然语言指令有望实现可控的端到端驾驶,但当规划器已经接收到可靠的高级命令时,其优势可能被掩盖。我们提出NudgeVAD,一个冻结规划器残差框架,利用语言作为对VAD轨迹的校准微调。通过恒等初始化的FiLM和零初始化的残差头,NudgeVAD在初始化时等价于冻结规划器,因此学习到的偏差仅来自语言条件残差。我们沿命令可靠性轴评估NudgeVAD。在可靠命令下,语言改进了初始规划器,但与VAD-FT (UNCOND)(一个计算量匹配的、无语言微调的VAD模型)相比几乎冗余。然而,在随机命令下,语言变得至关重要:去除文本使ADE6s降至3.166米,而带有文本的NudgeVAD恢复至2.806米,并优于VAD-FT (UNCOND) 0.312米。这些结果表明,语言并非普遍可加;当分类命令通道不可靠时,它最有价值。

英文摘要

Natural-language instructions promise controllable end-to-end driving, but their benefit can be hidden when planners already receive reliable high-level commands. We propose NudgeVAD, a frozen-planner residual framework that uses language as a calibrated nudge to a VAD trajectory. With identity-initialized FiLM and a zero-initialized residual head, NudgeVAD is equivalent to the frozen planner at initialization, so learned deviations arise only from language-conditioned residuals. We evaluate NudgeVAD along a command-reliability axis. With reliable commands, language improves the initial planner but becomes nearly redundant once compared against VAD-FT (UNCOND), a compute-matched VAD model fine-tuned without language. With random commands, however, language becomes essential: detaching text degrades ADE6s to 3.166 m, while NudgeVAD with text recovers 2.806 m and outperforms VAD-FT (UNCOND) by 0.312 m. These results show that language is not universally additive; it is most valuable when the categorical command channel is unreliable.

2605.24530 2026-05-26 cs.CL cs.CV 版本更新

Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

Unveil: 统一视觉-文本集成与蒸馏的多模态文档检索

Hao Sun, Yingyan Hou, Jiayan Guo, Bo Wang, Chunyu Yang, Jinsong Ni, Yan Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University(北京理工大学通用人工智能国家重点实验室) School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院) Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院航空航天信息研究所) Key Laboratory of Target Cognition and Application Technology(目标认知与应用技术重点实验室) Beijing Institute of Technology(北京理工大学) Ucap Cloud(Ucap云)

AI总结 提出Unveil框架,通过视觉-文本嵌入和知识蒸馏实现鲁棒的文档检索,兼顾布局与语义信息。

Comments ACL 2025 Main Conference

详情
AI中文摘要

现实场景中的文档检索由于文档格式和模态的多样性面临重大挑战。传统的基于文本的方法依赖于定制的解析技术,忽略布局信息且容易出错,而最近的无解析视觉方法在文本丰富的场景中往往难以捕捉细粒度的文本语义。为了解决这些限制,我们提出了 extbf{Unveil},一种新颖的视觉-文本嵌入框架,有效整合文本和视觉特征以实现鲁棒的文档表示。通过知识蒸馏,我们将视觉-文本嵌入模型的语义理解能力转移到纯视觉模型,实现高效的无解析检索同时保持语义保真度。实验结果表明,我们的视觉-文本嵌入方法超越了现有方法,而知识蒸馏成功弥合了视觉-文本方法与纯视觉方法之间的性能差距,提高了检索准确性和效率。

英文摘要

Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbf{Unveil}, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.

2605.22715 2026-05-26 cs.CV cs.AI cs.CL cs.HC 版本更新

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

AnyMo:野外人体运动的几何感知与设置无关建模

Baiyu Chen, Zechen Li, Wilson Wongso, Lihuan Li, Xiachong Lin, Hao Xue, Benjamin Tag, Flora Salim

发表机构 * The University of New South Wales(新南威尔士大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出AnyMo框架,通过物理模拟生成多样化IMU信号、图编码器预训练和LLM对齐,实现跨设备/数据集的零样本活动识别、跨模态检索和运动描述,性能显著提升。

详情
AI中文摘要

随着可穿戴和移动设备日益融入日常生活,它们为持续感知野外人体运动提供了实用途径。但惯性信号高度依赖于传感设置,包括身体位置、安装方向、传感器朝向、设备硬件和采样协议。这种设置依赖性使得学习跨设备和数据集迁移的运动表示变得困难,并限制了可穿戴IMU在封闭集识别之外的广泛应用。我们提出AnyMo,一个用于设置无关人体运动建模的几何感知框架。AnyMo利用基于物理的IMU模拟在密集体表位置上生成多样且合理的合成信号,从配对的合成放置视图和掩蔽部分观测中预训练图编码器,将多位置IMU标记化为全身运动令牌,并将这些令牌与LLM对齐以进行运动-语言理解。我们在三个互补任务上评估AnyMo:跨14个未见下游数据集的零样本活动识别、跨模态检索和可穿戴IMU运动描述,其中在HAR上平均Accuracy/F1/R@2提升11.7%/11.6%/22.6%,零样本IMU到文本和文本到IMU检索MRR分别提升15.9%和28.6%,零样本描述BERT-F1提升18.8%。这些结果支持AnyMo作为野外可穿戴运动理解的通才模型。项目页面:https://baiyuchen.com/project/AnyMo。

英文摘要

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.

2605.21652 2026-05-26 cs.CV cs.AI 版本更新

Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

Look-Closer-Then-Diagnose: 通过主动缩放实现置信度感知的超声VQA

Yue Zhou, Erxuan Wu, Yikang Sun, Hongjoo Lee, Yuan Bi, Huixiong Xu, Nassir Navab, Zhongliang Jiang

发表机构 * Computer Aided Medical Procedures (CAMP)(计算机辅助医疗程序) TU Munich, Germany(慕尼黑工业大学,德国) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Munich, Germany(慕尼黑,德国) Zhongshan Hospital, Fudan University, China(复旦大学中山医院) The University of Hong Kong, Hongkong, China(香港大学,香港,中国)

AI总结 提出一个模拟超声医师认知流程的框架,通过“缩放-诊断”范式和基于组相对策略优化的不确定性感知奖励,提升超声视觉问答中病灶定位和诊断性能。

详情
AI中文摘要

视觉-语言模型(VLM)显著推进了医学视觉问答,但在超声领域性能仍不理想。临床实践中,超声医师在制定报告时会明确关注病灶区域,尽管诊断解释有时因固有的主观性而存在差异。然而,现有VLM并未明确设计为在诊断前交互式地放大病灶;此外,它们通常将标注视为无偏真值,未能考虑其固有的主观性和模糊性。在本文中,我们提出了一个专门考虑超声医师认知工作流的框架。我们首先引入了一个结构化的“缩放-诊断”范式,该范式复制了交互式搜索过程以实现病灶聚焦推理。此外,在组相对策略优化(GRPO)框架内,我们引入了一个基于随机组 rollout 的不确定性感知奖励,以估计预测一致性作为模型置信度的代理。这两个组件共同鼓励模型在清晰案例上强化准确预测,同时在模糊情况下保持谨慎。在肝脏、乳腺和甲状腺数据集上的实验表明,我们的框架将病灶定位提高了39.3%,证明我们的模型学会了主动靠近观察并诊断的能力。

英文摘要

Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity. However, existing VLMs are not explicitly structured to interactively zoom into lesions prior to diagnosis; moreover, they typically treat annotations as unbiased ground truths, failing to account for their inherent subjectivity and ambiguity. In this paper, we propose a framework specifically designed to consider the sonographer's cognitive workflow. We first introduce a structured Zoom-then-Diagnose paradigm, which replicates the interactive search process to enable lesion-focused reasoning. Furthermore, within the Group Relative Policy Optimization (GRPO) framework, we introduce an uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence. Together, these two components encourage the model to reinforce accurate predictions on clear cases while remaining cautious under ambiguity. Experiments across liver, breast, and thyroid datasets show that our framework improves lesion localization by 39.3\%, demonstrating that our model has learned the ability to actively look closer and diagnose.

2605.21417 2026-05-26 cs.CV cs.AI 版本更新

Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

排序重要:面向混合情感识别的排名感知选择性融合

Junghyun Lee, Hyunseo Kim, Hanna Jang, Junhyug Noh

发表机构 * Department of Artificial Intelligence and Software(人工智能与软件系)

AI总结 提出一种排名感知的多编码器框架,通过注意力门控模块选择最有效的编码器进行融合,并解耦预测为存在性和显著性头,结合无监督域适应,在混合情感识别任务中取得第二名成绩。

Comments Accepted at IEEE FG 2026 Workshops. Final system ranked 2nd in the BlEmoRE Challenge. 9 pages including appendix, 8 figures

详情
AI中文摘要

混合情感识别具有挑战性,因为情感通常表现为微妙且重叠的多模态线索的混合,而非单一主导信号。我们提出了一种排名感知的多编码器框架,该框架选择性地结合来自不同预提取视频和音频编码器的互补表示。我们的方法将异构编码器特征投影到共享潜在空间,通过基于注意力的门控模块估计样本级编码器重要性,并仅融合前n个最具信息量的编码器。为了更好地建模混合情感,我们将预测解耦为存在性和显著性头,并通过概率级融合对齐它们。我们进一步引入了无需伪标签的特征级无监督域适应,以提高在分布偏移下的鲁棒性。在BlEmoRE挑战赛上的实验表明,所提出的框架优于强单个编码器和朴素的多编码器融合基线。我们的最终系统在比赛中排名第二,支持了排名感知选择性融合在细粒度混合情感识别中的有效性。

英文摘要

Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and naïve multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.

2605.21190 2026-05-26 cs.CV 版本更新

Semantic Granularity Navigation in Image Editing

图像编辑中的语义粒度导航

Liangsi Lu, Minzhe Guo, Xuhang Chen, Yang Shi

发表机构 * Guangdong University of Technology, Guangzhou, China(广东工业大学,广州,中国) Huizhou University, Huizhou, China(惠州市大学,惠州,中国)

AI总结 提出NaviEdit,一种无需训练、推理时控制的解耦方法,通过自一致性约束将编辑进度与模型尺度解耦,在保持结构保真度的同时提升语义可编辑性。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管扩散模型和流模型具有生成能力,真实图像编辑仍然受到语义可编辑性与结构保真度之间持续权衡的限制。我们将此限制的一个主要原因追溯到现有范式中编辑进度与模型尺度的隐式耦合。在这种耦合下,更强的编辑通常需要访问更嘈杂的状态,这在语义变化被良好定位之前,将计算用于破坏布局。我们引入NaviEdit,一种无需训练的推理时控制器,通过严格的自一致性契约将编辑进度与模型尺度遍历解耦。NaviEdit在rollout级别运行,不改变底层预训练模型。它将尺度视为控制输入,并将固定的步数预算重新分配给语义响应的中间尺度,而不是破坏性的高噪声区域。实验表明,在兼容的编辑器和流骨干网络上,平均增益为正,支持解耦作为一种可移植的推理时控制原则。

英文摘要

Despite the generative capabilities of diffusion and flow models, real-image editing remains constrained by a persistent trade-off between semantic editability and structural fidelity. We trace a primary cause of this limitation to the implicit coupling of edit progress with model scale in existing paradigms. Under this coupling, stronger edits typically require visiting noisier states, which spends computation on destabilizing layout before the semantic change is well localized. We introduce NaviEdit, a training-free inference-time controller that decouples edit progress from model scale traversal through a strict self-consistency contract. NaviEdit operates at the rollout level and leaves the underlying pretrained model unchanged. It treats scale as a control input and reallocates a fixed step budget toward semantically responsive intermediate scales instead of destructive high-noise regimes. Experiments show positive average gains across compatible editors and flow backbones, supporting decoupling as a portable inference-time control principle.

2605.20278 2026-05-26 cs.LG cs.AI cs.CV 版本更新

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

ClaimDiff-RL: 通过视觉声明比较进行细粒度描述强化学习

Tianle Li, Xuyang Shen, Yan Ma, Rongxin Guo, Shaoxiang Chen, Jiacheng Chen, Haochen Wang, Hongyang Tang, Yucong Zhou, Yu Cheng

发表机构 * The Chinese University of Hong Kong(香港中文大学) MiniMax

AI总结 提出ClaimDiff-RL框架,利用原子声明差异作为奖励单元,通过多模态判断器枚举视觉差异并分配错误类型和严重程度,以解决长描述强化学习中事实性与覆盖度的权衡问题。

详情
AI中文摘要

长格式图像描述揭示了强化学习中的奖励粒度问题:描述被整体判断,而重要错误发生在单个视觉声明层面。一个好的密集描述应既忠实又信息丰富,避免幻觉而不遗漏显著细节。然而,成对偏好、基于参考的指标和整体标量奖励将这些局部错误压缩为单个序列级信号,模糊了事实性与覆盖度之间的权衡。我们引入ClaimDiff-RL框架,该框架使用基于参考的原子声明差异作为描述强化学习的奖励单元。给定一张图像、一个演员描述和一个参考描述,多模态判断器枚举视觉上可区分的差异,针对图像验证每个差异,分配开放词汇的错误类型和严重程度,并生成每个差异的统计信息用于奖励组合。这使得幻觉声明和遗漏的显著事实可以分别测量和调整。实验表明,整体标量奖励可以通过增加遗漏事实来减少幻觉,而ClaimDiff-RL揭示了这种忠实性与覆盖度的权衡,并实现了更平衡的操作点。在包含160张图像的人工标注诊断基准、公开描述基准和VQA基准上,ClaimDiff-RL改善了幻觉-遗漏事实平衡,保留了通用能力,甚至在多个细粒度能力维度(如物体计数、空间关系和场景识别)上超越了Gemini-3-Pro-Preview。这些结果表明,类型化、可验证的声明差异是细粒度且可诊断的描述强化学习的有效奖励单元。

英文摘要

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

2605.19491 2026-05-26 cs.CV 版本更新

Thinking in Scales: Accelerating Gigapixel Pathology Image Analysis via Adaptive Continuous Reasoning

尺度思考:通过自适应连续推理加速千兆像素病理图像分析

Jiusong Ge, Yingkang Zhan, Wenjie Zhao, Di Zhang, Ke Wang, Jiashuai Liu, Chunze Yang, Chengzu Li, Jian Zhang, Yuxin Dong, Ni Zhang, Qidong Liu, Mireia Crispin-Ortuzar, Huazhu Fu, Chen Li, Zeyu Gao

发表机构 * School of Computer Science(计算机科学学院) Technology, Xi’an Jiaotong University, Xi’an, China(技术学院,西安交通大学,西安,中国) Department of Transmedia Art, Xi’an Academy of Fine Arts, Xi’an, China(多媒体艺术系,西安美术学院,西安,中国) Department of Oncology, University of Cambridge, Cambridge, U.K.(肿瘤学系,剑桥大学,剑桥,英国) Language Technology Lab, University of Cambridge, Cambridge, U.K.(语言技术实验室,剑桥大学,剑桥,英国) Institute of High Performance Computing, Agency for Science, Technology(高性能计算研究所,科技研究局)

AI总结 提出PathCTM模型,通过动态尺度切换和注意力引导的区域剪枝实现高效连续推理,大幅减少计算开销并保持诊断性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

传统的全切片图像(WSI)分析方法通常依赖于多实例学习(MIL)范式,该范式在高倍率下提取补丁级特征并进行聚合以进行切片级预测。然而,这种详尽的补丁级处理计算成本高,严重限制了WSI分析的效率和可扩展性。为应对这一挑战,我们提出了PathCTM(面向病理学的连续思维模型),该模型能够对千兆像素WSI进行令牌高效的尺度空间连续推理。PathCTM将诊断推理表述为动态的序列信息追踪。它逐步从低倍率全局检查过渡到高倍率局部检查,并在收集到足够证据以有效限制决策不确定性时自适应终止推理。具体而言,它使用条件计算进行动态尺度切换,并采用注意力引导的区域剪枝,结合置信度感知的早期停止。大量实验表明,与基于标准MIL的方法相比,PathCTM将所需图像补丁数量减少了95.95%,推理时间缩短了约95.62%,同时AUC没有下降。代码可在https://github.com/JSGe-AI/PathCTM获取。

英文摘要

Traditional whole slide image (WSI) analysis methods typically rely on the multiple instance learning (MIL) paradigm, which extracts patch-level features at high magnification and aggregates them for slide-level prediction. However, such exhaustive patch-level processing is computationally expensive, severely limiting the efficiency and scalability of WSI analysis. To address this challenge, we propose PathCTM (a Pathology-oriented Continuous Thought Model) that enables token-efficient scale-space continuous reasoning for gigapixel WSIs. PathCTM formulates diagnostic inference as a dynamic sequential information pursuit. It progressively transitions from low-magnification global to high-magnification local inspection, and adaptively terminates inference when sufficient evidence is gathered to effectively bound decision uncertainty. Specifically, it uses conditional computation for dynamic scale switching with attention-guided region pruning, coupled with confidence-aware early stopping. Extensive experiments demonstrate that, compared with standard MIL-based methods, PathCTM reduces the number of required image patches by 95.95% and shortens inference time by approximately 95.62%, while maintaining AUC without degradation. Code is available at https://github.com/JSGe-AI/PathCTM.

2605.17531 2026-05-26 cs.CV 版本更新

Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

不要猜测,只需询问:通过多轮澄清解决指代分割中的歧义

Yuting Yang, Haichao Jiang, Tianming Liang, Quan Zhang, Jian-Fang Hu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Guangdong Province Key Laboratory of Information Security Technology(广东省信息安全技术重点实验室) Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education(教育部机器智能与先进计算重点实验室)

AI总结 提出IC-Seg框架,通过多轮对话主动澄清用户意图,并引入Hi-GRPO分层优化策略,有效解决指代分割中用户查询歧义问题。

详情
AI中文摘要

指代分割旨在根据文本查询分割图像或视频中的目标对象。尽管过去几年取得了显著进展,现有工作总是假设用户提供的查询已经精确且清晰。然而,这种假设不切实际。在现实场景中,期望所有用户仔细审查其视觉内容并确保查询唯一且无歧义是不现实的。遇到此类情况时,现有分割模型倾向于任意猜测用户偏好,常常导致不理想的结果。为解决这一限制,我们提出IC-Seg,一种新颖的智能体框架,在分割前通过多轮对话主动澄清用户意图。为有效激励这种能力,我们进一步引入Hi-GRPO,一种新的分层优化策略,在轨迹、轮次和步骤层面注入密集且信息丰富的监督信号。该策略鼓励高效的意图澄清,有效消除冗余交互并提高整体对话质量。为评估,我们建立了Ambi-RVOS,一个带有模糊用户查询的指代视频对象分割基准。大量实验表明,IC-Seg不仅在解决模糊查询方面大幅优于现有方法,而且在标准推理分割基准上保持最先进性能。代码和数据将在https://github.com/iSEE-Laboratory/IC-Seg发布。

英文摘要

Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose IC-Seg, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce Hi-GRPO, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish Ambi-RVOS, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at https://github.com/iSEE-Laboratory/IC-Seg.

2605.17268 2026-05-26 cs.AI cs.CV cs.RO 版本更新

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

VLA 推理是否忠实?自动驾驶模型中因果链的安全性探究

Nicanor Mayumu, Xiaoheng Deng, Patrick Mukala

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) Central South University(中南大学) School of Computer Science(计算机科学学院) University of Wollongong in Dubai(迪拜大学)

AI总结 通过分析300次VLA推理,发现输出推理与轨迹的忠实度仅42.5%,存在大量漏检行人、轨迹脆弱及推理-动作不一致问题,并提出了信息论忠实度形式化定义与安全架构。

Comments Accept (Poster), CVPR 2026 Workshop DriveX NonArchival Track

详情
AI中文摘要

我们首次系统研究了视觉-语言-动作(VLA)驾驶模型的忠实度,分析了100个多样化PhysicalAI-AV场景中300次Alpamayo-R1-10B推理。主要发现是,输出带有轨迹的自然语言推理可能显著不忠实:(i) 整体推理保真度仅为42.5%,因果链与场景现实匹配不到一半;(ii) 在三分之一涉及行人的场景中漏检了94个行人;(iii) 在轻微视觉扰动下轨迹脆弱性达97.7%;(iv) 平均推理-动作一致性仅为48.3%,53.3%的推理表现出一致性低,其中37.9%声称停止但模型继续前行。我们从信息论角度形式化定义了忠实度,定义了实体和动作保真度及验证标准,并概述了与这些结果一致的四组件安全架构。

英文摘要

We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferences across 100 diverse PhysicalAI-AV scenarios. Our main finding is that output natural-language rationales with trajectories may be significantly unfaithful: (i) overall reasoning fidelity is only 42.5%, with Chain-of-Causation matching scene reality less than half the time; (ii) 94 missed pedestrians in one-third of pedestrian-relevant scenes; (iii) 97.7% trajectory fragility under mild visual perturbations; and (iv) only 48.3% mean reasoning-action consistency, with 53.3% of inferences exhibiting low consistency, including 37.9% of stop-claimed cases where the model continues instead. We formalize faithfulness information-theoretically, define entity and action fidelity with verification criteria, and outline a four-component safety architecture aligned with these results.

2605.16409 2026-05-26 cs.CV cs.CL cs.LG 版本更新

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

多语言OCR感知微调和提示引导的链式思维推理用于多模态大语言模型

Qinwu Xu, Yifan Jiang, Haoyu Ren

发表机构 * Meta AI UT Austin(德克萨斯大学奥斯汀分校)

AI总结 提出一种多语言OCR感知的多模态训练框架,通过合成数据生成、OCR感知微调和结构化视觉链式思维提示,提升多模态大语言模型在复杂视觉条件下的OCR完整性和多语言翻译准确性。

详情
AI中文摘要

光学字符识别(OCR)和多语言文本理解仍然是多模态大语言模型(MLLMs)的主要失败模式,尤其是在包含杂乱布局、小字体、模糊、遮挡和复杂排版的真实世界图像中。我们提出了一种OCR感知的多语言多模态训练框架,该框架结合了(i)大规模合成OCR到翻译数据生成,(ii)使用LoRA适配的OCR感知监督微调(SFT),以及(iii)在不确定视觉条件下进行推理的结构化视觉链式思维(CoT)提示。使用基于LLaMA的多模态架构,所提出的框架在OCR完整性、多语言翻译准确性和退化视觉条件下的鲁棒性方面有了显著提升。在多语言收据、菜单、海报、标志、手写文本和文档图像上的实验结果表明,与基线模型相比,视觉-文本对齐显著改善。特别是,所提出的OCR感知后训练框架提高了对小、模糊、空间分散和部分遮挡文本的提取,同时减少了对不确定OCR条件下语言先验的依赖。与前沿多模态系统(包括GPT-5类和Gemini系列模型)的定性比较进一步表明,在噪声和视觉模糊的OCR场景下,OCR对齐得到改善,幻觉减少。总体而言,结果表明,以数据为中心的OCR感知多模态后训练为改进多语言OCR和基于OCR的视觉问答系统提供了一种有效且可扩展的方向。

英文摘要

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion, and complex typography. We present an OCR-aware multilingual multimodal training framework that combines (i) large-scale synthetic OCR-to-translation data generation, (ii) OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and (iii) structured visual chain-of-thought (CoT) prompting for reasoning under uncertain visual conditions. Using a LLaMA-based multimodal architecture, the proposed framework substantially improves OCR completeness, multilingual translation accuracy, and robustness under degraded visual conditions. Experimental results on multilingual receipts, menus, posters, signs, handwritten text, and document images demonstrate significantly improved visual-text grounding compared with the baseline model. In particular, the proposed OCR-aware post-training framework improves extraction of small, blurred, spatially scattered, and partially occluded text while reducing reliance on language priors under uncertain OCR conditions. Qualitative comparisons with frontier multimodal systems, including GPT-5-class and Gemini-family models, further suggest improved OCR grounding and reduced hallucination under noisy and visually ambiguous OCR scenarios. Overall, the results indicate that data-centric OCR-aware multimodal post-training provides an effective and scalable direction for improving multilingual OCR and OCR-based visual question answering systems.

2605.14552 2026-05-26 cs.CV 版本更新

LiWi: Layering in the Wild

LiWi: 野外分层

Yu He, Fang Li, Haoyang Tong, Lichen Ma, Xinyuan Shan, Jingling Fu, Dong Chen, Luohang Liu, Junshi Huang, Yan Li

发表机构 * MAIS & NLPR, CASIA(模式识别与人工智能实验室及中国科学院自动化研究所)

AI总结 提出基于代理驱动数据分解和联合优化光度保真度与alpha边界的方法,实现野外自然图像的高保真分层分解,构建了LiWi-100k数据集并达到SOTA性能。

Comments Project Page https://rassetmusty.github.io/LiWi

详情
AI中文摘要

生成模型的最新进展使得令人印象深刻的分层图像生成成为可能,但其成功主要局限于图形设计领域。野外图像的分层仍然是一个未充分探索的问题,限制了细粒度编辑和图像在真实场景中的应用。具体而言,可扩展的分层数据和自然图像中对象交互(如光照效果和结构边界)的建模仍面临挑战。为解决这些瓶颈,我们提出了一种用于高保真自然图像分解的新框架。首先,我们引入了一种代理驱动数据分解(ADD)流水线,该流水线协调代理和工具以合成分层数据,无需人工干预。利用该流水线,我们构建了一个大规模数据集LiWi-100k,包含超过10万张高质量的分层野外图像。其次,我们提出了一个新框架,联合改进光度保真度和alpha边界精度。具体而言,阴影引导学习显式建模光照效果,退化-恢复目标通过从退化图像恢复干净前景图像提供边界校正监督。大量实验表明,我们的框架在自然图像分解中达到了最先进的性能,在RGB L1和Alpha IoU指标上优于现有模型。我们将很快发布代码和数据集。

英文摘要

Recent advances in generative models have empowered impressive layered image generation, yet their success is largely confined to graphic design domains. The layering of in-the-wild images remains an underexplored problem, limiting fine-grained editing and applications of images in real-world scenarios. Specifically, challenges remain in scalable layered data and the modeling of object interaction in natural images, such as illumination effects and structural boundary. To address these bottlenecks, we propose a novel framework for high-fidelity natural image decomposition. First, we introduce an Agent-driven Data Decomposition (ADD) pipeline that orchestrates agents and tools to synthesize layered data without manual intervention. Utilizing this pipeline, we construct a large-scale dataset, named LiWi-100k, with over 100,000 high-quality layered in-the-wild images. Second, we present a novel framework that jointly improves photometric fidelity and alpha boundary accuracy. Specifically, shadow-guided learning explicitly models the illumination effects, and degradation-restoration objective provides boundary-correction supervision by recovering clean foreground image from degraded one. Extensive experiments demonstrate that our framework achieves state-of-the-art (SoTA) performance in natural image decomposition, outperforming existing models in RGB L1 and Alpha IoU metrics. We will soon release our code and dataset.

2605.08063 2026-05-26 cs.CV cs.AI 版本更新

Flow-OPD: On-Policy Distillation for Flow Matching Models

Flow-OPD:面向流匹配模型的在线策略蒸馏

Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) University of California, Los Angeles(加州大学洛杉矶分校) The Chinese University of Hong Kong(香港中文大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出Flow-OPD框架,通过两阶段对齐策略(单奖励GRPO微调专家+流式冷启动与在线策略蒸馏)解决流匹配模型在多任务对齐中的奖励稀疏和梯度干扰问题,并引入流形锚点正则化抑制美学退化,在GenEval和OCR指标上显著提升。

Comments Project Page: https://costaliya.github.io/Flow-OPD/ , Code: https://github.com/CostaliyA/Flow-OPD

详情
AI中文摘要

现有的流匹配(FM)文本到图像模型在多任务对齐下存在两个关键瓶颈:标量奖励导致的奖励稀疏性,以及联合优化异构目标引起的梯度干扰,这共同导致了竞争指标的“跷跷板效应”和普遍的奖励破解。受大型语言模型社区中在线策略蒸馏(OPD)成功的启发,我们提出了Flow-OPD,这是第一个将在线策略蒸馏集成到流匹配模型中的统一后训练框架。Flow-OPD采用两阶段对齐策略:首先通过单奖励GRPO微调培养领域专精的教师模型,使每个专家在隔离环境中达到其性能上限;然后通过基于流的冷启动方案建立稳健的初始策略,并通过在线策略采样、任务路由标记和密集轨迹级监督的三步编排,将异构专业知识无缝整合到单个学生模型中。我们进一步引入了流形锚点正则化(MAR),它利用任务无关的教师提供全数据监督,将生成锚定到高质量流形,有效缓解了纯强化学习对齐中常见的美学退化。基于Stable Diffusion 3.5 Medium,Flow-OPD将GenEval分数从63提升至92,OCR准确率从59提升至94,相比原始GRPO总体提升约10个百分点,同时保持了图像保真度和人类偏好对齐,并展现出“超越教师”的涌现效应。这些结果确立了Flow-OPD作为构建通用文本到图像模型的可扩展对齐范式。代码和权重将在 https://github.com/CostaliyA/Flow-OPD 发布。

英文摘要

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models. The codes and weights will be released in: https://github.com/CostaliyA/Flow-OPD .

2605.08025 2026-05-26 cs.CV 版本更新

TRAS: An Interactive Software for Tracing Tree Ring Cross Sections

TRAS:一种用于追踪树木年轮横截面的交互式软件

Henry Marichal, Diego Passarella, Gregory Randall

发表机构 * Instituto de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de la República(拉普拉塔大学电气工程学院,工程学院) Procesos Industriales de la Madera, CENUR Noreste, Universidad de la República(木材工业过程,CENUR东北,拉普拉塔大学)

AI总结 提出TRAS开源图形软件,集成三种检测算法(CS-TRD、DeepCS-TRD、INBD),实现树木年轮自动勾画、手动校正和测量,在松木横截面图像上DeepCS-TRD达到81.0% F值,显著减少手动校正工作量。

Comments This manuscript has been accepted for publication in Forestry: An International Journal of Forest Research, published by Oxford University Press. This is an author-produced version and may differ from the final Version of Record. The final published version will be available through the journal website

详情
AI中文摘要

树木年轮标记仍然是树木测量学和树木年代学中的关键步骤,但通常手动进行,使得过程耗时、主观且难以扩展到大型图像数据集。我们提出了树木年轮分析套件(TRAS),一个用于木材横截面图像中树木年轮自动勾画、手动校正和测量的开源图形软件。TRAS集成了三种互补的检测算法:经典图像处理方法CS-TRD和两种深度学习方法DeepCS-TRD与INBD。界面允许用户细化自动检测、去除假阳性并手动添加缺失的年轮。它还计算树木年代学指标,如早材和晚材面积、年轮周长、等效年轮宽度以及基于自定义路径的年轮宽度测量。TRAS在18张专家标注的Pinus taeda L.横截面图像上进行了评估。DeepCS-TRD取得了最佳自动检测性能,F值为81.0%,精确率为86.4%。自动检测将所需的手动校正工作减少到大约20%的年轮边界。对于一维年轮宽度测量,TRAS与CooRecorder显示出极好的一致性(r > 0.99)。常见的检测错误,如跳跃传播或靠近节疤的假阳性,可以通过后处理界面轻松校正。TRAS在Windows、macOS和Linux上为树木年轮分析提供了灵活且可重复的解决方案。代码可在https://hmarichal93.github.io/tras获取。

英文摘要

Tree ring marking remains a key step in dendrometry and dendrochronology, but it is often performed manually, making the process time-consuming, subjective, and difficult to scale to large image datasets. We present the Tree Ring Analyzer Suite (TRAS), an open-source graphical software for automatic delineation, manual correction, and measurement of tree rings in wood cross-sectional images. TRAS integrates three complementary detection algorithms: the classical image-processing method CS-TRD and two deep-learning approaches, DeepCS-TRD and INBD. The interface allows users to refine automatic detections, remove false positives, and manually add missing rings. It also computes dendrochronological metrics such as earlywood and latewood areas, ring perimeter, equivalent ring width, and custom path-based ring-width measurements. TRAS was evaluated on 18 expertly annotated Pinus taeda L. cross-section images. DeepCS-TRD achieved the best automatic detection performance, with an F-score of 81.0% and precision of 86.4%. Automatic detection reduced the required manual correction effort to approximately 20% of ring boundaries. For one-dimensional ring-width measurements, TRAS showed excellent agreement with CooRecorder ($r > 0.99$). Common detection errors, such as jump propagation or false positives near knots, were easily corrected through the postprocessing interface. TRAS provides a flexible and reproducible solution for tree-ring analysis on Windows, macOS, and Linux. Code is available at the https://hmarichal93.github.io/tras.

2605.02900 2026-05-26 cs.CR cs.AI cs.CV cs.RO 版本更新

Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses

具身人工智能的安全性:风险、攻击与防御综述

Xiao Li, Xiang Zheng, Yifeng Gao, Xinyu Xia, Yixu Wang, Xin Wang, Ye Sun, Yunhan Zhao, Ming Wen, Jiayu Li, Zixing Chen, Xun Gong, Yi Liu, Yige Li, Yutao Wu, Cong Wang, Jun Sun, Yixin Cao, Zhineng Chen, Jingjing Chen, Tao Gui, Qi Zhang, Zuxuan Wu, Xipeng Qiu, Xuanjing Huang, Tiehua Zhang, Zhipeng Wei, Kun Wang, Xinfeng Li, Hanxun Huang, Sarah Erfani, James Bailey, Jianping Wang, Chaowei Xiao, Ran He, Bo Li, Xingjun Ma, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) City University of Hong Kong(香港城市大学) Jilin University(吉林大学) Singapore Management University(新加坡管理大学) Deakin University(德肯大学) Tongji University(同济大学) Nanyang Technological University(南洋理工大学) Chinese Academy of Sciences(中国科学院) The University of Melbourne(墨尔本大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文综述了具身AI在感知、认知、规划、行动及交互全流程中的安全风险、攻击与防御方法,提出了多层次分类体系,并指出了多模态感知融合脆弱性、规划不稳定及人机交互可信度等关键挑战。

Comments Survey paper; 75 pages, 4 figures, 18 tables; v2 expands embodied-specific coverage of agentic threats, World Action Model threats, and contextual risk mitigation, with over 100 new references added. Project page: https://x-zheng16.github.io/Awesome-Embodied-AI-Safety/

详情
AI中文摘要

具身人工智能将感知、认知、规划与交互集成到在开放、安全关键环境中运行的智能体中。随着这些系统获得自主性并进入交通、医疗、工业或辅助机器人等领域,确保其安全性在技术上具有挑战性,在社会上也变得不可或缺。与数字AI系统不同,具身智能体必须在不确定的感知、不完整的知识和动态的人机交互下行动,故障可能直接导致物理伤害。本综述对具身AI中的安全研究进行了全面且结构化的回顾,考察了从感知、认知到规划、行动与交互以及智能体系统的完整具身流程中的攻击与防御。我们引入了一个多层次分类体系,统一了分散的研究工作,并将具身特定的安全发现与视觉、语言和多模态基础模型的更广泛进展联系起来。我们的综述综合了来自500多篇论文的见解,涵盖对抗性攻击、后门攻击、越狱攻击和硬件级攻击;攻击检测、安全训练和鲁棒推理;以及风险感知的人机交互。这一分析揭示了几个被忽视的挑战,包括多模态感知融合的脆弱性、越狱攻击下规划的不稳定性,以及开放场景中人机交互的可信度。通过将领域组织成连贯的框架并识别关键研究空白,本综述为构建不仅具备能力和自主性,而且在现实部署中安全、鲁棒和可靠的具身智能体提供了路线图。

英文摘要

Embodied Artificial Intelligence (Embodied AI) integrates perception, cognition, planning, and interaction into agents that operate in open-world, safety-critical environments. As these systems gain autonomy and enter domains such as transportation, healthcare, and industrial or assistive robotics, ensuring their safety becomes both technically challenging and socially indispensable. Unlike digital AI systems, embodied agents must act under uncertain sensing, incomplete knowledge, and dynamic human-robot interactions, where failures can directly lead to physical harm. This survey provides a comprehensive and structured review of safety research in embodied AI, examining attacks and defenses across the full embodied pipeline, from perception and cognition to planning, action and interaction, and agentic system. We introduce a multi-level taxonomy that unifies fragmented lines of work and connects embodied-specific safety findings with broader advances in vision, language, and multimodal foundation models. Our review synthesizes insights from over 500 papers spanning adversarial, backdoor, jailbreak, and hardware-level attacks; attack detection, safe training and robust inference; and risk-aware human-agent interaction. This analysis reveals several overlooked challenges, including the fragility of multimodal perception fusion, the instability of planning under jailbreak attacks, and the trustworthiness of human-agent interaction in open-ended scenarios. By organizing the field into a coherent framework and identifying critical research gaps, this survey provides a roadmap for building embodied agents that are not only capable and autonomous but also safe, robust, and reliable in real-world deployment.

2605.01512 2026-05-26 cs.CV 版本更新

Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video

监控视频中罕见交通事件的两次通过零样本时空定位

Jiantang Huang

AI总结 提出一种无需微调的管道,通过粗到细的两遍分解和专家角色分配,利用冻结视觉语言模型实现罕见交通事件在时间、空间和碰撞类型上的联合定位。

Comments Accepted at CVPR 2026 AUTOPILOT Workshop (Non-Archival Track). 7 pages (4 main + references + appendix), 3 figures, 5 tables

详情
AI中文摘要

在真实闭路电视画面中定位交通事故是一个罕见事件问题,通常禁止使用标注事故视频进行训练,但需要精确的时空和碰撞类型联合定位。我们提出一种无需微调的管道,通过两个想法从冻结的视觉语言模型中引出这种联合输出。首先,粗到细的两遍分解:第一遍以1 fps处理全视频,产生粗粒度(t, x, y, c)元组;然后第二遍在±3秒窗口内以5 fps细化时间和位置,并设置两个确定性置信门,在边界犹豫或边缘夹紧坐标时回退到粗估计。其次,专家角色分配:Qwen3-VL-Plus负责定位,Gemini 3.1 Flash-Lite负责在居中视频片段上分类。在ACCIDENT@CVPR 2026基准测试(2,027个真实闭路电视视频)上,我们达到ACC^S = 0.539(95%置信区间[0.525, 0.553]):比基准论文的最佳基线预言机(0.412)高0.127,比最强单VLM基线(Molmo-7B, 0.396)高0.143,比朴素基线(0.289)高0.250。VLM路径每个视频最多调用三次API(17%在API失败时回退到物理方法);完整运行成本约20美元。

英文摘要

Grounding traffic accidents in real CCTV footage is a rare-event problem where training on labeled accident video is often prohibited, yet accurate joint localization in time, space, and collision type is required. We present a no-fine-tuning pipeline that elicits this joint output from frozen vision-language models through two ideas. First, a coarse-to-fine two-pass decomposition: a full-video pass at 1 fps produces a coarse (t, x, y, c) tuple, then a second pass at 5 fps within a +/- 3 s window refines time and location, with two deterministic confidence gates that revert to the coarse estimate on boundary hedges or edge-clamped coordinates. Second, a specialist role assignment: Qwen3-VL-Plus handles grounding, Gemini 3.1 Flash-Lite handles typing on a centered video clip. On the ACCIDENT@CVPR 2026 benchmark (2,027 real CCTV videos) we reach ACC^S = 0.539 (95% CI [0.525, 0.553]): +0.127 over the benchmark paper's best-of-baselines oracle (0.412), +0.143 over the strongest single-VLM baseline (Molmo-7B, 0.396), and +0.250 over the naive baseline (0.289). The VLM path uses up to three API calls per video (17% fall back to physics on API failures); the full run costs ~$20.

2605.01284 2026-05-26 cs.CV cs.AI cs.CL cs.IR 版本更新

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

证据链:面向迭代检索增强生成的像素级视觉归因

Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye

发表机构 * National Engineering Research Center for Software Engineering, Peking University(软件工程国家级工程研究中心,北京大学) City University of Hong Kong(香港城市大学) Peking University(北京大学) Tencent Technology(腾讯科技)

AI总结 提出Chain of Evidence (CoE)框架,利用视觉语言模型直接对检索到的文档截图进行推理,输出精确边界框以可视化完整推理链,解决迭代检索增强生成中的粗粒度归因和视觉语义丢失问题。

详情
AI中文摘要

迭代检索增强生成(iRAG)已成为通过逐步检索和推理外部文档来回答复杂多跳问题的强大范式。然而,当前系统主要基于解析文本运行,这造成了两个关键瓶颈:(1)粗粒度归因,用户需要根据模糊的文本级引用在冗长文档中手动定位证据;(2)视觉语义丢失,将视觉丰富的文档(如幻灯片、带有图表的PDF)转换为文本会丢弃对推理至关重要的空间逻辑和布局线索。为弥合这一差距,我们提出了证据链(CoE),这是一个与检索器无关的视觉归因框架,利用视觉语言模型直接对检索到的文档候选截图进行推理。CoE消除了特定格式的解析,输出精确的边界框,可视化检索候选集中的完整推理链。我们在两个不同的基准上评估CoE:Wiki-CoE,一个源自2WikiMultiHopQA的大规模结构化网页数据集;以及SlideVQA,一个具有挑战性的演示幻灯片数据集,包含复杂图表和自由形式布局。实验表明,微调后的Qwen3-VL-8B-Instruct取得了稳健的性能,在需要视觉布局理解的场景中显著优于基于文本的基线,同时为像素级可解释的iRAG建立了与检索器无关的解决方案。我们的代码可在https://github.com/PeiYangLiu/CoE.git获取。

英文摘要

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

2603.16100 2026-05-26 cs.CV 版本更新

Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

重新评估CLIP中的模态内错位假设

Jonas Herzog, Yue Wang

发表机构 * Zhejiang University(浙江大学)

AI总结 本文质疑CLIP的模态内错位假设,通过理论分析和实验证明图像嵌入距离不存在所谓的自由度,且模态内任务性能差异主要源于任务歧义而非错位。

Comments Accepted for CVPR'26. Project Page: https://vision-kek.github.io/Is-CLIP-Really-Misaligned/

详情
AI中文摘要

最近的研究表明,CLIP类对比语言-图像训练产生的嵌入对于纯图像任务并非最优。主要理论是跨模态(语言-图像)对齐损失忽略了模态内(图像-图像)对齐,导致图像间距离校准不良。在本研究中,我们质疑这一模态内错位假设。我们重新审视其基础理论论证、支持该假设的指标以及受影响的性能指标。对于理论论证,我们证明图像嵌入距离不存在所谓的自由度。对于经验度量,我们的发现表明,它们在语言-图像训练模型(CLIP、SigLIP)和图像-图像训练模型(DINO、SigLIP2)上产生相似结果。这表明观察到的现象并非源于前者特有的错位。对常见模态内任务(检索和少样本分类)的实验证实,解决任务歧义(而非所谓的错位)才是获得最佳结果的关键。

英文摘要

Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.

2603.00777 2026-05-26 cs.CV 版本更新

DUCX: Decomposing Unfairness in Tool-Using Chest X-ray Agents

DUCX:分解使用工具的胸部X光代理中的不公平性

Zikang Xu, Ruinan Jin, Xiaoxiao Li

发表机构 * Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Anhui, China(人工智能研究所,合肥国家科学中心,安徽,中国) The University of British Columbia, Vancouver, BC V6Z 1Z4, Canada(不列颠哥伦比亚大学,温哥华,BC V6Z 1Z4,加拿大) Vector Institute, Toronto, ON M5G 1M1, Canada(向量研究所,多伦多,ON M5G 1M1,加拿大)

AI总结 提出DUCK框架,通过阶段式公平性分解方法,系统审计使用工具的胸部X光代理中的工具暴露偏差、工具转换偏差和模型推理偏差,揭示端到端评估无法预测的群体差异。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

随着使用工具的临床AI系统协调专门的视觉和语言模块执行胸部X光问答等任务,医疗代理中的公平性变得至关重要。虽然这些医疗AI代理可以提高灵活性,但其增加的流水线复杂性也为人口统计偏差创造了新的途径,超出了独立模型。我们提出了DUCK,即分解胸部X光代理中的不公平性,这是一个对使用MedRAX实例化的工具型胸部X光代理的公平性进行系统审计的方法。为了定位差异产生的位置,我们引入了一种阶段式公平性分解,将端到端偏差与三个代理特定来源分开:工具暴露偏差,即基于工具存在的效用差距;工具转换偏差,即工具路由模式中的子组差异;以及模型推理偏差,即合成行为中的子组差异。在五个驱动骨干网络上对使用工具的代理框架进行的大量实验表明,端到端性能中存在人口统计差距,均等几率高达20.79%,最低公平-效用权衡降至28.65%。中间行为,包括工具使用、转换模式和推理轨迹,表现出明显的子组差异,这些差异无法仅从端到端评估中预测。例如,在分割工具可用的情况下,子组效用差距高达50%。我们的研究结果强调了过程级公平性审计和去偏的必要性,以确保临床代理系统的公平部署。代码:https://github.com/Nanboy-Ronan/DUCK。

英文摘要

Fairness in medical agents is becoming critical as tool-using clinical AI systems orchestrate specialized vision and language modules for tasks such as chest X-ray question answering. While these medical AI agents can improve flexibility, their added pipeline complexity also creates new pathways for demographic bias beyond standalone models. We present DUCK, Decomposing Unfairness in Chest X-ray agents, a systematic audit of fairness in tool-using chest X-ray agents instantiated with MedRAX. To localize where disparities arise, we introduce a stage-wise fairness decomposition that separates end-to-end bias from three agent-specific sources: tool exposure bias, or utility gaps conditioned on tool presence; tool transition bias, or subgroup differences in tool-routing patterns; and model reasoning bias, or subgroup differences in synthesis behaviors. Extensive experiments on tool-using agentic frameworks across five driver backbones reveal that demographic gaps persist in end-to-end performance, with equalized odds up to 20.79% and the lowest fairness-utility tradeoff down to 28.65%. Intermediate behaviors, including tool usage, transition patterns, and reasoning traces, exhibit distinct subgroup disparities that are not predictable from end-to-end evaluation alone. For example, conditioned on segmentation-tool availability, the subgroup utility gap reaches as high as 50%. Our findings underscore the need for process-level fairness auditing and debiasing to ensure the equitable deployment of clinical agentic systems. Code: https://github.com/Nanboy-Ronan/DUCK.

2603.00191 2026-05-26 cs.LG cs.CV 版本更新

Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning

基于LoRA的持续学习中任务驱动的子空间分解用于知识共享与隔离

Lingfeng He, De Cheng, Huaijie Wang, Xi Yang, Nannan Wang, Xinbo Gao

发表机构 * Department of XXX, University of YYY, Location, Country(XXX部门,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家) State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi'an, China(信息服务网络国家重点实验室,电信工程学院,西安电子科技大学,西安,中国) School of Electronic Engineering, Xidian University, Xi'an, China(电子工程学院,西安电子科技大学,西安,中国)

AI总结 提出LoDA方法,通过任务驱动分解构建通用和任务特定LoRA子空间,结合梯度对齐优化和闭式重校准,实现知识共享与隔离,提升持续学习性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

持续学习要求模型在不遗忘旧知识的情况下顺序适应新任务。最近,低秩适应(LoRA)作为一种代表性的参数高效微调方法,在持续学习中受到越来越多的关注。几种基于LoRA的持续学习方法通过分离更新空间来减少任务间的干扰,通常从过去任务的估计零空间中构建新空间。然而,它们(i)忽略了任务共享方向,抑制了知识迁移;(ii)未能捕获真正有效的任务特定方向,因为旧任务的这些“零基”在相关任务下对新任务几乎保持不活跃。为了解决这个问题,我们从投影能量的角度研究LoRA的学习能力,并提出了低秩分解与适应(LoDA)。它通过解决两个基于能量的目标,执行任务驱动分解以构建通用和真正的任务特定LoRA子空间,解耦知识共享和隔离的方向。LoDA固定两个子空间上的LoRA下投影,并通过梯度对齐优化方法学习鲁棒的上投影。在每个任务之后,在将LoRA更新集成到主干之前,LoDA为通用更新推导出一个闭式重校准,沿着这个任务共享方向近似特征级联合最优。实验表明,LoDA优于现有的持续学习方法。我们的代码可在https://github.com/HHHLF/LoDA_ICML2026获取。

英文摘要

Continual Learning (CL) requires models to sequentially adapt to new tasks without forgetting old knowledge. Recently, Low-Rank Adaptation (LoRA), a representative Parameter-Efficient Fine-Tuning (PEFT) method, has gained increasing attention in CL. Several LoRA-based CL methods reduce interference across tasks by separating their update spaces, typically building the new space from the estimated null space of past tasks. However, they (i) overlook task-shared directions, which suppresses knowledge transfer, and (ii) fail to capture truly effective task-specific directions since these ``null bases" of old tasks can remain nearly inactive for new task under correlated tasks. To address this, we study LoRA learning capability from a projection energy perspective, and propose Low-rank Decomposition and Adaptation (LoDA). It performs a task-driven decomposition to build general and truly task-specific LoRA subspaces by solving two energy-based objectives, decoupling directions for knowledge sharing and isolation. LoDA fixes LoRA down-projections on two subspaces and learns robust up-projections via a Gradient-Aligned Optimization (GAO) approach. After each task, before integrating the LoRA updates into the backbone, LoDA derives a closed-form recalibration for the general update, approximating a feature-level joint optimum along this task-shared direction. Experiments indicate that LoDA outperforms existing CL methods. Our code is available at https://github.com/HHHLF/LoDA_ICML2026.

2602.23916 2026-05-26 cs.CV cs.AI 版本更新

Topology-Driven Transferability Estimation of Medical Foundation Models for Segmentation

基于拓扑驱动的医学基础模型分割迁移性估计

Jiaqi Tang, Shaoyang Zhang, Xiaoqi Wang, Jiaying Zhou, Yang Liu, Qingchao Chen

发表机构 * Peking University(北京大学) Hohai University(河海大学) Beijing Normal University-Hong Kong Baptist University United International College(北京师范大学-香港 Baptist大学联合国际学院) National Institute of Health Data Science, Peking University(健康数据科学国家研究院,北京大学) Institute of Medical Technology, Peking University(北京大学医学技术研究院) State Key Laboratory of General Artificial Intelligence, Peking University(通用人工智能国家重点实验室,北京大学)

AI总结 提出拓扑驱动迁移性估计框架,通过全局表示拓扑散度、局部边界感知拓扑一致性和任务自适应融合,无需微调即可高效选择医学基础模型,在OpenMind基准上加权Kendall指标相对提升约31%。

详情
AI中文摘要

大规模自监督学习(SSL)的出现产生了大量的医学基础模型。然而,为特定分割任务选择最优的医学基础模型仍然是一个计算瓶颈。现有的迁移性估计(TE)指标主要针对分类任务设计,依赖于全局统计假设,无法捕捉密集预测所需的拓扑复杂性。我们提出了一种新颖的拓扑驱动迁移性估计框架,评估流形可处理性而非统计重叠。我们的方法引入了三个组成部分:(1)全局表示拓扑散度(GRTD),利用最小生成树量化特征-标签结构同构性;(2)局部边界感知拓扑一致性(LBTC),专门在关键解剖边界评估流形可分离性;(3)任务自适应融合,根据目标任务的语义基数动态整合全局和局部指标。在跨不同解剖目标和SSL基础模型的大规模OpenMind基准上验证,我们的方法在加权Kendall指标上显著优于最先进的基线,相对提升约31%,提供了一种鲁棒的、无需训练的代理,用于高效模型选择而无需微调成本。代码将在接收后公开。

英文摘要

The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal medical foundation models for specific segmentation tasks remains a computational bottleneck. Existing Transferability Estimation (TE) metrics, primarily designed for classification, rely on global statistical assumptions and fail to capture the topological complexity essential for dense prediction. We propose a novel Topology-Driven Transferability Estimation framework that evaluates manifold tractability rather than statistical overlap. Our approach introduces three components: (1) Global Representation Topology Divergence (GRTD), utilizing Minimum Spanning Trees to quantify feature-label structural isomorphism; (2) Local Boundary-Aware Topological Consistency (LBTC), which assesses manifold separability specifically at critical anatomical boundaries; and (3) Task-Adaptive Fusion, which dynamically integrates global and local metrics based on the semantic cardinality of the target task. Validated on the large-scale OpenMind benchmark across diverse anatomical targets and SSL foundation models, our approach significantly outperforms state-of-the-art baselines by around 31% relative improvement in the weighted Kendall metric, providing a robust, training-free proxy for efficient model selection without the cost of fine-tuning. The code will be made publicly available upon acceptance.

2602.23872 2026-05-26 cs.CV cs.RO 版本更新

Altitude-Adaptive Vision-Only Geo-Localization for UAVs in GPS-Denied Environments

GPS拒止环境下无人机的高度自适应纯视觉地理定位

Xingyu Shao, Mengfan He, Chunyu Li, Liangzheng Sun, Ziyang Meng

发表机构 * Department of Precision Instrument, Tsinghua University(清华大学精密仪器系) School of Aerospace Engineering, Beijing Institute of Technology(北京理工大学航天工程学院) School of Instrumentation Science and Opto-electronics Engineering, Beijing Information Science and Technology University(北京信息科技大学仪器科学与光电工程学院)

AI总结 针对无人机视觉位置识别中高度变化导致的尺度不匹配问题,提出一种基于单目视觉的高度自适应地理定位框架,通过频域变换估计相对高度并用于图像尺度归一化,结合分类-检索视觉位置识别模块实现粗定位,引入质量自适应边缘分类器提升检索鲁棒性。

详情
AI中文摘要

为了解决无人机视觉位置识别中由高度大幅变化引起的尺度不匹配问题,我们提出了一种仅依赖单目视觉的高度自适应地理定位框架。该方法首先通过将输入图像转换到频域,并将高度估计建模为回归作为分类问题,从单张下视图像中估计相对高度。然后利用估计的高度将查询图像裁剪到规范尺度,之后通过分类-检索视觉位置识别模块进行粗定位。为了在图像质量变化的情况下提高检索鲁棒性,我们进一步引入了质量自适应边缘分类器,并通过加权坐标估计对最终位置进行精化,该估计基于前k个检索候选。在两个合成数据集和两个真实飞行数据集上的实验表明,相对高度估计模块在显著高度变化下,下游检索性能有显著提升。与使用相同检索流程但未进行高度归一化相比,我们的视觉位置识别模块通过高度自适应使平均R@1和R@5分别提高了41.50和56.83个百分点,完整系统在报告的工作站硬件上以13.3帧/秒运行。这些结果表明,相对高度估计为跨高度无人机地理定位提供了有效的尺度先验,并在无需辅助距离传感器或时间输入的情况下支持GPS拒止环境下的粗初始化。

英文摘要

To address the scale mismatch caused by large altitude variations in UAV visual place recognition, we propose a monocular vision-only altitude-adaptive geo-localization framework. The method first estimates relative altitude from a single downward-looking image by transforming the input into the frequency domain and formulating altitude estimation as a regression-as-classification (RAC) problem. The estimated altitude is then used to crop the query image to a canonical scale, after which a classification-then-retrieval visual place recognition module performs coarse localization. To improve retrieval robustness under varying image quality, we further introduce a quality-adaptive margin classifier (QAMC) and refine the final location by weighted coordinate estimation over the top retrieved candidates. Experiments on two synthetic datasets and two real-flight datasets show that the relative altitude estimation (RAE) module yields clear overall improvements in downstream retrieval performance under significant altitude changes. With our visual place recognition module, altitude adaptation improves average R@1 and R@5 by 41.50 and 56.83 percentage points, respectively, compared with using the same retrieval pipeline without altitude normalization, and the full system runs at 13.3 frames/s on the reported workstation hardware. These results indicate that relative altitude estimation provides an effective scale prior for cross-altitude UAV geo-localization and supports GPS-denied coarse initialization without auxiliary range sensors or temporal inputs.

2602.23217 2026-05-26 cs.CV cs.NA math.NA 版本更新

Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks

多维任务学习:计算机视觉任务的统一张量框架

Alaa El Ichi, Khalide Jbilou

发表机构 * Université du Littoral Cote d’Opale(卢瓦尔海岸大学)

AI总结 提出基于广义爱因斯坦MLP的多维任务学习框架,通过张量运算统一分类、分割和检测等视觉任务,并证明其表达空间大于传统矩阵方法。

Comments This manuscript is under review at Pattern Recognition Letters

详情
AI中文摘要

本文介绍了多维任务学习(MTL),这是一个基于广义爱因斯坦MLP(GE-MLPs)的统一数学框架,通过爱因斯坦积直接在张量上操作。我们认为当前的计算机视觉任务公式本质上受限于基于矩阵的思维:标准架构依赖于矩阵值权重和向量值偏置,需要结构展平,这限制了自然可表达任务的空间。GE-MLPs通过使用张量值参数消除了这一约束,使得能够显式控制哪些维度被保留或收缩,而不会丢失信息。通过严格的数学推导,我们证明了分类、分割和检测是MTL的特例,仅在正式定义的任务空间中的维度配置上有所不同。我们进一步证明,这个任务空间严格大于基于矩阵的公式所能原生表达的空间,从而能够实现原则性的任务配置,例如时空或跨模态预测,这些在传统方法下需要破坏性展平。这项工作为通过张量代数的视角理解、比较和设计计算机视觉任务提供了数学基础。

英文摘要

This paper introduces Multidimensional Task Learning (MTL), a unified mathematical framework based on Generalized Einstein MLPs (GE-MLPs) that operate directly on tensors via the Einstein product. We argue that current computer vision task formulations are inherently constrained by matrix-based thinking: standard architectures rely on matrix-valued weights and vectorvalued biases, requiring structural flattening that restricts the space of naturally expressible tasks. GE-MLPs lift this constraint by operating with tensor-valued parameters, enabling explicit control over which dimensions are preserved or contracted without information loss. Through rigorous mathematical derivations, we demonstrate that classification, segmentation, and detection are special cases of MTL, differing only in their dimensional configuration within a formally defined task space. We further prove that this task space is strictly larger than what matrix-based formulations can natively express, enabling principled task configurations such as spatiotemporal or cross modal predictions that require destructive flattening under conventional approaches. This work provides a mathematical foundation for understanding, comparing, and designing computer vision tasks through the lens of tensor algebra.

2602.09431 2026-05-26 cs.CR cs.CV 版本更新

Grounding-Driven Attack: Improving Encoder-based Adversarial Transferability against Large Vision-Language Models

基于文本驱动的攻击:提升编码器对抗迁移性以攻击大型视觉-语言模型

Xinwei Zhang, Li Bai, Tianwei Zhang, Youqian Zhang, Qingqing Ye, Yingnan Zhao, Ruochen Du, Haibo Hu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Nanyang Technological University(南洋理工大学) Harbin Engineering University(哈尔滨工程大学)

AI总结 提出文本驱动攻击(GDA),通过将扰动优化与文本接地证据对齐,并采用接地感知扰动分配和接地中心证据破坏策略,显著提升编码器攻击在黑盒大型视觉-语言模型上的迁移性。

Comments Under review;

详情
AI中文摘要

大型视觉-语言模型(LVLMs)在多模态任务中取得了令人印象深刻的性能,但它们对视觉输入的依赖使其面临对抗性威胁。编码器攻击通过仅通过视觉编码器生成扰动,为端到端优化提供了一种高效的替代方案。然而,现有的编码器攻击通常假设替代编码器与受害LVLM的视觉编码器相同或相似。在这项工作中,我们系统研究了它们在具有异构LVLM架构的更现实的黑盒部署中的迁移性。我们发现,模型特定的视觉证据在不同模型间不一致,而文本条件接地区域与标题相关证据更紧密相关,并提供了更稳定的迁移目标。然而,现有攻击与这些区域的对齐较弱且不足以破坏它们。受这些发现启发,我们提出了文本驱动攻击(GDA),它将扰动优化与文本接地证据对齐。GDA结合了接地感知扰动分配(将扰动预算集中在接地证据区域)和接地中心证据破坏(增强其全局和局部破坏)。在多种受害模型和任务上的实验表明,GDA在黑盒迁移中始终优于现有的编码器攻击。这些结果突显了文本接地证据在对抗迁移性中的核心作用,并激励了接地感知的鲁棒性评估和防御设计。

英文摘要

Large vision-language models (LVLMs) have achieved impressive performance across multimodal tasks, but their reliance on visual inputs exposes them to adversarial threats. Encoder-based attacks provide an efficient alternative to end-to-end optimization by crafting perturbations through the vision encoder alone. However, existing encoder-based attacks often assume that the surrogate encoder is identical or similar to the victim LVLM's vision encoder. In this work, we present a systematic study of their transferability in more realistic black-box deployments with heterogeneous LVLM architectures. We find that model-specific visual evidence is inconsistent across models, whereas text-conditioned grounding regions are more closely tied to caption-relevant evidence and provide a more stable transfer target. However, existing attacks remain weakly aligned with and insufficiently disrupt these regions. Motivated by these findings, we propose Grounding-Driven Attack (GDA), which aligns perturbation optimization with text-grounded evidence. GDA combines Grounding-Aware Perturbation Allocation to concentrate perturbation budget on grounded evidence regions with Grounding-Centric Evidence Disruption to intensify their global and local disruption. Experiments across diverse victim models and tasks show that GDA consistently outperforms existing encoder-based attacks in black-box transfer. These results highlight the central role of text-grounded evidence in adversarial transferability and motivate grounding-aware robustness evaluation and defense design.

2602.08615 2026-05-26 cs.CV 版本更新

Inspiration Seeds: Learning Non-Literal Visual Combinations for Generative Exploration

灵感种子:学习用于生成式探索的非字面视觉组合

Kfir Goldberg, Elad Richardson, Yael Vinker

发表机构 * MIT(麻省理工学院)

AI总结 提出Inspiration Seeds框架,通过CLIP稀疏自编码器提取编辑方向并隔离概念对,实现无需文本提示的两张输入图像的视觉组合生成,支持早期创意阶段的探索性构思。

Comments Project page available at https://kfirgoldberg.github.io/InspirationSeeds/

详情
AI中文摘要

虽然生成模型已成为图像合成的强大工具,但它们通常针对执行精心设计的文本提示进行优化,对于想法形成之前常见的开放式视觉探索支持有限。相比之下,设计师经常从松散连接的视觉参考中汲取灵感,寻找能激发新想法的涌现连接。我们提出了Inspiration Seeds,这是一个将图像生成从最终执行转变为探索性构思的生成框架。给定两张输入图像,我们的模型生成多样且视觉连贯的组合,揭示输入之间的潜在关系,而无需依赖用户指定的文本提示。我们的方法是前馈式的,在完全通过视觉手段分解的合成三元组上训练:我们使用CLIP稀疏自编码器在CLIP潜在空间中提取编辑方向并隔离概念对。通过消除对语言的依赖并支持快速、直观的重组,我们的方法支持在创意工作的早期和模糊阶段进行视觉构思。

英文摘要

While generative models have become powerful tools for image synthesis, they are typically optimized for executing carefully crafted textual prompts, offering limited support for the open-ended visual exploration that often precedes idea formation. In contrast, designers frequently draw inspiration from loosely connected visual references, seeking emergent connections that spark new ideas. We propose Inspiration Seeds, a generative framework that shifts image generation from final execution to exploratory ideation. Given two input images, our model produces diverse, visually coherent compositions that reveal latent relationships between inputs, without relying on user-specified text prompts. Our approach is feed-forward, trained on synthetic triplets of decomposed visual aspects derived entirely through visual means: we use CLIP Sparse Autoencoders to extract editing directions in CLIP latent space and isolate concept pairs. By removing the reliance on language and enabling fast, intuitive recombination, our method supports visual ideation at the early and ambiguous stages of creative work.

2602.03983 2026-05-26 cs.RO cs.CV 版本更新

Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

通过静态-动态解耦实现高效长程视觉-语言-动作模型

Weikang Qiu, Huashuo Lei, Tinglin Huang, Rex Ying

发表机构 * Yale University(耶鲁大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出DySta框架,通过将视觉输入解耦为多级静态和动态令牌,减少上下文长度并复用KV缓存,实现高效多帧集成和推理,在基准测试和真实任务中显著提升性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型最近成为通用机器人控制的一种有前景的范式。基于视觉-语言模型(VLM)架构,VLA模型根据视觉观察和语言指令预测动作,在任务中实现了强大的性能和泛化能力。然而,VLA模型面临两个主要挑战:输入帧的有限上下文窗口,以及由于二次注意力复杂性和大参数数量导致的低效推理。为此,我们提出了DySta,一个将视觉输入解耦为多级静态和动态令牌的框架,使得(1)在帧间保留静态令牌的单一副本以显著减少上下文长度,以及(2)通过轻量级重缓存门(仅在必要时更新)重用静态令牌的键值(KV)缓存。这种设计实现了高效的多帧集成和高效推理。此外,我们引入了一个新的基准测试,更有效地评估VLA模型的多帧集成能力。实验表明,DySta在我们的基准测试中各项指标上提高了24.5%的多帧集成能力,在真实世界记忆依赖任务中绝对成功率达到23.3%,同时在模拟基准测试中推理速度提升2.0倍(成功率+2.3%),在真实世界通用任务中推理速度提升2.2倍(成功率+10.6%)。

英文摘要

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: a limited context window for input frames and inefficient inference due to the quadratic attention complexity and large parameter counts. To this end, we propose DySta, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the multi-frame integration ability of VLAs. Experiments show that Dysta improves multi-frame integration by 24.5% across metrics on our benchmark and 23.3% in absolute success rate on real-world memory-dependent tasks, while accelerating inference by 2.0x (with +2.3% success rate) on simulation benchmarks and 2.2x (with +10.6% success rate) on real-world general tasks.

2601.22709 2026-05-26 cs.CV cs.AI 版本更新

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

基于置信度蒸馏的门控关系对齐用于高效视觉语言模型

Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li

发表机构 * Department of Information Technology(信息科技系) Electrical Engineering, ETH Zurich, Zurich, Switzerland(电气工程,苏黎世联邦理工学院,苏黎世,瑞士) Qualcomm AI Research, Amsterdam, the Netherlands(高通人工智能研究,阿姆斯特丹,荷兰) Department of Electrical, Electronic and Information Engineering(电气、电子与信息工程系) University of Bologna, Bologna, Italy(博洛尼亚大学,博洛尼亚,意大利) School of Electrical and Electronic Engineering(电气与电子工程学院)

AI总结 提出GRACE框架,通过信息瓶颈原理统一知识蒸馏与量化感知训练,使用置信度门控解耦蒸馏、关系中心核对齐和自适应控制器,在INT4量化下实现性能超越FP16基线并接近教师模型,同时显著降低内存和提升吞吐量。

Comments Accepted to the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

视觉语言模型(VLM)具有强大的多模态性能,但部署成本高,且训练后量化通常会导致显著的精度损失。尽管有潜力,但针对VLM的量化感知训练仍未得到充分探索。我们提出GRACE,一个在信息瓶颈原则下统一知识蒸馏和量化感知训练的框架:量化约束信息容量,而蒸馏指导在此预算内保留什么。将教师视为任务相关信息的代理,我们引入置信度门控解耦蒸馏以过滤不可靠的监督,关系中心核对齐以传递视觉标记结构,以及通过拉格朗日松弛实现的自适应控制器以平衡保真度与容量约束。在LLaVA和Qwen系列的大量基准测试中,我们的INT4模型始终优于FP16基线(例如,LLaVA-1.5-7B:SQA上70.1 vs. 66.8;Qwen2-VL-2B:MMBench上76.9 vs. 72.6),几乎匹配教师性能。使用真实的INT4内核,我们实现了3倍的吞吐量,内存减少54%。这一原则性框架显著优于现有量化方法,使GRACE成为资源受限部署的有力解决方案。代码和数据可在https://github.com/ForeverBlue816/GRACE获取。

英文摘要

Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment. Code and data are available at: https://github.com/ForeverBlue816/GRACE.

2601.16763 2026-05-26 cs.CV 版本更新

Flow Matching for Probabilistic Monocular 3D Human Pose Estimation

基于流匹配的概率单目3D人体姿态估计

Cuong Le, Pavlo Melnyk, Bastian Wandt, Mårten Wadenbäck

发表机构 * Department of Electrical Engineering(电气工程系) Linköping University(林雪平大学) Independent researcher(独立研究者)

AI总结 提出FMPose方法,利用流匹配生成模型从2D关键点学习3D姿态分布,通过图卷积网络建模2D提升条件,在保持精度的同时显著提升推理速度。

Comments 12 pages, 2 figures, 8 tables, accepted to TMLR

详情
AI中文摘要

从单目相机视角恢复3D人体姿态是一个高度病态的问题,因为存在深度模糊。早期从2D提升3D姿态的研究常常包含错误但过度自信的3D估计。为了缓解这一问题,新兴的概率方法将3D估计视为分布,考虑姿态的不确定性度量。属于类似范畴,我们提出了FMPose,一种基于流匹配生成方法的概率3D人体姿态估计方法。以2D线索为条件,流匹配方案通过连续归一化流学习从简单源分布到合理3D人体姿态分布的最优传输。2D提升条件通过图卷积网络建模,利用人体关节之间的可学习连接作为图结构进行特征聚合。尽管处理时间和精度之间存在权衡,但在等精度比较中,FMPose的处理时间显著快于扩散模型,并且还提供了另一种更快且更准确的配置。实验结果表明,我们的FMPose在3D人体姿态估计的两个常见基准(Human3.6M、MPI-INF-3DHP)上相比当前最先进方法有显著改进。此外,FMPose在更具挑战性的3DPW数据集上表现出竞争性能。代码实现见https://github.com/cuongle1206/FMPose。

英文摘要

Recovering 3D human poses from a monocular camera view is a highly ill-posed problem due to the depth ambiguity. Earlier studies on 3D human pose lifting from 2D often contain incorrect-yet-overconfident 3D estimations. To mitigate the problem, emerging probabilistic approaches treat the 3D estimations as a distribution, taking into account the uncertainty measurement of the poses. Falling in a similar category, we proposed FMPose, a probabilistic 3D human pose estimation method based on the flow matching generative approach. Conditioned on the 2D cues, the flow matching scheme learns the optimal transport from a simple source distribution to the plausible 3D human pose distribution via continuous normalizing flows. The 2D lifting condition is modeled via graph convolutional networks, leveraging the learnable connections between human body joints as the graph structure for feature aggregation. While trade-offs between processing time and precision exist, already in the equal-accuracy comparison, FMPose exhibits significantly faster processing time than the diffusion model, and also offers another faster and more accurate configuration. Experimental results show major improvements of our FMPose over current state-of-the-art methods on two common benchmarks for 3D human pose estimation, namely Human3.6M, MPI-INF-3DHP. Additionally, FMPose shows competitive performance on the more challenging 3DPW dataset. The code implementation is available at https://github.com/cuongle1206/FMPose

2601.08205 2026-05-26 cs.CV cs.LG 版本更新

FUME: Fused Unified Multi-Gas Emission Network for Livestock Rumen Acidosis Detection

FUME: 用于牲畜瘤胃酸中毒检测的融合统一多气体排放网络

Taminul Islam, Toqi Tahamid Sarker, Mohamed Embaby, Khaled R Ahmed, Amer AbuGhazaleh

发表机构 * Southern Illinois University, Carbondale(南方伊利诺伊大学,卡本达勒分校) University of California, Davis(加州大学戴维斯分校)

AI总结 提出FUME网络,利用双气体(CO2和CH4)光学成像,通过轻量双流架构和通道注意力融合,实现瘤胃酸中毒的高精度分割与分类。

Comments 10 pages, 5 figures

详情
Journal ref
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2026, pp. 510-519
AI中文摘要

瘤胃酸中毒是奶牛中常见的代谢紊乱,导致重大经济损失和动物福利问题。当前的诊断方法依赖于侵入性pH测量,限制了持续监测的可扩展性。我们提出了FUME(融合统一多气体排放网络),这是首个在体外条件下通过双气体光学成像进行瘤胃酸中毒检测的深度学习方法。我们的方法利用红外相机捕获的互补二氧化碳(CO2)和甲烷(CH4)排放模式,将瘤胃健康状态分类为健康、过渡和酸中毒。FUME采用轻量双流架构,包含权重共享编码器、模态特定自注意力和通道注意力融合,联合优化气体羽流分割和奶牛健康分类。我们引入了首个双气体OGI数据集,包含8967个标注帧,覆盖六个pH水平,并带有像素级分割掩码。实验表明,FUME在仅使用1.28M参数和1.97G MACs的情况下,实现了80.99%的mIoU和98.82%的分类准确率——在分割质量上优于最先进方法,且计算成本降低10倍。消融研究揭示,CO2提供主要的判别信号,而双任务学习对于最优性能至关重要。我们的工作确立了基于气体排放的牲畜健康监测的可行性,为实用的体外酸中毒检测系统铺平了道路。代码可在 https://github.com/taminulislam/fume 获取。

英文摘要

Ruminal acidosis is a prevalent metabolic disorder in dairy cattle causing significant economic losses and animal welfare concerns. Current diagnostic methods rely on invasive pH measurement, limiting scalability for continuous monitoring. We present FUME (Fused Unified Multi-gas Emission Network), the first deep learning approach for rumen acidosis detection from dual-gas optical imaging under in vitro conditions. Our method leverages complementary carbon dioxide (CO2) and methane (CH4) emission patterns captured by infrared cameras to classify rumen health into Healthy, Transitional, and Acidotic states. FUME employs a lightweight dual-stream architecture with weight-shared encoders, modality-specific self-attention, and channel attention fusion, jointly optimizing gas plume segmentation and classification of dairy cattle health. We introduce the first dual-gas OGI dataset comprising 8,967 annotated frames across six pH levels with pixel-level segmentation masks. Experiments demonstrate that FUME achieves 80.99% mIoU and 98.82% classification accuracy while using only 1.28M parameters and 1.97G MACs--outperforming state-of-the-art methods in segmentation quality with 10x lower computational cost. Ablation studies reveal that CO2 provides the primary discriminative signal and dual-task learning is essential for optimal performance. Our work establishes the feasibility of gas emission-based livestock health monitoring, paving the way for practical, in vitro acidosis detection systems. Codes are available at https://github.com/taminulislam/fume.

2512.16710 2026-05-26 cs.CV 版本更新

A multi-centre, multi-device benchmark dataset for landmark-based comprehensive fetal biometry

基于标志点的全面胎儿生物测量多中心、多设备基准数据集

Chiara Di Vece, Zhehua Mao, Netanell Avisdris, Brian Dromey, Raffaele Napolitano, Dafna Ben Bashat, Francisco Vasconcelos, Danail Stoyanov, Leo Joskowicz, Sophia Bano

发表机构 * Department of Computer Science and UCL Hawkes Institute, University College London, London WC1E 6BT, UK(计算机科学系和UCL Hawkes研究所,伦敦大学学院,伦敦WC1E 6BT,英国) Sagol Brain Institute, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel(萨戈尔脑研究所,特拉维夫 Sourasky 医疗中心,以色列特拉维夫) School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel(计算机科学与工程学院,耶路撒冷希伯来大学,耶路撒冷,以色列) UCLH NHS Foundation Trust and the Elizabeth Garrett Anderson Institute for Women’s Health, UCL, London, UK(UCLH NHS基金会信托和Elizabeth Garrett Anderson妇女健康研究所,UCL,伦敦,英国) Sagol School of Neuroscience and Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel(萨戈尔神经科学学院和Sackler医学院,特拉维夫大学,以色列特拉维夫)

AI总结 为解决胎儿超声生物测量中手动标注耗时且依赖操作者的问题,构建了包含4513张图像、来自3个临床中心7种设备的公开基准数据集,提供标准化评估流程和基线结果,验证了单中心训练会高估性能,为多中心泛化研究提供基准。

Comments 11 pages, 5 figures, 3 tables

详情
Journal ref
Scientific Reports (2026)
AI中文摘要

准确的胎儿生长评估依赖于通过手动识别标准平面中的解剖标志点进行精确生物测量。手动标志点标注耗时、依赖操作者,且易受扫描仪和站点间差异影响,限制了自动化方法的可重复性。需要多源标注数据集来开发人工智能辅助的胎儿生长评估方法。为解决这一瓶颈,我们提出了一个开放的、多中心、多设备的胎儿超声图像基准数据集,包含用于临床胎儿生物测量的专家解剖标志点标注。这些测量包括头双顶径和枕额径、腹横径和前后径以及股骨长度。该数据集包含来自1904名受试者的4513张去标识超声图像,这些图像在三个临床站点使用七种不同的超声设备采集。我们提供标准化的、受试者不重叠的训练/测试划分、评估代码和基线结果,以实现方法的公平和可重复比较。使用自动生物测量模型,我们量化了域偏移,并证明局限于单个中心的训练和评估相对于多中心测试会显著高估性能。据我们所知,这是第一个公开可用的多中心、多设备、标志点标注数据集,覆盖所有主要胎儿生物测量指标,为胎儿生物测量中的域适应和多中心泛化提供了稳健的基准,并有助于跨中心实现更可靠的AI辅助胎儿生长评估。所有数据、标注、训练代码和评估流程均已公开。

英文摘要

Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset comprises 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.

2512.10548 2026-05-26 cs.CV 版本更新

Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding

Blink: 动态视觉令牌分辨率增强多模态理解

Yuchen Feng, Zhenyu Zhang, Naibin Gu, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) Baidu Inc(百度公司)

AI总结 提出Blink框架,通过注意力引导的令牌超分辨率和动态丢弃机制,在单次前向传播中模拟人类眨眼式扫描,提升多模态大语言模型的视觉感知能力。

Comments CVPR 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在各种视觉-语言任务上取得了显著进展,但其视觉感知仍然有限。相比之下,人类通过动态扫描并顺序地以“眨眼式”过程聚焦于显著区域,高效地感知复杂场景。受此策略启发,我们首先研究MLLMs是否表现出类似行为。我们的初步分析表明,MLLMs自然地关注不同层的视觉区域,并且选择性地将更多计算分配给显著令牌可以增强视觉感知。基于这一见解,我们提出Blink,一种动态视觉令牌分辨率框架,在单次前向传播中模拟人类启发的过程。具体来说,Blink包括两个模块:显著性引导扫描和动态令牌分辨率。它首先基于注意力图估计每层视觉令牌的显著性,并通过即插即用的令牌超分辨率(TokenSR)模块扩展重要令牌。在下一层,当扩展令牌失去焦点时,它会丢弃它们。这种动态机制平衡了广泛探索和细粒度聚焦,从而自适应且高效地增强视觉感知。大量实验验证了Blink在增强视觉感知和多模态理解方面的有效性。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.

2512.08254 2026-05-26 cs.CV 版本更新

Real-World Scene Recovery for Scattering-Degraded Images Using Spatial and Frequency Priors

使用空间和频率先验的散射退化图像真实场景恢复

Yun Liu, Tao Li, Guanghui Yue, Wenqi Ren, Cosmin Ancuti, Weisi Lin

发表机构 * College of Artificial Intelligence, Southwest University(西南大学人工智能学院) School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University(深圳大学医学院生物医学工程学院) School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University(中山大学深圳校区计算机科学与技术学院) ETcTI, University Politehnica Timisoara(蒂森堡大学ETcTI学院) College of Computing and Data Science, Nanyang Technological University (NTU)(南洋理工大学计算机与数据科学学院)

AI总结 提出空间和频率先验(SFP)方法,通过空间域传输图估计和频率域自适应增强策略,实现散射退化图像的真实场景恢复,在多种真实场景中优于现有方法。

Comments 18 pages, 22 figures, submitted to IEEE T-PAMI

详情
AI中文摘要

从受散射效应(如雾、沙尘暴、水下和遥感条件)退化的真实图像中恢复场景,仍然是计算机视觉中一个基本但具有挑战性的问题。现有方法要么依赖单一先验(本质上不足以表征多样的散射退化),要么使用在合成数据上训练的深度网络(通常对真实场景的泛化能力有限)。在本文中,我们提出空间和频率先验(SFP)用于散射诱导退化下的真实场景恢复。在空间域,我们观察到散射退化图像的逆在其光谱方向上揭示了一个与底层场景传输相关的投影。基于这一观察,我们制定了一个空间先验来估计传输图,从而能够在散射效应下有效恢复场景辐射。在频率域,我们设计了一种由两个新先验引导的自适应频率增强策略。第一个先验假设退化图像中跨通道的直流(DC)分量的平均强度近似于对应清晰图像的平均强度。第二个先验基于观察:在清晰图像中,窄带内的低径向频率仅占整个频谱的一小部分。这些先验能够针对不同频带的散射诱导衰减进行补偿。最后,对空间域和频率域的结果进行加权融合,得到最终的恢复图像。在多种真实世界散射退化场景上的大量实验验证,与最先进方法相比,我们的SFP实现了优越的性能和强大的泛化能力。

英文摘要

Scene recovery from real-world images degraded by scattering effects, such as haze, sandstorm, underwater, and remote sensing conditions, remains a fundamental yet challenging problem in computer vision. Existing methods either rely on a single prior, which is inherently insufficient to characterize diverse scattering degradations, or employ deep networks trained on synthetic data, which often suffer from limited generalization to real-world scenarios. In this paper, we propose Spatial and Frequency Priors (SFP) for real-world scene recovery under scattering-induced degradations. In the spatial domain, we observe that the inverse of a scattering-degraded image reveals a projection along its spectral direction that correlates with the underlying scene transmission. Based on this observation, a spatial prior is formulated to estimate the transmission map, enabling effective recovery of scene radiance under scattering effects. In the frequency domain, we design an adaptive frequency enhancement strategy guided by two novel priors. The first prior assumes that the mean intensity of the direct current (DC) components across channels in degraded images approximates that of the corresponding clear images. The second prior is based on the observation that, in clear images, low radial frequencies within a narrow band contribute only a small proportion of the overall spectrum. These priors enable targeted compensation for scattering-induced attenuation across different frequency bands. Finally, a weighted fusion of the spatial and frequency domain results is performed to obtain the final recovered image. Extensive experiments on diverse real-world scattering-degraded scenarios verify that our SFP achieves superior performance and strong generalization capability compared to state-of-the-art methods.

2512.05791 2026-05-26 physics.med-ph cs.CV cs.LG math.PR 版本更新

Fast and Robust Diffusion Posterior Sampling for MR Image Reconstruction Using the Preconditioned Unadjusted Langevin Algorithm

使用预条件未调整朗之万算法实现快速且鲁棒的MR图像重建扩散后验采样

Moritz Blumenthal, Tina Holliber, Jonathan I. Tamir, Martin Uecker

发表机构 * Institute of Biomedical Imaging, Graz University of Technology, Graz, Austria Department of Radiology, Boston Children's Hospital, Harvard Medical School, Boston, USA Chandra Family Department of Electrical Engineering, University of Texas at Austin, USA Department of Diagnostic Medicine, Dell Medical School, University of Texas at Austin, USA

AI总结 针对MR图像重建中扩散后验采样速度慢和参数调优问题,提出基于预条件未调整朗之万算法的精确似然方法,实现快速收敛且无需调参的鲁棒采样。

Comments Submitted to Magnetic Resonance in Medicine

详情
AI中文摘要

目的:结合未调整朗之万算法(ULA)与扩散模型,可以从高度欠采样的k空间数据生成高质量MRI重建结果并附带不确定性估计。然而,扩散后验采样(DPS)或似然退火等采样方法存在重建时间长和需要参数调优的问题。本文旨在开发一种具有快速收敛性的鲁棒采样算法。 理论与方法:在用于后验采样的反向扩散过程中,精确似然与所有噪声尺度下的扩散先验相乘。为克服收敛缓慢的问题,采用了预条件技术。该方法在fastMRI数据上训练,并在健康志愿者的回顾性欠采样脑部数据上测试。 结果:对于笛卡尔和非笛卡尔加速MRI的后验采样,新方法在重建速度和样本质量上均优于退火采样和DPS。 结论:所提出的预条件精确似然方法能够在各种MRI重建任务中实现快速可靠的后验采样,无需参数调优。

英文摘要

Purpose: The Unadjusted Langevin Algorithm (ULA) in combination with diffusion models can generate high quality MRI reconstructions with uncertainty estimation from highly undersampled k-space data. However, sampling methods such as diffusion posterior sampling (DPS) or likelihood annealing suffer from long reconstruction times and the need for parameter tuning. The purpose of this work is to develop a robust sampling algorithm with fast convergence. Theory and Methods: In the reverse diffusion process used for sampling the posterior, the exact likelihood is multiplied with the diffused prior at all noise scales. To overcome the issue of slow convergence, preconditioning is used. The method is trained on fastMRI data and tested on retrospectively undersampled brain data of a healthy volunteer. Results: For posterior sampling in Cartesian and non-Cartesian accelerated MRI the new approach outperforms annealed sampling and DPS in terms of reconstruction speed and sample quality. Conclusion: The proposed exact likelihood with preconditioning enables rapid and reliable posterior sampling across various MRI reconstruction tasks without the need for parameter tuning.

2512.01382 2026-05-26 cs.CV 版本更新

Reversible Inversion for Training-Free Exemplar-guided Image Editing

可逆反演用于免训练示例引导图像编辑

Yuke Li, Lianli Gao, Ji Zhang, Pengpeng Zeng, Lichuan Xiang, Hongkai Wen, Heng Tao Shen, Jingkuan Song

发表机构 * school of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) School of Computing and Artificial Intelligence, Southwest Jiaotong University, China(西南交通大学计算机科学与人工智能学院) School of Computer Science and Technology, Tongji University, China(同济大学计算机科学与技术学院) Department of Computer Science, University of Warwick, United Kingdom(英国沃里克大学计算机科学系)

AI总结 提出可逆反演(ReInversion)方法,通过两阶段去噪和掩码引导选择性去噪策略,实现免训练的高效示例引导图像编辑,达到最优性能且计算开销最低。

详情
AI中文摘要

示例引导图像编辑(EIE)旨在根据视觉参考修改源图像。现有方法通常需要大规模预训练来学习源图像和参考图像之间的关系,计算成本高。作为一种免训练的替代方案,反演技术可用于将源图像映射到潜在空间进行操作。然而,我们的实证研究表明,标准反演对于EIE是次优的,导致质量差和效率低。为了解决这一挑战,我们引入了 extbf{可逆反演({ReInversion})},用于有效且高效的EIE。具体来说,ReInversion作为一个两阶段去噪过程运行,首先以源图像为条件,然后以参考图像为条件。此外,我们引入了一种掩码引导选择性去噪(MSD)策略,将编辑限制在目标区域,保持背景的结构一致性。定性和定量比较都表明,我们的ReInversion方法以最低的计算开销实现了最先进的EIE性能。

英文摘要

Exemplar-guided Image Editing (EIE) aims to modify a source image according to a visual reference. Existing approaches often require large-scale pre-training to learn relationships between the source and reference images, incurring high computational costs. As a training-free alternative, inversion techniques can be used to map the source image into a latent space for manipulation. However, our empirical study reveals that standard inversion is sub-optimal for EIE, leading to poor quality and inefficiency. To tackle this challenge, we introduce \textbf{Reversible Inversion ({ReInversion})} for effective and efficient EIE. Specifically, ReInversion operates as a two-stage denoising process, which is first conditioned on the source image and subsequently on the reference. Besides, we introduce a Mask-Guided Selective Denoising (MSD) strategy to constrain edits to target regions, preserving the structural consistency of the background. Both qualitative and quantitative comparisons demonstrate that our ReInversion method achieves state-of-the-art EIE performance with the lowest computational overhead.

2512.00125 2026-05-26 cs.CV cs.LG 版本更新

Hybrid Synthetic Data Generation with Domain Randomization Enables Zero-Shot Vision-Based Part Inspection Under Extreme Class Imbalance

混合合成数据生成与域随机化实现极端类别不平衡下基于视觉的零样本零件检测

Ruo-Syuan Mei, Sixian Jia, Guangze Li, Soo Yeon Lee, Brian Musser, William Keller, Sreten Zakula, Jorge Arinez, Chenhui Shao

发表机构 * Department of Mechanical Engineering, University of Michigan, Ann Arbor, MI 48109, USA Materials \& Manufacturing Systems Research Lab, General Motors, Warren, MI 48092, USA

AI总结 提出一种结合仿真渲染、域随机化和真实背景合成的混合合成数据生成框架,仅用合成数据训练YOLOv8n和MobileNetV3-small模型,在极端类别不平衡下实现零样本工业零件检测,检测mAP@0.5达0.995,分类准确率96%,平衡准确率90.1%。

Comments Submitted to the NAMRC 54

详情
AI中文摘要

机器学习,特别是深度学习,正在改变工业质量检测。然而,训练鲁棒的机器学习模型通常需要大量高质量标注数据,这在制造业中获取成本高昂、耗时且劳动密集。此外,缺陷样本本身稀少,导致严重的类别不平衡,降低模型性能。这些数据约束阻碍了基于机器学习的质量检测方法在实际生产环境中的广泛采用。合成数据生成(SDG)通过高效、经济且可扩展的方式创建大规模、平衡且完全标注的数据集,提供了一种有前景的解决方案。本文提出一种混合SDG框架,集成了基于仿真的渲染、域随机化和真实背景合成,无需人工标注即可实现基于计算机视觉的工业零件检测的零样本学习。该SDG流水线通过改变零件几何、光照和表面属性,并将合成零件合成到真实图像背景上,在一小时内生成12,960张标注图像。利用YOLOv8n骨干网络进行目标检测、MobileNetV3-small进行质量分类的两阶段架构,仅使用合成数据训练,并在300个真实工业零件上评估。所提方法在检测上达到mAP@0.5为0.995,分类准确率96%,平衡准确率90.1%。与基于少量真实数据的基线方法相比,性能显著提升。在严重类别不平衡下,所提基于SDG的方法达到90-91%的平衡准确率,而基线仅达到50%准确率。这些结果表明,所提方法能够为真实制造应用实现免标注、可扩展且鲁棒的质量检测。

英文摘要

Machine learning, particularly deep learning, is transforming industrial quality inspection. Yet, training robust machine learning models typically requires large volumes of high-quality labeled data, which are expensive, time-consuming, and labor-intensive to obtain in manufacturing. Moreover, defective samples are intrinsically rare, leading to severe class imbalance that degrades model performance. These data constraints hinder the widespread adoption of machine learning-based quality inspection methods in real production environments. Synthetic data generation (SDG) offers a promising solution by enabling the creation of large, balanced, and fully annotated datasets in an efficient, cost-effective, and scalable manner. This paper presents a hybrid SDG framework that integrates simulation-based rendering, domain randomization, and real background compositing to enable zero-shot learning for computer vision-based industrial part inspection without manual annotation. The SDG pipeline generates 12,960 labeled images in one hour by varying part geometry, lighting, and surface properties, and then compositing synthetic parts onto real image backgrounds. A two-stage architecture utilizing a YOLOv8n backbone for object detection and MobileNetV3-small for quality classification is trained exclusively on synthetic data and evaluated on 300 real industrial parts. The proposed approach achieves an mAP@0.5 of 0.995 for detection, 96% classification accuracy, and 90.1% balanced accuracy. Comparative evaluation against few-shot real-data baseline approaches demonstrates significant improvement. The proposed SDG-based approach achieves 90-91% balanced accuracy under severe class imbalance, while the baselines reach only 50% accuracy. These results demonstrate that the proposed method enables annotation-free, scalable, and robust quality inspection for real-world manufacturing applications.

2511.18794 2026-05-26 cs.GR cs.CV 版本更新

ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes

ChronoGS:多时期场景中不变性与变化的解耦

Zhongtao Wang, Jiaqi Dai, Qingtian Zhu, Yilong Li, Mai Su, Fei Zhu, Meng Gai, Shaorong Wang, Chengwei Pan, Yisong Chen, Guoping Wang

发表机构 * Peking University(北京大学) Beijing Forestry University(北京林业大学) The University of Tokyo(东京大学) Beihang University(北航)

AI总结 提出ChronoGS,一种时间调制的高斯表示方法,通过统一锚点支架重建多时期场景,并解耦稳定与演化组件,实现时间一致的重建,同时发布ChronoScene基准数据集。

Comments CVPR26 Highlight

详情
AI中文摘要

多时期图像集合在现实应用中很常见。城市为测绘而重新扫描,建筑工地为进度跟踪而再次访问,自然区域为环境变化而监测。这些数据形成多时期场景,其中几何和外观会演变。重建此类场景是一个重要但尚未充分探索的问题。现有管线依赖于不兼容的假设:静态和野外方法强制单一几何,而动态方法假设平滑运动,两者在长期、不连续变化下均失败。为解决此问题,我们引入ChronoGS,一种时间调制的高斯表示,它在统一锚点支架内重建所有时期。它还被设计为解耦稳定和演化组件,实现多时期场景的时间一致重建。为促进相关研究,我们发布ChronoScene数据集,一个真实和合成多时期场景的基准,捕捉几何和外观变化。实验表明,ChronoGS在重建质量和时间一致性上始终优于基线。我们的代码和ChronoScene数据集公开于https://github.com/ZhongtaoWang/ChronoGS。

英文摘要

Multi-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It's also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we release ChronoScene dataset, a benchmark of real and synthetic multi-period scenes, capturing geometric and appearance variation. Experiments demonstrate that ChronoGS consistently outperforms baselines in reconstruction quality and temporal consistency. Our code and the ChronoScene dataset are publicly available at https://github.com/ZhongtaoWang/ChronoGS.

2511.15407 2026-05-26 cs.AI cs.CV cs.LG 版本更新

IPR-1: Interactive Physical Reasoner

IPR-1:交互式物理推理器

Mingyu Zhang, Lifeng Zhuo, Tianxi Tan, Guocan Xie, Xian Nie, Yan Li, Renjie Zhao, Zizhu He, Ziyu Wang, Jiting Cai, Yong-Lu Li

发表机构 * CARNEGIE MELLON UNIVERSITY(卡内基梅隆大学)

AI总结 提出IPR模型,通过世界模型滚动评分和强化VLM策略,结合物理中心动作代码PhysCode,在1000+异构游戏基准上实现鲁棒的物理推理,性能超越GPT-5并零样本迁移至未见游戏。

Comments Accepted by CVPR 2026. 13 pages of main text and 20 pages of appendices. Project page: https://mybearyzhang.github.io/ipr-1

详情
AI中文摘要

人类通过观察、与环境交互以及内化物理和因果关系来学习。在这里,我们旨在探究一个智能体是否能够通过交互类似地获得类人推理能力,并随着更多经验不断改进。为此,我们引入了一个包含1000+异构游戏的Game-to-Unseen (G2U)基准,这些游戏展现出显著的视觉领域差异。现有方法(包括VLM和世界模型)难以捕捉底层物理和因果关系,因为它们不关注核心机制且过度拟合视觉细节。VLM/VLA智能体能够推理,但在交互设置中缺乏前瞻性,而世界模型进行想象但模仿视觉模式而非分析物理和因果关系。因此,我们提出IPR(交互式物理推理器),利用世界模型滚动来评分和强化VLM的策略,并引入PhysCode,一种以物理为中心的动作代码,将语义意图与动力学对齐,为预测和推理提供共享动作空间。在1000+游戏上预训练后,我们的IPR在从原始直觉到目标驱动推理的各个层次上表现稳健,甚至在总体上超越了GPT-5。我们发现,性能随着训练游戏和交互步骤的增加而提升,并且模型还能零样本迁移到未见过的游戏。这些结果支持以物理为中心的交互作为稳步提升物理推理的路径。更多演示和项目详情请见https://mybearyzhang.github.io/ipr-1。

英文摘要

Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. To study this, we introduce a Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle to capture underlying physics and causality since they are not focused on core mechanisms and overfit to visual details. VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on levels from primitive intuition to goal-driven reasoning, and even surpasses GPT-5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning. Further demos and project details can be found at https://mybearyzhang.github.io/ipr-1.

2510.02730 2026-05-26 cs.LG cs.CV 版本更新

Dale meets Langevin: A Multiplicative Denoising Diffusion Model

Dale meets Langevin: 乘法去噪扩散模型

Nishanth Shetty, Madhava Prasath, Chandra Sekhar Seelamantula

发表机构 * Department of Electrical Engineering(电子工程系) Indian Institute of Science(印度科学研究所)

AI总结 提出以几何布朗运动为前向噪声过程的乘法分数生成模型,推导反向时间SDE并设计两种乘法采样器,引入Hyvärinen分数和乘法去噪分数匹配目标,在图像数据集上验证生成能力。

详情
AI中文摘要

指数梯度下降(EGD)是一种受生物学启发的优化算法,遵循Dale定律,在收敛时产生对数正态分布的突触权重,与神经科学的实验观察一致。由于几何布朗运动(GBM)在任何固定时间的边际分布是对数正态的,这种收敛性质揭示了EGD与基于GBM的随机过程之间的自然联系。我们提出了一种基于分数的乘法生成模型,以GBM作为前向噪声过程,并推导了其在环境空间和对数变换空间中的相应反向时间SDE。通过离散化相应的反向时间SDE,我们推导出两种乘法采样器:直接从环境空间反向时间SDE得到的符号无关采样器,以及通过Lamperti变换得到的符号保持采样器,我们称之为Dale-Langevin采样器。我们将该框架与镜像Langevin动力学联系起来,表明优化中驱动EGD的凸函数精确地控制着Dale-Langevin采样器。虽然标准Stein分数(定义为随机向量X在x处的∇log p_X(x))在基于加性噪声的扩散模型中自然出现,但在乘法设置中,我们遇到了一种用于采样的修改版Stein分数,我们称之为Hyvärinen分数:x∘∇log p_X(x)。为了估计该分数,我们提出了一种新的乘法去噪分数匹配目标(M-DSM),证明了其与乘法显式分数匹配损失的等价性,并表明它包含了非负分数匹配损失。在MNIST、Fashion-MNIST、Kuzushiji-MNIST和CIFAR-10上的实验结果验证了所提框架的生成能力。

英文摘要

Exponentiated gradient descent (EGD), a biologically motivated optimisation algorithm that respects Dale's law, produces log-normally distributed synaptic weights at convergence, in alignment with experimental observations in neuroscience. Since the marginal distribution of geometric Brownian motion (GBM) at any fixed time is log-normal, this convergence property reveals a natural connection between EGD and GBM-based stochastic processes. We propose a multiplicative score-based generative model with GBM as a forward noising process and derive its corresponding reverse-time SDE in both the ambient space and in the $\log$-transformed space. We derive two multiplicative samplers by discretising the corresponding reverse-time SDEs: a sign-agnostic sampler obtained directly from the ambient-space reverse-time SDE, and a sign-preserving sampler, which we refer to as the Dale-Langevin sampler, obtained via the Lamperti transform. We connect the framework to Mirrored Langevin Dynamics, showing that the convex function driving EGD in optimisation precisely governs the Dale-Langevin sampler. While the standard Stein score, defined as $\nabla \log p_{\boldsymbol{X}}(\boldsymbol{x})$ for a random vector $\boldsymbol{X}$ evaluated at $\boldsymbol{x}$, comes up naturally in the additive noise based diffusion models, in the multiplicative setting, we encounter a modified version of the Stein score for sampling, which we refer to as the {\it Hyvärinen score}: $\boldsymbol{x} \circ \nabla \log p_{\boldsymbol{X}}(\boldsymbol{x})$. To estimate the score, we propose a new multiplicative denoising score-matching objective (M-DSM), prove its equivalence to the multiplicative explicit score-matching loss and show that it subsumes the non-negative score matching loss. Experimental results on MNIST, Fashion-MNIST, Kuzushiji-MNIST, and CIFAR-10 to validate the generative capability of the proposed framework.

2509.25339 2026-05-26 cs.CV cs.AI cs.LG eess.IV 版本更新

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

VisualOverload: 在真正密集场景中探测VLM的视觉理解

Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne

发表机构 * Independent Researcher(独立研究者) JKU Linz(林茨JKU) MIT CSAIL Tübingen AI Center(图宾根人工智能中心) Stanford(斯坦福) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室)

AI总结 提出VisualOverload基准,通过密集场景中的简单视觉任务测试VLM,发现最佳模型仅达69.5%准确率,揭示计数、OCR和逻辑一致性等关键缺陷。

Comments Accepted at CVPR 2026

详情
AI中文摘要

最先进的VLM是否真正解决了基本视觉理解?我们提出VisualOverload,一个略有不同的视觉问答(VQA)基准,包含2,720个问答对,并持有私有真实答案。与以往通常关注近全局图像理解的VQA数据集不同,VisualOverload挑战模型在密集(或过载)场景中执行简单的、无需知识的视觉任务。我们的数据集由公共领域绘画的高分辨率扫描图组成,这些绘画包含多个人物、动作和展开的子情节,背景细节丰富。我们手动为这些图像标注了六个任务类别的问题,以探测对场景的彻底理解。我们假设当前基准高估了VLM的性能,编码和推理细节对它们来说仍然是一项具有挑战性的任务,尤其是当面对密集场景时。实际上,我们观察到在37个测试模型中,即使是最好的模型(o3)在我们最难的测试子集上也仅达到19.6%的准确率,在所有问题上总体准确率为69.5%。除了全面评估外,我们还通过错误分析补充了基准,揭示了多种失败模式,包括缺乏计数能力、OCR失败以及复杂任务下惊人的逻辑不一致。总之,VisualOverload暴露了当前视觉模型中的关键差距,并为社区开发更好的模型提供了重要资源。基准:http://paulgavrikov.github.io/visualoverload

英文摘要

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

2509.12194 2026-05-26 cs.AI cs.CV 版本更新

Teaching large language models to reason like expert diagnosticians

教会大型语言模型像专家诊断医生一样推理

Thomas A. Buckley, Riccardo Conci, Peter G. Brodeur, Jason Gusdorf, Sourik Beltrán, Bita Behrouzi, Byron Crowe, Jacob Dockterman, Muzzammil Muhammad, Sarah Ohnigian, Andrew Sanchez, James A. Diao, Aashna P. Shah, Daniel Restrepo, Eric S. Rosenberg, Andrew S. Lea, Emily Glanton, Kimberly LeBlanc, Undiagnosed Diseases Network, Marinka Zitnik, Scott H. Podolsky, Zahir Kanjee, Raja-Elie E. Abdulnour, Jacob M. Koshy, Adam Rodman, Arjun K. Manrai

发表机构 * Department of Biomedical Informatics, Harvard Medical School(哈佛医学院生物医学信息学系) Department of Medicine, Beth Israel Deaconess Medical Center(贝塞斯达医院内科部) The Mongan Institute, Massachusetts General Hospital(麻省总医院蒙根研究所) Division of Gastroenterology, Brigham and Women’s Hospital(布里洛妇女医院胃肠病科) Department of Medicine, Brigham and Women’s Hospital(布里洛妇女医院内科部) Department of Medicine, Massachusetts General Hospital(麻省总医院内科部) Department of Pathology, Massachusetts General Hospital(麻省总医院病理学部) Department of Health Humanities and Bioethics, University of Rochester School of Medicine and Dentistry(罗切斯特大学医学院和牙科学院健康人文与生物伦理学部) Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University(哈佛大学凯普纳人工智能研究所) Center for the History of Medicine, Countway Library of Medicine, Harvard Medical School(哈佛医学院医学史中心,考特维图书馆) Department of Global Health and Social Medicine, Harvard Medical School(哈佛医学院全球健康与社会医学部) Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital(布里洛妇女医院呼吸科和重症医学科)

AI总结 提出 Dr. CaBot 代理 AI 系统,通过生成基于初始病例描述的幻灯片演示来模拟专家诊断推理,并在 NEJM CPC 和 NIH 未诊断疾病网络病例上取得优于前沿模型的表现,同时发布 CPC-Bench 基准以促进临床 AI 发展。

详情
AI中文摘要

鉴别诊断是一个迭代过程,将患者信息与更广泛的医学知识相结合。自1923年以来持续发表的临床病例系列,如NEJM临床病理会议(CPCs),展示了专家医生向同行演示诊断推理,并已被用于评估AI数十年。然而,先前的AI评估主要关注最终诊断准确性,而非细微的临床推理。在此,我们介绍Dr. CaBot,一个代理AI系统,通过仅从初始病例描述生成带有书面和旁白的幻灯片演示,来模拟专家诊断医生。CaBot最近生成了NEJM CPC 100多年历史上首个发表的AI诊断。在盲评中,医生在46/62(74%)的试验中错误分类了鉴别诊断的来源(CaBot vs. 医生撰写),并在各个质量维度上给予其好评。当被要求解决来自NIH未诊断疾病网络的72名未诊断疾病患者的病例时,CaBot仅从转诊记录中就识别出了50/72(69%)病例的工作诊断。为了促进透明度和研究,我们还开发了CPC-Bench,一个基于7,102个CPC和47,648个问题(涵盖10个任务)的经医生验证的基准。我们证明CaBot在CPC-Bench上优于前沿模型,并公开发布CaBot和CPC-Bench,以促进临床AI的进步。

英文摘要

Differential diagnosis is an iterative process that integrates patient information with broader medical knowledge. Clinical case series such as the NEJM Clinicopathologic Conferences (CPCs), published continuously since 1923, feature expert physicians who demonstrate diagnostic reasoning to peers, and have been used for decades to evaluate AI. However, prior AI evaluations have largely focused on final diagnostic accuracy rather than nuanced clinical reasoning. Here, we introduce Dr. CaBot, an agentic AI system that emulates an expert diagnostician by generating written and narrated slide-based presentations from an initial case description alone. CaBot recently generated the first AI diagnosis published in the 100+ year history of the NEJM CPCs. In blinded evaluations, physicians misclassified the source of the differential (CaBot vs. physician-written) in 46/62 (74%) of trials and rated them favorably across quality dimensions. When tasked with solving cases for 72 patients with undiagnosed disease from the NIH Undiagnosed Diseases Network, CaBot identified the working diagnosis in 50/72 (69%) of cases from referral notes alone. To promote transparency and research, we also developed CPC-Bench, a physician-validated benchmark based on 7,102 CPCs and 47,648 questions across 10 tasks. We show that CaBot outperforms frontier models on CPC-Bench, and release both CaBot and CPC-Bench publicly to foster progress in clinical AI.

2509.05614 2026-05-26 cs.CV cs.AI cs.RO 版本更新

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

SpecPrune-VLA: 通过动作感知的自推测剪枝加速视觉-语言-动作模型

Hanzhen Wang, Jiaming Xu, Yushun Xiang, Jiayi Pan, Yongkang Zhou, Yong-Lu Li, Guohao Dai

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对视觉-语言-动作模型推理加速,提出结合全局上下文与局部信息的无训练两层剪枝方法,实现高达1.57倍加速且成功率几乎无下降。

Comments Accepted to ICML 2026

详情
AI中文摘要

剪枝是一种通过移除不重要值的计算来加速计算密集型模型的典型技术。最近,它被应用于加速视觉-语言-动作(VLA)模型推理。然而,现有的加速方法仅关注当前动作步骤的局部信息,忽略了全局上下文,导致在某些场景下成功率下降超过20%且加速效果有限。本文指出VLA任务中的时空一致性:连续步骤中的输入图像表现出高度相似性,并提出关键见解:令牌选择应结合局部信息与模型的全局上下文。基于此,我们提出SpecPrune-VLA,一种无需训练、具有启发式控制的两级剪枝方法。(1) 动作级静态剪枝:利用全局历史和局部注意力,在每个动作中静态减少视觉令牌。(2) 层级动态剪枝:根据逐层重要性自适应地剪枝每层的令牌。(3) 轻量级动作感知控制器:根据末端执行器的速度将动作分为粗粒度或细粒度,并相应调整剪枝激进程度。大量实验表明,SpecPrune-VLA在LIBERO模拟中实现高达1.57倍加速,在真实世界任务中实现1.70倍加速,且成功率下降可忽略不计。

英文摘要

Pruning is a typical acceleration technique for compute-bound models by removing computation on unimportant values. Recently, it has been applied to accelerate Vision-Language-Action (VLA) model inference. However, existing acceleration methods focus on local information from the current action step and ignore the global context, leading to >20% success rate drop and limited speedup in some scenarios. In this paper, we point out spatial-temporal consistency in VLA tasks: input images in consecutive steps exhibit high similarity, and propose the key insight that token selection should combine local information with global context of the model. Based on this, we propose SpecPrune-VLA, a training-free, two-level pruning method with heuristic control. (1) Action-level static pruning. We leverage global history and local attention to statically reduce visual tokens per action. (2) Layer-level dynamic pruning. We prune tokens adaptively per layer based on layer-wise importance. (3) Lightweight action-aware controller: We classify actions as coarse- or fine-grained by the speed of the end effector and adjust pruning aggressiveness accordingly. Extensive experiments show that SpecPrune-VLA achieves up to 1.57$\times$ speedup in LIBERO simulation and 1.70$\times$ on real-world tasks, with negligible success rate degradation.

2509.01557 2026-05-26 cs.CV 版本更新

Real-Time Hardware-Free HIFU Interference Suppression via Teacher-Student Diffusion Framework

基于教师-学生扩散框架的实时无硬件HIFU干扰抑制

Dejia Cai, Ali Abdollahi, Xi Wang, Kun Yang, Zhaohui Guo, Xiaowei Zhou, Hao Chen

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China(香港科学与技术大学计算机科学与工程系) State Key Laboratory of Ultrasound Engineering in Medicine, Chongqing Medical University, Chongqing 400016, China(重庆医科大学超声医学工程国家重点实验室) School of Microelectronics, Tianjin University, Tianjin 300072, China(天津大学微电子学院) Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China(香港科学与技术大学化学与生物工程系) Division of Life Science, The Hong Kong University of Science and Technology, Hong Kong SAR, China(香港科学与技术大学生命科学系) HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, The Hong Kong University of Science and Technology, Futian, Shenzhen, China(香港科技大学深圳-香港协同创新研究院) State Key Laboratory of Nervous System Disorders, The Hong Kong University of Science and Technology, Hong Kong SAR, China(香港科技大学神经系统疾病国家重点实验室)

AI总结 提出一种无需专用硬件同步的图像域扩散框架mHC-Diff,通过教师-学生蒸馏实现实时高保真HIFU干扰抑制,在临床数据集上达到26.65 dB PSNR和~20 FPS。

详情
AI中文摘要

高强度聚焦超声(HIFU)是一种非侵入性疗法,但其安全性常因连续超声引导期间的严重声学干扰而降低。传统的HIFU干扰抑制方法严重依赖专有的原始射频(RF)数据或复杂的硬件同步,限制了其临床实用性并阻碍了实时实现。为解决这一限制,我们提出了流形约束超连接扩散(mHC-Diff),一种图像域扩散框架,用于无需专用硬件同步的实时干扰抑制,将复杂干扰与解剖结构分离,同时确保高重建保真度。为实现临床实时应用,我们的方法采用两阶段策略:(i)解剖感知先验获取,其中扩散模型使用多步UNet作为高保真教师进行训练;以及(ii)效率蒸馏,其中通过知识蒸馏将该先验蒸馏为单步学生以实现实时吞吐量。在涵盖多种治疗场景的临床代表性数据集上的广泛验证表明,mHC-Diff实现了卓越的恢复(26.65 dB PSNR),同时在单个NVIDIA RTX 4090上实现实时推理(~20 FPS),比迭代扩散基线(例如HIFU-Diff)加速约6.8倍。通过消除对专用硬件同步和专有RF访问的需求,该图像域框架确保了兼容性,并促进了超声引导HIFU干预期间的实时干扰抑制。

英文摘要

High-Intensity Focused Ultrasound (HIFU) is a non-invasive therapy, yet its safety is often degraded by severe acoustic interference during continuous ultrasound guidance. Conventional HIFU interference suppression methods heavily rely on proprietary raw Radio-Frequency (RF) data or complex hardware synchronization, limiting their clinical utility and preventing real-time implementation. To address this limitation, we propose Manifold-Constrained Hyper-Connections Diffusion (mHC-Diff), an image-domain diffusion framework for real-time interference suppression without specialized hardware synchronization, disentangling complex interference from anatomical structures while ensuring high reconstruction fidelity. To achieve clinical real-time application, our approach employs a two-stage strategy: (i) anatomy-aware prior acquisition, where a diffusion model is trained with multi-step UNet as a highfidelity Teacher; and (ii) efficiency distillation, where this prior is distilled into a one-step Student via knowledge distillation to achieve real-time throughput. Extensive validation on a clinically representative dataset across diverse therapeutic scenarios shows that mHC-Diff achieves superior restoration (26.65 dB PSNR), while enabling real-time inference (~20 FPS) on a single NVIDIA RTX 4090, providing a ~6.8x speedup over iterative diffusion baselines (e.g., HIFU-Diff). By eliminating the requirement for specialized hardware synchronization and proprietary RF access, this image-domain framework ensures compatibility and facilitates real-time interference suppression during ultrasound-guided HIFU interventions.

2508.07624 2026-05-26 cs.CV 版本更新

Enhancing Egocentric Object Detection in Static Environments using Graph-based Spatial Anomaly Detection and Correction

基于图的空间异常检测与校正增强静态环境中的自我中心目标检测

Vishakha Lall, Yisi Liu

发表机构 * Centre of Excellence in Maritime Safety(海上安全卓越中心) Singapore Polytechnic(新加坡理工学院) Singapore(新加坡)

AI总结 提出一种基于图神经网络的后处理管道,通过建模静态环境中物体的空间关系来校正自我中心帧中的检测异常,显著提升检测性能。

详情
AI中文摘要

在涉及静态环境的许多实际应用中,物体的空间布局在实例之间保持一致。然而,最先进的目标检测模型通常无法利用这种空间先验,导致预测不一致、漏检或误分类,尤其是在杂乱或遮挡的场景中。在这项工作中,我们提出了一种基于图的后处理管道,显式建模物体之间的空间关系,以校正自我中心帧中的检测异常。使用在手动标注数据上训练的图神经网络(GNN),我们的模型识别无效的物体类别标签,并根据其邻域上下文预测校正后的类别标签。我们评估了我们的方法,既作为独立的异常检测与校正框架,也作为标准目标检测器(如YOLOv7和RT-DETR)的后处理模块。实验表明,融入这种空间推理显著提升了检测性能,mAP@50提升高达4%。该方法凸显了利用环境空间结构来提高目标检测系统可靠性的潜力。

英文摘要

In many real-world applications involving static environments, the spatial layout of objects remains consistent across instances. However, state-of-the-art object detection models often fail to leverage this spatial prior, resulting in inconsistent predictions, missed detections, or misclassifications, particularly in cluttered or occluded scenes. In this work, we propose a graph-based post-processing pipeline that explicitly models the spatial relationships between objects to correct detection anomalies in egocentric frames. Using a graph neural network (GNN) trained on manually annotated data, our model identifies invalid object class labels and predicts corrected class labels based on their neighbourhood context. We evaluate our approach both as a standalone anomaly detection and correction framework and as a post-processing module for standard object detectors such as YOLOv7 and RT-DETR. Experiments demonstrate that incorporating this spatial reasoning significantly improves detection performance, with mAP@50 gains of up to 4%. This method highlights the potential of leveraging the environment's spatial structure to improve reliability in object detection systems.

2507.14760 2026-05-26 eess.IV cs.AI cs.CV cs.LG 版本更新

QUTCC: Quantile Uncertainty Training and Conformal Calibration for Imaging Inverse Problems

QUTCC: 成像逆问题的分位数不确定性训练与保形校准

Cassandra Tong Ye, Shamus Li, Tyler King, Kristina Monakhova

AI总结 提出QUTCC方法,结合分位数回归与U-Net实现空间自适应保形校准,在多个成像逆问题中生成更紧的不确定性区间并定位模型幻觉。

详情
AI中文摘要

尽管深度学习为科学和医学成像带来了巨大前景,但任何失败和幻觉(与事实不符的预测)都难以定位,并可能产生严重的下游后果。不确定性估计技术,如保形预测,可以通过预测模型预测的统计有效误差条来提供帮助。然而,流行的保形预测方法并非为高维图像值问题设计,且在保形校准过程中未考虑图像内的空间相关性,导致不确定性区间过大。我们提出了一种实用的同时分位数回归方法,能够在保形校准期间实现非线性、空间自适应缩放。我们的方法QUTCC使用带有分位数嵌入的U-Net架构,在训练期间学习完整的条件分位数分布,然后利用这个非线性学习函数进行空间自适应保形校准。在测试时,我们的方法能够高效地估计具有像素边际覆盖保证的不确定性区间。此外,QUTCC还可以在没有内置分布假设的情况下预测逐像素条件概率密度估计。我们在多个去噪问题、加速磁共振成像和定量相位显微镜上评估了我们的方法。与先前的保形方法相比,我们的方法在相同覆盖水平下始终产生更紧的不确定性区间,能够预测不同任务的合理条件分布,并且在某些情况下,高不确定性区域可以帮助我们定位模型预测中的幻觉。

英文摘要

While deep learning offers tremendous promise for scientific and medical imaging, any failures and hallucinations (predictions that do not coincide with reality) are hard to pinpoint and can have serious downstream consequences. Uncertainty estimation techniques, such as conformal prediction, can help by predicting statistically valid error bars for a model's prediction. However, popular conformal prediction methods were not designed for high-dimensional image-valued problems and do not take into account spatial correlations within an image during conformal calibration, resulting in larger-than-necessary uncertainty intervals. We propose a practical simultaneous quantile regression method that enables non-linear, spatially-adaptive scaling during conformal calibration. Our method, QUTCC uses a U-Net architecture with a quantile embedding to learn a full conditional quantile distribution during training, and then leverages this non-linear, learned function for spatially-adaptive conformal calibration. At test time, our method can efficiently estimate uncertainty intervals with pixel-marginal coverage guarantees. In addition, QUTCC can also predict pixel-wise conditional probability density estimates without any built-in distributional assumptions. We evaluate our method on several denoising problems, accelerated magnetic resonance imaging, and quantitative phase microscopy. Our method consistently produces tighter uncertainty intervals than prior conformal methods at the same coverage level, can predict plausible conditional distributions for different tasks, and in some cases, high-uncertainty regions can help us locate hallucinations in a model's prediction.

2506.19117 2026-05-26 cs.CV 版本更新

PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

PrITTI: 基于基元的可控可编辑3D语义城市场景生成

Christina Ourania Tze, Daniel Dauner, Yiyi Liao, Dzmitry Tsishkou, Andreas Geiger

发表机构 * University of Tübingen(图宾根大学) Tübingen AI Center(图宾根人工智能中心) Zhejiang University(浙江大学) Noah’s Ark Lab, Huawei(华为诺亚实验室) KE:SAI - Kyutai ELLIS Scalable Autonomous Intelligence(KE:SAI - 韩国ELLIS可扩展自主智能)

AI总结 提出PrITTI,一种利用矢量化对象基元和栅格化地面表面的混合表示,通过潜在扩散模型实现高质量、可控且可编辑的3D语义城市场景生成。

Comments Accepted to CVPR 2026

详情
AI中文摘要

现有的3D语义城市场景生成方法主要依赖于基于体素的表示,这些方法受限于固定分辨率、难以编辑且密集形式下内存消耗大。相比之下,我们倡导一种基于基元的范式,其中城市场景使用紧凑、语义上有意义的3D元素表示,这些元素易于操作和组合。为此,我们引入了PrITTI,一种潜在扩散模型,利用矢量化对象基元和栅格化地面表面生成多样化、可控且可编辑的3D语义城市场景。这种混合表示产生了一个结构化的潜在空间,便于对象和地面级别的操作。在KITTI-360上的实验表明,基于基元的表示释放了扩散变压器的全部能力,实现了最先进的3D场景生成质量,同时内存需求更低、推理速度更快、可编辑性优于基于体素的方法。除了生成,PrITTI还支持一系列下游应用,包括场景编辑、修复、外推和照片级真实感的街景合成。源代码和更多结果可在https://raniatze.github.io/pritti/找到。

英文摘要

Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis. The source code and more results can be found at https://raniatze.github.io/pritti/.

2506.05360 2026-05-26 cs.CV 版本更新

CarboFormer: A Lightweight Semantic Segmentation Architecture for Efficient Carbon Dioxide Detection Using Optical Gas Imaging

CarboFormer: 一种用于光学气体成像的轻量级语义分割架构,实现高效二氧化碳检测

Taminul Islam, Toqi Tahamid Sarker, Mohamed G Embaby, Khaled R Ahmed, Amer AbuGhazaleh

发表机构 * Southern Illinois University, Carbondale, USA(南方伊利诺伊大学,卡罗尔达勒分校)

AI总结 提出CarboFormer轻量级语义分割框架,通过优化编解码器、多尺度特征融合和辅助监督策略,在资源受限环境下实现CO2排放的实时高精度检测,并贡献两个新数据集。

详情
Journal ref
Advances in Visual Computing. ISVC 2025. Lecture Notes in Computer Science, vol 16397, pp. 3-15, Springer, Cham, 2026
AI中文摘要

二氧化碳(CO$_2$)排放是环境影响和多种工业过程(包括畜牧业管理)的关键指标。我们提出了CarboFormer,一种用于光学气体成像(OGI)的轻量级语义分割框架,旨在检测和量化不同应用中的CO$_2$排放。我们的方法集成了优化的编码器-解码器架构与专门的多尺度特征融合和辅助监督策略,以有效建模气体羽流图像中的局部细节和全局关系,同时在资源受限环境中以最小的计算开销实现有竞争力的精度。我们贡献了两个新数据集:(1)受控二氧化碳释放(CCR)数据集,模拟了系统变化流速(10-100 SCCM)的气体泄漏;(2)实时Ankom(RTA)数据集,专注于奶牛瘤胃液体外实验的排放。大量评估表明,CarboFormer在CCR上达到84.88% mIoU,在RTA上达到92.98% mIoU,同时保持计算效率,仅5.07M参数,运行速度为84.68 FPS。该模型在具有挑战性的低流量场景中特别有效,显著优于其他轻量级方法,如SegFormer-B0(CCR上83.36% mIoU)和SegNeXt(CCR上82.55% mIoU),使其适用于资源受限平台(如可编程无人机)上的实时监测。我们的工作通过提供稳健高效的CO$_2$排放分析工具,推进了环境传感和精准畜牧业管理。

英文摘要

Carbon dioxide (CO$_2$) emissions are critical indicators of both environmental impact and various industrial processes, including livestock management. We introduce CarboFormer, a lightweight semantic segmentation framework for Optical Gas Imaging (OGI), designed to detect and quantify CO$_2$ emissions across diverse applications. Our approach integrates an optimized encoder-decoder architecture with specialized multi-scale feature fusion and auxiliary supervision strategies to effectively model both local details and global relationships in gas plume imagery while achieving competitive accuracy with minimal computational overhead for resource-constrained environments. We contribute two novel datasets: (1) the Controlled Carbon Dioxide Release (CCR) dataset, which simulates gas leaks with systematically varied flow rates (10-100 SCCM), and (2) the Real Time Ankom (RTA) dataset, focusing on emissions from dairy cow rumen fluid in vitro experiments. Extensive evaluations demonstrate that CarboFormer achieves competitive performance with 84.88\% mIoU on CCR and 92.98\% mIoU on RTA, while maintaining computational efficiency with only 5.07M parameters and operating at 84.68 FPS. The model shows particular effectiveness in challenging low-flow scenarios and significantly outperforms other lightweight methods like SegFormer-B0 (83.36\% mIoU on CCR) and SegNeXt (82.55\% mIoU on CCR), making it suitable for real-time monitoring on resource-constrained platforms such as programmable drones. Our work advances both environmental sensing and precision livestock management by providing robust and efficient tools for CO$_2$ emission analysis.

2505.24876 2026-05-26 cs.CV cs.CL 版本更新

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Agent-X:评估视觉中心智能体任务中的深度多模态推理

Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) University of Central Florida(中央佛罗里达大学) University of Oxford(牛津大学)

AI总结 提出Agent-X基准,通过828个真实视觉任务和细粒度步骤评估框架,揭示当前模型在多步视觉推理中全链成功率低于50%的瓶颈。

Comments Accepted in International Conference of Learning Representations (ICLR 2026)

详情
AI中文摘要

深度推理对于解决复杂任务至关重要,尤其是在需要顺序多模态理解的视觉中心场景中。然而,现有基准通常使用完全合成的单轮查询、有限的视觉模态进行评估,并且缺乏在真实世界环境中多步推理质量的评估框架。为了解决这一问题,我们引入了Agent-X,这是一个大规模基准,用于评估视觉中心智能体在真实多模态环境中的多步和深度推理能力。Agent-X包含828个具有真实视觉上下文的智能体任务,包括图像、多图像比较、视频和指令文本。这些任务涵盖六大智能体环境:通用视觉推理、网页浏览、安全与监控、自动驾驶、体育和数学推理。我们的基准要求智能体在这些多样化环境中将工具使用与明确的逐步决策相结合。此外,我们提出了一个细粒度的步骤级评估框架,用于评估每个推理步骤的正确性和逻辑连贯性以及整个任务中工具使用的有效性。我们的结果表明,即使是最佳性能模型,包括GPT、Gemini和Qwen系列,也难以解决多步视觉任务,全链成功率低于50%。这些发现突显了当前LMM推理和工具使用能力的关键瓶颈,并指出了视觉中心智能体推理模型的未来研究方向。我们的数据和代码公开在https://github.com/mbzuai-oryx/Agent-X。

英文摘要

Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at https://github.com/mbzuai-oryx/Agent-X

2505.03631 2026-05-26 cs.CV 版本更新

Generalizable Video Quality Assessment via Weak-to-Strong Learning

通过弱到强学习实现可泛化的视频质量评估

Linhan Cao, Wei Sun, Xiangyang Zhu, Kaiwei Zhang, Jun Jia, Yicong Peng, Dandan Zhu, Guangtao Zhai, Xiongkuo Min

发表机构 * Shanghai Jiao Tong University(上海交通大学) East China Normal University(华东师范大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出弱到强学习框架,结合同质/异质监督信号和迭代训练,无需人工标注即可提升视频质量评估的泛化能力。

Comments Accepted by CVPR 2026

详情
AI中文摘要

视频质量评估(VQA)旨在预测与人类视觉感知一致的视频感知质量,是量化视频处理流程中质量退化的基本工具。主流的VQA范式依赖于人工标注数据集的监督训练,尽管取得了显著进展,但在未见视频内容上仍存在泛化能力差的问题。本文探索弱到强(W2S)学习作为一种无需依赖人工标注数据集的新范式来推进VQA。我们首先提供经验证据,表明直接的W2S策略使强学生模型不仅能在域内基准上匹配其弱教师,还能在分布外(OOD)基准上超越教师,揭示了VQA中独特的弱到强效应。基于这一洞察,我们提出一个新颖框架,从两个方面增强W2S学习:(1)通过可学习排序公式整合来自不同VQA教师(包括现成VQA模型和合成失真模拟器)的同质和异质监督信号;(2)迭代W2S训练,其中每个强学生被回收作为后续循环的教师,逐步聚焦于困难案例。大量实验表明,我们的方法在域内和OOD基准上均达到最先进结果,尤其在OOD场景中表现突出。我们的发现强调W2S学习是打破标注障碍、实现视频质量评估可扩展泛化的原则性途径。我们的数据和代码将在https://github.com/clh124/W2S-VQA提供。

英文摘要

Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception, serving as a fundamental tool for quantifying quality degradation across video processing workflows. The dominant VQA paradigm relies on supervised training with human-labeled datasets, which, despite substantial progress, still suffers from poor generalization to unseen video content. In this work, we explore weak-to-strong (W2S) learning as a new paradigm for advancing VQA without reliance on human-labeled datasets. We first provide empirical evidence that a straightforward W2S strategy allows a strong student model to not only match its weak teacher on in-domain benchmarks but also surpass it on out-of-distribution (OOD) benchmarks, revealing a distinct weak-to-strong effect in VQA. Building on this insight, we propose a novel framework that enhances W2S learning from two aspects: (1) integrating homogeneous and heterogeneous supervision signals from diverse VQA teachers -- including off-the-shelf VQA models and synthetic distortion simulators -- via a learn-to-rank formulation, and (2) iterative W2S training, where each strong student is recycled as the teacher in subsequent cycles, progressively focusing on challenging cases. Extensive experiments show that our method achieves state-of-the-art results across both in-domain and OOD benchmarks, with especially strong gains in OOD scenarios. Our findings highlight W2S learning as a principled route to break annotation barriers and achieve scalable generalization in video quality assessment. Our data and code will be available at https://github.com/clh124/W2S-VQA.

2504.15404 2026-05-26 cs.CV 版本更新

Context Aware Grounded Teacher for Source Free Object Detection

上下文感知的接地教师用于无源目标检测

Tajamul Ashraf, Rajes Manna, Partha Sarathi Purkayastha, Tavaheed Tariq, Janibul Bashir

发表机构 * Department of Computer Vision(计算机视觉系) MBZUAI Microsoft Research India(微软印度研究院) GAASH Research Lab(GAASH研究实验室) NIT Srinagar(斋普尔理工学院)

AI总结 针对无源目标检测中类别不平衡导致的上下文偏差和噪声伪标签问题,提出一种基于关系上下文模块和语义增强的偏差感知框架Grounded Teacher,通过关系正则化和语义增强提升少数类检测性能。

Comments Accepted in International Journal of Computer Vision (IJCV); Project Webpage: https://tajamul21.github.io/Grounded_Teacher/

详情
AI中文摘要

无源目标检测(SFOD)面临持续挑战,原因在于类别不平衡驱动的上下文偏差以及噪声伪标签下教师-学生训练的不稳定性。现有技术往往忽略上下文偏差和类别不平衡偏移,尤其是在医疗数据中。为解决此问题,我们提出Grounded Teacher(GT),一种偏差感知的无源框架,通过关系正则化和语义正则化来接地教师模型。为了显式建模类别间的方向性混淆,GT引入关系上下文模块(RCM),维护跨域上下文偏差的指数移动平均(EMA)估计。在此基础上,语义增强(SA)策略通过在源相似和源不相似的目标区域中进行自适应MixUp,选择性地增强少数类和易混淆类,从而提高少数类召回率而不过度拟合主导类别。为了在偏差伪标签下稳定学习,我们设计了语义感知损失(SAL),应用对角归一化权重,防止梯度爆炸,同时强调少数-多数类别的修正。此外,从大型视觉基础模型(LVFMs)导出的冻结专家分支在训练期间作为监督参考,在不增加推理开销的情况下改善伪标签质量。GT的行为驱动偏差量化使其能够跨领域广泛应用,无需依赖数据集先验。在Cityscapes-to-Foggy(50.8 mAP)和医学迁移(DDSM-to-INBreast上+5.9 AP50)上的评估显示出一致的增益和改进的少数类检测,且额外训练成本低于12%。代码和模型可在https://github.com/Tajamul21/Grounded-Teacher获取。

英文摘要

Source-free object detection (SFOD) faces persistent challenges due to class imbalance-driven context bias and instability in teacher-student training under noisy pseudo-labels. Existing techniques tend to ignore context bias and class-imbalance shifts, especially in medical data. To tackle this, we propose Grounded Teacher (GT), a bias-aware source-free framework that grounds the teacher model through relational and semantic regularization. To explicitly model directional confusion between classes, GT introduces a Relational Context Module (RCM) that maintains an exponential moving average (EMA) estimate of cross-domain contextual bias. Building upon this, a Semantic Augmentation (SA) strategy selectively augments minority and confusable classes through adaptive MixUp in both source-similar and source-dissimilar target regions, improving minority recall without overfitting dominant categories. To stabilize learning under biased pseudo-labels, we design a Semantic-Aware Loss (SAL) that applies diagonally normalized weights, preventing gradient explosion while emphasizing minority-majority corrections. Additionally, a frozen Expert branch derived from large vision foundation models (LVFMs) serves as a supervisory reference during training, refining pseudo-label quality without adding inference overhead. GT's behavior-driven bias quantification makes it broadly applicable across domains without relying on dataset priors. Evaluations on Cityscapes-to-Foggy (50.8 mAP) and medical transfers (+5.9 AP50 on DDSM-to-INBreast) show consistent gains and improved minority-class detection, with less than 12\% additional training cost. Code and model are available at https://github.com/Tajamul21/Grounded-Teacher.

2504.00816 2026-05-26 cs.CV physics.med-ph 版本更新

Two-stage deep learning framework for the restoration of incomplete-ring PET images

用于修复不完整环PET图像的两阶段深度学习框架

Yeqi Fang, Rong Zhou

发表机构 * College of Physics, Sichuan University(四川大学物理学院)

AI总结 提出一种两阶段深度学习框架,无需飞行时间信息,通过投影域注意力U-Net预测缺失正弦图部分和级联U-Net与热启动扩散模型进行图像细化,从约50%缺失符合事件的不完整环数据中恢复高质量PET图像。

Comments 17 pages, 5 figures

详情
AI中文摘要

正电子发射断层扫描(PET)是一种重要的分子成像工具,广泛应用于医学。传统的PET系统依赖完整的探测器环来实现全角度覆盖和可靠的数据收集。然而,由于硬件故障、成本限制或特定临床需求,出现了不完整环PET扫描仪。标准重建算法由于数据完整性的降低和几何不一致性,在这些系统中往往性能下降。我们提出了一种两阶段深度学习框架,无需任何飞行时间(TOF)信息,即可从约50%缺失符合事件的数据中恢复高质量图像——这是之前基于CNN方法处理损失水平的两倍。该流程分两个阶段运行:投影域注意力U-Net首先通过利用相邻切片的空间上下文预测正弦图的缺失部分,然后使用OSEM算法重建完整数据,并将其传递给级联U-Net和热启动扩散模型进行图像细化。该模块从U-Net粗预测而非纯高斯噪声开始反向扩散过程。使用来自真实扫描的613个模拟脑体积(196个健康脑样本、217个阿尔茨海默病样本和200个轻度认知障碍样本),结果表明我们的模型成功保留了大部分解剖结构和示踪剂分布特征,PSNR为38.18至38.59 dB,SSIM为0.9904至0.9925。我们的两阶段深度学习框架有效地从超过50%的不完整环数据中恢复高质量PET图像,实现了接近完整的解剖保真度和鲁棒性能,无需TOF信息。

英文摘要

Positron Emission Tomography (PET) is an important molecular imaging tool widely used in medicine. Traditional PET systems rely on complete detector rings for full angular coverage and reliable data collection. However, incomplete-ring PET scanners have emerged due to hardware failures, cost constraints, or specific clinical needs. Standard reconstruction algorithms often suffer from performance degradation with these systems because of reduced data completeness and geometric inconsistencies. We present a two-stage deep-learning framework that, without incorporating any time-of-flight (TOF) information, restores high-quality images from data with about 50% missing coincidences - double the loss levels previously addressed by CNN-based methods. The pipeline operates in two stages: a projection-domain Attention U-Net first predicts the missing sections of the sinogram by leveraging spatial context from neighbouring slices, after which the completed data are reconstructed with OSEM algorithm and passed to a cascaded U-Net & warm-start diffusion model for image refinement. This module starts the reverse diffusion process from the U-Net coarse prediction rather than pure Gaussian noise. Using 613 simulated brain volumes from real scans (196 healthy brain samples, 217 Alzheimer's disease samples, and 200 Mild Cognitive Impairment samples), the result shows that our model successfully preserves most anatomical structures and tracer distribution features with PSNR of 38.18 to 38.59 dB and SSIM of 0.9904 to 0.9925. Our two-stage deep-learning framework effectively restores high-quality PET images from over 50% incomplete-ring data, achieving near-complete anatomical fidelity and robust performance without requiring TOF information.

2503.23670 2026-05-26 cs.CV 版本更新

Learning Bijective Surface Parameterization for Inferring Signed Distance Functions from Sparse Point Clouds with Grid Deformation

学习双射曲面参数化以通过网格变形从稀疏点云推断符号距离函数

Takeshi Noda, Chao Chen, Junsheng Zhou, Weiqi Zhang, Yu-Shen Liu, Zhizhong Han

发表机构 * School of Software, Tsinghua University, Beijing, China(清华大学软件学院,北京,中国) Department of Computer Science, Wayne State University, Detroit, USA(韦恩州立大学计算机科学系,底特律,美国)

AI总结 提出一种动态变形网络结合双射曲面参数化和网格变形优化的方法,从稀疏点云端到端预测符号距离函数,显著优于现有方法。

Comments Accepted by Conference on Computer Vision and Pattern Recognition (CVPR) 2025. Project page:https://takeshie.github.io/Bijective-SDF

详情
AI中文摘要

从稀疏点云推断符号距离函数(SDF)仍然是曲面重建中的一个挑战。关键在于稀疏点云缺乏学习连续场所需的详细几何信息。为解决此问题,我们提出了一种新颖的方法,学习一个动态变形网络以端到端方式预测SDF。为了从稀疏点参数化连续曲面,我们提出了双射曲面参数化(BSP),从局部块学习全局形状。具体来说,我们为从参数域到3D局部块的稀疏点构建双射映射,将块整合到全局曲面中。同时,我们将网格变形优化(GDO)引入曲面逼近,以优化网格点的变形并进一步细化参数曲面。在合成和真实扫描数据集上的实验结果表明,我们的方法显著优于当前最先进的方法。项目页面:https://takeshie.github.io/Bijective-SDF

英文摘要

Inferring signed distance functions (SDFs) from sparse point clouds remains a challenge in surface reconstruction. The key lies in the lack of detailed geometric information in sparse point clouds, which is essential for learning a continuous field. To resolve this issue, we present a novel approach that learns a dynamic deformation network to predict SDFs in an end-to-end manner. To parameterize a continuous surface from sparse points, we propose a bijective surface parameterization (BSP) that learns the global shape from local patches. Specifically, we construct a bijective mapping for sparse points from the parametric domain to 3D local patches, integrating patches into the global surface. Meanwhile, we introduce grid deformation optimization (GDO) into the surface approximation to optimize the deformation of grid points and further refine the parametric surfaces. Experimental results on synthetic and real scanned datasets demonstrate that our method significantly outperforms the current state-of-the-art methods. Project page: https://takeshie.github.io/Bijective-SDF

2412.07333 2026-05-26 cs.CV cs.AI 版本更新

Fusion Embedding for Pose-Guided Person Image Synthesis with Diffusion Model

基于扩散模型的姿态引导人物图像合成的融合嵌入

Donghwna Lee, Kirok Kim, Jisu Lee, Kyungha Min, Wooju Kim

发表机构 * Department of Industrial Engineering(工业工程系)

AI总结 提出FPDM框架,通过对比学习显式对齐融合源-姿态嵌入与目标图像嵌入,并作为条件信号生成,解决姿态引导人物图像合成中纹理保真度和一致性问题。

详情
AI中文摘要

姿态引导人物图像合成(PGPIS)旨在生成指定姿态下的人物图像,同时保留源图像的身份和外观。该技术促进了多种应用,包括虚拟试穿、数字化身、动画和手语生成。尽管最近基于扩散的PGPIS取得了高质量结果,但这些模型通常依赖于去噪过程中的隐式特征聚合。因此,细粒度纹理保持有限,即使对于相同身份,也难以确保在姿态和源外观变化下生成一致性。为解决这些限制,我们提出了基于扩散模型的融合嵌入PGPIS(FPDM),这是第一个通过对比学习显式对齐融合源-姿态嵌入与目标图像嵌入,并随后使用学习到的融合嵌入作为生成条件信号的框架。FPDM将图像-姿态融合(IPF)模块集成到我们提出的源增强姿态融合方法中,以学习与目标图像对齐的融合嵌入。然后,我们采用由源外观、目标姿态和学习到的融合嵌入引导的条件扩散模型。在DeepFashion基准和RWTH-PHOENIX-Weather 2014T数据集上的实验表明,在定量和定性评估中,与现有方法相比具有竞争力的性能,消融研究证实显式融合嵌入对齐显著提高了纹理保真度以及跨姿态和源外观变化的一致性。

英文摘要

Pose-Guided Person Image Synthesis (PGPIS) aims to generate human images in specified poses while preserving the identity and appearance of a source image. This technology facilitates diverse applications, including virtual try-on, digital avatars, animation, and sign language generation. Despite the high-quality results of recent diffusion-based PGPIS, these models typically depend on implicit feature aggregation within the denoising process. As a result, fine-grained texture preservation is limited, and even for the same identity, it is difficult to ensure consistent generation under variations in pose and source appearance. To address these limitations, we propose Fusion Embedding for PGPIS using a Diffusion Model (FPDM), the first framework that explicitly aligns fused source-pose embeddings with target image embeddings via contrastive learning, and subsequently employs the learned fusion embedding as a conditioning signal for generation. FPDM integrates an Image-Pose Fusion (IPF) module into our proposed Source-Enhanced Pose Fusion approach to learn a fusion embedding aligned with the target image. We then employ a conditional diffusion model guided by source appearance, target pose, and the learned fusion embedding. Experiments on the DeepFashion benchmark and the RWTH-PHOENIX-Weather 2014T dataset demonstrate competitive performance compared to existing methods in both quantitative and qualitative evaluations, with ablation studies confirming that explicit fusion embedding alignment substantially improves texture fidelity and consistency across pose and source appearance variations.

2410.12673 2026-05-26 cs.CV 版本更新

MambaBEV: An EV-based 3D detection model with Mamba2

MambaBEV:基于Mamba2的BEV三维检测模型

Zihan You, Ni Wang, Hao Wang, Qichao Zhao, Jinxiang Wang

发表机构 * School of Instrument Science and Engineering, Southeast University, China(仪器科学与工程学院,东南大学,中国) Amazon Development Center Germany GmbH, Germany(亚马逊德国开发中心,德国) T3CAIC Technology, China(T3CAIC技术,中国) School of Mechanical Engineering, Southeast University, China(机械工程学院,东南大学,中国)

AI总结 提出MambaBEV模型,利用Mamba2状态空间模型通过TemporalMamba时序融合模块和Mamba-based DETR头增强全局上下文建模,提升自动驾驶中大型物体的3D检测精度。

Comments ICPR2026

详情
AI中文摘要

自动驾驶中的精确3D物体检测依赖于鸟瞰图(BEV)感知和有效的时序融合。然而,现有基于卷积层或可变形自注意力的融合策略难以建模BEV空间中的全局上下文,导致大型物体的检测精度降低。为解决这一限制,我们提出了MambaBEV,一种新颖的基于BEV的3D物体检测模型,利用Mamba2——一种针对长序列处理优化的先进状态空间模型(SSM)。我们的关键贡献是TemporalMamba,一种时序融合模块,通过专为序列处理设计的BEV特征离散重排机制增强全局上下文建模。此外,我们引入了一个基于Mamba的DETR头以改进多物体表示。在nuScenes数据集上的评估表明,MambaBEV-base达到了51.7%的NDS和42.7%的mAP。此外,在端到端自动驾驶范式中的评估验证了其在运动预测和规划中的有效性。这些结果突显了状态空间模型在提升自动驾驶感知系统中全局上下文理解和大型物体检测方面的潜力。

英文摘要

Accurate 3D object detection in autonomous driving relies on Bird's Eye View (BEV) perception and effective temporal fusion. However, existing fusion strategies based on convolutional layers or deformable self-attention struggle to model global context in BEV space, leading to reduced accuracy for large objects.To address this limitation, we propose MambaBEV, a novel BEV-based 3D object detection model that leverages Mamba2, an advanced state-space model (SSM) optimized for long-sequence processing. Our key contribution is TemporalMamba, a temporal fusion module that enhances global context modeling through a BEV feature discrete rearrangement mechanism tailored for sequential processing. In addition, we introduce a Mamba-based DETR head to improve multi-object representation. Evaluations on the nuScenes dataset demonstrate that MambaBEV-base achieves 51.7% NDS and an 42.7% mAP. Furthermore, evaluation within an end-to-end autonomous driving paradigm validates its effectiveness in motion forecasting and planning.These results highlight the potential of state-space models for improving global context understanding and large-object detection in autonomous driving perception systems.

2409.19727 2026-05-26 cs.LG cs.CV 版本更新

Investigating the Effect of Network Pruning on Performance and Interpretability

探究网络剪枝对性能与可解释性的影响

Jonathan von Rad, Florian Seuffert

发表机构 * AI Center, Neural Information Processing Group University of Tübingen(人工智能中心、神经信息处理组 汤姆森大学)

AI总结 本文通过系统应用非结构化、结构化剪枝及连接稀疏方法,研究不同剪枝技术对GoogLeNet在ImageNet验证集上的分类性能和可解释性的影响,发现充分重训练后性能可接近甚至超越原始网络,且可解释性评分与剪枝率无显著关联。

Comments 4 pages, 6 figures

详情
AI中文摘要

深度神经网络(DNN)通常对其任务而言是过参数化的,可以通过移除权重进行大幅压缩,这一过程称为剪枝。我们研究了不同剪枝技术对GoogLeNet的分类性能和可解释性的影响。我们系统地应用非结构化剪枝、结构化剪枝以及连接稀疏性(输入权重剪枝)方法,并分析这些方法对网络在ImageNet验证集上性能的影响。我们还比较了不同的重训练策略,如迭代剪枝和一次性剪枝。我们发现,通过足够的重训练轮次,网络的性能可以接近默认GoogLeNet的性能——甚至在某些情况下超越它。为了评估可解释性,我们采用了Zimmermann等人开发的机制可解释性评分(MIS)。我们的实验表明,当使用MIS作为度量时,可解释性与剪枝率之间没有显著关系。此外,我们观察到,准确率极低的网络仍然可以获得高MIS分数,这表明MIS可能并不总是与可解释性的直观概念(例如理解正确决策的基础)一致。

英文摘要

Deep Neural Networks (DNNs) are often over-parameterized for their tasks and can be compressed quite drastically by removing weights, a process called pruning. We investigate the impact of different pruning techniques on the classification performance and interpretability of GoogLeNet. We systematically apply unstructured and structured pruning, as well as connection sparsity (pruning of input weights) methods to the network and analyze the outcomes regarding the network's performance on the validation set of ImageNet. We also compare different retraining strategies, such as iterative pruning and one-shot pruning. We find that with sufficient retraining epochs, the performance of the networks can approximate the performance of the default GoogLeNet - and even surpass it in some cases. To assess interpretability, we employ the Mechanistic Interpretability Score (MIS) developed by Zimmermann et al. . Our experiments reveal that there is no significant relationship between interpretability and pruning rate when using MIS as a measure. Additionally, we observe that networks with extremely low accuracy can still achieve high MIS scores, suggesting that the MIS may not always align with intuitive notions of interpretability, such as understanding the basis of correct decisions.

2409.00346 2026-05-26 cs.CV 版本更新

SMAFormer: Synergistic Multi-Attention Transformer for Medical Image Segmentation

SMAFormer: 协同多注意力Transformer用于医学图像分割

Fuchen Zheng, Xuhang Chen, Weihuang Liu, Haolun Li, Yingtie Lei, Jiahui He, Chi-Man Pun, Shounjun Zhou

发表机构 * University of Macau(澳门大学)

AI总结 提出SMAFormer,一种融合像素注意力、通道注意力和空间注意力的Transformer架构,通过协同多注意力块和特征融合调制器提升小肿瘤和器官的分割性能。

Comments Accepted by IEEE BIBM 2024

详情
AI中文摘要

在医学图像分割中,专门的计算机视觉技术,特别是基于注意力机制的Transformer和采用跳跃连接的残差网络,在提升性能方面发挥了重要作用。然而,先前的模型在分割小且形状不规则的肿瘤时常常表现不佳。为此,我们引入了SMAFormer,一种高效的基于Transformer的架构,它融合了多种注意力机制以增强对小肿瘤和器官的分割。SMAFormer能够捕获医学图像分割的局部和全局特征。该架构包含两个关键组件。首先,提出了协同多注意力(SMA)Transformer块,它结合了像素注意力、通道注意力和空间注意力的优势以丰富特征。其次,针对注意力机制转换和特征融合过程中产生的信息丢失问题,我们设计了一个特征融合调制器。该模块通过减轻重塑引起的信息损失来增强通道注意力和空间注意力之间的整合。为了评估我们的方法,我们在各种医学图像分割任务上进行了广泛实验,包括多器官、肝脏肿瘤和膀胱肿瘤分割,取得了最先进的结果。代码和模型可在 https://github.com/lzeeorno/SMAFormer 获取。

英文摘要

In medical image segmentation, specialized computer vision techniques, notably transformers grounded in attention mechanisms and residual networks employing skip connections, have been instrumental in advancing performance. Nonetheless, previous models often falter when segmenting small, irregularly shaped tumors. To this end, we introduce SMAFormer, an efficient, Transformer-based architecture that fuses multiple attention mechanisms for enhanced segmentation of small tumors and organs. SMAFormer can capture both local and global features for medical image segmentation. The architecture comprises two pivotal components. First, a Synergistic Multi-Attention (SMA) Transformer block is proposed, which has the benefits of Pixel Attention, Channel Attention, and Spatial Attention for feature enrichment. Second, addressing the challenge of information loss incurred during attention mechanism transitions and feature fusion, we design a Feature Fusion Modulator. This module bolsters the integration between the channel and spatial attention by mitigating reshaping-induced information attrition. To evaluate our method, we conduct extensive experiments on various medical image segmentation tasks, including multi-organ, liver tumor, and bladder tumor segmentation, achieving state-of-the-art results. Code and models are available at: https://github.com/lzeeorno/SMAFormer.

2406.12179 2026-05-26 cs.CV 版本更新

The Wisdom of a Crowd of Brains: A Universal Brain Encoder

一群大脑的智慧:通用大脑编码器

Roman Beliy, Navve Wasserman, Amit Zalcher, Michal Irani

发表机构 * Weizmann Institute of Science(魏兹曼科学研究院)

AI总结 提出一种基于体素中心架构的通用大脑编码器,通过跨注意力机制联合多主体/数据集/机器的fMRI数据,提升个体编码性能并实现快速迁移学习。

详情
AI中文摘要

图像到fMRI编码对于神经科学研究和实际应用都很重要。然而,这种“大脑编码器”通常针对每个受试者和每个fMRI数据集进行训练,因此局限于非常有限的训练数据。在本文中,我们提出了一种通用大脑编码器,它可以联合训练来自许多不同受试者/数据集/机器的数据。实现这一点的关键是我们新的以体素为中心的编码器架构,该架构为每个大脑体素学习一个独特的“体素嵌入”。我们的编码器通过直接计算大脑体素嵌入与多级深度图像特征之间的交叉注意力,来训练预测每个大脑体素对每张图像的响应。这种以体素为中心的架构使得每个大脑体素的功能角色能够从体素-图像交叉注意力中自然涌现。我们展示了这种方法的能力:(i) 结合来自多个不同受试者(“一群大脑”)的数据以改善每个个体的大脑编码,(ii) 在受试者、数据集和机器(例如3特斯拉、7特斯拉)之间进行快速有效的迁移学习,仅需少量训练样本,(iii) 使用学习到的体素嵌入作为探索大脑功能(例如,大脑中编码了什么以及在哪里编码)的强大工具。

英文摘要

Image-to-fMRI encoding is important for both neuroscience research and practical applications. However, such "Brain-Encoders" have been typically trained per-subject and per fMRI-dataset, thus restricted to very limited training data. In this paper we propose a Universal Brain-Encoder, which can be trained jointly on data from many different subjects/datasets/machines. What makes this possible is our new voxel-centric Encoder architecture, which learns a unique "voxel-embedding" per brain-voxel. Our Encoder trains to predict the response of each brain-voxel on every image, by directly computing the cross-attention between the brain-voxel embedding and multi-level deep image features. This voxel-centric architecture allows the functional role of each brain-voxel to naturally emerge from the voxel-image cross-attention. We show the power of this approach to (i) combine data from multiple different subjects (a "Crowd of Brains") to improve each individual brain-encoding, (ii) quick & effective Transfer-Learning across subjects, datasets, and machines (e.g., 3-Tesla, 7-Tesla), with few training examples, and (iii) use the learned voxel-embeddings as a powerful tool to explore brain functionality (e.g., what is encoded where in the brain).

2306.02216 2026-05-26 cs.LG cs.CV 版本更新

Forgettable Federated Linear Learning with Certified Data Unlearning

具有认证数据遗忘的可遗忘联邦线性学习

Ruinan Jin, Minghui Chen, Qiong Zhang, Xiaoxiao Li

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Renmin University of China(中国人民大学)

AI总结 提出一种基于预训练模型线性近似的联邦遗忘框架,通过联邦线性训练实现高效、安全且可认证的客户端数据遗忘。

Comments IEEE Transactions on Neural Networks and Learning Systems

详情
Journal ref
IEEE Transactions on Neural Networks and Learning Systems, Early Access, pp. 1-10, 2026
AI中文摘要

联邦学习(FL)能够在分布式客户端之间进行协作模型训练,同时保护用户隐私。最近,联邦遗忘(FU)的出现旨在解决“被遗忘权”问题,并在无需重新训练整个FL系统的情况下移除中毒或目标客户端的影响。然而,许多FU方法需要与保留或目标客户端通信,引入额外的安全风险,或存储历史模型,限制了其效率和实用性。此外,由于非线性模型及其训练动态的复杂性,大多数用于深度神经网络(DNN)的FU方法缺乏理论认证。在这项工作中,我们引入了可遗忘联邦线性学习,这是一个用于DNN的训练和遗忘框架。我们的方法使用预训练模型线性近似DNN,并通过联邦线性训练实现与原始网络相当的性能。我们进一步提出了一种经过认证、高效且安全的遗忘策略,使服务器能够在不进行额外客户端通信或存储的情况下移除目标客户端的影响。在从小型到大型数据集上使用卷积神经网络和现代基础模型进行的广泛实验表明,我们的方法在模型准确性和有效的目标客户端遗忘之间取得了平衡。这项工作为高效且可信的FU提供了一个实用的流程。代码:https://github.com/Nanboy-Ronan/2F2L-Federated-Unlearning

英文摘要

Federated Learning (FL) enables collaborative model training across distributed clients while preserving user privacy. Recently, Federated Unlearning (FU) has emerged to address the "right to be forgotten" and to remove the influence of poisoned or target clients without retraining the entire FL system. However, many FU methods require communication with retained or target clients, introduce additional security risks, or store historical models, limiting their efficiency and practicality. Moreover, most FU methods for deep neural networks (DNNs) lack theoretical certification due to the complexity of nonlinear models and their training dynamics. In this work, we introduce Forgettable Federated Linear Learning, a training and unlearning framework for DNNs. Our approach uses pre-trained models to linearly approximate DNNs and achieve performance comparable to the original networks through Federated Linear Training. We further present a certified, efficient, and secure unlearning strategy that enables the server to remove a target client's influence without additional client communication or storage. Extensive experiments on small- to large-scale datasets, using both convolutional neural networks and modern foundation models, show that our method balances model accuracy with effective target-client unlearning. This work provides a practical pipeline for efficient and trustworthy FU. Code: https://github.com/Nanboy-Ronan/2F2L-Federated-Unlearning

2605.24509 2026-05-26 cs.CV cs.AI cs.GR cs.LG 版本更新

Φ-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation

Φ-Noise:基于相位噪声操作的无训练时间视频条件生成

Ofir Abramovich, Nadav Z. Cohen, Adi Rosenthal, Ariel Shamir

发表机构 * Canvas-Lab

AI总结 提出一种无需训练的方法,通过将参考视频的低频相位信息注入扩散噪声潜变量,实现运动条件视频生成,无需修改模型架构或推理流程。

Comments Under Review; 26 pages, 21 figures

详情
AI中文摘要

潜在视频扩散模型通过逐步将高斯噪声转换为基于文本或视觉输入的真实样本来生成视频。然而,现有的条件方法通常需要额外的训练和计算开销。受最近关于频率分量在生成模型中重要性的发现启发,我们提出了一种简单、无需训练的运动条件视频生成方法,通过将参考视频的低频相位信息直接注入扩散噪声潜变量。我们的方法在不修改模型架构或推理流程的情况下传递运动线索。通过多个应用,我们展示了在生成视频中对外观和动态的有效控制,同时与更复杂的条件方法相比取得了具有竞争力或更优的结果。

英文摘要

Latent video diffusion models generate videos by progressively transforming Gaussian noise into realistic samples conditioned on text or visual inputs. However, existing conditioning methods often require additional training and computational overhead. Motivated by recent findings on the importance of frequency components in generative models, we propose a simple, training-free approach for motion-conditioned video generation by injecting low-frequency phase information from a reference video directly into the diffusion noise latents. Our method transfers motion cues without modifying the model architecture or inference pipeline. Using several applications, we demonstrate effective control over both appearance and dynamics in generated videos, while achieving competitive or superior results compared to more complex conditioning approaches.

2605.24508 2026-05-26 cs.CV 版本更新

FDDet: Achieving Data-Efficient Food Defect Detection Under Real-World Scenarios

FDDet: 实现真实场景下的数据高效食品缺陷检测

Ruihao Xu, Yong Liu, Yansong Tang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学)

AI总结 针对食品缺陷检测中数据稀缺和缺乏统一基准的问题,提出了包含48种缺陷类别的FDD-48数据集,并设计了半监督框架FDDet,通过BBoxMixUp数据增强和CGPC伪标签校准方法,在数据有限场景下显著优于主流检测器。

详情
AI中文摘要

食品缺陷检测对于自动化质量控制至关重要,然而现有研究缺乏统一基准且面临数据稀缺问题。我们引入了FDD-48,一个在多种真实世界条件下涵盖13种食品类型和48种缺陷类别的细粒度标注综合数据集。为了在有限标注数据下提高检测性能,我们提出了FDDet,一个半监督框架,包含两个关键组件:(1) BBoxMixUp,一种数据增强技术,通过混合同类别缺陷区域来减少虚假特征关联;(2) CGPC(一致性引导的伪标签校准),基于样本内一致性过滤伪标签。实验表明,FDDet在FDD-48上显著优于主流检测器,证明了其在数据有限场景下进行食品缺陷检测的有效性。

英文摘要

Food defect detection is critical for automated quality control, yet existing studies lack unified benchmarks and suffer from data scarcity. We introduce FDD-48, a comprehensive dataset with fine-grained annotations across 13 food types and 48 defect categories under diverse real-world conditions. To improve detection with limited labeled data, we propose FDDet, a semi-supervised framework featuring two key components: (1) BBoxMixUp, a data augmentation technique that mixes same-category defect regions to reduce spurious feature associations, and (2) CGPC (Consistency-Guided Pseudo-Label Calibration), which filters pseudo-labels based on intra-sample consistency. Experiments show FDDet significantly outperforms mainstream detectors on FDD-48, demonstrating its effectiveness for food defect detection under data-limited scenarios.

2605.24503 2026-05-26 cs.CV cs.AI 版本更新

FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis

FoodMonitor:用于可解释合规性分析的多模态大语言模型基准测试

Ruihao Xu, Xingming Shui, Jingxuan Niu, Yiqin Wang, Jilin Yu, Haoji Zhang, Yansong Tang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学)

AI总结 针对现有视频异常检测缺乏规则驱动可解释性的问题,提出FoodMonitor基准,包含双通道违规标注和两阶段匹配评估协议,揭示当前多模态大语言模型在空间定位和细粒度规则理解上的瓶颈。

详情
AI中文摘要

随着基于AI的合规性监测在公共治理和工业安全中日益重要,提供可验证证据和可追溯问责信号的能力至关重要。然而,现有的视频异常检测数据集侧重于事件级二元分类,缺乏真实世界合规场景所需的规则驱动、可解释分析。我们引入了FoodMonitor,一个用于商业厨房监控中可解释合规性分析的基准。FoodMonitor包含477个视频片段,具有3307个违规标注,采用双通道设计覆盖人员级和环境级违规。每个标注指定了违反哪条规则、发生了何种不合规行为以及由谁实施,并附有帧级边界框。我们建立了一个统一的评估协议,包含两阶段匹配机制,分别评估空间定位和语义理解,以及一个复合指标($C_{ ext{score}}$),平衡环境和人员检测性能。对几种最先进的多模态大语言模型的系统评估显示,表现最佳的模型仅达到0.360 $C_{ ext{score}}$,空间定位和细粒度规则理解成为主要瓶颈。我们的分析识别出两种不同的失败模式:定位主导的错误和语义主导的错误,为未来模型开发提供了诊断性见解。

英文摘要

As AI-powered compliance monitoring becomes increasingly important in public governance and industrial safety, the ability to provide verifiable evidence and traceable accountability signals is essential. However, existing video anomaly detection datasets focus on event-level binary classification, lacking the rule-driven, explainable analysis required for real-world compliance scenarios. We introduce FoodMonitor, a benchmark for explainable compliance analysis in commercial kitchen surveillance. FoodMonitor comprises 477 video clips with 3,307 violation annotations across a dual-channel design covering both person-level and environment-level violations. Each annotation specifies which rule was violated, what non-compliant behavior occurred, and who committed it with frame-level bounding boxes. We establish a unified evaluation protocol with a two-stage matching mechanism that separately assesses spatial localization and semantic understanding, along with a composite metric ($C_{\text{score}}$) that balances environment and person detection performance. Systematic evaluation of several state-of-the-art multimodal large language models reveals that the best-performing model achieves only 0.360 $C_{\text{score}}$, with spatial localization and fine-grained rule understanding emerging as the primary bottlenecks. Our analysis identifies two distinct failure modes: localization-dominated errors and semantics-dominated errors, providing diagnostic insights for future model development.

2605.24492 2026-05-26 cs.CV 版本更新

Med-R2: An Adversarial Benchmark for Evidence-Grounded Reasoning in Medical VLMs

Med-R2: 面向医学视觉语言模型中基于证据推理的对抗性基准

Wen Ma, Fucheng Niu, Zhiting Fan, Zikai Xiao, Jiaxiang Liu, Zuozhu Liu

发表机构 * Zhejiang University(浙江大学) Guangdong Institute of Intelligence Science and Technology(广东智能科学技术研究院)

AI总结 提出 Med-R2 Bench,一个分层对抗性基准,通过逐步QA任务和对抗扰动评估医学VLM在临床工作流中的视觉证据推理鲁棒性。

详情
AI中文摘要

视觉语言模型在通用医学视觉问答中展现出令人印象深刻的能力,但由于可解释性有限,尚不清楚其预测是反映了基于证据的临床推理还是依赖于虚假先验。我们引入 Med-R2 Bench,一个与临床工作流对齐的分层基准,用于评估视觉定位的对抗鲁棒性。我们设计逐步QA任务,以评估推理链是否严格基于四个临床阶段的视觉证据,并采用对抗性扰动测试对误导线索的鲁棒性。Med-R2 包含 42,432 张图像、31 个任务类别和 110,406 个 QA 对。在 14 个 VLM 上的评估揭示了沿四阶段临床工作流的顺序性能下降。对抗实验表明,模型严重依赖正确的提示来猜测答案。即使提供了明确的视觉线索,模型也难以准确对齐文本描述。最后,我们证明使用我们的分层数据进行逐步微调显著提高了推理鲁棒性,突显了其在推动基于证据的医学AI未来发展方面的潜力。

英文摘要

Vision-language models have demonstrated impressive capabilities in general medical visual question answering, yet due to limited interpretability, it remains unclear whether their predictions reflect evidence-grounded clinical reasoning or reliance on spurious priors. We introduce Med-R2 Bench, a hierarchical benchmark aligned with the clinical workflow to evaluate adversarial robustness with visual grounding. We design stepwise QA tasks to assess whether reasoning chains are strictly grounded in visual evidence across the four clinical stages, and employ adversarial perturbations to test robustness against misleading cues. Med-R2 comprises 42,432 images, 31 task categories, and 110,406 QA pairs. Evaluation across 14 VLMs reveals a sequential performance degradation along the four-stage clinical workflow. Adversarial experiments show that models rely heavily on correct prompts to guess answers. Even when provided with explicit visual cues, the models struggle to accurately align textual descriptions. Finally, we demonstrate stepwise fine-tuning using our hierarchical data significantly improves reasoning robustness, highlighting its potential to drive future improvements in evidence-based medical AI.

2605.24475 2026-05-26 cs.CV cs.AI cs.MM 版本更新

Robust Fuzzy Multi-view Learning under View Conflict

视角冲突下的鲁棒模糊多视角学习

Siyuan Duan, Yuan Sun, Dezhong Peng, Yingke Chen, Xi Peng, Peng Hu

发表机构 * College of Computer Science, Sichuan University(四川大学计算机学院) Tianfu Jincheng Laboratory(天府锦城实验室) School of Artificial Intelligence, Sichuan University(四川大学人工智能学院)

AI总结 针对多视角分类中视角冲突问题,提出基于模糊集理论的鲁棒模糊多视角学习框架(R-FUML),通过模糊隶属度量化类别可信度、熵值融合及冲突样本惩罚机制,提升鲁棒性和不确定性估计。

详情
AI中文摘要

可信多视角分类旨在提供可靠的融合以实现准确预测,近年来在学术界和工业界引起了广泛关注。然而,现有的TMVC方法通常假设训练和测试阶段不同视角之间严格对齐,这在现实场景中往往不切实际。这一局限性促使我们重新审视TMVC并将其扩展到更具挑战性的设置:如何在训练和推理过程中减轻视角冲突(VC)的影响。针对这一设置,现有的TMVC方法存在三个关键缺陷:低估不确定性、误导性决策以及对VC的过拟合。为解决这些问题,本文提出了一种基于模糊集理论的新型鲁棒模糊多视角学习(R-FUML)框架。具体而言,R-FUML将网络输出建模为模糊隶属度以量化类别可信度,并使用基于熵的方法进行可靠的多视角融合。为此,我们提出了一种鲁棒多视角融合(RMF)策略,该策略同时考虑了视角特定的不确定性和视角间的冲突,从而减轻VC对决策的不利影响。为了在训练过程中识别并克服VC,我们进一步设计了一种针对VC的鲁棒学习(RLVC)框架。RLVC通过利用神经网络的记忆效应隔离冲突样本,然后通过对这些冲突视角施加惩罚来重新训练模型。在八个公开数据集上的大量实验表明,R-FUML在鲁棒性和不确定性估计方面始终优于15个最先进的基线方法。代码将在论文被接收后发布。

英文摘要

Trusted multi-view classification aims to deliver reliable fusion for accurate predictions and has recently attracted substantial attention in both academia and industry. However, existing TMVC methods typically assume strict alignment across different views during both training and testing phases, which is often impractical in real-world scenarios. This limitation motivates us to revisit TMVC and extend it to a more challenging setting: how to mitigate the impact of view conflict (VC) during both training and inference. To tackle this setting, existing TMVC methods suffer from three critical limitations: underestimated uncertainty, misleading decisions, and overfitting to VC. To address these issues, this paper proposes a novel Robust Fuzzy Multi-View Learning (R-FUML) framework grounded in Fuzzy Set Theory. Specifically, R-FUML models network outputs as fuzzy memberships to quantify category credibility and uses an entropy-based method for reliable multi-view fusion. To this end, we present a Robust Multi-view Fusion (RMF) strategy that accounts for both view-specific uncertainty and inter-view conflicts, thereby alleviating the adverse impacts of VC on decision-making. To identify and conquer VC during training, we further design a Robust Learning Against VC (RLVC) framework. RLVC isolates conflicting samples by leveraging neural networks' memory effects and then retrains the model by applying a penalty to these conflicting views. Extensive experiments across eight public datasets demonstrate that R-FUML consistently outperforms 15 state-of-the-art baselines in robustness and uncertainty estimation. The code will be released upon acceptance.

2605.24448 2026-05-26 cs.CV 版本更新

SILSM: A Sustainable Interactive Level Set Method for Progressive Refinement

SILSM:一种可持续交互式水平集方法用于渐进式细化

Jiachen Song, Dazhi Zhang, Fanghui Song, Zhichang Guo, Shengzhu Shi

发表机构 * School of Mathematics, Harbin Institute of Technology(哈尔滨工业大学数学学院)

AI总结 提出一种可持续交互式水平集方法(SILSM),通过解耦用户引导为独立交互项并采用高阶正则化,实现稳定、渐进细化的交互式分割。

详情
AI中文摘要

交互式分割旨在利用稀疏的用户引导精确分离目标对象。然而,传统方法通常面临交互负担重和参数敏感的问题,而深度学习方法则受限于数据依赖和迭代不稳定性。受这些限制的启发,我们提出了可持续交互式水平集方法(SILSM)。所提出的水平集演化方程包含交互项、正则化项和分割项。具体来说,采用高阶正则化以保持数值稳定性,并且与传统方法不同,我们将用户引导解耦为一个独立的交互项,从而能够直接手动控制零水平集的演化。此外,我们开发了一种适用于多次交互的数值算法,通过基于顺序用户输入有效更新分割结果,促进动态细化。我们从理论上证明,高阶项比传统长度项提供更强的正则化约束,而交互项确保分割严格在用户选择的区域内。实验结果进一步表明,所提出的方法对交互输入具有鲁棒性,在首次交互时即达到有竞争力的性能,并支持稳定的多轮交互,分割质量逐步提高。

英文摘要

Interactive segmentation aims to precisely isolate target objects using sparse user guidance. However, traditional methods often suffer from heavy interaction burdens and parameter sensitivity, while deep learning approaches struggle with data dependency and iterative instability. Motivated by these limitations, we propose the Sustainable Interactive Level Set Method (SILSM). The proposed level set evolution equation incorporates interaction, regularization, and segmentation terms. Specifically, high-order regularization is employed to maintain numerical stability, and unlike traditional methods, we decouple user guidance into an independent interaction term to enable direct manual control over the zero-level set evolution. Furthermore, we develop a numerical algorithm tailored for multiple interactions, which facilitates dynamic refinement by effectively updating the segmentation results based on sequential user inputs. We theoretically demonstrate that the high-order term provides stronger regularization constraints than the conventional length term, while the interaction term ensures segmentation strictly within the user-selected region. Experimental results further demonstrate that the proposed method is robust to interactive inputs, achieves competitive performance at the first interaction, and supports stable multi-round interactions with progressively improved segmentation quality.

2605.24442 2026-05-26 cs.CV 版本更新

Benchmarking Composed Image Retrieval for Applied Earth Observation

面向应用地球观测的组合图像检索基准测试

Bill Psomas, Dionysis Christopoulos, Thanasis Petropoulos, Nikos Efthymiadis, Ioannis Kakogeorgiou, Ondřej Chum, Yannis Avrithis, Giorgos Tolias, Konstantinos Karantzalos

发表机构 * organization= Visual Recognition Group, Department of Cybernetics, Czech Technical University in Prague , country= Czechia organization= Remote Sensing Laboratory, School of Rural, Surveying Geoinformatics Engineering, National Technical University of Athens , country= Greece organization= Institute of Informatics \& Telecommunications, National Centre for Scientific Research ``Demokritos'' , country= Greece organization= Department of Informatics Kapodistrian University of Athens , country= Greece

AI总结 针对遥感组合图像检索(RSCIR),本文通过统一基准测试和面向应用的研究,系统评估了现代组合方法在地球观测图像上的可迁移性,并引入面向灾害监测的变化中心数据集xView2-CIR,揭示了无训练组合方法的优势及变化中心检索的独特挑战。

详情
AI中文摘要

遥感组合图像检索(RSCIR)能够使用结合参考图像和文本修饰符的组合查询在大型卫星图像档案中进行搜索。尽管RSCIR为表达目标检索意图提供了灵活的接口,但现代组合方法在地球观测(EO)图像上的可迁移性及其与操作化EO工作流的相关性仍未得到充分探索。我们通过统一的基准测试和面向应用的研究来填补这一空白。首先,我们在PatternCom上使用标准化协议,系统地调整并评估了具有六个视觉-语言骨干网络的代表性组合图像检索方法,分析了它们在不同骨干网络、组合策略和查询类型上的行为。其次,我们引入了xView2-CIR,这是一个面向灾害和损害监测的变化中心数据集,其中检索以场景身份和目标灾后状态为条件。我们的结果表明,无训练组合方法为EO检索提供了强大且可扩展的基线,而变化中心检索则呈现出与基于属性的检索不同的挑战,特别是由于需要保持场景身份。总体而言,本研究为RSCIR建立了一个实用的基准测试,并将组合检索定位为遥感图像检索、档案探索和变化分析的补充工具。数据集和代码可在https://github.com/billpsomas/rscir获取。

英文摘要

Remote sensing composed image retrieval (RSCIR) enables search in large satellite image archives using composed queries that combine a reference image with a textual modifier. Although RSCIR offers a flexible interface for expressing targeted retrieval intent, the transferability of modern composition methods to Earth observation (EO) imagery and their relevance to operational EO workflows remain underexplored. We address this gap through a unified benchmark and an application-oriented study. First, we systematically adapt and evaluate representative composed image retrieval methods with six vision-language backbones on PatternCom under a standardized protocol, analyzing their behavior across backbones, composition strategies, and query types. Second, we introduce xView2-CIR, a change-centric dataset for disaster and damage monitoring, where retrieval is conditioned on scene identity and a target post-event state. Our results show that training-free composition methods provide strong and scalable baselines for EO retrieval, while change-centric retrieval presents different challenges from attribute-based retrieval, particularly due to the need to preserve scene identity. Overall, this study establishes a practical benchmark for RSCIR and positions composed retrieval as a complementary tool for remote sensing image retrieval, archive exploration, and change analysis. The dataset and code are available at https://github.com/billpsomas/rscir.

2605.24403 2026-05-26 cs.CV 版本更新

Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects

Artiverse:一个多样且物理基础扎实的铰接物体数据集

Denys Iliash, Jiayi Liu, Egor Fokin, Qirui Wu, Ali Mahdavi-Amiri, Manolis Savva, Angel X. Chang

发表机构 * Simon Fraser University(西蒙 Fraser大学) Canada-CIFAR AI Chair, Amii(加拿大- CIFAR人工智能主席,Amii)

AI总结 提出Artiverse数据集,包含5.4K个高质量铰接3D物体,通过半自动标注管道结合少样本分割、几何推理和多阶段人工验证,实现高效标注,并展示其在部件运动分析、铰接物体生成和基于物理的交互中的价值。

Comments CVPR camera-ready version

详情
AI中文摘要

我们提出了Artiverse,一个多样且物理基础扎实的高质量铰接3D物体数据集,旨在用于真实的功能建模和仿真。Artiverse包含来自多个3D静态仓库的5.4K个人工制作的物体,涵盖88个广泛类别。物体被标注有功能部件、内部结构、真实的运动学关系和铰接关节(包括多自由度关节),以及物理属性如公制尺度、材料和质量。我们开发了一个半自动标注管道,结合少样本分割、几何推理和多阶段人工验证,以实现高质量和高效的标注,将人工标注时间减少了30%以上。我们展示了Artiverse在部件运动分析、铰接物体生成和基于物理的交互任务中的价值。Artiverse为推进铰接物体的功能理解提供了数据资源。

英文摘要

We present Artiverse, a diverse and physically grounded dataset of high-quality articulated 3D objects designed for realistic functional modeling and simulation. Artiverse contains 5.4K human-authored objects across a broad range of 88 categories, aggregated from multiple 3D static repositories. Objects are annotated with functional parts, interior structures, realistic kinematic relationships and articulated joints including multi-DoF joints, and physical attributes such as metric scale, material, and mass. We develop a semi-automated annotation pipeline that combines few-shot segmentation, geometric reasoning, and multi-stage human verification to achieve high-quality and efficient annotation, reducing manual annotation time by over 30%. We demonstrate the value of Artiverse on tasks of part mobility analysis, articulated object generation, and physics-based interaction. Artiverse provides a data resource to advance functional understanding for articulated objects.

2605.24402 2026-05-26 cs.CV 版本更新

Dual Prototype-Conditioned Diffusion Model for Scalable Multi-Class Unsupervised Anomaly Detection in Large Category Spaces

面向大规模类别空间的可扩展多类无监督异常检测的双原型条件扩散模型

Yaoxuan Feng, Yuxin Li, Weijiang Lv, Zixuan Zhao, Yubiao Wang, Wenchao Chen, Bo Chen, Hongwei Liu

发表机构 * National Key Laboratory of Radar Signal Processing(雷达信号处理国家重点实验室)

AI总结 提出DPDiff-AD,一种通过局部和全局原型建模异构正态分布并利用扩散重建实现可扩展多类异常检测的方法。

详情
AI中文摘要

多类异常检测旨在跨不同产品类别构建统一模型。然而,随着类别数量的增加,由于正态分布日益复杂和异质,其性能通常会下降。为应对这一挑战,我们提出DPDiff-AD,一种用于大规模多类异常检测的双原型条件扩散模型。DPDiff-AD通过互补的局部和全局原型对异构正态分布进行建模。局部原型通过最近原型聚合捕获代表性的细粒度结构模式,而全局原型通过最优传输正则化调节整体特征几何。这些双尺度表示共同定义了一个结构化的正态空间。通过基于原型感知注意力的局部和全局原型条件扩散重建,该空间得到细化。在生成过程中联合利用双原型,DPDiff-AD实现了精确的正态建模,随着类别基数的增长保持了结构化的可分离性,并实现了可扩展的异常判别。在五个基准上的大量实验证明了DPDiff-AD的有效性和可扩展性。在160类大规模数据集上,它相比之前最先进的方法Dinomaly+,图像级和像素级AUROC分别提升了5.3和2.9个百分点,同时随着类别基数的增加保持了稳定的性能。

英文摘要

Multi-class anomaly detection aims to build unified models across diverse product categories. However, as the number of categories grows, its performance often degrades due to increasingly complex and heterogeneous normal distributions. To address this challenge, we propose DPDiff-AD, a Dual Prototype-conditioned Diffusion model for large-scale multi-class Anomaly Detection. DPDiff-AD models heterogeneous normal distributions through complementary local and global prototypes. Local prototypes capture representative fine-grained structural patterns via nearest-prototype aggregation, while global prototypes regulate holistic feature geometry through optimal transport regularization. Together, these dual-scale representations define a structured normality space. This space is refined through diffusion-based reconstruction conditioned on both local and global prototypes via prototype-aware attention. By jointly leveraging dual prototypes during generation, DPDiff-AD achieves precise normality modeling, preserves structured separability as category cardinality grows, and enables scalable anomaly discrimination. Extensive experiments across five benchmarks demonstrate the effectiveness and scalability of DPDiff-AD. On the 160-category large-scale dataset, it improves image- and pixel-level AUROC by 5.3 and 2.9 points over the previous state-of-the-art method Dinomaly+, while maintaining stable performance as category cardinality increases.

2605.24398 2026-05-26 cs.CV cs.AI cs.GR 版本更新

VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation

VectorArk: 学习基于圆角多边形表示的实际图像矢量化

Tarun Gehlaut, Difan Liu, Charu Bansal, Krutik Malani, Souymodip Chakraborty, Ankit Phogat, Matthew Fisher, Vineet Batra

发表机构 * Adobe

AI总结 提出VectorArk模型,采用圆角多边形表示和退化模型,实现鲁棒且实用的图像矢量化,在多个数据集上取得优越的几何完整性和伪影抑制效果。

Comments CVPR 2026. Project page: https://vectorark.github.io/

详情
AI中文摘要

近期基于视觉-语言模型(VLM)的方法在图像矢量化任务上取得了令人印象深刻的结果。然而,它们通常在合成基准上进行评估,其中干净的SVG以高分辨率光栅化,然后重新矢量化。因此,这些方法在真实场景中泛化能力较差,例如图像具有未知的光栅化方法或由文本到图像模型生成。我们引入了VectorArk,一种新的基于VLM的模型,旨在实现鲁棒且实用的图像矢量化。VectorArk采用了一种新颖的圆角多边形表示,简化了学习过程,同时自然地生成平滑、视觉上吸引人的基元。我们还提出了一种退化模型,增强了在多样且不完美输入下的鲁棒性。我们的实验表明,与先前方法相比,VectorArk在多个数据集上实现了优越的几何完整性和伪影抑制,全面的消融实验验证了每个组件的贡献。

英文摘要

Recent vision-language model (VLM)-based approaches have achieved impressive results on image vectorization tasks. However, they are typically evaluated on synthetic benchmarks, where clean SVGs are rasterized at high resolution and then re-vectorized. As a result, these methods generalize poorly to real-world scenarios, such as images with unknown rasterization methods or those generated by text-to-image models. We introduce VectorArk, a new VLM-based model designed for robust and practical image vectorization. VectorArk employs a novel rounded polygon representation that simplifies the learning process while naturally producing smooth, visually appealing primitives. We also propose a degradation model that enhances robustness across diverse and imperfect inputs. Our experiments show that, in contrast to previous methods, VectorArk achieves superior geometric completeness and artifact suppression across multiple datasets, with comprehensive ablations validating the contribution of each component.

2605.24371 2026-05-26 cs.CV cs.CL 版本更新

SliceWorld: A Predictive and Controllable World-State Model for CT Report Generation

SliceWorld: 一种用于CT报告生成的预测性和可控世界状态模型

Yuanhe Tian, Yan Song

发表机构 * Zhongguancun Academy(中关村学院) University of Science and Technology of China(中国科学技术大学)

AI总结 提出SliceWorld世界状态框架,通过编码CT切片序列为因子感知的潜在状态,实现未来切片预测、病变因子干预和LLM报告生成,在M3D-Cap和CT-RATE上提升NLG指标和临床评估。

Comments 18 pages, 5 figures

详情
AI中文摘要

CT报告生成(CTRG)要求模型从数百个轴向切片中总结三维解剖背景和病理发现。现有方法通常学习直接的图像到文本映射,缺乏对CT证据如何跨切片演变或报告如何响应潜在病变相关因素受控变化的建模机制。我们提出SliceWorld,一个CT特定的世界状态框架,将轴向CT扫描视为沿z轴的有序序列。SliceWorld将前缀CT证据编码为包含解剖、病变和不确定性成分的因子感知潜在状态,并将这些状态投影到用于多步未来切片特征预测、病变因子干预和基于LLM的报告生成的世界令牌中。该模型首先在CT切片序列上使用预测性、因子感知和反事实目标进行预训练,然后在配对的CT报告数据上进行微调。在M3D-Cap和CT-RATE上的实验表明,SliceWorld改善了自然语言生成指标和临床导向的自动评估。进一步分析展示了多视野未来切片预测、可测量的因子对齐、减少切片的鲁棒性以及选择性病变敏感的报告调制。

英文摘要

CT report generation (CTRG) requires models to summarize three-dimensional anatomical context and pathological findings from hundreds of axial slices. Existing methods typically learn a direct image-to-text mapping, providing limited mechanisms for modeling how CT evidence evolves across slices or how reports respond to controlled changes in latent lesion-related factors. We propose SliceWorld, a CT-specific world-state framework that treats an axial CT scan as an ordered sequence along the z-axis. SliceWorld encodes prefix CT evidence into factor-aware latent states containing anatomy, lesion, and uncertainty components, and projects these states into world tokens used for multi-step future-slice feature prediction, lesion-factor intervention, and LLM-based report generation. The model is first pretrained on CT slice sequences with predictive, factor-aware, and counterfactual objectives, and is then fine-tuned on paired CT-report data. Experiments on M3D-Cap and CT-RATE show that SliceWorld improves natural language generation metrics and clinically oriented automatic evaluation. Further analyses demonstrate multi-horizon future-slice prediction, measurable factor alignment, reduced-slice robustness, and selective lesion-sensitive report modulation.

2605.24367 2026-05-26 cs.CV cs.LG 版本更新

Gaussian Rank-Based Neighborhood Degree for Graph Neural Networks in Image Classification

基于高斯排序邻域度的图神经网络图像分类方法

Rafael Mendonça Duarte, Jean Roberto Ponciano, Lucas Pascotti Valem

发表机构 * Institute of Mathematics and Computer Science (ICMC)(数学与计算机科学研究所) University of São Paulo (USP)(圣保罗大学) São Carlos -- SP -- Brazil(巴西圣卡洛斯)

AI总结 提出GRaNDe(高斯排序邻域度)方法,通过结合邻域排序与高斯距离加权来改进图神经网络中的度归一化,在五个公开图像分类数据集上取得一致准确率提升。

详情
AI中文摘要

数据的指数级增长加剧了未标注数据的可用性与人工标注的高成本之间的差距。图神经网络(GNN)作为一种有前景的解决方案出现,因为它们利用关系结构并从标注和未标注数据中学习,执行半监督学习。这些模型的一个关键组成部分是基于度的归一化,它影响消息传播,但通常假设邻域节点具有均匀重要性。在图像分类中,图通常根据特征相似性构建,将所有邻居平等对待可能会忽略相关性的重要变化。受此差距启发,我们提出GRaNDe(高斯排序邻域度)。这种新颖的度度量将邻域排序与高斯距离加权相结合,以更好地捕捉节点重要性。在五个公开图像分类数据集上的实验表明,与最先进方法相比,该方法具有一致的准确率提升和竞争性或更优的结果。

英文摘要

The exponential growth of data has intensified the gap between the availability of unlabeled data and the high cost of manual annotation. Graph Neural Networks (GNNs) have emerged as a promising solution, as they exploit relational structures and learn from both labeled and unlabeled data, performing semi-supervised learning. A crucial component of many of these models is degree-based normalization, which influences message propagation but typically assumes uniform importance among neighboring nodes. In image classification, graphs are usually constructed from feature similarity, where treating all neighbors equally may overlook important variations in relevance. Motivated by this gap, we propose GRaNDe (Gaussian Rank-based Neighborhood Degree). This novel degree measure integrates neighborhood ranking with Gaussian distance weighting to better capture node importance. Experiments on five public image classification datasets show consistent accuracy improvements and competitive or superior results compared to state-of-the-art methods.

2605.24354 2026-05-26 cs.CV 版本更新

SparseWorld: Enhancing End-to-End Autonomous Driving via World Models with Sparse Scene Representation

SparseWorld: 通过具有稀疏场景表示的世界模型增强端到端自动驾驶

Ruoyu Wang, Jingke Wang, Yukai Ma, Yuehao Huang, Shuangming Lei, Guanglin Xu, Aixue Ye, Yong Liu

发表机构 * Institute of Cyber-Systems and Control, Zhejiang University(浙江大学控制系统研究所) Labs, Huawei(华为2012实验室) State Key Laboratory of Industrial Control Technology(国家工业控制技术重点实验室)

AI总结 提出SparseWorld,一种基于稀疏场景表示的轻量级世界模型,通过自回归预测未来地图元素和周围智能体,并利用预测结果优化下游运动预测和轨迹规划,在nuScenes数据集上实现0.05%的碰撞率,达到开放循环规划指标的最优性能。

详情
AI中文摘要

最近,世界模型通过未来情况预测和改进场景理解,在增强端到端驾驶系统方面取得了显著进展。然而,现有的驾驶世界模型通常基于密集场景表示,导致高计算成本和冗余信息。在本文中,我们提出了SparseWorld,一种轻量级世界模型,专注于仅预测场景的关键布局,从而为端到端驾驶系统实现高效的未来预测。SparseWorld首先执行自回归展开以预测未来的地图元素和周围智能体,使模型能够学习驾驶场景随时间如何演变。然后,它利用这些预测的未来来优化下游运动预测和轨迹规划。具体来说,我们提出了一种稀疏梦想家(Sparse Dreamer),通过联合时间和空间注意力在潜在空间中预测未来实例。通过与预测的未来实例交互,运动规划器捕获更准确的运动模式,并生成更明智且安全感知的轨迹。大量实验表明,SparseWorld显著降低了碰撞风险,并在nuScenes数据集的开放循环规划指标上实现了最先进的性能,碰撞率为0.05%。此外,在Bench2Drive基准测试的闭环规划指标上,它大幅优于基线方法。补充材料可在项目页面获取:https://wryzju.github.io/SparseWorld/。

英文摘要

Recently, world models have made significant progress in enhancing end-to-end driving systems through both future situation forecasting and improved scene understanding. However, existing driving world models are typically built upon dense scene representations, causing high computational costs and redundant information. In this paper, we present SparseWorld, a lightweight world model that focuses on predicting only the critical layout of the scene, enabling efficient future forecasting for end-to-end driving systems. SparseWorld first performs autoregressive rollout to forecast future map elements and surrounding agents, enabling the model to learn how driving scenarios evolve over time. It then leverages these predicted futures to refine downstream motion prediction and trajectory planning. Specifically, we propose a Sparse Dreamer that anticipates future instances in the latent space through joint temporal and spatial attention. By interacting with predicted future instances, the motion planner captures more accurate motion patterns and generates more informed and safety-aware trajectories. Extensive experiments demonstrate that SparseWorld significantly reduces collision risk and achieves state-of-the-art performance on the open-loop planning metrics of the nuScenes dataset with a collision rate of 0.05\%. Moreover, it substantially outperforms the baseline method in closed-loop planning metrics on the Bench2Drive benchmark. Supplementary material is available at the project page: https://wryzju.github.io/SparseWorld/.

2605.24353 2026-05-26 cs.CV q-bio.OT 版本更新

ViViD-5K: Vineyard vision dataset for field-based berry detection and segmentation and grape cluster closure estimation

ViViD-5K:用于田间浆果检测与分割以及葡萄串闭合度估计的葡萄园视觉数据集

Xiangzhi Tong, Chengrui Zhang, Mac Flaherty, Andre Matteo Garcia, Dominic Gorman, Jonathan Jaramillo, Justine E. Vanden Heuvel, Yu Jiang

发表机构 * Horticulture Section, School of Integrative Plant Science, Cornell University(康奈尔大学整合植物科学学院园艺系) School of Electrical and Computer Engineering, Cornell University(康奈尔大学电气与计算机工程学院)

AI总结 提出ViViD-5K大规模葡萄园图像数据集和GrapeSAM两阶段视觉流水线,实现葡萄串闭合度的自动、客观估计。

详情
AI中文摘要

簇闭合度,定义为葡萄串中浆果之间间隙逐渐填充的程度,是葡萄园管理中的一个关键性状,影响病害风险。然而,传统的视觉评分方法劳动强度大、主观性强,且缺乏时间分辨率。现有数据集很少支持细粒度的浆果级分析,限制了稳健深度学习模型的发展。在这项工作中,我们提出了ViViD-5k,一个大规模田间葡萄园视觉数据集,包含5,000张带有密集标注的图像,包括超过648,000个浆果质心和覆盖13个葡萄品种的簇分割掩码。基于该数据集,我们引入了GrapeSAM,一个两阶段视觉流水线,结合了点状浆果定位和基于提示的分割(使用Segment Anything),随后是基于Transformer的簇分割。该流水线实现了在最小监督下对簇闭合度的自动化田间估计。定量结果表明,在多种条件下具有强大的分割和计数准确性,而可视化结果证实了在域内和域外样本上的鲁棒性。这项工作为手动紧凑度评分提供了一种可扩展且客观的替代方案,并支持具有增强空间细节的高通量葡萄表型分析。

英文摘要

Cluster closure, defined as the progressive filling of gaps between the berries in a grape bunch, is a key trait in vineyard management, impacting disease risk. However, traditional visual scoring methods are labor-intensive, subjective, and lack temporal resolution. Existing datasets rarely support fine-grained berry-level analysis, limiting the development of robust deep learning models. In this work, we present ViViD-5k, a large-scale in-field Vineyard Vision Dataset containing 5,000 images with dense annotations, including over 648,000 berry centroids and cluster segmentation masks spanning 13 grape varieties. Building on this dataset, we introduce GrapeSAM, a two-stage visual pipeline that combines point-based berry localization with prompt-based segmentation using Segment Anything, followed by transformer-based cluster segmentation. The pipeline enables automated, in-field estimation of cluster closure with minimal supervision. Quantitative results demonstrate strong segmentation and counting accuracy across diverse conditions, while visualizations confirm robustness on both in-domain and out-of-domain samples. This work provides a scalable and objective alternative to manual compactness scoring and supports high-throughput grape phenotyping with enhanced spatial detail.

2605.24322 2026-05-26 cs.CV 版本更新

Causal Physics Steering in Video World Models via Concept Activation Vectors

通过概念激活向量在视频世界模型中进行因果物理引导

Nahid Alam

发表机构 * Oreon Labs(Oreon实验室) Cohere Labs Community(Cohere实验室社区)

AI总结 提出一种无需训练的方法,利用物理涌现区(PEZ)的概念激活向量(CAV)在推理时引导视频模型的物理期望,无需修改模型权重。

Comments In proceedings of CVPR 2026 workshop on Video World Model

详情
AI中文摘要

视频世界模型学习物理动态的表示,但在推理时控制其物理期望仍然是一个开放问题。最近的可解释性工作识别出一个物理涌现区(PEZ),即VideoMAE中一组中间Transformer层,其中物理合理性与其他视觉特征分开表示。然而,尚不清楚这种结构是否可用于直接控制模型的物理推理。我们提出物理引导,一种无需训练的方法,使用PEZ层线性探测器的权重向量作为概念激活向量(CAV),并在推理时将其注入隐藏状态。这在不改变任何模型权重的情况下改变了模型的物理期望。在IntPhys基准上,这种干预可靠地将模型的合理性判断向任一方向移动,具体取决于引导符号。只有当干预应用于物理涌现区内时,效果才会出现,表明相关的物理表示位于该区域。我们进一步发现,物理与运动方向分开编码,不同的直觉物理原理在该表示空间中占据不同的方向。这些结果表明,VideoMAE中的物理推理不仅可读,而且可直接引导。

英文摘要

Video world models learn representations of physical dynamics, but controlling their physical expectations at inference time remains an open problem. Recent interpretability work identified a Physics Emergence Zone (PEZ), a group of middle transformer layers in VideoMAE where physical plausibility is represented separately from other visual features. However, it remained unclear whether this structure could be used to directly control the model's physics reasoning. We present physics steering, a training-free method that uses the weight vector of a linear probe at a PEZ layer as a Concept Activation Vector (CAV) and injects it into hidden states during inference. This shifts the model's physical expectations without changing any model weights. On the IntPhys benchmark, this intervention reliably shifts the model's plausibility judgment in either direction, depending on the steering sign. The effect appears only when the intervention is applied within the Physics Emergence Zone, suggesting that the relevant physics representation is localized there. We further find that physics is encoded separately from motion direction, and that different intuitive physics principles occupy distinct directions within this representation space. Together, these results show that physical reasoning in VideoMAE is not only readable, but also directly steerable.

2605.24321 2026-05-26 cs.CV 版本更新

Unified 3D Scene Understanding Through Physical World Modeling

统一3D场景理解:通过物理世界建模

Wanhee Lee, Klemen Kotar, Rahul Mysore Venkatesh, Jared Watrous, Honglin Chen, Khai Loong Aw, Daniel L. K. Yamins

发表机构 * Stanford University(斯坦福大学) OpenAI

AI总结 提出一个概率图模型3WM,将深度估计、新视角合成和物体操作等3D视觉任务统一为单一模型,通过不同推理路径实现零样本任务执行,无需微调即达到最先进性能。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

理解3D场景需要灵活组合视觉推理任务,包括深度估计、新视角合成和物体操作,这些对于感知和交互都至关重要。现有方法通常孤立地处理这些任务,阻止它们共享共同表示或跨任务迁移知识。一个概念上更简单但实践中非平凡的选择是将这些多样任务统一到单一模型中,将不同任务从独立的训练目标简化为仅仅是不同的提示,并允许跨所有数据集进行联合训练。在这项工作中,我们提出了一个用于统一3D理解和交互的物理世界模型(3WM),它被构建为一个概率图模型,其中节点表示多模态场景元素,如RGB、光流和相机位姿。多样任务通过图中的不同推理路径产生:从RGB和密集流提示进行新视角合成,从RGB和稀疏流提示进行物体操作,以及从RGB和相机条件进行深度估计,所有这些都在零样本下完成,无需特定任务训练。3WM在无需微调的情况下优于专门的基线,通过提供精确的可控性、强几何一致性和在真实场景中的鲁棒性,在新视角合成和3D物体操作上实现了最先进的性能。除了预定义任务外,该模型支持可组合的推理路径,例如在导航3D环境时将物体移开,从而实现复杂的几何推理。这表明统一模型可以作为碎片化任务特定系统的实用替代方案,朝着通用视觉世界模型迈出了一步。

英文摘要

Understanding 3D scenes requires flexible combinations of visual reasoning tasks, including depth estimation, novel view synthesis, and object manipulation, all of which are essential for perception and interaction. Existing approaches have typically addressed these tasks in isolation, preventing them from sharing a common representation or transferring knowledge across tasks. A conceptually simpler but practically non-trivial alternative is to unify these diverse tasks into a single model, reducing different tasks from separate training objectives to merely different prompts and allowing for joint training across all datasets. In this work, we present a physical world model for unified 3D understanding and interaction (3WM), formulated as a probabilistic graphical model in which nodes represent multimodal scene elements such as RGB, optical flow, and camera pose. Diverse tasks emerge from different inference pathways through the graph: novel view synthesis from RGB and dense flow prompts, object manipulation from RGB and sparse flow prompts, and depth estimation from RGB and camera conditioning, all zero-shot without task-specific training. 3WM outperforms specialized baselines without the need for finetuning by offering precise controllability, strong geometric consistency, and robustness in real-world scenarios, achieving state-of-the-art performance on NVS and 3D object manipulation. Beyond predefined tasks, the model supports composable inference pathways, such as moving objects aside while navigating a 3D environment, enabling complex geometric reasoning. This demonstrates that a unified model can serve as a practical alternative to fragmented task-specific systems, taking a step towards a general-purpose visual world model.

2605.24306 2026-05-26 cs.CV 版本更新

CoDA: Color Distribution Probing for Efficient and Generalizable AI-Generated Image Detection

CoDA: 面向高效且可泛化的AI生成图像检测的颜色分布探测

Zexi Jia, Zhiqiang Yuan, Xiaoyue Duan, Jinchao Zhang, Jie Zhou, Anil K. Jain

发表机构 * Tencent WeChat AI(腾讯微信AI) Department of Computer Science and Engineering, Michigan State University(密歇根州立大学计算机科学与工程系)

AI总结 提出基于颜色分布探测的轻量级检测器CoDA(仅1.48M参数),通过噪声量化探针捕捉合成图像的颜色不均匀性,在跨模型和跨域基准上达到最优性能。

详情
AI中文摘要

AI生成图像检测面临泛化性与效率之间的持续权衡:基于轻量级伪影的方法在未见过的生成器或域上常常性能下降,而更鲁棒的大规模模型则计算成本高昂。同时,现有基准主要关注逼真场景下的跨模型评估,跨域鲁棒性尚未充分探索。为填补这一空白,我们引入了FakeForm,一个大规模基准,包含约37万张图像,覆盖62个不同域,用于跨模型和跨域评估。受此更广泛设置的启发,我们重新审视颜色分布探测作为AI生成图像检测的一种高效互补线索。我们观察到,特别是对于摄影内容,真实照片往往呈现更平滑、更稳定的颜色模式,而合成图像则常表现出神经生成引入的特征性颜色不平衡。基于这一观察,我们提出了CoDA,一个紧凑的1.48M参数检测器,基于噪声量化探针,并提供了将探针响应与颜色非均匀性联系起来的理论分析。实验表明,CoDA在标准基准上达到最先进性能,在FakeForm具有挑战性的跨域评估中取得最佳结果,同时在跨模型逼真设置中保持高度竞争力。这些结果表明,持续的生成伪影可以为高效且鲁棒的AI生成图像检测提供实用基础。模型和FakeForm基准将公开发布。

英文摘要

AI-generated image detection faces a persistent trade-off between generalization and efficiency: lightweight artifact-based methods often degrade on unseen generators or domains, whereas more robust large-scale models are computationally expensive. Meanwhile, existing benchmarks mainly focus on cross-model evaluation in photorealistic settings, leaving cross-domain robustness underexplored. To address this gap, we introduce FakeForm, a large-scale benchmark with approximately 370,000 images across 62 diverse domains for both cross-model and cross-domain evaluation. Motivated by this broader setting, we revisit color-distribution probing as an efficient complementary cue for AI-generated image detection. We observe that, especially for photographic content, real photographs tend to exhibit smoother and more stable color patterns, whereas synthetic images often show characteristic color imbalances introduced by neural generation. Based on this observation, we propose CoDA, a compact 1.48M-parameter detector built on a Noise-Quantization Probe, together with a theoretical analysis linking probe responses to color non-uniformity. Experiments show that CoDA achieves state-of-the-art performance on standard benchmarks and the best results on the challenging cross-domain evaluation of FakeForm, while remaining highly competitive in cross-model photorealistic settings. These results suggest that persistent generative artifacts can provide a practical foundation for efficient and robust AI-generated image detection. The models and FakeForm benchmark will be made publicly available.

2605.24304 2026-05-26 cs.CV cs.AI 版本更新

ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views

ArtSplat: 基于前馈的关节式3D高斯泼溅从稀疏多状态未标定视图

Inseo Lee, Yoonji Kim, Eugene Sohn, Jiwoong Lee, Jungmin You, Joonseok Lee, Jin-Hwa Kim

发表机构 * Seoul National University(首尔国立大学) Sogang University(成均馆大学) NAVER AI Lab(NAVER AI实验室)

AI总结 提出首个前馈框架ArtSplat,通过稀疏多视图跨多个关节状态,一次性重建几何和关节参数,引入逐像素关节图表示和跨状态注意力机制,在PartNet-Mobility上实现400倍加速。

详情
AI中文摘要

从稀疏视图图像重建关节物体是一个病态问题,需要同时推断几何和底层关节结构。现有基于NeRF和3D高斯泼溅(3DGS)的关节物体重建方法通常依赖密集视图或强先验(例如深度图、关节类型、预定义关节数量),并且需要昂贵的逐对象优化。在本文中,我们提出了ArtSplat,这是第一个用于关节式3D高斯泼溅的前馈框架。它通过单个前向传递,从跨多个关节状态的稀疏多视图图像中重建几何和关节参数。为了解决单次前向关节重建的挑战,我们引入了一种逐像素关节图表示,使得关节参数估计能够集成到前馈流水线中。我们进一步提出了一种带有状态令牌的跨状态注意力(CSA)机制,该机制有效捕获输入状态间的离散运动。在来自PartNet-Mobility的68个关节物体(包括单关节和多关节配置)上的实验表明,ArtSplat在几何和关节估计方面均达到了有竞争力的性能,同时比基线方法快400倍以上。

英文摘要

Articulated object reconstruction from sparse-view images is an ill-posed problem that requires simultaneous inference of geometry and underlying articulation structure. Existing methods for articulated object reconstruction based on NeRF and 3D Gaussian Splatting (3DGS) typically rely on dense views or strong priors (e.g., depth maps, joint types, predefined number of joints) and require costly per-object optimization. In this paper, we propose ArtSplat, the first feed-forward framework for articulated 3D Gaussian Splatting. It reconstructs both geometry and joint parameters from sparse multi-view images across multiple articulation states in a single forward pass. To address the challenges of single-pass articulated reconstruction, we introduce a per-pixel joint map representation that enables the integration of joint parameter estimation into the feed-forward pipeline. We further propose a Cross-State Attention (CSA) mechanism with state tokens, which effectively captures discrete motion across input states. Experiments on 68 articulated objects from PartNet-Mobility, including both single- and multi-joint configurations, demonstrate that ArtSplat achieves competitive performance in both geometry and joint estimation, while being over 400 times faster than baselines.

2605.24273 2026-05-26 cs.CV physics.ao-ph 版本更新

Plume Segmentation from MethaneSAT with Cross-Sensor Transfer Learning and Physics-Informed Postprocessing

基于跨传感器迁移学习和物理信息后处理的MethaneSAT羽流分割

Manuel Pérez-Carrasco, Maya Nasr, Zhan Zhang, Apisada Chulakadabba, Javier Roger, Raia Ottenheimer, Sébastien Roche, Maryann Sargent, Chris Chan Miller, Daniel Varon, Jack Warren, Luis Guanter, Kang Sun, Jonathan Franklin, Jia Chen, Cecilia Garraffo, Xiong Liu, Ritesh Gautam, Steven Wofsy

发表机构 * Center for Astrophysics | Harvard & Smithsonian(哈佛-史密松天体物理中心) Environmental Defense Fund(环境防御基金) Department of Earth and Planetary Sciences, Harvard University(哈佛大学地球与行星科学系) Institute of Environmental Physics (IUP), University of Bremen(不莱梅大学环境物理研究所) John A. Paulson School of Engineering and Applied Sciences, Harvard University(哈佛大学约翰·A·保罗森工程与应用科学学院)

AI总结 提出一种结合Mask R-CNN、跨传感器迁移学习和物理信息后处理的机器学习框架,解决MethaneSAT甲烷羽流检测中的标签稀缺和推理可靠性问题,实现高灵敏度和高精度两种操作模式。

Comments 35 pages, 20 figures, 9 tables

详情
AI中文摘要

从卫星图像中自动检测和掩膜单个甲烷羽流对于操作性的排放归因和量化至关重要。我们提出了一个机器学习框架,用于从MethaneSAT反演的柱平均干空气甲烷摩尔分数中检测羽流。我们解决了两个核心挑战:标记的MethaneSAT数据稀缺以及跨不同大气和地表条件的推理可靠性需求。我们首先证明,带有ResNet-50骨干网络的Mask R-CNN在MethaneAIR(MethaneSAT的机载版本)和MethaneSAT数据上均优于U-Net语义分割,像素级F1分数分别提升10.49和5.48。为解决MethaneSAT数据稀缺问题,我们评估了三种利用MethaneAIR飞行数据和合成羽流的跨传感器迁移策略。从MethaneAIR预训练权重微调的Mask R-CNN(ResNet-50)是最有效的策略,在基线操作点实现了0.60的实例级精度和接近完美的0.98召回率。一个物理信息后处理管道将检测结果转换为两种操作模式。第一种是高灵敏度模式,应用形态学滤波和基于邻近度的合并进行综合排放筛查,达到0.71的精度和0.94的召回率。第二种是高精度模式,额外应用基于分布的分类器进行可信源归因,达到0.92的精度和0.70的召回率。对基于小波的真实标签中被分类为假阳性的检测结果进行人工审查发现,相当一部分案例对应的是因保守标注标准而被排除的真实甲烷增强,表明报告的精度值是真实检测性能的下界。我们的数据和代码可在 https://doi.org/10.7910/DVN/FR959H 获取。

英文摘要

Automated detection and masking of individual methane plumes from satellite imagery is important for operational emission attribution and quantification. We present a machine learning framework for plume detection from MethaneSAT retrieved column-averaged dry-air mole fractions of methane. We address two core challenges: the scarcity of labeled MethaneSAT data and the need for inference reliability across diverse atmospheric and surface conditions. We first demonstrate that Mask R-CNN with a ResNet-50 backbone outperforms U-Net semantic segmentation on both MethaneAIR (an airborne version of MethaneSAT) and MethaneSAT data, with pixel-level F1 score gains of 10.49 and 5.48 respectively. To address MethaneSAT data scarcity, we evaluate three cross-sensor transfer strategies leveraging MethaneAIR flights and synthetic plumes. Mask R-CNN with ResNet-50 fine-tuned from MethaneAIR pre-trained weights is the most effective strategy, achieving instance-level precision of 0.60 and a near-perfect recall of 0.98 at the baseline operating point. A physics-informed post-processing pipeline converts detections into two operationally distinct modes. The first is a high-sensitivity mode that applies morphological filtering and proximity-based merging for comprehensive emission screening, achieving precision of 0.71 and recall of 0.94. The second is a high-precision mode that additionally applies a distribution-based classifier for confident source attribution, achieving precision of 0.92 and recall of 0.70. Manual review of detections classified as false positives against our wavelet-based ground truth labels reveals that a meaningful fraction of cases correspond to real methane enhancements excluded by conservative labeling criteria, indicating that precision values reported are lower bounds on true detection performance... Our data and code are available at: https://doi.org/10.7910/DVN/FR959H

2605.24251 2026-05-26 cs.LG cs.CV 版本更新

Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions

重新思考边缘上的持续异常检测:在现实工业条件下进行基准测试

Chad Weatherly, Sen Lin

发表机构 * University of Houston(休斯敦大学)

AI总结 针对现有持续异常检测方法在评估、比较和边缘部署约束上的不足,提出统一基准和训练无关方法DINOSaur,在多种协议下超越所有现有方法,并在边缘设备上实现快速推理和适应。

详情
AI中文摘要

持续异常检测(CAD)解决了工业检测系统适应不断变化的生产条件的需求,但现有方法存在三个关键差距:不现实的评估、缺乏系统比较以及未考虑边缘部署约束。我们引入了一个统一的基准,结合了结构和逻辑异常的离散任务评估、一种新颖的连续漂移协议、对所有已发布CAD方法的首次头对头比较,以及在边缘硬件上的计算效率分析。我们的结果表明,现有的CAD方法并不一致地优于带有简单经验重放的传统方法。受此启发,我们提出了DINOSaur,一种无需训练的方法,结合了冻结的DINOv3骨干网络、空间索引的coreset记忆和邻域限制的异常评分。DINOSaur通过构造实现了零遗忘,在所有五种协议上优于所有评估的方法,并在NVIDIA Jetson Orin Nano上以低于100毫秒的推理速度运行,在设备上适应新任务的时间不到30秒。

英文摘要

Continual anomaly detection (CAD) addresses the need for industrial inspection systems to adapt to evolving production conditions, yet existing methods share three critical gaps: unrealistic evaluation, no systematic comparison, and no consideration of edge deployment constraints. We introduce a unified benchmark combining discrete-task evaluation on structural and logical anomalies, a novel continuous drift protocol, the first head-to-head comparison of all published CAD methods, and computational efficiency profiling on edge hardware. Our results reveal that existing CAD methods do not consistently outperform traditional approaches with simple experience replay. Thus motivated, we propose DINOSaur, a training-free method combining a frozen DINOv3 backbone with spatially-indexed coreset memory and neighborhood-restricted anomaly scoring. DINOSaur achieves zero forgetting by construction, outperforms all evaluated methods across all five protocols, and runs at sub-100\,ms inference on an NVIDIA Jetson Orin Nano, with on-device adaptation to new tasks in under 30 seconds.

2605.24243 2026-05-26 cs.CV cs.AI stat.ML 版本更新

GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer

GIBLy: 通过架构无关的轻量级几何归纳偏置层改进3D语义分割

Diogo Lavado, Alessandra Micheletti, Clàudia Soares

发表机构 * NOVA School of Science and Technology(诺瓦科学与技术学校) Università degli Studi di Milano(米兰大学)

AI总结 提出一种轻量级几何归纳偏置层GIBLy,通过集成可学习的几何先验提升3D分割性能,仅增加少量参数即可在多个基准上获得一致提升。

详情
AI中文摘要

在3D场景理解中,深度学习模型依赖大型模型和大量训练来捕捉3D数据中存在的几何结构。然而,现有方法缺乏显式机制来融入几何信息(例如可学习的基元形状),往往需要更大的模型和更多的训练数据,这增加了成本并可能限制泛化能力。我们引入了GIBLy,一种轻量级几何归纳偏置层,将可学习的几何先验集成到3D分割流程中。GIBLy通过提供与简单几何形状(因此可解释)对齐的特征来增强现有架构——无论是基于MLP、卷积还是Transformer——以最小的计算开销提升分割性能。我们在多个3D语义分割基准上验证了我们的方法,展示了一致的性能提升,包括在TS40K上使用PTV3时mIoU提升高达+11.5%,而仅增加58K额外参数。我们的结果突显了显式编码几何结构以支持准确高效的3D场景理解的优势,且仅需一个轻量级的附加层。

英文摘要

In 3D scene understanding, deep learning models rely on large models and extensive training to capture basic geometric structures that are present in the 3D data. However, existing methods lack explicit mechanisms to incorporate geometric information, such as learnable primitive shapes, often necessitating large models and more training data which in turn increases cost and can limit generalization. We introduce GIBLy, a lightweight geometric inductive bias layer that integrates learnable geometric priors into 3D segmentation pipelines. GIBLy enhances existing architectures -- whether MLP-based, convolution-based, or transformer-based -- by providing features aligned with simple geometric shapes (and thus human-interpretable) that improve segmentation performance with minimal computational overhead. We validate our approach across multiple 3D semantic segmentation benchmarks, demonstrating consistent performance gains, including up to +11.5% mIoU on TS40K with PTV3, while adding only 58K extra parameters. Our results highlight the benefit of explicitly encoding geometric structure to support accurate and efficient 3D scene understanding, with a lightweight add-on layer

2605.24201 2026-05-26 cs.CV physics.med-ph 版本更新

Radiuma: A Unified Zero-Code Executable Graphical Workflow Generator for Reproducible and Shareable Medical Image Analysis and Machine Learning

Radiuma: 一个统一的无代码可执行图形工作流生成器,用于可重复和可共享的医学图像分析与机器学习

Mohammad Salmanpour, Mehrdad Oveisi, Isaac Shiri, Arman Rahmim

发表机构 * Department of Basic and Translational Research, BC Cancer Research Institute(基础与转化研究部,BC癌症研究中心) Department of Radiology, University of British Columbia(放射科,不列颠哥伦比亚大学) Technological Virtual Collaboration (TECVICO Corp.)(技术虚拟协作(TECVICO公司)) Department of Computer Science, University of British Columbia(计算机科学系,不列颠哥伦比亚大学) Department of Cardiology, Inselspital, Bern University Hospital, University of Bern(心脏病学部,Inselspital,伯恩大学医院,伯恩大学) Department of Digital Medicine, University of Bern(数字医学系,伯恩大学) Departments of Physics & Biomedical Engineering, University of British Columbia(物理学与生物医学工程系,不列颠哥伦比亚大学)

AI总结 提出Radiuma模块化平台,通过可视化工作流系统集成图像处理、放射组学特征提取和机器学习模块,实现无需编程即可构建可重复的多步分析流程。

详情
AI中文摘要

医学图像计算软件对于识别支持诊断、预后、治疗计划和临床研究的成像生物标志物至关重要。然而,缺乏标准化、用户友好且可重复的软件环境限制了先进医学图像分析工作流的广泛采用。我们提出了Radiuma,一个免费可用的模块化平台,旨在支持跨多种模态和文件格式的可靠且可重复的医学图像分析。Radiuma集成了图像读取、可视化、配准、融合、处理、分割、放射组学特征提取以及用于分类、回归和聚类的机器学习模块。其模块化设计允许用户独立执行每个组件,或通过可视化工作流系统连接模块,其中一个步骤的输出可以图形化地传递到下一步。这使得无需大量编程专业知识即可创建自定义、可执行且可重复的多步骤流程。每个模块的结果可以直接在可视化窗口中检查,提供对处理质量和工作流准确性的即时反馈。Radiuma还支持保存和共享自定义工作流,促进协作研究中的透明度、可重用性和一致性。通过结合灵活性、易用性和标准化分析工具,Radiuma为临床和转化环境中的放射组学和机器学习研究提供了一个实用环境。该平台旨在面向具有不同专业知识的用户,包括放射科医生、物理学家、临床医生和数据科学家。

英文摘要

Medical image computing software is essential for identifying imaging biomarkers that can support diagnosis, prognosis, treatment planning, and clinical research. However, the lack of standardized, user-friendly, and reproducible software environments has limited the broader adoption of advanced medical image analysis workflows. We present Radiuma, a freely available modular platform designed to support reliable and reproducible medical image analysis across multiple modalities and file formats. Radiuma integrates image reading, visualization, registration, fusion, processing, segmentation, radiomics feature extraction, and machine learning modules for classification, regression, and clustering. Its modular design allows users to execute each component independently or connect modules through a visual workflow system, where the output of one step can be graphically passed to the next. This enables the creation of custom, executable, and reproducible multi-step pipelines without requiring extensive programming expertise. Results from each module can be inspected directly in the visualization window, providing immediate feedback on processing quality and workflow accuracy. Radiuma also supports saving and sharing customized workflows, promoting transparency, reusability, and consistency across collaborative studies. By combining flexibility, usability, and standardized analysis tools, Radiuma provides a practical environment for radiomics and machine learning research in clinical and translational settings. The platform is designed to be accessible to users with diverse expertise, including radiologists, physicists, clinicians, and data scientists.

2605.24195 2026-05-26 cs.CV cs.LG 版本更新

Single View Seafloor Recovery from Imaging Sonar via Differentiable Rendering

通过可微渲染从成像声纳进行单视图海底恢复

Sevan Brodjian, Michael Hobley, Pietro Perona

发表机构 * California Institute of Technology(加州理工学院)

AI总结 提出一种无需训练的方法,通过可微渲染在30秒内从单张声纳图像恢复海底地形,利用已知海底倾斜条件,首次实现单视图高度恢复。

详情
AI中文摘要

由于光衰减和浑浊度,声纳通常是水下高分辨率成像的唯一合适模态。前视成像声纳提供距离和水平角度的测量,但将垂直结构压缩成平面图像,产生歧义,使得3D恢复具有挑战性。成像声纳的一个常见应用是水下地形测绘(测深),但目前的方法需要多个视图、昂贵的多传感器设置或大量训练数据,这限制了其使用和对新环境的适应性。我们提出了一种无需训练的方法,通过可微渲染在30秒内从单张声纳图像恢复测深,条件为已知的海底倾斜。据我们所知,这是声纳中单视图高度恢复的第一个可微渲染方法。我们的方法实现了可微声纳光线追踪,并优化显式高度场以重现目标图像。在合成数据集上,我们的方法在分布偏移下优于有监督的CNN,在粗糙地形上保持接近,而CNN在分布内获胜。通过建模声纳过程的物理基础先验,我们的方法无需训练数据即可适应不同的传感器配置和环境。

英文摘要

Sonar is often the only modality suitable for high-resolution imaging underwater due to light attenuation and turbidity. Forward-looking imaging sonar provides measurements over range and horizontal angle but collapses vertical structure into a flat image, creating ambiguities that make 3D recovery challenging. A common use case for imaging sonar is underwater terrain mapping (bathymetry), yet current methods require many views, expensive multi-sensor setups, or significant training data, which limits use and adaptability to new environments. We present a training-free method that recovers bathymetry from a single sonar image in under 30 seconds via differentiable rendering, conditioned on a known seafloor tilt. To our knowledge, this is the first differentiable rendering approach for single-view height recovery in sonar. Our method implements differentiable sonar ray tracing and optimizes an explicit height field to reproduce the target image. On synthetic datasets, our approach outperforms a supervised CNN under distribution shift and remains close on rough terrain, while the CNN wins in-distribution. By modeling physically grounded priors of the sonar process, our method adapts across sensor configurations and environments without training data.

2605.24192 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization

滤波后验均值集合:扩散泛化分析模型的统一框架

Matthew Niedoba, Berend Zwartsenberg, Frank Wood

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Inverted AI Alberta Machine Intelligence Institute(阿尔伯塔机器智能研究所)

AI总结 本文提出滤波后验均值集合(FPMC)统一框架,通过查询精度向量、响应权重和源分布建模扩散模型去噪函数的泛化行为,并通过软松弛和源分布增强提升现有方法性能。

Comments 27 Pages, 7 figures

详情
AI中文摘要

作为图像扩散模型骨干的神经网络去噪函数,在多种网络架构和训练超参数下展现出显著一致的泛化行为。最近一系列研究试图通过聚合训练数据集补丁的后验加权平均值来建模这些网络的输出。在本工作中,我们将这些方法整合为一个统一的模型类,称为滤波后验均值集合(FPMC)。我们使用查询精度向量、响应权重和源分布定义该模型类,并说明现有方法可通过这些设计轴的具体选择恢复。依次研究每个轴,我们发现FPMC性能可以通过对先前基于补丁的方法进行软松弛以及通过源分布的增强来改进。将这些发现应用于现有的FPMC,我们在三个自然图像数据集上展示了样本的一致改进。

英文摘要

The neural-network denoising functions which form the backbone of image diffusion models are remarkably consistent in their generalization behaviour across a wide variety of network architectures and training procedure hyperparameters. A recent line of research has sought to model the outputs of these networks by aggregating posterior weighted averages of training dataset patches. In this work, we consolidate these approaches into a unified model class which we call Filtered Posterior Mean Collections (FPMCs). We define this model class using query precision vectors, response weights, and source distributions, and illustrate that existing methods are recoverable with specific choices of these design axes. Investigating each axis in turn, we find that FPMC performance can be improved with soft relaxations of prior patch-based methods, and through augmentations of source distributions. Applying these findings to an existing FPMC, we demonstrate consistent sample improvement across three natural image datasets.

2605.24176 2026-05-26 cs.CV 版本更新

Loki: Representation over Architecture for Diffusion-Based Portrait Animation

Loki:基于表示而非架构的扩散肖像动画

Pouyan Navard, Sernam Lim

发表机构 * The Ohio State University(俄亥俄州立大学) University of Central Florida(中央佛罗里达大学)

AI总结 提出Loki方法,通过使用参数化人脸模型解耦身份与表情/姿态,并利用轻量级键值注入保持身份,在减少参数和训练数据的同时提升扩散肖像动画的驱动跟随性。

详情
AI中文摘要

肖像动画将驱动片段的面部表情和头部姿态迁移到单个参考图像上,同时保留参考图像的身份。最先进的扩散系统通过依次堆叠用于表情、姿态和身份的训练模块来解决这一问题,但付出了可训练参数、专有语料库以及系统本应独立控制的轴之间的残余纠缠的代价。这种复杂性补偿了一个上游选择——从RGB中学习面部表情和头部姿态,而RGB是一种身份、姿态和表情无法分离的表示,除非分别学习。Loki在条件路径上跳出RGB。驱动表情和头部姿态由一个面部模型编码,该模型的参数轴在构造上与身份正交,然后光栅化为扩散骨干网络原生消费的空间图。身份通过轻量级键值注入,经由扩散骨干网络自身的预训练特征单独路由。由于参数化表示将身份与表情和姿态解耦,跨身份重演在推理时简化为系数替换,无需跨身份训练数据。Loki所需的推理参数比领先的扩散基线少约43%,并且训练所用的视频样本少1496倍。我们定义了两个指标,直接衡量生成的头部位姿轨迹和面部表情是否跟随驱动——这正是肖像动画所关注的问题;Loki在这两个指标上领先或并列领先。

英文摘要

Portrait animation transfers a driver clip's facial expression and head pose onto a single reference image while preserving the reference's identity. State-of-the-art diffusion systems address this by stacking trained modules for expression, pose, and identity in turn, paying for it in trainable parameters, proprietary corpora, and residual entanglement between the very axes the system is meant to control independently. This complexity compensates for an upstream choice -- learning facial expression and head pose from RGB, a representation in which identity, pose, and expression are inseparable without being learned apart. Loki steps out of RGB on the conditioning path. Driver expression and head pose are encoded by a face model whose parameter axes are identity-orthogonal by construction, then rasterised into a spatial map that the diffusion backbone consumes natively. Identity is routed separately through the diffusion backbone's own pretrained features via lightweight key-value injection. Because the parametric representation factorises identity from expression and pose, cross ID reenactment reduces to a coefficient substitution at inference, requiring no cross ID training data. Loki requires ~43% fewer inference parameters than leading diffusion baselines and trained on 1496x less video samples. We define two metrics that directly measure whether the generated head pose trajectory and facial expression followed the driver's -- the questions portrait animation actually asks; Loki leads or co-leads on both.

2605.24159 2026-05-26 cs.CV 版本更新

EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound

EchoVQA:为床旁心脏超声提供对话式辅助

Filippos Bellos, Yutong Li, Jessie N Dong, Zaiyang Guo, Emily Mackay, Yayuan Li, Yannis Avrithis, Alison Pouch, Jason J. Corso

发表机构 * University of Michigan(密歇根大学) University of Pennsylvania(宾夕法尼亚大学) Independent Scientist(独立科学家)

AI总结 针对床旁心脏超声图像获取与解读依赖专业知识的问题,提出首个大规模超声心动图VQA数据集EchoVQA(含14,299张图像和74,819个问答对),并开发基于多模态可学习提示的参数高效方法,在多数基准上达到最优性能。

详情
AI中文摘要

床旁经胸超声心动图(TTE)几乎可在任何临床环境中进行心脏评估,但其诊断效用仍受限于图像获取和解读所需的专业知识。视觉问答(VQA)为通过交互式临床辅助弥合这一专业知识差距提供了有前景的范式,但现有的超声心动图VQA数据集规模有限、仅限于高质量图像,且仅覆盖少数视图。我们提出了EchoVQA,首个大规模超声心动图VQA数据集,包含14,299张图像和74,819个问答对。该数据集整合了公共来源(EchoNet-Dynamic、CAMUS)和我们使用两种手持探头(Lumify、Clarius)自行采集的床旁图像,涵盖多种视图,包括高质量和次优图像。独特的是,EchoVQA包含采集指导问题,帮助用户优化探头位置以获得用于左心室射血分数评估的诊断性心尖四腔心视图——这对床旁环境中的新手操作者来说是一项具有挑战性的任务。我们进一步开发了一种基于多模态可学习提示的参数高效方法,在包括EchoVQA在内的大多数基准上取得了最先进的性能,且可训练参数显著少于现有最先进方法。

英文摘要

Point-of-care transthoracic echocardiography (TTE) enables cardiac assessment in virtually any clinical setting, yet its diagnostic utility remains constrained by the expertise required for image acquisition and interpretation. Visual question answering (VQA) offers a promising paradigm for bridging this expertise gap through interactive clinical assistance, but existing echocardiography VQA datasets are limited in scale, restricted to high-quality images, and only cover a few views. We introduce EchoVQA, the first large-scale VQA dataset for echocardiography, comprising 14,299 images and 74,819 question-answer pairs. The dataset integrates public sources (EchoNet-Dynamic, CAMUS) with our own point-of-care acquisitions from two handheld probes (Lumify, Clarius), spanning diverse views and including both high-quality and suboptimal images. Uniquely, EchoVQA includes acquisition guidance questions to help users optimize transducer positioning toward a diagnostic apical 4-chamber view for left ventricular ejection fraction estimation -- a challenging task for novice operators in point-of-care settings. We further develop a parameter-efficient method based on multimodal learnable prompts achieving state-of-the-art performance on most benchmarks, including EchoVQA, with significantly less trainable parameters than existing state-of-the-art approaches.

2605.24128 2026-05-26 cs.CV 版本更新

ImPartial: Multi-channel Whole-Cell Segmentation using Partial Annotations

ImPartial: 使用部分注释的多通道全细胞分割

Gunjan Shrivastava, Saad Nadeem

发表机构 * Memorial Sloan Kettering Cancer Center(纪念斯隆凯特琳癌症中心)

AI总结 提出ImPartial框架,通过自监督多通道量化插值,在稀疏标注和有限监督下实现与全监督模型相当的全细胞分割性能。

Comments MICCAI'26 Early Accept

详情
AI中文摘要

病理图像中准确的细胞分割通常需要密集的像素级注释,这既昂贵又耗时。这一挑战对于新兴的生物成像模态和具有可变通道配置的多重数据集尤其重要,因为这些数据中专家标注的数据很少。在这项工作中,我们引入了ImPartial,这是一个深度学习框架,旨在使用稀疏涂鸦和有限监督在低标注条件下实现最先进的分割性能。ImPartial通过自监督多通道量化插值增强了分割目标。该方法利用了以下观察结果:精确的像素级重建或图像去噪对于准确分割并非必需,因此引入了一个与整体分割目标更一致的自监督分类目标。我们证明,ImPartial在需要显著更少注释的情况下实现了与全监督模型相当的性能。在基准多重细胞成像和单重临床明场免疫组化数据集上的大量实验表明,仅使用部分注释,ImPartial相对于强基线有一致的改进。所有基准数据集和代码均可通过我们的GitHub获取:https://github.com/nadeemlab/ImPartial。

英文摘要

Accurate cell segmentation in pathology images typically requires dense pixel-wise annotations, which are costly and time-consuming to obtain. This challenge is especially important for emerging biological imaging modalities and multiplexed datasets with variable channel configurations, where expert-labeled data are scarce. In this work, we introduce ImPartial, a deep learning framework designed to achieve state-of-the-art segmentation performance in low-annotation regimes using sparse scribbles and limited supervision. ImPartial augments the segmentation objective via self-supervised multi-channel quantized imputation. This approach leverages the observation that perfect pixel-wise reconstruction or denoising of the image is not needed for accurate segmentation, and thus, introduces a self-supervised classification objective that better aligns with the overall segmentation goal. We demonstrate that ImPartial achieves performance at par with fully supervised models while requiring substantially fewer annotations. Extensive experiments on benchmark multiplexed cellular imaging and single-plex clinical brightfield immunohistochemistry datasets show consistent improvements over strong baselines with only partial annotations. All benchmark datasets and code are available via our Github: https://github.com/nadeemlab/ImPartial.

2605.24114 2026-05-26 cs.CV 版本更新

COSY: Compositional 3DGS Synthesis for Disentangled Human Head Editing

COSY: 用于解耦人像编辑的组合式3DGS合成

Florian Barthel, Shalini De Mello, Koki Nagano, Wieland Morgenstern, Anna Hilsmann, Peter Eisert

发表机构 * Fraunhofer Heinrich-Hertz Institute, Germany(弗劳恩霍夫 Heinrich-Hertz 研究所,德国) Humboldt University Berlin, Germany(柏林洪堡大学,德国) NVIDIA, USA(NVIDIA,美国)

AI总结 提出一种组合式3DGS生成器架构,通过独立合成头发、皮肤、眼镜和躯干等组件实现语义属性的精确解耦编辑,无需分割掩码或几何先验。

详情
AI中文摘要

近期用于人像的3D高斯泼溅(3DGS) GAN能够实时合成和渲染逼真的3D模型,并在身份和外观上提供丰富的多样性。然而,控制特定语义属性(如发色或眼镜)仍然具有挑战性,因为纠缠的潜在空间中的编辑常常导致身份或外观的意外变化。尽管有几种方法旨在通过估计仅修改特定特征的方向来解耦训练后的潜在空间,但这些方法无法保证完全解耦,并且通常需要预训练的分类器。在我们的方法中,我们提出了一种新的生成器架构,该架构完全独立地合成组件,例如头发、皮肤、眼镜和躯干。这使得可以更改一个区域的潜在向量,同时保持其余部分固定。此外,我们仅使用稀疏信息(如头发或皮肤颜色)实现这种分离,消除了先前工作中常见的分割掩码或几何先验的需求。为了确保编辑过程中形状和光照条件的匹配,我们允许独立生成器之间通过上下文令牌共享最少的信息。这些令牌甚至使我们能够控制形状和光照,而无需任何先验注释。与现有的基于GAN的生成和编辑工作相比,我们的方法显示出更好的解耦性、更精确的编辑控制和有竞争力的视觉质量。

英文摘要

Recent 3D Gaussian Splatting (3DGS) GANs for human heads synthesize and render photorealistic 3D models in real-time and offer a vast variety in identity and appearance. However, controlling specific semantic attributes such as hair color or glasses remains challenging, as edits in the entangled latent space often induce unintended changes in identity or appearance. Although there are several methods that aim to disentangle the latent space post training by estimating directions that only modify certain features, these methods cannot guarantee complete disentanglement and often require pre-trained classifiers. In our approach, we propose a new generator architecture that synthesizes components, such as hair, skin, glasses, and torso, completely independently. This allows for changing the latent vector for one region while keeping the remaining parts fixed. Further, we achieve this separation using only sparse information such as the hair or skin color, eliminating the requirement of segmentation masks or geometric priors, often seen in prior work. To ensure matching shape and lighting conditions during editing, we allow minimal shared information via context tokens between the independent generators. These tokens even allow us to control the shape and light, without any prior annotation. Compared to existing works on GAN-based generation and editing, our method shows better disentanglement, more precise editing control, and competitive visual quality.

2605.24098 2026-05-26 cs.CV 版本更新

D2-V2X: Depth-Driven Cooperative V2X Reasoning for Autonomous Driving

D2-V2X: 面向自动驾驶的深度驱动协同V2X推理

Kevin Richard, Alphin Varghese, Colin Pham, David Oh, Srijan Das

发表机构 * University of North Carolina at Charlotte(北卡罗来纳州立大学)

AI总结 针对单车辆视觉语言模型受传感器遮挡限制的问题,提出D2-V2X基准和基线模型,通过融合3D LiDAR特征与VLM潜空间,利用链式思维推理实现遮挡目标识别和空间估计,在识别遮挡危险和降低空间估计误差上取得显著提升。

Comments Accepted to the DriveX Workshop at CVPR 2026 (Non-archival)

详情
AI中文摘要

单车辆视觉语言模型(VLM)从根本上受到传感器遮挡的限制。虽然车联万物(V2X)系统缓解了这一问题,但当前基准缺乏解决复杂环境中歧义所需的协同推理。我们引入了D2-V2X,一个空间感知的问题-推理-答案(QRA)基准,包含来自多模态车辆和基础设施传感器的8,500个三元组。我们还建立了一个基线,将3D LiDAR特征与VLM的潜空间对齐。通过在结构化JSON输出之前强制使用自然语言链式思维推理,我们的模型被迫明确表达空间关系。实验表明,与零样本模型几乎为零的识别率相比,将VLM基于协同LiDAR在识别遮挡危险时实现了24.4%的召回率,并将可见物体的空间估计误差相比零样本基线降低了77%。虽然模型达到了53.5的功能性决策F1分数,但我们识别出3D到2D投影是当前VLM架构的基本瓶颈,为未来创新建立了新基线。数据、代码和训练模型可在https://github.com/KevinRichard1/D2-V2X获取。

英文摘要

Single-vehicle Vision-Language Models (VLMs) are fundamentally constrained by sensor occlusions. While Vehicle-to-Everything (V2X) systems mitigate this, current benchmarks lack the cooperative reasoning required for resolving ambiguities in complex environments. We introduce D2-V2X, a spatially-aware Question-Rationale-Answer (QRA) benchmark featuring 8,500 triplets derived from multimodal vehicle and infrastructure sensors. We additionally establish a baseline that aligns 3D LiDAR features with the VLM's latent space. By enforcing natural language Chain-of-Thought rationales prior to structured JSON outputs, our model is forced to explicitly articulate spatial relations. Our experiments demonstrate that grounding VLMs in cooperative LiDAR achieves 24.4% recall in identifying occluded hazards compared to near-zero in zero-shot models and reduces spatial estimation error for visible objects by 77% compared to the zero-shot baseline. While the model achieves a functional decision-making F1-score of 53.5, we identify 3D-to-2D projection as a fundamental bottleneck in current VLM architectures, establishing a new baseline for future innovation. Data, code, and trained models available at https://github.com/KevinRichard1/D2-V2X

2605.24074 2026-05-26 cs.CV cs.RO 版本更新

WideDepth: Millimeter-Accurate Benchmark for Fisheye Depth Estimation

WideDepth: 用于鱼眼深度估计的毫米级精度基准

Ilia Indyk, Ignat Penshin, Ivan Sosin, Maxim Monastyrny, Aleksei Valenkov, Ilya Makarov

发表机构 * Robotics Center(机器人中心) AXXX Trusted AI Research Center, RAS(可信人工智能研究中心,俄罗斯科学院)

AI总结 提出首个室内鱼眼深度估计数据集WideDepth,包含101个场景的5K高分辨率立体对和毫米级真值,并引入基于LiDAR的立体鱼眼图像生成方法,评估多种模型,微调后性能提升高达62%。

Comments Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026

详情
AI中文摘要

鱼眼相机在机器人领域的近场操作、导航和沉浸式感知中应用日益广泛,但缺乏具有精确真值的室内深度基准。为此,我们引入WideDepth——首个用于鱼眼深度估计的室内数据集,包含101个场景的5K高分辨率立体对,标注了毫米级地面真值深度和视差。我们的数据集还包括在水平和垂直立体设置中,不同视场和基线下的配对针孔和鱼眼样本。我们进一步提出一种方法,将针孔训练的立体模型适配到鱼眼图像,并引入一种基于高分辨率LiDAR扫描的新型立体鱼眼图像生成流程。利用这些方法,我们在基准上全面评估了最先进的单目深度、立体匹配和深度补全模型。此外,我们提供了18K LiDAR导出的稀疏深度训练样本,在微调基于针孔的立体模型时,鱼眼数据性能提升高达62%。总之,我们基准的高精度和多功能性为推进鱼眼深度估计和机器人感知研究奠定了坚实基础。项目页面:https://ilyaind.github.io/WideDepth

英文摘要

Fisheye cameras are increasingly adopted in robotics for near-field manipulation, navigation, and immersive perception, yet indoor depth benchmarks with accurate ground truth are still missing. To address this, we introduce WideDepth - the first indoor dataset for fisheye depth estimation, featuring 101 scenes containing 5K high-resolution stereo pairs labeled with millimeter-level ground truth depth and disparity. Our dataset also includes paired pinhole and fisheye samples across varying fields of view and baselines in both horizontal and vertical stereo setups. We further propose a method to adapt pinhole-trained stereo models to fisheye images and introduce a novel stereo fisheye image generation pipeline based on high-resolution LiDAR scans. Leveraging these methods, we thoroughly evaluate state-of-the-art monocular depth, stereo matching, and depth completion models on our benchmark. Additionally, we provide 18K LiDAR-derived sparse depth training samples, achieving up to a 62% performance boost on fisheye data when fine-tuning pinhole-based stereo models. In summary, the high precision and versatility of our benchmark set a strong foundation for advancing research in fisheye depth estimation and robotics perception. Project page: https://ilyaind.github.io/WideDepth

2605.24066 2026-05-26 cs.CV 版本更新

Distance-Aware Joint Spatio-Temporal Graph Contrastive Learning for Major Depressive Disorder Diagnosis

距离感知的联合时空图对比学习用于重度抑郁症诊断

Muhammad Asif Hasan, Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew

发表机构 * School of Information Technology and Communication, Griffith University, Australia(信息科技与通信学院,格里菲斯大学,澳大利亚)

AI总结 针对动态功能连接在重度抑郁症诊断中的噪声、频域信息利用不足及时空分离建模问题,提出基于霍克斯过程先验的联合时空图对比学习框架HWSTCL,通过谱节点描述符、指数距离衰减边权重和核加权对比目标,实现可靠时空表示并提升诊断性能。

详情
AI中文摘要

重度抑郁症(MDD)是一种常见的神经精神疾病,其基于静息态功能磁共振成像(rs-fMRI)的准确诊断仍然困难。动态功能连接(DFC)捕捉脑区间的时变交互,提供丰富的时空信息,但当前基于DFC的方法面临三个限制:滑动窗口Pearson相关产生对窗口长度和运动伪影敏感的噪声估计;相关导出的节点特征未充分利用血氧水平依赖(BOLD)信号的频域特性;大多数时空图模型在分离阶段处理空间结构和时间动态,限制了它们表示耦合脑网络演化的能力。为克服这些问题,我们将DFC学习重新表述为在霍克斯过程启发的时间依赖性先验下的联合时空图表示学习,并提出HWSTCL,一个基于可靠性精炼联合时空图和核加权预训练目标的两阶段框架。在每个时间窗口内,BOLD信号被编码为谱节点描述符,功能边通过指数距离衰减先验进行精炼,该先验降低不可靠长程连接的权重。然后通过霍克斯启发的指数核将每个区域与未来窗口中的自身连接形成联合图,使得在消息传递过程中空间和时间信息可以一起传播。核加权对比目标进一步促进每个区域跨窗口的时间一致性,同时减少不同区域间的冗余相似性。在基准rs-fMRI数据集上的实验表明,HWSTCL优于最近的基线方法,并为MDD诊断生成连贯的时空表示。

英文摘要

Major depressive disorder (MDD) is a common neuropsychiatric condition whose accurate diagnosis from resting-state functional magnetic resonance imaging (rs-fMRI) remains difficult. Dynamic functional connectivity (DFC) captures time-varying interactions among brain regions and provides rich spatio-temporal information, yet current DFC-based methods face three limitations: sliding-window Pearson correlation yields noisy estimates sensitive to window length and motion artifacts; correlation-derived node features do not fully exploit frequency-domain properties of blood-oxygen-level-dependent (BOLD) signals; and most spatio-temporal graph models handle spatial structure and temporal dynamics in separate stages, restricting their ability to represent coupled brain network evolution. To overcome these issues, we reformulate DFC learning as joint spatio-temporal graph representation learning under a Hawkes-process-inspired temporal dependency prior and propose HWSTCL, a two-stage framework built on a reliability-refined joint spatio-temporal graph with a kernel-weighted pretraining objective. Within each temporal window, BOLD signals are encoded as spectral node descriptors and functional edges are refined by an exponential distance-decay prior that down-weights less reliable long-range connections. The joint graph is then formed by linking each region to itself across future windows through a Hawkes-inspired exponential kernel, allowing spatial and temporal information to be propagated together during message passing. A kernel-weighted contrastive objective further promotes temporal consistency for each region across windows while reducing redundant similarity between different regions. Experiments on a benchmark rs-fMRI dataset show that HWSTCL outperforms recent baselines and yields coherent spatio-temporal representations for MDD diagnosis.

2605.24065 2026-05-26 cs.CV 版本更新

fMRI-Diffusion: Generating fMRI Time Series Via a Temporal Transformer Diffusion Model for Major Depressive Disorder Diagnosis

fMRI-Diffusion: 用于重度抑郁症诊断的基于时间Transformer扩散模型的fMRI时间序列生成

Muhammad Asif Hasan, Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew

发表机构 * School of Information and Communication Technology, Griffith University(信息与通信技术学院,格里菲斯大学)

AI总结 提出fMRI-Diffusion框架,通过时间Transformer扩散模型合成ROI级fMRI时间序列而非功能连接矩阵,以保留时间信息并提升小样本下MDD诊断准确率。

详情
AI中文摘要

使用功能连接分析从功能磁共振成像诊断重度抑郁症需要大量标记数据,而临床环境中这些数据稀缺。现有的增强方法合成FC矩阵,将fMRI记录压缩为静态成对摘要并丢弃时间信息。我们提出fMRI-Diffusion,一个合成ROI级fMRI时间序列而非FC矩阵的框架。时间Transformer作为去噪扩散概率模型中的去噪网络,将每个时间点视为一个token,通过自注意力捕获时间依赖。监督预训练策略在扩散训练前用任务相关表示初始化Transformer,并从合成时间序列导出FC矩阵用于分类。在REST-meta-MDD数据集上的实验表明,用合成时间序列增强训练数据在十个分类器、六个分区图谱和三个采集站点上一致提高了诊断准确率。该方法优于五种最新的基于FC的合成方法,比最强基线准确率提升高达3.7个百分点。消融研究证实了基于Transformer的去噪器和预训练策略的贡献。所有条件下的分布保真度指标均低于0.06,表明真实分布与合成分布高度一致。这些发现表明,在FC计算之前合成fMRI时间序列保留了矩阵级增强中丢失的时间信息,为有限数据下的MDD诊断提供了实用策略。

英文摘要

Diagnosing Major Depressive Disorder (MDD) from functional magnetic resonance imaging (fMRI) using functional connectivity (FC) analysis requires large amounts of labeled data that are scarce in clinical settings. Existing augmentation methods synthesize FC matrices, which compress fMRI recordings into static pairwise summaries and discard temporal information. We propose fMRI-Diffusion, a framework that synthesizes region-of-interest (ROI)-level fMRI time series rather than FC matrices. A Temporal Transformer serves as the denoising network within a denoising diffusion probabilistic model, treating each time point as a token to capture temporal dependencies through self-attention. A supervised pretraining strategy initializes the Transformer with task-relevant representations before diffusion training, and FC matrices are derived from the synthesized time series for classification. Experiments on the REST-meta-MDD dataset show that augmenting training data with synthetic time series consistently improves diagnostic accuracy across ten classifiers, six parcellation atlases, and three acquisition sites. The method outperforms five recent FC-based synthesis approaches, with accuracy gains of up to 3.7 percentage points over the strongest baseline. Ablation studies confirm the contributions of both the Transformer-based denoiser and the pretraining strategy. Distributional fidelity metrics remain below 0.06 across all conditions, indicating close agreement between real and synthetic distributions. These findings suggest that synthesizing fMRI time series before FC computation preserves temporal information lost in matrix-level augmentation and provides a practical strategy for MDD diagnosis under limited data.

2605.24047 2026-05-26 cs.CV 版本更新

EMMA: Extracting Multiple physical parameters from Multimodal Data

EMMA: 从多模态数据中提取多个物理参数

Farhat Shaikh, Ayan Banerjee, Sandeep Gupta

发表机构 * IMPACT Lab, School of Computing & Augmented Intelligence (SCAI)(IMPACT实验室,计算与增强智能学院) Arizona State University, Tempe, AZ(亚利桑那州立大学,Tempe)

AI总结 提出EMMA框架,利用物理信息多模态融合和LTC网络,从原始视频、音频和图像时间序列中联合推断系统动力学参数,无需先验条件或专用传感器,在100+场景中优于单模态方法。

Comments Accepted at CVPR 2026 (main conference)

详情
AI中文摘要

我们引入了EMMA,一个基于物理信息的多模态框架,能够直接从原始视频、音频和基于图像的时间序列观测中恢复系统的所有可识别动力学参数。与先前仅依赖视频的方法不同,这些方法难以处理遮挡状态、隐藏驱动输入或对已知初始条件和坐标系的假设,EMMA在统一的连续时间模型中对显式参数、隐式动力学分量和校准不变量进行联合推断。EMMA利用液态时间常数(LTC)网络从异构模态中学习潜在动力学,同时物理约束损失强制与支配微分方程保持一致。统一的特征管道实现了视频轨迹、声学特征和图表测量之间的一致对齐,使得EMMA能够在受迫、隐式和多元动力学下估计参数,无需分割掩码、可微渲染或专用传感器。在100多个场景中,包括五个标准动力学基准(75个Delfys视频)、具有隐藏输入的真实世界轮式机器人和四旋翼系统,以及涵盖生物和混沌系统的模拟图表案例研究,EMMA实现了稳健的多参数恢复,并显著优于现有的单模态和方程发现基线。我们的结果确立了EMMA作为从机会性多模态数据中进行物理一致模型提取的通用、可扩展解决方案。代码和数据可在 https://github.com/ImpactLabASU/EMMA-CVPR2026 获取。

英文摘要

We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors. Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction from opportunistic multimodal data. Code and data are available at: https://github.com/ImpactLabASU/EMMA-CVPR2026

2605.24040 2026-05-26 cs.CV 版本更新

Learning to See Like Humans: Gaze-Aligned Cycling Safety Prediction

学习像人类一样看:基于注视对齐的骑行安全预测

Luís Maria Perdigão, Miguel Costa, Carlos Santiago, Manuel Marques

发表机构 * Technical University of Denmark, Kongens Lyngby, Denmark(丹麦技术大学,永辛堡,丹麦)

AI总结 提出眼动追踪引导的感知骑行安全框架(EG-PCS),通过将注视数据集成到基于视觉Transformer的成对学习流程中,使模型注意力与人类注视模式对齐,提升预测准确性和可解释性。

Comments Accepted to be published as part of the 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC), Naples, Italy, September 15-18, 2026

详情
AI中文摘要

骑行带来显著的公共健康和环境效益,但在城市中的普及常受限于感知安全性。当街道环境看起来不安全时,人们骑行的可能性降低,因此感知成为采用骑行的关键障碍。近期研究表明,街景图像的成对比较为学习主观安全判断提供了一种可扩展的方法。然而,现有方法未明确建模人类视觉注意力,而注意力在人类感知安全中起核心作用。我们提出眼动追踪引导的感知骑行安全框架(EG-PCS),该框架将注视数据集成到基于视觉Transformer的成对学习流程中。通过用眼动追踪信号监督模型的注意力机制,我们促使学习到的注意力图与人类注视模式对齐。实验表明,与最先进方法相比,注视引导模型在实现相似排序性能的同时,生成的注意力图更准确地反映人类视觉注意行为。我们的结果表明,在基于感知的城市分析中融入眼动追踪信息可提升预测准确性和可解释性。

英文摘要

Cycling delivers significant public-health and environmental benefits, yet its uptake in cities is often limited by perceived safety. When street environments appear unsafe, individuals are less likely to cycle, making perception a key barrier to adoption. Recent work has shown that pairwise comparisons of street-view images provide a scalable way to learn subjective safety judgments. However, existing approaches do not explicitly model human visual attention, which plays a central role in how humans perceive safety. We propose an Eye-Tracking-Guided Perceived Cycling Safety framework (EG-PCS) that integrates gaze data into a pairwise learning pipeline based on vision transformers. By supervising the model's attention mechanism with eye-tracking signals, we encourage alignment between learned attention maps and human fixation patterns. Experiments show that gaze-guided models achieve similar ranking performance compared to state-of-the-art approaches while producing attention maps that more accurately reflect human visual attention behavior. Our results demonstrate that incorporating eye-tracking information enhances both predictive accuracy and interpretability in perception-based urban analytics.

2605.24037 2026-05-26 cs.CV cs.AI 版本更新

Mode-as-Sequence: Translating Multimodal Motion Prediction into Unified Sequential Mode Modeling

模式即序列:将多模态运动预测转化为统一序列模式建模

Zikang Zhou, Haibo Hu, Xinhong Chen, Yifan Zhang, Nan Guan, Yung-Hui Li, Chun Jason Xue, Jianping Wang

发表机构 * City University of Hong Kong(香港城市大学) City University of Hong Kong (Dongguan)(香港城市大学(东莞)) Hon Hai Research Institute(富士康研究学院) Mohamed bin Zayed University of Artificial Intelligence(莫莫丁·宾·扎耶德人工智能大学)

AI总结 提出Mode-as-Sequence框架,将无序模式集转化为有序模式序列并显式建模模式间依赖,通过ModeSeq和Parallel ModeSeq两种实例化方法解决多模态运动预测中的模式坍塌和置信度排序问题,在Waymo数据集上取得领先性能。

详情
AI中文摘要

多模态运动预测本质上是欠监督的:每个训练场景只提供一个已实现的未来,但存在多个合理的未来。这种稀疏监督通常会导致模式坍塌(冗余假设和模式覆盖不足)以及在预测少量轨迹时置信度排序不可靠。我们提出Mode-as-Sequence,一个统一的解码框架,将无序模式集转化为有序模式序列,并显式建模模式间依赖。在该框架下,我们开发了两种互补的实例化方法。ModeSeq执行循环模式解码,每个模式基于先前生成的模式生成,鼓励多样化、非冗余的假设,并具有校准的置信度排序。为了消除逐模式自回归瓶颈,我们进一步提出Parallel ModeSeq,它使用掩码模式间自注意力保留相同的因果依赖,同时在前向传播中一次性解码所有模式,从而实现高效的大K推理和可扩展的联合场景预测。为了在稀疏标签下学习代表性模式和校准的置信度,我们引入了Early-Match-Take-All (EMTA)及其联合场景扩展MA-EMTA,以及一个轻量级的排序正则化器,以减少置信度反转。在大型基准上的大量实验表明,在数据集、预测时长和对象类型上,排序导向指标和最佳K准确率均有一致提升。在Waymo开放数据集挑战中,ModeSeq在2024年无激光雷达运动预测赛道获得第一名,Parallel ModeSeq在2025年交互预测挑战赛中获得第一名,验证了Mode-as-Sequence在准确性和效率上的有效性。

英文摘要

Multimodal motion forecasting is inherently under-supervised: each training scene provides only one realized future, yet multiple plausible futures exist. This sparse supervision often leads to mode collapse (redundant hypotheses and insufficient mode coverage) and unreliable confidence ranking when predicting a small set of trajectories. We propose Mode-as-Sequence, a unified decoding framework that translates an unordered mode set into an ordered mode sequence and explicitly models mode-to-mode dependency. Under this framework, we develop two complementary instantiations. ModeSeq performs recurrent mode decoding, where each mode is generated conditioned on the previously generated modes, encouraging diverse, non-redundant hypotheses with calibrated confidence ordering. To remove the mode-by-mode autoregressive bottleneck, we further propose Parallel ModeSeq, which preserves the same causal dependency using masked mode-to-mode self-attention while decoding all modes in a single forward pass, enabling efficient large-$K$ inference and scalable joint-scene prediction. To learn representative modes and calibrated confidence under sparse labels, we introduce Early-Match-Take-All (EMTA) and its joint-scene extension MA-EMTA, together with a lightweight ranking regularizer that reduces confidence inversions. Extensive experiments on large-scale benchmarks demonstrate consistent improvements in both ranking-oriented metrics and best-of-K accuracy across datasets, horizons, and object types. In the Waymo Open Dataset challenges, ModeSeq achieves 1st place in the 2024 LiDAR-free motion prediction track, and Parallel ModeSeq achieves 1st place in the 2025 Interaction Prediction Challenge, validating the effectiveness of Mode-as-Sequence for both accuracy and efficiency.

2605.24025 2026-05-26 cs.CV cs.LG 版本更新

Towards Large Model Feature Coding

面向大模型特征编码

Youwei Pang, Changsheng Gao, Dong Liu, Huchuan Lu, Weisi Lin

发表机构 * NTU(国立台湾大学) USTC(中国科学技术大学) DUT(东吴大学)

AI总结 本文提出大模型特征编码(LaMoFC)基准与评估框架,通过构建涵盖4类16场景的特征数据集LaMoFCBench,揭示现有编码范式与大模型特征异构性之间的严重错位。

详情
AI中文摘要

大模型在广泛的感知和生成任务中取得了显著性能,但实际部署日益受到计算和内存预算以及隐私要求的限制。分割执行通过跨设备划分计算来缓解这些约束,但不可避免地引入了中间特征的密集传输和存储。与通常针对同质空间激活图的传统CNN特征编码不同,现代大模型生成具有不同统计分布和压缩容忍度的异构特征,例如多级/多模态表示和自回归上下文缓存。这些特性使得将大模型特征编码(LaMoFC)视为一个基本系统组件,并需要一个系统的评估框架。在本文中,我们提出了一个全面的LaMoFC基准和评估框架。我们首先构建特征数据集LaMoFCBench,涵盖4个类别和16个场景中的多样化任务需求,同时集成广泛采用的架构和各种分割计算设置。然后,我们根据实际应用场景指定代表性的分割点以提取中间特征,建立统一的流水线以实现公平和可重复的比较。最后,我们对主流的通用特征编解码器进行基准测试,揭示了现有编码范式与大模型特征异构性之间的严重错位。这些发现表明,LaMoFC需要从根本上脱离现有范式,而LaMoFCBench提供了推动这一转变的共享实证基础。数据和代码将在https://github.com/lartpang/LaMoFCBench上提供。

英文摘要

Large models have delivered remarkable performance across a wide range of perception and generation tasks, yet practical deployment is increasingly constrained by computational and memory budgets, as well as privacy requirements. Split execution alleviates these constraints by partitioning computation across devices, but it inevitably introduces intensive transmission and storage of intermediate features. Unlike conventional feature coding for CNNs that typically targets homogeneous spatial activation maps, modern large models generate heterogeneous features with varying statistical distributions and compression tolerances, e.g., multi-level/multi-modal representations and autoregressive context caches. These characteristics necessitate treating large model feature coding (LaMoFC) as a fundamental system component and call for a systematic evaluation framework. In this paper, we present a comprehensive benchmark and evaluation framework for LaMoFC. We first build the feature dataset LaMoFCBench, covering diverse task requirements across 4 categories and 16 scenarios while integrating widelyadopted architectures and various split-computing settings. We then specify representative split points according to practical application scenarios to extract intermediate features, establishing a unified pipeline for fair and reproducible comparisons. Finally, we benchmark mainstream universal feature codecs, exposing the profound misalignment between existing coding paradigms and the heterogeneous nature of large model features. These findings reveal that LaMoFC demands a fundamental departure from existing paradigms, and LaMoFCBench provides the shared empirical foundation to drive this transition. The data and code will be available at https://github.com/lartpang/LaMoFCBench.

2605.24024 2026-05-26 cs.CV 版本更新

Mitigating Hallucinations in Large Vision-Language Models via Causal Route Gating

通过因果路由门控减轻大型视觉语言模型中的幻觉

Zhe Cheng, Wenyu Chen, Fode Zhang, Dehuan Shen

发表机构 * Center of Statistical Research, School of Statistics and Data Science, Southwestern University of Finance and Economics, Chengdu, China.(统计研究中心,统计与数据科学学院,西南财经大学,成都,中国) Department of Biomedical Engineering, College of Design and Engineering, National University of Singapore, Singapore(生物医学工程系,设计与工程学院,新加坡国立大学,新加坡)

AI总结 针对大型视觉语言模型中因文本路径主导导致幻觉的问题,提出一种无训练、决策对齐的干预方法,通过分解注意力头为视觉和文本路由并抑制文本路由,有效减少幻觉错误。

Comments Accepted as a Spotlight Paper at ICML 2026. 33 pages, 8 figures

详情
AI中文摘要

大型视觉语言模型(LVLMs)常常产生流畅但缺乏图像支持的幻觉内容,限制了其在现实部署中的可靠性。我们表明,一个关键的失败模式源于路由竞争:即使视觉标记获得注意力,最终的标记决策也可能被文本路径主导,导致解码器遵循语言先验而非视觉证据。为了缓解这一问题,我们提出一种无训练、决策对齐的干预方法,将每个注意力头分解为视觉路由和文本路由,并使用高效的一次前向/一次梯度近似估计其标记级效应。这些估计揭示了头内的路由冲突并识别出先验主导的头,从而能够选择性地仅抑制文本路由,同时保持视觉路由完整。在涵盖判别和生成设置的五个基准测试中,我们的方法一致地减少了幻觉相关错误,对整体多模态性能影响有限,同时仅带来适度的推理时间开销。

英文摘要

Large vision-language models (LVLMs) often hallucinate content that is fluent yet unsupported by the image, limiting their reliability in real-world deployment. We show that a key failure mode arises from route competition: even when visual tokens receive attention, the final token decision can be dominated by the textual pathway, causing the decoder to follow linguistic priors over visual evidence. To mitigate this, we propose a training-free, decision-aligned intervention that decomposes each attention head into a visual route and a text route, and estimates their token-level effects using an efficient one-forward/one-gradient approximation. These estimates reveal route conflict within heads and identify prior-dominant ones, enabling selective suppression of only the text route while keeping the visual route intact. Across five benchmarks spanning discriminative and generative settings, our method consistently reduces hallucination-related errors across models with limited impact on overall multimodal performance, while incurring a modest inference-time overhead.

2605.24023 2026-05-26 cs.CV cs.DM 版本更新

Soft Tuy-Completeness for Robust Projection Selection in Cone-Beam CT

锥束CT中鲁棒投影选择的软Tuy完备性

Linda-Sophie Schneider, Andreas Maier

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg(模式识别实验室,弗赖堡-亚历山大-大学埃尔兰根-纽伦堡)

AI总结 基于Tuy完备性理论,提出连续软近正交评分和分辨率感知饱和覆盖目标,通过次模贪心算法和混合整数线性规划实现投影选择,并引入有效空间分辨率作为轨迹级诊断指标。

Comments Preprint

详情
AI中文摘要

本工作引入了一个连续的软近正交评分和一个分辨率感知的饱和覆盖目标,用于感兴趣区域聚焦的锥束CT中的投影选择,基于Tuy完备性理论。将经典Tuy完备性的二元命中-未命中模型替换为分级的、可微的公式,保留了对可实现特征尺寸的直接联系,同时支持高效的近似和精确优化。我们通过从集合覆盖的多项式时间归约证明了底层离散决策问题是NP完全的,从而激发了具有证明的$(1-1/\mathrm{e})$近似保证的次模贪心算法和提供认证最优性边界的混合整数线性规划(MILP)。MILP作为贪心解的质量证书,而不是竞争性优化器。主要实证结果证实了这种关系:在跨越六个目标区域、多个投影预算和四个受控遮挡条件的系统基准测试中,贪心与MILP目标值的合并中位数为0.998,且相当一部分案例被认证为全局最优。包含一个二元公式作为诊断基线;它增强了硬方向完备性,但在连续覆盖尺度上较弱。我们还引入了有效空间分辨率(ESR),这是一个物理可解释的轨迹级诊断指标,将方向采样间隙映射到可实现的特征尺寸。ESR在投影预算和遮挡水平上与匹配的重建质量可靠相关,提供了选择阶段与图像域之间的实用桥梁,而无需重建。

英文摘要

This work introduces a continuous soft near-orthogonality score and a resolution-aware saturated coverage objective for projection selection in region-of-interest focused cone-beam CT, grounded in Tuy's completeness theory. Replacing the binary hit-or-miss model of classical Tuy completeness with a graded, differentiable formulation preserves a direct link to achievable feature sizes while enabling both efficient approximate and exact optimisation. We establish that the underlying discrete decision problems are NP-complete via polynomial-time reductions from Set Cover, motivating a submodular greedy algorithm with proven $(1-1/\mathrm{e})$ approximation guarantees and a mixed-integer linear program (MILP) that provides certified optimality bounds. The MILP serves as a quality certificate for the greedy solution rather than a competing optimiser. The primary empirical finding confirms this relationship: across a systematic benchmark spanning six target regions, multiple projection budgets, and four controlled occlusion conditions, the pooled median greedy-to-MILP objective ratio was 0.998, with a substantial fraction of cases certified globally optimal. A binary formulation is included as a diagnostic baseline; it strengthens hard directional completeness but is weaker on the continuous coverage scale. We additionally introduce Effective Spatial Resolution (ESR), a physically interpretable trajectory-level diagnostic that maps directional sampling gaps to achievable feature sizes. ESR correlates reliably with matched reconstruction quality across projection budgets and occlusion levels, providing a practical bridge between the selection stage and the image domain without requiring reconstruction.

2605.24020 2026-05-26 cs.CV cs.AI 版本更新

Machine Intelligence that Understands Visual and Linguistic Information and Interacts with Humans and Environments

理解视觉与语言信息并与人类及环境交互的机器智能

Van Quang Nguyen

发表机构 * System Information Sciences(信息科学系)

AI总结 本文提出GRIT、LTMI和两阶段指令解释框架,分别改进图像描述、视觉对话和交互式指令跟随任务,在准确性和效率上取得领先结果。

Comments Doctoral dissertation, Tohoku University, 2022. Uploaded for archival purposes. 146 pages

详情
AI中文摘要

计算机视觉与自然语言处理交叉领域的进展对于辅助技术、多媒体查询和机器人等应用至关重要。本论文提出了新颖的架构,以改进智能体在三个关键视觉-语言任务上的表现:图像描述、视觉对话和交互式指令跟随。 首先,我们解决了图像描述中视觉表示的局限性。传统模型依赖CNN检测器提取的区域特征,缺乏全局上下文且计算开销大。我们提出GRIT(基于网格和区域的图像描述Transformer),一种纯Transformer架构。通过使用基于DETR的检测器整合网格和区域特征,GRIT实现了端到端训练,并在推理准确性和速度上均优于先前方法。 其次,我们处理视觉对话,这需要对图像进行多轮对话。挑战在于高效建模多个输入(图像、问题、历史)之间的交互。我们引入LTMI(轻量级多输入Transformer)。利用专门的注意力块,LTMI层在VisDial数据集上验证,其表示能力与标准Transformer扩展相当,但参数不到其十分之一。 最后,我们使用ALFRED数据集研究具身AI的交互式指令跟随。我们提出一个包含两阶段指令解释的框架:首先独立于视觉上下文解码语言指令以预测暂定的动作-对象序列,然后与视觉特征融合以最终执行。通过使用多个自我中心视图和分层注意力,我们的方法准确定位对象,并实现了8.37%的最新未见成功率。

英文摘要

Advancements at the intersection of computer vision and natural language processing are crucial for applications like assistive tech, multimedia querying, and robotics. This dissertation proposes novel architectures to improve intelligent agents across three key vision-language tasks: image captioning, visual dialog, and interactive instruction following. First, we address limitations in visual representation for image captioning. Traditional models rely on region-based features from CNN detectors, which lack global context and suffer from high computational overhead. We propose GRIT (Grid and Region-based Image captioning Transformer), a transformer-only architecture. By integrating grid and region features using a DETR-based detector, GRIT enables end-to-end training and out-performs prior methods in both inference accuracy and speed. Second, we tackle visual dialog, which requires multi-turn conversation about an image. The challenge lies in efficiently modeling interactions between multiple inputs (image, question, history). We introduce LTMI (Light-weight Transformer for Many Inputs). Utilizing a specialized attention block, an LTMI layer matches the representational power of a standard Transformer extension while utilizing less than one-tenth of its parameters, as validated on the VisDial dataset. Finally, we study interactive instruction-following for embodied AI using the ALFRED dataset. We propose a framework featuring a two-stage instruction interpretation: it first decodes language directives independently of visual context to predict a tentative action-object sequence, which is then fused with visual features for final execution. Using multiple egocentric views and hierarchical attention, our method accurately localizes objects and achieves a state-of-the-art unseen success rate of 8.37%.

2605.24019 2026-05-26 cs.CV cs.LG 版本更新

MGVQ: Synergizing Multi-dimensional Sensitivity-Aware and Gradient-Hessian Fusion for Vector Quantization

MGVQ:协同多维敏感度感知与梯度-海森融合的向量量化

Zhong Wang, Zukang Xu, Xing Hu, Dawei Yang

发表机构 * Bauman Moscow State Technical University(巴甫洛夫莫斯科国立技术大学)

AI总结 提出MGVQ框架,通过敏感度引导的结构化混合精度量化和梯度感知的二阶误差补偿,实现视觉-语言模型的超低位向量量化,在2-bit量化下最高提升4.9个点。

详情
AI中文摘要

视觉-语言模型(VLM)取得了卓越的性能,但其巨大的模型尺寸严重阻碍了在资源受限的边缘设备上的部署。作为一种高效的模型压缩技术,向量量化(VQ)在超低位表示方面表现出色,它将模型权重映射到紧凑码本中的离散码字,以降低内存消耗和传输开销,同时保持模型能力。直接将VQ应用于VLM仍存在两个核心限制。首先,视觉和文本输入带来的跨模态权重分布差异无法被单一的统一码本很好地拟合。其次,当前的二阶误差补偿忽略了梯度信息,导致权重偏离预训练最优状态、梯度漂移和补偿结果有偏。本文提出MGVQ,一种新颖的向量量化框架,集成了多维敏感度感知和梯度-海森融合。它包含两个核心模块:敏感度引导的结构化混合精度量化,通过结合全局和局部敏感度分析,根据通道敏感度动态分配不同位宽,实现精细的资源分配;梯度感知的二阶误差补偿,将一阶梯度嵌入误差校正,并采用Kronecker和Block-LDL分解确保低计算成本。在主流VLM(包括LLaVA-onevision、InternVL2和Qwen2-VL)上的大量实验验证了MGVQ的有效性。在2-bit量化设置下,MGVQ显著超越现有先进的后训练量化方法,在InternVL2-26B上最高提升4.9个点(71.4% vs 67.0%)。所提方法实现了稳定高效的超低位VLM量化,极大促进了多模态大模型在资源受限环境中的实际部署。

英文摘要

Vision-Language Models (VLMs) achieve outstanding performance, yet their huge model size severely hinders deployment on edge devices with limited resources. As an efficient model compression technique, vector quantization (VQ) excels in ultra-low-bit representation, which maps model weights to discrete codewords in a compact codebook to cut memory consumption and transmission overhead while preserving model capability. Direct VQ application to VLMs still has two core limitations. First, cross-modality weight distribution differences brought by visual and textual inputs cannot be well fitted by a single unified codebook. Second, current second-order error compensation ignores first-order gradient information, causing weight deviation from pre-trained optimal states, gradient drift and biased compensation results. This work proposes MGVQ, a novel vector quantization framework integrating multi-dimensional sensitivity perception and gradient-Hessian fusion. It consists of two core modules: sensitivity-guided structured mixed-precision quantization dynamically assigns different bit-widths according to channel sensitivity via combined global and local sensitivity analysis for refined resource allocation; gradient-aware second-order error compensation embeds first-order gradients into error correction, and adopts Kronecker and Block-LDL decomposition to ensure low computational cost. Extensive experiments on mainstream VLMs including LLaVA-onevision, InternVL2 and Qwen2-VL verify the effectiveness of MGVQ. In 2-bit quantization settings, MGVQ surpasses existing advanced post-training quantization methods significantly, achieving a maximum accuracy improvement of 4.9 points (71.4% vs 67.0% on InternVL2-26B). The proposed method realizes stable and efficient ultra-low-bit VLM quantization, greatly promoting the practical deployment of multimodal large models in resource-limited environments.

2605.24014 2026-05-26 cs.CV 版本更新

SkySeg: Collaborative Onboard Semantic Segmentation with Heterogeneous UAVs in the Wild

SkySeg: 野外异构无人机协同机载语义分割

Anqi Lu, Yun Cheng, Youbing Hu, Zhiqiang Cao, Jie Liu, Zhijun Li

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) ETH Zurich(苏黎世联邦理工学院) Huawei Cloud Algorithm Innovation Lab(华为云算法创新实验室)

AI总结 针对资源受限无人机在动态环境中实时语义分割的挑战,提出SkySeg异构多无人机空-空协作框架,结合高效信息融合推理与跨设备测试时自适应策略,实现低成本传感器下的机载分割,加速约3.6倍并提升精度5.91%。

详情
AI中文摘要

基于无人机的图像采集和分析需求激增,无人机越来越多地用于语义分割任务。为了满足无人机遥感任务的实时分析要求,进行机载计算并基于结果做出决策是一种自然的方法。然而,在资源受限的无人机平台上部署语义分割面临两个重大挑战:1)硬件限制限制了无人机执行实时语义分割的能力,2)飞行过程中的环境变化导致数据分布偏移,偏离原始训练数据。为了解决这些问题,本文介绍了SkySeg,一种异构多无人机空-空协作框架,它集成了计算机视觉和飞行模式,能够使用低成本传感器实现机载语义分割。SkySeg采用高效的信息融合推理方法,将低分辨率广域图像与高分辨率聚焦区域图像相结合。此外,它还包含一种跨设备测试时自适应策略,通过协作解决无人机间测试数据流的分布偏移,增强动态环境中的分割性能。实验结果表明,我们的SkySeg框架将推理延迟加速约3.6倍,将机载分割精度提高5.91%,并在野外环境中实现了10.91%的平均精度增益。

英文摘要

The demand for unmanned aerial vehicle (UAV)-based image acquisition and analysis has surged, with UAVs increasingly utilized for semantic segmentation tasks. To meet the real-time analysis requirements of UAV remote sensing missions, performing onboard computation and making decisions based on the results is a natural approach. However, deploying semantic segmentation on resource-constrained UAV platforms presents two significant challenges: 1) hardware constraints limit the ability of UAVs to perform real-time semantic segmentation, and 2) environmental variations during flight cause data distribution shifts, deviating from the original training data. To address these issues, this paper introduces SkySeg, a heterogeneous multi-UAV air-air cooperation framework that integrates computer vision and flight pattern to enable onboard semantic segmentation using low-cost sensors. SkySeg employs an efficient information fusion inference method, combining low-definition, wide-area images with high-definition, focused-area images. Additionally, it incorporates a cross-device test-time adaptation (TTA) strategy to enhance segmentation performance in dynamic environments by collaboratively addressing distribution shifts of test data streams across UAVs. Experimental results demonstrate that our SkySeg framework accelerates inference latency by approximately 3.6x, improves onboard segmentation accuracy by 5.91\%, and achieves a 10.91\% average accuracy gain in the wild.

2605.24012 2026-05-26 cs.CV 版本更新

Deep Learning-Based Automated Quantification of TIMI Myocardial Perfusion Frame Count (DL-TMPFC) from Coronary Angiography: A Novel Framework for Rapid Assessment of Microvascular Dysfunction

基于深度学习的TIMI心肌灌注帧数自动量化(DL-TMPFC):一种快速评估微血管功能障碍的新框架

Si Li, Yuanqing He, Chenkai Hu, Xiaogang Guo, Huay-Cheem Tan, Chieh Yang Koo, Xuan Zhang, Lei He, Jingyuan Zeng, Shan Xiao

发表机构 * School of Artificial Intelligence and Digital Economy Industry, Guangzhou Institute of Science and Technology(人工智能与数字经济发展学院,广州科学研究院) Department of Cardiology, Second Affiliated Hospital, Jiangxi Medical College, Nanchang University(南华大学江西医学院心内科,第二附属医院) Department of Cardiology, First Affiliated Hospital of Zhejiang University School of Medicine(浙江大学医学院附属第一医院心内科) Department of Cardiology, National University Heart Centre(国立大学心脏中心心内科)

AI总结 提出DL-TMPFC框架,结合狭窄检测网络和区域感知分割网络,从冠状动脉造影中自动计算TIMI心肌灌注帧数,实现微血管功能障碍的客观量化,验证显示与专家手动测量高度一致。

Comments 15 pages,8 figures

详情
AI中文摘要

目的:冠状动脉微血管功能障碍(CMVD)影响约40%-60%的缺血和非阻塞性冠状动脉患者,但由于依赖侵入性功能测试或主观的TIMI血流分级,诊断仍具挑战性。TIMI心肌灌注帧数(TMPFC)提供了一种基于造影的客观定量测量CMVD的方法,但其临床应用受限于繁琐的手动计算和验证不足。本研究旨在开发和验证一种基于深度学习的TMPFC计算方法(DL-TMPFC),使其能够整合到临床工作流程中。方法和结果:DL-TMPFC框架包含两个组件。首先,狭窄检测网络排除阻塞性冠状动脉疾病(CAD)。然后,区域感知分割网络识别灌注区域,TMPFC计算模块自动从造影序列中确定首帧和末帧。该框架在来自三个独立机构的655名患者(445名阻塞性CAD、100名确诊CMVD、110名对照组)队列中进行了验证。DL-TMPFC与专家手动测量具有极好的一致性(偏差:-0.93帧;95%一致性界限:-5.33至+3.47;r=0.98)。DL-TMPFC通过完全自动化TMPFC并消除观察者依赖性,显著增强了临床可行性。临床上,DL-TMPFC能够准确识别全谱冠状动脉病理中的CMVD,并捕获超越二元分类的CMVD连续严重程度,实现定量风险分层。结论:DL-TMPFC实现了直接从常规造影中自动、标准化和准确地量化CMVD。通过提供自动和客观的测量,该工具为临床实践中及时识别和管理CMVD提供了即时诊断信息。

英文摘要

Aims: Coronary microvascular dysfunction (CMVD) affects approximately 40%-60% of patients with ischemia and non-obstructive coronary arteries, yet diagnosis remains challenging due to reliance on invasive functional testing or subjective Thrombolysis In Myocardial Infarction (TIMI) flow grade. The TIMI Myocardial Perfusion Frame Count (TMPFC) offers an objective, angiography-based quantitative measure of CMVD, but its clinical translation is hindered by cumbersome manual calculation and insufficient validation. This study aims to develop and validate a deep learning-powered TMPFC calculation (DL-TMPFC), enabling integration into clinical workflows. Methods and results: DL-TMPFC framework comprised two components. A stenosis detection network first excluded obstructive coronary artery disease (CAD). A territory-aware segmentation network then identified perfusion territories and TMPFC calculation module automatically determined the first and last frames from angiographic sequences. The framework was validated in a cohort of 655 patients (445 of obstructive CAD, 100 of confirmed CMVD, 110 of control group) from three independent institutions. DL-TMPFC showed excellent agreement with expert manual measurements (bias: -0.93 frames; 95% LoA: -5.33 to +3.47; r =0.98). DL-TMPFC markedly enhanced clinical feasibility by fully automating TMPFC and removing observer dependence. Clinically, DL-TMPFC accurately identified CMVD across a full spectrum of coronary pathologies and captured the continuous severity of CMVD beyond binary classification, enabling quantitative risk stratification. Conclusion: DL-TMPFC enabled automatic, standardized, and accurate quantification of CMVD directly from routine angiography. By providing an automatic and objective measure, this tool provided immediate diagnostic information for timely recognition and management of CMVD in clinical practice.

2605.24008 2026-05-26 cs.LG cs.CV cs.SE 版本更新

CAFD: Concept-Aware DNN Fault Detection using VLMs

CAFD: 使用视觉语言模型的概念感知深度神经网络故障检测

Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand

发表机构 * School of EECS, University of Ottawa(渥太华大学电子工程与计算机科学学院) Research Ireland Lero centre for software, University of Limerick(利默尼克大学爱尔兰研究中心)

AI总结 提出概念感知故障检测(CAFD)方法,通过整合模型信号、距离特征和基于视觉语言模型的概念故障比(CFR)特征,在保持效率的同时显著提升DNN故障检测性能。

详情
AI中文摘要

近年来,深度神经网络(DNN)的故障检测受到越来越多的关注。虽然已经提出了更先进的混合方法来结合多种信息源并优于早期技术,但它们通常会产生大量的计算开销,限制了在现实环境中的可扩展性和实用性。在本文中,我们介绍了概念感知故障检测(CAFD),这是一种基于学习的方法,通过有效整合多个信息源同时保持实际效率,实现了卓越的故障检测性能。具体来说,CAFD使用一组精心挑选的信息特征进行训练,包括基于DNN输出的模型信号、基于距离的特征以及一种新颖的基于概念的特征,称为概念故障比(CFR)。CFR利用视觉语言模型(VLM)从图像中提取文本概念,并量化其存在与DNN故障相关的可能性。通过引入这一特征,CAFD受益于互补的语义信息,从而实现更有效的故障检测。我们的结果表明,CFR是DNN故障检测的有效指标。我们对CAFD进行了广泛的实证评估,将其与三个主题DNN模型和数据集(包括ImageNet)上的五个最先进基线进行了比较。在广泛的约束选择预算范围内,CAFD在故障检测率(FDR)上始终优于所有基线,在所有研究对象和预算规模上平均FDR提高了18.3%。

英文摘要

Fault detection for Deep Neural Networks (DNNs) has received increasing attention in recent years. While more advanced hybrid approaches have been proposed to combine multiple sources of information and outperform earlier techniques, they often incur substantial computational overhead, limiting scalability and practicality in real-world settings. In this paper, we introduce Concept-Aware Fault Detection (CAFD), a learning-based approach that achieves superior fault detection performance by effectively integrating multiple information sources while maintaining practical efficiency. Specifically, CAFD is trained using a carefully selected set of informative features, including model-based signals derived from the DNN's outputs, distance-based features, and a novel concept-based feature, called Concept Failure Ratio (CFR). CFR leverages Vision-Language Models (VLMs) to extract textual concepts from images and quantify the likelihood that their presence is associated with DNN failures. By incorporating this feature, CAFD benefits from complementary semantic information, enabling more effective fault detection. Our results demonstrate that CFR serves as an effective indicator for DNN fault detection. We conduct an extensive empirical evaluation of CAFD, comparing it against five state-of-the-art baselines across three subject DNN models and datasets, including ImageNet. Across a wide range of constrained selection budgets, CAFD consistently outperforms all baselines in Fault Detection Rate (FDR), achieving average FDR improvements of 18.3% across all investigated subjects and budget sizes.

2605.24004 2026-05-26 cs.AI cs.CV cs.LG cs.RO 版本更新

Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving

推理--想象--行动:基于世界模型的闭环LLM自动驾驶决策

Zhengqi Sun, Yiwen Sun, Boxuan Liu, Tailai Chen, Tianxu Guo, Jiabin Liu

发表机构 * 1Department of Information Management, Peking University, Beijing 100871, China 2School of Intelligence Science Technology, Peking University, Beijing 100871, China 3State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing 100080, China 4Yuanpei College, Peking University, Beijing 100871, China 5China Agricultural University, Beijing, China 6CRSC Research \& Design Institute Group Co., Ltd., Beijing, China

AI总结 提出Reason--Imagine--Act (RIA)闭环框架,结合LLM推理器与动作条件世界模型进行在线安全验证,在CARLA点目标协议下实现80.05%路线完成率、51.10%到达率和0.20%碰撞率。

Comments Accepted by the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026). 8 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLM)在自动驾驶中具有潜力,但仅基于语义的决策策略可能在动态交通中产生物理上不安全的行为。现有方法要么在没有显式动力学验证的情况下进行在线语言推理,要么主要在离线流程中使用世界模型,在决策时语义意图与物理可行性之间存在差距。我们提出了Reason--Imagine--Act (RIA),一个闭环框架,将LLM推理器与动作条件世界模型耦合,用于在线安全验证。在每一步,LLM提出一个动作模板和候选子动作,世界模型执行短时域展开,安全评分器选择最安全的可执行动作并反馈给下一步推理。在统一的CARLA点目标协议(1000个回合)下,RIA实现了80.05%的路线完成率、51.10%的到达率和0.20%的碰撞率。在相同的闭环接口下,RIA在核心闭环指标上始终优于无训练基线,包括CARLA TM和MADA。为便于复现,代码可在https://github.com/pku-smart-city/source_code/tree/main/RIA获取。

英文摘要

Large language models (LLMs) are promising for autonomous driving, but semantics-only decision policies can yield physically unsafe behavior in dynamic traffic. Existing methods either perform online language reasoning without explicit dynamics verification or use world models mainly in offline pipelines, leaving a gap between semantic intent and physical feasibility at decision time. We propose Reason--Imagine--Act (RIA), a closed-loop framework that couples an LLM reasoner with an action-conditioned world model for online safety verification. At each step, the LLM proposes an action template and candidate sub-actions, the world model performs short-horizon rollouts, and a safety scorer selects the safest executable action with feedback to the next reasoning step. Under a unified CARLA point-goal protocol (1000 episodes), RIA achieves 80.05% route completion, 51.10% arrival rate, and 0.20% collision rate. Under the same closed-loop interface, RIA consistently outperforms training-free baselines, including CARLA TM and MADA, on core closed-loop metrics. For reproducibility, code is available at https://github.com/pku-smart-city/source_code/tree/main/RIA.

2605.23997 2026-05-26 cs.CV cs.AI cs.LG 版本更新

IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning

IVR-R1:通过强化学习中的迭代视觉基础推理优化轨迹

Chenghao Li, Fusheng Hao, Xikai Zhang, Likang Xiao, Yanwei Ren, Fuxiang Wu, Quan Chen, Liu Liu

发表机构 * Hangzhou International Innovation Institute, Beihang University(北京航空航天大学杭州国际创新研究院) School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院) Kuaishou Technology(快手科技) Shenzhen Institute of Advanced Integration Technology, Shenzhen(深圳先进集成技术研究院)

AI总结 提出IVR-R1框架,利用奖励驱动的筛选机制和迭代再推理循环,在强化学习中动态校正多模态推理轨迹,以解决视觉幻觉和逻辑错误问题。

详情
AI中文摘要

通过强化学习的多模态大语言模型在复杂视觉推理任务中展现出显著能力,但在长程多模态场景中仍存在局限,常出现视觉幻觉和逻辑错误。当前方法通常将高维视觉场景预编码为离散文本代理以促进下游推理。然而,随着推理链展开,文本与视觉场景之间固有的信息不对称会侵蚀视觉基础,导致推理误导和错误输出。为解决此问题,我们提出IVR-R1(迭代视觉基础推理),一种新颖的强化学习训练框架,通过动态视觉重新对齐主动校正推理轨迹以指导策略优化。具体而言,利用奖励驱动的筛选机制识别有缺陷的展开,IVR-R1在多模态上下文中执行细粒度的步骤级错误归因。通过将中间推理状态与原始视觉先验进行迭代交叉引用,再推理循环实现自动轨迹校正,有效合成专家级演示,作为策略模型的高保真推理模板。我们在多种多模态基准上的实验表明,IVR-R1持续优于现有强化学习方法,为在复杂多模态推理中保持逻辑和视觉一致性建立了优越范式。

英文摘要

Multimodal large language models via reinforcement learning (RL) have demonstrated remarkable capabilities in complex visual reasoning tasks, yet they remain limited in long-horizon multimodal scenarios, often suffering from visual hallucination and logical error. Current methods typically pre-encode high-dimensional visual scenes into discrete textual proxies to facilitate downstream reasoning. As the reasoning chain unfolds, however, the inherent information asymmetry between text and visual scenes tends to erode visual grounding, resulting in misguided reasoning and erroneous outputs. To address this issue, we introduce IVR-R1 (Iterative Visual-grounded Reasoning), a novel RL training framework that facilitates dynamic visual re-alignment that actively rectifies reasoning trajectories to guide policy optimization. Specifically, by leveraging a reward-driven screening mechanism to identify flawed rollouts, IVR-R1 executes a fine-grained, step-level error attribution within the multimodal context. By iteratively cross-referencing intermediate reasoning states against pristine visual priors, a Re-Reasoning Loop enables automated trajectory rectification, effectively synthesizing expert-level demonstrations that serve as high-fidelity reasoning templates for the policy model. Our experiments across diverse multimodal benchmarks demonstrate that IVR-R1 consistently outperforms existing reinforcement learning methods, establishing a superior paradigm for maintaining logical and visual consistency in complex multimodal reasoning.

2605.23996 2026-05-26 cs.CV eess.IV 版本更新

Brain-to-Image Retrieval and Reconstruction via Multimodal EEG Alignment

通过多模态EEG对齐实现脑到图像的检索与重建

Chi Kit Wong, Yan Liu, Haowen Yan

AI总结 提出一种脑到图像系统,通过多模态EEG对齐实现自然图像观看时的视觉刺激解码,在检索任务中达到86.30%的Top-1准确率,在重建任务中获得0.903的CLIP分数。

Comments 16 pages, 5 figures. Code available at: https://github.com/Chikit-WONG/DL_Project/

详情
AI中文摘要

我们提出一种脑到图像系统,该系统从自然图像观看期间记录的EEG信号中解码视觉刺激。我们的系统解决两个任务:(1) EEG到图像检索,给定一个EEG片段,在200个候选中对正确的刺激图像进行排序;(2) EEG到图像重建,生成与感知刺激一致的图像。对于检索,我们实现了一种多级模糊方法,该方法通过生物启发的EVNet特征进行改进,并使用InfoNCE损失进行训练。在单个受试者的10个随机种子评估中,检索模型实现了平均最终epoch Top-1准确率86.30%和Top-5准确率98.55%。对于重建,我们实现了CognitionCapturerPro,它将EEG表示对齐到多模态CLIP嵌入,包括图像、文本、深度和边缘嵌入,并通过IP-Adapter条件化使用SDXL-Turbo合成图像。在10个种子上平均,重建模型使用ViT-H-14实现了0.903的CLIP分数,使用ViT-L/14实现了0.870的CLIP分数,SSIM为0.409。这些结果证明了使用现代多模态对齐和生成建模技术从EEG信号解码丰富视觉表示的可行性。

英文摘要

We present a brain-to-image system that decodes visual stimuli from EEG signals recorded during natural image viewing. Our system addresses two tasks: (1) EEG-to-image retrieval, which ranks the correct stimulus image among 200 candidates given an EEG segment, and (2) EEG-to-image reconstruction, which generates an image consistent with the perceived stimulus. For retrieval, we implement a multi-level blurring approach improved with biologically inspired EVNet features and trained with the InfoNCE loss. Evaluated over 10 random seeds for a single subject, the retrieval model achieves a mean final-epoch Top-1 accuracy of 86.30% and Top-5 accuracy of 98.55%. For reconstruction, we implement CognitionCapturerPro, which aligns EEG representations to multi-modal CLIP embeddings, including image, text, depth, and edge embeddings, and synthesizes images with SDXL-Turbo conditioned via IP-Adapter. Averaged over 10 seeds, the reconstruction model achieves a CLIP score of 0.903 using ViT-H-14, a CLIP score of 0.870 using ViT-L/14, and an SSIM of 0.409. These results demonstrate the feasibility of decoding rich visual representations from EEG signals using modern multi-modal alignment and generative modeling techniques.

2605.23994 2026-05-26 cs.CV cs.AI 版本更新

RAW: Robust Avatar Watermarking -- Benchmarking and Baseline

RAW:鲁棒的数字人水印——基准测试与基线方法

Jack Parry, Jack Saunders, Vinay Namboodiri

发表机构 * University of Bath(巴斯大学)

AI总结 针对数字人水印面临的后处理攻击,提出基准测试RAW和基于3D人脸重建的UV纹理空间水印方法WALT,在缩放攻击和背景移除攻击下分别达到92.4%和95.6%的鲁棒性。

详情
AI中文摘要

数字人水印面临独特挑战:在部署前,数字人通常要经过背景替换、重新构图和格式转换等常规后处理。我们提出 extbf{RAW}(鲁棒的数字人水印),一个包含来自5个商业提供商的50个合成数字人视频和6种模拟真实数字人工作流程的攻击的基准测试。评估7种现有方法发现,数字人特定的攻击(如背景移除)会显著降低水印恢复率。我们提出 extbf{WALT}(通过学习纹理进行数字人水印),该方法通过3D人脸重建在UV纹理空间中嵌入水印。WALT在缩放攻击下达到最高鲁棒性(92.4%),同时在背景移除攻击下保持强劲性能(95.6%)。我们发布该基准测试以促进针对数字人水印的研究。

英文摘要

Digital avatar watermarking presents unique challenges: avatars are routinely post-processed with background replacement, reframing, and format conversion before deployment. We introduce \textbf{RAW} (Robust Avatar Watermarking), a benchmark comprising 50 synthetic avatar videos from 5 commercial providers and 6 attacks simulating real-world avatar workflows. Evaluating 7 existing methods reveals that avatar-specific attacks such as background removal significantly degrade watermark recovery. We propose \textbf{WALT} (Watermarking Avatars with Learned Textures), which embeds watermarks in UV texture space via 3D face reconstruction. WALT achieves the highest robustness to zoom attacks (92.4\%) while maintaining strong performance on background removal (95.6\%). We release our benchmark to facilitate research into avatar-specific watermarking.

2605.23992 2026-05-26 cs.CV cs.AI 版本更新

A World Model of Radiologist Reading for Medical Image Representation Learning

放射科医生阅读的世界模型用于医学图像表示学习

Yiwei Li, Zihao Wu, Huaqin Zhao, Yifan Zhou, Chao Cao, Dajiang Zhu, Tianming Liu, Lin Zhao

发表机构 * University of Georgia(佐治亚大学) University of Texas at Arlington(德克萨斯大学阿灵顿分校) New Jersey Institute of Technology(新泽西理工学院)

AI总结 提出GazeWorld,一种将图像视为世界、放射科医生注视序列视为轨迹的医学成像世界模型,通过自回归预测注视补丁表示和空间补全未访问区域,在多个基准上实现最先进的诊断准确率和零样本性能。

详情
AI中文摘要

放射科医生的眼动追踪数据提供了专家在图像阅读过程中如何搜索、比较和积累证据的丰富记录;然而,现有方法仅部分利用这一信号,要么作为静态空间先验,要么作为与诊断脱节的辅助预测目标。我们提出GazeWorld,一种医学成像世界模型,将图像视为世界,将放射科医生的注视序列视为通过该世界的轨迹。GazeWorld自回归地从所有先前访问过的补丁预测下一个注视补丁的潜在表示,同时一个空间补全分支覆盖未访问区域。在推理时,GazeWorld仅从图像生成一系列补丁表示,无需真实注视数据。冻结的GazeWorld特征在CheXpert、RSNA肺炎和SIIM-ACR气胸的所有九个监督设置中实现了最先进的诊断准确率,并在所有三个基准上取得了最高的零样本准确率。在GazeSearch基准上,使用相同冻结特征训练的通用解码器在ScanMatch和SED上分别比专门构建的LogitGaze-Med高出16%和22%,尽管未明确训练以预测注视。GazeWorld表明,建模专家如何阅读(而不仅仅是他们得出什么结论)为医学成像AI提供了一种有前景的预训练范式。

英文摘要

Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist's fixation sequence as a trajectory through it. GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial-completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, as well as the highest zero-shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose-built LogitGaze-Med by over 16\% in ScanMatch and 22\% in SED, despite not being explicitly trained to predict gaze. GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.

2605.23984 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection

面向多模态在线分布式工业异常检测的参数高效多类智能调度

Heqiang Wang, Weihong Yang, Zheyuan Yang, Jia Zhou, Xiaoxiong Zhong, Fangming Liu, Weizhe Zhang

发表机构 * Pengcheng Laboratory(鹏城实验室) Shenzhen International Graduate School(深圳国际研究生院)

AI总结 针对工业异常检测中分布式、持续生成数据的特点,提出多模态在线分布式工业异常检测框架,通过多类智能调度问题和序列边际增益贪婪算法协调模型更新,并采用资源高效类级低秩适应策略降低系统开销,在MVTec 3D-AD和Eyecandies数据集上取得优越性能。

详情
AI中文摘要

工业异常检测作为工业系统的基本挑战已引起广泛关注。异构工业传感器的快速发展推动工业异常检测从单模态向多模态范式转变。然而,现有方法主要针对集中式和离线场景设计,忽视了实际工业环境中分布式和持续生成的数据特征。随着边缘智能的发展,现代边缘设备不仅能够采集数据,还能进行分布式模型训练,实现系统范围内的协作智能。工业异常检测是此背景下的关键应用。受这些挑战启发,我们提出了一种名为多模态在线分布式工业异常检测(MODIAD)的新框架。首先给出了MODIAD的完整工作流程,然后制定了多类智能调度(MIS)问题,通过平衡数据充足性和类别更新频率来协调跨类模型更新。为了高效解决该问题,我们设计了序列边际增益贪婪(SMG)算法,能够在资源约束下实现有效的多类训练。此外,为了提升训练过程中的计算和通信效率,我们提出了资源高效类级低秩适应(REC-LoRA)策略,在保持检测性能的同时显著降低系统开销。在两个代表性多模态工业异常检测数据集MVTec 3D-AD和Eyecandies上的大量实验表明,所提方法在MODIAD场景下实现了优越的性能和效率。

英文摘要

Industrial anomaly detection has attracted significant attention as a fundamental challenge in industrial systems. The rapid advancement of heterogeneous industrial sensors has driven industrial anomaly detection from unimodal to multimodal paradigms. However, existing methods are primarily designed for centralized and offline settings, overlooking the distributed and continuously generated data characteristic of real-world industrial environments. With the advancement of edge intelligence, modern edge devices are increasingly capable of not only data acquisition but also distributed model training, enabling collaborative intelligence across the system. Industrial anomaly detection represents a critical application in this context. Motivated by these challenges, we propose a novel framework termed Multimodal Online Distributed Industrial Anomaly Detection (MODIAD). We first present a comprehensive workflow for MODIAD and then formulate a Multi-class Intelligent Scheduling (MIS) problem to coordinate cross class model updates by balancing data sufficiency and class update frequency. To efficiently solve this problem, we design a Sequential Marginal Gain Greedy (SMG) algorithm that enables effective multi-class training under resource constraints. Furthermore, to improve the computational and communication efficiency during training, we propose an Resource Efficient Class-Wise Low Rank Adaptation (REC-LoRA) strategy, which significantly reduces system overhead while preserving detection performance. Extensive experiments on two representative multimodal industrial anomaly detection datasets, MVTec 3D-AD and Eyecandies demonstrate that the proposed approach achieves superior performance and efficiency under the MODIAD scenario.

2605.22809 2026-05-26 cs.CV 版本更新

Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

Sensor2Sensor: 自动驾驶的跨本体传感器转换

Jiahao Wang, Bo Sun, Yijing Bai, Vincent Casser, Songyou Peng, Zehao Zhu, Meng-Li Shih, Xander Masotto, Shih-Yang Su, Kanaad V Parvate, Tiancheng Ge, Linn Bieske, Dragomir Anguelov, Mingxing Tan, Chiyu Max Jiang

发表机构 * Waymo Johns Hopkins University(约翰霍普金斯大学) Google DeepMind(谷歌DeepMind) University of Washington(华盛顿大学)

AI总结 提出Sensor2Sensor生成模型,将单目行车记录仪视频转换为多模态传感器数据(多视角相机图像和LiDAR点云),通过4D高斯泼溅重建和扩散架构解决无配对数据问题,为自动驾驶开发解锁外部数据源。

Comments Accepted by CVPR 2026

详情
AI中文摘要

自动驾驶系统(ADS)的鲁棒训练和验证需要大规模、多样化的数据集。自动驾驶车队收集的专有数据虽然高保真,但在规模、传感器配置多样性以及地理和长尾行为覆盖方面有限。相比之下,来自行车记录仪等来源的野外数据提供了巨大的规模和多样性,捕获了关键的长尾场景和新环境。然而,这种非结构化的野外视频数据与期望结构化多模态传感器输入进行验证和训练的ADS不兼容。为了弥合这一数据差距,我们提出了Sensor2Sensor,一种新颖的生成建模范式,将野外的单目行车记录仪视频转换为高保真的多模态传感器套件(AV日志),包括多视角相机图像和LiDAR点云。一个核心挑战是缺乏配对训练数据。我们通过4D高斯泼溅(4DGS)重建和新视角渲染将真实的AV日志转换为行车记录仪风格的视频来解决这一问题。然后,Sensor2Sensor利用扩散架构进行生成转换。我们对生成的传感器数据的保真度和真实感进行了全面的定量评估。我们通过将具有挑战性的野外互联网和行车记录仪镜头转换为逼真的多模态数据格式,展示了Sensor2Sensor的实际效用,进一步为AV开发解锁了巨大的外部数据源。

英文摘要

Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for validation and training. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite (AV logs) comprising multi-view camera images and LiDAR point clouds. A core challenge is the lack of paired training data. We address this by converting real AV logs into dashcam-style videos via 4D Gaussian Splatting (4DGS) reconstruction and novel-view rendering. Sensor2Sensor then utilizes a diffusion architecture to perform the generative conversion. We perform comprehensive quantitative evaluations on the fidelity and realism of the generated sensor data. We demonstrate Sensor2Sensor's practical utility by converting challenging in-the-wild internet and dashcam footage into realistic, multi-modal data formats, further unlocking vast external data sources for AV development.

2605.19846 2026-05-26 cs.CV cs.AI cs.CL 版本更新

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

FineBench: 细粒度人类活动理解的视觉-语言模型基准测试与增强

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu

发表机构 * National Taiwan University(国立台湾大学) Google(谷歌) Independent Researcher(独立研究员)

AI总结 针对视觉-语言模型在细粒度人类活动理解上的不足,提出包含密集标注的长视频问答基准FineBench和增强框架FineAgent。

Comments CVPR'26 (Workshop on Video Large Language Models). Project Page: https://joslefaure.github.io/assets/html/finebench.html

详情
AI中文摘要

视觉-语言模型(VLM)在通用视频理解方面表现出色,但在需要细致理解人类动作和交互的真实世界应用中,它们常常难以进行细粒度理解。虽然最近一些以人为中心的基准测试评估了模型行为的公平性/伦理、情感感知等维度,但它们没有结合长视频、密集的问答覆盖以及大规模的帧级空间/时间定位。为弥补这一差距,我们引入了FineBench,一个专门设计用于评估细粒度理解的以人为中心的视频问答(VQA)基准。FineBench包含199,420个多项选择问答对,密集标注在64个长视频(每个15分钟)上,重点关注详细的人物运动、人物交互和物体操作,包括组合动作。我们的广泛评估显示,虽然像GPT-5这样的专有模型取得了不错的性能,但当前的开源VLM明显表现不佳,特别是在多人场景的空间推理以及区分人类运动和交互的细微差异方面。为了解决这些已识别的弱点,我们提出了FineAgent,一个模块化框架,通过利用定位器和描述器来增强VLM。实验表明,FineAgent在FineBench上持续提高了各种开源VLM的性能。FineBench为未来细粒度以人为中心的视频理解研究提供了严格的测试平台,而FineAgent则为增强当前VLM中的此类推理提供了一种实用方法。项目页面和代码:https://joslefaure.github.io/assets/html/finebench.html。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.

2605.19027 2026-05-26 cs.CV 版本更新

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

MedFM-Robust:医学基础模型的鲁棒性基准测试

Xiangxiang Cui, Tianjin Huang, Yifang Wang, Lijie Hu, Lu Yin

发表机构 * Beijing Normal University, China(北京师范大学) University of Exeter, United Kingdom(埃克塞特大学) University College London, United Kingdom(伦敦大学学院) Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates(穆罕默德·本·扎耶德人工智能大学) University of Surrey, United Kingdom(萨里大学)

AI总结 本文提出了一个包含40种扰动类型(12种基础、28种医学特定)的鲁棒性基准,评估了多种医学基础模型在VQA、视觉定位、图像描述和分割任务上的表现,发现微调策略主导鲁棒性、医学特定扰动对分割影响大、零样本VQA鲁棒性依赖模型等关键结论。

Comments MICCAI2026

详情
AI中文摘要

医学基础模型在临床任务中取得了显著性能,但它们在现实世界扰动下的鲁棒性仍未得到充分探索。我们提出了一个鲁棒性基准,包含8种成像模态下的40种扰动类型(12种基础、28种医学特定),评估了五个视觉语言模型(LLaVA-Med、MedGemma、MedGemma-1.5、Gemini-2.5-flash和GPT-4o-mini)在VQA、视觉定位和图像描述任务上的表现,以及两个分割模型(MedSAM、SAM-Med2D)在五种微调策略下的性能。我们的发现表明:(1)微调策略主导鲁棒性,LoRA的退化程度几乎是全微调的两倍,而SAM-Med2D的Adapter提供了有利的效率-鲁棒性权衡。(2)医学特定扰动对分割造成不成比例的损害,15个最严重扰动中有9个是领域特定的。(3)LoRA微调的视觉定位性能下降超过40个百分点,而零样本图像描述保持稳定(下降<7%)。零样本VQA表现出模型依赖的鲁棒性——医学模型下降不到20%,而Gemini-2.5-flash下降54%。通用视觉语言模型在VQA上准确率更高,但在视觉定位上失败;在医学视觉语言模型中,MedGemma表现出最佳的整体稳定性。这些结果提供了部署指南,并强调了医学AI领域特定鲁棒性评估的必要性。我们的代码可在 https://abnerai.github.io/MedFM-Robust 获取。

英文摘要

Medical foundation models have achieved remarkable clinical performance, yet their robustness under real-world perturbations remains underexplored. We present a robustness benchmark comprising 40 perturbation types (12 base, 28 medical-specific) across eight imaging modalities, evaluating five VLMs (LLaVA-Med, MedGemma, MedGemma-1.5, Gemini-2.5-flash and GPT-4o-mini) on VQA, visual grounding, and captioning, alongside two segmentation models (MedSAM, SAM-Med2D) with five fine-tuning strategies. Our findings reveal: (1) Fine-tuning strategy dominates robustness, with LoRA exhibiting nearly double the degradation of full fine-tuning, while SAM-Med2D's Adapter offers favorable efficiency-robustness trade-off. (2) Medical-specific perturbations disproportionately damage segmentation, with 9 of 15 top corruptions being domain-specific. (3) LoRA-tuned visual grounding drops over 40 points, whereas zero-shot captioning remains stable (<7% drop). Zero-shot VQA shows model-dependent robustness--medical models drop under 20% while Gemini-2.5-flash drops 54%. General-purpose VLMs achieve higher VQA accuracy but fail on grounding; among medical VLMs, MedGemma demonstrates the best overall stability. These results provide deployment guidelines and underscore the necessity of domain-specific robustness evaluation for medical AI. Our code is available at: https://abnerai.github.io/MedFM-Robust.

2605.18267 2026-05-26 cs.CV 版本更新

SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation

SRC-Flow:紧凑语义表示实现归一化流用于图像生成

Longtao Jiang, Jianmin Bao, Zhendong Wang, Xin Tao, Pengfei Wan, Zhihui Li, Xiaojun Chang

发表机构 * University of Science and Technology of China(中国科学技术大学) Kling Team, Kuaishou Technology(快手科技 Kling 团队)

AI总结 提出SRC-Flow,通过语义表示压缩器将高维RAE特征压缩到低维语义空间,降低归一化流建模负担,在ImageNet上实现最优生成质量,同时保持精确似然计算和确定性可逆采样。

详情
AI中文摘要

归一化流(NFs)提供精确似然和确定性可逆采样,但在大规模图像生成方面历史上落后于扩散模型。我们识别出一个关键障碍:NFs需要学习全环境空间上的单个可逆传输,使其对高维表示高度敏感。这导致现代视觉表示空间中的语义-容量不匹配,其中语义信息紧凑但编码在过完备特征中。我们提出SRC-Flow,引入语义表示压缩器(SRC),在流建模之前将高维RAE特征压缩到低维语义空间,并通过冻结的RAE解码器保持重建。这个紧凑空间减少了NFs的建模负担,并在语义表示空间中实现了有效的基于似然的生成。我们进一步采用针对流学习的固定无条件双射的常数噪声正则化。在ImageNet $256 \times 256$和$512 \times 512$上,SRC-Flow在归一化流方法中实现了最先进的生成质量,在无分类器引导下gFID分数分别为1.65和2.07,同时在紧凑语义表示空间中保留精确似然计算,并在流级别实现确定性可逆采样。代码和模型将在https://github.com/longtaojiang/SRC-Flow提供。

英文摘要

Normalizing flows (NFs) provide exact likelihoods and deterministic invertible sampling, but have historically lagged behind diffusion models for large-scale image generation. We identify a key obstacle: NFs are required to learn a single invertible transport over the full ambient space, making them highly sensitive to high-dimensional representations. This leads to a semantic-capacity mismatch in modern visual representation spaces, where semantic information is compact but encoded in overcomplete features. We propose SRC-Flow, which introduces a Semantic Representation Compressor (SRC) to compact high-dimensional RAE features into a low-dimensional semantic space before flow modeling and preserve reconstruction through the frozen RAE decoder. This compact space reduces the modeling burden of NFs and enables effective likelihood-based generation in semantic representation space. We further adopt constant noise regularization tailored to the fixed unconditional bijection learned by flows. On ImageNet $256 \times 256$ and $512 \times 512$, SRC-Flow achieves state-of-the-art generation quality among normalizing flow methods, with gFID scores of 1.65 and 2.07 under classifier-free guidance, while retaining exact likelihood computation in the compact semantic representation space and deterministic invertible sampling at the flow level. Codes and models will be available at https://github.com/longtaojiang/SRC-Flow.

2605.17543 2026-05-26 cs.CV cs.GR 版本更新

HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos

HL-OutPaint:面向高分辨率长范围视频的粗到细视频外绘

Jeongeun Park, Janghyeok Han, Geonung Kim, Hyun-Seung Lee, Kyuha Choi, Youngseok Han, Sunghyun Cho

发表机构 * POSTECH Visual Display Business, Samsung Electronics(三星电子视觉显示事业部)

AI总结 提出HL-OutPaint框架,采用粗到细两阶段流程,通过全局-局部帧交换机制构建全局粗引导,实现高分辨率长视频的大空间外推和时空一致生成。

Comments Supplementary material and video included. Project page: https://koyy001.github.io/Publications/hl-outpaint

详情
AI中文摘要

视频外绘生成超出视频原始空间范围的合理视觉内容,在使视频适应不同显示格式方面发挥关键作用。为支持此类应用,它必须能够对长序列进行大空间外推。然而,现有大多数方法仅解决其中一个挑战,或缺乏确保全局时空一致性的明确机制,导致明显局限性。本文提出HL-OutPaint,一种用于长序列的高分辨率视频外绘框架。我们的方法遵循粗到细策略,采用两阶段流水线。首先构建全局粗引导(GCG),这是一种低分辨率表示,捕捉视频的全局结构和主导运动。与简单下采样不同,GCG通过一种新颖的全局-局部帧交换机制构建,该机制将稀疏全局关键帧与局部时间窗口耦合,并在采样过程中交换信息。这使得GCG能够在统一表示中编码长期结构一致性和短期时间动态。在此表示引导下,HL-OutPaint随后执行高分辨率外绘,生成空间细节丰富且时间一致的内容。通过将全局结构建模与细粒度合成分离,我们的框架实现了对大空间扩展和长视频序列的稳定、连贯生成。大量实验表明,HL-OutPaint在涉及宽空间外推和长视频序列的挑战性场景中优于现有方法。

英文摘要

Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.

2605.17260 2026-05-26 cs.CV 版本更新

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

LiteFrame: 高效视觉编码器解锁视频大语言模型中的帧缩放

Jihwan Kim, Nikhil Parthasarathy, Danfeng Qin, Junhwa Hur, Deqing Sun, Bohyung Han, Ming-Hsuan Yang, Boqing Gong

发表机构 * Seoul National University(首尔国立大学)

AI总结 针对视频大语言模型处理长视频时视觉令牌上下文长度爆炸的问题,提出LiteFrame高效视频编码器,通过压缩令牌蒸馏(CTD)训练框架,使紧凑的学生模型直接预测教师模型的信息密集时空压缩表示,从而在降低35%端到端延迟的同时处理8倍帧数并提升视频理解精度。

Comments Project Page: https://jjihwan.github.io/projects/LiteFrame

详情
AI中文摘要

将视频大语言模型扩展到长视频的基本挑战在于管理视觉令牌上下文长度的爆炸。现有策略主要关注“事后”令牌缩减——在特征提取后减少视觉令牌以减轻LLM的计算开销。虽然这些方法有效减少了视觉令牌数量,但我们观察到主要延迟瓶颈随后从LLM转移到视觉编码器昂贵的逐帧处理。为了解决这个问题,我们引入了LiteFrame,一个强大且高效的视频编码器骨干网络,用于视频大语言模型。为了训练LiteFrame,我们提出了压缩令牌蒸馏(CTD),一种新颖的训练框架,教导紧凑的学生视觉编码器直接预测大型教师视觉模型产生的信息密集、时空压缩的表示,从而有效绕过冗余计算。当与进一步的语言模型适配(LMA)结合时,这种方法产生了一个新的延迟-精度帕累托前沿——与InternVL3-8B相比,LiteFrame在端到端延迟降低35%的同时处理8倍帧数,并在多个基准测试中提高了平均视频理解精度。我们的结果展示了在固定计算预算下解锁更长视频理解的新潜在路径。

英文摘要

The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier -- compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8$\times$ more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.

2605.14889 2026-05-26 cs.CV cs.AI 版本更新

SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

SurgicalMamba: 具有状态重编程的双路径SSD用于在线手术阶段识别

Sukju Oh, Sukkyu Sun

发表机构 * Department of Computer Science and Artificial Intelligence(计算机科学与人工智能系)

AI总结 提出SurgicalMamba模型,基于Mamba2的结构化状态空间对偶性(SSD),通过双路径SSD块、强度调制步进和状态重编程三个组件,实现在线手术阶段识别,在多个基准上达到最先进性能。

Comments 28 pages, 7 figures, 10 tables; Code available at https://github.com/sukjuoh/Surgical-Mamba

详情
AI中文摘要

在线手术阶段识别(SPR)是上下文感知手术室系统的基础,要求仅根据过去上下文对每一帧做出预测。手术视频提出了自然视频识别器无法共同解决的三个需求:手术过程跨越数万帧,时间流动不均匀(长时间常规片段被短暂的阶段定义转换打断),视觉领域狭窄,因此骨干特征在通道间高度相关。现有识别器要么让每帧成本随已处理长度增长,要么保持成本有界但以均匀速率和通道独立动态推进状态,无法解决后两个需求。我们提出SurgicalMamba,一种基于Mamba2的结构化状态空间对偶性(SSD)的因果SPR模型,将每帧成本保持在O(d)。它引入了三个与SSD兼容的组件,共同解决这些需求:双路径SSD块,在循环状态级别分离长期和短期模式;强度调制步进,一种连续时间时间扭曲,使慢路径的有效速率适应阶段相关信息;以及状态重编程,一种每块的Cayley旋转,在原本轴对齐的SSM循环中打开跨通道混合。学习到的旋转平面继承了阶段对齐的结构,无需任何直接监督,提供了手术工作流的可解释内部特征。在七个公开SPR基准上,SurgicalMamba在严格在线评估下达到了最先进的准确率和阶段级Jaccard指数:在Cholec80上为94.6%/82.7%(比最强先前方法高0.7 pp/2.2 pp),在AutoLaparo上为89.5%/68.9%(高1.7 pp/2.0 pp),在单个GPU上达到238.74 fps。消融实验分离了每个组件的贡献。代码公开于https://github.com/sukjuoh/Surgical-Mamba。

英文摘要

Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural-video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non-uniformly as long routine stretches are punctuated by brief phase-defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per-frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel-independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2's structured state-space duality (SSD) that holds per-frame cost at O(d). It introduces three SSD-compatible components that jointly address these demands: a dual-path SSD block that separates long- and short-term regimes at the level of recurrent state; intensity-modulated stepping, a continuous-time time-warp that adapts the slow path's effective rate to phase-relevant information; and state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence. The learned rotation planes inherit a phase-aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state-of-the-art accuracy and phase-level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 238.74 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at https://github.com/sukjuoh/Surgical-Mamba.

2605.14255 2026-05-26 cs.LG cs.CV 版本更新

Architecture-Aware Explanation Auditing for Industrial Visual Inspection

面向工业视觉检测的架构感知解释审计

Sibo Jia, Zihang Zhao, Kunrong Li

AI总结 本文提出一种基于原生读出假设的架构感知解释审计协议,通过扰动实验证明解释方法的忠实度受其与模型原生决策机制的结构距离约束,并揭示忠实度排名是(模型、解释器、扰动算子)三元组的联合属性。

Comments Format update

详情
AI中文摘要

工业视觉检测系统日益依赖深度分类器,其热力图解释可能看似合理,但未能识别真正驱动模型决策的图像区域。本文基于原生读出假设,实现了一种架构感知的解释审计协议:解释方法的基于扰动的忠实度受其与模型原生决策机制的结构距离约束。在WM-811K晶圆图(9类,172k图像)上,采用三种子零填充扰动协议,ViT-Tiny + Attention Rollout的Deletion AUC为0.211,而Swin-Tiny / ResNet18+CBAM / DenseNet121 + Grad-CAM的Deletion AUC为0.432-0.525(|Cohen's d| > 1.1),尽管其分类准确率较低。Swin-Tiny将架构家族与读出结构分离:尽管是Transformer,其空间特征图层次使其与Grad-CAM兼容,表明操作因素是读出结构而非架构家族。一个模型无关的控制方法(RISE)将所有家族的Deletion AUC压缩至约0.1,表明差距源于解释器路径;值得注意的是,RISE优于所有原生方法,因此原生读出是兼容性原则而非最优性保证。模糊填充敏感性分析表明,在不同扰动基线下的家族排序反转,强化了忠实度排名是(模型、解释器、扰动算子)三元组的联合属性。在MVTec AD(预训练模型)上的探索性边界条件研究表明,审计结果依赖于数据集/任务,并识别了需要限定的条件。该协议提供了可操作的指导:解释路径应基于读出结构与模型架构协同设计,部署的热力图应附带定量忠实度指标。

英文摘要

Industrial visual inspection systems increasingly rely on deep classifiers whose heatmap explanations may appear visually plausible while failing to identify the image regions that actually drive model decisions. This paper operationalizes an architecture-aware explanation audit protocol grounded in the native-readout hypothesis: the perturbation-based faithfulness of an explanation method is bounded by its structural distance from the model's native decision mechanism. On WM-811K wafer maps (9 classes, 172k images) under a three-seed zero-fill perturbation protocol, ViT-Tiny + Attention Rollout attains Deletion AUC 0.211 against 0.432-0.525 for Swin-Tiny / ResNet18+CBAM / DenseNet121 + Grad-CAM (abs(Cohen's d) > 1.1), despite lower classification accuracy. Swin-Tiny disentangles architecture family from readout structure: despite being a Transformer, its spatial feature-map hierarchy makes it Grad-CAM compatible, showing that the operative factor is readout structure rather than architecture family. A model-agnostic control (RISE) compresses all families to Deletion AUC about 0.1, indicating the gap arises from the explainer pathway; notably, RISE outperforms all native methods, so native readout is a compatibility principle rather than an optimality guarantee. A blur-fill sensitivity analysis shows that the family ordering reverses under a different perturbation baseline, reinforcing that faithfulness rankings are joint properties of (model, explainer, perturbation operator) triples. An exploratory boundary-condition study on MVTec AD (pretrained models) indicates that audit results are dataset/task dependent and identifies conditions requiring qualification. The protocol yields actionable guidance: explanation pathways should be co-designed with model architectures based on readout structure, and deployed heatmaps should be accompanied by quantitative faithfulness metrics.

2605.12961 2026-05-26 cs.CV cs.LG 版本更新

Reducing Bias and Variance: Generative Semantic Guidance and Bi-Layer Ensemble for Image Clustering

减少偏差与方差:用于图像聚类的生成语义引导与双层集成

Feijiang Li, Zhenxiong Li, Jieting Wang, Zizheng Jiu, Saixiong Liu, Liang Du

发表机构 * Institute of Big Data Science and Industry(大数据科学与产业研究院) Key Laboratory of Evolutionary Science Intelligence of Shanxi Province(山西省进化智能科学重点实验室) School of Artificial Intelligence, Shanxi University(山西大学人工智能学院)

AI总结 提出GSEC框架,通过生成语义引导减少偏差、双层集成学习降低方差,在六个基准数据集上超越18种最新方法。

详情
AI中文摘要

图像聚类旨在将未标记的图像数据集划分为不同的组。该任务的一个核心方面是构建并利用先验知识来指导聚类过程。最近的方法引入语义描述作为先验信息,其中大多数通常依赖于基于匹配的技术和预定义词汇表。然而,有限的匹配空间限制了它们对下游聚类任务的适应性。此外,这些方法主要关注减少偏差以提高性能,经常忽视方差降低的重要性。为了解决这些局限性,我们提出了GSEC(基于生成语义引导和双层集成的图像聚类),这是一个旨在通过生成语义引导减少偏差并通过集成学习缓解方差的框架。我们的方法利用多模态大语言模型生成语义描述,并通过加权平均推导图像嵌入。此外,双层集成策略通过内层的BatchEnsemble整合跨模态信息,并通过外层的对齐机制对齐输出。对比实验表明,GSEC在六个基准数据集上优于18种最新方法,进一步分析证实了其在同时减少偏差和方差方面的有效性。代码可在https://github.com/2017LI/GSEC.git获取。

英文摘要

Image clustering aims to partition unlabeled image datasets into distinct groups. A core aspect of this task is constructing and leveraging prior knowledge to guide the clustering process. Recent approaches introduce semantic descriptions as prior information, most of which typically relying on matching-based techniques with predefined vocabularies. However, the limited matching space restricts their adaptability to downstream clustering tasks. Moreover, these methods primarily focus on reducing bias to improve performance, frequently overlooking the importance of variance reduction. To address these limitations, we propose GSEC (Image Clustering based on Generative Semantic Guidance and Bi-Layer Ensemble), a framework designed to reduce bias through generative semantic guidance and mitigate variance via ensemble learning. Our method employs Multimodal Large Language Models to generate semantic descriptions and derive image embeddings via weighted averaging. Additionally, a bi-layer ensemble strategy integrates cross-modal information through BatchEnsemble in the inner layer and aligns outputs via an alignment mechanism in the outer layer. Comparative experiments demonstrate that GSEC outperforms 18 state-of-the-art methods across six benchmark datasets, while further analysis confirms its effectiveness in simultaneously reducing both bias and variance. The code is available at https://github.com/2017LI/GSEC.git.

2605.10764 2026-05-26 cs.CV cs.AI 版本更新

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

打破刹车,而非车轮:通过熵最大化实现无目标越狱

Mengqi He, Xinyu Tian, Xin Shen, Shu Zou, Jinhong Ni, Zhaoyuan Yang, Weikang Li, Xuesong Li, Jing Zhang

发表机构 * Australian National University(澳大利亚国立大学) The University Of Queensland(昆士兰大学) Peking University(北京大学) GE research(通用电气研究院) CSIRO(澳大利亚联邦科学与工业研究组织)

AI总结 提出UJEM-KL攻击方法,通过最大化决策令牌的熵来翻转视觉-语言模型的拒绝输出,实现高迁移性的无目标越狱。

Comments Preprint. 17 pages, 8 figures, 6 tables

详情
AI中文摘要

近期研究表明,基于梯度的通用图像越狱攻击在视觉-语言模型(VLM)上几乎没有或完全没有跨模型迁移性,这使人们对可迁移多模态越狱的可行性产生了怀疑。我们在严格的无目标威胁模型下重新审视这一结论,不强制固定前缀或响应模式。初步实验发现,在自回归解码过程中,拒绝行为集中在高熵令牌上,而攻击前非拒绝令牌在前排候选者中已占据相当大的概率质量。受此启发,我们提出通过熵最大化的无目标越狱(UJEM)-KL,这是一种轻量级攻击,通过最大化这些决策令牌的熵来翻转拒绝结果,同时稳定剩余的低熵位置以保持输出质量。在三个VLM和两个安全基准测试中,UJEM-KL实现了具有竞争力的白盒攻击成功率,并持续提高了迁移性,同时在代表性防御下仍然有效。我们的实验结果表明,有限的迁移性主要源于过度受限的优化目标。

英文摘要

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

2605.06415 2026-05-26 cs.LG cs.AI cs.CL cs.CV 版本更新

E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology

E = T*H/(O+B):混合专家生态的无量纲控制参数

Qingjun Zhang

发表机构 * School of Integrated Circuits, Wuxi Taihu University(无锡太湖大学集成电路学院)

AI总结 提出无量纲控制参数E = T*H/(O+B),通过12个控制实验证明E≥0.5可保证混合专家模型无死亡专家,并发现专家复活、正交毒性依赖数据集等六项额外结果。

Comments 12 experiments, 11,000+ training epochs, cross-modal validation (vision + language). Extended version of the Claude-in-the-Loop ecology framework

详情
AI中文摘要

我们引入E = T*H/(O+B),这是一个无量纲控制参数,用于预测混合专家(MoE)模型是否会发展出健康的专家生态还是陷入死亡专家。E将四个超参数——路由温度T、路由熵权重H、先知权重O和平衡权重B——组合成一个单一量。通过12个控制实验(8个视觉,4个语言),总计超过11,000个训练周期,我们确定仅E ≥ 0.5就足以保证零死亡专家,消除了手工设计负载平衡辅助损失的必要性。我们在CIFAR-10、CIFAR-100、TinyImageNet-200、WikiText-2和WikiText-103上跨模态验证了这一点。另外还发现了六项结果:(1)死亡专家可以复活——由平衡损失驱动路由器重新探索触发;(2)正交毒性依赖于数据集,并非普遍存在;(3)任务复杂性改变了临界E阈值;(4)模型过拟合与专家生态健康解耦;(5)三层MoE自发崩溃为两层功能结构;(6)生态结构在50倍温度范围内保持不变。我们提出E作为MoE训练的统一诊断指标,类似于流体力学中的雷诺数。

英文摘要

We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy weight H, oracle weight O, and balance weight B -- into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate -- triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the critical E threshold; (4) model overfitting is decoupled from expert ecological health; (5) three-tier MoE spontaneously collapses into a two-tier functional structure; (6) ecological structure is temperature-invariant across a 50x range. We propose that E serves as a unified diagnostic for MoE training, analogous to the Reynolds number in fluid dynamics.

2605.03509 2026-05-26 cs.CV cs.AI 版本更新

BFORE: Butterfly-Firefly Optimized Retinex Enhancement for Low-Light Image Quality Improvement

BFORE: 蝴蝶-萤火虫优化的Retinex增强用于低光图像质量提升

Ahmed Cherif

发表机构 * Sofrecom Tunisia(Sofrecom突尼斯) Orange Innovation(Orange创新)

AI总结 提出BFORE框架,结合蝴蝶优化算法和萤火虫算法自动搜索最佳Retinex增强参数,最大化高斯自然度评分,显著提升低光图像质量。

详情
AI中文摘要

低光图像存在可见度差、噪声和颜色失真问题。现有的基于Retinex的增强方法依赖手动调整参数,无法泛化到不同光照条件。本文提出BFORE(蝴蝶-萤火虫优化的Retinex增强),一个自动为每张图像寻找最佳增强参数的框架。BFORE分两阶段工作:(1)蝴蝶优化算法(BOA)搜索最优的多尺度Retinex带颜色恢复(MSRCR)参数,然后(2)萤火虫算法(FA)微调伽马校正、去噪和颜色参数。两个阶段都最大化高斯自然度评分(GNS),一种衡量增强图像自然度的无参考指标。标准质量指标(PSNR、SSIM、NIQE)仅在优化后计算,确保零数据泄露。在30对合成图像上,BFORE达到GNS=0.971,优于次优方法MSRCR(0.894)8.6%。在来自LOL数据集的115张真实图像上,BFORE达到GNS=0.887,优于MSRCR(0.808)9.8%。与三个在相同条件下训练的深度学习基线(Zero-DCE、SCI、IAT)进行受控比较,BFORE在GNS上超过最佳深度学习方法14.7%。消融研究证实,混合BOA+FA策略显著优于单独使用每种优化器,而在三个评估预算下的可扩展性分析表明,一旦计算资源可用,结构化优化器显著优于均匀随机采样(128次评估时p=0.009,300次评估时p=0.021)。所有改进均具有统计显著性(Wilcoxon符号秩检验p<0.0001)。每张图像在CPU上的处理时间为3-6分钟,适用于离线应用。

英文摘要

Low-light images suffer from poor visibility, noise, and color distortion. Existing Retinex-based enhancement methods rely on manually tuned parameters that do not generalize across different lighting conditions. This paper proposes BFORE (Butterfly-Firefly Optimized Retinex Enhancement), a framework that automatically finds the best enhancement parameters for each image. BFORE works in two phases: (1) a Butterfly Optimization Algorithm (BOA) searches for optimal Multi-Scale Retinex with Color Restoration (MSRCR) parameters, then (2) a Firefly Algorithm (FA) fine-tunes gamma correction, denoising, and color parameters. Both phases maximize a Gaussian Naturalness Score (GNS), a no-reference metric that measures how natural the enhanced image looks. Standard quality metrics (PSNR, SSIM, NIQE) are computed only after optimization, ensuring zero data leakage. On 30 synthetic image pairs, BFORE achieves GNS = 0.971, outperforming the next-best method MSRCR (0.894) by 8.6%. On 115 real images from the LOL dataset, BFORE achieves GNS = 0.887, outperforming MSRCR (0.808) by 9.8%. A controlled comparison with three deep learning baselines (Zero-DCE, SCI, IAT) trained under identical conditions shows BFORE surpasses the best DL method by 14.7% in GNS. An ablation study confirms that the hybrid BOA+FA strategy significantly outperforms each optimizer in isolation, and a scalability analysis at three evaluation budgets shows that the structured optimizer significantly outperforms uniform random sampling once compute is available (p = 0.009 at 128 evaluations, p = 0.021 at 300 evaluations). All improvements are statistically significant (p < 0.0001, Wilcoxon signed-rank test). Processing time is 3-6 minutes per image on CPU, suitable for offline applications.

2605.00908 2026-05-26 cs.CV 版本更新

Evaluation of Convolutional and Transformer-Based Detectors for Weed Detection in Tomato Plantations

卷积与基于Transformer的检测器在番茄种植园杂草检测中的评估

Alcides Toledo Espinosa, Gerardo Antonio Álvarez Hernández, Ángel Eduardo Zamora-Suárez, Miguel Bolaños, Juan Irving Vásquez

发表机构 * Instituto Politécnico Nacional(墨西哥国立理工学院) CIDETEC-IPN(CIDETEC-墨西哥国立理工学院) UPIBI-IPN(UPIBI-墨西哥国立理工学院)

AI总结 本文比较了基于CNN和Transformer的目标检测架构在番茄种植园早期杂草检测中的性能,揭示了效率与上下文建模之间的权衡。

Comments 7 pages, 3 figures, and 1 table

详情
AI中文摘要

本文对卷积和基于Transformer的目标检测架构在番茄种植园早期杂草检测中进行了比较评估。考虑了每种范式的代表性模型,包括YOLOv6-nano(YOLO系列的最新变体)以及作为基于Transformer架构的RT-DETR Large和RF-DETR Medium。评估在GROUNDBASED_WEED数据集上进行,考虑了六个杂草类别和一个对应于未识别植物的额外类别,从而能够使用精度、召回率、平均精度和推理速度等指标以及非参数统计检验来评估检测准确性和计算效率方面的性能。结果突出了效率与上下文建模之间的明显权衡:基于CNN的检测器以较低的计算成本实现了高性能,而基于Transformer的方法以更高的资源需求为代价提供了更好的全局上下文捕获。这些结果为精准农业应用中的模型选择提供了实用标准。

英文摘要

This paper presents a comparative evaluation of convolutional and transformer-based object detection architectures for early weed detection in tomato plantations. Representative models from each paradigm are considered, including YOLOv26-nano, a recent variant of the YOLO family, and RT-DETR Large and RF-DETR Medium as transformer-based architectures. The evaluation was conducted on the GROUNDBASED_WEED dataset, considering six weed classes and an additional category corresponding to unidentified plants, which allowed for the assessment of performance in terms of detection accuracy and computational efficiency using metrics such as precision, recall, average precision, and inference speed, as well as non-parametric statistical tests. The results highlight a clear trade-off between efficiency and contextual modeling: CNN-based detectors achieve high performance at a lower computational cost, while transformer-based approaches offer better global context capture at the expense of higher resource demands. These results provide practical criteria for model selection in precision agriculture applications.

2603.29236 2026-05-26 cs.CV 版本更新

M2H-MX: Multi-Task Semantic and Geometric Perception for Real-Time Monocular 3D Scene Graph Construction

M2H-MX:用于实时单目3D场景图构建的多任务语义与几何感知

U. V. B. L. Udugama, George Vosselman, Francesco Nex

发表机构 * Department of Earth Observation Science(地球观测科学系)

AI总结 提出M2H-MX多任务感知模型,通过注册门控全局上下文和受控跨任务交互的轻量解码器,在严格延迟约束下实现深度与语义预测相互增强,并集成到单目SLAM中,显著提升轨迹精度和地图质量。

Comments 6 pages, 5 figures, 5 tables. Preprint under review

详情
AI中文摘要

单目相机因其低成本且易于部署而对机器人感知具有吸引力,但从单一图像流实现可靠的实时空间理解仍然具有挑战性。虽然最近的多任务密集预测模型改进了逐像素深度和语义估计,但将这些进展转化为稳定的单目建图系统仍然不简单。本文提出了M2H-MX,一种用于单目空间理解的实时多任务感知模型。该模型保留多尺度特征表示,同时在轻量解码器中引入注册门控全局上下文和受控跨任务交互,使深度和语义预测在严格的延迟约束下相互增强。其输出通过紧凑的感知到建图接口直接集成到未修改的单目SLAM流水线中。我们评估了密集预测精度和系统内性能。在NYUDv2上,M2H-MX-L取得了最先进的结果,与代表性多任务基线相比,语义mIoU提高了6.6%,深度RMSE降低了9.4%。当在ScanNet上的实时单目建图系统中部署时,与强单目SLAM基线相比,M2H-MX将平均轨迹误差降低了60.7%,同时生成更清晰的度量-语义地图。这些结果表明,现代多任务密集预测可以可靠地部署于机器人系统中的实时单目空间感知。

英文摘要

Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.

2603.17044 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

理解与生成相冲突吗?统一多模态模型DPO的诊断研究

Abinav Rao, Sujan Rachuri

AI总结 通过系统实验发现,在统一多模态模型上应用DPO时,生成质量难以对齐,主要原因是理解和生成梯度近乎正交且存在11-14倍的幅度不平衡,源于VQ token数量不对称。

Comments Experiments are inconclusive: The claim that architectures such as Chameleon or Emu would exhibit stronger gradient conflict is not supported by experiments or analysis, and all experiments are conducted on Janus-Pro without evaluation on other unified multimodal architectures

详情
AI中文摘要

统一多模态模型共享一个语言模型骨干来同时进行理解和生成图像。DPO能否同时对齐这两种能力?我们首次系统研究了这一问题,在Janus-Pro的1B和7B参数上应用DPO,采用七种训练策略和两种事后方法。核心发现是负面的:在该架构下,所有测试条件下生成质量都抵制DPO对齐。在7B规模下,没有任何方法能改善生成CLIPScore(|Δ| < 0.2,每个种子n=200,3个种子,p > 0.5);在1B规模下,所有方法都降低了生成质量,并且该结果在偏好数据类型(真实vs生成和模型vs模型)以及测试的数据量(150-288对)上均成立。梯度分析揭示了原因:理解和生成梯度近乎正交(cos ~ 0),且由于VQ token数量不对称(576个生成token vs. ~30-100个文本token),幅度不平衡达到约11-14倍。这种不平衡是多任务DPO中的主要干扰机制;幅度平衡产生了方向正确的理解增量(VQA +0.01-0.04,虽然单独不显著),但生成差距仍然存在。我们识别出离散VQ tokenization是一个可能的结构瓶颈——生成DPO损失收敛到ln(2)支持了这一点——并为使用基于VQ的统一模型的从业者提供了实用指导。

英文摘要

Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck -- supported by the generation DPO loss converging to ln(2) -- and provide practical guidance for practitioners working with VQ-based unified models.

2603.10267 2026-05-26 cs.CV 版本更新

A Robust Deep Learning Framework for Bangla License Plate Recognition Using YOLO and Vision-Language OCR

基于YOLO和视觉语言OCR的孟加拉车牌识别鲁棒深度学习框架

Nayeb Hasin, Md. Arafath Rahman Nishat, Mainul Islam, Khandakar Shakib Al Hasan, Asif Newaz

发表机构 * Department of Electrical(电气工程系) Electronic Engineering Islamic University of Technology Gazipur, Bangladesh(电子工程伊斯兰科技大学加兹布尔,孟加拉国) Department of Mechanical Engineering Bangladesh University Of Engineering And Technology Dhaka, Bangladesh(机械工程孟加拉国工程与技术大学达卡,孟加拉国)

AI总结 提出一种结合YOLOv8两阶段自适应训练和ViT+BanglaBERT视觉语言OCR的鲁棒孟加拉车牌识别系统,在车牌定位和字符识别上分别达到97.83%准确率和0.1323字符错误率。

Comments Accepted at the 2026 IEEE International Conference on AI and Data Analytics (ICAD 2026). Final version will appear in IEEE Xplore

详情
AI中文摘要

自动车牌识别(ALPR)系统是智能交通管理系统的关键组成部分。然而,由于复杂的字符方案和不均匀的布局,孟加拉车牌检测仍然具有挑战性。本文提出了一种鲁棒的孟加拉车牌识别系统,该系统将基于深度学习的车牌定位目标检测模型与用于文本提取的光学字符识别相结合。比较了多种目标检测架构,包括U-Net和几种YOLO(You Only Look Once)变体,用于车牌定位。本研究提出了一种基于YOLOv8架构的新型两阶段自适应训练策略,以提高定位性能。所提出的方法优于现有模型,达到了97.83%的准确率和91.3%的交并比(IoU)。文本识别问题被表述为基于视觉编码器-解码器架构的序列生成问题,并评估了编码器-解码器的组合。结果表明,ViT + BanglaBERT模型在字符级别上取得了更好的结果,字符错误率为0.1323,词错误率为0.1068。所提出的系统在为此研究目的整理的外部数据集上进行测试时也表现出一致的性能。与训练样本相比,该数据集提供了完全不同的环境和光照条件,表明了所提出框架的鲁棒性。总体而言,我们提出的系统为孟加拉车牌识别提供了鲁棒且可靠的解决方案,并在各种真实场景中有效运行,包括光照、噪声和车牌样式的变化。这些优势使其非常适合部署在智能交通应用中,如自动执法和访问控制。

英文摘要

An Automatic License Plate Recognition (ALPR) system constitutes a crucial element in an intelligent traffic management system. However, the detection of Bangla license plates remains challenging because of the complicated character scheme and uneven layouts. This paper presents a robust Bangla License Plate Recognition system that integrates a deep learning-based object detection model for license plate localization with Optical Character Recognition for text extraction. Multiple object detection architectures, including U-Net and several YOLO (You Only Look Once) variants, are compared for license plate localization. This study proposes a novel two-stage adaptive training strategy built upon the YOLOv8 architecture to improve localization performance. The proposed approach outperforms the established models, achieving an accuracy of 97.83% and an Intersection over Union (IoU) of 91.3%. The text recognition problem is phrased as a sequence generation problem with a VisionEncoderDecoder architecture, with a combination of encoder-decoders evaluated. It was demonstrated that the ViT + BanglaBERT model gives better results at the character level, with a Character Error Rate of 0.1323 and Word Error Rate of 0.1068. The proposed system also shows a consistent performance when tested on an external dataset that has been curated for this study purpose. The dataset offers completely different environment and lighting conditions compared to the training sample, indicating the robustness of the proposed framework. Overall, our proposed system provides a robust and reliable solution for Bangla license plate recognition and performs effectively across diverse real-world scenarios, including variations in lighting, noise, and plate styles. These strengths make it well suited for deployment in intelligent transportation applications such as automated law enforcement and access control.

2603.08011 2026-05-26 cs.CV 版本更新

It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models

是时候正确了:提升视觉语言模型中的模拟时钟读取和指针空间推理能力

Jaeha Choi, Jin Won Lee, Siwoo You, Jangho Lee

发表机构 * Incheon National University(延世国立大学) McGill University(麦吉尔大学)

AI总结 针对视觉语言模型在真实环境中读取模拟时钟的挑战,提出TickTockVQA数据集和Swap-DPO微调框架,显著提升时钟读取准确性和鲁棒性。

Comments Accepted to CVPR 2026 Findings

详情
AI中文摘要

视觉语言模型(VLM)在复杂多模态推理任务上取得了显著成功,导致人们假设它们也应擅长读取模拟时钟。然而,与预期相反,我们的研究表明,在真实环境中读取模拟时钟对最先进的VLM来说仍然是一个重大挑战。现有的模拟时钟数据集大多是合成或平面的,风格多样性有限且背景上下文极少,无法捕捉真实世界场景的视觉变化。因此,在此类数据上训练的VLM表现出较弱的时空推理能力,经常混淆时针和分针,并在遮挡、光照变化和杂乱背景等常见视觉条件下挣扎。为解决此问题,我们引入了TickTockVQA,一个包含真实世界多样化场景中模拟时钟的人工标注数据集。TickTockVQA提供明确的时针和分针标注,并在可从视觉上下文推断时包含AM/PM标签。此外,我们提出了Swap-DPO,一种基于直接偏好优化的微调框架,以将模型推理对齐到准确的时间解释。实验结果表明,我们的方法在真实世界条件下显著提高了时钟读取的准确性和鲁棒性,为VLM中时空推理和视觉理解的未来研究奠定了基础。

英文摘要

Advances in vision-language models (VLMs) have achieved remarkable success on complex multimodal reasoning tasks, leading to the assumption that they should also excel at reading analog clocks. However, contrary to this expectation, our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs. Existing analog clock datasets are largely synthetic or planar with limited stylistic diversity and minimal background context, failing to capture the visual variability of real-world scenes. As a result, VLMs trained on such data exhibit weak spatiotemporal reasoning, frequently confusing the hour and minute hands and struggling under common visual conditions such as occlusion, lighting variation, and cluttered backgrounds. To address this issue, we introduce TickTockVQA, a human-annotated dataset containing analog clocks in diverse real-world scenarios. TickTockVQA provides explicit hour and minute annotations, and includes an AM/PM tag when it is inferable from the visual context. Furthermore, we propose Swap-DPO, a direct preference optimization-based fine-tuning framework to align model reasoning toward accurate time interpretation. Experimental results demonstrate that our approach substantially enhances clock reading accuracy and robustness under real-world conditions, establishing a foundation for future research on spatiotemporal reasoning and visual understanding in VLMs.

2602.20725 2026-05-26 cs.CV 版本更新

Bridging Rendering and Generative Modeling with Monte Carlo Transport Scheduling

桥接渲染与生成建模:蒙特卡洛传输调度

Junwei Shu, Wenjie Liu, Hantang Liu, Changbo Wang, Yang Li

发表机构 * East China Normal University(华东师范大学)

AI总结 提出蒙特卡洛传输调度框架,将渐进式路径追踪视为连续采样驱动的传输过程,通过真实渲染端点训练实现任意步数的神经细化,并作为物理先验迁移至生成模型。

Comments preprint

详情
AI中文摘要

蒙特卡洛渲染和现代生成模型都将不确定状态转化为结构化图像,但通常被视为独立过程。我们引入蒙特卡洛传输调度,一个将渐进式路径追踪视为连续采样驱动传输过程的框架。我们的关键观察是,渲染器在此过程中已经产生物理有效状态:嵌套蒙特卡洛估计追踪一条细化轨迹,其自然时间坐标由采样方差决定。这一观点引出一个连续训练框架,从真实渲染端点而非合成插值中学习,保留蒙特卡洛估计的统计结构,同时支持任意步数的神经细化。我们在一个旨在分离传输难度与场景上下文的受控渲染基准上评估该框架,结果表明它产生稳定的渲染细化,支持渲染状态之间的连续停止,并作为冻结生成采样器的物理先验进行迁移。这些结果表明渲染和生成存在共同的连续时间基础,其中蒙特卡洛采样既提供物理状态,也提供学习图像传输的监督。

英文摘要

Monte Carlo rendering and modern generative models both transform uncertain states into structured images, yet they are usually studied as separate processes. We introduce Monte Carlo Transport Scheduling, a framework that treats progressive path tracing as a continuous sampling-driven transport process. Our key observation is that the renderer already produces physically valid states along this process: nested Monte Carlo estimates trace a refinement trajectory whose natural time coordinate follows from sampling variance. This view leads to a continuous training framework that learns from real render endpoints rather than synthetic interpolants, preserving the statistical structure of Monte Carlo estimation while enabling arbitrary-step neural refinement. We evaluate the framework on a controlled rendering benchmark designed to separate transport difficulty from scene context, and show that it yields stable render refinement, supports continuous stopping between rendering states, and transfers as a physical prior for frozen generative samplers. These results suggest a common continuous-time substrate for rendering and generation, where Monte Carlo sampling provides both the physical states and the supervision for learning image transport.

2602.01576 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Generative Visual Code Mobile World Models

生成式视觉代码移动世界模型

Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, Jamin Shin

发表机构 * Trillion Labs(万亿实验室)

AI总结 提出通过单一视觉语言模型预测可执行网页代码来生成移动GUI下一状态,结合文本和视觉世界模型优势,实现高保真视觉生成与精确文本渲染。

Comments ICML 2026

详情
AI中文摘要

移动图形用户界面世界模型为在训练和推理时提升移动GUI代理性能提供了有前景的路径。然而,当前方法面临关键权衡:基于文本的世界模型牺牲了视觉保真度,而视觉世界模型在精确文本渲染上的不足导致其依赖缓慢、复杂的流水线和大量外部模型。我们提出一种新范式:通过可渲染代码生成进行视觉世界建模,其中单一视觉语言模型预测下一个GUI状态为可执行网页代码,该代码渲染为像素,而非直接生成像素。这结合了两种方法的优势:视觉语言模型保留其语言先验以实现精确文本渲染,同时其在结构化网页代码上的预训练实现了高保真视觉生成。我们推出了gWorld(8B、32B),这是基于该范式的首个开源权重视觉移动GUI世界模型,以及一个自动合成基于代码的训练数据的数据生成框架(gWorld)。在4个分布内和2个分布外基准测试的广泛评估中,gWorld在准确率与模型规模之间建立了新的帕累托前沿,性能优于8个前沿开源权重模型(其规模大50.25倍以上)。进一步分析表明:(1)通过gWorld扩展训练数据带来有意义的收益;(2)我们流水线的每个组件都提高了数据质量;(3)更强的世界建模提升了下游移动GUI策略性能。

英文摘要

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.

2601.21670 2026-05-26 cs.CV cs.LG 版本更新

Diverse via bounded Agreement: Geometric Regularization for Multimodal Fusion

通过有界一致性实现多样性:多模态融合的几何正则化

Zixuan Xia, Hao Wang, Pengcheng Weng, Yanyu Qian, Yangxin Xu, William Dan, Fei Wang

发表机构 * Department of Informatics University of Bern(伯尔尼大学信息学院) College of Computing and Data Science Nanyang Technological University(南洋理工大学计算机与数据科学学院) School of Software Engineering Xi’an Jiaotong University(西安交通大学软件工程学院)

AI总结 提出一种轻量级即插即用的几何正则化框架,通过有界一致性原则在保持模态特异多样性的同时约束跨模态漂移,提升多模态融合性能。

详情
AI中文摘要

多模态融合通常被视为一个优化平衡问题,通过调整训练信号防止一种模态主导其他模态。然而,平衡优化并不能完全决定中间表示的几何结构。有监督的多模态模型仍可能学习到低多样性的模态特定嵌入,或允许配对的跨模态观测过度分离,从而削弱单模态鲁棒性和多模态融合。 我们引入了\regName,一个轻量级即插即用的多模态表示学习几何正则化框架。\regName不强制执行严格的跨模态对齐,而是遵循有界一致性原则:在仅软约束超过允许一致性带的配对跨模态漂移部分的同时,保留模态特定多样性。在操作上,\regName结合了一个分散项(减轻谱集中度)和一个一致性带锚定项(控制过度配对漂移),无需架构修改或推理时开销。 在音频-视觉、图像-文本和基于RF的基准测试上的实验表明,\regName一致地提高了多模态性能,并常常增强单模态表示。这些结果表明,显式调节表示几何是优化平衡的有效补充,并提供了几何感知正则化可以改善跨不同架构和领域的多模态学习的证据。

英文摘要

Multimodal fusion is often treated as an optimization-balancing problem, where training signals are adjusted to prevent one modality from dominating the others. However, balanced optimization does not fully determine the geometry of intermediate representations. Supervised multimodal models may still learn low-diversity modality-specific embeddings or allow paired cross-modal observations to drift excessively apart, weakening both unimodal robustness and multimodal fusion. We introduce \regName, a lightweight plug-and-play geometric regularization framework for multimodal representation learning. Rather than enforcing rigid cross-modal alignment, \regName follows a bounded-agreement principle: preserve modality-specific diversity while softly constraining only the portion of paired cross-modal drift that exceeds an admissible agreement band. Operationally, \regName combines a dispersion term that mitigates spectral concentration with an agreement-band anchoring term that controls excessive paired drift, requiring no architectural modification or inference-time overhead. Experiments across audio-visual, image-text, and RF-based benchmarks show that \regName consistently improves multimodal performance and often strengthens unimodal representations. These results suggest that explicitly regulating representation geometry is an effective complement to optimization balancing, and provide evidence that geometry-aware regularization can improve multimodal learning across diverse architectures and domains.

2601.20273 2026-05-26 cs.DC cs.CV 版本更新

SwiftFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs

SwiftFusion: 面向GPU上扩散Transformer分布式推理的可扩展序列并行

Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang, Gennady Pekhimenko

发表机构 * University of Toronto \&\ Institute Amazon University of Toronto \& Vector Institute \& NVIDIA

AI总结 针对扩散Transformer推理中序列并行方法的通信和同步瓶颈,提出拓扑感知的StreamFusion引擎,通过Torus Attention和单边通信实现平均1.35倍加速。

详情
AI中文摘要

扩散Transformer(DiTs)在高品质图像和视频生成中日益普及。随着对更高分辨率图像和更长视频的需求增加,单GPU推理因延迟增加和激活尺寸过大而效率低下。当前框架采用序列并行(SP)技术如Ulysses Attention和Ring Attention来扩展推理。然而,这些实现存在三个主要限制:(1)现代GPU机器网络拓扑的次优通信模式,(2)机器间通信中全到全操作导致的延迟瓶颈,以及(3)使用双边通信库带来的GPU发送-接收同步和计算开销。为解决这些问题,我们提出了StreamFusion,一种拓扑感知的高效DiT服务引擎。StreamFusion包含三项关键创新:(1)考虑机器内外带宽差异的拓扑感知序列并行技术,(2)Torus Attention,一种新颖的SP技术,可将机器间全到全操作与计算重叠,以及(3)最小化GPU发送-接收同步和计算开销的单边通信实现。实验表明,StreamFusion平均比最先进方法快1.35倍(最高达1.77倍)。

英文摘要

Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine. StreamFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling overlapping of inter-machine all-to-all operations with computation, and (3) a one-sided communication implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of $1.35\times$ (up to $1.77\times$).

2601.03191 2026-05-26 cs.CV cs.AI cs.LG 版本更新

AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

AnatomiX:一种解剖学感知的胸部X光解读多模态大语言模型

Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert

发表机构 * Hasso Plattner Institute(霍普夫纳研究所) MBZUAI(穆萨大学人工智能研究所)

AI总结 提出AnatomiX,一种两阶段解剖学感知多模态大语言模型,通过先识别解剖结构再执行下游任务,在解剖定位、短语定位、定位诊断和定位描述任务上相比现有方法提升超过25%。

详情
AI中文摘要

多模态医学大语言模型在胸部X光解读方面取得了显著进展,但在空间推理和解剖学理解方面仍面临挑战。尽管现有的定位技术提高了整体性能,但它们往往未能建立真正的解剖对应关系,导致医学领域中的解剖理解错误。为弥补这一差距,我们引入了AnatomiX,一种用于解剖学定位的胸部X光解读的多任务多模态大语言模型。受放射学工作流程启发,AnatomiX采用两阶段方法:首先识别解剖结构并提取其特征,然后利用大语言模型执行多种下游任务,如短语定位、报告生成、视觉问答和图像理解。在多个基准上的大量实验表明,与现有方法相比,AnatomiX实现了卓越的解剖推理,并在解剖定位、短语定位、定位诊断和定位描述任务上性能提升超过25%。代码和预训练模型可在 https://aneesurhashmi.github.io/anatomix 获取。

英文摘要

Multimodal medical large language models have shown substantial progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://aneesurhashmi.github.io/anatomix

2512.21815 2026-05-26 cs.CV cs.LG 版本更新

High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models

高熵标记作为视觉-语言模型中的多模态失败点

Mengqi He, Xinyu Tian, Xin Shen, Jinhong Ni, Shu Zou, Zhaoyuan Yang, Jing Zhang

发表机构 * The Australia National University(澳大利亚国立大学) The University of Queensland(昆士兰大学) GE research(GE研究)

AI总结 本研究揭示视觉-语言模型中约20%的高熵标记集中了不成比例的对抗性影响,并提出基于熵引导的稀疏攻击方法(EGA),实现高攻击成功率与有害率。

Comments 19 Pages,11 figures,8 tables

详情
AI中文摘要

视觉-语言模型(VLM)取得了显著性能,但仍易受对抗攻击。熵作为模型不确定性的度量,与VLM可靠性高度相关。虽然先前的基于熵的攻击在解码步骤中最大化不确定性,隐含假设每个标记对模型不稳定性的贡献相等,但我们揭示了在评估的具有不同架构的代表性开源VLM中,一小部分(约20%)高熵标记在自回归生成过程中集中了不成比例的对抗性影响。我们证明,将这些对抗扰动集中到这些高熵位置,可以在优化更少解码位置的情况下实现与全局方法相当的语义退化。此外,在多个代表性VLM中,此类攻击不仅导致语义漂移,还在当前流程下产生大量不安全子集(20-31%)。值得注意的是,由于这种脆弱的高熵标记在不同架构的VLM中重复出现,针对它们的攻击表现出非平凡的迁移性。受这些发现启发,我们设计了一种简单的熵引导攻击(EGA),该攻击实现了稀疏高熵定位,并通过可重用的标记库扩展,在三个代表性开源VLM上取得了具有竞争力的攻击成功率(93-95%)和相当高的有害率(30.2-38.6%)。

英文摘要

Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, as a measure of model uncertainty, is highly correlated with VLM reliability. While prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token equally contributes to model instability, we reveal that a small fraction (around 20%) of high-entropy tokens, in the evaluated representative open-source VLMs with diverse architectures, concentrates a disproportionate share of adversarial influence during autoregressive generation. We demonstrate that concentrating adversarial perturbations on these high-entropy positions achieves comparable semantic degradation to global methods while optimizing fewer decoding positions. Additionally, across multiple representative VLMs, such attacks induce not only semantic drift but also a substantial unsafe subset (20-31%) under the current pipeline. Remarkably, since such vulnerable high-entropy tokens recur across architecturally diverse VLMs, attacks focused on them exhibit non-trivial transferability. Motivated by these findings, we design a simple Entropy-Guided Attack (EGA) that operationalizes sparse high-entropy targeting and extends it with a reusable token bank, yielding competitive attack success rates (93-95%) with a considerable harmful rate (30.2-38.6%) on the three representative open-source VLMs.

2512.14180 2026-05-26 cs.CV 版本更新

Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere

球面Voronoi:作为球面可微分划分的定向外观

Francesco Di Sario, Daniel Rebain, Dor Verbin, Marco Grangetto, Andrea Tagliasacchi

发表机构 * University of Torino(都灵大学) Simon Fraser University(西蒙弗雷泽大学) University of British Columbia(不列颠哥伦比亚大学) University of Toronto(多伦多大学) Google DeepMind(谷歌深Mind)

AI总结 提出球面Voronoi(SV)作为3D高斯泼溅中外观表示的统一框架,通过可学习区域划分实现视图依赖效果,在反射建模上达到最先进水平。

详情
AI中文摘要

辐射场方法(例如3D高斯泼溅)已成为新视角合成的强大范式,但其外观建模通常依赖于球谐函数(SH),这带来了根本性限制。SH难以处理高频信号,存在吉布斯振铃伪影,并且无法捕捉镜面反射——这是真实感渲染的关键组成部分。尽管球面高斯等替代方案有所改进,但它们增加了显著的优化复杂度。我们提出球面Voronoi(SV)作为3D高斯泼溅中外观表示的统一框架。SV将方向域划分为具有平滑边界的可学习区域,为视图依赖效应提供了直观且稳定的参数化。对于漫反射外观,SV在保持优化比现有替代方案更简单的同时取得了有竞争力的结果。对于反射——SH失败的地方——我们利用SV作为可学习的反射探针,遵循经典图形学原理将反射方向作为输入。该公式在合成和真实世界数据集上取得了最先进的结果,表明SV为显式3D表示中的外观建模提供了一种有原则、高效且通用的解决方案。项目页面:https://sphericalvoronoi.github.io/

英文摘要

Radiance field methods (e.g. 3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations. SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and fail to capture specular reflections - a key component of realistic rendering. Although alternatives like spherical Gaussians offer improvements, they add significant optimization complexity. We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting. SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while keeping optimization simpler than existing alternatives. For reflections - where SH fail - we leverage SV as learnable reflection probes, taking reflected directions as input following principles from classical graphics. This formulation attains state-of-the-art results on synthetic and real-world datasets, demonstrating that SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations. Project page: https://sphericalvoronoi.github.io/

2512.11941 2026-05-26 cs.CV cs.AI 版本更新

DynaPURLS: Dynamic Refinement of Part-Aware Representations for Skeleton-Based Zero-Shot Action Recognition

DynaPURLS: 基于骨架的零样本动作识别中部分感知表示的动态细化

Jingmin Zhu, Anqi Zhu, James Bailey, Jun Liu, Hossein Rahmani, Mohammed Bennamoun, Farid Boussaid, Qiuhong Ke

发表机构 * Monash University(莫纳什大学) Lancaster University(兰卡斯特大学) University of Western Australia(西澳大学)

AI总结 提出DynaPURLS框架,通过多尺度视觉-语义对应和动态细化模块,解决骨架零样本动作识别中的领域偏移问题,在三个基准数据集上取得最优结果。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence

详情
AI中文摘要

基于骨架的零样本动作识别(ZS-SAR)从根本上受到主流方法的限制,这些方法依赖于将骨架特征与静态的类级语义对齐。这种粗粒度的对齐无法弥合可见类和未见类之间的领域偏移,从而阻碍了细粒度视觉知识的有效迁移。为了解决这些限制,我们引入了 extbf{DynaPURLS},一个统一的框架,它建立稳健的多尺度视觉-语义对应,并在推理时动态细化它们以增强泛化能力。我们的框架利用大型语言模型生成层次化的文本描述,涵盖全局运动和局部身体部位动态。同时,一个自适应划分模块通过语义分组骨架关节点生成细粒度的视觉表示。为了强化这种细粒度对齐以应对训练-测试领域偏移,DynaPURLS包含一个动态细化模块。在推理时,该模块通过轻量级可学习投影将文本特征适应于输入的视觉流。该细化过程由一个置信度感知的类平衡记忆库稳定,该记忆库减轻了来自噪声伪标签的错误传播。在三个大规模基准数据集(包括NTU RGB+D 60/120和PKU-MMD)上的大量实验表明,DynaPURLS显著优于先前的方法,创造了新的最先进记录。源代码已在https://github.com/Alchemist0754/DynaPURLS公开。

英文摘要

Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton features with static, class-level semantics. This coarse-grained alignment fails to bridge the domain shift between seen and unseen classes, thereby impeding the effective transfer of fine-grained visual knowledge. To address these limitations, we introduce \textbf{DynaPURLS}, a unified framework that establishes robust, multi-scale visual-semantic correspondences and dynamically refines them at inference time to enhance generalization. Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics. Concurrently, an adaptive partitioning module produces fine-grained visual representations by semantically grouping skeleton joints. To fortify this fine-grained alignment against the train-test domain shift, DynaPURLS incorporates a dynamic refinement module. During inference, this module adapts textual features to the incoming visual stream via a lightweight learnable projection. This refinement process is stabilized by a confidence-aware, class-balanced memory bank, which mitigates error propagation from noisy pseudo-labels. Extensive experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art, setting new state-of-the-art records. The source code is made publicly available at https://github.com/Alchemist0754/DynaPURLS

2512.04883 2026-05-26 cs.CV 版本更新

SDG-Track: A Heterogeneous Observer-Follower Framework for High-Resolution UAV Tracking on Embedded Platforms

SDG-Track: 一种用于嵌入式平台高分辨率无人机跟踪的异构观察者-跟随者框架

Jiawen Wen, Yu Hu, Suixuan Qiu, Jinshan Huang, Xiaowen Chu

AI总结 提出SDG-Track框架,采用观察者-跟随者架构,通过稀疏检测引导跟踪和双空间恢复机制,在嵌入式平台上实现高分辨率无人机实时跟踪,达到35.1 FPS和97.2%检测精度。

Comments Withdrawn by the authors due to unresolved authorship and public-disclosure authorization issues

详情
AI中文摘要

在边缘设备上对小型无人机(UAV)进行实时跟踪面临根本性的分辨率-速度冲突。将高分辨率图像下采样到标准检测器输入尺寸会导致小目标特征低于可检测阈值。然而,在资源受限平台上处理原生1080p帧无法为平滑云台控制提供足够的吞吐量。我们提出SDG-Track,一种稀疏检测引导跟踪器,采用观察者-跟随者架构来解决这一冲突。观察者流在GPU上以低频率运行高容量检测器,从1920x1080帧中提供准确的位置锚点。跟随者流在CPU上通过ROI约束的稀疏光流执行高频轨迹插值。为了处理由光谱相似干扰物引起的遮挡或模型漂移导致的跟踪失败,我们引入了双空间恢复,一种无需训练的重捕获机制,结合颜色直方图匹配与几何一致性约束。在地对空跟踪站上的实验表明,SDG-Track实现了35.1 FPS的系统吞吐量,同时保留了97.2%的逐帧检测精度。该系统在NVIDIA Jetson Orin Nano上成功跟踪了实际操作条件下的敏捷FPV无人机。我们的论文代码公开在https://github.com/Jeffry-wen/SDG-Track。

英文摘要

Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at https://github.com/Jeffry-wen/SDG-Track

2510.22973 2026-05-26 cs.CV 版本更新

Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method

扩展以占据为中心的驾驶场景生成:数据集与方法

Bohan Li, Xin Jin, Hu Zhu, Hongsi Liu, Ruikai Li, Jiazhe Guo, Kaiwen Cai, Chao Ma, Yueming Jin, Hao Zhao, Xiaokang Yang, Wenjun Zeng

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology(东部技术研究院) School of Electronic Information and Electrical Engineering(电子信息与电气工程学院) Li Auto(力汽车) National University of Singapore(新加坡国立大学) Tsinghua University(清华大学) Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative(宁波空间智能与数字衍生实验室) Ningbo Institute of Digital Twin(宁波数字孪生研究院)

AI总结 针对占据数据稀缺问题,构建最大语义占据数据集Nuplan-Occ,并提出统一框架联合生成高质量语义占据、多视角视频和LiDAR点云,采用时空解耦架构及高斯泼溅稀疏点图渲染和传感器感知嵌入策略,实现高保真生成。

Comments IEEE TPAMI

详情
AI中文摘要

驾驶场景生成是自动驾驶的关键领域,支持下游应用,包括感知和规划评估。以占据为中心的方法通过提供跨帧和模态的一致条件,最近取得了最先进的结果;然而,其性能严重依赖于标注的占据数据,而这类数据仍然稀缺。为克服这一限制,我们整理了Nuplan-Occ,这是迄今为止最大的语义占据数据集,基于广泛使用的Nuplan基准构建。其规模和多样性不仅促进了大规模生成建模,也促进了自动驾驶下游应用。基于该数据集,我们开发了一个统一框架,联合合成高质量语义占据、多视角视频和LiDAR点云。我们的方法采用时空解耦架构,支持4D动态占据的高保真空间扩展和时间预测。为弥合模态差距,我们进一步提出了两种新技术:基于高斯泼溅的稀疏点图渲染策略,增强多视角视频生成;以及传感器感知嵌入策略,显式建模LiDAR传感器属性以实现逼真的多LiDAR模拟。大量实验表明,与现有方法相比,我们的方法实现了更优的生成保真度和可扩展性,并验证了其在下游任务中的实用价值。仓库:https://github.com/Arlo0o/UniScene-Unified-Occupancy-centric-Driving-Scene-Generation/tree/v2

英文摘要

Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities; however, their performance heavily depends on annotated occupancy data, which still remains scarce. To overcome this limitation, we curate Nuplan-Occ, the largest semantic occupancy dataset to date, constructed from the widely used Nuplan benchmark. Its scale and diversity facilitate not only large-scale generative modeling but also autonomous driving downstream applications. Based on this dataset, we develop a unified framework that jointly synthesizes high-quality semantic occupancy, multi-view videos, and LiDAR point clouds. Our approach incorporates a spatio-temporal disentangled architecture to support high-fidelity spatial expansion and temporal forecasting of 4D dynamic occupancy. To bridge modal gaps, we further propose two novel techniques: a Gaussian splatting-based sparse point map rendering strategy that enhances multi-view video generation, and a sensor-aware embedding strategy that explicitly models LiDAR sensor properties for realistic multi-LiDAR simulation. Extensive experiments demonstrate that our method achieves superior generation fidelity and scalability compared to existing approaches, and validates its practical value in downstream tasks. Repo: https://github.com/Arlo0o/UniScene-Unified-Occupancy-centric-Driving-Scene-Generation/tree/v2

2509.15814 2026-05-26 eess.IV cs.CV 版本更新

QWD-GAN: Quality-aware Wavelet-driven GAN for Unsupervised Medical Microscopy Images Denoising

QWD-GAN:质量感知的小波驱动GAN用于无监督医学显微镜图像去噪

Qijun Yang, Yating Huang, Lintao Xiang, Hujun Yin

发表机构 * Department of Electrical and Electronic Engineering(电气与电子工程系)

AI总结 提出一种基于GAN的无监督去噪方法,通过小波变换的多尺度自适应生成器和双分支判别器,在保持高频信息的同时实现显微镜图像去噪。

详情
AI中文摘要

图像去噪在生物医学和显微镜成像中起着关键作用,尤其是在获取宽场荧光染色图像时。该任务面临多个方面的挑战,包括图像采集条件的限制、复杂的噪声类型、算法适应性以及临床应用需求。尽管许多基于深度学习的去噪技术已显示出有希望的结果,但在保留图像细节、提高算法效率和增强临床可解释性方面仍需进一步改进。我们提出了一种基于生成对抗网络(GAN)架构的无监督图像去噪方法。该方法引入了一个基于小波变换的多尺度自适应生成器和一个将差异感知特征图与原始特征相结合的双分支判别器。在多个生物医学显微镜图像数据集上的实验结果表明,所提出的模型实现了最先进的去噪性能,特别是在高频信息的保留方面表现出色。此外,双分支判别器与各种GAN框架无缝兼容。所提出的质量感知、小波驱动的GAN去噪模型称为QWD-GAN。

英文摘要

Image denoising plays a critical role in biomedical and microscopy imaging, especially when acquiring wide-field fluorescence-stained images. This task faces challenges in multiple fronts, including limitations in image acquisition conditions, complex noise types, algorithm adaptability, and clinical application demands. Although many deep learning-based denoising techniques have demonstrated promising results, further improvements are needed in preserving image details, enhancing algorithmic efficiency, and increasing clinical interpretability. We propose an unsupervised image denoising method based on a Generative Adversarial Network (GAN) architecture. The approach introduces a multi-scale adaptive generator based on the Wavelet Transform and a dual-branch discriminator that integrates difference perception feature maps with original features. Experimental results on multiple biomedical microscopy image datasets show that the proposed model achieves state-of-the-art denoising performance, particularly excelling in the preservation of high-frequency information. Furthermore, the dual-branch discriminator is seamlessly compatible with various GAN frameworks. The proposed quality-aware, wavelet-driven GAN denoising model is termed as QWD-GAN.

2508.09599 2026-05-26 cs.CV 版本更新

BridgeTA: Bridging the Representation Gap in Knowledge Distillation via Teacher Assistant for Bird's Eye View Map Segmentation

BridgeTA: 通过教师助手弥合知识蒸馏中表示差距的鸟瞰图分割

Beomjun Kim, Suhan Woo, Sejong Heo, Euntai Kim

发表机构 * Yonsei University(延世大学) Hyundai Motor Company(现代汽车公司) Korea Institute of Science and Technology(韩国科学技术院)

AI总结 提出BridgeTA框架,利用教师助手网络在保持学生模型架构和推理成本不变的情况下,弥合激光雷达-相机融合与纯相机模型之间的表示差距,并通过Young不等式推导蒸馏损失实现稳定优化,在nuScenes数据集上mIoU提升4.2%。

Comments Accepted at ICRA 2026 (8 pages, 6 figures)

详情
AI中文摘要

鸟瞰图(BEV)分割是自动驾驶中最重要且最具挑战性的任务之一。纯相机方法作为激光雷达的经济高效替代方案备受关注,但仍落后于基于激光雷达-相机(LC)融合的方法。知识蒸馏(KD)已被探索用于缩小这一差距,但现有方法主要通过模仿教师架构来扩大学校模型,导致推理成本增加。为解决此问题,我们引入BridgeTA,一种经济高效的蒸馏框架,通过教师助手(TA)网络弥合LC融合与纯相机模型之间的表示差距,同时保持学生架构和推理成本不变。轻量级TA网络结合教师和学生的BEV表示,创建共享潜在空间作为中间表示。为从理论上奠定框架基础,我们使用Young不等式推导蒸馏损失,将直接的师生蒸馏路径分解为教师-TA和TA-学生双路径,稳定优化并加强知识迁移。在具有挑战性的nuScenes数据集上的大量实验证明了我们方法的有效性,相比纯相机基线mIoU提升4.2%,比最先进的KD方法提升幅度高出45%。

英文摘要

Bird's-Eye-View (BEV) map segmentation is one of the most important and challenging tasks in autonomous driving. Camera-only approaches have drawn attention as cost-effective alternatives to LiDAR, but they still fall behind LiDAR-Camera (LC) fusion-based methods. Knowledge Distillation (KD) has been explored to narrow this gap, but existing methods mainly enlarge the student model by mimicking the teacher's architecture, leading to higher inference cost. To address this issue, we introduce BridgeTA, a cost-effective distillation framework to bridge the representation gap between LC fusion and Camera-only models through a Teacher Assistant (TA) network while keeping the student's architecture and inference cost unchanged. A lightweight TA network combines the BEV representations of the teacher and student, creating a shared latent space that serves as an intermediate representation. To ground the framework theoretically, we derive a distillation loss using Young's Inequality, which decomposes the direct teacher-student distillation path into teacher-TA and TA-student dual paths, stabilizing optimization and strengthening knowledge transfer. Extensive experiments on the challenging nuScenes dataset demonstrate the effectiveness of our method, achieving an improvement of 4.2% mIoU over the Camera-only baseline, up to 45% higher than the improvement of other state-of-the-art KD methods.

2506.10054 2026-05-26 cs.LG cs.AI cs.CL cs.CV 版本更新

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Uni-DPO:大语言模型动态偏好优化的统一范式

Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Xi’an Jiaotong University(西安交通大学) The Chinese University of Hong Kong(香港中文大学) University of Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 针对现有DPO方法忽略数据质量和学习难度差异的问题,提出Uni-DPO统一框架,通过自适应重加权偏好对实现更有效的数据利用和更优性能。

Comments Accepted by ICLR 2026. Code & models: https://github.com/pspdada/Uni-DPO

详情
AI中文摘要

直接偏好优化(DPO)因其简单高效已成为从人类反馈中进行强化学习(RLHF)的基石。然而,现有的基于DPO的方法通常平等对待所有偏好对,忽略了数据质量和学习难度的显著差异,导致数据利用效率低下和性能次优。为解决这一局限,我们提出Uni-DPO,一个统一的动态偏好优化框架,该框架联合考虑(a)偏好对的内在质量和(b)模型在训练过程中的动态表现。通过基于这两个因素自适应地重新加权样本,Uni-DPO能够更有效地利用偏好数据并实现卓越性能。跨模型和基准的大量实验证明了Uni-DPO的有效性和泛化能力。在文本任务上,使用Uni-DPO微调的Gemma-2-9B-IT在Arena-Hard上超越领先的大语言模型Claude 3 Opus 6.7个百分点。在数学和多模态任务上,Uni-DPO在所有基准上持续优于基线方法,为其有效性和鲁棒性提供了强有力的实证证据。

英文摘要

Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based methods typically treat all preference pairs equally, overlooking substantial variations in data quality and learning difficulty, which leads to inefficient data utilization and suboptimal performance. To address this limitation, we propose Uni-DPO, a unified dynamic preference optimization framework that jointly considers (a) the inherent quality of preference pairs and (b) the model's evolving performance during training. By adaptively reweighting samples based on both factors, Uni-DPO enables more effective use of preference data and achieves superior performance. Extensive experiments across models and benchmarks demonstrate the effectiveness and generalization of Uni-DPO. On textual tasks, Gemma-2-9B-IT fine-tuned with Uni-DPO surpasses the leading LLM, Claude 3 Opus, by 6.7 points on Arena-Hard. On mathematical and multimodal tasks, Uni-DPO consistently outperforms baseline methods across all benchmarks, providing strong empirical evidence of its effectiveness and robustness.

2505.23764 2026-05-26 cs.CV cs.CL 版本更新

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

MMSI-Bench:多图像空间智能基准

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, Jiangmiao Pang

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学) Zhejiang University(浙江大学) Tsinghua University(清华大学) Shanghai Jiaotong University(上海交通大学) University of Hong Kong(香港大学) Beijing Normal University(北京师范大学)

AI总结 提出MMSI-Bench基准,通过1000道精心设计的VQA问题评估多图像空间推理能力,发现现有模型准确率远低于人类。

Comments ICLR 2026 Camera ready. 38 pages. Project page: https://runsenxu.com/projects/MMSI_Bench

详情
AI中文摘要

空间智能对于在复杂物理世界中运行的多模态大语言模型(MLLMs)至关重要。然而,现有基准仅探测单图像关系,无法评估实际部署所需的多图像空间推理。我们引入MMSI-Bench,一个专用于多图像空间智能的VQA基准。六位3D视觉研究人员花费超过300小时,从超过12万张图像中精心制作了1000个具有挑战性、无歧义的多选题,每个问题都配有精心设计的干扰项和逐步推理过程。我们进行了大量实验,评估了37个开源和专有MLLMs,观察到巨大差距:最强的开源模型准确率约30%,OpenAI的GPT-5推理模型达到40%,而人类得分为97%。这些结果凸显了MMSI-Bench的挑战性以及未来研究的巨大空间。利用注释的推理过程,我们还提供了一个自动错误分析流程,诊断出四种主要失败模式,包括(1)接地错误,(2)重叠匹配和场景重建错误,(3)情境转换推理错误,以及(4)空间逻辑错误,为推进空间智能提供了见解。项目页面:https://runsenxu.com/projects/MMSI_Bench 。

英文摘要

Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a stepwise reasoning process. We conduct extensive experiments and evaluate 37 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's GPT-5 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering insights for advancing spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .

2502.07278 2026-05-26 cs.CV 版本更新

Articulate That Object Part (ATOP): 3D Part Articulation via Text and Motion Personalization

Articulate That Object Part (ATOP): 通过文本和运动个性化实现3D部件关节运动

Aditya Vora, Sauradip Nag, Kai Wang, Hao Zhang

发表机构 * Computing Science, Simon Fraser University(西蒙弗雷泽大学计算科学系)

AI总结 提出ATOP方法,利用文本提示和运动个性化,通过少样本微调扩散模型生成运动样本,并借助可微渲染优化部件关节参数,实现静态3D对象的部件关节运动。

Comments Accepted to ACM Transactions of Graphics, 2026

详情
AI中文摘要

我们提出ATOP(Articulate That Object Part),一种基于运动个性化的少样本新方法,用于根据文本提示中指定的部件及其运动来关节化静态3D对象。由于缺乏带有运动属性标注的数据集,现有方法在此任务中难以很好地泛化。在我们的工作中,文本输入使我们能够利用现代扩散模型的能力,为正确的对象类别和部件生成合理的运动样本。反过来,输入的3D对象提供“图像提示”,以将生成的运动个性化到该输入对象。我们的方法从少样本微调开始,将关节感知注入当前的扩散模型,以学习与目标对象部件相关的唯一运动标识符。我们的微调应用于预训练的扩散模型,用于可控的多视图运动生成,并使用一小部分参考运动帧(展示适当的部件运动)进行训练。得到的运动模型随后可用于从多个视角实现输入3D对象的合理运动。最后,我们通过可微渲染将个性化运动转移到对象的3D空间,通过分数蒸馏采样损失优化部件关节参数。在PartNet-Mobility和ACD数据集上的实验表明,与先前的少样本方法相比,我们的方法可以生成具有更高准确性的真实运动样本,从而产生更具泛化性的3D运动预测。

英文摘要

We present ATOP (Articulate That Object Part), a novel few-shot method based on motion personalization to articulate a static 3D object with respect to a part and its motion as prescribed in a text prompt. Given the scarcity of available datasets with motion attribute annotations, existing methods struggle to generalize well in this task. In our work, the text input allows us to tap into the power of modern-day diffusion models to generate plausible motion samples for the right object category and part. In turn, the input 3D object provides ``image prompting'' to personalize the generated motion to the very input object. Our method starts with a few-shot finetuning to inject articulation awareness to current diffusion models to learn a unique motion identifier associated with the target object part. Our finetuning is applied to a pre-trained diffusion model for controllable multi-view motion generation, trained with a small collection of reference motion frames demonstrating appropriate part motion. The resulting motion model can then be employed to realize plausible motion of the input 3D object from multiple views. At last, we transfer the personalized motion to the 3D space of the object via differentiable rendering to optimize part articulation parameters by a score distillation sampling loss. Experiments on PartNet-Mobility and ACD datasets demonstrate that our method can generate realistic motion samples with higher accuracy, leading to more generalizable 3D motion predictions compared to prior approaches in the few-shot setting.

2409.03777 2026-05-26 cs.CV cs.LG 版本更新

A Greedy Hierarchical Approach to Whole-Network Filter-Pruning in CNNs

一种面向CNN全网络滤波器剪枝的贪婪层次方法

Kiran Purohit, Anurag Reddy Parvathgari, Sourangshu Bhattacharya

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Indian Institute of Technology, Kharagpur, India(印度理工学院,Khargpur,印度)

AI总结 提出一种基于线性近似的两层层次化贪婪剪枝算法,通过低层滤波器选择和全局剪枝准则高效剪枝,在多个网络上优于现有方法。

Comments Accepted in TMLR 2024

详情
AI中文摘要

深度卷积神经网络(CNN)在许多计算机视觉任务中取得了令人印象深刻的表现。然而,它们的大模型尺寸需要大量计算资源,因此从预训练的CNN中剪枝冗余滤波器是开发资源受限设备高效模型的关键任务。全网络滤波器剪枝算法从每层剪枝不同比例的滤波器,从而提供更大的灵活性。当前的全网络剪枝方法要么因需要使用训练数据集计算每个剪枝滤波器的损失而计算成本高昂,要么使用各种启发式/学习标准来确定每层的剪枝比例。本文提出了一种高效的两级层次化全网络滤波器剪枝方法,该方法使用分类损失作为最终标准。低级算法(称为滤波器剪枝)使用基于滤波器权重线性近似的稀疏近似公式。我们探索了两种算法:基于正交匹配追踪的贪婪选择和贪婪反向剪枝方法。反向剪枝算法使用一种新颖的闭式误差标准,在每个阶段高效选择最优滤波器,从而使整个算法更快。高级算法(称为层选择)使用全局剪枝准则贪婪地选择最佳剪枝层(使用滤波器选择算法进行剪枝)。我们针对两种不同的全局剪枝准则提出了算法:(1)逐层相对误差(HBGS),和(2)最终分类误差(HBGTS)。我们的算法套件在ResNet18、ResNet32、ResNet56、VGG16和ResNext101上优于最先进的剪枝方法。我们的方法将ResNext101的RAM需求从7.6 GB降低到1.5 GB,并在CIFAR-10上实现了94%的FLOPS减少而不损失精度。

英文摘要

Deep convolutional neural networks (CNNs) have achieved impressive performance in many computer vision tasks. However, their large model sizes require heavy computational resources, making pruning redundant filters from existing pre-trained CNNs an essential task in developing efficient models for resource-constrained devices. Whole-network filter pruning algorithms prune varying fractions of filters from each layer, hence providing greater flexibility. Current whole-network pruning methods are either computationally expensive due to the need to calculate the loss for each pruned filter using a training dataset, or use various heuristic / learned criteria for determining the pruning fractions for each layer. This paper proposes a two-level hierarchical approach for whole-network filter pruning which is efficient and uses the classification loss as the final criterion. The lower-level algorithm (called filter-pruning) uses a sparse-approximation formulation based on linear approximation of filter weights. We explore two algorithms: orthogonal matching pursuit-based greedy selection and a greedy backward pruning approach. The backward pruning algorithm uses a novel closed-form error criterion for efficiently selecting the optimal filter at each stage, thus making the whole algorithm much faster. The higher-level algorithm (called layer-selection) greedily selects the best-pruned layer (pruning using the filter-selection algorithm) using a global pruning criterion. We propose algorithms for two different global-pruning criteria: (1) layer-wise relative error (HBGS), and (2) final classification error (HBGTS). Our suite of algorithms outperforms state-of-the-art pruning methods on ResNet18, ResNet32, ResNet56, VGG16, and ResNext101. Our method reduces the RAM requirement for ResNext101 from 7.6 GB to 1.5 GB and achieves a 94% reduction in FLOPS without losing accuracy on CIFAR-10.

2407.01328 2026-05-26 cs.CV 版本更新

CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes

CSFNet: 用于驾驶场景实时RGB-X语义分割的余弦相似度融合网络

Danial Qashqai, Emad Mousavian, Shahriar Baradaran Shokouhi, Sattar Mirzakuchaki

发表机构 * Department of Electrical Engineering, Iran University of Science and Technology(伊朗科学技术大学电气工程系)

AI总结 提出CSFNet,通过余弦相似度注意力融合模块(CS-AFM)高效融合双模态特征,实现实时且高精度的RGB-X语义分割。

详情
Journal ref
Engineering Applications of Artificial Intelligence, 174, 114362 (2026)
AI中文摘要

语义分割作为复杂视觉解释的关键组成部分,在自动驾驶视觉系统中起着基础作用。最近的研究通过利用互补信息和开发多模态方法显著提高了语义分割的准确性。尽管准确性有所提高,但多模态语义分割方法存在计算复杂度高和推理速度慢的问题。因此,在驾驶应用中实现多模态方法是一项具有挑战性的任务。为了解决这个问题,我们提出了余弦相似度融合网络(CSFNet)作为实时RGB-X语义分割模型。具体来说,我们设计了一个余弦相似度注意力融合模块(CS-AFM),该模块有效地校正和融合两种模态的特征。CS-AFM模块利用跨模态相似性实现高泛化能力。通过增强低层跨模态特征的融合,CS-AFM为在高层使用单分支网络铺平了道路。因此,我们在编码器中使用双分支和单分支架构,并结合高效的上下文模块和轻量级解码器以实现快速准确的预测。为了验证CSFNet的有效性,我们使用Cityscapes、MFNet和ZJU数据集进行RGB-D/T/P语义分割。结果表明,CSFNet在准确性与最先进方法相比具有竞争力,同时在多模态语义分割模型中速度达到最先进水平。由于其低参数数量和计算复杂度,它还实现了高效率。CSFNet的源代码将在https://github.com/Danial-Qashqai/CSFNet提供。

英文摘要

Semantic segmentation, as a crucial component of complex visual interpretation, plays a fundamental role in autonomous vehicle vision systems. Recent studies have significantly improved the accuracy of semantic segmentation by exploiting complementary information and developing multimodal methods. Despite the gains in accuracy, multimodal semantic segmentation methods suffer from high computational complexity and low inference speed. Therefore, it is a challenging task to implement multimodal methods in driving applications. To address this problem, we propose the Cosine Similarity Fusion Network (CSFNet) as a real-time RGB-X semantic segmentation model. Specifically, we design a Cosine Similarity Attention Fusion Module (CS-AFM) that effectively rectifies and fuses features of two modalities. The CS-AFM module leverages cross-modal similarity to achieve high generalization ability. By enhancing the fusion of cross-modal features at lower levels, CS-AFM paves the way for the use of a single-branch network at higher levels. Therefore, we use dual and single-branch architectures in an encoder, along with an efficient context module and a lightweight decoder for fast and accurate predictions. To verify the effectiveness of CSFNet, we use the Cityscapes, MFNet, and ZJU datasets for the RGB-D/T/P semantic segmentation. According to the results, CSFNet has competitive accuracy with state-of-the-art methods while being state-of-the-art in terms of speed among multimodal semantic segmentation models. It also achieves high efficiency due to its low parameter count and computational complexity. The source code for CSFNet will be available at https://github.com/Danial-Qashqai/CSFNet.

2402.10665 2026-05-26 cs.LG cs.CV 版本更新

Soft Dice Confidence: A Near-Optimal Confidence Estimator for Selective Prediction in Semantic Segmentation

Soft Dice Confidence: 语义分割中选择性预测的近似最优置信度估计器

Bruno Laboissiere Camargos Borges, Bruno Machado Pacheco, Danilo Silva

发表机构 * Department of Automation and Systems Engineering, Federal University of Santa Catarina(圣卡塔琳娜联邦大学自动化与系统工程系)

AI总结 针对语义分割中的选择性预测问题,提出一种基于Dice系数的近似最优置信度估计器SDC,在已知或估计边际后验概率下均优于现有方法。

Comments 48 pages, 11 figures

详情
AI中文摘要

在语义分割中,即使是最先进的深度学习模型在某些高风险应用(如医学图像分析)中也达不到所需的性能。在这些情况下,可以通过允许模型在置信度低时放弃预测来提高性能,这种方法称为选择性预测。虽然在分类文献中广为人知,但选择性预测在语义分割的背景下尚未得到充分探索。本文通过关注图像级弃权来解决这个问题,即对整个图像产生单个置信度估计,而先前的方法则关注像素级不确定性。假设Dice系数作为分割的评估指标,本文提供了两个主要贡献:(i)在已知边际后验概率的情况下,我们推导出最优置信度估计器,但观察到对于典型图像大小难以处理。然后,提出了一种线性时间可计算的近似方法,称为Soft Dice Confidence(SDC),并证明它与最优估计器紧密有界。(ii)当仅知道边际后验概率的估计时,我们提出了SDC的插件版本,并证明它优于所有先前的方法,包括那些需要额外调优数据的方法。这些发现得到了合成数据和来自六项医学成像任务(包括分布外场景)的真实世界数据的实验结果的支持,将SDC定位为语义分割中选择性预测的可靠且高效的工具。

英文摘要

In semantic segmentation, even state-of-the-art deep learning models fall short of the performance required in certain high-stakes applications such as medical image analysis. In these cases, performance can be improved by allowing a model to abstain from making predictions when confidence is low, an approach known as selective prediction. While well-known in the classification literature, selective prediction has been underexplored in the context of semantic segmentation. This paper tackles the problem by focusing on image-level abstention, which involves producing a single confidence estimate for the entire image, in contrast to previous approaches that focus on pixel-level uncertainty. Assuming the Dice coefficient as the evaluation metric for segmentation, two main contributions are provided in this paper: (i) In the case of known marginal posterior probabilities, we derive the optimal confidence estimator, which is observed to be intractable for typical image sizes. Then, an approximation computable in linear time, named Soft Dice Confidence (SDC), is proposed and proven to be tightly bounded to the optimal estimator. (ii) When only an estimate of the marginal posterior probabilities are known, we propose a plug-in version of the SDC and show it outperforms all previous methods, including those requiring additional tuning data. These findings are supported by experimental results on both synthetic data and real-world data from six medical imaging tasks, including out-of-distribution scenarios, positioning the SDC as a reliable and efficient tool for selective prediction in semantic segmentation.

2310.04981 2026-05-26 cs.CV cs.LG 版本更新

Compositional Semantics for Open Vocabulary Spatio-semantic Representations

开放词汇时空语义表示的组合语义

Robin Karlsson, Francisco Lepe-Salazar, Kazuya Takeda

发表机构 * Graduate School of Informatics, Nagoya University(名古屋大学信息学研究科) Ludolab TIER IV

AI总结 提出潜在组合语义嵌入z*作为可查询时空语义记忆的知识表示,证明其存在性、最优性及可发现性,并引入充分相似性推理方法提升重叠语义推理性能。

Comments Preprint

详情
AI中文摘要

视觉语言模型(VLM)将环境感知转换为LLM可解释的视觉语言语义。然而,完成复杂任务通常需要对当前感知之外的信息进行推理。我们提出潜在组合语义嵌入z*作为可查询时空语义记忆的基于学习的原则性知识表示。我们在数学上证明z*总是可以找到,并且最优z*是任何集合Z的质心。我们推导了估计相关和不相关语义可分离性的概率界限。我们证明z*可以通过迭代梯度下降从视觉外观和单一描述中发现。我们在包括CLIP和SBERT的四个嵌入空间上实验验证了我们的发现。结果表明,z*可以表示由SBERT编码的多达10个语义,以及理想均匀分布的高维嵌入的多达100个语义。我们引入了三个具有重叠语义的新数据集,以表明在常规非重叠注释上训练的常见VLM能够发现z*。我们提出的充分相似性推理方法克服了传统推理的根本局限性,并将更高层次的重叠语义推理性能平均提高了19.63 mIoU。

英文摘要

Vision-language models (VLMs) transform environment percepts into vision-language semantics interpretable by LLMs. However, completing complex tasks often requires reasoning about information beyond what is currently perceived. We propose latent compositional semantic embeddings z* as a principled learning-based knowledge representation for queryable spatio-semantic memories. We mathematically prove that z* can always be found, and that the optimal z* is the centroid for any set Z. We derive a probabilistic bound for estimating separability of related and unrelated semantics. We prove that z* is discoverable from visual appearance and singular descriptions by iterative gradient descent. We experimentally verify our findings on four embedding spaces including CLIP and SBERT. Our results show that z* can represent up to 10 semantics encoded by SBERT, and up to 100 semantics for ideal uniformly distributed high-dimensional embeddings. We introduce three new datasets with overlapping semantics to show that common VLMs trained on conventional nonoverlapping annotations discover z*. Our novel sufficient similarity inference method overcomes fundamental limitations of conventional inference, and improves higher-level overlapping semantic inference performance by 19.63 mIoU on average.

2605.10543 2026-05-26 cs.CV 版本更新

TIE: Time Interval Encoding for Video Generation over Events

TIE:面向事件视频生成的时间区间编码

Zhilei Shu, Shangwen Zhu, Zihang Liang, Xiaofan Li, Qianyu Peng, Xinyu Cui, Bo Ye, Yiming Li, Fan Cheng, Jian Zhao, Yang Cao, Zheng-Jun Zha, Ruili Feng

发表机构 * University of Science and Technology of China(中国科学技术大学) Matrix Team(Matrix团队) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) University of Waterloo(滑铁卢大学) The Pennsylvania State University(宾夕法尼亚州立大学) Zhongguancun Academy(中关村学院) The University of Hong Kong(香港大学)

AI总结 提出时间区间编码(TIE),将旋转位置嵌入推广为区间感知形式,解决扩散变换器(DiT)在重叠事件视频生成中时间区间无法表示的问题,显著提升时间可控性。

详情
AI中文摘要

导演式提示、机器人动作预测和交互式视频代理需要对并发事件进行时间定位——在68%的通用片段和超过99%的机器人/游戏片段包含重叠事件的场景中,现有的事件生成器却基于单一活动提示假设。然而,现代视频生成器(如扩散变换器DiT)通过逐点位置编码将时间表示为离散点。这种表述造成了根本性的维度不匹配:时间上延展的区间和重叠事件在数学上无法被注意力机制表示。在本文中,我们提出时间区间编码(TIE),这是一种原则性的、即插即用的区间感知旋转嵌入推广,将时间区间提升为DiT交叉注意力中的一等公民。我们没有引入另一种启发式区间嵌入,而是证明,在兼容RoPE的双线性注意力中,TIE由两个基本原则刻画:时间可积性(要求事件在其整个持续时间内聚合位置证据)和持续时间不变性(消除对较长区间的平凡偏差)。在均匀核下,这种刻画产生了一个高效的闭式sinc解,该解保留了标准注意力接口,并通过区间积分自然地衰减边界噪声。实验上,TIE在保持基础DiT模型视觉质量的同时,显著提高了时间可控性。在OmniEvents数据集上的实验中,它将人工验证的时间约束满足率从77.34%提升至96.03%,将时间边界误差从0.261秒降低至0.073秒,同时改进了轨迹级时间对齐指标。代码和数据集可在https://github.com/MatrixTeam-AI/TIE获取。

英文摘要

Director-style prompting, robotic action prediction, and interactive video agents demand temporal grounding over concurrent events -- a regime in which 68% of general clips and over 99% of robotics/gameplay clips contain overlapping events, yet existing multi-event generators rest on a single-active-prompt assumption. However, modern video generators, such as Diffusion Transformers (DiT), represent time as discrete points through point-wise positional encodings. This formulation creates a fundamental dimension mismatch: temporally extended intervals and overlapping events are mathematically unrepresentable to the attention mechanism. In this paper, we propose Time Interval Encoding (TIE), a principled, plug-and-play interval-aware generalization of rotary embeddings that elevates time intervals to first-class primitives inside DiT cross-attention. Rather than introducing another heuristic interval embedding, we show that, within RoPE-compatible bilinear attention, TIE is characterized by two basic principles: Temporal Integrability, which requires an event to aggregate positional evidence over its full duration, and Duration Invariance, which removes the trivial bias toward longer intervals. Under a uniform kernel, this characterization yields an efficient closed-form sinc-based solution that preserves the standard attention interface and naturally attenuates boundary noise through interval integration. Empirically, TIE preserves the visual quality of the base DiT model while substantially improving temporal controllability. In our experiments on the OmniEvents dataset, it improves human-verified Temporal Constraint Satisfaction Rate from 77.34% to 96.03% and reduces temporal boundary error from 0.261s to 0.073s, while also improving trajectory-level temporal alignment metrics. The code and dataset are available at https://github.com/MatrixTeam-AI/TIE.

2401.01160 2026-05-26 eess.IV cs.CG cs.CV cs.LG 版本更新

Train-Free Segmentation in MRI with Cubical Persistent Homology

基于立方体持续同调的MRI无训练分割

Anton François, Raphaël Tinarrage

发表机构 * Centre G. Borelli ENS Paris-Saclay(巴黎-萨克雷大学) IST Austria(IST奥地利研究所) EMAp, Fundação Getulio Vargas(EMAp,格洛里亚·瓦格斯基金会)

AI总结 提出一种基于拓扑数据分析的无训练MRI分割框架,通过自动阈值、提取已知拓扑子集和分解成分三步实现,利用持续同调中的近似代表循环建立拓扑特征与解剖成分的可解释联系,在胶质母细胞瘤和胎儿皮质板分割中验证有效性。

Comments Similar to the published version. 22 pages, 11 figures, 3 tables. For associated code, see https://github.com/antonfrancois/gliomaSegmentation_TDA

详情
Journal ref
Journal of Mathematical Imaging and Vision 68, 20 (2026)
AI中文摘要

我们研究了一种基于拓扑数据分析的无训练MRI分割框架。该流程分三步进行:首先通过自动阈值识别待分割的整个对象,然后检测一个拓扑结构已知的独特子集,最后推导出分割的各个组成部分。一个关键要素是从持续同调图中提取近似代表循环,这提供了持久特征与解剖成分之间的可解释联系。为了阐明该方法的应用范围,我们明确了潜在的拓扑和强度假设,量化了它们在真实数据上的成立情况,并分析了典型的失败模式。我们在胶质母细胞瘤和胎儿皮质板分割上评估了该方法,并与无监督和深度学习参考方法进行了比较。通过在没有大型标注数据集的情况下运行,该方法非常适合数据稀缺的场景,并为专家修正或基于学习的流程提供了可解释的基线和实用的初始化。

英文摘要

We investigate a framework for train-free MRI segmentation based on Topological Data Analysis. The pipeline proceeds in three steps, first identifying the whole object to segment via automatic thresholding, then detecting a distinctive subset whose topology is known in advance, and finally deducing the various components of the segmentation. A key ingredient is the extraction of approximate representative cycles from persistence diagrams, which provides an interpretable link between persistent features and anatomical components. To clarify the method's scope, we make the underlying topological and intensity assumptions explicit, quantify when they hold on real data, and analyze typical failure modes. We evaluate the approach on glioblastoma and on fetal cortical plate segmentation, with comparisons to unsupervised and deep-learning references. By operating without large annotated datasets, the method is well suited to scarce-data settings and provides an interpretable baseline and practical initialization for expert refinement or learning-based pipelines.

2506.03134 2026-05-26 eess.SP cs.CV 版本更新

Controllable Radar Simulation with Waveform Parameter Embedding

具有波形参数嵌入的可控雷达仿真

Weiqing Xiao, Hao Huang, Chonghao Zhong, Yujie Lin, Nan Wang, Xiaoxue Chen, Zhaoxi Chen, Saining Zhang, Shuocheng Yang, Pierre Merriaux, Lei Lei, Hao Zhao

发表机构 * NJU(南京大学) BJTU(北京理工大学) BIT(北京理工大学) AIR, THU(空气科技,清华大学) NTU(国立台湾大学) SVM, THU(SVM,清华大学) Lightwheel AI LeddarTech

AI总结 提出Ctrl-RS框架,通过环境反射张量、波形参数抽象和WARP-Net网络,实现可控的雷达立方体仿真,在2D/3D检测和语义分割任务中性能接近或超越真实雷达。

Comments CVPR 2026 Findings: Code: https://github.com/zhuxing0/SA-Radar Project page: https://zhuxing0.github.io/projects/SA-Radar

详情
AI中文摘要

自动驾驶模拟器仍然缺乏高保真雷达,尽管雷达对于恶劣天气下的鲁棒感知至关重要。一个关键障碍是原始雷达点云极其稀疏和随机,难以建模;我们认为模拟完整的距离-方位-多普勒立方体是一个更合理的目标。现有的雷达立方体模拟器要么纯粹依赖神经生成器,这些生成器不透明且对传感器属性的控制有限,要么依赖详细的电磁流水线,这些流水线速度慢、需要专有硬件规格,并且仍然难以捕捉真实世界的复杂性。我们引入了Ctrl-RS,一个可控的雷达立方体仿真框架,结合了两者的优势。首先,我们从多种传感器源(包括LiDAR、单目相机和现有雷达)构建环境反射张量。其次,我们将雷达物理抽象为一组紧凑的波形参数,这些参数表征3D点扩散函数,从而得到雷达属性(如距离分辨率、多普勒展宽和方位波束形状)的直观嵌入。第三,我们在一个大型混合数据集上训练WARP-Net,该数据集融合了真实、分析合成和模拟器生成的雷达立方体,以覆盖广泛的雷达属性分布。Ctrl-RS支持视角变化、参与者移除和属性编辑。在RADDet、Carrada和nuScenes上的实验表明,我们的模拟数据在2D检测和语义分割中可以匹配或超越真实雷达,并且在与真实数据结合时持续提升3D检测性能。项目地址:https://github.com/zhuxing0/Ctrl-RS。

英文摘要

Autonomous driving simulators still lack high-fidelity radar, even though radar is critical for robust perception in adverse weather. A key obstacle is that raw radar point clouds are extremely sparse and stochastic, making it difficult to model; we argue that simulating the full range-azimuth-Doppler cube is a more principled target. Existing radar cube simulators either rely purely on neural generators, which are opaque and offer little control over sensor attributes, or on detailed electromagnetic pipelines, which are slow, require proprietary hardware specifications, and still struggle to capture real-world complexity. We introduce Ctrl-RS, a controllable radar cube simulation framework that combines the strengths of both worlds. First, we build an environment reflection tensor from diverse sensor sources (including LiDAR, monocular cameras, and existing radar). Second, we abstract radar physics into a compact set of waveform parameters that characterize the 3D point spread function, yielding an intuitive embedding of radar attributes such as range resolution, Doppler broadening, and azimuth beam shape. Third, we train a WARP-Net on a large mixed dataset that fuses real, analytically synthesized, and simulator-generated radar cubes to cover a wide distribution of radar attributes. Ctrl-RS supports viewpoint changes, actor removal, and attribute editing. Experiments on RADDet, Carrada, and nuScenes show that our simulated data can match or surpass real radar in 2D detection and semantic segmentation, and consistently boosts performance in 3D detection when combined with real data. The Project is available at https://github.com/zhuxing0/Ctrl-RS.

2412.15678 2026-05-26 cs.CV 版本更新

Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network

多对时序句子定位的多线程知识迁移网络

Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Keke Tang, Jianfeng Dong, Pan Zhou, Beibei Li

发表机构 * Sichuan University(四川大学) Nanyang Technological University, Singapore(南洋理工大学,新加坡) Peking University(北京大学) Guangzhou University(广州大学) Zhejiang Gongshang University(浙江工商大学)

AI总结 提出多对时序句子定位新任务,并设计多线程知识迁移网络,通过跨模态对比、原型对齐和自适应负样本选择实现多对视频-查询对的协同训练。

Comments Accepted by AAAI 2025

详情
AI中文摘要

给定一些包含未修剪视频和句子查询的视频-查询对,时序句子定位(TSG)旨在定位这些视频中与查询相关的片段。尽管先前优秀的TSG方法取得了显著成功,但它们单独训练每个视频-查询对,忽略了不同对之间的关系。我们观察到,相似的视频/查询内容不仅有助于TSG模型更好地理解和泛化跨模态表示,还能帮助模型定位一些复杂的视频-查询对。先前的方法遵循单线程框架,无法共同训练不同的对,并且通常花费大量时间重新获取冗余知识,限制了其实际应用。为此,在本文中,我们提出了一种全新的设置:多对TSG,旨在共同训练这些对。特别地,我们提出了一种新颖的视频-查询共同训练方法,即多线程知识迁移网络,以有效且高效地定位各种视频-查询对。首先,我们挖掘不同查询之间的空间和时间语义以相互协作。为了同时学习模态内和模态间表示,我们设计了一个跨模态对比模块,通过自监督策略探索语义一致性。为了充分对齐不同对之间的视觉和文本表示,我们设计了一种原型对齐策略,以1)匹配对象原型和短语原型以实现空间对齐,以及2)对齐活动原型和句子原型以实现时间对齐。最后,我们开发了一个自适应负样本选择模块,以自适应地生成跨模态匹配的阈值。大量实验表明了我们提出方法的有效性和效率。

英文摘要

Given some video-query pairs with untrimmed videos and sentence queries, temporal sentence grounding (TSG) aims to locate query-relevant segments in these videos. Although previous respectable TSG methods have achieved remarkable success, they train each video-query pair separately and ignore the relationship between different pairs. We observe that the similar video/query content not only helps the TSG model better understand and generalize the cross-modal representation but also assists the model in locating some complex video-query pairs. Previous methods follow a single-thread framework that cannot co-train different pairs and usually spends much time re-obtaining redundant knowledge, limiting their real-world applications. To this end, in this paper, we pose a brand-new setting: Multi-Pair TSG, which aims to co-train these pairs. In particular, we propose a novel video-query co-training approach, Multi-Thread Knowledge Transfer Network, to locate a variety of video-query pairs effectively and efficiently. Firstly, we mine the spatial and temporal semantics across different queries to cooperate with each other. To learn intra- and inter-modal representations simultaneously, we design a cross-modal contrast module to explore the semantic consistency by a self-supervised strategy. To fully align visual and textual representations between different pairs, we design a prototype alignment strategy to 1) match object prototypes and phrase prototypes for spatial alignment, and 2) align activity prototypes and sentence prototypes for temporal alignment. Finally, we develop an adaptive negative selection module to adaptively generate a threshold for cross-modal matching. Extensive experiments show the effectiveness and efficiency of our proposed method.

2412.06284 2026-05-26 cs.CV 版本更新

Your Data Is Not Perfect: Towards Cross-Domain Out-of-Distribution Detection in Class-Imbalanced Data

你的数据并不完美:面向类别不平衡数据中的跨域分布外检测

Xiang Fang, Arvind Easwaran, Blaise Genest, Ponnuthurai Nagaratnam Suganthan

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算学院和数据科学学院) KINDI Computing Research Center, College of Engineering, Qatar University, Doha(卡塔尔大学工程学院KINDI计算研究中心,多哈)

AI总结 针对跨域类别不平衡的分布外检测问题,提出基于原型对齐的不确定性感知自适应语义对齐网络(UASA),通过标签驱动原型、自适应阈值和不确定性感知聚类缩小域间隙、语义间隙和类别不平衡间隙。

Comments Accepted by Expert Systems with Applications

详情
AI中文摘要

以往的OOD检测系统只关注ID和OOD样本之间的语义差距。除了语义差距,我们还面临两个额外的差距:源域和目标域之间的域差距,以及不同类别之间的类别不平衡差距。事实上,来自不同域的相似对象应该属于同一类别。在本文中,我们引入了一个现实且具有挑战性的设置:类别不平衡的跨域OOD检测(CCOD),该设置包含一个标注良好(但通常较小)的源集用于训练,并在一个未标注(但通常较大)的目标集上进行OOD检测。我们不假设目标域仅包含OOD类别或类别平衡:目标数据集的类别分布不必与源数据集相同。为了应对这一具有挑战性的设置,我们提出了一种基于原型对齐策略的新型不确定性感知自适应语义对齐网络(UASA)。具体来说,我们首先在源域中构建标签驱动的原型,并利用这些原型进行目标分类以缩小域差距。我们不是使用固定阈值进行OOD检测,而是生成自适应样本级阈值来处理语义差距。最后,我们进行不确定性感知聚类,将语义相似的目标样本分组,以缓解类别不平衡差距。在三个具有挑战性的基准上的大量实验表明,我们提出的UASA以较大优势优于最先进的方法。

英文摘要

Previous OOD detection systems only focus on the semantic gap between ID and OOD samples. Besides the semantic gap, we are faced with two additional gaps: the domain gap between source and target domains, and the class-imbalance gap between different classes. In fact, similar objects from different domains should belong to the same class. In this paper, we introduce a realistic yet challenging setting: class-imbalanced cross-domain OOD detection (CCOD), which contains a well-labeled (but usually small) source set for training and conducts OOD detection on an unlabeled (but usually larger) target set for testing. We do not assume that the target domain contains only OOD classes or that it is class-balanced: the distribution among classes of the target dataset need not be the same as the source dataset. To tackle this challenging setting with an OOD detection system, we propose a novel uncertainty-aware adaptive semantic alignment (UASA) network based on a prototype-based alignment strategy. Specifically, we first build label-driven prototypes in the source domain and utilize these prototypes for target classification to close the domain gap. Rather than utilizing fixed thresholds for OOD detection, we generate adaptive sample-wise thresholds to handle the semantic gap. Finally, we conduct uncertainty-aware clustering to group semantically similar target samples to relieve the class-imbalance gap. Extensive experiments on three challenging benchmarks demonstrate that our proposed UASA outperforms state-of-the-art methods by a large margin.