arXivDaily arXiv每日学术速递 周一至周五更新
2605.12501 2026-05-13 cs.CV 版本更新

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan, Yifan Yang, Kai Qiu, Bei Liu, Justin Wagle, Chenzhong Yin, Mingxi Cheng, Ji Li, Qi Dai, Chong Luo, Xu Yang, Xin Geng, Baining Guo

发表机构 * Southeast University(东南大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Wuhan University(武汉大学) Sun Yat-sen University(中山大学) Microsoft(微软公司)

AI总结 该研究针对计算机使用代理(CUA)在处理复杂、低频交互任务时可靠性不足的问题,提出了一种新的基准测试CUActSpot,涵盖GUI、文本、表格、画布和自然图像等多种交互模态及多种操作类型。为解决复杂交互数据稀缺的问题,研究设计了一种基于渲染器的数据合成方法,自动生成场景并生成对应的指令和操作轨迹。实验表明,基于该数据集训练的模型在性能上优于参数量更少的开源模型。

详情
英文摘要

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

2605.12500 2026-05-13 cs.CV 版本更新

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, Yan Li, Yubo Wang, Zhijie Cao, Zhiqian Lin, Zhitao Yang, Zhongang Cai, Yuwei Niu, Yue Zhu, Bo Liu, Chengguang Lv, Haojia Yu, Haozhe Xie, Hongli Wang, Jianan Fan, Jiaqi Li, Jiefan Lu, Jingcheng Ni, Junxiang Xu, Kaihuan Liang, Lianqiang Shi, Linjun Dai, Linyan Wang, Oscar Qian, Peng Gao, Pengfei Liu, Qingping Sun, Rui Shen, Ruisi Wang, Shengnan Ma, Shuang Yang, Siyi Xie, Siying Li, Tianbo Zhong, Xiangli Kong, Xuanke Shi, Yang Gao, Yongqiang Yao, Yves Wang, Zhengqi Bai, Zhengyu Lin, Zixin Yin, Wenxiu Sun, Ruihao Gong, Quan Wang, Lewei Lu, Lei Yang, Ziwei Liu, Dahua Lin

AI总结 本文提出了一种名为 SenseNova-U1 的统一多模态模型,旨在解决当前视觉-语言模型中理解与生成分离的问题。该模型基于 NEO-unify 架构,将理解和生成视为同一底层过程的协同视角,从而实现更自然的多模态智能。研究展示了该模型在多种任务上的优越性能,并提供了详细的设计与训练策略,为多模态研究提供了新的方向。

Comments Project page: https://github.com/OpenSenseNova/SenseNova-U1

详情
英文摘要

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

2605.12498 2026-05-13 cs.CV cs.GR 版本更新

EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

Christen Millerdurai, Shaoxiang Wang, Yaxu Xie, Vladislav Golyanik, Didier Stricker, Alain Pagani

发表机构 * Max Planck Institute for Informatics (MPII)(马克斯·普朗克信息研究所)

AI总结 本文提出了一种名为EgoForce的单目手部三维姿态重建框架,旨在从用户的视角(即相机空间)准确恢复手部的绝对三维姿态和位置,适用于AR/VR、远程呈现等需要紧凑且无干扰感知的场景。该方法通过引入可微分的前臂表示、统一的臂手变换器以及光线空间闭式求解器,有效解决了单目方法中深度尺度模糊的问题,并能在多种广角相机模型上实现鲁棒的重建。实验表明,EgoForce在三个自拍视角数据集上达到了最先进的精度,尤其在HOT3D数据集上将相机空间MPJPE降低了28%。

Comments 23 pages, 19 figures and 10 tables; project page: https://dfki-av.github.io/EgoForce (source code, data and demo available); SIGGRAPH 2026 Conference

详情
英文摘要

Reconstructing the absolute 3D pose and shape of the hands from the user's viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth-scale ambiguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user's (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth-scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at https://dfki-av.github.io/EgoForce.

2605.12497 2026-05-13 cs.CV 版本更新

From Web to Pixels: Bringing Agentic Search into Visual Perception

Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu, Xiangyu Yue

发表机构 * Deep Research

AI总结 该研究探讨了在开放世界场景下,如何通过外部信息(如事实、事件、长尾实体和多跳关系)辅助完成视觉感知任务的问题。为此,作者提出了“感知深度研究”这一新挑战,并构建了WebEye基准,包含可验证证据、知识密集型查询和精确标注的图像实例。同时,他们设计了Pixel-Searcher方法,通过智能搜索流程实现从外部信息到像素级目标定位的端到端感知,显著提升了开放世界视觉任务的性能。

Comments Project page: https://pixel-searcher.github.io/

详情
英文摘要

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.

2605.12496 2026-05-13 cs.CV 版本更新

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

Yihao Meng, Zichen Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Yue Yu, Hanlin Wang, Haobo Li, Jiapeng Zhu, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen, Huamin Qu

发表机构 * HKUST(香港科技大学) Ant Group(蚂蚁集团) SJTU(上海交通大学)

AI总结 CausalCine 是一种用于多镜头视频叙事的实时自回归生成框架,旨在解决现有模型在长序列生成中出现的运动停滞和语义漂移问题。该方法通过引入因果基础模型和内容感知记忆路由机制,实现了跨镜头的连贯生成,并支持动态提示输入与上下文复用。实验表明,CausalCine 在生成质量上优于传统自回归模型,同时实现了接近双向模型的效果,并支持实时交互式生成。

Comments Project page: https://yihao-meng.github.io/CausalCine/

详情
英文摘要

Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at https://yihao-meng.github.io/CausalCine/

2605.12495 2026-05-13 cs.CV cs.AI cs.LG 版本更新

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

Runhui Huang, Jie Wu, Rui Yang, Zhe Liu, Hengshuang Zhao

发表机构 * The University of Hong Kong(香港大学)

AI总结 本文提出了一种名为 AlphaGRPO 的新框架,通过将组相对策略优化(GRPO)应用于统一多模态模型(UMMs),在无需额外冷启动阶段的情况下提升了多模态生成能力。该方法通过分解可验证奖励(DVReward)机制,利用大语言模型将复杂的用户请求拆解为可验证的语义和质量问题,从而提供稳定可靠的反馈,支持模型进行文本到图像的推理生成和自主的自我反思优化。实验表明,AlphaGRPO 在多个多模态生成基准测试中均取得显著提升,并在无需编辑任务训练的情况下也表现出色。

Comments ICML2026

详情
英文摘要

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation. Project page: https://huangrh99.github.io/AlphaGRPO/

2605.12494 2026-05-13 cs.CV 版本更新

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xiaohan Yu, Lin Gu, Gim Hee Lee

发表机构 * School of Computer Science Engineering, State Key Laboratory of Complex Critical \& Software Environment, Jiangxi Research Institute, Beihang University State Key Laboratory of Virtual Reality Technology Macquarie University Tohoku University School of Computing, National University of Singapore

AI总结 本文研究了在基于可微渲染的表面重建中普遍存在的光度模糊问题,并提出了一种名为 AmbiSuR 的框架,旨在提升高斯点扩散(Gaussian Splatting)方法在光度模糊环境下的重建精度。通过重新审视高斯点扩散的表示基础,作者发现了其内在的光度模糊特性,并提出了一种光度去模糊方法和模糊指示模块,以约束几何解的求解并引导重建过程。实验表明,该方法在多种复杂场景下均能实现更准确、更鲁棒的表面重建。

Comments Accepted at ICML 2026. Project page: https://fictionarry.github.io/AmbiSuR-Proj/

详情
英文摘要

Surface reconstruction with differentiable rendering has achieved impressive performance in recent years, yet the pervasive photometric ambiguities have strictly bottlenecked existing approaches. This paper presents AmbiSuR, a framework that explores an intrinsic solution upon Gaussian Splatting for the photometric ambiguity-robust surface 3D reconstruction with high performance. Starting by revisiting the foundation, our investigation uncovers two built-in primitive-wise ambiguities in representation, while revealing an intrinsic potential for ambiguity self-indication in Gaussian Splatting. Stemming from these, a photometric disambiguation is first introduced, constraining ill-posed geometry solution for definite surface formation. Then, we propose an ambiguity indication module that unleashes the self-indication potential to identify and further guide correcting underconstrained reconstructions. Extensive experiments demonstrate our superior surface reconstructions compared to existing methods across various challenging scenarios, excelling in broad compatibility. Project: https://fictionarry.github.io/AmbiSuR-Proj/ .

2605.12491 2026-05-13 cs.CV cs.LG 版本更新

Elastic Attention Cores for Scalable Vision Transformers

Alan Z. Song, Yinjie Chen, Mu Nan, Rui Zhang, Jiahang Cao, Weijian Mai, Muquan Yu, Hossein Adeli, Deva Ramanan, Michael J. Tarr, Andrew F. Luo

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Hong Kong(香港大学) Columbia University(哥伦比亚大学)

AI总结 本文提出了一种名为VECA的视觉Transformer架构,旨在解决传统ViT在高分辨率图像处理中计算复杂度过高的问题。VECA通过引入弹性核心-边缘注意力机制,利用少量学习得到的核心嵌入作为通信接口,使得图像块之间无需直接交互,从而将计算复杂度从二次降低到线性。该方法在保持输入token完整性的前提下,实现了计算资源与精度之间的灵活权衡,在多个视觉任务中表现出与最新视觉基础模型相当的性能。

Comments Project repository here: https://github.com/alansong1322/VECA

详情
英文摘要

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.

2605.12480 2026-05-13 cs.CV cs.AI 版本更新

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu, Hu Yu, Siming Fu, Yuming Li, Zeyue Xue, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) Peking University(北京大学) JD Explore Academy(京东探索研究院)

AI总结 OmniNFT 是一种针对联合音视频生成任务的新型强化学习框架,旨在解决现有方法在模态保真度、跨模态对齐和细粒度同步方面的不足。该方法通过模态感知的奖励路由、分层梯度手术和区域损失重加权三大创新,有效缓解了多目标优势不一致、多模态梯度不平衡和信用分配不均等问题。实验表明,OmniNFT 在多个基准测试中显著提升了音视频的感知质量与同步效果。

Comments Project page: https://zghhui.github.io/OmniNFT/

详情
英文摘要

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.

2605.12449 2026-05-13 cs.CV 版本更新

LychSim: A Controllable and Interactive Simulation Framework for Vision Research

Wufei Ma, Chloe Wang, Siyi Chen, Jiawei Peng, Patrick Li, Alan Yuille

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 LychSim 是一个基于 Unreal Engine 5 构建的可控且交互式的视觉研究仿真框架,旨在降低仿真平台的技术门槛,促进闭环优化和分布外评估。该框架通过简洁的 Python 接口、程序化数据生成管道以及与 Model Context Protocol 的集成,实现了高保真环境生成、语义对齐的三维标注以及与大语言模型的动态交互。LychSim 在合成数据生成、对抗性检验和语言驱动场景生成等多个下游任务中展现出广泛应用潜力,并将开源以供研究社区使用。

Comments 3D-LLM/VLA Workshop at CVPR 2026. Project page: https://lychsim.github.io/

详情
英文摘要

While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However, modern simulation platforms often present steep technical barriers, requiring extensive expertise in computer graphics and game development. In this work, we present LychSim, a highly controllable and interactive simulation framework built upon Unreal Engine 5 to bridge this gap. LychSim is built around three key designs: (1) a streamlined Python API that abstracts away underlying engine complexities; (2) a procedural data pipeline capable of generating diverse, high-fidelity environments with varying out-of-distribution (OOD) visual challenges, paired with rich 2D and 3D ground truths; and (3) a native integration of the Model Context Protocol (MCP) that transforms the simulator into a dynamic, closed-loop playground for reasoning agentic LLMs. We further annotate scene-level procedural rules and object-level pose alignments to enable semantically aligned 3D ground truths and automated scene modification. We demonstrate LychSim's capability across multiple downstream applications, including serving as a synthetic data engine, powering reinforcement learning-based adversarial examiners, and facilitating interactive, language-driven scene layout generation. To benefit the broader vision community, LychSim will be made publicly available, including full source code and various data annotations.

2605.12437 2026-05-13 cs.CV 版本更新

3D Gaussian Splatting for Efficient Retrospective Dynamic Scene Novel View Synthesis with a Standardized Benchmark

Yunxiao Zhang, Suryansh Kumar

发表机构 * Visual and Spatial AI Lab(视觉与空间人工智能实验室) VCCM Section College of PVFA(VCCM学院光电工程学院) Department of ECEN(电子工程系) Department of CSCE(计算机科学与工程系) Texas A&M University(德克萨斯A&M大学)

AI总结 本文研究了在同步多视角(MV)设置下实现高效回顾式动态场景新视角合成(NVS)的问题,提出了一种基于3D高斯泼溅(3DGS)的方法,无需显式的时序耦合即可实现动态场景的高质量重建。该方法通过在初始时间生成SfM点云并随时间传播优化的高斯分布,有效提升了NVS的效率。同时,作者构建了一个基于Blender的动态MV数据集框架,用于生成标准化、高质量的同步相机配置和训练数据,推动了动态NVS方法的可复现性和系统性评估。

Comments Accepted for publication at CVPR 2026; 4D World Models Workshop. Draft info: 14 pages, 4 figures, 8 tables

详情
英文摘要

Retrospective novel view synthesis (NVS) of dynamic scenes is fundamental to applications such as sports. Recent dynamic 3D Gaussian Splatting (3DGS) approaches introduce temporally coupled formulations to enforce motion coherence across time. In this paper, we argue that, in a synchronized multi-view (MV) setting typical of sports, the dynamic scene at each time step is already strongly geometrically constrained. We posit that the availability of calibrated, synchronized viewpoints provides sufficient spatial consistency, and therefore, explicit temporal coupling, or complex multi-body constraints seems unnecessary for retrospective NVS. To this end, we propose an approach tailored for synchronized MV dynamic scene. By initializing the SfM-derived point cloud at the start time and propagating optimized Gaussians over time, we show that efficient retrospective NVS can be achieved without imposing a temporal deformation constraint. Complementing our methodological contribution, we introduce a Dynamic MV dataset framework built on Blender for reproducible NeRF and 3DGS research. The framework generates high-quality, synchronized camera rigs and exports training-ready datasets in standard formats, eliminating inconsistencies in coordinate conventions and data pipelines. Using the framework, we construct a dynamic benchmark suite and evaluate representative NeRF and 3DGS approaches under controlled conditions. Together, we show that, under a synchronized MV setup, efficient retrospective dynamic scene NVS can be achieved using 3DGS. At the same time, the dataset-generation framework enables reproducible and principled benchmarking of dynamic NVS methods.

2605.12431 2026-05-13 cs.CV 版本更新

GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization

Huiran Duan, Qian Zhou, Zhongliang Guo, Junhao Dong, Yuqi Li, Guoying Zhao, Yingli Tian

发表机构 * City University of New York(纽约城市大学) Wuhan University(武汉大学) University of Aberdeen(阿伯丁大学) Nanyang Technological University(南洋理工大学) ELLIS Institute Finland(芬兰ELLIS研究所) University of Oulu(奥卢大学)

AI总结 传统步态去识别方法往往面临身份抑制不足或引入时空失真、影响后续应用的问题。本文提出GaitProtector,一种基于伪装驱动的步态去识别框架,通过统一的优化目标实现隐私保护,包含身份排斥与目标身份吸引两个紧密耦合的组件。该方法无需重新训练模型,利用预训练的3D扩散模型对输入轮廓序列进行潜空间优化,生成既保护隐私又保持结构合理性的步态,实验表明其在身份伪装成功率和下游任务性能保持方面均表现出色。

Comments Accepted to the 20th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)

详情
英文摘要

Conventional gait de-identification methods often encounter an inherent trade-off: they either provide insufficient identity suppression or introduce spatiotemporal distortions that impede structure-sensitive downstream applications. We propose GaitProtector, an impersonation-driven gait de-identification framework that formulates privacy protection as a unified objective with two tightly coupled components: (i) obfuscation, which repels the protected gait from the source identity, and (ii) impersonation, which attracts it toward a selected target identity. The target identity serves as a semantic anchor that biases optimization toward structurally plausible gait patterns under the pretrained diffusion prior, helping preserve dominant body shape and motion dynamics. We instantiate this idea through a training-free diffusion latent optimization pipeline. Instead of retraining a generator for each dataset, we invert each input silhouette sequence into the latent trajectory of a pretrained 3D video diffusion model and iteratively optimize latent codes with a differentiable adversarial objective to synthesize protected gaits. Experiments on the CASIA-B dataset show that GaitProtector achieves a 56.7% impersonation success rate under black-box gait recognition and reduces Rank-1 identification accuracy from 89.6% to 15.0%, while maintaining favorable visual and temporal quality. We further evaluate downstream utility on the Scoliosis1K dataset, where diagnostic accuracy decreases only from 91.4% to 74.2%. To the best of our knowledge, this work is the first to leverage pretrained 3D diffusion priors in a training-free manner for silhouette-based gait de-identification.

2605.12430 2026-05-13 cs.CV 版本更新

AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection

Joaquín Figueira, Rob Van Gastel, Giacomo D'Amicantonio, Zhuoran Liu, Ioan Gabriel Bucur, Faysal Boughorbel, Egor Bondarev

发表机构 * AIMS Lab, Eindhoven University of Technology(埃因霍温理工大学AIMS实验室) CoC, ASMPT(ASMPT联合中心) iCIS, Radboud University(拉德堡德大学iCIS实验室)

AI总结 本文提出了一种名为AOI-SSL的自监督学习框架,用于提高半导体线键封装产品在光学检测中的分割效率。该方法结合了小样本自监督预训练和上下文推理,有效减少了对标注数据的依赖,尤其在数据量有限的情况下表现出色。实验表明,该框架在分割精度和适应新设备的能力上优于从头训练和基于ImageNet预训练的方法,尤其在处理单个设备图像时,基于检索的分割方法比微调具有更优表现。

Comments Accepted to the AI4RWC Workshop at CVPR 2026

详情
英文摘要

Segmentation models in automated optical inspection of wire-bonded semiconductors are typically device-specific and must be re-trained when new devices or distribution shifts appear. We introduce AOI-SSL, a training-efficient framework for semantic segmentation of wire-bonded semiconductors by combining small-domain self-supervised pre-training of vision transformers with in-context inference that minimizes the need of labeled examples. We pre-train SOTA self-supervised algorithms in a small industrial inspection dataset and find that Masked Autoencoders are the most effective in this small-data setting, improving downstream segmentation while reducing the labeled fine-tuning effort. We further introduce in-context, patch-level retrieval methods that predict masks directly from dense encoder embeddings with negligible additional training. We show that, in this setting, simple similarity-based retrieval performs on par with more complex attention-based aggregation used currently in the literature. Furthermore, our experiments demonstrate that self-supervised pre-training significantly improves segmentation quality compared to training from scratch and to ImageNet pre-trained backbones under a fixed fine-tuning computational budget. Finally, the results reveal that retrieval based segmentation outperforms fine-tuning when targeting single device images, allowing for near-instant adaptation to difficult samples.

2605.12399 2026-05-13 cs.CV 版本更新

GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction

Xiao Cao, Yuze Li, Youmin Zhang, Jiayu Song, Cheng Yan, Wen Li, Lixin Duan

发表机构 * University of Electronic Science Tianjin University Tianjin China Tianjin University

AI总结 本文提出了一种名为GeoQuery的几何引导扩散框架,用于解决稀疏视角下3D高斯溅射(3DGS)重建中的严重伪影问题。该方法通过引入几何引导的跨视角注意力(GCA)机制,结合预测的深度图和相机姿态构建几何对齐的参考特征采样场,从而生成更准确的查询特征,并在局部窗口内进行特征聚合以提升重建一致性。实验表明,GeoQuery能够有效提升稀疏视角下的视图合成与伪影去除效果,且可无缝集成到现有扩散模型中。

Comments Accept to SIGGRAPH 2026 Conference Track

详情
英文摘要

3D Gaussian Splatting (3DGS) has emerged as a prominent paradigm for 3D reconstruction and novel view synthesis. However, it remains vulnerable to severe artifacts when trained under sparse-view constraints. While recent methods attempt to rectify artifacts in rendered views using image diffusion models, they typically rely on multi-view self-attention to retrieve information from reference images. We observe that this mechanism often fails when the rendered novel views output by 3DGS are heavily corrupted: damaged query features lead to erroneous cross-view retrieval, resulting in inconsistent rendering refinement. To address this, we propose GeoQuery, a geometry-guided diffusion framework that integrates generative priors with explicit geometric cues via a novel Geometry-guided Cross-view Attention (GCA) mechanism. First, by leveraging predicted depth maps and camera poses, we construct a geometry-induced correspondence field to sample reference features, forming a geometry-aligned proxy query that replaces the corrupted rendering features. Furthermore, we design a new cross-view feature aggregation pipeline, in which we restrict the cross-view attention to a local window around each proxy query to effectively retrieve useful features while suppressing spurious matches. GeoQuery can be seamlessly integrated into existing diffusion-based pipelines, enabling robust reconstruction even under extreme view sparsity. Extensive experiments on sparse-view novel view synthesis and rendering artifact removal demonstrate the effectiveness of our approach.

2605.12389 2026-05-13 cs.CV cs.AI cs.LG 版本更新

SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation

Luke James Miller, Yugyung Lee

发表机构 * Department of Computing, Analytics(计算、分析与数学系) University of Missouri-Kansas City, Kansas City, United States(密苏里大学-堪萨斯城分校)

AI总结 该论文提出了一种名为SEMIR的语义小结构引导的图表示学习方法,用于解决大规模图像中分割小而稀疏结构时面临的计算复杂性和类别不平衡问题。SEMIR通过参数化的边收缩、节点删除等操作,将原网格图转化为一个紧凑且边界对齐的图小结构,同时保持从图预测到网格标签的精确映射。该方法在多个肿瘤分割数据集上表现出色,显著提升了小结构的Dice分数,为高分辨率结构化视觉数据的任务适配表示学习提供了新框架。

Comments 20 pages, 3 figures. Accepted at ICML 2026. Includes appendices

详情
英文摘要

Segmenting small and sparse structures in large-scale images is fundamentally constrained by voxel-level, lattice-bound computation and extreme class imbalance -- dense, full-resolution inference scales poorly and forces most pipelines to rely on fixed regionization or downsampling, coupling computational cost to image resolution and attenuating boundary evidence precisely where minority structures are most informative. We introduce SEMIR (Semantic Minor-Induced Representation Learning), a representation framework that decouples inference from the native grid by learning a task-adapted, topology-preserving latent graph representation with exact decoding. SEMIR transforms the underlying grid graph into a compact, boundary-aligned graph minor through parameterized edge contraction, node deletion, and edge deletion, while preserving an exact lifting map from minor predictions to lattice labels. Minor construction is formalized as a few-shot structure learning problem that replaces hand-tuned preprocessing with a boundary-alignment objective: minor parameters are learned by maximizing agreement between predicted boundary elements and target-specific semantic edges under a boundary Dice criterion, and the induced minor is annotated with scale- and rotation-robust geometric and intensity descriptors and supports efficient region-level inference via message passing on a graph neural network (GNN) with relational edge features. We benchmark SEMIR on three tumor segmentation datasets -- BraTS 2021, KiTS23, and LiTS -- where targets exhibit high structural variability and distributional uncertainty. SEMIR yields consistent improvements in minority-structure Dice at practical runtime. More broadly, SEMIR establishes a framework for learning task-adapted, topology-preserving latent representations with exact decoding for high-resolution structured visual data.

2605.12306 2026-05-13 cs.LG cs.AI cs.CV 版本更新

KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks

Minjong Cheon

发表机构 * Sejong University Department of Computer Science and Engineering(世宗大学计算机科学与工程系)

AI总结 本文提出了一种名为KAN-CL的持续学习框架,旨在解决任务间参数干扰导致的灾难性遗忘问题。该方法利用Kolmogorov-Arnold网络(KAN)的紧支撑样条参数化特性,在每个样条节点层面进行重要性加权锚定,从而实现更精细的参数正则化。实验表明,KAN-CL在多个基准数据集上显著降低了遗忘率,同时保持了较高的分类精度,并通过神经切线核分析进一步揭示了其理论优势。

详情
英文摘要

Catastrophic forgetting remains the central obstacle in continual learning (CL): parameters shared across tasks interfere with one another, and existing regularization methods such as EWC and SI apply uniform penalties without awareness of which input region a parameter serves. We propose KAN-CL, a continual learning framework that exploits the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to perform importance-weighted anchoring at per-knot granularity. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone (bbEWC) KAN-CL achieves forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T respectively, while matching or exceeding the accuracy of all baselines on both benchmarks. We further provide a Neural Tangent Kernel (NTK) analysis showing that KAN's spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in the feature-learning regime. These results establish that combining an architecture with natural parameter locality (KAN head) with a complementary backbone regularizer (bbEWC) yields a compositional and principled approach to catastrophic forgetting.

2605.12305 2026-05-13 cs.CV 版本更新

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

Yabo Zhang, Kunchang Li, Dewei Zhou, Xinyu Huang, Xun Wang

发表机构 * ByteDance Seed(字节跳动种子)

AI总结 该研究针对多模态语言模型在处理复杂交错指令时生成图像的性能不足问题,提出了一种统一的视觉生成模型INSET,将图像作为文本指令中的原生词汇嵌入,从而更精确地匹配描述与视觉目标。通过引入可扩展的数据引擎生成大量高质量交错样本,并在多项任务中展现出优于现有方法的多图像一致性和文本对齐能力,同时支持多模态图像编辑等扩展应用。

详情
英文摘要

While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \texttt{I}mages i\texttt{N} \texttt{SE}n\texttt{T}ences (\textit{a.k.a}, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.

2605.12303 2026-05-13 cs.HC cs.CV cs.LG 版本更新

From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review

Moussa Kassem Sbeyti, Joshua Holstein, Philipp Spitzer, Nadja Klein, Gerhard Satzger

发表机构 * Scientific Computing Center(科学计算中心) Institute for Information Systems(信息系统研究所)

AI总结 高质量的标注数据对训练鲁棒的机器学习模型至关重要,但在大规模标注任务中,获取标注仍然成本高昂。本文研究了如何通过可视化模型的空间不确定性来辅助人类标注者更有效地审查标注结果,提出了一种定位感知的视觉提示方法,帮助标注者识别可能出错的区域。实验表明,使用该方法的标注者在保证标注质量的同时,整体效率更高,验证了空间不确定性作为改进人机协同标注的有效手段。

详情
英文摘要

High-quality labeled data is essential for training robust machine learning models, yet obtaining annotations at scale remains expensive. AI-assisted annotation has therefore become standard in large-scale labeling workflows. However, in tasks where model predictions carry two independent components, a class label and spatial boundaries, a model may classify an object with high confidence while mislocalizing it. Existing AI-assisted workflows offer annotators no signal about where spatial errors are most likely. Without such guidance, humans may systematically underinspect subtly misplaced boxes. We address this by studying the effect of visualizing spatial uncertainty via a purpose-built interface. In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality while being faster overall. A box-level analysis confirms that the cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes. These findings establish localization uncertainty as a lever to improve human-in-the-loop annotation. Code is available at https://mos-ks.github.io/MUHA/.

2605.12297 2026-05-13 cs.CV cs.RO eess.IV 版本更新

EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras

Luming Wang, Hao Shi, Jiajun Zhai, Kailun Yang, Kaiwei Wang

发表机构 * National Research Center for Optical Instrumentation, Zhejiang University(浙江大学光学仪器国家研究中心) School of Artificial Intelligence and Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University(湖南大学人工智能与机器人学院和机器人视觉感知与控制技术国家工程研究中心) Ant Group Company Ltd.(蚂蚁集团有限公司)

AI总结 本文提出EgoEV-HandPose,一种基于立体事件相机的端到端框架,用于解决第一人称视角下的3D双手姿态估计与手势识别问题。核心方法KeypointBEV通过将特征提升至统一的鸟瞰视角,并结合迭代重投影引导的优化循环,有效解决了深度不确定性与运动模糊问题。同时,研究还发布了首个大规模真实场景立体事件相机数据集EgoEVHands,显著提升了低光和双手遮挡场景下的性能,为事件相机在第一人称感知领域的发展提供了新基准。

Comments Extended version of SMC 2025 paper arXiv:2503.12419. The established dataset and source code will be publicly released at https://github.com/ZJUWang01/EgoEV-HandPose

详情
英文摘要

Egocentric 3D hand pose estimation and gesture recognition are essential for immersive augmented/virtual reality, human-computer interaction, and robotics. However, conventional frame-based cameras suffer from motion blur and limited dynamic range, while existing event-based methods are hindered by ego-motion interference, monocular depth ambiguity, and the lack of large-scale real-world stereo datasets. To overcome these limitations, we propose EgoEV-HandPose, an end-to-end framework for joint 3D bimanual pose estimation and gesture recognition from stereo event streams. Central to our approach is KeypointBEV, a flexible stereo fusion module that lifts features into a canonical bird's-eye-view space and employs an iterative reprojection-guided refinement loop to progressively resolve depth uncertainty and enforce kinematic consistency. In addition, we introduce EgoEVHands, the first large-scale real-world stereo event-camera dataset for egocentric hand perception, containing 5,419 annotated sequences with dense 3D/2D keypoints across 38 gesture classes under varying illumination. Extensive experiments demonstrate that EgoEV-HandPose achieves state-of-the-art performance with an MPJPE of 30.54mm and 86.87% Top-1 gesture recognition accuracy, significantly outperforming RGB-based stereo and prior event-camera methods, particularly in low-light and bimanual occlusion scenarios, thereby setting a new benchmark for event-based egocentric perception. The established dataset and source code will be publicly released at https://github.com/ZJUWang01/EgoEV-HandPose.

2605.12282 2026-05-13 cs.CV 版本更新

Large-Small Model Collaboration for Farmland Semantic Change Detection

Xinjia Li, Rui Wang, Qiurong Peng, Lingfei Ye, Dengrong Zhang, Haoyu Zhang

发表机构 * College of Information Science and Technology, Hangzhou Normal University(杭州师范大学信息科学与技术学院)

AI总结 本文针对精细农田语义变化检测(SCD)中存在的标注不足和伪变化干扰问题,构建了一个大规模细粒度农田变化检测基准HZNU-FCD,并提出了一种大模型与小模型协作的检测框架。该框架结合了任务驱动的小型视觉模型FD-Mamba和冻结的大型视觉-语言模型,通过跨模态逻辑仲裁和硬区域协同训练策略,有效提升了边界保持和小区域变化检测的精度。实验表明,该方法在多个数据集上均取得了优异的性能,展现出良好的鲁棒性和泛化能力。

详情
英文摘要

Farmland Semantic Change Detection (SCD) is essential for cultivated land protection, yet existing benchmarks and models remain insufficient for fine-grained farmland conversion monitoring. Current datasets often lack dedicated "from-to" annotations, while visual change detection models are easily disturbed by phenology-induced pseudo-changes caused by crop rotation, seasonal variation, and illumination differences. To address these challenges, we construct HZNU-FCD, a large-scale fine-grained farmland SCD benchmark with a unified five-class farmland-to-non-farmland annotation protocol. It contains 4,588 bitemporal image pairs with pixel-level labels for practical farmland protection. Based on this benchmark, we propose a large-small collaborative SCD framework that integrates a task-driven small visual model with a frozen large vision-language model. The small model, Fine-grained Difference-aware Mamba (FD-Mamba), learns dense change representations for boundary preservation and small-region localization. The large-model pathway, Cross-modal Logical Arbitration (CMLA), introduces CLIP-based textual priors for prompt-guided semantic arbitration and pseudo-change suppression. To enable effective collaboration, we design a hard-region co-training strategy that supervises the CMLA semantic score map only on low-confidence pixels. Experiments show that our method achieves 97.63% F1, 96.32% IoU, and 96.35% SCD_IoU_mean on HZNU-FCD with only 6.65M trainable parameters. Compared with the multimodal ChangeCLIP-ViT, which leverages vision-language information for change detection, our method improves F1 by 10.19 percentage points on HZNU-FCD. It also achieves 91.43% F1 and 84.21% IoU on LEVIR-CD, and 93.85% F1 and 88.41% IoU on WHU-CD, demonstrating strong robustness and generalization. The code is available at https://github.com/Lovelymili/FD-Mamba.

2605.12266 2026-05-13 cs.CV 版本更新

CAD-feature enhanced machine learning for manufacturing effort estimation on sheet metal bending parts

Matteo Ballegeer, Toon Van Camp, Willem Jaspers, Alp Bayar, Aung Nyein Soe, Martin Roelfs, Dries F. Benoit, Bieke Decraemer, Joost R. Duflou

发表机构 * Data Analytics Research Group, Ghent University(根特大学数据分析研究组) Corelab CodesignS, Flanders Make(核心实验室CodesignS,弗拉芒制作) Department of Mechanical Engineering, KU Leuven/Flanders Make(机械工程系,根特大学/弗拉芒制作)

AI总结 该研究针对钣金弯曲零件的制造努力估计问题,提出了一种结合CAD特征与图神经网络的混合方法。通过在B-rep拓扑图中引入基于规则模块识别的制造特征,如弯折特性、翻边长度等,增强了模型对工艺相关几何模式的学习能力。实验表明,该方法在合成数据集和真实工业数据集上均显著提升了预测精度,验证了领域知识与图学习结合在制造可行性评估中的有效性。

详情
英文摘要

Graph-based machine learning has emerged as a promising approach for manufacturability analysis by learning directly from CAD models represented as Boundary Representations (B-reps), exploiting both surface geometry and topological connectivity. However, purely geometric representations often lack the process-specific semantics required for accurate manufacturability prediction: many manufacturing factors, such as surface roles or bend intent, are not explicitly encoded in shape alone and are difficult for data-driven models to infer reliably. We propose a hybrid approach that addresses this challenge by enriching B-rep attributed adjacency graphs with manufacturing features recognized through a rule-based module. Applied to sheet metal bending, recognized features, such as bend characteristics, flange lengths, and surface roles are integrated as node attributes, concentrating the learning signal on process-relevant geometric patterns. Experiments on both a large-scale synthetic manufacturability benchmark and a real-world industrial dataset with measured bending times, one of the first such validations on genuine production data, demonstrate that combining domain knowledge with graph-based learning improves prediction accuracy across both tasks. The results demonstrate that hybrid modeling offers a feasible and effective path toward deployable tools for manufacturability assessment and effort estimation in industrial CAD environments.

2605.12259 2026-05-13 cs.CV 版本更新

From Image Hashing to Scene Change Detection

Anh-Kiet Duong, Marie-Claire Iatrides, Petra Gomez-Krämer, Jean-Michel Carozza

发表机构 * L3i Laboratory(L3i实验室) La Rochelle University(拉罗谢尔大学) LIENSs Laboratory(LIENSs实验室) Association Ferrocampus(Ferrocampus协会)

AI总结 图像哈希技术虽能高效存储和检索图像,但其全局比较特性无法定位具体变化区域,限制了其在场景变化检测中的应用。本文从场景变化检测的角度重新审视图像哈希,提出了一种基于块的哈希框架HashSCD,能够在哈明空间中直接实现全局变化检测与局部变化定位,无需对历史图像重复推理。该方法通过对比学习进行无监督训练,在保证性能的同时显著降低了计算和存储开销。

Comments 18 pages; accepted to ICPR 2026

详情
英文摘要

Image hashing provides compact representations for efficient storage and retrieval but is inherently limited to global comparison and cannot reason about where changes occur. This limitation prevents hashing from being directly applicable to scene change detection, where spatial localization is essential. In this work, we revisit hashing from a scene change detection perspective and propose HashSCD, a patch-wise hashing framework that enables both efficient global change detection and localized change identification. HashSCD encodes spatially aligned patches into compact hash codes and aggregates them through an XOR-like operation, allowing change detection and localization to be performed directly in the Hamming space without repeated inference on previous images. The model is trained in an unsupervised manner using contrastive learning at both patch and global levels. Experiments demonstrate that HashSCD achieves competitive performance compared to state-of-the-art unsupervised hashing and scene change detection methods, while significantly reducing computational cost and storage requirements.

2605.12252 2026-05-13 cs.CV 版本更新

H3D-MarNet: Wavelet-Guided Dual-Path Learning for Metal Artifact Suppression and CT Modality Transformation for Radiotherapy Workflows

Mubashara Rehman, Niki Martinel, Michele Avanzo, Riccardo Spizzo, Christian Micheloni

发表机构 * Machine Learning and Perception Lab, Università degli Studi di Udine(机器学习与感知实验室,乌迪内大学) Centro di Riferimento Oncologico di Aviano IRCCS(阿维亚诺肿瘤参考中心)

AI总结 该研究提出了一种名为H3D-MarNet的两阶段框架,用于从千伏CT(kVCT)到兆伏CT(MVCT)的去金属伪影和CT模态转换,以提升放疗流程中的图像质量。第一阶段通过小波引导的预处理模块,在去除金属伪影的同时保留解剖结构;第二阶段采用结合卷积神经网络和Transformer的Domain-TransNet,通过注意力机制融合局部细节与全局上下文信息,实现高保真的CT模态转换。实验表明,该方法在伪影严重区域取得了较高的PSNR和SSIM指标,显示出其在临床放疗中的应用潜力。

Comments Accepted for publication at the 28th International Conference on Pattern Recognition, Lyon, France August, 17-22, 2026

详情
英文摘要

Metal artifacts in computed tomography (CT) severely degrade image quality, compromising diagnostic accuracy and radiotherapy planning, especially in cancer patients with high-density implants. We propose H3D-MarNet, a two-stage framework for artifact-aware CT domain transformation from kilo-voltage CT (kVCT) to mega-voltage CT (MVCT). In the first stage, a wavelet-based preprocessing module suppresses metal-induced artifacts through frequency-aware denoising while preserving anatomical structures. In second stage, Domain-TransNet performs kVCT-to-MVCT domain transformation using a hybrid volumetric learning architecture. Domain-TransNet integrates a CNN-based encoder to capture fine-grained local anatomical details and a transformer-based encoder to model long-range volumetric dependencies. The complementary representations are fused through an attention-based feature fusion mechanism to ensure spatial and contextual coherence across slices. A multi-stage, attention-guided decoder, supported by deep supervision, progressively reconstructs artifact-suppressed MVCT volumes. Extensive experiments demonstrate that H3D-MarNet achieves 28.14 dB PSNR and 0.717 SSIM on artifact-affected slices from full dataset, indicating effective metal artifact suppression and anatomical preservation, highlighting its potential for reliable CT modality transformation in clinical radiotherapy workflows.

2605.12237 2026-05-13 cs.CV 版本更新

UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

Shuo Ni, Tong Wang, Jing Zhang, He Chen, Haonan Guo, Ning Zhang, Bo Du

发表机构 * National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing(国家空间智能信息处理科技重点实验室) Beijing Institute of Technology(北京理工大学) School of Computer Science(计算机学院) Wuhan University(武汉大学) Zhongguancun Academy(中关村学院) State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing(测绘遥感信息工程国家重点实验室) Hong Kong Polytechnic University(香港理工大学)

AI总结 随着超高分辨率(UHR)地球观测图像的广泛应用,视觉-语言模型(VLMs)在处理这类数据时面临“分辨率幻觉”问题,即高分辨率输入虽能提供更丰富的视觉细节,却难以可靠地识别微小目标。为此,研究提出UHR-Micro基准,包含11,253条基于1,212张UHR图像的指令,用于评估VLM在微尺度目标识别上的性能,并引入Micro-evidence Active Perception(MAP)方法,通过主动定位和解析任务相关微小证据,提升模型对高分辨率图像中微小目标的感知能力。该研究为诊断和改进地球观测VLM的高分辨率推理能力提供了系统平台。

详情
英文摘要

Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at https://github.com/MiliLab/UHR-Micro.

2605.12218 2026-05-13 cs.CV 版本更新

Learning Ego-Centric BEV Representations from a Perspective-Privileged View: Cross-View Supervision for Online HD Map Construction

Daniel Lengerer, Mathias Pechinger, Klaus Bogenberger, Carsten Markgraf

发表机构 * Technical University of Applied Sciences Augsburg, Germany(应用技术大学阿格堡学院,德国) Technical University of Munich, Germany(慕尼黑技术大学,德国)

AI总结 本文研究了如何从多摄像头输入中学习以自我为中心的鸟瞰图(BEV)表示,用于在线高精度地图构建。为了解决现有方法依赖单一自车视角监督导致的结构推理不一致问题,作者提出了跨视角监督(CVS)方法,通过从俯视视角向摄像头BEV编码器迁移几何和拓扑先验知识,从而提升结构一致性。实验表明,CVS在标准和扩展区域的mAP指标上均有显著提升,验证了其在长距离地图构建中的有效性。

详情
英文摘要

Bird's-eye-view (BEV) representations derived from multi-camera input have become a central interface for online high-definition (HD) map construction. However, most approaches rely solely on ego-centric supervision, requiring large-scale scene structure to be inferred from incomplete observations, occlusions, and diminishing information density at long range, where perspective effects and spatial sparsity hinder consistent structural reasoning. We introduce Cross-View Supervision (CVS), a representation learning paradigm that transfers geometric and topological priors from an ego-aligned overhead perspective into camera-based BEV encoders. Rather than adding auxiliary semantic losses, CVS aligns representations in a shared BEV feature space and distills globally consistent structural knowledge from a perspective-privileged teacher into the ego-centric backbone. This supervision enhances structural coherence without modifying the inference architecture or requiring overhead input at test time. Experiments on nuScenes using ego-aligned aerial imagery from the AID4AD cross-view extension demonstrate consistent improvements over StreamMapNet while maintaining identical camera-only inference. CVS yields +3.9\,mAP in the standard $60\times30\,\mathrm{m}$ region and +9.9\,mAP in the extended $100\times50\,\mathrm{m}$ setting, corresponding to a 44\% relative gain at long range. These results highlight perspective-privileged structural supervision as a promising training principle for improving BEV representation learning in HD map construction.

2605.12198 2026-05-13 cs.CV 版本更新

Enhancing Domain Generalization in 3D Human Pose Estimation through Controllable Generative Augmentation

Xinhao Hu, Yiyi Zhang, Liqing Zhang, Jianfu Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院)

AI总结 该研究针对3D人体姿态估计中因训练与测试数据分布差异导致的领域泛化问题,提出了一种可控的生成增强框架,通过系统地变化姿态、背景和摄像机视角生成多样化的视频数据。该方法通过融合室内外真实与虚拟数据集,构建适用于实际部署场景的丰富训练数据,显著提升了模型在未知场景和数据集上的性能。

详情
英文摘要

Pedestrian motion, due to its causal nature, is strongly influenced by domain gaps arising from discrepancies between training and testing data distributions. Focusing on 3D human pose estimation, this work presents a controllable human pose generation framework that synthesizes diverse video data by systematically varying poses, backgrounds, and camera viewpoints. This generative augmentation enriches training datasets, enhances model generalization, and alleviates the limitations of existing methods in handling domain discrepancies. By leveraging both indoor/real-world and outdoor/virtual datasets, we perform cross-domain data fusion and controllable video generation to construct enriched training data, tailored to realistic deployment settings. Extensive experiments show that the augmented datasets significantly improve model performance on unseen scenarios and datasets, validating the effectiveness of the proposed approach.

2605.12179 2026-05-13 cs.CV 版本更新

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

Xin Cheng, Xihua Wang, Ying Ba, Yuyue Wang, Kaisi Guan, Yinbo Wang, Wenpu Li, Ruihua Song

发表机构 * Renmin University of China(中国人民大学) Westlake University(西湖大学)

AI总结 SyncDPO 是一种通过偏好学习提升视频-音频联合生成中时间同步能力的后训练框架。该方法通过引入基于规则的实时负样本生成策略,有效增强了模型对时间错位的敏感性,避免了传统方法中高昂的采样和排序成本。实验表明,SyncDPO 在多个基准测试中显著提升了模型的时间对齐能力,并在分布外数据上展现出优越的泛化性能。

Comments Preprint. Under review

详情
英文摘要

Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in substantial computational cost. To improve efficiency, we introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures without incurring additional annotation or sampling. We demonstrate that the temporal alignment capability can be effectively reinforced by providing explicit negative supervision through temporally distorted V-A pairs. Accordingly, we implement a curriculum learning strategy that progressively increases the difficulty of negative samples, transitioning from coarse misalignment to subtle inconsistencies. Extensive objective and subjective experiments across four diverse benchmarks, ranging from ambient sound videos to human speech videos, demonstrate that SyncDPO significantly outperforms other methods in improving model's temporal alignment capability. It also demonstrates superior generalization on out-of-distribution benchmark by capturing intrinsic motion-sound dynamics. Demo and code is available in https://syncdpo.github.io/syncdpo/.

2605.12169 2026-05-13 cs.CV 版本更新

UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis

Sihan Chen, Xiang Zhang, Yang Zhang, Tunc Aydin, Christopher Schroers

发表机构 * ETH Zürich(苏黎世联邦理工学院) DisneyResearch|Studios(迪士尼研究实验室)

AI总结 随着生成模型的快速发展,基于扩散模型的视角合成方法已成为主流,但常因像素到潜空间的压缩和扩散幻觉导致图像质量下降。本文从空间、时间及主干网络三个维度分析扩散退化问题,提出了一种通用的参考引导修复框架UniFixer,通过粗到细的策略修复多种退化现象。该方法包含参考预对齐模块、全局结构锚定机制和局部细节注入模块,能够有效恢复几何结构和纹理细节,实现跨不同退化类型的零样本修复,在新视角合成和立体转换任务中表现出色。

详情
英文摘要

With the recent surge of generative models, diffusion-based approaches have become mainstream for view synthesis tasks, either in an explicit depth-warp-inpaint or in an implicit end-to-end manner. Despite their success, both paradigms often suffer from noticeable quality degradation, e.g., blurred details and distorted structures, caused by pixel-to-latent compression and diffusion hallucination. In this paper, we investigate diffusion degradation from three key dimensions (i.e., spatial, temporal, and backbone-related) and propose UniFixer, a universal reference-guided framework that fixes diverse degradation artifacts via a coarse-to-fine strategy. Specifically, a reference pre-alignment module is first designed to perform coarse alignment between the reference view and the degraded novel view. A global structure anchoring mechanism then rectifies geometric distortions to ensure structural fidelity, followed by a local detail injection module that recovers fine-grained texture details for high-quality view synthesis. Our UniFixer serves as a plug-and-play refiner that achieves zero-shot fixing across different types of diffusion degradation, and extensive experiments verify our state-of-the-art performance on novel view synthesis and stereo conversion.

2605.12167 2026-05-13 cs.RO cs.CV 版本更新

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

Yajie Li, Bozhou Zhang, Chun Gu, Zipei Ma, Jiahui Zhang, Jiankang Deng, Xiatian Zhu, Li Zhang

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Shanghai Innovation Institute(上海创新研究院) Imperial College London(伦敦帝国理工学院) University of Surrey(萨里大学)

AI总结 该论文研究了如何将视频生成模型预测的未来场景有效转化为机器人可执行的动作,解决了现有方法在视觉真实感与控制相关性之间不匹配的问题。为此,作者提出了MoLA(Mixture of Latent Actions)方法,通过预训练的逆动力学模型从生成的视频中推断出潜在动作的混合表示,从而实现更稳定和可控的策略执行。实验表明,该方法在多个仿真和真实机器人任务中提升了任务成功率与泛化能力。

Comments ICML 2026

详情
英文摘要

Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.

2605.12144 2026-05-13 cs.CV 版本更新

PoseCompass: Intelligent Synthetic Pose Selection for Visual Localization

Yanan Zhou, Zhaoyan Qian, Yanli Li, Nan Yang, Zhongliang Guo, Dong Yuan

发表机构 * The University of Sydney(悉尼大学) University of St Andrews(圣安德鲁大学)

AI总结 在视觉定位任务中,绝对姿态回归(APR)能够从单张图像中实时推断相机的6自由度姿态,但其性能高度依赖于训练数据的质量和覆盖范围。为了解决现有基于3D高斯溅射(3DGS)的视图合成数据增强方法中随机采样导致的冗余视角和噪声样本问题,本文提出了一种智能姿态选择方法PoseCompass,通过定位难度、覆盖新颖性和渲染可观测性三个维度对合成姿态进行排序,生成轨迹约束的候选视角并进行合成,从而显著提升了姿态回归模型的训练效率和定位精度。实验表明,PoseCompass在7-Scenes数据集上将适配时间缩短了3倍,并大幅降低了姿态误差。

详情
英文摘要

In visual localization, Absolute Pose Regression (APR) enables real-time 6-DoF camera pose inference from single images, yet critically depends on fine-tuning data quality and coverage. While recent methods leverage 3D Gaussian Splatting (3DGS) for novel view synthesis-based data augmentation, random sampling generates redundant views and noisy samples from poorly reconstructed regions. To mitigate this research gap, we propose PoseCompass, an intelligent pose selection pipeline for 3DGS-based APR. PoseCompass formulates synthetic pose selection and derives a value-based pose ranking mechanism to identify informative poses. The ranking integrates three dimensions: Localization Difficulty, favoring challenging regions; Coverage Novelty, exploring under-sampled areas; and Rendering Observability, filtering artifacts and noise. PoseCompass then generates trajectory-constrained candidates, selects the top-K ranked poses, and synthesizes views using 3DGS with lightweight diffusion-based alignment. Finally, the pose regressor is fine-tuned on mixed real and synthetic data. We evaluate PoseCompass on 7-Scenes, where it reduces adaptation time from 15.2 to 5.1 minutes, a 3x speedup, while cutting median pose errors by 53.8 percent and significantly outperforming random baselines.

2605.12140 2026-05-13 cs.CV 版本更新

EchoTracker2: Enhancing Myocardial Point Tracking by Modeling Local Motion

Md Abulkalam Azad, Vegard Holmstrøm, John Nyberg, Lasse Lovstakken, Håvard Dalen, Bjørnar Grenne, Andreas Østvik

发表机构 * Norwegian University of Science and Technology(挪威科学技术大学) Clinic of Cardiology, St. Olavs Hospital(斯德哥尔摩医院心内科) SINTEF Digital(SINTEF数字技术)

AI总结 本文提出了一种名为EchoTracker2的新型心肌点跟踪方法,旨在提升超声心动图中心肌运动估计的准确性。该方法通过建模局部运动特征,摒弃了传统两阶段架构中的粗粒度初始化步骤,采用仅细阶段的网络结构,结合局部时空上下文信息与长距离时序推理,实现了更鲁棒的点跟踪。实验表明,该方法在多个数据集上均优于现有最佳模型,提升了位置精度并降低了轨迹误差,同时在临床相关指标如全局纵向应变的一致性方面也表现出色。

Comments Early accepted (top 9%) to MICCAI 2026

详情
英文摘要

Myocardial point tracking (MPT) has recently emerged as a promising direction for motion estimation in echocardiography, driven by advances in general-purpose point tracking methods. However, myocardial motion fundamentally differs from motion encountered in natural videos, as it arises from physiologically constrained deformation that is spatially and temporally continuous throughout the cardiac cycle. Consequently, motion trajectories typically remain locally confined despite substantial tissue deformation. Motivated by these properties, we revisit the architectural design for MPT and find that coarse initialization in commonly used two-stage coarse-to-fine architectures may be unnecessary in this domain. In this work, we propose a fine-stage-only architecture, \textbf{EchoTracker2}, which enriches pixel-precise features with local spatiotemporal context and integrates them with long-range joint temporal reasoning for robust tracking. Experimental results across in-distribution, out-of-distribution (OOD), and public synthetic datasets show that our model improves position accuracy by $6.5\%$ and reduces median trajectory error by $12.2\%$ relative to a domain-specific state-of-the-art (SOTA) model. Compared to the best general-purpose point tracking method, the improvements are $2.0\%$ and $5.3\%$, respectively. Moreover, EchoTracker2 shows better agreement with expert-derived global longitudinal strain (GLS) and enhances test-rest reproducibility. Source code will be available at: https://github.com/riponazad/ptecho.

2605.12138 2026-05-13 cs.CV cs.CL cs.IR 版本更新

Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

Yexing Xu, Wei Feng, Shen Zhang, Haohan Wang, Yuxin Qin, Yaoyu Li, Ao Ma, Yuhao Luo, Lu Wang, Xudong Ren, Haoran Wang, Run Ling, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Longguang Wang, Yulan Guo

发表机构 * Sun Yat-Sen University(中山大学) Northeastern University(东北大学)

AI总结 生成符合用户偏好且真实的广告内容是电商领域的重要挑战。本文提出了一种统一的自回归生成模型Uni-AdGen,能够同时生成个性化广告图像和文本,通过引入前景感知模块和指令微调提升生成内容的真实性,并利用粗到细的偏好理解模块从多模态历史行为中捕捉用户兴趣以实现更精准的个性化生成。此外,研究还构建了首个大规模个性化广告图文数据集PAd1M,并引入产品背景相似度指标PBS,实验表明该方法在通用和个性化广告生成任务中均优于现有方法。

Comments 22 pages, 19 figures, CVPR 2026

详情
英文摘要

Generating realistic and user-preferred advertisements is a key challenge in e-commerce. Existing approaches utilize multiple independent models driven by click-through-rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross-modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image-text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni-AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni-AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni-AdGen with a coarse-to-fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large-scale Personalized Advertising image-text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. Our project is available at https://github.com/JD-GenX/Uni-AdGen.

2605.12134 2026-05-13 cs.CV cs.LG 版本更新

MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation

Sonali Godavarthy, Matthias Neuwirth-Trapp, Tim-Felix Faasch, Maarten Bieshaar, Michael Moeller, Danda Pani Paudel

发表机构 * Bosch Research(博世研究) ETH Zürich(苏黎世联邦理工学院) University of Siegen(锡根大学)

AI总结 本文提出了一种名为MULTI的新方法,旨在解决文本到图像生成中因文本歧义导致的精确控制难题,通过分离相机镜头、传感器类型、视角和场景域等成像因素,实现对图像生成过程的更精细控制。该方法分为两个阶段,先学习通用成像因素,再提取数据集特定因素,从而支持现有数据集的扩展和新因素组合,减少分布差距,并可通过ControlNets实现特定因素的修改和图像到图像生成。实验表明,MULTI在新构建的DF-RICO基准上表现良好,突显了成像因素解耦作为图像生成研究新方向的重要性。

Comments Accepted at ICPR 2026

详情
英文摘要

Recent text-to-image models produce high-quality images, yet text ambiguity hinders precise control when specific styles or objects are required. There have been a number of recent works dealing with learning and composing multiple objects and patterns. However, current work focuses almost entirely on image content, overlooking imaging factors such as camera lens, sensor types, imaging viewpoints, and scenes' domain characteristics. We introduce this new challenge as Imaging Factor Disentanglement and show limitations of current approaches in the regime. We, therefore, propose the new method Multi-factor disentanglement through Textual Inversion (MULTI). It consists of two stages: in the first stage, we learn general factors, and in the second stage, we extract dataset-specific ones. This setup enables the extension of existing datasets and novel factor combinations, thereby reducing distribution gaps. It further supports modifications of specific factors and image-to-image generation via ControlNets. The evaluation on our new DF-RICO benchmark demonstrates the effectiveness of MULTI and highlights the importance of Factor Disentanglement as a new direction of research.

2605.12122 2026-05-13 cs.LG cs.AI cs.CV 版本更新

Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

Hyeonjin Kim, Hangyeol Jung, Heechan Yun, Sungjun Yun, Dong-Jun Han

发表机构 * Yonsei University(延世大学) Kookmin University(韩国釜山大学)

AI总结 本文研究了如何在文本到图像的扩散模型中去除特定概念,提出了一个名为SAEParate的方法。该方法通过引入概念感知的对比目标,将潜在表示组织成概念特定的聚类,从而实现更精确的概念抑制并减少去学习过程中的干扰。此外,作者还增强编码器以提升其在分离目标下的表达能力,实验表明该方法在去学习任务中取得了当前最优的性能,尤其在联合风格-对象去学习任务中表现突出。

Comments 40 pages, 23 figures

详情
英文摘要

Unlearning specific concepts in text-to-image diffusion models has become increasingly important for preventing undesirable content generation. Among prior approaches, sparse autoencoder (SAE)-based methods have attracted attention due to their ability to suppress target concepts through lightweight manipulation of latent features, without modifying model parameters. However, SAEs trained with sparse reconstruction objectives do not explicitly enforce concept-wise separation, resulting in shared latent features across concepts. To address this, we propose SAEParate, which organizes latent representations into concept-specific clusters via a concept-aware contrastive objective, enabling more precise concept suppression while reducing unintended interference during unlearning. In addition, we enhance the encoder with a GeLU-based nonlinear transformation to increase its expressive capacity under this separation objective, enabling a more discriminative and disentangled latent space. Experiments on UnlearnCanvas demonstrate state-of-the-art performance, with particularly strong gains in joint style-object unlearning, a challenging setting where existing methods suffer from severe interference between target and non-target concepts.

2605.12112 2026-05-13 cs.CV 版本更新

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

Xiaofeng Tan, Jun Liu, Bin-Bin Gao, Yuanting Fan, Xi Jiang, Chengjie Wang, Hongsong Wang, Feng Zheng

发表机构 * Southeast University(东南大学) Tencent Youtu Lab(腾讯云图实验室) Southern University of Science and Technology(南方科技大学)

AI总结 在基于强化学习的文本到图像生成模型对齐中,策略熵约束常用于保持多样性,但在流模型中这一方法失效,导致生成结果多样性严重下降。本文理论与实验分析表明,流模型中策略熵不变而感知多样性却崩溃,原因是固定噪声调度与策略梯度的模式搜索特性所致。为此,研究提出感知熵概念以捕捉感知空间中的多样性,并设计了两种熵正则化策略,有效提升了生成质量与多样性,实验表明其在多个基准上均优于现有方法。

详情
英文摘要

RLHF is widely used to align flow-matching text-to-image models with human preferences, but often leads to severe diversity collapse after fine-tuning. In RL, diversity is often assumed to correlate with policy entropy, motivating entropy regularization. However, we show this intuition breaks in flow models: policy entropy remains constant, even while perceptual diversity collapses. We explain this mismatch both theoretically and empirically: the constant entropy arises from the fixed, pre-defined noise schedule, while the diversity collapse is driven by the mode-seeking nature of policy gradients. As a result, policy entropy fails to prevent the model from converging to a narrow high-reward region in the perceptual space. To this end, we introduce perceptual entropy that captures diversity in a perceptual space and maintains the property of standard entropy. Building upon this insight, we propose two entropy-regularized strategies, Perceptual Entropy Constraint and Perceptual Constraints on Generation Space, to preserve perceptual diversity and improve the quality. Experiments across two base models, neural and rule-based rewards, and three perceptual spaces demonstrate consistent gains in the quality-diversity trade-off; PEC achieves the best overall score of 0.734 (vs. baseline's 0.366); a complementary setting of PEC further reaches a diversity average of 0.989 (vs. baseline's 0.047). Our project page (https://xiaofeng-tan.github.io/projects/PEC) is publicly available.

2605.12090 2026-05-13 cs.RO cs.CL cs.CV 版本更新

World Action Models: The Next Frontier in Embodied AI

Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, Mike Zheng Shou, Xuanjing Huang, Xipeng Qiu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) National University of Singapore(新加坡国立大学)

AI总结 视觉-语言-动作(VLA)模型在具身策略学习中表现出良好的语义泛化能力,但其主要学习的是对观测到动作的反应映射,而未显式建模物理世界在干预下的演变过程。为解决这一问题,研究提出将环境动态预测模型融入动作生成流程,形成一种新的范式——世界动作模型(WAMs),旨在联合建模未来状态与动作的联合分布。本文系统梳理了WAMs的研究现状,定义其核心概念,区分其与相关模型的异同,并从架构、学习目标和应用场景等方面进行分类,同时分析其数据生态和评估方法,为该领域的发展提供了清晰的框架与未来方向。

详情
英文摘要

Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.

2605.12077 2026-05-13 cs.CV cs.AI 版本更新

The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments

Ofir Itzhak Shahar, Gur Elkin, Ohad Ben-Shahar

发表机构 * Stein Faculty of Computer and Information Science(Stein 计算机与信息科学学院)

AI总结 本文研究了从解决标准拼图问题到处理真实考古碎片这一更具挑战性的任务。为了解决非规则形状且严重磨损的考古碎片拼接问题,作者提出了GAP数据集,并设计了基于ViT和流匹配的新型框架PuzzleFlow。该方法在处理复杂形状的碎片拼接任务中表现出色,显著优于现有方法。

详情
英文摘要

Jigsaw puzzle solving has been an increasingly popular task in the computer vision research community. Recent works have utilized cutting-edge architectures and computational approaches to reassemble groups of pieces into a coherent image, while achieving increasingly good results on well established datasets. However, most of these approaches share a common, restricting setting: operating solely on strictly square puzzle pieces. In this work, we introduce GAP, a set of novel jigsaw puzzles datasets containing synthetic, heavily eroded pieces of unrestricted shapes, generated by a learned distribution of real-world archaeological fragments. We also introduce PuzzleFlow, a novel ViT and Flow-Matching based framework for jigsaw puzzle solving, capable of handling complex puzzle pieces and demonstrating superior performance on GAP when compared to both classic and recent prominent works in this domain.

2605.12074 2026-05-13 cs.CV 版本更新

BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

Patrick Knab, Orgest Xhelili, Inis Buzi, Drago Andres Guggiana Nilo, Mohd Saquib Khan, Lorenz Kolb, Manuel Scherzer, Kerem Yildirir, Christian Bartelt, Philipp Johannes Schubert

发表机构 * Ramblr.ai Research(Ramblr.ai 研究院) Technical University of Clausthal(Clausthal 技术大学)

AI总结 BARISTA 是一个用于组合视觉理解的多任务第一人称视角基准数据集,包含185个真实世界的咖啡制作视频,涵盖了全自动、portafilter 和胶囊式等多种流程。该数据集提供了详细的帧级场景图,包含物体身份、属性、关系、手-物交互及过程步骤等信息,并由此衍生出多项零样本语言任务,如短语定位、活动识别和时序问答等。BARISTA 为诊断模型在程序性视频理解中的不足提供了具有挑战性的评估基准。

详情
英文摘要

Scene understanding is central to general physical intelligence, and video is a primary modality for capturing both state and temporal dynamics of a scene. Yet understanding physical processes remains difficult, as models must combine object localization, hand-object interactions, relational parsing, temporal reasoning, and step-level procedural inference. Existing benchmarks usually evaluate these capabilities separately, limiting diagnosis of why models fail on procedural tasks. We introduce BARISTA, a densely annotated egocentric dataset and benchmark of 185 real-world coffee-preparation videos covering fully automatic, portafilter-based, and capsule-based workflows. BARISTA provides verified per-frame scene graphs linking persistent object identities to masks, tracks, boxes, attributes, typed relations, hand-object interactions, activities, and process steps. From these graphs, we derive zero-shot language-based tasks spanning phrase grounding, hand-object interaction recognition, referring, activity recognition, relation extraction, and temporal visual question answering. Experiments reveal strong variation across task families and no consistently dominant model family, positioning BARISTA as a challenging diagnostic benchmark for procedural video understanding. Code and dataset available at https://huggingface.co/datasets/ramblr/BARISTA.

2605.12069 2026-05-13 cs.CV cs.AI cs.LG 版本更新

Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection

Muhammad Aqeel, Maham Nazir, Uzair Khan, Marco Cristani, Francesco Setti

发表机构 * Dept. of Engineering for Innovation Medicine, University of Verona, Italy(创新医学工程系,威尼斯大学,意大利) School of Computer Science and Engineering, Beihang University, China(计算机科学与工程学院,北航大学,中国) Dept. of Computer Science, Reykjavik University, Iceland(计算机科学系,雷克雅未克大学,冰岛)

AI总结 该论文研究了无需目标类别训练的零样本异常检测问题,针对现有方法对正常与异常数据分布不对称性利用不足的问题,提出了一种名为AVA-DINO的异常感知视觉-语言适配框架。该方法通过两个专门分支分别处理正常和异常模式,结合文本引导的路由机制和显式路由正则化,在训练时实现分支特化;测试时仅依赖输入图像和预定义语言描述动态组合分支,实现不对称激活。实验表明,该方法在多个工业和医学基准上取得了最先进的性能,且具备良好的跨领域泛化能力。

Comments Accepted to ICIP 2026

详情
英文摘要

Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. https://github.com/aqeeelmirza/AVA-DINO

2605.12064 2026-05-13 cs.CV 版本更新

TAR: Text Semantic Assisted Cross-modal Image Registration Framework for Optical and SAR Images

Zhuoyu Cai, Dou Quan, Ning Huyan, Pei He, Shuang Wang, Licheng Jiao

发表机构 * Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, School of Artificial Intelligence, Xidian University(中国教育部智能感知与图像理解重点实验室,西安电子科技大学人工智能学院) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 本文提出了一种基于文本语义辅助的跨模态图像配准框架TAR,用于光学图像与合成孔径雷达(SAR)图像的配准。该方法通过引入遥感场景和地物覆盖类型的文本语义先验,缓解了光学与SAR图像之间的模态差异,增强了跨模态特征学习能力。TAR包含多尺度视觉特征学习、文本辅助特征增强和由粗到细的密集匹配三个模块,实验表明其在大形变情况下仍能实现优于现有方法的配准性能。

详情
英文摘要

Existing deep learning-based methods can capture shared features from optical and synthetic aperture radar (SAR) images for spatial alignment. However, optical-SAR registration remains challenging under large geometric deformations, because the model needs to simultaneously handle cross-modal appearance discrepancies and complex spatial transformations. To address this issue, this paper proposes a text semantic-assisted cross-modal image registration framework, named TAR, for optical and SAR images. TAR exploits text semantic priors from remote sensing scenes and land-cover categories to alleviate the modality gap and enhance cross-modal feature learning. TAR consists of three components: a multi-scale visual feature learning (MSFL) module, a text-assisted feature enhancement (TAFE) module, and a coarse-to-fine dense matching (CFDM) module. MSFL extracts multi-scale visual features from optical and SAR images. TAFE constructs text descriptors related to remote sensing scenes and land-cover objects, and uses a frozen RemoteCLIP text encoder to extract text features. These text features are introduced through visual-text interaction to enhance high-level visual features for more reliable coarse matching. CFDM then establishes coarse correspondences based on the enhanced high-level features and refines the matched locations using low-level features. Experimental results on cross-modal remote sensing images demonstrate the effectiveness of TAR, which achieves stronger matching performance than several state-of-the-art methods and yields significant gains under large geometric deformations.

2605.12038 2026-05-13 cs.CV 版本更新

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Yiren Song, Xiyao Deng, Pei Yang, Yihan Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(展示实验室,新加坡国立大学)

AI总结 OmniHumanoid 是一种用于跨具身视频生成的流式生成框架,旨在实现从人类到机器人或机器人到机器人之间的动作迁移。该方法通过分离可迁移的运动学习与具身特定的适配,解决了传统方法中因素纠缠和依赖配对数据的限制,仅需使用未配对视频即可适应新具身。研究还引入了分支隔离注意力机制,并构建了一个包含多具身、多场景的合成数据集,实验表明该方法在运动保真度和具身一致性方面表现优异,且无需重新训练共享运动模型即可扩展到新机器人。

详情
英文摘要

Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.

2605.12031 2026-05-13 cs.LG cs.CV 版本更新

Resilient Vision-Tabular Multimodal Learning under Modality Missingness

Camillo Maria Caruso, Valerio Guarrasi, Paolo Soda

发表机构 * Research Unit of Artificial Intelligence and Computer Systems, Department of Engineering, Università Campus Bio-Medico di Roma(人工智能与计算机系统研究单位,工程系,罗马生物医学大学) Department of Diagnostics and Intervention, Radiation Physics, Biomedical Engineering, Umeå University(诊断与干预系,辐射物理,生物医学工程,乌梅大学)

AI总结 该研究针对医疗多模态学习中常见的模态缺失问题,提出了一种无需数据填补或启发式切换的联合视觉-表格学习框架。该方法通过可学习的模态标记对单模态表示进行加权,并利用带有掩码的自注意力机制进行中间融合,从而排除缺失的模态和特征。此外,引入模态丢弃正则化策略增强模型鲁棒性,实验表明该方法在不同缺失场景下均优于现有基线,表现出更稳定的性能和更强的鲁棒性。

详情
英文摘要

Multimodal deep learning has shown strong potential in medical applications by integrating heterogeneous data sources such as medical images and structured clinical variables. However, most existing approaches implicitly assume complete modality availability, an assumption that rarely holds in real-world clinical settings where entire modalities and individual features are frequently missing. In this work, we propose a multimodal transformer framework for joint vision-tabular learning explicitly designed to operate under pervasive modality missingness, without relying on imputation or heuristic model switching. The architecture integrates three components: a vision, a tabular, and a multimodal fusion encoder. Unimodal representations are weighted through learnable modality tokens and fused via intermediate fusion with masked self-attention, which excludes missing tokens and modalities from information aggregation and gradient propagation. To further enhance resilience, we introduce a modality-dropout regularization strategy that stochastically removes available modalities during training, encouraging the model to exploit complementary information under partial data availability. We evaluate our approach on the MIMIC-CXR dataset paired with structured clinical data from MIMIC-IV for multilabel classification of 14 diagnostic findings with incomplete annotations. Two parallel systematic stress-test protocols progressively increase training and inference missingness in each modality separately, spanning fully multimodal to fully unimodal scenarios. Across all missingness regimes, the proposed method consistently outperforms representative baselines, showing smoother performance degradation and improved robustness. Ablation studies further demonstrate that attention-level masking and intermediate fusion with joint fine-tuning are key to resilient multimodal inference.

2605.12027 2026-05-13 cs.CV 版本更新

4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation

Ying Zang, Xuanyi Liu, Yidong Han, Deyi Ji, Chaotao Ding, Yuanqi Hu, Qi Zhu, Xuanfu Li, Jin Ma, Lingyun Sun, Tianrun Chen, Lanyun Zhu

发表机构 * Peking University(北京大学) Zhejiang University(浙江大学) Huzhou University(湖州大学) Huawei(华为) Tongji University(同济大学)

AI总结 该论文提出了一种名为4DVGGT-D的4D视觉几何变换器,旨在解决从单目视频中重建动态4D场景时的挑战。研究核心在于通过一种无需训练的渐进式解耦框架,将动态与静态要素分离,从而提升深度估计的稳定性与准确性。方法包含动态掩码引导的位姿解耦、拓扑子空间手术以及信息论置信度融合三个关键模块,有效提升了4D重建的质量与鲁棒性。

详情
英文摘要

Reconstructing dynamic 4D scenes from monocular videos is a fundamental yet challenging task. While recent 3D foundation models provide strong geometric priors, their performance significantly degrades in dynamic environments. This degradation stems from a fundamental tension: the inherent coupling of camera ego-motion and object motion within global attention mechanisms. In this paper, we propose a novel, training-free progressive decoupling framework that disentangles dynamics from statics in a principled, coarse-to-fine manner. Our core insight is to resolve the tension by first stabilizing the camera pose, followed by geometric refinement. Specifically, our approach consists of three synergistic components: (1) a Dynamic-Mask-Guided Pose Decoupling module that isolates pose estimation from dynamic interference, yielding a stable motion-free reference frame; (2) a Topological Subspace Surgery mechanism that orthogonally decomposes the depth manifold, safely preserving dynamic objects while injecting refined, mask-aware geometry into static regions; and (3) an Information-Theoretic Confidence-Aware Fusion strategy that formulates depth integration as a heteroscedastic Bayesian inference problem, adaptively blending multi-pass predictions via inverse-variance weighting. Extensive experiments on standard 4D reconstruction benchmarks demonstrate that our method achieves consistent and substantial improvements across principal point-cloud metrics. Notably, our approach shows competitive performance in robust 4D scene reconstruction without requiring fine-tuning, suggesting the potential of mathematically grounded dynamic-static disentanglement.

2605.12026 2026-05-13 cs.CV cs.AI eess.SP 版本更新

Spectral Vision Transformer for Efficient Tokenization with Limited Data

Alexandra G. Roberts, Maneesh John, Jinwei Zhang, Dominick Romano, Mert Sisman, Ki Sueng Choi, Heejong Kim, Mert R. Sabuncu, Thanh D. Nguyen, Alexey V. Dimov, Pascal Spincemaille, Brian H. Kopell, Yi Wang

AI总结 本文提出了一种新型的光谱视觉变换器架构,旨在在数据量有限的情况下实现高效的图像分块处理,特别关注医学影像应用。该方法利用光谱基函数的选择带来了空间不变性和最优信噪比等理论优势,并通过光谱投影降低了模型复杂度。实验表明,与多种主流模型相比,该方法在参数更少的情况下仍能取得相当甚至更优的性能,适用于多种类型的数据集。

详情
英文摘要

We propose a novel spectral vision transformer architecture for efficient tokenization in limited data, with an emphasis on medical imaging. We outline convenient theoretical properties arising from the choice of basis including spatial invariance and optimal signal-to-noise ratio. We show reduced complexity arising from the spectral projection compared to spatial vision transformers. We show equitable or superior performance with a reduced number of parameters as compared to a variety of models including compact and standard vision transformers, convolutional neural networks with attention, shifted window transformers, multi-layer perceptrons, and logistic regression. We include simulated, public, and clinical data in our analysis and release our code at: \verb+github.com/agr78/spectralViT+.

2605.12021 2026-05-13 cs.CV 版本更新

What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

Ryota Yoshihashi, Masahiro Kada, Satoshi Ikehata, Rei Kawakami, Ikuro Sato

发表机构 * Institute of Science Tokyo(东京科学研究所) DENSO IT Laboratory(DENSO IT实验室) National Institute of Informatics(国家信息研究所)

AI总结 本文提出了一种名为What-Where Transformer(WWT)的视觉骨干网络,旨在同时学习物体的外观(what)和位置(where)信息。该方法通过分离“what-where”这一归纳偏置,采用多流架构将物体表示与注意力图分别处理,从而实现对物体外观和空间位置的解耦表征。实验表明,WWT在无额外后处理的情况下即可从原始注意力图中发现多个物体,并在零样本目标发现和弱监督语义分割等任务中表现出优越性能。

详情
英文摘要

Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One possible reason is that classification-oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where. In this work, we focus on an inductive bias termed what-where separation, which encourages models to represent object appearance and spatial location in a decomposed manner. To incorporate this bias throughout an attentive backbone in the style of Vision Transformer (ViT), we propose the What-Where Transformer (WWT). Our method introduces two key novel designs: (1) it treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed-forward modules via a multi-stream, slot-based architecture; (2) it reuses both the final-layer tokens and attention maps for downstream tasks, and directly exposes them to gradients derived from task losses, thereby facilitating more effective and explicit learning of localization. We demonstrate that even under standard single-label classification-based supervision on ImageNet, WWT exhibits emergent multiple object discovery directly from raw attention maps, rather than via additional postprocessing such as token clustering. Furthermore, WWT achieves superior performance compared to ViT-based methods on zero-shot object discovery and weakly supervised semantic segmentation, and it is transferable to various localization setups with minimal modifications. Code will be published after acceptance.

2605.12017 2026-05-13 cs.CV 版本更新

FAME: Feature Activation Map Explanation on Image Classification and Face Recognition

Xinyi Zhang, Manuel Günther

发表机构 * Department of Informatics, University of Zurich(苏黎世大学信息学院)

AI总结 本文提出了一种名为FAME的图像分类与人脸识别任务的特征激活图解释方法,旨在提升深度学习模型的可解释性。FAME结合了基于梯度的特征图方法与扰动方法的优点,通过梯度驱动的方式对输入图像进行操作,而非使用固定补丁,从而更准确地生成像素级的归因图。实验表明,FAME在深度网络中优于传统CAM方法,并在定性和定量评估中展现出竞争力。

Comments Accepted for CVPR Workshop 2026

详情
英文摘要

Deep Learning has revolutionized machine learning, reaching unprecedented levels of accuracy, but at the cost of reduced interpretability. Especially in image processing systems, deep networks transform local pixel information into more global concepts in a highly obscured manner. Explainable AI methods for image processing try to shed light on this issue by highlighting the regions of the image that are important for the prediction task. Among these, Class Activation Mapping (CAM) and its gradient-based variants compute attributions based on the feature map and upscale them to the image resolution, assuming that feature map locations are influenced only by underlying regions. Perturbation-based methods, such as CorrRISE, on the other hand, try to provide pixel-level attributions by perturbing the input with fixed patches and checking how the output of the network changes. In this work, we propose Feature Activation Map Explanation (FAME), which combines both worlds by using network gradients to compute changes to the input image, manipulating it in a gradient-driven way rather than using fixed patches. We apply this technique on two common tasks, image classification and face recognition, and show that CAM's above-mentioned assumption does not hold for deeper networks. We qualitatively and quantitively show that FAME produces attribution maps that are competitive state-of-the-art systems. Our code is available: {\footnotesize https://github.com/AIML-IfI/fame.}

2605.12013 2026-05-13 cs.CV cs.AI 版本更新

L2P: Unlocking Latent Potential for Pixel Generation

Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Jiawei Chen, Zhuoqi Zeng, Wei Zhang, Chengjie Wang, Jian Yang, Ying Tai

发表机构 * Nanjing University(南京大学) Tencent Youtu Lab(腾讯云图实验室) Hainan-biuh(海南-比乌) Weess Gmbh(韦斯公司)

AI总结 本文提出了一种名为L2P的高效像素生成框架,旨在解决从头训练高精度像素空间模型所需的高昂计算和数据资源问题。L2P通过直接利用预训练潜在扩散模型(LDM)的知识,采用大块标记化替代VAE,并冻结LDM中间层仅训练浅层网络,从而学习潜在空间到像素空间的映射。该方法仅使用LDM生成的合成图像作为训练数据,无需真实数据采集,实现了快速收敛,并可在8块GPU上生成4K超高分辨率图像,实验表明其性能接近源模型,在多个基准测试中表现优异。

Comments project page: https://nju-pcalab.github.io/projects/L2P/

详情
英文摘要

Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.

2605.12006 2026-05-13 cs.CV 版本更新

Robust Promptable Video Object Segmentation

Sohyun Lee, Yeho Gwon, Lukas Hoyer, Konrad Schindler, Christos Sakaridis, Suha Kwak

发表机构 * POSTECH Google(谷歌) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文研究了可提示视频对象分割(PVOS)模型在输入受到干扰时性能下降的问题,提出了首个全面的鲁棒PVOS(RobustPVOS)研究。作者构建了一个包含351个视频片段和2500多张物体掩膜的综合性基准,涵盖真实场景下的多种不利条件,并生成了带有多样化时间变化干扰的合成训练数据。提出了一种新的鲁棒PVOS方法MoGA,通过记忆中的物体特异性表示来增强模型对不同物体退化的处理能力,并保持预测的时序一致性,实验表明该方法在多种干扰条件下均取得显著提升,为未来鲁棒PVOS研究提供了有力基础。

Comments Accepted to CVPR 2026

详情
英文摘要

The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety-critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark is publicly available at https://sohyun-l.github.io/RobustPVOS_project_page/.

2605.12002 2026-05-13 cs.CV 版本更新

EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization

Minh-Khoa Le-Phan, Minh-Hoang Le, Minh-Triet Tran, Trong-Le Do

发表机构 * University of Science - VNU-HCM(越南国家大学-胡志明市大学) Vietnam National University(越南国家大学)

AI总结 本文提出了一种名为EDGER的图像伪造定位方法,旨在应对文本引导的图像修复技术带来的挑战,提升跨域检测能力。该方法采用双分支框架,结合基于频率的边缘检测与合成热图定位,分别在像素级和块级定位伪造区域,从而实现高精度、高分辨率的通用化检测。实验表明,EDGER在多个基准数据集上表现出优异的跨域泛化能力和对高分辨率图像的适应性。

Comments Accepted for publication in the Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)

详情
英文摘要

Text-guided inpainting has made image forgery increasingly realistic, challenging both SID and IFL. However, existing methods often struggle to point out suspicious signals across domains. To address this problem, we propose EDGER, a patch-based, dual-branch framework that localizes manipulated regions in arbitrary resolution images without sacrificing native resolution. The first branch, Edge-Guided Segmentation, introduces a Frequency-based Edge Detector to emphasize high-frequency inconsistencies at manipulation boundaries, and fine-tunes a SegFormer to fuse RGB and edge features for pixel-level masks. Since edge evidence is most informative only when patches contain both authentic and manipulated pixels, we complement Edge-Guided Segmentation with a Synthetic Heatmapping branch, a classification-based localizer that fine-tunes a CLIP-ViT image encoder with LoRA to flag fully synthetic patches. Together, Synthetic Heatmapping provides coarse, patch-level synthetic priors, while Edge-Guided Segmentation sharpens boundaries within partially manipulated patches, yielding comprehensive localization. Evaluated in the MediaEval 2025, SynthIM challenge, Manipulated Region Localization Task's setting, our approach scales to multi-megapixel imagery and exhibits strong cross-domain generalization. Extensive ablations highlight the complementary roles of frequency-based edge cues and patch-level synthetic priors in driving accurate, resolution-agnostic localization.

2605.11977 2026-05-13 cs.CV 版本更新

Optimizing 4D Wires for Sparse 3D Abstraction

Dong-Yi Wu, Tong-Yee Lee

发表机构 * National Cheng Kung University(国立成功大学)

AI总结 本文提出了一种基于单一连续4D曲线(B样条)的3D几何抽象统一框架,通过参数化空间坐标和变量宽度来表示复杂形状。与传统方法中使用多个独立曲线段导致结构碎片化不同,该方法通过保证全局拓扑一致性,实现了更整洁美观且结构连贯的3D抽象。研究引入了可微渲染管道,支持基于梯度的优化,并在图像到3D抽象、多视角线稿生成等任务中表现出更高的语义保真度和结构一致性。

详情
英文摘要

We present a unified framework for 3D geometric abstraction using a single continuous 4D wire, parameterized as a B-spline with spatial coordinates and variable width $(x,y,z,w)$. Existing approaches typically represent shapes as collections of many independent curve segments, which often leads to fragmented structures and limited physical realizability. In contrast, we show that a single continuous spline is sufficiently expressive to capture complex volumetric forms while enforcing global topological coherence. By imposing continuity, our method transforms 3D sketching from a local density-accumulation process into a global routing problem, providing a strong inductive bias toward cleaner aesthetics and improved structural coherence. To enable gradient-based optimization, we introduce a differentiable rendering pipeline that efficiently rasterizes variable-width curves with bounded projection error. This formulation supports robust optimization using modern guidance signals such as Score Distillation Sampling (SDS) or CLIP. We demonstrate applications including image-to-3D abstraction, multi-view wire art generation, and differentiable stylized surface filling. Experiments show that our unified representation produces structures with higher semantic fidelity and improved structural coherence compared to approaches based on collections of discrete curves.

2605.11967 2026-05-13 cs.CV 版本更新

H2G: Hierarchy-Aware Hyperbolic Grouping for 3D Scenes

ByungHa Ko, Youngmin Lee, Dong Hwan Kim

发表机构 * Department of Computer Science and Engineering, Korea University(韩国大学计算机科学与工程系) Intelligence and Interaction Research Center, Korea Institute of Science and Technology(韩国科学技术院智能与交互研究中心)

AI总结 本文提出了一种名为H2G的层次感知双曲分组方法,用于在无需语义标签的情况下对3D场景进行多粒度分组。该方法通过将2D基础模型的相似性线索转化为层次化监督,并将其嵌入到双曲特征场中,以更好地建模树状结构。H2G通过一种层次感知的目标函数,实现了对细粒度部件、物体结构及层次顺序的统一建模,从而在单一特征空间中完成多层级的语义分组。

详情
英文摘要

Hierarchical 3D grouping aims to recover scene groups across multiple granularities, from fine object parts to complete objects, without relying on semantic labels or a fixed vocabulary. The main challenge is to transform 2D foundation-model cues into coherent hierarchy supervision and embed that hierarchy in a 3D representation. We propose H2G, a hyperbolic affinity field for hierarchical 3D grouping. Our method derives semantically organized tree supervision by interpreting foundation-model affinities through Dasgupta's objective for similarity-based hierarchical clustering. This supervision is distilled into a single Lorentz hyperbolic feature field, whose geometry is well suited for tree-like branching structures. A hierarchy-aware objective aligns the field with fine-level assignments, coarse object structure, compact feature clusters, and LCA (Lowest Common Ancestor) ordering. This formulation represents multiple grouping levels in one feature space, enabling semantic hierarchical grouping grounded in 2D foundation-model knowledge.

2605.11963 2026-05-13 cs.CV 版本更新

What Does It Mean for a Medical AI System to Be Right?

Antony Gitau

发表机构 * University of South-Eastern Norway(南欧大学)

AI总结 本文探讨了医疗AI系统“正确”的含义,以骨髓穿刺涂片中浆细胞的自动分类为例,分析了其在多发性骨髓瘤诊断中的应用。作者指出,医疗AI的正确性并非仅由基准性能决定,而是一个多维概念,涉及数据标注、模型可解释性、临床指标的相关性以及人机协作中的责任分配。文章从科学哲学和研究伦理角度出发,揭示了真实标签的不稳定性、过度自信AI的不透明性、标准临床指标的不足以及高压环境下自动化偏见等关键问题。

Comments Part of a PhD ethics course

详情
英文摘要

This paper examines what it means for a medical AI system to be right by grounding the question in a specific clinical context: the automatic classification of plasma cells in digitized bone marrow smears for the diagnosis of multiple myeloma. Drawing on philosophy of science and research ethics, the paper argues that correctness in medical AI is not a singular property reducible to benchmark performance, but a multi-dimensional concept involving the availability of expertly labeled medical datasets, the explainability and interpretability of model outputs, the clinical meaningfulness of evaluation metrics, and the distribution of accountability in human-AI workflows. As such, the paper develops this argument through four interrelated themes: the instability of ground truth labels, the opacity of overconfident AI, the inadequacy of standard clinical metrics, and the risk of automation bias in time-pressured clinical settings.

2605.11960 2026-05-13 cs.CV 版本更新

Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

Gengluo Li, Shangpin Peng, Xingyu Wan, Chengquan Zhang, Hao Feng, Xin Xu, Pian Wu, Bang Li, Zengmao Ding, Yongge Liu, Yipei Ye, Yang Yang, Zhan Shu, Guojun Yan, Zhe Li, Can Ma, Weiping Wang, Yu Zhou, Han Hu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院) Tencent(腾讯) Anyang Normal University(安阳师范学院) The Palace Museum(故宫博物院) Nankai University(南开大学)

AI总结 该研究提出Chronicles-OCR,首个用于评估视觉大语言模型跨时代感知能力的综合性基准,聚焦于汉字在七种书写体系演变过程中的视觉感知挑战。该数据集包含2800张严格平衡的图像,涵盖从甲骨到纸张等多种载体,通过提出阶段自适应注释范式,构建了包括跨时期字形识别、古文解析等在内的多项任务,旨在揭示当前模型在历史文字感知中的局限性,推动更加鲁棒的演变感知研究。

详情
英文摘要

Vision Large Language Models (VLLMs) have achieved remarkable success in modern text-rich visual understanding. However, their perceptual robustness in the face of the continuous morphological evolution of historical writing systems remains largely unexplored. Existing ancient text datasets typically focus on isolated historical periods, failing to capture the systematic visual distribution shifts spanning thousands of years. To bridge this gap and empower Digital Humanities, we introduce Chronicles-OCR, the first comprehensive benchmark specifically designed to evaluate the cross-temporal visual perception capabilities of VLLMs across the complete evolutionary trajectory of Chinese characters, known as the Seven Chinese Scripts. Curated in collaboration with top-tier institutional domain experts, the dataset comprises 2,800 strictly balanced images encompassing highly diverse physical media, ranging from tortoise shells to paper-based calligraphy. To accommodate the drastic morphological and topological variations across different historical stages, we propose a novel Stage-Adaptive Annotation Paradigm. Based on this, Chronicles-OCR formulates four rigorous quantitative tasks: cross-period character spotting, fine-grained archaic character recognition via visual referring, ancient text parsing, and script classification. By isolating visual perception from semantic reasoning, Chronicles-OCR provides an authoritative platform to expose the limitations of current VLLMs, paving the way for robust, evolution-aware historical text perception. Chronicles-OCR is publicly available at https://github.com/VirtualLUOUCAS/Chronicles-OCR.

2605.11959 2026-05-13 cs.CV cs.CL 版本更新

Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models

Maham Nazir, Muhammad Aqeel, Richong Zhang, Francesco Setti

发表机构 * Beihang University, Beijing, China(北航大学,北京,中国) University of Verona, Italy(威尼斯大学,意大利)

AI总结 本文研究了如何利用视觉-语言模型对教学视频进行多模态抽象摘要生成。作者提出了一种名为ClipSum的框架,通过冻结CLIP预训练模型的视觉特征,并结合显式的时序建模和维度自适应融合,实现了更有效的视频摘要生成。实验表明,ClipSum在YouCook2数据集上取得了优于传统方法的ROUGE-1指标,验证了语义对齐在跨模态任务中的重要性。

Comments Accepted to ICPR 2026

详情
英文摘要

Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision-language features with explicit temporal modeling and dimension-adaptive fusion for instructional video summarization. CLIP's contrastive pre-training on 400M image-text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision-language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE-1 versus 30.5% for ResNet-152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%) surpasses fine-tuned CLIP (32.3%), showing that preserving pre-trained alignment is more valuable than task-specific adaptation. https://github.com/aqeeelmirza/clipsum

2605.11939 2026-05-13 cs.CV 版本更新

Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models

Boyang Guo, Liang Li, Lin Peng, Yuhan Gao, Xichun Sheng, Chenggang Yan

发表机构 * Hangzhou Dianzi University(杭州电子科技大学) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) The Hong Kong Polytechnic University(香港理工大学) Macao Polytechnic University(澳门理工学院) Zhejiang Provincial Key Laboratory of Low Altitude Ubiquitous Networking Technology, HDU(浙江省低空 ubiquitous 网络技术重点实验室,HDU)

AI总结 本文提出了一种名为Cluster-Aware Neural Collapse Prompt Tuning(CPT)的方法,旨在提升视觉-语言模型在长尾数据集上的泛化能力。该方法通过构建语义不变空间并引入神经崩溃驱动的判别优化,增强了尾部类别的可区分性,同时保持模型整体的泛化性能。实验表明,CPT在多个数据集上优于现有方法,尤其在长尾类别上的表现更为突出。

详情
英文摘要

Prompt learning has emerged as an efficient alternative to fine-tuning pre-trained vision-language models (VLMs). Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose cluster-aware neural collapse prompt tuning (CPT), which enhances the discriminability of tail classes in prompt-tuned VLMs without sacrificing their overall generalization. First, we design a cluster-invariant space by mining semantic assignments from the pre-trained VLM and mapping them to prompt-tuned features. This computes cluster-level boundaries and restricts the constraints to local neighborhoods, which reduces interference with the global semantic structure of the pre-trained VLM. Second, we introduce neural-collapse-driven discriminability optimization with three losses: textual Equiangular Tight Frame (ETF) separation loss, class-wise convergence loss, and rotation stabilization loss. These losses work together to shape intra-cluster geometry for better inter-class separation and intra-class alignment. Extensive experiments on 11 diverse datasets demonstrate that CPT outperforms SOTA methods, with stronger performance on long-tail classes and good generalization to unseen classes.

2605.11934 2026-05-13 cs.CV 版本更新

Interactive State Space Model with Cross-Modal Local Scanning for Depth Super-Resolution

Chen Wu, Ling Wang, Zhuoran Zheng, Xiangyu Chen, Jingyuan Xia, Weidong Jiang, Jiantao Zhou

发表机构 * National University of Defense Technology(国防科技大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Sun Yat-sen University(中山大学) Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院(TeleAI)) University of Macau(澳门大学)

AI总结 本文研究了在高分辨率RGB图像指导下从低分辨率深度图重建高分辨率深度图的引导深度超分辨率(GDSR)问题。为了解决现有方法在模态间建模效率与语义交互能力之间的矛盾,作者提出了一种基于交互状态空间模型的新型GDSR框架,引入了跨模态局部扫描机制,实现了RGB与深度特征之间的细粒度语义交互,并结合Mamba架构实现了线性复杂度的全局建模,显著提升了模型效率与重建质量。

Comments ISCAS2026

详情
英文摘要

Guided depth super-resolution (GDSR) reconstructs HR depth maps from LR inputs with HR RGB guidance. Existing methods either model each modality independently or rely on computationally expensive attention mechanisms with quadratic complexity, hindering the establishment of efficient and semantically interactive joint representations. In this paper, we observe that feature maps from different modalities exhibit semantic-level correlations during feature extraction. This motivates us to develop a more flexible approach enabling dense, semantically-aware deep interactions between modalities. To this end, we propose a novel GDSR framework centered around the Interactive State Space Model. Specifically, we design a cross-modal local scanning mechanism that enables fine-grained semantic interactions between RGB and depth features. Leveraging the Mamba architecture, our framework achieves global modeling with linear complexity. Furthermore, a cross-modal matching transform module is introduced to enhance interactive modeling quality by utilizing representative features from both modalities. Extensive experiments demonstrate competitive performance against state-of-the-art methods.

2605.11931 2026-05-13 cs.CV 版本更新

Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

Qihuang Zhong, Liang Ding, Wenjie Xuan, Juhua Liu, Bo Du, Dacheng Tao

发表机构 * School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence(计算机学院、多媒体软件国家工程研究中心、人工智能研究院) Hubei Key Laboratory of Multimedia(湖北多媒体重点实验室) Network Communication Engineering, Wuhan University, China(网络通信工程、武汉大学,中国) The University of Sydney, Australia(悉尼大学,澳大利亚) Nanyang Technological University, Singapore(南洋理工大学,新加坡)

AI总结 本文研究了如何通过自改进训练提升多模态大语言模型(MLLMs)的推理能力。针对现有方法中数据不平衡和语言先验偏差的问题,提出了一种视觉感知的自改进训练框架VISTA,通过前缀重采样策略和视觉感知注意力评分,有效提升了模型对视觉信息的关注与利用。实验表明,VISTA在多种下游任务中显著提升了MLLMs的多模态推理性能。

Comments Accepted by ICML 2026

详情
英文摘要

Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.

2605.11927 2026-05-13 cs.CV 版本更新

RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation

Qi Zhao, Jun Chen, Ivor Tsang, Guang Dai

发表机构 * Xi’an Jiaotong University(西安交通大学) Zhejiang Normal University(浙江师范大学) SGIT AI Lab, State Grid Corporation of China(国网SGIT人工智能实验室) Centre for Frontier AI Research (CFAR), A*STAR, Singapore(前沿人工智能研究中心(CFAR),A*STAR,新加坡)

AI总结 RealDiffusion 是一种用于多角色绘本生成的物理感知注意机制框架,旨在解决扩散模型在生成连续图像序列时面临的叙事动态性与角色一致性之间的平衡问题。该方法引入热扩散作为去噪先验,结合区域感知的随机过程,有效抑制角色特征漂移并保持帧间身份稳定,同时通过可配置的物理系统建模特征演化,实现对时空关系的正则化。实验表明,RealDiffusion 在保持叙事动态性的同时显著提升了角色一致性,优于现有先进方法。

Comments CVPR2026

详情
英文摘要

While modern diffusion models excel at generating diverse single images, extending this to sequential generation reveals a fundamental challenge: balancing narrative dynamism with multi-character coherence. Existing methods often falter at this trade-off, leading to artifacts where characters lose their identity or the story stagnates. To resolve this critical tension, we introduce RealDiffusion, a unified framework designed to reconcile robust coherence with narrative dynamism. Heat diffusion serves as a dissipative prior that averages neighboring features along the sequence and removes high-frequency noise within the subject region. This suppresses attribute drift and stabilizes identity across frames. A region-aware stochastic process then introduces small perturbations that explore nearby modes and prevent collapse so the story maintains pose change and scene evolution. We thus introduce a lightweight, training-free Physics-informed Attention mechanism that injects controllable physical priors into the self-attention layers during inference. By modeling feature evolution as a configurable physical system, our method regularizes spatio-temporal relationships without suppressing intentional, prompt-driven changes. Extensive experiments demonstrate that RealDiffusion achieves substantial gains in character coherence while preserving narrative dynamism, outperforming state-of-the-art approaches. Code is available at https://github.com/ShmilyQi-CN/RealDiffusion.

2605.11913 2026-05-13 cs.CV 版本更新

Vector Scaffolding: Inter-Scale Orchestration for Differentiable Image Vectorization

Jaerin Lee, Kanggeon Lee, Kyoung Mu Lee

发表机构 * Dept. of ECE&ASRI, Seoul National University, Korea(电子工程与人工智能研究院,首尔国立大学,韩国) IPAI, Seoul National University, Korea(人工智能研究所,首尔国立大学,韩国)

AI总结 该论文提出了一种名为Vector Scaffolding的新型分层优化框架,用于解决可微分图像矢量化中的拓扑崩溃问题。传统方法在像素级优化过程中容易导致结构失真,而该方法通过引入内部梯度聚合、渐进分层和快速膨胀调度等技术,实现了多尺度曲线混合的稳定学习,显著提升了优化效率和图像质量。实验表明,该方法在优化速度和图像保真度方面均优于现有技术。

Comments 22 pages, 12 figures

详情
英文摘要

Differentiable vector graphics have enabled powerful gradient-based optimization of vector primitives directly from raster images. However, existing frameworks formulate this as a flat optimization problem, forcing hundreds to thousands of randomly initialized curves to blindly compete for pixel-level error reduction. This disordered optimization leads to topology collapse, where macroscopic structures are distorted by internal high-frequency noise, resulting in a redundant and uneditable "polygon soup" that limits practical editability. To address this limitation, we propose Vector Scaffolding, a novel hierarchical optimization framework that shifts from flat pixel-matching to structured topological construction tailored for vector graphics. By identifying a key cause of topology collapse as the mathematical imbalance between area and boundary gradients, we introduce Interior Gradient Aggregation to stabilize the learning dynamics of multi-scale curve mixtures. Upon this stabilized landscape, we employ Progressive Stratification and Rapid Inflation Scheduling to progressively densify vector primitives with extremely high learning rates ($\times 50$). Experiments demonstrate that our approach accelerates optimization by $2.5\times$ while simultaneously improving PSNR by up to 1.4 dB over the previous state of the art.

2605.11904 2026-05-13 cs.CV cs.AI 版本更新

Beyond Point-wise Neural Collapse: A Topology-Aware Hierarchical Classifier for Class-Incremental Learning

Huiyu Yi, Zhiming Xu, Dunwei Tu, Zhicheng Wang, Baile Xu, Furao Shen

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China(新型软件技术国家实验室,南京大学,中国南京) School of Computer Science, Nanjing University, Nanjing, China(计算机科学学院,南京大学,中国南京) School of Artificial Intelligence, Nanjing University, Nanjing, China(人工智能学院,南京大学,中国南京)

AI总结 本文针对类增量学习(CIL)中传统最近类均值(NCM)分类器因特征漂移和非线性结构而表现不佳的问题,提出了一种基于拓扑感知的分层分类器HC-SOINN。该方法通过“局部到全局”的表示方式捕捉类间流形的拓扑结构,并引入结构-拓扑对齐残差(STAR)方法,实现对复杂非线性特征漂移的精确适应。实验表明,该方法在多种先进模型中均能有效提升分类性能,展现出良好的鲁棒性和泛化能力。

Comments accepted by ICML2026

详情
英文摘要

The Nearest Class Mean (NCM) classifier is widely favored in Class-Incremental Learning (CIL) for its superior resistance to catastrophic forgetting compared to Fully Connected layers. While Neural Collapse (NC) theory supports NCM's optimality by assuming features collapse into single points, non-linear feature drift and insufficient training in CIL often prevent this ideal state. Consequently, classes manifest as complex manifolds rather than collapsed points, rendering the single-point NCM suboptimal. To address this, we propose Hierarchical-Cluster SOINN (HC-SOINN), a novel classifier that captures the topological structure of these manifolds via a ``local-to-global'' representation. Furthermore, we introduce Structure-Topology Alignment via Residuals (STAR) method, which employs a fine-grained pointwise trajectory tracking mechanism to actively deform the learned topology, allowing it to adapt precisely to complex non-linear feature drift. Theoretical analysis and Procrustes distance experiments validate our framework's resilience to manifold deformations. We integrated HC-SOINN into seven state-of-the-art methods by replacing their original classifiers, achieving consistent improvements that highlight the effectiveness and robustness of our approach. Code is available at https://github.com/yhyet/HC_SOINN.

2605.11900 2026-05-13 cs.CV 版本更新

Mobile Traffic Camera Calibration from Road Geometry for UAV-Based Traffic Surveillance

Alexey Popov, Natalia Trukhina, Vadim Vashkelis

发表机构 * Embedded Intelligence Lab(嵌入式智能实验室)

AI总结 本文研究如何利用道路几何信息对无人机拍摄的交通视频进行标定,以生成可用于交通分析的鸟瞰图(BEV)表示。通过车道线、道路边界等可见道路特征估计图像坐标到地面坐标系的单应性变换,进而将车辆检测结果投影到BEV中,实现车辆轨迹、速度、方向及三维立方体的估计。该方法在UAVDT数据集上进行了验证,展示了从单目无人机视频生成可解释交通分析结果的可行性,同时也指出了远距离车辆对单应性误差敏感、自动标定可靠性不足等局限性。

详情
英文摘要

Unmanned aerial vehicles (UAVs) can provide flexible traffic surveillance where fixed roadside cameras are unavailable, costly, or impractical. However, raw UAV video is difficult to use for traffic analytics because vehicle motion is observed in perspective image coordinates rather than in a stable metric road coordinate system. This paper presents a lightweight pipeline for converting monocular oblique UAV traffic video into a local metric bird's-eye-view (BEV) representation. Visible road geometry, including lane markings, road borders, and crosswalks, is used to estimate a road-plane homography from image coordinates to metric ground-plane coordinates. Vehicle observations from dataset annotations or detectors are then projected to BEV using estimated ground contact points. The resulting trajectories support estimation of vehicle direction, speed, heading, and dynamic 3D cuboids on the road plane. We evaluate the pipeline on UAVDT using ground-truth annotations to isolate calibration and geometric reconstruction from detector and tracker errors. For sequence M1401, 40 sampled frames from img000001-img000196 produce 632 metric cuboid instances across 23 tracks. Results show that road-geometry calibration can transform monocular UAV footage into interpretable traffic-camera-style analytics, including BEV tracks and synchronized 3D cuboid visualizations. They also reveal key limitations: far-field vehicles are sensitive to homography errors, manual validation is currently more reliable than fully automatic calibration, and the single-plane assumption limits performance in non-planar or ambiguous road regions. The proposed pipeline provides a practical foundation for deployable UAV traffic cameras and future real-time traffic digital-twin systems.

2605.11898 2026-05-13 cs.CV 版本更新

Few-Shot Synthetic Data Generation with Diffusion Models for Downstream Vision Tasks

Daniil Dushenev, Nazariy Karpov, Daniil Zinovjev, Alexander Gorin, Konstantin Kulikov

发表机构 * National University of Science and Technology MISIS(俄罗斯莫斯科国立研究型技术大学)

AI总结 本文针对视觉识别中罕见类样本不足的问题,提出了一种基于扩散模型的轻量级合成数据生成方法。该方法仅需少量真实样本(20-50张)微调LoRA适配器,即可生成用于训练的合成数据,有效提升罕见类的召回率和F1值。实验在胸部X光病理分类和工业表面缺陷检测两个不同领域进行,验证了该方法在数据稀缺场景下的有效性与可扩展性。

Comments 5 pages, 3 figures, 1 table. Accepted at SynData4CV Workshop @ CVPR 2026

详情
英文摘要

Class imbalance is a persistent challenge in visual recognition, particularly in safety-critical domains where collecting positive examples is expensive and rare events are inherently underrepresented. We propose a lightweight synthetic data augmentation pipeline that fine-tunes a LoRA adapter on as few as 20-50 real images of a rare class and uses a pretrained diffusion model to generate synthetic samples for training. We systematically vary the synthetic-to-real ratio and evaluate the approach across two structurally different domains: chest X-ray pathology classification (NIH ChestX-ray14) and industrial surface crack detection (Magnetic Tile Defect dataset). All evaluations are performed on held-out sets of real images only. Across both domains, synthetic augmentation consistently improves rare-class recall and F1 compared to training with real data alone. Performance improves with moderate synthetic augmentation and shows diminishing returns as the synthetic ratio increases. These results suggest that LoRA-adapted diffusion models provide a simple and scalable mechanism for augmenting rare classes, enabling effective learning in data-scarce scenarios across heterogeneous visual domains.

2605.11869 2026-05-13 cs.CV cs.LG 版本更新

FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

Jian Tang, Jiawei Fan, Qingbin Liu, Zheng Wei

发表机构 * Platform and Content Group, Tencent(腾讯平台与内容组)

AI总结 视频扩散变换器(DiT)在推理速度上的提升虽可通过模型蒸馏实现,但每步推理延迟仍是关键瓶颈。现有方法主要依赖去噪轨迹中的冗余性,但在少步推理场景下效果有限,因时间状态稀缺导致特征复用困难。为此,本文提出一种无需训练、操作无关的FIS-DiT框架,将优化重点从时间轨迹转移到潜空间帧维度,通过帧交错稀疏策略在模型层次上操作帧子集,实现高效推理。实验表明,FIS-DiT在多个数据集上实现了2.11到2.41倍的加速,且在多项指标上几乎无性能损失。

详情
英文摘要

While the overall inference latency of Video Diffusion Transformers (DiTs) can be substantially reduced through model distillation, per-step inference latency remains a critical bottleneck. Existing acceleration paradigms primarily exploit redundancy across the denoising trajectory; however, we identify a limitation where these step-wise strategies encounter diminishing returns in few-step regimes. In such scenarios, the scarcity of temporal states prevents effective feature reuse or predictive modeling, creating a formidable barrier to further acceleration. To overcome this, we propose Frame Interleaved Sparsity DiT (FIS-DiT), a training-free and operator-agnostic framework that shifts the optimization focus from the temporal trajectory to the latent frame dimension. Our approach is motivated by an intrinsic duality within this dimension: the existence of frame-wise sparsity that permits reduced computation, coupled with a structural consistency where each frame position remains equally vital to the global spatiotemporal context. Leveraging this insight, we implement Frame Interleaved Sparsity (FIS) as an execution strategy that manipulates frame subsets across the model hierarchy, refreshing all latent positions without requiring full-scale block computation. Empirical evaluations on Wan 2.2 and HunyuanVideo 1.5 demonstrate that FIS-DiT consistently achieves 2.11--2.41$\times$ speedup with negligible degradation across VBench-Q and CLIP metrics, providing a scalable and robust pathway toward real-time high-definition video generation.

2605.11864 2026-05-13 cs.IR cs.AI cs.CV cs.MM 版本更新

Very Efficient Listwise Multimodal Reranking for Long Documents

Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh

发表机构 * Magellan Technology Research Institute (MTRI)(马杰拉技术研究院(MTRI))

AI总结 本文提出了一种高效的列表级多模态重排序模型ZipRerank,旨在解决长文档视觉中心检索和多模态检索增强生成中的计算瓶颈问题。该方法通过轻量的查询-图像早期交互机制缩短输入长度,并采用单次前向传播对所有候选进行评分,从而避免了自回归解码的高耗时过程。实验表明,ZipRerank在保持高性能的同时,显著降低了大语言模型的推理延迟,适用于对延迟敏感的实际应用场景。

Comments To appear in ICML 2026

详情
英文摘要

Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at https://github.com/dukesun99/ZipRerank.

2605.11863 2026-05-13 cs.CV eess.IV 版本更新

GATA2Floor: Graph attention for floor counting in street-view facades

Ngoc Tan Le, Tzoulio Chamiti, Eirini Papagiannopoulou, Nikos Deligiannis

发表机构 * ETRO Department, Vrije Universiteit Brussel (VUB)(比利时布鲁塞尔自由大学ETRO部门) imec

AI总结 本文研究如何从街景立面图像中自动分析建筑的楼层数量,提出了一个基于图注意力机制的模型GATA2Floor。该方法将建筑立面建模为包含窗户和门的图结构,并引入多头图注意力网络来预测楼层数,同时通过可学习的跨注意力查询将元素分配到潜在的楼层槽位,从而获得可解释且鲁棒的结果。为了解决数据标注不足的问题,作者还提出了一种无需标注的轻量级提案机制,利用自监督特征和视觉-语言评分实现无监督学习,展示了图注意力关系推理在立面理解中的有效性。

Comments Accepted at IEEE ICIP 2026; 6 pages, 5 figures, 3 tables

详情
英文摘要

Automated analysis of building facades from street-level imagery has great potential for urban analytics, energy assessment, and emergency planning. However, it requires reasoning over spatially arranged elements rather than solely isolated detections. In this work, we model each facade as a graph over window/door detections with a vertical prior on edges. Additionally, we introduce GATA2Floor, a multi-head Graph Attention v2 (GATv2) based model that predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns elements to latent floor slots, yielding interpretable outputs and robustness to irregular designs. To mitigate the lack of labeled datasets, we demonstrate that the proposed graph-based reasoning can be applied without annotations by leveraging a lightweight label-free proposal mechanism based on self-supervised features and vision-language scoring. Our approach demonstrates the value of graph-attention-based relational reasoning for facade understanding.

2605.11856 2026-05-13 cs.CV cs.CL 版本更新

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

Houcheng Jiang, Jiajun Fu, Junfeng Fang, Chen Gao, Xiang Wang, Xiangnan He, Yong Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学) National University of Singapore(新加坡国立大学) Zhongguancun Academy(中关村学院)

AI总结 本文提出了一种统一的视觉潜层推理框架UniVLR,旨在提升多模态大语言模型在图像推理任务中的效率与表现。该方法将文本推理与辅助视觉信息整合到共享的视觉工作空间中,通过联合生成推理轨迹和图像信息,并将其压缩为紧凑的视觉潜层表示,从而在推理时仅依赖视觉潜层进行推理并直接生成答案,避免了显式文本推理和外部工具调用。实验表明,UniVLR在实际感知与视觉推理任务中优于现有方法,且生成的推理标记更少,展示了更高效统一的视觉推理范式。

详情
英文摘要

Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.

2605.11840 2026-05-13 cs.CV 版本更新

Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation

Zhangcheng Hou, Tomoaki Ohtsuki

发表机构 * School of Science and Technology(科学与技术学部)

AI总结 本文研究了如何利用雷达信号提升雷达-相机深度估计的性能,提出了一种基于状态空间模型的雷达调制选择机制(RMS),将雷达信息直接融入模型的扫描过程,而非传统的特征融合方式。该方法通过雷达对扫描步长和读取参数进行调制,在保证图像主干网络不变的前提下,仅在雷达能提升精度的区域引入雷达影响,从而实现更高效、准确的深度估计。实验表明,该方法在nuScenes数据集上取得了显著的性能提升,并且具有更低的计算延迟。

Comments 16 pages, 3 figures, 9 tables

详情
英文摘要

Radar-camera depth estimation must turn an ultra-sparse, all-weather, metric radar signal into a dense per-pixel depth map. Existing methods -- concatenation, confidence-aware gating, sparse supervision, graph-based extraction -- combine radar and image features outside the backbone's sequence operator, and even cross-modal Mamba variants leave the selection mechanism itself unimodal. We argue that the selection mechanism is the right place for radar to enter. We introduce Radar-Modulated Selection (RMS), a minimal and principled way to inject radar into Mamba's selective scan: radar modulates the scan from within, adding zero-initialised perturbations to the step size $Δ$ and readout $\mathbf{C}$ while leaving the input projection $\mathbf{B}$ and state dynamics $\mathbf{A}$ image-only. The construction is exactly equivalent to a pretrained image-only Mamba at initialisation, ensuring radar only influences the model where it improves accuracy. Two further properties follow that out-of-scan fusion cannot offer: linear-cost cross-modal coupling at every recurrence step, and a natural fallback to the image-only backbone when radar is absent. We deploy RMS in a Multi-View Scan Pyramid (MVSP) that matches the fusion operator to radar's spatial reach at each scale. SemoDepth achieves state-of-the-art performance on nuScenes, reducing MAE by 34.0%, 29.9%, and 29.9% over the previous best at 0--50, 0--70, and 0--80m, while attaining the lowest single-frame latency (26.8ms). A further ablation shows that out-of-scan feature blending adds no accuracy on top of RMS, providing empirical validation that in-scan selection can replace out-of-scan fusion.

2605.10916 2026-05-13 cs.CV cs.AI 版本更新

Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition

Md. Sultan Al Rayhan

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 识别手写孟加拉语复合字符是一个具有挑战性的问题,主要由于字符结构复杂、类内变化大以及高质量标注数据有限。本文提出了一种基于置信度引导的扩散增强框架,用于提升低分辨率孟加拉语复合字符的识别性能。该方法结合了类别条件扩散模型和分类器引导技术,生成高质量的合成样本,并引入了增强残差块和置信度过滤机制,以提升生成质量并筛选出类别一致性高的样本。实验表明,该方法在多个主流模型上均取得性能提升,最佳模型在AIBangla数据集上的分类准确率达到89.2%,显著优于现有基准。

详情
英文摘要

Recognition of handwritten Bangla compound characters remains a challenging problem due to complex character structures, large intra-class variation, and limited availability of high-quality annotated data. Existing Bangla handwritten character recognition systems often struggle to generalize across diverse writing styles, particularly for compound characters containing intricate ligatures and diacritical variations. In this work, we propose a confidence-guided diffusion augmentation framework for low-resolution Bangla compound character recognition. Our framework combines class-conditional diffusion modeling with classifier guidance to synthesize high-quality handwritten compound character samples. To further improve generation quality, we introduce Squeeze-and-Excitation enhanced residual blocks within the diffusion model's U-Net backbone. We additionally propose a confidence-based filtering mechanism where pre-trained classifiers act as quality gates to retain only highly class-consistent synthetic samples. The filtered synthetic images are fused with the original training data and used to retrain multiple classification architectures. Experiments conducted on the AIBangla compound character dataset demonstrate consistent performance improvements across ResNet50, DenseNet121, VGG16, and Vision Transformer architectures. Our best-performing model achieves 89.2\% classification accuracy, surpassing the previously published AIBangla benchmark by a substantial margin. The results demonstrate that quality-aware diffusion augmentation can effectively enhance handwritten character recognition performance in low-resource script domains.

2605.10360 2026-05-13 cs.CV 版本更新

DySurface: Consistent 4D Surface Reconstruction via Bridging Explicit Gaussians and Implicit Functions

Minje Kim, Younghyun Noh, Jaesoon Kim, Tae-Kyun Kim

发表机构 * KAIST(韩国科学技术院) Sungkyunkwan University(全北国立大学)

AI总结 本文提出了一种名为DySurface的新框架,用于解决动态场景中重建时间一致的4D表面的挑战。该方法结合了显式的高斯点和隐式的符号距离函数(SDF),通过构建动态稀疏体素网格,为隐式SDF场提供明确的几何引导,从而显著提升了表面重建的质量,实现了更精确的边界和细节表现。实验表明,DySurface在几何精度方面优于现有先进方法,同时保持了良好的渲染性能。

详情
英文摘要

While novel view synthesis (NVS) for dynamic scenes has seen significant progress, reconstructing temporally consistent geometric surfaces remains a challenge. Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) offer powerful dynamic scene rendering capabilities; however, relying solely on photometric optimization often leads to geometric ambiguities. This results in discontinuous surfaces, severe artifacts, and broken surfaces over time. To address these limitations, we present DySurface, a novel framework that bridges the effectiveness of explicit Gaussians with the geometric fidelity of implicit Signed Distance Functions (SDFs) in dynamic scenes. Our approach tackles the structural discrepancy between the forward deformation of 3DGS ($canonical \rightarrow dynamic$) and the backward deformation required for volumetric SDF rendering ($dynamic \rightarrow canonical$). Specifically, we propose the VoxGS-DSDF branch that leverages deformed Gaussians to construct a dynamic sparse voxel grid, providing explicit geometric guidance to the implicit SDF field. This explicit anchoring effectively regularizes the volumetric rendering process, significantly improving surface reconstruction quality, with watertight boundaries and detailed representations. Quantitative and qualitative experiments demonstrate that DySurface significantly outperforms state-of-the-art baselines in geometric accuracy metrics while maintaining competitive rendering performance.

2605.09965 2026-05-13 cs.CV 版本更新

Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

Kuan Zhang, Dongchen Liu, Qiyue Zhao, Tianyu Xin, Yue Su, Haisheng Wang, Han Yin, Hongbo Ma, Peize Li, Tianjun Gu, Xiangnan Wu, Xinran Zhang, Yongxuan Li, Zirong Chen, Yiming Li

发表机构 * College of AI, Tsinghua University(清华大学人工智能学院) MMLab, The University of Hong Kong(香港大学MMLab) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 该研究探讨了如何通过基础模型实现通用游戏玩家,旨在使人工智能具备在由不同规则、目标和物理特性构成的“游戏多元宇宙”中灵活适应和表现的能力。研究从数据集、模型、应用框架和评估基准四个相互关联的支柱出发,分析了通用游戏玩家的完整生命周期,并指出了当前系统面临的五大根本性权衡。通过这一整体视角,论文提出了一个五阶段的发展路线图,从单一游戏精通逐步迈向能够同时创造和演化于理论游戏多元宇宙的终极创造者阶段,为实现通用人工智能(AGI)提供了系统性指导。

Comments 51 pages, 7 figures, github: https://github.com/THUSI-Lab/Awesome-LFMs-Play-Games

详情
英文摘要

The real world unfolds along a single set of physics laws, yet human intelligence demonstrates a remarkable capacity to generalize experiences from this singular physical existence into a multiverse of games, each governed by entirely different rules, aesthetics, physics, and objectives. This omni-reality adaptability is a hallmark of general intelligence. As Artificial Intelligence progresses towards Artificial General Intelligence, the multiverse of games has evolved from mere entertainment into the ultimate ground for training and evaluating AGI. The pursuit of this generality has unfolded across four eras: from environment-specific symbolic and reinforcement learning agents, to current large foundation models acting as generalist players, and toward a future creator stage where agent both creates new game worlds and continually evolves within them. We trace the full lifecycle of a generalist game player along four interdependent pillars: Dataset, Model, Harness, and Benchmark. Every advance across these pillars can be read as an attempt to break one of five fundamental trade-offs that currently bound the whole system. Building on this end-to-end view, we chart a five-level roadmap, progressing from single-game mastery to the ultimate creator stage in which the agent simultaneously creates and evolves within theoretical game multiverse. Taken together, our work offers a unified lens onto a rapidly shifting field,and a principled path toward the omnipotent generalist agent capable of seamlessly mastering any challenge within the multiverse of games, thereby paving the way for AGI.

2605.08133 2026-05-13 cs.CV cs.AI 版本更新

VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

Rui Zhao, Haofeng Hu, Zhenhai Gao, Jiaqiao Liu, Gao Fei

发表机构 * College of Automotive Engineering(汽车工程学院) The National Key Laboratory of Automotive Chassis Integration and Bionics(汽车底盘集成与生物力学国家级重点实验室) ReeFocus AI Technology(ReeFocus人工智能技术)

AI总结 本文提出了一种名为 VLADriver-RAG 的检索增强型视觉-语言-动作模型,用于自动驾驶任务。该模型通过引入结构感知的历史知识检索机制,解决了传统 VLA 模型在长尾场景中泛化能力不足的问题。研究通过将视觉输入转化为时空语义图,并采用场景对齐的嵌入模型提升检索相关性,最终在 Bench2Drive 基准测试中取得了新的最优性能,驾驶评分为 89.12。

详情
英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving, yet their reliance on implicit parametric knowledge limits generalization in long-tail scenarios. While Retrieval-Augmented Generation (RAG) offers a solution by accessing external expert priors, standard visual retrieval suffers from high latency and semantic ambiguity. To address these challenges, we propose \textbf{VLADriver-RAG}, a framework that grounds planning in explicit, structure-aware historical knowledge. Specifically, we abstract sensory inputs into spatiotemporal semantic graphs via a \textit{Visual-to-Scenario} mechanism, effectively filtering visual noise. To ensure retrieval relevance, we employ a \textit{Scenario-Aligned Embedding Model} that utilizes Graph-DTW metric alignment to prioritize intrinsic topological consistency over superficial visual similarity. These retrieved priors are then fused within a query-based VLA backbone to synthesize precise, disentangled trajectories. Extensive experiments on the Bench2Drive benchmark establish a new state-of-the-art, achieving a Driving Score of 89.12.

2605.05680 2026-05-13 cs.CV 版本更新

MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

Nanjie Yao, Junlong Ren, Wenhao Shen, Hao Wang

发表机构 * The Hong Kong University of Science(香港科学与技术大学) Nanyang Technological University, Singapore(南洋理工大学)

AI总结 本文研究如何从头戴式设备信号中恢复全身3D人体运动。针对现有扩散模型依赖全局分布匹配导致局部关节重建误差的问题,提出了一种基于强化学习后训练的新型框架MotionGRPO,通过引入混合奖励机制和噪声注入策略,有效提升了样本多样性并稳定了学习过程。实验表明,MotionGRPO在视觉保真度方面达到了当前最优性能。

Comments Accepted by ICML 2026

详情
英文摘要

This paper studies full-body 3D human motion recovery from head-mounted device signals. Existing diffusion-based methods often rely on global distribution matching, leading to local joint reconstruction errors. We propose MotionGRPO, a novel framework leveraging reinforcement learning post-training to inject fine-grained guidance into the diffusion process. Technically, we model diffusion sampling as a Markov decision process optimized via Group Relative Policy Optimization (GRPO). To this end, we introduce a hybrid reward mechanism that combines a learned conditioned perceptual model for global visual plausibility and explicit constraints for local joint precision. Our key technical insight is that policy optimization in diffusion-based recovery suffers from vanishing gradients due to limited intra-group sample diversity. To address this, we further introduce a noise-injection strategy that explicitly increases sample variance and stabilizes learning. Extensive experiments demonstrate that MotionGRPO achieves state-of-the-art performance with superior visual fidelity

2605.05077 2026-05-13 cs.CV 版本更新

FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching

Andranik Sargsyan, Shant Navasardyan

发表机构 * Picsart AI Research (PAIR)(Picsart AI研究院)

AI总结 本文提出FlowDIS,一种基于流匹配框架的语言引导二值图像分割方法,通过学习时间依赖的向量场将图像分布转化为对应的掩码分布,并可选地基于文本提示进行条件生成。该方法引入位置感知实例配对(PAIP)训练策略,显著提升了文本提示控制下的像素级分割精度。实验表明,FlowDIS在有无语言引导的情况下均优于现有最佳方法,在DIS-TE测试集上分别提升了5.5%的$F_β^ω$指标和降低了43%的MAE($\mathcal{M}$)误差。

Comments Accepted to CVPR 2026

详情
英文摘要

Accurate image segmentation is essential for modern computer vision applications such as image editing, autonomous driving, and medical image analysis. In recent years, Dichotomous Image Segmentation (DIS) has become a standard task for training and evaluating highly accurate segmentation models. Existing DIS approaches often fail to preserve fine-grained details or fully capture the semantic structure of the foreground. To address these challenges, we present FlowDIS, a novel dichotomous image segmentation method built on the flow matching framework, which learns a time-dependent vector field to transport the image distribution to the corresponding mask distribution, optionally conditioned on a text prompt. Moreover, with our Position-Aware Instance Pairing (PAIP) training strategy, FlowDIS offers strong controllability through text prompts, enabling precise, pixel-level object segmentation. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches both with and without language guidance. Compared with the best prior DIS method, FlowDIS achieves a 5.5% higher $F_β^ω$ measure and 43% lower MAE ($\mathcal{M}$) on the DIS-TE test set. The code is available at: https://github.com/Picsart-AI-Research/FlowDIS

2604.26752 2026-05-13 cs.CV 版本更新

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

V Team, Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, Xijun Liu, Wenmeng Yu, Weihan Wang, Wei Li, Shuaiqi Duan, Sheng Yang, Ruiliang Lv, Mingdao Liu, Lihang Pan, Ke Ning, Junhui Ji, Jinjiang Wang, Jing Chen, Jiazheng Xu, Jiale Zhu, Jiale Cheng, Ji Qi, Guobing Gan, Guo Wang, Cong Yao, Zijun Dou, Zihao Zhou, Zihan Wang, Zhiqi Ge, Zhijie Li, Zhenyu Hou, Zhao Xue, Zehui Wang, Zehan Qi, Zehai He, Yutao Zhang, Yusen Liu, Yukuo Cen, Yuchen Li, Yuan Wang, Yu Yang, Yongbin Liu, Yijian Lu, Yifan Xu, Yanzi Wang, Yanxiao Zhao, Yanfeng Wang, Yadong Xue, Yabo Xu, Xinyu Zhang, Xinyu Liu, Xiao Liu, Wenyi Zhao, Wenkai Li, Tianyu Tong, Tianshu Zhang, Shudan Zhang, Shengdong Yan, Qinkai Zheng, Mingde Xu, Licheng Bao, lat Long long, Jiaxing Xu, Jiaxin Fan, Jiawen Qian, Jiali Chen, Jiahui Lin, Jiadai Sun, Haozhi Zheng, Haoran Wang, Haochen Li, Hanyu Lai, Han Xu, Fan Yang, Dan Zhang, Da Yin, Chuangxin Zhao, Chengcheng Wu, Boyan Shi, Bowen Lv, Bowei Jia, Bo Li, Bin Chen, Baoxu Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang

发表机构 * Z.ai & Tsinghua University(Z.ai与清华大学)

AI总结 本文介绍了GLM-5V-Turbo,这是一个面向多模态智能体的原生基础模型。该模型将多模态感知能力深度整合到推理、规划、工具使用和执行过程中,而非作为语言模型的辅助接口。研究通过改进模型设计、多模态训练、强化学习、工具链扩展及与智能体框架的集成,显著提升了模型在多模态编程、视觉工具使用和智能体任务中的表现,同时保持了优秀的纯文本编程能力,并为构建多模态智能体提供了实用经验。

详情
英文摘要

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.

2604.25432 2026-05-13 cs.CV 版本更新

SARU: A Shadow-Aware and Removal Unified Framework for Remote Sensing Images with New Benchmarks

Zi-Yang Bo, Wei Lu, Hongruixuan Chen, Si-Bao Chen, Bin Luo

发表机构 * MOE Key Lab of ICSP, Anhui Provincial Key Lab of Multimodal Cognitive Computation, IMIS Lab of Anhui Province, School of Computer Science and Technology, Anhui University, Hefei, China(教育部信息与通信系统重点实验室、安徽省多模态认知计算重点实验室、安徽省IMIS实验室、计算机科学与技术学院、安徽大学、合肥,中国) Graduate School of Frontier Sciences, The University of Tokyo, Chiba, 277-8561, Japan(前沿科学研究生院、东京大学、千叶,日本)

AI总结 遥感图像中的阴影严重影响视觉质量和下游任务性能,现有方法多将阴影检测与去除作为独立的级联任务,流程繁琐且易累积误差。为解决这些问题,本文提出了一种统一的阴影感知与去除框架SARU,其包含一个双分支检测模块和一个无需训练的物理恢复算法,能够高效生成高精度阴影掩膜并恢复光照,显著提升了阴影检测与去除的效果。同时,研究还发布了两个新的遥感阴影数据集,实验表明SARU在多个基准上均达到先进水平,且处理速度快、性能稳定。

Comments Accepted by ISPRS

详情
英文摘要

Shadows are a prevalent problem in remote sensing imagery (RSI), degrading visual quality and severely limiting the performance of downstream tasks like object detection and semantic segmentation. Most prior works treat shadow detection and removal as separate, cascaded tasks, which can lead to cumbersome process and error accumulation. Furthermore, many deep learning methods rely on paired shadow and non-shadow images for training, which are often unavailable in practice. To address these challenges, we propose Shadow-Aware and Removal Unified (SARU) Framework , a cohesive two-stage framework. First, its dual-branch detection module (DBCSF-Net) fuses multi-color space and semantic features to generate high-fidelity shadow masks, effectively distinguishing shadows from dark objects. Then, leveraging these masks, a novel, training-free physical algorithm (N$^2$SGSR) restores illumination by transferring properties from adjacent non-shadow regions within the single input image. To facilitate rigorous evaluation and foster future work, we also introduce two new benchmark datasets: the RSI Shadow Detection (RSISD) dataset and the Single-image Shadow Removal Benchmark (SiSRB). Extensive experiments on the AISD and RSISD datasets demonstrate that SARU achieves SOTA shadow detection performance. For shadow removal, our training-free N$^2$SGSR algorithm attains an average processing speed of approximately $1.3$s, which is over $10$ times faster than the SOTA MAOSD while maintains an SRI value close to 0.9 on both the AISD and SiSRB datasets, a level comparable to the advanced RS-GSSR method. By holistically integrating shadow detection and removal to mitigate error propagation and eliminating the dependency on paired training data, SARU establishes a robust, practical framework for real-world RSI analysis. The code and datasets are publicly available at: https://github.com/AeroVILab-AHU/SARU

2604.24990 2026-05-13 cs.CV 版本更新

A New Kind of Network? Review and Reference Implementation of Neural Cellular Automata

Martin Spitznagel, Janis Keuper

发表机构 * Institute for Machine Learning and Analytics (IMLA)(机器学习与分析研究所) Offenburg University(奥芬堡大学) University of Mannheim(曼海姆大学)

AI总结 本文回顾了神经细胞自动机(NCA)的研究进展,提出了一种统一的模块化框架与符号表示,并提供了基于开源库NCAtorch的参考实现。NCA结合了细胞自动机的简单规则与可学习的神经网络,能够从数据中学习复杂的更新规则,从而建模自我组织的生成系统,为复杂系统的模拟提供了新的方法。

详情
英文摘要

Stephen Wolfram proclaimed in his 2003 seminal work "A New Kind Of Science" that simple recursive programs in the form of Cellular Automata (CA) are a promising approach to replace currently used mathematical formalizations, e.g. differential equations, to improve the modeling of complex systems. Over two decades later, while Cellular Automata have still been waiting for a substantial breakthrough in scientific applications, recent research showed new and promising approaches which combine Wolfram's ideas with learnable Artificial Neural Networks: So-called Neural Cellular Automata (NCA) are able to learn the complex update rules of CA from data samples, allowing them to model complex, self-organizing generative systems. The aim of this paper is to review the existing work on NCA and provide a unified modular framework and notation, as well as a reference implementation in the open-source library NCAtorch. Supplementary materials, videos, and code are available at the project website: https://www.neural-cellular-automata.org/

2604.16445 2026-05-13 eess.AS cs.AI cs.CV cs.LG 版本更新

SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment

Giovanna Sannino, Ivanoe De Falco, Nadia Brancati, Laura Verde, Maria Frucci, Daniel Riccio, Vincenzo Bevilacqua, Antonio Di Marino, Lucia Aruta, Valentina Virginia Iuzzolino, Gianmaria Senerchia, Myriam Spisto, Raffaele Dubbioso

发表机构 * National Research Council of Italy (CNR), Institute for High-Performance Computing and Networking (ICAR), Naples(意大利国家研究理事会(CNR)、高性能计算与网络研究所(ICAR)、那不勒斯)

AI总结 本文介绍了SAND挑战赛,旨在利用语音信号进行神经退行性疾病(如肌萎缩侧索硬化症ALS)的早期诊断与病情进展预测。研究团队联合临床专家和机器学习学者,构建了一个临床标注的语音数据集,并基于该数据集发起挑战赛,推动AI模型在语音分析中的应用与验证。该工作为利用非侵入性生物标志物进行疾病评估提供了重要的数据基础和研究平台。

详情
英文摘要

Recent advances in Artificial Intelligence (AI) and the exploration of noninvasive, objective biomarkers, such as speech signals, have encouraged the development of algorithms to support the early diagnosis of neurodegenerative diseases, including Amyotrophic Lateral Sclerosis (ALS). Voice changes in subjects suffering from ALS typically manifest as progressive dysarthria, which is a prominent neurodegenerative symptom because it affects patients as the disease progresses. Since voice signals are complex data, the development and use of advanced AI techniques are fundamental to extracting distinctive patterns from them. Validating AI algorithms for ALS diagnosis and monitoring using voice signals is challenging, particularly due to the lack of annotated reference datasets. In this work, we present the outcome of a collaboration between a multidisciplinary team of clinicians and Machine Learning experts to create both a clinically annotated validation dataset and the "Speech Analysis for Neurodegenerative Diseases" (SAND) challenge based on it. Specifically, by analyzing voice disorders, the SAND challenge provides an opportunity to develop, test, and evaluate AI models for the automatic early identification and prediction of ALS disease progression.

2604.10500 2026-05-13 cs.CV 版本更新

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

Yudong Han, Yong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan, Xiangxiang Chu

发表机构 * Beijing Institute of Technology(北京理工大学) AMAP, Alibaba Group(阿里集团AMAP) City University of Hong Kong(香港城市大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Yangtze Delta Region Academy of Beijing Institude of Technology, Jiaxing, China(北京理工大学扬子江地区学院,嘉兴,中国)

AI总结 该研究针对多模态隐式推理中视觉信息优化不足和复杂语义 token 收敛困难的问题,提出了视觉增强深度缩放方法。通过分析 token 级梯度动态,发现视觉 token 的梯度幅值较小且复杂 token 易出现梯度不稳定,为此引入了视觉重放模块和路由深度缩放机制,分别增强视觉感知和复杂隐态的精细化处理。该方法结合课程学习策略,有效提升了多模态隐式推理的性能,并在多个基准测试中取得了领先的推理效果和加速表现。

Comments 11 pages, 6 figures

详情
英文摘要

Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly smaller gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.

2604.03061 2026-05-13 cs.CV 版本更新

Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks

Weixiong Sun, Xiang Yin, Chao Dong

发表机构 * Shenzhen University of Advanced Technology(深圳先进技术大学) Fudan University(复旦大学) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究所,中国科学院)

AI总结 本文评估了通用图像编辑模型Nano Banana 2在图像修复任务中的性能,发现其在多种场景和退化条件下表现良好,尤其在用户偏好和整体视觉质量方面具有竞争力。研究指出,简洁的提示和明确的保真度约束有助于在重建质量与感知质量之间取得更好平衡,但模型在细节增强和一致性方面仍存在不足,现有图像质量评估指标难以全面反映这一问题。研究认为,通用模型在感知层面具有作为统一图像修复方案的潜力,但仍需在可控性和保真度评估方面进一步改进。

Comments Accepted by CVPR 2026 Workshop AAVM

详情
英文摘要

Recent advances in generative AI raise the question of whether general-purpose image editing models can serve as unified solutions for image restoration. We conduct a systematic evaluation of Nano Banana 2 across diverse scenes and degradations. Our results show that prompt design is critical, with concise prompts and explicit fidelity constraints achieving a better balance between reconstruction and perceptual quality. Nano Banana 2 achieves competitive full-reference performance and is consistently preferred in user studies, while showing strong generalization in challenging scenarios. However, we observe a gap between perceptual quality and restoration fidelity, as the model tends to produce visually rich results with over-enhanced details and inconsistencies. This issue is not well captured by existing IQA metrics or user studies. Overall, general-purpose models show promise as unified IR solvers from a perceptual perspective, but require improved controllability and fidelity-aware evaluation. Further comparisons and detailed analyses are available in our project repository: https://github.com/yxyuanxiao/NanoBanana2TestOnIR.

2603.29057 2026-05-13 cs.CV 版本更新

LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition

Muxin Pu, Mei Kuan Lim, Chun Yong Chong, Chen Change Loy

发表机构 * School of Information Technology, Monash University(莫纳什大学信息技术学院) S-Lab, Nanyang Technological University(南洋理工大学S实验室)

AI总结 本文提出了一种基于循环变压器和几何感知对齐的骨架驱动手语识别方法LA-Sign,旨在提升对手语动作多尺度细节的理解。该方法通过循环机制在共享参数下反复优化潜在表示,从而增强模型对动作细节的感知能力,并引入几何感知的对比目标,将骨骼和文本特征映射到自适应双曲空间以促进多层次语义组织。实验表明,LA-Sign在多个基准数据集上取得了最先进的性能,且模型结构更简洁。

详情
英文摘要

Skeleton-based isolated sign language recognition (ISLR) demands fine-grained understanding of articulated motion across multiple spatial scales, from subtle finger movements to global body dynamics. Existing approaches typically rely on deep feed-forward architectures, which increase model capacity but lack mechanisms for recurrent refinement and structured representation. We propose LA-Sign, a looped transformer framework with geometry-aware alignment for ISLR. Instead of stacking deeper layers, LA-Sign derives its depth from recurrence, repeatedly revisiting latent representations to progressively refine motion understanding under shared parameters. To further regularise this refinement process, we present a geometry-aware contrastive objective that projects skeletal and textual features into an adaptive hyperbolic space, encouraging multi-scale semantic organisation. We study three looping designs and multiple geometric manifolds, demonstrating that encoder-decoder looping combined with adaptive Poincare alignment yields the strongest performance. Extensive experiments on WLASL and MSASL benchmarks show that LA-Sign achieves state-of-the-art results while using fewer unique layers, highlighting the effectiveness of recurrent latent refinement and geometry-aware representation learning for sign language recognition.

2603.23677 2026-05-13 cs.CV cs.AI 版本更新

Prototype Fusion: A Training-Free Multi-Layer Approach to OOD Detection

Shreen Gul, Mohamed Elmahallawy, Ardhendu Tripathy, Sanjay Madria

发表机构 * Missouri University of Science and Technology(密苏里科技大学) Washington State University(华盛顿州立大学)

AI总结 本文提出了一种无需训练的多层特征融合方法,用于检测模型输入是否超出训练分布(OOD)。不同于现有方法主要依赖网络最后一层激活值,该方法利用中间层丰富的表征信息,通过聚合多层卷积块的特征并计算类均值嵌入,构建紧凑的类别原型。实验表明,该方法在多种架构上均表现出优越的OOD检测性能,显著提升了检测准确率并降低了误报率。

详情
英文摘要

Deep learning models are increasingly deployed in safety-critical applications, where reliable out-of-distribution (OOD) detection is essential to ensure robustness. Existing methods predominantly rely on the penultimate-layer activations of neural networks, assuming they encapsulate the most informative in-distribution (ID) representations. In this work, we revisit this assumption to show that intermediate layers encode equally rich and discriminative information for OOD detection. Based on this observation, we propose a simple yet effective model-agnostic approach that leverages internal representations across multiple layers. Our scheme aggregates features from successive convolutional blocks, computes class-wise mean embeddings, and applies L_2 normalization to form compact ID prototypes capturing class semantics. During inference, cosine similarity between test features and these prototypes serves as an OOD score--ID samples exhibit strong affinity to at least one prototype, whereas OOD samples remain uniformly distant. Extensive experiments on state-of-the-art OOD benchmarks across diverse architectures demonstrate that our approach delivers robust, architecture-agnostic performance and strong generalization for image classification. Notably, it improves AUROC by up to 4.41% and reduces FPR by 13.58%, highlighting multi-layer feature aggregation as a powerful yet underexplored signal for OOD detection, challenging the dominance of penultimate-layer-based methods. Our code is available at: https://github.com/sgchr273/cosine-layers.git.

2603.05947 2026-05-13 cs.CV 版本更新

LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Flow-Based Real-World Super-Resolution

Song Fei, Tian Ye, Sixiang Chen, Zhaohu Xing, Jianyu Lai, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 本文提出了一种名为LucidNFT的多奖励强化学习框架,用于基于流匹配的现实场景图像超分辨率任务。该方法通过引入一种对退化不变且对语义幻觉敏感的LR参考评估器LucidConsistency,以及解耦的奖励归一化策略和大规模真实退化图像集LucidLR,有效解决了现有方法在保持低分辨率输入真实性与提升视觉质量之间的平衡问题。实验表明,LucidNFT在多个基准上提升了感知质量,同时保持了对真实低分辨率输入的一致性。

详情
英文摘要

Generative real-world image super-resolution (Real-ISR) can synthesize visually convincing details from severely degraded low-resolution (LR) inputs, yet its stochastic sampling makes a critical failure mode hard to avoid: outputs may look sharp but be unfaithful to the LR evidence, exhibiting semantic or structural hallucinations. Preference-based reinforcement learning (RL) is a natural fit because each LR input yields a rollout group of candidate restorations. However, effective alignment in Real-ISR is hindered by three coupled challenges: (i) the lack of an LR-referenced faithfulness signal that is robust to degradation yet sensitive to localized hallucinations, (ii) a rollout-group optimization bottleneck where scalarizing heterogeneous rewards before normalization compresses objective-wise contrasts and weakens DiffusionNFT-style reward-weighted updates, and (iii) limited coverage of real degradations, which restricts rollout diversity and preference signal quality. We propose LucidNFT, a multi-reward RL framework for flow-matching Real-ISR. LucidNFT introduces LucidConsistency, a degradation-invariant and hallucination-sensitive LR-referenced evaluator trained with content-consistent degradation pools and original-inpainted hard negatives; a decoupled reward normalization strategy that preserves objective-wise contrasts within each LR-conditioned rollout group before fusion; and LucidLR, a large-scale collection of real-world degraded images for robust RL fine-tuning. Extensive experiments show that LucidNFT improves perceptual quality on strong flow-based Real-ISR baselines while generally maintaining LR-referenced consistency across diverse real-world scenarios.

2602.22347 2026-05-13 cs.CV cs.AI 版本更新

Enabling clinical use of foundation models for computational pathology

Audun L Henriksen, Ole-Johan Skrede, Lisa van der Schee, Enric Domingo, Karolina Cyll, Sepp de Raedt, Ilyá Kostolomov, Jennifer Hay, Wanja Kildal, Joakim Kalsnes, Robert W Williams, Manohar Pradhan, John Arne Nesheim, Hanne Askautrud, Maria Isaksen, Karmele Saez de Gordoa, Miriam Cuatrecasas, Joanne Edwards, TransSCOT group, Arild Nesbakken, Neil A Shepherd, Ian Tomlinson, Daniel-Christoph Wagner, Rachel Kerr, Tarjei Sveinsgjerd Hveem, Knut Liestøl, Yoshiaki Nakamura, Marco Novelli, Masaaki Miyo, Sebastian Försch, David N Church, Miangela M Lacle, David J Kerr, Andreas Kleppe

发表机构 * Institute for Cancer Genetics and Informatics, Oslo University Hospital(癌症遗传学与信息学研究所,奥斯陆大学医院) Department of Pathology, University Medical Center Utrecht(病理学系,乌得勒支大学医学中心) Department of Oncology, University of Oxford(肿瘤学系,牛津大学) CRUK Beatson Institute of Cancer Research, Garscube Estate(CRUK贝茨癌症研究中心,加尔斯克里特庄园) Glasgow Tissue Research Facility, University of Glasgow, Queen Elizabeth University Hospital(格拉斯哥组织研究设施,格拉斯哥大学,伊丽莎白女王大学医院) Area for Improvement and Digital Transformation, Norwegian Offshore Directorate(改进与数字化转型部门,挪威海上管理局) Pathology Department, Hospital Clínic, Barcelona, Spain(病理学系,巴塞罗那医院,西班牙) Institut d’Investigacions Biomèdiques August Pi I Sunyer (IDIBAPS), Barcelona, Spain(August Pi I Sunyer生物医学研究所(IDIBAPS),巴塞罗那,西班牙) Department of Clinical Foundations, Universitat de Barcelona(临床基础系,巴塞罗那大学) School of Cancer Sciences, Wolfson Wohl Cancer Research Centre, University of Glasgow(癌症科学学院,沃尔夫森沃尔夫癌症研究中心,格拉斯哥大学) Institute of Clinical Medicine, University of Oslo(临床医学研究所,奥斯陆大学) Department of Gastrointestinal Surgery, Oslo University Hospital(胃肠外科系,奥斯陆大学医院)

AI总结 该研究探讨了如何使基础模型在计算病理学中更适用于临床场景,解决了现有模型因捕捉扫描仪和预分析变异而影响下游任务性能的问题。研究提出在下游模型训练中引入新的鲁棒性损失函数,以减少对技术变异的敏感性,并通过大量临床病理图像实验验证了该方法的有效性。该方法在不重新训练基础模型的前提下,提升了模型的鲁棒性和分类准确性,有助于开发更适用于真实临床环境的深度学习系统。

详情
英文摘要

Foundation models for computational pathology are expected to facilitate the development of high-performing, generalisable deep learning systems. However, in addition to biologically relevant features, current foundation models also capture pre-analytic and scanner-specific variation that bias the predictions made by downstream task-specific models trained on these features. Here we show that introducing novel robustness losses during downstream model training reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 whole-slide images from 6,155 patients is used to train thousands of models from the features of eight well-known foundation models for computational pathology. In addition to a substantial improvement in robustness, our approach improves classification accuracy by focusing on biologically relevant features. It mitigates robustness limitations of foundation models for computational pathology without retraining the foundation models themselves, enabling development of models that are more suitable in real-world clinical use.

2602.09587 2026-05-13 cs.CV cs.AI 版本更新

MieDB-100k: A Comprehensive Dataset for Medical Image Editing

Yongfan Lai, Wen Qian, Bo Liu, Hongyan Li, Hao Luo, Fan Wang, Bohan Zhuang, Shenda Hong

发表机构 * State Key Laboratory of General Artificial Intelligence, Beijing, China(1 国家一般人工智能重点实验室,北京,中国) School of Intelligence Science and Technology, Peking University, Beijing, China(2 智能科学与技术学院,北京大学,北京,中国) National Institute of Health Data Science, Peking University, Beijing, China(3 国家健康数据科学研究院,北京大学,北京,中国) DAMO Academy, Alibaba Group, Zhejiang, China(4 阿里巴巴集团 DAMO 院,浙江,中国) hupan lab, zhejiang province(5 鹏元实验室,浙江省) Zhejiang University, Zhejiang, China(6 浙江大学,浙江,中国)

AI总结 针对医学图像编辑领域高质量数据稀缺的问题,本文提出MieDB-100k,一个大规模、高质量且多样化的文本引导医学图像编辑数据集。该数据集从感知、修改和转换三个视角分类编辑任务,兼顾理解和生成能力,并通过专家模型与规则合成方法构建,经过严格人工审核确保临床准确性。实验表明,基于该数据集训练的模型在性能和泛化能力上均优于现有开源和商业模型,为医学图像编辑研究提供了重要基础。

详情
英文摘要

The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.

2602.04476 2026-05-13 cs.CV 版本更新

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

Byungwoo Jeon, Yoonwoo Jeong, Hyunseok Lee, Minsu Cho, Jinwoo Shin

发表机构 * Byungwoo Jeon Yoonwoo Jeong Hyunseok Lee Minsu Cho Jinwoo Shin

AI总结 尽管多模态大语言模型在多种理解任务上取得了进展,但在需要多步骤推理的问题上仍存在不足,主要原因是视觉信息在长上下文生成过程中逐渐稀释。为此,本文提出了一种名为Vision-aligned Latent Reasoning(VaLR)的推理框架,通过在每一步推理前动态生成与视觉对齐的潜在标记,引导模型基于潜在空间中的感知线索进行推理。实验表明,VaLR在多个需要长上下文理解和精确视觉感知的基准测试中表现优异,并在VSI-Bench上将性能从33.0%提升至52.9%,显著优于现有模型。

Comments Published as conference proceeding for ICML 2026. Last two authors advised equally

详情
英文摘要

Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.

2512.24985 2026-05-13 cs.CV cs.AI cs.LG cs.RO 版本更新

DarkQA: Benchmarking Vision-Language Models on Visual-Primitive Question Answering in Low-Light Indoor Scenes

Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) Pohang University of Science and Technology (POSTECH)(釜山科学技术大学)

AI总结 本文提出DarkQA,一个用于评估视觉语言模型在低光室内场景下视觉原语问答能力的开源基准。该基准通过多级光照控制生成9,400个可验证的问题-图像对,模拟真实光照下降和传感器噪声,揭示了现有模型在低光条件下的性能退化问题。研究还系统评估了多种视觉语言模型和低光图像增强方法,展示了DarkQA在分析模型鲁棒性方面的有效性。

Comments This work has been submitted to the IEEE for possible publication

详情
英文摘要

Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments, a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkQA, an open-source benchmark for evaluating perceptual primitives under multi-level low-light conditions in embodied scenarios. DarkQA evaluates single-view egocentric observations across controlled degradation levels, isolating low-light perceptual failures before they are entangled with complex embodied tasks. The benchmark contains 9.4K deterministically generated and verifiable question-image pairs spanning five visual-primitive families. A key design feature of DarkQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline; we further validate the synthesis against real paired low-light camera data. We evaluate representative VLMs and Low-Light Image Enhancement (LLIE) preprocessing methods. Results show consistent VLM degradation under low illumination and sensor noise, while LLIE provides severity-dependent but unstable recovery. We demonstrate the utility of DarkQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models, and systematically reveal VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance. Project website: https://darkqa-benchmark.github.io

2512.07150 2026-05-13 cs.LG cs.AI cs.CV 版本更新

FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers

Jonghyun Park, Jong Chul Ye

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出了一种名为 FlowLPS 的训练-free 潜在流逆问题求解方法,基于朗之万-近端采样(Langevin-Proximal Sampling),旨在解决深度生成模型在图像逆问题中的有限步数权衡问题。该方法在每一步反向过程中使用少量朗之万更新对模型预测的干净估计进行扰动,以提供后验导向的随机初始化,随后通过局部 MAP 风格的近端优化快速提升测量一致性,并结合受控的 pCN 风格重噪声技术保持轨迹稳定性。实验表明,FlowLPS 在多个线性逆问题上实现了测量保真度与感知质量的良好平衡。

详情
英文摘要

Deep generative models are powerful priors for imaging inverse problems, but training-free solvers for latent flow models face a practical finite-step trade-off. Optimization-heavy methods quickly improve measurement consistency, but in highly nonlinear latent spaces, their results can depend strongly on where local refinement is initialized, often degrading perceptual realism. In contrast, stochastic sampling methods better preserve posterior exploration, but often require many iterations to obtain sharp, measurement-consistent reconstructions. To address this trade-off, we propose FlowLPS, a training-free latent flow inverse solver based on Langevin-Proximal Sampling. At each reverse step, FlowLPS uses a few Langevin updates to perturb the model-predicted clean estimate in posterior-oriented directions, providing stochastic initializations for local refinement. It then applies local MAP-style proximal refinement to rapidly improve measurement consistency from the Langevin-updated estimate. We additionally use controlled pCN-style re-noising to stabilize the reverse trajectory while retaining trajectory coherence. Experiments on FFHQ and DIV2K across five linear inverse problems show that FlowLPS achieves a strong balance between measurement fidelity and perceptual quality, with additional experiments on pixel-space inverse problems and phase retrieval.

2512.01675 2026-05-13 cs.CV 版本更新

GRASP: Guided Residual Adapters with Sample-wise Partitioning

Felix Nützel, Mischa Dombrowski, Bernhard Kainz

发表机构 * Friedrich-Alexander-Universität Erlangen-Nürnberg(埃朗根-纽伦堡弗里德里希-亚历山大大学) Imperial College London(伦敦帝国理工学院)

AI总结 在长尾分布场景下,文本到图像的流匹配变换器在尾部类别上表现出生成质量下降的问题。本文提出GRASP方法,通过条件空间的确定性划分和分组残差适配器,有效提升了尾部类别的生成质量,同时保持了原优化目标和采样器不变。实验表明,GRASP在多个数据集上显著提升了生成图像的多样性与尾部类别覆盖率,并在下游分类任务中优于现有方法。

Comments 16 pages, 6 figures, 6 tables

详情
英文摘要

Text-to-image flow matching transformers degrade sharply in long-tail settings: tail-class outputs collapse in fidelity and diversity, limiting their value as synthetic augmentation for rare conditions. We trace this to low head-versus-tail gradient alignment during fine-tuning, an optimization-level pathology that conditioning- and sampling-side interventions do not address. We propose GRASP (Guided Residual Adapters with Sample-wise Partitioning): a deterministic partition of the conditioning space, paired with group-specific residual adapters in the transformer feedforward layers, that leaves the flow-matching objective and the sampler untouched. In conditional flow matching, condition values index distinct sets of probability paths, so partitioning along the conditioning is the structurally correct factorization suitable as gradient alignment proxy. Because the partition is static, every tail sample is guaranteed to update its assigned expert, which bypasses extreme longtail failure modes. Crucially, GRASP is non-invasive and composable: on MIMIC-CXR-LT, combining GRASP with self-guided minority sampling at inference time yields the best all-labels IRS we observe, beyond either intervention alone. GRASP itself reduces overall FID by up to 80\% and lifts tail-class coverage by up to 44\% over full fine-tuning, learned-routing MoE, and minority guidance. Used as training data for a downstream DenseNet classifier on NIH-CXR-LT, GRASP synthetics significantly outperform every non-GRASP alternative on macro F1, match the macro F1 obtained from real training data, and yield nonzero F1 on $9$ of $13$ classes versus $3$ of $13$ from full fine-tuning. Results on ImageNet-LT confirm the mechanism is not tied to medical inductive bias.

2511.22663 2026-05-13 cs.CV 版本更新

AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Hongsheng Li

发表机构 * MMLab, CUHK(CUHK多媒体实验室) Meituan(美团) USTC(中国科学技术大学) TJU(天津大学)

AI总结 统一多模态模型在图像生成与理解方面取得了显著进展,但任务间的冲突目标使得训练范式难以优化。为缓解冲突,现有方法多采用架构解耦策略,但可能导致模型失去交互生成能力。本文提出一种无需架构解耦的策略,通过分析模型的跨模态注意力行为,揭示解耦提升性能的本质是引导模型学习任务特定的交互模式,并提出注意力交互对齐(AIA)损失函数,有效优化跨模态注意力结构,提升生成与理解性能。

Comments Project page: https://zhengdian1.github.io/AIA-project/ Code: https://github.com/zhengdian1/AIA

详情
英文摘要

Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of architecture decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling boosts performance by studying the cross-modal attention behavior of models. We observe that architecture decoupling does not solve task conflicts, but essentially drives models toward cross-modal interaction patterns of task-specific models, as seen in Qwen3-VL and HunyuanImage-3.0, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns task-specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.

2511.18152 2026-05-13 cs.CV cs.AI 版本更新

UnfoldLDM: Degradation-Aware Unfolding with Iterative Latent Diffusion Priors for Blind Image Restoration

Chunming He, Rihan Zhang, Zheng Chen, Bowen Yang, Chengyu Fang, Yunlong Lin, Yulun Zhang, Fengyang Xiao, Sina Farsiu

发表机构 * Duke University(杜克大学) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Tsinghua University(清华大学) Xiamen University(厦门大学)

AI总结 本文提出了一种名为 UnfoldLDM 的盲图像修复方法,旨在解决现有深度展开网络在未知退化建模和过平滑问题上的不足。该方法结合了深度展开网络与潜在扩散模型,通过多粒度退化感知模块估计未知退化信息,并设计了退化鲁棒的扩散模型和过平滑校正模块,以恢复图像的高频细节和纹理。实验表明,UnfoldLDM 在多种盲图像修复任务中表现优异,并可作为通用框架与现有方法兼容。

Comments 6 figures, 11 tables

详情
英文摘要

Deep unfolding networks (DUNs) combine the interpretability of model-based methods with the learning ability of deep networks, yet remain limited for blind image restoration (BIR). Existing DUNs suffer from: (1) \textbf{Degradation-specific dependency}, as their optimization frameworks are tied to a known degradation model, making them unsuitable for BIR tasks; and (2) \textbf{Over-smoothing bias}, resulting from the direct feeding of gradient descent outputs, dominated by low-frequency content, into the proximal term, suppressing fine textures. To overcome these issues, we propose UnfoldLDM to integrate DUNs with latent diffusion model (LDM) for BIR. In each stage, UnfoldLDM employs a multi-granularity degradation-aware (MGDA) module as the gradient descent step. MGDA models BIR as an unknown degradation estimation problem and estimates both the holistic degradation matrix and its decomposed forms, enabling robust degradation removal. For the proximal step, we design a degradation-resistant LDM (DR-LDM) to extract compact degradation-invariant priors from the MGDA output. Guided by this prior, an over-smoothing correction transformer (OCFormer) explicitly recovers high-frequency components and enhances texture details. This unique combination ensures the final result is degradation-free and visually rich. Experiments show that our UnfoldLDM achieves a leading place on various BIR tasks and benefits downstream tasks. Moreover, our design is compatible with existing DUN-based methods, serving as a plug-and-play framework. Code will be released.

2510.09333 2026-05-13 cs.LG cs.CV 版本更新

Efficient Bayesian Inference from Noisy Pairwise Comparisons

Till Aczel, Lucas Theis, Roger Wattenhofer

发表机构 * ETH Zurich, Switzerland(苏黎世联邦理工学院)

AI总结 本文研究了如何从带有噪声的人类成对比较数据中高效进行贝叶斯推断,以评估生成模型的质量。作者提出了一种名为 BBQ 的贝叶斯 Bradley-Terry 模型变体,该方法显式建模评分者质量,过滤不可靠评分者,并通过期望最大化算法保证似然函数的单调收敛。实验表明,BBQ 能在噪声或众包评分环境下提供更高效、鲁棒且可解释的模型排序与不确定性估计。

详情
英文摘要

Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ provides efficient inference, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.

2510.03853 2026-05-13 cs.CV 版本更新

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) Zhejiang University(浙江大学)

AI总结 UGround 提出了一种统一的视觉 grounding 框架,通过在展开的 Transformer 层中动态选择中间层作为“掩码作为提示”,克服了传统方法中固定使用最后一层隐藏状态的问题。该方法引入了策略驱动的掩码机制,包含随机跳过连接和掩码作为提示两个核心组件,实现了对视觉模型(如 SAM)的动态引导与空间线索的显式传递。UGround 在统一框架下覆盖了多种视觉 grounding 任务,包括属性层面的传统指代分割和新提出的推理分割等,显著提升了模型的灵活性和适用性。

Comments This work has been accepted to ICML 2026, please refer to https://github.com/rui-qian/UGround

详情
英文摘要

We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt,'' diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{<SEG>} as prompt.'' UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{<SEG>} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (e.g., coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{<SEG>} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (e.g., SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{<SEG>} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All code and models are publicly available at https://github.com/rui-qian/UGround.

2509.22414 2026-05-13 cs.CV 版本更新

LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer

Song Fei, Tian Ye, Lujia Wang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文提出了一种无需图像描述的高保真图像修复方法LucidFlux,通过适配大规模扩散变换器Flux.1实现真实感图像恢复。该方法引入了一个轻量的双分支条件器,分别注入退化输入和轻度修复代理的信号以锚定几何结构并抑制伪影,并设计了时序和层自适应的调制调度策略,实现从粗到细的上下文感知更新。此外,通过SigLIP特征实现无需描述的语义对齐,并结合可扩展的数据筛选流程,LucidFlux在多个基准测试中优于现有开源和商业方法,验证了其在复杂场景下鲁棒且无需文本提示的图像修复能力。

Comments Project Page: https://w2genai-lab.github.io/LucidFlux

详情
英文摘要

Image restoration (IR) aims to recover images degraded by unknown mixtures while preserving semanticsconditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free IR framework that adapts a large diffusion transformer (Flux.1) without image captions. Our LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbones hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or Vision-Language Model (VLM) captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, our LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition onrather than adding parameters or relying on text promptsis the governing lever for robust and caption-free image restoration in the wild.

2509.20899 2026-05-13 cs.CV 版本更新

Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification

Patrick Knab, Sascha Marton, Philipp J. Schubert, Drago Guggiana, Christian Bartelt

发表机构 * Technical University of Clausthal(Clausthal 技术大学) Ramblr.ai Research(Ramblr.ai 研究)

AI总结 本文提出了一种名为MoTIF的可解释视频分类方法,通过引入基于时序概念激活的Transformer架构,解决了在视频中提取和建模概念的挑战。该方法利用每个概念的时序自注意力机制,捕捉概念随时间的变化规律及其对分类结果的贡献,并通过一个基于视觉-语言模型的概念发现模块,从训练视频中自动提取与物体和动作相关的文本概念,无需人工标注。实验表明,该方法在多个视频基准上优于全局概念瓶颈模型,并在可解释性框架下保持了良好的性能。

详情
英文摘要

Concept Bottleneck Models (CBMs) enable interpretable image classification by structuring predictions around human-understandable concepts, but extending this paradigm to video remains challenging due to the difficulty of extracting concepts and modeling them over time. In this paper, we introduce MoTIF (Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is a class-conditioned VLM-based concept discovery module that extracts object- and action-centric textual concepts from training videos, yielding temporally expressive concept sets without manual concept annotation. Across multiple video benchmarks, this combination improves over global concept bottlenecks and remains competitive within the interpretable concept-bottleneck setting, while narrowing the gap to strong black-box video baselines that we report as contextual references. Code available at github.com/patrick-knab/MoTIF.

2506.00294 2026-05-13 astro-ph.IM cs.CV 版本更新

Applying Vision Transformers on Spectral Analysis of Astronomical Objects

Luis Felipe Strano Moraes, Ignacio Becker, Pavlos Protopapas, Guillermo Cabrera-Vives

发表机构 * Harvard Extension School(哈佛大学延伸学校) Harvard University(哈佛大学) John A. Paulson School of Engineering and Applied Science(约翰·A·保罗森工程与应用科学学院) Department of Computer Science(计算机科学系) Center for Data and Artificial Intelligence(数据与人工智能中心) Millennium Institute of Astrophysics (MAS)(千年天体物理研究所 (MAS)) Millennium Nucleus for Galaxies (MINGAL)(银河千年核 (MINGAL)) Heidelberg Institute for Theoretical Studies(海德堡理论研究所)

AI总结 本文将预训练的视觉Transformer(ViT)应用于天文学光谱数据分析,通过将一维光谱转化为二维图像表示,使ViT能够通过空间自注意力机制捕捉局部和全局光谱特征。研究利用SDSS和LAMOST巡天的数百万条光谱数据对ViT进行微调,在恒星分类和红移估计等任务中表现出色,其分类准确率优于支持向量机和随机森林,且在跨类型泛化能力上达到与AstroCLIP相当的水平。这是首次将ViT应用于大规模真实光谱数据的分析,无需依赖合成输入。

Comments 9 pages, 9 figures

Journal ref A&A 709, A122 (2026)

详情
英文摘要

We apply pre-trained Vision Transformers (ViTs), originally developed for image recognition, to the analysis of astronomical spectral data. By converting traditional one-dimensional spectra into two-dimensional image representations, we enable ViTs to capture both local and global spectral features through spatial self-attention. We fine-tune a ViT pretrained on ImageNet using millions of spectra from the SDSS and LAMOST surveys, represented as spectral plots. Our model is evaluated on key tasks including stellar object classification and redshift ($z$) estimation, where it demonstrates strong performance and scalability. We achieve classification accuracy higher than Support Vector Machines and Random Forests, and attain $R^2$ values comparable to AstroCLIP's spectrum encoder, even when generalizing across diverse object types. These results demonstrate the effectiveness of using pretrained vision models for spectroscopic data analysis. To our knowledge, this is the first application of ViTs to large-scale, which also leverages real spectroscopic data and does not rely on synthetic inputs.

2501.08083 2026-05-13 cs.CV 版本更新

Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving

Mert Keser, Halil Ibrahim Orhan, Niki Amini-Naieni, Gesina Schwalbe, Alois Knoll, Matthias Rottmann

发表机构 * Continental AG(大陆汽车集团) Technical University of Munich(慕尼黑技术大学) University of Lübeck(吕贝克大学) University of Oxford(牛津大学) University of Wuppertal(伍珀塔尔大学)

AI总结 该论文研究了在自动驾驶等复杂开放领域中,如何利用视觉基础模型(VFM)进行输入监控以检测超出训练数据分布的场景(OOD)。作者提出了一种无需监督、模型无关的方法,通过结合VFM作为特征提取器与密度建模技术,统一检测语义偏移和协变量偏移。实验表明,该方法在多种条件下优于现有OOD分类方法,并能有效识别可能引发下游任务错误的高风险输入,为复杂视觉任务中的安全监控提供了新思路。

详情
英文摘要

Deep neural networks (DNNs) remain challenged by distribution shifts in complex open-world domains like automated driving (AD): Robustness against yet unknown novel objects (semantic shift) or styles like lighting conditions (covariate shift) cannot be guaranteed. Hence, reliable operation-time monitors for identification of out-of-training-data-distribution (OOD) scenarios are imperative. Current approaches for OOD classification are untested for complex domains like AD, are limited in the kinds of shifts they detect, or even require supervision with OOD samples. To prepare for unanticipated shifts, we instead establish a framework around a principled, unsupervised and model-agnostic method that unifies detection of semantic and covariate shifts: Find a full model of the training data's feature distribution, to then use its density at new points as in-distribution (ID) score. To implement this, we propose to combine Vision Foundation Models (VFMs) as feature extractors with density modeling techniques. Through a comprehensive benchmark of 4 VFMs with different backbone architectures and 5 density-modeling techniques against established baselines, we provide the first systematic evaluation of OOD classification capabilities of VFMs across diverse conditions. A comparison with state-of-the-art binary OOD classification methods reveals that VFM embeddings with density estimation outperform existing approaches in identifying OOD inputs. Additionally, we show that our method detects high-risk inputs likely to cause errors in downstream tasks, thereby improving overall performance. Overall, VFMs, when coupled with robust density modeling techniques, are promising to realize model-agnostic, unsupervised, reliable safety monitors in complex vision tasks

2501.02955 2026-05-13 cs.CV 版本更新

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang

发表机构 * Tsinghua University(清华大学) Zhipu AI(智谱AI)

AI总结 近年来,视觉语言模型(VLMs)在视频理解方面取得了显著进展,但对细粒度运动的理解仍缺乏系统研究。为此,本文提出了MotionBench,一个全面评估视频模型细粒度运动理解能力的基准,包含六类运动相关问题和多源视频数据。实验表明现有VLM在细粒度运动理解上表现不佳,作者通过分析视频特征压缩架构并提出一种高效的Through-Encoder融合方法,有效提升了模型的运动感知能力,展示了该方向仍有较大的提升空间。

Comments 20 pages

详情
英文摘要

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .

2412.13050 2026-05-13 cs.LG cs.AI cs.CL cs.CV cs.SD eess.AS 版本更新

Modality-Inconsistent Continual Learning of Multimodal Large Language Models

Weiguo Pian, Shijian Deng, Shentong Mo, Mingrui Liu, Yunhui Guo, Yapeng Tian

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) Carnegie Mellon University(卡内基梅隆大学) George Mason University(乔治·梅森大学)

AI总结 本文提出了一种新的多模态大语言模型持续学习场景——模态不一致持续学习(MICL),该场景涉及图像、音频或视频等不一致模态以及图文生成或问答等不同任务类型的持续学习任务。为应对模态和任务类型变化带来的灾难性遗忘问题,研究提出了MoInCL方法,通过伪目标生成模块和基于指令的知识蒸馏技术,有效缓解了模态和任务类型变化对模型性能的影响。实验结果表明,MoInCL在多个任务上优于现有的持续学习方法,具有显著优势。

Comments Accepted at Transactions on Machine Learning Research (TMLR), 2026

详情
英文摘要

In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.

2212.02011 2026-05-13 cs.CV 版本更新

PointCaM: Cut-and-Mix for Open-Set Point Cloud Learning

Jie Hong, Shi Qiu, Weihao Li, Saeed Anwar, Mehrtash Harandi, Nick Barnes, Lars Petersson

发表机构 * The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学) Australian National University(澳大利亚国立大学) The University of Western Australia(西澳大学) Monash University(墨尔本大学)

AI总结 本文研究了开放集点云学习问题,即在训练时不使用未知类别数据,而在推理阶段识别未知对象。为此,作者提出了一种名为PointCaM的新型点云“切割-混合”机制,包含未知点模拟器和未知点估计器两个模块,通过模拟分布外数据并利用多级特征上下文来区分已知与未知点云。实验表明,该方法在多个数据集上显著提升了开放集识别性能,验证了其有效性。

Comments Accepted in CVIU

详情
英文摘要

Point cloud learning is receiving increasing attention. However, most existing point cloud models lack the practical ability to deal with the unavoidable presence of unknown objects. This paper primarily discusses point cloud learning in open-set settings, where we train the model without data from unknown classes and identify them during the inference stage. In essence, we propose a novel Point Cut-and-Mix mechanism for solving open-set point cloud learning, comprising an Unknown-Point Simulator and an Unknown-Point Estimator module. Specifically, we use the Unknown-Point Simulator to simulate out-of-distribution data in the training stage by manipulating the geometric context of partially known data. Based on this, the Unknown-Point Estimator module learns to exploit the point cloud's feature context to discriminate between known and unknown data. Unlike existing methods that only consider classifier features, our proposed solution leverages multi-level feature contexts to recognize unknown point cloud objects more effectively. We test the proposed approach on several datasets, including customized S3DIS, ModelNet40, and ScanObjectNN. The improved open-set performances over comparative baselines show the effectiveness of our PointCaM method. Our code is available at https://github.com/JHome1/pointcam.

2605.11824 2026-05-13 cs.CV cs.AI 版本更新

REFNet++: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Bird's-Eye Polar View

Kavin Chandrasekaran, Sorin Grigorescu, Gijs Dubbelman, Pavol Jancura

发表机构 * ElektroBit Automotive GmbH Eindhoven University of Technology(埃因霍温理工大学) Transilvania University of Brasov(布拉索夫特拉扬大学)

AI总结 该论文提出了一种名为REFNet++的多任务高效融合方法,用于将摄像头和雷达传感器数据在鸟瞰极坐标视图中进行融合。研究通过变分编码器-解码器架构,将摄像头图像转换为极坐标域,并从雷达的范围-多普勒谱中提取角度信息以生成范围-方位角特征,从而实现两种模态数据在统一域中的对齐。该方法在保证融合精度的同时提升了计算效率,并在车辆检测和自由空间分割任务中取得了优于现有方法的性能。

Comments IEEE Intelligent Transportation Systems Conference (ITSC) 2025

详情
英文摘要

A realistic view of the vehicle's surroundings is generally offered by camera sensors, which is crucial for environmental perception. Affordable radar sensors, on the other hand, are becoming invaluable due to their robustness in variable weather conditions. However, because of their noisy output and reduced classification capability, they work best when combined with other sensor data. Specifically, we address the challenge of multimodal sensor fusion by aligning radar and camera data in a unified domain, prioritizing not only accuracy, but also computational efficiency. Our work leverages the raw range-Doppler (RD) spectrum from radar and front-view camera images as inputs. To enable effective fusion, we employ a variational encoder-decoder architecture that learns the transformation of front-view camera data into the Bird's-Eye View (BEV) polar domain. Concurrently, a radar encoder-decoder learns to recover the angle information from the RD data that produce Range-Azimuth (RA) features. This alignment ensures that both modalities are represented in a compatible domain, facilitating robust and efficient sensor fusion. We evaluated our fusion strategy for vehicle detection and free space segmentation against state-of-the-art methods using the RADIal dataset.

2605.11818 2026-05-13 cs.CV 版本更新

RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

Binhao Wang, Shihao Zhao, Bo Cheng, Qiuyu Ji, Yuhang Ma, Liebucha Wu, Shanyuan Liu, Dawei Leng, Yuhui Yin

发表机构 * Wenzhou University(温州大学) AI Research(360人工智能研究院)

AI总结 该论文提出了一种基于扩散模型的图像分层分解方法RevealLayer,旨在解决复杂自然图像中隐藏层与可见层的分离难题,以及遮挡区域内容的恢复问题。方法引入了区域感知注意力模块、遮挡引导适配器和复合损失函数,以实现更精确的层分离和遮挡内容重建。同时,研究团队构建了高质量的RevealLayer-100K数据集和评估基准RevealLayerBench,实验表明该方法在层分解任务上优于现有方法。

详情
英文摘要

Recent diffusion-based approaches have made substantial progress in image layer decomposition. However, accurately decomposing complex natural images remains challenging due to difficulties in occlusion completion, robust layer disentanglement, and precise foreground boundaries. Moreover, the scarcity of high-quality multi-layer natural image datasets limits advancement. To address these challenges, we propose RevealLayer, a diffusion-based framework that decomposes an RGB image into multiple RGBA layers, enabling precise layer separation and reliable recovery of occluded content in natural images. RevealLayer incorporates three key components: (1) a Region-Aware Attention module to disentangle hidden and visible layers; (2) an Occlusion-Guided Adapter to leverage contextual information to enhance overlapping regions; and (3) a composite loss to enforce sharp alpha boundaries and suppress residual artifacts. To support training and evaluation, we introduce RevealLayer-100K, a high-quality multi-layer natural image constructed through a collaboration between automated algorithms and human annotation, and further establish RevealLayerBench for benchmarking layer decomposition in general natural scenes. Extensive experiments demonstrate that RevealLayer consistently outperforms existing approaches in layer decomposition.

2605.11808 2026-05-13 cs.CV 版本更新

Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

Zhenxin Qin, Qiang Li, Qingzhuo Wang, Ruiyang Qin, Zhihua Wei, Wen Shen

发表机构 * Tongji University(同济大学)

AI总结 本文研究了大视觉语言模型(LVLMs)在生成文本时产生的动作关系幻觉问题,即模型生成的文本与视觉输入中的动作关系不一致。为解决这一问题,作者提出了一种基于关系感知的视觉增强方法(RVE),通过定义动作关系敏感度(ARS)评分定位包含关键视觉线索的动作相关区域,并增强模型对这些区域的注意力。实验表明,该方法在缓解动作关系幻觉方面优于现有方法,且几乎不增加推理成本,同时在空间关系和物体幻觉任务中也表现出良好的泛化能力。

详情
英文摘要

Large Vision-Language Models (LVLMs) have achieved remarkable performance on diverse vision-language tasks. However, LVLMs still suffer from hallucinations, generating text that contradicts the visual input. Existing research has primarily focused on mitigating object hallucinations, but often overlooks more complex relation hallucinations, particularly action relations involving interactions between objects. In this study, we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM's attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions that contain key visual cues. Then, we propose the Relation-aware Visual Enhancement (RVE) method to enhance the LVLM's attention to these action-relevant image regions. Extensive experiments demonstrate that, compared to existing baselines, our method achieves superior performance in mitigating action-relation hallucinations with negligible additional inference cost. Furthermore, it effectively generalizes to spatial-relation hallucinations and object hallucinations.

2605.11804 2026-05-13 cs.LG cs.CV 版本更新

Stop Marginalizing My Dreams: Model Inversion via Laplace Kernel for Continual Learning

Patryk Krukowski, Jacek Tabor, Przemysław Spurek, Marek Śmieja, Łukasz Struski

发表机构 * Jagiellonian University(雅盖隆大学) IDEAS Research Institute(IDEAS研究院)

AI总结 本文研究了无数据持续学习(DFCIL)中的模型逆问题,旨在生成高质量伪样本以缓解灾难性遗忘。现有方法通常假设特征分布具有对角协方差,忽略了特征间的相关性,导致生成样本质量不高。为此,作者提出REMIX方法,通过拉普拉斯核参数化实现结构化协方差建模,在保证计算效率的同时捕捉特征依赖关系,显著提升了合成样本的保真度和DFCIL性能。

详情
英文摘要

Data-free continual learning (DFCIL) relies on model inversion to synthesize pseudo-samples and mitigate catastrophic forgetting. However, existing inversion methods are fundamentally limited by a simplifying assumption: they model feature distributions using diagonal covariance, effectively ignoring correlations that define the geometry of learned representations. As a result, synthesized samples often lack fidelity, limiting knowledge retention. In this work, we show that modeling feature dependencies is a key ingredient for effective DFCIL. We introduce REMIX, a structured covariance modeling framework that enables scalable full-covariance modeling without the prohibitive cost of dense matrix inversion and log-determinant computation. By leveraging a Laplace kernel parameterization, REMIX captures structured feature dependencies using memory that scales linearly with the feature dimensionality, while requiring only an additional logarithmic factor in computation. Modeling these correlations produces more coherent synthetic samples and consistently improves performance across standard DFCIL benchmarks. Our results demonstrate that moving beyond diagonal assumptions is essential for effective and scalable data-free continual learning. Our code is available at https://github. com/pkrukowski1/REMIX-Model-Inversion-via-Laplace-Kernel.

2605.11803 2026-05-13 cs.CV cs.AI 版本更新

OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

Minseok Kang, Minhyeok Lee, Jungho Lee, Minjung Kim, Donghyeong Kim, Dayeon Lee, Heeseung Choi, Ig-jae Kim, Sangyoun Lee

发表机构 * Yonsei University(延世大学) LG Electronics(LG电子) KIST(韩国科学技术院)

AI总结 随着视频大语言模型(Video-LLMs)处理更长更复杂的视频,其推理成本因帧间视觉标记数量的增加而迅速上升。为解决这一问题,本文提出OTT-Vid,一种基于最优运输的时序标记压缩方法。该方法通过空间剪枝识别每帧中的关键内容,并利用非均匀标记质量的最优运输模型评估相邻帧间的压缩潜力,从而动态分配压缩预算,有效保护语义重要标记。实验表明,OTT-Vid在保留仅10%标记的情况下,仍能保持95.8%的视频问答和73.9%的时序定位性能,优于现有无训练压缩方法。

Comments 22pages, 9 figures. Code available at https://github.com/minseokii/OTT-Vid

详情
英文摘要

As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.

2605.11799 2026-05-13 cs.CV 版本更新

SB-BEVFusion: Enhancing the Robustness against Sensor Malfunction and Corruptions

Markus Essl, Marta Moscati, Mubashir Noman, Muhammad Zaigham Zaheer, Usman Naseem, Shah Nawaz, Markus Schedl

发表机构 * Johannes Kepler University Linz(约翰·凯撒大学林茨分校) MBZUAI(穆罕默德·本·拉希德人工智能研究所) Macquarie University(麦考瑞大学) Linz Institute of Technology(林茨技术学院)

AI总结 该论文提出了一种增强自动驾驶车辆三维目标检测中多模态传感器融合鲁棒性的方法,针对摄像头和激光雷达数据缺失或受污染的情况,设计了一个框架无关的融合模块。该模块能够有效应对单一模态失效或数据损坏的问题,并在BEVFusion框架中进行实例化验证。实验表明,该方法在多种传感器退化场景下表现出色,尤其在极端天气和传感器故障条件下达到了最先进的性能。

Comments Accepted at ICIP 2026

详情
英文摘要

Multimodal sensor fusion has demonstrated remarkable performance improvements over unimodal approaches in 3D object detection for autonomous vehicles. Typically, existing methods transform multimodal data from independent sensors, such as camera and LiDAR, into a unified bird's-eye view (BEV) representation for fusion. Although effective in ideal conditions, this strategy suffers from substantial performance deterioration when camera or LiDAR data are missing, corrupted, or noisy. To address this vulnerability, we develop a framework-agnostic fusion module for camera and LiDAR data that allows for handling cases when one of the two modalities is missing or corrupted. To demonstrate the effectiveness of our module, we instantiate it in BEVFusion [1], a well-established framework to combine camera and LiDAR data for 3D object detection. By means of quantitative experiments on the MultiCorrupt dataset, we demonstrate that our module achieves favorable performance improvements under scenarios of missing and corrupted modalities, substantially outperforming existing unified representation approaches across a wide range of sensor deterioration scenarios and reaching state-of-the-art performance in scenarios of corrupted modality due to extreme weather conditions and sensor failure.

2605.11782 2026-05-13 cs.CV 版本更新

Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision

Antoni Valls, Jordi Sanchez-Riera

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC(机器人与信息技术研究所,CSIC-UPC)

AI总结 该研究针对视力障碍者在城市环境中安全自主导航的问题,提出了一种基于视觉问答(VQA)的事件地图框架,利用视觉语言模型(VLMs)对行人场景进行描述和危险识别。通过三级分层查询结构,实现无需任务特定再训练的细粒度场景理解,并将模型响应聚合为加权风险评分系统,生成四类安全等级的可导航风险地图。研究还构建了一个涵盖六大洲20个城市的多样化数据集,并验证了生成式多模态大语言模型在该任务中的优越性能。

Comments 10 pages, 6 figures, submitted to IEEE T-ITS

详情
英文摘要

Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.

2605.11771 2026-05-13 cs.CV 版本更新

Revisiting Shadow Detection from a Vision-Language Perspective

Yonghui Wang, Wengang Zhou, Hao Feng, Houqiang Li

发表机构 * CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System, Department of Electronic Engineering and Information Science, University of Science and Technology of China(中国科学技术大学地球空间信息处理与应用系统重点实验室,电子工程与信息科学系)

AI总结 本文从视觉-语言视角重新审视阴影检测问题,指出传统基于视觉线索的方法在视觉模糊场景下可能失效,因此提出SVL框架,利用语言作为显式的语义参考来区分阴影与相似的暗色区域。SVL通过场景级阴影比例回归对齐图像与文本嵌入,并引入全局到局部的耦合机制,实现整体与细粒度预测的一致性,同时保持参数高效,实验表明其在多个基准测试中表现出优异的性能与鲁棒性。

详情
英文摘要

Shadow detection is commonly formulated as a vision-driven dense prediction problem, where models rely primarily on pixel-wise visual supervision to distinguish shadows from non-shadow regions. However, this formulation can become unreliable in visually ambiguous cases, where similar dark regions may correspond either to cast shadows or to intrinsically dark surfaces, making visual evidence alone insufficient for establishing a stable decision rule. In this work, we revisit shadow detection from a vision--language perspective and argue that robust prediction benefits from an explicit semantic reference beyond visual cues alone. We propose SVL, a Shadow Vision--Language framework that uses language as an explicit semantic reference to disambiguate shadows from visually similar dark regions. SVL aligns the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, thereby providing image-level guidance on the overall extent of shadows. To transfer this global guidance to dense inference, SVL introduces a global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions. In parallel, SVL applies local patch-level constraints with text embeddings to improve fine-grained discrimination under challenging appearance conditions. Built on a frozen DINOv3 image encoder, the framework learns only lightweight projection and decoding modules, yielding a parameter-efficient design with less than $1\%$ trainable parameters. Extensive experiments on multiple shadow detection benchmarks, including dedicated hard-case evaluations, suggest strong overall performance and improved robustness under visually ambiguous conditions.

2605.11760 2026-05-13 cs.CV 版本更新

M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

Jiyuan Liu, Jia Lin, Xiaofei Zhou, Runmin Cong, Deyang Liu, Zhi Liu

发表机构 * Hangzhou Dianzi University(杭州电子科技大学) Shandong University(山东大学) Anqing Normal University(安庆师范学院) Shanghai University(上海大学)

AI总结 该论文提出了一种名为 M$^4$-SAM 的多模态混合专家模型,旨在提升 RGB-D 视频显著目标检测的性能。通过引入模态感知的 LoRA 机制、多级特征融合模块以及无需手动提示的伪引导初始化方法,M$^4$-SAM 有效解决了 SAM2 在空间建模、多尺度特征利用和初始化依赖等方面的局限性。实验表明,该方法在三个公开数据集上取得了当前最优的检测性能。

Comments 10 pages, 3 figures

详情
英文摘要

The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M$^4$-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M$^4$-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.

2605.11758 2026-05-13 eess.IV cs.CV 版本更新

DiffSegLung: Diffusion Radiomic Distillation for Unsupervised Lung Pathology Segmentation

Rezkellah Noureddine Khiati, Pierre-Yves Brillet, Catalin Fetita

AI总结 本文提出了一种名为 DiffSegLung 的无监督肺部病理分割框架,旨在解决CT影像中缺乏标注数据以及现有扩散模型未能有效利用Hounsfield Unit(HU)信号的问题。该方法通过引入扩散放射组学蒸馏技术,利用手工设计的放射组学特征作为物理基础的“教师”模型,指导3D扩散U-Net的瓶颈特征学习,从而在无需标注的情况下提取病理区分性结构。实验表明,该方法在多个异质CT数据集上显著提升了分割性能和生成质量。

详情
英文摘要

Unsupervised segmentation of pulmonary pathologies in CT remains an open challenge due to the absence of annotated multi pathology cohorts and the failure of existing diffusion-based methods to exploit the quantitative Hounsfield Unit (HU) signal that physically distinguishes tissue classes. To address this, we propose DiffSegLung,a framework that introduces Diffusion Radiomic Distillation, in which handcrafted radiomic descriptors serve as a physics grounded teacher to shape the bottleneck of a 3D diffusion U-Net via a contrastive objective, transferring pathology discriminative structure into the learned representation without any annotations. At inference, the teacher is discarded and multitimestep bottleneck features are clustered by a Gaussian Mixture Model with HU-guided label assignment, followed by Sobel Diffusion Fusion for boundary refinement. Evaluated on 190 expert annotated axial slices drawn from four heterogeneous CT cohorts, Diff-SegLung improves segmentation across all four pathology classes over unsupervised baselines and improves generation fidelity over prior CT diffusion models.

2605.11756 2026-05-13 cs.CV cs.AI 版本更新

Focusable Monocular Depth Estimation

Yuxin Du, Tao Lin, Zile Zhong, Runting Li, Xiyao Chen, Jiting Liu, Chenglin Liu, Ying-Cong Chen, Yuqian Fu, Bo Zhao

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) King Abdullah University of Science and Technology(国王 Abdullah 科学与技术大学)

AI总结 本文提出了一种可聚焦的单目深度估计方法(FDE),旨在提升模型对用户指定或任务相关区域的深度估计精度。该方法引入了基于提示的FocusDepth框架,通过多尺度空间对齐融合(MSSA)技术,将多尺度特征与目标区域提示进行对齐和融合,从而在保持全局场景几何结构的同时,增强对目标区域的深度感知能力。研究还构建了FDE-Bench基准,实验证明该方法在目标边界和前景区域的深度估计上表现显著优于现有基线模型。

详情
英文摘要

Monocular depth foundation models generalize well across scenes, yet they are typically optimized with uniform pixel-wise objectives that do not distinguish user-specified or task-relevant target regions from the surrounding context. We therefore introduce Focusable Monocular Depth Estimation (FDE), a region-aware depth estimation task in which, given a specified target region, the model is required to prioritize foreground depth accuracy, preserve sharp boundary transitions, and maintain coherent global scene geometry. To prioritize task-critical region modeling, we propose FocusDepth, a prompt-conditioned monocular relative depth estimation framework that guides depth modeling to focus on target regions via box/text prompts. The core Multi-Scale Spatial-Aligned Fusion (MSSA) in FocusDepth spatially aligns multi-scale features from Segment Anything Model 3 to the Depth Anything family and injects them through scale-specific, gated conditional fusion. This enables dense prompt cue injection without disrupting geometric representations, thereby endowing the depth estimation model with focused perception capability. To study FDE, we establish FDE-Bench, a target-centric monocular relative depth benchmark built from image-target-depth triplets across five datasets, containing 252.9K/72.5K train/val triplets and 972 categories spanning real-world and embodied simulation environments. On FDE-Bench, FocusDepth consistently improves over globally fine-tuned DA2/DA3 baselines under both box and text prompts, with the largest gains appearing in target boundary and foreground regions while preserving global scene geometry. Ablations show that MSSA's spatial alignment is the key design factor, as disrupting prompt-geometry correspondence increases AbsRel by up to 13.8%.

2605.11750 2026-05-13 cs.RO cs.AI cs.CL cs.CV 版本更新

DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

Xianzhe Fan, Yuxiang Lu, Shenyuan Gao, Xiaoyang Wu, Ruihua Han, Manling Li, Hengshuang Zhao

发表机构 * HKU(香港大学) HKUST(香港理工大学) Northwestern University(西北大学)

AI总结 Vision-Language-Action(VLA)模型在精细操作任务中容易因关键阶段的微小动作错误而引发不可恢复的失败。为解决这一问题,本文提出DreamAvoid,一种在测试阶段通过“梦境”模拟来预判并规避失败的框架。该方法引入梦境触发机制、动作提案和梦境评估器,通过模拟候选动作的短期未来结果,选择最优动作以提升任务成功率。实验表明,DreamAvoid能有效减少失败情况,提高实际操作任务的完成率。

Comments 19 pages, 7 figures

详情
英文摘要

Vision-Language-Action (VLA) models are often brittle in fine-grained manipulation, where minor action errors during the critical phases can rapidly escalate into irrecoverable failures. Since existing VLA models rely predominantly on successful demonstrations for training, they lack an explicit awareness of failure during these critical phases. To address this, we propose DreamAvoid, a critical-phase test-time dreaming framework that enables VLA models to anticipate and avoid failures. We also introduce an autonomous boundary learning paradigm to refine the system's understanding of the subtle boundary between success and failure. Specifically, we (1) utilize a Dream Trigger to determine whether the execution has entered a critical phase, (2) sample multiple candidate action chunks from the VLA via an Action Proposer, and (3) employ a Dream Evaluator, jointly trained on mixed data (success, failure, and boundary cases), to "dream" the short-horizon futures corresponding to the candidate actions, evaluate their values, and select the optimal action. We conduct extensive evaluations on real-world manipulation tasks and simulation benchmarks. The results demonstrate that DreamAvoid can effectively avoid failures, thereby improving the overall task success rate. Our code is available at https://github.com/XianzheFan/DreamAvoid.

2605.11748 2026-05-13 cs.CV 版本更新

BronchoLumen: Analysis of recent YOLO-based architectures for real-time bronchial orifice detection in video bronchoscopy

Yongchao Li, Marian Himstedt

AI总结 本文提出了一种基于YOLO的实时系统BronchoLumen,用于在视频支气管镜图像中检测支气管开口,旨在辅助支气管镜导航和计算机辅助诊断系统。研究比较了YOLOv8和集成注意力模块的YOLOv12在不同图像域中的检测性能,结果表明YOLOv12在定位精度上略优于YOLOv8,但整体精度稍低,系统在多数场景下表现出良好的鲁棒性。该方法为跨域支气管开口检测提供了高效且准确的解决方案,并已开源以促进相关研究。

Comments 10 pages, 4 figures, IPCAI 2026

详情
英文摘要

Bronchoscopy is routinely conducted in pulmonary clinics and intensive care units, but navigating the complex branching of the respiratory tract remains challenging. This paper introduces BronchoLumen, a real-time YOLO-based system for detecting bronchial orifices in video bronchoscopy, aiming to assist navigation and CAD systems. The paper investigates if bronchial orifices can be robustly detected across image domains using state-of-the-art object detection and a limited set of public image data. The study includes the description and comparison of YOLOv8, a widely adopted architecture, and YOLOv12, a more recent architecture integrating attention-based modules to improve spatial reasoning. Both models are trained and tested solely on publicly available datasets comprising different image domains. A comparison of both models is conducted based on the common metrics mAP@0.5 and mAP@0.5:0.9 with the latter emphasizing localization accuracy. For YOLOv8 we obtained a mAP@0.5 of 0.91 on an in-domain and 0.68 on a cross-domain test set. YOLOv12 achieved 0.84 and 0.68 respectively with slightly better localization accuracy with mAP@0.5:0.9 of 0.48 and 0.26 compared to YOLOv8 with 0.45 and 0.25. Challenges like motion blur and low contrast occasionally entailed uncertainties but the system demonstrated overall robustness in most scenarios. BronchoLumen is an open-weight, YOLO-based solution for bronchial orifice detection offering high accuracy and efficiency across multiple image domains. While the more recent YOLOv12 achieves better localization accuracy, we observed a slightly worse precision. The models have been made publicly available to foster further research in bronchoscopy navigation.

2605.11743 2026-05-13 cs.CV cs.LG 版本更新

WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views

SeongMin Jin, Doo Seok Jeong

发表机构 * Department of Semiconductor Engineering, Hanyang University, Republic of Korea(韩世半导体工程系,翰阳大学,大韩民国)

AI总结 本文提出了一种名为 WorldComp2D 的轻量级表征学习框架,旨在从局部视角中学习物体身份和位置的时空语义表示。该方法通过多尺度局部感受野显式构建与物体身份和空间邻近性相关的潜在空间结构,包含一个依赖邻近性的编码器和一个用于定位输入中物体坐标的局部化模块。实验表明,相比现有轻量模型,WorldComp2D 在参数量和计算量上分别减少达 4.0 倍和 2.2 倍,同时在 CPU 上仍能保持实时性能,验证了其在时空语义推理中的高效性和通用性。

Comments Accepted as a regular paper at ICML2026

详情
英文摘要

Learning latent representations that capture both semantic and spatial information is central to efficient spatio-semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that, compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to 4.0X and 2.2X, respectively, while maintaining real-time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio-semantic reasoning. This framework is open-sourced at https://github.com/JinSeongmin/WorldComp2D.

2605.11727 2026-05-13 cs.AI cs.CL cs.CV 版本更新

Allegory of the Cave: Measurement-Grounded Vision-Language Learning

Kepeng Xu, Li Xu, Gang He, Wenxin Yu

发表机构 * Xidian University(西电大学) Southwest University of Science and Technology(西南科技大学)

AI总结 该研究探讨了如何通过更贴近原始相机测量数据的视觉输入来提升视觉-语言模型的感知能力。提出了一种基于原始测量值的视觉-语言学习框架PRISM-VL,结合了RAW图像输入、相机条件化对齐和曝光区间监督聚合等方法,以增强模型对真实环境信息的感知。实验表明,该方法在低光、高动态范围等复杂场景下显著提升了模型的性能,验证了保留测量域信息对多模态推理的重要性。

详情
英文摘要

Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. These results suggest that part of VLM grounding error arises from information lost during RGB rendering, and that preserving measurement-domain evidence can improve multimodal reasoning.

2605.11722 2026-05-13 cs.CV cs.LG 版本更新

EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation

Sunung Mun, Sunghyun Cho, Jungseul Ok

发表机构 * Graduate School of Artificial Intelligence, POSTECH(人工智能研究生院,POSTECH) Department of Computer Science & Engineering, POSTECH(计算机科学与工程系,POSTECH)

AI总结 EPIC 是一种无需训练的推理时优化框架,用于解决复杂文本到图像生成中多对象、数量、属性和关系等组合性提示的生成难题。该方法通过将提示解析为固定的视觉程序,利用谓词引导搜索进行图像验证与修正,确保所有条件满足后才判定生成成功。实验表明,EPIC 在 GenEval2 数据集上显著提升了生成准确率,并在计算资源消耗上相比现有方法大幅降低。

详情
英文摘要

Recent text-to-image (T2I) generators can synthesize realistic images, but still struggle with compositional prompts involving multiple objects, counts, attributes, and relations. We introduce EPIC (Efficient Predicate-Guided Inference-Time Control), a training-free inference-time refinement framework for compositional T2I generation. EPIC casts refinement as predicate-guided search: it parses the original prompt once into a fixed visual program of object variables and typed predicates, covering checkable conditions such as object presence, counts, attributes, and relations. Each generated or edited image is verified against this program using visual evidence extracted from that image. An image is judged to satisfy the prompt only when all predicates are satisfied; otherwise, failed predicates decide the next step, routing local failures to targeted editing and global failures to resampling while the fixed visual program remains unchanged. On GenEval2, EPIC improves prompt-level accuracy from 34.16% for single-pass generation with the base generator to 71.46%. Under the same generator/editor setting and maximum image-model execution budget, EPIC outperforms the strongest prior refinement baseline by 19.23 points while reducing realized cost by 31% in image-model executions, 72% in MLLM calls, and 81% in MLLM tokens per prompt.

2605.11705 2026-05-13 cs.CV 版本更新

CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection

Boran Zhao, Hetian Liu, Zhenxian Hu, Yuqing Yuan, Yu Yan, Pengju Ren

发表机构 * School of Software Engineering, the National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics(软件工程学院、人机混合增强智能国家重点实验室、视觉信息与应用国家工程研究中心、人工智能与机器人研究院) School of Software Engineering(软件工程学院) XJTU-POLIMI Joint School(西交大-波兰理工联合学院) Faculty of Electronic and Information Engineering(电子与信息工程学院) School of Human Settlements and Civil Engineering(人居与土木工程学院) the National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics(人机混合增强智能国家重点实验室、视觉信息与应用国家工程研究中心、人工智能与机器人研究院)

AI总结 本文提出了一种名为CAST的多模态核心集选择框架,旨在解决大规模图像-文本数据集在训练多模态模型时带来的高计算成本问题。CAST通过构建图像和文本模态的拓扑结构,并结合局部坍缩感知的融合策略,实现跨模态信息的均衡表示。同时,CAST引入多尺度扩散小波域分布匹配和局部软关系覆盖机制,有效提升了核心集在语义结构、细粒度细节和冗余抑制方面的表现。实验表明,CAST在多个数据集上优于现有方法,展现出更强的跨架构泛化能力和计算效率。

详情
英文摘要

The training of large multimodal models fundamentally relies on massive image-text datasets, which inevitably incur prohibitive computational overhead. Dataset selection offers a promising paradigm by identifying a highly informative coreset. However, existing approaches suffer from two critical limitations: (i) single-modality-dominated sampling methods, which ignore the fine-grained cross-modal information imbalance inherent in multimodal datasets and thus lead to semantic loss in the other modality; and (ii) coarse-grained sample-scoring-based sampling methods, where the selected coreset tends to be biased toward the scoring model, making it difficult to guarantee distributional equivalence between the coreset and the original dataset. Meanwhile, existing distribution matching and discrete sampling strategies often fail to jointly account for global semantic structure, local fine-grained details, and redundancy-aware coverage in dense regions. To this end, we propose CAST, a Collapse-Aware multi-Scale Topology fusion framework for multimodal coreset selection. We first construct image- and text-modality topologies, and derive a unified topology via local-collapse-aware refinement and cross-modal fusion. We then introduce a multi-scale distribution matching criterion in the diffusion wavelet domain, encouraging the coreset to approximate the original dataset at multiple scales. Finally, we introduce a local soft relational coverage mechanism that extends pure geometric coverage to relation-aware indirect coverage, penalizing redundant selections in dense clusters. Extensive experiments on Flickr30K and MS-COCO show that CAST outperforms existing dataset selection baselines, showcasing great superiority in cross-architecture generalization and energy efficiency over state-of-the-art multimodal synthesis methods.

2605.11704 2026-05-13 cs.CV 版本更新

ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation

Inwoo Hwang, Hojun Jang, Bing Zhou, Jian Wang, Young Min Kim, Chuan Guo

发表机构 * Seoul National University(首尔国立大学) Snap Inc.(Snap公司) Meta Reality Labs(Meta现实实验室)

AI总结 本文提出 ScaleMoGen,一种基于尺度自回归的文本驱动人体运动生成框架。该方法将运动生成视为从粗到细的过程,通过多尺度骨骼-时序离散化标记进行自回归预测,从而生成高质量的运动序列。研究通过位级量化和预测策略,提升了标记词汇量并优化了生成稳定性,实验表明其在多个指标上优于现有方法,并支持无需训练的文本引导运动编辑。

Comments Project page: https://inwoohwang.me/ScaleMoGen

详情
英文摘要

We present ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Unlike conventional autoregressive approaches that rely on standard next-token prediction, ScaleMoGen frames motion generation as a coarse-to-fine process. We quantize 3D motions into compositional discrete tokens across multiple skeletal-emporal scales of increasing granularity, learning to generate motion by autoregressively predicting next-scale token maps. To maintain structural integrity, our motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Additionally, we employ bitwise quantization and prediction, which efficiently scale up the tokenizer vocabulary to preserve motion details and stabilize optimization. Extensive experiments demonstrate that ScaleMoGen achieves state-of-the-art performance, establishing an FID of 0.030 (vs. 0.045 for MoMask) on HumanML3D and a CLIP Score of 0.693 (vs. 0.685 for MoMask++) on the SnapMoGen dataset. Furthermore, we demonstrate that our skeletal-temporal multi-scale representation naturally facilitates training-free, text-guided motion editing.

2605.11696 2026-05-13 cs.CV cs.AI cs.GR 版本更新

WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting

Lezhong Wang, Mehmet Onurcan Kaya, Siavash Bigdeli, Jeppe Revall Frisvad

发表机构 * Technical University of Denmark(丹麦技术大学) Inria(法国国家信息与自动化研究所)

AI总结 WildRelight 是一个专为单图像重光照任务设计的首个真实场景数据集,包含高分辨率户外场景及其配对的高动态范围环境光映射,用于评估现有方法在真实环境中的表现。该数据集揭示了当前基于合成数据训练的先进模型在真实世界中存在严重的领域偏移问题。研究提出了一种基于物理引导的推理框架,结合扩散后验采样与时间感知的测试时自适应方法,实现了合成模型在真实场景中的实时对齐,为解决模拟到现实的挑战提供了新的思路。

Comments Companion paper to the CVPR26 findings paper 'WildRelight', introducing the physics-guided adaptation method evaluated on the dataset. Project Page: https://lez-s.github.io/wildrelight_proj/

详情
英文摘要

Recent single-image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi-view reconstruction and fail to address the unique challenges of single-image relighting. To bridge this synthetic-to-real gap, we introduce WildRelight, the first in-the-wild dataset specifically created for evaluating single-image relighting models. WildRelight features a diverse collection of high-resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high-dynamic-range environment map. Using this data, we establish a rigorous benchmark revealing that state-of-the-art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics-guided inference framework that leverages the captured natural light evolution as a self-supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly, transforming the intractable sim-to-real challenge into a tractable self-supervised task. The dataset and code will be made publicly available to foster robust, physically-grounded relighting research.

2605.11695 2026-05-13 cs.CV cs.AI 版本更新

Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning

Mikako Ochiai, Masatoshi Nagano, Tadahiro Taniguchi

发表机构 * Graduate School of Informatics, Kyoto University(京都大学信息科学研究生院)

AI总结 本文研究了在异构视觉代理之间通过去中心化学习产生的通信机制,探讨了当代理具有不同视觉表征时,哪些视觉信息可以被共享。研究中代理仅交换离散的标记序列,并基于本地感知证据更新自身模型,无需依赖共享的通信目标。实验表明,这种通信方式能够生成具有视觉信息的共享标记序列,在跨代理对齐、视觉特征预测和图像-文本检索任务中优于无通信基线,并揭示了视觉编码器异质性对通信内容和语言对称性的影响。

详情
英文摘要

Symbols are shared, but perception is private. We study emergent communication between heterogeneous visual agents through decentralized learning, asking what visual information can become shareable when agents have different visual representations. Instead of optimizing messages through a shared external communicative objective, our agents exchange only discrete token sequences and update their own models using local perceptual evidence. This setting focuses on an underexplored aspect of emergent communication, examining whether common symbols can arise without shared perceptual access, and how the similarity between private visual spaces constrains the content and symmetry of the resulting language. We instantiate this setting in the Metropolis-Hastings Captioning Game (MHCG), where two agents collaboratively form shared captions by exchanging proposed token sequences that a listener accepts or rejects using an MH-style criterion evaluated against its own visual features. We compare three pairings of frozen visual encoders, with agents starting from randomly initialized text modules. Experiments on MS-COCO show that MHCG produces visually informative shared token sequences that outperform a no-communication baseline in cross-agent alignment, visual-feature prediction, and image-text retrieval; all cross-agent metrics decline as encoder mismatch increases. Moderate encoder heterogeneity reduces the number of shared sequences while preserving per-sequence visual specificity, whereas stronger encoder heterogeneity yields fewer, coarser, and more asymmetric sequences. Ablations show that listener-side MH acceptance is critical for avoiding degenerate token formation. These results suggest that shared symbols can arise from local perceptual evaluation alone, with visual representational similarity across encoders shaping both the content and symmetry of the resulting language.

2605.11683 2026-05-13 cs.CV 版本更新

DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers

Kaixuan He, Song Chen, Yi Kang

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥国家科学中心人工智能研究院)

AI总结 视觉 Transformer(ViT)由于自注意力机制的二次复杂度,计算开销较大。为解决这一问题,本文提出 DORA,一种基于强化学习的动态在线推理框架,用于在 ViT 中实现自适应的 token 合并。DORA 将 token 合并过程建模为马尔可夫决策过程,通过轻量级 RL 智能体根据当前特征状态和层间上下文动态决定合并策略,并通过非线性知识蒸馏惩罚函数优化智能体,以平衡计算效率与特征保真度。实验表明,DORA 在多个 ViT 尺度上均优于现有方法,在保持精度损失极小的前提下实现了显著的计算加速。

Comments Preprint. Under review

详情
英文摘要

Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference. In this paper, we propose DORA (Dynamic Online Reinforcement Agent), the first reinforcement learning (RL)-driven online inference framework for dynamic token merging in ViTs. We formulate the merging process as a sequential Markov Decision Process (MDP), where a lightweight RL agent determines the merging strategy for each Transformer block based on the current feature state and layer-specific context. To balance computational efficiency and feature fidelity, the agent is optimized via a dense reward function incorporating a non-linear distillation-based penalty. We implement an asymmetric Actor-Critic architecture that utilizes a high-capacity Critic for stable offline training while retaining a minimal Actor head for low-computation online inference. Evaluations across multiple ViT scales (Tiny to Large) demonstrate that DORA improves the accuracy-efficiency Pareto front compared to current baselines. Under strict negligible accuracy-drop constraints (<= 0.05%), DORA achieves up to a 12.66% token merging rate, and delivers up to a 569.7% relative improvement over the most efficient baseline. On ImageNet-1K, under aligned accuracy constraints, DORA achieves up to a 76% relative improvement in computational savings compared to state-of-the-art methods. Furthermore, on out-of-distribution (OOD) benchmarks such as ImageNet-A and ImageNet-C, DORA attains a relative efficiency advantage of over 430%.

2605.11680 2026-05-13 cs.CV 版本更新

ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

Shivam Kumar

发表机构 * Independent Researcher(独立研究者)

AI总结 本文介绍了 ShapeCodeBench,一个用于感知到程序重建的合成基准,旨在从渲染图像生成可执行的绘图程序,并与目标图像进行比较。该基准通过可重复的随机数生成器生成样本,支持创建无偏的测试集,包含不同难度级别的150个样本,并采用多种指标进行评估。实验表明,当前最先进的模型在精确匹配方面仍表现有限,表明该基准仍有较大的提升空间。

Comments 14 pages, 5 figures, 2 tables. Code, data, and artifacts: https://github.com/shivamk3r/shape-code-bench ; archival release: https://doi.org/10.5281/zenodo.20132286

详情
英文摘要

We introduce ShapeCodeBench, a synthetic benchmark for perception-to-program reconstruction: given a rendered raster image, a model must emit an executable drawing program that a deterministic evaluator re-renders and compares with the target. The v1 DSL has four primitives on a 512 x 512 black-on-white canvas, but every instance is generated from a seeded RNG, so fresh held-out sets can be created to reduce exact-instance contamination. We release a frozen eval_v1 split with 150 samples across easy, medium, and hard tiers, scored by exact match, pixel accuracy, foreground IoU, parse success, and execution success. We evaluate an empty-program floor, a classical computer-vision heuristic, Claude Opus 4.7 at high and max effort, and GPT-5.5 at medium and extra_high reasoning effort. The heuristic is competitive on easy scenes but collapses when overlaps fuse components; the strongest multimodal configuration preserves much of the foreground structure but still misses exact match because of small parameter errors. Best overall exact match remains low, so ShapeCodeBench is far from saturated. The benchmark code, frozen dataset, run artifacts, and paper sources are released to support independent replication and extension.

2605.11659 2026-05-13 cs.CV cs.AI 版本更新

Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning

Yaze Zhao, Yicong Liu, Yixiong Zou, Yuhua Li, Ruixuan Li

发表机构 * School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院)

AI总结 本文研究了在源域数据不可用的情况下,如何通过少量样本将大模型(如CLIP)适配到目标领域的问题,即无源域少样本跨域学习(CDFSL)。研究发现,基于适配器的方法(如LoRA)在CDFSL中优于基于提示的方法,其优势源于对视觉CLS token注意力的修正,从而增强模态对齐和类别区分。基于这一发现,作者提出了一个通用的注意力建模框架——语义探针(Semantic Probe),有效提升了适配器和提示方法在CDFSL中的性能,并在多个基准上取得了最先进的结果。

详情
英文摘要

Cross-Domain Few-Shot Learning (CDFSL) aims to adapt large-scale pretrained models to specialized target domains with limited samples, yet the few-shot fine-tuning of vision-language models like CLIP remains underexplored. By establishing multiple fine-tuning baselines of CLIP for CDFSL, we find adapter-based methods (e.g., LoRA) consistently outperform prompt-based ones (e.g., MaPLe), contrary to in-domain scenarios. To make those effective in-domain methods competitive again in CDFSL, we analyze this phenomenon and discover LoRA's superiority stems from rectifying the collapsed attention of visual CLS token, enhancing modality alignment and class separation by focusing on text-related visual regions. Further, we find textual EOS token exhibit much better attention to visual samples, and CLIP's standard contrastive loss weakly constrains modality alignment. Based on these insights, we propose Semantic Probe, a plug-and-play attention rectification framework for both adapter- and prompt-based methods. Extensive experiments on four CDFSL benchmarks validate our rationale, achieving state-of-the-art performance and benefiting both fine-tuning paradigms. Codes will be released.

2605.11634 2026-05-13 cs.CV cs.AI 版本更新

Unlocking UML Class Diagram Understanding in Vision Language Models

Artem Naboichenko, René Peinl

发表机构 * Hof University of Applied Sciences(霍夫应用科学大学)

AI总结 尽管视觉语言模型(VLMs)在各类应用中取得了显著进展,但在理解图表等结构化视觉内容方面仍存在不足,尤其在计算机科学领域的UML类图理解方面研究较少。本文提出了一种基于UML类图的视觉问答基准,兼具挑战性与可行性,并构建了一个包含16,000个图像-问题-答案三元组的大规模训练数据集。实验表明,基于LoRA的微调方法在该任务上表现优于当前主流的Qwen 3.5 27B模型。

详情
英文摘要

Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.

2605.11628 2026-05-13 cs.CV 版本更新

Single-Shot HDR Recovery via a Video Diffusion Prior

Chinmay Talegaonkar, Jinshi He, Christopher McKenna, Nicholas Antipa

发表机构 * University of California San Diego(加州大学圣地亚哥分校) Creare LLC(Creare公司)

AI总结 本文提出了一种基于视频扩散先验的单次拍摄高动态范围(HDR)图像恢复方法,解决了现有方法在保真度和模型复杂度之间的平衡问题。该方法将HDR重建重新定义为条件视频生成任务,通过生成曝光序列并融合为最终HDR图像,提升了重建结果的准确性和可解释性。实验表明,该方法在多个评估指标上优于现有方法,并在人类评估中获得更高偏好,同时框架还可扩展到其他图像重建任务。

详情
英文摘要

Recent generative methods for single-shot high dynamic range (HDR) image reconstruction show promising results, but often struggle with preserving fidelity to the input image. They require separate models to handle highlights and shadows, or sacrifice interpretability by directly predicting the final HDR image. We address these limitations by re-casting single-shot HDR reconstruction as conditional video generation and fusing the generated frames into an HDR image. We finetune a video diffusion model to generate an exposure bracket, conditioned on a low dynamic range (LDR) input. We fuse this image bracket using per-pixel weights predicted by a light-weight UNet. This formulation is simple, interpretable, and effective. Rather than directly hallucinating an HDR image, it explicitly reconstructs the intermediate exposure stack and fuses it into the final output. Our method eliminates the need for separate models across exposure regimes and produces HDR reconstructions with high input fidelity. On quantitative benchmarks, we outperform state-of-the-art generative baselines with comparable model capacity on several reconstruction metrics. Human evaluators further prefer our results in 72% of pairwise comparisons against existing methods. Finally, we show that this input-conditioned sequence generation and fusion framework extends beyond HDR to other image reconstruction tasks, such as all-in-focus image recovery from a single defocus-blurred input.

2605.11622 2026-05-13 cs.CV 版本更新

RNA-FM: Flow-Matching Generative Model for Genome-wide RNA-Seq Prediction

Yaxuan Song, Jianan Fan, Tianyi Wang, Qiuyue Hu, Hang Chang, Heng Huang, Weidong Cai

发表机构 * School of Computer Science, The University of Sydney, Australia(悉尼大学计算机科学学院) Engineering Division, Lawrence Berkeley National Lab, USA(伯克利国家实验室工程部) Berkeley Biomedical Data Science Center, Lawrence Berkeley National Lab, USA(伯克利生物医学数据科学中心) Department of Computer Science, University of Maryland College Park, USA(马里兰大学学院市计算机科学系)

AI总结 本文提出了一种名为RNA-FM的生成模型,用于基于组织病理学全切片图像(WSI)预测全基因组RNA测序(RNA-seq)数据。该方法将转录组预测建模为连续时间条件运输问题,通过学习形态条件下的速度场,从简单先验分布映射到目标基因表达分布,从而更准确地捕捉生物异质性和预测不确定性。RNA-FM结合通路级别的结构信息,实现了可扩展且具有生物学可解释性的全基因组基因表达填补,实验表明其在性能和生物学意义方面均优于现有方法。

Comments 15 pages, 13 tables, 3 figures. Accepted by the Forty-Third International Conference on Machine Learning (ICML2026). Code is available at https://github.com/YXSong000/RNA-FM

详情
英文摘要

Histopathology whole-slide images (WSIs) are routinely acquired in clinical practice and contain rich tissue morphology but lack direct molecular architecture and functional programs defining pathological states, whereas RNA sequencing (RNA-seq) provides genome-wide transcriptional profiles at substantial cost, thereby motivating WSI-based genome-wide transcriptomic prediction. Existing approaches for predicting gene expression from WSIs predominantly rely on deterministic regression with one-to-one mapping, limiting their ability to capture biological heterogeneity and predictive uncertainty. We propose RNA-FM, a flow-matching generative framework for genome-wide bulk RNA-seq prediction from WSIs. RNA-FM formulates transcriptomic prediction as a continuous-time conditional transport problem, learning a velocity field that maps a simple prior to the target gene expression distribution conditioned on morphologies. By integrating pathway-level structure, RNA-FM enables scalable and biologically interpretable genome-wide gene expression imputation. Extensive experiments demonstrate that RNA-FM consistently outperforms state-of-the-art approaches while maintaining biological meaningfulness. Code is available at https://github.com/YXSong000/RNA-FM.

2605.11616 2026-05-13 cs.CV 版本更新

Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

Qirui Wang, Jingyi He, Yining Pan, Xulei Yang, Shijie Li

发表机构 * TUM(慕尼黑工业大学) A*STAR(新加坡科技研究局)

AI总结 该研究旨在解决三维功能可操作性区域的定位问题,即在视觉语言模型中准确识别出物体上可用于交互的特定区域,如把手或按钮。为此,提出了一种名为AFFORDMEM的框架,通过跨场景和场景内两种记忆机制,无需模型微调或目标场景标注,即可从源场景中构建可复用的记忆库来辅助定位。实验表明,该方法在SceneFun3D数据集上显著提升了定位精度,验证了其在细粒度定位和空间关系理解方面的有效性。

详情
英文摘要

Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.

2605.11605 2026-05-13 cs.CV cs.AI 版本更新

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

Chaeyoung Jung, Kyeongha Rho, Joon Son Chung

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 多模态大语言模型(Omni-LLMs)在处理多模态输入时面临较高的计算开销,因此需要有效的token减少方法。本文提出了一种名为ContextGuard的推理时token剪枝框架,通过保留广泛的视听上下文并去除跨模态冗余,从而在保证性能的同时减少输入token数量。该方法基于音频预测粗粒度视觉语义,剪枝可由音频恢复的视频token,并保留能提供音频无法表达的局部视觉细节的token,同时合并时间上相似的视频token以进一步压缩。实验表明,ContextGuard在多个基准测试中优于现有方法,且在不需微调下游模型的情况下实现了较高的剪枝比例与性能。

详情
英文摘要

Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.

2605.11594 2026-05-13 cs.CV 版本更新

PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations

Cheng Chi, Xianqi Wang, Hongcheng Luo, Mingfei Tu, Gangwei Xu, Zehan Zhang, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, Xin Yang, Haiyang Sun

发表机构 * Xiaomi EV(小米电动车) Huazhong University of Science and Technology(华中科技大学) Zhejiang University(浙江大学)

AI总结 本文提出了一种名为PointForward的前馈驾驶场景重建框架,通过点对齐的表示方法解决现有方法在多视角一致性与动态实例建模方面的不足。该方法在世界坐标系中初始化稀疏的3D查询点,并通过时空融合多视角图像信息,实现单次前馈过程中的显式跨视角一致性。此外,通过引入场景图显式组织动态实例,结合3D边界框实现实例级运动传播,从而获得时序一致的动态重建结果。实验表明,PointForward在大规模驾驶数据集上达到了最先进的性能。

详情
英文摘要

High-fidelity reconstruction of driving scenes is crucial for autonomous driving. While recent feedforward 3D Gaussian Splatting (3DGS) methods enable fast reconstruction, their per-pixel Gaussian prediction paradigm often suffers from multi-view inconsistency and layering artifacts. Moreover, existing methods often model dynamic instances via dense flow prediction, which lacks explicit cross-view correspondence and instance-level consistency. In this paper, we propose PointForward, a feedforward driving reconstruction framework through point-aligned representations. Unlike pixel-aligned methods, we initialize sparse 3D queries in world space and aggregate multi-view image information via spatial-temporal fusion onto these queries, enforcing explicit cross-view consistency in a single feedforward pass. To handle scene dynamics, we introduce scene graphs that explicitly organize moving instances during reconstruction. By leveraging 3D bounding boxes, our method enables instance-level motion propagation and temporally consistent dynamic representations. Extensive experiments demonstrate that PointForward achieves state-of-the-art performance on large-scale driving benchmarks. The code will be available upon the publication of the paper.

2605.11591 2026-05-13 cs.CV 版本更新

Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration

Mingtao Xian, Yifeng Yang, Qinying Gu, Xinbing Wang, Nanyang Ye

发表机构 * Zhiyuan College, Shanghai Jiao Tong University, Shanghai, China(上海交通大学紫阳学院) Shanghai Jiao Tong University, Shanghai, China(上海交通大学) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室) Shanghai Innovation Institute, Shanghai, China(上海创新研究院)

AI总结 多模态大语言模型在多图像跨模态检索任务中表现出色,但存在严重的顺序偏差问题,即预测结果受输入顺序影响而非语义相关性。本文提出了一种名为“Logit-Attention Divergence”的现象,指出输出logits存在偏差,而内部注意力图仍能准确对齐相关视觉信息,揭示了现有校准方法的局限性。基于此,作者提出了一种无需训练、基于注意力引导的去偏框架,利用模型内部的注意力信号在推理阶段进行实例级校正,仅需少量校准数据且计算开销极小。实验表明,该方法显著提升了模型对输入顺序的鲁棒性,在多个基准测试中取得了最先进的性能。

详情
英文摘要

Multimodal Large Language Models (MLLMs) have shown strong performance in multi-image cross-modal retrieval, yet suffer from severe position bias, where predictions are dominated by input order rather than semantic relevance. Through empirical analysis, we identify a phenomenon termed Logit-Attention Divergence, in which output logits are heavily biased while internal attention maps remain well-aligned with relevant visual evidence. This observation reveals a fundamental limitation of existing logit-level calibration methods such as PriDe. Based on this insight, we propose a training-free, attention-guided debiasing framework that leverages intrinsic attention signals for instance-level correction at inference time, requiring only a minimal calibration set with negligible computational overhead. Experiments on MS-COCO-based benchmarks show that our method substantially improves permutation invariance and achieves state-of-the-art performance, enhancing accuracy by over 40\% compared to baselines. Code is available at https://github.com/brightXian/LAD.

2605.11585 2026-05-13 cs.CV cs.LG 版本更新

A Mixture Autoregressive Image Generative Model on Quadtree Regions for Gaussian Noise Removal via Variational Bayes and Gradient Methods

Shota Saito, Yuta Nakahara, Kohei Horinouchi, Naoki Ichijo, Manabu Kobayashi, Toshiyasu Matsushima

发表机构 * Gunma University(群马大学) Waseda University(早稻田大学)

AI总结 本文研究了灰度图像的高斯噪声去除问题,提出了一种结合四叉树区域划分模型与混合自回归模型的概率图像生成方法,并将基于最大后验估计的去噪问题转化为变分下界最大化问题。通过交替应用变分贝叶斯方法和梯度方法,开发了一种新的优化算法,其中梯度更新规则可解析计算,无需数值近似。实验验证了该算法的有效性,并指出了进一步改进的方向。

详情
英文摘要

This paper addresses the problem of image denoising for grayscale images. We propose a probabilistic image generative model that combines a quadtree region-partitioning model with a mixture autoregressive model, and propose a framework that reduces MAP (maximum a posteriori)-estimation-based denoising to the maximization of a variational lower bound. To maximize this lower bound, we develop an algorithm that alternately applies variational Bayes and gradient methods. We particularly demonstrate that the gradient-based update rule can be computed analytically without numerical computation or approximation. We carried out some experiments to verify that the proposed algorithm actually removes image noise and to identify directions for future improvement.

2605.11583 2026-05-13 eess.IV cs.AI cs.CV cs.LG eess.SP 版本更新

NexOP: Joint Optimization of NEX-Aware k-space Sampling and Image Reconstruction for Low-Field MRI

Tal Oved, Efrat Shimron

发表机构 * Department of Electrical and Computer Engineering, Technion – Israel Institute of Technology(电气与计算机工程系,技术学院–以色列理工学院) May-Blum-Dahl Technion Human MRI Research Center, Technion - Israel Institute of Technology(May-Blum-Dahl技术学院人类MRI研究中心,技术学院–以色列理工学院) Department of Biomedical Engineering, Technion – Israel Institute of Technology(生物医学工程系,技术学院–以色列理工学院)

AI总结 本文提出了一种名为NexOP的深度学习框架,旨在针对低场强MRI中信噪比低的问题,联合优化多重复采集(NEX)的k空间采样策略与图像重建过程。该方法通过在扩展的k空间-NEX域内优化采样密度概率,在固定采样预算下实现更高效的采样策略,并设计了新的深度学习架构,从多个低信噪比测量中重建高质量图像。实验表明,NexOP在多种加速倍数和组织对比下均优于现有方法,且能生成非均匀采样方案,有效利用NEX维度提升成像效率与质量。

详情
英文摘要

Modern low-field magnetic resonance imaging (MRI) technology offers a compelling alternative to standard high-field MRI, with portable, low-cost systems. However, its clinical utility is limited by a low Signal-to-Noise Ratio (SNR), which hampers diagnostic image quality. A common approach to increase SNR is through repetitive signal acquisitions, known as NEX, but this results in excessively long scan durations. Although recent work has introduced methods to accelerate MRI scans through k-space sampling optimization, the NEX dimension remains unexploited; typically, a single sampling mask is used across all repetitions. Here we introduce NexOP, a deep-learning framework for joint optimization of the sampling and reconstruction in multi-NEX acquisitions, tailored for low-SNR settings. NexOP enables optimizing the sampling density probabilities across the extended k-space-NEX domain, under a fixed sampling-budget constraint, and introduces a new deep-learning architecture for reconstructing a single high-SNR image from multiple low-SNR measurements. Experiments with raw low-field (0.3T) brain data demonstrate that NexOP consistently outperforms competing methods, both quantitatively and qualitatively, across diverse acceleration factors and tissue contrasts. The results also demonstrate that NexOP yields non-uniform sampling strategies, with progressively decreasing sampling across repetitions, hence exploiting the NEX dimension efficiently. Moreover, we present a theoretical analysis supporting these numerical observations. Overall, this work proposes a sampling-reconstruction optimization framework highly suitable for low-field MRI, which can enable faster, higher-quality imaging with low-cost systems and contribute to advancing affordable and accessible healthcare.

2605.11578 2026-05-13 cs.CV 版本更新

The Midas Touch for Metric Depth

Yu Ma, Zizhan Guo, Zuyi Xiong, Haoran Zhang, Yi Feng, Hongbo Zhao, Hanli Wang, Rui Fan

发表机构 * College of Electronic and Information Engineering, Tongji University(同济大学电子与信息工程学院) Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University(同济大学上海智能自主系统研究所) National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University(西安交通大学人机混合增强智能国家重点实验室)

AI总结 本文提出了一种名为MTD的方法,旨在解决相对深度估计在实际应用中因缺乏度量尺度、局部不一致和计算效率低而受限的问题。该方法通过极稀疏的3D数据将相对深度转换为度量深度,采用分段恢复策略和基于不连续性感知的测地成本像素级优化,有效消除了局部尺度不一致。MTD具有良好的泛化能力,显著提升了深度补全和深度估计的精度,且其轻量化的模块化设计便于在多种下游3D任务中部署和集成。

详情
英文摘要

Recent advances have markedly improved the cross-scene generalization of relative depth estimation, yet its practical applicability remains limited by the absence of metric scale, local inconsistencies, and low computational efficiency. To address these issues, we present \emph{\textbf{M}idas \textbf{T}ouch for \textbf{D}epth} (MTD), a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D tasks. Project page is available at https://mias.group/MTD.

2605.11563 2026-05-13 cs.CV cs.AI 版本更新

TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

Sara Shoouri, Morteza Tavakoli Taba, Hun-Seok Kim

发表机构 * University of Michigan(密歇根大学)

AI总结 本文提出了一种名为TCP-SSM的高效视觉状态空间模型,旨在解决现有SSM在长程视觉任务中难以控制状态依赖记忆行为的问题。该方法通过引入基于令牌的稳定极点,显式建模递归动态,提升了模型的可解释性和可控性。TCP-SSM采用实极点和复共轭极点分别建模单调衰减和阻尼振荡响应,并通过分组极点共享和轻量输入路径设计,实现了计算效率的显著提升,在多个视觉任务中相比基线模型减少了高达44%的计算复杂度。

详情
英文摘要

State Space Models (SSMs) have emerged as a compelling alternative to attention models for long-range vision tasks, offering input-dependent recurrence with linear complexity. However, most efficient SSM variants reduce computation cost by modifying scan routes, resolutions, or traversal patterns, while largely leaving the recurrent dynamics implicit. Consequently, the model's state-dependent memory behavior is difficult to control, particularly in compact backbones where long scan paths can exceed the effective memory horizon. We propose Token-Conditioned Poles SSM (TCP-SSM), a structured selective SSM framework that improves efficiency while making recurrence dynamics explicit and interpretable through stable poles. TCP-SSM builds each scan operator with 1) real poles that model monotone or sign-alternating decay, and 2) complex-conjugate poles that capture damped oscillatory responses. Using bounded radius and angle modulation, TCP-SSM converts shared base poles into token-dependent poles, allowing each scan step to adapt its memory behavior to the current visual token while preserving pole stability. For practical scalability, we integrate grouped pole sharing with a lightweight low-rank input pathway, yielding an efficient scan operator that preserves linear-time scan complexity. Across image classification, semantic segmentation, and object detection, TCP-SSM reduces SSM computation complexity up to 44% in Vision Mamba-style models while maintaining or surpassing baseline accuracy.

2605.11559 2026-05-13 cs.CV cs.AI 版本更新

When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

Fanpu Cao, Xin Zou, Xuming Hu, Hui Xiong

发表机构 * Thrust of Artificial Intelligence, HKUST (Guangzhou)(人工智能前沿 thrust,香港科技大学(广州)) Department of Computer Science and Engineering, HKUST(计算机科学与工程系,香港科技大学)

AI总结 多模态大语言模型(MLLMs)在视觉推理和基于视觉的问题回答中发挥着重要作用,但其仍易产生视觉幻觉,即生成的回答与图像内容矛盾或提及不存在的物体。本文发现,通过分析视觉注意力的高频结构(即层间拉普拉斯能量),可以揭示模型在生成幻觉时的注意力变化特征,并据此提出一种无需训练的解码策略LaSCD,通过选择具有高拉普拉斯能量的层并重新映射下一个词的得分,有效减少幻觉现象,同时保持模型的一般能力。

详情
英文摘要

Multimodal large language models (MLLMs) have become a key interface for visual reasoning and grounded question answering, yet they remain vulnerable to visual hallucinations, where generated responses contradict image content or mention nonexistent objects. A central challenge is that hallucination is not always caused by a simple lack of visual attention: the model may still assign substantial attention mass to image tokens while internally drifting toward an incorrect answer. In this paper, we show that the high-frequency structure of visual attention, measured by layer-wise Laplacian energy, reveals both the layer where hallucinated preferences emerge and the layer where the ground-truth answer transiently recovers. Building on this finding, we propose LaSCD (Laplacian-Spectral Contrastive Decoding), a training-free decoding strategy that selects informative layers via Laplacian energy and remaps next-token logits in closed form. Experiments on hallucination and general multimodal benchmarks show that LaSCD consistently reduces hallucination while preserving general capabilities, highlighting its potential as a faithful decoding paradigm. The code is available at https://github.com/macovaseas/LaSCD.

2605.11551 2026-05-13 cs.LG cs.CV cs.IT math.IT 版本更新

VNDUQE: Information-Theoretic Novelty Detection using Deep Variational Information Bottleneck

Aryan Gondkar, Hayder Radha, Yiming Deng

发表机构 * 1 Nondestructive Evaluation Lab, Department of Electrical Computer Engineering Michigan State University East Lansing, MI Email 2 Department of Electrical

AI总结 本文提出了一种基于深度变分信息瓶颈(VIB)的新型检测与不确定性量化方法VNDUQE,用于检测神经网络中的分布外(OOD)样本。该方法通过信息论指标如KL散度和预测熵来评估样本的异常程度,并在MNIST数据集上验证了其有效性。实验表明,结合KL散度和预测熵的并行检测策略在远分布外和近分布外样本检测上均优于传统基线方法,显著提升了检测性能和不确定性估计的可靠性。

Comments 6 pages, 3 figures, Fall 2025 version

详情
英文摘要

Detecting out-of-distribution (OOD) samples is critical for safe deployment of neural networks in safety-critical applications. While maximum softmax probability (MSP) provides a simple baseline, it lacks theoretical grounding and suffers from miscalibration. We propose VNDUQE (VIB-based Novelty Detection and Uncertainty Quantification for Nondestructive Evaluation), which investigates novelty detection through the Deep Variational Information Bottleneck (VIB), which explicitly constrains information flow through learned representations. We train VIB models on MNIST with held-out digit classes and evaluate OOD detection using information-theoretic metrics: KL divergence and prediction entropy. Our results reveal complementary detection signals: KL divergence achieves perfect detection (100\% AUROC on noise) on far-OOD samples (noise, domain shift), while prediction entropy excels at near-OOD detection (94.7\% AUROC on novel digit classes). A parallel detection strategy combining both metrics achieves 95.3\% average AUROC and 92\% true positive rate at 5\% false positive rate, which is a 32 percentage point improvement over baseline MSP (85.0\% AUROC, 60.1\% TPR). Compression via the information bottleneck principle ($β=10^{-3}$) reduces Expected Calibration Error by 38\%, demonstrating that information-theoretic constraints produce fundamentally more reliable uncertainty estimates. These findings directly support active learning with expensive computational oracles, where well-calibrated novelty detection enables principled threshold selection for oracle queries.

2605.11550 2026-05-13 cs.CV 版本更新

The DAWN of World-Action Interactive Models

Hongbo Lu, Liang Yao, Chenghao He, Haoyu Wang, Xiang Gu, Xianfei Li, Wenlong Liao, Tao He, Pai Peng

发表机构 * COWARobot Co. Ltd(COWARobot有限公司) Shanghai Jiao Tong University(上海交通大学) Hohai University(河海大学)

AI总结 该论文提出了一种名为DAWN的世界-动作交互模型,用于解决自动驾驶场景中世界演化与动作生成之间的相互依赖问题。DAWN通过在语义潜在空间中结合世界预测器和世界条件动作去噪器,实现了世界预测与动作生成的递归优化,从而在复杂交互场景中支持长期轨迹生成。实验表明,DAWN在多个自动驾驶基准测试中表现出优异的规划性能和安全性,展示了交互式世界-动作生成在构建真正可操作世界模型中的潜力。

详情
英文摘要

A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict-then-plan pipelines. We formalize this perspective as World-Action Interactive Models (WAIMs), and instantiate it in autonomous driving with \textbf{DAWN} (\textbf{D}enoising \textbf{A}ctions and \textbf{W}orld i\textbf{N}teractive model), a simple yet strong latent generative baseline. DAWN operates in a compact semantic latent space and couples a \emph{World Predictor} with a \emph{World-Conditioned Action Denoiser}: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update the world prediction, so that both are recursively refined during inference. Rather than eliminating test-time world evolution altogether or rolling out the full future in pixel space, DAWN performs a short explicit latent rollout that is sufficient to support long-horizon trajectory generation in complex interactive scenes. Experiments show that DAWN achieves strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks. More broadly, our results suggest that interactive world-action generation is a principled path toward truly actionable world models.

2605.11541 2026-05-13 cs.CV 版本更新

GeoR-Bench: Evaluating Geoscience Visual Reasoning

Yushuo Zheng, Zicheng Zhang, Huiyu Duan, Chunyi Li, Zijian Chen, Ziheng Jia, Yue Shi, Ke Gu, Xiongkuo Min, Guangtao Zhai

发表机构 * Shanghai Jiao Tong University(上海交通大学) Beijing University of Technology(北京理工大学)

AI总结 GeoR-Bench 是一个用于评估地球科学视觉推理能力的基准测试,旨在解决当前人工智能系统在理解和预测地球系统变化方面的能力不足问题。该基准包含440个经过精心挑选的样本,涵盖6类地球科学任务和24种任务类型,通过视觉编辑任务来评估模型的推理能力、一致性和输出质量。实验结果表明,现有模型在地球科学推理上仍存在显著瓶颈,最佳模型的总体准确率仅为42.7%,而开源模型表现更差,反映出当前模型在科学准确性上仍有较大提升空间。

详情
英文摘要

Geoscience intelligence is expected to understand, reason about, and predict earth system changes to support human decision-making in critical domains such as disaster response, climate adaptation and environmental protection. Although current research has shown promising progress on specific geoscience tasks, such as remote sensing interpretation, geographic question-answering, existing benchmarks remain largely task-specific which failing to capture the open-ended real world geoscience problems. As a result, it remains unclear how far current AI systems are from achieving genuine geoscience intelligence. To address this gap, we present \textbf{GeoR-Bench}, a \underline{Bench}mark for evaluating \underline{Geo}science visual \underline{R}easoning through reasoning informed visual editing tasks. GeoR-Bench contains 440 curated samples spanning 6 geoscience categories and 24 task types, covering earth observation imagery and structured scientific representations such as maps and diagrams. We evaluate outputs along three dimensions, including reasoning, consistency, and quality. Benchmark results of 21 closed- and open-source multimodal models reveal that geoscience reasoning remains a critical bottleneck. The highest-performing model achieves 42.7\% overall strict accuracy, while the best open-source models only get 10.3\%. Notably, the visual consistency and image quality of the outputs frequently surpass their scientific accuracy. Ultimately, these findings indicate that current models generate superficially plausible results but fail to capture underlying earth science processes.

2605.11521 2026-05-13 cs.CV 版本更新

XWOD: A Real-World Benchmark for Object Detection under Extreme Weather Conditions

Chih-Hsin Chen, Yu-Tung Liu, Amar Fadillah, Kuan-Ting Lai, Dong Liu

发表机构 * Department of Electronic Engineering(电子工程系) National Taipei University of Technology(台北科技大学) Adobe Inc.(Adobe公司)

AI总结 本文提出XWOD,一个用于极端天气条件下目标检测的大型真实世界数据集,包含10,010张图像和42,924个标注框,涵盖雨、雪、雾、沙尘、洪水、龙卷风和野火七种极端天气条件下的六类交通目标。XWOD扩展了天气分类的范围,首次引入气候加剧型灾害类别,并通过在其他天气数据集上的零样本测试验证了其数据质量,显著提升了检测性能。该数据集为研究极端天气下的交通感知提供了强有力的基准。

详情
英文摘要

Autonomous driving and intelligent transportation systems remain vulnerable under extreme weather. The U.S. Federal Highway Administration reports that roughly 745,000 crashes and 3,800 fatalities per year are weather-related, and recent regulatory investigations have examined failures of Level-2/3 driving systems under reduced-visibility conditions. However, datasets commonly used to evaluate weather robustness remain limited in scale, diversity, and realism. In this paper, we introduce XWOD (Extreme Weather Object Detection), a large-scale real-world traffic-object detection benchmark containing 10,010 images and 42,924 bounding boxes across seven extreme weather conditions: rain, snow, fog, haze/sand/dust, flooding, tornado, and wildfire. The dataset covers six traffic-object categories, including car, person, truck, motorcycle, bicycle, and bus. XWOD extends the weather taxonomy from one to seven conditions, and is the first to cover the emerging class of climate-amplified hazards, such as flooding, tornado, and wildfire. To evaluate the quality of our data, we train standard YOLO-family detectors on XWOD and test them zero-shot on external weather benchmarks, achieving mAP$_{50}$ scores of 63.00% on RTTS, 59.94% on DAWN, and 61.12% on WEDGE, compared with the corresponding published YOLO-based baselines of 40.37%, 32.75%, and 45.41%, respectively, representing relative improvements of 56%, 83%, and 35%. These cross-dataset results show that XWOD provides a strong source domain for learning weather-robust traffic perception. We release the dataset, splits, baseline weights, and reproducible evaluation code under a research-use license.

2605.11520 2026-05-13 cs.CV cs.AI 版本更新

PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting

Yixiao Song, Qingyong Li, Wen Wang, Zhicheng Yan

发表机构 * Key Laboratory of Big Data & Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education(大数据与人工智能在交通运输中的关键实验室(北京交通大学),教育部) Frontiers Science Center for Smart High-speed Railway System, Beijing Jiaotong University(智能高速铁路系统前沿科学中心,北京交通大学)

AI总结 本文提出了一种名为PointGS的无监督3D点云分割方法,旨在解决传统监督方法依赖密集标注带来的高昂成本问题。该方法通过3D高斯溅射技术构建统一的中间表示,弥合了离散点云与连续图像之间的域差距,并利用多视角重建与语义蒸馏策略,实现了跨视角语义的一致性分配。实验表明,PointGS在多个基准数据集上优于现有无监督方法,显著提升了分割性能。

Comments Accepted by Computer Vision and Pattern Recognition (CVPR) 2026

详情
英文摘要

Unsupervised point cloud segmentation is critical for embodied artificial intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. While integrating 2D pre-trained models such as the Segment Anything Model (SAM) to supplement semantic information is a natural choice, this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer. To address these limitations, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds are first reconstructed into dense 3D Gaussian spaces via multi-view observations, filling spatial gaps and encoding occlusion relationships to eliminate projection-induced semantic conflation. Multi-view dense images are rendered from the Gaussian space, with 2D semantic masks extracted via SAM, and semantics are distilled to 3D Gaussian primitives through contrastive learning to ensure consistent semantic assignments across different views. The Gaussian space is aligned with the original point cloud via two-step registration, and point semantics are assigned through nearest-neighbor search on labeled Gaussians. Experiments demonstrate that PointGS outperforms state-of-the-art unsupervised methods, achieving +0.9% mIoU on ScanNet-V2 and +2.8% mIoU on S3DIS.

2605.11508 2026-05-13 cs.CV 版本更新

LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing

Yongcong Wang, Chengchao Shen, Guangwei Gao, Wei Wang, Pengwen Dai, Dianjie Lu, Guijuan Zhang, Zhuoran Zheng

发表机构 * Central South University(中南大学) Nanjing University of Science and Technology(南京理工大学) Sun Yat-sen University(中山大学) Shandong Normal University(山东师范大学) Qilu University of Technology(齐鲁工业大学)

AI总结 当前超高分辨率视频去雾领域缺乏评估基准,且现有方法难以在消费级GPU上实时处理4K视频。本文提出LiBrA-Net,通过将去雾问题转化为由低频深度场驱动的逐像素仿射变换,并利用双侧网格进行高效编码,实现了在单个GPU上以25 FPS处理4K视频的实时去雾。此外,本文还发布了首个包含深度、透射率和光流注释的4K视频去雾基准UHV-4K,并在多个数据集上取得了最先进的性能。

Comments 10 pages, 5 figures

详情
英文摘要

Currently, there is a gap in the field of ultra-high-definition (UHD) video dehazing due to the lack of a benchmark for evaluation. Furthermore, existing video dehazing methods cannot run on consumer-grade GPUs when processing continuous UHD sequences of 3--5 frames at a time. In this paper, we address both issues with a new benchmark and an efficient method. Our key observation is that atmospheric dehazing reduces to a per-pixel affine transform governed by the low-frequency depth field, which can be compactly encoded in bilateral grids whose prediction cost is decoupled from the output resolution. Building on this, we propose LiBrA-Net, which factorizes the spatiotemporal affine field into a spatial--color and a temporal bilateral sub-grid predicted at a fixed low resolution, fuses their coefficients in the $\mathfrak{gl}(3)$ Lie algebra under group-theoretic regularization, maps the result to invertible GL(3) transforms via a Cayley parameterization, and restores high-frequency detail through a lightweight input-guided branch. We further release UHV-4K, the first paired 4K video dehazing benchmark with depth, transmission, and optical-flow annotations on every frame. Across UHV-4K, REVIDE, and HazeWorld, LiBrA-Net sets a new state of the art among compared video dehazing methods while running native 4K at 25 FPS on a single GPU with only 6.12 M parameters. Code and data are available at https://anonymous.4open.science/r/LiBrA-Net-42B8.

2605.11506 2026-05-13 cs.CV 版本更新

Principled Design of Diffusion-based Optimizers for Inverse Problems

Julio Oscanoa, Irmak Sivgin, Cagan Alkan, Daniel Ennis, John Pauly, Mert Pilanci, Shreyas Vasanawala

发表机构 * Department of Bioengineering(生物工程系) Department of Radiology, Stanford University, USA(斯坦福大学放射学系,美国)

AI总结 本文研究了基于扩散模型的优化器在逆问题中的设计,旨在解决其推理时间长和超参数调优繁琐的问题。作者提出了一种原理性的重参数化方法,使超参数能够在不同任务间复用,无需重新调整。同时,基于RED-diff框架,他们进一步开发了OptDiff流程,将后验采样转化为优化问题,从而加速推理并提升图像质量。实验表明,该方法在图像重建、去模糊和超分辨率任务中均取得了显著的加速效果和图像质量提升。

Comments 22 pages, 8 figures, 6 tables

详情
英文摘要

Score-based diffusion models achieve state-of-the-art performance for inverse problems, but their practical deployment is hindered by long inference times and cumbersome hyperparameter tuning. While pretrained diffusion models can be reused across tasks without retraining, inference-time hyperparameters such as the noise schedule and posterior sampling weights typically require ad-hoc adjustment for each problem setup. We propose principled reparameterizations that induce invariances, allowing the same hyperparameters to be reused across multiple problems without re-tuning. In addition, building on the RED-diff framework, which reformulates posterior sampling as an optimization problem, we further develop the OptDiff pipeline. OptDiff provides a simplified tuning framework that facilitates the integration of convex optimization tools to accelerate inference. Experiments on image reconstruction, deblurring, and super-resolution show substantial speedups and improved image quality.

2605.11497 2026-05-13 cs.CV 版本更新

PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition

Sanghyeon Lee, Jinwoo Kim, Jong Taek Lee

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 本文研究了零样本骨架动作识别(ZSSAR)中的语义对齐问题,指出当前方法在对齐阶段已丢失了人体与物体交互及姿态相关视觉线索等关键语义信息。为此,提出了一种名为PoseBridge的框架,通过利用姿态估计过程中的中间表示,提取姿态锚定的语义线索,并通过骨架条件桥接和语义原型适配将其传递至文本对齐模块,从而提升零样本识别性能。实验表明,PoseBridge在多个数据集上均取得显著提升,尤其在Kinetics-200/400 PURLS基准上表现突出。

详情
英文摘要

Zero-shot skeleton-based action recognition (ZSSAR) is typically treated as a skeleton-text alignment problem: encode joint-coordinate sequences, align them with language, and classify unseen actions. We argue that this alignment is often too late. Skeletons are not complete action observations, but compressed outputs of human pose estimation (HPE); by the time alignment begins, human-object interactions and pose-relative visual cues may no longer be explicit. We call this upstream semantic loss. To address it, we propose PoseBridge, an HPE-aware ZSSAR framework that bridges intermediate HPE representations to skeleton-text alignment. Rather than adding an RGB action branch or object detector, PoseBridge extracts pose-anchored semantic cues from the same HPE process that produces skeletons, then transfers them through skeleton-conditioned bridging and semantic prototype adaptation. Across NTU-RGB+D 60/120, PKU-MMD, and Kinetics-200/400, PoseBridge improves ZSSAR performance under the evaluated protocols. On the Kinetics-200/400 PURLS benchmark, which contains in-the-wild videos with diverse scenes and action contexts, PoseBridge shows the clearest separation, improving the strongest compared baseline by 13.3-17.4 points across all eight splits. Our code will be publicly released.

2605.11494 2026-05-13 cs.CV 版本更新

STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

Ankit Yadav, Arpit Garg, Ta Duc Huy, Lingqiao Liu

发表机构 * Australian Institute for Machine Learning, Adelaide University, Australia(澳大利亚机器学习研究所,阿德莱德大学,澳大利亚)

AI总结 STRIDE 是一种无需训练和优化的单步扩散模型多样性增强方法,通过在中间特征上注入与模型激活主成分对齐的噪声,实现可控的多样性提升。该方法基于模型自身特征结构进行扰动,确保生成样本在保持高质量的同时提高多样性。实验表明,STRIDE 在多个数据集上有效提升了生成图像的多样性,同时保持了良好的文本对齐性能,优于现有无训练基线方法。

Comments 11 Pages 3 figures 4 tables

详情
英文摘要

Distilled one-step (T=1) or few-step (T$\leq$4) diffusion models enable real-time image generation but often exhibit reduced sample diversity compared to their multi-step counterparts. In multi-step diffusion, diversity can be introduced through schedules, trajectories, or iterative optimization; however, these mechanisms are unavailable in the few-step or single-step setting, limiting the effectiveness of existing diversity-enhancing methods. A natural alternative is to perturb intermediate features, but naive feature perturbation is often ineffective, either yielding limited diversity gains or degrading generation quality. We argue that effective diversity injection in few-step models requires perturbations that respect the model's learned feature geometry. Based on this insight, we propose STRIDE, a training-free and optimization-free method that operates in a single forward pass. STRIDE injects spatially coherent (pink) noise into intermediate transformer features, projected onto the principal components of the model's own activations, ensuring that perturbations lie on the learned feature manifold. This design enables controlled variation along meaningful directions in the representation space. Extensive experiments on FLUX.1-schnell and SD3.5 Turbo across COCO, DrawBench, PartiPrompts, and GenEval show that STRIDE consistently improves diversity while maintaining strong text alignment. In particular, STRIDE reduces intra-batch similarity with minimal impact on CLIP score, and Pareto-dominates existing training-free baselines on the diversity-fidelity frontier. These results highlight that, in the absence of iterative refinement, improving diversity in few-step and one-step diffusion depends not on increasing perturbation strength, but on aligning perturbations with the model's internal representation structure.

2605.11489 2026-05-13 cs.GR cs.CV 版本更新

3DGS$^3$: Joint Super Sampling and Frame Interpolation for Real-Time Large-Scale 3DGS Rendering

Yibo Zhao, Fan Gao, Youcheng Cai, Ligang Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 3DGS$^3$ 是一种统一的后渲染框架,旨在解决 3D 高斯点绘(3DGS)在实时渲染中超大规模场景和高分辨率下的效率瓶颈问题。该方法通过联合进行超采样和帧插值,利用可微处理的低分辨率输出,实现高分辨率与高帧率的渲染。其核心模块包括基于梯度感知的超采样网络(GASS)和轻量级时序帧插值网络(LTFI),分别提升了空间细节和时间连贯性,实验表明该方法在渲染效率和视觉质量上优于现有方法,并兼容现有的 3DGS 加速技术。

详情
英文摘要

3D Gaussian Splatting (3DGS) enables high-quality real-time 3D rendering but faces challenges in efficiently scaling to ultra-dense scenes and high-resolution due to computational bottlenecks that limit its use in latency-sensitive applications. Instead of optimizing the splatting pipeline itself, we propose \textbf{3DGS$^3$}, a unified post-rendering framework that jointly performs super sampling and frame interpolation through differentiable processing of low-resolution outputs to achieve both high-resolution and high-frame-rate rendering. Our \textbf{Gradient\- \-Aware Super Sampling (GASS)} module leverages the continuous differentiability of 3DGS to extract image gradients that guide a GRU-based refinement network to enable high-fidelity super sampling. Furthermore, a \textbf{Lightweight Temporal Frame Interpolation (LTFI)} module based on a compact U-Net-like backbone fuses temporal and differentiable spatial cues from consecutive frames to synthesize temporally coherent intermediate frames. Experiments on public datasets demonstrate that 3DGS$^3$ achieves superior rendering efficiency and visual quality when compared with state-of-the-art methods and remains compatible with existing 3DGS acceleration techniques. The code will be publicly released upon acceptance.

2605.11477 2026-05-13 cs.CV 版本更新

LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs

Jingfeng Chen, Jiawen Qian, Wendi Deng, Yinuo Guo, Jiaqi Yu, Sicong Leng, Raghuveer Thirukovalluru, Bhuwan Dhingra

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Individual Researcher(个人研究员) National University Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学) Duke University(杜克大学)

AI总结 在多模态大语言模型中,视频理解需要在有限的视觉token预算下从冗长的视频中选取信息量大的帧。为此,本文提出LDDR,一种基于线性行列式点过程(DPP)的动态分辨率帧采样框架,能够在任务条件特征空间中进行查询感知的帧选择,实现比标准DPP方法快3倍的运行效率。LDDR通过引入组DPP重要性度量,指导帧的保留与动态分辨率分配,显著提升了视频理解性能,在多个视频基准测试中均优于现有方法。

Comments 21 pages, 4 figures

详情
英文摘要

Video understanding in multimodal large language models requires selecting informative frames from long, redundant videos under limited visual-token budgets. Existing methods often rely on uniform sampling, point-wise relevance scoring, chunk-wise selection, or agentic exploration, which either miss global dependencies or introduce substantial overhead. We propose LDDR (Linear DPP-Based Dynamic Resolution), a training-free, plug-and-play, and budget-aware video frame sampling framework. LDDR performs query-aware Determinantal Point Process (DPP) frame selection in a task-conditioned feature space, achieving a 3x runtime speedup over standard DPP baselines. It further introduces a Group DPP importance metric to guide frame retention and dynamic resolution allocation, assigning more tokens to informative, non-redundant frames while downscaling or pruning less useful ones. Across four video benchmarks spanning short-, medium-, and long-range videos, LDDR consistently outperforms the next-best baselines, achieving gains of 2.5 points under budget-constrained settings and 1.6 points in high-budget scenarios. These improvements are consistently observed across multiple MLLM backbones, including both open- and closed-source models. Qualitative analysis confirms that relevant frames are selected and allocated a higher budget, facilitating improved video understanding.

2605.11475 2026-05-13 cs.CV 版本更新

Deep Probabilistic Unfolding for Quantized Compressive Sensing

Gang Qu, Ping Wang, Siming Zheng, Xin Yuan

发表机构 * Westlake University, School of Engineering, Hangzhou, Zhejiang, China(西湖大学工程学院,杭州,浙江,中国) Vivo Mobile Communication Co., Ltd., Hangzhou, Zhejiang, China(Vivo移动通信有限公司,杭州,浙江,中国)

AI总结 本文提出了一种深度概率展开模型,用于解决量化压缩感知问题,通过展开框架提升重建的精度和效率。不同于以往方法采用L2投影,本文推导出一种闭式且数值稳定的似然梯度投影,使模型能够遵循真实的量化物理特性,将硬量化约束转化为软概率引导。此外,设计了一个高效的双域Mamba模块,用于动态捕捉和融合多尺度的局部与全局特征,增强远距离相关区域的交互能力。实验表明,该方法在多个任务上达到当前最优性能,有助于推动量化压缩感知在实际中的应用。

详情
英文摘要

We propose a deep probabilistic unfolding model to address the classical quantized compressive sensing problem that leverages an unfolding framework to enhance the reconstruction accuracy and efficiency. Unlike previous unfolding methods that apply L2 projection to measurements, we derive a closed-form, numerically stable likelihood gradient projection, which allows the model to respect the true quantization physics, turning the hard quantization constraint into a soft probabilistic guidance. Furthermore, an efficient, dual-domain Mamba module is specifically designed to dynamically capture and fuse the multi-scale local and global features, ensuring the interactions between the distant but correlated regions. Extensive experiments demonstrate the state-of-the-art performance of the proposed method over previous works, which is capable of promoting the application of quantized compressive sensing in real life.

2605.11463 2026-05-13 cs.CV 版本更新

Encore: Conditioning Trajectory Forecasting via Biased Ego Rehearsals

Conghao Wong, Ziqian Zou, Xinge You

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 本文研究了如何在轨迹预测任务中学习和表示智能体的主观性,这一问题具有挑战性但至关重要。作者提出了一种名为Encore的方法,通过引入偏向性的自我排练机制,使模型能够从短期观测中生成针对场景中所有参与者的偏置排练轨迹,并利用这些轨迹作为条件来引导最终预测,从而更准确地模拟不同智能体的主观行为。实验表明,该方法在多个数据集上均取得了性能提升,并为理解轨迹中的主观性提供了清晰的解释。

详情
英文摘要

Learning and representing the subjectivities of agents has become a challenging but crucial problem in the trajectory prediction task. Such subjectivities not only present specific spatial or temporal structures, but also are anisotropic for all interaction participants. Despite great efforts, it remains difficult to explicitly learn and forecast these subjectivities, let alone further modulate models' predictions through a specific ego's subjectivity. Inspired by prefactual thoughts in psychology and relevant theatrical concepts, we interpret such subjectivities in future trajectories as the continuous process from rehearsal to encore. In the rehearsal phase, the proposed ego predictor focuses on how each ego agent learns to derive and direct a set of explicitly biased rehearsal trajectories for all participants in the scene from the short-term observations. Then, these rehearsal trajectories serve as immediate controls to condition final predictions, providing direct yet distinct ego biases for the prediction network to simulate agents' various subjectivities. Experiments across datasets not only demonstrate a consistent improvement in the performance of the proposed \emph{Encore} trajectory prediction model but also provide clear interpretability regarding subjectivities as biased ego rehearsals.

2605.11462 2026-05-13 cs.CV cs.AI 版本更新

SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

Zishan Liu, Ruoxi Zang, Yanglin Zhang, Wei Liu, Yin Zhang, Jian Yao, Jiayin Zheng, Zhengzhe Liu

发表机构 * Lingnan University(岭南大学) XPENG Robotics(小鹏机器人)

AI总结 该研究提出了一种名为 SpatialForge 的可扩展数据合成方法,旨在从开放世界的二维图像中生成用于三维空间推理的监督信号,以解决当前大型视觉-语言模型在空间推理方面的不足。通过将空间推理分解为感知与关系两个部分,并构建包含深度、布局和视角依赖推理的结构化监督信号,该方法能够自动生成高质量的空间问答数据。基于此,研究构建了一个包含1000万对空间问答的大型数据集 SpatialForge-10M,并在多个空间推理基准上验证了其有效性,显著提升了视觉-语言模型的空间推理能力。

详情
英文摘要

Recent advancements in Large Vision-Language Models (VLMs) have demonstrated exceptional semantic understanding, yet these models consistently struggle with spatial reasoning, often failing at fundamental geometric tasks such as depth ordering and precise coordinate grounding. Recent efforts introduce spatial supervision from scene-centric datasets (e.g., multi-view scans or indoor video), but are constrained by the limited number of underlying scenes. As a result, the scale and diversity of such data remain significantly smaller than those of web-scale 2D image collections. To address this limitation, we propose SpatialForge, a scalable data synthesis pipeline that transforms in-the-wild 2D images into spatial reasoning supervision. Our approach decomposes spatial reasoning into perception and relation, and constructs structured supervision signals covering depth, layout, and viewpoint-dependent reasoning, with automatic verification to ensure data quality. Based on this pipeline, we build SpatialForge-10M, a large-scale dataset containing 10 million spatial QA pairs. Extensive experiments across multiple spatial reasoning benchmarks demonstrate that training on SpatialForge-10M significantly improves the spatial reasoning ability of standard VLMs, highlighting the effectiveness of scaling 2D data for 3D-aware spatial reasoning.

2605.11439 2026-05-13 cs.CV cs.LG 版本更新

Instruct-ICL: Instruction-Guided In-Context Learning for Post-Disaster Damage Assessment

Armin Zarbaft, Ehsan Karimi, Nhut Le, Maryam Rahnemoonfar

AI总结 本文研究了如何通过结构化推理策略提升预训练多模态大语言模型在灾后视觉问答任务中的可靠性。提出了一种名为 Instruct-ICL 的方法,利用一个 MLLM 生成任务特定的指令作为链式推理(CoT)引导,辅助另一个 MLLM 进行答案生成,并结合不同程度的上下文学习(ICL)提升模型性能。实验表明,该方法在 FloodNet 数据集上显著提高了答案准确性,为灾后快速评估提供了更可靠的技术方案。

Comments Accepted by the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

详情
英文摘要

Rapid and accurate situational awareness is essential for effective response during natural disasters, where delays in analysis can significantly hinder decision-making. Training task-specific models for post-disaster assessment is often time-consuming and computationally expensive, making such approaches impractical in time-critical scenarios. Consequently, pretrained multimodal large language models (MLLMs) have emerged as a promising alternative for post-disaster visual question answering (VQA), a task that aims to answer structured questions about visual scenes by jointly reasoning over images and text. While these models demonstrate strong multimodal reasoning capabilities, their responses can be sensitive to prompt formulation, which can limit their reliability in real-world disaster assessment scenarios. In this paper, we investigate whether structured reasoning strategies can improve the reliability of pretrained MLLMs for post-disaster VQA. Specifically, we explore multiple prompting paradigms in which one MLLM is used to generate task-specific instructions that serve as Chain-of-Thought (CoT) guidance for a second MLLM. These instructions are incorporated during answer generation with varying degrees of in-context learning (ICL), enabling the model to leverage both explicit reasoning guidance and contextual examples. We conduct our evaluation on the FloodNet dataset and compare these approaches against a zero-shot baseline. Our results demonstrate that integrating instruction-driven CoT reasoning consistently improves answer accuracy.

2605.11438 2026-05-13 cs.CV 版本更新

Beyond Masks: The Case for Medical Image Parsing

Siddharth Gupta, Alan L. Yuille, Zongwei Zhou

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Northwestern University(西北大学) Johns Hopkins Medicine(约翰霍普金斯医学)

AI总结 本文提出医疗图像解析(Medical Image Parsing)作为医学影像研究的核心输出,强调应超越传统的像素级分割掩码,生成包含实体、属性及关系的结构化表示,以更全面地描述医学影像内容。研究指出,当前系统在实体识别方面表现较好,但在属性描述、实体间关系及语义闭包等方面仍严重不足。作者主张通过改进输出形式和训练信号,推动模型从测量转向解释,以更贴近临床实际需求。

详情
英文摘要

Medical imaging research has spent a decade getting very good at one thing: producing per-voxel masks. Masks tell us size, volume, and location, and a decade of clinical infrastructure rests on those outputs. Yet the report a radiologist writes contains almost nothing a mask can express. We argue that medical imaging research should adopt medical image parsing as its central output: a structured representation in which entities, attributes, and relationships are emitted together and mutually consistent. Entities are the named structures and findings, present or absent. Attributes describe those entities, capturing things like margin regularity, enhancement pattern, or severity grade. Relationships connect them, naming where one structure sits relative to another, what abuts what, and what has changed since the prior scan. A good parse satisfies three properties, in order: (1) decision (the parse names the right things in the current image), (2) reconstruction (its content is rich enough to regenerate that image), and (3) prediction (its content is rich enough to forecast how the patient state will evolve). Quantitative measurements are derived from this content; they are not predicted alongside it. To test how close the field is to producing such an output, we audit eleven representative systems against the three parsing primitives plus closure. None emits a well-formed parse. Entities are largely solved. Attributes, relationships, and closure remain near-empty. The path forward is not a new architecture. It is a commitment to a richer output, and to training signals that reward it. Segmentation taught models to measure. Parsing asks them to explain.

2605.11435 2026-05-13 cs.CV 版本更新

ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models

Hai Jiang, Zhen Liu, Yinjie Lei, Songchen Han, Bing Zeng, Shuaicheng Liu

发表机构 * School of Aeronautics and Astronautics, Sichuan University(四川大学航空航天学院) University of Electronic Science and Technology of China(电子科技大学) College of Electronics and Information Engineering, Sichuan University(四川大学电子信息工程学院)

AI总结 本文提出了一种基于扩散模型的零参考图像修复框架ZeroIDIR,用于解决光照退化图像的恢复问题。该方法仅依赖低质量退化图像进行训练,通过解耦光照校正与扩散重建过程,引入自适应伽马校正模块和直方图引导的光照校正损失,提升光照一致性并作为后续扩散过程的可靠输入。此外,提出了一种扰动一致性扩散损失,以增强恢复图像的细节还原能力和稳定性,实验表明该方法在多个公开数据集上优于现有无监督方法,并具有良好的场景泛化能力。

Comments Accepted by CVPR 2026

详情
英文摘要

In this paper, we propose a zero-reference diffusion-based framework, named ZeroIDIR, for illumination degradation image restoration, which decouples the restoration process into adaptive illumination correction and diffusion-based reconstruction while being trained solely on low-quality degraded images. Specifically, we design an adaptive gamma correction module that performs spatially varying exposure correction to generate illumination-corrected only representations to mitigate exposure bias and serve as reliable inputs for subsequent diffusion processes, where a histogram-guided illumination correction loss is introduced to regularize the corrected illumination distribution toward that of natural scenes. Subsequently, the illumination-corrected image is treated as an intermediate noisy state for the proposed perturbed consistency diffusion model to reconstruct details and suppress noise. Moreover, a perturbed diffusion consistency loss is proposed to constrain the forward diffusion trajectory of the final restored image to remain consistent with the perturbed state, thus improving restoration fidelity and stability in the absence of supervision. Extensive experiments on publicly available benchmarks show that the proposed method outperforms state-of-the-art unsupervised competitors and is comparable to supervised methods while being more generalizable to various scenes. Code is available at https://github.com/JianghaiSCU/ZeroIDIR.

2605.11430 2026-05-13 cs.CV cs.AI cs.LG 版本更新

Diabetic Retinopathy Classification using Downscaling Algorithms and Deep Learning

Nishi Doshi, Urvi Oza, Pankaj Kumar

发表机构 * Dhirubhai Ambani Institute of Information and Communication Technology(迪鲁巴希·阿姆巴尼信息与通信技术研究所)

AI总结 该研究针对糖尿病视网膜病变(DR)分类中的图像尺寸不一问题,提出在输入深度学习网络前使用多种下采样算法对视网膜图像进行预处理。研究结合了Kaggle和印度糖尿病视网膜病变图像数据集,基于改进的多通道Inception V3网络架构进行分类实验,结果在准确率、特异性和灵敏度方面优于现有方法,为DR的自动分级提供了更有效的解决方案。

Journal ref 2020 7th International Conference on Signal Processing and Integrated Networks (SPIN)

详情
英文摘要

Diabetic Retinopathy (DR) is an art and science of recording and classifying the retinal images of a diabetic patient. DR classification deals with classifying retinal fundus image into five stages on the basis of severity of diabetes. One of the major issue faced while dealing with DR classification problem is the large and varying size of images. In this paper we propose and explore the use of several downscaling algorithms before feeding the image data to a Deep Learning Network for classification. For improving training and testing; we amalgamate two datasets: Kaggle and Indian Diabetic Retinopathy Image Dataset. Our experiments have been performed on a novel Multi Channel Inception V3 architecture with a unique self crafted preprocessing phase. We report results of proposed approach using accuracy, specificity and sensitivity, which outperform the previous state of the art methods. Index Terms: Diabetic Retinopathy, Downscaling Algorithms, Multichannel CNN Architecture, Deep Learning

2605.11427 2026-05-13 cs.CV 版本更新

PD-4DGS:Progressive Decomposition of 4D Gaussian Splatting for Bandwidth-Adaptive Dynamic Scene Streaming

Jiachen Li, Guangzhi Han, Jin Wan, Delong Han, Yuan Gao, Min Li, Mingle Zhou, Gang Li

发表机构 * Qilu University of Technology(青岛理工大学)

AI总结 PD-4DGS 是一种面向动态场景流媒体的渐进式 4D 高斯溅射压缩框架,旨在解决现有 4DGS 模型在带宽受限设备上渲染延迟高、无法适配自适应码率传输的问题。该方法通过层次化形变分解(HDD)将 4DGS 的运动结构分解为三个可独立传输的层次,使流媒体前缀即可渲染,实现可扩展的流式传输。实验表明,PD-4DGS 在保持渲染质量的同时显著降低了传输带宽和首帧延迟,为 4DGS 在移动设备上的实时流媒体应用提供了可行方案。

详情
英文摘要

4D Gaussian Splatting (4DGS) enables high-quality dynamic novel view synthesis, yet current models remain monolithic bitstreams that clients must download in full before any frame can be rendered, causing black-screen waits of tens to hundreds of seconds on mobile bandwidth and leaving 4DGS incompatible with modern adaptive-bitrate delivery. Progressive 3DGS compression alleviates this for static scenes, but it acts only on spatial anchors and cannot partition the temporal deformation networks that dominate dynamic-scene size. We present PD-4DGS, the first framework for progressive compression and on-demand transmission of 4DGS. Hierarchical Deformation Decomposition (HDD) externalises the coarse-to-fine motion hierarchy already latent in 4DGS into three independently transmittable layers -- a static scaffold, a global deformation, and a local refinement -- so that any prefix of the bitstream is already renderable, turning a single training run into a scalable, DASH/HLS-compatible bitstream. A Gaussian-entropy attribute rate-distortion loss together with a temporal mask consistency regulariser shrink the base layer while suppressing low-bitrate flicker; a capacity-weighted rollout schedule, gated online by a learnt activation rate rho, then prevents deformation-network under-training without any per-scene hyperparameter. On the Dycheck iPhone benchmark, PD-4DGS cuts the streamed bitstream by >60% at matched rendering fidelity and reduces first-frame latency from 73--930 s to ~1.7 s on a 2 Mbps link, uniquely enabling true on-demand progressive streaming for 4DGS.

2605.11424 2026-05-13 cs.CV 版本更新

VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors

Jimin Tang, Wenyuan Zhang, Junsheng Zhou, Zian Huang, Kanle Shi, Shenkun Xu, Yu-Shen Liu, Zhizhong Han

发表机构 * School of Software, Tsinghua University(清华大学软件学院) Department of Computer Science, Wayne State University(韦恩州立大学计算机科学系)

AI总结 VidSplat 是一种基于高斯点扩散的生成式重建框架,旨在解决在稀疏视角下进行多视角表面重建时存在的缺失区域和遮挡问题。该方法利用视频扩散先验,通过迭代生成新视角来补充输入覆盖不足的区域,从而实现对完整3D场景的重建。其核心在于提出了一种无需训练的分阶段去噪策略和迭代优化机制,有效提升了重建的几何一致性和完整性。

Comments Accepted by SIGGRAPH Conference 2026. Project Page: https://tangjm24.github.io/VidSplat

详情
英文摘要

Gaussian Splatting has achieved remarkable progress in multi-view surface reconstruction, yet it exhibits notable degradation when only few views are available. Although recent efforts alleviate this issue by enhancing multi-view consistency to produce plausible surfaces, they struggle to infer unseen, occluded, or weakly constrained regions beyond the input coverage. To address this limitation, we present VidSplat, a training-free generative reconstruction framework that leverages powerful video diffusion priors to iteratively synthesize novel views that compensate for missing input coverage, and thereby recover complete 3D scenes from sparse inputs. Specifically, we tackle two key challenges that enable the effective integration of generation and reconstruction. First, for 3D consistent generation, we elaborate a training-free, stage-wise denoising strategy that adaptively guides the denoising direction toward the underlying geometry using the rendered RGB and mask images. Second, to enhance the reconstruction, we develop an iterative mechanism that samples camera trajectories, explores unobserved regions, synthesizes novel views, and supplements training through confidence weighted refinement. VidSplat performs robustly to sparse input and even a single image. Extensive experiments on widely used benchmarks demonstrate our superior performance in sparse-view scene reconstruction.

2605.11385 2026-05-13 cs.CV cs.RO 版本更新

JACoP: Joint Alignment for Compliant Multi-Agent Prediction

Qingze Liu, Alen Mrdovic, Danrui Li, Mathew Schwartz, Sejong Yoon, Mubbasir Kapadia

发表机构 * Rutgers University, New Brunswick(新泽西州罗格斯大学) The College of New Jersey(新泽西州学院)

AI总结 该论文提出了一种名为JACoP的多阶段框架,用于解决多智能体轨迹预测中的集体合规性问题。其核心方法结合了基于锚点的个体轨迹筛选和基于马尔可夫随机场的联合轨迹对齐,有效减少了轨迹间的社交碰撞和环境违规。JACoP在保证预测精度的同时,显著提升了场景层面的合理性,为实际应用提供了更安全可靠的预测方案。

Comments Accepted by CVPRF 2026

详情
英文摘要

Stochastic Human Trajectory Prediction (HTP) using generative modeling has emerged as a significant area of research. Although state-of-the-art models excel in optimizing the accuracy of individual agents, they often struggle to generate predictions that are collectively compliant, leading to output trajectories marred by social collisions and environmental violations, thus rendering them impractical for real-world applications. To bridge this gap, we present JACoP: Joint Alignment for Compliant Multi-Agent Prediction, an innovative multi-stage framework that ensures scene-level plausibility. JACoP incorporates an Anchor-Based Agent-Centric Profiler for effective initial compliance filtering and employs a Markov Random Field (MRF) based aligner to formalize the joint selection for scene predictions. By representing inter-agent spatial and social costs as MRF energy potentials, we successfully infer and sample from the joint trajectory distribution, achieving prediction with optimal scene compliance. Comprehensive experiments show that JACoP not only achieves competitive accuracy, but also sets a new standard in reducing both environmental violations and social collisions, thereby confirming its ability to produce collectively feasible and practically applicable trajectory predictions.

2605.11383 2026-05-13 cs.CV 版本更新

HamBR: Active Decision Boundary Restoration Based on Hamiltonian Dynamics for Learning with Noisy Labels

Ningkang Peng, Jingyang Mao, Qianfeng Yu, Xiaoqian Peng, Peirong Ma, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学) Nanjing University of Chinese Medicine(南京中医药大学)

AI总结 在大规模视觉识别和数据挖掘任务中,噪声标签会严重影响深度神经网络的泛化能力。本文首次提出了一种基于哈密顿动力学的主动决策边界修复方法HamBR,通过球面哈密顿蒙特卡洛机制主动探测特征空间中的类间模糊区域,并合成高质量虚拟异常样本,利用能量模型建立鲁棒的决策边界屏障,从而恢复决策边界的判别性。实验表明,HamBR在多个基准数据集上取得了最先进的性能,并显著提升了模型的分布外检测能力。

详情
英文摘要

In large-scale visual recognition and data mining tasks, the presence of noisy labels severely undermines the generalization capability of deep neural networks (DNNs). Prevalent sample selection methods rely primarily on training loss or prediction confidence for passive screening. However, within a feature space degraded by noise, decision boundaries undergo systematic boundary collapse. This phenomenon hinders the ability of the model to distinguish between hard clean samples and noisy samples at the decision margins, thereby creating a significant performance bottleneck. This study is the first to emphasize the pivotal importance of active boundary restoration for noise-robust learning. We propose HamBR, a novel paradigm based on Hamiltonian dynamics. The core approach leverages the Spherical Hamiltonian Monte Carlo (Spherical HMC) mechanism to actively probe inter-class ambiguous regions within the representation space and synthesize high-quality virtual outliers. By imposing explicit repulsion constraints via energy-based modeling, these synthesized samples establish robust energy barriers at the decision boundaries. This mechanism forces real samples to move from dispersed overlapping regions toward their respective class centers, thereby restoring the discriminative sharpness of the decision boundaries. HamBR demonstrates exceptional versatility and can be integrated as a plug-and-play defense module into existing semi-supervised noisy label learning frameworks. Empirical evaluations show that the proposed paradigm significantly enhances the discriminative accuracy of hard boundary samples, achieving state-of-the-art (SOTA) performance on CIFAR-10/100 and real-world noise benchmarks. Furthermore, it exhibits superior convergence efficiency and reliable robustness, while improving significantly the capability of the model for Out-of-Distribution (OOD) detection.

2605.11369 2026-05-13 cs.CV 版本更新

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers

Sanghyeok Nam, Byoungjun Kim, Daehyung Park, Tae-Kyun Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 该研究旨在解决人类与物体之间动态交互动作生成的挑战,提出了一种结合预训练运动先验和模仿智能体的框架,以生成如持物奔跑等长期动态交互动作。通过在规划阶段引入预训练的人体运动扩散模型增强数据集,并生成物体轨迹,从而规划出动态交互序列;在执行阶段,使用一个组合网络融合专用于动态人体动作或静态交互的预训练模仿智能体,实现时空技能的互补组合。该方法在保持交互质量的同时显著提升了任务成功率,并大幅减少了训练时间。

Comments CVPR Findings 2026

详情
英文摘要

Generating physically plausible dynamic motions of human-object interaction (HOI) remains challenging, mainly due to existing HOI datasets limited to static interactions, and pretrained agents capable of either dynamic full-body motions without objects or static HOI motions. Recent works such as InsActor and CLoSD generate HOI motions in planning and execution stages, are yet limited to either static or short-term contacts e.g. striking. In this work, we propose a framework that fulfills dynamic and long-term interaction motions such as running while holding a table, by combining pretrained motion priors and imitation agents in planning and execution stages. In the planning stage, we augment HOI datasets with dynamic priors from a pretrained human motion diffusion model, followed by object trajectory generation. This plans dynamic HOI sequences. In the execution stage, a composer network blends actions of pretrained imitation agents specialized either for dynamic human motions or static HOI motions, enabling spatio-temporal composition of their complementary skills. Our method over relevant prior-arts consistently improves success rates while maintaining interaction for dynamic HOI tasks. Furthermore, blending pretrained experts with our composer achieves competitive performance in significantly reduced training time. Ablation studies validate the effectiveness of our augmentation and composer blending.

2605.11363 2026-05-13 cs.CV cs.CL 版本更新

PresentAgent-2: Towards Generalist Multimodal Presentation Agents

Wei Wu, Ziyang Xu, Zeyu Zhang, Yang Zhao, Hao Tang

发表机构 * Peking University(北京大学) La Trobe University(拉特罗布大学)

AI总结 本文提出了一种名为 PresentAgent-2 的智能框架,旨在从用户查询中生成包含多模态内容的完整演示视频。该框架支持三种独立的演示模式,包括单人讲解、多人讨论和互动问答,并通过深度研究和多模态资源整合,实现内容生成、脚本编写和动态媒体合成。研究拓展了演示生成从依赖文档的幻灯片制作向基于查询、具备研究支撑和交互能力的视频生成方向发展。

详情
英文摘要

Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: https://github.com/AIGeeksGroup/PresentAgent-2. Website: https://aigeeksgroup.github.io/PresentAgent-2.

2605.11354 2026-05-13 cs.CV 版本更新

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

Haoyu Zhang, Zeyu Zhang, Zedong Zhou, Yang Zhao, Hao Tang

发表机构 * Peking University(北京大学) La Trobe University(拉特罗布大学)

AI总结 本文提出了一种名为Lite3R的模型无关框架,旨在提升基于Transformer的3D重建方法的效率。该框架通过引入稀疏线性注意力机制减少密集多视图注意力的计算开销,并结合参数高效的FP8感知量化训练策略,实现低精度下的稳定几何重建。实验表明,Lite3R在多个主流模型上显著降低了计算延迟和内存消耗,同时保持了较高的重建质量,为实际应用中的高效3D重建提供了有效的算法与系统协同设计方法。

详情
英文摘要

Transformer-based 3D reconstruction has emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: dense multi-view attention creates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and 3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replaces dense attention with Sparse Linear Attention to preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficient FP8-aware quantization-aware training (FP8-aware QAT) strategy with partial attention distillation, which freezes the vast majority of pretrained backbone parameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrained geometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0x) and memory usage (1.9-2.4x) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practical transformer-based 3D reconstruction. Code: https://github.com/AIGeeksGroup/Lite3R. Website: https://aigeeksgroup.github.io/Lite3R.

2605.11311 2026-05-13 cs.LG cs.CV stat.CO stat.ML 版本更新

Couple to Control: Joint Initial Noise Design in Diffusion Models

Jing Jia, Liyue Shen, Guanyang Wang

发表机构 * Department of Computer Science(计算机科学系) Rutgers University(罗格斯大学) Department of EECS(电子工程与计算机科学系) University of Michigan(密歇根大学) Department of Statistics(统计学系)

AI总结 该论文研究了扩散模型中初始噪声设计的问题,指出传统方法中假设初始噪声相互独立可能限制了生成效果。作者提出通过设计噪声之间的依赖结构,保持单个噪声仍为标准高斯分布,从而在不改变模型输入分布的前提下,提升多样本生成的多样性与质量。实验表明,该方法在多个主流扩散模型中有效提升了生成多样性,同时保持了图像质量和提示对齐,并在部分指标上优于现有优化方法。

Comments 26 pages

详情
英文摘要

Diffusion models typically generate image batches from independent Gaussian initial noises. We argue that this independence assumption is only one choice within a broader class of valid joint noise designs. Instead, one can specify a coupling of the initial noises: each noise remains marginally standard Gaussian, so the pretrained diffusion model receives the same single-sample input distribution, while the dependence across samples is chosen by design. This reframes initial-noise control from selecting or optimizing individual seeds to designing the dependence structure of a multi-sample gallery. This view gives a general framework for initial-noise design, covering several existing methods as special cases and leading naturally to new coupled-noise constructions. Coupled noise can improve generation on its own without adding sampling cost, and it is flexible enough to serve as a structured initialization for optimization-based pipelines when additional computation is available. Empirically, repulsive Gaussian coupling improves gallery diversity on SD1.5, SDXL, and SD3 while largely preserving prompt alignment and image quality. It matches or outperforms recent test-time noise-optimization baselines on several diversity metrics at the same sampling cost as independent generation. Subspace couplings also support fixed-object background generation, producing diverse, natural backgrounds compared with specialized inpainting baselines, with a tunable trade-off in foreground fidelity.

2605.11307 2026-05-13 cs.CV cs.LG 版本更新

Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

Ajay Vikram Periasami, Junlin Wang, Bhuwan Dhingra

发表机构 * Duke University(杜克大学)

AI总结 Vision2Code 是一个用于评估多领域图像到代码生成能力的基准测试框架,旨在检验视觉语言模型能否将图像结构转化为可执行代码。该基准包含来自15个数据集的2,169个测试样例,涵盖图表、几何图形、科学图像等多种领域,并采用基于视觉语言模型的评分机制进行评估,有效区分代码执行错误与重建质量问题。实验表明,模型在不同领域的表现存在显著差异,且通过筛选模型输出作为训练数据可有效提升生成性能。

Comments Project page: https://image2code.github.io/vision2code/

详情
英文摘要

Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable reference code, or rely on generic rubrics that miss domain-specific reconstruction errors. We introduce Vision2Code, a reference-code-free benchmark and evaluation framework for multi-domain image-to-code generation. Vision2Code contains 2,169 test examples from 15 source datasets that span charts and plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable programs, which we render and score against the source image using a VLM rater with dataset-specific rubrics and deterministic guardrails for severe semantic failures. We report render-success diagnostics that separate code execution failures from reconstruction quality. Human validation shows that this evaluation protocol aligns better with human judgments than either a generic visual rubric or embedding-similarity baselines. Across nine open-weight and proprietary models, we find that image-to-code performance is domain-dependent: leading models perform well on regular chart- and graph-like visuals but remain weak on spatial scenes, chemistry, documents, and circuit-style diagrams. Finally, we show that evaluator-filtered model outputs can serve as training data to improve image-to-code capability, with Qwen3.5-9B improving from 1.60 to 1.86 on the benchmark without paired source programs. Vision2Code provides a reproducible testbed for measuring, diagnosing, and improving image-to-code generation. Our code and data are publicly available at https://image2code.github.io/vision2code/.

2605.11304 2026-05-13 cs.CV 版本更新

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

Eva Prakash, Yunhe Gao, Chong Wang, Justin Xu, Neal Prakash, Arne Michalson, Seena Dehkharghani, Eun Kyoung Hong, Julie Bauml, Roger Boodoo, Jean-Benoit Delbrouck, Sophie Ostmeier, Curtis Langlotz

发表机构 * Stanford University(斯坦福大学) University of Oxford(牛津大学) University of California, Berkeley(加州大学伯克利分校) HOPPR University Hospital Zurich(苏黎世大学医院)

AI总结 CheXTemporal 是一个用于胸部X光影像时序推理的数据集,旨在解决当前模型在处理胸部影像纵向变化时的不足。该数据集包含配对的前后胸部X光片,并提供了细粒度的时序和空间标注,支持五类疾病进展分类。研究还构建了一个包含28万对影像的弱监督数据集,用于评估模型在时序推理和疾病进展分类任务中的表现,结果表明现有模型在时序推理和空间定位方面仍存在明显局限。

详情
英文摘要

Chest radiograph interpretation requires temporal reasoning over prior and current studies, yet most vision-language models are trained on static image-report pairs and lack explicit supervision for modeling longitudinal change. We introduce CheXTemporal, a dataset for temporally grounded reasoning in chest radiography consisting of paired prior-current chest X-rays (CXR) with finding-level temporal and spatial annotations. The dataset includes a five-class progression taxonomy (new, worse, stable, improved, resolved), localized spatial supervision of pathology, explicit spatial-temporal alignment across paired studies, and multi-source coverage for cross-domain evaluation. We additionally construct a 280K-pair silver dataset with automatically derived temporal and anatomical supervision for large-scale evaluation under weaker supervision. Using these resources, we evaluate multiple state-of-the-art vision-language CXR models on grounding and progression-classification tasks in a zero-shot setting. Across both gold and silver evaluations, current models exhibit consistent limitations in spatial grounding, fine-grained temporal reasoning, and robustness under distribution shift. In particular, models perform substantially better on salient progression categories such as worse than on temporally subtle states such as stable and resolved, suggesting limited modeling of longitudinal disease evolution in chest radiography.

2605.11301 2026-05-13 cs.AI cs.CL cs.CV 版本更新

LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

Xueqi Cheng, Yushun Dong

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出了一种名为 LatentRouter 的多模态模型路由方法,旨在根据图像-问题输入的特性,选择最适合的多模态大语言模型。该方法通过构建多模态路由胶囊和模型能力标记,利用潜在状态间的通信来预测各候选模型的性能表现,并结合分布输出头和边界胶囊校正机制提升预测准确性。实验表明,LatentRouter 在多个基准测试中优于现有方法,尤其在需要视觉、布局或推理能力的任务中表现突出。

详情
英文摘要

Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility-based policy supports performance-oriented and performance-cost routing, and handles changing candidate pools through shared per-model scoring with availability masking. Experiments on MMR-Bench and VL-RouterBench show that LatentRouter outperforms fixed-model, feature-level, and learned-router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout-sensitive, or reasoning-oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: https://github.com/LabRAI/LatentRouter.

2605.11300 2026-05-13 cs.CV 版本更新

Can Graphs Help Vision SSMs See Better?

Dhruv Parikh, Anvitha Ramachandran, Haoyang Fan, Mustafa Munir, Rajgopal Kannan, Viktor Prasanna

发表机构 * USC(美国南加州大学) UT Austin(德克萨斯大学奥斯汀分校) DEVCOM ARL Army Research Office, USA(美国陆军战争学院研发办公室)

AI总结 本文研究了如何通过图结构改进视觉状态空间模型(Vision SSMs)的性能,提出了一种基于图的动态扫描操作符GraphScan。该方法为每个视觉标记构建局部图结构,学习基于特征的亲和关系,并通过语义邻域的一次消息传递生成输出标记,从而在全局状态空间混合前实现局部语义对齐。实验表明,集成GraphScan的GraphScan-Mamba在多个视觉任务中取得了最先进的性能,且计算开销较小,为未来视觉状态空间模型的扫描机制提供了新的语义导向视角。

Comments Technical Report

详情
英文摘要

Vision state space models inherit the efficiency and long-range modeling ability of Mamba-style selective scans. However, their performance depends critically on the representation of two-dimensional visual features as one-dimensional token sequences. Existing scan operators range from predefined geometric traversals to dynamic coordinate-based samplers that reroute tokens through predicted offsets and interpolation. While effective, these mechanisms primarily adapt paths or sampling locations, rather than explicitly modeling which local patches should exchange information before global state-space mixing. This motivates a simple question: \emph{can graphs help vision state space models see better?} We introduce \textbf{GraphScan}, a graph-induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature-conditioned affinities with relative positional bias, and produces the output token by one-step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. GraphScan preserves token count and linear scaling in image size, while replacing coordinate-conditioned interpolation with feature-conditioned semantic routing. Integrated into a hierarchical backbone, \textbf{GraphScan-Mamba} achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. Our analysis further shows that GraphScan induces interpretable displacement fields over the token lattice, providing a semantic and spatially grounded view of dynamic scanning. These results suggest that future Vision SSMs should treat scanning not merely as geometric serialization, but as learned local semantic routing before global state-space modeling.

2605.11276 2026-05-13 cs.CV cs.AI 版本更新

Generative AI for Visualizing Highway Construction Hazards Through Synthetic Images and Temporal Sequences

Trevor Neece, Mason Smetana, Lev Khazanovich

发表机构 * University of Pittsburgh(匹兹堡大学)

AI总结 该研究提出了一种基于生成式人工智能的方法,用于从OSHA严重伤害报告中生成高速公路施工危险场景的合成图像和时间序列,以辅助安全培训。研究开发了两种生成模式:单图生成和四阶段时间序列生成,并通过CLIP语义检索和专家评估对生成图像的教育价值、真实感和对齐度进行了多维评价。该方法在无需拍摄真实事故场景的情况下,为安全培训提供了可视化素材,同时为跨领域合成图像生成提供了新的评估框架。

详情
英文摘要

Highway construction workers face a high risk of serious injury or death. Image-based training materials depicting hazardous scenarios are essential for engaging safety instruction but remain scarce due to ethical and logistical barriers. This study develops and evaluates a generative AI methodology for producing synthetic visualizations of highway construction hazards from OSHA Severe Injury Report narratives. Two modes were developed: a single-pass approach yielding one image per incident, and a temporal approach producing a four-stage sequence. A sample of 75 incident records yielded 750 images, evaluated using CLIP-based semantic retrieval and expert assessment across dimensions such as educational utility, fidelity, and alignment. Single-pass images achieved 81.1% educational acceptability with fidelity and alignment scores of 4.14/5 and 4.07/5, respectively, while temporal sequences achieved 60.9% acceptability with comparable alignment (3.94/5) but lower fidelity (3.51/5). CLIP-based retrieval revealed that both modes produce images with statistically significant retrieval capabilities. This is among the first studies to leverage modern autoregressive image generation models for visualizing construction hazards from reported severe injuries and to generate temporally sequenced hazard imagery, and a new multi-dimensional evaluation framework was developed to support future research in this domain. The work enables safety trainers to pair narrative storytelling with visual learning material without photographing real-world hazards, and the framework could be applied to datasets across diverse domains, enabling synthetic image generation tailored to new application areas.

2605.11267 2026-05-13 cs.CV 版本更新

Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates

Quanyun Wu, Kyle Gao, Wentao Sun, Hongjie He, Yuhao Chen, David A. Clausi, Jonathan Li

发表机构 * East China Normal University(东华大学)

AI总结 本文提出了一种基于单目视觉的几何一致、真实尺度海岛面积与海岸线测量框架,仅需输入目标区域的地理坐标或名称即可自动获取低空环绕图像序列,并通过轻量轨迹对齐算法恢复全局物理尺度,最终实现高精度的二维平面面积和周长提取。该方法无需依赖传统GIS数据,大幅降低了测绘成本,实验表明其测量误差稳定在10%左右,具有较高的精度、鲁棒性和推理效率,为大规模海洋与海岸线监测提供了实用新范式。

Comments Accepted for publication at IEEE OCEANS (Sanya) 2026

详情
英文摘要

Accurate measurement of island area and coastline length is crucial for coastal zone monitoring and oceanographic analysis. However, traditional measurement and mapping methods usually rely heavily on orthophotos, expensive airborne depth sensors, or dense ground control points, which face serious limitations of high labor costs, time-consuming efforts, and low operational efficiency in vast and inaccessible open sea environments. To overcome these challenges and break away from the reliance on manual field exploration, this paper proposes a geometrically consistent, real-scale island measurement framework based on pure monocular vision. This project significantly reduces the mapping cost through a fully automated process and achieves high-efficiency measurement without prior GIS data. In our system pipeline, only the geographical coordinates or names of the target area need to be input to obtain a low-altitude surrounding image sequence. After obtaining the point clouds, a lightweight trajectory alignment algorithm (Umeyama) is used to restore the global physical scale, and the scaled model is orthorectified, enabling high-precision area and perimeter extraction directly on the 2D rasterized plane. We have fully verified this pipeline on four islands with different terrain features (covering natural landform islands and islands with complex artificial facilities). The experimental results show that the final measurement error of the system is stable at around 10\%, demonstrating excellent accuracy and robustness. Moreover, this framework has outstanding inference speed, requiring only 70 ms to process a single high-resolution image and generate point clouds, providing a highly practical new paradigm for large-scale marine and coastline

2605.11266 2026-05-13 cs.CV cs.GR cs.LG 版本更新

PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives

Zachary Lee, Maxwell Jacobson, Yexiang Xue

发表机构 * Department of Computer Science, Purdue University(普渡大学计算机科学系)

AI总结 该研究提出了一种名为PG-3DGS的物理引导三维高斯点绘方法,旨在生成不仅视觉逼真而且具备物理功能的三维结构。通过将可微分物理模拟与三维高斯表示相结合,该方法能够在优化形状时同时考虑视觉损失和物理目标,从而生成如能倒水的茶壶和能产生升力的飞机等具有实际功能的物体。实验表明,PG-3DGS在保持视觉质量的同时显著提升了物理功能,并在实际风洞测试中验证了其生成结构的物理性能优势。

Comments Submitted to Artificial Intelligence. 52 pages

详情
英文摘要

Recent advances in Gaussian Splatting have enabled fast, high-fidelity 3D scene generation, yet these methods remain purely visual and lack an understanding of how shapes behave in the physical world. We introduce Physics-Guided 3D Gaussian Splatting (PG-3DGS), a framework that couples differentiable physics simulation with 3D Gaussian representations to generate 3D structures satisfying physics functionalities. By allowing physical objectives to guide the shape optimization process alongside visual losses, our approach produces geometries that are not only photometrically accurate but also physically functional. The model learns to adjust shapes so that the generated objects exhibit physically meaningful behaviors, for example, teapots that can pour and airplanes that can generate lift, without sacrificing visual quality. Experiments on pouring and aerodynamic lift tasks show that PG-3DGS improves physical functionality while preserving visual quality. In addition to simulation gains, bench-top physical lift tests with 3D-printed aircraft (Cessna, B-2 Spirit, and paper plane) under identical airflow conditions show higher scale-measured lift for PG-3DGS, generated structures than an appearance-matching baseline in all three cases. Our unified framework connects appearance-based reconstruction with physics-based reasoning, enabling end-to-end generation of 3D structures that both look realistic and function correctly.

2605.11265 2026-05-13 cs.CV cs.AI cs.LG 版本更新

DenseTRF: Texture-Aware Unsupervised Representation Adaptation for Surgical Scene Dense Prediction

Guiqiu Liao, Matjaž Jogan, Daniel A. Hashimoto

发表机构 * GRASP Laboratory, University of Pennsylvania(宾夕法尼亚大学GRASP实验室) PCASO Laboratory, Department of Surgery, University of Pennsylvania(宾夕法尼亚大学外科PCASO实验室) Department of Computer and Information Science, University of Pennsylvania(宾夕法尼亚大学计算机与信息科学系)

AI总结 本文提出了一种名为DenseTRF的自监督表征适应框架,用于解决手术场景中密集预测任务(如分割和手术区域识别)在跨域部署时因分布偏移导致的性能下降问题。该方法基于纹理感知的注意力机制,通过学习具有不变视觉结构的表征,并在无监督条件下将其适配到目标分布,从而显著提升了模型对领域变化的鲁棒性。实验表明,DenseTRF在多种手术场景中均优于当前最先进的分割模型和跨域适应方法。

Comments Accepted to 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026)

详情
英文摘要

Dense prediction tasks in surgical computer vision, such as segmentation and surgical zone prediction, can provide valuable guidance for laparoscopic and robotic surgery. However, these models often suffer from distribution shifts, as training datasets rarely cover the variability encountered during deployment, leading to poor generalization. We propose DenseTRF, a self-supervised representation adaptation framework based on texture-centric attention. Our method leverages slot attention to learn texture-aware representations that capture invariant visual structures. By adapting these representations to the target distribution without supervision, DenseTRF significantly improves robustness to domain shifts. The framework is implemented through conditioning dense prediction on slot attention and model merging strategies. Experiments across multiple surgical procedures demonstrate improved cross-distribution generalization in comparison to state-of-the-art segmentation models and test-distribution adaptation methods for dense prediction tasks.

2605.11224 2026-05-13 cs.CV cs.AI 版本更新

ABRA: Agent Benchmark for Radiology Applications

Bulat Maksudov, Vladislav Kurenkov, Kathleen M. Curran, Alessandra Mileo

发表机构 * School of Computing(计算学院) Dublin City University(都柏林城市大学) School of Medicine(医学院) University College Dublin(都柏林大学)

AI总结 ABRA 是一个面向放射学应用的智能体基准,旨在评估医疗智能体在实际影像处理任务中的能力。该基准通过21个功能调用工具,使智能体能够操作医学影像查看器和DICOM服务器,完成包括切片导航、窗口调节、标注和结构化报告等任务。ABRA 包含655个自动生成的任务,涵盖多个难度等级和任务类型,并通过自动评分系统评估智能体在规划、执行和结果方面的表现,揭示了当前模型在感知层面存在较大瓶颈。

详情
英文摘要

Existing medical-agent benchmarks deliver imaging as pre-selected samples, never as an environment the agent must navigate. We introduce ABRA, a radiology-agent benchmark in which the agent operates an OHIF viewer and an Orthanc DICOM server through twenty-one function-calling tools that span slice navigation, windowing, series selection, pixel-coordinate annotation, and structured reporting. ABRA contains 655 programmatically generated tasks across three difficulty tiers and eight types (viewer control, metadata QA, vision probe, annotation, longitudinal comparison, BI-RADS reporting, and oracle variants of annotation and BI-RADS reporting), drawn from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Each episode is scored along Planning, Execution, and Outcome (Bluethgen et al., 2025) by task-type-specific automatic scorers. Ten current models, five closed-weight and five open-weight, reach at least 89% Execution on real annotation but only 0-25% Outcome; on the paired oracle variant where a simulated detector supplies the finding, Outcome on the same task reaches 69-100% across the models evaluated, localising the bottleneck to perception rather than tool orchestration. Code, task generators, and scorers are released at https://github.com/Luab/ABRA

2605.11203 2026-05-13 cs.LG cs.CV 版本更新

FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry

Elias B. Krey, Nils Neukirch, Nils Strodthoff

发表机构 * Division AI4Health(AI4Health部门) Carl von Ossietzky Universität Oldenburg(奥尔登堡卡尔·冯·奥西特齐克大学)

AI总结 本文研究了深度神经网络中间特征表示的几何结构,通过在输入空间应用多种图像变换,评估了在特征空间中学习从原始特征到变换后特征映射的可能性。研究设计了多种映射方式,包括线性与非线性、局部与全局映射,并分析了其重建质量和语义内容。结果表明,即使对于复杂的语义变换,使用单一特征向量的共享线性模型也能实现较好的重建效果,暗示特征空间可能在一定程度上具有线性结构。该研究为理解特征空间的组织方式提供了新视角,并展示了生成式图像编辑模型在这一领域的潜力。

Comments 27 pages, 24 figures, 3 tables, Code is available at https://github.com/AI4HealthUOL/FeatMap

详情
英文摘要

Intermediate feature representations represent the backbone for the expressivity and adaptability of deep neural networks. However, their geometric structure remains poorly understood. In this submission, we provide indirect insights into this matter by applying a broad selection of manipulations in input space, ranging from geometric and photometric transformations to local masking and semantic manipulations using generative image editing models, and assess the feasibility of learning a mapping in the feature space, mapping from the original to the manipulated feature map. To this end, we devise different types of mappings, from linear to non-linear and local to global mappings and assess both the reconstruction quality of the mapping as well as the semantic content of the mapped representations. We demonstrate the feasibility of learning such mappings for all considered transformations. While global (transformer) models that operate on the full feature map often achieve best results, we show that the same can be achieved with a shared linear model operating on a single feature vector typically with very little degradation in reconstruction quality, even for highly non-trivial semantic manipulations. We analyze the corresponding mappings across different feature layers and characterize them according to dominance of weight vs. bias and the effective rank of the linear transformations. These results provide hints for the hypothesis that the feature space is to a first degree of approximation organized in linear structures. From a broader perspective, the study demonstrates that generative image editing models might open the door to a deeper understanding of the feature space through input manipulation.

2605.11166 2026-05-13 cs.CV 版本更新

Unpacking the Eye of the Beholder: Social Location, Identity, and the Moving Target of Political Perspectives

Elena Sirotkina

发表机构 * Center for Data Science(数据科学中心)

AI总结 本文研究了政治和社会身份如何影响人们对政治信息的评价,并指出传统计算工具往往忽略这种差异。为此,作者提出了一个名为Perspectivist Visual Political Sentiment(PVPS)的分类器,通过大量美国成年人的评价数据,预测不同政治和社会身份群体对同一图像的评价差异。该方法保留了群体间的系统性分歧,揭示了政治图像意义的动态性,强调理解图像传达的内容必须考虑受众的身份背景。

详情
英文摘要

Political and social identities structure how people evaluate political information, a finding decades deep in political science and routinely discarded by computational tools that often produce single scores that treat a piece of text, an image, or a video as if it means the same thing to everyone. This paper shows that it does not, and that the difference is consequential. To address this problem, I develop the Perspectivist Visual Political Sentiment (PVPS) classifier, which learns from approximately 82,000 evaluations by 5,575 U.S. adults to predict how audiences defined by political and social identities will evaluate the same image. Unlike standard tools that average systematic disagreement away, PVPS preserves it, returning an evaluative profile that records who agrees, who diverges, and along which identity lines. Applied to several influential studies of visual sentiment, PVPS shows that perceived violence in protest imagery and the emotional mechanisms behind protest image engagement both change substantively once audience identity is taken into account. It follows that what a political image conveys is a moving target, and measuring it requires knowing whom it is moving.

2605.11131 2026-05-13 cs.CV 版本更新

USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation

Elisha Dayag, Nhat Thanh Tran, Jack Xin

发表机构 * University of California Irvine(加州大学 Irvine 分校)

AI总结 本文提出了一种可扩展且高效的类似Mamba的注意力机制USEMA,用于医学图像分割,旨在解决传统视觉Transformer因二次计算复杂度带来的效率问题。USEMA结合了局部窗口注意力和理论一致的算术平均,以兼顾局部特征提取与全局信息捕捉,并与卷积神经网络融合构建混合UNet架构。实验表明,USEMA在多种模态和图像尺寸下均表现出优于纯卷积模型和基于Mamba模型的分割性能和计算效率。

详情
英文摘要

Accurate medical image segmentation is an integral part of the medical image analysis pipeline that requires the ability to merge local and global information. While vision transformers are able to capture global interactions using vanilla self-attention, their quadratic computational complexity in the input size remains a struggle for medical image segmentation tasks. Motivated by the dispersion property of vanilla self-attention and recent development of Mamba form of attention, Scalable and Efficient Mamba like Attention (SEMA) utilizes token localization via local window attention to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. In this work, we present USEMA, a hybrid UNet architecture that merges the local feature extraction ability of convolutional neural networks (CNNs) with SEMA attention. We conduct experiments with USEMA across a variety of modalities and image sizes, demonstrating improved computational efficiency compared to transformer based models using full self-attention, and superior segmentation performance relative to purely convolution and Mamba-based models.

2605.11115 2026-05-13 cs.CV cs.GR cs.LG 版本更新

LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR

Pedram Fekri, WenChen Li, William Chen, Peter Altamirano

发表机构 * Monks AI Research Lab(Monks AI研究院)

AI总结 本文提出了一种名为LatentHDR的新型框架,用于生成高质量的高动态范围(HDR)图像。该方法通过在潜在空间中将场景生成与曝光建模解耦,利用预训练的扩散模型生成一致的场景表示,并通过一个轻量的条件潜在到潜在映射模块,将其确定性地映射到特定曝光的表示,从而在单次生成过程中实现结构一致的多曝光堆栈。该方法显著降低了计算成本,提升了生成效率,并在多个基准测试中取得了领先的动态范围和感知质量。

详情
英文摘要

High Dynamic Range (HDR) generation remains challenging for generative models, which are largely limited to low dynamic range outputs. Recent diffusionbased approaches approximate HDR by generating multiple exposure-conditioned samples, incurring high computational cost and structural inconsistencies across exposures. We propose LatentHDR, a framework that decouples scene generation from exposure modeling in latent space. A pretrained diffusion backbone produces a single coherent scene representation, while a lightweight conditional latent to-latent head deterministically maps it to exposure-specific representations. This enables the generation of a dense, structurally consistent exposure stack in a single pass. This design eliminates multi-pass diffusion, ensures cross-exposure alignment, and enables scalable HDR synthesis. LatentHDR supports both textand image-conditioned HDR generation for perspective and panoramic scenes. Experiments on synthetic data and the SI-HDR benchmark show that LatentHDR achieves state-of-the-art dynamic range with competitive perceptual quality, while reducing computation by an order of magnitude. Our results demonstrate that high-quality HDR generation can be achieved through structured latent modeling, challenging the need for stochastic multi-exposure generation.

2605.11109 2026-05-13 physics.geo-ph cs.AI cs.CV cs.LG 版本更新

Deploying Self-Supervised Learning for Real Seismic Data Denoising

Giovanny A. M. Arboleda, Claudio D. T. de Souza, Carlos E. M. dos Anjos, Lessandro de S. S. Valente, Roosevelt de L. Sardinha, Albino Aveleda, Pablo M. Barros, André Bulcão, Alexandre G. Evsukoff

发表机构 * COPPE Federal University of Rio de Janeiro(里约热内卢联邦大学Coppe分校) CENPES, Petrobras(石油公司CENPES)

AI总结 本文研究了在真实地震数据去噪中应用自监督学习(SSL)的可行性,重点评估了Noisy-as-Clean(NaC)方法在受控条件下的表现。通过构建包含噪声和滤波数据的四个真实数据集,作者对比了NaC方法与监督学习基线在相同网络结构和超参数下的性能,发现合成的高斯白噪声(AWGN)在NaC方法中效果不佳,实际噪声特性与注入噪声的匹配度对去噪效果影响显著。研究还表明,自监督模型在测试数据上的微调能有效提升性能,而监督模型则无此优势,NaC方法因其简单、有效且模型无关的特性,为真实地震数据去噪提供了可行的解决方案。

详情
英文摘要

Self-supervised learning (SSL) has emerged as a promising approach to seismic data denoising as it does not require clean reference data. In this work, the deployment of the Noisy-as-Clean (NaC) method was evaluated for real seismic data denoising under controlled conditions. Two independent seismic acquisitions, each comprising noisy and filtered data, were organized into four real datasets. The NaC SSL method was adapted to add real noise to the noisy input, controlled by a parameter. An experimental protocol with ten experiments was designed to compare different strategies for deploying the NaC SSL method with the supervised learning baseline, using identical network topology and hyperparameters. The models were evaluated in terms of denoising performance, computational cost, and generalization capability. The results show that the synthetic additive white Gaussian noise (AWGN) is inadequate for the denoising of seismic data within the NaC method, and performance strongly depends on the compatibility between the injected and actual noise characteristics. Furthermore, both the characteristics of the seismic data and the noise level influence the performance of the model. Self-supervised fine-tuning on test data has improved SSL performance, whereas no such gain was observed for fine-tuning of supervised models. Finally, NaC has shown to be a simple, effective, and model-independent method that offers a feasible solution for the denoising of real seismic data.

2605.11107 2026-05-13 cs.CV cs.AI 版本更新

Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

Youssef Zaazou, Mark Thomas

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究针对视觉语言模型(VLMs)在处理图像分类任务时易受背景干扰的问题,提出了一种基于嵌入空间线性可加性的方法,将场景表示分解为前景和背景成分,从而构建背景不变的表示。通过利用合成数据进行预训练,该方法在存在完美虚假关联的Waterbirds数据集上实现了首个超过90%的最差群体准确率,且无需依赖真实去偏数据,具有良好的模拟到现实迁移能力,适用于实际部署。

Comments 36 pages, 7 figures

详情
英文摘要

Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90\%$ on Waterbirds under perfect ($100\%$) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.

2605.11061 2026-05-13 cs.CV cs.MM 版本更新

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

Qi Cai, Jingwen Chen, Chengmin Gao, Zijian Gong, Yehao Li, Yingwei Pan, Yi Peng, Zhaofan Qiu, Kai Yu, Yiheng Zhang, Hao Ai, Siying Bai, Yang Chen, Zhihui Chen, Fengbin Gao, Ying Guo, Dong Li, Zhen Shen, Leilei Shi, Jing Wang, Siyu Wang, Yimeng Wang, Rui Zheng, Ting Yao, Tao Mei

AI总结 该论文提出了一种名为HiDream-O1-Image的原生统一图像生成基础模型,通过像素级扩散变换器架构,实现了从模块化结构向端到端视觉生成引擎的范式转变。该模型将原始图像像素、文本标记和任务条件映射到统一的共享标记空间,无需依赖独立的VAE或预训练文本编码器,从而在统一变换器(UiT)架构下实现了多模态输入的结构统一。实验表明,HiDream-O1-Image在多种生成任务中表现出色,并且在仅有80亿参数时性能可与更大参数量的模型媲美,其2000亿参数版本更实现了生成能力的显著提升,确立了新的性能基准。

Comments Source codes and models are available at Github: https://github.com/HiDream-ai/HiDream-O1-Image and Huggingface: https://huggingface.co/HiDream-ai/HiDream-O1-Image

详情
英文摘要

The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across various generation tasks, including text-to-image generation, instruction-based editing, and subject-driven personalization. Notably, with only 8B parameters, HiDream-O1-Image (8B) achieves performance parity with or even surpasses established state-of-the-art models with significantly larger parameters (e.g., 27B Qwen-Image). Crucially, to validate the immense scalability of this paradigm, we successfully scale the architecture up to over 200B parameters. Experimental results demonstrate that this massive-scale version HiDream-O1-Image-Pro (200B+) unlocks unprecedented generative capabilities and superior performance, establishing new state-of-the-art benchmarks. Ultimately, HiDream-O1-Image highlights the immense potential of natively unified architectures and charts a highly scalable path toward next-generation multimodal AI.

2605.11060 2026-05-13 eess.IV cs.CV 版本更新

SplitFed-CL: A Split Federated Co-Learning Framework for Medical Image Segmentation with Inaccurate Labels

Zahra Hafezi Kafshgari, Hadi Hadizadeh, Parvaneh Saeedi

发表机构 * School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada(工程科学学院,西蒙弗雷泽大学,本拿比,BC,加拿大)

AI总结 SplitFed-CL 是一种用于医疗图像分割的联邦协同学习框架,旨在解决客户端标签质量不一致导致的性能下降问题。该方法通过全局教师模型引导本地学生模型,识别并修正不可靠的标注,同时引入一致性正则化和可学习的损失权重模块以增强鲁棒性。此外,该框架还提出了一种基于难度引导的策略,模拟人类在边界区域易出错的标注行为,实验表明其在多个数据集上均优于现有先进方法,显著提升了分割精度和稳定性。

详情
英文摘要

Split Federated Learning (SplitFed) combines federated and split learning to preserve privacy while reducing client-side computation. However, in medical image segmentation, heterogeneous label quality across clients can significantly degrade performance. We propose SplitFed-CL, a co-learning framework where a global teacher guides local students to detect and refine unreliable annotations. Reliable labels supervise training directly, while unreliable labels are corrected via weighted student--teacher refinement. SplitFed-CL further incorporates consistency regularization for robustness to input perturbations and a trainable weighting module to balance loss terms adaptively. We also introduce a novel difficulty guided strategy to simulate human like boundary centric annotation errors, where the degree of perturbation is governed by shape complexity and the associated annotation difficulty. Experiments on two multiclass segmentation datasets with controlled synthetic noise, together with a binary segmentation dataset containing real-world annotation errors, demonstrate that SplitFed-CL consistently outperforms seven state-of-the-art baselines, yielding improved segmentation quality and robustness.

2605.11055 2026-05-13 cs.CV cs.LG 版本更新

The first global agricultural field boundary map at 10m resolution

Caleb Robinson, Gedeon Muhawenayo, Subash Khanal, Zhanpei Fang, Isaac Corley, Ana M. Tárano, Lyndon Estes, Jennifer Marcus, Nathan Jacobs, Hannah Kerner, Inbal Becker-Reshef, Juan M. Lavista Ferres

发表机构 * Microsoft AI for Good Research Lab(微软AI for Good研究实验室) Arizona State University(亚利桑那州立大学) Washington University in St. Louis(圣路易斯华盛顿大学) Oregon State University(俄勒冈州立大学) Clark University(克拉克大学) Taylor Geospatial(泰勒地理空间)

AI总结 本文提出了首个全球10米分辨率的农业地块边界地图,覆盖2024和2025年共241个国家和地区,包含31.7亿个遥感地块多边形。研究采用基于“Fields of The World”数据集训练的U-Net分割模型,对Sentinel-2无云影像进行处理生成地图,并通过多国实地数据验证其准确性。该数据集以三种形式公开发布,为全球农作物监测、粮食安全及相关农业研究提供了首个一致的地块级分析单元。

详情
英文摘要

The agricultural field is the natural unit at which crops are planted, managed, regulated, and reported, yet most global remote-sensing products for agriculture are only available at the pixel level. While some high-quality field-level data products exist, they come from parcel registries covering only parts of Europe or from ML-derived products for individual countries. No openly available, globally consistent map of agricultural field boundaries exists to date. Here we present the first global field boundary dataset at 10\,m resolution for the years 2024 and 2025, comprising 3.17 billion remote-sensing field polygons (1.62 B in 2024 and 1.55 B in 2025) across 241 countries and territories, produced by applying a U-Net segmentation model trained on the Fields of The World dataset to cloud-free Sentinel-2 mosaics. Validated against ground-truth field boundaries in 24 countries, the map achieved a mean pixel-level recall of 0.85 with 14 countries exceeding 0.90. Evaluation against full-country ground-truth datasets in Austria, Latvia, and Finland yielded F1 scores of 0.89, 0.88, and 0.74, respectively. Because reference data for global validation is inherently incomplete, we accompanied the map with a 500 m confidence layer that identifies regions where predictions are reliable. We release the dataset openly as three global maps: the confidence-thresholded default field boundary dataset, the full unfiltered dataset, and the continuous-valued confidence raster. These maps provide the first globally consistent field-level unit of analysis for crop monitoring, food security, and downstream agricultural science.

2605.10995 2026-05-13 eess.IV cs.CV cs.GR cs.MM 版本更新

Streaming of rendered content with adaptive frame rate and resolution

Yaru Liu, Joseph G. March, Rafal K. Mantiuk

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文研究了如何在带宽受限的情况下,通过自适应调整帧率和分辨率来提升渲染内容的流媒体质量。作者提出了一种基于轻量神经网络的方法,根据场景内容和运动速度预测最优的帧率与分辨率组合,从而在保证感知质量的同时降低渲染成本。该方法无需依赖特定编解码器,且对现有渲染系统改动极小,具有良好的实用性和扩展性。

详情
英文摘要

Streaming rendered content is an attractive way to bring high-quality graphics to billions of mobile devices that do not have sufficient rendering power. Existing solutions render content on a server at a fixed frame rate, typically 30 or 60 frames per second, and reduce resolution when bandwidth is restricted. However, this strategy leads to suboptimal rendering quality under the bandwidth constraints. In this work, we exploit the spatio-temporal limits of the human visual system to improve perceived quality while reducing rendering costs by adaptively adjusting both frame rate and resolution based on scene content and motion. Our approach is codec-agnostic and requires only minimal modifications to existing rendering infrastructure. We propose a system in which a lightweight neural network predicts the optimal combination of frame rate and resolution for a given transmission bandwidth, content, and motion velocity. This prediction significantly enhances perceptual quality while minimizing computational cost under bandwidth constraints. The network is trained on a large dataset of rendered content labeled with a perceptual video quality metric. The dataset and further information can be found at the project web page: https://www.cl.cam.ac.uk/research/rainbow/projects/adaptive_streaming/.

2605.10984 2026-05-13 cs.CV 版本更新

Principle-Guided Supervision for Interpretable Uncertainty in Medical Image Segmentation

An Sui, Yuzhu Li, Gunter Schumann, Fuping Wu, Xiahai Zhuang

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University(复旦大学脑启发智能科学研究院) National Heart and Lung Institute, Imperial College London(伦敦帝国理工学院国家心脏和肺研究所)

AI总结 本文研究了医学图像分割中可解释的不确定性量化问题,旨在使模型的不确定性估计更符合人类对不确定性的理解。为此,作者提出了三个与感知对齐的原则,要求不确定性在空间分布上反映图像结构对比度、图像损坏程度和解剖结构几何复杂性。基于这些原则,研究设计了一种原理引导的不确定性监督框架(PriUS),通过证据学习方法在训练过程中显式约束不确定性分布,并引入量化指标评估不确定性与图像模糊源的一致性。实验表明,PriUS在多个医学数据集上实现了更具一致性的不确定性估计,同时保持了良好的分割性能。

Comments 14 pages, 8 figures

详情
英文摘要

Uncertainty quantification complements model predictions by characterizing their reliability, which is essential for high-stakes decision making such as medical image segmentation. However, most existing methods reduce uncertainty to a scalar confidence estimate, leaving its spatial distribution semantically underconstrained. In this work, we focus on uncertainty interpretability, namely, whether estimated uncertainty behaves in a human-understandable manner with respect to sources of ambiguity. We identify three perception-aligned principles requiring the spatial distribution of uncertainty to reflect: (1) image contrast between structures, (2) severity of image corruption, and (3) geometric complexity in anatomical structures. Accordingly, we develop a principle-guided uncertainty supervision framework (PriUS) based on evidential learning, in which the corresponding supervision objectives are explicitly enforced during training. We further introduce quantitative metrics to measure the consistency between predicted uncertainty and image attributes that induce ambiguity. Experiments on ACDC, ISIC, and WHS datasets showed that, compared with state-of-the-art methods, PriUS produced more consistent uncertainty estimates while maintaining competitive segmentation performance.

2605.10953 2026-05-13 physics.geo-ph cs.CV cs.LG 版本更新

Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising

Jiahua Zhao, Umair bin Waheed, Jing Sun, Yang Cui, Nikos Savva, Eric Verschuur

发表机构 * Computation-based Science and Technology Research Center, The Cyprus Institute(计算科学与技术研究中心,塞浦路斯研究所) Department of Geosciences, College of Petroleum Engineering and Geosciences, King Fahd University of Petroleum & Minerals(地质学系,石油工程与地质学院,国王法赫德石油与矿物大学) Department of Intelligent Systems, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology(智能系统系,电气工程、数学与计算机科学学院,代尔夫特理工大学) Department of Mathematics and Statistics, Faculty of Pure and Applied Sciences, University of Cyprus(数学与统计学系,纯应用科学学院,塞浦路斯大学) Department of Geoscience and Engineering, Faculty of Civil Engineering and Geosciences, Delft University of Technology(地质科学与工程系,土木工程与地质科学学院,代尔夫特理工大学)

AI总结 本文研究了如何高效地将预训练的视觉基础模型(VFM)应用于主动和被动地震数据去噪任务。通过参数高效的微调方法,结合低秩适配(LoRA)和基于峰度的无监督测试时适应模块,模型能够在无需大量标注数据的情况下适应不同场地的噪声特性。实验表明,该方法在多个公开地震数据集上表现优异,展示了预训练VFM在勘探地震学中处理复杂去噪任务的潜力。

Comments 34 pages, 8 figures, 6 tables. Submitted to Geophysics for publication consideration

详情
英文摘要

The demand for high-resolution subsurface imaging and continuous Earth monitoring has driven rapid growth in active and passive seismic data from dense geophone deployments, distributed acoustic sensing (DAS) arrays, and large-scale 2D and 3D surveys. This expansion makes complex noise suppression increasingly challenging, especially when signal fidelity must be preserved. Conventional supervised deep learning methods are often task-specific, require large paired datasets, and can suffer from domain shift under new acquisition conditions. Foundation models offer a promising alternative, but pre-training seismic foundation models from scratch requires massive domain-specific data and substantial computation. We propose an efficient framework that repurposes general-purpose Vision Foundation Models (VFMs) for geophysical tasks through Parameter-Efficient Fine-Tuning. The architecture uses a pre-trained VFM, a DINOv3 encoder, adapted with Low-Rank Adaptation (LoRA) to enable effective feature adaptation with few additional parameters. To improve robustness under unseen field conditions without ground truth, we introduce a kurtosis-guided unsupervised test-time adaptation module that updates only LoRA parameters during inference. This module self-calibrates the model to site-specific noise by identifying information-rich regions via kurtosis and performing self-training without labeled data. Experiments on public exploration seismic images and DAS vertical seismic profiling data from the Utah FORGE site show that the framework matches or outperforms domain-specific models. Tests on unseen cross-site data from a land survey in China and the Groß Schönebeck geothermal site in Germany further demonstrate strong generalization and effective signal-noise separation. These results highlight the potential of adapting pre-trained VFMs to data-intensive problems in exploration seismology.

2605.10949 2026-05-13 stat.AP cs.AI cs.CV cs.LG 版本更新

AlphaEarth Satellite Embeddings for Modelling Climate Sensitive Diseases Towards Global Health Resilience

Usman Nazir, I-Han Cheng, Sara Khalid

发表机构 * Planetary Health Informatics (PHI) Lab, University of Oxford(行星健康信息学实验室,牛津大学)

AI总结 该研究探讨了利用卫星遥感数据(AlphaEarth嵌入)预测气候敏感性疾病的潜力,以提升全球健康韧性。研究聚焦于疟疾、儿童急性呼吸道感染和发育迟缓等疾病,评估了64维卫星嵌入在不同国家和地区的预测性能。结果显示,卫星数据在疟疾和呼吸道感染预测中具有显著的预测能力,但在发育迟缓预测中受固定效应影响较大,需进一步数据支持。这一工作为利用遥感技术辅助公共卫生监测提供了新的方法和实证依据。

Comments Visualising Climate 2026

详情
英文摘要

Malaria, childhood acute respiratory infection, and child undernutrition together account for over two million deaths annually in children under five, with the burden concentrated in low and middle-income countries where climate variability modulates transmission, exposure, and nutritional outcomes. Routine health surveillance in these settings remains sparse and reactive. Satellite-derived representations of the Earth's surface offer a scalable, low-cost complement to traditional covariates, yet their utility as predictors of population health outcomes is poorly characterised. We summarise findings from three studies evaluating AlphaEarth Foundations 64-dimensional satellite embeddings as predictors of population health outcomes, focusing on vulnerable populations. The studies span infectious disease (malaria, respiratory infection) and stunting. In each study, embeddings provide predictive value at sufficient spatial granularity: (i) malaria prediction across Nigeria shows consistent per-region R^2 gains; (ii) childhood acute respiratory infection prediction across 11 DHS countries increases pooled R^2 from 0.157 to 0.206 across three tree-based estimators; (iii) stunting prediction across 35 countries is neutral at country level due to collinearity with fixed effects. The stunting case is currently limited by lack of DHS cluster-level coordinates, which is the next key experiment.

2605.10865 2026-05-13 cs.AI cs.CV cs.SE 版本更新

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

Haozhe Zhang, Kaichen Liu, Miaomiao Chen, Lei Li, Shaojie Yang, Cheng Peng, Hanjie Chen

发表机构 * University of Virginia(弗吉尼亚大学) University of California, San Diego(加州大学圣地亚哥分校) Rice University(莱斯大学)

AI总结 BenchCAD 是一个面向工业CAD编程的综合性基准测试平台,旨在评估模型从视觉或文本输入生成可执行参数化CAD程序的能力。该基准包含17,900个经过验证的CadQuery程序,涵盖106类工业零件,通过视觉问答、代码问答、图像到代码生成等多种任务全面评估模型在感知、参数抽象和程序合成方面的能力。实验表明,当前主流模型虽能恢复零件的粗略外形,但在精确生成参数化CAD程序方面仍存在显著不足,如忽略细粒度3D结构、误读工程参数等,突显了工业CAD自动化领域亟需改进的方向。

Comments 9 page 7 figures

详情
英文摘要

Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.

2605.10780 2026-05-13 cs.CV cs.AI 版本更新

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Xuanyu Zhu, Yan Bai, Yang Shi, Yihang Lou, Yuanxing Zhang, Jing Jin, Yuan Zhou

发表机构 * Peking University(北京大学) Meituan Inc(美团公司) Tsinghua University(清华大学) IGDL

AI总结 该研究提出了一种名为DRoRAE的多层表示融合方法,旨在改进视觉编码器的特征提取过程。不同于现有方法仅使用最后一层特征,DRoRAE通过能量约束路由和增量校正机制,融合所有中间层的特征,从而恢复因多层语义抽象而丢失的细节信息。实验表明,该方法在图像重建和生成任务中显著提升了性能,并揭示了表示丰富性与重建质量之间的可预测关系,为视觉分词器的设计提供了新的理论依据。

详情
英文摘要

Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law ($R^2{=}0.86$) between fusion capacity and reconstruction quality, identifying \textit{representation richness} as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.

2605.09904 2026-05-13 cs.CV 版本更新

TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

Junzhe Chen, Siyuan Meng, Yuxi Chen, Man Zhao, Wenyao Gui, Xiaojie Guo

发表机构 * Tianjin University(天津大学)

AI总结 TOC-Bench 是一个用于评估视频大语言模型(Video-LLMs)时间对象一致性能力的诊断基准。该基准通过对象轨迹和结构化时间事件时间线进行构建,强调模型在遮挡、消失、重现、状态变化和跨对象交互等场景下保持同一对象身份、状态和连续性的能力。研究发现,尽管现有模型在一般视频理解任务上表现良好,但在事件计数、事件排序、身份敏感推理和幻觉检测等方面仍存在显著不足,表明时间对象一致性是当前视频大语言模型的一个关键瓶颈。

详情
英文摘要

Video large language models (Video-LLMs) have made strong progress in general video understanding, but their ability to maintain temporal object consistency remains underexplored. Existing benchmarks often emphasize event recognition, action understanding, or coarse temporal reasoning, while rarely testing whether models can preserve the identity, state, and continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. We introduce TOC-Bench, a diagnostic benchmark for evaluating temporal object consistency in Video-LLMs. TOC-Bench is object-track grounded: each queried subject is linked to a per-frame trajectory and a structured temporal event timeline. To ensure that questions require temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we design a three-layer temporal-necessity filtering protocol, which removes 60.7% of candidate QA pairs and retains 17,900 temporally dependent items across 10 diagnostic dimensions. From this pool, we construct a human-verified benchmark with 2,323 high-quality QA pairs over 1,951 videos. Experiments on representative Video-LLMs show that temporal object consistency remains a major unsolved challenge, with notable weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, even when models perform well on general video understanding benchmarks. These results suggest that object-centric temporal coherence is a key bottleneck for current Video-LLMs, and that TOC-Bench provides a focused platform for diagnosing and improving object-aware temporal reasoning. The resource is available at https://github.com/cjzcjz666/toc_bench.git.

2605.09598 2026-05-13 cs.CV 版本更新

SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy

Ismael Elsharkawi, Ahmed Sait, Silvio Giancola, Bernard Ghanem, Hossam Sharara, Abdelrahman Eldesokey

发表机构 * Department of Computer Science and Engineering, The American University in Cairo(美国亚历山大大学计算机科学与工程系) Image And Visual Understanding Lab (IVUL), KAUST(卡塔尔大学图像与视觉理解实验室)

AI总结 本文提出 SoccerLens,一个用于评估足球视频理解中视觉 grounding 能力的新基准,旨在解决现有模型可能依赖虚假关联而非真实视觉证据的问题。该基准包含标注的13类常见足球事件视频片段,并通过三级语义相关性结构组织视觉线索。研究进一步扩展了注意力归因方法,引入了衡量模型注意力是否与标注线索对齐的评估指标,结果表明当前最先进的足球视觉语言模型在 grounding 性能上表现有限,揭示了预测准确率与真实视觉理解之间的显著差距。

Comments Preprint

详情
英文摘要

Vision-language models (VLMs) have recently shown strong potential in soccer video understanding. However, given the high complexity of soccer videos due to large viewpoint variations, rapid shot transitions, and cluttered scenes, it remains unclear on whether VLMs rely on meaningful visual evidence or exploit spurious correlations and shortcut learning. Existing evaluation protocols focus primarily on classification accuracy and do not assess visual grounding. To address this limitation, we introduce SoccerLens, a benchmark for grounded soccer video understanding. The benchmark contains annotated video segments spanning $13$ common soccer events, with structured visual cues organized into three levels of semantic relevance. We further extend the attribution method of Chefer [arXiv:2103.15679] to jointly model spatial and temporal attention, and introduce evaluation metrics that measure whether model attention aligns with annotated cues or drifts toward spurious regions. Our evaluation of state-of-the-art soccer VLMs shows that, despite strong classification accuracy, current models fail to exceed $50\%$ grounding performance even under the loosest cue definitions and consistently underutilize temporal information. These results reveal a substantial gap between predictive performance and true visual grounding, highlighting the need for grounded evaluation in complex spatio-temporal domains such as soccer.

2605.09430 2026-05-13 cs.CV 版本更新

FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

Junkang Zhou, Yefei He, Feng Chen, Weijie Wang, Bohan Zhuang

发表机构 * Zhejiang University(浙江大学) University of Adelaide(阿德莱德大学)

AI总结 本文提出了一种名为FlashAR的轻量级后训练加速框架,旨在高效提升自回归图像生成模型的推理速度。该方法通过引入一个垂直预测头与原有的水平预测头协同工作,基于双向下一个token预测实现高度并行的生成过程,同时尽量保持原模型的训练目标不变。实验表明,FlashAR仅需少量训练数据即可实现对预训练模型的高效适配,在512x512图像生成任务中达到最高22.9倍的加速效果。

Comments Post-training acceleration for autoregressive image generation, code is available at https://lxazjk.github.io/FlashAR/

详情
英文摘要

Large-scale autoregressive models have demonstrated remarkable capabilities in image generation. However, their sequential raster-scan decoding relies on strictly next-token prediction, making inference prohibitively expensive. Existing acceleration methods typically either introduce entirely new generation paradigms that necessitate costly pre-training from scratch, or enable parallel generation at the expense of a training-inference gap or altered prediction objectives. In this paper, we introduce FlashAR, a lightweight post-training adaptation framework that efficiently adapts a pre-trained raster-scan autoregressive model into a highly parallel generator based on two-way next-token prediction. Our key insight is that effective adaptation should minimize modifications to the pre-trained model's original training objective to preserve its learned prior. Accordingly, we retain the original AR head as a horizontal head for row-wise prediction and introduce a complementary, lightweight vertical head for column-wise prediction. To facilitate efficient adaptation, we branch the vertical head from an intermediate layer rather than the final layer, bypassing the inherent horizontal head bias. Moreover, since horizontal and vertical predictions capture complementary dependencies whose relative importance varies across target positions, we employ a learnable fusion gate to dynamically combine the two predictions at each position. To further reduce adaptation cost, we propose a two-stage adaptation pipeline: the vertical head is first initialized through adaptation from the pre-trained autoregressive model before jointly fine-tuned with backbone to adapt to the new decoding paradigm. Extensive experiments on LlamaGen and Emu3.5 show that FlashAR achieves up to a 22.9x speedup for 512x512 image generation through a lightweight post-training with merely 0.05% of the original training data.

2605.09003 2026-05-13 cs.CV 版本更新

FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching

Yixin Tang, Jiawei Guo, Junxian Li, Zhiteng Li, Jixin Zhao, Bingya Zhang, Chenbo Wang, Yulun Zhang, Shangchen Zhou

发表机构 * Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Honor Device Co., Ltd(荣耀设备有限公司)

AI总结 本文提出了一种名为FlashClear的高效图像内容移除方法,旨在解决基于扩散模型的对象移除方法在计算效率上的不足。该方法通过引入区域感知的对抗蒸馏(RAD)和前景优先的非对称注意力与缓存(FPAC)策略,实现了仅需少数步骤即可完成高质量内容移除的模型,显著提升了推理速度。实验表明,FlashClear在保持视觉质量的同时,相比现有方法在速度上分别提升了8.26倍和122倍。

Comments Code: https://github.com/GuoCalix/FlashClear

详情
英文摘要

Recently, diffusion-based object removal models have achieved impressive results in eliminating objects and their associated visual effects. However, they indiscriminately denoise all tokens across all timesteps, ignoring that removal usually involves small foreground regions. This strategy introduces substantial computational overhead and prolonged inference times. To overcome this computational burden, we propose a latent discriminator to implement Region-aware Adversarial Distillation (RAD), yielding a highly efficient few-step model named FlashClear. Furthermore, tailored to few-step diffusion models, we propose FPAC (Foreground-Prioritized Asymmetric Attention and Caching), a training-free acceleration strategy. Extensive experiments demonstrate that our framework provides massive acceleration while maintaining or exceeding the performance of our base model, ObjectClear. Notably, on the OBER benchmark, our FlashClear achieves up to 8.26$\times$ and 122$\times$ speedup over ObjectClear and OmniPaint, respectively, while maintaining high visual quality and fidelity.

2605.08806 2026-05-13 cs.CV 版本更新

L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation

Zehua Wang, Changwang Mei, Huaijiang Sun, Pengqi Hu, Zhaoyang Yin

发表机构 * Nanjing University of Science and Technology(南京理工大学) Lenovo(联想)

AI总结 该论文提出了一种名为L2A的框架,旨在通过有效利用历史姿态信息来提升三维人体姿态估计的准确性。研究发现,现有方法在跨层特征复用方面存在不足,为此,作者设计了空间-时间并行的Transformer骨干网络以保持一致的表示空间,并引入了历史姿态积累(HPA)机制和层姿态历史聚合(LPA)模块,以自适应地整合多层特征,减少冗余并提升稳定性。实验表明,该方法在多个基准数据集上取得了最先进的性能。

Comments 15page

详情
英文摘要

Existing 2D-3D lifting human pose estimation methods have achieved strong performance. But the utilization of historical pose representations across network depth was overlooked. In current pipelines, information is propagated through fixed residual connections, which restricts effective reuse of early-layer features such as fine-grained spatial structures and short-term motion cues. However, naively incorporating historical features across layers is non-trivial. We further identify that maintaining a consistent representation space across layers is a prerequisite for effective cross-layer feature aggregation. To address this issue, we propose a history-aware framework that enables effective network cross-layer history feature utilization. Specifically, we adopt a spatial-temporal parallel Transformer backbone to prevent alternating spatial-temporal transformations during sequential processing, thereby maintaining a consistent representation space. Building upon this, we introduce a History Pose Accumulation (HPA) mechanism that adaptively aggregates features from all preceding layers to enhance current representations. Furthermore, we propose a Layer Pose History Aggregation (LPA) module that transforms layer pose features into a compact and structured form, reducing redundancy and enabling more stable aggregation. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on benchmarks.

2605.08802 2026-05-13 cs.CV 版本更新

CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

Ziyang Ding, Linjian Meng, Yiming Wu, Yuhan Li, Yuhao Liu, Zhen Zhao

发表机构 * Shandong University(山东大学) Shanghai AI Laboratory(上海人工智能实验室) Nanjing University(南京大学) The University of Hong Kong(香港大学)

AI总结 CoLVR 是一种通过对比优化增强潜空间视觉推理探索能力的方法,旨在解决现有模型因依赖硬对齐目标而限制潜空间推理灵活性的问题。该方法引入了基于角度扰动的潜空间对比训练框架,以学习更加多样化和探索性强的表示,并结合强化学习的潜轨迹对比奖励进行后训练,进一步优化潜空间推理过程。实验表明,CoLVR 在多个基准测试中显著提升了潜空间表示的探索能力,并在跨域任务中表现出色。

详情
英文摘要

Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at https://github.com/Oscar-dzy/CoLVR.

2605.08328 2026-05-13 cs.LG cs.CV 版本更新

P-Flow: Proxy-gradient Flows for Linear Inverse Problems

Zehua Jiang, Fenghao Zhu, Xinquan Wang, Chongwen Huang, Zhaoyang Zhang

发表机构 * Zhejiang University(浙江大学) University of Notre Dame(诺丁汉大学)

AI总结 本文提出了一种名为 P-Flow 的新框架,用于解决线性逆问题,通过引入代理梯度来更新源点,有效避免了传统方法中因长链求导导致的数值不稳定和计算开销。该方法结合高维空间中的测度集中现象,采用高斯球面投影以确保先验分布的一致性,并基于贝叶斯理论和 Lipschitz 连续性进行了理论分析。实验表明,P-Flow 在多种图像修复任务中表现优异,尤其在极端退化条件下具有明显优势。

详情
英文摘要

Generative models based on flow matching have emerged as a powerful paradigm for inverse problems, offering straighter trajectories and faster sampling compared to diffusion models. However, existing approaches often necessitate differentiating through unrolled paths, leading to numerical instability and prohibitive computational overhead. To address this, we propose P-Flow, a framework that stabilizes the reconstruction process by leveraging a proxy gradient to update the source point. This approach effectively circumvents the numerical instability and memory overhead of long-chain differentiation. To ensure consistency with the prior distribution, we employ a Gaussian spherical projection motivated by the concentration of measure phenomenon in high-dimensional spaces. We further provide a theoretical analysis for P-Flow based on Bayesian theory and Lipschitz continuity. Experiments across diverse restoration tasks demonstrate that P-Flow delivers competitive performance, especially under extreme degradations such as severely ill-posed conditions and high measurement noise.

2605.07552 2026-05-13 cs.CV 版本更新

VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network

Zepeng Yang, Junxuan Bai, Hao Li, Ju Dai, Junjun Pan, Yongfeng Yin, Bin Li

发表机构 * Beihang University(北航) Peng Cheng Laboratory(鹏城实验室) Capital University of Physical Education and Sports(首都体育学院) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)

AI总结 该论文提出了一种名为VIMCAN的混合架构,用于视觉-惯性融合的三维人体姿态估计。该方法结合了Mamba的高效序列建模能力和Cross-Attention的空间感知能力,有效解决了传统Transformer在处理长序列时计算复杂度高、难以实时处理的问题。实验表明,VIMCAN在多个数据集上取得了优于现有方法的精度,并能在普通消费级硬件上实现每秒60帧以上的实时推理。

Comments Accepted in CVPR 2026

详情
英文摘要

The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual-inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba's dynamic parameterization for temporal modeling and Attention for spatial dependency extraction, VIMCAN achieves superior accuracy, with mean per-joint position errors (MPJPE) of 17.2 mm on TotalCapture and 45.3 mm on 3DPW. VIMCAN outperforms prior Transformer-based and other SOTA approaches while supporting real-time inference at over 60 frames per second on consumer-grade hardware. The source code is available at \href{https://github.com/Eddieyzp/VIMCAN}{this GitHub repository}.

2605.06440 2026-05-13 cs.LG cs.CV 版本更新

Hyperbolic Concept Bottleneck Models

Daniel Uyterlinde, Swasti Shreya Mishra, Pascal Mettes

发表机构 * Informatics Institute, University of Amsterdam(阿姆斯特丹大学信息学院)

AI总结 该论文提出了一种名为Hyperbolic Concept Bottleneck Models(HypCBM)的新型可解释神经网络框架,用于提升模型的可解释性。与传统将概念嵌入欧几里得空间的方法不同,HypCBM将概念组织在语义层次结构中,并利用双曲空间的几何特性,通过不对称的几何包含关系来表示概念激活,从而更自然地捕捉概念间的层次关系。该方法无需额外监督或学习模块即可实现稀疏且层次感知的激活,并在保持人类可解释性的同时,展现出更强的层次一致性和对输入噪声的鲁棒性。

Comments 24 pages, 14 figures

详情
英文摘要

Concept Bottleneck Models (CBMs) have become a popular approach to enable interpretability in neural networks by constraining classifier inputs to a set of human-understandable concepts. While effective, current models embed concepts in flat Euclidean space, treating them as independent, orthogonal dimensions. Concepts, however, are highly structured and organized in semantic hierarchies. To resolve this mismatch, we propose Hyperbolic Concept Bottleneck Models (HypCBM), a post-hoc framework that grounds the bottleneck in this structure by reformulating concept activation as asymmetric geometric containment in hyperbolic space. Rather than treating entailment cones as a pre-training penalty, we show they encode a natural test-time activation signal: the margin of inclusion within a concept's entailment cone yields sparse, hierarchy-aware activations without any additional supervision or learned modules. We further introduce an adaptive scaling law for hierarchically faithful interventions, propagating user corrections coherently through the concept tree. Empirically, HypCBM rivals post-hoc Euclidean models trained on 20$\times$ more data in sparse regimes required for human interpretability, with stronger hierarchical consistency and improved robustness to input corruptions.

2605.05922 2026-05-13 cs.CV 版本更新

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

Yuan Wang, Ouxiang Li, Yulong Xu, Borui Liao, Jiajun Liang, Jinghan Li, Meng Wang, Xintao Wang, Pengfei Wan, Kuien Liu, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Kling Team, Kuaishou Technology(快手科技 Kling 团队) Institute of Software Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 该论文提出了一种名为DeScore的视频奖励模型,旨在解决现有模型在推理与评分耦合时存在的优化瓶颈问题。其核心方法是将推理与评分过程解耦,先由多模态大语言模型生成详细的推理过程,再通过独立的评分模块预测最终奖励。该方法在保证模型可解释性和泛化能力的同时,提升了训练稳定性与效率。

详情
英文摘要

Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textit{Generative RMs} with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.

2604.21052 2026-05-13 cs.CV cs.AI 版本更新

StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu, Yang Xu, Hanyu Xing

发表机构 * Duke University(杜克大学) University of Southern California(南加州大学) Xidian University(西安电子科技大学)

AI总结 StyleVAR 是一种基于视觉自回归建模(VAR)框架的可控图像风格迁移方法,通过将图像分解为多尺度表示并编码为离散码,利用变压器模型在条件离散序列建模中实现风格与内容的可控融合。该方法引入了混合交叉注意力机制和尺度相关的融合系数,以在保持自回归连续性的同时,有效结合风格与内容信息。实验表明,StyleVAR 在多个基准测试中优于传统 AdaIN 方法,在感知相似度和结构保持方面表现突出,尤其在风景和建筑场景中效果显著。

详情
英文摘要

We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive continuity of VAR. We train StyleVAR in two stages from a pretrained VAR checkpoint: supervised fine-tuning on a large triplet dataset of content--style--target images, followed by reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, with per-action normalization weighting to rebalance credit across VAR's multi-scale hierarchy. Across three benchmarks spanning in-, near-, and out-of-distribution regimes, StyleVAR consistently outperforms an AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, and the GRPO stage yields further gains over the SFT checkpoint, most notably on the reward-aligned perceptual metrics. Qualitatively, the method transfers texture while maintaining semantic structure, especially for landscapes and architectural scenes, while a generalization gap on internet images and difficulty with human faces highlight the need for better content diversity and stronger structural priors.

2604.12923 2026-05-13 cs.CV 版本更新

Pi-HOC: Pairwise 3D Human-Object Contact Estimation

Sravan Chittupalli, Ayush Jain, Dong Huang

发表机构 * Carnegie Mellon University, Robotics Institute(卡内基梅隆大学机器人研究所) National Robotics Engineering Center(国家机器人工程中心)

AI总结 本文提出了一种名为Pi-HOC的单次推理、实例感知的框架,用于预测图像中所有人类-物体对的密集3D语义接触。该方法通过检测实例并为每对人-物生成专用的标记,结合InteractionFormer进行优化,再利用基于SAM的解码器在SMPL人体网格上预测密集接触点。实验表明,Pi-HOC在多个数据集上显著提升了接触估计的准确性和定位能力,并且推理效率提高了20倍,同时还能通过测试时优化算法提升3D图像到网格的重建效果,并支持基于语言查询的参考接触预测。

详情
英文摘要

Resolving real-world human-object interactions in images is a many-to-many challenge, in which disentangling fine-grained concurrent physical contact is particularly difficult. Existing semantic contact estimation methods are either limited to single-human settings or require object geometries (e.g., meshes) in addition to the input image. Current state-of-the-art leverages powerful VLM for category-level semantics but struggles with multi-human scenarios and scales poorly in inference. We introduce Pi-HOC, a single-pass, instance-aware framework for dense 3D semantic contact prediction of all human-object pairs. Pi-HOC detects instances, creates dedicated human-object (HO) tokens for each pair, and refines them using an InteractionFormer. A SAM-based decoder then predicts dense contact on SMPL human meshes for each human-object pair. On the MMHOI and DAMON datasets, Pi-HOC significantly improves accuracy and localization over state-of-the-art methods while achieving 20x higher throughput. We further demonstrate that predicted contacts improve SAM-3D image-to-mesh reconstruction via a test-time optimization algorithm and enable referential contact prediction from language queries without additional training.

2604.03701 2026-05-13 cs.CV 版本更新

VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning

Shaoyang Cui, Lingbei Meng

发表机构 * Department of Psychological and Cognitive Sciences, Tsinghua University(清华大学心理与认知科学系) Shenzhen Loop Area Institute(深圳环园研究院)

AI总结 VidNum-1.4K 是一个用于评估视频中数值推理能力的综合性基准数据集,包含1,379个人工标注的视频问答对,覆盖多种复杂场景,旨在测试视觉语言模型对时间事件、物体持续性和组合逻辑的理解。该基准采用三级结构,从直接视觉感知逐步过渡到多步骤数值推理,要求模型进行算术运算、比较和逻辑推断。实验表明,当前最先进的模型在该任务上仍存在较大性能差距,凸显出视频数值推理任务的挑战性与现有模型的不足。

Comments 7 pages, 5 figures, under review at ACMMM 2026 Dataset Track

详情
英文摘要

Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly "understand" real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi-step numerical logic within the inherent complexity of real-world multimedia content. We introduce VidNum-1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human-annotated video-question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum-1.4K is uniquely structured into a three-level hierarchy that evolves from direct visual perception to video-based compositional numerical reasoning, requiring models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence. Our evaluations across a diverse suite of state-of-the-art VLMs reveal a striking reasoning gap: while the Gemini-3.1-pro barely reaches a 60% accuracy threshold, representative open-source families struggle heavily in the 25%--45% range. These findings demonstrate that current VLMs still lack a stable "internal world model", positioning VidNum-1.4K as a demanding diagnostic testbed for the next generation of numerical video intelligence.

2603.24577 2026-05-13 cs.CV cs.AI 版本更新

EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

Falong Fan, Yi Xie, Arnis Lektauers, Bo Liu, Jerzy Rozenblit

AI总结 本文提出了一种名为EndoVGGT的框架,用于提升手术场景中可变形软组织的三维重建精度。该方法引入了一个基于图注意力的变形感知模块(DeGAT),通过动态构建特征空间语义图来捕捉组织区域间的长程关联,从而在遮挡情况下更有效地传播结构信息,提高重建的鲁棒性和一致性。实验表明,EndoVGGT在SCARED数据集上显著提升了重建质量,并在未见数据集上表现出良好的泛化能力。

Comments We withdraw this submission due to significant errors in the presentation and logical structure of the paper. We found that the current version does not accurately convey the research findings and requires a major overhaul of the manuscript's methodology description and results analysis

详情
英文摘要

Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.

2603.10281 2026-05-13 cs.LG cs.AI cs.CV 版本更新

Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework

Rajesh Shrestha, Xiao Fu

发表机构 * School of EECS(电子工程与科学学院)

AI总结 本文研究了如何将基于分数的去噪器有效集成到ADMM优化算法中,以解决逆问题。针对训练数据流形与ADMM迭代几何不匹配以及收敛性缺乏保证的两个核心挑战,提出了一种新的ADMM-PnP框架,引入包含自动校正、方向校正和分数去噪三阶段的AC-DC去噪器。理论分析表明该框架在适当参数下具有弱非扩张性,保证了固定点球收敛,并在更宽松条件下支持自适应步长的收敛性。实验表明该方法在多种逆问题中优于现有基线。

详情
英文摘要

While score-based generative models have emerged as powerful priors for solving inverse problems, directly integrating them into optimization algorithms such as ADMM remains nontrivial. Two central challenges arise: i) the mismatch between the noisy data manifolds used to train the score functions and the geometry of ADMM iterates, especially due to the influence of dual variables, and ii) the lack of convergence understanding when ADMM is equipped with score-based denoisers. To address the manifold mismatch issue, we propose ADMM plug-and-play (ADMM-PnP) with the AC-DC denoiser, a new framework that embeds a three-stage denoiser into ADMM: (1) auto-correction (AC) via additive Gaussian noise, (2) directional correction (DC) using conditional Langevin dynamics, and (3) score-based denoising. In terms of convergence, we establish two results: first, under proper denoiser parameters, each ADMM iteration is a weakly nonexpansive operator, ensuring high-probability fixed-point $\textit{ball convergence}$ using a constant step size; second, under more relaxed conditions, the AC-DC denoiser is a bounded denoiser, which leads to convergence under an adaptive step size schedule. Experiments on a range of inverse problems demonstrate that our method consistently improves solution quality over a variety of baselines.

2602.22507 2026-05-13 cs.LG cs.CV 版本更新

Space Syntax-guided Post-training for Residential Floor Plan Generation

Zhuoyang Jiang, Dongqing Zhang

发表机构 * College of Architecture and Urban Planning, Tongji University(同济大学建筑与城市规划学院) Information Hub, The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)信息中心)

AI总结 本文研究了住宅平面图生成中空间配置逻辑的优化问题,提出了一种基于空间句法的后训练框架SSPT,通过引入空间句法集成预言机(SSIO)对生成的平面图进行配置质量评估,并将其作为反馈信号指导模型优化。该方法包括两种策略:基于迭代训练的SSPT-Iter和基于强化学习的SSPT-PPO,并构建了新的评估基准SSPT-Bench。实验表明,该方法有效提升了生成平面图的公共空间主导性和功能层级一致性,尤其SSPT-PPO在提升效果和效率方面表现更优。

详情
英文摘要

Residential floor plan generation requires not only geometric fidelity but also spatial configurational logic: shared living spaces should be integrative, while private spaces should remain segregated. Existing generators increasingly use room-relation graphs as input-side conditions, but generated layouts are rarely evaluated on the output side for configurational quality, and such evaluation is rarely fed back into model optimization. We propose Space Syntax-guided Post-training (SSPT), a framework that turns space-syntax integration from a post-hoc analysis tool into a computable feedback signal for already-trained floor plan generators. SSPT introduces the Space Syntax Integration Oracle (SSIO), which converts generated layouts into rectangle-space graphs and measures public-space dominance and functional hierarchy. SSIO is first applied to real residential data to establish empirical configurational references, then connected to two SSPT strategies: SSPT-Iter, a basic generate-filter-retrain route, and SSPT-PPO, the first RL-based post-training route for floor plan generation. We also introduce SSPT-Bench, a new evaluation system for measuring the output-side spatial configurational quality of post-trained generators under an out-of-distribution setting. Experiments show that both strategies improve public-space dominance and functional-hierarchy alignment over the unpost-trained baseline. SSPT-PPO achieves stronger gains, lower variance, and higher efficiency than iterative retraining. These results show that output-side configurational evaluation can serve as actionable post-training feedback, offering a practical path for injecting architectural theory into existing floor plan generation backbones.

2602.14199 2026-05-13 eess.IV cs.CV eess.SP 版本更新

Learnable Multi-level Discrete Wavelet Transforms for 3D Gaussian Splatting Frequency Modulation

Hung Nguyen, An Le, Truong Nguyen

发表机构 * GitHub

AI总结 3D高斯泼溅(3DGS)是一种用于新视角合成的有力方法,但在训练过程中,高斯基元数量往往会显著增加,导致内存和存储成本上升。本文提出了一种基于多级离散小波变换(DWT)的频率调制框架,通过递归分解低频子带,构建更深层次的课程学习策略,逐步降低高斯数量,并且仅需单个缩放参数即可实现频率调制,避免了传统方法中复杂的滤波器学习。实验表明,该方法在标准数据集上有效减少了高斯数量,同时保持了高质量的渲染效果。

Comments Accepted to EUSIPCO 2026

详情
英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful approach for novel view synthesis. However, the number of Gaussian primitives often grows substantially during training as finer scene details are reconstructed, leading to increased memory and storage costs. Recent coarse-to-fine strategies regulate Gaussian growth by modulating the frequency content of the ground-truth images. In particular, AutoOpti3DGS employs the learnable Discrete Wavelet Transform (DWT) to enable data-adaptive frequency modulation. Nevertheless, its modulation depth is limited by the 1-level DWT, and jointly optimizing wavelet regularization with 3D reconstruction introduces gradient competition that promotes excessive Gaussian densification. In this paper, we propose a multi-level DWT-based frequency modulation framework for 3DGS. By recursively decomposing the low-frequency subband, we construct a deeper curriculum that provides progressively coarser supervision during early training, consistently reducing Gaussian counts. Furthermore, we show that the modulation can be performed using only a single scaling parameter, rather than learning the full 2-tap high-pass filter. Experimental results on standard benchmarks demonstrate that our method further reduces Gaussian counts while maintaining competitive rendering quality.

2602.13267 2026-05-13 cs.CV cs.RO eess.IV 版本更新

SOAR: Regression-based LiDAR Relocalization for UAVs

Hengyu Mu, Jianshi Wu, Yuxin Guo, XianLian Lin, Qingyong Hu, Sheng Ao, Chenglu Wen, Cheng Wang

发表机构 * Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University(厦门大学智慧城市感知与计算重点实验室) Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(中国教育部多媒体可信感知与高效计算重点实验室) Department of Computer Science at the University of Oxford(牛津大学计算机科学系)

AI总结 本文提出SOAR,一种基于回归的无人机激光雷达重定位框架,旨在解决在无GNSS环境下无人机高精度定位的问题。为应对无人机场景中姿态变化大、飞行路径不规则等挑战,SOAR引入了局部保持的滑动窗口注意力模块和局部不变的位置编码,以增强对视角变化的鲁棒性,并设计了坐标无关的特征初始化模块以减少对全局变换的敏感性。此外,作者构建了一个包含4个场景和13条不规则路径的大规模无人机激光雷达定位数据集,显著提升了无人机重定位研究的现实基准。实验表明,SOAR在定位成功率和误差指标上均达到先进水平。

Comments 24 pages, 14 figures

详情
英文摘要

Regression-based LiDAR relocalization has recently emerged as a promising solution for high-precision positioning in GNSS-denied environments. However, these methods are primarily tailored to autonomous driving, exhibiting significantly degraded accuracy in unmanned aerial vehicle (UAV) scenarios due to arbitrary pose variations and irregular flight paths. In this paper, we propose SOAR, a regression-based LiDAR relocalization framework for UAVs. Specifically, we introduce a locality-preserving sliding window attention module with locally invariant positional encoding to capture discriminative geometric structures robust to viewpoint changes. A coordinate-independent feature initialization module is further designed to eliminate sensitivity to global transformations. Furthermore, most existing UAV datasets are limited to evaluate LiDAR relocalization in real-world, due to the lack of synchronized LiDAR scans, accurate 6-DoF poses, or multiple traversals. Thus, we construct a large-scale UAV LiDAR localization dataset with 4 scenes and 13 irregular paths exhibiting rotation and altitude variations, providing a more realistic benchmark for UAVs. Extensive experiments demonstrate that our method achieves state-of-the-art performance, improving the localization success rate by 40% and reducing mean error over 10m on UAVLoc. Our code and dataset will be released soon.

2602.07668 2026-05-13 cs.CV cs.AI cs.LG cs.RO 版本更新

Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making

Ross Greer, Laura Fleig, Maitrayee Keskar, Erika Maquiling, Giovanni Tapia Lopez, Angel Martinez-Sanchez, Parthib Roy, Jake Rattigan, Mira Sur, Alejandra Vidrio, Thomas Marcotte, Mohan Trivedi

发表机构 * Machine Intelligence, Interaction, and Imagination (Mi3) Laboratory(机器智能、交互与想象实验室) Laboratory for Intelligent and Safe Automobiles (LISA)(智能与安全汽车实验室) Johns Hopkins University(约翰霍普金斯大学) Center for Medicinal Cannabis Research (CMCR)(医药大麻研究中心)

AI总结 该研究提出了一种融合视觉与音频信息的多模态框架L-LIO,用于提升智能车辆中的驾驶员状态评估与环境理解能力。通过引入音频信号,增强对驾驶员、乘客及车外人员状态的感知,从而在安全气囊部署、自动驾驶接管时间预测等场景中提供更全面的信息支持。实验表明,音频在复杂或语境丰富的场景中能提供关键的安全相关信息,为智能车辆决策系统提供了新的干预路径。

详情
英文摘要

The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., "turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.

2602.02408 2026-05-13 cs.CV cs.AI 版本更新

ReasonEdit: Editing Vision-Language Models using Human Reasoning

Jiaxing Qiu, Kaihua Hou, Roxana Daneshjou, Ahmed Alaa, Thomas Hartvigsen

发表机构 * University of Virginia(弗吉尼亚大学) University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 ReasonEdit 是一种用于编辑视觉-语言模型(VLM)的新方法,旨在在不干扰模型其他功能的前提下修正其错误,特别针对需要人类与模型进行推理的视觉问答任务。该方法引入了用户在编辑过程中提供推理解释的机制,并通过一种基于网络科学的多模态嵌入技术,在推理时检索相关事实,从而提升编辑效果。实验表明,ReasonEdit 在多个数据集上取得了当前最优的编辑性能,验证了引入人类推理对模型编辑泛化能力的显著提升。

详情
英文摘要

Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically require humans and models to reason about images. We therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.

2602.01418 2026-05-13 cs.CV cs.LG 版本更新

Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General

Christoffer Koo Øhrstrøm, Rafael I. Cabral Muchacho, Yifei Dong, Filippos Moumtzidellis, Ronja Güldenring, Florian T. Pokorny, Lazaros Nalpantidis

发表机构 * Technical University of Denmark(丹麦技术大学) KTH Royal Institute of Technology(皇家理工学院)

AI总结 本文提出了一种基于抛物线的位置编码方法PaPE,专门用于视觉模态中的注意力架构。该方法从视觉特性的角度出发,结合平移不变性、旋转不变性、距离衰减、方向性和上下文感知等原则进行设计,能够更准确地编码图像、视频、点云等视觉数据中位置信息。实验表明,PaPE在ImageNet-1K等数据集上具有出色的外推能力,并在多个不同模态的数据集上展现出广泛适用性和优越性能。

详情
英文摘要

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. Extrapolation experiments on ImageNet-1K show how PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5\% over the next-best encoding. Generality experiments on 8 datasets across 4 modalities show that PaPE is a general vision position encoding, as PaPE matches the best baseline on 5 datasets and exceeds all on 2 datasets. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

2601.22301 2026-05-13 cs.CV 版本更新

Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes

Gonzalo Gomez-Nogales, Yicong Hong, Chongjian Ge, Peiye Zhuang, Marc Comino-Trinidad, Dan Casas, Yi Zhou

发表机构 * Universidad Rey Juan Carlos Móstoles, Spain(西班牙雷昂卡洛斯·莫斯特oles大学) Adobe Research San Jose, USA(美国Adobe研究圣地亚哥实验室) Roblox San Mateo, USA(美国Roblox圣马特奥实验室)

AI总结 传统渲染流程依赖复杂的模型、精确的材质和光照以及大量的计算资源来生成逼真的图像,但在处理包含大量动态人物的场景时仍面临可扩展性和真实感的挑战。本文提出C2R(Coarse-to-Real)生成渲染框架,通过粗略的3D模拟生成具有真实风格的都市人群视频,结合粗略3D渲染对场景布局、相机运动和人物轨迹进行显式控制,并利用学习到的神经渲染器根据文本提示生成逼真的外观、光照和细粒度动态。该方法采用两阶段的合成-真实领域对齐策略,先从大规模真实视频中学习生成先验,再利用少量配对的合成数据引入可控性,实现了从粗略到精细的控制,适用于多种CG和游戏输入,并能从最小的3D输入生成时间一致、可控且逼真的城市场景视频。

Comments Project website at https://gonzalognogales.github.io/coarse2real/

详情
英文摘要

Traditional rendering pipelines rely on complex assets, accurate materials and lighting, and substantial computational resources to produce realistic imagery, yet they still face challenges in scalability and realism for populated dynamic scenes. We present C2R (Coarse-to-Real), a generative rendering framework that synthesizes real-style urban crowd videos from coarse 3D simulations. Our approach uses coarse 3D renderings to explicitly control scene layout, camera motion, and human trajectories, while a learned neural renderer generates realistic appearance, lighting, and fine-scale dynamics guided by text prompts. To overcome the lack of paired training data between coarse simulations and real videos, we adopt a two-stage synthetic-real domain-hedging strategy that first learns a strong generative prior from large-scale real footage, and then introduces controllability by using a small amount of paired synthetic coarse-to-fine data to anchor shared implicit spatio-temporal features across domains. The resulting system supports coarse-to-fine control, generalizes across diverse CG and game inputs, and produces temporally consistent, controllable, and realistic urban scene videos from minimal 3D input. We will release the model and project webpage at https://gonzalognogales.github.io/coarse2real/.

2512.12165 2026-05-13 cs.CV 版本更新

Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video

Daniel Adebi, Sagnik Majumder, Kristen Grauman

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文研究了如何利用被动场景声音和野外视频进行音频-视觉相机位姿估计,解决视觉退化条件下相机运动估计的难题。作者提出了一种简单有效的音频-视觉框架,将到达方向(DOA)谱和双耳嵌入特征融合到先进的视觉位姿估计模型中,显著提升了位姿估计的准确性和鲁棒性。该方法在两个大规模数据集上的实验表明,相比纯视觉方法具有明显优势,尤其在视觉信息受损时表现突出,为现实场景中的相机位姿估计提供了新的音频辅助思路。

详情
英文摘要

Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide cues complementary to vision for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-of-arrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.

2512.11883 2026-05-13 cs.CY cs.AI cs.CV 版本更新

Position: Universal Aesthetic Alignment Narrows Artistic Expression

Wenqi Marshall Guo, Qingyun Qian, Khalad Hasan, Shan Du

发表机构 * Department of CMPS, University of British Columbia, Kelowna, Canada(计算机科学与工程系,不列颠哥伦比亚大学,加拿大克洛维纳)

AI总结 本文探讨了图像生成模型过度对齐普遍审美标准所带来的问题,指出这种对齐方式可能违背用户在艺术创作或批评性目的中对“反审美”输出的需求。研究通过构建宽谱审美数据集并评估先进生成与奖励模型,发现当前审美对齐模型倾向于生成传统意义上的“美丽”图像,难以遵循用户对低质量或负面图像的指令,且奖励模型即使在用户明确要求下,仍会对反审美图像进行惩罚。研究确认了这一系统性偏差,并提供了相关代码、微调模型和数据集供进一步研究。

详情
英文摘要

Over-aligning image generation models to a generalized aesthetic preference conflicts with user intent, particularly when "anti-aesthetic" outputs are requested for artistic or critical purposes. This adherence prioritizes developer-centered values, compromising user autonomy and aesthetic pluralism. We test this bias by constructing a wide-spectrum aesthetics dataset and evaluating state-of-the-art generation and reward models. This position paper finds that aesthetic-aligned generation models frequently default to conventionally beautiful outputs, failing to respect instructions for low-quality or negative imagery. Crucially, reward models penalize anti-aesthetic images even when they perfectly match the explicit user prompt. We confirm this systemic bias through image-to-image editing and evaluation against real abstract artworks. Our code, fine-tuned models, and datasets are available on our meta-expression intentionally anti-aesthetics webpage: https://weathon.github.io/icml2026_position/.

2512.11321 2026-05-13 cs.CV 版本更新

KeyframeFace: Language-Driven Facial Animation via Semantic Keyframes

Jingchao Wu, Zejian Kang, Haibo Liu, Yuanchen Fei, Xiangru Huang

发表机构 * Westlake University(西湖大学) Nanjing University(南京大学) Zhejiang University(浙江大学) Hunan University(湖南大学)

AI总结 本文提出了一种名为 KeyframeFace 的语言驱动面部动画生成方法,通过语义关键帧实现对人脸表情的精确控制。与现有方法直接从文本生成连续帧不同,该方法借鉴动画制作中的关键帧理念,在可解释的 ARKit 控制空间中使用语义关键帧表示动画,并利用大语言模型生成与文本描述和情绪线索对齐的关键帧。实验表明,该方法在表情保真度和语义一致性方面优于传统方法,同时提供了更清晰的语义控制结构。

详情
英文摘要

Facial animation is a core component for creating digital characters in Computer Graphics (CG) industry. A typical production workflow relies on sparse, semantically meaningful keyframes to precisely control facial expressions. Enabling such animation directly from natural-language descriptions could significantly improve content creation efficiency and accessibility. However, most existing methods adopt a text-to-continuous-frames paradigm, directly regressing dense facial motion trajectories from language. This formulation entangles high-level semantic intent with low-level motion, lacks explicit semantic control structure, and limits precise editing and interpretability. Inspired by the keyframe paradigm in animation production, we propose KeyframeFace, a framework for semantic facial animation from language via interpretable keyframes. Instead of predicting dense motion trajectories, our method represents animation as a sequence of semantically meaningful keyframes in an interpretable ARKit-based facial control space. A language-driven model leverages large language model (LLM) priors to generate keyframes that align with contextual text descriptions and emotion cues. To support this formulation, we construct a multimodal dataset comprising 2,100 expression scripts paired with monocular videos, per-frame ARKit coefficients, and manually annotated semantic keyframes. Experiments show that incorporating semantic keyframe supervision and language priors significantly improves expression fidelity and semantic alignment compared to methods that do not use facial action semantics.

2512.05683 2026-05-13 cs.CV physics.optics 版本更新

Physics-Informed Graph Neural Networks for Frequency-Aware Optical Aberration Correction

Yong En Kok, Bowen Deng, Alexander Bentley, Andrew J. Parkes, Michael G. Somekh, Amanda J. Wright, Michael P. Pound

发表机构 * School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院) Photonics Group, Department of Electrical and Electronic Engineering, University of Nottingham(诺丁汉大学电子与电气工程系光子组) Research Center for Humanoid Sensing, Zhejiang Laboratory(浙江实验室人机感知研究中心)

AI总结 本文提出了一种基于物理信息的图神经网络ZRNet,用于频率感知的光学像差校正。该方法结合了Zernike多项式系数预测与光学图像复原,通过引入Zernike图模块和频率感知对齐损失,显式建模多项式间的物理关系并增强图像与系数预测在频域的一致性。实验表明,ZRNet在多种显微成像模态和复杂生物样本上均取得了最先进的像差校正和图像复原效果,并在真实光学系统数据上验证了其鲁棒性和泛化能力。

详情
英文摘要

Optical aberrations significantly degrade image quality in microscopy, particularly when imaging deeper into samples. These aberrations arise from distortions in the optical wavefront and can be mathematically represented using Zernike polynomials. Existing methods often address only mild aberrations on limited sample types and modalities, typically treating the problem as a black-box mapping without leveraging the underlying optical physics of wavefront distortions. We propose ZRNet, a physics-informed framework that jointly performs Zernike coefficient prediction and optical image Restoration. We contribute a Zernike Graph module that explicitly models physical relationships between Zernike polynomials based on their azimuthal degrees-ensuring that learned corrections align with fundamental optical principles. To further enforce physical consistency between image restoration and Zernike prediction, we introduce a Frequency-Aware Alignment (FAA) loss, which better aligns Zernike coefficient prediction and image features in the Fourier domain. Extensive experiments on CytoImageNet demonstrates that our approach achieves state-of-the-art performance in both image restoration and Zernike coefficient prediction across diverse microscopy modalities and biological samples with complex, large-amplitude aberrations. We further validate on experimental PSF data from a physical microscope and demonstrate robustness to realistic sensor noise, confirming generalisation beyond simulated conditions. Code is available at https://github.com/janetkok/ZRNet.

2511.22475 2026-05-13 cs.LG cs.CV 版本更新

Adversarial Flow Models

Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan

发表机构 * ByteDance Seed(字节跳动种子)

AI总结 本文提出了一类生成模型——对抗流模型,结合了对抗学习和流模型的优点,支持一步或多步生成,并通过对抗目标进行训练。与传统GAN不同,该模型鼓励生成器学习确定性的噪声到数据映射,从而显著稳定训练过程;与基于一致性的方法相比,它无需学习概率流的中间时间步,直接实现一步或多步生成,避免了误差累积并保留了模型容量。实验表明,该模型在ImageNet-256px数据集上取得了优于现有方法的生成质量。

Comments ICML 2026

详情
英文摘要

We present adversarial flow models, a class of generative models that belongs to both the adversarial and flow families. Our method supports native one-step and multi-step generation and is trained with an adversarial objective. Unlike traditional GANs, in which the generator learns an arbitrary transport map between the noise and data distributions, our generator is encouraged to learn a deterministic noise-to-data mapping. This significantly stabilizes adversarial training. Unlike consistency-based methods, our model directly learns one-step or few-step generation without having to learn the intermediate timesteps of the probability flow for propagation. This preserves model capacity and avoids error accumulation. Under the same 1NFE setting on ImageNet-256px, our B/2 model approaches the performance of consistency-based XL/2 models, while our XL/2 model achieves a new best FID of 2.38. We additionally demonstrate end-to-end training of 56-layer and 112-layer models without any intermediate supervision, achieving FIDs of 2.08 and 1.94 with a single forward pass and surpassing the corresponding 28-layer 2NFE and 4NFE counterparts with equal compute and parameters. The code is available at https://github.com/ByteDance-Seed/Adversarial-Flow-Models

2511.16520 2026-05-13 cs.LG cs.CV eess.IV eess.SP 版本更新

Saving Foundation Flow-Matching Priors for Inverse Problems

Yuxiang Wan, Ryan Devera, Wenjie Zhang, Ju Sun

发表机构 * Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, USA(计算机科学与工程系,明尼苏达大学,明尼阿波利斯,明尼苏达州,美国)

AI总结 本文提出了一种名为FMPlug的插件框架,旨在提升基础流匹配模型在逆问题中的应用效果。该方法结合了实例引导的时序预热策略和尖锐高斯正则化,既增强了问题特异性指导,又保持了高斯结构的稳定性。实验表明,FMPlug在图像修复和样本稀缺的科学逆问题中均表现出色,为在这些场景中实用化基础流匹配模型提供了有效途径。

Comments Accepted by ICML 2026

详情
英文摘要

Foundation flow-matching (FM) models promise universal priors for solving inverse problems (IPs); yet today, they trail behind domain-specific and even untrained priors. \emph{How can we unlock their potential?} We introduce FMPlug, a plug-in framework that redefines how foundation FMs are used in IPs. FMPlug combines an instance-guided, time-dependent warm-start strategy with sharp Gaussianity regularization, adding problem-specific guidance while preserving the Gaussian structures. For evaluation, we consider both simple image restoration tasks and scientific IPs with a few similar samples -- where the prohibitive cost of data collection and model training hinders the development of domain-specific generative models. Our superior experimental results confirm the effectiveness of FMPlug. Overall, FMPlug paves the way for making foundation FM models practical, reusable priors for IPs, especially scientific ones with few similar samples. More details are available at https://sun-umn.github.io/xm-plug/ .

2511.12034 2026-05-13 cs.CV cs.LG cs.MM 版本更新

Calibrated Multimodal Representation Learning with Missing Modalities

Xiaohao Liu, Xiaobo Xia, Jiaheng Wei, Shuo Yang, Xiu Su, See-Kiong Ng, Tat-Seng Chua

发表机构 * National University of Singapore(国立新加坡大学) University of Science and Technology of China(中国科学技术大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Central South University(中南大学)

AI总结 多模态表征学习旨在将不同模态的信息对齐到统一的潜在空间中,但现有方法通常要求所有模态都存在,难以处理数据中缺失模态的情况。本文从锚点偏移的角度出发,揭示了缺失模态导致对齐偏差的理论机制,并提出了一种名为CalMRL的方法,通过利用模态间的先验知识和内在联系,在表征层面进行缺失模态的补全与对齐校准。实验表明,该方法有效缓解了锚点偏移问题,提升了模型在缺失模态数据上的表现。

Comments Accepted by ICML 2026

详情
英文摘要

Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL to calibrate incomplete alignments caused by missing modalities. CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments demonstrate the superiority of CalMRL. The code is released at https://github.com/Xiaohao-Liu/CalMRL.

2510.05408 2026-05-13 cs.CV cs.AI 版本更新

See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models

Kebin Contreras, Luis Toscano-Palomino, Mauro Dalla Mura, Jorge Bacca

发表机构 * Physics School, Universidad Industrial de Santander, Colombia(圣安德烈大学物理系,哥伦比亚) Department of Computer Science, Universidad Industrial de Santander, Colombia(圣安德烈大学计算机科学系,哥伦比亚) GIPSA-Lab, Université Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France(格拉斯实验室,格勒诺布尔阿尔卑斯大学,CNRS,格勒诺布尔INP,法国) Institut Universitaire de France (IUF), France(法国国家科学院(IUF))

AI总结 该研究提出了一种基于热成像和视觉语言模型的时序逆向重建方法,旨在从当前的热痕迹中恢复过去几秒内的场景状态。方法结合了视觉语言模型与约束扩散过程,通过生成场景描述并指导图像重建,确保语义与结构的一致性。实验表明,该方法能够在受控环境下重建出最多120秒前的合理场景画面,为基于热痕迹的时序逆向成像提供了初步实现。

详情
英文摘要

Recovering the past from present observations is an intriguing challenge with potential applications in forensics and scene analysis. Thermal imaging, operating in the infrared range, provides access to otherwise invisible information. Since humans are typically warmer (37 C -98.6 F) than their surroundings, interactions such as sitting, touching, or leaning leave residual heat traces. These fading imprints serve as passive temporal codes, allowing for the inference of recent events that exceed the capabilities of RGB cameras. This work proposes a time-reversed reconstruction framework that uses paired RGB and thermal images to recover scene states from a few seconds earlier. The proposed approach couples Visual-Language Models (VLMs) with a constrained diffusion process, where one VLM generates scene descriptions and another guides image reconstruction, ensuring semantic and structural consistency. The method is evaluated in three controlled scenarios, demonstrating the feasibility of reconstructing plausible past frames up to 120 seconds earlier, providing a first step toward time-reversed imaging from thermal traces.

2510.02043 2026-05-13 cs.CV cs.HC cs.LG 版本更新

Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers

Sahil Bhandary Karnoor, Romit Roy Choudhury

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了在传感器数量有限的情况下实现零样本人体姿态估计的问题。作者将姿态估计建模为一个逆问题,并提出了一种基于扩散模型的逆求解算法,仅依赖旋转测量信息进行条件生成,同时结合位置测量的似然项进行引导。该方法无需针对每个用户进行微调,实现了跨用户的零样本泛化,为少传感器场景下的姿态估计提供了新思路。

Comments Published as a Conference Paper at The Fourteenth International Conference on Learning Representations

详情
英文摘要

Pose estimation refers to tracking a human's full body posture, including their head, torso, arms, and legs. The problem is challenging in practical settings where the number of body sensors are limited. Past work has shown promising results using conditional diffusion models, where the pose prediction is conditioned on both <location, rotation> measurements from the sensors. Unfortunately, nearly all these approaches generalize poorly across users, primarly because location measurements are highly influenced by the body size of the user. In this paper, we formulate pose estimation as an inverse problem and design an algorithm capable of zero-shot generalization. Our idea utilizes a pre-trained diffusion model and conditions it on rotational measurements alone; the priors from this model are then guided by a likelihood term, derived from the measured locations. Thus, given any user, our proposed InPose method generatively estimates the highly likely sequence of poses that best explains the sparse on-body measurements.

2509.19207 2026-05-13 cs.CV 版本更新

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs

Israfel Salazar, Desmond Elliott, Yova Kementchedjhieva

发表机构 * Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系) MBZUAI(马克斯·普朗克人工智能研究所)

AI总结 本文研究了对比视觉-语言模型(VLMs)在理解长篇组合性描述时面临的挑战,分析了组合推理与长描述理解之间的关系。通过在不同训练目标、数据集和架构设计下的受控实验,发现两者存在双向但敏感的关联,高质量且具有强视觉支撑的长描述数据有助于同时提升两种能力,而某些架构设计可能限制组合性学习。研究为改进VLM的泛化能力提供了数据选择和模型设计的实用指导。

Comments To be published in Findings of ACL 2026

详情
英文摘要

Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While these capabilities are often assumed to be closely related, the conditions under which they reinforce each other remain unclear. In this paper, we empirically analyze when compositional reasoning and long-caption understanding transfer across tasks, and when this relationship fails. Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between the two capabilities. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. We further show that architectural choices aimed at preserving general alignment, such as frozen positional embeddings, can inadvertently limit compositional learning. Our analysis provides actionable guidelines for data selection and model design to improve VLM generalization.

2508.05269 2026-05-13 cs.CV 版本更新

B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding

Changho Choi, Youngwoo Shin, Gyojin Han, Dong-Jae Lee, Junmo Kim

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 该研究提出B4DL,一个用于训练和评估多模态大语言模型(MLLM)在4D激光雷达时空理解能力的新基准。针对4D激光雷达数据在MLLM中应用不足的问题,研究设计了可扩展的数据生成流程,并提出了首个能直接处理原始4D激光雷达数据并与语言理解结合的MLLM模型,为动态户外环境中的时空推理提供了统一解决方案。

Comments Accepted at ACM MM 2025

详情
英文摘要

Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: https://github.com/ccho4702/B4DL

2506.17501 2026-05-13 eess.IV cs.CV 版本更新

DSA-NRP: No-Reflow Prediction from Angiographic Perfusion Dynamics in Stroke EVT

Shreeram Athreya, Carlos Olivares, Ameera Ismail, Kambiz Nael, William Speier, Corey Arnold

发表机构 * Department of Electrical and Computer Engineering, UCLA(电气与计算机工程系,加州大学洛杉矶分校) Medical Informatics, UCLA(医学信息学,加州大学洛杉矶分校) Department of Radiological Sciences, UCLA(放射科学系,加州大学洛杉矶分校) Department of Radiology, UC San Francisco(放射科,旧金山大学医学院) Department of Bioengineering, UCLA(生物工程系,加州大学洛杉矶分校) Department of Pathology and Laboratory Medicine, UCLA(病理学与实验室医学系,加州大学洛杉矶分校)

AI总结 该研究提出了一种基于数字减影血管造影(DSA)术中影像动态的机器学习框架,用于在急性缺血性中风血管内取栓术后立即预测“无再灌注”并发症。通过分析术中DSA序列中的灌注特征及临床变量,该方法显著优于仅依赖临床特征的基线模型,在预测准确性和AUC指标上均有明显提升。该成果为临床提供了实时、准确的无再灌注预测工具,有助于及时干预高风险患者,改善治疗效果。

Comments 15 pages, 4 figures

详情
英文摘要

Following successful large-vessel recanalization via endovascular thrombectomy (EVT) for acute ischemic stroke (AIS), some patients experience a complication known as no-reflow, defined by persistent microvascular hypoperfusion that undermines tissue recovery and worsens clinical outcomes. Although prompt identification is crucial, standard clinical practice relies on perfusion magnetic resonance imaging (MRI) within 24 hours post-procedure, delaying intervention. In this work, we introduce the first-ever machine learning (ML) framework to predict no-reflow immediately after EVT by leveraging previously unexplored intra-procedural digital subtraction angiography (DSA) sequences and clinical variables. Our retrospective analysis included AIS patients treated at UCLA Medical Center (2011-2024) who achieved favorable mTICI scores (2b-3) and underwent pre- and post-procedure MRI. No-reflow was defined as persistent hypoperfusion (Tmax > 6 s) on post-procedural imaging. From DSA sequences (AP and lateral views), we extracted statistical and temporal perfusion features from the target downstream territory to train ML classifiers for predicting no-reflow. Our novel method significantly outperformed a clinical-features baseline(AUC: 0.7703 $\pm$ 0.12 vs. 0.5728 $\pm$ 0.12; accuracy: 0.8125 $\pm$ 0.10 vs. 0.6331 $\pm$ 0.09), demonstrating that real-time DSA perfusion dynamics encode critical insights into microvascular integrity. This approach establishes a foundation for immediate, accurate no-reflow prediction, enabling clinicians to proactively manage high-risk patients without reliance on delayed imaging.

2503.23947 2026-05-13 cs.CV 版本更新

Spectral-Adaptive Modulation Networks for Visual Perception

Guhnoo Yun, Juhan Yoo, Kijung Kim, Jeongho Lee, Paul Hongsuck Seo, Dong Hwan Kim

发表机构 * Korea University (KU)(韩国大学) Korea Institute of Science and Technology (KIST)(韩国科学技术院) Dong-A University(东洋大学)

AI总结 本文研究了2D卷积与自注意力机制在频域特性上的差异,并通过图谱分析理论解释了它们在频率响应上的行为。基于这一分析,作者提出了一种频域自适应调制(SPAM)混合模块,利用多尺度卷积核和频域重缩放机制对视觉特征进行自适应处理。基于SPAM,作者构建了新型视觉主干网络SPANetV2,在多个视觉任务中表现出优于现有先进模型的性能。

Comments Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

详情
英文摘要

Recent studies have shown that 2D convolution and self-attention exhibit distinct spectral behaviors, and optimizing their spectral properties can enhance vision model performance. However, theoretical analyses remain limited in explaining why 2D convolution is more effective in high-pass filtering than self-attention and why larger kernels favor shape bias, akin to self-attention. In this paper, we employ graph spectral analysis to theoretically simulate and compare the frequency responses of 2D convolution and self-attention within a unified framework. Our results corroborate previous empirical findings and reveal that node connectivity, modulated by window size, is a key factor in shaping spectral functions. Leveraging this insight, we introduce a \textit{spectral-adaptive modulation} (SPAM) mixer, which processes visual features in a spectral-adaptive manner using multi-scale convolutional kernels and a spectral re-scaling mechanism to refine spectral components. Based on SPAM, we develop SPANetV2 as a novel vision backbone. Extensive experiments demonstrate that SPANetV2 outperforms state-of-the-art models across multiple vision tasks, including ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.

2502.20209 2026-05-13 cs.CV cs.AI 版本更新

DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild

Luis Marquez-Carpintero, Sergio Suescun-Ferrandiz, Carolina Lorenzo Álvarez, Jorge Fernandez-Herrero, Diego Viejo, Rosabel Roig-Vila, Miguel Cazorla

发表机构 * Institute for Computer Research(计算机研究学院) University of Alicante(阿利坎特大学)

AI总结 本文提出了一种名为 DIPSER 的新型数据集,用于评估真实课堂环境中学生的注意力水平。该数据集包含多角度 RGB 摄像头数据和智能手表传感器数据,能够捕捉学生的姿态、面部表情及生理指标,并提供了由学生自评和四位专家评估生成的注意力和情绪标签。该数据集结合了面部与环境摄像头数据、智能穿戴设备指标,并涵盖了以往数据集中较少见的族群群体,是目前最全面的面对面课堂教学中学生注意力与情绪分析数据集。

详情
英文摘要

In this paper, a novel dataset is introduced, designed to assess student attention within in-person classroom settings. This dataset encompasses RGB camera data, featuring multiple cameras per student to capture both posture and facial expressions, in addition to smartwatch sensor data for each individual. This dataset allows machine learning algorithms to be trained to predict attention and correlate it with emotion. A comprehensive suite of attention and emotion labels for each student is provided, generated through self-reporting as well as evaluations by four different experts. Our dataset uniquely combines facial and environmental camera data, smartwatch metrics, and includes underrepresented ethnicities in similar datasets, all within in-the-wild, in-person settings, making it the most comprehensive dataset of its kind currently available. The dataset presented offers an extensive and diverse collection of data pertaining to student interactions across different educational contexts, augmented with additional metadata from other tools. This initiative addresses existing deficiencies by offering a valuable resource for the analysis of student attention and emotion in face-to-face lessons.

2502.19716 2026-05-13 cs.CV cs.LG 版本更新

Fully AI-Generated Image Detection: Definition, Recent Advances and Challenges

Qijie Xu, Can Wang, Jiawei Chen, Siwei Lyu, Defang Chen

发表机构 * Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security(杭州高新技术区(滨江)区块链与数据安全研究院)

AI总结 本文综述了全AI生成图像检测的研究进展,探讨了该领域面临的核心问题、检测方法及挑战。研究重点分析了数据集构建与特征提取两个关键环节,系统梳理了现有方法在利用先验知识提取生成痕迹方面的分类与差异。文章还指出了当前检测技术的局限性,并展望了未来提升检测鲁棒性与泛化能力的研究方向。

详情
英文摘要

Recent advances in visual generative models have enabled the creation of highly realistic, fully AI-generated images without relying on real source content. While beneficial for many applications, these models also pose significant societal risks, as they can be easily exploited to produce convincing Deepfakes. Detecting them represents a foundational yet challenging problem in AI media forensics, requiring detectors to reliably extract the inherent artifacts imprinted by generative architectures. In this Review, we provide a systematic overview of fully AI-generated image detection. Following the standard detector design pipeline, we focus on two key components: dataset construction and artifact extraction. We analyze how dataset design influences the generalization and robustness of learned artifacts, and categorize existing artifact extraction methods based on the primary inductive priors leveraged to isolate artifacts. Within this framework, we systematically review existing works. Finally, we highlight open problems and envision several future directions for developing more robust and generalizable detectors. Reviewed works in this survey can be found at https://github.com/zju-pi/Awesome-Fully-AI-Generated-Image-Detection.

2501.03717 2026-05-13 cs.CV cs.AI cs.GR 版本更新

Materialist: Physically Based Editing Using Single-Image Inverse Rendering

Lezhong Wang, Duc Minh Tran, Ruiqi Cui, Thomson TG, Anders Bjorholm Dahl, Siavash Arjomand Bigdeli, Jeppe Revall Frisvad, Manmohan Chandraker

发表机构 * Technical University of Denmark(丹麦技术大学) University of California San Diego(加州大学圣地亚哥分校)

AI总结 本文提出了一种基于物理的单图像逆渲染编辑方法Materialist,旨在解决图像编辑中物理一致性不足的问题。该方法结合神经网络与物理渲染,通过神经网络预测初始材质属性,并利用渐进式可微渲染进行优化,从而实现对材质、光照和物体插入等的高质量编辑。该方法无需完整场景几何即可编辑透明材质,并在环境光映射估计方面表现出色,实验表明其在合成与真实数据集上均具有优异性能。

Comments More Comprehensive IJCV Camera-Ready Version. Project website: https://lez-s.github.io/materialist_project/

Journal ref International Journal of Computer Vision (IJCV), 134(6), 267 (2026)

详情
英文摘要

Achieving physically consistent image editing remains a significant challenge in computer vision. Existing image editing methods typically rely on neural networks, which struggle to accurately handle shadows and refractions. Conversely, physics-based inverse rendering often requires multi-view optimization, limiting its practicality in single-image scenarios. In this paper, we propose Materialist, a neural-initialized physically based rendering pipeline for single-image inverse rendering. Unlike previous hybrid methods that use physics to guide neural generation, our method leverages neural networks to predict initial material properties, which are then rigorously optimized via progressive differentiable rendering. Our approach enables a range of applications, including material editing, object insertion, and relighting, while also introducing an effective method for editing material transparency via ray-traced refraction without requiring full scene geometry. Furthermore, our envmap estimation method also achieves competitive performance, further enhancing the accuracy of image editing task. Experiments demonstrate strong performance across synthetic and real-world datasets, excelling even on challenging out-of-domain images.

2411.16769 2026-05-13 cs.LG cs.CL cs.CR cs.CV 版本更新

Red-Teaming Text-to-Image Models via In-Context Experience Replay and Semantic-Preserving Prompt Rewriting

Zhi-Yi Chin, Pin-Yu Chen, Wei-Chen Chiu, Mario Fritz

发表机构 * CISPA Helmholtz Center for Information Security(CISPA 欧洲信息安全中心) IBM Research(IBM 研究院) National Yang Ming Chiao Tung University(国立阳明交通大学)

AI总结 本文研究了如何自动检测和生成针对文本到图像模型的有害内容,以评估其安全性。为解决现有方法依赖白盒信息、泛化能力差或生成不可解释攻击样本的问题,作者提出了ICER框架,通过基于大语言模型的提示重写和上下文经验回放技术,生成语义保持的自然语言攻击提示,并通过强化学习优化策略,实现攻击策略的有效探索与利用。实验表明,ICER在多种安全机制下优于现有方法,并能成功迁移到商业系统如DALL-E 3和Midjourney。

Comments The source code is available at https://github.com/zhiyichin/ICER

详情
英文摘要

Understanding the capabilities of text-to-image (T2I) models in harmful content generation is essential to safety and compliance. However, human red-teaming is costly and inconsistent, driving the need for automatic tools that simulate realistic misuse attempts. Existing methods either require white-box access, fail to generalize across defenses, or produce uninterpretable adversarial tokens, while generating fluent prompts that preserve the original harmful intent remains underexplored despite its practical relevance. We propose ICER, a black-box framework that addresses this gap through two components: an LLM-based rewriter that produces fluent, natural-language adversarial prompts, and in-context experience replay that accumulates successful jailbreaking patterns into a reusable prior. These components are integrated via bandit optimization, enabling ICER to efficiently balance exploiting proven attack strategies with exploring new ones. Experiments across six safety mechanisms show that ICER outperforms seven baselines under both standard and semantics-preserving evaluation, with over 30% of generated prompts transferring to commercial systems like DALL-E 3 and Midjourney.

2411.13311 2026-05-13 cs.CV cs.AI 版本更新

A Resource Efficient Fusion Network for Object Detection in Bird's-Eye View using Camera and Raw Radar Data

Kavin Chandrasekaran, Sorin Grigorescu, Gijs Dubbelman, Pavol Jancura

发表机构 * ElektroBit Automotive GmbH Eindhoven University of Technology(埃因霍温理工大学) Transilvania University of Brasov(布拉索夫特拉扬大学)

AI总结 该研究提出了一种高效的融合网络,用于利用摄像头和原始雷达数据在鸟瞰图(BEV)视角下进行目标检测。通过直接使用雷达的原始距离-多普勒(RD)谱,避免了复杂的雷达信号处理,并结合摄像头图像处理管道提取特征,最终将摄像头和雷达特征进行融合以实现目标检测。该方法在保证检测精度的同时,降低了计算复杂度,为自动驾驶系统提供了更高效、鲁棒的感知方案。

Comments IEEE Intelligent Transportation Systems Conference (ITSC) 2024

详情
英文摘要

Cameras can be used to perceive the environment around the vehicle, while affordable radar sensors are popular in autonomous driving systems as they can withstand adverse weather conditions unlike cameras. However, radar point clouds are sparser with low azimuth and elevation resolution that lack semantic and structural information of the scenes, resulting in generally lower radar detection performance. In this work, we directly use the raw range-Doppler (RD) spectrum of radar data, thus avoiding radar signal processing. We independently process camera images within the proposed comprehensive image processing pipeline. Specifically, first, we transform the camera images to Bird's-Eye View (BEV) Polar domain and extract the corresponding features with our camera encoder-decoder architecture. The resultant feature maps are fused with Range-Azimuth (RA) features, recovered from the RD spectrum input from the radar decoder to perform object detection. We evaluate our fusion strategy with other existing methods not only in terms of accuracy but also on computational complexity metrics on RADIal dataset.

2402.16860 2026-05-13 cs.CV cs.IR 版本更新

Interactive Mars Image Content-Based Search with Interpretable Machine Learning

Bhavan Vasu, Steven Lu, Emily Dunkel, Kiri L. Wagstaff, Kevin Grimes, Michael McAuley

发表机构 * NASA Planetary Data System(美国宇航局行星数据系统) PDS Imaging Node(PDS成像节点) PDS Cartography and Imaging Sciences Node(PDS制图与成像科学节点) Wagstaff et al.(瓦格斯塔夫等人)

AI总结 本文研究如何通过可解释的机器学习方法实现对火星图像的交互式内容搜索,以支持科学探索和用户兴趣。作者提出了一种基于原型的分类架构,使用户能够理解并验证分类器在处理好奇号火星车图像时所依赖的证据。该方法不仅提供了分类解释,还探讨了所用证据的多样性和正确性,未来将部署于NASA行星数据系统图像图谱中,替代当前不可解释的系统。

Comments Published at the Thirty-Sixth Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-24). Corrected citation: Proc. AAAI 38(21): 22976-22982 (2024)

Journal ref Proc AAAI Conference on Artificial Intelligence 2024

详情
英文摘要

The NASA Planetary Data System (PDS) hosts millions of images of planets, moons, and other bodies collected throughout many missions. The ever-expanding nature of data and user engagement demands an interpretable content classification system to support scientific discovery and individual curiosity. In this paper, we leverage a prototype-based architecture to enable users to understand and validate the evidence used by a classifier trained on images from the Mars Science Laboratory (MSL) Curiosity rover mission. In addition to providing explanations, we investigate the diversity and correctness of evidence used by the content-based classifier. The work presented in this paper will be deployed on the PDS Image Atlas, replacing its non-interpretable counterpart.

2312.06950 2026-05-13 cs.CV cs.CL 版本更新

READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling

Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Khoi Le, Zhiyuan Hu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan

发表机构 * nguyentthong.github.io

AI总结 该研究针对低资源视频-语言建模任务,提出了一种参数高效的微调方法READ,通过引入具有时序建模能力的递归适配器(READ)和部分视频-语言对齐(PVLA)目标,有效捕捉视频帧与文本间的时序关系并保留关键任务信息。实验表明,READ在多个低资源基准测试中显著优于现有微调策略,为视频-语言模型的参数高效迁移学习提供了新思路。

Comments Accepted at AAAI 2024

详情
英文摘要

Fully fine-tuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter's low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ framework through extensive experiments where READ significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks. The code, model, and data have been made available at https://nguyentthong.github.io/READ.

2312.02549 2026-05-13 cs.CV cs.CL 版本更新

DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan

发表机构 * National University of Singapore(国立新加坡大学) Nanyang Technological University(南洋理工大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文研究的是时序语言定位问题,即在视频中找到与自然语言查询语义对应的片段。为了解决传统注意力机制在建模视频片段与文本关系时的不足,作者提出了一种基于能量的模型框架,以显式学习片段与查询之间的分布关系,并设计了一种新的Transformer架构DemaFormer,通过引入可学习的阻尼因子的指数移动平均方法,更有效地编码输入信息。实验表明,该方法在四个公开数据集上优于现有先进方法。

Comments Accepted at EMNLP 2023 (Findings). Code is available at https://github.com/nguyentthong/demaformer

详情
英文摘要

Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.

2304.09479 2026-05-13 cs.CV cs.GR cs.LG 版本更新

DiFaReli++: Diffusion Face Relighting with Consistent Cast Shadows

Puntawat Ponglertnapakorn, Nontawat Tritrong, Supasorn Suwajanakorn

发表机构 * School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology(信息科学与技术学院,维达亚西里米迪科学技术研究所)

AI总结 本文提出了一种新的单视角人脸重光照方法DiFaReli++,能够在真实场景中生成具有时间一致阴影的逼真光照效果。该方法无需精确的内在分解,仅基于2D图像进行训练,避免了对光照标注数据的依赖。通过结合条件扩散隐式模型(DDIM)与渲染阴影参考及阴影图的条件引导,实现了对光照与几何复杂交互的高效建模,并在多个指标上超越了教师模型,取得了当前最优的重光照效果。

Comments Published in IEEE TPAMI (vol. 48, no. 5, May 2026). This is an extended version of the ICCV 2023 paper (DiFaReli)

Journal ref IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 5, pp. 5068-5082, May 2026

详情
英文摘要

We introduce a novel approach to single-view face relighting in the wild, addressing challenges such as global illumination and cast shadows. A common scheme in recent methods involves intrinsically decomposing an input image into 3D shape, albedo, and lighting, then recomposing it with the target lighting. However, estimating these components is error-prone and requires many training examples with ground-truth lighting to generalize well. Our work bypasses the need for accurate intrinsic estimation and can be trained solely on 2D images without any light stage data, relit pairs, multi-view images, or lighting ground truth. Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We propose a novel conditioning technique that simplifies modeling the complex interaction between light and geometry. It uses a rendered shading reference along with a shadow map, inferred using a simple and effective technique, to spatially modulate the DDIM. Moreover, we propose a single-shot relighting framework that requires just one network pass, given pre-processed data, and even outperforms the teacher model across all metrics. Our method realistically relights in-the-wild images with temporally consistent cast shadows under varying lighting conditions. We achieve state-of-the-art performance on the standard benchmark Multi-PIE and rank highest in user studies. Please visit our page: https://diffusion-face-relighting-pp.github.io