arXivDaily arXiv每日学术速递 周一至周五更新
2605.13835 2026-05-14 cs.CV 版本更新

Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

Hao Sun, Zi-Jun Ding, Da-Wei Zhou

发表机构 * School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) State Key Laboratory for Novel Software Technology, Nanjing University(南京大学软件新技术国家重点实验室)

AI总结 该论文研究了基于CLIP的类别增量学习(CIL)问题,旨在使模型在持续学习新类别时避免灾难性遗忘。现有方法主要关注全局图像嵌入的对齐,而忽略了CLIP编码器中丰富的局部块级语义信息。为此,作者提出了一种名为SPA的方法,通过生成类别语义描述并引导选择具有判别性的块级视觉特征,结合最优传输进行跨模态对齐,从而更有效地利用局部信息提升识别性能,并引入任务特定投影器和伪特征采样策略以增强模型的适应性和稳定性。

详情
英文摘要

Class-Incremental Learning (CIL) enables models to continuously integrate new knowledge while mitigating catastrophic forgetting. Driven by the remarkable generalization of CLIP, leveraging pre-trained vision-language models has become a dominant paradigm in CIL. However, current work primarily focuses on aligning global image embeddings (i.e., [CLS] token) with their corresponding text prompts (i.e., [EOS] token). Despite their good performance, we find that they discard the rich patch-level semantic information inherent in CLIP's encoders. For instance, when recognizing a rabbit, local patches may encode its distinctive cues, such as long ears and a fluffy tail, which can provide complementary evidence for recognition. Based on the above observation, we propose SPA (Semantic-guided Patch-level Alignment) for CLIP-based CIL, which aims to awaken long-neglected local representations within CLIP. Specifically, for each class, we first construct representative and diverse visual samples and feed them to GPT-5 as visual guidance to generate class-wise semantic descriptions. These descriptions are used to guide the selection of discriminative patch-level visual features. Building upon these selected patches, we further employ optimal transport to align selected patch tokens with semantic tokens from class-wise descriptions, yielding a structured cross-modal alignment that improves recognition. Furthermore, we introduce task-specific projectors for effective adaptation to downstream incremental tasks, and sample pseudo-features from stored class-wise Gaussian statistics to calibrate old-class representations, thereby mitigating catastrophic forgetting. Extensive experiments demonstrate that SPA achieves state-of-the-art performance.

2605.13833 2026-05-14 cs.LG cs.CV 版本更新

QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

Hoang-Quan Nguyen, Sankalp Pandey, Khoa Luu

发表机构 * Quantum AI Lab(量子人工智能实验室) Dept. of EECS(电子工程与计算机科学系) University of Arkansas(阿肯色大学)

AI总结 本文提出了一种名为QLAM的量子长注意力记忆方法,用于处理长序列的token建模问题。该方法结合量子计算的叠加特性与状态空间模型(SSMs)的线性时间效率,通过量子态表示隐藏状态,从而增强对历史信息的全局表示能力。实验表明,QLAM在多个序列图像分类任务中优于传统循环模型和基于Transformer的模型。

详情
英文摘要

Modeling long-range dependencies in sequential data remains a central challenge in machine learning. Transformers address this challenge through attention mechanisms, but their quadratic complexity with respect to sequence length limits scalability to long contexts. State-space models (SSMs) provide an efficient alternative with linear-time computation by evolving a latent state through recurrent updates, but their memory is typically formed via additive or linear transitions, which can limit their ability to capture complex global interactions across tokens. In this work, we introduce one of the first studies to leverage the superposition property of quantum systems to enhance state-based sequence modeling. In particular, we propose Quantum Long-Attention Memory (QLAM), a hybrid quantum-classical memory mechanism that can be viewed as a quantum extension of state-space models. Instead of maintaining a classical latent state updated through additive dynamics, QLAM represents the hidden state as a quantum state whose amplitudes encode a superposition of historical information. The state evolves through parameterized quantum circuits conditioned on the input, enabling a non-classical, globally update mechanism. In this way, QLAM preserves the recurrent and linear-time structure of SSMs while fundamentally enriching the memory representation through quantum superposition. Unlike attention mechanisms that explicitly compute pairwise interactions, QLAM implicitly captures global dependencies through the evolution of the quantum state, and retrieves task-relevant information via query-dependent measurements. We evaluate QLAM on sequential variants of standard image classification benchmarks, including sMNIST, sFashion-MNIST, and sCIFAR-10, where images are flattened into token sequences. Across all tasks, QLAM consistently improves over recurrent baselines and transformer-based models.

2605.13831 2026-05-14 cs.CV 版本更新

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Zhaowei Wang, Lishu Luo, Haodong Duan, Weiwei Liu, Sijin Wu, Ji Luo, Shen Yan, Shuai Peng, Sihang Yuan, Chaoyi Huang, Yi Lin, Yangqiu Song

发表机构 * CSE Department, HKUST(香港科技大学计算机科学与工程系)

AI总结 本文研究了如何有效训练长上下文视觉-语言模型(LVLMs),以实现超过128K上下文长度的泛化能力。通过系统性的继续预训练实验,作者发现长文档VQA任务比OCR转录更有效,并提出了三个关键结论:数据长度分布应保持平衡、检索能力是主要瓶颈、长文档数据可保留短上下文能力。基于这些发现,他们提出了MMProLong模型,在仅使用50亿token的情况下,显著提升了长文档VQA性能,并在更长的上下文长度上保持了良好的表现,无需额外训练。

Comments work in progress

详情
英文摘要

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.

2605.13825 2026-05-14 cs.AI cs.CV 版本更新

History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

Alberto G. Rodríguez Salgado

发表机构 * Independent Researcher(独立研究员)

AI总结 该研究探讨了大型语言模型在面对先前有害行为记录时是否会继续采取不安全行动的问题。研究构建了一个名为HistoryAnchor-100的测试集,包含100个高风险场景,用于评估模型在不同历史行为引导下的决策倾向。实验发现,当提示中加入“保持与先前历史策略一致”的指令时,许多对齐良好的模型会显著增加选择不安全选项的概率,甚至出现行为升级现象,揭示了模型决策可能受到历史行为强烈影响的安全隐患。

Comments 12 pages, 3 figures

详情
英文摘要

Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.

2605.13815 2026-05-14 cs.CV cs.RO 版本更新

OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation

Youquan Liu, Weidong Yang, Ao Liang, Xiang Xu, Lingdong Kong, Yang Wu, Dekai Zhu, Xin Li, Runnan Chen, Ben Fei, Tongliang Liu, Wanli Ouyang

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) School of Computing, Department of Computer Science, National University of Singapore(新加坡国立大学计算机学院) College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics(南京航空航天大学计算机科学与技术学院) Technical University of Munich(慕尼黑技术大学) Nanjing University of Science and Technology(南京理工大学) Shanghai AI Laboratory(上海人工智能实验室) University of Sydney(悉尼大学) The Chinese University of Hong Kong, Hong Kong SAR(香港中文大学(深圳))

AI总结 OmniLiDAR 是一种统一的文本条件扩散框架,旨在解决多领域LiDAR点云生成的问题,支持包括恶劣天气、传感器配置变化和跨平台采集在内的八种不同场景。该方法通过引入跨域训练策略和特征建模技术,在单一模型中实现了对异构数据的统一生成,提升了生成结果的可控性和泛化能力。实验表明,OmniLiDAR 在生成质量及下游任务如语义分割和目标检测中均表现出色,尤其在数据稀缺的情况下优势显著。

Comments Preprint; 12 pages, 7 figures, 10 tables

详情
英文摘要

LiDAR scene generation is increasingly important for scalable simulation and synthetic data creation, especially under diverse sensing conditions that are costly to capture at scale. Typically, diffusion-based LiDAR generators are developed under single-domain settings, requiring separate models for different datasets or sensing conditions and hindering unified, controllable synthesis under heterogeneous distribution shifts. To this end, we present OmniLiDAR, a unified text-conditioned diffusion framework that generates LiDAR scans in a shared range-image representation across eight representative domains spanning three shift types: adverse weather, sensor-configuration changes (e.g., reduced beams), and cross-platform acquisition (vehicle, drone, and quadruped). To enable training a single model over heterogeneous domains without isolating optimization by domain, we introduce a Cross-Domain Training Strategy (CDTS) that mixes domains within each mini-batch and leverages conditioning to steer generation. We further propose Cross-Domain Feature Modeling (CDFM), which captures directional dependencies along azimuth and elevation axes to reflect the anisotropic scanning structure of range images, and Domain-Adaptive Feature Scaling (DAFS) as a lightweight modulation to account for structured domain-dependent feature shifts during denoising. In the absence of a public consolidated benchmark, we construct an 8-domain dataset by combining real-world scans with physically based weather simulation and systematic beam reduction while following official splits. Extensive experiments demonstrate strong generation fidelity and consistent gains in downstream use cases, including generative data augmentation for LiDAR semantic segmentation and 3D object detection, as well as robustness evaluation under corruptions, with consistent benefits in limited-label regimes.

2605.13813 2026-05-14 cs.CV 版本更新

JANUS: Anatomy-Conditioned Gating for Robust CT Triage Under Distribution Shift

Lavsen Dahal, Yubraj Bhandari, Geoffrey Rubin, Joseph Y. Lo

发表机构 * Center for Virtual Imaging Trials, RAI Labs, Department of Radiology, Duke University, Durham NC 27708, USA Electrical Computer Engineering, Pratt School of Engineering, Duke University, Durham, NC 27708, USA Department of Mathematics, Trinity College of Arts \& Sciences, Duke University, Durham, NC 27708, USA Department of Radiology

AI总结 本文提出了一种名为JANUS的生理引导双流架构,用于在分布偏移情况下实现鲁棒的CT分诊。该方法通过解剖引导门控机制,将视觉嵌入条件化于宏观影像组学先验,从而提升模型在不同机构间的泛化能力与可靠性。实验表明,JANUS在MERLIN数据集上取得了优于现有方法的性能,并在外部数据集上也表现出色,尤其在基于大小和衰减定义的病灶检测中效果显著。

详情
英文摘要

Automated CT triage requires models that are simultaneously accurate across diverse pathologies and reliable under institutional shift. While Vision Transformers provide strong visual representations, many clinically significant findings are defined by quantitative imaging biomarkers rather than appearance alone. We introduce JANUS, a physiology-guided dual-stream architecture that conditions visual embeddings on macro-radiomic priors via Anatomically Guided Gating. On the MERLIN test set (N=5082), JANUS attains macro-AUROC 0.88 and AUPRC 0.74, outperforming all reproduced baselines. It generalizes to an external dataset N=2000; AUROC 0.87), with the largest gains on findings defined by size and attenuation as well as improved calibration on both datasets. We further quantify prediction suppression using the Physiological Veto Rate (PVR), showing that under domain shift JANUS reduces high-confidence false positives substantially more often than true positives. Together, these results are consistent with physically grounded conditioning that improves both discrimination and reliability in CT triage. Code is made publicly available at github repository https://github.com/lavsendahal/janus and model weights are at https://huggingface.co/lavsendahal/janus.

2605.13803 2026-05-14 cs.CV 版本更新

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

Minjoon Jung, Byoung-Tak Zhang, Lorenzo Torresani

发表机构 * Seoul National University(首尔国立大学) Northeastern University(东北大学)

AI总结 本文提出了一种名为EvoGround的自进化视频代理框架,用于解决视频时间定位(VTG)问题,即从未剪辑的视频中定位与自然语言查询最匹配的时间片段。该方法无需人工标注数据,通过两个相互协作的代理——提议者和求解者——从原始视频中自动学习时间定位能力。实验表明,EvoGround在多个基准测试中表现优异,达到了甚至超越了全监督模型的水平,并成为无需人工标注的细粒度视频描述生成的最先进方法。

Comments Project page: https://minjoong507.github.io/projects/EvoGround/

详情
英文摘要

Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.

2605.13798 2026-05-14 cs.CV 版本更新

VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence

Guney Tombak, Ertunc Erdil, Ender Konukoglu

发表机构 * Biomedical Image Computing Group, ETH Zurich(生物医学图像计算组,苏黎世联邦理工学院) The LOOP Zurich – Medical Research Center(苏黎世医疗研究中心)

AI总结 在多模态医学影像分析中,跨模态的体素级表示需要在不同成像方式、设备和采集协议下保持解剖一致性。本文提出VoxCor,一种无需训练的体素特征提取方法,能够从冻结的2D视觉Transformer模型中生成可复用的三维体素特征表示。该方法通过三平面ViT推理与加权偏最小二乘投影结合,在离线阶段学习模态稳定的解剖方向,从而在变换阶段无需微调或配准即可直接映射新体积,并支持高效的体素对应查询。实验表明,VoxCor在跨被试、跨模态任务中表现出优越的配准性能和特征迁移能力,为多模态医学影像分析提供了可复用的特征层。

详情
英文摘要

Cross-modal 3D medical image analysis requires voxelwise representations that remain anatomically consistent across imaging contrasts, scanners, and acquisition protocols. Recent work has shown that frozen 2D Vision Transformer (ViT) foundation models can support such representations, but typical pipelines extract features along a single anatomical axis and adapt those features inside a registration solver for one image pair at a time, leaving complementary viewing directions unused and producing representations that do not transfer to new volumes. We introduce VoxCor, a training-free fit--transform method for reusable volumetric feature representations from frozen 2D ViT foundation models. During an offline fitting phase, VoxCor combines triplanar ViT inference with a compact closed-form weighted partial least squares (WPLS) projection that uses fitting-time voxel correspondences to select modality-stable anatomical directions in the triplanar feature space. At transform time, new volumes are mapped by triplanar ViT inference and linear projection alone, without fine-tuning or registration. Voxel correspondences can then be queried directly by nearest-neighbor search. We evaluate VoxCor on intra-subject Abdomen MR--CT and inter-subject HCP T2w--T1w tasks using deformable registration, voxelwise k-nearest-neighbor segmentation, and segmentation-center landmark localization. VoxCor improves the hardest cross-subject, cross-modality transfer settings, reduces encoder sensitivity for dense correspondence transfer, and yields registration performance competitive with handcrafted descriptors and learned 3D features. This positions VoxCor as a reusable feature layer for downstream multimodal analysis beyond pairwise registration. Code, configuration files, and implementation details are publicly available on GitHub at \href{https://github.com/guneytombak/VoxCor}{guneytombak/VoxCor}.

2605.13794 2026-05-14 cs.GR cs.CV 版本更新

BlitzGS: City-Scale Gaussian Splatting at Lightning Speed

Zhongtao Wang, Huishan Au, Yilong Li, Mai Su, Haojie Jin, Yisong Chen, Meng Gai, Fei Zhu, Guoping Wang

发表机构 * Peking University(北京大学)

AI总结 本文提出了一种名为BlitzGS的分布式3D高斯溅射框架,旨在实现城市级规模场景的快速重建。该方法通过在系统层、模型层和视图层三个耦合层级优化高斯点的处理流程,显著减少了计算负载,提升了渲染效率。实验表明,BlitzGS在保持渲染质量的同时,相比现有方法实现了数量级的加速,能够在数十分钟内完成城市级场景的训练。

详情
英文摘要

We present BlitzGS, a distributed 3DGS framework that reduces active Gaussian workload for fast city-scale reconstruction. BlitzGS manages this workload at three coupled levels. At the system level, the framework shards Gaussians across GPUs by index parity rather than spatial blocks. This approach mitigates the cross-block visibility redundancy inherent in spatial partitioning. Furthermore, it distributes each rendering step through a single cross-GPU exchange that routes projected Gaussians to their tile owners. At the model level, scheduled importance-scoring passes shrink the global Gaussian population. During these passes, the framework generates a per-Gaussian visibility weight to bias density-control updates toward contributing primitives and a per-view importance mask for the view-level renderer. At the view level, BlitzGS trims each camera's active set with a distance-based LOD gate to exclude excessively fine primitives for the current frustum and the importance-based culling mask to skip Gaussians with negligible cross-view contribution. On large-scale benchmarks, BlitzGS matches the rendering quality of recent large-scale baselines while delivering an order-of-magnitude speedup, training city-scale scenes in tens of minutes. Our code is available at https: //github.com/AkierRaee/BlitzGS.

2605.13778 2026-05-14 cs.RO cs.CV 版本更新

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

Jiahui Niu, Kefan Gu, Yucheng Zhao, Shengwen Liang, Tiancai Wang, Xing Hu, Ying Wang, Huawei Li

发表机构 * State Key Lab of Processors, Institute of Computing Technology, CAS(处理器国家重点实验室,计算技术研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) Nanjing University(南京大学) Dexmal

AI总结 本文提出了一种名为 Realtime-VLA FLASH 的推测推理框架,旨在解决基于扩散模型的视觉-语言-动作(dVLA)模型在实时部署中因全推理过程延迟高而面临的问题。该方法通过引入一个轻量级的草案模型,并结合主模型的动作专家进行并行验证,以及在必要时回退到全推理流程的相位感知机制,实现了低延迟、高频次的重新规划。实验表明,FLASH 在 LIBERO 和实际传送带分拣任务中均能有效降低推理延迟,显著提升了实时任务的执行效率。

详情
英文摘要

Diffusion-based vision-language-action models (dVLAs) are promising for embodied intelligence but are fundamentally limited in real-time deployment by the high latency of full inference. We propose Realtime-VLA FLASH, a speculative inference framework that eliminates most full inference calls during replanning by introducing a lightweight draft model with parallel verification via the main model's Action Expert and a phase-aware fallback mechanism that reverts to the full inference pipeline when needed. This design enables low-latency, high-frequency replanning without sacrificing reliability. Experiments show that on LIBERO, FLASH largely preserves task performance by replacing many 58.0 ms full-inference rounds with speculative rounds as fast as 7.8 ms, lowering task-level average inference latency to 19.1 ms (3.04x speedup). We additionally demonstrate effectiveness on real-world conveyor-belt sorting, highlighting its practical impact for latency-critical embodied tasks.

2605.13775 2026-05-14 cs.RO cs.CV 版本更新

RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

Harold Haodong Chen, Sirui Chen, Yingjie Xu, Wenhang Ge, Ying-Cong Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 本文提出了一种名为 RoboEvolve 的新型框架,旨在解决机器人操作中由于物理交互数据稀缺而导致的可扩展性瓶颈。该框架通过将视觉语言模型(VLM)和视频生成模型(VGM)结合,形成一个相互促进的协同进化循环,仅依赖于未标记的种子图像进行自主数据合成与策略优化。实验表明,RoboEvolve 在任务成功率、数据效率和持续学习能力方面均表现出显著优势。

Comments On-going work

详情
英文摘要

The scalability of robotic manipulation is fundamentally bottlenecked by the scarcity of task-aligned physical interaction data. While vision-language models (VLMs) and video generation models (VGMs) hold promise for autonomous data synthesis, they suffer from semantic-spatial misalignment and physical hallucinations, respectively. To bridge this gap, we introduce RoboEvolve, a novel framework that couples a VLM planner and a VGM simulator into a mutually reinforcing co-evolutionary loop. Operating purely on unlabeled seed images, RoboEvolve leverages a cognitive-inspired dual-phase mechanism: (i) daytime exploration fosters physically grounded behavioral discovery through a semantic-controlled multi-granular reward, and (ii) nighttime consolidation mines "near-miss" failures to stabilize policy optimization. Guided by an autonomous progressive curriculum, the system naturally scales from simple atomic actions to complex tasks. Extensive experiments demonstrate that RoboEvolve (I) achieves superior effectiveness, elevating base planners by 30 absolute points and amplifying simulator success by 48% on average; (II) exhibits extreme data efficiency, surpassing fully supervised baselines with merely 500 unlabeled seeds--a 50x reduction; and (III) demonstrates robust continual learning without catastrophic forgetting.

2605.13755 2026-05-14 cs.CV 版本更新

Generative Texture Diversification of 3D Pedestrians for Robust Autonomous Driving Perception

Arka Bhowmick, Enes Ozeren, Ahmed Abdullah, Oliver Wasenmuller

发表机构 * BIT Technology Solutions GmbH(比特技术解决方案 GmbH) Mannheim University of Applied Sciences(曼海姆应用科学大学)

AI总结 本文研究了如何通过生成式人工智能提升自动驾驶感知系统中3D行人模型的纹理多样性,以增强模型在复杂场景下的鲁棒性。作者提出了一种基于StyleGAN2的方法,从单一3D基础模型出发,生成具有多样化面部纹理和外观特征的行人实例,无需重新设计几何结构。该方法构建了合成数据集,并分析了真实与合成数据混合对2D和3D目标检测的影响,揭示了几何域差异对3D感知模型的敏感性,展示了生成式AI在自动驾驶数据生成中的潜力与局限。

Comments Published at SAIAD 2026 Workshop at CVPR 2026

详情
英文摘要

In recent years, autonomous driving has significantly in creased the demand for high-quality data to train 2D and 3D perception models for safety-critical scenarios. Real world datasets struggle to meet this demand as require ments continuously evolve and large-scale annotated data collection remains costly and time-consuming making syn thetic data a scalable, practical and controllable alterna tive. Pedestrian detection is among the most safety-critical tasks in autonomous driving. In this paper, we propose a simple yet effective method for scaling variability in 3D pedestrian assets for synthetic scene generation. Starting from a single 3D base asset, we generate multiple distinct pedestrian instances by synthesizing diverse facial textures and identity-level appearance variations using StyleGAN2 and automatically mapping them onto 3D meshes. This ap proach enables scalable appearance-level asset diversifica tion without requiring the design of new geometries for each instance. Using the assets, we construct synthetic datasets and study the impact of mixing real and synthetic data for RGB-based object detection. Through complementary ex periments, we analyze geometry-driven distribution shifts in point cloud perception for 3D object detection. Our findings demonstrate that controlled synthetic diversifica tion improves robustness in 2D detection while revealing the sensitivity of 3D perception models to geometric domain gaps. Overall, this work highlights how generative AI en ables scalable, simulation-ready pedestrian diversification through controlled facial texture synthesis, along with the benefits and limitations of cross-domain training strategies in autonomous driving pipelines.

2605.13753 2026-05-14 cs.LG cs.CV 版本更新

Min Generalized Sliced Gromov Wasserstein: A Scalable Path to Gromov Wasserstein

Ashkan Shahbazi, Xinran Liu, Ping He, Soheil Kolouri

发表机构 * Department of Computer Science, College of Connected Computing, Vanderbilt University(计算机科学系,连接计算学院,范德比尔特大学) Department of Electrical and Computer Engineering, Vanderbilt University(电气与计算机工程系,范德比尔特大学)

AI总结 本文提出了一种名为 min Generalized Sliced Gromov-Wasserstein(min-GSGW)的新型方法,用于高效求解 Gromov-Wasserstein(GW)问题。该方法通过引入表达能力强的广义切片算子,学习输入度量之间的耦合非线性切片,从而在原始空间中直接最小化 GW 目标函数。min-GSGW 具有刚体运动不变性,适用于几何匹配和形状分析任务,并在多个实验中表现出比现有方法更低的计算成本和更优的几何对应结果。

详情
英文摘要

We propose min Generalized Sliced Gromov--Wasserstein (min-GSGW), a sliced formulation for the Gromov--Wasserstein (GW) problem using expressive generalized slicers. The key idea is to learn coupled nonlinear slicers that assign compatible push-forward values to both input measures, so that monotone coupling in the projected domain lifts to a transport plan evaluated against the GW objective in the original spaces. The resulting plan induces a GW objective value, and min-GSGW minimizes this cost directly in the original spaces. We further show that min-GSGW is rigid-motion invariant, a crucial property for geometric matching and shape analysis tasks. Our contributions are threefold: 1) we introduce generalized slicers into the sliced GW framework, 2) we construct a slicing-based efficient GW transport plan; and 3) we develop an amortized variant that replaces per-instance optimization with a learned slicer for unseen input pairs. We perform experiments on animal mesh matching, horse mesh interpolation, and ShapeNet part transfer. Results show that min-GSGW produces meaningful geometric correspondences and GW objective values at substantially lower computational cost than existing GW solvers.

2605.13746 2026-05-14 cs.CV cs.AI 版本更新

Weakly-Supervised Spatiotemporal Anomaly Detection

Urvi Gianchandani, Praveen Tirupattur, Mubarak Shah

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) University of Central Florida(佛罗里达中央大学)

AI总结 本文研究了弱监督下的时空异常检测问题,仅使用视频级别的标签进行训练,无需逐帧标注。核心方法是通过提取正常和异常视频片段的特征,并利用多实例排序损失(MIL)对时空区域进行异常评分,同时考虑了异常在时间和空间上的局部性。该方法在包含时空标注的UCF Crime2Local数据集上进行了验证,取得了有效结果。

详情
英文摘要

In this paper, we explore a weakly supervised method for anomaly detection. Since annotating videos is time-consuming, we only look at weak video-level labels during training. This means that given a video, we know that it is either normal or contains an anomaly, but no further annotations are used to train the network. Features are extracted from video clips that are either normal or anomalous. These features are used to determine anomaly scores for spatiotemporal regions of the clips based on a classifier and the implementation of a multiple instance ranking loss (MIL). We represent both anomalous and normal video clips as positive and negative bags, respectively, to apply MIL. Furthermore, since anomalies are usually localized to a part of a frame rather than the whole frame, we chose to explore temporal as well as spatial anomaly detection. We show our results on the UCF Crime2Local Dataset, which contains spatiotemporal annotations for a portion of the UCF Crime Dataset.

2605.13744 2026-05-14 cs.CV 版本更新

Aligning Network Equivariance with Data Symmetry: A Theoretical Framework and Adaptive Approach for Image Restoration

Feiyu Tan, Qi Xie, Zongben Xu, Deyu Meng

发表机构 * School of Mathematics and Statistics(数学与统计学学院)

AI总结 图像修复是一个固有病态的逆问题,而嵌入几何对称先验的等变网络可以缓解这一问题并提升性能。然而,现有研究对网络等变性与数据对称性的关系理解仍停留在启发式层面,缺乏系统理论框架来量化对称性、选择变换群或评估模型与数据的对齐程度。本文从优化角度出发,首次提出了在数据集层面可量化的非严格对称性定义,并将其作为约束构建图像修复逆问题,揭示了数据对称性、模型等变性与泛化能力之间的内在联系,同时提出了一个样本自适应的等变网络,能够动态对齐每个样本的内在对称性,实验表明该方法在超分辨率、去噪和去雨任务中显著优于传统方法。

Comments 30 pages, 9 figures, Supplementary Material can be found at https://github.com/tanfy929/SA-Conv

详情
英文摘要

Image restoration is an inherently ill posed inverse problem. Equivariant networks that embed geometric symmetry priors can mitigate this ill posedness and improve performance. However, current understanding of the relationship between network equivariance and data symmetry remains largely heuristic. Particularly for real world data with imperfect symmetry, existing research lacks a systematic theoretical framework to quantify symmetry, select transformation groups, or evaluate model data alignment. To bridge this gap, we conduct an analysis from an optimization perspective and formalize the intrinsic relationship among data symmetry priors, model equivariance, and generalization capability. Specifically, we propose for the first time a quantifiable definition of non strict symmetry at the dataset level (rather than sample level) and use it as a constraint to formulate the restoration inverse problem. We then show that the equivariance for restoration models can be naturally derived from this inverse problems incorporated the proposed symmetry constraints, and that the equivariance error of the optimal restoration operator is strictly bounded by the data symmetry error and the discretization mesh size. Furthermore, by analyzing the network's empirical risk, we demonstrate that aligning equivariance with data symmetry optimizes the bias variance trade off, minimizing the total expected risk. Guided by these insights, we propose a Sample Adaptive Equivariant Network that uses a hypernetwork and transformation learnable equivariant convolutions to dynamically align with each sample's inherent symmetry. Extensive experiments on super resolution, denoising, and deraining validate our theoretical findings and show significant superiority over standard baselines and traditional equivariant models. Our code and supplementary material are available at https://github.com/tanfy929/SA-Conv.

2605.13741 2026-05-14 cs.RO cs.CV 版本更新

LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction

Christina Kassab, Hyeonjae Gil, Matías Mattamala, Ayoung Kim, Maurice Fallon

AI总结 本文提出LEXI-SG,首个仅依赖RGB相机输入的单目三维场景图映射系统,能够在开放词汇场景中实现高精度、可扩展的密集地图重建。该方法利用开放词汇基础模型的语义先验,将场景划分为房间,并在每个房间完全观测后进行前馈重建,从而避免了滑动窗口尺度不一致的问题。通过基于房间的因子图优化,实现了全局对齐与局部地图一致性的保持,同时自然地构建了语义场景图的层次结构,并支持开放词汇的对象分割与跟踪。实验表明,LEXI-SG在轨迹估计、密集重建和开放词汇分割方面均表现出色。

详情
英文摘要

Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed -- enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation and tracking. We validate LEXI-SG on indoor scenes from the Habitat-Matterport 3D and self-collected egocentric office sequences. We evaluate its performance against existing feed-forward SLAM methods, as well as established scene graphs baselines. We demonstrate improved trajectory estimation and dense reconstruction, as well as, competitive performance in open-vocabulary segmentation. LEXI-SG shows that accurate, scalable, open-vocabulary 3D scene graphs can be achieved from monocular RGB alone. Our project page and office sequences are available here: https://ori-drs.github.io/lexisg-web/.

2605.13730 2026-05-14 cs.LG cs.AI cs.CV 版本更新

Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography

Christos Chrysanthos Nikolaidis, Vasileios Sachpekidis, Nikolas Moustakidis, Theofilos Moustakidis, Pavlos S. Efraimidis

发表机构 * Department of Electrical and Computer Engineering, Democritus University of Thrace(电气与计算机工程系,德莫克里特大学)

AI总结 该研究旨在利用超声心动图图像可靠诊断二叶式主动脉瓣(BAV),解决因操作者经验和图像质量差异导致的诊断不一致性问题。研究提出了一种基于视频集成的可解释人工智能模型,通过分析常规获取的左心室长轴视图动态影像,实现了对BAV与三叶式主动脉瓣(TAV)的准确分类。模型在90例患者数据上表现出优异的分类性能,并通过Grad-CAM和SHAP值提供了可解释的诊断依据,有助于提升临床诊断的透明度和可追溯性。

详情
英文摘要

Transthoracic echocardiography (TTE) is the first-line imaging modality for diagnosing bicuspid aortic valve (BAV), yet diagnostic performance varies with operator expertise and image quality. We developed an explainable AI model that distinguishes BAV from tricuspid aortic valves (TAV) using routinely acquired parasternal long-axis (PLAX) cine loops. A multi-backbone video ensemble was trained and evaluated using a leakage-aware, stratified outer cross-validation protocol on $N{=}90$ patient studies (48 BAV, 42 TAV). Across fixed outer splits and 10 random seeds, the calibrated stacked ensemble achieved an outer-CV F1-score of $0.907$ and recall of $0.877$. Frame-level Grad-CAM localized salient evidence to the aortic root and leaflet plane, while globally aggregated SHAP values quantified each video backbone's contribution to the stacked prediction, enabling transparent, case-level auditability. These findings indicate that PLAX-based video ensembles can support reliable BAV/TAV classification from routine echocardiographic cine loops and may facilitate earlier detection in non-specialist or resource-limited clinical settings.

2605.13729 2026-05-14 cs.CV cs.AI 版本更新

Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation

Deli Cai, Haoyang Ma, Changxing Ding

发表机构 * School of Electronic and Information Engineering, South China University of Technology(华南理工大学电子与信息学院) Pazhou Lab(琶洲实验室)

AI总结 本文研究了在文本描述和空间轨迹双重条件下生成真实人体运动的问题,现有方法在条件冲突和运动表示冗余方面存在不足,导致生成质量下降或轨迹控制不稳定。为此,作者提出了一种解耦框架 CMC,通过分治策略将任务分为轨迹控制和运动补全两个阶段,分别确保轨迹准确跟踪和生成完整运动。此外,引入选择性补全机制以缓解数据不足带来的过拟合问题,实验表明 CMC 在多个数据集上取得了优越的控制精度和运动质量。

详情
英文摘要

Trajectory-controlled human motion generation aims to synthesize realistic human motions conditioned on both textual descriptions and spatial trajectories. However, existing methods suffer from two critical limitations: first, the conflict between text and trajectory conditions disrupts the denoising process, resulting in compromised motion quality or inaccurate trajectory following; second, the use of redundant motion representations introduces inconsistencies between motion components, leading to instability during trajectory control. To address these challenges, we propose CMC, a decoupled framework that effectively coordinates text and trajectory conditions through a divide-and-conquer strategy. CMC follows a divide-and-conquer paradigm, comprising two cascaded stages: Trajectory Control and Motion Completion. In the first stage, a diffusion model generates a simplified representation of the controlled joints under trajectory guidance, based on the given trajectories, ensuring accurate and stable trajectory following. In the second stage, a text-conditioned diffusion inpainting model generates full-body motions using the simplified representation from the first stage as partial observations. To mitigate overfitting caused by limited inpainting training data, we further introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. Experiments on HumanML3D and KIT datasets demonstrate that CMC achieves state-of-the-art performance in control accuracy and motion quality, demonstrating its effectiveness in coordinating multimodal conditions and representations.

2605.13724 2026-05-14 cs.CV cs.AI 版本更新

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, Mike Zheng Shou

发表机构 * NVIDIA Show Lab, National University of Singapore(新加坡国立大学Show实验室) MIT(麻省理工学院)

AI总结 本文提出 AnyFlow,一种基于流图的任意步数视频扩散模型蒸馏框架,旨在解决一致性蒸馏模型在测试时分配更多采样步数时性能下降的问题。AnyFlow 通过将蒸馏目标从终点一致性映射转换为任意时间区间的流图转移学习,优化完整的 ODE 采样轨迹,并引入流图反向模拟方法,提升采样效率并减少测试时误差。实验表明,AnyFlow 在少量步数生成任务中性能优于或匹配现有方法,同时支持任意步数的灵活扩展。

Comments Project page at https://nvlabs.github.io/AnyFlow/

详情
英文摘要

Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated at test time, limiting their effectiveness for any-step video diffusion. This limitation arises because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, weakening the desirable test-time scaling behavior of ODE sampling. To address this limitation, we introduce AnyFlow, the first any-step video diffusion distillation framework based on flow maps. Instead of distilling a model for only a few fixed sampling steps, AnyFlow optimizes the full ODE sampling trajectory. To this end, we shift the distillation target from endpoint consistency mapping $(z_{t}\rightarrow z_{0})$ to flow-map transition learning $(z_{t}\rightarrow z_{r})$ over arbitrary time intervals. We further propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling efficient on-policy distillation that reduces test-time errors (i.e., discretization error in few-step sampling and exposure bias in causal generation). Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters, demonstrate that AnyFlow achieves performance matches or surpasses consistency-based counterparts in the few-step regime, while scaling with sampling step budgets.

2605.13713 2026-05-14 cs.CV eess.IV 版本更新

Learning to Optimize Radiotherapy Plans via Fluence Maps Diffusion Model Generation and LSTM-based Optimization

Isabella Poles, Simon Arberet, Riqiang Gao, Martin Kraus, Marco D. Santambrogio, Florin C. Ghesu, Ali Kamen, Dorin Comaniciu

发表机构 * Politecnico di Milano(米兰理工学院) Digital Technology and Innovation, Siemens Healthineers(西门子医疗数字化技术与创新)

AI总结 本文提出了一种基于扩散模型和LSTM的端到端优化方法,用于放射治疗计划的生成。该方法通过分布匹配的扩散模型生成临床可行的射线强度图,并利用LSTM模块学习梯度更新动态,从而快速优化剂量分布。实验表明,该方法在提升计划效率、灵活性和机器可执行性方面优于现有方法。

Comments Early Accept at MICCAI 2026

详情
英文摘要

Volumetric Modulated Arc Therapy (VMAT) is a cornerstone of modern radiation therapy, enabling highly conformal tumor irradiation and healthy-tissue sparing. Yet, its planning solves inverse and nested optimization for multi-leaf collimators, monitor units and dose parameters, while enforcing their consistency to ensure mechanical deliverability. Nevertheless, this process often requires repeated re-optimization when treatment configurations change, resulting in substantial planning time per patient. To address these problems, we present a diffusion-driven Learning-to-Optimize (L2O) method for end-to-end VMAT planning. A distribution-matching distilled diffusion model learns a clinically feasible manifold of fluence maps, enabling their one-shot generation. On top of this, an LSTM-based L2O module learns gradient update dynamics to swiftly refine fluence maps toward prescribed dose objectives during inference. Experimental results on clinical and public prostate cancer cohorts demonstrate improved planning efficiency, flexibility, and machine deliverability over currently available end-to-end VMAT planners.

2605.13688 2026-05-14 cs.CV cs.LG 版本更新

MedCore: Boundary-Preserving Medical Core Pruning for MedSAM

Cenwei Zhang, Suncheng Xiang, Lei You

发表机构 * Shanghai Jiao Tong University(上海交通大学) Technical University of Denmark(技术大学)

AI总结 MedCore 是一种针对 MedSAM 的结构化剪枝框架,旨在在保持医学图像分割边界精度的前提下显著压缩模型规模。该方法通过保留两种关键结构实现高效剪枝:一种是在 SAM 到 MedSAM 适配过程中变得重要的结构,另一种是具有高边界影响力的结构。实验表明,MedCore 在多项息肉分割基准测试中大幅减少了参数和计算量,同时保持了较高的 Dice 和边界指标,验证了其在医学图像分割中的有效性与可靠性。

Comments 3 figures, 17 pages

详情
英文摘要

Medical segmentation foundation models such as SAM and MedSAM provide strong prompt-driven segmentation, but their image encoders are still too large for many clinical settings. Compression is also risky in medicine because a model can keep high Dice while losing boundary fidelity. We propose MedCore, a structured pruning framework for MedSAM. The main idea is to preserve two kinds of structures: structures that became important during SAM-to-MedSAM adaptation, and structures that have high boundary leverage. We identify the first type by a dual-intervention score that compares zeroing a group with resetting it to its original SAM weight. We identify the second type by boundary-aware Fisher estimation. We also introduce a boundary leverage principle, which shows that compression-induced boundary displacement is controlled by logit perturbation on the boundary divided by the logit spatial gradient. This principle explains why boundary metrics can degrade even when Dice remains high. On polyp segmentation benchmarks, MedCore reduces parameters by 60.0% and FLOPs by 58.4% while achieving Dice 0.9549, Boundary F1 0.6388, and HD95 5.14 after recovery fine-tuning. It also reaches 86.6% parameter reduction and 90.4G FLOPs with strong boundary quality. Our analysis further shows that MedSAM lies in a head-fragile boundary regime: head-pruning steps have 2.887 times larger 95th-percentile boundary leverage than MLP-pruning steps, and this logit-level effect is consistent with BF1 and HD95 degradation. Our code is available at https://github.com/cenweizhang/MedCore.

2605.13686 2026-05-14 cs.CV cs.AI 版本更新

Cross Modality Image Translation In Medical Imaging Using Generative Frameworks

Giulia Romoli, Alessia Capoccia, Filippo Ruffini, Francesco Di Feola, Luca Boldrini, Arturo Chiti, Renato Cuocolo, Tugba Akinci D'Antonoli, Fatemeh Darvizeh, Marcello Di Pumpo, Bradley J. Erickson, Liu Fang, Deborah Fazzini, Paola Feraco, Fabrizia Gelardi, Francesco Gossetti, Ana Isabel Hernáiz Ferrer, Michail E. Klontzas, Seyedmehdi Payabvash, Katrine Riklund, Sara N. Strandberg, Valerio Guarrasi, Paolo Soda

发表机构 * Department of Diagnostics and Intervention, Radiation Physics, Biomedical Engineering, Umeå University(诊断与介入部门、放射物理、生物医学工程,乌梅大学) Unit of Artificial Intelligence and Computer Systems, Department of Engineering, Università Campus Bio-Medico di Roma(人工智能与计算机系统单位,工程部门,罗马生物医学学院) Vita-Salute San Raffaele University(维塔-萨拉特·桑拉法埃莱大学) Department of Medicine, Surgery and Dentistry, University of Salerno(医学、外科和牙科部门,萨勒诺大学) Division of Diagnostic and Interventional Neuroradiology, Department of Radiology, University Hospital Basel(诊断和介入神经放射学部门,放射学部门,巴塞尔大学医院) Department of Pediatric Radiology, University Children’s Hospital Basel(儿科放射学部门,巴塞尔儿童医院) Department of Life Science and Public Health, Università Cattolica del Sacro Cuore(生命科学与公共健康部门,圣心大学) Athinoula A. Martinos Center for Biomedical Imaging(阿提诺拉A·马里诺斯生物医学成像中心) Artificial Intelligence and Translational Imaging (ATI) Lab, Department of Radiology, School of Medicine, University of Crete(人工智能与转化成像(ATI)实验室,放射学部门,医学院,克里特大学) Division of Radiology, Department of Clinical Science, Intervention and Technology (CLINTEC), Karolinska Institute(放射学部门,临床科学、介入和科技(CLINTEC)部门,卡罗林斯卡研究所) Columbia University Medical Center(哥伦比亚大学医学中心) Department of Diagnostics and intervention, Diagnostic radiology, Umeå University(诊断与介入部门,诊断放射学,乌梅大学)

AI总结 本文研究了医学影像中跨模态图像翻译的问题,旨在从源影像模态生成目标模态的图像,无需额外采集。作者提出了一种可复现、标准化的评估框架,对七种生成模型在多个临床任务和数据集上的性能进行了系统比较,发现基于生成对抗网络(GAN)的模型整体表现优于潜在生成模型,其中SRGAN在多项任务中表现最优。实验还揭示了模型在小病灶生成和定量指标与临床偏好之间的差异,表明合成影像在临床判别上已接近真实影像。

详情
英文摘要

Medical image-to-image (I2I) translation enables virtual scanning, i.e. the synthesis of a target imaging modality from a source one without additional acquisitions. Despite growing interest, most proposed methods operate on 2D slices, are evaluated on isolated tasks with different experimental set-ups and lack clinical validation. The primary contribution of this work is a reproducible, standardized comparative evaluation of 3D I2I translation methods in oncological imaging, designed to standardize preprocessing, splitting, inference, and multi-level evaluation across heterogeneous clinical tasks. Within this framework, we compare seven generative models, three Generative Adversarial Networks (GANs: Pix2Pix, CycleGAN, SRGAN) and four latent generative models (Latent Diffusion Model, Latent Diffusion Model+ControlNet, Brownian Bridge, Flow Matching), across eleven datasets spanning three anatomical regions (head/neck, lung, pelvis) and four translation directions (cone-beam CT to CT, MRI to CT, CT to PET, MRI T2-weighted to T2-FLAIR), for a total of 77 experiments under uniform training, inference, and evaluation conditions. The results show that GANs outperform latent generative models across all tasks, with SRGAN achieving statistically significant superiority. Our lesion-level analysis reveals that all models struggle with small lesions and that, in CT to PET synthesis, models reproduce lesion shape more reliably than absolute uptake-related intensity. We also performed a Visual Turing test administered to 17 physicians, including 15 radiologists, which shows near-chance classification accuracy (56.7%), confirming that synthetic volumes are largely indistinguishable from real acquisitions, while exposing a dissociation between quantitative metrics and clinical preference.

2605.13675 2026-05-14 cs.CV cs.LG q-bio.NC 版本更新

Characterizing Universal Object Representations Across Vision Models

Florian P. Mahner, Johannes Roth, Ka Chun Lam, Michael F. Bonner, Francisco Pereira, Martin N. Hebart

发表机构 * Vision and Computational Cognition Group(视觉与计算认知组) Max Planck Institute(马克斯·普朗克研究所) Justus-Liebig-University Giessen(吉森约瑟夫·李贝大学) Machine Learning Core(机器学习核心) Department of Cognitive Science(认知科学系) National Institute of Mental Health(国家心理健康研究所) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本研究探讨了不同架构、目标函数和数据集训练的深度神经网络在视觉表征上的收敛现象,旨在揭示模型实际收敛于哪些视觉属性以及影响这一收敛的因素。通过将162个多样化视觉模型的对象相似性结构分解为少量非负维度,并分析这些维度在模型间的重复出现情况,研究发现部分维度具有跨模型的普遍性,且更易解释、更受图像语义属性驱动。研究还表明,模型的普遍性维度与灵长类动物视觉皮层活动和人类相似性判断的预测能力更强,暗示了这种普遍性可能反映了与生物视觉相关的表征特性。

详情
英文摘要

Deep neural networks trained with different architectures, objectives, and datasets have been reported to converge on similar visual representations. However, what remains unknown is which visual properties models actually converge on and which factors may underlie this convergence. To address this, we decompose the object similarity structure of 162 diverse vision models into a small set of non-negative dimensions. To determine universal versus model-specific dimensions, we then estimate how often each dimension reappears across models. In contrast to model-specific dimensions, universal dimensions are more interpretable and more strongly driven by conceptual image properties, indicating the relevance of interpretability and semantic content as implicit factors driving universality across models. Differences in architecture, objective function, training data, model size, and model performance do not explain the emergence of universal dimensions. However, models with more universal dimensions also better predict macaque IT activity and human similarity judgments, suggesting that universality reflects representations relevant to biological vision. These findings have important implications for understanding the emergent representations underlying deep neural network models and their alignment with biological vision.

2605.13670 2026-05-14 cs.CV 版本更新

Pattern-Enhanced RT-DETR for Multi-Class Battery Detection

Xu Zhong, Enyuan Hu

发表机构 * Independent Researcher(独立研究者) Chemistry Division Brookhaven National Laboratory NY, USA(布鲁赫斯国家实验室化学部纽约美国)

AI总结 本文针对多类别电池检测任务,提出了一种基于模式增强的RT-DETR方法PaQ-RT-DETR,通过引入基于模式的动态查询生成机制,有效缓解了查询激活不平衡问题,同时保持了较低的计算开销。研究在包含约8,591张标注图像的公开数据集上系统比较了多种检测模型,结果表明PaQ-RT-DETR-X在整体mAP@50指标上优于基线模型,尤其在数据稀缺的电池类别上表现突出,为电池相关工业应用中的目标检测模型选择提供了实用指导。

Comments 4 pages, 3 figures

详情
英文摘要

Accurate and efficient battery detection is increasingly important for applications in electronic waste recycling, industrial quality control, and automated sorting systems. In this paper, we present both a comprehensive benchmark and a novel method for multi-class battery detection. We systematically compare three CNN-based detectors (YOLOv8n, YOLOv8s, YOLO11n) and two transformer-based detectors (RT-DETR-L, RT-DETR-X) on a publicly available dataset of approximately 8,591 annotated images under identical experimental conditions, and further propose PaQ-RT-DETR, which introduces pattern-based dynamic query generation into RT-DETR to alleviate query activation imbalance with negligible computational overhead. Among baselines, YOLO11n achieves the best CNN-based accuracy (mAP@50: 0.779) at only 2.6M parameters, while YOLOv8n delivers the fastest inference at ~1,667 FPS. PaQ-RT-DETR-X achieves the highest overall mAP@50 of 0.782, surpassing RT-DETR-X by +2.8% with consistent per-class gains across all six battery categories including the data-scarce Bike Battery class. Our findings provide practical guidance for selecting object detection models in battery-related industrial applications.

2605.13667 2026-05-14 cs.CV 版本更新

SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

Vladislav Makarov, Mark Gizetdinov, Dmitry Yudin

发表机构 * MIRAI

AI总结 SceneGraphVLM 是一种基于视觉语言模型的紧凑方法,用于从图像和视频中生成结构化的场景图。该方法通过高效的 TOON 格式序列化图结构,并采用两阶段训练策略,结合监督微调和强化学习,以提升关系覆盖率和精确度,同时避免生成不相关对象和关系。在视频处理中,模型可通过前一帧生成的场景图提供轻量级的短期上下文,无需跟踪或后处理。实验表明,SceneGraphVLM 在多个数据集上实现了高质量与生成速度的良好平衡,并显著提升了场景图生成的精确度。

详情
英文摘要

Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or post-processing. We evaluate SceneGraphVLM on PSG, PVSG, and Action Genome. With compact VLMs and vLLM-accelerated decoding, SceneGraphVLM achieves a strong quality-speed trade-off, improves precision-oriented SGG metrics while preserving reasonable recall, and generates complete scene graphs with approximately one-second latency. Code and implementation details are available at: https://github.com/markus0440/SceneGraphVLM.git.

2605.13664 2026-05-14 cs.CV physics.optics 版本更新

HADAR-Based Thermal Infrared Hyperspectral Image Restoration

Cheng Dai, Jiale Lin, Bingxuan Song, Yifei Chen, Jiashuo Chen, Xin Yuan, Fanglin Bao

发表机构 * School of Science, Westlake University(西lake大学科学学院) School of Engineering, Westlake University(西lake大学工程学院)

AI总结 热红外高光谱图像(TIR-HSI)在许多应用中具有重要价值,但其实际应用受到传感器退化等因素的严重限制。本文提出了一种基于HADAR渲染方程的物理驱动框架HAIR,通过结合温度、发射率和纹理(TeX)三元组的物理模型,实现了对地面TIR-HSI的高精度恢复。该方法不仅保证了物理一致性与空间光谱噪声的鲁棒性,还通过大气下行辐射参考和发射率光谱平滑性实现了光谱校准与生成,实验表明其在去噪、修复、光谱校准和超分辨率等任务上均优于现有方法。

Comments 17 pages, 18 figures

详情
英文摘要

Thermal-infrared (TIR) hyperspectral imagery (HSI) provides critical scene information for various applications. However, its practical utility is severely limited by unique sensor degradations beyond the capabilities of existing restoration methods, which are ignorant of underlying thermal physics. Here, we propose HAIR (HADAR-based Image Restoration) as a physics-driven framework for ground-based TIR-HSI restoration. HAIR utilizes the HADAR rendering equation (HRE) and combines it with the atmospheric downwelling radiative transfer equation (RTE) to model TIR-HSI using temperature, emissivity, and texture (TeX) physical triplets. This physical model leads to a TeX decompose-synthesize strategy that guarantees physical consistency and spatio-spectral noise resilience, in stark contrast to existing approaches. Moreover, our framework uses a forward-modeled atmospheric downwelling reference, along with spectral smoothness of emissivity and blackbody radiation, to enable spectral calibration and generation that would otherwise be elusive. Our extensive experiments on the outdoor DARPA Invisible Headlights dataset and in-lab FTIR measurements show that HAIR consistently outperforms state-of-the-art methods across denoising, inpainting, spectral calibration, and spectral super-resolution, establishing a benchmark in objective accuracy and visual quality.

2605.13632 2026-05-14 cs.RO cs.CV 版本更新

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

Yiran Ling, Qing Lian, Jinghang Li, Qing Jiang, Tianming Zhang, Xiaoke Jiang, Chuanxiu Liu, Jie Liu, Lei Zhang

发表机构 * Futian Laboratory(福田实验室) Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) International Digital Economy Academy (IDEA)(国际数字经济学院(IDEA)) School of Robotics, Hunan University(湖南大学机器人学院) South China University of Technology(华南理工大学) Visincept(Visincept公司) National Key Laboratory of Smart Farm Technologies and Systems(智能农业技术与系统国家重点实验室)

AI总结 本文提出了一种名为GTA-VLA的交互式视觉-语言-动作框架,通过允许用户使用显式视觉线索引导机器人策略,实现空间可操控的具身推理。该框架引入了用户可选的空间先验引导机制,并将其与内部任务规划相结合,生成统一的视觉-空间推理链,从而提升机器人在复杂或未知环境中的任务成功率。实验表明,该方法在标准基准测试中表现优异,并在面对视觉变化和空间歧义时展现出更强的鲁棒性和恢复能力。

详情
英文摘要

In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here: https://signalispupupu.github.io/GTA-VLA_ProjPage/

2605.13621 2026-05-14 cs.CV 版本更新

WD-FQDet: Multispectral Detection Transformer via Wavelet Decomposition and Frequency-aware Query Learning

Chunjin Yang, Xiwei Zhang, Yiming Xiao, Fanman Meng

发表机构 * University of Electronic Science and Technology of China(电子科学与技术大学)

AI总结 WD-FQDet 是一种基于小波分解和频率感知查询学习的多光谱检测Transformer框架,旨在解决红外与可见光图像融合检测中模态共享特征偏差和模态特有特征不足的问题。该方法通过低频域对齐和高频域保留模块,分别增强跨模态特征的一致性和模态特有特征的表达,并引入频率感知的查询选择机制动态调节不同特征的贡献。实验表明,WD-FQDet 在多个数据集上取得了领先的检测性能。

详情
英文摘要

Infrared-visible object detection improves detection performance by combining complementary features from multispectral images. Existing backbone-specific and backbone-shared approaches still suffer from the problems of severe bias of modality-shared features and the insufficiency of modality-specific features. To address these issues, we propose a novel detection framework WD-FQDet that explicitly decouples modality-shared and modality-specific information from infrared and visible modalities in the new view of low- and high-frequency domains, allowing fusion strategies tailored to their frequency characteristics. Specifically, a low-frequency homogeneity alignment module is proposed to align modality-shared features across modalities via a cross-modal attention mechanism, and a high-frequency specificity retention module is proposed to preserve modality-specific features through the multi-scale gradient consistency loss. To reinforce the feature representation in the frequency domain, we propose a hybrid feature enhancement module that incorporates spatial cues. Furthermore, considering that the contributions of homogeneous and modality-specific features to object detection vary across scenarios, we propose a frequency-aware query selection module to dynamically regulate their contributions. Experimental results on the FLIR, LLVIP, and M3FD datasets demonstrate that WD-FQDet achieves state-of-the-art performance across multiple evaluation metrics.

2605.13619 2026-05-14 physics.optics cs.CV 版本更新

DeepFilters: Scattering-Aware Pupil Engineering with Learned Digital Filter Reconstruction for Extended Depth of Field Microscopy

Joseph L. Greene, Suet YIng Chan, Qilin Deng, Jeffrey Alido, Alexandra Lion, Guorong Hu, Ruipeng Guo, Tongyu Li, Kivilcim Kiliç, Ian Davison, Lei Tian

发表机构 * Boston University, Department of Electrical and Computer Engineering(波士顿大学电气与计算机工程系) Georgia Tech Research Institute, Electro-Optical Systems Lab(佐治亚理工研究学院电光学系统实验室) Boston University, Department of Biology(波士顿大学生物学系) Harvard Medical School, Brigham and Women’s Hospital, Department of Orthopedic Surgery(哈佛医学院布里特妇女医院骨科系) Boston University, Neurophotonics Center(波士顿大学神经光子学中心) Boston University, Department of Biomedical Engineering(波士顿大学生物医学工程系)

AI总结 DeepFilters 是一种用于扩展景深显微成像的深度光学框架,旨在解决传统和现有深度学习方法在散射组织中成像质量下降的问题。该方法通过一个可微分的正向模型,联合优化参数化的瞳孔滤波器和基于数字滤波器的重建网络,实现了无需重新训练的广泛适用性。DeepFilters 引入了经验散射核、物理引导的正则化和混合遗传-梯度初始化策略,显著提升了在清晰介质和生物组织中的成像深度与信号恢复能力。

Comments 38 pages (18 main text, 20 supplement), 23 Figures (7 main text, 16 supplement)

详情
英文摘要

Extended depth of field microscopy encodes axial information into a single acquisition through engineered point spread functions, but conventional and deep optics approaches are subject to degradation in scattering tissue. We introduce DeepFilters, a scattering-aware deep optics framework that jointly optimizes a parameterized pupil filter and a digital-filter-based reconstruction network through a calibrated differentiable forward model to achieve broad generalization without retraining. Incorporating empirical scattering kernels, physics-guided regularization, and a hybrid genetic-gradient initialization strategy, DeepFilters extends the PSF from 16 micron to >400 micron in clear media and enables signal recovery beyond 120 micron deep in biological tissues, validated across fixed brain slices and sea urchin embryos.

2605.13604 2026-05-14 cs.CV 版本更新

Rethinking Graph Convolution for 2D-to-3D Hand Pose Lifting

Chanyoung Kim, Donghyun Kim, Dong-Hyun Sim, Seong Jae Hwang, Youngjoong Kwon

发表机构 * Emory University(埃默里大学) Yonsei University(延世大学) WHATs Lab(WHATs实验室)

AI总结 本文重新审视了图卷积网络在2D到3D手部姿态提升中的应用,探讨了是否应采用固定邻接图来编码手部骨骼结构。研究通过在FPHA数据集上进行参数匹配的消融实验,发现多头自注意力机制在性能上显著优于传统图卷积方法,并进一步表明基于软结构先验的图距离位置编码比硬邻接约束更有效。实验结果表明,自适应空间注意力比固定图卷积更能有效提升手部姿态估计的精度。

详情
英文摘要

Graph convolutional networks (GCNs) are widely used for 3D hand pose estimation, where the hand skeleton is encoded as a fixed adjacency graph. We revisit whether this is the most effective way to incorporate hand topology in 2D-to-3D lifting. In this paper, we perform controlled, parameter-matched ablations on the FPHA benchmark and show that standard multi-head self-attention consistently outperforms GCN baselines. Even when the GCN is strengthened with multi-hop adjacency and matched parameter count, self-attention reduces MPJPE from 12.36 mm to 10.09 mm. A skeleton-constrained graph attention network recovers most of this gap, indicating that input-dependent aggregation is a major source of improvement, while fully connected attention yields additional gains. We further show that hand topology is most effective when introduced as a soft structural prior through graph-distance positional encoding, rather than as a hard adjacency constraint. These results suggest that, for hand pose lifting, adaptive spatial attention is a more effective inductive bias than fixed graph convolution.

2605.13600 2026-05-14 cs.CV 版本更新

Sparse Code Uplifting for Efficient 3D Language Gaussian Splatting

Lovre Antonio Budimir, Yushi Guan, Steve Ryhner, Sven Lončarić, Nandita Vijaykumar

发表机构 * Faculty of Electrical Engineering and Computing(电子工程与计算学院) Department of Computer Science(计算机科学系) Vector Institute(向量研究所)

AI总结 本文提出了一种名为SCOUP的高效三维语言高斯溅射方法,旨在解决在开放词汇三维场景理解中,如何高效关联高维视觉-语言嵌入与大量三维高斯点的问题。该方法通过解耦语言表示学习与三维高斯优化,利用二维图像区域的特征学习稀疏编码表示,并通过加权稀疏聚合将其提升至三维高斯点,从而实现高效的存储与快速渲染。实验表明,SCOUP在训练速度和内存效率上均有显著提升,并在多个基准测试中达到了与现有方法相当或更优的开放词汇查询准确率。

Comments 18 pages (9 pages main paper), 10 figures, preprint

详情
英文摘要

3D Language Gaussian Splatting (3DLGS) augments 3D Gaussian Splatting with language-aligned visual features for open-vocabulary 3D scene understanding. A core challenge is efficiently associating high-dimensional vision-language embeddings with millions of 3D Gaussians while preserving efficient feature rendering for text-based querying. Existing methods either store dense features directly on Gaussians, causing high storage costs and slow rendering, or learn compact representations through expensive per-scene optimization with repeated feature rasterization. No existing method simultaneously achieves fast 3D semantic reconstruction, efficient storage, and fast rendering. We propose SCOUP (Sparse COde UPlifting), which addresses all three by decoupling language representation learning from 3D Gaussian optimization. Rather than working directly in 3D, we learn sparse codebook-based representations entirely using features associated with 2D image regions, associating each region with a sparse set of codebook coefficients. We then uplift these coefficients to 3D Gaussians with our weighted sparse aggregation using Gaussian-to-pixel associations, where each Gaussian accumulates coefficients over codebook atoms across views. Top-$K$ filtering then extracts the most dominant multi-view coefficients per Gaussian, enabling efficient storage and fast rendering. Our method achieves up to $400\times$ training speedup while being $3\times$ more memory efficient during training compared to the state-of-the-art in rendering speed. Across multiple benchmarks, SCOUP matches or outperforms existing methods in open-vocabulary querying accuracy.

2605.13591 2026-05-14 cs.CV 版本更新

Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes

Kaicong Huang, Talha Azfar, Weisong Shi, Ruimin Ke

发表机构 * Department of Civil and Environmental Engineering, Rensselaer Polytechnic Institute(拉特克利夫理工学院土木与环境工程系) Department of Computer and Information Sciences, University of Delaware(德雷塞尔大学计算机与信息科学系)

AI总结 本文提出了一种名为 Real2Sim 的物理驱动且可编辑的高斯点喷射框架,用于自动驾驶场景的生成。该方法结合了4D高斯点喷射与可微分的材料点方法求解器,能够重建具有时间连续性的动态驾驶场景,支持实例级编辑,并模拟真实的物体间及物体与环境之间的交互。该框架能够在保证物理合理性的前提下生成高保真的多样化场景,包括碰撞等复杂情况,实验表明其在渲染、重建、编辑及物理模拟方面表现优异,具有在自动驾驶感知、轨迹预测等任务中广泛应用的潜力。

详情
英文摘要

Reliable autonomous driving relies on large-scale, well-labeled data and robust models. However, manual data collection is resource-intensive, and traditional simulation suffers from a persistent reality gap. While recent generative frameworks and radiance-field methods improve visual fidelity, they still struggle with temporal and spatial consistency and cannot ensure physics-aware behavior, limiting their applicability to driving scenario generation. To address these challenges, we propose Real2Sim, an unified framework that combines 4D Gaussian Splatting (4DGS) with a differentiable Material Point Method (MPM) solver. Real2Sim explicitly reconstructs dynamic driving scenes as temporally continuous Gaussian primitives, supports instance-level editing, and simulates realistic object-object and object-environment interactions. This framework enables physics-aware, high-fidelity synthesis of diverse, editable scenarios, including challenging corner cases such as collisions and post-impact trajectories. Experiments on the Waymo Open Dataset validate Real2Sim's capabilities in rendering, reconstruction, editing, and physics simulation, demonstrating its potential as a scalable tool for data generation in downstream tasks such as perception, tracking, trajectory prediction, and end-to-end policy learning.

2605.13586 2026-05-14 cs.CV cs.AI 版本更新

HetScene: Heterogeneity-Aware Diffusion for Dense Indoor Scene Generation

Zini Chen, Junming Huang, Rong Zhang, Jiamin Xu, Cheng Peng, Chi Wang, Weiwei Xu

AI总结 本文提出 HetScene,一种面向异构结构的扩散模型,用于生成高密度、物理合理的室内场景。该方法通过区分主物体和次物体,将场景生成过程分解为结构布局生成和上下文布局生成两个阶段,从而更有效地建模复杂的物体分布与空间依赖关系。该框架提升了生成场景的可控性和物理合理性,为具身人工智能的仿真环境构建提供了有力支持。

详情
英文摘要

Generating controllable and physically plausible indoor scenes is a pivotal prerequisite for constructing high-fidelity simulation environments for embodied AI. However, existing deeplearning-based methods usually treat all objects as homogeneous instances within a unified generation process. While effective for sparse and simplistic layouts, they struggle to model realistic layouts with dense object arrangements and complex spatial dependencies, leadingto limited scalability and degraded physical plausibility. To deal with these challenges, we revisit indoor layout generation from the perspective of structural heterogeneity and decompose the objects into primary objects and secondary objects according to their distinct roles in shaping a scene. Based on this decomposition, we propose HetScene, a heterogeneous two-stage generation framework that decouples indoor layout synthesis into Structural Layout Generation (SLG) and Contextual Layout Generation (CLG). SLG first generates globally coherent structural layouts with only primary objects conditioned on text descriptions, top-down binary room masks, and spatial relation graphs, establishing a stable global macro-skeleton of large core furniture.

2605.13583 2026-05-14 cs.CV 版本更新

Phy-CoSF: Physics-Guided Continuous Spectral Fields Reconstruction and Super-Resolution for Snapshot Compressive Imaging

Wudi Chen, Zhiyuan Zha, Xin Yuan, Shigang Wang, Bihan Wen, Jiantao Zhou, Gang Yan, Zipei Fan, Ce Zhu

发表机构 * College of Communication Engineering, Jilin University, Changchun 130012, China. School of Engineering, Westlake University, Hangzhou, Zhejiang 310024, China. School of Electrical \& Electronic Engineering, Nanyang Technological University, Singapore 639798. Department of Computer Information Science, University of Macau, Macau 999078, China. College of Computer Science Technology, Jilin University, Changchun 130012, China. College of Artificial Intelligence, Jilin University, Changchun 130012, China. School of Information Communication Engineering, University of Electronic Science

AI总结 本文提出了一种名为Phy-CoSF的方法,用于解决快照压缩成像(CASSI)系统中高光谱图像的连续光谱重建与超分辨率问题。该方法结合深度展开网络与隐式神经表示,建立了一种新的连续光谱重建范式,能够生成任意波长的高保真高光谱图像。核心模块连续光谱场(CoSF)通过跨域特征融合和动态先验机制,显著提升了重建精度和光谱细节保留能力,实验表明其在多个指标上优于现有先进方法。

Comments 15 pages, 10 figures, accepted by ICML 2026!

详情
英文摘要

Recent advances have demonstrated that coded aperture snapshot spectral imaging (CASSI) systems show great potential for capturing 3D hyperspectral images (HSIs) from a single 2D measurement. Despite the inherent spectral continuity of scenes captured by CASSI, most existing reconstruction methods are restricted to fixed, discrete spectral outputs, thereby precluding continuous spectral reconstruction or spectral super-resolution. To address this challenge, we propose Phy-CoSF, which synergizes deep unfolding networks with implicit neural representations, establishing a new paradigm for continuous spectral reconstruction and super-resolution in CASSI. Specifically, we propose a two-phase architecture that bridges discrete-wavelength training with continuous spectral rendering, enabling the synthesis of high-fidelity HSIs at arbitrary target wavelengths. At the core of our framework lies the continuous spectral fields (CoSF) module, embedded within each unfolding stage as a dynamic prior, which comprises a triple-branch cross-domain feature mixer for comprehensive spatial-frequency-channel feature fusion, alongside a spectral synthesis head that generates spectral intensities by querying continuous wavelength coordinates. Extensive experimental results demonstrate that Phy-CoSF not only achieves continuous modeling at arbitrary spectral resolutions but also outperforms many state-of-the-art methods in both reconstruction fidelity and spectral detail preservation. Our code and more results are available at: https://github.com/PaiDii/Phy-CoSF.git.

2605.13581 2026-05-14 cs.CV 版本更新

HIR-ALIGN: Enhancing Hyperspectral Image Restoration via Diffusion-Based Data Generation

Li Pang, Heng Zhao, Yijia Zhang, Deyu Meng, Xiangyong Cao

发表机构 * School of Mathematics and Statistics, Xi’an Jiaotong University(西安交通大学数学与统计学学院) School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) School of Mathematics and Statistics and the Ministry of Education Key Laboratory for Intelligent Networks and Network Security, Xi’an Jiaotong University(西安交通大学数学与统计学学院和教育部智能网络与网络安全重点实验室) Pazhou Laboratory (Huangpu), Guangzhou(广州黄埔 Pazhou 实验室)

AI总结 高光谱图像(HSI)修复在实际应用中面临噪声、模糊和分辨率下降等问题,而现有模型在缺乏干净参考的靶域数据上表现不佳。为此,本文提出HIR-ALIGN框架,通过扩散模型生成与靶域分布匹配的合成数据,增强修复效果。该方法包含代理生成、分布自适应合成和对齐监督微调三个阶段,有效提升了在靶域上的修复性能,并在去噪和超分辨率任务中展现出优于现有方法的实验结果。

详情
英文摘要

Hyperspectral image (HSI) restoration is crucial for reliable analysis, as real HSIs suffer from degradations like noise, blur, and resolution loss. However, existing models trained on source data often fail on target domains lacking clean references, a common occurrence in practice. To address this issue, we present HIR-ALIGN, a plug-and-play target-adaptive augmentation framework that enhances hyperspectral image restoration by augmenting limited training images with synthetic data that closely matches the target distribution using no extra data. It consists of three stages: (i) proxy generation, where off-the-shelf restoration models restore degraded target observations to produce semantics-preserving proxy HSIs that approximate target-domain clean images; (ii) distribution-adaptive synthesis, where a blur-robust unCLIP diffusion model generates target-aligned RGBs from proxy RGBs, with prompt conditioning and embedding-space noise initialization. Then, a warp-based spectral transfer module synthesizes HSIs by aligning each generated RGB with the proxy RGB, estimating soft patch-wise transport weights, and applying these weights and learnable local interpolation kernels to the proxy HSI; and (iii) aligned supervised finetuning, where restoration networks pretrained on the source distribution are finetuned using both the proxy HSIs and synthesized target-aligned HSIs, and are then deployed on degraded target images. We further provide theoretical analysis showing that augmentation-based finetuning can achieve lower target-domain restoration risk by jointly improving target distribution coverage and controlling spectral bias. Extensive experiments on simulated and real datasets across denoising and super-resolution tasks demonstrate that HIR-ALIGN consistently improves source-only supervised baselines, outperforming both source-only counterparts and representative unsupervised methods.

2605.13565 2026-05-14 cs.CV 版本更新

Qwen-Image-VAE-2.0 Technical Report

Zekai Zhang, Deqing Li, Kuan Cao, Yujia Wu, Chenfei Wu, Yu Wu, Liang Peng, Hao Meng, Jiahao Li, Jie Zhang, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Yan Shu, Yanran Zhang, Yilei Chen, Yixian Xu, Yuxiang Chen, Zhendong Wang, Zihao Liu, Zikai Zhou, Yiliang Gu, Yi Wang, Xiaoxiao Xu, Lin Qu

发表机构 * Qwen Team(通义实验室)

AI总结 本文介绍了 Qwen-Image-VAE-2.0,一套在重建保真度和扩散能力方面取得显著进展的高压缩变分自编码器(VAE)。通过引入全局跳接连接和扩展潜在通道,模型有效解决了高压缩下的重建瓶颈,并结合大规模图像训练和合成渲染引擎提升了文本密集场景的表现。研究还提出了一种增强的语义对齐策略以优化高维潜在空间的收敛性,并采用非对称且无需注意力机制的编解码结构以提高计算效率。实验表明,该模型在多个基准测试中达到先进水平,尤其在高压缩比下表现出卓越的重建和扩散能力。

详情
英文摘要

We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability.

2605.13544 2026-05-14 cs.CV 版本更新

CA-GCL: Cross-Anatomy Global-Local Contrastive Learning for Robust 3D Medical Image Understanding

Hanwen Zhang, Yao Liu, Die Dai, Jiaye Yang, Qiao Liu, Yutong Xie, Peng Wang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Mohamed bin Zayed University of Artificial Intelligence(莫扎德人工智能大学)

AI总结 本文提出了一种名为CA-GCL的跨解剖全局-局部对比学习框架,旨在提升三维医学图像理解的鲁棒性。该方法通过引入全局对比目标,增强解剖类别在潜在空间中的区分度,同时结合临床感知的文本增强策略,以应对描述不完整的问题。实验表明,CA-GCL在零样本异常检测任务中优于现有方法,且在不同数据集间具有良好的泛化能力,显著提升了模型对提示变化的稳定性。

详情
英文摘要

Fine-grained Vision-Language Pre-training (FVLP) demonstrates significant potential in 3D medical image understanding by aligning anatomy-level visual representations with corresponding textual descriptions. However, existing FVLP paradigms often suffer from severe representation collapse in the textual embedding space, where text embeddings of distinct anatomical structures become highly clustered and indistinguishable. This distributional degeneracy renders the model hypersensitive to prompt variations, hindering reliable clinical deployment. To address these challenges, we propose a novel Cross-Anatomy Global-Local Contrastive Learning framework (CA-GCL). CA-GCL introduces a global contrastive objective that enforces separation between anatomical categories in the latent space, effectively counteracting the aggregation tendency induced by local alignment. Furthermore, we incorporate a clinical-aware text augmentation strategy based on permutation invariance and partial completeness to enhance robustness against descriptive incompleteness. Extensive evaluations on the CT-RATE and Rad-ChestCT datasets demonstrate that CA-GCL consistently outperforms existing VLP paradigms in zero-shot abnormality detection, achieving superior performance while exhibiting strong cross-dataset generalization. Crucially, CA-GCL reduces performance variance across diverse prompt templates, transforming the collapsed textual similarity distribution into a bell-shaped distribution. These results validate CA-GCL as an effective framework for robust 3D medical image understanding.

2605.13530 2026-05-14 cs.CV cs.AI 版本更新

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

Jincai Huang, Shihao Zou, Yuchen Guo, Jingjing Li, Wei Ji, Kai Wang, Shanshan Wang, Weixin Si

发表机构 * Southern University of Science and Technology(南方科技大学) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) Northwestern University(西北大学) University of Alberta(阿尔伯塔大学) Yale University(耶鲁大学) Nanfang Hospital(南华医院) Shenzhen University of Advanced Technology(深圳大学先进技术研究院)

AI总结 本文提出 SurgMLLM,一种统一的手术场景理解框架,旨在将高层语义推理与底层视觉定位相结合,解决现有方法在手术场景中孤立处理各组件导致的语义不一致问题。该方法通过微调多模态大语言模型,实现对手术阶段、工具-动作-目标三元组及对应分割区域的联合建模,并通过时序聚合和分割网络实现精确的像素级定位。实验表明,SurgMLLM 在三元组识别和分割任务上均取得显著提升,验证了统一推理与定位方法在手术辅助中的有效性。

详情
英文摘要

Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.

2605.13493 2026-05-14 cs.CV 版本更新

PhysEditBench: A Protocol-Conditioned Benchmark for Dense Physical-Map Prediction with Image Editors

Jiaxin Yang, Yu Hou, Muxin Liu, Weixuan Liu, Ze Yuan, Zeming Chen, Zhongrui Wang, Xiaojuan Qi

发表机构 * Southern University of Science and Technology(南方科技大学) The University of Hong Kong(香港大学) East China Normal University(华东师范大学)

AI总结 PhysEditBench 是一个用于评估图像编辑器在密集物理图预测能力的协议条件化基准,涵盖了深度、法线、反照率、粗糙度和金属度五类目标。该基准通过构建目标依赖的数据集,并定义固定的输入输出协议,确保评估的标准化与可靠性。实验表明,尽管图像编辑器在部分指标上可与专业模型媲美,但在结构错误和光照敏感性方面仍存在明显不足。

Comments 48 pages, 12 figures, including references, appendix, and supplementary benchmark details

详情
英文摘要

Can general-purpose image editors predict physical maps from a single RGB image? General-purpose image editors differ from standard task-specific dense-prediction models: they do not directly take an image and output a physical map. Instead, they must be guided by prompts, examples, or image-based textual cues. To this end, we introduce PhysEditBench, a novel protocol-conditioned benchmark to evaluate and standardize image editors in dense physical-map prediction that covers five targets: depth, normal, albedo, roughness, and metallic maps. For evaluation data, we build a target-dependent benchmark substrate. We use OpenRooms-FF for depth, surface normal, albedo, and roughness, InteriorVerse as an additional source for depth, normal, albedo, and a new procedurally generated source for metallic maps. We curate the data with quality checks, valid-region masks, scene-level sampling, and lighting-based stress subsets to ensure reliable and diverse evaluation. For each target, PhysEditBench defines a fixed protocol that specifies the allowed input, expected output format, and scoring procedure. Each score, therefore, reflects the performance of a model under a specified protocol, rather than its best possible performance under all prompts or interaction modes. Experimental results show that specialized models remain much stronger on depth, normal, and albedo, and stronger image editors can produce more reasonable map-like outputs. For roughness and metallic, image editors can match or outperform specialized baselines on some scalar metrics, but they still suffer from structural errors, sparsity effects, and sensitivity to lighting.

2605.13476 2026-05-14 cs.CV 版本更新

Neural Video Compression with Domain Transfer

Tiange Zhang, Rongqun Lin, Xiandong Meng, Haofeng Wang, Xing Tian, Qi Zhang, Siwei Ma

发表机构 * Shenzhen Graduate School, Peking University, Shenzhen, China(北京大学深圳研究生院,深圳,中国) Pengcheng Laboratory, Shenzhen, China(鹏城实验室,深圳,中国) School of Computer Science, Peking University, Beijing, China(北京大学计算机学院,北京,中国)

AI总结 本文研究了神经视频编码中的领域迁移问题,旨在解决训练数据与测试数据之间分布差异导致的性能下降问题。提出了一种名为DCVC-DT的增强框架,通过轻量级的在线领域迁移机制,在推理过程中动态调整编码的潜在表示,从而有效缩小领域差距,无需修改编码器或解码器参数。同时,引入了帧级别的动态率失真调整方案,提升压缩效率与重建质量。实验表明,该方法在保持视频质量的同时,相比基线模型实现了更高的比特率节省,并增强了对未知测试数据的泛化能力。

Comments Accepted to ISCAS 2026 as an oral paper

详情
英文摘要

Content-adaptive compression has always been a key direction in neural video coding (NVC), aiming to mitigate the domain gap between training and testing data. Such gaps often arise from distributional discrepancies between training and inference data, which may cause noticeable performance degradation when the testing content differs from the training distribution. To tackle this challenge, we propose DCVC-DT, a domain transfer enhanced neural video compression framework. Specifically, we design a lightweight online domain transfer (DT) mechanism that dynamically adapts the encoded latent representation during inference, effectively bridging the domain gap without modifying the encoder or decoder parameters. In addition, we develop a frame-level dynamic RD (Rate and Distortion) adjustment scheme that actively regulates the ratio of R and D in the loss function based on quality fluctuation, thereby improving rate-distortion performance. Extensive experiments demonstrate that DCVC-DT achieves up to 6.21% bitrate savings over the baseline DCVC-DC, while significantly enhancing generalization to unseen testing data and alleviating error propagation. Our code is available at https://github.com/SunnyMass/DCVC-DT.

2605.13465 2026-05-14 cs.CV 版本更新

Z-Order Transformer for Feed-Forward Gaussian Splatting

Can Wang, Lei Liu, Wei Jiang, Dong Xu

发表机构 * The University of Hong Kong(香港大学) Futurewei Technologies Inc(未来科技公司)

AI总结 本文提出了一种基于Transformer的前馈高斯点绘(Gaussian Splatting)方法,旨在解决传统3D高斯点绘在实时性方面的不足。通过引入Z-order策略将无序的高斯点组织成空间连贯的序列,并结合稀疏注意力机制,有效捕捉高斯点之间的空间与语义关系,从而在单次前向传播中高效建模上下文、压缩高斯点数量并预测其属性。实验表明,该方法在保证渲染质量的同时显著提升了生成新视角图像的速度。

Comments Accept by CVPR 2026, Oral

详情
英文摘要

Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this work, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details. This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.

2605.13457 2026-05-14 cs.CV 版本更新

OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression

Chengyan Deng, Pengbin Yu, Zhentao Chen, Wei Shen, Kai Zhang, Meng Li, Lunxi Yuan, Xue Zhou, Li Yu

发表机构 * School of Automation Engineering, University of Electronic Science and Technology of China(电子科技大学自动化工程学院) OPPO AI Center, OPPO Inc.(OPPO人工智能中心) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院)

AI总结 本文提出了一种名为OP4KSR的一站式无块4K超分辨率方法,旨在解决基于扩散模型的现实场景图像超分辨率在直接生成4K图像时面临的显存限制问题。该方法基于强大的Flux架构,并结合极简压缩的F16 VAE,实现了在有限GPU资源下的高效推理,同时保持全局空间语义一致性。为了解决该方法引入的周期性伪影问题,作者提出了基于RoPE频率重缩放和自相关周期性损失的抑制策略,并构建了专门的训练数据集和三个基准测试,推动了4K超分辨率研究的发展。

详情
英文摘要

Diffusion-based real-world image super-resolution (Real-ISR) has achieved remarkable perceptual quality; however, directly super-resolving images to 4K remains limited by extreme memory consumption. Consequently, prior methods adopt patch-based inference, sacrificing global context and introducing semantic confusion, spatial inconsistency, and severe latency. We propose OP4KSR, a one-step patch-free 4K SR approach built upon the powerful Flux backbone. By leveraging the extreme-compression F16 VAE, OP4KSR makes 4K SR inference tractable under practical GPU budgets, preserving global spatial-semantic coherence while enabling highly efficient inference. However, adapting this one-step architecture intrinsically triggers severe periodic artifacts. We trace this to a RoPE base frequency allocation mismatch and intra-token spatial ambiguity, both exacerbated by the lack of iterative refinement. To suppress these artifacts, we couple RoPE base frequency rescaling (RFR) with an autocorrelation-based periodicity loss ($\mathcal{L}_\text{AP}$). Furthermore, we curate a dedicated training dataset alongside three benchmarks (one synthetic and two real-world) to advance 4K SR research. Extensive experiments demonstrate that OP4KSR achieves competitive perceptual quality with efficient inference, generating a $4096\times4096$ output in only 5.75 seconds on a single NVIDIA H20 GPU.

2605.13403 2026-05-14 cs.RO cs.CV 版本更新

RotVLA: Rotational Latent Action for Vision-Language-Action Model

Qiwei Li, Xicheng Gong, Xinghang Li, Peiyan Li, Quanyun Zhou, Hangjun Ye, Jiahuan Zhou, Yadong Mu

发表机构 * Wangxuan Institute of Computer Technology, Peking University(王轩计算机技术研究所,北京大学) Xiaomi Robotics(小米机器人) CASIA

AI总结 本文提出RotVLA,一种基于连续旋转潜行动作表示的视觉-语言-动作(VLA)框架,旨在解决现有潜行动作模型在动作表示离散化带来的重建行为简单、表达能力有限等问题。RotVLA将潜动作建模为SO(n)空间中的元素,具有连续性、组合性和符合现实动作动态的结构化几何特性,并通过三帧学习框架强化时间动态特性。实验表明,RotVLA在多个基准测试中表现出色,显著优于现有VLA模型。

详情
英文摘要

Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale cross-embodiment robotic datasets and human videos with latent-action supervision. For downstream robot control, the flow-matching head is extended into a unified action expert that jointly denoises latent and robot actions. Here, latent actions serve as a latent planner, providing high-level guidance that conditions action generation. With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.

2605.13402 2026-05-14 cs.CV cs.DS 版本更新

Fast and Compact Graph Cuts for the Boykov-Kolmogorov Algorithm

Christian Møller Mikkelstrup, Anders Bjorholm Dahl, Philip Bille, Vedrana Andersen Dahl, Inge Li Gørtz

AI总结 本文研究了Boykov-Kolmogorov(BK)算法在计算最小$s$-$t$割问题中的性能优化,提出了改进的理论分析和新的快速紧凑算法(fcBK),将时间复杂度从$O(mn|C|)$降低至$O(m|C|)$。此外,作者设计了一种紧凑的图表示方法,使得算法能够在有限内存下处理包含数十亿顶点和万亿边的大规模图。实验表明,该实现是目前BK算法中最高效的实现,突显了内存效率在大规模图割计算中的重要性。

Comments 15 pages, 6 figures, submitted to the IEEE for possible publication

详情
英文摘要

Computing a minimum $s$-$t$ cut in a graph is a solution to a wide range of computer vision problems, and is often done using the Boykov-Kolmogorov (BK) algorithm. In this paper, we revisit the BK algorithm from both a theoretical and practical point of view. We improve the analysis of the time complexity of the BK algorithm to $O(mn|C|)$ and propose a new algorithm, the fast and compact BK (fcBK) algorithm, with a time complexity of $O(m|C|)$, where $m$, $n$, and $|C|$ are the number of edges, number of vertices, and the capacity of the cut, respectively. We additionally propose a compact graph representation that allows our implementation to find a minimum $s$-$t$ cut in a graph with upwards of $10^9$ vertices and $10^{10}$ edges on a machine with 128 GB of memory. We find our implementation of the BK algorithm to be the fastest available implementation of the BK algorithm when evaluating on a comprehensive set of benchmark datasets, highlighting the importance of memory-efficient implementations. We make our implementations publicly available for further research and implementation development within minimum $s$-$t$ cut algorithms.

2605.13396 2026-05-14 cs.CV 版本更新

PreFIQs: Face Image Quality Is What Survives Pruning

Jan Niklas Kolf, Guray Ozgur, Andrea Atzori, Žiga Babnik, Vitomir Štruc, Naser Damer, Fadi Boutros

发表机构 * Fraunhofer Institute for Computer Graphics Research IGD(弗劳恩霍夫计算机图形研究研究所IGD) University of Ljubljana(卢布尔雅那大学) Technical University of Darmstadt(达姆施塔特技术大学)

AI总结 本文提出了一种无需训练和监督的面部图像质量评估框架 PreFIQs,基于“剪枝识别示例”(PIE)假设,通过分析预训练人脸识别模型及其剪枝版本之间嵌入向量的欧几里得距离来衡量图像质量。该方法从雅可比向量积的角度提供了理论支持,并在多个基准数据集上取得了优于现有方法的性能,验证了参数剪枝作为评估面部图像质量的有效信号。

Comments Accepted at CVPR 2026 Workshops

详情
英文摘要

Face Image Quality Assessment (FIQA) evaluates the utility of a face image for automated face recognition (FR) systems. In this work, we propose PreFIQs, an unsupervised and training-free FIQA framework grounded in the Pruning Identified Exemplar (PIE) hypothesis. We hypothesize that low-utility face images rely disproportionately on fragile network parameters, resulting in larger geometric displacement of their embeddings under model sparsification. Accordingly, PreFIQs quantifies image utility as the Euclidean distance between L2-normalized embeddings extracted from a pre-trained FR model and its pruned counterpart. We provide a first-order theoretical justification via a Jacobian-vector product analysis, demonstrating that this empirical drift serves as a computationally efficient approximation of the exact geometric sensitivity of the latent embedding manifold. Extensive experiments across eight benchmarks and four FR models demonstrate that PreFIQs achieves competitive or superior performance compared to state-of-the-art FIQA methods, including establishing new state-of-the-art results on several benchmarks, without any training or supervision. These results validate parameter sparsification as a principled and practically efficient signal for face image utility, and demonstrate that quality is, in essence, what survives pruning.

2605.13395 2026-05-14 cs.LG cs.CV 版本更新

Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation

Lilin Zhang, Yimo Guo, Yue Li, Jiancheng Shi, Xianggen Liu

发表机构 * Sichuan University(四川大学) Dongfang Electric (Chengdu) Innovation Research Co., Ltd.(东方电子(成都)创新研究院有限公司) Southwest China Research Institute of Electronic Equipment(西南中国电子设备研究院)

AI总结 该论文研究了深度神经网络在长尾数据下的对抗训练问题,指出传统对抗训练方法在类别不平衡的数据上存在训练目标偏斜和对抗分布不稳定等局限。作者提出通过自适应调整对抗扰动来同时提升模型的鲁棒性和类别平衡能力,并设计了名为 RobustLT 的即插即用框架,实验表明该方法在多个长尾数据集上有效增强了模型的对抗鲁棒性与类别平衡性能。

Comments accepted by CVPR 2026

详情
英文摘要

Deep neural networks are highly vulnerable to adversarial examples, i.e.,small perturbations that can significantly degrade model performance. While adversarial training has become the primary defense strategy, most studies focus on balanced datasets, overlooking the challenges posed by real-world long-tail data. Motivated by the fact that perturbations in adversarial examples inherently alter the training distribution, we theoretically investigate their impact. We first revisit adversarial training for long-tail data and identify two key limitations: (i) a skewed training objective caused by class imbalance, and (ii) unstable evolution of adversarial distributions. Furthermore, we show that perturbations can simultaneously address both adversarial vulnerability and class imbalance. Based on these insights, we propose RobustLT, a plug-and-play framework that adaptively adjusts perturbations during adversarial training. Extensive experiments demonstrate that RobustLT consistently enhances adversarial robustness and class-balance on long-tailed datasets. The code is available at \href{https://github.com/zhang-lilin/RobustLT}{https://github.com/zhang-lilin/RobustLT}.

2605.13381 2026-05-14 cs.CV cs.MM 版本更新

Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics

Chiara Musso, Joy Battocchio, Andrea Montibeller, Giulia Boato

发表机构 * University of Trento(特伦托大学)

AI总结 随着AI生成图像日益逼真,视觉Transformer(ViT)已成为现代深度伪造检测的核心技术。然而,现有方法普遍依赖冻结的预训练主干网络,这引入了一个隐蔽但关键的漏洞。本文提出了一种基于目标检测器ViT主干网络知识的灰盒攻击方法——替代迭代对抗攻击(SIAA),能够在目标检测器的特征空间内生成高效对抗样本,实验表明该方法在多种场景下均能实现接近白盒攻击的高成功率,揭示了仅凭主干网络知识即可严重削弱检测器可靠性的问题,突显了在对抗性多媒体取证中亟需更鲁棒防御机制的重要性。

详情
英文摘要

As AI-generated synthetic images become increasingly realistic, Vision Transformers (ViTs) have emerged as a cornerstone of modern deepfake detection. However, the prevailing reliance on frozen, pre-trained backbones introduces a subtle yet critical vulnerability. In this work, we present the Surrogate Iterative Adversarial Attack (SIAA), a gray-box attack that exploits knowledge of the detector's ViT backbone alone and operates entirely within the target detector's feature space to craft highly effective adversarial examples. Through our experiments, involving multiple ViT-based detectors and diverse gray-box scenarios, including few-shot learning, complete training misalignment and attack transferability tests, we demonstrate that this vulnerability consistently yields high attack success rates, often approaching white-box performance. By doing so, we reveal that backbone knowledge alone is sufficient to undermine detector reliability, highlighting the urgent need for more resilient defenses in adversarial multimedia forensics.

2605.13375 2026-05-14 cs.CV cs.AI 版本更新

GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

Mingzhe Huang, Weijun Wang, Xin Ding, Liang Mi, Hao Wen, Yuanchun Li, Lichen Pang, Shansong Yang, Yunxin Liu, Ting Cao

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学) Juhaokan Technology Co.,Ltd(极皓科技有限公司) Nanjing University(南京大学) University of Science and Technology of China(中国科学技术大学)

AI总结 在视觉-语言模型(VLMs)中,处理大量视觉标记会导致高昂的计算开销。为解决这一问题,本文提出GRIP-VLM,一种基于强化学习的组相对重要性剪枝框架,将剪枝建模为马尔可夫决策过程,通过监督预热引导的组相对策略优化(GRPO)直接探索离散选择空间,从而避免连续近似方法带来的次优解问题。该方法结合预算感知评分器,无需重新训练即可动态评估并适应不同压缩比,实验表明其在多个多模态基准上优于启发式和监督学习基线,在保持精度的同时实现了最高达15%的推理加速。

Comments 10 pages, 11 figures

详情
英文摘要

In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.

2605.13366 2026-05-14 cs.CV cs.LG 版本更新

Neural Surrogate Forward Modelling For Electrocardiology Without Explicit Intracellular Conductivity Tensor

Shaheim Ogbomo-Harmitt, Cesare Magnetti, Jakub Grzelak, Oleg Aslanidi

发表机构 * King’s College London(伦敦国王学院) PhysicsX

AI总结 该研究针对无创心脏电生理学中的正向建模问题,提出了一种无需显式输入细胞内导电张量的深度学习方法,用于直接从左心房细胞内电位预测远场心电图。该方法通过深度学习模型学习电位与心电图之间的映射关系,避免了传统物理模型中难以测量的导电张量带来的结构误差。实验表明,该模型在仅使用74个受试者数据训练的情况下,取得了较高的预测精度(R²为0.949 ± 0.037),展示了其在改善房颤无创评估中的潜力。

Comments Accepted into the 9th International Conference on Computational and Mathematical Biomedical Engineering (CMBE2026)

详情
英文摘要

Accurate forward modelling is essential for non-invasive cardiac electrophysiology, particularly in atrial fibrillation, where electrical activation is highly disorganised. Conventional physics-based forward models require explicit specification of intracellular conductivity tensors, which are not directly measurable in clinical practice and introduce structural modelling errors. This proof-of-concept study presents a deep learning approach that learns a direct mapping from left atrial intracellular electrical potentials to far-field ECGs without requiring explicit intracellular conductivity inputs at inference time. Despite training only on 74 subjects, the model achieved an R2 of 0.949 \pm 0.037, highlighting potential to reduce structural uncertainty and improve non-invasive AF assessment.

2605.13349 2026-05-14 cs.CV 版本更新

Drag within Prior Distribution: Text-Conditioned Point-Based Image Editing within Distribution Constraints

Haoyang Hu, Masataka Seo, Yen-Wei Chen

发表机构 * Ritsumeikan University, Graduate School of Information(日光大学信息工程研究生院) Engineering, Osaka Institute of Technology(工程学,大阪技术学院)

AI总结 本文研究了在扩散模型框架下,如何在保持图像语义一致性和分布约束的前提下,实现基于文本条件的点编辑。为了解决传统点编辑方法中轨迹模糊、编辑范围过大导致的不自然伪影等问题,作者引入了基于CLIP的引导机制和先验保持损失函数,确保编辑过程在扩散先验分布范围内进行。同时,提出了一种方向加权的点追踪机制,提升了细粒度编辑的准确性和生成质量。

Comments ICASSP 2026 oral

详情
英文摘要

Diffusion-based point editing methods have gained significant traction in image editing tasks due to their ability to manipulate image semantics and fine details by applying localized perturbations on the manifold of noise latent. However, these approaches face several limitations. Traditional point-based editing relies on pairs of handle and target points to define motion trajectories, which can introduce ambiguity or unnecessary alterations. Furthermore, when the distance between the handle and target points is large, the accumulated perturbations often cause the noise latent deviation from inversion score trajectory, resulting in unnatural artifacts. To address these issues in global editing tasks, we introduce a CLIP-based model to evaluate and guide intermediate editing steps, ensuring that the generated results remain both semantically aligned. Additionally, we propose a prior-preservation loss that constrains the optimized latent code to stay within the sampling space of the diffusion prior, improving consistency with the original data distribution, to ensure the model generates images along a familiar score trajectory. For fine-grained tasks, we present a directionally-weighted point tracking mechanism that steers the editing process toward the target direction within similar feature regions. This improves both the tracking accuracy and generation quality, while also reducing the editing time.

2605.12088 2026-05-14 cs.CV 版本更新

UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

Yiyan Xu, Qiulin Wang, Wenjie Wang, Yunyao Mao, Xintao Wang, Pengfei Wan, Kun Gai, Fuli Feng

发表机构 * University of Science and Technology of China(中国科学技术大学) Kling Team, Kuaishou Technology(快手技术团队)

AI总结 本文研究了多参考图像生成问题,即在文本指令引导下生成图像并忠实保留多个参考图像中的主体身份和外观细节。现有方法通常将语义和外观特征分离处理,导致模型难以正确关联主体与对应参考图像的细节,从而引发属性泄露和跨参考混淆。为此,作者提出UniCustom框架,在视觉语言模型编码前融合ViT和VAE特征,使模型能够同时学习主体语义和外观信息,并通过两阶段训练策略和槽位绑定正则化进一步提升生成质量。实验表明,UniCustom在多个基准上显著优于现有方法。

详情
英文摘要

Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual conditioning framework that fuses ViT and VAE features before VLM encoding. This early fusion exposes the VLM to both semantic cues and appearance-rich details, enabling its hidden states to jointly encode the referred subject and corresponding visual appearance with only a lightweight linear fusion layer. To learn such unified representations, we adopt a two-stage training strategy: reconstruction-oriented pretraining that preserves reference-specific appearance details in the fused hidden states, followed by supervised finetuning on single- and multi-reference generation tasks. We further introduce a slot-wise binding regularization that encourages each image slot to preserve low-level details of its corresponding reference, thereby reducing cross-reference entanglement. Experiments on two multi-reference generation benchmarks demonstrate that UniCustom consistently improves subject consistency, instruction following, and compositional fidelity over strong baselines.

2605.12072 2026-05-14 cs.CV 版本更新

PairDropGS: Paired Dropout-Induced Consistency Regularization for Sparse-View Gaussian Splatting

Hantang Li, Qiang Zhu, Xiandong Meng, Xingtao Wang, Debin Zhao, Xiaopeng Fan

发表机构 * School of Computer Science, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳校区计算机科学学院,中国) Pengcheng Laboratory, Shenzhen, China(鹏城实验室,中国) Smart Coding Institute, Pengcheng Laboratory, Shenzhen, China(鹏城实验室智能编码研究所,中国) School of Computer Science, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学学院,中国)

AI总结 PairDropGS 是一种基于配对 dropout 的一致性正则化方法,旨在提升稀疏视角下高斯溅射(Gaussian Splatting)的重建稳定性与质量。该方法通过从共享高斯场中构造配对的 dropout 子集,并引入低频一致性正则化,以保持场景布局和粗略几何结构的稳定性,同时避免对高频细节的过度约束。此外,PairDropGS 还采用渐进式一致性调度策略,增强训练过程中的鲁棒性,实验表明其在多个基准数据集上均取得了优于现有方法的重建效果。

Comments 11 pages,8 figures

详情
英文摘要

Dropout-based sparse-view 3D Gaussian Splatting (3DGS) methods alleviate overfitting by randomly suppressing Gaussian primitives during training. Existing methods mainly focus on designing increasingly sophisticated dropout strategies, while they overlook the resulting inconsistencies among different dropped Gaussian subsets. This oversight often leads to unstable reconstruction and suboptimal Gaussian representation learning.In this paper, we revisit dropout-based sparse-view 3DGS from a consistency regularization perspective and propose PairDropGS, a Paired Dropout-induced Consistency Regularization framework for sparse-view Gaussian splatting. Specifically, PairDropGS first constructs a pair of the dropped Gaussian subsets from a shared Gaussian field and designs a low-frequency consistency regularization to constrain their low-frequency rendered structures. This design encourages the shared Gaussian field to preserve stable scene layout and coarse geometry under different random dropouts, while avoiding excessive constraints on ambiguous high-frequency details. Moreover, we introduce a progressive consistency scheduling strategy to gradually strengthen the consistency regularization during training for stability and robustness of reconstruction. Extensive experiments on widely-used sparse-view benchmarks demonstrate that PairDropGS achieves superior training stability, significantly outperforms existing dropout-based 3DGS methods in reconstruction quality, while exhibiting the simplicity and plug-and-play nature for improving dropout-based optimization.

2605.10556 2026-05-14 cs.CV cs.LG 版本更新

EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving

Vittorio Palladino, Gianluca Palermo, Michael E. Papka, Zhiling Lan

发表机构 * University of Illinois Chicago(伊利诺伊大学香槟分校)

AI总结 随着大语言模型架构日益多样化,并在异构加速器上处理多模态工作负载,优化推理能耗已成为与延迟和吞吐量同样关键的问题。现有方法要么将延迟作为能耗代理,要么依赖数据密集的黑箱模型,均难以适应不同的并行策略。本文提出EnergyLens,通过符号回归从性能剖析数据中推导出一个包含12个参数的闭式能耗模型,能够准确描述系统特性如并行度、批大小和序列长度对能耗的影响,其预测结果具有物理可解释性,并且仅需少量的剖析样本即可实现高精度的配置选择和跨硬件平台的泛化能力。

Comments 10 pages

详情
英文摘要

As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing approaches either treat latency as an energy proxy or rely on data-hungry black-box surrogates. Both fail under varying parallelism strategies: latency and energy optima diverge in over 20% of configurations we tested, and black-box surrogates require hundreds of profiling samples to generalize across model families and hardware. We present EnergyLens, which uses symbolic regression as a structure-discovery tool over profiling data to derive a single twelve-parameter closed-form energy model expressed in terms of system properties such as degree of parallelism, batch size, and sequence length. Unlike black-box surrogates, EnergyLens decouples tensor and pipeline parallelism contributions and separates prefill from decode energy, making its predictions physically interpretable and actionable. Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification, making it a practical, interpretable tool for energy-optimal LLM deployment.

2605.09020 2026-05-14 cs.CV 版本更新

The Direct Integration Theorem: A Rigorous Framework for Consistent Discrete Solutions of the Inverse Radon Problem

Mikhail G. Mozerov

发表机构 * Institute for Information Transmission Problems, Russian Academy of Sciences(信息传输问题研究所,俄罗斯科学院)

AI总结 本文提出了一种新的直接积分定理(DIT),作为经典中心切片定理(CST)的非平凡推论,为连续域到离散域的数学一致转换提供了严谨的框架,解决了计算断层成像中的根本性难题。该方法无需传统 ramp 滤波和频率域插值,避免了零频奇点和谱失真等问题,并实现了基于采样参数和网格几何的准精确重建。实验表明,该方法在图像方差保持、重建质量及重投影保真度方面优于传统滤波反投影(FBP)方法,显著提升了图像的统计特性还原能力。

Comments Submitted to IEEE TPAMI. Code and data available at https://github.com/Mozerov-iitp/radon-dit/

详情
英文摘要

This paper presents a novel Direct Integration Theorem (DIT), derived as a non-trivial corollary of the classical Central Slice Theorem (CST). The DIT provides a mathematically consistent transition from the continuous to the discrete domain - a fundamental challenge in computed tomography - thereby eliminating the need for frequency-domain interpolation without resorting to conventional ramp-filtering. The proposed approach circumvents two principal limitations inherent in traditional methods: (i) the zero-frequency singularity and spectral distortions introduced by the mandatory ramp-filtering step, and (ii) discretization inaccuracies associated with frequency-domain interpolation. Based on the DIT, we develop a rigorous framework for consistent discrete solutions of the inverse Radon problem. Mathematical modeling demonstrates that this approach achieves quasi-exact reconstruction, with errors constrained solely by sampling parameters and grid geometry. Furthermore, while Filtered Back Projection (FBP) inherently distorts the variance of the reconstructed image, the DIT-based algorithm preserves it. Comparative simulations confirm that the proposed method eliminates common artifacts, such as intensity cupping, and consistently outperforms FBP in terms of PSNR, SSIM, and reprojection fidelity, faithfully restoring the original image's statistical characteristics.

2605.07653 2026-05-14 cs.CV eess.IV 版本更新

Aquatic Neuromorphic Optical Flow

Pei Zhang, Yunkai Liang, Kaiqiang Wang

发表机构 * School of Electrical Engineering, Guangxi University(广西大学电气工程学院) Baise Artificial Intelligence Innovation and Development Center(百色人工智能创新与发展中心) School of Physical Science and Technology, Northwestern Polytechnical University(西北工业大学物理科学与技术学院)

AI总结 本文研究了水下环境中基于神经形态视觉的光流估计问题,提出了一种基于脉冲神经网络的自监督框架,能够从异步事件流中高效估计逐像素光流,有效克服了水下数据稀缺的瓶颈。该方法在保证视觉和定量性能的同时,显著提升了计算效率,为资源受限的水下边缘平台提供了轻量、实时且低成本的感知解决方案。

Comments This work is under review. Project page: https://github.com/pz-even/event_underwater_optical_flow

详情
英文摘要

Underwater environments impose severe constraints on conventional imaging systems and demand solutions that balance high-quality sensing with strict resource efficiency. While emerging event cameras offer a promising alternative, their potential in aquatic scenarios remains largely unexplored. Through the lens of neuromorphic vision, this work pioneers the investigation of motion fields that serve as key media for agile underwater perception. Built upon spiking neural networks, we introduce a self-supervised framework to estimate per-pixel optical flow from asynchronous event streams, elegantly bypassing the long-standing bottleneck of underwater data scarcity. Extensive evaluations demonstrate that our method achieves competitive visual and quantitative results against leading techniques while operating with superior computational efficiency. By bridging neuromorphic sensing and aquatic intelligence, this work opens new frontiers for lightweight, real-time, and low-cost perception on resource-constrained underwater edge platforms.

2605.05876 2026-05-14 cs.GR cs.CV 版本更新

3DSS: 3D Surface Splatting for Inverse Rendering

Mae Younes, Adnane Boukhayma

发表机构 * INRIA, University of Rennes(INRIA,里昂大学)

AI总结 本文提出了一种名为3D Surface Splatting(3DSS)的可微表面点扩散渲染方法,用于从多视角图像中进行基于物理的逆向渲染。其核心思想是将表面分离问题直接建模为重建核的函数,从而推导出一种基于覆盖度的合成模型,能够生成抗锯齿的轮廓和稀疏区域的可见性梯度。结合优化的高动态范围环境光和密度感知的自适应细化,3DSS能够同时恢复物体的形状、空间变化的材质属性以及光照信息,并可通过有向点云重建方法自然地与基于网格的工作流程衔接。

详情
英文摘要

We present 3D Surface Splatting (3DSS), the first differentiable surface splatting renderer for physically-based inverse rendering from multi-view images. Our central insight is that the surface separation problem at the heart of surface splatting admits a direct formulation in terms of the reconstruction kernels themselves. From this foundation we derive a coverage-based compositing model whose per-layer opacity arises directly from the accumulated Elliptical Weighted Average reconstruction weight, yielding anti-aliased silhouettes and informative visibility gradients at sparsely covered edges. Combined with forward microfacet shading under co-optimized HDR environment lighting and density-aware adaptive refinement, 3DSS jointly recovers shape, spatially-varying BRDF materials, and illumination. Because the optimized representation is a set of oriented surface samples, it bridges natively to mesh-based workflows via surface reconstruction from oriented point cloud methods. We evaluate 3DSS against mesh-based, implicit, and Gaussian-splatting baselines across geometry reconstruction, novel-view synthesis, and novel-illumination relighting.

2605.04557 2026-05-14 cs.CV cs.AI 版本更新

Efficient Geometry-Controlled High-Resolution Satellite Image Synthesis

Vlad Vasilescu, Daniela Faur, Teodor Costachioiu

发表机构 * Univ. POLITEHNICA Bucharest SIGMA Lab , CAMPUS Institute(巴比什-博亚尔银行大学 SIGMA 实验室,CAMPUS 机构) Univ. POLITEHNICA Bucharest GEOSENSE , CAMPUS Institute(巴比什-博亚尔银行大学 GEOSENSE,CAMPUS 机构)

AI总结 本文研究了如何高效生成受几何控制的高分辨率卫星图像,以解决该类图像稀缺且成本高昂的问题,这对土地覆盖分类、变化检测和灾害监测等任务的模型开发与测试造成阻碍。作者提出了一种基于现有预训练扩散模型的方法,通过引入窗口交叉注意力模块,仅利用跳跃连接特征实现对生成过程的控制,方法简洁高效。实验表明,该方法在性能上与现有控制技术相当,且在几何控制图对齐方面表现更优,同时指出现有评估方法的局限性,强调了对齐评估一致性的重要性。

Comments 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS)

详情
英文摘要

High-resolution satellite images are often scarce and costly, especially for remote areas or infrequent events. This shortage hampers the development and testing of machine learning models for land-cover classification, change detection, and disaster monitoring. In this paper, we tackle the problem of geometry-controlled high-resolution satellite image synthesis by adding control over existing pre-trained diffusion models. We propose a simple yet efficient method for controlling the synthesis process by leveraging only skip connection features using windowed cross-attention modules. Several previously established control techniques are compared, indicating that our method achieves comparable performance while leading to a better alignment with the geometry control map. We also discuss the limitations in current evaluation approaches, amplifying the necessity of a consistent alignment assessment.

2605.02752 2026-05-14 cs.CV 版本更新

Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting

Giacomo Pacini, Luca Ciampi, Nicola Messina, Nicola Tonellotto, Giuseppe Amato, Fabrizio Falchi

发表机构 * Institute of Information Science and Technologies of the National Research Council (ISTI-CNR)(意大利国家研究理事会信息科学与技术研究所) University of Pisa - Department of Information Engineering(比萨大学信息工程系)

AI总结 本文研究了开放世界文本引导的类别无关计数(CAC)任务中语义对齐的问题,指出当前模型在理解文本提示与视觉场景之间关系时存在不足,导致计数结果不可靠。为此,作者提出了一种新的评估框架PrACo++,包含负标签测试和干扰项测试等新协议,并构建了包含多类别标注的MUCCA数据集。实验表明,尽管现有模型在标准指标上表现良好,但在语义理解与对齐方面仍存在明显缺陷,突显了构建更具语义感知能力模型的重要性。

Comments Code available at https://github.com/ciampluca/PrACo

详情
英文摘要

Open-world text-guided class-agnostic counting (CAC) has emerged as a flexible paradigm for counting arbitrary object classes by using natural language prompts. However, current evaluation protocols primarily focus on standard counting errors within single-category images, overlooking a fundamental requirement: the ability to correctly ground the textual prompt in the visual scene. In this paper, we show that several state-of-the-art CAC models often struggle to determine which object class should be counted based on the given prompt, revealing a misalignment between textual semantics and visual object representations. This limitation leads to spurious counting responses and reduced reliability in real-world scenarios. To systematically address these limitations, we propose a new evaluation framework focused on model robustness and trustworthiness. Our contribution is two-fold: (i) we introduce PrACo++ (Prompt-Aware Counting++), a novel test suite featuring two dedicated evaluation protocols -- the negative-label test and the distractor test -- paired with new specialized metrics; and (ii) we present the MUCCA (MUlti-Category Class-Agnostic counting) evaluation dataset, a new collection of real-world images featuring multiple annotated object categories per scene, unlike existing CAC benchmarks that typically include a single category per image. Our extensive experimental evaluation of 10 state-of-the-art methods shows that, despite strong performance under standard counting metrics, current models exhibit significant weaknesses in understanding and grounding object class descriptions. Finally, we provide a quantitative analysis of how semantic similarity between prompts influences these failures. Overall, our results underscore the need for more semantically grounded architectures and offer a reliable framework for future assessment in open-world text-guided CAC methods.

2605.02521 2026-05-14 cs.CV 版本更新

MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling

Xinyi Yin, Yiduo Wang, Tingqi Hu, Meicong Si, Yunyun Shi, Shi Chen, Hao Wang, Junxiao Xue, Xuecheng Wu

发表机构 * School of Cyber Science and Engineering, Zhengzhou University(郑州大学信息科学与工程学院) School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) School of Journalism and New Media, Xi’an Jiaotong University(西安交通大学新闻与传播学院) Research Center for Space Computing System, Zhejiang Lab(浙江实验室空间计算系统研究中心)

AI总结 本文提出MooD,一种基于连续愉悦-唤醒(Valence-Arousal)模型的感知增强型高效情感图像编辑框架,旨在解决现有情感图像编辑方法在推理效率和连续情感建模方面的不足。MooD通过引入VA感知检索策略和融合视觉迁移与感知增强语义引导,实现了细粒度且高效的可控情感编辑。同时,为弥补现有数据集对自然场景覆盖不足的问题,研究者构建了涵盖多场景的AffectSet数据集,进一步提升了模型的性能与泛化能力。

详情
英文摘要

Affective Image Editing (AIE) aims to modify visual content to evoke targeted emotions. Although current approaches achieve impressive editing quality, they often overlook inference efficiency, which limits their applicability in computational social scenarios. Moreover, most methods depend on discrete emotion representations, which hinder the continuous modeling of complex human emotions and constrain expressive capabilities in interactive scenarios. To tackle these gaps, we propose MooD, the first framework that directly leverages continuous Valence-Arousal (VA) values as editing instruction for fine-grained and efficient AIE in computational social systems. Specifically, we first introduce a VA-Aware retrieval strategy to bridge vague affective values and detailed visual semantics. Building upon this, MooD integrates visual transfer and perception-enhanced semantic guidance to achieve controllable AIE. Furthermore, considering that existing VA-annotated datasets mainly focus on social scenarios and largely overlook natural scenes, we therefore construct AffectSet, a comprehensive VA-annotated dataset covering diverse scenarios, to support model optimization and evaluation. Extensive qualitative and quantitative experimental results demonstrate that our MooD achieves superior performance in both affective controllability and visual fidelity while maintaining high efficiency. A series of ablation studies further reveal the crucial factors of our design.

2604.28045 2026-05-14 cs.CV 版本更新

TAFA-GSGC: Group-wise Scalable Point Cloud Geometry Compression with Progressive Residual Refinement

Xiumei Li, Alexander Kopte, André Kaup

AI总结 本文提出了一种名为TAFA-GSGC的可扩展点云几何压缩方法,能够在单一比特流和单一训练模型下实现多质量解码。该方法结合了分层残差细化与通道组熵编码,并引入了目标对齐特征聚合模块以减少增强残差中的跨层冗余。实验表明,TAFA-GSGC在保持良好压缩效率的同时,支持多达9个解码质量等级,并在D1-PSNR和D2-PSNR指标上分别实现了4.99%和5.92%的比特率降低。

Comments Accepted at IEEE International Conference on Image Processing (ICIP) 2026

详情
英文摘要

Scalable compression is essential for bandwidth-adaptive transmission, yet most learned codecs are optimized for a fixed rate-distortion point, making rate adaptation costly due to re-encoding or maintaining multiple bitstreams. In this work, we propose TAFA-GSGC, a scalable learned point cloud geometry codec that enables multi-quality decoding from a single bitstream and a single trained model. TAFA-GSGC combines layered residual refinement with channel-group entropy coding, and introduces a Target-Aligned Feature Aggregation module to reduce cross-layer redundancy in enhancement residuals. Our framework supports up to 9 decodable quality levels with monotonic quality improvement as more subbitstreams are received, while maintaining strong compression efficiency. Compared with the PCGCv2 baseline, TAFA-GSGC demonstrates improved RD performance, achieving average BD-rate reductions of 4.99% and 5.92% in terms of D1-PSNR and D2-PSNR, respectively.

2604.23018 2026-05-14 cs.CV cs.AI cs.LG 版本更新

AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI

Mohammad Sadegh Salehi, Alex Perkins, Igor Maurell, Ashkan Dabbagh, Raymond Wong

发表机构 * Zero One Creative(Zero One创意)

AI总结 该研究提出了一个名为 AmaraSpatial-10K 的三维数据集,旨在解决现有大规模三维资产在空间计算和具身人工智能应用中的部署难题。该数据集包含超过 10,000 个经过优化的合成三维资产,每个资产都具备精确的度量尺度、确定的锚点、分离的物理材质贴图以及多句文本元数据,便于直接使用。研究还引入了一套可复用的评估体系,显著提升了三维资产在图像检索、物理模拟和跨模态对齐等方面的性能。

详情
英文摘要

Web-scale 3D asset collections are abundant but rarely deployment-ready, suffering from arbitrary metric scaling, incorrect pivots, brittle geometry, and incomplete textures, defects that limit their use in embodied AI, robotics, and spatial computing. We present AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets optimised for zero-shot deployment. Each asset ships as a metric-scaled, deterministically anchored .glb with separated PBR maps, a convex collision hull, a paired reference image, and multi-sentence text metadata. Alongside the dataset we introduce a reusable evaluation suite for 3D asset banks, a continuous Scale Plausibility Score (SPS), an LLM Concept Density metric, anchor-error auditing, and a cross-modal CLIP coherence protocol, and apply it to AmaraSpatial-10K alongside matched subsets of Objaverse, HSSD, ABO, and GSO. AmaraSpatial-10K improves CLIP Recall@5 by $3.4\times$ over Objaverse ($0.612$ vs. $0.181$, median rank $267 \rightarrow 3$), achieves a $99.1\%$ physics-stability rate under Habitat-Sim with $\sim 20\times$ wall-time speed-up, and produces zero-overlap scenes when used as a drop-in asset bank for Holodeck. Controlled ablations on the same asset bank attribute the retrieval gain to description richness.

2604.22686 2026-05-14 cs.CV 版本更新

SS3D: End2End Self-Supervised 3D from Web Videos

Marwane Hariat, Gianni Franchi, David Filliat, Antoine Manzanera

发表机构 * U2IS, ENSTA – Institut Polytechnique de Paris(U2IS,ENSTA–巴黎国立理工学院) Pôle Recherche, Agence Ministérielle pour l’IA de Défense(人工智能防御部研究部)

AI总结 本文提出 SS3D,一种基于 SfM 的大规模自监督预训练方法,用于从单目视频中进行端到端的三维估计。该方法在一个前向传播过程中联合预测深度、相机运动和内参,并通过统一的单检查点评估协议进行训练和评估。为了解决网络视频中多视角可观测性弱和数据异构性强的问题,作者引入了多视角信号代理(MVS)用于过滤和课程采样,并通过专家训练蒸馏到单一学生模型中,显著提升了模型性能。

详情
英文摘要

We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.

2604.21360 2026-05-14 cs.CV 版本更新

Prototype-Based Test-Time Adaptation of Vision-Language Models

Zhaohong Huang, Yuxin Zhang, Wenjing Liu, Fei Chao, Rongrong Ji

发表机构 * Key Laboratory of Multimedia Trusted Perception(多媒体可信感知关键实验室) Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China(高效计算,中华人民共和国教育部,厦门大学,361005,中国)

AI总结 本文提出了一种基于原型的测试时适配(PTA)方法,用于提升视觉-语言模型在测试阶段的性能。该方法通过构建类特定的知识原型来累积测试样本的信息,并根据每个样本的零样本分类置信度对原型进行自适应加权,从而提升模型对新数据的适应能力。与基于缓存的适配方法相比,PTA无需维护和检索缓存,显著提高了推理效率,同时在多个图像识别和点云分析基准测试中取得了优于现有方法的性能。

详情
英文摘要

Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.

2604.10755 2026-05-14 cs.CV 版本更新

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

Junzhi Ning, Jiashi Lin, Yingying Fang, Wei Li, Jiyao Liu, Cheng Tang, Chenglong Ma, Wenhao Tang, Tianbin Li, Ziyan Huang, Guang Yang, Junjun He

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Imperial College London(帝国理工学院) Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学)

AI总结 该论文提出了MMRareBench,首个针对罕见病的多模态和多图像医学评估基准,旨在评估模型在诊断、治疗规划、跨图像证据对齐和检查建议等四个临床流程中的综合能力。该基准包含1,756个问答对和7,958张医学图像,采用基于Orphanet的本体对齐和严格的评估协议,系统揭示了现有大型语言模型在罕见病场景下处理多图像信息时能力不足的问题,尤其在治疗规划方面表现较差。研究结果表明,尽管医学领域模型在诊断任务上表现较好,但在多图像任务中仍显著落后于通用模型。

详情
英文摘要

Multimodal large language models (MLLMs) have advanced clinical tasks for common conditions, but their performance on rare diseases remains largely untested. In rare-disease scenarios, clinicians often lack prior clinical knowledge, forcing them to rely strictly on case-level evidence for clinical judgments. Existing benchmarks predominantly evaluate common-condition, single-image settings, leaving multimodal and multi-image evidence integration under rare-disease data scarcity systematically unevaluated. We introduce MMRareBench, to our knowledge the first rare-disease benchmark jointly evaluating multimodal and multi-image clinical capability across four workflow-aligned tracks: diagnosis, treatment planning, cross-image evidence alignment, and examination suggestion. The benchmark comprises 1,756 question-answer pairs with 7,958 associated medical images curated from PMC case reports, with Orphanet-anchored ontology alignment, track-specific leakage control, evidence-grounded annotations, and a two-level evaluation protocol. A systematic evaluation of 23 MLLMs reveals fragmented capability profiles and universally low treatment-planning performance, with medical-domain models trailing general-purpose MLLMs substantially on multi-image tracks despite competitive diagnostic scores. These patterns are consistent with a capacity dilution effect: medical fine-tuning can narrow the diagnostic gap but may erode the compositional multi-image capability that rare-disease evidence integration demands.

2604.10634 2026-05-14 cs.CV 版本更新

NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

Xin Li, Yeying Jin, Suhang Yao, Beibei Lin, Zhaoxin Fan, Wending Yan, Xin Jin, Zongwei Wu, Bingchen Li, Peishu Shi, Yufei Wang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Runzhe Li, Kui Jiang, Zhaocheng Yu, Yiang Chen, Junjun Jiang, Xianming Liu, Hongde Gu, Zeliang Li, Mache You, Jiangxin Dong, Jinshan Pan, Qiyu Rong, Bowen Shao, Hongyuan Jing, Mengmeng Zhang, Bo Ding, Hui Zhang, Yi Ren, Mohab Kishawy, Jun Chen, Anh-Kiet Duong, Petra Gomez-Kramer, Jean-Michel Carozza, Wangzhi Xing, Xin Lu, Enxuan Gu, Jingxi Zhang, Diqi Chen, Qiaosi Yi, Bingcai Wei, Wenjie Li, Bowen Tie, Heng Guo, Zhanyu Ma, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Cici Liu, Yaokun Shi, Paula Garrido Mellado, Daniel Feijoo, Alvaro Garcia Lara, Marcos V. Conde, Zhidong Zhu, Bangshu Xiong, Qiaofeng Ou, Zhibo Rao, Wei Li, Zida Zhang, Hui Geng, Qisheng Xu, Xuyao Deng, Changjian Wang, Kele Xu, Guanglu Dong, Qiyao Zhao, Tianheng Zheng, Chunlei Li, Lichao Mou, Chao Ren, Chang-De Peng, Chieh-Yu Tsai, Guan-Cheng Liu, Li-Wei Kang, Abhishek Rajak, Milan Kumar Singh, Ankit Kumar, Dimple Sonone, Kishor Upla, Kiran Raja, Huilin Zhao, Xing Xu, Chuan Chen, Yeming Lao, Wenjing Xun, Li Yang, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Hao Yang, Ruikun Zhang, Liyuan Pan

AI总结 本文介绍了NTIRE 2026第二届昼夜雨滴去除双焦点图像挑战赛的整体情况。该挑战基于真实场景下的Raindrop Clarity数据集,旨在建立一个在不同光照和对焦条件下具有良好实用性的雨滴去除基准。本次挑战吸引了168支队伍参与,其中17支队伍提交了最终方案,并在测试集上取得了较好的性能,展示了该领域技术的持续进步。

Comments Accepted by CVPR2026 Workshop; NTIRE 2026 Challenge Report

详情
英文摘要

This paper presents an overview of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Building upon the success of the first edition, this challenge attracted a wide range of impressive solutions, all developed and evaluated on our real-world Raindrop Clarity dataset~\cite{jin2024raindrop}. For this edition, we adjust the dataset with 14,139 images for training, 407 images for validation, and 593 images for testing. The primary goal of this challenge is to establish a strong and practical benchmark for the removal of raindrops under various illumination and focus conditions. In total, 168 teams have registered for the competition, and 17 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the Raindrop Clarity dataset, demonstrating the growing progress in this challenging task.

2604.02753 2026-05-14 cs.CV 版本更新

DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

Siheng Wang, Yanshu Li, Bohan Hu, Zhengdao Li, Haibo Zhan, Linshan Li, Weiming Liu, Ruizhi Qian, Guangxin Wu, Hao Zhang, Jifeng Shen, Piotr Koniusz, Zhengtao Yao, Junhao Dong, Qiang Sun

发表机构 * Jiangsu University(江苏大学) Brown University(布朗大学) Nanyang Technological University(南洋理工大学) MBZUAI University of New South Wales(新南威尔士大学) USC University of Toronto(多伦多大学) Data61 CSIRO

AI总结 本文提出了一种名为DeCo-DETR的解耦认知DETR框架,旨在解决开放词汇目标检测(OVOD)在实际应用中的效率与性能问题。该方法通过构建基于预训练多模态模型的层次化语义原型空间,避免了推理时对文本编码器的依赖,从而提升了检测效率。同时,通过解耦语义推理与定位任务的训练策略,实现了检测精度与开放世界泛化的有效平衡,实验表明其在多个基准上表现出优异的零样本检测性能。

Comments Accepted at ICLR 2026

详情
英文摘要

Open-vocabulary object detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.

2603.24649 2026-05-14 cs.CV 版本更新

MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows

Weixiang Shen, Chengzhi Shen, Yanzhu Hu, Che Liu, Junde Wu, Jiayuan Zhu, Xiao Han, Zongyue Li, Jingpei Wu, Min Xu, Daguang Xu, Yueming Jin, Benedikt Wiestler, Daniel Rueckert, Jiazhen Pan

发表机构 * Technical University of Munich(慕尼黑技术大学) TUM University Hospital(TUM大学医院) LMU Munich(慕尼黑大学) Imperial College London(伦敦帝国理工学院) University of Oxford(牛津大学) Carnegie Mellon University(卡内基梅隆大学) NVIDIA(NVIDIA公司) National University of Singapore(新加坡国立大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 该研究指出当前医学影像评估基准过于关注预选的2D图像,未能反映真实临床工作流程中的复杂任务。为此,研究者提出了MedFlowBench和MedOpenClaw,前者是一个完整的医学影像研究评估基准,后者是一个可控的医学影像软件运行环境,用于评估视觉语言模型在完整研究中的表现。实验表明,仅凭最终答案的评分会高估模型性能,而真实任务中模型还需生成可审计的证据,才能正确完成复杂流程。

Comments 33 pages

详情
英文摘要

Medical imaging benchmarks often evaluate VLMs on pre-selected 2D images, slices, crops, or patches, making evaluation closer to visual recognition. Real clinical workflows impose a different burden: readers must search through complete studies, operate imaging software, navigate across slices and magnifications, and document visual evidence that can be audited. We argue that this evidence-producing workflow is a critical missing evaluation axis for medical imaging agents. To study it, we introduce MedFlowBench, a full-study benchmark for VLM agents, together with MedOpenClaw, a controlled and replayable runtime in which agents operate medical imaging viewers such as 3D Slicer and QuPath. In each episode, an agent inspects a complete radiology study or whole-slide pathology image, returns a task answer, and submits structured evidence, including key slices, coordinates, regions of interest, or lesion-state fields. This evidence is automatically checked against withheld masks, annotations, and labels. Across evaluated models, final answer-only scoring gives an overly optimistic picture: when answers must also be supported by correct evidence, performance drops substantially on complex workflows. We further find that adding image-analysis tools does not by itself solve the problem. Tools help when they make a complex procedure simple and reliable, but agents still struggle when they must choose inputs, manage viewer state, and verify intermediate outputs over multiple steps. MedFlowBench exposes whether medical imaging agents can produce auditable evidence from complete studies, rather than plausible answers from selected images.

2603.05093 2026-05-14 cs.LG cs.AI cs.CV 版本更新

From Baselines to Transport Geodesics: Axiomatic Attribution via Optimal Generative Flows

Cenwei Zhang, Lin Zhu, Manxi Lin, Lei You

发表机构 * Shanghai Jiao Tong University(上海交通大学) Aalto University(艾尔沃斯大学) Alibaba(阿里巴巴) Technical University of Denmark(丹麦技术大学)

AI总结 该论文研究了特征归因中的路径选择问题,提出了一种基于最优生成流的归因方法。不同于传统的手工设计路径或模型敏感性几何,作者通过最小化运输过程中的动能作用,从数据生成过程中自动选择归因路径,从而获得更稳定和结构化的解释。研究证明了Aumann-Shapley积分在固定路径下的唯一性,并通过Rectified Flow等方法实现了该理论的近似,实验表明新方法在保持删除忠实度的同时提升了归因的稳定性。

Comments 10 figures, 31 pages

详情
英文摘要

Feature attributions often hide a critical modeling choice: they explain a prediction along a counterfactual path from a reference state to an input. Different baselines, interpolations, and generative trajectories define different paths and can therefor produce different explanations. We study this path ambiguity as a modeling problem. Our central question is whether the path can be chosen by the data-generating transport process, rather than by a hand-designed interpolation or by the sensitivity geometry of the model being explained. We separate attribution into fixed-path credit allocation and path selection. For a fixed path, we prove that the Aumann-Shapley line integral is the unique attribution rule under standard fixed-path axioms and explicit coordinate-trace regularity. For path selection, we minimize kinetic action over flows that transport a reference distribution to the data distribution, yielding a transport-geodesic attribution principle. We approximate this ideal with Rectified Flow and Reflow and derive stability bounds linking vector-field error to attribution error. Experiments show that lower-action, transport-consistent paths produce more stable and structured explanations, preserving competitive deletion faithfulness, without claiming data-manifold membership. Our code is available at https://github.com/cenweizhang/OTFlowSHAP.

2602.17555 2026-05-14 cs.CV 版本更新

GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

Zixu Cheng, Da Li, Jian Hu, Yuhang Zang, Ziquan Liu, Shaogang Gong, Wei Li

发表机构 * Queen Mary University of London(伦敦玛丽女王大学) Samsung AI Centre Cambridge(剑桥三星人工智能中心) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Nanyang Technological University(南洋理工大学)

AI总结 视频推理需要对视频中对象和事件之间的时序依赖和事件级关系进行细粒度理解。当前多模态大语言模型在视频推理中容易产生严重的时序幻觉,其根本原因在于视觉-时序对齐较弱且缺乏对事件关系的显式结构建模。为此,本文提出GraphThinker,一种通过强化微调构建结构化事件表示并加强视觉对齐的视频推理方法,有效减少了推理过程中的幻觉问题。实验表明,该方法在多个基准数据集上均取得了显著的性能提升。

Comments Under review

详情
英文摘要

Video reasoning requires a fine-grained understanding of the temporal dependencies and event-level relations between objects and events in videos. Current Multimodal Large Language Models (MLLMs) are prone to severe temporal hallucinations in video reasoning. An underlying cause of these hallucinations is weak visual-temporal grounding and the lack of explicit structure for modelling event relations. Models often rely on auxiliary text, such as dense captions, rather than explicitly anchoring their reasoning in actual visual evidence. However, these textual representations are inherently unstructured and fail to provide explicit causal constraints needed to guide the model's reasoning. In this work, we propose GraphThinker, a reinforcement finetuning method that constructs a structured event representation of a video and enforces visual grounding to jointly reduce reasoning hallucinations. Specifically, we employ an MLLM to construct an Event-based Video Scene Graph (EVSG) that captures both intra- and inter-event relations, guiding a structured video reasoning process. Moreover, we address the weak grounding issue by introducing a novel visual attention reward during reinforcement finetuning that encourages the model to actively attend to reliable visual cues. On the RexTime dataset, GraphThinker achieves an over 4% improvement in IoU=0.3 for moment localisation. On the VidHalluc dataset, GraphThinker achieves a 9.8% improvement in reducing temporal sequence hallucination and a 7.6% gain in Binary QA in reducing action hallucination, compared to the state-of-the-art methods.

2602.07458 2026-05-14 cs.CV 版本更新

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

Yancheng Long, Yankai Yang, Hongyang Wei, Wei Chen, Tianke Zhang, Haonan fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) The Hong Kong University of Science and Technology(香港科学与技术大学) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生学院,清华大学)

AI总结 在线强化学习(RL)为复杂图像编辑提供了前景,但目前受限于可靠且细粒度奖励信号的缺乏。本文提出 SpatialReward,一种通过显式空间推理增强评估准确性的奖励模型,有效解决了现有评估器在跨图像比较和细粒度细节捕捉上的“注意力坍塌”问题。该模型基于预测的编辑区域进行像素级验证,显著提升了评估效果,并在多个基准测试中取得领先性能,同时作为在线RL的强效信号,显著提升了图像生成模型的表现。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
英文摘要

Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.

2602.07029 2026-05-14 eess.IV cs.CV 版本更新

Guidestar-Free Adaptive Optics with Asymmetric Apertures

Weiyun Jiang, Haiyun Guo, Christopher A. Metzler, Ashok Veeraraghavan

发表机构 * Rice University(Rice大学) University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 本文提出了一种无需引导星或波前传感器的闭环自适应光学系统,能够实时校正光学像差。该方法基于非对称孔径和机器学习,结合波前感知、点扩散函数估计与光学校正,实现了高效、低计算量的波前校正。实验表明,该方法在复杂自然场景中表现优于现有无引导星波前调控技术,测量次数和计算量分别减少了十倍和千倍。

Comments Accepted to ACM Transactions on Graphics (TOG)

详情
英文摘要

This work introduces the first closed-loop adaptive optics (AO) system capable of optically correcting aberrations in real-time without a guidestar or a wavefront sensor. Nearly 40 years ago, Cederquist et al. demonstrated that asymmetric apertures enable phase retrieval (PR) algorithms to perform fully computational wavefront sensing, albeit at a high computational cost. More recently, Chimitt et al. extended this approach with machine learning and demonstrated real-time wavefront sensing using only a single (guidestar-based) point-spread-function (PSF) measurement. Inspired by these works, we introduce a guidestar-free AO framework built around asymmetric apertures and machine learning. Our approach combines three key elements: (1) an asymmetric aperture placed at the system's pupil plane that enables PR-based wavefront sensing, (2) a pair of machine learning algorithms that estimate the PSF from natural scene measurements and reconstruct phase aberrations, and (3) a spatial light modulator that performs optical correction. We experimentally validate this framework on dense natural scenes imaged through unknown obscurants. Our method outperforms state-of-the-art guidestar-free wavefront shaping methods, using an order of magnitude fewer measurements and three orders of magnitude less computation.

2602.02560 2026-05-14 cs.LG cs.AI cs.CV 版本更新

Auditing Sybil: Explaining Deep Lung Cancer Risk Prediction Through Generative Interventional Attributions

Bartlomiej Sobieski, Jakub Grzywaczewski, Karol Dobiczek, Mateusz Wójcik, Tomasz Bartczak, Patryk Szatkowski, Przemysław Bombiński, Matthew Tivnan, Przemyslaw Biecek

发表机构 * National Lung Screening Trial Research Team(国家肺癌筛查试验研究组)

AI总结 该研究针对深度学习模型Sybil在肺部癌症风险预测中的决策机制进行因果验证,提出了一个模型无关的审计框架S(H)NAP。该方法通过生成干预性归因,结合专家放射科医生的验证,系统分析模型对风险评分的因果贡献。研究发现,尽管Sybil在很多情况下表现类似专家,但其仍存在对临床无关伪影过度敏感和径向偏差等关键失效模式。

Comments ICML 2026

详情
英文摘要

Lung cancer remains the leading cause of cancer mortality, driving the development of automated screening tools to alleviate radiologist workload. Standing at the frontier of this effort is Sybil, a deep learning model capable of predicting future risk solely from computed tomography (CT) with high precision. However, despite extensive clinical validation, current assessments rely purely on observational metrics. This correlation-based approach overlooks the model's actual reasoning mechanism, necessitating a shift to causal verification to ensure robust decision-making before clinical deployment. We propose S(H)NAP, a model-agnostic auditing framework that constructs generative interventional attributions validated by expert radiologists. By leveraging realistic 3D diffusion bridge modeling to systematically modify anatomical features, our approach isolates object-specific causal contributions to the risk score. Providing the first interventional audit of Sybil, we demonstrate that while the model often exhibits behavior akin to an expert radiologist, differentiating malignant pulmonary nodules from benign ones, it suffers from critical failure modes, including dangerous sensitivity to clinically unjustified artifacts and a distinct radial bias.

2601.22868 2026-05-14 cs.CV cs.LG 版本更新

Conditional Compatibility Learning for Context-Dependent Anomaly Detection

Shashank Mishra, Didier Stricker, Jason Rambach

发表机构 * German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) RPTU Kaiserslautern(科布伦茨-莱茵威达大学(RPTU)基尔伯恩)

AI总结 该论文研究了上下文相关的异常检测问题,即同一对象在不同场景下可能表现出正常或异常的差异。传统方法通常假设异常是对象本身的属性,而本文指出这种假设在现实场景中并不成立。为此,作者提出了条件兼容性学习(Conditional Compatibility Learning)方法,通过分离对象和上下文的表示,并利用文本条件注意力机制进行融合,构建了CC-CLIP模型,在多个现实场景的异常检测任务中取得了显著优于现有方法的性能。

Comments Preprint. 9 pages main text, plus appendix

详情
英文摘要

Anomaly detection usually assumes that abnormality is an intrinsic property of an observation. A defect is a defect, and a rare object is rare, regardless of where it appears. Many real-world anomalies do not work this way. A runner on a track is normal, but the same runner on a highway is not. The subject is unchanged; only the context makes it anomalous. This setting, long recognized as contextual anomaly detection, remains largely underexplored in modern vision-language systems. The difficulty is not merely empirical; it is formal. When anomaly labels depend on the relation between a subject and its context, any detector reasoning from a global representation that conflates subject and context is provably non-identifiable: two different subject-context configurations can map to the same embedding while requiring opposite labels, and no such detector can be correct on both. This impossibility motivates a different formulation: instead of asking whether an observation deviates from a global notion of normality, the model should ask whether subjects are compatible with their surrounding context. We define this as conditional compatibility learning. We instantiate this framework in CC-CLIP, a vision-language architecture that learns disentangled subject- and context-aware representations from a single image and fuses visual evidence through text-conditioned attention. CC-CLIP achieves state-of-the-art results on real-world contextual anomaly detection, substantially outperforming all existing CLIP-based and context-reasoning baselines. A single-branch variant of CC-CLIP also achieves competitive performance on structural anomaly benchmarks.

2512.16767 2026-05-14 cs.CV 版本更新

Make-It-Poseable: Feed-forward Latent Posing Model for 3D Characters

Zhiyang Guo, Ori Zhang, Jax Xiang, Alan Zhao, Zhenxun Yuan, Wengang Zhou, Houqiang Li

发表机构 * EEIS Department University of Science(电子信息科学系中国科学技术大学) Tencent PCG Shenzhen China(腾讯PCG深圳中国) Tencent PCG New York USA(腾讯PCG纽约美国) Tencent PCG Beijing China(腾讯PCG北京中国) University of Science(中国科学技术大学) Tencent PCG(腾讯PCG)

AI总结 本文提出了一种名为 Make-It-Poseable 的新型前馈框架,用于解决3D角色姿态生成中的关键问题,如皮肤权重不准确、网格拓扑固定和姿态不匹配等。该方法将角色姿态生成重新定义为一种无需皮肤绑定的潜在空间变换问题,通过在紧凑的潜在表示上操作,实现了对目标姿态的高效重建。该框架结合了潜在姿态变换器、密集姿态表示和自适应补全模块,能够处理拓扑变化并展现出优异的零样本泛化能力,适用于多种形态的角色和3D创作任务。

Comments Project page: https://jasongzy.github.io/Make-It-Poseable/

详情
英文摘要

Posing 3D characters is a fundamental task in computer graphics. However, existing paradigms, ranging from traditional auto-rigging to recent pose-conditioned generative models, frequently struggle with inaccurate skinning weights, fixed mesh topologies, and poor pose conformance. These challenges have become particularly pronounced with the recent explosion of AI-generated 3D assets, which often exhibit flawed structures and fused geometry. To address these issues, we introduce Make-It-Poseable, a novel feed-forward framework that reformulates character posing as a skinning-free latent-space transformation problem. By decoupling shape deformation from the constraints of fixed mesh connectivity, our method directly operates on compact latent representations to reconstruct characters in target poses. To achieve this, our framework integrates a latent posing transformer for shape manipulation, a dense pose representation for fine-grained control, and an adaptive completion module optimized via a bipartite-matched latent loss to robustly handle topological changes. Extensive experiments demonstrate that our method significantly outperforms existing baselines in posing quality. Furthermore, our skeleton-agnostic design exhibits remarkable zero-shot generalization to diverse morphologies including quadrupeds and seamlessly supports various 3D authoring applications such as part replacement and refinement.

2511.09771 2026-05-14 cs.CV 版本更新

STORM: Segment, Track, and Object Re-Localization from a Single Image

Yu Deng, Teng Cao, Hikaru Shindo, Quentin Delfosse, Jiahong Xue, Kristian Kersting

发表机构 * Department of Computer Science, Technical University of Darmstadt, Darmstadt, Hesse, Germany(德累斯顿技术大学计算机科学系) Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt, Hesse, Germany(黑森人工智能中心(hessian.AI)) German Research Center for Artificial Intelligence (DFKI), Darmstadt, Hesse, Germany(德国人工智能研究中心(DFKI)) Centre for Cognitive Science, Technical University of Darmstadt, Darmstadt, Hesse, Germany(德累斯顿技术大学认知科学中心) Google Intrinsic AI Research, Germany. † Work done while at the AIML research lab, now working at Intrinsic, Google.(谷歌Intrinsic AI研究)

AI总结 STORM 是一种统一的框架,能够基于单张参考图像进行条件化的6D姿态估计与跟踪,具有较高的鲁棒性和较低的人工输入需求。该方法结合了分层空间融合注意力机制和基于BCE训练的跟踪验证器,能够在遮挡和快速运动等复杂场景下稳定恢复目标姿态。实验表明,STORM 在无需标注的情况下优于现有方法,并能有效应对严重遮挡和视角变化。

Comments 21 pages. Accepted at the 43rd International Conference on Machine Learning (ICML 2026); camera-ready version

详情
英文摘要

Accurate 6D pose estimation and tracking are core capabilities for physical AI systems, yet real-world deployment remains brittle and labor-intensive. Many pipelines rely on CAD models, manual masking, or per-object adaptation, and still fail under occlusion or fast motion without a principled way to recognize failure. We propose STORM, a unified framework for reference-conditioned 6D tracking that can operate from a single reference image, with minimal manual input and improved robustness. STORM combines: (i) Hierarchical Spatial Fusion Attention (HSFA), a task-driven reference-query fusion architecture that supports both single-reference and multi-reference conditioning and can optionally use vision-language semantic conditioning to resolve instance ambiguities; and (ii) a BCE-trained tracking verifier whose continuous compatibility logit is used as an energy-like score to detect drift and trigger automatic re-initialization. Experiments on LM-O and YCB-Video show that STORM improves annotation-free pose tracking accuracy over strong baselines and recovers reliably from severe occlusions and rapid viewpoint changes with minimal overhead.

2508.09479 2026-05-14 cs.CV 版本更新

SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

Xuejun Huang, Xinyi Liu, Yi Wan, Zhi Zheng, Bin Zhang, Mingtao Xiong, Yingying Pei, Yongjun Zhang

发表机构 * School of Remote Sensing and Information Engineering, Wuhan University(武汉大学遥感与信息工程学院) Technology Innovation Center for Collaborative Applications of Natural Resources Data in GBA, Ministry of Natural Resources(粤港澳大湾区自然资源数据协同应用技术创新中心,自然资源部) Department of Geography and Resource Management, The Chinese University of Hong Kong(香港中文大学地理与资源管理系) China Railway Siyuan Survey and Design Group Co., LTD(中国铁路syuan调查设计集团有限公司)

AI总结 本文提出了一种名为SkySplat的新型自监督框架,旨在从多时相稀疏卫星图像中实现通用化的三维高斯点云重建。该方法通过将有理多项式系数(RPC)模型集成到通用3D高斯点云生成流程中,解决了现有方法在卫星图像处理中几何约束不足、瞬时物体干扰和辐射不一致等问题。SkySplat仅依赖RGB图像和鲁棒的相对高度监督,无需真实高度图即可实现高效且准确的重建,并在多个基准数据集上表现出优越的性能和跨数据集泛化能力。

Comments AAAI 2026. Code is available at https://github.com/NanCheng2001/SkySplat-main

详情
英文摘要

Three-dimensional scene reconstruction from sparse-view satellite images is a long-standing and challenging task. While 3D Gaussian Splatting (3DGS) and its variants have recently attracted attention for its high efficiency, existing methods remain unsuitable for satellite images due to incompatibility with rational polynomial coefficient (RPC) models and limited generalization capability. Recent advances in generalizable 3DGS approaches show potential, but they perform poorly on multi-temporal sparse satellite images due to limited geometric constraints, transient objects, and radiometric inconsistencies. To address these limitations, we propose SkySplat, a novel self-supervised framework that integrates the RPC model into the generalizable 3DGS pipeline, enabling more effective use of sparse geometric cues for improved reconstruction. SkySplat relies only on RGB images and radiometric-robust relative height supervision, thereby eliminating the need for ground-truth height maps. Key components include a Cross-Self Consistency Module (CSCM), which mitigates transient object interference via consistency-based masking, and a multi-view consistency aggregation strategy that refines reconstruction results. Compared to per-scene optimization methods, SkySplat achieves an 86 times speedup over EOGS with higher accuracy. It also outperforms generalizable 3DGS baselines, reducing MAE from 13.18 m to 1.80 m on the DFC19 dataset significantly, and demonstrates strong cross-dataset generalization on the MVS3D benchmark. The is available at https://github.com/NanCheng2001/SkySplat-main

2505.21238 2026-05-14 cs.CV 版本更新

3D-UIR: 3D Gaussian for Underwater 3D Scene Reconstruction via Physics Based Appearance-Medium Decoupling

Jieyu Yuan, Yujun Li, Yuanlin Zhang, Chunle Guo, Xiongxin Tang, Ruixing Wang, Chongyi Li

发表机构 * VCIP, College of Computer Science, Nankai University(VCIP,计算机科学学院,南开大学) Institute of Software, Chinese Academy of Sciences(软件研究所,中国科学院) DJI(大疆创新)

AI总结 该论文提出了一种基于物理原理的3D高斯点云方法(3D-UIR),用于解决水下三维场景重建中的光-介质耦合问题。通过将物体外观与水介质效应解耦,并引入显式的介质嵌入表示,有效提升了场景的一致性和渲染质量。此外,该方法结合深度引导的优化策略,提高了几何重建的准确性,在水下场景的视图合成和场景恢复方面取得了显著改进。

Comments Accepted to IEEE TIP 2026. Project webpage: https://bilityniu.github.io/3D-UIR

详情
英文摘要

Novel view synthesis for underwater scene reconstruction presents unique challenges due to complex light-media interactions. Optical scattering and absorption in water body bring inhomogeneous medium attenuation interference that disrupts conventional volume rendering assumptions of uniform propagation medium. While 3D Gaussian Splatting (3DGS) offers real-time rendering capabilities, it struggles with underwater inhomogeneous environments where scattering media introduces artifacts and inconsistent appearance. In this study, we propose a physics-based framework that disentangles object appearance from water medium effects through tailored Gaussian modeling. Our approach introduces appearance embeddings, which are explicit medium representations for backscatter and attenuation, enhancing scene consistency. In addition, we propose a depth-guided optimization strategy that leverages pseudo-depth maps as supervision with depth regularization and scale penalty terms to improve geometric fidelity. By integrating the proposed appearance and medium modeling components via an underwater imaging model, our approach achieves both high-quality novel view synthesis and physically accurate scene restoration. Experiments demonstrate our significant improvements in rendering quality and restoration accuracy over existing methods. The project page is available at https://bilityniu.github.io/3D-UIR.

2505.15616 2026-05-14 cs.CV 版本更新

LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

Ruilin Yao, Bo Zhang, Jirui Huang, Xinwei Long, Yifang Zhang, Tianyu Zou, Yufei Wu, Shichao Su, Yifan Xu, Wenxi Zeng, Zhaoyu Yang, Guoyou Li, Shilan Zhang, Zichan Li, Yaxiong Chen, Shengwu Xiong, Peng Xu, Jiajun Zhang, Bowen Zhou, David Clifton, Luc Van Gool

发表机构 * Wuhan University of Technology(武汉理工大学) Tsinghua University(清华大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Shanghai AI Lab(上海人工智能实验室) University of Oxford(牛津大学) INSAIT, Sofia Un. St Kliment Ohridski(索菲亚大学克里门特·欧里迪斯基学院)

AI总结 该研究提出了LENS,一个多层级的基准测试,用于评估多模态大语言模型在感知、理解和推理任务中的综合能力。LENS包含3400张当代图像和6万余个由人类撰写的问答,覆盖八个任务和十二种日常场景,支持从基础感知到复杂推理的多层次评估。该数据集通过丰富的标注和来自社交媒体的高质量图像,能够更真实地反映模型在现实场景中的表现,实验表明当前前沿模型在推理任务上的准确率均未超过60%。

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in evaluating the synergistic effects of lower-level perceptual capabilities on higher-order reasoning. To lift this limitation, we contribute Lens, a multi-level benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios, forming three progressive task tiers, i.e., perception, understanding, and reasoning. One feature is that each image is equipped with rich annotations for all tasks. Thus, this dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning. In addition, our images are manully collected from the social media, in which 53% were published later than Jan. 2025. We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL. These models are released later than Dec. 2024, and none of them achieve an accuracy greater than 60% in the reasoning tasks. Project page: https://github.com/Lens4MLLMs/lens. ICCV 2025 workshop page: https://lens4mllms.github.io/mars2-workshop-iccv2025/

1811.12784 2026-05-14 cs.CV 版本更新

The GAN that Warped: Semantic Attribute Editing with Unpaired Data

Gara Dorta, Sara Vicente, Neill D. F. Campbell, Ivor J. A. Simpson

发表机构 * University of Bath(巴斯大学) Anthropics Technology Ltd.(Anthropics技术有限公司) University of Sussex(苏塞克斯大学)

AI总结 该研究提出了一种基于平滑变形场的语义图像编辑方法,能够在不依赖配对数据的情况下实现高质量的图像编辑。通过结合生成对抗网络(GAN)的最新进展,该方法能够使用未配对数据进行训练,有效保留图像主体的身份特征,并在高分辨率(如4K)图像上实现了高效的编辑。实验表明,该方法在人脸和鸟类图像数据集上均表现出优异的编辑效果和鲁棒性。

Comments CVPR 2020

详情
英文摘要

Deep neural networks have recently been used to edit images with great success, in particular for faces. However, they are often limited to only being able to work at a restricted range of resolutions. Many methods are so flexible that face edits can often result in an unwanted loss of identity. This work proposes to learn how to perform semantic image edits through the application of smooth warp fields. Previous approaches that attempted to use warping for semantic edits required paired data, i.e. example images of the same subject with different semantic attributes. In contrast, we employ recent advances in Generative Adversarial Networks that allow our model to be trained with unpaired data. We demonstrate face editing at very high resolutions (4k images) with a single forward pass of a deep network at a lower resolution. We also show that our edits are substantially better at preserving the subject's identity. The robustness of our approach is demonstrated by showing plausible image editing results on the Cub200 birds dataset. To our knowledge this has not been previously accomplished, due the challenging nature of the dataset.

1804.05261 2026-05-14 cs.CV cs.GR 版本更新

Physics-driven Fire Modeling from Multi-view Images

Gara Dorta, Luca Benedetti, Dmitry Kit, Yong-Liang Yang

发表机构 * University of Bath(巴斯大学)

AI总结 该研究提出了一种从多视角图像中重建物理合理的火焰模型的新方法,解决了传统火焰建模中依赖复杂物理模拟或简化假设的问题。通过RGB相机首次实现了对火焰体积物理属性(如温度、密度)的合理估计,从而支持全局火焰光照等新现象。该方法在多种输入数据上进行了验证,并成功应用于虚拟场景的真实光照生成,展示了其有效性与实用性。

详情
英文摘要

Fire effects are widely used in various computer graphics applications such as visual effects and video games. Modeling the shape and appearance of fire phenomenon is challenging as the underlying effects are driven by complex laws of physics. State-of-the-art fire modeling techniques rely on sophisticated physical simulations which require intensive parameter tuning, or use simplifications which produce physically invalid results. In this paper, we present a novel method of reconstructing physically valid fire models from multi-view stereo images. Our method, for the first time, provides plausible estimation of physical properties (e.g., temperature, density) of a fire volume using RGB cameras. This allows for a number of novel phenomena such as global fire illumination effects. The effectiveness and usefulness of our method are tested by generating fire models from a variety of input data, and applying the reconstructed fire models for realistic illumination of virtual scenes.

1804.01050 2026-05-14 stat.ML cs.CV cs.LG 版本更新

Training VAEs Under Structured Residuals

Gara Dorta, Sara Vicente, Lourdes Agapito, Neill D. F. Campbell, Ivor Simpson

发表机构 * University of Bath(巴斯大学) Anthropics Technology Ltd.(Anthropics技术有限公司) University College London(伦敦大学学院)

AI总结 本文研究了在变分自编码器(VAE)中如何更好地建模图像重构残差中的结构化相关性。传统VAE假设像素间的不确定性是独立的,但实际重构残差往往具有明显结构。为此,作者提出了一种新的方法,在VAE中引入结构化高斯似然预测网络,以建模残差中的相关性,并在保持模型复杂度较低的前提下,有效提升了VAE对颜色图像的不确定性建模能力与生成质量。

Comments Simplified training methodology, added more results

详情
英文摘要

Variational auto-encoders (VAEs) are a popular and powerful deep generative model. Previous works on VAEs have assumed a factorized likelihood model, whereby the output uncertainty of each pixel is assumed to be independent. This approximation is clearly limited as demonstrated by observing a residual image from a VAE reconstruction, which often possess a high level of structure. This paper demonstrates a novel scheme to incorporate a structured Gaussian likelihood prediction network within the VAE that allows the residual correlations to be modeled. Our novel architecture, with minimal increase in complexity, incorporates the covariance matrix prediction within the VAE. We also propose a new mechanism for allowing structured uncertainty on color images. Furthermore, we provide a scheme for effectively training this model, and include some suggestions for improving performance in terms of efficiency or modeling longer range correlations.

2605.13335 2026-05-14 cs.AI cs.CV 版本更新

Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

Qinchuan Cheng, Zhantao Gong, Pengzhan Sun, Angela Yao, Xulei Yang, Shijie Li

发表机构 * Xi’an Jiaotong University(西安交通大学) Nankai University(南开大学) National University of Singapore(新加坡国立大学) A*STAR

AI总结 本文提出 Ego2World,一个将第一视角烹饪视频编译为可执行符号世界的基准,用于评估具身智能体在部分可观测环境下的规划能力。该方法基于视频标注提取可复用的状态转移规则,并在隐藏的符号世界图中执行,迫使智能体仅依靠局部观测和执行反馈进行规划与记忆更新。实验表明,传统动作重叠度指标可能高估任务成功率,而维持持久的信念记忆有助于提升任务完成效率并减少重复视觉探索。

Comments Project page: https://sj-li.com/PROJ/Ego2World/

详情
英文摘要

Embodied agents in household environments must plan under partial observation: they need to remember objects, track state changes, and recover when actions fail. Existing benchmarks only partially test this ability. Egocentric video datasets capture realistic human activities but remain passive, while interactive simulators support execution but rely on synthetic scenes and hand-crafted dynamics, introducing a sim-to-real gap and often assuming fully observable state. We introduce Ego2World, an executable benchmark that turns egocentric cooking videos into executable symbolic worlds governed by graph-transition rules. Built on HD-EPIC, Ego2World derives reusable transition rules from video annotations and executes them in a hidden symbolic world graph. During evaluation, the simulator maintains the hidden world graph, while the agent plans over its own partial belief graph using only local observations and execution feedback. This separation forces agents to update memory and replan without observing the true world state. Experiments show that action-overlap scores overestimate physical-state success, and that persistent belief memory improves task completion while reducing repeated visual exploration -- suggesting that belief maintenance should be a first-class target of embodied-agent evaluation.

2605.13333 2026-05-14 cs.CV cs.AI cs.GR cs.LG 版本更新

Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation

Junhyuk Jeon, Seokhyeon Hong, Junyong Noh

发表机构 * Visual Media Lab, KAIST(韩国庆熙大学视觉媒体实验室)

AI总结 该研究针对文本驱动的运动扩散模型在生成精细风格化动作时的不足,提出了一种轻量级的风格条件生成框架。通过超网络生成低秩适配参数,动态调节预训练扩散模型,从而在去噪过程中实现对风格的精细控制。该方法利用监督对比损失结构风格潜在空间,提升了对未见风格的泛化能力,并在多个数据集上取得了领先的风格化生成效果。

Comments Accepted to SIGGRAPH 2026. Project page: https://junhyukjeon.github.io/projects/style-salad/

详情
英文摘要

Text-driven motion diffusion models are capable of generating realistic human motions, but text alone often struggles to express fine-level nuances of motion, commonly referred to as style. Recent approaches have tackled this challenge by attaching a style injection mechanism to a pretrained text-driven diffusion model. Existing stylization methods, however, either require style-specific fine-tuning of existing models or rely on heavy ControlNet-based architectures, limiting efficiency and generalization to unseen styles. We propose a lightweight style conditioning framework that dynamically modulates a pretrained diffusion model through hypernetwork-generated LoRA parameters. A style reference motion is encoded into a global style embedding, which is mapped by a hypernetwork to low-rank updates applied at each denoising step of the diffusion model. By structuring the style latent space with a supervised contrastive loss, our framework reliably captures diverse stylistic attributes, improves generalization to unseen styles, and supports optimization-based guidance without requiring predefined style categories. Experiments on the HumanML3D and 100STYLE datasets show state-of-the-art stylization results, while achieving improved stylization for unseen styles.

2605.13328 2026-05-14 cs.RO cs.AI cs.CL cs.CV 版本更新

What Limits Vision-and-Language Navigation ?

Yunheng Wang, Yuetong Fang, Taowen Wang, Lusong Li, Kun Liu, Junzhe Xu, Zizhao Yuan, Yixiao Feng, Jiaxi Zhang, Wei Lu, Zecui Zeng, Renjing Xu

发表机构 * HKUST(GZ)(香港科技大学(广州)) JD Explore Academy(京东探索研究院)

AI总结 视觉与语言导航(VLN)是具身智能的重要研究方向,但在从仿真环境迁移到真实世界时,现有方法常因感知不稳定和指令模糊而表现下降。本文提出StereoNav,一种融合视觉、语言和动作的鲁棒框架,通过引入目标位置先验和双目视觉技术,增强跨域导航的稳定性与准确性。实验表明,StereoNav在多个基准测试中取得先进性能,并在真实机器人部署中显著提升了复杂环境下的导航可靠性。

详情
英文摘要

Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: https://yunheng-wang.github.io/stereonav-public.github.io.

2605.13316 2026-05-14 cs.CV 版本更新

Test-time Sparsity for Extreme Fast Action Diffusion

Kangye Ji, Yuan Meng, Jianbo Zhou, Ye Li, Chen Tang, Zhi Wang

发表机构 * Tsinghua University(清华大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 该研究针对动作扩散模型在生成高质量动作序列时计算成本高的问题,提出了一种测试时稀疏化方法,通过动态预测模型前向过程中的可剪枝残差计算,以加速动作生成。为解决重复编码和剪枝带来的效率瓶颈,设计了高度并行的推理流程,并引入多向复用策略,有效提升了剪枝稀疏度与生成效率。实验表明,该方法在保持性能不变的情况下,将计算量降低了92%,生成速度提升了5倍。

详情
英文摘要

Action diffusion excels at high-fidelity action generation but incurs heavy computational costs owing to its iterative denoising nature. Despite current technologies showing promise in accelerating diffusion transformers by reusing the cached features, they struggle to adapt to policy dynamics arising from diverse perceptions and multi-round rollout iterations in open environments. We propose test-time sparsity to tackle this challenge, which aims to accelerate action diffusion by dynamically predicting prunable residual computations for each model forward at test time. However, two bottlenecks remain in this paradigm: 1) repetitive conditional encoding and pruning offset most potential speed gains, and 2) the features cached from previous denoising timesteps cannot constrain large pruning errors under aggressive sparsity. To address the first bottleneck, we design a highly parallelized inference pipeline that minimizes the non-decoder delay to milliseconds. Specifically, we first design a lightweight pruner that shares the encoder with the diffusion transformer. Then, we decouple the encoding and pruning from the autoregressive denoising loop by processing all denoising timesteps in parallel, and overlap the pruner with the decoder forward inference through asynchronism. To overcome the second bottleneck, we introduce an omnidirectional reusing strategy, which achieves 95% sparsity by selectively reusing features cached from the current forward, previous denoising timesteps, and earlier rollout iterations. To learn the rollout-level reusing strategies, we sample a few action trajectories to supervise the sparsified diffusion step by step. Extensive experiments demonstrate that our method reduces FLOPs by 92% and accelerates action generation by 5x, achieving lossless performance with an inference frequency of 47.5 Hz. Our code is available at https://github.com/ky-ji/Test-time-Sparsity.

2605.13306 2026-05-14 cs.CV 版本更新

Color Constancy in Hyperspectral Imaging via Reduced Spectral Spaces

G. Dofri Vidarsson, Liying Lu, Sabine Süsstrunk

发表机构 * \'Ecole Polytechnique F\'ed\'erale de Lausanne (EPFL), Lausanne, Switzerland

AI总结 本文研究了如何通过降低光谱维度来提升高光谱成像中的颜色恒定性估计性能。作者采用基于相关性的颜色估计(CbC)框架,分析了不同光谱降维策略对光照估计的影响,揭示了在何种条件下紧凑的光谱表示优于传统RGB方法。该研究为高效利用高光谱信息进行光照估计提供了实用指导。

详情
英文摘要

Illuminant estimation aims to infer scene illumination from image measurements despite intrinsic ambiguities between surface reflectance and lighting. Most existing methods operate on trichromatic RGB images and are therefore fundamentally limited by the restricted spectral information available. Hyperspectral imaging provides a much richer representation of scene radiance and has the potential to alleviate these ambiguities. However, its high dimensionality poses computational and statistical challenges. In this work, we systematically study the effect of spectral dimensionality and representation choice on illuminant estimation performance using hyperspectral data. We adopt the practical and effective Color-by-Correlation (CbC) framework as the estimation backbone and analyze its behavior under different spectral dimensionality reduction strategies. Our results offer practical insights into how hyperspectral information can be efficiently exploited for illuminant estimation and identify conditions under which compact spectral representations outperform conventional RGB-based approaches. The code is available at https://github.com/IVRL/Reduced-Spectral-Color-Constancy.

2605.13293 2026-05-14 cs.CV 版本更新

Img2CADSeq: Image-to-CAD Generation via Sequence-Based Diffusion

Shiyu Tan, Zixuan Zhao, Hao Gao, Zhiheng Chen, Xiaolong Yin, Enya Shen

发表机构 * School of Software Tsinghua University China(软件学院清华大学中国) Tsinghua University(清华大学)

AI总结 该论文提出了一种名为Img2CADSeq的多阶段图像到CAD生成方法,旨在从单视角图像中生成高质量的边界表示(BRep)CAD模型。其核心方法是将CAD操作序列编码为三级层次化代码本,并通过重要性优先策略,优先保留轮廓信息以压缩长序列到稳定的离散潜在空间。为弥合图像与CAD之间的模态差异,研究引入了基于对比学习的点云中间表示,结合VQ-Diffusion模型进行条件生成,并在新构建的CAD-220K和PrintCAD数据集上验证了方法的有效性,显著优于现有方法,生成的STEP文件可直接用于商业CAD软件。

Comments Accepted by SIGGRAPH 2026 Conference

详情
英文摘要

Boundary Representation (BRep) is the standard format for Computer-Aided Design (CAD), yet reconstructing high-quality BReps from single-view images remains challenging due to the complexity of topological constraints and operation sequences. We present Img2CADSeq, a multi-stage pipeline that overcomes these limitations by encoding CAD sequences into a three-level hierarchical codebook. Guided by an importance prioritization, this strategy values profiles over details, compressing long sequences into a stable discrete latent space. To bridge the modality gap, we leverage a coarse-to-fine point cloud intermediate, aligning 2D visual features with 3D CAD sequences via contrastive learning to condition a VQ-Diffusion model. Supported by newly introduced CAD-220K and PrintCAD datasets, our approach ensures robust industrial domain adaptation. Extensive experiments demonstrate that Img2CADSeq significantly outperforms state-of-the-art methods, producing standard STEP files that can be directly used in commercial CAD software.

2605.13277 2026-05-14 cs.CL cs.AI cs.CV cs.IR cs.LG 版本更新

Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

Weiqing Luo, Zongye Hu, Xiao Wang, Zhiyuan Yu, Haofeng Zhang, Ziyi Huang

发表机构 * Arizona State University(亚利桑那州立大学) Texas A&M University(德克萨斯大学) Morgan Stanley(摩根大通)

AI总结 本文研究了多模态检索增强生成(RAG)中视觉证据的选择问题,指出现有方法通常基于语义相关性或表面相似性,难以准确反映证据对下游推理的实际效用。为此,作者从信息论角度重新定义了证据的效用,提出通过模型输出分布的信息增益来衡量证据价值,并设计了一种无需训练、基于轻量多模态模型的高效估计框架。实验表明,该方法在多个基准上优于现有RAG方法,同时显著降低了计算成本。

Comments Accepted to ACL 2026

详情
英文摘要

Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.

2605.13228 2026-05-14 cs.CV cs.AI 版本更新

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

Xiao Liu, Nayu Liu, Junnan Zhu, Ruirui Chen, Guohui Xiang, Changjian Wang, Kaiwen Wei, Rongzhen Li, Jiang Zhong

发表机构 * Chongqing University(重庆大学) Tianjin University(天津大学) MAIS, Institute of Automation, Chinese Academy of Sciences(自动化研究所,中国科学院MAIS) Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore(新加坡科技研究局高性能计算研究所) Chongqing National Data AI Research Institute, AI Research Lab(重庆国家数据AI研究院,AI研究实验室)

AI总结 该论文提出了一种名为 ReTool-Video 的递归工具使用视频代理方法,旨在提升视频理解中复杂推理和跨模态分析的能力。为了解决现有视频代理在工具粒度和动作空间上的局限,研究构建了包含134个工具的 MetaAug-Video 工具库(MVTL),支持细粒度操作和多级信息访问,并设计了递归工具调用机制,将高层视频意图逐步分解为可执行的工具链。实验表明,该方法在多个基准测试中表现优异,显著提升了复杂视频理解的稳定性和效果。

详情
英文摘要

Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.

2605.13223 2026-05-14 cs.CV 版本更新

Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

Abdelrahman Eldesokey, Merey Ramazanova, Ahmad Sait, Ansar Khangeldin, Karen Sanchez, Tong Zhang, Bernard Ghanem

发表机构 * King Abdullah University of Science and Technology(卡斯泰大学)

AI总结 随着文本到图像生成技术的快速发展,可靠的模型评估变得尤为重要。本文提出了一种技能对齐注释方法,使注释策略更符合不同评估技能的本质特征,从而提升评估的一致性和稳定性。研究还构建了一个自动化评估流程,实现了可扩展的细粒度评估,并强调改进评估基础可以提高效率,而无需单纯增加注释工作量。

Comments Project Page: https://abdo-eldesokey.github.io/skill-aligned-eval/

详情
英文摘要

Text-to-image (T2I) generation has advanced rapidly, making reliable evaluation critical as performance differences between models narrow. Existing evaluation practices typically apply uniform annotation mechanisms, such as Likert-scale or binary question answering (BQA), across heterogeneous evaluation skills, despite fundamental differences in their nature. In this work, we revisit T2I evaluation through the lens of skill-aligned annotation, where annotation strategies reflect the underlying characteristics of each evaluation skill. We systematically compare skill-aligned annotation against uniform baselines and show that it produces more consistent evaluation signals, with higher inter-annotator agreement and improved stability across models. Finally, we present an automated pipeline that instantiates the proposed evaluation protocol, enabling scalable and fine-grained evaluation with spatially grounded feedback. Our work highlights that improving the foundations of image evaluation can increase reliability and efficiency without simply scaling annotation effort. We hope this motivates further research on refining evaluation protocols as a central component of reliable model assessment.

2605.13202 2026-05-14 cs.CV cs.AI 版本更新

STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

Hongli Liu, Yu Wang, Shengjie Zhao

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Engineering Research Center of Key Software Technologies for Smart City Perception and Planning, Ministry of Education(教育部智能城市感知与规划关键软件技术工程研究中心)

AI总结 本文研究了少样本动作识别(FSAR)中的语义-时序对齐问题,提出了一种统一的语义-时序自适应表示学习框架STAR。该方法通过引入时序语义注意力机制和语义时序原型细化模块,有效解决了文本提示与动作序列中稀疏视觉线索的对齐问题,并增强了对多尺度时序动态的建模能力。实验表明,STAR在多个基准数据集上均优于现有方法,验证了其在有限样本条件下的有效性。

Comments Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

详情
英文摘要

Few-shot action recognition (FSAR) requires models to generalize to novel action categories from only a handful of annotated samples. Despite progress with vision-language models, existing approaches still suffer from semantic-temporal misalignment, where static textual prompts fail to capture decisive visual cues that appear sparsely across sequences, and from inadequate modeling of multi-scale temporal dynamics, as short-term discriminative cues and long-range dependencies are often either oversmoothed or fragmented. To address these challenges, we propose Semantic Temporal Adaptive Representation Learning (STAR), a unified framework, consisting of a semantic-alignment component and a temporal-aware component, effectively bridging the semantic and temporal gaps and transferring the sequence modeling capability of Mamba into the FSAR. The semantic alignment module introduces a Temporal Semantic Attention (TSA) mechanism, which performs frame-level cross-modal alignment with textual cues, ensuring fine-grained semantic-temporal consistency. The temporal-aware module incorporates a Semantic Temporal Prototype Refiner (STPR) that integrates semantic-guided Mamba blocks with multi-frequency temporal sampling and bidirectional state-space refinement, yielding semantically aligned prototypes with enhanced discriminative fidelity and temporal consistency. Furthermore, temporally dependent class descriptors derived from large language models (LLMs) provide long-range semantic guidance. Extensive experiments on five FSAR benchmarks demonstrate the consistent superiority of STAR over state-of-the-art methods. For instance, STAR achieves up to 8.1% and 6.7% gains on the SSv2-Full and SSv2-Small datasets under the 1-shot setting, and 7.3% on HMDB51, validating its effectiveness under limited supervision. The code is available at https://github.com/HongliLiu1/STAR-main.

2605.13182 2026-05-14 cs.CV 版本更新

DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution

Zheng Chen, Ruofan Yang, Jin Han, Dehua Song, Zichen Zou, Chunming He, Yong Guo, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei Noah’s Ark Lab(华为诺亚实验室) Duke University(杜克大学) Huawei Consumer Business Group(华为消费者业务集团)

AI总结 DiffST 是一种高效的时空感知扩散框架,旨在解决真实场景下的时空视频超分辨率(STVSR)问题。该方法通过引入跨帧上下文聚合和视频表示引导模块,提升了对时空信息的利用效率,并采用一步采样策略提高了推理速度。实验表明,DiffST 在多个真实场景任务中取得了领先的性能,且推理速度比现有方法快约17倍。

Comments Code is available at: https://github.com/zhengchen1999/DiffST

详情
英文摘要

Diffusion-based models have shown strong performance in video super-resolution (VSR) and video frame interpolation (VFI). However, their role in the coupled space-time video super-resolution (STVSR) setting remains limited. Existing diffusion-based STVSR approaches suffer from two issues: (1) low inference efficiency and (2) insufficient utilization of spatiotemporal information. These limitations impede deployment. To address these issues, we introduce DiffST, an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. To improve efficiency, we adapt a pre-trained diffusion model for one-step sampling and process the entire video directly rather than operating on individual frames. Furthermore, to enhance spatiotemporal information utilization, we introduce cross-frame context aggregation (CFCA) and video representation guidance (VRG). The CFCA module aggregates information across multiple keyframes to produce intermediate frames. The VRG module extracts video-level global features to guide the diffusion process. Extensive experiments show that DiffST obtains leading results on real-world STVSR tasks. It also maintains high inference efficiency, running about 17$\times$ faster than previous diffusion-based STVSR methods. Code is available at: https://github.com/zhengchen1999/DiffST.

2605.13179 2026-05-14 cs.CV 版本更新

Does Engram Do Memory Retrieval in Autoregressive Image Generation?

Jinghao Wang, Qiyuan He, Chunbin Gu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学)

AI总结 该研究探讨了Engram模块在自回归图像生成中的作用,发现其虽能减少计算量,但并未提升生成图像的质量。通过实验分析表明,Engram模块更像是一个带有门控机制的辅助路径,而非内容寻址的回忆机制。研究进一步指出,Engram模块对生成结果的改进主要来源于其结构本身,而非记忆表中的内容。

Comments 9 pages

详情
英文摘要

The Engram module -- a hash-keyed, O(1) associative memory injected into Transformer layers -- was recently shown to improve large language model pretraining, with the appealing interpretation that it provides a content-addressed shortcut to recurring local token patterns. We ask whether this interpretation transfers to autoregressive (AR) image generation, or whether the observed gains, if any, come from a different mechanism. We adapt the Engram module to vision with 2D spatial $n$-gram hashing, gated fusion, and KV-cache-compatible incremental inference, and inject it into a class-conditional AR generator trained on ImageNet 256x256. Across a sweep of backbone-to-memory budget ratios $ρ{\in}[0.17, 0.90]$, every Engram-augmented variant trails the pure AR baseline in FID, indicating that the module saves backbone FLOPs but does not, by itself, improve sample quality. We then probe how the module is used. A gate-clamp sweep shows that disabling the Engram pathway entirely is catastrophic, yet a tiny constant gate (g=0.10) matches or beats the learned gate -- inconsistent with a heavily content-addressed recall mechanism. A donor-probe experiment shows that swapping the hash inputs for matched, adversarial, or random same-class exemplars produces statistically indistinguishable next-token distributions, while collapsing or randomising the table degrades them by two to three orders of magnitude. Finally, training a model from scratch with the entire memory table frozen to $\mathcal{N}(0, 1)$ noise costs only $Δ\text{FID}{=}0.10$ and actually raises Inception Score. Together, these findings indicate that the Engram in AR image generation behaves not as a content-addressed retriever but as a gated architectural side-pathway: a hash-keyed residual stream whose benefit is dominated by the pathway itself, with the learned table contributing only a small distributional refinement.

2605.13163 2026-05-14 cs.CR cs.CV cs.LG 版本更新

LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters

Beomjin Ahn, Jungmin Kwon, Chanyong Jung, Jaewook Chung

发表机构 * Samsung Research(三星研究院) Samsung Electronics(三星电子) Amazon Web Services(亚马逊网络服务) University of Michigan(密歇根大学)

AI总结 该论文提出了一种名为LoREnc的训练-free框架,用于保护基础模型和LoRA适配器的安全,防止知识产权泄露和模型恢复攻击。其核心方法基于谱截断与补偿技术,通过抑制基础模型权重中的主导低秩成分,并在授权适配器中补偿缺失信息,同时利用正交重参数化隐藏适配器的结构特征。实验表明,LoREnc在保证模型性能的同时,能有效抵御模型恢复攻击,且计算开销极低。

Comments Accepted to ICIP 2026

详情
英文摘要

Foundation models and low-rank adapters enable efficient on-device generative AI but raise risks such as intellectual property leakage and model recovery attacks. Existing defenses are often impractical because they require retraining or access to the original dataset. We propose LoREnc, a training-free framework that secures both FMs and adapters via spectral truncation and compensation. LoREnc suppresses dominant low-rank components of FM weights, compensates for the missing information in authorized adapters, and further applies orthogonal reparameterization to obscure structural fingerprints of the protected adapter. Unauthorized users produce structurally collapsed outputs, while authorized users recover exact performance. Experiments demonstrate that LoREnc provides strong protection against model recovery with under 1% computational overhead.

2605.13158 2026-05-14 cs.CV 版本更新

Unifying Physically-Informed Weather Priors in A Single Model for Image Restoration Across Multiple Adverse Weather Conditions

Jiaqi Xu, Xiaowei Hu, Lei Zhu, Pheng-Ann Heng

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China(香港中文大学(深圳)计算机科学与工程系) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室) ROAS Thrust, the Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China and The Hong Kong University of Science and Technology, Department of Electronic and Computer Engineering, Hong Kong SAR, China(香港科学与技术大学(广州)ROAS方向及电子与计算机工程系,香港特别行政区)

AI总结 本文研究了在多种恶劣天气条件下进行图像修复的问题,提出了一种统一的物理感知天气先验模型,能够同时处理雨滴和雾等不同天气引起的退化现象。该方法基于对天气相关视觉因素的分析,构建了一个融合粒子散射和雾状聚集效应的成像模型,并设计了一种基于天气先验的网络结构,通过估计遮挡和透射信息增强特征以恢复清晰场景。实验表明,该方法在多种恶劣天气场景下均优于现有先进方法。

Comments Accepted by TCSVT

详情
英文摘要

Image restoration under multiple adverse weather conditions aims to develop a single model to recover the underlying scene with high visibility. Weather-related artifacts vary with the particle's distance to the camera according to the established scene visibility analysis, where close and faraway regions are more affected by falling drops and fog effects, respectively. Existing methods fail to consider this weather-specific physical visual process; thus, the restoration performance is limited. In this work, we analyze the common visual factors in adverse weather conditions and present a unified imaging model that considers the individually visible particles and fog-like aggregate scattering effects. Further, we design a novel weather-prior-based network, which leverages the weather-related prior information to help recover the scene by enhancing the features using the estimated occlusion and transmission. Experimental results in multiple adverse scenarios show the superiority of our method against state-of-the-art methods.

2605.13156 2026-05-14 cs.CV 版本更新

Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

Jiaxin Liu, Ding Zhong, Yue Wang, Zhidong Yang, Zhaolu Kang, Guangyuan Dong, Qishi Zhan, Pengcheng Fang, Aofan Liu

发表机构 * UIUC(伊利诺伊大学香槟分校) UMich(密歇根大学) Stanford(斯坦福大学) HKUST(香港科技大学) PKU(北京大学) NUS(新加坡国立大学) Marquette(马quette大学) Southampton(南安普顿大学)

AI总结 视觉语言模型(VLMs)在跨模态理解任务中表现出色,但常出现物体幻觉问题,即描述输入图像中并不存在的内容,影响其可靠性和可解释性。本文提出了一种双路径电路分析框架,用于识别和分析VLM中与幻觉相关的电路机制。通过激活路径修补和条件路径分析,研究发现了支持正确预测的视觉接地路径和导致错误输出的幻觉路径,并揭示了两者的交互机制。实验表明,抑制幻觉路径组件可显著减少物体幻觉,且该电路机制在不同模型架构和幻觉类型中具有良好的一致性和可迁移性。

详情
英文摘要

Vision-language models (VLMs) have demonstrated remarkable capabilities in bridging visual perception and natural language understanding, enabling a wide range of multimodal reasoning tasks. However, they often produce object hallucinations, describing content absent from the input image, which limits their reliability and interpretability. To address this limitation, we propose Dual-Pathway Circuit Analysis, a framework that identifies and characterizes hallucination-related circuits in VLMs for mechanistic understanding and causal probing. We first apply activation patching across five architecturally diverse VLMs to identify a visual grounding pathway that supports correct predictions and a hallucination pathway that drives erroneous outputs. We then introduce Conditional Pathway Analysis (CPA) to characterize pathway-level interactions, revealing that grounding components remain strongly redundant in both correct and hallucinating samples but undergo a consistent polarity flip, shifting from supporting the ground truth on correct samples to aligning with the hallucinated answer on erroneous ones. We further perform targeted suppression of hallucination-pathway components, showing that scaling these components reduces object hallucination by up to 76% with minimal accuracy cost, and validate that the same circuit selectively transfers to relational but not attribute hallucination. Evaluations on POPE-adversarial and AMBER show that the identified circuits are consistent across architectures, support causal intervention, and transfer selectively across hallucination types.

2605.13155 2026-05-14 cs.CV 版本更新

Pareto-Guided Optimal Transport for Multi-Reward Alignment

Ying Ba, Tianyu Zhang, Mohan Zhou, Yalong Bai, Wenyi Mo, Guiwei Zhang, Bing Su, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China(中国人民大学北京校区人工智能学院) Beijing Key Laboratory of Research on Large Models(北京大模型研究关键实验室) Engineering Research Center of Next-Generation Intelligent Search(下一代智能搜索与推荐工程技术研究中心) Rutgers University(罗格斯大学)

AI总结 文本到图像生成模型在偏好优化方面取得了显著进展,但在面对多样化的奖励模型时,实现稳健的对齐仍是一个重大挑战。本文提出了一种基于帕累托前沿引导的最优传输(PG-OT)框架,通过构建特定提示的帕累托前沿,并利用分布感知的最优传输将劣化样本映射至该前沿,从而有效缓解奖励黑客问题。此外,作者引入了联合支配率(JDR)和联合崩溃率(JCR)作为评估多奖励协同效应和奖励黑客风险的指标,实验表明该方法在多个指标上均优于现有方法。

Comments Accepted to ICML 2026

详情
英文摘要

Text-to-image generation models have achieved remarkable progress in preference optimization, yet achieving robust alignment across diverse reward models remains a significant challenge. Existing multi-reward fusion approaches rely on weighted summation, which is costly to tune and insufficient for balancing conflicting objectives. More critically, optimization with reward models is highly susceptible to reward hacking, where reward scores increase while the perceived quality of generated images deteriorates. We demonstrate that optimizing against a unified global target under heterogeneous reward upper bounds can induce reward hacking, a risk further exacerbated by the inherent instability of weak reward models. To mitigate this, we propose a Pareto Frontier-Guided Optimal Transport (PG-OT) framework. Our method constructs a prompt-specific Pareto frontier and maps dominated samples toward it via distribution-aware optimal transport. Furthermore, we develop both online and offline optimization strategies tailored to diverse reward signal characteristics. To provide a more rigorous assessment, we introduce the Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) as principled metrics to quantify multi-reward synergy and reward hacking. Experimental results show that our approach outperforms strong baselines with an 11% gain in JDR and achieves a near 80% win rate in human evaluations.

2605.13152 2026-05-14 cs.CV cs.AI cs.LG cs.RO 版本更新

EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision

Jiahao Chen, Zihui Zhang, Yafei Yang, Jinxi Li, Shenxing Wei, Zhixuan Sun, Bo Yang

发表机构 * Shenzhen Research Institute, The Hong Kong Polytechnic University(深圳研究 institute,香港理工大学) vLAR Group, The Hong Kong Polytechnic University(vLAR 团队,香港理工大学)

AI总结 本文提出了一种名为 EvObj 的无监督三维实例分割方法,旨在解决从合成数据到真实点云场景中几何域差距带来的挑战。该方法通过引入对象辨别模块和对象补全模块,实现了对物体先验的动态优化和部分几何结构的重建,从而提升了在真实场景中的分割性能。实验表明,EvObj 在多个数据集上均取得了优于现有方法的分割效果,达到了当前最先进的水平。

Comments CVPR 2026. Code and data are available at: https://github.com/vLAR-group/EvObj

详情
英文摘要

We introduce EvObj for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and real-world point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EvObj integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries after discovering objects. We conduct extensive experiments on both real-world and synthetic datasets, demonstrating superior 3D object segmentation performance over all baselines while achieving state-of-the-art results.

2605.13151 2026-05-14 cs.CV 版本更新

GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation

Jiyong Rao, Yu Wang, Shengjie Zhao

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院)

AI总结 GenCape 是一种面向类别无关姿态估计(CAPE)的生成式框架,旨在仅使用少量标注的支持样本,对任意类别的图像中的关键点进行定位。该方法通过图像支持输入自动推断关键点之间的关系,无需额外的文字描述或预定义的骨骼结构,克服了传统方法对人工标注的依赖和结构灵活性差的问题。GenCape 包含一个迭代结构感知变分自编码器和一个组合图转移模块,能够有效捕捉实例级别的结构信息,并在不同类别间实现语义对齐,实验表明其在少样本设置下优于现有基于图支持和文本支持的方法。

Comments Accepted in ICLR 2026

详情
英文摘要

Category-agnostic pose estimation (CAPE) aims to localize keypoints on query images from arbitrary categories, using only a few annotated support examples for guidance. Recent approaches either treat keypoints as isolated entities or rely on manually defined skeleton priors, which are costly to annotate and inherently inflexible across diverse categories. Such oversimplification limits the model's capacity to capture instance-wise structural cues critical for accurate pixel-level localization. To overcome these limitations, we propose GenCape, a Generative-based framework for CAPE that infers keypoint relationships solely from image-based support inputs, without additional textual descriptions or predefined skeletons. Our framework consists of two principal components: an iterative Structure-aware Variational Autoencoder (i-SVAE) and a Compositional Graph Transfer (CGT) module. The former infers soft, instance-specific adjacency matrices from support features through variational inference, embedded layer-wise into the Graph Transformer Decoder for progressive structural priors refinement. The latter adaptively aggregates multiple latent graphs into a query-aware structure via Bayesian fusion and attention-based reweighting, enhancing resilience to visual uncertainty and support-induced bias. This structure-aware design facilitates effective message propagation among keypoints and promotes semantic alignment across object categories with diverse keypoint topologies. Experimental results on the MP-100 dataset show that our method achieves substantial gains over graph-support baselines under both 1- and 5-shot settings, while maintaining competitive performance against text-support counterparts.

2605.13148 2026-05-14 cs.LG cs.CV 版本更新

Understanding Generalization through Decision Pattern Shift

Huiqi Deng, Yibo Li, Quanshi Zhang, Peng Zhang, Hongbin Pei, Xia Hu

发表机构 * Xi’an Jiaotong University(西安交通大学) Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文研究深度神经网络在未见样本上泛化失败的原因,提出了一种新的分析视角——决策模式偏移(DPS)。该方法通过分析模型内部决策模式的稳定性,量化其在训练与测试阶段的偏差,从而衡量泛化性能。研究发现,决策模式在类别间具有高度结构化和一致性,且其变化程度与泛化差距呈强线性相关,为理解不同泛化失败场景提供了统一的解释框架。

Comments 14pages, 12figures, computer vision and pattern recognition

详情
英文摘要

Understanding why deep neural networks (DNNs) fail to generalize to unseen samples remains a long-standing challenge. Existing studies mainly examine changes in externally observable factors such as data, representations, or outputs, yet offer limited insight into how a model's internal decision mechanism evolves from training to test. To address this gap, we introduce Decision Pattern Shift (DPS), a new perspective that defines generalization through the stability of internal decision patterns and quantifies failure as their deviation from those learned during training. Specifically, we represent each sample's decision pattern as a GradCAM-based channel-contribution vector, which captures how feature channels collectively support a prediction, and we propose the DPS metric to measure its discrepancy from the class-average pattern. Empirical analyses across multiple datasets and architectures show that, (i) decision patterns form a highly structured, class-consistent space with strong intra-class cohesion and low inter-class confusion, enabling direct analysis of a model's decision logic; (ii) the DPS magnitude correlates linearly with the generalization gap (nearly all Pearson r > 0.8), revealing generalization as a systematic drift in the model's internal decision mechanism; (iii) the DPS spectrum organizes diverse generalization degradation scenarios (covering ideal generalization, in-distribution degradation, domain shift, out-of-distribution, and shortcut learning) into a continuous trajectory, providing a unified explanation of their failure modes. These findings open up new possibilities for early generalization-risk detection, failure-mode diagnosis, and channel-level defect localization.

2605.13146 2026-05-14 stat.ML cs.CV cs.LG 版本更新

On Hallucinations in Inverse Problems: Fundamental Limits and Provable Assessment Methods

David Iagaru, Nina M. Gottschling, Anders C. Hansen, Josselin Garnier

发表机构 * Gauss Centre for Supercomputing e.V.(Gauss超级计算中心) John von Neumann Institute for Computing(约翰·冯·诺依曼计算研究所) Deutsches Zentrum für Luft und Raumfahrt(德国航空航天中心) Laboratory Directed Research and Development Program of Oak Ridge National Laboratory(橡树岭国家实验室定向研究与开发计划) UT-Battelle, LLC(UT-巴特尔公司) Computing and Computational Sciences, Oak Ridge National Laboratory(橡树岭国家实验室计算与计算科学部) DAMTP, University of Cambridge(剑桥大学DAMTP)

AI总结 本文研究了逆问题中的“幻觉”现象,即人工智能模型生成的看似合理但实际错误的细节。作者提出了一种理论框架,揭示这类幻觉不仅源于模型本身,更可能源于逆问题本身的病态特性,并推导出幻觉产生的充要条件及仅依赖于前向模型的可计算界。基于该理论,文章提出了两种算法,分别用于估计最小幻觉幅度和评估重建细节的可信度,实验表明该方法适用于多种成像任务和生成模型,为量化和评估AI幻觉提供了理论依据。

Comments 31 pages, 11 figures; code available at https://github.com/davidiagraid/hallucinations_invpb

详情
英文摘要

Artificial intelligence (AI) has transformed imaging inverse problems, from medical diagnostics to Earth observation. Yet deep neural networks can produce hallucinations, realistic-looking but incorrect details, undermining their reliability, especially when ground truth data is unavailable. We develop a theoretical framework showing that such hallucinations are not merely artifacts of particular models, but can arise from the ill-posed nature of the inverse problem itself. We derive necessary and sufficient conditions for hallucinations, together with computable bounds on their magnitude that depend only on the forward model. Building on this theory, we introduce algorithms to: (1) estimate the minimum hallucination magnitude achievable by any reconstruction model for a given input; (2) assess the faithfulness of reconstructed details by a given reconstruction model. Experiments across three imaging tasks demonstrate that our approach applies broadly, including to modern generative models, and provides a principled way to quantify and evaluate AI hallucinations.

2605.13140 2026-05-14 cs.CV 版本更新

Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection

Sangin Lee, Seokjun Kwon, Jeongmin Shin, Namil Kim, Yukyung Choi

发表机构 * Sejong University(世宗大学) NAVER LABS(NAVER实验室) Artificial Intelligence and Robotics Institute (AIRI)(人工智能与机器人研究所(AIRI))

AI总结 该论文研究了多源领域自适应下的目标检测问题,旨在提升模型在目标领域中检测性能,特别是在训练数据分布与目标领域存在差异的情况下。为了解决现有方法在学习领域无关特征时无法有效保留领域特定信息的问题,作者提出了MS-DePro方法,结合深度图和文本提示,分别用于引导目标定位和分类特征对齐。该方法在多个基准测试中取得了最先进的性能,验证了其有效性。

详情
英文摘要

General object detection (OD) struggles to detect objects in the target domain that differ from the training distribution. To address this, recent studies demonstrate that training from multiple source domains and explicitly processing them separately for multi-source domain adaptation (MSDA) outperforms blending them for unsupervised domain adaptation (UDA). However, existing MSDA methods learn domain-agnostic features from domain-specific RGB images while preserving domain-specific information from the domain-agnostic feature map. To address this, we propose MS-DePro: Multi-Source Detector with Depth and Prompt, composed of (1) depth-guided localization and (2) multi-modal guided prompt learning. We leverage domain-agnostic input modalities, namely depth maps and text, to encode domain-agnostic characteristics. Specifically, we utilize depth maps to generate domain-agnostic region proposals for localization and integrate multi-modal features to align learnable text embeddings for classification. MS-DePro achieves state-of-the-art performance on MSDA benchmarks, and comprehensive ablations demonstrate the effectiveness of our contributions. Our code is available on https://github.com/sejong-rcv/Multi-Modal-Guided-Multi-Source-Domain-Adaptation-for-Object-Detection.

2605.13129 2026-05-14 cs.GR cs.CV 版本更新

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

Nikitas Chatzis, Marios Loizou, Evangelos Kalogerakis

发表机构 * Technical University of Crete(希腊克里特技术大学) CYENS Center of Excellence(CYENS卓越中心) University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 Rigel3D 是一种生成可用于动画的 3D 资产的生成方法,解决了现有 3D 生成模型输出缺乏骨骼结构、关节层次和蒙皮权重的问题。该方法通过耦合的表面与骨骼结构化潜在表示,联合建模几何形状与骨骼结构,并利用一个骨骼感知的自编码器生成网格、骨骼拓扑、关节坐标和蒙皮权重。此外,Rigel3D 还引入了开放词汇的关节标注模块,支持生成的关节与任意重定向模板的对应,实验表明其在多个指标上优于现有方法,能够生成高质量且多样化的动画就绪 3D 资产。

详情
英文摘要

Recent 3D generative models can synthesize high-quality assets, but their outputs are typically static: they lack the skeletal rigs, joint hierarchies, and skinning weights required for animation. This limits their use in games, film, simulation, virtual agents, and embodied AI, where assets must not only look plausible but also move plausibly. We introduce Rigel3D, a generative method for animation-ready 3D assets represented as rigged meshes. Unlike post-hoc auto-rigging methods that attach rigs to completed shapes, our method jointly models geometry and rig structure through coupled surface and skeleton structured latent representations. A rig-aware autoencoder decodes these representations into mesh geometry, skeleton topology, joint coordinates, and skinning weights, while a two-stage latent generative model synthesizes both surface and skeleton representations for image-conditioned generation. To support downstream animation workflows, we further introduce an open-vocabulary joint labeling module that embeds generated joints into a shared vision-language space, enabling correspondence to arbitrary retargeting templates. Experiments on large-scale rigged asset datasets demonstrate that our method generates diverse, high-quality animation-ready assets and outperforms existing rigging baselines across multiple metrics.

2605.13122 2026-05-14 cs.CV 版本更新

Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

Jingxuan He, Xiyu Wang, Yunke Wang, Mengyu Zheng, Chang Xu

发表机构 * The University of Sydney(悉尼大学)

AI总结 本文研究了基于指令的图像编辑模型在零样本参照图像分割任务中的语义定位能力。通过分析发现,这些模型在去噪过程的早期阶段已能生成具有强前景-背景可分性的内部表示,从而隐含实现了语言条件下的语义定位。基于此,作者提出了一种无需训练的框架,利用预训练图像编辑模型的中间表示,将分割任务分解为空间注意力和语义判别两个部分,实现了无需完整图像生成即可获得高精度分割掩码的方法,并在多个数据集上取得了优于现有零样本方法的性能。

详情
英文摘要

Instruction-based image editing (IIE) models have recently demonstrated strong capability in modifying specific image regions according to natural language instructions, which implicitly requires identifying where an edit should be applied. This indicates that such models inherently perform language-conditioned visual semantic grounding. In this work, we investigate whether this implicit grounding can be leveraged for zero-shot referring image segmentation (RIS), a task that requires pixel-level localization of objects described by natural language expressions. Through systematic analysis, we reveal that strong foreground-background separability emerges in the internal representations of these models at the earliest denoising timestep, well before any visible image transformation occurs. Building on this insight, we propose a training-free framework that repurposes pretrained image editing models for RIS by exploiting their intermediate representations. Our approach decomposes localization into two complementary components: attention-based spatial priors that estimate where to focus, and feature-based semantic discrimination that determines what to segment. By leveraging feature-space separability, the framework produces accurate segmentation masks using only a single denoising step, without requiring full image synthesis. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our method achieves superior performance over existing zero-shot baselines.

2605.13119 2026-05-14 cs.RO cs.AI cs.CV 版本更新

Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

Zixing Lei, Changxing Liu, Yichen Xiong, Minhao Xiong, Yuanzhuo Ding, Zhipeng Zhang, Weixin Li, Siheng Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Zhongguancun Academy(中关村学院) Beihang University(北京航空航天大学)

AI总结 该研究旨在解决视觉-语言-动作(VLA)模型在长期任务中执行能力受限的问题,提出了一种将高层视觉语言模型与专用工具型VLA模块相结合的新策略。通过引入工具对齐的后训练方法(TAPT)和工具族接口,实现了高效的长期任务规划与执行协同,显著提升了机器人在复杂环境中的任务完成率和指令遵循精度。

详情
英文摘要

Vision-language-action (VLA) models are effective robot action executors, but they remain limited on long-horizon tasks due to the dual burden of extended closed-loop planning and diverse physical operations. We therefore propose VLAs-as-Tools, a strategy that distributes this burden across a high-level vision language model (VLM) agent for temporal reasoning and a family of specialized VLA tools for diverse local physical operations. The VLM handles scene analysis, global planning, and recovery, while each VLA tool executes a bounded subtask. To tightly couple agent planning with VLA tool execution in long-horizon tasks, we introduce a VLA tool-family interface that exposes explicit tool selection and in-execution progress feedback, enabling efficient event-triggered agent replanning without continuous agent polling. To obtain diverse specialized VLA tools that faithfully follow agent invocations, we further propose Tool-Aligned Post-Training (TAPT), which constructs invocation-aligned training units for instruction following and adopts tool-family residual adapters for efficient tool specialization. Experiments show that VLAs-as-Tools improves the success rate of $π_{0.5}$ by 4.8 points on LIBERO-Long and 23.1 points on RoboTwin, and further enhances invocation fidelity by 15.0 points as measured by Non-biased Rate. Code will be released.

2605.13111 2026-05-14 cs.CV 版本更新

Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

Jiayu Chen, Junbei Tang, Wenbiao Zhao, Maoliang Li, Jiayi Luo, Zihao Zheng, Jiawei Yang, Guojie Luo, Xiang Chen

发表机构 * Peking University(北京大学) South China University of Technology(华南理工大学) Xinjiang University(新疆大学) Beihang University(北京航空航天大学) Zhongguancun Academy(中关村学院)

AI总结 本文提出了一种名为Pyramid Forcing的头部感知金字塔KV缓存策略,用于提升高质量长视频生成的效果。该方法通过分析不同注意力头的历史帧关注模式,识别出三种具有不同特性的头类型,并据此设计差异化的缓存策略,从而有效缓解长期误差累积导致的退化问题。实验表明,该方法在多个指标上显著提升了长时序视频生成的质量。

详情
英文摘要

Autoregressive video generation enables streaming and open-ended long video synthesis, but still suffers from long-term degradation caused by accumulated errors. Existing KVCache strategies usually apply unified historical-frame retention, implicitly assuming homogeneous historical dependencies across attention heads. We revisit historical-frame attention and reveal three distinct head types: Anchor Heads require broad long-range context, Wave Heads exhibit periodic temporal dependencies, and Veil Heads focus on initial and adjacent frames. Based on this finding, we propose Pyramid Forcing, a head-aware pyramidal KVCache framework that identifies head types offline, assigns behavior-specific cache policies, and supports heterogeneous cache lengths via efficient ragged-cache attention. Experiments on Self Forcing and Causal Forcing show that Pyramid Forcing consistently improves long-horizon generation quality on VBench-Long, increasing the 60-second Self Forcing score from 77.87 to 81.21 while enhancing motion dynamics, visual fidelity, and semantic consistency. Project: https://if-lab-pku.github.io/Pyramid-Forcing/.

2605.13108 2026-05-14 cs.CV 版本更新

Flow Augmentation and Knowledge Distillation for Lightweight Face Presentation Attack Detection

Muhammad Shahid Jabbar, Muhammad Sohail Ibrahim, Taha Hasan Masood Siddique, Kejie Huang, Shujaat Khan

发表机构 * SDAIA-KFUPM Joint Research Center for Artificial Intelligence(SDAIA-KFUPM联合人工智能研究中心) King Fahd University of Petroleum & Minerals(国王法赫德石油大学) Interdisciplinary Research Center for Intelligent Secure Systems (IRC-ISS)(智能安全系统跨学科研究中心) College of Information Science & Electronic Engineering(信息科学与电子工程学院) Department of Computer Engineering, College of Computing and Mathematics(计算机工程系,计算与数学学院)

AI总结 本文研究了在复杂攻击方式和多变采集条件下实现轻量级人脸活体检测(FacePAD)的问题,提出了一种结合光流增强和知识蒸馏的方法。通过训练时引入光流信息增强运动表征,推理时无需计算光流,同时设计了一个双分支教师模型融合外观与运动线索,并利用知识蒸馏将运动感知知识传递给轻量的学生模型,显著提升了检测性能并降低了计算开销。实验表明,该方法在多个基准数据集上取得了优异的检测效果,并能在嵌入式设备上实现每秒52帧的实时检测。

Comments Accepted at 2026 International Conference on Automatic Face and Gesture Recognition (FG)

详情
英文摘要

Face presentation attack detection (FacePAD) remains challenging under diverse spoofing representation, including 2D print and replay, 3D mask-based spoofing, makeup-induced appearance manipulation, and physical occlusions, as well as under varying capture conditions. Motion cues are highly discriminative for FacePAD but typically require explicit optical flow estimation, which introduces substantial computational overhead and limits real-time deployment. In this work, we leverage optical flow to enhance motion representation during training while eliminating the need for flow computation at inference. We propose a dual-branch teacher model that fuses appearance cues from RGB frames with motion cues derived from colorwheel-encoded optical flow, enabling effective modeling of micro-motions and temporal consistency. To enable efficient deployment, we introduce a knowledge distillation framework that transfers motion-aware knowledge from the flow-augmented teacher to a lightweight RGB-only student via logit distillation. As a result, the student implicitly learns motion-sensitive representations without requiring explicit flow estimation or additional feature extraction blocks at inference. Extensive experiments demonstrate strong performance across multiple benchmarks, achieving 0.0% HTER on Replay-Attack and Replay-Mobile, 0.94% HTER on ROSE-Youtu, 5.65% HTER on SiW-Mv2, and 0.42% ACER on OULU-NPU. The distilled student achieves performance comparable to or better than the teacher while significantly reducing parameters and FLOPs, achieving 52 FPS on an NVIDIA Jetson Orin Nano, indicating its suitability for real-time and resource-constrained FacePAD deployment.

2605.13093 2026-05-14 cs.CV 版本更新

RoSplat: Robust Feed-Forward Pixel-wise Gaussian Splatting for Varying Input Views and High-Resolution Rendering

Hoang Chuong Nguyen, Renjie Wu, Jose M. Alvarez, Miaomiao Liu

发表机构 * Australian National University(澳大利亚国立大学) NVIDIA

AI总结 RoSplat 是一种鲁棒的前馈像素级高斯点绘方法,旨在解决在输入视角变化和高分辨率渲染时出现的过亮和孔洞伪影问题。该方法通过引入像素级的 alpha 归一化策略和基于三维采样的辅助正则化器,有效提升了高斯尺度估计的准确性与渲染一致性。实验表明,RoSplat 在多个基准数据集上显著优于现有方法,尤其在输入视角变化和高分辨率场景下表现优异。

详情
英文摘要

Generalizable 3D Gaussian Splatting has recently emerged as an efficient approach for novel-view synthesis, enabling feed-forward synthesis from only a few input views. However, existing pixel-wise feed-forward methods suffer from over-bright renderings when the number of input views varies during inference, as well as insufficient supervision for accurate Gaussian scale estimation, which leads to hole artifacts, particularly in high-resolution renderings. To address these issues, we identify that the over-brightness is caused by the varying number of overlapping Gaussians and propose a simple alpha normalization strategy to maintain brightness consistency across different number of input views. In addition, we introduce an auxiliary 3D sampling-based regularizer to improve Gaussian scale estimation, thereby mitigating hole artifacts in high-resolution rendering. Experiments on benchmark datasets demonstrate that our method significantly improves baseline models under varying input-view and high-resolution rendering settings.

2605.13080 2026-05-14 cs.CV 版本更新

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

Junha Song, Byeongho Heo, Geonmo Gu, Jaegul Choo, Dongyoon Han, Sangdoo Yun

发表机构 * NAVER AI Lab(NAVER AI实验室)

AI总结 本文研究了多模态大语言模型在视觉描述任务中如何更高效地关注图像关键区域的问题。作者提出了一种新的注意力机制——Gaze Attention,通过将视觉嵌入分组为紧凑的注视区域,并动态选择与任务相关的区域进行注意力计算,从而减少冗余计算并提升聚焦效果。此外,为保持全局上下文信息,作者还引入了可学习的上下文标记。实验表明,该方法在图像和视频理解任务中表现优异,且显著降低了视觉键值对的使用量。

详情
英文摘要

When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation step, leading to diluted focus and unnecessary computational overhead. In this work, we introduce Gaze Attention, a novel mechanism that enables MLLMs to selectively attend to task-relevant visual regions during generation. Specifically, we spatially group visual embeddings-stored as key-value caches-into compact gaze regions, each represented by a lightweight descriptor. At each decoding step, the model dynamically selects the most relevant regions and restricts attention to them, reducing redundant computation while enhancing focus. To mitigate the loss of global context caused by localized attention, we further propose learnable context tokens appended to each image or frame, allowing the model to maintain holistic visual awareness. Extensive experiments on image and video understanding benchmarks demonstrate that Gaze Attention matches or surpasses dense-attention baselines, while using up to 90% fewer visual KV entries in the attention computation.

2605.13062 2026-05-14 cs.CV 版本更新

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Xuehai Bai, Yang Shi, Yi-Fan Zhang, Xuanyu Zhu, Yuran Wang, Yifan Dai, Xinyu Liu, Yiyan Ji, Xiaoling Gu, Yuanxing Zhang

发表机构 * HDU(杭州大学) PKU(北京大学) Kling Team(Kling团队) CASIA(中国科学院自动化研究所)

AI总结 近年来,图像编辑模型在指令理解、多模态感知和复杂视觉编辑方面取得了显著进展,但现有基准测试难以准确反映人类判断,尤其在评估前沿模型时存在任务难度有限和评价方式粗粒度的问题。为解决这一问题,本文提出Edit-Compass和EditReward-Compass,一个统一的图像编辑与奖励模型评估基准。Edit-Compass包含2,388个精细标注的样本,涵盖六个逐步提升难度的任务类别,采用多维细粒度评价框架;EditReward-Compass则包含2,251个偏好对,用于模拟实际强化学习中的奖励建模场景,为模型评估提供了更真实可靠的依据。

详情
英文摘要

Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grained evaluation protocols. In parallel, reward models have become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic evaluation settings that deviate from practical RL scenarios. These limitations hinder reliable assessment of both image editing models and reward models. To address these challenges, we introduce Edit-Compass and EditReward-Compass, a unified evaluation suite for image editing and reward modeling. Edit-Compass contains 2,388 carefully annotated instances spanning six progressively challenging task categories, covering capabilities such as world knowledge reasoning, visual reasoning, and multi-image editing. Beyond broad task coverage, Edit-Compass adopts a fine-grained multidimensional evaluation framework based on structured reasoning and carefully designed scoring rubrics. In parallel, EditReward-Compass contains 2,251 preference pairs that simulate realistic reward modeling scenarios during RL optimization.

2605.13059 2026-05-14 cs.CV 版本更新

BrainAnytime: Anatomy-Aware Cross-Modal Pretraining for Brain Image Analysis with Arbitrary Modality Availability

Guangqian Yang, Tong Ding, Wenlong Hou, Yue Xun, Ye Du, Qian Niu, Shujun Wang

发表机构 * Department of Biomedical Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China(生物医学工程系,香港理工大学,香港特别行政区,中国) Department of Technology Management for Innovation, The University of Tokyo, Japan(创新技术管理系,东京大学,日本) Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong SAR, China(数据科学与人工智能系,香港理工大学,香港特别行政区,中国)

AI总结 本文提出了一种名为BrainAnytime的统一预训练框架,用于处理在任意模态可用情况下的脑影像分析任务。该方法通过跨模态蒸馏和基于图谱的课程掩码技术,在共享的三维掩码自编码器中学习MRI与PET之间的结构-分子对应关系,并关注疾病易感解剖区域。实验表明,BrainAnytime在多种临床模态设置下显著优于现有模型,尤其在阿尔茨海默病分类任务中提升了平均准确率。

Comments Early accepted by MICCAI 2026

详情
英文摘要

Clinical diagnostic workups typically follow a modality escalation pathway: after initial clinical evaluation, clinicians begin with routine structural imaging (e.g., MRI), selectively add sequences such as FLAIR or T2 to refine the differential, and reserve molecular imaging (e.g., amyloid-PET) for cases that remain uncertain after standard evaluation. Consequently, patients are observed with heterogeneous and often incomplete modality subsets. However, most current AI models assume fixed data modalities as the model inputs. In this paper, we present BrainAnytime, a unified pretraining framework pretrained on 34,899 3D brain scans from five datasets that support brain image analysis under arbitrary modality availability spanning multi-sequence MRI and amyloid-PET. A single model accepts whatever imaging is available, from a lone T1 scan to a full multimodal workup. Pretraining learns structural-molecular correspondences between MRI and PET via cross-modal distillation (RCMD) and prioritizes disease-vulnerable anatomy via atlas-guided curriculum masking (PACM), all within a shared 3D masked autoencoder (Multi-MAE3D). Across four downstream tasks and five clinically motivated modality settings, BrainAnytime largely outperforms modality-specific models, missing-modality baselines, and large-scale brain MRI pretrained foundation models on most modality settings. Notably, it surpasses the strongest missing-modality baselines with relative improvements of 6.2% and 7.0% in average accuracy on CN vs. AD and CN vs. MCI classification, respectively. Code is available at https://github.com/SDH-Lab/BrainAnytime.

2605.13049 2026-05-14 cs.CV 版本更新

Uncertainty-aware Spatial-Frequency Registration and Fusion for Infrared and Visible Images

Xingyuan Li, Haoyuan Xu, Xingyue Zhu, Jun Ma, Yang Zou, Zhiying Jiang, Jinyuan Liu

发表机构 * Dalian University of Technology(大连理工大学) Northwestern Polytechnical University(西北工业大学) Dalian Maritime University(大连海事大学)

AI总结 红外与可见光图像融合(IVIF)在复杂环境下具有广泛应用,但未对齐条件下的融合面临固有的错位问题。现有方法多采用粗到细的变形参数预测或多尺度变形场估计,却忽视了注册过程中的累积误差,影响融合质量。本文提出了一种融合空间-频率域注册与融合的SFRF框架,通过引入不确定性估计和红外热辐射分布一致性,统一处理注册误差累积问题,提升跨空间与频率域的融合鲁棒性。该方法通过多尺度迭代注册和双分支空间-频率融合模块,实现了更精确的对齐与更高质量的图像重建。

Comments 10 pages, 5 figures, 4 tables

详情
英文摘要

Infrared and Visible Image Fusion (IVIF) has shown promise in visual tasks under challenging environments, but fusion under unregistered conditions faces inherent misalignments. Current studies to solve them either predict the deformation parameters coarse-to-fine (i.e., coarse registration and fine registration) or estimate the deformation fields in multi-scales for registration. Though straightforward, they overlook the cumulative errors in registration, which contaminate the fusion stage and severely deteriorate the resulting images. We introduce the Spatial-Frequency Registration and Fusion (SFRF) framework, which incorporates uncertainty estimation and infrared thermal radiation distribution consistency into a unified pipeline to handle the error accumulation for robust registration and fusion across both spatial and frequency domains. Specifically, SFRF constructs a Multi-scale Iterative Registration (MIR) framework that iteratively refines the deformation field across scales, leveraging uncertainty estimation at each stage to mitigate error accumulation and enhance alignment accuracy dynamically. To ensure the accurate alignment of infrared thermal distributions during registration, thermal radiation distribution consistency is employed as a frequency-domain supervisory signal, promoting global consistency in the frequency domain. Based on the spatial-frequency alignment, SFRF further adopts a Dual-branch Spatial-Frequency Fusion (DSFF) module, which incorporates spatial geometric features and frequency distribution information to reconstruct visually appealing images. SFRF achieves impressive performance across diverse datasets.

2605.13047 2026-05-14 cs.CV cs.AI 版本更新

Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

Ziqi Wen, Parsa Madinei, Miguel P. Eckstein

发表机构 * Department of Computer Science, University of California, Santa Barbara(加州大学圣巴巴拉分校计算机科学系) Department of Psychological and Brain Sciences, University of California, Santa Barbara(加州大学圣巴巴拉分校心理学与脑科学系)

AI总结 该研究探讨了视觉语言模型(VLM)在高层次语义场景理解方面与人类感知的差异。为此,作者提出了一种黑盒、模型无关的方法——反事实语义显著性(CSS),通过衡量物体在场景中被移除后引起的语义变化,量化其重要性。实验结果表明,VLM在理解场景时表现出对大物体、画面中心物体和高显著性物体的过度依赖,而对场景中人物的依赖则低于人类,揭示了模型与人类在语义理解上的显著差距。

详情
英文摘要

Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual-Semantic-Saliency.

2605.13041 2026-05-14 cs.CV 版本更新

EgoForce: Robust Online Egocentric Motion Reconstruction via Diffusion Forcing

Inwoo Hwang, Donggeun Lim, Hojun Jang, Young Min Kim

发表机构 * Seoul National University(首尔国立大学)

AI总结 EgoForce 是一种用于从噪声的自中心视角输入中在线重建长期全身运动的框架。该方法采用基于扩散的模型,并引入时间非对称的噪声调度策略,以应对实时应用中稀疏和噪声观测的挑战。通过建模时间演化的不确定性并逐步去噪,EgoForce 在严格因果约束下生成稳定且连贯的全身运动,实验表明其在复杂自中心场景中优于现有在线和离线方法。

Comments Project page: https://inwoohwang.me/EgoForce

详情
英文摘要

With recent advances in embodied agents and AR devices, egocentric observations are readily available as input for real-world interactive online applications. However, egocentric viewpoints can only sporadically observe hands, in addition to the estimated head trajectory. We propose EgoForce, an online framework for reconstructing long-term full-body motion from noisy egocentric input. While existing generative frameworks can robustly handle noisy and sparse measurements, they assume a fixed-length observation window is available and are thus not suitable for real-time applications. Faster inference often relies on autoregressive prediction, sacrificing robustness. In contrast, we adopt a diffusion-based method with a temporally asymmetric noise schedule inspired by Diffusion Forcing. Specifically, our approach models temporally evolving uncertainty and incrementally denoises states as new streaming observations arrive. Combined with a noise-robust imputation strategy, EgoForce progressively generates stable and coherent full-body motion under strict causal constraints. Experiments demonstrate that our online framework outperforms existing online and offline methods, enabling long-horizon, full-body motion reconstruction in challenging egocentric scenarios.

2605.13038 2026-05-14 cs.CV cs.AI 版本更新

CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy

Liangjing Shao, Beilei Cui, Hongliang Ren

发表机构 * Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China(香港中文大学电子工程系,香港特别行政区,中国) Shenzhen Loop Area Institute, China(深圳环湖研究所,中国)

AI总结 本文提出CoGE,一种用于结肠镜检查的单目在线几何估计框架,旨在解决实际场景中深度估计和场景重建的难题。该方法通过引入基于Retinex理论的光照感知监督模块和基于小波分解的结构感知感知模块,有效应对结肠镜场景中的光照差异和结构特征提取问题。实验表明,仅使用模拟数据训练的CoGE在模拟和真实场景中均取得了最先进的几何估计性能。

Comments Early Accepted by MICCAI 2026

详情
英文摘要

Geometric estimation including depth estimation and scene reconstruction is a crucial technique for colonoscopy which can provide surgeons with 3D spatial perception and navigation. However, geometric ground truth in colonoscopy is difficult to obtain due to narrow and enclosed space of the colon, while there is a large feature gap between simulated data and realistic data caused by artifacts and illumination. In this paper, we present CoGE, a novel framework for online monocular geometric estimation during colonoscopy. Firstly, we propose an illumination-aware supervision module based on the Retinex theory to address illumination diversity in different colonoscopy scenes. Moreover, a structure-aware perception module is proposed based on wavelet decomposition to extract common structural and local features of the colon. Both quantitative and qualitative results demonstrate that the proposed model solely trained on simulated data achieves state-of-the-art performance in geometric estimation for both simulated and realistic scenes.

2605.13034 2026-05-14 cs.CV cs.IR 版本更新

ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

Zhuofan Shi, Peilun Jia, Baoqin Sun, Haiyang Shen, Sixiong Xie, Yun Ma, Xiang Jing

发表机构 * School of Software and Microelectronics, Peking University(北京大学软件与微电子学院) National Key Laboratory of Data Space Technology and System(数据空间技术与系统国家重点实验室) School of Software Engineering, Beijing Jiaotong University(北京交通大学软件学院) College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机科学与电子工程学院)

AI总结 ViDR 是一种多模态深度研究框架,旨在通过源图示作为证据来生成内容详实且有依据的研究报告。该方法将源图示视为可检索、可解释、可追踪和可验证的证据对象,并结合上下文感知过滤、大纲感知重排序和视觉语言模型分析等技术,提升图示证据的准确性和相关性。ViDR 还引入了 MMR Bench+ 评估基准,实验证明其在报告质量、图示整合和可验证性方面优于现有主流模型,凸显了源视觉证据在多模态深度研究中的重要性。

详情
英文摘要

Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep research framework that grounds long-form reports in source figures. ViDR treats source figures as retrievable, interpretable, routable, and verifiable evidence objects, while still generating analytical charts when needed. It builds an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, and generates each section with section-specific evidence. ViDR further validates visual references to reduce hallucinated or misplaced figures. We also introduce MMR Bench+, a benchmark for evaluating visual evidence use in deep research reports, covering source-figure retrieval, placement, interpretation, verifiability, and analytical chart generation. Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines. These results suggest that source visual evidence is important for multimodal deep research, as it strengthens evidential grounding, visual support, and report verifiability.

2605.13027 2026-05-14 cs.CV 版本更新

PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution

Zihang Xu, Xiaoyang Liu, Zheng Chen, Yulun Zhang, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种基于扩散模型的文本图像超分辨率方法PRISM,旨在解决在严重退化情况下文本细节生成中的可靠性与结构准确性问题。该方法通过引入流匹配先验校正(FMPR)和结构引导的不确定性感知残差编码器(SURE),分别提升全局文本先验的可靠性与局部笔画边界的精确性。实验表明,PRISM在合成和真实数据集上均取得了最先进的性能,且推理速度达到毫秒级。

Comments Code is available at https://github.com/faithxuz/PRISM

详情
英文摘要

Text image super-resolution (Text-SR) requires more than visually plausible detail synthesis: slight errors in stroke topology may alter character identity and break readability. Existing methods improve text fidelity with stronger recognition-based or generative priors, yet they still face two unresolved challenges under severe degradation: the text condition extracted from low-quality inputs can itself be unreliable, and a plausible global prior does not fully determine fine-grained stroke boundaries. We present PRISM, a single-step diffusion-based Text-SR framework that addresses these two challenges through Flow-Matching Prior Rectification (FMPR) and a Structure-guided Uncertainty-aware Residual Encoder (SURE). FMPR constructs a privileged training-time prior from paired low-quality/high-quality latents and learns a flow matching that transports degraded embeddings toward this restoration-oriented prior space, yielding more accurate and reliable global text guidance. SURE further predicts uncertainty-aware structural residuals to selectively absorb reliable local boundary evidence while suppressing ambiguous stroke cues. Together, these components enable explicit global prior rectification and local structure refinement within a single diffusion restoration pass. Experiments on both synthetic and real-world benchmarks show that PRISM achieves state-of-the-art performance with millisecond-level inference. Our dataset and code will be available at https://github.com/faithxuz/PRISM.

2605.13018 2026-05-14 cs.CV 版本更新

OCH3R: Object-Centric Holistic 3D Reconstruction

Yi Du, Yang You, Xiang Wan, Leonidas Guibas

发表机构 * Stanford University(斯坦福大学)

AI总结 OCH3R 是一种面向对象的统一三维重建框架,能够从单张RGB图像中同时预测场景中所有物体的6D姿态及其详细三维重建结果。其核心方法基于一种变压器架构,通过预测每个像素的类别嵌入、度量深度、归一化物体坐标(NOCS)以及每个物体的固定数量的三维高斯分布,实现端到端的一次性推理。该方法通过将预测的高斯分布转换到规范空间并与预渲染的真值对齐,避免了高昂的逐图像标注成本,显著提升了重建精度与推理效率。

详情
英文摘要

Object-centric scene understanding is a fundamental challenge in computer vision. Existing approaches often rely on multi-stage pipelines that first apply pre-trained segmentors to extract individual objects, followed by per-object 3D reconstruction. Such methods are computationally expensive, fragile to segmentation errors, and scale poorly with scene complexity. We introduce OCH3R, a unified framework for Object-Centric Holistic 3D Reconstruction from a single RGB image. OCH3R performs one forward pass to simultaneously predict all object instances with their 6D poses and detailed 3D reconstructions. The key idea is a transformer architecture that predicts per-pixel attributes, including CLIP-based category embeddings, metric depth, normalized object coordinates (NOCS), and a fixed number of 3D Gaussians representing each object. To supervise these Gaussian reconstructions, we transform them into canonical space using the predicted 6D poses and align them with pre-rendered canonical ground truth, avoiding costly per-image Gaussian label generation. On standard indoor benchmarks, OCH3R achieves state-of-the-art performance across monocular depth estimation, open-vocabulary semantic segmentation, and RGB-only category-level 6D pose estimation, while producing high-fidelity, editable per-object reconstructions. Crucially, inference is fully feed-forward and scales independently of the number of objects, offering orders-of-magnitude speedups over conventional multi-stage pipelines in cluttered scenes.

2605.13015 2026-05-14 eess.IV cs.CV cs.LG 版本更新

A General Bézier Tree Encoding Counterfactual Framework for Retinal-Vessel-Mediated Disease Analysis

Tan Su, Ethan Elio Meidinger, Lin Gu, Ruogu Fang

发表机构 * Department of Electronic and Electrical Engineering(电子与电气工程系) School of Data Science(数据科学学院) Research Institute of Electrical Communication(电气通信研究所) J. Crayton Pruitt Family Department of Biomedical Engineering(姜·克雷顿·普瑞特家庭生物医学工程系)

AI总结 该研究提出了一种基于Bézier曲线树编码的反事实框架(BTECF),用于分析视网膜血管结构与全身性疾病的因果关系。该方法将视网膜血管网络抽象为连接的立方Bézier曲线段,从而在保持血管拓扑结构的同时实现对几何特征(如弯曲度、管径)的原子级干预。通过结合扩散生成模型,BTECF能够在不破坏背景纹理的前提下,对血管结构进行可控的反事实生成,并在糖尿病视网膜病变、缺血性中风和阿尔茨海默病等疾病中验证了其有效性,为跨疾病的因果假设验证提供了统一的生成范式。

Comments 33 pages, 6 figures; preprint

详情
英文摘要

The geometry of the retinal vessel is a key biomarker of vascular diseases, yet clinical evidence remains primarily observational. Existing generative counterfactuals intervene only at the image-level disease label, failing to isolate explicit anatomical structure. To address this limitation, we propose the Bézier Tree Encoding Counterfactual Framework (BTECF). By abstracting vascular networks into interconnected cubic-Bézier segments, BTECF establishes a disease-agnostic representation in which structural topology is explicitly preserved and atomically perturbable. Coupling this encoding with a diffusion-based generator enables parameter-level do-interventions on explicit geometric axes (e.g., tortuosity, caliber) while preserving background fundus textures. We validate BTECF on diabetic retinopathy, together with independent cohorts for ischemic stroke and Alzheimer's disease. Isolated counterfactual interventions produce dose-responsive shifts in classifier predictions; a matched pixel-drop control attenuates this response by an order of magnitude or more, ruling out out-of-distribution generation artifacts. By enforcing causal isolation between vessel topology and pixel-level confounders, BTECF provides a unified generative paradigm for hypothesis verification across systemic diseases. To support reproducibility, the code will be publicly released upon acceptance.

2605.13010 2026-05-14 cs.CV cs.AI cs.SY eess.SY math.OC 版本更新

Amortized Guidance for Image Inpainting with Pretrained Diffusion Models

Yilie Huang, Xun Yu Zhou

发表机构 * Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027, USA(工业工程与运筹学系,哥伦比亚大学,纽约,NY 10027,美国) Department of Industrial Engineering and Operations Research & Data Science Institute, Columbia University, New York, NY 10027, USA(工业工程与运筹学系及数据科学研究所,哥伦比亚大学,纽约,NY 10027,美国)

AI总结 本文研究了基于生成扩散模型的图像修复问题,提出了一种名为AID的方法,在保持预训练扩散模型主干不变的前提下,通过离线训练一个小型可复用的引导模块,实现对多张掩码图像的高效修复。该方法将问题建模为带有监督终端目标的确定性引导问题,并通过引入辅助高斯形式,推导出一种可在高维空间中学习的随机化问题求解方案,从而设计出一种基于数据驱动的连续时间策略-价值算法。实验表明,AID在多个数据集和掩码类型上均优于现有固定主干和摊销修复方法,在修复质量与速度之间取得了更好的平衡。

详情
英文摘要

We study image inpainting with generative diffusion models. Existing methods typically either train dedicated task-specific models, or adapt a pretrained diffusion model separately for each masked image at deployment. We introduce a middle-ground model, termed Amortized Inpainting with Diffusion (AID), which keeps a pretrained diffusion backbone fixed, trains a small reusable guidance module offline, and then reuses it across masked images without per-instance optimization. We formulate it as a deterministic guidance problem with a supervised terminal objective. To make this problem learnable in high dimensions, we derive an auxiliary Gaussian formulation and prove that solving this randomized problem recovers the optimal deterministic guidance field. This bridge yields a principled continuous-time actor--critic algorithm for learning the guidance module in a fully data-driven manner. Empirically, on AFHQv2 and FFHQ under the pixel EDM pipeline and on ImageNet under the latent EDM2 pipeline, AID consistently improves the quality--speed trade-off over strong fixed-backbone and amortized inpainting baselines across multiple mask types, while adding less than one percent trainable overhead.

2605.12967 2026-05-14 cs.CV 版本更新

ImageAttributionBench: How Far Are We from Generalizable Attribution?

Tingshu Mou, Zhipeng Wei, Chao Gong, Jingjing Chen, Xingjun Ma

发表机构 * Fudan University(复旦大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 随着生成式AI的快速发展,合成图像的逼真度和多样性不断提高,给图像来源识别和虚假信息检测带来了严峻挑战。为此,本文提出ImageAttributionBench,一个包含多种先进生成模型合成图像的综合性数据集,旨在推动更具鲁棒性和泛化能力的图像归属方法研究。实验表明,当前主流归属方法在该数据集上的表现较差,揭示了其在面对语义变化和图像退化时的局限性,为未来研究提供了严格的评估基准。

详情
英文摘要

The rapid advancement of generative AI has enabled the creation of highly realistic and diverse synthetic images, posing critical challenges for image provenance and misinformation detection. This underscores the urgent need for effective image attribution. However, existing attribution datasets are constrained by limited scale, outdated generation methods, and insufficient semantic diversity - hindering the development of robust and generalizable attribution models. To address these limitations, we introduce ImageAttributionBench, a comprehensive dataset comprising images synthesized by a wide array of advanced generative models with state-of-the-art (SOTA) architectures. Covering multiple real-world semantic domains, the dataset offers rich diversity and scale to support and accelerate progress in image attribution research. To simulate real-world attribution scenarios, we evaluate several SOTA attribution methods on ImageAttributionBench under two challenging settings: (1) training on a standard balanced split and testing on degraded images, and (2) training and testing on semantically disjoint splits. In both cases, current methods exhibit consistently poor performance, revealing significant limitations in their robustness and generalization to unseen semantic content. Our work provides a rigorous benchmark to facilitate the development and evaluation of future image attribution methods.

2605.12957 2026-05-14 cs.CV 版本更新

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

Hanxin Zhu, Cong Wang, Peiyan Tu, Jiayi Luo, Tianyu He, Xin Jin, Zhibo Chen

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院)

AI总结 本文提出了一种名为GTA的新型图像到3D世界生成方法,采用“几何优先、再渲染外观”的策略,以提升生成场景的结构准确性和跨视角一致性。该方法通过两个阶段的视频扩散模型,首先生成粗略的几何结构,再基于预测的几何信息合成精细的外观细节。此外,研究引入了随机潜在码打乱策略和测试时缩放方案,进一步提升了生成质量与感知一致性。实验表明,GTA在保真度、视觉质量及几何精度方面优于现有方法,并可作为通用增强模块提升现有生成流程的效果。

详情
英文摘要

Recent developments in generative models and large-scale datasets have substantially advanced 3D world generation, facilitating a broad range of domains including spatial intelligence, embodied intelligence, and autonomous driving. While achieving remarkable progress, existing approaches to 3D world generation typically prioritize appearance prediction with limited modeling of the underlying geometry, leading to issues such as unreliable scene structure estimation and degraded cross-view consistency. To address these limitations, motivated by the coarse-to-fine nature of human visual perception, we propose GTA, a novel image-to-3D world generation method following a Geometry-Then-Appearance paradigm. Specifically, given a single input image, to improve the structural fidelity of synthesized 3D scenes, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, we introduce a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance. Extensive experiments have demonstrated that our proposed method consistently outperforms existing approaches in terms of fidelity, visual quality, and geometric accuracy. Moreover, GTA is shown to be effective as a general enhancement module that further improves the generation quality of existing image-to-3D world pipelines, as well as supporting multiple downstream applications and exhibiting favorable data efficiency during model training, highlighting its versatility and broad applicability. Project page: https://hanxinzhu-lab.github.io/GTA/.

2605.12954 2026-05-14 cs.CV cs.AI 版本更新

AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

Xiao Yang, Yingzhe Ma, Haoxuan Yu, Zixin Li, Ning Qin

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 AdaFocus 是一种高效的长视频理解框架,旨在解决传统方法在时间覆盖、视觉细节与计算效率之间难以平衡的问题。该方法通过自适应相关性-多样性采样和零缓存回溯机制,实现对视频内容的渐进式证据获取,既减少了内存和计算开销,又保留了关键视觉细节。实验表明,AdaFocus 在多个基准数据集上实现了比现有方法更优的效率与精度平衡,显著提升了长视频理解任务的性能。

Comments 9 pages, 4 figures. Authors Xiao Yang and Yingzhe Ma contributed equally

详情
英文摘要

Long video understanding is heavily bottlenecked by a rigid one-shot paradigm: existing methods either densely encode videos at prohibitive memory and latency costs, or aggressively compress them into sparse frame sets that irreversibly discard fine-grained evidence needed for downstream reasoning. Consequently, current models struggle to simultaneously balance temporal coverage, visual details, and computational efficiency. We propose AdaFocus, an efficient framework that rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. AdaFocus relies on two tightly coupled components. First, a Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) produces a compact yet informative video preview, adaptively switching to global clustering when the query lacks reliable local grounding. Second, instead of caching exhaustive frame sequences in memory, AdaFocus introduces an uncertainty-triggered refinement mechanism. It performs targeted look-back only when the model is not confident, retrieving high-resolution evidence directly from disk via a zero-cache I/O design. This turns discarded visual details from an irreversible loss into on-demand recoverable evidence without paying the cost of exhaustive preloading. Experiments on seven standard long-video benchmarks show that AdaFocus delivers a substantially better efficiency-accuracy trade-off than strong baselines. Compared with conventional dense encoding, AdaFocus achieves improved task performance (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA over single-pass inference) while reducing visual token consumption by ~33x and eliminating the need for in-memory frame pre-caching through its zero-cache disk retrieval design. These findings suggest that progressive preview combined with zero-cache evidence refinement is a highly effective paradigm for scalable multimedia reasoning.

2605.12953 2026-05-14 cs.CV cs.AI 版本更新

Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

Chao Hao, Jun Xu, Ji Du, Shuo Ye, Ziyue Qiao, Xiaodong Cun, Guangcong Wang, Xubin Zheng, Zitong Yu

发表机构 * School of Computing and Information Technology(计算与信息科技学院) Great Bay University(大湾大学) Hangzhou International Innovation Institute(杭州国际创新研究院) Beihang University(北航大学) Department of Computing(计算系) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出了一种名为Seg-Agent的全新训练-free语言引导分割框架,旨在解决传统方法依赖大量训练数据的问题。该方法通过构建显式的多模态推理循环,使大型语言模型能够在视觉域内进行交互式推理,从而直接生成和优化分割结果。此外,研究还引入了Various-LangSeg基准,用于全面评估模型在不同场景下的泛化能力,实验表明Seg-Agent在无需参数更新的情况下即可达到先进训练方法的性能水平。

详情
英文摘要

Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg-Agent, a completely training-free framework that pioneers Explicit Multimodal Chain-of-Reasoning. Unlike prior text-only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set-of-Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see'' and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg-Agent to achieve performance comparable to state-of-the-art training-based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various-LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning-guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.

2605.12952 2026-05-14 cs.CV 版本更新

Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation

Yongjin Cui, Xiaohui Fan

发表机构 * Zhejiang University(浙江大学)

AI总结 本文对ICML 2024发表的Grad-ECLIP方法进行了全面分析,指出其并非基于中间特征的全新技术路线,而是与现有的注意力机制解释方法等价,且计算更为简洁。研究进一步揭示了Grad-ECLIP方法的缺陷,表明其生成的模型解释结果与原模型实际行为不一致,并提出了模型解释应遵循的两个基本原则,以避免类似错误。

详情
英文摘要

Grad-ECLIP is published at ICML 2024 and represents a new Transformer interpretation technical route (intermediate features-based). First, this paper demonstrates that the intermediate features-based technical route is not a novel one. Based on the existing attention-based route, we have developed Attention-ECLIP, which is completely equivalent to Grad-ECLIP but with simpler computation. Both through formal derivation and experimental validation, we prove that the intermediate feature-based route represented by Grad-ECLIP is actually an equivalent variant of the attention-based route. Next, this paper demonstrates that the Grad-ECLIP method is flawed. The model interpretation results obtained by Grad-ECLIP are not those of the original model, and the interpretation results are misaligned with the model's performance. We analyze the causes of Grad-ECLIP's flaws and propose, or rather, explicitly emphasize two fundamental principles that model interpretation should adhere to in order to avoid similar errors.

2605.12939 2026-05-14 cs.CV 版本更新

DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport

Xianbing Sun, Jiahui Zhan, Liqing Zhang, Jianfu Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种名为DirectTryOn的一站式虚拟试穿方法,通过直角条件传输实现高效生成。该方法基于对虚拟试穿任务条件约束特性的观察,提出通过纯条件传输、服装保持损失和自一致性损失等改进,引导生成过程更加直接,从而实现单步生成。实验表明,该方法在保证生成质量的同时显著降低了推理成本,达到了当前最先进的性能。

详情
英文摘要

Recent diffusion- and flow-based VTON methods achieve strong results with pretrained generative models, but their reliance on multi-step sampling incurs high inference cost, while existing acceleration methods largely overlook the intrinsic structure of the try-on task. In this paper, we highlight a key observation: VTON outputs are highly constrained by the conditional inputs, suggesting that the conditional sampling trajectory can be much straighter than that in general image generation, making one-step generation a natural solution. However, limited task-specific data makes training from scratch impractical, forcing existing methods to fine-tune pretrained models whose objectives do not encourage such straight conditional trajectories. Thus, the deviation from an ideal straight path mainly comes from the mismatch between pretrained base models and the conditional nature of try-on generation, rather than from the task itself. Motivated by this insight, we encourage straighter VTON sampling trajectories through three targeted modifications: pure conditional transport, a garment preservation loss, and a self consistency loss. We further introduce a one-step distillation stage. Extensive experiments show that our method achieves state-of-the-art performance with one-step sampling, establishing a new standard for efficient and high-quality VTON.

2605.12938 2026-05-14 cs.CV cs.AI cs.LG 版本更新

CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

Seonghyun Jin, Youngmin Kim, Sunwoo Park, Jong Chul Ye

发表机构 * Graduate School of AI(人工智能研究生院)

AI总结 该论文提出了一种名为CRePE的曲光线期望位置编码方法,用于统一相机控制的视频生成。针对现有方法在处理广角和鱼眼镜头等复杂相机配置时的不足,CRePE通过引入深度感知的位置分布,捕捉由宽视角相机引起的投影路径几何特性,从而提升相机控制的稳定性和生成质量。该方法结合几何注意力适配器和单目几何基础模型进行伪监督,实现了对多种相机模型的有效支持,并在多个几何感知和感知质量指标上表现出色。

Comments 17 pages, 8 figures, Under review

详情
英文摘要

Camera-conditioned video generation requires positional encoding that remains reliable under changes in camera motion, lens configuration, and scene structure. However, existing attention-level camera encodings either provide ray-only camera signals or rely on pinhole camera geometry, limiting their applicability to general camera control under the Unified Camera Model, including wide-angle and fisheye lenses. To address this limitation, we propose Curved Ray Expectation Positional Encoding (CRePE). CRePE represents each image token as a depth-aware positional distribution along its source ray, providing a Unified Camera Model-compatible positional encoding that captures the projected-path geometry induced by wide-angle and fisheye cameras. CRePE is implemented through a Geometric Attention Adapter added to frozen video DiTs, injecting token-wise scene-distance information into selected attention layers and stabilizing it with pseudo supervision from a monocular geometry foundation model. This design leads to more stable camera control and improves several geometry-aware and perceptual-quality metrics, while remaining competitive on video-quality metrics. Controlled positional-encoding ablations show a better overall average rank than a RayRoPE-style endpoint PE baseline, demonstrating the effectiveness of UCM-aware projected-path integration across diverse camera models. Furthermore, by extending the same positional-encoding pathway to external geometry control through Radial MixForcing, CRePE supports external radial-map control for scene-geometry-conditioned generation and source-video motion transfer beyond camera control.

2605.12937 2026-05-14 cs.CV cs.AI cs.HC 版本更新

AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters

Jacob Lagogiannis, William Agnew, Rosa I. Arriaga, Sauvik Das

发表机构 * Franklin and Marshall College(弗兰克林与马歇尔学院) Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种名为 AuraMask 的可扩展管道,用于开发既具有对抗性效果又符合审美要求的反人脸识别图像滤镜。该方法通过模仿流行的 Instagram 一键滤镜,生成了 40 种视觉上美观的滤镜,并在对抗开源人脸识别模型方面表现出优于现有方法的效果。实验表明,这些滤镜在用户接受度上也显著高于以往方法,为隐私保护技术的进一步研究提供了有效工具。

Comments 21 pages, 10 figures

详情
英文摘要

Anti-facial recognition (AFR) image filters alter images in ways that are subtle to people but blinding to computer vision. Yet, despite widespread interest in these technologies to subvert surveillance, users rarely use them in practice -- because the ``subtle'' alterations are visible enough to conflict with users' self-presentation goals. To address this challenge, we propose AuraMask: a novel approach to creating AFR filters that are both adversarially effective and aesthetically acceptable. Using AuraMask, we produce 40 ``aesthetic'' filters that emulate popular ``one-click'' Instagram image filters. We show that AuraMask filters meet or exceed the adversarial effectiveness of prior methods against open-source facial recognition models. Moreover, in a controlled online user study ($N=630$) we confirm these filters achieve significantly higher user acceptance than prior methods. Lastly, we provide our AFR pipeline to the community for accelerated research in adversarially effective and aesthetically acceptable protections.

2605.12927 2026-05-14 cs.CR cs.CV cs.HC 版本更新

ThermalTap: Passive Application Fingerprinting in VR Headsets via Thermal Side Channels

Mahsin Bin Akram, A H M Nazmus Sakib, OFM Riaz Rahman Aranya, Raveen Wijewickrama, Kevin Desai, Murtuza Jadliwala

发表机构 * Meta HTC

AI总结 本文提出了一种名为ThermalTap的被动非接触式侧信道攻击方法,通过VR头显外壳发出的长波红外辐射,远程识别正在运行的VR应用,无需任何设备交互或恶意软件执行。该方法将头显的热信号作为内部计算负载的高保真代理,结合环境传感器数据消除噪声干扰,实现了在室内外环境下对多种VR应用的高精度识别。研究揭示了热辐射作为沉浸式系统中不可忽视的隐私风险,暴露了现有软件防护和物理访问控制难以覆盖的安全漏洞。

详情
英文摘要

Standalone virtual reality (VR) headsets process highly sensitive personal, professional, and health-related data, yet their susceptibility to non-contact physical side channels remains largely unexplored. Existing side-channel attacks typically require malicious software execution or physical access to peripherals, making them conspicuous and potentially patchable. This paper introduces ThermalTap, the first passive, non-contact side-channel attack that fingerprints VR applications solely from the long-wave infrared (LWIR) radiation emitted by the headset chassis. By treating a headset's thermal signature as a high-fidelity proxy for internal computational workloads, ThermalTap enables remote application inference at meter-scale distances without any device interaction. To achieve robust performance in real-world settings, the system combines a commodity thermal camera with a multi-modal sensor suite (capturing ambient temperature, humidity, and airflow) to normalize environmental noise. We evaluate ThermalTap using six applications across three commercial standalone headsets. In indoor settings, ThermalTap identifies applications with over 90% accuracy using only 10 seconds of thermal camera data. Under outdoor conditions, with longer session-level observations, several applications remain identifiable despite environmental variability, with the strongest outdoor application reaching 81% accuracy. Our findings establish thermal radiation as a fundamental and unavoidable privacy risk for immersive systems, exposing a critical security gap that bypasses current software-level protections and physical access controls.

2605.12919 2026-05-14 cs.CV 版本更新

GuardMarkGS: Unified Ownership Tracing and Edit Deterrence for 3D Gaussian Splatting

Utae Jeong, Jaewan Choi, Junseok Lee, Jongheon Jeong, Sang Ho Yoon, ByoungSoo Koh, Sangpil Kim

发表机构 * Korea University(韩国大学) KAIST(韩国科学技术院) Hanshin University(汉西大学)

AI总结 本文提出了一种名为 GuardMarkGS 的统一保护框架,旨在解决 3D Gaussian Splatting(3DGS)资产在版权归属追踪与防止未经授权编辑之间的双重风险。该方法结合了全局水印优化与对抗性编辑抑制策略,通过分离潜在特征、扰动编辑轨迹以及选择性增强对抗更新,实现了版权归属可追溯与编辑行为有效遏制的双重目标。实验表明,该框架在保持渲染质量的同时,有效平衡了水印准确性与编辑抑制效果。

Comments Preprint

详情
英文摘要

3D Gaussian Splatting (3DGS) is becoming a practical representation for novel view synthesis, but its growing adoption, together with rapid advances in instruction-driven 3DGS editing, also exposes a dual copyright risk: once a 3DGS-based asset is released, it can be used without permission and manipulated through 3D editing. Existing protection methods address only one side of this problem. Watermarking can trace ownership after unauthorized use, but it cannot prevent malicious editing. Adversarial edit-deterrence methods can disrupt editing, but they do not provide evidence of ownership. To the best of our knowledge, we present the first unified protection framework for 3DGS that jointly optimizes ownership tracing and unauthorized editing deterrence. Our framework combines a scene-wide watermarking objective over all Gaussians with an adversarial objective for edit deterrence. The adversarial branch combines latent-anchor separation, denoising-trajectory diversion, and cross-attention diversion to divert the editing trajectory, while an update-saliency-motivated Gaussian selection strategy assigns stronger adversarial updates to mask-selected Gaussians, improving the balance among watermark recovery, edit deterrence, and rendering fidelity. Experiments on scenes from Mip-NeRF 360 and Instruct-NeRF2NeRF demonstrate that the proposed framework achieves a favorable balance among bit accuracy, edit deterrence, and rendering quality. These results suggest that practical copyright protection of 3DGS-based assets can be more effectively addressed by integrating ownership tracing and unauthorized editing deterrence into a single optimization framework.

2605.12917 2026-05-14 cs.CV cs.LG 版本更新

Adaptive Conformal Prediction for Reliable and Explainable Medical Image Classification

One Octadion, Novanto Yudistira, Lailil Muflikhah

发表机构 * Faculty of Computer Science, Universitas Brawijaya(博雅大学计算机科学学院)

AI总结 该研究针对医学图像分类中深度学习模型过度自信的问题,提出了一种自适应的置信度预测方法,以提高诊断的可靠性和可解释性。通过改进RAPS方法,引入自适应Lambda准则,有效控制预测集的覆盖偏差,确保在不同输入难度下均保持较高的覆盖性能。实验表明,该方法在多个医学图像数据集上实现了高覆盖率与小预测集大小的平衡,且具有良好的跨领域泛化能力,适用于对安全性要求高的医疗AI应用。

Comments To appear in IEA/AIE 2026 (Springer LNAI)

详情
英文摘要

Deep learning models for medical imaging often exhibit overconfidence, creating safety risks in ambiguous diagnostic scenarios. While Conformal Prediction (CP) provides distribution-free statistical guarantees, standard methods such as Regularized Adaptive Prediction Sets (RAPS) optimize for average efficiency and can mask severe failures on difficult inputs. We propose an Adaptive Lambda Criterion for RAPS that minimizes the worst-case coverage violation across prediction set size strata. On OrganAMNIST (58,850 abdominal CT images, 11 classes), standard size-optimized RAPS converges to near-deterministic behavior with stratified undercoverage on uncertain samples, while our method achieves 95.72 percent global coverage with average set size 1.09 and at least 90 percent coverage across all strata. Cross-domain validation on PathMNIST (107,180 pathology images, 9 classes) confirms generalizability. Quantitative Grad-CAM analysis (rho = -0.30, p < 1e-22) shows that multi-label predictions correspond to focused attention on anatomically ambiguous regions. These results demonstrate that the proposed method improves reliability while maintaining efficiency, making it suitable for safety-critical medical AI applications.

2605.12882 2026-05-14 cs.CL cs.CV 版本更新

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

Dongsheng Ma, Jiayu Li, Zhengren Wang, Yijie Wang, Jiahao Kong, Weijun Zeng, Jutao Xiao, Jie Yang, Wentao Zhang, Bin Wang, Conghui He

发表机构 * Peking University(北京大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 CiteVQA 是一个用于评估可信文档智能的新型基准,旨在解决当前文档问答系统中忽视证据溯源的问题。该基准要求模型在回答问题的同时提供具体的引用区域,从而同时评估答案的正确性和引用的准确性。通过引入严格归因准确率(SAA)指标,CiteVQA 揭示了现有大型语言模型在答案正确但引用错误方面的普遍问题,为提升文档理解系统的可靠性提供了新的评估工具。

详情
英文摘要

Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at https://github.com/opendatalab/CiteVQA.

2605.12855 2026-05-14 cs.CV 版本更新

Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy

Jorge Tapias Gomez, Despoina Kanata, Aneesh Rangnekar, Christina Lee, Hannah Williams, Hannah Thompson, J. Joshua Smith, Francisco Sanchez-Vega, Mert R. Sabuncu, Julio Garcia-Aguilar, Harini Veeraraghavan

发表机构 * Department of Medical Physics, Memorial Sloan Kettering Cancer Center(医学物理部,纪念斯隆凯特勒癌症中心) School of Computer Science, Cornell University and Cornell Tech(计算机科学学院,康奈尔大学和康奈尔科技) Department of Surgery, Colorectal Service, Memorial Sloan Kettering Cancer Center(外科部,结直肠服务,纪念斯隆凯特勒癌症中心) Department of Radiology, Weill Cornell Medical College(放射科,韦尔医学院) School of Electrical and Computer Engineering, Cornell University and Cornell Tech(电气与计算机工程学院,康奈尔大学和康奈尔科技)

AI总结 该研究提出了一种基于纵向内镜图像的深度学习方法TREX,用于预测接受“观察等待”治疗的直肠癌患者肿瘤的复发情况。TREX通过结合治疗后复查和随访期间的图像,利用双交叉注意力机制和预训练的Swin Transformer模型,在无需图像配准的情况下提取并融合特征,从而区分完全缓解与局部复发。实验表明,TREX在复发检测和早期预警方面均优于现有方法,并在临床验证中表现出与专业医生相当的诊断准确性。

Comments 14 Pages, 9 figures, 2 tables

详情
英文摘要

Clinical trial studies indicate benefit of watch-and-wait (WW) surveillance for patients with rectal cancer showing a complete or near clinical response (CR) directly after treatment (restaging). However, there are no objectively accurate methods to early detect local tumor regrowth (LR) in patients undergoing WW from follow-up exams. Hence, we developed Temporal Rectal Endoscopy Cross-attention (TREX), a longitudinal deep learning approach that combines pairs of images acquired at restaging and follow-up to distinguish CR from LR. TREX uses pretrained Swin Transformers in a siamese setting to extract features from longitudinal images and dual cross-attention to combine the features without spatial co-registration between image pairs. TREX and Swin-based baselines were trained under two settings: (a) detecting LR or CR at the last available follow-up and (b) early detection of LR at 3--6, 6--12, and 12--24 months before clinical confirmation. TREX achieved the highest accuracy in detecting LR with a high sensitivity of 97% $\pm$ 6% and a balanced accuracy of 90% $\pm$ 3%, and outperformed all baselines in early detection at both 3--6 (74% $\pm$ 1%) and 6--12 months (62% $\pm$ 4%) prior to clinical detection. Clinical validation via a surgeon survey showed that TREX matched attending-level overall accuracy (TREX: 86.21% vs.\ Clinicians: 87.84% $\pm$ 1.28%). Finally, we explored TREX's ability to predict treatment response by combining pre-treatment (pre-TNT) and restaging endoscopies, achieving a balanced accuracy of 73% $\pm$ 12%. These results show that longitudinal deep learning analysis of endoscopy may improve surveillance and enable earlier identification of rectal cancer regrowth.

2605.12851 2026-05-14 cs.CV cs.AI 版本更新

PRISM: Perinuclear Ring-based Image Segmentation Method for Acute Lymphoblastic Leukemia Classification

Larissa Ferreira Rodrigues Moreira, Leonardo Gabriel Ferreira Rodrigues, Rodrigo Moreira, André Ricardo Backes

发表机构 * Institute of Exact and Technological Sciences(精确与技术科学研究所) Federal University of Viçosa(弗雷塔斯联邦大学) School of Computer Science(计算机科学学院) Federal University of Uberlândia(伯南布哥联邦大学) Departament of Computing(计算系) Federal University of São Carlos(萨o卡洛斯联邦大学)

AI总结 该研究针对急性淋巴细胞白血病(ALL)分类中外周血涂片图像分析的挑战,提出了一种基于核周环的图像分割方法PRISM。该方法通过围绕细胞核构建自适应同心区域,替代传统的细胞质轮廓分割,从而在无需精确细胞边界检测的情况下提取鲁棒的细胞质特征。实验表明,该方法结合传统分类器的校准集成,在分类准确率和AUC指标上均表现出色,分别达到98.46%和0.9937。

Comments Paper accepted for publication at the XXVI Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2026), Ouro Preto, MG, Brazil

详情
英文摘要

Automated analysis of peripheral blood smears for Acute Lymphoblastic Leukemia (ALL) is hindered by low contrast and substantial variability in cytoplasmic appearance, which complicate conventional membrane-based segmentation. We found that many recent approaches rely on heavy neural architectures and extensive training, but still struggle to generalize across staining and acquisition variability. To address these limitations, we propose the Perinuclear Ring-based Image Segmentation Method (PRISM), which replaces explicit cytoplasmic delineation with adaptive concentric zones constructed around the nucleus. These perinuclear regions enable the extraction of robust cytoplasmic descriptors by integrating color information with texture statistics derived from grey-level co-occurrence patterns, without requiring accurate cell-boundary detection. A calibrated stacking ensemble of traditional classifiers leverages these descriptors to achieve a high performance, with an accuracy of 98.46% and a precision-recall AUC of 0.9937.

2605.12845 2026-05-14 cs.CV cs.AI 版本更新

AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects

Danrui Li, Jiahao Zhang, Bernhard Egger, Moitreya Chatterjee, Suhas Lohit, Tim K. Marks, Anoop Cherian

发表机构 * Rutgers, The State University of New Jersey(新泽西罗格斯大学) The Australian National University(澳大利亚国立大学) Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔兰根-纽伦堡弗里德里希-亚历山大大学) Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室)

AI总结 本文提出AssemblyBench,一个包含2,789个工业对象的合成数据集,包含多模态装配说明、对应的3D部件模型及装配轨迹,旨在解决工业装配中复杂形状和装配路径的问题。研究还提出基于Transformer的模型AssemblyDyno,能够联合预测装配顺序和部件轨迹,相比现有方法在装配姿态估计和轨迹可行性方面表现更优,其中轨迹可行性通过物理仿真进行评估。

Comments Accepted at CVPR 2026

详情
英文摘要

Assembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6-DoF motions for each assembly step. Existing datasets focus on simplified scenarios, overlooking shape complexities and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal instruction manuals, corresponding 3D part models, and part assembly trajectories. We also propose a transformer-based model, AssemblyDyno, which uses the instructional manual and the 3D shape of each part to jointly predict assembly order and part assembly trajectories. AssemblyDyno outperforms prior works in both assembly pose estimation and trajectory feasibility, where the latter is evaluated by our physics-based simulations.

2605.12826 2026-05-14 cs.CV cs.AI 版本更新

FRAME: Forensic Routing and Adaptive Multi-path Evidence Fusion for Image Manipulation Detection

Kaixiang Zhao, Tianrun Yu, Aoxu Zhang, Junhao Su, Porter Jenkins, Amanda Hughes

发表机构 * Brigham Young University Rutgers University

AI总结 随着图像编辑工具和生成式人工智能的普及,数字图像的真实性验证变得愈发困难。为了解决现有方法在鲁棒性、证据碎片化和泛化能力方面的不足,本文提出了一种名为FRAME的新方法,通过多路径分析空间组织多种取证算法,自适应选择适合的取证路径并融合互补证据,从而提升检测与定位性能。FRAME在保持多源取证线索可解释性的基础上,提供了更稳健且灵活的图像取证方案,并在多种篡改场景中展现出良好的效果。

Comments Accepted to CVPR 2026 SAFE Workshop

详情
英文摘要

The proliferation of sophisticated image editing tools and generative artificial intelligence models has made verifying the authenticity of digital images increasingly challenging, with important implications for journalism, forensic analysis, and public trust. Although numerous forensic algorithms, ranging from handcrafted methods to deep learning-based detectors, have been developed for manipulation detection, individual methods often suffer from limited robustness, fragmented evidence, or weak generalization across manipulation types and image conditions. To address these limitations, we present \textbf{FRAME}, a method for \textbf{F}orensic \textbf{R}outing and \textbf{A}daptive \textbf{M}ulti-path \textbf{E}vidence fusion for image manipulation detection. FRAME organizes diverse forensic algorithms into a multi-path analysis space, adaptively selects informative forensic paths for each input image, and fuses complementary evidence to improve detection and localization performance. By moving beyond single-method analysis and fixed fusion strategies, FRAME provides a more robust and flexible approach to image forensic reasoning while preserving interpretable forensic cues from multiple evidence sources. Experimental results demonstrate the effectiveness of FRAME across diverse manipulation scenarios. Code is available at \href{https://github.com/kzhao5/FRAME}{https://github.com/kzhao5/FRAME}.

2605.12778 2026-05-14 cs.GR cs.CV 版本更新

Generative Motion In-betweening by Diffusion over Continuous Implicit Representations

Shiyu Fan, Paul Henderson, Edmond S. L. Ho

发表机构 * School of Computing Science, University of Glasgow(格拉斯哥大学计算机科学学院)

AI总结 本文提出了一种基于连续隐式表示的扩散模型新方法,用于生成高质量的运动中间帧。该方法通过在潜在空间中建立隐式神经表示与稀疏时空信息之间的映射,能够在仅有极少关键帧的情况下生成平滑且多样化的运动序列。实验表明,该方法在保持关键帧准确性的同时显著提升了运动生成的质量。

详情
英文摘要

Recent advances in generative models have yielded impressive progress on motion in-betweening, allowing for more complex, varied, and realistic motion transitions. However, recent methods still exhibit noticeable limitations in preserving keyframe information and ensuring motion continuity. In this paper, we propose a novel pipeline and sampling optimization strategy for latent diffusion models (LDM) based on motion implicit neural representations (INR). By establishing a mapping between INR and sparse spatial or temporal information within latent diffusion, our model can sample the INR parameters from extremely sparse and ambiguous keyframe data and reconstruct plausible and smooth motions from the manifold. Our experiments demonstrate the superior performance of our model, which significantly improves motion generation quality in scenarios with few keyframes while ensuring both keyframe accuracy and diversity of in-between motions.

2605.12774 2026-05-14 cs.CV 版本更新

WildPose: A Unified Framework for Robust Pose Estimation in the Wild

Jianhao Zheng, Liyuan Zhu, Zihan Zhu, Iro Armeni

发表机构 * Stanford University(斯坦福大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文提出了一种名为WildPose的统一单目姿态估计框架,旨在解决动态环境下相机姿态估计这一关键挑战。该方法结合了前馈模型的丰富感知能力和端到端优化的微分捆绑调整,通过冻结预训练的MASt3R特征主干构建3D感知更新算子,并引入高容量的运动掩码检测器,实现了在动态、静态及低自运动场景下的鲁棒性能。实验表明,WildPose在多个基准数据集上均优于现有方法。

详情
英文摘要

Estimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume static scenes. While recent dynamic-aware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models may degrade on static-only scenes. We present WildPose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect two powerful paradigms in modern 3D vision: the rich perceptual frontend of feedforward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this with a 3D-aware update operator built on a frozen, pre-trained MASt3R feature backbone, together with a high-capacity motion mask detector that uses multi-level 3D-aware features from the same backbone. Extensive experiments show WildPose consistently outperforms prior methods across dynamic (Wild-SLAM, Bonn), static (TUM, 7-Scenes), and low-ego-motion (Sintel) benchmarks.

2605.12772 2026-05-14 cs.CV 版本更新

Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs

Andreas Maier, Jeta Sopa, Gozde Gul Sahin, Paula Perez-Toro, Siming Bayer

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universit\"at Erlangen-N\"urnberg, Germany

AI总结 该研究发现,当系统提示中包含软性赞助信息时,大多数前沿大语言模型(LLMs)倾向于推荐价格高出约一倍的赞助航班。通过在多个开源和商业模型上复现实验,研究者发现使用一个包含30个token的用户提示,要求模型先提供中立的对比表格,能够显著降低赞助推荐的比例,从平均46.9%降至1.0%(开源模型)和从53.0%降至0%(OpenAI模型)。研究还指出,模型对赞助内容的响应具有一定的普遍性,并揭示了实验复现中可能存在的实现偏差问题。

Comments Submitted to Workshop on Textual Information Processing & Synthesis in the Wild

详情
英文摘要

Wu et al. (2026) showed that most frontier large language models (LLMs) recommend a sponsored, roughly twice-as-expensive flight when their system prompt contains a soft sponsorship cue. We reproduce their evaluation on ten open-weight chat models plus the two of their twenty-three models that are still reachable today (gpt-3.5-turbo, gpt-4o). All reported rates in this paper are produced under the same judge the original paper used (gpt-4o); we additionally store every label under an open-weight (gpt-oss-120b) and a smaller proprietary (gpt-4o-mini) judge for an ablation. Three findings emerge. First, a prose description of an LLM evaluation pipeline is not, on its own, sufficient for accurate reproduction: we surfaced three silent implementation failures that each shifted a reported rate by tens of percentage points. Second, the central claims do generalise - the gpt-3.5-turbo logistic-regression intercept of alpha = 0.81 is within four points of the original alpha = 0.86, and 200 of 200 trials on gpt-3.5-turbo and gpt-4o promote a payday lender to a financially distressed user. Third, a thirty-token user prompt that asks the assistant for a neutral comparison table first cuts sponsored recommendation from 46.9% to 1.0% averaged across our ten open-source models, and from 53.0% to 0% averaged across the two OpenAI models. AI literacy and price-comparison portals are likely market-level mitigations; the harmful-product cell is bounded by neither. Raw data, labels and analysis scripts are at https://github.com/akmaier/Paper-LLM-Ads .

2605.12753 2026-05-14 eess.IV cs.CV cs.LG 版本更新

Optimization in Sparse 2D to Dense 3D Weakly Supervised Learning: Application to Multi-Label Segmentation of Large ex vivo MRI Data

Paul Hoareau, Kuan Yi Wang, Brandon Bujak, Roy Sun, Govind Nair, Irene Cortese, Charidimos Tsagkas, Daniel Reich, Julien Cohen-Adad

发表机构 * NeuroPoly Lab, Institute of Biomedical Engineering, Polytechnique Montreal(神经多极实验室,生物医学工程学院,蒙特利尔理工学院) École Centrale de Lyon(里昂中央理工学院) Mila - Quebec AI Institute(魁北克人工智能研究所) Functional Neuroimaging Unit, CRIUGM, University of Montreal(功能神经影像单元,CRIUGM,蒙特利尔大学) Translational Neuroradiology Section, National Institute of Neurological Disorders and Stroke, National Institutes of Health(转化神经放射学部门,国家神经疾病与中风研究所,国家卫生研究院) Translational Imaging in Neurology (ThINk) Basel, Department of Biomedical Engineering, Faculty of Medicine, University Hospital Basel and University of Basel(神经学转化成像(ThINk)巴塞尔,生物医学工程系,医学院,巴塞尔大学医院和巴塞尔大学) Neurologic Clinic and Policlinic, Departments of Medicine, University Hospital Basel, Switzerland(神经科诊所和多科诊所,医学院,巴塞尔大学医院,瑞士) Research Center for Clinical Neuroimmunology and Neuroscience Basel (RC2NB), University Hospital Basel and University of Basel, Switzerland(临床神经免疫学和神经科学巴塞尔研究中心(RC2NB),巴塞尔大学医院和巴塞尔大学,瑞士) National Institute of Neurological Disorders and Stroke, National Institutes of Health(国家神经疾病与中风研究所,国家卫生研究院) Centre de recherche du CHU Sainte-Justine, Université de Montréal, Montreal, QC, Canada(圣朱斯特医院研究中心,蒙特利尔大学,蒙特利尔,魁北克,加拿大) Quantitative MRI core facility, NINDS, NIH(定量MRI核心设施,NINDS,NIH) Experimental Immunotherapeutics Unit, Division of Neuroimmunology and Neurovirology, NINDS, NIH(实验免疫治疗单元,神经免疫学和神经病毒学部门,NINDS,NIH)

AI总结 该研究针对高分辨率体外MRI数据的多标签分割问题,探讨了在稀疏2D标注下如何优化生成密集3D分割的弱监督学习方法。研究提出了一种基于2D教师网络生成伪标签训练3D学生网络的框架,并系统分析了人类视觉增强、空间增强和软标签正则化对模型性能的影响。结果表明,2D和3D模型在优化策略上存在显著差异,需采用不同的正则化方法以获得最佳分割效果。

Comments 19 pages. Submitted to Machine Learning for Biomedical Imaging (MELBA). Code and models: https://github.com/ivadomed/model_seg_sc-gm-lesion_human_ms_exvivo_t2star

详情
英文摘要

INTRODUCTION | Fully supervised 3D segmentation of high-resolution ex vivo MRI is limited by the prohibitive cost of volumetric annotation, forcing reliance on sparse 2D slices. Weakly supervised Sparse-to-Dense frameworks bridge this gap, but guidelines remain ambiguous regarding human-centric visual enhancements and transferring optimization strategies across dimensions. We analyze divergent regularization needs for multi-class segmentation of high-resolution ex vivo spinal cord MRI. METHODS | We used 9.4T MRI of multiple sclerosis spinal cords (>104,000 slices) with sparse annotations (428 slices). A 2D Teacher trained on sparse slices generated dense pseudo-labels to train a 3D Student. We systematically evaluated the impact of human-centric preprocessing, spatial augmentation, and soft-label regularization on both architectures. RESULTS | We identified a critical divergence in training dynamics. The 2D Teacher required strong spatial augmentation and soft-labeling to overcome data scarcity, improving White Matter Lesion Dice scores by >11 points. However, propagating these techniques to the 3D Student degraded its performance. Furthermore, human-centric preprocessing (e.g., CLAHE) disrupted global statistical cues, dropping Gray Matter Lesion Dice scores by ~25 points. DISCUSSION | Our study highlights a perception divergence (human-centric contrast enhancement harms machine models) and a regularization conflict across dimensions. 3D architectures trained on dense pseudo-labels exhibit fundamentally different optimization landscapes than 2D counterparts and require distinct, conservative regularization. Code and models: https://github.com/ivadomed/model_seg_sc-gm-lesion_human_ms_exvivo_t2star.

2605.12743 2026-05-14 cs.CR cs.CV 版本更新

Still Camouflage, Moving Illusion: View-Induced Trajectory Manipulation in Autonomous Driving

Shuo Ju, Qingzhao Zhang, Huashan Chen, Xuheng Wang, Haotang Li, Wanqian Zhang, Feng Liu, Kebin Peng, Sen He

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) The University of Arizona(亚利桑那大学) Beijing Jiaotong University(北京交通大学) East Carolina University(东卡罗来纳大学)

AI总结 该研究提出了一种新型的物理对抗攻击方法,针对基于视觉的自动驾驶系统,利用视角变化本身作为攻击工具,通过在车辆上部署静态的伪装贴片,使其在相对运动中产生视点依赖的外观变化,从而诱导系统产生错误的轨迹预测。与以往需要多视角鲁棒性或主动干预的攻击方法不同,该方法仅需简单部署,即可在不同场景和感知模型下引发自动驾驶车辆的误判刹车,实验在nuScenes数据集上验证了其高达87.5%的成功率。

详情
英文摘要

Existing physical adversarial attacks on vision-based autonomous driving induce time-evolving perception errors, including biased object tracking or trajectory prediction, through (i) sophisticated physical patch inducing detection box drift when entering the view distance, or (ii) dynamically changing patches that cause different perception errors at different time. In both cases, viewing-angle variation is treated as a challenge, requiring adversarial patches to remain effective across frames under varying views, leading to complex multi-view optimization. In contrast, we show that viewing-angle variation itself can be turned into an attack tool. We design a new attack paradigm where a static, passive adversarial camouflage is mounted on a vehicle whose view-dependent appearance naturally evolves with relative motion, inducing consistent feature drift across frames. This causes the system to infer a physically plausible but incorrect trajectory, such as a false cut-in, which propagates to downstream decision-making and triggers unnecessary braking. Unlike prior approaches that require multi-view robustness or active intervention, our attack emerges from normal driving dynamics and is easy to deploy: a parked vehicle with a natural camouflage can induce hard braking in passing autonomous vehicles. We demonstrate the novel attack on nuScenes dataset, showing the effectiveness with an end-to-end success rate of up to 87.5%, measured by hard-braking events, and robustness across different scene backgrounds, victim vehicle speeds, and perception models.

2605.12725 2026-05-14 cs.CV 版本更新

Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models

Furkan Mumcu, Michael J. Jones, Anoop Cherian, Yasin Yilmaz

发表机构 * University of South Florida(佛罗里达州立大学) Mitsubishi Electric Research Laboratories(三菱电机研究实验室)

AI总结 近年来,视频异常检测研究逐渐转向构建跨场景的通用正常行为模型,但这一趋势忽视了场景特定和上下文依赖的正常行为特性。现有方法常依赖多模态大语言模型的预训练表示和视频级弱监督,导致模型更关注语义层面的异常类别,而非特定环境中的正常行为偏差。本文通过视觉分析和实验评估指出,这种做法削弱了空间定位能力,引入语义偏差,并将异常检测简化为动作识别,强调视频异常检测应在单一场景中重新聚焦于空间感知和可解释的正常行为建模。

详情
英文摘要

Recent video anomaly detection research has expanded rapidly with an emphasis on general models of normality intended to work across many different scenes. While this focus has led to improvements in scalability and multi-scene generalization, it has also shifted the field away from modeling the scene-specific and context-dependent nature of normal behavior. Contemporary approaches frequently rely on video-level weak supervision and opaque pretrained representations from multi-modal large language models (MLLMs), which encourage models to respond to familiar semantic anomaly categories rather than to deviations from the normal patterns of a particular environment. This trend suppresses spatial localization, introduces semantic bias, and reduces anomaly detection to a form of action recognition. In this paper, we examine whether these prevailing formulations align with the core requirements of real-world VAD, which is typically performed within a single scene where normality is determined by local geometry, semantics, and activity patterns. Through targeted visual analyses and empirical evaluations, we demonstrate the practical consequences of these limitations and show that meaningful progress in VAD requires renewed focus on single-scene, spatially-aware, and explainable formulations that capture the nuanced structure of normality within individual environments.

2605.12724 2026-05-14 cs.CV cs.AI 版本更新

Inline Critic Steers Image Editing

Weitai Kang, Xiaohang Zhan, Yizhou Wang, Mang Tik Chiu, Jason Kuen, Kangning Liu, Yan Yan

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Adobe

AI总结 本文研究了基于指令的图像编辑中不同区域的难度差异问题,提出了一种在生成过程中实时修正模型输出的方法。核心方法是引入一个可学习的“Inline Critic”模块,在模型中间层对生成结果进行评估,并引导后续生成过程。该方法通过三阶段训练策略稳定模型学习,显著提升了图像编辑的效果,在多个基准测试中取得了当前最优性能。

Comments 9 pages

详情
英文摘要

Instruction-based image editing exhibits heterogeneous difficulty not only across cases but also across regions of an image, motivating refinement approaches that allocate correction to where the model struggles. Existing refinement signals arrive late, after a fully generated image or a completed denoising step. We ask whether such a signal can act within an ongoing forward pass. To investigate this, we probe a frozen image-editing model and find that although generation capability emerges only in the last few layers, the error pattern is already set in early layers (rank correlation \r{ho} = 0.83 with the final-layer error map). Based on this, we introduce Inline Critic, a learnable token that critiques a frozen model's predictions at its intermediate layers and steers its hidden states to refine generation during the forward pass. A three-stage recipe is proposed to stabilize the training from learning how to critique to steering generation. As a result, we achieve state of the art on GEdit-Bench (7.89), a +9.4 gain on RISEBench over the same backbone, and the strongest open-source result on KRIS-Bench (81.92, surpassing GPT-4o). We further provide analyses showing that the critic genuinely shapes the model's attention and prediction updates at subsequent layers.

2605.12703 2026-05-14 cs.CV cs.AI 版本更新

MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence

Yifan Chen, Fei Yin, Qingyan Bai, Zicheng Lin, Yujiu Yang

发表机构 * University of Cambridge(剑桥大学) HKUST(香港科技大学) Tsinghua University(清华大学)

AI总结 本文介绍了 MMCL-Bench,一个用于多模态上下文学习的基准,旨在从视觉或混合模态的教学内容中学习任务相关的规则、程序和经验模式,并应用于新的视觉实例。该基准包含102个任务,涵盖规则应用、流程执行和经验归纳三个类别,评估结果显示当前主流多模态模型在严格评分标准下仍远未达到鲁棒的多模态上下文学习能力,揭示了多模态上下文学习作为当前模型的重要能力瓶颈。

详情
英文摘要

We introduce MMCL-Bench, a benchmark for multimodal context learning: learning task-local rules, procedures, and empirical patterns from visual or mixed-modality teaching context and applying them to new visual instances. Unlike text-only context learning or standard multimodal question answering, this setting requires models to recover and localize relevant evidence from images, screenshots, manuals, videos, and frame sequences before they can reason over the learned context. MMCL-Bench contains 102 tasks spanning three categories: rule system application, procedural task execution, and empirical discovery and induction. We evaluate frontier multimodal models with strict rubric-based scoring and find that current systems remain far from robust multimodal context learning, with even the strongest model solving fewer than one-third of tasks under strict evaluation. Diagnostic ablations and error analysis show that failures arise throughout the context-to-answer pipeline, including context anchoring, visual evidence extraction, context reasoning, and response construction. MMCL-Bench thus highlights multimodal context learning as an important unsolved capability bottleneck for current multimodal models.

2605.12684 2026-05-14 cs.CV cs.AI cs.HC 版本更新

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Yichen Feng, Yuetai Li, Chunjiang Liu, Yuanyuan Chen, Fengqing Jiang, Yue Huang, Hang Hua, Zhengqing Yuan, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Xiangliang Zhang, Misha Sra, Zichen Chen, Radha Poovendran, Zhangchen Xu

发表机构 * Bake AI University of Washington(华盛顿大学) University of California, Santa Barbara(加州大学圣巴巴拉分校) Stanford University(斯坦福大学) University of Notre Dame(诺丁汉大学) Carnegie Mellon University(卡内基梅隆大学) MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室) Western Washington University(西雅图华盛顿大学) King Abdulaziz City for Science and Technology(国王阿卜杜勒阿齐兹科技城)

AI总结 该研究探讨了前沿多模态大语言模型在视觉审美判断方面的能力,指出当前模型在判断图像美感时存在显著不足。研究引入了“视觉审美基准”(VAB),通过专家标注的对比任务评估模型表现,发现即使是最好的模型在识别最佳和最差图像时也远不如人类专家。研究还表明,通过少量专家示例对模型进行微调,可以显著提升其性能,凸显了VAB在推动审美判断模型发展中的重要价值。

Comments Project page: https://vab.bakelab.ai. Code: https://github.com/BakeLab/Visual-Aesthetic-Benchmark. Dataset: https://huggingface.co/datasets/BakeLab/Visual-Aesthetic-Benchmark

详情
英文摘要

Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts. Fine-tuning a 35B-parameter model on 2,000 expert examples brings its accuracy close to that of a 397B-parameter open-weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set-based, expert-grounded testbed on which that gap can be tracked and closed.

2605.12650 2026-05-14 cs.CV 版本更新

CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis

Yunsung Chung, Alex El Darzi, Carlo El Khoury, Han Feng, Nassir Marrouche, Jihun Hamm

发表机构 * Department of Computer Science, Tulane University(路易斯安那大学计算机科学系) School of Medicine, Tulane University(路易斯安那大学医学院)

AI总结 该研究针对医学图像合成中基础扩散模型适应性不足的问题,提出了一种基于临床对齐的微调方法CRAFT。通过引入临床对齐分数(CAS)作为新的评估指标,CRAFT从多模态大语言模型中迁移医学知识,结合条件提示增强、临床检查表和可微奖励优化,显著提升了生成图像的临床相关性。实验表明,CRAFT在多个医学影像模态上不仅提高了CAS评分,还有效减少了生成图像的不真实现象,优于现有主流方法。

详情
英文摘要

Foundation diffusion models can generate photorealistic natural images, but adapting them to medical imaging remains challenging. In medical adaptation, limited labeled data can exacerbate hallucination-like and clinically implausible synthesis, while existing metrics such as FID or Inception Score do not quantify per-image alignment with pathology-relevant criteria. We introduce the Clinical Alignment Score (CAS), a foundation-model-based proxy for clinical alignment that evaluates generated images along four complementary dimensions beyond visual fidelity. Building on CAS, we propose Clinical Reward-Aligned Finetuning (CRAFT), a reward-based adaptation framework that transfers medical knowledge from multimodal large language models and vision-language models through label-conditioned prompt enrichment, clinical checklists, and differentiable reward optimization. Across four diverse modalities, CRAFT improves CAS and downstream classification performance over strong adaptation baselines. Beyond average CAS gains, CRAFT reduces the empirical low-alignment tail below a real-image reference threshold by 5.5-34.7% points relative to the strongest baseline, corresponding to a 20.4% average relative reduction across datasets. These results indicate fewer hallucination-like generations under CAS, and are corroborated by out-of-family evaluator evaluation, structured checklist auditing, memorization analysis, and a blinded physician preference study on CheXpert.

2605.12325 2026-05-14 cs.CV 版本更新

VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

Hao Zhu, Shuo Jin, Wenbin Liao, Jiayu Xiao, Yan Zhu, Siyue Yu, Feng Dai

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(中国科学院计算技术研究所) University of Liverpool(利物浦大学)

AI总结 该研究旨在解决无训练开放词汇语义分割中因CLIP模型存在空间偏差而导致的效率与泛化性难题。为此,作者提出了一种基于空间感知框架dino$.$txt的视觉引导提示进化(VIP)方法,通过引入视觉引导的蒸馏机制和别名扩展,提升文本查询的语义表达能力,从而实现更高效、更精确的密集预测。实验表明,VIP在多个基准数据集上取得了优于现有方法的性能,并具有良好的跨领域泛化能力和较低的推理开销。

Comments Accepted by ICML2026. Code is available at https://github.com/MiSsU-HH/VIP

详情
英文摘要

Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino$.$txt framework to facilitate more efficient and high-quality dense prediction. While dino$.$txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce Visual-guided Prompt evolution (VIP) to rectify the semantic expressiveness of text queries in dino$.$txt, unleashing its potential for fine-grained object perception. Towards this end, VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that VIP: 1. surpasses the top-leading methods by 1.4%-8.4% average mIoU, 2. generalizes well to diverse challenging domains, and 3. requires marginal inference time and memory overhead.

2605.12163 2026-05-14 cs.CV 版本更新

Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

Chenfeng Wang, Wei He, Xuhan Zhu, Chunpeng Zhou, Qizhen Li, Song Yan, Yufei Zheng, Chengjun Yu, Fan Lu, Wei Zhai, Yang Cao, Pengfei Yu, Zheng-Jun Zha

发表机构 * University of Science and Technology of China(中国科学技术大学) Li Auto Inc.(利亚自动化公司)

AI总结 本文研究了视觉-语言模型中长潜层序列推理的问题,发现现有方法在潜层序列变长时性能下降,原因在于信息增益崩溃和过度池化的图像嵌入缺乏有效信号。为此,作者提出了一种自洽潜层推理方法SCOLAR,通过引入轻量级解码器生成独立锚定于原始视觉空间的辅助视觉标记,并结合多阶段微调和强化学习,显著提升了潜层推理长度和模型性能,在多个真实场景基准上取得了最优结果。

Comments 17 pages, 6 figures

详情
英文摘要

In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.

2605.12145 2026-05-14 cs.CV 版本更新

Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

Souptik Sen, Raneen Younis, Zahra Ahmadi

发表机构 * Peter L. Reichertz Institute for Medical Informatics(汉诺威医学院彼得·L·里赫茨医学信息学研究所) Lower Saxony Center for AI and Causal Methods in Medicine (CAIMed)(下萨克森医学人工智能与因果方法中心(CAIMed))

AI总结 该研究旨在解决多模态学习中跨模态泛化与模态特异性结构之间的平衡问题。提出了一种名为CoDAAR的新框架,通过语义对齐的离散表示,在统一的离散空间中同时保留各模态的独特结构并实现跨模态的泛化能力。该方法结合了离散时间对齐和级联语义对齐两种机制,通过自监督重建任务进行训练,在多个跨模态和跨领域基准测试中取得了最先进的性能。

Comments Added missing affiliation for co-author R. Younis and Z. Ahmadi

详情
英文摘要

Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross-modal semantic agreement. Together, they establish a competition-free unified representation space. Trained with self-supervised reconstruction objectives on paired multimodal sequences, CoDAAR demonstrates robust cross-modal and cross-domain generalization. Across Cross-Modal Generalization benchmarks, including event classification, localization, video segmentation, and cross-dataset transfer, CoDAAR achieves state-of-the-art performance, establishing a new paradigm for discrete and generalizable multimodal representation learning.

2605.12119 2026-05-14 cs.CV cs.GR 版本更新

MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

Haofeng Liu, Yang Zhou, Ziheng Wang, Zhengbo Xu, Zhan Peng, Jie Ma, Jun Liang, Shengfeng He, Jing Li

发表机构 * Orange-3DV-Team(橙色3D视觉团队)

AI总结 本文提出了一种名为MoCam的统一新视角合成方法,旨在解决生成式新视角合成中几何先验与外观先验之间的矛盾。该方法通过结构化去噪动力学,在扩散过程中协调地从几何到外观逐步生成内容,先利用几何先验构建粗略结构,再借助外观先验修正几何误差并细化细节。实验表明,MoCam在点云存在严重缺失或扭曲的情况下表现尤为突出,实现了几何与外观的有效解耦与统一合成。

Comments Project page: https://orange-3dv-team.github.io/MoCam

详情
英文摘要

Generative novel view synthesis faces a fundamental dilemma: geometric priors provide spatial alignment but become sparse and inaccurate under view changes, while appearance priors offer visual fidelity but lack geometric correspondence. Existing methods either propagate geometric errors throughout generation or suffer from signal conflicts when fusing both statically. We introduce MoCam, which employs structured denoising dynamics to orchestrate a coordinated progression from geometry to appearance within the diffusion process. MoCam first leverages geometric priors in early stages to anchor coarse structures and tolerate their incompleteness, then switches to appearance priors in later stages to actively correct geometric errors and refine details. This design naturally unifies static and dynamic view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion process. Experiments demonstrate that MoCam significantly outperforms prior methods, particularly when point clouds contain severe holes or distortions, achieving robust geometry-appearance disentanglement.

2605.11989 2026-05-14 cs.CV cs.AI 版本更新

A Transfer Learning Evaluation of Deep Neural Networks for Image Classification

Nermeen Abou Baker, Nico Zengeler, Uwe Handmann

发表机构 * Computer Science Institute, Ruhr West University of Applied Sciences, 46236 Bottrop(鲁尔西大学应用科学学院计算机科学研究所)

AI总结 本文研究了如何为图像分类任务选择最符合目标领域需求的预训练模型,探讨了迁移学习在深度神经网络中的应用效果。作者对十一类在ImageNet上预训练的模型进行了输出层和网络参数的调整,并将其应用于五个不同的目标数据集。通过评估准确率、准确密度、训练时间和模型大小等指标,比较了不同模型在单次和多次训练过程中的表现,为迁移学习中的模型选择提供了参考依据。

Comments Published by Machine Learning and Knowledge Extraction Journal

Journal ref Machine Learning and Knowledge Extraction 4, no. 1: 22-41 (2022)

详情
英文摘要

Transfer learning is a machine learning technique that uses previously acquired knowledge from a source domain to enhance learning in a target domain by reusing learned weights. This technique is ubiquitous because of its great advantages in achieving high performance while saving training time, memory, and effort in network design. In this paper, we investigate how to select the best pre-trained model that meets the target domain requirements for image classification tasks. In our study, we refined the output layers and general network parameters to apply the knowledge of eleven image processing models, pre-trained on ImageNet, to five different target domain datasets. We measured the accuracy, accuracy density, training time, and model size to evaluate the pre-trained models both in training sessions in one episode and with ten episodes.

2605.11572 2026-05-14 cs.CV 版本更新

TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning

Seongah Kim, Dinh Phu Tran, Hyeontaek Hwang, Saad Wazir, Duc Do Minh, Daeyoung Kim

发表机构 * AI2 Lab, KAIST(AI2实验室,韩国科学技术院)

AI总结 该研究提出了一种名为TB-AVA的参数高效微调框架,旨在解决音频-视觉对齐中的语义对应难题。通过引入文本作为语义桥梁,TB-AVA在冻结的音频和视觉编码器基础上,利用文本引导的语义调制模块实现跨模态特征的交互与对齐。实验表明,该方法在多个基准数据集上取得了最先进的性能,验证了文本作为语义锚点在音频-视觉学习中的有效性。

Comments 12 pages, 6 figures

详情
英文摘要

Audio-visual understanding requires effective alignment between heterogeneous modalities, yet cross-modal correspondence remains challenging when temporally aligned audio and visual signals lack clear semantic correspondence. We propose to use text as a semantic anchor for audio-visual representation learning. To this end, we introduce a parameter-efficient adaptation framework built on frozen audio and visual encoders, centered on Text-Bridged Audio-Visual Adapter (TB-AVA), which enables text-mediated interaction between audio and visual streams. At the core of TB-AVA, Gated Semantic Modulation (GSM) selectively modulates feature channels based on text-inferred semantic relevance. We evaluate the proposed approach on multiple benchmarks, including AVE, AVS, and AVVP, where the proposed framework achieves state-of-the-art performance, demonstrating text as an effective semantic anchor for parameter-efficient fine-tuning (PEFT) in audio-visual learning.

2605.11533 2026-05-14 cs.CL cs.CV 版本更新

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

Sike Xiang, Shuang Chen, Kevin Qinghong Lin, Jialin Yu, Yijia Sun, Philip Torr, Amir Atapour-Abarghouei

发表机构 * Durham University(杜伦大学) University of Oxford(牛津大学)

AI总结 该研究提出了一个名为 Checkup2Action 的多模态临床体检报告数据集,用于生成面向患者的行动建议卡片。该数据集包含2000份去标识化的实际体检报告,涵盖人口统计、体格检查、实验室检测、心血管评估和影像学证据等信息,每个行动卡片包含临床问题、优先级、推荐科室、随访时间、患者解释及问题等结构化内容。研究将体检报告到行动建议的生成任务定义为约束结构化生成问题,并引入了涵盖覆盖度、优先级一致性、部门与时间推荐准确性等多维度的评估协议,为评估模型在临床报告上的患者导向推理能力提供了新的基准。

详情
英文摘要

Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for laypersons to interpret and translate into concrete follow-up actions. Although large language models show promise in medical summarisation and triage support, their ability to generate safe, prioritised, and patient-oriented actions from multimodal check-up reports remains under-benchmarked. We present \textbf{Checkup2Action}, a multimodal clinical check-up report dataset and benchmark for structured \textit{Action Card} generation. Each card describes one clinically relevant issue and specifies its priority, recommended department, follow-up time window, patient-facing explanation, and questions for clinicians, while avoiding diagnostic or treatment-prescriptive claims. The dataset contains 2,000 de-identified real-world check-up reports covering demographic information, physical examinations, laboratory tests, cardiovascular assessments, and imaging-related evidence. We formulate checkup-to-action generation as a constrained structured generation task and introduce an evaluation protocol covering issue coverage and precision, priority consistency, department and time recommendation accuracy, action complexity, usefulness, readability, and safety compliance. Experiments with general-purpose and medical large language models reveal clear trade-offs between issue coverage, action correctness, conciseness, and safety alignment. Checkup2Action provides a new multimodal benchmark for evaluating patient-oriented reasoning over clinical check-up reports.

2605.11492 2026-05-14 cs.CV 版本更新

A Mimetic Detector for Adversarial Image Perturbations

Johnny Corbino

发表机构 * Lawrence Berkeley National Laboratory(伯克利国家实验室)

AI总结 该研究提出了一种无需训练、无需访问目标网络的单次检测方法,用于识别图像中的对抗性扰动。方法基于高阶Corbino–Castillo拟态算子,能够有效捕捉对抗样本在像素级上产生的高频、近随机的梯度能量特征。实验表明,该检测器在标准测试图像上实现了显著的干净图像与对抗样本的区分能力,检测效果随算子阶数增加而提升。

Comments v2: extended Table 1 with results for order $k=8$; minor revisions for clarity

详情
英文摘要

Adversarial attacks fool deep image classifiers by adding tiny, almost invisible noise patterns to a clean image. The standard $\ell^\infty$-bounded attacks (FGSM, PGD, and the $\ell^\infty$ variant of Carlini--Wagner) produce high-frequency, near-random sign patterns at the pixel level: nearly invisible in $\ell^2$, but carrying disproportionate gradient energy. We exploit this with a single-shot, training-free detector using the high-order Corbino--Castillo mimetic operators from the open-source MOLE library. No retraining, no surrogate classifier, no access to the network under attack: the verdict is a property of the input alone, computed in $O(HW)$ time. We validate the detector on the standard \texttt{peppers} test image at the canonical $\ell^\infty$ budget $\varepsilon = 16/255$ and observe a clean-vs-adversarial separation that grows monotonically from $3.55\times$ at order $k=2$ to $4.62\times$ at $k=8$.

2605.11444 2026-05-14 cs.CV 版本更新

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

Eunho Lee, Rei Kawakami, Youngbae Hwang

发表机构 * Chungbuk National University(Chungbuk国立大学) Institute of Science Tokyo(东京科学研究所)

AI总结 该研究提出了一种基于多模态大语言模型(MLLM)的统一图像修复框架,旨在从受多种未知退化影响的输入中恢复清晰图像。为了解决现有方法将退化视为离散类别而无法建模复合退化中连续关系的问题,作者引入了多模态嵌入作为修复过程的引导,并设计了MLLM引导的融合模块和频率专家混合模块,以增强退化感知表示并自适应组合不同频率专家。实验表明,该方法在多个基准数据集上表现出色,在CDD11数据集上取得了新的最先进成果。

详情
英文摘要

All-in-one image restoration seeks to recover clean images from inputs affected by diverse and unknown degradations using a unified framework. Recent methods have shown strong performance by identifying degradation characteristics to guide the restoration process. However, many of them treat degradations as discrete categories, which limits their ability to model the continuous relational structure that arises in composite degradations. To address this issue, we propose a multimodal large language model (MLLM)-guided image restoration framework that exploits multimodal embeddings as guidance for low-level restoration. Specifically, MLLM-derived features are injected into an encoder-decoder architecture through an MLLM-guided fusion block (MGFB) to enhance degradation-aware representations. In addition, we incorporate a mixture-of-frequency-experts (MoFE) module that adaptively combines frequency experts using MLLM-guided contextual cues. To further improve expert routing, we design an MLLM-guided router with a relational alignment loss that encourages routing patterns consistent with the embedding-space relationships of degraded inputs. Extensive experiments on multiple benchmarks show that the proposed method achieves strong performance across diverse restoration settings and establishes a new state of the art on the challenging CDD11 dataset, outperforming previous methods by up to 1.35 dB.

2605.11347 2026-05-14 cs.LG cs.AI cs.CV 版本更新

Gradient-Free Noise Optimization for Reward Alignment in Generative Models

Jeongsol Kim, Hongeun Kim, Jian Wang, Jong Chul Ye

发表机构 * KAIST AI(韩国科学技术院人工智能实验室) Snap Inc.(Snap公司)

AI总结 本文提出了一种无需梯度的噪声优化方法ZeNO,用于生成模型中的奖励对齐问题。该方法将噪声优化建模为路径积分控制问题,仅依赖零阶奖励评估,避免了传统方法对反向传播的依赖。ZeNO在多种生成器和奖励函数上表现出色,尤其适用于无法进行反向传播的场景,如蛋白质结构生成任务。

详情
英文摘要

Existing reward alignment methods for diffusion and flow models rely on multi-step stochastic trajectories, making them difficult to extend to deterministic generators. A natural alternative is noise-space optimization, but existing approaches require backpropagation through the generator and reward pipeline, limiting applicability to differentiable settings. To address this, here we present ZeNO (Zeroth-order Noise Optimization), a gradient-free framework that formulates noise optimization as a path-integral control problem, estimable from zeroth-order reward evaluations alone. When instantiated with an Ornstein--Uhlenbeck reference process, the update connects to Langevin dynamics implicitly targeting a reward-tilted distribution. ZeNO enables effective inference-time scaling and demonstrates strong performance across diverse generators and reward functions, including a protein structure generation task where backpropagation is infeasible.

2605.10983 2026-05-14 cs.LG cs.AI cs.CV 版本更新

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Jiaming Li, Chenyu Zhu, Nanxi Yi, Youjun Bao, Li Sun, Quanying Lv, Xiang Fang, Daizong Liu, Jianjun Li, Kun He, Bowen Zhou, Zhiyuan Ma

发表机构 * Huazhong University of Science and Technology(华中科技大学) Kuaishou Technology(快手科技) Nanyang Technological University(南洋理工大学) Wuhan University(武汉大学) Tsinghua University(清华大学)

AI总结 该研究针对扩散模型对下游任务对齐过程中存在的奖励作弊问题,提出了一种轨迹匹配策略优化方法(TMPO),通过轨迹级奖励分布匹配替代传统的标量奖励最大化,有效提升了生成多样性和质量。TMPO 引入了 Softmax 轨迹平衡目标,使策略概率与奖励诱导的玻尔兹曼分布对齐,并证明其具有覆盖多模式轨迹的特性。此外,TMPO 还结合动态随机树采样技术,提升大规模流匹配模型的训练效率,实验表明其在生成多样性及任务性能上均优于现有方法。

详情
英文摘要

Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.

2605.10819 2026-05-14 cs.RO cs.AI cs.CV 版本更新

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

Zuojin Tang, Haoyun Liu, Xinyuan Chang, Changjie Wu, Dongjie Huo, Yandan Yang, Bin Liu, Zhejia Cai, Feng Xiong, Mu Xu, jiachen Luo, De Ma, Zhiheng Ma, Gang Pan

发表机构 * Zhejiang University(浙江大学) Amap, Alibaba Group(阿里集团阿地图) Nanjing University(南京大学) Shenzhen University of Advanced Technology(深圳先进技术大学) Beijing University of Chemical Technology(北京化工大学) Embodied Intelligence General Platform Laboratory, Chery Auto(奇瑞汽车 embodied intelligence 通用平台实验室) Tsinghua University(清华大学) Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 视觉-语言-动作(VLA)模型受限于带有动作标签的机器人数据稀缺,而无动作视频中蕴含了丰富的物理世界变化信息。本文提出ALAM(代数一致潜在动作模型),通过从无动作视频中学习结构化的潜在动作转移,为策略生成提供一致的过渡结构。ALAM利用帧三元组学习满足重建、组合和反转一致性的潜在转移,并通过联合流匹配目标将其与策略生成结合,显著提升了VLA任务的性能,在多个基准测试中取得了显著提升。

详情
英文摘要

Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.

2605.10426 2026-05-14 cs.CV cs.AI 版本更新

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

Minqing Huang, Yujiao Xiang, Zihan Liang, Jiajie Huang, Jingqi Wang, Zhi Xu, Feiyang Tan, Hangning Zhou, Mu Yang, Gong Che

发表机构 * Afari Intelligent Drive(Afari智能驾驶公司) University of Electronic Science and Technology of China(电子科技大学) Shanghai Jiao Tong University(上海交通大学) Beijing University Of Posts and Telecommunications(北京邮电大学) Tianjin University(天津大学)

AI总结 本文提出了一种名为 CoWorld-VLA 的多专家世界模型框架,用于自动驾驶任务,旨在解决现有视觉-语言-动作(VLA)模型在规划导向的中间表示方面存在的不足。该方法通过多源监督提取互补的世界信息,并将其编码为专家 token,作为规划器的显式条件,从而更有效地指导动作生成。实验表明,CoWorld-VLA 在未来场景生成和路径规划任务上表现出色,尤其在避障和轨迹精度方面具有优势。

详情
英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/AFARI-Research/CoWorld-VLA.

2605.10187 2026-05-14 cs.CV 版本更新

SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

Longteng Guo, Xuanxu Lin, Dongze Hao, Tongtian Yue, Pengkang Huo, Jiatong Ma, Yuchen Liu, Jing Liu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) OPPO AI Center(OPPO AI中心)

AI总结 SciVQR 是一个涵盖数学、物理、化学等多个学科的多模态科学推理基准,旨在评估大型语言模型在处理复杂科学问题时的综合能力。该基准包含图表、公式等专业视觉元素,要求模型结合视觉理解与多步骤推理,任务难度从基础事实记忆到复杂推理不等,并提供专家解答供参考。研究发现当前主流多模态模型在处理跨学科、多步骤的科学推理任务时仍存在明显不足,突显了提升模型推理能力和学科知识整合的必要性。

详情
英文摘要

Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open-source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi-step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence. The dataset and evaluation code are publicly available at https://github.com/CASIA-IVA-Lab/SciVQR.

2605.10127 2026-05-14 cs.CV 版本更新

Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

Yu He, Ting Zhu, Yichun Liu, Lichen Ma, Xinyuan Shan, Jingling Fu, Yu Shi, Junshi Huang, Yan Li

AI总结 本文提出一个名为Fashion130K的新电商时尚数据集,包含多种场合、模特和服装类型,旨在推动服装搭配生成的研究。为实现服装生成的视觉一致性,作者设计了统一多模态条件(UMC)框架,通过融合文本和图像提示的嵌入信息,并引入融合变换器对齐多模态特征,进而引导生成模型关注提示与噪声图像之间的关键关联。该数据集和框架为多模态提示在生成模型中的应用提供了全面而细致的探索,并在多个实际应用和基准测试中表现出优于现有方法的视觉一致性效果。

Comments Accepted to CVPR 2026 Findings

详情
英文摘要

Recent research work on fashion outfit generation focuses on promoting visual consistency of garments by leveraging key information from reference image and text prompt. However, the potential of outfit generation remains underexplored, requiring comprehensive e-commercial dataset and elaborative utilization of multi-modal condition. In this paper, we propose a brand-new e-commerce dataset, named Fashion130k, with various occasions, models, and garment types. For the consistent generation of garment, we design a framework with Unified Multi-modal Condition (UMC) to align and integrate the text and visual prompts into generation model. Specifically, we explore an embedding refiner to extract the unified embeddings of multi-modal prompts, within which a Fusion Transformer is proposed to align the multi-modal embeddings by adjusting the modality gap between text and image. Based on unified embeddings, the attention in generation model is redesigned to emphasis the correlations between prompts and noise image, inducing that the noise image can select the pivotal tokens of prompts for consistent outfit generation. Our dataset and proposed framework offer a general and nuanced exploration of multi-modal prompts for generation models. Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency, achieving promising result than that of SoTA methods.

2605.10040 2026-05-14 cs.CV 版本更新

Only Train Once: Uncertainty-Aware One-Class Learning for Face Authenticity Detection

Qingchao Jiang, Zhenxuan Hou, Zhiying Zhu, Zhenxing Qian, Xinpeng Zhang, Zaiwang Gu

AI总结 随着生成式模型的快速发展,生成高度逼真的图像带来了身份欺诈和虚假信息传播的风险。现有方法大多将人脸伪造检测视为全监督的二分类问题,难以应对新型生成方法带来的挑战。本文提出FADNet,将人脸真实性检测重新建模为一类分类任务,仅使用真实人脸数据进行训练,通过引入证据深度学习和伪伪造图像生成器,有效提升了模型的泛化能力和检测精度,在多个基准测试中取得了优于现有方法的优异性能。

Comments The sole reason for our withdrawal application is that we have identified critical areas in our manuscript that require substantial revision and improvement to meet rigorous scientific standards. Our only intention is to retract the current draft to revise and enhance it, with no plans to replace it with a different version or redirect readers to other sources at this time

详情
英文摘要

The rapid evolution of generative paradigms has enabled the creation of highly realistic imagery, which escalating the risks of identity fraud and the dissemination of disinformation. Most existing approaches frame face forgery detection as a fully supervised binary classification problem. Consequently, these models typically exhibit significant performance decay when tasked with detecting forgeries from previously unseen generative paradigms. Furthermore, these methods focus exclusively on either DeepFakes or fully synthesized faces, thereby failing to provide a generalized framework for universal face forgery detection. In this paper, we address this challenge by introducing FADNet (Face Authenticity Detector Net), % a self-supervised framework that which reformulates face forgery detection as a one-class classification (OCC) task. By training exclusively on authentic facial data to capture their intrinsic representations, FADNet flags any image whose feature embedding deviates significantly from the learned distribution of real faces as a forgery. The framework incorporates Evidential Deep Learning (EDL) to quantify predictive uncertainty and utilizes a plug-and-play pseudo-forgery image generator (PFIG) to tighten decision boundaries around authentic data. Extensive experimental evaluations on the DF40 and ASFD benchmarks demonstrate that FADNet achieves superior performance and generalization capabilities. Specifically, FADNet substantially outperforms existing state-of-the-art (SOTA) methods, yielding a remarkable average accuracy of 96.63\% and an average precision of 98.83\%.

2605.09935 2026-05-14 cs.CV cs.CR 版本更新

Evidence-based Decision Modeling for Synthetic Face Detection with Uncertainty-driven Active Learning

Qingchao Jiang, Zhenxuan Hou, Zhiying Zhu, Zhenxing Qian, Xinpeng Zhang, Zaiwang Gu

AI总结 随着深度生成模型的快速发展,伪造人脸图像被广泛用于非法活动。现有合成人脸检测方法虽取得进展,但因依赖Softmax激活函数而存在过度自信的问题,导致在面对未知分布图像时预测不可靠。为此,本文提出EMSFD方法,通过狄利克雷分布建模类别证据并显式引入模型不确定性,提升检测可靠性与泛化能力;同时利用不确定性指导主动学习,减少标注成本,实验表明该方法在检测准确率上比现有最优方法提升了15%。

Comments The sole reason for our withdrawal application is that we have identified critical areas in our manuscript that require substantial revision and improvement to meet rigorous scientific standards. Our only intention is to retract the current draft to revise and enhance it, with no plans to replace it with a different version or redirect readers to other sources at this time

详情
英文摘要

With the rapid development of deep generative models, forged facial images are massively exploited for illegal activities. Although existing synthetic face detection methods have achieved significant progress, they suffer from the inherent limitation of overconfidence due to their reliance on the Softmax activation function. Thus, these methods often lead to unreliable predictions when encountering unknown Out-of-Distribution (OOD) images, and cannot ascertain the model's uncertainty in its prediction. Meanwhile, most existing methods require massive high-quality annotated data, which greatly limits their practicability across diverse scenarios. To address these limitations, we propose EMSFD (Evidence-based decision Modeling for Synthetic Face Detection with uncertainty-driven active learning), an approach designed to enhance detection reliability and generalizability. Specifically, EMSFD models class evidence using the Dirichlet distribution and explicitly incorporates model uncertainty into the prediction process. Furthermore, during training, the estimated uncertainty is exploited to prioritize more informative samples from the unlabeled pool for annotation, thereby reducing labeling cost and improving model generalization. Extensive experimental evaluations demonstrate that our method enhances the interpretability of synthetic face detection. Meanwhile, our method yields a 15\% increase in accuracy compared to existing state-of-the-art (SOTA) baselines, which demonstrates the superior detection performance and generalizability of our approach. Our code is available at: https://github.com/hzx111621/EMSFD.

2605.09725 2026-05-14 cs.CV 版本更新

On-Policy Distillation with Best-of-N Teacher Rollout Selection

Ke Zhang, Yunjie Tian, Dongdi Zhao, Yijiang Li, Yuanye Liu, Vishal M Patel, Di Fu

发表机构 * Johns Hopkins University(约翰霍普金斯大学) TikTok University of California, San Diego(加州大学圣地亚哥分校) Fudan University(复旦大学)

AI总结 本文提出了一种名为BRTS的框架,用于改进基于策略的蒸馏(OPD)方法,以提高模型在复杂推理任务中的表现。BRTS通过从多个教师轨迹中选择最优的辅助轨迹,减少监督信号的噪声和方差,从而提升学生模型的学习效果。实验表明,BRTS在多个数学推理基准测试中显著优于传统OPD方法,尤其在难度较高的数据集上表现突出。

Comments 10 pages, 5 figures

详情
英文摘要

On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward dependence of reinforcement learning and the catastrophic forgetting often observed in standard supervised fine-tuning. However, standard OPD typically computes teacher supervision under noisy student-generated contexts and often relies on a single stochastic teacher rollout per prompt. As a result, the supervision signal can be high-variance: the sampled teacher trajectory can be incorrect, uninformative, or poorly matched to the student's current reasoning behavior. To address this limitation, we propose BRTS, a Best-of-N Rollout Teacher Selection framework for on-policy distillation. BRTS augments standard student-context OPD with a teacher-context supervision branch constructed from the curated teacher trajectory. Rather than distilling from the first sampled teacher rollout, BRTS samples a small pool of teacher trajectories and selects the auxiliary trajectory using a simple priority rule: correctness first, student alignment second. When multiple correct teacher trajectories are available, BRTS chooses the one most aligned with the student's current behavior; when unconditioned teacher samples fail on harder prompts, it invokes a ground-truth-conditioned recovery step to elicit a natural derivation. The selected trajectory is then used to provide reliable teacher-context supervision inside the OPD loop, augmented with an auxiliary loss on the teacher trajectory. Experiments on AIME 2024, AIME 2025, and AMC 2023 show that BRTS improves over standard OPD on challenging reasoning benchmarks, with the largest gains on harder datasets. Our code is available at https://github.com/BWGZK-keke/BRTS.

2605.08320 2026-05-14 eess.IV cs.CV 版本更新

Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks

Marwane Hariat, Antoine Manzanera, David Filliat

发表机构 * U2IS, ENSTA, Institut Polytechnique de Paris(U2IS、ENSTA、巴黎理工学院)

AI总结 本文针对单目深度估计在低纹理区域表现不佳的问题,提出了一种基于预语义轮廓的距离变换方法,结合自监督神经网络提升深度预测的准确性。该方法通过预语义轮廓联合估计深度和相机运动,并利用距离变换增强低纹理区域的判别能力,从而生成更具区分性的输入图像和更有效的损失函数。实验表明,该方法在多个数据集上表现出色,优于现有的自监督深度估计方法。

详情
英文摘要

Monocular depth estimation (MDE) with self-supervised training approaches struggles in low-texture areas, where photometric losses may lead to ambiguous depth predictions. To address this, we propose a novel technique that enhances spatial information by applying a distance transform over pre-semantic contours, augmenting discriminative power in low texture regions. Our approach jointly estimates pre-semantic contours, depth and ego-motion. The pre-semantic contours are leveraged to produce new input images, with variance augmented by the distance transform in uniform areas. This approach results in more effective loss functions, enhancing the training process for depth and ego-motion. We demonstrate theoretically that the distance transform is the optimal variance-augmenting technique in this context. Through extensive experiments on KITTI, Cityscapes, Waymo, NYUv2 and ScanNet our model demonstrates robust performance, surpassing competing self-supervised methods in MDE.

2605.08293 2026-05-14 cs.CV 版本更新

Distill, Diffuse, and Semanticize (DDS): Annotation-Free 3D Scene Understanding Based on Multi-Granularity Distillation and Graph-Diffusion-Based Segmentation

Yijing Wang, Ruonan Li, Qilin Wang, Rongqiang Zhao, Jie Liu

发表机构 * Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) Pengcheng Laboratory(鹏城实验室)

AI总结 本文提出了一种名为DDS的轻量级框架,用于无需标注的3D场景理解。该方法结合多粒度知识蒸馏和基于图扩散的分割技术,在保留超点结构组织的同时引入视觉语义信息,实现了区域一致且语义化的3D场景理解。实验表明,DDS在多个真实数据集上优于现有方法,在多项指标上均有显著提升,为无标注的3D场景理解提供了可扩展且可解释的解决方案。

详情
英文摘要

3D semantic scene understanding is essential for digital twins, autonomous driving, smart agriculture, and embodied perception, yet dense point-wise annotation for point clouds remains expensive and difficult to scale. Existing annotation-free methods often face a trade-off between semantic recognition and structural efficiency: open-vocabulary and foundation-model-driven methods provide strong semantic priors, but often come with substantial computational costs, while structure-oriented methods based on superpoints, clustering, and graph reasoning are lightweight but often produce category-agnostic regions. We propose DDS, a resource-efficient structure-oriented framework for region-consistent and semanticized annotation-free 3D scene understanding. DDS preserves the lightweight superpoint-based organization paradigm while incorporating visual semantic cues from projected features and segmentation-derived masks. It first performs multi-granularity distillation to guide the 3D backbone at the point, mask-prototype, and inter-prototype levels, then applies graph diffusion over superpoints to propagate semantic information directly in 3D, producing coherent region representations without costly spectral decomposition or dense open-vocabulary 3D feature fields. Finally, DDS uses segmentation-cluster association to assign interpretable semantic names to category-agnostic 3D clusters. Experiments on real-world datasets show that DDS achieves the best performance among representative structure-oriented annotation-free baselines, improving oAcc, mAcc, and mIoU by up to 5.9%, 8.1%, and 2.4%, respectively. These results demonstrate that DDS improves region consistency and lightweight semantic recognition, providing a scalable and interpretable solution for annotation-free 3D scene understanding.

2605.08078 2026-05-14 cs.CV cs.LG 版本更新

Normalizing Trajectory Models

Jiatao Gu, Tianrong Chen, Ying Shen, David Berthelot, Shuangfei Zhai, Josh Susskind

发表机构 * Apple(苹果公司)

AI总结 本文提出了一种名为 Normalizing Trajectory Models(NTM)的新型生成模型,用于解决在少量采样步骤下扩散模型性能下降的问题。NTM 通过将每个逆向步骤建模为具有精确似然训练的条件归一化流,保留了完整的似然框架,同时提升了生成效率。该模型结合了浅层可逆模块与深层并行预测器,支持从头训练或基于预训练流匹配模型初始化,并通过自蒸馏技术实现了仅需四步即可生成高质量图像的效果,在文本到图像任务中表现优异。

Comments 25 pages, 10 figures; corrected typos and citations

详情
英文摘要

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.

2605.07188 2026-05-14 cs.CV 版本更新

PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset

Fuxin Duan, Hui Wang

发表机构 * Pico, Bytedance(字节跳动)

AI总结 本文提出了一种统一的注视估计框架PicoEyes,能够从单目或双目输入中直接预测注视的多个关键属性,包括3D眼参数、眼区分割、光轴、视线轴和深度图,并在端到端流程中同时解决校准、注视预测和设备姿态变化问题。研究还引入了一个大规模多视角近眼数据集,包含多种条件下的详尽2D和3D标注。实验表明,PicoEyes在无校准、校准、重戴校准和预测等多种设置下均优于现有学术和工业注视追踪方法,为混合现实应用中的鲁棒且通用的注视估计提供了实用范式。

Comments 15 pages, 10 figures, conference

详情
英文摘要

We present PicoEyes, a unified gaze estimation framework that directly predicts all key attributes of gaze, including 3D eye parameters, eye-region segmentation, optical axis, visual axis, and depth maps, from either monocular or binocular inputs. The framework simultaneously addresses calibration, gaze forecasting, and varying device postures, while also supporting 3D eye reconstruction via joint estimation of eye parameters and depth maps in an end-to-end manner. In addition, we introduce a large-scale multi-view near-eye dataset containing comprehensive 2D and 3D annotations under diverse conditions, including train, test, rewear-test, and calibration sessions. Extensive experiments demonstrate that PicoEyes achieves state-ofthe-art performance, consistently outperforming both academic and industrial gaze tracking methods across nocalibration, calibration, rewear-after-calibration, and forecasting settings. This work establishes a practical, end-toend paradigm for robust and generalizable gaze estimation in mixed reality (MR) applications.

2605.04506 2026-05-14 cs.CV cs.AI 版本更新

Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting

Binh Long Nguyen, Kien Nguyen, Sridha Sridharan, Clinton Fookes, Peyman Moghadam

发表机构 * School of Electrical Engineering and Robotics(电气工程与机器人学学院) Queensland University of Technology(昆士兰理工大学) CSIRO Robotics(CSIRO机器人部) CSIRO

AI总结 Ilov3Splat 是一种基于高斯点扩散(3D-GS)的新型框架,用于实现实例级别的开放词汇三维场景理解。该方法通过在高斯点中引入视图一致的特征场,联合优化场景几何与语义表示,从而提升跨视角一致性与实例级推理能力。通过结合多分辨率哈希嵌入与对比损失训练实例特征场,Ilov3Splat 能够在无需类别监督的情况下,基于自然语言描述准确识别和分割三维场景中的任意物体,显著优于现有开放词汇三维理解方法。

Comments The International Conference on Pattern Recognition (ICPR) 2026

详情
英文摘要

We introduce Ilov3Splat, a novel framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on 2D rendering-based matching or point-level semantic association, which undermines cross-view consistency, lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine-grained object distinction across views. At inference time, CLIP-encoded queries are matched against the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open-vocabulary 3D-GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language-driven 3D scene understanding. Project page: https://csiro-robotics.github.io/Ilov3Splat.

2604.27389 2026-05-14 cs.CV cs.AI 版本更新

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Bingli Wang, Huanze Tang, Haijun Lv, Zhishan Lin, Lixin Gu, Lei Feng, Qipeng Guo, Kai Chen

发表机构 * Southeast University Shanghai AI Laboratory(上海大学上海人工智能实验室) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出COHERENCE基准,旨在评估多模态大语言模型在交织图文上下文中进行细粒度图文对齐的能力。现有基准多关注单一或多个图像的理解,而现实场景中信息常以图文交织形式呈现,要求模型不仅识别图像内容,还需建立图文间的细粒度关联并进行推理。COHERENCE涵盖四个代表性领域的交织图文内容,包含6,161个高质量问题,并通过六类错误分析,揭示当前模型在该任务中的不足。

详情
英文摘要

In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension. In real-world scenarios such as document reading, information is often presented as interleaved multimodel contexts. This requires MLLMs not only to recognize the content of individual images, but also to identify relevant textual and visual evidence, establish fine-grained alignments between them, and reason over these aligned signals in interleaved contexts based on contextual evidence. However, there is still a lack of systematic benchmarks for quantifying the fine-grained understanding ability of MLLMs in interleaved image-text contexts. To fill this gap, we propose COHERENCE, a benchmark designed to evaluate the ability of MLLMs to recover fine-grained image-text correspondences in interleaved multimodal contexts. COHERENCE covers interleaved image-text content from four representative domains and contains 6,161 high-quality questions. Moreover, we perform a six-type error analysis, enabling fine-grained attribution of failures in interleaved image-text understanding to the specific capabilities missing in current MLLMs.

2604.09025 2026-05-14 cs.CV cs.AI 版本更新

Skill-Conditioned Visual Geolocation for Vision-Language Models

Chenjie Yang, Yutian Jiang, Yutong Deng, Chenyu Wu

发表机构 * Southwest Jiaotong University(西南交通大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Zhejiang University(浙江大学)

AI总结 该研究针对视觉语言模型在地理定位任务中缺乏结构化地理推理和自主进化能力的问题,提出了一种无需训练的GeoSkill框架。该方法基于一个可演进的技能图(Skill-Graph),通过提炼人类专家轨迹生成自然语言技能,并利用推理模型进行引导式推理。同时,通过自主进化机制,从大规模网络数据中不断生成和优化技能,提升地理定位的准确性和推理可信度,显著增强了模型对真实地理知识的理解与泛化能力。

详情
英文摘要

Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a "one-off" process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system's cognition of real-world geographic knowledge beyond isolated case studies.

2604.08039 2026-05-14 cs.CV cs.AI cs.LG 版本更新

LINE: LLM-based Iterative Neuron Explanations for Vision Models

Vladimir Zaigrajew, Michał Piechota, Gaspar Sekula, Paweł Gelar, Przemysław Biecek

发表机构 * Centre for Credible AI(可信AI中心) Warsaw University of Technology(华沙理工大学) University of Warsaw, Poland(波兰华沙大学)

AI总结 本文提出了一种基于大语言模型的迭代神经元解释方法LINE,用于对视觉模型中的神经元进行开放词汇的概念标注。LINE在黑盒设置下,通过语言模型和图像生成器迭代生成并优化概念描述,无需模型训练,能够发现传统预定义词汇表中遗漏的概念,并在多个数据集上取得了优于现有方法的性能。该方法不仅能够识别每个神经元的主要概念,还能提供完整的生成历史,支持多义性评估和生成可视化解释。

详情
英文摘要

Interpreting individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.11 on ImageNet and 0.05 on Places365, while discovering, on average, 27% of new concepts missed by predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, enabling polysemanticity evaluation and producing visual explanations that rival gradient-dependent activation maximization methods. The source code will be made available soon.

2604.04692 2026-05-14 cs.CL cs.AI cs.CV 版本更新

Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity

Jaeyoon Jung, Yejun Yoon, Kunwoo Park

发表机构 * School of AI Convergence, Soongsil University(顺斯利大学人工智能融合学院) MAUM AI Inc.(MAUM人工智能公司) Department of Intelligent Semiconductors, Soongsil University(顺斯利大学智能半导体系)

AI总结 本文研究了在多模态事实核查任务中是否应普遍使用视觉证据的问题,挑战了现有研究中“视觉证据总是有助于提升性能”的假设。为此,作者提出了AMuFC框架,通过两个协作的视觉-语言模型,分别用于判断是否需要视觉证据以及基于证据进行事实验证,从而实现对视觉证据的自适应使用。实验表明,该方法在三个数据集上显著提升了事实核查的准确性。

Comments preprint, 18 pages

详情
英文摘要

Automated fact-checking is a crucial task that supports a responsible information ecosystem. While recent research has progressed from text-only to multimodal fact-checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In this work, we challenge this assumption and show that the indiscriminate use of multimodal evidence can reduce accuracy. To address this challenge, we propose AMuFC, a multimodal fact-checking framework that employs two collaborative vision-language models with distinct roles for the adaptive use of visual evidence: an Analyzer determines whether visual evidence is necessary for claim verification, and a Verifier predicts claim veracity conditioned on both the retrieved evidence and the Analyzer's assessment. Experimental results on three datasets show that incorporating the Analyzer's assessment of visual evidence necessity into the Verifier's prediction yields substantial improvements in verification performance. We will release all code and datasets at https://github.com/ssu-humane/AMuFC.

2604.04667 2026-05-14 cs.CV cs.LG cs.RO 版本更新

ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

Selim Ahmet Iz, Francesco Nex, Norman Kerle, Henry Meissner, Ralf Berger

发表机构 * German Aerospace Center (DLR), Institute of Space Research(德国航空航天中心(DLR)空间研究所) Faculty of Geo-Information Science and Earth Observation (ITC), University of Twente(代尔夫特理工大学地理信息科学与地球观测学院)

AI总结 本文提出了一种名为ZeD-MAP的框架,用于实现实时无人机航拍图像的高精度深度重建。该方法结合零样本扩散模型与增量聚类式光束法平差(BA),在无需任务特定再训练的情况下,提升了深度估计的度量一致性和时间连续性。实验表明,该方法在高分辨率航拍图像上实现了亚米级精度,且单帧处理时间在1.47到4.91秒之间,适用于实时三维地图生成。

详情
英文摘要

Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks such as disaster response, yet remains challenging due to wide-baseline parallax, large image sizes, low-texture or specular surfaces, occlusions, and strict computational constraints. Recent zero-shot diffusion models offer fast per-image dense predictions without task-specific retraining, and require fewer labelled datasets than transformer-based predictors while avoiding the rigid capture geometry requirement of classical multi-view stereo. However, their probabilistic inference prevents reliable metric accuracy and temporal consistency across sequential frames and overlapping tiles. We present ZeD-MAP, a cluster-level framework that converts a test-time diffusion depth model into a metrically consistent, SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment (BA). Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights captured at approximately 50 m altitude (GSD is approximately 0.85 cm/px, corresponding to 2,650 square meters ground coverage per frame) with the DLR Modular Aerial Camera System (MACS) shows that our method achieves sub-meter accuracy, with approximately 0.87 m error in the horizontal (XY) plane and 0.12 m in the vertical (Z) direction, while maintaining per-image runtimes between 1.47 and 4.91 seconds. Results are subject to minor noise from manual point-cloud annotation. These findings show that BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation.

2603.29917 2026-05-14 cs.CV 版本更新

Diffusion-Based Feature Denoising with NNMF for Robust handwritten digit multi-class classification

Hiba Adil Al-kharsan, Róbert Rajkó

发表机构 * Doctoral School of Computer Science, University of Szeged(塞格德大学计算机科学博士学院) University Research and Innovation Center (EKIK), Óbuda University(奥布达大学研究与创新中心(EKIK))

AI总结 本文提出了一种结合扩散驱动特征去噪与混合特征表示的鲁棒手写数字多分类框架。通过非负矩阵分解(NNMF)将输入图像转换为可解释的特征表示,同时利用卷积神经网络提取深层特征,并将两者融合为统一的混合特征表示。在特征空间中引入逐步扩散噪声并训练去噪网络以恢复干净特征,从而提升模型对噪声和对抗攻击的鲁棒性。实验结果表明,该方法在基准和对抗环境下均表现出优越的分类性能。

详情
英文摘要

This work presents a robust multi-class classification framework for handwritten digits that combines diffusion-driven feature denoising with a hybrid feature representation. Inspired by our previous work on brain tumor classification, the proposed approach operates in a feature space to improve the robustness to noise and adversarial attacks. This manuscript is submitted as an extended abstract rather than a full-length press-ready paper. First, the input images are converted into tight, interpretable exemplification using Non-negative Matrix Factorization (NNMF). In parallel, special deep features are extracted using a computational neural network (CNN). These integral features are combined into a united hybrid representation. The main objective of this work is to extend our previously validated two-class framework to a multi-class handwritten digit classification scenario. To improve robustness, a step diffusion operation is used in the feature space by gradually adding Gaussian noise. A feature denoiser network is trained to reverse this operation and rebuild clean representations from tilted inputs. The courteous features are then applied for multi-class classification. The suggested method is evaluated in both baseline and adversarial settings using AutoAttack. The experimental outcome present that the diffusion-based hybrid model is both effective and robust, the CNN baseline models outperforming while maintain powerful classification performance. These results explain the activity of feature-level diffusion defense for reliable multi-class handwritten digit classification.

2603.26839 2026-05-14 cs.LG cs.CV 版本更新

From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

Alberto G. Rodriguez Salgado

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究探讨了多模态模型在解决视觉空间任务时是依赖真正的规划能力,还是通过在文本空间中进行暴力搜索。为此,研究者提出了一个名为 MazeBench 的基准测试,包含 110 个程序生成的迷宫图像,并评估了来自 OpenAI、Anthropic、Google 和阿里巴巴的 16 种模型配置。实验发现,尽管某些模型在视觉迷宫任务中表现出高准确率,但其解题方式主要是将图像转换为文本网格,再逐步枚举路径,而非真正的空间规划,揭示了高准确率并不意味着具备人类水平的空间理解能力。

Comments 15 pages, 10 figures. Code and mazes available at https://github.com/alrod97/LLMs_mazes

详情
英文摘要

How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12\%; on 20$\times$20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6\% on images to 80\% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.

2603.22364 2026-05-14 cs.LG cs.AI cs.CV 版本更新

MCLR: Improving Conditional Modeling via Inter-Class Likelihood-Ratio Maximization and Unifying Classifier-Free Guidance with Alignment Objectives

Xiang Li, Yixuan Jia, Xiao Li, Jeffrey A. Fessler, Rongrong Wang, Qing Qu

发表机构 * University of Michigan(密歇根大学) Michigan State University(密歇根州立大学)

AI总结 本文提出了一种名为MCLR的新训练目标,旨在通过最大化类间似然比来提升扩散模型的条件生成能力。该方法解决了标准去噪分数匹配(DSM)在类间分离不足的问题,并在训练过程中引入对齐目标,使模型在无需推理时引导(CFG)的情况下也能获得更优的条件生成效果。理论分析表明,CFG引导的分数实际上是针对样本自适应加权MCLR目标的最优解,从而揭示了CFG与对齐目标之间的内在联系。

详情
英文摘要

Diffusion models achieve strong performance in generative modeling, but their success often relies heavily on classifier-free guidance (CFG), an inference-time heuristic that modifies the sampling trajectory. In theory, diffusion models trained with standard denoising score matching (DSM) should recover the target data distribution, raising two fundamental questions: (i) why is inference-time guidance necessary in practice, and (ii) can its underlying effect be internalized into a principled training objective? In this work, we argue that a key limitation of standard DSM is insufficient inter-class separation. To address this issue, we propose MCLR, an alignment objective that explicitly maximizes inter-class likelihood-ratios during training. Fine-tuning diffusion models with MCLR induces CFG-like improvements under standard sampling, substantially improving guidance-free conditional generation and narrowing the gap to inference-time CFG. Beyond these empirical benefits, we show theoretically that the CFG-guided score is exactly the optimal solution to a sample-adaptive weighted MCLR objective. This result connects CFG to alignment-based objectives, providing a mechanistic interpretation of CFG as an implicit inference-time contrastive alignment procedure.

2603.13054 2026-05-14 cs.CV 版本更新

Topo-R1: Detecting Topological Anomalies via Vision-Language Models

Meilong Xu, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Xin Yu, Weimin Lyu, Kehan Qi, Dimitris Samaras, Chao Chen

发表机构 * Stony Brook University(石溪大学) Massachusetts General Hospital and Harvard Medical School(麻省总医院和哈佛医学院) Stanford University(斯坦福大学) Penn State University(宾夕法尼亚州立大学)

AI总结 该研究探讨了如何利用视觉-语言模型(VLMs)检测管状网络结构中的拓扑异常,如血管、神经纤维和道路网络中的连接断裂、虚假连接、分支缺失或多余等问题。研究发现现有VLMs在拓扑感知方面表现较差,几乎随机。为此,作者构建了一个包含多样化拓扑扰动的大型基准数据集,并提出Topo-R1模型,通过结合定位、分类和结构保真度的复合奖励机制,显著提升了模型在拓扑异常检测任务中的性能,优于通用VLMs并接近监督学习方法。

Comments 26 pages, 6 figures

详情
英文摘要

Topology is critical in tubular structures such as blood vessels, nerve fibers, and road networks, where connectivity and loop structure govern downstream functional analysis. Vision-Language Models (VLMs) are promising candidates for understanding such structures, given their reasoning and grounding capabilities. To probe their topological perception, we systematically evaluate leading closed- and open-source VLMs on localizing and classifying four canonical topological anomalies (broken/spurious connections, missing/extra branches) in tubular-network segmentation masks. They perform nearly at random, indicating that topology-aware perception is largely absent from current general-purpose VLMs. As no existing resource pairs segmentation masks with localized anomaly annotations, we build an automated, multi-domain data-curation pipeline that synthesizes diverse topological perturbations with verifiable Betti-number annotations across graduated difficulty levels, yielding the first systematic benchmark with a large-scale training set and held-out in-distribution (ID) and out-of-distribution (OOD) test suites. Building on this benchmark, we introduce Topo-R1, centered on a topology-aware composite reward that jointly scores localization, classification, and skeleton-level structural fidelity. Supervised fine-tuning cold-starts schema-compliant outputs, and Group Relative Policy Optimization (GRPO) then optimizes the policy against this reward, steering predictions toward topologically meaningful structure rather than superficial pixel overlap. Extensive experiments show that Topo-R1 substantially outperforms general-purpose VLMs and matches or exceeds supervised baselines across ID, OOD, and real-segmentation-output protocols, establishing a strong foundation for VLM-based topological understanding of structured visual data.

2603.07433 2026-05-14 cs.LG cs.CV 版本更新

Data Agent: Learning to Select Data via End-to-End Dynamic Optimization

Suorong Yang, Fangjian Su, Hai Gan, Ziqi Ye, Jie Li, Baile Xu, Furao Shen, Soujanya Poria

发表机构 * National University of Singapore(新加坡国立大学) Nanjing University(南京大学) Nanyang Technological University(南洋理工大学)

AI总结 该论文提出了一种名为Data Agent的端到端动态数据选择框架,旨在通过在线训练中优先选择信息量大的样本来加速模型训练。其核心方法是将数据选择建模为一个与训练过程相关的序列决策问题,通过结合损失和置信度的复合奖励机制,学习一个与模型优化协同进化的样本选择策略。实验表明,Data Agent在多个数据集和模型架构上均能有效提升训练效率并保持或提升性能,且具有良好的通用性和鲁棒性,适用于多种实际场景。

详情
英文摘要

Dynamic Data selection aims to accelerate training by prioritizing informative samples during online training. However, existing methods typically rely on task-specific handcrafted metrics or static/snapshot-based criteria to estimate sample importance, limiting scalability across learning paradigms and making it difficult to capture the evolving utility of data throughout training. To address this challenge, we propose Data Agent, an end-to-end dynamic data selection framework that formulates data selection as a training-aware sequential decision-making problem. The agent learns a sample-wise selection policy that co-evolves with model optimization, guided by a composite reward that integrates loss-based difficulty and confidence-based uncertainty signals. The reward signals capture complementary objectives of optimization impact and information gain, together with a tuning-free adaptive weighting mechanism that balances these signals over training. Extensive experiments across a wide range of datasets and architectures demonstrate that Data Agent consistently accelerates training while preserving or improving performance, e.g., reducing costs by over 50\% on ImageNet-1k and MMLU with lossless performance. Moreover, its dataset-agnostic formulation and modular reward make it plug-and-play across tasks and scenarios, e.g., robustness to noisy datasets, highlighting its potential in real-world scenarios. Code is available at https://github.com/Jackbrocp/Data-Agent.

2603.05582 2026-05-14 cs.LG cs.CV 版本更新

Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models

Ivan Luiz De Moura Matos, Abdel Djalil Sad Saoud, Ekaterina Iakovleva, Vito Paolo Pastore, Enzo Tartaglione

发表机构 * LTCI, Télécom Paris, Institut Polytechnique de Paris, France(法国巴黎电信学院(LTCI)、巴黎理工学院)

AI总结 本文探讨了如何从常规训练的深度学习模型中提取无偏的子网络,以减少算法中的偏见。研究提出了一种名为BISE的方法,无需额外数据或重新训练,即可通过剪枝技术识别并分离出模型中已存在的“无偏”子网络。该方法在保持模型性能的同时降低了对有偏特征的依赖,为高效的偏见缓解提供了结构化适应的新途径。实验表明,该方法在多个基准数据集上表现出优越的性能和计算效率。

Comments This work has been accepted for publication at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

详情
英文摘要

The issue of algorithmic biases in deep learning has led to the development of various debiasing techniques, many of which perform complex training procedures or dataset manipulation. However, an intriguing question arises: is it possible to extract fair and bias-agnostic subnetworks from standard vanilla-trained models without relying on additional data, such as unbiased training set? In this work, we introduce Bias-Invariant Subnetwork Extraction (BISE), a learning strategy that identifies and isolates "bias-free" subnetworks that already exist within conventionally trained models, without retraining or finetuning the original parameters. Our approach demonstrates that such subnetworks can be extracted via pruning and can operate without modification, effectively relying less on biased features and maintaining robust performance. Our findings contribute towards efficient bias mitigation through structural adaptation of pre-trained neural networks via parameter removal, as opposed to costly strategies that are either data-centric or involve (re)training all model parameters. Extensive experiments on common benchmarks show the advantages of our approach in terms of the performance and computational efficiency of the resulting debiased model.

2603.02337 2026-05-14 cs.LG cs.AI cs.CV 版本更新

Preconditioned Flow Matching

Shadab Ahamed, Eshed Gal, Md Shahriar Rahim Siddiqui, Simon Ghyselincks, Moshe Eliasof, Eldad Haber

发表机构 * University of British Columbia(不列颠哥伦比亚大学) University of Cambridge(剑桥大学)

AI总结 本文研究了流匹配(Flow Matching)方法在训练过程中遇到的几何优化瓶颈问题,即当中间分布的协方差矩阵病态时,梯度下降方法在不同方向上的收敛速度差异显著。为此,作者提出了一种预条件流匹配(Preconditioned Flow Matching)方法,通过将目标分布转换为更各向同性的表示,改善中间路径的条件数,从而提升模型训练效率和生成质量。实验表明,该方法在多种分布和高分辨率图像数据集上均取得了显著的性能提升。

Comments 34 pages, 16 figures, 5 tables

详情
英文摘要

Flow matching (FM) learns vector fields by regressing stochastic velocity targets along intermediate distributions $p_t$. We identify a geometric optimization bottleneck in this regression problem: when the covariance $Σ_t$ of $p_t$ is ill-conditioned, gradient-based training rapidly fits high-variance directions while making slow progress along low-variance ones. In an exactly solvable Gaussian setting, we prove that the excess risk is weighted by $Σ_t$, and that both gradient descent and stochastic gradient descent inherit condition-number-dependent convergence. We then extend the analysis to Gaussian mixtures, showing that multimodality does not average away this effect; instead, the slowest and worst-conditioned component can control optimization. Motivated by this analysis, we propose \emph{preconditioned flow matching}, a precondition-then-match framework that transforms the target distribution into a more isotropic representation, trains the main flow in the transformed space, and maps generated samples back through the inverse transformation. We show theoretically that preconditioning reshapes the intermediate FM path and improves its conditioning. Across controlled Gaussian and Gaussian-mixture experiments, latent MNIST and other high resolution image datasets up to $512{\times}512$ resolution, preconditioning improves path-conditioning diagnostics, low-eigenvalue recovery, FID, MMD, precision, and recall. Compute-matched baselines and preconditioner-quality ablations further show that the gains are not explained merely by additional preconditioner parameters, but by improved geometry of the downstream flow matching problem.

2603.02175 2026-05-14 cs.CV cs.AI 版本更新

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(展示实验室,新加坡国立大学)

AI总结 本文提出了一种名为 Kiwi-Edit 的通用视频编辑方法,通过指令和参考图像的联合引导实现更精确的视觉控制。为了解决现有方法在数据稀缺情况下的性能瓶颈,研究者设计了一种可扩展的数据生成管道,构建了大规模的 RefVIE 数据集和评估基准 RefVIE-Bench。基于该数据集,提出的统一编辑架构 Kiwi-Edit 通过可学习的查询与潜在视觉特征融合,实现了对参考语义的精准引导,在指令遵循和参考保真度方面取得了显著提升,达到了可控视频编辑的最新水平。

Comments Project page: https://showlab.github.io/Kiwi-Edit Huggingface Demo: https://huggingface.co/spaces/linyq/KiwiEdit

详情
英文摘要

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.

2602.23013 2026-05-14 cs.CV cs.LG 版本更新

SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling

Camile Lendering, Erkut Akdag, Egor Bondarev

发表机构 * AIMS Group, Department of Electrical Engineering, Eindhoven University of Technology(AIMS组,电气工程系,埃因霍温理工大学)

AI总结 本文提出了一种无需训练的少样本异常检测方法SubspaceAD,通过子空间建模实现工业视觉检测中的异常识别。该方法首先利用冻结的DINOv2模型从少量正常样本中提取块级特征,然后通过主成分分析(PCA)拟合这些特征以估计正常变化的低维子空间,在推理阶段通过重构残差检测异常,生成可解释且统计可靠的异常分数。实验表明,SubspaceAD在多个数据集上取得了当前最优的性能,尤其在单样本设置下表现出色。

Comments Accepted to CVPR 2026. Revised version with corrected AU-PRO evaluation and recomputed metrics

详情
英文摘要

Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 97.1% and 97.5% on the MVTec-AD dataset, and 93.2% and 98.2% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.

2602.22455 2026-05-14 cs.CV 版本更新

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

Giuseppe Lando, Rosario Forte, Antonino Furnari

发表机构 * Department of Mathematics and Computer Science, University of Catania, Italy(数学与计算机科学系,卡塔尼亚大学,意大利)

AI总结 本文研究了在边缘设备上使用多模态大语言模型(MLLMs)进行实时在线情景记忆问答的可行性。为应对隐私和延迟问题,作者设计了一个包含两个异步线程的问答流水线,分别用于视频到文本的轻量级描述生成和基于文本的记忆推理。实验表明,在资源受限的边缘设备上,该方法取得了与云端解决方案相当的性能,展示了边缘计算在隐私保护情景记忆检索中的潜力。

详情
英文摘要

We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.

2602.21204 2026-05-14 cs.LG cs.AI cs.CV 版本更新

Test-Time Training with KV Binding Is Secretly Linear Attention

Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li

发表机构 * NVIDIA, Toronto, Ontario, Canada(NVIDIA,多伦多,安大略省,加拿大) University of Toronto, Toronto, Ontario, Canada(多伦多大学,多伦多,安大略省,加拿大) Vector Institute, Toronto, Ontario, Canada(向量研究所,多伦多,安大略省,加拿大) Technion -- Israel Institute of Technology, Haifa, Israel(技术ion -- 以色列理工学院,海法,以色列)

AI总结 本文重新审视了基于键值绑定的测试时训练(TTT)在序列建模中的作用,指出其本质并非单纯的测试时记忆,而是一种学习到的线性注意力机制。研究揭示了TTT模型中一些之前难以解释的现象,并展示了多种TTT架构可以统一为线性注意力操作的形式。这一新视角不仅解释了模型行为,还带来了架构简化、并行计算和效率提升等实际优势,为TTT提供了更系统和高效的理论基础。

Comments ICML 2026, Webpage: https://research.nvidia.com/labs/sil/projects/tttla/

详情
英文摘要

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity. Project page: https://research.nvidia.com/labs/sil/projects/tttla/.

2602.20150 2026-05-14 cs.RO cs.CV 版本更新

Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization

Wei-Cheng Huang, Jiaheng Han, Xiaohan Ye, Zherong Pan, Kris Hauser

发表机构 * Meta Reality Labs(Meta现实实验室)

AI总结 本文研究如何从真实世界观测中估计可用于仿真的复杂场景,解决现有方法在处理多物体交互场景时计算成本高、鲁棒性差的问题。作者提出了一种基于物理约束的联合形状与姿态优化方法,结合可微分接触模型和高效求解器,实现了对多刚体物体几何与姿态的联合优化。该方法构建了端到端的SPARCS系统,能够鲁棒地重建出符合物理规律的仿真可用场景,实验表明其在包含多达5个物体和22个凸包的复杂场景中表现优异。

Comments Accepted to RSS 2026, camera-ready version; 17 pages, 15 figures

详情
英文摘要

Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks. Regretfully, existing methods struggle in cluttered environments, often exhibiting prohibitive computational cost, poor robustness, and restricted generality when scaling to multiple interacting objects. We propose a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical constraints. Our method is built on two key technical innovations. First, we leverage the recently introduced shape-differentiable contact model, whose global differentiability permits joint optimization over object geometry and pose while modeling inter-object contacts. Second, we exploit the structured sparsity of the augmented Lagrangian Hessian to derive an efficient linear system solver whose computational cost scales favorably with scene complexity. Building on this formulation, we develop an end-to-end Simulation-ready Physics-Aware Reconstruction for Cluttered Scenes (SPARCS) pipeline, which integrates learning-based object initialization, physics-constrained joint shape-pose optimization, and differentiable texture refinement. Experiments on cluttered scenes with up to 5 objects and 22 convex hulls demonstrate that our approach robustly reconstructs physically valid, simulation-ready object shapes and poses. Project webpage: https://rory-weicheng.github.io/SPARCS/.

2602.10326 2026-05-14 cs.CV cs.LG 版本更新

Flow Matching with Uncertainty Quantification and Guidance

Juyeop Han, Lukas Lao Beyer, Sertac Karaman

发表机构 * MIT(麻省理工学院)

AI总结 尽管基于采样的生成模型如流匹配在图像生成方面取得了显著成功,但生成的样本质量仍可能存在不一致或退化的问题。为此,本文提出了一种轻量级的不确定性感知流匹配(UA-Flow)方法,该方法在预测速度场的同时估计异方差不确定性,并通过流动态传播不确定性以评估每个样本的可靠性。实验表明,UA-Flow 生成的不确定性信号与样本保真度具有更高的相关性,且基于不确定性的引导采样进一步提升了生成质量。

详情
英文摘要

Despite the remarkable success of sampling-based generative models such as flow matching, they can still produce samples of inconsistent or degraded quality. To assess sample reliability and generate higher-quality outputs, we propose uncertainty-aware flow matching (UA-Flow), a lightweight extension of flow matching that predicts the velocity field together with heteroscedastic uncertainty. UA-Flow estimates per-sample uncertainty by propagating velocity uncertainty through the flow dynamics. These uncertainty estimates act as a reliability signal for individual samples, and we further use them to steer generation via uncertainty-aware classifier guidance and classifier-free guidance. Experiments on image generation show that UA-Flow produces uncertainty signals more highly correlated with sample fidelity than baseline methods, and that uncertainty-guided sampling further improves generation quality.

2602.10032 2026-05-14 cs.CV cs.RO 版本更新

Perception with Guarantees: Certified Pose Estimation via Reachability Analysis

Tobias Ladner, Yasser Shoukry, Matthias Althoff

发表机构 * Technical University of Munich, Germany(慕尼黑技术大学) University of California, Irvine, USA(加州大学 Irvine 分校)

AI总结 该论文研究了在安全关键型系统中如何通过视觉信息实现具有严格保证的三维姿态估计问题。作者提出了一种仅依赖于单目图像和已知目标几何形状的认证姿态估计方法,通过可达性分析和形式化神经网络验证技术,对姿态进行形式化边界约束,从而在最坏情况下也能保证估计的安全性。实验表明,该方法在合成与真实场景中均能高效且准确地完成定位任务,为安全关键型应用提供了可靠保障。

Comments Accepted at Computed Aided Verification (CAV'2026)

详情
英文摘要

Agents in cyber-physical systems are increasingly entrusted with safety-critical tasks. Ensuring safety of these agents often requires localizing the pose for subsequent actions. Pose estimates can, e.g., be obtained from various combinations of lidar sensors, cameras, and external services such as GPS. Crucially, in safety-critical domains, a rough estimate is insufficient to formally determine safety, i.e., guaranteeing safety even in the worst-case scenario, and external services might additionally not be trustworthy. We address this problem by presenting a certified pose estimation in 3D solely from a camera image and a well-known target geometry. This is realized by formally bounding the pose, which is computed by leveraging recent results from reachability analysis and formal neural network verification. Our experiments demonstrate that our approach efficiently and accurately localizes agents in both synthetic and real-world experiments.

2602.02977 2026-05-14 cs.CV cs.AI cs.LG 版本更新

Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding

Byeongju Woo, Zilin Wang, Byeonghyun Pak, Sangwoo Mo, Stella X. Yu

发表机构 * Agency for Defense Development(国防发展局) University of Michigan(密歇根大学) POSTECH

AI总结 该研究针对视觉语言模型在理解长而细节丰富的图像描述时存在的问题,提出了一种基于局部-整体结构的层次化学习方法。核心方法是通过CAFT模型,在中间表示层对齐局部文本与图像区域,在最终表示层实现全局图像与文本的对齐,从而更准确地捕捉细粒度视觉信息。该模型在多个长文本检索任务中取得了最先进的性能,并且无需显式的区域标注即可实现文本语义在图像区域中的定位。

Comments Preprint

详情
英文摘要

Vision-language models such as CLIP often struggle to faithfully understand long, detail-rich captions, relying on dominant scene cues while overlooking fine-grained visual evidence. We propose a hierarchical vision-language learning principle for understanding scenes as part-to-whole compositions: before forming a whole-scene representation, a model should uncover what semantic parts appear where in the image. To this end, we propose CAFT (Cross-domain Alignment of Forests and Trees), a vision-language model that jointly learns local text-region alignment at intermediate representations and global image-text alignment at the final representation. Exploiting the organization of long captions, where local descriptions often correspond to scene parts, CAFT employs a fine-to-coarse image encoder and a part-whole text encoder to discover localized part semantics and progressively compose them into a global image-text representation. Trained on 30M image-text pairs, CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks and exhibits strong scaling behavior. Experiments show that CAFT learns fine-grained representations that localize textual semantics in image regions without explicit region-level supervision.

2601.22853 2026-05-14 cs.CV 版本更新

Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification

Siyi Du, Xinzhe Luo, Declan P. O'Regan, Chen Qin

发表机构 * Department of Electrical and Electronic Engineering & I-X(电气与电子工程系及I-X)

AI总结 本文研究了多模态深度学习在面对不完整模态数据时的分类问题,提出了一种在推理阶段动态选择模态的框架DyMo,以解决传统方法中丢弃或恢复缺失模态所带来的信息损失或噪声引入问题。DyMo通过一种新的选择算法,在测试时自适应地识别并融合可靠的恢复模态,最大化任务相关的多模态信息,并设计了相应的奖励函数和网络架构,实验表明其在多种数据缺失场景下均优于现有方法。

Comments 27 pages (including appendix), accepted by ICLR 2026

详情
英文摘要

Multimodal deep learning (MDL) has achieved remarkable success across various domains, yet its practical deployment is often hindered by incomplete multimodal data. Existing incomplete MDL methods either discard missing modalities, risking the loss of valuable task-relevant information, or recover them, potentially introducing irrelevant noise, leading to the discarding-imputation dilemma. To address this dilemma, in this paper, we propose DyMo, a new inference-time dynamic modality selection framework that adaptively identifies and fuses reliable recovered modalities, fully exploring task-relevant information beyond the conventional discard-or-impute paradigm. Central to DyMo is a novel selection algorithm that maximizes multimodal task-relevant information for each test sample. Since direct estimation of such information at test time is intractable due to the unknown data distribution, we theoretically establish a connection between information and the task loss, which we compute at inference time as a tractable proxy. Building on this, a novel principled reward function is proposed to guide modality selection. In addition, we design a flexible multimodal network architecture compatible with arbitrary modality combinations, alongside a tailored training strategy for robust representation learning. Extensive experiments on diverse natural and medical image datasets show that DyMo significantly outperforms state-of-the-art incomplete/dynamic MDL methods across various missing-data scenarios. Our code is available at https://github.com//siyi-wind/DyMo.

2601.21892 2026-05-14 cs.CV cs.AI 版本更新

Improving Classifier-Free Guidance of Flow Matching via Manifold Projection

Jian-Feng Cai, Haixia Liu, Zhengyi Su, Chao Wang

发表机构 * Department of Mathematics, The Hong Kong University of Science IAS Center for AI for Scientific Discoveries, The Hong Kong University of Science School of Mathematics Statistics \& Institute of Interdisciplinary Research for Mathematics Applied Science \& Hubei Key Laboratory of Engineering Modeling Scientific Computing, Huazhong University of Science Department of Statistics Data Science, Southern University of Science

AI总结 本文研究了如何改进基于流匹配模型的无分类器引导(CFG)方法,提出了通过流匹配中的速度场与平滑距离函数梯度之间的关系,对CFG进行原理性解释。基于此,作者将CFG采样重新表述为具有流形约束的同伦优化问题,并通过增量梯度下降实现流形投影,进一步结合Anderson加速提升计算效率与稳定性。该方法无需额外训练,有效提升了生成质量、提示对齐度及对引导尺度的鲁棒性,并在多个大型模型上取得了显著改进。

Comments 26 pages, 14 figures

详情
英文摘要

Classifier-free guidance (CFG) is a widely used technique for controllable generation in diffusion and flow-based models. Despite its empirical success, CFG relies on a heuristic linear extrapolation that is often sensitive to the guidance scale. In this work, we provide a principled interpretation of CFG through the lens of optimization. We demonstrate that the velocity field in flow matching corresponds to the gradient of a sequence of smoothed distance functions, which guides latent variables toward the scaled target image set. This perspective reveals that the standard CFG formulation is an approximation of this gradient, where the prediction gap, the discrepancy between conditional and unconditional outputs, governs guidance sensitivity. Leveraging this insight, we reformulate the CFG sampling as a homotopy optimization with a manifold constraint. This formulation necessitates a manifold projection step, which we implement via an incremental gradient descent scheme during sampling. To improve computational efficiency and stability, we further enhance this iterative process with Anderson Acceleration without requiring additional model evaluations. Our proposed methods are training-free and consistently refine generation fidelity, prompt alignment, and robustness to the guidance scale. We validate their effectiveness across diverse benchmarks, demonstrating significant improvements on large-scale models such as DiT-XL-2-256, Flux, and Stable Diffusion 3.5.

2601.18842 2026-05-14 cs.CR cs.AI cs.CV 版本更新

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

Yanxi Wang, Zhiling Zhang, Wenbo Zhou, Weiming Zhang, Jie Zhang, Qiannan Zhu, Yu Shi, Shuxin Zheng, Jiyan He

发表机构 * Beijing Normal University(北京师范大学) Zhongguancun Academy(中关村学院) University of Science and Technology of China(中国科学技术大学) A*STAR Zhongguancun Institution of Artificial Intelligence(中关村人工智能研究所)

AI总结 随着GUI代理越来越多地依赖截图来感知和操作数字环境,可能会无意中暴露身份、账号、位置等敏感信息。为弥补现有隐私评估基准在任务轨迹上下文中隐私风险评估的不足,本文提出了GUIGuard-Bench,这是一个包含241条真实GUI代理轨迹和4080张截图的基准数据集,支持隐私识别、保护截图下的规划保真度评估以及不同保护策略的效用分析。研究发现,当前模型在隐私信息检测方面表现较好,但在细粒度定位、分类识别、风险评估和任务必要性判断上仍存在明显不足。

详情
英文摘要

As GUI agents increasingly rely on screenshots to perceive and operate digital environments, they may inadvertently expose sensitive information such as identities, accounts, locations, and behavioral traces. While existing benchmarks primarily focus on task completion, grounding, or defenses against third-party attacks, current visual privacy datasets remain largely restricted to static natural images, limiting their ability to capture the contextual dependence and task relevance of privacy risks in GUI task trajectories. To bridge this gap, we introduce \textbf{GUIGuard-Bench}, a first-step benchmark for studying privacy-preserving GUI agents in trajectory-based GUI workflows. GUIGuard-Bench contains 241 real GUI-agent trajectories with 4,080 screenshots across Android and PC environments. Each screenshot is annotated at the region level with privacy bounding boxes, semantic privacy categories, risk levels, and whether the private information is necessary for completing the task. Built on these annotations, GUIGuard-Bench supports three complementary evaluations: privacy recognition, offline planning fidelity under protected screenshots, and the utility impact of different protection strategies. Our results show that current models can often detect whether a screenshot contains private information, but they struggle with fine-grained localization, category recognition, risk assessment, and task-necessity judgment. We also find that closed-source models, exemplified by Claude Sonnet 4.6, can maintain largely consistent planner semantics in Android environments after privacy protection is applied. Our results highlight privacy recognition as a critical bottleneck for practical GUI agents. Project: https://futuresis.github.io/GUIGuard-page/

2601.17326 2026-05-14 cs.CV cs.HC 版本更新

SymbolSight: Minimizing Inter-Symbol Interference for Reading with Prosthetic Vision

Jasmine Lesner, Michael Beyeler

发表机构 * Department of Computer Science, University of California, Santa Barbara(计算机科学系,加州大学圣芭芭拉分校) Department of Psychological & Brain Sciences, University of California, Santa Barbara(心理学与脑科学系,加州大学圣芭芭拉分校)

AI总结 该研究针对视网膜假体视觉恢复中阅读困难的问题,提出了一种名为SymbolSight的计算框架,旨在通过优化视觉符号设计来减少符号间干扰。研究利用语言的双字统计特性,选择字母到符号的映射方式,以降低相邻字母间的识别混淆。实验表明,这种方法在阿拉伯语、保加利亚语和英语中显著减少了预测的识别错误,展示了符号设计优化在提升低带宽视觉假体阅读性能中的潜力。

Comments Accepted to IEEE EMBC 2026. 7 pages, 6 figures, 2 tables

详情
英文摘要

Retinal prostheses restore limited visual perception, but low spatial resolution and temporal persistence make reading difficult. In sequential letter presentation, the afterimage of one symbol can interfere with perception of the next, leading to systematic recognition errors. Rather than relying on future hardware improvements, we investigate whether optimizing the visual symbols themselves can mitigate this temporal interference. We present SymbolSight, a computational framework that selects symbol-to-letter mappings to minimize confusion among frequently adjacent letters. Using simulated prosthetic vision (SPV) and a neural proxy observer, we estimate pairwise symbol confusability and optimize assignments using language-specific bigram statistics. Across simulations in Arabic, Bulgarian, and English, the resulting heterogeneous symbol sets reduced predicted confusion by a median factor of 22 relative to native alphabets. These results suggest that standard typography is poorly matched to serial, low-bandwidth prosthetic vision and demonstrate how computational modeling can narrow the design space of visual encodings, identifying high-potential candidates for future psychophysical and clinical evaluation rather than predicting present-day clinical reading performance directly.

2601.14104 2026-05-14 cs.RO cs.CV 版本更新

When Backdoors Meet Partial Observability: Attacking Real-World Reinforcement Learning

Tairan Huang, Qingqing Ye, Yulin Jin, Jiawei Lian, Yaxin Xiao, Yi Wang, Haibo Hu

发表机构 * Department of Electrical and Electronic Engineering(电气与电子工程系)

AI总结 本文研究了在部分可观测的现实环境中对强化学习(RL)策略进行后门攻击的问题,指出传统攻击方法在多模态观测(如视觉和激光雷达)共存的场景下存在局限性。为此,作者提出了一种基于扩散模型的后门攻击框架(DGBA),通过可打印的视觉触发器,在不干扰任务性能的前提下实现对RL策略的隐蔽操控。实验表明,该方法在物理机器人平台上的攻击效果优于现有方法,具有较高的实用性和隐蔽性。

详情
英文摘要

Backdoor attacks can cause reinforcement learning (RL) policies to behave normally under clean inputs while executing malicious behaviors when triggers are present. Existing RL backdoor attacks are primarily studied in simulation and often assume that attackers can reliably manipulate the observations driving policy decisions. This assumption becomes fragile in real-world deployment, where RL policies commonly rely on multimodal observations. Attackers can manipulate visual inputs through physical triggers, but auxiliary states such as LiDAR and odometry signals remain uncontrollable and vary across trajectories. We study this overlooked challenge and propose a diffusion-guided backdoor attack framework (DGBA) for real-world RL. DGBA uses small printable visual patches as triggers and learns a stochastic trigger distribution via conditional diffusion to maintain consistent attack activation under varying uncontrollable states. We further introduce an advantage-based poisoning strategy that injects triggers only at decision-critical training states. Experiments on a physical TurtleBot3 platform show that DGBA consistently outperforms prior RL backdoor attacks while preserving normal task performance. Demo videos and code are available in the supplementary material.

2601.09636 2026-05-14 cs.AI cs.CV cs.HC cs.LG 版本更新

PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records

Yibo Lyu, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shenzhen Loop Area Institute(深圳环城区域研究院)

AI总结 本文提出 PersonalAlign,一种面向个性化图形用户界面(GUI)代理的分层隐式意图对齐方法,旨在通过利用用户的长期行为记录来理解模糊指令中的隐含偏好并主动预测用户潜在操作。为此,研究者构建了 AndroidIntent 基准数据集,并设计了 Hierarchical Intent Memory Agent(HIM-Agent)来持续更新和组织用户的个性化偏好与行为模式。实验表明,HIM-Agent 在执行与主动协助任务上分别提升了 15.7% 和 7.3%。

Comments Accepted to ACL26 Main

详情
英文摘要

While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users' more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents' ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%.

2601.00417 2026-05-14 cs.LG cs.AI cs.CL cs.CV 版本更新

Deep Delta Learning

Yifan Zhang, Yifeng Liu, Mengdi Wang, Quanquan Gu

发表机构 * Princeton University(普林斯顿大学) University of California Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出了一种名为Deep Delta Learning(DDL)的残差更新机制,用于改进Transformer模型中的残差流。与传统的加法累积方式不同,DDL允许每一层选择性地重写残差内容,通过学习方向读取当前状态,并与目标值进行比较,再沿相同方向进行门控修正。实验表明,DDL在语言模型中有效提升了残差流的管理能力,优于传统的残差加法方式。

Comments Project Page: https://github.com/yifanzhang-pro/deep-delta-learning

详情
英文摘要

Transformer residual streams evolve by additive accumulation: each layer appends a feature update to a shared hidden state, but has no direct mechanism for replacing content that has become obsolete or conflicting. We introduce Deep Delta Learning (DDL), a residual update rule that preserves the identity path while giving every layer the ability to selectively rewrite residual content. DDL reads the current state along a learned direction, compares it with a learned target value, and writes back a gated correction along the same direction. When the gate is closed, the update reduces to the identity; when the gate is fully open, the selected component is overwritten, yielding a depth-wise delta-rule generalization of standard residual addition. We integrate DDL in decoder-only language models with both scalar and expanded residual states, while keeping attention and MLP sublayers at the original compute width. Controlled pretraining and downstream evaluations show that residual rewrite operations improve language modeling quality relative to pure additive accumulation introduced in ResNet, suggesting that a learned delta-rule update is an effective mechanism for managing Transformer residual streams.

2512.01707 2026-05-14 cs.CV cs.AI cs.CL 版本更新

StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Mohit Bansal

发表机构 * University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校) Adobe Research(Adobe研究院)

AI总结 StreamGaze 是一个用于评估多模态大语言模型在流式视频中利用人类注视信号进行时间推理和主动理解能力的全新基准。该研究通过引入基于注视引导的过去、当前和主动推理任务,全面评估模型在实时处理视频流并预测用户意图方面的能力。研究构建了一个结合注视轨迹与视频内容的问答生成管道,生成具有时空语义的问答对,并揭示了当前模型在基于注视的时序推理和主动预测方面仍存在明显不足。

Comments Accepted to CVPR 2026 with strong scores (5/5/5) but desk-rejected after the camera-ready due to not completing all reviewing duties

详情
英文摘要

Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications such as Augmented Reality (AR) glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether Multimodal Large Language Models (MLLMs) can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs utilize gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively assess streaming video understanding. These tasks evaluate whether models can use real-time gaze signals to follow shifting attention and infer user intentions based only on past and currently observed frames. To build StreamGaze, we develop a gaze-video Question Answering (QA) generation pipeline that aligns egocentric videos with raw gaze trajectories through fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, highlighting key limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze prompting strategies, reasoning behaviors, and task-specific failure modes, offering insights into current limitations and directions for future research. All data and code are publicly available to support continued research in gaze-guided streaming video understanding.

2512.01242 2026-05-14 cs.CV cs.AI cs.CL 版本更新

When Diffusion Breaks Constraints: Sequential Autoregressive Generation with RL and MCTS

Zirui Zhao, Boye Niu, Harold Soh, David Hsu, Wee Sun Lee

发表机构 * Salesforce AI Research(Salesforce人工智能研究) University of Sydney(悉尼大学) National University of Singapore(新加坡国立大学)

AI总结 该论文研究了扩散模型在受约束生成任务中的局限性,例如多机器人路径规划、分子生成和场景合成等,这些问题需要满足严格的几何或物理约束。为了解决这一问题,作者提出了一种基于强化学习和蒙特卡洛树搜索的顺序自回归生成方法,将约束生成问题转化为离散的序列生成任务,从而更有效地满足复杂的约束条件。实验表明,该方法在可行性与任务成功率方面优于传统扩散模型,为解决此类受限生成问题提供了新的思路。

详情
英文摘要

Data-driven generative models excel in language and vision, but diffusion models often fail in constrained planning and design tasks, exhibiting severe constraint violations in engineering inverse design, molecular generation, multi-robot planning, and floorplan/scene synthesis even with projection or guidance. Such tasks combine hard-to-specify semantic goals with strict geometric or physical constraints (e.g., non-overlap, connectivity), yielding feasible solutions that lie on low-dimensional, small, and sometimes disconnected regions of the output space. This paper studies the failure mode through tangram generation from language, where seven fixed shapes must form a text-described silhouette while remaining connected and non-overlapping, and a simplified rectangle composition task with a learned bounding-box constraint. We find diffusion models struggle to satisfy constraints, consistent with difficulty generating samples near low-dimensional submanifolds. Motivated by locally feasible reparameterizations, we reformulate constrained generation as discrete autoregressive sequential generation. Reinforcement learning improves feasibility and task success, and Monte Carlo tree search quantifies the value of look-ahead when feasible regions shrink. Overall, the empirical, theoretical, and prior-work evidence points to a structural limitation of continuous density matching on this class of constrained-generation problems, and suggests sequential constraint-aware generation as a promising alternative.

2511.17031 2026-05-14 cs.LG cs.CV cs.CY 版本更新

Energy Scaling Laws for Diffusion Models: Quantifying Compute in Image Generation

Aniketh Iyengar, Jiaqi Han, Boris Ruf, Vincent Grari, Marcin Detyniecki, Stefano Ermon

发表机构 * Stanford University(斯坦福大学) AXA AI Research(AXA人工智能研究)

AI总结 本文研究了扩散模型在图像生成中的能耗扩展规律,旨在量化不同模型配置和硬件环境下的计算能耗。作者将Kaplan扩展定律应用于扩散模型,基于计算复杂度(FLOPs)预测GPU能耗,并通过实验验证了去噪过程是能耗的主要来源。研究在多种先进扩散模型和GPU架构上进行了广泛测试,证明了该方法在单一架构内具有高预测精度,并具备良好的跨架构泛化能力,为可持续AI部署提供了重要的能耗评估基础。

Comments Accepted at ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2026

详情
英文摘要

The rapidly growing computational demands of diffusion models for image generation have raised significant concerns about energy consumption and environmental impact. While existing approaches to energy optimization focus on architectural improvements or hardware acceleration, there is a lack of principled methods to predict energy consumption across different model configurations and hardware setups. We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps. We conduct comprehensive experiments across four state-of-the-art diffusion models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, and Qwen) on three GPU architectures (NVIDIA A100, A4000, A6000), spanning various inference configurations including resolution ($256^2$--$1024^2$), precision (fp16/fp32), step counts (10--50), and classifier-free guidance settings. Our energy scaling law achieves high predictive accuracy within individual architectures ($R^2 > 0.9$) and exhibits strong cross-architecture generalization, maintaining high rank correlations across models and enabling reliable energy estimation for unseen model--hardware combinations. These results validate the compute-bound nature of diffusion inference and establish energy consumption estimation as a necessary foundation for sustainable AI deployment planning and subsequent carbon footprint assessment.

2511.16868 2026-05-14 cs.CV q-bio.BM 版本更新

The Joint Gromov Wasserstein Objective for Multiple Object Matching

Aryan Tajmir Riahi, Khanh Dao Duc

发表机构 * Department of Computer Science, University of British Columbia(不列颠哥伦比亚大学计算机科学系) Department of Mathematics, University of British Columbia(不列颠哥伦比亚大学数学系)

AI总结 本文提出了一种联合格罗莫夫-沃尔夫(JGW)目标函数,旨在解决多个对象之间的匹配问题,突破了传统格罗莫夫-沃尔夫距离仅适用于单对对象匹配的限制。该方法通过扩展原始框架,实现了多个对象集合的同时匹配,并提供了一种具有点采样收敛性的非负相似性度量。实验表明,该方法在准确性和计算效率上优于其他变体,在合成数据和真实数据集上的测试显示其在几何形状和生物分子复合物等多对象匹配任务中表现优异,具有广泛的应用前景。

详情
英文摘要

The Gromov-Wasserstein (GW) distance serves as a powerful tool for matching objects in metric spaces. However, its traditional formulation is constrained to pairwise matching between single objects, limiting its utility in scenarios and applications requiring multiple-to-one or multiple-to-multiple object matching. In this paper, we introduce the Joint Gromov-Wasserstein (JGW) objective and extend the original framework of GW to enable simultaneous matching between collections of objects. Our formulation provides a non-negative dissimilarity measure that identifies partially isomorphic distributions of mm-spaces, with point sampling convergence. We also show that the objective can be formulated and solved for point cloud representations by adapting traditional algorithms in Optimal Transport, including entropic regularization. Our benchmarking with other variants of GW for partial matching indicates superior performance in accuracy and computational efficiency of our method, while experiments on both synthetic and real-world datasets show its effectiveness for multiple shape matching, including geometric shapes and biomolecular complexes, suggesting promising applications for solving complex matching problems across diverse domains, including computer graphics and atomic model building for structural biology.

2510.14244 2026-05-14 eess.IV cs.AI cs.CV 版本更新

Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation

Arnaud Judge, Nicolas Duchateau, Thierry Judge, Roman A. Sandler, Joseph Z. Sokol, Christian Desrosiers, Olivier Bernard, Pierre-Marc Jodoin

发表机构 * Department of Computer Science, University of Sherbrooke(谢布鲁克大学计算机科学系) INSA, Universite Claude Bernard Lyon 1, CNRS UMR 5220, Inserm U1206, CREATIS(里昂1大学INSA、CNRS UMR 5220、Inserm U1206、CREATIS) Dep. of Software and Information Technology Engineering, École de technologie supérieure(蒙特利尔工程学院软件与信息技术工程系) Institut Universitaire de France (IUF)(法国国家科学院(IUF))

AI总结 该研究针对超声心动图分割中的领域自适应问题,提出了一种基于强化学习的无监督领域自适应框架RL4Seg3D。该方法通过引入新颖的奖励函数和融合策略,提升了分割结果中关键解剖标志点的精度,并在处理完整尺寸的视频输入时保持了良好的时间一致性。实验表明,该方法在无需目标域标注的情况下,显著优于传统领域自适应技术,且能提供鲁棒的不确定性估计,有助于进一步提升分割性能。

Comments 13 pages, accepted for publication in IEEE TMI

详情
英文摘要

Domain adaptation methods aim to bridge the gap between datasets by enabling knowledge transfer across domains, reducing the need for additional expert annotations. However, many approaches struggle with reliability in the target domain, an issue particularly critical in medical image segmentation, where accuracy and anatomical validity are essential. This challenge is further exacerbated in spatio-temporal data, where the lack of temporal consistency can significantly degrade segmentation quality, and particularly in echocardiography, where the presence of artifacts and noise can further hinder segmentation performance. To address these issues, we present RL4Seg3D, an unsupervised domain adaptation framework for 2D + time echocardiography segmentation. RL4Seg3D integrates novel reward functions and a fusion scheme to enhance key landmark precision in its segmentations while processing full-sized input videos. By leveraging reinforcement learning for image segmentation, our approach improves accuracy, anatomical validity, and temporal consistency while also providing, as a beneficial side effect, a robust uncertainty estimator, which can be used at test time to further enhance segmentation performance. We demonstrate the effectiveness of our framework on over 30,000 echocardiographic videos, showing that it outperforms standard domain adaptation techniques without the need for any labels on the target domain. Code is available at https://github.com/arnaudjudge/RL4Seg3D.

2510.11303 2026-05-14 cs.CV 版本更新

sketch2symm: Symmetry-aware sketch-to-shape generation via semantic bridging

Yan Zhou, Mingji Li, Xiantao Zeng, Jie Lin, Yuexia Zhou

发表机构 * School of Electronic Information Engineering, Foshan University, Guangdong, China(佛山大学电子信息工程学院) School of Computer Science and Artificial Intelligence, Foshan University, Guangdong, China(佛山大学计算机科学与人工智能学院)

AI总结 Sketch2Symm 是一种基于语义桥接和对称约束的两阶段草图到三维形状生成方法,旨在解决草图输入抽象且信息稀疏带来的三维重建难题。该方法通过草图到图像的翻译增强草图的语义表示,并引入对称性先验以利用日常物体的结构规律,从而生成几何一致的三维形状。实验表明,该方法在主流草图数据集上优于现有方法,验证了其在生成质量上的有效性。

详情
英文摘要

Sketch-based 3D reconstruction remains a challenging task due to the abstract and sparse nature of sketch inputs, which often lack sufficient semantic and geometric information. To address this, we propose Sketch2Symm, a two-stage generation method that produces geometrically consistent 3D shapes from sketches. Our approach introduces semantic bridging via sketch-to-image translation to enrich sparse sketch representations, and incorporates symmetry constraints as geometric priors to leverage the structural regularity commonly found in everyday objects. Experiments on mainstream sketch datasets demonstrate that our method achieves superior performance compared to existing sketch-based reconstruction methods in terms of Chamfer Distance, Earth Mover's Distance, and F-Score, verifying the effectiveness of the proposed semantic bridging and symmetry-aware design.

2510.03548 2026-05-14 cs.CV cs.AI 版本更新

Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing

Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani, Koki Nagano, David Luebke, Orazio Gallo, Matthew Stamm

发表机构 * Drexel University(德雷克斯el大学) NVIDIA

AI总结 本文研究了基于人工智能的视频会议系统中身份伪装攻击的问题,即攻击者可通过操控传输的潜空间信息实时劫持用户的形象。为解决这一问题,作者提出了一种新型防御方法,通过利用潜空间中固有的生物特征信息,设计了一个基于姿态条件的对比编码器,能够分离身份特征并消除姿态和表情的干扰,从而在不依赖重建视频的情况下检测身份伪装。实验表明,该方法在多个生成模型上均表现出优越的检测性能,并具有实时性和良好的泛化能力。

详情
英文摘要

AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

2510.01502 2026-05-14 q-bio.NC cs.CV cs.LG 版本更新

Behavioral Geometric Supervision Aligns Video Foundation Models with Human Social Perception

Kathy Garcia, Leyla Isik

发表机构 * Department of Cognitive Science(认知科学系) Department of Biomedical Engineering(生物医学工程系) Johns Hopkins University(约翰霍普金斯大学)

AI总结 当前视频基础模型在捕捉人类对动态社会场景的信息组织方式方面存在不足,难以准确预测人类对社会视频片段的相似性判断。本文提出行为几何监督(BGS)方法,通过约束嵌入空间的局部与全局几何结构,使其与视频间的相似性关系对齐,从而提升模型性能。实验表明,该方法显著提升了模型在人类相似性判断任务中的表现,并使模型能够捕捉人类语言嵌入模型无法体现的社会情感特征,实现了更接近人类社会感知的视频理解。

Comments v2: Major revision. Retitled; expanded from TimeSformer alone to four backbones (V-JEPA 2/2.1, TimeSformer, VideoMAE, CLIP), with V-JEPA 2.1 nearly tripling pretrained performance. Adds zero-shot PHASE transfer, attention-rollout analysis, and a language-distillation control. Data (OOO sim. judgments) & core hybrid triplet+RSA LoRA method unchanged from v1. Prepared for NeurIPS 2026 submission

详情
英文摘要

Current video foundation models, including the strongest self-supervised models such as V-JEPA2, fail to capture how humans organize social information in dynamic scenes. For example, across a range of diverse vision models tested, none were able to predict human similarity judgments to social video clips as well as a sentence embedding model of the caption text (MPNet). We show this gap in vision model performance can be closed by a compact behavioral supervisory signal. We introduce behavioral geometric supervision (BGS): a hybrid objective that constrains local and global pairwise embedding geometry to match the relational similarity structure across videos. We apply this method using a new human similarity dataset, containing 49,484 odd-one-out judgments from 250 naturalistic social video clips, and low-rank adaptation across four ViT backbones (V-JEPA 2/2.1, TimeSformer, VideoMAE, and CLIP). We find that one of the best fine-tuned models, V-JEPA 2.1, nearly triples in performance compared to the pre-trained baseline and reaches close to the noise ceiling, exceeding the strongest sentence-embedding baseline. In addition, finetuned models (i) capture unique variance in human judgments that caption-based language embeddings do not, (ii) develop interpretable social-affective attributes (valence, arousal, and dominance) despite never being trained on any of these attributes, (iii) zero-shot transfer to a separate dataset of out-of-distribution abstract social interactions, and (iv) shift spatial attention from scene context to socially informative regions (faces, gaze, and interacting bodies). A matched language-distillation control fails to reproduce these gains, ruling out caption transfer as the mechanism. Our results show how a modest amount of human behavioral data can steer video models toward human-like social visual understanding.

2510.00929 2026-05-14 cs.CV 版本更新

Equivariant Splitting: Self-supervised learning from incomplete data

Victor Sechaud, Jérémy Scanvic, Quentin Barthélemy, Patrice Abry, Julián Tachella

发表机构 * LPENSL, CNRS, ENS de Lyon, France(LPENSL、CNRS、 Lyon 工程科学研究院、法国) Prysm, Lyon, France(Prysm、Lyon、法国)

AI总结 本文提出了一种用于不完整数据的自监督学习新方法——等变分裂,旨在解决在仅有单一不完整观测模型的情况下重建问题。该方法引入了重建网络中的等变性概念,并结合自监督分裂损失,实现了对有监督损失的无偏估计。实验表明,该方法在图像修复、加速磁共振成像、稀疏视角CT和压缩感知等任务中表现出色,尤其适用于正向模型高度欠秩的场景。

详情
英文摘要

Self-supervised learning for inverse problems allows to train a reconstruction network from noise and/or incomplete data alone. These methods have the potential of enabling learning-based solutions when obtaining ground-truth references for training is expensive or even impossible. In this paper, we propose a new self-supervised learning strategy devised for the challenging setting where measurements are observed via a single incomplete observation model. We introduce a new definition of equivariance in the context of reconstruction networks, and show that the combination of self-supervised splitting losses and equivariant reconstruction networks results in unbiased estimates of the supervised loss. Through a series of experiments on image inpainting, accelerated magnetic resonance imaging, sparse-view computed tomography, and compressive sensing, we demonstrate that the proposed loss achieves state-of-the-art performance in settings with highly rank-deficient forward models. The code is available at https://github.com/vsechaud/Equivariant-Splitting

2509.23056 2026-05-14 cs.CV cs.LG 版本更新

FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection

Ben Liang, Hongguang Wei, Yuan Liu, Bingwen Qiu, Yihong Wang, Xiubao Sui, Qian Chen

发表机构 * School of Electronic Engineering and Optoelectronic Technology, Nanjing University of Science and Technology(南京理工大学电子工程与光电子技术学院)

AI总结 本文提出FMC-DETR,一种用于遥感图像中空中视角目标检测的频率解耦融合框架,旨在解决高分辨率图像中微小目标检测因视觉线索弱和全局上下文建模不足而面临的问题。该方法引入了Wavelet Kolmogorov-Arnold Transformer(WeKat)作为主干网络,结合小波变换和Kolmogorov-Arnold网络以增强浅层特征的全局低频结构感知和多尺度依赖的非线性建模;同时设计了多域特征协调模块(MDFC)和紧凑部分融合模块(CPF),分别用于优化跨尺度特征融合和提升小目标检测性能。实验表明,FMC-DETR在多个遥感基准数据集上取得了最先进的检测效果。

详情
英文摘要

Remote sensing object detection is a critical technology for real-world applications such as natural resource monitoring, traffic management, and UAV-based rescue. Detecting tiny objects in high-resolution aerial imagery remains challenging due to weak visual cues and insufficient global context modeling in complex scenes. Existing methods often suffer from delayed contextual interaction and limited nonlinear reasoning, which restrict their ability to effectively refine shallow representations and ultimately lead to suboptimal performance. To address these challenges, we propose FMC-DETR, a frequency-decoupled fusion framework for aerial-view object detection. First, we propose the Wavelet Kolmogorov-Arnold Transformer (WeKat) backbone, which employs cascaded wavelet transforms to enhance global low-frequency structure perception in shallow features while preserving fine-grained details, and further leverages Kolmogorov-Arnold networks for adaptive nonlinear modeling of multi-scale dependencies. Second, we introduce the Multi-Domain Feature Coordination (MDFC) module, which refines cross-scale fused representations through partial-channel spatial, spectral, and structural coordination, thereby strengthening small-object-related feature responses in cluttered scenes. Finally, we design the Compact Partial Fusion (CPF) module, which performs compact multi-branch aggregation with progressive partial refinement to improve feature diversity and multi-scale interaction while preserving stable information flow and reducing redundant perturbation. Extensive experiments across multiple remote sensing benchmarks demonstrate that FMC-DETR achieves state-of-the-art performance and significantly outperforming the baseline detector. Code is available at https://github.com/bloomingvision/FMC-DETR.

2509.15642 2026-05-14 cs.CV 版本更新

UNIV: Unified Foundation Model for Infrared and Visible Modalities

Fangyuan Mao, Shuo Wang, Jilin Mei, Shun Lu, Chen Min, Fuyang Liu, Xiaokun Feng, Meiqi Wu, Yu Hu

发表机构 * Research Center for Intelligent Computing Systems, CAS ICT(智能计算系统研究所以及中国科学院信息科技研究院) University of Chinese Academy of Sciences(中国科学院大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 本文提出UNIV,一种统一的红外与可见光基础模型,旨在解决跨模态感知中的模态偏差问题。核心方法为Patch Cross-modal Contrastive Learning(PCCL),通过自监督学习构建统一的跨模态特征空间,提升语义对齐与类别可分性。此外,研究还构建了目前最全面的可见光-红外数据集MVIP,并在多个任务上验证了UNIV的优越性能。

详情
英文摘要

Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross-modal Contrastive Learning (PCCL), a self-supervised contrastive learning strategy that constructs a unified cross-modal feature space. PCCL employs a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared-visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross-modal alignment and inter-class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross-modal learning, we introduce MVIP, the most comprehensive visible-infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.

2509.13858 2026-05-14 cs.CV 版本更新

EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics

Qianxin Xia, Jiawei Du, Guoming Lu, Zhiyong Shu, Jielei Wang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Centre for Frontier AI Research, Agency for Science, Technology and Research(前沿人工智能研究中心,科技研究局)

AI总结 本文提出了一种名为EDITS的新框架,旨在通过利用图像中的隐含文本语义来提升数据集蒸馏的效果。该方法结合视觉语言模型生成的外部文本与图像特征,构建语义聚类缓冲区,并通过局部语义感知模块选择代表性样本生成图像与文本原型,最终利用扩散模型生成高质量的合成数据集。实验表明,该方法在保持模型性能的同时显著提升了蒸馏效率。

详情
英文摘要

Dataset distillation aims to synthesize a compact dataset from the original large-scale one, enabling highly efficient learning while preserving competitive model performance. However, traditional techniques primarily capture low-level visual features, neglecting the high-level semantic and structural information inherent in images. In this paper, we propose EDITS, a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation. First, external texts generated by a Vision Language Model (VLM) are fused with image features through a Global Semantic Query module, forming the prior clustered buffer. Local Semantic Awareness then selects representative samples from the buffer to construct image and text prototypes, with the latter produced by guiding a Large Language Model (LLM) with meticulously crafted prompt. Ultimately, Dual Prototype Guidance strategy generates the final synthetic dataset through a diffusion model. Extensive experiments confirm the effectiveness of our method.Source code is available in: https://github.com/einsteinxia/EDITS.

2509.08461 2026-05-14 cs.LG cs.AI cs.CV hep-ex 版本更新

Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics

Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi

发表机构 * Department of Computer Science, University of California, Irvine, CA, USA(计算机科学系,加州大学欧文分校,加州,美国) Department of Physics, University of California, Irvine, CA, USA(物理系,加州大学欧文分校,加州,美国)

AI总结 本文研究了将视觉语言模型(VLM)应用于高能物理实验中中微子事件分类的问题,提出了一种基于微调LLaMA 3.2的VLM方法,并与卷积神经网络(CNN)和视觉变换器(ViT)进行了对比。实验表明,基于变换器的模型在分类准确率和鲁棒性方面优于传统CNN,而VLM通过引入文本或语义信息,进一步提升了预测的可解释性和推理能力。该研究展示了VLM作为物理事件分类通用框架的潜力,为中微子物理实验中的多模态推理提供了新思路。

Comments Accepted for publication in Communications Physics (Nature Portfolio)

详情
英文摘要

Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMA 3.2 to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in major neutrino experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events, and also a Vision Transformer (ViT-h/14), which is the same architecture inside the VLM's vision encoder. Our evaluation considers both classification performance and interpretability of the model predictions, comparing a VLM with a vision-only transformer (ViT) and a convolutional neural network (CNN) baseline. We find that transformer-based architectures outperform conventional CNNs in classification accuracy and robustness, with the VLM providing additional flexibility through the integration of auxiliary textual or semantic information and enabling more interpretable, reasoning-based predictions. These results highlight the potential of large transformer models, particularly vision-language models, as general-purpose backbones for physics event classification, combining strong performance, robustness, and interpretability, and opening new avenues for multimodal reasoning in experimental neutrino physics.

2509.00626 2026-05-14 cs.CV cs.AI 版本更新

Towards Methane Detection Onboard Satellites

Maggie Chen, Hala Lamdouar, Luca Marini, Laura Martínez-Ferrer, Chris Bridges, Giacomo Acciarini

发表机构 * University of Oxford(牛津大学) Delft University of Technology(代尔夫特理工大学) Universitat de València(瓦伦西亚大学) University of Surrey(萨里大学) European Space Agency (ESA)(欧洲航天局)

AI总结 本文研究了如何在卫星上利用机器学习技术实现甲烷气体的快速检测,以支持气候变化的及时应对。研究提出了一种新的方法,无需传统图像预处理步骤,直接使用未正射校正的高光谱数据进行训练,取得了与传统方法相当的检测效果。此外,研究还展示了基于正射校正数据训练的模型在性能上优于传统匹配滤波方法,并公开了数据集和代码,为相关研究提供了重要资源。

详情
英文摘要

Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.

2508.19651 2026-05-14 cs.CV 版本更新

Scalable Object Detection in the Car Interior With Vision Foundation Models

Sebastian Schmidt, Bálint Mészáros, Ahmet Firintepe, Stephan Günnemann

发表机构 * Technical University of Munich, School of Computation, Information and Technology(慕尼黑技术大学,计算、信息与技术学院) BMW Group(宝马集团)

AI总结 本文研究了如何在车载环境中高效地进行车内物体检测与定位,以提升智能助手的响应质量。为解决车载系统计算资源受限的问题,作者提出了一种基于视觉基础模型的分布式检测框架 ODAL,将计算任务分配到车载端与云端,从而实现高效部署。研究还引入了 ODALbench 评估指标,并通过微调轻量模型 LLaVA 1.5 7B 实现了显著性能提升,其检测准确率较基线提升了 71%,并在关键指标上超越了 GPT-4o 模型。

详情
英文摘要

AI tasks in the car interior like identifying and localizing externally introduced objects is crucial for response quality of personal assistants. However, computational resources of on-board systems remain highly constrained, restricting the deployment of such solutions directly within the vehicle. To address this limitation, we propose the novel Object Detection and Localization (ODAL) framework for interior scene understanding. Our approach leverages vision foundation models through a distributed architecture, splitting computational tasks between on-board and cloud. This design overcomes the resource constraints of running foundation models directly in the car. To benchmark model performance, we introduce ODALbench, a new metric for comprehensive assessment of detection and localization.Our analysis demonstrates the framework's potential to establish new standards in this domain. We compare the state-of-the-art GPT-4o vision foundation model with the lightweight LLaVA 1.5 7B model and explore how fine-tuning enhances the lightweight models performance. Remarkably, our fine-tuned ODAL-LLaVA model achieves an ODAL$_{score}$ of 89%, representing a 71% improvement over its baseline performance and outperforming GPT-4o by nearly 20%. Furthermore, the fine-tuned model maintains high detection accuracy while significantly reducing hallucinations, achieving an ODAL$_{SNR}$ three times higher than GPT-4o.

2508.07642 2026-05-14 cs.AI cs.CL cs.CV 版本更新

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

Tianyi Ma, Yue Zhang, Zehao Wang, Parisa Kordjamshidi

发表机构 * Michigan State University(密歇根州立大学) ESAT-PSI, KU Leuven(KU莱顿大学ESAT-PSI实验室)

AI总结 视觉与语言导航(VLN)任务要求智能体理解自然语言指令并在复杂的3D环境中进行导航,当前方法在面对需要复杂时空推理的未知场景时仍存在较大挑战。本文提出SkillNav框架,通过将导航分解为一组可解释的原子技能,并由专门的智能体分别处理,引入结构化的技能推理机制。此外,研究构建了一个合成数据生成管道以支持无监督技能训练,并设计了一种基于视觉语言模型的路由器,动态选择最合适的智能体执行任务,显著提升了模型在新型指令风格和未知环境中的泛化能力。

Comments Accepted by ACL 2026 Main Conference

详情
英文摘要

Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization to the GSA-R2R, a benchmark with novel instruction styles and unseen environments.

2507.01908 2026-05-14 cs.CV 版本更新

Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

Qingdong He, Xueqin Chen, Chaoyi Wang, Yanjie Pan, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang

发表机构 * Tencent Youtu Lab(腾讯云图实验室) Sichuan University(四川大学) University of the Chinese Academy of Sciences(中国科学院大学) Fudan University(复旦大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学)

AI总结 该论文提出了一种基于视觉推理的假设指令图像编辑方法,旨在解决现有图像编辑技术在处理复杂隐含指令时的不足。研究引入了Reason50K数据集和ReasonBrain框架,前者包含5万余个样本,涵盖物理、时间、因果和故事推理等四类场景,后者结合多模态大语言模型和扩散模型,通过细粒度推理线索提取模块和跨模态增强模块,实现对隐含指令的精准理解和编辑。实验表明,该方法在推理场景中表现优异,并具备良好的零样本泛化能力。

Comments Accepted by ICML2026

详情
英文摘要

Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and user intent. Additionally, current datasets provide limited support for training and evaluating reasoning-aware editing capabilities. Architecturally, these methods also lack mechanisms for fine-grained detail extraction that support such reasoning. To address these limitations, we propose Reason50K, a large-scale dataset specifically curated for training and evaluating hypothetical instruction reasoning image editing, along with ReasonBrain, a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios. Reason50K includes over 50K samples spanning four key reasoning scenarios: Physical, Temporal, Causal, and Story reasoning. ReasonBrain leverages Multimodal Large Language Models (MLLMs) for editing guidance generation and a diffusion model for image synthesis, incorporating a Fine-grained Reasoning Cue Extraction (FRCE) module to capture detailed visual and textual semantics essential for supporting instruction reasoning. To mitigate the semantic loss, we further introduce a Cross-Modal Enhancer (CME) that enables rich interactions between the fine-grained cues and MLLM-derived features. Extensive experiments demonstrate that ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios while exhibiting strong zero-shot generalization to conventional IIE tasks. Our dataset and code will be released publicly.

2507.00990 2026-05-14 cs.RO cs.AI cs.CV 版本更新

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, Yunzhu Li

发表机构 * UIUC(伊利诺伊大学香槟分校) UC Irvine(加州大学尔湾分校) Columbia University(哥伦比亚大学)

AI总结 本文提出了一种名为 RIGVid 的系统,使机器人能够通过模仿人工智能生成的视频完成复杂的操作任务,如倒水、擦拭和混合,而无需任何物理演示或机器人特定的训练。系统通过语言指令和初始场景图像生成潜在演示视频,并利用视觉语言模型筛选符合指令的视频,再通过6D姿态追踪提取物体轨迹并映射到机器人上。实验表明,生成的视频在实际任务中表现优异,且生成质量越高效果越佳,优于基于关键点预测等更简洁的方法。

Comments In ICLR 2026. Website: https://rigvid-robot.github.io/

详情
英文摘要

This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks--such as pouring, wiping, and mixing--purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.

2506.09522 2026-05-14 cs.CV cs.AI cs.CL 版本更新

Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding

Beomsik Cho, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 该研究探讨了视觉信息在大视觉语言模型(LVLMs)解码过程中的作用,发现即使在出现幻觉的情况下,视觉token仍包含有意义的视觉信息,并且其语义可以在文本空间中被显式表达。基于此,研究提出了一种无需训练的解码方法ReVisiT,通过将视觉token投影到文本分布中,并在解码过程中动态选择最相关的视觉token来引导文本生成,从而提升模型对视觉语义的融合能力。实验表明,ReVisiT在多个基准测试中表现优异,同时减少了计算成本。

Comments ACL 2026 Main Conference (Oral). 30 pages, 10 figures. Code: https://github.com/bscho333/ReVisiT

详情
英文摘要

Large Vision Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that guides text generation in LVLMs by Referencing Vision Tokens. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization. Then, ReVisiT uses its constrained projection to refine the output distribution to better incorporate visual semantics. Across five benchmarks on recent LVLMs, ReVisiT achieves competitive or superior results to state-of-the-art decoding baselines while reducing computational cost by up to $2\times$

2505.22445 2026-05-14 cs.CV cs.AI 版本更新

NFR: Neural Feature-Guided Non-Rigid Shape Registration

Zhangquan Chen, Puhua Jiang, Mingze Sun, Ruqi Huang

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院)

AI总结 本文提出了一种基于神经特征引导的非刚性形状配准新框架,能够在无需对应关系标注的情况下,有效应对输入形状之间的显著非刚性变形和部分遮挡问题。该方法将深度学习形状匹配网络提取的神经特征融入迭代几何配准流程,既提升了对应关系的准确性和语义意义,又通过动态更新和一致性先验过滤增强了鲁棒性。实验表明,即使仅使用少量训练样本,该方法在多个非刚性点云配准和部分形状匹配基准上均达到最优性能,并能处理传统方法难以应对的复杂形变场景。

Comments 18 pages, 16 figures. arXiv admin note: substantial text overlap with arXiv:2311.04494

详情
英文摘要

In this paper, we propose a novel learning-based framework for 3D shape registration, which overcomes the challenges of significant non-rigid deformation and partiality undergoing among input shapes, and, remarkably, requires no correspondence annotation during training. Our key insight is to incorporate neural features learned by deep learning-based shape matching networks into an iterative, geometric shape registration pipeline. The advantage of our approach is two-fold -- On one hand, neural features provide more accurate and semantically meaningful correspondence estimation than spatial features (e.g., coordinates), which is critical in the presence of large non-rigid deformations; On the other hand, the correspondences are dynamically updated according to the intermediate registrations and filtered by consistency prior, which prominently robustify the overall pipeline. Empirical results show that, with as few as dozens of training shapes of limited variability, our pipeline achieves state-of-the-art results on several benchmarks of non-rigid point cloud matching and partial shape matching across varying settings, but also delivers high-quality correspondences between unseen challenging shape pairs that undergo both significant extrinsic and intrinsic deformations, in which case neither traditional registration methods nor intrinsic methods work. Our code is available at https://github.com/rqhuang88/NFR.

2505.05376 2026-05-14 cs.CV 版本更新

GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans

Rachmadio Noval Lazuardi, Artem Sevastopolsky, Egor Zakharov, Matthias Niessner, Vanessa Sklyarova

发表机构 * Technical University of Munich(慕尼黑技术大学) ETH Zürich(苏黎世联邦理工学院) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所)

AI总结 本文提出了一种从无颜色的3D扫描数据中直接重建发丝的新方法,通过多模态发丝方向提取技术实现。该方法利用神经网络检测扫描渲染中的表面特征,并结合扩散先验模型,仅依赖几何信息即可准确重建简单或复杂的发型。研究还构建了包含400个真实扫描重建发丝的Strands400数据集,为后续生成模型训练和计算机图形学应用提供了重要资源。

Comments 15 pages, 9 figures, 1 table

详情
英文摘要

We propose a novel method that reconstructs hair strands directly from colorless 3D scans by leveraging multi-modal hair orientation extraction. Hair strand reconstruction is a fundamental problem in computer vision and graphics, essential for high-fidelity digital avatar synthesis, animation, and AR/VR applications. However, accurately recovering hair strands from raw scan data remains challenging due to the complex and fine-grained structure of human hair, and none of the existing methods operate on colorless 3D geometry alone. To address this gap, our method directly identifies sharp surface features on the scan and estimates strand orientation using a neural 2D line detector applied to the renderings of scan shading. Additionally, we incorporate a diffusion prior trained on a diverse set of synthetic hair scans, refined with a noise schedule, and adapted to the reconstructed contents via a scan-specific text prompt. We demonstrate that this combination of supervision signals enables accurate reconstruction of both simple and intricate hairstyles from geometry alone. By enabling strand extraction from 3D scans, we compile Strands400, the largest publicly available dataset of hair strands with detailed surface geometry extracted from real-world data, comprising reconstructions from 400 subjects' scans. Strands400 enables training data-driven generative models for downstream tasks such as image-to-strands and text-to-strands. Moreover, our method applies to designer mesh assets, supporting a practical CG workflow where artists model hair as meshes and need strand-level representations for simulation and rendering. All code and data will be released for research purposes on https://seva100.github.io/GeomHair/.

2504.14129 2026-05-14 cs.CV 版本更新

PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution

Yaning Zhang, Jiahe Zhang, Chunjie Ma, Weili Guan, Tian Gan, Zan Gao

发表机构 * Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences)(计算机科学与技术学院,齐鲁工业大学(山东省科学院)) Shandong University of Science and Technology(山东科技大学) Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences)(山东省人工智能研究院,齐鲁工业大学(山东省科学院)) School of Electronics and Information Engineering, Harbin Institute of Technology(电子与信息工程学院,哈尔滨工业大学(深圳)) School of Computer Science and Technology, Shandong University(计算机科学与技术学院,山东大学)

AI总结 本文提出了一种基于动态对比学习的解析感知视觉语言模型(PVLM),用于实现零样本深度伪造归因(ZSDFA),以有效追踪未见过的先进生成模型(如扩散模型)所产生的伪造人脸来源。该方法通过引入面部解析信息,捕捉生成模型在保留源人脸属性方面的差异,从而提升归因的细粒度与泛化能力。此外,研究还构建了一个新的零样本深度伪造归因基准,并设计了对比中心损失函数,进一步增强了模型对未知生成器的追踪性能,实验表明该方法在相关基准上优于现有最先进方法。

Comments Accepted to IEEE Transactions on Dependable and Secure Computing 2026

详情
英文摘要

The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen advanced generators like diffusion in a fine-grained manner. In this paper, we propose a novel parsing-aware vision language model with a dynamic contrastive learning (PVLM) method for zero-shot deepfake attribution (ZSDFA), which facilitates effective and fine-grained traceability to unseen advanced generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark to evaluate the attribution performance of deepfake attributors to unseen advanced generators like diffusion. Besides, we propose an innovative PVLM attributor based on the vision-language model to capture general and diverse attribution features. We are motivated by the observation that the preservation of source face attributes in facial images generated by GAN and diffusion models varies significantly. We propose to employ the inherent facial attributes preservation differences to capture face parsing-aware forgery representations. Therefore, we devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via dynamic vision-parsing matching. Additionally, we present a novel deepfake attribution contrastive center loss to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results show that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various protocol evaluations.

2503.19719 2026-05-14 cs.LG cs.AI cs.CV 版本更新

On What Depends the Robustness of Multi-source Models to Missing Data in Earth Observation?

Francisco Mena, Diego Arenas, Miro Miranda, Andreas Dengel

发表机构 * University of Kaiserslautern-Landau (RPTU)(凯撒斯劳滕-兰道大学(RPTU)) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 本文研究了多源模型在遥感观测中面对数据缺失时的鲁棒性影响因素。通过评估六种先进多源模型在单一数据源缺失或仅有一个数据源可用时的预测性能,发现模型效果与任务特性、数据源互补性及模型设计密切相关。研究还发现,去除某些数据源反而可能提升预测性能,挑战了“数据越多越好”的传统假设,引发了对模型复杂性和数据必要性的深入思考。

Comments Accepted at IEEE International Geoscience and Remote Sensing Symposium 2025

Journal ref 2025 IEEE International Geoscience and Remote Sensing Symposium

详情
英文摘要

In recent years, the development of robust multi-source models has emerged in the Earth Observation (EO) field. These are models that leverage data from diverse sources to improve predictive accuracy when there is missing data. Despite these advancements, the factors influencing the varying effectiveness of such models remain poorly understood. In this study, we evaluate the predictive performance of six state-of-the-art multi-source models in predicting scenarios where either a single data source is missing or only a single source is available. Our analysis reveals that the efficacy of these models is intricately tied to the nature of the task, the complementarity among data sources, and the model design. Surprisingly, we observe instances where the removal of certain data sources leads to improved predictive performance, challenging the assumption that incorporating all available data is always beneficial. These findings prompt critical reflections on model complexity and the necessity of all collected data sources, potentially shaping the way for more streamlined approaches in EO applications.

2412.06341 2026-05-14 cs.CV cs.AI 版本更新

Visual Accommodation: Rethinking Image Scale as a Learnable Variable for Object Detection

Daeun Seo, Hoeseok Yang, Sihyeong Park, Hyungshin Kim

发表机构 * Chungnam National University(Chungnam 国立大学) Santa Clara University(Santa Clara 大学) Korea Electronics Technology Institute(韩国电子技术研究所)

AI总结 本文提出了一种名为Ciliary-DETR的框架,旨在通过学习可变的图像尺度来提升目标检测在测试阶段的适应能力,类似于生物视觉中的调节机制。该方法引入了一个轻量级的尺度预测器,能够在不同输入尺度下动态估计最优的测试尺度因子,从而提高检测的灵活性和鲁棒性。通过引入参数化的尺度优化目标,解决了在标准训练设置下最优输入尺度不可观测的问题,实现了高效的一次性推理过程。

Comments 23 pages, 11 figures

详情
英文摘要

We propose Ciliary-DETR (previous name: Elastic-DETR), a framework for test-time resolution adjustment analogous to biological accommodation. While multi-scale data augmentation improves robustness to scale variation, modern detectors rely on fixed inference resolutions, potentially limiting flexibility and robustness. Similar to the ciliary muscle, we introduce a lightweight scale predictor that dynamically estimates test-time scale factors across a wide range of input scales. The core challenge is that the optimal input scale is inherently unobservable under standard training setups. To address this challenge, we introduce a parametric formulation of desired scaling behavior, leading to loss-driven objectives that guide scale optimization. Overall, our method enables flexible and efficient single-pass inference, bridging the gap between training-time robustness and test-time adaptation.

2407.15512 2026-05-14 cs.LG cs.AI cs.CV 版本更新

Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation

Francisco Mena, Diego Arenas, Andreas Dengel

发表机构 * University of Kaiserslautern-Landau, Kaiserslautern, Germany(凯撒斯劳滕-兰道大学,凯撒斯劳滕,德国) German Research Center for Artificial Intelligence, Kaiserslautern, Germany(德国人工智能研究中心,凯撒斯劳滕,德国)

AI总结 该研究旨在提高地球观测中多传感器机器学习模型在传感器缺失情况下的预测鲁棒性。作者提出了两种新方法:输入传感器丢弃(ISensD)和集成传感器不变(ESensI),通过实验验证了它们在三个多传感器时序数据集上的有效性。研究发现,集成多传感器模型在面对传感器缺失时表现最为稳健,而ISensD中的传感器丢弃机制也展现出良好的鲁棒性。

Comments Accepted at the MACLEAN workshop in the ECML/PKDD 2024

Journal ref Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2024

详情
英文摘要

Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.

2403.11247 2026-05-14 cs.CV cs.RO 版本更新

Compact 3D Gaussian Splatting For Dense Visual SLAM

Tianchen Deng, Chang Nie, Shuhong Liu, Wenhua Wu, Jianfei Yang, Shenghai Yuan, Jiuming Liu, Danwei Wang, Hesheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) The University of Tokyo(东京大学) Harvard University(哈佛大学) University of Cambridge(剑桥大学)

AI总结 本文提出了一种紧凑的3D高斯溅射SLAM系统,旨在解决现有方法中因大量冗余高斯椭球体导致的高内存消耗和训练速度慢的问题。通过引入基于滑动窗口的掩码策略和几何码本压缩技术,有效减少了高斯椭球体的数量和参数规模。实验表明,该方法在保持场景重建质量的同时,显著提升了训练和渲染速度。

Comments Accepted by IJCV 2026

详情
英文摘要

Recent work has shown that 3D Gaussian-based SLAM enables high-quality reconstruction, accurate pose estimation, and real-time rendering of scenes. However, these approaches are built on a tremendous number of redundant 3D Gaussian ellipsoids, leading to high memory and storage costs, and slow training speed. To address the limitation, we propose a compact 3D Gaussian Splatting SLAM system that reduces the number and the parameter size of Gaussian ellipsoids. A sliding window-based masking strategy is first proposed to reduce the redundant ellipsoids. Then we observe that the covariance matrix (geometry) of most 3D Gaussian ellipsoids are extremely similar, which motivates a novel geometry codebook to compress 3D Gaussian geometric attributes, i.e., the parameters. Robust and accurate pose estimation is achieved by a global bundle adjustment method with reprojection loss. Extensive experiments demonstrate that our method achieves faster training and rendering speed while maintaining the state-of-the-art (SOTA) quality of the scene representation.

2308.10058 2026-05-14 cs.CV 版本更新

R-C-P Method: An Autonomous Volume Calculation Method Using Image Processing and Machine Vision

MA Muktadir, Sydney Parker, Sun Yi

AI总结 本文提出了一种基于图像处理和机器视觉的自主体积计算方法——R-C-P方法,旨在替代传统深度传感器(如LiDAR)以适应复杂环境下的应用需求。该方法利用两台2D摄像头实时测量矩形物体的尺寸,通过行-列-像素(R-C-P)策略结合边缘检测技术,实现了对物体表面积及不连续边缘或体积的检测。实验验证了该方法的有效性,并提供了基于摄像头与物体距离的尺寸计算公式,为实际物体的自主测量提供了可行的视觉解决方案。

Journal ref Communications in Computer and Information Science, vol. 2939, Springer, Cham (2026)

详情
英文摘要

Machine vision and image processing are often used with sensors for situation awareness in autonomous systems, from industrial robots to self-driving cars. The 3D depth sensors, such as LiDAR (Light Detection and Ranging), Radar, are great invention for autonomous systems. Due to the complexity of the setup, LiDAR may not be suitable for some operational environments, for example, a space environment. This study was motivated by a desire to get real-time volumetric and change information with multiple 2D cameras instead of a depth camera. Two cameras were used to measure the dimensions of a rectangular object in real-time. The R-C-P (row-column-pixel) method is developed using image processing and edge detection. In addition to the surface areas, the R-C-P method also detects discontinuous edges or volumes. Lastly, experimental work is presented for illustration of the R-C-P method, which provides the equations for calculating surface area dimensions. Using the equations with given distance information between the object and the camera, the vision system provides the dimensions of actual objects.

2304.11193 2026-05-14 cs.RO cs.AI cs.CV 版本更新

Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy

Willow Mandil, Amir Ghalamzan-E

发表机构 * University of Lincoln(林肯大学) University of Sheffield(谢菲尔德大学)

AI总结 本文研究了在物理机器人交互中融合视觉与触觉信息的世界模型预测方法,旨在提升对复杂环境中机器人操作结果的预测准确性。通过引入两个新的机器人推物数据集,作者展示了在物理不确定性较高的场景下,结合视觉与触觉信息能显著提高预测性能,而在视觉信息已足够明确的情况下,触觉带来的提升有限。该工作为构建更鲁棒的机器人世界模型提供了新的数据支持与方法启示。

Comments This paper is accepted for publication in Robotics and Autonomous Systems

详情
英文摘要

Predicting the outcomes of robotic actions, often referred to as learning a world model, in complex environments remains a fundamental challenge in robotics. Existing approaches primarily rely on visual observations and action inputs to generate video-based predictions, frequently overlooking the critical role of tactile feedback in understanding physical interactions. In this work, we investigate the integration of tactile and visual information within predictive perception systems for physical robot interaction. We demonstrate that visuo-tactile prediction provides the greatest benefits in physically ambiguous interaction regimes, while improvements are naturally limited when object dynamics are visually inferable. Furthermore, we introduce two novel robot-pushing datasets collected using a magnetic-based tactile sensor for unsupervised learning. The first dataset comprises visually identical objects with varying physical properties, explicitly isolating physical ambiguity, while the second mirrors existing robot-pushing benchmarks involving clusters of household objects. Our results show that tactile-visual integration improves prediction accuracy and robustness under physical ambiguity, while offering limited gains in visually unambiguous settings. Code and datasets are publicly available.

1911.09301 2026-05-14 cs.CV 版本更新

Image Aesthetics Assessment using Multi Channel Convolutional Neural Networks

Nishi Doshi, Gitam Shikhenawis, Suman K Mitra

发表机构 * Dhirubhai Ambani Institute of Information and Communication Technology(迪鲁巴希·阿姆巴尼信息与通信技术研究所) C R Rao Advanced Institute of Mathematics, Statistics and Computer Science(C R Rao高级数学、统计与计算机科学研究所)

AI总结 本文研究了图像美学评估问题,旨在将图像分类为高质量或低质量。作者提出了一种多通道卷积神经网络方法,除使用原始图像外,还引入了图像裁剪和显著性图作为输入,以提升分类效果。实验表明,该方法在常用AVA数据集上的性能优于现有方法,具有重要的应用价值。

Journal ref Computer Vision and Image Processing. CVIP 2019

详情
英文摘要

Image Aesthetics Assessment is one of the emerging domains in research. The domain deals with classification of images into categories depending on the basis of how pleasant they are for the users to watch. In this article, the focus is on categorizing the images in high quality and low quality image. Deep convolutional neural networks are used to classify the images. Instead of using just the raw image as input, different crops and saliency maps of the images are also used, as input to the proposed multi channel CNN architecture. The experiments reported on widely used AVA database show improvement in the aesthetic assessment performance over existing approaches.

2605.12619 2026-05-14 q-bio.NC cs.CV 版本更新

Human face perception reflects inverse-generative and naturalistic discriminative objectives

Wenxuan Guo, Heiko H. Schütt, Kamila Maria Jozwik, Katherine R. Storrs, Nikolaus Kriegeskorte, Tal Golan

发表机构 * Department of Psychology(心理学系) Department of Behavioural and Cognitive Sciences(行为与认知科学系) MRC Cognition and Brain Sciences Unit(认知与脑科学单位) School of Psychology(心理学系) Department of Neuroscience(神经科学系) Department of Industrial Engineering and Management(工业工程与管理系) School of Brain Sciences and Cognition(脑科学与认知系)

AI总结 该研究探讨了人类面孔识别的感知机制,通过比较六种结构相同但训练任务不同的深度神经网络模型,揭示了人类面孔感知的计算特性。研究发现,强调高层不变结构的模型(如逆渲染、人脸识别或物体分类训练的模型)最符合人类对人脸差异的判断,且基于自然图像训练的模型表现优于合成图像训练的模型。这些结果表明,人类面孔感知可能依赖于推断面部外观潜在原因、排除干扰变量,并受自然图像统计特性调节的机制。

Comments 33 pages, 10 figures, 4 tables

详情
英文摘要

The perceptual representations supporting our ability to recognize faces remain a computational mystery. Deep neural networks offer mechanistic hypotheses for human face perception, but theoretically distinct models often make indistinguishable representational predictions for randomly sampled faces. To expose diagnostic differences among these hypotheses, we compared six neural network models sharing an architecture but trained on distinct tasks, using face pairs optimized to elicit contrasting model predictions ("controversial" pairs) alongside randomly sampled pairs. We tested model predictions against face-dissimilarity judgments from 864 human participants across stimulus sets differing in realism and pose variation. Models prioritizing high-level, invariant structures (trained via inverse rendering, face identification, or object classification) most robustly matched human judgments. Furthermore, models trained on natural images typically outperformed synthetic-trained counterparts. Together, these findings suggest that human face perception is shaped by mechanisms that infer latent causes of facial appearance, discount nuisance variation, and are tuned by natural image statistics.

2605.12608 2026-05-14 cs.CV 版本更新

A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline

Mohamed Ahmed Mohamed, Xiaowei Huang

发表机构 * Waymo Open Dataset(Waymo开放数据集) GitHub

AI总结 本文研究了在恶劣天气下提升目标检测性能的数据效率问题,提出了一种基于物理原理的端到端合成雾气生成方法Clear2Fog(C2F),能够在保持相机与激光雷达传感器一致性的同时,在晴朗天气数据集上生成逼真的雾天图像。通过引入单目深度估计和新型大气光估计方法,C2F有效克服了现有技术中的结构伪影和色偏问题。实验表明,使用C2F生成的多样化雾天数据进行训练,能够显著提升模型在真实雾天环境中的检测性能。

Comments Project code and experimental configs available at https://github.com/mmohamed28/Clear2Fog

详情
英文摘要

Object detection in adverse weather is critical for the safety of autonomous vehicles; however, the scarcity of labelled, real-world foggy data remains a significant bottleneck. In this paper, we propose Clear2Fog (C2F), an end-to-end, physics-based pipeline that simulates fog on clear-weather datasets while ensuring sensor-level consistency across camera and LiDAR. By using monocular depth estimation and a novel atmospheric light estimation method, C2F overcomes structural artifacts and chromatic biases common in existing techniques. A human perceptual study confirms C2F's physical realism, with the generated images being preferred 92.95% of the time over an established method. Utilising a training set of 270,000 images from the Waymo Open Dataset, we conduct an extensive data efficiency study to investigate how environmental diversity influences model robustness. Our findings reveal that models trained on mixed-density fog datasets at 75% scale outperform those trained on fixed-density datasets at 100% scale. Furthermore, we investigate the sim-to-real transfer by fine-tuning pre-trained models on real-world foggy data. We demonstrate that a tenfold increase over the default fine-tuning learning rate successfully overcomes negative transfer from synthetic biases, resulting in a 1.67 mAP improvement over real-only baselines. The C2F pipeline provides a scalable framework for enhancing the reliability of autonomous systems in adverse weather and demonstrates the potential of diverse synthetic datasets for efficient model training.

2605.12587 2026-05-14 cs.CV 版本更新

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

Jisu Nam, Jahyeok Koo, Soowon Son, Jaewoo Jung, Honggyu An, Junhwa Hur, Seungryong Kim

发表机构 * KAIST AI(韩国科学技术院人工智能研究所) Google DeepMind(谷歌DeepMind)

AI总结 本文提出了一种名为TrackCraft3R的方法,旨在将预训练的视频扩散变换器(video DiT)重新用于单目视频的密集3D跟踪任务。通过引入双潜在表示和时间RoPE对齐技术,该方法将视频DiT的逐帧生成模式转换为以参考帧为锚点的跟踪范式,从而在单次前向传播中预测出参考帧中每个像素在时间上的跟踪点图及其可见性。实验表明,TrackCraft3R在标准的稀疏和密集3D跟踪基准上取得了最先进的性能,同时在速度和内存消耗方面也优于现有方法。

Comments Project page and code are available at https://cvlab-kaist.github.io/TrackCraft3r/

详情
英文摘要

Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame's content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time. We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored reconstruction pointmap, TrackCraft3R predicts a reference-anchored tracking pointmap that follows every pixel of the first frame across time in a single forward pass, along with its visibility. We achieve this through two designs: (i) a dual-latent representation that uses per-frame geometry latents and reference-anchored track latents as dense queries, and (ii) temporal RoPE alignment, which specifies the target timestamp of each track latent. Together, these designs convert the per-frame generative paradigm of video DiTs into a reference-anchored tracking formulation with LoRA fine-tuning. TrackCraft3R achieves state-of-the-art performance on standard sparse and dense 3D tracking benchmarks, while running 1.3x faster and using 4.6x less peak memory than the strongest prior method. We further demonstrate robustness to large motions and long videos.

2605.12586 2026-05-14 cs.CV cs.AI cs.DB 版本更新

3D Primitives are a Spatial Language for VLMs

Junze Liu, Kun Qian, Florian Dubost, Kai Zhong, Arvind Srinivasan, Nan Chen, Anping Wang, Sam Zhang, Alejandro Mottini, Qingjun Cui, Tian Wang

发表机构 * Unity Technologies

AI总结 该研究探讨了视觉语言模型(VLMs)在空间理解上的矛盾表现,并提出以3D几何基元(如立方体、球体等)作为中间表示来提升其空间推理能力。研究引入了SpatialBabel基准,评估了多种VLM在基于基元的3D场景重建任务中的表现,并提出了两种新方法:无需训练的Code-CoT推理策略和自监督的S³-FT微调方法,显著提升了模型在多个空间理解任务上的性能,验证了几何基元在代码中的诊断与迁移价值。

详情
英文摘要

Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf{\textsc{SpatialBabel}}, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emph{scene-code languages} (programming languages and declarative formats for 3D primitive scenes), revealing that a single model's object-detection F1 can vary by up to $5.7\times$ across languages. Second, we propose \textbf{Code-CoT} (Code Chain-of-Thought), a training-free inference strategy that routes spatial reasoning through primitive-based code generation. Code-CoT lifts the SpatialBabel-QA-Score by up to $+6.4$\% on primitive scenes and real-photo CV-Bench-3D accuracy by $+5.0$\% for VLMs with strong coding capabilities. Third, we propose \textbf{S$^{3}$-FT} (Self-Supervised Spatial Fine-Tuning), which self-supervisedly distills primitive spatial knowledge into general visual reasoning by parsing the model's own Three.js primitive-reconstructions into structured annotations and fine-tuning on the result, with \emph{no human labels and no teacher model}. Training on primitive images alone, S$^3$-FT improves Qwen3-VL-8B by $+4.6$ to $+8.6$\% on SpatialBabel-Primitive-QA, $+9.7$\% on CV-Bench-2D, and $+17$\% on HallusionBench; the recipe transfers across model families. These results establish geometric primitives in code as both a diagnostic and a transferable spatial vocabulary for VLMs. We will release all artifacts upon publication.

2605.12575 2026-05-14 eess.IV cs.AI cs.CV 版本更新

Are Compact Rationales Free? Measuring Tile Selection Headroom in Frozen WSI-MIL

Hyun Do Jung, Jungwon Choi, Soojung Choi, Yujin Oh, Hwiyoung Kim

发表机构 * Department of Artificial Intelligence, Yonsei University(延世大学人工智能系) Kim Jaechul Graduate School of AI, KAIST(金 Jaechul人工智能研究生院,韩国科学技术院) Department of Integrative Medicine, College of Medicine, Yonsei University(延世大学医学院整合医学系) Department of Biomedical Systems Informatics, College of Medicine, Yonsei University(延世大学医学院生物医学系统信息学系) H-Data Strategy Center, Hallym University Chuncheon Sacred Heart Hospital(翰林大学春川圣心医院H-Data战略中心)

AI总结 本文研究了在冻结的全切片图像(WSI)多实例学习(MIL)分类器中,能否从少量输出一致的图像块中恢复出滑动级预测结果,从而生成紧凑的后验解释。为此,作者提出了一种轻量级的解释层FOCI,通过训练使其能够从保留或删除的图像块子集中提取足够信息,并引入选择余量指数(SHI)进行评估。实验表明,不同MIL模型对紧凑解释的支持程度不同,FOCI能够有效减少所需图像块数量,并为模型解释和审计提供了一种新的工具。

详情
英文摘要

Whole-slide image (WSI) multiple instance learning (MIL) classifiers can achieve strong slide-level AUC while leaving the full-bag prediction opaque. Attention scores are widely reused as post-hoc explanations, but high attention can reflect aggregation preference rather than a compact, model-sufficient rationale. We study post-hoc rationale highlighting for frozen WSI-MIL: given a trained classifier, can its slide-level prediction be recovered from a compact, output-consistent tile subset without retraining the backbone? We instantiate this with Finding Optimal Contextual Instances (FOCI), a lightweight rationale-readout layer over a frozen MIL backbone. FOCI is trained with model-output sufficiency and exclusion objectives over keep/drop tile subsets, evaluated with an insertion-style Sequential Reveal Protocol (SRP) adapted to WSI-MIL, and summarized by the Selection Headroom Index (SHI). Across three WSI benchmarks and seven MIL backbones, FOCI reveals that compact rationales are selection-headroom dependent: transformer and multi-branch attention aggregators can admit compact rationales, near-minimal attention-pooling baselines enter a selection-saturation regime, and hard-selection backbones can conflict with an external readout. For TransMIL, relative to its documented CLS-proxy ranking, FOCI reduces the Minimum Sufficient K (MSK) tile count by 32-56% across benchmarks, while ACMIL+FOCI attains the highest mean SHI (+0.465). Deletion-based perturbation and selected-only downstream evaluation provide complementary checks. These results position FOCI as a model-level interpretability and audit layer: selected tiles are not claims of clinical or pathologist-level diagnostic sufficiency, but candidate rationales that offer a compact, reviewable view of when a frozen MIL prediction can be localized to a small output-consistent subset.

2605.12574 2026-05-14 cs.CV cs.AI 版本更新

DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

Hongyi Tang, Zhihao Zhu, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 本文研究了如何通过语义干扰技术,在仅能访问视觉语言模型生成文本输出的黑盒场景下,对其训练数据进行成员推理攻击。提出的方法DistractMIA通过在输入图像中插入已知语义干扰物,并分析模型生成文本的变化,从而判断样本是否属于训练数据。该方法无需访问模型内部信息,仅依赖输出结果,实验表明其在多个视觉语言模型和基准数据集上均优于现有方法,并在医疗图像任务中展现出良好的泛化能力。

Comments 23 pages, 8 figures

详情
英文摘要

Vision-language models (VLMs) are trained on large-scale image-text corpora that may contain private, copyrighted, or otherwise sensitive data, motivating membership inference as a tool for training-data auditing. This is especially challenging for deployed VLMs, where auditors typically observe only generated textual responses. Existing VLM membership inference attacks either rely on probability-level signals unavailable in such settings, or use mask-based semantic prediction tasks whose effectiveness depends on object-centric visual assumptions. To address these limitations, we propose DistractMIA, an output-only black-box framework based on semantic distraction. Rather than removing visual evidence, DistractMIA preserves the original image, inserts a known semantic distractor, and measures how generated responses change. This design is motivated by the intuition that member samples remain more anchored to the original image semantics, while non-member samples are more easily redirected toward the distractor. To make this signal reliable, DistractMIA calibrates distractor configurations on a reference set and derives membership scores from repeated textual generations, capturing response stability and distractor uptake without accessing logits, probabilities, or hidden states. Experiments across multiple VLMs and benchmarks show that DistractMIA consistently outperforms both output-only and stronger-access baselines. Its performance on a medical benchmark further demonstrates applicability beyond object-centric natural images.

2605.12573 2026-05-14 cs.CV cs.AI cs.LG 版本更新

Improving Diffusion Posterior Samplers with Lagged Temporal Corrections for Image Restoration

Davide Evangelista, Elena Morotti, Francesco Pivi, Maurizio Gabbrielli

发表机构 * Dept. of Computer Science and Engineering(计算机科学与工程系) University of Bologna(博洛尼亚大学) Dept. of Political and Social Sciences(政治与社会科学系)

AI总结 本文研究了如何改进基于扩散的后验采样(PS)方法在图像恢复任务中的性能。作者从动力学角度重新诠释PS,提出了一种结合二阶离散化和残差修正的新型方法LAMP,通过引入滞后时间修正来提升采样过程的稳定性与准确性。实验表明,LAMP在多个图像恢复任务中优于现有方法,且无需增加去噪评估次数。

Comments 9 Figures, 9 Tables, Submitted to a conference

详情
英文摘要

Diffusion-based posterior sampling (PS) is a leading framework for imaging inverse problems, combining learned priors with measurement constraints. Yet, its standard formulations rely on instantaneous data-consistent estimates, which induce temporal variability in the reverse dynamics. We reinterpret PS from a dynamical perspective, showing that the standard PS update corresponds to a first-order discretization of the diffusion dynamics plus a residual correction capturing the mismatch between the denoised prediction and the data-consistent estimate. A second-order discretization, however, naturally introduces a temporal correction based on the variation of consecutive estimates. Building on this, we propose LAMP, combining the second-order update with the residual correction characterizing a PS technique. LAMP thus inherits a lagged temporal correction, and it can be implemented as a modular plug-in over the PS backbone. We show that LAMP preserves the structure of a posterior sampler, and we perform a one-step risk analysis to characterize when LAMP improves the reverse transition via a bias-variance trade-off. Experiments across multiple imaging tasks demonstrate consistent improvements over strong baselines such as DiffPIR and DDRM, without increasing the number of denoising evaluations.

2605.12571 2026-05-14 cs.CV cs.AI 版本更新

VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority

Chenhao Qiu, Yechao Zhang, Xin Luo, Shien Song, Xusheng Liu

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡)

AI总结 本文研究了长期视频问答任务中由于证据不一致导致的性能问题,提出了一种名为VideoSEAL的解耦框架,通过将规划与回答权威性分离,提升了答案准确性和证据对齐度。该方法引入时间与语义双重诊断指标,揭示了现有模型在推理和训练过程中存在的压力源,并通过像素级验证机制有效缓解了证据不一致问题。实验表明,该框架在多个长期视频基准测试中表现优异,且具备良好的扩展性和模块化升级能力。

Comments Accepted to ICML 2026. 33 pages, 13 figures. Code and models are available at https://github.com/Echochef/VideoSEAL

详情
英文摘要

Long video question answering requires locating sparse, time-scattered visual evidence within highly redundant content. Although current MLLMs perform well on short videos, long videos introduce long-horizon search and verification, which often necessitates multi-turn, agentic interaction. We show that existing LVU agents can exhibit "evidence misalignment": they produce correct answers that are not supported by the retrieved or inspected evidence. To characterize this failure, we introduce two diagnostics (temporal groundedness and semantic groundedness) and use them to reveal two pressures that amplify misalignment: prompt pressure from shared-context saturation at inference time and reward pressure from outcome-only optimization during training. These findings point to a structural root cause: the coupled agent paradigm conflates long-horizon planning with answer authority. We therefore propose the decoupled planner-inspector framework, which separates planning from answer authority and gates final answering on pixel-level verification. Across four long-video benchmarks, our framework improves both answer accuracy and evidence alignment, achieving 55.1% on LVBench and 62.0% on LongVideoBench while producing interpretable search trajectories. Moreover, the decoupled architecture scales consistently with increased search budgets and supports plug-and-play upgrades of the MLLM backbone without retraining the planner. Code and models are available at https://github.com/Echochef/VideoSEAL.

2605.12570 2026-05-14 cs.CV 版本更新

M3Net: A Macro-to-Meso-to-Micro Clinical-inspired Hierarchical 3D Network for Pulmonary Nodule Classification

Jinyue Li, Yuzhou Yu, Jingjing Yang, Meng Fu, Yani Zhang, Shuyao He, Dianlong Ge, Xin Ning, Yannan Chu, Qiankun Li

发表机构 * Hefei Cancer Hospital of CAS, Institute of Health and Medical Technology, Hefei Institutes of Physical Science, Chinese Academy of Sciences(中国科学院合肥医疗健康研究院、健康与医疗技术研究所、物理研究所) University of Science and Technology of China(中国科学技术大学) Graduate School, Bengbu Medical College(蚌埠医疗学院研究生院) Department of Pulmonary and Critical Care Medicine, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC)(中国科学技术大学附属第一医院呼吸与危重症医学科、生命科学与医学学院) Northeastern University(东北大学) Institute of Semiconductors, Chinese Academy of Sciences(中国科学院半导体研究所) College of Computing and Data Science (CCDS), Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 肺结节的良恶性分类在肺部癌症早期筛查中具有重要意义,但因其多尺度和异质性特征而极具挑战。为此,本文提出M3Net,一种受放射科医生分层诊断流程启发的三维网络,通过整合从细粒度结构到全局解剖关系的多尺度上下文信息,实现更准确的分类。该网络采用分层输入结构和跨尺度语义一致性机制,显著提升了模型性能和可解释性,在公开数据集和自建临床数据集上的实验结果表明其性能优于现有方法。

Comments Published in Information Fusion (2026), 15 pages, 5 figures

Journal ref Information Fusion, 2026

详情
英文摘要

The accurate classification of benign and malignant pulmonary nodules in CT scans is critical for early lung cancer screening, yet remains challenging due to the multi-scale and heterogeneous nature of pulmonary nodules. While deep learning offers potential for auxiliary diagnosis, most existing models act as "black boxes", lacking the transparency and explainability required for trustworthy clinical integration. To address this issue, we propose M3Net, a novel 3D network for pulmonary nodule classification inspired by the hierarchical diagnostic workflow of radiologists, which integrates multi-scale contextual information from fine-grained structures to global anatomical relationships. Our framework constructs a progressive multi-scale input, from fine-grained nodule structures to local semantics and global spatial relationships. M3Net employs scale-specific encoders and ensures cross-scale semantic consistency through latent space projection and mutual information maximization. Extensive experiments on the public LIDC-IDRI dataset and a self-collected clinical dataset (USTC-FHLN) demonstrate that our method achieves state-of-the-art performance, with accuracies of 86.96% and 84.24% respectively, outperforming the best baseline by 3.26% and 2.17%. The results validate that M3Net provides a more robust and clinically relevant solution for pulmonary nodule classification. The code is available at https://github.com/jylEcho/M3-Net.

2605.12562 2026-05-14 eess.IV cs.AI cs.CV 版本更新

Uncovering Latent Pathological Signatures in Pulmonary CT via Cross-Window Knowledge Distillation

Bo Peng, Wujian Xu, Kun Wang, Ximing Liao, Na Wang, Daqian Shi, Tian Li, Jing Gao, Johan Thygesen, Yingqun Ji, Honghan Wu

发表机构 * Institute of Health Informatics, University College London(伦敦大学学院健康信息学研究所) Department of Pulmonary and Critical Care Medicine, Shanghai East Hospital, School of Medicine, Tongji University(同济大学医学院 pulmonary and critical care medicine 部门,上海东方医院) Queen Mary University of London(伦敦女王玛丽大学) School of Health and Wellbeing, University of Glasgow(格拉斯哥大学健康与福祉学院)

AI总结 该研究针对多窗口肺部CT影像分析中现有深度学习方法未能有效融合不同密度结构信息的问题,提出了一种跨窗口知识蒸馏框架,通过让学生编码器从在最具信息量窗口上训练的教师模型中学习潜在的临床先验知识。实验表明,该方法在三个数据集上显著提升了各窗口的AUC指标,并实现了高达0.9960的集成AUC,展示了其在肺部CT多窗口分析中的优越性能和泛化能力。

详情
英文摘要

Multi-window CT imaging captures complementary pathological information across anatomical structures of differing densities, yet existing deep learning methods fuse representations only at later stages, missing cross-density interactions. We propose a cross-window knowledge distillation framework in which student encoders learn latent clinical priors from a teacher trained on the most informative window. Evaluated retrospectively on three cohorts - COPD-CT-DF (n=719), RSNA PE (n=1,433), and an in-house CTEPD dataset (n=161) - distillation improved per-window AUC by 10.1-16.5 percentage points on COPD-CT-DF (0.75-0.81 to 0.90-0.94; all P<0.001), with ensemble AUC reaching 0.9960. Similar gains were observed on RSNA PE (0.80-0.83 to 0.90-0.92) and CTEPD (AUC 0.7481 vs. 0.6264). Cross-window distillation internalises pathological signatures invisible to supervised approaches, offering a generalisable solution for multi-window pulmonary CT analysis.

2605.12560 2026-05-14 eess.IV cs.CV cs.LG 版本更新

Brain Tumor Classification in MRI Images: A Computationally Efficient Convolutional Neural Network

Md Fahimul Kabir Chowdhury, Jannatul Ferdous

发表机构 * Department of Computer Science and Engineering, University of North Texas, USA(北卡罗来纳州立大学计算机科学与工程系) Department of Electrical and Electronic Engineering, International Islamic University Chittagong, Bangladesh(伊斯兰国际大学查塔格昂分校电子与电气工程系)

AI总结 本文提出了一种计算效率高的卷积神经网络(CNN),用于对MRI图像中的脑肿瘤进行多类别分类,包括胶质瘤、脑膜瘤、垂体瘤和无肿瘤四种情况。该模型通过高效的特征提取和优化的训练策略,在两个公开数据集上分别达到了99.03%和99.28%的分类准确率,以及99.88%和99.94%的ROC得分,且参数数量远少于主流预训练模型。相比现有先进模型,该方法在保持高分类性能的同时显著降低了计算开销,具有在临床环境中作为实用诊断辅助工具的潜力。

Journal ref 2025 IEEE International Conference on Biomedical Engineering, Computer and Information Technology for Health (BECITHCON), pp. 633-638, 2025

详情
英文摘要

Improving patient outcomes depends on the prompt and accurate diagnosis of brain tumors, but manual MRI scan analysis is still time-consuming and unreliable. Although deep learning has shown promise, many of the models that are now in use are computationally intensive and have difficulty handling the intrinsic complexity and variety of different types of brain tumors. In this work, we propose a lightweight yet high-performing Convolutional Neural Network (CNN) for multi-class brain tumor classification, employing MRI images to target gliomas, meningiomas, pituitary tumors, and healthy (no tumor) instances. The model was rigorously evaluated on two publicly accessible datasets from Figshare and Kaggle. Leveraging efficient feature extraction and optimized training strategies, our CNN achieved classification accuracies of 99.03% and 99.28%, along with ROC scores of 99.88% and 99.94% on Dataset 1 and Dataset 2, respectively-all while utilizing significantly fewer parameters than popular pre-trained architectures. In contrast to cutting-edge models like DenseNet201, MobileNetV2, VGG19, Xception, InceptionV3, and ResNet50, our approach consistently demonstrated superior performance with reduced computational overhead. These findings highlight the potential of the proposed model as a practical and reliable diagnostic aid in clinical environments.

2605.12556 2026-05-14 cs.CV 版本更新

M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement

Youssef Aboelwafa, Hicham G. Elmongui, Marwan Torki

发表机构 * Alexandria University, Egypt(亚历山大大学,埃及)

AI总结 低光图像增强因噪声放大、伪影和色彩失真等复杂退化问题而具有挑战性。本文提出了一种多模态Retinexformer(M2Retinexformer)框架,通过引入深度线索、亮度先验和语义特征,在渐进式优化流程中提升增强效果。该方法利用跨模态注意力机制融合多尺度信息,并通过自适应门控机制动态平衡光照引导的自注意力与跨注意力,实验表明其在多个基准数据集上优于现有方法。

Comments Accepted at 2026 IEEE International Conference on Image Processing (ICIP)

详情
英文摘要

Low-light image enhancement is challenging due to complex degradations, including amplified noise, artifacts, and color distortion. While Retinex-based deep learning methods have achieved promising results, they primarily rely on single-modality RGB information. We propose M2Retinexformer (Multi-Modal Retinexformer), a novel framework that extends Retinexformer by incorporating depth cues, luminance priors, and semantic features within a progressive refinement pipeline. Depth provides geometric context that is invariant to lighting variations, while luminance and semantic features offer explicit guidance on brightness distribution and scene understanding. Modalities are extracted at multiple scales and fused through cross-attention, with adaptive gating dynamically balancing illumination-guided self-attention and cross-attention based on the reliability of auxiliary cues. Evaluations on the LOL, SID, SMID, and SDSD benchmarks demonstrate overall improvements over Retinexformer and recent state-of-the-art methods. Code and pretrained weights are available at https://github.com/YoussefAboelwafa/M2Retinexformer

2605.12550 2026-05-14 cs.CV cs.AI 版本更新

SSDA: Bridging Spectral and Structural Gaps via Dual Adaptation for Vision-Based Time Series Forecasting

Mingrui Zhang, Hanchen Yang, Wengen Li, Xudong Jiang, Yichao Zhang, Jihong Guan, Shuigeng Zhou

AI总结 该论文研究了基于视觉模型的时间序列预测问题,指出将时间序列渲染为图像后,仍存在光谱和结构上的差距,限制了预训练视觉模型的性能。为此,作者提出SSDA方法,通过光谱幅度对齐和结构引导的低秩适配,分别在数据和模型层面弥补这些差距,从而显著提升时间序列预测效果。实验表明,SSDA在多个真实数据集上优于现有方法,表现出良好的泛化能力。

详情
英文摘要

Large vision models (LVMs) have recently proven to be surprisingly effective time series forecasters, simply by rendering temporal data as images. This success, how ever, rests on a largely unexamined premise: the rendered time series images are sufficiently close to natural images for knowledge in pre-trained models to transfer effectively. We argue that two gaps still remain, i.e., spectral and structural gaps, fundamentally limiting the potential of LVMs for time series forecasting. Spectrally, we systematically reveal that rendered time series images exhibit a markedly shallower power spectrum than the natural images LVMs are pre-trained to recognize. Structurally, reshaping 1D temporal sequences into 2D grids fabricates spurious spatial adjacencies while severing genuine temporal continuities, misleading the spatial inductive biases of pre-trained LVMs. To bridge these gaps, we propose SSDA, a dual-branch network that spectrally and structurally adapts to unlock the full potential of LVMs for time series forecasting. At the data level, a Spectral Magnitude Aligner (SMA) applies 2D FFT to selectively enhance the magnitude spectrum toward natural-image statistics while preserving phase. At the model level, a Structural-Guided Low-Rank Adaptation (SG-LoRA) injects position-aware temporal encodings into patch embeddings and adapts at tention via low-rank updates. The two branches are further adaptively fused to produce the final forecast. Extensive experiments on seven real-world benchmarks demonstrate that SSDA consistently outperforms strong LVM- and LLM-based baselines under both full-shot and few-shot settings. Code is publicly available at https://anonymous.4open.science/r/SSDA-8C5B.

2605.12549 2026-05-14 cs.CV 版本更新

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

Jiaping Lin, Fei Shen, Junzhe Li, Ping Nie, Fei Yu, Ming Li, Haizhou Li

发表机构 * Guangming Laboratory(光明实验室) National University of Singapore(新加坡国立大学) Peking University(北京大学) University of Waterloo(滑铁卢大学) The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳))

AI总结 现有无训练的GUI定位方法通常依赖多次推理过程来识别目标元素,但每个前向传播过程独立解析指令和视觉布局,缺乏视觉token之间的渐进交互。本文研究了视觉语言模型(VLMs)在GUI定位过程中的内部机制,发现其遵循两阶段范式:预填充阶段确定候选UI元素,解码阶段进一步细化坐标。基于此,作者提出了一种无训练方法Re-Prefill,在预填充阶段引入注意力引导的二次处理,通过提取与查询位置高度相关的视觉token作为初步假设,从而提升定位精度。实验表明,该方法在多个基准测试中均取得显著提升。

详情
英文摘要

Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Based on this observation, we propose Re-Prefill, a training-free method that revisits inference by introducing an attention-guided second prefill stage to refine target selection. Specifically, visual tokens that consistently receive high attention from the query position, i.e., the final token, across layers are extracted as a preliminary target hypothesis and appended to the input, together with the instruction hidden states, enabling the model to deeply re-think its decision before coordinate generation. Experiments across four VLMs and five benchmarks, including ScreenSpot-Pro, ScreenSpot-V2, OSWorld-G, UI-Vision, and MMBench-GUI, demonstrate consistent improvements without additional training, with gains of up to 4.3% on ScreenSpot-Pro. Code will be available at https://github.com/linjiaping1/Re-Prefill.

2605.12545 2026-05-14 cs.CV cs.AI 版本更新

CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

Zhitong Dong, Chao Li, Jie Yu, Hao Chen

发表机构 * Southeast University(东南大学) Key Laboratory of New Generation Artificial Intelligence Technology(新一代人工智能技术重点实验室) Alibaba Group(阿里巴巴集团)

AI总结 该研究提出了一种名为CROP的新方法,旨在通过组合推理和优化偏好来实现与专家审美一致的图像裁剪。不同于以往依赖显著性预测或检索增强的方法,CROP将美学裁剪重新定义为多模态推理任务,引导视觉语言模型像专业摄影师一样进行分析、提案和决策。该方法通过分解复杂的审美问题,并结合专家偏好对齐模块,有效提升了裁剪结果与人类专家判断的一致性,实验表明其在多个数据集上均表现出优越性能。

详情
英文摘要

Aesthetic image cropping aims to enhance the aesthetic quality of an image by improving its composition through spatial cropping. Previous methods often rely on saliency prediction or retrieval augmentation, ignoring the task's core requirement: a deep understanding of composition and aesthetics. Consequently, saliency-based methods struggle to make compositional trade-offs in complex scenes, while retrieval-based methods blindly refer to similar cases, lacking adaptive reasoning for unique scenes. Both approaches fail to align their automated cropping results with those of human experts. To address the above issues, we propose a novel paradigm that reformulates aesthetic cropping as a multimodal reasoning task, aiming to activate the VLM's analytical and comprehension capabilities in aesthetics. We design a Compositional Reasoning and Optimizing Preference method (CROP) that directs the VLM to think like a professional photographer. It deconstructs a complex and subjective aesthetic problem into an "analysis-proposal-decision" process, reasoning step by step through the analysis of scene elements and compositional principles. Meanwhile, our expert preference alignment module makes the model's decision consistent with human expert aesthetics. Extensive experiments across multiple datasets validate our method's superiority and component effectiveness.

2605.12528 2026-05-14 cs.CV cs.AI cs.AR 版本更新

MorphOPC: Advancing Mask Optimization with Multi-scale Hierarchical Morphological Learning

Yuting Hu, Lei Zhuang, Chen Wang, Ruiyang Qin, Hua Xiang, Gi-joon Nam, Jinjun Xiong

发表机构 * University at Buffalo(布法罗大学) IBM T. J. Watson Research Center(IBM 沃森研究中心) Villanova University(维拉诺瓦大学)

AI总结 随着特征尺寸缩小至纳米级,从光刻掩模向硅晶圆准确转移电路图案变得愈发困难。为提高图案保真度和制造可行性,本文提出MorphOPC,一种基于多尺度分层形态学学习的掩模优化模型,通过局部布局特征的形态学操作序列生成掩模,有效提升了生成质量。实验表明,MorphOPC在多个基准测试中优于现有方法,实现了更高的印刷保真度和更低的制造成本,展示了其在可扩展掩模优化中的巨大潜力。

详情
英文摘要

As feature sizes shrink to the nanometer scale, accurately transferring circuit patterns from photomasks to silicon wafers becomes increasingly challenging. Optical proximity correction (OPC) is widely used to ensure pattern fidelity and manufacturability. Recent generative mask optimization models based on encoder-decoder architecture can synthesize near-optimal masks, serving as fast machine learning (ML) surrogates for traditional OPC. However, these models often fail to capture the geometric transformations from target layouts to mask patterns, leading to suboptimal quality. In this work, we formulate mask generation as a sequence of morphological operations on local layout features and propose \textit{MorphOPC}, a multi-scale hierarchical model with neural morphological modules to learn these transformations. Experiments on edge-based OPC and ILT benchmarks across metal and via layers show that \textit{MorphOPC} consistently outperforms state-of-the-art methods, achieving higher printing fidelity and lower manufacturing cost, demonstrating strong potential for scalable mask optimization.

2605.12517 2026-05-14 cs.CL cs.AI cs.CV 版本更新

Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models

Mingyeong Kim, Jungwon Choi, Chaeyun Jang, Juho Lee

发表机构 * Graduate School of AI, KAIST(人工智能研究生院,韩国科学技术院)

AI总结 该研究探讨了视觉语言模型在仅输入文本时出现的性能下降和校准偏差问题,发现即使文本保留了关键信息,模型的置信度也会变得不可靠。为此,作者提出了一种轻量的交叉注意力模块——潜在想象模块(LIM),通过从文本生成潜在嵌入并输入到冻结的模型主干中,从而在无需生成图像的情况下提升模型的准确性和校准效果。实验表明,LIM在多种文本-only任务和缺失图像场景中均表现出显著的性能提升。

Comments 9 pages, 16 figures. Accepted at the ICLR 2026 Workshop on Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety across Modalities

详情
英文摘要

Vision-language models (VLMs) are often deployed on text-only inputs, although they are trained with images. We find that removing the vision modality causes large drops in accuracy and severe miscalibration, and the model does not behave like its original language backbone under text-only prompting. This failure is not explained only by missing semantic information. Even when text descriptions preserve key content, confidence becomes unreliable, while adding a visual signal through generated images partially restores accuracy and calibration. We propose the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent embeddings from textual input and feeds them into a frozen VLM backbone without pixel-level image synthesis. Across text-only benchmarks, unseen tasks, and missing-image scenarios, LIM improves accuracy and reduces calibration error. These results suggest that latent modality completion is a practical approach for reliable VLM inference under missing-modality.

2605.12514 2026-05-14 cs.SI cs.CV cs.CY cs.DL stat.AP 版本更新

Structural Diversity Drives Disruptive Scientific Innovation

Yichun Peng, Saike He, Peijie Zhang, Kang Zhao, Yi Yang, Ning Zhang, Qingpeng Zhang, Daniel Dajun Zeng, Hao Peng

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China(多模态人工智能系统国家重点实验室,自动化研究所,中国科学院,北京100190,中国) University of Chinese Academy of Sciences, Beijing 101408, China(中国科学院大学,北京101408,中国) Department of Business Analytics, Tippie College of Business, The University of Iowa, Iowa City, IA 52242, United States of America(美国爱荷华大学蒂普皮商学院商业分析系,爱荷华市,IA 52242,美国) The University of Hong Kong, Institute of Data Science & Department of Pharmacology and Pharmacy(香港大学,数据科学研究所及药理学与药学系)

AI总结 科学创新越来越依赖于合作,但能促进突破性想法的组织结构仍不明确。本文提出“结构多样性”(Structural Diversity,SD)这一新指标,用于衡量团队在其先前合作网络中连接多个不同知识社区的程度,并证明其是预测颠覆性创新的强大而稳健的指标,优于传统指标如团队新颖性和边密度。研究还发现,结构多样性能够与团队规模产生正向交互作用,缓解“规模诅咒”问题,并通过跨学科整合机制提升创新效能,为科学合作的组织设计提供了新的理论框架和实践指导。

详情
英文摘要

Scientific innovation increasingly depends on collaboration, yet the organizational structure that fosters breakthrough ideas remains poorly understood. Existing metrics - such as team size or compositional diversity - capture readily observable characteristics but not the deeper architecture of collaboration. We introduce Structural Diversity (SD): the extent to which a team bridges multiple distinct knowledge communities within its prior collaboration network. Using a century-scale dataset of 260 million scientific publications (1900-2025) and combining causal inference with a quasi-natural experiment based on a U.S. National Science Foundation policy change in 2012, we show that SD is a powerful and robust predictor of disruptive innovation, outperforming traditional team novelty indicators such as team freshness and edge density. Moreover, SD positively interacts with team size and is able to mitigate the well-known "curse of scale" by transforming scale from a liability into a resource for creative synthesis. We find that one mechanism underlying this effect is Disciplinary Integration (DI): teams with higher SD can more effectively combine heterogeneous knowledge into novel configurations. Our findings position SD as both a new theoretical construct and an actionable design principle for organizing scientific collaboration. By linking the architecture of team assembly to the dynamics of creative discovery, our work offers a structural explanation for how collective intelligence can be systematically engineered to foster disruptive innovation.

2605.12506 2026-05-14 cs.CV cs.AI cs.HC cs.RO eess.IV 版本更新

Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

Abdul Basit, Saim Rehman, Muhammad Shafique

发表机构 * New York University (NYU) Abu Dhabi(纽约大学(NYU)阿布扎赫德)

AI总结 在移动设备上实现满足实时性、能耗和内存约束的基于机器学习的手势检测具有挑战性,尤其在电池电量不一的情况下。本文提出了一种名为 Scale-Gest 的新型运行时自适应手势检测框架,通过扩展检测器空间为一系列紧凑的 tiny-YOLO 架构,并引入基于设备校准的 ACE(准确率-复杂度-能耗)配置,实现了在不同约束下的最优模型选择。实验表明,该方法在保持高检测性能的同时,显著降低了能耗和延迟,适用于车载等实际应用场景。

Comments 7 pages, 11 figures, Accepted to DAC 2026

详情
英文摘要

Realizing on-device ML-based gesture detection under tight real-time performance, energy and memory constraints is challenging, especially when considering mobile devices with varying battery-power levels. Existing EdgeAI deployments typically rely on a single fixed detector, limiting optimization opportunities. We present Scale-Gest, a novel run-time adaptive gesture detection framework that expands the detector space into a dense family of tiny-YOLO architectures. We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A lightweight run-time controller selects an appropriate ACE mode under user-defined and battery constraints, while a motion-aware hand-gesture-tracking ROI gate crops the input for reduced complexity detection. To evaluate performance of our system in real-world car driving scenarios, we introduce a temporally-annotated Driver Simulated Gesture (DSG-18) dataset. Scale-Gest maintains event-level F1 while significantly reducing energy and latency compared to single-detector approaches. On a battery-powered laptop running gesture streams, our ACE controller reduces per-frame energy by 4x (from 6.9 mJ to 1.6 mJ) while maintaining high gesture-detection performance (event-level F1 = 0.8-0.9) and low mean latency (6 ms).