详情

AI中文摘要

扩散模型实现了最先进的图像合成，其生成轨迹从根本上表现出谱偏置，早期解析低频全局结构，后期解析高频细节。传统的随机微分方程（SDE）求解器未能考虑这一动态，在整个过程中幼稚地注入均匀白噪声，并误用有限能量预算。在这项工作中，我们建立了一个数学框架，将SDE推理重新视为一种有针对性的、频率解耦的能量传递。利用这一框架，我们引入了有色噪声采样（CNS），一种新颖的、无需训练的随机求解器。CNS不注入均匀白噪声，而是利用动态的、依赖于时间步和频率的调度，更有效地将注入能量分配给结构未解决的频带。通过主动利用模型固有的谱偏置，CNS系统地将生成分布引导向真实数据流形。大量实验表明，作为严格的即插即用推理时采样器替代，CNS在多种架构（SiT、JiT、FLUX）上显著优于标准ODE和SDE基线。与ImageNet-256上的标准采样相比，CNS在无引导下实现了显著的FID降低，SiT-XL/2从8.26降至6.27，JiT-B/16从32.39降至26.69，JiT-H/16从11.88降至8.31，同时在无分类器引导下也获得一致的相对FID改进。项目页面：https://hadardavidson.github.io/CNS/。

英文摘要

Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stochastic differential equation (SDE) solvers fail to account for this dynamic, naively injecting uniform white noise throughout the entire process and misusing the finite energy budget. In this work, we establish a mathematical framework that reconsiders SDE inference as a targeted, frequency-decoupled energy transfer. Leveraging this framework, we introduce Colored Noise Sampling (CNS), a novel, training-free stochastic solver. Rather than injecting uniform white noise, CNS utilizes a dynamic, timestep- and frequency-dependent schedule that more efficiently allocates injected energy toward structurally unresolved frequency bands. By actively exploiting the model's inherent spectral bias, CNS systematically steers the generated distribution toward the true data manifold. Extensive experiments demonstrate that CNS significantly outperforms standard ODE and SDE baselines as a strictly plug-and-play, inference-time sampler substitution across diverse architectures (SiT, JiT, FLUX). Compared to standard sampling on ImageNet-256, CNS achieves substantial unguided FID reductions, improving from 8.26 to 6.27 on SiT-XL/2, 32.39 to 26.69 on JiT-B/16, and 11.88 to 8.31 on JiT-H/16, while yielding consistent relative FID improvements with Classifier-Free Guidance. Project page is available at https://hadardavidson.github.io/CNS/.

URL PDF HTML ☆

赞 0 踩 0

2605.30328 2026-05-29 cs.CV 版本更新

Supercharging Thermal Gaussian Splatting with Depth Estimation

利用深度估计增强热高斯泼溅

Manoj Biswanath, Chenxin Cai, Hannah Schieber, Daniel Roth, Benjamin Busam

发表机构 * Technical University of Munich（技术大学慕尼黑）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； Munich Institute of Robotics and Machine Intelligence (MIRMI)（慕尼黑机器人与人工智能研究所）； Human-Centered Computing and Extended Reality Lab（以人为本计算与扩展现实实验室）； TUM University Hospital（技术大学慕尼黑医院）

AI总结提出一种仅使用热红外图像和深度估计的单模态方法TDg，通过热到深度高斯泼溅推导辐射场，在渲染质量和训练时间上优于多模态基线。

Comments 8 pages, 4 figures. Accepted and will be published in ISPRS proceedings (ISPRS Congress 2026)

详情

AI中文摘要

高效且鲁棒的3D场景表示在自动驾驶、机器人及相关领域至关重要。虽然RGB图像为3D重建提供了有价值的内容，但热成像或深度等其他模态可以提供环境的额外信息。最近，像3D高斯泼溅这样的新视角合成方法开始使用多模态来进一步提升性能。但融合或组合多模态数据可能使过程变慢，并带来额外挑战。因此，我们的项目旨在基于热红外域使用单模态，尽可能减少对可见光的依赖。这种单模态有望更快，因为它不依赖多模态数据。我们提出了一种方法，热到深度高斯泼溅（TDg），其架构仅使用热图像和深度估计来推导辐射场。我们的TDg方法在大多数情况下优于我们的测试数据集RGBT-Scenes和ThermalMix上的MSMG（多单模态高斯）基线。平均而言，TDg的渲染质量指标如学习感知图像块相似度（LPIPS）、结构相似性指数（SSIM）和峰值信噪比（PSNR）分别比基线MSMG值好1.12%、0.034%和0.01%。它还显著减少了训练时间，减少了12分47秒（提升55%）。总体而言，我们的方法成功推导了这些热辐射场，最终可以应用于多种场景，例如识别监控、搜索或救援行动中的热源，以及工业检查中温度广泛用于监测机器的情况。

英文摘要

Efficient and robust 3D scene representation is crucial in autonomous driving, robotics, and related fields. While RGB images provide valuable content for 3D reconstruction, other modalities like thermal or depth can enable additional information on the environment. Lately, novel view synthesis methods like 3D Gaussian Splatting have started using multiple modalities to further boost their performance. But fusing or combining multimodal data can make the process slower and can bring in additional challenges. Therefore, our project aims to use single modality based on thermal infrared domain, by removing the reliance on visible light as much as possible. This single modality can be expected to be faster as it does not rely on multimodal data. We propose a method, Thermal-to-Depth Gaussian Splatting (TDg), that uses only thermal images and depth estimation in its architecture to derive the radiance fields. Our TDg method outperforms the MSMG (Multiple Single-Modal Gaussians) baseline in most cases on our test datasets, RGBT-Scenes and ThermalMix. On average, the rendering quality metrics such as learned perceptual image patch similarity (LPIPS), structural similarity index measure (SSIM), and peak signal-to-noise ratio (PSNR) of TDg are 1.12%, 0.034%, and 0.01% better than the baseline MSMG values. It also reduces the training time significantly, by 12 mins 47 secs (55% improvement). Overall, our method is successful in deriving these thermal radiance fields, which can ultimately have several applications, such as identifying heat sources critical in surveillance, search or rescue operations, and industrial inspections where temperature is widely used to monitor machines.

URL PDF HTML ☆

赞 0 踩 0

2605.30325 2026-05-29 cs.CV 版本更新

Veda: Scalable Video Diffusion via Distilled Sparse Attention

Veda: 通过蒸馏稀疏注意力实现可扩展的视频扩散

Shihao Han, Hao Yang, Xinting Hu, Xiaofeng Mei, Yi Jiang, Xiaojuan Qi

发表机构 * ByteDance Inc.（字节跳动公司）； The University of Hong Kong（香港大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出Veda蒸馏稀疏注意力框架，通过统计感知的tile评分和头感知tile选择，在保持生成质量的同时实现视频扩散模型的高效加速。

Comments Accepted to ICML 2026

详情

AI中文摘要

扩展扩散Transformer以生成高分辨率、长视频受限于自注意力的二次成本，现有的稀疏注意力方法在高稀疏度下性能下降。我们通过实验证明，生成质量并非由稀疏度本身决定，而是由稀疏掩模与全注意力的tile级几何对齐程度决定。基于这一洞察，我们提出Veda，一个蒸馏稀疏注意力框架，将tile选择形式化为从全注意力中显式重建的问题。Veda整合了统计感知的tile评分与头感知的tile选择，以减少估计误差和结构不匹配，从而实现高稀疏度。一个硬件高效的tile跳过内核将理论稀疏度转化为实际墙钟加速。在包括Waver和Wan2.1在内的大型视频扩散模型上的实验表明，Veda实现了显著的加速，且生成质量无明显下降。为了在Waver-T2V-12B上生成720P 10秒视频，Veda实现了5.1倍的端到端加速和10.5倍的自注意力加速，将注意力开销从92%降低到50%。值得注意的是，加速增益随序列长度增加而增加，表明Veda在跨模型的时空分辨率上具有良好的可扩展性。

英文摘要

Scaling Diffusion Transformers to generate high-resolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation quality is determined not by the sparsity ratio itself, but by how well the sparse mask aligns with the tile-wise geometry of full attention. Based on this insight, we propose Veda, a distilled sparse attention framework that formulates tile selection as an explicit reconstruction problem from full attention. Veda integrates statistics-aware tile scoring with head-aware tiling to reduce estimation error and structural mismatch, enabling aggressive sparsity. A hardware-efficient tile-skipping kernel converts theoretical sparsity into practical wall-clock speedups. Experiments on large video diffusion models, including Waver and Wan2.1, demonstrate substantial acceleration with no noticeable degradation in generation quality. To generate 720P 10-second videos on Waver-T2V-12B, Veda achieves a 5.1$\times$ end-to-end speedup and a 10.5$\times$ self-attention speedup, reducing attention overhead from 92% to 50%. Notably, the gains increase with sequence length, indicating that Veda scales favorably with spatiotemporal resolution across models.

URL PDF HTML ☆

赞 0 踩 0

2605.30320 2026-05-29 cs.CV 版本更新

MonoPhysics: Estimating Geometry, Appearance, and Physical Parameters from Monocular Videos

MonoPhysics: 从单目视频估计几何、外观和物理参数

Daniel Rho, Jun Myeong Choi, Matthew Thornton, Biswadip Dey, Roni Sengupta

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Meta

AI总结提出MonoPhysics框架，通过可微分MPM模拟和3D高斯泼溅，从单目视频联合优化可变形物体的几何、外观和物理参数，解决尺度模糊和几何不准确问题。

详情

AI中文摘要

现有的逆物理方法从多视角视频中恢复物理参数，其中跨视角的几何约束解决了尺度和3D结构问题。然而，在单目设置中，这种约束缺失，导致严重的尺度模糊、不准确的几何以及外观优化与物理模拟之间的弱耦合。我们提出MonoPhysics，一个用于可变形物体的单目逆物理估计框架，使用可微分MPM模拟和3D高斯泼溅，从单个相机视角联合优化几何、外观和物理参数。我们通过三个视觉-物理桥梁解决这些挑战：全局尺度对齐、物理感知的几何细化以及可微分位置图，这些共同使得仅从单目观测就能进行准确优化。我们在Vid2Sim和我们新的弹性和塑性物体数据集上评估，结果表明MonoPhysics在单目设置中优于现有基线，并且仅使用单个相机就能达到与多视角基线相当的性能。我们的项目页面可在https://daniel03c1.github.io/MonoPhysics/获取。

英文摘要

Existing inverse physics methods recover physical parameters from multi-view videos, where geometric constraints across views resolve scale and 3D structure. In monocular settings, however, such constraints are absent, leading to severe scale ambiguity, inaccurate geometry, and weak coupling between appearance optimization and physical simulation. We propose MonoPhysics, a framework for monocular inverse physics estimation of deformable objects using differentiable MPM simulation and 3D Gaussian Splatting, which jointly optimizes geometry, appearance, and physical parameters from a single camera view. We address these challenges through three visual-physical bridges: global scale alignment, physics-aware geometry refinement, and a differentiable position map, which together enable accurate optimization from monocular observations alone. We evaluate on Vid2Sim and our new dataset of elastic and plastic objects, showing that MonoPhysics outperforms existing baselines in monocular settings and achieves performance comparable to multi-view baselines using only a single camera. Our project page is available at https://daniel03c1.github.io/MonoPhysics/

URL PDF HTML ☆

赞 0 踩 0

2605.30318 2026-05-29 cs.GR cs.AI cs.CV 版本更新

Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes

快门之前：3D场景中美学的且可执行的人像摄影规划

Ruixiang Jiang, Chang Wen Chen

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结提出在3D场景中生成人像姿态、相机、照明和曝光方案的方法，通过构建摄影场景图实现美学引导的规划，生成视觉上引人注目且几何与光度可行的人像。

详情

AI中文摘要

人像摄影在很大程度上是在快门打开之前决定的：主体的姿态、相机配置和照明设备必须在周围的3D场景中协调。相比之下，大多数现有的计算方法侧重于2D图像空间中的后期制作，例如修饰、重新照明或编辑已经存在的图像；捕获前的摄影规划仍然很大程度上未被探索。我们引入了3D美学人像规划，即生成人体姿态、相机、照明和曝光计划的任务，这些计划在满足3D场景中的几何和光度可行性的同时，产生视觉上引人注目的人像。我们的方法构建了一个摄影场景图，该图表示场景可供性、主体-场景关系以及与人像相关的照明结构。基于这种表示，我们对先前的尝试和当前的取景器观察进行美学引导的比较规划。在多样化的室内和室外场景中的实验表明，我们的方法生成的人像比竞争基线更受人类评分者和MLLM评估者的青睐，同时保持高物理合理性。总之，我们的结果指明了从捕获后校正走向捕获前计算人像规划的道路。项目仓库：https://github.com/songrise/Before-the-Shutter

英文摘要

Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices must be coordinated within the surrounding 3D scene. In contrast, most existing computational methods focus on post-production in 2D image space, such as retouching, relighting, or editing images that already exist; pre-capture photographic planning remains largely unexplored. We introduce 3D aesthetic portrait planning, the task of generating human pose, camera, lighting, and exposure plans that produce visually compelling portraits while satisfying geometric and photometric feasibility in a 3D scene. Our approach builds a Photographic Scene Graph that represents scene affordances, subject-scene relations, and portrait-relevant lighting structure. Built on this representation, we perform aesthetic-guided comparative planning over previous attempts and current viewfinder observations. Experiments across diverse indoor and outdoor scenes show that our method produces portraits preferred by human raters and MLLM evaluators over competitive baselines, while maintaining high physical plausibility. Together, our results suggest a path from post-capture correction toward pre-capture computational portrait planning. Project repository: https://github.com/songrise/Before-the-Shutter

URL PDF HTML ☆

赞 0 踩 0

2605.30317 2026-05-29 cs.CV 版本更新

提升图像质量评估性能：基于深度最大后验估计的无监督分数融合

Zhongling Wang, Raymond Zhou, Shahrukh Athar, Wenbo Yang, Zhou Wang

发表机构 * University of Waterloo, Canada（加拿大滑铁卢大学）； McMaster University, Canada（加拿大麦马斯特大学）

AI总结提出一种基于深度最大后验估计的无监督图像质量评估分数融合框架，通过细粒度不确定性估计提高融合预测的准确性并降低不确定性。

Comments 2024 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)

详情

DOI: 10.1109/ICASSP48485.2024.10447233

AI中文摘要

在过去的几十年中，出现了许多图像质量评估（IQA）模型，旨在预测图像的感知质量。然而，单个模型往往偏向于某些类型的图像内容或失真，具体取决于设计原则和过程。一个直观的想法是通过将多个模型的分数融合成一个更强的模型，来利用每个IQA模型的优势并减轻其弱点。在此，我们首次尝试为这一想法寻求最优解，并提出一个基于深度最大后验（MAP）估计的无监督IQA分数融合通用框架。所提出的模型在分数级别进行细粒度不确定性估计，以提高准确性并降低融合预测中的不确定性。综合实验表明，所提出的模型优于单个IQA模型和其他融合方法。它还在融合过程中展现出拒绝“坏”模型的有趣能力。

英文摘要

Over the past decades, numerous Image Quality Assessment (IQA) models have emerged, aiming to predict the perceptual quality of images. However, individual models are often biased toward certain types of image content or distortions, depending on the design principle and process. An intuitive idea is to harness the strengths and mitigate the weaknesses of each IQA model, by fusing the scores of multiple models into a stronger one. Here we make one of the first attempts to seek an optimal solution for the idea and propose a general framework for unsupervised IQA score fusion using deep Maximum a Posteriori (MAP) estimation. The proposed model conducts fine-grained uncertainty estimation at the score level to increase the accuracy and reduce the uncertainty in fused predictions. Comprehensive experiments demonstrate the superiority of the proposed model over individual IQA models and other fusion methods. It also exhibits an interesting capability of rejecting ``bad" models in the fusion process.

URL PDF HTML ☆

赞 0 踩 0

2605.30268 2026-05-29 cs.CV cs.AI 版本更新

PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions

PhyGenHOI：物理感知的动态人-物交互4D生成

Omer Benishu, Gal Fiebelman, Sagie Benaim

发表机构 * Hebrew University of Jerusalem（耶路撒冷希伯来大学）

AI总结提出PhyGenHOI框架，结合运动扩散模型和物质点方法，通过窗口吸引损失、接触驱动重模拟和掩码视频SDS目标，生成物理一致且视觉逼真的4D人-物交互动态场景。

详情

AI中文摘要

我们解决了生成物理准确且视觉逼真的4D人-物交互（HOI）的任务。给定一个静态3D人体和以3D高斯泼溅（3DGS）表示的目标物体，我们的目标是合成动态场景，其中人体根据给定的输入文本主动与物体交互，例如拳击或踢腿。为此，我们引入了PhyGenHOI，一种新颖的框架，将生成式人体运动与显式物理物体模拟相结合。我们将人体建模为由运动扩散模型（MDM）驱动的语义智能体，将物体建模为通过物质点方法（MPM）模拟的物理智能体，并利用3D高斯作为统一的、可微分的表示。我们通过三种耦合机制监督它们的交互：（1）窗口吸引损失，时间上同步生成运动以拦截物体；（2）接触驱动重模拟步骤，在碰撞时触发物理一致动量传递；（3）掩码视频SDS目标，注入基于视频的先验以增强接触保真度。实验表明，PhyGenHOI在多种动作、人体和物体上生成物理一致的4D HOI，优于基线方法。项目页面和视频：https://omerbenishu.github.io/PhyGenHOI/

英文摘要

We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through actions, such as punching or kicking, in accordance with a given input text. To this end, we introduce PhyGenHOI, a novel framework that couples generative human motion with an explicit physical object simulation. We model the human as a semantic agent driven by a Motion Diffusion Model (MDM) and the object as a physical agent simulated via the Material Point Method (MPM), utilizing 3D Gaussians as a unified, differentiable representation. We supervise their interaction through three coupled mechanisms: (1) A Windowed Attraction Loss that temporally synchronizes generative motion to intercept the object; (2) A Contact-Driven Re-simulation step that triggers physically consistent momentum transfer upon impact; and (3) A Masked Video-SDS objective that injects video-based priors to enhance contact fidelity. Experiments show PhyGenHOI generates physically consistent 4D HOI across diverse actions, humans, and objects, outperforming baselines. Project page and videos: https://omerbenishu.github.io/PhyGenHOI/

URL PDF HTML ☆

赞 0 踩 0

2605.30265 2026-05-29 cs.CV cs.CL 版本更新

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

LoMo: 局部模态替换以实现更深的视觉-语言融合

Feng Han, Zhixiong Zhang, Zheming Liang, Yibin Wang, Jiaqi Wang

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Shanghai Jiao Tong University（上海交通大学）； University of Science and Technology of China（中国科学技术大学）

AI总结针对视觉-语言模型在模态替换时性能下降的“载体敏感性”问题，提出局部模态替换（LoMo）数据策展范式，通过将文本片段动态渲染为图像来训练跨模态表示不变性，显著提升多模态推理与融合效果。

详情

AI中文摘要

视觉-语言模型（VLM）在广泛的理解和推理任务中取得了显著进展，这得益于旨在多模态融合的大规模图像-文本训练。理想情况下，将文本问题替换为其渲染图像对应物应基本不影响模型性能。然而，在实践中，这种模态替换会导致性能急剧下降。我们将这种“载体敏感性”问题归因于当前训练语料中固有的偏差。在图像描述、VQA、OCR和网络来源的交错数据等流行数据集中，文本和图像通常被组织成不同且不对称的角色，文本作为语言查询，图像作为视觉参考。这种数据偏差导致VLM在不同模态的信息获取上表现出不同的偏好。因此，VLM无法对齐语义等价内容在文本和视觉载体上的表示，使得模型推理在模态替换下变得脆弱。为了解决这个问题，我们提出了局部模态替换（LoMo），一种轻量级、架构无关的数据策展范式，旨在为语义等价的文本和图像载体之间的跨模态表示不变性提供监督。LoMo通过将单模态提示重新表述为无缝交错的跨模态序列来实现这一点。它动态选择目标文本跨度并将其重新表述为渲染图像，从而在“文本、视觉、文本”载体上保持相同的语义。在13个不同的多模态基准上的大量实验表明，LoMo显著改善了整体多模态推理，并实现了更深的跨模态融合。具体来说，它在基础模型上带来了一致的提升，在LLaVA-OneVision-1.5-8B上比标准SFT提高了2.67个百分点，在Qwen3.5-9B上提高了2.82个百分点。

英文摘要

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.

URL PDF HTML ☆

赞 0 踩 0

2605.30263 2026-05-29 cs.CV 版本更新

基于稳健评分规则的强化学习

Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu

发表机构 * Huawei Technologies Co., Ltd.（华为技术有限公司）

AI总结针对部分可验证的视觉-语言任务，提出RLR^3方法，通过双路径执行评分规则、最小暴露策略和层次聚合，实现从任务级到准则级验证的扩展，在15个基准上平均提升4.7分。

详情

AI中文摘要

虽然基于可验证奖励的强化学习（RLVR）对于确定性可检查的任务有效，但许多视觉-语言任务部分可验证，需要多准则监督（例如，感知细节、推理步骤和约束）。评分规则为此细粒度监督提供了自然接口，但其有效性取决于在线RL期间的执行准确性。我们提出基于稳健评分规则的强化学习（$\text{RLR}^3$），将RLVR从任务级验证扩展到准则级验证。$\text{RLR}^3$通过两条执行路径路由实例特定的评分规则：LLM作为提取器与确定性验证器配对，或LLM作为裁判用于不可验证的准则。为确保忠实评分，$\text{RLR}^3$引入最小暴露策略，从提取器中屏蔽真实标签，从裁判中屏蔽图像。此外，$\text{RLR}^3$采用层次聚合，优先考虑基本准则而非附加准则，并缓解rollout组内的分数饱和。在Qwen3-VL-30B-A3B上跨15个基准评估，$\text{RLR}^3$始终优于RLVR，比基础模型提升4.7分，并超过官方instruct-to-thinking模型差距。受控审计证实，我们的确定性验证和最小暴露显著减少了可利用的假阳性。

英文摘要

While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.

URL PDF HTML ☆

赞 0 踩 0

2605.30239 2026-05-29 cs.CV 版本更新

SAM3D-Phys: Towards Multi-Object Interactive Simulation in Real World

SAM3D-Phys：迈向真实世界中的多物体交互仿真

Xin Dong, Weijian Deng, Lihan Zhang, Tianru Dai, Wenfeng Deng, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Pengcheng Laboratory（鹏城实验室）

AI总结提出SAM3D-Phys框架，结合场景重建与SAM3D生成式先验，从部分观测中恢复完整可仿真物体几何，并通过物理约束优化和掩码引导外观蒸馏实现场景一致性，支持多物体同时交互仿真。

Comments 23 pages, 11 figures

详情

AI中文摘要

这项工作解决了从重建的真实世界场景中恢复完整、可仿真的物体几何的问题，使得与场景中嵌入的物体进行基于物理的交互成为可能。虽然现代多视图重建方法可以产生视觉上准确的环境，但由于遮挡和有限的观测，物体往往不完整，因此不适合物理仿真。为了解决这一局限性，我们提出了SAM3D-Phys，一个将场景重建与SAM3D的生成式3D先验相结合以恢复可物理仿真的物体的框架。我们的方法首先从多视图图像重建场景，获得场景几何和物体的部分观测。然后，我们利用SAM3D从这些部分观测中推断出完整的物体几何。为了确保恢复的物体与重建场景保持一致，我们通过两种互补策略恢复场景一致的物体状态：一种物理约束的空间优化算法，迭代地将恢复的物体对齐到其原始位置；以及一种掩码引导的外观蒸馏模块，基于观测图像细化纹理保真度。通过恢复完整的物体几何并在场景中恢复其姿态和外观，SAM3D-Phys产生了适用于基于物理仿真的干净物体表示，使得在重建场景中能够对多个物体进行同时且物理一致的交互仿真。项目页面：https://chnxindong.github.io/sam3d-phys/

英文摘要

This work addresses the problem of recovering complete, simulatable object geometry from reconstructed real-world scenes, enabling physics-based interaction with objects embedded in the scene. While modern multi-view reconstruction methods can produce visually accurate environments, objects are often incomplete due to occlusions and limited observations, making them unsuitable for physics simulation. To address this limitation, we propose SAM3D-Phys, a framework that integrates scene reconstruction with generative 3D priors of SAM3D to recover physically simulatable objects. Our approach first reconstructs the scene from multi-view images to obtain scene geometry and partial observations of objects. We then leverage SAM3D to infer complete object geometry from these partial observations. To ensure that the recovered objects remain consistent with the reconstructed scene, we restore scene-consistent object states through two complementary strategies: a physics-constrained spatial optimization algorithm that iteratively aligns the recovered object to its original location, and a mask-guided appearance distillation module that refines texture fidelity based on the observed images. By recovering complete object geometry and restoring its pose and appearance within the scene, SAM3D-Phys produces clean object representations suitable for physics-based simulation, enabling simultaneous and physically consistent interactive simulation of multiple objects within a reconstructed scene. Project page: https://chnxindong.github.io/sam3d-phys/

URL PDF HTML ☆

赞 0 踩 0

2605.30235 2026-05-29 cs.CV 版本更新

BullingerDB: A Dataset for Handwritten Text Recognition and Writer Retrieval

BullingerDB：用于手写文本识别和作者检索的数据集

Marco Peer, Anna-Scius Bertrand, Patricia Scheurer, Andreas Fischer

发表机构 * AIBEX Group, University of Fribourg, Switzerland（AIBEX集团，弗里堡大学，瑞士）； iCoSys Institute, University of Applied Sciences and Arts Western Switzerland（iCoSys研究所，西方瑞士应用科学与艺术大学）； Department of Computational Linguistics, University of Zurich, Switzerland（计算语言学系，苏黎世大学，瑞士）

AI总结提出一个基于Heinrich Bullinger书信的大规模历史文档数据集BullingerDB，用于手写文本识别和作者检索，并引入时间感知的nDCG指标评估检索性能。

Comments Accepted for presentation at ICDAR2026. Dataset available via zenodo

2605.30231 2026-05-29 cs.CV cs.AI 版本更新

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

超越3D VQA：将3D空间先验注入视觉-语言模型以增强几何推理

Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma, Joseph Tighe, Fanyi Xiao

发表机构 * FAIR at Meta（Meta的FAIR）

AI总结提出GASP框架，通过将几何先验注入LLM的Transformer层，利用对比损失和深度一致性监督训练，显著提升VLM的3D空间推理能力，在多个基准上取得大幅提升。

Comments CVPR 2026. Project page: https://danielchyeh.github.io/GASP/

详情

AI中文摘要

视觉-语言模型（VLM）通常在鲁棒的3D空间推理方面存在困难。依赖于使用3D视觉问答（VQA）数据集进行微调的主流方法可能过度拟合数据集特定的偏差，而集成专门的3D视觉编码器往往不灵活且繁琐。在本文中，我们认为真正的空间理解应该源于学习基本的几何先验，而不仅仅是来自高级VQA监督。我们提出了GASP（几何感知空间先验），这是一个将这些先验直接注入LLM的Transformer层的框架。GASP采用一个小的对应头，作为跨所有层的深度监督信号，并使用一个双重目标进行训练，该目标利用大规模视频场景的真实几何：基于真实点对应的对比损失强制2D视图不变性，而深度一致性监督解决3D几何歧义。我们的分析首先提供了一个诊断，表明标准VLM的内部对应匹配精度非常低（通常低于5%）。然后我们证明，我们的训练显著改善了这种行为，将逐层峰值对应提升到70%以上，并保持超过85%的时间鲁棒性，而基线仍低于5%。这些内部改进转化为下游空间基准的显著提升，包括在All-Angles Bench上+18.2%，在VSI-Bench上+29.0%，所有这些都没有在任何3D VQA数据上进行训练。我们的发现表明，从基本几何先验中学习是实现具有更可靠3D空间推理的VLM的一条有前途且可推广的途径。

英文摘要

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.30230 2026-05-29 cs.CV 版本更新

OmniCD：多模态语义引导的遥感图像变化检测基础框架

Chenhao Sun

发表机构 * Wuhan University（武汉大学）

AI总结提出OmniCD框架，通过多模态语义引导（图像和文本提示）统一遥感变化检测任务，结合层次化场景检索和风格解耦机制，并构建大规模数据集RSITCD，在多个基准上取得最优性能。

详情

AI中文摘要

遥感中的变化检测（CD）对于城市监测和灾害评估等应用至关重要，但传统方法难以在不同场景下泛化。我们提出OmniCD，一个通过多模态语义引导统一并增强遥感CD的基础框架。OmniCD将图像和文本提示（如文本描述、语义地图和地理空间元数据）整合到统一架构中，支持从二元CD到零样本语义变化理解的任务。该框架集成了层次化场景检索模块和变化检测模块，并通过风格解耦机制增强跨域鲁棒性。我们进一步引入RSITCD，一个包含30万+标注图像-文本对的大规模多模态数据集。大量实验表明，OmniCD在多个基准上达到最先进性能，展现出强大的适应性，为遥感中的通用CD系统奠定了坚实基础。

英文摘要

Change detection (CD) in remote sensing is vital for applications such as urban monitoring and disaster assessment, yet traditional methods struggle with generalization across diverse scenarios. We present OmniCD, a foundational framework that unifies and enhances remote sensing CD through multimodal semantic guidance. OmniCD incorporates image and text prompts -- such as textual descriptions, semantic maps, and geospatial metadata -- into a unified architecture, supporting tasks from binary CD to zero-shot semantic change understanding. The framework integrates a hierarchical scene retrieval module and a change detection module, reinforced by a style disentanglement mechanism for improved cross-domain robustness. We further introduce RSITCD, a large-scale multimodal dataset with 300K+ annotated image-text pairs. Extensive experiments show that OmniCD achieves state-of-the-art performance across benchmarks, demonstrating strong adaptability and setting a solid foundation for general-purpose CD systems in remote sensing.

URL PDF HTML ☆

赞 0 踩 0

2605.30167 2026-05-29 stat.ML cs.CV cs.LG stat.AP 版本更新

Visual Spatial Learning: Single-Field Spatial Interpolation Using Convolutional Neural Networks

视觉空间学习：使用卷积神经网络的单场空间插值

Daniel Tinoco, Raquel Menezes, Carlos Baquero, Alexandra Silva

发表机构 * Centro de Matemática (CMAT), Universidade do Minho（数学中心（CMAT），明霍大学）； DEI-FEUP & INESC TEC, Universidade do Porto（FEUP-DEI与INESC TEC，波尔图大学）； Instituto Português do Mar e da Atmosfera, I. P. (IPMA, I. P.), Lisboa, Portugal（葡萄牙海洋与大气研究所（IPMA, I. P.），里斯本，葡萄牙）； Centro de Ciências do Mar e do Ambiente (MARE), Évora, Portugal（海洋与环境科学中心（MARE），埃维拉，葡萄牙）

AI总结提出基于卷积神经网络（CNN）的架构，直接从单次部分观测场学习空间插值，无需外部数据或先验场，作为克里金法的替代方案。

Comments 53 pages, 10 figures

详情

AI中文摘要

从稀疏观测中预测完整的空间相关场是空间统计和环境建模中的一个基本挑战。经典的插值方法如克里金法依赖于高斯过程假设和变异函数分析，这可能会限制其在非平稳环境中的有效性，并且需要大量的领域专业知识。在这项工作中，我们利用基于卷积神经网络（CNN）的架构进行空间插值，该架构在单个部分观测场上进行训练和应用，无需访问外部数据或先验场。模型直接在观测位置进行监督，并学习在用户定义的网格上预测未观测点的值。与克里金法不同，我们的方法不需要显式的协方差建模或变异函数估计，并且可以以数据驱动的方式灵活捕捉局部空间模式。这项工作展示了CNN在稀疏监督下进行单实例空间插值的潜力，为经典地统计方法提供了实用的替代方案，并将CNN的应用扩展到新的问题领域。

英文摘要

Predicting a complete spatially correlated field from sparse observations is a fundamental challenge in spatial statistics and environmental modelling. Classical interpolation methods such as Kriging rely on Gaussian process assumptions and variography, which can limit their effectiveness in non-stationary settings and require substantial domain expertise. In this work, we leverage an architecture based on convolutional neural networks (CNNs) for spatial interpolation that is trained and applied on a single partially observed field, without access to external data or prior fields. The model is supervised directly on the observed locations and learns to predict values at unobserved points on the user defined grid. Unlike Kriging, our method does not require explicit covariance modelling or variogram estimation, and it can flexibly capture local spatial patterns in a data-driven manner. This work demonstrates the potential of CNNs for single-instance spatial interpolation under sparse supervision, offering a practical alternative to classical geostatistical methods, and extending the use of CNNs to a new problem domain.

URL PDF HTML ☆

赞 0 踩 0

2605.30161 2026-05-29 cs.CV 版本更新

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

为什么远处看起来在上方：探究视觉-语言模型中的空间表征

Cheolhong Min, Jaeyun Jung, Daeun Lee, Hyeonseong Jeon, Yu Su, Jonathan Tremblay, Chan Hee Song, Jaesik Park

发表机构 * Seoul National University（首尔国立大学）； The Ohio State University（俄亥俄州立大学）； NVIDIA（英伟达）

AI总结通过最小对比对分析，发现视觉-语言模型存在垂直-距离纠缠（将图像垂直位置与距离混淆），这种透视偏差导致性能差距，并随数据规模扩大而加剧，而具有良好分离空间轴的模型更鲁棒。

详情

AI中文摘要

视觉-语言模型（VLM）在空间推理基准上取得了强劲性能，但仍不清楚这是否反映了结构化的3D理解，还是依赖于自然图像中的统计捷径。我们引入了一个表征级分析框架，构建最小对比对来测量VLM嵌入中空间轴的组织和分离程度。跨多个模型族的分析揭示了一致的垂直-距离纠缠：模型将图像垂直位置与距离混淆，反映了自然照片的透视偏差。这种偏差导致透视一致与反启发式示例之间存在显著的准确率差距，并且随着数据规模的扩大而加剧，即使整体基准准确率有所提高。我们进一步表明，具有相似基准分数的模型可能表现出不同的内部表征，并且这些差异可预测跨不同空间推理基准的准确率和鲁棒性。为了将这种偏差与评估集偏斜隔离，我们引入了SpatialTunnel，这是一个合成基准，通过去除自然图像中常见的相关性来暴露空间捷径偏差。实验证实，纠缠是模型固有的，并且具有良好分离空间轴的模型表现出更强的鲁棒性，这表明结构良好的空间表征可在不同基准上带来更可靠的空间推理。代码和基准可在项目页面获取：https://cheolhong0916.github.io/whyfarlooksup.github.io/。

英文摘要

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.30140 2026-05-29 cs.CV 版本更新

对话代理评估：理解情感检测中的文化、背景与环境

Martha Teiko Teye, Yaw Marfo Missah, Emmanuel Ahene, Twum Frimpong, Auxane Boch

发表机构 * Cluster of Excellence, University of Stuttgart（斯图加特大学卓越中心）； Department of Computer Science, Kwame Nkrumah University of Science and Technology（库马西技术科学大学计算机科学系）； Institute for Ethics in Artificial Intelligence, Technical University of Munich（慕尼黑技术大学人工智能伦理研究所）

AI总结针对黑人非洲社会，提出结合语音和图像数据、使用3层CNN和AFME算法的情感预测模型，准确率85%-96%，并识别讽刺，提升对话AI情感识别系统的可信度。

Comments IEEE paper on arxiv

详情

DOI: 10.1109/ACCESS.2022.3153787
Journal ref: IEEE Access 10 (2022) 24976-24984; Erratum: IEEE Access (2022) 35900-35900

AI中文摘要

现在，有价值决策和高度优先分析依赖于面部生物识别、社交媒体照片标记和人机交互等应用。然而，成功部署这些应用的能力取决于它们在考虑可能边缘情况下的测试用例效率。多年来，已经实施了大量通用解决方案来模仿人类情感，包括讽刺。然而，地理位置或文化差异等因素在其解决伦理问题和改进对话AI（人工智能）的相关性中尚未得到充分探索。在本文中，我们旨在解决在黑人非洲社会中对话AI使用的潜在挑战。我们开发了一个情感预测模型，准确率在85%到96%之间。我们的模型结合了语音和图像数据来检测七种基本情感，并特别关注识别讽刺。它使用了3层卷积神经网络，并结合了一种新的音频帧平均表情（AFME）算法，重点放在模型的预处理和后处理阶段。最后，我们的解决方案有助于维护对话AI中情感识别系统的可信度。

英文摘要

Valuable decisions and highly prioritized analysis now depend on applications such as facial biometrics, social media photo tagging, and human robots interactions. However, the ability to successfully deploy such applications is based on their efficiencies on tested use cases taking into consideration possible edge cases. Over the years, lots of generalized solutions have been implemented to mimic human emotions including sarcasm. However, factors such as geographical location or cultural difference have not been explored fully amidst its relevance in resolving ethical issues and improving conversational AI (Artificial Intelligence). In this paper, we seek to address the potential challenges in the usage of conversational AI within Black African society. We develop an emotion prediction model with accuracies ranging between 85% and 96%. Our model combines both speech and image data to detect the seven basic emotions with a focus on also identifying sarcasm. It uses 3-layers of the Convolutional Neural Network in addition to a new Audio-Frame Mean Expression (AFME) algorithm and focuses on model pre-processing and post-processing stages. In the end, our proposed solution contributes to maintaining the credibility of an emotion recognition system in conversational AIs.

URL PDF HTML ☆

赞 0 踩 0

2605.30093 2026-05-29 cs.CV 版本更新

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

几何至关重要：用于学习语义对应的3D基础先验

Artur Jesslen, Olaf Dünkel, Adam Kortylewski

发表机构 * University of Freiburg（弗赖堡大学）； Max Planck Institute for Informatics（马克斯·普朗克信息研究所）； CISPA Helmholtz Center for Information Security（CISPA 河岸信息安全中心）

AI总结提出一种3D感知的后训练框架，利用3D基础模型（SAM3D）估计物体几何和姿态，生成几何感知特征图，结合DINO和Stable Diffusion特征，通过测地距离过滤候选对应，训练轻量适配器改进语义对应。

Comments 9 pages (main paper), 21 pages (total), 4 figures

详情

AI中文摘要

来自自监督视觉模型和文本到图像扩散模型的基础特征已被证明对语义对应估计有效。然而，由于这些特征主要从2D图像目标学习，它们缺乏明确的3D意识，并且常常混淆对称物体侧面、重复部分以及在3D中不同的视觉相似结构。我们引入了一个3D感知的后训练框架，通过结合3D基础模型的先验，超越了现有的2D基础特征。给定一张图像，我们的方法使用SAM3D估计物体几何和姿态，并通过渲染-比较优化来细化姿态。随后，我们根据估计的物体姿态，将重建几何中的PartField描述符渲染到图像平面。由此产生的几何感知特征图补充了DINO和Stable Diffusion特征，而重建形状上的测地距离能够可靠地过滤候选对应。我们使用过滤后的匹配作为监督，在DINO和Stable Diffusion之上训练一个轻量适配器用于语义对应。与之前需要姿态标注并依赖粗略球形几何的后训练方法相比，我们的方法自动获得实例特定的3D结构，并用它来指导对应学习。实验表明，我们的方法改进了语义对应，同时减少了人工几何监督。代码和模型可在 https://github.com/GenIntel/3D-SC 获取。

英文摘要

Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation. However, because these features are learned primarily from 2D image objectives, they lack explicit 3D awareness and often confuse symmetric object sides, repeated parts, and visually similar structures that are distinct in 3D. We introduce a 3D-aware post-training framework that goes beyond available 2D foundation features by incorporating priors from 3D foundation models. Given an image, our method uses SAM3D to estimate object geometry and pose, and refines the pose through render-and-compare optimization. Subsequently, we render PartField descriptors from the reconstructed geometry into the image plane based on the estimated object pose. The resulting geometry-aware feature maps complement DINO and Stable Diffusion features, while geodesic distances on the reconstructed shapes enable reliable filtering of candidate correspondences. We use the filtered matches as supervision to train a lightweight adapter on top of DINO and Stable Diffusion for semantic correspondence. In contrast to prior post-training approaches that require pose annotations and rely on coarse spherical geometry, our method automatically obtains instance-specific 3D structure and uses it to guide correspondence learning. Experiments show that our approach improves semantic correspondence over the prior methods while reducing manual geometric supervision. Code and model can be found at https:/github.com/GenIntel/3D-SC.

URL PDF HTML ☆

赞 0 踩 0

2605.30090 2026-05-29 cs.CL cs.CV 版本更新

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

DirectorBench: 通过个性化多智能体评估诊断长视频生成

Jiamin Chen, Qianben Chen, Jiawen Zhang, Yidi Wu, Yuchen Li, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

发表机构 * ByteDance Inc.（字节跳动公司）； City University of Hong Kong（香港城市大学）

AI总结提出DirectorBench，一种基于多智能体的诊断基准，通过80个结构化元数据、7个用户画像和40个检查点标准，在脚本、视觉、音频、跨模态和稳定性五个维度上评估长视频生成，并定位瓶颈和用户偏好依赖。

详情

AI中文摘要

长视频生成正从短的单场景合成快速转向分钟级、多镜头的创作，具有叙事结构、电影控制、音频和跨模态同步。然而，评估此类视频仍然具有挑战性，因为现有基准主要关注局部视觉质量、短期时间一致性或通用提示对齐，并且对工作流故障和用户依赖偏好的诊断有限。我们引入了DirectorBench，一个用于长视频生成的个性化多智能体诊断基准。DirectorBench根据80个结构化元数据、7个用户画像和40个检查点标准，在脚本、视觉、音频、跨模态和稳定性五个维度上评估生成的视频。DirectorBench不将质量简化为单一聚合分数，而是定位检查点级别的瓶颈并支持画像感知评估。我们评估了4个长视频生成工作流、6个基础LLM和7个用户画像。在不同工作流中，DirectorBench揭示了一个单元间瓶颈：过渡质量平均仅为0.256，最佳工作流达到0.356，而提示级别的用户需求满足度平均为0.71。我们进一步进行了14名标注者的人工评估，以验证DirectorBench与人类判断的一致性。结果表明，DirectorBench捕捉到了人类可感知的质量差异，并揭示了聚合评分所隐藏的工作流和画像依赖的故障模式。这些发现强调了长视频生成中诊断性和画像感知基准的重要性。

英文摘要

Long-form video generation is rapidly moving from short, single-scene synthesis toward minute-long, multi-shot creation with narrative structure, cinematic control, audio, and cross-modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user-dependent preferences. We introduce DirectorBench, a personalized multi-agent diagnostic benchmark for long-form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross-modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint-level bottlenecks and supports profile-aware evaluation. We evaluate 4 long-form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between-unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt-level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human-perceptible quality differences and reveals workflow- and profile-dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile-aware benchmarking for long-form video generation.

URL PDF HTML ☆

赞 0 踩 0

2605.30083 2026-05-29 cs.CV 版本更新

FakeVLM-R1：通过思维链内化物理定律进行合成图像检测

Leqi Zhu, Junyan Ye, Kaiqing Lin, Zhiyuan Yan, Conghui He, Weijia Li

发表机构 * Shanghai AI Lab（上海人工智能实验室）； Nanjing University（南京大学）； Sun Yat-Sen University（中山大学）； Shenzhen University（深圳大学）； Peking University（北京大学）； Tsinghua University（清华大学）

AI总结提出FakeVLM-R1框架，结合监督微调、组相对策略优化和批判性思维链机制，通过双向辩证推理和物理常识构建真实性反证，实现高精度、逻辑可解释的合成图像检测，解决现有方法的过度拒绝偏差。

详情

AI中文摘要

生成式人工智能技术的发展已将合成图像的视觉真实性提升至前所未有的水平。尽管当前基于大型多模态模型（LMM）的可解释检测方法取得了一定进展，但它们仍然依赖于从大量伪造数据中获得的模仿学习，因此缺乏真正的因果推理能力，容易产生解释性幻觉。为克服这一瓶颈，我们提出FakeVLM-R1，旨在赋予模型在执行合成检测任务时类似人类的批判性思维能力。该框架在监督微调（SFT）基础上，将组相对策略优化（GRPO）与批判性思维链（CoT）机制相结合。在推理阶段，模型执行“双向辩证推理”过程：在提出伪造假设的同时，必须同时调用物理常识构建真实性反证。此外，我们构建了包含高质量样本的FakeClue++数据集，该数据集广泛引入了基于真实图像物理定律的注释，为模型提供了统一的真实性锚点。实验证实，FakeVLM-R1在多个基准测试中达到了评估模型中的最优性能（SOTA）。它不仅实现了高精度、逻辑可解释的检测，还解决了现有方法对真实图像的过度拒绝偏差，展现出对扰动的泛化性和鲁棒性。

英文摘要

The development of generative artificial intelligence technologies has propelled the visual realism of synthetic images to an unprecedented level. Although current interpretable detection methods based on Large Multimodal Models (LMMs) have made certain progress, they still rely on imitation learning derived from massive volumes of forged data. Consequently, they lack genuine causal reasoning capabilities and are prone to explanatory hallucinations. To overcome this bottleneck, we propose FakeVLM-R1, aiming to endow the model with human-like critical thinking capabilities when performing synthetic detection tasks. Building upon Supervised Fine-Tuning (SFT), this framework integrates Group Relative Policy Optimization (GRPO) with a Critical Thinking Chain-of-Thought (CoT) mechanism. During the inference phase, the model executes a "bidirectional dialectical reasoning" process: while proposing a forgery hypothesis, it must simultaneously invoke physical commonsense to construct an authenticity counter-proof. Furthermore, we constructed the FakeClue++ dataset with high-quality samples, which extensively introduces annotations guided by the physical laws of authentic images, providing a unified authenticity anchor for the model. Experiments confirm that FakeVLM-R1 achieves SOTA performance the evaluated models across multiple benchmarks. It not only achieves high-precision, logically interpretable detection but also resolves the over-rejection bias of existing methods against real images, demonstrating generalization and robustness against perturbations.

URL PDF HTML ☆

赞 0 踩 0

2605.30045 2026-05-29 cs.CV 版本更新

GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

GenEraser：通过平衡文本-掩码引导和解耦定位器-保持器实现可泛化的视频对象移除

Yuqing Chen, Lin Liu, Haisu Wu, Xiaopeng Zhang, Yaowei Wang, Yujiu Yang, Qi Tian

发表机构 * Tsinghua University（清华大学）； Pengcheng National Laboratory（鹏城实验室）； Huawei（华为）； Southeast University（东南大学）； Harbin Institute of Technology（哈尔滨工业大学）

AI总结提出GenEraser框架，通过多条件混合专家、可学习深度CFG融合机制和解耦专家架构，解决视频对象移除中目标与物理效应同时消除的泛化难题，在ROSE和VOR-Eval上分别提升2.16 dB和1.44 dB。

详情

AI中文摘要

视频对象移除在域外场景中常因复杂的时空歧义而难以同时消除目标对象及其关联的物理效应（如烟雾、反射、光线和涟漪）。现有方法主要依赖空间掩码，但往往无法捕捉弱相关效应，且显式文本引导的潜力尚未充分探索。此外，移除模型在高层语义泛化与精确像素级背景保持之间存在根本性的优化冲突。为解决这些挑战，我们提出GenEraser，一种用于泛化高保真视频对象与效应移除的新框架。首先，我们引入多条件混合专家（MC-MoE）配合二分文本引导，充分利用扩散变换器的多模态先验，显著增强复杂效应的识别。其次，开发可学习深度“CFG”融合机制（LD-CFG），以自适应平衡不同场景下掩码和文本条件的相对主导地位。最后，提出解耦专家架构，包含定位器和保持器，以缓解语义泛化与像素对齐之间的固有权衡。大量实验表明，我们的GenEraser超越了近期最先进方法，在ROSE基准和VOR-Eval上分别实现了显著的定量提升（2.16 dB和1.44 dB），同时在开放世界场景中保持了异常稳健的泛化能力。

英文摘要

Video object removal frequently struggles to simultaneously eliminate target objects and their associated physical effects (e.g., smoke, reflections, light, and ripples) in out-of-domain scenarios due to complex spatiotemporal ambiguities. While existing methods primarily rely on spatial masks, they often fail to capture weakly correlated effects, and the potential of explicit textual guidance remains underexplored. Furthermore, a fundamental optimization conflict exists in removal models between high-level semantic generalization and precise pixel-level background preservation. To address these challenges, we propose GenEraser, a novel framework for generalized and high-fidelity video object and effect removal. First, we introduce a Multi-Conditional Mixture-of-Experts (MC-MoE) paired with Bipartite Text guidance to fully exploit the multimodal priors of Diffusion Transformers, significantly enhancing the identification of complex effects. Second, a Learnable Deep ``CFG'' Fusion mechanism (LD-CFG) is developed to adaptively balance the relative dominance of mask and textual conditions across diverse scenarios. Finally, we propose a Decoupled Expert Architecture, comprising a Locator and a Preserver, to mitigate the inherent trade-off between semantic generalization and pixel alignment. Extensive experiments demonstrate that our GenEraser surpasses recent state-of-the-art approaches, achieving significant quantitative improvements (e.g., $2.16$ dB and $1.44$ dB on the ROSE Benchmark and VOR-Eval, respectively) while maintaining exceptionally robust generalization in open-world scenarios. https://cyqii.github.io/GenEraser.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.30038 2026-05-29 cs.LG cs.AI cs.CV 版本更新

Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

对齐引导的分数匹配用于扩散模型中的文本到图像对齐

Jaa-Yeon Lee, Yeobin Hong, Taesung Kwon, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, South Korea（韩国高级人工智能研究生院）

AI总结提出一种轻量级、无奖励的后训练方法，通过将对比对齐引导直接整合到扩散模型的分数匹配目标中，以解决文本-图像对齐中的过度惩罚和计数错误问题。

Comments ICML 2026, Project page: https://jaayeon.github.io/AGSM

详情

AI中文摘要

扩散模型生成高度逼真的图像，但通常难以实现精确的文本-图像对齐。虽然最近的后训练方法使用外部奖励或人类偏好信号改善对齐，但其性能严重依赖奖励质量，且不直接解决扩散过程中的对齐问题。最近的无奖励方法如SoftREPA表明，通过对比学习优化软文本令牌可以有效改善文本-图像表示对齐，优于标准参数高效微调基线。然而，对比公式可能过度惩罚负对，表现为典型的失败案例，如过度计数和重复。为解决此问题，我们提出一种轻量级、无奖励的后训练方法，通过将对比对齐引导直接整合到扩散模型的分数匹配目标中来细化软令牌。通过在分数级别分配对齐方向，我们的方法缓解了这些限制，并产生更连贯和语义忠实的生成。实验表明，我们的方法与SoftREPA相当，同时显著改善了其失败案例，在GenEval基准上计数准确性提高了超过35%。我们的方法可无缝应用于现有扩散骨干网络（SD1.5、SDXL和SD3），并与现有的基于RL的扩散后训练方法互补。项目页面：https://jaayeon.github.io/AGSM

英文摘要

Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods. Project page: https://jaayeon.github.io/AGSM

URL PDF HTML ☆

赞 0 踩 0

2605.30027 2026-05-29 cs.CV cs.IR 版本更新

DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

DocRetriever：面向多模态文档检索的即插即用框架与综合基准

Ruofan Hu, Menghui Zhu, Jieming Zhu, Bo Chen, Shengyang Xu, Minjie Hong, Xiaoda Yang, Sashuai Zhou, Li Tang, Tao Jin, Zhou Zhao

发表机构 * Zhejiang University（浙江大学）； Huawei Technologies Co., Ltd（华为技术有限公司）

AI总结提出DocRetriever即插即用框架，通过布局感知的稀疏嵌入和推理增强的重排序器解决多模态文档检索中语义模糊和泛化瓶颈问题，并构建MultiDocR基准实现更严格评估。

Comments Accepted at KDD 2026 Research Track

详情

DOI: 10.1145/3770855.3817680

AI中文摘要

多模态文档包含表格、图形和布局等多样元素，可能使检索任务复杂化。当前方法通常将密集视觉嵌入模型与有监督重排序器相结合以实现高精度检索，但存在固有局限性。首先，密集嵌入的粗粒度特性往往模糊显式语义，无法利用结构显著信息。其次，有监督重排序模型面临泛化瓶颈，其性能严重依赖领域特定训练数据。此外，现有基准通常缺乏多样化的评估维度和全面的相关性标注，限制了可靠评估。为解决这些挑战，我们提出DocRetriever，一个即插即用框架。它通过布局感知的稀疏嵌入技术增强视觉检索，实现无需光学字符识别（OCR）开销的有效混合编码。我们还引入了一个可泛化的重排序器，利用推理增强的示范和优化采样来提高少样本场景下的准确性。最后，我们构建了一个新基准MultiDocR，以实现更严格的评估。在多个基准上的实验验证了DocRetriever相对于最先进方法的优越性。

英文摘要

Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent limitations. First, the coarse-grained nature of dense embeddings tends to obfuscate explicit semantics, failing to leverage structurally salient information. Second, supervised reranking models suffer from generalization bottlenecks, as their performance heavily relies on domain-specific training data. Furthermore, existing benchmarks often lack diverse assessment dimensions and comprehensive relevance annotations, limiting reliable evaluation. To address these challenges, we propose DocRetriever, a plug-and-play framework. It enhances visual retrieval via a layout-aware sparse embedding technique, enabling effective hybrid encoding without the overhead of optical character recognition (OCR). We also introduce a generalizable reranker that leverages reasoning-augmented demonstrations and optimized sampling to improve accuracy in few-shot settings. Finally, we construct a new benchmark, MultiDocR, to enable more rigorous evaluation. Experiments across diverse benchmarks validate DocRetriever's superiority over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.30011 2026-05-29 cs.CV cs.AI 版本更新

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

VisualThink-VLA：用于高效低延迟视觉-语言-动作策略的视觉中间推理

Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang

发表机构 * Zhejiang University（浙江大学）； Cornell University（康奈尔大学）； National University of Singapore（新加坡国立大学）； Xi'an University of Electronic Science and Technology（西安电子科技大学）

AI总结提出VisualThink-VLA框架，通过视觉中间推理和选择性路由机制，在保持高精度的同时将推理延迟从数秒降至亚秒级。

详情

AI中文摘要

近期工作开始为视觉-语言-动作（VLA）策略配备显式的中间推理。然而，在具身控制中，文本思维链并不适用：无关或弱文本信息会干扰动作预测，而自回归文本解码为实时闭环执行增加了过多延迟。我们提出VISUALTHINK-VLA，一个用于准确、低延迟VLA策略的视觉中间推理框架。我们的引导哲学是通过有效的视觉思维来指导动作：VISUALTHINK-VLA通过一个紧凑的视觉证据接口引导动作预测，该接口在避免解码开销的同时保持空间精度。此外，为了进一步提升性能和效率，VISUALTHINK-VLA采用了一种定制的选择性路由机制来学习视觉证据令牌，从而实现低延迟推理同时保持高容量专用性。我们还引入了VisualEvidence-Kit，这是一个以VisualEvidence-Agent为核心的监督与审计资源，该智能体构建了754.7k条VLA指令的VisualEvidence-Set，用于路由监督和反事实忠实性测试。在多个基准测试和真实机器人评估中，VISUALTHINK-VLA在大多数基准测试上实现了最高成功率，同时将推理增强基线的多秒延迟降至亚秒级。例如，在BridgeData V2上，它将步骤延迟从ECoT的8.377秒降至0.367秒，实现了22.8倍的加速。

英文摘要

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

URL PDF HTML ☆

赞 0 踩 0

2605.30010 2026-05-29 cs.CV 版本更新

EarlyTom: Early Token Compression Completes Fast Video Understanding

EarlyTom: 早期令牌压缩实现快速视频理解

Hesong Wang, Xin Jin, Lu Lu, Chenhaowen Li, Jian Chen, Qiang Liu, Huan Wang

发表机构 * Zhejiang University（浙江大学）； Westlake University（西湖大学）； Alibaba Cloud Computing（阿里云计算）

AI总结针对视频大语言模型中视觉编码阶段效率低下的问题，提出EarlyTom无训练令牌压缩框架，通过在视觉编码器内部进行早期压缩，显著降低首令牌延迟并提升吞吐量。

Comments Accepted by CVPR 2026. 16 pages, 8 figures, 8 tables. Project page: https://viridisgreen.github.io/EarlyTom

详情

AI中文摘要

视频大语言模型（Video-LLMs）在视频理解任务中展现了强大的能力。然而，处理大量视觉令牌带来的低效率仍然阻碍了它们的实际部署。尽管近期的方法在保持与全令牌基线相当准确性的同时实现了极低的令牌保留率，但大多数方法仅在预填充的后期阶段进行压缩，视觉编码器的效率未得到优化。在本文中，我们首先表明视觉编码对首令牌时间（TTFT）贡献很大。因此，与仅在视觉编码器之后压缩视觉令牌不同，在编码器内部进行压缩仍有很大的探索空间。基于这一见解，我们提出了EarlyTom，一种无训练的令牌压缩框架，在视觉编码器内部执行早期视觉令牌压缩，从而显著降低TTFT并提高吞吐量。此外，我们引入了一种解耦的空间令牌选择策略，提高了整体压缩效果。在单个NVIDIA A100 GPU上，对于LLaVA-OneVision-7B模型，EarlyTom将TTFT降低高达2.65倍，FLOPs降低高达61%，同时保持与全令牌基线相当的准确性。这些改进显著增强了Video-LLMs在实际生产场景中部署的实用性。

英文摘要

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.29997 2026-05-29 cs.CV 版本更新

网格感知的对极匹配用于篮球多视角多人3D姿态估计

Li Yin, Qin Haobin, Tomohiro Suzuki, Calvin Yeung, Mariko Isogawa, Keisuke Fujii

发表机构 * RIKEN Center for Advanced Intelligence Project（RIKEN先进情报项目中心）

AI总结提出一种无训练框架MAEM，通过单目3D人体网格恢复模型和两阶段对极匹配策略，解决团队运动场景中多视角多人3D姿态估计的遮挡和外观相似问题。

详情

AI中文摘要

团队运动场景中的多视角多人3D姿态估计因球员遮挡、队服造成的外观相似性以及标注多视角数据的稀缺而仍然具有挑战性，这些因素限制了基于学习方法的有效性和泛化能力。相比之下，无训练方法的性能固有地受限于2D关键点检测的准确性和跨视角关联的鲁棒性。为应对这些挑战，我们提出了网格感知的对极匹配（MAEM），一种用于多视角多人3D姿态估计的无训练框架。我们的方法采用单目3D人体网格恢复模型作为前端，并基于恢复的网格输出引入了一种两阶段对极匹配策略。具体而言，所提出的框架结合了基于并查集的聚类与每关节三角测量，以实现鲁棒的跨视角关联和准确的3D姿态重建。在两个公开的多视角篮球数据集上的实验表明，MAEM持续优于现有的无训练关联基线，同时在室内和室外篮球场景中实现了有竞争力的仅RGB性能。MAEM在SportCenter EPFL上达到MPJPE/PA-MPJPE分数59.8/40.7毫米，在Human-M3 Basketball上达到74.0/51.8毫米，突显了密集网格几何在无需目标域训练或微调的情况下进行跨视角关联的有效性。

英文摘要

Multi-view multi-person 3D pose estimation in team sports scenarios remains challenging due to player occlusions, appearance similarity caused by team uniforms, and the scarcity of annotated multi-view data, all of which limit the effectiveness and generalization capability of learning-based methods. In contrast, the performance of training-free approaches is inherently constrained by the accuracy of 2D keypoint detection and the robustness of cross-view association. To address these challenges, we propose Mesh-Aware Epipolar Matching (MAEM), a training-free framework for multi-view multi-person 3D pose estimation. Our method employs a monocular 3D human mesh recovery model as the frontend and introduces a two-stage epipolar matching strategy based on the recovered mesh outputs. Specifically, the proposed framework combines disjoint-set-union-based clustering with per-joint triangulation to achieve robust cross-view association and accurate 3D pose reconstruction. Experiments on two public multi-view basketball datasets demonstrate that MAEM consistently outperforms existing training-free association baselines while achieving competitive RGB-only performance in both indoor and outdoor basketball scenarios. MAEM achieves MPJPE/PA-MPJPE scores of 59.8/40.7 mm on SportCenter EPFL and 74.0/51.8 mm on Human-M3 Basketball, highlighting the effectiveness of dense mesh geometry for cross-view association without requiring target-domain training or fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.29935 2026-05-29 cs.CV cs.AI 版本更新

CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving

CityGen: 结构引导的城市风格合成用于跨城市自动驾驶

Zezhong Qian, Zhao Yang, Lu Tan, Zhihao Yan, Weiyi Hong, Haizhuang Liu, Yawei Jueluo

发表机构 * Jiangsu Cytoderm Intelligent Technology Co., Ltd., China（江苏细胞膜智能科技有限公司，中国）； Xi'an Jiaotong University, Xi'an, China（西安交通大学，中国）； Tsinghua University, Beijing, China（清华大学，中国）； University of Science and Technology of China, Hefei, China（中国科学技术大学，中国）

AI总结提出CityGen，一种基于扩散模型的生成框架，通过高清地图条件和城市级视觉提示实现零标签城市适应，提升跨城市自动驾驶在感知、分割和规划任务上的鲁棒性。

详情

AI中文摘要

自动驾驶系统通常在有限的地理区域内进行训练和评估，这阻碍了它们在新城市部署时的可扩展性。然而，外观、道路拓扑和交通模式的显著域偏移常常导致跨城市部署时性能严重下降。现有的基于域适应、数据增强或合成数据生成的方法通常依赖于标注的目标数据、城市特定的标注或任务特定的设计，限制了它们在整体评估中的可扩展性和有效性。在本文中，我们引入了CityTransfer-Bench，一个地理上不重叠的基准，用于评估跨城市泛化在感知、分割和规划任务上的表现，并提出了CityGen，一个基于扩散的生成框架，通过城市级视觉提示引导的高清地图条件合成实现零标签城市适应。大量实验表明，CityGen在多个任务上持续提高了跨城市鲁棒性，为可泛化的自动驾驶建立了可扩展且标签高效的基石。

英文摘要

Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deployed in new cities. However, significant domain shifts in appearance, road topology, and traffic patterns often cause severe performance degradation under cross-city deployment. Existing approaches based on domain adaptation, data augmentation, or synthetic data generation typically rely on labeled target data, city-specific annotations, or task-specific designs, limiting their scalability and effectiveness for holistic evaluation. In this paper, we introduce CityTransfer-Bench, a geographically disjoint benchmark for evaluating cross-city generalization across perception, segmentation, and planning, and propose CityGen, a diffusion-based generative framework that performs zero-label city adaptation via HD-map-conditioned synthesis guided by city-level visual prompts. Extensive experiments demonstrate that CityGen consistently improves cross-city robustness across multiple tasks, establishing a scalable and label-efficient foundation for generalizable autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.29932 2026-05-29 cs.LG cs.CV 版本更新

Treatment-Conditioned Diffusion for Forecasting Neurodegenerative Disease Progression

治疗条件扩散用于预测神经退行性疾病进展

Danylo Boiko, Viktoriia Mishkurova

发表机构 * Innoloft Inc.（Innoloft公司）； Bogomolets National Medical University（博戈莫列茨国家医学大学）

AI总结提出一种治疗条件扩散框架，通过条件化生成过程于患者的筛查DaTscan图像和一年内左旋多巴等效日剂量，预测高保真未来脑状态，在临床保真度上显著优于基线。

Comments 9 pages, 5 figures, 1 table

详情

AI中文摘要

预测帕金森病等神经退行性疾病的进展对于有效的长期规划和个性化治疗干预至关重要。现有系统通常产生忽略纵向神经影像丰富结构的标量临床评分，而传统生成方法则遭受解剖细节丢失和细微进展模式模糊的问题。为此，我们引入了一种新颖的治疗条件扩散框架，通过将生成过程条件化于患者的筛查DaTscan图像和一年内左旋多巴等效日剂量，预测高保真的未来脑状态。该流程使用基于Transformer的编码器表示非线性、时间依赖的药理学动态，并通过一个关注生物关键区域的多权重感兴趣区域掩码优化生成。实验评估表明，我们的框架保持了清晰的解剖边界，并在临床保真度上显著优于基线，实现了MSE降低14.0%，MAE降低7.2%，SSIM提高4.9%。

英文摘要

Forecasting the progression of neurodegenerative diseases, such as Parkinson's disease, is essential for effective long-term planning and personalized therapeutic intervention. Existing systems typically produce scalar clinical scores that ignore the rich structure of longitudinal neuroimaging, while traditional generative approaches suffer from a loss of anatomical details and blurring subtle progression patterns. To address this, we introduce a novel treatment-conditioned diffusion framework that predicts high-fidelity future brain states by conditioning the generative process on patients' screening DaTscan images and levodopa equivalent daily dose over one year. The pipeline uses a Transformer-based encoder to represent non-linear, time-dependent pharmacological dynamics and optimizes generation through a multi-weight region-of-interest mask that focuses on biologically critical areas. Experimental evaluation shows that our framework maintains sharp anatomical boundaries and significantly improves clinical fidelity relative to the baseline, achieving 14.0% lower MSE, 7.2% lower MAE, and 4.9% higher SSIM.

URL PDF HTML ☆

赞 0 踩 0

2605.29911 2026-05-29 cs.LG cs.CV 版本更新

Reducing Experimental Testing in Space Propulsion Film Cooling Analyses by Pixelwise Generative Image Interpolation

通过逐像素生成图像插值减少空间推进薄膜冷却分析中的实验测试

Adam T. Müller, Philipp J. Teuffel, Konstantin Manassis, Nicolaj C. Stache

发表机构 * Heilbronn University of Applied Sciences（海德堡应用科学大学）； Center for Machine Learning（机器学习中心）； Max-Planck-Str. 39（马克斯-普朗克街39号）； German Aerospace Center (DLR)（德国航空航天中心（DLR））； Institute of Space Propulsion（空间推进研究所）

AI总结提出一种基于轻量级前馈神经网络和位置编码的机器学习方法，从稀疏实验测量中进行图像回归，以减少推进系统薄膜冷却研究中的物理测试需求。

Comments Presented at the 11th European Conference for Aeronautics and Aerospace Sciences (EUCASS), 2025, DOI: 10.13009/EUCASS2025-285

详情

DOI: 10.13009/EUCASS2025-285

AI中文摘要

我们提出了一种从稀疏实验测量中进行图像回归的机器学习方法。我们展示了该方法在推进系统开发中薄膜冷却研究中的应用，旨在减少对大量物理测试的需求。我们的方法采用带有位置编码的轻量级前馈神经网络，根据输入参数生成图像。在真实和合成数据上的验证表明，该方法在减少30%测量量的同时，实现了高图像相似度（RMSE < 8%，SSIM > 93%）。我们进一步提出了一种知识驱动的扩展，用于生成图像的局部适应性。该方法显著减少了所需测试次数，同时保持了高质量数据，从而能够高效优化冷却剂喷射器配置，其应用范围超越航空航天领域。

英文摘要

We propose a machine learning approach for image regression from sparse experimental measurements. We show the application of the proposed method on film cooling studies in propulsion system development, aiming to reduce the need for extensive physical testing. Our method employs a lightweight feed-forward neural network with positional encoding to generate images conditioned by input parameters. Validated on real and synthetic data, it achieves high image similarity (RMSE < 8 %, SSIM > 93 %) while maintaining accuracy with a 30 \% reduction of measurements. We further propose a knowledge-informed extension for local adaptability of the generated images. This approach significantly reduces required tests while preserving high-quality data, enabling efficient optimization of coolant injector configurations with applications beyond aerospace.

URL PDF HTML ☆

赞 0 踩 0

2605.29894 2026-05-29 cs.CV 版本更新

Train the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual Reasoning

训练智能体而非专家：学习利用异构专家进行多轮视觉推理

Yaowu Fan, Tao Han, Dazhao Du, Andy J. Ma, Jia Wan

发表机构 * Sun Yat-sen University（中山大学）； HKUST（香港科技大学）； Harbin Institute of Technology（哈尔滨理工大学）

AI总结提出VisHarness，一种可训练的视觉智能体，通过解耦高层感知推理与低层任务执行，学习利用异构视觉专家模型，以轻量训练实现多轮交互下的通用视觉任务求解。

详情

AI中文摘要

计算机视觉的最新进展产生了大量用于检测、分割、计数和其他视觉任务的强大专用模型。然而，这些模型通常针对孤立的任务形式进行优化，使得直接支持通用视觉智能变得困难，尤其是当任务需要复杂的语言理解和密集的小物体感知时。在本文中，我们提出了VisHarness，一种可训练的视觉智能体，它将高层感知、推理和决策与低层任务执行解耦。VisHarness不是训练模型来解决特定的视觉任务，而是学习利用一组精心设计的异构视觉专家。这种范式保留了智能体的通用智能，同时充分利用了专用视觉模型在具体视觉任务中的精度优势。仅通过轻量训练，VisHarness就能学习到可泛化的视觉专家利用策略，并通过与视觉专家模型的多轮交互，在各种复杂条件下解决常见的基础视觉任务。为了在实时环境中实现高效的在策略强化学习训练，我们引入了动态视觉记忆归档，这缓解了与视觉专家模型多轮交互导致的快速累积的视觉令牌开销。在涵盖推理分割、广义指代分割、密集小物体检测和指代计数的四个代表性基准上的实验表明，VisHarness显著优于现有的通用模型，并与任务专用模型相比取得了具有竞争力或更优的性能。

英文摘要

Recent progress in computer vision has produced a wide range of powerful specialized models for detection, segmentation, counting, and other visual tasks. However, these models are usually optimized for isolated task formulations, making it difficult to directly support general-purpose visual intelligence, especially when a task requires complex language understanding and dense small-object perception. In this paper, we propose VisHarness, a trainable visual agent that decouples high-level perception, reasoning, and decision-making from low-level task execution. Instead of training a model to solve a specific visual task, VisHarness learns to harness a set of carefully designed heterogeneous visual experts. This paradigm preserves the general intelligence of the agent while fully leveraging the precision advantages of specialized visual models in concrete visual tasks. With only lightweight training, VisHarness learns a generalizable visual expert-harnessing policy and can solve common fundamental vision tasks under various complex conditions through multi-turn interactions with visual expert models. To enable efficient on-policy reinforcement learning training in a live environment, we introduce dynamic visual memory archiving, which mitigates the rapidly accumulating visual-token overhead caused by multi-turn interactions with visual expert models. Experiments on four representative benchmarks covering reasoning segmentation, generalized referring segmentation, dense small-object detection, and referring counting demonstrate that VisHarness substantially outperforms existing general-purpose models and achieves competitive or superior performance compared with task-specific models.

URL PDF HTML ☆

赞 0 踩 0

2605.29891 2026-05-29 cs.CV 版本更新

DVSM: Decoder-only View Synthesis Model Done Right

DVSM: 正确的仅解码器视图合成模型

Cheng Sun, Jaesung Choe, Min-Hung Chen, Ryo Hachiuma, Yu-Chiang Frank Wang

发表机构 * NVIDIA ； National Taiwan University（国立台湾大学）

AI总结提出仅解码器架构DVSM，通过隐式KV-cache表示场景，在相同渲染复杂度下以更少参数超越编码器-解码器变体，并利用共享权重、基础模型先验和分阶段块大小优化效率与质量，在多个基准上实现新视点合成的最优结果。

Comments Code at https://github.com/NVLabs/dvsm

详情

AI中文摘要

近期的大型视图合成模型（LVSMs）倡导一种编码器-解码器架构，将重建和渲染分离到不同的网络中。我们重新审视了这种设计。通过控制实验，我们表明仅解码器架构（将场景隐式表示为KV-cache）在相同渲染复杂度下使用更少参数，性能优于编码器-解码器变体。进一步分析表明，在颜色输入重建网络和仅相机渲染网络之间共享权重，能更好地对齐同一视点下的特征，从而促进图像合成。基于这一发现，我们的模型DVSM进一步结合了基础模型先验和分阶段块大小调整，以改进效率与质量的权衡。我们的结果在多个基准上为新颖视图合成设立了新的最先进水平，在某些情况下，甚至在密集输入视图下优于每场景优化的3DGS。

英文摘要

Recent Large View Synthesis Models (LVSMs) advocate an encoder-decoder architecture that separates reconstruction and rendering into distinct networks. We re-examine this design. Through controlled experiments, we show that a decoder-only architecture, which represents scenes implicitly as a KV-cache, outperforms encoder-decoder variants while using fewer parameters at identical rendering complexity. Further analysis shows that sharing weights between the color-input reconstruction network and the camera-only rendering network better aligns their features at the same viewpoint, facilitating image synthesis. Building on this finding, our model, dubbed DVSM, further incorporates foundation model priors and stage-wise patch sizing for an improved efficiency-quality tradeoff. Our results establish a new state of the art for novel-view synthesis across multiple benchmarks, in some cases even outperforming per-scene-optimized 3DGS under dense input views.

URL PDF HTML ☆

赞 0 踩 0

2605.29881 2026-05-29 cs.CV cs.AI 版本更新

Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering

通过屏障调控自适应闭式引导缓解视觉语言模型中的幻觉

Soumyadeep Jana, Pulkit Mittal, Sanasam Ranbir Singh

发表机构 * Indian Institute of Technology Guwahati（印度理工学院果阿班加）

AI总结提出BRACS框架，通过监测视觉注意力并仅在接地退化时进行闭式修正，无需训练即可有效减少LVLM中的物体幻觉。

详情

AI中文摘要

大型视觉语言模型（LVLMs）经常幻觉出输入图像中不存在的物体，这主要是因为随着解码进行，视觉接地减弱。现有的推理时缓解方法在生成过程中修改logits或隐藏状态，但它们存在三个关键限制：缺乏明确的接地目标，即使在模型已经良好接地时也进行干预，以及使用固定的修正强度，无法适应接地失败的严重程度。我们提出BRACS（屏障调控自适应闭式引导），一种无需训练的引导框架，通过屏障调控自适应闭式引导解决这些问题。BRACS监测模型自身的注意力以衡量视觉接地，并仅在接地恶化时对隐藏状态进行修正。修正更新以闭式解析计算，无需训练辅助网络或重新训练模型。在LLaVA-1.5-7B和Qwen-VL-Chat上的实验表明，BRACS在幻觉基准上持续优于先前方法，将CHAIR$_s$降低9.4个点，将POPE F1提高2.7个点，同时在四个通用多模态基准上匹配或提升性能。BRACS还保持高效，运行速度为贪心解码吞吐量的80%，平均速度比基线快1.3倍。

英文摘要

Large vision-language models (LVLMs) often hallucinate objects that are not present in the input image, largely because visual grounding weakens as decoding progresses. Existing inference-time mitigation methods modify logits or hidden states throughout generation, but they suffer from three key limitations: they lack an explicit grounding objective, intervene even when the model is already well-grounded, and use fixed correction strengths that do not adapt to the severity of grounding failure. We propose BRACS (Barrier-Regulated Adaptive Closed-form Steering), a training-free steering framework that addresses these issues through barrier-regulated adaptive closed-form steering. BRACS monitors the model's own attention to measure visual grounding and applies corrections to the hidden states only when grounding deteriorates. The corrective update is computed analytically in closed form, requiring no training of auxiliary networks or model retraining. Experiments on LLaVA-1.5-7B and Qwen-VL-Chat show that BRACS consistently outperforms prior methods on hallucination benchmarks, reducing CHAIR$_s$ by 9.4 points and improving POPE F1 by 2.7 points, while matching or improving performance on four general multimodal benchmarks. BRACS also remains efficient, operating at 80% of greedy decoding throughput and achieving 1.3 times higher speed on average than the baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.29868 2026-05-29 cs.CR cs.CV cs.DC 版本更新

Ciphera: A Decentralised Biometric Identity Framework

Ciphera: 一种去中心化的生物特征身份框架

Ankit Kanaiyalal Prajapati, Shahzad Memon, Mohammed Mahir Rahman, Ameer Al-Nemrat

发表机构 * University of East London（东伦敦大学）

AI总结提出Ciphera框架，结合隐私保护面部识别、多节点验证、IPFS凭证元数据存储和区块链锚定撤销，实现去中心化生物特征身份管理，并通过功能、性能、安全性和分布式一致性评估验证其可行性。

Comments Accepted at the CyberAI 2026 Conference, and to be indexed at IEEE-Scopus

详情

Journal ref: CyberAI 2026 (https://cyberai-conf.org/)

AI中文摘要

中心化的生物特征身份系统使用户面临单点故障、不透明的验证过程以及不可逆的生物特征泄露风险。去中心化标识符（DID）和可验证凭证（VC）提供了更强的隐私保障，但它们与生物特征认证和分布式验证的整合仍未被充分探索。本文提出了Ciphera，一个去中心化的生物特征身份框架，结合了隐私保护的面部识别、多节点验证、基于IPFS的凭证元数据存储和区块链锚定的撤销。在功能、性能、安全性和分布式一致性维度上评估，Ciphera实现了81%的功能成功率，具有稳定的注册和认证，但存在可测量的撤销传播延迟和偶尔的审计日志不一致。性能测试显示，在并发多节点条件下，p95验证延迟约为820毫秒，低于1秒。安全性分析确认了强大的机密性和完整性保证，但不完整的活体检测使其容易受到深度伪造和重放攻击。结果证明了去中心化生物特征身份的可行性，同时指出了生产级部署的关键工程挑战。

英文摘要

Centralised biometric identity systems expose users to single points of failure, opaque verification processes, and irreversible biometric compromise. Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) offer stronger privacy guarantees, yet their integration with biometric authentication and distributed verification remains insufficiently explored. This paper presents Ciphera, a decentralised biometric identity framework combining privacy-preserving facial recognition, multi-node verification, IPFS-based credential metadata storage, and blockchain-anchored revocation. Evaluated across functional, performance, security, and distributed consistency dimensions, Ciphera achieved an 81% functional success rate, with stable enrolment and authentication but measurable revocation propagation delays and occasional audit-log inconsistencies. Performance testing demonstrated sub-second p95 verification latency of approximately 820ms under concurrent multi-node conditions. Security analysis confirmed strong confidentiality and integrity guarantees, though incomplete liveness detection leaves susceptibility to deepfake and replay attacks. The results demonstrate the feasibility of decentralised biometric identity while identifying key engineering challenges for production-grade deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.29858 2026-05-29 cs.CV 版本更新

低倍率SEM可能足够：用于氧化锆增韧氧化铝多尺度断裂原因分类的可解释深度学习

Julian Schmid, Pawel Astankow, Tom Vater, Julius Beck, Robert Cichon, Danny Krautz

发表机构 * CeramTec GmbH（CeramTec公司）； School of Life Sciences, University of Applied Sciences（应用科学与艺术北瑞士学院生命科学学院）

AI总结提出一种可解释的视觉变换器工作流，利用低倍率SEM图像对氧化铝基复合材料植入物断裂原因进行自动分类，达到与高倍率相当的准确率。

详情

AI中文摘要

可靠识别氧化铝基复合材料髋关节和膝关节植入物的断裂起源对于质量保证和患者安全至关重要，然而当前的断口分析工作流程耗时、部分主观且依赖高倍率扫描电子显微镜（SEM）。我们提出了一种可解释的视觉变换器（ViT）工作流，用于对广泛用于全关节置换的氧化铝基复合材料（BIOLOX delta, CeramTec GmbH）的断裂原因进行自动分类。从五年的生产爆破和验证测试中整理了8,493张SEM图像（50倍至10,000倍）的数据集，并按照制造链定义的三个缺陷类别（生坯、硬加工和材料缺陷）进行标注。在严重的类别不平衡下，微调后的ViT在分层五折交叉验证中达到了0.907的准确率和0.888的宏F1分数，两阶段感知哈希/SSIM泄漏审计确认了样本重叠可忽略。值得注意的是，低倍率（50倍）下的性能与高倍率（1k-10k倍）相当，表明宏观特征——镜面几何和羽状纹线场——已经编码了足够的诊断信号。Grad-CAM归因一致地定位在经典的断口线索（镜面、羽状纹、孔隙、加工痕迹）上，与既定的断口分析标准一致。这些结果共同将可解释ViT定位为陶瓷植入物质量保证的补充工具，能够实现低倍率预筛选并减少对耗时的高倍率检查的依赖。

英文摘要

Reliable identification of fracture origins in alumina matrix composite hip and knee implants is critical for quality assurance and patient safety, yet current fractographic workflows are time-consuming, partly subjective, and reliant on high-magnification scanning electron microscopy (SEM). We present an interpretable vision-transformer (ViT) workflow for automated classification of fracture causes in an alumina matrix composite (BIOLOX delta, CeramTec GmbH) widely used in total joint replacements. A dataset of 8,493 SEM images (50x-10,000x) was curated from five years of in-production burst and proof tests and annotated into three defect categories defined along the manufacturing chain: green body, hard machining, and material defects. Under severe class imbalance, the fine-tuned ViT reached an accuracy of 0.907 and a macro-F1 of 0.888 in stratified five-fold cross-validation, with a two-stage perceptual-hash/SSIM leakage audit confirming negligible specimen overlap. Notably, performance at low magnification (50x) was comparable to that at high magnification (1k-10kx), indicating that macro-scale features - mirror geometry and hackle line fields - already encode sufficient diagnostic signal. Grad-CAM attributions consistently localised on canonical fractographic cues (mirrors, hackles, pores, machining marks), aligning with established fractographic criteria. Together, these results position interpretable ViTs as a complementary tool for ceramic-implant quality assurance, enabling low-magnification pre-screening and reducing reliance on time-intensive high-magnification inspection.

URL PDF HTML ☆

赞 0 踩 0

2605.29793 2026-05-29 cs.CV 版本更新

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language

更少步骤，更优性能：基于语言的高效跨模态视频片段修剪用于视频时刻检索

Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Zichuan Xu, Wenzheng Xu, Junyang Chen, Renfu Li

发表机构 * Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science of Technology（湖北大数据安全工程研究中心，网络安全学院，华中科技大学）； Peking University（北京大学）； Henan University（河南大学）； Dalian University of Technology（大连理工大学）； Sichuan University（四川大学）； Shenzhen University（深圳大学）； Huazhong University of Science and Technology（华中科技大学）

AI总结提出SpotVMR方法，通过可学习的片段搜索模型和低成本语义索引特征，高效修剪查询相关视频片段，作为即插即用模块提升现有VMR方法的效率与性能。

Comments Published in AAAI 2024

详情

AI中文摘要

给定一个未修剪的视频和一个句子查询，基于语言的视频时刻检索（VMR）旨在定位目标查询相关的时刻。由于未修剪的视频过长，几乎所有现有的VMR方法首先将每个未修剪的视频稀疏下采样为多个固定长度的视频片段，然后与查询特征和昂贵的片段特征进行多模态交互以进行推理，这对于跨越数小时的长真实世界视频是不可行的。由于视频被下采样为固定长度的片段，一些与查询相关的帧可能被过滤掉，这将模糊目标时刻的特定边界，将相邻的不相关帧作为新边界，容易导致跨模态错位，并引入边界偏差和推理偏差。为此，在本文中，我们提出了一种高效的方法SpotVMR，用于修剪与查询相关的片段。此外，我们提出的SpotVMR可以作为即插即用模块，在保持良好检索性能的同时提高最先进VMR方法的效率。特别地，我们首先设计了一个新颖的片段搜索模型，该模型学习根据语言查询识别有希望的视频区域进行搜索。然后，我们引入一组低成本的语义索引特征来捕获对象和交互的上下文，这些上下文提示在哪里搜索查询相关的时刻。此外，利用蒸馏损失来解决片段选择器和VMR模型端到端联合训练中出现的优化问题。在三个具有挑战性的数据集上的大量实验证明了其有效性。

英文摘要

Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target query-relevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model. Extensive experiments on three challenging datasets demonstrate its effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2605.29776 2026-05-29 cs.CV 版本更新

Improving CLIP Adaptation by Breaking Tail Alignment for Source-Free Cross-Domain Few-Shot Learning

通过打破尾部对齐改进CLIP适应：用于源无关跨域小样本学习

Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li

发表机构 * School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China（华中科技大学计算机科学与技术学院）； Institute of Artificial Intelligence, Huazhong University of Science and Technology, Wuhan, China（华中科技大学人工智能研究院）

AI总结针对CLIP在跨域小样本学习中的性能下降问题，提出自适应尾头对齐策略（ATHA），通过有选择地削弱低相似度图像令牌的对齐来减少过拟合，在四个基准上取得最优结果。

Comments Accepted by ICML 2026

详情

AI中文摘要

视觉语言模型（如CLIP）展现出强大的零样本泛化能力，但在目标域训练数据稀缺的跨域场景（跨域小样本学习，CDFSL）中性能显著下降。本文聚焦于基于CLIP的CDFSL任务中的目标域小样本微调。现有的微调范式将所有图像块令牌与其对应的文本嵌入统一对齐。然而，我们发现一个反直觉的现象：主动将某些低相似度图像令牌（称为“尾部令牌”）推离其文本嵌入能持续提升目标域性能。我们深入探究这一现象并给出新的解释：在巨大的域偏移和稀缺的训练数据下，模型难以从视觉输入中提取语义信息；因此，常见的对齐信念仅对已包含足够语义信息的令牌有效；对于尾部令牌，强制对齐会导致对稀缺训练的过度过拟合，而打破对齐则更有用。受此启发，我们提出自适应尾头对齐（ATHA），一种新颖的CLIP微调策略，将传统的统一对齐范式转变为自适应对齐范式，同时包含对齐增强和削弱。在四个具有挑战性的CDFSL基准上的大量实验验证了我们的最先进性能。我们的代码可在 https://github.com/shuaiyi308/ATHA 获取。

英文摘要

Vision-Language Models (VLMs) such as CLIP demonstrate strong zero-shot generalization, but their performance significantly degrades in cross-domain scenarios with scarce target-domain training data (Cross-Domain Few-Shot Learning, CDFSL). In this paper, we focus on the target-domain few-shot finetuning in the CLIP-based CDFSL task. Prevailing finetuning paradigms uniformly align all image patch tokens with their corresponding textual embeddings. However, we find a counterintuitive phenomenon: actively pushing away certain low-similarity image tokens, termed "tail tokens", from their textual embeddings consistently improves target-domain performance. We delve into this phenomenon and provide a novel interpretation: under great domain shifts and scarce training data, the model can hardly extract semantic information from visual inputs; therefore, the common belief of alignment is valid only for tokens already containing sufficient semantic information; for tail tokens, forcing the alignment would lead to excessive overfitting to the scarce training, while breaking the alignment is more useful. Motivated by this, we propose Adaptive Tail-Head Alignment (ATHA), a novel fine-tuning strategy for CLIP that transforms the conventional uniform alignment paradigm to an adaptive alignment paradigm, with both alignment strengthening and weakening. Extensive experiments on four challenging CDFSL benchmarks validate our state-of-the-art performance. Our code is available at https://github.com/shuaiyi308/ATHA.

URL PDF HTML ☆

赞 0 踩 0

2605.29773 2026-05-29 cs.CV cs.AI cs.RO 版本更新

Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation

能量感知NECO：用于语义分割中单次逐像素分布外检测

Boyuan Zhang, Huanshan Huang, Yifei Cao

发表机构 * Ecole Polytechnique, Institut Polytechnique de Paris（巴黎理工学院高研院）； CIAD, UTBM, Université Marie et Louis Pasteur（CIAD、UTBM、马吕斯·路易·巴斯蒂埃大学）； U2IS, ENSTA, Institut Polytechnique de Paris（U2IS、ENSTA、巴黎理工学院）

AI总结提出一种结合NECO几何比率和能量分数的混合方法，实现单次前向传播的逐像素分布外检测，在miniMUAD数据集上AUROC达0.8539，优于单独使用NECO或能量分数。

Comments 7 pages, 6 figures. Accepted at the ICRA 2026 Workshop on Long-term Deployments in the Wild (LoWi 2026)

详情

AI中文摘要

移动机器人的可靠语义分割需要准确的密集预测和分布偏移下的鲁棒不确定性估计。强不确定性基线如蒙特卡洛Dropout通常需要重复的随机前向传播，难以在边缘平台上部署。我们提出能量感知NECO，一种用于语义分割的单次逐像素分布外（OOD）检测器。该方法将从解码器特征计算的居中NECO风格几何比率与基于logit的能量分数相结合。两个分量均使用在纯分布内验证集上拟合的统计量进行标准化，并通过凸组合融合。我们在miniMUAD子集上使用真实像素级OOD标签评估该方法。所提出的混合分数达到0.8539的AUROC，优于仅NECO（0.8280）、仅能量（0.8171）和集成预测熵基线（0.8124）。额外的定性和操作点分析表明，混合检测器在保持单次设计效率优势的同时，提高了整体排名性能。代码可在https://github.com/boyuan-zhangx/Energy-Aware_NECO获取。

英文摘要

Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms. We propose Energy-Aware NECO, a single-pass pixel-wise out-of-distribution (OOD) detector for semantic segmentation. The method combines a centered NECO-style geometric ratio computed from decoder features with a logit-based Energy score. Both components are standardized using statistics fitted on a pure in-distribution validation split and fused through a convex combination. We evaluate the method on the miniMUAD subset using true pixel-level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO-only (0.8280), Energy-only (0.8171), and an ensemble predictive-entropy baseline (0.8124). Additional qualitative and operating-point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single-pass design. Code is available at https://github.com/boyuan-zhangx/Energy-Aware_NECO

URL PDF HTML ☆

赞 0 踩 0

2605.29762 2026-05-29 cs.CV 版本更新

GeoMag: Geometric-Aware Video Motion Magnification via State Space Model

GeoMag: 基于状态空间模型的几何感知视频运动放大

Kecheng Han, Yuchen Zhang, Bingqing Liu, Boqiang Guo, Wenbin Zheng, Shiyuan Pei

发表机构 * School of Software Engineering, Xi'an Jiaotong University（西安交通大学软件工程学院）； Xi'an Jiaotong University（西安交通大学）

AI总结提出GeoMag框架，利用状态空间模型实现全局一致的运动放大，并构建Geo-200K数据集提升训练多样性，在视觉保真度和计算效率上优于现有方法。

Comments ICME 2026 Spotlight

详情

AI中文摘要

视频运动放大（VMM）揭示了不可感知的动态，但在复杂几何变换下常常遭受结构不一致的问题。现有的基于学习的方法通常面临CNN的有限全局上下文与Transformer的高计算成本之间的权衡。此外，当前的训练协议主要由简单的线性运动主导，未能捕捉真实世界视频中遇到的几何和成像复杂性。为了解决这些问题，我们提出了GeoMag，一个基于状态空间模型的几何感知VMM框架，以实现具有线性复杂度的全局一致运动放大。我们进一步构建了Geo-200K，一个大规模合成数据集，引入了丰富的几何变换以及传感器真实的退化，提高了训练信号的多样性和真实性。在合成和真实世界基准上的大量实验表明，GeoMag在视觉保真度和计算效率上始终优于先前的方法，同时产生更少的伪影和更好的结构一致性。

英文摘要

Video Motion Magnification (VMM) reveals imperceptible dynamics but often suffers from structural inconsistencies under complex geometric transformations. Existing learning-based methods generally face a trade-off between the limited global context of CNNs and the high computational cost of Transformers. In addition, current training protocols, largely dominated by simple linear motion, fail to capture the geometric and imaging complexities encountered in real-world videos. To address these issues, we propose GeoMag, a geometric-aware VMM framework built upon State Space Models to achieve globally consistent motion amplification with linear complexity. We further construct Geo-200K, a large-scale synthetic dataset that introduces rich geometric transformations together with sensor-realistic degradations, improving the diversity and realism of training signals. Extensive experiments on synthetic and real-world benchmarks show that GeoMag consistently outperforms prior methods in visual fidelity and computational efficiency, while producing fewer artifacts and better structural consistency.

URL PDF HTML ☆

赞 0 踩 0

2605.29761 2026-05-29 cs.CV cs.CG 版本更新

S2MDF: A Plug-And-Play Layer for Intersection-Free Multi-Object Signed Distance Fields

S2MDF：用于无交叉多物体有符号距离场的即插即用层

Deniz Sayin Mercadier, Federico Stella, Aurel Bizeau, Nicolas Talabot, Pascal Fua

发表机构 * CVLab, Ecole Polytechnique Fédérale de Lausanne (EPFL)（计算机视觉实验室，瑞士联邦理工学院（EPFL））

AI总结提出S2MDF模块，通过硬约束强制向量值有符号距离场避免物体间几何交叉，无需修改网络架构，在训练或后处理中均可使用，显著减少交叉至数值精度且保持重建质量。

详情

AI中文摘要

组合隐式表面表示将场景建模为物体集合，每个物体由有符号距离场（SDF）编码。该方法的一个基本限制是多个SDF可能产生相互穿透的几何形状，违反物理合理性。现有的缓解策略依赖于软惩罚项，这些项减少但不能消除交叉，并且需要仔细的损失加权。为了真正防止相互穿透，我们提出了对向量值SDF的硬约束，并引入了S2MDF，一个轻量级的即插即用模块，无需架构修改即可对任何物体组合SDF表示施加约束。它引入可忽略的计算开销，并与线性插值的标准网格化算法（如Marching Cubes）兼容。它可以在训练期间或作为后处理步骤应用。在多种最先进的组合方法上的实验表明，S2MDF将交叉减少到数值精度，同时保持重建质量，优于现有的缓解策略。

英文摘要

Compositional implicit surface representations model scenes as collections of objects, each encoded by a Signed Distance Field (SDF). A fundamental limitation of this approach is that multiple SDFs can produce geometries that interpenetrate, violating physical plausibility. Existing mitigation strategies rely on soft penalty terms that reduce but do not eliminate intersections, and require careful loss weighting. To truly prevent interpenetration, we propose a hard constraint on vector-valued SDFs and introduce S2MDF, a lightweight plug-and-play module that enforces the constraint on any object-compositional SDF representation without architectural modifications. It introduces negligible computational overhead and is compatible with linearly-interpolated standard meshing algorithms such as Marching Cubes. It can be applied during training or as a post-processing step. Experiments on multiple state-of-the-art compositional methods show that S2MDF reduces intersections to numerical precision while preserving reconstruction quality, outperforming existing mitigation strategies.

URL PDF HTML ☆

赞 0 踩 0

2605.29726 2026-05-29 cs.CV 版本更新

SLAD : Shared LoRA Adapters for Task Specific Distillation

SLAD：用于任务特定蒸馏的共享LoRA适配器

Reda Bensaid, Yassir Bendou, Vincent Gripon, François Leduc-Primeau

发表机构 * IMT Atlantique（IMT阿登蒂克）； Polytechnique Montréal（蒙特利尔理工学院）

AI总结提出SLAD方法，通过共享低秩适配器参数对齐教师和学生模型的特征表示，实现高效的知识蒸馏，在多个分类和分割数据集上达到最先进性能。

Comments CVPR Findings 2026

详情

AI中文摘要

在资源受限环境（如嵌入式系统）中，将缩小版基础模型适配到下游任务变得越来越流行。这最近激发了任务特定蒸馏的新场景，其中同一基础模型的较大和较小版本都适配到同一下游任务，目标是将知识从前者转移到后者。最近的工作展示了使用同一基础模型的较大版本协助较小版本适配的好处。通常，较大模型（教师）首先通过微调或线性探测进行适配，然后将其知识蒸馏到较小模型（学生）。虽然微调教师通常能提升其性能，但最近的工作表明，对教师进行探测能更好地向学生蒸馏知识。我们的发现表明，这主要是由于教师微调过程中教师和学生之间特征表示的对齐偏差。受现有保留先前学习知识的努力启发，我们首先提出利用低秩适配，从而带来更好的特征对齐，进而实现更好的知识转移。基于这一洞察，我们进一步通过联合训练期间两个编码器之间适配器的参数共享策略来增强特征对齐。我们提出的方法SLAD在教师和学生之间展现出更好的特征对齐，不仅提升了学生模型的性能，也提升了教师模型的性能，同时训练速度比微调快2倍。通过在多个分类和分割数据集上的大量实验，我们展示了该方法在准确性和迁移效率上的提升，在任务特定蒸馏框架中达到了最先进性能。

英文摘要

In the context of resource-constrained environments such as embedded systems, adapting reduced-size foundation models to downstream tasks has become increasingly popular. This has recently motivated the emerging setting of task-specific distillation, where a larger and a smaller version of the same foundation model are both adapted to the same downstream task, with the goal of transferring knowledge from the former to the latter. Recent work has demonstrated the benefits of using a larger version of the same foundation model to assist the adaptation of a smaller one. Typically, the larger model (teacher) is first adapted via fine-tuning or linear probing before its knowledge is distilled into the smaller model (student). While fine-tuning the teacher often increases its performance, recent work showed that probing it leads to better knowledge distillation to the student. Our findings show that this is mainly due to a mis-alignment in feature representation between the teacher and the student which occurs during the teacher's fine-tuning. Inspired by existing efforts to preserve previously learned knowledge, we first propose leveraging low-rank adaptation, resulting in better feature alignment and therefore better knowledge transfer. Drawing from this insight, we further enhance the feature alignment through a parameter-sharing strategy of the adapters between the two encoders during joint training. Our proposed method, SLAD, shows better feature alignment between the teacher and student, which results in increased performance for not only the student but also the teacher model, while being 2x faster to train than fine-tuning. Through extensive experiments on multiple classification and segmentation datasets, we demonstrate the improved accuracy and transfer efficiency of our method, achieving state-of-the-art performance in the task-specific distillation framework.

URL PDF HTML ☆

赞 0 踩 0

2605.29720 2026-05-29 cs.CV cs.LG 版本更新

Efficient, Validation-Free Intrinsic Quality Estimation for Large-Scale Face Recognition Datasets

面向大规模人脸识别数据集的高效、免验证的内在质量评估

Zhichao Chen, Yongle Zhao, Kaicheng Yang, Meng Yang, Yin Xie, Ziyong Feng

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China（中国科学技术大学网络科学与技术学院）

AI总结提出一种无需训练的内在质量（IQ）指标，通过邻域一致性得分和全局表示子空间复杂度来估计人脸识别数据集生成高性能模型的潜力，实现快速数据集诊断与筛选。

Comments ICML 2026

2605.29703 2026-05-29 q-bio.NC cs.CV q-bio.TO 版本更新

Subcortical Shape Variations and Their Associations with Cognition Across the 8th Decade of Life. A Study in the Lothian Birth Cohort 1936

皮层下形状变化及其与第八个十年生命期认知的关联：洛锡安出生队列1936研究

Maria del C. Valdes-Hernandez, Wonjung Park, Joanna Moodie, Susana Muñoz Maniega, Janie Corley, Fraser N. Sneden, Mark E. Bastin, Joanna M. Wardlaw, Simon R. Cox, Jinah Park

发表机构 * Department of Neuroimaging Sciences（神经影像科学系）； University of Edinburgh（爱丁堡大学）； Computer Graphics and Visualization Laboratory（计算机图形与可视化实验室）； Korea Advanced Institute of Science and Technology (KAIST)（韩国科学技术院）； Department of Psychology（心理学系）； Edinburgh Futures Institute（爱丁堡未来研究所）

AI总结利用洛锡安出生队列1936的纵向数据，通过ANCOVA和混合线性模型分析，研究第八个十年中皮层下结构的形状变化及其与认知老化的关联。

Comments 34 pages

详情

AI中文摘要

对正常个体脑形态变化的研究可能捕捉到与功能相关的脑老化方面，而这些方面不一定完全由总体积测量所指示。尽管皮层下脑结构在认知中起重要作用，但其形态轨迹与认知老化之间的关联尚未被记录。我们利用来自一项大型认知老化纵向研究——洛锡安出生队列1936——的神经影像、人口统计学和认知数据，探索社区居住个体在第八个十年生命期中皮层下脑结构的形状变化。我们使用ANCOVA和混合线性模型分析研究这些变化与认知老化的关联。皮层下形状变化是异质性的，在整个时期呈现不同的萎缩模式。海马体和腹侧DC经历了不同的形态变形（相对于其基线点），左右半球不同，而丘脑和苍白球形状则经历了更均匀的体积收缩，几乎在不同时间线上对称。一般认知的变化主要与时间点之间的向内和向外顶点位移相关。

英文摘要

The study of brain morphology changes in normal individuals may capture aspects of functionally-relevant brain aging not fully indicated by gross volumetry. Despite the important role of subcortical brain structures in cognition, the associations between their morphological trajectories and cognitive changes in aging have not been documented. We use neuroimaging, demographic, and cognitive data from a large longitudinal study of cognitive aging, the Lothian Birth Cohort 1936, to explore shape changes in subcortical brain structures of community-dwelling individuals across their 8th decade of life. We investigate the association of these changes with cognitive aging using ANCOVA and mixed linear model analyses. Subcortical shape changes were heterogeneous, with varied atrophy patterns across whole period. The hippocampus and the ventral DC experienced varied morphological deformations (from its baseline point) different in left and right hemispheres, while the thalami and globus pallidi shapes, for example, experienced a more uniform volume contraction, nearly symmetrical throughout different timelines. Changes in general cognition were mainly associated with inwards and outwards vertex displacements between the time-points.

URL PDF HTML ☆

赞 0 踩 0

2605.29691 2026-05-29 cs.CV 版本更新

Unsupervised Semantic Segmentation Facilitates Model Understanding

无监督语义分割促进模型理解

Xiaoyan Yu, Lisa Mais, Jannik Franzen, Peter Hirsch, Nick Lechtenbörger, Andreas Mardt, Dagmar Kainmüller

发表机构 * Max-Delbruck-Center（马克斯·德尔布鲁克中心）； Helmholtz Imaging（海德堡成像）； Humboldt-Universität zu Berlin（柏林洪堡大学）； Charité Universitätsmedizin（夏里特大学医学院）； University of Potsdam（波茨坦大学）

AI总结提出基于无监督语义分割的可视化协议，直观揭示不同自监督视觉Transformer的注意力机制、位置偏差和缩放行为等模型特性。

详情

AI中文摘要

自监督学习（SSL）产生了多种视觉Transformer（ViT），其预训练表示支持广泛的下游任务。为了更好地理解这些模型，已有工作评估了自注意力的机制以及表示中捕获的信息类型，例如揭示了对比学习（CL）和掩码图像建模（MIM）训练模型之间的显著差异。然而，模型理解的这些进展尚未完全渗透到更广泛的社区，其中针对CL模型的见解有时被泛化到MIM模型。为了使模型理解对广大受众直接且直观，我们提出了一种简单且易于解释的可视化协议。我们的协议基于可视化无监督语义分割结果，但目标不是最大化分割性能。相反，它允许我们传达跨图像一致出现的模型行为。通过对不同层和表示上的多种SSL模型进行基准测试，我们获得了关于不同位置偏差和缩放行为的新见解，包括DINOv3-Large模型令牌中的强边界伪影。这些见解补充并有助于传达一系列先前发现。我们的协议进一步能够清晰地区分位置效应与密切相关但不同的局部性偏差，后者在文献中已被更广泛地研究。该协议在GitHub上公开，我们相信它将促进更广泛社区的进一步模型理解。

英文摘要

Self-supervised learning (SSL) has produced a diverse landscape of vision transformers (ViTs) whose pretrained representations support a wide range of downstream tasks. Towards a better understanding of these models, a body of work has assessed the mechanics of their self-attention as well as the types of information captured across their representations, revealing, for example, stark differences between models trained with contrastive learning (CL) and masked image modeling (MIM). However, these advances in model understanding have not yet fully permeated the broader community, where insights specific to CL models are sometimes generalized to MIM models. To make model understanding straightforward and intuitive for a broad audience, we propose a simple and easily interpretable visualization protocol. Our protocol is based on visualizing unsupervised semantic segmentation results, yet our goal is not to maximize segmentation performance. Instead, it allows us to convey model behaviors that consistently emerge across images. Benchmarking a diverse set of SSL models across layers and representations, we obtain novel insights into distinct positional biases and scaling behaviors, including strong boundary artifacts in DINOv3-Large model tokens. These insights complement and help communicate a range of previous findings. Our protocol further enables a clear visual distinction between positional effects and the closely related but distinct locality bias, which has been studied much more extensively in the literature. The protocol is publicly available on GitHub and we believe it will catalyze further model understanding for a broad community.

URL PDF HTML ☆

赞 0 踩 0

2605.29673 2026-05-29 cs.LG cs.CV 版本更新

A Geometric View of SRC: Learning Representations for Stable Residual Inference

SRC的几何视角：学习用于稳定残差推理的表示

Vangelis P. Oikonomou

AI总结本文从几何角度分析稀疏表示分类（SRC）的残差排序稳定性，提出几何塑造目标以改善表示学习，并在多个数据集上验证了效果。

Comments 37 pages

详情

AI中文摘要

基于重构的推理通过比较类重构残差来分配类别；稀疏表示分类（SRC）是一个典型实例，其可靠性取决于学习表示的几何结构。我们采用严格的训练-推理分离：SRC仅作为固定的测试时规则使用，在训练过程中从不进行微分、展开或优化。在基于类条件张成子空间及其相关投影残差的张成子空间理想化中，我们通过残差间隔形式化残差排序稳定性，并刻画了可能在最坏方向破坏该间隔的几何障碍——张成子空间重叠、支配以及通过小主角产生的近重叠。这一张成子空间理论是首要的：它指定了理想化残差族何时良好分离，并为实际残差近似（如OMP）提供了条件性的求解器级解释，只要它们接近张成子空间级别的残差排序。在显式的覆盖和分离假设下，我们推导了（理想化）残差间隔的定量下界。在这些目标的指导下，我们提出了几何塑造目标，这些目标促进掩蔽的类内自表达性，抑制跨类重构路径和类间张成子空间对齐，并防止坍塌——而在训练过程中不调用SRC残差或预测。在图像（COIL-100）、文本（TREC）和EEG连接性上的实验，在相同的固定SRC/OMP推理下评估所有表示，并报告残差间隔和几何诊断；交叉熵仅作为相同评估协议下的参考几何包含在内。

英文摘要

Reconstruction-based inference assigns a class by comparing class-wise reconstruction residuals; Sparse Representation Classification (SRC) is a canonical instance whose reliability depends on the geometry of the learned representation. We adopt a strict training-inference separation: SRC is used only as a fixed test-time rule and is never differentiated, unrolled, or optimized during training. In a span-level idealization based on class-conditional spans and their associated projection residuals, we formalize residual-ordering stability through a residual margin and characterize geometric obstructions -- span overlap, dominance, and near-overlap via small principal angles -- that can collapse this margin in worst-case directions. This span-level theory is primary: it specifies when the idealized residual family is well-separated, and it provides a conditional solver-level interpretation for practical residual approximations (e.g., OMP) insofar as they remain close to the span-level residual ordering. Under explicit coverage and separation assumptions, we derive a quantitative lower bound on the (idealized) residual margin. Guided by these targets, we propose geometry-shaping objectives that promote masked within-class self-expressiveness, discourage cross-class reconstruction pathways and inter-class span alignment, and prevent collapse -- without invoking SRC residuals or predictions during training. Experiments on images (COIL-100), text (TREC), and EEG connectivity evaluate all representations under identical fixed SRC/OMP inference and report residual margins and geometric diagnostics; cross-entropy is included only as a reference geometry under the same evaluation protocol.

URL PDF HTML ☆

赞 0 踩 0

2605.29657 2026-05-29 cs.CV cs.AI 版本更新

OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

OccamToken: 无需训练且预算自适应的令牌剪枝实现高效VLM推理

Geng Li, Guohao Chen, Ting Chen, Shilin Shan, Kuangji Zuo, Bofan Lyu, Tuo An, Gen Li, Jianfei Yang

发表机构 * Nanyang Technological University (NTU)（南洋理工大学）

AI总结提出OccamToken框架，通过寄存器锚定的相对证据测试替代绝对排名范式，实现无需训练、自适应预算的视觉令牌剪枝，在保持高精度的同时大幅压缩令牌数量。

Comments 26 pages,8 figures

详情

AI中文摘要

视觉语言模型（VLM）依赖长视觉令牌序列进行视觉理解，导致预填充阶段在计算和内存上开销巨大。现有大多数剪枝方法遵循绝对排名范式，为视觉令牌分配重要性分数并保留固定的Top-K子集。本文认为这种范式本质上是脆弱的：注意力汇聚点扭曲令牌重要性排名，而图像冗余和查询依赖的视觉证据使得固定令牌预算在不同输入间不可靠。我们提出OccamToken，一个无需训练的框架，用寄存器锚定的相对证据测试替代绝对令牌排名。OccamToken不询问哪些令牌全局重要，而是评估视觉令牌是否提供了超越寄存器基线的信息。我们的关键洞察是，寄存器令牌自然吸收低信息注意力模式，使其成为识别真正信息性视觉证据的稳定参考。基于这一原理，OccamToken通过从寄存器注意力中导出的动态阈值，执行图像自适应冗余剪枝和查询自适应相关性剪枝。在LLaVA-NeXT、LLaVA-v1.5和Qwen3-VL上，OccamToken一致地改善了准确率-效率权衡，无需额外训练。值得注意的是，在LLaVA-NeXT上，它将2880个视觉令牌减少到约40个，同时保留了超过93%的原始准确率，即使在极端的1.4%保留率下也能实现稳定的视觉令牌压缩。

英文摘要

Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.

URL PDF HTML ☆

赞 0 踩 0

2605.29647 2026-05-29 cs.CV 版本更新

MARTIAN: A Rendering Framework for Aerial Mars Imagery from HiRISE Orbital Data

MARTIAN：基于HiRISE轨道数据的火星空中影像渲染框架

Dario Pisanti, Georgios Georgakis

发表机构 * Space Robotics Research Group, SnT, University of Luxembourg（卢森堡大学空间机器人研究组）； Jet Propulsion Laboratory, California Institute of Technology（加州理工学院喷气推进实验室）

AI总结提出一个基于Blender的开源渲染框架MARTIAN，利用真实HiRISE轨道地图数据合成火星地形在不同光照和高度下的逼真空中视图，并生成精确姿态标注，以解决火星视觉导航训练数据稀缺问题。

详情

AI中文摘要

火星上的空中导航需要基于视觉的管道，这些管道必须对火星表面的多样光照条件和地形形态具有鲁棒性。训练和评估此类方法的一个关键瓶颈是缺乏大规模、带标注的空中数据集。我们提出了MARTIAN，一个基于Blender的开源渲染框架，它利用真实的HiRISE轨道地图产品，在可控光照条件和不同高度下合成火星地形的逼真空中视图。MARTIAN生成带有精确姿态标注的观测数据，直接解决了火星视觉导航训练数据稀缺的问题。该框架已通过其在基于地图的定位系统（用于Ingenuity和未来火星旋翼机）的并行工作中的部署得到验证，其中合成训练的深度图像匹配器已成功在真实火星图像上进行了评估。MARTIAN公开于：https://github.com/nasa-jpl/martian。

英文摘要

Aerial navigation on Mars requires vision-based pipelines that are robust to the diverse illumination conditions and terrain morphology of the Martian surface. A key bottleneck for training and evaluating such methods is the scarcity of large-scale, annotated aerial datasets. We present MARTIAN, an open-source Blender-based rendering framework that leverages real HiRISE orbital map products to synthesize realistic aerial views of the Martian terrain under controllable lighting conditions and at varying altitudes. MARTIAN generates observations with accurate pose annotations, directly addressing the scarcity of training data for vision-based navigation on Mars. The framework has been validated through its deployment in concurrent work on map-based localization systems for Ingenuity and future Mars rotorcraft, where synthetically trained deep image matchers were successfully evaluated on real Mars imagery. MARTIAN is publicly available at: https://github.com/nasa-jpl/martian.

URL PDF HTML ☆

赞 0 踩 0

2605.29643 2026-05-29 cs.CV cs.MA 版本更新

AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement Learning

AgentCVR：通过脚本模拟强化学习的主动多智能体跨视频推理

Yilun Qiu, Jiahe Wang, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Chun Yuan

发表机构 * Xiaohongshu Inc.（小红书公司）； Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生学院，清华大学）

AI总结提出AgentCVR多智能体框架，将跨视频推理视为主动证据获取任务，通过主智能体协调视觉和音频智能体进行定向证据提取，并引入脚本模拟强化学习优化策略，在跨视频对齐和定位任务上超越单次基线，达到与闭源系统相当的性能。

详情

AI中文摘要

跨视频推理（CVR）已成为多模态智能的关键前沿，要求模型检索、对齐和聚合分布在多个视频中的证据。当前的多模态大语言模型（MLLMs）往往难以应对CVR，因为简单的单次策略将多个视频编码到共享压缩上下文中，可能掩盖罕见但关键的证据。在本文中，我们提出AgentCVR，一个多智能体框架，将CVR视为主动证据获取任务。AgentCVR使用主智能体迭代协调专门的视觉和音频智能体进行定向证据提取。为确保高效训练，我们引入脚本模拟强化学习，利用LLM生成的语义脚本和轻量级文本模拟器优化智能体策略，在在线探索期间避免昂贵的多模态推理。在综合CVR基准上的实验结果表明，AgentCVR优于单次基线，并在复杂跨视频对齐和定位任务上达到与最先进闭源系统相当的性能。为确保可复现性，我们的代码可在https://github.com/wang-jh24/AgentCVR获取。

英文摘要

Cross-Video Reasoning (CVR) has emerged as a critical frontier in multimodal intelligence, requiring models to retrieve, align, and aggregate evidence distributed across multiple videos. Current Multimodal Large Language Models (MLLMs) often struggle with CVR, as simple single-pass strategies encode multiple videos into a shared compressed context, potentially obscuring rare but critical evidence. In this paper, we propose AgentCVR, a multi-agent framework that treats CVR as an active evidence-acquisition task. AgentCVR employs a Master Agent to iteratively coordinate specialized Visual and Audio Agents for targeted evidence extraction. To ensure efficient training, we introduce Script-Simulated RL, which optimizes the agent's policy with LLM-generated semantic scripts and a lightweight text-based simulator, bypassing costly multimodal inference during online exploration. Experimental results on a comprehensive CVR benchmark show that AgentCVR outperforms single-pass baselines and achieves comparable performance to state-of-the-art closed-source systems, particularly in complex cross-video alignment and localization. To ensure reproducibility, our code is available at https://github.com/wang-jh24/AgentCVR.

URL PDF HTML ☆

赞 0 踩 0

2605.29615 2026-05-29 cs.CV cs.CL 版本更新

如何缓解越野环境语义分割中的分布偏移

Ji-Hoon Hwang, Daeyoung Kim, Hyung-Suk Yoon, Dong-Wook Kim, Seung-Woo Seo

发表机构 * Department of Electrical and Communication Engineering, Seoul National University（电子与通信工程系，首尔国立大学）

AI总结提出ST-Seg框架，通过风格扩展和纹理正则化缓解越野场景中源-目标域差异和传感器退化导致的分布偏移，提升语义分割鲁棒性。

Comments 8 pages, 6 figures. Accepted to IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

详情

DOI: 10.1109/LRA.2025.3551536
Journal ref: IEEE Robotics and Automation Letters, vol. 10, issue. 5, pp. 4500-4507, 2025

AI中文摘要

语义分割对于越野环境中的自主导航至关重要，能够精确分类周围环境以识别可通行区域。然而，越野条件固有的独特因素，如源-目标域差异和粗糙地形导致的传感器退化，可能引起分布偏移，使数据变化与训练条件不同。这常导致语义标签预测不准确，进而造成导航任务失败。为解决此问题，我们提出ST-Seg，一种通过风格扩展（SE）和纹理正则化（TR）扩展源分布的新框架。与先前在固定源分布内隐式应用泛化的方法不同，ST-Seg提供了一种直观的分布偏移处理方法。具体而言，SE通过生成多样化的逼真风格来拓宽域覆盖范围，增强源域有限的风格信息。TR通过深度纹理流形稳定受风格增强学习影响的局部纹理表示。在各种分布偏移的目标域上的实验证明了ST-Seg的有效性，相较于现有方法有显著改进。这些结果凸显了ST-Seg的鲁棒性，增强了越野导航中语义分割的实际应用性。

英文摘要

Semantic segmentation is crucial for autonomous navigation in off-road environments, enabling precise classification of surroundings to identify traversable regions. However, distinctive factors inherent to off-road conditions, such as source-target domain discrepancies and sensor corruption from rough terrain, can result in distribution shifts that alter the data differently from the trained conditions. This often leads to inaccurate semantic label predictions and subsequent failures in navigation tasks. To address this, we propose ST-Seg, a novel framework that expands the source distribution through style expansion (SE) and texture regularization (TR). Unlike prior methods that implicitly apply generalization within a fixed source distribution, ST-Seg offers an intuitive approach for distribution shift. Specifically, SE broadens domain coverage by generating diverse realistic styles, augmenting the limited style information of the source domain. TR stabilizes local texture representation affected by style-augmented learning through a deep texture manifold. Experiments across various distribution-shifted target domains demonstrate the effectiveness of ST-Seg, with substantial improvements over existing methods. These results highlight the robustness of ST-Seg, enhancing the real-world applicability of semantic segmentation for off-road navigation.

URL PDF HTML ☆

赞 0 踩 0

2605.29592 2026-05-29 cs.CV 版本更新

Non-Forgetting Knowledge Allocation with Bi-level Competition for Class-Incremental Learning

非遗忘知识分配与双层竞争用于类增量学习

Xiang Tan, Run He, Yawen Cui, Mengchen Zhao, Yan Wu, Tianyi Chen, Huiping Zhuang, Xiaonan Luo, Guanbin Li

发表机构 * South China University of Technology（华南理工大学）； Hong Kong Polytechnic University（香港理工大学）； Agency for Science, Technology and Research（科技研究局）； Microsoft（微软）； Guilin University of Electronic Technology（桂林电子科技大学）； Sun Yat-sen University（中山大学）

AI总结针对基于预训练模型的类增量学习中适配器知识分配不均和遗忘问题，提出非遗忘分配与双层竞争方法（NoFA-BC），通过递归最小二乘构建非遗忘分配器，并引入任务内赢家通吃和任务间最后淘汰机制优化适配器利用。

详情

AI中文摘要

基于预训练模型（PTM）的类增量学习（CIL）旨在顺序地将PTM适应到新类别而不遗忘旧知识。现有的基于适配器的方法主要通过不同的任务特定适配器训练模型，并在推理时为每个适配器呈现统一的知识分配。然而，这种分配机制忽略了任务差异的本质，导致适配器的利用次优。此外，在CIL约束下，分配器在任务演化时容易遗忘。为了解决这些问题，我们提出了一种具有双层竞争的非遗忘分配（NoFA-BC）。NoFA-BC通过将分配器训练转化为递归最小二乘问题来构建非遗忘分配器（NFA），并实现了与使用所有数据训练等效的分配器。基于NFA，提出了双层竞争（BLC），包括任务内级别的赢家通吃（WTA）机制和任务间级别的最后淘汰（LOF）消除，以提供更好的适配器知识分配。WTA提取任务内最显著的logit来表示适配器的贡献，LOF抑制不相关的适配器。通过BLC，每个适配器的参与比例可以根据每个输入进行调整。此外，还加入了稳定性增强（SE）过程，以进一步提高旧任务的性能。

英文摘要

Class-Incremental Learning (CIL) with pre-trained models (PTMs) aims to sequentially adapt PTMs to new categories without forgetting old knowledge. Built upon PTMs, existing adapter-based methods mainly train models via distinct task-specific adapters, and present a uniform knowledge allocation for each adapter during inference. However, this allocation mechanism ignores the nature of task discrepancy and leads to suboptimal utilization of adapters. Also, under CIL constraint, an allocator is prone to forgetting when tasks evolve. To address these issues, we propose a Non-Forgetting Allocation with Bi-Level Competition (NoFA-BC). NoFA-BC constructs a non-forgetting allocator (NFA) by transforming the allocator training into a recursive least-squares problem and achieves an allocator equivalent to that trained with all data. Based on the NFA, a Bi-Level Competition (BLC) including an intra-task level Winner-Takes-All (WTA) mechanism and inter-task Last-Ones-Fall (LOF) elimination is proposed to provide better allocation of adapter knowledge. WTA extracts the most significant logit within a task to represent the adapter's contribution and LOF suppresses the irrelevant adapters. With BLC, participation ratio of each adapter can be tailored for each input. Moreover, a Stability Enhancement (SE) process is incorporated to further improve the performance of old tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.29583 2026-05-29 cs.CV 版本更新

BitC-3DGS: High-Capacity 3D Gaussian Splatting Watermarking via Bit Compression

BitC-3DGS: 基于位压缩的高容量3D高斯泼溅水印技术

Yuquan Bi, Baosheng Yu, Yingke Lei, Jianwei Yang, Hongsong Wang, Jie Gui, Yuan Yan Tang, James Tin-Yau Kwok

发表机构 * School of Cyber Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； Lee Kong Chian School of Medicine, Nanyang Technological University（南洋理工大学李科金医学院）； College of Electronic Engineer, National University of Defense and Technology（国防科技大学电子工程学院）； Institute of AI for Industries, Chinese Academy of Sciences（中国科学院人工智能产业研究所）； School of Computer Science and Engineering, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Southeast University（东南大学计算机科学与工程学院，教育部新一代人工智能技术及其交叉应用重点实验室）； Department of Computer and Information Science, University of Macau（澳门大学计算机与信息科学系）； Faculty of Science and Technology, UOW College Hong Kong（UOW学院香港科技学院理学院）； Department of Computer Science and Engineering, The Hong Kong University of Science and Technology（香港科学与技术大学计算机科学与工程学院）

AI总结提出BitC-3DGS框架，通过位压缩令牌化、双分支架构和硬消息采样策略，突破CLIP文本编码器77位消息限制，实现高容量3DGS水印嵌入与恢复。

详情

AI中文摘要

高容量水印对于3D高斯泼溅（3DGS）资产嵌入丰富信息（例如所有权、来源和认证码）是必要的，从而在大规模3D资产管线中实现可靠的识别和完整性验证。现有的基于预训练文本编码器的位到令牌水印方法由于CLIP固定的77令牌上下文长度而仅限于77位消息，因为超出此限制的令牌不被学习的位置嵌入支持。为了解决这一限制，我们引入了BitC-3DGS，一种位压缩框架，每个令牌编码多个消息位。它采用位压缩令牌化方案，将同一块内的多个位编码为单个语义令牌。为了恢复压缩信息，它进一步引入了双分支架构用于联合块解压缩和位解码，以及硬消息采样策略以改善解码器训练期间的组合覆盖。在Blender和LLFF数据集上的大量实验证明了BitC-3DGS在高容量水印方面的有效性，实现了高消息恢复精度和渲染保真度。例如，它支持128位消息容量，恢复精度与最近最先进方法中64位消息相当。

英文摘要

High-capacity watermarking is necessary for 3D Gaussian Splatting (3DGS) assets to embed rich information (e.g., ownership, provenance, and authentication codes), enabling reliable identification and integrity verification in large-scale 3D asset pipelines. Existing bit-to-token watermarking methods based on a pre-trained text encoder are limited to 77-bit messages due to CLIP's fixed 77-token context length, as tokens beyond this limit are unsupported by learned positional embeddings. To address this limitation, we introduce BitC-3DGS, a bit-compression framework that encodes multiple message bits per token. It employs a bit-compressed tokenization scheme that encodes multiple bits within the same chunk into a single semantic token. To enable recovery of the compressed information, it further introduces a dual-branch architecture for joint chunk decompression and bit decoding, along with a hard-message sampling strategy to improve combinatorial coverage during decoder training. Extensive experiments on the Blender and LLFF datasets demonstrate the effectiveness of BitC-3DGS for high-capacity watermarking, achieving high message recovery accuracy and rendering fidelity. For example, it supports 128-bit message capacity with recovery accuracy comparable to that of 64-bit messages in recent state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.27696 2026-05-29 cs.CV cs.LG 版本更新

Structure over Pixels: Learning Variable-Length Visual Programs

结构优于像素：学习可变长度视觉程序

Piotr Wyrwiński, Kacper Dobek, Krzysztof Krawiec

发表机构 * Institute of Computing Science（计算科学研究所）； Poznan University of Technology（波兹南技术大学）

AI总结提出STROP离散视觉分词器架构，通过基于DINOv3特征的局部率失真监督学习可变长度视觉程序，以结构表示替代像素重建。

详情

AI中文摘要

离散视觉分词器将图像转换为有序的代码序列，为场景的结构描述提供了自然表示。然而，现有的自适应分词器要么需要事后搜索，要么在预训练速率的离散集合中进行选择，而不是学习与模型和场景耦合的连续每图像序列长度，并且它们通常针对像素重建进行训练，强调纹理而非结构。我们提出STROP，一种离散视觉分词器架构，形成结构场景表示并同时学习图像的视觉程序应该有多长。使用由冻结的DINOv3特征的局部率失真探针监督的四阶段课程，STROP优化了一个专门的长度头，在单次前向传递中估计活动前缀长度。通过绕过像素级重建梯度，码本完全由高层潜在表示的质量塑造。程序长度随场景复杂性增长，组合结构的迹象出现在下游密集预测迁移和对学习代码词汇的直接检查中。

英文摘要

Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure. We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate--distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass. By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.

URL PDF HTML ☆

赞 0 踩 0

2605.26064 2026-05-29 cs.CV cs.LG 版本更新

Paris 2.0: A Decentralized Diffusion Model for Video Generation

Paris 2.0: 一种去中心化的视频生成扩散模型

Ali Rouzbayani, Bidhan Roy, Marcos Villagra, Zhiying Jiang

AI总结本文提出Paris 2.0，首个通过去中心化计算预训练的视频生成模型，基于Paris 1.0的扩散模型框架，在低分辨率文本到视频任务中相比集中式模型将FVD从561.04降至279.01，提升约2倍，并提高了CLIP文本-视频相似度和美学评分。

Comments 6 pages, 5 figures

2605.25975 2026-05-29 cs.GR cs.CV 版本更新

HyperBones: 基于超网络调节的实时骨骼驱动神经服装模拟

Astitva Srivastava, Hsiao-Yu Chen, Ryan Goldade, Philipp Herholz, Zhongshi Jiang, Gene Wei-Chin Lin, Lingchen Yang, Nikolaos Sarafianos, Tuur Stuyck, Doug Roble, Avinash Sharma, Egor Larionov

发表机构 * Meta Reality Labs（Meta现实实验室）

AI总结提出一种结合虚拟骨骼驱动粗粒度模拟和卷积神经映射恢复细粒度褶皱的实时神经服装模拟方法，通过超网络调节实现高效物理监督，无需外部模拟器。

详情

AI中文摘要

服装模拟的最新进展使高质量结果更接近实时性能。基于物理的模拟器可以产生精确的运动，但对于交互式应用而言计算成本仍然过高。相比之下，线性混合蒙皮效率高，但无法捕捉宽松服装的复杂动态，常常导致不真实的运动和视觉伪影。神经方法提供了一种有前景的替代方案，但在严格的运行时约束下仍难以合理动画化宽松衣物。我们提出了一种快速且物理上合理的动态服装模拟方法。我们的方法训练了一个由独立的粗粒度和细粒度组件组成的降维神经动力学模拟器。在粗粒度层面，服装由一组与轻量级神经网络集成的虚拟骨骼驱动。然后使用训练好的卷积神经映射恢复细粒度的褶皱细节。通过将身份特定计算与实时神经集成解耦，我们的架构在支持多样化的体型和运动的同时保持了高性能。我们进一步引入了一种有效的物理监督方案，无需依赖外部模拟器即可获得准确结果。实验表明，我们的方法产生了物理上合理的服装动态，能够泛化到各种运动和体型，并支持固定服装集。我们的模拟器在商用GPU上以300+ FPS运行，使其适用于实时应用。

英文摘要

Recent advances in garment simulation have brought high-quality results closer to real-time performance. Physics-based simulators can produce accurate motion, but remain too computationally expensive for interactive applications. In contrast, linear blend skinning is efficient, but cannot capture the complex dynamics of loose-fitting garments, often leading to unrealistic motion and visual artifacts. Neural methods offer a promising alternative, yet they still struggle to animate loose clothing plausibly under strict runtime constraints. We present a fast and physically plausible approach for dynamic garment simulation. Our method trains a reduced-space neural dynamics simulator composed of independent coarse- and fine-level components. At the coarse level, the garment is driven by a set of virtual bones integrated with a lightweight neural network. Fine-scale wrinkle details are then recovered using a trained convolutional neural map. By decoupling identity-specific computation from real-time neural integration, our architecture maintains high performance while supporting diverse body shapes and motions. We further introduce an effective physics-supervision scheme that enables accurate results without relying on an external simulator. Experiments show that our method produces physically plausible garment dynamics, generalizes across a range of motions and body shapes, and supports a fixed set of garments. Our simulator runs at 300+ FPS on a commodity GPU, making it suitable for real-time applications.

URL PDF HTML ☆

赞 0 踩 0

2605.14113 2026-05-29 cs.CV cs.AI cs.LG cs.MA 版本更新

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

ProtoMedAgent: 通过隐私感知的智能体工作流实现多模态临床可解释性

Alvaro Lopez Pellicer, Plamen Angelov, Marwan Bukhari, Yi Li, Eduardo Soares, Jemma Kerns

发表机构 * School of Computing and Communications（计算与通信学校）； Lancaster University（兰卡斯特大学）； Lancaster Medical School（兰卡斯特医学院）； PUC-Rio（里约热内卢联邦大学）； Puc-Behring Institute for AI（人工智能皮克林研究所）

AI总结提出ProtoMedAgent框架，通过神经符号瓶颈和反射性Scribe-Critic循环约束生成过程，解决原型网络在临床报告中的语义结构缺失和检索谄媚问题，并引入k-匿名和ℓ-多样性隐私门控。

Comments CVR 2026

详情

AI中文摘要

尽管可解释的原型网络为临床诊断提供了引人注目的基于案例的推理，但其原始连续输出缺乏医学文档所需的语义结构。通过标准检索增强生成（RAG）弥合这一差距通常会触发“检索谄媚”，即大语言模型（LLM）产生事后合理化幻觉以与视觉预测对齐。我们引入了ProtoMedAgent，一个将多模态临床报告形式化为在严格神经符号瓶颈上的迭代、零梯度测试时优化问题的框架。在冻结的原型骨干上运行，我们将潜在视觉和表格特征蒸馏为离散语义记忆。在线生成严格受限于精确的集合论差分和反射性Scribe-Critic循环，从数学上排除了无根据的叙述性声明。为了安全地限制数据泄露，我们引入了一个由k-匿名和ℓ-多样性控制的语义隐私门控。在4,160名患者临床队列上的评估显示，ProtoMedAgent达到了91.2%的比较集忠实度，从根本上优于标准RAG（46.2%）。ProtoMedAgent还利用一个绑定ℓ-多样性的相变，系统性地将工件级成员推理风险降低了绝对9.8%。

英文摘要

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8%.

URL PDF HTML ☆

赞 0 踩 0

2605.11723 2026-05-29 cs.CV cs.AI 版本更新

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

CaC：通过分层时空聚焦推进视频奖励模型

Jiyuan Wang, Huan Ouyang, Jiuzhou Lin, Chunyu Lin, Dewen Fan, Boheng Zhang, Haonan Fan, Fei Zuo, Jia Sun, Huaiqing Wang, Honglie Wang, Yiyang Fan, Zhenlong Yuan, Zijun Li, Yongrui Heng, Guosheng Lin, Fan Yang, Tingting Gao

发表机构 * BJTU（北京工业大学）； NTU（国立台湾大学）； BUPT（北京邮电大学）； Kuaishou Technology（快手科技）

AI总结提出基于视觉语言模型的粗到细异常奖励模型CaC，通过全局时间扫描、局部空间定位和结构化时空思维链推理，结合大规模生成视频异常数据集和三阶段渐进训练，显著提升细粒度异常检测精度并减少生成视频异常。

Comments 27 pages, 10 figures

详情

AI中文摘要

在本文中，我们提出了Concentrate and Concentrate (CaC)，一种基于视觉语言模型的粗到细异常奖励模型。在推理过程中，它首先进行全局时间扫描以锚定异常时间窗口，然后在局部区间内进行细粒度空间定位，最后通过结构化的时空思维链推理得出稳健判断。为了使模型具备这些能力，我们构建了第一个大规模生成视频异常数据集，包含逐帧边界框注释、时间异常窗口和细粒度归因标签。基于该数据集，我们设计了三阶段渐进训练范式。模型首先通过单帧和多帧监督微调学习空间和时间锚定，然后通过基于两轮组相对策略优化（GRPO）的强化学习策略进行优化。除了传统的准确率奖励，我们引入了时间和空间IoU奖励来监督中间定位过程，有效引导模型进行更扎实和可解释的时空推理。大量实验表明，CaC能够稳定聚焦于细微异常，在细粒度异常基准上实现了25.7%的准确率提升，并且作为奖励信号时，CaC将生成视频异常减少了11.7%，同时提高了整体视频质量。

英文摘要

In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.

URL PDF HTML ☆

赞 0 踩 0

2605.05155 2026-05-29 cs.CV cs.AI 版本更新

超越思维链：重写作为生成式多模态嵌入的通用接口

Peixi Wu, Ke Mei, Feipeng Ma, Bosong Chai, Zhibin Lan, Chenxi Zhao, Shannan Yan, Jie Chen, Zhangchi Hu, Yansong Peng, Bo Lin, Junjie Zhou, Dacheng Yin, Tianyi Wang, Fengyun Rao, Jing Lyu, Hebei Li, Xiaoyan Sun

发表机构 * WeChat Vision, Tencent Inc.（腾讯微信视觉部）； Zhejiang University（浙江大学）； Tsinghua University（清华大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合性国家科学中心人工智能研究院）

AI总结针对思维链推理在检索中产生冗余和语义歧义的问题，提出重写驱动的多模态嵌入框架RIME，联合优化生成与嵌入，并通过跨模态对齐和精炼强化学习实现高效准确的检索。

详情

AI中文摘要

多模态大语言模型已成为通用多模态嵌入的有前景的基础。最近的研究表明，推理驱动的生成式多模态嵌入在多个嵌入任务上可以超越判别式嵌入。然而，思维链推理往往会产生冗余的思考步骤，并在更广泛的检索场景中引入总结答案的语义歧义。为了解决这一限制，我们提出了重写驱动的多模态嵌入（RIME），这是一个通过检索友好的重写联合优化生成和嵌入的统一框架。同时，我们提出了跨模态对齐（CMA）来桥接生成式和判别式嵌入空间，从而实现灵活的相互检索以权衡效率和准确性。在此基础上，我们还引入了精炼强化学习（Refine-RL），将判别式嵌入作为稳定的语义锚点来指导重写优化。在MMEB-V2、MRMR和UVRB上的大量实验表明，RIME显著优于先前的生成式嵌入模型，同时大幅减少了思考长度。

英文摘要

Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.

URL PDF HTML ☆

赞 0 踩 0

2603.29954 2026-05-29 cs.CV 版本更新

Detecting Unknown Objects via Energy-based Separation for Open World Object Detection

基于能量分离的未知物体检测用于开放世界目标检测

Jun-Woo Heo, Keonhee Park, Gyeong-Moon Park

发表机构 * Korea University, South Korea（韩国大学）； Seoul National University, South Korea（首尔国立大学）

AI总结提出DEUS框架，通过等角紧框架子空间未知分离和基于能量的已知区分损失，解决开放世界目标检测中未知物体检测和类别遗忘问题。

Comments 8 pages, Accepted at CVPR 2026

详情

AI中文摘要

在这项工作中，我们解决了开放世界目标检测（OWOD）问题。这一具有挑战性的场景要求检测器在不遗忘的情况下增量学习分类已知物体，同时在没有监督的情况下识别未知物体。先前的OWOD方法增强了未知发现过程，并采用记忆重放来缓解灾难性遗忘。然而，由于现有方法严重依赖检测器的已知类别预测来检测未知物体，它们难以有效学习和识别未知物体表示。此外，虽然记忆重放缓解了旧类别的遗忘，但往往牺牲了新学习类别的知识。为了解决这些限制，我们提出了DEUS（基于能量分离的未知检测），这是一个新颖的框架，应对开放世界目标检测的挑战。DEUS由等角紧框架（ETF）-子空间未知分离（EUS）和基于能量的已知区分（EKD）损失组成。EUS利用基于ETF的几何特性创建正交子空间，从而实现已知和未知物体表示的更干净分离。与仅考虑已知空间的先前基于能量的方法不同，EUS利用两个空间的能量来更好地捕捉未知物体的独特模式。此外，EKD损失强制先前和当前分类器之间的分离，从而在记忆重放期间最小化先前和新学习类别之间的知识干扰。我们在OWOD基准上彻底验证了DEUS，展示了在未知检测方面的显著性能改进，同时保持竞争力的已知类别性能。

英文摘要

In this work, we tackle the problem of Open World Object Detection (OWOD). This challenging scenario requires the detector to incrementally learn to classify known objects without forgetting while identifying unknown objects without supervision. Previous OWOD methods have enhanced the unknown discovery process and employed memory replay to mitigate catastrophic forgetting. However, since existing methods heavily rely on the detector's known class predictions for detecting unknown objects, they struggle to effectively learn and recognize unknown object representations. Moreover, while memory replay mitigates forgetting of old classes, it often sacrifices the knowledge of newly learned classes. To resolve these limitations, we propose DEUS (Detecting Unknowns via energy-based Separation), a novel framework that addresses the challenges of Open World Object Detection. DEUS consists of Equiangular Tight Frame (ETF)-Subspace Unknown Separation (EUS) and an Energy-based Known Distinction (EKD) loss. EUS leverages ETF-based geometric properties to create orthogonal subspaces, enabling cleaner separation between known and unknown object representations. Unlike prior energy-based approaches that consider only the known space, EUS utilizes energies from both spaces to better capture distinct patterns of unknown objects. Furthermore, EKD loss enforces the separation between previous and current classifiers, thus minimizing knowledge interference between previous and newly learned classes during memory replay. We thoroughly validate DEUS on OWOD benchmarks, demonstrating outstanding performance improvements in unknown detection while maintaining competitive known class performance.

URL PDF HTML ☆

赞 0 踩 0

2603.21746 2026-05-29 cs.CV 版本更新

Getting to the Point: Pointing Improves LVLMs at Counting

直击要点：指向提升LVLMs的计数能力

Simone Alghisi, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi

发表机构 * Signals and Interactive Systems Lab, University of Trento（信号与交互系统实验室，特伦托大学）

AI总结提出Point-then-Count方法，通过生成目标物体坐标进行零样本计数，在多个LVLM上取得最高准确率，并揭示坐标编码的空间信息是性能提升的关键。

详情

AI中文摘要

基于指向的方法将复杂任务分解为顺序的定位和推理步骤。给定查询，模型首先生成相关对象的坐标进行定位，然后基于这些点预测答案。虽然这种方法已被证明能提高大型视觉语言模型（LVLM）的性能，但其为何以及如何改善模型的视觉推理仍不清楚。在这项工作中，我们评估了基于指向的方法在视觉场景零样本计数任务中的表现。我们在最先进的LVLM上实验了多种微调和免训练方法，并将其与Point-then-Count（PtC）进行比较，其中模型首先生成目标对象的点坐标，然后预测其数量。我们的结果表明，PtC在评估方法中达到了最高准确率，预测的点在超过94%的情况下正确位于图像中（基于F1分数）。机制分析表明，性能提升源于预测坐标中编码的空间信息。然而，定位性能在不同图像区域存在差异，揭示了空间偏差。最后，结果表明PtC在合成和真实数据上都改善了分布外泛化，表明坐标有潜力帮助LVLM提升计数技能。

英文摘要

Pointing-based methods decompose complex tasks as sequential grounding and reasoning steps. Given a query, the model first grounds the relevant objects by generating their coordinates, and then predicts an answer conditioned on these points. While this approach has been shown to increase the performance of Large Vision-Language Models (LVLMs), it remains unclear why and how it improves the models' visual reasoning. In this work, we evaluate pointing-based methods in the task of zero-shot counting in visual scenes. We experiment with multiple fine-tuning and training-free approaches on state-of-the-art LVLMs, and compare them with Point-then-Count (PtC), where models first generate point coordinates for the target objects and then predict their count. Our results show that PtC achieves the highest accuracy among the evaluated approaches, with predicted points correctly grounded in the image in more than 94% of cases (based on F1-score). Mechanistic analyses show that gains arise from spatial information encoded in the predicted coordinates. Nevertheless, grounding performance varies across image regions, revealing spatial biases. Finally, the results indicate that PtC improves out-of-distribution generalization on both synthetic and real data, suggesting the potential of coordinates to help LVLMs improve their counting skills.

URL PDF HTML ☆

赞 0 踩 0

2603.12588 2026-05-29 cs.CV 版本更新

SDF-Net: Structure-Aware Disentangled Feature Learning for Opticall-SAR Ship Re-identification

SDF-Net：面向光学-SAR船舶重识别的结构感知解耦特征学习

Furui Chen, Han Wang, Yuhan Sun, Jianing You, Yixuan Lv, Zhuang Zhou, Hong Tan, Shengyang Li

发表机构 * Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences（中国科学院空间利用技术与工程中心）； Key Laboratory of Space Utilization, Chinese Academy of Sciences（中国科学院空间利用重点实验室）； University of Chinese Academy of Sciences（中国科学院大学）； School of Software, Beihang University（北航软件学院）

AI总结针对光学与SAR图像间辐射差异导致的船舶重识别挑战，提出SDF-Net，通过结构一致性约束和解耦特征学习，实现模态不变的身份特征提取，在HOSS-ReID数据集上达到最优性能。

详情

AI中文摘要

光学与合成孔径雷达（SAR）图像之间的跨模态船舶重识别（ReID）面临根本性挑战，即被动光学成像与相干主动雷达传感之间的严重辐射差异。现有方法主要依赖统计分布对齐或语义匹配，但往往忽略了一个关键的物理先验：船舶是刚性物体，其几何结构在不同传感模态下保持稳定，而纹理外观则高度依赖模态。本文提出SDF-Net，一种结构感知解耦特征学习网络，系统地将几何一致性引入光学-SAR船舶重识别。基于ViT骨干网络，SDF-Net引入结构一致性约束，从中间层提取尺度不变的梯度能量统计量，以稳健地锚定表示对抗辐射变化。在终端阶段，SDF-Net将学习到的表示解耦为模态不变的身份特征和模态特定的特征。然后通过无参数的加性残差融合整合这些解耦线索，有效增强判别能力。在HOSS-ReID数据集上的大量实验表明，SDF-Net持续优于现有最先进方法。代码和训练模型已在https://github.com/cfrfree/SDF-Net公开。

英文摘要

Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery is fundamentally challenged by the severe radiometric discrepancy between passive optical imaging and coherent active radar sensing. While existing approaches primarily rely on statistical distribution alignment or semantic matching, they often overlook a critical physical prior: ships are rigid objects whose geometric structures remain stable across sensing modalities, whereas texture appearance is highly modality-dependent. In this work, we propose SDF-Net, a Structure-Aware Disentangled Feature Learning Network that systematically incorporates geometric consistency into optical--SAR ship ReID. Built upon a ViT backbone, SDF-Net introduces a structure consistency constraint that extracts scale-invariant gradient energy statistics from intermediate layers to robustly anchor representations against radiometric variations. At the terminal stage, SDF-Net disentangles the learned representations into modality-invariant identity features and modality-specific characteristics. These decoupled cues are then integrated through a parameter-free additive residual fusion, effectively enhancing discriminative power. Extensive experiments on the HOSS-ReID dataset demonstrate that SDF-Net consistently outperforms existing state-of-the-art methods. The code and trained models are publicly available at https://github.com/cfrfree/SDF-Net.

URL PDF HTML ☆

赞 0 踩 0

2603.04314 2026-05-29 cs.CV cs.AI 版本更新

MOO: A Multi-view Oriented Observations Dataset for Viewpoint Analysis in Cattle Re-Identification

MOO：用于牛个体重识别视角分析的多视角观测数据集

William Grolleau, Achraf Chaouch, Astrid Sabourin, Guillaume Lapouge, Catherine Achard

发表机构 * Universite Paris-Saclay, CEA, List（巴黎-萨克雷大学，CEA，List）； Sorbonne University, CNRS, ISIR（索邦大学，CNRS，ISIR）

AI总结提出大规模合成多视角观测数据集MOO，通过128个均匀采样视角的1000头牛图像，量化视角变化对重识别的影响，并验证合成几何先验在真实场景中的迁移性。

Comments 6 pages, 3 figures, accepted to the CVPR 2026 Workshop on Computer Vision for Animal Behavior Tracking and Modeling (CV4Animals)

详情

AI中文摘要

动物重识别（ReID）由于视角变化面临严峻挑战，特别是在航空-地面（AG-ReID）场景中，模型需要跨越剧烈的高度变化匹配个体。然而，现有数据集缺乏精确的角度标注来系统分析这些几何变化。为此，我们引入了多视角观测（MOO）数据集，这是一个大规模合成AG-ReID数据集，包含从128个均匀采样视角捕获的1000头牛个体（128,000张标注图像）。利用这个受控数据集，我们量化了高度的影响，并识别出一个关键高度阈值，超过该阈值模型对未见视角的泛化能力显著提升。最后，我们在零样本和监督设置下验证了向真实世界应用的迁移性，展示了在四个真实牛数据集上的性能提升，并确认合成几何先验有效弥合了领域差距。总之，该数据集和分析为跨视角动物ReID的未来模型开发奠定了基础。MOO公开于https://github.com/TurtleSmoke/MOO。

英文摘要

Animal re-identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial-Ground (AG-ReID) settings where models must match individuals across drastic elevation changes. However, existing datasets lack the precise angular annotations required to systematically analyze these geometric variations. To address this, we introduce the Multi-view Oriented Observation (MOO) dataset, a large-scale synthetic AG-ReID dataset of $1,000$ cattle individuals captured from $128$ uniformly sampled viewpoints ($128,000$ annotated images). Using this controlled dataset, we quantify the influence of elevation and identify a critical elevation threshold, above which models generalize significantly better to unseen views. Finally, we validate the transferability to real-world applications in both zero-shot and supervised settings, demonstrating performance gains across four real-world cattle datasets and confirming that synthetic geometric priors effectively bridge the domain gap. Collectively, this dataset and analysis lay the foundation for future model development in cross-view animal ReID. MOO is publicly available at https://github.com/TurtleSmoke/MOO.

URL PDF HTML ☆

赞 0 踩 0

2603.03503 2026-05-29 cs.CV cs.LG 版本更新

Geographically-Weighted Weakly Supervised Bayesian High-Resolution Transformer for 200m Resolution Pan-Arctic Sea Ice Concentration Mapping and Uncertainty Estimation using Sentinel-1, RCM, and AMSR2 Data

地理加权弱监督贝叶斯高分辨率Transformer：利用Sentinel-1、RCM和AMSR2数据实现200米分辨率泛北极海冰密集度制图与不确定性估计

Mabel Heffring, Lincoln Linlin Xu

发表机构 * Department of Geomatics Engineering, Schulich School of Engineering, University of Calgary（地质工程系，Schulich 工程学院，卡尔加里大学）

AI总结提出一种贝叶斯高分辨率Transformer模型，结合地理加权弱监督损失函数和决策级数据融合，利用Sentinel-1、RCM和AMSR2数据实现200米分辨率泛北极海冰密集度制图与不确定性量化。

Comments 23 pages, 20 figures

详情

DOI: 10.1016/j.isprsjprs.2026.05.032

AI中文摘要

尽管具有可靠对应不确定性的泛北极海冰高分辨率制图对于业务化海冰密集度（SIC）制图至关重要，但由于冰特征信号的细微性、SIC标签的不精确性、模型不确定性和数据异质性等关键挑战，这是一项艰巨的任务。本研究提出了一种新颖的贝叶斯高分辨率Transformer方法，利用Sentinel-1、RADARSAT星座任务（RCM）和先进微波扫描辐射计2（AMSR2）数据，实现200米分辨率泛北极SIC制图和不确定性量化。首先，为了改进微小和细微海冰特征（例如裂缝/水道、融池和浮冰）的提取，我们设计了一种新颖的高分辨率Transformer模型，该模型具有全局和局部模块，能够更好地区分海冰模式的细微差异。其次，为了解决低分辨率和非精确SIC标签的问题，我们设计了一种地理加权弱监督损失函数，在区域级别而非像素级别监督模型，并优先考虑纯开阔水和冰盖特征，同时减轻边缘冰区（MIZ）中模糊性的影响。第三，为了改进不确定性量化，我们设计了所提Transformer模型的贝叶斯扩展，将其参数视为随机变量，以更有效地捕获不确定性。第四，为了解决数据异质性，我们在决策级融合三种不同类型的数据（Sentinel-1、RCM和AMSR2），以改进SIC制图和不确定性量化。所提方法在2021年和2025年泛北极最小范围条件下进行了评估。结果表明，所提模型在使用Sentinel-1数据时实现了0.70的总体特征检测精度，同时保留了泛北极SIC模式（相对于ARTIST海冰产品，Sentinel-1 R² = 0.90）。

英文摘要

Although high-resolution mapping of pan-Arctic sea ice with reliable corresponding uncertainty is essential for operational sea ice concentration (SIC) charting, it is a difficult task due to key challenges, such as the subtle nature of ice signature features, inexact SIC labels, model uncertainty, and data heterogeneity. This study presents a novel Bayesian High-Resolution Transformer approach for 200 meter resolution pan-Arctic SIC mapping and uncertainty quantification using Sentinel-1, RADARSAT Constellation Mission (RCM), and Advanced Microwave Scanning Radiometer 2 (AMSR2) data. First, to improve small and subtle sea ice feature (e.g., cracks/leads, ponds, and ice floes) extraction, we design a novel high-resolution Transformer model with both global and local modules that can better discern the subtle differences in sea ice patterns. Second, to address low-resolution and inexact SIC labels, we design a geographically-weighted weakly supervised loss function to supervise the model at region level instead of pixel level, and to prioritize pure open water and ice pack signatures while mitigating the impact of ambiguity in the marginal ice zone (MIZ). Third, to improve uncertainty quantification, we design a Bayesian extension of the proposed Transformer model, treating its parameters as random variables to more effectively capture uncertainties. Fourth, to address data heterogeneity, we fuse three different data types (Sentinel-1, RCM, and AMSR2) at decision-level to improve both SIC mapping and uncertainty quantification. The proposed approach is evaluated under pan-Arctic minimum-extent conditions in 2021 and 2025. Results demonstrate that the proposed model achieves 0.70 overall feature detection accuracy using Sentinel-1 data, while also preserving pan-Arctic SIC patterns (Sentinel-1 R\textsuperscript{2} = 0.90 relative to the ARTIST Sea Ice product).

URL PDF HTML ☆

赞 0 踩 0

2602.20316 2026-05-29 astro-ph.SR cs.CV 版本更新

Inspectorch: Efficient rare event exploration in solar observations

Inspectorch: 太阳观测中稀有事件的高效探索

C. J. Díaz Baso, I. J. Soler Poquet, C. Kuckein, M. van Noort, N. Poirier

发表机构 * Institute of Theoretical Astrophysics, University of Oslo, P.O. Box 1029 Blindern, N-0315 Oslo, Norway Rosseland Centre for Solar Physics, University of Oslo, P.O. Box 1029 Blindern, N-0315 Oslo, Norway Instituto de Astrof\'isica de Canarias, C/V\' a L\'actea s/n, E-38205 La Laguna, Tenerife, Spain Departamento de Astrof\'isica, Universidad de La Laguna, E-38206 La Laguna, Tenerife, Spain Max-Planck Institute for Solar System Research, Justus-von-Liebig-Weg 3, 37077 G\"ottingen, Germany LPC2E, OSUC, Univ Orl\'eans, CNRS, CNES, F-45071 Orl\'eans, France

AI总结提出基于流的密度估计模型Inspectorch，用于从高维太阳观测数据中高效识别稀有事件，并聚焦计算资源于极端现象。

Comments Comments: 12+1 pages, 11+2 figures, submitted to A&A

详情

AI中文摘要

太阳正以前所未有的细节被观测，使得我们能够研究其非常小时空尺度上的活动。然而，望远镜收集的大量数据无法用传统方法完全分析。流行的机器学习方法从观测中识别一般趋势，但由于罕见事件发生频率低，往往忽略它们。我们研究无监督概率方法在多维太阳观测中高效识别罕见事件的适用性，并优化计算资源以研究这些极端现象。我们介绍了Inspectorch，一个开源框架，利用基于流的模型：灵活的概率密度估计器，能够学习太阳观测的多维分布。一旦优化，它为每个样本分配概率，使我们能够识别异常事件。我们通过将其应用于Hinode光谱偏振仪、界面区域成像光谱仪、瑞典1米太阳望远镜上的微透镜高光谱成像仪、太阳动力学观测站上的大气成像组件以及太阳轨道器上的极紫外成像仪的观测来应用该方法。我们发现该算法始终为表现出异常特征的光谱分配较低的概率。例如，它识别出具有非常强多普勒频移、不常见展宽以及与小型重联事件相关的时间动态的谱线等。因此，Inspectorch证明了使用基于流的模型进行密度估计为在大型太阳数据集中识别罕见事件提供了一种强大的方法。由此产生的概率异常分数允许将计算资源集中在最具信息量和物理相关的事件上。我们公开提供Python包，网址为https://github.com/cdiazbas/inspectorch。

英文摘要

The Sun is observed in unprecedented detail, enabling studies of its activity on very small spatiotemporal scales. However, the large volume of data collected by our telescopes cannot be fully analyzed with conventional methods. Popular machine learning methods identify general trends from observations, but tend to overlook unusual events due to their low frequency of occurrence. We study the applicability of unsupervised probabilistic methods to efficiently identify rare events in multidimensional solar observations and optimize our computational resources to the study of these extreme phenomena. We introduce Inspectorch, an open-source framework that utilizes flow-based models: flexible density estimators capable of learning the multidimensional distribution of solar observations. Once optimized, it assigns a probability to each sample, allowing us to identify unusual events. We apply this approach by applying it to observations from the Hinode Spectro-Polarimeter, the Interface Region Imaging Spectrograph, the Microlensed Hyperspectral Imager at Swedish 1-m Solar Telescope, the Atmospheric Imaging Assembly on board the Solar Dynamics Observatory and the Extreme Ultraviolet Imager on board Solar Orbiter. We find that the algorithm assigns consistently lower probabilities to spectra that exhibit unusual features. For example, it identifies profiles with very strong Doppler shifts, uncommon broadening, and temporal dynamics associated with small-scale reconnection events, among others. As a result, Inspectorch demonstrates that density estimation using flow-based models offers a powerful approach to identifying rare events in large solar datasets. The resulting probabilistic anomaly scores allow computational resources to be focused on the most informative and physically relevant events. We make our Python package publicly available at https://github.com/cdiazbas/inspectorch.

URL PDF HTML ☆

赞 0 踩 0

2602.18527 2026-05-29 cs.CV cs.AI cs.SD 版本更新

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

JAEGER：模拟物理环境中的联合3D音频-视觉定位与推理

Zhan Liu, Changli Tang, Yuxin Wang, Zhiyuan Zhu, Youjun Chen, Yiwen Shao, Tianzi Wang, Lei Ke, Zengrui Jin, Chao Zhang

发表机构 * Tsinghua University（清华大学）； Zhejiang University（浙江大学）； The Chinese University of Hong Kong（香港中文大学）； Tencent AI Lab（腾讯AI实验室）

AI总结提出JAEGER框架，通过集成RGB-D观测和多通道一阶环境声学，将音频-视觉大语言模型扩展到3D空间，实现联合空间定位与推理，并引入神经强度向量（Neural IV）提升声源方向估计的鲁棒性。

Comments Accepted to ICML 2026

详情

AI中文摘要

当前的音频-视觉大语言模型（AV-LLMs）主要局限于2D感知，依赖于RGB视频和单声道音频。这种设计选择引入了基本的维度不匹配，阻碍了在复杂3D环境中可靠的声源定位和空间推理。我们通过提出JAEGER框架来解决这一限制，该框架将AV-LLMs扩展到3D空间，通过集成RGB-D观测和多通道一阶环境声学实现联合空间定位与推理。我们工作的核心贡献是神经强度向量（Neural IV），一种学习的空间音频表示，它编码了鲁棒的方向线索，以增强到达方向估计，即使在具有重叠声源的不利声学场景中也是如此。为了促进大规模训练和系统评估，我们提出了SpatialSceneQA，一个包含从模拟物理环境中整理的6.1万个指令调优样本的基准。大量实验表明，我们的方法在各种空间感知和推理任务中始终优于以2D为中心的基线，强调了显式3D建模对于推进物理环境中AI的必要性。我们的源代码、预训练模型检查点和数据集可在https://github.com/liuzhan22/JAEGER获取。

英文摘要

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints, and datasets are available at https://github.com/liuzhan22/JAEGER.

URL PDF HTML ☆

赞 0 踩 0

2602.15382 2026-05-29 cs.CL cs.CV cs.LG 版本更新

The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

视觉虫洞：异构多智能体系统中的潜在空间通信

Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu, Hoin Jung, Matt Fredrikson, Xiaoqian Wang, Jing Gao

发表机构 * Purdue University（普渡大学）； Contextual AI（情境人工智能）； Carnegie Mellon University（卡内基梅隆大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结提出Vision Wormhole框架，通过通用视觉编解码器将推理轨迹映射到共享连续空间，实现异构VLM间的潜在状态传输，无需配对翻译器，降低对齐复杂度并提升效率。

Comments Preprint. Work in progress

详情

AI中文摘要

由大型语言模型驱动的多智能体系统（MAS）实现了先进的协作推理，但仍受限于离散文本通信，这带来了运行时开销和信息量化损失。虽然潜在状态传输提供了一种替代方案，但现有方法要么假设同构的发送器-接收器架构，要么依赖于特定配对的学得翻译器，限制了跨具有不连续流形的不同模型族的可扩展性。我们将为自然图像训练的视觉-语言模型（VLM）的视觉界面重新概念化为异构智能体之间的连续通信通道，并将这一思想实例化为 extbf{视觉虫洞}：一种通用视觉编解码器，将推理轨迹映射到共享的连续参考空间，并将其注入接收器的视觉通路，实现无需配对翻译器的跨架构潜在状态传输。该框架采用中心辐射拓扑，将对齐复杂度从$O(N^2)$降低到$O(N)$，并通过无标签的教师-学生蒸馏针对文本通道进行训练，无需并行隐藏状态监督。在异构VLM族（Qwen-VL、Gemma、SmolVLM2、LFM2.5-VL）和九个推理基准上的大量实验表明，视觉虫洞在大多数评估设置中减少了端到端挂钟时间，并产生了正的平均宏$Δ$-准确率。

英文摘要

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain bottlenecked by discrete text communication, which imposes runtime overhead and information quantization loss. While latent state transfer offers an alternative, existing approaches either assume homogeneous sender--receiver architectures or rely on pair-specific learned translators, limiting scalability across diverse model families with disjoint manifolds. We reconceptualize the visual interface of Vision-Language Models (VLMs), trained for natural images, as a continuous communication channel between heterogeneous agents, and instantiate this idea as the \textbf{Vision Wormhole}: a Universal Visual Codec maps reasoning traces into a shared continuous reference space and injects them into the receiver's visual pathway, yielding cross-architecture latent state transfer without per-pair translators. The framework adopts a hub-and-spoke topology that reduces alignment complexity from $O(N^2)$ to $O(N)$, and is trained by label-free teacher--student distillation against the text channel, requiring no parallel hidden-state supervision. Extensive experiments across heterogeneous VLM families (Qwen-VL, Gemma, SmolVLM2, LFM2.5-VL) and nine reasoning benchmarks show that the Vision Wormhole reduces end-to-end wall-clock time across most evaluated settings and yields positive macro-average $Δ$-accuracy.

URL PDF HTML ☆

赞 0 踩 0

2602.01456 2026-05-29 cs.LG cs.CV 版本更新

Rectified LpJEPA: Joint-Embedding Predictive Architectures with Sparse and Maximum-Entropy Representations

Rectified LpJEPA：具有稀疏和最大熵表示的联合嵌入预测架构

Yilun Kuang, Yash Dagade, Tim G. J. Rudner, Randall Balestriero, Yann LeCun

发表机构 * New York University（纽约大学）； Duke University（杜克大学）； University of Toronto（多伦多大学）； Brown University（布朗大学）

AI总结提出Rectified Distribution Matching Regularization (RDMReg)损失，通过将表示对齐到Rectified Generalized Gaussian分布，实现稀疏且最大熵的表示，从而改进联合嵌入预测架构（JEPA）的性能。

Comments ICML 2026

详情

AI中文摘要

联合嵌入预测架构（JEPA）学习视角不变表示，并采用基于投影的分布匹配来防止崩溃。现有方法将表示正则化为各向同性高斯分布，但固有地偏向密集表示，未能捕捉高效表示中观察到的稀疏性关键特性。我们引入了Rectified Distribution Matching Regularization (RDMReg)，这是一种切片双样本分布匹配损失，将表示对齐到Rectified Generalized Gaussian (RGG)分布。RGG通过整流显式控制期望的$\ell_0$范数，而其连续截断部分在期望$\ell_p$范数和支撑约束下具有最大熵特性。将RDMReg应用于JEPA得到Rectified LpJEPA，它严格推广了先前基于高斯的JEPA。实验表明，Rectified LpJEPA学习到稀疏、非负的表示，具有有利的稀疏性-性能权衡，并在图像分类基准上取得了有竞争力的下游性能，表明RDMReg可以在保留任务相关信息的同时强制执行稀疏性。

英文摘要

Joint-Embedding Predictive Architectures (JEPA) learn view-invariant representations and admit projection-based distribution matching for collapse prevention. Existing approaches regularize representations towards isotropic Gaussian distributions, but inherently favor dense representations and fail to capture the key property of sparsity observed in efficient representations. We introduce Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution. RGG enables explicit control over expected $\ell_0$ norm through rectification, while its continuous truncated component admits a maximum-entropy characterization under expected $\ell_p$ norm and support constraints. Equipping JEPAs with RDMReg yields Rectified LpJEPA, which strictly generalizes prior Gaussian-based JEPAs. Empirically, Rectified LpJEPA learns sparse, non-negative representations with favorable sparsity--performance trade-offs and competitive downstream performance on image classification benchmarks, showing that RDMReg can enforce sparsity while preserving task-relevant information.

URL PDF HTML ☆

赞 0 踩 0

2601.19947 2026-05-29 cs.LG cs.AI cs.CV 版本更新

NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

NCSAM: 噪声补偿的锐度感知最小化用于噪声标签学习

Jiayu Xu, Junbiao Pang

发表机构 * Beijing University of Technology（北京理工大学）

AI总结提出NCSAM方法，通过噪声补偿扰动修正噪声标签引起的优化偏差，缓解对噪声标签的记忆，在合成和真实噪声标签基准上优于SAM基线。

Comments 11 pages, 1 figure, 8 tables. Major revision of v1: revised PAC-Bayesian theoretical analysis, clarified the NCSAM formulation, added appendix derivations, reorganized experiments and ablations, updated related work, citations, writing, and author list

详情

AI中文摘要

从噪声标签学习（LNL）仍然是深度学习中的一个基本挑战，因为现实世界的数据集通常包含损坏的注释。大多数现有方法依赖于标签校正或样本选择机制。相比之下，我们从优化角度研究LNL，通过建立标签噪声与锐度感知最小化（SAM）的平坦性寻求行为之间的理论联系。基于此分析，我们提出了噪声补偿的锐度感知最小化（NCSAM），它使用噪声补偿扰动来抵消由噪声标签引起的优化偏差。通过纠正失真的SAM扰动，NCSAM在训练过程中减轻了对噪声标签的记忆，同时保持了基于优化的学习的简单性。在合成和真实噪声标签基准上的实验表明，NCSAM在基于SAM的优化基线上持续改进，并与代表性的噪声标签学习方法保持竞争力。

英文摘要

Learning from Noisy Labels (LNL) remains a fundamental challenge in deep learning because real-world datasets often contain corrupted annotations. Most existing methods rely on label correction or sample selection mechanisms. In contrast, we study LNL from an optimization perspective by establishing a theoretical connection between label noise and the flatness-seeking behavior of Sharpness-Aware Minimization (SAM). Based on this analysis, we propose Noise-Compensated Sharpness-Aware Minimization (NCSAM), which uses a noise-compensated perturbation to counteract the optimization bias induced by noisy labels. By correcting distorted SAM perturbations, NCSAM mitigates the memorization of noisy labels during training while preserving the simplicity of optimization-based learning. Experiments on synthetic and real-world noisy-label benchmarks show that NCSAM consistently improves over SAM-based optimization baselines and remains competitive with representative noisy-label learning methods.

URL PDF HTML ☆

赞 0 踩 0

2601.12500 2026-05-29 cs.CV 版本更新

Video Individual Counting and Tracking from Moving Drones: A Benchmark and Methods

来自移动无人机的视频个体计数与跟踪：基准与方法

Yaowu Fan, Jia Wan, Tao Han, Andy J. Ma, Wanli Ouyang, Antoni B. Chan

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University（中山大学计算机科学与工程学院）； school of Computer Science and Engineering, Hong Kong University of Science and Technology（香港科学与技术大学计算机科学与工程学院）； Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）； School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳）计算机科学与技术学院）； Chinese University of Hong Kong（香港中文大学）

AI总结针对大规模密集人群场景，提出移动无人机视频数据集MovingDroneCrowd++，并设计基于最优传输和描述子投票的计数与跟踪方法GD3A和DVTrack，显著降低计数误差并提升跟踪精度。

详情

AI中文摘要

在大规模场景中计数和跟踪密集人群是一个高度实用但具有挑战性的问题。现有方法大多依赖于场景覆盖有限的固定摄像头数据集，使其不足以用于大规模场景的人群分析。为弥补这一差距，我们引入了MovingDroneCrowd++，这是最大的视频级数据集，专门用于快速移动无人机下的密集人群计数和跟踪，在多种飞行高度、相机角度和光照条件下采集。然而，现有方法在这些具有挑战性的空中条件下仍无法达到令人满意的视频个体计数或跟踪性能。为此，我们提出了GD3A（通过分组描述符关联的全局密度图分解），一种视频个体计数方法，该方法首先通过带有自适应垃圾桶分数的最优传输建立帧间行人描述符的像素级对应关系。然后，采用分组关联来指导将全局密度图分解为共享、流入和流出密度图。我们进一步引入了一种行人跟踪方法DVTrack（描述子投票跟踪），该方法通过描述子投票将描述符级匹配转换为实例级关联。我们的方法依赖于每个行人的分组多个描述符的关联结果，而不是单个向量。由于组内匹配错误不影响最终的计数和跟踪结果，我们的方法在密集人群和具有挑战性的空中条件下更加鲁棒。实验表明，我们的方法在密集人群和复杂运动的移动无人机视频上，在人群计数和跟踪方面均取得了显著提升，计数误差降低了47.4%，跟踪精度提高了64.6%。代码、数据集和预训练模型可在 https://github.com/fyw1999/MovingDroneCrowd 获取。

英文摘要

Counting and tracking dense crowds in large-scale scenes is a highly practical yet challenging problem. Existing methods mostly rely on fixed-camera datasets with limited scene coverage, making them inadequate for crowd analysis in large-scale scenes. To bridge this gap, we introduce MovingDroneCrowd++, the largest video-level dataset dedicated to dense crowd counting and tracking with fast-moving drones, captured under diverse flight altitudes, camera angles, and illumination conditions. Existing methods, however, still fail to achieve satisfactory video individual counting or tracking performance under these challenging aerial conditions. To this end, we propose GD3A (Global Density map Decomposition via group-wise Descriptor Association), a video individual counting method that first establishes pixel-level correspondences between pedestrian descriptors across frames via optimal transport with an adaptive dustbin score. Then, group-wise association is adopted to guide the decomposition of the global density map into shared, inflow, and outflow density maps. We further introduce a pedestrian tracking method, DVTrack (Descriptor Voting Track), which converts descriptor-level matching into instance-level association through descriptor voting. Our methods rely on the association results of group-wise multiple descriptors for each pedestrian rather than a single vector. Since intra-group matching errors do not affect the final counting and tracking results, our methods are more robust in dense crowds and challenging aerial conditions. Experiments show that our methods achieve substantial gains in both crowd counting and tracking on moving-drone videos with dense crowds and complex motions, reducing counting error by 47.4% and improving tracking accuracy by 64.6%. Code, dataset, and pretrained models are available at https://github.com/fyw1999/MovingDroneCrowd.

URL PDF HTML ☆

赞 0 踩 0

2601.05149 2026-05-29 cs.CV 版本更新

Multi-Scale Local Speculative Decoding for Image Generation

多尺度局部推测解码用于图像生成

Elia Peruzzo, Guillaume Sautière, Amirhossein Habibian

发表机构 * Qualcomm AI Research（高通人工智能研究）

AI总结提出多尺度局部推测解码（MuLo-SD）框架，通过低分辨率草稿模型与高分辨率目标模型结合、局部拒绝与重采样机制，加速自回归图像生成，实现高达5倍加速并保持语义对齐和感知质量。

Comments Accepted at CVPR 2026

详情

AI中文摘要

自回归（AR）模型在图像合成中取得了显著成功，但其顺序性带来了严重的延迟限制。推测解码提供了一种有前景的加速途径，但现有方法受限于令牌级模糊性和缺乏空间感知。在这项工作中，我们引入了多尺度局部推测解码（MuLo-SD），一种新颖的框架，结合多分辨率草稿与空间感知验证来加速AR图像生成。我们的方法利用低分辨率草稿模型配合上采样步骤来提出候选图像令牌，然后由高分辨率目标模型并行验证。关键的是，我们引入了局部拒绝和重采样机制，通过关注空间邻域而非在第一次拒绝后进行光栅扫描重采样，从而高效纠正草稿错误。当与并行解码重采样集成时，MuLo-SD实现了显著的加速——高达$\mathbf{5 imes}$——在加速方面优于推测解码和并行解码基线，同时保持相当的语义对齐和感知质量。这些结果在MS-COCO 5k验证集上使用GenEval、DPG-Bench和FID/HPSv2进行了验证。广泛的消融实验突出了上采样设计、概率池化以及局部拒绝和重采样与邻域扩展的影响。我们的方法为图像合成中的推测解码设立了新的最先进水平，弥合了效率与保真度之间的差距。项目页面见https://qualcomm-ai-research.github.io/mulo-sd-webpage/。

英文摘要

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with an up-sampling step to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. When integrated with parallel decoding resampling, MuLo-SD achieves substantial speedups -- up to $\mathbf{5\times}$ -- outperforming both speculative decoding and parallel decoding baselines in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity. Project page is available at https://qualcomm-ai-research.github.io/mulo-sd-webpage/ .

URL PDF HTML ☆

赞 0 踩 0

2601.03729 2026-05-29 cs.CV 版本更新

MATANet: A Multi-context Attention and Taxonomy-Aware Network for Fine-Grained Underwater Recognition of Marine Species

MATANet：用于海洋物种细粒度识别的多上下文注意与分类感知网络

Donghwan Lee, Byeongjin Kim, Geunhee Kim, Hyukjin Kwon, Nahyeon Maeng, Wooju Kim

发表机构 * Department of Industrial Engineering, Yonsei University（延世大学工业工程系）

AI总结提出MATANet框架，通过多上下文环境注意力模块和层级感知表示学习模块，结合生物外观、环境上下文和分类结构，实现海洋生物细粒度识别，在FathomNet2025和LifeCLEF2015-Fish上取得最优性能。

详情

AI中文摘要

海洋生物的细粒度识别对于生态研究、生物多样性监测、栖息地保护和基于证据的政策制定至关重要。然而，许多现有方法主要依赖于以物体或ROI为中心的表征。这些限制在具有挑战性的水下场景中会降低判别性能，因为视觉上相似的生物通常出现在不同的环境条件下。为了解决这些问题，我们提出了MATANet（多上下文注意与分类感知网络），一个用于海洋生物细粒度分类识别的框架。MATANet的动机来自专家分类识别实践，其中在识别过程中同时考虑生物体形态和上下文线索。该框架由两个主要组件组成。首先，多上下文环境注意力模块（MCEAM）对主要感兴趣区域（ROI）与多尺度周围环境区域之间的交叉注意力进行建模，从而将局部形态线索与栖息地级上下文信息相结合。其次，层级感知表示学习模块（HRLM）使用分类层次作为辅助监督来正则化表示学习，并鼓励跨分类级别的语义结构化嵌入。通过联合建模生物外观、环境上下文和分类结构，MATANet学习了用于细粒度分类识别的更具判别性的表示。在FathomNet2025和LifeCLEF2015-Fish上的实验表明，MATANet持续优于现有方法的识别性能。在FAIR1M上的额外实验进一步检验了所提框架在水下图像之外的适用性。值得注意的是，MATANet在CVPR 2025 FGVC12研讨会的FathomNet 2025挑战赛中获得了第一名。

英文摘要

Fine-grained recognition of marine organisms is important for ecological research, biodiversity monitoring, habitat conservation, and evidence-based policy-making. However, many existing approaches primarily rely on object- or ROI-centered representations. These limitations can reduce discriminative performance in challenging underwater scenes, where visually similar organisms often appear under diverse environmental conditions. To address these challenges, we propose MATANet (Multi-context Attention and Taxonomy-Aware Network), a framework for fine-grained taxonomic recognition of marine organisms. MATANet is motivated by expert taxonomic identification practices, in which both organism-level morphology and contextual cues are considered during recognition. The framework consists of two main components. First, the Multi-Context Environmental Attention Module (MCEAM) models cross-attention between the primary region of interest (ROI) and multi-scale surrounding environmental regions, thereby combining local morphological cues with habitat-level contextual information. Second, the Hierarchy-Aware Representation Learning Module (HRLM) uses taxonomic hierarchy as auxiliary supervision to regularize representation learning and encourage semantically structured embeddings across taxonomic levels. By jointly modeling organism appearance, environmental context, and taxonomic structure, MATANet learns more discriminative representations for fine-grained taxonomic recognition. Experiments on FathomNet2025 and LifeCLEF2015-Fish demonstrate that MATANet consistently improves recognition performance over existing methods. Additional experiments on FAIR1M further examine the applicability of the proposed framework beyond underwater imagery. Notably, MATANet ranked first in the FathomNet 2025 Challenge at the CVPR 2025 FGVC12 workshop.

URL PDF HTML ☆

赞 0 踩 0

2512.04733 2026-05-29 cs.CV cs.AI 版本更新

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

E3AD：面向以人为中心的端到端自动驾驶的情感感知视觉-语言-动作模型

Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu

发表机构 * McGill University（麦吉尔大学）； University of Macau（澳门大学）； The Hong Kong Polytechnic University（香港理工大学）； Massachusetts Institute of Technology（麻省理工学院）； University of Washington（华盛顿大学）

AI总结提出E3AD框架，通过连续VAD情感模型和双路径空间推理模块，将情感理解融入视觉-语言-动作模型，实现开放域端到端自动驾驶中的情感感知轨迹规划，在真实数据集上达到SOTA性能。

详情

AI中文摘要

端到端自动驾驶系统越来越多地采用视觉-语言-动作模型，但它们通常忽略乘客的情绪状态，而情绪状态对舒适度和自动驾驶接受度至关重要。我们引入了开放域端到端自动驾驶，其中自动驾驶车辆必须解释自由形式的自然语言命令，推断情绪，并规划物理上可行的轨迹。我们提出了E3AD，一个情感感知的VLA框架，通过两个认知启发的组件增强语义理解：一个连续的Valence-Arousal-Dominance情感模型，从语言中捕捉语调和紧迫性；以及一个双路径空间推理模块，融合自我中心和异中心视角以实现类人空间认知。结合模态预训练和基于偏好的对齐的一致性导向训练方案，进一步强化了情感意图与驾驶行为之间的一致性。在真实世界数据集上，E3AD改进了视觉定位和路径点规划，并在情感估计方面达到了最先进的VAD相关性。这些评估结果表明，将情感注入VLA风格的驾驶能够产生更符合人类行为的定位、规划和反馈。

英文摘要

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These evaluation results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and feedback.

URL PDF HTML ☆

赞 0 踩 0

2511.19316 2026-05-29 cs.CV cs.AI 版本更新

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

评估数据集水印用于定制扩散模型微调可追溯性：一个综合基准与移除方法

Xincheng Wang, Hanchi Sun, Wenjun Sun, Kejun Xue, Wangqiu Zhou, Jianbo Zhang, Wei Sun, Dandan Zhu, Xiongkuo Min, Jun Jia, Zhijun Fang

发表机构 * Donghua University（东华大学）； Shanghai Jiao Tong University（上海交通大学）； Xidian University（西安电子科技大学）； Hefei University of Technology（合肥工业大学）； East China Normal University（华东师范大学）

AI总结针对扩散模型微调中的版权与安全风险，本文建立统一威胁模型并提出包含普适性、可传递性和鲁棒性的评估框架，揭示现有数据集水印方法的脆弱性，并进一步提出一种实用的水印移除方法。

详情

AI中文摘要

最近扩散模型的微调技术使其能够再现特定图像集，例如特定人脸或艺术风格，但也引入了版权和安全风险。数据集水印已被提出，通过将不可察觉的水印嵌入训练图像来确保可追溯性，即使在微调后这些水印在输出中仍然可检测。然而，当前方法缺乏统一的评估框架。为解决这一问题，本文建立了一个通用威胁模型，并引入了一个包含普适性、可传递性和鲁棒性的综合评估框架。实验表明，现有方法在普适性和可传递性方面表现良好，并对常见图像处理操作具有一定的鲁棒性，但在真实威胁场景下仍然不足。为揭示这些脆弱性，本文进一步提出了一种实用的水印移除方法，该方法在不影响微调的情况下完全消除数据集水印，突出了未来研究的一个关键挑战。

英文摘要

Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.

URL PDF HTML ☆

赞 0 踩 0

2511.08423 2026-05-29 cs.CV 版本更新

OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild

OmniAID: 解耦语义与伪影以实现通用AI生成图像野外检测

Yuncheng Guo, Junyan Ye, Chenjue Zhang, Hengrui Kang, Haohuan Fu, Conghui He, Weijia Li

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Sun Yat-Sen University（中山大学）； Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出OmniAID框架，通过解耦混合专家架构分离语义缺陷和通用伪影，结合两阶段训练策略和Mirage数据集，实现跨生成模型和语义内容的鲁棒AI生成图像检测。

Comments Accepted by ICML 2026

详情

AI中文摘要

一个真正通用的AI生成图像（AIGI）检测器必须同时泛化到多种生成模型和不同的语义内容。当前方法学习单一的、纠缠的伪造表示，混淆了内容相关的缺陷与内容无关的伪影，并进一步受到过时基准的限制。我们提出OmniAID，一种以解耦混合专家（MoE）架构为核心的新框架，该架构分离了：（1）通过可路由的专门语义专家在不同内容领域中的语义缺陷，以及（2）通过固定的通用伪影专家从内容相关缺陷中分离出内容无关的通用伪影。两阶段训练策略首先通过领域特定的困难采样独立专门化专家，然后训练一个轻量级门控网络以实现有效的输入路由。通过明确解耦“生成了什么”（内容特定缺陷）与“如何生成”（通用伪影），OmniAID实现了鲁棒的泛化。我们还引入了Mirage，一个大规模、当代的数据集，包含现代训练集和具有挑战性的测试集。大量实验表明，OmniAID超越了现有检测器，为针对现代野外威胁的AIGI检测建立了新标准。代码可在https://github.com/yunncheng/OmniAID获取。

英文摘要

A truly universal AI-Generated Image (AIGI) detector must simultaneously generalize across diverse generative models and varied semantic content. Current methods learn a single, entangled forgery representation, conflating content-dependent flaws with content-agnostic artifacts, and are further constrained by outdated benchmarks. We propose OmniAID, a novel framework centered on a decoupled Mixture-of-Experts (MoE) architecture that separates: (1) semantic flaws across distinct content domains via Routable Specialized Semantic Experts, and (2) content-agnostic universal artifacts from content-dependent flaws via a Fixed Universal Artifact Expert. A two-stage training strategy first specializes experts independently with domain-specific hard-sampling, then trains a lightweight gating network for effective input routing. By explicitly decoupling "what is generated" (content-specific flaws) from "how it is generated" (universal artifacts), OmniAID achieves robust generalization. We also introduce Mirage, a large-scale, contemporary dataset comprising a modern training set and a challenging test set. Extensive experiments demonstrate that OmniAID surpasses existing detectors, establishing a new standard for AIGI detection against modern, in-the-wild threats. Code is available at https://github.com/yunncheng/OmniAID.

URL PDF HTML ☆

赞 0 踩 0

2510.27391 2026-05-29 cs.CV cs.LG 版本更新

Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

异质双曲流形上的树间模态对齐

Wei Wu, Xiaomeng Fan, Yuwei Wu, Zhi Gao, Pengxiang Li, Yunde Jia, Mehrtash Harandi

发表机构 * Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology（北京智能信息科技重点实验室，计算机科学与技术学院，北京理工大学）； Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University（广东机器感知与智能计算实验室，深圳MSU-BIT大学）； Department of Electrical and Computer System Engineering, Monash University（电子与计算机系统工程系，墨尔本大学）

AI总结提出一种在异质双曲流形上对齐图像和文本树状层次特征的方法，通过交叉注意力提取视觉层次特征、异质流形嵌入及KL距离度量学习中间流形，在开放集分类任务中优于基线。

Comments Published as a conference paper at ICLR 2026

详情

Journal ref: The Fourteenth International Conference on Learning Representations (ICLR 2026), Rio de Janeiro, Brazil, 2026

AI中文摘要

模态对齐对于视觉-语言模型（VLM）有效整合跨模态信息至关重要。然而，现有方法在提取文本层次特征的同时，对每个图像仅用单一特征表示，导致不对称和次优的对齐。为解决此问题，我们提出树间对齐（Alignment across Trees）方法，该方法为图像和文本模态构建并对齐树状层次特征。具体而言，我们引入一个语义感知的视觉特征提取框架，该框架对来自中间Transformer层的视觉类别标记应用交叉注意力机制，由文本线索引导以提取具有从粗到细语义的视觉特征。然后，我们将两种模态的特征树嵌入到具有不同曲率的双曲流形中，以有效建模其层次结构。为了在不同曲率的异质双曲流形之间进行对齐，我们推导了异质流形上分布之间的KL距离度量，并通过最小化该距离学习一个用于流形对齐的中间流形。我们证明了最优中间流形的存在性和唯一性。在多个图像数据集上的分类学开放集分类任务实验表明，我们的方法在少样本和跨域设置下持续优于强基线。

英文摘要

Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

URL PDF HTML ☆

赞 0 踩 0

2510.03550 2026-05-29 cs.CV 版本更新

Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!

流式拖拽导向的交互式视频操作：随时拖动任何物体！

Junbao Zhou, Yuan Zhou, Kesen Zhao, Qingshan Xu, Beier Zhu, Richang Hong, Hanwang Zhang

发表机构 * Nanyang Technological University（南洋理工大学）； University of Science and Technology of China（中国科学技术大学）； Hefei University of Technology（合肥工业大学）

AI总结提出REVEL任务和DragStream方法，通过自适应分布自校正和空间频率选择性优化，实现自回归视频扩散模型的流式拖拽交互操作。

详情

AI中文摘要

实现对自回归视频扩散模型输出的流式、细粒度控制仍然具有挑战性，难以确保其始终与用户期望一致。为弥补这一差距，我们提出 extbf{流式拖拽导向的交互式视频操作（REVEL）}，这是一个新任务，允许用户通过细粒度的交互式拖拽 extit{随时}对 extit{任何物体}修改生成的视频。超越DragVideo和SG-I2V，REVEL将拖拽式视频操作统一为编辑和动画化视频帧，同时支持用户指定的平移、变形和旋转效果，使拖拽操作更加通用。在解决REVEL时，我们观察到： extit{i}）拖拽引起的扰动在潜在空间中累积，导致严重的潜在分布漂移，从而中断拖拽过程； extit{ii}）流式拖拽容易受到上下文帧的干扰，从而产生视觉上不自然的结果。因此，我们提出一种无需训练的方法 extbf{DragStream}，包括： extit{i}）自适应分布自校正策略，利用相邻帧的统计信息有效约束潜在嵌入的漂移； extit{ii}）空间频率选择性优化机制，允许模型充分利用上下文信息，同时通过沿生成过程选择性传播视觉线索来减轻其干扰。我们的方法可以无缝集成到现有的自回归视频扩散模型中，大量实验有力地证明了DragStream的有效性。

英文摘要

Achieving streaming, fine-grained control over the outputs of autoregressive video diffusion models remains challenging, making it difficult to ensure that they consistently align with user expectations. To bridge this gap, we propose \textbf{stReaming drag-oriEnted interactiVe vidEo manipuLation (REVEL)}, a new task that enables users to modify generated videos \emph{anytime} on \emph{anything} via fine-grained, interactive drag. Beyond DragVideo and SG-I2V, REVEL unifies drag-style video manipulation as editing and animating video frames with both supporting user-specified translation, deformation, and rotation effects, making drag operations versatile. In resolving REVEL, we observe: \emph{i}) drag-induced perturbations accumulate in latent space, causing severe latent distribution drift that halts the drag process; \emph{ii}) streaming drag is easily disturbed by context frames, thereby yielding visually unnatural outcomes. We thus propose a training-free approach, \textbf{DragStream}, comprising: \emph{i}) an adaptive distribution self-rectification strategy that leverages neighboring frames' statistics to effectively constrain the drift of latent embeddings; \emph{ii}) a spatial-frequency selective optimization mechanism, allowing the model to fully exploit contextual information while mitigating its interference via selectively propagating visual cues along generation. Our method can be seamlessly integrated into existing autoregressive video diffusion models, and extensive experiments firmly demonstrate the effectiveness of our DragStream.

URL PDF HTML ☆

赞 0 踩 0

2510.00936 2026-05-29 cs.CV 版本更新

Resolution as a Direction: Vector-Panning Feature Alignment for Cross-Resolution Re-Identification

分辨率作为方向：跨分辨率重识别的向量平移特征对齐

Zanwu Liu, Chao Yuan, Bo Li, Xiaowei Zhang, Guanglin Niu

发表机构 * School of Artificial Intelligence, Beihang University, Beijing, China（北京航空航天大学人工智能学院）； School of Computer Science and Engineering, Beihang University, Beijing, China（北京航空航天大学计算机科学与工程学院）； College of Computer Science and Technology, Qingdao University, Qingdao, China（青岛大学计算机科学与技术学院）

AI总结提出向量平移特征对齐（VPFA）方法，通过将低分辨率特征沿学习到的分辨率方向平移得到伪高分辨率表示，实现轻量级且高效的跨分辨率行人重识别。

详情

AI中文摘要

跨分辨率行人重识别（CR-ReID）在实际监控中仍然具有挑战性，其中相机质量和拍摄距离导致低分辨率（LR）查询与高分辨率（HR）图库图像之间存在显著的分辨率差距。先前的方法通常依赖于超分辨率（SR）或分辨率不变表示学习，这往往增加系统复杂性，并且可能无法直接解决由分辨率退化引起的特征不匹配问题。在这项工作中，我们从一项专门分析中报告了一个新的经验发现，其中身份特定的变化被平均化：标准ReID主干产生的HR-LR特征差异在嵌入空间中表现出一致的、与分辨率相关的语义方向。我们进一步基于典型相关分析（CCA）和皮尔逊相关分析支持这一观察。受此发现启发，我们提出了向量平移特征对齐（VPFA），一个轻量级的后处理模块，学习将LR特征沿学习到的分辨率方向平移，以获得伪HR表示。VPFA在特征提取后运行，可以以可忽略的开销集成到现有的ReID系统中。在多个CR-ReID基准上的大量实验表明，VPFA实现了最先进的性能，同时与基于SR或联合训练的方法相比提高了效率。

英文摘要

Cross-resolution person re-identification (CR-ReID) remains challenging in practical surveillance, where camera quality and capture distance lead to substantial resolution gaps between low-resolution (LR) queries and high-resolution (HR) gallery images. Prior approaches commonly rely on super-resolution (SR) or resolution-invariant representation learning, which often increases system complexity and may not directly address the feature mismatch induced by resolution degradation. In this work, we report a new empirical finding from a dedicated analysis in which identity-specific variation is averaged out: the HR--LR feature discrepancy produced by standard ReID backbones exhibits a consistent, resolution-related semantic direction in the embedding space. We further support this observation with statistical analyses based on Canonical Correlation Analysis (CCA) and Pearson correlation analysis. Motivated by this finding, we propose Vector Panning Feature Alignment (VPFA), a lightweight post-hoc module that learns to pan LR features along the learned resolution direction to obtain pseudo-HR representations. VPFA operates after feature extraction and can be integrated into existing ReID systems with negligible overhead. Extensive experiments on multiple CR-ReID benchmarks show that VPFA achieves state-of-the-art performance while improving efficiency compared to SR-based or jointly trained alternatives.

URL PDF HTML ☆

赞 0 踩 0

2509.21979 2026-05-29 cs.CV cs.AI 版本更新

Benchmarking and Mitigating Sycophancy in Medical Vision Language Models

医疗视觉语言模型中的谄媚行为基准测试与缓解

Juangui Xu, Zikun Guo, Jingwei Lv, Hongbin Lin, Shu Yang, Jun Wen, Di Wang, Lijie Hu

发表机构 * MBZUAI ； Saarland University（萨尔兰大学）； HKUST(GZ)（香港科技大学（广州））； KAUST（卡塔尔大学）

AI总结针对医疗视觉语言模型中的谄媚问题，提出分层医疗视觉问答基准和VIPER策略，通过过滤非证据社会线索减少谄媚，提升模型鲁棒性。

Comments 19figures, 61pages. The first two authors contributed equally

详情

AI中文摘要

视觉语言模型（VLM）有潜力改变医疗工作流程。然而，其部署受到谄媚行为的限制。尽管这对患者安全构成严重威胁，但系统性的基准测试仍然缺乏。本文通过引入一个医疗基准来填补这一空白，该基准在分层医疗视觉问答任务中对VLM应用多种模板。我们发现当前的VLM极易受到视觉线索的影响，失败率与模型大小或整体准确性相关。我们发现感知权威和用户模仿是强大的触发因素，表明存在独立于视觉数据的偏差机制。为了克服这一点，我们提出了一种基于证据的视觉信息净化响应（VIPER）策略，该策略主动过滤掉非基于证据的社会线索，从而强化基于证据的推理。VIPER在保持可解释性的同时减少了谄媚，并且始终优于基线方法，为VLM的稳健和安全集成奠定了必要的基础。

英文摘要

Visual language models (VLMs) have the potential to transform medical workflows. However, the deployment is limited by sycophancy. Despite this serious threat to patient safety, a systematic benchmark remains lacking. This paper addresses this gap by introducing a Medical benchmark that applies multiple templates to VLMs in a hierarchical medical visual question answering task. We find that current VLMs are highly susceptible to visual cues, with failure rates showing a correlation to model size or overall accuracy. we discover that perceived authority and user mimicry are powerful triggers, suggesting a bias mechanism independent of visual data. To overcome this, we propose a Visual Information Purification for Evidence based Responses (VIPER) strategy that proactively filters out non-evidence-based social cues, thereby reinforcing evidence based reasoning. VIPER reduces sycophancy while maintaining interpretability and consistently outperforms baseline methods, laying the necessary foundation for the robust and secure integration of VLMs.

URL PDF HTML ☆

赞 0 踩 0

2508.03221 2026-05-29 cs.CR cs.CV 版本更新

软化掩码：自适应时间软掩码用于高效动态面部表情识别

Meng-zhu Li, Quanxing Zha, Hongjun Wu

发表机构 * Beijing Union University（北京联合大学）； Huaqiao University（华侨大学）； Beijing University of Posts and Telecommunication（北京邮电大学）

AI总结提出一种结合自监督重建与监督分类的AdaTosk网络，通过自适应时间软掩码（类不可知和类语义软掩码）增强关键表情时刻并减少语义冗余，在降低计算成本的同时保持竞争性能。

Comments 6 pages, 3 figures

详情

DOI: 10.1109/ICME59968.2025.11209787

AI中文摘要

动态面部表情识别（DFER）通过非语言交流促进对心理意图的理解。现有方法难以管理无关信息（如背景噪声和冗余语义），影响效率和有效性。本文提出一种新颖的监督式时间软掩码自编码器网络用于DFER，即AdaTosk，它将并行监督分类分支与自监督重建分支相结合。自监督重建分支应用随机二元硬掩码生成多样化的训练样本，促进可见令牌中的有意义的特征表示。同时，分类分支采用自适应时间软掩码，根据时间重要性灵活地掩盖可见令牌。其两个关键组成部分，即类不可知软掩码和类语义软掩码，分别用于增强关键表情时刻并随时间减少语义冗余。在广泛使用的基准测试上进行的大量实验表明，与当前最先进方法相比，我们的AdaTosk显著降低了计算成本，同时仍保持竞争性能。

英文摘要

Dynamic Facial Expression Recognition (DFER) facilitates the understanding of psychological intentions through non-verbal communication. Existing methods struggle to manage irrelevant information, such as background noise and redundant semantics, which impacts both efficiency and effectiveness. In this work, we propose a novel supervised temporal soft masked autoencoder network for DFER, namely AdaTosk, which integrates a parallel supervised classification branch with the self-supervised reconstruction branch. The self-supervised reconstruction branch applies random binary hard mask to generate diverse training samples, encouraging meaningful feature representations in visible tokens. Meanwhile the classification branch employs an adaptive temporal soft mask to flexibly mask visible tokens based on their temporal significance. Its two key components, respectively of, class-agnostic and class-semantic soft masks, serve to enhance critical expression moments and reduce semantic redundancy over time. Extensive experiments conducted on widely-used benchmarks demonstrate that our AdaTosk remarkably reduces computational costs compared with current state-of-the-art methods while still maintaining competitive performance.

URL PDF HTML ☆

赞 0 踩 0

2412.00452 2026-05-29 cs.LG cs.CV 版本更新

Learning Locally, Revising Globally: Global Reviser for Federated Learning with Noisy Labels

局部学习，全局修正：面向含噪标签联邦学习的全局修正器

Yuxin Tian, Mouxing Yang, Yuhao Zhou, Jian Wang, Qing Ye, Tongliang Liu, Gang Niu, Jiancheng Lv

发表机构 * College of Computer Science, Sichuan University, Chengdu, China（四川大学计算机学院，中国成都）； Engineering Research Center of Machine Learning（机器学习工程研究中心）； University of Sydney, Sydney, Australia（悉尼大学，澳大利亚悉尼）； Southeast University, Nanjing, China（东南大学，中国南京）

AI总结针对联邦学习中标签噪声与数据异质性共存的问题，提出一种利用全局模型慢记忆特性的联邦全局修正器（FedGR），通过三个模块协同修正噪声标签并正则化局部训练，在三个基准上优于八种基线方法。

Comments ICML 2026 Camera Ready

详情

AI中文摘要

传统的联邦学习（FL）严重依赖高质量标签，这在实际应用中往往不现实，导致联邦标签噪声（F-LN）问题。更糟糕的是，FL的异质性加剧了F-LN问题，因为客户端经历不同的标签噪声类型、比率和数据分布。在本研究中，我们首先观察到FL的全局模型表现出对噪声标签的缓慢记忆现象，这表明其在FL中能够维持可靠的预测和鲁棒的表示。受此启发，我们提出了一种名为联邦全局修正器（FedGR）的新方法，这是一种直接而有效的方法，包含三个模块，协同修正噪声标签并正则化局部训练。通过利用这一固有属性，FedGR以自包含的方式提高了FL对标签噪声的鲁棒性。在三个广泛使用的F-LN基准上的大量实验表明，即使在严重的标签噪声和数据异质性下，FedGR也表现出优越的性能，始终优于八个最先进的基线。代码：https://github.com/cs-yuxintian/FedGR-ICML26

英文摘要

Conventional federated learning (FL) heavily depends on high-quality labels, which are often impractical in the real world, leading to the federated label-noise (F-LN) problem. Worse still, the F-LN problem is exacerbated by the heterogeneity of FL, whereas clients experience different label-noise types, ratios, and data distribution. In this study, we first observe an intriguing phenomenon that the global model of FL exhibits a slow memorization of noisy labels, suggesting its ability to maintain reliable predictions and robust representations in FL. Motivated by this, we propose a novel method termed Federated Global Reviser (\method), a straightforward yet effective method comprising three modules that collaboratively rectify noisy labels and regularize local training. By exploiting this inherent property, \method\ improves the label-noise robustness of FL in a self-contained manner. Extensive experiments on three widely used F-LN benchmarks demonstrate the superior performance of FedGR, consistently outperforming eight state-of-the-art baselines even in severe label-noise and data heterogeneity. Code: https://github.com/cs-yuxintian/FedGR-ICML26

URL PDF HTML ☆

赞 0 踩 0

2605.29579 2026-05-29 cs.CV 版本更新

ReactBench: A Cause-Driven Benchmark for Multimodal Hallucination via Systematic Evaluation

ReactBench：通过系统评估的多模态幻觉因果驱动基准

Shizhe Zhou, Bohan Jia, Kai Wu, Yan Shen, Tongyun Li, Yuyang Wu, Shaohui Lin

发表机构 * East China Normal University（华东师范大学）

AI总结提出ReactBench基准，通过对抗性图像和诱导幻觉的查询，系统评估多模态大模型在关系擦除、反事实属性、变化追踪和密集计数等任务中的因果幻觉。

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）在视觉-语言理解方面取得了快速进展，但它们仍然容易产生多模态幻觉，即生成与视觉输入不一致的响应。现有基准主要侧重于检测幻觉结果，而非评估这些失败的潜在原因。此外，许多基准依赖于简单的场景和有限的评估格式，不再能挑战最先进的模型。为了解决这些局限性，我们引入了ReactBench，一个因果驱动的幻觉基准，具有多个任务和考试式评估格式。通过生成对抗性图像和诱导幻觉的查询，ReactBench引入了四个目标任务：关系擦除、反事实属性、变化追踪和密集计数。这些任务系统地暴露了共现偏差、语言先验、跨图像比较感知缺陷和细粒度感知瓶颈。除了基于标准准确率的评估外，我们利用思维链推理来识别每个任务中幻觉的细粒度子原因。大量评估表明，当前的MLLMs仍然容易受到特定因果幻觉触发因素的影响，这证明了ReactBench作为诊断和提高多模态模型鲁棒性的系统化和可解释测试平台的价值。项目页面见https://reactbench.github.io/。

英文摘要

While multimodal large language models (MLLMs) have achieved rapid progress in vision-language understanding, they remain prone to multimodal hallucinations, producing responses that are inconsistent with the visual input. Existing benchmarks predominantly focus on detecting hallucination outcomes rather than evaluating the underlying causes of these failures. Moreover, many benchmarks rely on simplistic scenarios and limited evaluation formats that no longer challenge state-of-the-art models. To address these limitations, we introduce ReactBench, a cause-driven hallucination benchmark featuring multiple tasks and an exam-style evaluation format. By generating adversarial images and hallucination-inducing queries, ReactBench introduces four targeted tasks: Relational Erasure, Counterfactual Attribute, Alteration Tracing, and Dense Counting. These tasks systematically expose co-occurrence bias, language priors, cross-image comparative perception deficiencies, and fine-grained perceptual bottlenecks. Beyond standard accuracy-based evaluation, we leverage Chain-of-Thought reasoning to identify fine-grained sub-causes of hallucination within each task. Extensive evaluations reveal that current MLLMs remain notably vulnerable to cause-specific hallucination triggers, demonstrating the value of ReactBench as a systematic and interpretable testbed for diagnosing and improving multimodal model robustness. The project page is available at https://reactbench.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.29577 2026-05-29 cs.CV 版本更新

Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning

通过逆动力学学习缓解视觉-语言-动作模型中的状态混叠

Kyujin Lee, Injae Kim, Jihwan Park, Yejun Ju, Minseok Joo, Hyunwoo J. Kim

发表机构 * KAIST（韩国科学技术院）； Korea University（韩国大学）

AI总结提出将逆动力学学习作为辅助目标，直接监督VLA视觉编码器，通过预测当前与未来观测之间的动作来捕捉细粒度视觉差异，从而缓解状态混叠问题。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过将预训练的视觉-语言模型（VLM）适应于动作预测，成为统一机器人操作中感知、推理和控制的 promising 框架。然而，VLM 衍生的表示通常对低级控制所需的细微视觉差异不敏感，导致视觉相似但需要截然不同动作的状态之间出现状态混叠。先前的 VLA 研究通过生成视觉或推理输出（如未来帧、2D 接地点或轨迹、或中间空间推理步骤）来改善视觉理解，但这些目标通常仅通过端到端预测间接塑造视觉编码器，并未显式分析学习到的视觉特征空间中的状态混叠。为了缓解状态混叠，我们引入逆动力学学习作为辅助目标，直接监督 VLA 视觉编码器。通过预测当前与未来观测之间的动作，我们的目标鼓励编码器捕捉决定低级动作的细粒度视觉差异。我们进一步使用伪反向监督，使编码器暴露于更广泛的动作方向，并在有限的机器人演示下提高泛化能力。我们的方法适用于多种 VLA 基线，仅使用标准的观测-动作对，无需额外标注，并在测试时保留原始推理流程。在 CALVIN ABC-D 和 SimplerEnv 上的实验表明，在多种 VLA 基线上均获得一致的性能提升。冻结编码器探测和状态-特征对齐分析进一步表明，我们的方法学习了状态判别性的视觉表示，减少了状态混叠，并更好地与机器人状态变化对齐。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising framework that unifies perception, reasoning, and control for robot manipulation by adapting pretrained vision-language models (VLMs) to action prediction. However, VLM-derived representations are often insensitive to subtle visual distinctions required for low-level control, causing state aliasing between visually similar states that require substantially different actions. Prior VLA studies improve visual understanding by generating visual or reasoning outputs, such as future frames, 2D grounding points or traces, or intermediate spatial reasoning steps, but these objectives typically shape the vision encoder only indirectly through end-to-end prediction and do not explicitly analyze state aliasing in the learned visual feature space. To mitigate state aliasing, we introduce inverse dynamics learning as an auxiliary objective that directly supervises the VLA vision encoder. By predicting the action between current and future observations, our objective encourages the encoder to capture fine-grained visual distinctions that determine low-level actions. We further use pseudo-reversed supervision to expose the encoder to a broader range of action directions and improve generalization under limited robot demonstrations. Our method applies to diverse VLA baselines, uses only standard observation-action pairs without additional annotations, and preserves the original inference pipeline at test time. Experiments on CALVIN ABC-D and SimplerEnv show consistent gains across diverse VLA baselines. Frozen-encoder probing and state-feature alignment analyses further show that our method learns state-discriminative visual representations that reduce state aliasing and better align with robot state changes.

URL PDF HTML ☆

赞 0 踩 0

2605.29575 2026-05-29 cs.CV 版本更新

Optimizing Latent Representations for Robust Building Damage Assessment Onboard Earth Observation Satellites

优化潜在表示以实现地球观测卫星上稳健的建筑物损坏评估

Thomas Goudemant, Benjamin Francesconi

AI总结提出一种基于AI的星上系统，通过编码预灾图像为紧凑潜在表示并与灾后图像在轨比较，实现建筑物损坏的定位与分类，减少下行数据量并提高响应速度。

Comments IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2026), Jun 2026, Denver, United States

详情

AI中文摘要

在自然灾害或战区后快速识别受损建筑物对于支持应急响应和优先干预至关重要。地球观测星座提供及时、大范围的覆盖，但可操作信息常因数据下行限制、地面处理及人工解读而延迟。减少这种延迟对于提高决策响应能力至关重要。本文提出一种原创的基于AI的系统，可直接在卫星上从灾前和灾后高分辨率光学图像中进行目标级建筑物损坏评估（定位和损坏分类）。可用的灾前图像在地面编码为紧凑潜在表示，传输至卫星，并与新获取的灾后观测在轨比较。利用AI解读能力和星上处理能力的提升，所提设计支持在数据源直接处理，减少需下行的信息量，同时保留任务相关内容并提高系统整体响应性。我们通过系统基准测试星上兼容变体，分析孪生处理、交叉注意力、潜在空间压缩和面向鲁棒性的数据增强的影响。在xBD数据集上的实验表明，在未对准情况下具有可靠且稳健的损坏评估，且在强压缩下性能退化最小。

英文摘要

Rapid identification of damaged buildings after natural disasters or on war areas is crucial to support emergency response and prioritize interventions. Earth Observation constellations provide timely, large-scale coverage, but actionable information is often delayed by data downlink constraints, on-ground processing, and human interpretation. Reducing this latency is essential to improve decision-making responsiveness. In this work, we propose an original AI-based system that enables object-level building damage assessment (localization and damage classification) directly onboard satellites from pre-disaster and post-disaster highresolution optical imagery. Available pre-disaster images are encoded on ground into compact latent representations, transmitted to the satellite, and compared on-board with newly acquired post-event observations. Leveraging AI interpretation capabilities and increasing processing capabilities on-board satellites, the proposed design enables processing directly at the data source, reducing the amount of information to be downlinked while preserving task-relevant content and improving overall system responsivity. We explore the design space through a systematic benchmark of onboard-compatible variants, analyzing the impact of siamese processing, cross-attention, latent-space compression, and robustness-oriented data augmentation. Experiments on xBD dataset demonstrate reliable and robust damage assessment under misalignment, with minimal performance degradation under strong compression.

URL PDF HTML ☆

赞 0 踩 0

2605.29570 2026-05-29 cs.CV 版本更新

DefSynUS: Real-time Patient-specific Intrahepatic Vessel Identification via Deformation-Aware CT-US Domain Adaptation

DefSynUS：通过形变感知CT-超声域自适应的实时患者特异性肝内血管识别

Karl-Philippe Beaudet, Yordanka Velikova, Sidaty El Hadramy, Nassir Navab, Philippe Cattin, Juan Verde, Stéphane Cotin

发表机构 * Inria（法国国家信息与自动化研究所）； University of Strasbourg（斯特拉斯堡大学）； Technical University of Munich（慕尼黑技术大学）； University of Basel（巴塞尔大学）； Institute of Image-Guided Surgery（图像引导手术研究所）

AI总结提出一种基于物理渲染和形变感知数据增强的域自适应框架，无需术前超声即可实现术中实时、患者特异性的肝内血管分支识别。

详情

AI中文摘要

目的：腹腔镜超声通过实时可视化肝内血管增强肝脏手术的安全性。然而，由于探头限制、复杂的血管结构和组织形变，血管识别仍然困难。本研究旨在通过可变形超声增强，实现实时、患者特异性的血管识别，并在形变下保持鲁棒性。方法：利用术前CT血管标注，通过优化的基于物理的渲染生成合成超声数据，并结合域自适应到术中超声。渲染过程以端到端方式训练，用于血管识别和患者特异性，无需术前超声。形变感知增强在渲染流程中模拟真实的术中运动和软组织形变。结果：在腹部体模和有限临床可行性实验（单病例临床评估）中，该框架实现了实时肝内血管分支识别，并在新患者姿势下保持性能。结论：该框架无需术前超声即可实现实时血管识别，并支持技术可行性，但仍需多患者验证以评估泛化性和临床可行性。

英文摘要

Purpose: Laparoscopic ultrasound (LUS) enhances the safety of liver surgery by visualizing intrahepatic vessels in real-time. Still, vessel identification remains difficult due to probe constraints, complex vascular structure, and tissue deformation. This work aims to enable real-time, patient-specific vessel identification that remains robust under deformation through deformable ultrasound augmentation. Methods: Preoperative CT vessel annotations are used to generate synthetic ultrasound data via optimized physics-based rendering, coupled with domain adaptation to intraoperative ultrasound. The rendering is trained end-to-end for vessel identification and patient-specificity, eliminating the need for preoperative ultrasound. A deformation-aware augmentation simulates realistic intraoperative motion and tissue deformation within the rendering pipeline. Results: In abdominal phantom and limited clinical feasibility experiments (single-case clinical evaluation), the framework achieved real-time intrahepatic vessel-branch identification, maintaining performance under new patient poses. Conclusion: The framework enables real-time vessel identification without preoperative ultrasound and supports technical feasibility, but multi-patient validation is still needed for generalizability and clinical feasibility.

URL PDF HTML ☆

赞 0 踩 0

2605.29565 2026-05-29 cs.CV cs.RO 版本更新

From General Vision to Reliable Traversability Estimation: Adapting Vision Foundation Models for Unstructured Outdoor Environments

从通用视觉到可靠的可通行性估计：适应视觉基础模型用于非结构化户外环境

Ji-Hoon Hwang, Jisung Bae, Dong-Wook Kim, Yeonkyu Lee, Seung-Woo Seo

AI总结提出ViTA框架，通过可学习提示、视角多样化训练和几何知识蒸馏，将视觉基础模型适应于非结构化户外环境的可靠可通行性估计，显著降低误报并提升跨域泛化。

Comments 8 pages, 5figures

详情

AI中文摘要

基于视觉的方法已成为非结构化户外环境中可通行性估计的主导范式，通常通过语义分割监督来适应视觉基础模型（VFM）。然而，该范式面临三个根本性挑战，削弱了其可靠性：VFM的任务无关设计、可通行性标注的模糊性以及语义标签与物理安全性之间的差异。我们提出了视觉到可通行性适应（ViTA）框架，该框架将VFM适应于可靠的可通行性估计，并在SAM2上实例化。ViTA通过可学习的可通行性提示注入任务特定知识，同时保留VFM的跨域泛化能力。为处理标注模糊性，我们引入了视角多样化训练，通过估计语义不确定性来抑制模糊边界处的自信预测。为弥合语义与可通行性之间的差异，我们在训练期间蒸馏几何知识，使得推理时仅从RGB图像即可进行坡度和高程推理。语义和几何输出融合为一个连续的可通行性分数，同时反映语义不确定性和几何风险。在包括具有挑战性的真实越野数据集在内的多个领域的评估表明，ViTA实现了最先进的IoU和精确度，同时大幅减少误报并具备强大的跨域泛化能力。

英文摘要

Vision-based approaches have become the dominant paradigm for traversability estimation in unstructured outdoor environments, typically adapting vision foundation models (VFMs) via semantic segmentation supervision. However, this paradigm faces three fundamental challenges that undermine its reliability: the task-agnostic design of VFMs, the ambiguity of traversability annotations, and the discrepancy between semantic labels and physical safety. We propose Vision-to-Traversability Adaptation (ViTA), a framework that adapts VFMs for reliable traversability estimation, instantiated on SAM2. ViTA injects task-specific knowledge through learnable traversability prompts while preserving the VFM's cross-domain generalization. To handle annotation ambiguity, we introduce Perspective-Diversified Training, which estimates semantic uncertainty to suppress confident predictions at ambiguous boundaries. To bridge the semantic-traversability discrepancy, we distill geometric knowledge during training, enabling slope and elevation reasoning from RGB images alone at inference. The semantic and geometric outputs are fused into a continuous traversability score that reflects both semantic uncertainty and geometric risk. Evaluations across diverse domains, including challenging real-world off-road datasets, demonstrate that ViTA achieves state-of-the-art IoU and Precision with substantial false-positive reduction and strong cross-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2605.29562 2026-05-29 cs.RO cs.AI cs.CV 版本更新

RadioFormer3D：通过生成式建模在低空空域中进行弱监督三维无线电地图估计

Zheng Fang, Junjie Liu, Kangjun Liu, Jianguo Zhang, Yaowei Wang, Ke Chen

发表机构 * Pengcheng Laboratory（鹏城实验室）； Southern University of Science and Technology（南方科技大学）； Harbin Institute of Technology（哈尔滨工业大学）

AI总结提出RadioFormer3D模型，采用傅里叶采样编码器、体素解码器和联合频谱完整性损失，在弱监督下实现三维空间稀疏测量的无线电地图估计，有效提升未标注高度层的重建质量。

详情

AI中文摘要

随着三维环境中无线应用（如低空空域和三维异构网络）的出现，无线电地图估计越来越需要表征信号在水平和垂直维度上的传播。然而，由于空间稀疏性增加和连续高度上的监督有限，将无线电地图估计从二维扩展到三维仍然具有挑战性。在本文中，我们提出了 extbf{ extit{RadioFormer3D}}，一种专门用于弱监督下体素频谱重建的模型。基于 extit{RadioFormer}的双流多粒度融合架构， extit{RadioFormer3D}引入了基于傅里叶的采样编码器和体素解码器，以有效处理三维空间中的稀疏测量。为了缓解垂直监督的缺乏，我们提出了 extbf{ extit{联合频谱完整性损失}}，它将体素级伪标签监督、地图级几何感知无线电渲染和像素级局部约束整合到一个统一的优化方案中。这种设计使模型能够在稀疏监督下更有效地捕捉复杂的垂直结构关系。在多个无线电地图数据集上的大量实验表明，与现有代表性方法相比， extit{RadioFormer3D}实现了优越的整体性能。特别是，它在保持精度和推理效率之间良好权衡的同时，在未标注高度层上展示了改进的重建质量，使其成为未来三维环境感知无线网络的一个非常有前景的解决方案。

英文摘要

With the emergence of wireless applications in three-dimensional environments, such as the low-altitude airspace and 3D heterogeneous networks, radio map estimation is increasingly required to characterize signal propagation across both horizontal and vertical dimensions. However, extending radio map estimation from 2D to 3D remains challenging due to increased spatial sparsity and limited supervision across continuous altitudes. In this paper, we propose \textbf{\textit{RadioFormer3D}}, a specialized model for volumetric spectrum reconstruction under weak supervision. Building on the dual-stream, multi-granularity fusion architecture of \textit{RadioFormer}, \textit{RadioFormer3D} introduces a Fourier-based sampling encoder and a volumetric decoder to efficiently process sparse measurements in 3D space. To alleviate the lack of vertical supervision, we propose the \textbf{\textit{Joint Spectrum Integrity Loss}}, which integrates volume-level pseudo-label supervision, map-level geometry-aware radio rendering, and pixel-level localized constraints within a unified optimization scheme. This design enables the model to capture complex vertical structural relationships more effectively under sparse supervision. Extensive experiments across several radio map datasets show that \textit{RadioFormer3D} achieves superior overall performance compared to representative existing methods. In particular, it demonstrates improved reconstruction quality at unlabeled altitudes while maintaining a favorable trade-off between accuracy and inference efficiency, positioning it as a highly promising solution for future 3D environment-aware wireless networks.

URL PDF HTML ☆

赞 0 踩 0

2605.29531 2026-05-29 cs.SD cs.CV cs.LG 版本更新

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

使用交叉注意力特征融合的半真音频深度伪造检测与定位

S. Sutharya, Remya K. Sasi

发表机构 * Department of Computer Science（计算机科学系）

AI总结提出CAFNet模型，通过三元分类和边界回归联合检测部分伪造音频，在MLADDC数据集上达到92.71%准确率和0.075s定位误差。

Comments 13 pages, 5 figures, 11 tables

详情

AI中文摘要

音频深度伪造检测通常作为二分类问题研究，但部分篡改语音（其中一段短合成片段被拼接进真实语音）构成了更困难且更现实的威胁。检测此类半真音频不仅需要区分真实和完全伪造语音，还需要定位篡改发生的位置。我们提出了CAFNet，一个576k参数的架构，联合处理这两个任务：它在单次前向传播中执行三元分类（真实、完全伪造或半真）并回归合成区域的时间边界。CAFNet通过并行深度可分离卷积分支和交叉注意力融合梅尔频率倒谱系数（MFCC）、线性频率倒谱系数（LFCC）和色度短时傅里叶变换（Chroma-STFT）特征，随后使用双向长短期记忆（BiLSTM）回归头进行边界预测。在组合的多语言音频深度伪造检测语料库（MLADDC）T2+T3测试集上，CAFNet达到92.71%的准确率和0.9910的宏观曲线下面积（AUC），边界定位平均绝对误差（MAE）为0.075秒，中位误差为0.052秒。在二分类检测中，它达到96.76%的准确率和3.20%的等错误率（EER），以超过500倍的参数减少优于微调的XLS-R 300M（78.31%）和AST 87M（93.03%）。跨数据集研究进一步表明，即使在降低骨干学习率的情况下，标准微调也会破坏跨域表示。

英文摘要

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

URL PDF HTML ☆

赞 0 踩 0

2605.29505 2026-05-29 cs.CV 版本更新

V2XCrafter：学习生成跨智能体的驾驶场景

Yihang Tao, Yu Guo, Senkang Hu, Yanan Ma, Zihan Fang, Sam Kwong, Yuguang Fang

发表机构 * Hong Kong JC STEM Lab of Smart City（香港JC智能城市STEM实验室）； City University of Hong Kong（香港城市大学）； Lingnan University（岭南大学）

AI总结提出V2XCrafter框架，通过渐进式多智能体扩散模型和跨智能体注意力模块，生成跨智能体相机视角的一致可控协作驾驶场景，以增强数据并提升下游协作3D目标检测性能。

详情

AI中文摘要

协作驾驶系统利用车联网（V2X）通信进行多智能体协作感知，以提升驾驶安全性，但仍受限于标注的真实世界V2X驾驶数据集稀缺以及在多样化驾驶条件下的泛化能力有限。虽然图像生成技术为数据增强提供了可行的解决方案，但现有针对单车辆多视角场景的方法在多智能体驾驶设置中面临两个基本挑战：（1）学习目标的扩展降低了生成质量；（2）跨智能体的高度动态变化阻碍了对联合观测对象物理属性（如颜色、类别）一致性的建模。为弥补这一差距，我们提出V2XCrafter，这是首个用于跨智能体相机视角生成可控且逼真的协作驾驶场景的框架。为了实现有效学习，我们基于单智能体骨干网络开发了一种渐进式多智能体扩散模型，利用相邻智能体的潜在状态作为参考信号，逐步引导从单智能体到多智能体的扩散过程。为解决跨车辆不一致性问题，我们提出了一个跨智能体注意力模块，该模块利用协作视图图和可学习的联合观测对象表示来建模动态的跨智能体相机视角关系。实验表明，V2XCrafter能够生成高保真且可控的街道视图，并保持跨智能体的一致性，从而有效提升下游协作3D目标检测任务的效果。

英文摘要

Collaborative driving systems leverage vehicle-to-everything (V2X) communication for multi-agent collaborative perception to enhance driving safety, yet they remain constrained by scarce annotated real-world V2X driving datasets and limited generalization across diverse driving conditions. While image generation technology offers a feasible solution for data augmentation, existing methods tailored for single-vehicle multi-view scenarios face two fundamental challenges in multi-agent driving settings: (1) the expansion of the learning objective degrades generation quality, and (2) the highly dynamic variations across agents hinder the modeling of consistency for physical attributes (e.g., color, category) in jointly observed objects. To bridge this gap, we propose V2XCrafter, the first framework for generating controllable and realistic collaborative driving scene across agents' camera views. For effective learning, we develop a progressive multi-agent diffusion model based on a single-agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single-to-multi diffusion. To address cross-vehicle inconsistency, we propose a cross-agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross-agent camera view relationships. Experiments have shown that V2XCrafter can generate high-fidelity and controllable street views with consistency across agents, thereby effectively enhancing the downstream collaborative 3D object detection tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.29462 2026-05-29 cs.CV cs.AI 版本更新

3DVLA：通过3D空间和实例理解增强视觉-语言-动作模型

Zhongyu Xia, Yousen Tang, Bingqing Wei, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王轩计算机技术研究所）

AI总结提出3DVLA框架，通过多视角一致性3D特征编码、实例估计模块和掩码自监督3D编码，解决VLA模型缺乏3D场景理解的问题，在LIBERO-Plus和RoboTwin 2.0上显著提升操作性能。

详情

AI中文摘要

视觉-语言-动作模型在机器人操作中取得了显著进展，但存在一个关键限制：缺乏3D场景理解。这一缺陷表现为三个相互交织的挑战：在不强制执行多视角一致性的情况下弱提取3D空间位置、不足的3D实例理解以及遮挡下的脆弱推理。尽管存在成熟的3D感知方法，但由于架构不兼容以及对昂贵实例级标注的严重依赖，它们难以直接集成到VLA流程中。为解决上述挑战，我们提出3DVLA，一个即插即用框架，将稳健的3D推理注入预训练的VLA，无需额外人工标注或丢弃VLM先验。具体来说，3DVLA通过以下方式应对三个挑战：（1）在所有模态上具有显式多视角一致性约束的普遍3D特征编码和空间条件几何聚合方法，（2）具有高级实例令牌的实例估计模块以实现3D实例感知，以及（3）保留预测器用于视觉令牌完成的掩码自监督3D编码分支以处理遮挡。我们将3DVLA与多个VLA基线集成，并在LIBERO-Plus和RoboTwin 2.0上进行评估。结果显示操作性能持续且显著提升，验证了我们方法的有效性和即插即用兼容性。

英文摘要

Vision-Language-Action models have achieved remarkable progress in robotic manipulation, yet they suffer from a critical limitation: a lack of 3D scene understanding. This deficiency manifests as three intertwined challenges: weak extraction of 3D spatial positions without enforcing multi-view consistency, inadequate 3D instance understanding, and fragile reasoning under occlusion. Although mature 3D perception methods exist, their direct integration into VLA pipelines is hindered by architectural incompatibility and by heavy reliance on costly instance-level annotations. To address the above challenges, we propose 3DVLA, a plug-and-play framework that injects robust 3D reasoning into pretrained VLAs without requiring extra manual labels or discarding VLM priors. Specifically, 3DVLA tackles the three challenges through: (1) pervasive 3D feature encoding with explicit multi-view consistency constraints across all modalities and a Spatially-Conditioned Geometry Aggregation method, (2) an instance estimation module with high-level instance tokens for 3D instance awareness, and (3) a masked self-supervised 3D encoding branch that retains its predictor for visual token completion to handle occlusions. We integrate 3DVLA with multiple VLA baselines and evaluate on LIBERO-Plus and RoboTwin 2.0. Results show consistent and significant gains in manipulation performance, validating both the effectiveness and plug-and-play compatibility of our approach.

URL PDF HTML ☆

赞 0 踩 0

2605.29415 2026-05-29 eess.IV cs.CV cs.LG eess.SP stat.ML 版本更新

Constructing efficient channels for ideal observers using the conjugate gradient method

使用共轭梯度法构建理想观察者的高效通道

Weimin Zhou

发表机构 * University of Arizona, Wyant College of Optical Sciences（亚利桑那大学光学科学学院）； University of Arizona, Department of Radiology & Imaging Sciences（亚利桑那大学放射科与成像科学系）

AI总结针对医学成像系统图像质量的任务评估，提出基于共轭梯度（CG）的方法构建高效通道，以近似贝叶斯理想观察者（IO）和霍特林观察者（HO）的性能。

Comments Submitted to the Journal of Medical Imaging (JMI) Special Issue Honoring Dr. Harrison H. Barrett

2605.29402 2026-05-29 cs.CV cs.AI 版本更新

DMC-CF: 用于因果推理的动态多模态反事实QA基准

Junzhe Zhang, Huixuan Zhang, Guirong Wang, Xingyao Zhang, Pei Liu, Lin Qu, Hu Wei, Xiaojun Wan

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王轩计算机技术研究所）； Alibaba Group（阿里巴巴集团）

AI总结针对现有因果推理数据集规模有限或基于非真实数据的问题，提出基于真实视频的大规模多模态因果反事实推理基准DMC-CF-Static，并利用动态图干预框架构建动态评估基准DMC-CF-Dynamic，实验表明当前多模态大模型在真实场景下的因果推理能力仍需大幅提升。

详情

AI中文摘要

随着多模态大语言模型（MLLMs）的快速发展，模型已展现出日益强大的多模态能力。然而，通过统计学习训练的MLLMs能否真正理解现实世界背后的因果关系仍是一个关键研究问题。近年来，众多多模态因果推理数据集被提出，但这些数据集要么规模有限，要么基于合成图像和视频、卡通内容或其他非真实多模态来源构建。为解决这些局限性，我们收集真实世界视频并构建了DMC-CF-Static，一个大规模多模态因果反事实推理基准。此外，为缓解传统静态评估中的数据污染等问题，我们使用因果图表示因果事件，并提出动态图干预（DGI）框架，从DMC-CF-Static构建动态评估基准DMC-CF-Dynamic。在包含静态和动态评估基准的整体DMC-CF上的实验结果表明，当前多模态大语言模型在真实场景下的多模态因果推理能力仍需大幅提升。

英文摘要

With the rapid advancement of multimodal large language models (MLLMs), models have demonstrated increasingly powerful multimodal capabilities. However, whether MLLMs trained through statistical learning can truly understand the causal relationships underlying the real world remains a key research question. In recent years, numerous multimodal causal reasoning datasets have been proposed. Nevertheless, these datasets are either limited in scale or constructed from synthetic images and videos, cartoon-based content, or other non-realistic multimodal sources. To address these limitations, we collect real-world videos and construct DMC-CF-Static, a large-scale benchmark for multimodal causal counterfactual reasoning. Furthermore, to mitigate issues such as data contamination in traditional static evaluation, we represent causal events using causal graphs and propose the Dynamic Graph Intervention (DGI) framework to build the dynamic evaluation benchmark DMC-CF-Dynamic from DMC-CF-Static. Experimental results on the overall DMC-CF, which includes both static and dynamic evaluation benchmarks, demonstrate that the multimodal causal reasoning capabilities of current multimodal large language models in real-world scenarios still require substantial improvement.

URL PDF HTML ☆

赞 0 踩 0

2605.29335 2026-05-29 cs.CV cs.AI 版本更新

Rethinking FID Through the Geometry of the Reference Dataset

通过参考数据集的几何结构重新思考FID

Yunghee Lee, Byeonghyun Pak

AI总结本文通过分析参考数据集的几何特性（密度和有效秩）来解释Fréchet Inception Distance (FID) 与样本质量之间的不一致性，并提出应结合参考数据集几何结构来更可靠地评估生成模型。

Comments 9 pages, 2 figures. Accepted to ICML 2026 Workshop: Combining Theory and Benchmarks

2605.29330 2026-05-29 cs.CV 版本更新

CapTalk: 文本引导的风格化与语音驱动的3D头部动画

Xuangeng Chu, Yuan Gan, Ziteng Cui, Shuhong Liu, Jian Wang, Bing Zhou, Tatsuya Harada

发表机构 * The University of Tokyo（东京大学）； Snap Research, Snap Inc. ； RIKEN AIP（理化学研究所AIP）

AI总结提出CapTalk框架，通过文本描述控制说话风格和情感，结合语音驱动生成同步唇动和面部表情，支持动态情感变化。

详情

AI中文摘要

音频驱动的3D面部动画旨在从任意音频片段生成同步的唇部运动和生动的面部表情。现有方法虽能产生同步唇动，但通常依赖预定义的身份或风格潜在特征，限制了用户自由控制说话风格的能力。此外，将固定风格或身份应用于整个音频片段通常导致面部动画风格无法适应音频的情感内容。为解决这些挑战，我们重新审视风格与情感的纠缠，构建了一个包含风格和情感文本描述的大规模数据集，并提出了一种新颖的说话头生成框架，能够分别控制风格和情感。我们的模型以说话风格和角色情感的文本描述以及驱动音频流为输入，能够实时生成与描述高度同步的唇部运动和面部表情。此外，我们的模型在推理时支持动态情感控制，能够处理目标情感在语音过程中变化的情况。

英文摘要

Audio-driven 3D facial animation aims to generate synchronized lip movements and vivid facial expressions from arbitrary audio clips. While existing methods can produce synchronized lip motions, they often rely on predefined identity or style latent features, which limits users' ability to freely control speaking styles. Moreover, applying a fixed style or identity to an entire audio segment typically results in facial animation styles that do not adapt to the emotional content of the audio. To address these challenges, we revisit the entanglement between style and emotion, construct a large-scale dataset with textual descriptions of both style and emotion, and propose a novel talking head generation framework that enables separate control over style and emotion. Our model takes as input both textual descriptions of speaking style and character emotion, as well as the driving audio stream, enabling real-time generation of highly synchronized lip movements and facial expressions that match the provided descriptions. Furthermore, our model supports dynamic emotion control during inference, allowing it to handle scenarios where the target emotion changes throughout the speech.

URL PDF HTML ☆

赞 0 踩 0

2605.29302 2026-05-29 cs.CV 版本更新

ViASNet: A Video Ad Saliency Network for Predicting Dynamic Saliency and Viewer Engagement

ViASNet：用于预测动态显著性和观众参与度的视频广告显著性网络

Jianping Ye, Michel Wedel

发表机构 * Department of Mathematics, University of Maryland, College Park, MD 20742, USA（数学系，马里兰大学，学院公园，MD 20742, 美国）； Robert H. Smith School of Business, University of Maryland, College Park, MD 20742, USA（罗伯特·H·史密斯商学院，马里兰大学，学院公园，MD 20742, 美国）

AI总结提出基于3D U-Net架构的ViASNet模型，融合音频和场景语义，预测视频广告的动态显著性图，并通过熵分析诊断观众参与度。

详情

AI中文摘要

数字媒体领域已普遍转向电视、社交媒体和电子商务平台上的短视频广告。本研究聚焦于短视频广告的深度显著性预测。深度显著性模型已被用于生成人类眼动注视模式的预测，以增强用户与数字技术的交互并优化其设计。对于视频广告，动态显著性图捕捉观众观看的位置和时间，揭示视频广告为何有效以及如何优化其内容。我们开发并测试了一种新的深度动态显著性预测模型ViASNet（视频广告显著性网络），其架构基于3D U-Net，并考虑了音频和场景语义的影响。我们评估了该模型在151个视频广告上的性能，每个广告约有20名观众观看并记录其眼动，并通过消融实验探索影响模型性能的关键因素。我们逐帧计算预测显著性图的熵，作为诊断工具来识别未能吸引观众的广告和场景，并在15个未见广告的测试数据上展示了其应用。我们的研究表明，通过基于ViASNet等深度显著性模型的自动化系统，可以显著加快广告设计和测试的速度。

英文摘要

The digital media landscape has seen a pervasive shift toward short-form video advertising on TV, social media and e-commerce platforms. The present study focuses on deep saliency prediction for short-form video advertising. Deep saliency models have been used to generate predictions of human eye fixation patterns with the purpose of enhancing user interaction with digital technology and optimizing its design. For video ads, dynamic saliency maps capture where and when viewers are looking, revealing why video ads are effective, and how their content should be optimized. We develop and test a new deep dynamic saliency prediction model called ViASNet (Video Ad Saliency Network), which has an architecture founded on the 3D U-Net, and accommodates the influence of audio and the semantic meaning of scenes. We assess the model's performance on 151 video ads, each seen by about 20 viewers wile their eye movements were tracked, and explore the critical factors influencing model performance through ablation experiments. We calculate the entropy of the predicted saliency maps frame-by-frame as a diagnostic tool to identify ads and scenes that fail to engage viewers, and illustrate its use on test data of 15 unseen ads. Our study reveals that ad design and testing can be sped up considerably through automated systems built on deep saliency models such as ViASNet.

URL PDF HTML ☆

赞 0 踩 0

2605.29230 2026-05-29 cs.CV cs.AI 版本更新

Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data

面向道德的面部年龄估计：无需儿童数据训练的广义零样本基准

Caio Petrucci, Leo Sampaio Ferraz Ribeiro, Sandra Avila

发表机构 * New York University（纽约大学）

AI总结提出一个广义零样本基准，训练时排除儿童数据，评估模型对未见年龄组的泛化能力，发现所有方法均存在严重性能下降和可见类偏见。

Comments 12 pages; 3 figures; 5 tables

详情

AI中文摘要

从面部图像进行年龄估计通常依赖于包含未成年人图像的训练数据，这种做法引发了严重的伦理、法律和隐私问题。在这项工作中，我们提出了一个用于面部年龄估计的广义零样本基准，该基准在训练时明确排除儿童数据，同时仍评估模型在年轻人群上的性能。我们重新审视了六个广泛使用的数据集，并引入了具有严格年龄组划分的标准化分割：18-59岁的样本用于训练、验证和测试；18岁以下的样本仅保留用于零样本评估；60岁以上的样本作为分布偏移下模型选择的未见验证集。对于具有身份注释的数据集，基于主体的分割防止了身份泄露，并更好地反映了实际部署条件。在此协议下评估九种最先进的年龄估计方法，结果表明所有评估方法均无法泛化到未见年龄组，性能相对于监督基线平均下降46.4%，最高达52.8%。此外，模型并非简单退化：它们系统性地将未见年龄的预测锚定到附近的可见类别，这是广义零样本学习中众所周知的可见类偏见的体现。通过将无儿童数据的年龄估计形式化为现有数据集上的广义零样本基准，这项工作突出了当前建模实践与现实伦理约束之间的关键差距。我们的基准为在受限数据制度下评估模型提供了原则性基础，并鼓励开发对分布偏移鲁棒且符合负责任数据使用的方法。

英文摘要

Age estimation from facial images typically relies on training data that includes images of minors, a practice that raises serious ethical, legal, and privacy concerns. In this work, we propose a generalized zero-shot benchmark for facial age estimation that explicitly excludes children's data during training while still assessing model performance on younger populations. We revisit six widely used datasets and introduce standardized splits with strict age-group separation: samples aged 18-59 for training, validation, and testing; samples under 18 reserved exclusively for zero-shot evaluation; and samples 60+ as an unseen validation set for model selection under distribution shift. For datasets with identity annotations, subject-exclusive splits prevent identity leakage and better reflect real-world deployment conditions. Evaluating nine state-of-the-art age estimation methods under this protocol reveals that all evaluated methods consistently fail to generalize to unseen age groups, suffering substantial performance degradation -- on average 46.4%, and up to 52.8% -- relative to the supervised baseline. Moreover, models do not simply degrade: they systematically anchor predictions for unseen ages to nearby seen classes, a manifestation of the well-known seen-class bias in generalized zero-shot learning. By formalizing age estimation without children's data as a generalized zero-shot benchmark on existing datasets, this work highlights a critical gap between current modeling practices and real-world ethical constraints. Our benchmark provides a principled basis for evaluating models under restricted data regimes and encourages the development of methods that are robust to distribution shift and aligned with responsible data use.

URL PDF HTML ☆

赞 0 踩 0

2605.29221 2026-05-29 cs.CV 版本更新

An Approach for Thyroid Nodule Analysis Using Thermographic Images

使用热成像图像进行甲状腺结节分析的方法

J. R. González, É. O. Rodrigues, C. P. Damião, C. A. P. Fontes, A. C. Silva, A. C. Paiva, H. Li, C. Du, A. Conci

发表机构 * Computer Science Department, Universidade Federal Fluminense（联邦弗里蒙特大学计算机科学系）； Radiology Department, Hospital Universitário Antônio Pedro (HUAP)（安东尼奥佩德罗大学医院放射科）； Applied Computation Group NCA-UFMA, Universidade Federal do Maranhão（马兰舍大学应用计算组NCA-UFMA）

AI总结本文综述了热成像在甲状腺分析中的应用，提出图像采集协议和自主配准方法，并通过特征提取、图像处理和分类方法区分健康与患病患者。

详情

DOI: 10.1007/978-981-10-3147-2_26
Journal ref: Application of Infrared to Biomedical Sciences 2017

AI中文摘要

据预测，到2030年，甲状腺癌将成为女性中第二常见的癌症类型，男性中第三常见。一般来说，早期检测癌症可提高个体生存机会。热成像是一种诊断工具，越来越多地用于检测癌症和异常，包括甲状腺异常。已有多种方法被提出用于分割和检测热成像图中的热区域，从而检测这些图像中存在的可疑组织。众所周知，医学诊断会产生大量信息。因此，医生必须在短时间内全面分析和评估这些信息，这在大多数情况下是不可行的。在这项工作中，我们对热成像进行了全面综述，重点关注甲状腺分析。我们提出了图像采集协议和甲状腺图像的自主配准方法。我们还对图像数据进行了分析，包括特征提取、图像处理以及一种可能的健康或非健康患者分类方法。总之，这项工作提出了在我们大学医院检测肿瘤的试点项目，这是支持我们内分泌科预防性医疗行动的一部分。经过一些未来调整后，该项目将提交给弗鲁米嫩塞联邦大学安东尼奥·佩德罗大学医院（HUAP-UFF）的伦理与研究委员会以及巴西卫生部伦理委员会审批，项目名称为：评估热成像在HUAP-UFF患者甲状腺结节诊断辅助中的重要性（葡萄牙语：Avaliação da importância da termografia no auxílio à investigação diagnóstica de nódulos tireoidianos em pacientes acompanhados no HUAP-UFF）。

英文摘要

Thyroid cancer is said to be the second most common type of cancer in female individuals and the third in males by 2030, according to projections. In general, detecting cancer in its early stages improves the chance of survival of the individual. Thermography is a diagnostic tool that has been increasingly used to detect cancer and abnormalities, including that of thyroid. Various methods to segment and detect hot regions in thermograms and, consequently, to detect suspicious tissues present in these images have been proposed. It is well known that medical diagnosis yields a great deal of information. Thus, physicians have to comprehensively analyse and evaluate this information in a short period of time, which is infeasible in most cases. In this work, we perform a general review of thermography , focusing on the thyroid analysis. We propose protocols for image acquisiton and an autonomous registration for thyroid images. We also perform analyses of the image data, which include feature extraction, image processing, and a possible approach for classification of healthy or unhealthy patients. In summary, this work presents a pilot project for detection of tumors in our university hospital, which is part of an effort to support preventive medical actions in our endocrinology department. Under some future adjustments, this project will be submitted for approval by the ethics and research committee of Hospital Universitário Antonio Pedro at Universidade Federal Fluminense (HUAP-UFF) and to the Brazilian Ministry of Health Ethical committee under the name: Evaluation of the importance of thermography to aid diagnosis of thyroid nodules of patients in HUAP-UFF (in Portuguese: Avaliação da importância da termografia no auxílio à investigação diagnóstica de nódulos tireoidianos em pacientes acompanhados no HUAP-UFF).

URL PDF HTML ☆

赞 0 踩 0

2605.29220 2026-05-29 cs.CV 版本更新

Motion-guided sparse correction enables expert-quality point tracking across diverse microscopy regimes

运动引导的稀疏校正实现跨不同显微镜体制的专家级点跟踪

Leonidas Zimianitis, Pasindu Thenahandi, Kai Buckhalter, Dineth Jayakody, Julian O. Kimura, Xinyue Liang, Karen Cunningham, Azeem Ahmad, Balpreet S. Ahluwalia, Sampath Jayarathna, Nikos Chrisochoides, Brandon Weissbourd, Dushan N. Wadduwage

发表机构 * Department of Computer Science, Old Dominion University（奥德赛大学计算机科学系）； Department of Biology, Massachusetts Institute of Technology（麻省理工学院生物学系）； The Picower Institute for Learning and Memory, Massachusetts Institute of Technology（麻省理工学院学习与记忆研究所）； Department of Physics and Technology, UiT--The Arctic University of Norway（挪威北极大学物理与技术系）； Department of Physics, University of Oslo（奥斯陆大学物理系）； School of Data Science, Old Dominion University（奥德赛大学数据科学学院）； Department of Physics, Old Dominion University（奥德赛大学物理系）

AI总结提出RIPPLE方法，通过运动引导的稀疏校正，在多种显微镜视频中实现专家级点跟踪，将手动标注工作量减少3至25倍。

详情

AI中文摘要

在显微镜视频中跟踪非规范生物系统的动力学仍然是一个持续的挑战。经典和基于学习的跟踪器都需要专家审查的数据来进行评估和适应，然而详尽的手动标注很少能扩展到最需要这些工具的视频中。我们开发了RIPPLE（点位置估计的细化插值平台），它将标注重新定义为稀疏校正：用户点击一个起始点，RIPPLE提出完整的轨迹，用户仅在轨迹偏离时进行干预。我们在来自实验室的五个具有挑战性的显微镜数据集上测试了RIPPLE，其中四个来自透明水螅体Clytia hemisphaerica，一个跟踪快速移动精子的地标。在这些数据集中，RIPPLE匹配了详尽手动标注的质量，同时将数据集的手动点击次数减少了3至25倍。因此，RIPPLE填补了手动标注和全自动跟踪之间的缺失层，使得能够立即量化生物动力学、进行方法基准测试，并生成适应未来自动显微镜跟踪器所需的金标准数据。

英文摘要

Tracking the dynamics of non-canonical biological systems in microscopy videos remains a persistent challenge. Both classical and learning-based trackers depend on expert-reviewed data to be evaluated and adapted, yet exhaustive manual annotation rarely scales to the videos where these tools are needed most. We developed RIPPLE (Refinement Interpolation Platform for Point Location Estimation), which recasts annotation as sparse correction: a user clicks a starting point, RIPPLE proposes a full trajectory, and the user intervenes only where the trajectory drifts. We tested RIPPLE on five challenging microscopy datasets from our laboratories, four from the transparent jellyfish Clytia hemisphaerica and one tracking landmarks on rapidly moving sperm. Across these, RIPPLE matched the quality of exhaustive manual annotation while reducing manual clicks by 3 to 25 times across datasets. RIPPLE thereby fills a missing layer between manual annotation and fully automated tracking, enabling immediate quantification of biological dynamics, method benchmarking, and the production of the gold-standard data needed to adapt future automated microscopy trackers.

URL PDF HTML ☆

赞 0 踩 0

2605.29217 2026-05-29 cs.CV 版本更新

Towards the automated segmentation of epicardial and mediastinal fats: A multi-manufacturer approach using intersubject registration and random forest

朝向心外膜和纵隔脂肪的自动分割：一种使用跨受试者配准和随机森林的多厂商方法

É. O. Rodrigues, A. Conci, F. F. C. Morais, M. G. Pérez

发表机构 * Institute of Computing（计算学院）； Institute of Medicine（医学学院）； Fac. de Ing. en Sist. Electr. e Ind.（电子与工业工程系）； Universidade Federal Fluminense（里约热内卢联邦大学）； Universidade Federal do Rio de Janeiro（里约热内卢联邦大学）； Universidad Técnica de Ambato（阿姆巴托技术大学）

AI总结提出一种基于跨受试者配准和随机森林的全自动方法，用于分割CT图像中的心外膜和纵隔脂肪，平均准确率达98.4%，Dice相似指数为96.8%。

详情

DOI: 10.1109/ICIT.2015.7125355
Journal ref: 2015 IEEE International Conference on Industrial Technology (ICIT)

AI中文摘要

心脏周围的脂肪量与多种健康风险因素相关，如颈动脉僵硬度、冠状动脉钙化、心房颤动、动脉粥样硬化、癌症发病率等。此外，心脏脂肪的变化与受试者的总体脂肪无关，因此加强了对这些脂肪组织进行定量分析的必要性。临床决策支持系统是能够评估信息并提供相应诊断或数据以补充物理学家分析的计算机程序。本工作的目的是提出一种方法，能够在通过用于冠状动脉钙化评分的标准采集协议获得的CT图像上，全自动分割两种由心包隔开的心脏脂肪组织。我们致力于减少用户干预并提高可重复性。本文提出的方法包括配准（将输入图像粗略调整到标准）、提取与像素及其周围区域相关的特征，以及基于数据挖掘分类算法的分割步骤，该算法判断输入像素是否属于某一类型。实验表明，心外膜和纵隔脂肪的平均准确率达到98.4%，平均真阳性率为96.2%。平均Dice相似指数为96.8%。

英文摘要

The amount of fat on the surroundings of the heart is correlated to several health risk factors such as carotid stiffness, coronary artery calcification, atrial fibrillation, atherosclerosis, cancer incidence and others. Furthermore, the cardiac fat varies unrelated to the overall fat of the subject, and, therefore, it reinforces the quantitative analysis of these adipose tissues as being essential. Clinical decision support systems are computer programs capable of evaluating information and providing a corresponding diagnosis or data to complement the physicists' analyses. The aim of this work is to propose a method capable of fully automatically segmenting two types of cardiac adipose tissues that stand apart from each other by the pericardium on CT images obtained by the standard acquisition protocol used for coronary calcium scoring. Much effort was devoted to promote minimal user intervention and ease of reproducibility. The methodology proposed in this work consists of a registration, which will roughly adjust input images to a standard, an extraction of features related to pixels and their surrounding area and a segmentation step based on data mining classification algorithms that define if an incoming pixel is of a certain type. Experimentations showed that the achieved mean accuracy for the epicardial and mediastinal fats was 98.4% with a mean true positive rate of 96.2%. In average, the Dice similarity index was equal to 96.8%.

URL PDF HTML ☆

赞 0 踩 0

2605.29212 2026-05-29 cs.CV cs.HC 版本更新

MetaRanker: Human-in-the-loop Active Ranking for Metalens Image Quality

MetaRanker：用于超透镜图像质量的人机协同主动排序

Yujin Park, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University（翰阳大学）； Hankuk University of Foreign Studies（韩国民法大学）

AI总结提出MetaRanker框架，通过人机协同主动排序，以语义可解释性为指标评估超透镜图像质量，减少80%人工标注量，并实现与人类评估高度一致的排序。

Comments 12 pages, 6 figures

详情

AI中文摘要

现代成像系统中的图像质量源于传感器、光学元件和计算重建的耦合效应。超薄超透镜为实现光学模块的显著小型化提供了途径，但实际设计通常表现出明显的色差和视场相关像差，需要计算重建来补偿。在当前的超透镜流程中，重建模型通常使用基于失真的保真度目标（如PSNR）进行训练和选择，但这些代理指标与人类偏好和下游实用性的相关性较弱，反映了众所周知的感知-失真权衡。我们引入了MetaRanker，一种人机协同主动排序框架，以语义可解释性（定义为人类在存在光学伪影时可靠识别物体和结构的程度）来形式化超透镜图像质量。MetaRanker结合了概率偏好模型与不确定性感知的查询选择，并利用视觉-语言模型提供轻量级语义先验。重要的是，这些先验仅用于指导信息性比较的采样；人类判断始终是主要的监督信号。在具有不同退化特征的现实和合成超透镜数据集上，MetaRanker生成的排序与人类评估最为一致，同时相对于穷举成对评估，所需的成对标注数量减少了约80%。最后，我们表明标准图像质量评估指标在超透镜领域与人类可解释性的对齐有限，这使MetaRanker成为迈向基于感知的超透镜评估和协同设计的实际一步。

英文摘要

Image quality in modern imaging systems emerges from the coupled effects of the sensor, optics, and computational reconstruction. Ultra-thin metalenses offer a path toward substantial miniaturization of optical modules, but practical designs often exhibit pronounced chromatic and field-dependent aberrations that necessitate computational reconstruction. In current metalens pipelines, reconstruction models are commonly trained and selected using distortion-based fidelity objectives, such as PSNR, yet these proxies can be weakly correlated with human preference and downstream utility, reflecting the well-known perception--distortion trade-off. We introduce MetaRanker, a human-in-the-loop active ranking framework that formalizes metalens image quality in terms of semantic interpretability, defined as the degree to which humans can reliably recognize objects and structures in the presence of optical artifacts. MetaRanker combines a probabilistic preference model with uncertainty-aware query selection, and leverages vision--language models to provide lightweight semantic priors. Importantly, these priors are used only to guide the sampling of informative comparisons; human judgments remain the primary supervision signal throughout. Across real-world and synthetic metalens datasets with distinct degradation profiles, MetaRanker produces rankings that align most closely with human assessments, while reducing the number of pairwise annotations required by approximately 80% relative to exhaustive pairwise evaluation. Finally, we show that standard image quality assessment metrics exhibit limited alignment with human interpretability in the metalens domain, positioning MetaRanker as a practical step toward perceptually grounded metalens evaluation and co-design.

URL PDF HTML ☆

赞 0 踩 0

2605.29136 2026-05-29 cs.CV cs.LG 版本更新

Eulerian Gaussian Splatting using Hashed Probability Pyramids

使用哈希概率金字塔的欧拉高斯溅射

Mia Gaia Polansky, George Kopanas, Stephan Garbin, Todd Zickler, Dor Verbin

发表机构 * Harvard University（哈佛大学）； Google DeepMind（谷歌DeepMind）； Google（谷歌）

AI总结提出一种基于概率溅射的辐射场框架，用梯度优化的体积概率密度替代启发式操作，通过多尺度哈希网格实现端到端优化，在mip-NeRF 360上达到SOTA重建质量并保持3DGS渲染速度。

Comments CVPR 2026. Project Page: https://euleriansplatting.github.io

详情

AI中文摘要

我们引入了一种基于概率溅射的辐射场框架，该框架保留了3D高斯溅射（3DGS）的快速光栅化和测试效率，同时用基于梯度优化的体积概率密度替代了启发式原始操作。我们不通过手动调整的密集化（例如ADC）来重新定位、分割或剔除高斯体，而是将原始位置视为从持久、可学习的密度中抽取的样本。我们使用一种新颖的、内存高效的多尺度层次网格来实例化该密度，从而实现端到端的梯度优化。为了稳定优化，我们推导了一个具有控制变量的无偏梯度估计器，显著降低了方差。通过允许概率质量流向损失要求的地方，我们的框架消除了脆弱的先验，并自然地探索体积，在mip-NeRF 360上实现了最先进的重建质量，同时保持了3DGS级别的渲染速度。

英文摘要

We introduce a probabilistic splat-based radiance field framework that retains the fast rasterization and test-time efficiency of 3D Gaussian Splatting (3DGS) while replacing heuristic primitive manipulation with gradient-based optimization of a volumetric probability density. Rather than relocating, splitting, or culling Gaussians via hand-tuned densification (e.g., ADC), we treat primitive locations as samples drawn from a persistent, learnable density. We instantiate this density using a novel, memory-efficient multi-scale hierarchical grid that enables end-to-end gradient-based optimization. To stabilize the optimization, we derive an unbiased gradient estimator with control variates that markedly reduces variance. By allowing probability mass to flow to where the loss demands, our framework eliminates brittle priors and naturally explores the volume, achieving state-of-the-art reconstruction quality on mip-NeRF 360 while preserving 3DGS-level rendering speed.

URL PDF HTML ☆

赞 0 踩 0

2605.29122 2026-05-29 cs.CV 版本更新

Robust Cross-Domain Generalization Using Unlabeled Target Data with Source-Domain Supervision

利用源域监督和无标签目标数据的鲁棒跨域泛化

Yuyue Zhou, Shrimanti Ghosh, Michael, Xie, Justin JY Kim, Jessica Knight, Steel McDonald, Vincent Man, Jacob L. Jaremko, Abhilash Hareendranathan

发表机构 * Department of Radiology and Diagnostic Imaging, University of Alberta（放射学与诊断影像学系，阿尔伯塔大学）

AI总结针对医学影像AI模型跨设备泛化问题，提出结合目标域无监督预训练（掩码图像建模与对比学习）和源域监督训练的策略，在儿科腕部超声骨折检测中实现超过6%的Dice提升。

详情

AI中文摘要

通常，我们希望将使用密集标注训练的医学影像AI模型泛化到来自不同超声扫描仪或临床站点的数据；然而，使用新标注重新训练这些模型往往困难且成本高昂。我们在儿科腕部骨折评估中研究了这一挑战，使用床旁超声（POCUS），其中骨折常见且可通过超声有效分诊。AI在骨折检测中已展现出放射科医生级别的性能，通常借助高质量骨结构分割。然而，由于显著的域偏移，模型在其他中心或探头的数据上表现不佳，并且由于手动标注工作和数据隐私问题，跨设备获取分割标签不切实际。为了解决这个问题，我们提出了一种目标信息引导的自监督预训练和模型集成策略。具体来说，我们的方法结合了掩码图像建模（MIM）和对比学习，无需标签即可学习目标域结构表示，并引入了一个置信度感知融合头来自适应地集成预测。使用Philips Lumify探头收集的源数据集包含密集标签，而使用TeleMED便携式探头收集的目标数据集未标注。整个过程中数据集严格分离。我们的方法使用带标签的源数据进行监督训练，并利用目标域预训练来提高泛化能力。在来自62个儿科POCUS视频的318张图像上，该方法显著提高了跨设备性能，与基线相比，目标域的Dice提升了超过6%。这些结果展示了一种标签高效且保护隐私的跨设备鲁棒超声AI方法，提供了一个可扩展到多中心研究或联邦学习设置的框架。

英文摘要

It is often desirable to generalize medical imaging AI models trained with dense annotations to data acquired from different ultrasound scanners or clinical sites; however, retraining these models with new annotations is often difficult and costly. We examine this challenge in pediatric wrist fracture assessment using point-of-care ultrasound (POCUS), where fractures are common and can be effectively triaged via ultrasound. AI has shown radiologist-level performance for fracture detection, often aided by high-quality bony structure segmentation. However, due to significant domain shifts, models perform poorly on data from other centers or probes, and obtaining segmentation labels across devices is impractical due to manual annotation effort and data privacy concerns. To address this, we propose a target-informed self-supervised pretraining and model-ensemble strategy. Specifically, our approach combines masked image modeling (MIM) and contrastive learning to learn target-domain structural representations without labels, and introduces a confidence-aware infusion head to adaptively integrate predictions. The source dataset, collected with a Philips Lumify probe, contained dense labels, while the target dataset, acquired with a TeleMED portable probe, was unlabeled. The datasets were kept strictly separate throughout the entire process. Our method used labeled source data for supervised training and leveraged target-domain pretraining to improve generalization. On 318 images from 62 pediatric POCUS videos, this approach significantly improved cross-device performance, achieving over 6% Dice improvement on the target domain versus the baseline. These results demonstrate a label-efficient and privacy-preserving approach for cross-device-robust ultrasound AI, offering a framework that can be extended to multi-center studies or federated learning setups.

URL PDF HTML ☆

赞 0 踩 0

2605.29098 2026-05-29 cs.CV 版本更新

Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals

透视箱子：基于雷达信号的非视距三维重建

Jiachen Lu, Hailan Shanbhag, Haitham Al Hassanieh

发表机构 * École Polytechnique Fédérale de Lausanne（联邦理工学院洛桑校区）

AI总结提出统一视距与非视距神经几何重建框架GeRaF 2.0，利用外部视距几何约束引导射频信号传播，实现稳定训练和物理一致的重建，在射频几何重建中达到新最优。

Comments Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

详情

AI中文摘要

从射频信号重建物体几何形状具有根本性挑战，因为射频传感的无透镜成像特性导致低空间分辨率和强噪声。与光信号不同，射频信号可以穿透遮挡物，从而捕获隐藏场景的信息。现有的非视距三维神经重建方法可以恢复封闭环境内的粗糙表面，但常常面临优化不稳定、表面几何噪声大和表面模糊等问题，无法从符号距离场生成精确的零水平集。这些局限性很大程度上源于忽略了封闭区域外视距几何的作用，而视距几何为建模信号传播提供了有价值的物理约束。本文提出统一视距与非视距神经几何重建框架GeRaF 2.0，利用外部视距几何来建模并引导射频信号从视距区域传播到非视距区域。通过将视觉视距先验融入神经场公式，GeRaF 2.0实现了可见和隐藏几何的稳定训练和物理一致重建，在基于射频的几何重建中达到了新的最优水平。

英文摘要

Reconstructing object geometry from radio frequency (RF) signals is fundamentally challenging due to the lensless imaging nature of RF sensing, which leads to low spatial resolution and high noise. Unlike light signals, RF signals can penetrate occlusions and thus capture information about hidden scenes. Existing Non-Line-of-Sight (NLoS) 3D neural reconstruction methods can recover coarse surfaces inside enclosed environments but often suffer from unstable optimization, noisy surface geometry, and surface ambiguity, failing to produce accurate zero-level sets from the signed distance field (SDF). These limitations largely stem from neglecting the role of Line-of-Sight (LoS) geometry outside the enclosed region, which provides valuable physical constraints for modeling signal propagation. In this paper, we introduce a Unified LoS and NLoS neural geometry reconstruction framework GeRaF 2.0 that leverages the outside LoS geometry to model and guide RF propagation from the LoS region into the NLoS region. By integrating visual LoS priors into the neural field formulation, GeRaF 2.0 achieves stable training and physically consistent reconstruction of both visible and hidden geometry, setting a new state-of-the-art in RF-based geometry reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2605.29097 2026-05-29 cs.CV 版本更新

GeRaF: Neural Geometry Reconstruction from Radio Frequency Signals

GeRaF: 从射频信号进行神经几何重建

Jiachen Lu, Hailan Shanbhag, Haitham Al Hassanieh

发表机构 * École Polytechnique Fédérale de Lausanne（瑞士联邦理工学院）

AI总结提出GeRaF方法，利用神经隐式学习从射频信号重建近距3D几何，通过滤波渲染、物理射频体渲染和无透镜采样策略解决低分辨率、噪声和镜面反射问题。

Comments Accepted at NeurIPS 2025 (Spotlight)

详情

Journal ref: Advances in Neural Information Processing Systems 38 (2026): 94200-94230

AI中文摘要

GeRaF是首个利用神经隐式学习从射频信号进行近距3D几何重建的方法。与基于RGB或LiDAR的方法不同，射频传感可以穿透遮挡，但由于其无透镜成像特性，存在分辨率低和噪声大的问题。虽然RGB成像中的透镜将采样限制在1D射线上，但射频信号在整个空间中传播，引入显著噪声并导致体渲染的立方复杂度。此外，射频信号通过镜面反射与表面相互作用，需要根本不同的建模。为解决这些挑战，GeRaF (1) 引入基于滤波的渲染以抑制无关信号，(2) 实现基于物理的射频体渲染管线，(3) 提出一种新颖的无透镜采样和无透镜alpha混合策略，使训练期间的全空间采样可行。通过MLP和可训练参数学习符号距离函数、反射率和信号功率，GeRaF迈出了从射频信号在真实环境中重建毫米级几何的第一步。

英文摘要

GeRaF is the first method to use neural implicit learning for near-range 3D geometry reconstruction from radio frequency (RF) signals. Unlike RGB or LiDAR-based methods, RF sensing can see through occlusion but suffers from low resolution and noise due to its lensless imaging nature. While lenses in RGB imaging constrain sampling to 1D rays, RF signals propagate through the entire space, introducing significant noise and leading to cubic complexity in volumetric rendering. Moreover, RF signals interact with surfaces via specular reflections, requiring fundamentally different modeling. To address these challenges, GeRaF (1) introduces filter-based rendering to suppress irrelevant signals, (2) implements a physics-based RF volumetric rendering pipeline, and (3) proposes a novel lensless sampling and lensless alpha blending strategy that makes full-space sampling feasible during training. By learning signed distance functions, reflectiveness, and signal power through MLPs and trainable parameters, GeRaF takes the first step towards reconstructing millimeter-level geometry from RF signals in real-world settings.

URL PDF HTML ☆

赞 0 踩 0

2605.29092 2026-05-29 cs.CV cs.LG cs.MM 版本更新

Lightweight Complementary-Cue Fusion for Robust Video Face Forgery Detection

轻量级互补线索融合用于鲁棒视频人脸伪造检测

Sunghwan Baek, Tariq Anwaar, Karanveer Singh, Rita Singh

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出轻量级融合模块，结合手工特征（小波去噪特征与相位谱或局部二值模式），在极小参数增加下显著提升视频人脸伪造检测的鲁棒性。

Comments 13 pages, 6 figures, 3 tables

详情

AI中文摘要

当前的人脸视频伪造检测器使用宽或双流骨干网络。我们证明，通过单个轻量级融合两个手工线索，可以在更小的模型下实现更高的准确率。基于Xception基线模型（2190万参数），我们构建了两个检测器：LFWS，它添加一个1x1卷积来结合低频小波去噪特征（WDF）和来自空间相位浅层学习（SPSL）的相位谱通道；以及LFWL，它以相同方式融合WDF和局部二值模式（LBP）。这个额外模块仅增加292个参数，使总参数保持在2190万，小于F3Net（2250万）且不到SRM（5530万）的一半。即使如此小的开销，融合模型在FaceForensics++上将平均曲线下面积（AUC）从74.8%提升至78.6%，在DFDC-Preview上从70.5%提升至74.9%，分别比Xception基线提高3.8%和4.4%。在八个公开基准上，它们也始终优于F3Net、SRM和SPSL，无需额外数据或测试时增强。这些结果表明，通过轻量级融合块精心配对的手工特征，可以以远低于可比频率检测器的成本提供有竞争力的鲁棒性。我们的发现提示需要重新评估人脸视频伪造检测中规模驱动的设计选择。

英文摘要

Current face video forgery detectors use wide or dual-stream backbones. We show that a single, lightweight fusion of two handcrafted cues can achieve higher accuracy with a much smaller model. Based on the Xception baseline model (21.9 million parameters), we build two detectors: LFWS, which adds a 1x1 convolution to combine a low-frequency Wavelet-Denoised Feature (WDF) with a phase-spectrum channel derived from Spatial-Phase Shallow Learning (SPSL), and LFWL, which merges WDF with Local Binary Patterns (LBP) in the same way. This extra module adds only 292 parameters, keeping the total at 21.9 million, smaller than F3Net (22.5 million) and less than half the size of SRM (55.3 million). Even with this minimal overhead, the fused models increase the average area under the curve (AUC) from 74.8% to 78.6% on FaceForensics++ and from 70.5% to 74.9% on DFDC-Preview, gains of 3.8% and 4.4% over the Xception baseline. They also consistently outperform F3Net, SRM, and SPSL in eight public benchmarks, without extra data or test-time augmentation. These results show that carefully paired, handcrafted features, combined through the lightweight fusion block, can provide competitive robustness at a significantly lower cost than comparable frequency-based detectors. Our findings suggest a need to reevaluate scale-driven design choices in face video forgery detection.

URL PDF HTML ☆

赞 0 踩 0

2605.29089 2026-05-29 cs.LG cs.AI cs.CV 版本更新

OISD: On-Policy Internal Self-Distillation of Language Models

OISD: 语言模型在策略内部自蒸馏

Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang, Pan He

发表机构 * Auburn University（阿肯色大学）； William & Mary（威廉与玛丽学院）

AI总结提出OISD框架，通过将最终层的预测信号蒸馏到中间层，结合logit对齐和注意力对齐，提升推理能力，在数学推理任务上显著优于基线。

Comments Under Review for Publication

详情

AI中文摘要

最近的强化学习后训练方法主要使用稀疏的结果级奖励来优化最终输出策略，而很大程度上忽略了中间表示中编码的预测信号。在本文中，我们引入了一种称为在策略内部自蒸馏的新范式，并提出了OISD框架，该框架通过将最终层的在策略预测信号转移到中间表示来改进推理。在展开和组相对策略优化（GRPO）优化过程中，最终层既充当策略，又充当所选中间层的分离内部教师，通过两种互补机制引导中间层与其对齐：logit对齐，传递高级推理行为（如何思考）；注意力对齐，强制从最终层到所选中间层的一致注意力模式（看哪里），两者都不需要外部特权信息。我们的OISD与GRPO一起，采用带符号优势加权的Jensen-Shannon对齐来蒸馏信息丰富的中间表示，同时在统一行动策略下保持策略一致性。实验结果表明了OISD的有效性，在四个数学推理任务上，相对于强推理强化学习基线，取得了显著且一致的改进。代码将在https://github.com/THE-MALT-LAB/OISD发布。

英文摘要

Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on-policy internal self-distillation and propose the OISD framework, which improves reasoning by transferring on-policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high-level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both without requiring external privileged information. Our OISD, together with GRPO, employs signed advantage-weighted Jensen--Shannon alignment to distill informative intermediate representations while preserving policy consistency under a unified acting policy. Experimental results demonstrate the effectiveness of OISD, with substantial and consistent improvements over strong reasoning RL baselines across four mathematical reasoning tasks. The code will be released at https://github.com/THE-MALT-LAB/OISD

URL PDF HTML ☆

赞 0 踩 0

2605.29088 2026-05-29 cs.CV 版本更新

A Deep Learning Iterative Framework for Sentinel-1 Stripmap Enhancement Based on Azimuth Doppler Decomposition

基于方位向多普勒分解的哨兵一号条带图增强深度学习迭代框架

Juan Francisco Amieva, Christian Ayala, Roberto Del Prete, Mikel Galar

发表机构 * Tracasa Instrumental S.L.（Tracasa仪器有限公司）； European Space Agency（欧洲航天局）； Public University of Navarre（纳瓦拉公共大学）

AI总结提出一种基于方位子孔径分解的自监督增强框架，利用子孔径与全孔径图像之间的物理一致性生成训练数据，通过单/多帧学习和迭代推理逐步提升图像质量，在哨兵一号条带模式数据上优于MERLIN方法。

Comments Accepted at the AI4Space Workshop, CVPR 2026

详情

AI中文摘要

合成孔径雷达（SAR）图像能够实现全天候、昼夜地球观测；然而，由于散斑噪声和其他固有成像伪影，其仍难以解释。哨兵一号（S1）是最广泛使用的星载SAR任务之一，提供系统性的全球覆盖、高时间分辨率、双极化成像和免费数据获取。在S1模式中，条带图（SM）提供最高分辨率，但散斑噪声和空间约束常常阻碍需要更精细空间细节的应用。这激发了对有效图像增强策略的需求。在这项工作中，我们提出了一种基于方位子孔径分解的S1 SM图像自监督增强框架。该方法利用子孔径重建与对应全孔径图像之间的物理一致性，生成配对训练数据，无需外部传感器、模拟真值或多时相堆叠。所提框架集成了单帧和多帧学习，并融入迭代推理方案，逐步提升图像质量。在真实S1 SM数据上的实验表明，所提方法在PSNR和SSIM上持续优于广泛采用的自监督深度学习基线MERLIN，而MERLIN获得更高的ENL，凸显了结构保真度与散斑平滑之间的权衡。总体而言，结果表明基于子孔径的监督为使用S1数据的SAR图像增强提供了一种物理基础、可复现且操作可行的方法。值得注意的是，所提方法可扩展到其他SAR平台、极化和采集模式。

英文摘要

Synthetic Aperture Radar (SAR) imagery enables all-weather, day-and-night Earth observation; however, it remains difficult to interpret due to speckle noise and other intrinsic imaging artifacts. Sentinel-1 (S1) constitutes one of the most widely used spaceborne SAR missions, offering systematic global coverage, high temporal resolution, dual-polarization imaging, and free data availability. Among S1 modes, Stripmap (SM) provides the highest resolution, yet speckle noise and spatial constraints often hinder applications requiring finer spatial detail. This motivates the need for effective image enhancement strategies. In this work, we propose a self-supervised enhancement framework for S1 SM imagery based on azimuth subaperture decomposition. The method exploits the physical consistency between subaperture reconstructions and the corresponding full-aperture image to generate paired training data without external sensors, simulated ground truth, or multi-temporal stacks. The proposed framework integrates single- and multi-frame learning and incorporates an iterative inference scheme that progressively refines image quality. Experiments on real S1 SM data show that the proposed approach consistently outperforms the widely adopted self-supervised deep learning baseline MERLIN, in terms of PSNR and SSIM, while MERLIN attains higher ENL, highlighting a trade-off between structural fidelity and speckle smoothing. Overall, the results demonstrate that subaperture-based supervision provides a physically grounded, reproducible, and operationally viable approach for SAR image enhancement using S1 data. It is worth noting that the proposed approach can be extended to other SAR platforms, polarizations, and acquisition modes.

URL PDF HTML ☆

赞 0 踩 0

2605.29074 2026-05-29 cs.CV cs.RO 版本更新

Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models

Embodied3DBench: 视觉语言模型低级具身空间智能的基准测试

Jiyao Zhang, Mingxu Zhang, Yitong Peng, Haoxuan Liu, Chenshuo Wang, Yuxing Long, Haoyang Huang, Dongjiang Li, Nan Duan, Hui Shen, Hao Dong

发表机构 * CFCS, School of CS, PKU（计算机学院CFCS，北京大学）； Jingdong Technology Information Technology Co., Ltd（京东科技信息技术有限公司）

AI总结提出Embodied3DBench基准，通过6类任务（空间结构理解与交互导向感知）系统评估视觉语言模型在3D环境中的低级空间智能，并合成130万QA对训练数据以弥补能力差距。

详情

AI中文摘要

当前的视觉语言模型（VLM）是否准备好理解和推理3D环境中的复杂具身交互？我们引入了Embodied3DBench，一个以机器人为中心的基准，针对具身3D环境中的低级空间智能。为了系统评估这些基础感知能力，该基准包括6个任务类别，分为两个核心组：空间结构理解（定位、空间关系预测和多视图对应）和交互导向感知（可供性预测、抓取点预测和轨迹预测）。该基准涵盖12个子类别，包含超过21k个高质量问答对。我们评估了13个最先进的模型，结果显示，尽管当前模型在高级空间推理（如理解对象间位置关系）方面表现相对较强，但在交互导向感知方面仍然脆弱，突显了缺乏鲁棒的3D感知交互先验。为了积极弥合基准揭示的能力差距，我们进一步合成了一个包含130万问答对的大规模训练数据集。值得注意的是，在该数据集上微调显著提升了低级空间智能。最终，Embodied3DBench通过提供系统评估框架和可扩展的数据解决方案填补了关键空白，为交互感知多模态系统的发展设定了明确目标。

英文摘要

Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D environments. To systematically evaluate these foundational perceptual capabilities, the benchmark includes 6 task categories divided into two core groups: Spatial Structural Understanding (Grounding, Spatial Relation Prediction, and Multi-view Correspondence) and Interaction-Oriented Perception (Affordance Prediction, Grasp Point Prediction, and Trajectory Prediction). The benchmark spans 12 subcategories and contains over 21k high-quality question-answer pairs. We evaluate 13 state-of-the-art models, and the results show that while current models exhibit relatively strong high-level spatial reasoning, such as understanding object-to-object positional relations, they remain fragile in interaction-oriented perception, highlighting a significant lack of robust 3D-aware interaction priors. To actively bridge this capability gap revealed by our benchmark, we further synthesize a large-scale training dataset comprising 1.3M QA pairs. Notably, fine-tuning on this dataset yields significant improvements in low-level spatial intelligence. Ultimately, Embodied3DBench fills a critical gap by providing both a systematic evaluation framework and a scalable data solution, setting a clear target for the development of interaction-aware multimodal systems.

URL PDF HTML ☆

赞 0 踩 0

2605.29064 2026-05-29 cs.CL cs.CV cs.HC cs.MA 版本更新

Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception

分析多模态大语言模型代理在城市感知中生成解释的角色效应

Neemias da Silva, Myriam Delgado, Rodrigo Minetto, Daniel Silver, Thiago H Silva

发表机构 * Universidade Tecnologica Federal do Parana（巴西南里奥格兰德联邦技术大学）； University of Toronto（多伦多大学）

AI总结通过对比不同角色提示和无角色设置下多模态大语言模型生成的文本，发现标题描述趋同，但理由描述随社会经济和政治属性系统变化，感知标签无显著差异。

Comments 10 pages, 6 figures

2605.29063 2026-05-29 eess.IV cs.CV 版本更新

Accelerating HEVC Intra Partitioning via a CNN-Hierarchical Attention Transformer Hybrid

通过CNN-分层注意力Transformer混合加速HEVC帧内划分

Krishna Kumar Sharma, Somdyuti Paul

发表机构 * Department of Artificial Intelligence, Indian Institute of Technology Kharagpur（人工智能系，印度理工学院Kharagpur）

AI总结提出HFViT混合架构，融合重参数化深度可分离卷积与分层注意力Transformer，以低复杂度实现高效全局信息传播，在HEVC帧内划分预测中降低VMAF BD-rate惩罚并保持低CPU延迟。

详情

AI中文摘要

高效视频编码（HEVC）中的递归四叉树划分带来了大量计算开销，其中针对CTU划分预测的穷举率失真优化消耗了编码时间的主要部分。尽管通过深度学习进行划分预测已成为一种可行的编码加速器，但架构上的二分法仍未得到充分解决：CNN计算效率高，但由于其局部有效感受野而空间短视，无法捕捉长程语义关系和重复纹理；相反，基于Transformer的架构更擅长捕捉全局上下文，但会带来过高的CPU延迟，这是阻碍其在主要CPU受限环境中部署的关键缺陷。本文介绍了混合快速视觉Transformer（HFViT），这是一种旨在加速HEVC帧内模式划分预测的混合架构。HFViT将重参数化的深度可分离卷积骨干与分层注意力Transformer（HAT）机制融合，利用载体令牌方案以次二次复杂度实现高效的全局信息传播。训练后的结构融合将批归一化折叠到前一层，以进一步减少延迟。全面评估揭示了HFViT在跨分辨率加速HEVC帧内编码方面的有效性。在标准JCT-VC测试序列上，与竞争的ETH-CNN基线相比，HFViT在A、B和E类上分别将平均VMAF BD-rate惩罚降低了2.4、2.6和7.9个百分点，同时将CPU推理延迟维持在CNN基线的8%以内，并在GPU上超越其40%，为实时编码器集成建立了实际可行性。

英文摘要

The recursive quad-tree partitioning in High Efficiency Video Coding (HEVC) incurs considerable computational overhead, with exhaustive rate-distortion optimization for CTU partition prediction consuming the dominant share of encoding time. Although partition prediction through deep learning has emerged as a viable encoding accelerator, an architectural dichotomy remains largely unaddressed: CNNs are computationally efficient but spatially myopic due to their localized effective receptive fields, failing to capture long range semantic relationships and repetitive textures; conversely, transformer based architectures are better at capturing global context but incur prohibitive CPU latency, a critical liability that impedes deployment which is predominantly CPU-bound. This paper introduces Hybrid Fast Vision Transformer (HFViT), a hybrid architecture designed to accelerate HEVC intra-mode partition prediction. HFViT fuses a reparameterized depthwise-separable convolutional backbone with a Hierarchical Attention Transformer (HAT) mechanism, leveraging a carrier token scheme to enable efficient global information propagation at sub-quadratic complexity. Post-training structural fusion collapses batch normalization into preceding layers to further reduce latency. Comprehensive evaluation reveals the efficacy of HFViT in accelerating HEVC intra-encoding across resolutions. On standard JCT-VC test sequences, HFViT reduces the average VMAF BD-rate penalty by 2.4, 2.6, and 7.9 percentage points on Classes A, B and E, respectively, as compared to the competing ETH-CNN baseline while maintaining CPU inference latency within 8% of the CNN baseline and surpassing it on GPU by 40%, establishing practical viability for real-time encoder integration.

URL PDF HTML ☆

赞 0 踩 0

2605.29012 2026-05-29 cs.CV 版本更新

Trajectory Constraints for Imaging Inverse Problems

成像逆问题的轨迹约束

Chaoyan Huang, Haijie Yuan, Saiprasad Ravishankar

发表机构 * Department of Computational Mathematics, Science, & Engineering, Michigan State University（密歇根州立大学计算数学、科学与工程系）； Department of Electrical Engineering and Computer Science, University of Michigan（密歇根大学电气工程与计算机科学系）； Department of Biomedical Engineering, Michigan State University（密歇根州立大学生物医学工程系）

AI总结提出TRACE框架，通过相邻状态耦合约束重建轨迹，稳定扩散和迭代方法在成像逆问题中的重建过程，并提升重建质量。

Comments 20 pages, 10 figures

详情

AI中文摘要

基于扩散和迭代的方法已成为解决成像逆问题的有效工具。它们的重建过程自然形成一条由中间估计组成的轨迹。尽管这些中间估计定义了重建轨迹，但大多数方法并未显式正则化连续状态之间的转换。为了解决这一局限，我们引入了TRACE，一种无需训练的轨迹约束重建框架，通过沿轨迹耦合相邻状态来稳定重建路径。这产生了一个轨迹级模型，可解释为一系列近端更新。由于精确的近端更新通常是难解的，我们用一个神经映射来近似它。这产生了一个具有相邻状态间显式耦合的类扩散重建过程。我们提供了稳定性分析，表明时间耦合限制了轨迹变化，并且这种控制在未训练的网络更新下得以保持。在线性和非线性图像重建任务上的实验表明，TRACE提高了重建质量。轨迹级分析和消融实验证实，时间耦合直接影响重建路径上的状态转换。

英文摘要

Diffusion-based and iterative methods have become effective tools for solving imaging inverse problems. Their reconstruction process naturally forms a trajectory of intermediate estimates. Although these intermediate estimates define a reconstruction trajectory, most methods do not explicitly regularize the transitions between consecutive states. To address this limitation, we introduce TRACE, a training-free TRAjectory-Constrained rEconstruction framework that stabilizes the reconstruction path by coupling adjacent states along the trajectory. This gives a trajectory-level model that can be interpreted as a sequence of proximal updates. Since the exact proximal update is generally intractable, we approximate it with a neural mapping. This yields a diffusion-like reconstruction process with an explicit coupling between neighboring states. We provide a stability analysis showing that temporal coupling bounds trajectory variation and that this control is preserved under untrained network updates. Experiments on linear and nonlinear image reconstruction tasks show that TRACE improves reconstruction quality. Trajectory-level analyses and ablations confirm that temporal coupling directly affects state transitions along the reconstruction path.

URL PDF HTML ☆

赞 0 踩 0

2605.29004 2026-05-29 cs.CV cs.GR 版本更新

Auditing Training-Free 3D Shape Retrieval with Diffused Geodesic Moments

审计基于扩散测地矩的无训练三维形状检索

Zhicheng Du, Changyue Liu, Wenji Xi, Zhaotian Xie, Zhuo Deng, Ziheng Zhang, Yang Liu, Lan Ma

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； Guangzhou International Economics College（广州国际经济学院）； School of Electrical and Electronic Engineering, The University of Sheffield（谢菲尔德大学电子与电气工程学院）

AI总结本文提出扩散测地矩（DGM）作为无训练形状描述符，通过协议审计方法隔离评估局部信号设计、归一化、聚合、码本拟合和度量选择等组件的影响，并在FAUST-Reg和TOSCA数据集上验证了协议主导性。

详情

AI中文摘要

无训练形状描述符的报告检索分数混淆了局部信号设计、归一化、聚合、码本拟合和度量选择，使得孤立组件评估困难。本文将描述符评估重新定义为协议审计。我们引入扩散测地矩（DGM），一种种子条件描述符，计算稀疏隐式热响应，将其转换为距离类场，并通过跨种子和尺度的低阶矩汇总每个顶点。DGM既作为实用的非谱基线，也作为隔离协议效应的工具。在注册的FAUST基准分割（FAUST-Reg）和TOSCA形状集合上，聚合匹配实验表明，基于热核签名特征构建的独立几何矩形状描述符基线（GMSD-HKS）在此实现中获得最高分数（平均精度（mAP）/top-1分别为0.621/0.820和0.865/0.963），波核签名（WKS）仍然是强经典信号，而DGM主要在稀疏求解、非谱部署或对称信息种子帧优先时有用。更广泛的发现是方法论的：输入场和聚合协议可以主导矩公式。本文贡献了可复现的协议级联分析、用于功能映射兼容性的跨形状对齐诊断，以及设计和报告无训练形状描述符的具体建议。

英文摘要

Reported retrieval scores for training-free shape descriptors conflate local signal design, normalization, aggregation, codebook fitting, and metric choices, making isolated component evaluation difficult. This paper reframes descriptor evaluation as a {\em protocol audit}. We introduce Diffused Geodesic Moments (DGM), a seed-conditioned descriptor that computes sparse implicit heat responses, converts them to distance-like fields, and summarizes each vertex by low-order moments across seeds and scales. DGM is used both as a practical non-spectral baseline and as an instrument for isolating protocol effects. On the registered FAUST benchmark split (FAUST-Reg) and the TOSCA shape collection, aggregation-matched experiments show that an independent Geometric Moment Shape Descriptor baseline built on Heat Kernel Signature features (GMSD-HKS) obtains the highest scores in this implementation ($0.621/0.820$ and $0.865/0.963$ mean average precision (mAP)/top-1), Wave Kernel Signature (WKS) remains a strong classical signal, and DGM is useful mainly when sparse solves, non-spectral deployment, or symmetry-informative seed frames are priorities. The broader finding is methodological: the input field and aggregation protocol can dominate the moment formula. The paper contributes a reproducible protocol-cascade analysis, a cross-shape alignment diagnostic for functional-map compatibility, and concrete recommendations for designing and reporting training-free shape descriptors.

URL PDF HTML ☆

赞 0 踩 0

2605.28962 2026-05-29 cs.CV 版本更新

Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment

通过噪声对齐解决扩散桥中的端点欠拟合

Yurong Gao, Zicheng Zhang, Congying Han, Tiande Guo, Xinmin Qiu

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）

AI总结针对扩散桥模型在目标端点附近出现的欠拟合问题，提出噪声对齐扩散桥（NADB），通过均值网络和噪声对齐映射解决噪声不匹配，在图像恢复和翻译任务中验证有效性。

Comments Accepted by CVPR2026

详情

AI中文摘要

扩散桥模型为连接两个数据分布（如图像恢复和翻译）提供了强大框架。许多现有方法通过模仿标准扩散模型的分数匹配公式来学习这种桥接。在这项工作中，我们发现这种方式会导致在接近目标端点（$t \to 0$）时出现异常的欠拟合现象。这种欠拟合以预测方差和方向的显著漂移为特征，是由网络输入与其回归目标之间的噪声水平差异过大引起的。为了解决这个问题，我们提出了噪声对齐扩散桥（NADB）。我们的方法通过首先使用均值网络提供更清晰的条件目标，然后引入一种新颖的噪声对齐映射关系来重新表述扩散桥。这种新表述解决了噪声不匹配问题，并纠正了目标端点附近的欠拟合。在多个图像恢复和图像翻译任务上的实验验证了我们的方法的有效性。代码可在 https://github.com/gyr02/NADB 获取。

英文摘要

Diffusion bridge models offer a powerful framework for connecting two data distributions, such as in image restoration and translation. Many existing methods learn this bridge by mimicking the score-matching formulation of standard diffusion models. In this work, we find that this way leads to an anomalous underfitting phenomenon near the target endpoint, as the process approaches the target distribution ($t \to 0$). This underfitting, characterized by significant drift in the predicted variance and direction, results from an excessively large discrepancy in noise levels between the network's input and its regression target.To resolve this issue, we propose the Noise-Aligned Diffusion Bridge (NADB).Our approach reformulates the diffusion bridge by first employing a mean network to provide a cleaner conditional target, and then introducing a novel, noise-aligned mapping relationship. This new formulation resolves the noise mismatch and corrects the underfitting near the target endpoint. Experimental validation across multiple image restoration and image translation tasks demonstrates the effectiveness of our approach. Code is available at https://github.com/gyr02/NADB.

URL PDF HTML ☆

赞 0 踩 0

2605.28551 2026-05-29 cs.CV cs.GR cs.LG 版本更新

Resolution-free neural surrogates for geometric parameterization and mapping with spatially varying fields

无分辨率依赖的几何参数化与映射神经替代模型：面向空间变化场

Yanwen Huang, Lok Ming Lui, Gary P. T. Choi

发表机构 * Department of Mathematics, The Chinese University of Hong Kong（香港中文大学数学系）

AI总结提出一种无分辨率依赖的神经替代模型，通过多分辨率几何编码和几何感知约束（变分能量、扩散密度均衡、拟共形理论）无监督学习，直接从空间变化参数场预测映射位置，适用于任意结构化或非结构化点集。

详情

AI中文摘要

许多成像问题需要计算由空间变化的强度、特征或密度场引起的空间变换。典型例子包括畸变校正、可变形图像配准、基于图谱的分割以及变形驱动的图像分析。这些任务可以表述为几何映射问题，其中变换被约束以保持局部结构、控制边界行为或调节角度畸变。此类公式通常导致变分模型、扩散过程或椭圆偏微分方程。然而，当底层参数场在不同实例间变化时，重复求解高分辨率系统在计算上变得昂贵。在这项工作中，我们提出了一种无分辨率依赖的神经替代模型，用于几何参数化和映射问题。给定一个空间变化的参数场 $p:\Omega\to\mathbb{R}^m$ 和查询位置 $\{x_i\}_{i=1}^N\subset\Omega$，该模型预测任意结构化或非结构化点集上的映射位置 $\{u(x_i)\}_{i=1}^N$。为了避免对固定网格的依赖，我们采用了一种多分辨率几何编码策略，该策略将网络条件建立在参数场的坐标增强样本上。该模型通过强制执行源自变分能量、基于扩散的密度均衡和拟共形理论的几何感知约束进行训练，无需标记解数据。在拟共形映射和密度均衡映射问题上的实验结果展示了我们提出方法的有效性。

英文摘要

Many imaging problems require computing spatial transformations induced by spatially varying intensity, feature, or density fields. Canonical examples include distortion correction, deformable image registration, atlas-based segmentation, and deformation-driven image analysis. These tasks can be formulated as geometric mapping problems in which the transformation is constrained to preserve local structure, control boundary behavior, or regulate angular distortion. Such formulations typically lead to variational models, diffusion processes, or elliptic partial differential equations. However, repeatedly solving high-resolution systems becomes computationally expensive when the underlying parameter fields vary across instances. In this work, we propose a resolution-free neural surrogate for geometric parameterization and mapping problems. Given a spatially varying parameter field $p:Ω\to\mathbb{R}^m$ and query locations $\{x_i\}_{i=1}^N\subsetΩ$, the model predicts mapped locations $\{u(x_i)\}_{i=1}^N$ on arbitrary structured or unstructured point sets. To avoid dependence on a fixed grid, we use a multi-resolution geometric encoding strategy that conditions the network on coordinate-augmented samples of the parameter field. The model is trained without labeled solution data by enforcing geometry-aware constraints derived from variational energies, diffusion-based density equalization, and quasi-conformal theory. Experimental results on quasi-conformal mapping and density-equalizing mapping problems are presented to demonstrate the effectiveness of our proposed method.

URL PDF HTML ☆

赞 0 踩 0

2605.27959 2026-05-29 cs.CV cs.AI 版本更新

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

ROVER: 面向对象中心视觉证据的路由用于基于多图像推理

Guannan Lv, Ren Nie, Hongjian Dou, Tingting Gao

发表机构 * Kuaishou Technology（快手科技）

AI总结提出ROVER，一种轻量级可学习插件，通过对象中心差分注意力聚合上下文、蒸馏图像内线索并路由历史感知证据，实现高效全局视觉证据路由，在多图像推理中提升答案和定位精度。

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地定位和交错视觉证据以进行审慎推理。基于定位的方法通常通过将裁剪的图像块或感兴趣区域（RoI）特定特征注入推理上下文来关注RoI。然而，这种设计可能削弱整体场景理解和对象间关系，同时导致解码成本随RoI数量和大小增加而增加。或者，自适应视觉特征选择通常需要细粒度监督或复杂启发式方法。为解决这些限制，我们提出ROVER（面向对象中心视觉证据的路由用于基于多图像推理），一种轻量级、可学习的插件，用于高效的全局视觉证据路由。在每次对象定位预测时，ROVER注入一个步骤特定的令牌三元组，以协同地：(i) 聚合正在进行的推理上下文，(ii) 通过对象中心差分注意力将图像内线索蒸馏到视觉工作空间中，以及(iii) 在该空间内跨对象和图像路由并整合历史感知证据以供后续推理。我们将ROVER集成到Qwen2.5-VL-7B中，并开发了一个交错的SFT到GRPO训练流程。严格遵循原始数据集和评估协议，我们的方法在MM-GCoT（+4.8%答案准确率，+14.6%定位准确率）和VideoEspresso（+8.6%答案准确率）上取得了最佳性能。在VideoEspresso上训练的模型表现出强大的迁移能力，在多个基准测试上平均比基础模型高出+4.7%。

英文摘要

Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object relations, while incurring decoding costs that scale with the number and size of RoIs. Alternatively, adaptive visual feature selection often requires fine-grained supervision or complex heuristics. To address these limitations, we propose ROVER (Routing Object-centric Visual Evidence for grounded multi-image Reasoning), a lightweight, learnable plugin for efficient global visual evidence routing. Upon each object grounding prediction, ROVER injects a step-specific token triplet to synergistically: (i) aggregate the ongoing reasoning context, (ii) distill intra-image cues into a visual working space via object-centric differential attention, and (iii) route and integrate history-aware evidence across objects and images within this space for subsequent reasoning. We integrate ROVER into Qwen2.5-VL-7B and develop an interleaved SFT-to-GRPO training pipeline. Strictly adhering to the original datasets and evaluation protocols, our method achieves the best performance on MM-GCoT (+4.8% answer accuracy, +14.6% grounding accuracy) and VideoEspresso (+8.6% answer accuracy). The VideoEspresso-trained model demonstrates strong transferability, outperforming the base model by +4.7% on average across diverse benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.26994 2026-05-29 cs.CV 版本更新

ChartAct: A Benchmark for Dynamic Chart Understanding

ChartAct: 动态图表理解基准

Muye Huang, Lin Wu, Lingling Zhang, Hang Yan, Zhiyuan Wang, Yumeng Fu, Zesheng Yang, Jun Liu

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University（西安交通大学计算机科学与技术学院）； MOE KLNN Lab, Xi’an Jiaotong University（西安交通大学MOE KLNN实验室）

AI总结提出ChartAct基准，通过收集673个动态图表和1440个问答样本，评估多模态模型在交互式图表理解中的能力，发现现有模型表现有限。

详情

AI中文摘要

图表广泛用于呈现复杂数据以支持分析和决策。现有的图表理解基准主要关注静态图表，但现实中的图表通常是动态且可交互的。关键信息可能仅在悬停、点击、缩放或拖拽等操作后出现。因此，动态图表理解要求模型识别可见内容、选择合适的交互方式，并在变化的图表状态中进行推理。为了评估这一能力，我们提出了ChartAct，一个用于动态图表理解的交互式基准。ChartAct从8个真实图表网站收集并筛选了673个动态图表，涵盖7种常见图表类型，并构建了1440个高质量问答样本。每个样本在两个环境（动态图表和仪表板图表）中实例化，以评估不同上下文下的动态图表理解能力。基于ChartAct，我们系统评估了11个先进的多模态模型和GUI智能体。实验结果表明，现有模型在动态图表理解方面仍存在明显局限。最强的模型Claude-Opus-4.7达到了84.5%的平均成功率，而大多数模型仍低于60%。我们还进行了详细的失败归因和案例分析。ChartAct为研究真实交互环境中的图表理解提供了新的基准。代码见https://github.com/wulin-wulin/OSWorld_Chart。

英文摘要

Charts are widely used to present complex data for analysis and decision making. Existing chart understanding benchmarks mainly focus on static charts, but real-world charts are often dynamic and interactive. Key information may only appear after actions such as hovering, clicking, zooming, or dragging. Dynamic chart understanding therefore requires models to identify visible content, choose proper interactions, and reason over changing chart states. To evaluate this ability, we propose ChartAct, an interactive benchmark for dynamic chart understanding. ChartAct collects and filters 673 dynamic charts from 8 real chart websites, covers 7 common chart types, and constructs 1,440 high-quality question-answer samples. Each sample is instantiated in two environments, Dynamic Chart and Dashboard Chart, to evaluate dynamic chart understanding under different contexts. Based on ChartAct, we systematically evaluate 11 advanced multimodal models and GUI agents. Experimental results show that existing models still have clear limitations in dynamic chart understanding. The strongest model, Claude-Opus-4.7, achieves an average success rate of 84.5\%, while most models remain below 60\%. We also conduct detailed failure attribution and case analysis. ChartAct provides a new benchmark for studying chart understanding in real interactive environments. Codes at https://github.com/wulin-wulin/OSWorld_Chart

URL PDF HTML ☆

赞 0 踩 0

2605.25299 2026-05-29 cs.CV cs.LG 版本更新

A Principled Self-Referenced Early Stopping Approach for Deep Image Prior

一种基于自引用的原则性早期停止方法用于深度图像先验

Chaoyan Huang, Cheng-Han Huang, Ismail R. Alkhouri, Rongrong Wang

发表机构 * Department of Computational Mathematics, Science, & Engineering, Michigan State University（密歇根州立大学计算数学、科学与工程系）； Department of Electrical Engineering and Computer Science, University of Michigan（密歇根大学电气工程与计算机科学系）； X Computational Physics Division, Los Alamos National Laboratory（洛斯阿拉莫斯国家实验室计算物理部）； Michigan Institute for Computational Discovery & Engineering, University of Michigan（密歇根大学计算发现与工程研究所）； Mathematical Sciences, Michigan State University（密歇根州立大学数学科学系）

AI总结针对深度图像先验（DIP）过拟合问题，提出一种基于构造伪自引用图像的过拟合检测框架，实现无需噪声水平估计的早期停止方法。

Comments 35 pages, 10 figures, 14 tables

详情

AI中文摘要

最近，深度图像先验（DIP）通过在无训练数据的情况下优化随机初始化的卷积神经网络，展示了解决逆成像问题（IIPs）的强大能力。然而，由于网络过参数化，DIP会过拟合噪声测量，使得早期停止（ES）至关重要。最成功的ES方法通过跟踪网络输出运行方差的波动来检测过拟合。然而，在许多应用中，这些波动可能过早出现，导致重建不稳定。本文首先证明，当退化图像的两个独立噪声副本可用时，可以实现近乎最优的DIP早期停止。受此观察启发，且由于获取两个完全独立的副本不可行，我们提出了一种基于构造伪自引用图像的过拟合检测框架，从而得到三种IIP特定算法。我们的方法还得到了关于单引用验证、伪验证估计以及共享噪声影响的理论结果的支持。在不同的IIP中，从自然图像恢复到医学图像重建，以及在不同噪声水平和噪声类型下，我们的方法始终优于现有的DIP早期停止方法，且无需准确估计噪声水平。

英文摘要

Recently, Deep Image Prior (DIP) has demonstrated strong capabilities for solving inverse imaging problems (IIPs) by optimizing a randomly initialized convolutional neural network in a training-data-free regime. However, DIP suffers from overfitting to noisy measurements due to network over-parameterization, making early stopping (ES) essential. The most successful ES method tracks fluctuations in the running variance of the network output to detect overfitting. However, in many applications, these fluctuations may appear prematurely, leading to unstable reconstructions. In this paper, we first show that nearly optimal DIP early stopping can be achieved when two independent noisy copies of the degraded image are available. Motivated by this observation, and since obtaining two fully independent copies is infeasible, we propose an overfitting detection framework based on constructing pseudo self-referenced images, resulting in three IIP-specific algorithms. Our approach is further supported by theoretical results on single-reference validation, pseudo-validation estimation, and the impact of shared noise. Across different IIPs, ranging from natural image restoration to medical image reconstruction, and under varying noise levels and noise types, our methods consistently outperform existing DIP early stopping approaches, all without requiring an accurate estimate of the noise level.

URL PDF HTML ☆

赞 0 踩 0

2605.25059 2026-05-29 cs.CV 版本更新

VEOcc: Voxel-Centric Online Semantic Occupancy Prediction For Embodied Scene Understanding

VEOcc：面向具身场景理解的体素中心在线语义占用预测

Ruoyu Wang, Yong Liu, Sheng Tao, Yuhang Lin, Yukai Ma

发表机构 * Institute of Cyber-Systems and Control（控制系统研究院）

AI总结提出一种基于体素的递归感知-同化框架VEOcc，通过时空感知在线更新策略实现无需初始尺度估计的高效、鲁棒语义占用预测，在局部和具身场景中达到最先进性能。

详情

AI中文摘要

对于自主探索至关重要，在线3D占用预测和映射逐步构建密集的空间表示。然而，近期以高斯为中心的方法在结构边界保真度上存在困难，且严重依赖预定义的场景大小先验，从根本上限制了其操作效率。在这项工作中，我们提出了VEOcc，一个以体素为中心的框架，表述为递归感知-同化范式。通过消除初始尺度估计的需要，VEOcc实现了高度精简、开放的地图扩展。此外，为了在离散体素空间内鲁棒地聚合带噪声的时间观测，我们提出了一种时空感知在线更新策略。它集成了跨时间对数聚合（TLA）以保持时间一致性、可靠性感知置信度调制（RCM）以进行空间不确定性校准，以及置信度驱动的增量状态更新（CSU）以实现鲁棒的全局状态同化。在Occ-ScanNet和EmbodiedOcc-ScanNet上的大量实验表明，VEOcc在局部和具身设置中均建立了新的最先进性能，为真实世界探索提供了准确且高效的解决方案。值得注意的是，在自收集视频序列上的零样本评估进一步证实了其在完全未见过的真实世界环境中的鲁棒分布外泛化能力。最终，我们的框架为自主探索提供了准确且高效的解决方案。代码和补充可视化可在我们的项目页面获取：https://wryzju.github.io/VEOcc/。

英文摘要

Crucial for autonomous exploration, online 3D occupancy prediction and mapping incrementally constructs dense spatial representations on the fly. However, recent Gaussian-centric methods struggle with structural boundary fidelity and rely heavily on predefined scene-size priors, fundamentally limiting their operational efficiency. In this work, we present VEOcc, a voxel-centric framework formulated as a recursive perception-and-assimilation paradigm. By eliminating the need for initial scale estimation, VEOcc enables highly streamlined, open-ended map expansion. Furthermore, to robustly aggregate noisy temporal observations within the discrete voxel space, we propose a Spatio-Temporal-Aware Online Update Strategy. It integrates Cross-Temporal Logit Aggregation (TLA) for temporal consistency, Reliability-Aware Confidence Modulation (RCM) for spatial uncertainty calibration, and Confidence-Driven Incremental State Update (CSU) for robust global state assimilation. % Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings, providing an accurate and efficient solution for real-world exploration. Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings. Notably, zero-shot evaluations on self-collected video sequences further confirm its robust out-of-distribution generalization capability in completely unseen real-world environments. Ultimately, our framework provides an accurate and highly efficient solution for autonomous exploration. Code and supplementary visualizations are available on our project page: https://wryzju.github.io/VEOcc/.

URL PDF HTML ☆

赞 0 踩 0

2605.23993 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Nano World Models: A Minimalist Implementation of Future Video Prediction

纳米世界模型：未来视频预测的极简实现

Siqiao Huang, Partha Kaushik, Michael Chen, Hengkai Pan, Kaiwen Geng, Omar Chehab, Fernando Moreno-Pino, Max Simchowitz

发表机构 * DeepMind

AI总结提出Nano World Models，一个基于扩散强迫的极简代码库，用于未来视频预测，支持可控研究世界模型的设计选择，并通过实验分析预测参数化、架构规模等因素对视频预测质量的影响。

Comments Project page: https://simchowitzlabpublic.github.io/nano-world-model/

详情

AI中文摘要

世界模型已成为学习预测模拟器的核心范式，支持生成、规划和决策。然而，尽管工业级交互式视频生成取得了快速进展，更广泛的研究社区仍然缺乏紧凑、可重复且易于扩展的实现来研究现代世界模型的设计选择。我们介绍了Nano World Models，一个围绕扩散强迫的极简代码库，用于未来视频预测。Nano World Models为生成目标、模型规模、动作条件机制、潜在观测空间、数据集、评估协议和长程展开程序提供了统一接口。这种设计使得通常在不同实现中纠缠的世界模型组件可以进行受控研究。通过在简单控制环境、游戏模拟和真实机器人数据上的实验，我们考察了预测参数化、架构规模、动作注入、采样预算和领域复杂性如何影响视频预测质量和自回归展开行为。通过发布代码、配置、评估脚本和预训练检查点，Nano World Models旨在为开放、可重复和科学的世界模型研究提供一个紧凑但可扩展的实验基础。

英文摘要

World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.

URL PDF HTML ☆

赞 0 踩 0

2605.23531 2026-05-29 cs.CV 版本更新

PixIE: Prompted Pixel-Space Low-Light Image Enhancement

PixIE: 提示驱动的像素空间低光照图像增强

Ruirui Lin, Guoxi Huang, David Bull, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, University of Bristol, United Kingdom（布里斯托大学视觉信息实验室，英国）

AI总结提出PixIE框架，利用视觉基础模型的语义提示，通过跨尺度去噪和DINO提示像素块进行像素空间低光照图像增强，在多个基准上提升PSNR和LPIPS。

详情

AI中文摘要

低光照图像遭受严重的噪声、对比度损失和语义模糊，使得增强成为去噪和细节恢复的联合问题。我们提出PixIE，一种由视觉基础模型语义提示的前馈像素空间LLIE框架。PixIE首先执行跨尺度去噪以抑制噪声并保持结构，然后使用DINO提示像素块（DPPBs）细化细节，通过补丁条件、空间连续的逐像素调制注入中间DINOv3特征。为了使像素空间注意力在跨尺度上高效，我们引入了空间通道压缩（SCC），它联合减少空间令牌网格和通道维度。我们进一步提出多感受野像素嵌入（MRPE），在语义提示之前提供邻域感知的像素表示，提高对信号依赖噪声的鲁棒性，超越逐点嵌入。在LLIE基准上的实验表明，与最近的最先进方法相比，PixIE将平均PSNR提高了1.9-15.0%，并将LPIPS降低了8.5-44.4%。定性比较进一步显示更清晰的细节和更稳定的纹理，提高了重建保真度和感知质量。

英文摘要

Low-light images suffer from severe noise, contrast loss, and semantic ambiguity, making enhancement a joint problem of denoising and detail recovery. We propose PixIE, a feed-forward pixel-space LLIE framework semantically prompted by a vision foundation model. PixIE first performs cross-scale denoising to suppress noise and preserve structure, then refines details using DINO-Prompted Pixel Blocks (DPPBs), which inject intermediate DINOv3 features through patch-conditioned, spatially continuous per-pixel modulation. To make pixel-space attention efficient across scales, we introduce Spatial-Channel Compaction (SCC), which jointly reduces the spatial token grid and channel dimension. We further propose Multi-Receptive-Field Pixel Embedding (MRPE) to provide neighborhood-aware pixel representations before semantic prompting, improving robustness to signal-dependent noise beyond point-wise embeddings. Experiments on LLIE benchmarks show that PixIE improves average PSNR by 1.9-15.0% over recent state-of-the-art methods and reduces LPIPS by 8.5-44.4%. Qualitative comparisons further show sharper details and more stable textures, improving both reconstruction fidelity and perceptual quality.

URL PDF HTML ☆

赞 0 踩 0

2605.23345 2026-05-29 cs.CV 版本更新

SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

SCOPE: 在可玩环境中模拟跨游戏操作以构建FPS世界模型

Zizhao Tong, Yeying Jin, Hongfeng Lai, Zeqing Wang, Zhaohu Xing, Kexu Cheng, Haoran Xu, Zhao Pu, Shangwen Zhu, Ruili Feng, Jian Zhao, Yan Zhang, Hao Tang, Ling Shao

发表机构 * UCAS-Terminus AI Lab, University of Chinese Academy of Sciences（中国科学院大学Terminus AI实验室）； Tencent（腾讯）； National University of Singapore（新加坡国立大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； University of Waterloo（多伦多大学）； Shanghai Jiaotong University（上海交通大学）； Zhongguancun Institute of Artificial Intelligence（中关村人工智能研究院）； State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（北京大学计算机科学学院多媒体信息处理国家重点实验室）

AI总结提出SCOPE方法，通过在每个Transformer块中插入条件模块，将特征重塑为逐像素时间序列，以分离FPS游戏中局部作用域（scope）内的操作效果与全局生成，并引入跨游戏数据集CrossFPS，实现零样本迁移。

Comments Project page: https://z2tong.github.io/SCOPE/. Code is available at https://github.com/z2tong/SCOPE

详情

AI中文摘要

第一人称射击（FPS）游戏的交互式世界模型必须在每一帧解析高频重叠控制信号，同时不干扰未受影响的区域。现有方法全局注入动作并在单一游戏上训练，在密集FPS输入下失败。我们观察到FPS动作具有空间选择性：离散事件（如射击或换弹）仅影响武器周围的局部区域（scope），而连续的相机和移动信号控制稳定的环境。我们提出SCOPE，它在预训练视频扩散模型的每个Transformer块中插入一个条件模块。它将特征重塑为逐像素时间序列，使得每个位置根据局部视觉内容计算其动作响应。这无需分割标签即可将作用域内效果与作用域外生成分离。我们还引入了CrossFPS，这是第一个具有帧对齐动作遥测的多游戏FPS数据集。它包含来自7个游戏的69K个片段，具有10自由度控制器信号，并经过策划以消除游戏玩法偏差。该模型学习通用的视觉到动作映射，而非特定游戏模式，从而实现对未见场景的零样本迁移。实验证实了强动作响应性、精确的作用域分离以及有效的跨游戏泛化。

英文摘要

Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.

URL PDF HTML ☆

赞 0 踩 0

2605.22080 2026-05-29 cs.CV cs.AI 版本更新

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

JMed48k：用于视觉语言模型评估的多专业日本医疗执照基准

Yue Xun, Junyu Liu, Qian Niu, Xinyi Wang, Zheng Yuan, Zirui Li, Zequn Zhang, Bowen Zhao, Shujun Wang, Irene Li, Kan Hatakeyama-Sato, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Kyoto University（京都大学）； The University of Tokyo（东京大学）； Hohai University（淮海大学）； University of Science and Technology of China（中国科学技术大学）； University of Toronto（多伦多大学）

AI总结本文提出JMed48k，一个包含48,862道试题和20,142张图像的多专业日本医疗执照基准，通过评估21个模型并引入配对图像移除审计，发现专有和开源模型显著受益于图像，而医学专用模型对视觉证据利用有限。

详情

AI中文摘要

我们引入了JMed48k，一个用于评估视觉语言模型的多专业日本医疗执照基准。该基准基于日本厚生劳动省发布的官方PDF材料构建，包含2005年至2025年间11个国家执照考试的48,862道试题和20,142张图像，视觉内容按8类分类法进行标注。从该语料库中，我们提取了JMed48k-Eval，一个近五年的评估子集，包含12,484道评分题，其中9,905道纯文本题和2,579道带图像题。我们评估了21个专有、开源和医学专用模型，分别报告纯文本和带图像的性能。由于这些子集包含不同的问题，我们进一步引入了一种配对图像移除审计，评估带图像的问题在移除视觉内容前后的表现，以探索四种答案转换状态。审计显示，专有和开源模型从图像中获益显著，而医学专用系统对视觉证据的利用有限，许多正确答案在图像移除后仍然存在。即使在专有模型中，净图像移除效应在不同专业间变化七倍，从医师问题的+5.7分到公共卫生护士问题的+39.8分。我们发布JMed48k以支持在医疗执照场景中对视觉语言模型进行可重复的、按专业分层的评估。

英文摘要

We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official PDF materials released by the Japanese Ministry of Health, Labour and Welfare, JMed48k contains 48,862 exam questions and 20,142 images from 11 national licensing examinations between 2005 and 2025, with visual content annotated under an 8-type taxonomy. From this corpus, we derive JMed48k-Eval, a recent five-year evaluation subset with 12,484 scored questions, including 9,905 text-only questions and 2,579 questions with images. We evaluate 21 proprietary, open-source, and medical-specific models, reporting text-only and with-image performance separately. Because these subsets contain different questions, we further introduce a paired image-removal audit that evaluates questions with images before and after removing visual content to explore four answer-transition states. The audit shows that proprietary and open source models gain substantially from images, whereas medical-specific systems show limited observable use of visual evidence, with many correct answers persisting after image removal. Even among proprietary models, the net image-removal effect varies sevenfold across professions, from +5.7 points on Physician questions to +39.8 points on Public Health Nurse questions. We release JMed48k to support reproducible, profession-stratified evaluation of vision-language models in medical licensing settings.

URL PDF HTML ☆

赞 0 踩 0

2605.22069 2026-05-29 cs.CV cs.LG 版本更新

TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting

TWINGS: 基于薄板样条翘曲对齐的稀疏视图高斯泼溅初始化

Hyeseong Kim, Geonhui Son, Deukhee Lee, Dosik Hwang

发表机构 * Yonsei University（延世大学）； Korea Institute of Science and Technology（韩国科学技术院）

AI总结提出TWINGS框架，利用薄板样条（TPS）对齐反投影点与三角化控制点，为3D高斯泼溅提供几何精确的初始化，从而在稀疏视图下提升场景重建的细节保留和颜色保真度。

Comments Accepted at CVPR 2026, Project page: https://sandokim.github.io/twings/

详情

AI中文摘要

从稀疏视图输入进行新视角合成是3D计算机视觉中的一个重大挑战，特别是在有限视角下实现高质量场景重建。我们引入了TWINGS，这是一个通过直接解决点稀疏性来增强3D高斯泼溅（3DGS）的框架。我们采用薄板样条（TPS），一种平滑的非刚性变形模型，通过最小化弯曲能量从控制点对应关系估计全局一致的翘曲，将估计深度反投影的点与三角化的3D控制点对齐，从而生成校准的反投影点。通过在这些控制点附近采样校准点，TWINGS为3DGS提供了快速且几何精确的初始化，最终改善了重建场景中结构细节的保留和颜色保真度。在DTU、LLFF和Mip-NeRF360上的大量实验表明，TWINGS在稀疏视图场景下始终优于现有方法，提供详细且准确的重建。

英文摘要

Novel view synthesis from sparse-view inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrate that TWINGS consistently outperforms existing methods, delivering detailed and accurate reconstructions under sparse-view scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.17286 2026-05-29 cs.CV 版本更新

HyperVision: A Channel-Adaptive Ground-Based Hyperspectral Vision Pre-trained Backbone

HyperVision: 一种通道自适应的地基高光谱视觉预训练骨干网络

Guanyiman Fu, Jingtao Li, Zihang Cheng, Zhuanfeng Li, Diqi Chen, Yan Xu, Xiangyu Liu, Fengchao Xiong, Jianfeng Lu, Chengrong Chen, Jun Zhou

发表机构 * Griffith University, Australia（格里菲斯大学，澳大利亚）； Wuhan University, China（武汉大学，中国）； Nanjing University of Science and Technology, China（南京理工大学，中国）； Huaiyin Normal University, China（淮阴师范学院，中国）； Massey University, New Zealand（马斯sey大学，新西兰）

AI总结针对地基高光谱传感器配置差异、标签稀缺与不一致、数据集规模有限等问题，提出首个地基高光谱预训练骨干HyperVision，采用通道自适应动态嵌入、多源伪标签和跨模态知识蒸馏，在三个下游任务上取得最优性能。

详情

AI中文摘要

虽然高光谱成像通过数百个窄波长波段提供丰富的空间-光谱信息，用于精确的材料识别，但地基高光谱预训练骨干网络仍然缺失，受限于传感器间的光谱配置差异、标签的稀缺性和不一致性，以及现有数据集的规模有限和场景多样性不足。为了解决这些挑战并实现通用感知，我们提出了HyperVision，这是首个地基高光谱预训练骨干网络。首先，为了处理不同的光谱配置，HyperVision采用通道自适应动态嵌入机制，将异构输入映射到统一的标记空间。其次，我们开发了一个无监督表示学习框架。具体来说，为了解决标签稀缺和不一致问题，引入了一种多源伪标签方法，融合来自SAM2的空间结构和来自HyperFree的细粒度光谱材料信息。此外，为了丰富场景多样性并补偿有限的数据集规模，利用跨模态知识蒸馏机制，将预训练RGB视觉模型的丰富语义表示迁移到我们的骨干网络。HyperVision在来自26个不同地基数据集的15000张图像集合上进行预训练，展现出卓越的泛化能力。仅需高效的头适配而无需调整骨干参数，它在不同传感器配置下的三个下游任务中取得了比任务特定方法更优的性能，在高光谱语义分割中$\mathrm{Acc}_{\mathrm{M}}$相对提升高达16.3%，目标跟踪AUC相对提升2.1%，显著目标检测MAE降低35.5%。源代码和预训练模型将在https://github.com/lronkitty/HyperVision 公开。

英文摘要

While hyperspectral imaging provides rich spatial-spectral information across hundreds of narrow wavelength bands for precise material identification, ground-based hyperspectral pre-trained backbones remain absent, constrained by varying spectral configurations across sensors, the scarcity and inconsistency of labels, and the limited scale and scene diversity of existing datasets. To address these challenges and enable universal perception, we propose HyperVision, the first ground-based hyperspectral pre-trained backbone. First, to handle varying spectral configurations, HyperVision adopts a channel-adaptive dynamic embedding mechanism to map heterogeneous inputs into a unified token space. Second, we develop an unsupervised representation learning framework. Specifically, to address label scarcity and inconsistency, a multi-source pseudo-labeling method is introduced to fuse spatial structures from SAM2 and fine-grained spectral material information from HyperFree. Furthermore, to enrich scene diversity and compensate for limited dataset scale, a cross-modal knowledge distillation mechanism is utilized to transfer rich semantic representations from a pre-trained RGB vision model to our backbone. Pre-trained on a collection of 15k images from 26 diverse ground-based datasets, HyperVision demonstrates exceptional generalization. Requiring only efficient head-only adaptation without adjusting backbone parameters, it achieves state-of-the-art performance compared to task-specific methods across three downstream tasks under varying sensor configurations, yielding up to a 16.3% relative improvement in hyperspectral semantic segmentation $\mathrm{Acc}_{\mathrm{M}}$, a 2.1% relative gain in object tracking AUC, and a 35.5% reduction in salient object detection MAE. The source code and pre-trained model will be publicly available on https://github.com/lronkitty/HyperVision .

URL PDF HTML ☆

赞 0 踩 0

2605.15852 2026-05-29 cs.CV 版本更新

GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction

GHOST: 用于高效3D重建的几何层次化在线流式令牌驱逐

Leyang Chen, Junyi Wu, Zhiteng Li, Yulun Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出GHOST框架，利用模型自身的3D几何输出在线驱逐冗余令牌，在保持重建质量的同时将KV缓存减半并实现1.75倍加速。

详情

AI中文摘要

从长单目视频序列进行流式3D重建需要维护一个随序列长度线性增长的键值（KV）缓存，造成严重的内存瓶颈。现有方法要么将缓存截断为固定的一组锚帧，导致重建质量下降，要么依赖于对3D场景结构无关的注意力分数启发式方法，未能保留几何上有价值的令牌。为解决这些问题，我们提出GHOST（几何层次化在线流式令牌驱逐），一种无需训练的KV缓存管理框架，利用模型自身的3D几何输出在线驱逐冗余令牌。GHOST引入了三项相互增强的创新：层次化双层重要性评分方案、保护特殊令牌不被驱逐的特权机制，以及余弦相似度引导的逐层预算分配。在各种基准上的实验表明，GHOST在保持出色重建质量的同时，将KV缓存削减近一半，并且与最先进方法相比实现了1.75倍的推理加速。我们的代码可在 https://github.com/lokiniuniu/GHOST 获取。

英文摘要

Streaming 3D reconstruction from long monocular video sequences requires maintaining a key-value (KV) cache that grows linearly with sequence length, creating a severe memory bottleneck. Existing approaches either truncate the cache to a fixed set of anchor frames, leading to reconstruction quality degradation, or rely on attention-score heuristics that are agnostic to 3D scene structure, failing to preserve geometrically valuable tokens. To address these problems, we present GHOST (Geometry-Hierarchical Online Streaming Token Eviction), a training-free KV cache management framework that exploits the model's own 3D geometry outputs to evict redundant tokens online. GHOST introduces three mutually reinforcing innovations: a hierarchical dual-level importance scoring scheme, a privilege mechanism that protects special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation. Experiments on various benchmarks show that GHOST preserves excellent reconstruction quality while cutting the KV cache by nearly half and delivering 1.75x faster inference compared to state-of-the-art methods. Our code is available at https://github.com/lokiniuniu/GHOST.

URL PDF HTML ☆

赞 0 踩 0

2605.14270 2026-05-29 cs.CV 版本更新

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

诊断和纠正多模态扩散Transformer中的概念遗漏

Kanghyun Baek, Jaihyun Lew, Chaehun Shin, Jungbeom Lee, Sungroh Yoon

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, South Korea ； Department of Electrical ； Computer Engineering, Seoul National University, Seoul, South Korea ； Department of Computer Science \& Engineering, Korea University, Seoul, South Korea ； ISRC, Seoul National University, Seoul, South Korea

AI总结本文通过线性探测发现文本嵌入中存在表征目标概念缺失的“遗漏信号”，并提出遗漏信号干预（OSI）方法放大该信号以主动催化缺失概念的生成，在FLUX.1-Dev和SD3.5-Medium上显著缓解了概念遗漏问题。

Comments Accepted to ICML 2026

2604.21654 2026-05-29 cs.CV cs.AI 版本更新

Causal Disentanglement-Inspired Degradation Representation Learning for Full-Reference Image Quality Assessment

因果解耦启发的退化表示学习用于全参考图像质量评估

Zhen Zhang, Jielei Chu, Tian Zhang, Lin Ma, Fengmao Lv, Weide Liu, Tianrui Li, Yuming Fang

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University（计算机与人工智能学院，西南交通大学）； School of Transportation and Logistics, Southwest Jiaotong University（交通运输与物流学院，西南交通大学）； School of Physics, Northeast Normal University（物理学院，东北师范大学）； School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics（计算机与人工智能学院，江西财经大学）； School of Information Management, Jiangxi University of Finance and Economics（信息管理学院，江西财经大学）

AI总结提出基于因果推断和解耦表示学习的全参考图像质量评估新范式，通过干预潜在表示实现退化估计，在多种设置和跨域场景中表现优异。

详情

AI中文摘要

现有的基于深度网络的全参考图像质量评估（FR-IQA）模型通常通过对参考图像和失真图像的深度特征进行成对比较来工作。在本文中，我们从不同的角度处理这个问题，提出了一种基于因果推断和解耦表示学习的新型FR-IQA范式。与典型的基于特征比较的FR-IQA模型不同，我们的方法将退化估计表述为一个由对潜在表示进行干预引导的因果解耦过程。我们首先利用参考图像和失真图像之间的内容不变性来解耦退化表示和内容表示。其次，受人类视觉掩蔽效应的启发，我们设计了一个掩蔽模块来建模图像内容与退化特征之间的因果关系，从而从失真图像中提取受内容影响的退化特征。最后，通过监督回归或无标签降维从这些退化特征预测质量分数。大量实验表明，我们的方法在全监督、少标签和无标签设置的标准IQA基准上取得了极具竞争力的性能。此外，我们还在数据稀缺的多种非标准自然图像域（包括水下、放射线、医学、中子和屏幕内容图像）上评估了该方法。得益于其能够在没有标记IQA数据的情况下进行场景特定训练和预测的能力，我们的方法在跨域泛化方面优于现有的无训练FR-IQA模型。

英文摘要

Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.

URL PDF HTML ☆

赞 0 踩 0

2604.18518 2026-05-29 cs.CV cs.LG 版本更新

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

UDM-GRPO：面向均匀离散扩散模型的稳定高效组相对策略优化

Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Chengyuan Wang, Fan Zhang, Yonggang Qi, Xinlong Wang

发表机构 * Beijing University of Posts（北京邮电大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）

AI总结针对均匀离散扩散模型（UDM）与强化学习（RL）集成时训练不稳定、性能提升有限的问题，提出UDM-GRPO框架，通过将最终干净样本作为动作、利用扩散前向过程重建轨迹以及引入简化步数和无CFG策略，显著提升文本到图像生成任务的性能。

Comments UDM-GRPO is accepted by ICML 2026 (Spotlight). Code is available at https://github.com/Yovecent/UDM-GRPO

详情

AI中文摘要

均匀离散扩散模型（UDM）最近成为离散生成建模的一种有前景的范式；然而，其与强化学习的集成仍然很大程度上未被探索。我们观察到，将GRPO直接应用于UDM会导致训练不稳定和边际性能提升。为了解决这个问题，我们提出了UDM-GRPO，这是第一个将UDM与RL集成的框架。我们的方法基于两个关键见解：（i）将最终干净样本作为动作提供更准确和稳定的优化信号；（ii）通过扩散前向过程重建轨迹更好地将概率路径与预训练分布对齐。此外，我们引入了两种策略，即简化步数（Reduced-Step）和无CFG（CFG-Free），以进一步提高训练效率。UDM-GRPO在多个T2I任务上显著提升了基础模型性能。值得注意的是，GenEval准确率从69%提高到96%，PickScore从20.46增加到23.81，在连续和离散设置中均达到了最先进的性能。在OCR基准测试中，准确率从8%提高到57%，进一步验证了我们方法的泛化能力。代码可在https://github.com/Yovecent/UDM-GRPO获取。

英文摘要

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8\%$ to $57\%$, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO.

URL PDF HTML ☆

赞 0 踩 0

2604.13019 2026-05-29 cs.CV 版本更新

PrecisionCUA: Iterative Visual Refinement for Pixel-Precise Cursor Grounding in Code Editors

PrecisionCUA：代码编辑器中像素级光标定位的迭代视觉细化

Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso, Yu Hu

发表机构 * Microsoft（微软公司）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出PrecisionCUA方法，通过迭代视觉反馈细化机制实现代码编辑器中像素级光标定位，显著提升点击精度和任务成功率。

详情

AI中文摘要

计算机使用代理（CUA）从根本上依赖图形用户界面（GUI）定位，将语言指令转化为可执行的屏幕操作，但在密集编码界面（如VS Code和Cursor）中，需要亚像素精度才能与密集IDE元素交互的编辑级定位尚未得到充分探索。现有方法通常依赖单次坐标预测，缺乏纠错机制，在高密度界面中常常失败。在本技术报告中，我们对编码环境中的像素级光标定位进行了实证研究。我们的代理不是单步执行，而是参与迭代细化过程，利用先前尝试的视觉反馈来达到目标元素。这种闭环定位机制使代理能够自我纠正位移误差并适应动态UI变化。我们在Claude、Qwen和GPT上的一系列复杂编码基准上评估了我们的方法，结果表明多轮细化在点击精度和整体任务成功率上均显著优于最先进的单次模型。我们的结果表明，迭代视觉推理是下一代可靠软件工程代理的关键组成部分。代码：https://github.com/microsoft/precision-cua-bench/tree/main。

英文摘要

Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces (such as VS Code and Cursor), where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across Claude, Qwen, and GPT on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench/tree/main.

URL PDF HTML ☆

赞 0 踩 0

2604.12772 2026-05-29 cs.CV cs.MA 版本更新

多轮自适应提示攻击对大型视觉-语言模型

In Chong Choi, Jiacheng Zhang, Feng Liu, Yiliao Song

发表机构 * The University of Melbourne（墨尔本大学）； The University of Adelaide（阿德莱德大学）

AI总结提出多轮自适应提示攻击（MAPA），通过交替文本-视觉攻击动作和跨轮迭代调整攻击轨迹，显著提升对大型视觉-语言模型的多轮越狱攻击成功率。

详情

AI中文摘要

多轮越狱攻击已被证明对纯文本大型语言模型（LLMs）有效，其中恶意内容逐渐引入以绕过安全对齐。然而，将此类攻击有效扩展到大型视觉-语言模型（LVLMs）仍未被充分探索。在本文中，我们发现简单地将视觉输入纳入多轮越狱可能使其更容易防御；例如，过度恶意的视觉内容容易触发安全对齐的LVLMs中的防御机制，导致更保守的响应。基于这一发现，我们提出了多轮自适应提示攻击（MAPA），该攻击：1）在每一轮中，交替文本-视觉攻击动作以引发最恶意的响应；2）跨轮，通过迭代来回优化调整攻击轨迹，逐步放大响应的恶意程度。这种两级设计使MAPA能够持续优于最先进的方法，在最近的基准测试中，针对LLaVA-v1.6-Mistral-7B、Qwen2.5-VL-7B-Instruct、Llama-3.2-Vision-11B-Instruct和GPT-4o-mini，攻击成功率提高了15-30%。我们的代码可在https://github.com/thomaschoi143/MAPA获取。

英文摘要

Multi-turn jailbreak attacks have proven effective against text-only large language models (LLMs), where malicious content is gradually introduced to bypass safety alignment. However, effectively extending such attacks to large vision-language models (LVLMs) remains underexplored. In this paper, we find that naively incorporating visual inputs can make multi-turn jailbreaks easier to defend against; for example, overly malicious visual content will easily trigger the defense mechanism in safety-aligned LVLMs, resulting in more conservative responses. Based on this finding, we propose multi-turn adaptive prompting attack (MAPA) that 1) at each turn, alternates text-vision attack actions to elicit the most malicious response; and 2) across turns, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables MAPA to consistently outperform state-of-the-art methods, improving attack success rates by 15-30% on recent benchmarks against LLaVA-v1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct and GPT-4o-mini. Our code is available at: https://github.com/thomaschoi143/MAPA.

URL PDF HTML ☆

赞 0 踩 0

2602.13600 2026-05-29 cs.CV 版本更新

SAVAA: Mitigating Hallucinations in LVLMs via Step-wise Adaptive Visual Attention Amplification

SAVAA: 通过逐步自适应视觉注意力放大减轻LVLMs中的幻觉

Jiacheng Zhang, Feng Liu, Chao Du, Tianyu Pang

发表机构 * Sea AI Lab（海思人工智能实验室）； The University of Melbourne（墨尔本大学）

AI总结提出SAVAA框架，通过视觉接地熵估计幻觉风险并自适应调整视觉注意力放大因子，在多个基准上显著减轻大型视觉语言模型的幻觉。

详情

AI中文摘要

最近一系列无需训练的减轻大型视觉语言模型（LVLMs）幻觉的方法，通过在单次前向传递的自回归生成过程中放大对视觉标记的注意力。我们将这种范式称为视觉注意力放大（VAA）。在本文中，我们识别出现有VAA方法的一个双重失败模式，原因是它们在生成步骤中使用固定的放大因子：在某些步骤可能太弱，无法解决幻觉，而在其他步骤太强，引入新的幻觉。受此发现启发，我们提出逐步自适应视觉注意力放大（SAVAA），一种新的VAA框架，它估计每个生成标记的幻觉风险，并使用估计的风险自适应地放大下一个生成步骤的视觉注意力。具体来说，我们引入视觉接地熵（VGE），一种轻量级的幻觉风险估计器，它用视觉接地增强预测熵，为那些不确定、在图像中接地较弱或两者兼有的标记分配更高的风险。在VGE的指导下，SAVAA使用估计的风险校准下一个生成步骤的VAA因子，对高风险步骤应用更强的放大，对低风险步骤应用更弱的放大。在LLaVA-NeXT-7B、Qwen3-VL-8B和InternVL3.5-8B上，SAVAA在生成幻觉基准（如CHAIR、SHR和AMBER）上显著优于基线方法。代码可在https://github.com/JiachengZ01/SAVVA获取。

英文摘要

A line of recent training-free methods for mitigating hallucinations in large vision-language models (LVLMs) operates by amplifying attention to visual tokens during autoregressive generation within a single forward pass. We refer to this paradigm as visual attention amplification (VAA). In this paper, we identify a dual failure pattern in existing VAA methods caused by their use of a fixed amplification factor across generation steps: it can be too weak at some steps, leaving hallucinations unresolved, while too strong at others, introducing new hallucinations. Motivated by this finding, we propose Step-wise Adaptive Visual Attention Amplification (SAVAA), a new VAA framework that estimates hallucination risk for each generated token and uses the estimated risk to adaptively amplify visual attention at the next generation step. Specifically, we introduce Visual Grounding Entropy (VGE), a lightweight hallucination-risk estimator that augments predictive entropy with visual grounding, assigning higher risk to tokens that are uncertain, weakly grounded in the image, or both. Guided by VGE, SAVAA uses the estimated risk to calibrate the VAA factor for the next generation step, applying stronger amplification to higher-risk steps and weaker amplification to lower-risk steps. Across LLaVA-NeXT-7B, Qwen3-VL-8B, and InternVL3.5-8B, SAVAA significantly outperforms baseline methods on generative hallucination benchmarks such as CHAIR, SHR and AMBER. Code is available at: https://github.com/JiachengZ01/SAVVA.

URL PDF HTML ☆

赞 0 踩 0

2602.07044 2026-05-29 cs.CV cs.AI 版本更新

PipeMFL-240K: A Large-scale Dataset and Benchmark for Object Detection in Pipeline Magnetic Flux Leakage Imaging

PipeMFL-240K：管道磁通量泄漏成像中目标检测的大规模数据集与基准

Tianyi Qu, Songxiao Yang, Haolin Wang, Huadong Song, Xiaoting Guo, Wenguang Hu, Guanlin Liu, Honghe Chen, Yafei Ou

发表机构 * SINOMACH Sensing Technology \ ., Ltd Shenyang Liaoning China ； Institute of Science Tokyo Tokyo Japan ； Hokkaido University Sapporo Hokkaido Japan ； SINOMACH Sensing Technology \ ., Ltd ； Institute of Science Tokyo ； Hokkaido University

AI总结为解决管道磁通量泄漏检测中缺乏大规模公开数据集和基准的问题，构建了包含249,320张图像和200,020个边界框标注的PipeMFL-240K数据集，并评估了现有目标检测器，揭示了其在长尾分布、小目标和类内变异等挑战下的性能不足。

Comments Accepted by ACM KDD 2026 Datasets and Benchmarks Track

详情

AI中文摘要

管道完整性对工业安全和环境保护至关重要，磁通量泄漏（MFL）检测是一种主要的无损检测技术。尽管深度学习在自动化MFL解释方面具有前景，但由于缺乏大规模公开数据集和基准，可靠模型的进展受到限制，导致公平比较和可重复评估困难。我们引入了 extbf{PipeMFL-240K}，这是一个大规模、精心标注的数据集和基准，用于管道MFL伪彩色图像中的复杂目标检测。PipeMFL-240K反映了真实检测的复杂性，并提出了几个独特挑战：(i) 覆盖 extbf{12}个类别的极端长尾分布，(ii) 大量仅包含少数像素的小目标，(iii) 显著的类内变异。该数据集包含 extbf{249,320}张图像和 extbf{200,020}个高质量边界框标注，采集自12条总长约 extbf{1,530}公里的管道。我们使用最先进的目标检测器进行了大量实验以建立基线。结果表明，现代检测器仍然难以应对MFL数据的固有特性，凸显了巨大的改进空间，而PipeMFL-240K为驱动未来研究提供了可靠且具有挑战性的试验平台。作为管道MFL检测领域首个如此规模和范围的数据集和基准，它为高效的管道诊断和维护规划提供了关键基础，并有望加速基于MFL的管道完整性评估中的算法创新和可重复研究。

英文摘要

Pipeline integrity is critical to industrial safety and environmental protection, with Magnetic Flux Leakage (MFL) detection being a primary non-destructive testing technology. Despite the promise of deep learning for automating MFL interpretation, progress toward reliable models has been constrained by the absence of a large-scale public dataset and benchmark, making fair comparison and reproducible evaluation difficult. We introduce \textbf{PipeMFL-240K}, a large-scale, meticulously annotated dataset and benchmark for complex object detection in pipeline MFL pseudo-color images. PipeMFL-240K reflects real-world inspection complexity and poses several unique challenges: (i) an extremely long-tailed distribution over \textbf{12} categories, (ii) a high prevalence of tiny objects that often comprise only a handful of pixels and (iii) substantial intra-class variability. The dataset contains \textbf{249,320} images and \textbf{200,020} high-quality bounding-box annotations, collected from 12 pipelines spanning approximately \textbf{1,530} km. Extensive experiments are conducted with state-of-the-art object detectors to establish baselines. Results show that modern detectors still struggle with the intrinsic properties of MFL data, highlighting considerable headroom for improvement, while PipeMFL-240K provides a reliable and challenging testbed to drive future research. As the first public dataset and the first benchmark of this scale and scope for pipeline MFL inspection, it provides a critical foundation for efficient pipeline diagnostics as well as maintenance planning and is expected to accelerate algorithmic innovation and reproducible research in MFL-based pipeline integrity assessment.

URL PDF HTML ☆

赞 0 踩 0

2602.06282 2026-05-29 cs.CV q-bio.QM 版本更新

An Interpretable Vision Transformer as a Fingerprint-Based Diagnostic Aid for Kabuki and Wiedemann-Steiner Syndromes

一种可解释的基于指纹的视觉Transformer辅助诊断Kabuki和Wiedemann-Steiner综合征

Marilyn Lionts, Arnhildur Tomasdottir, Viktor I. Agustsson, Yuankai Huo, Hans T. Bjornsson, Lotta M. Ellingsen

发表机构 * Dept. of Computer Science, Vanderbilt University（范德比尔特大学计算机科学系）； Dept. of Genetics and Molecular Medicine, Landspitali University Hospital（陆斯帕蒂医院遗传学与分子医学系）； Louma G. Laboratory of Epigenetics Research, Faculty of Medicine, University of Iceland（爱沙尼亚大学医学系表观遗传学研究实验室）； McKusick-Nathans Dept. of Genetic Medicine, Johns Hopkins University School of Medicine（约翰霍普金斯大学医学院遗传医学部）； Faculty of Electrical and Computer Engineering University of Iceland（爱沙尼亚大学电气与计算机工程系）

AI总结本研究提出一种基于视觉Transformer的深度学习模型，利用指纹图像区分Kabuki综合征（KS）和Wiedemann-Steiner综合征（WSS）患者与健康对照，并通过注意力可视化增强可解释性，为罕见遗传病的非侵入性诊断提供新工具。

详情

DOI: 10.1117/12.3085208

AI中文摘要

Kabuki综合征（KS）和Wiedemann-Steiner综合征（WSS）是罕见但不同的发育障碍，具有重叠的临床特征，包括神经发育迟缓、生长受限和持续性胎儿指尖垫。尽管基因检测仍是诊断的金标准，但由于基因检测和专业知识获取的障碍，许多KS或WSS患者仍未得到诊断。皮纹异常虽然是几种遗传综合征的既定标志，但在分子检测时代仍是一种未被充分利用的诊断信号。本研究提出一种基于视觉Transformer的深度学习模型，利用指纹图像区分KS和WSS患者与未受影响的对照组以及彼此。我们在三个二分类任务中评估模型性能。在三个分类任务中，模型在对照组vs. KS、对照组vs. WSS和KS vs. WSS上分别达到了0.80、0.73和0.85的AUC分数，相应的F1分数分别为0.71、0.72和0.83。除了分类，我们应用基于注意力的可视化来识别对模型预测最显著的指纹区域，增强了可解释性。总之，这些发现表明存在综合征特异性的指纹特征，证明了基于指纹的人工智能（AI）工具作为一种非侵入性、可解释且可获取的未来诊断辅助手段，用于早期诊断未充分诊断的遗传综合征的可行性。

英文摘要

Kabuki syndrome (KS) and Wiedemann-Steiner syndrome (WSS) are rare but distinct developmental disorders that share overlapping clinical features, including neurodevelopmental delay, growth restriction, and persistent fetal fingertip pads. While genetic testing remains the diagnostic gold standard, many individuals with KS or WSS remain undiagnosed due to barriers in access to both genetic testing and expertise. Dermatoglyphic anomalies, despite being established hallmarks of several genetic syndromes, remain an underutilized diagnostic signal in the era of molecular testing. This study presents a vision transformer-based deep learning model that leverages fingerprint images to distinguish individuals with KS and WSS from unaffected controls and from one another. We evaluate model performance across three binary classification tasks. Across the three classification tasks, the model achieved AUC scores of 0.80 (control vs. KS), 0.73 (control vs. WSS), and 0.85 (KS vs. WSS), with corresponding F1 scores of 0.71, 0.72, and 0.83, respectively. Beyond classification, we apply attention-based visualizations to identify fingerprint regions most salient to model predictions, enhancing interpretability. Together, these findings suggest the presence of syndrome-specific fingerprint features, demonstrating the feasibility of a fingerprint-based artificial intelligence (AI) tool as a noninvasive, interpretable, and accessible future diagnostic aid for the early diagnosis of underdiagnosed genetic syndromes.

URL PDF HTML ☆

赞 0 踩 0

2602.00324 2026-05-29 math.OC cs.CV cs.RO eess.SP 版本更新

Dual Quaternion SE(3) Synchronization with Recovery Guarantees

对偶四元数 SE(3) 同步及其恢复保证

Jianing Zhao, Linglingzhi Zhu, Anthony Man-Cho So

发表机构 * Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, NT, Hong Kong（系统工程与工程管理系，香港中文大学（深圳））； H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA（H. Milton Stewart工业与系统工程学院，佐治亚理工学院）

AI总结采用对偶四元数表示，通过谱初始化和对偶四元数广义幂法实现 SE(3) 同步，并给出误差界和线性收敛保证。

Comments ICML 2026

详情

AI中文摘要

特殊欧几里得群 SE(3) 上的同步旨在从含噪的成对相对变换中恢复绝对位姿，是机器人和 3D 视觉中的核心基本操作。标准方法通常需要多步启发式程序来恢复有效位姿，这些程序难以分析且通常缺乏理论保证。本文采用对偶四元数表示，并直接在对偶四元数单位上制定 SE(3) 同步。开发了一个两阶段算法：通过 Hermitian 对偶四元数测量矩阵上的幂法计算谱初始化，随后是对偶四元数广义幂法 (DQGPM)，通过每次迭代投影来强制执行可行性。建立了谱估计器的估计误差界，并证明 DQGPM 具有有限迭代误差界，并实现线性误差收缩直至显式的噪声相关阈值。在合成基准和真实多扫描点集配准上的实验表明，所提出的流程在准确性和效率上均优于代表性的基于矩阵的方法。

英文摘要

Synchronization over the special Euclidean group SE(3) aims to recover absolute poses from noisy pairwise relative transformations and is a core primitive in robotics and 3D vision. Standard approaches often require multi-step heuristic procedures to recover valid poses, which are difficult to analyze and typically lack theoretical guarantees. This paper adopts a dual quaternion representation and formulates SE(3) synchronization directly over the unit dual quaternion. A two-stage algorithm is developed: A spectral initializer computed via the power method on a Hermitian dual quaternion measurement matrix, followed by a dual quaternion generalized power method (DQGPM) that enforces feasibility through per-iteration projection. The estimation error bounds are established for spectral estimators, and DQGPM is shown to admit a finite-iteration error bound and achieves linear error contraction up to an explicit noise-dependent threshold. Experiments on synthetic benchmarks and real-world multi-scan point-set registration demonstrate that the proposed pipeline improves both accuracy and efficiency over representative matrix-based methods.

URL PDF HTML ☆

赞 0 踩 0

2512.21032 2026-05-29 cs.CV 版本更新

Multi-Attribute guided Thermal Face Image Translation based on Latent Diffusion Model

基于潜在扩散模型的多属性引导热人脸图像翻译

Mingshu Cai, Osamu Yoshie, Yuya Ieiri

发表机构 * Graduate School of Information, Production and Systems, Waseda University（早稻田大学信息、生产与系统研究生院）

AI总结提出一种基于潜在扩散模型的多属性引导方法，从热红外图像生成高质量可见光人脸图像，同时保留关键身份特征，解决异质人脸识别中的域偏移和特征丢失问题。

Comments Accepted by 2025 IEEE International Joint Conference on Biometrics (IJCB 2025)

详情

DOI: 10.1109/IJCB65343.2025.11411523
Journal ref: 2025 IEEE International Joint Conference on Biometrics (IJCB), 2025

AI中文摘要

现代监控系统越来越依赖多波长传感器和深度神经网络来识别夜间拍摄的红外图像中的人脸。然而，大多数人脸识别模型是在可见光数据集上训练的，由于显著的域偏移，在红外输入上性能大幅下降。早期的基于特征的红外人脸识别方法被证明效果不佳，促使研究人员采用生成式方法将红外图像转换为可见光图像以提高识别性能。这种被称为异质人脸识别（HFR）的范式面临模型和模态差异等挑战，导致生成图像出现失真和特征丢失。为了解决这些限制，本文引入了一种新颖的基于潜在扩散的模型，旨在从热输入生成高质量的可见光人脸图像，同时保留关键身份特征。我们集成一个多属性分类器，从可见光图像中提取关键面部属性，减轻红外到可见光图像恢复过程中的特征丢失。此外，我们提出了Self-attn Mamba模块，该模块增强了跨模态特征的全局建模，并显著提高了推理速度。在两个基准数据集上的实验结果表明，我们的方法在图像质量和身份保持方面均达到了最先进的性能。

英文摘要

Modern surveillance systems increasingly rely on multi-wavelength sensors and deep neural networks to recognize faces in infrared images captured at night. However, most facial recognition models are trained on visible light datasets, leading to substantial performance degradation on infrared inputs due to significant domain shifts. Early feature-based methods for infrared face recognition proved ineffective, prompting researchers to adopt generative approaches that convert infrared images into visible light images for improved recognition. This paradigm, known as Heterogeneous Face Recognition (HFR), faces challenges such as model and modality discrepancies, leading to distortion and feature loss in generated images. To address these limitations, this paper introduces a novel latent diffusion-based model designed to generate high-quality visible face images from thermal inputs while preserving critical identity features. A multi-attribute classifier is incorporated to extract key facial attributes from visible images, mitigating feature loss during infrared-to-visible image restoration. Additionally, we propose the Self-attn Mamba module, which enhances global modeling of cross-modal features and significantly improves inference speed. Experimental results on two benchmark datasets demonstrate the superiority of our approach, achieving state-of-the-art performance in both image quality and identity preservation.

URL PDF HTML ☆

赞 0 踩 0

2512.03010 2026-05-29 cs.CV cs.GR cs.RO 版本更新

SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting

SurfFill: 通过高斯曲面元填充完成LiDAR点云

Svenja Strobel, Matthias Innmann, Bernhard Egger, Marc Stamminger, Linus Franke

发表机构 * NavVis GmbH（NavVis公司）； Inria, Université Côte d'Azur（Inria与阿尔卑斯海岸大学）

AI总结针对LiDAR点云缺失薄结构和边缘细节的问题，提出基于高斯曲面元（Gaussian surfel）的补全方案SurfFill，利用光束发散启发式识别模糊区域并优化曲面元重建以生长新点，在合成和真实场景中优于先前方法。

Comments Project page: https://lfranke.github.io/surffill

详情

AI中文摘要

LiDAR捕获的点云通常被视为主动3D重建的金标准。尽管其在平坦区域精度极高，但捕获容易遗漏小的几何结构，并可能在暗色、吸光材料上失败。或者，拍摄场景的多张照片并应用3D摄影测量可以推断这些细节，因为它们通常代表特征丰富的区域。然而，对于无特征区域，LiDAR的精度很少能达到。因此，我们建议通过引入SurfFill：一种基于高斯曲面元的LiDAR补全方案，结合LiDAR和基于相机的捕获的优势。我们分析LiDAR捕获，并将LiDAR光束发散归因于伪影的主要因素，主要表现为薄结构和边缘。我们利用这一见解，通过评估点云中密度的变化，引入一种用于完成扫描的模糊启发式方法。这使我们能够识别靠近缺失区域的点，然后我们可以使用这些点生长额外的点以完成扫描。对于这种点生长，我们约束高斯曲面元重建，将优化和密集化集中在这些模糊区域。最后，提取模糊区域重建的高斯基元并采样以获取点来完成点云。为了解决大规模重建的挑战，我们将此流程扩展为一种分治方案，用于建筑大小的点云补全。我们在合成和真实场景的LiDAR点云补全任务上评估，发现我们的方法优于先前的重建方法。

英文摘要

LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.

URL PDF HTML ☆

赞 0 踩 0

2512.01334 2026-05-29 cs.CV 版本更新

AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

AlignVid: 文本引导图像到视频生成中语义保真度的免训练注意力缩放

Yexin Liu, Wen-Jie Shu, Zile Huang, Haoze Zheng, Yueze Wang, Jingjin Zhu, Manyuan Zhang, Ser-Nam Lim, Harry Yang

发表机构 * Hong Kong University of Science（香港科学大学）； University of Central Florida, Orlando, FL, USA（中央佛罗里达大学）； Beijing Academy of Artificial Intelligence, Beijing, China（北京人工智能研究院）； The Chinese University of Hong Kong, Hong Kong SAR, China（香港中文大学）

AI总结针对文本引导图像到视频生成中视觉主导导致语义编辑失败的问题，提出免训练注意力缩放调制（ASM）和引导调度（GS）方法，并构建OmitI2V基准，有效提升语义保真度且计算开销可忽略。

详情

AI中文摘要

文本引导的图像到视频生成取得了显著进展，但在执行需要对参考图像进行实质性更改（例如，添加、移除或修改对象）的文本指定编辑时仍存在困难。经验上，我们的分析表明，这源于 extbf{视觉主导}，即参考图像导致严重的注意力分散，抑制了模型整合新语义信息的能力。为解决此问题，我们提出 extbf{AlignVid}，一种免训练干预方法，重新校准模型内部的注意力分布。基于注意力的能量视角，AlignVid采用注意力缩放调制（ extbf{ASM}）以降低注意力熵并将焦点集中在语义标记上，同时结合引导调度（ extbf{GS}）以保持生成稳定性。为严格评估此能力，我们提出 extbf{OmitI2V}，一个全面的基准，用于评估对象修改、添加和删除中的提示遵循度。大量实验表明，AlignVid有效增强了语义保真度，且计算开销可忽略。代码和OmitI2V基准可在https://github.com/LAW1223/AlignVid获取。

英文摘要

Text-guided image-to-video generation has made substantial progress, yet it still struggles to execute text-specified edits that require substantial changes to a reference image (\textit{e.g., object addition, removal, or modification}). Empirically, our analysis reveals that this stems from \textbf{visual dominance}, where the reference image causes severe attention dispersion, inhibiting the model's ability to incorporate new semantic information. To address this, we propose \textbf{AlignVid}, a training-free intervention that re-calibrates the model's internal attention distribution. Drawing on an energy-based perspective of attention, AlignVid employs Attention Scaling Modulation (\textbf{ASM}) to reduce attention entropy and concentrate focus on semantic tokens, alongside Guidance Scheduling (\textbf{GS}) to maintain generation stability. To rigorously assess this capability, we present \textbf{OmitI2V}, a comprehensive benchmark for evaluating prompt adherence across object modification, addition, and deletion. Extensive experiments demonstrate that AlignVid effectively enhances semantic fidelity with negligible computational overhead. Code and the OmitI2V benchmark are available at https://github.com/LAW1223/AlignVid.

URL PDF HTML ☆

赞 0 踩 0

2511.10861 2026-05-29 cs.CV cs.AI cs.LG 版本更新

An accuracy-aware extension to LRP-based pruning for CNNs to prevent cascading accuracy degradation in data-scarce transfer learning

一种面向CNN的基于LRP剪枝的精度感知扩展，以防止数据稀缺迁移学习中的级联精度下降

Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

发表机构 * Mathematics and Computer Science National Defense Academy of Japan（日本防卫大学校数学与计算机科学系）

AI总结针对数据稀缺迁移学习中预训练CNN剪枝导致的级联精度下降问题，提出一种精度感知的剪枝控制机制，通过动态调整剪枝率和顺序来抑制精度下降，提升模型压缩后的分类性能。

Comments Accepted to scientific reports. The title was revised during the peer review process

详情

DOI: 10.1038/s41598-026-47992-8

AI中文摘要

在大规模数据集（如ImageNet）上预训练的卷积神经网络（CNN）被广泛用作特征提取器，从稀缺数据中构建特定任务的高精度分类模型。在此类场景中，由于数据稀缺，微调预训练CNN变得困难，因此必须使用固定权重。然而，当权重固定时，许多对目标任务无贡献的滤波器仍保留在模型中，导致不必要的冗余和效率降低。因此，需要有效的方法通过剪枝对推理不必要的滤波器来减小模型大小。为此，已有研究提出了利用逐层相关性传播（LRP）的方法。LRP量化每个滤波器对推理结果的贡献，从而可以剪枝低相关性的滤波器。然而，现有基于LRP的剪枝方法被观察到会导致级联精度下降。在本研究中，我们为现有基于LRP的滤波器剪枝方法引入了一种精度感知的剪枝控制机制，该机制通过使用类别精度的调和平均数动态调整剪枝率和剪枝顺序，抑制级联精度下降，并在小数据环境下压缩预训练模型的同时保持任务特定性能。我们证明，该控制机制有效缓解了级联精度下降，与现有基于LRP的剪枝方法相比，实现了更高的分类精度，将VGG16的精度-剪枝率曲线下的类别平均面积（AUC）比传统基于LRP的方法提高了约15%。

英文摘要

Convolutional Neural Networks (CNNs) pre-trained on large-scale datasets such as ImageNet are widely used as feature extractors to construct high-accuracy classification models from scarce data for specific tasks. In such scenarios, fine-tuning the pre-trained CNN is difficult due to data scarcity, necessitating the use of fixed weights. However, when the weights are kept fixed, many filters that do not contribute to the target task remain in the model, leading to unnecessary redundancy and reduced efficiency. Therefore, effective methods are needed to reduce model size by pruning filters that are unnecessary for inference. To address this, approaches utilizing Layer-wise Relevance Propagation (LRP) have been proposed. LRP quantifies the contribution of each filter to the inference result, enabling the pruning of filters with low relevance. However, existing LRP-based pruning methods have been observed to cause cascading accuracy degradation. In this study, we introduce an accuracy-aware pruning control mechanism for existing LRP-based filter pruning methods, which suppresses cascading accuracy degradation by dynamically adjusting the pruning rate and the pruning order using the harmonic mean of class accuracy, and compresses the pre-trained model while preserving task-specific performance in a small-data environment. We demonstrate that this control mechanism effectively mitigates cascading accuracy degradation and achieves higher classification accuracy compared to existing LRP-based pruning methods, improving the class-averaged area under the accuracy-pruning-rate curve (AUC) of VGG16 by approximately 15\% over conventional LRP-based approaches.

URL PDF HTML ☆

赞 0 踩 0

2510.27607 2026-05-29 cs.CV cs.RO 版本更新

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

双流扩散用于世界模型增强的视觉-语言-动作模型

John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin

发表机构 * Kim Jaechul Graduate School of AI, Korea Advanced Institute of Technology, Seoul, Republic of Korea（金 Jaechul 人工智能研究生院，韩国科学技术院，首尔，大韩民国）

AI总结提出DUST框架，通过双流扩散Transformer和异步采样方法，解决世界模型增强的视觉-语言-动作模型中的模态差距问题，在模拟和真实任务中取得显著性能提升。

Comments Accepted at ICML 2026. Project page at https://periphanes.github.io/dust (20 pages, 10 figures)

详情

AI中文摘要

用世界模型增强视觉-语言-动作模型（VLA）对于机器人策略学习很有前景，但由于模态差距，在联合预测状态和动作方面面临挑战。为了解决这个问题，我们提出了DUal-STream diffusion（DUST），一个世界模型增强的VLA框架，其特点是一个多模态扩散Transformer，在保持独立模态流的同时实现跨模态知识共享。此外，DUST利用独立的噪声扰动和解耦的流匹配损失来学习跨模态因果关系。我们进一步引入了一种用于动作和视觉令牌的异步采样方法，通过推理时缩放来增强性能。在RoboCasa和GR-1等模拟基准上的实验结果表明，DUST相对于最先进的VLA和世界建模基线实现了高达6%的性能提升，推理时缩放额外提供了2-5%的提升。在使用Franka Research 3的真实世界任务中，DUST的成功率比基线高出10%。最后，我们证明了DUST通过在无动作视频上的预训练以及与异构机器人和人类数据集的联合训练，实现了有效的迁移学习。

英文摘要

Augmenting vision-language-action models (VLAs) with world models is promising for robotic policy learning but faces challenges in jointly predicting states and actions due to the modality gap. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework featuring a multimodal diffusion transformer that maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, DUST utilizes independent noise perturbations and a decoupled flow matching loss to learn cross-modal causal relationships. We further introduce an asynchronous sampling method for action and vision tokens that enhances performance through inference-time scaling. Experimental results on simulated benchmarks like RoboCasa and GR-1 show that DUST achieves up to 6% gains over state-of-the-art VLA and world-modeling baselines, with inference-time scaling providing an additional 2-5% improvement. In real-world tasks using the Franka Research 3, DUST outperforms baselines by 10% in success rate. Finally, we demonstrate that DUST enables effective transfer learning through both pretraining on action-free videos and joint-training with heterogeneous robot and human datasets.

URL PDF HTML ☆

赞 0 踩 0

2510.26412 2026-05-29 cs.CV cs.AI 版本更新

LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation

LoCoT2V-Bench: 长文本与复杂文本到视频生成的基准测试

Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学深圳研究院）； The University of Hong Kong（香港大学）

AI总结针对长视频生成在复杂文本输入下的评估挑战，提出包含多场景提示与层次元数据的基准LoCoT2V-Bench，并设计多维度评估框架LoCoT2V-Eval，实验发现模型在细粒度文本-视频对齐和角色一致性方面存在显著不足。

Comments Accepted by ICML 2026 (Regular)

详情

AI中文摘要

近期文本到视频生成在短片段上取得了令人印象深刻的性能，但在复杂文本输入下评估长视频生成仍然是一个重大挑战。为应对这一挑战，我们提出了LoCoT2V-Bench，一个用于长视频生成（LVG）的基准，包含具有层次元数据（如角色设置和相机行为）的多场景提示，这些提示从收集的真实世界视频中构建。我们进一步提出了LoCoT2V-Eval，一个多维度评估框架，涵盖感知质量、文本-视频对齐、时间质量、动态质量和人类期望实现程度（HERD），重点关注细粒度文本-视频对齐和时间角色一致性等方面。在17个代表性LVG模型上的实验揭示了评估维度之间的显著能力差异，模型在感知质量和背景一致性方面表现强劲，但在细粒度文本-视频对齐和角色一致性方面明显较弱。这些发现表明，提高提示忠实度和身份保持仍是长视频生成的关键挑战。我们的代码和数据发布在https://github.com/XqZeppelinhead0702/LoCoT2V-Bench。

英文摘要

Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under complex textual inputs remains a significant challenge. In response to this challenge, we present LoCoT2V-Bench, a benchmark for long video generation (LVG) featuring multi-scene prompts with hierarchical metadata (e.g., character settings and camera behaviors), constructed from collected real-world videos. We further propose LoCoT2V-Eval, a multi-dimensional framework covering perceptual quality, text-video alignment, temporal quality, dynamic quality, and Human Expectation Realization Degree (HERD), with an emphasis on aspects such as fine-grained text-video alignment and temporal character consistency. Experiments on 17 representative LVG models reveal pronounced capability disparities across evaluation dimensions, with strong perceptual quality and background consistency but markedly weaker fine-grained text-video alignment and character consistency. These findings suggest that improving prompt faithfulness and identity preservation remains a key challenge for long-form video generation. Our code and data are released at https://github.com/XqZeppelinhead0702/LoCoT2V-Bench

URL PDF HTML ☆

赞 0 踩 0

2508.16873 2026-05-29 cs.CV cs.SI 版本更新

Multimodal LLMs See Sentiment

多模态大语言模型感知情感

Neemias B. da Silva, John Harrison, Rodrigo Minetto, Myriam R. Delgado, Bogdan T. Nassu, Thiago H. Silva

发表机构 * Universidade Tecnológica Federal do Paraná（联邦技术大学帕拉纳州大学）； University of Toronto（多伦多大学）

AI总结本文通过系统评估研究，探讨多模态大语言模型在图像情感分析中的三种方法，发现基于MLLM描述的两阶段流水线在微调后性能显著优于传统基线。

Comments 24 pages, 7 figures

详情

AI中文摘要

理解视觉内容如何传达情感在以图像为主导的数字环境中日益重要。然而，情感感知依赖于复杂的场景级语义，这对计算模型而言是一项具有挑战性的任务。本文通过一项系统性的、以评估为导向的研究，从三个视角考察多模态大语言模型如何执行图像情感分析：(i) 使用MLLM直接从图像进行情感分类；(ii) 使用预训练LLM对MLLM生成的描述进行情感分析；(iii) 在情感标注的描述上微调这些LLM以评估性能和泛化能力。在最新基准上的实验表明，两阶段MLLM描述中介流水线在多种评估设置下能显著提高预测准确性，尤其是当LLM组件被微调时。在不同的一致性阈值和情感粒度下，该流水线的最强配置在基准测试中分别优于基于词典、CNN和Transformer的基线高达30.9%、64.8%和42.4%。在跨数据集评估中，所提出的流水线——无需在目标数据集上进行训练或微调——仍比最佳域内基线高出8%以上。总体而言，本研究提供了对MLLM描述中介情感分析的综合评估，阐明了其有效的条件、失败的场景以及与基于传统视觉方法的比较，同时为未来研究提供了可复现的基准资源。

英文摘要

Understanding how visual content conveys sentiment is increasingly important in a digital landscape dominated by imagery. However, sentiment perception depends on complex scene-level semantics, making this a challenging task for computational models. This paper examines how Multimodal Large Language Models (MLLMs) perform sentiment analysis in images through a systematic, evaluation-driven study encompassing three perspectives: (i) direct sentiment classification from images using MLLMs; (ii) sentiment analysis on MLLM-generated descriptions using pre-trained LLMs; and (iii) fine-tuning these LLMs on sentiment-labeled descriptions to assess performance and generalization. Experiments on a recent benchmark show that a two-stage MLLM description-mediated pipeline can substantially improve prediction accuracy under several evaluation settings, particularly when the LLM component is fine-tuned. Across different agreement thresholds and sentiment granularities, the strongest configurations of this pipeline outperform lexicon-, CNN-, and Transformer-based baselines in our benchmark by up to 30.9%, 64.8%, and 42.4%, respectively. In cross-dataset evaluation, the proposed pipeline - without training or fine-tuning on the target dataset - still surpasses the best in-domain baseline by over 8%. Overall, the study provides a comprehensive assessment of MLLM description-mediated sentiment analysis, clarifying the conditions under which it is effective, the scenarios in which it fails, and its comparison with traditional vision-based approaches, while also providing a reproducible benchmark resource for future research.

URL PDF HTML ☆

赞 0 踩 0

2508.15151 2026-05-29 eess.IV cs.CV 版本更新

Zero-shot CT Super-Resolution using Diffusion-based 2D Projection Priors and Signed 3D Gaussians

基于扩散的二维投影先验和有符号三维高斯的零样本CT超分辨率

Jeonghyun Noh, Hyun-Jic Oh, Won-Ki Jeong

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Korea University（韩国大学）； Seoul, Korea（韩国首尔）

AI总结提出一种零样本三维CT超分辨率框架，通过扩散模型上采样二维投影先验并结合有符号三维高斯溅射（NAB-GS）重建高分辨率CT体积，在公开数据集上实现4倍超分辨率的优越性能。

Comments MICCAI 2026 early accepted

详情

AI中文摘要

计算机断层扫描（CT）在临床诊断中至关重要，但获取高分辨率（HR）CT受到辐射暴露风险的限制。虽然基于深度学习的超分辨率（SR）方法在从低分辨率（LR）输入重建HR CT方面显示出前景，但监督方法需要通常不可用的配对数据集。零样本方法通过处理单个LR输入来解决这一限制；然而，由于单个体积内LR信息有限，它们常常无法恢复精细的结构细节。为克服这些限制，我们提出了一种新颖的零样本三维CT SR框架，将基于扩散的上采样二维投影先验集成到三维重建过程中。具体而言，我们的框架包含两个阶段：（1）LR CT投影SR，在丰富的X射线数据上训练扩散模型以对LR投影进行上采样，从而增强LR输入中固有的稀缺信息。（2）三维CT体积重建，使用我们新颖的负Alpha混合（NAB-GS）的三维高斯溅射，该技术建模正负高斯密度以学习扩散生成的HR投影与上采样的LR投影之间的有符号残差。我们的框架在两个公开数据集上展示了优越的定量和定性性能，专家评估表明了该框架在4倍超分辨率下的临床潜力。

英文摘要

Computed tomography (CT) is important in clinical diagnosis, but acquiring high-resolution (HR) CT is constrained by radiation exposure risks. While deep learning-based super-resolution (SR) methods have shown promise for reconstructing HR CT from low-resolution (LR) inputs, supervised approaches require paired datasets that are often unavailable. Zero-shot methods address this limitation by operating on single LR inputs; however, they frequently fail to recover fine structural details due to limited LR information within individual volumes. To overcome these limitations, we propose a novel zero-shot 3D CT SR framework that integrates diffusion-based upsampled 2D projection priors into the 3D reconstruction process. Specifically, our framework consists of two stages: (1) LR CT projection SR, training a diffusion model on abundant X-ray data to upsample LR projections, thereby enhancing the scarce information inherent in the LR inputs. (2) 3D CT volume reconstruction, using 3D Gaussian splatting with our novel Negative Alpha Blending (NAB-GS), which models positive and negative Gaussian densities to learn signed residuals between diffusion-generated HR and upsampled LR projections. Our framework demonstrates superior quantitative and qualitative performance on two public datasets, and expert evaluations present the framework's clinical potential at 4x.

URL PDF HTML ☆

赞 0 踩 0

2508.12176 2026-05-29 cs.CV cs.AI eess.SP 版本更新

Scalable RF Simulation in Generative 4D Worlds

生成式4D世界中的可扩展射频仿真

Zhiwei Zheng, Dongyin Hu, Mingmin Zhao

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结提出WaveVerse框架，通过语言引导的4D世界生成器和物理信号模拟器实现可扩展的射频信号仿真，在相位敏感基准上表现高保真度，并有效提升下游任务性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

射频（RF）感知已成为一种强大的、保护隐私的替代视觉方法，用于各种感知任务。然而，在动态和多样化的环境中构建高质量的RF数据集仍然是一个重大挑战。为了解决这一问题，我们引入了WaveVerse，一个基于提示的可扩展框架，该框架从生成的室内场景中模拟真实的RF信号，并包含由空间路径引导的人体运动，从而无需手动轨迹设计即可实现多样且可行的行为。WaveVerse具有语言引导的4D世界生成器和基于物理的信号模拟器，能够在多样化的环境中实现RF信号的逼真模拟。它采用了一个相位相干光线追踪器，保留了空间和时间上的相位一致性。模拟信号在相位敏感基准上显示出高保真度，并且与真实世界收集的测量数据以及来自专有电磁求解器的模拟结果高度一致。当用于数据增强时，WaveVerse在RF成像和人类活动识别等下游任务中持续提升性能，其增益随模拟数据量的增加而增长，并超越了现有方法。代码和附加材料可在网页上获取。

英文摘要

Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for various perception tasks. However, building high-quality RF datasets in dynamic and diverse environments remains a major challenge. To address this, we introduce WaveVerse, a prompt-based, scalable framework that simulates realistic RF signals from generated indoor scenes with human motions guided by spatial paths, enabling diverse and feasible behaviors without manual trajectory design. WaveVerse features a language-guided 4D world generator and a physics-based signal simulator that enables realistic simulation of RF signals in diverse environments. It employs a phase-coherent ray tracer that preserves both spatial and temporal phase consistency. The simulated signals show high fidelity on phase-sensitive benchmarks, and closely align with both real-world collected measurements and simulations from a proprietary electromagnetic solver. When used for data augmentation, WaveVerse consistently improves performance in downstream tasks like RF imaging and human activity recognition, with gains that grow with the amount of simulated data and surpass existing methods. Code and additional materials are available on the webpage.

URL PDF HTML ☆

赞 0 踩 0

2508.10566 2026-05-29 cs.CV 版本更新

HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

HM-Talker：用于高保真说话头合成的混合运动建模

Shiyu Liu, Kui Jiang, Junjun Jiang, Xianming Liu, Xiaocheng Feng, Hongxun Yao, Qi Tian

发表机构 * Harbin Institute of Technology University（哈尔滨理工大学）； Guangdong Laboratory of Artificial Intelligence and Digital Economy（广东省人工智能与数字经济实验室）

AI总结提出HM-Talker框架，通过混合显式发音线索与隐式韵律特征，结合交叉模态映射和随机特征配对策略，解决说话头生成中个性化与泛化的权衡问题，在视觉真实感和唇同步精度上超越现有方法。

详情

AI中文摘要

音频驱动的说话头生成面临个性化与泛化之间的基本权衡，限制了其实际应用。隐式模型通常以结构不一致为代价实现泛化，导致不稳定的头部运动和不准确的唇同步。而显式方法引入了几何和解剖先验，如参数化面部几何的3D可变形模型（3DMM）或编码面部肌肉运动的动作单元（AU），但它们往往产生过度中性的表情或泛化能力有限。为解决这一矛盾，我们提出了HM-Talker，一个音频驱动的说话头框架，它协同整合显式发音线索与隐式韵律特征，以刻画身份特定动态，同时实现音频驱动的泛化。其显著特点可概括为：i) 跨模态映射模块（CMMM），从音频和视频中提取全面的运动线索词汇表；ii) 混合运动建模模块（HMMM），采用随机特征配对（SFP）策略，动态融合配对的隐式和显式特征以进行运动合成。该设计促进了下半部分面部运动的迭代优化，在身份特定目标与身份无关（仅音频）目标之间交替进行。大量实验表明，HM-Talker在多种设置下的视觉真实感和唇同步精度方面均优于最先进方法。

英文摘要

Audio-driven talking head generation faces a fundamental trade-off between personalization and generalization, limiting its practical application. Implicit models often achieve generalization at the cost of structural incoherence, resulting in unstable head motion and inaccurate lip synchronization. While explicit methods incorporate geometric and anatomical priors such as 3D Morphable Models (3DMMs), which parameterize facial geometry, or Action Units (AUs), which code facial muscle movements--they tend to produce overly neutral expressions or suffer from limited generalization. To resolve this conflict, we present HM-Talker, an audio-driven talking head framework that synergistically integrates explicit articulatory cues with implicit prosodic features to characterize identity-specific dynamics while enabling audio-driven generalization. Its distinctive features can be summarized as: i) the Cross-Modal Mapping Module (CMMM) that extracts a comprehensive vocabulary of motion cues from audio and video, and ii) the Hybrid Motion Modeling Module (HMMM) that employs a Stochastic Feature Pairing (SFP) strategy to dynamically merge paired implicit and explicit features for motion synthesis. This design facilitates an iterative optimization of the lower face motion, alternating between identity-specific and identity-agnostic (audio-only) objectives. Extensive experiments demonstrate that HM-Talker outperforms state-of-the-art methods in both visual realism and lip-sync accuracy across diverse settings.

URL PDF HTML ☆

赞 0 踩 0

2508.08677 2026-05-29 cs.LG cs.CV 版本更新

Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL

多级协作蒸馏遇见全局工作空间模型：面向OCIL的统一框架

Shibin Su, Guoqiang Liang, De Cheng, Shizhou Zhang, Lingyan Ran

发表机构 * School of Computer Science, Northwestern Polytechnical University（西北工业大学计算机学院）； School of Telecommunications Engineering, Xidian University（西安电子科技大学电信工程学院）

AI总结提出一种结合全局工作空间模型和多级协作蒸馏的统一框架，通过融合多学生模型参数形成共享隐式记忆并周期性广播，以及跨学生一致性和历史知识对齐机制，有效平衡在线类增量学习中的稳定性与可塑性。

Comments 15 pages, 8 figures

详情

AI中文摘要

在线类增量学习（OCIL）使模型能够从非独立同分布的数据流中持续学习。由于数据流中的样本只能被看到一次，因此与离线学习相比，它更适用于现实场景。然而，这一约束加剧了OCIL在维持稳定性与可塑性之间适当平衡的挑战。此外，在现实世界中更严格的内存缓冲区约束下，当前基于重放的方法效果较差。虽然集成方法提高了可塑性，但它们常常在稳定性上遇到困难。受全局工作空间理论（GWT）启发，我们提出了一种新颖方法，通过全局工作空间模型（GWM）——一种共享的隐式记忆，指导多个学生模型的学习——来增强集成学习。GWM通过在每个训练批次中融合所有学生的参数形成，捕获历史学习轨迹，并作为知识巩固的动态锚点。类似于GWT的广播机制，GWM定期重新分发给学生，稳定学习并促进跨任务一致性。此外，我们引入了一种多级协作蒸馏机制。它强制学生之间保持对等一致性，并通过将每个学生与GWM对齐来保留历史知识。因此，学生模型在保持先前所学知识的同时，仍能适应新任务，在稳定性与可塑性之间实现更好的平衡。在三个标准OCIL基准上的大量实验表明，我们的方法在各种内存预算下为多个OCIL模型带来了显著的性能提升。代码可在https://github.com/susususushi/GWM获取。

英文摘要

Online Class-Incremental Learning (OCIL) enables models to learn continuously from non-i.i.d. data streams. Since samples of the data streams can be seen only once, it is more suitable for real-world scenarios compared to offline learning. However, this constraint intensifies the challenge for OCIL in maintaining an appropriate balance between stability and plasticity. Moreover, under stricter memory buffer constraints in real world, current replay-based methods are less effective. While ensemble methods improve plasticity, they often struggle with stability. Inspired by the Global Workspace Theory (GWT), we propose a novel approach that enhances ensemble learning through a Global Workspace Model (GWM)-a shared, implicit memory that guides the learning of multiple student models. The GWM is formed by fusing the parameters of all students within each training batch, capturing the historical learning trajectory and serving as a dynamic anchor for knowledge consolidation. Like the broadcasting mechanism of GWT, the GWM is redistributed periodically to students, stabilizing learning and promoting cross-task consistency. In addition, we introduce a multi-level collaborative distillation mechanism. It enforces peer-to-peer consistency among students and preserves historical knowledge by aligning each student with the GWM. As a result, student models remain adaptable to new tasks while maintaining previously learned knowledge, striking a better balance between stability and plasticity. Extensive experiments on three standard OCIL benchmarks show that our method delivers significant performance improvement for several OCIL models across various memory budgets. The code is available at https://github.com/susususushi/GWM.

URL PDF HTML ☆

赞 0 踩 0

2507.21114 2026-05-29 cs.IR cs.AI cs.CV 版本更新

EPiC: 基于精确锚点视频引导的高效视频摄像机控制学习

Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal

AI总结提出EPiC框架，通过基于首帧可见性掩码构建精确对齐的锚点视频，并引入轻量模块Anchor-ControlNet，以极低参数实现高效、精确的3D摄像机控制，在RealEstate10K和MiraData上达到最先进性能。

Comments Accepted to ICML 2026. Project website: https://zunwang1.github.io/Epic

详情

AI中文摘要

近期带摄像机控制的视频生成方法通常通过从估计的点云沿摄像机轨迹渲染，创建锚点视频（即近似所需摄像机运动的渲染视频），以作为结构化先验引导扩散模型。然而，点云和摄像机轨迹估计中的误差常导致不准确的锚点视频，并带来更高的训练成本和低效率，因为模型被迫补偿渲染错位。为解决这些局限，我们提出EPiC，一种高效且精确的摄像机控制学习框架，无需摄像机姿态或点云估计即可构建良好对齐的训练锚点视频。具体而言，我们通过基于首帧可见性掩码掩蔽源视频来创建高精度锚点视频，这确保了强对齐，消除了对摄像机/点云估计的需求，因此可轻松应用于任意野外视频。此外，我们引入Anchor-ControlNet，一种轻量模块，将可见区域中的锚点视频引导集成到预训练视频扩散模型中，仅增加不到1%的额外参数。EPiC以显著更少的参数、训练步骤和数据实现高效训练，并在测试时对使用点云制作的锚点视频具有鲁棒泛化能力，从而实现精确的3D感知摄像机控制。EPiC在RealEstate10K和MiraData上的I2V摄像机控制任务中达到最先进性能。值得注意的是，EPiC还展现出对视频到视频（V2V）场景的强零样本泛化能力。

英文摘要

Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camera motions) to guide diffusion models as a structured prior, by rendering from estimated point clouds following camera trajectories. However, errors in point cloud and camera trajectory estimation often lead to inaccurate anchor videos with higher training cost and low efficiency, as the model is forced to compensate for rendering misalignments. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that constructs well-aligned training anchor videos without the need for camera pose or point cloud estimation. Concretely, we create highly precise anchor videos by masking source videos based on first-frame visibility, which ensures strong alignment, eliminates the need for camera/point cloud estimation, and thus can be readily applied to any in-the-wild video. Furthermore, we introduce Anchor-ControlNet, a lightweight module that integrates anchor video guidance in visible regions to pretrained video diffusion models, with less than 1% of additional parameters. EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, and generalizes robustly to anchor videos made with point clouds at test time, enabling precise 3D-informed camera control. EPiC achieves SoTA performance on RealEstate10K and MiraData for I2V camera control task. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video (V2V) scenarios.

URL PDF HTML ☆

赞 0 踩 0

2504.12747 2026-05-29 cs.CV 版本更新

Privacy Protection Against Personalized Text-to-Image Synthesis via Cross-image Consistency Constraints

针对个性化文本到图像合成的跨图像一致性约束隐私保护

Guanyu Wang, Kailong Wang, Yihao Huang, Mingyi Zhou, Geguang Pu, Li Li

发表机构 * Beihang University（北京航空航天大学）； Huazhong University of Science and Technology（华中科技大学）； East China Normal University（东华大学）

AI总结提出跨图像反个性化框架，通过强制扰动图像间的风格一致性并采用动态比率调整策略，增强对扩散模型个性化攻击的抵抗能力。

详情

AI中文摘要

扩散模型和个性化技术的快速发展使得仅凭少量公开图像就能重建个人肖像成为可能。虽然这种能力赋能了各种创意应用，但也带来了严重的隐私问题，因为攻击者可以利用它生成高度逼真的冒充图像。为应对这些威胁，反个性化方法被提出，通过向已发布图像添加对抗性扰动来破坏个性化模型的训练。然而，现有方法很大程度上忽视了个性化固有的多图像特性，而是采用一种朴素的独立应用扰动策略（如同在单图像设置中常见的那样）。这忽略了利用图像间关系实现更强隐私保护的机会。因此，我们倡导从群体层面看待针对个性化的隐私保护。具体而言，我们引入了跨图像反个性化（CAP），一种通过强制扰动图像间的风格一致性来增强对个性化抵抗能力的新型框架。此外，我们开发了一种动态比率调整策略，可在攻击迭代过程中自适应地平衡一致性损失的影响。在经典CelebHQ和VGGFace2基准上的大量实验表明，CAP显著改进了现有方法。

英文摘要

The rapid advancement of diffusion models and personalization techniques has made it possible to recreate individual portraits from just a few publicly available images. While such capabilities empower various creative applications, they also introduce serious privacy concerns, as adversaries can exploit them to generate highly realistic impersonations. To counter these threats, anti-personalization methods have been proposed, which add adversarial perturbations to published images to disrupt the training of personalization models. However, existing approaches largely overlook the intrinsic multi-image nature of personalization and instead adopt a naive strategy of applying perturbations independently, as commonly done in single-image settings. This neglects the opportunity to leverage inter-image relationships for stronger privacy protection. Therefore, we advocate for a group-level perspective on privacy protection against personalization. Specifically, we introduce Cross-image Anti-Personalization (CAP), a novel framework that enhances resistance to personalization by enforcing style consistency across perturbed images. Furthermore, we develop a dynamic ratio adjustment strategy that adaptively balances the impact of the consistency loss throughout the attack iterations. Extensive experiments on the classical CelebHQ and VGGFace2 benchmarks show that CAP substantially improves existing methods.

URL PDF HTML ☆

赞 0 踩 0

2503.20897 2026-05-29 cs.CV 版本更新

Domain-Agnostic Feature Modulation for Semi-Supervised Domain Generalization

面向半监督领域泛化的域无关特征调制

Venuri Amarasinghe, Kalinga Bandara, Isun Randila, Asini Jayakody, Chamuditha Jayanga Galappaththige, Ranga Rodrigo

发表机构 * University of Moratuwa（穆塔瓦大学）； Queensland University of Technology（昆士兰理工大学）

AI总结针对半监督领域泛化中无域标签的挑战，提出一种特征调制策略与损失缩放函数，通过增强类判别特征、抑制域特定信息并动态降低伪标签置信度阈值，显著提升模型在多个基准上的泛化性能。

Comments Accepted at CVPRW 2026

详情

AI中文摘要

半监督领域泛化（SSDG）利用少量标注数据与大量未标注数据来增强模型泛化能力。现有SSDG方法大多依赖伪标签（PL）处理未标注数据，且常假设可获取域标签——这一特权并非总是可用。然而，域偏移引入域噪声，导致不一致的伪标签，从而降低模型性能。源自FixMatch的方法尤其受限于较低的伪标签准确率，削弱了未标注数据的效用。为解决此问题，我们应对更具挑战性的域标签不可知SSDG场景，即在训练过程中未标注数据的域标签不可用。首先，我们提出一种特征调制策略，该策略在抑制域特定信息的同时增强类判别特征。此调制将特征推向“相似平均表示”（类原型的改进版本），该表示跨域鲁棒，促使分类器区分紧密相关的类别，并促使特征提取器形成紧密聚类、域不变的表征。其次，为缓解域噪声并提高伪标签准确率，我们引入一个损失缩放函数，该函数动态降低伪标签的固定置信度阈值，从而优化未标注数据的利用。凭借这些关键创新，我们的方法在四个主要领域泛化基准上取得了显著改进——即使在没有域标签的情况下。我们将公开代码。

英文摘要

Semi-supervised domain generalization (SSDG) leverages a small fraction of labeled data alongside unlabeled data to enhance model generalization. Most of the existing SSDG methods rely on pseudo-labeling (PL) for unlabeled data, often assuming access to domain labels-a privilege not always available. However, domain shifts introduce domain noise, leading to inconsistent PLs that degrade model performance. Methods derived from FixMatch suffer particularly from lower PL accuracy, reducing the effectiveness of unlabeled data. To address this, we tackle the more challenging domain-label agnostic SSDG, where domain labels for unlabeled data are not available during training. First, we propose a feature modulation strategy that enhances class-discriminative features while suppressing domain-specific information. This modulation shifts features toward Similar Average Representations-a modified version of class prototypes-that are robust across domains, encouraging the classifier to distinguish between closely related classes and feature extractor to form tightly clustered, domain-invariant representations. Second, to mitigate domain noise and improve pseudo-label accuracy, we introduce a loss-scaling function that dynamically lowers the fixed confidence threshold for pseudo-labels, optimizing the use of unlabeled data. With these key innovations, our approach achieves significant improvements on four major domain generalization benchmarks-even without domain labels. We will make the code available.

URL PDF HTML ☆

赞 0 踩 0

2502.16548 2026-05-29 cs.LG cs.AI cs.CV 版本更新

A Composable Multimodal Framework for cine CMR-Text-Driven Prediction of Heart Failure Outcomes

用于电影心脏磁共振-文本驱动的心力衰竭结局预测的可组合多模态框架

Jianzhou Chen, Jinyang Sun, Xiumei Wang, Xi Chen, Heyu Chu, Guo Song, Yuji Luo, Xingping Zhou, Rong Gu

发表机构 * Department of Cardiology, Nanjing Drum Tower Hospital, State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University（南京鼓楼医院心内科，南京大学国家药物生物技术重点实验室）； School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University（上海交通大学电子信息与电气工程学院）； College of Electronic and Optical Engineering, Nanjing University of Posts and Telecommunications（南京邮电大学电子与光学工程学院）； College of Integrated Circuit Science and Engineering, Nanjing University of Posts and Telecommunications（南京邮电大学集成电路科学与工程学院）； Department of Cardiology, Nanjing Drum Tower Hospital Clinical College of Nanjing Medical University（南京医科大学南京鼓楼医院临床学院心内科）； Institute of Quantum Information and Technology, Nanjing University of Posts and Telecommunications（南京邮电大学量子信息与技术研究院）

AI总结提出一种可组合多模态框架，通过整合cine CMR影像、结构化临床指标和非结构化文本记录，实现比单模态AI算法更准确的心力衰竭预后预测，并支持个性化治疗优化。

详情

AI中文摘要

目的。根据世界卫生组织（WHO）及其他公共卫生机构的数据，心力衰竭是全球主要死因之一，每年导致数百万人死亡。尽管心力衰竭领域已取得显著进展，生存率和射血分数有所改善，但由于其复杂性和多因素特征，仍存在大量未满足的需求。本研究旨在提出并评估一种用于心力衰竭评估和治疗优化的可组合策略框架，旨在提供更全面的患者评估和管理。方法。该框架利用多模态算法分析全面的患者数据，明确整合了电影心脏磁共振（cine CMR）序列、结构化临床指标（如实验室结果、人口统计学数据）和非结构化文本记录（如病史、处方）。通过整合这些多种数据源，我们的框架为患者提供了更全面的评估和优化的治疗方案。主要结果。与单模态AI算法相比，该多模态框架在心力衰竭预后预测方面展现出更高的准确性。此外，它还能详细评估各种病理指标对心力衰竭结局的影响。意义。通过系统性地整合异质性临床数据，该方法支持更全面的预后评估，并有助于为心力衰竭患者制定优化的个性化治疗计划。

英文摘要

Objective. Heart failure is one of the leading causes of death worldwide, with millions of deaths each year, according to data from the World Health Organization (WHO) and other public health agencies. While significant progress has been made in the field of heart failure, leading to improved survival rates and improvement of ejection fraction, there remains substantial unmet needs, due to the complexity and multifactorial characteristics. This study aims to propose and evaluate a composable strategy framework for assessment and treatment optimization in heart failure, designed to provide more holistic patient evaluation and management. Approach. The framework leverages multi-modal algorithms to analyze a comprehensive range of patient data, explicitly integrating cine cardiac magnetic resonance (cine CMR) sequences, structured clinical metrics (e.g., lab results, demographics), and unstructured textual records (e.g., medical history, prescriptions). By integrating these various data sources, our framework offers a more holistic evaluation and optimized treatment plan for patients. Main results. The multi-modal framework demonstrates superior accuracy in HF prognosis prediction compared to single-modal AI algorithms. Additionally, it enables a detailed evaluation of the impact of various pathological indicators on HF outcomes. Significance. By integrating heterogeneous clinical data in a systematic manner, this approach supports more comprehensive prognosis assessment and facilitates optimized, personalized treatment planning for heart failure patients.

URL PDF HTML ☆

赞 0 踩 0

2412.15632 2026-05-29 cs.CV 版本更新

A New Method to Capturing Compositional Knowledge in Linguistic Space

一种在语言空间中捕获组合知识的新方法

Jiahe Wan

发表机构 * School of Computer Science（计算机科学学院）； South-Central Minzu University（西南民族大学）

AI总结提出YUKINO方法，通过文本反转和“no”逻辑正则化，在无需硬负样本的情况下提升视觉语言模型的组合理解能力，在SugarCREPE基准上超越现有多模态SOTA模型8%以上。

详情

DOI: 10.1016/j.neucom.2026.133150
Journal ref: Neurocomputing 2026, 679, 133150

AI中文摘要

组合理解使视觉语言模型能够解释图像和文本中对象、属性和关系之间的复杂联系。然而，现有方法通常依赖硬负样本和微调，这可能会高估改进效果，且受限于获取硬负样本的难度。在这项工作中，我们引入了零样本组合理解（ZS-CU），这是一个无需硬负训练数据即可增强组合理解的新任务。我们提出了YUKINO（通过带有“NO”的文本反转产生的组合理解知识），该方法利用文本反转将未标记图像映射到预训练CLIP模型中的伪标记。我们提出引入“no”逻辑正则化来解决反转中标记交互的问题。此外，我们建议使用知识蒸馏来降低文本反转的时间复杂度。实验结果表明，YUKINO在SugarCREPE基准上比现有多模态SOTA模型高出8%以上，并且在图像检索任务中也取得了显著改进。

英文摘要

Compositional understanding allows visual language models to interpret complex relationships between objects, attributes, and relations in images and text. However, most existing methods often rely on hard negative examples and fine-tuning, which can overestimate improvements and are limited by the difficulty of obtaining hard negatives. In this work, we introduce Zero-Shot Compositional Understanding (ZS-CU), a novel task that enhances compositional understanding without requiring hard negative training data. We propose YUKINO (Yielded Compositional Understanding Knowledge via Textual Inversion with NO), which uses textual inversion to map unlabeled images to pseudo-tokens in a pre-trained CLIP model. We propose introducing "no" logical regularization to address the issue of token interaction in inversion. Additionally, we suggest using knowledge distillation to reduce the time complexity of textual inversion. Experimental results show that YUKINO outperforms the existing multi-modal SOTA models by over 8% on the SugarCREPE benchmark, and also achieves significant improvements in image retrieval tasks.

URL PDF HTML ☆

赞 0 踩 0

2411.14279 2026-05-29 cs.CV cs.CL 版本更新

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

超越文本：通过多模态双注意力和软图像引导减少大型视觉语言模型中的语言偏差

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Mingjia Zhang, Baobao Chang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Peking University（北京大学）； Tsinghua University（清华大学）

AI总结针对大型视觉语言模型因语言偏差导致的幻觉问题，提出LACING框架，采用多模态双注意力机制和软图像引导策略，在不增加训练资源的情况下增强视觉理解并减少幻觉。

Comments EMNLP 2025

详情

AI中文摘要

大型视觉语言模型在各种视觉语言任务中取得了令人印象深刻的结果。然而，尽管表现出有前景的性能，大型视觉语言模型仍因语言偏差而产生幻觉，导致对图像的关注度降低和视觉理解效率低下。我们确定了这种偏差的两个主要原因：1. 大语言模型预训练阶段与多模态对齐阶段之间训练数据的规模差异。2. 文本数据短期依赖性导致的学习推理偏差。因此，我们提出了LACING，一个系统性框架，旨在通过多模态双注意力机制和软图像引导来解决大型视觉语言模型的语言偏差。具体来说，多模态双注意力机制引入了一种并行双注意力机制，增强了整个模型中视觉输入的整合。软图像引导在训练和推理过程中引入了一个可学习的软视觉提示，以替代视觉输入，旨在迫使大型视觉语言模型优先处理文本输入。然后，软图像引导进一步提出了一种使用软视觉提示的新解码策略，以减轻模型对相邻文本输入的过度依赖。综合实验表明，我们的方法有效地消除了大型视觉语言模型的语言偏差，增强了视觉理解并减少了幻觉，无需额外的训练资源或数据。代码和模型可在[lacing-lvlm.github.io](https://lacing-lvlm.github.io)获取。

英文摘要

Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [lacing-lvlm.github.io](https://lacing-lvlm.github.io).

URL PDF HTML ☆

赞 0 踩 0

2404.07977 2026-05-29 cs.CV 版本更新

Gaga: Group Any Gaussians via 3D-aware Memory Bank

Gaga: 通过3D感知记忆库分组任意高斯体

Weijie Lyu, Xueting Li, Abhijit Kundu, Yi-Hsuan Tsai, Ming-Hsuan Yang

发表机构 * University of California, Merced（加州大学默塞德分校）； NVIDIA Research（英伟达研究）； Google DeepMind（谷歌深Mind）； Atmanity Inc.（Atmanity公司）

AI总结提出Gaga框架，利用零样本类别无关分割模型预测的不一致2D掩码，通过3D感知记忆库关联不同视角下的物体掩码，实现开放世界3D场景的重建与分割。

Comments TMLR Camera-Ready Version. Project Page: https://weijielyu.github.io/Gaga

详情

AI中文摘要

我们介绍了Gaga，一个通过利用零样本类别无关分割模型预测的不一致2D掩码来重建和分割开放世界3D场景的框架。与先前依赖视频对象跟踪或对比学习方法的3D场景分割方法不同，Gaga利用空间信息并通过新颖的3D感知记忆库有效关联不同相机姿态下的物体掩码。通过消除训练图像中连续视角变化的假设，Gaga展现出对相机姿态变化的鲁棒性，尤其有利于稀疏采样图像，确保精确的掩码标签一致性。此外，Gaga可兼容来自不同来源的2D分割掩码，并与不同的开放世界零样本类别无关分割模型展现出稳健性能，显著增强了其通用性。大量的定性和定量评估表明，Gaga的性能优于现有最先进方法，凸显了其在3D场景理解与操作等实际应用中的潜力。

英文摘要

We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot class-agnostic segmentation models. Contrasted to prior 3D scene segmentation approaches that rely on video object tracking or contrastive learning methods, Gaga utilizes spatial information and effectively associates object masks across diverse camera poses through a novel 3D-aware memory bank. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images, ensuring precise mask label consistency. Furthermore, Gaga accommodates 2D segmentation masks from diverse sources and demonstrates robust performance with different open-world zero-shot class-agnostic segmentation models, significantly enhancing its versatility. Extensive qualitative and quantitative evaluations demonstrate that Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications such as 3D scene understanding and manipulation.

URL PDF HTML ☆

赞 0 踩 0